scrape-giant/pm/tasks.org

* [X] t1.1: harden giant receipt fetch cli (2-4 commits)
** acceptance criteria
- giant scraper runs from cli with prompts or env-backed defaults for `user_id` and `loyalty`
- script reuses current browser session via firefox cookies + `curl_cffi`
- script only fetches unseen orders
- script appends to `orders.csv` and `items.csv` without duplicating prior visits
- script prints a note that giant only exposes the most recent 50 visits

** notes
- keep this giant-specific
- no canonical product logic here
- raw json archive remains source of truth

** evidence
- commit: `d57b9cf` on branch `cx`
- tests: `./venv/bin/python -m unittest discover -s tests`; `./venv/bin/python scraper.py --help`; verified `.env` loading via `scraper.load_config()`
- date: 2026-03-14

* [X] t1.2: define grocery data model and file layout (1-2 commits)
** acceptance criteria
- decide and document the files/directories for:
  - retailer raw exports
  - enriched line items
  - observed products
  - canonical products
  - product links
- define stable column schemas for each file
- explicitly separate retailer-specific parsing from cross-retailer canonicalization

** notes
- this is the guardrail task so we don't make giant-specific hacks the system of record
- keep schema minimal but extensible

** evidence
- commit: `42dbae1` on branch `cx`
- tests: reviewed `giant_output/raw/history.json`, one sample raw order json, `giant_output/orders.csv`, `giant_output/items.csv`; documented schemas in `pm/data-model.org`
- date: 2026-03-15

* [X] t1.3: build giant parser/enricher from raw json (2-4 commits)
** acceptance criteria
- parser reads giant raw order json files
- outputs `items_enriched.csv`
- preserves core raw values plus parsed fields such as:
  - normalized item name
  - image url
  - size value/unit guesses
  - pack/count guesses
  - fee/store-brand flags
  - per-unit/per-weight derived price where possible
- parser is deterministic and rerunnable

** notes
- do not attempt canonical cross-store matching yet
- parser should preserve ambiguity rather than hallucinating precision

** evidence
- commit:
- tests: `./venv/bin/python -m unittest discover -s tests`; `./venv/bin/python enrich_giant.py`; verified `giant_output/items_enriched.csv` on real raw data
- date: 2026-03-16

* [ ] t1.4: generate observed-product layer from enriched items (2-3 commits)

** acceptance criteria
- distinct observed products are generated from enriched giant items
- each observed product has a stable `observed_product_id`
- observed products aggregate:
  - first seen / last seen
  - times seen
  - representative upc
  - representative image url
  - representative normalized name
- outputs `products_observed.csv`

** notes
- observed product is retailer-facing, not yet canonical
- likely key is some combo of retailer + upc + normalized name

** evidence
- commit:
- tests:
- date:

* [ ] t1.5: build review queue for unresolved or low-confidence products (1-3 commits)

** acceptance criteria
- produce a review file containing observed products needing manual review
- include enough context to review quickly:
  - raw names
  - parsed names
  - upc
  - image url
  - example prices
  - seen count
- reviewed status can be stored and reused

** notes
- this is where human-in-the-loop starts
- optimize for “approve once, remember forever”

** evidence
- commit:
- tests:
- date:

* [ ] t1.6: create canonical product layer and observed→canonical links (2-4 commits)

** acceptance criteria
- define and create `products_canonical.csv`
- define and create `product_links.csv`
- support linking one or more observed products to one canonical product
- canonical product schema supports food-cost comparison fields such as:
  - product type
  - variant
  - size
  - measure type
  - normalized quantity basis

** notes
- this is the first cross-retailer abstraction layer
- do not require llm assistance for v1

** evidence
- commit:
- tests:
- date:

* [ ] t1.7: implement auto-link rules for easy matches (2-3 commits)

** acceptance criteria
- auto-link can match observed products to canonical products using deterministic rules
- rules include at least:
  - exact upc
  - exact normalized name
  - exact size/unit match where available
- low-confidence cases remain unlinked for review

** notes
- keep the rules conservative
- false positives are worse than unresolved items

** evidence
- commit:
- tests:
- date:

* [ ] t1.8: support costco raw ingest path (2-5 commits)

** acceptance criteria
- add a costco-specific raw ingest/export path
- output costco line items into the same shared raw/enriched schema family
- confirm at least one product class can exist as:
  - giant observed product
  - costco observed product
  - one shared canonical product

** notes
- this is the proof that the architecture generalizes
- don’t chase perfection before the second retailer lands

** evidence
- commit:
- tests:
- date:

* [ ] t1.9: compute normalized comparison metrics (2-3 commits)

** acceptance criteria
- derive normalized comparison fields where possible:
  - price per lb
  - price per oz
  - price per each
  - price per count
- metrics are attached at canonical or linked-observed level as appropriate
- emit obvious nulls when basis is unknown rather than inventing values

** notes
- this is where “gala apples 5 lb bag vs other gala apples” becomes possible
- units discipline matters a lot here

** evidence
- commit:
- tests:
- date:

* [ ] t1.10: add optional llm-assisted suggestion workflow for unresolved products (2-4 commits)

** acceptance criteria
- llm suggestions are generated only for unresolved observed products
- llm outputs are stored as suggestions, not auto-applied truth
- reviewer can approve/edit/reject suggestions
- approved decisions are persisted into canonical/link files

** notes
- bounded assistant, not autonomous goblin
- image urls may become useful here

** evidence
- commit:
- tests:
- date: