scrape-giant/pm/tasks.org

* [X] t1.1: harden giant receipt fetch cli (2-4 commits)
** acceptance criteria
- giant scraper runs from cli with prompts or env-backed defaults for `user_id` and `loyalty`
- script reuses current browser session via firefox cookies + `curl_cffi`
- script only fetches unseen orders
- script appends to `orders.csv` and `items.csv` without duplicating prior visits
- script prints a note that giant only exposes the most recent 50 visits

** notes
- keep this giant-specific
- no canonical product logic here
- raw json archive remains source of truth

** evidence
- commit: `d57b9cf` on branch `cx`
- tests: `./venv/bin/python -m unittest discover -s tests`; `./venv/bin/python scraper.py --help`; verified `.env` loading via `scraper.load_config()`
- date: 2026-03-14

* [X] t1.2: define grocery data model and file layout (1-2 commits)
** acceptance criteria
- decide and document the files/directories for:
  - retailer raw exports
  - enriched line items
  - observed products
  - canonical products
  - product links
- define stable column schemas for each file
- explicitly separate retailer-specific parsing from cross-retailer canonicalization

** notes
- this is the guardrail task so we don't make giant-specific hacks the system of record
- keep schema minimal but extensible

** evidence
- commit: `42dbae1` on branch `cx`
- tests: reviewed `giant_output/raw/history.json`, one sample raw order json, `giant_output/orders.csv`, `giant_output/items.csv`; documented schemas in `pm/data-model.org`
- date: 2026-03-15

* [X] t1.3: build giant parser/enricher from raw json (2-4 commits)
** acceptance criteria
- parser reads giant raw order json files
- outputs `items_enriched.csv`
- preserves core raw values plus parsed fields such as:
  - normalized item name
  - image url
  - size value/unit guesses
  - pack/count guesses
  - fee/store-brand flags
  - per-unit/per-weight derived price where possible
- parser is deterministic and rerunnable

** notes
- do not attempt canonical cross-store matching yet
- parser should preserve ambiguity rather than hallucinating precision

** evidence
- commit: `14f2cc2` on branch `cx`
- tests: `./venv/bin/python -m unittest discover -s tests`; `./venv/bin/python enrich_giant.py`; verified `giant_output/items_enriched.csv` on real raw data
- date: 2026-03-16

* [X] t1.4: generate observed-product layer from enriched items (2-3 commits)

** acceptance criteria
- distinct observed products are generated from enriched giant items
- each observed product has a stable `observed_product_id`
- observed products aggregate:
  - first seen / last seen
  - times seen
  - representative upc
  - representative image url
  - representative normalized name
- outputs `products_observed.csv`

** notes
- observed product is retailer-facing, not yet canonical
- likely key is some combo of retailer + upc + normalized name

** evidence
- commit: `dc39214` on branch `cx`
- tests: `./venv/bin/python -m unittest discover -s tests`; `./venv/bin/python build_observed_products.py`; verified `giant_output/products_observed.csv`
- date: 2026-03-16

* [X] t1.5: build review queue for unresolved or low-confidence products (1-3 commits)

** acceptance criteria
- produce a review file containing observed products needing manual review
- include enough context to review quickly:
  - raw names
  - parsed names
  - upc
  - image url
  - example prices
  - seen count
- reviewed status can be stored and reused

** notes
- this is where human-in-the-loop starts
- optimize for “approve once, remember forever”

** evidence
- commit: `9b13ec3` on branch `cx`
- tests: `./venv/bin/python -m unittest discover -s tests`; `./venv/bin/python build_review_queue.py`; verified `giant_output/review_queue.csv`
- date: 2026-03-16

* [X] t1.6: create canonical product layer and observed→canonical links (2-4 commits)

** acceptance criteria
- define and create `products_canonical.csv`
- define and create `product_links.csv`
- support linking one or more observed products to one canonical product
- canonical product schema supports food-cost comparison fields such as:
  - product type
  - variant
  - size
  - measure type
  - normalized quantity basis

** notes
- this is the first cross-retailer abstraction layer
- do not require llm assistance for v1

** evidence
- commit: `347cd44` on branch `cx`
- tests: `./venv/bin/python -m unittest discover -s tests`; `./venv/bin/python build_canonical_layer.py`; verified seeded `giant_output/products_canonical.csv` and `giant_output/product_links.csv`
- date: 2026-03-16

* [X] t1.7: implement auto-link rules for easy matches (2-3 commits)

** acceptance criteria
- auto-link can match observed products to canonical products using deterministic rules
- rules include at least:
  - exact upc
  - exact normalized name
  - exact size/unit match where available
- low-confidence cases remain unlinked for review

** notes
- keep the rules conservative
- false positives are worse than unresolved items

** evidence
- commit: `385a31c` on branch `cx`
- tests: `./venv/bin/python -m unittest discover -s tests`; `./venv/bin/python build_canonical_layer.py`; verified auto-linked `giant_output/products_canonical.csv` and `giant_output/product_links.csv`
- date: 2026-03-16

* [X] t1.8: support costco raw ingest path (2-5 commits)

** acceptance criteria
- add a costco-specific raw ingest/export path
- fetch costco receipt summary and receipt detail payloads from graphql endpoint
- persist raw json under `costco_output/raw/orders.csv` and `./items.csv`, same format as giant
- costco-native identifiers such as `transactionBarcode` as order id and `itemNumber` as retailer item id
- preserve discount/coupon rows rather than dropping

** notes
- focus on raw costco acquisistion and flattening
- do not force costco identifiers into `upc`
- bearer/auth values should come from local env, not source

** evidence
- commit: `da00288` on branch `cx`
- tests: `./venv/bin/python -m unittest discover -s tests`; `./venv/bin/python scrape_costco.py --help`; verified `costco_output/raw/*.json`, `costco_output/orders.csv`, and `costco_output/items.csv` from the local sample payload
- date: 2026-03-16

* [X] t1.8.1: support costco parser/enricher path (2-4 commits)

** acceptance criteria
- add a costco-specific enrich step producing `costco_output/items_enriched.csv`
- output rows into the same shared enriched schema family as Giant
- support costco-specific parsing for:
  - `itemDescription01` + `itemDescription02`
  - `itemNumber` as `retailer_item_id`
  - discount lines / negative rows
  - common size patterns such as `25#`, `48 OZ`, `2/24 OZ`, `6-PACK`
- preserve obvious unknowns as blank rather than guessed values

** notes
- this is the real schema compatibility proof, not raw ingest alone
- expect weaker identifiers than Giant

** evidence
- commit: `da00288` on branch `cx`
- tests: `./venv/bin/python -m unittest discover -s tests`; `./venv/bin/python enrich_costco.py`; verified `costco_output/items_enriched.csv`
- date: 2026-03-16
* [X] t1.8.2: validate cross-retailer observed/canonical flow (1-3 commits)

** acceptance criteria
- feed Giant and Costco enriched rows through the same observed/canonical pipeline
- confirm at least one product class can exist as:
  - Giant observed product
  - Costco observed product
  - one shared canonical product
- document the exact example used for proof

** notes
- keep this to one or two well-behaved product classes first
- apples, eggs, bananas, or flour are better than weird prepared foods

** evidence
- commit: `da00288` on branch `cx`
- tests: `./venv/bin/python -m unittest discover -s tests`; `./venv/bin/python validate_cross_retailer_flow.py`; proof example: Giant `FRESH BANANA` and Costco `BANANAS 3 LB / 1.36 KG` share one canonical in `combined_output/proof_examples.csv`
- date: 2026-03-16
* [X] t1.8.3: extend shared schema for retailer-native ids and adjustment lines (1-2 commits)

** acceptance criteria
- add shared fields needed for non-upc retailers, including:
  - `retailer_item_id`
  - `is_discount_line`
  - `is_coupon_line` or equivalent if needed
- keep `upc` nullable across the pipeline
- update downstream builders/tests to accept retailers with blank `upc`

** notes
- this prevents costco from becoming a schema hack
- do this once instead of sprinkling exceptions everywhere

** evidence
- commit: `9497565` on branch `cx`
- tests: `./venv/bin/python -m unittest discover -s tests`; verified shared enriched fields in `giant_output/items_enriched.csv` and `costco_output/items_enriched.csv`
- date: 2026-03-16
* [X] t1.8.4: verify and correct costco receipt enumeration (1–2 commits)

** acceptance criteria
- confirm graphql summary query returns all expected receipts
- compare `inWarehouse` count vs number of `receipts` returned
- widen or parameterize date window if necessary; website shows receipts in 3-month windows
- persist request metadata (`startDate`, `endDate`, `documentType`, `documentSubType`)
- emit warning when receipt counts mismatch

** notes
- goal is to confirm we are enumerating all receipts before parsing
- do not expand schema or parser logic in this task
- keep changes limited to summary query handling and diagnostics

** evidence
- commit: `ac82fa6` on branch `cx`
- tests: `./venv/bin/python -m unittest discover -s tests`; `./venv/bin/python scrape_costco.py --help`; reviewed the sample Costco summary request in `pm/scrape-giant.org` against `costco_output/raw/summary.json` and added 3-month window chunking plus mismatch diagnostics
- date: 2026-03-16
* [X] t1.8.5: refactor costco scraper auth and UX with giant scraper

** acceptance criteria
- remove manual auth env vars
- load costco cookies from firefox session
- require only logged-in browser
- replace start/end date flags with --months-back
- maintain same raw output structure
- ensure summary_lookup keys are collision-safe by using a composite key (transactionBarcode + transactionDateTime) instead of transactionBarcode alone

** notes
- align Costco acquisition ergonomics with the Giant scraper
- keep downstream Costco parsing and shared schemas unchanged

** evidence
- commit: `c0054dc` on branch `cx`
- tests: `./venv/bin/python -m unittest discover -s tests`; `./venv/bin/python scrape_costco.py --help`; verified Costco summary/detail flattening now uses composite receipt keys in unit tests
- date: 2026-03-16
* [X] t1.8.6: add browser session helper (2-4 commits)

** acceptance criteria
- create a separate Python module/script that extracts firefox browser session data needed for giant and costco scrapers.
- support Firefox and Costco first, including:
  - loading cookies via existing browser-cookie approach
  - reading browser storage needed for dynamic auth headers (e.g. Costco bearer token)
  - copying locked browser sqlite/db files to a temp location before reading when necessary
- expose a small interface usable by scrapers, e.g. cookie jar + storage/header values
- keep retailer-specific parsing of extracted session data outside the low-level browser access layer
- structure the helper so Chromium-family browser support can be added later without changing scraper call sites

** notes
- goal is to replace manual `.env` copying of volatile browser-derived auth data
- session bootstrap only, not full browser automation
- prefer one shared helper over retailer-specific ad hoc storage reads
- Firefox only; Chromium support later

** evidence
- commit: `7789c2e` on branch `cx`
- tests: `./venv/bin/python -m unittest discover -s tests`; `./venv/bin/python scrape_giant.py --help`; `./venv/bin/python scrape_costco.py --help`; verified Firefox storage token extraction and locked-db copy behavior in unit tests
- date: 2026-03-16
* [X] t1.8.7: simplify costco session bootstrap and remove over-abstraction (2-4 commits)

** acceptance criteria
- make `scrape_costco.py` readable end-to-end without tracing through multiple partial bootstrap layers
- keep `browser_session.py` limited to low-level browser data access only:
  - firefox profile discovery
  - cookie loading
  - storage reads
  - sqlite copy/read helpers
- remove or sharply reduce `retailer_sessions.py` so retailer-specific header extraction lives with the retailer scraper or in a very small retailer-specific helper
- make session bootstrap flow explicit and linear:
  - load browser context
  - extract costco auth values
  - build request headers
  - build requests session
- eliminate inconsistent/obsolete function signatures and dead call paths (e.g. mixed `build_session(...)` calling conventions, stale fallback branches, mismatched `build_headers(...)` args)
- add one focused bootstrap debug print showing whether cookies, authorization, client id, and client identifier were found
- preserve current working behavior where available; this is a refactor/clarification task, not a feature expansion task

** notes
- goal is to restore concern separation and debuggability
- prefer obvious retailer-specific code over “generic” helpers that guess and obscure control flow
- browser access can stay shared; retailer auth mapping should be explicit
- no new heuristics in this task

** evidence
- commit: `d7a0329` on branch `cx`
- tests: `./venv/bin/python -m unittest discover -s tests`; `./venv/bin/python scrape_costco.py --help`; verified explicit Costco session bootstrap flow in `scrape_costco.py` and low-level-only browser access in `browser_session.py`
- date: 2026-03-16
* [X] t1.9: build pivot-ready normalized purchase log and comparison metrics (2-4 commits)

** acceptance criteria
- produce a flat `purchases.csv` suitable for excel pivot tables and pivot charts
- each purchase row preserves:
  - purchase date
  - retailer
  - order id
  - raw item name
  - normalized item name
  - canonical item id when resolved
  - quantity / unit
  - line total
  - store/location info where available
- derive normalized comparison fields where possible on enriched or observed product rows:
  - `price_per_lb`
  - `price_per_oz`
  - `price_per_each`
  - `price_per_count`
- preserve the source basis used to derive each metric, e.g.:
  - parsed size/unit
  - receipt weight
  - explicit count/pack
- emit nulls when basis is unknown, conflicting, or ambiguous
- support pivot-friendly analysis of purchase frequency and item cost over time
- document at least one Giant vs Costco comparison example using the normalized metrics

** notes
- compute metrics as close to the raw observation as possible
- canonical layer can aggregate later, but should not invent missing unit economics
- unit discipline matters more than coverage
- raw item name must be retained for audit/debugging

** evidence
- commit: `be1bf63` on branch `cx`
- tests: `./venv/bin/python -m unittest discover -s tests`; `./venv/bin/python build_purchases.py`; verified `combined_output/purchases.csv` and `combined_output/comparison_examples.csv` on the current Giant + Costco dataset
- date: 2026-03-16

* [X] t1.11: define review and item-resolution workflow for unresolved products (2-3 commits)

** acceptance criteria
- define the persistent files used to resolve unknown items, including:
  - review queue
  - canonical item catalog
  - alias / mapping layer if separate
- specify how unresolved items move from `review_queue.csv` into the final normalized purchase log
- define the manual resolution workflow, including:
  - what the human edits
  - what script is rerun afterward
  - how resolved mappings are persisted for future runs
- ensure resolved items are positively identified into stable canonical item ids rather than one-off text substitutions
- document how raw item name, normalized item name, and canonical item id are all retained

** notes
- goal is “approve once, reuse forever”
- keep the workflow simple and auditable
- manual review is fine; the important part is making it durable and rerunnable

** evidence
- commit: `c7dad54` on branch `cx`
- tests: `./venv/bin/python -m unittest discover -s tests`; `./venv/bin/python build_purchases.py`; `./venv/bin/python review_products.py --refresh-only`; verified `combined_output/review_queue.csv`, `combined_output/review_resolutions.csv` workflow, and `combined_output/canonical_catalog.csv`
- date: 2026-03-16
* [X] t1.12: simplify review process display
Clearly show current state separate from proposed future state.
** acceptance criteria
1. Display position in review queue, e.g., (1/22)
2. Add short help text header to the review item explaining that the action resolves the current observed product group
3. color-code outputs based on info, prompt/menu, warning
   1. color action menu/requests for input differently from display text; do not color individual options separately
4. update action menu `[x]exclude` to `e[x]clude`
5. on each review item, display a list of all matched items to be linked, sorted by descending date:
   1. YYYY-mm-dd, price, raw item name, normalized item name, upc, retailer
   2. image URL, if exists
6. on each review item, suggest (but do not auto-apply) up to 3 likely existing canonicals using determinstic rules, e.g:
   1. exact normalized name match
   2. prefix/contains match on canonical name
   3. exact UPC
- reinforce project terminology such as raw_name, observed_name, canonical_name

** evidence
- commit:
- tests: `./venv/bin/python -m unittest discover -s tests`; `./venv/bin/python review_products.py --help`; verified review prompt now shows queue position, matched item history, image URLs when present, and deterministic canonical suggestions
- date: 2026-03-17

** notes
- The big win here was clarifying terminology in the prompt itself: the reviewer is resolving one observed product group to a canonical_name, not linking raw rows to each other.
- Showing the full matched-item list by descending date made the review context much more legible than the old compact summary fields.
- Deterministic suggestions help, but they are intentionally conservative and shallow; this improves reviewer speed without pretending to solve product matching automatically.

* [ ] t1.10: add optional llm-assisted suggestion workflow for unresolved products (2-4 commits)

** acceptance criteria
- llm suggestions are generated only for unresolved observed products
- llm outputs are stored as suggestions, not auto-applied truth
- reviewer can approve/edit/reject suggestions
- approved decisions are persisted into canonical/link files

** notes
- bounded assistant, not autonomous goblin
- image urls may become useful here

** evidence
- commit:
- tests:
- date: