* [X] t1.1: harden giant receipt fetch cli (2-4 commits) ** acceptance criteria - giant scraper runs from cli with prompts or env-backed defaults for `user_id` and `loyalty` - script reuses current browser session via firefox cookies + `curl_cffi` - script only fetches unseen orders - script appends to `orders.csv` and `items.csv` without duplicating prior visits - script prints a note that giant only exposes the most recent 50 visits ** notes - keep this giant-specific - no canonical product logic here - raw json archive remains source of truth ** evidence - commit: `d57b9cf` on branch `cx` - tests: `./venv/bin/python -m unittest discover -s tests`; `./venv/bin/python scraper.py --help`; verified `.env` loading via `scraper.load_config()` - date: 2026-03-14 * [X] t1.2: define grocery data model and file layout (1-2 commits) ** acceptance criteria - decide and document the files/directories for: - retailer raw exports - enriched line items - observed products - canonical products - product links - define stable column schemas for each file - explicitly separate retailer-specific parsing from cross-retailer canonicalization ** notes - this is the guardrail task so we don't make giant-specific hacks the system of record - keep schema minimal but extensible ** evidence - commit: `42dbae1` on branch `cx` - tests: reviewed `giant_output/raw/history.json`, one sample raw order json, `giant_output/orders.csv`, `giant_output/items.csv`; documented schemas in `pm/data-model.org` - date: 2026-03-15 * [X] t1.3: build giant parser/enricher from raw json (2-4 commits) ** acceptance criteria - parser reads giant raw order json files - outputs `items_enriched.csv` - preserves core raw values plus parsed fields such as: - normalized item name - image url - size value/unit guesses - pack/count guesses - fee/store-brand flags - per-unit/per-weight derived price where possible - parser is deterministic and rerunnable ** notes - do not attempt canonical cross-store matching yet - parser should preserve ambiguity rather than hallucinating precision ** evidence - commit: `14f2cc2` on branch `cx` - tests: `./venv/bin/python -m unittest discover -s tests`; `./venv/bin/python enrich_giant.py`; verified `giant_output/items_enriched.csv` on real raw data - date: 2026-03-16 * [X] t1.4: generate observed-product layer from enriched items (2-3 commits) ** acceptance criteria - distinct observed products are generated from enriched giant items - each observed product has a stable `observed_product_id` - observed products aggregate: - first seen / last seen - times seen - representative upc - representative image url - representative normalized name - outputs `products_observed.csv` ** notes - observed product is retailer-facing, not yet canonical - likely key is some combo of retailer + upc + normalized name ** evidence - commit: `dc39214` on branch `cx` - tests: `./venv/bin/python -m unittest discover -s tests`; `./venv/bin/python build_observed_products.py`; verified `giant_output/products_observed.csv` - date: 2026-03-16 * [X] t1.5: build review queue for unresolved or low-confidence products (1-3 commits) ** acceptance criteria - produce a review file containing observed products needing manual review - include enough context to review quickly: - raw names - parsed names - upc - image url - example prices - seen count - reviewed status can be stored and reused ** notes - this is where human-in-the-loop starts - optimize for “approve once, remember forever” ** evidence - commit: `9b13ec3` on branch `cx` - tests: `./venv/bin/python -m unittest discover -s tests`; `./venv/bin/python build_review_queue.py`; verified `giant_output/review_queue.csv` - date: 2026-03-16 * [X] t1.6: create canonical product layer and observed→canonical links (2-4 commits) ** acceptance criteria - define and create `products_canonical.csv` - define and create `product_links.csv` - support linking one or more observed products to one canonical product - canonical product schema supports food-cost comparison fields such as: - product type - variant - size - measure type - normalized quantity basis ** notes - this is the first cross-retailer abstraction layer - do not require llm assistance for v1 ** evidence - commit: `347cd44` on branch `cx` - tests: `./venv/bin/python -m unittest discover -s tests`; `./venv/bin/python build_canonical_layer.py`; verified seeded `giant_output/products_canonical.csv` and `giant_output/product_links.csv` - date: 2026-03-16 * [X] t1.7: implement auto-link rules for easy matches (2-3 commits) ** acceptance criteria - auto-link can match observed products to canonical products using deterministic rules - rules include at least: - exact upc - exact normalized name - exact size/unit match where available - low-confidence cases remain unlinked for review ** notes - keep the rules conservative - false positives are worse than unresolved items ** evidence - commit: `385a31c` on branch `cx` - tests: `./venv/bin/python -m unittest discover -s tests`; `./venv/bin/python build_canonical_layer.py`; verified auto-linked `giant_output/products_canonical.csv` and `giant_output/product_links.csv` - date: 2026-03-16 * [X] t1.8: support costco raw ingest path (2-5 commits) ** acceptance criteria - add a costco-specific raw ingest/export path - fetch costco receipt summary and receipt detail payloads from graphql endpoint - persist raw json under `costco_output/raw/orders.csv` and `./items.csv`, same format as giant - costco-native identifiers such as `transactionBarcode` as order id and `itemNumber` as retailer item id - preserve discount/coupon rows rather than dropping ** notes - focus on raw costco acquisistion and flattening - do not force costco identifiers into `upc` - bearer/auth values should come from local env, not source ** evidence - commit: `da00288` on branch `cx` - tests: `./venv/bin/python -m unittest discover -s tests`; `./venv/bin/python scrape_costco.py --help`; verified `costco_output/raw/*.json`, `costco_output/orders.csv`, and `costco_output/items.csv` from the local sample payload - date: 2026-03-16 * [X] t1.8.1: support costco parser/enricher path (2-4 commits) ** acceptance criteria - add a costco-specific enrich step producing `costco_output/items_enriched.csv` - output rows into the same shared enriched schema family as Giant - support costco-specific parsing for: - `itemDescription01` + `itemDescription02` - `itemNumber` as `retailer_item_id` - discount lines / negative rows - common size patterns such as `25#`, `48 OZ`, `2/24 OZ`, `6-PACK` - preserve obvious unknowns as blank rather than guessed values ** notes - this is the real schema compatibility proof, not raw ingest alone - expect weaker identifiers than Giant ** evidence - commit: `da00288` on branch `cx` - tests: `./venv/bin/python -m unittest discover -s tests`; `./venv/bin/python enrich_costco.py`; verified `costco_output/items_enriched.csv` - date: 2026-03-16 * [X] t1.8.2: validate cross-retailer observed/canonical flow (1-3 commits) ** acceptance criteria - feed Giant and Costco enriched rows through the same observed/canonical pipeline - confirm at least one product class can exist as: - Giant observed product - Costco observed product - one shared canonical product - document the exact example used for proof ** notes - keep this to one or two well-behaved product classes first - apples, eggs, bananas, or flour are better than weird prepared foods ** evidence - commit: `da00288` on branch `cx` - tests: `./venv/bin/python -m unittest discover -s tests`; `./venv/bin/python validate_cross_retailer_flow.py`; proof example: Giant `FRESH BANANA` and Costco `BANANAS 3 LB / 1.36 KG` share one canonical in `combined_output/proof_examples.csv` - date: 2026-03-16 * [X] t1.8.3: extend shared schema for retailer-native ids and adjustment lines (1-2 commits) ** acceptance criteria - add shared fields needed for non-upc retailers, including: - `retailer_item_id` - `is_discount_line` - `is_coupon_line` or equivalent if needed - keep `upc` nullable across the pipeline - update downstream builders/tests to accept retailers with blank `upc` ** notes - this prevents costco from becoming a schema hack - do this once instead of sprinkling exceptions everywhere ** evidence - commit: `9497565` on branch `cx` - tests: `./venv/bin/python -m unittest discover -s tests`; verified shared enriched fields in `giant_output/items_enriched.csv` and `costco_output/items_enriched.csv` - date: 2026-03-16 * [X] t1.8.4: verify and correct costco receipt enumeration (1–2 commits) ** acceptance criteria - confirm graphql summary query returns all expected receipts - compare `inWarehouse` count vs number of `receipts` returned - widen or parameterize date window if necessary; website shows receipts in 3-month windows - persist request metadata (`startDate`, `endDate`, `documentType`, `documentSubType`) - emit warning when receipt counts mismatch ** notes - goal is to confirm we are enumerating all receipts before parsing - do not expand schema or parser logic in this task - keep changes limited to summary query handling and diagnostics ** evidence - commit: `ac82fa6` on branch `cx` - tests: `./venv/bin/python -m unittest discover -s tests`; `./venv/bin/python scrape_costco.py --help`; reviewed the sample Costco summary request in `pm/scrape-giant.org` against `costco_output/raw/summary.json` and added 3-month window chunking plus mismatch diagnostics - date: 2026-03-16 * [X] t1.8.5: refactor costco scraper auth and UX with giant scraper ** acceptance criteria - remove manual auth env vars - load costco cookies from firefox session - require only logged-in browser - replace start/end date flags with --months-back - maintain same raw output structure - ensure summary_lookup keys are collision-safe by using a composite key (transactionBarcode + transactionDateTime) instead of transactionBarcode alone ** notes - align Costco acquisition ergonomics with the Giant scraper - keep downstream Costco parsing and shared schemas unchanged ** evidence - commit: `c0054dc` on branch `cx` - tests: `./venv/bin/python -m unittest discover -s tests`; `./venv/bin/python scrape_costco.py --help`; verified Costco summary/detail flattening now uses composite receipt keys in unit tests - date: 2026-03-16 * [X] t1.8.6: add browser session helper (2-4 commits) ** acceptance criteria - create a separate Python module/script that extracts firefox browser session data needed for giant and costco scrapers. - support Firefox and Costco first, including: - loading cookies via existing browser-cookie approach - reading browser storage needed for dynamic auth headers (e.g. Costco bearer token) - copying locked browser sqlite/db files to a temp location before reading when necessary - expose a small interface usable by scrapers, e.g. cookie jar + storage/header values - keep retailer-specific parsing of extracted session data outside the low-level browser access layer - structure the helper so Chromium-family browser support can be added later without changing scraper call sites ** notes - goal is to replace manual `.env` copying of volatile browser-derived auth data - session bootstrap only, not full browser automation - prefer one shared helper over retailer-specific ad hoc storage reads - Firefox only; Chromium support later ** evidence - commit: `7789c2e` on branch `cx` - tests: `./venv/bin/python -m unittest discover -s tests`; `./venv/bin/python scrape_giant.py --help`; `./venv/bin/python scrape_costco.py --help`; verified Firefox storage token extraction and locked-db copy behavior in unit tests - date: 2026-03-16 * [X] t1.8.7: simplify costco session bootstrap and remove over-abstraction (2-4 commits) ** acceptance criteria - make `scrape_costco.py` readable end-to-end without tracing through multiple partial bootstrap layers - keep `browser_session.py` limited to low-level browser data access only: - firefox profile discovery - cookie loading - storage reads - sqlite copy/read helpers - remove or sharply reduce `retailer_sessions.py` so retailer-specific header extraction lives with the retailer scraper or in a very small retailer-specific helper - make session bootstrap flow explicit and linear: - load browser context - extract costco auth values - build request headers - build requests session - eliminate inconsistent/obsolete function signatures and dead call paths (e.g. mixed `build_session(...)` calling conventions, stale fallback branches, mismatched `build_headers(...)` args) - add one focused bootstrap debug print showing whether cookies, authorization, client id, and client identifier were found - preserve current working behavior where available; this is a refactor/clarification task, not a feature expansion task ** notes - goal is to restore concern separation and debuggability - prefer obvious retailer-specific code over “generic” helpers that guess and obscure control flow - browser access can stay shared; retailer auth mapping should be explicit - no new heuristics in this task ** evidence - commit: `d7a0329` on branch `cx` - tests: `./venv/bin/python -m unittest discover -s tests`; `./venv/bin/python scrape_costco.py --help`; verified explicit Costco session bootstrap flow in `scrape_costco.py` and low-level-only browser access in `browser_session.py` - date: 2026-03-16 * [X] t1.9: build pivot-ready normalized purchase log and comparison metrics (2-4 commits) ** acceptance criteria - produce a flat `purchases.csv` suitable for excel pivot tables and pivot charts - each purchase row preserves: - purchase date - retailer - order id - raw item name - normalized item name - canonical item id when resolved - quantity / unit - line total - store/location info where available - derive normalized comparison fields where possible on enriched or observed product rows: - `price_per_lb` - `price_per_oz` - `price_per_each` - `price_per_count` - preserve the source basis used to derive each metric, e.g.: - parsed size/unit - receipt weight - explicit count/pack - emit nulls when basis is unknown, conflicting, or ambiguous - support pivot-friendly analysis of purchase frequency and item cost over time - document at least one Giant vs Costco comparison example using the normalized metrics ** notes - compute metrics as close to the raw observation as possible - canonical layer can aggregate later, but should not invent missing unit economics - unit discipline matters more than coverage - raw item name must be retained for audit/debugging ** evidence - commit: `be1bf63` on branch `cx` - tests: `./venv/bin/python -m unittest discover -s tests`; `./venv/bin/python build_purchases.py`; verified `combined_output/purchases.csv` and `combined_output/comparison_examples.csv` on the current Giant + Costco dataset - date: 2026-03-16 * [X] t1.11: define review and item-resolution workflow for unresolved products (2-3 commits) ** acceptance criteria - define the persistent files used to resolve unknown items, including: - review queue - canonical item catalog - alias / mapping layer if separate - specify how unresolved items move from `review_queue.csv` into the final normalized purchase log - define the manual resolution workflow, including: - what the human edits - what script is rerun afterward - how resolved mappings are persisted for future runs - ensure resolved items are positively identified into stable canonical item ids rather than one-off text substitutions - document how raw item name, normalized item name, and canonical item id are all retained ** notes - goal is “approve once, reuse forever” - keep the workflow simple and auditable - manual review is fine; the important part is making it durable and rerunnable ** evidence - commit: `c7dad54` on branch `cx` - tests: `./venv/bin/python -m unittest discover -s tests`; `./venv/bin/python build_purchases.py`; `./venv/bin/python review_products.py --refresh-only`; verified `combined_output/review_queue.csv`, `combined_output/review_resolutions.csv` workflow, and `combined_output/canonical_catalog.csv` - date: 2026-03-16 * [X] t1.12: simplify review process display Clearly show current state separate from proposed future state. ** acceptance criteria 1. Display position in review queue, e.g., (1/22) 2. Display compact header with observed_product under review, queue position, and canonical decision, e.g.: "Resolve [n] observed product group [name] and associated items to canonical_name [name]? (\n [n] matched items)" 3. color-code outputs based on info, input/prompt, warning/error 1. color action menu/requests for input differently from display text; do not color individual options separately 2. "no canonical_name suggestions found" is informational, not a warning/error. 4. update action menu `[x]exclude` to `e[x]clude` 5. on each review item, display a list of all matched items to be linked, sorted by descending date: 1. YYYY-mm-dd, price, raw item name, normalized item name, upc, retailer 2. image URL, if exists 3. Sample: 6. on each review item, suggest (but do not auto-apply) up to 3 likely existing canonicals using determinstic rules, e.g: 1. exact normalized name match 2. prefix/contains match on canonical name 3. exact UPC 7. Sample Entry: #+begin_comment Review 7/22: Resolve observed_product MIXED PEPPER to canonical_name [__]? 2 matched items: [1] 2026-03-12 | 7.49 | MIXED PEPPER 6-PACK | MIXED PEPPER | [upc] | costco | [img_url] [2] [YYYY-mm-dd] | [price] | [raw_name] | [observed_name] | [upc] | [retailer] | [img_url] 2 canonical suggestions found: [1] BELL PEPPERS, PRODUCE [2] PEPPER, SPICES - reinforce project terminology such as raw_name, observed_name, canonical_name #+end_comment 8. When link is selected, users should be able to select the number of the item in the list, e.g.: #+begin_comment Select the canonical_name to associate [n] items with: [1] GRB GRADU PCH PUF1. | gcan_01b0d623aa02 [2] BTB CHICKEN | gcan_0201f0feb749 [3] LIME | gcan_02074d9e7359 #+end_comment 9. Add confirmation to link selection with instructions, "[n] [observed_name] and future observed_name matches will be associated with [canonical_name], is this ok? actions: [Y]es [n]o [b]ack [s]kip [q]uit ** evidence - commit: `7b8141c`, `d39497c` - tests: `./venv/bin/python -m unittest discover -s tests`; `./venv/bin/python -m unittest tests.test_review_workflow tests.test_purchases`; `./venv/bin/python review_products.py --help`; verified compact review header, numbered matched-item display, informational no-suggestion state, numbered canonical selection, and confirmation flow - date: 2026-03-17 ** notes - The key improvement was shifting the prompt from system metadata to reviewer intent: one observed_product, its matched retailer rows, and one canonical_name decision. - Numbered canonical selection plus confirmation worked better than free-text id entry and should reduce accidental links. - Deterministic suggestions remain intentionally conservative; they speed up common cases, but unresolved items still depend on human review by design. - resolve observed product group (group id) to canonical name: * [ ] t1.10: add optional llm-assisted suggestion workflow for unresolved products (2-4 commits) ** acceptance criteria - llm suggestions are generated only for unresolved observed products - llm outputs are stored as suggestions, not auto-applied truth - reviewer can approve/edit/reject suggestions - approved decisions are persisted into canonical/link files ** notes - bounded assistant, not autonomous goblin - image urls may become useful here ** evidence - commit: - tests: - date: