* [X] t1.1: harden giant receipt fetch cli (2-4 commits) ** acceptance criteria - giant scraper runs from cli with prompts or env-backed defaults for `user_id` and `loyalty` - script reuses current browser session via firefox cookies + `curl_cffi` - script only fetches unseen orders - script appends to `orders.csv` and `items.csv` without duplicating prior visits - script prints a note that giant only exposes the most recent 50 visits ** notes - keep this giant-specific - no canonical product logic here - raw json archive remains source of truth ** evidence - commit: `d57b9cf` on branch `cx` - tests: `./venv/bin/python -m unittest discover -s tests`; `./venv/bin/python scraper.py --help`; verified `.env` loading via `scraper.load_config()` - date: 2026-03-14 * [X] t1.2: define grocery data model and file layout (1-2 commits) ** acceptance criteria - decide and document the files/directories for: - retailer raw exports - enriched line items - observed products - canonical products - product links - define stable column schemas for each file - explicitly separate retailer-specific parsing from cross-retailer canonicalization ** notes - this is the guardrail task so we don't make giant-specific hacks the system of record - keep schema minimal but extensible ** evidence - commit: `42dbae1` on branch `cx` - tests: reviewed `giant_output/raw/history.json`, one sample raw order json, `giant_output/orders.csv`, `giant_output/items.csv`; documented schemas in `pm/data-model.org` - date: 2026-03-15 * [X] t1.3: build giant parser/enricher from raw json (2-4 commits) ** acceptance criteria - parser reads giant raw order json files - outputs `items_enriched.csv` - preserves core raw values plus parsed fields such as: - normalized item name - image url - size value/unit guesses - pack/count guesses - fee/store-brand flags - per-unit/per-weight derived price where possible - parser is deterministic and rerunnable ** notes - do not attempt canonical cross-store matching yet - parser should preserve ambiguity rather than hallucinating precision ** evidence - commit: `14f2cc2` on branch `cx` - tests: `./venv/bin/python -m unittest discover -s tests`; `./venv/bin/python enrich_giant.py`; verified `giant_output/items_enriched.csv` on real raw data - date: 2026-03-16 * [X] t1.4: generate observed-product layer from enriched items (2-3 commits) ** acceptance criteria - distinct observed products are generated from enriched giant items - each observed product has a stable `observed_product_id` - observed products aggregate: - first seen / last seen - times seen - representative upc - representative image url - representative normalized name - outputs `products_observed.csv` ** notes - observed product is retailer-facing, not yet canonical - likely key is some combo of retailer + upc + normalized name ** evidence - commit: `dc39214` on branch `cx` - tests: `./venv/bin/python -m unittest discover -s tests`; `./venv/bin/python build_observed_products.py`; verified `giant_output/products_observed.csv` - date: 2026-03-16 * [X] t1.5: build review queue for unresolved or low-confidence products (1-3 commits) ** acceptance criteria - produce a review file containing observed products needing manual review - include enough context to review quickly: - raw names - parsed names - upc - image url - example prices - seen count - reviewed status can be stored and reused ** notes - this is where human-in-the-loop starts - optimize for “approve once, remember forever” ** evidence - commit: `9b13ec3` on branch `cx` - tests: `./venv/bin/python -m unittest discover -s tests`; `./venv/bin/python build_review_queue.py`; verified `giant_output/review_queue.csv` - date: 2026-03-16 * [X] t1.6: create canonical product layer and observed→canonical links (2-4 commits) ** acceptance criteria - define and create `products_canonical.csv` - define and create `product_links.csv` - support linking one or more observed products to one canonical product - canonical product schema supports food-cost comparison fields such as: - product type - variant - size - measure type - normalized quantity basis ** notes - this is the first cross-retailer abstraction layer - do not require llm assistance for v1 ** evidence - commit: `347cd44` on branch `cx` - tests: `./venv/bin/python -m unittest discover -s tests`; `./venv/bin/python build_canonical_layer.py`; verified seeded `giant_output/products_canonical.csv` and `giant_output/product_links.csv` - date: 2026-03-16 * [X] t1.7: implement auto-link rules for easy matches (2-3 commits) ** acceptance criteria - auto-link can match observed products to canonical products using deterministic rules - rules include at least: - exact upc - exact normalized name - exact size/unit match where available - low-confidence cases remain unlinked for review ** notes - keep the rules conservative - false positives are worse than unresolved items ** evidence - commit: `385a31c` on branch `cx` - tests: `./venv/bin/python -m unittest discover -s tests`; `./venv/bin/python build_canonical_layer.py`; verified auto-linked `giant_output/products_canonical.csv` and `giant_output/product_links.csv` - date: 2026-03-16 * [X] t1.8: support costco raw ingest path (2-5 commits) ** acceptance criteria - add a costco-specific raw ingest/export path - fetch costco receipt summary and receipt detail payloads from graphql endpoint - persist raw json under `costco_output/raw/orders.csv` and `./items.csv`, same format as giant - costco-native identifiers such as `transactionBarcode` as order id and `itemNumber` as retailer item id - preserve discount/coupon rows rather than dropping ** notes - focus on raw costco acquisistion and flattening - do not force costco identifiers into `upc` - bearer/auth values should come from local env, not source ** evidence - commit: `da00288` on branch `cx` - tests: `./venv/bin/python -m unittest discover -s tests`; `./venv/bin/python scrape_costco.py --help`; verified `costco_output/raw/*.json`, `costco_output/orders.csv`, and `costco_output/items.csv` from the local sample payload - date: 2026-03-16 * [X] t1.8.1: support costco parser/enricher path (2-4 commits) ** acceptance criteria - add a costco-specific enrich step producing `costco_output/items_enriched.csv` - output rows into the same shared enriched schema family as Giant - support costco-specific parsing for: - `itemDescription01` + `itemDescription02` - `itemNumber` as `retailer_item_id` - discount lines / negative rows - common size patterns such as `25#`, `48 OZ`, `2/24 OZ`, `6-PACK` - preserve obvious unknowns as blank rather than guessed values ** notes - this is the real schema compatibility proof, not raw ingest alone - expect weaker identifiers than Giant ** evidence - commit: `da00288` on branch `cx` - tests: `./venv/bin/python -m unittest discover -s tests`; `./venv/bin/python enrich_costco.py`; verified `costco_output/items_enriched.csv` - date: 2026-03-16 * [X] t1.8.2: validate cross-retailer observed/canonical flow (1-3 commits) ** acceptance criteria - feed Giant and Costco enriched rows through the same observed/canonical pipeline - confirm at least one product class can exist as: - Giant observed product - Costco observed product - one shared canonical product - document the exact example used for proof ** notes - keep this to one or two well-behaved product classes first - apples, eggs, bananas, or flour are better than weird prepared foods ** evidence - commit: `da00288` on branch `cx` - tests: `./venv/bin/python -m unittest discover -s tests`; `./venv/bin/python validate_cross_retailer_flow.py`; proof example: Giant `FRESH BANANA` and Costco `BANANAS 3 LB / 1.36 KG` share one canonical in `combined_output/proof_examples.csv` - date: 2026-03-16 * [X] t1.8.3: extend shared schema for retailer-native ids and adjustment lines (1-2 commits) ** acceptance criteria - add shared fields needed for non-upc retailers, including: - `retailer_item_id` - `is_discount_line` - `is_coupon_line` or equivalent if needed - keep `upc` nullable across the pipeline - update downstream builders/tests to accept retailers with blank `upc` ** notes - this prevents costco from becoming a schema hack - do this once instead of sprinkling exceptions everywhere ** evidence - commit: `9497565` on branch `cx` - tests: `./venv/bin/python -m unittest discover -s tests`; verified shared enriched fields in `giant_output/items_enriched.csv` and `costco_output/items_enriched.csv` - date: 2026-03-16 * [X] t1.8.4: verify and correct costco receipt enumeration (1–2 commits) ** acceptance criteria - confirm graphql summary query returns all expected receipts - compare `inWarehouse` count vs number of `receipts` returned - widen or parameterize date window if necessary; website shows receipts in 3-month windows - persist request metadata (`startDate`, `endDate`, `documentType`, `documentSubType`) - emit warning when receipt counts mismatch ** notes - goal is to confirm we are enumerating all receipts before parsing - do not expand schema or parser logic in this task - keep changes limited to summary query handling and diagnostics ** evidence - commit: `ac82fa6` on branch `cx` - tests: `./venv/bin/python -m unittest discover -s tests`; `./venv/bin/python scrape_costco.py --help`; reviewed the sample Costco summary request in `pm/scrape-giant.org` against `costco_output/raw/summary.json` and added 3-month window chunking plus mismatch diagnostics - date: 2026-03-16 * [X] t1.8.5: refactor costco scraper auth and UX with giant scraper ** acceptance criteria - remove manual auth env vars - load costco cookies from firefox session - require only logged-in browser - replace start/end date flags with --months-back - maintain same raw output structure - ensure summary_lookup keys are collision-safe by using a composite key (transactionBarcode + transactionDateTime) instead of transactionBarcode alone ** notes - align Costco acquisition ergonomics with the Giant scraper - keep downstream Costco parsing and shared schemas unchanged ** evidence - commit: `c0054dc` on branch `cx` - tests: `./venv/bin/python -m unittest discover -s tests`; `./venv/bin/python scrape_costco.py --help`; verified Costco summary/detail flattening now uses composite receipt keys in unit tests - date: 2026-03-16 * [X] t1.8.6: add browser session helper (2-4 commits) ** acceptance criteria - create a separate Python module/script that extracts firefox browser session data needed for giant and costco scrapers. - support Firefox and Costco first, including: - loading cookies via existing browser-cookie approach - reading browser storage needed for dynamic auth headers (e.g. Costco bearer token) - copying locked browser sqlite/db files to a temp location before reading when necessary - expose a small interface usable by scrapers, e.g. cookie jar + storage/header values - keep retailer-specific parsing of extracted session data outside the low-level browser access layer - structure the helper so Chromium-family browser support can be added later without changing scraper call sites ** notes - goal is to replace manual `.env` copying of volatile browser-derived auth data - session bootstrap only, not full browser automation - prefer one shared helper over retailer-specific ad hoc storage reads - Firefox only; Chromium support later ** evidence - commit: `7789c2e` on branch `cx` - tests: `./venv/bin/python -m unittest discover -s tests`; `./venv/bin/python scrape_giant.py --help`; `./venv/bin/python scrape_costco.py --help`; verified Firefox storage token extraction and locked-db copy behavior in unit tests - date: 2026-03-16 * [X] t1.8.7: simplify costco session bootstrap and remove over-abstraction (2-4 commits) ** acceptance criteria - make `scrape_costco.py` readable end-to-end without tracing through multiple partial bootstrap layers - keep `browser_session.py` limited to low-level browser data access only: - firefox profile discovery - cookie loading - storage reads - sqlite copy/read helpers - remove or sharply reduce `retailer_sessions.py` so retailer-specific header extraction lives with the retailer scraper or in a very small retailer-specific helper - make session bootstrap flow explicit and linear: - load browser context - extract costco auth values - build request headers - build requests session - eliminate inconsistent/obsolete function signatures and dead call paths (e.g. mixed `build_session(...)` calling conventions, stale fallback branches, mismatched `build_headers(...)` args) - add one focused bootstrap debug print showing whether cookies, authorization, client id, and client identifier were found - preserve current working behavior where available; this is a refactor/clarification task, not a feature expansion task ** notes - goal is to restore concern separation and debuggability - prefer obvious retailer-specific code over “generic” helpers that guess and obscure control flow - browser access can stay shared; retailer auth mapping should be explicit - no new heuristics in this task ** evidence - commit: `d7a0329` on branch `cx` - tests: `./venv/bin/python -m unittest discover -s tests`; `./venv/bin/python scrape_costco.py --help`; verified explicit Costco session bootstrap flow in `scrape_costco.py` and low-level-only browser access in `browser_session.py` - date: 2026-03-16 * [X] t1.9: build pivot-ready normalized purchase log and comparison metrics (2-4 commits) ** acceptance criteria - produce a flat `purchases.csv` suitable for excel pivot tables and pivot charts - each purchase row preserves: - purchase date - retailer - order id - raw item name - normalized item name - canonical item id when resolved - quantity / unit - line total - store/location info where available - derive normalized comparison fields where possible on enriched or observed product rows: - `price_per_lb` - `price_per_oz` - `price_per_each` - `price_per_count` - preserve the source basis used to derive each metric, e.g.: - parsed size/unit - receipt weight - explicit count/pack - emit nulls when basis is unknown, conflicting, or ambiguous - support pivot-friendly analysis of purchase frequency and item cost over time - document at least one Giant vs Costco comparison example using the normalized metrics ** notes - compute metrics as close to the raw observation as possible - canonical layer can aggregate later, but should not invent missing unit economics - unit discipline matters more than coverage - raw item name must be retained for audit/debugging ** evidence - commit: `be1bf63` on branch `cx` - tests: `./venv/bin/python -m unittest discover -s tests`; `./venv/bin/python build_purchases.py`; verified `combined_output/purchases.csv` and `combined_output/comparison_examples.csv` on the current Giant + Costco dataset - date: 2026-03-16 * [X] t1.11: define review and item-resolution workflow for unresolved products (2-3 commits) ** acceptance criteria - define the persistent files used to resolve unknown items, including: - review queue - canonical item catalog - alias / mapping layer if separate - specify how unresolved items move from `review_queue.csv` into the final normalized purchase log - define the manual resolution workflow, including: - what the human edits - what script is rerun afterward - how resolved mappings are persisted for future runs - ensure resolved items are positively identified into stable canonical item ids rather than one-off text substitutions - document how raw item name, normalized item name, and canonical item id are all retained ** notes - goal is “approve once, reuse forever” - keep the workflow simple and auditable - manual review is fine; the important part is making it durable and rerunnable ** evidence - commit: `c7dad54` on branch `cx` - tests: `./venv/bin/python -m unittest discover -s tests`; `./venv/bin/python build_purchases.py`; `./venv/bin/python review_products.py --refresh-only`; verified `combined_output/review_queue.csv`, `combined_output/review_resolutions.csv` workflow, and `combined_output/canonical_catalog.csv` - date: 2026-03-16 * [X] t1.12: simplify review process display Clearly show current state separate from proposed future state. ** acceptance criteria 1. Display position in review queue, e.g., (1/22) 2. Display compact header with observed_product under review, queue position, and canonical decision, e.g.: "Resolve [n] observed product group [name] and associated items to canonical_name [name]? (\n [n] matched items)" 3. color-code outputs based on info, input/prompt, warning/error 1. color action menu/requests for input differently from display text; do not color individual options separately 2. "no canonical_name suggestions found" is informational, not a warning/error. 4. update action menu `[x]exclude` to `e[x]clude` 5. on each review item, display a list of all matched items to be linked, sorted by descending date: 1. YYYY-mm-dd, price, raw item name, normalized item name, upc, retailer 2. image URL, if exists 3. Sample: 6. on each review item, suggest (but do not auto-apply) up to 3 likely existing canonicals using determinstic rules, e.g: 1. exact normalized name match 2. prefix/contains match on canonical name 3. exact UPC 7. Sample Entry: #+begin_comment Review 7/22: Resolve observed_product MIXED PEPPER to canonical_name [__]? 2 matched items: [1] 2026-03-12 | 7.49 | MIXED PEPPER 6-PACK | MIXED PEPPER | [upc] | costco | [img_url] [2] [YYYY-mm-dd] | [price] | [raw_name] | [observed_name] | [upc] | [retailer] | [img_url] 2 canonical suggestions found: [1] BELL PEPPERS, PRODUCE [2] PEPPER, SPICES #+end_comment 8. When link is selected, users should be able to select the number of the item in the list, e.g.: #+begin_comment Select the canonical_name to associate [n] items with: [1] GRB GRADU PCH PUF1. | gcan_01b0d623aa02 [2] BTB CHICKEN | gcan_0201f0feb749 [3] LIME | gcan_02074d9e7359 #+end_comment 9. Add confirmation to link selection with instructions, "[n] [observed_name] and future observed_name matches will be associated with [canonical_name], is this ok? actions: [Y]es [n]o [b]ack [s]kip [q]uit - reinforce project terminology such as raw_name, observed_name, canonical_name ** evidence - commit: `7b8141c`, `d39497c` - tests: `./venv/bin/python -m unittest discover -s tests`; `./venv/bin/python -m unittest tests.test_review_workflow tests.test_purchases`; `./venv/bin/python review_products.py --help`; verified compact review header, numbered matched-item display, informational no-suggestion state, numbered canonical selection, and confirmation flow - date: 2026-03-17 ** notes - The key improvement was shifting the prompt from system metadata to reviewer intent: one observed_product, its matched retailer rows, and one canonical_name decision. - Numbered canonical selection plus confirmation worked better than free-text id entry and should reduce accidental links. - Deterministic suggestions remain intentionally conservative; they speed up common cases, but unresolved items still depend on human review by design. * [X] t1.13.1 pipeline accountability and stage visibility (1-2 commits) add simple accounting so we can see what survives or drops at each pipeline stage ** AC 1. emit counts for raw, enriched, combined/observed, review-queued, canonical-linked, and final purchase-log rows 2. report unresolved and dropped item counts explicitly 3. make it easy to verify that missing items were intentionally left in review rather than silently lost - pm note: simple text/json/csv summary is sufficient; trust and visibility matter more than presentation ** evidence - commit: `967e19e` - tests: `./venv/bin/python -m unittest discover -s tests`; `./venv/bin/python report_pipeline_status.py --help`; `./venv/bin/python report_pipeline_status.py`; verified `combined_output/pipeline_status.csv` and `combined_output/pipeline_status.json` - date: 2026-03-17 ** notes - Added a single explicit status script instead of threading counters through every pipeline step; this keeps the pipeline simple while still making row survival visible. - The most useful check here is `unresolved_not_in_review_rows`; when it is non-zero, we know we have a real accounting bug rather than normal unresolved work. * [X] t1.13.2 costco discount matching and net pricing in enrich_costco (2-3 commits) refactor costco enrichment so discount lines are matched to purchased items and net pricing is preserved ** AC 1. detect costco discount/coupon rows like `/` and match them to purchased items within the same order 2. preserve raw discount rows for auditability while also carrying matched discount values onto the purchased item row 3. add explicit fields for discount-adjusted pricing, e.g. `matched_discount_amount` and `net_line_total` (or equivalent) 4. preserve original raw receipt amounts (`line_total`) without overwriting them - pm note: keep this retailer-specific and explicit; do not introduce generic discount heuristics ** evidence - commit: `56a03bc` - tests: `./venv/bin/python -m unittest discover -s tests`; `./venv/bin/python enrich_costco.py`; verified matched Costco discount rows now populate `matched_discount_amount` and `net_line_total` while preserving raw `line_total` - date: 2026-03-17 ** notes - Kept this retailer-specific and literal: only discount rows with `/` are matched, and only within the same order. - Raw discount rows are still preserved for auditability; the purchased row now carries the matched adjustment separately rather than overwriting the original amount. * [X] t1.13.3 canonical cleanup and review-first product identity (3-4 commits) refactor canonical generation so product identity is cleaner, duplicate canonicals are reduced, and unresolved items stay in review instead of spawning junk canonicals ** AC 1. stop auto-creating new canonical products from weak normalized names alone; unresolved items remain in `review_queue.csv` 2. canonical names are based on stable product identity rather than noisy observed titles 3. packaging/count/size tokens are removed from canonical names when they belong in structured fields (`pack_qty`, `size_value`, `size_unit`) 4. consolidate obvious duplicate canonicals (e.g. egg/lime cases) and ensure final outputs retain raw item name, normalized item name, and canonical item id - pm note: prefer conservative canonical creation and a better manual review loop over aggressive auto-unification ** evidence - commit: `08e2a86` - tests: `./venv/bin/python -m unittest discover -s tests`; `./venv/bin/python build_purchases.py`; `./venv/bin/python review_products.py --refresh-only`; verified weaker exact-name cases now remain unresolved in `combined_output/review_queue.csv` and canonical names are cleaned before auto-catalog creation - date: 2026-03-17 ** notes - Removed weak exact-name auto-canonical creation so ambiguous products stay in review instead of generating junk canonicals. - Canonical display names are now cleaned of obvious punctuation and packaging noise, but I kept the cleanup conservative rather than adding a broad fuzzy merge layer. * [ ] 1t.10: add optional llm-assisted suggestion workflow for unresolved products (2-4 commits) ** acceptance criteria - llm suggestions are generated only for unresolved observed products - llm outputs are stored as suggestions, not auto-applied truth - reviewer can approve/edit/reject suggestions - approved decisions are persisted into canonical/link files ** notes - bounded assistant, not autonomous goblin - image urls may become useful here ** evidence - commit: - tests: - date: