262 lines
9.6 KiB
Org Mode
262 lines
9.6 KiB
Org Mode
* [X] t1.1: harden giant receipt fetch cli (2-4 commits)
|
|
** acceptance criteria
|
|
- giant scraper runs from cli with prompts or env-backed defaults for `user_id` and `loyalty`
|
|
- script reuses current browser session via firefox cookies + `curl_cffi`
|
|
- script only fetches unseen orders
|
|
- script appends to `orders.csv` and `items.csv` without duplicating prior visits
|
|
- script prints a note that giant only exposes the most recent 50 visits
|
|
|
|
** notes
|
|
- keep this giant-specific
|
|
- no canonical product logic here
|
|
- raw json archive remains source of truth
|
|
|
|
** evidence
|
|
- commit: `d57b9cf` on branch `cx`
|
|
- tests: `./venv/bin/python -m unittest discover -s tests`; `./venv/bin/python scraper.py --help`; verified `.env` loading via `scraper.load_config()`
|
|
- date: 2026-03-14
|
|
|
|
* [X] t1.2: define grocery data model and file layout (1-2 commits)
|
|
** acceptance criteria
|
|
- decide and document the files/directories for:
|
|
- retailer raw exports
|
|
- enriched line items
|
|
- observed products
|
|
- canonical products
|
|
- product links
|
|
- define stable column schemas for each file
|
|
- explicitly separate retailer-specific parsing from cross-retailer canonicalization
|
|
|
|
** notes
|
|
- this is the guardrail task so we don't make giant-specific hacks the system of record
|
|
- keep schema minimal but extensible
|
|
|
|
** evidence
|
|
- commit: `42dbae1` on branch `cx`
|
|
- tests: reviewed `giant_output/raw/history.json`, one sample raw order json, `giant_output/orders.csv`, `giant_output/items.csv`; documented schemas in `pm/data-model.org`
|
|
- date: 2026-03-15
|
|
|
|
* [X] t1.3: build giant parser/enricher from raw json (2-4 commits)
|
|
** acceptance criteria
|
|
- parser reads giant raw order json files
|
|
- outputs `items_enriched.csv`
|
|
- preserves core raw values plus parsed fields such as:
|
|
- normalized item name
|
|
- image url
|
|
- size value/unit guesses
|
|
- pack/count guesses
|
|
- fee/store-brand flags
|
|
- per-unit/per-weight derived price where possible
|
|
- parser is deterministic and rerunnable
|
|
|
|
** notes
|
|
- do not attempt canonical cross-store matching yet
|
|
- parser should preserve ambiguity rather than hallucinating precision
|
|
|
|
** evidence
|
|
- commit: `14f2cc2` on branch `cx`
|
|
- tests: `./venv/bin/python -m unittest discover -s tests`; `./venv/bin/python enrich_giant.py`; verified `giant_output/items_enriched.csv` on real raw data
|
|
- date: 2026-03-16
|
|
|
|
* [X] t1.4: generate observed-product layer from enriched items (2-3 commits)
|
|
|
|
** acceptance criteria
|
|
- distinct observed products are generated from enriched giant items
|
|
- each observed product has a stable `observed_product_id`
|
|
- observed products aggregate:
|
|
- first seen / last seen
|
|
- times seen
|
|
- representative upc
|
|
- representative image url
|
|
- representative normalized name
|
|
- outputs `products_observed.csv`
|
|
|
|
** notes
|
|
- observed product is retailer-facing, not yet canonical
|
|
- likely key is some combo of retailer + upc + normalized name
|
|
|
|
** evidence
|
|
- commit: `dc39214` on branch `cx`
|
|
- tests: `./venv/bin/python -m unittest discover -s tests`; `./venv/bin/python build_observed_products.py`; verified `giant_output/products_observed.csv`
|
|
- date: 2026-03-16
|
|
|
|
* [X] t1.5: build review queue for unresolved or low-confidence products (1-3 commits)
|
|
|
|
** acceptance criteria
|
|
- produce a review file containing observed products needing manual review
|
|
- include enough context to review quickly:
|
|
- raw names
|
|
- parsed names
|
|
- upc
|
|
- image url
|
|
- example prices
|
|
- seen count
|
|
- reviewed status can be stored and reused
|
|
|
|
** notes
|
|
- this is where human-in-the-loop starts
|
|
- optimize for “approve once, remember forever”
|
|
|
|
** evidence
|
|
- commit: `9b13ec3` on branch `cx`
|
|
- tests: `./venv/bin/python -m unittest discover -s tests`; `./venv/bin/python build_review_queue.py`; verified `giant_output/review_queue.csv`
|
|
- date: 2026-03-16
|
|
|
|
* [X] t1.6: create canonical product layer and observed→canonical links (2-4 commits)
|
|
|
|
** acceptance criteria
|
|
- define and create `products_canonical.csv`
|
|
- define and create `product_links.csv`
|
|
- support linking one or more observed products to one canonical product
|
|
- canonical product schema supports food-cost comparison fields such as:
|
|
- product type
|
|
- variant
|
|
- size
|
|
- measure type
|
|
- normalized quantity basis
|
|
|
|
** notes
|
|
- this is the first cross-retailer abstraction layer
|
|
- do not require llm assistance for v1
|
|
|
|
** evidence
|
|
- commit: `347cd44` on branch `cx`
|
|
- tests: `./venv/bin/python -m unittest discover -s tests`; `./venv/bin/python build_canonical_layer.py`; verified seeded `giant_output/products_canonical.csv` and `giant_output/product_links.csv`
|
|
- date: 2026-03-16
|
|
|
|
* [X] t1.7: implement auto-link rules for easy matches (2-3 commits)
|
|
|
|
** acceptance criteria
|
|
- auto-link can match observed products to canonical products using deterministic rules
|
|
- rules include at least:
|
|
- exact upc
|
|
- exact normalized name
|
|
- exact size/unit match where available
|
|
- low-confidence cases remain unlinked for review
|
|
|
|
** notes
|
|
- keep the rules conservative
|
|
- false positives are worse than unresolved items
|
|
|
|
** evidence
|
|
- commit: `385a31c` on branch `cx`
|
|
- tests: `./venv/bin/python -m unittest discover -s tests`; `./venv/bin/python build_canonical_layer.py`; verified auto-linked `giant_output/products_canonical.csv` and `giant_output/product_links.csv`
|
|
- date: 2026-03-16
|
|
|
|
* [X] t1.8: support costco raw ingest path (2-5 commits)
|
|
|
|
** acceptance criteria
|
|
- add a costco-specific raw ingest/export path
|
|
- fetch costco receipt summary and receipt detail payloads from graphql endpoint
|
|
- persist raw json under `costco_output/raw/orders.csv` and `./items.csv`, same format as giant
|
|
- costco-native identifiers such as `transactionBarcode` as order id and `itemNumber` as retailer item id
|
|
- preserve discount/coupon rows rather than dropping
|
|
|
|
** notes
|
|
- focus on raw costco acquisistion and flattening
|
|
- do not force costco identifiers into `upc`
|
|
- bearer/auth values should come from local env, not source
|
|
|
|
** evidence
|
|
- commit: `da00288` on branch `cx`
|
|
- tests: `./venv/bin/python -m unittest discover -s tests`; `./venv/bin/python scrape_costco.py --help`; verified `costco_output/raw/*.json`, `costco_output/orders.csv`, and `costco_output/items.csv` from the local sample payload
|
|
- date: 2026-03-16
|
|
|
|
* [X] t1.8.1: support costco parser/enricher path (2-4 commits)
|
|
|
|
** acceptance criteria
|
|
- add a costco-specific enrich step producing `costco_output/items_enriched.csv`
|
|
- output rows into the same shared enriched schema family as Giant
|
|
- support costco-specific parsing for:
|
|
- `itemDescription01` + `itemDescription02`
|
|
- `itemNumber` as `retailer_item_id`
|
|
- discount lines / negative rows
|
|
- common size patterns such as `25#`, `48 OZ`, `2/24 OZ`, `6-PACK`
|
|
- preserve obvious unknowns as blank rather than guessed values
|
|
|
|
** notes
|
|
- this is the real schema compatibility proof, not raw ingest alone
|
|
- expect weaker identifiers than Giant
|
|
|
|
** evidence
|
|
- commit: `da00288` on branch `cx`
|
|
- tests: `./venv/bin/python -m unittest discover -s tests`; `./venv/bin/python enrich_costco.py`; verified `costco_output/items_enriched.csv`
|
|
- date: 2026-03-16
|
|
* [X] t1.8.2: validate cross-retailer observed/canonical flow (1-3 commits)
|
|
|
|
** acceptance criteria
|
|
- feed Giant and Costco enriched rows through the same observed/canonical pipeline
|
|
- confirm at least one product class can exist as:
|
|
- Giant observed product
|
|
- Costco observed product
|
|
- one shared canonical product
|
|
- document the exact example used for proof
|
|
|
|
** notes
|
|
- keep this to one or two well-behaved product classes first
|
|
- apples, eggs, bananas, or flour are better than weird prepared foods
|
|
|
|
** evidence
|
|
- commit: `da00288` on branch `cx`
|
|
- tests: `./venv/bin/python -m unittest discover -s tests`; `./venv/bin/python validate_cross_retailer_flow.py`; proof example: Giant `FRESH BANANA` and Costco `BANANAS 3 LB / 1.36 KG` share one canonical in `combined_output/proof_examples.csv`
|
|
- date: 2026-03-16
|
|
* [X] t1.8.3: extend shared schema for retailer-native ids and adjustment lines (1-2 commits)
|
|
|
|
** acceptance criteria
|
|
- add shared fields needed for non-upc retailers, including:
|
|
- `retailer_item_id`
|
|
- `is_discount_line`
|
|
- `is_coupon_line` or equivalent if needed
|
|
- keep `upc` nullable across the pipeline
|
|
- update downstream builders/tests to accept retailers with blank `upc`
|
|
|
|
** notes
|
|
- this prevents costco from becoming a schema hack
|
|
- do this once instead of sprinkling exceptions everywhere
|
|
|
|
** evidence
|
|
- commit: `9497565` on branch `cx`
|
|
- tests: `./venv/bin/python -m unittest discover -s tests`; verified shared enriched fields in `giant_output/items_enriched.csv` and `costco_output/items_enriched.csv`
|
|
- date: 2026-03-16
|
|
* [ ] t1.9: compute normalized comparison metrics (2-4 commits)
|
|
|
|
** acceptance criteria
|
|
- derive normalized comparison fields where possible on enriched or observed product rows:
|
|
- `price_per_lb`
|
|
- `price_per_oz`
|
|
- `price_per_each`
|
|
- `price_per_count`
|
|
- preserve the source basis used to derive each metric, e.g.:
|
|
- parsed size/unit
|
|
- receipt weight
|
|
- explicit count/pack
|
|
- emit nulls when basis is unknown, conflicting, or ambiguous
|
|
- document at least one Giant vs Costco comparison example using the normalized metrics
|
|
|
|
** notes
|
|
- compute metrics as close to the raw observation as possible
|
|
- canonical layer can aggregate later, but should not invent missing unit economics
|
|
- unit discipline matters more than coverage
|
|
|
|
** evidence
|
|
- commit:
|
|
- tests:
|
|
- date:
|
|
|
|
* [ ] t1.10: add optional llm-assisted suggestion workflow for unresolved products (2-4 commits)
|
|
|
|
** acceptance criteria
|
|
- llm suggestions are generated only for unresolved observed products
|
|
- llm outputs are stored as suggestions, not auto-applied truth
|
|
- reviewer can approve/edit/reject suggestions
|
|
- approved decisions are persisted into canonical/link files
|
|
|
|
** notes
|
|
- bounded assistant, not autonomous goblin
|
|
- image urls may become useful here
|
|
|
|
** evidence
|
|
- commit:
|
|
- tests:
|
|
- date:
|