updated scope to prep for costco scraper

2026-03-16 09:04:52 -04:00
parent 4216daa37c
commit d20a131e04
3 changed files with 256 additions and 20 deletions
--- a/README.md
+++ b/README.md
@@ -0,0 +1,103 @@
+# scrape-giant
+
+Small grocery-history pipeline for Giant receipts.
+
+The project currently does four things:
+
+1. scrape Giant in-store order history from an active Firefox session
+2. enrich raw line items into a deterministic `items_enriched.csv`
+3. aggregate retailer-facing observed products and build a manual review queue
+4. create a first-pass canonical product layer plus conservative auto-links
+
+The work so far is Giant-specific on the ingest side and intentionally simple on
+the shared product-model side.
+
+## Current flow
+
+Run the commands from the repo root with the project venv active, or call them
+directly through `./venv/bin/python`.
+
+```bash
+./venv/bin/python scraper.py
+./venv/bin/python enrich_giant.py
+./venv/bin/python build_observed_products.py
+./venv/bin/python build_review_queue.py
+./venv/bin/python build_canonical_layer.py
+```
+
+## Inputs
+
+- Firefox cookies for `giantfood.com`
+- `GIANT_USER_ID` and `GIANT_LOYALTY_NUMBER` in `.env`, shell env, or prompts
+- Giant raw order payloads in `giant_output/raw/`
+
+## Outputs
+
+Current generated files live under `giant_output/`:
+
+- `orders.csv`: flattened visit/order rows from the Giant history API
+- `items.csv`: flattened raw line items from fetched order detail payloads
+- `items_enriched.csv`: deterministic parsed/enriched line items
+- `products_observed.csv`: retailer-facing observed product groups
+- `review_queue.csv`: products needing manual review
+- `products_canonical.csv`: shared canonical product rows
+- `product_links.csv`: observed-to-canonical links
+
+Raw json remains the source of truth:
+
+- `giant_output/raw/history.json`
+- `giant_output/raw/<order_id>.json`
+
+## Scripts
+
+- `scraper.py`: fetches Giant history/detail payloads and updates `orders.csv` and `items.csv`
+- `enrich_giant.py`: reads raw Giant order json and writes `items_enriched.csv`
+- `build_observed_products.py`: groups enriched rows into `products_observed.csv`
+- `build_review_queue.py`: generates `review_queue.csv` and preserves review status on reruns
+- `build_canonical_layer.py`: builds `products_canonical.csv` and `product_links.csv`
+
+## Notes on the current model
+
+- Observed products are retailer-specific: Giant, Costco.
+- Canonical products are the first cross-retailer layer.
+- Auto-linking is conservative:
+  exact UPC first, then exact normalized name plus exact size/unit context, then
+  exact normalized name when there is no size context to conflict.
+- Fee rows are excluded from auto-linking.
+- Unknown values are left blank instead of guessed.
+
+## Verification
+
+Run the test suite with:
+
+```bash
+./venv/bin/python -m unittest discover -s tests
+```
+
+Useful one-off rebuilds:
+
+```bash
+./venv/bin/python enrich_giant.py
+./venv/bin/python build_observed_products.py
+./venv/bin/python build_review_queue.py
+./venv/bin/python build_canonical_layer.py
+```
+
+## Project docs
+
+- `pm/tasks.org`: task log and evidence
+- `pm/data-model.org`: file layout and schema decisions
+
+## Status
+
+Completed through `t1.7`:
+
+- Giant receipt fetch CLI
+- data model and file layout
+- Giant parser/enricher
+- observed products
+- review queue
+- canonical layer scaffold
+- conservative auto-link rules
+
+Next planned task is `t1.8`: add a Costco raw ingest path.
--- a/pm/scrape-giant.org
+++ b/pm/scrape-giant.org
--- a/pm/tasks.org
+++ b/pm/tasks.org
@@ -147,35 +147,96 @@

 ** acceptance criteria
 - add a costco-specific raw ingest/export path
- output costco line items into the same shared raw/enriched schema family
- confirm at least one product class can exist as:
-  - giant observed product
-  - costco observed product
-  - one shared canonical product
+- fetch costco receipt summary and receipt detail payloads from graphql endpoint
+- persist raw json under `costco_output/raw/orders.csv` and `./items.csv`, same format as giant
+- costco-native identifiers such as `transactionBarcode` as order id and `itemNumber` as retailer item id
+- preserve discount/coupon rows rather than dropping

 ** notes
- this is the proof that the architecture generalizes
- don’t chase perfection before the second retailer lands
+- focus on raw costco acquisistion and flattening
+- do not force costco identifiers into `upc`
+- bearer/auth values should come from local env, not source

 ** evidence
 - commit:
 - tests:
 - date:

-* [ ] t1.9: compute normalized comparison metrics (2-3 commits)
+* [ ] t1.8.1: support costco parser/enricher path (2-4 commits)

 ** acceptance criteria
- derive normalized comparison fields where possible:
-  - price per lb
-  - price per oz
-  - price per each
-  - price per count
- metrics are attached at canonical or linked-observed level as appropriate
- emit obvious nulls when basis is unknown rather than inventing values
+- add a costco-specific enrich step producing `costco_output/items_enriched.csv`
+- output rows into the same shared enriched schema family as Giant
+- support costco-specific parsing for:
+  - `itemDescription01` + `itemDescription02`
+  - `itemNumber` as `retailer_item_id`
+  - discount lines / negative rows
+  - common size patterns such as `25#`, `48 OZ`, `2/24 OZ`, `6-PACK`
+- preserve obvious unknowns as blank rather than guessed values

 ** notes
- this is where “gala apples 5 lb bag vs other gala apples” becomes possible
- units discipline matters a lot here
+- this is the real schema compatibility proof, not raw ingest alone
+- expect weaker identifiers than Giant
+
+** evidence
+- commit:
+- tests:
+- date:
+* [ ] t1.8.2: validate cross-retailer observed/canonical flow (1-3 commits)
+
+** acceptance criteria
+- feed Giant and Costco enriched rows through the same observed/canonical pipeline
+- confirm at least one product class can exist as:
+  - Giant observed product
+  - Costco observed product
+  - one shared canonical product
+- document the exact example used for proof
+
+** notes
+- keep this to one or two well-behaved product classes first
+- apples, eggs, bananas, or flour are better than weird prepared foods
+
+** evidence
+- commit:
+- tests:
+- date:
+* [ ] t1.8.3: extend shared schema for retailer-native ids and adjustment lines (1-2 commits)
+
+** acceptance criteria
+- add shared fields needed for non-upc retailers, including:
+  - `retailer_item_id`
+  - `is_discount_line`
+  - `is_coupon_line` or equivalent if needed
+- keep `upc` nullable across the pipeline
+- update downstream builders/tests to accept retailers with blank `upc`
+
+** notes
+- this prevents costco from becoming a schema hack
+- do this once instead of sprinkling exceptions everywhere
+
+** evidence
+- commit:
+- tests:
+- date:
+* [ ] t1.9: compute normalized comparison metrics (2-4 commits)
+
+** acceptance criteria
+- derive normalized comparison fields where possible on enriched or observed product rows:
+  - `price_per_lb`
+  - `price_per_oz`
+  - `price_per_each`
+  - `price_per_count`
+- preserve the source basis used to derive each metric, e.g.:
+  - parsed size/unit
+  - receipt weight
+  - explicit count/pack
+- emit nulls when basis is unknown, conflicting, or ambiguous
+- document at least one Giant vs Costco comparison example using the normalized metrics
+
+** notes
+- compute metrics as close to the raw observation as possible
+- canonical layer can aggregate later, but should not invent missing unit economics
+- unit discipline matters more than coverage

 ** evidence
 - commit: