Merge remote-tracking branch 'gitea/cx' into cx

This commit is contained in:
2026-03-16 12:40:44 -04:00
18 changed files with 3675 additions and 47 deletions

309
pm/data-model.org Normal file
View File

@@ -0,0 +1,309 @@
* grocery data model and file layout
This document defines the shared file layout and stable CSV schemas for the
grocery pipeline. The goal is to keep retailer-specific ingest separate from
cross-retailer product modeling so Giant-specific quirks do not become the
system of record.
** design rules
- Raw retailer exports remain the source of truth.
- Retailer parsing is isolated to retailer-specific files and ids.
- Cross-retailer product layers begin only after retailer-specific enrichment.
- CSV schemas are stable and additive: new columns may be appended, but
existing columns should not be repurposed.
- Unknown values should be left blank rather than guessed.
** directory layout
Use one top-level data root:
#+begin_example
data/
giant/
raw/
history.json
orders/
<order_id>.json
orders.csv
items_raw.csv
items_enriched.csv
products_observed.csv
costco/
raw/
...
orders.csv
items_raw.csv
items_enriched.csv
products_observed.csv
shared/
products_canonical.csv
product_links.csv
review_queue.csv
#+end_example
** layer responsibilities
- `data/<retailer>/raw/`
Stores unmodified retailer payloads exactly as fetched.
- `data/<retailer>/orders.csv`
One row per retailer order or visit, flattened from raw order data.
- `data/<retailer>/items_raw.csv`
One row per retailer line item, preserving retailer-native values needed for
reruns and debugging.
- `data/<retailer>/items_enriched.csv`
Parsed retailer line items with normalized fields and derived guesses, still
retailer-specific.
- `data/<retailer>/products_observed.csv`
Distinct retailer-facing observed products aggregated from enriched items.
- `data/shared/products_canonical.csv`
Cross-retailer canonical product entities used for comparison.
- `data/shared/product_links.csv`
Links from retailer observed products to canonical products.
- `data/shared/review_queue.csv`
Human review queue for unresolved or low-confidence matching/parsing cases.
** retailer-specific versus shared
Retailer-specific:
- raw json payloads
- retailer order ids
- retailer line numbers
- retailer category ids and names
- retailer item names
- retailer image urls
- parsed guesses derived from one retailer feed
- observed products scoped to one retailer
Shared:
- canonical products
- observed-to-canonical links
- human review state for unresolved cases
- comparison-ready normalized quantity basis fields
Observed products are the boundary between retailer-specific parsing and
cross-retailer canonicalization. Nothing upstream of `products_observed.csv`
should require knowledge of another retailer.
** schema: `data/<retailer>/orders.csv`
One row per order or visit.
| column | meaning |
|-
| `retailer` | retailer slug such as `giant` |
| `order_id` | retailer order or visit id |
| `order_date` | order date in `YYYY-MM-DD` when available |
| `delivery_date` | fulfillment date in `YYYY-MM-DD` when available |
| `service_type` | retailer service type such as `INSTORE` |
| `order_total` | order total as provided by retailer |
| `payment_method` | retailer payment label |
| `total_item_count` | total line count or item count from retailer |
| `total_savings` | total savings as provided by retailer |
| `your_savings_total` | savings field from retailer when present |
| `coupons_discounts_total` | coupon/discount total from retailer |
| `store_name` | retailer store name |
| `store_number` | retailer store number |
| `store_address1` | street address |
| `store_city` | city |
| `store_state` | state or province |
| `store_zipcode` | postal code |
| `refund_order` | retailer refund flag |
| `ebt_order` | retailer EBT flag |
| `raw_history_path` | relative path to source history payload |
| `raw_order_path` | relative path to source order payload |
Primary key:
- (`retailer`, `order_id`)
** schema: `data/<retailer>/items_raw.csv`
One row per retailer line item.
| column | meaning |
|------------------+-----------------------------------------|
| `retailer` | retailer slug |
| `order_id` | retailer order id |
| `line_no` | stable line number within order export |
| `order_date` | copied from order when available |
| `retailer_item_id` | retailer-native item id when available |
| `pod_id` | retailer pod/item id |
| `item_name` | raw retailer item name |
| `upc` | retailer UPC or PLU value |
| `category_id` | retailer category id |
| `category` | retailer category description |
| `qty` | retailer quantity field |
| `unit` | retailer unit code such as `EA` or `LB` |
| `unit_price` | retailer unit price field |
| `line_total` | retailer extended price field |
| `picked_weight` | retailer picked weight field |
| `mvp_savings` | retailer savings field |
| `reward_savings` | retailer rewards savings field |
| `coupon_savings` | retailer coupon savings field |
| `coupon_price` | retailer coupon price field |
| `image_url` | raw retailer image url when present |
| `raw_order_path` | relative path to source order payload |
| `is_discount_line` | retailer adjustment or discount-line flag |
| `is_coupon_line` | coupon-like line flag when distinguishable |
Primary key:
- (`retailer`, `order_id`, `line_no`)
** schema: `data/<retailer>/items_enriched.csv`
One row per retailer line item after deterministic parsing. Preserve the raw
fields from `items_raw.csv` and add parsed fields.
| column | meaning |
|---------------------+-------------------------------------------------------------|
| `retailer` | retailer slug |
| `order_id` | retailer order id |
| `line_no` | line number within order |
| `observed_item_key` | stable row key, typically `<retailer>:<order_id>:<line_no>` |
| `retailer_item_id` | retailer-native item id |
| `item_name` | raw retailer item name |
| `item_name_norm` | normalized item name |
| `brand_guess` | parsed brand guess |
| `variant` | parsed variant text |
| `size_value` | parsed numeric size value |
| `size_unit` | parsed size unit such as `oz`, `lb`, `fl_oz` |
| `pack_qty` | parsed pack or count guess |
| `measure_type` | `each`, `weight`, `volume`, `count`, or blank |
| `is_store_brand` | store-brand guess |
| `is_fee` | fee or non-product flag |
| `is_discount_line` | discount or adjustment-line flag |
| `is_coupon_line` | coupon-like line flag |
| `price_per_each` | derived per-each price when supported |
| `price_per_lb` | derived per-pound price when supported |
| `price_per_oz` | derived per-ounce price when supported |
| `image_url` | best available retailer image url |
| `parse_version` | parser version string for reruns |
| `parse_notes` | optional non-fatal parser notes |
Primary key:
- (`retailer`, `order_id`, `line_no`)
** schema: `data/<retailer>/products_observed.csv`
One row per distinct retailer-facing observed product.
| column | meaning |
|-------------------------------+----------------------------------------------------------------|
| `observed_product_id` | stable observed product id |
| `retailer` | retailer slug |
| `observed_key` | deterministic grouping key used to create the observed product |
| `representative_retailer_item_id` | best representative retailer-native item id |
| `representative_upc` | best representative UPC/PLU |
| `representative_item_name` | representative raw retailer name |
| `representative_name_norm` | representative normalized name |
| `representative_brand` | representative brand guess |
| `representative_variant` | representative variant |
| `representative_size_value` | representative size value |
| `representative_size_unit` | representative size unit |
| `representative_pack_qty` | representative pack/count |
| `representative_measure_type` | representative measure type |
| `representative_image_url` | representative image url |
| `is_store_brand` | representative store-brand flag |
| `is_fee` | representative fee flag |
| `is_discount_line` | representative discount-line flag |
| `is_coupon_line` | representative coupon-line flag |
| `first_seen_date` | first order date seen |
| `last_seen_date` | last order date seen |
| `times_seen` | number of enriched item rows grouped here |
| `example_order_id` | one example retailer order id |
| `example_item_name` | one example raw item name |
| `distinct_retailer_item_ids_count` | count of distinct retailer-native item ids |
Primary key:
- (`observed_product_id`)
** schema: `data/shared/products_canonical.csv`
One row per cross-retailer canonical product.
| column | meaning |
|----------------------------+--------------------------------------------------|
| `canonical_product_id` | stable canonical product id |
| `canonical_name` | canonical human-readable name |
| `product_type` | broad class such as `apple`, `milk`, `trash_bag` |
| `brand` | canonical brand when applicable |
| `variant` | canonical variant |
| `size_value` | normalized size value |
| `size_unit` | normalized size unit |
| `pack_qty` | normalized pack/count |
| `measure_type` | normalized measure type |
| `normalized_quantity` | numeric comparison basis value |
| `normalized_quantity_unit` | basis unit such as `oz`, `lb`, `count` |
| `notes` | optional human notes |
| `created_at` | creation timestamp or date |
| `updated_at` | last update timestamp or date |
Primary key:
- (`canonical_product_id`)
** schema: `data/shared/product_links.csv`
One row per observed-to-canonical relationship.
| column | meaning |
|-
| `observed_product_id` | retailer observed product id |
| `canonical_product_id` | linked canonical product id |
| `link_method` | `manual`, `exact_upc`, `exact_name`, etc. |
| `link_confidence` | optional confidence label |
| `review_status` | `pending`, `approved`, `rejected`, or blank |
| `reviewed_by` | reviewer id or initials |
| `reviewed_at` | review timestamp or date |
| `link_notes` | optional notes |
Primary key:
- (`observed_product_id`, `canonical_product_id`)
** schema: `data/shared/review_queue.csv`
One row per issue needing human review.
| column | meaning |
|-
| `review_id` | stable review row id |
| `queue_type` | `observed_product`, `link_candidate`, `parse_issue` |
| `retailer` | retailer slug when applicable |
| `observed_product_id` | observed product id when applicable |
| `canonical_product_id` | candidate canonical id when applicable |
| `reason_code` | machine-readable review reason |
| `priority` | optional priority label |
| `raw_item_names` | compact list of example raw names |
| `normalized_names` | compact list of example normalized names |
| `upc` | example UPC/PLU |
| `image_url` | example image url |
| `example_prices` | compact list of example prices |
| `seen_count` | count of related rows |
| `status` | `pending`, `approved`, `rejected`, `deferred` |
| `resolution_notes` | reviewer notes |
| `created_at` | creation timestamp or date |
| `updated_at` | last update timestamp or date |
Primary key:
- (`review_id`)
** current giant mapping
Current scraper outputs map to the new layout as follows:
- `giant_output/raw/history.json` -> `data/giant/raw/history.json`
- `giant_output/raw/<order_id>.json` -> `data/giant/raw/orders/<order_id>.json`
- `giant_output/orders.csv` -> `data/giant/orders.csv`
- `giant_output/items.csv` -> `data/giant/items_raw.csv`
Current Giant raw order payloads already expose fields needed for future
enrichment, including `image`, `itemName`, `primUpcCd`, `lbEachCd`,
`unitPrice`, `groceryAmount`, and `totalPickedWeight`.

File diff suppressed because one or more lines are too long

View File

@@ -16,7 +16,7 @@
- tests: `./venv/bin/python -m unittest discover -s tests`; `./venv/bin/python scraper.py --help`; verified `.env` loading via `scraper.load_config()`
- date: 2026-03-14
* [ ] t1.2: define grocery data model and file layout (1-2 commits)
* [X] t1.2: define grocery data model and file layout (1-2 commits)
** acceptance criteria
- decide and document the files/directories for:
- retailer raw exports
@@ -32,11 +32,11 @@
- keep schema minimal but extensible
** evidence
- commit:
- tests:
- date:
- commit: `42dbae1` on branch `cx`
- tests: reviewed `giant_output/raw/history.json`, one sample raw order json, `giant_output/orders.csv`, `giant_output/items.csv`; documented schemas in `pm/data-model.org`
- date: 2026-03-15
* [ ] t1.3: build giant parser/enricher from raw json (2-4 commits)
* [X] t1.3: build giant parser/enricher from raw json (2-4 commits)
** acceptance criteria
- parser reads giant raw order json files
- outputs `items_enriched.csv`
@@ -54,11 +54,11 @@
- parser should preserve ambiguity rather than hallucinating precision
** evidence
- commit:
- tests:
- date:
- commit: `14f2cc2` on branch `cx`
- tests: `./venv/bin/python -m unittest discover -s tests`; `./venv/bin/python enrich_giant.py`; verified `giant_output/items_enriched.csv` on real raw data
- date: 2026-03-16
* [ ] t1.4: generate observed-product layer from enriched items (2-3 commits)
* [X] t1.4: generate observed-product layer from enriched items (2-3 commits)
** acceptance criteria
- distinct observed products are generated from enriched giant items
@@ -76,11 +76,11 @@
- likely key is some combo of retailer + upc + normalized name
** evidence
- commit:
- tests:
- date:
- commit: `dc39214` on branch `cx`
- tests: `./venv/bin/python -m unittest discover -s tests`; `./venv/bin/python build_observed_products.py`; verified `giant_output/products_observed.csv`
- date: 2026-03-16
* [ ] t1.5: build review queue for unresolved or low-confidence products (1-3 commits)
* [X] t1.5: build review queue for unresolved or low-confidence products (1-3 commits)
** acceptance criteria
- produce a review file containing observed products needing manual review
@@ -98,11 +98,11 @@
- optimize for “approve once, remember forever”
** evidence
- commit:
- tests:
- date:
- commit: `9b13ec3` on branch `cx`
- tests: `./venv/bin/python -m unittest discover -s tests`; `./venv/bin/python build_review_queue.py`; verified `giant_output/review_queue.csv`
- date: 2026-03-16
* [ ] t1.6: create canonical product layer and observed→canonical links (2-4 commits)
* [X] t1.6: create canonical product layer and observed→canonical links (2-4 commits)
** acceptance criteria
- define and create `products_canonical.csv`
@@ -120,11 +120,11 @@
- do not require llm assistance for v1
** evidence
- commit:
- tests:
- date:
- commit: `347cd44` on branch `cx`
- tests: `./venv/bin/python -m unittest discover -s tests`; `./venv/bin/python build_canonical_layer.py`; verified seeded `giant_output/products_canonical.csv` and `giant_output/product_links.csv`
- date: 2026-03-16
* [ ] t1.7: implement auto-link rules for easy matches (2-3 commits)
* [X] t1.7: implement auto-link rules for easy matches (2-3 commits)
** acceptance criteria
- auto-link can match observed products to canonical products using deterministic rules
@@ -139,43 +139,140 @@
- false positives are worse than unresolved items
** evidence
- commit:
- tests:
- date:
- commit: `385a31c` on branch `cx`
- tests: `./venv/bin/python -m unittest discover -s tests`; `./venv/bin/python build_canonical_layer.py`; verified auto-linked `giant_output/products_canonical.csv` and `giant_output/product_links.csv`
- date: 2026-03-16
* [ ] t1.8: support costco raw ingest path (2-5 commits)
* [X] t1.8: support costco raw ingest path (2-5 commits)
** acceptance criteria
- add a costco-specific raw ingest/export path
- output costco line items into the same shared raw/enriched schema family
- confirm at least one product class can exist as:
- giant observed product
- costco observed product
- one shared canonical product
- fetch costco receipt summary and receipt detail payloads from graphql endpoint
- persist raw json under `costco_output/raw/orders.csv` and `./items.csv`, same format as giant
- costco-native identifiers such as `transactionBarcode` as order id and `itemNumber` as retailer item id
- preserve discount/coupon rows rather than dropping
** notes
- this is the proof that the architecture generalizes
- dont chase perfection before the second retailer lands
- focus on raw costco acquisistion and flattening
- do not force costco identifiers into `upc`
- bearer/auth values should come from local env, not source
** evidence
- commit:
- tests:
- date:
- commit: `da00288` on branch `cx`
- tests: `./venv/bin/python -m unittest discover -s tests`; `./venv/bin/python scrape_costco.py --help`; verified `costco_output/raw/*.json`, `costco_output/orders.csv`, and `costco_output/items.csv` from the local sample payload
- date: 2026-03-16
* [ ] t1.9: compute normalized comparison metrics (2-3 commits)
* [X] t1.8.1: support costco parser/enricher path (2-4 commits)
** acceptance criteria
- derive normalized comparison fields where possible:
- price per lb
- price per oz
- price per each
- price per count
- metrics are attached at canonical or linked-observed level as appropriate
- emit obvious nulls when basis is unknown rather than inventing values
- add a costco-specific enrich step producing `costco_output/items_enriched.csv`
- output rows into the same shared enriched schema family as Giant
- support costco-specific parsing for:
- `itemDescription01` + `itemDescription02`
- `itemNumber` as `retailer_item_id`
- discount lines / negative rows
- common size patterns such as `25#`, `48 OZ`, `2/24 OZ`, `6-PACK`
- preserve obvious unknowns as blank rather than guessed values
** notes
- this is where “gala apples 5 lb bag vs other gala apples” becomes possible
- units discipline matters a lot here
- this is the real schema compatibility proof, not raw ingest alone
- expect weaker identifiers than Giant
** evidence
- commit: `da00288` on branch `cx`
- tests: `./venv/bin/python -m unittest discover -s tests`; `./venv/bin/python enrich_costco.py`; verified `costco_output/items_enriched.csv`
- date: 2026-03-16
* [X] t1.8.2: validate cross-retailer observed/canonical flow (1-3 commits)
** acceptance criteria
- feed Giant and Costco enriched rows through the same observed/canonical pipeline
- confirm at least one product class can exist as:
- Giant observed product
- Costco observed product
- one shared canonical product
- document the exact example used for proof
** notes
- keep this to one or two well-behaved product classes first
- apples, eggs, bananas, or flour are better than weird prepared foods
** evidence
- commit: `da00288` on branch `cx`
- tests: `./venv/bin/python -m unittest discover -s tests`; `./venv/bin/python validate_cross_retailer_flow.py`; proof example: Giant `FRESH BANANA` and Costco `BANANAS 3 LB / 1.36 KG` share one canonical in `combined_output/proof_examples.csv`
- date: 2026-03-16
* [X] t1.8.3: extend shared schema for retailer-native ids and adjustment lines (1-2 commits)
** acceptance criteria
- add shared fields needed for non-upc retailers, including:
- `retailer_item_id`
- `is_discount_line`
- `is_coupon_line` or equivalent if needed
- keep `upc` nullable across the pipeline
- update downstream builders/tests to accept retailers with blank `upc`
** notes
- this prevents costco from becoming a schema hack
- do this once instead of sprinkling exceptions everywhere
** evidence
- commit: `9497565` on branch `cx`
- tests: `./venv/bin/python -m unittest discover -s tests`; verified shared enriched fields in `giant_output/items_enriched.csv` and `costco_output/items_enriched.csv`
- date: 2026-03-16
* [X] t1.8.4: verify and correct costco receipt enumeration (12 commits)
** acceptance criteria
- confirm graphql summary query returns all expected receipts
- compare `inWarehouse` count vs number of `receipts` returned
- widen or parameterize date window if necessary; website shows receipts in 3-month windows
- persist request metadata (`startDate`, `endDate`, `documentType`, `documentSubType`)
- emit warning when receipt counts mismatch
** notes
- goal is to confirm we are enumerating all receipts before parsing
- do not expand schema or parser logic in this task
- keep changes limited to summary query handling and diagnostics
** evidence
- commit: `ac82fa6` on branch `cx`
- tests: `./venv/bin/python -m unittest discover -s tests`; `./venv/bin/python scrape_costco.py --help`; reviewed the sample Costco summary request in `pm/scrape-giant.org` against `costco_output/raw/summary.json` and added 3-month window chunking plus mismatch diagnostics
- date: 2026-03-16
* [X] t1.8.5: refactor costco scraper auth and UX with giant scraper
** acceptance criteria
- remove manual auth env vars
- load costco cookies from firefox session
- require only logged-in browser
- replace start/end date flags with --months-back
- maintain same raw output structure
- ensure summary_lookup keys are collision-safe by using a composite key (transactionBarcode + transactionDateTime) instead of transactionBarcode alone
** notes
- align Costco acquisition ergonomics with the Giant scraper
- keep downstream Costco parsing and shared schemas unchanged
** evidence
- commit: `c0054dc` on branch `cx`
- tests: `./venv/bin/python -m unittest discover -s tests`; `./venv/bin/python scrape_costco.py --help`; verified Costco summary/detail flattening now uses composite receipt keys in unit tests
- date: 2026-03-16
* [ ] t1.9: compute normalized comparison metrics (2-4 commits)
** acceptance criteria
- derive normalized comparison fields where possible on enriched or observed product rows:
- `price_per_lb`
- `price_per_oz`
- `price_per_each`
- `price_per_count`
- preserve the source basis used to derive each metric, e.g.:
- parsed size/unit
- receipt weight
- explicit count/pack
- emit nulls when basis is unknown, conflicting, or ambiguous
- document at least one Giant vs Costco comparison example using the normalized metrics
** notes
- compute metrics as close to the raw observation as possible
- canonical layer can aggregate later, but should not invent missing unit economics
- unit discipline matters more than coverage
** evidence
- commit: