Merge remote-tracking branch 'gitea/cx' into cx

2026-03-16 12:40:44 -04:00
parent d080a35697 2e5109bd11
commit de0c276a24
18 changed files with 3675 additions and 47 deletions
--- a/pm/data-model.org
+++ b/pm/data-model.org
@@ -0,0 +1,309 @@
+* grocery data model and file layout
+
+This document defines the shared file layout and stable CSV schemas for the
+grocery pipeline. The goal is to keep retailer-specific ingest separate from
+cross-retailer product modeling so Giant-specific quirks do not become the
+system of record.
+
+** design rules
+
+- Raw retailer exports remain the source of truth.
+- Retailer parsing is isolated to retailer-specific files and ids.
+- Cross-retailer product layers begin only after retailer-specific enrichment.
+- CSV schemas are stable and additive: new columns may be appended, but
+  existing columns should not be repurposed.
+- Unknown values should be left blank rather than guessed.
+
+** directory layout
+
+Use one top-level data root:
+
+#+begin_example
+data/
+  giant/
+    raw/
+      history.json
+      orders/
+        <order_id>.json
+    orders.csv
+    items_raw.csv
+    items_enriched.csv
+    products_observed.csv
+  costco/
+    raw/
+      ...
+    orders.csv
+    items_raw.csv
+    items_enriched.csv
+    products_observed.csv
+  shared/
+    products_canonical.csv
+    product_links.csv
+    review_queue.csv
+#+end_example
+
+** layer responsibilities
+
+- `data/<retailer>/raw/`
+  Stores unmodified retailer payloads exactly as fetched.
+- `data/<retailer>/orders.csv`
+  One row per retailer order or visit, flattened from raw order data.
+- `data/<retailer>/items_raw.csv`
+  One row per retailer line item, preserving retailer-native values needed for
+  reruns and debugging.
+- `data/<retailer>/items_enriched.csv`
+  Parsed retailer line items with normalized fields and derived guesses, still
+  retailer-specific.
+- `data/<retailer>/products_observed.csv`
+  Distinct retailer-facing observed products aggregated from enriched items.
+- `data/shared/products_canonical.csv`
+  Cross-retailer canonical product entities used for comparison.
+- `data/shared/product_links.csv`
+  Links from retailer observed products to canonical products.
+- `data/shared/review_queue.csv`
+  Human review queue for unresolved or low-confidence matching/parsing cases.
+
+** retailer-specific versus shared
+
+Retailer-specific:
+
+- raw json payloads
+- retailer order ids
+- retailer line numbers
+- retailer category ids and names
+- retailer item names
+- retailer image urls
+- parsed guesses derived from one retailer feed
+- observed products scoped to one retailer
+
+Shared:
+
+- canonical products
+- observed-to-canonical links
+- human review state for unresolved cases
+- comparison-ready normalized quantity basis fields
+
+Observed products are the boundary between retailer-specific parsing and
+cross-retailer canonicalization. Nothing upstream of `products_observed.csv`
+should require knowledge of another retailer.
+
+** schema: `data/<retailer>/orders.csv`
+
+One row per order or visit.
+
+| column | meaning |
+|-
+| `retailer` | retailer slug such as `giant` |
+| `order_id` | retailer order or visit id |
+| `order_date` | order date in `YYYY-MM-DD` when available |
+| `delivery_date` | fulfillment date in `YYYY-MM-DD` when available |
+| `service_type` | retailer service type such as `INSTORE` |
+| `order_total` | order total as provided by retailer |
+| `payment_method` | retailer payment label |
+| `total_item_count` | total line count or item count from retailer |
+| `total_savings` | total savings as provided by retailer |
+| `your_savings_total` | savings field from retailer when present |
+| `coupons_discounts_total` | coupon/discount total from retailer |
+| `store_name` | retailer store name |
+| `store_number` | retailer store number |
+| `store_address1` | street address |
+| `store_city` | city |
+| `store_state` | state or province |
+| `store_zipcode` | postal code |
+| `refund_order` | retailer refund flag |
+| `ebt_order` | retailer EBT flag |
+| `raw_history_path` | relative path to source history payload |
+| `raw_order_path` | relative path to source order payload |
+
+Primary key:
+
+- (`retailer`, `order_id`)
+
+** schema: `data/<retailer>/items_raw.csv`
+
+One row per retailer line item.
+
+| column           | meaning                                 |
+|------------------+-----------------------------------------|
+| `retailer`       | retailer slug                           |
+| `order_id`       | retailer order id                       |
+| `line_no`        | stable line number within order export  |
+| `order_date`     | copied from order when available        |
+| `retailer_item_id` | retailer-native item id when available |
+| `pod_id`         | retailer pod/item id                    |
+| `item_name`      | raw retailer item name                  |
+| `upc`            | retailer UPC or PLU value               |
+| `category_id`    | retailer category id                    |
+| `category`       | retailer category description           |
+| `qty`            | retailer quantity field                 |
+| `unit`           | retailer unit code such as `EA` or `LB` |
+| `unit_price`     | retailer unit price field               |
+| `line_total`     | retailer extended price field           |
+| `picked_weight`  | retailer picked weight field            |
+| `mvp_savings`    | retailer savings field                  |
+| `reward_savings` | retailer rewards savings field          |
+| `coupon_savings` | retailer coupon savings field           |
+| `coupon_price`   | retailer coupon price field             |
+| `image_url`      | raw retailer image url when present     |
+| `raw_order_path` | relative path to source order payload   |
+| `is_discount_line` | retailer adjustment or discount-line flag |
+| `is_coupon_line` | coupon-like line flag when distinguishable |
+
+Primary key:
+
+- (`retailer`, `order_id`, `line_no`)
+
+** schema: `data/<retailer>/items_enriched.csv`
+
+One row per retailer line item after deterministic parsing. Preserve the raw
+fields from `items_raw.csv` and add parsed fields.
+
+| column              | meaning                                                     |
+|---------------------+-------------------------------------------------------------|
+| `retailer`          | retailer slug                                               |
+| `order_id`          | retailer order id                                           |
+| `line_no`           | line number within order                                    |
+| `observed_item_key` | stable row key, typically `<retailer>:<order_id>:<line_no>` |
+| `retailer_item_id`  | retailer-native item id                                     |
+| `item_name`         | raw retailer item name                                      |
+| `item_name_norm`    | normalized item name                                        |
+| `brand_guess`       | parsed brand guess                                          |
+| `variant`           | parsed variant text                                         |
+| `size_value`        | parsed numeric size value                                   |
+| `size_unit`         | parsed size unit such as `oz`, `lb`, `fl_oz`                |
+| `pack_qty`          | parsed pack or count guess                                  |
+| `measure_type`      | `each`, `weight`, `volume`, `count`, or blank               |
+| `is_store_brand`    | store-brand guess                                           |
+| `is_fee`            | fee or non-product flag                                     |
+| `is_discount_line`  | discount or adjustment-line flag                            |
+| `is_coupon_line`    | coupon-like line flag                                       |
+| `price_per_each`    | derived per-each price when supported                       |
+| `price_per_lb`      | derived per-pound price when supported                      |
+| `price_per_oz`      | derived per-ounce price when supported                      |
+| `image_url`         | best available retailer image url                           |
+| `parse_version`     | parser version string for reruns                            |
+| `parse_notes`       | optional non-fatal parser notes                             |
+
+Primary key:
+
+- (`retailer`, `order_id`, `line_no`)
+
+** schema: `data/<retailer>/products_observed.csv`
+
+One row per distinct retailer-facing observed product.
+
+| column                        | meaning                                                        |
+|-------------------------------+----------------------------------------------------------------|
+| `observed_product_id`         | stable observed product id                                     |
+| `retailer`                    | retailer slug                                                  |
+| `observed_key`                | deterministic grouping key used to create the observed product |
+| `representative_retailer_item_id` | best representative retailer-native item id               |
+| `representative_upc`          | best representative UPC/PLU                                    |
+| `representative_item_name`    | representative raw retailer name                               |
+| `representative_name_norm`    | representative normalized name                                 |
+| `representative_brand`        | representative brand guess                                     |
+| `representative_variant`      | representative variant                                         |
+| `representative_size_value`   | representative size value                                      |
+| `representative_size_unit`    | representative size unit                                       |
+| `representative_pack_qty`     | representative pack/count                                      |
+| `representative_measure_type` | representative measure type                                    |
+| `representative_image_url`    | representative image url                                       |
+| `is_store_brand`              | representative store-brand flag                                |
+| `is_fee`                      | representative fee flag                                        |
+| `is_discount_line`            | representative discount-line flag                              |
+| `is_coupon_line`              | representative coupon-line flag                                |
+| `first_seen_date`             | first order date seen                                          |
+| `last_seen_date`              | last order date seen                                           |
+| `times_seen`                  | number of enriched item rows grouped here                      |
+| `example_order_id`            | one example retailer order id                                  |
+| `example_item_name`           | one example raw item name                                      |
+| `distinct_retailer_item_ids_count` | count of distinct retailer-native item ids               |
+
+Primary key:
+
+- (`observed_product_id`)
+
+** schema: `data/shared/products_canonical.csv`
+
+One row per cross-retailer canonical product.
+
+| column                     | meaning                                          |
+|----------------------------+--------------------------------------------------|
+| `canonical_product_id`     | stable canonical product id                      |
+| `canonical_name`           | canonical human-readable name                    |
+| `product_type`             | broad class such as `apple`, `milk`, `trash_bag` |
+| `brand`                    | canonical brand when applicable                  |
+| `variant`                  | canonical variant                                |
+| `size_value`               | normalized size value                            |
+| `size_unit`                | normalized size unit                             |
+| `pack_qty`                 | normalized pack/count                            |
+| `measure_type`             | normalized measure type                          |
+| `normalized_quantity`      | numeric comparison basis value                   |
+| `normalized_quantity_unit` | basis unit such as `oz`, `lb`, `count`           |
+| `notes`                    | optional human notes                             |
+| `created_at`               | creation timestamp or date                       |
+| `updated_at`               | last update timestamp or date                    |
+
+Primary key:
+
+- (`canonical_product_id`)
+
+** schema: `data/shared/product_links.csv`
+
+One row per observed-to-canonical relationship.
+
+| column | meaning |
+|-
+| `observed_product_id` | retailer observed product id |
+| `canonical_product_id` | linked canonical product id |
+| `link_method` | `manual`, `exact_upc`, `exact_name`, etc. |
+| `link_confidence` | optional confidence label |
+| `review_status` | `pending`, `approved`, `rejected`, or blank |
+| `reviewed_by` | reviewer id or initials |
+| `reviewed_at` | review timestamp or date |
+| `link_notes` | optional notes |
+
+Primary key:
+
+- (`observed_product_id`, `canonical_product_id`)
+
+** schema: `data/shared/review_queue.csv`
+
+One row per issue needing human review.
+
+| column | meaning |
+|-
+| `review_id` | stable review row id |
+| `queue_type` | `observed_product`, `link_candidate`, `parse_issue` |
+| `retailer` | retailer slug when applicable |
+| `observed_product_id` | observed product id when applicable |
+| `canonical_product_id` | candidate canonical id when applicable |
+| `reason_code` | machine-readable review reason |
+| `priority` | optional priority label |
+| `raw_item_names` | compact list of example raw names |
+| `normalized_names` | compact list of example normalized names |
+| `upc` | example UPC/PLU |
+| `image_url` | example image url |
+| `example_prices` | compact list of example prices |
+| `seen_count` | count of related rows |
+| `status` | `pending`, `approved`, `rejected`, `deferred` |
+| `resolution_notes` | reviewer notes |
+| `created_at` | creation timestamp or date |
+| `updated_at` | last update timestamp or date |
+
+Primary key:
+
+- (`review_id`)
+
+** current giant mapping
+
+Current scraper outputs map to the new layout as follows:
+
+- `giant_output/raw/history.json` -> `data/giant/raw/history.json`
+- `giant_output/raw/<order_id>.json` -> `data/giant/raw/orders/<order_id>.json`
+- `giant_output/orders.csv` -> `data/giant/orders.csv`
+- `giant_output/items.csv` -> `data/giant/items_raw.csv`
+
+Current Giant raw order payloads already expose fields needed for future
+enrichment, including `image`, `itemName`, `primUpcCd`, `lbEachCd`,
+`unitPrice`, `groceryAmount`, and `totalPickedWeight`.
--- a/pm/scrape-giant.org
+++ b/pm/scrape-giant.org
--- a/pm/tasks.org
+++ b/pm/tasks.org
@@ -16,7 +16,7 @@
 - tests: `./venv/bin/python -m unittest discover -s tests`; `./venv/bin/python scraper.py --help`; verified `.env` loading via `scraper.load_config()`
 - date: 2026-03-14

-* [ ] t1.2: define grocery data model and file layout (1-2 commits)
+* [X] t1.2: define grocery data model and file layout (1-2 commits)
 ** acceptance criteria
 - decide and document the files/directories for:
  - retailer raw exports
@@ -32,11 +32,11 @@
 - keep schema minimal but extensible

 ** evidence
- commit:
- tests:
- date:
+- commit: `42dbae1` on branch `cx`
+- tests: reviewed `giant_output/raw/history.json`, one sample raw order json, `giant_output/orders.csv`, `giant_output/items.csv`; documented schemas in `pm/data-model.org`
+- date: 2026-03-15

-* [ ] t1.3: build giant parser/enricher from raw json (2-4 commits)
+* [X] t1.3: build giant parser/enricher from raw json (2-4 commits)
 ** acceptance criteria
 - parser reads giant raw order json files
 - outputs `items_enriched.csv`
@@ -54,11 +54,11 @@
 - parser should preserve ambiguity rather than hallucinating precision

 ** evidence
- commit:
- tests:
- date:
+- commit: `14f2cc2` on branch `cx`
+- tests: `./venv/bin/python -m unittest discover -s tests`; `./venv/bin/python enrich_giant.py`; verified `giant_output/items_enriched.csv` on real raw data
+- date: 2026-03-16

-* [ ] t1.4: generate observed-product layer from enriched items (2-3 commits)
+* [X] t1.4: generate observed-product layer from enriched items (2-3 commits)

 ** acceptance criteria
 - distinct observed products are generated from enriched giant items
@@ -76,11 +76,11 @@
 - likely key is some combo of retailer + upc + normalized name

 ** evidence
- commit:
- tests:
- date:
+- commit: `dc39214` on branch `cx`
+- tests: `./venv/bin/python -m unittest discover -s tests`; `./venv/bin/python build_observed_products.py`; verified `giant_output/products_observed.csv`
+- date: 2026-03-16

-* [ ] t1.5: build review queue for unresolved or low-confidence products (1-3 commits)
+* [X] t1.5: build review queue for unresolved or low-confidence products (1-3 commits)

 ** acceptance criteria
 - produce a review file containing observed products needing manual review
@@ -98,11 +98,11 @@
 - optimize for “approve once, remember forever”

 ** evidence
- commit:
- tests:
- date:
+- commit: `9b13ec3` on branch `cx`
+- tests: `./venv/bin/python -m unittest discover -s tests`; `./venv/bin/python build_review_queue.py`; verified `giant_output/review_queue.csv`
+- date: 2026-03-16

-* [ ] t1.6: create canonical product layer and observed→canonical links (2-4 commits)
+* [X] t1.6: create canonical product layer and observed→canonical links (2-4 commits)

 ** acceptance criteria
 - define and create `products_canonical.csv`
@@ -120,11 +120,11 @@
 - do not require llm assistance for v1

 ** evidence
- commit:
- tests:
- date:
+- commit: `347cd44` on branch `cx`
+- tests: `./venv/bin/python -m unittest discover -s tests`; `./venv/bin/python build_canonical_layer.py`; verified seeded `giant_output/products_canonical.csv` and `giant_output/product_links.csv`
+- date: 2026-03-16

-* [ ] t1.7: implement auto-link rules for easy matches (2-3 commits)
+* [X] t1.7: implement auto-link rules for easy matches (2-3 commits)

 ** acceptance criteria
 - auto-link can match observed products to canonical products using deterministic rules
@@ -139,43 +139,140 @@
 - false positives are worse than unresolved items

 ** evidence
- commit:
- tests:
- date:
+- commit: `385a31c` on branch `cx`
+- tests: `./venv/bin/python -m unittest discover -s tests`; `./venv/bin/python build_canonical_layer.py`; verified auto-linked `giant_output/products_canonical.csv` and `giant_output/product_links.csv`
+- date: 2026-03-16

-* [ ] t1.8: support costco raw ingest path (2-5 commits)
+* [X] t1.8: support costco raw ingest path (2-5 commits)

 ** acceptance criteria
 - add a costco-specific raw ingest/export path
- output costco line items into the same shared raw/enriched schema family
- confirm at least one product class can exist as:
-  - giant observed product
-  - costco observed product
-  - one shared canonical product
+- fetch costco receipt summary and receipt detail payloads from graphql endpoint
+- persist raw json under `costco_output/raw/orders.csv` and `./items.csv`, same format as giant
+- costco-native identifiers such as `transactionBarcode` as order id and `itemNumber` as retailer item id
+- preserve discount/coupon rows rather than dropping

 ** notes
- this is the proof that the architecture generalizes
- don’t chase perfection before the second retailer lands
+- focus on raw costco acquisistion and flattening
+- do not force costco identifiers into `upc`
+- bearer/auth values should come from local env, not source

 ** evidence
- commit:
- tests:
- date:
+- commit: `da00288` on branch `cx`
+- tests: `./venv/bin/python -m unittest discover -s tests`; `./venv/bin/python scrape_costco.py --help`; verified `costco_output/raw/*.json`, `costco_output/orders.csv`, and `costco_output/items.csv` from the local sample payload
+- date: 2026-03-16

-* [ ] t1.9: compute normalized comparison metrics (2-3 commits)
+* [X] t1.8.1: support costco parser/enricher path (2-4 commits)

 ** acceptance criteria
- derive normalized comparison fields where possible:
-  - price per lb
-  - price per oz
-  - price per each
-  - price per count
- metrics are attached at canonical or linked-observed level as appropriate
- emit obvious nulls when basis is unknown rather than inventing values
+- add a costco-specific enrich step producing `costco_output/items_enriched.csv`
+- output rows into the same shared enriched schema family as Giant
+- support costco-specific parsing for:
+  - `itemDescription01` + `itemDescription02`
+  - `itemNumber` as `retailer_item_id`
+  - discount lines / negative rows
+  - common size patterns such as `25#`, `48 OZ`, `2/24 OZ`, `6-PACK`
+- preserve obvious unknowns as blank rather than guessed values

 ** notes
- this is where “gala apples 5 lb bag vs other gala apples” becomes possible
- units discipline matters a lot here
+- this is the real schema compatibility proof, not raw ingest alone
+- expect weaker identifiers than Giant
+
+** evidence
+- commit: `da00288` on branch `cx`
+- tests: `./venv/bin/python -m unittest discover -s tests`; `./venv/bin/python enrich_costco.py`; verified `costco_output/items_enriched.csv`
+- date: 2026-03-16
+* [X] t1.8.2: validate cross-retailer observed/canonical flow (1-3 commits)
+
+** acceptance criteria
+- feed Giant and Costco enriched rows through the same observed/canonical pipeline
+- confirm at least one product class can exist as:
+  - Giant observed product
+  - Costco observed product
+  - one shared canonical product
+- document the exact example used for proof
+
+** notes
+- keep this to one or two well-behaved product classes first
+- apples, eggs, bananas, or flour are better than weird prepared foods
+
+** evidence
+- commit: `da00288` on branch `cx`
+- tests: `./venv/bin/python -m unittest discover -s tests`; `./venv/bin/python validate_cross_retailer_flow.py`; proof example: Giant `FRESH BANANA` and Costco `BANANAS 3 LB / 1.36 KG` share one canonical in `combined_output/proof_examples.csv`
+- date: 2026-03-16
+* [X] t1.8.3: extend shared schema for retailer-native ids and adjustment lines (1-2 commits)
+
+** acceptance criteria
+- add shared fields needed for non-upc retailers, including:
+  - `retailer_item_id`
+  - `is_discount_line`
+  - `is_coupon_line` or equivalent if needed
+- keep `upc` nullable across the pipeline
+- update downstream builders/tests to accept retailers with blank `upc`
+
+** notes
+- this prevents costco from becoming a schema hack
+- do this once instead of sprinkling exceptions everywhere
+
+** evidence
+- commit: `9497565` on branch `cx`
+- tests: `./venv/bin/python -m unittest discover -s tests`; verified shared enriched fields in `giant_output/items_enriched.csv` and `costco_output/items_enriched.csv`
+- date: 2026-03-16
+* [X] t1.8.4: verify and correct costco receipt enumeration (1–2 commits)
+
+** acceptance criteria
+- confirm graphql summary query returns all expected receipts
+- compare `inWarehouse` count vs number of `receipts` returned
+- widen or parameterize date window if necessary; website shows receipts in 3-month windows
+- persist request metadata (`startDate`, `endDate`, `documentType`, `documentSubType`)
+- emit warning when receipt counts mismatch
+
+** notes
+- goal is to confirm we are enumerating all receipts before parsing
+- do not expand schema or parser logic in this task
+- keep changes limited to summary query handling and diagnostics
+
+** evidence
+- commit: `ac82fa6` on branch `cx`
+- tests: `./venv/bin/python -m unittest discover -s tests`; `./venv/bin/python scrape_costco.py --help`; reviewed the sample Costco summary request in `pm/scrape-giant.org` against `costco_output/raw/summary.json` and added 3-month window chunking plus mismatch diagnostics
+- date: 2026-03-16
+* [X] t1.8.5: refactor costco scraper auth and UX with giant scraper
+
+** acceptance criteria
+- remove manual auth env vars
+- load costco cookies from firefox session
+- require only logged-in browser
+- replace start/end date flags with --months-back
+- maintain same raw output structure
+- ensure summary_lookup keys are collision-safe by using a composite key (transactionBarcode + transactionDateTime) instead of transactionBarcode alone 
+
+** notes
+- align Costco acquisition ergonomics with the Giant scraper
+- keep downstream Costco parsing and shared schemas unchanged
+
+** evidence
+- commit: `c0054dc` on branch `cx`
+- tests: `./venv/bin/python -m unittest discover -s tests`; `./venv/bin/python scrape_costco.py --help`; verified Costco summary/detail flattening now uses composite receipt keys in unit tests
+- date: 2026-03-16
+* [ ] t1.9: compute normalized comparison metrics (2-4 commits)
+
+** acceptance criteria
+- derive normalized comparison fields where possible on enriched or observed product rows:
+  - `price_per_lb`
+  - `price_per_oz`
+  - `price_per_each`
+  - `price_per_count`
+- preserve the source basis used to derive each metric, e.g.:
+  - parsed size/unit
+  - receipt weight
+  - explicit count/pack
+- emit nulls when basis is unknown, conflicting, or ambiguous
+- document at least one Giant vs Costco comparison example using the normalized metrics
+
+** notes
+- compute metrics as close to the raw observation as possible
+- canonical layer can aggregate later, but should not invent missing unit economics
+- unit discipline matters more than coverage

 ** evidence
 - commit: