data-model refactor and prep scope

2026-03-18 13:08:28 -04:00
parent 9122821db1
commit 10aad05808
3 changed files with 538 additions and 267 deletions
--- a/README.md
+++ b/README.md
@@ -12,6 +12,7 @@ Run each script step-by-step from the terminal.
 4. `enrich_costco.py`: normalize Costco line items
 5. `build_purchases.py`: combine retailer outputs into one purchase table
 6. `review_products.py`: review unresolved product matches in the terminal
+7. `report_pipeline_status.py`: show how many rows survive each stage

 ## Requirements

@@ -31,6 +32,7 @@ pip install -r requirements.txt
 Current version works best with `.env` in the project root.  The scraper will prompt for these values if they are not found in the current browser session.  
 - `scrape_giant` prompts if `GIANT_USER_ID` or `GIANT_LOYALTY_NUMBER` is missing.
 - `scrape_costco` tries `.env` first, then Firefox local storage for session-backed values; `COSTCO_CLIENT_IDENTIFIER` should still be set explicitly.
+- Costco discount matching happens later in `enrich_costco.py`; you do not need to pre-clean discount lines by hand.

 ```env
 GIANT_USER_ID=...
@@ -53,6 +55,8 @@ python enrich_costco.py
 python build_purchases.py
 python review_products.py
 python build_purchases.py
+python review_products.py --refresh-only
+python report_pipeline_status.py
 ```

 Why run `build_purchases.py` twice:
@@ -66,6 +70,12 @@ If you only want to refresh the queue without reviewing interactively:
 python review_products.py --refresh-only
 ```

+If you want a quick stage-by-stage accountability check:
+
+```bash
+python report_pipeline_status.py
+```
+
 ## Key Outputs

 Giant:
@@ -77,6 +87,7 @@ Costco:
 - `costco_output/orders.csv`
 - `costco_output/items.csv`
 - `costco_output/items_enriched.csv`
+- `costco_output/items_enriched.csv` now preserves raw totals and matched net discount fields

 Combined:
 - `combined_output/purchases.csv`
@@ -85,6 +96,8 @@ Combined:
 - `combined_output/canonical_catalog.csv`
 - `combined_output/product_links.csv`
 - `combined_output/comparison_examples.csv`
+- `combined_output/pipeline_status.csv`
+- `combined_output/pipeline_status.json`

 ## Review Workflow

@@ -95,9 +108,14 @@ Run `review_products.py` to cleanup unresolved or weakly unified items:
 - skip it for later
 Decisions are saved and reused on later runs.

+The review step is intentionally conservative:
+- weak exact-name matches stay in the queue instead of auto-creating canonical products
+- canonical names should describe stable product identity, not retailer packaging text
+
 ## Notes
 - This project is designed around fragile retailer scraping flows, so the code favors explicit retailer-specific steps over heavy abstraction.
 - `scrape_giant.py` and `scrape_costco.py` are meant to work as standalone acquisition scripts.
+- Costco discount rows are preserved for auditability and also matched back to purchased items during enrichment.
 - `validate_cross_retailer_flow.py` is a proof/check script, not a required production step.

 ## Test
--- a/pm/data-model.org
+++ b/pm/data-model.org
@@ -1,12 +1,13 @@
-* grocery data model and file layout
+* Grocery data model and file layout

 This document defines the shared file layout and stable CSV schemas for the
-grocery pipeline. The goal is to keep retailer-specific ingest separate from
-cross-retailer product modeling so Giant-specific quirks do not become the
-system of record.
-
-** design rules
+grocery pipeline.
+Goals:
+- Ensure data gathering is separate from analysis
+- Enable multiple data gathering methods
+- One layer for review and analysis  

+ ** Design Rules
 - Raw retailer exports remain the source of truth.
 - Retailer parsing is isolated to retailer-specific files and ids.
 - Cross-retailer product layers begin only after retailer-specific enrichment.
@@ -14,296 +15,313 @@ system of record.
  existing columns should not be repurposed.
 - Unknown values should be left blank rather than guessed.

-** directory layout
-
-Use one top-level data root:
-
-#+begin_example
-data/
-  giant/
-    raw/
-      history.json
-      orders/
-        <order_id>.json
-    orders.csv
-    items_raw.csv
-    items_enriched.csv
-    products_observed.csv
-  costco/
-    raw/
-      ...
-    orders.csv
-    items_raw.csv
-    items_enriched.csv
-    products_observed.csv
-  shared/
-    products_canonical.csv
-    product_links.csv
-    review_queue.csv
-#+end_example
-
-** layer responsibilities
-
- `data/<retailer>/raw/`
-  Stores unmodified retailer payloads exactly as fetched.
- `data/<retailer>/orders.csv`
-  One row per retailer order or visit, flattened from raw order data.
- `data/<retailer>/items_raw.csv`
-  One row per retailer line item, preserving retailer-native values needed for
-  reruns and debugging.
- `data/<retailer>/items_enriched.csv`
-  Parsed retailer line items with normalized fields and derived guesses, still
-  retailer-specific.
- `data/<retailer>/products_observed.csv`
-  Distinct retailer-facing observed products aggregated from enriched items.
- `data/shared/products_canonical.csv`
-  Cross-retailer canonical product entities used for comparison.
- `data/shared/product_links.csv`
-  Links from retailer observed products to canonical products.
- `data/shared/review_queue.csv`
-  Human review queue for unresolved or low-confidence matching/parsing cases.
-
-** retailer-specific versus shared
-
-Retailer-specific:
-
+*** Retailer-specific data:
 - raw json payloads
 - retailer order ids
 - retailer line numbers
 - retailer category ids and names
 - retailer item names
 - retailer image urls
- parsed guesses derived from one retailer feed
 - observed products scoped to one retailer

-Shared:
-
+*** Review/Combined data:
 - canonical products
 - observed-to-canonical links
 - human review state for unresolved cases
 - comparison-ready normalized quantity basis fields

+// I don't like this terminology - what is "observed" doing for us?
+// output should be normalized_items, not observed
+// unless this is the way we're matching multiple upc's?
 Observed products are the boundary between retailer-specific parsing and
 cross-retailer canonicalization. Nothing upstream of `products_observed.csv`
 should require knowledge of another retailer.

-** schema: `data/<retailer>/orders.csv`
+* Pipeline
+Key: 
+- (1) input
+- [2] output

-One row per order or visit.
+Each step can be run alone if its dependents exist.

-| column | meaning |
-|-
-| `retailer` | retailer slug such as `giant` |
-| `order_id` | retailer order or visit id |
-| `order_date` | order date in `YYYY-MM-DD` when available |
-| `delivery_date` | fulfillment date in `YYYY-MM-DD` when available |
-| `service_type` | retailer service type such as `INSTORE` |
-| `order_total` | order total as provided by retailer |
-| `payment_method` | retailer payment label |
-| `total_item_count` | total line count or item count from retailer |
-| `total_savings` | total savings as provided by retailer |
-| `your_savings_total` | savings field from retailer when present |
-| `coupons_discounts_total` | coupon/discount total from retailer |
-| `store_name` | retailer store name |
-| `store_number` | retailer store number |
-| `store_address1` | street address |
-| `store_city` | city |
-| `store_state` | state or province |
-| `store_zipcode` | postal code |
-| `refund_order` | retailer refund flag |
-| `ebt_order` | retailer EBT flag |
-| `raw_history_path` | relative path to source history payload |
-| `raw_order_path` | relative path to source order payload |
+** 1. Collect
+Get raw receipt/visit and item data from a retailer.  Scraping is unique to a Retailer and method (e.g., Giant-Web and Giant-Scan).  Preserve complete raw data and preserve fidelity.  Avoid interpretation beyond basic data flattening.
+ - (1) Source access (Varies, eg header data, auth for API access)
+ - [1] collected visits from each retailer
+ - [2] collected items from each retailer
+ - [3] any other raw data that supports [1] and [2]; explicit source (eventual receipt scan?)
+   
+** 2. Normalize
+Parse and extract structured facts from retailer-specific raw data to create a standardized item format for that retailer.  Strictly dependent on Collect method and output.
+ - Extract quantity, size, pack, pricing, variant
+ - Add discount line items to product line items using upc/retail_item_id and concurrence
+ - Cleanup naming to facilitate later matching
+ - (1) collected items from each retailer
+ - (2) collected visits from each retailer
+ - [1] normalized items from each retailer

-Primary key:
+** 3. Review/Combine (Canonicalization)
+Decide whether two normalized retailer items are "the same product"; match items across retailers using algo/logic and human review.  Create catalog linked to normalized items.
+ - Grouping the same item from retailer
+ - Asking human to create a canonical/catalog item with:
+   - friendly/canonical_name: "bell pepper"; "milk"
+   - category: "produce"; "dairy"
+   - product_type: "pepper"; "milk"
+   - ? variant? "whole, "skim", "2pct"
+ - (1) normalized items from each retailer
+ - [1] review queue of items to be reviewed
+ - [2] catalog (lookup table) of confirmed retailer_item and canonical_name
+ - [3] canonical purchase list, pivot-ready
+   
+** Unresolved Issues
+1. need central script to orchestrate; metadata belongs there and nowhere else

- (`retailer`, `order_id`)
+** Symptoms
+- `LIME` and `LIME . / .` appearing in canonical_catalog:
+  - names must come from review-approved names, not raw strings

-** schema: `data/<retailer>/items_raw.csv`

+* Directory Layout
+Use one top-level data root:
+#+begin_example
+main.py
+collect_<retailer>_<method>.py
+normalize_<retailer>_<method>.py
+review.py
+data/
+  <retailer-method>/
+    raw/  # unmodified retailer payloads exactly as fetched
+      <order_id.json> 
+    collected_items.csv # one row per retailer line item w/ retailer-native values
+    collected_orders.csv # one row per receipt/visit, flattened from raw order data
+    normalized_items.csv # parsed retailer-specific line items with normalized fields
+  costco-web/ # sample
+    raw/
+      orders/
+        history.json
+        <order_id>.json
+    collected_items.csv
+    collected_orders.csv
+    normalized_items.csv
+  review/
+    review_queue.csv #  Human review queue for unresolved matching/parsing cases.
+    product_links.csv # Links from retailer-observed products to canonical products.
+  catalog.csv  # Cross-retailer canonical product entities used for comparison.
+  purchases.csv
+#+end_example
+
+* Schemas
+** `data/<retailer-method>/collected_items.csv`
 One row per retailer line item.
+| key                | definition                                 |
+|--------------------+--------------------------------------------|
+| `retailer` PK      | retailer slug                              |
+| `order_id` PK      | retailer order id                          |
+| `line_no`  PK      | stable line number within order export     |
+| `order_date`       | copied from order when available           |
+| `retailer_item_id` | retailer-native item id when available     |
+| `pod_id`           | retailer pod/item id                       |
+| `item_name`        | raw retailer item name                     |
+| `upc`              | retailer UPC or PLU value                  |
+| `category_id`      | retailer category id                       |
+| `category`         | retailer category description              |
+| `qty`              | retailer quantity field                    |
+| `unit`             | retailer unit code such as `EA` or `LB`    |
+| `unit_price`       | retailer unit price field                  |
+| `line_total`       | retailer extended price field              |
+| `picked_weight`    | retailer picked weight field               |
+| `mvp_savings`      | retailer savings field                     |
+| `reward_savings`   | retailer rewards savings field             |
+| `coupon_savings`   | retailer coupon savings field              |
+| `coupon_price`     | retailer coupon price field                |
+| `image_url`        | raw retailer image url when present        |
+| `raw_order_path`   | relative path to source order payload      |
+| `is_discount_line` | retailer adjustment or discount-line flag  |
+| `is_coupon_line`   | coupon-like line flag when distinguishable |

-| column           | meaning                                 |
-|------------------+-----------------------------------------|
-| `retailer`       | retailer slug                           |
-| `order_id`       | retailer order id                       |
-| `line_no`        | stable line number within order export  |
-| `order_date`     | copied from order when available        |
-| `retailer_item_id` | retailer-native item id when available |
-| `pod_id`         | retailer pod/item id                    |
-| `item_name`      | raw retailer item name                  |
-| `upc`            | retailer UPC or PLU value               |
-| `category_id`    | retailer category id                    |
-| `category`       | retailer category description           |
-| `qty`            | retailer quantity field                 |
-| `unit`           | retailer unit code such as `EA` or `LB` |
-| `unit_price`     | retailer unit price field               |
-| `line_total`     | retailer extended price field           |
-| `picked_weight`  | retailer picked weight field            |
-| `mvp_savings`    | retailer savings field                  |
-| `reward_savings` | retailer rewards savings field          |
-| `coupon_savings` | retailer coupon savings field           |
-| `coupon_price`   | retailer coupon price field             |
-| `image_url`      | raw retailer image url when present     |
-| `raw_order_path` | relative path to source order payload   |
-| `is_discount_line` | retailer adjustment or discount-line flag |
-| `is_coupon_line` | coupon-like line flag when distinguishable |
+** `data/<retailer-method>/collected_orders.csv`
+One row per order or visit.
+| key                       | definition                                      |
+|---------------------------+-------------------------------------------------|
+| `retailer` PK             | retailer slug such as `giant`                   |
+| `order_id` PK             | retailer order or visit id                      |
+| `order_date`              | order date in `YYYY-MM-DD` when available       |
+| `delivery_date`           | fulfillment date in `YYYY-MM-DD` when available |
+| `service_type`            | retailer service type such as `INSTORE`         |
+| `order_total`             | order total as provided by retailer             |
+| `payment_method`          | retailer payment label                          |
+| `total_item_count`        | total line count or item count from retailer    |
+| `total_savings`           | total savings as provided by retailer           |
+| `your_savings_total`      | savings field from retailer when present        |
+| `coupons_discounts_total` | coupon/discount total from retailer             |
+| `store_name`              | retailer store name                             |
+| `store_number`            | retailer store number                           |
+| `store_address1`          | street address                                  |
+| `store_city`              | city                                            |
+| `store_state`             | state or province                               |
+| `store_zipcode`           | postal code                                     |
+| `refund_order`            | retailer refund flag                            |
+| `ebt_order`               | retailer EBT flag                               |
+| `raw_history_path`        | relative path to source history payload         |
+| `raw_order_path`          | relative path to source order payload           |

-Primary key:
+** `data/<retailer-method>/normalized_items.csv`
+One row per retailer line item after deterministic parsing. Preserve raw
+fields from `collected_items.csv` and add parsed fields plus retailer-level
+identity needed before cross-retailer review.

- (`retailer`, `order_id`, `line_no`)
+| key                        | definition                                                       |
+|----------------------------+------------------------------------------------------------------|
+| `retailer` PK              | retailer slug                                                    |
+| `order_id` PK              | retailer order id                                                |
+| `line_no` PK               | line number within order                                         |
+| `normalized_row_id`        | stable row key, typically `<retailer>:<order_id>:<line_no>`      |
+| `normalized_item_id`       | stable retailer-level item identity after deterministic grouping |
+| `normalization_basis`      | basis used to assign `normalized_item_id`                        |
+| `retailer_item_id`         | retailer-native item id                                          |
+| `item_name`                | raw retailer item name                                           |
+| `item_name_norm`           | normalized retailer item name                                    |
+| `brand_guess`              | parsed brand guess                                               |
+| `variant`                  | parsed variant text                                              |
+| `size_value`               | parsed numeric size value                                        |
+| `size_unit`                | parsed size unit such as `oz`, `lb`, `fl_oz`                     |
+| `pack_qty`                 | parsed pack or count guess                                       |
+| `measure_type`             | `each`, `weight`, `volume`, `count`, or blank                    |
+| `normalized_quantity`      | numeric comparison basis derived during normalization            |
+| `normalized_quantity_unit` | basis unit such as `oz`, `lb`, `count`, or blank                 |
+| `is_store_brand`           | store-brand guess                                                |
+| `is_fee`                   | fee or non-product flag                                          |
+| `is_discount_line`         | discount or adjustment-line flag                                 |
+| `is_coupon_line`           | coupon-like line flag                                            |
+| `matched_discount_amount`  | matched discount value carried onto purchased row when supported |
+| `net_line_total`           | line total after matched discount when supported                 |
+| `price_per_each`           | derived per-each price when supported                            |
+| `price_per_each_basis`     | source basis for `price_per_each`                                |
+| `price_per_count`          | derived per-count price when supported                           |
+| `price_per_count_basis`    | source basis for `price_per_count`                               |
+| `price_per_lb`             | derived per-pound price when supported                           |
+| `price_per_lb_basis`       | source basis for `price_per_lb`                                  |
+| `price_per_oz`             | derived per-ounce price when supported                           |
+| `price_per_oz_basis`       | source basis for `price_per_oz`                                  |
+| `image_url`                | best available retailer image url                                |
+| `raw_order_path`           | relative path to source order payload                            |
+| `parse_version`            | parser version string for reruns                                 |
+| `parse_notes`              | optional non-fatal parser notes                                  |

-** schema: `data/<retailer>/items_enriched.csv`
-
-One row per retailer line item after deterministic parsing. Preserve the raw
-fields from `items_raw.csv` and add parsed fields.
-
-| column              | meaning                                                     |
-|---------------------+-------------------------------------------------------------|
-| `retailer`          | retailer slug                                               |
-| `order_id`          | retailer order id                                           |
-| `line_no`           | line number within order                                    |
-| `observed_item_key` | stable row key, typically `<retailer>:<order_id>:<line_no>` |
-| `retailer_item_id`  | retailer-native item id                                     |
-| `item_name`         | raw retailer item name                                      |
-| `item_name_norm`    | normalized item name                                        |
-| `brand_guess`       | parsed brand guess                                          |
-| `variant`           | parsed variant text                                         |
-| `size_value`        | parsed numeric size value                                   |
-| `size_unit`         | parsed size unit such as `oz`, `lb`, `fl_oz`                |
-| `pack_qty`          | parsed pack or count guess                                  |
-| `measure_type`      | `each`, `weight`, `volume`, `count`, or blank               |
-| `is_store_brand`    | store-brand guess                                           |
-| `is_fee`            | fee or non-product flag                                     |
-| `is_discount_line`  | discount or adjustment-line flag                            |
-| `is_coupon_line`    | coupon-like line flag                                       |
-| `price_per_each`    | derived per-each price when supported                       |
-| `price_per_lb`      | derived per-pound price when supported                      |
-| `price_per_oz`      | derived per-ounce price when supported                      |
-| `image_url`         | best available retailer image url                           |
-| `parse_version`     | parser version string for reruns                            |
-| `parse_notes`       | optional non-fatal parser notes                             |
-
-Primary key:
-
- (`retailer`, `order_id`, `line_no`)
-
-** schema: `data/<retailer>/products_observed.csv`
-
-One row per distinct retailer-facing observed product.
-
-| column                        | meaning                                                        |
-|-------------------------------+----------------------------------------------------------------|
-| `observed_product_id`         | stable observed product id                                     |
-| `retailer`                    | retailer slug                                                  |
-| `observed_key`                | deterministic grouping key used to create the observed product |
-| `representative_retailer_item_id` | best representative retailer-native item id               |
-| `representative_upc`          | best representative UPC/PLU                                    |
-| `representative_item_name`    | representative raw retailer name                               |
-| `representative_name_norm`    | representative normalized name                                 |
-| `representative_brand`        | representative brand guess                                     |
-| `representative_variant`      | representative variant                                         |
-| `representative_size_value`   | representative size value                                      |
-| `representative_size_unit`    | representative size unit                                       |
-| `representative_pack_qty`     | representative pack/count                                      |
-| `representative_measure_type` | representative measure type                                    |
-| `representative_image_url`    | representative image url                                       |
-| `is_store_brand`              | representative store-brand flag                                |
-| `is_fee`                      | representative fee flag                                        |
-| `is_discount_line`            | representative discount-line flag                              |
-| `is_coupon_line`              | representative coupon-line flag                                |
-| `first_seen_date`             | first order date seen                                          |
-| `last_seen_date`              | last order date seen                                           |
-| `times_seen`                  | number of enriched item rows grouped here                      |
-| `example_order_id`            | one example retailer order id                                  |
-| `example_item_name`           | one example raw item name                                      |
-| `distinct_retailer_item_ids_count` | count of distinct retailer-native item ids               |
-
-Primary key:
-
- (`observed_product_id`)
-
-** schema: `data/shared/products_canonical.csv`
-
-One row per cross-retailer canonical product.
-
-| column                     | meaning                                          |
-|----------------------------+--------------------------------------------------|
-| `canonical_product_id`     | stable canonical product id                      |
-| `canonical_name`           | canonical human-readable name                    |
-| `product_type`             | broad class such as `apple`, `milk`, `trash_bag` |
-| `brand`                    | canonical brand when applicable                  |
-| `variant`                  | canonical variant                                |
-| `size_value`               | normalized size value                            |
-| `size_unit`                | normalized size unit                             |
-| `pack_qty`                 | normalized pack/count                            |
-| `measure_type`             | normalized measure type                          |
-| `normalized_quantity`      | numeric comparison basis value                   |
-| `normalized_quantity_unit` | basis unit such as `oz`, `lb`, `count`           |
-| `notes`                    | optional human notes                             |
-| `created_at`               | creation timestamp or date                       |
-| `updated_at`               | last update timestamp or date                    |
-
-Primary key:
-
- (`canonical_product_id`)
-
-** schema: `data/shared/product_links.csv`
+Notes:
+- `normalized_item_id` replaces the need for a core `observed_products.csv` layer.
+- `normalization_basis` should be explicit values like `exact_upc`, `retailer_item_id`, `name_size_pack`, or `manual_retailer_alias`.
+- Cross-retailer identity is still handled later in review/combine via `catalog.csv` and `product_links.csv`.

+** `data/review/product_links.csv`
 One row per observed-to-canonical relationship.
+1 (catalog_item) to many (normalized_items)

-| column | meaning |
-|-
-| `observed_product_id` | retailer observed product id |
-| `canonical_product_id` | linked canonical product id |
-| `link_method` | `manual`, `exact_upc`, `exact_name`, etc. |
-| `link_confidence` | optional confidence label |
-| `review_status` | `pending`, `approved`, `rejected`, or blank |
-| `reviewed_by` | reviewer id or initials |
-| `reviewed_at` | review timestamp or date |
-| `link_notes` | optional notes |
-
-Primary key:
-
- (`observed_product_id`, `canonical_product_id`)
-
-** schema: `data/shared/review_queue.csv`
+| key               | definition                                  |
+|-------------------+---------------------------------------------|
+| `observed_id` PK  | retailer observed product id                |
+| `catalog_id` PK   | linked canonical product id                 |
+| `link_method`     | `manual`, `exact_upc`, `exact_name`, etc.   |
+| `link_confidence` | optional confidence label                   |
+| `review_status`   | `pending`, `approved`, `rejected`, or blank |
+| `reviewed_by`     | reviewer id or initials                     |
+| `reviewed_at`     | review timestamp or date                    |
+| `link_notes`      | optional notes                              |

+** `data/review/review_queue.csv`
 One row per issue needing human review.

-| column | meaning |
-|-
-| `review_id` | stable review row id |
-| `queue_type` | `observed_product`, `link_candidate`, `parse_issue` |
-| `retailer` | retailer slug when applicable |
-| `observed_product_id` | observed product id when applicable |
-| `canonical_product_id` | candidate canonical id when applicable |
-| `reason_code` | machine-readable review reason |
-| `priority` | optional priority label |
-| `raw_item_names` | compact list of example raw names |
-| `normalized_names` | compact list of example normalized names |
-| `upc` | example UPC/PLU |
-| `image_url` | example image url |
-| `example_prices` | compact list of example prices |
-| `seen_count` | count of related rows |
-| `status` | `pending`, `approved`, `rejected`, `deferred` |
-| `resolution_notes` | reviewer notes |
-| `created_at` | creation timestamp or date |
-| `updated_at` | last update timestamp or date |
+| key                   | definition                                          |
+|-----------------------+-----------------------------------------------------|
+| `review_id` PK        | stable review row id                                |
+| `queue_type`          | `observed_product`, `link_candidate`, `parse_issue` |
+| `retailer`            | retailer slug when applicable                       |
+| `observed_product_id` | observed product id when applicable                 |
+| `catalod_id`          | candidate canonical id when applicable              |
+| `reason_code`         | machine-readable review reason                      |
+| `priority`            | optional priority label                             |
+| `raw_item_names`      | compact list of example raw names                   |
+| `normalized_names`    | compact list of example normalized names            |
+| `upc`                 | example UPC/PLU                                     |
+| `image_url`           | example image url                                   |
+| `example_prices`      | compact list of example prices                      |
+| `seen_count`          | count of related rows                               |
+| `status`              | `pending`, `approved`, `rejected`, `deferred`       |
+| `resolution_notes`    | reviewer notes                                      |
+| `created_at`          | creation timestamp or date                          |
+| `updated_at`          | last update timestamp or date                       |
+** `data/catalog.csv`
+One row per cross-retailer canonical product.
+| key                        | definition                             |
+|----------------------------+----------------------------------------|
+| `catalog_id` PK            | stable canonical product id            |
+| `catalog_name`             | canonical human-readable name          |
+| `product_type`             | generic product eg `apple`, `milk`     |
+| `category`                 | broad section eg `produce`, `dairy`    |
+| `brand`                    | canonical brand when applicable        |
+| `variant`                  | canonical variant                      |
+| `size_value`               | normalized size value                  |
+| `size_unit`                | normalized size unit                   |
+| `pack_qty`                 | normalized pack/count                  |
+| `measure_type`             | normalized measure type                |
+| `normalized_quantity`      | numeric comparison basis value         |
+| `normalized_quantity_unit` | basis unit such as `oz`, `lb`, `count` |
+| `notes`                    | optional human notes                   |
+| `created_at`               | creation timestamp or date             |
+| `updated_at`               | last update timestamp or date          |

-Primary key:
+** `data/purchases.csv`
+One row per purchased item (i.e., `row_type=item` from normalized layer), with
+catalog attributes denormalized in and discounts already applied.

- (`review_id`)
+| key                        | definition                                                     |
+|----------------------------+----------------------------------------------------------------|
+| `purchase_date`            | date of purchase (from order)                                  |
+| `retailer`                 | retailer slug                                                  |
+| `order_id`                 | retailer order id                                              |
+| `line_no`                  | line number within order                                       |
+| `normalized_row_id`        | `<retailer>:<order_id>:<line_no>`                              |
+| `normalized_item_id`       | retailer-level normalized item identity                        |
+| `catalog_id`               | linked canonical product id                                    |
+| `catalog_name`             | canonical product name for analysis                            |
+| `catalog_product_type`     | broader product family (e.g., `egg`, `milk`)                   |
+| `catalog_category`         | category such as `produce`, `dairy`                            |
+| `catalog_brand`            | canonical brand when applicable                                |
+| `catalog_variant`          | canonical variant when applicable                              |
+| `raw_item_name`            | original retailer item name                                    |
+| `normalized_item_name`     | cleaned/normalized retailer item name                          |
+| `retailer_item_id`         | retailer-native item id                                        |
+| `upc`                      | UPC/PLU when available                                         |
+| `qty`                      | retailer quantity field                                        |
+| `unit`                     | retailer unit (e.g., `EA`, `LB`)                               |
+| `pack_qty`                 | parsed pack/count                                              |
+| `size_value`               | parsed size value                                              |
+| `size_unit`                | parsed size unit                                               |
+| `measure_type`             | `each`, `weight`, `volume`, `count`                            |
+| `normalized_quantity`      | normalized comparison quantity                                 |
+| `normalized_quantity_unit` | unit for normalized quantity                                   |
+| `unit_price`               | retailer unit price                                            |
+| `line_total`               | original retailer extended price (pre-discount)                |
+| `matched_discount_amount`  | discount amount matched from discount lines                    |
+| `net_line_total`           | effective price after discount (`line_total` + discounts)      |
+| `store_name`               | retailer store name                                            |
+| `store_city`               | store city                                                     |
+| `store_state`              | store state                                                    |
+| `price_per_each`           | derived per-each price                                         |
+| `price_per_each_basis`     | source basis for per-each calc                                 |
+| `price_per_count`          | derived per-count price                                        |
+| `price_per_count_basis`    | source basis for per-count calc                                |
+| `price_per_lb`             | derived per-pound price                                        |
+| `price_per_lb_basis`       | source basis for per-pound calc                                |
+| `price_per_oz`             | derived per-ounce price                                        |
+| `price_per_oz_basis`       | source basis for per-ounce calc                                |
+| `is_fee`                   | true if row represents non-product fee                         |
+| `raw_order_path`           | relative path to original order payload                        |

-** current giant mapping
+Notes:
+- Only rows with `row_type=item` from normalization should appear here.
+- `line_total` preserves retailer truth; `net_line_total` is what you actually paid.
+- catalog fields are denormalized in to make pivoting trivial.
+- no discount/coupon rows exist here; their effects are carried via `matched_discount_amount`.

-Current scraper outputs map to the new layout as follows:
-
- `giant_output/raw/history.json` -> `data/giant/raw/history.json`
- `giant_output/raw/<order_id>.json` -> `data/giant/raw/orders/<order_id>.json`
- `giant_output/orders.csv` -> `data/giant/orders.csv`
- `giant_output/items.csv` -> `data/giant/items_raw.csv`
-
-Current Giant raw order payloads already expose fields needed for future
-enrichment, including `image`, `itemName`, `primUpcCd`, `lbEachCd`,
-`unitPrice`, `groceryAmount`, and `totalPickedWeight`.
+* /
--- a/pm/scrape-giant.org
+++ b/pm/scrape-giant.org
@@ -70,7 +70,13 @@ b l : switch to local branch (cx)
 l l : open local reflog
 put point on the commit; highlighted remote gitea/cx
 X   : reset branch; prompts you, selected cx
-       
+
+
+
+** merge branch
+b b : switch to branch to be merged into (cx)
+m m : pick branch to merge into current branch
+
 * giant requests
 ** item:
 get:
@@ -250,18 +256,247 @@ python build_observed_products.py
 python build_review_queue.py
 python build_canonical_layer.py
 python validate_cross_retailer_flow.py
-* t1.11 tasks [2026-03-17 Tue 13:49]
+* t1.13 tasks [2026-03-17 Tue 13:49]
 ok i ran a few. time to run some cleanups here - i'm wondering if we shouldn't be less aggressive with canonical names and encourage a better manual process to start. 
-1. auto-created canonical_names lack category, product_type - ok with filling these in manually in the catalog once the queue is empty
-2. canonical_names feel too specific, e.g., "5DZ egg"
-3. some canonical_names need consolidation, eg "LIME" and "LIME  . / ." ; poss cleanup issue. there are 5 entries for ergg but but they are all regular large grade A white eggs, just different amounts in dozens.
+** TODO fill in auto-created canonical category, product-type
+auto-created canonical_names lack category, product_type - ok with filling these in manually in the catalog once the queue is empty
+
+** TODO consolidation cleanup
+1. canonical_names feel too specific, e.g., "5DZ egg" - probably a problem with the enrich_* steps not adding appropraite normalizing data /and/ removing from observed product title?
+2.  some canonical_names need consolidation, eg "LIME" and "LIME  . / ." ; poss cleanup issue. there are 5 entries for ergg but but they are all regular large grade A white eggs, just different amounts in dozens.
  Eggs are actually a great candidate for the kind of analysis we want to do - the pipeline should have caught and properly sorted these into size/qty:
+  #+begin_example
  ```canonical_product_id	canonical_name	category	product_type	brand	variant	size_value	size_unit	pack_qty	measure_type	notes	created_at	updated_at
  gcan_0e350505fd22	5DZ EGG / /			KS					each	auto-linked via exact_name		
  gcan_47279a80f5f3	EGG 5 DOZ. BBS								each	auto-linked via exact_name		
  gcan_7d099130c1bf	LRG WHITE EGG			SB				30	count	auto-linked via exact_upc		
  gcan_849c2817e667	GDA LRG WHITE EGG			SB				18	count	auto-linked via exact_upc		
  gcan_cb0c6c8cf480	LG EGG CONVENTIONAL					18	count		count	auto-linked via exact_name_size		  ```
-4. Build costco mechanism for matching discount to line item.
+  #+end_example
+** TODO costco discount matching
+Build costco mechanism for matching discount to line item.
   1. Discounts appear as their own line items with a number like /123456, this matches the UPC of the discounted item
   2. must be date-matched to the UPC
+
+Data model might be missing shape:
+1. match discount rows like `item_name:/2303476` to `retailer_item_id:2303476`
+2. display this value on the item somehow? maybe update line_total? otherwise we lose fidelity. should be stored in items_enriched somehow
+#+begin_example
+```retailer	order_id	line_no	observed_item_key	order_date	retailer_item_id	pod_id	item_name	upc	category_id	category	qty	unit	unit_price	line_total	picked_weight	mvp_savings	reward_savings	coupon_savings	coupon_price	image_url	raw_order_path	item_name_norm	brand_guess	variant	size_value	size_unit	pack_qty	measure_type	is_store_brand	is_fee	is_discount_line	is_coupon_line	price_per_each	price_per_lb	price_per_oz	parse_version	parse_notes
+costco	2.11115E+22	3	costco:21111520101942404241753:3	4/24/2024	2303476		KA 6QT MIXER P16 KSM60SECXER/CU FY23		33	33	1	None	399.99	399.99							costco_output/raw/21111520101942404241753-2024-04-24T17-53-00.json	KA 6QT MIXER KSM60SECXER/CU						each	FALSE	FALSE	FALSE	FALSE	399.99			costco-enrich-v1	
+costco	2.11115E+22	4	costco:21111520101942404241753:4	4/24/2024	325173		/2303476		33	33	-1	None	0	-100				-100			costco_output/raw/21111520101942404241753-2024-04-24T17-53-00.json	/2303476						each	FALSE	FALSE	TRUE	TRUE	100			costco-enrich-v1	```
+#+end_example
+** TODO giant discount matching
+
+* prompt
+ do not add new abstractions unless they remove real duplication. prefer explicit retailer-specific logic over generic heuristics. do not auto-create new canonical products from weak normalized names.
+ and propose the smallest set of edits needed.
+* 1.13 fixes
+** 15x Costco discounts not caught
+- 15x, some with slash-space: `/ 1768123`and some without: `/2303476`
+** canonical names suck - tempted to force manual config from scratch?
+- maybe first-pass should be naming groups, starting with largest groups and going on down.
+- unfortunately not seeing many cross-retailer items? looks like costco-only; just taking Giant as gospel
+- could be as simple as changing canonical name in canonical_catalog.csv  
+- tough to figure out where the data is, leading to below:  
+** need to refactor whole flow and where data is stored
+group by browser or by site, or both? currently mixed. 
+1. Scrape
+   - Script:
+   - Output: /output/raw/orderN.json, history.json, orders.csv, history.csv
+2. Enrich
+   - Scripts:
+   - Output: /output/enrich/items.json
+3. Combined - /output/?
+   - Review step?
+
+** propsed fixes
+* 1.14 prep - OBE
+** [ ] t1.14.1 define and document the filesystem/data-layer layout (2-3 commits)
+make stage ownership and retailer ownership explicit so every artifact has one obvious home
+
+** AC
+1. define and document the canonical directory layout for the pipeline, separating retailer-specific artifacts from shared combined artifacts
+2. adopt an explicit layout of the form:
+   - `data/<retailer>/raw/`
+   - `data/<retailer>/orders.csv`
+   - `data/<retailer>/items.csv`
+   - `data/<retailer>/items_enriched.csv`
+   - `data/combined/products_observed.csv`
+   - `data/combined/review_queue.csv`
+   - `data/combined/item_aliases.csv`
+   - `data/combined/canonical_catalog.csv`
+   - `data/combined/product_links.csv`
+   - `data/combined/purchases.csv`
+   - `data/combined/pipeline_status.csv`
+   - `data/combined/pipeline_status.json`
+3. update docs/readme and pipeline docs so each script’s inputs and outputs point to the new layout
+4. remove or deprecate ambiguous stage outputs living under a retailer-specific output directory when they are actually shared artifacts
+- pm note: goal is “where does this file live?” should have one answer, not three
+
+** evidence
+- commit:
+- tests:
+- date:
+
+** notes
+
+** [ ] t1.14.2 define the row-level data model for raw, enriched, observed, canonical, and purchases layers (2-4 commits)
+lock the item model before further refactors so each stage has a clear grain and purpose
+
+** AC
+1. document the row grain for each layer:
+   - raw item row = one receipt line from one retailer order
+   - enriched item row = one retailer line with retailer-specific parsed fields
+   - observed product row = one grouped retailer-facing product concept
+   - canonical catalog row = one review-controlled product identity
+   - purchase row = one final pivot-ready purchased item line
+2. define the required fields for each layer, including stable ids and provenance fields
+3. explicitly document which fields are allowed to be blank at each layer (e.g. `upc`, `canonical_item_id`, category)
+4. document the relationship between:
+   - `raw_item_name`
+   - `normalized_item_name`
+   - `observed_product_id`
+   - `canonical_item_id`
+5. document how retailer-native ids (e.g. Costco `retailer_item_id`) fit into the shared model without being forced into `upc`
+- pm note: this is the schema contract task; code should follow it, not invent it ad hoc
+
+** evidence
+- commit:
+- tests:
+- date:
+
+** notes
+** [ ] t1.14.3 refactor pipeline outputs to the new layout without changing semantics (2-4 commits)
+move files and script defaults to the new structure while preserving current behavior
+
+** AC
+1. update scraper and enrich scripts to write retailer-specific outputs under `data/<retailer>/...`
+2. update combined/shared scripts to read from retailer-specific enriched outputs and write to `data/combined/...`
+3. preserve current content/meaning of outputs during the move; this is a location/structure refactor, not a behavior rewrite
+4. update tests, docs, and script defaults to use the new paths
+- pm note: do not mix data-layout cleanup with canonical/review logic changes in this task
+
+** evidence
+- commit:
+- tests:
+- date:
+
+** notes
+** [ ] t1.14.4 make the review and catalog layer explicit and authoritative (2-4 commits)
+treat review and canonical resolution as first-class data, not incidental byproducts
+
+** AC
+1. define `review_queue.csv`, `item_aliases.csv`, and `canonical_catalog.csv` as the authoritative review/catalog files in `data/combined/`
+2. document the intended purpose of each:
+   - `review_queue.csv` = unresolved observed items needing action
+   - `item_aliases.csv` = approved mapping from observed/normalized names to canonical ids
+   - `canonical_catalog.csv` = review-controlled canonical product definitions and display names
+3. ensure final purchase generation reads from these files as the source of truth for resolution
+4. stop relying on weak implicit canonical creation as a substitute for the explicit review/catalog layer
+- pm note: this is the control-plane task; observed products may be automatic, canonical products are review-controlled
+
+** evidence
+- commit:
+- tests:
+- date:
+
+** notes
+** [ ] t1.14.5 define and document the final pivot-ready purchases output (2-3 commits)
+make the final analysis artifact explicit so excel/pivot/chart use is a first-class target
+
+** AC
+1. define `data/combined/purchases.csv` as the final normalized purchase log
+2. ensure each purchase row retains:
+   - purchase date
+   - retailer
+   - order id
+   - raw item name
+   - normalized item name
+   - canonical item id when resolved
+   - quantity and unit
+   - original line total
+   - discount-adjusted fields when applicable
+   - store/location fields where available
+3. document that `purchases.csv` is the primary excel/pivot input and that earlier files are staging layers
+4. document expected pivot uses such as purchase frequency and cost over time by canonical item
+- pm note: this task is about making the final artifact explicit and stable, not about adding new metrics
+
+** evidence
+- commit:
+- tests:
+- date:
+
+** notes
+
+* pipeline prep [2026-03-17 Tue]
+
+data saved to /data
+1. "scrape_<retailer>" gathers data from a retailer and outputs:
+   1. raw list of items per visit          ./<retailer>/scraped/raw/order-<uid>.json
+   2. raw list of visits                   ./<retailer>/scraped_visits.csv
+   3. raw list of items from all visits    ./<retailer>/scraped_items.csv
+2. "enrich <retailer>" takes /scraped/ data and outputs:
+   1. normalized list of items             ./<retailer>/enriched_items.csv
+3. "combine" takes retailer 
+ input:
+   1. all enriched items                   ./<retailer>/enriched_items.csv
+   2. all retailer visits                  ./<retailer>/scraped_visits.csv
+ outputs:
+   1. observed product groups              ./combined/observed/products_observed.csv
+   2. unresolved products for review       ./combined/review/review_queue.csv
+   3. pipeline accounting/status           ./combined/status/pipeline_status.csv
+   4. pipeline accounting/status           ./combined/status/pipeline_status.json
+4. review resolves unknown or weakly identified products and maintains:
+   1. canonical product catalog            ./combined/review/canonical_catalog.csv
+   2. approved alias mappings              ./combined/review/item_aliases.csv
+   3. optional observed→canonical links    ./combined/review/product_links.csv
+5. build purchases takes combined observed data plus review/catalog data and outputs:
+   [1]. final normalized purchase log        ./combined/purchases/purchases.csv
+
+lets get this pipeline right before more refactoring.
+
+* Pipeline - moved to data-model.org [2026-03-18 Wed]
+Key: 
+- (1) input
+- [2] output
+
+Each step can be run alone if its dependents exist.
+
+** 1. Collect
+Get raw receipt/visit and item data from a retailer.  Scraping is unique to a Retailer and method (e.g., Giant-Web and Giant-Scan).  Preserve complete raw data and preserve fidelity.  Avoid interpretation beyond basic data flattening.
+ - (1) Source access (Varies, eg header data, auth for API access)
+ - [1] collected visits from each retailer
+ - [2] collected items from each retailer
+ - [3] any other raw data that supports [1] and [2]; explicit source (eventual receipt scan?)
+   
+** 2. Normalize
+Parse and extract structured facts from retailer-specific raw data to create a standardized item format.  Strictly dependent on Collect method and output.
+ - Extract quantity, size, pack, pricing, variant
+ - Consolidate discount with item using upc/retail_item_id and concurrence
+ - Cleanup naming to facilitate later matching
+ - (1) collected items from each retailer
+ - (2) collected visits from each retailer
+ - [1] normalized items from each retailer
+
+** 3. Review/Combine (Canonicalization)
+Decide whether two normalized retailer items are "the same product"; match items across retailers using algo/logic and human review.  Create catalog linked to normalized items.
+ - Grouping the same item from retailer
+ - Asking human to create a canonical/catalog item with:
+   - friendly/canonical_name: "bell pepper"; "milk"
+   - category: "produce"; "dairy"
+   - product_type: "pepper"; "milk"
+   - ? variant? "whole, "skim", "2pct"
+ - (1) normalized items from each retailer
+ - [1] review queue of items to be reviewed
+ - [2] catalog (lookup table) of confirmed retailer_item and canonical_name
+ - [3] canonical purchase list, pivot-ready
+   
+** Unresolved Issues
+2. Create tags: canonical_name (need better label), category, product_type is missing data like Variant, shouldn't this be part of the normalization step?
+3. need central script to orchestrate; metadata belongs here and nowhere else
+
+** Symptoms
+- `LIME` and `LIME . / .` appearing in canonical_catalog:
+  - names must come from review-approved names, not raw strings
+*