* Grocery data model and file layout

This document defines the shared file layout and stable CSV schemas for the
grocery pipeline.
Goals:
- Ensure data gathering is separate from analysis
- Enable multiple data gathering methods
- One layer for review and analysis  

 ** Design Rules
- Raw retailer exports remain the source of truth.
- Retailer parsing is isolated to retailer-specific files and ids.
- Cross-retailer product layers begin only after retailer-specific enrichment.
- CSV schemas are stable and additive: new columns may be appended, but
  existing columns should not be repurposed.
- Unknown values should be left blank rather than guessed.

*** Retailer-specific data:
- raw json payloads
- retailer order ids
- retailer line numbers
- retailer category ids and names
- retailer item names
- retailer image urls
- observed products scoped to one retailer

*** Review/Combined data:
- canonical products
- observed-to-canonical links
- human review state for unresolved cases
- comparison-ready normalized quantity basis fields

// I don't like this terminology - what is "observed" doing for us?
// output should be normalized_items, not observed
// unless this is the way we're matching multiple upc's?
Observed products are the boundary between retailer-specific parsing and
cross-retailer canonicalization. Nothing upstream of `products_observed.csv`
should require knowledge of another retailer.

* Pipeline
Key: 
- (1) input
- [2] output

Each step can be run alone if its dependents exist.

** 1. Collect
Get raw receipt/visit and item data from a retailer.  Scraping is unique to a Retailer and method (e.g., Giant-Web and Giant-Scan).  Preserve complete raw data and preserve fidelity.  Avoid interpretation beyond basic data flattening.
 - (1) Source access (Varies, eg header data, auth for API access)
 - [1] collected visits from each retailer
 - [2] collected items from each retailer
 - [3] any other raw data that supports [1] and [2]; explicit source (eventual receipt scan?)
   
** 2. Normalize
Parse and extract structured facts from retailer-specific raw data to create a standardized item format for that retailer.  Strictly dependent on Collect method and output.
 - Extract quantity, size, pack, pricing, variant
 - Add discount line items to product line items using upc/retail_item_id and concurrence
 - Cleanup naming to facilitate later matching
 - (1) collected items from each retailer
 - (2) collected visits from each retailer
 - [1] normalized items from each retailer

** 3. Review/Combine (Canonicalization)
Decide whether two normalized retailer items are "the same product"; match items across retailers using algo/logic and human review.  Create catalog linked to normalized items.
 - Grouping the same item from retailer
 - Asking human to create a canonical/catalog item with:
   - friendly/canonical_name: "bell pepper"; "milk"
   - category: "produce"; "dairy"
   - product_type: "pepper"; "milk"
   - ? variant? "whole, "skim", "2pct"
 - (1) normalized items from each retailer
 - [1] review queue of items to be reviewed
 - [2] catalog (lookup table) of confirmed retailer_item and canonical_name
 - [3] canonical purchase list, pivot-ready
   
** Unresolved Issues
1. need central script to orchestrate; metadata belongs there and nowhere else

** Symptoms
- `LIME` and `LIME . / .` appearing in canonical_catalog:
  - names must come from review-approved names, not raw strings


* Directory Layout
Use one top-level data root:
#+begin_example
main.py
collect_<retailer>_<method>.py
normalize_<retailer>_<method>.py
review.py
data/
  <retailer-method>/
    raw/  # unmodified retailer payloads exactly as fetched
      <order_id.json> 
    collected_items.csv # one row per retailer line item w/ retailer-native values
    collected_orders.csv # one row per receipt/visit, flattened from raw order data
    normalized_items.csv # parsed retailer-specific line items with normalized fields
  costco-web/ # sample
    raw/
      orders/
        history.json
        <order_id>.json
    collected_items.csv
    collected_orders.csv
    normalized_items.csv
  review/
    review_queue.csv #  Human review queue for unresolved matching/parsing cases.
    product_links.csv # Links from retailer-observed products to canonical products.
  catalog.csv  # Cross-retailer canonical product entities used for comparison.
  purchases.csv
#+end_example

* Schemas
** `data/<retailer-method>/collected_items.csv`
One row per retailer line item.
| key                | definition                                 |
|--------------------+--------------------------------------------|
| `retailer` PK      | retailer slug                              |
| `order_id` PK      | retailer order id                          |
| `line_no`  PK      | stable line number within order export     |
| `order_date`       | copied from order when available           |
| `retailer_item_id` | retailer-native item id when available     |
| `pod_id`           | retailer pod/item id                       |
| `item_name`        | raw retailer item name                     |
| `upc`              | retailer UPC or PLU value                  |
| `category_id`      | retailer category id                       |
| `category`         | retailer category description              |
| `qty`              | retailer quantity field                    |
| `unit`             | retailer unit code such as `EA` or `LB`    |
| `unit_price`       | retailer unit price field                  |
| `line_total`       | retailer extended price field              |
| `picked_weight`    | retailer picked weight field               |
| `mvp_savings`      | retailer savings field                     |
| `reward_savings`   | retailer rewards savings field             |
| `coupon_savings`   | retailer coupon savings field              |
| `coupon_price`     | retailer coupon price field                |
| `image_url`        | raw retailer image url when present        |
| `raw_order_path`   | relative path to source order payload      |
| `is_discount_line` | retailer adjustment or discount-line flag  |
| `is_coupon_line`   | coupon-like line flag when distinguishable |

** `data/<retailer-method>/collected_orders.csv`
One row per order or visit.
| key                       | definition                                      |
|---------------------------+-------------------------------------------------|
| `retailer` PK             | retailer slug such as `giant`                   |
| `order_id` PK             | retailer order or visit id                      |
| `order_date`              | order date in `YYYY-MM-DD` when available       |
| `delivery_date`           | fulfillment date in `YYYY-MM-DD` when available |
| `service_type`            | retailer service type such as `INSTORE`         |
| `order_total`             | order total as provided by retailer             |
| `payment_method`          | retailer payment label                          |
| `total_item_count`        | total line count or item count from retailer    |
| `total_savings`           | total savings as provided by retailer           |
| `your_savings_total`      | savings field from retailer when present        |
| `coupons_discounts_total` | coupon/discount total from retailer             |
| `store_name`              | retailer store name                             |
| `store_number`            | retailer store number                           |
| `store_address1`          | street address                                  |
| `store_city`              | city                                            |
| `store_state`             | state or province                               |
| `store_zipcode`           | postal code                                     |
| `refund_order`            | retailer refund flag                            |
| `ebt_order`               | retailer EBT flag                               |
| `raw_history_path`        | relative path to source history payload         |
| `raw_order_path`          | relative path to source order payload           |

** `data/<retailer-method>/normalized_items.csv`
One row per retailer line item after deterministic parsing. Preserve raw
fields from `collected_items.csv` and add parsed fields plus retailer-level
identity needed before cross-retailer review.

| key                        | definition                                                       |
|----------------------------+------------------------------------------------------------------|
| `retailer` PK              | retailer slug                                                    |
| `order_id` PK              | retailer order id                                                |
| `line_no` PK               | line number within order                                         |
| `normalized_row_id`        | stable row key, typically `<retailer>:<order_id>:<line_no>`      |
| `normalized_item_id`       | stable retailer-level item identity after deterministic grouping |
| `normalization_basis`      | basis used to assign `normalized_item_id`                        |
| `retailer_item_id`         | retailer-native item id                                          |
| `item_name`                | raw retailer item name                                           |
| `item_name_norm`           | normalized retailer item name                                    |
| `brand_guess`              | parsed brand guess                                               |
| `variant`                  | parsed variant text                                              |
| `size_value`               | parsed numeric size value                                        |
| `size_unit`                | parsed size unit such as `oz`, `lb`, `fl_oz`                     |
| `pack_qty`                 | parsed pack or count guess                                       |
| `measure_type`             | `each`, `weight`, `volume`, `count`, or blank                    |
| `normalized_quantity`      | numeric comparison basis derived during normalization            |
| `normalized_quantity_unit` | basis unit such as `oz`, `lb`, `count`, or blank                 |
| `is_store_brand`           | store-brand guess                                                |
| `is_fee`                   | fee or non-product flag                                          |
| `is_discount_line`         | discount or adjustment-line flag                                 |
| `is_coupon_line`           | coupon-like line flag                                            |
| `matched_discount_amount`  | matched discount value carried onto purchased row when supported |
| `net_line_total`           | line total after matched discount when supported                 |
| `price_per_each`           | derived per-each price when supported                            |
| `price_per_each_basis`     | source basis for `price_per_each`                                |
| `price_per_count`          | derived per-count price when supported                           |
| `price_per_count_basis`    | source basis for `price_per_count`                               |
| `price_per_lb`             | derived per-pound price when supported                           |
| `price_per_lb_basis`       | source basis for `price_per_lb`                                  |
| `price_per_oz`             | derived per-ounce price when supported                           |
| `price_per_oz_basis`       | source basis for `price_per_oz`                                  |
| `image_url`                | best available retailer image url                                |
| `raw_order_path`           | relative path to source order payload                            |
| `parse_version`            | parser version string for reruns                                 |
| `parse_notes`              | optional non-fatal parser notes                                  |

Notes:
- `normalized_item_id` replaces the need for a core `observed_products.csv` layer.
- `normalization_basis` should be explicit values like `exact_upc`, `retailer_item_id`, `name_size_pack`, or `manual_retailer_alias`.
- Cross-retailer identity is still handled later in review/combine via `catalog.csv` and `product_links.csv`.

** `data/review/product_links.csv`
One row per observed-to-canonical relationship.
1 (catalog_item) to many (normalized_items)

| key               | definition                                  |
|-------------------+---------------------------------------------|
| `observed_id` PK  | retailer observed product id                |
| `catalog_id` PK   | linked canonical product id                 |
| `link_method`     | `manual`, `exact_upc`, `exact_name`, etc.   |
| `link_confidence` | optional confidence label                   |
| `review_status`   | `pending`, `approved`, `rejected`, or blank |
| `reviewed_by`     | reviewer id or initials                     |
| `reviewed_at`     | review timestamp or date                    |
| `link_notes`      | optional notes                              |

** `data/review/review_queue.csv`
One row per issue needing human review.

| key                   | definition                                          |
|-----------------------+-----------------------------------------------------|
| `review_id` PK        | stable review row id                                |
| `queue_type`          | `observed_product`, `link_candidate`, `parse_issue` |
| `retailer`            | retailer slug when applicable                       |
| `observed_product_id` | observed product id when applicable                 |
| `catalod_id`          | candidate canonical id when applicable              |
| `reason_code`         | machine-readable review reason                      |
| `priority`            | optional priority label                             |
| `raw_item_names`      | compact list of example raw names                   |
| `normalized_names`    | compact list of example normalized names            |
| `upc`                 | example UPC/PLU                                     |
| `image_url`           | example image url                                   |
| `example_prices`      | compact list of example prices                      |
| `seen_count`          | count of related rows                               |
| `status`              | `pending`, `approved`, `rejected`, `deferred`       |
| `resolution_notes`    | reviewer notes                                      |
| `created_at`          | creation timestamp or date                          |
| `updated_at`          | last update timestamp or date                       |
** `data/catalog.csv`
One row per cross-retailer canonical product.
| key                        | definition                             |
|----------------------------+----------------------------------------|
| `catalog_id` PK            | stable canonical product id            |
| `catalog_name`             | canonical human-readable name          |
| `product_type`             | generic product eg `apple`, `milk`     |
| `category`                 | broad section eg `produce`, `dairy`    |
| `brand`                    | canonical brand when applicable        |
| `variant`                  | canonical variant                      |
| `size_value`               | normalized size value                  |
| `size_unit`                | normalized size unit                   |
| `pack_qty`                 | normalized pack/count                  |
| `measure_type`             | normalized measure type                |
| `normalized_quantity`      | numeric comparison basis value         |
| `normalized_quantity_unit` | basis unit such as `oz`, `lb`, `count` |
| `notes`                    | optional human notes                   |
| `created_at`               | creation timestamp or date             |
| `updated_at`               | last update timestamp or date          |

** `data/purchases.csv`
One row per purchased item (i.e., `row_type=item` from normalized layer), with
catalog attributes denormalized in and discounts already applied.

| key                        | definition                                                     |
|----------------------------+----------------------------------------------------------------|
| `purchase_date`            | date of purchase (from order)                                  |
| `retailer`                 | retailer slug                                                  |
| `order_id`                 | retailer order id                                              |
| `line_no`                  | line number within order                                       |
| `normalized_row_id`        | `<retailer>:<order_id>:<line_no>`                              |
| `normalized_item_id`       | retailer-level normalized item identity                        |
| `catalog_id`               | linked canonical product id                                    |
| `catalog_name`             | canonical product name for analysis                            |
| `catalog_product_type`     | broader product family (e.g., `egg`, `milk`)                   |
| `catalog_category`         | category such as `produce`, `dairy`                            |
| `catalog_brand`            | canonical brand when applicable                                |
| `catalog_variant`          | canonical variant when applicable                              |
| `raw_item_name`            | original retailer item name                                    |
| `normalized_item_name`     | cleaned/normalized retailer item name                          |
| `retailer_item_id`         | retailer-native item id                                        |
| `upc`                      | UPC/PLU when available                                         |
| `qty`                      | retailer quantity field                                        |
| `unit`                     | retailer unit (e.g., `EA`, `LB`)                               |
| `pack_qty`                 | parsed pack/count                                              |
| `size_value`               | parsed size value                                              |
| `size_unit`                | parsed size unit                                               |
| `measure_type`             | `each`, `weight`, `volume`, `count`                            |
| `normalized_quantity`      | normalized comparison quantity                                 |
| `normalized_quantity_unit` | unit for normalized quantity                                   |
| `unit_price`               | retailer unit price                                            |
| `line_total`               | original retailer extended price (pre-discount)                |
| `matched_discount_amount`  | discount amount matched from discount lines                    |
| `net_line_total`           | effective price after discount (`line_total` + discounts)      |
| `store_name`               | retailer store name                                            |
| `store_city`               | store city                                                     |
| `store_state`              | store state                                                    |
| `price_per_each`           | derived per-each price                                         |
| `price_per_each_basis`     | source basis for per-each calc                                 |
| `price_per_count`          | derived per-count price                                        |
| `price_per_count_basis`    | source basis for per-count calc                                |
| `price_per_lb`             | derived per-pound price                                        |
| `price_per_lb_basis`       | source basis for per-pound calc                                |
| `price_per_oz`             | derived per-ounce price                                        |
| `price_per_oz_basis`       | source basis for per-ounce calc                                |
| `is_fee`                   | true if row represents non-product fee                         |
| `raw_order_path`           | relative path to original order payload                        |

Notes:
- Only rows with `row_type=item` from normalization should appear here.
- `line_total` preserves retailer truth; `net_line_total` is what you actually paid.
- catalog fields are denormalized in to make pivoting trivial.
- no discount/coupon rows exist here; their effects are carried via `matched_discount_amount`.

* /