* Grocery data model and file layout

This document defines the shared file layout and stable CSV schemas for the
grocery pipeline.
Goals:
- Ensure data gathering is separate from analysis
- Enable multiple data gathering methods
- One layer for review and analysis  

** Design Rules
- Raw retailer exports remain the source of truth.
- Retailer parsing is isolated to retailer-specific files and ids.
- Cross-retailer product layers begin only after retailer-specific normalization.
- CSV schemas are stable and additive: new columns may be appended, but
   existing columns should not be repurposed.
- Unknown values should be left blank rather than guessed.

*** Retailer-specific data:
- raw json payloads
- retailer order ids
- retailer line numbers
- retailer category ids and names
- retailer item names
- retailer image urls
- comparison-ready normalized quantity basis fields
  
*** Review/Combined data:
- catalog of reviewed products
- links from normalized retailer items to catalog
- human review state for unresolved cases


* Pipeline
Each step can be run alone if its dependents exist.
Each retail provider script must produce deterministic line-item outputs, and
normalization may assign within-retailer product identity only when the
retailer itself provides strong evidence.

Key: 
- (1) input
- [1] output
 
** 1. Collect
Get raw receipt/visit and item data from a retailer.
Scraping is unique to a Retailer and method (e.g., Giant-Web and Giant-Scan).
Preserve complete raw data and preserve fidelity.
Avoid interpretation beyond basic data flattening.
 - (1) Source access (Varies, eg header data, auth for API access)
 - [1] collected visits from each retailer
 - [2] collected items from each retailer
 - [3] any other raw data that supports [1] and [2]; explicit source (eventual receipt scan?)
   
** 2. Normalize
Parse and extract structured facts from retailer-specific raw data
  to create a standardized item format for that retailer.
Strictly dependent on Collect method and output.
 - Extract quantity, size, pack, pricing, variant
 - Add discount line items to product line items using upc/retail_item_id and concurrence
 - Cleanup naming to facilitate later matching
 - Assign retailer-level `normalized_item_id` only when evidence is deterministic
 - Never use fuzzy or semantic matching here
 - (1) collected items from each retailer
 - (2) collected visits from each retailer
 - [1] normalized items from each retailer

** 3. Review/Combine (Canonicalization)
Decide whether two normalized retailer items are "the same product";
 match items across retailers using algo/logic and human review.
Create catalog linked to normalized retailer items.
 - Review operates on distinct `normalized_item_id` values, not individual purchase rows
 - Cross-retailer identity decisions happen only here
 - Asking human to create a canonical/catalog item with:
   - friendly/catalog_name: "bell pepper"; "milk"
   - category: "produce"; "dairy"
   - product_type: "pepper"; "milk"
   - ? variant? "whole, "skim", "2pct"
 - Then link the group of items to that catalog item.
 - (1) normalized items from each retailer
 - [1] review queue of items to be reviewed
 - [2] catalog (lookup table) of confirmed normalized retailer items and catalog_id
 - [3] purchase list of normalized items , pivot-ready
   
** Unresolved Issues
1. need central script to orchestrate; metadata belongs there and nowhere else
2. `LIME` and `LIME . / .` appearing in the catalog: names must come from review-approved names, not raw strings


* Directory Layout
Use one top-level data root:
#+begin_example
main.py
collect_<retailer>_<method>.py
normalize_<retailer>_<method>.py
review.py
data/
  <retailer-method>/
    raw/  # unmodified retailer payloads exactly as fetched
      <order_id.json> 
    collected_items.csv # one row per retailer line item w/ retailer-native values
    collected_orders.csv # one row per receipt/visit, flattened from raw order data
    normalized_items.csv # parsed retailer-specific line items with normalized fields
  costco-web/ # sample
    raw/
      orders/
        history.json
        <order_id>.json
    collected_items.csv
    collected_orders.csv
    normalized_items.csv
  review/
    review_queue.csv # Human review queue for unresolved matching/parsing cases.
    product_links.csv # Links from normalized retailer items to catalog items.
  catalog.csv  # Cross-retailer product catalog entities used for comparison.
  purchases.csv
#+end_example

Notes:
- The current repo still uses transitional root-level scripts and output folders.
- This layout is the target structure for the refactor, not a claim that migration is already complete.

* Schemas
** `data/<retailer-method>/collected_items.csv`
One row per retailer line item.
| key                | definition                                 |
|--------------------+--------------------------------------------|
| `retailer` PK      | retailer slug                              |
| `order_id` PK      | retailer order id                          |
| `line_no`  PK      | stable line number within order export     |
| `order_date`       | copied from order when available           |
| `retailer_item_id` | retailer-native item id when available     |
| `pod_id`           | retailer pod/item id                       |
| `item_name`        | raw retailer item name                     |
| `upc`              | retailer UPC or PLU value                  |
| `category_id`      | retailer category id                       |
| `category`         | retailer category description              |
| `qty`              | retailer quantity field                    |
| `unit`             | retailer unit code such as `EA` or `LB`    |
| `unit_price`       | retailer unit price field                  |
| `line_total`       | retailer extended price field              |
| `picked_weight`    | retailer picked weight field               |
| `mvp_savings`      | retailer savings field                     |
| `reward_savings`   | retailer rewards savings field             |
| `coupon_savings`   | retailer coupon savings field              |
| `coupon_price`     | retailer coupon price field                |
| `image_url`        | raw retailer image url when present        |
| `raw_order_path`   | relative path to source order payload      |
| `is_discount_line` | retailer adjustment or discount-line flag  |
| `is_coupon_line`   | coupon-like line flag when distinguishable |

** `data/<retailer-method>/collected_orders.csv`
One row per order/visit/receipt.
| key                       | definition                                      |
|---------------------------+-------------------------------------------------|
| `retailer` PK             | retailer slug such as `giant`                   |
| `order_id` PK             | retailer order or visit id                      |
| `order_date`              | order date in `YYYY-MM-DD` when available       |
| `delivery_date`           | fulfillment date in `YYYY-MM-DD` when available |
| `service_type`            | retailer service type such as `INSTORE`         |
| `order_total`             | order total as provided by retailer             |
| `payment_method`          | retailer payment label                          |
| `total_item_count`        | total line count or item count from retailer    |
| `total_savings`           | total savings as provided by retailer           |
| `your_savings_total`      | savings field from retailer when present        |
| `coupons_discounts_total` | coupon/discount total from retailer             |
| `store_name`              | retailer store name                             |
| `store_number`            | retailer store number                           |
| `store_address1`          | street address                                  |
| `store_city`              | city                                            |
| `store_state`             | state or province                               |
| `store_zipcode`           | postal code                                     |
| `refund_order`            | retailer refund flag                            |
| `ebt_order`               | retailer EBT flag                               |
| `raw_history_path`        | relative path to source history payload         |
| `raw_order_path`          | relative path to source order payload           |

** `data/<retailer-method>/normalized_items.csv`
One row per retailer line item after deterministic parsing. Preserve raw
fields from `collected_items.csv` and add parsed fields that make later review
and grouping easier. Normalization may assign retailer-level identity when the
evidence is deterministic and retailer-scoped.

| key                        | definition                                                       |
|----------------------------+------------------------------------------------------------------|
| `retailer` PK              | retailer slug                                                    |
| `order_id` PK              | retailer order id                                                |
| `line_no` PK               | line number within order                                         |
| `normalized_row_id`        | stable row key, typically `<retailer>:<order_id>:<line_no>`      |
| `normalized_item_id`       | stable retailer-level item identity when deterministic grouping is supported |
| `normalization_basis`      | basis used to assign `normalized_item_id`                        |
| `retailer_item_id`         | retailer-native item id                                          |
| `item_name`                | raw retailer item name                                           |
| `item_name_norm`           | normalized retailer item name                                    |
| `brand_guess`              | parsed brand guess                                               |
| `variant`                  | parsed variant text                                              |
| `size_value`               | parsed numeric size value                                        |
| `size_unit`                | parsed size unit such as `oz`, `lb`, `fl_oz`                     |
| `pack_qty`                 | parsed pack or count guess                                       |
| `measure_type`             | `each`, `weight`, `volume`, `count`, or blank                    |
| `normalized_quantity`      | numeric comparison basis derived during normalization            |
| `normalized_quantity_unit` | basis unit such as `oz`, `lb`, `count`, or blank                 |
| `is_item`                  | item flag                                                        |
| `is_store_brand`           | store-brand guess                                                |
| `is_fee`                   | fee or non-product flag                                          |
| `is_discount_line`         | discount or adjustment-line flag                                 |
| `is_coupon_line`           | coupon-like line flag                                            |
| `matched_discount_amount`  | matched discount value carried onto purchased row when supported |
| `net_line_total`           | line total after matched discount when supported                 |
| `price_per_each`           | derived per-each price when supported                            |
| `price_per_each_basis`     | source basis for `price_per_each`                                |
| `price_per_count`          | derived per-count price when supported                           |
| `price_per_count_basis`    | source basis for `price_per_count`                               |
| `price_per_lb`             | derived per-pound price when supported                           |
| `price_per_lb_basis`       | source basis for `price_per_lb`                                  |
| `price_per_oz`             | derived per-ounce price when supported                           |
| `price_per_oz_basis`       | source basis for `price_per_oz`                                  |
| `image_url`                | best available retailer image url                                |
| `raw_order_path`           | relative path to source order payload                            |
| `parse_version`            | parser version string for reruns                                 |
| `parse_notes`              | optional non-fatal parser notes                                  |

Notes:
- `normalized_row_id` identifies the purchase row; `normalized_item_id` identifies a repeated retailer item when strong retailer evidence supports grouping.
- Valid `normalization_basis` values should be explicit, e.g. `exact_upc`, `exact_retailer_item_id`, `exact_name_size_pack`, or `approved_retailer_alias`.
- Do not use fuzzy or semantic matching to assign `normalized_item_id`.
- Discount/coupon rows may remain as standalone normalized rows for auditability even when their amounts are attached to a purchased row via `matched_discount_amount`.
- Cross-retailer identity is handled later in review/combine via `catalog.csv` and `product_links.csv`.

** `data/review/product_links.csv`
One row per review-approved link from a normalized retailer item to a catalog item.
Many normalized retailer items may link to the same catalog item.

| key                     | definition                                  |
|-------------------------+---------------------------------------------|
| `normalized_item_id` PK | normalized retailer item id                 |
| `catalog_id` PK         | linked catalog product id                   |
| `link_method`           | `manual`, `exact_upc`, `exact_name_size`, etc. |
| `link_confidence`       | optional confidence label                   |
| `review_status`         | `pending`, `approved`, `rejected`, or blank |
| `reviewed_by`           | reviewer id or initials                     |
| `reviewed_at`           | review timestamp or date                    |
| `link_notes`            | optional notes                              |

** `data/review/review_queue.csv`
One row per issue needing human review.

| key                  | definition                                          |
|----------------------+-----------------------------------------------------|
| `review_id` PK       | stable review row id                                |
| `queue_type`         | `link_candidate`, `parse_issue`, `catalog_cleanup`  |
| `retailer`           | retailer slug when applicable                       |
| `normalized_item_id` | normalized retailer item id when review is item-level |
| `normalized_row_id`  | normalized row id when review is row-specific       |
| `catalog_id`         | candidate canonical id                              |
| `reason_code`        | machine-readable review reason                      |
| `priority`           | optional priority label                             |
| `raw_item_names`     | compact list of example raw names                   |
| `normalized_names`   | compact list of example normalized names            |
| `upc`                | example UPC/PLU                                     |
| `image_url`          | example image url                                   |
| `example_prices`     | compact list of example prices                      |
| `seen_count`         | count of related rows                               |
| `status`             | `pending`, `approved`, `rejected`, `deferred`       |
| `resolution_notes`   | reviewer notes                                      |
| `created_at`         | creation timestamp or date                          |
| `updated_at`         | last update timestamp or date                       |
** `data/catalog.csv`
One row per cross-retailer catalog product.
| key                        | definition                             |
|----------------------------+----------------------------------------|
| `catalog_id` PK            | stable catalog product id              |
| `catalog_name`             | human-reviewed product name            |
| `product_type`             | generic product eg `apple`, `milk`     |
| `category`                 | broad section eg `produce`, `dairy`    |
| `brand`                    | canonical brand when applicable        |
| `variant`                  | canonical variant                      |
| `size_value`               | normalized size value                  |
| `size_unit`                | normalized size unit                   |
| `pack_qty`                 | normalized pack/count                  |
| `measure_type`             | normalized measure type                |
| `normalized_quantity`      | numeric comparison basis value         |
| `normalized_quantity_unit` | basis unit such as `oz`, `lb`, `count` |
| `notes`                    | optional human notes                   |
| `created_at`               | creation timestamp or date             |
| `updated_at`               | last update timestamp or date          |

Notes:
- Do not auto-create new catalog rows from weak normalized names alone.
- Do not encode packaging/count into `catalog_name` unless it is essential to product identity.
- `catalog_name` should come from review-approved naming, not raw retailer strings.

** `data/purchases.csv`
One row per purchased item (i.e., `is_item`==true from normalized layer), with
catalog attributes denormalized in and discounts already applied.

| key                        | definition                                                     |
|----------------------------+----------------------------------------------------------------|
| `purchase_date`            | date of purchase (from order)                                  |
| `retailer`                 | retailer slug                                                  |
| `order_id`                 | retailer order id                                              |
| `line_no`                  | line number within order                                       |
| `normalized_row_id`        | `<retailer>:<order_id>:<line_no>`                              |
| `normalized_item_id`       | retailer-level normalized item identity                        |
| `catalog_id`               | linked catalog product id                                      |
| `catalog_name`             | catalog product name for analysis                              |
| `catalog_product_type`     | broader product family (e.g., `egg`, `milk`)                   |
| `catalog_category`         | category such as `produce`, `dairy`                            |
| `catalog_brand`            | canonical brand when applicable                                |
| `catalog_variant`          | canonical variant when applicable                              |
| `raw_item_name`            | original retailer item name                                    |
| `normalized_item_name`     | cleaned/normalized retailer item name                          |
| `retailer_item_id`         | retailer-native item id                                        |
| `upc`                      | UPC/PLU when available                                         |
| `qty`                      | retailer quantity field                                        |
| `unit`                     | retailer unit (e.g., `EA`, `LB`)                               |
| `pack_qty`                 | parsed pack/count                                              |
| `size_value`               | parsed size value                                              |
| `size_unit`                | parsed size unit                                               |
| `measure_type`             | `each`, `weight`, `volume`, `count`                            |
| `normalized_quantity`      | normalized comparison quantity                                 |
| `normalized_quantity_unit` | unit for normalized quantity                                   |
| `unit_price`               | retailer unit price                                            |
| `line_total`               | original retailer extended price (pre-discount)                |
| `matched_discount_amount`  | discount amount matched from discount lines                    |
| `net_line_total`           | effective price after discount (`line_total` + discounts)      |
| `store_name`               | retailer store name                                            |
| `store_city`               | store city                                                     |
| `store_state`              | store state                                                    |
| `price_per_each`           | derived per-each price                                         |
| `price_per_each_basis`     | source basis for per-each calc                                 |
| `price_per_count`          | derived per-count price                                        |
| `price_per_count_basis`    | source basis for per-count calc                                |
| `price_per_lb`             | derived per-pound price                                        |
| `price_per_lb_basis`       | source basis for per-pound calc                                |
| `price_per_oz`             | derived per-ounce price                                        |
| `price_per_oz_basis`       | source basis for per-ounce calc                                |
| `is_fee`                   | true if row represents non-product fee                         |
| `raw_order_path`           | relative path to original order payload                        |

Notes:
- Only rows that represent purchased items should appear here.
- `line_total` preserves retailer truth; `net_line_total` is what you actually paid.
- catalog fields are denormalized in to make pivoting trivial.
- no discount/coupon rows exist here; their effects are carried via `matched_discount_amount`.
- review/link decisions should apply at the `normalized_item_id` level, then fan out to all purchase rows sharing that id.

* /