* Grocery data model and file layout This document defines the shared file layout and stable CSV schemas for the grocery pipeline. Goals: - Ensure data gathering is separate from analysis - Enable multiple data gathering methods - One layer for review and analysis ** Design Rules - Raw retailer exports remain the source of truth. - Retailer parsing is isolated to retailer-specific files and ids. - Cross-retailer product layers begin only after retailer-specific enrichment. - CSV schemas are stable and additive: new columns may be appended, but existing columns should not be repurposed. - Unknown values should be left blank rather than guessed. *** Retailer-specific data: - raw json payloads - retailer order ids - retailer line numbers - retailer category ids and names - retailer item names - retailer image urls - observed products scoped to one retailer *** Review/Combined data: - canonical products - observed-to-canonical links - human review state for unresolved cases - comparison-ready normalized quantity basis fields // I don't like this terminology - what is "observed" doing for us? // output should be normalized_items, not observed // unless this is the way we're matching multiple upc's? Observed products are the boundary between retailer-specific parsing and cross-retailer canonicalization. Nothing upstream of `products_observed.csv` should require knowledge of another retailer. * Pipeline Key: - (1) input - [2] output Each step can be run alone if its dependents exist. ** 1. Collect Get raw receipt/visit and item data from a retailer. Scraping is unique to a Retailer and method (e.g., Giant-Web and Giant-Scan). Preserve complete raw data and preserve fidelity. Avoid interpretation beyond basic data flattening. - (1) Source access (Varies, eg header data, auth for API access) - [1] collected visits from each retailer - [2] collected items from each retailer - [3] any other raw data that supports [1] and [2]; explicit source (eventual receipt scan?) ** 2. Normalize Parse and extract structured facts from retailer-specific raw data to create a standardized item format for that retailer. Strictly dependent on Collect method and output. - Extract quantity, size, pack, pricing, variant - Add discount line items to product line items using upc/retail_item_id and concurrence - Cleanup naming to facilitate later matching - (1) collected items from each retailer - (2) collected visits from each retailer - [1] normalized items from each retailer ** 3. Review/Combine (Canonicalization) Decide whether two normalized retailer items are "the same product"; match items across retailers using algo/logic and human review. Create catalog linked to normalized items. - Grouping the same item from retailer - Asking human to create a canonical/catalog item with: - friendly/canonical_name: "bell pepper"; "milk" - category: "produce"; "dairy" - product_type: "pepper"; "milk" - ? variant? "whole, "skim", "2pct" - (1) normalized items from each retailer - [1] review queue of items to be reviewed - [2] catalog (lookup table) of confirmed retailer_item and canonical_name - [3] canonical purchase list, pivot-ready ** Unresolved Issues 1. need central script to orchestrate; metadata belongs there and nowhere else ** Symptoms - `LIME` and `LIME . / .` appearing in canonical_catalog: - names must come from review-approved names, not raw strings * Directory Layout Use one top-level data root: #+begin_example main.py collect__.py normalize__.py review.py data/ / raw/ # unmodified retailer payloads exactly as fetched collected_items.csv # one row per retailer line item w/ retailer-native values collected_orders.csv # one row per receipt/visit, flattened from raw order data normalized_items.csv # parsed retailer-specific line items with normalized fields costco-web/ # sample raw/ orders/ history.json .json collected_items.csv collected_orders.csv normalized_items.csv review/ review_queue.csv # Human review queue for unresolved matching/parsing cases. product_links.csv # Links from retailer-observed products to canonical products. catalog.csv # Cross-retailer canonical product entities used for comparison. purchases.csv #+end_example * Schemas ** `data//collected_items.csv` One row per retailer line item. | key | definition | |--------------------+--------------------------------------------| | `retailer` PK | retailer slug | | `order_id` PK | retailer order id | | `line_no` PK | stable line number within order export | | `order_date` | copied from order when available | | `retailer_item_id` | retailer-native item id when available | | `pod_id` | retailer pod/item id | | `item_name` | raw retailer item name | | `upc` | retailer UPC or PLU value | | `category_id` | retailer category id | | `category` | retailer category description | | `qty` | retailer quantity field | | `unit` | retailer unit code such as `EA` or `LB` | | `unit_price` | retailer unit price field | | `line_total` | retailer extended price field | | `picked_weight` | retailer picked weight field | | `mvp_savings` | retailer savings field | | `reward_savings` | retailer rewards savings field | | `coupon_savings` | retailer coupon savings field | | `coupon_price` | retailer coupon price field | | `image_url` | raw retailer image url when present | | `raw_order_path` | relative path to source order payload | | `is_discount_line` | retailer adjustment or discount-line flag | | `is_coupon_line` | coupon-like line flag when distinguishable | ** `data//collected_orders.csv` One row per order or visit. | key | definition | |---------------------------+-------------------------------------------------| | `retailer` PK | retailer slug such as `giant` | | `order_id` PK | retailer order or visit id | | `order_date` | order date in `YYYY-MM-DD` when available | | `delivery_date` | fulfillment date in `YYYY-MM-DD` when available | | `service_type` | retailer service type such as `INSTORE` | | `order_total` | order total as provided by retailer | | `payment_method` | retailer payment label | | `total_item_count` | total line count or item count from retailer | | `total_savings` | total savings as provided by retailer | | `your_savings_total` | savings field from retailer when present | | `coupons_discounts_total` | coupon/discount total from retailer | | `store_name` | retailer store name | | `store_number` | retailer store number | | `store_address1` | street address | | `store_city` | city | | `store_state` | state or province | | `store_zipcode` | postal code | | `refund_order` | retailer refund flag | | `ebt_order` | retailer EBT flag | | `raw_history_path` | relative path to source history payload | | `raw_order_path` | relative path to source order payload | ** `data//normalized_items.csv` One row per retailer line item after deterministic parsing. Preserve raw fields from `collected_items.csv` and add parsed fields plus retailer-level identity needed before cross-retailer review. | key | definition | |----------------------------+------------------------------------------------------------------| | `retailer` PK | retailer slug | | `order_id` PK | retailer order id | | `line_no` PK | line number within order | | `normalized_row_id` | stable row key, typically `::` | | `normalized_item_id` | stable retailer-level item identity after deterministic grouping | | `normalization_basis` | basis used to assign `normalized_item_id` | | `retailer_item_id` | retailer-native item id | | `item_name` | raw retailer item name | | `item_name_norm` | normalized retailer item name | | `brand_guess` | parsed brand guess | | `variant` | parsed variant text | | `size_value` | parsed numeric size value | | `size_unit` | parsed size unit such as `oz`, `lb`, `fl_oz` | | `pack_qty` | parsed pack or count guess | | `measure_type` | `each`, `weight`, `volume`, `count`, or blank | | `normalized_quantity` | numeric comparison basis derived during normalization | | `normalized_quantity_unit` | basis unit such as `oz`, `lb`, `count`, or blank | | `is_store_brand` | store-brand guess | | `is_fee` | fee or non-product flag | | `is_discount_line` | discount or adjustment-line flag | | `is_coupon_line` | coupon-like line flag | | `matched_discount_amount` | matched discount value carried onto purchased row when supported | | `net_line_total` | line total after matched discount when supported | | `price_per_each` | derived per-each price when supported | | `price_per_each_basis` | source basis for `price_per_each` | | `price_per_count` | derived per-count price when supported | | `price_per_count_basis` | source basis for `price_per_count` | | `price_per_lb` | derived per-pound price when supported | | `price_per_lb_basis` | source basis for `price_per_lb` | | `price_per_oz` | derived per-ounce price when supported | | `price_per_oz_basis` | source basis for `price_per_oz` | | `image_url` | best available retailer image url | | `raw_order_path` | relative path to source order payload | | `parse_version` | parser version string for reruns | | `parse_notes` | optional non-fatal parser notes | Notes: - `normalized_item_id` replaces the need for a core `observed_products.csv` layer. - `normalization_basis` should be explicit values like `exact_upc`, `retailer_item_id`, `name_size_pack`, or `manual_retailer_alias`. - Cross-retailer identity is still handled later in review/combine via `catalog.csv` and `product_links.csv`. ** `data/review/product_links.csv` One row per observed-to-canonical relationship. 1 (catalog_item) to many (normalized_items) | key | definition | |-------------------+---------------------------------------------| | `observed_id` PK | retailer observed product id | | `catalog_id` PK | linked canonical product id | | `link_method` | `manual`, `exact_upc`, `exact_name`, etc. | | `link_confidence` | optional confidence label | | `review_status` | `pending`, `approved`, `rejected`, or blank | | `reviewed_by` | reviewer id or initials | | `reviewed_at` | review timestamp or date | | `link_notes` | optional notes | ** `data/review/review_queue.csv` One row per issue needing human review. | key | definition | |-----------------------+-----------------------------------------------------| | `review_id` PK | stable review row id | | `queue_type` | `observed_product`, `link_candidate`, `parse_issue` | | `retailer` | retailer slug when applicable | | `observed_product_id` | observed product id when applicable | | `catalod_id` | candidate canonical id when applicable | | `reason_code` | machine-readable review reason | | `priority` | optional priority label | | `raw_item_names` | compact list of example raw names | | `normalized_names` | compact list of example normalized names | | `upc` | example UPC/PLU | | `image_url` | example image url | | `example_prices` | compact list of example prices | | `seen_count` | count of related rows | | `status` | `pending`, `approved`, `rejected`, `deferred` | | `resolution_notes` | reviewer notes | | `created_at` | creation timestamp or date | | `updated_at` | last update timestamp or date | ** `data/catalog.csv` One row per cross-retailer canonical product. | key | definition | |----------------------------+----------------------------------------| | `catalog_id` PK | stable canonical product id | | `catalog_name` | canonical human-readable name | | `product_type` | generic product eg `apple`, `milk` | | `category` | broad section eg `produce`, `dairy` | | `brand` | canonical brand when applicable | | `variant` | canonical variant | | `size_value` | normalized size value | | `size_unit` | normalized size unit | | `pack_qty` | normalized pack/count | | `measure_type` | normalized measure type | | `normalized_quantity` | numeric comparison basis value | | `normalized_quantity_unit` | basis unit such as `oz`, `lb`, `count` | | `notes` | optional human notes | | `created_at` | creation timestamp or date | | `updated_at` | last update timestamp or date | ** `data/purchases.csv` One row per purchased item (i.e., `row_type=item` from normalized layer), with catalog attributes denormalized in and discounts already applied. | key | definition | |----------------------------+----------------------------------------------------------------| | `purchase_date` | date of purchase (from order) | | `retailer` | retailer slug | | `order_id` | retailer order id | | `line_no` | line number within order | | `normalized_row_id` | `::` | | `normalized_item_id` | retailer-level normalized item identity | | `catalog_id` | linked canonical product id | | `catalog_name` | canonical product name for analysis | | `catalog_product_type` | broader product family (e.g., `egg`, `milk`) | | `catalog_category` | category such as `produce`, `dairy` | | `catalog_brand` | canonical brand when applicable | | `catalog_variant` | canonical variant when applicable | | `raw_item_name` | original retailer item name | | `normalized_item_name` | cleaned/normalized retailer item name | | `retailer_item_id` | retailer-native item id | | `upc` | UPC/PLU when available | | `qty` | retailer quantity field | | `unit` | retailer unit (e.g., `EA`, `LB`) | | `pack_qty` | parsed pack/count | | `size_value` | parsed size value | | `size_unit` | parsed size unit | | `measure_type` | `each`, `weight`, `volume`, `count` | | `normalized_quantity` | normalized comparison quantity | | `normalized_quantity_unit` | unit for normalized quantity | | `unit_price` | retailer unit price | | `line_total` | original retailer extended price (pre-discount) | | `matched_discount_amount` | discount amount matched from discount lines | | `net_line_total` | effective price after discount (`line_total` + discounts) | | `store_name` | retailer store name | | `store_city` | store city | | `store_state` | store state | | `price_per_each` | derived per-each price | | `price_per_each_basis` | source basis for per-each calc | | `price_per_count` | derived per-count price | | `price_per_count_basis` | source basis for per-count calc | | `price_per_lb` | derived per-pound price | | `price_per_lb_basis` | source basis for per-pound calc | | `price_per_oz` | derived per-ounce price | | `price_per_oz_basis` | source basis for per-ounce calc | | `is_fee` | true if row represents non-product fee | | `raw_order_path` | relative path to original order payload | Notes: - Only rows with `row_type=item` from normalization should appear here. - `line_total` preserves retailer truth; `net_line_total` is what you actually paid. - catalog fields are denormalized in to make pivoting trivial. - no discount/coupon rows exist here; their effects are carried via `matched_discount_amount`. * /