* Grocery data model and file layout This document defines the shared file layout and stable CSV schemas for the grocery pipeline. Goals: - Ensure data gathering is separate from analysis - Enable multiple data gathering methods - One layer for review and analysis ** Design Rules - Raw retailer exports remain the source of truth. - Retailer parsing is isolated to retailer-specific files and ids. - Cross-retailer product layers begin only after retailer-specific normalization. - CSV schemas are stable and additive: new columns may be appended, but existing columns should not be repurposed. - Unknown values should be left blank rather than guessed. *** Retailer-specific data: - raw json payloads - retailer order ids - retailer line numbers - retailer category ids and names - retailer item names - retailer image urls - comparison-ready normalized quantity basis fields *** Review/Combined data: - catalog of reviewed products - links from normalized retailer items to catalog - human review state for unresolved cases * Pipeline Each step can be run alone if its dependents exist. Each retail provider script must produce deterministic line-item outputs, and normalization may assign within-retailer product identity only when the retailer itself provides strong evidence. Key: - (1) input - [1] output ** 1. Collect Get raw receipt/visit and item data from a retailer. Scraping is unique to a Retailer and method (e.g., Giant-Web and Giant-Scan). Preserve complete raw data and preserve fidelity. Avoid interpretation beyond basic data flattening. - (1) Source access (Varies, eg header data, auth for API access) - [1] collected visits from each retailer - [2] collected items from each retailer - [3] any other raw data that supports [1] and [2]; explicit source (eventual receipt scan?) ** 2. Normalize Parse and extract structured facts from retailer-specific raw data to create a standardized item format for that retailer. Strictly dependent on Collect method and output. - Extract quantity, size, pack, pricing, variant - Add discount line items to product line items using upc/retail_item_id and concurrence - Cleanup naming to facilitate later matching - Assign retailer-level `normalized_item_id` only when evidence is deterministic - Never use fuzzy or semantic matching here - (1) collected items from each retailer - (2) collected visits from each retailer - [1] normalized items from each retailer ** 3. Review/Combine (Canonicalization) Decide whether two normalized retailer items are "the same product"; match items across retailers using algo/logic and human review. Create catalog linked to normalized retailer items. - Review operates on distinct `normalized_item_id` values, not individual purchase rows - Cross-retailer identity decisions happen only here - Asking human to create a canonical/catalog item with: - friendly/catalog_name: "bell pepper"; "milk" - category: "produce"; "dairy" - product_type: "pepper"; "milk" - ? variant? "whole, "skim", "2pct" - Then link the group of items to that catalog item. - (1) normalized items from each retailer - [1] review queue of items to be reviewed - [2] catalog (lookup table) of confirmed normalized retailer items and catalog_id - [3] purchase list of normalized items , pivot-ready ** Unresolved Issues 1. need central script to orchestrate; metadata belongs there and nowhere else 2. `LIME` and `LIME . / .` appearing in the catalog: names must come from review-approved names, not raw strings * Directory Layout Use one top-level data root: #+begin_example main.py collect__.py normalize__.py review.py data/ / raw/ # unmodified retailer payloads exactly as fetched collected_items.csv # one row per retailer line item w/ retailer-native values collected_orders.csv # one row per receipt/visit, flattened from raw order data normalized_items.csv # parsed retailer-specific line items with normalized fields costco-web/ # sample raw/ orders/ history.json .json collected_items.csv collected_orders.csv normalized_items.csv review/ review_queue.csv # Human review queue for unresolved matching/parsing cases. product_links.csv # Links from normalized retailer items to catalog items. catalog.csv # Cross-retailer product catalog entities used for comparison. purchases.csv #+end_example Notes: - The current repo still uses transitional root-level scripts and output folders. - This layout is the target structure for the refactor, not a claim that migration is already complete. * Schemas ** `data//collected_items.csv` One row per retailer line item. | key | definition | |--------------------+--------------------------------------------| | `retailer` PK | retailer slug | | `order_id` PK | retailer order id | | `line_no` PK | stable line number within order export | | `order_date` | copied from order when available | | `retailer_item_id` | retailer-native item id when available | | `pod_id` | retailer pod/item id | | `item_name` | raw retailer item name | | `upc` | retailer UPC or PLU value | | `category_id` | retailer category id | | `category` | retailer category description | | `qty` | retailer quantity field | | `unit` | retailer unit code such as `EA` or `LB` | | `unit_price` | retailer unit price field | | `line_total` | retailer extended price field | | `picked_weight` | retailer picked weight field | | `mvp_savings` | retailer savings field | | `reward_savings` | retailer rewards savings field | | `coupon_savings` | retailer coupon savings field | | `coupon_price` | retailer coupon price field | | `image_url` | raw retailer image url when present | | `raw_order_path` | relative path to source order payload | | `is_discount_line` | retailer adjustment or discount-line flag | | `is_coupon_line` | coupon-like line flag when distinguishable | ** `data//collected_orders.csv` One row per order/visit/receipt. | key | definition | |---------------------------+-------------------------------------------------| | `retailer` PK | retailer slug such as `giant` | | `order_id` PK | retailer order or visit id | | `order_date` | order date in `YYYY-MM-DD` when available | | `delivery_date` | fulfillment date in `YYYY-MM-DD` when available | | `service_type` | retailer service type such as `INSTORE` | | `order_total` | order total as provided by retailer | | `payment_method` | retailer payment label | | `total_item_count` | total line count or item count from retailer | | `total_savings` | total savings as provided by retailer | | `your_savings_total` | savings field from retailer when present | | `coupons_discounts_total` | coupon/discount total from retailer | | `store_name` | retailer store name | | `store_number` | retailer store number | | `store_address1` | street address | | `store_city` | city | | `store_state` | state or province | | `store_zipcode` | postal code | | `refund_order` | retailer refund flag | | `ebt_order` | retailer EBT flag | | `raw_history_path` | relative path to source history payload | | `raw_order_path` | relative path to source order payload | ** `data//normalized_items.csv` One row per retailer line item after deterministic parsing. Preserve raw fields from `collected_items.csv` and add parsed fields that make later review and grouping easier. Normalization may assign retailer-level identity when the evidence is deterministic and retailer-scoped. | key | definition | |----------------------------+------------------------------------------------------------------| | `retailer` PK | retailer slug | | `order_id` PK | retailer order id | | `line_no` PK | line number within order | | `normalized_row_id` | stable row key, typically `::` | | `normalized_item_id` | stable retailer-level item identity when deterministic grouping is supported | | `normalization_basis` | basis used to assign `normalized_item_id` | | `retailer_item_id` | retailer-native item id | | `item_name` | raw retailer item name | | `item_name_norm` | normalized retailer item name | | `brand_guess` | parsed brand guess | | `variant` | parsed variant text | | `size_value` | parsed numeric size value | | `size_unit` | parsed size unit such as `oz`, `lb`, `fl_oz` | | `pack_qty` | parsed pack or count guess | | `measure_type` | `each`, `weight`, `volume`, `count`, or blank | | `normalized_quantity` | numeric comparison basis derived during normalization | | `normalized_quantity_unit` | basis unit such as `oz`, `lb`, `count`, or blank | | `is_item` | item flag | | `is_store_brand` | store-brand guess | | `is_fee` | fee or non-product flag | | `is_discount_line` | discount or adjustment-line flag | | `is_coupon_line` | coupon-like line flag | | `matched_discount_amount` | matched discount value carried onto purchased row when supported | | `net_line_total` | line total after matched discount when supported | | `price_per_each` | derived per-each price when supported | | `price_per_each_basis` | source basis for `price_per_each` | | `price_per_count` | derived per-count price when supported | | `price_per_count_basis` | source basis for `price_per_count` | | `price_per_lb` | derived per-pound price when supported | | `price_per_lb_basis` | source basis for `price_per_lb` | | `price_per_oz` | derived per-ounce price when supported | | `price_per_oz_basis` | source basis for `price_per_oz` | | `image_url` | best available retailer image url | | `raw_order_path` | relative path to source order payload | | `parse_version` | parser version string for reruns | | `parse_notes` | optional non-fatal parser notes | Notes: - `normalized_row_id` identifies the purchase row; `normalized_item_id` identifies a repeated retailer item when strong retailer evidence supports grouping. - Valid `normalization_basis` values should be explicit, e.g. `exact_upc`, `exact_retailer_item_id`, `exact_name_size_pack`, or `approved_retailer_alias`. - Do not use fuzzy or semantic matching to assign `normalized_item_id`. - Discount/coupon rows may remain as standalone normalized rows for auditability even when their amounts are attached to a purchased row via `matched_discount_amount`. - Cross-retailer identity is handled later in review/combine via `catalog.csv` and `product_links.csv`. ** `data/review/product_links.csv` One row per review-approved link from a normalized retailer item to a catalog item. Many normalized retailer items may link to the same catalog item. | key | definition | |-------------------------+---------------------------------------------| | `normalized_item_id` PK | normalized retailer item id | | `catalog_id` PK | linked catalog product id | | `link_method` | `manual`, `exact_upc`, `exact_name_size`, etc. | | `link_confidence` | optional confidence label | | `review_status` | `pending`, `approved`, `rejected`, or blank | | `reviewed_by` | reviewer id or initials | | `reviewed_at` | review timestamp or date | | `link_notes` | optional notes | ** `data/review/review_queue.csv` One row per issue needing human review. | key | definition | |----------------------+-----------------------------------------------------| | `review_id` PK | stable review row id | | `queue_type` | `link_candidate`, `parse_issue`, `catalog_cleanup` | | `retailer` | retailer slug when applicable | | `normalized_item_id` | normalized retailer item id when review is item-level | | `normalized_row_id` | normalized row id when review is row-specific | | `catalog_id` | candidate canonical id | | `reason_code` | machine-readable review reason | | `priority` | optional priority label | | `raw_item_names` | compact list of example raw names | | `normalized_names` | compact list of example normalized names | | `upc` | example UPC/PLU | | `image_url` | example image url | | `example_prices` | compact list of example prices | | `seen_count` | count of related rows | | `status` | `pending`, `approved`, `rejected`, `deferred` | | `resolution_notes` | reviewer notes | | `created_at` | creation timestamp or date | | `updated_at` | last update timestamp or date | ** `data/catalog.csv` One row per cross-retailer catalog product. | key | definition | |----------------------------+----------------------------------------| | `catalog_id` PK | stable catalog product id | | `catalog_name` | human-reviewed product name | | `product_type` | generic product eg `apple`, `milk` | | `category` | broad section eg `produce`, `dairy` | | `brand` | canonical brand when applicable | | `variant` | canonical variant | | `size_value` | normalized size value | | `size_unit` | normalized size unit | | `pack_qty` | normalized pack/count | | `measure_type` | normalized measure type | | `normalized_quantity` | numeric comparison basis value | | `normalized_quantity_unit` | basis unit such as `oz`, `lb`, `count` | | `notes` | optional human notes | | `created_at` | creation timestamp or date | | `updated_at` | last update timestamp or date | Notes: - Do not auto-create new catalog rows from weak normalized names alone. - Do not encode packaging/count into `catalog_name` unless it is essential to product identity. - `catalog_name` should come from review-approved naming, not raw retailer strings. ** `data/purchases.csv` One row per purchased item (i.e., `is_item`==true from normalized layer), with catalog attributes denormalized in and discounts already applied. | key | definition | |----------------------------+----------------------------------------------------------------| | `purchase_date` | date of purchase (from order) | | `retailer` | retailer slug | | `order_id` | retailer order id | | `line_no` | line number within order | | `normalized_row_id` | `::` | | `normalized_item_id` | retailer-level normalized item identity | | `catalog_id` | linked catalog product id | | `catalog_name` | catalog product name for analysis | | `catalog_product_type` | broader product family (e.g., `egg`, `milk`) | | `catalog_category` | category such as `produce`, `dairy` | | `catalog_brand` | canonical brand when applicable | | `catalog_variant` | canonical variant when applicable | | `raw_item_name` | original retailer item name | | `normalized_item_name` | cleaned/normalized retailer item name | | `retailer_item_id` | retailer-native item id | | `upc` | UPC/PLU when available | | `qty` | retailer quantity field | | `unit` | retailer unit (e.g., `EA`, `LB`) | | `pack_qty` | parsed pack/count | | `size_value` | parsed size value | | `size_unit` | parsed size unit | | `measure_type` | `each`, `weight`, `volume`, `count` | | `normalized_quantity` | normalized comparison quantity | | `normalized_quantity_unit` | unit for normalized quantity | | `unit_price` | retailer unit price | | `line_total` | original retailer extended price (pre-discount) | | `matched_discount_amount` | discount amount matched from discount lines | | `net_line_total` | effective price after discount (`line_total` + discounts) | | `store_name` | retailer store name | | `store_city` | store city | | `store_state` | store state | | `price_per_each` | derived per-each price | | `price_per_each_basis` | source basis for per-each calc | | `price_per_count` | derived per-count price | | `price_per_count_basis` | source basis for per-count calc | | `price_per_lb` | derived per-pound price | | `price_per_lb_basis` | source basis for per-pound calc | | `price_per_oz` | derived per-ounce price | | `price_per_oz_basis` | source basis for per-ounce calc | | `is_fee` | true if row represents non-product fee | | `raw_order_path` | relative path to original order payload | Notes: - Only rows that represent purchased items should appear here. - `line_total` preserves retailer truth; `net_line_total` is what you actually paid. - catalog fields are denormalized in to make pivoting trivial. - no discount/coupon rows exist here; their effects are carried via `matched_discount_amount`. - review/link decisions should apply at the `normalized_item_id` level, then fan out to all purchase rows sharing that id. * /