22 KiB
Grocery data model and file layout
This document defines the shared file layout and stable CSV schemas for the grocery pipeline. Goals:
- Ensure data gathering is separate from analysis
- Enable multiple data gathering methods
- One layer for review and analysis
Design Rules
- Raw retailer exports remain the source of truth.
- Retailer parsing is isolated to retailer-specific files and ids.
- Cross-retailer product layers begin only after retailer-specific normalization.
- CSV schemas are stable and additive: new columns may be appended, but existing columns should not be repurposed.
- Unknown values should be left blank rather than guessed.
Retailer-specific data:
- raw json payloads
- retailer order ids
- retailer line numbers
- retailer category ids and names
- retailer item names
- retailer image urls
- comparison-ready normalized quantity basis fields
Review/Combined data:
- catalog of reviewed products
- links from normalized retailer items to catalog
- human review state for unresolved cases
Pipeline
Each step can be run alone if its dependents exist. Each retail provider script must produce deterministic line-item outputs, and normalization may assign within-retailer product identity only when the retailer itself provides strong evidence.
Key:
- (1) input
- [1] output
1. Collect
Get raw receipt/visit and item data from a retailer. Scraping is unique to a Retailer and method (e.g., Giant-Web and Giant-Scan). Preserve complete raw data and preserve fidelity. Avoid interpretation beyond basic data flattening.
- (1) Source access (Varies, eg header data, auth for API access)
- [1] collected visits from each retailer
- [2] collected items from each retailer
- [3] any other raw data that supports [1] and [2]; explicit source (eventual receipt scan?)
2. Normalize
Parse and extract structured facts from retailer-specific raw data to create a standardized item format for that retailer. Strictly dependent on Collect method and output.
- Extract quantity, size, pack, pricing, variant
- Add discount line items to product line items using upc/retail_item_id and concurrence
- Cleanup naming to facilitate later matching
- Assign retailer-level `normalized_item_id` only when evidence is deterministic
- Never use fuzzy or semantic matching here
- (1) collected items from each retailer
- (2) collected visits from each retailer
- [1] normalized items from each retailer
3. Review/Combine (Canonicalization)
Decide whether two normalized retailer items are "the same product"; match items across retailers using algo/logic and human review. Create catalog linked to normalized retailer items.
- Review operates on distinct `normalized_item_id` values, not individual purchase rows
- Cross-retailer identity decisions happen only here
-
Asking human to create a canonical/catalog item with:
- friendly/catalog_name: "bell pepper"; "milk"
- category: "produce"; "dairy"
- product_type: "pepper"; "milk"
- ? variant? "whole, "skim", "2pct"
- Then link the group of items to that catalog item.
- (1) normalized items from each retailer
- [1] review queue of items to be reviewed
- [2] catalog (lookup table) of confirmed normalized retailer items and catalog_id
- [3] purchase list of normalized items , pivot-ready
Unresolved Issues
- need central script to orchestrate; metadata belongs there and nowhere else
- `LIME` and `LIME . / .` appearing in the catalog: names must come from review-approved names, not raw strings
Directory Layout
Use one top-level data root:
main.py
collect_<retailer>_<method>.py
normalize_<retailer>_<method>.py
review.py
data/
<retailer-method>/
raw/ # unmodified retailer payloads exactly as fetched
<order_id.json>
collected_items.csv # one row per retailer line item w/ retailer-native values
collected_orders.csv # one row per receipt/visit, flattened from raw order data
normalized_items.csv # parsed retailer-specific line items with normalized fields
costco-web/ # sample
raw/
orders/
history.json
<order_id>.json
collected_items.csv
collected_orders.csv
normalized_items.csv
review/
review_queue.csv # Human review queue for unresolved matching/parsing cases.
product_links.csv # Links from normalized retailer items to catalog items.
catalog.csv # Cross-retailer product catalog entities used for comparison.
analysis/
purchases.csv
comparison_examples.csv
item_price_over_time.csv
spend_by_visit.csv
items_per_visit.csv
category_spend_over_time.csv
retailer_store_breakdown.csv
Notes:
- The current repo still uses transitional root-level scripts and output folders.
- This layout is the target structure for the refactor, not a claim that migration is already complete.
Schemas
`data/<retailer-method>/collected_items.csv`
One row per retailer line item.
| key | definition |
|---|---|
| `retailer` PK | retailer slug |
| `order_id` PK | retailer order id |
| `line_no` PK | stable line number within order export |
| `order_date` | copied from order when available |
| `retailer_item_id` | retailer-native item id when available |
| `pod_id` | retailer pod/item id |
| `item_name` | raw retailer item name |
| `upc` | retailer UPC or PLU value |
| `category_id` | retailer category id |
| `category` | retailer category description |
| `qty` | retailer quantity field |
| `unit` | retailer unit code such as `EA` or `LB` |
| `unit_price` | retailer unit price field |
| `line_total` | retailer extended price field |
| `picked_weight` | retailer picked weight field |
| `mvp_savings` | retailer savings field |
| `reward_savings` | retailer rewards savings field |
| `coupon_savings` | retailer coupon savings field |
| `coupon_price` | retailer coupon price field |
| `image_url` | raw retailer image url when present |
| `raw_order_path` | relative path to source order payload |
| `is_discount_line` | retailer adjustment or discount-line flag |
| `is_coupon_line` | coupon-like line flag when distinguishable |
`data/<retailer-method>/collected_orders.csv`
One row per order/visit/receipt.
| key | definition |
|---|---|
| `retailer` PK | retailer slug such as `giant` |
| `order_id` PK | retailer order or visit id |
| `order_date` | order date in `YYYY-MM-DD` when available |
| `delivery_date` | fulfillment date in `YYYY-MM-DD` when available |
| `service_type` | retailer service type such as `INSTORE` |
| `order_total` | order total as provided by retailer |
| `payment_method` | retailer payment label |
| `total_item_count` | total line count or item count from retailer |
| `total_savings` | total savings as provided by retailer |
| `your_savings_total` | savings field from retailer when present |
| `coupons_discounts_total` | coupon/discount total from retailer |
| `store_name` | retailer store name |
| `store_number` | retailer store number |
| `store_address1` | street address |
| `store_city` | city |
| `store_state` | state or province |
| `store_zipcode` | postal code |
| `refund_order` | retailer refund flag |
| `ebt_order` | retailer EBT flag |
| `raw_history_path` | relative path to source history payload |
| `raw_order_path` | relative path to source order payload |
`data/<retailer-method>/normalized_items.csv`
One row per retailer line item after deterministic parsing. Preserve raw fields from `collected_items.csv` and add parsed fields that make later review and grouping easier. Normalization may assign retailer-level identity when the evidence is deterministic and retailer-scoped.
| key | definition |
|---|---|
| `retailer` PK | retailer slug |
| `order_id` PK | retailer order id |
| `line_no` PK | line number within order |
| `normalized_row_id` | stable row key, typically `<retailer>:<order_id>:<line_no>` |
| `normalized_item_id` | stable retailer-level item identity when deterministic grouping is supported |
| `normalization_basis` | basis used to assign `normalized_item_id` |
| `retailer_item_id` | retailer-native item id |
| `item_name` | raw retailer item name |
| `item_name_norm` | normalized retailer item name |
| `brand_guess` | parsed brand guess |
| `variant` | parsed variant text |
| `size_value` | parsed numeric size value |
| `size_unit` | parsed size unit such as `oz`, `lb`, `fl_oz` |
| `pack_qty` | parsed pack or count guess |
| `measure_type` | `each`, `weight`, `volume`, `count`, or blank |
| `normalized_quantity` | numeric comparison basis derived during normalization |
| `normalized_quantity_unit` | basis unit such as `oz`, `lb`, `count`, or blank |
| `is_item` | item flag |
| `is_store_brand` | store-brand guess |
| `is_fee` | fee or non-product flag |
| `is_discount_line` | discount or adjustment-line flag |
| `is_coupon_line` | coupon-like line flag |
| `matched_discount_amount` | matched discount value carried onto purchased row when supported |
| `net_line_total` | line total after matched discount when supported |
| `price_per_each` | derived per-each price when supported |
| `price_per_each_basis` | source basis for `price_per_each` |
| `price_per_count` | derived per-count price when supported |
| `price_per_count_basis` | source basis for `price_per_count` |
| `price_per_lb` | derived per-pound price when supported |
| `price_per_lb_basis` | source basis for `price_per_lb` |
| `price_per_oz` | derived per-ounce price when supported |
| `price_per_oz_basis` | source basis for `price_per_oz` |
| `image_url` | best available retailer image url |
| `raw_order_path` | relative path to source order payload |
| `parse_version` | parser version string for reruns |
| `parse_notes` | optional non-fatal parser notes |
Notes:
- `normalized_row_id` identifies the purchase row; `normalized_item_id` identifies a repeated retailer item when strong retailer evidence supports grouping.
- Valid `normalization_basis` values should be explicit, e.g. `exact_upc`, `exact_retailer_item_id`, `exact_name_size_pack`, or `approved_retailer_alias`.
- Do not use fuzzy or semantic matching to assign `normalized_item_id`.
- Discount/coupon rows may remain as standalone normalized rows for auditability even when their amounts are attached to a purchased row via `matched_discount_amount`.
- Cross-retailer identity is handled later in review/combine via `data/review/catalog.csv` and `product_links.csv`.
`data/review/product_links.csv`
One row per review-approved link from a normalized retailer item to a catalog item. Many normalized retailer items may link to the same catalog item.
| key | definition |
|---|---|
| `normalized_item_id` PK | normalized retailer item id |
| `catalog_id` PK | linked catalog product id |
| `link_method` | `manual`, `exact_upc`, `exact_name_size`, etc. |
| `link_confidence` | optional confidence label |
| `review_status` | `pending`, `approved`, `rejected`, or blank |
| `reviewed_by` | reviewer id or initials |
| `reviewed_at` | review timestamp or date |
| `link_notes` | optional notes |
`data/review/review_queue.csv`
One row per issue needing human review.
| key | definition |
|---|---|
| `review_id` PK | stable review row id |
| `queue_type` | `link_candidate`, `parse_issue`, `catalog_cleanup` |
| `retailer` | retailer slug when applicable |
| `normalized_item_id` | normalized retailer item id when review is item-level |
| `normalized_row_id` | normalized row id when review is row-specific |
| `catalog_id` | candidate canonical id |
| `reason_code` | machine-readable review reason |
| `priority` | optional priority label |
| `raw_item_names` | compact list of example raw names |
| `normalized_names` | compact list of example normalized names |
| `upc` | example UPC/PLU |
| `image_url` | example image url |
| `example_prices` | compact list of example prices |
| `seen_count` | count of related rows |
| `status` | `pending`, `approved`, `rejected`, `deferred` |
| `resolution_notes` | reviewer notes |
| `created_at` | creation timestamp or date |
| `updated_at` | last update timestamp or date |
`data/review/catalog.csv`
One row per cross-retailer catalog product.
| key | definition |
|---|---|
| `catalog_id` PK | stable catalog product id |
| `catalog_name` | human-reviewed product name |
| `product_type` | generic product eg `apple`, `milk` |
| `category` | broad section eg `produce`, `dairy` |
| `brand` | canonical brand when applicable |
| `variant` | canonical variant |
| `size_value` | normalized size value |
| `size_unit` | normalized size unit |
| `pack_qty` | normalized pack/count |
| `measure_type` | normalized measure type |
| `normalized_quantity` | numeric comparison basis value |
| `normalized_quantity_unit` | basis unit such as `oz`, `lb`, `count` |
| `notes` | optional human notes |
| `created_at` | creation timestamp or date |
| `updated_at` | last update timestamp or date |
Notes:
- Do not auto-create new catalog rows from weak normalized names alone.
- Do not encode packaging/count into `catalog_name` unless it is essential to product identity.
- `catalog_name` should come from review-approved naming, not raw retailer strings.
`data/analysis/purchases.csv`
One row per purchased item (i.e., `is_item`==true from normalized layer), with catalog attributes denormalized in and discounts already applied.
| key | definition |
|---|---|
| `purchase_date` | date of purchase (from order) |
| `retailer` | retailer slug |
| `order_id` | retailer order id |
| `line_no` | line number within order |
| `normalized_row_id` | `<retailer>:<order_id>:<line_no>` |
| `normalized_item_id` | retailer-level normalized item identity |
| `catalog_id` | linked catalog product id |
| `catalog_name` | catalog product name for analysis |
| `catalog_product_type` | broader product family (e.g., `egg`, `milk`) |
| `catalog_category` | category such as `produce`, `dairy` |
| `catalog_brand` | canonical brand when applicable |
| `catalog_variant` | canonical variant when applicable |
| `raw_item_name` | original retailer item name |
| `normalized_item_name` | cleaned/normalized retailer item name |
| `retailer_item_id` | retailer-native item id |
| `upc` | UPC/PLU when available |
| `qty` | retailer quantity field |
| `unit` | retailer unit (e.g., `EA`, `LB`) |
| `pack_qty` | parsed pack/count |
| `size_value` | parsed size value |
| `size_unit` | parsed size unit |
| `measure_type` | `each`, `weight`, `volume`, `count` |
| `normalized_quantity` | normalized comparison quantity |
| `normalized_quantity_unit` | unit for normalized quantity |
| `unit_price` | retailer unit price |
| `line_total` | original retailer extended price (pre-discount) |
| `matched_discount_amount` | discount amount matched from discount lines |
| `net_line_total` | effective price after discount (`line_total` + discounts) |
| `store_name` | retailer store name |
| `store_city` | store city |
| `store_state` | store state |
| `price_per_each` | derived per-each price |
| `price_per_each_basis` | source basis for per-each calc |
| `price_per_count` | derived per-count price |
| `price_per_count_basis` | source basis for per-count calc |
| `price_per_lb` | derived per-pound price |
| `price_per_lb_basis` | source basis for per-pound calc |
| `price_per_oz` | derived per-ounce price |
| `price_per_oz_basis` | source basis for per-ounce calc |
| `is_fee` | true if row represents non-product fee |
| `raw_order_path` | relative path to original order payload |
Notes:
- Only rows that represent purchased items should appear here.
- `line_total` preserves retailer truth; `net_line_total` is what you actually paid.
- catalog fields are denormalized in to make pivoting trivial.
- no discount/coupon rows exist here; their effects are carried via `matched_discount_amount`.
- review/link decisions should apply at the `normalized_item_id` level, then fan out to all purchase rows sharing that id.
/
Normalized quantity is deterministic and conservative:
- if `qty * pack_qty * size_value` is available, use that total with `size_unit`
- else if count basis is explicit, use `qty * pack_qty` with unit `count`
- else if `measure_type` is `each`, use `qty each`
- else leave both fields blank
- no hidden unit conversion is applied inside normalization; values stay in their parsed units such as `oz`, `lb`, `qt`, or `count`