data-model refactor and prep scope

This commit is contained in:
ben
2026-03-18 13:08:28 -04:00
parent 9122821db1
commit 10aad05808
3 changed files with 538 additions and 267 deletions

View File

@@ -1,12 +1,13 @@
* grocery data model and file layout
* Grocery data model and file layout
This document defines the shared file layout and stable CSV schemas for the
grocery pipeline. The goal is to keep retailer-specific ingest separate from
cross-retailer product modeling so Giant-specific quirks do not become the
system of record.
** design rules
grocery pipeline.
Goals:
- Ensure data gathering is separate from analysis
- Enable multiple data gathering methods
- One layer for review and analysis
** Design Rules
- Raw retailer exports remain the source of truth.
- Retailer parsing is isolated to retailer-specific files and ids.
- Cross-retailer product layers begin only after retailer-specific enrichment.
@@ -14,296 +15,313 @@ system of record.
existing columns should not be repurposed.
- Unknown values should be left blank rather than guessed.
** directory layout
Use one top-level data root:
#+begin_example
data/
giant/
raw/
history.json
orders/
<order_id>.json
orders.csv
items_raw.csv
items_enriched.csv
products_observed.csv
costco/
raw/
...
orders.csv
items_raw.csv
items_enriched.csv
products_observed.csv
shared/
products_canonical.csv
product_links.csv
review_queue.csv
#+end_example
** layer responsibilities
- `data/<retailer>/raw/`
Stores unmodified retailer payloads exactly as fetched.
- `data/<retailer>/orders.csv`
One row per retailer order or visit, flattened from raw order data.
- `data/<retailer>/items_raw.csv`
One row per retailer line item, preserving retailer-native values needed for
reruns and debugging.
- `data/<retailer>/items_enriched.csv`
Parsed retailer line items with normalized fields and derived guesses, still
retailer-specific.
- `data/<retailer>/products_observed.csv`
Distinct retailer-facing observed products aggregated from enriched items.
- `data/shared/products_canonical.csv`
Cross-retailer canonical product entities used for comparison.
- `data/shared/product_links.csv`
Links from retailer observed products to canonical products.
- `data/shared/review_queue.csv`
Human review queue for unresolved or low-confidence matching/parsing cases.
** retailer-specific versus shared
Retailer-specific:
*** Retailer-specific data:
- raw json payloads
- retailer order ids
- retailer line numbers
- retailer category ids and names
- retailer item names
- retailer image urls
- parsed guesses derived from one retailer feed
- observed products scoped to one retailer
Shared:
*** Review/Combined data:
- canonical products
- observed-to-canonical links
- human review state for unresolved cases
- comparison-ready normalized quantity basis fields
// I don't like this terminology - what is "observed" doing for us?
// output should be normalized_items, not observed
// unless this is the way we're matching multiple upc's?
Observed products are the boundary between retailer-specific parsing and
cross-retailer canonicalization. Nothing upstream of `products_observed.csv`
should require knowledge of another retailer.
** schema: `data/<retailer>/orders.csv`
* Pipeline
Key:
- (1) input
- [2] output
One row per order or visit.
Each step can be run alone if its dependents exist.
| column | meaning |
|-
| `retailer` | retailer slug such as `giant` |
| `order_id` | retailer order or visit id |
| `order_date` | order date in `YYYY-MM-DD` when available |
| `delivery_date` | fulfillment date in `YYYY-MM-DD` when available |
| `service_type` | retailer service type such as `INSTORE` |
| `order_total` | order total as provided by retailer |
| `payment_method` | retailer payment label |
| `total_item_count` | total line count or item count from retailer |
| `total_savings` | total savings as provided by retailer |
| `your_savings_total` | savings field from retailer when present |
| `coupons_discounts_total` | coupon/discount total from retailer |
| `store_name` | retailer store name |
| `store_number` | retailer store number |
| `store_address1` | street address |
| `store_city` | city |
| `store_state` | state or province |
| `store_zipcode` | postal code |
| `refund_order` | retailer refund flag |
| `ebt_order` | retailer EBT flag |
| `raw_history_path` | relative path to source history payload |
| `raw_order_path` | relative path to source order payload |
** 1. Collect
Get raw receipt/visit and item data from a retailer. Scraping is unique to a Retailer and method (e.g., Giant-Web and Giant-Scan). Preserve complete raw data and preserve fidelity. Avoid interpretation beyond basic data flattening.
- (1) Source access (Varies, eg header data, auth for API access)
- [1] collected visits from each retailer
- [2] collected items from each retailer
- [3] any other raw data that supports [1] and [2]; explicit source (eventual receipt scan?)
** 2. Normalize
Parse and extract structured facts from retailer-specific raw data to create a standardized item format for that retailer. Strictly dependent on Collect method and output.
- Extract quantity, size, pack, pricing, variant
- Add discount line items to product line items using upc/retail_item_id and concurrence
- Cleanup naming to facilitate later matching
- (1) collected items from each retailer
- (2) collected visits from each retailer
- [1] normalized items from each retailer
Primary key:
** 3. Review/Combine (Canonicalization)
Decide whether two normalized retailer items are "the same product"; match items across retailers using algo/logic and human review. Create catalog linked to normalized items.
- Grouping the same item from retailer
- Asking human to create a canonical/catalog item with:
- friendly/canonical_name: "bell pepper"; "milk"
- category: "produce"; "dairy"
- product_type: "pepper"; "milk"
- ? variant? "whole, "skim", "2pct"
- (1) normalized items from each retailer
- [1] review queue of items to be reviewed
- [2] catalog (lookup table) of confirmed retailer_item and canonical_name
- [3] canonical purchase list, pivot-ready
** Unresolved Issues
1. need central script to orchestrate; metadata belongs there and nowhere else
- (`retailer`, `order_id`)
** Symptoms
- `LIME` and `LIME . / .` appearing in canonical_catalog:
- names must come from review-approved names, not raw strings
** schema: `data/<retailer>/items_raw.csv`
* Directory Layout
Use one top-level data root:
#+begin_example
main.py
collect_<retailer>_<method>.py
normalize_<retailer>_<method>.py
review.py
data/
<retailer-method>/
raw/ # unmodified retailer payloads exactly as fetched
<order_id.json>
collected_items.csv # one row per retailer line item w/ retailer-native values
collected_orders.csv # one row per receipt/visit, flattened from raw order data
normalized_items.csv # parsed retailer-specific line items with normalized fields
costco-web/ # sample
raw/
orders/
history.json
<order_id>.json
collected_items.csv
collected_orders.csv
normalized_items.csv
review/
review_queue.csv # Human review queue for unresolved matching/parsing cases.
product_links.csv # Links from retailer-observed products to canonical products.
catalog.csv # Cross-retailer canonical product entities used for comparison.
purchases.csv
#+end_example
* Schemas
** `data/<retailer-method>/collected_items.csv`
One row per retailer line item.
| key | definition |
|--------------------+--------------------------------------------|
| `retailer` PK | retailer slug |
| `order_id` PK | retailer order id |
| `line_no` PK | stable line number within order export |
| `order_date` | copied from order when available |
| `retailer_item_id` | retailer-native item id when available |
| `pod_id` | retailer pod/item id |
| `item_name` | raw retailer item name |
| `upc` | retailer UPC or PLU value |
| `category_id` | retailer category id |
| `category` | retailer category description |
| `qty` | retailer quantity field |
| `unit` | retailer unit code such as `EA` or `LB` |
| `unit_price` | retailer unit price field |
| `line_total` | retailer extended price field |
| `picked_weight` | retailer picked weight field |
| `mvp_savings` | retailer savings field |
| `reward_savings` | retailer rewards savings field |
| `coupon_savings` | retailer coupon savings field |
| `coupon_price` | retailer coupon price field |
| `image_url` | raw retailer image url when present |
| `raw_order_path` | relative path to source order payload |
| `is_discount_line` | retailer adjustment or discount-line flag |
| `is_coupon_line` | coupon-like line flag when distinguishable |
| column | meaning |
|------------------+-----------------------------------------|
| `retailer` | retailer slug |
| `order_id` | retailer order id |
| `line_no` | stable line number within order export |
| `order_date` | copied from order when available |
| `retailer_item_id` | retailer-native item id when available |
| `pod_id` | retailer pod/item id |
| `item_name` | raw retailer item name |
| `upc` | retailer UPC or PLU value |
| `category_id` | retailer category id |
| `category` | retailer category description |
| `qty` | retailer quantity field |
| `unit` | retailer unit code such as `EA` or `LB` |
| `unit_price` | retailer unit price field |
| `line_total` | retailer extended price field |
| `picked_weight` | retailer picked weight field |
| `mvp_savings` | retailer savings field |
| `reward_savings` | retailer rewards savings field |
| `coupon_savings` | retailer coupon savings field |
| `coupon_price` | retailer coupon price field |
| `image_url` | raw retailer image url when present |
| `raw_order_path` | relative path to source order payload |
| `is_discount_line` | retailer adjustment or discount-line flag |
| `is_coupon_line` | coupon-like line flag when distinguishable |
** `data/<retailer-method>/collected_orders.csv`
One row per order or visit.
| key | definition |
|---------------------------+-------------------------------------------------|
| `retailer` PK | retailer slug such as `giant` |
| `order_id` PK | retailer order or visit id |
| `order_date` | order date in `YYYY-MM-DD` when available |
| `delivery_date` | fulfillment date in `YYYY-MM-DD` when available |
| `service_type` | retailer service type such as `INSTORE` |
| `order_total` | order total as provided by retailer |
| `payment_method` | retailer payment label |
| `total_item_count` | total line count or item count from retailer |
| `total_savings` | total savings as provided by retailer |
| `your_savings_total` | savings field from retailer when present |
| `coupons_discounts_total` | coupon/discount total from retailer |
| `store_name` | retailer store name |
| `store_number` | retailer store number |
| `store_address1` | street address |
| `store_city` | city |
| `store_state` | state or province |
| `store_zipcode` | postal code |
| `refund_order` | retailer refund flag |
| `ebt_order` | retailer EBT flag |
| `raw_history_path` | relative path to source history payload |
| `raw_order_path` | relative path to source order payload |
Primary key:
** `data/<retailer-method>/normalized_items.csv`
One row per retailer line item after deterministic parsing. Preserve raw
fields from `collected_items.csv` and add parsed fields plus retailer-level
identity needed before cross-retailer review.
- (`retailer`, `order_id`, `line_no`)
| key | definition |
|----------------------------+------------------------------------------------------------------|
| `retailer` PK | retailer slug |
| `order_id` PK | retailer order id |
| `line_no` PK | line number within order |
| `normalized_row_id` | stable row key, typically `<retailer>:<order_id>:<line_no>` |
| `normalized_item_id` | stable retailer-level item identity after deterministic grouping |
| `normalization_basis` | basis used to assign `normalized_item_id` |
| `retailer_item_id` | retailer-native item id |
| `item_name` | raw retailer item name |
| `item_name_norm` | normalized retailer item name |
| `brand_guess` | parsed brand guess |
| `variant` | parsed variant text |
| `size_value` | parsed numeric size value |
| `size_unit` | parsed size unit such as `oz`, `lb`, `fl_oz` |
| `pack_qty` | parsed pack or count guess |
| `measure_type` | `each`, `weight`, `volume`, `count`, or blank |
| `normalized_quantity` | numeric comparison basis derived during normalization |
| `normalized_quantity_unit` | basis unit such as `oz`, `lb`, `count`, or blank |
| `is_store_brand` | store-brand guess |
| `is_fee` | fee or non-product flag |
| `is_discount_line` | discount or adjustment-line flag |
| `is_coupon_line` | coupon-like line flag |
| `matched_discount_amount` | matched discount value carried onto purchased row when supported |
| `net_line_total` | line total after matched discount when supported |
| `price_per_each` | derived per-each price when supported |
| `price_per_each_basis` | source basis for `price_per_each` |
| `price_per_count` | derived per-count price when supported |
| `price_per_count_basis` | source basis for `price_per_count` |
| `price_per_lb` | derived per-pound price when supported |
| `price_per_lb_basis` | source basis for `price_per_lb` |
| `price_per_oz` | derived per-ounce price when supported |
| `price_per_oz_basis` | source basis for `price_per_oz` |
| `image_url` | best available retailer image url |
| `raw_order_path` | relative path to source order payload |
| `parse_version` | parser version string for reruns |
| `parse_notes` | optional non-fatal parser notes |
** schema: `data/<retailer>/items_enriched.csv`
One row per retailer line item after deterministic parsing. Preserve the raw
fields from `items_raw.csv` and add parsed fields.
| column | meaning |
|---------------------+-------------------------------------------------------------|
| `retailer` | retailer slug |
| `order_id` | retailer order id |
| `line_no` | line number within order |
| `observed_item_key` | stable row key, typically `<retailer>:<order_id>:<line_no>` |
| `retailer_item_id` | retailer-native item id |
| `item_name` | raw retailer item name |
| `item_name_norm` | normalized item name |
| `brand_guess` | parsed brand guess |
| `variant` | parsed variant text |
| `size_value` | parsed numeric size value |
| `size_unit` | parsed size unit such as `oz`, `lb`, `fl_oz` |
| `pack_qty` | parsed pack or count guess |
| `measure_type` | `each`, `weight`, `volume`, `count`, or blank |
| `is_store_brand` | store-brand guess |
| `is_fee` | fee or non-product flag |
| `is_discount_line` | discount or adjustment-line flag |
| `is_coupon_line` | coupon-like line flag |
| `price_per_each` | derived per-each price when supported |
| `price_per_lb` | derived per-pound price when supported |
| `price_per_oz` | derived per-ounce price when supported |
| `image_url` | best available retailer image url |
| `parse_version` | parser version string for reruns |
| `parse_notes` | optional non-fatal parser notes |
Primary key:
- (`retailer`, `order_id`, `line_no`)
** schema: `data/<retailer>/products_observed.csv`
One row per distinct retailer-facing observed product.
| column | meaning |
|-------------------------------+----------------------------------------------------------------|
| `observed_product_id` | stable observed product id |
| `retailer` | retailer slug |
| `observed_key` | deterministic grouping key used to create the observed product |
| `representative_retailer_item_id` | best representative retailer-native item id |
| `representative_upc` | best representative UPC/PLU |
| `representative_item_name` | representative raw retailer name |
| `representative_name_norm` | representative normalized name |
| `representative_brand` | representative brand guess |
| `representative_variant` | representative variant |
| `representative_size_value` | representative size value |
| `representative_size_unit` | representative size unit |
| `representative_pack_qty` | representative pack/count |
| `representative_measure_type` | representative measure type |
| `representative_image_url` | representative image url |
| `is_store_brand` | representative store-brand flag |
| `is_fee` | representative fee flag |
| `is_discount_line` | representative discount-line flag |
| `is_coupon_line` | representative coupon-line flag |
| `first_seen_date` | first order date seen |
| `last_seen_date` | last order date seen |
| `times_seen` | number of enriched item rows grouped here |
| `example_order_id` | one example retailer order id |
| `example_item_name` | one example raw item name |
| `distinct_retailer_item_ids_count` | count of distinct retailer-native item ids |
Primary key:
- (`observed_product_id`)
** schema: `data/shared/products_canonical.csv`
One row per cross-retailer canonical product.
| column | meaning |
|----------------------------+--------------------------------------------------|
| `canonical_product_id` | stable canonical product id |
| `canonical_name` | canonical human-readable name |
| `product_type` | broad class such as `apple`, `milk`, `trash_bag` |
| `brand` | canonical brand when applicable |
| `variant` | canonical variant |
| `size_value` | normalized size value |
| `size_unit` | normalized size unit |
| `pack_qty` | normalized pack/count |
| `measure_type` | normalized measure type |
| `normalized_quantity` | numeric comparison basis value |
| `normalized_quantity_unit` | basis unit such as `oz`, `lb`, `count` |
| `notes` | optional human notes |
| `created_at` | creation timestamp or date |
| `updated_at` | last update timestamp or date |
Primary key:
- (`canonical_product_id`)
** schema: `data/shared/product_links.csv`
Notes:
- `normalized_item_id` replaces the need for a core `observed_products.csv` layer.
- `normalization_basis` should be explicit values like `exact_upc`, `retailer_item_id`, `name_size_pack`, or `manual_retailer_alias`.
- Cross-retailer identity is still handled later in review/combine via `catalog.csv` and `product_links.csv`.
** `data/review/product_links.csv`
One row per observed-to-canonical relationship.
1 (catalog_item) to many (normalized_items)
| column | meaning |
|-
| `observed_product_id` | retailer observed product id |
| `canonical_product_id` | linked canonical product id |
| `link_method` | `manual`, `exact_upc`, `exact_name`, etc. |
| `link_confidence` | optional confidence label |
| `review_status` | `pending`, `approved`, `rejected`, or blank |
| `reviewed_by` | reviewer id or initials |
| `reviewed_at` | review timestamp or date |
| `link_notes` | optional notes |
Primary key:
- (`observed_product_id`, `canonical_product_id`)
** schema: `data/shared/review_queue.csv`
| key | definition |
|-------------------+---------------------------------------------|
| `observed_id` PK | retailer observed product id |
| `catalog_id` PK | linked canonical product id |
| `link_method` | `manual`, `exact_upc`, `exact_name`, etc. |
| `link_confidence` | optional confidence label |
| `review_status` | `pending`, `approved`, `rejected`, or blank |
| `reviewed_by` | reviewer id or initials |
| `reviewed_at` | review timestamp or date |
| `link_notes` | optional notes |
** `data/review/review_queue.csv`
One row per issue needing human review.
| column | meaning |
|-
| `review_id` | stable review row id |
| `queue_type` | `observed_product`, `link_candidate`, `parse_issue` |
| `retailer` | retailer slug when applicable |
| `observed_product_id` | observed product id when applicable |
| `canonical_product_id` | candidate canonical id when applicable |
| `reason_code` | machine-readable review reason |
| `priority` | optional priority label |
| `raw_item_names` | compact list of example raw names |
| `normalized_names` | compact list of example normalized names |
| `upc` | example UPC/PLU |
| `image_url` | example image url |
| `example_prices` | compact list of example prices |
| `seen_count` | count of related rows |
| `status` | `pending`, `approved`, `rejected`, `deferred` |
| `resolution_notes` | reviewer notes |
| `created_at` | creation timestamp or date |
| `updated_at` | last update timestamp or date |
| key | definition |
|-----------------------+-----------------------------------------------------|
| `review_id` PK | stable review row id |
| `queue_type` | `observed_product`, `link_candidate`, `parse_issue` |
| `retailer` | retailer slug when applicable |
| `observed_product_id` | observed product id when applicable |
| `catalod_id` | candidate canonical id when applicable |
| `reason_code` | machine-readable review reason |
| `priority` | optional priority label |
| `raw_item_names` | compact list of example raw names |
| `normalized_names` | compact list of example normalized names |
| `upc` | example UPC/PLU |
| `image_url` | example image url |
| `example_prices` | compact list of example prices |
| `seen_count` | count of related rows |
| `status` | `pending`, `approved`, `rejected`, `deferred` |
| `resolution_notes` | reviewer notes |
| `created_at` | creation timestamp or date |
| `updated_at` | last update timestamp or date |
** `data/catalog.csv`
One row per cross-retailer canonical product.
| key | definition |
|----------------------------+----------------------------------------|
| `catalog_id` PK | stable canonical product id |
| `catalog_name` | canonical human-readable name |
| `product_type` | generic product eg `apple`, `milk` |
| `category` | broad section eg `produce`, `dairy` |
| `brand` | canonical brand when applicable |
| `variant` | canonical variant |
| `size_value` | normalized size value |
| `size_unit` | normalized size unit |
| `pack_qty` | normalized pack/count |
| `measure_type` | normalized measure type |
| `normalized_quantity` | numeric comparison basis value |
| `normalized_quantity_unit` | basis unit such as `oz`, `lb`, `count` |
| `notes` | optional human notes |
| `created_at` | creation timestamp or date |
| `updated_at` | last update timestamp or date |
Primary key:
** `data/purchases.csv`
One row per purchased item (i.e., `row_type=item` from normalized layer), with
catalog attributes denormalized in and discounts already applied.
- (`review_id`)
| key | definition |
|----------------------------+----------------------------------------------------------------|
| `purchase_date` | date of purchase (from order) |
| `retailer` | retailer slug |
| `order_id` | retailer order id |
| `line_no` | line number within order |
| `normalized_row_id` | `<retailer>:<order_id>:<line_no>` |
| `normalized_item_id` | retailer-level normalized item identity |
| `catalog_id` | linked canonical product id |
| `catalog_name` | canonical product name for analysis |
| `catalog_product_type` | broader product family (e.g., `egg`, `milk`) |
| `catalog_category` | category such as `produce`, `dairy` |
| `catalog_brand` | canonical brand when applicable |
| `catalog_variant` | canonical variant when applicable |
| `raw_item_name` | original retailer item name |
| `normalized_item_name` | cleaned/normalized retailer item name |
| `retailer_item_id` | retailer-native item id |
| `upc` | UPC/PLU when available |
| `qty` | retailer quantity field |
| `unit` | retailer unit (e.g., `EA`, `LB`) |
| `pack_qty` | parsed pack/count |
| `size_value` | parsed size value |
| `size_unit` | parsed size unit |
| `measure_type` | `each`, `weight`, `volume`, `count` |
| `normalized_quantity` | normalized comparison quantity |
| `normalized_quantity_unit` | unit for normalized quantity |
| `unit_price` | retailer unit price |
| `line_total` | original retailer extended price (pre-discount) |
| `matched_discount_amount` | discount amount matched from discount lines |
| `net_line_total` | effective price after discount (`line_total` + discounts) |
| `store_name` | retailer store name |
| `store_city` | store city |
| `store_state` | store state |
| `price_per_each` | derived per-each price |
| `price_per_each_basis` | source basis for per-each calc |
| `price_per_count` | derived per-count price |
| `price_per_count_basis` | source basis for per-count calc |
| `price_per_lb` | derived per-pound price |
| `price_per_lb_basis` | source basis for per-pound calc |
| `price_per_oz` | derived per-ounce price |
| `price_per_oz_basis` | source basis for per-ounce calc |
| `is_fee` | true if row represents non-product fee |
| `raw_order_path` | relative path to original order payload |
** current giant mapping
Notes:
- Only rows with `row_type=item` from normalization should appear here.
- `line_total` preserves retailer truth; `net_line_total` is what you actually paid.
- catalog fields are denormalized in to make pivoting trivial.
- no discount/coupon rows exist here; their effects are carried via `matched_discount_amount`.
Current scraper outputs map to the new layout as follows:
- `giant_output/raw/history.json` -> `data/giant/raw/history.json`
- `giant_output/raw/<order_id>.json` -> `data/giant/raw/orders/<order_id>.json`
- `giant_output/orders.csv` -> `data/giant/orders.csv`
- `giant_output/items.csv` -> `data/giant/items_raw.csv`
Current Giant raw order payloads already expose fields needed for future
enrichment, including `image`, `itemName`, `primUpcCd`, `lbEachCd`,
`unitPrice`, `groceryAmount`, and `totalPickedWeight`.
* /