Files
scrape-giant/pm/data-model.org
2026-03-21 21:50:10 -04:00

22 KiB

Grocery data model and file layout

This document defines the shared file layout and stable CSV schemas for the grocery pipeline. Goals:

  • Ensure data gathering is separate from analysis
  • Enable multiple data gathering methods
  • One layer for review and analysis

Design Rules

  • Raw retailer exports remain the source of truth.
  • Retailer parsing is isolated to retailer-specific files and ids.
  • Cross-retailer product layers begin only after retailer-specific normalization.
  • CSV schemas are stable and additive: new columns may be appended, but existing columns should not be repurposed.
  • Unknown values should be left blank rather than guessed.

Retailer-specific data:

  • raw json payloads
  • retailer order ids
  • retailer line numbers
  • retailer category ids and names
  • retailer item names
  • retailer image urls
  • comparison-ready normalized quantity basis fields

Review/Combined data:

  • catalog of reviewed products
  • links from normalized retailer items to catalog
  • human review state for unresolved cases

Pipeline

Each step can be run alone if its dependents exist. Each retail provider script must produce deterministic line-item outputs, and normalization may assign within-retailer product identity only when the retailer itself provides strong evidence.

Key:

  • (1) input
  • [1] output

1. Collect

Get raw receipt/visit and item data from a retailer. Scraping is unique to a Retailer and method (e.g., Giant-Web and Giant-Scan). Preserve complete raw data and preserve fidelity. Avoid interpretation beyond basic data flattening.

  • (1) Source access (Varies, eg header data, auth for API access)
  • [1] collected visits from each retailer
  • [2] collected items from each retailer
  • [3] any other raw data that supports [1] and [2]; explicit source (eventual receipt scan?)

2. Normalize

Parse and extract structured facts from retailer-specific raw data to create a standardized item format for that retailer. Strictly dependent on Collect method and output.

  • Extract quantity, size, pack, pricing, variant
  • Add discount line items to product line items using upc/retail_item_id and concurrence
  • Cleanup naming to facilitate later matching
  • Assign retailer-level `normalized_item_id` only when evidence is deterministic
  • Never use fuzzy or semantic matching here
  • (1) collected items from each retailer
  • (2) collected visits from each retailer
  • [1] normalized items from each retailer

3. Review/Combine (Canonicalization)

Decide whether two normalized retailer items are "the same product"; match items across retailers using algo/logic and human review. Create catalog linked to normalized retailer items.

  • Review operates on distinct `normalized_item_id` values, not individual purchase rows
  • Cross-retailer identity decisions happen only here
  • Asking human to create a canonical/catalog item with:

    • friendly/catalog_name: "bell pepper"; "milk"
    • category: "produce"; "dairy"
    • product_type: "pepper"; "milk"
    • ? variant? "whole, "skim", "2pct"
  • Then link the group of items to that catalog item.
  • (1) normalized items from each retailer
  • [1] review queue of items to be reviewed
  • [2] catalog (lookup table) of confirmed normalized retailer items and catalog_id
  • [3] purchase list of normalized items , pivot-ready

Unresolved Issues

  1. need central script to orchestrate; metadata belongs there and nowhere else
  2. `LIME` and `LIME . / .` appearing in the catalog: names must come from review-approved names, not raw strings

Directory Layout

Use one top-level data root:

main.py
collect_<retailer>_<method>.py
normalize_<retailer>_<method>.py
review.py
data/
  <retailer-method>/
    raw/  # unmodified retailer payloads exactly as fetched
      <order_id.json> 
    collected_items.csv # one row per retailer line item w/ retailer-native values
    collected_orders.csv # one row per receipt/visit, flattened from raw order data
    normalized_items.csv # parsed retailer-specific line items with normalized fields
  costco-web/ # sample
    raw/
      orders/
        history.json
        <order_id>.json
    collected_items.csv
    collected_orders.csv
    normalized_items.csv
  review/
    review_queue.csv # Human review queue for unresolved matching/parsing cases.
    product_links.csv # Links from normalized retailer items to catalog items.
  catalog.csv  # Cross-retailer product catalog entities used for comparison.
  purchases.csv

Notes:

  • The current repo still uses transitional root-level scripts and output folders.
  • This layout is the target structure for the refactor, not a claim that migration is already complete.

Schemas

`data/<retailer-method>/collected_items.csv`

One row per retailer line item.

key definition
`retailer` PK retailer slug
`order_id` PK retailer order id
`line_no` PK stable line number within order export
`order_date` copied from order when available
`retailer_item_id` retailer-native item id when available
`pod_id` retailer pod/item id
`item_name` raw retailer item name
`upc` retailer UPC or PLU value
`category_id` retailer category id
`category` retailer category description
`qty` retailer quantity field
`unit` retailer unit code such as `EA` or `LB`
`unit_price` retailer unit price field
`line_total` retailer extended price field
`picked_weight` retailer picked weight field
`mvp_savings` retailer savings field
`reward_savings` retailer rewards savings field
`coupon_savings` retailer coupon savings field
`coupon_price` retailer coupon price field
`image_url` raw retailer image url when present
`raw_order_path` relative path to source order payload
`is_discount_line` retailer adjustment or discount-line flag
`is_coupon_line` coupon-like line flag when distinguishable

`data/<retailer-method>/collected_orders.csv`

One row per order/visit/receipt.

key definition
`retailer` PK retailer slug such as `giant`
`order_id` PK retailer order or visit id
`order_date` order date in `YYYY-MM-DD` when available
`delivery_date` fulfillment date in `YYYY-MM-DD` when available
`service_type` retailer service type such as `INSTORE`
`order_total` order total as provided by retailer
`payment_method` retailer payment label
`total_item_count` total line count or item count from retailer
`total_savings` total savings as provided by retailer
`your_savings_total` savings field from retailer when present
`coupons_discounts_total` coupon/discount total from retailer
`store_name` retailer store name
`store_number` retailer store number
`store_address1` street address
`store_city` city
`store_state` state or province
`store_zipcode` postal code
`refund_order` retailer refund flag
`ebt_order` retailer EBT flag
`raw_history_path` relative path to source history payload
`raw_order_path` relative path to source order payload

`data/<retailer-method>/normalized_items.csv`

One row per retailer line item after deterministic parsing. Preserve raw fields from `collected_items.csv` and add parsed fields that make later review and grouping easier. Normalization may assign retailer-level identity when the evidence is deterministic and retailer-scoped.

key definition
`retailer` PK retailer slug
`order_id` PK retailer order id
`line_no` PK line number within order
`normalized_row_id` stable row key, typically `<retailer>:<order_id>:<line_no>`
`normalized_item_id` stable retailer-level item identity when deterministic grouping is supported
`normalization_basis` basis used to assign `normalized_item_id`
`retailer_item_id` retailer-native item id
`item_name` raw retailer item name
`item_name_norm` normalized retailer item name
`brand_guess` parsed brand guess
`variant` parsed variant text
`size_value` parsed numeric size value
`size_unit` parsed size unit such as `oz`, `lb`, `fl_oz`
`pack_qty` parsed pack or count guess
`measure_type` `each`, `weight`, `volume`, `count`, or blank
`normalized_quantity` numeric comparison basis derived during normalization
`normalized_quantity_unit` basis unit such as `oz`, `lb`, `count`, or blank
`is_item` item flag
`is_store_brand` store-brand guess
`is_fee` fee or non-product flag
`is_discount_line` discount or adjustment-line flag
`is_coupon_line` coupon-like line flag
`matched_discount_amount` matched discount value carried onto purchased row when supported
`net_line_total` line total after matched discount when supported
`price_per_each` derived per-each price when supported
`price_per_each_basis` source basis for `price_per_each`
`price_per_count` derived per-count price when supported
`price_per_count_basis` source basis for `price_per_count`
`price_per_lb` derived per-pound price when supported
`price_per_lb_basis` source basis for `price_per_lb`
`price_per_oz` derived per-ounce price when supported
`price_per_oz_basis` source basis for `price_per_oz`
`image_url` best available retailer image url
`raw_order_path` relative path to source order payload
`parse_version` parser version string for reruns
`parse_notes` optional non-fatal parser notes

Notes:

  • `normalized_row_id` identifies the purchase row; `normalized_item_id` identifies a repeated retailer item when strong retailer evidence supports grouping.
  • Valid `normalization_basis` values should be explicit, e.g. `exact_upc`, `exact_retailer_item_id`, `exact_name_size_pack`, or `approved_retailer_alias`.
  • Do not use fuzzy or semantic matching to assign `normalized_item_id`.
  • Discount/coupon rows may remain as standalone normalized rows for auditability even when their amounts are attached to a purchased row via `matched_discount_amount`.
  • Cross-retailer identity is handled later in review/combine via `catalog.csv` and `product_links.csv`.

`data/review/product_links.csv`

One row per review-approved link from a normalized retailer item to a catalog item. Many normalized retailer items may link to the same catalog item.

key definition
`normalized_item_id` PK normalized retailer item id
`catalog_id` PK linked catalog product id
`link_method` `manual`, `exact_upc`, `exact_name_size`, etc.
`link_confidence` optional confidence label
`review_status` `pending`, `approved`, `rejected`, or blank
`reviewed_by` reviewer id or initials
`reviewed_at` review timestamp or date
`link_notes` optional notes

`data/review/review_queue.csv`

One row per issue needing human review.

key definition
`review_id` PK stable review row id
`queue_type` `link_candidate`, `parse_issue`, `catalog_cleanup`
`retailer` retailer slug when applicable
`normalized_item_id` normalized retailer item id when review is item-level
`normalized_row_id` normalized row id when review is row-specific
`catalog_id` candidate canonical id
`reason_code` machine-readable review reason
`priority` optional priority label
`raw_item_names` compact list of example raw names
`normalized_names` compact list of example normalized names
`upc` example UPC/PLU
`image_url` example image url
`example_prices` compact list of example prices
`seen_count` count of related rows
`status` `pending`, `approved`, `rejected`, `deferred`
`resolution_notes` reviewer notes
`created_at` creation timestamp or date
`updated_at` last update timestamp or date

`data/catalog.csv`

One row per cross-retailer catalog product.

key definition
`catalog_id` PK stable catalog product id
`catalog_name` human-reviewed product name
`product_type` generic product eg `apple`, `milk`
`category` broad section eg `produce`, `dairy`
`brand` canonical brand when applicable
`variant` canonical variant
`size_value` normalized size value
`size_unit` normalized size unit
`pack_qty` normalized pack/count
`measure_type` normalized measure type
`normalized_quantity` numeric comparison basis value
`normalized_quantity_unit` basis unit such as `oz`, `lb`, `count`
`notes` optional human notes
`created_at` creation timestamp or date
`updated_at` last update timestamp or date

Notes:

  • Do not auto-create new catalog rows from weak normalized names alone.
  • Do not encode packaging/count into `catalog_name` unless it is essential to product identity.
  • `catalog_name` should come from review-approved naming, not raw retailer strings.

`data/purchases.csv`

One row per purchased item (i.e., `is_item`==true from normalized layer), with catalog attributes denormalized in and discounts already applied.

key definition
`purchase_date` date of purchase (from order)
`retailer` retailer slug
`order_id` retailer order id
`line_no` line number within order
`normalized_row_id` `<retailer>:<order_id>:<line_no>`
`normalized_item_id` retailer-level normalized item identity
`catalog_id` linked catalog product id
`catalog_name` catalog product name for analysis
`catalog_product_type` broader product family (e.g., `egg`, `milk`)
`catalog_category` category such as `produce`, `dairy`
`catalog_brand` canonical brand when applicable
`catalog_variant` canonical variant when applicable
`raw_item_name` original retailer item name
`normalized_item_name` cleaned/normalized retailer item name
`retailer_item_id` retailer-native item id
`upc` UPC/PLU when available
`qty` retailer quantity field
`unit` retailer unit (e.g., `EA`, `LB`)
`pack_qty` parsed pack/count
`size_value` parsed size value
`size_unit` parsed size unit
`measure_type` `each`, `weight`, `volume`, `count`
`normalized_quantity` normalized comparison quantity
`normalized_quantity_unit` unit for normalized quantity
`unit_price` retailer unit price
`line_total` original retailer extended price (pre-discount)
`matched_discount_amount` discount amount matched from discount lines
`net_line_total` effective price after discount (`line_total` + discounts)
`store_name` retailer store name
`store_city` store city
`store_state` store state
`price_per_each` derived per-each price
`price_per_each_basis` source basis for per-each calc
`price_per_count` derived per-count price
`price_per_count_basis` source basis for per-count calc
`price_per_lb` derived per-pound price
`price_per_lb_basis` source basis for per-pound calc
`price_per_oz` derived per-ounce price
`price_per_oz_basis` source basis for per-ounce calc
`is_fee` true if row represents non-product fee
`raw_order_path` relative path to original order payload

Notes:

  • Only rows that represent purchased items should appear here.
  • `line_total` preserves retailer truth; `net_line_total` is what you actually paid.
  • catalog fields are denormalized in to make pivoting trivial.
  • no discount/coupon rows exist here; their effects are carried via `matched_discount_amount`.
  • review/link decisions should apply at the `normalized_item_id` level, then fan out to all purchase rows sharing that id.

/

Normalized quantity is deterministic and conservative:

  • if `qty * pack_qty * size_value` is available, use that total with `size_unit`
  • else if count basis is explicit, use `qty * pack_qty` with unit `count`
  • else if `measure_type` is `each`, use `qty each`
  • else leave both fields blank
  • no hidden unit conversion is applied inside normalization; values stay in their parsed units such as `oz`, `lb`, `qt`, or `count`