scrape-giant/pm/data-model.org at 10aad058089b430b42b03837d4917dac8bcc06e1

ben/scrape-giant

Fork 0

Files

ben 10aad05808 data-model refactor and prep scope

2026-03-18 13:08:28 -04:00

20 KiB

Raw Blame History

Grocery data model and file layout
- Retailer-specific data:
- Review/Combined data:
Pipeline
Directory Layout
Schemas
/

Grocery data model and file layout

This document defines the shared file layout and stable CSV schemas for the grocery pipeline. Goals:

Ensure data gathering is separate from analysis
Enable multiple data gathering methods
One layer for review and analysis ** Design Rules
Raw retailer exports remain the source of truth.
Retailer parsing is isolated to retailer-specific files and ids.
Cross-retailer product layers begin only after retailer-specific enrichment.
CSV schemas are stable and additive: new columns may be appended, but existing columns should not be repurposed.
Unknown values should be left blank rather than guessed.

Retailer-specific data:

raw json payloads
retailer order ids
retailer line numbers
retailer category ids and names
retailer item names
retailer image urls
observed products scoped to one retailer

Review/Combined data:

canonical products
observed-to-canonical links
human review state for unresolved cases
comparison-ready normalized quantity basis fields

/ I don't like this terminology - what is "observed" doing for us? / output should be normalized_items, not observed // unless this is the way we're matching multiple upc's? Observed products are the boundary between retailer-specific parsing and cross-retailer canonicalization. Nothing upstream of `products_observed.csv` should require knowledge of another retailer.

Pipeline

Key:

(1) input
[2] output

Each step can be run alone if its dependents exist.

1. Collect

Get raw receipt/visit and item data from a retailer. Scraping is unique to a Retailer and method (e.g., Giant-Web and Giant-Scan). Preserve complete raw data and preserve fidelity. Avoid interpretation beyond basic data flattening.

(1) Source access (Varies, eg header data, auth for API access)
[1] collected visits from each retailer
[2] collected items from each retailer
[3] any other raw data that supports [1] and [2]; explicit source (eventual receipt scan?)

2. Normalize

Parse and extract structured facts from retailer-specific raw data to create a standardized item format for that retailer. Strictly dependent on Collect method and output.

Extract quantity, size, pack, pricing, variant
Add discount line items to product line items using upc/retail_item_id and concurrence
Cleanup naming to facilitate later matching
(1) collected items from each retailer
(2) collected visits from each retailer
[1] normalized items from each retailer

3. Review/Combine (Canonicalization)

Decide whether two normalized retailer items are "the same product"; match items across retailers using algo/logic and human review. Create catalog linked to normalized items.

Grouping the same item from retailer
Asking human to create a canonical/catalog item with:
- friendly/canonical_name: "bell pepper"; "milk"
- category: "produce"; "dairy"
- product_type: "pepper"; "milk"
- ? variant? "whole, "skim", "2pct"
(1) normalized items from each retailer
[1] review queue of items to be reviewed
[2] catalog (lookup table) of confirmed retailer_item and canonical_name
[3] canonical purchase list, pivot-ready

Unresolved Issues

need central script to orchestrate; metadata belongs there and nowhere else

Symptoms

`LIME` and `LIME . / .` appearing in canonical_catalog:
- names must come from review-approved names, not raw strings

Directory Layout

Use one top-level data root:

main.py
collect_<retailer>_<method>.py
normalize_<retailer>_<method>.py
review.py
data/
  <retailer-method>/
    raw/  # unmodified retailer payloads exactly as fetched
      <order_id.json> 
    collected_items.csv # one row per retailer line item w/ retailer-native values
    collected_orders.csv # one row per receipt/visit, flattened from raw order data
    normalized_items.csv # parsed retailer-specific line items with normalized fields
  costco-web/ # sample
    raw/
      orders/
        history.json
        <order_id>.json
    collected_items.csv
    collected_orders.csv
    normalized_items.csv
  review/
    review_queue.csv #  Human review queue for unresolved matching/parsing cases.
    product_links.csv # Links from retailer-observed products to canonical products.
  catalog.csv  # Cross-retailer canonical product entities used for comparison.
  purchases.csv

Schemas

`data/<retailer-method>/collected_items.csv`

One row per retailer line item.

key	definition
`retailer` PK	retailer slug
`order_id` PK	retailer order id
`line_no` PK	stable line number within order export
`order_date`	copied from order when available
`retailer_item_id`	retailer-native item id when available
`pod_id`	retailer pod/item id
`item_name`	raw retailer item name
`upc`	retailer UPC or PLU value
`category_id`	retailer category id
`category`	retailer category description
`qty`	retailer quantity field
`unit`	retailer unit code such as `EA` or `LB`
`unit_price`	retailer unit price field
`line_total`	retailer extended price field
`picked_weight`	retailer picked weight field
`mvp_savings`	retailer savings field
`reward_savings`	retailer rewards savings field
`coupon_savings`	retailer coupon savings field
`coupon_price`	retailer coupon price field
`image_url`	raw retailer image url when present
`raw_order_path`	relative path to source order payload
`is_discount_line`	retailer adjustment or discount-line flag
`is_coupon_line`	coupon-like line flag when distinguishable

`data/<retailer-method>/collected_orders.csv`

One row per order or visit.

key	definition
`retailer` PK	retailer slug such as `giant`
`order_id` PK	retailer order or visit id
`order_date`	order date in `YYYY-MM-DD` when available
`delivery_date`	fulfillment date in `YYYY-MM-DD` when available
`service_type`	retailer service type such as `INSTORE`
`order_total`	order total as provided by retailer
`payment_method`	retailer payment label
`total_item_count`	total line count or item count from retailer
`total_savings`	total savings as provided by retailer
`your_savings_total`	savings field from retailer when present
`coupons_discounts_total`	coupon/discount total from retailer
`store_name`	retailer store name
`store_number`	retailer store number
`store_address1`	street address
`store_city`	city
`store_state`	state or province
`store_zipcode`	postal code
`refund_order`	retailer refund flag
`ebt_order`	retailer EBT flag
`raw_history_path`	relative path to source history payload
`raw_order_path`	relative path to source order payload

`data/<retailer-method>/normalized_items.csv`

One row per retailer line item after deterministic parsing. Preserve raw fields from `collected_items.csv` and add parsed fields plus retailer-level identity needed before cross-retailer review.

key	definition
`retailer` PK	retailer slug
`order_id` PK	retailer order id
`line_no` PK	line number within order
`normalized_row_id`	stable row key, typically `<retailer>:<order_id>:<line_no>`
`normalized_item_id`	stable retailer-level item identity after deterministic grouping
`normalization_basis`	basis used to assign `normalized_item_id`
`retailer_item_id`	retailer-native item id
`item_name`	raw retailer item name
`item_name_norm`	normalized retailer item name
`brand_guess`	parsed brand guess
`variant`	parsed variant text
`size_value`	parsed numeric size value
`size_unit`	parsed size unit such as `oz`, `lb`, `fl_oz`
`pack_qty`	parsed pack or count guess
`measure_type`	`each`, `weight`, `volume`, `count`, or blank
`normalized_quantity`	numeric comparison basis derived during normalization
`normalized_quantity_unit`	basis unit such as `oz`, `lb`, `count`, or blank
`is_store_brand`	store-brand guess
`is_fee`	fee or non-product flag
`is_discount_line`	discount or adjustment-line flag
`is_coupon_line`	coupon-like line flag
`matched_discount_amount`	matched discount value carried onto purchased row when supported
`net_line_total`	line total after matched discount when supported
`price_per_each`	derived per-each price when supported
`price_per_each_basis`	source basis for `price_per_each`
`price_per_count`	derived per-count price when supported
`price_per_count_basis`	source basis for `price_per_count`
`price_per_lb`	derived per-pound price when supported
`price_per_lb_basis`	source basis for `price_per_lb`
`price_per_oz`	derived per-ounce price when supported
`price_per_oz_basis`	source basis for `price_per_oz`
`image_url`	best available retailer image url
`raw_order_path`	relative path to source order payload
`parse_version`	parser version string for reruns
`parse_notes`	optional non-fatal parser notes

Notes:

`normalized_item_id` replaces the need for a core `observed_products.csv` layer.
`normalization_basis` should be explicit values like `exact_upc`, `retailer_item_id`, `name_size_pack`, or `manual_retailer_alias`.
Cross-retailer identity is still handled later in review/combine via `catalog.csv` and `product_links.csv`.

`data/review/product_links.csv`

One row per observed-to-canonical relationship. 1 (catalog_item) to many (normalized_items)

key	definition
`observed_id` PK	retailer observed product id
`catalog_id` PK	linked canonical product id
`link_method`	`manual`, `exact_upc`, `exact_name`, etc.
`link_confidence`	optional confidence label
`review_status`	`pending`, `approved`, `rejected`, or blank
`reviewed_by`	reviewer id or initials
`reviewed_at`	review timestamp or date
`link_notes`	optional notes

`data/review/review_queue.csv`

One row per issue needing human review.

key	definition
`review_id` PK	stable review row id
`queue_type`	`observed_product`, `link_candidate`, `parse_issue`
`retailer`	retailer slug when applicable
`observed_product_id`	observed product id when applicable
`catalod_id`	candidate canonical id when applicable
`reason_code`	machine-readable review reason
`priority`	optional priority label
`raw_item_names`	compact list of example raw names
`normalized_names`	compact list of example normalized names
`upc`	example UPC/PLU
`image_url`	example image url
`example_prices`	compact list of example prices
`seen_count`	count of related rows
`status`	`pending`, `approved`, `rejected`, `deferred`
`resolution_notes`	reviewer notes
`created_at`	creation timestamp or date
`updated_at`	last update timestamp or date

`data/catalog.csv`

One row per cross-retailer canonical product.

key	definition
`catalog_id` PK	stable canonical product id
`catalog_name`	canonical human-readable name
`product_type`	generic product eg `apple`, `milk`
`category`	broad section eg `produce`, `dairy`
`brand`	canonical brand when applicable
`variant`	canonical variant
`size_value`	normalized size value
`size_unit`	normalized size unit
`pack_qty`	normalized pack/count
`measure_type`	normalized measure type
`normalized_quantity`	numeric comparison basis value
`normalized_quantity_unit`	basis unit such as `oz`, `lb`, `count`
`notes`	optional human notes
`created_at`	creation timestamp or date
`updated_at`	last update timestamp or date

`data/purchases.csv`

One row per purchased item (i.e., `row_type=item` from normalized layer), with catalog attributes denormalized in and discounts already applied.

key	definition
`purchase_date`	date of purchase (from order)
`retailer`	retailer slug
`order_id`	retailer order id
`line_no`	line number within order
`normalized_row_id`	`<retailer>:<order_id>:<line_no>`
`normalized_item_id`	retailer-level normalized item identity
`catalog_id`	linked canonical product id
`catalog_name`	canonical product name for analysis
`catalog_product_type`	broader product family (e.g., `egg`, `milk`)
`catalog_category`	category such as `produce`, `dairy`
`catalog_brand`	canonical brand when applicable
`catalog_variant`	canonical variant when applicable
`raw_item_name`	original retailer item name
`normalized_item_name`	cleaned/normalized retailer item name
`retailer_item_id`	retailer-native item id
`upc`	UPC/PLU when available
`qty`	retailer quantity field
`unit`	retailer unit (e.g., `EA`, `LB`)
`pack_qty`	parsed pack/count
`size_value`	parsed size value
`size_unit`	parsed size unit
`measure_type`	`each`, `weight`, `volume`, `count`
`normalized_quantity`	normalized comparison quantity
`normalized_quantity_unit`	unit for normalized quantity
`unit_price`	retailer unit price
`line_total`	original retailer extended price (pre-discount)
`matched_discount_amount`	discount amount matched from discount lines
`net_line_total`	effective price after discount (`line_total` + discounts)
`store_name`	retailer store name
`store_city`	store city
`store_state`	store state
`price_per_each`	derived per-each price
`price_per_each_basis`	source basis for per-each calc
`price_per_count`	derived per-count price
`price_per_count_basis`	source basis for per-count calc
`price_per_lb`	derived per-pound price
`price_per_lb_basis`	source basis for per-pound calc
`price_per_oz`	derived per-ounce price
`price_per_oz_basis`	source basis for per-ounce calc
`is_fee`	true if row represents non-product fee
`raw_order_path`	relative path to original order payload