scrape-giant/pm/data-model.org at eb3959ae0fdd81e3f163f611709f3c6f78de7820

ben/scrape-giant

Fork 0

Files

ben 09829b2b9d Finalize post-refactor layout and remove old pipeline files

2026-03-24 17:09:57 -04:00

22 KiB

Raw Blame History

Grocery data model and file layout
- Design Rules
  - Retailer-specific data:
  - Review/Combined data:
Pipeline
Directory Layout
Schemas
/

Grocery data model and file layout

This document defines the shared file layout and stable CSV schemas for the grocery pipeline. Goals:

Ensure data gathering is separate from analysis
Enable multiple data gathering methods
One layer for review and analysis

Design Rules

Raw retailer exports remain the source of truth.
Retailer parsing is isolated to retailer-specific files and ids.
Cross-retailer product layers begin only after retailer-specific normalization.
CSV schemas are stable and additive: new columns may be appended, but existing columns should not be repurposed.
Unknown values should be left blank rather than guessed.

Retailer-specific data:

raw json payloads
retailer order ids
retailer line numbers
retailer category ids and names
retailer item names
retailer image urls
comparison-ready normalized quantity basis fields

Review/Combined data:

catalog of reviewed products
links from normalized retailer items to catalog
human review state for unresolved cases

Pipeline

Each step can be run alone if its dependents exist. Each retail provider script must produce deterministic line-item outputs, and normalization may assign within-retailer product identity only when the retailer itself provides strong evidence.

Key:

(1) input
[1] output

1. Collect

Get raw receipt/visit and item data from a retailer. Scraping is unique to a Retailer and method (e.g., Giant-Web and Giant-Scan). Preserve complete raw data and preserve fidelity. Avoid interpretation beyond basic data flattening.

(1) Source access (Varies, eg header data, auth for API access)
[1] collected visits from each retailer
[2] collected items from each retailer
[3] any other raw data that supports [1] and [2]; explicit source (eventual receipt scan?)

2. Normalize

Parse and extract structured facts from retailer-specific raw data to create a standardized item format for that retailer. Strictly dependent on Collect method and output.

Extract quantity, size, pack, pricing, variant
Add discount line items to product line items using upc/retail_item_id and concurrence
Cleanup naming to facilitate later matching
Assign retailer-level `normalized_item_id` only when evidence is deterministic
Never use fuzzy or semantic matching here
(1) collected items from each retailer
(2) collected visits from each retailer
[1] normalized items from each retailer

3. Review/Combine (Canonicalization)

Decide whether two normalized retailer items are "the same product"; match items across retailers using algo/logic and human review. Create catalog linked to normalized retailer items.

Review operates on distinct `normalized_item_id` values, not individual purchase rows
Cross-retailer identity decisions happen only here
Asking human to create a canonical/catalog item with:
- friendly/catalog_name: "bell pepper"; "milk"
- category: "produce"; "dairy"
- product_type: "pepper"; "milk"
- ? variant? "whole, "skim", "2pct"
Then link the group of items to that catalog item.
(1) normalized items from each retailer
[1] review queue of items to be reviewed
[2] catalog (lookup table) of confirmed normalized retailer items and catalog_id
[3] purchase list of normalized items , pivot-ready

Unresolved Issues

need central script to orchestrate; metadata belongs there and nowhere else
`LIME` and `LIME . / .` appearing in the catalog: names must come from review-approved names, not raw strings

Directory Layout

Use one top-level data root:

main.py
collect_<retailer>_<method>.py
normalize_<retailer>_<method>.py
review.py
data/
  <retailer-method>/
    raw/  # unmodified retailer payloads exactly as fetched
      <order_id.json> 
    collected_items.csv # one row per retailer line item w/ retailer-native values
    collected_orders.csv # one row per receipt/visit, flattened from raw order data
    normalized_items.csv # parsed retailer-specific line items with normalized fields
  costco-web/ # sample
    raw/
      orders/
        history.json
        <order_id>.json
    collected_items.csv
    collected_orders.csv
    normalized_items.csv
  review/
    review_queue.csv # Human review queue for unresolved matching/parsing cases.
    product_links.csv # Links from normalized retailer items to catalog items.
    catalog.csv # Cross-retailer product catalog entities used for comparison.
  analysis/
    purchases.csv
    comparison_examples.csv
    item_price_over_time.csv
    spend_by_visit.csv
    items_per_visit.csv
    category_spend_over_time.csv
    retailer_store_breakdown.csv

Notes:

The current repo still uses transitional root-level scripts and output folders.
This layout is the target structure for the refactor, not a claim that migration is already complete.

Schemas

`data/<retailer-method>/collected_items.csv`

One row per retailer line item.

key	definition
`retailer` PK	retailer slug
`order_id` PK	retailer order id
`line_no` PK	stable line number within order export
`order_date`	copied from order when available
`retailer_item_id`	retailer-native item id when available
`pod_id`	retailer pod/item id
`item_name`	raw retailer item name
`upc`	retailer UPC or PLU value
`category_id`	retailer category id
`category`	retailer category description
`qty`	retailer quantity field
`unit`	retailer unit code such as `EA` or `LB`
`unit_price`	retailer unit price field
`line_total`	retailer extended price field
`picked_weight`	retailer picked weight field
`mvp_savings`	retailer savings field
`reward_savings`	retailer rewards savings field
`coupon_savings`	retailer coupon savings field
`coupon_price`	retailer coupon price field
`image_url`	raw retailer image url when present
`raw_order_path`	relative path to source order payload
`is_discount_line`	retailer adjustment or discount-line flag
`is_coupon_line`	coupon-like line flag when distinguishable

`data/<retailer-method>/collected_orders.csv`

One row per order/visit/receipt.

key	definition
`retailer` PK	retailer slug such as `giant`
`order_id` PK	retailer order or visit id
`order_date`	order date in `YYYY-MM-DD` when available
`delivery_date`	fulfillment date in `YYYY-MM-DD` when available
`service_type`	retailer service type such as `INSTORE`
`order_total`	order total as provided by retailer
`payment_method`	retailer payment label
`total_item_count`	total line count or item count from retailer
`total_savings`	total savings as provided by retailer
`your_savings_total`	savings field from retailer when present
`coupons_discounts_total`	coupon/discount total from retailer
`store_name`	retailer store name
`store_number`	retailer store number
`store_address1`	street address
`store_city`	city
`store_state`	state or province
`store_zipcode`	postal code
`refund_order`	retailer refund flag
`ebt_order`	retailer EBT flag
`raw_history_path`	relative path to source history payload
`raw_order_path`	relative path to source order payload

`data/<retailer-method>/normalized_items.csv`

One row per retailer line item after deterministic parsing. Preserve raw fields from `collected_items.csv` and add parsed fields that make later review and grouping easier. Normalization may assign retailer-level identity when the evidence is deterministic and retailer-scoped.

key	definition
`retailer` PK	retailer slug
`order_id` PK	retailer order id
`line_no` PK	line number within order
`normalized_row_id`	stable row key, typically `<retailer>:<order_id>:<line_no>`
`normalized_item_id`	stable retailer-level item identity when deterministic grouping is supported
`normalization_basis`	basis used to assign `normalized_item_id`
`retailer_item_id`	retailer-native item id
`item_name`	raw retailer item name
`item_name_norm`	normalized retailer item name
`brand_guess`	parsed brand guess
`variant`	parsed variant text
`size_value`	parsed numeric size value
`size_unit`	parsed size unit such as `oz`, `lb`, `fl_oz`
`pack_qty`	parsed pack or count guess
`measure_type`	`each`, `weight`, `volume`, `count`, or blank
`normalized_quantity`	numeric comparison basis derived during normalization
`normalized_quantity_unit`	basis unit such as `oz`, `lb`, `count`, or blank
`is_item`	item flag
`is_store_brand`	store-brand guess
`is_fee`	fee or non-product flag
`is_discount_line`	discount or adjustment-line flag
`is_coupon_line`	coupon-like line flag
`matched_discount_amount`	matched discount value carried onto purchased row when supported
`net_line_total`	line total after matched discount when supported
`price_per_each`	derived per-each price when supported
`price_per_each_basis`	source basis for `price_per_each`
`price_per_count`	derived per-count price when supported
`price_per_count_basis`	source basis for `price_per_count`
`price_per_lb`	derived per-pound price when supported
`price_per_lb_basis`	source basis for `price_per_lb`
`price_per_oz`	derived per-ounce price when supported
`price_per_oz_basis`	source basis for `price_per_oz`
`image_url`	best available retailer image url
`raw_order_path`	relative path to source order payload
`parse_version`	parser version string for reruns
`parse_notes`	optional non-fatal parser notes

Notes:

`normalized_row_id` identifies the purchase row; `normalized_item_id` identifies a repeated retailer item when strong retailer evidence supports grouping.
Valid `normalization_basis` values should be explicit, e.g. `exact_upc`, `exact_retailer_item_id`, `exact_name_size_pack`, or `approved_retailer_alias`.
Do not use fuzzy or semantic matching to assign `normalized_item_id`.
Discount/coupon rows may remain as standalone normalized rows for auditability even when their amounts are attached to a purchased row via `matched_discount_amount`.
Cross-retailer identity is handled later in review/combine via `data/review/catalog.csv` and `product_links.csv`.

`data/review/product_links.csv`

One row per review-approved link from a normalized retailer item to a catalog item. Many normalized retailer items may link to the same catalog item.

key	definition
`normalized_item_id` PK	normalized retailer item id
`catalog_id` PK	linked catalog product id
`link_method`	`manual`, `exact_upc`, `exact_name_size`, etc.
`link_confidence`	optional confidence label
`review_status`	`pending`, `approved`, `rejected`, or blank
`reviewed_by`	reviewer id or initials
`reviewed_at`	review timestamp or date
`link_notes`	optional notes

`data/review/review_queue.csv`

One row per issue needing human review.

key	definition
`review_id` PK	stable review row id
`queue_type`	`link_candidate`, `parse_issue`, `catalog_cleanup`
`retailer`	retailer slug when applicable
`normalized_item_id`	normalized retailer item id when review is item-level
`normalized_row_id`	normalized row id when review is row-specific
`catalog_id`	candidate canonical id
`reason_code`	machine-readable review reason
`priority`	optional priority label
`raw_item_names`	compact list of example raw names
`normalized_names`	compact list of example normalized names
`upc`	example UPC/PLU
`image_url`	example image url
`example_prices`	compact list of example prices
`seen_count`	count of related rows
`status`	`pending`, `approved`, `rejected`, `deferred`
`resolution_notes`	reviewer notes
`created_at`	creation timestamp or date
`updated_at`	last update timestamp or date

`data/review/catalog.csv`

One row per cross-retailer catalog product.

key	definition
`catalog_id` PK	stable catalog product id
`catalog_name`	human-reviewed product name
`product_type`	generic product eg `apple`, `milk`
`category`	broad section eg `produce`, `dairy`
`brand`	canonical brand when applicable
`variant`	canonical variant
`size_value`	normalized size value
`size_unit`	normalized size unit
`pack_qty`	normalized pack/count
`measure_type`	normalized measure type
`normalized_quantity`	numeric comparison basis value
`normalized_quantity_unit`	basis unit such as `oz`, `lb`, `count`
`notes`	optional human notes
`created_at`	creation timestamp or date
`updated_at`	last update timestamp or date

Notes:

Do not auto-create new catalog rows from weak normalized names alone.
Do not encode packaging/count into `catalog_name` unless it is essential to product identity.
`catalog_name` should come from review-approved naming, not raw retailer strings.

`data/analysis/purchases.csv`

One row per purchased item (i.e., `is_item`==true from normalized layer), with catalog attributes denormalized in and discounts already applied.

key	definition
`purchase_date`	date of purchase (from order)
`retailer`	retailer slug
`order_id`	retailer order id
`line_no`	line number within order
`normalized_row_id`	`<retailer>:<order_id>:<line_no>`
`normalized_item_id`	retailer-level normalized item identity
`catalog_id`	linked catalog product id
`catalog_name`	catalog product name for analysis
`catalog_product_type`	broader product family (e.g., `egg`, `milk`)
`catalog_category`	category such as `produce`, `dairy`
`catalog_brand`	canonical brand when applicable
`catalog_variant`	canonical variant when applicable
`raw_item_name`	original retailer item name
`normalized_item_name`	cleaned/normalized retailer item name
`retailer_item_id`	retailer-native item id
`upc`	UPC/PLU when available
`qty`	retailer quantity field
`unit`	retailer unit (e.g., `EA`, `LB`)
`pack_qty`	parsed pack/count
`size_value`	parsed size value
`size_unit`	parsed size unit
`measure_type`	`each`, `weight`, `volume`, `count`
`normalized_quantity`	normalized comparison quantity
`normalized_quantity_unit`	unit for normalized quantity
`unit_price`	retailer unit price
`line_total`	original retailer extended price (pre-discount)
`matched_discount_amount`	discount amount matched from discount lines
`net_line_total`	effective price after discount (`line_total` + discounts)
`store_name`	retailer store name
`store_city`	store city
`store_state`	store state
`price_per_each`	derived per-each price
`price_per_each_basis`	source basis for per-each calc
`price_per_count`	derived per-count price
`price_per_count_basis`	source basis for per-count calc
`price_per_lb`	derived per-pound price
`price_per_lb_basis`	source basis for per-pound calc
`price_per_oz`	derived per-ounce price
`price_per_oz_basis`	source basis for per-ounce calc
`is_fee`	true if row represents non-product fee
`raw_order_path`	relative path to original order payload

Notes:

Only rows that represent purchased items should appear here.
`line_total` preserves retailer truth; `net_line_total` is what you actually paid.
catalog fields are denormalized in to make pivoting trivial.
no discount/coupon rows exist here; their effects are carried via `matched_discount_amount`.
review/link decisions should apply at the `normalized_item_id` level, then fan out to all purchase rows sharing that id.

/

Normalized quantity is deterministic and conservative:

if `qty * pack_qty * size_value` is available, use that total with `size_unit`
else if count basis is explicit, use `qty * pack_qty` with unit `count`
else if `measure_type` is `each`, use `qty each`
else leave both fields blank
no hidden unit conversion is applied inside normalization; values stay in their parsed units such as `oz`, `lb`, `qt`, or `count`

22 KiB Raw Blame History

Grocery data model and file layout

Design Rules

Retailer-specific data:

Review/Combined data:

Pipeline

1. Collect

2. Normalize

3. Review/Combine (Canonicalization)

Unresolved Issues

Directory Layout

Schemas

`data/<retailer-method>/collected_items.csv`

`data/<retailer-method>/collected_orders.csv`

`data/<retailer-method>/normalized_items.csv`

`data/review/product_links.csv`

`data/review/review_queue.csv`

`data/review/catalog.csv`

`data/analysis/purchases.csv`

/

22 KiB

Raw Blame History