scrape-giant/pm/data-model.org at 347cd44d09af6c22fc5f9ba39586ac49ad7eb68a

ben/scrape-giant

Fork 0

Files

ben 42dbae1d2e added data-model

2026-03-16 00:22:24 -04:00

13 KiB

Raw Blame History

grocery data model and file layout

grocery data model and file layout

This document defines the shared file layout and stable CSV schemas for the grocery pipeline. The goal is to keep retailer-specific ingest separate from cross-retailer product modeling so Giant-specific quirks do not become the system of record.

design rules

Raw retailer exports remain the source of truth.
Retailer parsing is isolated to retailer-specific files and ids.
Cross-retailer product layers begin only after retailer-specific enrichment.
CSV schemas are stable and additive: new columns may be appended, but existing columns should not be repurposed.
Unknown values should be left blank rather than guessed.

directory layout

Use one top-level data root:

data/
  giant/
    raw/
      history.json
      orders/
        <order_id>.json
    orders.csv
    items_raw.csv
    items_enriched.csv
    products_observed.csv
  costco/
    raw/
      ...
    orders.csv
    items_raw.csv
    items_enriched.csv
    products_observed.csv
  shared/
    products_canonical.csv
    product_links.csv
    review_queue.csv

layer responsibilities

`data/<retailer>/raw/` Stores unmodified retailer payloads exactly as fetched.
`data/<retailer>/orders.csv` One row per retailer order or visit, flattened from raw order data.
`data/<retailer>/items_raw.csv` One row per retailer line item, preserving retailer-native values needed for reruns and debugging.
`data/<retailer>/items_enriched.csv` Parsed retailer line items with normalized fields and derived guesses, still retailer-specific.
`data/<retailer>/products_observed.csv` Distinct retailer-facing observed products aggregated from enriched items.
`data/shared/products_canonical.csv` Cross-retailer canonical product entities used for comparison.
`data/shared/product_links.csv` Links from retailer observed products to canonical products.
`data/shared/review_queue.csv` Human review queue for unresolved or low-confidence matching/parsing cases.

retailer-specific versus shared

Retailer-specific:

raw json payloads
retailer order ids
retailer line numbers
retailer category ids and names
retailer item names
retailer image urls
parsed guesses derived from one retailer feed
observed products scoped to one retailer

Shared:

canonical products
observed-to-canonical links
human review state for unresolved cases
comparison-ready normalized quantity basis fields

Observed products are the boundary between retailer-specific parsing and cross-retailer canonicalization. Nothing upstream of `products_observed.csv` should require knowledge of another retailer.

schema: `data/<retailer>/orders.csv`

One row per order or visit.

column	meaning
`retailer`	retailer slug such as `giant`
`order_id`	retailer order or visit id
`order_date`	order date in `YYYY-MM-DD` when available
`delivery_date`	fulfillment date in `YYYY-MM-DD` when available
`service_type`	retailer service type such as `INSTORE`
`order_total`	order total as provided by retailer
`payment_method`	retailer payment label
`total_item_count`	total line count or item count from retailer
`total_savings`	total savings as provided by retailer
`your_savings_total`	savings field from retailer when present
`coupons_discounts_total`	coupon/discount total from retailer
`store_name`	retailer store name
`store_number`	retailer store number
`store_address1`	street address
`store_city`	city
`store_state`	state or province
`store_zipcode`	postal code
`refund_order`	retailer refund flag
`ebt_order`	retailer EBT flag
`raw_history_path`	relative path to source history payload
`raw_order_path`	relative path to source order payload

Primary key:

(`retailer`, `order_id`)

schema: `data/<retailer>/items_raw.csv`

One row per retailer line item.

column	meaning
`retailer`	retailer slug
`order_id`	retailer order id
`line_no`	stable line number within order export
`order_date`	copied from order when available
`pod_id`	retailer pod/item id
`item_name`	raw retailer item name
`upc`	retailer UPC or PLU value
`category_id`	retailer category id
`category`	retailer category description
`qty`	retailer quantity field
`unit`	retailer unit code such as `EA` or `LB`
`unit_price`	retailer unit price field
`line_total`	retailer extended price field
`picked_weight`	retailer picked weight field
`mvp_savings`	retailer savings field
`reward_savings`	retailer rewards savings field
`coupon_savings`	retailer coupon savings field
`coupon_price`	retailer coupon price field
`image_url`	raw retailer image url when present
`raw_order_path`	relative path to source order payload

Primary key:

(`retailer`, `order_id`, `line_no`)

schema: `data/<retailer>/items_enriched.csv`

One row per retailer line item after deterministic parsing. Preserve the raw fields from `items_raw.csv` and add parsed fields.

column	meaning
`retailer`	retailer slug
`order_id`	retailer order id
`line_no`	line number within order
`observed_item_key`	stable row key, typically `<retailer>:<order_id>:<line_no>`
`item_name`	raw retailer item name
`item_name_norm`	normalized item name
`brand_guess`	parsed brand guess
`variant`	parsed variant text
`size_value`	parsed numeric size value
`size_unit`	parsed size unit such as `oz`, `lb`, `fl_oz`
`pack_qty`	parsed pack or count guess
`measure_type`	`each`, `weight`, `volume`, `count`, or blank
`is_store_brand`	store-brand guess
`is_fee`	fee or non-product flag
`price_per_each`	derived per-each price when supported
`price_per_lb`	derived per-pound price when supported
`price_per_oz`	derived per-ounce price when supported
`image_url`	best available retailer image url
`parse_version`	parser version string for reruns
`parse_notes`	optional non-fatal parser notes

Primary key:

(`retailer`, `order_id`, `line_no`)

schema: `data/<retailer>/products_observed.csv`

One row per distinct retailer-facing observed product.

column	meaning
`observed_product_id`	stable observed product id
`retailer`	retailer slug
`observed_key`	deterministic grouping key used to create the observed product
`representative_upc`	best representative UPC/PLU
`representative_item_name`	representative raw retailer name
`representative_name_norm`	representative normalized name
`representative_brand`	representative brand guess
`representative_variant`	representative variant
`representative_size_value`	representative size value
`representative_size_unit`	representative size unit
`representative_pack_qty`	representative pack/count
`representative_measure_type`	representative measure type
`representative_image_url`	representative image url
`is_store_brand`	representative store-brand flag
`is_fee`	representative fee flag
`first_seen_date`	first order date seen
`last_seen_date`	last order date seen
`times_seen`	number of enriched item rows grouped here
`example_order_id`	one example retailer order id
`example_item_name`	one example raw item name

Primary key:

(`observed_product_id`)

schema: `data/shared/products_canonical.csv`

One row per cross-retailer canonical product.

column	meaning
`canonical_product_id`	stable canonical product id
`canonical_name`	canonical human-readable name
`product_type`	broad class such as `apple`, `milk`, `trash_bag`
`brand`	canonical brand when applicable
`variant`	canonical variant
`size_value`	normalized size value
`size_unit`	normalized size unit
`pack_qty`	normalized pack/count
`measure_type`	normalized measure type
`normalized_quantity`	numeric comparison basis value
`normalized_quantity_unit`	basis unit such as `oz`, `lb`, `count`
`notes`	optional human notes
`created_at`	creation timestamp or date
`updated_at`	last update timestamp or date

Primary key:

(`canonical_product_id`)

schema: `data/shared/product_links.csv`

One row per observed-to-canonical relationship.

column	meaning
`observed_product_id`	retailer observed product id
`canonical_product_id`	linked canonical product id
`link_method`	`manual`, `exact_upc`, `exact_name`, etc.
`link_confidence`	optional confidence label
`review_status`	`pending`, `approved`, `rejected`, or blank
`reviewed_by`	reviewer id or initials
`reviewed_at`	review timestamp or date
`link_notes`	optional notes

Primary key:

(`observed_product_id`, `canonical_product_id`)

schema: `data/shared/review_queue.csv`

One row per issue needing human review.

column	meaning
`review_id`	stable review row id
`queue_type`	`observed_product`, `link_candidate`, `parse_issue`
`retailer`	retailer slug when applicable
`observed_product_id`	observed product id when applicable
`canonical_product_id`	candidate canonical id when applicable
`reason_code`	machine-readable review reason
`priority`	optional priority label
`raw_item_names`	compact list of example raw names
`normalized_names`	compact list of example normalized names
`upc`	example UPC/PLU
`image_url`	example image url
`example_prices`	compact list of example prices
`seen_count`	count of related rows
`status`	`pending`, `approved`, `rejected`, `deferred`
`resolution_notes`	reviewer notes
`created_at`	creation timestamp or date
`updated_at`	last update timestamp or date

Primary key:

(`review_id`)

current giant mapping

Current scraper outputs map to the new layout as follows:

`giant_output/raw/history.json` -> `data/giant/raw/history.json`
`giant_output/raw/<order_id>.json` -> `data/giant/raw/orders/<order_id>.json`
`giant_output/orders.csv` -> `data/giant/orders.csv`
`giant_output/items.csv` -> `data/giant/items_raw.csv`

Current Giant raw order payloads already expose fields needed for future enrichment, including `image`, `itemName`, `primUpcCd`, `lbEachCd`, `unitPrice`, `groceryAmount`, and `totalPickedWeight`.

13 KiB Raw Blame History

grocery data model and file layout

design rules

directory layout

layer responsibilities

retailer-specific versus shared

schema: `data/<retailer>/orders.csv`

schema: `data/<retailer>/items_raw.csv`

schema: `data/<retailer>/items_enriched.csv`

schema: `data/<retailer>/products_observed.csv`

schema: `data/shared/products_canonical.csv`

schema: `data/shared/product_links.csv`

schema: `data/shared/review_queue.csv`

current giant mapping

13 KiB

Raw Blame History