Files
scrape-giant/pm/data-model.org
2026-03-16 00:22:24 -04:00

13 KiB

grocery data model and file layout

This document defines the shared file layout and stable CSV schemas for the grocery pipeline. The goal is to keep retailer-specific ingest separate from cross-retailer product modeling so Giant-specific quirks do not become the system of record.

design rules

  • Raw retailer exports remain the source of truth.
  • Retailer parsing is isolated to retailer-specific files and ids.
  • Cross-retailer product layers begin only after retailer-specific enrichment.
  • CSV schemas are stable and additive: new columns may be appended, but existing columns should not be repurposed.
  • Unknown values should be left blank rather than guessed.

directory layout

Use one top-level data root:

data/
  giant/
    raw/
      history.json
      orders/
        <order_id>.json
    orders.csv
    items_raw.csv
    items_enriched.csv
    products_observed.csv
  costco/
    raw/
      ...
    orders.csv
    items_raw.csv
    items_enriched.csv
    products_observed.csv
  shared/
    products_canonical.csv
    product_links.csv
    review_queue.csv

layer responsibilities

  • `data/<retailer>/raw/` Stores unmodified retailer payloads exactly as fetched.
  • `data/<retailer>/orders.csv` One row per retailer order or visit, flattened from raw order data.
  • `data/<retailer>/items_raw.csv` One row per retailer line item, preserving retailer-native values needed for reruns and debugging.
  • `data/<retailer>/items_enriched.csv` Parsed retailer line items with normalized fields and derived guesses, still retailer-specific.
  • `data/<retailer>/products_observed.csv` Distinct retailer-facing observed products aggregated from enriched items.
  • `data/shared/products_canonical.csv` Cross-retailer canonical product entities used for comparison.
  • `data/shared/product_links.csv` Links from retailer observed products to canonical products.
  • `data/shared/review_queue.csv` Human review queue for unresolved or low-confidence matching/parsing cases.

retailer-specific versus shared

Retailer-specific:

  • raw json payloads
  • retailer order ids
  • retailer line numbers
  • retailer category ids and names
  • retailer item names
  • retailer image urls
  • parsed guesses derived from one retailer feed
  • observed products scoped to one retailer

Shared:

  • canonical products
  • observed-to-canonical links
  • human review state for unresolved cases
  • comparison-ready normalized quantity basis fields

Observed products are the boundary between retailer-specific parsing and cross-retailer canonicalization. Nothing upstream of `products_observed.csv` should require knowledge of another retailer.

schema: `data/<retailer>/orders.csv`

One row per order or visit.

column meaning
`retailer` retailer slug such as `giant`
`order_id` retailer order or visit id
`order_date` order date in `YYYY-MM-DD` when available
`delivery_date` fulfillment date in `YYYY-MM-DD` when available
`service_type` retailer service type such as `INSTORE`
`order_total` order total as provided by retailer
`payment_method` retailer payment label
`total_item_count` total line count or item count from retailer
`total_savings` total savings as provided by retailer
`your_savings_total` savings field from retailer when present
`coupons_discounts_total` coupon/discount total from retailer
`store_name` retailer store name
`store_number` retailer store number
`store_address1` street address
`store_city` city
`store_state` state or province
`store_zipcode` postal code
`refund_order` retailer refund flag
`ebt_order` retailer EBT flag
`raw_history_path` relative path to source history payload
`raw_order_path` relative path to source order payload

Primary key:

  • (`retailer`, `order_id`)

schema: `data/<retailer>/items_raw.csv`

One row per retailer line item.

column meaning
`retailer` retailer slug
`order_id` retailer order id
`line_no` stable line number within order export
`order_date` copied from order when available
`pod_id` retailer pod/item id
`item_name` raw retailer item name
`upc` retailer UPC or PLU value
`category_id` retailer category id
`category` retailer category description
`qty` retailer quantity field
`unit` retailer unit code such as `EA` or `LB`
`unit_price` retailer unit price field
`line_total` retailer extended price field
`picked_weight` retailer picked weight field
`mvp_savings` retailer savings field
`reward_savings` retailer rewards savings field
`coupon_savings` retailer coupon savings field
`coupon_price` retailer coupon price field
`image_url` raw retailer image url when present
`raw_order_path` relative path to source order payload

Primary key:

  • (`retailer`, `order_id`, `line_no`)

schema: `data/<retailer>/items_enriched.csv`

One row per retailer line item after deterministic parsing. Preserve the raw fields from `items_raw.csv` and add parsed fields.

column meaning
`retailer` retailer slug
`order_id` retailer order id
`line_no` line number within order
`observed_item_key` stable row key, typically `<retailer>:<order_id>:<line_no>`
`item_name` raw retailer item name
`item_name_norm` normalized item name
`brand_guess` parsed brand guess
`variant` parsed variant text
`size_value` parsed numeric size value
`size_unit` parsed size unit such as `oz`, `lb`, `fl_oz`
`pack_qty` parsed pack or count guess
`measure_type` `each`, `weight`, `volume`, `count`, or blank
`is_store_brand` store-brand guess
`is_fee` fee or non-product flag
`price_per_each` derived per-each price when supported
`price_per_lb` derived per-pound price when supported
`price_per_oz` derived per-ounce price when supported
`image_url` best available retailer image url
`parse_version` parser version string for reruns
`parse_notes` optional non-fatal parser notes

Primary key:

  • (`retailer`, `order_id`, `line_no`)

schema: `data/<retailer>/products_observed.csv`

One row per distinct retailer-facing observed product.

column meaning
`observed_product_id` stable observed product id
`retailer` retailer slug
`observed_key` deterministic grouping key used to create the observed product
`representative_upc` best representative UPC/PLU
`representative_item_name` representative raw retailer name
`representative_name_norm` representative normalized name
`representative_brand` representative brand guess
`representative_variant` representative variant
`representative_size_value` representative size value
`representative_size_unit` representative size unit
`representative_pack_qty` representative pack/count
`representative_measure_type` representative measure type
`representative_image_url` representative image url
`is_store_brand` representative store-brand flag
`is_fee` representative fee flag
`first_seen_date` first order date seen
`last_seen_date` last order date seen
`times_seen` number of enriched item rows grouped here
`example_order_id` one example retailer order id
`example_item_name` one example raw item name

Primary key:

  • (`observed_product_id`)

schema: `data/shared/products_canonical.csv`

One row per cross-retailer canonical product.

column meaning
`canonical_product_id` stable canonical product id
`canonical_name` canonical human-readable name
`product_type` broad class such as `apple`, `milk`, `trash_bag`
`brand` canonical brand when applicable
`variant` canonical variant
`size_value` normalized size value
`size_unit` normalized size unit
`pack_qty` normalized pack/count
`measure_type` normalized measure type
`normalized_quantity` numeric comparison basis value
`normalized_quantity_unit` basis unit such as `oz`, `lb`, `count`
`notes` optional human notes
`created_at` creation timestamp or date
`updated_at` last update timestamp or date

Primary key:

  • (`canonical_product_id`)

schema: `data/shared/product_links.csv`

One row per observed-to-canonical relationship.

column meaning
`observed_product_id` retailer observed product id
`canonical_product_id` linked canonical product id
`link_method` `manual`, `exact_upc`, `exact_name`, etc.
`link_confidence` optional confidence label
`review_status` `pending`, `approved`, `rejected`, or blank
`reviewed_by` reviewer id or initials
`reviewed_at` review timestamp or date
`link_notes` optional notes

Primary key:

  • (`observed_product_id`, `canonical_product_id`)

schema: `data/shared/review_queue.csv`

One row per issue needing human review.

column meaning
`review_id` stable review row id
`queue_type` `observed_product`, `link_candidate`, `parse_issue`
`retailer` retailer slug when applicable
`observed_product_id` observed product id when applicable
`canonical_product_id` candidate canonical id when applicable
`reason_code` machine-readable review reason
`priority` optional priority label
`raw_item_names` compact list of example raw names
`normalized_names` compact list of example normalized names
`upc` example UPC/PLU
`image_url` example image url
`example_prices` compact list of example prices
`seen_count` count of related rows
`status` `pending`, `approved`, `rejected`, `deferred`
`resolution_notes` reviewer notes
`created_at` creation timestamp or date
`updated_at` last update timestamp or date

Primary key:

  • (`review_id`)

current giant mapping

Current scraper outputs map to the new layout as follows:

  • `giant_output/raw/history.json` -> `data/giant/raw/history.json`
  • `giant_output/raw/<order_id>.json` -> `data/giant/raw/orders/<order_id>.json`
  • `giant_output/orders.csv` -> `data/giant/orders.csv`
  • `giant_output/items.csv` -> `data/giant/items_raw.csv`

Current Giant raw order payloads already expose fields needed for future enrichment, including `image`, `itemName`, `primUpcCd`, `lbEachCd`, `unitPrice`, `groceryAmount`, and `totalPickedWeight`.