* grocery data model and file layout This document defines the shared file layout and stable CSV schemas for the grocery pipeline. The goal is to keep retailer-specific ingest separate from cross-retailer product modeling so Giant-specific quirks do not become the system of record. ** design rules - Raw retailer exports remain the source of truth. - Retailer parsing is isolated to retailer-specific files and ids. - Cross-retailer product layers begin only after retailer-specific enrichment. - CSV schemas are stable and additive: new columns may be appended, but existing columns should not be repurposed. - Unknown values should be left blank rather than guessed. ** directory layout Use one top-level data root: #+begin_example data/ giant/ raw/ history.json orders/ .json orders.csv items_raw.csv items_enriched.csv products_observed.csv costco/ raw/ ... orders.csv items_raw.csv items_enriched.csv products_observed.csv shared/ products_canonical.csv product_links.csv review_queue.csv #+end_example ** layer responsibilities - `data//raw/` Stores unmodified retailer payloads exactly as fetched. - `data//orders.csv` One row per retailer order or visit, flattened from raw order data. - `data//items_raw.csv` One row per retailer line item, preserving retailer-native values needed for reruns and debugging. - `data//items_enriched.csv` Parsed retailer line items with normalized fields and derived guesses, still retailer-specific. - `data//products_observed.csv` Distinct retailer-facing observed products aggregated from enriched items. - `data/shared/products_canonical.csv` Cross-retailer canonical product entities used for comparison. - `data/shared/product_links.csv` Links from retailer observed products to canonical products. - `data/shared/review_queue.csv` Human review queue for unresolved or low-confidence matching/parsing cases. ** retailer-specific versus shared Retailer-specific: - raw json payloads - retailer order ids - retailer line numbers - retailer category ids and names - retailer item names - retailer image urls - parsed guesses derived from one retailer feed - observed products scoped to one retailer Shared: - canonical products - observed-to-canonical links - human review state for unresolved cases - comparison-ready normalized quantity basis fields Observed products are the boundary between retailer-specific parsing and cross-retailer canonicalization. Nothing upstream of `products_observed.csv` should require knowledge of another retailer. ** schema: `data//orders.csv` One row per order or visit. | column | meaning | |- | `retailer` | retailer slug such as `giant` | | `order_id` | retailer order or visit id | | `order_date` | order date in `YYYY-MM-DD` when available | | `delivery_date` | fulfillment date in `YYYY-MM-DD` when available | | `service_type` | retailer service type such as `INSTORE` | | `order_total` | order total as provided by retailer | | `payment_method` | retailer payment label | | `total_item_count` | total line count or item count from retailer | | `total_savings` | total savings as provided by retailer | | `your_savings_total` | savings field from retailer when present | | `coupons_discounts_total` | coupon/discount total from retailer | | `store_name` | retailer store name | | `store_number` | retailer store number | | `store_address1` | street address | | `store_city` | city | | `store_state` | state or province | | `store_zipcode` | postal code | | `refund_order` | retailer refund flag | | `ebt_order` | retailer EBT flag | | `raw_history_path` | relative path to source history payload | | `raw_order_path` | relative path to source order payload | Primary key: - (`retailer`, `order_id`) ** schema: `data//items_raw.csv` One row per retailer line item. | column | meaning | |------------------+-----------------------------------------| | `retailer` | retailer slug | | `order_id` | retailer order id | | `line_no` | stable line number within order export | | `order_date` | copied from order when available | | `pod_id` | retailer pod/item id | | `item_name` | raw retailer item name | | `upc` | retailer UPC or PLU value | | `category_id` | retailer category id | | `category` | retailer category description | | `qty` | retailer quantity field | | `unit` | retailer unit code such as `EA` or `LB` | | `unit_price` | retailer unit price field | | `line_total` | retailer extended price field | | `picked_weight` | retailer picked weight field | | `mvp_savings` | retailer savings field | | `reward_savings` | retailer rewards savings field | | `coupon_savings` | retailer coupon savings field | | `coupon_price` | retailer coupon price field | | `image_url` | raw retailer image url when present | | `raw_order_path` | relative path to source order payload | Primary key: - (`retailer`, `order_id`, `line_no`) ** schema: `data//items_enriched.csv` One row per retailer line item after deterministic parsing. Preserve the raw fields from `items_raw.csv` and add parsed fields. | column | meaning | |---------------------+-------------------------------------------------------------| | `retailer` | retailer slug | | `order_id` | retailer order id | | `line_no` | line number within order | | `observed_item_key` | stable row key, typically `::` | | `item_name` | raw retailer item name | | `item_name_norm` | normalized item name | | `brand_guess` | parsed brand guess | | `variant` | parsed variant text | | `size_value` | parsed numeric size value | | `size_unit` | parsed size unit such as `oz`, `lb`, `fl_oz` | | `pack_qty` | parsed pack or count guess | | `measure_type` | `each`, `weight`, `volume`, `count`, or blank | | `is_store_brand` | store-brand guess | | `is_fee` | fee or non-product flag | | `price_per_each` | derived per-each price when supported | | `price_per_lb` | derived per-pound price when supported | | `price_per_oz` | derived per-ounce price when supported | | `image_url` | best available retailer image url | | `parse_version` | parser version string for reruns | | `parse_notes` | optional non-fatal parser notes | Primary key: - (`retailer`, `order_id`, `line_no`) ** schema: `data//products_observed.csv` One row per distinct retailer-facing observed product. | column | meaning | |-------------------------------+----------------------------------------------------------------| | `observed_product_id` | stable observed product id | | `retailer` | retailer slug | | `observed_key` | deterministic grouping key used to create the observed product | | `representative_upc` | best representative UPC/PLU | | `representative_item_name` | representative raw retailer name | | `representative_name_norm` | representative normalized name | | `representative_brand` | representative brand guess | | `representative_variant` | representative variant | | `representative_size_value` | representative size value | | `representative_size_unit` | representative size unit | | `representative_pack_qty` | representative pack/count | | `representative_measure_type` | representative measure type | | `representative_image_url` | representative image url | | `is_store_brand` | representative store-brand flag | | `is_fee` | representative fee flag | | `first_seen_date` | first order date seen | | `last_seen_date` | last order date seen | | `times_seen` | number of enriched item rows grouped here | | `example_order_id` | one example retailer order id | | `example_item_name` | one example raw item name | Primary key: - (`observed_product_id`) ** schema: `data/shared/products_canonical.csv` One row per cross-retailer canonical product. | column | meaning | |----------------------------+--------------------------------------------------| | `canonical_product_id` | stable canonical product id | | `canonical_name` | canonical human-readable name | | `product_type` | broad class such as `apple`, `milk`, `trash_bag` | | `brand` | canonical brand when applicable | | `variant` | canonical variant | | `size_value` | normalized size value | | `size_unit` | normalized size unit | | `pack_qty` | normalized pack/count | | `measure_type` | normalized measure type | | `normalized_quantity` | numeric comparison basis value | | `normalized_quantity_unit` | basis unit such as `oz`, `lb`, `count` | | `notes` | optional human notes | | `created_at` | creation timestamp or date | | `updated_at` | last update timestamp or date | Primary key: - (`canonical_product_id`) ** schema: `data/shared/product_links.csv` One row per observed-to-canonical relationship. | column | meaning | |- | `observed_product_id` | retailer observed product id | | `canonical_product_id` | linked canonical product id | | `link_method` | `manual`, `exact_upc`, `exact_name`, etc. | | `link_confidence` | optional confidence label | | `review_status` | `pending`, `approved`, `rejected`, or blank | | `reviewed_by` | reviewer id or initials | | `reviewed_at` | review timestamp or date | | `link_notes` | optional notes | Primary key: - (`observed_product_id`, `canonical_product_id`) ** schema: `data/shared/review_queue.csv` One row per issue needing human review. | column | meaning | |- | `review_id` | stable review row id | | `queue_type` | `observed_product`, `link_candidate`, `parse_issue` | | `retailer` | retailer slug when applicable | | `observed_product_id` | observed product id when applicable | | `canonical_product_id` | candidate canonical id when applicable | | `reason_code` | machine-readable review reason | | `priority` | optional priority label | | `raw_item_names` | compact list of example raw names | | `normalized_names` | compact list of example normalized names | | `upc` | example UPC/PLU | | `image_url` | example image url | | `example_prices` | compact list of example prices | | `seen_count` | count of related rows | | `status` | `pending`, `approved`, `rejected`, `deferred` | | `resolution_notes` | reviewer notes | | `created_at` | creation timestamp or date | | `updated_at` | last update timestamp or date | Primary key: - (`review_id`) ** current giant mapping Current scraper outputs map to the new layout as follows: - `giant_output/raw/history.json` -> `data/giant/raw/history.json` - `giant_output/raw/.json` -> `data/giant/raw/orders/.json` - `giant_output/orders.csv` -> `data/giant/orders.csv` - `giant_output/items.csv` -> `data/giant/items_raw.csv` Current Giant raw order payloads already expose fields needed for future enrichment, including `image`, `itemName`, `primUpcCd`, `lbEachCd`, `unitPrice`, `groceryAmount`, and `totalPickedWeight`.