diff --git a/pm/data-model.org b/pm/data-model.org new file mode 100644 index 0000000..5b1966c --- /dev/null +++ b/pm/data-model.org @@ -0,0 +1,300 @@ +* grocery data model and file layout + +This document defines the shared file layout and stable CSV schemas for the +grocery pipeline. The goal is to keep retailer-specific ingest separate from +cross-retailer product modeling so Giant-specific quirks do not become the +system of record. + +** design rules + +- Raw retailer exports remain the source of truth. +- Retailer parsing is isolated to retailer-specific files and ids. +- Cross-retailer product layers begin only after retailer-specific enrichment. +- CSV schemas are stable and additive: new columns may be appended, but + existing columns should not be repurposed. +- Unknown values should be left blank rather than guessed. + +** directory layout + +Use one top-level data root: + +#+begin_example +data/ + giant/ + raw/ + history.json + orders/ + .json + orders.csv + items_raw.csv + items_enriched.csv + products_observed.csv + costco/ + raw/ + ... + orders.csv + items_raw.csv + items_enriched.csv + products_observed.csv + shared/ + products_canonical.csv + product_links.csv + review_queue.csv +#+end_example + +** layer responsibilities + +- `data//raw/` + Stores unmodified retailer payloads exactly as fetched. +- `data//orders.csv` + One row per retailer order or visit, flattened from raw order data. +- `data//items_raw.csv` + One row per retailer line item, preserving retailer-native values needed for + reruns and debugging. +- `data//items_enriched.csv` + Parsed retailer line items with normalized fields and derived guesses, still + retailer-specific. +- `data//products_observed.csv` + Distinct retailer-facing observed products aggregated from enriched items. +- `data/shared/products_canonical.csv` + Cross-retailer canonical product entities used for comparison. +- `data/shared/product_links.csv` + Links from retailer observed products to canonical products. +- `data/shared/review_queue.csv` + Human review queue for unresolved or low-confidence matching/parsing cases. + +** retailer-specific versus shared + +Retailer-specific: + +- raw json payloads +- retailer order ids +- retailer line numbers +- retailer category ids and names +- retailer item names +- retailer image urls +- parsed guesses derived from one retailer feed +- observed products scoped to one retailer + +Shared: + +- canonical products +- observed-to-canonical links +- human review state for unresolved cases +- comparison-ready normalized quantity basis fields + +Observed products are the boundary between retailer-specific parsing and +cross-retailer canonicalization. Nothing upstream of `products_observed.csv` +should require knowledge of another retailer. + +** schema: `data//orders.csv` + +One row per order or visit. + +| column | meaning | +|- +| `retailer` | retailer slug such as `giant` | +| `order_id` | retailer order or visit id | +| `order_date` | order date in `YYYY-MM-DD` when available | +| `delivery_date` | fulfillment date in `YYYY-MM-DD` when available | +| `service_type` | retailer service type such as `INSTORE` | +| `order_total` | order total as provided by retailer | +| `payment_method` | retailer payment label | +| `total_item_count` | total line count or item count from retailer | +| `total_savings` | total savings as provided by retailer | +| `your_savings_total` | savings field from retailer when present | +| `coupons_discounts_total` | coupon/discount total from retailer | +| `store_name` | retailer store name | +| `store_number` | retailer store number | +| `store_address1` | street address | +| `store_city` | city | +| `store_state` | state or province | +| `store_zipcode` | postal code | +| `refund_order` | retailer refund flag | +| `ebt_order` | retailer EBT flag | +| `raw_history_path` | relative path to source history payload | +| `raw_order_path` | relative path to source order payload | + +Primary key: + +- (`retailer`, `order_id`) + +** schema: `data//items_raw.csv` + +One row per retailer line item. + +| column | meaning | +|------------------+-----------------------------------------| +| `retailer` | retailer slug | +| `order_id` | retailer order id | +| `line_no` | stable line number within order export | +| `order_date` | copied from order when available | +| `pod_id` | retailer pod/item id | +| `item_name` | raw retailer item name | +| `upc` | retailer UPC or PLU value | +| `category_id` | retailer category id | +| `category` | retailer category description | +| `qty` | retailer quantity field | +| `unit` | retailer unit code such as `EA` or `LB` | +| `unit_price` | retailer unit price field | +| `line_total` | retailer extended price field | +| `picked_weight` | retailer picked weight field | +| `mvp_savings` | retailer savings field | +| `reward_savings` | retailer rewards savings field | +| `coupon_savings` | retailer coupon savings field | +| `coupon_price` | retailer coupon price field | +| `image_url` | raw retailer image url when present | +| `raw_order_path` | relative path to source order payload | + +Primary key: + +- (`retailer`, `order_id`, `line_no`) + +** schema: `data//items_enriched.csv` + +One row per retailer line item after deterministic parsing. Preserve the raw +fields from `items_raw.csv` and add parsed fields. + +| column | meaning | +|---------------------+-------------------------------------------------------------| +| `retailer` | retailer slug | +| `order_id` | retailer order id | +| `line_no` | line number within order | +| `observed_item_key` | stable row key, typically `::` | +| `item_name` | raw retailer item name | +| `item_name_norm` | normalized item name | +| `brand_guess` | parsed brand guess | +| `variant` | parsed variant text | +| `size_value` | parsed numeric size value | +| `size_unit` | parsed size unit such as `oz`, `lb`, `fl_oz` | +| `pack_qty` | parsed pack or count guess | +| `measure_type` | `each`, `weight`, `volume`, `count`, or blank | +| `is_store_brand` | store-brand guess | +| `is_fee` | fee or non-product flag | +| `price_per_each` | derived per-each price when supported | +| `price_per_lb` | derived per-pound price when supported | +| `price_per_oz` | derived per-ounce price when supported | +| `image_url` | best available retailer image url | +| `parse_version` | parser version string for reruns | +| `parse_notes` | optional non-fatal parser notes | + +Primary key: + +- (`retailer`, `order_id`, `line_no`) + +** schema: `data//products_observed.csv` + +One row per distinct retailer-facing observed product. + +| column | meaning | +|-------------------------------+----------------------------------------------------------------| +| `observed_product_id` | stable observed product id | +| `retailer` | retailer slug | +| `observed_key` | deterministic grouping key used to create the observed product | +| `representative_upc` | best representative UPC/PLU | +| `representative_item_name` | representative raw retailer name | +| `representative_name_norm` | representative normalized name | +| `representative_brand` | representative brand guess | +| `representative_variant` | representative variant | +| `representative_size_value` | representative size value | +| `representative_size_unit` | representative size unit | +| `representative_pack_qty` | representative pack/count | +| `representative_measure_type` | representative measure type | +| `representative_image_url` | representative image url | +| `is_store_brand` | representative store-brand flag | +| `is_fee` | representative fee flag | +| `first_seen_date` | first order date seen | +| `last_seen_date` | last order date seen | +| `times_seen` | number of enriched item rows grouped here | +| `example_order_id` | one example retailer order id | +| `example_item_name` | one example raw item name | + +Primary key: + +- (`observed_product_id`) + +** schema: `data/shared/products_canonical.csv` + +One row per cross-retailer canonical product. + +| column | meaning | +|----------------------------+--------------------------------------------------| +| `canonical_product_id` | stable canonical product id | +| `canonical_name` | canonical human-readable name | +| `product_type` | broad class such as `apple`, `milk`, `trash_bag` | +| `brand` | canonical brand when applicable | +| `variant` | canonical variant | +| `size_value` | normalized size value | +| `size_unit` | normalized size unit | +| `pack_qty` | normalized pack/count | +| `measure_type` | normalized measure type | +| `normalized_quantity` | numeric comparison basis value | +| `normalized_quantity_unit` | basis unit such as `oz`, `lb`, `count` | +| `notes` | optional human notes | +| `created_at` | creation timestamp or date | +| `updated_at` | last update timestamp or date | + +Primary key: + +- (`canonical_product_id`) + +** schema: `data/shared/product_links.csv` + +One row per observed-to-canonical relationship. + +| column | meaning | +|- +| `observed_product_id` | retailer observed product id | +| `canonical_product_id` | linked canonical product id | +| `link_method` | `manual`, `exact_upc`, `exact_name`, etc. | +| `link_confidence` | optional confidence label | +| `review_status` | `pending`, `approved`, `rejected`, or blank | +| `reviewed_by` | reviewer id or initials | +| `reviewed_at` | review timestamp or date | +| `link_notes` | optional notes | + +Primary key: + +- (`observed_product_id`, `canonical_product_id`) + +** schema: `data/shared/review_queue.csv` + +One row per issue needing human review. + +| column | meaning | +|- +| `review_id` | stable review row id | +| `queue_type` | `observed_product`, `link_candidate`, `parse_issue` | +| `retailer` | retailer slug when applicable | +| `observed_product_id` | observed product id when applicable | +| `canonical_product_id` | candidate canonical id when applicable | +| `reason_code` | machine-readable review reason | +| `priority` | optional priority label | +| `raw_item_names` | compact list of example raw names | +| `normalized_names` | compact list of example normalized names | +| `upc` | example UPC/PLU | +| `image_url` | example image url | +| `example_prices` | compact list of example prices | +| `seen_count` | count of related rows | +| `status` | `pending`, `approved`, `rejected`, `deferred` | +| `resolution_notes` | reviewer notes | +| `created_at` | creation timestamp or date | +| `updated_at` | last update timestamp or date | + +Primary key: + +- (`review_id`) + +** current giant mapping + +Current scraper outputs map to the new layout as follows: + +- `giant_output/raw/history.json` -> `data/giant/raw/history.json` +- `giant_output/raw/.json` -> `data/giant/raw/orders/.json` +- `giant_output/orders.csv` -> `data/giant/orders.csv` +- `giant_output/items.csv` -> `data/giant/items_raw.csv` + +Current Giant raw order payloads already expose fields needed for future +enrichment, including `image`, `itemName`, `primUpcCd`, `lbEachCd`, +`unitPrice`, `groceryAmount`, and `totalPickedWeight`. + diff --git a/pm/tasks.org b/pm/tasks.org index 0f7ad9b..a8e9d5e 100644 --- a/pm/tasks.org +++ b/pm/tasks.org @@ -16,7 +16,7 @@ - tests: `./venv/bin/python -m unittest discover -s tests`; `./venv/bin/python scraper.py --help`; verified `.env` loading via `scraper.load_config()` - date: 2026-03-14 -* [ ] t1.2: define grocery data model and file layout (1-2 commits) +* [X] t1.2: define grocery data model and file layout (1-2 commits) ** acceptance criteria - decide and document the files/directories for: - retailer raw exports @@ -33,8 +33,8 @@ ** evidence - commit: -- tests: -- date: +- tests: reviewed `giant_output/raw/history.json`, one sample raw order json, `giant_output/orders.csv`, `giant_output/items.csv`; documented schemas in `pm/data-model.org` +- date: 2026-03-15 * [ ] t1.3: build giant parser/enricher from raw json (2-4 commits) ** acceptance criteria