From c13d14441848cc3ac697702f5cd9da492f71e3e0 Mon Sep 17 00:00:00 2001 From: ben Date: Wed, 18 Mar 2026 14:02:36 -0400 Subject: [PATCH] cleanup --- pm/data-model.org | 167 ++++++++++++++++++++++++---------------------- 1 file changed, 88 insertions(+), 79 deletions(-) diff --git a/pm/data-model.org b/pm/data-model.org index 64e1c82..ca7963d 100644 --- a/pm/data-model.org +++ b/pm/data-model.org @@ -7,12 +7,12 @@ Goals: - Enable multiple data gathering methods - One layer for review and analysis - ** Design Rules +** Design Rules - Raw retailer exports remain the source of truth. - Retailer parsing is isolated to retailer-specific files and ids. -- Cross-retailer product layers begin only after retailer-specific enrichment. +- Cross-retailer product layers begin only after retailer-specific normalization. - CSV schemas are stable and additive: new columns may be appended, but - existing columns should not be repurposed. + existing columns should not be repurposed. - Unknown values should be left blank rather than guessed. *** Retailer-specific data: @@ -22,63 +22,64 @@ Goals: - retailer category ids and names - retailer item names - retailer image urls -- observed products scoped to one retailer - -*** Review/Combined data: -- canonical products -- observed-to-canonical links -- human review state for unresolved cases - comparison-ready normalized quantity basis fields + +*** Review/Combined data: +- catalog of reviewed products +- links from normalized line items to catalog +- human review state for unresolved cases -// I don't like this terminology - what is "observed" doing for us? -// output should be normalized_items, not observed -// unless this is the way we're matching multiple upc's? -Observed products are the boundary between retailer-specific parsing and -cross-retailer canonicalization. Nothing upstream of `products_observed.csv` -should require knowledge of another retailer. * Pipeline +Each step can be run alone if its dependents exist. +Each retail provider script must produce deterministic line-item outputs and +must not group items into product identities before review. + Key: - (1) input -- [2] output - -Each step can be run alone if its dependents exist. - +- [1] output + ** 1. Collect -Get raw receipt/visit and item data from a retailer. Scraping is unique to a Retailer and method (e.g., Giant-Web and Giant-Scan). Preserve complete raw data and preserve fidelity. Avoid interpretation beyond basic data flattening. +Get raw receipt/visit and item data from a retailer. +Scraping is unique to a Retailer and method (e.g., Giant-Web and Giant-Scan). +Preserve complete raw data and preserve fidelity. +Avoid interpretation beyond basic data flattening. - (1) Source access (Varies, eg header data, auth for API access) - [1] collected visits from each retailer - [2] collected items from each retailer - [3] any other raw data that supports [1] and [2]; explicit source (eventual receipt scan?) ** 2. Normalize -Parse and extract structured facts from retailer-specific raw data to create a standardized item format for that retailer. Strictly dependent on Collect method and output. +Parse and extract structured facts from retailer-specific raw data + to create a standardized item format for that retailer. +Strictly dependent on Collect method and output. - Extract quantity, size, pack, pricing, variant - Add discount line items to product line items using upc/retail_item_id and concurrence - Cleanup naming to facilitate later matching + - Do not group line items into retailer-level product identities here - (1) collected items from each retailer - (2) collected visits from each retailer - [1] normalized items from each retailer ** 3. Review/Combine (Canonicalization) -Decide whether two normalized retailer items are "the same product"; match items across retailers using algo/logic and human review. Create catalog linked to normalized items. - - Grouping the same item from retailer +Decide whether two normalized retailer items are "the same product"; + match items across retailers using algo/logic and human review. +Create catalog linked to normalized items. + - Group line items only here, when review/combine logic explicitly decides they belong together - Asking human to create a canonical/catalog item with: - - friendly/canonical_name: "bell pepper"; "milk" + - friendly/catalog_name: "bell pepper"; "milk" - category: "produce"; "dairy" - product_type: "pepper"; "milk" - ? variant? "whole, "skim", "2pct" + - Then link the group of items to that catalog item. - (1) normalized items from each retailer - [1] review queue of items to be reviewed - - [2] catalog (lookup table) of confirmed retailer_item and canonical_name - - [3] canonical purchase list, pivot-ready + - [2] catalog (lookup table) of confirmed normalized line items and catalog_id + - [3] purchase list of normalized items , pivot-ready ** Unresolved Issues 1. need central script to orchestrate; metadata belongs there and nowhere else - -** Symptoms -- `LIME` and `LIME . / .` appearing in canonical_catalog: - - names must come from review-approved names, not raw strings +2. `LIME` and `LIME . / .` appearing in the catalog: names must come from review-approved names, not raw strings * Directory Layout @@ -104,12 +105,16 @@ data/ collected_orders.csv normalized_items.csv review/ - review_queue.csv # Human review queue for unresolved matching/parsing cases. - product_links.csv # Links from retailer-observed products to canonical products. - catalog.csv # Cross-retailer canonical product entities used for comparison. + review_queue.csv # Human review queue for unresolved matching/parsing cases. + product_links.csv # Links from normalized line items to catalog items. + catalog.csv # Cross-retailer product catalog entities used for comparison. purchases.csv #+end_example +Notes: +- The current repo still uses transitional root-level scripts and output folders. +- This layout is the target structure for the refactor, not a claim that migration is already complete. + * Schemas ** `data//collected_items.csv` One row per retailer line item. @@ -140,7 +145,7 @@ One row per retailer line item. | `is_coupon_line` | coupon-like line flag when distinguishable | ** `data//collected_orders.csv` -One row per order or visit. +One row per order/visit/receipt. | key | definition | |---------------------------+-------------------------------------------------| | `retailer` PK | retailer slug such as `giant` | @@ -167,8 +172,8 @@ One row per order or visit. ** `data//normalized_items.csv` One row per retailer line item after deterministic parsing. Preserve raw -fields from `collected_items.csv` and add parsed fields plus retailer-level -identity needed before cross-retailer review. +fields from `collected_items.csv` and add parsed fields that make later review +and grouping easier. Normalization does not assign product identity. | key | definition | |----------------------------+------------------------------------------------------------------| @@ -176,8 +181,6 @@ identity needed before cross-retailer review. | `order_id` PK | retailer order id | | `line_no` PK | line number within order | | `normalized_row_id` | stable row key, typically `::` | -| `normalized_item_id` | stable retailer-level item identity after deterministic grouping | -| `normalization_basis` | basis used to assign `normalized_item_id` | | `retailer_item_id` | retailer-native item id | | `item_name` | raw retailer item name | | `item_name_norm` | normalized retailer item name | @@ -189,6 +192,7 @@ identity needed before cross-retailer review. | `measure_type` | `each`, `weight`, `volume`, `count`, or blank | | `normalized_quantity` | numeric comparison basis derived during normalization | | `normalized_quantity_unit` | basis unit such as `oz`, `lb`, `count`, or blank | +| `is_item` | item flag | | `is_store_brand` | store-brand guess | | `is_fee` | fee or non-product flag | | `is_discount_line` | discount or adjustment-line flag | @@ -209,53 +213,54 @@ identity needed before cross-retailer review. | `parse_notes` | optional non-fatal parser notes | Notes: -- `normalized_item_id` replaces the need for a core `observed_products.csv` layer. -- `normalization_basis` should be explicit values like `exact_upc`, `retailer_item_id`, `name_size_pack`, or `manual_retailer_alias`. -- Cross-retailer identity is still handled later in review/combine via `catalog.csv` and `product_links.csv`. +- `normalized_row_id` is the only required identity at this stage. +- Many normalized rows may later be grouped together during review/combine, but that grouping is not persisted here. +- Discount/coupon rows may remain as standalone normalized rows for auditability even when their amounts are attached to a purchased row via `matched_discount_amount`. +- Cross-retailer identity is handled later in review/combine via `catalog.csv` and `product_links.csv`. ** `data/review/product_links.csv` -One row per observed-to-canonical relationship. -1 (catalog_item) to many (normalized_items) +One row per review-approved link from a normalized row to a catalog item. +Many normalized rows may link to the same catalog item. -| key | definition | -|-------------------+---------------------------------------------| -| `observed_id` PK | retailer observed product id | -| `catalog_id` PK | linked canonical product id | -| `link_method` | `manual`, `exact_upc`, `exact_name`, etc. | -| `link_confidence` | optional confidence label | -| `review_status` | `pending`, `approved`, `rejected`, or blank | -| `reviewed_by` | reviewer id or initials | -| `reviewed_at` | review timestamp or date | -| `link_notes` | optional notes | +| key | definition | +|-------------------------+---------------------------------------------| +| `normalized_row_id` PK | normalized retailer line-item id | +| `catalog_id` PK | linked catalog product id | +| `link_method` | `manual`, `exact_upc`, `exact_name_size`, etc. | +| `link_confidence` | optional confidence label | +| `review_status` | `pending`, `approved`, `rejected`, or blank | +| `reviewed_by` | reviewer id or initials | +| `reviewed_at` | review timestamp or date | +| `link_notes` | optional notes | ** `data/review/review_queue.csv` One row per issue needing human review. -| key | definition | -|-----------------------+-----------------------------------------------------| -| `review_id` PK | stable review row id | -| `queue_type` | `observed_product`, `link_candidate`, `parse_issue` | -| `retailer` | retailer slug when applicable | -| `observed_product_id` | observed product id when applicable | -| `catalod_id` | candidate canonical id when applicable | -| `reason_code` | machine-readable review reason | -| `priority` | optional priority label | -| `raw_item_names` | compact list of example raw names | -| `normalized_names` | compact list of example normalized names | -| `upc` | example UPC/PLU | -| `image_url` | example image url | -| `example_prices` | compact list of example prices | -| `seen_count` | count of related rows | -| `status` | `pending`, `approved`, `rejected`, `deferred` | -| `resolution_notes` | reviewer notes | -| `created_at` | creation timestamp or date | -| `updated_at` | last update timestamp or date | +| key | definition | +|----------------------+-----------------------------------------------------| +| `review_id` PK | stable review row id | +| `queue_type` | `link_candidate`, `parse_issue`, `catalog_cleanup` | +| `retailer` | retailer slug when applicable | +| `normalized_row_id` | normalized row id when review is row-specific | +| `catalog_id` | candidate canonical id | +| `reason_code` | machine-readable review reason | +| `priority` | optional priority label | +| `raw_item_names` | compact list of example raw names | +| `normalized_names` | compact list of example normalized names | +| `upc` | example UPC/PLU | +| `image_url` | example image url | +| `example_prices` | compact list of example prices | +| `seen_count` | count of related rows | +| `status` | `pending`, `approved`, `rejected`, `deferred` | +| `resolution_notes` | reviewer notes | +| `created_at` | creation timestamp or date | +| `updated_at` | last update timestamp or date | ** `data/catalog.csv` -One row per cross-retailer canonical product. +One row per cross-retailer catalog product. | key | definition | |----------------------------+----------------------------------------| -| `catalog_id` PK | stable canonical product id | -| `catalog_name` | canonical human-readable name | +| `catalog_id` PK | stable catalog product id | +| `catalog_name` | human-reviewed product name | | `product_type` | generic product eg `apple`, `milk` | | `category` | broad section eg `produce`, `dairy` | | `brand` | canonical brand when applicable | @@ -270,6 +275,11 @@ One row per cross-retailer canonical product. | `created_at` | creation timestamp or date | | `updated_at` | last update timestamp or date | +Notes: +- Do not auto-create new catalog rows from weak normalized names alone. +- Do not encode packaging/count into `catalog_name` unless it is essential to product identity. +- `catalog_name` should come from review-approved naming, not raw retailer strings. + ** `data/purchases.csv` One row per purchased item (i.e., `row_type=item` from normalized layer), with catalog attributes denormalized in and discounts already applied. @@ -281,9 +291,8 @@ catalog attributes denormalized in and discounts already applied. | `order_id` | retailer order id | | `line_no` | line number within order | | `normalized_row_id` | `::` | -| `normalized_item_id` | retailer-level normalized item identity | -| `catalog_id` | linked canonical product id | -| `catalog_name` | canonical product name for analysis | +| `catalog_id` | linked catalog product id | +| `catalog_name` | catalog product name for analysis | | `catalog_product_type` | broader product family (e.g., `egg`, `milk`) | | `catalog_category` | category such as `produce`, `dairy` | | `catalog_brand` | canonical brand when applicable | @@ -319,7 +328,7 @@ catalog attributes denormalized in and discounts already applied. | `raw_order_path` | relative path to original order payload | Notes: -- Only rows with `row_type=item` from normalization should appear here. +- Only rows that represent purchased items should appear here. - `line_total` preserves retailer truth; `net_line_total` is what you actually paid. - catalog fields are denormalized in to make pivoting trivial. - no discount/coupon rows exist here; their effects are carried via `matched_discount_amount`.