cleanup

2026-03-18 14:02:36 -04:00
parent 10aad05808
commit c13d144418
1 changed files with 88 additions and 79 deletions
--- a/pm/data-model.org
+++ b/pm/data-model.org
@@ -7,12 +7,12 @@ Goals:
 - Enable multiple data gathering methods
 - One layer for review and analysis  

- ** Design Rules
+** Design Rules
 - Raw retailer exports remain the source of truth.
 - Retailer parsing is isolated to retailer-specific files and ids.
- Cross-retailer product layers begin only after retailer-specific enrichment.
+- Cross-retailer product layers begin only after retailer-specific normalization.
 - CSV schemas are stable and additive: new columns may be appended, but
-  existing columns should not be repurposed.
+   existing columns should not be repurposed.
 - Unknown values should be left blank rather than guessed.

 *** Retailer-specific data:
@@ -22,63 +22,64 @@ Goals:
 - retailer category ids and names
 - retailer item names
 - retailer image urls
- observed products scoped to one retailer
-
-*** Review/Combined data:
- canonical products
- observed-to-canonical links
- human review state for unresolved cases
 - comparison-ready normalized quantity basis fields
+  
+*** Review/Combined data:
+- catalog of reviewed products
+- links from normalized line items to catalog
+- human review state for unresolved cases

-// I don't like this terminology - what is "observed" doing for us?
-// output should be normalized_items, not observed
-// unless this is the way we're matching multiple upc's?
-Observed products are the boundary between retailer-specific parsing and
-cross-retailer canonicalization. Nothing upstream of `products_observed.csv`
-should require knowledge of another retailer.

 * Pipeline
+Each step can be run alone if its dependents exist.
+Each retail provider script must produce deterministic line-item outputs and
+must not group items into product identities before review.
+
 Key: 
 - (1) input
- [2] output
-
-Each step can be run alone if its dependents exist.
-
+- [1] output
+ 
 ** 1. Collect
-Get raw receipt/visit and item data from a retailer.  Scraping is unique to a Retailer and method (e.g., Giant-Web and Giant-Scan).  Preserve complete raw data and preserve fidelity.  Avoid interpretation beyond basic data flattening.
+Get raw receipt/visit and item data from a retailer.
+Scraping is unique to a Retailer and method (e.g., Giant-Web and Giant-Scan).
+Preserve complete raw data and preserve fidelity.
+Avoid interpretation beyond basic data flattening.
 - (1) Source access (Varies, eg header data, auth for API access)
 - [1] collected visits from each retailer
 - [2] collected items from each retailer
 - [3] any other raw data that supports [1] and [2]; explicit source (eventual receipt scan?)
   
 ** 2. Normalize
-Parse and extract structured facts from retailer-specific raw data to create a standardized item format for that retailer.  Strictly dependent on Collect method and output.
+Parse and extract structured facts from retailer-specific raw data
+  to create a standardized item format for that retailer.
+Strictly dependent on Collect method and output.
 - Extract quantity, size, pack, pricing, variant
 - Add discount line items to product line items using upc/retail_item_id and concurrence
 - Cleanup naming to facilitate later matching
+ - Do not group line items into retailer-level product identities here
 - (1) collected items from each retailer
 - (2) collected visits from each retailer
 - [1] normalized items from each retailer

 ** 3. Review/Combine (Canonicalization)
-Decide whether two normalized retailer items are "the same product"; match items across retailers using algo/logic and human review.  Create catalog linked to normalized items.
- - Grouping the same item from retailer
+Decide whether two normalized retailer items are "the same product";
+ match items across retailers using algo/logic and human review.
+Create catalog linked to normalized items.
+ - Group line items only here, when review/combine logic explicitly decides they belong together
 - Asking human to create a canonical/catalog item with:
-   - friendly/canonical_name: "bell pepper"; "milk"
+   - friendly/catalog_name: "bell pepper"; "milk"
   - category: "produce"; "dairy"
   - product_type: "pepper"; "milk"
   - ? variant? "whole, "skim", "2pct"
+ - Then link the group of items to that catalog item.
 - (1) normalized items from each retailer
 - [1] review queue of items to be reviewed
- - [2] catalog (lookup table) of confirmed retailer_item and canonical_name
- - [3] canonical purchase list, pivot-ready
+ - [2] catalog (lookup table) of confirmed normalized line items and catalog_id
+ - [3] purchase list of normalized items , pivot-ready
   
 ** Unresolved Issues
 1. need central script to orchestrate; metadata belongs there and nowhere else
-
-** Symptoms
- `LIME` and `LIME . / .` appearing in canonical_catalog:
-  - names must come from review-approved names, not raw strings
+2. `LIME` and `LIME . / .` appearing in the catalog: names must come from review-approved names, not raw strings


 * Directory Layout
@@ -104,12 +105,16 @@ data/
    collected_orders.csv
    normalized_items.csv
  review/
-    review_queue.csv #  Human review queue for unresolved matching/parsing cases.
-    product_links.csv # Links from retailer-observed products to canonical products.
-  catalog.csv  # Cross-retailer canonical product entities used for comparison.
+    review_queue.csv # Human review queue for unresolved matching/parsing cases.
+    product_links.csv # Links from normalized line items to catalog items.
+  catalog.csv  # Cross-retailer product catalog entities used for comparison.
  purchases.csv
 #+end_example

+Notes:
+- The current repo still uses transitional root-level scripts and output folders.
+- This layout is the target structure for the refactor, not a claim that migration is already complete.
+
 * Schemas
 ** `data/<retailer-method>/collected_items.csv`
 One row per retailer line item.
@@ -140,7 +145,7 @@ One row per retailer line item.
 | `is_coupon_line`   | coupon-like line flag when distinguishable |

 ** `data/<retailer-method>/collected_orders.csv`
-One row per order or visit.
+One row per order/visit/receipt.
 | key                       | definition                                      |
 |---------------------------+-------------------------------------------------|
 | `retailer` PK             | retailer slug such as `giant`                   |
@@ -167,8 +172,8 @@ One row per order or visit.

 ** `data/<retailer-method>/normalized_items.csv`
 One row per retailer line item after deterministic parsing. Preserve raw
-fields from `collected_items.csv` and add parsed fields plus retailer-level
-identity needed before cross-retailer review.
+fields from `collected_items.csv` and add parsed fields that make later review
+and grouping easier. Normalization does not assign product identity.

 | key                        | definition                                                       |
 |----------------------------+------------------------------------------------------------------|
@@ -176,8 +181,6 @@ identity needed before cross-retailer review.
 | `order_id` PK              | retailer order id                                                |
 | `line_no` PK               | line number within order                                         |
 | `normalized_row_id`        | stable row key, typically `<retailer>:<order_id>:<line_no>`      |
-| `normalized_item_id`       | stable retailer-level item identity after deterministic grouping |
-| `normalization_basis`      | basis used to assign `normalized_item_id`                        |
 | `retailer_item_id`         | retailer-native item id                                          |
 | `item_name`                | raw retailer item name                                           |
 | `item_name_norm`           | normalized retailer item name                                    |
@@ -189,6 +192,7 @@ identity needed before cross-retailer review.
 | `measure_type`             | `each`, `weight`, `volume`, `count`, or blank                    |
 | `normalized_quantity`      | numeric comparison basis derived during normalization            |
 | `normalized_quantity_unit` | basis unit such as `oz`, `lb`, `count`, or blank                 |
+| `is_item`                  | item flag                                                        |
 | `is_store_brand`           | store-brand guess                                                |
 | `is_fee`                   | fee or non-product flag                                          |
 | `is_discount_line`         | discount or adjustment-line flag                                 |
@@ -209,53 +213,54 @@ identity needed before cross-retailer review.
 | `parse_notes`              | optional non-fatal parser notes                                  |

 Notes:
- `normalized_item_id` replaces the need for a core `observed_products.csv` layer.
- `normalization_basis` should be explicit values like `exact_upc`, `retailer_item_id`, `name_size_pack`, or `manual_retailer_alias`.
- Cross-retailer identity is still handled later in review/combine via `catalog.csv` and `product_links.csv`.
+- `normalized_row_id` is the only required identity at this stage.
+- Many normalized rows may later be grouped together during review/combine, but that grouping is not persisted here.
+- Discount/coupon rows may remain as standalone normalized rows for auditability even when their amounts are attached to a purchased row via `matched_discount_amount`.
+- Cross-retailer identity is handled later in review/combine via `catalog.csv` and `product_links.csv`.

 ** `data/review/product_links.csv`
-One row per observed-to-canonical relationship.
-1 (catalog_item) to many (normalized_items)
+One row per review-approved link from a normalized row to a catalog item.
+Many normalized rows may link to the same catalog item.

-| key               | definition                                  |
-|-------------------+---------------------------------------------|
-| `observed_id` PK  | retailer observed product id                |
-| `catalog_id` PK   | linked canonical product id                 |
-| `link_method`     | `manual`, `exact_upc`, `exact_name`, etc.   |
-| `link_confidence` | optional confidence label                   |
-| `review_status`   | `pending`, `approved`, `rejected`, or blank |
-| `reviewed_by`     | reviewer id or initials                     |
-| `reviewed_at`     | review timestamp or date                    |
-| `link_notes`      | optional notes                              |
+| key                     | definition                                  |
+|-------------------------+---------------------------------------------|
+| `normalized_row_id` PK  | normalized retailer line-item id            |
+| `catalog_id` PK         | linked catalog product id                   |
+| `link_method`           | `manual`, `exact_upc`, `exact_name_size`, etc. |
+| `link_confidence`       | optional confidence label                   |
+| `review_status`         | `pending`, `approved`, `rejected`, or blank |
+| `reviewed_by`           | reviewer id or initials                     |
+| `reviewed_at`           | review timestamp or date                    |
+| `link_notes`            | optional notes                              |

 ** `data/review/review_queue.csv`
 One row per issue needing human review.

-| key                   | definition                                          |
-|-----------------------+-----------------------------------------------------|
-| `review_id` PK        | stable review row id                                |
-| `queue_type`          | `observed_product`, `link_candidate`, `parse_issue` |
-| `retailer`            | retailer slug when applicable                       |
-| `observed_product_id` | observed product id when applicable                 |
-| `catalod_id`          | candidate canonical id when applicable              |
-| `reason_code`         | machine-readable review reason                      |
-| `priority`            | optional priority label                             |
-| `raw_item_names`      | compact list of example raw names                   |
-| `normalized_names`    | compact list of example normalized names            |
-| `upc`                 | example UPC/PLU                                     |
-| `image_url`           | example image url                                   |
-| `example_prices`      | compact list of example prices                      |
-| `seen_count`          | count of related rows                               |
-| `status`              | `pending`, `approved`, `rejected`, `deferred`       |
-| `resolution_notes`    | reviewer notes                                      |
-| `created_at`          | creation timestamp or date                          |
-| `updated_at`          | last update timestamp or date                       |
+| key                  | definition                                          |
+|----------------------+-----------------------------------------------------|
+| `review_id` PK       | stable review row id                                |
+| `queue_type`         | `link_candidate`, `parse_issue`, `catalog_cleanup`  |
+| `retailer`           | retailer slug when applicable                       |
+| `normalized_row_id`  | normalized row id when review is row-specific       |
+| `catalog_id`         | candidate canonical id                              |
+| `reason_code`        | machine-readable review reason                      |
+| `priority`           | optional priority label                             |
+| `raw_item_names`     | compact list of example raw names                   |
+| `normalized_names`   | compact list of example normalized names            |
+| `upc`                | example UPC/PLU                                     |
+| `image_url`          | example image url                                   |
+| `example_prices`     | compact list of example prices                      |
+| `seen_count`         | count of related rows                               |
+| `status`             | `pending`, `approved`, `rejected`, `deferred`       |
+| `resolution_notes`   | reviewer notes                                      |
+| `created_at`         | creation timestamp or date                          |
+| `updated_at`         | last update timestamp or date                       |
 ** `data/catalog.csv`
-One row per cross-retailer canonical product.
+One row per cross-retailer catalog product.
 | key                        | definition                             |
 |----------------------------+----------------------------------------|
-| `catalog_id` PK            | stable canonical product id            |
-| `catalog_name`             | canonical human-readable name          |
+| `catalog_id` PK            | stable catalog product id              |
+| `catalog_name`             | human-reviewed product name            |
 | `product_type`             | generic product eg `apple`, `milk`     |
 | `category`                 | broad section eg `produce`, `dairy`    |
 | `brand`                    | canonical brand when applicable        |
@@ -270,6 +275,11 @@ One row per cross-retailer canonical product.
 | `created_at`               | creation timestamp or date             |
 | `updated_at`               | last update timestamp or date          |

+Notes:
+- Do not auto-create new catalog rows from weak normalized names alone.
+- Do not encode packaging/count into `catalog_name` unless it is essential to product identity.
+- `catalog_name` should come from review-approved naming, not raw retailer strings.
+
 ** `data/purchases.csv`
 One row per purchased item (i.e., `row_type=item` from normalized layer), with
 catalog attributes denormalized in and discounts already applied.
@@ -281,9 +291,8 @@ catalog attributes denormalized in and discounts already applied.
 | `order_id`                 | retailer order id                                              |
 | `line_no`                  | line number within order                                       |
 | `normalized_row_id`        | `<retailer>:<order_id>:<line_no>`                              |
-| `normalized_item_id`       | retailer-level normalized item identity                        |
-| `catalog_id`               | linked canonical product id                                    |
-| `catalog_name`             | canonical product name for analysis                            |
+| `catalog_id`               | linked catalog product id                                      |
+| `catalog_name`             | catalog product name for analysis                              |
 | `catalog_product_type`     | broader product family (e.g., `egg`, `milk`)                   |
 | `catalog_category`         | category such as `produce`, `dairy`                            |
 | `catalog_brand`            | canonical brand when applicable                                |
@@ -319,7 +328,7 @@ catalog attributes denormalized in and discounts already applied.
 | `raw_order_path`           | relative path to original order payload                        |

 Notes:
- Only rows with `row_type=item` from normalization should appear here.
+- Only rows that represent purchased items should appear here.
 - `line_total` preserves retailer truth; `net_line_total` is what you actually paid.
 - catalog fields are denormalized in to make pivoting trivial.
 - no discount/coupon rows exist here; their effects are carried via `matched_discount_amount`.