From 2e5d69c75ee7ca0f9f2d2325cfb1d13879d4cc5b Mon Sep 17 00:00:00 2001 From: eulaly Date: Fri, 20 Mar 2026 09:55:46 -0400 Subject: [PATCH] added 14.2 and 14.3 for refactor prep --- pm/tasks.org | 73 +++++++++++++++++++++++++++++++++++++++++++++++++++- 1 file changed, 72 insertions(+), 1 deletion(-) diff --git a/pm/tasks.org b/pm/tasks.org index 5f7bd7c..78c5aa6 100644 --- a/pm/tasks.org +++ b/pm/tasks.org @@ -546,6 +546,78 @@ make Giant and Costco emit the shared normalized line-item schema without introd - `normalized_item_id` is always present, but it only collapses repeated rows when the evidence is strong; otherwise it falls back to row-level identity via `normalized_row_id`. - Added `normalize_*` entry points for the new data-model layout while leaving the legacy `enrich_*` commands available during the transition. +* [ ] t1.14.2: finalize filesystem and schema alignment for the refactor (2-4 commits) +bring on-disk outputs fully into the target `data/` structure without changing retailer behavior + +** Acceptance Criteria +1. retailer data directories conform to pm/data-model.org: + - `data/giant-web/raw/...` + - `data/giant-web/collected_orders.csv` + - `data/giant-web/collected_items.csv` + - `data/giant-web/normalized_items.csv` + - `data/costco-web/raw/...` + - `data/costco-web/collected_orders.csv` + - `data/costco-web/collected_items.csv` + - `data/costco-web/normalized_items.csv` +2. review/combine outputs are moved or rewritten into the target review paths: + - `data/review/review_queue.csv` + - `data/review/product_links.csv` + - `data/review/review_resolutions.csv` + - `data/review/purchases.csv` + - `data/review/pipeline_status.csv` + - `data/review/pipeline_status.json` +3. old transitional output paths are either: + - removed from active script defaults, or + - left as explicit compatibility shims with clear deprecation notes +4. no recollection is required if existing raw files and collected csvs can be moved/copied losslessly into the new structure +5. no schema information is lost during the move: + - raw paths still resolve + - collected/normalized csvs still open with the expected headers +6. README and task/docs reflect the final active paths +- pm note: prefer moving/adapting existing files over recollecting from retailers unless a real data loss or schema mismatch forces recollection +- pm note: this is a structure-alignment task, not a retailer parsing task + +** evidence +- commit: +- tests: +- datetime: + +** notes + +* [ ] t1.14.3: retailer-specific Costco normalization cleanup (2-4 commits) +tighten Costco-specific normalization so normalized item names are cleaner and deterministic retailer grouping is less noisy + +** Acceptance Criteria +1. improve Costco item-name cleanup for obvious non-identity noise, such as: + - trailing slash fragments + - code tokens and receipt-format artifacts + - duplicated measurement fragments already captured in structured fields +2. preserve deterministic normalization rules only: + - exact retailer_item_id + - exact cleaned name + same size/pack when needed + - approved retailer alias + - no fuzzy or semantic matching +3. normalized Costco names improve on known bad examples, e.g.: + - `MANDARIN /` -> cleaner normalized item name + - `LIFE 6'TABLE ... /` -> cleaner normalized item name +4. cleanup does not overwrite retailer truth: + - raw `item_name` is unchanged + - parsed `size_value`, `size_unit`, `pack_qty`, and pricing fields remain intact +5. discount-row behavior remains correct: + - matched discount rows still populate `matched_discount_amount` + - `net_line_total` remains correct + - discount rows remain auditable +6. add regression tests for the cleaned Costco examples and any new parsing rules +- pm note: keep this explicitly Costco-specific; do not introduce a generic cleanup framework +- pm note: prefer a short allowlist/blocklist of known receipt artifacts over broad heuristics + +** evidence +- commit: +- tests: +- datetime: + +** notes + * [ ] t1.15: refactor review/combine pipeline around normalized_item_id and catalog links (4-8 commits) replace the old observed/canonical workflow with a review-first pipeline that uses normalized_item_id as the retailer-level review unit and links it to catalog items @@ -595,7 +667,6 @@ replace the old observed/canonical workflow with a review-first pipeline that us ** notes - * [ ] 1t.10: add optional llm-assisted suggestion workflow for unresolved normalized retailer items (2-4 commits) ** acceptance criteria