added 14.2 and 14.3 for refactor prep

This commit is contained in:
2026-03-20 09:55:46 -04:00
parent 3c2462845b
commit 2e5d69c75e

View File

@@ -546,6 +546,78 @@ make Giant and Costco emit the shared normalized line-item schema without introd
- `normalized_item_id` is always present, but it only collapses repeated rows when the evidence is strong; otherwise it falls back to row-level identity via `normalized_row_id`.
- Added `normalize_*` entry points for the new data-model layout while leaving the legacy `enrich_*` commands available during the transition.
* [ ] t1.14.2: finalize filesystem and schema alignment for the refactor (2-4 commits)
bring on-disk outputs fully into the target `data/` structure without changing retailer behavior
** Acceptance Criteria
1. retailer data directories conform to pm/data-model.org:
- `data/giant-web/raw/...`
- `data/giant-web/collected_orders.csv`
- `data/giant-web/collected_items.csv`
- `data/giant-web/normalized_items.csv`
- `data/costco-web/raw/...`
- `data/costco-web/collected_orders.csv`
- `data/costco-web/collected_items.csv`
- `data/costco-web/normalized_items.csv`
2. review/combine outputs are moved or rewritten into the target review paths:
- `data/review/review_queue.csv`
- `data/review/product_links.csv`
- `data/review/review_resolutions.csv`
- `data/review/purchases.csv`
- `data/review/pipeline_status.csv`
- `data/review/pipeline_status.json`
3. old transitional output paths are either:
- removed from active script defaults, or
- left as explicit compatibility shims with clear deprecation notes
4. no recollection is required if existing raw files and collected csvs can be moved/copied losslessly into the new structure
5. no schema information is lost during the move:
- raw paths still resolve
- collected/normalized csvs still open with the expected headers
6. README and task/docs reflect the final active paths
- pm note: prefer moving/adapting existing files over recollecting from retailers unless a real data loss or schema mismatch forces recollection
- pm note: this is a structure-alignment task, not a retailer parsing task
** evidence
- commit:
- tests:
- datetime:
** notes
* [ ] t1.14.3: retailer-specific Costco normalization cleanup (2-4 commits)
tighten Costco-specific normalization so normalized item names are cleaner and deterministic retailer grouping is less noisy
** Acceptance Criteria
1. improve Costco item-name cleanup for obvious non-identity noise, such as:
- trailing slash fragments
- code tokens and receipt-format artifacts
- duplicated measurement fragments already captured in structured fields
2. preserve deterministic normalization rules only:
- exact retailer_item_id
- exact cleaned name + same size/pack when needed
- approved retailer alias
- no fuzzy or semantic matching
3. normalized Costco names improve on known bad examples, e.g.:
- `MANDARIN /` -> cleaner normalized item name
- `LIFE 6'TABLE ... /` -> cleaner normalized item name
4. cleanup does not overwrite retailer truth:
- raw `item_name` is unchanged
- parsed `size_value`, `size_unit`, `pack_qty`, and pricing fields remain intact
5. discount-row behavior remains correct:
- matched discount rows still populate `matched_discount_amount`
- `net_line_total` remains correct
- discount rows remain auditable
6. add regression tests for the cleaned Costco examples and any new parsing rules
- pm note: keep this explicitly Costco-specific; do not introduce a generic cleanup framework
- pm note: prefer a short allowlist/blocklist of known receipt artifacts over broad heuristics
** evidence
- commit:
- tests:
- datetime:
** notes
* [ ] t1.15: refactor review/combine pipeline around normalized_item_id and catalog links (4-8 commits)
replace the old observed/canonical workflow with a review-first pipeline that uses normalized_item_id as the retailer-level review unit and links it to catalog items
@@ -595,7 +667,6 @@ replace the old observed/canonical workflow with a review-first pipeline that us
** notes
* [ ] 1t.10: add optional llm-assisted suggestion workflow for unresolved normalized retailer items (2-4 commits)
** acceptance criteria