Refactor retailer normalization outputs

This commit is contained in:
ben
2026-03-18 15:46:20 -04:00
parent ec1f36a140
commit 9064de5f67
7 changed files with 222 additions and 11 deletions

View File

@@ -502,7 +502,7 @@ move Giant and Costco collection into the new collect structure and make both re
- Added lightweight deprecation nudges on the legacy `scrape_*` commands rather than removing them immediately, so the move is inspectable and low-risk.
- The main schema fix was on Giant collection, which was missing retailer/provenance/audit fields that Costco collection already carried.
* [ ] t1.14.1: refactor retailer normalization into the new normalized_items schema (3-5 commits)
* [X] t1.14.1: refactor retailer normalization into the new normalized_items schema (3-5 commits)
make Giant and Costco emit the shared normalized line-item schema without introducing cross-retailer identity logic
** Acceptance Criteria
@@ -538,10 +538,13 @@ make Giant and Costco emit the shared normalized line-item schema without introd
- pm note: normalized_item_id is the only retailer-level grouping identity; do not introduce observed_products or a second grouping artifact
** evidence
- commit:
- tests:
- datetime:
- tests: `./venv/bin/python -m unittest discover -s tests`; `./venv/bin/python -m unittest tests.test_enrich_giant tests.test_costco_pipeline tests.test_purchases`; `./venv/bin/python normalize_giant_web.py --help`; `./venv/bin/python normalize_costco_web.py --help`; `./venv/bin/python enrich_giant.py --help`; `./venv/bin/python enrich_costco.py --help`
- datetime: 2026-03-18
** notes
- Kept the existing Giant and Costco parsing logic intact and added the new normalized schema fields in place, rather than rewriting the enrichers from scratch.
- `normalized_item_id` is always present, but it only collapses repeated rows when the evidence is strong; otherwise it falls back to row-level identity via `normalized_row_id`.
- Added `normalize_*` entry points for the new data-model layout while leaving the legacy `enrich_*` commands available during the transition.
* [ ] t1.15: refactor review/combine pipeline around normalized_item_id and catalog links (4-8 commits)
replace the old observed/canonical workflow with a review-first pipeline that uses normalized_item_id as the retailer-level review unit and links it to catalog items