data-model refactor and prep scope

2026-03-18 13:08:28 -04:00
parent 9122821db1
commit 10aad05808
3 changed files with 538 additions and 267 deletions
--- a/README.md
+++ b/README.md
@@ -12,6 +12,7 @@ Run each script step-by-step from the terminal.
 4. `enrich_costco.py`: normalize Costco line items
 5. `build_purchases.py`: combine retailer outputs into one purchase table
 6. `review_products.py`: review unresolved product matches in the terminal
+7. `report_pipeline_status.py`: show how many rows survive each stage

 ## Requirements

@@ -31,6 +32,7 @@ pip install -r requirements.txt
 Current version works best with `.env` in the project root.  The scraper will prompt for these values if they are not found in the current browser session.  
 - `scrape_giant` prompts if `GIANT_USER_ID` or `GIANT_LOYALTY_NUMBER` is missing.
 - `scrape_costco` tries `.env` first, then Firefox local storage for session-backed values; `COSTCO_CLIENT_IDENTIFIER` should still be set explicitly.
+- Costco discount matching happens later in `enrich_costco.py`; you do not need to pre-clean discount lines by hand.

 ```env
 GIANT_USER_ID=...
@@ -53,6 +55,8 @@ python enrich_costco.py
 python build_purchases.py
 python review_products.py
 python build_purchases.py
+python review_products.py --refresh-only
+python report_pipeline_status.py
 ```

 Why run `build_purchases.py` twice:
@@ -66,6 +70,12 @@ If you only want to refresh the queue without reviewing interactively:
 python review_products.py --refresh-only
 ```

+If you want a quick stage-by-stage accountability check:
+
+```bash
+python report_pipeline_status.py
+```
+
 ## Key Outputs

 Giant:
@@ -77,6 +87,7 @@ Costco:
 - `costco_output/orders.csv`
 - `costco_output/items.csv`
 - `costco_output/items_enriched.csv`
+- `costco_output/items_enriched.csv` now preserves raw totals and matched net discount fields

 Combined:
 - `combined_output/purchases.csv`
@@ -85,6 +96,8 @@ Combined:
 - `combined_output/canonical_catalog.csv`
 - `combined_output/product_links.csv`
 - `combined_output/comparison_examples.csv`
+- `combined_output/pipeline_status.csv`
+- `combined_output/pipeline_status.json`

 ## Review Workflow

@@ -95,9 +108,14 @@ Run `review_products.py` to cleanup unresolved or weakly unified items:
 - skip it for later
 Decisions are saved and reused on later runs.

+The review step is intentionally conservative:
+- weak exact-name matches stay in the queue instead of auto-creating canonical products
+- canonical names should describe stable product identity, not retailer packaging text
+
 ## Notes
 - This project is designed around fragile retailer scraping flows, so the code favors explicit retailer-specific steps over heavy abstraction.
 - `scrape_giant.py` and `scrape_costco.py` are meant to work as standalone acquisition scripts.
+- Costco discount rows are preserved for auditability and also matched back to purchased items during enrichment.
 - `validate_cross_retailer_flow.py` is a proof/check script, not a required production step.

 ## Test