Align refactor paths with data layout

2026-03-20 10:04:58 -04:00
parent 424a777dd0
commit d2e6f2afd3
5 changed files with 81 additions and 48 deletions
--- a/README.md
+++ b/README.md
@@ -14,6 +14,12 @@ Run each script step-by-step from the terminal.
 6. `review_products.py`: review unresolved product matches in the terminal
 7. `report_pipeline_status.py`: show how many rows survive each stage

+Active refactor entrypoints:
+- `collect_giant_web.py`
+- `collect_costco_web.py`
+- `normalize_giant_web.py`
+- `normalize_costco_web.py`
+
 ## Requirements

 - Python 3.10+
@@ -30,8 +36,8 @@ pip install -r requirements.txt
 ## Optional `.env`

 Current version works best with `.env` in the project root.  The scraper will prompt for these values if they are not found in the current browser session.  
- `scrape_giant` prompts if `GIANT_USER_ID` or `GIANT_LOYALTY_NUMBER` is missing.
- `scrape_costco` tries `.env` first, then Firefox local storage for session-backed values; `COSTCO_CLIENT_IDENTIFIER` should still be set explicitly.
+- `collect_giant_web.py` prompts if `GIANT_USER_ID` or `GIANT_LOYALTY_NUMBER` is missing.
+- `collect_costco_web.py` tries `.env` first, then Firefox local storage for session-backed values; `COSTCO_CLIENT_IDENTIFIER` should still be set explicitly.
 - Costco discount matching happens later in `enrich_costco.py`; you do not need to pre-clean discount lines by hand.

 ```env
@@ -43,15 +49,39 @@ COSTCO_X_WCS_CLIENTID=...
 COSTCO_CLIENT_IDENTIFIER=...
 ```

+Current active path layout:
+
+```text
+data/
+  giant-web/
+    raw/
+    collected_orders.csv
+    collected_items.csv
+    normalized_items.csv
+  costco-web/
+    raw/
+    collected_orders.csv
+    collected_items.csv
+    normalized_items.csv
+  review/
+    review_queue.csv
+    review_resolutions.csv
+    product_links.csv
+    purchases.csv
+    pipeline_status.csv
+    pipeline_status.json
+  catalog.csv
+```
+
 ## Run Order

 Run the pipeline in this order:

 ```bash
-python scrape_giant.py
-python enrich_giant.py
-python scrape_costco.py
-python enrich_costco.py
+python collect_giant_web.py
+python normalize_giant_web.py
+python collect_costco_web.py
+python normalize_costco_web.py
 python build_purchases.py
 python review_products.py
 python build_purchases.py
@@ -79,25 +109,25 @@ python report_pipeline_status.py
 ## Key Outputs

 Giant:
- `giant_output/orders.csv`
- `giant_output/items.csv`
- `giant_output/items_enriched.csv`
+- `data/giant-web/collected_orders.csv`
+- `data/giant-web/collected_items.csv`
+- `data/giant-web/normalized_items.csv`

 Costco:
- `costco_output/orders.csv`
- `costco_output/items.csv`
- `costco_output/items_enriched.csv`
- `costco_output/items_enriched.csv` now preserves raw totals and matched net discount fields
+- `data/costco-web/collected_orders.csv`
+- `data/costco-web/collected_items.csv`
+- `data/costco-web/normalized_items.csv`
+- `data/costco-web/normalized_items.csv` preserves raw totals and matched net discount fields

 Combined:
- `combined_output/purchases.csv`
- `combined_output/review_queue.csv`
- `combined_output/review_resolutions.csv`
- `combined_output/canonical_catalog.csv`
- `combined_output/product_links.csv`
- `combined_output/comparison_examples.csv`
- `combined_output/pipeline_status.csv`
- `combined_output/pipeline_status.json`
+- `data/review/purchases.csv`
+- `data/review/review_queue.csv`
+- `data/review/review_resolutions.csv`
+- `data/review/product_links.csv`
+- `data/review/comparison_examples.csv`
+- `data/review/pipeline_status.csv`
+- `data/review/pipeline_status.json`
+- `data/catalog.csv`

 ## Review Workflow

@@ -114,7 +144,7 @@ The review step is intentionally conservative:

 ## Notes
 - This project is designed around fragile retailer scraping flows, so the code favors explicit retailer-specific steps over heavy abstraction.
- `scrape_giant.py` and `scrape_costco.py` are meant to work as standalone acquisition scripts.
+- `scrape_giant.py`, `scrape_costco.py`, `enrich_giant.py`, and `enrich_costco.py` are now legacy-compatible entrypoints; prefer the `collect_*` and `normalize_*` scripts for active work.
 - Costco discount rows are preserved for auditability and also matched back to purchased items during enrichment.
 - `validate_cross_retailer_flow.py` is a proof/check script, not a required production step.