Finalize post-refactor layout and remove old pipeline files
This commit is contained in:
45
README.md
45
README.md
@@ -6,21 +6,15 @@ Run each script step-by-step from the terminal.
|
||||
|
||||
## What It Does
|
||||
|
||||
1. `scrape_giant.py`: download Giant orders and items
|
||||
2. `enrich_giant.py`: normalize Giant line items
|
||||
3. `scrape_costco.py`: download Costco orders and items
|
||||
4. `enrich_costco.py`: normalize Costco line items
|
||||
1. `collect_giant_web.py`: download Giant orders and items
|
||||
2. `normalize_giant_web.py`: normalize Giant line items
|
||||
3. `collect_costco_web.py`: download Costco orders and items
|
||||
4. `normalize_costco_web.py`: normalize Costco line items
|
||||
5. `build_purchases.py`: combine retailer outputs into one purchase table
|
||||
6. `review_products.py`: review unresolved product matches in the terminal
|
||||
7. `report_pipeline_status.py`: show how many rows survive each stage
|
||||
8. `analyze_purchases.py`: write chart-ready analysis CSVs from the purchase table
|
||||
|
||||
Active refactor entrypoints:
|
||||
- `collect_giant_web.py`
|
||||
- `collect_costco_web.py`
|
||||
- `normalize_giant_web.py`
|
||||
- `normalize_costco_web.py`
|
||||
|
||||
## Requirements
|
||||
|
||||
- Python 3.10+
|
||||
@@ -65,13 +59,20 @@ data/
|
||||
collected_items.csv
|
||||
normalized_items.csv
|
||||
review/
|
||||
catalog.csv
|
||||
review_queue.csv
|
||||
review_resolutions.csv
|
||||
product_links.csv
|
||||
purchases.csv
|
||||
pipeline_status.csv
|
||||
pipeline_status.json
|
||||
catalog.csv
|
||||
analysis/
|
||||
purchases.csv
|
||||
comparison_examples.csv
|
||||
item_price_over_time.csv
|
||||
spend_by_visit.csv
|
||||
items_per_visit.csv
|
||||
category_spend_over_time.csv
|
||||
retailer_store_breakdown.csv
|
||||
```
|
||||
|
||||
## Run Order
|
||||
@@ -122,21 +123,21 @@ Costco:
|
||||
- `data/costco-web/normalized_items.csv` preserves raw totals and matched net discount fields
|
||||
|
||||
Combined:
|
||||
- `data/review/purchases.csv`
|
||||
- `data/review/analysis/item_price_over_time.csv`
|
||||
- `data/review/analysis/spend_by_visit.csv`
|
||||
- `data/review/analysis/items_per_visit.csv`
|
||||
- `data/review/analysis/category_spend_over_time.csv`
|
||||
- `data/review/analysis/retailer_store_breakdown.csv`
|
||||
- `data/analysis/purchases.csv`
|
||||
- `data/analysis/comparison_examples.csv`
|
||||
- `data/analysis/item_price_over_time.csv`
|
||||
- `data/analysis/spend_by_visit.csv`
|
||||
- `data/analysis/items_per_visit.csv`
|
||||
- `data/analysis/category_spend_over_time.csv`
|
||||
- `data/analysis/retailer_store_breakdown.csv`
|
||||
- `data/review/review_queue.csv`
|
||||
- `data/review/review_resolutions.csv`
|
||||
- `data/review/product_links.csv`
|
||||
- `data/review/comparison_examples.csv`
|
||||
- `data/review/pipeline_status.csv`
|
||||
- `data/review/pipeline_status.json`
|
||||
- `data/catalog.csv`
|
||||
- `data/review/catalog.csv`
|
||||
|
||||
`data/review/purchases.csv` is the main analysis artifact. It is designed to support both:
|
||||
`data/analysis/purchases.csv` is the main analysis artifact. It is designed to support both:
|
||||
- item-level price analysis
|
||||
- visit-level analysis such as spend by visit, items per visit, category spend by visit, and retailer/store breakdown
|
||||
|
||||
@@ -164,9 +165,7 @@ The review step is intentionally conservative:
|
||||
|
||||
## Notes
|
||||
- This project is designed around fragile retailer scraping flows, so the code favors explicit retailer-specific steps over heavy abstraction.
|
||||
- `scrape_giant.py`, `scrape_costco.py`, `enrich_giant.py`, and `enrich_costco.py` are now legacy-compatible entrypoints; prefer the `collect_*` and `normalize_*` scripts for active work.
|
||||
- Costco discount rows are preserved for auditability and also matched back to purchased items during enrichment.
|
||||
- `validate_cross_retailer_flow.py` is a proof/check script, not a required production step.
|
||||
|
||||
## Test
|
||||
|
||||
|
||||
Reference in New Issue
Block a user