updated readme and prep for next phase
This commit is contained in:
25
README.md
25
README.md
@@ -1,17 +1,17 @@
|
||||
# scrape-giant
|
||||
|
||||
Small CLI pipeline for pulling purchase history from Giant and Costco, enriching line items, and building a reviewable cross-retailer purchase dataset.
|
||||
CLI to pull purchase history from Giant and Costco websites and refine into a single product catalog for external analysis.
|
||||
|
||||
There is no one-shot runner yet. Today, you run the scripts step by step from the terminal.
|
||||
Run each script step-by-step from the terminal.
|
||||
|
||||
## What It Does
|
||||
|
||||
- `scrape_giant.py`: download Giant orders and items
|
||||
- `enrich_giant.py`: normalize Giant line items
|
||||
- `scrape_costco.py`: download Costco orders and items
|
||||
- `enrich_costco.py`: normalize Costco line items
|
||||
- `build_purchases.py`: combine retailer outputs into one purchase table
|
||||
- `review_products.py`: review unresolved product matches in the terminal
|
||||
1. `scrape_giant.py`: download Giant orders and items
|
||||
2. `enrich_giant.py`: normalize Giant line items
|
||||
3. `scrape_costco.py`: download Costco orders and items
|
||||
4. `enrich_costco.py`: normalize Costco line items
|
||||
5. `build_purchases.py`: combine retailer outputs into one purchase table
|
||||
6. `review_products.py`: review unresolved product matches in the terminal
|
||||
|
||||
## Requirements
|
||||
|
||||
@@ -36,7 +36,6 @@ Current version works best with `.env` in the project root. The scraper will pr
|
||||
GIANT_USER_ID=...
|
||||
GIANT_LOYALTY_NUMBER=...
|
||||
|
||||
# Costco can use these if present, but it can also pull session values from Firefox.
|
||||
COSTCO_X_AUTHORIZATION=...
|
||||
COSTCO_X_WCS_CLIENTID=...
|
||||
COSTCO_CLIENT_IDENTIFIER=...
|
||||
@@ -89,18 +88,14 @@ Combined:
|
||||
|
||||
## Review Workflow
|
||||
|
||||
`review_products.py` is the manual cleanup step for unresolved or weakly unified items.
|
||||
|
||||
In the terminal, you can:
|
||||
Run `review_products.py` to cleanup unresolved or weakly unified items:
|
||||
- link an item to an existing canonical product
|
||||
- create a new canonical product
|
||||
- exclude an item
|
||||
- skip it for later
|
||||
|
||||
Those decisions are saved and reused on later runs.
|
||||
Decisions are saved and reused on later runs.
|
||||
|
||||
## Notes
|
||||
|
||||
- This project is designed around fragile retailer scraping flows, so the code favors explicit retailer-specific steps over heavy abstraction.
|
||||
- `scrape_giant.py` and `scrape_costco.py` are meant to work as standalone acquisition scripts.
|
||||
- `validate_cross_retailer_flow.py` is a proof/check script, not a required production step.
|
||||
|
||||
Reference in New Issue
Block a user