Files
scrape-giant/README.md
2026-03-24 16:48:53 -04:00

182 lines
5.4 KiB
Markdown

# scrape-giant
CLI to pull purchase history from Giant and Costco websites and refine into a single product catalog for external analysis.
Run each script step-by-step from the terminal.
## What It Does
1. `scrape_giant.py`: download Giant orders and items
2. `enrich_giant.py`: normalize Giant line items
3. `scrape_costco.py`: download Costco orders and items
4. `enrich_costco.py`: normalize Costco line items
5. `build_purchases.py`: combine retailer outputs into one purchase table
6. `review_products.py`: review unresolved product matches in the terminal
7. `report_pipeline_status.py`: show how many rows survive each stage
8. `analyze_purchases.py`: write chart-ready analysis CSVs from the purchase table
Active refactor entrypoints:
- `collect_giant_web.py`
- `collect_costco_web.py`
- `normalize_giant_web.py`
- `normalize_costco_web.py`
## Requirements
- Python 3.10+
- Firefox installed with active Giant and Costco sessions
## Install
```bash
python -m venv venv
./venv/scripts/activate
pip install -r requirements.txt
```
## Optional `.env`
Current version works best with `.env` in the project root. The scraper will prompt for these values if they are not found in the current browser session.
- `collect_giant_web.py` prompts if `GIANT_USER_ID` or `GIANT_LOYALTY_NUMBER` is missing.
- `collect_costco_web.py` tries `.env` first, then Firefox local storage for session-backed values; `COSTCO_CLIENT_IDENTIFIER` should still be set explicitly.
- Costco discount matching happens later in `enrich_costco.py`; you do not need to pre-clean discount lines by hand.
```env
GIANT_USER_ID=...
GIANT_LOYALTY_NUMBER=...
COSTCO_X_AUTHORIZATION=...
COSTCO_X_WCS_CLIENTID=...
COSTCO_CLIENT_IDENTIFIER=...
```
Current active path layout:
```text
data/
giant-web/
raw/
collected_orders.csv
collected_items.csv
normalized_items.csv
costco-web/
raw/
collected_orders.csv
collected_items.csv
normalized_items.csv
review/
review_queue.csv
review_resolutions.csv
product_links.csv
purchases.csv
pipeline_status.csv
pipeline_status.json
catalog.csv
```
## Run Order
Run the pipeline in this order:
```bash
python collect_giant_web.py
python normalize_giant_web.py
python collect_costco_web.py
python normalize_costco_web.py
python build_purchases.py
python review_products.py
python build_purchases.py
python review_products.py --refresh-only
python report_pipeline_status.py
python analyze_purchases.py
```
Why run `build_purchases.py` twice:
- first pass builds the current combined dataset and review queue inputs
- `review_products.py` writes durable review decisions
- second pass reapplies those decisions into the purchase output
If you only want to refresh the queue without reviewing interactively:
```bash
python review_products.py --refresh-only
```
If you want a quick stage-by-stage accountability check:
```bash
python report_pipeline_status.py
```
## Key Outputs
Giant:
- `data/giant-web/collected_orders.csv`
- `data/giant-web/collected_items.csv`
- `data/giant-web/normalized_items.csv`
Costco:
- `data/costco-web/collected_orders.csv`
- `data/costco-web/collected_items.csv`
- `data/costco-web/normalized_items.csv`
- `data/costco-web/normalized_items.csv` preserves raw totals and matched net discount fields
Combined:
- `data/review/purchases.csv`
- `data/review/analysis/item_price_over_time.csv`
- `data/review/analysis/spend_by_visit.csv`
- `data/review/analysis/items_per_visit.csv`
- `data/review/analysis/category_spend_over_time.csv`
- `data/review/analysis/retailer_store_breakdown.csv`
- `data/review/review_queue.csv`
- `data/review/review_resolutions.csv`
- `data/review/product_links.csv`
- `data/review/comparison_examples.csv`
- `data/review/pipeline_status.csv`
- `data/review/pipeline_status.json`
- `data/catalog.csv`
`data/review/purchases.csv` is the main analysis artifact. It is designed to support both:
- item-level price analysis
- visit-level analysis such as spend by visit, items per visit, category spend by visit, and retailer/store breakdown
The visit fields are carried directly in `purchases.csv`, so you can pivot on them without extra joins:
- `order_id`
- `purchase_date`
- `retailer`
- `store_name`
- `store_number`
- `store_city`
- `store_state`
## Review Workflow
Run `review_products.py` to cleanup unresolved or weakly unified items:
- link an item to an existing canonical product
- create a new canonical product
- exclude an item
- skip it for later
Decisions are saved and reused on later runs.
The review step is intentionally conservative:
- weak exact-name matches stay in the queue instead of auto-creating canonical products
- canonical names should describe stable product identity, not retailer packaging text
## Notes
- This project is designed around fragile retailer scraping flows, so the code favors explicit retailer-specific steps over heavy abstraction.
- `scrape_giant.py`, `scrape_costco.py`, `enrich_giant.py`, and `enrich_costco.py` are now legacy-compatible entrypoints; prefer the `collect_*` and `normalize_*` scripts for active work.
- Costco discount rows are preserved for auditability and also matched back to purchased items during enrichment.
- `validate_cross_retailer_flow.py` is a proof/check script, not a required production step.
## Test
```bash
./venv/bin/python -m unittest discover -s tests
```
## Project Docs
- `pm/tasks.org`: task tracking
- `pm/data-model.org`: current data model notes
- `pm/review-workflow.org`: review and resolution workflow