132 lines
4.0 KiB
Markdown
132 lines
4.0 KiB
Markdown
# scrape-giant
|
|
|
|
CLI to pull purchase history from Giant and Costco websites and refine into a single product catalog for external analysis.
|
|
|
|
Run each script step-by-step from the terminal.
|
|
|
|
## What It Does
|
|
|
|
1. `scrape_giant.py`: download Giant orders and items
|
|
2. `enrich_giant.py`: normalize Giant line items
|
|
3. `scrape_costco.py`: download Costco orders and items
|
|
4. `enrich_costco.py`: normalize Costco line items
|
|
5. `build_purchases.py`: combine retailer outputs into one purchase table
|
|
6. `review_products.py`: review unresolved product matches in the terminal
|
|
7. `report_pipeline_status.py`: show how many rows survive each stage
|
|
|
|
## Requirements
|
|
|
|
- Python 3.10+
|
|
- Firefox installed with active Giant and Costco sessions
|
|
|
|
## Install
|
|
|
|
```bash
|
|
python -m venv venv
|
|
./venv/scripts/activate
|
|
pip install -r requirements.txt
|
|
```
|
|
|
|
## Optional `.env`
|
|
|
|
Current version works best with `.env` in the project root. The scraper will prompt for these values if they are not found in the current browser session.
|
|
- `scrape_giant` prompts if `GIANT_USER_ID` or `GIANT_LOYALTY_NUMBER` is missing.
|
|
- `scrape_costco` tries `.env` first, then Firefox local storage for session-backed values; `COSTCO_CLIENT_IDENTIFIER` should still be set explicitly.
|
|
- Costco discount matching happens later in `enrich_costco.py`; you do not need to pre-clean discount lines by hand.
|
|
|
|
```env
|
|
GIANT_USER_ID=...
|
|
GIANT_LOYALTY_NUMBER=...
|
|
|
|
COSTCO_X_AUTHORIZATION=...
|
|
COSTCO_X_WCS_CLIENTID=...
|
|
COSTCO_CLIENT_IDENTIFIER=...
|
|
```
|
|
|
|
## Run Order
|
|
|
|
Run the pipeline in this order:
|
|
|
|
```bash
|
|
python scrape_giant.py
|
|
python enrich_giant.py
|
|
python scrape_costco.py
|
|
python enrich_costco.py
|
|
python build_purchases.py
|
|
python review_products.py
|
|
python build_purchases.py
|
|
python review_products.py --refresh-only
|
|
python report_pipeline_status.py
|
|
```
|
|
|
|
Why run `build_purchases.py` twice:
|
|
- first pass builds the current combined dataset and review queue inputs
|
|
- `review_products.py` writes durable review decisions
|
|
- second pass reapplies those decisions into the purchase output
|
|
|
|
If you only want to refresh the queue without reviewing interactively:
|
|
|
|
```bash
|
|
python review_products.py --refresh-only
|
|
```
|
|
|
|
If you want a quick stage-by-stage accountability check:
|
|
|
|
```bash
|
|
python report_pipeline_status.py
|
|
```
|
|
|
|
## Key Outputs
|
|
|
|
Giant:
|
|
- `giant_output/orders.csv`
|
|
- `giant_output/items.csv`
|
|
- `giant_output/items_enriched.csv`
|
|
|
|
Costco:
|
|
- `costco_output/orders.csv`
|
|
- `costco_output/items.csv`
|
|
- `costco_output/items_enriched.csv`
|
|
- `costco_output/items_enriched.csv` now preserves raw totals and matched net discount fields
|
|
|
|
Combined:
|
|
- `combined_output/purchases.csv`
|
|
- `combined_output/review_queue.csv`
|
|
- `combined_output/review_resolutions.csv`
|
|
- `combined_output/canonical_catalog.csv`
|
|
- `combined_output/product_links.csv`
|
|
- `combined_output/comparison_examples.csv`
|
|
- `combined_output/pipeline_status.csv`
|
|
- `combined_output/pipeline_status.json`
|
|
|
|
## Review Workflow
|
|
|
|
Run `review_products.py` to cleanup unresolved or weakly unified items:
|
|
- link an item to an existing canonical product
|
|
- create a new canonical product
|
|
- exclude an item
|
|
- skip it for later
|
|
Decisions are saved and reused on later runs.
|
|
|
|
The review step is intentionally conservative:
|
|
- weak exact-name matches stay in the queue instead of auto-creating canonical products
|
|
- canonical names should describe stable product identity, not retailer packaging text
|
|
|
|
## Notes
|
|
- This project is designed around fragile retailer scraping flows, so the code favors explicit retailer-specific steps over heavy abstraction.
|
|
- `scrape_giant.py` and `scrape_costco.py` are meant to work as standalone acquisition scripts.
|
|
- Costco discount rows are preserved for auditability and also matched back to purchased items during enrichment.
|
|
- `validate_cross_retailer_flow.py` is a proof/check script, not a required production step.
|
|
|
|
## Test
|
|
|
|
```bash
|
|
./venv/bin/python -m unittest discover -s tests
|
|
```
|
|
|
|
## Project Docs
|
|
|
|
- `pm/tasks.org`: task tracking
|
|
- `pm/data-model.org`: current data model notes
|
|
- `pm/review-workflow.org`: review and resolution workflow
|