scrape-giant/README.md

# scrape-giant

CLI to pull purchase history from Giant and Costco websites and refine into a single product catalog for external analysis.

Run each script step-by-step from the terminal.

## What It Does

1. `scrape_giant.py`: download Giant orders and items
2. `enrich_giant.py`: normalize Giant line items
3. `scrape_costco.py`: download Costco orders and items
4. `enrich_costco.py`: normalize Costco line items
5. `build_purchases.py`: combine retailer outputs into one purchase table
6. `review_products.py`: review unresolved product matches in the terminal

## Requirements

- Python 3.10+
- Firefox installed with active Giant and Costco sessions

## Install

```bash
python -m venv venv
./venv/scripts/activate
pip install -r requirements.txt
```

## Optional `.env`

Current version works best with `.env` in the project root.  The scraper will prompt for these values if they are not found in the current browser session.
- `scrape_giant` prompts if `GIANT_USER_ID` or `GIANT_LOYALTY_NUMBER` is missing.
- `scrape_costco` tries `.env` first, then Firefox local storage for session-backed values; `COSTCO_CLIENT_IDENTIFIER` should still be set explicitly.

```env
GIANT_USER_ID=...
GIANT_LOYALTY_NUMBER=...

COSTCO_X_AUTHORIZATION=...
COSTCO_X_WCS_CLIENTID=...
COSTCO_CLIENT_IDENTIFIER=...
```

## Run Order

Run the pipeline in this order:

```bash
python scrape_giant.py
python enrich_giant.py
python scrape_costco.py
python enrich_costco.py
python build_purchases.py
python review_products.py
python build_purchases.py
```

Why run `build_purchases.py` twice:
- first pass builds the current combined dataset and review queue inputs
- `review_products.py` writes durable review decisions
- second pass reapplies those decisions into the purchase output

If you only want to refresh the queue without reviewing interactively:

```bash
python review_products.py --refresh-only
```

## Key Outputs

Giant:
- `giant_output/orders.csv`
- `giant_output/items.csv`
- `giant_output/items_enriched.csv`

Costco:
- `costco_output/orders.csv`
- `costco_output/items.csv`
- `costco_output/items_enriched.csv`

Combined:
- `combined_output/purchases.csv`
- `combined_output/review_queue.csv`
- `combined_output/review_resolutions.csv`
- `combined_output/canonical_catalog.csv`
- `combined_output/product_links.csv`
- `combined_output/comparison_examples.csv`

## Review Workflow

Run `review_products.py` to cleanup unresolved or weakly unified items:
- link an item to an existing canonical product
- create a new canonical product
- exclude an item
- skip it for later
Decisions are saved and reused on later runs.

## Notes
- This project is designed around fragile retailer scraping flows, so the code favors explicit retailer-specific steps over heavy abstraction.
- `scrape_giant.py` and `scrape_costco.py` are meant to work as standalone acquisition scripts.
- `validate_cross_retailer_flow.py` is a proof/check script, not a required production step.

## Test

```bash
./venv/bin/python -m unittest discover -s tests
```

## Project Docs

- `pm/tasks.org`: task tracking
- `pm/data-model.org`: current data model notes
- `pm/review-workflow.org`: review and resolution workflow