# scrape-giant

Small CLI pipeline for pulling purchase history from Giant and Costco, enriching line items, and building a reviewable cross-retailer purchase dataset.

There is no one-shot runner yet. Today, you run the scripts step by step from the terminal.

## What It Does

- `scrape_giant.py`: download Giant orders and items
- `enrich_giant.py`: normalize Giant line items
- `scrape_costco.py`: download Costco orders and items
- `enrich_costco.py`: normalize Costco line items
- `build_purchases.py`: combine retailer outputs into one purchase table
- `review_products.py`: review unresolved product matches in the terminal

## Requirements

- Python 3.10+
- Firefox installed with active Giant and Costco sessions

## Install

```bash
python -m venv venv
./venv/scripts/activate
pip install -r requirements.txt
```

## Optional `.env`

Current version works best with `.env` in the project root.  The scraper will prompt for these values if they are not found in the current browser session.  
- `scrape_giant` prompts if `GIANT_USER_ID` or `GIANT_LOYALTY_NUMBER` is missing.
- `scrape_costco` tries `.env` first, then Firefox local storage for session-backed values; `COSTCO_CLIENT_IDENTIFIER` should still be set explicitly.

```env
GIANT_USER_ID=...
GIANT_LOYALTY_NUMBER=...

# Costco can use these if present, but it can also pull session values from Firefox.
COSTCO_X_AUTHORIZATION=...
COSTCO_X_WCS_CLIENTID=...
COSTCO_CLIENT_IDENTIFIER=...
```

## Run Order

Run the pipeline in this order:

```bash
python scrape_giant.py
python enrich_giant.py
python scrape_costco.py
python enrich_costco.py
python build_purchases.py
python review_products.py
python build_purchases.py
```

Why run `build_purchases.py` twice:
- first pass builds the current combined dataset and review queue inputs
- `review_products.py` writes durable review decisions
- second pass reapplies those decisions into the purchase output

If you only want to refresh the queue without reviewing interactively:

```bash
python review_products.py --refresh-only
```

## Key Outputs

Giant:
- `giant_output/orders.csv`
- `giant_output/items.csv`
- `giant_output/items_enriched.csv`

Costco:
- `costco_output/orders.csv`
- `costco_output/items.csv`
- `costco_output/items_enriched.csv`

Combined:
- `combined_output/purchases.csv`
- `combined_output/review_queue.csv`
- `combined_output/review_resolutions.csv`
- `combined_output/canonical_catalog.csv`
- `combined_output/product_links.csv`
- `combined_output/comparison_examples.csv`

## Review Workflow

`review_products.py` is the manual cleanup step for unresolved or weakly unified items.

In the terminal, you can:
- link an item to an existing canonical product
- create a new canonical product
- exclude an item
- skip it for later

Those decisions are saved and reused on later runs.

## Notes

- This project is designed around fragile retailer scraping flows, so the code favors explicit retailer-specific steps over heavy abstraction.
- `scrape_giant.py` and `scrape_costco.py` are meant to work as standalone acquisition scripts.
- `validate_cross_retailer_flow.py` is a proof/check script, not a required production step.

## Test

```bash
./venv/bin/python -m unittest discover -s tests
```

## Project Docs

- `pm/tasks.org`: task tracking
- `pm/data-model.org`: current data model notes
- `pm/review-workflow.org`: review and resolution workflow