scrape-giant/README.md

# scrape-giant

Small grocery-history pipeline for Giant and Costco receipt data.

This repo is still a manual, stepwise pipeline. There is no single orchestrator
script yet. Each stage is run directly, and later stages depend on files
produced by earlier stages.

## What The Project Does

The current flow is:

1. acquire raw Giant receipt/history data
2. enrich Giant line items into a shared enriched-item schema
3. acquire raw Costco receipt data
4. enrich Costco line items into the same shared enriched-item schema
5. build observed-product, review, and canonical-product layers
6. validate that Giant and Costco can flow through the same downstream model

Raw retailer JSON remains the source of truth.

## Current Scripts

- `scrape_giant.py`
  Fetch Giant in-store history and order detail payloads from an active Firefox
  session.
- `scrape_costco.py`
  Fetch Costco receipt summary/detail payloads from an active Firefox session.
  Costco currently prefers `.env` header values first, then falls back to exact
  Firefox local-storage values for session auth.
- `enrich_giant.py`
  Parse Giant raw order JSON into `giant_output/items_enriched.csv`.
- `enrich_costco.py`
  Parse Costco raw receipt JSON into `costco_output/items_enriched.csv`.
- `build_observed_products.py`
  Build retailer-facing observed products from enriched rows.
- `build_review_queue.py`
  Build a manual review queue for low-confidence or unresolved observed
  products.
- `build_canonical_layer.py`
  Build shared canonical products and observed-to-canonical links.
- `validate_cross_retailer_flow.py`
  Write a proof/check output showing that Giant and Costco can meet in the same
  downstream model.

## Manual Pipeline

Run these from the repo root with the venv active, or call them through
`./venv/bin/python`.

### 1. Acquire Giant raw data

```bash
./venv/bin/python scrape_giant.py
```

Inputs:
- active Firefox session for `giantfood.com`
- `GIANT_USER_ID` and `GIANT_LOYALTY_NUMBER` from `.env`, shell env, or prompt

Outputs:
- `giant_output/raw/history.json`
- `giant_output/raw/<order_id>.json`
- `giant_output/orders.csv`
- `giant_output/items.csv`

### 2. Enrich Giant data

```bash
./venv/bin/python enrich_giant.py
```

Input:
- `giant_output/raw/*.json`

Output:
- `giant_output/items_enriched.csv`

### 3. Acquire Costco raw data

```bash
./venv/bin/python scrape_costco.py
```

Optional useful flags:

```bash
./venv/bin/python scrape_costco.py --months-back 36
./venv/bin/python scrape_costco.py --firefox-profile-dir "C:\\Users\\you\\AppData\\Roaming\\Mozilla\\Firefox\\Profiles\\xxxx.default-release"
```

Inputs:
- active Firefox session for `costco.com`
- optional `.env` values:
  - `COSTCO_X_AUTHORIZATION`
  - `COSTCO_X_WCS_CLIENTID`
  - `COSTCO_CLIENT_IDENTIFIER`
- if `COSTCO_X_AUTHORIZATION` is absent, the script falls back to exact Firefox
  local-storage values:
  - `idToken` -> sent as `Bearer <idToken>`
  - `clientID` -> used as `costco-x-wcs-clientId` when env is blank

Outputs:
- `costco_output/raw/summary.json`
- `costco_output/raw/summary_requests.json`
- `costco_output/raw/<receipt_id>-<timestamp>.json`
- `costco_output/orders.csv`
- `costco_output/items.csv`

### 4. Enrich Costco data

```bash
./venv/bin/python enrich_costco.py
```

Input:
- `costco_output/raw/*.json`

Output:
- `costco_output/items_enriched.csv`

### 5. Build shared downstream layers

```bash
./venv/bin/python build_observed_products.py
./venv/bin/python build_review_queue.py
./venv/bin/python build_canonical_layer.py
```

These scripts consume the enriched item files and generate the downstream
product-model outputs.

Current outputs on disk:

- retailer-facing:
  - `giant_output/products_observed.csv`
  - `giant_output/review_queue.csv`
  - `giant_output/products_canonical.csv`
  - `giant_output/product_links.csv`
- cross-retailer proof/check output:
  - `combined_output/products_observed.csv`
  - `combined_output/products_canonical.csv`
  - `combined_output/product_links.csv`
  - `combined_output/proof_examples.csv`

### 6. Validate cross-retailer flow

```bash
./venv/bin/python validate_cross_retailer_flow.py
```

This is a proof/check step, not the main acquisition path.

## Inputs And Outputs By Directory

### `giant_output/`

Inputs to this layer:
- Firefox session data for Giant
- Giant raw JSON payloads

Generated files:
- `raw/history.json`
- `raw/<order_id>.json`
- `orders.csv`
- `items.csv`
- `items_enriched.csv`
- `products_observed.csv`
- `review_queue.csv`
- `products_canonical.csv`
- `product_links.csv`

### `costco_output/`

Inputs to this layer:
- Firefox session data for Costco
- Costco raw GraphQL receipt payloads

Generated files:
- `raw/summary.json`
- `raw/summary_requests.json`
- `raw/<receipt_id>-<timestamp>.json`
- `orders.csv`
- `items.csv`
- `items_enriched.csv`

### `combined_output/`

Generated by cross-retailer proof/build scripts:
- `products_observed.csv`
- `products_canonical.csv`
- `product_links.csv`
- `proof_examples.csv`

## Notes

- The pipeline is intentionally simple and currently manual.
- Scraping is retailer-specific and fragile; downstream modeling is shared only
  after enrichment.
- `summary_requests.json` is diagnostic metadata from Costco summary enumeration
  and is not a receipt payload.
- `enrich_costco.py` skips that file and only parses receipt payloads.
- The repo may contain archived or sample output files under `archive/`; they
  are not part of the active scrape path.

## Verification

Run the full test suite with:

```bash
./venv/bin/python -m unittest discover -s tests
```

Useful one-off checks:

```bash
./venv/bin/python scrape_giant.py --help
./venv/bin/python scrape_costco.py --help
./venv/bin/python enrich_giant.py
./venv/bin/python enrich_costco.py
```

## Project Docs

- `pm/tasks.org`
- `pm/data-model.org`
- `pm/scrape-giant.org`