updated readme

This commit is contained in:
ben
2026-03-16 17:40:23 -04:00
parent 861955557a
commit 6806c0e7ff

252
README.md
View File

@@ -1,103 +1,227 @@
# scrape-giant
Small grocery-history pipeline for Giant receipts.
Small grocery-history pipeline for Giant and Costco receipt data.
The project currently does four things:
This repo is still a manual, stepwise pipeline. There is no single orchestrator
script yet. Each stage is run directly, and later stages depend on files
produced by earlier stages.
1. scrape Giant in-store order history from an active Firefox session
2. enrich raw line items into a deterministic `items_enriched.csv`
3. aggregate retailer-facing observed products and build a manual review queue
4. create a first-pass canonical product layer plus conservative auto-links
## What The Project Does
The work so far is Giant-specific on the ingest side and intentionally simple on
the shared product-model side.
The current flow is:
## Current flow
1. acquire raw Giant receipt/history data
2. enrich Giant line items into a shared enriched-item schema
3. acquire raw Costco receipt data
4. enrich Costco line items into the same shared enriched-item schema
5. build observed-product, review, and canonical-product layers
6. validate that Giant and Costco can flow through the same downstream model
Run the commands from the repo root with the project venv active, or call them
directly through `./venv/bin/python`.
Raw retailer JSON remains the source of truth.
## Current Scripts
- `scrape_giant.py`
Fetch Giant in-store history and order detail payloads from an active Firefox
session.
- `scrape_costco.py`
Fetch Costco receipt summary/detail payloads from an active Firefox session.
Costco currently prefers `.env` header values first, then falls back to exact
Firefox local-storage values for session auth.
- `enrich_giant.py`
Parse Giant raw order JSON into `giant_output/items_enriched.csv`.
- `enrich_costco.py`
Parse Costco raw receipt JSON into `costco_output/items_enriched.csv`.
- `build_observed_products.py`
Build retailer-facing observed products from enriched rows.
- `build_review_queue.py`
Build a manual review queue for low-confidence or unresolved observed
products.
- `build_canonical_layer.py`
Build shared canonical products and observed-to-canonical links.
- `validate_cross_retailer_flow.py`
Write a proof/check output showing that Giant and Costco can meet in the same
downstream model.
## Manual Pipeline
Run these from the repo root with the venv active, or call them through
`./venv/bin/python`.
### 1. Acquire Giant raw data
```bash
./venv/bin/python scrape_giant.py
```
Inputs:
- active Firefox session for `giantfood.com`
- `GIANT_USER_ID` and `GIANT_LOYALTY_NUMBER` from `.env`, shell env, or prompt
Outputs:
- `giant_output/raw/history.json`
- `giant_output/raw/<order_id>.json`
- `giant_output/orders.csv`
- `giant_output/items.csv`
### 2. Enrich Giant data
```bash
./venv/bin/python scraper.py
./venv/bin/python enrich_giant.py
```
Input:
- `giant_output/raw/*.json`
Output:
- `giant_output/items_enriched.csv`
### 3. Acquire Costco raw data
```bash
./venv/bin/python scrape_costco.py
```
Optional useful flags:
```bash
./venv/bin/python scrape_costco.py --months-back 36
./venv/bin/python scrape_costco.py --firefox-profile-dir "C:\\Users\\you\\AppData\\Roaming\\Mozilla\\Firefox\\Profiles\\xxxx.default-release"
```
Inputs:
- active Firefox session for `costco.com`
- optional `.env` values:
- `COSTCO_X_AUTHORIZATION`
- `COSTCO_X_WCS_CLIENTID`
- `COSTCO_CLIENT_IDENTIFIER`
- if `COSTCO_X_AUTHORIZATION` is absent, the script falls back to exact Firefox
local-storage values:
- `idToken` -> sent as `Bearer <idToken>`
- `clientID` -> used as `costco-x-wcs-clientId` when env is blank
Outputs:
- `costco_output/raw/summary.json`
- `costco_output/raw/summary_requests.json`
- `costco_output/raw/<receipt_id>-<timestamp>.json`
- `costco_output/orders.csv`
- `costco_output/items.csv`
### 4. Enrich Costco data
```bash
./venv/bin/python enrich_costco.py
```
Input:
- `costco_output/raw/*.json`
Output:
- `costco_output/items_enriched.csv`
### 5. Build shared downstream layers
```bash
./venv/bin/python build_observed_products.py
./venv/bin/python build_review_queue.py
./venv/bin/python build_canonical_layer.py
```
## Inputs
These scripts consume the enriched item files and generate the downstream
product-model outputs.
- Firefox cookies for `giantfood.com`
- `GIANT_USER_ID` and `GIANT_LOYALTY_NUMBER` in `.env`, shell env, or prompts
- Giant raw order payloads in `giant_output/raw/`
Current outputs on disk:
## Outputs
- retailer-facing:
- `giant_output/products_observed.csv`
- `giant_output/review_queue.csv`
- `giant_output/products_canonical.csv`
- `giant_output/product_links.csv`
- cross-retailer proof/check output:
- `combined_output/products_observed.csv`
- `combined_output/products_canonical.csv`
- `combined_output/product_links.csv`
- `combined_output/proof_examples.csv`
Current generated files live under `giant_output/`:
### 6. Validate cross-retailer flow
- `orders.csv`: flattened visit/order rows from the Giant history API
- `items.csv`: flattened raw line items from fetched order detail payloads
- `items_enriched.csv`: deterministic parsed/enriched line items
- `products_observed.csv`: retailer-facing observed product groups
- `review_queue.csv`: products needing manual review
- `products_canonical.csv`: shared canonical product rows
- `product_links.csv`: observed-to-canonical links
```bash
./venv/bin/python validate_cross_retailer_flow.py
```
Raw json remains the source of truth:
This is a proof/check step, not the main acquisition path.
- `giant_output/raw/history.json`
- `giant_output/raw/<order_id>.json`
## Inputs And Outputs By Directory
## Scripts
### `giant_output/`
- `scraper.py`: fetches Giant history/detail payloads and updates `orders.csv` and `items.csv`
- `enrich_giant.py`: reads raw Giant order json and writes `items_enriched.csv`
- `build_observed_products.py`: groups enriched rows into `products_observed.csv`
- `build_review_queue.py`: generates `review_queue.csv` and preserves review status on reruns
- `build_canonical_layer.py`: builds `products_canonical.csv` and `product_links.csv`
Inputs to this layer:
- Firefox session data for Giant
- Giant raw JSON payloads
## Notes on the current model
Generated files:
- `raw/history.json`
- `raw/<order_id>.json`
- `orders.csv`
- `items.csv`
- `items_enriched.csv`
- `products_observed.csv`
- `review_queue.csv`
- `products_canonical.csv`
- `product_links.csv`
- Observed products are retailer-specific: Giant, Costco.
- Canonical products are the first cross-retailer layer.
- Auto-linking is conservative:
exact UPC first, then exact normalized name plus exact size/unit context, then
exact normalized name when there is no size context to conflict.
- Fee rows are excluded from auto-linking.
- Unknown values are left blank instead of guessed.
### `costco_output/`
Inputs to this layer:
- Firefox session data for Costco
- Costco raw GraphQL receipt payloads
Generated files:
- `raw/summary.json`
- `raw/summary_requests.json`
- `raw/<receipt_id>-<timestamp>.json`
- `orders.csv`
- `items.csv`
- `items_enriched.csv`
### `combined_output/`
Generated by cross-retailer proof/build scripts:
- `products_observed.csv`
- `products_canonical.csv`
- `product_links.csv`
- `proof_examples.csv`
## Notes
- The pipeline is intentionally simple and currently manual.
- Scraping is retailer-specific and fragile; downstream modeling is shared only
after enrichment.
- `summary_requests.json` is diagnostic metadata from Costco summary enumeration
and is not a receipt payload.
- `enrich_costco.py` skips that file and only parses receipt payloads.
- The repo may contain archived or sample output files under `archive/`; they
are not part of the active scrape path.
## Verification
Run the test suite with:
Run the full test suite with:
```bash
./venv/bin/python -m unittest discover -s tests
```
Useful one-off rebuilds:
Useful one-off checks:
```bash
./venv/bin/python scrape_giant.py --help
./venv/bin/python scrape_costco.py --help
./venv/bin/python enrich_giant.py
./venv/bin/python build_observed_products.py
./venv/bin/python build_review_queue.py
./venv/bin/python build_canonical_layer.py
./venv/bin/python enrich_costco.py
```
## Project docs
## Project Docs
- `pm/tasks.org`: task log and evidence
- `pm/data-model.org`: file layout and schema decisions
## Status
Completed through `t1.7`:
- Giant receipt fetch CLI
- data model and file layout
- Giant parser/enricher
- observed products
- review queue
- canonical layer scaffold
- conservative auto-link rules
Next planned task is `t1.8`: add a Costco raw ingest path.
- `pm/tasks.org`
- `pm/data-model.org`
- `pm/scrape-giant.org`