updated readme

This commit is contained in:
ben
2026-03-16 17:40:23 -04:00
parent 861955557a
commit 6806c0e7ff

252
README.md
View File

@@ -1,103 +1,227 @@
# scrape-giant # scrape-giant
Small grocery-history pipeline for Giant receipts. Small grocery-history pipeline for Giant and Costco receipt data.
The project currently does four things: This repo is still a manual, stepwise pipeline. There is no single orchestrator
script yet. Each stage is run directly, and later stages depend on files
produced by earlier stages.
1. scrape Giant in-store order history from an active Firefox session ## What The Project Does
2. enrich raw line items into a deterministic `items_enriched.csv`
3. aggregate retailer-facing observed products and build a manual review queue
4. create a first-pass canonical product layer plus conservative auto-links
The work so far is Giant-specific on the ingest side and intentionally simple on The current flow is:
the shared product-model side.
## Current flow 1. acquire raw Giant receipt/history data
2. enrich Giant line items into a shared enriched-item schema
3. acquire raw Costco receipt data
4. enrich Costco line items into the same shared enriched-item schema
5. build observed-product, review, and canonical-product layers
6. validate that Giant and Costco can flow through the same downstream model
Run the commands from the repo root with the project venv active, or call them Raw retailer JSON remains the source of truth.
directly through `./venv/bin/python`.
## Current Scripts
- `scrape_giant.py`
Fetch Giant in-store history and order detail payloads from an active Firefox
session.
- `scrape_costco.py`
Fetch Costco receipt summary/detail payloads from an active Firefox session.
Costco currently prefers `.env` header values first, then falls back to exact
Firefox local-storage values for session auth.
- `enrich_giant.py`
Parse Giant raw order JSON into `giant_output/items_enriched.csv`.
- `enrich_costco.py`
Parse Costco raw receipt JSON into `costco_output/items_enriched.csv`.
- `build_observed_products.py`
Build retailer-facing observed products from enriched rows.
- `build_review_queue.py`
Build a manual review queue for low-confidence or unresolved observed
products.
- `build_canonical_layer.py`
Build shared canonical products and observed-to-canonical links.
- `validate_cross_retailer_flow.py`
Write a proof/check output showing that Giant and Costco can meet in the same
downstream model.
## Manual Pipeline
Run these from the repo root with the venv active, or call them through
`./venv/bin/python`.
### 1. Acquire Giant raw data
```bash
./venv/bin/python scrape_giant.py
```
Inputs:
- active Firefox session for `giantfood.com`
- `GIANT_USER_ID` and `GIANT_LOYALTY_NUMBER` from `.env`, shell env, or prompt
Outputs:
- `giant_output/raw/history.json`
- `giant_output/raw/<order_id>.json`
- `giant_output/orders.csv`
- `giant_output/items.csv`
### 2. Enrich Giant data
```bash ```bash
./venv/bin/python scraper.py
./venv/bin/python enrich_giant.py ./venv/bin/python enrich_giant.py
```
Input:
- `giant_output/raw/*.json`
Output:
- `giant_output/items_enriched.csv`
### 3. Acquire Costco raw data
```bash
./venv/bin/python scrape_costco.py
```
Optional useful flags:
```bash
./venv/bin/python scrape_costco.py --months-back 36
./venv/bin/python scrape_costco.py --firefox-profile-dir "C:\\Users\\you\\AppData\\Roaming\\Mozilla\\Firefox\\Profiles\\xxxx.default-release"
```
Inputs:
- active Firefox session for `costco.com`
- optional `.env` values:
- `COSTCO_X_AUTHORIZATION`
- `COSTCO_X_WCS_CLIENTID`
- `COSTCO_CLIENT_IDENTIFIER`
- if `COSTCO_X_AUTHORIZATION` is absent, the script falls back to exact Firefox
local-storage values:
- `idToken` -> sent as `Bearer <idToken>`
- `clientID` -> used as `costco-x-wcs-clientId` when env is blank
Outputs:
- `costco_output/raw/summary.json`
- `costco_output/raw/summary_requests.json`
- `costco_output/raw/<receipt_id>-<timestamp>.json`
- `costco_output/orders.csv`
- `costco_output/items.csv`
### 4. Enrich Costco data
```bash
./venv/bin/python enrich_costco.py
```
Input:
- `costco_output/raw/*.json`
Output:
- `costco_output/items_enriched.csv`
### 5. Build shared downstream layers
```bash
./venv/bin/python build_observed_products.py ./venv/bin/python build_observed_products.py
./venv/bin/python build_review_queue.py ./venv/bin/python build_review_queue.py
./venv/bin/python build_canonical_layer.py ./venv/bin/python build_canonical_layer.py
``` ```
## Inputs These scripts consume the enriched item files and generate the downstream
product-model outputs.
- Firefox cookies for `giantfood.com` Current outputs on disk:
- `GIANT_USER_ID` and `GIANT_LOYALTY_NUMBER` in `.env`, shell env, or prompts
- Giant raw order payloads in `giant_output/raw/`
## Outputs - retailer-facing:
- `giant_output/products_observed.csv`
- `giant_output/review_queue.csv`
- `giant_output/products_canonical.csv`
- `giant_output/product_links.csv`
- cross-retailer proof/check output:
- `combined_output/products_observed.csv`
- `combined_output/products_canonical.csv`
- `combined_output/product_links.csv`
- `combined_output/proof_examples.csv`
Current generated files live under `giant_output/`: ### 6. Validate cross-retailer flow
- `orders.csv`: flattened visit/order rows from the Giant history API ```bash
- `items.csv`: flattened raw line items from fetched order detail payloads ./venv/bin/python validate_cross_retailer_flow.py
- `items_enriched.csv`: deterministic parsed/enriched line items ```
- `products_observed.csv`: retailer-facing observed product groups
- `review_queue.csv`: products needing manual review
- `products_canonical.csv`: shared canonical product rows
- `product_links.csv`: observed-to-canonical links
Raw json remains the source of truth: This is a proof/check step, not the main acquisition path.
- `giant_output/raw/history.json` ## Inputs And Outputs By Directory
- `giant_output/raw/<order_id>.json`
## Scripts ### `giant_output/`
- `scraper.py`: fetches Giant history/detail payloads and updates `orders.csv` and `items.csv` Inputs to this layer:
- `enrich_giant.py`: reads raw Giant order json and writes `items_enriched.csv` - Firefox session data for Giant
- `build_observed_products.py`: groups enriched rows into `products_observed.csv` - Giant raw JSON payloads
- `build_review_queue.py`: generates `review_queue.csv` and preserves review status on reruns
- `build_canonical_layer.py`: builds `products_canonical.csv` and `product_links.csv`
## Notes on the current model Generated files:
- `raw/history.json`
- `raw/<order_id>.json`
- `orders.csv`
- `items.csv`
- `items_enriched.csv`
- `products_observed.csv`
- `review_queue.csv`
- `products_canonical.csv`
- `product_links.csv`
- Observed products are retailer-specific: Giant, Costco. ### `costco_output/`
- Canonical products are the first cross-retailer layer.
- Auto-linking is conservative: Inputs to this layer:
exact UPC first, then exact normalized name plus exact size/unit context, then - Firefox session data for Costco
exact normalized name when there is no size context to conflict. - Costco raw GraphQL receipt payloads
- Fee rows are excluded from auto-linking.
- Unknown values are left blank instead of guessed. Generated files:
- `raw/summary.json`
- `raw/summary_requests.json`
- `raw/<receipt_id>-<timestamp>.json`
- `orders.csv`
- `items.csv`
- `items_enriched.csv`
### `combined_output/`
Generated by cross-retailer proof/build scripts:
- `products_observed.csv`
- `products_canonical.csv`
- `product_links.csv`
- `proof_examples.csv`
## Notes
- The pipeline is intentionally simple and currently manual.
- Scraping is retailer-specific and fragile; downstream modeling is shared only
after enrichment.
- `summary_requests.json` is diagnostic metadata from Costco summary enumeration
and is not a receipt payload.
- `enrich_costco.py` skips that file and only parses receipt payloads.
- The repo may contain archived or sample output files under `archive/`; they
are not part of the active scrape path.
## Verification ## Verification
Run the test suite with: Run the full test suite with:
```bash ```bash
./venv/bin/python -m unittest discover -s tests ./venv/bin/python -m unittest discover -s tests
``` ```
Useful one-off rebuilds: Useful one-off checks:
```bash ```bash
./venv/bin/python scrape_giant.py --help
./venv/bin/python scrape_costco.py --help
./venv/bin/python enrich_giant.py ./venv/bin/python enrich_giant.py
./venv/bin/python build_observed_products.py ./venv/bin/python enrich_costco.py
./venv/bin/python build_review_queue.py
./venv/bin/python build_canonical_layer.py
``` ```
## Project docs ## Project Docs
- `pm/tasks.org`: task log and evidence - `pm/tasks.org`
- `pm/data-model.org`: file layout and schema decisions - `pm/data-model.org`
- `pm/scrape-giant.org`
## Status
Completed through `t1.7`:
- Giant receipt fetch CLI
- data model and file layout
- Giant parser/enricher
- observed products
- review queue
- canonical layer scaffold
- conservative auto-link rules
Next planned task is `t1.8`: add a Costco raw ingest path.