Files

2026-03-16 17:40:23 -04:00

5.7 KiB

Raw Blame History

scrape-giant

Small grocery-history pipeline for Giant and Costco receipt data.

This repo is still a manual, stepwise pipeline. There is no single orchestrator script yet. Each stage is run directly, and later stages depend on files produced by earlier stages.

What The Project Does

The current flow is:

acquire raw Giant receipt/history data
enrich Giant line items into a shared enriched-item schema
acquire raw Costco receipt data
enrich Costco line items into the same shared enriched-item schema
build observed-product, review, and canonical-product layers
validate that Giant and Costco can flow through the same downstream model

Raw retailer JSON remains the source of truth.

Current Scripts

scrape_giant.py Fetch Giant in-store history and order detail payloads from an active Firefox session.
scrape_costco.py Fetch Costco receipt summary/detail payloads from an active Firefox session. Costco currently prefers .env header values first, then falls back to exact Firefox local-storage values for session auth.
enrich_giant.py Parse Giant raw order JSON into giant_output/items_enriched.csv.
enrich_costco.py Parse Costco raw receipt JSON into costco_output/items_enriched.csv.
build_observed_products.py Build retailer-facing observed products from enriched rows.
build_review_queue.py Build a manual review queue for low-confidence or unresolved observed products.
build_canonical_layer.py Build shared canonical products and observed-to-canonical links.
validate_cross_retailer_flow.py Write a proof/check output showing that Giant and Costco can meet in the same downstream model.

Manual Pipeline

Run these from the repo root with the venv active, or call them through ./venv/bin/python.

1. Acquire Giant raw data

./venv/bin/python scrape_giant.py

Inputs:

active Firefox session for giantfood.com
GIANT_USER_ID and GIANT_LOYALTY_NUMBER from .env, shell env, or prompt

Outputs:

giant_output/raw/history.json
giant_output/raw/<order_id>.json
giant_output/orders.csv
giant_output/items.csv

2. Enrich Giant data

./venv/bin/python enrich_giant.py

Input:

giant_output/raw/*.json

Output:

giant_output/items_enriched.csv

3. Acquire Costco raw data

./venv/bin/python scrape_costco.py

Optional useful flags:

./venv/bin/python scrape_costco.py --months-back 36
./venv/bin/python scrape_costco.py --firefox-profile-dir "C:\\Users\\you\\AppData\\Roaming\\Mozilla\\Firefox\\Profiles\\xxxx.default-release"

Inputs:

active Firefox session for costco.com
optional .env values:
- COSTCO_X_AUTHORIZATION
- COSTCO_X_WCS_CLIENTID
- COSTCO_CLIENT_IDENTIFIER
if COSTCO_X_AUTHORIZATION is absent, the script falls back to exact Firefox local-storage values:
- idToken -> sent as Bearer <idToken>
- clientID -> used as costco-x-wcs-clientId when env is blank

Outputs:

costco_output/raw/summary.json
costco_output/raw/summary_requests.json
costco_output/raw/<receipt_id>-<timestamp>.json
costco_output/orders.csv
costco_output/items.csv

4. Enrich Costco data

./venv/bin/python enrich_costco.py

Input:

costco_output/raw/*.json

Output:

costco_output/items_enriched.csv

5. Build shared downstream layers

./venv/bin/python build_observed_products.py
./venv/bin/python build_review_queue.py
./venv/bin/python build_canonical_layer.py

These scripts consume the enriched item files and generate the downstream product-model outputs.

Current outputs on disk:

retailer-facing:
- giant_output/products_observed.csv
- giant_output/review_queue.csv
- giant_output/products_canonical.csv
- giant_output/product_links.csv
cross-retailer proof/check output:
- combined_output/products_observed.csv
- combined_output/products_canonical.csv
- combined_output/product_links.csv
- combined_output/proof_examples.csv

6. Validate cross-retailer flow

./venv/bin/python validate_cross_retailer_flow.py

This is a proof/check step, not the main acquisition path.

Inputs And Outputs By Directory

`giant_output/`

Inputs to this layer:

Firefox session data for Giant
Giant raw JSON payloads

Generated files:

raw/history.json
raw/<order_id>.json
orders.csv
items.csv
items_enriched.csv
products_observed.csv
review_queue.csv
products_canonical.csv
product_links.csv

`costco_output/`

Inputs to this layer:

Firefox session data for Costco
Costco raw GraphQL receipt payloads

Generated files:

raw/summary.json
raw/summary_requests.json
raw/<receipt_id>-<timestamp>.json
orders.csv
items.csv
items_enriched.csv

`combined_output/`

Generated by cross-retailer proof/build scripts:

products_observed.csv
products_canonical.csv
product_links.csv
proof_examples.csv

Notes

The pipeline is intentionally simple and currently manual.
Scraping is retailer-specific and fragile; downstream modeling is shared only after enrichment.
summary_requests.json is diagnostic metadata from Costco summary enumeration and is not a receipt payload.
enrich_costco.py skips that file and only parses receipt payloads.
The repo may contain archived or sample output files under archive/; they are not part of the active scrape path.

Verification

Run the full test suite with:

./venv/bin/python -m unittest discover -s tests

Useful one-off checks:

./venv/bin/python scrape_giant.py --help
./venv/bin/python scrape_costco.py --help
./venv/bin/python enrich_giant.py
./venv/bin/python enrich_costco.py

Project Docs

pm/tasks.org
pm/data-model.org
pm/scrape-giant.org

5.7 KiB Raw Blame History