Files
scrape-giant/README.md
2026-03-16 17:40:23 -04:00

5.7 KiB

scrape-giant

Small grocery-history pipeline for Giant and Costco receipt data.

This repo is still a manual, stepwise pipeline. There is no single orchestrator script yet. Each stage is run directly, and later stages depend on files produced by earlier stages.

What The Project Does

The current flow is:

  1. acquire raw Giant receipt/history data
  2. enrich Giant line items into a shared enriched-item schema
  3. acquire raw Costco receipt data
  4. enrich Costco line items into the same shared enriched-item schema
  5. build observed-product, review, and canonical-product layers
  6. validate that Giant and Costco can flow through the same downstream model

Raw retailer JSON remains the source of truth.

Current Scripts

  • scrape_giant.py Fetch Giant in-store history and order detail payloads from an active Firefox session.
  • scrape_costco.py Fetch Costco receipt summary/detail payloads from an active Firefox session. Costco currently prefers .env header values first, then falls back to exact Firefox local-storage values for session auth.
  • enrich_giant.py Parse Giant raw order JSON into giant_output/items_enriched.csv.
  • enrich_costco.py Parse Costco raw receipt JSON into costco_output/items_enriched.csv.
  • build_observed_products.py Build retailer-facing observed products from enriched rows.
  • build_review_queue.py Build a manual review queue for low-confidence or unresolved observed products.
  • build_canonical_layer.py Build shared canonical products and observed-to-canonical links.
  • validate_cross_retailer_flow.py Write a proof/check output showing that Giant and Costco can meet in the same downstream model.

Manual Pipeline

Run these from the repo root with the venv active, or call them through ./venv/bin/python.

1. Acquire Giant raw data

./venv/bin/python scrape_giant.py

Inputs:

  • active Firefox session for giantfood.com
  • GIANT_USER_ID and GIANT_LOYALTY_NUMBER from .env, shell env, or prompt

Outputs:

  • giant_output/raw/history.json
  • giant_output/raw/<order_id>.json
  • giant_output/orders.csv
  • giant_output/items.csv

2. Enrich Giant data

./venv/bin/python enrich_giant.py

Input:

  • giant_output/raw/*.json

Output:

  • giant_output/items_enriched.csv

3. Acquire Costco raw data

./venv/bin/python scrape_costco.py

Optional useful flags:

./venv/bin/python scrape_costco.py --months-back 36
./venv/bin/python scrape_costco.py --firefox-profile-dir "C:\\Users\\you\\AppData\\Roaming\\Mozilla\\Firefox\\Profiles\\xxxx.default-release"

Inputs:

  • active Firefox session for costco.com
  • optional .env values:
    • COSTCO_X_AUTHORIZATION
    • COSTCO_X_WCS_CLIENTID
    • COSTCO_CLIENT_IDENTIFIER
  • if COSTCO_X_AUTHORIZATION is absent, the script falls back to exact Firefox local-storage values:
    • idToken -> sent as Bearer <idToken>
    • clientID -> used as costco-x-wcs-clientId when env is blank

Outputs:

  • costco_output/raw/summary.json
  • costco_output/raw/summary_requests.json
  • costco_output/raw/<receipt_id>-<timestamp>.json
  • costco_output/orders.csv
  • costco_output/items.csv

4. Enrich Costco data

./venv/bin/python enrich_costco.py

Input:

  • costco_output/raw/*.json

Output:

  • costco_output/items_enriched.csv

5. Build shared downstream layers

./venv/bin/python build_observed_products.py
./venv/bin/python build_review_queue.py
./venv/bin/python build_canonical_layer.py

These scripts consume the enriched item files and generate the downstream product-model outputs.

Current outputs on disk:

  • retailer-facing:
    • giant_output/products_observed.csv
    • giant_output/review_queue.csv
    • giant_output/products_canonical.csv
    • giant_output/product_links.csv
  • cross-retailer proof/check output:
    • combined_output/products_observed.csv
    • combined_output/products_canonical.csv
    • combined_output/product_links.csv
    • combined_output/proof_examples.csv

6. Validate cross-retailer flow

./venv/bin/python validate_cross_retailer_flow.py

This is a proof/check step, not the main acquisition path.

Inputs And Outputs By Directory

giant_output/

Inputs to this layer:

  • Firefox session data for Giant
  • Giant raw JSON payloads

Generated files:

  • raw/history.json
  • raw/<order_id>.json
  • orders.csv
  • items.csv
  • items_enriched.csv
  • products_observed.csv
  • review_queue.csv
  • products_canonical.csv
  • product_links.csv

costco_output/

Inputs to this layer:

  • Firefox session data for Costco
  • Costco raw GraphQL receipt payloads

Generated files:

  • raw/summary.json
  • raw/summary_requests.json
  • raw/<receipt_id>-<timestamp>.json
  • orders.csv
  • items.csv
  • items_enriched.csv

combined_output/

Generated by cross-retailer proof/build scripts:

  • products_observed.csv
  • products_canonical.csv
  • product_links.csv
  • proof_examples.csv

Notes

  • The pipeline is intentionally simple and currently manual.
  • Scraping is retailer-specific and fragile; downstream modeling is shared only after enrichment.
  • summary_requests.json is diagnostic metadata from Costco summary enumeration and is not a receipt payload.
  • enrich_costco.py skips that file and only parses receipt payloads.
  • The repo may contain archived or sample output files under archive/; they are not part of the active scrape path.

Verification

Run the full test suite with:

./venv/bin/python -m unittest discover -s tests

Useful one-off checks:

./venv/bin/python scrape_giant.py --help
./venv/bin/python scrape_costco.py --help
./venv/bin/python enrich_giant.py
./venv/bin/python enrich_costco.py

Project Docs

  • pm/tasks.org
  • pm/data-model.org
  • pm/scrape-giant.org