2026-03-16 13:54:11 -04:00
2026-03-16 13:54:11 -04:00
2026-03-14 17:59:40 -04:00
2026-03-16 11:44:10 -04:00

scrape-giant

Small grocery-history pipeline for Giant receipts.

The project currently does four things:

  1. scrape Giant in-store order history from an active Firefox session
  2. enrich raw line items into a deterministic items_enriched.csv
  3. aggregate retailer-facing observed products and build a manual review queue
  4. create a first-pass canonical product layer plus conservative auto-links

The work so far is Giant-specific on the ingest side and intentionally simple on the shared product-model side.

Current flow

Run the commands from the repo root with the project venv active, or call them directly through ./venv/bin/python.

./venv/bin/python scraper.py
./venv/bin/python enrich_giant.py
./venv/bin/python build_observed_products.py
./venv/bin/python build_review_queue.py
./venv/bin/python build_canonical_layer.py

Inputs

  • Firefox cookies for giantfood.com
  • GIANT_USER_ID and GIANT_LOYALTY_NUMBER in .env, shell env, or prompts
  • Giant raw order payloads in giant_output/raw/

Outputs

Current generated files live under giant_output/:

  • orders.csv: flattened visit/order rows from the Giant history API
  • items.csv: flattened raw line items from fetched order detail payloads
  • items_enriched.csv: deterministic parsed/enriched line items
  • products_observed.csv: retailer-facing observed product groups
  • review_queue.csv: products needing manual review
  • products_canonical.csv: shared canonical product rows
  • product_links.csv: observed-to-canonical links

Raw json remains the source of truth:

  • giant_output/raw/history.json
  • giant_output/raw/<order_id>.json

Scripts

  • scraper.py: fetches Giant history/detail payloads and updates orders.csv and items.csv
  • enrich_giant.py: reads raw Giant order json and writes items_enriched.csv
  • build_observed_products.py: groups enriched rows into products_observed.csv
  • build_review_queue.py: generates review_queue.csv and preserves review status on reruns
  • build_canonical_layer.py: builds products_canonical.csv and product_links.csv

Notes on the current model

  • Observed products are retailer-specific: Giant, Costco.
  • Canonical products are the first cross-retailer layer.
  • Auto-linking is conservative: exact UPC first, then exact normalized name plus exact size/unit context, then exact normalized name when there is no size context to conflict.
  • Fee rows are excluded from auto-linking.
  • Unknown values are left blank instead of guessed.

Verification

Run the test suite with:

./venv/bin/python -m unittest discover -s tests

Useful one-off rebuilds:

./venv/bin/python enrich_giant.py
./venv/bin/python build_observed_products.py
./venv/bin/python build_review_queue.py
./venv/bin/python build_canonical_layer.py

Project docs

  • pm/tasks.org: task log and evidence
  • pm/data-model.org: file layout and schema decisions

Status

Completed through t1.7:

  • Giant receipt fetch CLI
  • data model and file layout
  • Giant parser/enricher
  • observed products
  • review queue
  • canonical layer scaffold
  • conservative auto-link rules

Next planned task is t1.8: add a Costco raw ingest path.

Description
gathering data from your Giant Food website
Readme 1.4 MiB
Languages
Python 100%