0f797d0a96aaf18b9a3c27986364ebf27d8be782
scrape-giant
Small grocery-history pipeline for Giant receipts.
The project currently does four things:
- scrape Giant in-store order history from an active Firefox session
- enrich raw line items into a deterministic
items_enriched.csv - aggregate retailer-facing observed products and build a manual review queue
- create a first-pass canonical product layer plus conservative auto-links
The work so far is Giant-specific on the ingest side and intentionally simple on the shared product-model side.
Current flow
Run the commands from the repo root with the project venv active, or call them
directly through ./venv/bin/python.
./venv/bin/python scraper.py
./venv/bin/python enrich_giant.py
./venv/bin/python build_observed_products.py
./venv/bin/python build_review_queue.py
./venv/bin/python build_canonical_layer.py
Inputs
- Firefox cookies for
giantfood.com GIANT_USER_IDandGIANT_LOYALTY_NUMBERin.env, shell env, or prompts- Giant raw order payloads in
giant_output/raw/
Outputs
Current generated files live under giant_output/:
orders.csv: flattened visit/order rows from the Giant history APIitems.csv: flattened raw line items from fetched order detail payloadsitems_enriched.csv: deterministic parsed/enriched line itemsproducts_observed.csv: retailer-facing observed product groupsreview_queue.csv: products needing manual reviewproducts_canonical.csv: shared canonical product rowsproduct_links.csv: observed-to-canonical links
Raw json remains the source of truth:
giant_output/raw/history.jsongiant_output/raw/<order_id>.json
Scripts
scraper.py: fetches Giant history/detail payloads and updatesorders.csvanditems.csvenrich_giant.py: reads raw Giant order json and writesitems_enriched.csvbuild_observed_products.py: groups enriched rows intoproducts_observed.csvbuild_review_queue.py: generatesreview_queue.csvand preserves review status on rerunsbuild_canonical_layer.py: buildsproducts_canonical.csvandproduct_links.csv
Notes on the current model
- Observed products are retailer-specific: Giant, Costco.
- Canonical products are the first cross-retailer layer.
- Auto-linking is conservative: exact UPC first, then exact normalized name plus exact size/unit context, then exact normalized name when there is no size context to conflict.
- Fee rows are excluded from auto-linking.
- Unknown values are left blank instead of guessed.
Verification
Run the test suite with:
./venv/bin/python -m unittest discover -s tests
Useful one-off rebuilds:
./venv/bin/python enrich_giant.py
./venv/bin/python build_observed_products.py
./venv/bin/python build_review_queue.py
./venv/bin/python build_canonical_layer.py
Project docs
pm/tasks.org: task log and evidencepm/data-model.org: file layout and schema decisions
Status
Completed through t1.7:
- Giant receipt fetch CLI
- data model and file layout
- Giant parser/enricher
- observed products
- review queue
- canonical layer scaffold
- conservative auto-link rules
Next planned task is t1.8: add a Costco raw ingest path.
Description
Languages
Python
100%