ben/scrape-giant

Go to file

ben 4fd309251d Record t1.8.6 task evidence

2026-03-16 13:54:11 -04:00

Record t1.8.6 task evidence

2026-03-16 13:54:11 -04:00

Add shared browser session bootstrap

2026-03-16 13:54:00 -04:00

.gitignore

added pm folder and tasks

2026-03-14 17:59:40 -04:00

agents.md

assume local venv available

2026-03-16 11:44:10 -04:00

browser_session.py

Add shared browser session bootstrap

2026-03-16 13:54:00 -04:00

build_canonical_layer.py

Extend shared schema for retailer-native ids

2026-03-16 09:17:36 -04:00

build_observed_products.py

Extend shared schema for retailer-native ids

2026-03-16 09:17:36 -04:00

build_review_queue.py

Extend shared schema for retailer-native ids

2026-03-16 09:17:36 -04:00

enrich_costco.py

Add Costco acquisition and enrich flow

2026-03-16 09:17:46 -04:00

enrich_giant.py

Extend shared schema for retailer-native ids

2026-03-16 09:17:36 -04:00

layer_helpers.py

Generate Giant observed products

2026-03-16 00:43:11 -04:00

README.md

updated scope to prep for costco scraper

2026-03-16 09:04:52 -04:00

requirements.txt

added dotenv and completed t1.1

2026-03-14 18:45:55 -04:00

retailer_sessions.py

Add shared browser session bootstrap

2026-03-16 13:54:00 -04:00

scrape_costco.py

Add shared browser session bootstrap

2026-03-16 13:54:00 -04:00

scrape_giant.py

Add shared browser session bootstrap

2026-03-16 13:54:00 -04:00

scraper.py

Add shared browser session bootstrap

2026-03-16 13:54:00 -04:00

validate_cross_retailer_flow.py

Add Costco acquisition and enrich flow

2026-03-16 09:17:46 -04:00

README.md

scrape-giant

Small grocery-history pipeline for Giant receipts.

The project currently does four things:

scrape Giant in-store order history from an active Firefox session
enrich raw line items into a deterministic items_enriched.csv
aggregate retailer-facing observed products and build a manual review queue
create a first-pass canonical product layer plus conservative auto-links

The work so far is Giant-specific on the ingest side and intentionally simple on the shared product-model side.

Current flow

Run the commands from the repo root with the project venv active, or call them directly through ./venv/bin/python.

./venv/bin/python scraper.py
./venv/bin/python enrich_giant.py
./venv/bin/python build_observed_products.py
./venv/bin/python build_review_queue.py
./venv/bin/python build_canonical_layer.py

Inputs

Firefox cookies for giantfood.com
GIANT_USER_ID and GIANT_LOYALTY_NUMBER in .env, shell env, or prompts
Giant raw order payloads in giant_output/raw/

Outputs

Current generated files live under giant_output/:

orders.csv: flattened visit/order rows from the Giant history API
items.csv: flattened raw line items from fetched order detail payloads
items_enriched.csv: deterministic parsed/enriched line items
products_observed.csv: retailer-facing observed product groups
review_queue.csv: products needing manual review
products_canonical.csv: shared canonical product rows
product_links.csv: observed-to-canonical links

Raw json remains the source of truth:

giant_output/raw/history.json
giant_output/raw/<order_id>.json

Scripts

scraper.py: fetches Giant history/detail payloads and updates orders.csv and items.csv
enrich_giant.py: reads raw Giant order json and writes items_enriched.csv
build_observed_products.py: groups enriched rows into products_observed.csv
build_review_queue.py: generates review_queue.csv and preserves review status on reruns
build_canonical_layer.py: builds products_canonical.csv and product_links.csv

Notes on the current model

Observed products are retailer-specific: Giant, Costco.
Canonical products are the first cross-retailer layer.
Auto-linking is conservative: exact UPC first, then exact normalized name plus exact size/unit context, then exact normalized name when there is no size context to conflict.
Fee rows are excluded from auto-linking.
Unknown values are left blank instead of guessed.

Verification

Run the test suite with:

./venv/bin/python -m unittest discover -s tests

Useful one-off rebuilds:

./venv/bin/python enrich_giant.py
./venv/bin/python build_observed_products.py
./venv/bin/python build_review_queue.py
./venv/bin/python build_canonical_layer.py

Project docs

pm/tasks.org: task log and evidence
pm/data-model.org: file layout and schema decisions

Status

Completed through t1.7:

Giant receipt fetch CLI
data model and file layout
Giant parser/enricher
observed products
review queue
canonical layer scaffold
conservative auto-link rules

Next planned task is t1.8: add a Costco raw ingest path.