# scrape-giant Small grocery-history pipeline for Giant receipts. The project currently does four things: 1. scrape Giant in-store order history from an active Firefox session 2. enrich raw line items into a deterministic `items_enriched.csv` 3. aggregate retailer-facing observed products and build a manual review queue 4. create a first-pass canonical product layer plus conservative auto-links The work so far is Giant-specific on the ingest side and intentionally simple on the shared product-model side. ## Current flow Run the commands from the repo root with the project venv active, or call them directly through `./venv/bin/python`. ```bash ./venv/bin/python scraper.py ./venv/bin/python enrich_giant.py ./venv/bin/python build_observed_products.py ./venv/bin/python build_review_queue.py ./venv/bin/python build_canonical_layer.py ``` ## Inputs - Firefox cookies for `giantfood.com` - `GIANT_USER_ID` and `GIANT_LOYALTY_NUMBER` in `.env`, shell env, or prompts - Giant raw order payloads in `giant_output/raw/` ## Outputs Current generated files live under `giant_output/`: - `orders.csv`: flattened visit/order rows from the Giant history API - `items.csv`: flattened raw line items from fetched order detail payloads - `items_enriched.csv`: deterministic parsed/enriched line items - `products_observed.csv`: retailer-facing observed product groups - `review_queue.csv`: products needing manual review - `products_canonical.csv`: shared canonical product rows - `product_links.csv`: observed-to-canonical links Raw json remains the source of truth: - `giant_output/raw/history.json` - `giant_output/raw/.json` ## Scripts - `scraper.py`: fetches Giant history/detail payloads and updates `orders.csv` and `items.csv` - `enrich_giant.py`: reads raw Giant order json and writes `items_enriched.csv` - `build_observed_products.py`: groups enriched rows into `products_observed.csv` - `build_review_queue.py`: generates `review_queue.csv` and preserves review status on reruns - `build_canonical_layer.py`: builds `products_canonical.csv` and `product_links.csv` ## Notes on the current model - Observed products are retailer-specific: Giant, Costco. - Canonical products are the first cross-retailer layer. - Auto-linking is conservative: exact UPC first, then exact normalized name plus exact size/unit context, then exact normalized name when there is no size context to conflict. - Fee rows are excluded from auto-linking. - Unknown values are left blank instead of guessed. ## Verification Run the test suite with: ```bash ./venv/bin/python -m unittest discover -s tests ``` Useful one-off rebuilds: ```bash ./venv/bin/python enrich_giant.py ./venv/bin/python build_observed_products.py ./venv/bin/python build_review_queue.py ./venv/bin/python build_canonical_layer.py ``` ## Project docs - `pm/tasks.org`: task log and evidence - `pm/data-model.org`: file layout and schema decisions ## Status Completed through `t1.7`: - Giant receipt fetch CLI - data model and file layout - Giant parser/enricher - observed products - review queue - canonical layer scaffold - conservative auto-link rules Next planned task is `t1.8`: add a Costco raw ingest path.