2026-03-20 12:45:57 -04:00
2026-03-20 12:45:57 -04:00
2026-03-20 12:45:38 -04:00
2026-03-14 17:59:40 -04:00
2026-03-16 11:44:10 -04:00
2026-03-20 12:45:38 -04:00

scrape-giant

CLI to pull purchase history from Giant and Costco websites and refine into a single product catalog for external analysis.

Run each script step-by-step from the terminal.

What It Does

  1. scrape_giant.py: download Giant orders and items
  2. enrich_giant.py: normalize Giant line items
  3. scrape_costco.py: download Costco orders and items
  4. enrich_costco.py: normalize Costco line items
  5. build_purchases.py: combine retailer outputs into one purchase table
  6. review_products.py: review unresolved product matches in the terminal
  7. report_pipeline_status.py: show how many rows survive each stage

Active refactor entrypoints:

  • collect_giant_web.py
  • collect_costco_web.py
  • normalize_giant_web.py
  • normalize_costco_web.py

Requirements

  • Python 3.10+
  • Firefox installed with active Giant and Costco sessions

Install

python -m venv venv
./venv/scripts/activate
pip install -r requirements.txt

Optional .env

Current version works best with .env in the project root. The scraper will prompt for these values if they are not found in the current browser session.

  • collect_giant_web.py prompts if GIANT_USER_ID or GIANT_LOYALTY_NUMBER is missing.
  • collect_costco_web.py tries .env first, then Firefox local storage for session-backed values; COSTCO_CLIENT_IDENTIFIER should still be set explicitly.
  • Costco discount matching happens later in enrich_costco.py; you do not need to pre-clean discount lines by hand.
GIANT_USER_ID=...
GIANT_LOYALTY_NUMBER=...

COSTCO_X_AUTHORIZATION=...
COSTCO_X_WCS_CLIENTID=...
COSTCO_CLIENT_IDENTIFIER=...

Current active path layout:

data/
  giant-web/
    raw/
    collected_orders.csv
    collected_items.csv
    normalized_items.csv
  costco-web/
    raw/
    collected_orders.csv
    collected_items.csv
    normalized_items.csv
  review/
    review_queue.csv
    review_resolutions.csv
    product_links.csv
    purchases.csv
    pipeline_status.csv
    pipeline_status.json
  catalog.csv

Run Order

Run the pipeline in this order:

python collect_giant_web.py
python normalize_giant_web.py
python collect_costco_web.py
python normalize_costco_web.py
python build_purchases.py
python review_products.py
python build_purchases.py
python review_products.py --refresh-only
python report_pipeline_status.py

Why run build_purchases.py twice:

  • first pass builds the current combined dataset and review queue inputs
  • review_products.py writes durable review decisions
  • second pass reapplies those decisions into the purchase output

If you only want to refresh the queue without reviewing interactively:

python review_products.py --refresh-only

If you want a quick stage-by-stage accountability check:

python report_pipeline_status.py

Key Outputs

Giant:

  • data/giant-web/collected_orders.csv
  • data/giant-web/collected_items.csv
  • data/giant-web/normalized_items.csv

Costco:

  • data/costco-web/collected_orders.csv
  • data/costco-web/collected_items.csv
  • data/costco-web/normalized_items.csv
  • data/costco-web/normalized_items.csv preserves raw totals and matched net discount fields

Combined:

  • data/review/purchases.csv
  • data/review/review_queue.csv
  • data/review/review_resolutions.csv
  • data/review/product_links.csv
  • data/review/comparison_examples.csv
  • data/review/pipeline_status.csv
  • data/review/pipeline_status.json
  • data/catalog.csv

Review Workflow

Run review_products.py to cleanup unresolved or weakly unified items:

  • link an item to an existing canonical product
  • create a new canonical product
  • exclude an item
  • skip it for later Decisions are saved and reused on later runs.

The review step is intentionally conservative:

  • weak exact-name matches stay in the queue instead of auto-creating canonical products
  • canonical names should describe stable product identity, not retailer packaging text

Notes

  • This project is designed around fragile retailer scraping flows, so the code favors explicit retailer-specific steps over heavy abstraction.
  • scrape_giant.py, scrape_costco.py, enrich_giant.py, and enrich_costco.py are now legacy-compatible entrypoints; prefer the collect_* and normalize_* scripts for active work.
  • Costco discount rows are preserved for auditability and also matched back to purchased items during enrichment.
  • validate_cross_retailer_flow.py is a proof/check script, not a required production step.

Test

./venv/bin/python -m unittest discover -s tests

Project Docs

  • pm/tasks.org: task tracking
  • pm/data-model.org: current data model notes
  • pm/review-workflow.org: review and resolution workflow
Description
gathering data from your Giant Food website
Readme 1.4 MiB
Languages
Python 100%