scrape-giant
CLI to pull purchase history from Giant and Costco websites and refine into a single product catalog for external analysis.
Run each script step-by-step from the terminal.
What It Does
collect_giant_web.py: download Giant orders and itemsnormalize_giant_web.py: normalize Giant line itemscollect_costco_web.py: download Costco orders and itemsnormalize_costco_web.py: normalize Costco line itemsbuild_purchases.py: combine retailer outputs into one purchase tablereview_products.py: review unresolved product matches in the terminalreport_pipeline_status.py: show how many rows survive each stageanalyze_purchases.py: write chart-ready analysis CSVs from the purchase table
Requirements
- Python 3.10+
- Firefox installed with active Giant and Costco sessions
Install
python -m venv venv
./venv/scripts/activate
pip install -r requirements.txt
Optional .env
Current version works best with .env in the project root. The scraper will prompt for these values if they are not found in the current browser session.
collect_giant_web.pyprompts ifGIANT_USER_IDorGIANT_LOYALTY_NUMBERis missing.collect_costco_web.pytries.envfirst, then Firefox local storage for session-backed values;COSTCO_CLIENT_IDENTIFIERshould still be set explicitly.- Costco discount matching happens later in
enrich_costco.py; you do not need to pre-clean discount lines by hand.
GIANT_USER_ID=...
GIANT_LOYALTY_NUMBER=...
COSTCO_X_AUTHORIZATION=...
COSTCO_X_WCS_CLIENTID=...
COSTCO_CLIENT_IDENTIFIER=...
Current active path layout:
data/
giant-web/
raw/
collected_orders.csv
collected_items.csv
normalized_items.csv
costco-web/
raw/
collected_orders.csv
collected_items.csv
normalized_items.csv
review/
catalog.csv
review_queue.csv
review_resolutions.csv
product_links.csv
pipeline_status.csv
pipeline_status.json
analysis/
purchases.csv
comparison_examples.csv
item_price_over_time.csv
spend_by_visit.csv
items_per_visit.csv
category_spend_over_time.csv
retailer_store_breakdown.csv
Run Order
Run the pipeline in this order:
python collect_giant_web.py
python normalize_giant_web.py
python collect_costco_web.py
python normalize_costco_web.py
python build_purchases.py
python review_products.py
python build_purchases.py
python review_products.py --refresh-only
python report_pipeline_status.py
python analyze_purchases.py
Why run build_purchases.py twice:
- first pass builds the current combined dataset and review queue inputs
review_products.pywrites durable review decisions- second pass reapplies those decisions into the purchase output
If you only want to refresh the queue without reviewing interactively:
python review_products.py --refresh-only
If you want a quick stage-by-stage accountability check:
python report_pipeline_status.py
Key Outputs
Giant:
data/giant-web/collected_orders.csvdata/giant-web/collected_items.csvdata/giant-web/normalized_items.csv
Costco:
data/costco-web/collected_orders.csvdata/costco-web/collected_items.csvdata/costco-web/normalized_items.csvdata/costco-web/normalized_items.csvpreserves raw totals and matched net discount fields
Combined:
data/analysis/purchases.csvdata/analysis/comparison_examples.csvdata/analysis/item_price_over_time.csvdata/analysis/spend_by_visit.csvdata/analysis/items_per_visit.csvdata/analysis/category_spend_over_time.csvdata/analysis/retailer_store_breakdown.csvdata/review/review_queue.csvdata/review/review_resolutions.csvdata/review/product_links.csvdata/review/pipeline_status.csvdata/review/pipeline_status.jsondata/review/catalog.csv
data/analysis/purchases.csv is the main analysis artifact. It is designed to support both:
- item-level price analysis
- visit-level analysis such as spend by visit, items per visit, category spend by visit, and retailer/store breakdown
The visit fields are carried directly in purchases.csv, so you can pivot on them without extra joins:
order_idpurchase_dateretailerstore_namestore_numberstore_citystore_state
Review Workflow
Run review_products.py to cleanup unresolved or weakly unified items:
- link an item to an existing canonical product
- create a new canonical product
- exclude an item
- skip it for later Decisions are saved and reused on later runs.
The review step is intentionally conservative:
- weak exact-name matches stay in the queue instead of auto-creating canonical products
- canonical names should describe stable product identity, not retailer packaging text
Notes
- This project is designed around fragile retailer scraping flows, so the code favors explicit retailer-specific steps over heavy abstraction.
- Costco discount rows are preserved for auditability and also matched back to purchased items during enrichment.
Test
./venv/bin/python -m unittest discover -s tests
Project Docs
pm/tasks.org: task trackingpm/data-model.org: current data model notespm/review-workflow.org: review and resolution workflow