59fb881c0aa4020c9143afe0ce8fa9148914ed04
scrape-giant
CLI to pull purchase history from Giant and Costco websites and refine into a single product catalog for external analysis.
Run each script step-by-step from the terminal.
What It Does
scrape_giant.py: download Giant orders and itemsenrich_giant.py: normalize Giant line itemsscrape_costco.py: download Costco orders and itemsenrich_costco.py: normalize Costco line itemsbuild_purchases.py: combine retailer outputs into one purchase tablereview_products.py: review unresolved product matches in the terminalreport_pipeline_status.py: show how many rows survive each stage
Active refactor entrypoints:
collect_giant_web.pycollect_costco_web.pynormalize_giant_web.pynormalize_costco_web.py
Requirements
- Python 3.10+
- Firefox installed with active Giant and Costco sessions
Install
python -m venv venv
./venv/scripts/activate
pip install -r requirements.txt
Optional .env
Current version works best with .env in the project root. The scraper will prompt for these values if they are not found in the current browser session.
collect_giant_web.pyprompts ifGIANT_USER_IDorGIANT_LOYALTY_NUMBERis missing.collect_costco_web.pytries.envfirst, then Firefox local storage for session-backed values;COSTCO_CLIENT_IDENTIFIERshould still be set explicitly.- Costco discount matching happens later in
enrich_costco.py; you do not need to pre-clean discount lines by hand.
GIANT_USER_ID=...
GIANT_LOYALTY_NUMBER=...
COSTCO_X_AUTHORIZATION=...
COSTCO_X_WCS_CLIENTID=...
COSTCO_CLIENT_IDENTIFIER=...
Current active path layout:
data/
giant-web/
raw/
collected_orders.csv
collected_items.csv
normalized_items.csv
costco-web/
raw/
collected_orders.csv
collected_items.csv
normalized_items.csv
review/
review_queue.csv
review_resolutions.csv
product_links.csv
purchases.csv
pipeline_status.csv
pipeline_status.json
catalog.csv
Run Order
Run the pipeline in this order:
python collect_giant_web.py
python normalize_giant_web.py
python collect_costco_web.py
python normalize_costco_web.py
python build_purchases.py
python review_products.py
python build_purchases.py
python review_products.py --refresh-only
python report_pipeline_status.py
Why run build_purchases.py twice:
- first pass builds the current combined dataset and review queue inputs
review_products.pywrites durable review decisions- second pass reapplies those decisions into the purchase output
If you only want to refresh the queue without reviewing interactively:
python review_products.py --refresh-only
If you want a quick stage-by-stage accountability check:
python report_pipeline_status.py
Key Outputs
Giant:
data/giant-web/collected_orders.csvdata/giant-web/collected_items.csvdata/giant-web/normalized_items.csv
Costco:
data/costco-web/collected_orders.csvdata/costco-web/collected_items.csvdata/costco-web/normalized_items.csvdata/costco-web/normalized_items.csvpreserves raw totals and matched net discount fields
Combined:
data/review/purchases.csvdata/review/review_queue.csvdata/review/review_resolutions.csvdata/review/product_links.csvdata/review/comparison_examples.csvdata/review/pipeline_status.csvdata/review/pipeline_status.jsondata/catalog.csv
Review Workflow
Run review_products.py to cleanup unresolved or weakly unified items:
- link an item to an existing canonical product
- create a new canonical product
- exclude an item
- skip it for later Decisions are saved and reused on later runs.
The review step is intentionally conservative:
- weak exact-name matches stay in the queue instead of auto-creating canonical products
- canonical names should describe stable product identity, not retailer packaging text
Notes
- This project is designed around fragile retailer scraping flows, so the code favors explicit retailer-specific steps over heavy abstraction.
scrape_giant.py,scrape_costco.py,enrich_giant.py, andenrich_costco.pyare now legacy-compatible entrypoints; prefer thecollect_*andnormalize_*scripts for active work.- Costco discount rows are preserved for auditability and also matched back to purchased items during enrichment.
validate_cross_retailer_flow.pyis a proof/check script, not a required production step.
Test
./venv/bin/python -m unittest discover -s tests
Project Docs
pm/tasks.org: task trackingpm/data-model.org: current data model notespm/review-workflow.org: review and resolution workflow
Description
Languages
Python
100%