scrape-giant
CLI to pull purchase history from Giant and Costco websites and refine into a single product catalog for external analysis.
Run each script step-by-step from the terminal.
What It Does
scrape_giant.py: download Giant orders and itemsenrich_giant.py: normalize Giant line itemsscrape_costco.py: download Costco orders and itemsenrich_costco.py: normalize Costco line itemsbuild_purchases.py: combine retailer outputs into one purchase tablereview_products.py: review unresolved product matches in the terminal
Requirements
- Python 3.10+
- Firefox installed with active Giant and Costco sessions
Install
python -m venv venv
./venv/scripts/activate
pip install -r requirements.txt
Optional .env
Current version works best with .env in the project root. The scraper will prompt for these values if they are not found in the current browser session.
scrape_giantprompts ifGIANT_USER_IDorGIANT_LOYALTY_NUMBERis missing.scrape_costcotries.envfirst, then Firefox local storage for session-backed values;COSTCO_CLIENT_IDENTIFIERshould still be set explicitly.
GIANT_USER_ID=...
GIANT_LOYALTY_NUMBER=...
COSTCO_X_AUTHORIZATION=...
COSTCO_X_WCS_CLIENTID=...
COSTCO_CLIENT_IDENTIFIER=...
Run Order
Run the pipeline in this order:
python scrape_giant.py
python enrich_giant.py
python scrape_costco.py
python enrich_costco.py
python build_purchases.py
python review_products.py
python build_purchases.py
Why run build_purchases.py twice:
- first pass builds the current combined dataset and review queue inputs
review_products.pywrites durable review decisions- second pass reapplies those decisions into the purchase output
If you only want to refresh the queue without reviewing interactively:
python review_products.py --refresh-only
Key Outputs
Giant:
giant_output/orders.csvgiant_output/items.csvgiant_output/items_enriched.csv
Costco:
costco_output/orders.csvcostco_output/items.csvcostco_output/items_enriched.csv
Combined:
combined_output/purchases.csvcombined_output/review_queue.csvcombined_output/review_resolutions.csvcombined_output/canonical_catalog.csvcombined_output/product_links.csvcombined_output/comparison_examples.csv
Review Workflow
Run review_products.py to cleanup unresolved or weakly unified items:
- link an item to an existing canonical product
- create a new canonical product
- exclude an item
- skip it for later Decisions are saved and reused on later runs.
Notes
- This project is designed around fragile retailer scraping flows, so the code favors explicit retailer-specific steps over heavy abstraction.
scrape_giant.pyandscrape_costco.pyare meant to work as standalone acquisition scripts.validate_cross_retailer_flow.pyis a proof/check script, not a required production step.
Test
./venv/bin/python -m unittest discover -s tests
Project Docs
pm/tasks.org: task trackingpm/data-model.org: current data model notespm/review-workflow.org: review and resolution workflow
Description
Languages
Python
100%