5.7 KiB
scrape-giant
Small grocery-history pipeline for Giant and Costco receipt data.
This repo is still a manual, stepwise pipeline. There is no single orchestrator script yet. Each stage is run directly, and later stages depend on files produced by earlier stages.
What The Project Does
The current flow is:
- acquire raw Giant receipt/history data
- enrich Giant line items into a shared enriched-item schema
- acquire raw Costco receipt data
- enrich Costco line items into the same shared enriched-item schema
- build observed-product, review, and canonical-product layers
- validate that Giant and Costco can flow through the same downstream model
Raw retailer JSON remains the source of truth.
Current Scripts
scrape_giant.pyFetch Giant in-store history and order detail payloads from an active Firefox session.scrape_costco.pyFetch Costco receipt summary/detail payloads from an active Firefox session. Costco currently prefers.envheader values first, then falls back to exact Firefox local-storage values for session auth.enrich_giant.pyParse Giant raw order JSON intogiant_output/items_enriched.csv.enrich_costco.pyParse Costco raw receipt JSON intocostco_output/items_enriched.csv.build_observed_products.pyBuild retailer-facing observed products from enriched rows.build_review_queue.pyBuild a manual review queue for low-confidence or unresolved observed products.build_canonical_layer.pyBuild shared canonical products and observed-to-canonical links.validate_cross_retailer_flow.pyWrite a proof/check output showing that Giant and Costco can meet in the same downstream model.
Manual Pipeline
Run these from the repo root with the venv active, or call them through
./venv/bin/python.
1. Acquire Giant raw data
./venv/bin/python scrape_giant.py
Inputs:
- active Firefox session for
giantfood.com GIANT_USER_IDandGIANT_LOYALTY_NUMBERfrom.env, shell env, or prompt
Outputs:
giant_output/raw/history.jsongiant_output/raw/<order_id>.jsongiant_output/orders.csvgiant_output/items.csv
2. Enrich Giant data
./venv/bin/python enrich_giant.py
Input:
giant_output/raw/*.json
Output:
giant_output/items_enriched.csv
3. Acquire Costco raw data
./venv/bin/python scrape_costco.py
Optional useful flags:
./venv/bin/python scrape_costco.py --months-back 36
./venv/bin/python scrape_costco.py --firefox-profile-dir "C:\\Users\\you\\AppData\\Roaming\\Mozilla\\Firefox\\Profiles\\xxxx.default-release"
Inputs:
- active Firefox session for
costco.com - optional
.envvalues:COSTCO_X_AUTHORIZATIONCOSTCO_X_WCS_CLIENTIDCOSTCO_CLIENT_IDENTIFIER
- if
COSTCO_X_AUTHORIZATIONis absent, the script falls back to exact Firefox local-storage values:idToken-> sent asBearer <idToken>clientID-> used ascostco-x-wcs-clientIdwhen env is blank
Outputs:
costco_output/raw/summary.jsoncostco_output/raw/summary_requests.jsoncostco_output/raw/<receipt_id>-<timestamp>.jsoncostco_output/orders.csvcostco_output/items.csv
4. Enrich Costco data
./venv/bin/python enrich_costco.py
Input:
costco_output/raw/*.json
Output:
costco_output/items_enriched.csv
5. Build shared downstream layers
./venv/bin/python build_observed_products.py
./venv/bin/python build_review_queue.py
./venv/bin/python build_canonical_layer.py
These scripts consume the enriched item files and generate the downstream product-model outputs.
Current outputs on disk:
- retailer-facing:
giant_output/products_observed.csvgiant_output/review_queue.csvgiant_output/products_canonical.csvgiant_output/product_links.csv
- cross-retailer proof/check output:
combined_output/products_observed.csvcombined_output/products_canonical.csvcombined_output/product_links.csvcombined_output/proof_examples.csv
6. Validate cross-retailer flow
./venv/bin/python validate_cross_retailer_flow.py
This is a proof/check step, not the main acquisition path.
Inputs And Outputs By Directory
giant_output/
Inputs to this layer:
- Firefox session data for Giant
- Giant raw JSON payloads
Generated files:
raw/history.jsonraw/<order_id>.jsonorders.csvitems.csvitems_enriched.csvproducts_observed.csvreview_queue.csvproducts_canonical.csvproduct_links.csv
costco_output/
Inputs to this layer:
- Firefox session data for Costco
- Costco raw GraphQL receipt payloads
Generated files:
raw/summary.jsonraw/summary_requests.jsonraw/<receipt_id>-<timestamp>.jsonorders.csvitems.csvitems_enriched.csv
combined_output/
Generated by cross-retailer proof/build scripts:
products_observed.csvproducts_canonical.csvproduct_links.csvproof_examples.csv
Notes
- The pipeline is intentionally simple and currently manual.
- Scraping is retailer-specific and fragile; downstream modeling is shared only after enrichment.
summary_requests.jsonis diagnostic metadata from Costco summary enumeration and is not a receipt payload.enrich_costco.pyskips that file and only parses receipt payloads.- The repo may contain archived or sample output files under
archive/; they are not part of the active scrape path.
Verification
Run the full test suite with:
./venv/bin/python -m unittest discover -s tests
Useful one-off checks:
./venv/bin/python scrape_giant.py --help
./venv/bin/python scrape_costco.py --help
./venv/bin/python enrich_giant.py
./venv/bin/python enrich_costco.py
Project Docs
pm/tasks.orgpm/data-model.orgpm/scrape-giant.org