# scrape-giant Small grocery-history pipeline for Giant and Costco receipt data. This repo is still a manual, stepwise pipeline. There is no single orchestrator script yet. Each stage is run directly, and later stages depend on files produced by earlier stages. ## What The Project Does The current flow is: 1. acquire raw Giant receipt/history data 2. enrich Giant line items into a shared enriched-item schema 3. acquire raw Costco receipt data 4. enrich Costco line items into the same shared enriched-item schema 5. build observed-product, review, and canonical-product layers 6. validate that Giant and Costco can flow through the same downstream model Raw retailer JSON remains the source of truth. ## Current Scripts - `scrape_giant.py` Fetch Giant in-store history and order detail payloads from an active Firefox session. - `scrape_costco.py` Fetch Costco receipt summary/detail payloads from an active Firefox session. Costco currently prefers `.env` header values first, then falls back to exact Firefox local-storage values for session auth. - `enrich_giant.py` Parse Giant raw order JSON into `giant_output/items_enriched.csv`. - `enrich_costco.py` Parse Costco raw receipt JSON into `costco_output/items_enriched.csv`. - `build_observed_products.py` Build retailer-facing observed products from enriched rows. - `build_review_queue.py` Build a manual review queue for low-confidence or unresolved observed products. - `build_canonical_layer.py` Build shared canonical products and observed-to-canonical links. - `validate_cross_retailer_flow.py` Write a proof/check output showing that Giant and Costco can meet in the same downstream model. ## Manual Pipeline Run these from the repo root with the venv active, or call them through `./venv/bin/python`. ### 1. Acquire Giant raw data ```bash ./venv/bin/python scrape_giant.py ``` Inputs: - active Firefox session for `giantfood.com` - `GIANT_USER_ID` and `GIANT_LOYALTY_NUMBER` from `.env`, shell env, or prompt Outputs: - `giant_output/raw/history.json` - `giant_output/raw/.json` - `giant_output/orders.csv` - `giant_output/items.csv` ### 2. Enrich Giant data ```bash ./venv/bin/python enrich_giant.py ``` Input: - `giant_output/raw/*.json` Output: - `giant_output/items_enriched.csv` ### 3. Acquire Costco raw data ```bash ./venv/bin/python scrape_costco.py ``` Optional useful flags: ```bash ./venv/bin/python scrape_costco.py --months-back 36 ./venv/bin/python scrape_costco.py --firefox-profile-dir "C:\\Users\\you\\AppData\\Roaming\\Mozilla\\Firefox\\Profiles\\xxxx.default-release" ``` Inputs: - active Firefox session for `costco.com` - optional `.env` values: - `COSTCO_X_AUTHORIZATION` - `COSTCO_X_WCS_CLIENTID` - `COSTCO_CLIENT_IDENTIFIER` - if `COSTCO_X_AUTHORIZATION` is absent, the script falls back to exact Firefox local-storage values: - `idToken` -> sent as `Bearer ` - `clientID` -> used as `costco-x-wcs-clientId` when env is blank Outputs: - `costco_output/raw/summary.json` - `costco_output/raw/summary_requests.json` - `costco_output/raw/-.json` - `costco_output/orders.csv` - `costco_output/items.csv` ### 4. Enrich Costco data ```bash ./venv/bin/python enrich_costco.py ``` Input: - `costco_output/raw/*.json` Output: - `costco_output/items_enriched.csv` ### 5. Build shared downstream layers ```bash ./venv/bin/python build_observed_products.py ./venv/bin/python build_review_queue.py ./venv/bin/python build_canonical_layer.py ``` These scripts consume the enriched item files and generate the downstream product-model outputs. Current outputs on disk: - retailer-facing: - `giant_output/products_observed.csv` - `giant_output/review_queue.csv` - `giant_output/products_canonical.csv` - `giant_output/product_links.csv` - cross-retailer proof/check output: - `combined_output/products_observed.csv` - `combined_output/products_canonical.csv` - `combined_output/product_links.csv` - `combined_output/proof_examples.csv` ### 6. Validate cross-retailer flow ```bash ./venv/bin/python validate_cross_retailer_flow.py ``` This is a proof/check step, not the main acquisition path. ## Inputs And Outputs By Directory ### `giant_output/` Inputs to this layer: - Firefox session data for Giant - Giant raw JSON payloads Generated files: - `raw/history.json` - `raw/.json` - `orders.csv` - `items.csv` - `items_enriched.csv` - `products_observed.csv` - `review_queue.csv` - `products_canonical.csv` - `product_links.csv` ### `costco_output/` Inputs to this layer: - Firefox session data for Costco - Costco raw GraphQL receipt payloads Generated files: - `raw/summary.json` - `raw/summary_requests.json` - `raw/-.json` - `orders.csv` - `items.csv` - `items_enriched.csv` ### `combined_output/` Generated by cross-retailer proof/build scripts: - `products_observed.csv` - `products_canonical.csv` - `product_links.csv` - `proof_examples.csv` ## Notes - The pipeline is intentionally simple and currently manual. - Scraping is retailer-specific and fragile; downstream modeling is shared only after enrichment. - `summary_requests.json` is diagnostic metadata from Costco summary enumeration and is not a receipt payload. - `enrich_costco.py` skips that file and only parses receipt payloads. - The repo may contain archived or sample output files under `archive/`; they are not part of the active scrape path. ## Verification Run the full test suite with: ```bash ./venv/bin/python -m unittest discover -s tests ``` Useful one-off checks: ```bash ./venv/bin/python scrape_giant.py --help ./venv/bin/python scrape_costco.py --help ./venv/bin/python enrich_giant.py ./venv/bin/python enrich_costco.py ``` ## Project Docs - `pm/tasks.org` - `pm/data-model.org` - `pm/scrape-giant.org`