diff --git a/README.md b/README.md index f593c0d..0ff4e66 100644 --- a/README.md +++ b/README.md @@ -1,103 +1,227 @@ # scrape-giant -Small grocery-history pipeline for Giant receipts. +Small grocery-history pipeline for Giant and Costco receipt data. -The project currently does four things: +This repo is still a manual, stepwise pipeline. There is no single orchestrator +script yet. Each stage is run directly, and later stages depend on files +produced by earlier stages. -1. scrape Giant in-store order history from an active Firefox session -2. enrich raw line items into a deterministic `items_enriched.csv` -3. aggregate retailer-facing observed products and build a manual review queue -4. create a first-pass canonical product layer plus conservative auto-links +## What The Project Does -The work so far is Giant-specific on the ingest side and intentionally simple on -the shared product-model side. +The current flow is: -## Current flow +1. acquire raw Giant receipt/history data +2. enrich Giant line items into a shared enriched-item schema +3. acquire raw Costco receipt data +4. enrich Costco line items into the same shared enriched-item schema +5. build observed-product, review, and canonical-product layers +6. validate that Giant and Costco can flow through the same downstream model -Run the commands from the repo root with the project venv active, or call them -directly through `./venv/bin/python`. +Raw retailer JSON remains the source of truth. + +## Current Scripts + +- `scrape_giant.py` + Fetch Giant in-store history and order detail payloads from an active Firefox + session. +- `scrape_costco.py` + Fetch Costco receipt summary/detail payloads from an active Firefox session. + Costco currently prefers `.env` header values first, then falls back to exact + Firefox local-storage values for session auth. +- `enrich_giant.py` + Parse Giant raw order JSON into `giant_output/items_enriched.csv`. +- `enrich_costco.py` + Parse Costco raw receipt JSON into `costco_output/items_enriched.csv`. +- `build_observed_products.py` + Build retailer-facing observed products from enriched rows. +- `build_review_queue.py` + Build a manual review queue for low-confidence or unresolved observed + products. +- `build_canonical_layer.py` + Build shared canonical products and observed-to-canonical links. +- `validate_cross_retailer_flow.py` + Write a proof/check output showing that Giant and Costco can meet in the same + downstream model. + +## Manual Pipeline + +Run these from the repo root with the venv active, or call them through +`./venv/bin/python`. + +### 1. Acquire Giant raw data + +```bash +./venv/bin/python scrape_giant.py +``` + +Inputs: +- active Firefox session for `giantfood.com` +- `GIANT_USER_ID` and `GIANT_LOYALTY_NUMBER` from `.env`, shell env, or prompt + +Outputs: +- `giant_output/raw/history.json` +- `giant_output/raw/.json` +- `giant_output/orders.csv` +- `giant_output/items.csv` + +### 2. Enrich Giant data ```bash -./venv/bin/python scraper.py ./venv/bin/python enrich_giant.py +``` + +Input: +- `giant_output/raw/*.json` + +Output: +- `giant_output/items_enriched.csv` + +### 3. Acquire Costco raw data + +```bash +./venv/bin/python scrape_costco.py +``` + +Optional useful flags: + +```bash +./venv/bin/python scrape_costco.py --months-back 36 +./venv/bin/python scrape_costco.py --firefox-profile-dir "C:\\Users\\you\\AppData\\Roaming\\Mozilla\\Firefox\\Profiles\\xxxx.default-release" +``` + +Inputs: +- active Firefox session for `costco.com` +- optional `.env` values: + - `COSTCO_X_AUTHORIZATION` + - `COSTCO_X_WCS_CLIENTID` + - `COSTCO_CLIENT_IDENTIFIER` +- if `COSTCO_X_AUTHORIZATION` is absent, the script falls back to exact Firefox + local-storage values: + - `idToken` -> sent as `Bearer ` + - `clientID` -> used as `costco-x-wcs-clientId` when env is blank + +Outputs: +- `costco_output/raw/summary.json` +- `costco_output/raw/summary_requests.json` +- `costco_output/raw/-.json` +- `costco_output/orders.csv` +- `costco_output/items.csv` + +### 4. Enrich Costco data + +```bash +./venv/bin/python enrich_costco.py +``` + +Input: +- `costco_output/raw/*.json` + +Output: +- `costco_output/items_enriched.csv` + +### 5. Build shared downstream layers + +```bash ./venv/bin/python build_observed_products.py ./venv/bin/python build_review_queue.py ./venv/bin/python build_canonical_layer.py ``` -## Inputs +These scripts consume the enriched item files and generate the downstream +product-model outputs. -- Firefox cookies for `giantfood.com` -- `GIANT_USER_ID` and `GIANT_LOYALTY_NUMBER` in `.env`, shell env, or prompts -- Giant raw order payloads in `giant_output/raw/` +Current outputs on disk: -## Outputs +- retailer-facing: + - `giant_output/products_observed.csv` + - `giant_output/review_queue.csv` + - `giant_output/products_canonical.csv` + - `giant_output/product_links.csv` +- cross-retailer proof/check output: + - `combined_output/products_observed.csv` + - `combined_output/products_canonical.csv` + - `combined_output/product_links.csv` + - `combined_output/proof_examples.csv` -Current generated files live under `giant_output/`: +### 6. Validate cross-retailer flow -- `orders.csv`: flattened visit/order rows from the Giant history API -- `items.csv`: flattened raw line items from fetched order detail payloads -- `items_enriched.csv`: deterministic parsed/enriched line items -- `products_observed.csv`: retailer-facing observed product groups -- `review_queue.csv`: products needing manual review -- `products_canonical.csv`: shared canonical product rows -- `product_links.csv`: observed-to-canonical links +```bash +./venv/bin/python validate_cross_retailer_flow.py +``` -Raw json remains the source of truth: +This is a proof/check step, not the main acquisition path. -- `giant_output/raw/history.json` -- `giant_output/raw/.json` +## Inputs And Outputs By Directory -## Scripts +### `giant_output/` -- `scraper.py`: fetches Giant history/detail payloads and updates `orders.csv` and `items.csv` -- `enrich_giant.py`: reads raw Giant order json and writes `items_enriched.csv` -- `build_observed_products.py`: groups enriched rows into `products_observed.csv` -- `build_review_queue.py`: generates `review_queue.csv` and preserves review status on reruns -- `build_canonical_layer.py`: builds `products_canonical.csv` and `product_links.csv` +Inputs to this layer: +- Firefox session data for Giant +- Giant raw JSON payloads -## Notes on the current model +Generated files: +- `raw/history.json` +- `raw/.json` +- `orders.csv` +- `items.csv` +- `items_enriched.csv` +- `products_observed.csv` +- `review_queue.csv` +- `products_canonical.csv` +- `product_links.csv` -- Observed products are retailer-specific: Giant, Costco. -- Canonical products are the first cross-retailer layer. -- Auto-linking is conservative: - exact UPC first, then exact normalized name plus exact size/unit context, then - exact normalized name when there is no size context to conflict. -- Fee rows are excluded from auto-linking. -- Unknown values are left blank instead of guessed. +### `costco_output/` + +Inputs to this layer: +- Firefox session data for Costco +- Costco raw GraphQL receipt payloads + +Generated files: +- `raw/summary.json` +- `raw/summary_requests.json` +- `raw/-.json` +- `orders.csv` +- `items.csv` +- `items_enriched.csv` + +### `combined_output/` + +Generated by cross-retailer proof/build scripts: +- `products_observed.csv` +- `products_canonical.csv` +- `product_links.csv` +- `proof_examples.csv` + +## Notes + +- The pipeline is intentionally simple and currently manual. +- Scraping is retailer-specific and fragile; downstream modeling is shared only + after enrichment. +- `summary_requests.json` is diagnostic metadata from Costco summary enumeration + and is not a receipt payload. +- `enrich_costco.py` skips that file and only parses receipt payloads. +- The repo may contain archived or sample output files under `archive/`; they + are not part of the active scrape path. ## Verification -Run the test suite with: +Run the full test suite with: ```bash ./venv/bin/python -m unittest discover -s tests ``` -Useful one-off rebuilds: +Useful one-off checks: ```bash +./venv/bin/python scrape_giant.py --help +./venv/bin/python scrape_costco.py --help ./venv/bin/python enrich_giant.py -./venv/bin/python build_observed_products.py -./venv/bin/python build_review_queue.py -./venv/bin/python build_canonical_layer.py +./venv/bin/python enrich_costco.py ``` -## Project docs +## Project Docs -- `pm/tasks.org`: task log and evidence -- `pm/data-model.org`: file layout and schema decisions - -## Status - -Completed through `t1.7`: - -- Giant receipt fetch CLI -- data model and file layout -- Giant parser/enricher -- observed products -- review queue -- canonical layer scaffold -- conservative auto-link rules - -Next planned task is `t1.8`: add a Costco raw ingest path. +- `pm/tasks.org` +- `pm/data-model.org` +- `pm/scrape-giant.org`