updated readme with Review steps

2026-03-17 09:14:14 -04:00
parent 91bfd3597e
commit 7f8c3ed8eb
3 changed files with 147 additions and 191 deletions
--- a/README.md
+++ b/README.md
@@ -1,227 +1,118 @@
 # scrape-giant

-Small grocery-history pipeline for Giant and Costco receipt data.
+Small CLI pipeline for pulling purchase history from Giant and Costco, enriching line items, and building a reviewable cross-retailer purchase dataset.

-This repo is still a manual, stepwise pipeline. There is no single orchestrator
-script yet. Each stage is run directly, and later stages depend on files
-produced by earlier stages.
+There is no one-shot runner yet. Today, you run the scripts step by step from the terminal.

-## What The Project Does
+## What It Does

-The current flow is:
+- `scrape_giant.py`: download Giant orders and items
+- `enrich_giant.py`: normalize Giant line items
+- `scrape_costco.py`: download Costco orders and items
+- `enrich_costco.py`: normalize Costco line items
+- `build_purchases.py`: combine retailer outputs into one purchase table
+- `review_products.py`: review unresolved product matches in the terminal

-1. acquire raw Giant receipt/history data
-2. enrich Giant line items into a shared enriched-item schema
-3. acquire raw Costco receipt data
-4. enrich Costco line items into the same shared enriched-item schema
-5. build observed-product, review, and canonical-product layers
-6. validate that Giant and Costco can flow through the same downstream model
+## Requirements

-Raw retailer JSON remains the source of truth.
+- Python 3.10+
+- Firefox installed with active Giant and Costco sessions

-## Current Scripts
-
- `scrape_giant.py`
-  Fetch Giant in-store history and order detail payloads from an active Firefox
-  session.
- `scrape_costco.py`
-  Fetch Costco receipt summary/detail payloads from an active Firefox session.
-  Costco currently prefers `.env` header values first, then falls back to exact
-  Firefox local-storage values for session auth.
- `enrich_giant.py`
-  Parse Giant raw order JSON into `giant_output/items_enriched.csv`.
- `enrich_costco.py`
-  Parse Costco raw receipt JSON into `costco_output/items_enriched.csv`.
- `build_observed_products.py`
-  Build retailer-facing observed products from enriched rows.
- `build_review_queue.py`
-  Build a manual review queue for low-confidence or unresolved observed
-  products.
- `build_canonical_layer.py`
-  Build shared canonical products and observed-to-canonical links.
- `validate_cross_retailer_flow.py`
-  Write a proof/check output showing that Giant and Costco can meet in the same
-  downstream model.
-
-## Manual Pipeline
-
-Run these from the repo root with the venv active, or call them through
-`./venv/bin/python`.
-
-### 1. Acquire Giant raw data
+## Install

 ```bash
-./venv/bin/python scrape_giant.py
+python -m venv venv
+./venv/scripts/activate
+pip install -r requirements.txt
 ```

-Inputs:
- active Firefox session for `giantfood.com`
- `GIANT_USER_ID` and `GIANT_LOYALTY_NUMBER` from `.env`, shell env, or prompt
+## Optional `.env`

-Outputs:
- `giant_output/raw/history.json`
- `giant_output/raw/<order_id>.json`
+Current version works best with `.env` in the project root.  The scraper will prompt for these values if they are not found in the current browser session.  
+- `scrape_giant` prompts if `GIANT_USER_ID` or `GIANT_LOYALTY_NUMBER` is missing.
+- `scrape_costco` tries `.env` first, then Firefox local storage for session-backed values; `COSTCO_CLIENT_IDENTIFIER` should still be set explicitly.
+
+```env
+GIANT_USER_ID=...
+GIANT_LOYALTY_NUMBER=...
+
+# Costco can use these if present, but it can also pull session values from Firefox.
+COSTCO_X_AUTHORIZATION=...
+COSTCO_X_WCS_CLIENTID=...
+COSTCO_CLIENT_IDENTIFIER=...
+```
+
+## Run Order
+
+Run the pipeline in this order:
+
+```bash
+python scrape_giant.py
+python enrich_giant.py
+python scrape_costco.py
+python enrich_costco.py
+python build_purchases.py
+python review_products.py
+python build_purchases.py
+```
+
+Why run `build_purchases.py` twice:
+- first pass builds the current combined dataset and review queue inputs
+- `review_products.py` writes durable review decisions
+- second pass reapplies those decisions into the purchase output
+
+If you only want to refresh the queue without reviewing interactively:
+
+```bash
+python review_products.py --refresh-only
+```
+
+## Key Outputs
+
+Giant:
 - `giant_output/orders.csv`
 - `giant_output/items.csv`
-
-### 2. Enrich Giant data
-
-```bash
-./venv/bin/python enrich_giant.py
-```
-
-Input:
- `giant_output/raw/*.json`
-
-Output:
 - `giant_output/items_enriched.csv`

-### 3. Acquire Costco raw data
-
-```bash
-./venv/bin/python scrape_costco.py
-```
-
-Optional useful flags:
-
-```bash
-./venv/bin/python scrape_costco.py --months-back 36
-./venv/bin/python scrape_costco.py --firefox-profile-dir "C:\\Users\\you\\AppData\\Roaming\\Mozilla\\Firefox\\Profiles\\xxxx.default-release"
-```
-
-Inputs:
- active Firefox session for `costco.com`
- optional `.env` values:
-  - `COSTCO_X_AUTHORIZATION`
-  - `COSTCO_X_WCS_CLIENTID`
-  - `COSTCO_CLIENT_IDENTIFIER`
- if `COSTCO_X_AUTHORIZATION` is absent, the script falls back to exact Firefox
-  local-storage values:
-  - `idToken` -> sent as `Bearer <idToken>`
-  - `clientID` -> used as `costco-x-wcs-clientId` when env is blank
-
-Outputs:
- `costco_output/raw/summary.json`
- `costco_output/raw/summary_requests.json`
- `costco_output/raw/<receipt_id>-<timestamp>.json`
+Costco:
 - `costco_output/orders.csv`
 - `costco_output/items.csv`
-
-### 4. Enrich Costco data
-
-```bash
-./venv/bin/python enrich_costco.py
-```
-
-Input:
- `costco_output/raw/*.json`
-
-Output:
 - `costco_output/items_enriched.csv`

-### 5. Build shared downstream layers
+Combined:
+- `combined_output/purchases.csv`
+- `combined_output/review_queue.csv`
+- `combined_output/review_resolutions.csv`
+- `combined_output/canonical_catalog.csv`
+- `combined_output/product_links.csv`
+- `combined_output/comparison_examples.csv`

-```bash
-./venv/bin/python build_observed_products.py
-./venv/bin/python build_review_queue.py
-./venv/bin/python build_canonical_layer.py
-```
+## Review Workflow

-These scripts consume the enriched item files and generate the downstream
-product-model outputs.
+`review_products.py` is the manual cleanup step for unresolved or weakly unified items.

-Current outputs on disk:
+In the terminal, you can:
+- link an item to an existing canonical product
+- create a new canonical product
+- exclude an item
+- skip it for later

- retailer-facing:
-  - `giant_output/products_observed.csv`
-  - `giant_output/review_queue.csv`
-  - `giant_output/products_canonical.csv`
-  - `giant_output/product_links.csv`
- cross-retailer proof/check output:
-  - `combined_output/products_observed.csv`
-  - `combined_output/products_canonical.csv`
-  - `combined_output/product_links.csv`
-  - `combined_output/proof_examples.csv`
-
-### 6. Validate cross-retailer flow
-
-```bash
-./venv/bin/python validate_cross_retailer_flow.py
-```
-
-This is a proof/check step, not the main acquisition path.
-
-## Inputs And Outputs By Directory
-
-### `giant_output/`
-
-Inputs to this layer:
- Firefox session data for Giant
- Giant raw JSON payloads
-
-Generated files:
- `raw/history.json`
- `raw/<order_id>.json`
- `orders.csv`
- `items.csv`
- `items_enriched.csv`
- `products_observed.csv`
- `review_queue.csv`
- `products_canonical.csv`
- `product_links.csv`
-
-### `costco_output/`
-
-Inputs to this layer:
- Firefox session data for Costco
- Costco raw GraphQL receipt payloads
-
-Generated files:
- `raw/summary.json`
- `raw/summary_requests.json`
- `raw/<receipt_id>-<timestamp>.json`
- `orders.csv`
- `items.csv`
- `items_enriched.csv`
-
-### `combined_output/`
-
-Generated by cross-retailer proof/build scripts:
- `products_observed.csv`
- `products_canonical.csv`
- `product_links.csv`
- `proof_examples.csv`
+Those decisions are saved and reused on later runs.

 ## Notes

- The pipeline is intentionally simple and currently manual.
- Scraping is retailer-specific and fragile; downstream modeling is shared only
-  after enrichment.
- `summary_requests.json` is diagnostic metadata from Costco summary enumeration
-  and is not a receipt payload.
- `enrich_costco.py` skips that file and only parses receipt payloads.
- The repo may contain archived or sample output files under `archive/`; they
-  are not part of the active scrape path.
+- This project is designed around fragile retailer scraping flows, so the code favors explicit retailer-specific steps over heavy abstraction.
+- `scrape_giant.py` and `scrape_costco.py` are meant to work as standalone acquisition scripts.
+- `validate_cross_retailer_flow.py` is a proof/check script, not a required production step.

-## Verification
-
-Run the full test suite with:
+## Test

 ```bash
 ./venv/bin/python -m unittest discover -s tests
 ```

-Useful one-off checks:
-
-```bash
-./venv/bin/python scrape_giant.py --help
-./venv/bin/python scrape_costco.py --help
-./venv/bin/python enrich_giant.py
-./venv/bin/python enrich_costco.py
-```
-
 ## Project Docs

- `pm/tasks.org`
- `pm/data-model.org`
- `pm/scrape-giant.org`
+- `pm/tasks.org`: task tracking
+- `pm/data-model.org`: current data model notes
+- `pm/review-workflow.org`: review and resolution workflow