updated readme
This commit is contained in:
252
README.md
252
README.md
@@ -1,103 +1,227 @@
|
|||||||
# scrape-giant
|
# scrape-giant
|
||||||
|
|
||||||
Small grocery-history pipeline for Giant receipts.
|
Small grocery-history pipeline for Giant and Costco receipt data.
|
||||||
|
|
||||||
The project currently does four things:
|
This repo is still a manual, stepwise pipeline. There is no single orchestrator
|
||||||
|
script yet. Each stage is run directly, and later stages depend on files
|
||||||
|
produced by earlier stages.
|
||||||
|
|
||||||
1. scrape Giant in-store order history from an active Firefox session
|
## What The Project Does
|
||||||
2. enrich raw line items into a deterministic `items_enriched.csv`
|
|
||||||
3. aggregate retailer-facing observed products and build a manual review queue
|
|
||||||
4. create a first-pass canonical product layer plus conservative auto-links
|
|
||||||
|
|
||||||
The work so far is Giant-specific on the ingest side and intentionally simple on
|
The current flow is:
|
||||||
the shared product-model side.
|
|
||||||
|
|
||||||
## Current flow
|
1. acquire raw Giant receipt/history data
|
||||||
|
2. enrich Giant line items into a shared enriched-item schema
|
||||||
|
3. acquire raw Costco receipt data
|
||||||
|
4. enrich Costco line items into the same shared enriched-item schema
|
||||||
|
5. build observed-product, review, and canonical-product layers
|
||||||
|
6. validate that Giant and Costco can flow through the same downstream model
|
||||||
|
|
||||||
Run the commands from the repo root with the project venv active, or call them
|
Raw retailer JSON remains the source of truth.
|
||||||
directly through `./venv/bin/python`.
|
|
||||||
|
## Current Scripts
|
||||||
|
|
||||||
|
- `scrape_giant.py`
|
||||||
|
Fetch Giant in-store history and order detail payloads from an active Firefox
|
||||||
|
session.
|
||||||
|
- `scrape_costco.py`
|
||||||
|
Fetch Costco receipt summary/detail payloads from an active Firefox session.
|
||||||
|
Costco currently prefers `.env` header values first, then falls back to exact
|
||||||
|
Firefox local-storage values for session auth.
|
||||||
|
- `enrich_giant.py`
|
||||||
|
Parse Giant raw order JSON into `giant_output/items_enriched.csv`.
|
||||||
|
- `enrich_costco.py`
|
||||||
|
Parse Costco raw receipt JSON into `costco_output/items_enriched.csv`.
|
||||||
|
- `build_observed_products.py`
|
||||||
|
Build retailer-facing observed products from enriched rows.
|
||||||
|
- `build_review_queue.py`
|
||||||
|
Build a manual review queue for low-confidence or unresolved observed
|
||||||
|
products.
|
||||||
|
- `build_canonical_layer.py`
|
||||||
|
Build shared canonical products and observed-to-canonical links.
|
||||||
|
- `validate_cross_retailer_flow.py`
|
||||||
|
Write a proof/check output showing that Giant and Costco can meet in the same
|
||||||
|
downstream model.
|
||||||
|
|
||||||
|
## Manual Pipeline
|
||||||
|
|
||||||
|
Run these from the repo root with the venv active, or call them through
|
||||||
|
`./venv/bin/python`.
|
||||||
|
|
||||||
|
### 1. Acquire Giant raw data
|
||||||
|
|
||||||
|
```bash
|
||||||
|
./venv/bin/python scrape_giant.py
|
||||||
|
```
|
||||||
|
|
||||||
|
Inputs:
|
||||||
|
- active Firefox session for `giantfood.com`
|
||||||
|
- `GIANT_USER_ID` and `GIANT_LOYALTY_NUMBER` from `.env`, shell env, or prompt
|
||||||
|
|
||||||
|
Outputs:
|
||||||
|
- `giant_output/raw/history.json`
|
||||||
|
- `giant_output/raw/<order_id>.json`
|
||||||
|
- `giant_output/orders.csv`
|
||||||
|
- `giant_output/items.csv`
|
||||||
|
|
||||||
|
### 2. Enrich Giant data
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
./venv/bin/python scraper.py
|
|
||||||
./venv/bin/python enrich_giant.py
|
./venv/bin/python enrich_giant.py
|
||||||
|
```
|
||||||
|
|
||||||
|
Input:
|
||||||
|
- `giant_output/raw/*.json`
|
||||||
|
|
||||||
|
Output:
|
||||||
|
- `giant_output/items_enriched.csv`
|
||||||
|
|
||||||
|
### 3. Acquire Costco raw data
|
||||||
|
|
||||||
|
```bash
|
||||||
|
./venv/bin/python scrape_costco.py
|
||||||
|
```
|
||||||
|
|
||||||
|
Optional useful flags:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
./venv/bin/python scrape_costco.py --months-back 36
|
||||||
|
./venv/bin/python scrape_costco.py --firefox-profile-dir "C:\\Users\\you\\AppData\\Roaming\\Mozilla\\Firefox\\Profiles\\xxxx.default-release"
|
||||||
|
```
|
||||||
|
|
||||||
|
Inputs:
|
||||||
|
- active Firefox session for `costco.com`
|
||||||
|
- optional `.env` values:
|
||||||
|
- `COSTCO_X_AUTHORIZATION`
|
||||||
|
- `COSTCO_X_WCS_CLIENTID`
|
||||||
|
- `COSTCO_CLIENT_IDENTIFIER`
|
||||||
|
- if `COSTCO_X_AUTHORIZATION` is absent, the script falls back to exact Firefox
|
||||||
|
local-storage values:
|
||||||
|
- `idToken` -> sent as `Bearer <idToken>`
|
||||||
|
- `clientID` -> used as `costco-x-wcs-clientId` when env is blank
|
||||||
|
|
||||||
|
Outputs:
|
||||||
|
- `costco_output/raw/summary.json`
|
||||||
|
- `costco_output/raw/summary_requests.json`
|
||||||
|
- `costco_output/raw/<receipt_id>-<timestamp>.json`
|
||||||
|
- `costco_output/orders.csv`
|
||||||
|
- `costco_output/items.csv`
|
||||||
|
|
||||||
|
### 4. Enrich Costco data
|
||||||
|
|
||||||
|
```bash
|
||||||
|
./venv/bin/python enrich_costco.py
|
||||||
|
```
|
||||||
|
|
||||||
|
Input:
|
||||||
|
- `costco_output/raw/*.json`
|
||||||
|
|
||||||
|
Output:
|
||||||
|
- `costco_output/items_enriched.csv`
|
||||||
|
|
||||||
|
### 5. Build shared downstream layers
|
||||||
|
|
||||||
|
```bash
|
||||||
./venv/bin/python build_observed_products.py
|
./venv/bin/python build_observed_products.py
|
||||||
./venv/bin/python build_review_queue.py
|
./venv/bin/python build_review_queue.py
|
||||||
./venv/bin/python build_canonical_layer.py
|
./venv/bin/python build_canonical_layer.py
|
||||||
```
|
```
|
||||||
|
|
||||||
## Inputs
|
These scripts consume the enriched item files and generate the downstream
|
||||||
|
product-model outputs.
|
||||||
|
|
||||||
- Firefox cookies for `giantfood.com`
|
Current outputs on disk:
|
||||||
- `GIANT_USER_ID` and `GIANT_LOYALTY_NUMBER` in `.env`, shell env, or prompts
|
|
||||||
- Giant raw order payloads in `giant_output/raw/`
|
|
||||||
|
|
||||||
## Outputs
|
- retailer-facing:
|
||||||
|
- `giant_output/products_observed.csv`
|
||||||
|
- `giant_output/review_queue.csv`
|
||||||
|
- `giant_output/products_canonical.csv`
|
||||||
|
- `giant_output/product_links.csv`
|
||||||
|
- cross-retailer proof/check output:
|
||||||
|
- `combined_output/products_observed.csv`
|
||||||
|
- `combined_output/products_canonical.csv`
|
||||||
|
- `combined_output/product_links.csv`
|
||||||
|
- `combined_output/proof_examples.csv`
|
||||||
|
|
||||||
Current generated files live under `giant_output/`:
|
### 6. Validate cross-retailer flow
|
||||||
|
|
||||||
- `orders.csv`: flattened visit/order rows from the Giant history API
|
```bash
|
||||||
- `items.csv`: flattened raw line items from fetched order detail payloads
|
./venv/bin/python validate_cross_retailer_flow.py
|
||||||
- `items_enriched.csv`: deterministic parsed/enriched line items
|
```
|
||||||
- `products_observed.csv`: retailer-facing observed product groups
|
|
||||||
- `review_queue.csv`: products needing manual review
|
|
||||||
- `products_canonical.csv`: shared canonical product rows
|
|
||||||
- `product_links.csv`: observed-to-canonical links
|
|
||||||
|
|
||||||
Raw json remains the source of truth:
|
This is a proof/check step, not the main acquisition path.
|
||||||
|
|
||||||
- `giant_output/raw/history.json`
|
## Inputs And Outputs By Directory
|
||||||
- `giant_output/raw/<order_id>.json`
|
|
||||||
|
|
||||||
## Scripts
|
### `giant_output/`
|
||||||
|
|
||||||
- `scraper.py`: fetches Giant history/detail payloads and updates `orders.csv` and `items.csv`
|
Inputs to this layer:
|
||||||
- `enrich_giant.py`: reads raw Giant order json and writes `items_enriched.csv`
|
- Firefox session data for Giant
|
||||||
- `build_observed_products.py`: groups enriched rows into `products_observed.csv`
|
- Giant raw JSON payloads
|
||||||
- `build_review_queue.py`: generates `review_queue.csv` and preserves review status on reruns
|
|
||||||
- `build_canonical_layer.py`: builds `products_canonical.csv` and `product_links.csv`
|
|
||||||
|
|
||||||
## Notes on the current model
|
Generated files:
|
||||||
|
- `raw/history.json`
|
||||||
|
- `raw/<order_id>.json`
|
||||||
|
- `orders.csv`
|
||||||
|
- `items.csv`
|
||||||
|
- `items_enriched.csv`
|
||||||
|
- `products_observed.csv`
|
||||||
|
- `review_queue.csv`
|
||||||
|
- `products_canonical.csv`
|
||||||
|
- `product_links.csv`
|
||||||
|
|
||||||
- Observed products are retailer-specific: Giant, Costco.
|
### `costco_output/`
|
||||||
- Canonical products are the first cross-retailer layer.
|
|
||||||
- Auto-linking is conservative:
|
Inputs to this layer:
|
||||||
exact UPC first, then exact normalized name plus exact size/unit context, then
|
- Firefox session data for Costco
|
||||||
exact normalized name when there is no size context to conflict.
|
- Costco raw GraphQL receipt payloads
|
||||||
- Fee rows are excluded from auto-linking.
|
|
||||||
- Unknown values are left blank instead of guessed.
|
Generated files:
|
||||||
|
- `raw/summary.json`
|
||||||
|
- `raw/summary_requests.json`
|
||||||
|
- `raw/<receipt_id>-<timestamp>.json`
|
||||||
|
- `orders.csv`
|
||||||
|
- `items.csv`
|
||||||
|
- `items_enriched.csv`
|
||||||
|
|
||||||
|
### `combined_output/`
|
||||||
|
|
||||||
|
Generated by cross-retailer proof/build scripts:
|
||||||
|
- `products_observed.csv`
|
||||||
|
- `products_canonical.csv`
|
||||||
|
- `product_links.csv`
|
||||||
|
- `proof_examples.csv`
|
||||||
|
|
||||||
|
## Notes
|
||||||
|
|
||||||
|
- The pipeline is intentionally simple and currently manual.
|
||||||
|
- Scraping is retailer-specific and fragile; downstream modeling is shared only
|
||||||
|
after enrichment.
|
||||||
|
- `summary_requests.json` is diagnostic metadata from Costco summary enumeration
|
||||||
|
and is not a receipt payload.
|
||||||
|
- `enrich_costco.py` skips that file and only parses receipt payloads.
|
||||||
|
- The repo may contain archived or sample output files under `archive/`; they
|
||||||
|
are not part of the active scrape path.
|
||||||
|
|
||||||
## Verification
|
## Verification
|
||||||
|
|
||||||
Run the test suite with:
|
Run the full test suite with:
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
./venv/bin/python -m unittest discover -s tests
|
./venv/bin/python -m unittest discover -s tests
|
||||||
```
|
```
|
||||||
|
|
||||||
Useful one-off rebuilds:
|
Useful one-off checks:
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
|
./venv/bin/python scrape_giant.py --help
|
||||||
|
./venv/bin/python scrape_costco.py --help
|
||||||
./venv/bin/python enrich_giant.py
|
./venv/bin/python enrich_giant.py
|
||||||
./venv/bin/python build_observed_products.py
|
./venv/bin/python enrich_costco.py
|
||||||
./venv/bin/python build_review_queue.py
|
|
||||||
./venv/bin/python build_canonical_layer.py
|
|
||||||
```
|
```
|
||||||
|
|
||||||
## Project docs
|
## Project Docs
|
||||||
|
|
||||||
- `pm/tasks.org`: task log and evidence
|
- `pm/tasks.org`
|
||||||
- `pm/data-model.org`: file layout and schema decisions
|
- `pm/data-model.org`
|
||||||
|
- `pm/scrape-giant.org`
|
||||||
## Status
|
|
||||||
|
|
||||||
Completed through `t1.7`:
|
|
||||||
|
|
||||||
- Giant receipt fetch CLI
|
|
||||||
- data model and file layout
|
|
||||||
- Giant parser/enricher
|
|
||||||
- observed products
|
|
||||||
- review queue
|
|
||||||
- canonical layer scaffold
|
|
||||||
- conservative auto-link rules
|
|
||||||
|
|
||||||
Next planned task is `t1.8`: add a Costco raw ingest path.
|
|
||||||
|
|||||||
Reference in New Issue
Block a user