104 lines
3.1 KiB
Markdown
104 lines
3.1 KiB
Markdown
# scrape-giant
|
|
|
|
Small grocery-history pipeline for Giant receipts.
|
|
|
|
The project currently does four things:
|
|
|
|
1. scrape Giant in-store order history from an active Firefox session
|
|
2. enrich raw line items into a deterministic `items_enriched.csv`
|
|
3. aggregate retailer-facing observed products and build a manual review queue
|
|
4. create a first-pass canonical product layer plus conservative auto-links
|
|
|
|
The work so far is Giant-specific on the ingest side and intentionally simple on
|
|
the shared product-model side.
|
|
|
|
## Current flow
|
|
|
|
Run the commands from the repo root with the project venv active, or call them
|
|
directly through `./venv/bin/python`.
|
|
|
|
```bash
|
|
./venv/bin/python scraper.py
|
|
./venv/bin/python enrich_giant.py
|
|
./venv/bin/python build_observed_products.py
|
|
./venv/bin/python build_review_queue.py
|
|
./venv/bin/python build_canonical_layer.py
|
|
```
|
|
|
|
## Inputs
|
|
|
|
- Firefox cookies for `giantfood.com`
|
|
- `GIANT_USER_ID` and `GIANT_LOYALTY_NUMBER` in `.env`, shell env, or prompts
|
|
- Giant raw order payloads in `giant_output/raw/`
|
|
|
|
## Outputs
|
|
|
|
Current generated files live under `giant_output/`:
|
|
|
|
- `orders.csv`: flattened visit/order rows from the Giant history API
|
|
- `items.csv`: flattened raw line items from fetched order detail payloads
|
|
- `items_enriched.csv`: deterministic parsed/enriched line items
|
|
- `products_observed.csv`: retailer-facing observed product groups
|
|
- `review_queue.csv`: products needing manual review
|
|
- `products_canonical.csv`: shared canonical product rows
|
|
- `product_links.csv`: observed-to-canonical links
|
|
|
|
Raw json remains the source of truth:
|
|
|
|
- `giant_output/raw/history.json`
|
|
- `giant_output/raw/<order_id>.json`
|
|
|
|
## Scripts
|
|
|
|
- `scraper.py`: fetches Giant history/detail payloads and updates `orders.csv` and `items.csv`
|
|
- `enrich_giant.py`: reads raw Giant order json and writes `items_enriched.csv`
|
|
- `build_observed_products.py`: groups enriched rows into `products_observed.csv`
|
|
- `build_review_queue.py`: generates `review_queue.csv` and preserves review status on reruns
|
|
- `build_canonical_layer.py`: builds `products_canonical.csv` and `product_links.csv`
|
|
|
|
## Notes on the current model
|
|
|
|
- Observed products are retailer-specific: Giant, Costco.
|
|
- Canonical products are the first cross-retailer layer.
|
|
- Auto-linking is conservative:
|
|
exact UPC first, then exact normalized name plus exact size/unit context, then
|
|
exact normalized name when there is no size context to conflict.
|
|
- Fee rows are excluded from auto-linking.
|
|
- Unknown values are left blank instead of guessed.
|
|
|
|
## Verification
|
|
|
|
Run the test suite with:
|
|
|
|
```bash
|
|
./venv/bin/python -m unittest discover -s tests
|
|
```
|
|
|
|
Useful one-off rebuilds:
|
|
|
|
```bash
|
|
./venv/bin/python enrich_giant.py
|
|
./venv/bin/python build_observed_products.py
|
|
./venv/bin/python build_review_queue.py
|
|
./venv/bin/python build_canonical_layer.py
|
|
```
|
|
|
|
## Project docs
|
|
|
|
- `pm/tasks.org`: task log and evidence
|
|
- `pm/data-model.org`: file layout and schema decisions
|
|
|
|
## Status
|
|
|
|
Completed through `t1.7`:
|
|
|
|
- Giant receipt fetch CLI
|
|
- data model and file layout
|
|
- Giant parser/enricher
|
|
- observed products
|
|
- review queue
|
|
- canonical layer scaffold
|
|
- conservative auto-link rules
|
|
|
|
Next planned task is `t1.8`: add a Costco raw ingest path.
|