updated scope to prep for costco scraper
This commit is contained in:
103
README.md
Normal file
103
README.md
Normal file
@@ -0,0 +1,103 @@
|
|||||||
|
# scrape-giant
|
||||||
|
|
||||||
|
Small grocery-history pipeline for Giant receipts.
|
||||||
|
|
||||||
|
The project currently does four things:
|
||||||
|
|
||||||
|
1. scrape Giant in-store order history from an active Firefox session
|
||||||
|
2. enrich raw line items into a deterministic `items_enriched.csv`
|
||||||
|
3. aggregate retailer-facing observed products and build a manual review queue
|
||||||
|
4. create a first-pass canonical product layer plus conservative auto-links
|
||||||
|
|
||||||
|
The work so far is Giant-specific on the ingest side and intentionally simple on
|
||||||
|
the shared product-model side.
|
||||||
|
|
||||||
|
## Current flow
|
||||||
|
|
||||||
|
Run the commands from the repo root with the project venv active, or call them
|
||||||
|
directly through `./venv/bin/python`.
|
||||||
|
|
||||||
|
```bash
|
||||||
|
./venv/bin/python scraper.py
|
||||||
|
./venv/bin/python enrich_giant.py
|
||||||
|
./venv/bin/python build_observed_products.py
|
||||||
|
./venv/bin/python build_review_queue.py
|
||||||
|
./venv/bin/python build_canonical_layer.py
|
||||||
|
```
|
||||||
|
|
||||||
|
## Inputs
|
||||||
|
|
||||||
|
- Firefox cookies for `giantfood.com`
|
||||||
|
- `GIANT_USER_ID` and `GIANT_LOYALTY_NUMBER` in `.env`, shell env, or prompts
|
||||||
|
- Giant raw order payloads in `giant_output/raw/`
|
||||||
|
|
||||||
|
## Outputs
|
||||||
|
|
||||||
|
Current generated files live under `giant_output/`:
|
||||||
|
|
||||||
|
- `orders.csv`: flattened visit/order rows from the Giant history API
|
||||||
|
- `items.csv`: flattened raw line items from fetched order detail payloads
|
||||||
|
- `items_enriched.csv`: deterministic parsed/enriched line items
|
||||||
|
- `products_observed.csv`: retailer-facing observed product groups
|
||||||
|
- `review_queue.csv`: products needing manual review
|
||||||
|
- `products_canonical.csv`: shared canonical product rows
|
||||||
|
- `product_links.csv`: observed-to-canonical links
|
||||||
|
|
||||||
|
Raw json remains the source of truth:
|
||||||
|
|
||||||
|
- `giant_output/raw/history.json`
|
||||||
|
- `giant_output/raw/<order_id>.json`
|
||||||
|
|
||||||
|
## Scripts
|
||||||
|
|
||||||
|
- `scraper.py`: fetches Giant history/detail payloads and updates `orders.csv` and `items.csv`
|
||||||
|
- `enrich_giant.py`: reads raw Giant order json and writes `items_enriched.csv`
|
||||||
|
- `build_observed_products.py`: groups enriched rows into `products_observed.csv`
|
||||||
|
- `build_review_queue.py`: generates `review_queue.csv` and preserves review status on reruns
|
||||||
|
- `build_canonical_layer.py`: builds `products_canonical.csv` and `product_links.csv`
|
||||||
|
|
||||||
|
## Notes on the current model
|
||||||
|
|
||||||
|
- Observed products are retailer-specific: Giant, Costco.
|
||||||
|
- Canonical products are the first cross-retailer layer.
|
||||||
|
- Auto-linking is conservative:
|
||||||
|
exact UPC first, then exact normalized name plus exact size/unit context, then
|
||||||
|
exact normalized name when there is no size context to conflict.
|
||||||
|
- Fee rows are excluded from auto-linking.
|
||||||
|
- Unknown values are left blank instead of guessed.
|
||||||
|
|
||||||
|
## Verification
|
||||||
|
|
||||||
|
Run the test suite with:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
./venv/bin/python -m unittest discover -s tests
|
||||||
|
```
|
||||||
|
|
||||||
|
Useful one-off rebuilds:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
./venv/bin/python enrich_giant.py
|
||||||
|
./venv/bin/python build_observed_products.py
|
||||||
|
./venv/bin/python build_review_queue.py
|
||||||
|
./venv/bin/python build_canonical_layer.py
|
||||||
|
```
|
||||||
|
|
||||||
|
## Project docs
|
||||||
|
|
||||||
|
- `pm/tasks.org`: task log and evidence
|
||||||
|
- `pm/data-model.org`: file layout and schema decisions
|
||||||
|
|
||||||
|
## Status
|
||||||
|
|
||||||
|
Completed through `t1.7`:
|
||||||
|
|
||||||
|
- Giant receipt fetch CLI
|
||||||
|
- data model and file layout
|
||||||
|
- Giant parser/enricher
|
||||||
|
- observed products
|
||||||
|
- review queue
|
||||||
|
- canonical layer scaffold
|
||||||
|
- conservative auto-link rules
|
||||||
|
|
||||||
|
Next planned task is `t1.8`: add a Costco raw ingest path.
|
||||||
File diff suppressed because one or more lines are too long
95
pm/tasks.org
95
pm/tasks.org
@@ -147,35 +147,96 @@
|
|||||||
|
|
||||||
** acceptance criteria
|
** acceptance criteria
|
||||||
- add a costco-specific raw ingest/export path
|
- add a costco-specific raw ingest/export path
|
||||||
- output costco line items into the same shared raw/enriched schema family
|
- fetch costco receipt summary and receipt detail payloads from graphql endpoint
|
||||||
- confirm at least one product class can exist as:
|
- persist raw json under `costco_output/raw/orders.csv` and `./items.csv`, same format as giant
|
||||||
- giant observed product
|
- costco-native identifiers such as `transactionBarcode` as order id and `itemNumber` as retailer item id
|
||||||
- costco observed product
|
- preserve discount/coupon rows rather than dropping
|
||||||
- one shared canonical product
|
|
||||||
|
|
||||||
** notes
|
** notes
|
||||||
- this is the proof that the architecture generalizes
|
- focus on raw costco acquisistion and flattening
|
||||||
- don’t chase perfection before the second retailer lands
|
- do not force costco identifiers into `upc`
|
||||||
|
- bearer/auth values should come from local env, not source
|
||||||
|
|
||||||
** evidence
|
** evidence
|
||||||
- commit:
|
- commit:
|
||||||
- tests:
|
- tests:
|
||||||
- date:
|
- date:
|
||||||
|
|
||||||
* [ ] t1.9: compute normalized comparison metrics (2-3 commits)
|
* [ ] t1.8.1: support costco parser/enricher path (2-4 commits)
|
||||||
|
|
||||||
** acceptance criteria
|
** acceptance criteria
|
||||||
- derive normalized comparison fields where possible:
|
- add a costco-specific enrich step producing `costco_output/items_enriched.csv`
|
||||||
- price per lb
|
- output rows into the same shared enriched schema family as Giant
|
||||||
- price per oz
|
- support costco-specific parsing for:
|
||||||
- price per each
|
- `itemDescription01` + `itemDescription02`
|
||||||
- price per count
|
- `itemNumber` as `retailer_item_id`
|
||||||
- metrics are attached at canonical or linked-observed level as appropriate
|
- discount lines / negative rows
|
||||||
- emit obvious nulls when basis is unknown rather than inventing values
|
- common size patterns such as `25#`, `48 OZ`, `2/24 OZ`, `6-PACK`
|
||||||
|
- preserve obvious unknowns as blank rather than guessed values
|
||||||
|
|
||||||
** notes
|
** notes
|
||||||
- this is where “gala apples 5 lb bag vs other gala apples” becomes possible
|
- this is the real schema compatibility proof, not raw ingest alone
|
||||||
- units discipline matters a lot here
|
- expect weaker identifiers than Giant
|
||||||
|
|
||||||
|
** evidence
|
||||||
|
- commit:
|
||||||
|
- tests:
|
||||||
|
- date:
|
||||||
|
* [ ] t1.8.2: validate cross-retailer observed/canonical flow (1-3 commits)
|
||||||
|
|
||||||
|
** acceptance criteria
|
||||||
|
- feed Giant and Costco enriched rows through the same observed/canonical pipeline
|
||||||
|
- confirm at least one product class can exist as:
|
||||||
|
- Giant observed product
|
||||||
|
- Costco observed product
|
||||||
|
- one shared canonical product
|
||||||
|
- document the exact example used for proof
|
||||||
|
|
||||||
|
** notes
|
||||||
|
- keep this to one or two well-behaved product classes first
|
||||||
|
- apples, eggs, bananas, or flour are better than weird prepared foods
|
||||||
|
|
||||||
|
** evidence
|
||||||
|
- commit:
|
||||||
|
- tests:
|
||||||
|
- date:
|
||||||
|
* [ ] t1.8.3: extend shared schema for retailer-native ids and adjustment lines (1-2 commits)
|
||||||
|
|
||||||
|
** acceptance criteria
|
||||||
|
- add shared fields needed for non-upc retailers, including:
|
||||||
|
- `retailer_item_id`
|
||||||
|
- `is_discount_line`
|
||||||
|
- `is_coupon_line` or equivalent if needed
|
||||||
|
- keep `upc` nullable across the pipeline
|
||||||
|
- update downstream builders/tests to accept retailers with blank `upc`
|
||||||
|
|
||||||
|
** notes
|
||||||
|
- this prevents costco from becoming a schema hack
|
||||||
|
- do this once instead of sprinkling exceptions everywhere
|
||||||
|
|
||||||
|
** evidence
|
||||||
|
- commit:
|
||||||
|
- tests:
|
||||||
|
- date:
|
||||||
|
* [ ] t1.9: compute normalized comparison metrics (2-4 commits)
|
||||||
|
|
||||||
|
** acceptance criteria
|
||||||
|
- derive normalized comparison fields where possible on enriched or observed product rows:
|
||||||
|
- `price_per_lb`
|
||||||
|
- `price_per_oz`
|
||||||
|
- `price_per_each`
|
||||||
|
- `price_per_count`
|
||||||
|
- preserve the source basis used to derive each metric, e.g.:
|
||||||
|
- parsed size/unit
|
||||||
|
- receipt weight
|
||||||
|
- explicit count/pack
|
||||||
|
- emit nulls when basis is unknown, conflicting, or ambiguous
|
||||||
|
- document at least one Giant vs Costco comparison example using the normalized metrics
|
||||||
|
|
||||||
|
** notes
|
||||||
|
- compute metrics as close to the raw observation as possible
|
||||||
|
- canonical layer can aggregate later, but should not invent missing unit economics
|
||||||
|
- unit discipline matters more than coverage
|
||||||
|
|
||||||
** evidence
|
** evidence
|
||||||
- commit:
|
- commit:
|
||||||
|
|||||||
Reference in New Issue
Block a user