scrape-giant/pm/tasks.org at 0f797d0a96aaf18b9a3c27986364ebf27d8be782

ben/scrape-giant

Fork 0

Files

eulaly 0f797d0a96 added scope for browser session pull task and cleanup

2026-03-16 13:46:52 -04:00

12 KiB

Raw Blame History

[X] t1.1: harden giant receipt fetch cli (2-4 commits)
[X] t1.2: define grocery data model and file layout (1-2 commits)
[X] t1.3: build giant parser/enricher from raw json (2-4 commits)
[X] t1.4: generate observed-product layer from enriched items (2-3 commits)
[X] t1.5: build review queue for unresolved or low-confidence products (1-3 commits)
[X] t1.6: create canonical product layer and observed→canonical links (2-4 commits)
[X] t1.7: implement auto-link rules for easy matches (2-3 commits)
[X] t1.8: support costco raw ingest path (2-5 commits)
[X] t1.8.1: support costco parser/enricher path (2-4 commits)
[X] t1.8.2: validate cross-retailer observed/canonical flow (1-3 commits)
[X] t1.8.3: extend shared schema for retailer-native ids and adjustment lines (1-2 commits)
[X] t1.8.4: verify and correct costco receipt enumeration (1–2 commits)
[X] t1.8.5: refactor costco scraper auth and UX with giant scraper
[ ] t1.8.6: add browser session helper (2-4 commits)
[ ] t1.9: compute normalized comparison metrics (2-4 commits)
[ ] t1.10: add optional llm-assisted suggestion workflow for unresolved products (2-4 commits)

[X] t1.1: harden giant receipt fetch cli (2-4 commits)

acceptance criteria

giant scraper runs from cli with prompts or env-backed defaults for `user_id` and `loyalty`
script reuses current browser session via firefox cookies + `curl_cffi`
script only fetches unseen orders
script appends to `orders.csv` and `items.csv` without duplicating prior visits
script prints a note that giant only exposes the most recent 50 visits

notes

keep this giant-specific
no canonical product logic here
raw json archive remains source of truth

evidence

commit: `d57b9cf` on branch `cx`
tests: `./venv/bin/python -m unittest discover -s tests`; `./venv/bin/python scraper.py –help`; verified `.env` loading via `scraper.load_config()`
date: 2026-03-14

[X] t1.2: define grocery data model and file layout (1-2 commits)

acceptance criteria

decide and document the files/directories for:
- retailer raw exports
- enriched line items
- observed products
- canonical products
- product links
define stable column schemas for each file
explicitly separate retailer-specific parsing from cross-retailer canonicalization

notes

this is the guardrail task so we don't make giant-specific hacks the system of record
keep schema minimal but extensible

evidence

commit: `42dbae1` on branch `cx`
tests: reviewed `giant_output/raw/history.json`, one sample raw order json, `giant_output/orders.csv`, `giant_output/items.csv`; documented schemas in `pm/data-model.org`
date: 2026-03-15

[X] t1.3: build giant parser/enricher from raw json (2-4 commits)

acceptance criteria

parser reads giant raw order json files
outputs `items_enriched.csv`
preserves core raw values plus parsed fields such as:
- normalized item name
- image url
- size value/unit guesses
- pack/count guesses
- fee/store-brand flags
- per-unit/per-weight derived price where possible
parser is deterministic and rerunnable

notes

do not attempt canonical cross-store matching yet
parser should preserve ambiguity rather than hallucinating precision

evidence

commit: `14f2cc2` on branch `cx`
tests: `./venv/bin/python -m unittest discover -s tests`; `./venv/bin/python enrich_giant.py`; verified `giant_output/items_enriched.csv` on real raw data
date: 2026-03-16

[X] t1.4: generate observed-product layer from enriched items (2-3 commits)

acceptance criteria

distinct observed products are generated from enriched giant items
each observed product has a stable `observed_product_id`
observed products aggregate:
- first seen / last seen
- times seen
- representative upc
- representative image url
- representative normalized name
outputs `products_observed.csv`

notes

observed product is retailer-facing, not yet canonical
likely key is some combo of retailer + upc + normalized name

evidence

commit: `dc39214` on branch `cx`
tests: `./venv/bin/python -m unittest discover -s tests`; `./venv/bin/python build_observed_products.py`; verified `giant_output/products_observed.csv`
date: 2026-03-16

[X] t1.5: build review queue for unresolved or low-confidence products (1-3 commits)

acceptance criteria

produce a review file containing observed products needing manual review
include enough context to review quickly:
- raw names
- parsed names
- upc
- image url
- example prices
- seen count
reviewed status can be stored and reused

notes

this is where human-in-the-loop starts
optimize for “approve once, remember forever”

evidence

commit: `9b13ec3` on branch `cx`
tests: `./venv/bin/python -m unittest discover -s tests`; `./venv/bin/python build_review_queue.py`; verified `giant_output/review_queue.csv`
date: 2026-03-16

[X] t1.6: create canonical product layer and observed→canonical links (2-4 commits)

acceptance criteria

define and create `products_canonical.csv`
define and create `product_links.csv`
support linking one or more observed products to one canonical product
canonical product schema supports food-cost comparison fields such as:
- product type
- variant
- size
- measure type
- normalized quantity basis

notes

this is the first cross-retailer abstraction layer
do not require llm assistance for v1

evidence

commit: `347cd44` on branch `cx`
tests: `./venv/bin/python -m unittest discover -s tests`; `./venv/bin/python build_canonical_layer.py`; verified seeded `giant_output/products_canonical.csv` and `giant_output/product_links.csv`
date: 2026-03-16

[X] t1.7: implement auto-link rules for easy matches (2-3 commits)

acceptance criteria

auto-link can match observed products to canonical products using deterministic rules
rules include at least:
- exact upc
- exact normalized name
- exact size/unit match where available
low-confidence cases remain unlinked for review

notes

keep the rules conservative
false positives are worse than unresolved items

evidence

commit: `385a31c` on branch `cx`
tests: `./venv/bin/python -m unittest discover -s tests`; `./venv/bin/python build_canonical_layer.py`; verified auto-linked `giant_output/products_canonical.csv` and `giant_output/product_links.csv`
date: 2026-03-16

[X] t1.8: support costco raw ingest path (2-5 commits)

acceptance criteria

add a costco-specific raw ingest/export path
fetch costco receipt summary and receipt detail payloads from graphql endpoint
persist raw json under `costco_output/raw/orders.csv` and `./items.csv`, same format as giant
costco-native identifiers such as `transactionBarcode` as order id and `itemNumber` as retailer item id
preserve discount/coupon rows rather than dropping

notes

focus on raw costco acquisistion and flattening
do not force costco identifiers into `upc`
bearer/auth values should come from local env, not source

evidence

commit: `da00288` on branch `cx`
tests: `./venv/bin/python -m unittest discover -s tests`; `./venv/bin/python scrape_costco.py –help`; verified `costco_output/raw/*.json`, `costco_output/orders.csv`, and `costco_output/items.csv` from the local sample payload
date: 2026-03-16

[X] t1.8.1: support costco parser/enricher path (2-4 commits)

acceptance criteria

add a costco-specific enrich step producing `costco_output/items_enriched.csv`
output rows into the same shared enriched schema family as Giant
support costco-specific parsing for:
- `itemDescription01` + `itemDescription02`
- `itemNumber` as `retailer_item_id`
- discount lines / negative rows
- common size patterns such as `25#`, `48 OZ`, `2/24 OZ`, `6-PACK`
preserve obvious unknowns as blank rather than guessed values

notes

this is the real schema compatibility proof, not raw ingest alone
expect weaker identifiers than Giant

evidence

commit: `da00288` on branch `cx`
tests: `./venv/bin/python -m unittest discover -s tests`; `./venv/bin/python enrich_costco.py`; verified `costco_output/items_enriched.csv`
date: 2026-03-16

[X] t1.8.2: validate cross-retailer observed/canonical flow (1-3 commits)

acceptance criteria

feed Giant and Costco enriched rows through the same observed/canonical pipeline
confirm at least one product class can exist as:
- Giant observed product
- Costco observed product
- one shared canonical product
document the exact example used for proof

notes

keep this to one or two well-behaved product classes first
apples, eggs, bananas, or flour are better than weird prepared foods

evidence

commit: `da00288` on branch `cx`
tests: `./venv/bin/python -m unittest discover -s tests`; `./venv/bin/python validate_cross_retailer_flow.py`; proof example: Giant `FRESH BANANA` and Costco `BANANAS 3 LB / 1.36 KG` share one canonical in `combined_output/proof_examples.csv`
date: 2026-03-16

[X] t1.8.3: extend shared schema for retailer-native ids and adjustment lines (1-2 commits)

acceptance criteria

add shared fields needed for non-upc retailers, including:
- `retailer_item_id`
- `is_discount_line`
- `is_coupon_line` or equivalent if needed
keep `upc` nullable across the pipeline
update downstream builders/tests to accept retailers with blank `upc`

notes

this prevents costco from becoming a schema hack
do this once instead of sprinkling exceptions everywhere

evidence

commit: `9497565` on branch `cx`
tests: `./venv/bin/python -m unittest discover -s tests`; verified shared enriched fields in `giant_output/items_enriched.csv` and `costco_output/items_enriched.csv`
date: 2026-03-16

[X] t1.8.4: verify and correct costco receipt enumeration (1–2 commits)

acceptance criteria

confirm graphql summary query returns all expected receipts
compare `inWarehouse` count vs number of `receipts` returned
widen or parameterize date window if necessary; website shows receipts in 3-month windows
persist request metadata (`startDate`, `endDate`, `documentType`, `documentSubType`)
emit warning when receipt counts mismatch

notes

goal is to confirm we are enumerating all receipts before parsing
do not expand schema or parser logic in this task
keep changes limited to summary query handling and diagnostics

evidence

commit: `ac82fa6` on branch `cx`
tests: `./venv/bin/python -m unittest discover -s tests`; `./venv/bin/python scrape_costco.py –help`; reviewed the sample Costco summary request in `pm/scrape-giant.org` against `costco_output/raw/summary.json` and added 3-month window chunking plus mismatch diagnostics
date: 2026-03-16

[X] t1.8.5: refactor costco scraper auth and UX with giant scraper

acceptance criteria

remove manual auth env vars
load costco cookies from firefox session
require only logged-in browser
replace start/end date flags with –months-back
maintain same raw output structure
ensure summary_lookup keys are collision-safe by using a composite key (transactionBarcode + transactionDateTime) instead of transactionBarcode alone

notes

align Costco acquisition ergonomics with the Giant scraper
keep downstream Costco parsing and shared schemas unchanged

evidence

commit: `c0054dc` on branch `cx`
tests: `./venv/bin/python -m unittest discover -s tests`; `./venv/bin/python scrape_costco.py –help`; verified Costco summary/detail flattening now uses composite receipt keys in unit tests
date: 2026-03-16

[ ] t1.8.6: add browser session helper (2-4 commits)

acceptance criteria

create a separate Python module/script that extracts firefox browser session data needed for giant and costco scrapers.
support Firefox and Costco first, including:
- loading cookies via existing browser-cookie approach
- reading browser storage needed for dynamic auth headers (e.g. Costco bearer token)
- copying locked browser sqlite/db files to a temp location before reading when necessary
expose a small interface usable by scrapers, e.g. cookie jar + storage/header values
keep retailer-specific parsing of extracted session data outside the low-level browser access layer
structure the helper so Chromium-family browser support can be added later without changing scraper call sites

notes

goal is to replace manual `.env` copying of volatile browser-derived auth data
session bootstrap only, not full browser automation
prefer one shared helper over retailer-specific ad hoc storage reads
Firefox only; Chromium support later

evidence

commit:
tests:
date:

[ ] t1.9: compute normalized comparison metrics (2-4 commits)

acceptance criteria

derive normalized comparison fields where possible on enriched or observed product rows:
- `price_per_lb`
- `price_per_oz`
- `price_per_each`
- `price_per_count`
preserve the source basis used to derive each metric, e.g.:
- parsed size/unit
- receipt weight
- explicit count/pack
emit nulls when basis is unknown, conflicting, or ambiguous
document at least one Giant vs Costco comparison example using the normalized metrics

notes

compute metrics as close to the raw observation as possible
canonical layer can aggregate later, but should not invent missing unit economics
unit discipline matters more than coverage

evidence

commit:
tests:
date:

[ ] t1.10: add optional llm-assisted suggestion workflow for unresolved products (2-4 commits)

acceptance criteria

llm suggestions are generated only for unresolved observed products
llm outputs are stored as suggestions, not auto-applied truth
reviewer can approve/edit/reject suggestions
approved decisions are persisted into canonical/link files

notes

bounded assistant, not autonomous goblin
image urls may become useful here

evidence

commit:
tests:
date:

12 KiB Raw Blame History Unescape Escape

[X] t1.1: harden giant receipt fetch cli (2-4 commits)

acceptance criteria

notes

evidence

[X] t1.2: define grocery data model and file layout (1-2 commits)

acceptance criteria

notes

evidence

[X] t1.3: build giant parser/enricher from raw json (2-4 commits)

acceptance criteria

notes

evidence

[X] t1.4: generate observed-product layer from enriched items (2-3 commits)

acceptance criteria

notes

evidence

[X] t1.5: build review queue for unresolved or low-confidence products (1-3 commits)

acceptance criteria

notes

evidence

[X] t1.6: create canonical product layer and observed→canonical links (2-4 commits)

acceptance criteria

notes

evidence

[X] t1.7: implement auto-link rules for easy matches (2-3 commits)

acceptance criteria

notes

evidence

[X] t1.8: support costco raw ingest path (2-5 commits)

acceptance criteria

notes

evidence

[X] t1.8.1: support costco parser/enricher path (2-4 commits)

acceptance criteria

notes

evidence

[X] t1.8.2: validate cross-retailer observed/canonical flow (1-3 commits)

acceptance criteria

notes

evidence

[X] t1.8.3: extend shared schema for retailer-native ids and adjustment lines (1-2 commits)

acceptance criteria

notes

evidence

[X] t1.8.4: verify and correct costco receipt enumeration (1–2 commits)

acceptance criteria

notes

evidence

[X] t1.8.5: refactor costco scraper auth and UX with giant scraper

acceptance criteria

notes

evidence

[ ] t1.8.6: add browser session helper (2-4 commits)

acceptance criteria

notes

evidence

[ ] t1.9: compute normalized comparison metrics (2-4 commits)

acceptance criteria

notes

evidence

[ ] t1.10: add optional llm-assisted suggestion workflow for unresolved products (2-4 commits)

acceptance criteria

notes

evidence

12 KiB

Raw Blame History