Files
scrape-giant/pm/tasks.org

16 KiB
Raw Blame History

[X] t1.1: harden giant receipt fetch cli (2-4 commits)

acceptance criteria

  • giant scraper runs from cli with prompts or env-backed defaults for `user_id` and `loyalty`
  • script reuses current browser session via firefox cookies + `curl_cffi`
  • script only fetches unseen orders
  • script appends to `orders.csv` and `items.csv` without duplicating prior visits
  • script prints a note that giant only exposes the most recent 50 visits

notes

  • keep this giant-specific
  • no canonical product logic here
  • raw json archive remains source of truth

evidence

  • commit: `d57b9cf` on branch `cx`
  • tests: `./venv/bin/python -m unittest discover -s tests`; `./venv/bin/python scraper.py help`; verified `.env` loading via `scraper.load_config()`
  • date: 2026-03-14

[X] t1.2: define grocery data model and file layout (1-2 commits)

acceptance criteria

  • decide and document the files/directories for:

    • retailer raw exports
    • enriched line items
    • observed products
    • canonical products
    • product links
  • define stable column schemas for each file
  • explicitly separate retailer-specific parsing from cross-retailer canonicalization

notes

  • this is the guardrail task so we don't make giant-specific hacks the system of record
  • keep schema minimal but extensible

evidence

  • commit: `42dbae1` on branch `cx`
  • tests: reviewed `giant_output/raw/history.json`, one sample raw order json, `giant_output/orders.csv`, `giant_output/items.csv`; documented schemas in `pm/data-model.org`
  • date: 2026-03-15

[X] t1.3: build giant parser/enricher from raw json (2-4 commits)

acceptance criteria

  • parser reads giant raw order json files
  • outputs `items_enriched.csv`
  • preserves core raw values plus parsed fields such as:

    • normalized item name
    • image url
    • size value/unit guesses
    • pack/count guesses
    • fee/store-brand flags
    • per-unit/per-weight derived price where possible
  • parser is deterministic and rerunnable

notes

  • do not attempt canonical cross-store matching yet
  • parser should preserve ambiguity rather than hallucinating precision

evidence

  • commit: `14f2cc2` on branch `cx`
  • tests: `./venv/bin/python -m unittest discover -s tests`; `./venv/bin/python enrich_giant.py`; verified `giant_output/items_enriched.csv` on real raw data
  • date: 2026-03-16

[X] t1.4: generate observed-product layer from enriched items (2-3 commits)

acceptance criteria

  • distinct observed products are generated from enriched giant items
  • each observed product has a stable `observed_product_id`
  • observed products aggregate:

    • first seen / last seen
    • times seen
    • representative upc
    • representative image url
    • representative normalized name
  • outputs `products_observed.csv`

notes

  • observed product is retailer-facing, not yet canonical
  • likely key is some combo of retailer + upc + normalized name

evidence

  • commit: `dc39214` on branch `cx`
  • tests: `./venv/bin/python -m unittest discover -s tests`; `./venv/bin/python build_observed_products.py`; verified `giant_output/products_observed.csv`
  • date: 2026-03-16

[X] t1.5: build review queue for unresolved or low-confidence products (1-3 commits)

acceptance criteria

  • produce a review file containing observed products needing manual review
  • include enough context to review quickly:

    • raw names
    • parsed names
    • upc
    • image url
    • example prices
    • seen count
  • reviewed status can be stored and reused

notes

  • this is where human-in-the-loop starts
  • optimize for “approve once, remember forever”

evidence

  • commit: `9b13ec3` on branch `cx`
  • tests: `./venv/bin/python -m unittest discover -s tests`; `./venv/bin/python build_review_queue.py`; verified `giant_output/review_queue.csv`
  • date: 2026-03-16

[X] t1.6: create canonical product layer and observed→canonical links (2-4 commits)

acceptance criteria

  • define and create `products_canonical.csv`
  • define and create `product_links.csv`
  • support linking one or more observed products to one canonical product
  • canonical product schema supports food-cost comparison fields such as:

    • product type
    • variant
    • size
    • measure type
    • normalized quantity basis

notes

  • this is the first cross-retailer abstraction layer
  • do not require llm assistance for v1

evidence

  • commit: `347cd44` on branch `cx`
  • tests: `./venv/bin/python -m unittest discover -s tests`; `./venv/bin/python build_canonical_layer.py`; verified seeded `giant_output/products_canonical.csv` and `giant_output/product_links.csv`
  • date: 2026-03-16

[X] t1.7: implement auto-link rules for easy matches (2-3 commits)

acceptance criteria

  • auto-link can match observed products to canonical products using deterministic rules
  • rules include at least:

    • exact upc
    • exact normalized name
    • exact size/unit match where available
  • low-confidence cases remain unlinked for review

notes

  • keep the rules conservative
  • false positives are worse than unresolved items

evidence

  • commit: `385a31c` on branch `cx`
  • tests: `./venv/bin/python -m unittest discover -s tests`; `./venv/bin/python build_canonical_layer.py`; verified auto-linked `giant_output/products_canonical.csv` and `giant_output/product_links.csv`
  • date: 2026-03-16

[X] t1.8: support costco raw ingest path (2-5 commits)

acceptance criteria

  • add a costco-specific raw ingest/export path
  • fetch costco receipt summary and receipt detail payloads from graphql endpoint
  • persist raw json under `costco_output/raw/orders.csv` and `./items.csv`, same format as giant
  • costco-native identifiers such as `transactionBarcode` as order id and `itemNumber` as retailer item id
  • preserve discount/coupon rows rather than dropping

notes

  • focus on raw costco acquisistion and flattening
  • do not force costco identifiers into `upc`
  • bearer/auth values should come from local env, not source

evidence

  • commit: `da00288` on branch `cx`
  • tests: `./venv/bin/python -m unittest discover -s tests`; `./venv/bin/python scrape_costco.py help`; verified `costco_output/raw/*.json`, `costco_output/orders.csv`, and `costco_output/items.csv` from the local sample payload
  • date: 2026-03-16

[X] t1.8.1: support costco parser/enricher path (2-4 commits)

acceptance criteria

  • add a costco-specific enrich step producing `costco_output/items_enriched.csv`
  • output rows into the same shared enriched schema family as Giant
  • support costco-specific parsing for:

    • `itemDescription01` + `itemDescription02`
    • `itemNumber` as `retailer_item_id`
    • discount lines / negative rows
    • common size patterns such as `25#`, `48 OZ`, `2/24 OZ`, `6-PACK`
  • preserve obvious unknowns as blank rather than guessed values

notes

  • this is the real schema compatibility proof, not raw ingest alone
  • expect weaker identifiers than Giant

evidence

  • commit: `da00288` on branch `cx`
  • tests: `./venv/bin/python -m unittest discover -s tests`; `./venv/bin/python enrich_costco.py`; verified `costco_output/items_enriched.csv`
  • date: 2026-03-16

[X] t1.8.2: validate cross-retailer observed/canonical flow (1-3 commits)

acceptance criteria

  • feed Giant and Costco enriched rows through the same observed/canonical pipeline
  • confirm at least one product class can exist as:

    • Giant observed product
    • Costco observed product
    • one shared canonical product
  • document the exact example used for proof

notes

  • keep this to one or two well-behaved product classes first
  • apples, eggs, bananas, or flour are better than weird prepared foods

evidence

  • commit: `da00288` on branch `cx`
  • tests: `./venv/bin/python -m unittest discover -s tests`; `./venv/bin/python validate_cross_retailer_flow.py`; proof example: Giant `FRESH BANANA` and Costco `BANANAS 3 LB / 1.36 KG` share one canonical in `combined_output/proof_examples.csv`
  • date: 2026-03-16

[X] t1.8.3: extend shared schema for retailer-native ids and adjustment lines (1-2 commits)

acceptance criteria

  • add shared fields needed for non-upc retailers, including:

    • `retailer_item_id`
    • `is_discount_line`
    • `is_coupon_line` or equivalent if needed
  • keep `upc` nullable across the pipeline
  • update downstream builders/tests to accept retailers with blank `upc`

notes

  • this prevents costco from becoming a schema hack
  • do this once instead of sprinkling exceptions everywhere

evidence

  • commit: `9497565` on branch `cx`
  • tests: `./venv/bin/python -m unittest discover -s tests`; verified shared enriched fields in `giant_output/items_enriched.csv` and `costco_output/items_enriched.csv`
  • date: 2026-03-16

[X] t1.8.4: verify and correct costco receipt enumeration (12 commits)

acceptance criteria

  • confirm graphql summary query returns all expected receipts
  • compare `inWarehouse` count vs number of `receipts` returned
  • widen or parameterize date window if necessary; website shows receipts in 3-month windows
  • persist request metadata (`startDate`, `endDate`, `documentType`, `documentSubType`)
  • emit warning when receipt counts mismatch

notes

  • goal is to confirm we are enumerating all receipts before parsing
  • do not expand schema or parser logic in this task
  • keep changes limited to summary query handling and diagnostics

evidence

  • commit: `ac82fa6` on branch `cx`
  • tests: `./venv/bin/python -m unittest discover -s tests`; `./venv/bin/python scrape_costco.py help`; reviewed the sample Costco summary request in `pm/scrape-giant.org` against `costco_output/raw/summary.json` and added 3-month window chunking plus mismatch diagnostics
  • date: 2026-03-16

[X] t1.8.5: refactor costco scraper auth and UX with giant scraper

acceptance criteria

  • remove manual auth env vars
  • load costco cookies from firefox session
  • require only logged-in browser
  • replace start/end date flags with months-back
  • maintain same raw output structure
  • ensure summary_lookup keys are collision-safe by using a composite key (transactionBarcode + transactionDateTime) instead of transactionBarcode alone

notes

  • align Costco acquisition ergonomics with the Giant scraper
  • keep downstream Costco parsing and shared schemas unchanged

evidence

  • commit: `c0054dc` on branch `cx`
  • tests: `./venv/bin/python -m unittest discover -s tests`; `./venv/bin/python scrape_costco.py help`; verified Costco summary/detail flattening now uses composite receipt keys in unit tests
  • date: 2026-03-16

[X] t1.8.6: add browser session helper (2-4 commits)

acceptance criteria

  • create a separate Python module/script that extracts firefox browser session data needed for giant and costco scrapers.
  • support Firefox and Costco first, including:

    • loading cookies via existing browser-cookie approach
    • reading browser storage needed for dynamic auth headers (e.g. Costco bearer token)
    • copying locked browser sqlite/db files to a temp location before reading when necessary
  • expose a small interface usable by scrapers, e.g. cookie jar + storage/header values
  • keep retailer-specific parsing of extracted session data outside the low-level browser access layer
  • structure the helper so Chromium-family browser support can be added later without changing scraper call sites

notes

  • goal is to replace manual `.env` copying of volatile browser-derived auth data
  • session bootstrap only, not full browser automation
  • prefer one shared helper over retailer-specific ad hoc storage reads
  • Firefox only; Chromium support later

evidence

  • commit: `7789c2e` on branch `cx`
  • tests: `./venv/bin/python -m unittest discover -s tests`; `./venv/bin/python scrape_giant.py help`; `./venv/bin/python scrape_costco.py help`; verified Firefox storage token extraction and locked-db copy behavior in unit tests
  • date: 2026-03-16

[X] t1.8.7: simplify costco session bootstrap and remove over-abstraction (2-4 commits)

acceptance criteria

  • make `scrape_costco.py` readable end-to-end without tracing through multiple partial bootstrap layers
  • keep `browser_session.py` limited to low-level browser data access only:

    • firefox profile discovery
    • cookie loading
    • storage reads
    • sqlite copy/read helpers
  • remove or sharply reduce `retailer_sessions.py` so retailer-specific header extraction lives with the retailer scraper or in a very small retailer-specific helper
  • make session bootstrap flow explicit and linear:

    • load browser context
    • extract costco auth values
    • build request headers
    • build requests session
  • eliminate inconsistent/obsolete function signatures and dead call paths (e.g. mixed `build_session(…)` calling conventions, stale fallback branches, mismatched `build_headers(…)` args)
  • add one focused bootstrap debug print showing whether cookies, authorization, client id, and client identifier were found
  • preserve current working behavior where available; this is a refactor/clarification task, not a feature expansion task

notes

  • goal is to restore concern separation and debuggability
  • prefer obvious retailer-specific code over “generic” helpers that guess and obscure control flow
  • browser access can stay shared; retailer auth mapping should be explicit
  • no new heuristics in this task

evidence

  • commit: `d7a0329` on branch `cx`
  • tests: `./venv/bin/python -m unittest discover -s tests`; `./venv/bin/python scrape_costco.py help`; verified explicit Costco session bootstrap flow in `scrape_costco.py` and low-level-only browser access in `browser_session.py`
  • date: 2026-03-16

[X] t1.9: build pivot-ready normalized purchase log and comparison metrics (2-4 commits)

acceptance criteria

  • produce a flat `purchases.csv` suitable for excel pivot tables and pivot charts
  • each purchase row preserves:

    • purchase date
    • retailer
    • order id
    • raw item name
    • normalized item name
    • canonical item id when resolved
    • quantity / unit
    • line total
    • store/location info where available
  • derive normalized comparison fields where possible on enriched or observed product rows:

    • `price_per_lb`
    • `price_per_oz`
    • `price_per_each`
    • `price_per_count`
  • preserve the source basis used to derive each metric, e.g.:

    • parsed size/unit
    • receipt weight
    • explicit count/pack
  • emit nulls when basis is unknown, conflicting, or ambiguous
  • support pivot-friendly analysis of purchase frequency and item cost over time
  • document at least one Giant vs Costco comparison example using the normalized metrics

notes

  • compute metrics as close to the raw observation as possible
  • canonical layer can aggregate later, but should not invent missing unit economics
  • unit discipline matters more than coverage
  • raw item name must be retained for audit/debugging

evidence

  • commit: `be1bf63` on branch `cx`
  • tests: `./venv/bin/python -m unittest discover -s tests`; `./venv/bin/python build_purchases.py`; verified `combined_output/purchases.csv` and `combined_output/comparison_examples.csv` on the current Giant + Costco dataset
  • date: 2026-03-16

[ ] t1.10: add optional llm-assisted suggestion workflow for unresolved products (2-4 commits)

acceptance criteria

  • llm suggestions are generated only for unresolved observed products
  • llm outputs are stored as suggestions, not auto-applied truth
  • reviewer can approve/edit/reject suggestions
  • approved decisions are persisted into canonical/link files

notes

  • bounded assistant, not autonomous goblin
  • image urls may become useful here

evidence

  • commit:
  • tests:
  • date:

[ ] t1.11: define review and item-resolution workflow for unresolved products (2-3 commits)

acceptance criteria

  • define the persistent files used to resolve unknown items, including:

    • review queue
    • canonical item catalog
    • alias / mapping layer if separate
  • specify how unresolved items move from `review_queue.csv` into the final normalized purchase log
  • define the manual resolution workflow, including:

    • what the human edits
    • what script is rerun afterward
    • how resolved mappings are persisted for future runs
  • ensure resolved items are positively identified into stable canonical item ids rather than one-off text substitutions
  • document how raw item name, normalized item name, and canonical item id are all retained

notes

  • goal is “approve once, reuse forever”
  • keep the workflow simple and auditable
  • manual review is fine; the important part is making it durable and rerunnable

evidence

  • commit:
  • tests:
  • date: