Files
scrape-giant/pm/tasks.org

12 KiB
Raw Blame History

[X] t1.1: harden giant receipt fetch cli (2-4 commits)

acceptance criteria

  • giant scraper runs from cli with prompts or env-backed defaults for `user_id` and `loyalty`
  • script reuses current browser session via firefox cookies + `curl_cffi`
  • script only fetches unseen orders
  • script appends to `orders.csv` and `items.csv` without duplicating prior visits
  • script prints a note that giant only exposes the most recent 50 visits

notes

  • keep this giant-specific
  • no canonical product logic here
  • raw json archive remains source of truth

evidence

  • commit: `d57b9cf` on branch `cx`
  • tests: `./venv/bin/python -m unittest discover -s tests`; `./venv/bin/python scraper.py help`; verified `.env` loading via `scraper.load_config()`
  • date: 2026-03-14

[X] t1.2: define grocery data model and file layout (1-2 commits)

acceptance criteria

  • decide and document the files/directories for:

    • retailer raw exports
    • enriched line items
    • observed products
    • canonical products
    • product links
  • define stable column schemas for each file
  • explicitly separate retailer-specific parsing from cross-retailer canonicalization

notes

  • this is the guardrail task so we don't make giant-specific hacks the system of record
  • keep schema minimal but extensible

evidence

  • commit: `42dbae1` on branch `cx`
  • tests: reviewed `giant_output/raw/history.json`, one sample raw order json, `giant_output/orders.csv`, `giant_output/items.csv`; documented schemas in `pm/data-model.org`
  • date: 2026-03-15

[X] t1.3: build giant parser/enricher from raw json (2-4 commits)

acceptance criteria

  • parser reads giant raw order json files
  • outputs `items_enriched.csv`
  • preserves core raw values plus parsed fields such as:

    • normalized item name
    • image url
    • size value/unit guesses
    • pack/count guesses
    • fee/store-brand flags
    • per-unit/per-weight derived price where possible
  • parser is deterministic and rerunnable

notes

  • do not attempt canonical cross-store matching yet
  • parser should preserve ambiguity rather than hallucinating precision

evidence

  • commit: `14f2cc2` on branch `cx`
  • tests: `./venv/bin/python -m unittest discover -s tests`; `./venv/bin/python enrich_giant.py`; verified `giant_output/items_enriched.csv` on real raw data
  • date: 2026-03-16

[X] t1.4: generate observed-product layer from enriched items (2-3 commits)

acceptance criteria

  • distinct observed products are generated from enriched giant items
  • each observed product has a stable `observed_product_id`
  • observed products aggregate:

    • first seen / last seen
    • times seen
    • representative upc
    • representative image url
    • representative normalized name
  • outputs `products_observed.csv`

notes

  • observed product is retailer-facing, not yet canonical
  • likely key is some combo of retailer + upc + normalized name

evidence

  • commit: `dc39214` on branch `cx`
  • tests: `./venv/bin/python -m unittest discover -s tests`; `./venv/bin/python build_observed_products.py`; verified `giant_output/products_observed.csv`
  • date: 2026-03-16

[X] t1.5: build review queue for unresolved or low-confidence products (1-3 commits)

acceptance criteria

  • produce a review file containing observed products needing manual review
  • include enough context to review quickly:

    • raw names
    • parsed names
    • upc
    • image url
    • example prices
    • seen count
  • reviewed status can be stored and reused

notes

  • this is where human-in-the-loop starts
  • optimize for “approve once, remember forever”

evidence

  • commit: `9b13ec3` on branch `cx`
  • tests: `./venv/bin/python -m unittest discover -s tests`; `./venv/bin/python build_review_queue.py`; verified `giant_output/review_queue.csv`
  • date: 2026-03-16

[X] t1.6: create canonical product layer and observed→canonical links (2-4 commits)

acceptance criteria

  • define and create `products_canonical.csv`
  • define and create `product_links.csv`
  • support linking one or more observed products to one canonical product
  • canonical product schema supports food-cost comparison fields such as:

    • product type
    • variant
    • size
    • measure type
    • normalized quantity basis

notes

  • this is the first cross-retailer abstraction layer
  • do not require llm assistance for v1

evidence

  • commit: `347cd44` on branch `cx`
  • tests: `./venv/bin/python -m unittest discover -s tests`; `./venv/bin/python build_canonical_layer.py`; verified seeded `giant_output/products_canonical.csv` and `giant_output/product_links.csv`
  • date: 2026-03-16

[X] t1.7: implement auto-link rules for easy matches (2-3 commits)

acceptance criteria

  • auto-link can match observed products to canonical products using deterministic rules
  • rules include at least:

    • exact upc
    • exact normalized name
    • exact size/unit match where available
  • low-confidence cases remain unlinked for review

notes

  • keep the rules conservative
  • false positives are worse than unresolved items

evidence

  • commit: `385a31c` on branch `cx`
  • tests: `./venv/bin/python -m unittest discover -s tests`; `./venv/bin/python build_canonical_layer.py`; verified auto-linked `giant_output/products_canonical.csv` and `giant_output/product_links.csv`
  • date: 2026-03-16

[X] t1.8: support costco raw ingest path (2-5 commits)

acceptance criteria

  • add a costco-specific raw ingest/export path
  • fetch costco receipt summary and receipt detail payloads from graphql endpoint
  • persist raw json under `costco_output/raw/orders.csv` and `./items.csv`, same format as giant
  • costco-native identifiers such as `transactionBarcode` as order id and `itemNumber` as retailer item id
  • preserve discount/coupon rows rather than dropping

notes

  • focus on raw costco acquisistion and flattening
  • do not force costco identifiers into `upc`
  • bearer/auth values should come from local env, not source

evidence

  • commit: `da00288` on branch `cx`
  • tests: `./venv/bin/python -m unittest discover -s tests`; `./venv/bin/python scrape_costco.py help`; verified `costco_output/raw/*.json`, `costco_output/orders.csv`, and `costco_output/items.csv` from the local sample payload
  • date: 2026-03-16

[X] t1.8.1: support costco parser/enricher path (2-4 commits)

acceptance criteria

  • add a costco-specific enrich step producing `costco_output/items_enriched.csv`
  • output rows into the same shared enriched schema family as Giant
  • support costco-specific parsing for:

    • `itemDescription01` + `itemDescription02`
    • `itemNumber` as `retailer_item_id`
    • discount lines / negative rows
    • common size patterns such as `25#`, `48 OZ`, `2/24 OZ`, `6-PACK`
  • preserve obvious unknowns as blank rather than guessed values

notes

  • this is the real schema compatibility proof, not raw ingest alone
  • expect weaker identifiers than Giant

evidence

  • commit: `da00288` on branch `cx`
  • tests: `./venv/bin/python -m unittest discover -s tests`; `./venv/bin/python enrich_costco.py`; verified `costco_output/items_enriched.csv`
  • date: 2026-03-16

[X] t1.8.2: validate cross-retailer observed/canonical flow (1-3 commits)

acceptance criteria

  • feed Giant and Costco enriched rows through the same observed/canonical pipeline
  • confirm at least one product class can exist as:

    • Giant observed product
    • Costco observed product
    • one shared canonical product
  • document the exact example used for proof

notes

  • keep this to one or two well-behaved product classes first
  • apples, eggs, bananas, or flour are better than weird prepared foods

evidence

  • commit: `da00288` on branch `cx`
  • tests: `./venv/bin/python -m unittest discover -s tests`; `./venv/bin/python validate_cross_retailer_flow.py`; proof example: Giant `FRESH BANANA` and Costco `BANANAS 3 LB / 1.36 KG` share one canonical in `combined_output/proof_examples.csv`
  • date: 2026-03-16

[X] t1.8.3: extend shared schema for retailer-native ids and adjustment lines (1-2 commits)

acceptance criteria

  • add shared fields needed for non-upc retailers, including:

    • `retailer_item_id`
    • `is_discount_line`
    • `is_coupon_line` or equivalent if needed
  • keep `upc` nullable across the pipeline
  • update downstream builders/tests to accept retailers with blank `upc`

notes

  • this prevents costco from becoming a schema hack
  • do this once instead of sprinkling exceptions everywhere

evidence

  • commit: `9497565` on branch `cx`
  • tests: `./venv/bin/python -m unittest discover -s tests`; verified shared enriched fields in `giant_output/items_enriched.csv` and `costco_output/items_enriched.csv`
  • date: 2026-03-16

[X] t1.8.4: verify and correct costco receipt enumeration (12 commits)

acceptance criteria

  • confirm graphql summary query returns all expected receipts
  • compare `inWarehouse` count vs number of `receipts` returned
  • widen or parameterize date window if necessary; website shows receipts in 3-month windows
  • persist request metadata (`startDate`, `endDate`, `documentType`, `documentSubType`)
  • emit warning when receipt counts mismatch

notes

  • goal is to confirm we are enumerating all receipts before parsing
  • do not expand schema or parser logic in this task
  • keep changes limited to summary query handling and diagnostics

evidence

  • commit: `ac82fa6` on branch `cx`
  • tests: `./venv/bin/python -m unittest discover -s tests`; `./venv/bin/python scrape_costco.py help`; reviewed the sample Costco summary request in `pm/scrape-giant.org` against `costco_output/raw/summary.json` and added 3-month window chunking plus mismatch diagnostics
  • date: 2026-03-16

[X] t1.8.5: refactor costco scraper auth and UX with giant scraper

acceptance criteria

  • remove manual auth env vars
  • load costco cookies from firefox session
  • require only logged-in browser
  • replace start/end date flags with months-back
  • maintain same raw output structure
  • ensure summary_lookup keys are collision-safe by using a composite key (transactionBarcode + transactionDateTime) instead of transactionBarcode alone

notes

  • align Costco acquisition ergonomics with the Giant scraper
  • keep downstream Costco parsing and shared schemas unchanged

evidence

  • commit: `c0054dc` on branch `cx`
  • tests: `./venv/bin/python -m unittest discover -s tests`; `./venv/bin/python scrape_costco.py help`; verified Costco summary/detail flattening now uses composite receipt keys in unit tests
  • date: 2026-03-16

[ ] t1.8.6: add browser session helper (2-4 commits)

acceptance criteria

  • create a separate Python module/script that extracts firefox browser session data needed for giant and costco scrapers.
  • support Firefox and Costco first, including:

    • loading cookies via existing browser-cookie approach
    • reading browser storage needed for dynamic auth headers (e.g. Costco bearer token)
    • copying locked browser sqlite/db files to a temp location before reading when necessary
  • expose a small interface usable by scrapers, e.g. cookie jar + storage/header values
  • keep retailer-specific parsing of extracted session data outside the low-level browser access layer
  • structure the helper so Chromium-family browser support can be added later without changing scraper call sites

notes

  • goal is to replace manual `.env` copying of volatile browser-derived auth data
  • session bootstrap only, not full browser automation
  • prefer one shared helper over retailer-specific ad hoc storage reads
  • Firefox only; Chromium support later

evidence

  • commit:
  • tests:
  • date:

[ ] t1.9: compute normalized comparison metrics (2-4 commits)

acceptance criteria

  • derive normalized comparison fields where possible on enriched or observed product rows:

    • `price_per_lb`
    • `price_per_oz`
    • `price_per_each`
    • `price_per_count`
  • preserve the source basis used to derive each metric, e.g.:

    • parsed size/unit
    • receipt weight
    • explicit count/pack
  • emit nulls when basis is unknown, conflicting, or ambiguous
  • document at least one Giant vs Costco comparison example using the normalized metrics

notes

  • compute metrics as close to the raw observation as possible
  • canonical layer can aggregate later, but should not invent missing unit economics
  • unit discipline matters more than coverage

evidence

  • commit:
  • tests:
  • date:

[ ] t1.10: add optional llm-assisted suggestion workflow for unresolved products (2-4 commits)

acceptance criteria

  • llm suggestions are generated only for unresolved observed products
  • llm outputs are stored as suggestions, not auto-applied truth
  • reviewer can approve/edit/reject suggestions
  • approved decisions are persisted into canonical/link files

notes

  • bounded assistant, not autonomous goblin
  • image urls may become useful here

evidence

  • commit:
  • tests:
  • date: