Files
scrape-giant/pm/tasks.org
2026-03-16 00:28:28 -04:00

5.8 KiB
Raw Blame History

[X] t1.1: harden giant receipt fetch cli (2-4 commits)

acceptance criteria

  • giant scraper runs from cli with prompts or env-backed defaults for `user_id` and `loyalty`
  • script reuses current browser session via firefox cookies + `curl_cffi`
  • script only fetches unseen orders
  • script appends to `orders.csv` and `items.csv` without duplicating prior visits
  • script prints a note that giant only exposes the most recent 50 visits

notes

  • keep this giant-specific
  • no canonical product logic here
  • raw json archive remains source of truth

evidence

  • commit: `d57b9cf` on branch `cx`
  • tests: `./venv/bin/python -m unittest discover -s tests`; `./venv/bin/python scraper.py help`; verified `.env` loading via `scraper.load_config()`
  • date: 2026-03-14

[X] t1.2: define grocery data model and file layout (1-2 commits)

acceptance criteria

  • decide and document the files/directories for:

    • retailer raw exports
    • enriched line items
    • observed products
    • canonical products
    • product links
  • define stable column schemas for each file
  • explicitly separate retailer-specific parsing from cross-retailer canonicalization

notes

  • this is the guardrail task so we don't make giant-specific hacks the system of record
  • keep schema minimal but extensible

evidence

  • commit: `42dbae1` on branch `cx`
  • tests: reviewed `giant_output/raw/history.json`, one sample raw order json, `giant_output/orders.csv`, `giant_output/items.csv`; documented schemas in `pm/data-model.org`
  • date: 2026-03-15

[X] t1.3: build giant parser/enricher from raw json (2-4 commits)

acceptance criteria

  • parser reads giant raw order json files
  • outputs `items_enriched.csv`
  • preserves core raw values plus parsed fields such as:

    • normalized item name
    • image url
    • size value/unit guesses
    • pack/count guesses
    • fee/store-brand flags
    • per-unit/per-weight derived price where possible
  • parser is deterministic and rerunnable

notes

  • do not attempt canonical cross-store matching yet
  • parser should preserve ambiguity rather than hallucinating precision

evidence

  • commit:
  • tests: `./venv/bin/python -m unittest discover -s tests`; `./venv/bin/python enrich_giant.py`; verified `giant_output/items_enriched.csv` on real raw data
  • date: 2026-03-16

[ ] t1.4: generate observed-product layer from enriched items (2-3 commits)

acceptance criteria

  • distinct observed products are generated from enriched giant items
  • each observed product has a stable `observed_product_id`
  • observed products aggregate:

    • first seen / last seen
    • times seen
    • representative upc
    • representative image url
    • representative normalized name
  • outputs `products_observed.csv`

notes

  • observed product is retailer-facing, not yet canonical
  • likely key is some combo of retailer + upc + normalized name

evidence

  • commit:
  • tests:
  • date:

[ ] t1.5: build review queue for unresolved or low-confidence products (1-3 commits)

acceptance criteria

  • produce a review file containing observed products needing manual review
  • include enough context to review quickly:

    • raw names
    • parsed names
    • upc
    • image url
    • example prices
    • seen count
  • reviewed status can be stored and reused

notes

  • this is where human-in-the-loop starts
  • optimize for “approve once, remember forever”

evidence

  • commit:
  • tests:
  • date:

[ ] t1.6: create canonical product layer and observed→canonical links (2-4 commits)

acceptance criteria

  • define and create `products_canonical.csv`
  • define and create `product_links.csv`
  • support linking one or more observed products to one canonical product
  • canonical product schema supports food-cost comparison fields such as:

    • product type
    • variant
    • size
    • measure type
    • normalized quantity basis

notes

  • this is the first cross-retailer abstraction layer
  • do not require llm assistance for v1

evidence

  • commit:
  • tests:
  • date:

[ ] t1.7: implement auto-link rules for easy matches (2-3 commits)

acceptance criteria

  • auto-link can match observed products to canonical products using deterministic rules
  • rules include at least:

    • exact upc
    • exact normalized name
    • exact size/unit match where available
  • low-confidence cases remain unlinked for review

notes

  • keep the rules conservative
  • false positives are worse than unresolved items

evidence

  • commit:
  • tests:
  • date:

[ ] t1.8: support costco raw ingest path (2-5 commits)

acceptance criteria

  • add a costco-specific raw ingest/export path
  • output costco line items into the same shared raw/enriched schema family
  • confirm at least one product class can exist as:

    • giant observed product
    • costco observed product
    • one shared canonical product

notes

  • this is the proof that the architecture generalizes
  • dont chase perfection before the second retailer lands

evidence

  • commit:
  • tests:
  • date:

[ ] t1.9: compute normalized comparison metrics (2-3 commits)

acceptance criteria

  • derive normalized comparison fields where possible:

    • price per lb
    • price per oz
    • price per each
    • price per count
  • metrics are attached at canonical or linked-observed level as appropriate
  • emit obvious nulls when basis is unknown rather than inventing values

notes

  • this is where “gala apples 5 lb bag vs other gala apples” becomes possible
  • units discipline matters a lot here

evidence

  • commit:
  • tests:
  • date:

[ ] t1.10: add optional llm-assisted suggestion workflow for unresolved products (2-4 commits)

acceptance criteria

  • llm suggestions are generated only for unresolved observed products
  • llm outputs are stored as suggestions, not auto-applied truth
  • reviewer can approve/edit/reject suggestions
  • approved decisions are persisted into canonical/link files

notes

  • bounded assistant, not autonomous goblin
  • image urls may become useful here

evidence

  • commit:
  • tests:
  • date: