Files
scrape-giant/pm/tasks.org
2026-03-14 17:59:40 -04:00

5.3 KiB
Raw Blame History

[ ] t1.1: harden giant receipt fetch cli (2-4 commits)

acceptance criteria

  • giant scraper runs from cli with prompts or env-backed defaults for `user_id` and `loyalty`
  • script reuses current browser session via firefox cookies + `curl_cffi`
  • script only fetches unseen orders
  • script appends to `orders.csv` and `items.csv` without duplicating prior visits
  • script prints a note that giant only exposes the most recent 50 visits

notes

  • keep this giant-specific
  • no canonical product logic here
  • raw json archive remains source of truth

evidence

  • commit:
  • tests:
  • date:

[ ] t1.2: define grocery data model and file layout (1-2 commits)

acceptance criteria

  • decide and document the files/directories for:

    • retailer raw exports
    • enriched line items
    • observed products
    • canonical products
    • product links
  • define stable column schemas for each file
  • explicitly separate retailer-specific parsing from cross-retailer canonicalization

notes

  • this is the guardrail task so we dont make giant-specific hacks the system of record
  • keep schema minimal but extensible

evidence

  • commit:
  • tests:
  • date:

[ ] t1.3: build giant parser/enricher from raw json (2-4 commits)

acceptance criteria

  • parser reads giant raw order json files
  • outputs `items_enriched.csv`
  • preserves core raw values plus parsed fields such as:

    • normalized item name
    • image url
    • size value/unit guesses
    • pack/count guesses
    • fee/store-brand flags
    • per-unit/per-weight derived price where possible
  • parser is deterministic and rerunnable

notes

  • do not attempt canonical cross-store matching yet
  • parser should preserve ambiguity rather than hallucinating precision

evidence

  • commit:
  • tests:
  • date:

[ ] t1.4: generate observed-product layer from enriched items (2-3 commits)

acceptance criteria

  • distinct observed products are generated from enriched giant items
  • each observed product has a stable `observed_product_id`
  • observed products aggregate:

    • first seen / last seen
    • times seen
    • representative upc
    • representative image url
    • representative normalized name
  • outputs `products_observed.csv`

notes

  • observed product is retailer-facing, not yet canonical
  • likely key is some combo of retailer + upc + normalized name

evidence

  • commit:
  • tests:
  • date:

[ ] t1.5: build review queue for unresolved or low-confidence products (1-3 commits)

acceptance criteria

  • produce a review file containing observed products needing manual review
  • include enough context to review quickly:

    • raw names
    • parsed names
    • upc
    • image url
    • example prices
    • seen count
  • reviewed status can be stored and reused

notes

  • this is where human-in-the-loop starts
  • optimize for “approve once, remember forever”

evidence

  • commit:
  • tests:
  • date:

[ ] t1.6: create canonical product layer and observed→canonical links (2-4 commits)

acceptance criteria

  • define and create `products_canonical.csv`
  • define and create `product_links.csv`
  • support linking one or more observed products to one canonical product
  • canonical product schema supports food-cost comparison fields such as:

    • product type
    • variant
    • size
    • measure type
    • normalized quantity basis

notes

  • this is the first cross-retailer abstraction layer
  • do not require llm assistance for v1

evidence

  • commit:
  • tests:
  • date:

[ ] t1.7: implement auto-link rules for easy matches (2-3 commits)

acceptance criteria

  • auto-link can match observed products to canonical products using deterministic rules
  • rules include at least:

    • exact upc
    • exact normalized name
    • exact size/unit match where available
  • low-confidence cases remain unlinked for review

notes

  • keep the rules conservative
  • false positives are worse than unresolved items

evidence

  • commit:
  • tests:
  • date:

[ ] t1.8: support costco raw ingest path (2-5 commits)

acceptance criteria

  • add a costco-specific raw ingest/export path
  • output costco line items into the same shared raw/enriched schema family
  • confirm at least one product class can exist as:

    • giant observed product
    • costco observed product
    • one shared canonical product

notes

  • this is the proof that the architecture generalizes
  • dont chase perfection before the second retailer lands

evidence

  • commit:
  • tests:
  • date:

[ ] t1.9: compute normalized comparison metrics (2-3 commits)

acceptance criteria

  • derive normalized comparison fields where possible:

    • price per lb
    • price per oz
    • price per each
    • price per count
  • metrics are attached at canonical or linked-observed level as appropriate
  • emit obvious nulls when basis is unknown rather than inventing values

notes

  • this is where “gala apples 5 lb bag vs other gala apples” becomes possible
  • units discipline matters a lot here

evidence

  • commit:
  • tests:
  • date:

[ ] t1.10: add optional llm-assisted suggestion workflow for unresolved products (2-4 commits)

acceptance criteria

  • llm suggestions are generated only for unresolved observed products
  • llm outputs are stored as suggestions, not auto-applied truth
  • reviewer can approve/edit/reject suggestions
  • approved decisions are persisted into canonical/link files

notes

  • bounded assistant, not autonomous goblin
  • image urls may become useful here

evidence

  • commit:
  • tests:
  • date: