scrape-giant/pm/tasks.org at master - scrape-giant - Gitea

ben/scrape-giant

Files

eulaly 585d8c1e49 added pm folder and tasks

2026-03-14 17:59:40 -04:00

5.3 KiB

Raw Permalink Blame History

[ ] t1.1: harden giant receipt fetch cli (2-4 commits)
[ ] t1.2: define grocery data model and file layout (1-2 commits)
[ ] t1.3: build giant parser/enricher from raw json (2-4 commits)
[ ] t1.4: generate observed-product layer from enriched items (2-3 commits)
[ ] t1.5: build review queue for unresolved or low-confidence products (1-3 commits)
[ ] t1.6: create canonical product layer and observed→canonical links (2-4 commits)
[ ] t1.7: implement auto-link rules for easy matches (2-3 commits)
[ ] t1.8: support costco raw ingest path (2-5 commits)
[ ] t1.9: compute normalized comparison metrics (2-3 commits)
[ ] t1.10: add optional llm-assisted suggestion workflow for unresolved products (2-4 commits)

[ ] t1.1: harden giant receipt fetch cli (2-4 commits)

acceptance criteria

giant scraper runs from cli with prompts or env-backed defaults for `user_id` and `loyalty`
script reuses current browser session via firefox cookies + `curl_cffi`
script only fetches unseen orders
script appends to `orders.csv` and `items.csv` without duplicating prior visits
script prints a note that giant only exposes the most recent 50 visits

notes

keep this giant-specific
no canonical product logic here
raw json archive remains source of truth

evidence

commit:
tests:
date:

[ ] t1.2: define grocery data model and file layout (1-2 commits)

acceptance criteria

decide and document the files/directories for:
- retailer raw exports
- enriched line items
- observed products
- canonical products
- product links
define stable column schemas for each file
explicitly separate retailer-specific parsing from cross-retailer canonicalization

notes

this is the guardrail task so we don’t make giant-specific hacks the system of record
keep schema minimal but extensible

evidence

commit:
tests:
date:

[ ] t1.3: build giant parser/enricher from raw json (2-4 commits)

acceptance criteria

parser reads giant raw order json files
outputs `items_enriched.csv`
preserves core raw values plus parsed fields such as:
- normalized item name
- image url
- size value/unit guesses
- pack/count guesses
- fee/store-brand flags
- per-unit/per-weight derived price where possible
parser is deterministic and rerunnable

notes

do not attempt canonical cross-store matching yet
parser should preserve ambiguity rather than hallucinating precision

evidence

commit:
tests:
date:

[ ] t1.4: generate observed-product layer from enriched items (2-3 commits)

acceptance criteria

distinct observed products are generated from enriched giant items
each observed product has a stable `observed_product_id`
observed products aggregate:
- first seen / last seen
- times seen
- representative upc
- representative image url
- representative normalized name
outputs `products_observed.csv`

notes

observed product is retailer-facing, not yet canonical
likely key is some combo of retailer + upc + normalized name

evidence

commit:
tests:
date:

[ ] t1.5: build review queue for unresolved or low-confidence products (1-3 commits)

acceptance criteria

produce a review file containing observed products needing manual review
include enough context to review quickly:
- raw names
- parsed names
- upc
- image url
- example prices
- seen count
reviewed status can be stored and reused

notes

this is where human-in-the-loop starts
optimize for “approve once, remember forever”

evidence

commit:
tests:
date:

[ ] t1.6: create canonical product layer and observed→canonical links (2-4 commits)

acceptance criteria

define and create `products_canonical.csv`
define and create `product_links.csv`
support linking one or more observed products to one canonical product
canonical product schema supports food-cost comparison fields such as:
- product type
- variant
- size
- measure type
- normalized quantity basis

notes

this is the first cross-retailer abstraction layer
do not require llm assistance for v1

evidence

commit:
tests:
date:

[ ] t1.7: implement auto-link rules for easy matches (2-3 commits)

acceptance criteria

auto-link can match observed products to canonical products using deterministic rules
rules include at least:
- exact upc
- exact normalized name
- exact size/unit match where available
low-confidence cases remain unlinked for review

notes

keep the rules conservative
false positives are worse than unresolved items

evidence

commit:
tests:
date:

[ ] t1.8: support costco raw ingest path (2-5 commits)

acceptance criteria

add a costco-specific raw ingest/export path
output costco line items into the same shared raw/enriched schema family
confirm at least one product class can exist as:
- giant observed product
- costco observed product
- one shared canonical product

notes

this is the proof that the architecture generalizes
don’t chase perfection before the second retailer lands

evidence

commit:
tests:
date:

[ ] t1.9: compute normalized comparison metrics (2-3 commits)

acceptance criteria

derive normalized comparison fields where possible:
- price per lb
- price per oz
- price per each
- price per count
metrics are attached at canonical or linked-observed level as appropriate
emit obvious nulls when basis is unknown rather than inventing values

notes

this is where “gala apples 5 lb bag vs other gala apples” becomes possible
units discipline matters a lot here

evidence

commit:
tests:
date:

[ ] t1.10: add optional llm-assisted suggestion workflow for unresolved products (2-4 commits)

acceptance criteria

llm suggestions are generated only for unresolved observed products
llm outputs are stored as suggestions, not auto-applied truth
reviewer can approve/edit/reject suggestions
approved decisions are persisted into canonical/link files

notes

bounded assistant, not autonomous goblin
image urls may become useful here

evidence

commit:
tests:
date: