ben/scrape-giant

Fork 0

Files

ben 848d229f2d Record t1.14.2 task evidence

2026-03-20 10:05:08 -04:00

34 KiB

Raw Blame History

Scrape-Giant Task Log

[X] t1.1: harden giant receipt fetch cli (2-4 commits)
- acceptance criteria
- notes
- evidence
[X] t1.2: define grocery data model and file layout (1-2 commits)
- acceptance criteria
- notes
- evidence
[X] t1.3: build giant parser/enricher from raw json (2-4 commits)
- acceptance criteria
- notes
- evidence
[X] t1.4: generate observed-product layer from enriched items (2-3 commits)
- acceptance criteria
- notes
- evidence
[X] t1.5: build review queue for unresolved or low-confidence products (1-3 commits)
- acceptance criteria
- notes
- evidence
[X] t1.6: create canonical product layer and observed→canonical links (2-4 commits)
- acceptance criteria
- notes
- evidence
[X] t1.7: implement auto-link rules for easy matches (2-3 commits)
- acceptance criteria
- notes
- evidence
[X] t1.8: support costco raw ingest path (2-5 commits)
- acceptance criteria
- notes
- evidence
[X] t1.8.1: support costco parser/enricher path (2-4 commits)
- acceptance criteria
- notes
- evidence
[X] t1.8.2: validate cross-retailer observed/canonical flow (1-3 commits)
- acceptance criteria
- notes
- evidence
[X] t1.8.3: extend shared schema for retailer-native ids and adjustment lines (1-2 commits)
- acceptance criteria
- notes
- evidence
[X] t1.8.4: verify and correct costco receipt enumeration (1–2 commits)
- acceptance criteria
- notes
- evidence
[X] t1.8.5: refactor costco scraper auth and UX with giant scraper
- acceptance criteria
- notes
- evidence
[X] t1.8.6: add browser session helper (2-4 commits)
- acceptance criteria
- notes
- evidence
[X] t1.8.7: simplify costco session bootstrap and remove over-abstraction (2-4 commits)
- acceptance criteria
- notes
- evidence
[X] t1.9: build pivot-ready normalized purchase log and comparison metrics (2-4 commits)
- acceptance criteria
- notes
- evidence
[X] t1.11: define review and item-resolution workflow for unresolved products (2-3 commits)
- acceptance criteria
- notes
- evidence
[X] t1.12: simplify review process display
- acceptance criteria
- evidence
- notes
[X] t1.13.1 pipeline accountability and stage visibility (1-2 commits)
- AC
- evidence
- notes
[X] t1.13.2 costco discount matching and net pricing in enrich_costco (2-3 commits)
- AC
- evidence
- notes
[X] t1.13.3 canonical cleanup and review-first product identity (3-4 commits)
- AC
- evidence
- notes
[X] t1.14: refactor retailer collection into the new data model (2-4 commits)
- Acceptance Criteria
- evidence
- notes
[X] t1.14.1: refactor retailer normalization into the new normalized_items schema (3-5 commits)
- Acceptance Criteria
- evidence
- notes
[X] t1.14.2: finalize filesystem and schema alignment for the refactor (2-4 commits)
- Acceptance Criteria
- evidence
- notes
[ ] t1.14.3: retailer-specific Costco normalization cleanup (2-4 commits)
- Acceptance Criteria
- evidence
- notes
[ ] t1.15: refactor review/combine pipeline around normalized_item_id and catalog links (4-8 commits)
- Acceptance Criteria
- evidence
- notes
[ ] 1t.10: add optional llm-assisted suggestion workflow for unresolved normalized retailer items (2-4 commits)
- acceptance criteria
- notes
- evidence

[X] t1.1: harden giant receipt fetch cli (2-4 commits)

acceptance criteria

giant scraper runs from cli with prompts or env-backed defaults for `user_id` and `loyalty`
script reuses current browser session via firefox cookies + `curl_cffi`
script only fetches unseen orders
script appends to `orders.csv` and `items.csv` without duplicating prior visits
script prints a note that giant only exposes the most recent 50 visits

notes

keep this giant-specific
no canonical product logic here
raw json archive remains source of truth

evidence

commit: `d57b9cf` on branch `cx`
tests: `./venv/bin/python -m unittest discover -s tests`; `./venv/bin/python scraper.py –help`; verified `.env` loading via `scraper.load_config()`
date: 2026-03-14

[X] t1.2: define grocery data model and file layout (1-2 commits)

acceptance criteria

decide and document the files/directories for:
- retailer raw exports
- enriched line items
- observed products
- canonical products
- product links
define stable column schemas for each file
explicitly separate retailer-specific parsing from cross-retailer canonicalization

notes

this is the guardrail task so we don't make giant-specific hacks the system of record
keep schema minimal but extensible

evidence

commit: `42dbae1` on branch `cx`
tests: reviewed `giant_output/raw/history.json`, one sample raw order json, `giant_output/orders.csv`, `giant_output/items.csv`; documented schemas in `pm/data-model.org`
date: 2026-03-15

[X] t1.3: build giant parser/enricher from raw json (2-4 commits)

acceptance criteria

parser reads giant raw order json files
outputs `items_enriched.csv`
preserves core raw values plus parsed fields such as:
- normalized item name
- image url
- size value/unit guesses
- pack/count guesses
- fee/store-brand flags
- per-unit/per-weight derived price where possible
parser is deterministic and rerunnable

notes

do not attempt canonical cross-store matching yet
parser should preserve ambiguity rather than hallucinating precision

evidence

commit: `14f2cc2` on branch `cx`
tests: `./venv/bin/python -m unittest discover -s tests`; `./venv/bin/python enrich_giant.py`; verified `giant_output/items_enriched.csv` on real raw data
date: 2026-03-16

[X] t1.4: generate observed-product layer from enriched items (2-3 commits)

acceptance criteria

distinct observed products are generated from enriched giant items
each observed product has a stable `observed_product_id`
observed products aggregate:
- first seen / last seen
- times seen
- representative upc
- representative image url
- representative normalized name
outputs `products_observed.csv`

notes

observed product is retailer-facing, not yet canonical
likely key is some combo of retailer + upc + normalized name

evidence

commit: `dc39214` on branch `cx`
tests: `./venv/bin/python -m unittest discover -s tests`; `./venv/bin/python build_observed_products.py`; verified `giant_output/products_observed.csv`
date: 2026-03-16

[X] t1.5: build review queue for unresolved or low-confidence products (1-3 commits)

acceptance criteria

produce a review file containing observed products needing manual review
include enough context to review quickly:
- raw names
- parsed names
- upc
- image url
- example prices
- seen count
reviewed status can be stored and reused

notes

this is where human-in-the-loop starts
optimize for “approve once, remember forever”

evidence

commit: `9b13ec3` on branch `cx`
tests: `./venv/bin/python -m unittest discover -s tests`; `./venv/bin/python build_review_queue.py`; verified `giant_output/review_queue.csv`
date: 2026-03-16

[X] t1.6: create canonical product layer and observed→canonical links (2-4 commits)

acceptance criteria

define and create `products_canonical.csv`
define and create `product_links.csv`
support linking one or more observed products to one canonical product
canonical product schema supports food-cost comparison fields such as:
- product type
- variant
- size
- measure type
- normalized quantity basis

notes

this is the first cross-retailer abstraction layer
do not require llm assistance for v1

evidence

commit: `347cd44` on branch `cx`
tests: `./venv/bin/python -m unittest discover -s tests`; `./venv/bin/python build_canonical_layer.py`; verified seeded `giant_output/products_canonical.csv` and `giant_output/product_links.csv`
date: 2026-03-16

[X] t1.7: implement auto-link rules for easy matches (2-3 commits)

acceptance criteria

auto-link can match observed products to canonical products using deterministic rules
rules include at least:
- exact upc
- exact normalized name
- exact size/unit match where available
low-confidence cases remain unlinked for review

notes

keep the rules conservative
false positives are worse than unresolved items

evidence

commit: `385a31c` on branch `cx`
tests: `./venv/bin/python -m unittest discover -s tests`; `./venv/bin/python build_canonical_layer.py`; verified auto-linked `giant_output/products_canonical.csv` and `giant_output/product_links.csv`
date: 2026-03-16

[X] t1.8: support costco raw ingest path (2-5 commits)

acceptance criteria

add a costco-specific raw ingest/export path
fetch costco receipt summary and receipt detail payloads from graphql endpoint
persist raw json under `costco_output/raw/orders.csv` and `./items.csv`, same format as giant
costco-native identifiers such as `transactionBarcode` as order id and `itemNumber` as retailer item id
preserve discount/coupon rows rather than dropping

notes

focus on raw costco acquisistion and flattening
do not force costco identifiers into `upc`
bearer/auth values should come from local env, not source

evidence

commit: `da00288` on branch `cx`
tests: `./venv/bin/python -m unittest discover -s tests`; `./venv/bin/python scrape_costco.py –help`; verified `costco_output/raw/*.json`, `costco_output/orders.csv`, and `costco_output/items.csv` from the local sample payload
date: 2026-03-16

[X] t1.8.1: support costco parser/enricher path (2-4 commits)

acceptance criteria

add a costco-specific enrich step producing `costco_output/items_enriched.csv`
output rows into the same shared enriched schema family as Giant
support costco-specific parsing for:
- `itemDescription01` + `itemDescription02`
- `itemNumber` as `retailer_item_id`
- discount lines / negative rows
- common size patterns such as `25#`, `48 OZ`, `2/24 OZ`, `6-PACK`
preserve obvious unknowns as blank rather than guessed values

notes

this is the real schema compatibility proof, not raw ingest alone
expect weaker identifiers than Giant

evidence

commit: `da00288` on branch `cx`
tests: `./venv/bin/python -m unittest discover -s tests`; `./venv/bin/python enrich_costco.py`; verified `costco_output/items_enriched.csv`
date: 2026-03-16

[X] t1.8.2: validate cross-retailer observed/canonical flow (1-3 commits)

acceptance criteria

feed Giant and Costco enriched rows through the same observed/canonical pipeline
confirm at least one product class can exist as:
- Giant observed product
- Costco observed product
- one shared canonical product
document the exact example used for proof

notes

keep this to one or two well-behaved product classes first
apples, eggs, bananas, or flour are better than weird prepared foods

evidence

commit: `da00288` on branch `cx`
tests: `./venv/bin/python -m unittest discover -s tests`; `./venv/bin/python validate_cross_retailer_flow.py`; proof example: Giant `FRESH BANANA` and Costco `BANANAS 3 LB / 1.36 KG` share one canonical in `combined_output/proof_examples.csv`
date: 2026-03-16

[X] t1.8.3: extend shared schema for retailer-native ids and adjustment lines (1-2 commits)

acceptance criteria

add shared fields needed for non-upc retailers, including:
- `retailer_item_id`
- `is_discount_line`
- `is_coupon_line` or equivalent if needed
keep `upc` nullable across the pipeline
update downstream builders/tests to accept retailers with blank `upc`

notes

this prevents costco from becoming a schema hack
do this once instead of sprinkling exceptions everywhere

evidence

commit: `9497565` on branch `cx`
tests: `./venv/bin/python -m unittest discover -s tests`; verified shared enriched fields in `giant_output/items_enriched.csv` and `costco_output/items_enriched.csv`
date: 2026-03-16

[X] t1.8.4: verify and correct costco receipt enumeration (1–2 commits)

acceptance criteria

confirm graphql summary query returns all expected receipts
compare `inWarehouse` count vs number of `receipts` returned
widen or parameterize date window if necessary; website shows receipts in 3-month windows
persist request metadata (`startDate`, `endDate`, `documentType`, `documentSubType`)
emit warning when receipt counts mismatch

notes

goal is to confirm we are enumerating all receipts before parsing
do not expand schema or parser logic in this task
keep changes limited to summary query handling and diagnostics

evidence

commit: `ac82fa6` on branch `cx`
tests: `./venv/bin/python -m unittest discover -s tests`; `./venv/bin/python scrape_costco.py –help`; reviewed the sample Costco summary request in `pm/scrape-giant.org` against `costco_output/raw/summary.json` and added 3-month window chunking plus mismatch diagnostics
date: 2026-03-16

[X] t1.8.5: refactor costco scraper auth and UX with giant scraper

acceptance criteria

remove manual auth env vars
load costco cookies from firefox session
require only logged-in browser
replace start/end date flags with –months-back
maintain same raw output structure
ensure summary_lookup keys are collision-safe by using a composite key (transactionBarcode + transactionDateTime) instead of transactionBarcode alone

notes

align Costco acquisition ergonomics with the Giant scraper
keep downstream Costco parsing and shared schemas unchanged

evidence

commit: `c0054dc` on branch `cx`
tests: `./venv/bin/python -m unittest discover -s tests`; `./venv/bin/python scrape_costco.py –help`; verified Costco summary/detail flattening now uses composite receipt keys in unit tests
date: 2026-03-16

[X] t1.8.6: add browser session helper (2-4 commits)

acceptance criteria

create a separate Python module/script that extracts firefox browser session data needed for giant and costco scrapers.
support Firefox and Costco first, including:
- loading cookies via existing browser-cookie approach
- reading browser storage needed for dynamic auth headers (e.g. Costco bearer token)
- copying locked browser sqlite/db files to a temp location before reading when necessary
expose a small interface usable by scrapers, e.g. cookie jar + storage/header values
keep retailer-specific parsing of extracted session data outside the low-level browser access layer
structure the helper so Chromium-family browser support can be added later without changing scraper call sites

notes

goal is to replace manual `.env` copying of volatile browser-derived auth data
session bootstrap only, not full browser automation
prefer one shared helper over retailer-specific ad hoc storage reads
Firefox only; Chromium support later

evidence

commit: `7789c2e` on branch `cx`
tests: `./venv/bin/python -m unittest discover -s tests`; `./venv/bin/python scrape_giant.py –help`; `./venv/bin/python scrape_costco.py –help`; verified Firefox storage token extraction and locked-db copy behavior in unit tests
date: 2026-03-16

[X] t1.8.7: simplify costco session bootstrap and remove over-abstraction (2-4 commits)

acceptance criteria

make `scrape_costco.py` readable end-to-end without tracing through multiple partial bootstrap layers
keep `browser_session.py` limited to low-level browser data access only:
- firefox profile discovery
- cookie loading
- storage reads
- sqlite copy/read helpers
remove or sharply reduce `retailer_sessions.py` so retailer-specific header extraction lives with the retailer scraper or in a very small retailer-specific helper
make session bootstrap flow explicit and linear:
- load browser context
- extract costco auth values
- build request headers
- build requests session
eliminate inconsistent/obsolete function signatures and dead call paths (e.g. mixed `build_session(…)` calling conventions, stale fallback branches, mismatched `build_headers(…)` args)
add one focused bootstrap debug print showing whether cookies, authorization, client id, and client identifier were found
preserve current working behavior where available; this is a refactor/clarification task, not a feature expansion task

notes

goal is to restore concern separation and debuggability
prefer obvious retailer-specific code over “generic” helpers that guess and obscure control flow
browser access can stay shared; retailer auth mapping should be explicit
no new heuristics in this task

evidence

commit: `d7a0329` on branch `cx`
tests: `./venv/bin/python -m unittest discover -s tests`; `./venv/bin/python scrape_costco.py –help`; verified explicit Costco session bootstrap flow in `scrape_costco.py` and low-level-only browser access in `browser_session.py`
date: 2026-03-16

[X] t1.9: build pivot-ready normalized purchase log and comparison metrics (2-4 commits)

acceptance criteria

produce a flat `purchases.csv` suitable for excel pivot tables and pivot charts
each purchase row preserves:
- purchase date
- retailer
- order id
- raw item name
- normalized item name
- canonical item id when resolved
- quantity / unit
- line total
- store/location info where available
derive normalized comparison fields where possible on enriched or observed product rows:
- `price_per_lb`
- `price_per_oz`
- `price_per_each`
- `price_per_count`
preserve the source basis used to derive each metric, e.g.:
- parsed size/unit
- receipt weight
- explicit count/pack
emit nulls when basis is unknown, conflicting, or ambiguous
support pivot-friendly analysis of purchase frequency and item cost over time
document at least one Giant vs Costco comparison example using the normalized metrics

notes

compute metrics as close to the raw observation as possible
canonical layer can aggregate later, but should not invent missing unit economics
unit discipline matters more than coverage
raw item name must be retained for audit/debugging

evidence

commit: `be1bf63` on branch `cx`
tests: `./venv/bin/python -m unittest discover -s tests`; `./venv/bin/python build_purchases.py`; verified `combined_output/purchases.csv` and `combined_output/comparison_examples.csv` on the current Giant + Costco dataset
date: 2026-03-16

[X] t1.11: define review and item-resolution workflow for unresolved products (2-3 commits)

acceptance criteria

define the persistent files used to resolve unknown items, including:
- review queue
- canonical item catalog
- alias / mapping layer if separate
specify how unresolved items move from `review_queue.csv` into the final normalized purchase log
define the manual resolution workflow, including:
- what the human edits
- what script is rerun afterward
- how resolved mappings are persisted for future runs
ensure resolved items are positively identified into stable canonical item ids rather than one-off text substitutions
document how raw item name, normalized item name, and canonical item id are all retained

notes

goal is “approve once, reuse forever”
keep the workflow simple and auditable
manual review is fine; the important part is making it durable and rerunnable

evidence

commit: `c7dad54` on branch `cx`
tests: `./venv/bin/python -m unittest discover -s tests`; `./venv/bin/python build_purchases.py`; `./venv/bin/python review_products.py –refresh-only`; verified `combined_output/review_queue.csv`, `combined_output/review_resolutions.csv` workflow, and `combined_output/canonical_catalog.csv`
date: 2026-03-16

[X] t1.12: simplify review process display

Clearly show current state separate from proposed future state.

acceptance criteria

Display position in review queue, e.g., (1/22)
Display compact header with observed_product under review, queue position, and canonical decision, e.g.: "Resolve [n] observed product group [name] and associated items to canonical_name [name]? (\n [n] matched items)"
color-code outputs based on info, input/prompt, warning/error
1. color action menu/requests for input differently from display text; do not color individual options separately
2. "no canonical_name suggestions found" is informational, not a warning/error.
update action menu `[x]exclude` to `e[x]clude`
on each review item, display a list of all matched items to be linked, sorted by descending date:
1. YYYY-mm-dd, price, raw item name, normalized item name, upc, retailer
2. image URL, if exists
3. Sample:
on each review item, suggest (but do not auto-apply) up to 3 likely existing canonicals using determinstic rules, e.g:
1. exact normalized name match
2. prefix/contains match on canonical name
3. exact UPC
Sample Entry:

When link is selected, users should be able to select the number of the item in the list, e.g.:

Select the canonical_name to associate [n] items with: [1] GRB GRADU PCH PUF1. | gcan_01b0d623aa02 [2] BTB CHICKEN | gcan_0201f0feb749 [3] LIME | gcan_02074d9e7359

Add confirmation to link selection with instructions, "[n] [observed_name] and future observed_name matches will be associated with [canonical_name], is this ok? actions: [Y]es [n]o [b]ack [s]kip [q]uit

reinforce project terminology such as raw_name, observed_name, canonical_name

evidence

commit: `7b8141c`, `d39497c`
tests: `./venv/bin/python -m unittest discover -s tests`; `./venv/bin/python -m unittest tests.test_review_workflow tests.test_purchases`; `./venv/bin/python review_products.py –help`; verified compact review header, numbered matched-item display, informational no-suggestion state, numbered canonical selection, and confirmation flow
date: 2026-03-17

notes

The key improvement was shifting the prompt from system metadata to reviewer intent: one observed_product, its matched retailer rows, and one canonical_name decision.
Numbered canonical selection plus confirmation worked better than free-text id entry and should reduce accidental links.
Deterministic suggestions remain intentionally conservative; they speed up common cases, but unresolved items still depend on human review by design.

[X] t1.13.1 pipeline accountability and stage visibility (1-2 commits)

add simple accounting so we can see what survives or drops at each pipeline stage

AC

emit counts for raw, enriched, combined/observed, review-queued, canonical-linked, and final purchase-log rows
report unresolved and dropped item counts explicitly
make it easy to verify that missing items were intentionally left in review rather than silently lost

pm note: simple text/json/csv summary is sufficient; trust and visibility matter more than presentation

evidence

commit: `967e19e`
tests: `./venv/bin/python -m unittest discover -s tests`; `./venv/bin/python report_pipeline_status.py –help`; `./venv/bin/python report_pipeline_status.py`; verified `combined_output/pipeline_status.csv` and `combined_output/pipeline_status.json`
date: 2026-03-17

notes

Added a single explicit status script instead of threading counters through every pipeline step; this keeps the pipeline simple while still making row survival visible.
The most useful check here is `unresolved_not_in_review_rows`; when it is non-zero, we know we have a real accounting bug rather than normal unresolved work.

[X] t1.13.2 costco discount matching and net pricing in enrich_costco (2-3 commits)

refactor costco enrichment so discount lines are matched to purchased items and net pricing is preserved

AC

detect costco discount/coupon rows like `/<retailer_item_id>` and match them to purchased items within the same order
preserve raw discount rows for auditability while also carrying matched discount values onto the purchased item row
add explicit fields for discount-adjusted pricing, e.g. `matched_discount_amount` and `net_line_total` (or equivalent)
preserve original raw receipt amounts (`line_total`) without overwriting them

pm note: keep this retailer-specific and explicit; do not introduce generic discount heuristics

evidence

commit: `56a03bc`
tests: `./venv/bin/python -m unittest discover -s tests`; `./venv/bin/python enrich_costco.py`; verified matched Costco discount rows now populate `matched_discount_amount` and `net_line_total` while preserving raw `line_total`
date: 2026-03-17

notes

Kept this retailer-specific and literal: only discount rows with `/<retailer_item_id>` are matched, and only within the same order.
Raw discount rows are still preserved for auditability; the purchased row now carries the matched adjustment separately rather than overwriting the original amount.

[X] t1.13.3 canonical cleanup and review-first product identity (3-4 commits)

refactor canonical generation so product identity is cleaner, duplicate canonicals are reduced, and unresolved items stay in review instead of spawning junk canonicals

AC

stop auto-creating new canonical products from weak normalized names alone; unresolved items remain in `review_queue.csv`
canonical names are based on stable product identity rather than noisy observed titles
packaging/count/size tokens are removed from canonical names when they belong in structured fields (`pack_qty`, `size_value`, `size_unit`)
consolidate obvious duplicate canonicals (e.g. egg/lime cases) and ensure final outputs retain raw item name, normalized item name, and canonical item id

pm note: prefer conservative canonical creation and a better manual review loop over aggressive auto-unification

evidence

commit: `08e2a86`
tests: `./venv/bin/python -m unittest discover -s tests`; `./venv/bin/python build_purchases.py`; `./venv/bin/python review_products.py –refresh-only`; verified weaker exact-name cases now remain unresolved in `combined_output/review_queue.csv` and canonical names are cleaned before auto-catalog creation
date: 2026-03-17

notes

Removed weak exact-name auto-canonical creation so ambiguous products stay in review instead of generating junk canonicals.
Canonical display names are now cleaned of obvious punctuation and packaging noise, but I kept the cleanup conservative rather than adding a broad fuzzy merge layer.

[X] t1.14: refactor retailer collection into the new data model (2-4 commits)

move Giant and Costco collection into the new collect structure and make both retailers emit the same collected schemas

Acceptance Criteria

create retailer-specific collect scripts in the target naming pattern, e.g.:
- collect_giant_web.py
- collect_costco_web.py
collected outputs conform to pm/data-model.org:
- data/<retailer-method>/raw/…
- data/<retailer-method>/collected_orders.csv
- data/<retailer-method>/collected_items.csv
current Giant and Costco raw acquisition behavior is preserved during the move
collected schemas preserve retailer truth and provenance:
- no interpretation beyond basic flattening
- raw_order_path/raw_history_path remain usable
- unknown values remain blank rather than guessed
old paths should be removed or deprecated
collect_* scripts do not depend on any normalize/review files or scripts

pm note: this is a path/schema refactor, not a parsing rewrite

evidence

commit: `48c6eaf`
tests: `./venv/bin/python -m unittest tests.test_scraper tests.test_costco_pipeline tests.test_browser_session`; `./venv/bin/python collect_giant_web.py –help`; `./venv/bin/python collect_costco_web.py –help`; `./venv/bin/python scrape_giant.py –help`; `./venv/bin/python scrape_costco.py –help`
datetime: 2026-03-18

notes

Kept this as a path/schema move, not a parsing rewrite: the existing Giant and Costco collection behavior remains in place behind new `collect_*` entry points.
Added lightweight deprecation nudges on the legacy `scrape_*` commands rather than removing them immediately, so the move is inspectable and low-risk.
The main schema fix was on Giant collection, which was missing retailer/provenance/audit fields that Costco collection already carried.

[X] t1.14.1: refactor retailer normalization into the new normalized_items schema (3-5 commits)

make Giant and Costco emit the shared normalized line-item schema without introducing cross-retailer identity logic

Acceptance Criteria

create retailer-specific normalize scripts in the target naming pattern, e.g.:
- normalize_giant_web.py
- normalize_costco_web.py
normalized outputs conform to pm/data-model.org:
- data/<retailer-method>/normalized_items.csv
- one row per collected line item
- normalized_row_id is stable and present
- normalized_item_id is stable, present, and represents retailer-level identity reused across repeated purchase rows when deterministic retailer evidence is sufficient
- normalized_quantity and normalized_quantity_unit
- repeated rows for the same retailer product resolve to the same normalized_item_id only when supported by deterministic retailer evidence, e.g. exact upc, exact retailer_item_id, exact cleaned name + same size/pack
- normalization_basis is explicit
Giant normalization preserves current useful parsing:
- normalized item name
- size/unit/pack parsing
- fee/store-brand flags
- derived price fields
Costco normalization preserves current useful parsing:
- normalized item name
- size/unit/pack parsing
- explicit discount matching using retailer-specific logic
- matched_discount_amount and net_line_total
both normalizers preserve raw retailer truth:
- line_total is never overwritten
- unknown values remain blank rather than guessed
no cross-retailer identity assignment occurs in normalization
normalize never uses fuzzy or semantic matching to assign normalized_item_id

pm note: prefer explicit retailer-specific code paths over generic normalization helpers unless the duplication is truly mechanical
pm note: normalization may resolve retailer-level identity, but not catalog identity
pm note: normalized_item_id is the only retailer-level grouping identity; do not introduce observed_products or a second grouping artifact

evidence

commit: `9064de5`
tests: `./venv/bin/python -m unittest discover -s tests`; `./venv/bin/python -m unittest tests.test_enrich_giant tests.test_costco_pipeline tests.test_purchases`; `./venv/bin/python normalize_giant_web.py –help`; `./venv/bin/python normalize_costco_web.py –help`; `./venv/bin/python enrich_giant.py –help`; `./venv/bin/python enrich_costco.py –help`
datetime: 2026-03-18

notes

Kept the existing Giant and Costco parsing logic intact and added the new normalized schema fields in place, rather than rewriting the enrichers from scratch.
`normalized_item_id` is always present, but it only collapses repeated rows when the evidence is strong; otherwise it falls back to row-level identity via `normalized_row_id`.
Added `normalize_*` entry points for the new data-model layout while leaving the legacy `enrich_*` commands available during the transition.

[X] t1.14.2: finalize filesystem and schema alignment for the refactor (2-4 commits)

bring on-disk outputs fully into the target `data/` structure without changing retailer behavior

Acceptance Criteria

retailer data directories conform to pm/data-model.org:
- `data/giant-web/raw/…`
- `data/giant-web/collected_orders.csv`
- `data/giant-web/collected_items.csv`
- `data/giant-web/normalized_items.csv`
- `data/costco-web/raw/…`
- `data/costco-web/collected_orders.csv`
- `data/costco-web/collected_items.csv`
- `data/costco-web/normalized_items.csv`
review/combine outputs are moved or rewritten into the target review paths:
- `data/review/review_queue.csv`
- `data/review/product_links.csv`
- `data/review/review_resolutions.csv`
- `data/review/purchases.csv`
- `data/review/pipeline_status.csv`
- `data/review/pipeline_status.json`
old transitional output paths are either:
- removed from active script defaults, or
- left as explicit compatibility shims with clear deprecation notes
no recollection is required if existing raw files and collected csvs can be moved/copied losslessly into the new structure
no schema information is lost during the move:
- raw paths still resolve
- collected/normalized csvs still open with the expected headers
README and task/docs reflect the final active paths

pm note: prefer moving/adapting existing files over recollecting from retailers unless a real data loss or schema mismatch forces recollection
pm note: this is a structure-alignment task, not a retailer parsing task

evidence

commit: `d2e6f2a`
tests: `./venv/bin/python -m unittest discover -s tests`; `./venv/bin/python build_purchases.py`; `./venv/bin/python review_products.py –refresh-only`; `./venv/bin/python report_pipeline_status.py`; `./venv/bin/python build_purchases.py –help`; `./venv/bin/python review_products.py –help`; `./venv/bin/python report_pipeline_status.py –help`; verified `data/giant-web/collected_orders.csv`, `data/giant-web/collected_items.csv`, `data/costco-web/collected_orders.csv`, `data/costco-web/collected_items.csv`, `data/catalog.csv`, and archived transitional review outputs under `data/review/archive/`
datetime: 2026-03-20 10:04:15 EDT

notes

No recollection was needed; existing raw and collected exports were adapted in place and moved into the target names.
Updated the active script defaults to point at `data/…` so the code and on-disk layout now agree.
Kept obviously obsolete review artifacts, but moved them under `data/review/archive/` instead of deleting them outright.

[ ] t1.14.3: retailer-specific Costco normalization cleanup (2-4 commits)

tighten Costco-specific normalization so normalized item names are cleaner and deterministic retailer grouping is less noisy

Acceptance Criteria

improve Costco item-name cleanup for obvious non-identity noise, such as:
- trailing slash fragments
- code tokens and receipt-format artifacts
- duplicated measurement fragments already captured in structured fields
preserve deterministic normalization rules only:
- exact retailer_item_id
- exact cleaned name + same size/pack when needed
- approved retailer alias
- no fuzzy or semantic matching
normalized Costco names improve on known bad examples, e.g.:
- `MANDARIN /` -> cleaner normalized item name
- `LIFE 6'TABLE … /` -> cleaner normalized item name
cleanup does not overwrite retailer truth:
- raw `item_name` is unchanged
- parsed `size_value`, `size_unit`, `pack_qty`, and pricing fields remain intact
discount-row behavior remains correct:
- matched discount rows still populate `matched_discount_amount`
- `net_line_total` remains correct
- discount rows remain auditable
add regression tests for the cleaned Costco examples and any new parsing rules

pm note: keep this explicitly Costco-specific; do not introduce a generic cleanup framework
pm note: prefer a short allowlist/blocklist of known receipt artifacts over broad heuristics

evidence

commit:
tests:
datetime:

notes

[ ] t1.15: refactor review/combine pipeline around normalized_item_id and catalog links (4-8 commits)

replace the old observed/canonical workflow with a review-first pipeline that uses normalized_item_id as the retailer-level review unit and links it to catalog items

Acceptance Criteria

refactor review outputs to conform to pm/data-model.org:
- data/review/review_queue.csv
- data/review/product_links.csv
- data/catalog.csv
- data/purchases.csv
review logic uses normalized_item_id as the upstream retailer-level review identity:
- no dependency on observed_product_id
- no dependency on products_observed.csv
- one review/link decision applies to all purchase rows sharing the same normalized_item_id
product_links.csv stores review-approved links from normalized_item_id to catalog_id
- one row per approved retailer-level identity to catalog mapping
catalog.csv entries are review-first and conservative:
- no auto-creation from weak normalized names alone
- names come from reviewed catalog naming, not raw retailer strings
- packaging/count is not embedded in catalog_name unless essential to identity
- catalog_name/product_type/category/brand/variant may be blank until reviewed; blank is preferred to guessed
purchases.csv remains pivot-ready and retains:
- raw item name
- normalized item name
- normalized_row_id (not for review)
- normalized_item_id
- catalog_id
- catalog fields
- raw line_total
- matched_discount_amount and net_line_total when present
- derived price fields and their bases
terminal review flow remains simple and usable:
- reviewer sees one grouped retailer item identity (normalized_item_id) with count and list of matches, not one prompt per purchase row; use existing pattern as a template
- link to existing catalog item
- create new catalog item
- exclude
- skip
pipeline accounting remains valid after the refactor:
- unresolved items are visible
- missing items are not silently dropped
pm note: prefer a better manual review loop over aggressive automatic grouping. initial manual data entry is expected, and should resolve over time
pm note: keep review/combine auditable; each catalog link should be explainable from normalized rows and review state

evidence

commit:
tests:
datetime:

notes

[ ] 1t.10: add optional llm-assisted suggestion workflow for unresolved normalized retailer items (2-4 commits)

acceptance criteria

llm suggestions are generated only for unresolved normalized retailer items
llm outputs are stored as suggestions, not auto-applied truth
reviewer can approve/edit/reject suggestions
approved decisions are persisted into canonical/link files

notes

bounded assistant, not autonomous goblin
image urls may become useful here

evidence

commit:
tests:
date:

34 KiB Raw Blame History Unescape Escape

Scrape-Giant Task Log

[X] t1.1: harden giant receipt fetch cli (2-4 commits)

acceptance criteria

notes

evidence

[X] t1.2: define grocery data model and file layout (1-2 commits)

acceptance criteria

notes

evidence

[X] t1.3: build giant parser/enricher from raw json (2-4 commits)

acceptance criteria

notes

evidence

[X] t1.4: generate observed-product layer from enriched items (2-3 commits)

acceptance criteria

notes

evidence

[X] t1.5: build review queue for unresolved or low-confidence products (1-3 commits)

acceptance criteria

notes

evidence

[X] t1.6: create canonical product layer and observed→canonical links (2-4 commits)

acceptance criteria

notes

evidence

[X] t1.7: implement auto-link rules for easy matches (2-3 commits)

acceptance criteria

notes

evidence

[X] t1.8: support costco raw ingest path (2-5 commits)

acceptance criteria

notes

evidence

[X] t1.8.1: support costco parser/enricher path (2-4 commits)

acceptance criteria

notes

evidence

[X] t1.8.2: validate cross-retailer observed/canonical flow (1-3 commits)

acceptance criteria

notes

evidence

[X] t1.8.3: extend shared schema for retailer-native ids and adjustment lines (1-2 commits)

acceptance criteria

notes

evidence

[X] t1.8.4: verify and correct costco receipt enumeration (1–2 commits)

acceptance criteria

notes

evidence

[X] t1.8.5: refactor costco scraper auth and UX with giant scraper

acceptance criteria

notes

evidence

[X] t1.8.6: add browser session helper (2-4 commits)

acceptance criteria

notes

evidence

[X] t1.8.7: simplify costco session bootstrap and remove over-abstraction (2-4 commits)

acceptance criteria

notes

evidence

[X] t1.9: build pivot-ready normalized purchase log and comparison metrics (2-4 commits)

acceptance criteria

notes

evidence

[X] t1.11: define review and item-resolution workflow for unresolved products (2-3 commits)

acceptance criteria

notes

evidence

[X] t1.12: simplify review process display

acceptance criteria

evidence

notes

[X] t1.13.1 pipeline accountability and stage visibility (1-2 commits)

AC

evidence

notes

[X] t1.13.2 costco discount matching and net pricing in enrich_costco (2-3 commits)

AC

34 KiB

Raw Blame History