692 lines
35 KiB
Org Mode
692 lines
35 KiB
Org Mode
#+title: Scrape-Giant Task Log
|
||
#+STARTUP: overview
|
||
* [X] t1.1: harden giant receipt fetch cli (2-4 commits)
|
||
** acceptance criteria
|
||
- giant scraper runs from cli with prompts or env-backed defaults for `user_id` and `loyalty`
|
||
- script reuses current browser session via firefox cookies + `curl_cffi`
|
||
- script only fetches unseen orders
|
||
- script appends to `orders.csv` and `items.csv` without duplicating prior visits
|
||
- script prints a note that giant only exposes the most recent 50 visits
|
||
|
||
** notes
|
||
- keep this giant-specific
|
||
- no canonical product logic here
|
||
- raw json archive remains source of truth
|
||
|
||
** evidence
|
||
- commit: `d57b9cf` on branch `cx`
|
||
- tests: `./venv/bin/python -m unittest discover -s tests`; `./venv/bin/python scraper.py --help`; verified `.env` loading via `scraper.load_config()`
|
||
- date: 2026-03-14
|
||
|
||
* [X] t1.2: define grocery data model and file layout (1-2 commits)
|
||
** acceptance criteria
|
||
- decide and document the files/directories for:
|
||
- retailer raw exports
|
||
- enriched line items
|
||
- observed products
|
||
- canonical products
|
||
- product links
|
||
- define stable column schemas for each file
|
||
- explicitly separate retailer-specific parsing from cross-retailer canonicalization
|
||
|
||
** notes
|
||
- this is the guardrail task so we don't make giant-specific hacks the system of record
|
||
- keep schema minimal but extensible
|
||
|
||
** evidence
|
||
- commit: `42dbae1` on branch `cx`
|
||
- tests: reviewed `giant_output/raw/history.json`, one sample raw order json, `giant_output/orders.csv`, `giant_output/items.csv`; documented schemas in `pm/data-model.org`
|
||
- date: 2026-03-15
|
||
|
||
* [X] t1.3: build giant parser/enricher from raw json (2-4 commits)
|
||
** acceptance criteria
|
||
- parser reads giant raw order json files
|
||
- outputs `items_enriched.csv`
|
||
- preserves core raw values plus parsed fields such as:
|
||
- normalized item name
|
||
- image url
|
||
- size value/unit guesses
|
||
- pack/count guesses
|
||
- fee/store-brand flags
|
||
- per-unit/per-weight derived price where possible
|
||
- parser is deterministic and rerunnable
|
||
|
||
** notes
|
||
- do not attempt canonical cross-store matching yet
|
||
- parser should preserve ambiguity rather than hallucinating precision
|
||
|
||
** evidence
|
||
- commit: `14f2cc2` on branch `cx`
|
||
- tests: `./venv/bin/python -m unittest discover -s tests`; `./venv/bin/python enrich_giant.py`; verified `giant_output/items_enriched.csv` on real raw data
|
||
- date: 2026-03-16
|
||
|
||
* [X] t1.4: generate observed-product layer from enriched items (2-3 commits)
|
||
|
||
** acceptance criteria
|
||
- distinct observed products are generated from enriched giant items
|
||
- each observed product has a stable `observed_product_id`
|
||
- observed products aggregate:
|
||
- first seen / last seen
|
||
- times seen
|
||
- representative upc
|
||
- representative image url
|
||
- representative normalized name
|
||
- outputs `products_observed.csv`
|
||
|
||
** notes
|
||
- observed product is retailer-facing, not yet canonical
|
||
- likely key is some combo of retailer + upc + normalized name
|
||
|
||
** evidence
|
||
- commit: `dc39214` on branch `cx`
|
||
- tests: `./venv/bin/python -m unittest discover -s tests`; `./venv/bin/python build_observed_products.py`; verified `giant_output/products_observed.csv`
|
||
- date: 2026-03-16
|
||
|
||
* [X] t1.5: build review queue for unresolved or low-confidence products (1-3 commits)
|
||
|
||
** acceptance criteria
|
||
- produce a review file containing observed products needing manual review
|
||
- include enough context to review quickly:
|
||
- raw names
|
||
- parsed names
|
||
- upc
|
||
- image url
|
||
- example prices
|
||
- seen count
|
||
- reviewed status can be stored and reused
|
||
|
||
** notes
|
||
- this is where human-in-the-loop starts
|
||
- optimize for “approve once, remember forever”
|
||
|
||
** evidence
|
||
- commit: `9b13ec3` on branch `cx`
|
||
- tests: `./venv/bin/python -m unittest discover -s tests`; `./venv/bin/python build_review_queue.py`; verified `giant_output/review_queue.csv`
|
||
- date: 2026-03-16
|
||
|
||
* [X] t1.6: create canonical product layer and observed→canonical links (2-4 commits)
|
||
|
||
** acceptance criteria
|
||
- define and create `products_canonical.csv`
|
||
- define and create `product_links.csv`
|
||
- support linking one or more observed products to one canonical product
|
||
- canonical product schema supports food-cost comparison fields such as:
|
||
- product type
|
||
- variant
|
||
- size
|
||
- measure type
|
||
- normalized quantity basis
|
||
|
||
** notes
|
||
- this is the first cross-retailer abstraction layer
|
||
- do not require llm assistance for v1
|
||
|
||
** evidence
|
||
- commit: `347cd44` on branch `cx`
|
||
- tests: `./venv/bin/python -m unittest discover -s tests`; `./venv/bin/python build_canonical_layer.py`; verified seeded `giant_output/products_canonical.csv` and `giant_output/product_links.csv`
|
||
- date: 2026-03-16
|
||
|
||
* [X] t1.7: implement auto-link rules for easy matches (2-3 commits)
|
||
|
||
** acceptance criteria
|
||
- auto-link can match observed products to canonical products using deterministic rules
|
||
- rules include at least:
|
||
- exact upc
|
||
- exact normalized name
|
||
- exact size/unit match where available
|
||
- low-confidence cases remain unlinked for review
|
||
|
||
** notes
|
||
- keep the rules conservative
|
||
- false positives are worse than unresolved items
|
||
|
||
** evidence
|
||
- commit: `385a31c` on branch `cx`
|
||
- tests: `./venv/bin/python -m unittest discover -s tests`; `./venv/bin/python build_canonical_layer.py`; verified auto-linked `giant_output/products_canonical.csv` and `giant_output/product_links.csv`
|
||
- date: 2026-03-16
|
||
|
||
* [X] t1.8: support costco raw ingest path (2-5 commits)
|
||
|
||
** acceptance criteria
|
||
- add a costco-specific raw ingest/export path
|
||
- fetch costco receipt summary and receipt detail payloads from graphql endpoint
|
||
- persist raw json under `costco_output/raw/orders.csv` and `./items.csv`, same format as giant
|
||
- costco-native identifiers such as `transactionBarcode` as order id and `itemNumber` as retailer item id
|
||
- preserve discount/coupon rows rather than dropping
|
||
|
||
** notes
|
||
- focus on raw costco acquisistion and flattening
|
||
- do not force costco identifiers into `upc`
|
||
- bearer/auth values should come from local env, not source
|
||
|
||
** evidence
|
||
- commit: `da00288` on branch `cx`
|
||
- tests: `./venv/bin/python -m unittest discover -s tests`; `./venv/bin/python scrape_costco.py --help`; verified `costco_output/raw/*.json`, `costco_output/orders.csv`, and `costco_output/items.csv` from the local sample payload
|
||
- date: 2026-03-16
|
||
|
||
* [X] t1.8.1: support costco parser/enricher path (2-4 commits)
|
||
|
||
** acceptance criteria
|
||
- add a costco-specific enrich step producing `costco_output/items_enriched.csv`
|
||
- output rows into the same shared enriched schema family as Giant
|
||
- support costco-specific parsing for:
|
||
- `itemDescription01` + `itemDescription02`
|
||
- `itemNumber` as `retailer_item_id`
|
||
- discount lines / negative rows
|
||
- common size patterns such as `25#`, `48 OZ`, `2/24 OZ`, `6-PACK`
|
||
- preserve obvious unknowns as blank rather than guessed values
|
||
|
||
** notes
|
||
- this is the real schema compatibility proof, not raw ingest alone
|
||
- expect weaker identifiers than Giant
|
||
|
||
** evidence
|
||
- commit: `da00288` on branch `cx`
|
||
- tests: `./venv/bin/python -m unittest discover -s tests`; `./venv/bin/python enrich_costco.py`; verified `costco_output/items_enriched.csv`
|
||
- date: 2026-03-16
|
||
* [X] t1.8.2: validate cross-retailer observed/canonical flow (1-3 commits)
|
||
|
||
** acceptance criteria
|
||
- feed Giant and Costco enriched rows through the same observed/canonical pipeline
|
||
- confirm at least one product class can exist as:
|
||
- Giant observed product
|
||
- Costco observed product
|
||
- one shared canonical product
|
||
- document the exact example used for proof
|
||
|
||
** notes
|
||
- keep this to one or two well-behaved product classes first
|
||
- apples, eggs, bananas, or flour are better than weird prepared foods
|
||
|
||
** evidence
|
||
- commit: `da00288` on branch `cx`
|
||
- tests: `./venv/bin/python -m unittest discover -s tests`; `./venv/bin/python validate_cross_retailer_flow.py`; proof example: Giant `FRESH BANANA` and Costco `BANANAS 3 LB / 1.36 KG` share one canonical in `combined_output/proof_examples.csv`
|
||
- date: 2026-03-16
|
||
* [X] t1.8.3: extend shared schema for retailer-native ids and adjustment lines (1-2 commits)
|
||
|
||
** acceptance criteria
|
||
- add shared fields needed for non-upc retailers, including:
|
||
- `retailer_item_id`
|
||
- `is_discount_line`
|
||
- `is_coupon_line` or equivalent if needed
|
||
- keep `upc` nullable across the pipeline
|
||
- update downstream builders/tests to accept retailers with blank `upc`
|
||
|
||
** notes
|
||
- this prevents costco from becoming a schema hack
|
||
- do this once instead of sprinkling exceptions everywhere
|
||
|
||
** evidence
|
||
- commit: `9497565` on branch `cx`
|
||
- tests: `./venv/bin/python -m unittest discover -s tests`; verified shared enriched fields in `giant_output/items_enriched.csv` and `costco_output/items_enriched.csv`
|
||
- date: 2026-03-16
|
||
* [X] t1.8.4: verify and correct costco receipt enumeration (1–2 commits)
|
||
|
||
** acceptance criteria
|
||
- confirm graphql summary query returns all expected receipts
|
||
- compare `inWarehouse` count vs number of `receipts` returned
|
||
- widen or parameterize date window if necessary; website shows receipts in 3-month windows
|
||
- persist request metadata (`startDate`, `endDate`, `documentType`, `documentSubType`)
|
||
- emit warning when receipt counts mismatch
|
||
|
||
** notes
|
||
- goal is to confirm we are enumerating all receipts before parsing
|
||
- do not expand schema or parser logic in this task
|
||
- keep changes limited to summary query handling and diagnostics
|
||
|
||
** evidence
|
||
- commit: `ac82fa6` on branch `cx`
|
||
- tests: `./venv/bin/python -m unittest discover -s tests`; `./venv/bin/python scrape_costco.py --help`; reviewed the sample Costco summary request in `pm/scrape-giant.org` against `costco_output/raw/summary.json` and added 3-month window chunking plus mismatch diagnostics
|
||
- date: 2026-03-16
|
||
* [X] t1.8.5: refactor costco scraper auth and UX with giant scraper
|
||
|
||
** acceptance criteria
|
||
- remove manual auth env vars
|
||
- load costco cookies from firefox session
|
||
- require only logged-in browser
|
||
- replace start/end date flags with --months-back
|
||
- maintain same raw output structure
|
||
- ensure summary_lookup keys are collision-safe by using a composite key (transactionBarcode + transactionDateTime) instead of transactionBarcode alone
|
||
|
||
** notes
|
||
- align Costco acquisition ergonomics with the Giant scraper
|
||
- keep downstream Costco parsing and shared schemas unchanged
|
||
|
||
** evidence
|
||
- commit: `c0054dc` on branch `cx`
|
||
- tests: `./venv/bin/python -m unittest discover -s tests`; `./venv/bin/python scrape_costco.py --help`; verified Costco summary/detail flattening now uses composite receipt keys in unit tests
|
||
- date: 2026-03-16
|
||
* [X] t1.8.6: add browser session helper (2-4 commits)
|
||
|
||
** acceptance criteria
|
||
- create a separate Python module/script that extracts firefox browser session data needed for giant and costco scrapers.
|
||
- support Firefox and Costco first, including:
|
||
- loading cookies via existing browser-cookie approach
|
||
- reading browser storage needed for dynamic auth headers (e.g. Costco bearer token)
|
||
- copying locked browser sqlite/db files to a temp location before reading when necessary
|
||
- expose a small interface usable by scrapers, e.g. cookie jar + storage/header values
|
||
- keep retailer-specific parsing of extracted session data outside the low-level browser access layer
|
||
- structure the helper so Chromium-family browser support can be added later without changing scraper call sites
|
||
|
||
** notes
|
||
- goal is to replace manual `.env` copying of volatile browser-derived auth data
|
||
- session bootstrap only, not full browser automation
|
||
- prefer one shared helper over retailer-specific ad hoc storage reads
|
||
- Firefox only; Chromium support later
|
||
|
||
** evidence
|
||
- commit: `7789c2e` on branch `cx`
|
||
- tests: `./venv/bin/python -m unittest discover -s tests`; `./venv/bin/python scrape_giant.py --help`; `./venv/bin/python scrape_costco.py --help`; verified Firefox storage token extraction and locked-db copy behavior in unit tests
|
||
- date: 2026-03-16
|
||
* [X] t1.8.7: simplify costco session bootstrap and remove over-abstraction (2-4 commits)
|
||
|
||
** acceptance criteria
|
||
- make `scrape_costco.py` readable end-to-end without tracing through multiple partial bootstrap layers
|
||
- keep `browser_session.py` limited to low-level browser data access only:
|
||
- firefox profile discovery
|
||
- cookie loading
|
||
- storage reads
|
||
- sqlite copy/read helpers
|
||
- remove or sharply reduce `retailer_sessions.py` so retailer-specific header extraction lives with the retailer scraper or in a very small retailer-specific helper
|
||
- make session bootstrap flow explicit and linear:
|
||
- load browser context
|
||
- extract costco auth values
|
||
- build request headers
|
||
- build requests session
|
||
- eliminate inconsistent/obsolete function signatures and dead call paths (e.g. mixed `build_session(...)` calling conventions, stale fallback branches, mismatched `build_headers(...)` args)
|
||
- add one focused bootstrap debug print showing whether cookies, authorization, client id, and client identifier were found
|
||
- preserve current working behavior where available; this is a refactor/clarification task, not a feature expansion task
|
||
|
||
** notes
|
||
- goal is to restore concern separation and debuggability
|
||
- prefer obvious retailer-specific code over “generic” helpers that guess and obscure control flow
|
||
- browser access can stay shared; retailer auth mapping should be explicit
|
||
- no new heuristics in this task
|
||
|
||
** evidence
|
||
- commit: `d7a0329` on branch `cx`
|
||
- tests: `./venv/bin/python -m unittest discover -s tests`; `./venv/bin/python scrape_costco.py --help`; verified explicit Costco session bootstrap flow in `scrape_costco.py` and low-level-only browser access in `browser_session.py`
|
||
- date: 2026-03-16
|
||
* [X] t1.9: build pivot-ready normalized purchase log and comparison metrics (2-4 commits)
|
||
|
||
** acceptance criteria
|
||
- produce a flat `purchases.csv` suitable for excel pivot tables and pivot charts
|
||
- each purchase row preserves:
|
||
- purchase date
|
||
- retailer
|
||
- order id
|
||
- raw item name
|
||
- normalized item name
|
||
- canonical item id when resolved
|
||
- quantity / unit
|
||
- line total
|
||
- store/location info where available
|
||
- derive normalized comparison fields where possible on enriched or observed product rows:
|
||
- `price_per_lb`
|
||
- `price_per_oz`
|
||
- `price_per_each`
|
||
- `price_per_count`
|
||
- preserve the source basis used to derive each metric, e.g.:
|
||
- parsed size/unit
|
||
- receipt weight
|
||
- explicit count/pack
|
||
- emit nulls when basis is unknown, conflicting, or ambiguous
|
||
- support pivot-friendly analysis of purchase frequency and item cost over time
|
||
- document at least one Giant vs Costco comparison example using the normalized metrics
|
||
|
||
** notes
|
||
- compute metrics as close to the raw observation as possible
|
||
- canonical layer can aggregate later, but should not invent missing unit economics
|
||
- unit discipline matters more than coverage
|
||
- raw item name must be retained for audit/debugging
|
||
|
||
** evidence
|
||
- commit: `be1bf63` on branch `cx`
|
||
- tests: `./venv/bin/python -m unittest discover -s tests`; `./venv/bin/python build_purchases.py`; verified `combined_output/purchases.csv` and `combined_output/comparison_examples.csv` on the current Giant + Costco dataset
|
||
- date: 2026-03-16
|
||
|
||
* [X] t1.11: define review and item-resolution workflow for unresolved products (2-3 commits)
|
||
|
||
** acceptance criteria
|
||
- define the persistent files used to resolve unknown items, including:
|
||
- review queue
|
||
- canonical item catalog
|
||
- alias / mapping layer if separate
|
||
- specify how unresolved items move from `review_queue.csv` into the final normalized purchase log
|
||
- define the manual resolution workflow, including:
|
||
- what the human edits
|
||
- what script is rerun afterward
|
||
- how resolved mappings are persisted for future runs
|
||
- ensure resolved items are positively identified into stable canonical item ids rather than one-off text substitutions
|
||
- document how raw item name, normalized item name, and canonical item id are all retained
|
||
|
||
** notes
|
||
- goal is “approve once, reuse forever”
|
||
- keep the workflow simple and auditable
|
||
- manual review is fine; the important part is making it durable and rerunnable
|
||
|
||
** evidence
|
||
- commit: `c7dad54` on branch `cx`
|
||
- tests: `./venv/bin/python -m unittest discover -s tests`; `./venv/bin/python build_purchases.py`; `./venv/bin/python review_products.py --refresh-only`; verified `combined_output/review_queue.csv`, `combined_output/review_resolutions.csv` workflow, and `combined_output/canonical_catalog.csv`
|
||
- date: 2026-03-16
|
||
* [X] t1.12: simplify review process display
|
||
Clearly show current state separate from proposed future state.
|
||
** acceptance criteria
|
||
1. Display position in review queue, e.g., (1/22)
|
||
2. Display compact header with observed_product under review, queue position, and canonical decision, e.g.: "Resolve [n] observed product group [name] and associated items to canonical_name [name]? (\n [n] matched items)"
|
||
3. color-code outputs based on info, input/prompt, warning/error
|
||
1. color action menu/requests for input differently from display text; do not color individual options separately
|
||
2. "no canonical_name suggestions found" is informational, not a warning/error.
|
||
4. update action menu `[x]exclude` to `e[x]clude`
|
||
5. on each review item, display a list of all matched items to be linked, sorted by descending date:
|
||
1. YYYY-mm-dd, price, raw item name, normalized item name, upc, retailer
|
||
2. image URL, if exists
|
||
3. Sample:
|
||
6. on each review item, suggest (but do not auto-apply) up to 3 likely existing canonicals using determinstic rules, e.g:
|
||
1. exact normalized name match
|
||
2. prefix/contains match on canonical name
|
||
3. exact UPC
|
||
7. Sample Entry:
|
||
#+begin_comment
|
||
Review 7/22: Resolve observed_product MIXED PEPPER to canonical_name [__]?
|
||
2 matched items:
|
||
[1] 2026-03-12 | 7.49 | MIXED PEPPER 6-PACK | MIXED PEPPER | [upc] | costco | [img_url]
|
||
[2] [YYYY-mm-dd] | [price] | [raw_name] | [observed_name] | [upc] | [retailer] | [img_url]
|
||
2 canonical suggestions found:
|
||
[1] BELL PEPPERS, PRODUCE
|
||
[2] PEPPER, SPICES
|
||
#+end_comment
|
||
8. When link is selected, users should be able to select the number of the item in the list, e.g.:
|
||
#+begin_comment
|
||
Select the canonical_name to associate [n] items with:
|
||
[1] GRB GRADU PCH PUF1. | gcan_01b0d623aa02
|
||
[2] BTB CHICKEN | gcan_0201f0feb749
|
||
[3] LIME | gcan_02074d9e7359
|
||
#+end_comment
|
||
9. Add confirmation to link selection with instructions, "[n] [observed_name] and future observed_name matches will be associated with [canonical_name], is this ok?
|
||
actions: [Y]es [n]o [b]ack [s]kip [q]uit
|
||
|
||
- reinforce project terminology such as raw_name, observed_name, canonical_name
|
||
|
||
** evidence
|
||
- commit: `7b8141c`, `d39497c`
|
||
- tests: `./venv/bin/python -m unittest discover -s tests`; `./venv/bin/python -m unittest tests.test_review_workflow tests.test_purchases`; `./venv/bin/python review_products.py --help`; verified compact review header, numbered matched-item display, informational no-suggestion state, numbered canonical selection, and confirmation flow
|
||
- date: 2026-03-17
|
||
|
||
** notes
|
||
- The key improvement was shifting the prompt from system metadata to reviewer intent: one observed_product, its matched retailer rows, and one canonical_name decision.
|
||
- Numbered canonical selection plus confirmation worked better than free-text id entry and should reduce accidental links.
|
||
- Deterministic suggestions remain intentionally conservative; they speed up common cases, but unresolved items still depend on human review by design.
|
||
|
||
* [X] t1.13.1 pipeline accountability and stage visibility (1-2 commits)
|
||
add simple accounting so we can see what survives or drops at each pipeline stage
|
||
|
||
** AC
|
||
1. emit counts for raw, enriched, combined/observed, review-queued, canonical-linked, and final purchase-log rows
|
||
2. report unresolved and dropped item counts explicitly
|
||
3. make it easy to verify that missing items were intentionally left in review rather than silently lost
|
||
- pm note: simple text/json/csv summary is sufficient; trust and visibility matter more than presentation
|
||
|
||
** evidence
|
||
- commit: `967e19e`
|
||
- tests: `./venv/bin/python -m unittest discover -s tests`; `./venv/bin/python report_pipeline_status.py --help`; `./venv/bin/python report_pipeline_status.py`; verified `combined_output/pipeline_status.csv` and `combined_output/pipeline_status.json`
|
||
- date: 2026-03-17
|
||
|
||
** notes
|
||
- Added a single explicit status script instead of threading counters through every pipeline step; this keeps the pipeline simple while still making row survival visible.
|
||
- The most useful check here is `unresolved_not_in_review_rows`; when it is non-zero, we know we have a real accounting bug rather than normal unresolved work.
|
||
|
||
* [X] t1.13.2 costco discount matching and net pricing in enrich_costco (2-3 commits)
|
||
refactor costco enrichment so discount lines are matched to purchased items and net pricing is preserved
|
||
|
||
** AC
|
||
1. detect costco discount/coupon rows like `/<retailer_item_id>` and match them to purchased items within the same order
|
||
2. preserve raw discount rows for auditability while also carrying matched discount values onto the purchased item row
|
||
3. add explicit fields for discount-adjusted pricing, e.g. `matched_discount_amount` and `net_line_total` (or equivalent)
|
||
4. preserve original raw receipt amounts (`line_total`) without overwriting them
|
||
- pm note: keep this retailer-specific and explicit; do not introduce generic discount heuristics
|
||
|
||
** evidence
|
||
- commit: `56a03bc`
|
||
- tests: `./venv/bin/python -m unittest discover -s tests`; `./venv/bin/python enrich_costco.py`; verified matched Costco discount rows now populate `matched_discount_amount` and `net_line_total` while preserving raw `line_total`
|
||
- date: 2026-03-17
|
||
|
||
** notes
|
||
- Kept this retailer-specific and literal: only discount rows with `/<retailer_item_id>` are matched, and only within the same order.
|
||
- Raw discount rows are still preserved for auditability; the purchased row now carries the matched adjustment separately rather than overwriting the original amount.
|
||
* [X] t1.13.3 canonical cleanup and review-first product identity (3-4 commits)
|
||
refactor canonical generation so product identity is cleaner, duplicate canonicals are reduced, and unresolved items stay in review instead of spawning junk canonicals
|
||
|
||
** AC
|
||
1. stop auto-creating new canonical products from weak normalized names alone; unresolved items remain in `review_queue.csv`
|
||
2. canonical names are based on stable product identity rather than noisy observed titles
|
||
3. packaging/count/size tokens are removed from canonical names when they belong in structured fields (`pack_qty`, `size_value`, `size_unit`)
|
||
4. consolidate obvious duplicate canonicals (e.g. egg/lime cases) and ensure final outputs retain raw item name, normalized item name, and canonical item id
|
||
- pm note: prefer conservative canonical creation and a better manual review loop over aggressive auto-unification
|
||
|
||
** evidence
|
||
- commit: `08e2a86`
|
||
- tests: `./venv/bin/python -m unittest discover -s tests`; `./venv/bin/python build_purchases.py`; `./venv/bin/python review_products.py --refresh-only`; verified weaker exact-name cases now remain unresolved in `combined_output/review_queue.csv` and canonical names are cleaned before auto-catalog creation
|
||
- date: 2026-03-17
|
||
|
||
** notes
|
||
- Removed weak exact-name auto-canonical creation so ambiguous products stay in review instead of generating junk canonicals.
|
||
- Canonical display names are now cleaned of obvious punctuation and packaging noise, but I kept the cleanup conservative rather than adding a broad fuzzy merge layer.
|
||
* [X] t1.14: refactor retailer collection into the new data model (2-4 commits)
|
||
move Giant and Costco collection into the new collect structure and make both retailers emit the same collected schemas
|
||
|
||
** Acceptance Criteria
|
||
1. create retailer-specific collect scripts in the target naming pattern, e.g.:
|
||
- collect_giant_web.py
|
||
- collect_costco_web.py
|
||
2. collected outputs conform to pm/data-model.org:
|
||
- data/<retailer-method>/raw/...
|
||
- data/<retailer-method>/collected_orders.csv
|
||
- data/<retailer-method>/collected_items.csv
|
||
3. current Giant and Costco raw acquisition behavior is preserved during the move
|
||
4. collected schemas preserve retailer truth and provenance:
|
||
- no interpretation beyond basic flattening
|
||
- raw_order_path/raw_history_path remain usable
|
||
- unknown values remain blank rather than guessed
|
||
5. old paths should be removed or deprecated
|
||
6. collect_* scripts do not depend on any normalize/review files or scripts
|
||
- pm note: this is a path/schema refactor, not a parsing rewrite
|
||
|
||
** evidence
|
||
- commit: `48c6eaf`
|
||
- tests: `./venv/bin/python -m unittest tests.test_scraper tests.test_costco_pipeline tests.test_browser_session`; `./venv/bin/python collect_giant_web.py --help`; `./venv/bin/python collect_costco_web.py --help`; `./venv/bin/python scrape_giant.py --help`; `./venv/bin/python scrape_costco.py --help`
|
||
- datetime: 2026-03-18
|
||
|
||
** notes
|
||
- Kept this as a path/schema move, not a parsing rewrite: the existing Giant and Costco collection behavior remains in place behind new `collect_*` entry points.
|
||
- Added lightweight deprecation nudges on the legacy `scrape_*` commands rather than removing them immediately, so the move is inspectable and low-risk.
|
||
- The main schema fix was on Giant collection, which was missing retailer/provenance/audit fields that Costco collection already carried.
|
||
|
||
* [X] t1.14.1: refactor retailer normalization into the new normalized_items schema (3-5 commits)
|
||
make Giant and Costco emit the shared normalized line-item schema without introducing cross-retailer identity logic
|
||
|
||
** Acceptance Criteria
|
||
1. create retailer-specific normalize scripts in the target naming pattern, e.g.:
|
||
- normalize_giant_web.py
|
||
- normalize_costco_web.py
|
||
2. normalized outputs conform to pm/data-model.org:
|
||
- data/<retailer-method>/normalized_items.csv
|
||
- one row per collected line item
|
||
- normalized_row_id is stable and present
|
||
- normalized_item_id is stable, present, and represents retailer-level identity reused across repeated purchase rows when deterministic retailer evidence is sufficient
|
||
- normalized_quantity and normalized_quantity_unit
|
||
- repeated rows for the same retailer product resolve to the same normalized_item_id only when supported by deterministic retailer evidence, e.g. exact upc, exact retailer_item_id, exact cleaned name + same size/pack
|
||
- normalization_basis is explicit
|
||
3. Giant normalization preserves current useful parsing:
|
||
- normalized item name
|
||
- size/unit/pack parsing
|
||
- fee/store-brand flags
|
||
- derived price fields
|
||
4. Costco normalization preserves current useful parsing:
|
||
- normalized item name
|
||
- size/unit/pack parsing
|
||
- explicit discount matching using retailer-specific logic
|
||
- matched_discount_amount and net_line_total
|
||
5. both normalizers preserve raw retailer truth:
|
||
- line_total is never overwritten
|
||
- unknown values remain blank rather than guessed
|
||
6. no cross-retailer identity assignment occurs in normalization
|
||
7. normalize never uses fuzzy or semantic matching to assign normalized_item_id
|
||
|
||
- pm note: prefer explicit retailer-specific code paths over generic normalization helpers unless the duplication is truly mechanical
|
||
- pm note: normalization may resolve retailer-level identity, but not catalog identity
|
||
- pm note: normalized_item_id is the only retailer-level grouping identity; do not introduce observed_products or a second grouping artifact
|
||
** evidence
|
||
- commit: `9064de5`
|
||
- tests: `./venv/bin/python -m unittest discover -s tests`; `./venv/bin/python -m unittest tests.test_enrich_giant tests.test_costco_pipeline tests.test_purchases`; `./venv/bin/python normalize_giant_web.py --help`; `./venv/bin/python normalize_costco_web.py --help`; `./venv/bin/python enrich_giant.py --help`; `./venv/bin/python enrich_costco.py --help`
|
||
- datetime: 2026-03-18
|
||
|
||
** notes
|
||
- Kept the existing Giant and Costco parsing logic intact and added the new normalized schema fields in place, rather than rewriting the enrichers from scratch.
|
||
- `normalized_item_id` is always present, but it only collapses repeated rows when the evidence is strong; otherwise it falls back to row-level identity via `normalized_row_id`.
|
||
- Added `normalize_*` entry points for the new data-model layout while leaving the legacy `enrich_*` commands available during the transition.
|
||
|
||
* [X] t1.14.2: finalize filesystem and schema alignment for the refactor (2-4 commits)
|
||
bring on-disk outputs fully into the target `data/` structure without changing retailer behavior
|
||
|
||
** Acceptance Criteria
|
||
1. retailer data directories conform to pm/data-model.org:
|
||
- `data/giant-web/raw/...`
|
||
- `data/giant-web/collected_orders.csv`
|
||
- `data/giant-web/collected_items.csv`
|
||
- `data/giant-web/normalized_items.csv`
|
||
- `data/costco-web/raw/...`
|
||
- `data/costco-web/collected_orders.csv`
|
||
- `data/costco-web/collected_items.csv`
|
||
- `data/costco-web/normalized_items.csv`
|
||
2. review/combine outputs are moved or rewritten into the target review paths:
|
||
- `data/review/review_queue.csv`
|
||
- `data/review/product_links.csv`
|
||
- `data/review/review_resolutions.csv`
|
||
- `data/review/purchases.csv`
|
||
- `data/review/pipeline_status.csv`
|
||
- `data/review/pipeline_status.json`
|
||
3. old transitional output paths are either:
|
||
- removed from active script defaults, or
|
||
- left as explicit compatibility shims with clear deprecation notes
|
||
4. no recollection is required if existing raw files and collected csvs can be moved/copied losslessly into the new structure
|
||
5. no schema information is lost during the move:
|
||
- raw paths still resolve
|
||
- collected/normalized csvs still open with the expected headers
|
||
6. README and task/docs reflect the final active paths
|
||
- pm note: prefer moving/adapting existing files over recollecting from retailers unless a real data loss or schema mismatch forces recollection
|
||
- pm note: this is a structure-alignment task, not a retailer parsing task
|
||
|
||
** evidence
|
||
- commit: `d2e6f2a`
|
||
- tests: `./venv/bin/python -m unittest discover -s tests`; `./venv/bin/python build_purchases.py`; `./venv/bin/python review_products.py --refresh-only`; `./venv/bin/python report_pipeline_status.py`; `./venv/bin/python build_purchases.py --help`; `./venv/bin/python review_products.py --help`; `./venv/bin/python report_pipeline_status.py --help`; verified `data/giant-web/collected_orders.csv`, `data/giant-web/collected_items.csv`, `data/costco-web/collected_orders.csv`, `data/costco-web/collected_items.csv`, `data/catalog.csv`, and archived transitional review outputs under `data/review/archive/`
|
||
- datetime: [2026-03-20 10:04:15 EDT]
|
||
|
||
** notes
|
||
- No recollection was needed; existing raw and collected exports were adapted in place and moved into the target names.
|
||
- Updated the active script defaults to point at `data/...` so the code and on-disk layout now agree.
|
||
- Kept obviously obsolete review artifacts, but moved them under `data/review/archive/` instead of deleting them outright.
|
||
|
||
* [X] t1.14.3: retailer-specific Costco normalization cleanup (2-4 commits)
|
||
tighten Costco-specific normalization so normalized item names are cleaner and deterministic retailer grouping is less noisy
|
||
|
||
** Acceptance Criteria
|
||
1. improve Costco item-name cleanup for obvious non-identity noise, such as:
|
||
- trailing slash fragments
|
||
- code tokens and receipt-format artifacts
|
||
- duplicated measurement fragments already captured in structured fields
|
||
2. preserve deterministic normalization rules only:
|
||
- exact retailer_item_id
|
||
- exact cleaned name + same size/pack when needed
|
||
- approved retailer alias
|
||
- no fuzzy or semantic matching
|
||
3. normalized Costco names improve on known bad examples, e.g.:
|
||
- `MANDARIN /` -> cleaner normalized item name
|
||
- `LIFE 6'TABLE ... /` -> cleaner normalized item name
|
||
4. cleanup does not overwrite retailer truth:
|
||
- raw `item_name` is unchanged
|
||
- parsed `size_value`, `size_unit`, `pack_qty`, and pricing fields remain intact
|
||
5. discount-row behavior remains correct:
|
||
- matched discount rows still populate `matched_discount_amount`
|
||
- `net_line_total` remains correct
|
||
- discount rows remain auditable
|
||
6. add regression tests for the cleaned Costco examples and any new parsing rules
|
||
- pm note: keep this explicitly Costco-specific; do not introduce a generic cleanup framework
|
||
- pm note: prefer a short allowlist/blocklist of known receipt artifacts over broad heuristics
|
||
|
||
** evidence
|
||
- commit: `bcec6b3`
|
||
- tests: `./venv/bin/python -m unittest discover -s tests`; `./venv/bin/python -m unittest tests.test_costco_pipeline`; `./venv/bin/python normalize_costco_web.py`; verified live cleaned examples in `data/costco-web/normalized_items.csv`, including `MANDARINS 2.27 KG / 5 LBS -> MANDARIN` and `LIFE 6'TABLE MDL #80873U - T12/H3/P36 -> LIFE 6'TABLE MDL`
|
||
- datetime: 2026-03-20 11:09:32 EDT
|
||
|
||
** notes
|
||
- Kept this explicitly Costco-specific and narrow: the cleanup removes known logistics/code artifacts and orphan slash tokens without introducing fuzzy naming logic.
|
||
- The structured parsing still owns size/pack extraction, so name cleanup can safely strip dual-unit and logistics fragments after those fields are parsed.
|
||
- Discount-line behavior remains unchanged; this task only cleaned normalized names and preserved the existing audit trail.
|
||
|
||
* [ ] t1.15: refactor review/combine pipeline around normalized_item_id and catalog links (4-8 commits)
|
||
replace the old observed/canonical workflow with a review-first pipeline that uses normalized_item_id as the retailer-level review unit and links it to catalog items
|
||
|
||
** Acceptance Criteria
|
||
1. refactor review outputs to conform to pm/data-model.org:
|
||
- data/review/review_queue.csv
|
||
- data/review/product_links.csv
|
||
- data/catalog.csv
|
||
- data/purchases.csv
|
||
2. review logic uses normalized_item_id as the upstream retailer-level review identity:
|
||
- no dependency on observed_product_id
|
||
- no dependency on products_observed.csv
|
||
- one review/link decision applies to all purchase rows sharing the same normalized_item_id
|
||
3. product_links.csv stores review-approved links from normalized_item_id to catalog_id
|
||
- one row per approved retailer-level identity to catalog mapping
|
||
4. catalog.csv entries are review-first and conservative:
|
||
- no auto-creation from weak normalized names alone
|
||
- names come from reviewed catalog naming, not raw retailer strings
|
||
- packaging/count is not embedded in catalog_name unless essential to identity
|
||
- catalog_name/product_type/category/brand/variant may be blank until reviewed; blank is preferred to guessed
|
||
5. purchases.csv remains pivot-ready and retains:
|
||
- raw item name
|
||
- normalized item name
|
||
- normalized_row_id (not for review)
|
||
- normalized_item_id
|
||
- catalog_id
|
||
- catalog fields
|
||
- raw line_total
|
||
- matched_discount_amount and net_line_total when present
|
||
- derived price fields and their bases
|
||
6. terminal review flow remains simple and usable:
|
||
- reviewer sees one grouped retailer item identity (normalized_item_id) with count and list of matches, not one prompt per purchase row; use existing pattern as a template
|
||
- link to existing catalog item
|
||
- create new catalog item
|
||
- exclude
|
||
- skip
|
||
7. pipeline accounting remains valid after the refactor:
|
||
- unresolved items are visible
|
||
- missing items are not silently dropped
|
||
8. pm note: prefer a better manual review loop over aggressive automatic grouping. initial manual data entry is expected, and should resolve over time
|
||
9. pm note: keep review/combine auditable; each catalog link should be explainable from normalized rows and review state
|
||
|
||
** evidence
|
||
- commit:
|
||
- tests:
|
||
- datetime:
|
||
|
||
** notes
|
||
|
||
* [ ] 1t.10: add optional llm-assisted suggestion workflow for unresolved normalized retailer items (2-4 commits)
|
||
|
||
** acceptance criteria
|
||
- llm suggestions are generated only for unresolved normalized retailer items
|
||
- llm outputs are stored as suggestions, not auto-applied truth
|
||
- reviewer can approve/edit/reject suggestions
|
||
- approved decisions are persisted into canonical/link files
|
||
|
||
** notes
|
||
- bounded assistant, not autonomous goblin
|
||
- image urls may become useful here
|
||
|
||
** evidence
|
||
- commit:
|
||
- tests:
|
||
- date:
|