From de8ff535b846370a497619c7b93aa2c65c8a1d64 Mon Sep 17 00:00:00 2001 From: ben Date: Tue, 24 Mar 2026 08:27:41 -0400 Subject: [PATCH] 1.18 cleanup and review --- pm/notes.org | 66 +++++++++++++++++++++++++++++++++++++++++++++++++++- pm/tasks.org | 5 ++-- 2 files changed, 67 insertions(+), 4 deletions(-) diff --git a/pm/notes.org b/pm/notes.org index 9efa7da..b43a609 100644 --- a/pm/notes.org +++ b/pm/notes.org @@ -587,4 +587,68 @@ instead of [5] yellow onion, onion, produce (0 items, 0 rows) selection: -* +* data cleanup [2026-03-23 Mon] +ok we're getting closer. still see some issues +1. reorder purchases columns for display: catalog_name, product_type, category (makes data/troubleshooting way easier) +2. shouldn't net_line_price should never be empty? to allow cumulative cost comparison/analysis (we can see normalized price per X via effective_price but shouldnt this be weighted against how much we bought? eg if we bought 5lb flour at $0.970/lb this is weighted as 1-to-1 with a 25lb purchase as 0.670/lb +3. some items missing entire categorizations? probably a result of me trying to do data cleanup. i found the orphaned values in teh product_links table and removed them, but re-running review_products.py did not catch this... + shouldn't review_products run a comparison between each vendor's normalized_items and compare to the existing review_queu? + RSET POTATO US 1 + GREEK YOGURT DOM55 + FDLY CHY VAN IC CRM + DUNKIN DONUT CANISTER ORIG BLND P=260 + ICE CUBES + BLACK BEANS + KETCHUP SQUEEZE BTL + YELLOW_GOLD POTATO US 1 + YELLOW_GOLD POTATO US 1 + PINTO BEANS +4. cleanup deprecated .py files +5. Goals: + 1. When have I purchased this item, what did I pay, and how has the price changed over time? + - we're close, but missing units - eg AP flour shows a value that looks like price/lb but you just see $0.765 + - doesnt seem like we've captured everything but that's just a gut feeling + 2. Visit breakdown as well as catalog/product/category? this certainly belongs in purchases.csv. + 3. Consider dash/plotly for better-than-excel tracking, since we're really only looking at a couple of graphs and filtering within certain values? (obv keep purchases as a user-friendly output) +** 1. Cleanup purchases column order +purchase_date +retailer +catalog_name +product_type +category +net_line_total +normalized_quantity +effective_price +effective_price_unit (new) +order_id +line_no +raw_item_name +normalized_item_name +catalog_id +normalized_item_id +** 2. Populate and use purchases.net_line_total + net_line_total = line_total+matched_discount_amoun + effective_price = net_line_total / normalized_quantity + weighted cost analysis uses net_line_total, not just avg effective_price +** 3. Improve review robustness, enable norm_item re review +1. should regenerate candidates from: +- normalized items with no valid catalog_id +- normalized items whose linked catalog_id no longer exists +- normalized items whose linked catalog row exists but missing required fields if you want completeness review +2. review_products.py should compare: +- current normalized universe +- current product_links +- current catalog +- current review_queue +** 4. Remove deprecated.py +** 5. Improve Charts +1. Histogram: add effective_price_unit to purchases.py +1. Visits: plot by order_id enable display of: + 1. spend by visit + 2. items per visit + 3. category spend by visit + 4. retailer/store breakdown + +* / + + diff --git a/pm/tasks.org b/pm/tasks.org index d8642bb..1dbb0a6 100644 --- a/pm/tasks.org +++ b/pm/tasks.org @@ -962,7 +962,7 @@ Costco 25# FLOUR not parsed into normalized weight - meaure_type says each - Costco `25#` weight text was falling through to `each` because the hash-size parser missed sizes followed by whitespace. - This fix is intentionally narrow: explicit `#`-weight parsing now feeds the existing quantity and effective-price flow without changing `normalized_item_id` behavior. -* [x] t1.18.4: clean purchases output and finalize effective price fields (2-4 commits) +* [X] t1.18.4: clean purchases output and finalize effective price fields (2-4 commits) make `purchases.csv` easier to inspect and ensure price fields support weighted cost analysis ** acceptance criteria @@ -995,7 +995,7 @@ make `purchases.csv` easier to inspect and ensure price fields support weighted - `purchases.csv` now carries a filled `net_line_total` for every row, preserving existing values from normalization and deriving the rest from `line_total` plus matched discounts. - `effective_price_unit` now mirrors the normalized quantity basis, so downstream analysis can tell whether an `effective_price` is per `lb`, `oz`, `count`, or `each`. -* [x] t1.19: make review_products.py robust to orphaned and incomplete catalog links (2-4 commits) +* [X] t1.19: make review_products.py robust to orphaned and incomplete catalog links (2-4 commits) refresh review state from the current normalized universe so missing or broken links re-enter review instead of silently disappearing ** acceptance criteria @@ -1048,7 +1048,6 @@ ensure purchases retains enough visit/order context to support spend-by-visit an ** notes - * [ ] t1.21: add lightweight charting/analysis surface on top of purchases.csv (2-4 commits) build a minimal analysis layer for common price and visit charts without changing the csv pipeline