From eddef7de2b3bb957125ed2790692bd9f134e418c Mon Sep 17 00:00:00 2001
From: ben <johnmosescarter@gmail.com>
Date: Tue, 17 Mar 2026 13:59:57 -0400
Subject: [PATCH] updated readme and prep for next phase

---
 README.md           | 25 ++++++++++---------------
 pm/scrape-giant.org | 15 +++++++++++++++
 pm/tasks.org        | 29 ++++++++++++++---------------
 3 files changed, 39 insertions(+), 30 deletions(-)

diff --git a/README.md b/README.md
index d7803a1..f8e1692 100644
--- a/README.md
+++ b/README.md
@@ -1,17 +1,17 @@
 # scrape-giant
 
-Small CLI pipeline for pulling purchase history from Giant and Costco, enriching line items, and building a reviewable cross-retailer purchase dataset.
+CLI to pull purchase history from Giant and Costco websites and refine into a single product catalog for external analysis.
 
-There is no one-shot runner yet. Today, you run the scripts step by step from the terminal.
+Run each script step-by-step from the terminal.
 
 ## What It Does
 
-- `scrape_giant.py`: download Giant orders and items
-- `enrich_giant.py`: normalize Giant line items
-- `scrape_costco.py`: download Costco orders and items
-- `enrich_costco.py`: normalize Costco line items
-- `build_purchases.py`: combine retailer outputs into one purchase table
-- `review_products.py`: review unresolved product matches in the terminal
+1. `scrape_giant.py`: download Giant orders and items
+2. `enrich_giant.py`: normalize Giant line items
+3. `scrape_costco.py`: download Costco orders and items
+4. `enrich_costco.py`: normalize Costco line items
+5. `build_purchases.py`: combine retailer outputs into one purchase table
+6. `review_products.py`: review unresolved product matches in the terminal
 
 ## Requirements
 
@@ -36,7 +36,6 @@ Current version works best with `.env` in the project root.  The scraper will pr
 GIANT_USER_ID=...
 GIANT_LOYALTY_NUMBER=...
 
-# Costco can use these if present, but it can also pull session values from Firefox.
 COSTCO_X_AUTHORIZATION=...
 COSTCO_X_WCS_CLIENTID=...
 COSTCO_CLIENT_IDENTIFIER=...
@@ -89,18 +88,14 @@ Combined:
 
 ## Review Workflow
 
-`review_products.py` is the manual cleanup step for unresolved or weakly unified items.
-
-In the terminal, you can:
+Run `review_products.py` to cleanup unresolved or weakly unified items:
 - link an item to an existing canonical product
 - create a new canonical product
 - exclude an item
 - skip it for later
-
-Those decisions are saved and reused on later runs.
+Decisions are saved and reused on later runs.
 
 ## Notes
-
 - This project is designed around fragile retailer scraping flows, so the code favors explicit retailer-specific steps over heavy abstraction.
 - `scrape_giant.py` and `scrape_costco.py` are meant to work as standalone acquisition scripts.
 - `validate_cross_retailer_flow.py` is a proof/check script, not a required production step.
diff --git a/pm/scrape-giant.org b/pm/scrape-giant.org
index 0400459..770bccc 100644
--- a/pm/scrape-giant.org
+++ b/pm/scrape-giant.org
@@ -250,3 +250,18 @@ python build_observed_products.py
 python build_review_queue.py
 python build_canonical_layer.py
 python validate_cross_retailer_flow.py
+* t1.11 tasks [2026-03-17 Tue 13:49]
+ok i ran a few. time to run some cleanups here - i'm wondering if we shouldn't be less aggressive with canonical names and encourage a better manual process to start. 
+1. auto-created canonical_names lack category, product_type - ok with filling these in manually in the catalog once the queue is empty
+2. canonical_names feel too specific, e.g., "5DZ egg"
+3. some canonical_names need consolidation, eg "LIME" and "LIME  . / ." ; poss cleanup issue. there are 5 entries for ergg but but they are all regular large grade A white eggs, just different amounts in dozens.
+  Eggs are actually a great candidate for the kind of analysis we want to do - the pipeline should have caught and properly sorted these into size/qty:
+  ```canonical_product_id	canonical_name	category	product_type	brand	variant	size_value	size_unit	pack_qty	measure_type	notes	created_at	updated_at
+  gcan_0e350505fd22	5DZ EGG / /			KS					each	auto-linked via exact_name		
+  gcan_47279a80f5f3	EGG 5 DOZ. BBS								each	auto-linked via exact_name		
+  gcan_7d099130c1bf	LRG WHITE EGG			SB				30	count	auto-linked via exact_upc		
+  gcan_849c2817e667	GDA LRG WHITE EGG			SB				18	count	auto-linked via exact_upc		
+  gcan_cb0c6c8cf480	LG EGG CONVENTIONAL					18	count		count	auto-linked via exact_name_size		  ```
+4. Build costco mechanism for matching discount to line item.
+   1. Discounts appear as their own line items with a number like /123456, this matches the UPC of the discounted item
+   2. must be date-matched to the UPC
diff --git a/pm/tasks.org b/pm/tasks.org
index 691040d..9e78e3c 100644
--- a/pm/tasks.org
+++ b/pm/tasks.org
@@ -386,24 +386,26 @@ Clearly show current state separate from proposed future state.
    3. exact UPC
 7. Sample Entry:
 #+begin_comment
-Review 7/22: Resolve observed_product MIXED PEPPER to canonical_name [__]?
-2 matched items:
- [1] 2026-03-12 | 7.49 | MIXED PEPPER 6-PACK | MIXED PEPPER | [upc] | costco | [img_url]
- [2] [YYYY-mm-dd] | [price] | [raw_name] | [observed_name] | [upc] | [retailer] | [img_url]
-2 canonical suggestions found:
- [1] BELL PEPPERS, PRODUCE
- [2] PEPPER, SPICES
-- reinforce project terminology such as raw_name, observed_name, canonical_name   
+ Review 7/22: Resolve observed_product MIXED PEPPER to canonical_name [__]?
+ 2 matched items:
+  [1] 2026-03-12 | 7.49 | MIXED PEPPER 6-PACK | MIXED PEPPER | [upc] | costco | [img_url]
+  [2] [YYYY-mm-dd] | [price] | [raw_name] | [observed_name] | [upc] | [retailer] | [img_url]
+ 2 canonical suggestions found:
+  [1] BELL PEPPERS, PRODUCE
+  [2] PEPPER, SPICES
 #+end_comment
 8. When link is selected, users should be able to select the number of the item in the list, e.g.:
 #+begin_comment
-Select the canonical_name to associate [n] items with:
-  [1] GRB GRADU PCH PUF1. | gcan_01b0d623aa02
-  [2] BTB CHICKEN         | gcan_0201f0feb749
-  [3] LIME                | gcan_02074d9e7359
+  Select the canonical_name to associate [n] items with:
+   [1] GRB GRADU PCH PUF1. | gcan_01b0d623aa02
+   [2] BTB CHICKEN         | gcan_0201f0feb749
+   [3] LIME                | gcan_02074d9e7359
 #+end_comment
 9. Add confirmation to link selection with instructions, "[n] [observed_name] and future observed_name matches will be associated with [canonical_name], is this ok?
      actions: [Y]es  [n]o  [b]ack  [s]kip  [q]uit
+
+- reinforce project terminology such as raw_name, observed_name, canonical_name   
+
 ** evidence
 - commit: `7b8141c`, `d39497c`
 - tests: `./venv/bin/python -m unittest discover -s tests`; `./venv/bin/python -m unittest tests.test_review_workflow tests.test_purchases`; `./venv/bin/python review_products.py --help`; verified compact review header, numbered matched-item display, informational no-suggestion state, numbered canonical selection, and confirmation flow
@@ -414,9 +416,6 @@ Select the canonical_name to associate [n] items with:
 - Numbered canonical selection plus confirmation worked better than free-text id entry and should reduce accidental links.
 - Deterministic suggestions remain intentionally conservative; they speed up common cases, but unresolved items still depend on human review by design.
 
-
-- resolve observed product group (group id)
-  to canonical name: 
 * [ ] t1.10: add optional llm-assisted suggestion workflow for unresolved products (2-4 commits)
 
 ** acceptance criteria