From eddef7de2b3bb957125ed2790692bd9f134e418c Mon Sep 17 00:00:00 2001 From: ben Date: Tue, 17 Mar 2026 13:59:57 -0400 Subject: [PATCH] updated readme and prep for next phase --- README.md | 25 ++++++++++--------------- pm/scrape-giant.org | 15 +++++++++++++++ pm/tasks.org | 29 ++++++++++++++--------------- 3 files changed, 39 insertions(+), 30 deletions(-) diff --git a/README.md b/README.md index d7803a1..f8e1692 100644 --- a/README.md +++ b/README.md @@ -1,17 +1,17 @@ # scrape-giant -Small CLI pipeline for pulling purchase history from Giant and Costco, enriching line items, and building a reviewable cross-retailer purchase dataset. +CLI to pull purchase history from Giant and Costco websites and refine into a single product catalog for external analysis. -There is no one-shot runner yet. Today, you run the scripts step by step from the terminal. +Run each script step-by-step from the terminal. ## What It Does -- `scrape_giant.py`: download Giant orders and items -- `enrich_giant.py`: normalize Giant line items -- `scrape_costco.py`: download Costco orders and items -- `enrich_costco.py`: normalize Costco line items -- `build_purchases.py`: combine retailer outputs into one purchase table -- `review_products.py`: review unresolved product matches in the terminal +1. `scrape_giant.py`: download Giant orders and items +2. `enrich_giant.py`: normalize Giant line items +3. `scrape_costco.py`: download Costco orders and items +4. `enrich_costco.py`: normalize Costco line items +5. `build_purchases.py`: combine retailer outputs into one purchase table +6. `review_products.py`: review unresolved product matches in the terminal ## Requirements @@ -36,7 +36,6 @@ Current version works best with `.env` in the project root. The scraper will pr GIANT_USER_ID=... GIANT_LOYALTY_NUMBER=... -# Costco can use these if present, but it can also pull session values from Firefox. COSTCO_X_AUTHORIZATION=... COSTCO_X_WCS_CLIENTID=... COSTCO_CLIENT_IDENTIFIER=... @@ -89,18 +88,14 @@ Combined: ## Review Workflow -`review_products.py` is the manual cleanup step for unresolved or weakly unified items. - -In the terminal, you can: +Run `review_products.py` to cleanup unresolved or weakly unified items: - link an item to an existing canonical product - create a new canonical product - exclude an item - skip it for later - -Those decisions are saved and reused on later runs. +Decisions are saved and reused on later runs. ## Notes - - This project is designed around fragile retailer scraping flows, so the code favors explicit retailer-specific steps over heavy abstraction. - `scrape_giant.py` and `scrape_costco.py` are meant to work as standalone acquisition scripts. - `validate_cross_retailer_flow.py` is a proof/check script, not a required production step. diff --git a/pm/scrape-giant.org b/pm/scrape-giant.org index 0400459..770bccc 100644 --- a/pm/scrape-giant.org +++ b/pm/scrape-giant.org @@ -250,3 +250,18 @@ python build_observed_products.py python build_review_queue.py python build_canonical_layer.py python validate_cross_retailer_flow.py +* t1.11 tasks [2026-03-17 Tue 13:49] +ok i ran a few. time to run some cleanups here - i'm wondering if we shouldn't be less aggressive with canonical names and encourage a better manual process to start. +1. auto-created canonical_names lack category, product_type - ok with filling these in manually in the catalog once the queue is empty +2. canonical_names feel too specific, e.g., "5DZ egg" +3. some canonical_names need consolidation, eg "LIME" and "LIME . / ." ; poss cleanup issue. there are 5 entries for ergg but but they are all regular large grade A white eggs, just different amounts in dozens. + Eggs are actually a great candidate for the kind of analysis we want to do - the pipeline should have caught and properly sorted these into size/qty: + ```canonical_product_id canonical_name category product_type brand variant size_value size_unit pack_qty measure_type notes created_at updated_at + gcan_0e350505fd22 5DZ EGG / / KS each auto-linked via exact_name + gcan_47279a80f5f3 EGG 5 DOZ. BBS each auto-linked via exact_name + gcan_7d099130c1bf LRG WHITE EGG SB 30 count auto-linked via exact_upc + gcan_849c2817e667 GDA LRG WHITE EGG SB 18 count auto-linked via exact_upc + gcan_cb0c6c8cf480 LG EGG CONVENTIONAL 18 count count auto-linked via exact_name_size ``` +4. Build costco mechanism for matching discount to line item. + 1. Discounts appear as their own line items with a number like /123456, this matches the UPC of the discounted item + 2. must be date-matched to the UPC diff --git a/pm/tasks.org b/pm/tasks.org index 691040d..9e78e3c 100644 --- a/pm/tasks.org +++ b/pm/tasks.org @@ -386,24 +386,26 @@ Clearly show current state separate from proposed future state. 3. exact UPC 7. Sample Entry: #+begin_comment -Review 7/22: Resolve observed_product MIXED PEPPER to canonical_name [__]? -2 matched items: - [1] 2026-03-12 | 7.49 | MIXED PEPPER 6-PACK | MIXED PEPPER | [upc] | costco | [img_url] - [2] [YYYY-mm-dd] | [price] | [raw_name] | [observed_name] | [upc] | [retailer] | [img_url] -2 canonical suggestions found: - [1] BELL PEPPERS, PRODUCE - [2] PEPPER, SPICES -- reinforce project terminology such as raw_name, observed_name, canonical_name + Review 7/22: Resolve observed_product MIXED PEPPER to canonical_name [__]? + 2 matched items: + [1] 2026-03-12 | 7.49 | MIXED PEPPER 6-PACK | MIXED PEPPER | [upc] | costco | [img_url] + [2] [YYYY-mm-dd] | [price] | [raw_name] | [observed_name] | [upc] | [retailer] | [img_url] + 2 canonical suggestions found: + [1] BELL PEPPERS, PRODUCE + [2] PEPPER, SPICES #+end_comment 8. When link is selected, users should be able to select the number of the item in the list, e.g.: #+begin_comment -Select the canonical_name to associate [n] items with: - [1] GRB GRADU PCH PUF1. | gcan_01b0d623aa02 - [2] BTB CHICKEN | gcan_0201f0feb749 - [3] LIME | gcan_02074d9e7359 + Select the canonical_name to associate [n] items with: + [1] GRB GRADU PCH PUF1. | gcan_01b0d623aa02 + [2] BTB CHICKEN | gcan_0201f0feb749 + [3] LIME | gcan_02074d9e7359 #+end_comment 9. Add confirmation to link selection with instructions, "[n] [observed_name] and future observed_name matches will be associated with [canonical_name], is this ok? actions: [Y]es [n]o [b]ack [s]kip [q]uit + +- reinforce project terminology such as raw_name, observed_name, canonical_name + ** evidence - commit: `7b8141c`, `d39497c` - tests: `./venv/bin/python -m unittest discover -s tests`; `./venv/bin/python -m unittest tests.test_review_workflow tests.test_purchases`; `./venv/bin/python review_products.py --help`; verified compact review header, numbered matched-item display, informational no-suggestion state, numbered canonical selection, and confirmation flow @@ -414,9 +416,6 @@ Select the canonical_name to associate [n] items with: - Numbered canonical selection plus confirmation worked better than free-text id entry and should reduce accidental links. - Deterministic suggestions remain intentionally conservative; they speed up common cases, but unresolved items still depend on human review by design. - -- resolve observed product group (group id) - to canonical name: * [ ] t1.10: add optional llm-assisted suggestion workflow for unresolved products (2-4 commits) ** acceptance criteria