updated readme and prep for next phase
This commit is contained in:
25
README.md
25
README.md
@@ -1,17 +1,17 @@
|
|||||||
# scrape-giant
|
# scrape-giant
|
||||||
|
|
||||||
Small CLI pipeline for pulling purchase history from Giant and Costco, enriching line items, and building a reviewable cross-retailer purchase dataset.
|
CLI to pull purchase history from Giant and Costco websites and refine into a single product catalog for external analysis.
|
||||||
|
|
||||||
There is no one-shot runner yet. Today, you run the scripts step by step from the terminal.
|
Run each script step-by-step from the terminal.
|
||||||
|
|
||||||
## What It Does
|
## What It Does
|
||||||
|
|
||||||
- `scrape_giant.py`: download Giant orders and items
|
1. `scrape_giant.py`: download Giant orders and items
|
||||||
- `enrich_giant.py`: normalize Giant line items
|
2. `enrich_giant.py`: normalize Giant line items
|
||||||
- `scrape_costco.py`: download Costco orders and items
|
3. `scrape_costco.py`: download Costco orders and items
|
||||||
- `enrich_costco.py`: normalize Costco line items
|
4. `enrich_costco.py`: normalize Costco line items
|
||||||
- `build_purchases.py`: combine retailer outputs into one purchase table
|
5. `build_purchases.py`: combine retailer outputs into one purchase table
|
||||||
- `review_products.py`: review unresolved product matches in the terminal
|
6. `review_products.py`: review unresolved product matches in the terminal
|
||||||
|
|
||||||
## Requirements
|
## Requirements
|
||||||
|
|
||||||
@@ -36,7 +36,6 @@ Current version works best with `.env` in the project root. The scraper will pr
|
|||||||
GIANT_USER_ID=...
|
GIANT_USER_ID=...
|
||||||
GIANT_LOYALTY_NUMBER=...
|
GIANT_LOYALTY_NUMBER=...
|
||||||
|
|
||||||
# Costco can use these if present, but it can also pull session values from Firefox.
|
|
||||||
COSTCO_X_AUTHORIZATION=...
|
COSTCO_X_AUTHORIZATION=...
|
||||||
COSTCO_X_WCS_CLIENTID=...
|
COSTCO_X_WCS_CLIENTID=...
|
||||||
COSTCO_CLIENT_IDENTIFIER=...
|
COSTCO_CLIENT_IDENTIFIER=...
|
||||||
@@ -89,18 +88,14 @@ Combined:
|
|||||||
|
|
||||||
## Review Workflow
|
## Review Workflow
|
||||||
|
|
||||||
`review_products.py` is the manual cleanup step for unresolved or weakly unified items.
|
Run `review_products.py` to cleanup unresolved or weakly unified items:
|
||||||
|
|
||||||
In the terminal, you can:
|
|
||||||
- link an item to an existing canonical product
|
- link an item to an existing canonical product
|
||||||
- create a new canonical product
|
- create a new canonical product
|
||||||
- exclude an item
|
- exclude an item
|
||||||
- skip it for later
|
- skip it for later
|
||||||
|
Decisions are saved and reused on later runs.
|
||||||
Those decisions are saved and reused on later runs.
|
|
||||||
|
|
||||||
## Notes
|
## Notes
|
||||||
|
|
||||||
- This project is designed around fragile retailer scraping flows, so the code favors explicit retailer-specific steps over heavy abstraction.
|
- This project is designed around fragile retailer scraping flows, so the code favors explicit retailer-specific steps over heavy abstraction.
|
||||||
- `scrape_giant.py` and `scrape_costco.py` are meant to work as standalone acquisition scripts.
|
- `scrape_giant.py` and `scrape_costco.py` are meant to work as standalone acquisition scripts.
|
||||||
- `validate_cross_retailer_flow.py` is a proof/check script, not a required production step.
|
- `validate_cross_retailer_flow.py` is a proof/check script, not a required production step.
|
||||||
|
|||||||
@@ -250,3 +250,18 @@ python build_observed_products.py
|
|||||||
python build_review_queue.py
|
python build_review_queue.py
|
||||||
python build_canonical_layer.py
|
python build_canonical_layer.py
|
||||||
python validate_cross_retailer_flow.py
|
python validate_cross_retailer_flow.py
|
||||||
|
* t1.11 tasks [2026-03-17 Tue 13:49]
|
||||||
|
ok i ran a few. time to run some cleanups here - i'm wondering if we shouldn't be less aggressive with canonical names and encourage a better manual process to start.
|
||||||
|
1. auto-created canonical_names lack category, product_type - ok with filling these in manually in the catalog once the queue is empty
|
||||||
|
2. canonical_names feel too specific, e.g., "5DZ egg"
|
||||||
|
3. some canonical_names need consolidation, eg "LIME" and "LIME . / ." ; poss cleanup issue. there are 5 entries for ergg but but they are all regular large grade A white eggs, just different amounts in dozens.
|
||||||
|
Eggs are actually a great candidate for the kind of analysis we want to do - the pipeline should have caught and properly sorted these into size/qty:
|
||||||
|
```canonical_product_id canonical_name category product_type brand variant size_value size_unit pack_qty measure_type notes created_at updated_at
|
||||||
|
gcan_0e350505fd22 5DZ EGG / / KS each auto-linked via exact_name
|
||||||
|
gcan_47279a80f5f3 EGG 5 DOZ. BBS each auto-linked via exact_name
|
||||||
|
gcan_7d099130c1bf LRG WHITE EGG SB 30 count auto-linked via exact_upc
|
||||||
|
gcan_849c2817e667 GDA LRG WHITE EGG SB 18 count auto-linked via exact_upc
|
||||||
|
gcan_cb0c6c8cf480 LG EGG CONVENTIONAL 18 count count auto-linked via exact_name_size ```
|
||||||
|
4. Build costco mechanism for matching discount to line item.
|
||||||
|
1. Discounts appear as their own line items with a number like /123456, this matches the UPC of the discounted item
|
||||||
|
2. must be date-matched to the UPC
|
||||||
|
|||||||
@@ -393,7 +393,6 @@ Review 7/22: Resolve observed_product MIXED PEPPER to canonical_name [__]?
|
|||||||
2 canonical suggestions found:
|
2 canonical suggestions found:
|
||||||
[1] BELL PEPPERS, PRODUCE
|
[1] BELL PEPPERS, PRODUCE
|
||||||
[2] PEPPER, SPICES
|
[2] PEPPER, SPICES
|
||||||
- reinforce project terminology such as raw_name, observed_name, canonical_name
|
|
||||||
#+end_comment
|
#+end_comment
|
||||||
8. When link is selected, users should be able to select the number of the item in the list, e.g.:
|
8. When link is selected, users should be able to select the number of the item in the list, e.g.:
|
||||||
#+begin_comment
|
#+begin_comment
|
||||||
@@ -404,6 +403,9 @@ Select the canonical_name to associate [n] items with:
|
|||||||
#+end_comment
|
#+end_comment
|
||||||
9. Add confirmation to link selection with instructions, "[n] [observed_name] and future observed_name matches will be associated with [canonical_name], is this ok?
|
9. Add confirmation to link selection with instructions, "[n] [observed_name] and future observed_name matches will be associated with [canonical_name], is this ok?
|
||||||
actions: [Y]es [n]o [b]ack [s]kip [q]uit
|
actions: [Y]es [n]o [b]ack [s]kip [q]uit
|
||||||
|
|
||||||
|
- reinforce project terminology such as raw_name, observed_name, canonical_name
|
||||||
|
|
||||||
** evidence
|
** evidence
|
||||||
- commit: `7b8141c`, `d39497c`
|
- commit: `7b8141c`, `d39497c`
|
||||||
- tests: `./venv/bin/python -m unittest discover -s tests`; `./venv/bin/python -m unittest tests.test_review_workflow tests.test_purchases`; `./venv/bin/python review_products.py --help`; verified compact review header, numbered matched-item display, informational no-suggestion state, numbered canonical selection, and confirmation flow
|
- tests: `./venv/bin/python -m unittest discover -s tests`; `./venv/bin/python -m unittest tests.test_review_workflow tests.test_purchases`; `./venv/bin/python review_products.py --help`; verified compact review header, numbered matched-item display, informational no-suggestion state, numbered canonical selection, and confirmation flow
|
||||||
@@ -414,9 +416,6 @@ Select the canonical_name to associate [n] items with:
|
|||||||
- Numbered canonical selection plus confirmation worked better than free-text id entry and should reduce accidental links.
|
- Numbered canonical selection plus confirmation worked better than free-text id entry and should reduce accidental links.
|
||||||
- Deterministic suggestions remain intentionally conservative; they speed up common cases, but unresolved items still depend on human review by design.
|
- Deterministic suggestions remain intentionally conservative; they speed up common cases, but unresolved items still depend on human review by design.
|
||||||
|
|
||||||
|
|
||||||
- resolve observed product group (group id)
|
|
||||||
to canonical name:
|
|
||||||
* [ ] t1.10: add optional llm-assisted suggestion workflow for unresolved products (2-4 commits)
|
* [ ] t1.10: add optional llm-assisted suggestion workflow for unresolved products (2-4 commits)
|
||||||
|
|
||||||
** acceptance criteria
|
** acceptance criteria
|
||||||
|
|||||||
Reference in New Issue
Block a user