updated readme and prep for next phase

This commit is contained in:
ben
2026-03-17 13:59:57 -04:00
parent 83bc6c4a7c
commit eddef7de2b
3 changed files with 39 additions and 30 deletions

View File

@@ -250,3 +250,18 @@ python build_observed_products.py
python build_review_queue.py
python build_canonical_layer.py
python validate_cross_retailer_flow.py
* t1.11 tasks [2026-03-17 Tue 13:49]
ok i ran a few. time to run some cleanups here - i'm wondering if we shouldn't be less aggressive with canonical names and encourage a better manual process to start.
1. auto-created canonical_names lack category, product_type - ok with filling these in manually in the catalog once the queue is empty
2. canonical_names feel too specific, e.g., "5DZ egg"
3. some canonical_names need consolidation, eg "LIME" and "LIME . / ." ; poss cleanup issue. there are 5 entries for ergg but but they are all regular large grade A white eggs, just different amounts in dozens.
Eggs are actually a great candidate for the kind of analysis we want to do - the pipeline should have caught and properly sorted these into size/qty:
```canonical_product_id canonical_name category product_type brand variant size_value size_unit pack_qty measure_type notes created_at updated_at
gcan_0e350505fd22 5DZ EGG / / KS each auto-linked via exact_name
gcan_47279a80f5f3 EGG 5 DOZ. BBS each auto-linked via exact_name
gcan_7d099130c1bf LRG WHITE EGG SB 30 count auto-linked via exact_upc
gcan_849c2817e667 GDA LRG WHITE EGG SB 18 count auto-linked via exact_upc
gcan_cb0c6c8cf480 LG EGG CONVENTIONAL 18 count count auto-linked via exact_name_size ```
4. Build costco mechanism for matching discount to line item.
1. Discounts appear as their own line items with a number like /123456, this matches the UPC of the discounted item
2. must be date-matched to the UPC