Finalize post-refactor layout and remove old pipeline files

This commit is contained in:
ben
2026-03-24 17:09:57 -04:00
parent cdb7a15739
commit 09829b2b9d
17 changed files with 59 additions and 1154 deletions

View File

@@ -110,8 +110,15 @@ data/
review/
review_queue.csv # Human review queue for unresolved matching/parsing cases.
product_links.csv # Links from normalized retailer items to catalog items.
catalog.csv # Cross-retailer product catalog entities used for comparison.
purchases.csv
catalog.csv # Cross-retailer product catalog entities used for comparison.
analysis/
purchases.csv
comparison_examples.csv
item_price_over_time.csv
spend_by_visit.csv
items_per_visit.csv
category_spend_over_time.csv
retailer_store_breakdown.csv
#+end_example
Notes:
@@ -223,7 +230,7 @@ Notes:
- Valid `normalization_basis` values should be explicit, e.g. `exact_upc`, `exact_retailer_item_id`, `exact_name_size_pack`, or `approved_retailer_alias`.
- Do not use fuzzy or semantic matching to assign `normalized_item_id`.
- Discount/coupon rows may remain as standalone normalized rows for auditability even when their amounts are attached to a purchased row via `matched_discount_amount`.
- Cross-retailer identity is handled later in review/combine via `catalog.csv` and `product_links.csv`.
- Cross-retailer identity is handled later in review/combine via `data/review/catalog.csv` and `product_links.csv`.
** `data/review/product_links.csv`
One row per review-approved link from a normalized retailer item to a catalog item.
@@ -263,7 +270,7 @@ One row per issue needing human review.
| `resolution_notes` | reviewer notes |
| `created_at` | creation timestamp or date |
| `updated_at` | last update timestamp or date |
** `data/catalog.csv`
** `data/review/catalog.csv`
One row per cross-retailer catalog product.
| key | definition |
|----------------------------+----------------------------------------|
@@ -288,7 +295,7 @@ Notes:
- Do not encode packaging/count into `catalog_name` unless it is essential to product identity.
- `catalog_name` should come from review-approved naming, not raw retailer strings.
** `data/purchases.csv`
** `data/analysis/purchases.csv`
One row per purchased item (i.e., `is_item`==true from normalized layer), with
catalog attributes denormalized in and discounts already applied.