Compare commits

..

33 Commits

Author SHA1 Message Date
ben
74d17b0b0c minor edit 2026-03-24 17:28:16 -04:00
ben
fea5132100 minor edi 2026-03-24 17:27:34 -04:00
ben
eb3959ae0f Record t1.22.1 task evidence 2026-03-24 17:26:00 -04:00
ben
867275c67a Trim requirements to direct runtime deps 2026-03-24 17:25:52 -04:00
ben
6336c15da8 Record t1.22 task evidence 2026-03-24 17:10:09 -04:00
ben
09829b2b9d Finalize post-refactor layout and remove old pipeline files 2026-03-24 17:09:57 -04:00
ben
cdb7a15739 Record t1.21 task evidence 2026-03-24 16:49:01 -04:00
ben
46a3b2c639 Add purchase analysis summaries 2026-03-24 16:48:53 -04:00
ben
c35688c87f Record t1.20 task evidence 2026-03-24 08:29:31 -04:00
ben
6940f165fb Document visit-level purchase analysis 2026-03-24 08:29:26 -04:00
ben
de8ff535b8 1.18 cleanup and review 2026-03-24 08:27:41 -04:00
ben
02be6f52c0 Record t1.19 task evidence 2026-03-23 15:32:48 -04:00
ben
8ccf3ff43b Reconcile review queue against current catalog state 2026-03-23 15:32:41 -04:00
ben
a93229408b Record t1.18.4 task evidence 2026-03-23 15:28:05 -04:00
ben
a45522c110 Finalize purchase effective price fields 2026-03-23 15:27:58 -04:00
ben
d78230f1c6 Record t1.18.3 task evidence 2026-03-23 13:56:56 -04:00
ben
73176117fe Fix Costco hash-size weight parsing 2026-03-23 13:56:47 -04:00
ben
facebced9c Record t1.18.2 task evidence 2026-03-23 13:23:03 -04:00
ben
23dfc3de3e Use picked weight for Giant quantity basis 2026-03-23 13:22:56 -04:00
ben
3bc76ed243 Record t1.18 and t1.18.1 evidence 2026-03-23 12:54:09 -04:00
ben
dc0d0614bb Add effective price to purchases 2026-03-23 12:53:54 -04:00
ben
605c94498b Add effective price regression tests 2026-03-23 12:52:41 -04:00
ben
d4f479b0d8 added effective_price and testing to id upstream data 2026-03-23 12:35:27 -04:00
ben
38c2c2ea2e Record t1.17 task evidence 2026-03-21 21:50:16 -04:00
ben
d25448b690 Fix normalized quantity basis 2026-03-21 21:50:10 -04:00
db761adafc added notes from first review session 2026-03-21 20:53:22 -04:00
e8e11e15b3 added draft scope for review/search loop 2026-03-21 09:48:34 -04:00
ben
afadd0c0d0 Restore skip and move search to find 2026-03-20 13:35:07 -04:00
ben
2847d2d59f Record t1.16.1 task evidence 2026-03-20 13:32:27 -04:00
ben
f93b9aa464 Add catalog search to review flow 2026-03-20 13:32:20 -04:00
ben
17158fb9e9 Record t1.16 task evidence 2026-03-20 12:45:57 -04:00
ben
975d44bebb Tighten review prompt flow 2026-03-20 12:45:38 -04:00
ben
f478795b5d added t1.16 to cleanup review process 2026-03-20 12:42:23 -04:00
26 changed files with 2450 additions and 1424 deletions

View File

@@ -6,19 +6,14 @@ Run each script step-by-step from the terminal.
## What It Does
1. `scrape_giant.py`: download Giant orders and items
2. `enrich_giant.py`: normalize Giant line items
3. `scrape_costco.py`: download Costco orders and items
4. `enrich_costco.py`: normalize Costco line items
1. `collect_giant_web.py`: download Giant orders and items
2. `normalize_giant_web.py`: normalize Giant line items
3. `collect_costco_web.py`: download Costco orders and items
4. `normalize_costco_web.py`: normalize Costco line items
5. `build_purchases.py`: combine retailer outputs into one purchase table
6. `review_products.py`: review unresolved product matches in the terminal
7. `report_pipeline_status.py`: show how many rows survive each stage
Active refactor entrypoints:
- `collect_giant_web.py`
- `collect_costco_web.py`
- `normalize_giant_web.py`
- `normalize_costco_web.py`
8. `analyze_purchases.py`: write chart-ready analysis CSVs from the purchase table
## Requirements
@@ -64,13 +59,20 @@ data/
collected_items.csv
normalized_items.csv
review/
catalog.csv
review_queue.csv
review_resolutions.csv
product_links.csv
purchases.csv
pipeline_status.csv
pipeline_status.json
catalog.csv
analysis/
purchases.csv
comparison_examples.csv
item_price_over_time.csv
spend_by_visit.csv
items_per_visit.csv
category_spend_over_time.csv
retailer_store_breakdown.csv
```
## Run Order
@@ -87,6 +89,7 @@ python review_products.py
python build_purchases.py
python review_products.py --refresh-only
python report_pipeline_status.py
python analyze_purchases.py
```
Why run `build_purchases.py` twice:
@@ -120,14 +123,32 @@ Costco:
- `data/costco-web/normalized_items.csv` preserves raw totals and matched net discount fields
Combined:
- `data/review/purchases.csv`
- `data/analysis/purchases.csv`
- `data/analysis/comparison_examples.csv`
- `data/analysis/item_price_over_time.csv`
- `data/analysis/spend_by_visit.csv`
- `data/analysis/items_per_visit.csv`
- `data/analysis/category_spend_over_time.csv`
- `data/analysis/retailer_store_breakdown.csv`
- `data/review/review_queue.csv`
- `data/review/review_resolutions.csv`
- `data/review/product_links.csv`
- `data/review/comparison_examples.csv`
- `data/review/pipeline_status.csv`
- `data/review/pipeline_status.json`
- `data/catalog.csv`
- `data/review/catalog.csv`
`data/analysis/purchases.csv` is the main analysis artifact. It is designed to support both:
- item-level price analysis
- visit-level analysis such as spend by visit, items per visit, category spend by visit, and retailer/store breakdown
The visit fields are carried directly in `purchases.csv`, so you can pivot on them without extra joins:
- `order_id`
- `purchase_date`
- `retailer`
- `store_name`
- `store_number`
- `store_city`
- `store_state`
## Review Workflow
@@ -144,9 +165,7 @@ The review step is intentionally conservative:
## Notes
- This project is designed around fragile retailer scraping flows, so the code favors explicit retailer-specific steps over heavy abstraction.
- `scrape_giant.py`, `scrape_costco.py`, `enrich_giant.py`, and `enrich_costco.py` are now legacy-compatible entrypoints; prefer the `collect_*` and `normalize_*` scripts for active work.
- Costco discount rows are preserved for auditability and also matched back to purchased items during enrichment.
- `validate_cross_retailer_flow.py` is a proof/check script, not a required production step.
## Test

271
analyze_purchases.py Normal file
View File

@@ -0,0 +1,271 @@
from collections import defaultdict
from pathlib import Path
import click
from enrich_giant import format_decimal, to_decimal
from layer_helpers import read_csv_rows, write_csv_rows
ITEM_PRICE_FIELDS = [
"purchase_date",
"retailer",
"store_name",
"store_number",
"store_city",
"store_state",
"order_id",
"catalog_id",
"catalog_name",
"category",
"product_type",
"effective_price",
"effective_price_unit",
"net_line_total",
"normalized_quantity",
]
SPEND_BY_VISIT_FIELDS = [
"purchase_date",
"retailer",
"order_id",
"store_name",
"store_number",
"store_city",
"store_state",
"visit_spend_total",
]
ITEMS_PER_VISIT_FIELDS = [
"purchase_date",
"retailer",
"order_id",
"store_name",
"store_number",
"store_city",
"store_state",
"item_row_count",
"distinct_catalog_count",
]
CATEGORY_SPEND_FIELDS = [
"purchase_date",
"retailer",
"category",
"category_spend_total",
]
RETAILER_STORE_FIELDS = [
"retailer",
"store_name",
"store_number",
"store_city",
"store_state",
"visit_count",
"item_row_count",
"store_spend_total",
]
def effective_total(row):
total = to_decimal(row.get("net_line_total"))
if total is not None:
return total
return to_decimal(row.get("line_total"))
def is_item_row(row):
return (
row.get("is_fee") != "true"
and row.get("is_discount_line") != "true"
and row.get("is_coupon_line") != "true"
)
def build_item_price_rows(purchase_rows):
rows = []
for row in purchase_rows:
if not row.get("catalog_name") or not row.get("effective_price"):
continue
rows.append(
{
"purchase_date": row.get("purchase_date", ""),
"retailer": row.get("retailer", ""),
"store_name": row.get("store_name", ""),
"store_number": row.get("store_number", ""),
"store_city": row.get("store_city", ""),
"store_state": row.get("store_state", ""),
"order_id": row.get("order_id", ""),
"catalog_id": row.get("catalog_id", ""),
"catalog_name": row.get("catalog_name", ""),
"category": row.get("category", ""),
"product_type": row.get("product_type", ""),
"effective_price": row.get("effective_price", ""),
"effective_price_unit": row.get("effective_price_unit", ""),
"net_line_total": row.get("net_line_total", ""),
"normalized_quantity": row.get("normalized_quantity", ""),
}
)
return rows
def build_spend_by_visit_rows(purchase_rows):
grouped = defaultdict(lambda: {"total": to_decimal("0")})
for row in purchase_rows:
total = effective_total(row)
if total is None:
continue
key = (
row.get("purchase_date", ""),
row.get("retailer", ""),
row.get("order_id", ""),
row.get("store_name", ""),
row.get("store_number", ""),
row.get("store_city", ""),
row.get("store_state", ""),
)
grouped[key]["total"] += total
rows = []
for key, values in sorted(grouped.items()):
rows.append(
{
"purchase_date": key[0],
"retailer": key[1],
"order_id": key[2],
"store_name": key[3],
"store_number": key[4],
"store_city": key[5],
"store_state": key[6],
"visit_spend_total": format_decimal(values["total"]),
}
)
return rows
def build_items_per_visit_rows(purchase_rows):
grouped = defaultdict(lambda: {"item_rows": 0, "catalog_ids": set()})
for row in purchase_rows:
if not is_item_row(row):
continue
key = (
row.get("purchase_date", ""),
row.get("retailer", ""),
row.get("order_id", ""),
row.get("store_name", ""),
row.get("store_number", ""),
row.get("store_city", ""),
row.get("store_state", ""),
)
grouped[key]["item_rows"] += 1
if row.get("catalog_id"):
grouped[key]["catalog_ids"].add(row["catalog_id"])
rows = []
for key, values in sorted(grouped.items()):
rows.append(
{
"purchase_date": key[0],
"retailer": key[1],
"order_id": key[2],
"store_name": key[3],
"store_number": key[4],
"store_city": key[5],
"store_state": key[6],
"item_row_count": str(values["item_rows"]),
"distinct_catalog_count": str(len(values["catalog_ids"])),
}
)
return rows
def build_category_spend_rows(purchase_rows):
grouped = defaultdict(lambda: to_decimal("0"))
for row in purchase_rows:
category = row.get("category", "")
total = effective_total(row)
if not category or total is None:
continue
key = (
row.get("purchase_date", ""),
row.get("retailer", ""),
category,
)
grouped[key] += total
rows = []
for key, total in sorted(grouped.items()):
rows.append(
{
"purchase_date": key[0],
"retailer": key[1],
"category": key[2],
"category_spend_total": format_decimal(total),
}
)
return rows
def build_retailer_store_rows(purchase_rows):
grouped = defaultdict(lambda: {"visit_ids": set(), "item_rows": 0, "total": to_decimal("0")})
for row in purchase_rows:
total = effective_total(row)
key = (
row.get("retailer", ""),
row.get("store_name", ""),
row.get("store_number", ""),
row.get("store_city", ""),
row.get("store_state", ""),
)
grouped[key]["visit_ids"].add((row.get("purchase_date", ""), row.get("order_id", "")))
if is_item_row(row):
grouped[key]["item_rows"] += 1
if total is not None:
grouped[key]["total"] += total
rows = []
for key, values in sorted(grouped.items()):
rows.append(
{
"retailer": key[0],
"store_name": key[1],
"store_number": key[2],
"store_city": key[3],
"store_state": key[4],
"visit_count": str(len(values["visit_ids"])),
"item_row_count": str(values["item_rows"]),
"store_spend_total": format_decimal(values["total"]),
}
)
return rows
@click.command()
@click.option("--purchases-csv", default="data/analysis/purchases.csv", show_default=True)
@click.option("--output-dir", default="data/analysis", show_default=True)
def main(purchases_csv, output_dir):
purchase_rows = read_csv_rows(purchases_csv)
output_path = Path(output_dir)
output_path.mkdir(parents=True, exist_ok=True)
item_price_rows = build_item_price_rows(purchase_rows)
spend_by_visit_rows = build_spend_by_visit_rows(purchase_rows)
items_per_visit_rows = build_items_per_visit_rows(purchase_rows)
category_spend_rows = build_category_spend_rows(purchase_rows)
retailer_store_rows = build_retailer_store_rows(purchase_rows)
outputs = [
("item_price_over_time.csv", item_price_rows, ITEM_PRICE_FIELDS),
("spend_by_visit.csv", spend_by_visit_rows, SPEND_BY_VISIT_FIELDS),
("items_per_visit.csv", items_per_visit_rows, ITEMS_PER_VISIT_FIELDS),
("category_spend_over_time.csv", category_spend_rows, CATEGORY_SPEND_FIELDS),
("retailer_store_breakdown.csv", retailer_store_rows, RETAILER_STORE_FIELDS),
]
for filename, rows, fieldnames in outputs:
write_csv_rows(output_path / filename, rows, fieldnames)
click.echo(f"wrote analysis outputs to {output_path}")
if __name__ == "__main__":
main()

View File

@@ -1,220 +0,0 @@
import click
import re
from layer_helpers import read_csv_rows, representative_value, stable_id, write_csv_rows
CANONICAL_FIELDS = [
"canonical_product_id",
"canonical_name",
"product_type",
"brand",
"variant",
"size_value",
"size_unit",
"pack_qty",
"measure_type",
"normalized_quantity",
"normalized_quantity_unit",
"notes",
"created_at",
"updated_at",
]
CANONICAL_DROP_TOKENS = {"CT", "COUNT", "COUNTS", "DOZ", "DOZEN", "DOZ.", "PACK"}
LINK_FIELDS = [
"observed_product_id",
"canonical_product_id",
"link_method",
"link_confidence",
"review_status",
"reviewed_by",
"reviewed_at",
"link_notes",
]
def to_float(value):
try:
return float(value)
except (TypeError, ValueError):
return None
def normalized_quantity(row):
size_value = to_float(row.get("representative_size_value"))
pack_qty = to_float(row.get("representative_pack_qty")) or 1.0
size_unit = row.get("representative_size_unit", "")
measure_type = row.get("representative_measure_type", "")
if size_value is not None and size_unit:
return format(size_value * pack_qty, "g"), size_unit
if row.get("representative_pack_qty") and measure_type == "count":
return row["representative_pack_qty"], "count"
if measure_type == "each":
return "1", "each"
return "", ""
def auto_link_rule(observed_row):
if (
observed_row.get("is_fee") == "true"
or observed_row.get("is_discount_line") == "true"
or observed_row.get("is_coupon_line") == "true"
):
return "", "", ""
if observed_row.get("representative_upc"):
return (
"exact_upc",
f"upc={observed_row['representative_upc']}",
"high",
)
if (
observed_row.get("representative_name_norm")
and observed_row.get("representative_size_value")
and observed_row.get("representative_size_unit")
):
return (
"exact_name_size",
"|".join(
[
f"name={observed_row['representative_name_norm']}",
f"size={observed_row['representative_size_value']}",
f"unit={observed_row['representative_size_unit']}",
f"pack={observed_row['representative_pack_qty']}",
f"measure={observed_row['representative_measure_type']}",
]
),
"high",
)
return "", "", ""
def clean_canonical_name(name):
tokens = []
for token in re.sub(r"[^A-Z0-9\s]", " ", (name or "").upper()).split():
if token.isdigit():
continue
if token in CANONICAL_DROP_TOKENS:
continue
if re.fullmatch(r"\d+(?:PK|PACK)", token):
continue
if re.fullmatch(r"\d+DZ", token):
continue
tokens.append(token)
return " ".join(tokens).strip()
def canonical_row_for_group(canonical_product_id, group_rows, link_method):
quantity_value, quantity_unit = normalized_quantity(
{
"representative_size_value": representative_value(
group_rows, "representative_size_value"
),
"representative_size_unit": representative_value(
group_rows, "representative_size_unit"
),
"representative_pack_qty": representative_value(
group_rows, "representative_pack_qty"
),
"representative_measure_type": representative_value(
group_rows, "representative_measure_type"
),
}
)
return {
"canonical_product_id": canonical_product_id,
"canonical_name": clean_canonical_name(
representative_value(group_rows, "representative_name_norm")
)
or representative_value(group_rows, "representative_name_norm"),
"product_type": "",
"brand": representative_value(group_rows, "representative_brand"),
"variant": representative_value(group_rows, "representative_variant"),
"size_value": representative_value(group_rows, "representative_size_value"),
"size_unit": representative_value(group_rows, "representative_size_unit"),
"pack_qty": representative_value(group_rows, "representative_pack_qty"),
"measure_type": representative_value(group_rows, "representative_measure_type"),
"normalized_quantity": quantity_value,
"normalized_quantity_unit": quantity_unit,
"notes": f"auto-linked via {link_method}",
"created_at": "",
"updated_at": "",
}
def build_canonical_layer(observed_rows):
canonical_rows = []
link_rows = []
groups = {}
for observed_row in sorted(observed_rows, key=lambda row: row["observed_product_id"]):
link_method, group_key, confidence = auto_link_rule(observed_row)
if not group_key:
continue
canonical_product_id = stable_id("gcan", f"{link_method}|{group_key}")
groups.setdefault(canonical_product_id, {"method": link_method, "rows": []})
groups[canonical_product_id]["rows"].append(observed_row)
link_rows.append(
{
"observed_product_id": observed_row["observed_product_id"],
"canonical_product_id": canonical_product_id,
"link_method": link_method,
"link_confidence": confidence,
"review_status": "",
"reviewed_by": "",
"reviewed_at": "",
"link_notes": "",
}
)
for canonical_product_id, group in sorted(groups.items()):
canonical_rows.append(
canonical_row_for_group(
canonical_product_id, group["rows"], group["method"]
)
)
return canonical_rows, link_rows
@click.command()
@click.option(
"--observed-csv",
default="giant_output/products_observed.csv",
show_default=True,
help="Path to observed product rows.",
)
@click.option(
"--canonical-csv",
default="giant_output/products_canonical.csv",
show_default=True,
help="Path to canonical product output.",
)
@click.option(
"--links-csv",
default="giant_output/product_links.csv",
show_default=True,
help="Path to observed-to-canonical link output.",
)
def main(observed_csv, canonical_csv, links_csv):
observed_rows = read_csv_rows(observed_csv)
canonical_rows, link_rows = build_canonical_layer(observed_rows)
write_csv_rows(canonical_csv, canonical_rows, CANONICAL_FIELDS)
write_csv_rows(links_csv, link_rows, LINK_FIELDS)
click.echo(
f"wrote {len(canonical_rows)} canonical rows to {canonical_csv} and "
f"{len(link_rows)} links to {links_csv}"
)
if __name__ == "__main__":
main()

View File

@@ -1,172 +0,0 @@
from collections import defaultdict
import click
from layer_helpers import (
compact_join,
distinct_values,
first_nonblank,
read_csv_rows,
representative_value,
stable_id,
write_csv_rows,
)
OUTPUT_FIELDS = [
"observed_product_id",
"retailer",
"observed_key",
"representative_retailer_item_id",
"representative_upc",
"representative_item_name",
"representative_name_norm",
"representative_brand",
"representative_variant",
"representative_size_value",
"representative_size_unit",
"representative_pack_qty",
"representative_measure_type",
"representative_image_url",
"is_store_brand",
"is_fee",
"is_discount_line",
"is_coupon_line",
"first_seen_date",
"last_seen_date",
"times_seen",
"example_order_id",
"example_item_name",
"raw_name_examples",
"normalized_name_examples",
"example_prices",
"distinct_item_names_count",
"distinct_retailer_item_ids_count",
"distinct_upcs_count",
]
def build_observed_key(row):
if row.get("upc"):
return "|".join(
[
row["retailer"],
f"upc={row['upc']}",
f"name={row['item_name_norm']}",
]
)
if row.get("retailer_item_id"):
return "|".join(
[
row["retailer"],
f"retailer_item_id={row['retailer_item_id']}",
f"name={row['item_name_norm']}",
f"discount={row.get('is_discount_line', 'false')}",
f"coupon={row.get('is_coupon_line', 'false')}",
]
)
return "|".join(
[
row["retailer"],
f"name={row['item_name_norm']}",
f"size={row['size_value']}",
f"unit={row['size_unit']}",
f"pack={row['pack_qty']}",
f"measure={row['measure_type']}",
f"store_brand={row['is_store_brand']}",
f"fee={row['is_fee']}",
]
)
def build_observed_products(rows):
grouped = defaultdict(list)
for row in rows:
grouped[build_observed_key(row)].append(row)
observed_rows = []
for observed_key, group_rows in sorted(grouped.items()):
ordered = sorted(
group_rows,
key=lambda row: (row["order_date"], row["order_id"], int(row["line_no"])),
)
observed_rows.append(
{
"observed_product_id": stable_id("gobs", observed_key),
"retailer": ordered[0]["retailer"],
"observed_key": observed_key,
"representative_retailer_item_id": representative_value(
ordered, "retailer_item_id"
),
"representative_upc": representative_value(ordered, "upc"),
"representative_item_name": representative_value(ordered, "item_name"),
"representative_name_norm": representative_value(
ordered, "item_name_norm"
),
"representative_brand": representative_value(ordered, "brand_guess"),
"representative_variant": representative_value(ordered, "variant"),
"representative_size_value": representative_value(ordered, "size_value"),
"representative_size_unit": representative_value(ordered, "size_unit"),
"representative_pack_qty": representative_value(ordered, "pack_qty"),
"representative_measure_type": representative_value(
ordered, "measure_type"
),
"representative_image_url": first_nonblank(ordered, "image_url"),
"is_store_brand": representative_value(ordered, "is_store_brand"),
"is_fee": representative_value(ordered, "is_fee"),
"is_discount_line": representative_value(
ordered, "is_discount_line"
),
"is_coupon_line": representative_value(ordered, "is_coupon_line"),
"first_seen_date": ordered[0]["order_date"],
"last_seen_date": ordered[-1]["order_date"],
"times_seen": str(len(ordered)),
"example_order_id": ordered[0]["order_id"],
"example_item_name": ordered[0]["item_name"],
"raw_name_examples": compact_join(
distinct_values(ordered, "item_name"), limit=4
),
"normalized_name_examples": compact_join(
distinct_values(ordered, "item_name_norm"), limit=4
),
"example_prices": compact_join(
distinct_values(ordered, "line_total"), limit=4
),
"distinct_item_names_count": str(
len(distinct_values(ordered, "item_name"))
),
"distinct_retailer_item_ids_count": str(
len(distinct_values(ordered, "retailer_item_id"))
),
"distinct_upcs_count": str(len(distinct_values(ordered, "upc"))),
}
)
observed_rows.sort(key=lambda row: row["observed_product_id"])
return observed_rows
@click.command()
@click.option(
"--items-enriched-csv",
default="giant_output/items_enriched.csv",
show_default=True,
help="Path to enriched Giant item rows.",
)
@click.option(
"--output-csv",
default="giant_output/products_observed.csv",
show_default=True,
help="Path to observed product output.",
)
def main(items_enriched_csv, output_csv):
rows = read_csv_rows(items_enriched_csv)
observed_rows = build_observed_products(rows)
write_csv_rows(output_csv, observed_rows, OUTPUT_FIELDS)
click.echo(f"wrote {len(observed_rows)} rows to {output_csv}")
if __name__ == "__main__":
main()

View File

@@ -10,6 +10,14 @@ from layer_helpers import read_csv_rows, write_csv_rows
PURCHASE_FIELDS = [
"purchase_date",
"retailer",
"catalog_name",
"product_type",
"category",
"net_line_total",
"normalized_quantity",
"normalized_quantity_unit",
"effective_price",
"effective_price_unit",
"order_id",
"line_no",
"normalized_row_id",
@@ -19,9 +27,6 @@ PURCHASE_FIELDS = [
"resolution_action",
"raw_item_name",
"normalized_item_name",
"catalog_name",
"category",
"product_type",
"brand",
"variant",
"image_url",
@@ -170,6 +175,41 @@ def derive_metrics(row):
}
def derive_effective_price(row):
normalized_quantity = to_decimal(row.get("normalized_quantity"))
if normalized_quantity in (None, Decimal("0")):
return ""
numerator = to_decimal(derive_net_line_total(row))
if numerator is None:
return ""
return format_decimal(numerator / normalized_quantity)
def derive_effective_price_unit(row):
normalized_quantity = to_decimal(row.get("normalized_quantity"))
if normalized_quantity in (None, Decimal("0")):
return ""
return row.get("normalized_quantity_unit", "")
def derive_net_line_total(row):
existing_net = row.get("net_line_total", "")
if str(existing_net).strip() != "":
return str(existing_net)
line_total = to_decimal(row.get("line_total"))
if line_total is None:
return ""
matched_discount_amount = to_decimal(row.get("matched_discount_amount"))
if matched_discount_amount is not None:
return format_decimal(line_total + matched_discount_amount)
return format_decimal(line_total)
def order_lookup(rows, retailer):
return {(retailer, row["order_id"]): row for row in rows}
@@ -318,6 +358,14 @@ def build_purchase_rows(
{
"purchase_date": row["order_date"],
"retailer": row["retailer"],
"catalog_name": catalog_row.get("catalog_name", ""),
"product_type": catalog_row.get("product_type", ""),
"category": catalog_row.get("category", ""),
"net_line_total": derive_net_line_total(row),
"normalized_quantity": row.get("normalized_quantity", ""),
"normalized_quantity_unit": row.get("normalized_quantity_unit", ""),
"effective_price": derive_effective_price({**row, "net_line_total": derive_net_line_total(row)}),
"effective_price_unit": derive_effective_price_unit(row),
"order_id": row["order_id"],
"line_no": row["line_no"],
"normalized_row_id": row.get("normalized_row_id", ""),
@@ -327,9 +375,6 @@ def build_purchase_rows(
"resolution_action": resolution.get("resolution_action", ""),
"raw_item_name": row["item_name"],
"normalized_item_name": row["item_name_norm"],
"catalog_name": catalog_row.get("catalog_name", ""),
"category": catalog_row.get("category", ""),
"product_type": catalog_row.get("product_type", ""),
"brand": catalog_row.get("brand", ""),
"variant": catalog_row.get("variant", ""),
"image_url": row.get("image_url", ""),
@@ -344,7 +389,6 @@ def build_purchase_rows(
"line_total": row["line_total"],
"unit_price": row["unit_price"],
"matched_discount_amount": row.get("matched_discount_amount", ""),
"net_line_total": row.get("net_line_total", ""),
"store_name": order_row.get("store_name", ""),
"store_number": order_row.get("store_number", ""),
"store_city": order_row.get("store_city", ""),
@@ -396,10 +440,10 @@ def build_comparison_examples(purchase_rows):
@click.option("--giant-orders-csv", default="data/giant-web/collected_orders.csv", show_default=True)
@click.option("--costco-orders-csv", default="data/costco-web/collected_orders.csv", show_default=True)
@click.option("--resolutions-csv", default="data/review/review_resolutions.csv", show_default=True)
@click.option("--catalog-csv", default="data/catalog.csv", show_default=True)
@click.option("--catalog-csv", default="data/review/catalog.csv", show_default=True)
@click.option("--links-csv", default="data/review/product_links.csv", show_default=True)
@click.option("--output-csv", default="data/review/purchases.csv", show_default=True)
@click.option("--examples-csv", default="data/review/comparison_examples.csv", show_default=True)
@click.option("--output-csv", default="data/analysis/purchases.csv", show_default=True)
@click.option("--examples-csv", default="data/analysis/comparison_examples.csv", show_default=True)
def main(
giant_items_enriched_csv,
costco_items_enriched_csv,

View File

@@ -1,175 +0,0 @@
from collections import defaultdict
from datetime import date
import click
from layer_helpers import compact_join, distinct_values, read_csv_rows, stable_id, write_csv_rows
OUTPUT_FIELDS = [
"review_id",
"queue_type",
"retailer",
"observed_product_id",
"canonical_product_id",
"reason_code",
"priority",
"raw_item_names",
"normalized_names",
"upc",
"image_url",
"example_prices",
"seen_count",
"status",
"resolution_notes",
"created_at",
"updated_at",
]
def existing_review_state(path):
try:
rows = read_csv_rows(path)
except FileNotFoundError:
return {}
return {row["review_id"]: row for row in rows}
def review_reasons(observed_row):
reasons = []
if (
observed_row["is_fee"] == "true"
or observed_row.get("is_discount_line") == "true"
or observed_row.get("is_coupon_line") == "true"
):
return reasons
if observed_row["distinct_upcs_count"] not in {"", "0", "1"}:
reasons.append(("multiple_upcs", "high"))
if observed_row["distinct_item_names_count"] not in {"", "0", "1"}:
reasons.append(("multiple_raw_names", "medium"))
if not observed_row["representative_image_url"]:
reasons.append(("missing_image", "medium"))
if not observed_row["representative_upc"]:
reasons.append(("missing_upc", "high"))
if not observed_row["representative_name_norm"]:
reasons.append(("missing_normalized_name", "high"))
return reasons
def build_review_queue(observed_rows, item_rows, existing_rows, today_text):
by_observed = defaultdict(list)
for row in item_rows:
observed_id = row.get("observed_product_id", "")
if observed_id:
by_observed[observed_id].append(row)
queue_rows = []
for observed_row in observed_rows:
reasons = review_reasons(observed_row)
if not reasons:
continue
related_items = by_observed.get(observed_row["observed_product_id"], [])
raw_names = compact_join(distinct_values(related_items, "item_name"), limit=5)
norm_names = compact_join(
distinct_values(related_items, "item_name_norm"), limit=5
)
example_prices = compact_join(
distinct_values(related_items, "line_total"), limit=5
)
for reason_code, priority in reasons:
review_id = stable_id(
"rvw",
f"{observed_row['observed_product_id']}|{reason_code}",
)
prior = existing_rows.get(review_id, {})
queue_rows.append(
{
"review_id": review_id,
"queue_type": "observed_product",
"retailer": observed_row["retailer"],
"observed_product_id": observed_row["observed_product_id"],
"canonical_product_id": prior.get("canonical_product_id", ""),
"reason_code": reason_code,
"priority": priority,
"raw_item_names": raw_names,
"normalized_names": norm_names,
"upc": observed_row["representative_upc"],
"image_url": observed_row["representative_image_url"],
"example_prices": example_prices,
"seen_count": observed_row["times_seen"],
"status": prior.get("status", "pending"),
"resolution_notes": prior.get("resolution_notes", ""),
"created_at": prior.get("created_at", today_text),
"updated_at": today_text,
}
)
queue_rows.sort(key=lambda row: (row["priority"], row["reason_code"], row["review_id"]))
return queue_rows
def attach_observed_ids(item_rows, observed_rows):
observed_by_key = {row["observed_key"]: row["observed_product_id"] for row in observed_rows}
attached = []
for row in item_rows:
observed_key = "|".join(
[
row["retailer"],
f"upc={row['upc']}",
f"name={row['item_name_norm']}",
]
) if row.get("upc") else "|".join(
[
row["retailer"],
f"retailer_item_id={row.get('retailer_item_id', '')}",
f"name={row['item_name_norm']}",
f"size={row['size_value']}",
f"unit={row['size_unit']}",
f"pack={row['pack_qty']}",
f"measure={row['measure_type']}",
f"store_brand={row['is_store_brand']}",
f"fee={row['is_fee']}",
f"discount={row.get('is_discount_line', 'false')}",
f"coupon={row.get('is_coupon_line', 'false')}",
]
)
enriched = dict(row)
enriched["observed_product_id"] = observed_by_key.get(observed_key, "")
attached.append(enriched)
return attached
@click.command()
@click.option(
"--observed-csv",
default="giant_output/products_observed.csv",
show_default=True,
help="Path to observed product rows.",
)
@click.option(
"--items-enriched-csv",
default="giant_output/items_enriched.csv",
show_default=True,
help="Path to enriched Giant item rows.",
)
@click.option(
"--output-csv",
default="giant_output/review_queue.csv",
show_default=True,
help="Path to review queue output.",
)
def main(observed_csv, items_enriched_csv, output_csv):
observed_rows = read_csv_rows(observed_csv)
item_rows = read_csv_rows(items_enriched_csv)
item_rows = attach_observed_ids(item_rows, observed_rows)
existing_rows = existing_review_state(output_csv)
today_text = str(date.today())
queue_rows = build_review_queue(observed_rows, item_rows, existing_rows, today_text)
write_csv_rows(output_csv, queue_rows, OUTPUT_FIELDS)
click.echo(f"wrote {len(queue_rows)} rows to {output_csv}")
if __name__ == "__main__":
main()

View File

@@ -29,7 +29,7 @@ CODE_TOKEN_RE = re.compile(
r"\b(?:SL\d+|T\d+H\d+|P\d+(?:/\d+)?|W\d+T\d+H\d+|FY\d+|CSPC#|C\d+T\d+H\d+|EC\d+T\d+H\d+|\d+X\d+)\b"
)
PACK_FRACTION_RE = re.compile(r"(?<![A-Z0-9])(\d+)\s*/\s*(\d+(?:\.\d+)?)\s*(OZ|LB|LBS|CT)\b")
HASH_SIZE_RE = re.compile(r"(?<![A-Z0-9])(\d+(?:\.\d+)?)#\b")
HASH_SIZE_RE = re.compile(r"(?<![A-Z0-9])(\d+(?:\.\d+)?)#(?=\s|$)")
ITEM_CODE_RE = re.compile(r"#\w+\b")
DUAL_WEIGHT_RE = re.compile(
r"\b\d+(?:\.\d+)?\s*(?:KG|G|LB|LBS|OZ)\s*/\s*\d+(?:\.\d+)?\s*(?:KG|G|LB|LBS|OZ)\b"
@@ -37,7 +37,9 @@ DUAL_WEIGHT_RE = re.compile(
LOGISTICS_SLASH_RE = re.compile(r"\b(?:T\d+/H\d+(?:/P\d+)?/?|H\d+/P\d+/?|T\d+/H\d+/?)\b")
PACK_DASH_RE = re.compile(r"(?<![A-Z0-9])(\d+)\s*-\s*PACK\b")
PACK_WORD_RE = re.compile(r"(?<![A-Z0-9])(\d+)\s*PACK\b")
SIZE_RE = re.compile(r"(?<![A-Z0-9])(\d+(?:\.\d+)?)\s*(OZ|LB|LBS|CT|KG|G)\b")
SIZE_RE = re.compile(
r"(?<![A-Z0-9])(\d+(?:\.\d+)?)\s*(OZ|LB|LBS|CT|KG|G|QT|QTS|PT|PTS|GAL|GALS|FL OZ|FLOZ)\b"
)
DISCOUNT_TARGET_RE = re.compile(r"^/\s*(\d+)\b")
@@ -192,10 +194,12 @@ def parse_costco_item(order_id, order_date, raw_path, line_no, item):
)
normalized_row_id = f"{RETAILER}:{order_id}:{line_no}"
normalized_quantity, normalized_quantity_unit = derive_normalized_quantity(
item.get("unit"),
size_value,
size_unit,
pack_qty,
measure_type,
"",
)
identity_key, normalization_basis = normalization_identity(
{

View File

@@ -224,13 +224,17 @@ def normalize_unit(unit):
"OZ": "oz",
"FZ": "fl_oz",
"FL OZ": "fl_oz",
"FLOZ": "fl_oz",
"LB": "lb",
"LBS": "lb",
"ML": "ml",
"L": "l",
"QT": "qt",
"QTS": "qt",
"PT": "pt",
"PTS": "pt",
"GAL": "gal",
"GALS": "gal",
"GA": "gal",
}.get(collapsed, collapsed.lower())
@@ -340,16 +344,27 @@ def derive_prices(item, measure_type, size_value="", size_unit="", pack_qty=""):
return price_per_each, price_per_lb, price_per_oz
def derive_normalized_quantity(size_value, size_unit, pack_qty, measure_type):
def derive_normalized_quantity(qty, size_value, size_unit, pack_qty, measure_type, picked_weight=""):
parsed_qty = to_decimal(qty)
parsed_size = to_decimal(size_value)
parsed_pack = to_decimal(pack_qty) or Decimal("1")
parsed_pack = to_decimal(pack_qty)
parsed_picked_weight = to_decimal(picked_weight)
total_multiplier = None
if parsed_qty not in (None, Decimal("0")):
total_multiplier = parsed_qty * (parsed_pack or Decimal("1"))
if parsed_size not in (None, Decimal("0")) and size_unit:
return format_decimal(parsed_size * parsed_pack), size_unit
if parsed_pack not in (None, Decimal("0")) and measure_type == "count":
return format_decimal(parsed_pack), "count"
if measure_type == "each":
return "1", "each"
if (
parsed_size not in (None, Decimal("0"))
and size_unit
and total_multiplier not in (None, Decimal("0"))
):
return format_decimal(parsed_size * total_multiplier), size_unit
if measure_type == "weight" and parsed_picked_weight not in (None, Decimal("0")):
return format_decimal(parsed_picked_weight), "lb"
if measure_type == "count" and total_multiplier not in (None, Decimal("0")):
return format_decimal(total_multiplier), "count"
if measure_type == "each" and parsed_qty not in (None, Decimal("0")):
return format_decimal(parsed_qty), "each"
return "", ""
@@ -424,10 +439,12 @@ def parse_item(order_id, order_date, raw_path, line_no, item):
normalized_row_id = f"{RETAILER}:{order_id}:{line_no}"
normalized_quantity, normalized_quantity_unit = derive_normalized_quantity(
item.get("shipQy"),
size_value,
size_unit,
pack_qty,
measure_type,
item.get("totalPickedWeight"),
)
identity_key, normalization_basis = normalization_identity(
{

View File

@@ -111,7 +111,14 @@ data/
review_queue.csv # Human review queue for unresolved matching/parsing cases.
product_links.csv # Links from normalized retailer items to catalog items.
catalog.csv # Cross-retailer product catalog entities used for comparison.
analysis/
purchases.csv
comparison_examples.csv
item_price_over_time.csv
spend_by_visit.csv
items_per_visit.csv
category_spend_over_time.csv
retailer_store_breakdown.csv
#+end_example
Notes:
@@ -223,7 +230,7 @@ Notes:
- Valid `normalization_basis` values should be explicit, e.g. `exact_upc`, `exact_retailer_item_id`, `exact_name_size_pack`, or `approved_retailer_alias`.
- Do not use fuzzy or semantic matching to assign `normalized_item_id`.
- Discount/coupon rows may remain as standalone normalized rows for auditability even when their amounts are attached to a purchased row via `matched_discount_amount`.
- Cross-retailer identity is handled later in review/combine via `catalog.csv` and `product_links.csv`.
- Cross-retailer identity is handled later in review/combine via `data/review/catalog.csv` and `product_links.csv`.
** `data/review/product_links.csv`
One row per review-approved link from a normalized retailer item to a catalog item.
@@ -263,7 +270,7 @@ One row per issue needing human review.
| `resolution_notes` | reviewer notes |
| `created_at` | creation timestamp or date |
| `updated_at` | last update timestamp or date |
** `data/catalog.csv`
** `data/review/catalog.csv`
One row per cross-retailer catalog product.
| key | definition |
|----------------------------+----------------------------------------|
@@ -288,7 +295,7 @@ Notes:
- Do not encode packaging/count into `catalog_name` unless it is essential to product identity.
- `catalog_name` should come from review-approved naming, not raw retailer strings.
** `data/purchases.csv`
** `data/analysis/purchases.csv`
One row per purchased item (i.e., `is_item`==true from normalized layer), with
catalog attributes denormalized in and discounts already applied.
@@ -344,3 +351,9 @@ Notes:
- review/link decisions should apply at the `normalized_item_id` level, then fan out to all purchase rows sharing that id.
* /
Normalized quantity is deterministic and conservative:
- if `qty * pack_qty * size_value` is available, use that total with `size_unit`
- else if count basis is explicit, use `qty * pack_qty` with unit `count`
- else if `measure_type` is `each`, use `qty each`
- else leave both fields blank
- no hidden unit conversion is applied inside normalization; values stay in their parsed units such as `oz`, `lb`, `qt`, or `count`

View File

@@ -500,4 +500,155 @@ Decide whether two normalized retailer items are "the same product"; match items
** Symptoms
- `LIME` and `LIME . / .` appearing in canonical_catalog:
- names must come from review-approved names, not raw strings
*
* notes
** to fix
- options not reading/sticking?
- ice cream - add flavor, call it frozen (not dairy)
- seltzer/soda from "seltzer,soda,bev" to "cherry san pellegrino, seltzer, bev"?
- [1] chicken bouillon, soup, (0 items, 0 rows) -> chicken bouillon, broth?, ,
- peanut butter,, -> creamy peanut butter, peanut butter, condiment
- add gummy bear to candy
- add "fresh" to fresh strawberry
- fix "onion,veg,produce"
manage product_type and category directly?
future: fix match
*** Done
fuji apple, apple, produce (not apple, fruit, produce)
spinach, , produce -> frozen vs fresh?
frozen chicken thighs ->
rotisserie chicken, chicken, poultry -> rotisserie chicken, chicken, meat
beef patty, hamburger, meat -> hamburger patty, beef, meat
oats > cereal
cheerios > cereal
- 3 kinds of greek yogurt!!
** takeaways
- variants not caught, how to fix?
catalog_name = what you actually bought
product_type = reasonable substitute
category = store aisle
Using different categories maintains a direct comparison (product_type==spinach) and a distinction.
fresh spinach, spinach, produce
frozen spinach, spinach, frozen
include in catalog_name:
- form: frozen, fresh, ground, shredded
- fat level: whole, skim, 2%
- flavor when primary: vanilla yogurt vs plain yogurt
- cut: diced tomatoes vs crushed tomatoes
- species when relevant: gala apple vs fuji apple
exclude from catalog_name:
- package size / multipack count
- promo wording; adjectives like "premium"; retailer marketing fluff
** AC
1. fix internal search flow, add same menu
#+begin_src diff
Review 4/345: SHRP CHDR
5 matched items:
[1] KS SHRP CHDR EC20T9H5 W12T13H5 SL130 | costco | 2026-03-12 | 5.49 |
[2] KS SHRP CHDR EC20T9H5 W12T13H5 SL130 | costco | 2025-01-24 | 12.58 |
[3] KS SHRP CHDR EC20T9H5 W12T13H5 SL130 | costco | 2025-01-10 | 6.29 |
[4] KS SHRP CHDR EC20T9H5 W12T13H5 SL130 | costco | 2024-12-14 | 6.29 |
[5] KS SHRP CHDR EC20T9H5 W12T13H5 SL130 | costco | 2024-08-06 | 5.99 |
no catalog_name suggestions found
[f]ind [n]ew [s]kip e[x]clude [q]uit >
f
search: cheddar
1 search results found:
[1] cheddar cheese, cheese, dairy (0 items, 0 rows)
- selection: 1
+ [#] link to suggestion [f]ind [n]ew [s]kip e[x]clude [q]uit >
#+end_src
instead of
#+begin_src diff
search: banana
no matches found
- search again? [enter=yes, q=no]:
+ [f]ind [n]ew [s]kip e[x]clude [q]uit >
#+end_src
2. during a long review session, two pepper or onion types back-to-back cant see the one i just added
- suggest just-added catalog items
- script likely needs to re-read the csv, not just add
//3. suggest based on both catalog & product_name (this is already happening//
3. Search results do not properly list running totals:
5 search results found:
[1] red onion, onion, produce (0 items, 0 rows)
[2] mild roasted red bell pepper, bell pepper, produce (0 items, 0 rows)
[3] onion, vegetable, produce (0 items, 0 rows)
[4] sour cream and onion potato chip, chips, snack (0 items, 0 rows)
[5] yellow onion, onion, produce (0 items, 0 rows)
selection:
* data cleanup [2026-03-23 Mon]
ok we're getting closer. still see some issues
1. reorder purchases columns for display: catalog_name, product_type, category (makes data/troubleshooting way easier)
2. shouldn't net_line_price should never be empty? to allow cumulative cost comparison/analysis (we can see normalized price per X via effective_price but shouldnt this be weighted against how much we bought? eg if we bought 5lb flour at $0.970/lb this is weighted as 1-to-1 with a 25lb purchase as 0.670/lb
3. some items missing entire categorizations? probably a result of me trying to do data cleanup. i found the orphaned values in teh product_links table and removed them, but re-running review_products.py did not catch this...
shouldn't review_products run a comparison between each vendor's normalized_items and compare to the existing review_queu?
RSET POTATO US 1
GREEK YOGURT DOM55
FDLY CHY VAN IC CRM
DUNKIN DONUT CANISTER ORIG BLND P=260
ICE CUBES
BLACK BEANS
KETCHUP SQUEEZE BTL
YELLOW_GOLD POTATO US 1
YELLOW_GOLD POTATO US 1
PINTO BEANS
4. cleanup deprecated .py files
5. Goals:
1. When have I purchased this item, what did I pay, and how has the price changed over time?
- we're close, but missing units - eg AP flour shows a value that looks like price/lb but you just see $0.765
- doesnt seem like we've captured everything but that's just a gut feeling
2. Visit breakdown as well as catalog/product/category? this certainly belongs in purchases.csv.
3. Consider dash/plotly for better-than-excel tracking, since we're really only looking at a couple of graphs and filtering within certain values? (obv keep purchases as a user-friendly output)
** 1. Cleanup purchases column order
purchase_date
retailer
catalog_name
product_type
category
net_line_total
normalized_quantity
effective_price
effective_price_unit (new)
order_id
line_no
raw_item_name
normalized_item_name
catalog_id
normalized_item_id
** 2. Populate and use purchases.net_line_total
net_line_total = line_total+matched_discount_amoun
effective_price = net_line_total / normalized_quantity
weighted cost analysis uses net_line_total, not just avg effective_price
** 3. Improve review robustness, enable norm_item re review
1. should regenerate candidates from:
- normalized items with no valid catalog_id
- normalized items whose linked catalog_id no longer exists
- normalized items whose linked catalog row exists but missing required fields if you want completeness review
2. review_products.py should compare:
- current normalized universe
- current product_links
- current catalog
- current review_queue
** 4. Remove deprecated.py
** 5. Improve Charts
1. Histogram: add effective_price_unit to purchases.py
1. Visits: plot by order_id enable display of:
1. spend by visit
2. items per visit
3. category spend by visit
4. retailer/store breakdown
* /

View File

@@ -624,7 +624,7 @@ tighten Costco-specific normalization so normalized item names are cleaner and d
- The structured parsing still owns size/pack extraction, so name cleanup can safely strip dual-unit and logistics fragments after those fields are parsed.
- Discount-line behavior remains unchanged; this task only cleaned normalized names and preserved the existing audit trail.
* [x] t1.15: refactor review/combine pipeline around normalized_item_id and catalog links (4-8 commits)
* [X] t1.15: refactor review/combine pipeline around normalized_item_id and catalog links (4-8 commits)
replace the old observed/canonical workflow with a review-first pipeline that uses normalized_item_id as the retailer-level review unit and links it to catalog items
** Acceptance Criteria
@@ -677,7 +677,452 @@ replace the old observed/canonical workflow with a review-first pipeline that us
- Existing auto-generated catalog rows are no longer carried forward by default; only deliberate catalog entries survive. That keeps the new `catalog.csv` conservative, but it also means prior observed-based auto-links do not migrate into the new model.
- Live rerun after the refactor produced `627` purchase rows, `387` review-queue rows, `407` distinct normalized items, `0` linked normalized items, and `0` unresolved rows missing from the review queue.
* [ ] 1t.10: add optional llm-assisted suggestion workflow for unresolved normalized retailer items (2-4 commits)
* [X] t1.16: cleanup review process and format
** acceptance criteria
1. Add intro text explaining:
1. catalog name: unique product including variant but not packaging, eg "whole milk", "sharp cheddar cheese"
2. product type: general product you would like to compare to, eg "milk", "cheese"
3. category: eg "dairy"
2. Reformat input per item
1. Change matched item field display order
2. Add count of distinct normalized_item_ids and total purchase rows already linked to the catalog item
3. Add option to select catalog suggestion directly
#+begin_comment
Review 7/22: MIXED PEPPER 6-PK
2 matched items:
- MIXED PEPPER 6-PK | costco | 2026-03-12 | 7.49 | [img_url]
- [raw_name] | [retailer] | [YYYY-mm-dd] | [price] | [img_url]
2 catalog suggestions found:
[1] bell pepper, pepper, produce (42 items)
[2] ground pepper, spice, baking (1 item)
[#] link to suggestion [n]ew [s]kip e[x]clude [q]uit >
#+end_comment
3. When creating new, ask for input in catalog_name, product_type, category order
1. enter to accept blank value
4. Each reviewed item is saved after user input, not at the end of the script.
1. on new creation, create entry in catalog.csv and create entry in product_links.csv
2. on link existing, create entry in product_links.csv
3. update review_queue.csv status for item immediately after action
5. linking operates at normalized_item_id level, not per normalized_row_id
6. ensure catalog.csv and product_links.csv are human-editable and consistent so manual correction is possible without tooling
** evidence
- commit: `975d44b`
- tests: `./venv/bin/python -m unittest discover -s tests`; `./venv/bin/python review_products.py --refresh-only`; `./venv/bin/python review_products.py --help`
- datetime: 2026-03-20 12:45:25 EDT
** notes
- The main flow change is operational rather than architectural: each review decision now persists immediately to `review_resolutions.csv`, `catalog.csv`, `product_links.csv`, and the on-disk `review_queue.csv`.
- Direct numeric selection works well for suggestion-heavy review, while `[l]ink existing` remains available as a fallback when the suggestion list is empty or incomplete.
- I kept the review data model unchanged from `t1.15`; this task only tightened the prompt format, field order, and save behavior.
* [X] t1.16.1: add catalog search flow to review ui (2-3 commits)
enable fast lookup of catalog items during review via tokenized search and replace manual list scanning
** acceptance criteria
1. replace `[l]ink existing` with `[f]ind` in review prompt:
- `[#] link to suggestion [f]ind [n]ew [s]kip [x]exclude [q]uit >`
2. implement search flow:
- on `s`, prompt: `search: `
- tokenize input using same normalization rules as suggestion matching
- return ranked list of catalog items where tokens overlap with:
- catalog_name
- product_type
- variant
- display results in same numbered format as suggestions:
[1] flour, flour, baking (12 items, 48 rows)
3. allow direct selection from search results:
- when user inputs number, immediately creates approved resolution and product_links rows
- returns to next review item
4. reuse match logic used for suggestion matching; no new matching system introduced
- future improvements to matching logic will therefore apply in both places
5. search results exclude already-linked current normalized_item_id target
6. fallback behavior:
- if no results, print `no matches found`
- allow retry or return to main prompt
7. keep interaction tight:
- no full catalog dump
- max ~10 results returned
- sorted by simple score (token overlap count)
8. persistence:
- selected link writes immediately to `product_links.csv`
- no buffering until script end
- pm note: optimize for speed over correctness; this is a manual assist tool, not a ranking system
- pm note: improve manual lookup flow only, don't retool or create a second algorithm
** evidence
- commit: `f93b9aa`
- tests: `./venv/bin/python -m unittest discover -s tests`; `./venv/bin/python review_products.py --help`; `./venv/bin/python review_products.py --refresh-only`
- datetime: 2026-03-20 13:34:57 EDT
** notes
- The search path reuses the same lightweight token matching rules as suggestion ranking, so there is still only one matching system to maintain.
- Direct numeric suggestion-pick remains the fastest happy path; search is the fallback when suggestions are sparse or missing.
- Search intentionally optimizes for manual speed rather than smart ranking: simple token overlap, max 10 rows, and immediate persistence on selection.
- Follow-up fix: search moved to `[f]ind` so `[s]kip` remains available at the main prompt.
* [X] t1.17: fix normalized quantity derivation and carry it through purchases (2-4 commits)
correct and document deterministic normalized quantity fields so unit-cost analysis works across package sizes
** Acceptance Criteria
1. populate and validate `normalized_quantity` and `normalized_quantity_unit` in `data/<retailer-method>/normalized_items.csv`
- these columns already exist and must be corrected rather than reintroduced
2. carry `normalized_quantity` and `normalized_quantity_unit` through to `data/review/purchases.csv`
3. derive normalized quantity deterministically from existing parsed fields only:
- `qty`
- `pack_qty`
- `size_value`
- `size_unit`
- `measure_type`
4. prefer the best deterministic basis rather than falling back to `each` too early:
- count items when count is explicit
- weight items when parsed weight is explicit
- volume items when parsed volume is explicit
- `each` only when no better basis is available
5. handle common cases explicitly, including totals derived from deterministic patterns such as:
- `18 count`
- `5 lb`
- `64 oz`
- `2 each`
6. preserve blanks when no reliable normalized quantity basis can be derived
7. existing `normalized_item_id` values remain stable; this task must not change retailer-level grouping identity
8. document the derivation rules and any intentional conversions or non-conversions in `pm/data-model.org` or task notes
- if unit conversions are allowed, they must be explicit and minimal
- pm note: keep this deterministic and conservative; do not introduce fuzzy inference
- pm note: if `lb <-> oz` or volume conversions are used, document them directly rather than hiding them in code
- pm note: this task enables cost analysis and charting, not catalog/review changes
** evidence
- commit: `d25448b`
- tests: `./venv/bin/python -m unittest discover -s tests`; `./venv/bin/python normalize_giant_web.py`; `./venv/bin/python normalize_costco_web.py`; `./venv/bin/python build_purchases.py`
- datetime: 2026-03-21 21:02:21 EDT
** notes
- The missing purchases fields were a carry-through bug: normalization had `normalized_quantity` and `normalized_quantity_unit`, but `build_purchases.py` never wrote them into `data/review/purchases.csv`.
- Normalized quantity now prefers explicit package basis over `each`, so rows like `PEPSI 6PK 7.5Z` resolve to `90 oz` and `KS ALMND BAR US 1.74QTS` purchased twice resolves to `3.48 qt`.
- The derivation stays conservative and does not convert units during normalization; parsed units such as `oz`, `lb`, `qt`, and `count` are preserved as-is.
* [X] t1.18: add regression tests for known quantity/price failures (1-2 commits)
capture the currently broken comparison cases before changing normalization or purchases logic
** acceptance criteria
1. ensure the new tests assert the intended `effective_price` behavior for the known banana, ice, and beef patty examples
2. add tests covering known broken cases:
- giant bananas produce non-blank effective price
- giant bagged ice produces non-zero effective price
- costco bananas retain correct effective price
- beef patty comparison rows preserve expected quantity basis behavior
3. tests fail against current broken behavior and document the expected outcome
4. include at least one assertion that effective_price is blank rather than `0` or divide-by-zero when no denominator exists
- pm note: this task should only add tests/fixtures and not change business logic
** pm identified problems
we have a few problems to scope. looks like:
1. normalize_giant_web not always propagating weight data to price_per
2. effective_price calc needs more robust matching algo (my excel hack is clearly not engouh)
```
catalog_name banana
Average of effective_price Column Labels
Row Labels 8/6/2024 12/6/2024 12/12/2024 1/7/2025 1/24/2025 2/16/2025 2/20/2025 6/25/2025 2/14/2026 3/12/2026 Grand Total
Jan #DIV/0! 0.496666667 #DIV/0!
Feb #DIV/0! #DIV/0! 0.496666667 #DIV/0!
Mar 0.496666667 0.496666667
Jun #DIV/0! #DIV/0!
Aug 0.496666667 0.496666667
Dec #DIV/0! #DIV/0! #DIV/0!
Grand Total 0.496666667 #DIV/0! #DIV/0! #DIV/0! 0.496666667 #DIV/0! #DIV/0! #DIV/0! 0.496666667 0.496666667 #DIV/0!
purchase_date retailer normalized_item_name catalog_name category product_type qty unit normalized_quantity normalized_quantity_unit pack_qty size_value size_unit measure_type line_total unit_price net_line_total price_per_each price_per_each_basis price_per_count price_per_count_basis price_per_lb price_per_lb_basis price_per_oz price_per_oz_basis effective_price
8/6/2024 costco BANANA banana produce banana 1 E 3 lb 3 lb weight 1.49 1.49 1.49 1.49 line_total_over_qty 0.4967 parsed_size_lb 0.031 parsed_size_lb_to_oz 0.496666667
12/6/2024 giant BANANA banana produce banana 1 LB weight 0.99 0.99 0.99 line_total_over_qty 0.5893 picked_weight_lb 0.0368 picked_weight_lb_to_oz #DIV/0!
12/12/2024 giant BANANA banana produce banana 1 LB weight 1.37 1.37 1.37 line_total_over_qty 0.5905 picked_weight_lb 0.0369 picked_weight_lb_to_oz #DIV/0!
1/7/2025 giant BANANA banana produce banana 1 LB weight 1.44 1.44 1.44 line_total_over_qty 0.5902 picked_weight_lb 0.0369 picked_weight_lb_to_oz #DIV/0!
1/24/2025 costco BANANA banana produce banana 1 E 3 lb 3 lb weight 1.49 1.49 1.49 1.49 line_total_over_qty 0.4967 parsed_size_lb 0.031 parsed_size_lb_to_oz 0.496666667
2/16/2025 giant BANANA banana produce banana 2 LB weight 2.54 1.27 1.27 line_total_over_qty 0.588 picked_weight_lb 0.0367 picked_weight_lb_to_oz #DIV/0!
2/20/2025 giant BANANA banana produce banana 1 LB weight 1.4 1.4 1.4 line_total_over_qty 0.5907 picked_weight_lb 0.0369 picked_weight_lb_to_oz #DIV/0!
6/25/2025 giant BANANA banana produce banana 1 LB weight 1.29 1.29 1.29 line_total_over_qty 0.589 picked_weight_lb 0.0368 picked_weight_lb_to_oz #DIV/0!
2/14/2026 costco BANANA banana produce banana 1 E 3 lb 3 lb weight 1.49 1.49 1.49 1.49 line_total_over_qty 0.4967 parsed_size_lb 0.031 parsed_size_lb_to_oz 0.496666667
3/12/2026 costco BANANA banana produce banana 2 E 6 lb 3 lb weight 2.98 1.49 2.98 1.49 line_total_over_qty 0.4967 parsed_size_lb 0.031 parsed_size_lb_to_oz 0.496666667
purchase_date retailer normalized_item_name catalog_name category product_type qty unit normalized_quantity normalized_quantity_unit pack_qty size_value size_unit measure_type line_total unit_price net_line_total price_per_each price_per_each_basis price_per_count price_per_count_basis price_per_lb price_per_lb_basis price_per_oz price_per_oz_basis effective_price
9/9/2023 costco BEEF PATTIES 6# BAG beef patty meat hamburger 1 E 1 each each 26.99 26.99 26.99 26.99 line_total_over_qty 26.99
11/26/2025 giant 80% PATTIES PK12 beef patty meat hamburger 1 LB weight 10.05 10.05 10.05 line_total_over_qty 7.7907 picked_weight_lb 0.4869 picked_weight_lb_to_oz #DIV/0!
purchase_date retailer normalized_item_name catalog_name category product_type qty unit normalized_quantity normalized_quantity_unit pack_qty size_value size_unit measure_type line_total unit_price net_line_total price_per_each price_per_each_basis price_per_count price_per_count_basis price_per_lb price_per_lb_basis price_per_oz price_per_oz_basis effective_price
5/26/2025 giant BAGGED ICE bagged ice cubes frozen ice 2 EA 40 lb 20 lb weight 9.98 4.99 4.99 line_total_over_qty 0.2495 parsed_size_lb 0.0156 parsed_size_lb_to_oz 0
6/12/2025 giant BAG ICE CUBED bagged ice cubes frozen ice 1 EA 10 lb 10 lb weight 3.49 3.49 3.49 line_total_over_qty 0.349 parsed_size_lb 0.0218 parsed_size_lb_to_oz 0
9/13/2025 giant BAGGED ICE bagged ice cubes frozen ice 2 EA 20 lb 10 lb weight 6.98 3.49 3.49 line_total_over_qty 0.349 parsed_size_lb 0.0218 parsed_size_lb_to_oz 0
10/10/2025 giant BAGGED ICE bagged ice cubes frozen ice 1 EA 20 lb 20 lb weight 4.99 4.99 4.99 line_total_over_qty 0.2495 parsed_size_lb 0.0156 parsed_size_lb_to_oz 0
```
** evidence
- commit: `605c944`
- tests: `./venv/bin/python -m unittest tests.test_purchases` (fails as expected before implementation: missing `effective_price` in purchases rows)
- datetime: 2026-03-23 12:52:32 EDT
** notes
- Added purchases-level regression coverage for the known comparison cases before implementation: Giant banana, Costco banana, Giant bagged ice, Costco beef patties, and a blank-denominator case.
- The current failure mode is the intended one for this task: `build_purchase_rows()` does not yet emit `effective_price`, so the tests document the missing behavior before `t1.18.1`.
* [X] t1.18.1: fix effective price calculation precedence and blank handling (1-3 commits)
correct purchases/effective price logic for the known broken cases using existing normalized fields
** acceptance criteria
1. when generating `data/purchases.csv`, add `effective_price` = `effective_total` / `normalized_quantity`
2. effective_price uses explicit numerator precedence:
- prefer `net_line_total`
- fallback to `line_total`
3. effective_price uses `normalized_quantity` if not blank
4. effective_price is blank when no valid denominator exists
5. effective_price is never written as `0` or divide-by-zero for missing-basis cases
6. effective_price is only comparable within same `normalized_quantity_unit` unless later analysis converts the units
7. existing regression tests for bananas and ice pass
- pm note: keep this limited to calculation logic; do not broaden into catalog or review changes
** evidence
- commit: `dc0d061`
- tests: `./venv/bin/python -m unittest discover -s tests`; `./venv/bin/python build_purchases.py`
- datetime: 2026-03-23 12:53:34 EDT
** notes
- `effective_price` is now a downstream purchases field only. It does not replace `price_per_lb` / `price_per_each`; it gives one deterministic comparison value based on the existing normalized quantity basis.
- The implemented precedence is: use non-zero `net_line_total` when present, otherwise `line_total`; divide by `normalized_quantity` when that denominator is > 0; otherwise leave blank.
- This keeps the calculation conservative for mixed-quality data: Costco bananas and ice now compute correctly, while rows like Giant patties with no quantity basis stay blank instead of producing `0` or a divide-by-zero artifact.
* [X] t1.18.2: fix giant normalization quantity carry-through for weight-based items (1-3 commits)
ensure giant normalization emits usable normalized quantity for known weight-based cases
** acceptance criteria
1. giant bananas populate normalized quantity and unit from deterministic weight basis
2. giant weight-based items that already produce `price_per_lb` also carry enough quantity basis for effective price calculation where supported
3. existing regression tests pass without changing normalized_item_id behavior
4. blanks are preserved only when no deterministic quantity basis exists
- pm note: this task is about normalization carry-through, not fuzzy matching or catalog cleanup
** pm notes
*** banana
giant bananas have picked weight and price_per_oz but normalized missing
| purchase_date | retailer | normalized_item_name | catalog_name | qty | unit | normalized_quantity | normalized_quantity_unit | pack_qty | size_value | size_unit | measure_type | line_total | unit_price | net_line_total | price_per_each | price_per_each_basis | price_per_count | price_per_count_basis | price_per_lb | price_per_lb_basis | price_per_oz | price_per_oz_basis | effective_price |
| 8/6/2024 | costco | BANANAS 3 LB / 1.36 KG | BANANA | 1 | E | 3 | lb | | 3 | lb | weight | 1.49 | 1.49 | 1.49 | 1.49 | line_total_over_qty | | | 0.4967 | parsed_size_lb | 0.031 | parsed_size_lb_to_oz | $0.50 |
| 12/6/2024 | giant | FRESH BANANA | BANANA | 1 | LB | | | | | | weight | 0.99 | 0.99 | | 0.99 | line_total_over_qty | | | 0.5893 | picked_weight_lb | 0.0368 | picked_weight_lb_to_oz | |
| 12/12/2024 | giant | FRESH BANANA | BANANA | 1 | LB | | | | | | weight | 1.37 | 1.37 | | 1.37 | line_total_over_qty | | | 0.5905 | picked_weight_lb | 0.0369 | picked_weight_lb_to_oz | |
| 1/7/2025 | giant | FRESH BANANA | BANANA | 1 | LB | | | | | | weight | 1.44 | 1.44 | | 1.44 | line_total_over_qty | | | 0.5902 | picked_weight_lb | 0.0369 | picked_weight_lb_to_oz | |
| 1/24/2025 | costco | BANANAS 3 LB / 1.36 KG | BANANA | 1 | E | 3 | lb | | 3 | lb | weight | 1.49 | 1.49 | 1.49 | 1.49 | line_total_over_qty | | | 0.4967 | parsed_size_lb | 0.031 | parsed_size_lb_to_oz | 0.4967 |
| 2/16/2025 | giant | FRESH BANANA | BANANA | 2 | LB | | | | | | weight | 2.54 | 1.27 | | 1.27 | line_total_over_qty | | | 0.588 | picked_weight_lb | 0.0367 | picked_weight_lb_to_oz | |
| 2/20/2025 | giant | FRESH BANANA | BANANA | 1 | LB | | | | | | weight | 1.4 | 1.4 | | 1.4 | line_total_over_qty | | | 0.5907 | picked_weight_lb | 0.0369 | picked_weight_lb_to_oz | |
| 6/25/2025 | giant | FRESH BANANA | BANANA | 1 | LB | | | | | | weight | 1.29 | 1.29 | | 1.29 | line_total_over_qty | | | 0.589 | picked_weight_lb | 0.0368 | picked_weight_lb_to_oz | |
| 2/14/2026 | costco | BANANAS 3 LB / 1.36 KG | BANANA | 1 | E | 3 | lb | | 3 | lb | weight | 1.49 | 1.49 | 1.49 | 1.49 | line_total_over_qty | | | 0.4967 | parsed_size_lb | 0.031 | parsed_size_lb_to_oz | 0.4967 |
| 3/12/2026 | costco | BANANAS 3 LB / 1.36 KG | BANANA | 2 | E | 6 | lb | | 3 | lb | weight | 2.98 | 1.49 | 2.98 | 1.49 | line_total_over_qty | | | 0.4967 | parsed_size_lb | 0.031 | parsed_size_lb_to_oz | 0.4967 |
*** beef patty
beef patty by weight not made into effective price
| purchase_date | retailer | normalized_item_name | product_type | qty | unit | normalized_quantity | normalized_quantity_unit | pack_qty | size_value | size_unit | measure_type | line_total | unit_price | matched_discount_amount | net_line_total | store_name | price_per_each | price_per_each_basis | price_per_count | price_per_count_basis | price_per_lb | price_per_lb_basis | price_per_oz | price_per_oz_basis | effective_price |
| 9/9/2023 | costco | BEEF PATTIES 6# BAG | hamburger | 1 | E | 1 | each | | | | each | 26.99 | 26.99 | | 26.99 | MT VERNON | 26.99 | line_total_over_qty | | | | | | | $26.99 |
| 11/26/2025 | giant | PATTIES PK12 | hamburger | 1 | LB | | | | | | weight | 10.05 | 10.05 | | | Giant Food | 10.05 | line_total_over_qty | | | 7.7907 | picked_weight_lb | 0.4869 | picked_weight_lb_to_oz | |
** evidence
- commit: `23dfc3d` `Use picked weight for Giant quantity basis`
- tests: `./venv/bin/python -m unittest tests.test_enrich_giant tests.test_purchases`; `./venv/bin/python normalize_giant_web.py`; `./venv/bin/python build_purchases.py`
- datetime: 2026-03-23 13:22:47 EDT
** notes
- Giant loose-weight rows already had deterministic `picked_weight` and `price_per_lb`; this task reuses that basis when parsed size/pack is absent.
- Parsed package size still wins when present, so fixed-size products keep their original comparison basis and `normalized_item_id` behavior does not change.
* [X] t1.18.3: fix costco normalization quantity carry-through for weight-based items (1-3 commits)
** acceptance criteria
1. add regression tests covering known broken Costco quantity-basis cases before changing parser logic
2. Costco normalization correctly parses explicit weight-bearing package text into normalized quantity fields for known cases such as:
- `25# FLOUR ALL-PURPOSE HARV ...` -> `normalized_quantity=25`, `normalized_quantity_unit=lb`, `measure_type=weight`
3. corrected Costco normalized rows carry through to `data/purchases.csv` without changing `normalized_item_id` behavior
4. `effective_price` for corrected Costco rows uses the same rule already established for Giant:
- use `net_line_total` when present, otherwise `line_total`
- divide by `normalized_quantity` when `normalized_quantity > 0`
- leave blank when no valid denominator exists
5. rerun output verifies the broken Costco flour examples no longer behave like `each` items and now produce non-blank weight-based effective prices
6. keep this task limited to the identified Costco parsing failures; do not broaden into catalog cleanup or fuzzy matching
*** All Purpose Flour
Costco 25# FLOUR not parsed into normalized weight - meaure_type says each
| purchase_date | retailer | normalized_item_name | catalog_name | qty | unit | normalized_quantity | normalized_quantity_unit | pack_qty | size_value | size_unit | measure_type | line_total | unit_price | matched_discount_amount | net_line_total | store_name | price_per_each | price_per_each_basis | price_per_count | price_per_count_basis | price_per_lb | price_per_lb_basis | price_per_oz | price_per_oz_basis | effective_price | is_discount_line | is_coupon_line | is_fee | raw_order_path | |
| 9/9/2023 | costco | 10LB BAKERS 4.5KG / 10 LB | all purpose flour | 1 | E | 10 | lb | | 10 | lb | weight | 5.99 | 5.99 | | 5.99 | VA | 5.99 | line_total_over_qty | | | 0.599 | parsed_size_lb | 0.0374 | parsed_size_lb_to_oz | $0.60 | FALSE | FALSE | FALSE | data/costco-web/raw/21111500603752309091647-2023-09-09T16-47-00.json | |
| 8/6/2024 | costco | 10LB BAKERS 4.5KG / 10 LB | all purpose flour | 1 | E | 10 | lb | | 10 | lb | weight | 5.29 | 5.29 | | 5.29 | VA | 5.29 | line_total_over_qty | | | 0.529 | parsed_size_lb | 0.0331 | parsed_size_lb_to_oz | $0.53 | FALSE | FALSE | FALSE | data/costco-web/raw/21111520101732408061704-2024-08-06T17-04-00.json | |
| 11/29/2024 | costco | 25# FLOUR ALL-PURPOSE HARV P98/100 | all purpose flour | 1 | E | 1 | each | | | | each | 8.79 | 8.79 | | 8.79 | VA | 8.79 | line_total_over_qty | | | | | | | $8.79 | FALSE | FALSE | FALSE | data/costco-web/raw/21111500803392411291626-2024-11-29T16-26-00.json | |
| 12/14/2024 | costco | KS ORG FLOUR 2/10 LB P112 | all purpose flour | 1 | E | 20 | lb | 2 | 10 | lb | weight | 17.99 | 17.99 | | 17.99 | VA | 17.99 | line_total_over_qty | 8.995 | line_total_over_pack_qty | 0.8995 | parsed_size_lb | 0.0562 | parsed_size_lb_to_oz | 0.8995 | FALSE | FALSE | FALSE | data/costco-web/raw/21111500301442412141209-2024-12-14T12-09-00.json | |
| 12/14/2024 | costco | 10LB BAKERS 4.5KG / 10 LB | all purpose flour | 1 | E | 10 | lb | | 10 | lb | weight | 5.49 | 5.49 | | 5.49 | VA | 5.49 | line_total_over_qty | | | 0.549 | parsed_size_lb | 0.0343 | parsed_size_lb_to_oz | 0.549 | FALSE | FALSE | FALSE | data/costco-web/raw/21111500301442412141209-2024-12-14T12-09-00.json | |
| 1/10/2025 | costco | 10LB BAKERS 4.5KG / 10 LB | all purpose flour | 1 | E | 10 | lb | | 10 | lb | weight | 5.49 | 5.49 | | 5.49 | VA | 5.49 | line_total_over_qty | | | 0.549 | parsed_size_lb | 0.0343 | parsed_size_lb_to_oz | 0.549 | FALSE | FALSE | FALSE | data/costco-web/raw/21111500702462501101630-2025-01-10T16-30-00.json | |
| 1/10/2025 | costco | KS ORG FLOUR 2/10 LB P112 | all purpose flour | 1 | E | 20 | lb | 2 | 10 | lb | weight | 17.99 | 17.99 | | 17.99 | VA | 17.99 | line_total_over_qty | 8.995 | line_total_over_pack_qty | 0.8995 | parsed_size_lb | 0.0562 | parsed_size_lb_to_oz | 0.8995 | FALSE | FALSE | FALSE | data/costco-web/raw/21111500702462501101630-2025-01-10T16-30-00.json | |
| 1/31/2026 | giant | SB FLOUR ALL PRPSE 5LB | all purpose flour | 1 | EA | 5 | lb | | 5 | lb | weight | 3.39 | 3.39 | | | VA | 3.39 | line_total_over_qty | | | 0.678 | parsed_size_lb | 0.0424 | parsed_size_lb_to_oz | 0.678 | FALSE | FALSE | FALSE | data/giant-web/raw/697f42031c28e23df08d95f9.json | |
| 3/12/2026 | costco | 25# FLOUR ALL-PURPOSE HARV P98/100 | all purpose flour | 1 | E | 1 | each | | | | each | 9.49 | 9.49 | | 9.49 | VA | 9.49 | line_total_over_qty | | | | | | | 9.49 | FALSE | FALSE | FALSE | data/costco-web/raw/21111500804012603121616-2026-03-12T16-16-00.json
| |
** evidence
- commit: `7317611` `Fix Costco hash-size weight parsing`
- tests: `./venv/bin/python -m unittest tests.test_costco_pipeline tests.test_purchases`; `./venv/bin/python normalize_costco_web.py`; `./venv/bin/python build_purchases.py`
- datetime: 2026-03-23 13:56:38 EDT
** notes
- Costco `25#` weight text was falling through to `each` because the hash-size parser missed sizes followed by whitespace.
- This fix is intentionally narrow: explicit `#`-weight parsing now feeds the existing quantity and effective-price flow without changing `normalized_item_id` behavior.
* [X] t1.18.4: clean purchases output and finalize effective price fields (2-4 commits)
make `purchases.csv` easier to inspect and ensure price fields support weighted cost analysis
** acceptance criteria
1. reorder `data/purchases.csv` columns for human inspection, with analysis fields first:
- `purchase_date`
- `retailer`
- `catalog_name`
- `product_type`
- `category`
- `net_line_total`
- `normalized_quantity`
- `effective_price`
- `effective_price_unit`
- followed by order/item/provenance fields
3. populate `net_line_total` for all purchase rows:
- preserve existing net_line_total when already populated;
- otherwise, derive `net_line_total = line_total + matched_discount_amount` when discount exists;
- else `net_line_total = line_total`
4. compute `effective_price` from `net_line_total / normalized_quantity` when `normalized_quantity > 0`
5. add `effective_price_unit` and populate it consistently from the normalized quantity basis
6. preserve blanks rather than writing `0` or divide-by-zero when no valid denominator exists
- pm note: this task is about final purchase output correctness and usability, not review/catalog logic
** evidence
- commit: `a45522c` `Finalize purchase effective price fields`
- tests: `./venv/bin/python -m unittest tests.test_purchases`; `./venv/bin/python build_purchases.py`
- datetime: 2026-03-23 15:27:42 EDT
** notes
- `purchases.csv` now carries a filled `net_line_total` for every row, preserving existing values from normalization and deriving the rest from `line_total` plus matched discounts.
- `effective_price_unit` now mirrors the normalized quantity basis, so downstream analysis can tell whether an `effective_price` is per `lb`, `oz`, `count`, or `each`.
* [X] t1.19: make review_products.py robust to orphaned and incomplete catalog links (2-4 commits)
refresh review state from the current normalized universe so missing or broken links re-enter review instead of silently disappearing
** acceptance criteria
1. `review_products.py` regenerates review candidates from the current normalized item universe, not just previously queued items (/data/<provider>/normalized_items.csv)
2. items are added or re-added to review when:
- they have no valid `catalog_id`
- their linked `catalog_id` no longer exists
- their linked catalog row does noth have both "catalog_name" AND "product_type"
3. `review_products.py` compares and reconciles:
- current normalized items
- current product_links
- current catalog
- current review_queue
4. rerunning review after manual cleanup of `product_links.csv` or `catalog.csv` surfaces newly orphaned normalized items
5. unresolved items remain visible and are not silently dropped from review or purchases accounting
- pm note: keep the logic explicit and auditable; this is a refresh/reconciliation task, not a new matching system
** evidence
- commit: `8ccf3ff` `Reconcile review queue against current catalog state`
- tests: `./venv/bin/python -m unittest tests.test_review_workflow tests.test_purchases`; `./venv/bin/python review_products.py --refresh-only`; `./venv/bin/python report_pipeline_status.py`
- datetime: 2026-03-23 15:32:29 EDT
** notes
- `review_products.py` now rebuilds its queue from the current normalized files and order files instead of trusting stale `purchases.csv` state.
- Missing catalog rows and incomplete catalog rows now re-enter review explicitly as `orphaned_catalog_link` or `incomplete_catalog_link`, and excluded rows no longer inflate unresolved-not-in-review accounting.
* [X] t1.20: add visit-level fields and outputs for spend analysis (2-4 commits)
ensure purchases retains enough visit/order context to support spend-by-visit and store-level analysis
** acceptance criteria
1. `data/purchases.csv` retains or adds the visit/order fields needed for visit analysis:
- `order_id`
- `purchase_date`
- `store_name`
- `store_number`
- `store_city`
- `store_state`
- `retailer`
2. purchases output supports these analyses without additional joins:
- spend by visit
- items per visit
- category spend by visit
- retailer/store breakdown
3. documentation or task notes make clear that `purchases.csv` is the primary analysis artifact for both item-level and visit-level reporting
- pm note: do not build dash/plotly here; this task is only about carrying the right data through
** evidence
- commit: `6940f16` `Document visit-level purchase analysis`
- tests: `./venv/bin/python -m unittest tests.test_purchases`; `./venv/bin/python build_purchases.py`
- datetime: 2026-03-24 08:29:13 EDT
** notes
- The needed visit fields were already flowing through `build_purchases.py`; this task locked them in with explicit tests and documentation instead of adding a new visit layer.
- `data/analysis/purchases.csv` is now documented as the primary analysis artifact for both item-level and visit-level work.
* [X] t1.21: add lightweight charting/analysis surface on top of purchases.csv (2-4 commits)
build a minimal analysis layer for common price and visit charts without changing the csv pipeline
** acceptance criteria
1. support charting of:
- item price over time
- spend by visit
- items per visit
- category spend over time
- retailer/store comparison
2. use `data/purchases.csv` as the source of truth
3. keep excel/pivot compatibility intact
- pm note: thin reader layer only; do not move business logic out of the pipeline
** evidence
- commit: `46a3b2c` `Add purchase analysis summaries`
- tests: `./venv/bin/python -m unittest tests.test_analyze_purchases tests.test_purchases`; `./venv/bin/python analyze_purchases.py`
- datetime: 2026-03-24 16:48:41 EDT
** notes
- The new layer is file-based, not notebook- or dashboard-based: `analyze_purchases.py` reads `data/analysis/purchases.csv` and writes chart-ready CSVs under `data/analysis/`.
- This keeps Excel/pivot workflows intact while still giving a repeatable CLI path for common price, visit, category, and retailer/store summaries.
* [X] t1.22: cleanup and finalize post-refactor merging refactor/enrich into cx (3-6 commits)
remove transitional detritus from the repo and make the final folder/script layout explicit before merging back into `cx`
** acceptance criteria
1. move `catalog.csv` alongside the other step-3 review artifacts under `data/review/`
- update active scripts, tests, docs, and task notes to match the chosen path
2. promote analysis to a top-level step-4 folder such as `data/analysis/`
- add `purchases.csv` to this folder
- update active scripts, tests, docs, and task notes to match the chosen path
3. remove obsolete or superseded Python files
- includes old `scrape_*`, `enrich_*`, `build_*`, and proof/check scripts as appropriate
- do not remove files still required by the active collect/normalize/review/analysis pipeline
4. active repo entrypoints are reduced to the intended flow and are easy to identify, including:
- retailer collection
- retailer normalization
- review/combine
- status/reporting
- analysis
5. tests pass after removals and path decisions
6. README reflects the final post-refactor structure and run order without legacy ambiguity
7. `pm/data-model.org` and `pm/tasks.org` reflect the final chosen layout
- pm note: prefer deleting true detritus over keeping compatibility shims now that the refactor path is established
- pm note: make folder decisions once here so we stop carrying path churn into later tasks
** evidence
- commit: `09829b2` `Finalize post-refactor layout and remove old pipeline files`
- tests: `./venv/bin/python -m unittest discover -s tests`; `./venv/bin/python build_purchases.py`; `./venv/bin/python review_products.py --refresh-only`; `./venv/bin/python report_pipeline_status.py`; `./venv/bin/python analyze_purchases.py`; `./venv/bin/python collect_giant_web.py --help`; `./venv/bin/python collect_costco_web.py --help`; `./venv/bin/python normalize_giant_web.py --help`; `./venv/bin/python normalize_costco_web.py --help`
- datetime: 2026-03-24 17:09:45 EDT
** notes
- Final layout decision: `catalog.csv` now lives under `data/review/`, while `purchases.csv` and the chart-ready analysis outputs live under the step-4 `data/analysis/` folder.
- Removed obsolete top-level pipeline files and their dead tests so the active entrypoints are now the collect, normalize, review/combine, status, and analysis scripts only.
* [X] t1.22.1: remove unneeded python deps
** acceptance criteria
1. update requirements.txt to add/remove necessary python libs
2. keep only direct runtime deps in requirements.txt; transitive deps should not be pinned unless imported directly
** evidence
- commit: `867275c` `Trim requirements to direct runtime deps`
- tests: `./venv/bin/python -m unittest discover -s tests`; `./venv/bin/python collect_giant_web.py --help`; `./venv/bin/python collect_costco_web.py --help`; `./venv/bin/python normalize_giant_web.py --help`; `./venv/bin/python normalize_costco_web.py --help`; `./venv/bin/python build_purchases.py --help`; `./venv/bin/python review_products.py --help`; `./venv/bin/python report_pipeline_status.py --help`; `./venv/bin/python analyze_purchases.py --help`
- date: 2026-03-24 17:25:39 EDT
** notes
- `requirements.txt` now keeps only direct runtime deps imported by the active pipeline: `browser-cookie3`, `click`, `curl_cffi`, and `python-dotenv`.
- Low-level support packages such as `cffi`, `jeepney`, `lz4`, `pycryptodomex`, and `certifi` are left to transitive installation instead of being pinned directly.
* [ ] t1.10: add optional llm-assisted suggestion workflow for unresolved normalized retailer items (2-4 commits)
** acceptance criteria

View File

@@ -27,9 +27,11 @@ def build_status_summary(
costco_enriched,
purchases,
resolutions,
links,
catalog,
):
normalized_rows = giant_enriched + costco_enriched
queue_rows = review_products.build_review_queue(purchases, resolutions)
queue_rows = review_products.build_review_queue(purchases, resolutions, links, catalog, [])
queue_ids = {row["normalized_item_id"] for row in queue_rows}
unresolved_purchase_rows = [
@@ -37,6 +39,7 @@ def build_status_summary(
for row in purchases
if row.get("normalized_item_id")
and not row.get("catalog_id")
and row.get("resolution_action") != "exclude"
and row.get("is_fee") != "true"
and row.get("is_discount_line") != "true"
and row.get("is_coupon_line") != "true"
@@ -82,8 +85,10 @@ def build_status_summary(
@click.option("--costco-orders-csv", default="data/costco-web/collected_orders.csv", show_default=True)
@click.option("--costco-items-csv", default="data/costco-web/collected_items.csv", show_default=True)
@click.option("--costco-enriched-csv", default="data/costco-web/normalized_items.csv", show_default=True)
@click.option("--purchases-csv", default="data/review/purchases.csv", show_default=True)
@click.option("--purchases-csv", default="data/analysis/purchases.csv", show_default=True)
@click.option("--resolutions-csv", default="data/review/review_resolutions.csv", show_default=True)
@click.option("--links-csv", default="data/review/product_links.csv", show_default=True)
@click.option("--catalog-csv", default="data/review/catalog.csv", show_default=True)
@click.option("--summary-csv", default="data/review/pipeline_status.csv", show_default=True)
@click.option("--summary-json", default="data/review/pipeline_status.json", show_default=True)
def main(
@@ -95,6 +100,8 @@ def main(
costco_enriched_csv,
purchases_csv,
resolutions_csv,
links_csv,
catalog_csv,
summary_csv,
summary_json,
):
@@ -107,6 +114,8 @@ def main(
read_rows_if_exists(costco_enriched_csv),
read_rows_if_exists(purchases_csv),
[build_purchases.normalize_resolution_row(row) for row in read_rows_if_exists(resolutions_csv)],
[build_purchases.normalize_link_row(row) for row in read_rows_if_exists(links_csv)],
[build_purchases.normalize_catalog_row(row) for row in read_rows_if_exists(catalog_csv)],
)
write_csv_rows(summary_csv, summary_rows, SUMMARY_FIELDS)
summary_json_path = Path(summary_json)

View File

@@ -1,10 +1,4 @@
browser-cookie3==0.20.1
certifi==2026.2.25
cffi==2.0.0
click==8.3.1
curl_cffi==0.14.0
jeepney==0.9.0
lz4==4.4.5
pycparser==3.0
pycryptodomex==3.23.0
python-dotenv==1.1.1

View File

@@ -1,5 +1,6 @@
from collections import defaultdict
from datetime import date
import re
import click
@@ -29,11 +30,48 @@ QUEUE_FIELDS = [
INFO_COLOR = "cyan"
PROMPT_COLOR = "bright_yellow"
WARNING_COLOR = "magenta"
TOKEN_RE = re.compile(r"[A-Z0-9]+")
REQUIRED_CATALOG_FIELDS = ("catalog_name", "product_type")
def build_review_queue(purchase_rows, resolution_rows):
def print_intro_text():
click.secho("Review guide:", fg=INFO_COLOR)
click.echo(" catalog name: unique product identity including variant, but not packaging")
click.echo(" product type: general product you want to compare across purchases")
click.echo(" category: broad analysis bucket such as dairy, produce, or frozen")
def has_complete_catalog_row(catalog_row):
if not catalog_row:
return False
return all(catalog_row.get(field, "").strip() for field in REQUIRED_CATALOG_FIELDS)
def load_queue_lookup(queue_rows):
lookup = {}
for row in queue_rows:
normalized_item_id = row.get("normalized_item_id", "")
if normalized_item_id:
lookup[normalized_item_id] = row
return lookup
def build_review_queue(
purchase_rows,
resolution_rows,
link_rows=None,
catalog_rows=None,
existing_queue_rows=None,
):
by_normalized = defaultdict(list)
resolution_lookup = build_purchases.load_resolution_lookup(resolution_rows)
link_lookup = build_purchases.load_link_lookup(link_rows or [])
catalog_lookup = {
row.get("catalog_id", ""): build_purchases.normalize_catalog_row(row)
for row in (catalog_rows or [])
if row.get("catalog_id", "")
}
queue_lookup = load_queue_lookup(existing_queue_rows or [])
for row in purchase_rows:
normalized_item_id = row.get("normalized_item_id", "")
@@ -45,30 +83,40 @@ def build_review_queue(purchase_rows, resolution_rows):
queue_rows = []
for normalized_item_id, rows in sorted(by_normalized.items()):
current_resolution = resolution_lookup.get(normalized_item_id, {})
if current_resolution.get("status") == "approved":
if current_resolution.get("status") == "approved" and current_resolution.get("resolution_action") == "exclude":
continue
existing_queue_row = queue_lookup.get(normalized_item_id, {})
linked_catalog_id = current_resolution.get("catalog_id") or link_lookup.get(normalized_item_id, {}).get("catalog_id", "")
linked_catalog_row = catalog_lookup.get(linked_catalog_id, {})
has_valid_catalog_link = bool(linked_catalog_id and has_complete_catalog_row(linked_catalog_row))
unresolved_rows = [
row
for row in rows
if not row.get("catalog_id")
and row.get("is_item", "true") != "false"
if row.get("is_item", "true") != "false"
and row.get("is_fee") != "true"
and row.get("is_discount_line") != "true"
and row.get("is_coupon_line") != "true"
]
if not unresolved_rows:
if not unresolved_rows or has_valid_catalog_link:
continue
retailers = sorted({row["retailer"] for row in rows})
review_id = stable_id("rvw", normalized_item_id)
reason_code = "missing_catalog_link"
if linked_catalog_id and linked_catalog_id not in catalog_lookup:
reason_code = "orphaned_catalog_link"
elif linked_catalog_id and not has_complete_catalog_row(linked_catalog_row):
reason_code = "incomplete_catalog_link"
queue_rows.append(
{
"review_id": review_id,
"retailer": " | ".join(retailers),
"normalized_item_id": normalized_item_id,
"catalog_id": current_resolution.get("catalog_id", ""),
"reason_code": "missing_catalog_link",
"catalog_id": linked_catalog_id,
"reason_code": reason_code,
"priority": "high",
"raw_item_names": compact_join(
sorted({row["raw_item_name"] for row in rows if row["raw_item_name"]}),
@@ -93,10 +141,13 @@ def build_review_queue(purchase_rows, resolution_rows):
limit=8,
),
"seen_count": str(len(rows)),
"status": current_resolution.get("status", "pending"),
"resolution_action": current_resolution.get("resolution_action", ""),
"resolution_notes": current_resolution.get("resolution_notes", ""),
"created_at": current_resolution.get("reviewed_at", today_text),
"status": existing_queue_row.get("status") or current_resolution.get("status", "pending"),
"resolution_action": existing_queue_row.get("resolution_action")
or current_resolution.get("resolution_action", ""),
"resolution_notes": existing_queue_row.get("resolution_notes")
or current_resolution.get("resolution_notes", ""),
"created_at": existing_queue_row.get("created_at")
or current_resolution.get("reviewed_at", today_text),
"updated_at": today_text,
}
)
@@ -111,6 +162,10 @@ def save_catalog_rows(path, rows):
write_csv_rows(path, rows, build_purchases.CATALOG_FIELDS)
def save_link_rows(path, rows):
write_csv_rows(path, rows, build_purchases.PRODUCT_LINK_FIELDS)
def sort_related_items(rows):
return sorted(
rows,
@@ -123,6 +178,13 @@ def sort_related_items(rows):
)
def tokenize_match_text(*values):
tokens = set()
for value in values:
tokens.update(TOKEN_RE.findall((value or "").upper()))
return tokens
def build_catalog_suggestions(related_rows, purchase_rows, catalog_rows, limit=3):
normalized_names = {
row.get("normalized_item_name", "").strip().upper()
@@ -179,23 +241,122 @@ def build_catalog_suggestions(related_rows, purchase_rows, catalog_rows, limit=3
return suggestions
def search_catalog_rows(query, catalog_rows, purchase_rows, current_normalized_item_id, limit=10):
query_tokens = tokenize_match_text(query)
if not query_tokens:
return []
linked_purchase_counts = defaultdict(int)
linked_normalized_ids = defaultdict(set)
current_catalog_id = ""
for row in purchase_rows:
catalog_id = row.get("catalog_id", "")
normalized_item_id = row.get("normalized_item_id", "")
if catalog_id and normalized_item_id:
linked_purchase_counts[catalog_id] += 1
linked_normalized_ids[catalog_id].add(normalized_item_id)
if normalized_item_id == current_normalized_item_id and catalog_id:
current_catalog_id = catalog_id
ranked_rows = []
for row in catalog_rows:
catalog_id = row.get("catalog_id", "")
if not catalog_id or catalog_id == current_catalog_id:
continue
catalog_tokens = tokenize_match_text(
row.get("catalog_name", ""),
row.get("product_type", ""),
row.get("variant", ""),
)
overlap = query_tokens & catalog_tokens
if not overlap:
continue
ranked_rows.append(
{
"catalog_id": catalog_id,
"catalog_name": row.get("catalog_name", ""),
"product_type": row.get("product_type", ""),
"category": row.get("category", ""),
"variant": row.get("variant", ""),
"linked_normalized_items": len(linked_normalized_ids.get(catalog_id, set())),
"linked_purchase_rows": linked_purchase_counts.get(catalog_id, 0),
"score": len(overlap),
}
)
ranked_rows.sort(
key=lambda row: (-row["score"], row["catalog_name"], row["catalog_id"])
)
return ranked_rows[:limit]
def suggestion_display_rows(suggestions, purchase_rows, catalog_rows):
linked_purchase_counts = defaultdict(int)
linked_normalized_ids = defaultdict(set)
for row in purchase_rows:
catalog_id = row.get("catalog_id", "")
normalized_item_id = row.get("normalized_item_id", "")
if not catalog_id or not normalized_item_id:
continue
linked_purchase_counts[catalog_id] += 1
linked_normalized_ids[catalog_id].add(normalized_item_id)
display_rows = []
catalog_details = {
row["catalog_id"]: {
"product_type": row.get("product_type", ""),
"category": row.get("category", ""),
}
for row in catalog_rows
if row.get("catalog_id")
}
for row in purchase_rows:
if row.get("catalog_id"):
catalog_details.setdefault(
row["catalog_id"],
{
"product_type": row.get("product_type", ""),
"category": row.get("category", ""),
},
)
for row in suggestions:
catalog_id = row["catalog_id"]
details = catalog_details.get(catalog_id, {})
display_rows.append(
{
**row,
"product_type": details.get("product_type", ""),
"category": details.get("category", ""),
"linked_purchase_rows": linked_purchase_counts.get(catalog_id, 0),
"linked_normalized_items": len(linked_normalized_ids.get(catalog_id, set())),
}
)
return display_rows
def print_catalog_rows(rows):
for index, row in enumerate(rows, start=1):
click.echo(
f" [{index}] {row['catalog_name']}, {row.get('product_type', '')}, "
f"{row.get('category', '')} ({row['linked_normalized_items']} items, "
f"{row['linked_purchase_rows']} rows)"
)
def build_display_lines(related_rows):
lines = []
for index, row in enumerate(sort_related_items(related_rows), start=1):
lines.append(
" [{index}] {purchase_date} | {line_total} | {raw_item_name} | {normalized_item_name} | "
"{upc} | {retailer}".format(
" [{index}] {raw_item_name} | {retailer} | {purchase_date} | {line_total} | {image_url}".format(
index=index,
raw_item_name=row.get("raw_item_name", ""),
retailer=row.get("retailer", ""),
purchase_date=row.get("purchase_date", ""),
line_total=row.get("line_total", ""),
raw_item_name=row.get("raw_item_name", ""),
normalized_item_name=row.get("normalized_item_name", ""),
upc=row.get("upc", ""),
retailer=row.get("retailer", ""),
image_url=row.get("image_url", ""),
)
)
if row.get("image_url"):
lines.append(f" {row['image_url']}")
if not lines:
lines.append(" [1] no matched item rows found")
return lines
@@ -215,8 +376,7 @@ def choose_existing_catalog(display_rows, normalized_name, matched_count):
f"Select the catalog_name to associate {matched_count} items with:",
fg=INFO_COLOR,
)
for index, row in enumerate(display_rows, start=1):
click.echo(f" [{index}] {row['catalog_name']} | {row['catalog_id']}")
print_catalog_rows(display_rows)
choice = click.prompt(
click.style("selection", fg=PROMPT_COLOR),
type=click.IntRange(1, len(display_rows)),
@@ -241,13 +401,16 @@ def choose_existing_catalog(display_rows, normalized_name, matched_count):
def prompt_resolution(queue_row, related_rows, purchase_rows, catalog_rows, queue_index, queue_total):
suggestions = build_catalog_suggestions(related_rows, purchase_rows, catalog_rows)
suggestions = suggestion_display_rows(
build_catalog_suggestions(related_rows, purchase_rows, catalog_rows),
purchase_rows,
catalog_rows,
)
normalized_name = normalized_label(queue_row, related_rows)
matched_count = len(related_rows)
click.echo("")
click.secho(
f"Review {queue_index}/{queue_total}: Resolve normalized_item {normalized_name} "
"to catalog_name [__]?",
f"Review {queue_index}/{queue_total}: {normalized_name}",
fg=INFO_COLOR,
)
click.echo(f"{matched_count} matched items:")
@@ -255,12 +418,30 @@ def prompt_resolution(queue_row, related_rows, purchase_rows, catalog_rows, queu
click.echo(line)
if suggestions:
click.echo(f"{len(suggestions)} catalog_name suggestions found:")
for index, suggestion in enumerate(suggestions, start=1):
click.echo(f" [{index}] {suggestion['catalog_name']}")
print_catalog_rows(suggestions)
else:
click.echo("no catalog_name suggestions found")
click.secho("[l]ink existing [n]ew catalog e[x]clude [s]kip [q]uit:", fg=PROMPT_COLOR)
action = click.prompt("", type=click.Choice(["l", "n", "x", "s", "q"]), prompt_suffix=" ")
prompt_bits = []
if suggestions:
prompt_bits.append("[#] link to suggestion")
prompt_bits.extend(["[f]ind", "[n]ew", "[s]kip", "e[x]clude", "[q]uit"])
click.secho(" ".join(prompt_bits) + " >", fg=PROMPT_COLOR)
action = click.prompt("", type=str, prompt_suffix=" ").strip().lower()
if action.isdigit() and suggestions:
choice = int(action)
if 1 <= choice <= len(suggestions):
chosen_row = suggestions[choice - 1]
notes = click.prompt(click.style("link notes", fg=PROMPT_COLOR), default="", show_default=False)
return {
"normalized_item_id": queue_row["normalized_item_id"],
"catalog_id": chosen_row["catalog_id"],
"resolution_action": "link",
"status": "approved",
"resolution_notes": notes,
"reviewed_at": str(date.today()),
}, None
click.secho("invalid suggestion number", fg=WARNING_COLOR)
return prompt_resolution(queue_row, related_rows, purchase_rows, catalog_rows, queue_index, queue_total)
if action == "q":
return None, None
if action == "s":
@@ -272,6 +453,43 @@ def prompt_resolution(queue_row, related_rows, purchase_rows, catalog_rows, queu
"resolution_notes": queue_row.get("resolution_notes", ""),
"reviewed_at": str(date.today()),
}, None
if action == "f":
while True:
query = click.prompt(click.style("search", fg=PROMPT_COLOR), default="", show_default=False).strip()
if not query:
return prompt_resolution(queue_row, related_rows, purchase_rows, catalog_rows, queue_index, queue_total)
search_rows = search_catalog_rows(
query,
catalog_rows,
purchase_rows,
queue_row["normalized_item_id"],
)
if not search_rows:
click.echo("no matches found")
retry = click.prompt(
click.style("search again? [enter=yes, q=no]", fg=PROMPT_COLOR),
default="",
show_default=False,
).strip().lower()
if retry == "q":
return prompt_resolution(queue_row, related_rows, purchase_rows, catalog_rows, queue_index, queue_total)
continue
click.echo(f"{len(search_rows)} search results found:")
print_catalog_rows(search_rows)
choice = click.prompt(
click.style("selection", fg=PROMPT_COLOR),
type=click.IntRange(1, len(search_rows)),
)
chosen_row = search_rows[choice - 1]
notes = click.prompt(click.style("link notes", fg=PROMPT_COLOR), default="", show_default=False)
return {
"normalized_item_id": queue_row["normalized_item_id"],
"catalog_id": chosen_row["catalog_id"],
"resolution_action": "link",
"status": "approved",
"resolution_notes": notes,
"reviewed_at": str(date.today()),
}, None
if action == "x":
notes = click.prompt(click.style("exclude notes", fg=PROMPT_COLOR), default="", show_default=False)
return {
@@ -282,45 +500,13 @@ def prompt_resolution(queue_row, related_rows, purchase_rows, catalog_rows, queu
"resolution_notes": notes,
"reviewed_at": str(date.today()),
}, None
if action == "l":
display_rows = suggestions or [
{
"catalog_id": row["catalog_id"],
"catalog_name": row["catalog_name"],
"reason": "catalog sample",
}
for row in catalog_rows[:10]
if row.get("catalog_id")
]
while True:
catalog_id, outcome = choose_existing_catalog(display_rows, normalized_name, matched_count)
if outcome == "skip":
return {
"normalized_item_id": queue_row["normalized_item_id"],
"catalog_id": "",
"resolution_action": "skip",
"status": "pending",
"resolution_notes": queue_row.get("resolution_notes", ""),
"reviewed_at": str(date.today()),
}, None
if outcome == "quit":
return None, None
if outcome == "back":
continue
break
notes = click.prompt(click.style("link notes", fg=PROMPT_COLOR), default="", show_default=False)
return {
"normalized_item_id": queue_row["normalized_item_id"],
"catalog_id": catalog_id,
"resolution_action": "link",
"status": "approved",
"resolution_notes": notes,
"reviewed_at": str(date.today()),
}, None
if action != "n":
click.secho("invalid action", fg=WARNING_COLOR)
return prompt_resolution(queue_row, related_rows, purchase_rows, catalog_rows, queue_index, queue_total)
catalog_name = click.prompt(click.style("catalog name", fg=PROMPT_COLOR), type=str)
category = click.prompt(click.style("category", fg=PROMPT_COLOR), default="", show_default=False)
product_type = click.prompt(click.style("product type", fg=PROMPT_COLOR), default="", show_default=False)
category = click.prompt(click.style("category", fg=PROMPT_COLOR), default="", show_default=False)
notes = click.prompt(click.style("notes", fg=PROMPT_COLOR), default="", show_default=False)
catalog_id = stable_id("cat", f"manual|{catalog_name}|{category}|{product_type}")
catalog_row = {
@@ -349,24 +535,81 @@ def prompt_resolution(queue_row, related_rows, purchase_rows, catalog_rows, queu
return resolution_row, catalog_row
def apply_resolution_to_queue(queue_rows, resolution_lookup):
today_text = str(date.today())
updated_rows = []
for row in queue_rows:
resolution = resolution_lookup.get(row["normalized_item_id"], {})
row_copy = dict(row)
if resolution:
row_copy["catalog_id"] = resolution.get("catalog_id", "")
row_copy["status"] = resolution.get("status", row_copy.get("status", "pending"))
row_copy["resolution_action"] = resolution.get("resolution_action", "")
row_copy["resolution_notes"] = resolution.get("resolution_notes", "")
row_copy["updated_at"] = resolution.get("reviewed_at", today_text)
if resolution.get("status") == "approved":
row_copy["created_at"] = row_copy.get("created_at") or resolution.get("reviewed_at", today_text)
updated_rows.append(row_copy)
return updated_rows
def link_rows_from_state(link_lookup):
return sorted(link_lookup.values(), key=lambda row: row["normalized_item_id"])
@click.command()
@click.option("--purchases-csv", default="data/review/purchases.csv", show_default=True)
@click.option("--giant-items-enriched-csv", default="data/giant-web/normalized_items.csv", show_default=True)
@click.option("--costco-items-enriched-csv", default="data/costco-web/normalized_items.csv", show_default=True)
@click.option("--giant-orders-csv", default="data/giant-web/collected_orders.csv", show_default=True)
@click.option("--costco-orders-csv", default="data/costco-web/collected_orders.csv", show_default=True)
@click.option("--purchases-csv", default="data/analysis/purchases.csv", show_default=True)
@click.option("--queue-csv", default="data/review/review_queue.csv", show_default=True)
@click.option("--resolutions-csv", default="data/review/review_resolutions.csv", show_default=True)
@click.option("--catalog-csv", default="data/catalog.csv", show_default=True)
@click.option("--catalog-csv", default="data/review/catalog.csv", show_default=True)
@click.option("--links-csv", default="data/review/product_links.csv", show_default=True)
@click.option("--limit", default=0, show_default=True, type=int)
@click.option("--refresh-only", is_flag=True, help="Only rebuild review_queue.csv without prompting.")
def main(purchases_csv, queue_csv, resolutions_csv, catalog_csv, limit, refresh_only):
purchase_rows = build_purchases.read_optional_csv_rows(purchases_csv)
def main(
giant_items_enriched_csv,
costco_items_enriched_csv,
giant_orders_csv,
costco_orders_csv,
purchases_csv,
queue_csv,
resolutions_csv,
catalog_csv,
links_csv,
limit,
refresh_only,
):
resolution_rows = build_purchases.read_optional_csv_rows(resolutions_csv)
catalog_rows = build_purchases.merge_catalog_rows(build_purchases.read_optional_csv_rows(catalog_csv), [])
queue_rows = build_review_queue(purchase_rows, resolution_rows)
link_rows = build_purchases.read_optional_csv_rows(links_csv)
purchase_rows, refreshed_link_rows = build_purchases.build_purchase_rows(
build_purchases.read_optional_csv_rows(giant_items_enriched_csv),
build_purchases.read_optional_csv_rows(costco_items_enriched_csv),
build_purchases.read_optional_csv_rows(giant_orders_csv),
build_purchases.read_optional_csv_rows(costco_orders_csv),
resolution_rows,
link_rows,
catalog_rows,
)
build_purchases.write_csv_rows(purchases_csv, purchase_rows, build_purchases.PURCHASE_FIELDS)
link_lookup = build_purchases.load_link_lookup(refreshed_link_rows)
queue_rows = build_review_queue(
purchase_rows,
resolution_rows,
refreshed_link_rows,
catalog_rows,
build_purchases.read_optional_csv_rows(queue_csv),
)
write_csv_rows(queue_csv, queue_rows, QUEUE_FIELDS)
click.echo(f"wrote {len(queue_rows)} rows to {queue_csv}")
if refresh_only:
return
print_intro_text()
resolution_lookup = build_purchases.load_resolution_lookup(resolution_rows)
catalog_by_id = {row["catalog_id"]: row for row in catalog_rows if row.get("catalog_id")}
rows_by_normalized = defaultdict(list)
@@ -388,16 +631,38 @@ def main(purchases_csv, queue_csv, resolutions_csv, catalog_csv, limit, refresh_
if catalog_row and catalog_row["catalog_id"] not in catalog_by_id:
catalog_by_id[catalog_row["catalog_id"]] = catalog_row
catalog_rows.append(catalog_row)
reviewed += 1
normalized_item_id = resolution_row["normalized_item_id"]
if resolution_row["status"] == "approved":
if resolution_row["resolution_action"] in {"link", "create"} and resolution_row.get("catalog_id"):
link_lookup[normalized_item_id] = {
"normalized_item_id": normalized_item_id,
"catalog_id": resolution_row["catalog_id"],
"link_method": f"manual_{resolution_row['resolution_action']}",
"link_confidence": "high",
"review_status": "approved",
"reviewed_by": "",
"reviewed_at": resolution_row.get("reviewed_at", ""),
"link_notes": resolution_row.get("resolution_notes", ""),
}
elif resolution_row["resolution_action"] == "exclude":
link_lookup.pop(normalized_item_id, None)
queue_rows = apply_resolution_to_queue(queue_rows, resolution_lookup)
write_csv_rows(queue_csv, queue_rows, QUEUE_FIELDS)
save_resolution_rows(
resolutions_csv,
sorted(resolution_lookup.values(), key=lambda row: row["normalized_item_id"]),
)
save_catalog_rows(catalog_csv, sorted(catalog_by_id.values(), key=lambda row: row["catalog_id"]))
save_link_rows(links_csv, link_rows_from_state(link_lookup))
reviewed += 1
save_resolution_rows(resolutions_csv, sorted(resolution_lookup.values(), key=lambda row: row["normalized_item_id"]))
save_catalog_rows(catalog_csv, sorted(catalog_by_id.values(), key=lambda row: row["catalog_id"]))
save_link_rows(links_csv, link_rows_from_state(link_lookup))
click.echo(
f"saved {len(resolution_lookup)} resolution rows to {resolutions_csv} "
f"and {len(catalog_by_id)} catalog rows to {catalog_csv}"
f"saved {len(resolution_lookup)} resolution rows to {resolutions_csv}, "
f"{len(catalog_by_id)} catalog rows to {catalog_csv}, "
f"and {len(link_lookup)} product links to {links_csv}"
)

View File

@@ -1,5 +0,0 @@
from scrape_giant import * # noqa: F401,F403
if __name__ == "__main__":
main()

View File

@@ -0,0 +1,149 @@
import csv
import tempfile
import unittest
from pathlib import Path
import analyze_purchases
class AnalyzePurchasesTests(unittest.TestCase):
def test_analysis_outputs_cover_required_views(self):
with tempfile.TemporaryDirectory() as tmpdir:
purchases_csv = Path(tmpdir) / "purchases.csv"
output_dir = Path(tmpdir) / "analysis"
fieldnames = [
"purchase_date",
"retailer",
"order_id",
"catalog_id",
"catalog_name",
"category",
"product_type",
"net_line_total",
"line_total",
"normalized_quantity",
"normalized_quantity_unit",
"effective_price",
"effective_price_unit",
"store_name",
"store_number",
"store_city",
"store_state",
"is_fee",
"is_discount_line",
"is_coupon_line",
]
with purchases_csv.open("w", newline="", encoding="utf-8") as handle:
writer = csv.DictWriter(handle, fieldnames=fieldnames)
writer.writeheader()
writer.writerows(
[
{
"purchase_date": "2026-03-01",
"retailer": "giant",
"order_id": "g1",
"catalog_id": "cat_banana",
"catalog_name": "BANANA",
"category": "produce",
"product_type": "banana",
"net_line_total": "1.29",
"line_total": "1.29",
"normalized_quantity": "2.19",
"normalized_quantity_unit": "lb",
"effective_price": "0.589",
"effective_price_unit": "lb",
"store_name": "Giant",
"store_number": "42",
"store_city": "Springfield",
"store_state": "VA",
"is_fee": "false",
"is_discount_line": "false",
"is_coupon_line": "false",
},
{
"purchase_date": "2026-03-01",
"retailer": "giant",
"order_id": "g1",
"catalog_id": "cat_ice",
"catalog_name": "ICE",
"category": "frozen",
"product_type": "ice",
"net_line_total": "3.50",
"line_total": "3.50",
"normalized_quantity": "20",
"normalized_quantity_unit": "lb",
"effective_price": "0.175",
"effective_price_unit": "lb",
"store_name": "Giant",
"store_number": "42",
"store_city": "Springfield",
"store_state": "VA",
"is_fee": "false",
"is_discount_line": "false",
"is_coupon_line": "false",
},
{
"purchase_date": "2026-03-02",
"retailer": "costco",
"order_id": "c1",
"catalog_id": "cat_banana",
"catalog_name": "BANANA",
"category": "produce",
"product_type": "banana",
"net_line_total": "1.49",
"line_total": "2.98",
"normalized_quantity": "3",
"normalized_quantity_unit": "lb",
"effective_price": "0.4967",
"effective_price_unit": "lb",
"store_name": "MT VERNON",
"store_number": "1115",
"store_city": "ALEXANDRIA",
"store_state": "VA",
"is_fee": "false",
"is_discount_line": "false",
"is_coupon_line": "false",
},
]
)
analyze_purchases.main.callback(
purchases_csv=str(purchases_csv),
output_dir=str(output_dir),
)
expected_files = [
"item_price_over_time.csv",
"spend_by_visit.csv",
"items_per_visit.csv",
"category_spend_over_time.csv",
"retailer_store_breakdown.csv",
]
for name in expected_files:
self.assertTrue((output_dir / name).exists(), name)
with (output_dir / "spend_by_visit.csv").open(newline="", encoding="utf-8") as handle:
spend_rows = list(csv.DictReader(handle))
self.assertEqual("4.79", spend_rows[0]["visit_spend_total"])
with (output_dir / "items_per_visit.csv").open(newline="", encoding="utf-8") as handle:
item_rows = list(csv.DictReader(handle))
self.assertEqual("2", item_rows[0]["item_row_count"])
self.assertEqual("2", item_rows[0]["distinct_catalog_count"])
with (output_dir / "category_spend_over_time.csv").open(newline="", encoding="utf-8") as handle:
category_rows = list(csv.DictReader(handle))
produce_row = next(row for row in category_rows if row["purchase_date"] == "2026-03-01" and row["category"] == "produce")
self.assertEqual("1.29", produce_row["category_spend_total"])
with (output_dir / "retailer_store_breakdown.csv").open(newline="", encoding="utf-8") as handle:
store_rows = list(csv.DictReader(handle))
giant_row = next(row for row in store_rows if row["retailer"] == "giant")
self.assertEqual("1", giant_row["visit_count"])
self.assertEqual("2", giant_row["item_row_count"])
self.assertEqual("4.79", giant_row["store_spend_total"])
if __name__ == "__main__":
unittest.main()

View File

@@ -1,119 +0,0 @@
import unittest
import build_canonical_layer
class CanonicalLayerTests(unittest.TestCase):
def test_build_canonical_layer_auto_links_exact_upc_and_name_size_only(self):
observed_rows = [
{
"observed_product_id": "gobs_1",
"representative_upc": "111",
"representative_retailer_item_id": "11",
"representative_name_norm": "GALA APPLE",
"representative_brand": "SB",
"representative_variant": "",
"representative_size_value": "5",
"representative_size_unit": "lb",
"representative_pack_qty": "",
"representative_measure_type": "weight",
"is_fee": "false",
"is_discount_line": "false",
"is_coupon_line": "false",
},
{
"observed_product_id": "gobs_2",
"representative_upc": "111",
"representative_retailer_item_id": "12",
"representative_name_norm": "LARGE WHITE EGGS",
"representative_brand": "SB",
"representative_variant": "",
"representative_size_value": "",
"representative_size_unit": "",
"representative_pack_qty": "18",
"representative_measure_type": "count",
"is_fee": "false",
"is_discount_line": "false",
"is_coupon_line": "false",
},
{
"observed_product_id": "gobs_3",
"representative_upc": "",
"representative_retailer_item_id": "21",
"representative_name_norm": "ROTINI",
"representative_brand": "",
"representative_variant": "",
"representative_size_value": "16",
"representative_size_unit": "oz",
"representative_pack_qty": "",
"representative_measure_type": "weight",
"is_fee": "false",
"is_discount_line": "false",
"is_coupon_line": "false",
},
{
"observed_product_id": "gobs_4",
"representative_upc": "",
"representative_retailer_item_id": "22",
"representative_name_norm": "ROTINI",
"representative_brand": "SB",
"representative_variant": "",
"representative_size_value": "16",
"representative_size_unit": "oz",
"representative_pack_qty": "",
"representative_measure_type": "weight",
"is_fee": "false",
"is_discount_line": "false",
"is_coupon_line": "false",
},
{
"observed_product_id": "gobs_5",
"representative_upc": "",
"representative_retailer_item_id": "99",
"representative_name_norm": "GL BAG CHARGE",
"representative_brand": "",
"representative_variant": "",
"representative_size_value": "",
"representative_size_unit": "",
"representative_pack_qty": "",
"representative_measure_type": "each",
"is_fee": "true",
"is_discount_line": "false",
"is_coupon_line": "false",
},
{
"observed_product_id": "gobs_6",
"representative_upc": "",
"representative_retailer_item_id": "",
"representative_name_norm": "LIME",
"representative_brand": "",
"representative_variant": "",
"representative_size_value": "",
"representative_size_unit": "",
"representative_pack_qty": "",
"representative_measure_type": "each",
"is_fee": "false",
"is_discount_line": "false",
"is_coupon_line": "false",
},
]
canonicals, links = build_canonical_layer.build_canonical_layer(observed_rows)
self.assertEqual(2, len(canonicals))
self.assertEqual(4, len(links))
methods = {row["observed_product_id"]: row["link_method"] for row in links}
self.assertEqual("exact_upc", methods["gobs_1"])
self.assertEqual("exact_upc", methods["gobs_2"])
self.assertEqual("exact_name_size", methods["gobs_3"])
self.assertEqual("exact_name_size", methods["gobs_4"])
self.assertNotIn("gobs_5", methods)
self.assertNotIn("gobs_6", methods)
def test_clean_canonical_name_removes_packaging_noise(self):
self.assertEqual("LIME", build_canonical_layer.clean_canonical_name("LIME . / ."))
self.assertEqual("EGG", build_canonical_layer.clean_canonical_name("5DZ EGG / /"))
if __name__ == "__main__":
unittest.main()

View File

@@ -7,7 +7,6 @@ from unittest import mock
import enrich_costco
import scrape_costco
import validate_cross_retailer_flow
class CostcoPipelineTests(unittest.TestCase):
@@ -264,6 +263,26 @@ class CostcoPipelineTests(unittest.TestCase):
self.assertEqual("6", row["normalized_quantity"])
self.assertEqual("count", row["normalized_quantity_unit"])
volume_row = enrich_costco.parse_costco_item(
order_id="abc",
order_date="2026-03-12",
raw_path=Path("costco_output/raw/abc.json"),
line_no=3,
item={
"itemNumber": "1185912",
"itemDescription01": "KS ALMND BAR US 1.74QTS CN",
"itemDescription02": None,
"itemDepartmentNumber": 18,
"transDepartmentNumber": 18,
"unit": 2,
"itemIdentifier": "E",
"amount": 21.98,
"itemUnitPriceAmount": 10.99,
},
)
self.assertEqual("3.48", volume_row["normalized_quantity"])
self.assertEqual("qt", volume_row["normalized_quantity_unit"])
discount = enrich_costco.parse_costco_item(
order_id="abc",
order_date="2026-03-12",
@@ -326,6 +345,32 @@ class CostcoPipelineTests(unittest.TestCase):
)
self.assertEqual("LIFE 6'TABLE MDL", logistics["item_name_norm"])
def test_costco_hash_weight_parses_into_weight_basis(self):
row = enrich_costco.parse_costco_item(
order_id="abc",
order_date="2024-11-29",
raw_path=Path("costco_output/raw/abc.json"),
line_no=4,
item={
"itemNumber": "999",
"itemDescription01": "25# FLOUR ALL-PURPOSE HARV P98/100",
"itemDescription02": None,
"itemDepartmentNumber": 14,
"transDepartmentNumber": 14,
"unit": 1,
"itemIdentifier": "E",
"amount": 8.79,
"itemUnitPriceAmount": 8.79,
},
)
self.assertEqual("FLOUR ALL-PURPOSE HARV", row["item_name_norm"])
self.assertEqual("25", row["size_value"])
self.assertEqual("lb", row["size_unit"])
self.assertEqual("weight", row["measure_type"])
self.assertEqual("25", row["normalized_quantity"])
self.assertEqual("lb", row["normalized_quantity_unit"])
self.assertEqual("0.3516", row["price_per_lb"])
def test_build_items_enriched_matches_discount_to_item(self):
with tempfile.TemporaryDirectory() as tmpdir:
raw_dir = Path(tmpdir) / "raw"
@@ -377,76 +422,6 @@ class CostcoPipelineTests(unittest.TestCase):
self.assertIn("matched_discount=4873222", purchase_row["parse_notes"])
self.assertIn("matched_to_item=4873222", discount_row["parse_notes"])
def test_cross_retailer_validation_writes_proof_example(self):
with tempfile.TemporaryDirectory() as tmpdir:
giant_csv = Path(tmpdir) / "giant_items_enriched.csv"
costco_csv = Path(tmpdir) / "costco_items_enriched.csv"
outdir = Path(tmpdir) / "combined"
fieldnames = enrich_costco.OUTPUT_FIELDS
giant_row = {field: "" for field in fieldnames}
giant_row.update(
{
"retailer": "giant",
"order_id": "g1",
"line_no": "1",
"order_date": "2026-03-01",
"retailer_item_id": "100",
"item_name": "FRESH BANANA",
"item_name_norm": "BANANA",
"upc": "4011",
"measure_type": "weight",
"is_store_brand": "false",
"is_fee": "false",
"is_discount_line": "false",
"is_coupon_line": "false",
"line_total": "1.29",
}
)
costco_row = {field: "" for field in fieldnames}
costco_row.update(
{
"retailer": "costco",
"order_id": "c1",
"line_no": "1",
"order_date": "2026-03-12",
"retailer_item_id": "30669",
"item_name": "BANANAS 3 LB / 1.36 KG",
"item_name_norm": "BANANA",
"upc": "",
"size_value": "3",
"size_unit": "lb",
"measure_type": "weight",
"is_store_brand": "false",
"is_fee": "false",
"is_discount_line": "false",
"is_coupon_line": "false",
"line_total": "2.98",
}
)
with giant_csv.open("w", newline="", encoding="utf-8") as handle:
writer = csv.DictWriter(handle, fieldnames=fieldnames)
writer.writeheader()
writer.writerow(giant_row)
with costco_csv.open("w", newline="", encoding="utf-8") as handle:
writer = csv.DictWriter(handle, fieldnames=fieldnames)
writer.writeheader()
writer.writerow(costco_row)
validate_cross_retailer_flow.main.callback(
giant_items_enriched_csv=str(giant_csv),
costco_items_enriched_csv=str(costco_csv),
outdir=str(outdir),
)
proof_path = outdir / "proof_examples.csv"
self.assertTrue(proof_path.exists())
with proof_path.open(newline="", encoding="utf-8") as handle:
rows = list(csv.DictReader(handle))
self.assertEqual(1, len(rows))
self.assertEqual("banana", rows[0]["proof_name"])
def test_main_writes_summary_request_metadata(self):
with tempfile.TemporaryDirectory() as tmpdir:
outdir = Path(tmpdir) / "costco_output"

View File

@@ -111,9 +111,82 @@ class EnrichGiantTests(unittest.TestCase):
self.assertEqual("weight", row["measure_type"])
self.assertEqual("6", row["pack_qty"])
self.assertEqual("7.5", row["size_value"])
self.assertEqual("90", row["normalized_quantity"])
self.assertEqual("oz", row["normalized_quantity_unit"])
self.assertEqual("0.0667", row["price_per_oz"])
self.assertEqual("1.0667", row["price_per_lb"])
def test_derive_normalized_quantity_handles_count_volume_and_each(self):
self.assertEqual(
("18", "count"),
enrich_giant.derive_normalized_quantity("1", "", "", "18", "count"),
)
self.assertEqual(
("3.48", "qt"),
enrich_giant.derive_normalized_quantity("2", "1.74", "qt", "", "volume"),
)
self.assertEqual(
("2", "each"),
enrich_giant.derive_normalized_quantity("2", "", "", "", "each"),
)
self.assertEqual(
("1.68", "lb"),
enrich_giant.derive_normalized_quantity("1", "", "", "", "weight", "1.68"),
)
def test_parse_item_uses_picked_weight_for_loose_weight_items(self):
banana = enrich_giant.parse_item(
order_id="abc123",
order_date="2026-03-01",
raw_path=Path("raw/abc123.json"),
line_no=1,
item={
"podId": 1,
"shipQy": 1,
"totalPickedWeight": 1.68,
"unitPrice": 0.99,
"itemName": "FRESH BANANA",
"lbEachCd": "LB",
"groceryAmount": 0.99,
"primUpcCd": "111",
"mvpSavings": 0,
"rewardSavings": 0,
"couponSavings": 0,
"couponPrice": 0,
"categoryId": "1",
"categoryDesc": "Grocery",
},
)
self.assertEqual("weight", banana["measure_type"])
self.assertEqual("1.68", banana["normalized_quantity"])
self.assertEqual("lb", banana["normalized_quantity_unit"])
patty = enrich_giant.parse_item(
order_id="abc123",
order_date="2026-03-01",
raw_path=Path("raw/abc123.json"),
line_no=2,
item={
"podId": 2,
"shipQy": 1,
"totalPickedWeight": 1.29,
"unitPrice": 10.05,
"itemName": "80% PATTIES PK12",
"lbEachCd": "LB",
"groceryAmount": 10.05,
"primUpcCd": "222",
"mvpSavings": 0,
"rewardSavings": 0,
"couponSavings": 0,
"couponPrice": 0,
"categoryId": "1",
"categoryDesc": "Grocery",
},
)
self.assertEqual("1.29", patty["normalized_quantity"])
self.assertEqual("lb", patty["normalized_quantity_unit"])
def test_build_items_enriched_reads_raw_order_files_and_writes_csv(self):
with tempfile.TemporaryDirectory() as tmpdir:
raw_dir = Path(tmpdir) / "raw"

View File

@@ -1,67 +0,0 @@
import unittest
import build_observed_products
class ObservedProductTests(unittest.TestCase):
def test_build_observed_products_aggregates_rows_with_same_key(self):
rows = [
{
"retailer": "giant",
"order_id": "1",
"line_no": "1",
"order_date": "2026-01-01",
"item_name": "SB GALA APPLE 5LB",
"item_name_norm": "GALA APPLE",
"retailer_item_id": "11",
"upc": "111",
"brand_guess": "SB",
"variant": "",
"size_value": "5",
"size_unit": "lb",
"pack_qty": "",
"measure_type": "weight",
"image_url": "https://example.test/a.jpg",
"is_store_brand": "true",
"is_fee": "false",
"is_discount_line": "false",
"is_coupon_line": "false",
"line_total": "7.99",
},
{
"retailer": "giant",
"order_id": "2",
"line_no": "1",
"order_date": "2026-01-10",
"item_name": "SB GALA APPLE 5 LB",
"item_name_norm": "GALA APPLE",
"retailer_item_id": "11",
"upc": "111",
"brand_guess": "SB",
"variant": "",
"size_value": "5",
"size_unit": "lb",
"pack_qty": "",
"measure_type": "weight",
"image_url": "",
"is_store_brand": "true",
"is_fee": "false",
"is_discount_line": "false",
"is_coupon_line": "false",
"line_total": "8.49",
},
]
observed = build_observed_products.build_observed_products(rows)
self.assertEqual(1, len(observed))
self.assertEqual("2", observed[0]["times_seen"])
self.assertEqual("2026-01-01", observed[0]["first_seen_date"])
self.assertEqual("2026-01-10", observed[0]["last_seen_date"])
self.assertEqual("11", observed[0]["representative_retailer_item_id"])
self.assertEqual("111", observed[0]["representative_upc"])
self.assertIn("SB GALA APPLE 5LB", observed[0]["raw_name_examples"])
if __name__ == "__main__":
unittest.main()

View File

@@ -65,6 +65,21 @@ class PipelineStatusTests(unittest.TestCase):
},
],
resolutions=[],
links=[
{
"normalized_item_id": "gnorm_banana",
"catalog_id": "cat_banana",
"review_status": "approved",
}
],
catalog=[
{
"catalog_id": "cat_banana",
"catalog_name": "BANANA",
"product_type": "banana",
"category": "produce",
}
],
)
counts = {row["stage"]: row["count"] for row in summary}

View File

@@ -8,6 +8,11 @@ import enrich_costco
class PurchaseLogTests(unittest.TestCase):
def test_derive_net_line_total_preserves_existing_then_derives(self):
self.assertEqual("1.49", build_purchases.derive_net_line_total({"net_line_total": "1.49", "line_total": "2.98"}))
self.assertEqual("5.99", build_purchases.derive_net_line_total({"line_total": "6.99", "matched_discount_amount": "-1.00"}))
self.assertEqual("3.5", build_purchases.derive_net_line_total({"line_total": "3.50"}))
def test_derive_metrics_prefers_picked_weight_and_pack_count(self):
metrics = build_purchases.derive_metrics(
{
@@ -47,6 +52,8 @@ class PurchaseLogTests(unittest.TestCase):
"upc": "4011",
"qty": "1",
"unit": "LB",
"normalized_quantity": "1",
"normalized_quantity_unit": "lb",
"line_total": "1.29",
"unit_price": "1.29",
"measure_type": "weight",
@@ -71,6 +78,8 @@ class PurchaseLogTests(unittest.TestCase):
"retailer_item_id": "30669",
"qty": "1",
"unit": "E",
"normalized_quantity": "3",
"normalized_quantity_unit": "lb",
"line_total": "2.98",
"unit_price": "2.98",
"size_value": "3",
@@ -155,6 +164,14 @@ class PurchaseLogTests(unittest.TestCase):
self.assertTrue(all(row["catalog_id"] == "cat_banana" for row in rows))
self.assertEqual({"giant", "costco"}, {row["retailer"] for row in rows})
self.assertEqual("https://example.test/banana.jpg", rows[0]["image_url"])
self.assertEqual("1", rows[0]["normalized_quantity"])
self.assertEqual("lb", rows[0]["normalized_quantity_unit"])
self.assertEqual("lb", rows[0]["effective_price_unit"])
self.assertEqual("g1", rows[0]["order_id"])
self.assertEqual("Giant", rows[0]["store_name"])
self.assertEqual("42", rows[0]["store_number"])
self.assertEqual("Springfield", rows[0]["store_city"])
self.assertEqual("VA", rows[0]["store_state"])
def test_main_writes_purchase_and_example_csvs(self):
with tempfile.TemporaryDirectory() as tmpdir:
@@ -184,6 +201,8 @@ class PurchaseLogTests(unittest.TestCase):
"upc": "4011",
"qty": "1",
"unit": "LB",
"normalized_quantity": "1",
"normalized_quantity_unit": "lb",
"line_total": "1.29",
"unit_price": "1.29",
"measure_type": "weight",
@@ -208,6 +227,8 @@ class PurchaseLogTests(unittest.TestCase):
"retailer_item_id": "30669",
"qty": "1",
"unit": "E",
"normalized_quantity": "3",
"normalized_quantity_unit": "lb",
"line_total": "2.98",
"unit_price": "2.98",
"size_value": "3",
@@ -346,6 +367,8 @@ class PurchaseLogTests(unittest.TestCase):
"upc": "",
"qty": "1",
"unit": "EA",
"normalized_quantity": "1",
"normalized_quantity_unit": "each",
"line_total": "3.50",
"unit_price": "3.50",
"measure_type": "each",
@@ -403,6 +426,296 @@ class PurchaseLogTests(unittest.TestCase):
self.assertEqual("approved", rows[0]["review_status"])
self.assertEqual("create", rows[0]["resolution_action"])
self.assertEqual("cat_ice", links[0]["catalog_id"])
self.assertEqual("1", rows[0]["normalized_quantity"])
self.assertEqual("each", rows[0]["normalized_quantity_unit"])
def test_build_purchase_rows_derives_effective_price_for_known_cases(self):
fieldnames = enrich_costco.OUTPUT_FIELDS
def base_row():
return {field: "" for field in fieldnames}
giant_banana = base_row()
giant_banana.update(
{
"retailer": "giant",
"order_id": "g1",
"line_no": "1",
"normalized_row_id": "giant:g1:1",
"normalized_item_id": "gnorm:banana",
"order_date": "2026-03-01",
"item_name": "FRESH BANANA",
"item_name_norm": "BANANA",
"retailer_item_id": "100",
"qty": "1",
"unit": "LB",
"normalized_quantity": "1.68",
"normalized_quantity_unit": "lb",
"line_total": "0.99",
"unit_price": "0.99",
"measure_type": "weight",
"price_per_lb": "0.5893",
"raw_order_path": "data/giant-web/raw/g1.json",
"is_discount_line": "false",
"is_coupon_line": "false",
"is_fee": "false",
}
)
costco_banana = base_row()
costco_banana.update(
{
"retailer": "costco",
"order_id": "c1",
"line_no": "1",
"normalized_row_id": "costco:c1:1",
"normalized_item_id": "cnorm:banana",
"order_date": "2026-03-12",
"item_name": "BANANAS 3 LB / 1.36 KG",
"item_name_norm": "BANANA",
"retailer_item_id": "30669",
"qty": "1",
"unit": "E",
"normalized_quantity": "3",
"normalized_quantity_unit": "lb",
"line_total": "2.98",
"net_line_total": "1.49",
"unit_price": "2.98",
"size_value": "3",
"size_unit": "lb",
"measure_type": "weight",
"price_per_lb": "0.4967",
"raw_order_path": "data/costco-web/raw/c1.json",
"is_discount_line": "false",
"is_coupon_line": "false",
"is_fee": "false",
}
)
giant_ice = base_row()
giant_ice.update(
{
"retailer": "giant",
"order_id": "g2",
"line_no": "1",
"normalized_row_id": "giant:g2:1",
"normalized_item_id": "gnorm:ice",
"order_date": "2026-03-02",
"item_name": "SB BAGGED ICE 20LB",
"item_name_norm": "BAGGED ICE",
"retailer_item_id": "101",
"qty": "2",
"unit": "EA",
"normalized_quantity": "40",
"normalized_quantity_unit": "lb",
"line_total": "9.98",
"unit_price": "4.99",
"size_value": "20",
"size_unit": "lb",
"measure_type": "weight",
"price_per_lb": "0.2495",
"raw_order_path": "data/giant-web/raw/g2.json",
"is_discount_line": "false",
"is_coupon_line": "false",
"is_fee": "false",
}
)
costco_patty = base_row()
costco_patty.update(
{
"retailer": "costco",
"order_id": "c2",
"line_no": "1",
"normalized_row_id": "costco:c2:1",
"normalized_item_id": "cnorm:patty",
"order_date": "2026-03-03",
"item_name": "BEEF PATTIES 6# BAG",
"item_name_norm": "BEEF PATTIES 6# BAG",
"retailer_item_id": "777",
"qty": "1",
"unit": "E",
"normalized_quantity": "1",
"normalized_quantity_unit": "each",
"line_total": "26.99",
"net_line_total": "26.99",
"unit_price": "26.99",
"measure_type": "each",
"raw_order_path": "data/costco-web/raw/c2.json",
"is_discount_line": "false",
"is_coupon_line": "false",
"is_fee": "false",
}
)
giant_patty = base_row()
giant_patty.update(
{
"retailer": "giant",
"order_id": "g3",
"line_no": "1",
"normalized_row_id": "giant:g3:1",
"normalized_item_id": "gnorm:patty",
"order_date": "2026-03-04",
"item_name": "80% PATTIES PK12",
"item_name_norm": "80% PATTIES PK12",
"retailer_item_id": "102",
"qty": "1",
"unit": "LB",
"normalized_quantity": "",
"normalized_quantity_unit": "",
"line_total": "10.05",
"unit_price": "10.05",
"measure_type": "weight",
"price_per_lb": "7.7907",
"raw_order_path": "data/giant-web/raw/g3.json",
"is_discount_line": "false",
"is_coupon_line": "false",
"is_fee": "false",
}
)
rows, _links = build_purchases.build_purchase_rows(
[giant_banana, giant_ice, giant_patty],
[costco_banana, costco_patty],
[],
[],
[],
[],
[],
)
rows_by_item = {row["normalized_item_id"]: row for row in rows}
self.assertEqual("0.5893", rows_by_item["gnorm:banana"]["effective_price"])
self.assertEqual("lb", rows_by_item["gnorm:banana"]["effective_price_unit"])
self.assertEqual("0.4967", rows_by_item["cnorm:banana"]["effective_price"])
self.assertEqual("lb", rows_by_item["cnorm:banana"]["effective_price_unit"])
self.assertEqual("0.2495", rows_by_item["gnorm:ice"]["effective_price"])
self.assertEqual("lb", rows_by_item["gnorm:ice"]["effective_price_unit"])
self.assertEqual("26.99", rows_by_item["cnorm:patty"]["effective_price"])
self.assertEqual("each", rows_by_item["cnorm:patty"]["effective_price_unit"])
self.assertEqual("", rows_by_item["gnorm:patty"]["effective_price"])
self.assertEqual("", rows_by_item["gnorm:patty"]["effective_price_unit"])
def test_build_purchase_rows_leaves_effective_price_blank_without_valid_denominator(self):
fieldnames = enrich_costco.OUTPUT_FIELDS
row = {field: "" for field in fieldnames}
row.update(
{
"retailer": "giant",
"order_id": "g1",
"line_no": "1",
"normalized_row_id": "giant:g1:1",
"normalized_item_id": "gnorm:blank",
"order_date": "2026-03-01",
"item_name": "MYSTERY ITEM",
"item_name_norm": "MYSTERY ITEM",
"retailer_item_id": "100",
"qty": "1",
"unit": "EA",
"normalized_quantity": "0",
"normalized_quantity_unit": "each",
"line_total": "3.50",
"unit_price": "3.50",
"measure_type": "each",
"raw_order_path": "data/giant-web/raw/g1.json",
"is_discount_line": "false",
"is_coupon_line": "false",
"is_fee": "false",
}
)
rows, _links = build_purchases.build_purchase_rows([row], [], [], [], [], [], [])
self.assertEqual("", rows[0]["effective_price"])
self.assertEqual("", rows[0]["effective_price_unit"])
def test_purchase_rows_support_visit_level_grouping_without_extra_joins(self):
fieldnames = enrich_costco.OUTPUT_FIELDS
def base_row():
return {field: "" for field in fieldnames}
row_one = base_row()
row_one.update(
{
"retailer": "giant",
"order_id": "g1",
"line_no": "1",
"normalized_row_id": "giant:g1:1",
"normalized_item_id": "gnorm:first",
"order_date": "2026-03-01",
"item_name": "FIRST ITEM",
"item_name_norm": "FIRST ITEM",
"qty": "1",
"unit": "EA",
"normalized_quantity": "1",
"normalized_quantity_unit": "each",
"line_total": "3.50",
"measure_type": "each",
"raw_order_path": "data/giant-web/raw/g1.json",
"is_discount_line": "false",
"is_coupon_line": "false",
"is_fee": "false",
}
)
row_two = base_row()
row_two.update(
{
"retailer": "giant",
"order_id": "g1",
"line_no": "2",
"normalized_row_id": "giant:g1:2",
"normalized_item_id": "gnorm:second",
"order_date": "2026-03-01",
"item_name": "SECOND ITEM",
"item_name_norm": "SECOND ITEM",
"qty": "1",
"unit": "EA",
"normalized_quantity": "1",
"normalized_quantity_unit": "each",
"line_total": "2.00",
"measure_type": "each",
"raw_order_path": "data/giant-web/raw/g1.json",
"is_discount_line": "false",
"is_coupon_line": "false",
"is_fee": "false",
}
)
rows, _links = build_purchases.build_purchase_rows(
[row_one, row_two],
[],
[
{
"order_id": "g1",
"store_name": "Giant",
"store_number": "42",
"store_city": "Springfield",
"store_state": "VA",
}
],
[],
[],
[],
[],
)
visit_key = {
(
row["retailer"],
row["order_id"],
row["purchase_date"],
row["store_name"],
row["store_number"],
row["store_city"],
row["store_state"],
)
for row in rows
}
visit_total = sum(float(row["net_line_total"]) for row in rows)
self.assertEqual(1, len(visit_key))
self.assertEqual(5.5, visit_total)
if __name__ == "__main__":

View File

@@ -1,133 +0,0 @@
import tempfile
import unittest
from pathlib import Path
import build_observed_products
import build_review_queue
from layer_helpers import write_csv_rows
class ReviewQueueTests(unittest.TestCase):
def test_build_review_queue_preserves_existing_status(self):
observed_rows = [
{
"observed_product_id": "gobs_1",
"retailer": "giant",
"representative_upc": "111",
"representative_image_url": "",
"representative_name_norm": "GALA APPLE",
"times_seen": "2",
"distinct_item_names_count": "2",
"distinct_upcs_count": "1",
"is_fee": "false",
"is_discount_line": "false",
"is_coupon_line": "false",
}
]
item_rows = [
{
"observed_product_id": "gobs_1",
"item_name": "SB GALA APPLE 5LB",
"item_name_norm": "GALA APPLE",
"line_total": "7.99",
},
{
"observed_product_id": "gobs_1",
"item_name": "SB GALA APPLE 5 LB",
"item_name_norm": "GALA APPLE",
"line_total": "8.49",
},
]
existing = {
build_review_queue.stable_id("rvw", "gobs_1|missing_image"): {
"status": "approved",
"resolution_notes": "looked fine",
"created_at": "2026-03-15",
}
}
queue = build_review_queue.build_review_queue(
observed_rows, item_rows, existing, "2026-03-16"
)
self.assertEqual(2, len(queue))
missing_image = [row for row in queue if row["reason_code"] == "missing_image"][0]
self.assertEqual("approved", missing_image["status"])
self.assertEqual("looked fine", missing_image["resolution_notes"])
def test_review_queue_main_writes_output(self):
with tempfile.TemporaryDirectory() as tmpdir:
observed_path = Path(tmpdir) / "products_observed.csv"
items_path = Path(tmpdir) / "items_enriched.csv"
output_path = Path(tmpdir) / "review_queue.csv"
observed_rows = [
{
"observed_product_id": "gobs_1",
"retailer": "giant",
"observed_key": "giant|upc=111|name=GALA APPLE",
"representative_retailer_item_id": "11",
"representative_upc": "111",
"representative_item_name": "SB GALA APPLE 5LB",
"representative_name_norm": "GALA APPLE",
"representative_brand": "SB",
"representative_variant": "",
"representative_size_value": "5",
"representative_size_unit": "lb",
"representative_pack_qty": "",
"representative_measure_type": "weight",
"representative_image_url": "",
"is_store_brand": "true",
"is_fee": "false",
"is_discount_line": "false",
"is_coupon_line": "false",
"first_seen_date": "2026-01-01",
"last_seen_date": "2026-01-10",
"times_seen": "2",
"example_order_id": "1",
"example_item_name": "SB GALA APPLE 5LB",
"raw_name_examples": "SB GALA APPLE 5LB | SB GALA APPLE 5 LB",
"normalized_name_examples": "GALA APPLE",
"example_prices": "7.99 | 8.49",
"distinct_item_names_count": "2",
"distinct_retailer_item_ids_count": "1",
"distinct_upcs_count": "1",
}
]
item_rows = [
{
"retailer": "giant",
"order_id": "1",
"line_no": "1",
"item_name": "SB GALA APPLE 5LB",
"item_name_norm": "GALA APPLE",
"retailer_item_id": "11",
"upc": "111",
"size_value": "5",
"size_unit": "lb",
"pack_qty": "",
"measure_type": "weight",
"is_store_brand": "true",
"is_fee": "false",
"is_discount_line": "false",
"is_coupon_line": "false",
"line_total": "7.99",
}
]
write_csv_rows(
observed_path, observed_rows, build_observed_products.OUTPUT_FIELDS
)
write_csv_rows(items_path, item_rows, list(item_rows[0].keys()))
build_review_queue.main.callback(
observed_csv=str(observed_path),
items_enriched_csv=str(items_path),
output_csv=str(output_path),
)
self.assertTrue(output_path.exists())
if __name__ == "__main__":
unittest.main()

View File

@@ -6,9 +6,94 @@ from unittest import mock
from click.testing import CliRunner
import enrich_costco
import review_products
def write_review_source_files(tmpdir, rows):
giant_items_csv = Path(tmpdir) / "giant_items.csv"
costco_items_csv = Path(tmpdir) / "costco_items.csv"
giant_orders_csv = Path(tmpdir) / "giant_orders.csv"
costco_orders_csv = Path(tmpdir) / "costco_orders.csv"
fieldnames = enrich_costco.OUTPUT_FIELDS
grouped_rows = {"giant": [], "costco": []}
grouped_orders = {"giant": {}, "costco": {}}
for index, row in enumerate(rows, start=1):
retailer = row.get("retailer", "giant")
normalized_row = {field: "" for field in fieldnames}
normalized_row.update(
{
"retailer": retailer,
"order_id": row.get("order_id", f"{retailer[0]}{index}"),
"line_no": row.get("line_no", str(index)),
"normalized_row_id": row.get(
"normalized_row_id",
f"{retailer}:{row.get('order_id', f'{retailer[0]}{index}')}:{row.get('line_no', str(index))}",
),
"normalized_item_id": row.get("normalized_item_id", ""),
"order_date": row.get("purchase_date", ""),
"item_name": row.get("raw_item_name", ""),
"item_name_norm": row.get("normalized_item_name", ""),
"image_url": row.get("image_url", ""),
"upc": row.get("upc", ""),
"line_total": row.get("line_total", ""),
"net_line_total": row.get("net_line_total", ""),
"matched_discount_amount": row.get("matched_discount_amount", ""),
"qty": row.get("qty", "1"),
"unit": row.get("unit", "EA"),
"normalized_quantity": row.get("normalized_quantity", ""),
"normalized_quantity_unit": row.get("normalized_quantity_unit", ""),
"size_value": row.get("size_value", ""),
"size_unit": row.get("size_unit", ""),
"pack_qty": row.get("pack_qty", ""),
"measure_type": row.get("measure_type", "each"),
"retailer_item_id": row.get("retailer_item_id", ""),
"price_per_each": row.get("price_per_each", ""),
"price_per_lb": row.get("price_per_lb", ""),
"price_per_oz": row.get("price_per_oz", ""),
"is_discount_line": row.get("is_discount_line", "false"),
"is_coupon_line": row.get("is_coupon_line", "false"),
"is_fee": row.get("is_fee", "false"),
"raw_order_path": row.get("raw_order_path", ""),
}
)
grouped_rows[retailer].append(normalized_row)
order_id = normalized_row["order_id"]
grouped_orders[retailer].setdefault(
order_id,
{
"order_id": order_id,
"store_name": row.get("store_name", ""),
"store_number": row.get("store_number", ""),
"store_city": row.get("store_city", ""),
"store_state": row.get("store_state", ""),
},
)
for path, source_rows in [
(giant_items_csv, grouped_rows["giant"]),
(costco_items_csv, grouped_rows["costco"]),
]:
with path.open("w", newline="", encoding="utf-8") as handle:
writer = csv.DictWriter(handle, fieldnames=fieldnames)
writer.writeheader()
writer.writerows(source_rows)
order_fields = ["order_id", "store_name", "store_number", "store_city", "store_state"]
for path, source_rows in [
(giant_orders_csv, grouped_orders["giant"].values()),
(costco_orders_csv, grouped_orders["costco"].values()),
]:
with path.open("w", newline="", encoding="utf-8") as handle:
writer = csv.DictWriter(handle, fieldnames=order_fields)
writer.writeheader()
writer.writerows(source_rows)
return giant_items_csv, costco_items_csv, giant_orders_csv, costco_orders_csv
class ReviewWorkflowTests(unittest.TestCase):
def test_build_review_queue_groups_unresolved_purchases(self):
queue_rows = review_products.build_review_queue(
@@ -76,30 +161,46 @@ class ReviewWorkflowTests(unittest.TestCase):
self.assertEqual("cat_2", suggestions[0]["catalog_id"])
self.assertEqual("exact upc", suggestions[0]["reason"])
def test_search_catalog_rows_ranks_token_overlap(self):
results = review_products.search_catalog_rows(
"mixed pepper",
[
{
"catalog_id": "cat_1",
"catalog_name": "MIXED PEPPER",
"product_type": "pepper",
"category": "produce",
"variant": "",
},
{
"catalog_id": "cat_2",
"catalog_name": "GROUND PEPPER",
"product_type": "spice",
"category": "baking",
"variant": "",
},
],
[
{
"normalized_item_id": "gnorm_mix",
"catalog_id": "cat_1",
}
],
"cnorm_mix",
)
self.assertEqual("cat_1", results[0]["catalog_id"])
self.assertGreater(results[0]["score"], results[1]["score"])
def test_review_products_displays_position_items_and_suggestions(self):
with tempfile.TemporaryDirectory() as tmpdir:
purchases_csv = Path(tmpdir) / "purchases.csv"
queue_csv = Path(tmpdir) / "review_queue.csv"
resolutions_csv = Path(tmpdir) / "review_resolutions.csv"
catalog_csv = Path(tmpdir) / "catalog.csv"
purchase_fields = [
"purchase_date",
"retailer",
"order_id",
"line_no",
"normalized_item_id",
"catalog_id",
"raw_item_name",
"normalized_item_name",
"image_url",
"upc",
"line_total",
]
with purchases_csv.open("w", newline="", encoding="utf-8") as handle:
writer = csv.DictWriter(handle, fieldnames=purchase_fields)
writer.writeheader()
writer.writerows(
links_csv = Path(tmpdir) / "product_links.csv"
giant_items_csv, costco_items_csv, giant_orders_csv, costco_orders_csv = write_review_source_files(
tmpdir,
[
{
"purchase_date": "2026-03-14",
@@ -107,7 +208,6 @@ class ReviewWorkflowTests(unittest.TestCase):
"order_id": "c2",
"line_no": "2",
"normalized_item_id": "cnorm_mix",
"catalog_id": "",
"raw_item_name": "MIXED PEPPER 6-PACK",
"normalized_item_name": "MIXED PEPPER",
"image_url": "",
@@ -120,7 +220,6 @@ class ReviewWorkflowTests(unittest.TestCase):
"order_id": "c1",
"line_no": "1",
"normalized_item_id": "cnorm_mix",
"catalog_id": "",
"raw_item_name": "MIXED PEPPER 6-PACK",
"normalized_item_name": "MIXED PEPPER",
"image_url": "https://example.test/mixed-pepper.jpg",
@@ -133,14 +232,13 @@ class ReviewWorkflowTests(unittest.TestCase):
"order_id": "g1",
"line_no": "1",
"normalized_item_id": "gnorm_mix",
"catalog_id": "cat_mix",
"raw_item_name": "MIXED PEPPER",
"normalized_item_name": "MIXED PEPPER",
"image_url": "",
"upc": "",
"line_total": "5.99",
},
]
],
)
with catalog_csv.open("w", newline="", encoding="utf-8") as handle:
@@ -163,11 +261,34 @@ class ReviewWorkflowTests(unittest.TestCase):
"updated_at": "",
}
)
with links_csv.open("w", newline="", encoding="utf-8") as handle:
writer = csv.DictWriter(handle, fieldnames=review_products.build_purchases.PRODUCT_LINK_FIELDS)
writer.writeheader()
writer.writerow(
{
"normalized_item_id": "gnorm_mix",
"catalog_id": "cat_mix",
"link_method": "manual_link",
"link_confidence": "high",
"review_status": "approved",
"reviewed_by": "",
"reviewed_at": "",
"link_notes": "",
}
)
runner = CliRunner()
result = runner.invoke(
review_products.main,
[
"--giant-items-enriched-csv",
str(giant_items_csv),
"--costco-items-enriched-csv",
str(costco_items_csv),
"--giant-orders-csv",
str(giant_orders_csv),
"--costco-orders-csv",
str(costco_orders_csv),
"--purchases-csv",
str(purchases_csv),
"--queue-csv",
@@ -176,21 +297,23 @@ class ReviewWorkflowTests(unittest.TestCase):
str(resolutions_csv),
"--catalog-csv",
str(catalog_csv),
"--links-csv",
str(links_csv),
],
input="q\n",
color=True,
)
self.assertEqual(0, result.exit_code)
self.assertIn("Review 1/1: Resolve normalized_item MIXED PEPPER to catalog_name [__]?", result.output)
self.assertIn("Review guide:", result.output)
self.assertIn("Review 1/1: MIXED PEPPER", result.output)
self.assertIn("2 matched items:", result.output)
self.assertIn("[l]ink existing [n]ew catalog e[x]clude [s]kip [q]uit:", result.output)
first_item = result.output.index("[1] 2026-03-14 | 7.49")
second_item = result.output.index("[2] 2026-03-12 | 6.99")
self.assertIn("[#] link to suggestion [f]ind [n]ew [s]kip e[x]clude [q]uit >", result.output)
first_item = result.output.index("[1] MIXED PEPPER 6-PACK | costco | 2026-03-14 | 7.49 | ")
second_item = result.output.index("[2] MIXED PEPPER 6-PACK | costco | 2026-03-12 | 6.99 | https://example.test/mixed-pepper.jpg")
self.assertLess(first_item, second_item)
self.assertIn("https://example.test/mixed-pepper.jpg", result.output)
self.assertIn("1 catalog_name suggestions found:", result.output)
self.assertIn("[1] MIXED PEPPER", result.output)
self.assertIn("[1] MIXED PEPPER, pepper, produce (1 items, 1 rows)", result.output)
self.assertIn("\x1b[", result.output)
def test_review_products_no_suggestions_is_informational(self):
@@ -199,39 +322,23 @@ class ReviewWorkflowTests(unittest.TestCase):
queue_csv = Path(tmpdir) / "review_queue.csv"
resolutions_csv = Path(tmpdir) / "review_resolutions.csv"
catalog_csv = Path(tmpdir) / "catalog.csv"
with purchases_csv.open("w", newline="", encoding="utf-8") as handle:
writer = csv.DictWriter(
handle,
fieldnames=[
"purchase_date",
"retailer",
"order_id",
"line_no",
"normalized_item_id",
"catalog_id",
"raw_item_name",
"normalized_item_name",
"image_url",
"upc",
"line_total",
],
)
writer.writeheader()
writer.writerow(
links_csv = Path(tmpdir) / "product_links.csv"
giant_items_csv, costco_items_csv, giant_orders_csv, costco_orders_csv = write_review_source_files(
tmpdir,
[
{
"purchase_date": "2026-03-14",
"retailer": "giant",
"order_id": "g1",
"line_no": "1",
"normalized_item_id": "gnorm_ice",
"catalog_id": "",
"raw_item_name": "SB BAGGED ICE 20LB",
"normalized_item_name": "BAGGED ICE",
"image_url": "",
"upc": "",
"line_total": "3.50",
}
],
)
with catalog_csv.open("w", newline="", encoding="utf-8") as handle:
@@ -241,6 +348,14 @@ class ReviewWorkflowTests(unittest.TestCase):
result = CliRunner().invoke(
review_products.main,
[
"--giant-items-enriched-csv",
str(giant_items_csv),
"--costco-items-enriched-csv",
str(costco_items_csv),
"--giant-orders-csv",
str(giant_orders_csv),
"--costco-orders-csv",
str(costco_orders_csv),
"--purchases-csv",
str(purchases_csv),
"--queue-csv",
@@ -249,6 +364,8 @@ class ReviewWorkflowTests(unittest.TestCase):
str(resolutions_csv),
"--catalog-csv",
str(catalog_csv),
"--links-csv",
str(links_csv),
],
input="q\n",
color=True,
@@ -257,32 +374,15 @@ class ReviewWorkflowTests(unittest.TestCase):
self.assertEqual(0, result.exit_code)
self.assertIn("no catalog_name suggestions found", result.output)
def test_link_existing_uses_numbered_selection_and_confirmation(self):
def test_search_links_catalog_and_writes_link_row(self):
with tempfile.TemporaryDirectory() as tmpdir:
purchases_csv = Path(tmpdir) / "purchases.csv"
queue_csv = Path(tmpdir) / "review_queue.csv"
resolutions_csv = Path(tmpdir) / "review_resolutions.csv"
catalog_csv = Path(tmpdir) / "catalog.csv"
with purchases_csv.open("w", newline="", encoding="utf-8") as handle:
writer = csv.DictWriter(
handle,
fieldnames=[
"purchase_date",
"retailer",
"order_id",
"line_no",
"normalized_item_id",
"catalog_id",
"raw_item_name",
"normalized_item_name",
"image_url",
"upc",
"line_total",
],
)
writer.writeheader()
writer.writerows(
links_csv = Path(tmpdir) / "product_links.csv"
giant_items_csv, costco_items_csv, giant_orders_csv, costco_orders_csv = write_review_source_files(
tmpdir,
[
{
"purchase_date": "2026-03-14",
@@ -290,7 +390,6 @@ class ReviewWorkflowTests(unittest.TestCase):
"order_id": "c2",
"line_no": "2",
"normalized_item_id": "cnorm_mix",
"catalog_id": "",
"raw_item_name": "MIXED PEPPER 6-PACK",
"normalized_item_name": "MIXED PEPPER",
"image_url": "",
@@ -303,7 +402,6 @@ class ReviewWorkflowTests(unittest.TestCase):
"order_id": "c1",
"line_no": "1",
"normalized_item_id": "cnorm_mix",
"catalog_id": "",
"raw_item_name": "MIXED PEPPER 6-PACK",
"normalized_item_name": "MIXED PEPPER",
"image_url": "",
@@ -316,14 +414,13 @@ class ReviewWorkflowTests(unittest.TestCase):
"order_id": "g1",
"line_no": "1",
"normalized_item_id": "gnorm_mix",
"catalog_id": "cat_mix",
"raw_item_name": "MIXED PEPPER",
"normalized_item_name": "MIXED PEPPER",
"image_url": "",
"upc": "",
"line_total": "5.99",
},
]
],
)
with catalog_csv.open("w", newline="", encoding="utf-8") as handle:
@@ -346,10 +443,33 @@ class ReviewWorkflowTests(unittest.TestCase):
"updated_at": "",
}
)
with links_csv.open("w", newline="", encoding="utf-8") as handle:
writer = csv.DictWriter(handle, fieldnames=review_products.build_purchases.PRODUCT_LINK_FIELDS)
writer.writeheader()
writer.writerow(
{
"normalized_item_id": "gnorm_mix",
"catalog_id": "cat_mix",
"link_method": "manual_link",
"link_confidence": "high",
"review_status": "approved",
"reviewed_by": "",
"reviewed_at": "",
"link_notes": "",
}
)
result = CliRunner().invoke(
review_products.main,
[
"--giant-items-enriched-csv",
str(giant_items_csv),
"--costco-items-enriched-csv",
str(costco_items_csv),
"--giant-orders-csv",
str(giant_orders_csv),
"--costco-orders-csv",
str(costco_orders_csv),
"--purchases-csv",
str(purchases_csv),
"--queue-csv",
@@ -358,22 +478,162 @@ class ReviewWorkflowTests(unittest.TestCase):
str(resolutions_csv),
"--catalog-csv",
str(catalog_csv),
"--links-csv",
str(links_csv),
"--limit",
"1",
],
input="l\n1\ny\nlinked by test\n",
input="f\nmixed pepper\n1\nlinked by test\n",
color=True,
)
self.assertEqual(0, result.exit_code)
self.assertIn("Select the catalog_name to associate 2 items with:", result.output)
self.assertIn("[1] MIXED PEPPER | cat_mix", result.output)
self.assertIn('2 "MIXED PEPPER" items and future matches will be associated with "MIXED PEPPER".', result.output)
self.assertIn("actions: [y]es [n]o [b]ack [s]kip [q]uit", result.output)
self.assertIn("1 search results found:", result.output)
with resolutions_csv.open(newline="", encoding="utf-8") as handle:
rows = list(csv.DictReader(handle))
with links_csv.open(newline="", encoding="utf-8") as handle:
link_rows = list(csv.DictReader(handle))
self.assertEqual("cat_mix", rows[0]["catalog_id"])
self.assertEqual("link", rows[0]["resolution_action"])
self.assertEqual("cat_mix", link_rows[0]["catalog_id"])
def test_search_no_matches_allows_retry_or_return(self):
with tempfile.TemporaryDirectory() as tmpdir:
purchases_csv = Path(tmpdir) / "purchases.csv"
queue_csv = Path(tmpdir) / "review_queue.csv"
resolutions_csv = Path(tmpdir) / "review_resolutions.csv"
catalog_csv = Path(tmpdir) / "catalog.csv"
links_csv = Path(tmpdir) / "product_links.csv"
giant_items_csv, costco_items_csv, giant_orders_csv, costco_orders_csv = write_review_source_files(
tmpdir,
[
{
"purchase_date": "2026-03-14",
"retailer": "giant",
"order_id": "g1",
"line_no": "1",
"normalized_item_id": "gnorm_ice",
"raw_item_name": "SB BAGGED ICE 20LB",
"normalized_item_name": "BAGGED ICE",
"image_url": "",
"upc": "",
"line_total": "3.50",
}
],
)
with catalog_csv.open("w", newline="", encoding="utf-8") as handle:
writer = csv.DictWriter(handle, fieldnames=review_products.build_purchases.CATALOG_FIELDS)
writer.writeheader()
writer.writerow(
{
"catalog_id": "cat_ice",
"catalog_name": "ICE",
"category": "frozen",
"product_type": "ice",
"brand": "",
"variant": "",
"size_value": "",
"size_unit": "",
"pack_qty": "",
"measure_type": "",
"notes": "",
"created_at": "",
"updated_at": "",
}
)
result = CliRunner().invoke(
review_products.main,
[
"--giant-items-enriched-csv",
str(giant_items_csv),
"--costco-items-enriched-csv",
str(costco_items_csv),
"--giant-orders-csv",
str(giant_orders_csv),
"--costco-orders-csv",
str(costco_orders_csv),
"--purchases-csv",
str(purchases_csv),
"--queue-csv",
str(queue_csv),
"--resolutions-csv",
str(resolutions_csv),
"--catalog-csv",
str(catalog_csv),
"--links-csv",
str(links_csv),
],
input="f\nzzz\nq\nq\n",
color=True,
)
self.assertEqual(0, result.exit_code)
self.assertIn("no matches found", result.output)
def test_skip_remains_available_from_main_prompt(self):
with tempfile.TemporaryDirectory() as tmpdir:
purchases_csv = Path(tmpdir) / "purchases.csv"
queue_csv = Path(tmpdir) / "review_queue.csv"
resolutions_csv = Path(tmpdir) / "review_resolutions.csv"
catalog_csv = Path(tmpdir) / "catalog.csv"
links_csv = Path(tmpdir) / "product_links.csv"
giant_items_csv, costco_items_csv, giant_orders_csv, costco_orders_csv = write_review_source_files(
tmpdir,
[
{
"purchase_date": "2026-03-14",
"retailer": "giant",
"order_id": "g1",
"line_no": "1",
"normalized_item_id": "gnorm_skip",
"raw_item_name": "TEST ITEM",
"normalized_item_name": "TEST ITEM",
"image_url": "",
"upc": "",
"line_total": "1.00",
}
],
)
with catalog_csv.open("w", newline="", encoding="utf-8") as handle:
writer = csv.DictWriter(handle, fieldnames=review_products.build_purchases.CATALOG_FIELDS)
writer.writeheader()
result = CliRunner().invoke(
review_products.main,
[
"--giant-items-enriched-csv",
str(giant_items_csv),
"--costco-items-enriched-csv",
str(costco_items_csv),
"--giant-orders-csv",
str(giant_orders_csv),
"--costco-orders-csv",
str(costco_orders_csv),
"--purchases-csv",
str(purchases_csv),
"--queue-csv",
str(queue_csv),
"--resolutions-csv",
str(resolutions_csv),
"--catalog-csv",
str(catalog_csv),
"--links-csv",
str(links_csv),
"--limit",
"1",
],
input="s\n",
color=True,
)
self.assertEqual(0, result.exit_code)
with resolutions_csv.open(newline="", encoding="utf-8") as handle:
rows = list(csv.DictReader(handle))
self.assertEqual("skip", rows[0]["resolution_action"])
self.assertEqual("pending", rows[0]["status"])
def test_review_products_creates_catalog_and_resolution(self):
with tempfile.TemporaryDirectory() as tmpdir:
@@ -381,30 +641,13 @@ class ReviewWorkflowTests(unittest.TestCase):
queue_csv = Path(tmpdir) / "review_queue.csv"
resolutions_csv = Path(tmpdir) / "review_resolutions.csv"
catalog_csv = Path(tmpdir) / "catalog.csv"
with purchases_csv.open("w", newline="", encoding="utf-8") as handle:
writer = csv.DictWriter(
handle,
fieldnames=[
"purchase_date",
"normalized_item_id",
"catalog_id",
"retailer",
"raw_item_name",
"normalized_item_name",
"image_url",
"upc",
"line_total",
"order_id",
"line_no",
],
)
writer.writeheader()
writer.writerow(
links_csv = Path(tmpdir) / "product_links.csv"
giant_items_csv, costco_items_csv, giant_orders_csv, costco_orders_csv = write_review_source_files(
tmpdir,
[
{
"purchase_date": "2026-03-15",
"normalized_item_id": "gnorm_ice",
"catalog_id": "",
"retailer": "giant",
"raw_item_name": "SB BAGGED ICE 20LB",
"normalized_item_name": "BAGGED ICE",
@@ -414,6 +657,7 @@ class ReviewWorkflowTests(unittest.TestCase):
"order_id": "g1",
"line_no": "1",
}
],
)
with mock.patch.object(
@@ -422,10 +666,15 @@ class ReviewWorkflowTests(unittest.TestCase):
side_effect=["n", "ICE", "frozen", "ice", "manual merge", "q"],
):
review_products.main.callback(
giant_items_enriched_csv=str(giant_items_csv),
costco_items_enriched_csv=str(costco_items_csv),
giant_orders_csv=str(giant_orders_csv),
costco_orders_csv=str(costco_orders_csv),
purchases_csv=str(purchases_csv),
queue_csv=str(queue_csv),
resolutions_csv=str(resolutions_csv),
catalog_csv=str(catalog_csv),
links_csv=str(links_csv),
limit=1,
refresh_only=False,
)
@@ -433,13 +682,78 @@ class ReviewWorkflowTests(unittest.TestCase):
self.assertTrue(queue_csv.exists())
self.assertTrue(resolutions_csv.exists())
self.assertTrue(catalog_csv.exists())
self.assertTrue(links_csv.exists())
with queue_csv.open(newline="", encoding="utf-8") as handle:
queue_rows = list(csv.DictReader(handle))
with resolutions_csv.open(newline="", encoding="utf-8") as handle:
resolution_rows = list(csv.DictReader(handle))
with catalog_csv.open(newline="", encoding="utf-8") as handle:
catalog_rows = list(csv.DictReader(handle))
with links_csv.open(newline="", encoding="utf-8") as handle:
link_rows = list(csv.DictReader(handle))
self.assertEqual("approved", queue_rows[0]["status"])
self.assertEqual("create", queue_rows[0]["resolution_action"])
self.assertEqual("create", resolution_rows[0]["resolution_action"])
self.assertEqual("approved", resolution_rows[0]["status"])
self.assertEqual("ICE", catalog_rows[0]["catalog_name"])
self.assertEqual(catalog_rows[0]["catalog_id"], link_rows[0]["catalog_id"])
def test_build_review_queue_readds_orphaned_and_incomplete_links(self):
purchase_rows = [
{
"normalized_item_id": "gnorm_orphan",
"catalog_id": "cat_missing",
"retailer": "giant",
"raw_item_name": "ORPHAN ITEM",
"normalized_item_name": "ORPHAN ITEM",
"upc": "",
"line_total": "3.50",
"is_fee": "false",
"is_discount_line": "false",
"is_coupon_line": "false",
},
{
"normalized_item_id": "gnorm_incomplete",
"catalog_id": "cat_incomplete",
"retailer": "giant",
"raw_item_name": "INCOMPLETE ITEM",
"normalized_item_name": "INCOMPLETE ITEM",
"upc": "",
"line_total": "4.50",
"is_fee": "false",
"is_discount_line": "false",
"is_coupon_line": "false",
},
]
link_rows = [
{
"normalized_item_id": "gnorm_orphan",
"catalog_id": "cat_missing",
},
{
"normalized_item_id": "gnorm_incomplete",
"catalog_id": "cat_incomplete",
},
]
catalog_rows = [
{
"catalog_id": "cat_incomplete",
"catalog_name": "INCOMPLETE ITEM",
"product_type": "",
}
]
queue_rows = review_products.build_review_queue(
purchase_rows,
[],
link_rows,
catalog_rows,
[],
)
reasons = {row["normalized_item_id"]: row["reason_code"] for row in queue_rows}
self.assertEqual("orphaned_catalog_link", reasons["gnorm_orphan"])
self.assertEqual("incomplete_catalog_link", reasons["gnorm_incomplete"])
if __name__ == "__main__":

View File

@@ -3,7 +3,7 @@ import tempfile
import unittest
from pathlib import Path
import scraper
import scrape_giant as scraper
class ScraperTests(unittest.TestCase):

View File

@@ -1,154 +0,0 @@
import json
from pathlib import Path
import click
import build_canonical_layer
import build_observed_products
from layer_helpers import stable_id, write_csv_rows
PROOF_FIELDS = [
"proof_name",
"canonical_product_id",
"giant_observed_product_id",
"costco_observed_product_id",
"giant_example_item",
"costco_example_item",
"notes",
]
def read_rows(path):
import csv
with Path(path).open(newline="", encoding="utf-8") as handle:
return list(csv.DictReader(handle))
def find_proof_pair(observed_rows):
giant = None
costco = None
for row in observed_rows:
if row["retailer"] == "giant" and row["representative_name_norm"] == "BANANA":
giant = row
if row["retailer"] == "costco" and row["representative_name_norm"] == "BANANA":
costco = row
return giant, costco
def merge_proof_pair(canonical_rows, link_rows, giant_row, costco_row):
if not giant_row or not costco_row:
return canonical_rows, link_rows, []
proof_canonical_id = stable_id("gcan", "proof|banana")
link_rows = [
row
for row in link_rows
if row["observed_product_id"]
not in {giant_row["observed_product_id"], costco_row["observed_product_id"]}
]
canonical_rows = [
row
for row in canonical_rows
if row["canonical_product_id"] != proof_canonical_id
]
canonical_rows.append(
{
"canonical_product_id": proof_canonical_id,
"canonical_name": "BANANA",
"product_type": "banana",
"brand": "",
"variant": "",
"size_value": "",
"size_unit": "",
"pack_qty": "",
"measure_type": "weight",
"normalized_quantity": "",
"normalized_quantity_unit": "",
"notes": "manual proof merge for cross-retailer validation",
"created_at": "",
"updated_at": "",
}
)
for observed_row in [giant_row, costco_row]:
link_rows.append(
{
"observed_product_id": observed_row["observed_product_id"],
"canonical_product_id": proof_canonical_id,
"link_method": "manual_proof_merge",
"link_confidence": "medium",
"review_status": "",
"reviewed_by": "",
"reviewed_at": "",
"link_notes": "cross-retailer validation proof",
}
)
proof_rows = [
{
"proof_name": "banana",
"canonical_product_id": proof_canonical_id,
"giant_observed_product_id": giant_row["observed_product_id"],
"costco_observed_product_id": costco_row["observed_product_id"],
"giant_example_item": giant_row["example_item_name"],
"costco_example_item": costco_row["example_item_name"],
"notes": "BANANA proof pair built from Giant and Costco enriched rows",
}
]
return canonical_rows, link_rows, proof_rows
@click.command()
@click.option(
"--giant-items-enriched-csv",
default="giant_output/items_enriched.csv",
show_default=True,
)
@click.option(
"--costco-items-enriched-csv",
default="costco_output/items_enriched.csv",
show_default=True,
)
@click.option(
"--outdir",
default="combined_output",
show_default=True,
)
def main(giant_items_enriched_csv, costco_items_enriched_csv, outdir):
outdir = Path(outdir)
rows = read_rows(giant_items_enriched_csv) + read_rows(costco_items_enriched_csv)
observed_rows = build_observed_products.build_observed_products(rows)
canonical_rows, link_rows = build_canonical_layer.build_canonical_layer(observed_rows)
giant_row, costco_row = find_proof_pair(observed_rows)
if not giant_row or not costco_row:
raise click.ClickException(
"could not find BANANA proof pair across Giant and Costco observed products"
)
canonical_rows, link_rows, proof_rows = merge_proof_pair(
canonical_rows, link_rows, giant_row, costco_row
)
write_csv_rows(
outdir / "products_observed.csv",
observed_rows,
build_observed_products.OUTPUT_FIELDS,
)
write_csv_rows(
outdir / "products_canonical.csv",
canonical_rows,
build_canonical_layer.CANONICAL_FIELDS,
)
write_csv_rows(
outdir / "product_links.csv",
link_rows,
build_canonical_layer.LINK_FIELDS,
)
write_csv_rows(outdir / "proof_examples.csv", proof_rows, PROOF_FIELDS)
click.echo(
f"wrote combined outputs to {outdir} using {len(observed_rows)} observed rows"
)
if __name__ == "__main__":
main()