Compare commits

...

22 Commits

Author SHA1 Message Date
ben
74d17b0b0c minor edit 2026-03-24 17:28:16 -04:00
ben
fea5132100 minor edi 2026-03-24 17:27:34 -04:00
ben
eb3959ae0f Record t1.22.1 task evidence 2026-03-24 17:26:00 -04:00
ben
867275c67a Trim requirements to direct runtime deps 2026-03-24 17:25:52 -04:00
ben
6336c15da8 Record t1.22 task evidence 2026-03-24 17:10:09 -04:00
ben
09829b2b9d Finalize post-refactor layout and remove old pipeline files 2026-03-24 17:09:57 -04:00
ben
cdb7a15739 Record t1.21 task evidence 2026-03-24 16:49:01 -04:00
ben
46a3b2c639 Add purchase analysis summaries 2026-03-24 16:48:53 -04:00
ben
c35688c87f Record t1.20 task evidence 2026-03-24 08:29:31 -04:00
ben
6940f165fb Document visit-level purchase analysis 2026-03-24 08:29:26 -04:00
ben
de8ff535b8 1.18 cleanup and review 2026-03-24 08:27:41 -04:00
ben
02be6f52c0 Record t1.19 task evidence 2026-03-23 15:32:48 -04:00
ben
8ccf3ff43b Reconcile review queue against current catalog state 2026-03-23 15:32:41 -04:00
ben
a93229408b Record t1.18.4 task evidence 2026-03-23 15:28:05 -04:00
ben
a45522c110 Finalize purchase effective price fields 2026-03-23 15:27:58 -04:00
ben
d78230f1c6 Record t1.18.3 task evidence 2026-03-23 13:56:56 -04:00
ben
73176117fe Fix Costco hash-size weight parsing 2026-03-23 13:56:47 -04:00
ben
facebced9c Record t1.18.2 task evidence 2026-03-23 13:23:03 -04:00
ben
23dfc3de3e Use picked weight for Giant quantity basis 2026-03-23 13:22:56 -04:00
ben
3bc76ed243 Record t1.18 and t1.18.1 evidence 2026-03-23 12:54:09 -04:00
ben
dc0d0614bb Add effective price to purchases 2026-03-23 12:53:54 -04:00
ben
605c94498b Add effective price regression tests 2026-03-23 12:52:41 -04:00
26 changed files with 1659 additions and 1421 deletions

View File

@@ -6,19 +6,14 @@ Run each script step-by-step from the terminal.
## What It Does ## What It Does
1. `scrape_giant.py`: download Giant orders and items 1. `collect_giant_web.py`: download Giant orders and items
2. `enrich_giant.py`: normalize Giant line items 2. `normalize_giant_web.py`: normalize Giant line items
3. `scrape_costco.py`: download Costco orders and items 3. `collect_costco_web.py`: download Costco orders and items
4. `enrich_costco.py`: normalize Costco line items 4. `normalize_costco_web.py`: normalize Costco line items
5. `build_purchases.py`: combine retailer outputs into one purchase table 5. `build_purchases.py`: combine retailer outputs into one purchase table
6. `review_products.py`: review unresolved product matches in the terminal 6. `review_products.py`: review unresolved product matches in the terminal
7. `report_pipeline_status.py`: show how many rows survive each stage 7. `report_pipeline_status.py`: show how many rows survive each stage
8. `analyze_purchases.py`: write chart-ready analysis CSVs from the purchase table
Active refactor entrypoints:
- `collect_giant_web.py`
- `collect_costco_web.py`
- `normalize_giant_web.py`
- `normalize_costco_web.py`
## Requirements ## Requirements
@@ -64,13 +59,20 @@ data/
collected_items.csv collected_items.csv
normalized_items.csv normalized_items.csv
review/ review/
catalog.csv
review_queue.csv review_queue.csv
review_resolutions.csv review_resolutions.csv
product_links.csv product_links.csv
purchases.csv
pipeline_status.csv pipeline_status.csv
pipeline_status.json pipeline_status.json
catalog.csv analysis/
purchases.csv
comparison_examples.csv
item_price_over_time.csv
spend_by_visit.csv
items_per_visit.csv
category_spend_over_time.csv
retailer_store_breakdown.csv
``` ```
## Run Order ## Run Order
@@ -87,6 +89,7 @@ python review_products.py
python build_purchases.py python build_purchases.py
python review_products.py --refresh-only python review_products.py --refresh-only
python report_pipeline_status.py python report_pipeline_status.py
python analyze_purchases.py
``` ```
Why run `build_purchases.py` twice: Why run `build_purchases.py` twice:
@@ -120,14 +123,32 @@ Costco:
- `data/costco-web/normalized_items.csv` preserves raw totals and matched net discount fields - `data/costco-web/normalized_items.csv` preserves raw totals and matched net discount fields
Combined: Combined:
- `data/review/purchases.csv` - `data/analysis/purchases.csv`
- `data/analysis/comparison_examples.csv`
- `data/analysis/item_price_over_time.csv`
- `data/analysis/spend_by_visit.csv`
- `data/analysis/items_per_visit.csv`
- `data/analysis/category_spend_over_time.csv`
- `data/analysis/retailer_store_breakdown.csv`
- `data/review/review_queue.csv` - `data/review/review_queue.csv`
- `data/review/review_resolutions.csv` - `data/review/review_resolutions.csv`
- `data/review/product_links.csv` - `data/review/product_links.csv`
- `data/review/comparison_examples.csv`
- `data/review/pipeline_status.csv` - `data/review/pipeline_status.csv`
- `data/review/pipeline_status.json` - `data/review/pipeline_status.json`
- `data/catalog.csv` - `data/review/catalog.csv`
`data/analysis/purchases.csv` is the main analysis artifact. It is designed to support both:
- item-level price analysis
- visit-level analysis such as spend by visit, items per visit, category spend by visit, and retailer/store breakdown
The visit fields are carried directly in `purchases.csv`, so you can pivot on them without extra joins:
- `order_id`
- `purchase_date`
- `retailer`
- `store_name`
- `store_number`
- `store_city`
- `store_state`
## Review Workflow ## Review Workflow
@@ -144,9 +165,7 @@ The review step is intentionally conservative:
## Notes ## Notes
- This project is designed around fragile retailer scraping flows, so the code favors explicit retailer-specific steps over heavy abstraction. - This project is designed around fragile retailer scraping flows, so the code favors explicit retailer-specific steps over heavy abstraction.
- `scrape_giant.py`, `scrape_costco.py`, `enrich_giant.py`, and `enrich_costco.py` are now legacy-compatible entrypoints; prefer the `collect_*` and `normalize_*` scripts for active work.
- Costco discount rows are preserved for auditability and also matched back to purchased items during enrichment. - Costco discount rows are preserved for auditability and also matched back to purchased items during enrichment.
- `validate_cross_retailer_flow.py` is a proof/check script, not a required production step.
## Test ## Test

271
analyze_purchases.py Normal file
View File

@@ -0,0 +1,271 @@
from collections import defaultdict
from pathlib import Path
import click
from enrich_giant import format_decimal, to_decimal
from layer_helpers import read_csv_rows, write_csv_rows
ITEM_PRICE_FIELDS = [
"purchase_date",
"retailer",
"store_name",
"store_number",
"store_city",
"store_state",
"order_id",
"catalog_id",
"catalog_name",
"category",
"product_type",
"effective_price",
"effective_price_unit",
"net_line_total",
"normalized_quantity",
]
SPEND_BY_VISIT_FIELDS = [
"purchase_date",
"retailer",
"order_id",
"store_name",
"store_number",
"store_city",
"store_state",
"visit_spend_total",
]
ITEMS_PER_VISIT_FIELDS = [
"purchase_date",
"retailer",
"order_id",
"store_name",
"store_number",
"store_city",
"store_state",
"item_row_count",
"distinct_catalog_count",
]
CATEGORY_SPEND_FIELDS = [
"purchase_date",
"retailer",
"category",
"category_spend_total",
]
RETAILER_STORE_FIELDS = [
"retailer",
"store_name",
"store_number",
"store_city",
"store_state",
"visit_count",
"item_row_count",
"store_spend_total",
]
def effective_total(row):
total = to_decimal(row.get("net_line_total"))
if total is not None:
return total
return to_decimal(row.get("line_total"))
def is_item_row(row):
return (
row.get("is_fee") != "true"
and row.get("is_discount_line") != "true"
and row.get("is_coupon_line") != "true"
)
def build_item_price_rows(purchase_rows):
rows = []
for row in purchase_rows:
if not row.get("catalog_name") or not row.get("effective_price"):
continue
rows.append(
{
"purchase_date": row.get("purchase_date", ""),
"retailer": row.get("retailer", ""),
"store_name": row.get("store_name", ""),
"store_number": row.get("store_number", ""),
"store_city": row.get("store_city", ""),
"store_state": row.get("store_state", ""),
"order_id": row.get("order_id", ""),
"catalog_id": row.get("catalog_id", ""),
"catalog_name": row.get("catalog_name", ""),
"category": row.get("category", ""),
"product_type": row.get("product_type", ""),
"effective_price": row.get("effective_price", ""),
"effective_price_unit": row.get("effective_price_unit", ""),
"net_line_total": row.get("net_line_total", ""),
"normalized_quantity": row.get("normalized_quantity", ""),
}
)
return rows
def build_spend_by_visit_rows(purchase_rows):
grouped = defaultdict(lambda: {"total": to_decimal("0")})
for row in purchase_rows:
total = effective_total(row)
if total is None:
continue
key = (
row.get("purchase_date", ""),
row.get("retailer", ""),
row.get("order_id", ""),
row.get("store_name", ""),
row.get("store_number", ""),
row.get("store_city", ""),
row.get("store_state", ""),
)
grouped[key]["total"] += total
rows = []
for key, values in sorted(grouped.items()):
rows.append(
{
"purchase_date": key[0],
"retailer": key[1],
"order_id": key[2],
"store_name": key[3],
"store_number": key[4],
"store_city": key[5],
"store_state": key[6],
"visit_spend_total": format_decimal(values["total"]),
}
)
return rows
def build_items_per_visit_rows(purchase_rows):
grouped = defaultdict(lambda: {"item_rows": 0, "catalog_ids": set()})
for row in purchase_rows:
if not is_item_row(row):
continue
key = (
row.get("purchase_date", ""),
row.get("retailer", ""),
row.get("order_id", ""),
row.get("store_name", ""),
row.get("store_number", ""),
row.get("store_city", ""),
row.get("store_state", ""),
)
grouped[key]["item_rows"] += 1
if row.get("catalog_id"):
grouped[key]["catalog_ids"].add(row["catalog_id"])
rows = []
for key, values in sorted(grouped.items()):
rows.append(
{
"purchase_date": key[0],
"retailer": key[1],
"order_id": key[2],
"store_name": key[3],
"store_number": key[4],
"store_city": key[5],
"store_state": key[6],
"item_row_count": str(values["item_rows"]),
"distinct_catalog_count": str(len(values["catalog_ids"])),
}
)
return rows
def build_category_spend_rows(purchase_rows):
grouped = defaultdict(lambda: to_decimal("0"))
for row in purchase_rows:
category = row.get("category", "")
total = effective_total(row)
if not category or total is None:
continue
key = (
row.get("purchase_date", ""),
row.get("retailer", ""),
category,
)
grouped[key] += total
rows = []
for key, total in sorted(grouped.items()):
rows.append(
{
"purchase_date": key[0],
"retailer": key[1],
"category": key[2],
"category_spend_total": format_decimal(total),
}
)
return rows
def build_retailer_store_rows(purchase_rows):
grouped = defaultdict(lambda: {"visit_ids": set(), "item_rows": 0, "total": to_decimal("0")})
for row in purchase_rows:
total = effective_total(row)
key = (
row.get("retailer", ""),
row.get("store_name", ""),
row.get("store_number", ""),
row.get("store_city", ""),
row.get("store_state", ""),
)
grouped[key]["visit_ids"].add((row.get("purchase_date", ""), row.get("order_id", "")))
if is_item_row(row):
grouped[key]["item_rows"] += 1
if total is not None:
grouped[key]["total"] += total
rows = []
for key, values in sorted(grouped.items()):
rows.append(
{
"retailer": key[0],
"store_name": key[1],
"store_number": key[2],
"store_city": key[3],
"store_state": key[4],
"visit_count": str(len(values["visit_ids"])),
"item_row_count": str(values["item_rows"]),
"store_spend_total": format_decimal(values["total"]),
}
)
return rows
@click.command()
@click.option("--purchases-csv", default="data/analysis/purchases.csv", show_default=True)
@click.option("--output-dir", default="data/analysis", show_default=True)
def main(purchases_csv, output_dir):
purchase_rows = read_csv_rows(purchases_csv)
output_path = Path(output_dir)
output_path.mkdir(parents=True, exist_ok=True)
item_price_rows = build_item_price_rows(purchase_rows)
spend_by_visit_rows = build_spend_by_visit_rows(purchase_rows)
items_per_visit_rows = build_items_per_visit_rows(purchase_rows)
category_spend_rows = build_category_spend_rows(purchase_rows)
retailer_store_rows = build_retailer_store_rows(purchase_rows)
outputs = [
("item_price_over_time.csv", item_price_rows, ITEM_PRICE_FIELDS),
("spend_by_visit.csv", spend_by_visit_rows, SPEND_BY_VISIT_FIELDS),
("items_per_visit.csv", items_per_visit_rows, ITEMS_PER_VISIT_FIELDS),
("category_spend_over_time.csv", category_spend_rows, CATEGORY_SPEND_FIELDS),
("retailer_store_breakdown.csv", retailer_store_rows, RETAILER_STORE_FIELDS),
]
for filename, rows, fieldnames in outputs:
write_csv_rows(output_path / filename, rows, fieldnames)
click.echo(f"wrote analysis outputs to {output_path}")
if __name__ == "__main__":
main()

View File

@@ -1,220 +0,0 @@
import click
import re
from layer_helpers import read_csv_rows, representative_value, stable_id, write_csv_rows
CANONICAL_FIELDS = [
"canonical_product_id",
"canonical_name",
"product_type",
"brand",
"variant",
"size_value",
"size_unit",
"pack_qty",
"measure_type",
"normalized_quantity",
"normalized_quantity_unit",
"notes",
"created_at",
"updated_at",
]
CANONICAL_DROP_TOKENS = {"CT", "COUNT", "COUNTS", "DOZ", "DOZEN", "DOZ.", "PACK"}
LINK_FIELDS = [
"observed_product_id",
"canonical_product_id",
"link_method",
"link_confidence",
"review_status",
"reviewed_by",
"reviewed_at",
"link_notes",
]
def to_float(value):
try:
return float(value)
except (TypeError, ValueError):
return None
def normalized_quantity(row):
size_value = to_float(row.get("representative_size_value"))
pack_qty = to_float(row.get("representative_pack_qty")) or 1.0
size_unit = row.get("representative_size_unit", "")
measure_type = row.get("representative_measure_type", "")
if size_value is not None and size_unit:
return format(size_value * pack_qty, "g"), size_unit
if row.get("representative_pack_qty") and measure_type == "count":
return row["representative_pack_qty"], "count"
if measure_type == "each":
return "1", "each"
return "", ""
def auto_link_rule(observed_row):
if (
observed_row.get("is_fee") == "true"
or observed_row.get("is_discount_line") == "true"
or observed_row.get("is_coupon_line") == "true"
):
return "", "", ""
if observed_row.get("representative_upc"):
return (
"exact_upc",
f"upc={observed_row['representative_upc']}",
"high",
)
if (
observed_row.get("representative_name_norm")
and observed_row.get("representative_size_value")
and observed_row.get("representative_size_unit")
):
return (
"exact_name_size",
"|".join(
[
f"name={observed_row['representative_name_norm']}",
f"size={observed_row['representative_size_value']}",
f"unit={observed_row['representative_size_unit']}",
f"pack={observed_row['representative_pack_qty']}",
f"measure={observed_row['representative_measure_type']}",
]
),
"high",
)
return "", "", ""
def clean_canonical_name(name):
tokens = []
for token in re.sub(r"[^A-Z0-9\s]", " ", (name or "").upper()).split():
if token.isdigit():
continue
if token in CANONICAL_DROP_TOKENS:
continue
if re.fullmatch(r"\d+(?:PK|PACK)", token):
continue
if re.fullmatch(r"\d+DZ", token):
continue
tokens.append(token)
return " ".join(tokens).strip()
def canonical_row_for_group(canonical_product_id, group_rows, link_method):
quantity_value, quantity_unit = normalized_quantity(
{
"representative_size_value": representative_value(
group_rows, "representative_size_value"
),
"representative_size_unit": representative_value(
group_rows, "representative_size_unit"
),
"representative_pack_qty": representative_value(
group_rows, "representative_pack_qty"
),
"representative_measure_type": representative_value(
group_rows, "representative_measure_type"
),
}
)
return {
"canonical_product_id": canonical_product_id,
"canonical_name": clean_canonical_name(
representative_value(group_rows, "representative_name_norm")
)
or representative_value(group_rows, "representative_name_norm"),
"product_type": "",
"brand": representative_value(group_rows, "representative_brand"),
"variant": representative_value(group_rows, "representative_variant"),
"size_value": representative_value(group_rows, "representative_size_value"),
"size_unit": representative_value(group_rows, "representative_size_unit"),
"pack_qty": representative_value(group_rows, "representative_pack_qty"),
"measure_type": representative_value(group_rows, "representative_measure_type"),
"normalized_quantity": quantity_value,
"normalized_quantity_unit": quantity_unit,
"notes": f"auto-linked via {link_method}",
"created_at": "",
"updated_at": "",
}
def build_canonical_layer(observed_rows):
canonical_rows = []
link_rows = []
groups = {}
for observed_row in sorted(observed_rows, key=lambda row: row["observed_product_id"]):
link_method, group_key, confidence = auto_link_rule(observed_row)
if not group_key:
continue
canonical_product_id = stable_id("gcan", f"{link_method}|{group_key}")
groups.setdefault(canonical_product_id, {"method": link_method, "rows": []})
groups[canonical_product_id]["rows"].append(observed_row)
link_rows.append(
{
"observed_product_id": observed_row["observed_product_id"],
"canonical_product_id": canonical_product_id,
"link_method": link_method,
"link_confidence": confidence,
"review_status": "",
"reviewed_by": "",
"reviewed_at": "",
"link_notes": "",
}
)
for canonical_product_id, group in sorted(groups.items()):
canonical_rows.append(
canonical_row_for_group(
canonical_product_id, group["rows"], group["method"]
)
)
return canonical_rows, link_rows
@click.command()
@click.option(
"--observed-csv",
default="giant_output/products_observed.csv",
show_default=True,
help="Path to observed product rows.",
)
@click.option(
"--canonical-csv",
default="giant_output/products_canonical.csv",
show_default=True,
help="Path to canonical product output.",
)
@click.option(
"--links-csv",
default="giant_output/product_links.csv",
show_default=True,
help="Path to observed-to-canonical link output.",
)
def main(observed_csv, canonical_csv, links_csv):
observed_rows = read_csv_rows(observed_csv)
canonical_rows, link_rows = build_canonical_layer(observed_rows)
write_csv_rows(canonical_csv, canonical_rows, CANONICAL_FIELDS)
write_csv_rows(links_csv, link_rows, LINK_FIELDS)
click.echo(
f"wrote {len(canonical_rows)} canonical rows to {canonical_csv} and "
f"{len(link_rows)} links to {links_csv}"
)
if __name__ == "__main__":
main()

View File

@@ -1,172 +0,0 @@
from collections import defaultdict
import click
from layer_helpers import (
compact_join,
distinct_values,
first_nonblank,
read_csv_rows,
representative_value,
stable_id,
write_csv_rows,
)
OUTPUT_FIELDS = [
"observed_product_id",
"retailer",
"observed_key",
"representative_retailer_item_id",
"representative_upc",
"representative_item_name",
"representative_name_norm",
"representative_brand",
"representative_variant",
"representative_size_value",
"representative_size_unit",
"representative_pack_qty",
"representative_measure_type",
"representative_image_url",
"is_store_brand",
"is_fee",
"is_discount_line",
"is_coupon_line",
"first_seen_date",
"last_seen_date",
"times_seen",
"example_order_id",
"example_item_name",
"raw_name_examples",
"normalized_name_examples",
"example_prices",
"distinct_item_names_count",
"distinct_retailer_item_ids_count",
"distinct_upcs_count",
]
def build_observed_key(row):
if row.get("upc"):
return "|".join(
[
row["retailer"],
f"upc={row['upc']}",
f"name={row['item_name_norm']}",
]
)
if row.get("retailer_item_id"):
return "|".join(
[
row["retailer"],
f"retailer_item_id={row['retailer_item_id']}",
f"name={row['item_name_norm']}",
f"discount={row.get('is_discount_line', 'false')}",
f"coupon={row.get('is_coupon_line', 'false')}",
]
)
return "|".join(
[
row["retailer"],
f"name={row['item_name_norm']}",
f"size={row['size_value']}",
f"unit={row['size_unit']}",
f"pack={row['pack_qty']}",
f"measure={row['measure_type']}",
f"store_brand={row['is_store_brand']}",
f"fee={row['is_fee']}",
]
)
def build_observed_products(rows):
grouped = defaultdict(list)
for row in rows:
grouped[build_observed_key(row)].append(row)
observed_rows = []
for observed_key, group_rows in sorted(grouped.items()):
ordered = sorted(
group_rows,
key=lambda row: (row["order_date"], row["order_id"], int(row["line_no"])),
)
observed_rows.append(
{
"observed_product_id": stable_id("gobs", observed_key),
"retailer": ordered[0]["retailer"],
"observed_key": observed_key,
"representative_retailer_item_id": representative_value(
ordered, "retailer_item_id"
),
"representative_upc": representative_value(ordered, "upc"),
"representative_item_name": representative_value(ordered, "item_name"),
"representative_name_norm": representative_value(
ordered, "item_name_norm"
),
"representative_brand": representative_value(ordered, "brand_guess"),
"representative_variant": representative_value(ordered, "variant"),
"representative_size_value": representative_value(ordered, "size_value"),
"representative_size_unit": representative_value(ordered, "size_unit"),
"representative_pack_qty": representative_value(ordered, "pack_qty"),
"representative_measure_type": representative_value(
ordered, "measure_type"
),
"representative_image_url": first_nonblank(ordered, "image_url"),
"is_store_brand": representative_value(ordered, "is_store_brand"),
"is_fee": representative_value(ordered, "is_fee"),
"is_discount_line": representative_value(
ordered, "is_discount_line"
),
"is_coupon_line": representative_value(ordered, "is_coupon_line"),
"first_seen_date": ordered[0]["order_date"],
"last_seen_date": ordered[-1]["order_date"],
"times_seen": str(len(ordered)),
"example_order_id": ordered[0]["order_id"],
"example_item_name": ordered[0]["item_name"],
"raw_name_examples": compact_join(
distinct_values(ordered, "item_name"), limit=4
),
"normalized_name_examples": compact_join(
distinct_values(ordered, "item_name_norm"), limit=4
),
"example_prices": compact_join(
distinct_values(ordered, "line_total"), limit=4
),
"distinct_item_names_count": str(
len(distinct_values(ordered, "item_name"))
),
"distinct_retailer_item_ids_count": str(
len(distinct_values(ordered, "retailer_item_id"))
),
"distinct_upcs_count": str(len(distinct_values(ordered, "upc"))),
}
)
observed_rows.sort(key=lambda row: row["observed_product_id"])
return observed_rows
@click.command()
@click.option(
"--items-enriched-csv",
default="giant_output/items_enriched.csv",
show_default=True,
help="Path to enriched Giant item rows.",
)
@click.option(
"--output-csv",
default="giant_output/products_observed.csv",
show_default=True,
help="Path to observed product output.",
)
def main(items_enriched_csv, output_csv):
rows = read_csv_rows(items_enriched_csv)
observed_rows = build_observed_products(rows)
write_csv_rows(output_csv, observed_rows, OUTPUT_FIELDS)
click.echo(f"wrote {len(observed_rows)} rows to {output_csv}")
if __name__ == "__main__":
main()

View File

@@ -10,6 +10,14 @@ from layer_helpers import read_csv_rows, write_csv_rows
PURCHASE_FIELDS = [ PURCHASE_FIELDS = [
"purchase_date", "purchase_date",
"retailer", "retailer",
"catalog_name",
"product_type",
"category",
"net_line_total",
"normalized_quantity",
"normalized_quantity_unit",
"effective_price",
"effective_price_unit",
"order_id", "order_id",
"line_no", "line_no",
"normalized_row_id", "normalized_row_id",
@@ -19,9 +27,6 @@ PURCHASE_FIELDS = [
"resolution_action", "resolution_action",
"raw_item_name", "raw_item_name",
"normalized_item_name", "normalized_item_name",
"catalog_name",
"category",
"product_type",
"brand", "brand",
"variant", "variant",
"image_url", "image_url",
@@ -29,8 +34,6 @@ PURCHASE_FIELDS = [
"upc", "upc",
"qty", "qty",
"unit", "unit",
"normalized_quantity",
"normalized_quantity_unit",
"pack_qty", "pack_qty",
"size_value", "size_value",
"size_unit", "size_unit",
@@ -172,6 +175,41 @@ def derive_metrics(row):
} }
def derive_effective_price(row):
normalized_quantity = to_decimal(row.get("normalized_quantity"))
if normalized_quantity in (None, Decimal("0")):
return ""
numerator = to_decimal(derive_net_line_total(row))
if numerator is None:
return ""
return format_decimal(numerator / normalized_quantity)
def derive_effective_price_unit(row):
normalized_quantity = to_decimal(row.get("normalized_quantity"))
if normalized_quantity in (None, Decimal("0")):
return ""
return row.get("normalized_quantity_unit", "")
def derive_net_line_total(row):
existing_net = row.get("net_line_total", "")
if str(existing_net).strip() != "":
return str(existing_net)
line_total = to_decimal(row.get("line_total"))
if line_total is None:
return ""
matched_discount_amount = to_decimal(row.get("matched_discount_amount"))
if matched_discount_amount is not None:
return format_decimal(line_total + matched_discount_amount)
return format_decimal(line_total)
def order_lookup(rows, retailer): def order_lookup(rows, retailer):
return {(retailer, row["order_id"]): row for row in rows} return {(retailer, row["order_id"]): row for row in rows}
@@ -320,6 +358,14 @@ def build_purchase_rows(
{ {
"purchase_date": row["order_date"], "purchase_date": row["order_date"],
"retailer": row["retailer"], "retailer": row["retailer"],
"catalog_name": catalog_row.get("catalog_name", ""),
"product_type": catalog_row.get("product_type", ""),
"category": catalog_row.get("category", ""),
"net_line_total": derive_net_line_total(row),
"normalized_quantity": row.get("normalized_quantity", ""),
"normalized_quantity_unit": row.get("normalized_quantity_unit", ""),
"effective_price": derive_effective_price({**row, "net_line_total": derive_net_line_total(row)}),
"effective_price_unit": derive_effective_price_unit(row),
"order_id": row["order_id"], "order_id": row["order_id"],
"line_no": row["line_no"], "line_no": row["line_no"],
"normalized_row_id": row.get("normalized_row_id", ""), "normalized_row_id": row.get("normalized_row_id", ""),
@@ -329,9 +375,6 @@ def build_purchase_rows(
"resolution_action": resolution.get("resolution_action", ""), "resolution_action": resolution.get("resolution_action", ""),
"raw_item_name": row["item_name"], "raw_item_name": row["item_name"],
"normalized_item_name": row["item_name_norm"], "normalized_item_name": row["item_name_norm"],
"catalog_name": catalog_row.get("catalog_name", ""),
"category": catalog_row.get("category", ""),
"product_type": catalog_row.get("product_type", ""),
"brand": catalog_row.get("brand", ""), "brand": catalog_row.get("brand", ""),
"variant": catalog_row.get("variant", ""), "variant": catalog_row.get("variant", ""),
"image_url": row.get("image_url", ""), "image_url": row.get("image_url", ""),
@@ -339,8 +382,6 @@ def build_purchase_rows(
"upc": row["upc"], "upc": row["upc"],
"qty": row["qty"], "qty": row["qty"],
"unit": row["unit"], "unit": row["unit"],
"normalized_quantity": row.get("normalized_quantity", ""),
"normalized_quantity_unit": row.get("normalized_quantity_unit", ""),
"pack_qty": row["pack_qty"], "pack_qty": row["pack_qty"],
"size_value": row["size_value"], "size_value": row["size_value"],
"size_unit": row["size_unit"], "size_unit": row["size_unit"],
@@ -348,7 +389,6 @@ def build_purchase_rows(
"line_total": row["line_total"], "line_total": row["line_total"],
"unit_price": row["unit_price"], "unit_price": row["unit_price"],
"matched_discount_amount": row.get("matched_discount_amount", ""), "matched_discount_amount": row.get("matched_discount_amount", ""),
"net_line_total": row.get("net_line_total", ""),
"store_name": order_row.get("store_name", ""), "store_name": order_row.get("store_name", ""),
"store_number": order_row.get("store_number", ""), "store_number": order_row.get("store_number", ""),
"store_city": order_row.get("store_city", ""), "store_city": order_row.get("store_city", ""),
@@ -400,10 +440,10 @@ def build_comparison_examples(purchase_rows):
@click.option("--giant-orders-csv", default="data/giant-web/collected_orders.csv", show_default=True) @click.option("--giant-orders-csv", default="data/giant-web/collected_orders.csv", show_default=True)
@click.option("--costco-orders-csv", default="data/costco-web/collected_orders.csv", show_default=True) @click.option("--costco-orders-csv", default="data/costco-web/collected_orders.csv", show_default=True)
@click.option("--resolutions-csv", default="data/review/review_resolutions.csv", show_default=True) @click.option("--resolutions-csv", default="data/review/review_resolutions.csv", show_default=True)
@click.option("--catalog-csv", default="data/catalog.csv", show_default=True) @click.option("--catalog-csv", default="data/review/catalog.csv", show_default=True)
@click.option("--links-csv", default="data/review/product_links.csv", show_default=True) @click.option("--links-csv", default="data/review/product_links.csv", show_default=True)
@click.option("--output-csv", default="data/review/purchases.csv", show_default=True) @click.option("--output-csv", default="data/analysis/purchases.csv", show_default=True)
@click.option("--examples-csv", default="data/review/comparison_examples.csv", show_default=True) @click.option("--examples-csv", default="data/analysis/comparison_examples.csv", show_default=True)
def main( def main(
giant_items_enriched_csv, giant_items_enriched_csv,
costco_items_enriched_csv, costco_items_enriched_csv,

View File

@@ -1,175 +0,0 @@
from collections import defaultdict
from datetime import date
import click
from layer_helpers import compact_join, distinct_values, read_csv_rows, stable_id, write_csv_rows
OUTPUT_FIELDS = [
"review_id",
"queue_type",
"retailer",
"observed_product_id",
"canonical_product_id",
"reason_code",
"priority",
"raw_item_names",
"normalized_names",
"upc",
"image_url",
"example_prices",
"seen_count",
"status",
"resolution_notes",
"created_at",
"updated_at",
]
def existing_review_state(path):
try:
rows = read_csv_rows(path)
except FileNotFoundError:
return {}
return {row["review_id"]: row for row in rows}
def review_reasons(observed_row):
reasons = []
if (
observed_row["is_fee"] == "true"
or observed_row.get("is_discount_line") == "true"
or observed_row.get("is_coupon_line") == "true"
):
return reasons
if observed_row["distinct_upcs_count"] not in {"", "0", "1"}:
reasons.append(("multiple_upcs", "high"))
if observed_row["distinct_item_names_count"] not in {"", "0", "1"}:
reasons.append(("multiple_raw_names", "medium"))
if not observed_row["representative_image_url"]:
reasons.append(("missing_image", "medium"))
if not observed_row["representative_upc"]:
reasons.append(("missing_upc", "high"))
if not observed_row["representative_name_norm"]:
reasons.append(("missing_normalized_name", "high"))
return reasons
def build_review_queue(observed_rows, item_rows, existing_rows, today_text):
by_observed = defaultdict(list)
for row in item_rows:
observed_id = row.get("observed_product_id", "")
if observed_id:
by_observed[observed_id].append(row)
queue_rows = []
for observed_row in observed_rows:
reasons = review_reasons(observed_row)
if not reasons:
continue
related_items = by_observed.get(observed_row["observed_product_id"], [])
raw_names = compact_join(distinct_values(related_items, "item_name"), limit=5)
norm_names = compact_join(
distinct_values(related_items, "item_name_norm"), limit=5
)
example_prices = compact_join(
distinct_values(related_items, "line_total"), limit=5
)
for reason_code, priority in reasons:
review_id = stable_id(
"rvw",
f"{observed_row['observed_product_id']}|{reason_code}",
)
prior = existing_rows.get(review_id, {})
queue_rows.append(
{
"review_id": review_id,
"queue_type": "observed_product",
"retailer": observed_row["retailer"],
"observed_product_id": observed_row["observed_product_id"],
"canonical_product_id": prior.get("canonical_product_id", ""),
"reason_code": reason_code,
"priority": priority,
"raw_item_names": raw_names,
"normalized_names": norm_names,
"upc": observed_row["representative_upc"],
"image_url": observed_row["representative_image_url"],
"example_prices": example_prices,
"seen_count": observed_row["times_seen"],
"status": prior.get("status", "pending"),
"resolution_notes": prior.get("resolution_notes", ""),
"created_at": prior.get("created_at", today_text),
"updated_at": today_text,
}
)
queue_rows.sort(key=lambda row: (row["priority"], row["reason_code"], row["review_id"]))
return queue_rows
def attach_observed_ids(item_rows, observed_rows):
observed_by_key = {row["observed_key"]: row["observed_product_id"] for row in observed_rows}
attached = []
for row in item_rows:
observed_key = "|".join(
[
row["retailer"],
f"upc={row['upc']}",
f"name={row['item_name_norm']}",
]
) if row.get("upc") else "|".join(
[
row["retailer"],
f"retailer_item_id={row.get('retailer_item_id', '')}",
f"name={row['item_name_norm']}",
f"size={row['size_value']}",
f"unit={row['size_unit']}",
f"pack={row['pack_qty']}",
f"measure={row['measure_type']}",
f"store_brand={row['is_store_brand']}",
f"fee={row['is_fee']}",
f"discount={row.get('is_discount_line', 'false')}",
f"coupon={row.get('is_coupon_line', 'false')}",
]
)
enriched = dict(row)
enriched["observed_product_id"] = observed_by_key.get(observed_key, "")
attached.append(enriched)
return attached
@click.command()
@click.option(
"--observed-csv",
default="giant_output/products_observed.csv",
show_default=True,
help="Path to observed product rows.",
)
@click.option(
"--items-enriched-csv",
default="giant_output/items_enriched.csv",
show_default=True,
help="Path to enriched Giant item rows.",
)
@click.option(
"--output-csv",
default="giant_output/review_queue.csv",
show_default=True,
help="Path to review queue output.",
)
def main(observed_csv, items_enriched_csv, output_csv):
observed_rows = read_csv_rows(observed_csv)
item_rows = read_csv_rows(items_enriched_csv)
item_rows = attach_observed_ids(item_rows, observed_rows)
existing_rows = existing_review_state(output_csv)
today_text = str(date.today())
queue_rows = build_review_queue(observed_rows, item_rows, existing_rows, today_text)
write_csv_rows(output_csv, queue_rows, OUTPUT_FIELDS)
click.echo(f"wrote {len(queue_rows)} rows to {output_csv}")
if __name__ == "__main__":
main()

View File

@@ -29,7 +29,7 @@ CODE_TOKEN_RE = re.compile(
r"\b(?:SL\d+|T\d+H\d+|P\d+(?:/\d+)?|W\d+T\d+H\d+|FY\d+|CSPC#|C\d+T\d+H\d+|EC\d+T\d+H\d+|\d+X\d+)\b" r"\b(?:SL\d+|T\d+H\d+|P\d+(?:/\d+)?|W\d+T\d+H\d+|FY\d+|CSPC#|C\d+T\d+H\d+|EC\d+T\d+H\d+|\d+X\d+)\b"
) )
PACK_FRACTION_RE = re.compile(r"(?<![A-Z0-9])(\d+)\s*/\s*(\d+(?:\.\d+)?)\s*(OZ|LB|LBS|CT)\b") PACK_FRACTION_RE = re.compile(r"(?<![A-Z0-9])(\d+)\s*/\s*(\d+(?:\.\d+)?)\s*(OZ|LB|LBS|CT)\b")
HASH_SIZE_RE = re.compile(r"(?<![A-Z0-9])(\d+(?:\.\d+)?)#\b") HASH_SIZE_RE = re.compile(r"(?<![A-Z0-9])(\d+(?:\.\d+)?)#(?=\s|$)")
ITEM_CODE_RE = re.compile(r"#\w+\b") ITEM_CODE_RE = re.compile(r"#\w+\b")
DUAL_WEIGHT_RE = re.compile( DUAL_WEIGHT_RE = re.compile(
r"\b\d+(?:\.\d+)?\s*(?:KG|G|LB|LBS|OZ)\s*/\s*\d+(?:\.\d+)?\s*(?:KG|G|LB|LBS|OZ)\b" r"\b\d+(?:\.\d+)?\s*(?:KG|G|LB|LBS|OZ)\s*/\s*\d+(?:\.\d+)?\s*(?:KG|G|LB|LBS|OZ)\b"
@@ -199,6 +199,7 @@ def parse_costco_item(order_id, order_date, raw_path, line_no, item):
size_unit, size_unit,
pack_qty, pack_qty,
measure_type, measure_type,
"",
) )
identity_key, normalization_basis = normalization_identity( identity_key, normalization_basis = normalization_identity(
{ {

View File

@@ -344,10 +344,11 @@ def derive_prices(item, measure_type, size_value="", size_unit="", pack_qty=""):
return price_per_each, price_per_lb, price_per_oz return price_per_each, price_per_lb, price_per_oz
def derive_normalized_quantity(qty, size_value, size_unit, pack_qty, measure_type): def derive_normalized_quantity(qty, size_value, size_unit, pack_qty, measure_type, picked_weight=""):
parsed_qty = to_decimal(qty) parsed_qty = to_decimal(qty)
parsed_size = to_decimal(size_value) parsed_size = to_decimal(size_value)
parsed_pack = to_decimal(pack_qty) parsed_pack = to_decimal(pack_qty)
parsed_picked_weight = to_decimal(picked_weight)
total_multiplier = None total_multiplier = None
if parsed_qty not in (None, Decimal("0")): if parsed_qty not in (None, Decimal("0")):
total_multiplier = parsed_qty * (parsed_pack or Decimal("1")) total_multiplier = parsed_qty * (parsed_pack or Decimal("1"))
@@ -358,6 +359,8 @@ def derive_normalized_quantity(qty, size_value, size_unit, pack_qty, measure_typ
and total_multiplier not in (None, Decimal("0")) and total_multiplier not in (None, Decimal("0"))
): ):
return format_decimal(parsed_size * total_multiplier), size_unit return format_decimal(parsed_size * total_multiplier), size_unit
if measure_type == "weight" and parsed_picked_weight not in (None, Decimal("0")):
return format_decimal(parsed_picked_weight), "lb"
if measure_type == "count" and total_multiplier not in (None, Decimal("0")): if measure_type == "count" and total_multiplier not in (None, Decimal("0")):
return format_decimal(total_multiplier), "count" return format_decimal(total_multiplier), "count"
if measure_type == "each" and parsed_qty not in (None, Decimal("0")): if measure_type == "each" and parsed_qty not in (None, Decimal("0")):
@@ -441,6 +444,7 @@ def parse_item(order_id, order_date, raw_path, line_no, item):
size_unit, size_unit,
pack_qty, pack_qty,
measure_type, measure_type,
item.get("totalPickedWeight"),
) )
identity_key, normalization_basis = normalization_identity( identity_key, normalization_basis = normalization_identity(
{ {

View File

@@ -110,8 +110,15 @@ data/
review/ review/
review_queue.csv # Human review queue for unresolved matching/parsing cases. review_queue.csv # Human review queue for unresolved matching/parsing cases.
product_links.csv # Links from normalized retailer items to catalog items. product_links.csv # Links from normalized retailer items to catalog items.
catalog.csv # Cross-retailer product catalog entities used for comparison. catalog.csv # Cross-retailer product catalog entities used for comparison.
purchases.csv analysis/
purchases.csv
comparison_examples.csv
item_price_over_time.csv
spend_by_visit.csv
items_per_visit.csv
category_spend_over_time.csv
retailer_store_breakdown.csv
#+end_example #+end_example
Notes: Notes:
@@ -223,7 +230,7 @@ Notes:
- Valid `normalization_basis` values should be explicit, e.g. `exact_upc`, `exact_retailer_item_id`, `exact_name_size_pack`, or `approved_retailer_alias`. - Valid `normalization_basis` values should be explicit, e.g. `exact_upc`, `exact_retailer_item_id`, `exact_name_size_pack`, or `approved_retailer_alias`.
- Do not use fuzzy or semantic matching to assign `normalized_item_id`. - Do not use fuzzy or semantic matching to assign `normalized_item_id`.
- Discount/coupon rows may remain as standalone normalized rows for auditability even when their amounts are attached to a purchased row via `matched_discount_amount`. - Discount/coupon rows may remain as standalone normalized rows for auditability even when their amounts are attached to a purchased row via `matched_discount_amount`.
- Cross-retailer identity is handled later in review/combine via `catalog.csv` and `product_links.csv`. - Cross-retailer identity is handled later in review/combine via `data/review/catalog.csv` and `product_links.csv`.
** `data/review/product_links.csv` ** `data/review/product_links.csv`
One row per review-approved link from a normalized retailer item to a catalog item. One row per review-approved link from a normalized retailer item to a catalog item.
@@ -263,7 +270,7 @@ One row per issue needing human review.
| `resolution_notes` | reviewer notes | | `resolution_notes` | reviewer notes |
| `created_at` | creation timestamp or date | | `created_at` | creation timestamp or date |
| `updated_at` | last update timestamp or date | | `updated_at` | last update timestamp or date |
** `data/catalog.csv` ** `data/review/catalog.csv`
One row per cross-retailer catalog product. One row per cross-retailer catalog product.
| key | definition | | key | definition |
|----------------------------+----------------------------------------| |----------------------------+----------------------------------------|
@@ -288,7 +295,7 @@ Notes:
- Do not encode packaging/count into `catalog_name` unless it is essential to product identity. - Do not encode packaging/count into `catalog_name` unless it is essential to product identity.
- `catalog_name` should come from review-approved naming, not raw retailer strings. - `catalog_name` should come from review-approved naming, not raw retailer strings.
** `data/purchases.csv` ** `data/analysis/purchases.csv`
One row per purchased item (i.e., `is_item`==true from normalized layer), with One row per purchased item (i.e., `is_item`==true from normalized layer), with
catalog attributes denormalized in and discounts already applied. catalog attributes denormalized in and discounts already applied.

View File

@@ -587,4 +587,68 @@ instead of
[5] yellow onion, onion, produce (0 items, 0 rows) [5] yellow onion, onion, produce (0 items, 0 rows)
selection: selection:
* * data cleanup [2026-03-23 Mon]
ok we're getting closer. still see some issues
1. reorder purchases columns for display: catalog_name, product_type, category (makes data/troubleshooting way easier)
2. shouldn't net_line_price should never be empty? to allow cumulative cost comparison/analysis (we can see normalized price per X via effective_price but shouldnt this be weighted against how much we bought? eg if we bought 5lb flour at $0.970/lb this is weighted as 1-to-1 with a 25lb purchase as 0.670/lb
3. some items missing entire categorizations? probably a result of me trying to do data cleanup. i found the orphaned values in teh product_links table and removed them, but re-running review_products.py did not catch this...
shouldn't review_products run a comparison between each vendor's normalized_items and compare to the existing review_queu?
RSET POTATO US 1
GREEK YOGURT DOM55
FDLY CHY VAN IC CRM
DUNKIN DONUT CANISTER ORIG BLND P=260
ICE CUBES
BLACK BEANS
KETCHUP SQUEEZE BTL
YELLOW_GOLD POTATO US 1
YELLOW_GOLD POTATO US 1
PINTO BEANS
4. cleanup deprecated .py files
5. Goals:
1. When have I purchased this item, what did I pay, and how has the price changed over time?
- we're close, but missing units - eg AP flour shows a value that looks like price/lb but you just see $0.765
- doesnt seem like we've captured everything but that's just a gut feeling
2. Visit breakdown as well as catalog/product/category? this certainly belongs in purchases.csv.
3. Consider dash/plotly for better-than-excel tracking, since we're really only looking at a couple of graphs and filtering within certain values? (obv keep purchases as a user-friendly output)
** 1. Cleanup purchases column order
purchase_date
retailer
catalog_name
product_type
category
net_line_total
normalized_quantity
effective_price
effective_price_unit (new)
order_id
line_no
raw_item_name
normalized_item_name
catalog_id
normalized_item_id
** 2. Populate and use purchases.net_line_total
net_line_total = line_total+matched_discount_amoun
effective_price = net_line_total / normalized_quantity
weighted cost analysis uses net_line_total, not just avg effective_price
** 3. Improve review robustness, enable norm_item re review
1. should regenerate candidates from:
- normalized items with no valid catalog_id
- normalized items whose linked catalog_id no longer exists
- normalized items whose linked catalog row exists but missing required fields if you want completeness review
2. review_products.py should compare:
- current normalized universe
- current product_links
- current catalog
- current review_queue
** 4. Remove deprecated.py
** 5. Improve Charts
1. Histogram: add effective_price_unit to purchases.py
1. Visits: plot by order_id enable display of:
1. spend by visit
2. items per visit
3. category spend by visit
4. retailer/store breakdown
* /

View File

@@ -803,26 +803,19 @@ correct and document deterministic normalized quantity fields so unit-cost analy
- The missing purchases fields were a carry-through bug: normalization had `normalized_quantity` and `normalized_quantity_unit`, but `build_purchases.py` never wrote them into `data/review/purchases.csv`. - The missing purchases fields were a carry-through bug: normalization had `normalized_quantity` and `normalized_quantity_unit`, but `build_purchases.py` never wrote them into `data/review/purchases.csv`.
- Normalized quantity now prefers explicit package basis over `each`, so rows like `PEPSI 6PK 7.5Z` resolve to `90 oz` and `KS ALMND BAR US 1.74QTS` purchased twice resolves to `3.48 qt`. - Normalized quantity now prefers explicit package basis over `each`, so rows like `PEPSI 6PK 7.5Z` resolve to `90 oz` and `KS ALMND BAR US 1.74QTS` purchased twice resolves to `3.48 qt`.
- The derivation stays conservative and does not convert units during normalization; parsed units such as `oz`, `lb`, `qt`, and `count` are preserved as-is. - The derivation stays conservative and does not convert units during normalization; parsed units such as `oz`, `lb`, `qt`, and `count` are preserved as-is.
* [ ] t1.18: add regression tests for known quantity/price failures (1-2 commits) * [X] t1.18: add regression tests for known quantity/price failures (1-2 commits)
capture the currently broken comparison cases before changing normalization or purchases logic capture the currently broken comparison cases before changing normalization or purchases logic
** acceptance criteria ** acceptance criteria
1. when generating `data/purchases.csv`, add `effective_price` = `effective_total` / `normalized_quantity` 1. ensure the new tests assert the intended `effective_price` behavior for the known banana, ice, and beef patty examples
2. define `effective_price` behavior explicitly from the covered cases: 2. add tests covering known broken cases:
- use `net_line_total` when present and non-zero, else use `line_total`
- divide by `normalized_quantity` when `normalized_quantity > 0`
- leave blank when no valid denominator exists
- never emit `0` or divide-by-zero for missing-basis cases
- `effective_price` only comparable within same `normalized_quantity_unit` unless later analysis converts the units
3. ensure the new tests assert the intended `effective_price` behavior for the known banana, ice, and beef patty examples
4. add tests covering known broken cases:
- giant bananas produce non-blank effective price - giant bananas produce non-blank effective price
- giant bagged ice produces non-zero effective price - giant bagged ice produces non-zero effective price
- costco bananas retain correct effective price - costco bananas retain correct effective price
- beef patty comparison rows preserve expected quantity basis behavior - beef patty comparison rows preserve expected quantity basis behavior
5. tests fail against current broken behavior and document the expected outcome 3. tests fail against current broken behavior and document the expected outcome
6. include at least one assertion that effective_price is blank rather than `0` or divide-by-zero when no denominator exists 4. include at least one assertion that effective_price is blank rather than `0` or divide-by-zero when no denominator exists
7. pm note: this task should only add tests/fixtures and not change business logic - pm note: this task should only add tests/fixtures and not change business logic
** pm identified problems ** pm identified problems
we have a few problems to scope. looks like: we have a few problems to scope. looks like:
1. normalize_giant_web not always propagating weight data to price_per 1. normalize_giant_web not always propagating weight data to price_per
@@ -862,34 +855,40 @@ purchase_date retailer normalized_item_name catalog_name category product_type q
10/10/2025 giant BAGGED ICE bagged ice cubes frozen ice 1 EA 20 lb 20 lb weight 4.99 4.99 4.99 line_total_over_qty 0.2495 parsed_size_lb 0.0156 parsed_size_lb_to_oz 0 10/10/2025 giant BAGGED ICE bagged ice cubes frozen ice 1 EA 20 lb 20 lb weight 4.99 4.99 4.99 line_total_over_qty 0.2495 parsed_size_lb 0.0156 parsed_size_lb_to_oz 0
``` ```
** evidence ** evidence
- commit: - commit: `605c944`
- tests: - tests: `./venv/bin/python -m unittest tests.test_purchases` (fails as expected before implementation: missing `effective_price` in purchases rows)
- datetime: - datetime: 2026-03-23 12:52:32 EDT
** notes ** notes
- Added purchases-level regression coverage for the known comparison cases before implementation: Giant banana, Costco banana, Giant bagged ice, Costco beef patties, and a blank-denominator case.
- The current failure mode is the intended one for this task: `build_purchase_rows()` does not yet emit `effective_price`, so the tests document the missing behavior before `t1.18.1`.
* [ ] t1.18.1: fix effective price calculation precedence and blank handling (1-3 commits) * [X] t1.18.1: fix effective price calculation precedence and blank handling (1-3 commits)
correct purchases/effective price logic for the known broken cases using existing normalized fields correct purchases/effective price logic for the known broken cases using existing normalized fields
** acceptance criteria ** acceptance criteria
1. effective_price uses explicit numerator precedence: 1. when generating `data/purchases.csv`, add `effective_price` = `effective_total` / `normalized_quantity`
2. effective_price uses explicit numerator precedence:
- prefer `net_line_total` - prefer `net_line_total`
- fallback to `line_total` - fallback to `line_total`
2. effective_price uses `normalized_quantity` when present and > 0 3. effective_price uses `normalized_quantity` if not blank
3. effective_price is blank when no valid denominator exists 4. effective_price is blank when no valid denominator exists
4. effective_price is never written as `0` or divide-by-zero for missing-basis cases 5. effective_price is never written as `0` or divide-by-zero for missing-basis cases
5. existing regression tests for bananas and ice pass 6. effective_price is only comparable within same `normalized_quantity_unit` unless later analysis converts the units
7. existing regression tests for bananas and ice pass
- pm note: keep this limited to calculation logic; do not broaden into catalog or review changes - pm note: keep this limited to calculation logic; do not broaden into catalog or review changes
** evidence ** evidence
- commit: - commit: `dc0d061`
- tests: - tests: `./venv/bin/python -m unittest discover -s tests`; `./venv/bin/python build_purchases.py`
- datetime: - datetime: 2026-03-23 12:53:34 EDT
** notes ** notes
- `effective_price` is now a downstream purchases field only. It does not replace `price_per_lb` / `price_per_each`; it gives one deterministic comparison value based on the existing normalized quantity basis.
- The implemented precedence is: use non-zero `net_line_total` when present, otherwise `line_total`; divide by `normalized_quantity` when that denominator is > 0; otherwise leave blank.
- This keeps the calculation conservative for mixed-quality data: Costco bananas and ice now compute correctly, while rows like Giant patties with no quantity basis stay blank instead of producing `0` or a divide-by-zero artifact.
* [X] t1.18.2: fix giant normalization quantity carry-through for weight-based items (1-3 commits)
* [ ] t1.18.2: fix giant normalization quantity carry-through for weight-based items (1-3 commits)
ensure giant normalization emits usable normalized quantity for known weight-based cases ensure giant normalization emits usable normalized quantity for known weight-based cases
** acceptance criteria ** acceptance criteria
@@ -898,13 +897,231 @@ ensure giant normalization emits usable normalized quantity for known weight-bas
3. existing regression tests pass without changing normalized_item_id behavior 3. existing regression tests pass without changing normalized_item_id behavior
4. blanks are preserved only when no deterministic quantity basis exists 4. blanks are preserved only when no deterministic quantity basis exists
- pm note: this task is about normalization carry-through, not fuzzy matching or catalog cleanup - pm note: this task is about normalization carry-through, not fuzzy matching or catalog cleanup
** pm notes
*** banana
giant bananas have picked weight and price_per_oz but normalized missing
| purchase_date | retailer | normalized_item_name | catalog_name | qty | unit | normalized_quantity | normalized_quantity_unit | pack_qty | size_value | size_unit | measure_type | line_total | unit_price | net_line_total | price_per_each | price_per_each_basis | price_per_count | price_per_count_basis | price_per_lb | price_per_lb_basis | price_per_oz | price_per_oz_basis | effective_price |
| 8/6/2024 | costco | BANANAS 3 LB / 1.36 KG | BANANA | 1 | E | 3 | lb | | 3 | lb | weight | 1.49 | 1.49 | 1.49 | 1.49 | line_total_over_qty | | | 0.4967 | parsed_size_lb | 0.031 | parsed_size_lb_to_oz | $0.50 |
| 12/6/2024 | giant | FRESH BANANA | BANANA | 1 | LB | | | | | | weight | 0.99 | 0.99 | | 0.99 | line_total_over_qty | | | 0.5893 | picked_weight_lb | 0.0368 | picked_weight_lb_to_oz | |
| 12/12/2024 | giant | FRESH BANANA | BANANA | 1 | LB | | | | | | weight | 1.37 | 1.37 | | 1.37 | line_total_over_qty | | | 0.5905 | picked_weight_lb | 0.0369 | picked_weight_lb_to_oz | |
| 1/7/2025 | giant | FRESH BANANA | BANANA | 1 | LB | | | | | | weight | 1.44 | 1.44 | | 1.44 | line_total_over_qty | | | 0.5902 | picked_weight_lb | 0.0369 | picked_weight_lb_to_oz | |
| 1/24/2025 | costco | BANANAS 3 LB / 1.36 KG | BANANA | 1 | E | 3 | lb | | 3 | lb | weight | 1.49 | 1.49 | 1.49 | 1.49 | line_total_over_qty | | | 0.4967 | parsed_size_lb | 0.031 | parsed_size_lb_to_oz | 0.4967 |
| 2/16/2025 | giant | FRESH BANANA | BANANA | 2 | LB | | | | | | weight | 2.54 | 1.27 | | 1.27 | line_total_over_qty | | | 0.588 | picked_weight_lb | 0.0367 | picked_weight_lb_to_oz | |
| 2/20/2025 | giant | FRESH BANANA | BANANA | 1 | LB | | | | | | weight | 1.4 | 1.4 | | 1.4 | line_total_over_qty | | | 0.5907 | picked_weight_lb | 0.0369 | picked_weight_lb_to_oz | |
| 6/25/2025 | giant | FRESH BANANA | BANANA | 1 | LB | | | | | | weight | 1.29 | 1.29 | | 1.29 | line_total_over_qty | | | 0.589 | picked_weight_lb | 0.0368 | picked_weight_lb_to_oz | |
| 2/14/2026 | costco | BANANAS 3 LB / 1.36 KG | BANANA | 1 | E | 3 | lb | | 3 | lb | weight | 1.49 | 1.49 | 1.49 | 1.49 | line_total_over_qty | | | 0.4967 | parsed_size_lb | 0.031 | parsed_size_lb_to_oz | 0.4967 |
| 3/12/2026 | costco | BANANAS 3 LB / 1.36 KG | BANANA | 2 | E | 6 | lb | | 3 | lb | weight | 2.98 | 1.49 | 2.98 | 1.49 | line_total_over_qty | | | 0.4967 | parsed_size_lb | 0.031 | parsed_size_lb_to_oz | 0.4967 |
*** beef patty
beef patty by weight not made into effective price
| purchase_date | retailer | normalized_item_name | product_type | qty | unit | normalized_quantity | normalized_quantity_unit | pack_qty | size_value | size_unit | measure_type | line_total | unit_price | matched_discount_amount | net_line_total | store_name | price_per_each | price_per_each_basis | price_per_count | price_per_count_basis | price_per_lb | price_per_lb_basis | price_per_oz | price_per_oz_basis | effective_price |
| 9/9/2023 | costco | BEEF PATTIES 6# BAG | hamburger | 1 | E | 1 | each | | | | each | 26.99 | 26.99 | | 26.99 | MT VERNON | 26.99 | line_total_over_qty | | | | | | | $26.99 |
| 11/26/2025 | giant | PATTIES PK12 | hamburger | 1 | LB | | | | | | weight | 10.05 | 10.05 | | | Giant Food | 10.05 | line_total_over_qty | | | 7.7907 | picked_weight_lb | 0.4869 | picked_weight_lb_to_oz | |
** evidence ** evidence
- commit: - commit: `23dfc3d` `Use picked weight for Giant quantity basis`
- tests: - tests: `./venv/bin/python -m unittest tests.test_enrich_giant tests.test_purchases`; `./venv/bin/python normalize_giant_web.py`; `./venv/bin/python build_purchases.py`
- datetime: - datetime: 2026-03-23 13:22:47 EDT
** notes ** notes
- Giant loose-weight rows already had deterministic `picked_weight` and `price_per_lb`; this task reuses that basis when parsed size/pack is absent.
- Parsed package size still wins when present, so fixed-size products keep their original comparison basis and `normalized_item_id` behavior does not change.
* [X] t1.18.3: fix costco normalization quantity carry-through for weight-based items (1-3 commits)
** acceptance criteria
1. add regression tests covering known broken Costco quantity-basis cases before changing parser logic
2. Costco normalization correctly parses explicit weight-bearing package text into normalized quantity fields for known cases such as:
- `25# FLOUR ALL-PURPOSE HARV ...` -> `normalized_quantity=25`, `normalized_quantity_unit=lb`, `measure_type=weight`
3. corrected Costco normalized rows carry through to `data/purchases.csv` without changing `normalized_item_id` behavior
4. `effective_price` for corrected Costco rows uses the same rule already established for Giant:
- use `net_line_total` when present, otherwise `line_total`
- divide by `normalized_quantity` when `normalized_quantity > 0`
- leave blank when no valid denominator exists
5. rerun output verifies the broken Costco flour examples no longer behave like `each` items and now produce non-blank weight-based effective prices
6. keep this task limited to the identified Costco parsing failures; do not broaden into catalog cleanup or fuzzy matching
*** All Purpose Flour
Costco 25# FLOUR not parsed into normalized weight - meaure_type says each
| purchase_date | retailer | normalized_item_name | catalog_name | qty | unit | normalized_quantity | normalized_quantity_unit | pack_qty | size_value | size_unit | measure_type | line_total | unit_price | matched_discount_amount | net_line_total | store_name | price_per_each | price_per_each_basis | price_per_count | price_per_count_basis | price_per_lb | price_per_lb_basis | price_per_oz | price_per_oz_basis | effective_price | is_discount_line | is_coupon_line | is_fee | raw_order_path | |
| 9/9/2023 | costco | 10LB BAKERS 4.5KG / 10 LB | all purpose flour | 1 | E | 10 | lb | | 10 | lb | weight | 5.99 | 5.99 | | 5.99 | VA | 5.99 | line_total_over_qty | | | 0.599 | parsed_size_lb | 0.0374 | parsed_size_lb_to_oz | $0.60 | FALSE | FALSE | FALSE | data/costco-web/raw/21111500603752309091647-2023-09-09T16-47-00.json | |
| 8/6/2024 | costco | 10LB BAKERS 4.5KG / 10 LB | all purpose flour | 1 | E | 10 | lb | | 10 | lb | weight | 5.29 | 5.29 | | 5.29 | VA | 5.29 | line_total_over_qty | | | 0.529 | parsed_size_lb | 0.0331 | parsed_size_lb_to_oz | $0.53 | FALSE | FALSE | FALSE | data/costco-web/raw/21111520101732408061704-2024-08-06T17-04-00.json | |
| 11/29/2024 | costco | 25# FLOUR ALL-PURPOSE HARV P98/100 | all purpose flour | 1 | E | 1 | each | | | | each | 8.79 | 8.79 | | 8.79 | VA | 8.79 | line_total_over_qty | | | | | | | $8.79 | FALSE | FALSE | FALSE | data/costco-web/raw/21111500803392411291626-2024-11-29T16-26-00.json | |
| 12/14/2024 | costco | KS ORG FLOUR 2/10 LB P112 | all purpose flour | 1 | E | 20 | lb | 2 | 10 | lb | weight | 17.99 | 17.99 | | 17.99 | VA | 17.99 | line_total_over_qty | 8.995 | line_total_over_pack_qty | 0.8995 | parsed_size_lb | 0.0562 | parsed_size_lb_to_oz | 0.8995 | FALSE | FALSE | FALSE | data/costco-web/raw/21111500301442412141209-2024-12-14T12-09-00.json | |
| 12/14/2024 | costco | 10LB BAKERS 4.5KG / 10 LB | all purpose flour | 1 | E | 10 | lb | | 10 | lb | weight | 5.49 | 5.49 | | 5.49 | VA | 5.49 | line_total_over_qty | | | 0.549 | parsed_size_lb | 0.0343 | parsed_size_lb_to_oz | 0.549 | FALSE | FALSE | FALSE | data/costco-web/raw/21111500301442412141209-2024-12-14T12-09-00.json | |
| 1/10/2025 | costco | 10LB BAKERS 4.5KG / 10 LB | all purpose flour | 1 | E | 10 | lb | | 10 | lb | weight | 5.49 | 5.49 | | 5.49 | VA | 5.49 | line_total_over_qty | | | 0.549 | parsed_size_lb | 0.0343 | parsed_size_lb_to_oz | 0.549 | FALSE | FALSE | FALSE | data/costco-web/raw/21111500702462501101630-2025-01-10T16-30-00.json | |
| 1/10/2025 | costco | KS ORG FLOUR 2/10 LB P112 | all purpose flour | 1 | E | 20 | lb | 2 | 10 | lb | weight | 17.99 | 17.99 | | 17.99 | VA | 17.99 | line_total_over_qty | 8.995 | line_total_over_pack_qty | 0.8995 | parsed_size_lb | 0.0562 | parsed_size_lb_to_oz | 0.8995 | FALSE | FALSE | FALSE | data/costco-web/raw/21111500702462501101630-2025-01-10T16-30-00.json | |
| 1/31/2026 | giant | SB FLOUR ALL PRPSE 5LB | all purpose flour | 1 | EA | 5 | lb | | 5 | lb | weight | 3.39 | 3.39 | | | VA | 3.39 | line_total_over_qty | | | 0.678 | parsed_size_lb | 0.0424 | parsed_size_lb_to_oz | 0.678 | FALSE | FALSE | FALSE | data/giant-web/raw/697f42031c28e23df08d95f9.json | |
| 3/12/2026 | costco | 25# FLOUR ALL-PURPOSE HARV P98/100 | all purpose flour | 1 | E | 1 | each | | | | each | 9.49 | 9.49 | | 9.49 | VA | 9.49 | line_total_over_qty | | | | | | | 9.49 | FALSE | FALSE | FALSE | data/costco-web/raw/21111500804012603121616-2026-03-12T16-16-00.json
| |
** evidence
- commit: `7317611` `Fix Costco hash-size weight parsing`
- tests: `./venv/bin/python -m unittest tests.test_costco_pipeline tests.test_purchases`; `./venv/bin/python normalize_costco_web.py`; `./venv/bin/python build_purchases.py`
- datetime: 2026-03-23 13:56:38 EDT
** notes
- Costco `25#` weight text was falling through to `each` because the hash-size parser missed sizes followed by whitespace.
- This fix is intentionally narrow: explicit `#`-weight parsing now feeds the existing quantity and effective-price flow without changing `normalized_item_id` behavior.
* [X] t1.18.4: clean purchases output and finalize effective price fields (2-4 commits)
make `purchases.csv` easier to inspect and ensure price fields support weighted cost analysis
** acceptance criteria
1. reorder `data/purchases.csv` columns for human inspection, with analysis fields first:
- `purchase_date`
- `retailer`
- `catalog_name`
- `product_type`
- `category`
- `net_line_total`
- `normalized_quantity`
- `effective_price`
- `effective_price_unit`
- followed by order/item/provenance fields
3. populate `net_line_total` for all purchase rows:
- preserve existing net_line_total when already populated;
- otherwise, derive `net_line_total = line_total + matched_discount_amount` when discount exists;
- else `net_line_total = line_total`
4. compute `effective_price` from `net_line_total / normalized_quantity` when `normalized_quantity > 0`
5. add `effective_price_unit` and populate it consistently from the normalized quantity basis
6. preserve blanks rather than writing `0` or divide-by-zero when no valid denominator exists
- pm note: this task is about final purchase output correctness and usability, not review/catalog logic
** evidence
- commit: `a45522c` `Finalize purchase effective price fields`
- tests: `./venv/bin/python -m unittest tests.test_purchases`; `./venv/bin/python build_purchases.py`
- datetime: 2026-03-23 15:27:42 EDT
** notes
- `purchases.csv` now carries a filled `net_line_total` for every row, preserving existing values from normalization and deriving the rest from `line_total` plus matched discounts.
- `effective_price_unit` now mirrors the normalized quantity basis, so downstream analysis can tell whether an `effective_price` is per `lb`, `oz`, `count`, or `each`.
* [X] t1.19: make review_products.py robust to orphaned and incomplete catalog links (2-4 commits)
refresh review state from the current normalized universe so missing or broken links re-enter review instead of silently disappearing
** acceptance criteria
1. `review_products.py` regenerates review candidates from the current normalized item universe, not just previously queued items (/data/<provider>/normalized_items.csv)
2. items are added or re-added to review when:
- they have no valid `catalog_id`
- their linked `catalog_id` no longer exists
- their linked catalog row does noth have both "catalog_name" AND "product_type"
3. `review_products.py` compares and reconciles:
- current normalized items
- current product_links
- current catalog
- current review_queue
4. rerunning review after manual cleanup of `product_links.csv` or `catalog.csv` surfaces newly orphaned normalized items
5. unresolved items remain visible and are not silently dropped from review or purchases accounting
- pm note: keep the logic explicit and auditable; this is a refresh/reconciliation task, not a new matching system
** evidence
- commit: `8ccf3ff` `Reconcile review queue against current catalog state`
- tests: `./venv/bin/python -m unittest tests.test_review_workflow tests.test_purchases`; `./venv/bin/python review_products.py --refresh-only`; `./venv/bin/python report_pipeline_status.py`
- datetime: 2026-03-23 15:32:29 EDT
** notes
- `review_products.py` now rebuilds its queue from the current normalized files and order files instead of trusting stale `purchases.csv` state.
- Missing catalog rows and incomplete catalog rows now re-enter review explicitly as `orphaned_catalog_link` or `incomplete_catalog_link`, and excluded rows no longer inflate unresolved-not-in-review accounting.
* [X] t1.20: add visit-level fields and outputs for spend analysis (2-4 commits)
ensure purchases retains enough visit/order context to support spend-by-visit and store-level analysis
** acceptance criteria
1. `data/purchases.csv` retains or adds the visit/order fields needed for visit analysis:
- `order_id`
- `purchase_date`
- `store_name`
- `store_number`
- `store_city`
- `store_state`
- `retailer`
2. purchases output supports these analyses without additional joins:
- spend by visit
- items per visit
- category spend by visit
- retailer/store breakdown
3. documentation or task notes make clear that `purchases.csv` is the primary analysis artifact for both item-level and visit-level reporting
- pm note: do not build dash/plotly here; this task is only about carrying the right data through
** evidence
- commit: `6940f16` `Document visit-level purchase analysis`
- tests: `./venv/bin/python -m unittest tests.test_purchases`; `./venv/bin/python build_purchases.py`
- datetime: 2026-03-24 08:29:13 EDT
** notes
- The needed visit fields were already flowing through `build_purchases.py`; this task locked them in with explicit tests and documentation instead of adding a new visit layer.
- `data/analysis/purchases.csv` is now documented as the primary analysis artifact for both item-level and visit-level work.
* [X] t1.21: add lightweight charting/analysis surface on top of purchases.csv (2-4 commits)
build a minimal analysis layer for common price and visit charts without changing the csv pipeline
** acceptance criteria
1. support charting of:
- item price over time
- spend by visit
- items per visit
- category spend over time
- retailer/store comparison
2. use `data/purchases.csv` as the source of truth
3. keep excel/pivot compatibility intact
- pm note: thin reader layer only; do not move business logic out of the pipeline
** evidence
- commit: `46a3b2c` `Add purchase analysis summaries`
- tests: `./venv/bin/python -m unittest tests.test_analyze_purchases tests.test_purchases`; `./venv/bin/python analyze_purchases.py`
- datetime: 2026-03-24 16:48:41 EDT
** notes
- The new layer is file-based, not notebook- or dashboard-based: `analyze_purchases.py` reads `data/analysis/purchases.csv` and writes chart-ready CSVs under `data/analysis/`.
- This keeps Excel/pivot workflows intact while still giving a repeatable CLI path for common price, visit, category, and retailer/store summaries.
* [X] t1.22: cleanup and finalize post-refactor merging refactor/enrich into cx (3-6 commits)
remove transitional detritus from the repo and make the final folder/script layout explicit before merging back into `cx`
** acceptance criteria
1. move `catalog.csv` alongside the other step-3 review artifacts under `data/review/`
- update active scripts, tests, docs, and task notes to match the chosen path
2. promote analysis to a top-level step-4 folder such as `data/analysis/`
- add `purchases.csv` to this folder
- update active scripts, tests, docs, and task notes to match the chosen path
3. remove obsolete or superseded Python files
- includes old `scrape_*`, `enrich_*`, `build_*`, and proof/check scripts as appropriate
- do not remove files still required by the active collect/normalize/review/analysis pipeline
4. active repo entrypoints are reduced to the intended flow and are easy to identify, including:
- retailer collection
- retailer normalization
- review/combine
- status/reporting
- analysis
5. tests pass after removals and path decisions
6. README reflects the final post-refactor structure and run order without legacy ambiguity
7. `pm/data-model.org` and `pm/tasks.org` reflect the final chosen layout
- pm note: prefer deleting true detritus over keeping compatibility shims now that the refactor path is established
- pm note: make folder decisions once here so we stop carrying path churn into later tasks
** evidence
- commit: `09829b2` `Finalize post-refactor layout and remove old pipeline files`
- tests: `./venv/bin/python -m unittest discover -s tests`; `./venv/bin/python build_purchases.py`; `./venv/bin/python review_products.py --refresh-only`; `./venv/bin/python report_pipeline_status.py`; `./venv/bin/python analyze_purchases.py`; `./venv/bin/python collect_giant_web.py --help`; `./venv/bin/python collect_costco_web.py --help`; `./venv/bin/python normalize_giant_web.py --help`; `./venv/bin/python normalize_costco_web.py --help`
- datetime: 2026-03-24 17:09:45 EDT
** notes
- Final layout decision: `catalog.csv` now lives under `data/review/`, while `purchases.csv` and the chart-ready analysis outputs live under the step-4 `data/analysis/` folder.
- Removed obsolete top-level pipeline files and their dead tests so the active entrypoints are now the collect, normalize, review/combine, status, and analysis scripts only.
* [X] t1.22.1: remove unneeded python deps
** acceptance criteria
1. update requirements.txt to add/remove necessary python libs
2. keep only direct runtime deps in requirements.txt; transitive deps should not be pinned unless imported directly
** evidence
- commit: `867275c` `Trim requirements to direct runtime deps`
- tests: `./venv/bin/python -m unittest discover -s tests`; `./venv/bin/python collect_giant_web.py --help`; `./venv/bin/python collect_costco_web.py --help`; `./venv/bin/python normalize_giant_web.py --help`; `./venv/bin/python normalize_costco_web.py --help`; `./venv/bin/python build_purchases.py --help`; `./venv/bin/python review_products.py --help`; `./venv/bin/python report_pipeline_status.py --help`; `./venv/bin/python analyze_purchases.py --help`
- date: 2026-03-24 17:25:39 EDT
** notes
- `requirements.txt` now keeps only direct runtime deps imported by the active pipeline: `browser-cookie3`, `click`, `curl_cffi`, and `python-dotenv`.
- Low-level support packages such as `cffi`, `jeepney`, `lz4`, `pycryptodomex`, and `certifi` are left to transitive installation instead of being pinned directly.
* [ ] t1.10: add optional llm-assisted suggestion workflow for unresolved normalized retailer items (2-4 commits) * [ ] t1.10: add optional llm-assisted suggestion workflow for unresolved normalized retailer items (2-4 commits)

View File

@@ -27,9 +27,11 @@ def build_status_summary(
costco_enriched, costco_enriched,
purchases, purchases,
resolutions, resolutions,
links,
catalog,
): ):
normalized_rows = giant_enriched + costco_enriched normalized_rows = giant_enriched + costco_enriched
queue_rows = review_products.build_review_queue(purchases, resolutions) queue_rows = review_products.build_review_queue(purchases, resolutions, links, catalog, [])
queue_ids = {row["normalized_item_id"] for row in queue_rows} queue_ids = {row["normalized_item_id"] for row in queue_rows}
unresolved_purchase_rows = [ unresolved_purchase_rows = [
@@ -37,6 +39,7 @@ def build_status_summary(
for row in purchases for row in purchases
if row.get("normalized_item_id") if row.get("normalized_item_id")
and not row.get("catalog_id") and not row.get("catalog_id")
and row.get("resolution_action") != "exclude"
and row.get("is_fee") != "true" and row.get("is_fee") != "true"
and row.get("is_discount_line") != "true" and row.get("is_discount_line") != "true"
and row.get("is_coupon_line") != "true" and row.get("is_coupon_line") != "true"
@@ -82,8 +85,10 @@ def build_status_summary(
@click.option("--costco-orders-csv", default="data/costco-web/collected_orders.csv", show_default=True) @click.option("--costco-orders-csv", default="data/costco-web/collected_orders.csv", show_default=True)
@click.option("--costco-items-csv", default="data/costco-web/collected_items.csv", show_default=True) @click.option("--costco-items-csv", default="data/costco-web/collected_items.csv", show_default=True)
@click.option("--costco-enriched-csv", default="data/costco-web/normalized_items.csv", show_default=True) @click.option("--costco-enriched-csv", default="data/costco-web/normalized_items.csv", show_default=True)
@click.option("--purchases-csv", default="data/review/purchases.csv", show_default=True) @click.option("--purchases-csv", default="data/analysis/purchases.csv", show_default=True)
@click.option("--resolutions-csv", default="data/review/review_resolutions.csv", show_default=True) @click.option("--resolutions-csv", default="data/review/review_resolutions.csv", show_default=True)
@click.option("--links-csv", default="data/review/product_links.csv", show_default=True)
@click.option("--catalog-csv", default="data/review/catalog.csv", show_default=True)
@click.option("--summary-csv", default="data/review/pipeline_status.csv", show_default=True) @click.option("--summary-csv", default="data/review/pipeline_status.csv", show_default=True)
@click.option("--summary-json", default="data/review/pipeline_status.json", show_default=True) @click.option("--summary-json", default="data/review/pipeline_status.json", show_default=True)
def main( def main(
@@ -95,6 +100,8 @@ def main(
costco_enriched_csv, costco_enriched_csv,
purchases_csv, purchases_csv,
resolutions_csv, resolutions_csv,
links_csv,
catalog_csv,
summary_csv, summary_csv,
summary_json, summary_json,
): ):
@@ -107,6 +114,8 @@ def main(
read_rows_if_exists(costco_enriched_csv), read_rows_if_exists(costco_enriched_csv),
read_rows_if_exists(purchases_csv), read_rows_if_exists(purchases_csv),
[build_purchases.normalize_resolution_row(row) for row in read_rows_if_exists(resolutions_csv)], [build_purchases.normalize_resolution_row(row) for row in read_rows_if_exists(resolutions_csv)],
[build_purchases.normalize_link_row(row) for row in read_rows_if_exists(links_csv)],
[build_purchases.normalize_catalog_row(row) for row in read_rows_if_exists(catalog_csv)],
) )
write_csv_rows(summary_csv, summary_rows, SUMMARY_FIELDS) write_csv_rows(summary_csv, summary_rows, SUMMARY_FIELDS)
summary_json_path = Path(summary_json) summary_json_path = Path(summary_json)

View File

@@ -1,10 +1,4 @@
browser-cookie3==0.20.1 browser-cookie3==0.20.1
certifi==2026.2.25
cffi==2.0.0
click==8.3.1 click==8.3.1
curl_cffi==0.14.0 curl_cffi==0.14.0
jeepney==0.9.0
lz4==4.4.5
pycparser==3.0
pycryptodomex==3.23.0
python-dotenv==1.1.1 python-dotenv==1.1.1

View File

@@ -31,6 +31,7 @@ INFO_COLOR = "cyan"
PROMPT_COLOR = "bright_yellow" PROMPT_COLOR = "bright_yellow"
WARNING_COLOR = "magenta" WARNING_COLOR = "magenta"
TOKEN_RE = re.compile(r"[A-Z0-9]+") TOKEN_RE = re.compile(r"[A-Z0-9]+")
REQUIRED_CATALOG_FIELDS = ("catalog_name", "product_type")
def print_intro_text(): def print_intro_text():
@@ -40,9 +41,37 @@ def print_intro_text():
click.echo(" category: broad analysis bucket such as dairy, produce, or frozen") click.echo(" category: broad analysis bucket such as dairy, produce, or frozen")
def build_review_queue(purchase_rows, resolution_rows): def has_complete_catalog_row(catalog_row):
if not catalog_row:
return False
return all(catalog_row.get(field, "").strip() for field in REQUIRED_CATALOG_FIELDS)
def load_queue_lookup(queue_rows):
lookup = {}
for row in queue_rows:
normalized_item_id = row.get("normalized_item_id", "")
if normalized_item_id:
lookup[normalized_item_id] = row
return lookup
def build_review_queue(
purchase_rows,
resolution_rows,
link_rows=None,
catalog_rows=None,
existing_queue_rows=None,
):
by_normalized = defaultdict(list) by_normalized = defaultdict(list)
resolution_lookup = build_purchases.load_resolution_lookup(resolution_rows) resolution_lookup = build_purchases.load_resolution_lookup(resolution_rows)
link_lookup = build_purchases.load_link_lookup(link_rows or [])
catalog_lookup = {
row.get("catalog_id", ""): build_purchases.normalize_catalog_row(row)
for row in (catalog_rows or [])
if row.get("catalog_id", "")
}
queue_lookup = load_queue_lookup(existing_queue_rows or [])
for row in purchase_rows: for row in purchase_rows:
normalized_item_id = row.get("normalized_item_id", "") normalized_item_id = row.get("normalized_item_id", "")
@@ -54,30 +83,40 @@ def build_review_queue(purchase_rows, resolution_rows):
queue_rows = [] queue_rows = []
for normalized_item_id, rows in sorted(by_normalized.items()): for normalized_item_id, rows in sorted(by_normalized.items()):
current_resolution = resolution_lookup.get(normalized_item_id, {}) current_resolution = resolution_lookup.get(normalized_item_id, {})
if current_resolution.get("status") == "approved": if current_resolution.get("status") == "approved" and current_resolution.get("resolution_action") == "exclude":
continue continue
existing_queue_row = queue_lookup.get(normalized_item_id, {})
linked_catalog_id = current_resolution.get("catalog_id") or link_lookup.get(normalized_item_id, {}).get("catalog_id", "")
linked_catalog_row = catalog_lookup.get(linked_catalog_id, {})
has_valid_catalog_link = bool(linked_catalog_id and has_complete_catalog_row(linked_catalog_row))
unresolved_rows = [ unresolved_rows = [
row row
for row in rows for row in rows
if not row.get("catalog_id") if row.get("is_item", "true") != "false"
and row.get("is_item", "true") != "false"
and row.get("is_fee") != "true" and row.get("is_fee") != "true"
and row.get("is_discount_line") != "true" and row.get("is_discount_line") != "true"
and row.get("is_coupon_line") != "true" and row.get("is_coupon_line") != "true"
] ]
if not unresolved_rows: if not unresolved_rows or has_valid_catalog_link:
continue continue
retailers = sorted({row["retailer"] for row in rows}) retailers = sorted({row["retailer"] for row in rows})
review_id = stable_id("rvw", normalized_item_id) review_id = stable_id("rvw", normalized_item_id)
reason_code = "missing_catalog_link"
if linked_catalog_id and linked_catalog_id not in catalog_lookup:
reason_code = "orphaned_catalog_link"
elif linked_catalog_id and not has_complete_catalog_row(linked_catalog_row):
reason_code = "incomplete_catalog_link"
queue_rows.append( queue_rows.append(
{ {
"review_id": review_id, "review_id": review_id,
"retailer": " | ".join(retailers), "retailer": " | ".join(retailers),
"normalized_item_id": normalized_item_id, "normalized_item_id": normalized_item_id,
"catalog_id": current_resolution.get("catalog_id", ""), "catalog_id": linked_catalog_id,
"reason_code": "missing_catalog_link", "reason_code": reason_code,
"priority": "high", "priority": "high",
"raw_item_names": compact_join( "raw_item_names": compact_join(
sorted({row["raw_item_name"] for row in rows if row["raw_item_name"]}), sorted({row["raw_item_name"] for row in rows if row["raw_item_name"]}),
@@ -102,10 +141,13 @@ def build_review_queue(purchase_rows, resolution_rows):
limit=8, limit=8,
), ),
"seen_count": str(len(rows)), "seen_count": str(len(rows)),
"status": current_resolution.get("status", "pending"), "status": existing_queue_row.get("status") or current_resolution.get("status", "pending"),
"resolution_action": current_resolution.get("resolution_action", ""), "resolution_action": existing_queue_row.get("resolution_action")
"resolution_notes": current_resolution.get("resolution_notes", ""), or current_resolution.get("resolution_action", ""),
"created_at": current_resolution.get("reviewed_at", today_text), "resolution_notes": existing_queue_row.get("resolution_notes")
or current_resolution.get("resolution_notes", ""),
"created_at": existing_queue_row.get("created_at")
or current_resolution.get("reviewed_at", today_text),
"updated_at": today_text, "updated_at": today_text,
} }
) )
@@ -516,19 +558,51 @@ def link_rows_from_state(link_lookup):
@click.command() @click.command()
@click.option("--purchases-csv", default="data/review/purchases.csv", show_default=True) @click.option("--giant-items-enriched-csv", default="data/giant-web/normalized_items.csv", show_default=True)
@click.option("--costco-items-enriched-csv", default="data/costco-web/normalized_items.csv", show_default=True)
@click.option("--giant-orders-csv", default="data/giant-web/collected_orders.csv", show_default=True)
@click.option("--costco-orders-csv", default="data/costco-web/collected_orders.csv", show_default=True)
@click.option("--purchases-csv", default="data/analysis/purchases.csv", show_default=True)
@click.option("--queue-csv", default="data/review/review_queue.csv", show_default=True) @click.option("--queue-csv", default="data/review/review_queue.csv", show_default=True)
@click.option("--resolutions-csv", default="data/review/review_resolutions.csv", show_default=True) @click.option("--resolutions-csv", default="data/review/review_resolutions.csv", show_default=True)
@click.option("--catalog-csv", default="data/catalog.csv", show_default=True) @click.option("--catalog-csv", default="data/review/catalog.csv", show_default=True)
@click.option("--links-csv", default="data/review/product_links.csv", show_default=True) @click.option("--links-csv", default="data/review/product_links.csv", show_default=True)
@click.option("--limit", default=0, show_default=True, type=int) @click.option("--limit", default=0, show_default=True, type=int)
@click.option("--refresh-only", is_flag=True, help="Only rebuild review_queue.csv without prompting.") @click.option("--refresh-only", is_flag=True, help="Only rebuild review_queue.csv without prompting.")
def main(purchases_csv, queue_csv, resolutions_csv, catalog_csv, links_csv, limit, refresh_only): def main(
purchase_rows = build_purchases.read_optional_csv_rows(purchases_csv) giant_items_enriched_csv,
costco_items_enriched_csv,
giant_orders_csv,
costco_orders_csv,
purchases_csv,
queue_csv,
resolutions_csv,
catalog_csv,
links_csv,
limit,
refresh_only,
):
resolution_rows = build_purchases.read_optional_csv_rows(resolutions_csv) resolution_rows = build_purchases.read_optional_csv_rows(resolutions_csv)
catalog_rows = build_purchases.merge_catalog_rows(build_purchases.read_optional_csv_rows(catalog_csv), []) catalog_rows = build_purchases.merge_catalog_rows(build_purchases.read_optional_csv_rows(catalog_csv), [])
link_lookup = build_purchases.load_link_lookup(build_purchases.read_optional_csv_rows(links_csv)) link_rows = build_purchases.read_optional_csv_rows(links_csv)
queue_rows = build_review_queue(purchase_rows, resolution_rows) purchase_rows, refreshed_link_rows = build_purchases.build_purchase_rows(
build_purchases.read_optional_csv_rows(giant_items_enriched_csv),
build_purchases.read_optional_csv_rows(costco_items_enriched_csv),
build_purchases.read_optional_csv_rows(giant_orders_csv),
build_purchases.read_optional_csv_rows(costco_orders_csv),
resolution_rows,
link_rows,
catalog_rows,
)
build_purchases.write_csv_rows(purchases_csv, purchase_rows, build_purchases.PURCHASE_FIELDS)
link_lookup = build_purchases.load_link_lookup(refreshed_link_rows)
queue_rows = build_review_queue(
purchase_rows,
resolution_rows,
refreshed_link_rows,
catalog_rows,
build_purchases.read_optional_csv_rows(queue_csv),
)
write_csv_rows(queue_csv, queue_rows, QUEUE_FIELDS) write_csv_rows(queue_csv, queue_rows, QUEUE_FIELDS)
click.echo(f"wrote {len(queue_rows)} rows to {queue_csv}") click.echo(f"wrote {len(queue_rows)} rows to {queue_csv}")

View File

@@ -1,5 +0,0 @@
from scrape_giant import * # noqa: F401,F403
if __name__ == "__main__":
main()

View File

@@ -0,0 +1,149 @@
import csv
import tempfile
import unittest
from pathlib import Path
import analyze_purchases
class AnalyzePurchasesTests(unittest.TestCase):
def test_analysis_outputs_cover_required_views(self):
with tempfile.TemporaryDirectory() as tmpdir:
purchases_csv = Path(tmpdir) / "purchases.csv"
output_dir = Path(tmpdir) / "analysis"
fieldnames = [
"purchase_date",
"retailer",
"order_id",
"catalog_id",
"catalog_name",
"category",
"product_type",
"net_line_total",
"line_total",
"normalized_quantity",
"normalized_quantity_unit",
"effective_price",
"effective_price_unit",
"store_name",
"store_number",
"store_city",
"store_state",
"is_fee",
"is_discount_line",
"is_coupon_line",
]
with purchases_csv.open("w", newline="", encoding="utf-8") as handle:
writer = csv.DictWriter(handle, fieldnames=fieldnames)
writer.writeheader()
writer.writerows(
[
{
"purchase_date": "2026-03-01",
"retailer": "giant",
"order_id": "g1",
"catalog_id": "cat_banana",
"catalog_name": "BANANA",
"category": "produce",
"product_type": "banana",
"net_line_total": "1.29",
"line_total": "1.29",
"normalized_quantity": "2.19",
"normalized_quantity_unit": "lb",
"effective_price": "0.589",
"effective_price_unit": "lb",
"store_name": "Giant",
"store_number": "42",
"store_city": "Springfield",
"store_state": "VA",
"is_fee": "false",
"is_discount_line": "false",
"is_coupon_line": "false",
},
{
"purchase_date": "2026-03-01",
"retailer": "giant",
"order_id": "g1",
"catalog_id": "cat_ice",
"catalog_name": "ICE",
"category": "frozen",
"product_type": "ice",
"net_line_total": "3.50",
"line_total": "3.50",
"normalized_quantity": "20",
"normalized_quantity_unit": "lb",
"effective_price": "0.175",
"effective_price_unit": "lb",
"store_name": "Giant",
"store_number": "42",
"store_city": "Springfield",
"store_state": "VA",
"is_fee": "false",
"is_discount_line": "false",
"is_coupon_line": "false",
},
{
"purchase_date": "2026-03-02",
"retailer": "costco",
"order_id": "c1",
"catalog_id": "cat_banana",
"catalog_name": "BANANA",
"category": "produce",
"product_type": "banana",
"net_line_total": "1.49",
"line_total": "2.98",
"normalized_quantity": "3",
"normalized_quantity_unit": "lb",
"effective_price": "0.4967",
"effective_price_unit": "lb",
"store_name": "MT VERNON",
"store_number": "1115",
"store_city": "ALEXANDRIA",
"store_state": "VA",
"is_fee": "false",
"is_discount_line": "false",
"is_coupon_line": "false",
},
]
)
analyze_purchases.main.callback(
purchases_csv=str(purchases_csv),
output_dir=str(output_dir),
)
expected_files = [
"item_price_over_time.csv",
"spend_by_visit.csv",
"items_per_visit.csv",
"category_spend_over_time.csv",
"retailer_store_breakdown.csv",
]
for name in expected_files:
self.assertTrue((output_dir / name).exists(), name)
with (output_dir / "spend_by_visit.csv").open(newline="", encoding="utf-8") as handle:
spend_rows = list(csv.DictReader(handle))
self.assertEqual("4.79", spend_rows[0]["visit_spend_total"])
with (output_dir / "items_per_visit.csv").open(newline="", encoding="utf-8") as handle:
item_rows = list(csv.DictReader(handle))
self.assertEqual("2", item_rows[0]["item_row_count"])
self.assertEqual("2", item_rows[0]["distinct_catalog_count"])
with (output_dir / "category_spend_over_time.csv").open(newline="", encoding="utf-8") as handle:
category_rows = list(csv.DictReader(handle))
produce_row = next(row for row in category_rows if row["purchase_date"] == "2026-03-01" and row["category"] == "produce")
self.assertEqual("1.29", produce_row["category_spend_total"])
with (output_dir / "retailer_store_breakdown.csv").open(newline="", encoding="utf-8") as handle:
store_rows = list(csv.DictReader(handle))
giant_row = next(row for row in store_rows if row["retailer"] == "giant")
self.assertEqual("1", giant_row["visit_count"])
self.assertEqual("2", giant_row["item_row_count"])
self.assertEqual("4.79", giant_row["store_spend_total"])
if __name__ == "__main__":
unittest.main()

View File

@@ -1,119 +0,0 @@
import unittest
import build_canonical_layer
class CanonicalLayerTests(unittest.TestCase):
def test_build_canonical_layer_auto_links_exact_upc_and_name_size_only(self):
observed_rows = [
{
"observed_product_id": "gobs_1",
"representative_upc": "111",
"representative_retailer_item_id": "11",
"representative_name_norm": "GALA APPLE",
"representative_brand": "SB",
"representative_variant": "",
"representative_size_value": "5",
"representative_size_unit": "lb",
"representative_pack_qty": "",
"representative_measure_type": "weight",
"is_fee": "false",
"is_discount_line": "false",
"is_coupon_line": "false",
},
{
"observed_product_id": "gobs_2",
"representative_upc": "111",
"representative_retailer_item_id": "12",
"representative_name_norm": "LARGE WHITE EGGS",
"representative_brand": "SB",
"representative_variant": "",
"representative_size_value": "",
"representative_size_unit": "",
"representative_pack_qty": "18",
"representative_measure_type": "count",
"is_fee": "false",
"is_discount_line": "false",
"is_coupon_line": "false",
},
{
"observed_product_id": "gobs_3",
"representative_upc": "",
"representative_retailer_item_id": "21",
"representative_name_norm": "ROTINI",
"representative_brand": "",
"representative_variant": "",
"representative_size_value": "16",
"representative_size_unit": "oz",
"representative_pack_qty": "",
"representative_measure_type": "weight",
"is_fee": "false",
"is_discount_line": "false",
"is_coupon_line": "false",
},
{
"observed_product_id": "gobs_4",
"representative_upc": "",
"representative_retailer_item_id": "22",
"representative_name_norm": "ROTINI",
"representative_brand": "SB",
"representative_variant": "",
"representative_size_value": "16",
"representative_size_unit": "oz",
"representative_pack_qty": "",
"representative_measure_type": "weight",
"is_fee": "false",
"is_discount_line": "false",
"is_coupon_line": "false",
},
{
"observed_product_id": "gobs_5",
"representative_upc": "",
"representative_retailer_item_id": "99",
"representative_name_norm": "GL BAG CHARGE",
"representative_brand": "",
"representative_variant": "",
"representative_size_value": "",
"representative_size_unit": "",
"representative_pack_qty": "",
"representative_measure_type": "each",
"is_fee": "true",
"is_discount_line": "false",
"is_coupon_line": "false",
},
{
"observed_product_id": "gobs_6",
"representative_upc": "",
"representative_retailer_item_id": "",
"representative_name_norm": "LIME",
"representative_brand": "",
"representative_variant": "",
"representative_size_value": "",
"representative_size_unit": "",
"representative_pack_qty": "",
"representative_measure_type": "each",
"is_fee": "false",
"is_discount_line": "false",
"is_coupon_line": "false",
},
]
canonicals, links = build_canonical_layer.build_canonical_layer(observed_rows)
self.assertEqual(2, len(canonicals))
self.assertEqual(4, len(links))
methods = {row["observed_product_id"]: row["link_method"] for row in links}
self.assertEqual("exact_upc", methods["gobs_1"])
self.assertEqual("exact_upc", methods["gobs_2"])
self.assertEqual("exact_name_size", methods["gobs_3"])
self.assertEqual("exact_name_size", methods["gobs_4"])
self.assertNotIn("gobs_5", methods)
self.assertNotIn("gobs_6", methods)
def test_clean_canonical_name_removes_packaging_noise(self):
self.assertEqual("LIME", build_canonical_layer.clean_canonical_name("LIME . / ."))
self.assertEqual("EGG", build_canonical_layer.clean_canonical_name("5DZ EGG / /"))
if __name__ == "__main__":
unittest.main()

View File

@@ -7,7 +7,6 @@ from unittest import mock
import enrich_costco import enrich_costco
import scrape_costco import scrape_costco
import validate_cross_retailer_flow
class CostcoPipelineTests(unittest.TestCase): class CostcoPipelineTests(unittest.TestCase):
@@ -346,6 +345,32 @@ class CostcoPipelineTests(unittest.TestCase):
) )
self.assertEqual("LIFE 6'TABLE MDL", logistics["item_name_norm"]) self.assertEqual("LIFE 6'TABLE MDL", logistics["item_name_norm"])
def test_costco_hash_weight_parses_into_weight_basis(self):
row = enrich_costco.parse_costco_item(
order_id="abc",
order_date="2024-11-29",
raw_path=Path("costco_output/raw/abc.json"),
line_no=4,
item={
"itemNumber": "999",
"itemDescription01": "25# FLOUR ALL-PURPOSE HARV P98/100",
"itemDescription02": None,
"itemDepartmentNumber": 14,
"transDepartmentNumber": 14,
"unit": 1,
"itemIdentifier": "E",
"amount": 8.79,
"itemUnitPriceAmount": 8.79,
},
)
self.assertEqual("FLOUR ALL-PURPOSE HARV", row["item_name_norm"])
self.assertEqual("25", row["size_value"])
self.assertEqual("lb", row["size_unit"])
self.assertEqual("weight", row["measure_type"])
self.assertEqual("25", row["normalized_quantity"])
self.assertEqual("lb", row["normalized_quantity_unit"])
self.assertEqual("0.3516", row["price_per_lb"])
def test_build_items_enriched_matches_discount_to_item(self): def test_build_items_enriched_matches_discount_to_item(self):
with tempfile.TemporaryDirectory() as tmpdir: with tempfile.TemporaryDirectory() as tmpdir:
raw_dir = Path(tmpdir) / "raw" raw_dir = Path(tmpdir) / "raw"
@@ -397,76 +422,6 @@ class CostcoPipelineTests(unittest.TestCase):
self.assertIn("matched_discount=4873222", purchase_row["parse_notes"]) self.assertIn("matched_discount=4873222", purchase_row["parse_notes"])
self.assertIn("matched_to_item=4873222", discount_row["parse_notes"]) self.assertIn("matched_to_item=4873222", discount_row["parse_notes"])
def test_cross_retailer_validation_writes_proof_example(self):
with tempfile.TemporaryDirectory() as tmpdir:
giant_csv = Path(tmpdir) / "giant_items_enriched.csv"
costco_csv = Path(tmpdir) / "costco_items_enriched.csv"
outdir = Path(tmpdir) / "combined"
fieldnames = enrich_costco.OUTPUT_FIELDS
giant_row = {field: "" for field in fieldnames}
giant_row.update(
{
"retailer": "giant",
"order_id": "g1",
"line_no": "1",
"order_date": "2026-03-01",
"retailer_item_id": "100",
"item_name": "FRESH BANANA",
"item_name_norm": "BANANA",
"upc": "4011",
"measure_type": "weight",
"is_store_brand": "false",
"is_fee": "false",
"is_discount_line": "false",
"is_coupon_line": "false",
"line_total": "1.29",
}
)
costco_row = {field: "" for field in fieldnames}
costco_row.update(
{
"retailer": "costco",
"order_id": "c1",
"line_no": "1",
"order_date": "2026-03-12",
"retailer_item_id": "30669",
"item_name": "BANANAS 3 LB / 1.36 KG",
"item_name_norm": "BANANA",
"upc": "",
"size_value": "3",
"size_unit": "lb",
"measure_type": "weight",
"is_store_brand": "false",
"is_fee": "false",
"is_discount_line": "false",
"is_coupon_line": "false",
"line_total": "2.98",
}
)
with giant_csv.open("w", newline="", encoding="utf-8") as handle:
writer = csv.DictWriter(handle, fieldnames=fieldnames)
writer.writeheader()
writer.writerow(giant_row)
with costco_csv.open("w", newline="", encoding="utf-8") as handle:
writer = csv.DictWriter(handle, fieldnames=fieldnames)
writer.writeheader()
writer.writerow(costco_row)
validate_cross_retailer_flow.main.callback(
giant_items_enriched_csv=str(giant_csv),
costco_items_enriched_csv=str(costco_csv),
outdir=str(outdir),
)
proof_path = outdir / "proof_examples.csv"
self.assertTrue(proof_path.exists())
with proof_path.open(newline="", encoding="utf-8") as handle:
rows = list(csv.DictReader(handle))
self.assertEqual(1, len(rows))
self.assertEqual("banana", rows[0]["proof_name"])
def test_main_writes_summary_request_metadata(self): def test_main_writes_summary_request_metadata(self):
with tempfile.TemporaryDirectory() as tmpdir: with tempfile.TemporaryDirectory() as tmpdir:
outdir = Path(tmpdir) / "costco_output" outdir = Path(tmpdir) / "costco_output"

View File

@@ -129,6 +129,63 @@ class EnrichGiantTests(unittest.TestCase):
("2", "each"), ("2", "each"),
enrich_giant.derive_normalized_quantity("2", "", "", "", "each"), enrich_giant.derive_normalized_quantity("2", "", "", "", "each"),
) )
self.assertEqual(
("1.68", "lb"),
enrich_giant.derive_normalized_quantity("1", "", "", "", "weight", "1.68"),
)
def test_parse_item_uses_picked_weight_for_loose_weight_items(self):
banana = enrich_giant.parse_item(
order_id="abc123",
order_date="2026-03-01",
raw_path=Path("raw/abc123.json"),
line_no=1,
item={
"podId": 1,
"shipQy": 1,
"totalPickedWeight": 1.68,
"unitPrice": 0.99,
"itemName": "FRESH BANANA",
"lbEachCd": "LB",
"groceryAmount": 0.99,
"primUpcCd": "111",
"mvpSavings": 0,
"rewardSavings": 0,
"couponSavings": 0,
"couponPrice": 0,
"categoryId": "1",
"categoryDesc": "Grocery",
},
)
self.assertEqual("weight", banana["measure_type"])
self.assertEqual("1.68", banana["normalized_quantity"])
self.assertEqual("lb", banana["normalized_quantity_unit"])
patty = enrich_giant.parse_item(
order_id="abc123",
order_date="2026-03-01",
raw_path=Path("raw/abc123.json"),
line_no=2,
item={
"podId": 2,
"shipQy": 1,
"totalPickedWeight": 1.29,
"unitPrice": 10.05,
"itemName": "80% PATTIES PK12",
"lbEachCd": "LB",
"groceryAmount": 10.05,
"primUpcCd": "222",
"mvpSavings": 0,
"rewardSavings": 0,
"couponSavings": 0,
"couponPrice": 0,
"categoryId": "1",
"categoryDesc": "Grocery",
},
)
self.assertEqual("1.29", patty["normalized_quantity"])
self.assertEqual("lb", patty["normalized_quantity_unit"])
def test_build_items_enriched_reads_raw_order_files_and_writes_csv(self): def test_build_items_enriched_reads_raw_order_files_and_writes_csv(self):
with tempfile.TemporaryDirectory() as tmpdir: with tempfile.TemporaryDirectory() as tmpdir:

View File

@@ -1,67 +0,0 @@
import unittest
import build_observed_products
class ObservedProductTests(unittest.TestCase):
def test_build_observed_products_aggregates_rows_with_same_key(self):
rows = [
{
"retailer": "giant",
"order_id": "1",
"line_no": "1",
"order_date": "2026-01-01",
"item_name": "SB GALA APPLE 5LB",
"item_name_norm": "GALA APPLE",
"retailer_item_id": "11",
"upc": "111",
"brand_guess": "SB",
"variant": "",
"size_value": "5",
"size_unit": "lb",
"pack_qty": "",
"measure_type": "weight",
"image_url": "https://example.test/a.jpg",
"is_store_brand": "true",
"is_fee": "false",
"is_discount_line": "false",
"is_coupon_line": "false",
"line_total": "7.99",
},
{
"retailer": "giant",
"order_id": "2",
"line_no": "1",
"order_date": "2026-01-10",
"item_name": "SB GALA APPLE 5 LB",
"item_name_norm": "GALA APPLE",
"retailer_item_id": "11",
"upc": "111",
"brand_guess": "SB",
"variant": "",
"size_value": "5",
"size_unit": "lb",
"pack_qty": "",
"measure_type": "weight",
"image_url": "",
"is_store_brand": "true",
"is_fee": "false",
"is_discount_line": "false",
"is_coupon_line": "false",
"line_total": "8.49",
},
]
observed = build_observed_products.build_observed_products(rows)
self.assertEqual(1, len(observed))
self.assertEqual("2", observed[0]["times_seen"])
self.assertEqual("2026-01-01", observed[0]["first_seen_date"])
self.assertEqual("2026-01-10", observed[0]["last_seen_date"])
self.assertEqual("11", observed[0]["representative_retailer_item_id"])
self.assertEqual("111", observed[0]["representative_upc"])
self.assertIn("SB GALA APPLE 5LB", observed[0]["raw_name_examples"])
if __name__ == "__main__":
unittest.main()

View File

@@ -65,6 +65,21 @@ class PipelineStatusTests(unittest.TestCase):
}, },
], ],
resolutions=[], resolutions=[],
links=[
{
"normalized_item_id": "gnorm_banana",
"catalog_id": "cat_banana",
"review_status": "approved",
}
],
catalog=[
{
"catalog_id": "cat_banana",
"catalog_name": "BANANA",
"product_type": "banana",
"category": "produce",
}
],
) )
counts = {row["stage"]: row["count"] for row in summary} counts = {row["stage"]: row["count"] for row in summary}

View File

@@ -8,6 +8,11 @@ import enrich_costco
class PurchaseLogTests(unittest.TestCase): class PurchaseLogTests(unittest.TestCase):
def test_derive_net_line_total_preserves_existing_then_derives(self):
self.assertEqual("1.49", build_purchases.derive_net_line_total({"net_line_total": "1.49", "line_total": "2.98"}))
self.assertEqual("5.99", build_purchases.derive_net_line_total({"line_total": "6.99", "matched_discount_amount": "-1.00"}))
self.assertEqual("3.5", build_purchases.derive_net_line_total({"line_total": "3.50"}))
def test_derive_metrics_prefers_picked_weight_and_pack_count(self): def test_derive_metrics_prefers_picked_weight_and_pack_count(self):
metrics = build_purchases.derive_metrics( metrics = build_purchases.derive_metrics(
{ {
@@ -161,6 +166,12 @@ class PurchaseLogTests(unittest.TestCase):
self.assertEqual("https://example.test/banana.jpg", rows[0]["image_url"]) self.assertEqual("https://example.test/banana.jpg", rows[0]["image_url"])
self.assertEqual("1", rows[0]["normalized_quantity"]) self.assertEqual("1", rows[0]["normalized_quantity"])
self.assertEqual("lb", rows[0]["normalized_quantity_unit"]) self.assertEqual("lb", rows[0]["normalized_quantity_unit"])
self.assertEqual("lb", rows[0]["effective_price_unit"])
self.assertEqual("g1", rows[0]["order_id"])
self.assertEqual("Giant", rows[0]["store_name"])
self.assertEqual("42", rows[0]["store_number"])
self.assertEqual("Springfield", rows[0]["store_city"])
self.assertEqual("VA", rows[0]["store_state"])
def test_main_writes_purchase_and_example_csvs(self): def test_main_writes_purchase_and_example_csvs(self):
with tempfile.TemporaryDirectory() as tmpdir: with tempfile.TemporaryDirectory() as tmpdir:
@@ -418,6 +429,294 @@ class PurchaseLogTests(unittest.TestCase):
self.assertEqual("1", rows[0]["normalized_quantity"]) self.assertEqual("1", rows[0]["normalized_quantity"])
self.assertEqual("each", rows[0]["normalized_quantity_unit"]) self.assertEqual("each", rows[0]["normalized_quantity_unit"])
def test_build_purchase_rows_derives_effective_price_for_known_cases(self):
fieldnames = enrich_costco.OUTPUT_FIELDS
def base_row():
return {field: "" for field in fieldnames}
giant_banana = base_row()
giant_banana.update(
{
"retailer": "giant",
"order_id": "g1",
"line_no": "1",
"normalized_row_id": "giant:g1:1",
"normalized_item_id": "gnorm:banana",
"order_date": "2026-03-01",
"item_name": "FRESH BANANA",
"item_name_norm": "BANANA",
"retailer_item_id": "100",
"qty": "1",
"unit": "LB",
"normalized_quantity": "1.68",
"normalized_quantity_unit": "lb",
"line_total": "0.99",
"unit_price": "0.99",
"measure_type": "weight",
"price_per_lb": "0.5893",
"raw_order_path": "data/giant-web/raw/g1.json",
"is_discount_line": "false",
"is_coupon_line": "false",
"is_fee": "false",
}
)
costco_banana = base_row()
costco_banana.update(
{
"retailer": "costco",
"order_id": "c1",
"line_no": "1",
"normalized_row_id": "costco:c1:1",
"normalized_item_id": "cnorm:banana",
"order_date": "2026-03-12",
"item_name": "BANANAS 3 LB / 1.36 KG",
"item_name_norm": "BANANA",
"retailer_item_id": "30669",
"qty": "1",
"unit": "E",
"normalized_quantity": "3",
"normalized_quantity_unit": "lb",
"line_total": "2.98",
"net_line_total": "1.49",
"unit_price": "2.98",
"size_value": "3",
"size_unit": "lb",
"measure_type": "weight",
"price_per_lb": "0.4967",
"raw_order_path": "data/costco-web/raw/c1.json",
"is_discount_line": "false",
"is_coupon_line": "false",
"is_fee": "false",
}
)
giant_ice = base_row()
giant_ice.update(
{
"retailer": "giant",
"order_id": "g2",
"line_no": "1",
"normalized_row_id": "giant:g2:1",
"normalized_item_id": "gnorm:ice",
"order_date": "2026-03-02",
"item_name": "SB BAGGED ICE 20LB",
"item_name_norm": "BAGGED ICE",
"retailer_item_id": "101",
"qty": "2",
"unit": "EA",
"normalized_quantity": "40",
"normalized_quantity_unit": "lb",
"line_total": "9.98",
"unit_price": "4.99",
"size_value": "20",
"size_unit": "lb",
"measure_type": "weight",
"price_per_lb": "0.2495",
"raw_order_path": "data/giant-web/raw/g2.json",
"is_discount_line": "false",
"is_coupon_line": "false",
"is_fee": "false",
}
)
costco_patty = base_row()
costco_patty.update(
{
"retailer": "costco",
"order_id": "c2",
"line_no": "1",
"normalized_row_id": "costco:c2:1",
"normalized_item_id": "cnorm:patty",
"order_date": "2026-03-03",
"item_name": "BEEF PATTIES 6# BAG",
"item_name_norm": "BEEF PATTIES 6# BAG",
"retailer_item_id": "777",
"qty": "1",
"unit": "E",
"normalized_quantity": "1",
"normalized_quantity_unit": "each",
"line_total": "26.99",
"net_line_total": "26.99",
"unit_price": "26.99",
"measure_type": "each",
"raw_order_path": "data/costco-web/raw/c2.json",
"is_discount_line": "false",
"is_coupon_line": "false",
"is_fee": "false",
}
)
giant_patty = base_row()
giant_patty.update(
{
"retailer": "giant",
"order_id": "g3",
"line_no": "1",
"normalized_row_id": "giant:g3:1",
"normalized_item_id": "gnorm:patty",
"order_date": "2026-03-04",
"item_name": "80% PATTIES PK12",
"item_name_norm": "80% PATTIES PK12",
"retailer_item_id": "102",
"qty": "1",
"unit": "LB",
"normalized_quantity": "",
"normalized_quantity_unit": "",
"line_total": "10.05",
"unit_price": "10.05",
"measure_type": "weight",
"price_per_lb": "7.7907",
"raw_order_path": "data/giant-web/raw/g3.json",
"is_discount_line": "false",
"is_coupon_line": "false",
"is_fee": "false",
}
)
rows, _links = build_purchases.build_purchase_rows(
[giant_banana, giant_ice, giant_patty],
[costco_banana, costco_patty],
[],
[],
[],
[],
[],
)
rows_by_item = {row["normalized_item_id"]: row for row in rows}
self.assertEqual("0.5893", rows_by_item["gnorm:banana"]["effective_price"])
self.assertEqual("lb", rows_by_item["gnorm:banana"]["effective_price_unit"])
self.assertEqual("0.4967", rows_by_item["cnorm:banana"]["effective_price"])
self.assertEqual("lb", rows_by_item["cnorm:banana"]["effective_price_unit"])
self.assertEqual("0.2495", rows_by_item["gnorm:ice"]["effective_price"])
self.assertEqual("lb", rows_by_item["gnorm:ice"]["effective_price_unit"])
self.assertEqual("26.99", rows_by_item["cnorm:patty"]["effective_price"])
self.assertEqual("each", rows_by_item["cnorm:patty"]["effective_price_unit"])
self.assertEqual("", rows_by_item["gnorm:patty"]["effective_price"])
self.assertEqual("", rows_by_item["gnorm:patty"]["effective_price_unit"])
def test_build_purchase_rows_leaves_effective_price_blank_without_valid_denominator(self):
fieldnames = enrich_costco.OUTPUT_FIELDS
row = {field: "" for field in fieldnames}
row.update(
{
"retailer": "giant",
"order_id": "g1",
"line_no": "1",
"normalized_row_id": "giant:g1:1",
"normalized_item_id": "gnorm:blank",
"order_date": "2026-03-01",
"item_name": "MYSTERY ITEM",
"item_name_norm": "MYSTERY ITEM",
"retailer_item_id": "100",
"qty": "1",
"unit": "EA",
"normalized_quantity": "0",
"normalized_quantity_unit": "each",
"line_total": "3.50",
"unit_price": "3.50",
"measure_type": "each",
"raw_order_path": "data/giant-web/raw/g1.json",
"is_discount_line": "false",
"is_coupon_line": "false",
"is_fee": "false",
}
)
rows, _links = build_purchases.build_purchase_rows([row], [], [], [], [], [], [])
self.assertEqual("", rows[0]["effective_price"])
self.assertEqual("", rows[0]["effective_price_unit"])
def test_purchase_rows_support_visit_level_grouping_without_extra_joins(self):
fieldnames = enrich_costco.OUTPUT_FIELDS
def base_row():
return {field: "" for field in fieldnames}
row_one = base_row()
row_one.update(
{
"retailer": "giant",
"order_id": "g1",
"line_no": "1",
"normalized_row_id": "giant:g1:1",
"normalized_item_id": "gnorm:first",
"order_date": "2026-03-01",
"item_name": "FIRST ITEM",
"item_name_norm": "FIRST ITEM",
"qty": "1",
"unit": "EA",
"normalized_quantity": "1",
"normalized_quantity_unit": "each",
"line_total": "3.50",
"measure_type": "each",
"raw_order_path": "data/giant-web/raw/g1.json",
"is_discount_line": "false",
"is_coupon_line": "false",
"is_fee": "false",
}
)
row_two = base_row()
row_two.update(
{
"retailer": "giant",
"order_id": "g1",
"line_no": "2",
"normalized_row_id": "giant:g1:2",
"normalized_item_id": "gnorm:second",
"order_date": "2026-03-01",
"item_name": "SECOND ITEM",
"item_name_norm": "SECOND ITEM",
"qty": "1",
"unit": "EA",
"normalized_quantity": "1",
"normalized_quantity_unit": "each",
"line_total": "2.00",
"measure_type": "each",
"raw_order_path": "data/giant-web/raw/g1.json",
"is_discount_line": "false",
"is_coupon_line": "false",
"is_fee": "false",
}
)
rows, _links = build_purchases.build_purchase_rows(
[row_one, row_two],
[],
[
{
"order_id": "g1",
"store_name": "Giant",
"store_number": "42",
"store_city": "Springfield",
"store_state": "VA",
}
],
[],
[],
[],
[],
)
visit_key = {
(
row["retailer"],
row["order_id"],
row["purchase_date"],
row["store_name"],
row["store_number"],
row["store_city"],
row["store_state"],
)
for row in rows
}
visit_total = sum(float(row["net_line_total"]) for row in rows)
self.assertEqual(1, len(visit_key))
self.assertEqual(5.5, visit_total)
if __name__ == "__main__": if __name__ == "__main__":
unittest.main() unittest.main()

View File

@@ -1,133 +0,0 @@
import tempfile
import unittest
from pathlib import Path
import build_observed_products
import build_review_queue
from layer_helpers import write_csv_rows
class ReviewQueueTests(unittest.TestCase):
def test_build_review_queue_preserves_existing_status(self):
observed_rows = [
{
"observed_product_id": "gobs_1",
"retailer": "giant",
"representative_upc": "111",
"representative_image_url": "",
"representative_name_norm": "GALA APPLE",
"times_seen": "2",
"distinct_item_names_count": "2",
"distinct_upcs_count": "1",
"is_fee": "false",
"is_discount_line": "false",
"is_coupon_line": "false",
}
]
item_rows = [
{
"observed_product_id": "gobs_1",
"item_name": "SB GALA APPLE 5LB",
"item_name_norm": "GALA APPLE",
"line_total": "7.99",
},
{
"observed_product_id": "gobs_1",
"item_name": "SB GALA APPLE 5 LB",
"item_name_norm": "GALA APPLE",
"line_total": "8.49",
},
]
existing = {
build_review_queue.stable_id("rvw", "gobs_1|missing_image"): {
"status": "approved",
"resolution_notes": "looked fine",
"created_at": "2026-03-15",
}
}
queue = build_review_queue.build_review_queue(
observed_rows, item_rows, existing, "2026-03-16"
)
self.assertEqual(2, len(queue))
missing_image = [row for row in queue if row["reason_code"] == "missing_image"][0]
self.assertEqual("approved", missing_image["status"])
self.assertEqual("looked fine", missing_image["resolution_notes"])
def test_review_queue_main_writes_output(self):
with tempfile.TemporaryDirectory() as tmpdir:
observed_path = Path(tmpdir) / "products_observed.csv"
items_path = Path(tmpdir) / "items_enriched.csv"
output_path = Path(tmpdir) / "review_queue.csv"
observed_rows = [
{
"observed_product_id": "gobs_1",
"retailer": "giant",
"observed_key": "giant|upc=111|name=GALA APPLE",
"representative_retailer_item_id": "11",
"representative_upc": "111",
"representative_item_name": "SB GALA APPLE 5LB",
"representative_name_norm": "GALA APPLE",
"representative_brand": "SB",
"representative_variant": "",
"representative_size_value": "5",
"representative_size_unit": "lb",
"representative_pack_qty": "",
"representative_measure_type": "weight",
"representative_image_url": "",
"is_store_brand": "true",
"is_fee": "false",
"is_discount_line": "false",
"is_coupon_line": "false",
"first_seen_date": "2026-01-01",
"last_seen_date": "2026-01-10",
"times_seen": "2",
"example_order_id": "1",
"example_item_name": "SB GALA APPLE 5LB",
"raw_name_examples": "SB GALA APPLE 5LB | SB GALA APPLE 5 LB",
"normalized_name_examples": "GALA APPLE",
"example_prices": "7.99 | 8.49",
"distinct_item_names_count": "2",
"distinct_retailer_item_ids_count": "1",
"distinct_upcs_count": "1",
}
]
item_rows = [
{
"retailer": "giant",
"order_id": "1",
"line_no": "1",
"item_name": "SB GALA APPLE 5LB",
"item_name_norm": "GALA APPLE",
"retailer_item_id": "11",
"upc": "111",
"size_value": "5",
"size_unit": "lb",
"pack_qty": "",
"measure_type": "weight",
"is_store_brand": "true",
"is_fee": "false",
"is_discount_line": "false",
"is_coupon_line": "false",
"line_total": "7.99",
}
]
write_csv_rows(
observed_path, observed_rows, build_observed_products.OUTPUT_FIELDS
)
write_csv_rows(items_path, item_rows, list(item_rows[0].keys()))
build_review_queue.main.callback(
observed_csv=str(observed_path),
items_enriched_csv=str(items_path),
output_csv=str(output_path),
)
self.assertTrue(output_path.exists())
if __name__ == "__main__":
unittest.main()

View File

@@ -6,9 +6,94 @@ from unittest import mock
from click.testing import CliRunner from click.testing import CliRunner
import enrich_costco
import review_products import review_products
def write_review_source_files(tmpdir, rows):
giant_items_csv = Path(tmpdir) / "giant_items.csv"
costco_items_csv = Path(tmpdir) / "costco_items.csv"
giant_orders_csv = Path(tmpdir) / "giant_orders.csv"
costco_orders_csv = Path(tmpdir) / "costco_orders.csv"
fieldnames = enrich_costco.OUTPUT_FIELDS
grouped_rows = {"giant": [], "costco": []}
grouped_orders = {"giant": {}, "costco": {}}
for index, row in enumerate(rows, start=1):
retailer = row.get("retailer", "giant")
normalized_row = {field: "" for field in fieldnames}
normalized_row.update(
{
"retailer": retailer,
"order_id": row.get("order_id", f"{retailer[0]}{index}"),
"line_no": row.get("line_no", str(index)),
"normalized_row_id": row.get(
"normalized_row_id",
f"{retailer}:{row.get('order_id', f'{retailer[0]}{index}')}:{row.get('line_no', str(index))}",
),
"normalized_item_id": row.get("normalized_item_id", ""),
"order_date": row.get("purchase_date", ""),
"item_name": row.get("raw_item_name", ""),
"item_name_norm": row.get("normalized_item_name", ""),
"image_url": row.get("image_url", ""),
"upc": row.get("upc", ""),
"line_total": row.get("line_total", ""),
"net_line_total": row.get("net_line_total", ""),
"matched_discount_amount": row.get("matched_discount_amount", ""),
"qty": row.get("qty", "1"),
"unit": row.get("unit", "EA"),
"normalized_quantity": row.get("normalized_quantity", ""),
"normalized_quantity_unit": row.get("normalized_quantity_unit", ""),
"size_value": row.get("size_value", ""),
"size_unit": row.get("size_unit", ""),
"pack_qty": row.get("pack_qty", ""),
"measure_type": row.get("measure_type", "each"),
"retailer_item_id": row.get("retailer_item_id", ""),
"price_per_each": row.get("price_per_each", ""),
"price_per_lb": row.get("price_per_lb", ""),
"price_per_oz": row.get("price_per_oz", ""),
"is_discount_line": row.get("is_discount_line", "false"),
"is_coupon_line": row.get("is_coupon_line", "false"),
"is_fee": row.get("is_fee", "false"),
"raw_order_path": row.get("raw_order_path", ""),
}
)
grouped_rows[retailer].append(normalized_row)
order_id = normalized_row["order_id"]
grouped_orders[retailer].setdefault(
order_id,
{
"order_id": order_id,
"store_name": row.get("store_name", ""),
"store_number": row.get("store_number", ""),
"store_city": row.get("store_city", ""),
"store_state": row.get("store_state", ""),
},
)
for path, source_rows in [
(giant_items_csv, grouped_rows["giant"]),
(costco_items_csv, grouped_rows["costco"]),
]:
with path.open("w", newline="", encoding="utf-8") as handle:
writer = csv.DictWriter(handle, fieldnames=fieldnames)
writer.writeheader()
writer.writerows(source_rows)
order_fields = ["order_id", "store_name", "store_number", "store_city", "store_state"]
for path, source_rows in [
(giant_orders_csv, grouped_orders["giant"].values()),
(costco_orders_csv, grouped_orders["costco"].values()),
]:
with path.open("w", newline="", encoding="utf-8") as handle:
writer = csv.DictWriter(handle, fieldnames=order_fields)
writer.writeheader()
writer.writerows(source_rows)
return giant_items_csv, costco_items_csv, giant_orders_csv, costco_orders_csv
class ReviewWorkflowTests(unittest.TestCase): class ReviewWorkflowTests(unittest.TestCase):
def test_build_review_queue_groups_unresolved_purchases(self): def test_build_review_queue_groups_unresolved_purchases(self):
queue_rows = review_products.build_review_queue( queue_rows = review_products.build_review_queue(
@@ -114,66 +199,47 @@ class ReviewWorkflowTests(unittest.TestCase):
resolutions_csv = Path(tmpdir) / "review_resolutions.csv" resolutions_csv = Path(tmpdir) / "review_resolutions.csv"
catalog_csv = Path(tmpdir) / "catalog.csv" catalog_csv = Path(tmpdir) / "catalog.csv"
links_csv = Path(tmpdir) / "product_links.csv" links_csv = Path(tmpdir) / "product_links.csv"
giant_items_csv, costco_items_csv, giant_orders_csv, costco_orders_csv = write_review_source_files(
purchase_fields = [ tmpdir,
"purchase_date", [
"retailer", {
"order_id", "purchase_date": "2026-03-14",
"line_no", "retailer": "costco",
"normalized_item_id", "order_id": "c2",
"catalog_id", "line_no": "2",
"raw_item_name", "normalized_item_id": "cnorm_mix",
"normalized_item_name", "raw_item_name": "MIXED PEPPER 6-PACK",
"image_url", "normalized_item_name": "MIXED PEPPER",
"upc", "image_url": "",
"line_total", "upc": "",
] "line_total": "7.49",
with purchases_csv.open("w", newline="", encoding="utf-8") as handle: },
writer = csv.DictWriter(handle, fieldnames=purchase_fields) {
writer.writeheader() "purchase_date": "2026-03-12",
writer.writerows( "retailer": "costco",
[ "order_id": "c1",
{ "line_no": "1",
"purchase_date": "2026-03-14", "normalized_item_id": "cnorm_mix",
"retailer": "costco", "raw_item_name": "MIXED PEPPER 6-PACK",
"order_id": "c2", "normalized_item_name": "MIXED PEPPER",
"line_no": "2", "image_url": "https://example.test/mixed-pepper.jpg",
"normalized_item_id": "cnorm_mix", "upc": "",
"catalog_id": "", "line_total": "6.99",
"raw_item_name": "MIXED PEPPER 6-PACK", },
"normalized_item_name": "MIXED PEPPER", {
"image_url": "", "purchase_date": "2026-03-10",
"upc": "", "retailer": "giant",
"line_total": "7.49", "order_id": "g1",
}, "line_no": "1",
{ "normalized_item_id": "gnorm_mix",
"purchase_date": "2026-03-12", "raw_item_name": "MIXED PEPPER",
"retailer": "costco", "normalized_item_name": "MIXED PEPPER",
"order_id": "c1", "image_url": "",
"line_no": "1", "upc": "",
"normalized_item_id": "cnorm_mix", "line_total": "5.99",
"catalog_id": "", },
"raw_item_name": "MIXED PEPPER 6-PACK", ],
"normalized_item_name": "MIXED PEPPER", )
"image_url": "https://example.test/mixed-pepper.jpg",
"upc": "",
"line_total": "6.99",
},
{
"purchase_date": "2026-03-10",
"retailer": "giant",
"order_id": "g1",
"line_no": "1",
"normalized_item_id": "gnorm_mix",
"catalog_id": "cat_mix",
"raw_item_name": "MIXED PEPPER",
"normalized_item_name": "MIXED PEPPER",
"image_url": "",
"upc": "",
"line_total": "5.99",
},
]
)
with catalog_csv.open("w", newline="", encoding="utf-8") as handle: with catalog_csv.open("w", newline="", encoding="utf-8") as handle:
writer = csv.DictWriter(handle, fieldnames=review_products.build_purchases.CATALOG_FIELDS) writer = csv.DictWriter(handle, fieldnames=review_products.build_purchases.CATALOG_FIELDS)
@@ -195,11 +261,34 @@ class ReviewWorkflowTests(unittest.TestCase):
"updated_at": "", "updated_at": "",
} }
) )
with links_csv.open("w", newline="", encoding="utf-8") as handle:
writer = csv.DictWriter(handle, fieldnames=review_products.build_purchases.PRODUCT_LINK_FIELDS)
writer.writeheader()
writer.writerow(
{
"normalized_item_id": "gnorm_mix",
"catalog_id": "cat_mix",
"link_method": "manual_link",
"link_confidence": "high",
"review_status": "approved",
"reviewed_by": "",
"reviewed_at": "",
"link_notes": "",
}
)
runner = CliRunner() runner = CliRunner()
result = runner.invoke( result = runner.invoke(
review_products.main, review_products.main,
[ [
"--giant-items-enriched-csv",
str(giant_items_csv),
"--costco-items-enriched-csv",
str(costco_items_csv),
"--giant-orders-csv",
str(giant_orders_csv),
"--costco-orders-csv",
str(costco_orders_csv),
"--purchases-csv", "--purchases-csv",
str(purchases_csv), str(purchases_csv),
"--queue-csv", "--queue-csv",
@@ -234,40 +323,23 @@ class ReviewWorkflowTests(unittest.TestCase):
resolutions_csv = Path(tmpdir) / "review_resolutions.csv" resolutions_csv = Path(tmpdir) / "review_resolutions.csv"
catalog_csv = Path(tmpdir) / "catalog.csv" catalog_csv = Path(tmpdir) / "catalog.csv"
links_csv = Path(tmpdir) / "product_links.csv" links_csv = Path(tmpdir) / "product_links.csv"
giant_items_csv, costco_items_csv, giant_orders_csv, costco_orders_csv = write_review_source_files(
with purchases_csv.open("w", newline="", encoding="utf-8") as handle: tmpdir,
writer = csv.DictWriter( [
handle,
fieldnames=[
"purchase_date",
"retailer",
"order_id",
"line_no",
"normalized_item_id",
"catalog_id",
"raw_item_name",
"normalized_item_name",
"image_url",
"upc",
"line_total",
],
)
writer.writeheader()
writer.writerow(
{ {
"purchase_date": "2026-03-14", "purchase_date": "2026-03-14",
"retailer": "giant", "retailer": "giant",
"order_id": "g1", "order_id": "g1",
"line_no": "1", "line_no": "1",
"normalized_item_id": "gnorm_ice", "normalized_item_id": "gnorm_ice",
"catalog_id": "",
"raw_item_name": "SB BAGGED ICE 20LB", "raw_item_name": "SB BAGGED ICE 20LB",
"normalized_item_name": "BAGGED ICE", "normalized_item_name": "BAGGED ICE",
"image_url": "", "image_url": "",
"upc": "", "upc": "",
"line_total": "3.50", "line_total": "3.50",
} }
) ],
)
with catalog_csv.open("w", newline="", encoding="utf-8") as handle: with catalog_csv.open("w", newline="", encoding="utf-8") as handle:
writer = csv.DictWriter(handle, fieldnames=review_products.build_purchases.CATALOG_FIELDS) writer = csv.DictWriter(handle, fieldnames=review_products.build_purchases.CATALOG_FIELDS)
@@ -276,6 +348,14 @@ class ReviewWorkflowTests(unittest.TestCase):
result = CliRunner().invoke( result = CliRunner().invoke(
review_products.main, review_products.main,
[ [
"--giant-items-enriched-csv",
str(giant_items_csv),
"--costco-items-enriched-csv",
str(costco_items_csv),
"--giant-orders-csv",
str(giant_orders_csv),
"--costco-orders-csv",
str(costco_orders_csv),
"--purchases-csv", "--purchases-csv",
str(purchases_csv), str(purchases_csv),
"--queue-csv", "--queue-csv",
@@ -301,68 +381,47 @@ class ReviewWorkflowTests(unittest.TestCase):
resolutions_csv = Path(tmpdir) / "review_resolutions.csv" resolutions_csv = Path(tmpdir) / "review_resolutions.csv"
catalog_csv = Path(tmpdir) / "catalog.csv" catalog_csv = Path(tmpdir) / "catalog.csv"
links_csv = Path(tmpdir) / "product_links.csv" links_csv = Path(tmpdir) / "product_links.csv"
giant_items_csv, costco_items_csv, giant_orders_csv, costco_orders_csv = write_review_source_files(
with purchases_csv.open("w", newline="", encoding="utf-8") as handle: tmpdir,
writer = csv.DictWriter( [
handle, {
fieldnames=[ "purchase_date": "2026-03-14",
"purchase_date", "retailer": "costco",
"retailer", "order_id": "c2",
"order_id", "line_no": "2",
"line_no", "normalized_item_id": "cnorm_mix",
"normalized_item_id", "raw_item_name": "MIXED PEPPER 6-PACK",
"catalog_id", "normalized_item_name": "MIXED PEPPER",
"raw_item_name", "image_url": "",
"normalized_item_name", "upc": "",
"image_url", "line_total": "7.49",
"upc", },
"line_total", {
], "purchase_date": "2026-03-12",
) "retailer": "costco",
writer.writeheader() "order_id": "c1",
writer.writerows( "line_no": "1",
[ "normalized_item_id": "cnorm_mix",
{ "raw_item_name": "MIXED PEPPER 6-PACK",
"purchase_date": "2026-03-14", "normalized_item_name": "MIXED PEPPER",
"retailer": "costco", "image_url": "",
"order_id": "c2", "upc": "",
"line_no": "2", "line_total": "6.99",
"normalized_item_id": "cnorm_mix", },
"catalog_id": "", {
"raw_item_name": "MIXED PEPPER 6-PACK", "purchase_date": "2026-03-10",
"normalized_item_name": "MIXED PEPPER", "retailer": "giant",
"image_url": "", "order_id": "g1",
"upc": "", "line_no": "1",
"line_total": "7.49", "normalized_item_id": "gnorm_mix",
}, "raw_item_name": "MIXED PEPPER",
{ "normalized_item_name": "MIXED PEPPER",
"purchase_date": "2026-03-12", "image_url": "",
"retailer": "costco", "upc": "",
"order_id": "c1", "line_total": "5.99",
"line_no": "1", },
"normalized_item_id": "cnorm_mix", ],
"catalog_id": "", )
"raw_item_name": "MIXED PEPPER 6-PACK",
"normalized_item_name": "MIXED PEPPER",
"image_url": "",
"upc": "",
"line_total": "6.99",
},
{
"purchase_date": "2026-03-10",
"retailer": "giant",
"order_id": "g1",
"line_no": "1",
"normalized_item_id": "gnorm_mix",
"catalog_id": "cat_mix",
"raw_item_name": "MIXED PEPPER",
"normalized_item_name": "MIXED PEPPER",
"image_url": "",
"upc": "",
"line_total": "5.99",
},
]
)
with catalog_csv.open("w", newline="", encoding="utf-8") as handle: with catalog_csv.open("w", newline="", encoding="utf-8") as handle:
writer = csv.DictWriter(handle, fieldnames=review_products.build_purchases.CATALOG_FIELDS) writer = csv.DictWriter(handle, fieldnames=review_products.build_purchases.CATALOG_FIELDS)
@@ -384,10 +443,33 @@ class ReviewWorkflowTests(unittest.TestCase):
"updated_at": "", "updated_at": "",
} }
) )
with links_csv.open("w", newline="", encoding="utf-8") as handle:
writer = csv.DictWriter(handle, fieldnames=review_products.build_purchases.PRODUCT_LINK_FIELDS)
writer.writeheader()
writer.writerow(
{
"normalized_item_id": "gnorm_mix",
"catalog_id": "cat_mix",
"link_method": "manual_link",
"link_confidence": "high",
"review_status": "approved",
"reviewed_by": "",
"reviewed_at": "",
"link_notes": "",
}
)
result = CliRunner().invoke( result = CliRunner().invoke(
review_products.main, review_products.main,
[ [
"--giant-items-enriched-csv",
str(giant_items_csv),
"--costco-items-enriched-csv",
str(costco_items_csv),
"--giant-orders-csv",
str(giant_orders_csv),
"--costco-orders-csv",
str(costco_orders_csv),
"--purchases-csv", "--purchases-csv",
str(purchases_csv), str(purchases_csv),
"--queue-csv", "--queue-csv",
@@ -422,40 +504,23 @@ class ReviewWorkflowTests(unittest.TestCase):
resolutions_csv = Path(tmpdir) / "review_resolutions.csv" resolutions_csv = Path(tmpdir) / "review_resolutions.csv"
catalog_csv = Path(tmpdir) / "catalog.csv" catalog_csv = Path(tmpdir) / "catalog.csv"
links_csv = Path(tmpdir) / "product_links.csv" links_csv = Path(tmpdir) / "product_links.csv"
giant_items_csv, costco_items_csv, giant_orders_csv, costco_orders_csv = write_review_source_files(
with purchases_csv.open("w", newline="", encoding="utf-8") as handle: tmpdir,
writer = csv.DictWriter( [
handle,
fieldnames=[
"purchase_date",
"retailer",
"order_id",
"line_no",
"normalized_item_id",
"catalog_id",
"raw_item_name",
"normalized_item_name",
"image_url",
"upc",
"line_total",
],
)
writer.writeheader()
writer.writerow(
{ {
"purchase_date": "2026-03-14", "purchase_date": "2026-03-14",
"retailer": "giant", "retailer": "giant",
"order_id": "g1", "order_id": "g1",
"line_no": "1", "line_no": "1",
"normalized_item_id": "gnorm_ice", "normalized_item_id": "gnorm_ice",
"catalog_id": "",
"raw_item_name": "SB BAGGED ICE 20LB", "raw_item_name": "SB BAGGED ICE 20LB",
"normalized_item_name": "BAGGED ICE", "normalized_item_name": "BAGGED ICE",
"image_url": "", "image_url": "",
"upc": "", "upc": "",
"line_total": "3.50", "line_total": "3.50",
} }
) ],
)
with catalog_csv.open("w", newline="", encoding="utf-8") as handle: with catalog_csv.open("w", newline="", encoding="utf-8") as handle:
writer = csv.DictWriter(handle, fieldnames=review_products.build_purchases.CATALOG_FIELDS) writer = csv.DictWriter(handle, fieldnames=review_products.build_purchases.CATALOG_FIELDS)
@@ -481,6 +546,14 @@ class ReviewWorkflowTests(unittest.TestCase):
result = CliRunner().invoke( result = CliRunner().invoke(
review_products.main, review_products.main,
[ [
"--giant-items-enriched-csv",
str(giant_items_csv),
"--costco-items-enriched-csv",
str(costco_items_csv),
"--giant-orders-csv",
str(giant_orders_csv),
"--costco-orders-csv",
str(costco_orders_csv),
"--purchases-csv", "--purchases-csv",
str(purchases_csv), str(purchases_csv),
"--queue-csv", "--queue-csv",
@@ -506,40 +579,23 @@ class ReviewWorkflowTests(unittest.TestCase):
resolutions_csv = Path(tmpdir) / "review_resolutions.csv" resolutions_csv = Path(tmpdir) / "review_resolutions.csv"
catalog_csv = Path(tmpdir) / "catalog.csv" catalog_csv = Path(tmpdir) / "catalog.csv"
links_csv = Path(tmpdir) / "product_links.csv" links_csv = Path(tmpdir) / "product_links.csv"
giant_items_csv, costco_items_csv, giant_orders_csv, costco_orders_csv = write_review_source_files(
with purchases_csv.open("w", newline="", encoding="utf-8") as handle: tmpdir,
writer = csv.DictWriter( [
handle,
fieldnames=[
"purchase_date",
"retailer",
"order_id",
"line_no",
"normalized_item_id",
"catalog_id",
"raw_item_name",
"normalized_item_name",
"image_url",
"upc",
"line_total",
],
)
writer.writeheader()
writer.writerow(
{ {
"purchase_date": "2026-03-14", "purchase_date": "2026-03-14",
"retailer": "giant", "retailer": "giant",
"order_id": "g1", "order_id": "g1",
"line_no": "1", "line_no": "1",
"normalized_item_id": "gnorm_skip", "normalized_item_id": "gnorm_skip",
"catalog_id": "",
"raw_item_name": "TEST ITEM", "raw_item_name": "TEST ITEM",
"normalized_item_name": "TEST ITEM", "normalized_item_name": "TEST ITEM",
"image_url": "", "image_url": "",
"upc": "", "upc": "",
"line_total": "1.00", "line_total": "1.00",
} }
) ],
)
with catalog_csv.open("w", newline="", encoding="utf-8") as handle: with catalog_csv.open("w", newline="", encoding="utf-8") as handle:
writer = csv.DictWriter(handle, fieldnames=review_products.build_purchases.CATALOG_FIELDS) writer = csv.DictWriter(handle, fieldnames=review_products.build_purchases.CATALOG_FIELDS)
@@ -548,6 +604,14 @@ class ReviewWorkflowTests(unittest.TestCase):
result = CliRunner().invoke( result = CliRunner().invoke(
review_products.main, review_products.main,
[ [
"--giant-items-enriched-csv",
str(giant_items_csv),
"--costco-items-enriched-csv",
str(costco_items_csv),
"--giant-orders-csv",
str(giant_orders_csv),
"--costco-orders-csv",
str(costco_orders_csv),
"--purchases-csv", "--purchases-csv",
str(purchases_csv), str(purchases_csv),
"--queue-csv", "--queue-csv",
@@ -578,30 +642,12 @@ class ReviewWorkflowTests(unittest.TestCase):
resolutions_csv = Path(tmpdir) / "review_resolutions.csv" resolutions_csv = Path(tmpdir) / "review_resolutions.csv"
catalog_csv = Path(tmpdir) / "catalog.csv" catalog_csv = Path(tmpdir) / "catalog.csv"
links_csv = Path(tmpdir) / "product_links.csv" links_csv = Path(tmpdir) / "product_links.csv"
giant_items_csv, costco_items_csv, giant_orders_csv, costco_orders_csv = write_review_source_files(
with purchases_csv.open("w", newline="", encoding="utf-8") as handle: tmpdir,
writer = csv.DictWriter( [
handle,
fieldnames=[
"purchase_date",
"normalized_item_id",
"catalog_id",
"retailer",
"raw_item_name",
"normalized_item_name",
"image_url",
"upc",
"line_total",
"order_id",
"line_no",
],
)
writer.writeheader()
writer.writerow(
{ {
"purchase_date": "2026-03-15", "purchase_date": "2026-03-15",
"normalized_item_id": "gnorm_ice", "normalized_item_id": "gnorm_ice",
"catalog_id": "",
"retailer": "giant", "retailer": "giant",
"raw_item_name": "SB BAGGED ICE 20LB", "raw_item_name": "SB BAGGED ICE 20LB",
"normalized_item_name": "BAGGED ICE", "normalized_item_name": "BAGGED ICE",
@@ -611,7 +657,8 @@ class ReviewWorkflowTests(unittest.TestCase):
"order_id": "g1", "order_id": "g1",
"line_no": "1", "line_no": "1",
} }
) ],
)
with mock.patch.object( with mock.patch.object(
review_products.click, review_products.click,
@@ -619,6 +666,10 @@ class ReviewWorkflowTests(unittest.TestCase):
side_effect=["n", "ICE", "frozen", "ice", "manual merge", "q"], side_effect=["n", "ICE", "frozen", "ice", "manual merge", "q"],
): ):
review_products.main.callback( review_products.main.callback(
giant_items_enriched_csv=str(giant_items_csv),
costco_items_enriched_csv=str(costco_items_csv),
giant_orders_csv=str(giant_orders_csv),
costco_orders_csv=str(costco_orders_csv),
purchases_csv=str(purchases_csv), purchases_csv=str(purchases_csv),
queue_csv=str(queue_csv), queue_csv=str(queue_csv),
resolutions_csv=str(resolutions_csv), resolutions_csv=str(resolutions_csv),
@@ -647,6 +698,63 @@ class ReviewWorkflowTests(unittest.TestCase):
self.assertEqual("ICE", catalog_rows[0]["catalog_name"]) self.assertEqual("ICE", catalog_rows[0]["catalog_name"])
self.assertEqual(catalog_rows[0]["catalog_id"], link_rows[0]["catalog_id"]) self.assertEqual(catalog_rows[0]["catalog_id"], link_rows[0]["catalog_id"])
def test_build_review_queue_readds_orphaned_and_incomplete_links(self):
purchase_rows = [
{
"normalized_item_id": "gnorm_orphan",
"catalog_id": "cat_missing",
"retailer": "giant",
"raw_item_name": "ORPHAN ITEM",
"normalized_item_name": "ORPHAN ITEM",
"upc": "",
"line_total": "3.50",
"is_fee": "false",
"is_discount_line": "false",
"is_coupon_line": "false",
},
{
"normalized_item_id": "gnorm_incomplete",
"catalog_id": "cat_incomplete",
"retailer": "giant",
"raw_item_name": "INCOMPLETE ITEM",
"normalized_item_name": "INCOMPLETE ITEM",
"upc": "",
"line_total": "4.50",
"is_fee": "false",
"is_discount_line": "false",
"is_coupon_line": "false",
},
]
link_rows = [
{
"normalized_item_id": "gnorm_orphan",
"catalog_id": "cat_missing",
},
{
"normalized_item_id": "gnorm_incomplete",
"catalog_id": "cat_incomplete",
},
]
catalog_rows = [
{
"catalog_id": "cat_incomplete",
"catalog_name": "INCOMPLETE ITEM",
"product_type": "",
}
]
queue_rows = review_products.build_review_queue(
purchase_rows,
[],
link_rows,
catalog_rows,
[],
)
reasons = {row["normalized_item_id"]: row["reason_code"] for row in queue_rows}
self.assertEqual("orphaned_catalog_link", reasons["gnorm_orphan"])
self.assertEqual("incomplete_catalog_link", reasons["gnorm_incomplete"])
if __name__ == "__main__": if __name__ == "__main__":
unittest.main() unittest.main()

View File

@@ -3,7 +3,7 @@ import tempfile
import unittest import unittest
from pathlib import Path from pathlib import Path
import scraper import scrape_giant as scraper
class ScraperTests(unittest.TestCase): class ScraperTests(unittest.TestCase):

View File

@@ -1,154 +0,0 @@
import json
from pathlib import Path
import click
import build_canonical_layer
import build_observed_products
from layer_helpers import stable_id, write_csv_rows
PROOF_FIELDS = [
"proof_name",
"canonical_product_id",
"giant_observed_product_id",
"costco_observed_product_id",
"giant_example_item",
"costco_example_item",
"notes",
]
def read_rows(path):
import csv
with Path(path).open(newline="", encoding="utf-8") as handle:
return list(csv.DictReader(handle))
def find_proof_pair(observed_rows):
giant = None
costco = None
for row in observed_rows:
if row["retailer"] == "giant" and row["representative_name_norm"] == "BANANA":
giant = row
if row["retailer"] == "costco" and row["representative_name_norm"] == "BANANA":
costco = row
return giant, costco
def merge_proof_pair(canonical_rows, link_rows, giant_row, costco_row):
if not giant_row or not costco_row:
return canonical_rows, link_rows, []
proof_canonical_id = stable_id("gcan", "proof|banana")
link_rows = [
row
for row in link_rows
if row["observed_product_id"]
not in {giant_row["observed_product_id"], costco_row["observed_product_id"]}
]
canonical_rows = [
row
for row in canonical_rows
if row["canonical_product_id"] != proof_canonical_id
]
canonical_rows.append(
{
"canonical_product_id": proof_canonical_id,
"canonical_name": "BANANA",
"product_type": "banana",
"brand": "",
"variant": "",
"size_value": "",
"size_unit": "",
"pack_qty": "",
"measure_type": "weight",
"normalized_quantity": "",
"normalized_quantity_unit": "",
"notes": "manual proof merge for cross-retailer validation",
"created_at": "",
"updated_at": "",
}
)
for observed_row in [giant_row, costco_row]:
link_rows.append(
{
"observed_product_id": observed_row["observed_product_id"],
"canonical_product_id": proof_canonical_id,
"link_method": "manual_proof_merge",
"link_confidence": "medium",
"review_status": "",
"reviewed_by": "",
"reviewed_at": "",
"link_notes": "cross-retailer validation proof",
}
)
proof_rows = [
{
"proof_name": "banana",
"canonical_product_id": proof_canonical_id,
"giant_observed_product_id": giant_row["observed_product_id"],
"costco_observed_product_id": costco_row["observed_product_id"],
"giant_example_item": giant_row["example_item_name"],
"costco_example_item": costco_row["example_item_name"],
"notes": "BANANA proof pair built from Giant and Costco enriched rows",
}
]
return canonical_rows, link_rows, proof_rows
@click.command()
@click.option(
"--giant-items-enriched-csv",
default="giant_output/items_enriched.csv",
show_default=True,
)
@click.option(
"--costco-items-enriched-csv",
default="costco_output/items_enriched.csv",
show_default=True,
)
@click.option(
"--outdir",
default="combined_output",
show_default=True,
)
def main(giant_items_enriched_csv, costco_items_enriched_csv, outdir):
outdir = Path(outdir)
rows = read_rows(giant_items_enriched_csv) + read_rows(costco_items_enriched_csv)
observed_rows = build_observed_products.build_observed_products(rows)
canonical_rows, link_rows = build_canonical_layer.build_canonical_layer(observed_rows)
giant_row, costco_row = find_proof_pair(observed_rows)
if not giant_row or not costco_row:
raise click.ClickException(
"could not find BANANA proof pair across Giant and Costco observed products"
)
canonical_rows, link_rows, proof_rows = merge_proof_pair(
canonical_rows, link_rows, giant_row, costco_row
)
write_csv_rows(
outdir / "products_observed.csv",
observed_rows,
build_observed_products.OUTPUT_FIELDS,
)
write_csv_rows(
outdir / "products_canonical.csv",
canonical_rows,
build_canonical_layer.CANONICAL_FIELDS,
)
write_csv_rows(
outdir / "product_links.csv",
link_rows,
build_canonical_layer.LINK_FIELDS,
)
write_csv_rows(outdir / "proof_examples.csv", proof_rows, PROOF_FIELDS)
click.echo(
f"wrote combined outputs to {outdir} using {len(observed_rows)} observed rows"
)
if __name__ == "__main__":
main()