minor edit

minor edi
Record t1.22.1 task evidence
2026-03-24 17:28:16 -04:00 · 2026-03-24 17:27:34 -04:00 · 2026-03-24 17:26:00 -04:00 · 2026-03-24 17:25:52 -04:00 · 2026-03-24 17:10:09 -04:00 · 2026-03-24 17:09:57 -04:00
34 changed files with 5793 additions and 1468 deletions
--- a/README.md
+++ b/README.md
@@ -1,103 +1,180 @@
 # scrape-giant
-Small grocery-history pipeline for Giant receipts.
+CLI to pull purchase history from Giant and Costco websites and refine into a single product catalog for external analysis.
-The project currently does four things:
+Run each script step-by-step from the terminal.
-1. scrape Giant in-store order history from an active Firefox session
+## What It Does
 2. enrich raw line items into a deterministic `items_enriched.csv`
 3. aggregate retailer-facing observed products and build a manual review queue
 4. create a first-pass canonical product layer plus conservative auto-links
-The work so far is Giant-specific on the ingest side and intentionally simple on
+1. `collect_giant_web.py`: download Giant orders and items
-the shared product-model side.
+2. `normalize_giant_web.py`: normalize Giant line items
 3. `collect_costco_web.py`: download Costco orders and items
 4. `normalize_costco_web.py`: normalize Costco line items
 5. `build_purchases.py`: combine retailer outputs into one purchase table
 6. `review_products.py`: review unresolved product matches in the terminal
 7. `report_pipeline_status.py`: show how many rows survive each stage
 8. `analyze_purchases.py`: write chart-ready analysis CSVs from the purchase table
-## Current flow
+## Requirements
-Run the commands from the repo root with the project venv active, or call them
+- Python 3.10+
-directly through `./venv/bin/python`.
+- Firefox installed with active Giant and Costco sessions
 ## Install
 ```bash
-./venv/bin/python scraper.py
+python -m venv venv
-./venv/bin/python enrich_giant.py
+./venv/scripts/activate
-./venv/bin/python build_observed_products.py
+pip install -r requirements.txt
 ./venv/bin/python build_review_queue.py
 ./venv/bin/python build_canonical_layer.py
 ```
-## Inputs
+## Optional `.env`
- Firefox cookies for `giantfood.com`
+Current version works best with `.env` in the project root.  The scraper will prompt for these values if they are not found in the current browser session.  
- `GIANT_USER_ID` and `GIANT_LOYALTY_NUMBER` in `.env`, shell env, or prompts
+- `collect_giant_web.py` prompts if `GIANT_USER_ID` or `GIANT_LOYALTY_NUMBER` is missing.
- Giant raw order payloads in `giant_output/raw/`
+- `collect_costco_web.py` tries `.env` first, then Firefox local storage for session-backed values; `COSTCO_CLIENT_IDENTIFIER` should still be set explicitly.
 - Costco discount matching happens later in `enrich_costco.py`; you do not need to pre-clean discount lines by hand.
-## Outputs
+```env
 GIANT_USER_ID=...
 GIANT_LOYALTY_NUMBER=...
-Current generated files live under `giant_output/`:
+COSTCO_X_AUTHORIZATION=...
 COSTCO_X_WCS_CLIENTID=...
 COSTCO_CLIENT_IDENTIFIER=...
 ```
- `orders.csv`: flattened visit/order rows from the Giant history API
+Current active path layout:
 - `items.csv`: flattened raw line items from fetched order detail payloads
 - `items_enriched.csv`: deterministic parsed/enriched line items
 - `products_observed.csv`: retailer-facing observed product groups
 - `review_queue.csv`: products needing manual review
 - `products_canonical.csv`: shared canonical product rows
 - `product_links.csv`: observed-to-canonical links
-Raw json remains the source of truth:
+```text
 data/
  giant-web/
    raw/
    collected_orders.csv
    collected_items.csv
    normalized_items.csv
  costco-web/
    raw/
    collected_orders.csv
    collected_items.csv
    normalized_items.csv
  review/
    catalog.csv
    review_queue.csv
    review_resolutions.csv
    product_links.csv
    pipeline_status.csv
    pipeline_status.json
  analysis/
    purchases.csv
    comparison_examples.csv
    item_price_over_time.csv
    spend_by_visit.csv
    items_per_visit.csv
    category_spend_over_time.csv
    retailer_store_breakdown.csv
 ```
- `giant_output/raw/history.json`
+## Run Order
 - `giant_output/raw/<order_id>.json`
-## Scripts
+Run the pipeline in this order:
- `scraper.py`: fetches Giant history/detail payloads and updates `orders.csv` and `items.csv`
+```bash
- `enrich_giant.py`: reads raw Giant order json and writes `items_enriched.csv`
+python collect_giant_web.py
- `build_observed_products.py`: groups enriched rows into `products_observed.csv`
+python normalize_giant_web.py
- `build_review_queue.py`: generates `review_queue.csv` and preserves review status on reruns
+python collect_costco_web.py
- `build_canonical_layer.py`: builds `products_canonical.csv` and `product_links.csv`
+python normalize_costco_web.py
 python build_purchases.py
 python review_products.py
 python build_purchases.py
 python review_products.py --refresh-only
 python report_pipeline_status.py
 python analyze_purchases.py
 ```
-## Notes on the current model
+Why run `build_purchases.py` twice:
 - first pass builds the current combined dataset and review queue inputs
 - `review_products.py` writes durable review decisions
 - second pass reapplies those decisions into the purchase output
- Observed products are retailer-specific: Giant, Costco.
+If you only want to refresh the queue without reviewing interactively:
 - Canonical products are the first cross-retailer layer.
 - Auto-linking is conservative:
  exact UPC first, then exact normalized name plus exact size/unit context, then
  exact normalized name when there is no size context to conflict.
 - Fee rows are excluded from auto-linking.
 - Unknown values are left blank instead of guessed.
-## Verification
+```bash
 python review_products.py --refresh-only
 ```
-Run the test suite with:
+If you want a quick stage-by-stage accountability check:
 ```bash
 python report_pipeline_status.py
 ```
 ## Key Outputs
 Giant:
 - `data/giant-web/collected_orders.csv`
 - `data/giant-web/collected_items.csv`
 - `data/giant-web/normalized_items.csv`
 Costco:
 - `data/costco-web/collected_orders.csv`
 - `data/costco-web/collected_items.csv`
 - `data/costco-web/normalized_items.csv`
 - `data/costco-web/normalized_items.csv` preserves raw totals and matched net discount fields
 Combined:
 - `data/analysis/purchases.csv`
 - `data/analysis/comparison_examples.csv`
 - `data/analysis/item_price_over_time.csv`
 - `data/analysis/spend_by_visit.csv`
 - `data/analysis/items_per_visit.csv`
 - `data/analysis/category_spend_over_time.csv`
 - `data/analysis/retailer_store_breakdown.csv`
 - `data/review/review_queue.csv`
 - `data/review/review_resolutions.csv`
 - `data/review/product_links.csv`
 - `data/review/pipeline_status.csv`
 - `data/review/pipeline_status.json`
 - `data/review/catalog.csv`
 `data/analysis/purchases.csv` is the main analysis artifact. It is designed to support both:
 - item-level price analysis
 - visit-level analysis such as spend by visit, items per visit, category spend by visit, and retailer/store breakdown
 The visit fields are carried directly in `purchases.csv`, so you can pivot on them without extra joins:
 - `order_id`
 - `purchase_date`
 - `retailer`
 - `store_name`
 - `store_number`
 - `store_city`
 - `store_state`
 ## Review Workflow
 Run `review_products.py` to cleanup unresolved or weakly unified items:
 - link an item to an existing canonical product
 - create a new canonical product
 - exclude an item
 - skip it for later
 Decisions are saved and reused on later runs.
 The review step is intentionally conservative:
 - weak exact-name matches stay in the queue instead of auto-creating canonical products
 - canonical names should describe stable product identity, not retailer packaging text
 ## Notes
 - This project is designed around fragile retailer scraping flows, so the code favors explicit retailer-specific steps over heavy abstraction.
 - Costco discount rows are preserved for auditability and also matched back to purchased items during enrichment.
 ## Test
 ```bash
 ./venv/bin/python -m unittest discover -s tests
 ```
-Useful one-off rebuilds:
+## Project Docs
-```bash
+- `pm/tasks.org`: task tracking
-./venv/bin/python enrich_giant.py
+- `pm/data-model.org`: current data model notes
-./venv/bin/python build_observed_products.py
+- `pm/review-workflow.org`: review and resolution workflow
 ./venv/bin/python build_review_queue.py
 ./venv/bin/python build_canonical_layer.py
 ```
 ## Project docs
 - `pm/tasks.org`: task log and evidence
 - `pm/data-model.org`: file layout and schema decisions
 ## Status
 Completed through `t1.7`:
 - Giant receipt fetch CLI
 - data model and file layout
 - Giant parser/enricher
 - observed products
 - review queue
 - canonical layer scaffold
 - conservative auto-link rules
 Next planned task is `t1.8`: add a Costco raw ingest path.
--- a/analyze_purchases.py
+++ b/analyze_purchases.py
@@ -0,0 +1,271 @@
 from collections import defaultdict
 from pathlib import Path
 import click
 from enrich_giant import format_decimal, to_decimal
 from layer_helpers import read_csv_rows, write_csv_rows
 ITEM_PRICE_FIELDS = [
    "purchase_date",
    "retailer",
    "store_name",
    "store_number",
    "store_city",
    "store_state",
    "order_id",
    "catalog_id",
    "catalog_name",
    "category",
    "product_type",
    "effective_price",
    "effective_price_unit",
    "net_line_total",
    "normalized_quantity",
 ]
 SPEND_BY_VISIT_FIELDS = [
    "purchase_date",
    "retailer",
    "order_id",
    "store_name",
    "store_number",
    "store_city",
    "store_state",
    "visit_spend_total",
 ]
 ITEMS_PER_VISIT_FIELDS = [
    "purchase_date",
    "retailer",
    "order_id",
    "store_name",
    "store_number",
    "store_city",
    "store_state",
    "item_row_count",
    "distinct_catalog_count",
 ]
 CATEGORY_SPEND_FIELDS = [
    "purchase_date",
    "retailer",
    "category",
    "category_spend_total",
 ]
 RETAILER_STORE_FIELDS = [
    "retailer",
    "store_name",
    "store_number",
    "store_city",
    "store_state",
    "visit_count",
    "item_row_count",
    "store_spend_total",
 ]
 def effective_total(row):
    total = to_decimal(row.get("net_line_total"))
    if total is not None:
        return total
    return to_decimal(row.get("line_total"))
 def is_item_row(row):
    return (
        row.get("is_fee") != "true"
        and row.get("is_discount_line") != "true"
        and row.get("is_coupon_line") != "true"
    )
 def build_item_price_rows(purchase_rows):
    rows = []
    for row in purchase_rows:
        if not row.get("catalog_name") or not row.get("effective_price"):
            continue
        rows.append(
            {
                "purchase_date": row.get("purchase_date", ""),
                "retailer": row.get("retailer", ""),
                "store_name": row.get("store_name", ""),
                "store_number": row.get("store_number", ""),
                "store_city": row.get("store_city", ""),
                "store_state": row.get("store_state", ""),
                "order_id": row.get("order_id", ""),
                "catalog_id": row.get("catalog_id", ""),
                "catalog_name": row.get("catalog_name", ""),
                "category": row.get("category", ""),
                "product_type": row.get("product_type", ""),
                "effective_price": row.get("effective_price", ""),
                "effective_price_unit": row.get("effective_price_unit", ""),
                "net_line_total": row.get("net_line_total", ""),
                "normalized_quantity": row.get("normalized_quantity", ""),
            }
        )
    return rows
 def build_spend_by_visit_rows(purchase_rows):
    grouped = defaultdict(lambda: {"total": to_decimal("0")})
    for row in purchase_rows:
        total = effective_total(row)
        if total is None:
            continue
        key = (
            row.get("purchase_date", ""),
            row.get("retailer", ""),
            row.get("order_id", ""),
            row.get("store_name", ""),
            row.get("store_number", ""),
            row.get("store_city", ""),
            row.get("store_state", ""),
        )
        grouped[key]["total"] += total
    rows = []
    for key, values in sorted(grouped.items()):
        rows.append(
            {
                "purchase_date": key[0],
                "retailer": key[1],
                "order_id": key[2],
                "store_name": key[3],
                "store_number": key[4],
                "store_city": key[5],
                "store_state": key[6],
                "visit_spend_total": format_decimal(values["total"]),
            }
        )
    return rows
 def build_items_per_visit_rows(purchase_rows):
    grouped = defaultdict(lambda: {"item_rows": 0, "catalog_ids": set()})
    for row in purchase_rows:
        if not is_item_row(row):
            continue
        key = (
            row.get("purchase_date", ""),
            row.get("retailer", ""),
            row.get("order_id", ""),
            row.get("store_name", ""),
            row.get("store_number", ""),
            row.get("store_city", ""),
            row.get("store_state", ""),
        )
        grouped[key]["item_rows"] += 1
        if row.get("catalog_id"):
            grouped[key]["catalog_ids"].add(row["catalog_id"])
    rows = []
    for key, values in sorted(grouped.items()):
        rows.append(
            {
                "purchase_date": key[0],
                "retailer": key[1],
                "order_id": key[2],
                "store_name": key[3],
                "store_number": key[4],
                "store_city": key[5],
                "store_state": key[6],
                "item_row_count": str(values["item_rows"]),
                "distinct_catalog_count": str(len(values["catalog_ids"])),
            }
        )
    return rows
 def build_category_spend_rows(purchase_rows):
    grouped = defaultdict(lambda: to_decimal("0"))
    for row in purchase_rows:
        category = row.get("category", "")
        total = effective_total(row)
        if not category or total is None:
            continue
        key = (
            row.get("purchase_date", ""),
            row.get("retailer", ""),
            category,
        )
        grouped[key] += total
    rows = []
    for key, total in sorted(grouped.items()):
        rows.append(
            {
                "purchase_date": key[0],
                "retailer": key[1],
                "category": key[2],
                "category_spend_total": format_decimal(total),
            }
        )
    return rows
 def build_retailer_store_rows(purchase_rows):
    grouped = defaultdict(lambda: {"visit_ids": set(), "item_rows": 0, "total": to_decimal("0")})
    for row in purchase_rows:
        total = effective_total(row)
        key = (
            row.get("retailer", ""),
            row.get("store_name", ""),
            row.get("store_number", ""),
            row.get("store_city", ""),
            row.get("store_state", ""),
        )
        grouped[key]["visit_ids"].add((row.get("purchase_date", ""), row.get("order_id", "")))
        if is_item_row(row):
            grouped[key]["item_rows"] += 1
        if total is not None:
            grouped[key]["total"] += total
    rows = []
    for key, values in sorted(grouped.items()):
        rows.append(
            {
                "retailer": key[0],
                "store_name": key[1],
                "store_number": key[2],
                "store_city": key[3],
                "store_state": key[4],
                "visit_count": str(len(values["visit_ids"])),
                "item_row_count": str(values["item_rows"]),
                "store_spend_total": format_decimal(values["total"]),
            }
        )
    return rows
@click.command()
@click.option("--purchases-csv", default="data/analysis/purchases.csv", show_default=True)
@click.option("--output-dir", default="data/analysis", show_default=True)
 def main(purchases_csv, output_dir):
    purchase_rows = read_csv_rows(purchases_csv)
    output_path = Path(output_dir)
    output_path.mkdir(parents=True, exist_ok=True)
    item_price_rows = build_item_price_rows(purchase_rows)
    spend_by_visit_rows = build_spend_by_visit_rows(purchase_rows)
    items_per_visit_rows = build_items_per_visit_rows(purchase_rows)
    category_spend_rows = build_category_spend_rows(purchase_rows)
    retailer_store_rows = build_retailer_store_rows(purchase_rows)
    outputs = [
        ("item_price_over_time.csv", item_price_rows, ITEM_PRICE_FIELDS),
        ("spend_by_visit.csv", spend_by_visit_rows, SPEND_BY_VISIT_FIELDS),
        ("items_per_visit.csv", items_per_visit_rows, ITEMS_PER_VISIT_FIELDS),
        ("category_spend_over_time.csv", category_spend_rows, CATEGORY_SPEND_FIELDS),
        ("retailer_store_breakdown.csv", retailer_store_rows, RETAILER_STORE_FIELDS),
    ]
    for filename, rows, fieldnames in outputs:
        write_csv_rows(output_path / filename, rows, fieldnames)
    click.echo(f"wrote analysis outputs to {output_path}")
 if __name__ == "__main__":
    main()
--- a/build_canonical_layer.py
+++ b/build_canonical_layer.py
@@ -1,216 +0,0 @@
 import click
 from layer_helpers import read_csv_rows, representative_value, stable_id, write_csv_rows
 CANONICAL_FIELDS = [
    "canonical_product_id",
    "canonical_name",
    "product_type",
    "brand",
    "variant",
    "size_value",
    "size_unit",
    "pack_qty",
    "measure_type",
    "normalized_quantity",
    "normalized_quantity_unit",
    "notes",
    "created_at",
    "updated_at",
 ]
 LINK_FIELDS = [
    "observed_product_id",
    "canonical_product_id",
    "link_method",
    "link_confidence",
    "review_status",
    "reviewed_by",
    "reviewed_at",
    "link_notes",
 ]
 def to_float(value):
    try:
        return float(value)
    except (TypeError, ValueError):
        return None
 def normalized_quantity(row):
    size_value = to_float(row.get("representative_size_value"))
    pack_qty = to_float(row.get("representative_pack_qty")) or 1.0
    size_unit = row.get("representative_size_unit", "")
    measure_type = row.get("representative_measure_type", "")
    if size_value is not None and size_unit:
        return format(size_value * pack_qty, "g"), size_unit
    if row.get("representative_pack_qty") and measure_type == "count":
        return row["representative_pack_qty"], "count"
    if measure_type == "each":
        return "1", "each"
    return "", ""
 def auto_link_rule(observed_row):
    if (
        observed_row.get("is_fee") == "true"
        or observed_row.get("is_discount_line") == "true"
        or observed_row.get("is_coupon_line") == "true"
    ):
        return "", "", ""
    if observed_row.get("representative_upc"):
        return (
            "exact_upc",
            f"upc={observed_row['representative_upc']}",
            "high",
        )
    if (
        observed_row.get("representative_name_norm")
        and observed_row.get("representative_size_value")
        and observed_row.get("representative_size_unit")
    ):
        return (
            "exact_name_size",
            "|".join(
                [
                    f"name={observed_row['representative_name_norm']}",
                    f"size={observed_row['representative_size_value']}",
                    f"unit={observed_row['representative_size_unit']}",
                    f"pack={observed_row['representative_pack_qty']}",
                    f"measure={observed_row['representative_measure_type']}",
                ]
            ),
            "high",
        )
    if (
        observed_row.get("representative_name_norm")
        and not observed_row.get("representative_size_value")
        and not observed_row.get("representative_size_unit")
        and not observed_row.get("representative_pack_qty")
    ):
        return (
            "exact_name",
            "|".join(
                [
                    f"name={observed_row['representative_name_norm']}",
                    f"measure={observed_row['representative_measure_type']}",
                ]
            ),
            "medium",
        )
    return "", "", ""
 def canonical_row_for_group(canonical_product_id, group_rows, link_method):
    quantity_value, quantity_unit = normalized_quantity(
        {
            "representative_size_value": representative_value(
                group_rows, "representative_size_value"
            ),
            "representative_size_unit": representative_value(
                group_rows, "representative_size_unit"
            ),
            "representative_pack_qty": representative_value(
                group_rows, "representative_pack_qty"
            ),
            "representative_measure_type": representative_value(
                group_rows, "representative_measure_type"
            ),
        }
    )
    return {
        "canonical_product_id": canonical_product_id,
        "canonical_name": representative_value(group_rows, "representative_name_norm"),
        "product_type": "",
        "brand": representative_value(group_rows, "representative_brand"),
        "variant": representative_value(group_rows, "representative_variant"),
        "size_value": representative_value(group_rows, "representative_size_value"),
        "size_unit": representative_value(group_rows, "representative_size_unit"),
        "pack_qty": representative_value(group_rows, "representative_pack_qty"),
        "measure_type": representative_value(group_rows, "representative_measure_type"),
        "normalized_quantity": quantity_value,
        "normalized_quantity_unit": quantity_unit,
        "notes": f"auto-linked via {link_method}",
        "created_at": "",
        "updated_at": "",
    }
 def build_canonical_layer(observed_rows):
    canonical_rows = []
    link_rows = []
    groups = {}
    for observed_row in sorted(observed_rows, key=lambda row: row["observed_product_id"]):
        link_method, group_key, confidence = auto_link_rule(observed_row)
        if not group_key:
            continue
        canonical_product_id = stable_id("gcan", f"{link_method}|{group_key}")
        groups.setdefault(canonical_product_id, {"method": link_method, "rows": []})
        groups[canonical_product_id]["rows"].append(observed_row)
        link_rows.append(
            {
                "observed_product_id": observed_row["observed_product_id"],
                "canonical_product_id": canonical_product_id,
                "link_method": link_method,
                "link_confidence": confidence,
                "review_status": "",
                "reviewed_by": "",
                "reviewed_at": "",
                "link_notes": "",
            }
        )
    for canonical_product_id, group in sorted(groups.items()):
        canonical_rows.append(
            canonical_row_for_group(
                canonical_product_id, group["rows"], group["method"]
            )
        )
    return canonical_rows, link_rows
@click.command()
@click.option(
    "--observed-csv",
    default="giant_output/products_observed.csv",
    show_default=True,
    help="Path to observed product rows.",
 )
@click.option(
    "--canonical-csv",
    default="giant_output/products_canonical.csv",
    show_default=True,
    help="Path to canonical product output.",
 )
@click.option(
    "--links-csv",
    default="giant_output/product_links.csv",
    show_default=True,
    help="Path to observed-to-canonical link output.",
 )
 def main(observed_csv, canonical_csv, links_csv):
    observed_rows = read_csv_rows(observed_csv)
    canonical_rows, link_rows = build_canonical_layer(observed_rows)
    write_csv_rows(canonical_csv, canonical_rows, CANONICAL_FIELDS)
    write_csv_rows(links_csv, link_rows, LINK_FIELDS)
    click.echo(
        f"wrote {len(canonical_rows)} canonical rows to {canonical_csv} and "
        f"{len(link_rows)} links to {links_csv}"
    )
 if __name__ == "__main__":
    main()
--- a/build_observed_products.py
+++ b/build_observed_products.py
@@ -1,172 +0,0 @@
 from collections import defaultdict
 import click
 from layer_helpers import (
    compact_join,
    distinct_values,
    first_nonblank,
    read_csv_rows,
    representative_value,
    stable_id,
    write_csv_rows,
 )
 OUTPUT_FIELDS = [
    "observed_product_id",
    "retailer",
    "observed_key",
    "representative_retailer_item_id",
    "representative_upc",
    "representative_item_name",
    "representative_name_norm",
    "representative_brand",
    "representative_variant",
    "representative_size_value",
    "representative_size_unit",
    "representative_pack_qty",
    "representative_measure_type",
    "representative_image_url",
    "is_store_brand",
    "is_fee",
    "is_discount_line",
    "is_coupon_line",
    "first_seen_date",
    "last_seen_date",
    "times_seen",
    "example_order_id",
    "example_item_name",
    "raw_name_examples",
    "normalized_name_examples",
    "example_prices",
    "distinct_item_names_count",
    "distinct_retailer_item_ids_count",
    "distinct_upcs_count",
 ]
 def build_observed_key(row):
    if row.get("upc"):
        return "|".join(
            [
                row["retailer"],
                f"upc={row['upc']}",
                f"name={row['item_name_norm']}",
            ]
        )
    if row.get("retailer_item_id"):
        return "|".join(
            [
                row["retailer"],
                f"retailer_item_id={row['retailer_item_id']}",
                f"name={row['item_name_norm']}",
                f"discount={row.get('is_discount_line', 'false')}",
                f"coupon={row.get('is_coupon_line', 'false')}",
            ]
        )
    return "|".join(
        [
            row["retailer"],
            f"name={row['item_name_norm']}",
            f"size={row['size_value']}",
            f"unit={row['size_unit']}",
            f"pack={row['pack_qty']}",
            f"measure={row['measure_type']}",
            f"store_brand={row['is_store_brand']}",
            f"fee={row['is_fee']}",
        ]
    )
 def build_observed_products(rows):
    grouped = defaultdict(list)
    for row in rows:
        grouped[build_observed_key(row)].append(row)
    observed_rows = []
    for observed_key, group_rows in sorted(grouped.items()):
        ordered = sorted(
            group_rows,
            key=lambda row: (row["order_date"], row["order_id"], int(row["line_no"])),
        )
        observed_rows.append(
            {
                "observed_product_id": stable_id("gobs", observed_key),
                "retailer": ordered[0]["retailer"],
                "observed_key": observed_key,
                "representative_retailer_item_id": representative_value(
                    ordered, "retailer_item_id"
                ),
                "representative_upc": representative_value(ordered, "upc"),
                "representative_item_name": representative_value(ordered, "item_name"),
                "representative_name_norm": representative_value(
                    ordered, "item_name_norm"
                ),
                "representative_brand": representative_value(ordered, "brand_guess"),
                "representative_variant": representative_value(ordered, "variant"),
                "representative_size_value": representative_value(ordered, "size_value"),
                "representative_size_unit": representative_value(ordered, "size_unit"),
                "representative_pack_qty": representative_value(ordered, "pack_qty"),
                "representative_measure_type": representative_value(
                    ordered, "measure_type"
                ),
                "representative_image_url": first_nonblank(ordered, "image_url"),
                "is_store_brand": representative_value(ordered, "is_store_brand"),
                "is_fee": representative_value(ordered, "is_fee"),
                "is_discount_line": representative_value(
                    ordered, "is_discount_line"
                ),
                "is_coupon_line": representative_value(ordered, "is_coupon_line"),
                "first_seen_date": ordered[0]["order_date"],
                "last_seen_date": ordered[-1]["order_date"],
                "times_seen": str(len(ordered)),
                "example_order_id": ordered[0]["order_id"],
                "example_item_name": ordered[0]["item_name"],
                "raw_name_examples": compact_join(
                    distinct_values(ordered, "item_name"), limit=4
                ),
                "normalized_name_examples": compact_join(
                    distinct_values(ordered, "item_name_norm"), limit=4
                ),
                "example_prices": compact_join(
                    distinct_values(ordered, "line_total"), limit=4
                ),
                "distinct_item_names_count": str(
                    len(distinct_values(ordered, "item_name"))
                ),
                "distinct_retailer_item_ids_count": str(
                    len(distinct_values(ordered, "retailer_item_id"))
                ),
                "distinct_upcs_count": str(len(distinct_values(ordered, "upc"))),
            }
        )
    observed_rows.sort(key=lambda row: row["observed_product_id"])
    return observed_rows
@click.command()
@click.option(
    "--items-enriched-csv",
    default="giant_output/items_enriched.csv",
    show_default=True,
    help="Path to enriched Giant item rows.",
 )
@click.option(
    "--output-csv",
    default="giant_output/products_observed.csv",
    show_default=True,
    help="Path to observed product output.",
 )
 def main(items_enriched_csv, output_csv):
    rows = read_csv_rows(items_enriched_csv)
    observed_rows = build_observed_products(rows)
    write_csv_rows(output_csv, observed_rows, OUTPUT_FIELDS)
    click.echo(f"wrote {len(observed_rows)} rows to {output_csv}")
 if __name__ == "__main__":
    main()
--- a/build_purchases.py
+++ b/build_purchases.py
@@ -0,0 +1,487 @@
 from decimal import Decimal
 from pathlib import Path
 import click
 from enrich_giant import format_decimal, to_decimal
 from layer_helpers import read_csv_rows, write_csv_rows
 PURCHASE_FIELDS = [
    "purchase_date",
    "retailer",
    "catalog_name",
    "product_type",
    "category",
    "net_line_total",
    "normalized_quantity",
    "normalized_quantity_unit",
    "effective_price",
    "effective_price_unit",
    "order_id",
    "line_no",
    "normalized_row_id",
    "normalized_item_id",
    "catalog_id",
    "review_status",
    "resolution_action",
    "raw_item_name",
    "normalized_item_name",
    "brand",
    "variant",
    "image_url",
    "retailer_item_id",
    "upc",
    "qty",
    "unit",
    "pack_qty",
    "size_value",
    "size_unit",
    "measure_type",
    "line_total",
    "unit_price",
    "matched_discount_amount",
    "net_line_total",
    "store_name",
    "store_number",
    "store_city",
    "store_state",
    "price_per_each",
    "price_per_each_basis",
    "price_per_count",
    "price_per_count_basis",
    "price_per_lb",
    "price_per_lb_basis",
    "price_per_oz",
    "price_per_oz_basis",
    "is_discount_line",
    "is_coupon_line",
    "is_fee",
    "raw_order_path",
 ]
 EXAMPLE_FIELDS = [
    "example_name",
    "catalog_id",
    "giant_purchase_date",
    "giant_raw_item_name",
    "giant_price_per_lb",
    "costco_purchase_date",
    "costco_raw_item_name",
    "costco_price_per_lb",
    "notes",
 ]
 CATALOG_FIELDS = [
    "catalog_id",
    "catalog_name",
    "category",
    "product_type",
    "brand",
    "variant",
    "size_value",
    "size_unit",
    "pack_qty",
    "measure_type",
    "notes",
    "created_at",
    "updated_at",
 ]
 PRODUCT_LINK_FIELDS = [
    "normalized_item_id",
    "catalog_id",
    "link_method",
    "link_confidence",
    "review_status",
    "reviewed_by",
    "reviewed_at",
    "link_notes",
 ]
 RESOLUTION_FIELDS = [
    "normalized_item_id",
    "catalog_id",
    "resolution_action",
    "status",
    "resolution_notes",
    "reviewed_at",
 ]
 def derive_metrics(row):
    line_total = to_decimal(row.get("net_line_total") or row.get("line_total"))
    qty = to_decimal(row.get("qty"))
    pack_qty = to_decimal(row.get("pack_qty"))
    size_value = to_decimal(row.get("size_value"))
    picked_weight = to_decimal(row.get("picked_weight"))
    size_unit = row.get("size_unit", "")
    price_per_each = row.get("price_per_each", "")
    price_per_lb = row.get("price_per_lb", "")
    price_per_oz = row.get("price_per_oz", "")
    price_per_count = ""
    basis_each = ""
    basis_count = ""
    basis_lb = ""
    basis_oz = ""
    if price_per_each:
        basis_each = "line_total_over_qty"
    elif line_total is not None and qty not in (None, 0):
        price_per_each = format_decimal(line_total / qty)
        basis_each = "line_total_over_qty"
    if line_total is not None and pack_qty not in (None, 0):
        total_count = pack_qty * (qty or Decimal("1"))
        if total_count not in (None, 0):
            price_per_count = format_decimal(line_total / total_count)
            basis_count = "line_total_over_pack_qty"
    if picked_weight not in (None, 0):
        price_per_lb = format_decimal(line_total / picked_weight) if line_total is not None else ""
        price_per_oz = (
            format_decimal((line_total / picked_weight) / Decimal("16"))
            if line_total is not None
            else ""
        )
        basis_lb = "picked_weight_lb"
        basis_oz = "picked_weight_lb_to_oz"
    elif line_total is not None and size_value not in (None, 0):
        total_units = size_value * (pack_qty or Decimal("1")) * (qty or Decimal("1"))
        if size_unit == "lb" and total_units not in (None, 0):
            per_lb = line_total / total_units
            price_per_lb = format_decimal(per_lb)
            price_per_oz = format_decimal(per_lb / Decimal("16"))
            basis_lb = "parsed_size_lb"
            basis_oz = "parsed_size_lb_to_oz"
        elif size_unit == "oz" and total_units not in (None, 0):
            per_oz = line_total / total_units
            price_per_oz = format_decimal(per_oz)
            price_per_lb = format_decimal(per_oz * Decimal("16"))
            basis_lb = "parsed_size_oz_to_lb"
            basis_oz = "parsed_size_oz"
    return {
        "price_per_each": price_per_each,
        "price_per_each_basis": basis_each,
        "price_per_count": price_per_count,
        "price_per_count_basis": basis_count,
        "price_per_lb": price_per_lb,
        "price_per_lb_basis": basis_lb,
        "price_per_oz": price_per_oz,
        "price_per_oz_basis": basis_oz,
    }
 def derive_effective_price(row):
    normalized_quantity = to_decimal(row.get("normalized_quantity"))
    if normalized_quantity in (None, Decimal("0")):
        return ""
    numerator = to_decimal(derive_net_line_total(row))
    if numerator is None:
        return ""
    return format_decimal(numerator / normalized_quantity)
 def derive_effective_price_unit(row):
    normalized_quantity = to_decimal(row.get("normalized_quantity"))
    if normalized_quantity in (None, Decimal("0")):
        return ""
    return row.get("normalized_quantity_unit", "")
 def derive_net_line_total(row):
    existing_net = row.get("net_line_total", "")
    if str(existing_net).strip() != "":
        return str(existing_net)
    line_total = to_decimal(row.get("line_total"))
    if line_total is None:
        return ""
    matched_discount_amount = to_decimal(row.get("matched_discount_amount"))
    if matched_discount_amount is not None:
        return format_decimal(line_total + matched_discount_amount)
    return format_decimal(line_total)
 def order_lookup(rows, retailer):
    return {(retailer, row["order_id"]): row for row in rows}
 def read_optional_csv_rows(path):
    path = Path(path)
    if not path.exists():
        return []
    return read_csv_rows(path)
 def normalize_catalog_row(row):
    return {
        "catalog_id": row.get("catalog_id") or row.get("canonical_product_id", ""),
        "catalog_name": row.get("catalog_name") or row.get("canonical_name", ""),
        "category": row.get("category", ""),
        "product_type": row.get("product_type", ""),
        "brand": row.get("brand", ""),
        "variant": row.get("variant", ""),
        "size_value": row.get("size_value", ""),
        "size_unit": row.get("size_unit", ""),
        "pack_qty": row.get("pack_qty", ""),
        "measure_type": row.get("measure_type", ""),
        "notes": row.get("notes", ""),
        "created_at": row.get("created_at", ""),
        "updated_at": row.get("updated_at", ""),
    }
 def is_review_first_catalog_row(row):
    notes = row.get("notes", "").strip().lower()
    if notes.startswith("auto-linked via"):
        return False
    return True
 def normalize_link_row(row):
    return {
        "normalized_item_id": row.get("normalized_item_id", ""),
        "catalog_id": row.get("catalog_id") or row.get("canonical_product_id", ""),
        "link_method": row.get("link_method", ""),
        "link_confidence": row.get("link_confidence", ""),
        "review_status": row.get("review_status", ""),
        "reviewed_by": row.get("reviewed_by", ""),
        "reviewed_at": row.get("reviewed_at", ""),
        "link_notes": row.get("link_notes", ""),
    }
 def normalize_resolution_row(row):
    return {
        "normalized_item_id": row.get("normalized_item_id", ""),
        "catalog_id": row.get("catalog_id") or row.get("canonical_product_id", ""),
        "resolution_action": row.get("resolution_action", ""),
        "status": row.get("status", ""),
        "resolution_notes": row.get("resolution_notes", ""),
        "reviewed_at": row.get("reviewed_at", ""),
    }
 def load_resolution_lookup(resolution_rows):
    lookup = {}
    for row in resolution_rows:
        normalized_row = normalize_resolution_row(row)
        normalized_item_id = normalized_row.get("normalized_item_id", "")
        if not normalized_item_id:
            continue
        lookup[normalized_item_id] = normalized_row
    return lookup
 def merge_catalog_rows(existing_rows, new_rows):
    merged = {}
    for row in existing_rows + new_rows:
        normalized_row = normalize_catalog_row(row)
        catalog_id = normalized_row.get("catalog_id", "")
        if catalog_id:
            merged[catalog_id] = normalized_row
    return sorted(merged.values(), key=lambda row: row["catalog_id"])
 def load_link_lookup(link_rows):
    lookup = {}
    for row in link_rows:
        normalized_row = normalize_link_row(row)
        normalized_item_id = normalized_row.get("normalized_item_id", "")
        if not normalized_item_id:
            continue
        lookup[normalized_item_id] = normalized_row
    return lookup
 def build_purchase_rows(
    giant_enriched_rows,
    costco_enriched_rows,
    giant_orders,
    costco_orders,
    resolution_rows,
    link_rows=None,
    catalog_rows=None,
 ):
    all_enriched_rows = giant_enriched_rows + costco_enriched_rows
    resolution_lookup = load_resolution_lookup(resolution_rows)
    link_lookup = load_link_lookup(link_rows or [])
    catalog_lookup = {
        row["catalog_id"]: normalize_catalog_row(row)
        for row in (catalog_rows or [])
        if normalize_catalog_row(row).get("catalog_id")
    }
    for normalized_item_id, resolution in resolution_lookup.items():
        action = resolution.get("resolution_action", "")
        status = resolution.get("status", "")
        if status != "approved":
            continue
        if action in {"link", "create"} and resolution.get("catalog_id"):
            link_lookup[normalized_item_id] = {
                "normalized_item_id": normalized_item_id,
                "catalog_id": resolution["catalog_id"],
                "link_method": f"manual_{action}",
                "link_confidence": "high",
                "review_status": status,
                "reviewed_by": "",
                "reviewed_at": resolution.get("reviewed_at", ""),
                "link_notes": resolution.get("resolution_notes", ""),
            }
        elif action == "exclude":
            link_lookup.pop(normalized_item_id, None)
    orders_by_id = {}
    orders_by_id.update(order_lookup(giant_orders, "giant"))
    orders_by_id.update(order_lookup(costco_orders, "costco"))
    purchase_rows = []
    for row in sorted(
        all_enriched_rows,
        key=lambda item: (item["order_date"], item["retailer"], item["order_id"], int(item["line_no"])),
    ):
        normalized_item_id = row.get("normalized_item_id", "")
        resolution = resolution_lookup.get(normalized_item_id, {})
        link_row = link_lookup.get(normalized_item_id, {})
        catalog_row = catalog_lookup.get(link_row.get("catalog_id", ""), {})
        order_row = orders_by_id.get((row["retailer"], row["order_id"]), {})
        metrics = derive_metrics(row)
        purchase_rows.append(
            {
                "purchase_date": row["order_date"],
                "retailer": row["retailer"],
                "catalog_name": catalog_row.get("catalog_name", ""),
                "product_type": catalog_row.get("product_type", ""),
                "category": catalog_row.get("category", ""),
                "net_line_total": derive_net_line_total(row),
                "normalized_quantity": row.get("normalized_quantity", ""),
                "normalized_quantity_unit": row.get("normalized_quantity_unit", ""),
                "effective_price": derive_effective_price({**row, "net_line_total": derive_net_line_total(row)}),
                "effective_price_unit": derive_effective_price_unit(row),
                "order_id": row["order_id"],
                "line_no": row["line_no"],
                "normalized_row_id": row.get("normalized_row_id", ""),
                "normalized_item_id": normalized_item_id,
                "catalog_id": link_row.get("catalog_id", ""),
                "review_status": resolution.get("status", ""),
                "resolution_action": resolution.get("resolution_action", ""),
                "raw_item_name": row["item_name"],
                "normalized_item_name": row["item_name_norm"],
                "brand": catalog_row.get("brand", ""),
                "variant": catalog_row.get("variant", ""),
                "image_url": row.get("image_url", ""),
                "retailer_item_id": row["retailer_item_id"],
                "upc": row["upc"],
                "qty": row["qty"],
                "unit": row["unit"],
                "pack_qty": row["pack_qty"],
                "size_value": row["size_value"],
                "size_unit": row["size_unit"],
                "measure_type": row["measure_type"],
                "line_total": row["line_total"],
                "unit_price": row["unit_price"],
                "matched_discount_amount": row.get("matched_discount_amount", ""),
                "store_name": order_row.get("store_name", ""),
                "store_number": order_row.get("store_number", ""),
                "store_city": order_row.get("store_city", ""),
                "store_state": order_row.get("store_state", ""),
                "is_discount_line": row["is_discount_line"],
                "is_coupon_line": row["is_coupon_line"],
                "is_fee": row["is_fee"],
                "raw_order_path": row["raw_order_path"],
                **metrics,
            }
        )
    return purchase_rows, sorted(link_lookup.values(), key=lambda row: row["normalized_item_id"])
 def build_comparison_examples(purchase_rows):
    giant_banana = None
    costco_banana = None
    for row in purchase_rows:
        if row.get("normalized_item_name") != "BANANA":
            continue
        if not row.get("catalog_id"):
            continue
        if row["retailer"] == "giant" and row.get("price_per_lb"):
            giant_banana = row
        if row["retailer"] == "costco" and row.get("price_per_lb"):
            costco_banana = row
    if not giant_banana or not costco_banana:
        return []
    return [
        {
            "example_name": "banana_price_per_lb",
            "catalog_id": giant_banana["catalog_id"],
            "giant_purchase_date": giant_banana["purchase_date"],
            "giant_raw_item_name": giant_banana["raw_item_name"],
            "giant_price_per_lb": giant_banana["price_per_lb"],
            "costco_purchase_date": costco_banana["purchase_date"],
            "costco_raw_item_name": costco_banana["raw_item_name"],
            "costco_price_per_lb": costco_banana["price_per_lb"],
            "notes": "Example comparison using normalized price_per_lb across Giant and Costco",
        }
    ]
@click.command()
@click.option("--giant-items-enriched-csv", default="data/giant-web/normalized_items.csv", show_default=True)
@click.option("--costco-items-enriched-csv", default="data/costco-web/normalized_items.csv", show_default=True)
@click.option("--giant-orders-csv", default="data/giant-web/collected_orders.csv", show_default=True)
@click.option("--costco-orders-csv", default="data/costco-web/collected_orders.csv", show_default=True)
@click.option("--resolutions-csv", default="data/review/review_resolutions.csv", show_default=True)
@click.option("--catalog-csv", default="data/review/catalog.csv", show_default=True)
@click.option("--links-csv", default="data/review/product_links.csv", show_default=True)
@click.option("--output-csv", default="data/analysis/purchases.csv", show_default=True)
@click.option("--examples-csv", default="data/analysis/comparison_examples.csv", show_default=True)
 def main(
    giant_items_enriched_csv,
    costco_items_enriched_csv,
    giant_orders_csv,
    costco_orders_csv,
    resolutions_csv,
    catalog_csv,
    links_csv,
    output_csv,
    examples_csv,
 ):
    resolution_rows = read_optional_csv_rows(resolutions_csv)
    catalog_rows = merge_catalog_rows(
        [row for row in read_optional_csv_rows(catalog_csv) if is_review_first_catalog_row(row)],
        [],
    )
    existing_links = [normalize_link_row(row) for row in read_optional_csv_rows(links_csv)]
    purchase_rows, link_rows = build_purchase_rows(
        read_csv_rows(giant_items_enriched_csv),
        read_csv_rows(costco_items_enriched_csv),
        read_csv_rows(giant_orders_csv),
        read_csv_rows(costco_orders_csv),
        resolution_rows,
        existing_links,
        catalog_rows,
    )
    example_rows = build_comparison_examples(purchase_rows)
    write_csv_rows(catalog_csv, catalog_rows, CATALOG_FIELDS)
    write_csv_rows(links_csv, link_rows, PRODUCT_LINK_FIELDS)
    write_csv_rows(output_csv, purchase_rows, PURCHASE_FIELDS)
    write_csv_rows(examples_csv, example_rows, EXAMPLE_FIELDS)
    click.echo(
        f"wrote {len(purchase_rows)} purchase rows to {output_csv}, "
        f"{len(catalog_rows)} catalog rows to {catalog_csv}, "
        f"{len(link_rows)} product links to {links_csv}, "
        f"and {len(example_rows)} comparison examples to {examples_csv}"
    )
 if __name__ == "__main__":
    main()
--- a/build_review_queue.py
+++ b/build_review_queue.py
@@ -1,175 +0,0 @@
 from collections import defaultdict
 from datetime import date
 import click
 from layer_helpers import compact_join, distinct_values, read_csv_rows, stable_id, write_csv_rows
 OUTPUT_FIELDS = [
    "review_id",
    "queue_type",
    "retailer",
    "observed_product_id",
    "canonical_product_id",
    "reason_code",
    "priority",
    "raw_item_names",
    "normalized_names",
    "upc",
    "image_url",
    "example_prices",
    "seen_count",
    "status",
    "resolution_notes",
    "created_at",
    "updated_at",
 ]
 def existing_review_state(path):
    try:
        rows = read_csv_rows(path)
    except FileNotFoundError:
        return {}
    return {row["review_id"]: row for row in rows}
 def review_reasons(observed_row):
    reasons = []
    if (
        observed_row["is_fee"] == "true"
        or observed_row.get("is_discount_line") == "true"
        or observed_row.get("is_coupon_line") == "true"
    ):
        return reasons
    if observed_row["distinct_upcs_count"] not in {"", "0", "1"}:
        reasons.append(("multiple_upcs", "high"))
    if observed_row["distinct_item_names_count"] not in {"", "0", "1"}:
        reasons.append(("multiple_raw_names", "medium"))
    if not observed_row["representative_image_url"]:
        reasons.append(("missing_image", "medium"))
    if not observed_row["representative_upc"]:
        reasons.append(("missing_upc", "high"))
    if not observed_row["representative_name_norm"]:
        reasons.append(("missing_normalized_name", "high"))
    return reasons
 def build_review_queue(observed_rows, item_rows, existing_rows, today_text):
    by_observed = defaultdict(list)
    for row in item_rows:
        observed_id = row.get("observed_product_id", "")
        if observed_id:
            by_observed[observed_id].append(row)
    queue_rows = []
    for observed_row in observed_rows:
        reasons = review_reasons(observed_row)
        if not reasons:
            continue
        related_items = by_observed.get(observed_row["observed_product_id"], [])
        raw_names = compact_join(distinct_values(related_items, "item_name"), limit=5)
        norm_names = compact_join(
            distinct_values(related_items, "item_name_norm"), limit=5
        )
        example_prices = compact_join(
            distinct_values(related_items, "line_total"), limit=5
        )
        for reason_code, priority in reasons:
            review_id = stable_id(
                "rvw",
                f"{observed_row['observed_product_id']}|{reason_code}",
            )
            prior = existing_rows.get(review_id, {})
            queue_rows.append(
                {
                    "review_id": review_id,
                    "queue_type": "observed_product",
                    "retailer": observed_row["retailer"],
                    "observed_product_id": observed_row["observed_product_id"],
                    "canonical_product_id": prior.get("canonical_product_id", ""),
                    "reason_code": reason_code,
                    "priority": priority,
                    "raw_item_names": raw_names,
                    "normalized_names": norm_names,
                    "upc": observed_row["representative_upc"],
                    "image_url": observed_row["representative_image_url"],
                    "example_prices": example_prices,
                    "seen_count": observed_row["times_seen"],
                    "status": prior.get("status", "pending"),
                    "resolution_notes": prior.get("resolution_notes", ""),
                    "created_at": prior.get("created_at", today_text),
                    "updated_at": today_text,
                }
            )
    queue_rows.sort(key=lambda row: (row["priority"], row["reason_code"], row["review_id"]))
    return queue_rows
 def attach_observed_ids(item_rows, observed_rows):
    observed_by_key = {row["observed_key"]: row["observed_product_id"] for row in observed_rows}
    attached = []
    for row in item_rows:
        observed_key = "|".join(
            [
                row["retailer"],
                f"upc={row['upc']}",
                f"name={row['item_name_norm']}",
            ]
        ) if row.get("upc") else "|".join(
            [
                row["retailer"],
                f"retailer_item_id={row.get('retailer_item_id', '')}",
                f"name={row['item_name_norm']}",
                f"size={row['size_value']}",
                f"unit={row['size_unit']}",
                f"pack={row['pack_qty']}",
                f"measure={row['measure_type']}",
                f"store_brand={row['is_store_brand']}",
                f"fee={row['is_fee']}",
                f"discount={row.get('is_discount_line', 'false')}",
                f"coupon={row.get('is_coupon_line', 'false')}",
            ]
        )
        enriched = dict(row)
        enriched["observed_product_id"] = observed_by_key.get(observed_key, "")
        attached.append(enriched)
    return attached
@click.command()
@click.option(
    "--observed-csv",
    default="giant_output/products_observed.csv",
    show_default=True,
    help="Path to observed product rows.",
 )
@click.option(
    "--items-enriched-csv",
    default="giant_output/items_enriched.csv",
    show_default=True,
    help="Path to enriched Giant item rows.",
 )
@click.option(
    "--output-csv",
    default="giant_output/review_queue.csv",
    show_default=True,
    help="Path to review queue output.",
 )
 def main(observed_csv, items_enriched_csv, output_csv):
    observed_rows = read_csv_rows(observed_csv)
    item_rows = read_csv_rows(items_enriched_csv)
    item_rows = attach_observed_ids(item_rows, observed_rows)
    existing_rows = existing_review_state(output_csv)
    today_text = str(date.today())
    queue_rows = build_review_queue(observed_rows, item_rows, existing_rows, today_text)
    write_csv_rows(output_csv, queue_rows, OUTPUT_FIELDS)
    click.echo(f"wrote {len(queue_rows)} rows to {output_csv}")
 if __name__ == "__main__":
    main()
--- a/collect_costco_web.py
+++ b/collect_costco_web.py
@@ -0,0 +1,65 @@
 import click
 import scrape_costco
@click.command()
@click.option(
    "--outdir",
    default="data/costco-web",
    show_default=True,
    help="Directory for Costco raw and collected outputs.",
 )
@click.option(
    "--document-type",
    default="all",
    show_default=True,
    help="Summary document type.",
 )
@click.option(
    "--document-sub-type",
    default="all",
    show_default=True,
    help="Summary document sub type.",
 )
@click.option(
    "--window-days",
    default=92,
    show_default=True,
    type=int,
    help="Maximum number of days to request per summary window.",
 )
@click.option(
    "--months-back",
    default=36,
    show_default=True,
    type=int,
    help="How many months of receipts to enumerate back from today.",
 )
@click.option(
    "--firefox-profile-dir",
    default=None,
    help="Firefox profile directory to use for cookies and session storage.",
 )
 def main(
    outdir,
    document_type,
    document_sub_type,
    window_days,
    months_back,
    firefox_profile_dir,
 ):
    scrape_costco.run_collection(
        outdir=outdir,
        document_type=document_type,
        document_sub_type=document_sub_type,
        window_days=window_days,
        months_back=months_back,
        firefox_profile_dir=firefox_profile_dir,
        orders_filename="collected_orders.csv",
        items_filename="collected_items.csv",
    )
 if __name__ == "__main__":
    main()
--- a/collect_giant_web.py
+++ b/collect_giant_web.py
@@ -0,0 +1,34 @@
 import click
 import scrape_giant
@click.command()
@click.option("--user-id", default=None, help="Giant user id.")
@click.option("--loyalty", default=None, help="Giant loyalty number.")
@click.option(
    "--outdir",
    default="data/giant-web",
    show_default=True,
    help="Directory for raw json and collected csv outputs.",
 )
@click.option(
    "--sleep-seconds",
    default=1.5,
    show_default=True,
    type=float,
    help="Delay between order detail requests.",
 )
 def main(user_id, loyalty, outdir, sleep_seconds):
    scrape_giant.run_collection(
        user_id,
        loyalty,
        outdir,
        sleep_seconds,
        orders_filename="collected_orders.csv",
        items_filename="collected_items.csv",
    )
 if __name__ == "__main__":
    main()
--- a/enrich_costco.py
+++ b/enrich_costco.py
@@ -1,13 +1,17 @@
 import csv
 import json
 import re
 from collections import defaultdict
 from pathlib import Path
 import click
 from enrich_giant import (
    OUTPUT_FIELDS,
    derive_normalized_quantity,
    derive_price_fields,
    format_decimal,
    normalization_identity,
    normalize_number,
    normalize_unit,
    normalize_whitespace,
@@ -25,10 +29,18 @@ CODE_TOKEN_RE = re.compile(
    r"\b(?:SL\d+|T\d+H\d+|P\d+(?:/\d+)?|W\d+T\d+H\d+|FY\d+|CSPC#|C\d+T\d+H\d+|EC\d+T\d+H\d+|\d+X\d+)\b"
 )
 PACK_FRACTION_RE = re.compile(r"(?<![A-Z0-9])(\d+)\s*/\s*(\d+(?:\.\d+)?)\s*(OZ|LB|LBS|CT)\b")
-HASH_SIZE_RE = re.compile(r"(?<![A-Z0-9])(\d+(?:\.\d+)?)#\b")
+HASH_SIZE_RE = re.compile(r"(?<![A-Z0-9])(\d+(?:\.\d+)?)#(?=\s|$)")
 ITEM_CODE_RE = re.compile(r"#\w+\b")
 DUAL_WEIGHT_RE = re.compile(
    r"\b\d+(?:\.\d+)?\s*(?:KG|G|LB|LBS|OZ)\s*/\s*\d+(?:\.\d+)?\s*(?:KG|G|LB|LBS|OZ)\b"
 )
 LOGISTICS_SLASH_RE = re.compile(r"\b(?:T\d+/H\d+(?:/P\d+)?/?|H\d+/P\d+/?|T\d+/H\d+/?)\b")
 PACK_DASH_RE = re.compile(r"(?<![A-Z0-9])(\d+)\s*-\s*PACK\b")
 PACK_WORD_RE = re.compile(r"(?<![A-Z0-9])(\d+)\s*PACK\b")
-SIZE_RE = re.compile(r"(?<![A-Z0-9])(\d+(?:\.\d+)?)\s*(OZ|LB|LBS|CT|KG|G)\b")
+SIZE_RE = re.compile(
    r"(?<![A-Z0-9])(\d+(?:\.\d+)?)\s*(OZ|LB|LBS|CT|KG|G|QT|QTS|PT|PTS|GAL|GALS|FL OZ|FLOZ)\b"
 )
 DISCOUNT_TARGET_RE = re.compile(r"^/\s*(\d+)\b")
 def clean_costco_name(name):
@@ -93,12 +105,17 @@ def normalize_costco_name(cleaned_name):
            base = PACK_FRACTION_RE.sub(" ", base)
        else:
            base = SIZE_RE.sub(" ", base)
    base = DUAL_WEIGHT_RE.sub(" ", base)
    base = HASH_SIZE_RE.sub(" ", base)
    base = ITEM_CODE_RE.sub(" ", base)
    base = LOGISTICS_SLASH_RE.sub(" ", base)
    base = PACK_DASH_RE.sub(" ", base)
    base = PACK_WORD_RE.sub(" ", base)
    base = normalize_whitespace(base)
    tokens = []
    for token in base.split():
        if token in {"/", "-"}:
            continue
        if token in {"ORG"}:
            continue
        if token in {"PEANUT", "BUTTER"} and "JIF" in base:
@@ -156,6 +173,13 @@ def is_discount_item(item):
    return amount < 0 or unit < 0 or description.startswith("/")
 def discount_target_id(raw_name):
    match = DISCOUNT_TARGET_RE.match(normalize_whitespace(raw_name))
    if not match:
        return ""
    return match.group(1)
 def parse_costco_item(order_id, order_date, raw_path, line_no, item):
    raw_name = combine_description(item)
    cleaned_name = clean_costco_name(raw_name)
@@ -168,12 +192,44 @@ def parse_costco_item(order_id, order_date, raw_path, line_no, item):
    price_per_each, price_per_lb, price_per_oz = derive_costco_prices(
        item, measure_type, size_value, size_unit, pack_qty
    )
    normalized_row_id = f"{RETAILER}:{order_id}:{line_no}"
    normalized_quantity, normalized_quantity_unit = derive_normalized_quantity(
        item.get("unit"),
        size_value,
        size_unit,
        pack_qty,
        measure_type,
        "",
    )
    identity_key, normalization_basis = normalization_identity(
        {
            "retailer": RETAILER,
            "normalized_row_id": normalized_row_id,
            "upc": "",
            "retailer_item_id": str(item.get("itemNumber", "")),
            "item_name_norm": item_name_norm,
            "size_value": size_value,
            "size_unit": size_unit,
            "pack_qty": pack_qty,
        }
    )
    price_fields = derive_price_fields(
        price_per_each,
        price_per_lb,
        price_per_oz,
        str(item.get("amount", "")),
        str(item.get("unit", "")),
        pack_qty,
    )
    return {
        "retailer": RETAILER,
        "order_id": str(order_id),
        "line_no": str(line_no),
-        "observed_item_key": f"{RETAILER}:{order_id}:{line_no}",
+        "normalized_row_id": normalized_row_id,
        "normalized_item_id": f"cnorm:{identity_key}",
        "normalization_basis": normalization_basis,
        "observed_item_key": normalized_row_id,
        "order_date": normalize_whitespace(order_date),
        "retailer_item_id": str(item.get("itemNumber", "")),
        "pod_id": "",
@@ -190,6 +246,8 @@ def parse_costco_item(order_id, order_date, raw_path, line_no, item):
        "reward_savings": "",
        "coupon_savings": str(item.get("amount", "")) if is_discount_line else "",
        "coupon_price": "",
        "matched_discount_amount": "",
        "net_line_total": str(item.get("amount", "")) if not is_discount_line else "",
        "image_url": "",
        "raw_order_path": raw_path.as_posix(),
        "item_name_norm": item_name_norm,
@@ -199,23 +257,71 @@ def parse_costco_item(order_id, order_date, raw_path, line_no, item):
        "size_unit": size_unit,
        "pack_qty": pack_qty,
        "measure_type": measure_type,
        "normalized_quantity": normalized_quantity,
        "normalized_quantity_unit": normalized_quantity_unit,
        "is_store_brand": "true" if brand_guess else "false",
        "is_item": "false" if is_discount_line else "true",
        "is_fee": "false",
        "is_discount_line": "true" if is_discount_line else "false",
        "is_coupon_line": is_coupon_line,
-        "price_per_each": price_per_each,
+        **price_fields,
        "price_per_lb": price_per_lb,
        "price_per_oz": price_per_oz,
        "parse_version": PARSER_VERSION,
        "parse_notes": "",
    }
 def match_costco_discounts(rows):
    rows_by_order = defaultdict(list)
    for row in rows:
        rows_by_order[row["order_id"]].append(row)
    for order_rows in rows_by_order.values():
        purchase_rows_by_item_id = defaultdict(list)
        for row in order_rows:
            if row.get("is_discount_line") == "true":
                continue
            retailer_item_id = row.get("retailer_item_id", "")
            if retailer_item_id:
                purchase_rows_by_item_id[retailer_item_id].append(row)
        for row in order_rows:
            if row.get("is_discount_line") != "true":
                continue
            target_id = discount_target_id(row.get("item_name", ""))
            if not target_id:
                continue
            matches = purchase_rows_by_item_id.get(target_id, [])
            if len(matches) != 1:
                row["parse_notes"] = normalize_whitespace(
                    f"{row.get('parse_notes', '')};discount_target_unmatched={target_id}"
                ).strip(";")
                continue
            purchase_row = matches[0]
            matched_discount = to_decimal(row.get("line_total"))
            gross_total = to_decimal(purchase_row.get("line_total"))
            existing_discount = to_decimal(purchase_row.get("matched_discount_amount")) or 0
            if matched_discount is None or gross_total is None:
                continue
            total_discount = existing_discount + matched_discount
            purchase_row["matched_discount_amount"] = format_decimal(total_discount)
            purchase_row["net_line_total"] = format_decimal(gross_total + total_discount)
            purchase_row["parse_notes"] = normalize_whitespace(
                f"{purchase_row.get('parse_notes', '')};matched_discount={target_id}"
            ).strip(";")
            row["parse_notes"] = normalize_whitespace(
                f"{row.get('parse_notes', '')};matched_to_item={target_id}"
            ).strip(";")
 def iter_costco_rows(raw_dir):
    for path in discover_json_files(raw_dir):
-        if path.name == "summary.json":
+        if path.name in {"summary.json", "summary_requests.json"}:
            continue
        payload = json.loads(path.read_text(encoding="utf-8"))
        if not isinstance(payload, dict):
            continue
        receipts = payload.get("data", {}).get("receiptsWithCounts", {}).get("receipts", [])
        for receipt in receipts:
            order_id = receipt["transactionBarcode"]
@@ -236,6 +342,7 @@ def discover_json_files(raw_dir):
 def build_items_enriched(raw_dir):
    rows = list(iter_costco_rows(raw_dir))
    match_costco_discounts(rows)
    rows.sort(key=lambda row: (row["order_date"], row["order_id"], int(row["line_no"])))
    return rows
@@ -262,6 +369,7 @@ def write_csv(path, rows):
    help="CSV path for enriched Costco item rows.",
 )
 def main(input_dir, output_csv):
    click.echo("legacy entrypoint: prefer normalize_costco_web.py for data-model outputs")
    rows = build_items_enriched(Path(input_dir))
    write_csv(Path(output_csv), rows)
    click.echo(f"wrote {len(rows)} rows to {output_csv}")
--- a/enrich_giant.py
+++ b/enrich_giant.py
@@ -16,6 +16,9 @@ OUTPUT_FIELDS = [
    "retailer",
    "order_id",
    "line_no",
    "normalized_row_id",
    "normalized_item_id",
    "normalization_basis",
    "observed_item_key",
    "order_date",
    "retailer_item_id",
@@ -33,6 +36,8 @@ OUTPUT_FIELDS = [
    "reward_savings",
    "coupon_savings",
    "coupon_price",
    "matched_discount_amount",
    "net_line_total",
    "image_url",
    "raw_order_path",
    "item_name_norm",
@@ -42,13 +47,21 @@ OUTPUT_FIELDS = [
    "size_unit",
    "pack_qty",
    "measure_type",
    "normalized_quantity",
    "normalized_quantity_unit",
    "is_store_brand",
    "is_item",
    "is_fee",
    "is_discount_line",
    "is_coupon_line",
    "price_per_each",
    "price_per_each_basis",
    "price_per_count",
    "price_per_count_basis",
    "price_per_lb",
    "price_per_lb_basis",
    "price_per_oz",
    "price_per_oz_basis",
    "parse_version",
    "parse_notes",
 ]
@@ -211,13 +224,17 @@ def normalize_unit(unit):
        "OZ": "oz",
        "FZ": "fl_oz",
        "FL OZ": "fl_oz",
        "FLOZ": "fl_oz",
        "LB": "lb",
        "LBS": "lb",
        "ML": "ml",
        "L": "l",
        "QT": "qt",
        "QTS": "qt",
        "PT": "pt",
        "PTS": "pt",
        "GAL": "gal",
        "GALS": "gal",
        "GA": "gal",
    }.get(collapsed, collapsed.lower())
@@ -327,6 +344,76 @@ def derive_prices(item, measure_type, size_value="", size_unit="", pack_qty=""):
    return price_per_each, price_per_lb, price_per_oz
 def derive_normalized_quantity(qty, size_value, size_unit, pack_qty, measure_type, picked_weight=""):
    parsed_qty = to_decimal(qty)
    parsed_size = to_decimal(size_value)
    parsed_pack = to_decimal(pack_qty)
    parsed_picked_weight = to_decimal(picked_weight)
    total_multiplier = None
    if parsed_qty not in (None, Decimal("0")):
        total_multiplier = parsed_qty * (parsed_pack or Decimal("1"))
    if (
        parsed_size not in (None, Decimal("0"))
        and size_unit
        and total_multiplier not in (None, Decimal("0"))
    ):
        return format_decimal(parsed_size * total_multiplier), size_unit
    if measure_type == "weight" and parsed_picked_weight not in (None, Decimal("0")):
        return format_decimal(parsed_picked_weight), "lb"
    if measure_type == "count" and total_multiplier not in (None, Decimal("0")):
        return format_decimal(total_multiplier), "count"
    if measure_type == "each" and parsed_qty not in (None, Decimal("0")):
        return format_decimal(parsed_qty), "each"
    return "", ""
 def derive_price_fields(price_per_each, price_per_lb, price_per_oz, line_total, qty, pack_qty):
    line_total_decimal = to_decimal(line_total)
    qty_decimal = to_decimal(qty)
    pack_decimal = to_decimal(pack_qty)
    price_per_count = ""
    price_per_count_basis = ""
    if line_total_decimal is not None and qty_decimal not in (None, Decimal("0")) and pack_decimal not in (
        None,
        Decimal("0"),
    ):
        price_per_count = format_decimal(line_total_decimal / (qty_decimal * pack_decimal))
        price_per_count_basis = "line_total_over_pack_qty"
    return {
        "price_per_each": price_per_each,
        "price_per_each_basis": "line_total_over_qty" if price_per_each else "",
        "price_per_count": price_per_count,
        "price_per_count_basis": price_per_count_basis,
        "price_per_lb": price_per_lb,
        "price_per_lb_basis": "parsed_or_picked_weight" if price_per_lb else "",
        "price_per_oz": price_per_oz,
        "price_per_oz_basis": "parsed_or_picked_weight" if price_per_oz else "",
    }
 def normalization_identity(row):
    if row.get("upc"):
        return f"{row['retailer']}|upc={row['upc']}", "exact_upc"
    if row.get("retailer_item_id"):
        return f"{row['retailer']}|retailer_item_id={row['retailer_item_id']}", "exact_retailer_item_id"
    if row.get("item_name_norm"):
        return (
            "|".join(
                [
                    row["retailer"],
                    f"name={row['item_name_norm']}",
                    f"size={row.get('size_value', '')}",
                    f"unit={row.get('size_unit', '')}",
                    f"pack={row.get('pack_qty', '')}",
                ]
            ),
            "exact_name_size_pack",
        )
    return row["normalized_row_id"], "row_identity"
 def parse_item(order_id, order_date, raw_path, line_no, item):
    cleaned_name = clean_item_name(item.get("itemName", ""))
    size_value, size_unit, pack_qty = parse_size_and_pack(cleaned_name)
@@ -350,11 +437,44 @@ def parse_item(order_id, order_date, raw_path, line_no, item):
    if size_value and not size_unit:
        parse_notes.append("size_without_unit")
    normalized_row_id = f"{RETAILER}:{order_id}:{line_no}"
    normalized_quantity, normalized_quantity_unit = derive_normalized_quantity(
        item.get("shipQy"),
        size_value,
        size_unit,
        pack_qty,
        measure_type,
        item.get("totalPickedWeight"),
    )
    identity_key, normalization_basis = normalization_identity(
        {
            "retailer": RETAILER,
            "normalized_row_id": normalized_row_id,
            "upc": stringify(item.get("primUpcCd")),
            "retailer_item_id": stringify(item.get("podId")),
            "item_name_norm": normalized_name,
            "size_value": size_value,
            "size_unit": size_unit,
            "pack_qty": pack_qty,
        }
    )
    price_fields = derive_price_fields(
        price_per_each,
        price_per_lb,
        price_per_oz,
        stringify(item.get("groceryAmount")),
        stringify(item.get("shipQy")),
        pack_qty,
    )
    return {
        "retailer": RETAILER,
        "order_id": str(order_id),
        "line_no": str(line_no),
-        "observed_item_key": f"{RETAILER}:{order_id}:{line_no}",
+        "normalized_row_id": normalized_row_id,
        "normalized_item_id": f"gnorm:{identity_key}",
        "normalization_basis": normalization_basis,
        "observed_item_key": normalized_row_id,
        "order_date": normalize_whitespace(order_date),
        "retailer_item_id": stringify(item.get("podId")),
        "pod_id": stringify(item.get("podId")),
@@ -371,6 +491,8 @@ def parse_item(order_id, order_date, raw_path, line_no, item):
        "reward_savings": stringify(item.get("rewardSavings")),
        "coupon_savings": stringify(item.get("couponSavings")),
        "coupon_price": stringify(item.get("couponPrice")),
        "matched_discount_amount": "",
        "net_line_total": stringify(item.get("totalPrice")),
        "image_url": extract_image_url(item),
        "raw_order_path": raw_path.as_posix(),
        "item_name_norm": normalized_name,
@@ -380,13 +502,14 @@ def parse_item(order_id, order_date, raw_path, line_no, item):
        "size_unit": size_unit,
        "pack_qty": pack_qty,
        "measure_type": measure_type,
        "normalized_quantity": normalized_quantity,
        "normalized_quantity_unit": normalized_quantity_unit,
        "is_store_brand": "true" if bool(prefix) else "false",
        "is_item": "false" if is_fee else "true",
        "is_fee": "true" if is_fee else "false",
        "is_discount_line": "false",
        "is_coupon_line": "false",
-        "price_per_each": price_per_each,
+        **price_fields,
        "price_per_lb": price_per_lb,
        "price_per_oz": price_per_oz,
        "parse_version": PARSER_VERSION,
        "parse_notes": ";".join(parse_notes),
    }
@@ -439,6 +562,7 @@ def write_csv(path, rows):
    help="CSV path for enriched Giant item rows.",
 )
 def main(input_dir, output_csv):
    click.echo("legacy entrypoint: prefer normalize_giant_web.py for data-model outputs")
    raw_dir = Path(input_dir)
    output_path = Path(output_csv)
--- a/normalize_costco_web.py
+++ b/normalize_costco_web.py
@@ -0,0 +1,28 @@
 from pathlib import Path
 import click
 import enrich_costco
@click.command()
@click.option(
    "--input-dir",
    default="data/costco-web/raw",
    show_default=True,
    help="Directory containing Costco raw order json files.",
 )
@click.option(
    "--output-csv",
    default="data/costco-web/normalized_items.csv",
    show_default=True,
    help="CSV path for normalized Costco item rows.",
 )
 def main(input_dir, output_csv):
    rows = enrich_costco.build_items_enriched(Path(input_dir))
    enrich_costco.write_csv(Path(output_csv), rows)
    click.echo(f"wrote {len(rows)} rows to {output_csv}")
 if __name__ == "__main__":
    main()
--- a/normalize_giant_web.py
+++ b/normalize_giant_web.py
@@ -0,0 +1,28 @@
 from pathlib import Path
 import click
 import enrich_giant
@click.command()
@click.option(
    "--input-dir",
    default="data/giant-web/raw",
    show_default=True,
    help="Directory containing Giant raw order json files.",
 )
@click.option(
    "--output-csv",
    default="data/giant-web/normalized_items.csv",
    show_default=True,
    help="CSV path for normalized Giant item rows.",
 )
 def main(input_dir, output_csv):
    rows = enrich_giant.build_items_enriched(Path(input_dir))
    enrich_giant.write_csv(Path(output_csv), rows)
    click.echo(f"wrote {len(rows)} rows to {output_csv}")
 if __name__ == "__main__":
    main()
--- a/pm/data-model.org
+++ b/pm/data-model.org
@@ -1,133 +1,138 @@
-* grocery data model and file layout
+* Grocery data model and file layout
 This document defines the shared file layout and stable CSV schemas for the
-grocery pipeline. The goal is to keep retailer-specific ingest separate from
+grocery pipeline.
-cross-retailer product modeling so Giant-specific quirks do not become the
+Goals:
-system of record.
+- Ensure data gathering is separate from analysis
-
+- Enable multiple data gathering methods
-** design rules
+- One layer for review and analysis  
 ** Design Rules
 - Raw retailer exports remain the source of truth.
 - Retailer parsing is isolated to retailer-specific files and ids.
- Cross-retailer product layers begin only after retailer-specific enrichment.
+- Cross-retailer product layers begin only after retailer-specific normalization.
 - CSV schemas are stable and additive: new columns may be appended, but
   existing columns should not be repurposed.
 - Unknown values should be left blank rather than guessed.
-** directory layout
+*** Retailer-specific data:
 Use one top-level data root:
 #+begin_example
 data/
  giant/
    raw/
      history.json
      orders/
        <order_id>.json
    orders.csv
    items_raw.csv
    items_enriched.csv
    products_observed.csv
  costco/
    raw/
      ...
    orders.csv
    items_raw.csv
    items_enriched.csv
    products_observed.csv
  shared/
    products_canonical.csv
    product_links.csv
    review_queue.csv
 #+end_example
 ** layer responsibilities
 - `data/<retailer>/raw/`
  Stores unmodified retailer payloads exactly as fetched.
 - `data/<retailer>/orders.csv`
  One row per retailer order or visit, flattened from raw order data.
 - `data/<retailer>/items_raw.csv`
  One row per retailer line item, preserving retailer-native values needed for
  reruns and debugging.
 - `data/<retailer>/items_enriched.csv`
  Parsed retailer line items with normalized fields and derived guesses, still
  retailer-specific.
 - `data/<retailer>/products_observed.csv`
  Distinct retailer-facing observed products aggregated from enriched items.
 - `data/shared/products_canonical.csv`
  Cross-retailer canonical product entities used for comparison.
 - `data/shared/product_links.csv`
  Links from retailer observed products to canonical products.
 - `data/shared/review_queue.csv`
  Human review queue for unresolved or low-confidence matching/parsing cases.
 ** retailer-specific versus shared
 Retailer-specific:
 - raw json payloads
 - retailer order ids
 - retailer line numbers
 - retailer category ids and names
 - retailer item names
 - retailer image urls
 - parsed guesses derived from one retailer feed
 - observed products scoped to one retailer
 Shared:
 - canonical products
 - observed-to-canonical links
 - human review state for unresolved cases
 - comparison-ready normalized quantity basis fields
-Observed products are the boundary between retailer-specific parsing and
+*** Review/Combined data:
-cross-retailer canonicalization. Nothing upstream of `products_observed.csv`
+- catalog of reviewed products
-should require knowledge of another retailer.
+- links from normalized retailer items to catalog
 - human review state for unresolved cases
 ** schema: `data/<retailer>/orders.csv`
-One row per order or visit.
+* Pipeline
 Each step can be run alone if its dependents exist.
 Each retail provider script must produce deterministic line-item outputs, and
 normalization may assign within-retailer product identity only when the
 retailer itself provides strong evidence.
-| column | meaning |
+Key: 
-|-
+- (1) input
-| `retailer` | retailer slug such as `giant` |
+- [1] output
 | `order_id` | retailer order or visit id |
 | `order_date` | order date in `YYYY-MM-DD` when available |
 | `delivery_date` | fulfillment date in `YYYY-MM-DD` when available |
 | `service_type` | retailer service type such as `INSTORE` |
 | `order_total` | order total as provided by retailer |
 | `payment_method` | retailer payment label |
 | `total_item_count` | total line count or item count from retailer |
 | `total_savings` | total savings as provided by retailer |
 | `your_savings_total` | savings field from retailer when present |
 | `coupons_discounts_total` | coupon/discount total from retailer |
 | `store_name` | retailer store name |
 | `store_number` | retailer store number |
 | `store_address1` | street address |
 | `store_city` | city |
 | `store_state` | state or province |
 | `store_zipcode` | postal code |
 | `refund_order` | retailer refund flag |
 | `ebt_order` | retailer EBT flag |
 | `raw_history_path` | relative path to source history payload |
 | `raw_order_path` | relative path to source order payload |
-Primary key:
+** 1. Collect
 Get raw receipt/visit and item data from a retailer.
 Scraping is unique to a Retailer and method (e.g., Giant-Web and Giant-Scan).
 Preserve complete raw data and preserve fidelity.
 Avoid interpretation beyond basic data flattening.
 - (1) Source access (Varies, eg header data, auth for API access)
 - [1] collected visits from each retailer
 - [2] collected items from each retailer
 - [3] any other raw data that supports [1] and [2]; explicit source (eventual receipt scan?)
- (`retailer`, `order_id`)
+** 2. Normalize
 Parse and extract structured facts from retailer-specific raw data
  to create a standardized item format for that retailer.
 Strictly dependent on Collect method and output.
 - Extract quantity, size, pack, pricing, variant
 - Add discount line items to product line items using upc/retail_item_id and concurrence
 - Cleanup naming to facilitate later matching
 - Assign retailer-level `normalized_item_id` only when evidence is deterministic
 - Never use fuzzy or semantic matching here
 - (1) collected items from each retailer
 - (2) collected visits from each retailer
 - [1] normalized items from each retailer
-** schema: `data/<retailer>/items_raw.csv`
+** 3. Review/Combine (Canonicalization)
 Decide whether two normalized retailer items are "the same product";
 match items across retailers using algo/logic and human review.
 Create catalog linked to normalized retailer items.
 - Review operates on distinct `normalized_item_id` values, not individual purchase rows
 - Cross-retailer identity decisions happen only here
 - Asking human to create a canonical/catalog item with:
   - friendly/catalog_name: "bell pepper"; "milk"
   - category: "produce"; "dairy"
   - product_type: "pepper"; "milk"
   - ? variant? "whole, "skim", "2pct"
 - Then link the group of items to that catalog item.
 - (1) normalized items from each retailer
 - [1] review queue of items to be reviewed
 - [2] catalog (lookup table) of confirmed normalized retailer items and catalog_id
 - [3] purchase list of normalized items , pivot-ready
 ** Unresolved Issues
 1. need central script to orchestrate; metadata belongs there and nowhere else
 2. `LIME` and `LIME . / .` appearing in the catalog: names must come from review-approved names, not raw strings
 * Directory Layout
 Use one top-level data root:
 #+begin_example
 main.py
 collect_<retailer>_<method>.py
 normalize_<retailer>_<method>.py
 review.py
 data/
  <retailer-method>/
    raw/  # unmodified retailer payloads exactly as fetched
      <order_id.json> 
    collected_items.csv # one row per retailer line item w/ retailer-native values
    collected_orders.csv # one row per receipt/visit, flattened from raw order data
    normalized_items.csv # parsed retailer-specific line items with normalized fields
  costco-web/ # sample
    raw/
      orders/
        history.json
        <order_id>.json
    collected_items.csv
    collected_orders.csv
    normalized_items.csv
  review/
    review_queue.csv # Human review queue for unresolved matching/parsing cases.
    product_links.csv # Links from normalized retailer items to catalog items.
    catalog.csv # Cross-retailer product catalog entities used for comparison.
  analysis/
    purchases.csv
    comparison_examples.csv
    item_price_over_time.csv
    spend_by_visit.csv
    items_per_visit.csv
    category_spend_over_time.csv
    retailer_store_breakdown.csv
 #+end_example
 Notes:
 - The current repo still uses transitional root-level scripts and output folders.
 - This layout is the target structure for the refactor, not a claim that migration is already complete.
 * Schemas
 ** `data/<retailer-method>/collected_items.csv`
 One row per retailer line item.
-
+| key                | definition                                 |
-| column           | meaning                                 |
+|--------------------+--------------------------------------------|
-|------------------+-----------------------------------------|
+| `retailer` PK      | retailer slug                              |
-| `retailer`       | retailer slug                           |
+| `order_id` PK      | retailer order id                          |
-| `order_id`       | retailer order id                       |
+| `line_no`  PK      | stable line number within order export     |
 | `line_no`        | stable line number within order export  |
 | `order_date`       | copied from order when available           |
 | `retailer_item_id` | retailer-native item id when available     |
 | `pod_id`           | retailer pod/item id                       |
@@ -149,135 +154,110 @@ One row per retailer line item.
 | `is_discount_line` | retailer adjustment or discount-line flag  |
 | `is_coupon_line`   | coupon-like line flag when distinguishable |
-Primary key:
+** `data/<retailer-method>/collected_orders.csv`
 One row per order/visit/receipt.
 | key                       | definition                                      |
 |---------------------------+-------------------------------------------------|
 | `retailer` PK             | retailer slug such as `giant`                   |
 | `order_id` PK             | retailer order or visit id                      |
 | `order_date`              | order date in `YYYY-MM-DD` when available       |
 | `delivery_date`           | fulfillment date in `YYYY-MM-DD` when available |
 | `service_type`            | retailer service type such as `INSTORE`         |
 | `order_total`             | order total as provided by retailer             |
 | `payment_method`          | retailer payment label                          |
 | `total_item_count`        | total line count or item count from retailer    |
 | `total_savings`           | total savings as provided by retailer           |
 | `your_savings_total`      | savings field from retailer when present        |
 | `coupons_discounts_total` | coupon/discount total from retailer             |
 | `store_name`              | retailer store name                             |
 | `store_number`            | retailer store number                           |
 | `store_address1`          | street address                                  |
 | `store_city`              | city                                            |
 | `store_state`             | state or province                               |
 | `store_zipcode`           | postal code                                     |
 | `refund_order`            | retailer refund flag                            |
 | `ebt_order`               | retailer EBT flag                               |
 | `raw_history_path`        | relative path to source history payload         |
 | `raw_order_path`          | relative path to source order payload           |
- (`retailer`, `order_id`, `line_no`)
+** `data/<retailer-method>/normalized_items.csv`
 One row per retailer line item after deterministic parsing. Preserve raw
 fields from `collected_items.csv` and add parsed fields that make later review
 and grouping easier. Normalization may assign retailer-level identity when the
 evidence is deterministic and retailer-scoped.
-** schema: `data/<retailer>/items_enriched.csv`
+| key                        | definition                                                       |
-
+|----------------------------+------------------------------------------------------------------|
-One row per retailer line item after deterministic parsing. Preserve the raw
+| `retailer` PK              | retailer slug                                                    |
-fields from `items_raw.csv` and add parsed fields.
+| `order_id` PK              | retailer order id                                                |
-
+| `line_no` PK               | line number within order                                         |
-| column              | meaning                                                     |
+| `normalized_row_id`        | stable row key, typically `<retailer>:<order_id>:<line_no>`      |
-|---------------------+-------------------------------------------------------------|
+| `normalized_item_id`       | stable retailer-level item identity when deterministic grouping is supported |
-| `retailer`          | retailer slug                                               |
+| `normalization_basis`      | basis used to assign `normalized_item_id`                        |
 | `order_id`          | retailer order id                                           |
 | `line_no`           | line number within order                                    |
 | `observed_item_key` | stable row key, typically `<retailer>:<order_id>:<line_no>` |
 | `retailer_item_id`         | retailer-native item id                                          |
 | `item_name`                | raw retailer item name                                           |
-| `item_name_norm`    | normalized item name                                        |
+| `item_name_norm`           | normalized retailer item name                                    |
 | `brand_guess`              | parsed brand guess                                               |
 | `variant`                  | parsed variant text                                              |
 | `size_value`               | parsed numeric size value                                        |
 | `size_unit`                | parsed size unit such as `oz`, `lb`, `fl_oz`                     |
 | `pack_qty`                 | parsed pack or count guess                                       |
 | `measure_type`             | `each`, `weight`, `volume`, `count`, or blank                    |
 | `normalized_quantity`      | numeric comparison basis derived during normalization            |
 | `normalized_quantity_unit` | basis unit such as `oz`, `lb`, `count`, or blank                 |
 | `is_item`                  | item flag                                                        |
 | `is_store_brand`           | store-brand guess                                                |
 | `is_fee`                   | fee or non-product flag                                          |
 | `is_discount_line`         | discount or adjustment-line flag                                 |
 | `is_coupon_line`           | coupon-like line flag                                            |
 | `matched_discount_amount`  | matched discount value carried onto purchased row when supported |
 | `net_line_total`           | line total after matched discount when supported                 |
 | `price_per_each`           | derived per-each price when supported                            |
 | `price_per_each_basis`     | source basis for `price_per_each`                                |
 | `price_per_count`          | derived per-count price when supported                           |
 | `price_per_count_basis`    | source basis for `price_per_count`                               |
 | `price_per_lb`             | derived per-pound price when supported                           |
 | `price_per_lb_basis`       | source basis for `price_per_lb`                                  |
 | `price_per_oz`             | derived per-ounce price when supported                           |
 | `price_per_oz_basis`       | source basis for `price_per_oz`                                  |
 | `image_url`                | best available retailer image url                                |
 | `raw_order_path`           | relative path to source order payload                            |
 | `parse_version`            | parser version string for reruns                                 |
 | `parse_notes`              | optional non-fatal parser notes                                  |
-Primary key:
+Notes:
 - `normalized_row_id` identifies the purchase row; `normalized_item_id` identifies a repeated retailer item when strong retailer evidence supports grouping.
 - Valid `normalization_basis` values should be explicit, e.g. `exact_upc`, `exact_retailer_item_id`, `exact_name_size_pack`, or `approved_retailer_alias`.
 - Do not use fuzzy or semantic matching to assign `normalized_item_id`.
 - Discount/coupon rows may remain as standalone normalized rows for auditability even when their amounts are attached to a purchased row via `matched_discount_amount`.
 - Cross-retailer identity is handled later in review/combine via `data/review/catalog.csv` and `product_links.csv`.
- (`retailer`, `order_id`, `line_no`)
+** `data/review/product_links.csv`
 One row per review-approved link from a normalized retailer item to a catalog item.
 Many normalized retailer items may link to the same catalog item.
-** schema: `data/<retailer>/products_observed.csv`
+| key                     | definition                                  |
-
+|-------------------------+---------------------------------------------|
-One row per distinct retailer-facing observed product.
+| `normalized_item_id` PK | normalized retailer item id                 |
-
+| `catalog_id` PK         | linked catalog product id                   |
-| column                        | meaning                                                        |
+| `link_method`           | `manual`, `exact_upc`, `exact_name_size`, etc. |
 |-------------------------------+----------------------------------------------------------------|
 | `observed_product_id`         | stable observed product id                                     |
 | `retailer`                    | retailer slug                                                  |
 | `observed_key`                | deterministic grouping key used to create the observed product |
 | `representative_retailer_item_id` | best representative retailer-native item id               |
 | `representative_upc`          | best representative UPC/PLU                                    |
 | `representative_item_name`    | representative raw retailer name                               |
 | `representative_name_norm`    | representative normalized name                                 |
 | `representative_brand`        | representative brand guess                                     |
 | `representative_variant`      | representative variant                                         |
 | `representative_size_value`   | representative size value                                      |
 | `representative_size_unit`    | representative size unit                                       |
 | `representative_pack_qty`     | representative pack/count                                      |
 | `representative_measure_type` | representative measure type                                    |
 | `representative_image_url`    | representative image url                                       |
 | `is_store_brand`              | representative store-brand flag                                |
 | `is_fee`                      | representative fee flag                                        |
 | `is_discount_line`            | representative discount-line flag                              |
 | `is_coupon_line`              | representative coupon-line flag                                |
 | `first_seen_date`             | first order date seen                                          |
 | `last_seen_date`              | last order date seen                                           |
 | `times_seen`                  | number of enriched item rows grouped here                      |
 | `example_order_id`            | one example retailer order id                                  |
 | `example_item_name`           | one example raw item name                                      |
 | `distinct_retailer_item_ids_count` | count of distinct retailer-native item ids               |
 Primary key:
 - (`observed_product_id`)
 ** schema: `data/shared/products_canonical.csv`
 One row per cross-retailer canonical product.
 | column                     | meaning                                          |
 |----------------------------+--------------------------------------------------|
 | `canonical_product_id`     | stable canonical product id                      |
 | `canonical_name`           | canonical human-readable name                    |
 | `product_type`             | broad class such as `apple`, `milk`, `trash_bag` |
 | `brand`                    | canonical brand when applicable                  |
 | `variant`                  | canonical variant                                |
 | `size_value`               | normalized size value                            |
 | `size_unit`                | normalized size unit                             |
 | `pack_qty`                 | normalized pack/count                            |
 | `measure_type`             | normalized measure type                          |
 | `normalized_quantity`      | numeric comparison basis value                   |
 | `normalized_quantity_unit` | basis unit such as `oz`, `lb`, `count`           |
 | `notes`                    | optional human notes                             |
 | `created_at`               | creation timestamp or date                       |
 | `updated_at`               | last update timestamp or date                    |
 Primary key:
 - (`canonical_product_id`)
 ** schema: `data/shared/product_links.csv`
 One row per observed-to-canonical relationship.
 | column | meaning |
 |-
 | `observed_product_id` | retailer observed product id |
 | `canonical_product_id` | linked canonical product id |
 | `link_method` | `manual`, `exact_upc`, `exact_name`, etc. |
 | `link_confidence`       | optional confidence label                   |
 | `review_status`         | `pending`, `approved`, `rejected`, or blank |
 | `reviewed_by`           | reviewer id or initials                     |
 | `reviewed_at`           | review timestamp or date                    |
 | `link_notes`            | optional notes                              |
-Primary key:
+** `data/review/review_queue.csv`
 - (`observed_product_id`, `canonical_product_id`)
 ** schema: `data/shared/review_queue.csv`
 One row per issue needing human review.
-| column | meaning |
+| key                  | definition                                          |
-|-
+|----------------------+-----------------------------------------------------|
-| `review_id` | stable review row id |
+| `review_id` PK       | stable review row id                                |
-| `queue_type` | `observed_product`, `link_candidate`, `parse_issue` |
+| `queue_type`         | `link_candidate`, `parse_issue`, `catalog_cleanup`  |
 | `retailer`           | retailer slug when applicable                       |
-| `observed_product_id` | observed product id when applicable |
+| `normalized_item_id` | normalized retailer item id when review is item-level |
-| `canonical_product_id` | candidate canonical id when applicable |
+| `normalized_row_id`  | normalized row id when review is row-specific       |
 | `catalog_id`         | candidate canonical id                              |
 | `reason_code`        | machine-readable review reason                      |
 | `priority`           | optional priority label                             |
 | `raw_item_names`     | compact list of example raw names                   |
@@ -290,20 +270,90 @@ One row per issue needing human review.
 | `resolution_notes`   | reviewer notes                                      |
 | `created_at`         | creation timestamp or date                          |
 | `updated_at`         | last update timestamp or date                       |
 ** `data/review/catalog.csv`
 One row per cross-retailer catalog product.
 | key                        | definition                             |
 |----------------------------+----------------------------------------|
 | `catalog_id` PK            | stable catalog product id              |
 | `catalog_name`             | human-reviewed product name            |
 | `product_type`             | generic product eg `apple`, `milk`     |
 | `category`                 | broad section eg `produce`, `dairy`    |
 | `brand`                    | canonical brand when applicable        |
 | `variant`                  | canonical variant                      |
 | `size_value`               | normalized size value                  |
 | `size_unit`                | normalized size unit                   |
 | `pack_qty`                 | normalized pack/count                  |
 | `measure_type`             | normalized measure type                |
 | `normalized_quantity`      | numeric comparison basis value         |
 | `normalized_quantity_unit` | basis unit such as `oz`, `lb`, `count` |
 | `notes`                    | optional human notes                   |
 | `created_at`               | creation timestamp or date             |
 | `updated_at`               | last update timestamp or date          |
-Primary key:
+Notes:
 - Do not auto-create new catalog rows from weak normalized names alone.
 - Do not encode packaging/count into `catalog_name` unless it is essential to product identity.
 - `catalog_name` should come from review-approved naming, not raw retailer strings.
- (`review_id`)
+** `data/analysis/purchases.csv`
 One row per purchased item (i.e., `is_item`==true from normalized layer), with
 catalog attributes denormalized in and discounts already applied.
-** current giant mapping
+| key                        | definition                                                     |
 |----------------------------+----------------------------------------------------------------|
 | `purchase_date`            | date of purchase (from order)                                  |
 | `retailer`                 | retailer slug                                                  |
 | `order_id`                 | retailer order id                                              |
 | `line_no`                  | line number within order                                       |
 | `normalized_row_id`        | `<retailer>:<order_id>:<line_no>`                              |
 | `normalized_item_id`       | retailer-level normalized item identity                        |
 | `catalog_id`               | linked catalog product id                                      |
 | `catalog_name`             | catalog product name for analysis                              |
 | `catalog_product_type`     | broader product family (e.g., `egg`, `milk`)                   |
 | `catalog_category`         | category such as `produce`, `dairy`                            |
 | `catalog_brand`            | canonical brand when applicable                                |
 | `catalog_variant`          | canonical variant when applicable                              |
 | `raw_item_name`            | original retailer item name                                    |
 | `normalized_item_name`     | cleaned/normalized retailer item name                          |
 | `retailer_item_id`         | retailer-native item id                                        |
 | `upc`                      | UPC/PLU when available                                         |
 | `qty`                      | retailer quantity field                                        |
 | `unit`                     | retailer unit (e.g., `EA`, `LB`)                               |
 | `pack_qty`                 | parsed pack/count                                              |
 | `size_value`               | parsed size value                                              |
 | `size_unit`                | parsed size unit                                               |
 | `measure_type`             | `each`, `weight`, `volume`, `count`                            |
 | `normalized_quantity`      | normalized comparison quantity                                 |
 | `normalized_quantity_unit` | unit for normalized quantity                                   |
 | `unit_price`               | retailer unit price                                            |
 | `line_total`               | original retailer extended price (pre-discount)                |
 | `matched_discount_amount`  | discount amount matched from discount lines                    |
 | `net_line_total`           | effective price after discount (`line_total` + discounts)      |
 | `store_name`               | retailer store name                                            |
 | `store_city`               | store city                                                     |
 | `store_state`              | store state                                                    |
 | `price_per_each`           | derived per-each price                                         |
 | `price_per_each_basis`     | source basis for per-each calc                                 |
 | `price_per_count`          | derived per-count price                                        |
 | `price_per_count_basis`    | source basis for per-count calc                                |
 | `price_per_lb`             | derived per-pound price                                        |
 | `price_per_lb_basis`       | source basis for per-pound calc                                |
 | `price_per_oz`             | derived per-ounce price                                        |
 | `price_per_oz_basis`       | source basis for per-ounce calc                                |
 | `is_fee`                   | true if row represents non-product fee                         |
 | `raw_order_path`           | relative path to original order payload                        |
-Current scraper outputs map to the new layout as follows:
+Notes:
 - Only rows that represent purchased items should appear here.
 - `line_total` preserves retailer truth; `net_line_total` is what you actually paid.
 - catalog fields are denormalized in to make pivoting trivial.
 - no discount/coupon rows exist here; their effects are carried via `matched_discount_amount`.
 - review/link decisions should apply at the `normalized_item_id` level, then fan out to all purchase rows sharing that id.
- `giant_output/raw/history.json` -> `data/giant/raw/history.json`
+* /
- `giant_output/raw/<order_id>.json` -> `data/giant/raw/orders/<order_id>.json`
+Normalized quantity is deterministic and conservative:
- `giant_output/orders.csv` -> `data/giant/orders.csv`
+- if `qty * pack_qty * size_value` is available, use that total with `size_unit`
- `giant_output/items.csv` -> `data/giant/items_raw.csv`
+- else if count basis is explicit, use `qty * pack_qty` with unit `count`
-
+- else if `measure_type` is `each`, use `qty each`
-Current Giant raw order payloads already expose fields needed for future
+- else leave both fields blank
-enrichment, including `image`, `itemName`, `primUpcCd`, `lbEachCd`,
+- no hidden unit conversion is applied inside normalization; values stay in their parsed units such as `oz`, `lb`, `qt`, or `count`
 `unitPrice`, `groceryAmount`, and `totalPickedWeight`.
--- a/pm/scrape-giant.org
+++ b/pm/scrape-giant.org
@@ -27,6 +27,9 @@ carry forward image url
 3. build observed-product atble from enriched items
 * git issues
 - dont try to git push from win emacs viewing wsl, it will be screwy (windows identity vs wsl)
 ** ssh / access to gitea
 ssh://git@192.168.1.207:2020/ben/scrape-giant.git
 https://git.hgsky.me/ben/scrape-giant.git
@@ -44,6 +47,37 @@ git remote set-url gitea git@gitea:ben/scrape-giant.git
 on local network: use ssh to 192.168.1.207:2020
 from elsewhere/public: use https to git.hgsky.me/... unless you later expose ssh properly
 ** stash
 z z to stash local work only
 take care not to add ignored files which will add the venv and `__pycache__`
 z p to pop the stash back
 ** creating remote branches
 P p, magit will suggest upstream (gitea), select and Enter and it will be created
 ** cherry-picking
 b b : switch to desired branch (review)
 l B : open reflog for local branches
      (my changes were committed to local cx but not pushed to gitea/cx)
 put point on the commit you want; did this in sequence
 A A : cherry pick commit to current branch
      minibuffer will show the commit and all branches, leave it on that commit
      the final commit was not shown by hash, just the branch cx
       since (local) cx was caught up with that branch
 ** reverting a branch
 b l : switch to local branch (cx)
 l l : open local reflog
 put point on the commit; highlighted remote gitea/cx
 X   : reset branch; prompts you, selected cx
 ** merge branch
 b b : switch to branch to be merged into (cx)
 m m : pick branch to merge into current branch
 * giant requests
 ** item:
 get:
@@ -212,3 +246,409 @@ request:
 - pull all orders by default
 - add online orders
 - copy header data from browser using selenium
 * how to run
 python scrape_giant.py
 python enrich_giant.py
 python scrape_costco.py
 python enrich_costco.py
 python build_observed_products.py
 python build_review_queue.py
 python build_canonical_layer.py
 python validate_cross_retailer_flow.py
 * t1.13 tasks [2026-03-17 Tue 13:49]
 ok i ran a few. time to run some cleanups here - i'm wondering if we shouldn't be less aggressive with canonical names and encourage a better manual process to start. 
 ** TODO fill in auto-created canonical category, product-type
 auto-created canonical_names lack category, product_type - ok with filling these in manually in the catalog once the queue is empty
 ** TODO consolidation cleanup
 1. canonical_names feel too specific, e.g., "5DZ egg" - probably a problem with the enrich_* steps not adding appropraite normalizing data /and/ removing from observed product title?
 2.  some canonical_names need consolidation, eg "LIME" and "LIME  . / ." ; poss cleanup issue. there are 5 entries for ergg but but they are all regular large grade A white eggs, just different amounts in dozens.
  Eggs are actually a great candidate for the kind of analysis we want to do - the pipeline should have caught and properly sorted these into size/qty:
  #+begin_example
  ```canonical_product_id	canonical_name	category	product_type	brand	variant	size_value	size_unit	pack_qty	measure_type	notes	created_at	updated_at
  gcan_0e350505fd22	5DZ EGG / /			KS					each	auto-linked via exact_name		
  gcan_47279a80f5f3	EGG 5 DOZ. BBS								each	auto-linked via exact_name		
  gcan_7d099130c1bf	LRG WHITE EGG			SB				30	count	auto-linked via exact_upc		
  gcan_849c2817e667	GDA LRG WHITE EGG			SB				18	count	auto-linked via exact_upc		
  gcan_cb0c6c8cf480	LG EGG CONVENTIONAL					18	count		count	auto-linked via exact_name_size		  ```
  #+end_example
 ** TODO costco discount matching
 Build costco mechanism for matching discount to line item.
   1. Discounts appear as their own line items with a number like /123456, this matches the UPC of the discounted item
   2. must be date-matched to the UPC
 Data model might be missing shape:
 1. match discount rows like `item_name:/2303476` to `retailer_item_id:2303476`
 2. display this value on the item somehow? maybe update line_total? otherwise we lose fidelity. should be stored in items_enriched somehow
 #+begin_example
 ```retailer	order_id	line_no	observed_item_key	order_date	retailer_item_id	pod_id	item_name	upc	category_id	category	qty	unit	unit_price	line_total	picked_weight	mvp_savings	reward_savings	coupon_savings	coupon_price	image_url	raw_order_path	item_name_norm	brand_guess	variant	size_value	size_unit	pack_qty	measure_type	is_store_brand	is_fee	is_discount_line	is_coupon_line	price_per_each	price_per_lb	price_per_oz	parse_version	parse_notes
 costco	2.11115E+22	3	costco:21111520101942404241753:3	4/24/2024	2303476		KA 6QT MIXER P16 KSM60SECXER/CU FY23		33	33	1	None	399.99	399.99							costco_output/raw/21111520101942404241753-2024-04-24T17-53-00.json	KA 6QT MIXER KSM60SECXER/CU						each	FALSE	FALSE	FALSE	FALSE	399.99			costco-enrich-v1	
 costco	2.11115E+22	4	costco:21111520101942404241753:4	4/24/2024	325173		/2303476		33	33	-1	None	0	-100				-100			costco_output/raw/21111520101942404241753-2024-04-24T17-53-00.json	/2303476						each	FALSE	FALSE	TRUE	TRUE	100			costco-enrich-v1	```
 #+end_example
 ** TODO giant discount matching
 * prompt
 do not add new abstractions unless they remove real duplication. prefer explicit retailer-specific logic over generic heuristics. do not auto-create new canonical products from weak normalized names.
 and propose the smallest set of edits needed.
 * 1.13 fixes
 ** 15x Costco discounts not caught
 - 15x, some with slash-space: `/ 1768123`and some without: `/2303476`
 ** canonical names suck - tempted to force manual config from scratch?
 - maybe first-pass should be naming groups, starting with largest groups and going on down.
 - unfortunately not seeing many cross-retailer items? looks like costco-only; just taking Giant as gospel
 - could be as simple as changing canonical name in canonical_catalog.csv  
 - tough to figure out where the data is, leading to below:  
 ** need to refactor whole flow and where data is stored
 group by browser or by site, or both? currently mixed. 
 1. Scrape
   - Script:
   - Output: /output/raw/orderN.json, history.json, orders.csv, history.csv
 2. Enrich
   - Scripts:
   - Output: /output/enrich/items.json
 3. Combined - /output/?
   - Review step?
 ** propsed fixes
 * 1.14 prep - OBE
 ** [ ] t1.14.1 define and document the filesystem/data-layer layout (2-3 commits)
 make stage ownership and retailer ownership explicit so every artifact has one obvious home
 ** AC
 1. define and document the canonical directory layout for the pipeline, separating retailer-specific artifacts from shared combined artifacts
 2. adopt an explicit layout of the form:
   - `data/<retailer>/raw/`
   - `data/<retailer>/orders.csv`
   - `data/<retailer>/items.csv`
   - `data/<retailer>/items_enriched.csv`
   - `data/combined/products_observed.csv`
   - `data/combined/review_queue.csv`
   - `data/combined/item_aliases.csv`
   - `data/combined/canonical_catalog.csv`
   - `data/combined/product_links.csv`
   - `data/combined/purchases.csv`
   - `data/combined/pipeline_status.csv`
   - `data/combined/pipeline_status.json`
 3. update docs/readme and pipeline docs so each script’s inputs and outputs point to the new layout
 4. remove or deprecate ambiguous stage outputs living under a retailer-specific output directory when they are actually shared artifacts
 - pm note: goal is “where does this file live?” should have one answer, not three
 ** evidence
 - commit:
 - tests:
 - date:
 ** notes
 ** [ ] t1.14.2 define the row-level data model for raw, enriched, observed, canonical, and purchases layers (2-4 commits)
 lock the item model before further refactors so each stage has a clear grain and purpose
 ** AC
 1. document the row grain for each layer:
   - raw item row = one receipt line from one retailer order
   - enriched item row = one retailer line with retailer-specific parsed fields
   - observed product row = one grouped retailer-facing product concept
   - canonical catalog row = one review-controlled product identity
   - purchase row = one final pivot-ready purchased item line
 2. define the required fields for each layer, including stable ids and provenance fields
 3. explicitly document which fields are allowed to be blank at each layer (e.g. `upc`, `canonical_item_id`, category)
 4. document the relationship between:
   - `raw_item_name`
   - `normalized_item_name`
   - `observed_product_id`
   - `canonical_item_id`
 5. document how retailer-native ids (e.g. Costco `retailer_item_id`) fit into the shared model without being forced into `upc`
 - pm note: this is the schema contract task; code should follow it, not invent it ad hoc
 ** evidence
 - commit:
 - tests:
 - date:
 ** notes
 ** [ ] t1.14.3 refactor pipeline outputs to the new layout without changing semantics (2-4 commits)
 move files and script defaults to the new structure while preserving current behavior
 ** AC
 1. update scraper and enrich scripts to write retailer-specific outputs under `data/<retailer>/...`
 2. update combined/shared scripts to read from retailer-specific enriched outputs and write to `data/combined/...`
 3. preserve current content/meaning of outputs during the move; this is a location/structure refactor, not a behavior rewrite
 4. update tests, docs, and script defaults to use the new paths
 - pm note: do not mix data-layout cleanup with canonical/review logic changes in this task
 ** evidence
 - commit:
 - tests:
 - date:
 ** notes
 ** [ ] t1.14.4 make the review and catalog layer explicit and authoritative (2-4 commits)
 treat review and canonical resolution as first-class data, not incidental byproducts
 ** AC
 1. define `review_queue.csv`, `item_aliases.csv`, and `canonical_catalog.csv` as the authoritative review/catalog files in `data/combined/`
 2. document the intended purpose of each:
   - `review_queue.csv` = unresolved observed items needing action
   - `item_aliases.csv` = approved mapping from observed/normalized names to canonical ids
   - `canonical_catalog.csv` = review-controlled canonical product definitions and display names
 3. ensure final purchase generation reads from these files as the source of truth for resolution
 4. stop relying on weak implicit canonical creation as a substitute for the explicit review/catalog layer
 - pm note: this is the control-plane task; observed products may be automatic, canonical products are review-controlled
 ** evidence
 - commit:
 - tests:
 - date:
 ** notes
 ** [ ] t1.14.5 define and document the final pivot-ready purchases output (2-3 commits)
 make the final analysis artifact explicit so excel/pivot/chart use is a first-class target
 ** AC
 1. define `data/combined/purchases.csv` as the final normalized purchase log
 2. ensure each purchase row retains:
   - purchase date
   - retailer
   - order id
   - raw item name
   - normalized item name
   - canonical item id when resolved
   - quantity and unit
   - original line total
   - discount-adjusted fields when applicable
   - store/location fields where available
 3. document that `purchases.csv` is the primary excel/pivot input and that earlier files are staging layers
 4. document expected pivot uses such as purchase frequency and cost over time by canonical item
 - pm note: this task is about making the final artifact explicit and stable, not about adding new metrics
 ** evidence
 - commit:
 - tests:
 - date:
 ** notes
 * pipeline prep [2026-03-17 Tue]
 data saved to /data
 1. "scrape_<retailer>" gathers data from a retailer and outputs:
   1. raw list of items per visit          ./<retailer>/scraped/raw/order-<uid>.json
   2. raw list of visits                   ./<retailer>/scraped_visits.csv
   3. raw list of items from all visits    ./<retailer>/scraped_items.csv
 2. "enrich <retailer>" takes /scraped/ data and outputs:
   1. normalized list of items             ./<retailer>/enriched_items.csv
 3. "combine" takes retailer 
 input:
   1. all enriched items                   ./<retailer>/enriched_items.csv
   2. all retailer visits                  ./<retailer>/scraped_visits.csv
 outputs:
   1. observed product groups              ./combined/observed/products_observed.csv
   2. unresolved products for review       ./combined/review/review_queue.csv
   3. pipeline accounting/status           ./combined/status/pipeline_status.csv
   4. pipeline accounting/status           ./combined/status/pipeline_status.json
 4. review resolves unknown or weakly identified products and maintains:
   1. canonical product catalog            ./combined/review/canonical_catalog.csv
   2. approved alias mappings              ./combined/review/item_aliases.csv
   3. optional observed→canonical links    ./combined/review/product_links.csv
 5. build purchases takes combined observed data plus review/catalog data and outputs:
   [1]. final normalized purchase log        ./combined/purchases/purchases.csv
 lets get this pipeline right before more refactoring.
 * Pipeline - moved to data-model.org [2026-03-18 Wed]
 Key: 
 - (1) input
 - [2] output
 Each step can be run alone if its dependents exist.
 ** 1. Collect
 Get raw receipt/visit and item data from a retailer.  Scraping is unique to a Retailer and method (e.g., Giant-Web and Giant-Scan).  Preserve complete raw data and preserve fidelity.  Avoid interpretation beyond basic data flattening.
 - (1) Source access (Varies, eg header data, auth for API access)
 - [1] collected visits from each retailer
 - [2] collected items from each retailer
 - [3] any other raw data that supports [1] and [2]; explicit source (eventual receipt scan?)
 ** 2. Normalize
 Parse and extract structured facts from retailer-specific raw data to create a standardized item format.  Strictly dependent on Collect method and output.
 - Extract quantity, size, pack, pricing, variant
 - Consolidate discount with item using upc/retail_item_id and concurrence
 - Cleanup naming to facilitate later matching
 - (1) collected items from each retailer
 - (2) collected visits from each retailer
 - [1] normalized items from each retailer
 ** 3. Review/Combine (Canonicalization)
 Decide whether two normalized retailer items are "the same product"; match items across retailers using algo/logic and human review.  Create catalog linked to normalized items.
 - Grouping the same item from retailer
 - Asking human to create a canonical/catalog item with:
   - friendly/canonical_name: "bell pepper"; "milk"
   - category: "produce"; "dairy"
   - product_type: "pepper"; "milk"
   - ? variant? "whole, "skim", "2pct"
 - (1) normalized items from each retailer
 - [1] review queue of items to be reviewed
 - [2] catalog (lookup table) of confirmed retailer_item and canonical_name
 - [3] canonical purchase list, pivot-ready
 ** Unresolved Issues
 2. Create tags: canonical_name (need better label), category, product_type is missing data like Variant, shouldn't this be part of the normalization step?
 3. need central script to orchestrate; metadata belongs here and nowhere else
 ** Symptoms
 - `LIME` and `LIME . / .` appearing in canonical_catalog:
  - names must come from review-approved names, not raw strings
 * notes
 ** to fix
 - options not reading/sticking?
 - ice cream - add flavor, call it frozen (not dairy)
 - seltzer/soda from "seltzer,soda,bev" to "cherry san pellegrino, seltzer, bev"?
 -  [1] chicken bouillon, soup,  (0 items, 0 rows)  -> chicken bouillon, broth?, ,
 - peanut butter,, -> creamy peanut butter, peanut butter, condiment
 - add gummy bear to candy
 - add "fresh" to fresh strawberry
 - fix "onion,veg,produce"
 manage product_type and category directly?
 future: fix match
 *** Done
 fuji apple, apple, produce (not apple, fruit, produce)
 spinach, , produce -> frozen vs fresh?
 frozen chicken thighs ->
 rotisserie chicken, chicken, poultry ->  rotisserie chicken, chicken, meat
 beef patty, hamburger, meat -> hamburger patty, beef, meat
 oats > cereal
 cheerios > cereal
 - 3 kinds of greek yogurt!!
 ** takeaways
 - variants not caught, how to fix? 
 catalog_name = what you actually bought
 product_type = reasonable substitute
 category = store aisle
 Using different categories maintains a direct comparison (product_type==spinach) and a distinction.
 fresh spinach, spinach, produce
 frozen spinach, spinach, frozen
 include in catalog_name: 
  - form: frozen, fresh, ground, shredded
  - fat level: whole, skim, 2%
  - flavor when primary: vanilla yogurt vs plain yogurt
  - cut: diced tomatoes vs crushed tomatoes
  - species when relevant: gala apple vs fuji apple
 exclude from catalog_name:
  - package size / multipack count
  - promo wording; adjectives like "premium"; retailer marketing fluff
 ** AC 
 1. fix internal search flow, add same menu
   #+begin_src diff
  Review 4/345: SHRP CHDR
  5 matched items:
   [1] KS SHRP CHDR EC20T9H5 W12T13H5 SL130 | costco | 2026-03-12 |  5.49 | 
   [2] KS SHRP CHDR EC20T9H5 W12T13H5 SL130 | costco | 2025-01-24 | 12.58 | 
   [3] KS SHRP CHDR EC20T9H5 W12T13H5 SL130 | costco | 2025-01-10 |  6.29 | 
   [4] KS SHRP CHDR EC20T9H5 W12T13H5 SL130 | costco | 2024-12-14 |  6.29 | 
   [5] KS SHRP CHDR EC20T9H5 W12T13H5 SL130 | costco | 2024-08-06 |  5.99 | 
  no catalog_name suggestions found
  [f]ind  [n]ew  [s]kip  e[x]clude  [q]uit >
   f
  search: cheddar
  1 search results found:
   [1] cheddar cheese, cheese, dairy (0 items, 0 rows)
 -  selection: 1
 + [#] link to suggestion  [f]ind  [n]ew  [s]kip  e[x]clude  [q]uit >
   #+end_src
 instead of
 #+begin_src diff
 search: banana
 no matches found
 - search again? [enter=yes, q=no]:
 + [f]ind  [n]ew  [s]kip  e[x]clude  [q]uit >
 #+end_src
 2. during a long review session, two pepper or onion types back-to-back cant see the one i just added
  - suggest just-added catalog items
  - script likely needs to re-read the csv, not just add
 //3. suggest based on both catalog & product_name  (this is already happening//
 3. Search results do not properly list running totals:
      5 search results found:
 [1] red onion, onion, produce (0 items, 0 rows)
 [2] mild roasted red bell pepper, bell pepper, produce (0 items, 0 rows)
 [3] onion, vegetable, produce (0 items, 0 rows)
 [4] sour cream and onion potato chip, chips, snack (0 items, 0 rows)
 [5] yellow onion, onion, produce (0 items, 0 rows)
 selection:  
 * data cleanup [2026-03-23 Mon]
 ok we're getting closer. still see some issues
 1. reorder purchases columns for display: catalog_name, product_type, category (makes data/troubleshooting way easier)
 2. shouldn't net_line_price should never be empty? to allow cumulative cost comparison/analysis (we can see normalized price per X via effective_price but shouldnt this be weighted against how much we bought? eg if we bought 5lb flour at $0.970/lb this is weighted as 1-to-1 with a 25lb purchase as 0.670/lb
 3. some items missing entire categorizations? probably a result of me trying to do data cleanup. i found the orphaned values in teh product_links table and removed them, but re-running review_products.py did not catch this...
   shouldn't review_products run a comparison between each vendor's normalized_items and compare to the existing review_queu? 
   RSET POTATO US 1
   GREEK YOGURT DOM55
   FDLY CHY VAN IC CRM
   DUNKIN DONUT CANISTER ORIG BLND P=260
   ICE CUBES
   BLACK BEANS
   KETCHUP SQUEEZE BTL
   YELLOW_GOLD POTATO US 1
   YELLOW_GOLD POTATO US 1
   PINTO BEANS
 4. cleanup deprecated .py files
 5. Goals:
   1. When have I purchased this item, what did I pay, and how has the price changed over time?
      - we're close, but missing units - eg AP flour shows a value that looks like price/lb but you just see $0.765
      - doesnt seem like we've captured everything but that's just a gut feeling
   2. Visit breakdown as well as catalog/product/category? this certainly belongs in purchases.csv.
   3. Consider dash/plotly for better-than-excel tracking, since we're really only looking at a couple of graphs and filtering within certain values? (obv keep purchases as a user-friendly output)
 ** 1. Cleanup purchases column order
 purchase_date
 retailer
 catalog_name
 product_type
 category
 net_line_total
 normalized_quantity
 effective_price
 effective_price_unit (new)
 order_id
 line_no
 raw_item_name
 normalized_item_name
 catalog_id
 normalized_item_id
 ** 2. Populate and use purchases.net_line_total
  net_line_total = line_total+matched_discount_amoun
  effective_price = net_line_total / normalized_quantity
  weighted cost analysis uses net_line_total, not just avg effective_price
 ** 3. Improve review robustness, enable norm_item re review
 1. should regenerate candidates from:
 - normalized items with no valid catalog_id
 - normalized items whose linked catalog_id no longer exists
 - normalized items whose linked catalog row exists but missing required fields if you want completeness review
 2. review_products.py should compare:
 - current normalized universe
 - current product_links
 - current catalog
 - current review_queue
 ** 4. Remove deprecated.py
 ** 5. Improve Charts
 1. Histogram: add effective_price_unit to purchases.py
 1. Visits: plot by order_id enable display of:
   1. spend by visit
   2. items per visit
   3. category spend by visit
   4. retailer/store breakdown
 *  /
--- a/pm/review-workflow.org
+++ b/pm/review-workflow.org
@@ -0,0 +1,73 @@
 * review and item-resolution workflow
 This document defines the durable review workflow for unresolved observed
 products.
 ** persistent files
 - `combined_output/purchases.csv`
  Flat normalized purchase log. This is the review input because it retains:
  - raw item name
  - normalized item name
  - observed product id
  - canonical product id when resolved
  - retailer/order/date/price context
 - `combined_output/review_queue.csv`
  Current unresolved observed products grouped for review.
 - `combined_output/review_resolutions.csv`
  Durable mapping decisions from observed products to canonical products.
 - `combined_output/canonical_catalog.csv`
  Durable canonical item catalog used by manual review and later purchase-log
  rebuilds.
 There is no separate alias file in v1. `review_resolutions.csv` is the mapping
 layer from observed products to canonical product ids.
 ** workflow
 1. Run `build_purchases.py`
   This refreshes the purchase log and seeds/updates the canonical catalog from
   current auto-linked canonical rows.
 2. Run `review_products.py`
   This rebuilds `review_queue.csv` from unresolved purchase rows and prompts in
   the terminal for one observed product at a time.
 3. Choose one of:
   - link to existing canonical
   - create new canonical
   - exclude
   - skip
 4. `review_products.py` writes decisions immediately to:
   - `review_resolutions.csv`
   - `canonical_catalog.csv` when a new canonical item is created
 5. Rerun `build_purchases.py`
   This reapplies approved resolutions so the final normalized purchase log now
   carries the reviewed `canonical_product_id`.
 ** what the human edits
 The primary interface is terminal prompts in `review_products.py`.
 The human provides:
 - existing canonical id when linking
 - canonical name/category/product type when creating a new canonical item
 - optional resolution notes
 The generated CSVs remain editable by hand if needed, but the intended workflow
 is terminal-first.
 ** durability
 - Resolutions are keyed by `observed_product_id`, not by one-off text
  substitution.
 - Canonical products are keyed by stable `canonical_product_id`.
 - Future runs reuse approved mappings through `review_resolutions.csv`.
 ** retention of audit fields
 The final `purchases.csv` retains:
 - `raw_item_name`
 - `normalized_item_name`
 - `canonical_product_id`
 This preserves the raw receipt description, the deterministic parser output, and
 the human-approved canonical identity in one flat purchase log.
--- a/pm/task-sample.org
+++ b/pm/task-sample.org
@@ -0,0 +1,22 @@
 #+title: Task Log
 #+updated: [2026-03-18 Wed 14:19]
 Use the template below, which should be a top-level org-mode header.
 * [ ] M.m.m: Task Title (estimate # commits)
 replace the old observed/canonical workflow with a review-first pipeline that groups normalized rows only during review/combine and links them to catalog items
 ** Acceptance Criteria
 1. Criterion
   - expanded data
 2. Criterion 
 - pm note: amplifying information
 ** evidence
 - commit: abc123, bcd234
 - tests: 
 - datetime: [2026-03-18 Wed 14:15]
 ** notes    
 - explanation of work done, decisions made, reasoning
--- a/pm/tasks.org
+++ b/pm/tasks.org
@@ -1,3 +1,5 @@
 #+title: Scrape-Giant Task Log
 #+STARTUP: overview
 * [X] t1.1: harden giant receipt fetch cli (2-4 commits)
 ** acceptance criteria
 - giant scraper runs from cli with prompts or env-backed defaults for `user_id` and `loyalty`
@@ -276,7 +278,7 @@
 - commit: `7789c2e` on branch `cx`
 - tests: `./venv/bin/python -m unittest discover -s tests`; `./venv/bin/python scrape_giant.py --help`; `./venv/bin/python scrape_costco.py --help`; verified Firefox storage token extraction and locked-db copy behavior in unit tests
 - date: 2026-03-16
-* [ ] t1.8.7: simplify costco session bootstrap and remove over-abstraction (2-4 commits)
+* [X] t1.8.7: simplify costco session bootstrap and remove over-abstraction (2-4 commits)
 ** acceptance criteria
 - make `scrape_costco.py` readable end-to-end without tracing through multiple partial bootstrap layers
@@ -302,12 +304,23 @@
 - no new heuristics in this task
 ** evidence
- commit:
+- commit: `d7a0329` on branch `cx`
- tests:
+- tests: `./venv/bin/python -m unittest discover -s tests`; `./venv/bin/python scrape_costco.py --help`; verified explicit Costco session bootstrap flow in `scrape_costco.py` and low-level-only browser access in `browser_session.py`
- date:  
+- date: 2026-03-16
-* [ ] t1.9: compute normalized comparison metrics (2-4 commits)
+* [X] t1.9: build pivot-ready normalized purchase log and comparison metrics (2-4 commits)
 ** acceptance criteria
 - produce a flat `purchases.csv` suitable for excel pivot tables and pivot charts
 - each purchase row preserves:
  - purchase date
  - retailer
  - order id
  - raw item name
  - normalized item name
  - canonical item id when resolved
  - quantity / unit
  - line total
  - store/location info where available
 - derive normalized comparison fields where possible on enriched or observed product rows:
  - `price_per_lb`
  - `price_per_oz`
@@ -318,22 +331,801 @@
  - receipt weight
  - explicit count/pack
 - emit nulls when basis is unknown, conflicting, or ambiguous
 - support pivot-friendly analysis of purchase frequency and item cost over time
 - document at least one Giant vs Costco comparison example using the normalized metrics
 ** notes
 - compute metrics as close to the raw observation as possible
 - canonical layer can aggregate later, but should not invent missing unit economics
 - unit discipline matters more than coverage
 - raw item name must be retained for audit/debugging
 ** evidence
- commit:
+- commit: `be1bf63` on branch `cx`
- tests:
+- tests: `./venv/bin/python -m unittest discover -s tests`; `./venv/bin/python build_purchases.py`; verified `combined_output/purchases.csv` and `combined_output/comparison_examples.csv` on the current Giant + Costco dataset
- date:
+- date: 2026-03-16
-* [ ] t1.10: add optional llm-assisted suggestion workflow for unresolved products (2-4 commits)
+* [X] t1.11: define review and item-resolution workflow for unresolved products (2-3 commits)
 ** acceptance criteria
- llm suggestions are generated only for unresolved observed products
+- define the persistent files used to resolve unknown items, including:
  - review queue
  - canonical item catalog
  - alias / mapping layer if separate
 - specify how unresolved items move from `review_queue.csv` into the final normalized purchase log
 - define the manual resolution workflow, including:
  - what the human edits
  - what script is rerun afterward
  - how resolved mappings are persisted for future runs
 - ensure resolved items are positively identified into stable canonical item ids rather than one-off text substitutions
 - document how raw item name, normalized item name, and canonical item id are all retained
 ** notes
 - goal is “approve once, reuse forever”
 - keep the workflow simple and auditable
 - manual review is fine; the important part is making it durable and rerunnable
 ** evidence
 - commit: `c7dad54` on branch `cx`
 - tests: `./venv/bin/python -m unittest discover -s tests`; `./venv/bin/python build_purchases.py`; `./venv/bin/python review_products.py --refresh-only`; verified `combined_output/review_queue.csv`, `combined_output/review_resolutions.csv` workflow, and `combined_output/canonical_catalog.csv`
 - date: 2026-03-16
 * [X] t1.12: simplify review process display
 Clearly show current state separate from proposed future state.
 ** acceptance criteria
 1. Display position in review queue, e.g., (1/22)
 2. Display compact header with observed_product under review, queue position, and canonical decision, e.g.: "Resolve [n] observed product group [name]  and associated items to canonical_name [name]? (\n [n] matched items)"
 3. color-code outputs based on info, input/prompt, warning/error
   1. color action menu/requests for input differently from display text; do not color individual options separately
   2. "no canonical_name suggestions found" is informational, not a warning/error.
 4. update action menu `[x]exclude` to `e[x]clude`
 5. on each review item, display a list of all matched items to be linked, sorted by descending date:
   1. YYYY-mm-dd, price, raw item name, normalized item name, upc, retailer
   2. image URL, if exists
   3. Sample:
 6. on each review item, suggest (but do not auto-apply) up to 3 likely existing canonicals using determinstic rules, e.g:
   1. exact normalized name match
   2. prefix/contains match on canonical name
   3. exact UPC
 7. Sample Entry:
 #+begin_comment
 Review 7/22: Resolve observed_product MIXED PEPPER to canonical_name [__]?
 2 matched items:
  [1] 2026-03-12 | 7.49 | MIXED PEPPER 6-PACK | MIXED PEPPER | [upc] | costco | [img_url]
  [2] [YYYY-mm-dd] | [price] | [raw_name] | [observed_name] | [upc] | [retailer] | [img_url]
 2 canonical suggestions found:
  [1] BELL PEPPERS, PRODUCE
  [2] PEPPER, SPICES
 #+end_comment
 8. When link is selected, users should be able to select the number of the item in the list, e.g.:
 #+begin_comment
  Select the canonical_name to associate [n] items with:
   [1] GRB GRADU PCH PUF1. | gcan_01b0d623aa02
   [2] BTB CHICKEN         | gcan_0201f0feb749
   [3] LIME                | gcan_02074d9e7359
 #+end_comment
 9. Add confirmation to link selection with instructions, "[n] [observed_name] and future observed_name matches will be associated with [canonical_name], is this ok?
     actions: [Y]es  [n]o  [b]ack  [s]kip  [q]uit
 - reinforce project terminology such as raw_name, observed_name, canonical_name   
 ** evidence
 - commit: `7b8141c`, `d39497c`
 - tests: `./venv/bin/python -m unittest discover -s tests`; `./venv/bin/python -m unittest tests.test_review_workflow tests.test_purchases`; `./venv/bin/python review_products.py --help`; verified compact review header, numbered matched-item display, informational no-suggestion state, numbered canonical selection, and confirmation flow
 - date: 2026-03-17
 ** notes
 - The key improvement was shifting the prompt from system metadata to reviewer intent: one observed_product, its matched retailer rows, and one canonical_name decision.
 - Numbered canonical selection plus confirmation worked better than free-text id entry and should reduce accidental links.
 - Deterministic suggestions remain intentionally conservative; they speed up common cases, but unresolved items still depend on human review by design.
 * [X] t1.13.1 pipeline accountability and stage visibility (1-2 commits)
 add simple accounting so we can see what survives or drops at each pipeline stage
 ** AC
 1. emit counts for raw, enriched, combined/observed, review-queued, canonical-linked, and final purchase-log rows
 2. report unresolved and dropped item counts explicitly
 3. make it easy to verify that missing items were intentionally left in review rather than silently lost
 - pm note: simple text/json/csv summary is sufficient; trust and visibility matter more than presentation
 ** evidence
 - commit: `967e19e`
 - tests: `./venv/bin/python -m unittest discover -s tests`; `./venv/bin/python report_pipeline_status.py --help`; `./venv/bin/python report_pipeline_status.py`; verified `combined_output/pipeline_status.csv` and `combined_output/pipeline_status.json`
 - date: 2026-03-17
 ** notes
 - Added a single explicit status script instead of threading counters through every pipeline step; this keeps the pipeline simple while still making row survival visible.
 - The most useful check here is `unresolved_not_in_review_rows`; when it is non-zero, we know we have a real accounting bug rather than normal unresolved work.
 * [X] t1.13.2 costco discount matching and net pricing in enrich_costco (2-3 commits)
 refactor costco enrichment so discount lines are matched to purchased items and net pricing is preserved
 ** AC
 1. detect costco discount/coupon rows like `/<retailer_item_id>` and match them to purchased items within the same order
 2. preserve raw discount rows for auditability while also carrying matched discount values onto the purchased item row
 3. add explicit fields for discount-adjusted pricing, e.g. `matched_discount_amount` and `net_line_total` (or equivalent)
 4. preserve original raw receipt amounts (`line_total`) without overwriting them
 - pm note: keep this retailer-specific and explicit; do not introduce generic discount heuristics
 ** evidence
 - commit: `56a03bc`
 - tests: `./venv/bin/python -m unittest discover -s tests`; `./venv/bin/python enrich_costco.py`; verified matched Costco discount rows now populate `matched_discount_amount` and `net_line_total` while preserving raw `line_total`
 - date: 2026-03-17
 ** notes
 - Kept this retailer-specific and literal: only discount rows with `/<retailer_item_id>` are matched, and only within the same order.
 - Raw discount rows are still preserved for auditability; the purchased row now carries the matched adjustment separately rather than overwriting the original amount.
 * [X] t1.13.3 canonical cleanup and review-first product identity (3-4 commits)
 refactor canonical generation so product identity is cleaner, duplicate canonicals are reduced, and unresolved items stay in review instead of spawning junk canonicals
 ** AC
 1. stop auto-creating new canonical products from weak normalized names alone; unresolved items remain in `review_queue.csv`
 2. canonical names are based on stable product identity rather than noisy observed titles
 3. packaging/count/size tokens are removed from canonical names when they belong in structured fields (`pack_qty`, `size_value`, `size_unit`)
 4. consolidate obvious duplicate canonicals (e.g. egg/lime cases) and ensure final outputs retain raw item name, normalized item name, and canonical item id
 - pm note: prefer conservative canonical creation and a better manual review loop over aggressive auto-unification
 ** evidence
 - commit: `08e2a86`
 - tests: `./venv/bin/python -m unittest discover -s tests`; `./venv/bin/python build_purchases.py`; `./venv/bin/python review_products.py --refresh-only`; verified weaker exact-name cases now remain unresolved in `combined_output/review_queue.csv` and canonical names are cleaned before auto-catalog creation
 - date: 2026-03-17
 ** notes
 - Removed weak exact-name auto-canonical creation so ambiguous products stay in review instead of generating junk canonicals.
 - Canonical display names are now cleaned of obvious punctuation and packaging noise, but I kept the cleanup conservative rather than adding a broad fuzzy merge layer.
 * [X] t1.14: refactor retailer collection into the new data model (2-4 commits)
 move Giant and Costco collection into the new collect structure and make both retailers emit the same collected schemas
 ** Acceptance Criteria
 1. create retailer-specific collect scripts in the target naming pattern, e.g.:
  - collect_giant_web.py
  - collect_costco_web.py
 2. collected outputs conform to pm/data-model.org:
  - data/<retailer-method>/raw/...
  - data/<retailer-method>/collected_orders.csv
  - data/<retailer-method>/collected_items.csv
 3. current Giant and Costco raw acquisition behavior is preserved during the move
 4. collected schemas preserve retailer truth and provenance:
  - no interpretation beyond basic flattening
  - raw_order_path/raw_history_path remain usable
  - unknown values remain blank rather than guessed
 5. old paths should be removed or deprecated
 6. collect_* scripts do not depend on any normalize/review files or scripts
 - pm note: this is a path/schema refactor, not a parsing rewrite
 ** evidence
 - commit: `48c6eaf`
 - tests: `./venv/bin/python -m unittest tests.test_scraper tests.test_costco_pipeline tests.test_browser_session`; `./venv/bin/python collect_giant_web.py --help`; `./venv/bin/python collect_costco_web.py --help`; `./venv/bin/python scrape_giant.py --help`; `./venv/bin/python scrape_costco.py --help`
 - datetime: 2026-03-18
 ** notes
 - Kept this as a path/schema move, not a parsing rewrite: the existing Giant and Costco collection behavior remains in place behind new `collect_*` entry points.
 - Added lightweight deprecation nudges on the legacy `scrape_*` commands rather than removing them immediately, so the move is inspectable and low-risk.
 - The main schema fix was on Giant collection, which was missing retailer/provenance/audit fields that Costco collection already carried.
 * [X] t1.14.1: refactor retailer normalization into the new normalized_items schema (3-5 commits)
 make Giant and Costco emit the shared normalized line-item schema without introducing cross-retailer identity logic
 ** Acceptance Criteria
 1. create retailer-specific normalize scripts in the target naming pattern, e.g.:
   - normalize_giant_web.py
   - normalize_costco_web.py
 2. normalized outputs conform to pm/data-model.org:
   - data/<retailer-method>/normalized_items.csv
   - one row per collected line item
   - normalized_row_id is stable and present
   - normalized_item_id is stable, present, and represents retailer-level identity reused across repeated purchase rows when deterministic retailer evidence is sufficient
   - normalized_quantity and normalized_quantity_unit
   - repeated rows for the same retailer product resolve to the same normalized_item_id only when supported by deterministic retailer evidence, e.g. exact upc, exact retailer_item_id, exact cleaned name + same size/pack
   - normalization_basis is explicit
 3. Giant normalization preserves current useful parsing:
   - normalized item name
   - size/unit/pack parsing
   - fee/store-brand flags
   - derived price fields
 4. Costco normalization preserves current useful parsing:
   - normalized item name
   - size/unit/pack parsing
   - explicit discount matching using retailer-specific logic
   - matched_discount_amount and net_line_total
 5. both normalizers preserve raw retailer truth:
   - line_total is never overwritten
   - unknown values remain blank rather than guessed
 6. no cross-retailer identity assignment occurs in normalization
 7. normalize never uses fuzzy or semantic matching to assign normalized_item_id
 - pm note: prefer explicit retailer-specific code paths over generic normalization helpers unless the duplication is truly mechanical
 - pm note: normalization may resolve retailer-level identity, but not catalog identity
 - pm note: normalized_item_id is the only retailer-level grouping identity; do not introduce observed_products or a second grouping artifact
 ** evidence
 - commit: `9064de5`
 - tests: `./venv/bin/python -m unittest discover -s tests`; `./venv/bin/python -m unittest tests.test_enrich_giant tests.test_costco_pipeline tests.test_purchases`; `./venv/bin/python normalize_giant_web.py --help`; `./venv/bin/python normalize_costco_web.py --help`; `./venv/bin/python enrich_giant.py --help`; `./venv/bin/python enrich_costco.py --help`
 - datetime: 2026-03-18
 ** notes
 - Kept the existing Giant and Costco parsing logic intact and added the new normalized schema fields in place, rather than rewriting the enrichers from scratch.
 - `normalized_item_id` is always present, but it only collapses repeated rows when the evidence is strong; otherwise it falls back to row-level identity via `normalized_row_id`.
 - Added `normalize_*` entry points for the new data-model layout while leaving the legacy `enrich_*` commands available during the transition.
 * [X] t1.14.2: finalize filesystem and schema alignment for the refactor (2-4 commits)
 bring on-disk outputs fully into the target `data/` structure without changing retailer behavior
 ** Acceptance Criteria
 1. retailer data directories conform to pm/data-model.org:
   - `data/giant-web/raw/...`
   - `data/giant-web/collected_orders.csv`
   - `data/giant-web/collected_items.csv`
   - `data/giant-web/normalized_items.csv`
   - `data/costco-web/raw/...`
   - `data/costco-web/collected_orders.csv`
   - `data/costco-web/collected_items.csv`
   - `data/costco-web/normalized_items.csv`
 2. review/combine outputs are moved or rewritten into the target review paths:
   - `data/review/review_queue.csv`
   - `data/review/product_links.csv`
   - `data/review/review_resolutions.csv`
   - `data/review/purchases.csv`
   - `data/review/pipeline_status.csv`
   - `data/review/pipeline_status.json`
 3. old transitional output paths are either:
   - removed from active script defaults, or
   - left as explicit compatibility shims with clear deprecation notes
 4. no recollection is required if existing raw files and collected csvs can be moved/copied losslessly into the new structure
 5. no schema information is lost during the move:
   - raw paths still resolve
   - collected/normalized csvs still open with the expected headers
 6. README and task/docs reflect the final active paths
 - pm note: prefer moving/adapting existing files over recollecting from retailers unless a real data loss or schema mismatch forces recollection
 - pm note: this is a structure-alignment task, not a retailer parsing task
 ** evidence
 - commit: `d2e6f2a`
 - tests: `./venv/bin/python -m unittest discover -s tests`; `./venv/bin/python build_purchases.py`; `./venv/bin/python review_products.py --refresh-only`; `./venv/bin/python report_pipeline_status.py`; `./venv/bin/python build_purchases.py --help`; `./venv/bin/python review_products.py --help`; `./venv/bin/python report_pipeline_status.py --help`; verified `data/giant-web/collected_orders.csv`, `data/giant-web/collected_items.csv`, `data/costco-web/collected_orders.csv`, `data/costco-web/collected_items.csv`, `data/catalog.csv`, and archived transitional review outputs under `data/review/archive/`
 - datetime: [2026-03-20 10:04:15 EDT]
 ** notes
 - No recollection was needed; existing raw and collected exports were adapted in place and moved into the target names.
 - Updated the active script defaults to point at `data/...` so the code and on-disk layout now agree.
 - Kept obviously obsolete review artifacts, but moved them under `data/review/archive/` instead of deleting them outright.
 * [X] t1.14.3: retailer-specific Costco normalization cleanup (2-4 commits)
 tighten Costco-specific normalization so normalized item names are cleaner and deterministic retailer grouping is less noisy
 ** Acceptance Criteria
 1. improve Costco item-name cleanup for obvious non-identity noise, such as:
   - trailing slash fragments
   - code tokens and receipt-format artifacts
   - duplicated measurement fragments already captured in structured fields
 2. preserve deterministic normalization rules only:
   - exact retailer_item_id
   - exact cleaned name + same size/pack when needed
   - approved retailer alias
   - no fuzzy or semantic matching
 3. normalized Costco names improve on known bad examples, e.g.:
   - `MANDARIN /` -> cleaner normalized item name
   - `LIFE 6'TABLE ... /` -> cleaner normalized item name
 4. cleanup does not overwrite retailer truth:
   - raw `item_name` is unchanged
   - parsed `size_value`, `size_unit`, `pack_qty`, and pricing fields remain intact
 5. discount-row behavior remains correct:
   - matched discount rows still populate `matched_discount_amount`
   - `net_line_total` remains correct
   - discount rows remain auditable
 6. add regression tests for the cleaned Costco examples and any new parsing rules
 - pm note: keep this explicitly Costco-specific; do not introduce a generic cleanup framework
 - pm note: prefer a short allowlist/blocklist of known receipt artifacts over broad heuristics
 ** evidence
 - commit: `bcec6b3`
 - tests: `./venv/bin/python -m unittest discover -s tests`; `./venv/bin/python -m unittest tests.test_costco_pipeline`; `./venv/bin/python normalize_costco_web.py`; verified live cleaned examples in `data/costco-web/normalized_items.csv`, including `MANDARINS 2.27 KG / 5 LBS -> MANDARIN` and `LIFE 6'TABLE MDL #80873U - T12/H3/P36 -> LIFE 6'TABLE MDL`
 - datetime: 2026-03-20 11:09:32 EDT
 ** notes
 - Kept this explicitly Costco-specific and narrow: the cleanup removes known logistics/code artifacts and orphan slash tokens without introducing fuzzy naming logic.
 - The structured parsing still owns size/pack extraction, so name cleanup can safely strip dual-unit and logistics fragments after those fields are parsed.
 - Discount-line behavior remains unchanged; this task only cleaned normalized names and preserved the existing audit trail.
 * [X] t1.15: refactor review/combine pipeline around normalized_item_id and catalog links (4-8 commits)
 replace the old observed/canonical workflow with a review-first pipeline that uses normalized_item_id as the retailer-level review unit and links it to catalog items
 ** Acceptance Criteria
 1. refactor review outputs to conform to pm/data-model.org:
   - data/review/review_queue.csv
   - data/review/product_links.csv
   - data/catalog.csv
   - data/purchases.csv
 2. review logic uses normalized_item_id as the upstream retailer-level review identity:
   - no dependency on observed_product_id
   - no dependency on products_observed.csv
   - one review/link decision applies to all purchase rows sharing the same normalized_item_id
 3. product_links.csv stores review-approved links from normalized_item_id to catalog_id
   - one row per approved retailer-level identity to catalog mapping
 4. catalog.csv entries are review-first and conservative:
   - no auto-creation from weak normalized names alone
   - names come from reviewed catalog naming, not raw retailer strings
   - packaging/count is not embedded in catalog_name unless essential to identity
   - catalog_name/product_type/category/brand/variant may be blank until reviewed; blank is preferred to guessed
 5. purchases.csv remains pivot-ready and retains:
   - raw item name
   - normalized item name
   - normalized_row_id (not for review)
   - normalized_item_id
   - catalog_id
   - catalog fields
   - raw line_total
   - matched_discount_amount and net_line_total when present
   - derived price fields and their bases
 6. terminal review flow remains simple and usable:
   - reviewer sees one grouped retailer item identity (normalized_item_id) with count and list of matches, not one prompt per purchase row; use existing pattern as a template
   - link to existing catalog item
   - create new catalog item
   - exclude
   - skip
 7. pipeline accounting remains valid after the refactor:
   - unresolved items are visible
   - missing items are not silently dropped
 8. pm note: prefer a better manual review loop over aggressive automatic grouping. initial manual data entry is expected, and should resolve over time
 9. pm note: keep review/combine auditable; each catalog link should be explainable from normalized rows and review state
 ** evidence
 - commit: `9104781`
 - tests: `./venv/bin/python -m unittest discover -s tests`; `./venv/bin/python build_purchases.py`; `./venv/bin/python review_products.py --refresh-only`; `./venv/bin/python report_pipeline_status.py`; `./venv/bin/python build_purchases.py --help`; `./venv/bin/python review_products.py --help`; `./venv/bin/python report_pipeline_status.py --help`
 - datetime: 2026-03-20 11:27:12 EDT
 ** notes
 - The old observed/canonical auto-layer is no longer in the active review/combine path. `build_purchases.py`, `review_products.py`, and `report_pipeline_status.py` now operate on `normalized_item_id`, `catalog_id`, and `catalog_name`.
 - I kept the review CLI shape intentionally close to the pre-refactor flow so the project only changed its identity model, not the operator workflow.
 - Existing auto-generated catalog rows are no longer carried forward by default; only deliberate catalog entries survive. That keeps the new `catalog.csv` conservative, but it also means prior observed-based auto-links do not migrate into the new model.
 - Live rerun after the refactor produced `627` purchase rows, `387` review-queue rows, `407` distinct normalized items, `0` linked normalized items, and `0` unresolved rows missing from the review queue.
 * [X] t1.16: cleanup review process and format
 ** acceptance criteria
 1. Add intro text explaining:
   1. catalog name: unique product including variant but not packaging, eg "whole milk", "sharp cheddar cheese"
   2. product type: general product you would like to compare to, eg "milk", "cheese"
   3. category: eg "dairy"
 2. Reformat input per item
   1. Change matched item field display order
   2. Add count of distinct normalized_item_ids and total purchase rows already linked to the catalog item
   3. Add option to select catalog suggestion directly
   #+begin_comment
    Review 7/22: MIXED PEPPER 6-PK
    2 matched items:
     - MIXED PEPPER 6-PK | costco | 2026-03-12 | 7.49 | [img_url]
     - [raw_name] |  [retailer] | [YYYY-mm-dd] | [price] | [img_url]  
    2 catalog suggestions found:
     [1] bell pepper, pepper, produce (42 items)
     [2] ground pepper, spice, baking (1 item)
   [#] link to suggestion [n]ew [s]kip e[x]clude [q]uit > 
   #+end_comment
 3. When creating new, ask for input in catalog_name, product_type, category order
   1. enter to accept blank value
 4. Each reviewed item is saved after user input, not at the end of the script.
   1. on new creation, create entry in catalog.csv and create entry in product_links.csv
   2. on link existing, create entry in product_links.csv
   3. update review_queue.csv status for item immediately after action
 5. linking operates at normalized_item_id level, not per normalized_row_id
 6. ensure catalog.csv and product_links.csv are human-editable and consistent so manual correction is possible without tooling
 ** evidence
 - commit: `975d44b`
 - tests: `./venv/bin/python -m unittest discover -s tests`; `./venv/bin/python review_products.py --refresh-only`; `./venv/bin/python review_products.py --help`
 - datetime: 2026-03-20 12:45:25 EDT
 ** notes
 - The main flow change is operational rather than architectural: each review decision now persists immediately to `review_resolutions.csv`, `catalog.csv`, `product_links.csv`, and the on-disk `review_queue.csv`.
 - Direct numeric selection works well for suggestion-heavy review, while `[l]ink existing` remains available as a fallback when the suggestion list is empty or incomplete.
 - I kept the review data model unchanged from `t1.15`; this task only tightened the prompt format, field order, and save behavior.
 * [X] t1.16.1: add catalog search flow to review ui (2-3 commits)
 enable fast lookup of catalog items during review via tokenized search and replace manual list scanning
 ** acceptance criteria
 1. replace `[l]ink existing` with `[f]ind` in review prompt:
   - `[#] link to suggestion  [f]ind  [n]ew  [s]kip   [x]exclude  [q]uit >`
 2. implement search flow:
   - on `s`, prompt: `search: `
   - tokenize input using same normalization rules as suggestion matching
   - return ranked list of catalog items where tokens overlap with:
     - catalog_name
     - product_type
     - variant
   - display results in same numbered format as suggestions:
     [1] flour, flour, baking (12 items, 48 rows)
 3. allow direct selection from search results:
   - when user inputs number, immediately creates approved resolution and product_links rows
   - returns to next review item
 4. reuse match logic used for suggestion matching; no new matching system introduced
   - future improvements to matching logic will therefore apply in both places
 5. search results exclude already-linked current normalized_item_id target
 6. fallback behavior:
   - if no results, print `no matches found`
   - allow retry or return to main prompt
 7. keep interaction tight:
   - no full catalog dump
   - max ~10 results returned
   - sorted by simple score (token overlap count)
 8. persistence:
   - selected link writes immediately to `product_links.csv`
   - no buffering until script end
 - pm note: optimize for speed over correctness; this is a manual assist tool, not a ranking system
 - pm note: improve manual lookup flow only, don't retool or create a second algorithm
 ** evidence
 - commit: `f93b9aa`
 - tests: `./venv/bin/python -m unittest discover -s tests`; `./venv/bin/python review_products.py --help`; `./venv/bin/python review_products.py --refresh-only`
 - datetime: 2026-03-20 13:34:57 EDT
 ** notes
 - The search path reuses the same lightweight token matching rules as suggestion ranking, so there is still only one matching system to maintain.
 - Direct numeric suggestion-pick remains the fastest happy path; search is the fallback when suggestions are sparse or missing.
 - Search intentionally optimizes for manual speed rather than smart ranking: simple token overlap, max 10 rows, and immediate persistence on selection.
 - Follow-up fix: search moved to `[f]ind` so `[s]kip` remains available at the main prompt.
 * [X] t1.17: fix normalized quantity derivation and carry it through purchases (2-4 commits)
 correct and document deterministic normalized quantity fields so unit-cost analysis works across package sizes
 ** Acceptance Criteria
 1. populate and validate `normalized_quantity` and `normalized_quantity_unit` in `data/<retailer-method>/normalized_items.csv`
   - these columns already exist and must be corrected rather than reintroduced
 2. carry `normalized_quantity` and `normalized_quantity_unit` through to `data/review/purchases.csv`
 3. derive normalized quantity deterministically from existing parsed fields only:
   - `qty`
   - `pack_qty`
   - `size_value`
   - `size_unit`
   - `measure_type`
 4. prefer the best deterministic basis rather than falling back to `each` too early:
   - count items when count is explicit
   - weight items when parsed weight is explicit
   - volume items when parsed volume is explicit
   - `each` only when no better basis is available
 5. handle common cases explicitly, including totals derived from deterministic patterns such as:
   - `18 count`
   - `5 lb`
   - `64 oz`
   - `2 each`
 6. preserve blanks when no reliable normalized quantity basis can be derived
 7. existing `normalized_item_id` values remain stable; this task must not change retailer-level grouping identity
 8. document the derivation rules and any intentional conversions or non-conversions in `pm/data-model.org` or task notes
   - if unit conversions are allowed, they must be explicit and minimal
 - pm note: keep this deterministic and conservative; do not introduce fuzzy inference
 - pm note: if `lb <-> oz` or volume conversions are used, document them directly rather than hiding them in code
 - pm note: this task enables cost analysis and charting, not catalog/review changes
 ** evidence
 - commit: `d25448b`
 - tests: `./venv/bin/python -m unittest discover -s tests`; `./venv/bin/python normalize_giant_web.py`; `./venv/bin/python normalize_costco_web.py`; `./venv/bin/python build_purchases.py`
 - datetime: 2026-03-21 21:02:21 EDT
 ** notes
 - The missing purchases fields were a carry-through bug: normalization had `normalized_quantity` and `normalized_quantity_unit`, but `build_purchases.py` never wrote them into `data/review/purchases.csv`.
 - Normalized quantity now prefers explicit package basis over `each`, so rows like `PEPSI 6PK 7.5Z` resolve to `90 oz` and `KS ALMND BAR US 1.74QTS` purchased twice resolves to `3.48 qt`.
 - The derivation stays conservative and does not convert units during normalization; parsed units such as `oz`, `lb`, `qt`, and `count` are preserved as-is.
 * [X] t1.18: add regression tests for known quantity/price failures (1-2 commits)
 capture the currently broken comparison cases before changing normalization or purchases logic
 ** acceptance criteria
 1. ensure the new tests assert the intended `effective_price` behavior for the known banana, ice, and beef patty examples
 2. add tests covering known broken cases:
   - giant bananas produce non-blank effective price
   - giant bagged ice produces non-zero effective price
   - costco bananas retain correct effective price
   - beef patty comparison rows preserve expected quantity basis behavior
 3. tests fail against current broken behavior and document the expected outcome
 4. include at least one assertion that effective_price is blank rather than `0` or divide-by-zero when no denominator exists
 - pm note: this task should only add tests/fixtures and not change business logic
 ** pm identified problems
 we have a few problems to scope. looks like:
 1. normalize_giant_web not always propagating weight data to price_per
 2. effective_price calc needs more robust matching algo (my excel hack is clearly not engouh)
 ```
 catalog_name	banana										
 Average of effective_price	Column Labels										
 Row Labels	8/6/2024	12/6/2024	12/12/2024	1/7/2025	1/24/2025	2/16/2025	2/20/2025	6/25/2025	2/14/2026	3/12/2026	Grand Total
 Jan				#DIV/0!	0.496666667						#DIV/0!
 Feb						#DIV/0!	#DIV/0!		0.496666667		#DIV/0!
 Mar										0.496666667	0.496666667
 Jun								#DIV/0!			#DIV/0!
 Aug	0.496666667										0.496666667
 Dec		#DIV/0!	#DIV/0!								#DIV/0!
 Grand Total	0.496666667	#DIV/0!	#DIV/0!	#DIV/0!	0.496666667	#DIV/0!	#DIV/0!	#DIV/0!	0.496666667	0.496666667	#DIV/0!
 purchase_date	retailer	normalized_item_name	catalog_name	category	product_type	qty	unit	normalized_quantity	normalized_quantity_unit	pack_qty	size_value	size_unit	measure_type	line_total	unit_price	net_line_total	price_per_each	price_per_each_basis	price_per_count	price_per_count_basis	price_per_lb	price_per_lb_basis	price_per_oz	price_per_oz_basis	effective_price
 8/6/2024	costco	BANANA	banana	produce	banana	1	E	3	lb		3	lb	weight	1.49	1.49	1.49	1.49	line_total_over_qty			0.4967	parsed_size_lb	0.031	parsed_size_lb_to_oz	0.496666667
 12/6/2024	giant	BANANA	banana	produce	banana	1	LB						weight	0.99	0.99		0.99	line_total_over_qty			0.5893	picked_weight_lb	0.0368	picked_weight_lb_to_oz	#DIV/0!
 12/12/2024	giant	BANANA	banana	produce	banana	1	LB						weight	1.37	1.37		1.37	line_total_over_qty			0.5905	picked_weight_lb	0.0369	picked_weight_lb_to_oz	#DIV/0!
 1/7/2025	giant	BANANA	banana	produce	banana	1	LB						weight	1.44	1.44		1.44	line_total_over_qty			0.5902	picked_weight_lb	0.0369	picked_weight_lb_to_oz	#DIV/0!
 1/24/2025	costco	BANANA	banana	produce	banana	1	E	3	lb		3	lb	weight	1.49	1.49	1.49	1.49	line_total_over_qty			0.4967	parsed_size_lb	0.031	parsed_size_lb_to_oz	0.496666667
 2/16/2025	giant	BANANA	banana	produce	banana	2	LB						weight	2.54	1.27		1.27	line_total_over_qty			0.588	picked_weight_lb	0.0367	picked_weight_lb_to_oz	#DIV/0!
 2/20/2025	giant	BANANA	banana	produce	banana	1	LB						weight	1.4	1.4		1.4	line_total_over_qty			0.5907	picked_weight_lb	0.0369	picked_weight_lb_to_oz	#DIV/0!
 6/25/2025	giant	BANANA	banana	produce	banana	1	LB						weight	1.29	1.29		1.29	line_total_over_qty			0.589	picked_weight_lb	0.0368	picked_weight_lb_to_oz	#DIV/0!
 2/14/2026	costco	BANANA	banana	produce	banana	1	E	3	lb		3	lb	weight	1.49	1.49	1.49	1.49	line_total_over_qty			0.4967	parsed_size_lb	0.031	parsed_size_lb_to_oz	0.496666667
 3/12/2026	costco	BANANA	banana	produce	banana	2	E	6	lb		3	lb	weight	2.98	1.49	2.98	1.49	line_total_over_qty			0.4967	parsed_size_lb	0.031	parsed_size_lb_to_oz	0.496666667
 purchase_date	retailer	normalized_item_name	catalog_name	category	product_type	qty	unit	normalized_quantity	normalized_quantity_unit	pack_qty	size_value	size_unit	measure_type	line_total	unit_price	net_line_total	price_per_each	price_per_each_basis	price_per_count	price_per_count_basis	price_per_lb	price_per_lb_basis	price_per_oz	price_per_oz_basis	effective_price
 9/9/2023	costco	BEEF PATTIES 6# BAG	beef patty	meat	hamburger	1	E	1	each				each	26.99	26.99	26.99	26.99	line_total_over_qty							26.99
 11/26/2025	giant	80% PATTIES PK12	beef patty	meat	hamburger	1	LB						weight	10.05	10.05		10.05	line_total_over_qty			7.7907	picked_weight_lb	0.4869	picked_weight_lb_to_oz	#DIV/0!
 purchase_date	retailer	normalized_item_name	catalog_name	category	product_type	qty	unit	normalized_quantity	normalized_quantity_unit	pack_qty	size_value	size_unit	measure_type	line_total	unit_price	net_line_total	price_per_each	price_per_each_basis	price_per_count	price_per_count_basis	price_per_lb	price_per_lb_basis	price_per_oz	price_per_oz_basis	effective_price
 5/26/2025	giant	BAGGED ICE	bagged ice cubes	frozen	ice	2	EA	40	lb		20	lb	weight	9.98	4.99		4.99	line_total_over_qty			0.2495	parsed_size_lb	0.0156	parsed_size_lb_to_oz	0
 6/12/2025	giant	BAG ICE CUBED	bagged ice cubes	frozen	ice	1	EA	10	lb		10	lb	weight	3.49	3.49		3.49	line_total_over_qty			0.349	parsed_size_lb	0.0218	parsed_size_lb_to_oz	0
 9/13/2025	giant	BAGGED ICE	bagged ice cubes	frozen	ice	2	EA	20	lb		10	lb	weight	6.98	3.49		3.49	line_total_over_qty			0.349	parsed_size_lb	0.0218	parsed_size_lb_to_oz	0
 10/10/2025	giant	BAGGED ICE	bagged ice cubes	frozen	ice	1	EA	20	lb		20	lb	weight	4.99	4.99		4.99	line_total_over_qty			0.2495	parsed_size_lb	0.0156	parsed_size_lb_to_oz	0
 ```
 ** evidence
 - commit: `605c944`
 - tests: `./venv/bin/python -m unittest tests.test_purchases` (fails as expected before implementation: missing `effective_price` in purchases rows)
 - datetime: 2026-03-23 12:52:32 EDT
 ** notes
 - Added purchases-level regression coverage for the known comparison cases before implementation: Giant banana, Costco banana, Giant bagged ice, Costco beef patties, and a blank-denominator case.
 - The current failure mode is the intended one for this task: `build_purchase_rows()` does not yet emit `effective_price`, so the tests document the missing behavior before `t1.18.1`.
 * [X] t1.18.1: fix effective price calculation precedence and blank handling (1-3 commits)
 correct purchases/effective price logic for the known broken cases using existing normalized fields
 ** acceptance criteria
 1. when generating `data/purchases.csv`, add `effective_price` = `effective_total` / `normalized_quantity`
 2. effective_price uses explicit numerator precedence:
   - prefer `net_line_total`
   - fallback to `line_total`
 3. effective_price uses `normalized_quantity` if not blank
 4. effective_price is blank when no valid denominator exists
 5. effective_price is never written as `0` or divide-by-zero for missing-basis cases
 6. effective_price is only comparable within same `normalized_quantity_unit` unless later analysis converts the units
 7. existing regression tests for bananas and ice pass
 - pm note: keep this limited to calculation logic; do not broaden into catalog or review changes
 ** evidence
 - commit: `dc0d061`
 - tests: `./venv/bin/python -m unittest discover -s tests`; `./venv/bin/python build_purchases.py`
 - datetime: 2026-03-23 12:53:34 EDT
 ** notes
 - `effective_price` is now a downstream purchases field only. It does not replace `price_per_lb` / `price_per_each`; it gives one deterministic comparison value based on the existing normalized quantity basis.
 - The implemented precedence is: use non-zero `net_line_total` when present, otherwise `line_total`; divide by `normalized_quantity` when that denominator is > 0; otherwise leave blank.
 - This keeps the calculation conservative for mixed-quality data: Costco bananas and ice now compute correctly, while rows like Giant patties with no quantity basis stay blank instead of producing `0` or a divide-by-zero artifact.
 * [X] t1.18.2: fix giant normalization quantity carry-through for weight-based items (1-3 commits)
 ensure giant normalization emits usable normalized quantity for known weight-based cases
 ** acceptance criteria
 1. giant bananas populate normalized quantity and unit from deterministic weight basis
 2. giant weight-based items that already produce `price_per_lb` also carry enough quantity basis for effective price calculation where supported
 3. existing regression tests pass without changing normalized_item_id behavior
 4. blanks are preserved only when no deterministic quantity basis exists
 - pm note: this task is about normalization carry-through, not fuzzy matching or catalog cleanup
 ** pm notes
 *** banana
 giant bananas have picked weight and price_per_oz but normalized missing
 | purchase_date | retailer | normalized_item_name   | catalog_name | qty | unit | normalized_quantity | normalized_quantity_unit | pack_qty | size_value | size_unit | measure_type | line_total | unit_price | net_line_total | price_per_each | price_per_each_basis | price_per_count | price_per_count_basis | price_per_lb | price_per_lb_basis | price_per_oz | price_per_oz_basis     | effective_price |
 | 8/6/2024      | costco   | BANANAS 3 LB / 1.36 KG | BANANA       |   1 | E    |                   3 | lb                       |          |          3 | lb        | weight       |       1.49 |       1.49 |           1.49 |           1.49 | line_total_over_qty  |                 |                       |       0.4967 | parsed_size_lb     |        0.031 | parsed_size_lb_to_oz   |           $0.50 |
 | 12/6/2024     | giant    | FRESH BANANA           | BANANA       |   1 | LB   |                     |                          |          |            |           | weight       |       0.99 |       0.99 |                |           0.99 | line_total_over_qty  |                 |                       |       0.5893 | picked_weight_lb   |       0.0368 | picked_weight_lb_to_oz |                 |
 | 12/12/2024    | giant    | FRESH BANANA           | BANANA       |   1 | LB   |                     |                          |          |            |           | weight       |       1.37 |       1.37 |                |           1.37 | line_total_over_qty  |                 |                       |       0.5905 | picked_weight_lb   |       0.0369 | picked_weight_lb_to_oz |                 |
 | 1/7/2025      | giant    | FRESH BANANA           | BANANA       |   1 | LB   |                     |                          |          |            |           | weight       |       1.44 |       1.44 |                |           1.44 | line_total_over_qty  |                 |                       |       0.5902 | picked_weight_lb   |       0.0369 | picked_weight_lb_to_oz |                 |
 | 1/24/2025     | costco   | BANANAS 3 LB / 1.36 KG | BANANA       |   1 | E    |                   3 | lb                       |          |          3 | lb        | weight       |       1.49 |       1.49 |           1.49 |           1.49 | line_total_over_qty  |                 |                       |       0.4967 | parsed_size_lb     |        0.031 | parsed_size_lb_to_oz   |          0.4967 |
 | 2/16/2025     | giant    | FRESH BANANA           | BANANA       |   2 | LB   |                     |                          |          |            |           | weight       |       2.54 |       1.27 |                |           1.27 | line_total_over_qty  |                 |                       |        0.588 | picked_weight_lb   |       0.0367 | picked_weight_lb_to_oz |                 |
 | 2/20/2025     | giant    | FRESH BANANA           | BANANA       |   1 | LB   |                     |                          |          |            |           | weight       |        1.4 |        1.4 |                |            1.4 | line_total_over_qty  |                 |                       |       0.5907 | picked_weight_lb   |       0.0369 | picked_weight_lb_to_oz |                 |
 | 6/25/2025     | giant    | FRESH BANANA           | BANANA       |   1 | LB   |                     |                          |          |            |           | weight       |       1.29 |       1.29 |                |           1.29 | line_total_over_qty  |                 |                       |        0.589 | picked_weight_lb   |       0.0368 | picked_weight_lb_to_oz |                 |
 | 2/14/2026     | costco   | BANANAS 3 LB / 1.36 KG | BANANA       |   1 | E    |                   3 | lb                       |          |          3 | lb        | weight       |       1.49 |       1.49 |           1.49 |           1.49 | line_total_over_qty  |                 |                       |       0.4967 | parsed_size_lb     |        0.031 | parsed_size_lb_to_oz   |          0.4967 |
 | 3/12/2026     | costco   | BANANAS 3 LB / 1.36 KG | BANANA       |   2 | E    |                   6 | lb                       |          |          3 | lb        | weight       |       2.98 |       1.49 |           2.98 |           1.49 | line_total_over_qty  |                 |                       |       0.4967 | parsed_size_lb     |        0.031 | parsed_size_lb_to_oz   |          0.4967 |
 *** beef patty
 beef patty by weight not made into effective price
 | purchase_date | retailer | normalized_item_name | product_type | qty | unit | normalized_quantity | normalized_quantity_unit | pack_qty | size_value | size_unit | measure_type | line_total | unit_price | matched_discount_amount | net_line_total | store_name | price_per_each | price_per_each_basis | price_per_count | price_per_count_basis | price_per_lb | price_per_lb_basis | price_per_oz | price_per_oz_basis     | effective_price |
 | 9/9/2023      | costco   | BEEF PATTIES 6# BAG  | hamburger    |   1 | E    |                   1 | each                     |          |            |           | each         |      26.99 |      26.99 |                         |          26.99 | MT VERNON  |          26.99 | line_total_over_qty  |                 |                       |              |                    |              |                        | $26.99          |
 | 11/26/2025    | giant    | PATTIES PK12         | hamburger    |   1 | LB   |                     |                          |          |            |           | weight       |      10.05 |      10.05 |                         |                | Giant Food |          10.05 | line_total_over_qty  |                 |                       |       7.7907 | picked_weight_lb   |       0.4869 | picked_weight_lb_to_oz |                 |
 ** evidence
 - commit: `23dfc3d` `Use picked weight for Giant quantity basis`
 - tests: `./venv/bin/python -m unittest tests.test_enrich_giant tests.test_purchases`; `./venv/bin/python normalize_giant_web.py`; `./venv/bin/python build_purchases.py`
 - datetime: 2026-03-23 13:22:47 EDT
 ** notes
 - Giant loose-weight rows already had deterministic `picked_weight` and `price_per_lb`; this task reuses that basis when parsed size/pack is absent.
 - Parsed package size still wins when present, so fixed-size products keep their original comparison basis and `normalized_item_id` behavior does not change.
 * [X] t1.18.3: fix costco normalization quantity carry-through for weight-based items (1-3 commits)
 ** acceptance criteria
 1. add regression tests covering known broken Costco quantity-basis cases before changing parser logic
 2. Costco normalization correctly parses explicit weight-bearing package text into normalized quantity fields for known cases such as:
   - `25# FLOUR ALL-PURPOSE HARV ...` -> `normalized_quantity=25`, `normalized_quantity_unit=lb`, `measure_type=weight`
 3. corrected Costco normalized rows carry through to `data/purchases.csv` without changing `normalized_item_id` behavior
 4. `effective_price` for corrected Costco rows uses the same rule already established for Giant:
   - use `net_line_total` when present, otherwise `line_total`
   - divide by `normalized_quantity` when `normalized_quantity > 0`
   - leave blank when no valid denominator exists
 5. rerun output verifies the broken Costco flour examples no longer behave like `each` items and now produce non-blank weight-based effective prices
 6. keep this task limited to the identified Costco parsing failures; do not broaden into catalog cleanup or fuzzy matching
 *** All Purpose Flour 
 Costco 25# FLOUR not parsed into normalized weight - meaure_type says each
 | purchase_date | retailer | normalized_item_name               | catalog_name      | qty | unit | normalized_quantity | normalized_quantity_unit | pack_qty | size_value | size_unit | measure_type | line_total | unit_price | matched_discount_amount | net_line_total | store_name | price_per_each | price_per_each_basis | price_per_count | price_per_count_basis    | price_per_lb | price_per_lb_basis | price_per_oz | price_per_oz_basis   | effective_price | is_discount_line | is_coupon_line | is_fee | raw_order_path                                                         |   |
 | 9/9/2023      | costco   | 10LB BAKERS 4.5KG / 10 LB          | all purpose flour |   1 | E    |                  10 | lb                       |          |         10 | lb        | weight       |       5.99 |       5.99 |                         |           5.99 | VA         |           5.99 | line_total_over_qty  |                 |                          |        0.599 | parsed_size_lb     |       0.0374 | parsed_size_lb_to_oz |           $0.60 | FALSE            | FALSE          | FALSE  | data/costco-web/raw/21111500603752309091647-2023-09-09T16-47-00.json   |   |
 | 8/6/2024      | costco   | 10LB BAKERS 4.5KG / 10 LB          | all purpose flour |   1 | E    |                  10 | lb                       |          |         10 | lb        | weight       |       5.29 |       5.29 |                         |           5.29 | VA         |           5.29 | line_total_over_qty  |                 |                          |        0.529 | parsed_size_lb     |       0.0331 | parsed_size_lb_to_oz |           $0.53 | FALSE            | FALSE          | FALSE  | data/costco-web/raw/21111520101732408061704-2024-08-06T17-04-00.json   |   |
 | 11/29/2024    | costco   | 25# FLOUR ALL-PURPOSE HARV P98/100 | all purpose flour |   1 | E    |                   1 | each                     |          |            |           | each         |       8.79 |       8.79 |                         |           8.79 | VA         |           8.79 | line_total_over_qty  |                 |                          |              |                    |              |                      |           $8.79 | FALSE            | FALSE          | FALSE  | data/costco-web/raw/21111500803392411291626-2024-11-29T16-26-00.json   |   |
 | 12/14/2024    | costco   | KS ORG FLOUR 2/10 LB P112          | all purpose flour |   1 | E    |                  20 | lb                       |        2 |         10 | lb        | weight       |      17.99 |      17.99 |                         |          17.99 | VA         |          17.99 | line_total_over_qty  |           8.995 | line_total_over_pack_qty |       0.8995 | parsed_size_lb     |       0.0562 | parsed_size_lb_to_oz |          0.8995 | FALSE            | FALSE          | FALSE  | data/costco-web/raw/21111500301442412141209-2024-12-14T12-09-00.json   |   |
 | 12/14/2024    | costco   | 10LB BAKERS 4.5KG / 10 LB          | all purpose flour |   1 | E    |                  10 | lb                       |          |         10 | lb        | weight       |       5.49 |       5.49 |                         |           5.49 | VA         |           5.49 | line_total_over_qty  |                 |                          |        0.549 | parsed_size_lb     |       0.0343 | parsed_size_lb_to_oz |           0.549 | FALSE            | FALSE          | FALSE  | data/costco-web/raw/21111500301442412141209-2024-12-14T12-09-00.json   |   |
 | 1/10/2025     | costco   | 10LB BAKERS 4.5KG / 10 LB          | all purpose flour |   1 | E    |                  10 | lb                       |          |         10 | lb        | weight       |       5.49 |       5.49 |                         |           5.49 | VA         |           5.49 | line_total_over_qty  |                 |                          |        0.549 | parsed_size_lb     |       0.0343 | parsed_size_lb_to_oz |           0.549 | FALSE            | FALSE          | FALSE  | data/costco-web/raw/21111500702462501101630-2025-01-10T16-30-00.json   |   |
 | 1/10/2025     | costco   | KS ORG FLOUR 2/10 LB P112          | all purpose flour |   1 | E    |                  20 | lb                       |        2 |         10 | lb        | weight       |      17.99 |      17.99 |                         |          17.99 | VA         |          17.99 | line_total_over_qty  |           8.995 | line_total_over_pack_qty |       0.8995 | parsed_size_lb     |       0.0562 | parsed_size_lb_to_oz |          0.8995 | FALSE            | FALSE          | FALSE  | data/costco-web/raw/21111500702462501101630-2025-01-10T16-30-00.json   |   |
 | 1/31/2026     | giant    | SB FLOUR ALL PRPSE 5LB             | all purpose flour |   1 | EA   |                   5 | lb                       |          |          5 | lb        | weight       |       3.39 |       3.39 |                         |                | VA         |           3.39 | line_total_over_qty  |                 |                          |        0.678 | parsed_size_lb     |       0.0424 | parsed_size_lb_to_oz |           0.678 | FALSE            | FALSE          | FALSE  | data/giant-web/raw/697f42031c28e23df08d95f9.json                       |   |
 | 3/12/2026     | costco   | 25# FLOUR ALL-PURPOSE HARV P98/100 | all purpose flour |   1 | E    |                   1 | each                     |          |            |           | each         |       9.49 |       9.49 |                         |           9.49 | VA         |           9.49 | line_total_over_qty  |                 |                          |              |                    |              |                      |            9.49 | FALSE            | FALSE          | FALSE  | data/costco-web/raw/21111500804012603121616-2026-03-12T16-16-00.json
 |   |
 ** evidence
 - commit: `7317611` `Fix Costco hash-size weight parsing`
 - tests: `./venv/bin/python -m unittest tests.test_costco_pipeline tests.test_purchases`; `./venv/bin/python normalize_costco_web.py`; `./venv/bin/python build_purchases.py`
 - datetime: 2026-03-23 13:56:38 EDT
 ** notes
 - Costco `25#` weight text was falling through to `each` because the hash-size parser missed sizes followed by whitespace.
 - This fix is intentionally narrow: explicit `#`-weight parsing now feeds the existing quantity and effective-price flow without changing `normalized_item_id` behavior.
 * [X] t1.18.4: clean purchases output and finalize effective price fields (2-4 commits)
 make `purchases.csv` easier to inspect and ensure price fields support weighted cost analysis
 ** acceptance criteria
 1. reorder `data/purchases.csv` columns for human inspection, with analysis fields first:
   - `purchase_date`
   - `retailer`
   - `catalog_name`
   - `product_type`
   - `category`
   - `net_line_total`
   - `normalized_quantity`
   - `effective_price`
   - `effective_price_unit`
   - followed by order/item/provenance fields
 3. populate `net_line_total` for all purchase rows:
   - preserve existing net_line_total when already populated; 
   - otherwise, derive `net_line_total = line_total + matched_discount_amount` when discount exists;
   - else `net_line_total = line_total`
 4. compute `effective_price` from `net_line_total / normalized_quantity` when `normalized_quantity > 0`
 5. add `effective_price_unit` and populate it consistently from the normalized quantity basis
 6. preserve blanks rather than writing `0` or divide-by-zero when no valid denominator exists
 - pm note: this task is about final purchase output correctness and usability, not review/catalog logic
 ** evidence
 - commit: `a45522c` `Finalize purchase effective price fields`
 - tests: `./venv/bin/python -m unittest tests.test_purchases`; `./venv/bin/python build_purchases.py`
 - datetime: 2026-03-23 15:27:42 EDT
 ** notes
 - `purchases.csv` now carries a filled `net_line_total` for every row, preserving existing values from normalization and deriving the rest from `line_total` plus matched discounts.
 - `effective_price_unit` now mirrors the normalized quantity basis, so downstream analysis can tell whether an `effective_price` is per `lb`, `oz`, `count`, or `each`.
 * [X] t1.19: make review_products.py robust to orphaned and incomplete catalog links (2-4 commits)
 refresh review state from the current normalized universe so missing or broken links re-enter review instead of silently disappearing
 ** acceptance criteria
 1. `review_products.py` regenerates review candidates from the current normalized item universe, not just previously queued items (/data/<provider>/normalized_items.csv)
 2. items are added or re-added to review when:
   - they have no valid `catalog_id`
   - their linked `catalog_id` no longer exists
   - their linked catalog row does noth have both "catalog_name" AND "product_type"
 3. `review_products.py` compares and reconciles:
   - current normalized items
   - current product_links
   - current catalog
   - current review_queue
 4. rerunning review after manual cleanup of `product_links.csv` or `catalog.csv` surfaces newly orphaned normalized items
 5. unresolved items remain visible and are not silently dropped from review or purchases accounting
 - pm note: keep the logic explicit and auditable; this is a refresh/reconciliation task, not a new matching system
 ** evidence
 - commit: `8ccf3ff` `Reconcile review queue against current catalog state`
 - tests: `./venv/bin/python -m unittest tests.test_review_workflow tests.test_purchases`; `./venv/bin/python review_products.py --refresh-only`; `./venv/bin/python report_pipeline_status.py`
 - datetime: 2026-03-23 15:32:29 EDT
 ** notes
 - `review_products.py` now rebuilds its queue from the current normalized files and order files instead of trusting stale `purchases.csv` state.
 - Missing catalog rows and incomplete catalog rows now re-enter review explicitly as `orphaned_catalog_link` or `incomplete_catalog_link`, and excluded rows no longer inflate unresolved-not-in-review accounting.
 * [X] t1.20: add visit-level fields and outputs for spend analysis (2-4 commits)
 ensure purchases retains enough visit/order context to support spend-by-visit and store-level analysis
 ** acceptance criteria
 1. `data/purchases.csv` retains or adds the visit/order fields needed for visit analysis:
   - `order_id`
   - `purchase_date`
   - `store_name`
   - `store_number`
   - `store_city`
   - `store_state`
   - `retailer`
 2. purchases output supports these analyses without additional joins:
   - spend by visit
   - items per visit
   - category spend by visit
   - retailer/store breakdown
 3. documentation or task notes make clear that `purchases.csv` is the primary analysis artifact for both item-level and visit-level reporting
 - pm note: do not build dash/plotly here; this task is only about carrying the right data through
 ** evidence
 - commit: `6940f16` `Document visit-level purchase analysis`
 - tests: `./venv/bin/python -m unittest tests.test_purchases`; `./venv/bin/python build_purchases.py`
 - datetime: 2026-03-24 08:29:13 EDT
 ** notes
 - The needed visit fields were already flowing through `build_purchases.py`; this task locked them in with explicit tests and documentation instead of adding a new visit layer.
 - `data/analysis/purchases.csv` is now documented as the primary analysis artifact for both item-level and visit-level work.
 * [X] t1.21: add lightweight charting/analysis surface on top of purchases.csv (2-4 commits)
 build a minimal analysis layer for common price and visit charts without changing the csv pipeline
 ** acceptance criteria
 1. support charting of:
   - item price over time
   - spend by visit
   - items per visit
   - category spend over time
   - retailer/store comparison
 2. use `data/purchases.csv` as the source of truth
 3. keep excel/pivot compatibility intact
 - pm note: thin reader layer only; do not move business logic out of the pipeline
 ** evidence
 - commit: `46a3b2c` `Add purchase analysis summaries`
 - tests: `./venv/bin/python -m unittest tests.test_analyze_purchases tests.test_purchases`; `./venv/bin/python analyze_purchases.py`
 - datetime: 2026-03-24 16:48:41 EDT
 ** notes
 - The new layer is file-based, not notebook- or dashboard-based: `analyze_purchases.py` reads `data/analysis/purchases.csv` and writes chart-ready CSVs under `data/analysis/`.
 - This keeps Excel/pivot workflows intact while still giving a repeatable CLI path for common price, visit, category, and retailer/store summaries.
 * [X] t1.22: cleanup and finalize post-refactor merging refactor/enrich into cx (3-6 commits)
 remove transitional detritus from the repo and make the final folder/script layout explicit before merging back into `cx`
 ** acceptance criteria
 1. move `catalog.csv` alongside the other step-3 review artifacts under `data/review/`
   - update active scripts, tests, docs, and task notes to match the chosen path
 2. promote analysis to a top-level step-4 folder such as `data/analysis/`
   - add `purchases.csv` to this folder
   - update active scripts, tests, docs, and task notes to match the chosen path
 3. remove obsolete or superseded Python files
   - includes old `scrape_*`, `enrich_*`, `build_*`, and proof/check scripts as appropriate
   - do not remove files still required by the active collect/normalize/review/analysis pipeline
 4. active repo entrypoints are reduced to the intended flow and are easy to identify, including:
   - retailer collection
   - retailer normalization
   - review/combine
   - status/reporting
   - analysis
 5. tests pass after removals and path decisions
 6. README reflects the final post-refactor structure and run order without legacy ambiguity
 7. `pm/data-model.org` and `pm/tasks.org` reflect the final chosen layout
 - pm note: prefer deleting true detritus over keeping compatibility shims now that the refactor path is established
 - pm note: make folder decisions once here so we stop carrying path churn into later tasks
 ** evidence
 - commit: `09829b2` `Finalize post-refactor layout and remove old pipeline files`
 - tests: `./venv/bin/python -m unittest discover -s tests`; `./venv/bin/python build_purchases.py`; `./venv/bin/python review_products.py --refresh-only`; `./venv/bin/python report_pipeline_status.py`; `./venv/bin/python analyze_purchases.py`; `./venv/bin/python collect_giant_web.py --help`; `./venv/bin/python collect_costco_web.py --help`; `./venv/bin/python normalize_giant_web.py --help`; `./venv/bin/python normalize_costco_web.py --help`
 - datetime: 2026-03-24 17:09:45 EDT
 ** notes
 - Final layout decision: `catalog.csv` now lives under `data/review/`, while `purchases.csv` and the chart-ready analysis outputs live under the step-4 `data/analysis/` folder.
 - Removed obsolete top-level pipeline files and their dead tests so the active entrypoints are now the collect, normalize, review/combine, status, and analysis scripts only.
 * [X] t1.22.1: remove unneeded python deps
 ** acceptance criteria
 1. update requirements.txt to add/remove necessary python libs
 2. keep only direct runtime deps in requirements.txt; transitive deps should not be pinned unless imported directly
 ** evidence
 - commit: `867275c` `Trim requirements to direct runtime deps`
 - tests: `./venv/bin/python -m unittest discover -s tests`; `./venv/bin/python collect_giant_web.py --help`; `./venv/bin/python collect_costco_web.py --help`; `./venv/bin/python normalize_giant_web.py --help`; `./venv/bin/python normalize_costco_web.py --help`; `./venv/bin/python build_purchases.py --help`; `./venv/bin/python review_products.py --help`; `./venv/bin/python report_pipeline_status.py --help`; `./venv/bin/python analyze_purchases.py --help`
 - date: 2026-03-24 17:25:39 EDT
 ** notes
 - `requirements.txt` now keeps only direct runtime deps imported by the active pipeline: `browser-cookie3`, `click`, `curl_cffi`, and `python-dotenv`.
 - Low-level support packages such as `cffi`, `jeepney`, `lz4`, `pycryptodomex`, and `certifi` are left to transitive installation instead of being pinned directly.
 * [ ] t1.10: add optional llm-assisted suggestion workflow for unresolved normalized retailer items (2-4 commits)
 ** acceptance criteria
 - llm suggestions are generated only for unresolved normalized retailer items
 - llm outputs are stored as suggestions, not auto-applied truth
 - reviewer can approve/edit/reject suggestions
--- a/report_pipeline_status.py
+++ b/report_pipeline_status.py
@@ -0,0 +1,129 @@
 import json
 from pathlib import Path
 import click
 import build_purchases
 import review_products
 from layer_helpers import read_csv_rows, write_csv_rows
 SUMMARY_FIELDS = ["stage", "count"]
 def read_rows_if_exists(path):
    path = Path(path)
    if not path.exists():
        return []
    return read_csv_rows(path)
 def build_status_summary(
    giant_orders,
    giant_items,
    giant_enriched,
    costco_orders,
    costco_items,
    costco_enriched,
    purchases,
    resolutions,
    links,
    catalog,
 ):
    normalized_rows = giant_enriched + costco_enriched
    queue_rows = review_products.build_review_queue(purchases, resolutions, links, catalog, [])
    queue_ids = {row["normalized_item_id"] for row in queue_rows}
    unresolved_purchase_rows = [
        row
        for row in purchases
        if row.get("normalized_item_id")
        and not row.get("catalog_id")
        and row.get("resolution_action") != "exclude"
        and row.get("is_fee") != "true"
        and row.get("is_discount_line") != "true"
        and row.get("is_coupon_line") != "true"
    ]
    excluded_rows = [row for row in purchases if row.get("resolution_action") == "exclude"]
    linked_purchase_rows = [row for row in purchases if row.get("catalog_id")]
    distinct_normalized_items = {
        row["normalized_item_id"] for row in normalized_rows if row.get("normalized_item_id")
    }
    linked_normalized_items = {
        row["normalized_item_id"] for row in purchases if row.get("normalized_item_id") and row.get("catalog_id")
    }
    summary = [
        {"stage": "raw_orders", "count": len(giant_orders) + len(costco_orders)},
        {"stage": "raw_items", "count": len(giant_items) + len(costco_items)},
        {"stage": "normalized_items", "count": len(normalized_rows)},
        {"stage": "distinct_normalized_items", "count": len(distinct_normalized_items)},
        {"stage": "review_queue_normalized_items", "count": len(queue_rows)},
        {"stage": "linked_normalized_items", "count": len(linked_normalized_items)},
        {"stage": "linked_purchase_rows", "count": len(linked_purchase_rows)},
        {"stage": "final_purchase_rows", "count": len(purchases)},
        {"stage": "unresolved_purchase_rows", "count": len(unresolved_purchase_rows)},
        {"stage": "excluded_purchase_rows", "count": len(excluded_rows)},
        {
            "stage": "unresolved_not_in_review_rows",
            "count": len(
                [
                    row
                    for row in unresolved_purchase_rows
                    if row.get("normalized_item_id") not in queue_ids
                ]
            ),
        },
    ]
    return summary
@click.command()
@click.option("--giant-orders-csv", default="data/giant-web/collected_orders.csv", show_default=True)
@click.option("--giant-items-csv", default="data/giant-web/collected_items.csv", show_default=True)
@click.option("--giant-enriched-csv", default="data/giant-web/normalized_items.csv", show_default=True)
@click.option("--costco-orders-csv", default="data/costco-web/collected_orders.csv", show_default=True)
@click.option("--costco-items-csv", default="data/costco-web/collected_items.csv", show_default=True)
@click.option("--costco-enriched-csv", default="data/costco-web/normalized_items.csv", show_default=True)
@click.option("--purchases-csv", default="data/analysis/purchases.csv", show_default=True)
@click.option("--resolutions-csv", default="data/review/review_resolutions.csv", show_default=True)
@click.option("--links-csv", default="data/review/product_links.csv", show_default=True)
@click.option("--catalog-csv", default="data/review/catalog.csv", show_default=True)
@click.option("--summary-csv", default="data/review/pipeline_status.csv", show_default=True)
@click.option("--summary-json", default="data/review/pipeline_status.json", show_default=True)
 def main(
    giant_orders_csv,
    giant_items_csv,
    giant_enriched_csv,
    costco_orders_csv,
    costco_items_csv,
    costco_enriched_csv,
    purchases_csv,
    resolutions_csv,
    links_csv,
    catalog_csv,
    summary_csv,
    summary_json,
 ):
    summary_rows = build_status_summary(
        read_rows_if_exists(giant_orders_csv),
        read_rows_if_exists(giant_items_csv),
        read_rows_if_exists(giant_enriched_csv),
        read_rows_if_exists(costco_orders_csv),
        read_rows_if_exists(costco_items_csv),
        read_rows_if_exists(costco_enriched_csv),
        read_rows_if_exists(purchases_csv),
        [build_purchases.normalize_resolution_row(row) for row in read_rows_if_exists(resolutions_csv)],
        [build_purchases.normalize_link_row(row) for row in read_rows_if_exists(links_csv)],
        [build_purchases.normalize_catalog_row(row) for row in read_rows_if_exists(catalog_csv)],
    )
    write_csv_rows(summary_csv, summary_rows, SUMMARY_FIELDS)
    summary_json_path = Path(summary_json)
    summary_json_path.parent.mkdir(parents=True, exist_ok=True)
    summary_json_path.write_text(json.dumps(summary_rows, indent=2), encoding="utf-8")
    for row in summary_rows:
        click.echo(f"{row['stage']}: {row['count']}")
 if __name__ == "__main__":
    main()
--- a/requirements.txt
+++ b/requirements.txt
@@ -1,10 +1,4 @@
 browser-cookie3==0.20.1
 certifi==2026.2.25
 cffi==2.0.0
 click==8.3.1
 curl_cffi==0.14.0
 jeepney==0.9.0
 lz4==4.4.5
 pycparser==3.0
 pycryptodomex==3.23.0
 python-dotenv==1.1.1
--- a/review_products.py
+++ b/review_products.py
@@ -0,0 +1,670 @@
 from collections import defaultdict
 from datetime import date
 import re
 import click
 import build_purchases
 from layer_helpers import compact_join, stable_id, write_csv_rows
 QUEUE_FIELDS = [
    "review_id",
    "retailer",
    "normalized_item_id",
    "catalog_id",
    "reason_code",
    "priority",
    "raw_item_names",
    "normalized_names",
    "upc_values",
    "example_prices",
    "seen_count",
    "status",
    "resolution_action",
    "resolution_notes",
    "created_at",
    "updated_at",
 ]
 INFO_COLOR = "cyan"
 PROMPT_COLOR = "bright_yellow"
 WARNING_COLOR = "magenta"
 TOKEN_RE = re.compile(r"[A-Z0-9]+")
 REQUIRED_CATALOG_FIELDS = ("catalog_name", "product_type")
 def print_intro_text():
    click.secho("Review guide:", fg=INFO_COLOR)
    click.echo("  catalog name: unique product identity including variant, but not packaging")
    click.echo("  product type: general product you want to compare across purchases")
    click.echo("  category: broad analysis bucket such as dairy, produce, or frozen")
 def has_complete_catalog_row(catalog_row):
    if not catalog_row:
        return False
    return all(catalog_row.get(field, "").strip() for field in REQUIRED_CATALOG_FIELDS)
 def load_queue_lookup(queue_rows):
    lookup = {}
    for row in queue_rows:
        normalized_item_id = row.get("normalized_item_id", "")
        if normalized_item_id:
            lookup[normalized_item_id] = row
    return lookup
 def build_review_queue(
    purchase_rows,
    resolution_rows,
    link_rows=None,
    catalog_rows=None,
    existing_queue_rows=None,
 ):
    by_normalized = defaultdict(list)
    resolution_lookup = build_purchases.load_resolution_lookup(resolution_rows)
    link_lookup = build_purchases.load_link_lookup(link_rows or [])
    catalog_lookup = {
        row.get("catalog_id", ""): build_purchases.normalize_catalog_row(row)
        for row in (catalog_rows or [])
        if row.get("catalog_id", "")
    }
    queue_lookup = load_queue_lookup(existing_queue_rows or [])
    for row in purchase_rows:
        normalized_item_id = row.get("normalized_item_id", "")
        if not normalized_item_id:
            continue
        by_normalized[normalized_item_id].append(row)
    today_text = str(date.today())
    queue_rows = []
    for normalized_item_id, rows in sorted(by_normalized.items()):
        current_resolution = resolution_lookup.get(normalized_item_id, {})
        if current_resolution.get("status") == "approved" and current_resolution.get("resolution_action") == "exclude":
            continue
        existing_queue_row = queue_lookup.get(normalized_item_id, {})
        linked_catalog_id = current_resolution.get("catalog_id") or link_lookup.get(normalized_item_id, {}).get("catalog_id", "")
        linked_catalog_row = catalog_lookup.get(linked_catalog_id, {})
        has_valid_catalog_link = bool(linked_catalog_id and has_complete_catalog_row(linked_catalog_row))
        unresolved_rows = [
            row
            for row in rows
            if row.get("is_item", "true") != "false"
            and row.get("is_fee") != "true"
            and row.get("is_discount_line") != "true"
            and row.get("is_coupon_line") != "true"
        ]
        if not unresolved_rows or has_valid_catalog_link:
            continue
        retailers = sorted({row["retailer"] for row in rows})
        review_id = stable_id("rvw", normalized_item_id)
        reason_code = "missing_catalog_link"
        if linked_catalog_id and linked_catalog_id not in catalog_lookup:
            reason_code = "orphaned_catalog_link"
        elif linked_catalog_id and not has_complete_catalog_row(linked_catalog_row):
            reason_code = "incomplete_catalog_link"
        queue_rows.append(
            {
                "review_id": review_id,
                "retailer": " | ".join(retailers),
                "normalized_item_id": normalized_item_id,
                "catalog_id": linked_catalog_id,
                "reason_code": reason_code,
                "priority": "high",
                "raw_item_names": compact_join(
                    sorted({row["raw_item_name"] for row in rows if row["raw_item_name"]}),
                    limit=8,
                ),
                "normalized_names": compact_join(
                    sorted(
                        {
                            row["normalized_item_name"]
                            for row in rows
                            if row["normalized_item_name"]
                        }
                    ),
                    limit=8,
                ),
                "upc_values": compact_join(
                    sorted({row["upc"] for row in rows if row["upc"]}),
                    limit=8,
                ),
                "example_prices": compact_join(
                    sorted({row["line_total"] for row in rows if row["line_total"]}),
                    limit=8,
                ),
                "seen_count": str(len(rows)),
                "status": existing_queue_row.get("status") or current_resolution.get("status", "pending"),
                "resolution_action": existing_queue_row.get("resolution_action")
                or current_resolution.get("resolution_action", ""),
                "resolution_notes": existing_queue_row.get("resolution_notes")
                or current_resolution.get("resolution_notes", ""),
                "created_at": existing_queue_row.get("created_at")
                or current_resolution.get("reviewed_at", today_text),
                "updated_at": today_text,
            }
        )
    return queue_rows
 def save_resolution_rows(path, rows):
    write_csv_rows(path, rows, build_purchases.RESOLUTION_FIELDS)
 def save_catalog_rows(path, rows):
    write_csv_rows(path, rows, build_purchases.CATALOG_FIELDS)
 def save_link_rows(path, rows):
    write_csv_rows(path, rows, build_purchases.PRODUCT_LINK_FIELDS)
 def sort_related_items(rows):
    return sorted(
        rows,
        key=lambda row: (
            row.get("purchase_date", ""),
            row.get("order_id", ""),
            int(row.get("line_no", "0") or "0"),
        ),
        reverse=True,
    )
 def tokenize_match_text(*values):
    tokens = set()
    for value in values:
        tokens.update(TOKEN_RE.findall((value or "").upper()))
    return tokens
 def build_catalog_suggestions(related_rows, purchase_rows, catalog_rows, limit=3):
    normalized_names = {
        row.get("normalized_item_name", "").strip().upper()
        for row in related_rows
        if row.get("normalized_item_name", "").strip()
    }
    upcs = {
        row.get("upc", "").strip()
        for row in related_rows
        if row.get("upc", "").strip()
    }
    catalog_by_id = {
        row.get("catalog_id", ""): row for row in catalog_rows if row.get("catalog_id", "")
    }
    suggestions = []
    seen_ids = set()
    def add_catalog_id(catalog_id, reason):
        if not catalog_id or catalog_id in seen_ids or catalog_id not in catalog_by_id:
            return False
        seen_ids.add(catalog_id)
        catalog_row = catalog_by_id[catalog_id]
        suggestions.append(
            {
                "catalog_id": catalog_id,
                "catalog_name": catalog_row.get("catalog_name", ""),
                "reason": reason,
            }
        )
        return len(suggestions) >= limit
    reviewed_purchase_rows = [
        row for row in purchase_rows if row.get("catalog_id") and row.get("normalized_item_id")
    ]
    for row in reviewed_purchase_rows:
        if row.get("upc", "").strip() and row.get("upc", "").strip() in upcs:
            if add_catalog_id(row.get("catalog_id", ""), "exact upc"):
                return suggestions
    for row in reviewed_purchase_rows:
        if row.get("normalized_item_name", "").strip().upper() in normalized_names:
            if add_catalog_id(row.get("catalog_id", ""), "exact normalized name"):
                return suggestions
    for catalog_row in catalog_rows:
        catalog_name = catalog_row.get("catalog_name", "").strip().upper()
        if not catalog_name:
            continue
        for normalized_name in normalized_names:
            if normalized_name in catalog_name or catalog_name in normalized_name:
                if add_catalog_id(catalog_row.get("catalog_id", ""), "catalog name contains match"):
                    return suggestions
                break
    return suggestions
 def search_catalog_rows(query, catalog_rows, purchase_rows, current_normalized_item_id, limit=10):
    query_tokens = tokenize_match_text(query)
    if not query_tokens:
        return []
    linked_purchase_counts = defaultdict(int)
    linked_normalized_ids = defaultdict(set)
    current_catalog_id = ""
    for row in purchase_rows:
        catalog_id = row.get("catalog_id", "")
        normalized_item_id = row.get("normalized_item_id", "")
        if catalog_id and normalized_item_id:
            linked_purchase_counts[catalog_id] += 1
            linked_normalized_ids[catalog_id].add(normalized_item_id)
        if normalized_item_id == current_normalized_item_id and catalog_id:
            current_catalog_id = catalog_id
    ranked_rows = []
    for row in catalog_rows:
        catalog_id = row.get("catalog_id", "")
        if not catalog_id or catalog_id == current_catalog_id:
            continue
        catalog_tokens = tokenize_match_text(
            row.get("catalog_name", ""),
            row.get("product_type", ""),
            row.get("variant", ""),
        )
        overlap = query_tokens & catalog_tokens
        if not overlap:
            continue
        ranked_rows.append(
            {
                "catalog_id": catalog_id,
                "catalog_name": row.get("catalog_name", ""),
                "product_type": row.get("product_type", ""),
                "category": row.get("category", ""),
                "variant": row.get("variant", ""),
                "linked_normalized_items": len(linked_normalized_ids.get(catalog_id, set())),
                "linked_purchase_rows": linked_purchase_counts.get(catalog_id, 0),
                "score": len(overlap),
            }
        )
    ranked_rows.sort(
        key=lambda row: (-row["score"], row["catalog_name"], row["catalog_id"])
    )
    return ranked_rows[:limit]
 def suggestion_display_rows(suggestions, purchase_rows, catalog_rows):
    linked_purchase_counts = defaultdict(int)
    linked_normalized_ids = defaultdict(set)
    for row in purchase_rows:
        catalog_id = row.get("catalog_id", "")
        normalized_item_id = row.get("normalized_item_id", "")
        if not catalog_id or not normalized_item_id:
            continue
        linked_purchase_counts[catalog_id] += 1
        linked_normalized_ids[catalog_id].add(normalized_item_id)
    display_rows = []
    catalog_details = {
        row["catalog_id"]: {
            "product_type": row.get("product_type", ""),
            "category": row.get("category", ""),
        }
        for row in catalog_rows
        if row.get("catalog_id")
    }
    for row in purchase_rows:
        if row.get("catalog_id"):
            catalog_details.setdefault(
                row["catalog_id"],
                {
                    "product_type": row.get("product_type", ""),
                    "category": row.get("category", ""),
                },
            )
    for row in suggestions:
        catalog_id = row["catalog_id"]
        details = catalog_details.get(catalog_id, {})
        display_rows.append(
            {
                **row,
                "product_type": details.get("product_type", ""),
                "category": details.get("category", ""),
                "linked_purchase_rows": linked_purchase_counts.get(catalog_id, 0),
                "linked_normalized_items": len(linked_normalized_ids.get(catalog_id, set())),
            }
        )
    return display_rows
 def print_catalog_rows(rows):
    for index, row in enumerate(rows, start=1):
        click.echo(
            f" [{index}] {row['catalog_name']}, {row.get('product_type', '')}, "
            f"{row.get('category', '')} ({row['linked_normalized_items']} items, "
            f"{row['linked_purchase_rows']} rows)"
        )
 def build_display_lines(related_rows):
    lines = []
    for index, row in enumerate(sort_related_items(related_rows), start=1):
        lines.append(
            " [{index}] {raw_item_name} | {retailer} | {purchase_date} | {line_total} | {image_url}".format(
                index=index,
                raw_item_name=row.get("raw_item_name", ""),
                retailer=row.get("retailer", ""),
                purchase_date=row.get("purchase_date", ""),
                line_total=row.get("line_total", ""),
                image_url=row.get("image_url", ""),
            )
        )
    if not lines:
        lines.append(" [1] no matched item rows found")
    return lines
 def normalized_label(queue_row, related_rows):
    if queue_row.get("normalized_names"):
        return queue_row["normalized_names"].split(" | ")[0]
    for row in related_rows:
        if row.get("normalized_item_name"):
            return row["normalized_item_name"]
    return queue_row.get("normalized_item_id", "")
 def choose_existing_catalog(display_rows, normalized_name, matched_count):
    click.secho(
        f"Select the catalog_name to associate {matched_count} items with:",
        fg=INFO_COLOR,
    )
    print_catalog_rows(display_rows)
    choice = click.prompt(
        click.style("selection", fg=PROMPT_COLOR),
        type=click.IntRange(1, len(display_rows)),
    )
    chosen_row = display_rows[choice - 1]
    click.echo(
        f'{matched_count} "{normalized_name}" items and future matches will be associated '
        f'with "{chosen_row["catalog_name"]}".'
    )
    click.secho("actions: [y]es  [n]o  [b]ack  [s]kip  [q]uit", fg=PROMPT_COLOR)
    confirm = click.prompt(
        click.style("confirm", fg=PROMPT_COLOR),
        type=click.Choice(["y", "n", "b", "s", "q"]),
    )
    if confirm == "y":
        return chosen_row["catalog_id"], ""
    if confirm == "s":
        return "", "skip"
    if confirm == "q":
        return "", "quit"
    return "", "back"
 def prompt_resolution(queue_row, related_rows, purchase_rows, catalog_rows, queue_index, queue_total):
    suggestions = suggestion_display_rows(
        build_catalog_suggestions(related_rows, purchase_rows, catalog_rows),
        purchase_rows,
        catalog_rows,
    )
    normalized_name = normalized_label(queue_row, related_rows)
    matched_count = len(related_rows)
    click.echo("")
    click.secho(
        f"Review {queue_index}/{queue_total}: {normalized_name}",
        fg=INFO_COLOR,
    )
    click.echo(f"{matched_count} matched items:")
    for line in build_display_lines(related_rows):
        click.echo(line)
    if suggestions:
        click.echo(f"{len(suggestions)} catalog_name suggestions found:")
        print_catalog_rows(suggestions)
    else:
        click.echo("no catalog_name suggestions found")
    prompt_bits = []
    if suggestions:
        prompt_bits.append("[#] link to suggestion")
    prompt_bits.extend(["[f]ind", "[n]ew", "[s]kip", "e[x]clude", "[q]uit"])
    click.secho("  ".join(prompt_bits) + " >", fg=PROMPT_COLOR)
    action = click.prompt("", type=str, prompt_suffix=" ").strip().lower()
    if action.isdigit() and suggestions:
        choice = int(action)
        if 1 <= choice <= len(suggestions):
            chosen_row = suggestions[choice - 1]
            notes = click.prompt(click.style("link notes", fg=PROMPT_COLOR), default="", show_default=False)
            return {
                "normalized_item_id": queue_row["normalized_item_id"],
                "catalog_id": chosen_row["catalog_id"],
                "resolution_action": "link",
                "status": "approved",
                "resolution_notes": notes,
                "reviewed_at": str(date.today()),
            }, None
        click.secho("invalid suggestion number", fg=WARNING_COLOR)
        return prompt_resolution(queue_row, related_rows, purchase_rows, catalog_rows, queue_index, queue_total)
    if action == "q":
        return None, None
    if action == "s":
        return {
            "normalized_item_id": queue_row["normalized_item_id"],
            "catalog_id": "",
            "resolution_action": "skip",
            "status": "pending",
            "resolution_notes": queue_row.get("resolution_notes", ""),
            "reviewed_at": str(date.today()),
        }, None
    if action == "f":
        while True:
            query = click.prompt(click.style("search", fg=PROMPT_COLOR), default="", show_default=False).strip()
            if not query:
                return prompt_resolution(queue_row, related_rows, purchase_rows, catalog_rows, queue_index, queue_total)
            search_rows = search_catalog_rows(
                query,
                catalog_rows,
                purchase_rows,
                queue_row["normalized_item_id"],
            )
            if not search_rows:
                click.echo("no matches found")
                retry = click.prompt(
                    click.style("search again? [enter=yes, q=no]", fg=PROMPT_COLOR),
                    default="",
                    show_default=False,
                ).strip().lower()
                if retry == "q":
                    return prompt_resolution(queue_row, related_rows, purchase_rows, catalog_rows, queue_index, queue_total)
                continue
            click.echo(f"{len(search_rows)} search results found:")
            print_catalog_rows(search_rows)
            choice = click.prompt(
                click.style("selection", fg=PROMPT_COLOR),
                type=click.IntRange(1, len(search_rows)),
            )
            chosen_row = search_rows[choice - 1]
            notes = click.prompt(click.style("link notes", fg=PROMPT_COLOR), default="", show_default=False)
            return {
                "normalized_item_id": queue_row["normalized_item_id"],
                "catalog_id": chosen_row["catalog_id"],
                "resolution_action": "link",
                "status": "approved",
                "resolution_notes": notes,
                "reviewed_at": str(date.today()),
            }, None
    if action == "x":
        notes = click.prompt(click.style("exclude notes", fg=PROMPT_COLOR), default="", show_default=False)
        return {
            "normalized_item_id": queue_row["normalized_item_id"],
            "catalog_id": "",
            "resolution_action": "exclude",
            "status": "approved",
            "resolution_notes": notes,
            "reviewed_at": str(date.today()),
        }, None
    if action != "n":
        click.secho("invalid action", fg=WARNING_COLOR)
        return prompt_resolution(queue_row, related_rows, purchase_rows, catalog_rows, queue_index, queue_total)
    catalog_name = click.prompt(click.style("catalog name", fg=PROMPT_COLOR), type=str)
    product_type = click.prompt(click.style("product type", fg=PROMPT_COLOR), default="", show_default=False)
    category = click.prompt(click.style("category", fg=PROMPT_COLOR), default="", show_default=False)
    notes = click.prompt(click.style("notes", fg=PROMPT_COLOR), default="", show_default=False)
    catalog_id = stable_id("cat", f"manual|{catalog_name}|{category}|{product_type}")
    catalog_row = {
        "catalog_id": catalog_id,
        "catalog_name": catalog_name,
        "category": category,
        "product_type": product_type,
        "brand": "",
        "variant": "",
        "size_value": "",
        "size_unit": "",
        "pack_qty": "",
        "measure_type": "",
        "notes": notes,
        "created_at": str(date.today()),
        "updated_at": str(date.today()),
    }
    resolution_row = {
        "normalized_item_id": queue_row["normalized_item_id"],
        "catalog_id": catalog_id,
        "resolution_action": "create",
        "status": "approved",
        "resolution_notes": notes,
        "reviewed_at": str(date.today()),
    }
    return resolution_row, catalog_row
 def apply_resolution_to_queue(queue_rows, resolution_lookup):
    today_text = str(date.today())
    updated_rows = []
    for row in queue_rows:
        resolution = resolution_lookup.get(row["normalized_item_id"], {})
        row_copy = dict(row)
        if resolution:
            row_copy["catalog_id"] = resolution.get("catalog_id", "")
            row_copy["status"] = resolution.get("status", row_copy.get("status", "pending"))
            row_copy["resolution_action"] = resolution.get("resolution_action", "")
            row_copy["resolution_notes"] = resolution.get("resolution_notes", "")
            row_copy["updated_at"] = resolution.get("reviewed_at", today_text)
            if resolution.get("status") == "approved":
                row_copy["created_at"] = row_copy.get("created_at") or resolution.get("reviewed_at", today_text)
        updated_rows.append(row_copy)
    return updated_rows
 def link_rows_from_state(link_lookup):
    return sorted(link_lookup.values(), key=lambda row: row["normalized_item_id"])
@click.command()
@click.option("--giant-items-enriched-csv", default="data/giant-web/normalized_items.csv", show_default=True)
@click.option("--costco-items-enriched-csv", default="data/costco-web/normalized_items.csv", show_default=True)
@click.option("--giant-orders-csv", default="data/giant-web/collected_orders.csv", show_default=True)
@click.option("--costco-orders-csv", default="data/costco-web/collected_orders.csv", show_default=True)
@click.option("--purchases-csv", default="data/analysis/purchases.csv", show_default=True)
@click.option("--queue-csv", default="data/review/review_queue.csv", show_default=True)
@click.option("--resolutions-csv", default="data/review/review_resolutions.csv", show_default=True)
@click.option("--catalog-csv", default="data/review/catalog.csv", show_default=True)
@click.option("--links-csv", default="data/review/product_links.csv", show_default=True)
@click.option("--limit", default=0, show_default=True, type=int)
@click.option("--refresh-only", is_flag=True, help="Only rebuild review_queue.csv without prompting.")
 def main(
    giant_items_enriched_csv,
    costco_items_enriched_csv,
    giant_orders_csv,
    costco_orders_csv,
    purchases_csv,
    queue_csv,
    resolutions_csv,
    catalog_csv,
    links_csv,
    limit,
    refresh_only,
 ):
    resolution_rows = build_purchases.read_optional_csv_rows(resolutions_csv)
    catalog_rows = build_purchases.merge_catalog_rows(build_purchases.read_optional_csv_rows(catalog_csv), [])
    link_rows = build_purchases.read_optional_csv_rows(links_csv)
    purchase_rows, refreshed_link_rows = build_purchases.build_purchase_rows(
        build_purchases.read_optional_csv_rows(giant_items_enriched_csv),
        build_purchases.read_optional_csv_rows(costco_items_enriched_csv),
        build_purchases.read_optional_csv_rows(giant_orders_csv),
        build_purchases.read_optional_csv_rows(costco_orders_csv),
        resolution_rows,
        link_rows,
        catalog_rows,
    )
    build_purchases.write_csv_rows(purchases_csv, purchase_rows, build_purchases.PURCHASE_FIELDS)
    link_lookup = build_purchases.load_link_lookup(refreshed_link_rows)
    queue_rows = build_review_queue(
        purchase_rows,
        resolution_rows,
        refreshed_link_rows,
        catalog_rows,
        build_purchases.read_optional_csv_rows(queue_csv),
    )
    write_csv_rows(queue_csv, queue_rows, QUEUE_FIELDS)
    click.echo(f"wrote {len(queue_rows)} rows to {queue_csv}")
    if refresh_only:
        return
    print_intro_text()
    resolution_lookup = build_purchases.load_resolution_lookup(resolution_rows)
    catalog_by_id = {row["catalog_id"]: row for row in catalog_rows if row.get("catalog_id")}
    rows_by_normalized = defaultdict(list)
    for row in purchase_rows:
        normalized_item_id = row.get("normalized_item_id", "")
        if normalized_item_id:
            rows_by_normalized[normalized_item_id].append(row)
    reviewed = 0
    for index, queue_row in enumerate(queue_rows, start=1):
        if limit and reviewed >= limit:
            break
        related_rows = rows_by_normalized.get(queue_row["normalized_item_id"], [])
        result = prompt_resolution(queue_row, related_rows, purchase_rows, catalog_rows, index, len(queue_rows))
        if result == (None, None):
            break
        resolution_row, catalog_row = result
        resolution_lookup[resolution_row["normalized_item_id"]] = resolution_row
        if catalog_row and catalog_row["catalog_id"] not in catalog_by_id:
            catalog_by_id[catalog_row["catalog_id"]] = catalog_row
            catalog_rows.append(catalog_row)
        normalized_item_id = resolution_row["normalized_item_id"]
        if resolution_row["status"] == "approved":
            if resolution_row["resolution_action"] in {"link", "create"} and resolution_row.get("catalog_id"):
                link_lookup[normalized_item_id] = {
                    "normalized_item_id": normalized_item_id,
                    "catalog_id": resolution_row["catalog_id"],
                    "link_method": f"manual_{resolution_row['resolution_action']}",
                    "link_confidence": "high",
                    "review_status": "approved",
                    "reviewed_by": "",
                    "reviewed_at": resolution_row.get("reviewed_at", ""),
                    "link_notes": resolution_row.get("resolution_notes", ""),
                }
            elif resolution_row["resolution_action"] == "exclude":
                link_lookup.pop(normalized_item_id, None)
        queue_rows = apply_resolution_to_queue(queue_rows, resolution_lookup)
        write_csv_rows(queue_csv, queue_rows, QUEUE_FIELDS)
        save_resolution_rows(
            resolutions_csv,
            sorted(resolution_lookup.values(), key=lambda row: row["normalized_item_id"]),
        )
        save_catalog_rows(catalog_csv, sorted(catalog_by_id.values(), key=lambda row: row["catalog_id"]))
        save_link_rows(links_csv, link_rows_from_state(link_lookup))
        reviewed += 1
    save_resolution_rows(resolutions_csv, sorted(resolution_lookup.values(), key=lambda row: row["normalized_item_id"]))
    save_catalog_rows(catalog_csv, sorted(catalog_by_id.values(), key=lambda row: row["catalog_id"]))
    save_link_rows(links_csv, link_rows_from_state(link_lookup))
    click.echo(
        f"saved {len(resolution_lookup)} resolution rows to {resolutions_csv}, "
        f"{len(catalog_by_id)} catalog rows to {catalog_csv}, "
        f"and {len(link_lookup)} product links to {links_csv}"
    )
 if __name__ == "__main__":
    main()
--- a/scrape_costco.py
+++ b/scrape_costco.py
@@ -648,6 +648,27 @@ def main(
    window_days,
    months_back,
    firefox_profile_dir,
 ):
    click.echo("legacy entrypoint: prefer collect_costco_web.py for data-model outputs")
    run_collection(
        outdir=outdir,
        document_type=document_type,
        document_sub_type=document_sub_type,
        window_days=window_days,
        months_back=months_back,
        firefox_profile_dir=firefox_profile_dir,
    )
 def run_collection(
    outdir,
    document_type,
    document_sub_type,
    window_days,
    months_back,
    firefox_profile_dir,
    orders_filename="orders.csv",
    items_filename="items.csv",
 ):
    outdir = Path(outdir)
    raw_dir = outdir / "raw"
@@ -670,6 +691,13 @@ def main(
        client_identifier=config["client_identifier"],
    )
    session = build_session(profile_dir, auth_headers)
    click.echo(
        "session bootstrap: "
        f"cookies={True} "
        f"authorization={bool(auth_headers.get('costco-x-authorization'))} "
        f"client_id={bool(auth_headers.get('costco-x-wcs-clientId'))} "
        f"client_identifier={bool(auth_headers.get('client-identifier'))}"
    )
    start_date, end_date = resolve_date_range(months_back)
@@ -699,8 +727,8 @@ def main(
        write_json(raw_dir / f"{safe_filename(receipt_id)}.json", detail_payload)
    orders, items = flatten_costco_data(summary_payload, detail_payloads, raw_dir)
-    write_csv(outdir / "orders.csv", orders, ORDER_FIELDS)
+    write_csv(outdir / orders_filename, orders, ORDER_FIELDS)
-    write_csv(outdir / "items.csv", items, ITEM_FIELDS)
+    write_csv(outdir / items_filename, items, ITEM_FIELDS)
    click.echo(f"wrote {len(orders)} orders and {len(items)} item rows to {outdir}")
--- a/scrape_giant.py
+++ b/scrape_giant.py
@@ -13,8 +13,10 @@ from browser_session import find_firefox_profile_dir, load_firefox_cookies
 BASE = "https://giantfood.com"
 ACCOUNT_PAGE = f"{BASE}/account/history/invoice/in-store"
 RETAILER = "giant"
 ORDER_FIELDS = [
    "retailer",
    "order_id",
    "order_date",
    "delivery_date",
@@ -33,12 +35,16 @@ ORDER_FIELDS = [
    "store_zipcode",
    "refund_order",
    "ebt_order",
    "raw_history_path",
    "raw_order_path",
 ]
 ITEM_FIELDS = [
    "retailer",
    "order_id",
    "order_date",
    "line_no",
    "retailer_item_id",
    "pod_id",
    "item_name",
    "upc",
@@ -53,6 +59,10 @@ ITEM_FIELDS = [
    "reward_savings",
    "coupon_savings",
    "coupon_price",
    "image_url",
    "raw_order_path",
    "is_discount_line",
    "is_coupon_line",
 ]
@@ -130,18 +140,21 @@ def get_order_detail(session, user_id, order_id):
    return response.json()
-def flatten_orders(history, details):
+def flatten_orders(history, details, history_path=None, raw_dir=None):
    orders = []
    items = []
    history_lookup = {record["orderId"]: record for record in history.get("records", [])}
    history_path_value = history_path.as_posix() if history_path else ""
    for detail in details:
        order_id = str(detail["orderId"])
        history_row = history_lookup.get(detail["orderId"], {})
        pickup = detail.get("pup", {})
        raw_order_path = (raw_dir / f"{order_id}.json").as_posix() if raw_dir else ""
        orders.append(
            {
                "retailer": RETAILER,
                "order_id": order_id,
                "order_date": detail.get("orderDate"),
                "delivery_date": detail.get("deliveryDate"),
@@ -160,15 +173,19 @@ def flatten_orders(history, details):
                "store_zipcode": pickup.get("storeZipcode"),
                "refund_order": detail.get("refundOrder"),
                "ebt_order": detail.get("ebtOrder"),
                "raw_history_path": history_path_value,
                "raw_order_path": raw_order_path,
            }
        )
        for line_no, item in enumerate(detail.get("items", []), start=1):
            items.append(
                {
                    "retailer": RETAILER,
                    "order_id": order_id,
                    "order_date": detail.get("orderDate"),
                    "line_no": str(line_no),
                    "retailer_item_id": "",
                    "pod_id": item.get("podId"),
                    "item_name": item.get("itemName"),
                    "upc": item.get("primUpcCd"),
@@ -183,6 +200,10 @@ def flatten_orders(history, details):
                    "reward_savings": item.get("rewardSavings"),
                    "coupon_savings": item.get("couponSavings"),
                    "coupon_price": item.get("couponPrice"),
                    "image_url": "",
                    "raw_order_path": raw_order_path,
                    "is_discount_line": "false",
                    "is_coupon_line": "false",
                }
            )
@@ -269,6 +290,18 @@ def write_json(path, payload):
    help="Delay between order detail requests.",
 )
 def main(user_id, loyalty, outdir, sleep_seconds):
    click.echo("legacy entrypoint: prefer collect_giant_web.py for data-model outputs")
    run_collection(user_id, loyalty, outdir, sleep_seconds)
 def run_collection(
    user_id,
    loyalty,
    outdir,
    sleep_seconds,
    orders_filename="orders.csv",
    items_filename="items.csv",
 ):
    config = load_config()
    user_id = user_id or config["user_id"] or click.prompt("Giant user id", type=str)
    loyalty = loyalty or config["loyalty"] or click.prompt(
@@ -279,13 +312,14 @@ def main(user_id, loyalty, outdir, sleep_seconds):
    rawdir = outdir / "raw"
    rawdir.mkdir(parents=True, exist_ok=True)
-    orders_csv = outdir / "orders.csv"
+    orders_csv = outdir / orders_filename
-    items_csv = outdir / "items.csv"
+    items_csv = outdir / items_filename
    existing_order_ids = read_existing_order_ids(orders_csv)
    session = build_session()
    history = get_history(session, user_id, loyalty)
-    write_json(rawdir / "history.json", history)
+    history_path = rawdir / "history.json"
    write_json(history_path, history)
    records = history.get("records", [])
    click.echo(f"history returned {len(records)} visits; Giant exposes only the most recent 50")
@@ -310,7 +344,7 @@ def main(user_id, loyalty, outdir, sleep_seconds):
        if index < len(unseen_records):
            time.sleep(sleep_seconds)
-    orders, items = flatten_orders(history, details)
+    orders, items = flatten_orders(history, details, history_path=history_path, raw_dir=rawdir)
    merged_orders = append_dedup(
        orders_csv,
        orders,
--- a/scraper.py
+++ b/scraper.py
@@ -1,5 +0,0 @@
 from scrape_giant import *  # noqa: F401,F403
 if __name__ == "__main__":
    main()
--- a/tests/test_analyze_purchases.py
+++ b/tests/test_analyze_purchases.py
@@ -0,0 +1,149 @@
 import csv
 import tempfile
 import unittest
 from pathlib import Path
 import analyze_purchases
 class AnalyzePurchasesTests(unittest.TestCase):
    def test_analysis_outputs_cover_required_views(self):
        with tempfile.TemporaryDirectory() as tmpdir:
            purchases_csv = Path(tmpdir) / "purchases.csv"
            output_dir = Path(tmpdir) / "analysis"
            fieldnames = [
                "purchase_date",
                "retailer",
                "order_id",
                "catalog_id",
                "catalog_name",
                "category",
                "product_type",
                "net_line_total",
                "line_total",
                "normalized_quantity",
                "normalized_quantity_unit",
                "effective_price",
                "effective_price_unit",
                "store_name",
                "store_number",
                "store_city",
                "store_state",
                "is_fee",
                "is_discount_line",
                "is_coupon_line",
            ]
            with purchases_csv.open("w", newline="", encoding="utf-8") as handle:
                writer = csv.DictWriter(handle, fieldnames=fieldnames)
                writer.writeheader()
                writer.writerows(
                    [
                        {
                            "purchase_date": "2026-03-01",
                            "retailer": "giant",
                            "order_id": "g1",
                            "catalog_id": "cat_banana",
                            "catalog_name": "BANANA",
                            "category": "produce",
                            "product_type": "banana",
                            "net_line_total": "1.29",
                            "line_total": "1.29",
                            "normalized_quantity": "2.19",
                            "normalized_quantity_unit": "lb",
                            "effective_price": "0.589",
                            "effective_price_unit": "lb",
                            "store_name": "Giant",
                            "store_number": "42",
                            "store_city": "Springfield",
                            "store_state": "VA",
                            "is_fee": "false",
                            "is_discount_line": "false",
                            "is_coupon_line": "false",
                        },
                        {
                            "purchase_date": "2026-03-01",
                            "retailer": "giant",
                            "order_id": "g1",
                            "catalog_id": "cat_ice",
                            "catalog_name": "ICE",
                            "category": "frozen",
                            "product_type": "ice",
                            "net_line_total": "3.50",
                            "line_total": "3.50",
                            "normalized_quantity": "20",
                            "normalized_quantity_unit": "lb",
                            "effective_price": "0.175",
                            "effective_price_unit": "lb",
                            "store_name": "Giant",
                            "store_number": "42",
                            "store_city": "Springfield",
                            "store_state": "VA",
                            "is_fee": "false",
                            "is_discount_line": "false",
                            "is_coupon_line": "false",
                        },
                        {
                            "purchase_date": "2026-03-02",
                            "retailer": "costco",
                            "order_id": "c1",
                            "catalog_id": "cat_banana",
                            "catalog_name": "BANANA",
                            "category": "produce",
                            "product_type": "banana",
                            "net_line_total": "1.49",
                            "line_total": "2.98",
                            "normalized_quantity": "3",
                            "normalized_quantity_unit": "lb",
                            "effective_price": "0.4967",
                            "effective_price_unit": "lb",
                            "store_name": "MT VERNON",
                            "store_number": "1115",
                            "store_city": "ALEXANDRIA",
                            "store_state": "VA",
                            "is_fee": "false",
                            "is_discount_line": "false",
                            "is_coupon_line": "false",
                        },
                    ]
                )
            analyze_purchases.main.callback(
                purchases_csv=str(purchases_csv),
                output_dir=str(output_dir),
            )
            expected_files = [
                "item_price_over_time.csv",
                "spend_by_visit.csv",
                "items_per_visit.csv",
                "category_spend_over_time.csv",
                "retailer_store_breakdown.csv",
            ]
            for name in expected_files:
                self.assertTrue((output_dir / name).exists(), name)
            with (output_dir / "spend_by_visit.csv").open(newline="", encoding="utf-8") as handle:
                spend_rows = list(csv.DictReader(handle))
            self.assertEqual("4.79", spend_rows[0]["visit_spend_total"])
            with (output_dir / "items_per_visit.csv").open(newline="", encoding="utf-8") as handle:
                item_rows = list(csv.DictReader(handle))
            self.assertEqual("2", item_rows[0]["item_row_count"])
            self.assertEqual("2", item_rows[0]["distinct_catalog_count"])
            with (output_dir / "category_spend_over_time.csv").open(newline="", encoding="utf-8") as handle:
                category_rows = list(csv.DictReader(handle))
            produce_row = next(row for row in category_rows if row["purchase_date"] == "2026-03-01" and row["category"] == "produce")
            self.assertEqual("1.29", produce_row["category_spend_total"])
            with (output_dir / "retailer_store_breakdown.csv").open(newline="", encoding="utf-8") as handle:
                store_rows = list(csv.DictReader(handle))
            giant_row = next(row for row in store_rows if row["retailer"] == "giant")
            self.assertEqual("1", giant_row["visit_count"])
            self.assertEqual("2", giant_row["item_row_count"])
            self.assertEqual("4.79", giant_row["store_spend_total"])
 if __name__ == "__main__":
    unittest.main()
--- a/tests/test_canonical_layer.py
+++ b/tests/test_canonical_layer.py
@@ -1,99 +0,0 @@
 import unittest
 import build_canonical_layer
 class CanonicalLayerTests(unittest.TestCase):
    def test_build_canonical_layer_auto_links_exact_upc_and_name_size(self):
        observed_rows = [
            {
                "observed_product_id": "gobs_1",
                "representative_upc": "111",
                "representative_retailer_item_id": "11",
                "representative_name_norm": "GALA APPLE",
                "representative_brand": "SB",
                "representative_variant": "",
                "representative_size_value": "5",
                "representative_size_unit": "lb",
                "representative_pack_qty": "",
                "representative_measure_type": "weight",
                "is_fee": "false",
                "is_discount_line": "false",
                "is_coupon_line": "false",
            },
            {
                "observed_product_id": "gobs_2",
                "representative_upc": "111",
                "representative_retailer_item_id": "12",
                "representative_name_norm": "LARGE WHITE EGGS",
                "representative_brand": "SB",
                "representative_variant": "",
                "representative_size_value": "",
                "representative_size_unit": "",
                "representative_pack_qty": "18",
                "representative_measure_type": "count",
                "is_fee": "false",
                "is_discount_line": "false",
                "is_coupon_line": "false",
            },
            {
                "observed_product_id": "gobs_3",
                "representative_upc": "",
                "representative_retailer_item_id": "21",
                "representative_name_norm": "ROTINI",
                "representative_brand": "",
                "representative_variant": "",
                "representative_size_value": "16",
                "representative_size_unit": "oz",
                "representative_pack_qty": "",
                "representative_measure_type": "weight",
                "is_fee": "false",
                "is_discount_line": "false",
                "is_coupon_line": "false",
            },
            {
                "observed_product_id": "gobs_4",
                "representative_upc": "",
                "representative_retailer_item_id": "22",
                "representative_name_norm": "ROTINI",
                "representative_brand": "SB",
                "representative_variant": "",
                "representative_size_value": "16",
                "representative_size_unit": "oz",
                "representative_pack_qty": "",
                "representative_measure_type": "weight",
                "is_fee": "false",
                "is_discount_line": "false",
                "is_coupon_line": "false",
            },
            {
                "observed_product_id": "gobs_5",
                "representative_upc": "",
                "representative_retailer_item_id": "99",
                "representative_name_norm": "GL BAG CHARGE",
                "representative_brand": "",
                "representative_variant": "",
                "representative_size_value": "",
                "representative_size_unit": "",
                "representative_pack_qty": "",
                "representative_measure_type": "each",
                "is_fee": "true",
                "is_discount_line": "false",
                "is_coupon_line": "false",
            },
        ]
        canonicals, links = build_canonical_layer.build_canonical_layer(observed_rows)
        self.assertEqual(2, len(canonicals))
        self.assertEqual(4, len(links))
        methods = {row["observed_product_id"]: row["link_method"] for row in links}
        self.assertEqual("exact_upc", methods["gobs_1"])
        self.assertEqual("exact_upc", methods["gobs_2"])
        self.assertEqual("exact_name_size", methods["gobs_3"])
        self.assertEqual("exact_name_size", methods["gobs_4"])
        self.assertNotIn("gobs_5", methods)
 if __name__ == "__main__":
    unittest.main()
--- a/tests/test_costco_pipeline.py
+++ b/tests/test_costco_pipeline.py
@@ -7,7 +7,6 @@ from unittest import mock
 import enrich_costco
 import scrape_costco
 import validate_cross_retailer_flow
 class CostcoPipelineTests(unittest.TestCase):
@@ -258,6 +257,31 @@ class CostcoPipelineTests(unittest.TestCase):
        self.assertEqual("MIXED PEPPER", row["item_name_norm"])
        self.assertEqual("6", row["pack_qty"])
        self.assertEqual("count", row["measure_type"])
        self.assertEqual("costco:abc:1", row["normalized_row_id"])
        self.assertEqual("exact_retailer_item_id", row["normalization_basis"])
        self.assertTrue(row["normalized_item_id"])
        self.assertEqual("6", row["normalized_quantity"])
        self.assertEqual("count", row["normalized_quantity_unit"])
        volume_row = enrich_costco.parse_costco_item(
            order_id="abc",
            order_date="2026-03-12",
            raw_path=Path("costco_output/raw/abc.json"),
            line_no=3,
            item={
                "itemNumber": "1185912",
                "itemDescription01": "KS ALMND BAR US 1.74QTS CN",
                "itemDescription02": None,
                "itemDepartmentNumber": 18,
                "transDepartmentNumber": 18,
                "unit": 2,
                "itemIdentifier": "E",
                "amount": 21.98,
                "itemUnitPriceAmount": 10.99,
            },
        )
        self.assertEqual("3.48", volume_row["normalized_quantity"])
        self.assertEqual("qt", volume_row["normalized_quantity_unit"])
        discount = enrich_costco.parse_costco_item(
            order_id="abc",
@@ -278,76 +302,125 @@ class CostcoPipelineTests(unittest.TestCase):
        )
        self.assertEqual("true", discount["is_discount_line"])
        self.assertEqual("true", discount["is_coupon_line"])
        self.assertEqual("false", discount["is_item"])
-    def test_cross_retailer_validation_writes_proof_example(self):
+    def test_costco_name_cleanup_removes_dual_weight_and_logistics_artifacts(self):
        mixed_units = enrich_costco.parse_costco_item(
            order_id="abc",
            order_date="2026-03-12",
            raw_path=Path("costco_output/raw/abc.json"),
            line_no=1,
            item={
                "itemNumber": "18600",
                "itemDescription01": "MANDARINS 2.27 KG / 5 LBS",
                "itemDescription02": None,
                "itemDepartmentNumber": 65,
                "transDepartmentNumber": 65,
                "unit": 1,
                "itemIdentifier": "E",
                "amount": 7.49,
                "itemUnitPriceAmount": 7.49,
            },
        )
        self.assertEqual("MANDARIN", mixed_units["item_name_norm"])
        self.assertEqual("5", mixed_units["size_value"])
        self.assertEqual("lb", mixed_units["size_unit"])
        logistics = enrich_costco.parse_costco_item(
            order_id="abc",
            order_date="2026-03-12",
            raw_path=Path("costco_output/raw/abc.json"),
            line_no=2,
            item={
                "itemNumber": "1375005",
                "itemDescription01": "LIFE 6'TABLE MDL #80873U - T12/H3/P36",
                "itemDescription02": None,
                "itemDepartmentNumber": 18,
                "transDepartmentNumber": 18,
                "unit": 1,
                "itemIdentifier": "E",
                "amount": 119.98,
                "itemUnitPriceAmount": 119.98,
            },
        )
        self.assertEqual("LIFE 6'TABLE MDL", logistics["item_name_norm"])
    def test_costco_hash_weight_parses_into_weight_basis(self):
        row = enrich_costco.parse_costco_item(
            order_id="abc",
            order_date="2024-11-29",
            raw_path=Path("costco_output/raw/abc.json"),
            line_no=4,
            item={
                "itemNumber": "999",
                "itemDescription01": "25# FLOUR ALL-PURPOSE HARV P98/100",
                "itemDescription02": None,
                "itemDepartmentNumber": 14,
                "transDepartmentNumber": 14,
                "unit": 1,
                "itemIdentifier": "E",
                "amount": 8.79,
                "itemUnitPriceAmount": 8.79,
            },
        )
        self.assertEqual("FLOUR ALL-PURPOSE HARV", row["item_name_norm"])
        self.assertEqual("25", row["size_value"])
        self.assertEqual("lb", row["size_unit"])
        self.assertEqual("weight", row["measure_type"])
        self.assertEqual("25", row["normalized_quantity"])
        self.assertEqual("lb", row["normalized_quantity_unit"])
        self.assertEqual("0.3516", row["price_per_lb"])
    def test_build_items_enriched_matches_discount_to_item(self):
        with tempfile.TemporaryDirectory() as tmpdir:
-            giant_csv = Path(tmpdir) / "giant_items_enriched.csv"
+            raw_dir = Path(tmpdir) / "raw"
-            costco_csv = Path(tmpdir) / "costco_items_enriched.csv"
+            raw_dir.mkdir()
-            outdir = Path(tmpdir) / "combined"
+            payload = {
-
+                "data": {
-            fieldnames = enrich_costco.OUTPUT_FIELDS
+                    "receiptsWithCounts": {
-            giant_row = {field: "" for field in fieldnames}
+                        "receipts": [
            giant_row.update(
                            {
-                    "retailer": "giant",
+                                "transactionBarcode": "abc",
-                    "order_id": "g1",
+                                "transactionDate": "2026-03-12",
-                    "line_no": "1",
+                                "itemArray": [
                    "order_date": "2026-03-01",
                    "retailer_item_id": "100",
                    "item_name": "FRESH BANANA",
                    "item_name_norm": "BANANA",
                    "upc": "4011",
                    "measure_type": "weight",
                    "is_store_brand": "false",
                    "is_fee": "false",
                    "is_discount_line": "false",
                    "is_coupon_line": "false",
                    "line_total": "1.29",
                }
            )
            costco_row = {field: "" for field in fieldnames}
            costco_row.update(
                                    {
-                    "retailer": "costco",
+                                        "itemNumber": "4873222",
-                    "order_id": "c1",
+                                        "itemDescription01": "ALL F&C",
-                    "line_no": "1",
+                                        "itemDescription02": "200OZ 160LOADS P104",
-                    "order_date": "2026-03-12",
+                                        "itemDepartmentNumber": 14,
-                    "retailer_item_id": "30669",
+                                        "transDepartmentNumber": 14,
-                    "item_name": "BANANAS 3 LB / 1.36 KG",
+                                        "unit": 1,
-                    "item_name_norm": "BANANA",
+                                        "itemIdentifier": "E",
-                    "upc": "",
+                                        "amount": 19.99,
-                    "size_value": "3",
+                                        "itemUnitPriceAmount": 19.99,
-                    "size_unit": "lb",
+                                    },
-                    "measure_type": "weight",
+                                    {
-                    "is_store_brand": "false",
+                                        "itemNumber": "374664",
-                    "is_fee": "false",
+                                        "itemDescription01": "/ 4873222",
-                    "is_discount_line": "false",
+                                        "itemDescription02": None,
-                    "is_coupon_line": "false",
+                                        "itemDepartmentNumber": 14,
-                    "line_total": "2.98",
+                                        "transDepartmentNumber": 14,
                                        "unit": -1,
                                        "itemIdentifier": None,
                                        "amount": -5,
                                        "itemUnitPriceAmount": 0,
                                    },
                                ],
                            }
-            )
+                        ]
                    }
                }
            }
            (raw_dir / "abc.json").write_text(json.dumps(payload), encoding="utf-8")
-            with giant_csv.open("w", newline="", encoding="utf-8") as handle:
+            rows = enrich_costco.build_items_enriched(raw_dir)
                writer = csv.DictWriter(handle, fieldnames=fieldnames)
                writer.writeheader()
                writer.writerow(giant_row)
            with costco_csv.open("w", newline="", encoding="utf-8") as handle:
                writer = csv.DictWriter(handle, fieldnames=fieldnames)
                writer.writeheader()
                writer.writerow(costco_row)
-            validate_cross_retailer_flow.main.callback(
+            purchase_row = next(row for row in rows if row["is_discount_line"] == "false")
-                giant_items_enriched_csv=str(giant_csv),
+            discount_row = next(row for row in rows if row["is_discount_line"] == "true")
-                costco_items_enriched_csv=str(costco_csv),
+            self.assertEqual("-5", purchase_row["matched_discount_amount"])
-                outdir=str(outdir),
+            self.assertEqual("14.99", purchase_row["net_line_total"])
-            )
+            self.assertIn("matched_discount=4873222", purchase_row["parse_notes"])
-
+            self.assertIn("matched_to_item=4873222", discount_row["parse_notes"])
            proof_path = outdir / "proof_examples.csv"
            self.assertTrue(proof_path.exists())
            with proof_path.open(newline="", encoding="utf-8") as handle:
                rows = list(csv.DictReader(handle))
            self.assertEqual(1, len(rows))
            self.assertEqual("banana", rows[0]["proof_name"])
    def test_main_writes_summary_request_metadata(self):
        with tempfile.TemporaryDirectory() as tmpdir:
--- a/tests/test_enrich_giant.py
+++ b/tests/test_enrich_giant.py
@@ -51,6 +51,11 @@ class EnrichGiantTests(unittest.TestCase):
        self.assertEqual("1.99", row["price_per_lb"])
        self.assertEqual("0.1244", row["price_per_oz"])
        self.assertEqual("https://example.test/apple.jpg", row["image_url"])
        self.assertEqual("giant:abc123:1", row["normalized_row_id"])
        self.assertEqual("exact_upc", row["normalization_basis"])
        self.assertEqual("5", row["normalized_quantity"])
        self.assertEqual("lb", row["normalized_quantity_unit"])
        self.assertEqual("true", row["is_item"])
        fee_row = enrich_giant.parse_item(
            order_id="abc123",
@@ -77,6 +82,7 @@ class EnrichGiantTests(unittest.TestCase):
        self.assertEqual("true", fee_row["is_fee"])
        self.assertEqual("GL BAG CHARGE", fee_row["item_name_norm"])
        self.assertEqual("false", fee_row["is_item"])
    def test_parse_item_derives_packaged_weight_prices_from_size_tokens(self):
        row = enrich_giant.parse_item(
@@ -105,9 +111,82 @@ class EnrichGiantTests(unittest.TestCase):
        self.assertEqual("weight", row["measure_type"])
        self.assertEqual("6", row["pack_qty"])
        self.assertEqual("7.5", row["size_value"])
        self.assertEqual("90", row["normalized_quantity"])
        self.assertEqual("oz", row["normalized_quantity_unit"])
        self.assertEqual("0.0667", row["price_per_oz"])
        self.assertEqual("1.0667", row["price_per_lb"])
    def test_derive_normalized_quantity_handles_count_volume_and_each(self):
        self.assertEqual(
            ("18", "count"),
            enrich_giant.derive_normalized_quantity("1", "", "", "18", "count"),
        )
        self.assertEqual(
            ("3.48", "qt"),
            enrich_giant.derive_normalized_quantity("2", "1.74", "qt", "", "volume"),
        )
        self.assertEqual(
            ("2", "each"),
            enrich_giant.derive_normalized_quantity("2", "", "", "", "each"),
        )
        self.assertEqual(
            ("1.68", "lb"),
            enrich_giant.derive_normalized_quantity("1", "", "", "", "weight", "1.68"),
        )
    def test_parse_item_uses_picked_weight_for_loose_weight_items(self):
        banana = enrich_giant.parse_item(
            order_id="abc123",
            order_date="2026-03-01",
            raw_path=Path("raw/abc123.json"),
            line_no=1,
            item={
                "podId": 1,
                "shipQy": 1,
                "totalPickedWeight": 1.68,
                "unitPrice": 0.99,
                "itemName": "FRESH BANANA",
                "lbEachCd": "LB",
                "groceryAmount": 0.99,
                "primUpcCd": "111",
                "mvpSavings": 0,
                "rewardSavings": 0,
                "couponSavings": 0,
                "couponPrice": 0,
                "categoryId": "1",
                "categoryDesc": "Grocery",
            },
        )
        self.assertEqual("weight", banana["measure_type"])
        self.assertEqual("1.68", banana["normalized_quantity"])
        self.assertEqual("lb", banana["normalized_quantity_unit"])
        patty = enrich_giant.parse_item(
            order_id="abc123",
            order_date="2026-03-01",
            raw_path=Path("raw/abc123.json"),
            line_no=2,
            item={
                "podId": 2,
                "shipQy": 1,
                "totalPickedWeight": 1.29,
                "unitPrice": 10.05,
                "itemName": "80% PATTIES PK12",
                "lbEachCd": "LB",
                "groceryAmount": 10.05,
                "primUpcCd": "222",
                "mvpSavings": 0,
                "rewardSavings": 0,
                "couponSavings": 0,
                "couponPrice": 0,
                "categoryId": "1",
                "categoryDesc": "Grocery",
            },
        )
        self.assertEqual("1.29", patty["normalized_quantity"])
        self.assertEqual("lb", patty["normalized_quantity_unit"])
    def test_build_items_enriched_reads_raw_order_files_and_writes_csv(self):
        with tempfile.TemporaryDirectory() as tmpdir:
            raw_dir = Path(tmpdir) / "raw"
@@ -179,6 +258,8 @@ class EnrichGiantTests(unittest.TestCase):
            self.assertEqual("7.5", rows[0]["size_value"])
            self.assertEqual("10", rows[0]["retailer_item_id"])
            self.assertEqual("true", rows[1]["is_store_brand"])
            self.assertTrue(rows[0]["normalized_item_id"])
            self.assertEqual("exact_upc", rows[0]["normalization_basis"])
            with output_csv.open(newline="", encoding="utf-8") as handle:
                written_rows = list(csv.DictReader(handle))
--- a/tests/test_observed_products.py
+++ b/tests/test_observed_products.py
@@ -1,67 +0,0 @@
 import unittest
 import build_observed_products
 class ObservedProductTests(unittest.TestCase):
    def test_build_observed_products_aggregates_rows_with_same_key(self):
        rows = [
            {
                "retailer": "giant",
                "order_id": "1",
                "line_no": "1",
                "order_date": "2026-01-01",
                "item_name": "SB GALA APPLE 5LB",
                "item_name_norm": "GALA APPLE",
                "retailer_item_id": "11",
                "upc": "111",
                "brand_guess": "SB",
                "variant": "",
                "size_value": "5",
                "size_unit": "lb",
                "pack_qty": "",
                "measure_type": "weight",
                "image_url": "https://example.test/a.jpg",
                "is_store_brand": "true",
                "is_fee": "false",
                "is_discount_line": "false",
                "is_coupon_line": "false",
                "line_total": "7.99",
            },
            {
                "retailer": "giant",
                "order_id": "2",
                "line_no": "1",
                "order_date": "2026-01-10",
                "item_name": "SB GALA APPLE 5 LB",
                "item_name_norm": "GALA APPLE",
                "retailer_item_id": "11",
                "upc": "111",
                "brand_guess": "SB",
                "variant": "",
                "size_value": "5",
                "size_unit": "lb",
                "pack_qty": "",
                "measure_type": "weight",
                "image_url": "",
                "is_store_brand": "true",
                "is_fee": "false",
                "is_discount_line": "false",
                "is_coupon_line": "false",
                "line_total": "8.49",
            },
        ]
        observed = build_observed_products.build_observed_products(rows)
        self.assertEqual(1, len(observed))
        self.assertEqual("2", observed[0]["times_seen"])
        self.assertEqual("2026-01-01", observed[0]["first_seen_date"])
        self.assertEqual("2026-01-10", observed[0]["last_seen_date"])
        self.assertEqual("11", observed[0]["representative_retailer_item_id"])
        self.assertEqual("111", observed[0]["representative_upc"])
        self.assertIn("SB GALA APPLE 5LB", observed[0]["raw_name_examples"])
 if __name__ == "__main__":
    unittest.main()
--- a/tests/test_pipeline_status.py
+++ b/tests/test_pipeline_status.py
@@ -0,0 +1,96 @@
 import unittest
 import report_pipeline_status
 class PipelineStatusTests(unittest.TestCase):
    def test_build_status_summary_reports_unresolved_and_reviewed_counts(self):
        summary = report_pipeline_status.build_status_summary(
            giant_orders=[{"order_id": "g1"}],
            giant_items=[{"order_id": "g1", "line_no": "1"}],
            giant_enriched=[
                {
                    "retailer": "giant",
                    "order_id": "g1",
                    "line_no": "1",
                    "normalized_item_id": "gnorm_banana",
                    "item_name_norm": "BANANA",
                    "item_name": "FRESH BANANA",
                    "retailer_item_id": "1",
                    "upc": "4011",
                    "brand_guess": "",
                    "variant": "",
                    "size_value": "",
                    "size_unit": "",
                    "pack_qty": "",
                    "measure_type": "weight",
                    "image_url": "",
                    "is_store_brand": "false",
                    "is_fee": "false",
                    "is_discount_line": "false",
                    "is_coupon_line": "false",
                    "order_date": "2026-03-01",
                    "line_total": "1.29",
                }
            ],
            costco_orders=[],
            costco_items=[],
            costco_enriched=[],
            purchases=[
                {
                    "normalized_item_id": "gnorm_banana",
                    "catalog_id": "cat_banana",
                    "resolution_action": "",
                    "is_fee": "false",
                    "is_discount_line": "false",
                    "is_coupon_line": "false",
                    "retailer": "giant",
                    "raw_item_name": "FRESH BANANA",
                    "normalized_item_name": "BANANA",
                    "upc": "4011",
                    "line_total": "1.29",
                },
                {
                    "normalized_item_id": "cnorm_lime",
                    "catalog_id": "",
                    "resolution_action": "",
                    "is_fee": "false",
                    "is_discount_line": "false",
                    "is_coupon_line": "false",
                    "retailer": "costco",
                    "raw_item_name": "LIME 5LB",
                    "normalized_item_name": "LIME",
                    "upc": "",
                    "line_total": "4.99",
                },
            ],
            resolutions=[],
            links=[
                {
                    "normalized_item_id": "gnorm_banana",
                    "catalog_id": "cat_banana",
                    "review_status": "approved",
                }
            ],
            catalog=[
                {
                    "catalog_id": "cat_banana",
                    "catalog_name": "BANANA",
                    "product_type": "banana",
                    "category": "produce",
                }
            ],
        )
        counts = {row["stage"]: row["count"] for row in summary}
        self.assertEqual(1, counts["raw_orders"])
        self.assertEqual(1, counts["raw_items"])
        self.assertEqual(1, counts["normalized_items"])
        self.assertEqual(1, counts["linked_purchase_rows"])
        self.assertEqual(1, counts["unresolved_purchase_rows"])
        self.assertEqual(1, counts["review_queue_normalized_items"])
        self.assertEqual(0, counts["unresolved_not_in_review_rows"])
 if __name__ == "__main__":
    unittest.main()
--- a/tests/test_purchases.py
+++ b/tests/test_purchases.py
@@ -0,0 +1,722 @@
 import csv
 import tempfile
 import unittest
 from pathlib import Path
 import build_purchases
 import enrich_costco
 class PurchaseLogTests(unittest.TestCase):
    def test_derive_net_line_total_preserves_existing_then_derives(self):
        self.assertEqual("1.49", build_purchases.derive_net_line_total({"net_line_total": "1.49", "line_total": "2.98"}))
        self.assertEqual("5.99", build_purchases.derive_net_line_total({"line_total": "6.99", "matched_discount_amount": "-1.00"}))
        self.assertEqual("3.5", build_purchases.derive_net_line_total({"line_total": "3.50"}))
    def test_derive_metrics_prefers_picked_weight_and_pack_count(self):
        metrics = build_purchases.derive_metrics(
            {
                "line_total": "4.00",
                "qty": "1",
                "pack_qty": "4",
                "size_value": "",
                "size_unit": "",
                "picked_weight": "2",
                "price_per_each": "",
                "price_per_lb": "",
                "price_per_oz": "",
            }
        )
        self.assertEqual("4", metrics["price_per_each"])
        self.assertEqual("1", metrics["price_per_count"])
        self.assertEqual("2", metrics["price_per_lb"])
        self.assertEqual("0.125", metrics["price_per_oz"])
        self.assertEqual("picked_weight_lb", metrics["price_per_lb_basis"])
    def test_build_purchase_rows_maps_catalog_ids(self):
        fieldnames = enrich_costco.OUTPUT_FIELDS
        giant_row = {field: "" for field in fieldnames}
        giant_row.update(
            {
                "retailer": "giant",
                "order_id": "g1",
                "line_no": "1",
                "normalized_row_id": "giant:g1:1",
                "normalized_item_id": "gnorm:banana",
                "order_date": "2026-03-01",
                "item_name": "FRESH BANANA",
                "item_name_norm": "BANANA",
                "image_url": "https://example.test/banana.jpg",
                "retailer_item_id": "100",
                "upc": "4011",
                "qty": "1",
                "unit": "LB",
                "normalized_quantity": "1",
                "normalized_quantity_unit": "lb",
                "line_total": "1.29",
                "unit_price": "1.29",
                "measure_type": "weight",
                "price_per_lb": "1.29",
                "raw_order_path": "data/giant-web/raw/g1.json",
                "is_discount_line": "false",
                "is_coupon_line": "false",
                "is_fee": "false",
            }
        )
        costco_row = {field: "" for field in fieldnames}
        costco_row.update(
            {
                "retailer": "costco",
                "order_id": "c1",
                "line_no": "1",
                "normalized_row_id": "costco:c1:1",
                "normalized_item_id": "cnorm:banana",
                "order_date": "2026-03-12",
                "item_name": "BANANAS 3 LB / 1.36 KG",
                "item_name_norm": "BANANA",
                "retailer_item_id": "30669",
                "qty": "1",
                "unit": "E",
                "normalized_quantity": "3",
                "normalized_quantity_unit": "lb",
                "line_total": "2.98",
                "unit_price": "2.98",
                "size_value": "3",
                "size_unit": "lb",
                "measure_type": "weight",
                "price_per_lb": "0.9933",
                "raw_order_path": "data/costco-web/raw/c1.json",
                "is_discount_line": "false",
                "is_coupon_line": "false",
                "is_fee": "false",
            }
        )
        giant_orders = [
            {
                "order_id": "g1",
                "store_name": "Giant",
                "store_number": "42",
                "store_city": "Springfield",
                "store_state": "VA",
            }
        ]
        costco_orders = [
            {
                "order_id": "c1",
                "store_name": "MT VERNON",
                "store_number": "1115",
                "store_city": "ALEXANDRIA",
                "store_state": "VA",
            }
        ]
        catalog_rows = [
            {
                "catalog_id": "cat_banana",
                "catalog_name": "BANANA",
                "category": "produce",
                "product_type": "banana",
                "brand": "",
                "variant": "",
                "size_value": "",
                "size_unit": "",
                "pack_qty": "",
                "measure_type": "",
                "notes": "",
                "created_at": "",
                "updated_at": "",
            }
        ]
        link_rows = [
            {
                "normalized_item_id": "gnorm:banana",
                "catalog_id": "cat_banana",
                "link_method": "manual_link",
                "link_confidence": "high",
                "review_status": "approved",
                "reviewed_by": "",
                "reviewed_at": "",
                "link_notes": "",
            },
            {
                "normalized_item_id": "cnorm:banana",
                "catalog_id": "cat_banana",
                "link_method": "manual_link",
                "link_confidence": "high",
                "review_status": "approved",
                "reviewed_by": "",
                "reviewed_at": "",
                "link_notes": "",
            },
        ]
        rows, _links = build_purchases.build_purchase_rows(
            [giant_row],
            [costco_row],
            giant_orders,
            costco_orders,
            [],
            link_rows,
            catalog_rows,
        )
        self.assertEqual(2, len(rows))
        self.assertTrue(all(row["catalog_id"] == "cat_banana" for row in rows))
        self.assertEqual({"giant", "costco"}, {row["retailer"] for row in rows})
        self.assertEqual("https://example.test/banana.jpg", rows[0]["image_url"])
        self.assertEqual("1", rows[0]["normalized_quantity"])
        self.assertEqual("lb", rows[0]["normalized_quantity_unit"])
        self.assertEqual("lb", rows[0]["effective_price_unit"])
        self.assertEqual("g1", rows[0]["order_id"])
        self.assertEqual("Giant", rows[0]["store_name"])
        self.assertEqual("42", rows[0]["store_number"])
        self.assertEqual("Springfield", rows[0]["store_city"])
        self.assertEqual("VA", rows[0]["store_state"])
    def test_main_writes_purchase_and_example_csvs(self):
        with tempfile.TemporaryDirectory() as tmpdir:
            giant_items = Path(tmpdir) / "giant_items.csv"
            costco_items = Path(tmpdir) / "costco_items.csv"
            giant_orders = Path(tmpdir) / "giant_orders.csv"
            costco_orders = Path(tmpdir) / "costco_orders.csv"
            resolutions_csv = Path(tmpdir) / "review_resolutions.csv"
            catalog_csv = Path(tmpdir) / "catalog.csv"
            links_csv = Path(tmpdir) / "product_links.csv"
            purchases_csv = Path(tmpdir) / "review" / "purchases.csv"
            examples_csv = Path(tmpdir) / "review" / "comparison_examples.csv"
            fieldnames = enrich_costco.OUTPUT_FIELDS
            giant_row = {field: "" for field in fieldnames}
            giant_row.update(
                {
                    "retailer": "giant",
                    "order_id": "g1",
                    "line_no": "1",
                    "normalized_row_id": "giant:g1:1",
                    "normalized_item_id": "gnorm:banana",
                    "order_date": "2026-03-01",
                    "item_name": "FRESH BANANA",
                    "item_name_norm": "BANANA",
                    "retailer_item_id": "100",
                    "upc": "4011",
                    "qty": "1",
                    "unit": "LB",
                    "normalized_quantity": "1",
                    "normalized_quantity_unit": "lb",
                    "line_total": "1.29",
                    "unit_price": "1.29",
                    "measure_type": "weight",
                    "price_per_lb": "1.29",
                    "raw_order_path": "data/giant-web/raw/g1.json",
                    "is_discount_line": "false",
                    "is_coupon_line": "false",
                    "is_fee": "false",
                }
            )
            costco_row = {field: "" for field in fieldnames}
            costco_row.update(
                {
                    "retailer": "costco",
                    "order_id": "c1",
                    "line_no": "1",
                    "normalized_row_id": "costco:c1:1",
                    "normalized_item_id": "cnorm:banana",
                    "order_date": "2026-03-12",
                    "item_name": "BANANAS 3 LB / 1.36 KG",
                    "item_name_norm": "BANANA",
                    "retailer_item_id": "30669",
                    "qty": "1",
                    "unit": "E",
                    "normalized_quantity": "3",
                    "normalized_quantity_unit": "lb",
                    "line_total": "2.98",
                    "unit_price": "2.98",
                    "size_value": "3",
                    "size_unit": "lb",
                    "measure_type": "weight",
                    "price_per_lb": "0.9933",
                    "raw_order_path": "data/costco-web/raw/c1.json",
                    "is_discount_line": "false",
                    "is_coupon_line": "false",
                    "is_fee": "false",
                }
            )
            for path, source_rows in [(giant_items, [giant_row]), (costco_items, [costco_row])]:
                with path.open("w", newline="", encoding="utf-8") as handle:
                    writer = csv.DictWriter(handle, fieldnames=fieldnames)
                    writer.writeheader()
                    writer.writerows(source_rows)
            order_fields = ["order_id", "store_name", "store_number", "store_city", "store_state"]
            for path, source_rows in [
                (
                    giant_orders,
                    [
                        {
                            "order_id": "g1",
                            "store_name": "Giant",
                            "store_number": "42",
                            "store_city": "Springfield",
                            "store_state": "VA",
                        }
                    ],
                ),
                (
                    costco_orders,
                    [
                        {
                            "order_id": "c1",
                            "store_name": "MT VERNON",
                            "store_number": "1115",
                            "store_city": "ALEXANDRIA",
                            "store_state": "VA",
                        }
                    ],
                ),
            ]:
                with path.open("w", newline="", encoding="utf-8") as handle:
                    writer = csv.DictWriter(handle, fieldnames=order_fields)
                    writer.writeheader()
                    writer.writerows(source_rows)
            with catalog_csv.open("w", newline="", encoding="utf-8") as handle:
                writer = csv.DictWriter(handle, fieldnames=build_purchases.CATALOG_FIELDS)
                writer.writeheader()
                writer.writerow(
                    {
                        "catalog_id": "cat_banana",
                        "catalog_name": "BANANA",
                        "category": "produce",
                        "product_type": "banana",
                        "brand": "",
                        "variant": "",
                        "size_value": "",
                        "size_unit": "",
                        "pack_qty": "",
                        "measure_type": "",
                        "notes": "",
                        "created_at": "",
                        "updated_at": "",
                    }
                )
            with links_csv.open("w", newline="", encoding="utf-8") as handle:
                writer = csv.DictWriter(handle, fieldnames=build_purchases.PRODUCT_LINK_FIELDS)
                writer.writeheader()
                writer.writerows(
                    [
                        {
                            "normalized_item_id": "gnorm:banana",
                            "catalog_id": "cat_banana",
                            "link_method": "manual_link",
                            "link_confidence": "high",
                            "review_status": "approved",
                            "reviewed_by": "",
                            "reviewed_at": "",
                            "link_notes": "",
                        },
                        {
                            "normalized_item_id": "cnorm:banana",
                            "catalog_id": "cat_banana",
                            "link_method": "manual_link",
                            "link_confidence": "high",
                            "review_status": "approved",
                            "reviewed_by": "",
                            "reviewed_at": "",
                            "link_notes": "",
                        },
                    ]
                )
            build_purchases.main.callback(
                giant_items_enriched_csv=str(giant_items),
                costco_items_enriched_csv=str(costco_items),
                giant_orders_csv=str(giant_orders),
                costco_orders_csv=str(costco_orders),
                resolutions_csv=str(resolutions_csv),
                catalog_csv=str(catalog_csv),
                links_csv=str(links_csv),
                output_csv=str(purchases_csv),
                examples_csv=str(examples_csv),
            )
            self.assertTrue(purchases_csv.exists())
            self.assertTrue(examples_csv.exists())
            with purchases_csv.open(newline="", encoding="utf-8") as handle:
                purchase_rows = list(csv.DictReader(handle))
            with examples_csv.open(newline="", encoding="utf-8") as handle:
                example_rows = list(csv.DictReader(handle))
            self.assertEqual(2, len(purchase_rows))
            self.assertEqual(1, len(example_rows))
    def test_build_purchase_rows_applies_manual_resolution(self):
        fieldnames = enrich_costco.OUTPUT_FIELDS
        giant_row = {field: "" for field in fieldnames}
        giant_row.update(
            {
                "retailer": "giant",
                "order_id": "g1",
                "line_no": "1",
                "normalized_row_id": "giant:g1:1",
                "normalized_item_id": "gnorm:ice",
                "order_date": "2026-03-01",
                "item_name": "SB BAGGED ICE 20LB",
                "item_name_norm": "BAGGED ICE",
                "retailer_item_id": "100",
                "upc": "",
                "qty": "1",
                "unit": "EA",
                "normalized_quantity": "1",
                "normalized_quantity_unit": "each",
                "line_total": "3.50",
                "unit_price": "3.50",
                "measure_type": "each",
                "raw_order_path": "data/giant-web/raw/g1.json",
                "is_discount_line": "false",
                "is_coupon_line": "false",
                "is_fee": "false",
            }
        )
        rows, links = build_purchases.build_purchase_rows(
            [giant_row],
            [],
            [
                {
                    "order_id": "g1",
                    "store_name": "Giant",
                    "store_number": "42",
                    "store_city": "Springfield",
                    "store_state": "VA",
                }
            ],
            [],
            [
                {
                    "normalized_item_id": "gnorm:ice",
                    "catalog_id": "cat_ice",
                    "resolution_action": "create",
                    "status": "approved",
                    "resolution_notes": "manual ice merge",
                    "reviewed_at": "2026-03-16",
                }
            ],
            [],
            [
                {
                    "catalog_id": "cat_ice",
                    "catalog_name": "ICE",
                    "category": "frozen",
                    "product_type": "ice",
                    "brand": "",
                    "variant": "",
                    "size_value": "",
                    "size_unit": "",
                    "pack_qty": "",
                    "measure_type": "",
                    "notes": "",
                    "created_at": "",
                    "updated_at": "",
                }
            ],
        )
        self.assertEqual("cat_ice", rows[0]["catalog_id"])
        self.assertEqual("approved", rows[0]["review_status"])
        self.assertEqual("create", rows[0]["resolution_action"])
        self.assertEqual("cat_ice", links[0]["catalog_id"])
        self.assertEqual("1", rows[0]["normalized_quantity"])
        self.assertEqual("each", rows[0]["normalized_quantity_unit"])
    def test_build_purchase_rows_derives_effective_price_for_known_cases(self):
        fieldnames = enrich_costco.OUTPUT_FIELDS
        def base_row():
            return {field: "" for field in fieldnames}
        giant_banana = base_row()
        giant_banana.update(
            {
                "retailer": "giant",
                "order_id": "g1",
                "line_no": "1",
                "normalized_row_id": "giant:g1:1",
                "normalized_item_id": "gnorm:banana",
                "order_date": "2026-03-01",
                "item_name": "FRESH BANANA",
                "item_name_norm": "BANANA",
                "retailer_item_id": "100",
                "qty": "1",
                "unit": "LB",
                "normalized_quantity": "1.68",
                "normalized_quantity_unit": "lb",
                "line_total": "0.99",
                "unit_price": "0.99",
                "measure_type": "weight",
                "price_per_lb": "0.5893",
                "raw_order_path": "data/giant-web/raw/g1.json",
                "is_discount_line": "false",
                "is_coupon_line": "false",
                "is_fee": "false",
            }
        )
        costco_banana = base_row()
        costco_banana.update(
            {
                "retailer": "costco",
                "order_id": "c1",
                "line_no": "1",
                "normalized_row_id": "costco:c1:1",
                "normalized_item_id": "cnorm:banana",
                "order_date": "2026-03-12",
                "item_name": "BANANAS 3 LB / 1.36 KG",
                "item_name_norm": "BANANA",
                "retailer_item_id": "30669",
                "qty": "1",
                "unit": "E",
                "normalized_quantity": "3",
                "normalized_quantity_unit": "lb",
                "line_total": "2.98",
                "net_line_total": "1.49",
                "unit_price": "2.98",
                "size_value": "3",
                "size_unit": "lb",
                "measure_type": "weight",
                "price_per_lb": "0.4967",
                "raw_order_path": "data/costco-web/raw/c1.json",
                "is_discount_line": "false",
                "is_coupon_line": "false",
                "is_fee": "false",
            }
        )
        giant_ice = base_row()
        giant_ice.update(
            {
                "retailer": "giant",
                "order_id": "g2",
                "line_no": "1",
                "normalized_row_id": "giant:g2:1",
                "normalized_item_id": "gnorm:ice",
                "order_date": "2026-03-02",
                "item_name": "SB BAGGED ICE 20LB",
                "item_name_norm": "BAGGED ICE",
                "retailer_item_id": "101",
                "qty": "2",
                "unit": "EA",
                "normalized_quantity": "40",
                "normalized_quantity_unit": "lb",
                "line_total": "9.98",
                "unit_price": "4.99",
                "size_value": "20",
                "size_unit": "lb",
                "measure_type": "weight",
                "price_per_lb": "0.2495",
                "raw_order_path": "data/giant-web/raw/g2.json",
                "is_discount_line": "false",
                "is_coupon_line": "false",
                "is_fee": "false",
            }
        )
        costco_patty = base_row()
        costco_patty.update(
            {
                "retailer": "costco",
                "order_id": "c2",
                "line_no": "1",
                "normalized_row_id": "costco:c2:1",
                "normalized_item_id": "cnorm:patty",
                "order_date": "2026-03-03",
                "item_name": "BEEF PATTIES 6# BAG",
                "item_name_norm": "BEEF PATTIES 6# BAG",
                "retailer_item_id": "777",
                "qty": "1",
                "unit": "E",
                "normalized_quantity": "1",
                "normalized_quantity_unit": "each",
                "line_total": "26.99",
                "net_line_total": "26.99",
                "unit_price": "26.99",
                "measure_type": "each",
                "raw_order_path": "data/costco-web/raw/c2.json",
                "is_discount_line": "false",
                "is_coupon_line": "false",
                "is_fee": "false",
            }
        )
        giant_patty = base_row()
        giant_patty.update(
            {
                "retailer": "giant",
                "order_id": "g3",
                "line_no": "1",
                "normalized_row_id": "giant:g3:1",
                "normalized_item_id": "gnorm:patty",
                "order_date": "2026-03-04",
                "item_name": "80% PATTIES PK12",
                "item_name_norm": "80% PATTIES PK12",
                "retailer_item_id": "102",
                "qty": "1",
                "unit": "LB",
                "normalized_quantity": "",
                "normalized_quantity_unit": "",
                "line_total": "10.05",
                "unit_price": "10.05",
                "measure_type": "weight",
                "price_per_lb": "7.7907",
                "raw_order_path": "data/giant-web/raw/g3.json",
                "is_discount_line": "false",
                "is_coupon_line": "false",
                "is_fee": "false",
            }
        )
        rows, _links = build_purchases.build_purchase_rows(
            [giant_banana, giant_ice, giant_patty],
            [costco_banana, costco_patty],
            [],
            [],
            [],
            [],
            [],
        )
        rows_by_item = {row["normalized_item_id"]: row for row in rows}
        self.assertEqual("0.5893", rows_by_item["gnorm:banana"]["effective_price"])
        self.assertEqual("lb", rows_by_item["gnorm:banana"]["effective_price_unit"])
        self.assertEqual("0.4967", rows_by_item["cnorm:banana"]["effective_price"])
        self.assertEqual("lb", rows_by_item["cnorm:banana"]["effective_price_unit"])
        self.assertEqual("0.2495", rows_by_item["gnorm:ice"]["effective_price"])
        self.assertEqual("lb", rows_by_item["gnorm:ice"]["effective_price_unit"])
        self.assertEqual("26.99", rows_by_item["cnorm:patty"]["effective_price"])
        self.assertEqual("each", rows_by_item["cnorm:patty"]["effective_price_unit"])
        self.assertEqual("", rows_by_item["gnorm:patty"]["effective_price"])
        self.assertEqual("", rows_by_item["gnorm:patty"]["effective_price_unit"])
    def test_build_purchase_rows_leaves_effective_price_blank_without_valid_denominator(self):
        fieldnames = enrich_costco.OUTPUT_FIELDS
        row = {field: "" for field in fieldnames}
        row.update(
            {
                "retailer": "giant",
                "order_id": "g1",
                "line_no": "1",
                "normalized_row_id": "giant:g1:1",
                "normalized_item_id": "gnorm:blank",
                "order_date": "2026-03-01",
                "item_name": "MYSTERY ITEM",
                "item_name_norm": "MYSTERY ITEM",
                "retailer_item_id": "100",
                "qty": "1",
                "unit": "EA",
                "normalized_quantity": "0",
                "normalized_quantity_unit": "each",
                "line_total": "3.50",
                "unit_price": "3.50",
                "measure_type": "each",
                "raw_order_path": "data/giant-web/raw/g1.json",
                "is_discount_line": "false",
                "is_coupon_line": "false",
                "is_fee": "false",
            }
        )
        rows, _links = build_purchases.build_purchase_rows([row], [], [], [], [], [], [])
        self.assertEqual("", rows[0]["effective_price"])
        self.assertEqual("", rows[0]["effective_price_unit"])
    def test_purchase_rows_support_visit_level_grouping_without_extra_joins(self):
        fieldnames = enrich_costco.OUTPUT_FIELDS
        def base_row():
            return {field: "" for field in fieldnames}
        row_one = base_row()
        row_one.update(
            {
                "retailer": "giant",
                "order_id": "g1",
                "line_no": "1",
                "normalized_row_id": "giant:g1:1",
                "normalized_item_id": "gnorm:first",
                "order_date": "2026-03-01",
                "item_name": "FIRST ITEM",
                "item_name_norm": "FIRST ITEM",
                "qty": "1",
                "unit": "EA",
                "normalized_quantity": "1",
                "normalized_quantity_unit": "each",
                "line_total": "3.50",
                "measure_type": "each",
                "raw_order_path": "data/giant-web/raw/g1.json",
                "is_discount_line": "false",
                "is_coupon_line": "false",
                "is_fee": "false",
            }
        )
        row_two = base_row()
        row_two.update(
            {
                "retailer": "giant",
                "order_id": "g1",
                "line_no": "2",
                "normalized_row_id": "giant:g1:2",
                "normalized_item_id": "gnorm:second",
                "order_date": "2026-03-01",
                "item_name": "SECOND ITEM",
                "item_name_norm": "SECOND ITEM",
                "qty": "1",
                "unit": "EA",
                "normalized_quantity": "1",
                "normalized_quantity_unit": "each",
                "line_total": "2.00",
                "measure_type": "each",
                "raw_order_path": "data/giant-web/raw/g1.json",
                "is_discount_line": "false",
                "is_coupon_line": "false",
                "is_fee": "false",
            }
        )
        rows, _links = build_purchases.build_purchase_rows(
            [row_one, row_two],
            [],
            [
                {
                    "order_id": "g1",
                    "store_name": "Giant",
                    "store_number": "42",
                    "store_city": "Springfield",
                    "store_state": "VA",
                }
            ],
            [],
            [],
            [],
            [],
        )
        visit_key = {
            (
                row["retailer"],
                row["order_id"],
                row["purchase_date"],
                row["store_name"],
                row["store_number"],
                row["store_city"],
                row["store_state"],
            )
            for row in rows
        }
        visit_total = sum(float(row["net_line_total"]) for row in rows)
        self.assertEqual(1, len(visit_key))
        self.assertEqual(5.5, visit_total)
 if __name__ == "__main__":
    unittest.main()
--- a/tests/test_review_queue.py
+++ b/tests/test_review_queue.py
@@ -1,133 +0,0 @@
 import tempfile
 import unittest
 from pathlib import Path
 import build_observed_products
 import build_review_queue
 from layer_helpers import write_csv_rows
 class ReviewQueueTests(unittest.TestCase):
    def test_build_review_queue_preserves_existing_status(self):
        observed_rows = [
            {
                "observed_product_id": "gobs_1",
                "retailer": "giant",
                "representative_upc": "111",
                "representative_image_url": "",
                "representative_name_norm": "GALA APPLE",
                "times_seen": "2",
                "distinct_item_names_count": "2",
                "distinct_upcs_count": "1",
                "is_fee": "false",
                "is_discount_line": "false",
                "is_coupon_line": "false",
            }
        ]
        item_rows = [
            {
                "observed_product_id": "gobs_1",
                "item_name": "SB GALA APPLE 5LB",
                "item_name_norm": "GALA APPLE",
                "line_total": "7.99",
            },
            {
                "observed_product_id": "gobs_1",
                "item_name": "SB GALA APPLE 5 LB",
                "item_name_norm": "GALA APPLE",
                "line_total": "8.49",
            },
        ]
        existing = {
            build_review_queue.stable_id("rvw", "gobs_1|missing_image"): {
                "status": "approved",
                "resolution_notes": "looked fine",
                "created_at": "2026-03-15",
            }
        }
        queue = build_review_queue.build_review_queue(
            observed_rows, item_rows, existing, "2026-03-16"
        )
        self.assertEqual(2, len(queue))
        missing_image = [row for row in queue if row["reason_code"] == "missing_image"][0]
        self.assertEqual("approved", missing_image["status"])
        self.assertEqual("looked fine", missing_image["resolution_notes"])
    def test_review_queue_main_writes_output(self):
        with tempfile.TemporaryDirectory() as tmpdir:
            observed_path = Path(tmpdir) / "products_observed.csv"
            items_path = Path(tmpdir) / "items_enriched.csv"
            output_path = Path(tmpdir) / "review_queue.csv"
            observed_rows = [
                {
                    "observed_product_id": "gobs_1",
                    "retailer": "giant",
                    "observed_key": "giant|upc=111|name=GALA APPLE",
                    "representative_retailer_item_id": "11",
                    "representative_upc": "111",
                    "representative_item_name": "SB GALA APPLE 5LB",
                    "representative_name_norm": "GALA APPLE",
                    "representative_brand": "SB",
                    "representative_variant": "",
                    "representative_size_value": "5",
                    "representative_size_unit": "lb",
                    "representative_pack_qty": "",
                    "representative_measure_type": "weight",
                    "representative_image_url": "",
                    "is_store_brand": "true",
                    "is_fee": "false",
                    "is_discount_line": "false",
                    "is_coupon_line": "false",
                    "first_seen_date": "2026-01-01",
                    "last_seen_date": "2026-01-10",
                    "times_seen": "2",
                    "example_order_id": "1",
                    "example_item_name": "SB GALA APPLE 5LB",
                    "raw_name_examples": "SB GALA APPLE 5LB | SB GALA APPLE 5 LB",
                    "normalized_name_examples": "GALA APPLE",
                    "example_prices": "7.99 | 8.49",
                    "distinct_item_names_count": "2",
                    "distinct_retailer_item_ids_count": "1",
                    "distinct_upcs_count": "1",
                }
            ]
            item_rows = [
                {
                    "retailer": "giant",
                    "order_id": "1",
                    "line_no": "1",
                    "item_name": "SB GALA APPLE 5LB",
                    "item_name_norm": "GALA APPLE",
                    "retailer_item_id": "11",
                    "upc": "111",
                    "size_value": "5",
                    "size_unit": "lb",
                    "pack_qty": "",
                    "measure_type": "weight",
                    "is_store_brand": "true",
                    "is_fee": "false",
                    "is_discount_line": "false",
                    "is_coupon_line": "false",
                    "line_total": "7.99",
                }
            ]
            write_csv_rows(
                observed_path, observed_rows, build_observed_products.OUTPUT_FIELDS
            )
            write_csv_rows(items_path, item_rows, list(item_rows[0].keys()))
            build_review_queue.main.callback(
                observed_csv=str(observed_path),
                items_enriched_csv=str(items_path),
                output_csv=str(output_path),
            )
            self.assertTrue(output_path.exists())
 if __name__ == "__main__":
    unittest.main()
--- a/tests/test_review_workflow.py
+++ b/tests/test_review_workflow.py
@@ -0,0 +1,760 @@
 import csv
 import tempfile
 import unittest
 from pathlib import Path
 from unittest import mock
 from click.testing import CliRunner
 import enrich_costco
 import review_products
 def write_review_source_files(tmpdir, rows):
    giant_items_csv = Path(tmpdir) / "giant_items.csv"
    costco_items_csv = Path(tmpdir) / "costco_items.csv"
    giant_orders_csv = Path(tmpdir) / "giant_orders.csv"
    costco_orders_csv = Path(tmpdir) / "costco_orders.csv"
    fieldnames = enrich_costco.OUTPUT_FIELDS
    grouped_rows = {"giant": [], "costco": []}
    grouped_orders = {"giant": {}, "costco": {}}
    for index, row in enumerate(rows, start=1):
        retailer = row.get("retailer", "giant")
        normalized_row = {field: "" for field in fieldnames}
        normalized_row.update(
            {
                "retailer": retailer,
                "order_id": row.get("order_id", f"{retailer[0]}{index}"),
                "line_no": row.get("line_no", str(index)),
                "normalized_row_id": row.get(
                    "normalized_row_id",
                    f"{retailer}:{row.get('order_id', f'{retailer[0]}{index}')}:{row.get('line_no', str(index))}",
                ),
                "normalized_item_id": row.get("normalized_item_id", ""),
                "order_date": row.get("purchase_date", ""),
                "item_name": row.get("raw_item_name", ""),
                "item_name_norm": row.get("normalized_item_name", ""),
                "image_url": row.get("image_url", ""),
                "upc": row.get("upc", ""),
                "line_total": row.get("line_total", ""),
                "net_line_total": row.get("net_line_total", ""),
                "matched_discount_amount": row.get("matched_discount_amount", ""),
                "qty": row.get("qty", "1"),
                "unit": row.get("unit", "EA"),
                "normalized_quantity": row.get("normalized_quantity", ""),
                "normalized_quantity_unit": row.get("normalized_quantity_unit", ""),
                "size_value": row.get("size_value", ""),
                "size_unit": row.get("size_unit", ""),
                "pack_qty": row.get("pack_qty", ""),
                "measure_type": row.get("measure_type", "each"),
                "retailer_item_id": row.get("retailer_item_id", ""),
                "price_per_each": row.get("price_per_each", ""),
                "price_per_lb": row.get("price_per_lb", ""),
                "price_per_oz": row.get("price_per_oz", ""),
                "is_discount_line": row.get("is_discount_line", "false"),
                "is_coupon_line": row.get("is_coupon_line", "false"),
                "is_fee": row.get("is_fee", "false"),
                "raw_order_path": row.get("raw_order_path", ""),
            }
        )
        grouped_rows[retailer].append(normalized_row)
        order_id = normalized_row["order_id"]
        grouped_orders[retailer].setdefault(
            order_id,
            {
                "order_id": order_id,
                "store_name": row.get("store_name", ""),
                "store_number": row.get("store_number", ""),
                "store_city": row.get("store_city", ""),
                "store_state": row.get("store_state", ""),
            },
        )
    for path, source_rows in [
        (giant_items_csv, grouped_rows["giant"]),
        (costco_items_csv, grouped_rows["costco"]),
    ]:
        with path.open("w", newline="", encoding="utf-8") as handle:
            writer = csv.DictWriter(handle, fieldnames=fieldnames)
            writer.writeheader()
            writer.writerows(source_rows)
    order_fields = ["order_id", "store_name", "store_number", "store_city", "store_state"]
    for path, source_rows in [
        (giant_orders_csv, grouped_orders["giant"].values()),
        (costco_orders_csv, grouped_orders["costco"].values()),
    ]:
        with path.open("w", newline="", encoding="utf-8") as handle:
            writer = csv.DictWriter(handle, fieldnames=order_fields)
            writer.writeheader()
            writer.writerows(source_rows)
    return giant_items_csv, costco_items_csv, giant_orders_csv, costco_orders_csv
 class ReviewWorkflowTests(unittest.TestCase):
    def test_build_review_queue_groups_unresolved_purchases(self):
        queue_rows = review_products.build_review_queue(
            [
                {
                    "normalized_item_id": "gnorm_1",
                    "catalog_id": "",
                    "retailer": "giant",
                    "raw_item_name": "SB BAGGED ICE 20LB",
                    "normalized_item_name": "BAGGED ICE",
                    "upc": "",
                    "line_total": "3.50",
                    "is_fee": "false",
                    "is_discount_line": "false",
                    "is_coupon_line": "false",
                },
                {
                    "normalized_item_id": "gnorm_1",
                    "catalog_id": "",
                    "retailer": "giant",
                    "raw_item_name": "SB BAG ICE CUBED 10LB",
                    "normalized_item_name": "BAG ICE",
                    "upc": "",
                    "line_total": "2.50",
                    "is_fee": "false",
                    "is_discount_line": "false",
                    "is_coupon_line": "false",
                },
            ],
            [],
        )
        self.assertEqual(1, len(queue_rows))
        self.assertEqual("gnorm_1", queue_rows[0]["normalized_item_id"])
        self.assertIn("SB BAGGED ICE 20LB", queue_rows[0]["raw_item_names"])
    def test_build_catalog_suggestions_prefers_upc_then_name(self):
        suggestions = review_products.build_catalog_suggestions(
            [
                {
                    "normalized_item_name": "MIXED PEPPER",
                    "upc": "12345",
                }
            ],
            [
                {
                    "normalized_item_id": "prior_1",
                    "normalized_item_name": "MIXED PEPPER 6 PACK",
                    "upc": "12345",
                    "catalog_id": "cat_2",
                }
            ],
            [
                {
                    "catalog_id": "cat_1",
                    "catalog_name": "MIXED PEPPER",
                },
                {
                    "catalog_id": "cat_2",
                    "catalog_name": "MIXED PEPPER 6 PACK",
                },
            ],
        )
        self.assertEqual("cat_2", suggestions[0]["catalog_id"])
        self.assertEqual("exact upc", suggestions[0]["reason"])
    def test_search_catalog_rows_ranks_token_overlap(self):
        results = review_products.search_catalog_rows(
            "mixed pepper",
            [
                {
                    "catalog_id": "cat_1",
                    "catalog_name": "MIXED PEPPER",
                    "product_type": "pepper",
                    "category": "produce",
                    "variant": "",
                },
                {
                    "catalog_id": "cat_2",
                    "catalog_name": "GROUND PEPPER",
                    "product_type": "spice",
                    "category": "baking",
                    "variant": "",
                },
            ],
            [
                {
                    "normalized_item_id": "gnorm_mix",
                    "catalog_id": "cat_1",
                }
            ],
            "cnorm_mix",
        )
        self.assertEqual("cat_1", results[0]["catalog_id"])
        self.assertGreater(results[0]["score"], results[1]["score"])
    def test_review_products_displays_position_items_and_suggestions(self):
        with tempfile.TemporaryDirectory() as tmpdir:
            purchases_csv = Path(tmpdir) / "purchases.csv"
            queue_csv = Path(tmpdir) / "review_queue.csv"
            resolutions_csv = Path(tmpdir) / "review_resolutions.csv"
            catalog_csv = Path(tmpdir) / "catalog.csv"
            links_csv = Path(tmpdir) / "product_links.csv"
            giant_items_csv, costco_items_csv, giant_orders_csv, costco_orders_csv = write_review_source_files(
                tmpdir,
                [
                    {
                        "purchase_date": "2026-03-14",
                        "retailer": "costco",
                        "order_id": "c2",
                        "line_no": "2",
                        "normalized_item_id": "cnorm_mix",
                        "raw_item_name": "MIXED PEPPER 6-PACK",
                        "normalized_item_name": "MIXED PEPPER",
                        "image_url": "",
                        "upc": "",
                        "line_total": "7.49",
                    },
                    {
                        "purchase_date": "2026-03-12",
                        "retailer": "costco",
                        "order_id": "c1",
                        "line_no": "1",
                        "normalized_item_id": "cnorm_mix",
                        "raw_item_name": "MIXED PEPPER 6-PACK",
                        "normalized_item_name": "MIXED PEPPER",
                        "image_url": "https://example.test/mixed-pepper.jpg",
                        "upc": "",
                        "line_total": "6.99",
                    },
                    {
                        "purchase_date": "2026-03-10",
                        "retailer": "giant",
                        "order_id": "g1",
                        "line_no": "1",
                        "normalized_item_id": "gnorm_mix",
                        "raw_item_name": "MIXED PEPPER",
                        "normalized_item_name": "MIXED PEPPER",
                        "image_url": "",
                        "upc": "",
                        "line_total": "5.99",
                    },
                ],
            )
            with catalog_csv.open("w", newline="", encoding="utf-8") as handle:
                writer = csv.DictWriter(handle, fieldnames=review_products.build_purchases.CATALOG_FIELDS)
                writer.writeheader()
                writer.writerow(
                    {
                        "catalog_id": "cat_mix",
                        "catalog_name": "MIXED PEPPER",
                        "category": "produce",
                        "product_type": "pepper",
                        "brand": "",
                        "variant": "",
                        "size_value": "",
                        "size_unit": "",
                        "pack_qty": "",
                        "measure_type": "",
                        "notes": "",
                        "created_at": "",
                        "updated_at": "",
                    }
                )
            with links_csv.open("w", newline="", encoding="utf-8") as handle:
                writer = csv.DictWriter(handle, fieldnames=review_products.build_purchases.PRODUCT_LINK_FIELDS)
                writer.writeheader()
                writer.writerow(
                    {
                        "normalized_item_id": "gnorm_mix",
                        "catalog_id": "cat_mix",
                        "link_method": "manual_link",
                        "link_confidence": "high",
                        "review_status": "approved",
                        "reviewed_by": "",
                        "reviewed_at": "",
                        "link_notes": "",
                    }
                )
            runner = CliRunner()
            result = runner.invoke(
                review_products.main,
                [
                    "--giant-items-enriched-csv",
                    str(giant_items_csv),
                    "--costco-items-enriched-csv",
                    str(costco_items_csv),
                    "--giant-orders-csv",
                    str(giant_orders_csv),
                    "--costco-orders-csv",
                    str(costco_orders_csv),
                    "--purchases-csv",
                    str(purchases_csv),
                    "--queue-csv",
                    str(queue_csv),
                    "--resolutions-csv",
                    str(resolutions_csv),
                    "--catalog-csv",
                    str(catalog_csv),
                    "--links-csv",
                    str(links_csv),
                ],
                input="q\n",
                color=True,
            )
            self.assertEqual(0, result.exit_code)
            self.assertIn("Review guide:", result.output)
            self.assertIn("Review 1/1: MIXED PEPPER", result.output)
            self.assertIn("2 matched items:", result.output)
            self.assertIn("[#] link to suggestion  [f]ind  [n]ew  [s]kip  e[x]clude  [q]uit >", result.output)
            first_item = result.output.index("[1] MIXED PEPPER 6-PACK | costco | 2026-03-14 | 7.49 | ")
            second_item = result.output.index("[2] MIXED PEPPER 6-PACK | costco | 2026-03-12 | 6.99 | https://example.test/mixed-pepper.jpg")
            self.assertLess(first_item, second_item)
            self.assertIn("1 catalog_name suggestions found:", result.output)
            self.assertIn("[1] MIXED PEPPER, pepper, produce (1 items, 1 rows)", result.output)
            self.assertIn("\x1b[", result.output)
    def test_review_products_no_suggestions_is_informational(self):
        with tempfile.TemporaryDirectory() as tmpdir:
            purchases_csv = Path(tmpdir) / "purchases.csv"
            queue_csv = Path(tmpdir) / "review_queue.csv"
            resolutions_csv = Path(tmpdir) / "review_resolutions.csv"
            catalog_csv = Path(tmpdir) / "catalog.csv"
            links_csv = Path(tmpdir) / "product_links.csv"
            giant_items_csv, costco_items_csv, giant_orders_csv, costco_orders_csv = write_review_source_files(
                tmpdir,
                [
                    {
                        "purchase_date": "2026-03-14",
                        "retailer": "giant",
                        "order_id": "g1",
                        "line_no": "1",
                        "normalized_item_id": "gnorm_ice",
                        "raw_item_name": "SB BAGGED ICE 20LB",
                        "normalized_item_name": "BAGGED ICE",
                        "image_url": "",
                        "upc": "",
                        "line_total": "3.50",
                    }
                ],
            )
            with catalog_csv.open("w", newline="", encoding="utf-8") as handle:
                writer = csv.DictWriter(handle, fieldnames=review_products.build_purchases.CATALOG_FIELDS)
                writer.writeheader()
            result = CliRunner().invoke(
                review_products.main,
                [
                    "--giant-items-enriched-csv",
                    str(giant_items_csv),
                    "--costco-items-enriched-csv",
                    str(costco_items_csv),
                    "--giant-orders-csv",
                    str(giant_orders_csv),
                    "--costco-orders-csv",
                    str(costco_orders_csv),
                    "--purchases-csv",
                    str(purchases_csv),
                    "--queue-csv",
                    str(queue_csv),
                    "--resolutions-csv",
                    str(resolutions_csv),
                    "--catalog-csv",
                    str(catalog_csv),
                    "--links-csv",
                    str(links_csv),
                ],
                input="q\n",
                color=True,
            )
            self.assertEqual(0, result.exit_code)
            self.assertIn("no catalog_name suggestions found", result.output)
    def test_search_links_catalog_and_writes_link_row(self):
        with tempfile.TemporaryDirectory() as tmpdir:
            purchases_csv = Path(tmpdir) / "purchases.csv"
            queue_csv = Path(tmpdir) / "review_queue.csv"
            resolutions_csv = Path(tmpdir) / "review_resolutions.csv"
            catalog_csv = Path(tmpdir) / "catalog.csv"
            links_csv = Path(tmpdir) / "product_links.csv"
            giant_items_csv, costco_items_csv, giant_orders_csv, costco_orders_csv = write_review_source_files(
                tmpdir,
                [
                    {
                        "purchase_date": "2026-03-14",
                        "retailer": "costco",
                        "order_id": "c2",
                        "line_no": "2",
                        "normalized_item_id": "cnorm_mix",
                        "raw_item_name": "MIXED PEPPER 6-PACK",
                        "normalized_item_name": "MIXED PEPPER",
                        "image_url": "",
                        "upc": "",
                        "line_total": "7.49",
                    },
                    {
                        "purchase_date": "2026-03-12",
                        "retailer": "costco",
                        "order_id": "c1",
                        "line_no": "1",
                        "normalized_item_id": "cnorm_mix",
                        "raw_item_name": "MIXED PEPPER 6-PACK",
                        "normalized_item_name": "MIXED PEPPER",
                        "image_url": "",
                        "upc": "",
                        "line_total": "6.99",
                    },
                    {
                        "purchase_date": "2026-03-10",
                        "retailer": "giant",
                        "order_id": "g1",
                        "line_no": "1",
                        "normalized_item_id": "gnorm_mix",
                        "raw_item_name": "MIXED PEPPER",
                        "normalized_item_name": "MIXED PEPPER",
                        "image_url": "",
                        "upc": "",
                        "line_total": "5.99",
                    },
                ],
            )
            with catalog_csv.open("w", newline="", encoding="utf-8") as handle:
                writer = csv.DictWriter(handle, fieldnames=review_products.build_purchases.CATALOG_FIELDS)
                writer.writeheader()
                writer.writerow(
                    {
                        "catalog_id": "cat_mix",
                        "catalog_name": "MIXED PEPPER",
                        "category": "",
                        "product_type": "",
                        "brand": "",
                        "variant": "",
                        "size_value": "",
                        "size_unit": "",
                        "pack_qty": "",
                        "measure_type": "",
                        "notes": "",
                        "created_at": "",
                        "updated_at": "",
                    }
                )
            with links_csv.open("w", newline="", encoding="utf-8") as handle:
                writer = csv.DictWriter(handle, fieldnames=review_products.build_purchases.PRODUCT_LINK_FIELDS)
                writer.writeheader()
                writer.writerow(
                    {
                        "normalized_item_id": "gnorm_mix",
                        "catalog_id": "cat_mix",
                        "link_method": "manual_link",
                        "link_confidence": "high",
                        "review_status": "approved",
                        "reviewed_by": "",
                        "reviewed_at": "",
                        "link_notes": "",
                    }
                )
            result = CliRunner().invoke(
                review_products.main,
                [
                    "--giant-items-enriched-csv",
                    str(giant_items_csv),
                    "--costco-items-enriched-csv",
                    str(costco_items_csv),
                    "--giant-orders-csv",
                    str(giant_orders_csv),
                    "--costco-orders-csv",
                    str(costco_orders_csv),
                    "--purchases-csv",
                    str(purchases_csv),
                    "--queue-csv",
                    str(queue_csv),
                    "--resolutions-csv",
                    str(resolutions_csv),
                    "--catalog-csv",
                    str(catalog_csv),
                    "--links-csv",
                    str(links_csv),
                    "--limit",
                    "1",
                ],
                input="f\nmixed pepper\n1\nlinked by test\n",
                color=True,
            )
            self.assertEqual(0, result.exit_code)
            self.assertIn("1 search results found:", result.output)
            with resolutions_csv.open(newline="", encoding="utf-8") as handle:
                rows = list(csv.DictReader(handle))
            with links_csv.open(newline="", encoding="utf-8") as handle:
                link_rows = list(csv.DictReader(handle))
            self.assertEqual("cat_mix", rows[0]["catalog_id"])
            self.assertEqual("link", rows[0]["resolution_action"])
            self.assertEqual("cat_mix", link_rows[0]["catalog_id"])
    def test_search_no_matches_allows_retry_or_return(self):
        with tempfile.TemporaryDirectory() as tmpdir:
            purchases_csv = Path(tmpdir) / "purchases.csv"
            queue_csv = Path(tmpdir) / "review_queue.csv"
            resolutions_csv = Path(tmpdir) / "review_resolutions.csv"
            catalog_csv = Path(tmpdir) / "catalog.csv"
            links_csv = Path(tmpdir) / "product_links.csv"
            giant_items_csv, costco_items_csv, giant_orders_csv, costco_orders_csv = write_review_source_files(
                tmpdir,
                [
                    {
                        "purchase_date": "2026-03-14",
                        "retailer": "giant",
                        "order_id": "g1",
                        "line_no": "1",
                        "normalized_item_id": "gnorm_ice",
                        "raw_item_name": "SB BAGGED ICE 20LB",
                        "normalized_item_name": "BAGGED ICE",
                        "image_url": "",
                        "upc": "",
                        "line_total": "3.50",
                    }
                ],
            )
            with catalog_csv.open("w", newline="", encoding="utf-8") as handle:
                writer = csv.DictWriter(handle, fieldnames=review_products.build_purchases.CATALOG_FIELDS)
                writer.writeheader()
                writer.writerow(
                    {
                        "catalog_id": "cat_ice",
                        "catalog_name": "ICE",
                        "category": "frozen",
                        "product_type": "ice",
                        "brand": "",
                        "variant": "",
                        "size_value": "",
                        "size_unit": "",
                        "pack_qty": "",
                        "measure_type": "",
                        "notes": "",
                        "created_at": "",
                        "updated_at": "",
                    }
                )
            result = CliRunner().invoke(
                review_products.main,
                [
                    "--giant-items-enriched-csv",
                    str(giant_items_csv),
                    "--costco-items-enriched-csv",
                    str(costco_items_csv),
                    "--giant-orders-csv",
                    str(giant_orders_csv),
                    "--costco-orders-csv",
                    str(costco_orders_csv),
                    "--purchases-csv",
                    str(purchases_csv),
                    "--queue-csv",
                    str(queue_csv),
                    "--resolutions-csv",
                    str(resolutions_csv),
                    "--catalog-csv",
                    str(catalog_csv),
                    "--links-csv",
                    str(links_csv),
                ],
                input="f\nzzz\nq\nq\n",
                color=True,
            )
            self.assertEqual(0, result.exit_code)
            self.assertIn("no matches found", result.output)
    def test_skip_remains_available_from_main_prompt(self):
        with tempfile.TemporaryDirectory() as tmpdir:
            purchases_csv = Path(tmpdir) / "purchases.csv"
            queue_csv = Path(tmpdir) / "review_queue.csv"
            resolutions_csv = Path(tmpdir) / "review_resolutions.csv"
            catalog_csv = Path(tmpdir) / "catalog.csv"
            links_csv = Path(tmpdir) / "product_links.csv"
            giant_items_csv, costco_items_csv, giant_orders_csv, costco_orders_csv = write_review_source_files(
                tmpdir,
                [
                    {
                        "purchase_date": "2026-03-14",
                        "retailer": "giant",
                        "order_id": "g1",
                        "line_no": "1",
                        "normalized_item_id": "gnorm_skip",
                        "raw_item_name": "TEST ITEM",
                        "normalized_item_name": "TEST ITEM",
                        "image_url": "",
                        "upc": "",
                        "line_total": "1.00",
                    }
                ],
            )
            with catalog_csv.open("w", newline="", encoding="utf-8") as handle:
                writer = csv.DictWriter(handle, fieldnames=review_products.build_purchases.CATALOG_FIELDS)
                writer.writeheader()
            result = CliRunner().invoke(
                review_products.main,
                [
                    "--giant-items-enriched-csv",
                    str(giant_items_csv),
                    "--costco-items-enriched-csv",
                    str(costco_items_csv),
                    "--giant-orders-csv",
                    str(giant_orders_csv),
                    "--costco-orders-csv",
                    str(costco_orders_csv),
                    "--purchases-csv",
                    str(purchases_csv),
                    "--queue-csv",
                    str(queue_csv),
                    "--resolutions-csv",
                    str(resolutions_csv),
                    "--catalog-csv",
                    str(catalog_csv),
                    "--links-csv",
                    str(links_csv),
                    "--limit",
                    "1",
                ],
                input="s\n",
                color=True,
            )
            self.assertEqual(0, result.exit_code)
            with resolutions_csv.open(newline="", encoding="utf-8") as handle:
                rows = list(csv.DictReader(handle))
            self.assertEqual("skip", rows[0]["resolution_action"])
            self.assertEqual("pending", rows[0]["status"])
    def test_review_products_creates_catalog_and_resolution(self):
        with tempfile.TemporaryDirectory() as tmpdir:
            purchases_csv = Path(tmpdir) / "purchases.csv"
            queue_csv = Path(tmpdir) / "review_queue.csv"
            resolutions_csv = Path(tmpdir) / "review_resolutions.csv"
            catalog_csv = Path(tmpdir) / "catalog.csv"
            links_csv = Path(tmpdir) / "product_links.csv"
            giant_items_csv, costco_items_csv, giant_orders_csv, costco_orders_csv = write_review_source_files(
                tmpdir,
                [
                    {
                        "purchase_date": "2026-03-15",
                        "normalized_item_id": "gnorm_ice",
                        "retailer": "giant",
                        "raw_item_name": "SB BAGGED ICE 20LB",
                        "normalized_item_name": "BAGGED ICE",
                        "image_url": "",
                        "upc": "",
                        "line_total": "3.50",
                        "order_id": "g1",
                        "line_no": "1",
                    }
                ],
            )
            with mock.patch.object(
                review_products.click,
                "prompt",
                side_effect=["n", "ICE", "frozen", "ice", "manual merge", "q"],
            ):
                review_products.main.callback(
                    giant_items_enriched_csv=str(giant_items_csv),
                    costco_items_enriched_csv=str(costco_items_csv),
                    giant_orders_csv=str(giant_orders_csv),
                    costco_orders_csv=str(costco_orders_csv),
                    purchases_csv=str(purchases_csv),
                    queue_csv=str(queue_csv),
                    resolutions_csv=str(resolutions_csv),
                    catalog_csv=str(catalog_csv),
                    links_csv=str(links_csv),
                    limit=1,
                    refresh_only=False,
                )
            self.assertTrue(queue_csv.exists())
            self.assertTrue(resolutions_csv.exists())
            self.assertTrue(catalog_csv.exists())
            self.assertTrue(links_csv.exists())
            with queue_csv.open(newline="", encoding="utf-8") as handle:
                queue_rows = list(csv.DictReader(handle))
            with resolutions_csv.open(newline="", encoding="utf-8") as handle:
                resolution_rows = list(csv.DictReader(handle))
            with catalog_csv.open(newline="", encoding="utf-8") as handle:
                catalog_rows = list(csv.DictReader(handle))
            with links_csv.open(newline="", encoding="utf-8") as handle:
                link_rows = list(csv.DictReader(handle))
            self.assertEqual("approved", queue_rows[0]["status"])
            self.assertEqual("create", queue_rows[0]["resolution_action"])
            self.assertEqual("create", resolution_rows[0]["resolution_action"])
            self.assertEqual("approved", resolution_rows[0]["status"])
            self.assertEqual("ICE", catalog_rows[0]["catalog_name"])
            self.assertEqual(catalog_rows[0]["catalog_id"], link_rows[0]["catalog_id"])
    def test_build_review_queue_readds_orphaned_and_incomplete_links(self):
        purchase_rows = [
            {
                "normalized_item_id": "gnorm_orphan",
                "catalog_id": "cat_missing",
                "retailer": "giant",
                "raw_item_name": "ORPHAN ITEM",
                "normalized_item_name": "ORPHAN ITEM",
                "upc": "",
                "line_total": "3.50",
                "is_fee": "false",
                "is_discount_line": "false",
                "is_coupon_line": "false",
            },
            {
                "normalized_item_id": "gnorm_incomplete",
                "catalog_id": "cat_incomplete",
                "retailer": "giant",
                "raw_item_name": "INCOMPLETE ITEM",
                "normalized_item_name": "INCOMPLETE ITEM",
                "upc": "",
                "line_total": "4.50",
                "is_fee": "false",
                "is_discount_line": "false",
                "is_coupon_line": "false",
            },
        ]
        link_rows = [
            {
                "normalized_item_id": "gnorm_orphan",
                "catalog_id": "cat_missing",
            },
            {
                "normalized_item_id": "gnorm_incomplete",
                "catalog_id": "cat_incomplete",
            },
        ]
        catalog_rows = [
            {
                "catalog_id": "cat_incomplete",
                "catalog_name": "INCOMPLETE ITEM",
                "product_type": "",
            }
        ]
        queue_rows = review_products.build_review_queue(
            purchase_rows,
            [],
            link_rows,
            catalog_rows,
            [],
        )
        reasons = {row["normalized_item_id"]: row["reason_code"] for row in queue_rows}
        self.assertEqual("orphaned_catalog_link", reasons["gnorm_orphan"])
        self.assertEqual("incomplete_catalog_link", reasons["gnorm_incomplete"])
 if __name__ == "__main__":
    unittest.main()
--- a/tests/test_scraper.py
+++ b/tests/test_scraper.py
@@ -3,7 +3,7 @@ import tempfile
 import unittest
 from pathlib import Path
-import scraper
+import scrape_giant as scraper
 class ScraperTests(unittest.TestCase):
@@ -58,14 +58,25 @@ class ScraperTests(unittest.TestCase):
            }
        ]
-        orders, items = scraper.flatten_orders(history, details)
+        orders, items = scraper.flatten_orders(
            history,
            details,
            history_path=Path("data/giant-web/raw/history.json"),
            raw_dir=Path("data/giant-web/raw"),
        )
        self.assertEqual(1, len(orders))
        self.assertEqual("abc123", orders[0]["order_id"])
        self.assertEqual("giant", orders[0]["retailer"])
        self.assertEqual("PICKUP", orders[0]["service_type"])
        self.assertEqual("data/giant-web/raw/history.json", orders[0]["raw_history_path"])
        self.assertEqual("data/giant-web/raw/abc123.json", orders[0]["raw_order_path"])
        self.assertEqual(1, len(items))
        self.assertEqual("1", items[0]["line_no"])
        self.assertEqual("Bananas", items[0]["item_name"])
        self.assertEqual("giant", items[0]["retailer"])
        self.assertEqual("data/giant-web/raw/abc123.json", items[0]["raw_order_path"])
        self.assertEqual("false", items[0]["is_discount_line"])
    def test_append_dedup_replaces_duplicate_rows_and_preserves_new_values(self):
        with tempfile.TemporaryDirectory() as tmpdir:
--- a/validate_cross_retailer_flow.py
+++ b/validate_cross_retailer_flow.py
@@ -1,154 +0,0 @@
 import json
 from pathlib import Path
 import click
 import build_canonical_layer
 import build_observed_products
 from layer_helpers import stable_id, write_csv_rows
 PROOF_FIELDS = [
    "proof_name",
    "canonical_product_id",
    "giant_observed_product_id",
    "costco_observed_product_id",
    "giant_example_item",
    "costco_example_item",
    "notes",
 ]
 def read_rows(path):
    import csv
    with Path(path).open(newline="", encoding="utf-8") as handle:
        return list(csv.DictReader(handle))
 def find_proof_pair(observed_rows):
    giant = None
    costco = None
    for row in observed_rows:
        if row["retailer"] == "giant" and row["representative_name_norm"] == "BANANA":
            giant = row
        if row["retailer"] == "costco" and row["representative_name_norm"] == "BANANA":
            costco = row
    return giant, costco
 def merge_proof_pair(canonical_rows, link_rows, giant_row, costco_row):
    if not giant_row or not costco_row:
        return canonical_rows, link_rows, []
    proof_canonical_id = stable_id("gcan", "proof|banana")
    link_rows = [
        row
        for row in link_rows
        if row["observed_product_id"]
        not in {giant_row["observed_product_id"], costco_row["observed_product_id"]}
    ]
    canonical_rows = [
        row
        for row in canonical_rows
        if row["canonical_product_id"] != proof_canonical_id
    ]
    canonical_rows.append(
        {
            "canonical_product_id": proof_canonical_id,
            "canonical_name": "BANANA",
            "product_type": "banana",
            "brand": "",
            "variant": "",
            "size_value": "",
            "size_unit": "",
            "pack_qty": "",
            "measure_type": "weight",
            "normalized_quantity": "",
            "normalized_quantity_unit": "",
            "notes": "manual proof merge for cross-retailer validation",
            "created_at": "",
            "updated_at": "",
        }
    )
    for observed_row in [giant_row, costco_row]:
        link_rows.append(
            {
                "observed_product_id": observed_row["observed_product_id"],
                "canonical_product_id": proof_canonical_id,
                "link_method": "manual_proof_merge",
                "link_confidence": "medium",
                "review_status": "",
                "reviewed_by": "",
                "reviewed_at": "",
                "link_notes": "cross-retailer validation proof",
            }
        )
    proof_rows = [
        {
            "proof_name": "banana",
            "canonical_product_id": proof_canonical_id,
            "giant_observed_product_id": giant_row["observed_product_id"],
            "costco_observed_product_id": costco_row["observed_product_id"],
            "giant_example_item": giant_row["example_item_name"],
            "costco_example_item": costco_row["example_item_name"],
            "notes": "BANANA proof pair built from Giant and Costco enriched rows",
        }
    ]
    return canonical_rows, link_rows, proof_rows
@click.command()
@click.option(
    "--giant-items-enriched-csv",
    default="giant_output/items_enriched.csv",
    show_default=True,
 )
@click.option(
    "--costco-items-enriched-csv",
    default="costco_output/items_enriched.csv",
    show_default=True,
 )
@click.option(
    "--outdir",
    default="combined_output",
    show_default=True,
 )
 def main(giant_items_enriched_csv, costco_items_enriched_csv, outdir):
    outdir = Path(outdir)
    rows = read_rows(giant_items_enriched_csv) + read_rows(costco_items_enriched_csv)
    observed_rows = build_observed_products.build_observed_products(rows)
    canonical_rows, link_rows = build_canonical_layer.build_canonical_layer(observed_rows)
    giant_row, costco_row = find_proof_pair(observed_rows)
    if not giant_row or not costco_row:
        raise click.ClickException(
            "could not find BANANA proof pair across Giant and Costco observed products"
        )
    canonical_rows, link_rows, proof_rows = merge_proof_pair(
        canonical_rows, link_rows, giant_row, costco_row
    )
    write_csv_rows(
        outdir / "products_observed.csv",
        observed_rows,
        build_observed_products.OUTPUT_FIELDS,
    )
    write_csv_rows(
        outdir / "products_canonical.csv",
        canonical_rows,
        build_canonical_layer.CANONICAL_FIELDS,
    )
    write_csv_rows(
        outdir / "product_links.csv",
        link_rows,
        build_canonical_layer.LINK_FIELDS,
    )
    write_csv_rows(outdir / "proof_examples.csv", proof_rows, PROOF_FIELDS)
    click.echo(
        f"wrote combined outputs to {outdir} using {len(observed_rows)} observed rows"
    )
 if __name__ == "__main__":
    main()
Author	SHA1	Message	Date
ben	74d17b0b0c	minor edit	2026-03-24 17:28:16 -04:00
ben	fea5132100	minor edi	2026-03-24 17:27:34 -04:00
ben	eb3959ae0f	Record t1.22.1 task evidence	2026-03-24 17:26:00 -04:00
ben	867275c67a	Trim requirements to direct runtime deps	2026-03-24 17:25:52 -04:00
ben	6336c15da8	Record t1.22 task evidence	2026-03-24 17:10:09 -04:00
ben	09829b2b9d	Finalize post-refactor layout and remove old pipeline files	2026-03-24 17:09:57 -04:00
ben	cdb7a15739	Record t1.21 task evidence	2026-03-24 16:49:01 -04:00
ben	46a3b2c639	Add purchase analysis summaries	2026-03-24 16:48:53 -04:00
ben	c35688c87f	Record t1.20 task evidence	2026-03-24 08:29:31 -04:00
ben	6940f165fb	Document visit-level purchase analysis	2026-03-24 08:29:26 -04:00
ben	de8ff535b8	1.18 cleanup and review	2026-03-24 08:27:41 -04:00
ben	02be6f52c0	Record t1.19 task evidence	2026-03-23 15:32:48 -04:00
ben	8ccf3ff43b	Reconcile review queue against current catalog state	2026-03-23 15:32:41 -04:00
ben	a93229408b	Record t1.18.4 task evidence	2026-03-23 15:28:05 -04:00
ben	a45522c110	Finalize purchase effective price fields	2026-03-23 15:27:58 -04:00
ben	d78230f1c6	Record t1.18.3 task evidence	2026-03-23 13:56:56 -04:00
ben	73176117fe	Fix Costco hash-size weight parsing	2026-03-23 13:56:47 -04:00
ben	facebced9c	Record t1.18.2 task evidence	2026-03-23 13:23:03 -04:00
ben	23dfc3de3e	Use picked weight for Giant quantity basis	2026-03-23 13:22:56 -04:00
ben	3bc76ed243	Record t1.18 and t1.18.1 evidence	2026-03-23 12:54:09 -04:00
ben	dc0d0614bb	Add effective price to purchases	2026-03-23 12:53:54 -04:00
ben	605c94498b	Add effective price regression tests	2026-03-23 12:52:41 -04:00
ben	d4f479b0d8	added effective_price and testing to id upstream data	2026-03-23 12:35:27 -04:00
ben	38c2c2ea2e	Record t1.17 task evidence	2026-03-21 21:50:16 -04:00
ben	d25448b690	Fix normalized quantity basis	2026-03-21 21:50:10 -04:00
eulaly	db761adafc	added notes from first review session	2026-03-21 20:53:22 -04:00
eulaly	e8e11e15b3	added draft scope for review/search loop	2026-03-21 09:48:34 -04:00
ben	afadd0c0d0	Restore skip and move search to find	2026-03-20 13:35:07 -04:00
ben	2847d2d59f	Record t1.16.1 task evidence	2026-03-20 13:32:27 -04:00
ben	f93b9aa464	Add catalog search to review flow	2026-03-20 13:32:20 -04:00
ben	17158fb9e9	Record t1.16 task evidence	2026-03-20 12:45:57 -04:00
ben	975d44bebb	Tighten review prompt flow	2026-03-20 12:45:38 -04:00
ben	f478795b5d	added t1.16 to cleanup review process	2026-03-20 12:42:23 -04:00
ben	59fb881c0a	Record t1.15 task evidence	2026-03-20 11:27:56 -04:00
ben	9104781b93	Refactor review pipeline around normalized items	2026-03-20 11:27:46 -04:00
ben	607c51038a	Record t1.14.3 task evidence	2026-03-20 11:09:50 -04:00
ben	bcec6b37d3	Clean Costco normalization artifacts	2026-03-20 11:09:44 -04:00
ben	848d229f2d	Record t1.14.2 task evidence	2026-03-20 10:05:08 -04:00
ben	d2e6f2afd3	Align refactor paths with data layout	2026-03-20 10:04:58 -04:00
eulaly	424a777dd0	added git note	2026-03-20 09:58:25 -04:00
eulaly	2e5d69c75e	added 14.2 and 14.3 for refactor prep	2026-03-20 09:55:46 -04:00
ben	3c2462845b	added task-sample	2026-03-18 15:47:12 -04:00
ben	c0023e8f3a	Record t1.14.1 task evidence	2026-03-18 15:46:31 -04:00
ben	9064de5f67	Refactor retailer normalization outputs	2026-03-18 15:46:20 -04:00
ben	ec1f36a140	Record t1.14 task evidence	2026-03-18 15:18:54 -04:00
ben	48c6eaf753	Refactor retailer collection entrypoints	2026-03-18 15:18:47 -04:00
ben	e74253f6fb	data-model prep for refactor, removing observed layer	2026-03-18 15:15:29 -04:00
ben	c13d144418	cleanup	2026-03-18 14:02:36 -04:00
ben	10aad05808	data-model refactor and prep scope	2026-03-18 13:08:28 -04:00
ben	9122821db1	Fix t1.13 evidence hashes	2026-03-17 15:08:09 -04:00
ben	7743421918	Record t1.13 task evidence	2026-03-17 15:07:51 -04:00
ben	08e2a86cbd	Make canonical auto-linking more conservative	2026-03-17 15:07:48 -04:00
ben	56a03bcb1d	Attach Costco discounts to purchase rows	2026-03-17 15:07:45 -04:00
ben	967e19e561	Add pipeline status accounting	2026-03-17 15:07:42 -04:00
ben	eddef7de2b	updated readme and prep for next phase	2026-03-17 13:59:57 -04:00
ben	83bc6c4a7c	Update t1.12 task evidence	2026-03-17 13:25:21 -04:00
ben	d39497c298	Refine product review prompt flow	2026-03-17 13:25:12 -04:00
ben	7b8141cd42	Improve product review display workflow	2026-03-17 12:25:47 -04:00
ben	e494386e64	build_purchases rev1	2026-03-17 12:21:44 -04:00
ben	7527fe37eb	added git notes	2026-03-17 12:21:24 -04:00
ben	a1fafa3885	added t1.12 scope to simplify review process	2026-03-17 12:20:48 -04:00
ben	37b2196023	added git notes	2026-03-17 09:23:00 -04:00
ben	7f8c3ed8eb	updated readme with Review steps	2026-03-17 09:14:14 -04:00
ben	91bfd3597e	Record t1.11 task evidence	2026-03-16 20:45:57 -04:00
ben	c7dad5489e	Add terminal review resolution workflow	2026-03-16 20:45:37 -04:00
ben	34eedff9c5	Record t1.8.7 and t1.9 task evidence	2026-03-16 18:01:16 -04:00
ben	be1bf6328e	Build pivot-ready purchase log	2026-03-16 18:01:09 -04:00
ben	6806c0e7ff	updated readme	2026-03-16 17:40:23 -04:00
ben	861955557a	added instructions	2026-03-16 17:34:22 -04:00
ben	6e1cde2c83	fix json data pull from /raw	2026-03-16 17:34:01 -04:00