added giant_output/ to .gitignore

2026-03-15 14:28:46 -04:00
39 changed files with 707 additions and 7540 deletions
--- a/.gitignore
+++ b/.gitignore
@@ -21,6 +21,7 @@ env/
 # --- project private data ---
 /private/
 giant_output/
 # --- django ---
 db.sqlite3
--- a/README.md
+++ b/README.md
@@ -1,131 +0,0 @@
 # scrape-giant
 CLI to pull purchase history from Giant and Costco websites and refine into a single product catalog for external analysis.
 Run each script step-by-step from the terminal.
 ## What It Does
 1. `scrape_giant.py`: download Giant orders and items
 2. `enrich_giant.py`: normalize Giant line items
 3. `scrape_costco.py`: download Costco orders and items
 4. `enrich_costco.py`: normalize Costco line items
 5. `build_purchases.py`: combine retailer outputs into one purchase table
 6. `review_products.py`: review unresolved product matches in the terminal
 7. `report_pipeline_status.py`: show how many rows survive each stage
 ## Requirements
 - Python 3.10+
 - Firefox installed with active Giant and Costco sessions
 ## Install
 ```bash
 python -m venv venv
 ./venv/scripts/activate
 pip install -r requirements.txt
 ```
 ## Optional `.env`
 Current version works best with `.env` in the project root.  The scraper will prompt for these values if they are not found in the current browser session.  
 - `scrape_giant` prompts if `GIANT_USER_ID` or `GIANT_LOYALTY_NUMBER` is missing.
 - `scrape_costco` tries `.env` first, then Firefox local storage for session-backed values; `COSTCO_CLIENT_IDENTIFIER` should still be set explicitly.
 - Costco discount matching happens later in `enrich_costco.py`; you do not need to pre-clean discount lines by hand.
 ```env
 GIANT_USER_ID=...
 GIANT_LOYALTY_NUMBER=...
 COSTCO_X_AUTHORIZATION=...
 COSTCO_X_WCS_CLIENTID=...
 COSTCO_CLIENT_IDENTIFIER=...
 ```
 ## Run Order
 Run the pipeline in this order:
 ```bash
 python scrape_giant.py
 python enrich_giant.py
 python scrape_costco.py
 python enrich_costco.py
 python build_purchases.py
 python review_products.py
 python build_purchases.py
 python review_products.py --refresh-only
 python report_pipeline_status.py
 ```
 Why run `build_purchases.py` twice:
 - first pass builds the current combined dataset and review queue inputs
 - `review_products.py` writes durable review decisions
 - second pass reapplies those decisions into the purchase output
 If you only want to refresh the queue without reviewing interactively:
 ```bash
 python review_products.py --refresh-only
 ```
 If you want a quick stage-by-stage accountability check:
 ```bash
 python report_pipeline_status.py
 ```
 ## Key Outputs
 Giant:
 - `giant_output/orders.csv`
 - `giant_output/items.csv`
 - `giant_output/items_enriched.csv`
 Costco:
 - `costco_output/orders.csv`
 - `costco_output/items.csv`
 - `costco_output/items_enriched.csv`
 - `costco_output/items_enriched.csv` now preserves raw totals and matched net discount fields
 Combined:
 - `combined_output/purchases.csv`
 - `combined_output/review_queue.csv`
 - `combined_output/review_resolutions.csv`
 - `combined_output/canonical_catalog.csv`
 - `combined_output/product_links.csv`
 - `combined_output/comparison_examples.csv`
 - `combined_output/pipeline_status.csv`
 - `combined_output/pipeline_status.json`
 ## Review Workflow
 Run `review_products.py` to cleanup unresolved or weakly unified items:
 - link an item to an existing canonical product
 - create a new canonical product
 - exclude an item
 - skip it for later
 Decisions are saved and reused on later runs.
 The review step is intentionally conservative:
 - weak exact-name matches stay in the queue instead of auto-creating canonical products
 - canonical names should describe stable product identity, not retailer packaging text
 ## Notes
 - This project is designed around fragile retailer scraping flows, so the code favors explicit retailer-specific steps over heavy abstraction.
 - `scrape_giant.py` and `scrape_costco.py` are meant to work as standalone acquisition scripts.
 - Costco discount rows are preserved for auditability and also matched back to purchased items during enrichment.
 - `validate_cross_retailer_flow.py` is a proof/check script, not a required production step.
 ## Test
 ```bash
 ./venv/bin/python -m unittest discover -s tests
 ```
 ## Project Docs
 - `pm/tasks.org`: task tracking
 - `pm/data-model.org`: current data model notes
 - `pm/review-workflow.org`: review and resolution workflow
--- a/agents.md
+++ b/agents.md
@@ -1,24 +0,0 @@
 # agent rules
 ## priorities
 - optimize for simplicity, boringness, and long-term maintainability
 - prefer minimal diffs; avoid refactors unless required for the active task
 ## tech stack
 - python; pandas or polars
 - file storage: json and csv, no sqlite or databases
 - assume local virtual env is available and accessible
 - do not add new dependencies unless explicitly approved; if unavoidable, document justification in the active task notes
 ## workflow
 - prefer direct argv commands (no bash -lc / compound shell chains) unless necessary
 - work on ONE task at a time unless explicitly instructed otherwise
 - at the start of work, state the task id you are executing
 - do not start work unless a task id is specified; if missing, choose the earliest unchecked task and say so
 - propose incremental steps
 - always include basic tests for core logic
 - when you complete a task:
  - mark it [x] in pm/tasks.md
  - fill in evidence with commit hash + commands run
  - never mark complete unless acceptance criteria are met
  - include date and time (HH:MM)
--- a/browser_session.py
+++ b/browser_session.py
@@ -1,129 +0,0 @@
 import configparser
 import os
 import shutil
 import sqlite3
 import tempfile
 from pathlib import Path
 import browser_cookie3
 def find_firefox_profile_dir():
    profiles_ini = firefox_profiles_root() / "profiles.ini"
    parser = configparser.RawConfigParser()
    if not profiles_ini.exists():
        raise FileNotFoundError(f"Firefox profiles.ini not found at {profiles_ini}")
    parser.read(profiles_ini, encoding="utf-8")
    profiles = []
    for section in parser.sections():
        if not section.startswith("Profile"):
            continue
        path_value = parser.get(section, "Path", fallback="")
        if not path_value:
            continue
        is_relative = parser.getboolean(section, "IsRelative", fallback=True)
        profile_path = (
            profiles_ini.parent / path_value if is_relative else Path(path_value)
        )
        profiles.append(
            (
                parser.getboolean(section, "Default", fallback=False),
                profile_path,
            )
        )
    if not profiles:
        raise FileNotFoundError("No Firefox profiles found in profiles.ini")
    profiles.sort(key=lambda item: (not item[0], str(item[1])))
    return profiles[0][1]
 def firefox_profiles_root():
    if os.name == "nt":
        appdata = os.getenv("APPDATA", "").strip()
        if not appdata:
            raise FileNotFoundError("APPDATA is not set")
        return Path(appdata) / "Mozilla" / "Firefox"
    return Path.home() / ".mozilla" / "firefox"
 def load_firefox_cookies(domain_name, profile_dir):
    cookie_file = Path(profile_dir) / "cookies.sqlite"
    return browser_cookie3.firefox(cookie_file=str(cookie_file), domain_name=domain_name)
 def read_firefox_local_storage(profile_dir, origin_filter):
    storage_root = profile_dir / "storage" / "default"
    if not storage_root.exists():
        return {}
    for ls_path in storage_root.glob("*/ls/data.sqlite"):
        origin = decode_firefox_origin(ls_path.parents[1].name)
        if origin_filter.lower() not in origin.lower():
            continue
        return {
            stringify_sql_value(row[0]): stringify_sql_value(row[1])
            for row in query_sqlite(ls_path, "SELECT key, value FROM data")
        }
    return {}
 def read_firefox_webapps_store(profile_dir, origin_filter):
    webapps_path = profile_dir / "webappsstore.sqlite"
    if not webapps_path.exists():
        return {}
    values = {}
    for row in query_sqlite(
        webapps_path,
        "SELECT originKey, key, value FROM webappsstore2",
    ):
        origin = stringify_sql_value(row[0])
        if origin_filter.lower() not in origin.lower():
            continue
        values[stringify_sql_value(row[1])] = stringify_sql_value(row[2])
    return values
 def query_sqlite(path, query):
    copied_path = copy_sqlite_to_temp(path)
    connection = None
    cursor = None
    try:
        connection = sqlite3.connect(copied_path)
        cursor = connection.cursor()
        cursor.execute(query)
        rows = cursor.fetchall()
        return rows
    except sqlite3.OperationalError:
        return []
    finally:
        if cursor is not None:
            cursor.close()
        if connection is not None:
            connection.close()
        copied_path.unlink(missing_ok=True)
 def copy_sqlite_to_temp(path):
    fd, tmp = tempfile.mkstemp(suffix=".sqlite")
    os.close(fd)
    shutil.copyfile(path, tmp)
    return Path(tmp)
 def decode_firefox_origin(raw_origin):
    origin = raw_origin.split("^", 1)[0]
    return origin.replace("+++", "://")
 def stringify_sql_value(value):
    if value is None:
        return ""
    if isinstance(value, bytes):
        for encoding in ("utf-8", "utf-16-le", "utf-16"):
            try:
                return value.decode(encoding)
            except UnicodeDecodeError:
                continue
        return value.decode("utf-8", errors="ignore")
    return str(value)
--- a/build_canonical_layer.py
+++ b/build_canonical_layer.py
@@ -1,220 +0,0 @@
 import click
 import re
 from layer_helpers import read_csv_rows, representative_value, stable_id, write_csv_rows
 CANONICAL_FIELDS = [
    "canonical_product_id",
    "canonical_name",
    "product_type",
    "brand",
    "variant",
    "size_value",
    "size_unit",
    "pack_qty",
    "measure_type",
    "normalized_quantity",
    "normalized_quantity_unit",
    "notes",
    "created_at",
    "updated_at",
 ]
 CANONICAL_DROP_TOKENS = {"CT", "COUNT", "COUNTS", "DOZ", "DOZEN", "DOZ.", "PACK"}
 LINK_FIELDS = [
    "observed_product_id",
    "canonical_product_id",
    "link_method",
    "link_confidence",
    "review_status",
    "reviewed_by",
    "reviewed_at",
    "link_notes",
 ]
 def to_float(value):
    try:
        return float(value)
    except (TypeError, ValueError):
        return None
 def normalized_quantity(row):
    size_value = to_float(row.get("representative_size_value"))
    pack_qty = to_float(row.get("representative_pack_qty")) or 1.0
    size_unit = row.get("representative_size_unit", "")
    measure_type = row.get("representative_measure_type", "")
    if size_value is not None and size_unit:
        return format(size_value * pack_qty, "g"), size_unit
    if row.get("representative_pack_qty") and measure_type == "count":
        return row["representative_pack_qty"], "count"
    if measure_type == "each":
        return "1", "each"
    return "", ""
 def auto_link_rule(observed_row):
    if (
        observed_row.get("is_fee") == "true"
        or observed_row.get("is_discount_line") == "true"
        or observed_row.get("is_coupon_line") == "true"
    ):
        return "", "", ""
    if observed_row.get("representative_upc"):
        return (
            "exact_upc",
            f"upc={observed_row['representative_upc']}",
            "high",
        )
    if (
        observed_row.get("representative_name_norm")
        and observed_row.get("representative_size_value")
        and observed_row.get("representative_size_unit")
    ):
        return (
            "exact_name_size",
            "|".join(
                [
                    f"name={observed_row['representative_name_norm']}",
                    f"size={observed_row['representative_size_value']}",
                    f"unit={observed_row['representative_size_unit']}",
                    f"pack={observed_row['representative_pack_qty']}",
                    f"measure={observed_row['representative_measure_type']}",
                ]
            ),
            "high",
        )
    return "", "", ""
 def clean_canonical_name(name):
    tokens = []
    for token in re.sub(r"[^A-Z0-9\s]", " ", (name or "").upper()).split():
        if token.isdigit():
            continue
        if token in CANONICAL_DROP_TOKENS:
            continue
        if re.fullmatch(r"\d+(?:PK|PACK)", token):
            continue
        if re.fullmatch(r"\d+DZ", token):
            continue
        tokens.append(token)
    return " ".join(tokens).strip()
 def canonical_row_for_group(canonical_product_id, group_rows, link_method):
    quantity_value, quantity_unit = normalized_quantity(
        {
            "representative_size_value": representative_value(
                group_rows, "representative_size_value"
            ),
            "representative_size_unit": representative_value(
                group_rows, "representative_size_unit"
            ),
            "representative_pack_qty": representative_value(
                group_rows, "representative_pack_qty"
            ),
            "representative_measure_type": representative_value(
                group_rows, "representative_measure_type"
            ),
        }
    )
    return {
        "canonical_product_id": canonical_product_id,
        "canonical_name": clean_canonical_name(
            representative_value(group_rows, "representative_name_norm")
        )
        or representative_value(group_rows, "representative_name_norm"),
        "product_type": "",
        "brand": representative_value(group_rows, "representative_brand"),
        "variant": representative_value(group_rows, "representative_variant"),
        "size_value": representative_value(group_rows, "representative_size_value"),
        "size_unit": representative_value(group_rows, "representative_size_unit"),
        "pack_qty": representative_value(group_rows, "representative_pack_qty"),
        "measure_type": representative_value(group_rows, "representative_measure_type"),
        "normalized_quantity": quantity_value,
        "normalized_quantity_unit": quantity_unit,
        "notes": f"auto-linked via {link_method}",
        "created_at": "",
        "updated_at": "",
    }
 def build_canonical_layer(observed_rows):
    canonical_rows = []
    link_rows = []
    groups = {}
    for observed_row in sorted(observed_rows, key=lambda row: row["observed_product_id"]):
        link_method, group_key, confidence = auto_link_rule(observed_row)
        if not group_key:
            continue
        canonical_product_id = stable_id("gcan", f"{link_method}|{group_key}")
        groups.setdefault(canonical_product_id, {"method": link_method, "rows": []})
        groups[canonical_product_id]["rows"].append(observed_row)
        link_rows.append(
            {
                "observed_product_id": observed_row["observed_product_id"],
                "canonical_product_id": canonical_product_id,
                "link_method": link_method,
                "link_confidence": confidence,
                "review_status": "",
                "reviewed_by": "",
                "reviewed_at": "",
                "link_notes": "",
            }
        )
    for canonical_product_id, group in sorted(groups.items()):
        canonical_rows.append(
            canonical_row_for_group(
                canonical_product_id, group["rows"], group["method"]
            )
        )
    return canonical_rows, link_rows
@click.command()
@click.option(
    "--observed-csv",
    default="giant_output/products_observed.csv",
    show_default=True,
    help="Path to observed product rows.",
 )
@click.option(
    "--canonical-csv",
    default="giant_output/products_canonical.csv",
    show_default=True,
    help="Path to canonical product output.",
 )
@click.option(
    "--links-csv",
    default="giant_output/product_links.csv",
    show_default=True,
    help="Path to observed-to-canonical link output.",
 )
 def main(observed_csv, canonical_csv, links_csv):
    observed_rows = read_csv_rows(observed_csv)
    canonical_rows, link_rows = build_canonical_layer(observed_rows)
    write_csv_rows(canonical_csv, canonical_rows, CANONICAL_FIELDS)
    write_csv_rows(links_csv, link_rows, LINK_FIELDS)
    click.echo(
        f"wrote {len(canonical_rows)} canonical rows to {canonical_csv} and "
        f"{len(link_rows)} links to {links_csv}"
    )
 if __name__ == "__main__":
    main()
--- a/build_observed_products.py
+++ b/build_observed_products.py
@@ -1,172 +0,0 @@
 from collections import defaultdict
 import click
 from layer_helpers import (
    compact_join,
    distinct_values,
    first_nonblank,
    read_csv_rows,
    representative_value,
    stable_id,
    write_csv_rows,
 )
 OUTPUT_FIELDS = [
    "observed_product_id",
    "retailer",
    "observed_key",
    "representative_retailer_item_id",
    "representative_upc",
    "representative_item_name",
    "representative_name_norm",
    "representative_brand",
    "representative_variant",
    "representative_size_value",
    "representative_size_unit",
    "representative_pack_qty",
    "representative_measure_type",
    "representative_image_url",
    "is_store_brand",
    "is_fee",
    "is_discount_line",
    "is_coupon_line",
    "first_seen_date",
    "last_seen_date",
    "times_seen",
    "example_order_id",
    "example_item_name",
    "raw_name_examples",
    "normalized_name_examples",
    "example_prices",
    "distinct_item_names_count",
    "distinct_retailer_item_ids_count",
    "distinct_upcs_count",
 ]
 def build_observed_key(row):
    if row.get("upc"):
        return "|".join(
            [
                row["retailer"],
                f"upc={row['upc']}",
                f"name={row['item_name_norm']}",
            ]
        )
    if row.get("retailer_item_id"):
        return "|".join(
            [
                row["retailer"],
                f"retailer_item_id={row['retailer_item_id']}",
                f"name={row['item_name_norm']}",
                f"discount={row.get('is_discount_line', 'false')}",
                f"coupon={row.get('is_coupon_line', 'false')}",
            ]
        )
    return "|".join(
        [
            row["retailer"],
            f"name={row['item_name_norm']}",
            f"size={row['size_value']}",
            f"unit={row['size_unit']}",
            f"pack={row['pack_qty']}",
            f"measure={row['measure_type']}",
            f"store_brand={row['is_store_brand']}",
            f"fee={row['is_fee']}",
        ]
    )
 def build_observed_products(rows):
    grouped = defaultdict(list)
    for row in rows:
        grouped[build_observed_key(row)].append(row)
    observed_rows = []
    for observed_key, group_rows in sorted(grouped.items()):
        ordered = sorted(
            group_rows,
            key=lambda row: (row["order_date"], row["order_id"], int(row["line_no"])),
        )
        observed_rows.append(
            {
                "observed_product_id": stable_id("gobs", observed_key),
                "retailer": ordered[0]["retailer"],
                "observed_key": observed_key,
                "representative_retailer_item_id": representative_value(
                    ordered, "retailer_item_id"
                ),
                "representative_upc": representative_value(ordered, "upc"),
                "representative_item_name": representative_value(ordered, "item_name"),
                "representative_name_norm": representative_value(
                    ordered, "item_name_norm"
                ),
                "representative_brand": representative_value(ordered, "brand_guess"),
                "representative_variant": representative_value(ordered, "variant"),
                "representative_size_value": representative_value(ordered, "size_value"),
                "representative_size_unit": representative_value(ordered, "size_unit"),
                "representative_pack_qty": representative_value(ordered, "pack_qty"),
                "representative_measure_type": representative_value(
                    ordered, "measure_type"
                ),
                "representative_image_url": first_nonblank(ordered, "image_url"),
                "is_store_brand": representative_value(ordered, "is_store_brand"),
                "is_fee": representative_value(ordered, "is_fee"),
                "is_discount_line": representative_value(
                    ordered, "is_discount_line"
                ),
                "is_coupon_line": representative_value(ordered, "is_coupon_line"),
                "first_seen_date": ordered[0]["order_date"],
                "last_seen_date": ordered[-1]["order_date"],
                "times_seen": str(len(ordered)),
                "example_order_id": ordered[0]["order_id"],
                "example_item_name": ordered[0]["item_name"],
                "raw_name_examples": compact_join(
                    distinct_values(ordered, "item_name"), limit=4
                ),
                "normalized_name_examples": compact_join(
                    distinct_values(ordered, "item_name_norm"), limit=4
                ),
                "example_prices": compact_join(
                    distinct_values(ordered, "line_total"), limit=4
                ),
                "distinct_item_names_count": str(
                    len(distinct_values(ordered, "item_name"))
                ),
                "distinct_retailer_item_ids_count": str(
                    len(distinct_values(ordered, "retailer_item_id"))
                ),
                "distinct_upcs_count": str(len(distinct_values(ordered, "upc"))),
            }
        )
    observed_rows.sort(key=lambda row: row["observed_product_id"])
    return observed_rows
@click.command()
@click.option(
    "--items-enriched-csv",
    default="giant_output/items_enriched.csv",
    show_default=True,
    help="Path to enriched Giant item rows.",
 )
@click.option(
    "--output-csv",
    default="giant_output/products_observed.csv",
    show_default=True,
    help="Path to observed product output.",
 )
 def main(items_enriched_csv, output_csv):
    rows = read_csv_rows(items_enriched_csv)
    observed_rows = build_observed_products(rows)
    write_csv_rows(output_csv, observed_rows, OUTPUT_FIELDS)
    click.echo(f"wrote {len(observed_rows)} rows to {output_csv}")
 if __name__ == "__main__":
    main()
--- a/build_purchases.py
+++ b/build_purchases.py
@@ -1,418 +0,0 @@
 from decimal import Decimal
 from pathlib import Path
 import click
 import build_canonical_layer
 import build_observed_products
 import validate_cross_retailer_flow
 from enrich_giant import format_decimal, to_decimal
 from layer_helpers import read_csv_rows, stable_id, write_csv_rows
 PURCHASE_FIELDS = [
    "purchase_date",
    "retailer",
    "order_id",
    "line_no",
    "observed_item_key",
    "observed_product_id",
    "canonical_product_id",
    "review_status",
    "resolution_action",
    "raw_item_name",
    "normalized_item_name",
    "image_url",
    "retailer_item_id",
    "upc",
    "qty",
    "unit",
    "pack_qty",
    "size_value",
    "size_unit",
    "measure_type",
    "line_total",
    "unit_price",
    "matched_discount_amount",
    "net_line_total",
    "store_name",
    "store_number",
    "store_city",
    "store_state",
    "price_per_each",
    "price_per_each_basis",
    "price_per_count",
    "price_per_count_basis",
    "price_per_lb",
    "price_per_lb_basis",
    "price_per_oz",
    "price_per_oz_basis",
    "is_discount_line",
    "is_coupon_line",
    "is_fee",
    "raw_order_path",
 ]
 EXAMPLE_FIELDS = [
    "example_name",
    "canonical_product_id",
    "giant_purchase_date",
    "giant_raw_item_name",
    "giant_price_per_lb",
    "costco_purchase_date",
    "costco_raw_item_name",
    "costco_price_per_lb",
    "notes",
 ]
 CATALOG_FIELDS = [
    "canonical_product_id",
    "canonical_name",
    "category",
    "product_type",
    "brand",
    "variant",
    "size_value",
    "size_unit",
    "pack_qty",
    "measure_type",
    "notes",
    "created_at",
    "updated_at",
 ]
 RESOLUTION_FIELDS = [
    "observed_product_id",
    "canonical_product_id",
    "resolution_action",
    "status",
    "resolution_notes",
    "reviewed_at",
 ]
 def decimal_or_zero(value):
    return to_decimal(value) or Decimal("0")
 def derive_metrics(row):
    line_total = to_decimal(row.get("net_line_total") or row.get("line_total"))
    qty = to_decimal(row.get("qty"))
    pack_qty = to_decimal(row.get("pack_qty"))
    size_value = to_decimal(row.get("size_value"))
    picked_weight = to_decimal(row.get("picked_weight"))
    size_unit = row.get("size_unit", "")
    price_per_each = row.get("price_per_each", "")
    price_per_lb = row.get("price_per_lb", "")
    price_per_oz = row.get("price_per_oz", "")
    price_per_count = ""
    basis_each = ""
    basis_count = ""
    basis_lb = ""
    basis_oz = ""
    if price_per_each:
        basis_each = "line_total_over_qty"
    elif line_total is not None and qty not in (None, 0):
        price_per_each = format_decimal(line_total / qty)
        basis_each = "line_total_over_qty"
    if line_total is not None and pack_qty not in (None, 0):
        total_count = pack_qty * (qty or Decimal("1"))
        if total_count not in (None, 0):
            price_per_count = format_decimal(line_total / total_count)
            basis_count = "line_total_over_pack_qty"
    if picked_weight not in (None, 0):
        price_per_lb = format_decimal(line_total / picked_weight) if line_total is not None else ""
        price_per_oz = (
            format_decimal((line_total / picked_weight) / Decimal("16"))
            if line_total is not None
            else ""
        )
        basis_lb = "picked_weight_lb"
        basis_oz = "picked_weight_lb_to_oz"
    elif line_total is not None and size_value not in (None, 0):
        total_units = size_value * (pack_qty or Decimal("1")) * (qty or Decimal("1"))
        if size_unit == "lb" and total_units not in (None, 0):
            per_lb = line_total / total_units
            price_per_lb = format_decimal(per_lb)
            price_per_oz = format_decimal(per_lb / Decimal("16"))
            basis_lb = "parsed_size_lb"
            basis_oz = "parsed_size_lb_to_oz"
        elif size_unit == "oz" and total_units not in (None, 0):
            per_oz = line_total / total_units
            price_per_oz = format_decimal(per_oz)
            price_per_lb = format_decimal(per_oz * Decimal("16"))
            basis_lb = "parsed_size_oz_to_lb"
            basis_oz = "parsed_size_oz"
    return {
        "price_per_each": price_per_each,
        "price_per_each_basis": basis_each,
        "price_per_count": price_per_count,
        "price_per_count_basis": basis_count,
        "price_per_lb": price_per_lb,
        "price_per_lb_basis": basis_lb,
        "price_per_oz": price_per_oz,
        "price_per_oz_basis": basis_oz,
    }
 def order_lookup(rows, retailer):
    return {
        (retailer, row["order_id"]): row
        for row in rows
    }
 def read_optional_csv_rows(path):
    path = Path(path)
    if not path.exists():
        return []
    return read_csv_rows(path)
 def load_resolution_lookup(resolution_rows):
    lookup = {}
    for row in resolution_rows:
        if not row.get("observed_product_id"):
            continue
        lookup[row["observed_product_id"]] = row
    return lookup
 def merge_catalog_rows(existing_rows, auto_rows):
    merged = {}
    for row in auto_rows + existing_rows:
        canonical_product_id = row.get("canonical_product_id", "")
        if canonical_product_id:
            merged[canonical_product_id] = row
    return sorted(merged.values(), key=lambda row: row["canonical_product_id"])
 def catalog_row_from_canonical(row):
    return {
        "canonical_product_id": row.get("canonical_product_id", ""),
        "canonical_name": row.get("canonical_name", ""),
        "category": row.get("category", ""),
        "product_type": row.get("product_type", ""),
        "brand": row.get("brand", ""),
        "variant": row.get("variant", ""),
        "size_value": row.get("size_value", ""),
        "size_unit": row.get("size_unit", ""),
        "pack_qty": row.get("pack_qty", ""),
        "measure_type": row.get("measure_type", ""),
        "notes": row.get("notes", ""),
        "created_at": row.get("created_at", ""),
        "updated_at": row.get("updated_at", ""),
    }
 def build_link_state(enriched_rows):
    observed_rows = build_observed_products.build_observed_products(enriched_rows)
    canonical_rows, link_rows = build_canonical_layer.build_canonical_layer(observed_rows)
    giant_row, costco_row = validate_cross_retailer_flow.find_proof_pair(observed_rows)
    canonical_rows, link_rows, _proof_rows = validate_cross_retailer_flow.merge_proof_pair(
        canonical_rows,
        link_rows,
        giant_row,
        costco_row,
    )
    observed_id_by_key = {
        row["observed_key"]: row["observed_product_id"] for row in observed_rows
    }
    canonical_id_by_observed = {
        row["observed_product_id"]: row["canonical_product_id"] for row in link_rows
    }
    return observed_rows, canonical_rows, link_rows, observed_id_by_key, canonical_id_by_observed
 def build_purchase_rows(
    giant_enriched_rows,
    costco_enriched_rows,
    giant_orders,
    costco_orders,
    resolution_rows,
 ):
    all_enriched_rows = giant_enriched_rows + costco_enriched_rows
    (
        observed_rows,
        canonical_rows,
        link_rows,
        observed_id_by_key,
        canonical_id_by_observed,
    ) = build_link_state(all_enriched_rows)
    resolution_lookup = load_resolution_lookup(resolution_rows)
    for observed_product_id, resolution in resolution_lookup.items():
        action = resolution.get("resolution_action", "")
        status = resolution.get("status", "")
        if status != "approved":
            continue
        if action in {"link", "create"} and resolution.get("canonical_product_id"):
            canonical_id_by_observed[observed_product_id] = resolution["canonical_product_id"]
        elif action == "exclude":
            canonical_id_by_observed[observed_product_id] = ""
    orders_by_id = {}
    orders_by_id.update(order_lookup(giant_orders, "giant"))
    orders_by_id.update(order_lookup(costco_orders, "costco"))
    purchase_rows = []
    for row in sorted(
        all_enriched_rows,
        key=lambda item: (item["order_date"], item["retailer"], item["order_id"], int(item["line_no"])),
    ):
        observed_key = build_observed_products.build_observed_key(row)
        observed_product_id = observed_id_by_key.get(observed_key, "")
        order_row = orders_by_id.get((row["retailer"], row["order_id"]), {})
        metrics = derive_metrics(row)
        resolution = resolution_lookup.get(observed_product_id, {})
        purchase_rows.append(
            {
                "purchase_date": row["order_date"],
                "retailer": row["retailer"],
                "order_id": row["order_id"],
                "line_no": row["line_no"],
                "observed_item_key": row["observed_item_key"],
                "observed_product_id": observed_product_id,
                "canonical_product_id": canonical_id_by_observed.get(observed_product_id, ""),
                "review_status": resolution.get("status", ""),
                "resolution_action": resolution.get("resolution_action", ""),
                "raw_item_name": row["item_name"],
                "normalized_item_name": row["item_name_norm"],
                "image_url": row.get("image_url", ""),
                "retailer_item_id": row["retailer_item_id"],
                "upc": row["upc"],
                "qty": row["qty"],
                "unit": row["unit"],
                "pack_qty": row["pack_qty"],
                "size_value": row["size_value"],
                "size_unit": row["size_unit"],
                "measure_type": row["measure_type"],
                "line_total": row["line_total"],
                "unit_price": row["unit_price"],
                "matched_discount_amount": row.get("matched_discount_amount", ""),
                "net_line_total": row.get("net_line_total", ""),
                "store_name": order_row.get("store_name", ""),
                "store_number": order_row.get("store_number", ""),
                "store_city": order_row.get("store_city", ""),
                "store_state": order_row.get("store_state", ""),
                "is_discount_line": row["is_discount_line"],
                "is_coupon_line": row["is_coupon_line"],
                "is_fee": row["is_fee"],
                "raw_order_path": row["raw_order_path"],
                **metrics,
            }
        )
    return purchase_rows, observed_rows, canonical_rows, link_rows
 def apply_manual_resolutions_to_links(link_rows, resolution_rows):
    link_by_observed = {row["observed_product_id"]: dict(row) for row in link_rows}
    for resolution in resolution_rows:
        if resolution.get("status") != "approved":
            continue
        observed_product_id = resolution.get("observed_product_id", "")
        action = resolution.get("resolution_action", "")
        if not observed_product_id:
            continue
        if action == "exclude":
            link_by_observed.pop(observed_product_id, None)
            continue
        if action in {"link", "create"} and resolution.get("canonical_product_id"):
            link_by_observed[observed_product_id] = {
                "observed_product_id": observed_product_id,
                "canonical_product_id": resolution["canonical_product_id"],
                "link_method": f"manual_{action}",
                "link_confidence": "high",
                "review_status": resolution.get("status", ""),
                "reviewed_by": "",
                "reviewed_at": resolution.get("reviewed_at", ""),
                "link_notes": resolution.get("resolution_notes", ""),
            }
    return sorted(link_by_observed.values(), key=lambda row: row["observed_product_id"])
 def build_comparison_examples(purchase_rows):
    giant_banana = None
    costco_banana = None
    for row in purchase_rows:
        if row.get("normalized_item_name") != "BANANA":
            continue
        if not row.get("canonical_product_id"):
            continue
        if row["retailer"] == "giant" and row.get("price_per_lb"):
            giant_banana = row
        if row["retailer"] == "costco" and row.get("price_per_lb"):
            costco_banana = row
    if not giant_banana or not costco_banana:
        return []
    return [
        {
            "example_name": "banana_price_per_lb",
            "canonical_product_id": giant_banana["canonical_product_id"],
            "giant_purchase_date": giant_banana["purchase_date"],
            "giant_raw_item_name": giant_banana["raw_item_name"],
            "giant_price_per_lb": giant_banana["price_per_lb"],
            "costco_purchase_date": costco_banana["purchase_date"],
            "costco_raw_item_name": costco_banana["raw_item_name"],
            "costco_price_per_lb": costco_banana["price_per_lb"],
            "notes": "Example comparison using normalized price_per_lb across Giant and Costco",
        }
    ]
@click.command()
@click.option("--giant-items-enriched-csv", default="giant_output/items_enriched.csv", show_default=True)
@click.option("--costco-items-enriched-csv", default="costco_output/items_enriched.csv", show_default=True)
@click.option("--giant-orders-csv", default="giant_output/orders.csv", show_default=True)
@click.option("--costco-orders-csv", default="costco_output/orders.csv", show_default=True)
@click.option("--resolutions-csv", default="combined_output/review_resolutions.csv", show_default=True)
@click.option("--catalog-csv", default="combined_output/canonical_catalog.csv", show_default=True)
@click.option("--links-csv", default="combined_output/product_links.csv", show_default=True)
@click.option("--output-csv", default="combined_output/purchases.csv", show_default=True)
@click.option("--examples-csv", default="combined_output/comparison_examples.csv", show_default=True)
 def main(
    giant_items_enriched_csv,
    costco_items_enriched_csv,
    giant_orders_csv,
    costco_orders_csv,
    resolutions_csv,
    catalog_csv,
    links_csv,
    output_csv,
    examples_csv,
 ):
    resolution_rows = read_optional_csv_rows(resolutions_csv)
    purchase_rows, _observed_rows, canonical_rows, link_rows = build_purchase_rows(
        read_csv_rows(giant_items_enriched_csv),
        read_csv_rows(costco_items_enriched_csv),
        read_csv_rows(giant_orders_csv),
        read_csv_rows(costco_orders_csv),
        resolution_rows,
    )
    existing_catalog_rows = read_optional_csv_rows(catalog_csv)
    merged_catalog_rows = merge_catalog_rows(
        existing_catalog_rows,
        [catalog_row_from_canonical(row) for row in canonical_rows],
    )
    link_rows = apply_manual_resolutions_to_links(link_rows, resolution_rows)
    example_rows = build_comparison_examples(purchase_rows)
    write_csv_rows(catalog_csv, merged_catalog_rows, CATALOG_FIELDS)
    write_csv_rows(links_csv, link_rows, build_canonical_layer.LINK_FIELDS)
    write_csv_rows(output_csv, purchase_rows, PURCHASE_FIELDS)
    write_csv_rows(examples_csv, example_rows, EXAMPLE_FIELDS)
    click.echo(
        f"wrote {len(purchase_rows)} purchase rows to {output_csv}, "
        f"{len(merged_catalog_rows)} catalog rows to {catalog_csv}, "
        f"and {len(example_rows)} comparison examples to {examples_csv}"
    )
 if __name__ == "__main__":
    main()
--- a/build_review_queue.py
+++ b/build_review_queue.py
@@ -1,175 +0,0 @@
 from collections import defaultdict
 from datetime import date
 import click
 from layer_helpers import compact_join, distinct_values, read_csv_rows, stable_id, write_csv_rows
 OUTPUT_FIELDS = [
    "review_id",
    "queue_type",
    "retailer",
    "observed_product_id",
    "canonical_product_id",
    "reason_code",
    "priority",
    "raw_item_names",
    "normalized_names",
    "upc",
    "image_url",
    "example_prices",
    "seen_count",
    "status",
    "resolution_notes",
    "created_at",
    "updated_at",
 ]
 def existing_review_state(path):
    try:
        rows = read_csv_rows(path)
    except FileNotFoundError:
        return {}
    return {row["review_id"]: row for row in rows}
 def review_reasons(observed_row):
    reasons = []
    if (
        observed_row["is_fee"] == "true"
        or observed_row.get("is_discount_line") == "true"
        or observed_row.get("is_coupon_line") == "true"
    ):
        return reasons
    if observed_row["distinct_upcs_count"] not in {"", "0", "1"}:
        reasons.append(("multiple_upcs", "high"))
    if observed_row["distinct_item_names_count"] not in {"", "0", "1"}:
        reasons.append(("multiple_raw_names", "medium"))
    if not observed_row["representative_image_url"]:
        reasons.append(("missing_image", "medium"))
    if not observed_row["representative_upc"]:
        reasons.append(("missing_upc", "high"))
    if not observed_row["representative_name_norm"]:
        reasons.append(("missing_normalized_name", "high"))
    return reasons
 def build_review_queue(observed_rows, item_rows, existing_rows, today_text):
    by_observed = defaultdict(list)
    for row in item_rows:
        observed_id = row.get("observed_product_id", "")
        if observed_id:
            by_observed[observed_id].append(row)
    queue_rows = []
    for observed_row in observed_rows:
        reasons = review_reasons(observed_row)
        if not reasons:
            continue
        related_items = by_observed.get(observed_row["observed_product_id"], [])
        raw_names = compact_join(distinct_values(related_items, "item_name"), limit=5)
        norm_names = compact_join(
            distinct_values(related_items, "item_name_norm"), limit=5
        )
        example_prices = compact_join(
            distinct_values(related_items, "line_total"), limit=5
        )
        for reason_code, priority in reasons:
            review_id = stable_id(
                "rvw",
                f"{observed_row['observed_product_id']}|{reason_code}",
            )
            prior = existing_rows.get(review_id, {})
            queue_rows.append(
                {
                    "review_id": review_id,
                    "queue_type": "observed_product",
                    "retailer": observed_row["retailer"],
                    "observed_product_id": observed_row["observed_product_id"],
                    "canonical_product_id": prior.get("canonical_product_id", ""),
                    "reason_code": reason_code,
                    "priority": priority,
                    "raw_item_names": raw_names,
                    "normalized_names": norm_names,
                    "upc": observed_row["representative_upc"],
                    "image_url": observed_row["representative_image_url"],
                    "example_prices": example_prices,
                    "seen_count": observed_row["times_seen"],
                    "status": prior.get("status", "pending"),
                    "resolution_notes": prior.get("resolution_notes", ""),
                    "created_at": prior.get("created_at", today_text),
                    "updated_at": today_text,
                }
            )
    queue_rows.sort(key=lambda row: (row["priority"], row["reason_code"], row["review_id"]))
    return queue_rows
 def attach_observed_ids(item_rows, observed_rows):
    observed_by_key = {row["observed_key"]: row["observed_product_id"] for row in observed_rows}
    attached = []
    for row in item_rows:
        observed_key = "|".join(
            [
                row["retailer"],
                f"upc={row['upc']}",
                f"name={row['item_name_norm']}",
            ]
        ) if row.get("upc") else "|".join(
            [
                row["retailer"],
                f"retailer_item_id={row.get('retailer_item_id', '')}",
                f"name={row['item_name_norm']}",
                f"size={row['size_value']}",
                f"unit={row['size_unit']}",
                f"pack={row['pack_qty']}",
                f"measure={row['measure_type']}",
                f"store_brand={row['is_store_brand']}",
                f"fee={row['is_fee']}",
                f"discount={row.get('is_discount_line', 'false')}",
                f"coupon={row.get('is_coupon_line', 'false')}",
            ]
        )
        enriched = dict(row)
        enriched["observed_product_id"] = observed_by_key.get(observed_key, "")
        attached.append(enriched)
    return attached
@click.command()
@click.option(
    "--observed-csv",
    default="giant_output/products_observed.csv",
    show_default=True,
    help="Path to observed product rows.",
 )
@click.option(
    "--items-enriched-csv",
    default="giant_output/items_enriched.csv",
    show_default=True,
    help="Path to enriched Giant item rows.",
 )
@click.option(
    "--output-csv",
    default="giant_output/review_queue.csv",
    show_default=True,
    help="Path to review queue output.",
 )
 def main(observed_csv, items_enriched_csv, output_csv):
    observed_rows = read_csv_rows(observed_csv)
    item_rows = read_csv_rows(items_enriched_csv)
    item_rows = attach_observed_ids(item_rows, observed_rows)
    existing_rows = existing_review_state(output_csv)
    today_text = str(date.today())
    queue_rows = build_review_queue(observed_rows, item_rows, existing_rows, today_text)
    write_csv_rows(output_csv, queue_rows, OUTPUT_FIELDS)
    click.echo(f"wrote {len(queue_rows)} rows to {output_csv}")
 if __name__ == "__main__":
    main()
--- a/collect_costco_web.py
+++ b/collect_costco_web.py
@@ -1,65 +0,0 @@
 import click
 import scrape_costco
@click.command()
@click.option(
    "--outdir",
    default="data/costco-web",
    show_default=True,
    help="Directory for Costco raw and collected outputs.",
 )
@click.option(
    "--document-type",
    default="all",
    show_default=True,
    help="Summary document type.",
 )
@click.option(
    "--document-sub-type",
    default="all",
    show_default=True,
    help="Summary document sub type.",
 )
@click.option(
    "--window-days",
    default=92,
    show_default=True,
    type=int,
    help="Maximum number of days to request per summary window.",
 )
@click.option(
    "--months-back",
    default=36,
    show_default=True,
    type=int,
    help="How many months of receipts to enumerate back from today.",
 )
@click.option(
    "--firefox-profile-dir",
    default=None,
    help="Firefox profile directory to use for cookies and session storage.",
 )
 def main(
    outdir,
    document_type,
    document_sub_type,
    window_days,
    months_back,
    firefox_profile_dir,
 ):
    scrape_costco.run_collection(
        outdir=outdir,
        document_type=document_type,
        document_sub_type=document_sub_type,
        window_days=window_days,
        months_back=months_back,
        firefox_profile_dir=firefox_profile_dir,
        orders_filename="collected_orders.csv",
        items_filename="collected_items.csv",
    )
 if __name__ == "__main__":
    main()
--- a/collect_giant_web.py
+++ b/collect_giant_web.py
@@ -1,34 +0,0 @@
 import click
 import scrape_giant
@click.command()
@click.option("--user-id", default=None, help="Giant user id.")
@click.option("--loyalty", default=None, help="Giant loyalty number.")
@click.option(
    "--outdir",
    default="data/giant-web",
    show_default=True,
    help="Directory for raw json and collected csv outputs.",
 )
@click.option(
    "--sleep-seconds",
    default=1.5,
    show_default=True,
    type=float,
    help="Delay between order detail requests.",
 )
 def main(user_id, loyalty, outdir, sleep_seconds):
    scrape_giant.run_collection(
        user_id,
        loyalty,
        outdir,
        sleep_seconds,
        orders_filename="collected_orders.csv",
        items_filename="collected_items.csv",
    )
 if __name__ == "__main__":
    main()
--- a/enrich_costco.py
+++ b/enrich_costco.py
@@ -1,330 +0,0 @@
 import csv
 import json
 import re
 from collections import defaultdict
 from pathlib import Path
 import click
 from enrich_giant import (
    OUTPUT_FIELDS,
    format_decimal,
    normalize_number,
    normalize_unit,
    normalize_whitespace,
    singularize_tokens,
    to_decimal,
 )
 PARSER_VERSION = "costco-enrich-v1"
 RETAILER = "costco"
 DEFAULT_INPUT_DIR = Path("costco_output/raw")
 DEFAULT_OUTPUT_CSV = Path("costco_output/items_enriched.csv")
 CODE_TOKEN_RE = re.compile(
    r"\b(?:SL\d+|T\d+H\d+|P\d+(?:/\d+)?|W\d+T\d+H\d+|FY\d+|CSPC#|C\d+T\d+H\d+|EC\d+T\d+H\d+|\d+X\d+)\b"
 )
 PACK_FRACTION_RE = re.compile(r"(?<![A-Z0-9])(\d+)\s*/\s*(\d+(?:\.\d+)?)\s*(OZ|LB|LBS|CT)\b")
 HASH_SIZE_RE = re.compile(r"(?<![A-Z0-9])(\d+(?:\.\d+)?)#\b")
 PACK_DASH_RE = re.compile(r"(?<![A-Z0-9])(\d+)\s*-\s*PACK\b")
 PACK_WORD_RE = re.compile(r"(?<![A-Z0-9])(\d+)\s*PACK\b")
 SIZE_RE = re.compile(r"(?<![A-Z0-9])(\d+(?:\.\d+)?)\s*(OZ|LB|LBS|CT|KG|G)\b")
 DISCOUNT_TARGET_RE = re.compile(r"^/\s*(\d+)\b")
 def clean_costco_name(name):
    cleaned = normalize_whitespace(name).upper().replace('"', "")
    cleaned = CODE_TOKEN_RE.sub(" ", cleaned)
    cleaned = re.sub(r"\s*/\s*\d+(?:\.\d+)?\s*(KG|G)\b", " ", cleaned)
    cleaned = normalize_whitespace(cleaned)
    return cleaned
 def combine_description(item):
    return normalize_whitespace(
        " ".join(
            str(part).strip()
            for part in [item.get("itemDescription01"), item.get("itemDescription02")]
            if part
        )
    )
 def parse_costco_size_and_pack(cleaned_name):
    pack_qty = ""
    size_value = ""
    size_unit = ""
    match = PACK_FRACTION_RE.search(cleaned_name)
    if match:
        pack_qty = normalize_number(match.group(1))
        size_value = normalize_number(match.group(2))
        size_unit = normalize_unit(match.group(3))
        return size_value, size_unit, pack_qty
    match = HASH_SIZE_RE.search(cleaned_name)
    if match:
        size_value = normalize_number(match.group(1))
        size_unit = "lb"
    match = PACK_DASH_RE.search(cleaned_name) or PACK_WORD_RE.search(cleaned_name)
    if match:
        pack_qty = normalize_number(match.group(1))
    matches = list(SIZE_RE.finditer(cleaned_name))
    if matches:
        last = matches[-1]
        unit = last.group(2)
        size_value = normalize_number(last.group(1))
        size_unit = "count" if unit == "CT" else normalize_unit(unit)
    return size_value, size_unit, pack_qty
 def normalize_costco_name(cleaned_name):
    brand = ""
    base = cleaned_name
    if base.startswith("KS "):
        brand = "KS"
        base = normalize_whitespace(base[3:])
    size_value, size_unit, pack_qty = parse_costco_size_and_pack(base)
    if size_value and size_unit:
        if pack_qty:
            base = PACK_FRACTION_RE.sub(" ", base)
        else:
            base = SIZE_RE.sub(" ", base)
    base = HASH_SIZE_RE.sub(" ", base)
    base = PACK_DASH_RE.sub(" ", base)
    base = PACK_WORD_RE.sub(" ", base)
    base = normalize_whitespace(base)
    tokens = []
    for token in base.split():
        if token in {"ORG"}:
            continue
        if token in {"PEANUT", "BUTTER"} and "JIF" in base:
            continue
        tokens.append(token)
    base = singularize_tokens(" ".join(tokens))
    return normalize_whitespace(base), brand, size_value, size_unit, pack_qty
 def guess_measure_type(size_unit, pack_qty, is_discount_line):
    if is_discount_line:
        return "each"
    if size_unit in {"lb", "oz", "g", "kg"}:
        return "weight"
    if size_unit in {"ml", "l", "qt", "pt", "gal", "fl_oz"}:
        return "volume"
    if size_unit == "count" or pack_qty:
        return "count"
    return "each"
 def derive_costco_prices(item, measure_type, size_value, size_unit, pack_qty):
    line_total = to_decimal(item.get("amount"))
    qty = to_decimal(item.get("unit"))
    parsed_size = to_decimal(size_value)
    parsed_pack = to_decimal(pack_qty) or 1
    price_per_each = ""
    price_per_lb = ""
    price_per_oz = ""
    if line_total is None:
        return price_per_each, price_per_lb, price_per_oz
    if measure_type in {"each", "count"} and qty not in (None, 0):
        price_per_each = format_decimal(line_total / qty)
    if parsed_size not in (None, 0):
        total_units = parsed_size * parsed_pack * (qty or 1)
        if size_unit == "lb":
            per_lb = line_total / total_units
            price_per_lb = format_decimal(per_lb)
            price_per_oz = format_decimal(per_lb / 16)
        elif size_unit == "oz":
            per_oz = line_total / total_units
            price_per_oz = format_decimal(per_oz)
            price_per_lb = format_decimal(per_oz * 16)
    return price_per_each, price_per_lb, price_per_oz
 def is_discount_item(item):
    amount = to_decimal(item.get("amount")) or 0
    unit = to_decimal(item.get("unit")) or 0
    description = combine_description(item)
    return amount < 0 or unit < 0 or description.startswith("/")
 def discount_target_id(raw_name):
    match = DISCOUNT_TARGET_RE.match(normalize_whitespace(raw_name))
    if not match:
        return ""
    return match.group(1)
 def parse_costco_item(order_id, order_date, raw_path, line_no, item):
    raw_name = combine_description(item)
    cleaned_name = clean_costco_name(raw_name)
    item_name_norm, brand_guess, size_value, size_unit, pack_qty = normalize_costco_name(
        cleaned_name
    )
    is_discount_line = is_discount_item(item)
    is_coupon_line = "true" if raw_name.startswith("/") else "false"
    measure_type = guess_measure_type(size_unit, pack_qty, is_discount_line)
    price_per_each, price_per_lb, price_per_oz = derive_costco_prices(
        item, measure_type, size_value, size_unit, pack_qty
    )
    return {
        "retailer": RETAILER,
        "order_id": str(order_id),
        "line_no": str(line_no),
        "observed_item_key": f"{RETAILER}:{order_id}:{line_no}",
        "order_date": normalize_whitespace(order_date),
        "retailer_item_id": str(item.get("itemNumber", "")),
        "pod_id": "",
        "item_name": raw_name,
        "upc": "",
        "category_id": str(item.get("itemDepartmentNumber", "")),
        "category": str(item.get("transDepartmentNumber", "")),
        "qty": str(item.get("unit", "")),
        "unit": str(item.get("itemIdentifier", "")),
        "unit_price": str(item.get("itemUnitPriceAmount", "")),
        "line_total": str(item.get("amount", "")),
        "picked_weight": "",
        "mvp_savings": "",
        "reward_savings": "",
        "coupon_savings": str(item.get("amount", "")) if is_discount_line else "",
        "coupon_price": "",
        "matched_discount_amount": "",
        "net_line_total": str(item.get("amount", "")) if not is_discount_line else "",
        "image_url": "",
        "raw_order_path": raw_path.as_posix(),
        "item_name_norm": item_name_norm,
        "brand_guess": brand_guess,
        "variant": "",
        "size_value": size_value,
        "size_unit": size_unit,
        "pack_qty": pack_qty,
        "measure_type": measure_type,
        "is_store_brand": "true" if brand_guess else "false",
        "is_fee": "false",
        "is_discount_line": "true" if is_discount_line else "false",
        "is_coupon_line": is_coupon_line,
        "price_per_each": price_per_each,
        "price_per_lb": price_per_lb,
        "price_per_oz": price_per_oz,
        "parse_version": PARSER_VERSION,
        "parse_notes": "",
    }
 def match_costco_discounts(rows):
    rows_by_order = defaultdict(list)
    for row in rows:
        rows_by_order[row["order_id"]].append(row)
    for order_rows in rows_by_order.values():
        purchase_rows_by_item_id = defaultdict(list)
        for row in order_rows:
            if row.get("is_discount_line") == "true":
                continue
            retailer_item_id = row.get("retailer_item_id", "")
            if retailer_item_id:
                purchase_rows_by_item_id[retailer_item_id].append(row)
        for row in order_rows:
            if row.get("is_discount_line") != "true":
                continue
            target_id = discount_target_id(row.get("item_name", ""))
            if not target_id:
                continue
            matches = purchase_rows_by_item_id.get(target_id, [])
            if len(matches) != 1:
                row["parse_notes"] = normalize_whitespace(
                    f"{row.get('parse_notes', '')};discount_target_unmatched={target_id}"
                ).strip(";")
                continue
            purchase_row = matches[0]
            matched_discount = to_decimal(row.get("line_total"))
            gross_total = to_decimal(purchase_row.get("line_total"))
            existing_discount = to_decimal(purchase_row.get("matched_discount_amount")) or 0
            if matched_discount is None or gross_total is None:
                continue
            total_discount = existing_discount + matched_discount
            purchase_row["matched_discount_amount"] = format_decimal(total_discount)
            purchase_row["net_line_total"] = format_decimal(gross_total + total_discount)
            purchase_row["parse_notes"] = normalize_whitespace(
                f"{purchase_row.get('parse_notes', '')};matched_discount={target_id}"
            ).strip(";")
            row["parse_notes"] = normalize_whitespace(
                f"{row.get('parse_notes', '')};matched_to_item={target_id}"
            ).strip(";")
 def iter_costco_rows(raw_dir):
    for path in discover_json_files(raw_dir):
        if path.name in {"summary.json", "summary_requests.json"}:
            continue
        payload = json.loads(path.read_text(encoding="utf-8"))
        if not isinstance(payload, dict):
            continue
        receipts = payload.get("data", {}).get("receiptsWithCounts", {}).get("receipts", [])
        for receipt in receipts:
            order_id = receipt["transactionBarcode"]
            order_date = receipt.get("transactionDate", "")
            for line_no, item in enumerate(receipt.get("itemArray", []), start=1):
                yield parse_costco_item(order_id, order_date, path, line_no, item)
 def discover_json_files(raw_dir):
    raw_dir = Path(raw_dir)
    candidates = sorted(raw_dir.glob("*.json"))
    if candidates:
        return candidates
    if raw_dir.name == "raw" and raw_dir.parent.exists():
        return sorted(raw_dir.parent.glob("*.json"))
    return []
 def build_items_enriched(raw_dir):
    rows = list(iter_costco_rows(raw_dir))
    match_costco_discounts(rows)
    rows.sort(key=lambda row: (row["order_date"], row["order_id"], int(row["line_no"])))
    return rows
 def write_csv(path, rows):
    path.parent.mkdir(parents=True, exist_ok=True)
    with path.open("w", newline="", encoding="utf-8") as handle:
        writer = csv.DictWriter(handle, fieldnames=OUTPUT_FIELDS)
        writer.writeheader()
        writer.writerows(rows)
@click.command()
@click.option(
    "--input-dir",
    default=str(DEFAULT_INPUT_DIR),
    show_default=True,
    help="Directory containing Costco raw order json files.",
 )
@click.option(
    "--output-csv",
    default=str(DEFAULT_OUTPUT_CSV),
    show_default=True,
    help="CSV path for enriched Costco item rows.",
 )
 def main(input_dir, output_csv):
    rows = build_items_enriched(Path(input_dir))
    write_csv(Path(output_csv), rows)
    click.echo(f"wrote {len(rows)} rows to {output_csv}")
 if __name__ == "__main__":
    main()
--- a/enrich_giant.py
+++ b/enrich_giant.py
@@ -1,459 +0,0 @@
 import csv
 import json
 import re
 from decimal import Decimal, InvalidOperation, ROUND_HALF_UP
 from pathlib import Path
 import click
 PARSER_VERSION = "giant-enrich-v1"
 RETAILER = "giant"
 DEFAULT_INPUT_DIR = Path("giant_output/raw")
 DEFAULT_OUTPUT_CSV = Path("giant_output/items_enriched.csv")
 OUTPUT_FIELDS = [
    "retailer",
    "order_id",
    "line_no",
    "observed_item_key",
    "order_date",
    "retailer_item_id",
    "pod_id",
    "item_name",
    "upc",
    "category_id",
    "category",
    "qty",
    "unit",
    "unit_price",
    "line_total",
    "picked_weight",
    "mvp_savings",
    "reward_savings",
    "coupon_savings",
    "coupon_price",
    "matched_discount_amount",
    "net_line_total",
    "image_url",
    "raw_order_path",
    "item_name_norm",
    "brand_guess",
    "variant",
    "size_value",
    "size_unit",
    "pack_qty",
    "measure_type",
    "is_store_brand",
    "is_fee",
    "is_discount_line",
    "is_coupon_line",
    "price_per_each",
    "price_per_lb",
    "price_per_oz",
    "parse_version",
    "parse_notes",
 ]
 STORE_BRAND_PREFIXES = {
    "SB": "SB",
    "NP": "NP",
 }
 DROP_TOKENS = {"FRESH"}
 ABBREVIATIONS = {
    "APPLE": "APPLE",
    "APPLES": "APPLES",
    "APLE": "APPLE",
    "BASIL": "BASIL",
    "BLK": "BLACK",
    "BNLS": "BONELESS",
    "BRWN": "BROWN",
    "CARROTS": "CARROTS",
    "CHDR": "CHEDDAR",
    "CHICKEN": "CHICKEN",
    "CHOC": "CHOCOLATE",
    "CHS": "CHEESE",
    "CHSE": "CHEESE",
    "CHZ": "CHEESE",
    "CILANTRO": "CILANTRO",
    "CKI": "COOKIE",
    "CRSHD": "CRUSHED",
    "FLR": "FLOUR",
    "FRSH": "FRESH",
    "GALA": "GALA",
    "GRAHM": "GRAHAM",
    "HOT": "HOT",
    "HRSRDSH": "HORSERADISH",
    "IMP": "IMPORTED",
    "IQF": "IQF",
    "LENTILS": "LENTILS",
    "LG": "LARGE",
    "MLK": "MILK",
    "MSTRD": "MUSTARD",
    "ONION": "ONION",
    "ORG": "ORGANIC",
    "PEPPER": "PEPPER",
    "PEPPERS": "PEPPERS",
    "POT": "POTATO",
    "POTATO": "POTATO",
    "PPR": "PEPPER",
    "RICOTTA": "RICOTTA",
    "ROASTER": "ROASTER",
    "ROTINI": "ROTINI",
    "SCE": "SAUCE",
    "SLC": "SLICED",
    "SPINCH": "SPINACH",
    "SPNC": "SPINACH",
    "SPINACH": "SPINACH",
    "SQZ": "SQUEEZE",
    "SWT": "SWEET",
    "THYME": "THYME",
    "TOM": "TOMATO",
    "TOMS": "TOMATOES",
    "TRTL": "TORTILLA",
    "VEG": "VEGETABLE",
    "VINEGAR": "VINEGAR",
    "WHT": "WHITE",
    "WHOLE": "WHOLE",
    "YLW": "YELLOW",
    "YLWGLD": "YELLOW_GOLD",
 }
 FEE_PATTERNS = [
    re.compile(r"\bBAG CHARGE\b"),
    re.compile(r"\bDISC AT TOTAL\b"),
 ]
 SIZE_RE = re.compile(r"(?<![A-Z0-9])(\d+(?:\.\d+)?)(?:\s*)(OZ|Z|LB|LBS|ML|L|FZ|FL OZ|QT|PT|GAL|GA)\b")
 PACK_RE = re.compile(r"(?<![A-Z0-9])(\d+(?:\.\d+)?)(?:\s*)(CT|PK|PKG|PACK)\b")
 def to_decimal(value):
    if value in ("", None):
        return None
    try:
        return Decimal(str(value))
    except (InvalidOperation, ValueError):
        return None
 def format_decimal(value, places=4):
    if value is None:
        return ""
    quant = Decimal("1").scaleb(-places)
    normalized = value.quantize(quant, rounding=ROUND_HALF_UP).normalize()
    return format(normalized, "f")
 def normalize_whitespace(value):
    return " ".join(str(value or "").strip().split())
 def clean_item_name(name):
    cleaned = normalize_whitespace(name).upper()
    cleaned = re.sub(r"^\+", "", cleaned)
    cleaned = re.sub(r"^PLU#\d+\s*", "", cleaned)
    cleaned = cleaned.replace("#", " ")
    return normalize_whitespace(cleaned)
 def extract_store_brand_prefix(cleaned_name):
    for prefix, brand in STORE_BRAND_PREFIXES.items():
        if cleaned_name == prefix or cleaned_name.startswith(f"{prefix} "):
            return prefix, brand
    return "", ""
 def extract_image_url(item):
    image = item.get("image")
    if isinstance(image, dict):
        for key in ["xlarge", "large", "medium", "small"]:
            value = image.get(key)
            if value:
                return value
    if isinstance(image, str):
        return image
    return ""
 def parse_size_and_pack(cleaned_name):
    size_value = ""
    size_unit = ""
    pack_qty = ""
    size_matches = list(SIZE_RE.finditer(cleaned_name))
    if size_matches:
        match = size_matches[-1]
        size_value = normalize_number(match.group(1))
        size_unit = normalize_unit(match.group(2))
    pack_matches = list(PACK_RE.finditer(cleaned_name))
    if pack_matches:
        match = pack_matches[-1]
        pack_qty = normalize_number(match.group(1))
    return size_value, size_unit, pack_qty
 def normalize_number(value):
    decimal = to_decimal(value)
    if decimal is None:
        return ""
    return format(decimal.normalize(), "f")
 def normalize_unit(unit):
    collapsed = normalize_whitespace(unit).upper()
    return {
        "Z": "oz",
        "OZ": "oz",
        "FZ": "fl_oz",
        "FL OZ": "fl_oz",
        "LB": "lb",
        "LBS": "lb",
        "ML": "ml",
        "L": "l",
        "QT": "qt",
        "PT": "pt",
        "GAL": "gal",
        "GA": "gal",
    }.get(collapsed, collapsed.lower())
 def strip_measure_tokens(cleaned_name):
    without_sizes = SIZE_RE.sub(" ", cleaned_name)
    without_measures = PACK_RE.sub(" ", without_sizes)
    return normalize_whitespace(without_measures)
 def expand_token(token):
    return ABBREVIATIONS.get(token, token)
 def normalize_item_name(cleaned_name):
    prefix, _brand = extract_store_brand_prefix(cleaned_name)
    base = cleaned_name
    if prefix:
        base = normalize_whitespace(base[len(prefix):])
    base = strip_measure_tokens(base)
    expanded_tokens = []
    for token in base.split():
        expanded = expand_token(token)
        if expanded in DROP_TOKENS:
            continue
        expanded_tokens.append(expanded)
    expanded = " ".join(token for token in expanded_tokens if token)
    return singularize_tokens(normalize_whitespace(expanded))
 def singularize_tokens(text):
    singular_map = {
        "APPLES": "APPLE",
        "BANANAS": "BANANA",
        "BERRIES": "BERRY",
        "EGGS": "EGG",
        "LEMONS": "LEMON",
        "LIMES": "LIME",
        "MANDARINS": "MANDARIN",
        "PEPPERS": "PEPPER",
        "STRAWBERRIES": "STRAWBERRY",
    }
    tokens = [singular_map.get(token, token) for token in text.split()]
    return normalize_whitespace(" ".join(tokens))
 def guess_measure_type(item, size_unit, pack_qty):
    unit = normalize_whitespace(item.get("lbEachCd")).upper()
    picked_weight = to_decimal(item.get("totalPickedWeight"))
    qty = to_decimal(item.get("shipQy"))
    if unit == "LB" or (picked_weight is not None and picked_weight > 0 and unit != "EA"):
        return "weight"
    if size_unit in {"lb", "oz"}:
        return "weight"
    if size_unit in {"ml", "l", "qt", "pt", "gal", "fl_oz"}:
        return "volume"
    if pack_qty:
        return "count"
    if unit == "EA" or (qty is not None and qty > 0):
        return "each"
    return ""
 def is_fee_item(cleaned_name):
    return any(pattern.search(cleaned_name) for pattern in FEE_PATTERNS)
 def derive_prices(item, measure_type, size_value="", size_unit="", pack_qty=""):
    qty = to_decimal(item.get("shipQy"))
    line_total = to_decimal(item.get("groceryAmount"))
    picked_weight = to_decimal(item.get("totalPickedWeight"))
    parsed_size = to_decimal(size_value)
    parsed_pack = to_decimal(pack_qty) or Decimal("1")
    price_per_each = ""
    price_per_lb = ""
    price_per_oz = ""
    if line_total is None:
        return price_per_each, price_per_lb, price_per_oz
    if measure_type == "each" and qty not in (None, Decimal("0")):
        price_per_each = format_decimal(line_total / qty)
    if measure_type == "count" and qty not in (None, Decimal("0")):
        price_per_each = format_decimal(line_total / qty)
    if measure_type == "weight" and picked_weight not in (None, Decimal("0")):
        per_lb = line_total / picked_weight
        price_per_lb = format_decimal(per_lb)
        price_per_oz = format_decimal(per_lb / Decimal("16"))
        return price_per_each, price_per_lb, price_per_oz
    if measure_type == "weight" and parsed_size not in (None, Decimal("0")) and qty not in (None, Decimal("0")):
        total_units = qty * parsed_pack * parsed_size
        if size_unit == "lb":
            per_lb = line_total / total_units
            price_per_lb = format_decimal(per_lb)
            price_per_oz = format_decimal(per_lb / Decimal("16"))
        elif size_unit == "oz":
            per_oz = line_total / total_units
            price_per_oz = format_decimal(per_oz)
            price_per_lb = format_decimal(per_oz * Decimal("16"))
    return price_per_each, price_per_lb, price_per_oz
 def parse_item(order_id, order_date, raw_path, line_no, item):
    cleaned_name = clean_item_name(item.get("itemName", ""))
    size_value, size_unit, pack_qty = parse_size_and_pack(cleaned_name)
    prefix, brand_guess = extract_store_brand_prefix(cleaned_name)
    normalized_name = normalize_item_name(cleaned_name)
    measure_type = guess_measure_type(item, size_unit, pack_qty)
    price_per_each, price_per_lb, price_per_oz = derive_prices(
        item,
        measure_type,
        size_value=size_value,
        size_unit=size_unit,
        pack_qty=pack_qty,
    )
    is_fee = is_fee_item(cleaned_name)
    parse_notes = []
    if prefix:
        parse_notes.append(f"store_brand_prefix={prefix}")
    if is_fee:
        parse_notes.append("fee_item")
    if size_value and not size_unit:
        parse_notes.append("size_without_unit")
    return {
        "retailer": RETAILER,
        "order_id": str(order_id),
        "line_no": str(line_no),
        "observed_item_key": f"{RETAILER}:{order_id}:{line_no}",
        "order_date": normalize_whitespace(order_date),
        "retailer_item_id": stringify(item.get("podId")),
        "pod_id": stringify(item.get("podId")),
        "item_name": stringify(item.get("itemName")),
        "upc": stringify(item.get("primUpcCd")),
        "category_id": stringify(item.get("categoryId")),
        "category": stringify(item.get("categoryDesc")),
        "qty": stringify(item.get("shipQy")),
        "unit": stringify(item.get("lbEachCd")),
        "unit_price": stringify(item.get("unitPrice")),
        "line_total": stringify(item.get("groceryAmount")),
        "picked_weight": stringify(item.get("totalPickedWeight")),
        "mvp_savings": stringify(item.get("mvpSavings")),
        "reward_savings": stringify(item.get("rewardSavings")),
        "coupon_savings": stringify(item.get("couponSavings")),
        "coupon_price": stringify(item.get("couponPrice")),
        "matched_discount_amount": "",
        "net_line_total": stringify(item.get("totalPrice")),
        "image_url": extract_image_url(item),
        "raw_order_path": raw_path.as_posix(),
        "item_name_norm": normalized_name,
        "brand_guess": brand_guess,
        "variant": "",
        "size_value": size_value,
        "size_unit": size_unit,
        "pack_qty": pack_qty,
        "measure_type": measure_type,
        "is_store_brand": "true" if bool(prefix) else "false",
        "is_fee": "true" if is_fee else "false",
        "is_discount_line": "false",
        "is_coupon_line": "false",
        "price_per_each": price_per_each,
        "price_per_lb": price_per_lb,
        "price_per_oz": price_per_oz,
        "parse_version": PARSER_VERSION,
        "parse_notes": ";".join(parse_notes),
    }
 def stringify(value):
    if value is None:
        return ""
    return str(value)
 def iter_order_rows(raw_dir):
    for path in sorted(raw_dir.glob("*.json")):
        if path.name == "history.json":
            continue
        payload = json.loads(path.read_text(encoding="utf-8"))
        order_id = payload.get("orderId", path.stem)
        order_date = payload.get("orderDate", "")
        for line_no, item in enumerate(payload.get("items", []), start=1):
            yield parse_item(order_id, order_date, path, line_no, item)
 def build_items_enriched(raw_dir):
    rows = list(iter_order_rows(raw_dir))
    rows.sort(key=lambda row: (row["order_date"], row["order_id"], int(row["line_no"])))
    return rows
 def write_csv(path, rows):
    path.parent.mkdir(parents=True, exist_ok=True)
    with path.open("w", newline="", encoding="utf-8") as handle:
        writer = csv.DictWriter(handle, fieldnames=OUTPUT_FIELDS)
        writer.writeheader()
        writer.writerows(rows)
@click.command()
@click.option(
    "--input-dir",
    default=str(DEFAULT_INPUT_DIR),
    show_default=True,
    help="Directory containing Giant raw order json files.",
 )
@click.option(
    "--output-csv",
    default=str(DEFAULT_OUTPUT_CSV),
    show_default=True,
    help="CSV path for enriched Giant item rows.",
 )
 def main(input_dir, output_csv):
    raw_dir = Path(input_dir)
    output_path = Path(output_csv)
    if not raw_dir.exists():
        raise click.ClickException(f"input dir does not exist: {raw_dir}")
    rows = build_items_enriched(raw_dir)
    write_csv(output_path, rows)
    click.echo(f"wrote {len(rows)} rows to {output_path}")
 if __name__ == "__main__":
    main()
--- a/layer_helpers.py
+++ b/layer_helpers.py
@@ -1,54 +0,0 @@
 import csv
 import hashlib
 from collections import Counter
 from pathlib import Path
 def read_csv_rows(path):
    path = Path(path)
    with path.open(newline="", encoding="utf-8") as handle:
        return list(csv.DictReader(handle))
 def write_csv_rows(path, rows, fieldnames):
    path = Path(path)
    path.parent.mkdir(parents=True, exist_ok=True)
    with path.open("w", newline="", encoding="utf-8") as handle:
        writer = csv.DictWriter(handle, fieldnames=fieldnames)
        writer.writeheader()
        writer.writerows(rows)
 def stable_id(prefix, raw_key):
    digest = hashlib.sha1(str(raw_key).encode("utf-8")).hexdigest()[:12]
    return f"{prefix}_{digest}"
 def first_nonblank(rows, field):
    for row in rows:
        value = row.get(field, "")
        if value:
            return value
    return ""
 def representative_value(rows, field):
    values = [row.get(field, "") for row in rows if row.get(field, "")]
    if not values:
        return ""
    counts = Counter(values)
    return sorted(counts.items(), key=lambda item: (-item[1], item[0]))[0][0]
 def distinct_values(rows, field):
    return sorted({row.get(field, "") for row in rows if row.get(field, "")})
 def compact_join(values, limit=3):
    unique = []
    seen = set()
    for value in values:
        if value and value not in seen:
            seen.add(value)
            unique.append(value)
    return " | ".join(unique[:limit])
--- a/pm/data-model.org
+++ b/pm/data-model.org
@@ -1,346 +0,0 @@
 * Grocery data model and file layout
 This document defines the shared file layout and stable CSV schemas for the
 grocery pipeline.
 Goals:
 - Ensure data gathering is separate from analysis
 - Enable multiple data gathering methods
 - One layer for review and analysis  
 ** Design Rules
 - Raw retailer exports remain the source of truth.
 - Retailer parsing is isolated to retailer-specific files and ids.
 - Cross-retailer product layers begin only after retailer-specific normalization.
 - CSV schemas are stable and additive: new columns may be appended, but
   existing columns should not be repurposed.
 - Unknown values should be left blank rather than guessed.
 *** Retailer-specific data:
 - raw json payloads
 - retailer order ids
 - retailer line numbers
 - retailer category ids and names
 - retailer item names
 - retailer image urls
 - comparison-ready normalized quantity basis fields
 *** Review/Combined data:
 - catalog of reviewed products
 - links from normalized retailer items to catalog
 - human review state for unresolved cases
 * Pipeline
 Each step can be run alone if its dependents exist.
 Each retail provider script must produce deterministic line-item outputs, and
 normalization may assign within-retailer product identity only when the
 retailer itself provides strong evidence.
 Key: 
 - (1) input
 - [1] output
 ** 1. Collect
 Get raw receipt/visit and item data from a retailer.
 Scraping is unique to a Retailer and method (e.g., Giant-Web and Giant-Scan).
 Preserve complete raw data and preserve fidelity.
 Avoid interpretation beyond basic data flattening.
 - (1) Source access (Varies, eg header data, auth for API access)
 - [1] collected visits from each retailer
 - [2] collected items from each retailer
 - [3] any other raw data that supports [1] and [2]; explicit source (eventual receipt scan?)
 ** 2. Normalize
 Parse and extract structured facts from retailer-specific raw data
  to create a standardized item format for that retailer.
 Strictly dependent on Collect method and output.
 - Extract quantity, size, pack, pricing, variant
 - Add discount line items to product line items using upc/retail_item_id and concurrence
 - Cleanup naming to facilitate later matching
 - Assign retailer-level `normalized_item_id` only when evidence is deterministic
 - Never use fuzzy or semantic matching here
 - (1) collected items from each retailer
 - (2) collected visits from each retailer
 - [1] normalized items from each retailer
 ** 3. Review/Combine (Canonicalization)
 Decide whether two normalized retailer items are "the same product";
 match items across retailers using algo/logic and human review.
 Create catalog linked to normalized retailer items.
 - Review operates on distinct `normalized_item_id` values, not individual purchase rows
 - Cross-retailer identity decisions happen only here
 - Asking human to create a canonical/catalog item with:
   - friendly/catalog_name: "bell pepper"; "milk"
   - category: "produce"; "dairy"
   - product_type: "pepper"; "milk"
   - ? variant? "whole, "skim", "2pct"
 - Then link the group of items to that catalog item.
 - (1) normalized items from each retailer
 - [1] review queue of items to be reviewed
 - [2] catalog (lookup table) of confirmed normalized retailer items and catalog_id
 - [3] purchase list of normalized items , pivot-ready
 ** Unresolved Issues
 1. need central script to orchestrate; metadata belongs there and nowhere else
 2. `LIME` and `LIME . / .` appearing in the catalog: names must come from review-approved names, not raw strings
 * Directory Layout
 Use one top-level data root:
 #+begin_example
 main.py
 collect_<retailer>_<method>.py
 normalize_<retailer>_<method>.py
 review.py
 data/
  <retailer-method>/
    raw/  # unmodified retailer payloads exactly as fetched
      <order_id.json> 
    collected_items.csv # one row per retailer line item w/ retailer-native values
    collected_orders.csv # one row per receipt/visit, flattened from raw order data
    normalized_items.csv # parsed retailer-specific line items with normalized fields
  costco-web/ # sample
    raw/
      orders/
        history.json
        <order_id>.json
    collected_items.csv
    collected_orders.csv
    normalized_items.csv
  review/
    review_queue.csv # Human review queue for unresolved matching/parsing cases.
    product_links.csv # Links from normalized retailer items to catalog items.
  catalog.csv  # Cross-retailer product catalog entities used for comparison.
  purchases.csv
 #+end_example
 Notes:
 - The current repo still uses transitional root-level scripts and output folders.
 - This layout is the target structure for the refactor, not a claim that migration is already complete.
 * Schemas
 ** `data/<retailer-method>/collected_items.csv`
 One row per retailer line item.
 | key                | definition                                 |
 |--------------------+--------------------------------------------|
 | `retailer` PK      | retailer slug                              |
 | `order_id` PK      | retailer order id                          |
 | `line_no`  PK      | stable line number within order export     |
 | `order_date`       | copied from order when available           |
 | `retailer_item_id` | retailer-native item id when available     |
 | `pod_id`           | retailer pod/item id                       |
 | `item_name`        | raw retailer item name                     |
 | `upc`              | retailer UPC or PLU value                  |
 | `category_id`      | retailer category id                       |
 | `category`         | retailer category description              |
 | `qty`              | retailer quantity field                    |
 | `unit`             | retailer unit code such as `EA` or `LB`    |
 | `unit_price`       | retailer unit price field                  |
 | `line_total`       | retailer extended price field              |
 | `picked_weight`    | retailer picked weight field               |
 | `mvp_savings`      | retailer savings field                     |
 | `reward_savings`   | retailer rewards savings field             |
 | `coupon_savings`   | retailer coupon savings field              |
 | `coupon_price`     | retailer coupon price field                |
 | `image_url`        | raw retailer image url when present        |
 | `raw_order_path`   | relative path to source order payload      |
 | `is_discount_line` | retailer adjustment or discount-line flag  |
 | `is_coupon_line`   | coupon-like line flag when distinguishable |
 ** `data/<retailer-method>/collected_orders.csv`
 One row per order/visit/receipt.
 | key                       | definition                                      |
 |---------------------------+-------------------------------------------------|
 | `retailer` PK             | retailer slug such as `giant`                   |
 | `order_id` PK             | retailer order or visit id                      |
 | `order_date`              | order date in `YYYY-MM-DD` when available       |
 | `delivery_date`           | fulfillment date in `YYYY-MM-DD` when available |
 | `service_type`            | retailer service type such as `INSTORE`         |
 | `order_total`             | order total as provided by retailer             |
 | `payment_method`          | retailer payment label                          |
 | `total_item_count`        | total line count or item count from retailer    |
 | `total_savings`           | total savings as provided by retailer           |
 | `your_savings_total`      | savings field from retailer when present        |
 | `coupons_discounts_total` | coupon/discount total from retailer             |
 | `store_name`              | retailer store name                             |
 | `store_number`            | retailer store number                           |
 | `store_address1`          | street address                                  |
 | `store_city`              | city                                            |
 | `store_state`             | state or province                               |
 | `store_zipcode`           | postal code                                     |
 | `refund_order`            | retailer refund flag                            |
 | `ebt_order`               | retailer EBT flag                               |
 | `raw_history_path`        | relative path to source history payload         |
 | `raw_order_path`          | relative path to source order payload           |
 ** `data/<retailer-method>/normalized_items.csv`
 One row per retailer line item after deterministic parsing. Preserve raw
 fields from `collected_items.csv` and add parsed fields that make later review
 and grouping easier. Normalization may assign retailer-level identity when the
 evidence is deterministic and retailer-scoped.
 | key                        | definition                                                       |
 |----------------------------+------------------------------------------------------------------|
 | `retailer` PK              | retailer slug                                                    |
 | `order_id` PK              | retailer order id                                                |
 | `line_no` PK               | line number within order                                         |
 | `normalized_row_id`        | stable row key, typically `<retailer>:<order_id>:<line_no>`      |
 | `normalized_item_id`       | stable retailer-level item identity when deterministic grouping is supported |
 | `normalization_basis`      | basis used to assign `normalized_item_id`                        |
 | `retailer_item_id`         | retailer-native item id                                          |
 | `item_name`                | raw retailer item name                                           |
 | `item_name_norm`           | normalized retailer item name                                    |
 | `brand_guess`              | parsed brand guess                                               |
 | `variant`                  | parsed variant text                                              |
 | `size_value`               | parsed numeric size value                                        |
 | `size_unit`                | parsed size unit such as `oz`, `lb`, `fl_oz`                     |
 | `pack_qty`                 | parsed pack or count guess                                       |
 | `measure_type`             | `each`, `weight`, `volume`, `count`, or blank                    |
 | `normalized_quantity`      | numeric comparison basis derived during normalization            |
 | `normalized_quantity_unit` | basis unit such as `oz`, `lb`, `count`, or blank                 |
 | `is_item`                  | item flag                                                        |
 | `is_store_brand`           | store-brand guess                                                |
 | `is_fee`                   | fee or non-product flag                                          |
 | `is_discount_line`         | discount or adjustment-line flag                                 |
 | `is_coupon_line`           | coupon-like line flag                                            |
 | `matched_discount_amount`  | matched discount value carried onto purchased row when supported |
 | `net_line_total`           | line total after matched discount when supported                 |
 | `price_per_each`           | derived per-each price when supported                            |
 | `price_per_each_basis`     | source basis for `price_per_each`                                |
 | `price_per_count`          | derived per-count price when supported                           |
 | `price_per_count_basis`    | source basis for `price_per_count`                               |
 | `price_per_lb`             | derived per-pound price when supported                           |
 | `price_per_lb_basis`       | source basis for `price_per_lb`                                  |
 | `price_per_oz`             | derived per-ounce price when supported                           |
 | `price_per_oz_basis`       | source basis for `price_per_oz`                                  |
 | `image_url`                | best available retailer image url                                |
 | `raw_order_path`           | relative path to source order payload                            |
 | `parse_version`            | parser version string for reruns                                 |
 | `parse_notes`              | optional non-fatal parser notes                                  |
 Notes:
 - `normalized_row_id` identifies the purchase row; `normalized_item_id` identifies a repeated retailer item when strong retailer evidence supports grouping.
 - Valid `normalization_basis` values should be explicit, e.g. `exact_upc`, `exact_retailer_item_id`, `exact_name_size_pack`, or `approved_retailer_alias`.
 - Do not use fuzzy or semantic matching to assign `normalized_item_id`.
 - Discount/coupon rows may remain as standalone normalized rows for auditability even when their amounts are attached to a purchased row via `matched_discount_amount`.
 - Cross-retailer identity is handled later in review/combine via `catalog.csv` and `product_links.csv`.
 ** `data/review/product_links.csv`
 One row per review-approved link from a normalized retailer item to a catalog item.
 Many normalized retailer items may link to the same catalog item.
 | key                     | definition                                  |
 |-------------------------+---------------------------------------------|
 | `normalized_item_id` PK | normalized retailer item id                 |
 | `catalog_id` PK         | linked catalog product id                   |
 | `link_method`           | `manual`, `exact_upc`, `exact_name_size`, etc. |
 | `link_confidence`       | optional confidence label                   |
 | `review_status`         | `pending`, `approved`, `rejected`, or blank |
 | `reviewed_by`           | reviewer id or initials                     |
 | `reviewed_at`           | review timestamp or date                    |
 | `link_notes`            | optional notes                              |
 ** `data/review/review_queue.csv`
 One row per issue needing human review.
 | key                  | definition                                          |
 |----------------------+-----------------------------------------------------|
 | `review_id` PK       | stable review row id                                |
 | `queue_type`         | `link_candidate`, `parse_issue`, `catalog_cleanup`  |
 | `retailer`           | retailer slug when applicable                       |
 | `normalized_item_id` | normalized retailer item id when review is item-level |
 | `normalized_row_id`  | normalized row id when review is row-specific       |
 | `catalog_id`         | candidate canonical id                              |
 | `reason_code`        | machine-readable review reason                      |
 | `priority`           | optional priority label                             |
 | `raw_item_names`     | compact list of example raw names                   |
 | `normalized_names`   | compact list of example normalized names            |
 | `upc`                | example UPC/PLU                                     |
 | `image_url`          | example image url                                   |
 | `example_prices`     | compact list of example prices                      |
 | `seen_count`         | count of related rows                               |
 | `status`             | `pending`, `approved`, `rejected`, `deferred`       |
 | `resolution_notes`   | reviewer notes                                      |
 | `created_at`         | creation timestamp or date                          |
 | `updated_at`         | last update timestamp or date                       |
 ** `data/catalog.csv`
 One row per cross-retailer catalog product.
 | key                        | definition                             |
 |----------------------------+----------------------------------------|
 | `catalog_id` PK            | stable catalog product id              |
 | `catalog_name`             | human-reviewed product name            |
 | `product_type`             | generic product eg `apple`, `milk`     |
 | `category`                 | broad section eg `produce`, `dairy`    |
 | `brand`                    | canonical brand when applicable        |
 | `variant`                  | canonical variant                      |
 | `size_value`               | normalized size value                  |
 | `size_unit`                | normalized size unit                   |
 | `pack_qty`                 | normalized pack/count                  |
 | `measure_type`             | normalized measure type                |
 | `normalized_quantity`      | numeric comparison basis value         |
 | `normalized_quantity_unit` | basis unit such as `oz`, `lb`, `count` |
 | `notes`                    | optional human notes                   |
 | `created_at`               | creation timestamp or date             |
 | `updated_at`               | last update timestamp or date          |
 Notes:
 - Do not auto-create new catalog rows from weak normalized names alone.
 - Do not encode packaging/count into `catalog_name` unless it is essential to product identity.
 - `catalog_name` should come from review-approved naming, not raw retailer strings.
 ** `data/purchases.csv`
 One row per purchased item (i.e., `is_item`==true from normalized layer), with
 catalog attributes denormalized in and discounts already applied.
 | key                        | definition                                                     |
 |----------------------------+----------------------------------------------------------------|
 | `purchase_date`            | date of purchase (from order)                                  |
 | `retailer`                 | retailer slug                                                  |
 | `order_id`                 | retailer order id                                              |
 | `line_no`                  | line number within order                                       |
 | `normalized_row_id`        | `<retailer>:<order_id>:<line_no>`                              |
 | `normalized_item_id`       | retailer-level normalized item identity                        |
 | `catalog_id`               | linked catalog product id                                      |
 | `catalog_name`             | catalog product name for analysis                              |
 | `catalog_product_type`     | broader product family (e.g., `egg`, `milk`)                   |
 | `catalog_category`         | category such as `produce`, `dairy`                            |
 | `catalog_brand`            | canonical brand when applicable                                |
 | `catalog_variant`          | canonical variant when applicable                              |
 | `raw_item_name`            | original retailer item name                                    |
 | `normalized_item_name`     | cleaned/normalized retailer item name                          |
 | `retailer_item_id`         | retailer-native item id                                        |
 | `upc`                      | UPC/PLU when available                                         |
 | `qty`                      | retailer quantity field                                        |
 | `unit`                     | retailer unit (e.g., `EA`, `LB`)                               |
 | `pack_qty`                 | parsed pack/count                                              |
 | `size_value`               | parsed size value                                              |
 | `size_unit`                | parsed size unit                                               |
 | `measure_type`             | `each`, `weight`, `volume`, `count`                            |
 | `normalized_quantity`      | normalized comparison quantity                                 |
 | `normalized_quantity_unit` | unit for normalized quantity                                   |
 | `unit_price`               | retailer unit price                                            |
 | `line_total`               | original retailer extended price (pre-discount)                |
 | `matched_discount_amount`  | discount amount matched from discount lines                    |
 | `net_line_total`           | effective price after discount (`line_total` + discounts)      |
 | `store_name`               | retailer store name                                            |
 | `store_city`               | store city                                                     |
 | `store_state`              | store state                                                    |
 | `price_per_each`           | derived per-each price                                         |
 | `price_per_each_basis`     | source basis for per-each calc                                 |
 | `price_per_count`          | derived per-count price                                        |
 | `price_per_count_basis`    | source basis for per-count calc                                |
 | `price_per_lb`             | derived per-pound price                                        |
 | `price_per_lb_basis`       | source basis for per-pound calc                                |
 | `price_per_oz`             | derived per-ounce price                                        |
 | `price_per_oz_basis`       | source basis for per-ounce calc                                |
 | `is_fee`                   | true if row represents non-product fee                         |
 | `raw_order_path`           | relative path to original order payload                        |
 Notes:
 - Only rows that represent purchased items should appear here.
 - `line_total` preserves retailer truth; `net_line_total` is what you actually paid.
 - catalog fields are denormalized in to make pivoting trivial.
 - no discount/coupon rows exist here; their effects are carried via `matched_discount_amount`.
 - review/link decisions should apply at the `normalized_item_id` level, then fan out to all purchase rows sharing that id.
 * /
--- a/pm/notes.org
+++ b/pm/notes.org
--- a/pm/review-workflow.org
+++ b/pm/review-workflow.org
@@ -1,73 +0,0 @@
 * review and item-resolution workflow
 This document defines the durable review workflow for unresolved observed
 products.
 ** persistent files
 - `combined_output/purchases.csv`
  Flat normalized purchase log. This is the review input because it retains:
  - raw item name
  - normalized item name
  - observed product id
  - canonical product id when resolved
  - retailer/order/date/price context
 - `combined_output/review_queue.csv`
  Current unresolved observed products grouped for review.
 - `combined_output/review_resolutions.csv`
  Durable mapping decisions from observed products to canonical products.
 - `combined_output/canonical_catalog.csv`
  Durable canonical item catalog used by manual review and later purchase-log
  rebuilds.
 There is no separate alias file in v1. `review_resolutions.csv` is the mapping
 layer from observed products to canonical product ids.
 ** workflow
 1. Run `build_purchases.py`
   This refreshes the purchase log and seeds/updates the canonical catalog from
   current auto-linked canonical rows.
 2. Run `review_products.py`
   This rebuilds `review_queue.csv` from unresolved purchase rows and prompts in
   the terminal for one observed product at a time.
 3. Choose one of:
   - link to existing canonical
   - create new canonical
   - exclude
   - skip
 4. `review_products.py` writes decisions immediately to:
   - `review_resolutions.csv`
   - `canonical_catalog.csv` when a new canonical item is created
 5. Rerun `build_purchases.py`
   This reapplies approved resolutions so the final normalized purchase log now
   carries the reviewed `canonical_product_id`.
 ** what the human edits
 The primary interface is terminal prompts in `review_products.py`.
 The human provides:
 - existing canonical id when linking
 - canonical name/category/product type when creating a new canonical item
 - optional resolution notes
 The generated CSVs remain editable by hand if needed, but the intended workflow
 is terminal-first.
 ** durability
 - Resolutions are keyed by `observed_product_id`, not by one-off text
  substitution.
 - Canonical products are keyed by stable `canonical_product_id`.
 - Future runs reuse approved mappings through `review_resolutions.csv`.
 ** retention of audit fields
 The final `purchases.csv` retains:
 - `raw_item_name`
 - `normalized_item_name`
 - `canonical_product_id`
 This preserves the raw receipt description, the deterministic parser output, and
 the human-approved canonical identity in one flat purchase log.
--- a/pm/scrape-giant.org
+++ b/pm/scrape-giant.org
@@ -0,0 +1,107 @@
 * python setup
 venv install playwright, pandas
 playwright install
 1. scrape - raw giant json
 2. enrich -
   cols:
 item_name_norm
 brand_guess
 size_value
 size_unit
 pack_qty
 variant
 is_store_brand
 is_fee
 measure_type
 price_per_lb
 price_per_oz
 price_per_each
 image_url
 normalize abbreviationsta
 extract size like 12z, 10ct, 5lb
 detect fees like bag charges
 infer whether something is sold by each vs weight
 carry forward image url   
 3. build observed-product atble from enriched items
 * item:
 get:
  /api/v6.0/user/369513017/order/history/detail/69a2e44a16be1142e74ad3cc
 headers:
  request:
 GET /api/v6.0/user/369513017/order/history/detail/69a2e44a16be1142e74ad3cc?isInStore=true HTTP/2
 Host: giantfood.com
 User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:148.0) Gecko/20100101 Firefox/148.0
 Accept: application/json, text/plain, */*
 Accept-Language: en-US,en;q=0.9
 Accept-Encoding: gzip, deflate, br, zstd
 DNT: 1
 Sec-GPC: 1
 Connection: keep-alive
 Referer: https://giantfood.com/account/history/invoice/in-store
 Cookie: datadome=rDtvd3J2hO5AeghJMSFRRxGc6ifKCQYgMLcqPNr9rWiz2rdcXb032AY6GIZn8tUmYB96BKKbzh3_jSjEzYWLj8hDjl3oGYYAiu4jwdaxpf3vh2v4f7KH7kbqgsMWpkjt; cf_clearance=WEPyQokx9f0qoyS4Svsw4EkZ1TYOxjOwcUHspT3.rXw-1773348940-1.2.1.1-fPvERGxBlFUaBW83sUppbUWpwvFG7mZivag5vBvZb3kxUQv2WSVIV1tON0HV2n8bkVY0U8_BBl62a00Np.oJylYQcGME540gZlYEoL.gMs4WynLqApFe5BOXAEwOm01_6h6b62H90bl4ypRehVb_TXEi4qHaPLVSZhjZK_h.fv6RBqjgYch2j_8XnHe5HXvLziVjl1k2aJskozqy04KOyeHyc3OyIPTZd5On_KAzFIM; dvrctk=MnjKJVShVraEtbrBkkxWxLaZrXnIGNQlwB7QtZVPFeA=; __cflb=0H28vXMLFyydRmDMNgcPHijM6auXkCspCkuh58tVuJ3; __cf_bm=C6QbqiEvbbwdrYBpoJOkcWcedf60vcOfPfTPPbZzKbM-1773348202-1.0.1.1-cSHoYwi8ZjIHTdBItXQP_iXJdRJS6FYjFsGdl1eGHvS5pgfbcT4Lg19P6UStX.bZz1u0OXiS5ykdipPBtwP6OvZr68k4XSmjYpir05jNLhw; _dd_s=rum=0&expire=1773349846445; ppdtk=Uog72CR22mD85C7U4iZHlgOQeRmvHEYp0OdQc+0lEes1c5/LeqGT+ZUlXpSC6FpW; cartId=3820547
 Sec-Fetch-Dest: empty
 Sec-Fetch-Mode: cors
 Sec-Fetch-Site: same-origin
 Priority: u=0
 TE: trailers
  response:
 HTTP/2 200 
 date: Thu, 12 Mar 2026 20:55:47 GMT
 content-type: application/json
 server: cloudflare
 cf-ray: 9db5b3a5d84aff28-IAD
 cf-cache-status: DYNAMIC
 content-encoding: gzip
 set-cookie: datadome=MXMri0hss6PlQ0_oS7gG2iMdOKnNkbDmGvOxelgN~nCcupgkJQOqjcjcgdprIaI7hSlt_w8E9Ri_RAzPFrGqtUfqAJ_szB_aNZ2FdC26qmI3870Nn4~T0vtx8Gj3dEZR; Max-Age=31536000; Domain=.giantfood.com; Path=/; Secure; SameSite=Lax
 strict-transport-security: max-age=31536000; includeSubDomains
 vary: Origin, Access-Control-Request-Method, Access-Control-Request-Headers, accept-encoding
 accept-ch: Sec-CH-UA,Sec-CH-UA-Mobile,Sec-CH-UA-Platform,Sec-CH-UA-Arch,Sec-CH-UA-Full-Version-List,Sec-CH-UA-Model,Sec-CH-Device-Memory
 x-datadome: protected
 request-context: appId=cid-v1:75750625-0c81-4f08-9f5d-ce4f73198e54
 X-Firefox-Spdy: h2
 * history:
 GET
  https://giantfood.com/api/v6.0/user/369513017/order/history?filter=instore&loyaltyNumber=440155630880
 headers:
  request:
 GET /api/v6.0/user/369513017/order/history?filter=instore&loyaltyNumber=440155630880 HTTP/2
 Host: giantfood.com
 User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:148.0) Gecko/20100101 Firefox/148.0
 Accept: application/json, text/plain, */*
 Accept-Language: en-US,en;q=0.9
 Accept-Encoding: gzip, deflate, br, zstd
 DNT: 1
 Sec-GPC: 1
 Connection: keep-alive
 Referer: https://giantfood.com/account/history/invoice/in-store
 Cookie: datadome=OH2XjtCoI6XjE3Qsz_b0F1YULKLatAC0Ea~VMeDGBP0N9Z~CeI3RqEbvkGmNW_VCOU~vRb6p0kqibvF2tLbWnzyAGIdO7jsC41KiYbp7USpJDnefZhIg0e1ypAugvDSw; cf_clearance=WEPyQokx9f0qoyS4Svsw4EkZ1TYOxjOwcUHspT3.rXw-1773348940-1.2.1.1-fPvERGxBlFUaBW83sUppbUWpwvFG7mZivag5vBvZb3kxUQv2WSVIV1tON0HV2n8bkVY0U8_BBl62a00Np.oJylYQcGME540gZlYEoL.gMs4WynLqApFe5BOXAEwOm01_6h6b62H90bl4ypRehVb_TXEi4qHaPLVSZhjZK_h.fv6RBqjgYch2j_8XnHe5HXvLziVjl1k2aJskozqy04KOyeHyc3OyIPTZd5On_KAzFIM; dvrctk=MnjKJVShVraEtbrBkkxWxLaZrXnIGNQlwB7QtZVPFeA=; __cflb=0H28vXMLFyydRmDMNgcPHijM6auXkCspCkuh58tVuJ3; __cf_bm=C6QbqiEvbbwdrYBpoJOkcWcedf60vcOfPfTPPbZzKbM-1773348202-1.0.1.1-cSHoYwi8ZjIHTdBItXQP_iXJdRJS6FYjFsGdl1eGHvS5pgfbcT4Lg19P6UStX.bZz1u0OXiS5ykdipPBtwP6OvZr68k4XSmjYpir05jNLhw; _dd_s=rum=0&expire=1773349842848; ppdtk=Uog72CR22mD85C7U4iZHlgOQeRmvHEYp0OdQc+0lEes1c5/LeqGT+ZUlXpSC6FpW; cartId=3820547
 Sec-Fetch-Dest: empty
 Sec-Fetch-Mode: cors
 Sec-Fetch-Site: same-origin
 Priority: u=0
 TE: trailers
  response:
  HTTP/2 200 
 date: Thu, 12 Mar 2026 20:55:43 GMT
 content-type: application/json
 server: cloudflare
 cf-ray: 9db5b38f7eebff28-IAD
 cf-cache-status: DYNAMIC
 content-encoding: gzip
 set-cookie: datadome=rDtvd3J2hO5AeghJMSFRRxGc6ifKCQYgMLcqPNr9rWiz2rdcXb032AY6GIZn8tUmYB96BKKbzh3_jSjEzYWLj8hDjl3oGYYAiu4jwdaxpf3vh2v4f7KH7kbqgsMWpkjt; Max-Age=31536000; Domain=.giantfood.com; Path=/; Secure; SameSite=Lax
 strict-transport-security: max-age=31536000; includeSubDomains
 vary: Origin, Access-Control-Request-Method, Access-Control-Request-Headers, accept-encoding
 accept-ch: Sec-CH-UA,Sec-CH-UA-Mobile,Sec-CH-UA-Platform,Sec-CH-UA-Arch,Sec-CH-UA-Full-Version-List,Sec-CH-UA-Model,Sec-CH-Device-Memory
 x-datadome: protected
 request-context: appId=cid-v1:75750625-0c81-4f08-9f5d-ce4f73198e54
 X-Firefox-Spdy: h2
--- a/pm/tasks.org
+++ b/pm/tasks.org
@@ -1,6 +1,4 @@
-#+title: Scrape-Giant Task Log
+* [ ] t1.1: harden giant receipt fetch cli (2-4 commits)
 * [X] t1.1: harden giant receipt fetch cli (2-4 commits)
 ** acceptance criteria
 - giant scraper runs from cli with prompts or env-backed defaults for `user_id` and `loyalty`
 - script reuses current browser session via firefox cookies + `curl_cffi`
@@ -14,11 +12,11 @@
 - raw json archive remains source of truth
 ** evidence
- commit: `d57b9cf` on branch `cx`
+- commit:
- tests: `./venv/bin/python -m unittest discover -s tests`; `./venv/bin/python scraper.py --help`; verified `.env` loading via `scraper.load_config()`
+- tests:
- date: 2026-03-14
+- date:
-* [X] t1.2: define grocery data model and file layout (1-2 commits)
+* [ ] t1.2: define grocery data model and file layout (1-2 commits)
 ** acceptance criteria
 - decide and document the files/directories for:
  - retailer raw exports
@@ -30,15 +28,15 @@
 - explicitly separate retailer-specific parsing from cross-retailer canonicalization
 ** notes
- this is the guardrail task so we don't make giant-specific hacks the system of record
+- this is the guardrail task so we don’t make giant-specific hacks the system of record
 - keep schema minimal but extensible
 ** evidence
- commit: `42dbae1` on branch `cx`
+- commit:
- tests: reviewed `giant_output/raw/history.json`, one sample raw order json, `giant_output/orders.csv`, `giant_output/items.csv`; documented schemas in `pm/data-model.org`
+- tests:
- date: 2026-03-15
+- date:
-* [X] t1.3: build giant parser/enricher from raw json (2-4 commits)
+* [ ] t1.3: build giant parser/enricher from raw json (2-4 commits)
 ** acceptance criteria
 - parser reads giant raw order json files
 - outputs `items_enriched.csv`
@@ -56,11 +54,11 @@
 - parser should preserve ambiguity rather than hallucinating precision
 ** evidence
- commit: `14f2cc2` on branch `cx`
+- commit:
- tests: `./venv/bin/python -m unittest discover -s tests`; `./venv/bin/python enrich_giant.py`; verified `giant_output/items_enriched.csv` on real raw data
+- tests:
- date: 2026-03-16
+- date:
-* [X] t1.4: generate observed-product layer from enriched items (2-3 commits)
+* [ ] t1.4: generate observed-product layer from enriched items (2-3 commits)
 ** acceptance criteria
 - distinct observed products are generated from enriched giant items
@@ -78,11 +76,11 @@
 - likely key is some combo of retailer + upc + normalized name
 ** evidence
- commit: `dc39214` on branch `cx`
+- commit:
- tests: `./venv/bin/python -m unittest discover -s tests`; `./venv/bin/python build_observed_products.py`; verified `giant_output/products_observed.csv`
+- tests:
- date: 2026-03-16
+- date:
-* [X] t1.5: build review queue for unresolved or low-confidence products (1-3 commits)
+* [ ] t1.5: build review queue for unresolved or low-confidence products (1-3 commits)
 ** acceptance criteria
 - produce a review file containing observed products needing manual review
@@ -100,11 +98,11 @@
 - optimize for “approve once, remember forever”
 ** evidence
- commit: `9b13ec3` on branch `cx`
+- commit:
- tests: `./venv/bin/python -m unittest discover -s tests`; `./venv/bin/python build_review_queue.py`; verified `giant_output/review_queue.csv`
+- tests:
- date: 2026-03-16
+- date:
-* [X] t1.6: create canonical product layer and observed→canonical links (2-4 commits)
+* [ ] t1.6: create canonical product layer and observed→canonical links (2-4 commits)
 ** acceptance criteria
 - define and create `products_canonical.csv`
@@ -122,11 +120,11 @@
 - do not require llm assistance for v1
 ** evidence
- commit: `347cd44` on branch `cx`
+- commit:
- tests: `./venv/bin/python -m unittest discover -s tests`; `./venv/bin/python build_canonical_layer.py`; verified seeded `giant_output/products_canonical.csv` and `giant_output/product_links.csv`
+- tests:
- date: 2026-03-16
+- date:
-* [X] t1.7: implement auto-link rules for easy matches (2-3 commits)
+* [ ] t1.7: implement auto-link rules for easy matches (2-3 commits)
 ** acceptance criteria
 - auto-link can match observed products to canonical products using deterministic rules
@@ -141,462 +139,53 @@
 - false positives are worse than unresolved items
 ** evidence
- commit: `385a31c` on branch `cx`
+- commit:
- tests: `./venv/bin/python -m unittest discover -s tests`; `./venv/bin/python build_canonical_layer.py`; verified auto-linked `giant_output/products_canonical.csv` and `giant_output/product_links.csv`
+- tests:
- date: 2026-03-16
+- date:
-* [X] t1.8: support costco raw ingest path (2-5 commits)
+* [ ] t1.8: support costco raw ingest path (2-5 commits)
 ** acceptance criteria
 - add a costco-specific raw ingest/export path
- fetch costco receipt summary and receipt detail payloads from graphql endpoint
+- output costco line items into the same shared raw/enriched schema family
 - persist raw json under `costco_output/raw/orders.csv` and `./items.csv`, same format as giant
 - costco-native identifiers such as `transactionBarcode` as order id and `itemNumber` as retailer item id
 - preserve discount/coupon rows rather than dropping
 ** notes
 - focus on raw costco acquisistion and flattening
 - do not force costco identifiers into `upc`
 - bearer/auth values should come from local env, not source
 ** evidence
 - commit: `da00288` on branch `cx`
 - tests: `./venv/bin/python -m unittest discover -s tests`; `./venv/bin/python scrape_costco.py --help`; verified `costco_output/raw/*.json`, `costco_output/orders.csv`, and `costco_output/items.csv` from the local sample payload
 - date: 2026-03-16
 * [X] t1.8.1: support costco parser/enricher path (2-4 commits)
 ** acceptance criteria
 - add a costco-specific enrich step producing `costco_output/items_enriched.csv`
 - output rows into the same shared enriched schema family as Giant
 - support costco-specific parsing for:
  - `itemDescription01` + `itemDescription02`
  - `itemNumber` as `retailer_item_id`
  - discount lines / negative rows
  - common size patterns such as `25#`, `48 OZ`, `2/24 OZ`, `6-PACK`
 - preserve obvious unknowns as blank rather than guessed values
 ** notes
 - this is the real schema compatibility proof, not raw ingest alone
 - expect weaker identifiers than Giant
 ** evidence
 - commit: `da00288` on branch `cx`
 - tests: `./venv/bin/python -m unittest discover -s tests`; `./venv/bin/python enrich_costco.py`; verified `costco_output/items_enriched.csv`
 - date: 2026-03-16
 * [X] t1.8.2: validate cross-retailer observed/canonical flow (1-3 commits)
 ** acceptance criteria
 - feed Giant and Costco enriched rows through the same observed/canonical pipeline
 - confirm at least one product class can exist as:
-  - Giant observed product
+  - giant observed product
-  - Costco observed product
+  - costco observed product
  - one shared canonical product
 - document the exact example used for proof
 ** notes
- keep this to one or two well-behaved product classes first
+- this is the proof that the architecture generalizes
- apples, eggs, bananas, or flour are better than weird prepared foods
+- don’t chase perfection before the second retailer lands
 ** evidence
 - commit: `da00288` on branch `cx`
 - tests: `./venv/bin/python -m unittest discover -s tests`; `./venv/bin/python validate_cross_retailer_flow.py`; proof example: Giant `FRESH BANANA` and Costco `BANANAS 3 LB / 1.36 KG` share one canonical in `combined_output/proof_examples.csv`
 - date: 2026-03-16
 * [X] t1.8.3: extend shared schema for retailer-native ids and adjustment lines (1-2 commits)
 ** acceptance criteria
 - add shared fields needed for non-upc retailers, including:
  - `retailer_item_id`
  - `is_discount_line`
  - `is_coupon_line` or equivalent if needed
 - keep `upc` nullable across the pipeline
 - update downstream builders/tests to accept retailers with blank `upc`
 ** notes
 - this prevents costco from becoming a schema hack
 - do this once instead of sprinkling exceptions everywhere
 ** evidence
 - commit: `9497565` on branch `cx`
 - tests: `./venv/bin/python -m unittest discover -s tests`; verified shared enriched fields in `giant_output/items_enriched.csv` and `costco_output/items_enriched.csv`
 - date: 2026-03-16
 * [X] t1.8.4: verify and correct costco receipt enumeration (1–2 commits)
 ** acceptance criteria
 - confirm graphql summary query returns all expected receipts
 - compare `inWarehouse` count vs number of `receipts` returned
 - widen or parameterize date window if necessary; website shows receipts in 3-month windows
 - persist request metadata (`startDate`, `endDate`, `documentType`, `documentSubType`)
 - emit warning when receipt counts mismatch
 ** notes
 - goal is to confirm we are enumerating all receipts before parsing
 - do not expand schema or parser logic in this task
 - keep changes limited to summary query handling and diagnostics
 ** evidence
 - commit: `ac82fa6` on branch `cx`
 - tests: `./venv/bin/python -m unittest discover -s tests`; `./venv/bin/python scrape_costco.py --help`; reviewed the sample Costco summary request in `pm/scrape-giant.org` against `costco_output/raw/summary.json` and added 3-month window chunking plus mismatch diagnostics
 - date: 2026-03-16
 * [X] t1.8.5: refactor costco scraper auth and UX with giant scraper
 ** acceptance criteria
 - remove manual auth env vars
 - load costco cookies from firefox session
 - require only logged-in browser
 - replace start/end date flags with --months-back
 - maintain same raw output structure
 - ensure summary_lookup keys are collision-safe by using a composite key (transactionBarcode + transactionDateTime) instead of transactionBarcode alone 
 ** notes
 - align Costco acquisition ergonomics with the Giant scraper
 - keep downstream Costco parsing and shared schemas unchanged
 ** evidence
 - commit: `c0054dc` on branch `cx`
 - tests: `./venv/bin/python -m unittest discover -s tests`; `./venv/bin/python scrape_costco.py --help`; verified Costco summary/detail flattening now uses composite receipt keys in unit tests
 - date: 2026-03-16
 * [X] t1.8.6: add browser session helper (2-4 commits)
 ** acceptance criteria
 - create a separate Python module/script that extracts firefox browser session data needed for giant and costco scrapers.
 - support Firefox and Costco first, including:
  - loading cookies via existing browser-cookie approach
  - reading browser storage needed for dynamic auth headers (e.g. Costco bearer token)
  - copying locked browser sqlite/db files to a temp location before reading when necessary
 - expose a small interface usable by scrapers, e.g. cookie jar + storage/header values
 - keep retailer-specific parsing of extracted session data outside the low-level browser access layer
 - structure the helper so Chromium-family browser support can be added later without changing scraper call sites
 ** notes
 - goal is to replace manual `.env` copying of volatile browser-derived auth data
 - session bootstrap only, not full browser automation
 - prefer one shared helper over retailer-specific ad hoc storage reads
 - Firefox only; Chromium support later
 ** evidence
 - commit: `7789c2e` on branch `cx`
 - tests: `./venv/bin/python -m unittest discover -s tests`; `./venv/bin/python scrape_giant.py --help`; `./venv/bin/python scrape_costco.py --help`; verified Firefox storage token extraction and locked-db copy behavior in unit tests
 - date: 2026-03-16
 * [X] t1.8.7: simplify costco session bootstrap and remove over-abstraction (2-4 commits)
 ** acceptance criteria
 - make `scrape_costco.py` readable end-to-end without tracing through multiple partial bootstrap layers
 - keep `browser_session.py` limited to low-level browser data access only:
  - firefox profile discovery
  - cookie loading
  - storage reads
  - sqlite copy/read helpers
 - remove or sharply reduce `retailer_sessions.py` so retailer-specific header extraction lives with the retailer scraper or in a very small retailer-specific helper
 - make session bootstrap flow explicit and linear:
  - load browser context
  - extract costco auth values
  - build request headers
  - build requests session
 - eliminate inconsistent/obsolete function signatures and dead call paths (e.g. mixed `build_session(...)` calling conventions, stale fallback branches, mismatched `build_headers(...)` args)
 - add one focused bootstrap debug print showing whether cookies, authorization, client id, and client identifier were found
 - preserve current working behavior where available; this is a refactor/clarification task, not a feature expansion task
 ** notes
 - goal is to restore concern separation and debuggability
 - prefer obvious retailer-specific code over “generic” helpers that guess and obscure control flow
 - browser access can stay shared; retailer auth mapping should be explicit
 - no new heuristics in this task
 ** evidence
 - commit: `d7a0329` on branch `cx`
 - tests: `./venv/bin/python -m unittest discover -s tests`; `./venv/bin/python scrape_costco.py --help`; verified explicit Costco session bootstrap flow in `scrape_costco.py` and low-level-only browser access in `browser_session.py`
 - date: 2026-03-16
 * [X] t1.9: build pivot-ready normalized purchase log and comparison metrics (2-4 commits)
 ** acceptance criteria
 - produce a flat `purchases.csv` suitable for excel pivot tables and pivot charts
 - each purchase row preserves:
  - purchase date
  - retailer
  - order id
  - raw item name
  - normalized item name
  - canonical item id when resolved
  - quantity / unit
  - line total
  - store/location info where available
 - derive normalized comparison fields where possible on enriched or observed product rows:
  - `price_per_lb`
  - `price_per_oz`
  - `price_per_each`
  - `price_per_count`
 - preserve the source basis used to derive each metric, e.g.:
  - parsed size/unit
  - receipt weight
  - explicit count/pack
 - emit nulls when basis is unknown, conflicting, or ambiguous
 - support pivot-friendly analysis of purchase frequency and item cost over time
 - document at least one Giant vs Costco comparison example using the normalized metrics
 ** notes
 - compute metrics as close to the raw observation as possible
 - canonical layer can aggregate later, but should not invent missing unit economics
 - unit discipline matters more than coverage
 - raw item name must be retained for audit/debugging
 ** evidence
 - commit: `be1bf63` on branch `cx`
 - tests: `./venv/bin/python -m unittest discover -s tests`; `./venv/bin/python build_purchases.py`; verified `combined_output/purchases.csv` and `combined_output/comparison_examples.csv` on the current Giant + Costco dataset
 - date: 2026-03-16
 * [X] t1.11: define review and item-resolution workflow for unresolved products (2-3 commits)
 ** acceptance criteria
 - define the persistent files used to resolve unknown items, including:
  - review queue
  - canonical item catalog
  - alias / mapping layer if separate
 - specify how unresolved items move from `review_queue.csv` into the final normalized purchase log
 - define the manual resolution workflow, including:
  - what the human edits
  - what script is rerun afterward
  - how resolved mappings are persisted for future runs
 - ensure resolved items are positively identified into stable canonical item ids rather than one-off text substitutions
 - document how raw item name, normalized item name, and canonical item id are all retained
 ** notes
 - goal is “approve once, reuse forever”
 - keep the workflow simple and auditable
 - manual review is fine; the important part is making it durable and rerunnable
 ** evidence
 - commit: `c7dad54` on branch `cx`
 - tests: `./venv/bin/python -m unittest discover -s tests`; `./venv/bin/python build_purchases.py`; `./venv/bin/python review_products.py --refresh-only`; verified `combined_output/review_queue.csv`, `combined_output/review_resolutions.csv` workflow, and `combined_output/canonical_catalog.csv`
 - date: 2026-03-16
 * [X] t1.12: simplify review process display
 Clearly show current state separate from proposed future state.
 ** acceptance criteria
 1. Display position in review queue, e.g., (1/22)
 2. Display compact header with observed_product under review, queue position, and canonical decision, e.g.: "Resolve [n] observed product group [name]  and associated items to canonical_name [name]? (\n [n] matched items)"
 3. color-code outputs based on info, input/prompt, warning/error
   1. color action menu/requests for input differently from display text; do not color individual options separately
   2. "no canonical_name suggestions found" is informational, not a warning/error.
 4. update action menu `[x]exclude` to `e[x]clude`
 5. on each review item, display a list of all matched items to be linked, sorted by descending date:
   1. YYYY-mm-dd, price, raw item name, normalized item name, upc, retailer
   2. image URL, if exists
   3. Sample:
 6. on each review item, suggest (but do not auto-apply) up to 3 likely existing canonicals using determinstic rules, e.g:
   1. exact normalized name match
   2. prefix/contains match on canonical name
   3. exact UPC
 7. Sample Entry:
 #+begin_comment
 Review 7/22: Resolve observed_product MIXED PEPPER to canonical_name [__]?
 2 matched items:
  [1] 2026-03-12 | 7.49 | MIXED PEPPER 6-PACK | MIXED PEPPER | [upc] | costco | [img_url]
  [2] [YYYY-mm-dd] | [price] | [raw_name] | [observed_name] | [upc] | [retailer] | [img_url]
 2 canonical suggestions found:
  [1] BELL PEPPERS, PRODUCE
  [2] PEPPER, SPICES
 #+end_comment
 8. When link is selected, users should be able to select the number of the item in the list, e.g.:
 #+begin_comment
  Select the canonical_name to associate [n] items with:
   [1] GRB GRADU PCH PUF1. | gcan_01b0d623aa02
   [2] BTB CHICKEN         | gcan_0201f0feb749
   [3] LIME                | gcan_02074d9e7359
 #+end_comment
 9. Add confirmation to link selection with instructions, "[n] [observed_name] and future observed_name matches will be associated with [canonical_name], is this ok?
     actions: [Y]es  [n]o  [b]ack  [s]kip  [q]uit
 - reinforce project terminology such as raw_name, observed_name, canonical_name   
 ** evidence
 - commit: `7b8141c`, `d39497c`
 - tests: `./venv/bin/python -m unittest discover -s tests`; `./venv/bin/python -m unittest tests.test_review_workflow tests.test_purchases`; `./venv/bin/python review_products.py --help`; verified compact review header, numbered matched-item display, informational no-suggestion state, numbered canonical selection, and confirmation flow
 - date: 2026-03-17
 ** notes
 - The key improvement was shifting the prompt from system metadata to reviewer intent: one observed_product, its matched retailer rows, and one canonical_name decision.
 - Numbered canonical selection plus confirmation worked better than free-text id entry and should reduce accidental links.
 - Deterministic suggestions remain intentionally conservative; they speed up common cases, but unresolved items still depend on human review by design.
 * [X] t1.13.1 pipeline accountability and stage visibility (1-2 commits)
 add simple accounting so we can see what survives or drops at each pipeline stage
 ** AC
 1. emit counts for raw, enriched, combined/observed, review-queued, canonical-linked, and final purchase-log rows
 2. report unresolved and dropped item counts explicitly
 3. make it easy to verify that missing items were intentionally left in review rather than silently lost
 - pm note: simple text/json/csv summary is sufficient; trust and visibility matter more than presentation
 ** evidence
 - commit: `967e19e`
 - tests: `./venv/bin/python -m unittest discover -s tests`; `./venv/bin/python report_pipeline_status.py --help`; `./venv/bin/python report_pipeline_status.py`; verified `combined_output/pipeline_status.csv` and `combined_output/pipeline_status.json`
 - date: 2026-03-17
 ** notes
 - Added a single explicit status script instead of threading counters through every pipeline step; this keeps the pipeline simple while still making row survival visible.
 - The most useful check here is `unresolved_not_in_review_rows`; when it is non-zero, we know we have a real accounting bug rather than normal unresolved work.
 * [X] t1.13.2 costco discount matching and net pricing in enrich_costco (2-3 commits)
 refactor costco enrichment so discount lines are matched to purchased items and net pricing is preserved
 ** AC
 1. detect costco discount/coupon rows like `/<retailer_item_id>` and match them to purchased items within the same order
 2. preserve raw discount rows for auditability while also carrying matched discount values onto the purchased item row
 3. add explicit fields for discount-adjusted pricing, e.g. `matched_discount_amount` and `net_line_total` (or equivalent)
 4. preserve original raw receipt amounts (`line_total`) without overwriting them
 - pm note: keep this retailer-specific and explicit; do not introduce generic discount heuristics
 ** evidence
 - commit: `56a03bc`
 - tests: `./venv/bin/python -m unittest discover -s tests`; `./venv/bin/python enrich_costco.py`; verified matched Costco discount rows now populate `matched_discount_amount` and `net_line_total` while preserving raw `line_total`
 - date: 2026-03-17
 ** notes
 - Kept this retailer-specific and literal: only discount rows with `/<retailer_item_id>` are matched, and only within the same order.
 - Raw discount rows are still preserved for auditability; the purchased row now carries the matched adjustment separately rather than overwriting the original amount.
 * [X] t1.13.3 canonical cleanup and review-first product identity (3-4 commits)
 refactor canonical generation so product identity is cleaner, duplicate canonicals are reduced, and unresolved items stay in review instead of spawning junk canonicals
 ** AC
 1. stop auto-creating new canonical products from weak normalized names alone; unresolved items remain in `review_queue.csv`
 2. canonical names are based on stable product identity rather than noisy observed titles
 3. packaging/count/size tokens are removed from canonical names when they belong in structured fields (`pack_qty`, `size_value`, `size_unit`)
 4. consolidate obvious duplicate canonicals (e.g. egg/lime cases) and ensure final outputs retain raw item name, normalized item name, and canonical item id
 - pm note: prefer conservative canonical creation and a better manual review loop over aggressive auto-unification
 ** evidence
 - commit: `08e2a86`
 - tests: `./venv/bin/python -m unittest discover -s tests`; `./venv/bin/python build_purchases.py`; `./venv/bin/python review_products.py --refresh-only`; verified weaker exact-name cases now remain unresolved in `combined_output/review_queue.csv` and canonical names are cleaned before auto-catalog creation
 - date: 2026-03-17
 ** notes
 - Removed weak exact-name auto-canonical creation so ambiguous products stay in review instead of generating junk canonicals.
 - Canonical display names are now cleaned of obvious punctuation and packaging noise, but I kept the cleanup conservative rather than adding a broad fuzzy merge layer.
 * [X] t1.14: refactor retailer collection into the new data model (2-4 commits)
 move Giant and Costco collection into the new collect structure and make both retailers emit the same collected schemas
 ** Acceptance Criteria
 1. create retailer-specific collect scripts in the target naming pattern, e.g.:
  - collect_giant_web.py
  - collect_costco_web.py
 2. collected outputs conform to pm/data-model.org:
  - data/<retailer-method>/raw/...
  - data/<retailer-method>/collected_orders.csv
  - data/<retailer-method>/collected_items.csv
 3. current Giant and Costco raw acquisition behavior is preserved during the move
 4. collected schemas preserve retailer truth and provenance:
  - no interpretation beyond basic flattening
  - raw_order_path/raw_history_path remain usable
  - unknown values remain blank rather than guessed
 5. old paths should be removed or deprecated
 6. collect_* scripts do not depend on any normalize/review files or scripts
 - pm note: this is a path/schema refactor, not a parsing rewrite
 ** evidence
 - commit: `48c6eaf`
 - tests: `./venv/bin/python -m unittest tests.test_scraper tests.test_costco_pipeline tests.test_browser_session`; `./venv/bin/python collect_giant_web.py --help`; `./venv/bin/python collect_costco_web.py --help`; `./venv/bin/python scrape_giant.py --help`; `./venv/bin/python scrape_costco.py --help`
 - datetime: 2026-03-18
 ** notes
 - Kept this as a path/schema move, not a parsing rewrite: the existing Giant and Costco collection behavior remains in place behind new `collect_*` entry points.
 - Added lightweight deprecation nudges on the legacy `scrape_*` commands rather than removing them immediately, so the move is inspectable and low-risk.
 - The main schema fix was on Giant collection, which was missing retailer/provenance/audit fields that Costco collection already carried.
 * [ ] t1.14.1: refactor retailer normalization into the new normalized_items schema (3-5 commits)
 make Giant and Costco emit the shared normalized line-item schema without introducing cross-retailer identity logic
 ** Acceptance Criteria
 1. create retailer-specific normalize scripts in the target naming pattern, e.g.:
   - normalize_giant_web.py
   - normalize_costco_web.py
 2. normalized outputs conform to pm/data-model.org:
   - data/<retailer-method>/normalized_items.csv
   - one row per collected line item
   - normalized_row_id is stable and present
   - normalized_item_id is stable, present, and represents retailer-level identity reused across repeated purchase rows when deterministic retailer evidence is sufficient
   - normalized_quantity and normalized_quantity_unit
   - repeated rows for the same retailer product resolve to the same normalized_item_id only when supported by deterministic retailer evidence, e.g. exact upc, exact retailer_item_id, exact cleaned name + same size/pack
   - normalization_basis is explicit
 3. Giant normalization preserves current useful parsing:
   - normalized item name
   - size/unit/pack parsing
   - fee/store-brand flags
   - derived price fields
 4. Costco normalization preserves current useful parsing:
   - normalized item name
   - size/unit/pack parsing
   - explicit discount matching using retailer-specific logic
   - matched_discount_amount and net_line_total
 5. both normalizers preserve raw retailer truth:
   - line_total is never overwritten
   - unknown values remain blank rather than guessed
 6. no cross-retailer identity assignment occurs in normalization
 7. normalize never uses fuzzy or semantic matching to assign normalized_item_id
 - pm note: prefer explicit retailer-specific code paths over generic normalization helpers unless the duplication is truly mechanical
 - pm note: normalization may resolve retailer-level identity, but not catalog identity
 - pm note: normalized_item_id is the only retailer-level grouping identity; do not introduce observed_products or a second grouping artifact
 ** evidence
 - commit:
 - tests:
 - datetime:
 ** notes
 * [ ] t1.15: refactor review/combine pipeline around normalized_item_id and catalog links (4-8 commits)
 replace the old observed/canonical workflow with a review-first pipeline that uses normalized_item_id as the retailer-level review unit and links it to catalog items
 ** Acceptance Criteria
 1. refactor review outputs to conform to pm/data-model.org:
   - data/review/review_queue.csv
   - data/review/product_links.csv
   - data/catalog.csv
   - data/purchases.csv
 2. review logic uses normalized_item_id as the upstream retailer-level review identity:
   - no dependency on observed_product_id
   - no dependency on products_observed.csv
   - one review/link decision applies to all purchase rows sharing the same normalized_item_id
 3. product_links.csv stores review-approved links from normalized_item_id to catalog_id
   - one row per approved retailer-level identity to catalog mapping
 4. catalog.csv entries are review-first and conservative:
   - no auto-creation from weak normalized names alone
   - names come from reviewed catalog naming, not raw retailer strings
   - packaging/count is not embedded in catalog_name unless essential to identity
   - catalog_name/product_type/category/brand/variant may be blank until reviewed; blank is preferred to guessed
 5. purchases.csv remains pivot-ready and retains:
   - raw item name
   - normalized item name
   - normalized_row_id (not for review)
   - normalized_item_id
   - catalog_id
   - catalog fields
   - raw line_total
   - matched_discount_amount and net_line_total when present
   - derived price fields and their bases
 6. terminal review flow remains simple and usable:
   - reviewer sees one grouped retailer item identity (normalized_item_id) with count and list of matches, not one prompt per purchase row; use existing pattern as a template
   - link to existing catalog item
   - create new catalog item
   - exclude
   - skip
 7. pipeline accounting remains valid after the refactor:
   - unresolved items are visible
   - missing items are not silently dropped
 8. pm note: prefer a better manual review loop over aggressive automatic grouping. initial manual data entry is expected, and should resolve over time
 9. pm note: keep review/combine auditable; each catalog link should be explainable from normalized rows and review state
 ** evidence
 - commit:
 - tests:
- datetime:
+- date:
-** notes
+* [ ] t1.9: compute normalized comparison metrics (2-3 commits)
 * [ ] 1t.10: add optional llm-assisted suggestion workflow for unresolved normalized retailer items (2-4 commits)
 ** acceptance criteria
- llm suggestions are generated only for unresolved normalized retailer items
+- derive normalized comparison fields where possible:
  - price per lb
  - price per oz
  - price per each
  - price per count
 - metrics are attached at canonical or linked-observed level as appropriate
 - emit obvious nulls when basis is unknown rather than inventing values
 ** notes
 - this is where “gala apples 5 lb bag vs other gala apples” becomes possible
 - units discipline matters a lot here
 ** evidence
 - commit:
 - tests:
 - date:
 * [ ] t1.10: add optional llm-assisted suggestion workflow for unresolved products (2-4 commits)
 ** acceptance criteria
 - llm suggestions are generated only for unresolved observed products
 - llm outputs are stored as suggestions, not auto-applied truth
 - reviewer can approve/edit/reject suggestions
 - approved decisions are persisted into canonical/link files
--- a/report_pipeline_status.py
+++ b/report_pipeline_status.py
@@ -1,119 +0,0 @@
 import json
 from pathlib import Path
 import click
 import build_observed_products
 import build_purchases
 import review_products
 from layer_helpers import read_csv_rows, write_csv_rows
 SUMMARY_FIELDS = ["stage", "count"]
 def read_rows_if_exists(path):
    path = Path(path)
    if not path.exists():
        return []
    return read_csv_rows(path)
 def build_status_summary(
    giant_orders,
    giant_items,
    giant_enriched,
    costco_orders,
    costco_items,
    costco_enriched,
    purchases,
    resolutions,
 ):
    enriched_rows = giant_enriched + costco_enriched
    observed_rows = build_observed_products.build_observed_products(enriched_rows)
    queue_rows = review_products.build_review_queue(purchases, resolutions)
    unresolved_purchase_rows = [
        row
        for row in purchases
        if row.get("observed_product_id")
        and not row.get("canonical_product_id")
        and row.get("is_fee") != "true"
        and row.get("is_discount_line") != "true"
        and row.get("is_coupon_line") != "true"
    ]
    excluded_rows = [
        row
        for row in purchases
        if row.get("resolution_action") == "exclude"
    ]
    linked_purchase_rows = [row for row in purchases if row.get("canonical_product_id")]
    summary = [
        {"stage": "raw_orders", "count": len(giant_orders) + len(costco_orders)},
        {"stage": "raw_items", "count": len(giant_items) + len(costco_items)},
        {"stage": "enriched_items", "count": len(enriched_rows)},
        {"stage": "observed_products", "count": len(observed_rows)},
        {"stage": "review_queue_observed_products", "count": len(queue_rows)},
        {"stage": "canonical_linked_purchase_rows", "count": len(linked_purchase_rows)},
        {"stage": "final_purchase_rows", "count": len(purchases)},
        {"stage": "unresolved_purchase_rows", "count": len(unresolved_purchase_rows)},
        {"stage": "excluded_purchase_rows", "count": len(excluded_rows)},
        {
            "stage": "unresolved_not_in_review_rows",
            "count": len(
                [
                    row
                    for row in unresolved_purchase_rows
                    if row.get("observed_product_id")
                    not in {queue_row["observed_product_id"] for queue_row in queue_rows}
                ]
            ),
        },
    ]
    return summary
@click.command()
@click.option("--giant-orders-csv", default="giant_output/orders.csv", show_default=True)
@click.option("--giant-items-csv", default="giant_output/items.csv", show_default=True)
@click.option("--giant-enriched-csv", default="giant_output/items_enriched.csv", show_default=True)
@click.option("--costco-orders-csv", default="costco_output/orders.csv", show_default=True)
@click.option("--costco-items-csv", default="costco_output/items.csv", show_default=True)
@click.option("--costco-enriched-csv", default="costco_output/items_enriched.csv", show_default=True)
@click.option("--purchases-csv", default="combined_output/purchases.csv", show_default=True)
@click.option("--resolutions-csv", default="combined_output/review_resolutions.csv", show_default=True)
@click.option("--summary-csv", default="combined_output/pipeline_status.csv", show_default=True)
@click.option("--summary-json", default="combined_output/pipeline_status.json", show_default=True)
 def main(
    giant_orders_csv,
    giant_items_csv,
    giant_enriched_csv,
    costco_orders_csv,
    costco_items_csv,
    costco_enriched_csv,
    purchases_csv,
    resolutions_csv,
    summary_csv,
    summary_json,
 ):
    summary_rows = build_status_summary(
        read_rows_if_exists(giant_orders_csv),
        read_rows_if_exists(giant_items_csv),
        read_rows_if_exists(giant_enriched_csv),
        read_rows_if_exists(costco_orders_csv),
        read_rows_if_exists(costco_items_csv),
        read_rows_if_exists(costco_enriched_csv),
        read_rows_if_exists(purchases_csv),
        read_rows_if_exists(resolutions_csv),
    )
    write_csv_rows(summary_csv, summary_rows, SUMMARY_FIELDS)
    summary_json_path = Path(summary_json)
    summary_json_path.parent.mkdir(parents=True, exist_ok=True)
    summary_json_path.write_text(json.dumps(summary_rows, indent=2), encoding="utf-8")
    for row in summary_rows:
        click.echo(f"{row['stage']}: {row['count']}")
 if __name__ == "__main__":
    main()
--- a/requirements.txt
+++ b/requirements.txt
--- a/review_products.py
+++ b/review_products.py
@@ -1,426 +0,0 @@
 from collections import defaultdict
 from datetime import date
 import click
 import build_purchases
 from layer_helpers import compact_join, stable_id, write_csv_rows
 QUEUE_FIELDS = [
    "review_id",
    "retailer",
    "observed_product_id",
    "canonical_product_id",
    "reason_code",
    "priority",
    "raw_item_names",
    "normalized_names",
    "upc_values",
    "example_prices",
    "seen_count",
    "status",
    "resolution_action",
    "resolution_notes",
    "created_at",
    "updated_at",
 ]
 def build_review_queue(purchase_rows, resolution_rows):
    by_observed = defaultdict(list)
    resolution_lookup = build_purchases.load_resolution_lookup(resolution_rows)
    for row in purchase_rows:
        observed_product_id = row.get("observed_product_id", "")
        if not observed_product_id:
            continue
        by_observed[observed_product_id].append(row)
    today_text = str(date.today())
    queue_rows = []
    for observed_product_id, rows in sorted(by_observed.items()):
        current_resolution = resolution_lookup.get(observed_product_id, {})
        if current_resolution.get("status") == "approved":
            continue
        unresolved_rows = [row for row in rows if not row.get("canonical_product_id")]
        if not unresolved_rows:
            continue
        retailers = sorted({row["retailer"] for row in rows})
        review_id = stable_id("rvw", observed_product_id)
        queue_rows.append(
            {
                "review_id": review_id,
                "retailer": " | ".join(retailers),
                "observed_product_id": observed_product_id,
                "canonical_product_id": current_resolution.get("canonical_product_id", ""),
                "reason_code": "missing_canonical_link",
                "priority": "high",
                "raw_item_names": compact_join(
                    sorted({row["raw_item_name"] for row in rows if row["raw_item_name"]}),
                    limit=8,
                ),
                "normalized_names": compact_join(
                    sorted(
                        {
                            row["normalized_item_name"]
                            for row in rows
                            if row["normalized_item_name"]
                        }
                    ),
                    limit=8,
                ),
                "upc_values": compact_join(
                    sorted({row["upc"] for row in rows if row["upc"]}),
                    limit=8,
                ),
                "example_prices": compact_join(
                    sorted({row["line_total"] for row in rows if row["line_total"]}),
                    limit=8,
                ),
                "seen_count": str(len(rows)),
                "status": current_resolution.get("status", "pending"),
                "resolution_action": current_resolution.get("resolution_action", ""),
                "resolution_notes": current_resolution.get("resolution_notes", ""),
                "created_at": current_resolution.get("reviewed_at", today_text),
                "updated_at": today_text,
            }
        )
    return queue_rows
 def save_resolution_rows(path, rows):
    write_csv_rows(path, rows, build_purchases.RESOLUTION_FIELDS)
 def save_catalog_rows(path, rows):
    write_csv_rows(path, rows, build_purchases.CATALOG_FIELDS)
 INFO_COLOR = "cyan"
 PROMPT_COLOR = "bright_yellow"
 WARNING_COLOR = "magenta"
 def sort_related_items(rows):
    return sorted(
        rows,
        key=lambda row: (
            row.get("purchase_date", ""),
            row.get("order_id", ""),
            int(row.get("line_no", "0") or "0"),
        ),
        reverse=True,
    )
 def build_canonical_suggestions(related_rows, catalog_rows, limit=3):
    normalized_names = {
        row.get("normalized_item_name", "").strip().upper()
        for row in related_rows
        if row.get("normalized_item_name", "").strip()
    }
    upcs = {
        row.get("upc", "").strip()
        for row in related_rows
        if row.get("upc", "").strip()
    }
    suggestions = []
    seen_ids = set()
    def add_matches(rows, reason):
        for row in rows:
            canonical_product_id = row.get("canonical_product_id", "")
            if not canonical_product_id or canonical_product_id in seen_ids:
                continue
            seen_ids.add(canonical_product_id)
            suggestions.append(
                {
                    "canonical_product_id": canonical_product_id,
                    "canonical_name": row.get("canonical_name", ""),
                    "reason": reason,
                }
            )
            if len(suggestions) >= limit:
                return True
        return False
    exact_upc_rows = [
        row
        for row in catalog_rows
        if row.get("upc", "").strip() and row.get("upc", "").strip() in upcs
    ]
    if add_matches(exact_upc_rows, "exact upc"):
        return suggestions
    exact_name_rows = [
        row
        for row in catalog_rows
        if row.get("canonical_name", "").strip().upper() in normalized_names
    ]
    if add_matches(exact_name_rows, "exact normalized name"):
        return suggestions
    contains_rows = []
    for row in catalog_rows:
        canonical_name = row.get("canonical_name", "").strip().upper()
        if not canonical_name:
            continue
        for normalized_name in normalized_names:
            if normalized_name in canonical_name or canonical_name in normalized_name:
                contains_rows.append(row)
                break
    add_matches(contains_rows, "canonical name contains match")
    return suggestions
 def build_display_lines(queue_row, related_rows):
    lines = []
    for index, row in enumerate(sort_related_items(related_rows), start=1):
        lines.append(
            " [{index}] {purchase_date} | {line_total} | {raw_item_name} | {normalized_item_name} | "
            "{upc} | {retailer}".format(
                index=index,
                purchase_date=row.get("purchase_date", ""),
                line_total=row.get("line_total", ""),
                raw_item_name=row.get("raw_item_name", ""),
                normalized_item_name=row.get("normalized_item_name", ""),
                upc=row.get("upc", ""),
                retailer=row.get("retailer", ""),
            )
        )
        if row.get("image_url"):
            lines.append(f"     {row['image_url']}")
    if not lines:
        lines.append(" [1] no matched item rows found")
    return lines
 def observed_name(queue_row, related_rows):
    if queue_row.get("normalized_names"):
        return queue_row["normalized_names"].split(" | ")[0]
    for row in related_rows:
        if row.get("normalized_item_name"):
            return row["normalized_item_name"]
    return queue_row.get("observed_product_id", "")
 def choose_existing_canonical(display_rows, observed_label, matched_count):
    click.secho(
        f"Select the canonical_name to associate {matched_count} items with:",
        fg=INFO_COLOR,
    )
    for index, row in enumerate(display_rows, start=1):
        click.echo(f"  [{index}] {row['canonical_name']} | {row['canonical_product_id']}")
    choice = click.prompt(
        click.style("selection", fg=PROMPT_COLOR),
        type=click.IntRange(1, len(display_rows)),
    )
    chosen_row = display_rows[choice - 1]
    click.echo(
        f'{matched_count} "{observed_label}" items and future matches will be associated '
        f'with "{chosen_row["canonical_name"]}".'
    )
    click.secho(
        "actions: [y]es  [n]o  [b]ack  [s]kip  [q]uit",
        fg=PROMPT_COLOR,
    )
    confirm = click.prompt(
        click.style("confirm", fg=PROMPT_COLOR),
        type=click.Choice(["y", "n", "b", "s", "q"]),
    )
    if confirm == "y":
        return chosen_row["canonical_product_id"], ""
    if confirm == "s":
        return "", "skip"
    if confirm == "q":
        return "", "quit"
    return "", "back"
 def prompt_resolution(queue_row, related_rows, catalog_rows, queue_index, queue_total):
    suggestions = build_canonical_suggestions(related_rows, catalog_rows)
    observed_label = observed_name(queue_row, related_rows)
    matched_count = len(related_rows)
    click.echo("")
    click.secho(
        f"Review {queue_index}/{queue_total}: Resolve observed_product {observed_label} "
        "to canonical_name [__]?",
        fg=INFO_COLOR,
    )
    click.echo(f"{matched_count} matched items:")
    for line in build_display_lines(queue_row, related_rows):
        click.echo(line)
    if suggestions:
        click.echo(f"{len(suggestions)} canonical suggestions found:")
        for index, suggestion in enumerate(suggestions, start=1):
            click.echo(f" [{index}] {suggestion['canonical_name']}")
    else:
        click.echo("no canonical_name suggestions found")
    click.secho(
        "[l]ink existing  [n]ew canonical  e[x]clude  [s]kip  [q]uit:",
        fg=PROMPT_COLOR,
    )
    action = click.prompt(
        "",
        type=click.Choice(["l", "n", "x", "s", "q"]),
        prompt_suffix=" ",
    )
    if action == "q":
        return None, None
    if action == "s":
        return {
            "observed_product_id": queue_row["observed_product_id"],
            "canonical_product_id": "",
            "resolution_action": "skip",
            "status": "pending",
            "resolution_notes": queue_row.get("resolution_notes", ""),
            "reviewed_at": str(date.today()),
        }, None
    if action == "x":
        notes = click.prompt(
            click.style("exclude notes", fg=PROMPT_COLOR),
            default="",
            show_default=False,
        )
        return {
            "observed_product_id": queue_row["observed_product_id"],
            "canonical_product_id": "",
            "resolution_action": "exclude",
            "status": "approved",
            "resolution_notes": notes,
            "reviewed_at": str(date.today()),
        }, None
    if action == "l":
        display_rows = suggestions or [
            {
                "canonical_product_id": row["canonical_product_id"],
                "canonical_name": row["canonical_name"],
                "reason": "catalog sample",
            }
            for row in catalog_rows[:10]
        ]
        while True:
            canonical_product_id, outcome = choose_existing_canonical(
                display_rows,
                observed_label,
                matched_count,
            )
            if outcome == "skip":
                return {
                    "observed_product_id": queue_row["observed_product_id"],
                    "canonical_product_id": "",
                    "resolution_action": "skip",
                    "status": "pending",
                    "resolution_notes": queue_row.get("resolution_notes", ""),
                    "reviewed_at": str(date.today()),
                }, None
            if outcome == "quit":
                return None, None
            if outcome == "back":
                continue
            break
        notes = click.prompt(click.style("link notes", fg=PROMPT_COLOR), default="", show_default=False)
        return {
            "observed_product_id": queue_row["observed_product_id"],
            "canonical_product_id": canonical_product_id,
            "resolution_action": "link",
            "status": "approved",
            "resolution_notes": notes,
            "reviewed_at": str(date.today()),
        }, None
    canonical_name = click.prompt(click.style("canonical name", fg=PROMPT_COLOR), type=str)
    category = click.prompt(
        click.style("category", fg=PROMPT_COLOR),
        default="",
        show_default=False,
    )
    product_type = click.prompt(
        click.style("product type", fg=PROMPT_COLOR),
        default="",
        show_default=False,
    )
    notes = click.prompt(
        click.style("notes", fg=PROMPT_COLOR),
        default="",
        show_default=False,
    )
    canonical_product_id = stable_id("gcan", f"manual|{canonical_name}|{category}|{product_type}")
    canonical_row = {
        "canonical_product_id": canonical_product_id,
        "canonical_name": canonical_name,
        "category": category,
        "product_type": product_type,
        "brand": "",
        "variant": "",
        "size_value": "",
        "size_unit": "",
        "pack_qty": "",
        "measure_type": "",
        "notes": notes,
        "created_at": str(date.today()),
        "updated_at": str(date.today()),
    }
    resolution_row = {
        "observed_product_id": queue_row["observed_product_id"],
        "canonical_product_id": canonical_product_id,
        "resolution_action": "create",
        "status": "approved",
        "resolution_notes": notes,
        "reviewed_at": str(date.today()),
    }
    return resolution_row, canonical_row
@click.command()
@click.option("--purchases-csv", default="combined_output/purchases.csv", show_default=True)
@click.option("--queue-csv", default="combined_output/review_queue.csv", show_default=True)
@click.option("--resolutions-csv", default="combined_output/review_resolutions.csv", show_default=True)
@click.option("--catalog-csv", default="combined_output/canonical_catalog.csv", show_default=True)
@click.option("--limit", default=0, show_default=True, type=int)
@click.option("--refresh-only", is_flag=True, help="Only rebuild review_queue.csv without prompting.")
 def main(purchases_csv, queue_csv, resolutions_csv, catalog_csv, limit, refresh_only):
    purchase_rows = build_purchases.read_optional_csv_rows(purchases_csv)
    resolution_rows = build_purchases.read_optional_csv_rows(resolutions_csv)
    catalog_rows = build_purchases.read_optional_csv_rows(catalog_csv)
    queue_rows = build_review_queue(purchase_rows, resolution_rows)
    write_csv_rows(queue_csv, queue_rows, QUEUE_FIELDS)
    click.echo(f"wrote {len(queue_rows)} rows to {queue_csv}")
    if refresh_only:
        return
    resolution_lookup = build_purchases.load_resolution_lookup(resolution_rows)
    catalog_by_id = {row["canonical_product_id"]: row for row in catalog_rows if row.get("canonical_product_id")}
    rows_by_observed = defaultdict(list)
    for row in purchase_rows:
        observed_product_id = row.get("observed_product_id", "")
        if observed_product_id:
            rows_by_observed[observed_product_id].append(row)
    reviewed = 0
    for index, queue_row in enumerate(queue_rows, start=1):
        if limit and reviewed >= limit:
            break
        related_rows = rows_by_observed.get(queue_row["observed_product_id"], [])
        result = prompt_resolution(queue_row, related_rows, catalog_rows, index, len(queue_rows))
        if result == (None, None):
            break
        resolution_row, canonical_row = result
        resolution_lookup[resolution_row["observed_product_id"]] = resolution_row
        if canonical_row and canonical_row["canonical_product_id"] not in catalog_by_id:
            catalog_by_id[canonical_row["canonical_product_id"]] = canonical_row
            catalog_rows.append(canonical_row)
        reviewed += 1
    save_resolution_rows(resolutions_csv, sorted(resolution_lookup.values(), key=lambda row: row["observed_product_id"]))
    save_catalog_rows(catalog_csv, sorted(catalog_by_id.values(), key=lambda row: row["canonical_product_id"]))
    click.echo(
        f"saved {len(resolution_lookup)} resolution rows to {resolutions_csv} "
        f"and {len(catalog_by_id)} catalog rows to {catalog_csv}"
    )
 if __name__ == "__main__":
    main()
--- a/scrape-click.py
+++ b/scrape-click.py
@@ -0,0 +1,254 @@
 import json
 import time
 from pathlib import Path
 import browser_cookie3
 import click
 import pandas as pd
 from curl_cffi import requests
 from dotenv import load_dotenv
 import os
 BASE = "https://giantfood.com"
 ACCOUNT_PAGE = f"{BASE}/account/history/invoice/in-store"
 def load_config():
    load_dotenv()
    return {
        "user_id": os.getenv("GIANT_USER_ID", "").strip(),
        "loyalty": os.getenv("GIANT_LOYALTY_NUMBER", "").strip(),
    }
 def build_session():
    s = requests.Session()
    s.cookies.update(browser_cookie3.firefox(domain_name="giantfood.com"))
    s.headers.update({
        "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:148.0) Gecko/20100101 Firefox/148.0",
        "accept": "application/json, text/plain, */*",
        "accept-language": "en-US,en;q=0.9",
        "referer": ACCOUNT_PAGE,
    })
    return s
 def safe_get(session, url, **kwargs):
    last_response = None
    for attempt in range(3):
        try:
            r = session.get(
                url,
                impersonate="firefox",
                timeout=30,
                **kwargs,
            )
            last_response = r
            if r.status_code == 200:
                return r
            click.echo(f"retry {attempt + 1}/3 status={r.status_code}")
        except Exception as e:
            click.echo(f"retry {attempt + 1}/3 error={e}")
        time.sleep(3)
    if last_response is not None:
        last_response.raise_for_status()
    raise RuntimeError(f"failed to fetch {url}")
 def get_history(session, user_id, loyalty):
    url = f"{BASE}/api/v6.0/user/{user_id}/order/history"
    r = safe_get(
        session,
        url,
        params={
            "filter": "instore",
            "loyaltyNumber": loyalty,
        },
    )
    return r.json()
 def get_order_detail(session, user_id, order_id):
    url = f"{BASE}/api/v6.0/user/{user_id}/order/history/detail/{order_id}"
    r = safe_get(
        session,
        url,
        params={"isInStore": "true"},
    )
    return r.json()
 def flatten_orders(history, details):
    orders = []
    items = []
    history_lookup = {
        r["orderId"]: r
        for r in history.get("records", [])
    }
    for d in details:
        hist = history_lookup.get(d["orderId"], {})
        pup = d.get("pup", {})
        orders.append({
            "order_id": d["orderId"],
            "order_date": d.get("orderDate"),
            "delivery_date": d.get("deliveryDate"),
            "service_type": hist.get("serviceType"),
            "order_total": d.get("orderTotal"),
            "payment_method": d.get("paymentMethod"),
            "total_item_count": d.get("totalItemCount"),
            "total_savings": d.get("totalSavings"),
            "your_savings_total": d.get("yourSavingsTotal"),
            "coupons_discounts_total": d.get("couponsDiscountsTotal"),
            "store_name": pup.get("storeName"),
            "store_number": pup.get("aholdStoreNumber"),
            "store_address1": pup.get("storeAddress1"),
            "store_city": pup.get("storeCity"),
            "store_state": pup.get("storeState"),
            "store_zipcode": pup.get("storeZipcode"),
            "refund_order": d.get("refundOrder"),
            "ebt_order": d.get("ebtOrder"),
        })
        for i, item in enumerate(d.get("items", []), start=1):
            items.append({
                "order_id": d["orderId"],
                "order_date": d.get("orderDate"),
                "line_no": i,
                "pod_id": item.get("podId"),
                "item_name": item.get("itemName"),
                "upc": item.get("primUpcCd"),
                "category_id": item.get("categoryId"),
                "category": item.get("categoryDesc"),
                "qty": item.get("shipQy"),
                "unit": item.get("lbEachCd"),
                "unit_price": item.get("unitPrice"),
                "line_total": item.get("groceryAmount"),
                "picked_weight": item.get("totalPickedWeight"),
                "mvp_savings": item.get("mvpSavings"),
                "reward_savings": item.get("rewardSavings"),
                "coupon_savings": item.get("couponSavings"),
                "coupon_price": item.get("couponPrice"),
            })
    return pd.DataFrame(orders), pd.DataFrame(items)
 def read_existing_order_ids(orders_csv: Path) -> set[str]:
    if not orders_csv.exists():
        return set()
    try:
        df = pd.read_csv(orders_csv, dtype={"order_id": str})
        if "order_id" not in df.columns:
            return set()
        return set(df["order_id"].dropna().astype(str))
    except Exception:
        return set()
 def append_dedup(existing_path: Path, new_df: pd.DataFrame, subset: list[str]) -> pd.DataFrame:
    if existing_path.exists():
        old_df = pd.read_csv(existing_path, dtype=str)
        combined = pd.concat([old_df, new_df.astype(str)], ignore_index=True)
    else:
        combined = new_df.astype(str).copy()
    combined = combined.drop_duplicates(subset=subset, keep="last")
    combined.to_csv(existing_path, index=False)
    return combined
@click.command()
@click.option("--user-id", default=None, help="giant user id")
@click.option("--loyalty", default=None, help="giant loyalty number")
@click.option("--outdir", default="giant_output", show_default=True, help="output directory")
@click.option("--sleep-seconds", default=1.5, show_default=True, type=float, help="delay between detail requests")
 def main(user_id, loyalty, outdir, sleep_seconds):
    cfg = load_config()
    user_id = user_id or cfg["user_id"] or click.prompt("giant user id", type=str)
    loyalty = loyalty or cfg["loyalty"] or click.prompt("giant loyalty number", type=str)
    outdir = Path(outdir)
    rawdir = outdir / "raw"
    rawdir.mkdir(parents=True, exist_ok=True)
    orders_csv = outdir / "orders.csv"
    items_csv = outdir / "items.csv"
    click.echo("using cookies from your current firefox profile.")
    click.echo(f"open giant here, make sure you're logged in, then return: {ACCOUNT_PAGE}")
    click.pause(info="press any key once giant is open and logged in")
    session = build_session()
    click.echo("fetching order history...")
    history = get_history(session, user_id, loyalty)
    (rawdir / "history.json").write_text(
        json.dumps(history, indent=2),
        encoding="utf-8",
    )
    records = history.get("records", [])
    click.echo(f"history returned {len(records)} visits")
    click.echo("tip: giant appears to expose only the most recent 50 visits, so run this periodically if you want full continuity.")
    history_order_ids = [str(r["orderId"]) for r in records]
    existing_order_ids = read_existing_order_ids(orders_csv)
    new_order_ids = [oid for oid in history_order_ids if oid not in existing_order_ids]
    click.echo(f"existing orders in csv: {len(existing_order_ids)}")
    click.echo(f"new orders to fetch: {len(new_order_ids)}")
    if not new_order_ids:
        click.echo("no new orders found. done.")
        return
    details = []
    for order_id in new_order_ids:
        click.echo(f"fetching {order_id}")
        d = get_order_detail(session, user_id, order_id)
        details.append(d)
        (rawdir / f"{order_id}.json").write_text(
            json.dumps(d, indent=2),
            encoding="utf-8",
        )
        time.sleep(sleep_seconds)
    click.echo("flattening new data...")
    orders_df, items_df = flatten_orders(history, details)
    orders_all = append_dedup(
        orders_csv,
        orders_df,
        subset=["order_id"],
    )
    items_all = append_dedup(
        items_csv,
        items_df,
        subset=["order_id", "line_no", "item_name", "upc", "line_total"],
    )
    click.echo("done")
    click.echo(f"orders csv: {orders_csv}")
    click.echo(f"items csv:  {items_csv}")
    click.echo(f"total orders stored: {len(orders_all)}")
    click.echo(f"total item rows stored: {len(items_all)}")
 if __name__ == "__main__":
    main()
--- a/scrape_costco.py
+++ b/scrape_costco.py
@@ -1,738 +0,0 @@
 import os
 import csv
 import json
 import time
 import re
 from pathlib import Path
 from calendar import monthrange
 from datetime import datetime, timedelta
 from dotenv import load_dotenv
 import click
 from curl_cffi import requests
 from browser_session import (
    find_firefox_profile_dir,
    load_firefox_cookies,
    read_firefox_local_storage,
    read_firefox_webapps_store,
 )
 BASE_URL = "https://ecom-api.costco.com/ebusiness/order/v1/orders/graphql"
 RETAILER = "costco"
 SUMMARY_QUERY = """
 query receiptsWithCounts($startDate: String!, $endDate: String!, $documentType: String!, $documentSubType: String!) {
  receiptsWithCounts(startDate: $startDate, endDate: $endDate, documentType: $documentType, documentSubType: $documentSubType) {
    inWarehouse
    gasStation
    carWash
    gasAndCarWash
    receipts {
      warehouseName
      receiptType 
      documentType
      transactionDateTime
      transactionBarcode
      warehouseName
      transactionType
      total
      totalItemCount
      itemArray {
        itemNumber
      }
      tenderArray {
        tenderTypeCode
        tenderDescription
        amountTender
      }
      couponArray {
        upcnumberCoupon
      }
    }
  }
 }
 """.strip()
 DETAIL_QUERY = """
 query receiptsWithCounts($barcode: String!, $documentType: String!) {
  receiptsWithCounts(barcode: $barcode, documentType: $documentType) {
    receipts {
      warehouseName
      receiptType
      documentType
      transactionDateTime
      transactionDate
      companyNumber
      warehouseNumber
      operatorNumber
      warehouseShortName
      registerNumber
      transactionNumber
      transactionType
      transactionBarcode
      total
      warehouseAddress1
      warehouseAddress2
      warehouseCity
      warehouseState
      warehouseCountry
      warehousePostalCode
      totalItemCount
      subTotal
      taxes
      total
      invoiceNumber
      sequenceNumber
      itemArray {
        itemNumber
        itemDescription01
        frenchItemDescription1
        itemDescription02
        frenchItemDescription2
        itemIdentifier
        itemDepartmentNumber
        unit
        amount
        taxFlag
        merchantID
        entryMethod
        transDepartmentNumber
        fuelUnitQuantity
        fuelGradeCode
        itemUnitPriceAmount
        fuelUomCode
        fuelUomDescription
        fuelUomDescriptionFr
        fuelGradeDescription
        fuelGradeDescriptionFr
      }
      tenderArray {
        tenderTypeCode
        tenderSubTypeCode
        tenderDescription
        amountTender
        displayAccountNumber
        sequenceNumber
        approvalNumber
        responseCode
        tenderTypeName
        transactionID
        merchantID
        entryMethod
        tenderAcctTxnNumber
        tenderAuthorizationCode
        tenderTypeNameFr
        tenderEntryMethodDescription
        walletType
        walletId
        storedValueBucket
      }
      subTaxes {
        tax1
        tax2
        tax3
        tax4
        aTaxPercent
        aTaxLegend
        aTaxAmount
        aTaxPrintCode
        aTaxPrintCodeFR
        aTaxIdentifierCode
        bTaxPercent
        bTaxLegend
        bTaxAmount
        bTaxPrintCode
        bTaxPrintCodeFR
        bTaxIdentifierCode
        cTaxPercent
        cTaxLegend
        cTaxAmount
        cTaxIdentifierCode
        dTaxPercent
        dTaxLegend
        dTaxAmount
        dTaxPrintCode
        dTaxPrintCodeFR
        dTaxIdentifierCode
        uTaxLegend
        uTaxAmount
        uTaxableAmount
      }
      instantSavings
      membershipNumber
    }
  }
 }
 """.strip()
 ORDER_FIELDS = [
    "retailer",
    "order_id",
    "order_date",
    "delivery_date",
    "service_type",
    "order_total",
    "payment_method",
    "total_item_count",
    "total_savings",
    "your_savings_total",
    "coupons_discounts_total",
    "store_name",
    "store_number",
    "store_address1",
    "store_city",
    "store_state",
    "store_zipcode",
    "refund_order",
    "ebt_order",
    "raw_history_path",
    "raw_order_path",
 ]
 ITEM_FIELDS = [
    "retailer",
    "order_id",
    "line_no",
    "order_date",
    "retailer_item_id",
    "pod_id",
    "item_name",
    "upc",
    "category_id",
    "category",
    "qty",
    "unit",
    "unit_price",
    "line_total",
    "picked_weight",
    "mvp_savings",
    "reward_savings",
    "coupon_savings",
    "coupon_price",
    "image_url",
    "raw_order_path",
    "is_discount_line",
    "is_coupon_line",
 ]
 COSTCO_STORAGE_ORIGIN = "costco.com"
 COSTCO_ID_TOKEN_STORAGE_KEY = "idToken"
 COSTCO_CLIENT_ID_STORAGE_KEY = "clientID"
 def load_config():
    load_dotenv()
    return {
        "authorization": os.getenv("COSTCO_X_AUTHORIZATION", "").strip(),
        "client_id": os.getenv("COSTCO_X_WCS_CLIENTID", "").strip(),
        "client_identifier": os.getenv("COSTCO_CLIENT_IDENTIFIER", "").strip(),
    }
 def build_headers(auth_headers):
    headers = {
        "accept": "*/*",
        "content-type": "application/json-patch+json",
        "costco.service": "restOrders",
        "costco.env": "ecom",
        "origin": "https://www.costco.com",
        "referer": "https://www.costco.com/",
        "user-agent": (
            "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:148.0) "
            "Gecko/20100101 Firefox/148.0"
        ),
    }
    headers.update(auth_headers)
    return headers
 def load_costco_browser_headers(profile_dir, authorization, client_id, client_identifier):
    local_storage = read_firefox_local_storage(profile_dir, COSTCO_STORAGE_ORIGIN)
    webapps_store = read_firefox_webapps_store(profile_dir, COSTCO_STORAGE_ORIGIN)
    auth_header = authorization.strip() if authorization else ""
    if client_id:
        client_id = client_id.strip()
    if client_identifier:
        client_identifier = client_identifier.strip()
    if not auth_header:
        id_token = (
            local_storage.get(COSTCO_ID_TOKEN_STORAGE_KEY, "").strip()
            or webapps_store.get(COSTCO_ID_TOKEN_STORAGE_KEY, "").strip()
        )
        if id_token:
            auth_header = f"Bearer {id_token}"
    client_id = client_id or (
        local_storage.get(COSTCO_CLIENT_ID_STORAGE_KEY, "").strip()
        or webapps_store.get(COSTCO_CLIENT_ID_STORAGE_KEY, "").strip()
    )
    if not auth_header:
        raise click.ClickException(
            "could not find Costco auth token; set COSTCO_X_AUTHORIZATION or load Firefox idToken"
        )
    if not client_id or not client_identifier:
        raise click.ClickException(
            "missing Costco client ids; set COSTCO_X_WCS_CLIENTID and COSTCO_CLIENT_IDENTIFIER"
        )
    return {
        "costco-x-authorization": auth_header,
        "costco-x-wcs-clientId": client_id,
        "client-identifier": client_identifier,
    }
 def build_session(profile_dir, auth_headers):
    session = requests.Session()
    session.cookies.update(load_firefox_cookies(".costco.com", profile_dir))
    session.headers.update(build_headers(auth_headers))
    session.headers.update(auth_headers)
    return session
 def graphql_post(session, query, variables):
    last_response = None
    for attempt in range(3):
        try:
            response = session.post(
                BASE_URL,
                json={"query": query, "variables": variables},
                impersonate="firefox",
                timeout=30,
            )
            last_response = response
            if response.status_code == 200:
                return response.json()
            click.echo(f"retry {attempt + 1}/3 status={response.status_code} body={response.text[:500]}")
        except Exception as exc:  # pragma: no cover - network error path
            click.echo(f"retry {attempt + 1}/3 error={exc}")
        time.sleep(3)
    if last_response is not None:
        last_response.raise_for_status()
    raise RuntimeError("failed to fetch Costco GraphQL payload")
 def safe_filename(value):
    return re.sub(r'[<>:"/\\|?*]+', "-", str(value))
 def summary_receipts(payload):
    return payload.get("data", {}).get("receiptsWithCounts", {}).get("receipts", [])
 def detail_receipts(payload):
    return payload.get("data", {}).get("receiptsWithCounts", {}).get("receipts", [])
 def summary_counts(payload):
    counts = payload.get("data", {}).get("receiptsWithCounts", {})
    return {
        "inWarehouse": counts.get("inWarehouse", 0) or 0,
        "gasStation": counts.get("gasStation", 0) or 0,
        "carWash": counts.get("carWash", 0) or 0,
        "gasAndCarWash": counts.get("gasAndCarWash", 0) or 0,
    }
 def parse_cli_date(value):
    return datetime.strptime(value, "%m/%d/%Y").date()
 def format_cli_date(value):
    return f"{value.month}/{value.day:02d}/{value.year}"
 def subtract_months(value, months):
    year = value.year
    month = value.month - months
    while month <= 0:
        month += 12
        year -= 1
    day = min(value.day, monthrange(year, month)[1])
    return value.replace(year=year, month=month, day=day)
 def resolve_date_range(months_back, today=None):
    if months_back < 1:
        raise click.ClickException("months-back must be at least 1")
    end = today or datetime.now().date()
    start = subtract_months(end, months_back)
    return format_cli_date(start), format_cli_date(end)
 def build_date_windows(start_date, end_date, window_days):
    start = parse_cli_date(start_date)
    end = parse_cli_date(end_date)
    if end < start:
        raise click.ClickException("end-date must be on or after start-date")
    if window_days < 1:
        raise click.ClickException("window-days must be at least 1")
    windows = []
    current = start
    while current <= end:
        window_end = min(current + timedelta(days=window_days - 1), end)
        windows.append(
            {
                "startDate": format_cli_date(current),
                "endDate": format_cli_date(window_end),
            }
        )
        current = window_end + timedelta(days=1)
    return windows
 def unique_receipts(receipts):
    by_barcode = {}
    for receipt in receipts:
        key = receipt_key(receipt)
        if key:
            by_barcode[key] = receipt
    return list(by_barcode.values())
 def receipt_key(receipt):
    barcode = receipt.get("transactionBarcode", "")
    transaction_date_time = receipt.get("transactionDateTime", "")
    if not barcode:
        return ""
    return f"{barcode}::{transaction_date_time}"
 def fetch_summary_windows(
    session,
    start_date,
    end_date,
    document_type,
    document_sub_type,
    window_days,
 ):
    requests_metadata = []
    combined_receipts = []
    for window in build_date_windows(start_date, end_date, window_days):
        variables = {
            "startDate": window["startDate"],
            "endDate": window["endDate"],
            "text": "custom",
            "documentType": document_type,
            "documentSubType": document_sub_type,
        }
        payload = graphql_post(session, SUMMARY_QUERY, variables)
        receipts = summary_receipts(payload)
        counts = summary_counts(payload)
        warehouse_count = sum(
            1 for receipt in receipts if receipt.get("receiptType") == "In-Warehouse"
        )
        mismatch = counts["inWarehouse"] != warehouse_count
        requests_metadata.append(
            {
                **variables,
                "returnedReceipts": len(receipts),
                "returnedInWarehouseReceipts": warehouse_count,
                "inWarehouse": counts["inWarehouse"],
                "gasStation": counts["gasStation"],
                "carWash": counts["carWash"],
                "gasAndCarWash": counts["gasAndCarWash"],
                "countMismatch": mismatch,
            }
        )
        if mismatch:
            click.echo(
                (
                    "warning: summary count mismatch for "
                    f"{window['startDate']} to {window['endDate']}: "
                    f"inWarehouse={counts['inWarehouse']} "
                    f"returnedInWarehouseReceipts={warehouse_count}"
                ),
                err=True,
            )
        combined_receipts.extend(receipts)
    unique = unique_receipts(combined_receipts)
    aggregate_payload = {
        "data": {
            "receiptsWithCounts": {
                "inWarehouse": sum(row["inWarehouse"] for row in requests_metadata),
                "gasStation": sum(row["gasStation"] for row in requests_metadata),
                "carWash": sum(row["carWash"] for row in requests_metadata),
                "gasAndCarWash": sum(row["gasAndCarWash"] for row in requests_metadata),
                "receipts": unique,
            }
        }
    }
    return aggregate_payload, requests_metadata
 def flatten_costco_data(summary_payload, detail_payloads, raw_dir):
    summary_lookup = {
        receipt_key(receipt): receipt
        for receipt in summary_receipts(summary_payload)
        if receipt_key(receipt)
    }
    orders = []
    items = []
    for detail_payload in detail_payloads:
        for receipt in detail_receipts(detail_payload):
            order_id = receipt["transactionBarcode"]
            receipt_id = receipt_key(receipt)
            summary_row = summary_lookup.get(receipt_id, {})
            coupon_numbers = {
                row.get("upcnumberCoupon", "")
                for row in summary_row.get("couponArray", []) or []
                if row.get("upcnumberCoupon")
            }
            raw_order_path = raw_dir / f"{safe_filename(receipt_id or order_id)}.json"
            orders.append(
                {
                    "retailer": RETAILER,
                    "order_id": order_id,
                    "order_date": receipt.get("transactionDate", ""),
                    "delivery_date": receipt.get("transactionDate", ""),
                    "service_type": receipt.get("receiptType", ""),
                    "order_total": stringify(receipt.get("total")),
                    "payment_method": compact_join(
                        summary_row.get("tenderArray", []) or [], "tenderDescription"
                    ),
                    "total_item_count": stringify(receipt.get("totalItemCount")),
                    "total_savings": stringify(receipt.get("instantSavings")),
                    "your_savings_total": stringify(receipt.get("instantSavings")),
                    "coupons_discounts_total": stringify(receipt.get("instantSavings")),
                    "store_name": receipt.get("warehouseName", ""),
                    "store_number": stringify(receipt.get("warehouseNumber")),
                    "store_address1": receipt.get("warehouseAddress1", ""),
                    "store_city": receipt.get("warehouseCity", ""),
                    "store_state": receipt.get("warehouseState", ""),
                    "store_zipcode": receipt.get("warehousePostalCode", ""),
                    "refund_order": "false",
                    "ebt_order": "false",
                    "raw_history_path": (raw_dir / "summary.json").as_posix(),
                    "raw_order_path": raw_order_path.as_posix(),
                }
            )
            for line_no, item in enumerate(receipt.get("itemArray", []), start=1):
                item_number = stringify(item.get("itemNumber"))
                description = join_descriptions(
                    item.get("itemDescription01"), item.get("itemDescription02")
                )
                is_discount = is_discount_line(item)
                is_coupon = is_discount and (
                    item_number in coupon_numbers
                    or description.startswith("/")
                )
                items.append(
                    {
                        "retailer": RETAILER,
                        "order_id": order_id,
                        "line_no": str(line_no),
                        "order_date": receipt.get("transactionDate", ""),
                        "retailer_item_id": item_number,
                        "pod_id": "",
                        "item_name": description,
                        "upc": "",
                        "category_id": stringify(item.get("itemDepartmentNumber")),
                        "category": stringify(item.get("transDepartmentNumber")),
                        "qty": stringify(item.get("unit")),
                        "unit": stringify(item.get("itemIdentifier")),
                        "unit_price": stringify(item.get("itemUnitPriceAmount")),
                        "line_total": stringify(item.get("amount")),
                        "picked_weight": "",
                        "mvp_savings": "",
                        "reward_savings": "",
                        "coupon_savings": stringify(item.get("amount") if is_coupon else ""),
                        "coupon_price": "",
                        "image_url": "",
                        "raw_order_path": raw_order_path.as_posix(),
                        "is_discount_line": "true" if is_discount else "false",
                        "is_coupon_line": "true" if is_coupon else "false",
                    }
                )
    return orders, items
 def join_descriptions(*parts):
    return " ".join(str(part).strip() for part in parts if part).strip()
 def compact_join(rows, field):
    values = [str(row.get(field, "")).strip() for row in rows if row.get(field)]
    return " | ".join(values)
 def is_discount_line(item):
    amount = item.get("amount")
    unit = item.get("unit")
    description = join_descriptions(
        item.get("itemDescription01"), item.get("itemDescription02")
    )
    try:
        amount_val = float(amount)
    except (TypeError, ValueError):
        amount_val = 0.0
    try:
        unit_val = float(unit)
    except (TypeError, ValueError):
        unit_val = 0.0
    return amount_val < 0 or unit_val < 0 or description.startswith("/")
 def stringify(value):
    if value is None:
        return ""
    return str(value)
 def write_json(path, payload):
    path.parent.mkdir(parents=True, exist_ok=True)
    path.write_text(json.dumps(payload, indent=2), encoding="utf-8")
 def write_csv(path, rows, fieldnames):
    path.parent.mkdir(parents=True, exist_ok=True)
    with path.open("w", newline="", encoding="utf-8") as handle:
        writer = csv.DictWriter(handle, fieldnames=fieldnames)
        writer.writeheader()
        writer.writerows(rows)
@click.command()
@click.option(
    "--outdir",
    default="costco_output",
    show_default=True,
    help="Output directory for Costco raw and flattened files.",
 )
@click.option(
    "--document-type",
    default="all",
    show_default=True,
    help="Summary document type.",
 )
@click.option(
    "--document-sub-type",
    default="all",
    show_default=True,
    help="Summary document sub type.",
 )
@click.option(
    "--window-days",
    default=92,
    show_default=True,
    type=int,
    help="Maximum number of days to request per summary window.",
 )
@click.option(
    "--months-back",
    default=36,
    show_default=True,
    type=int,
    help="How many months of receipts to enumerate back from today.",
 )
@click.option(
    "--firefox-profile-dir",
    default=None,
    help="Firefox profile directory to use for cookies and session storage.",
 )
 def main(
    outdir,
    document_type,
    document_sub_type,
    window_days,
    months_back,
    firefox_profile_dir,
 ):
    click.echo("legacy entrypoint: prefer collect_costco_web.py for data-model outputs")
    run_collection(
        outdir=outdir,
        document_type=document_type,
        document_sub_type=document_sub_type,
        window_days=window_days,
        months_back=months_back,
        firefox_profile_dir=firefox_profile_dir,
    )
 def run_collection(
    outdir,
    document_type,
    document_sub_type,
    window_days,
    months_back,
    firefox_profile_dir,
    orders_filename="orders.csv",
    items_filename="items.csv",
 ):
    outdir = Path(outdir)
    raw_dir = outdir / "raw"
    config = load_config()
    profile_dir = Path(firefox_profile_dir) if firefox_profile_dir else None
    if profile_dir is None:
        try:
            profile_dir = find_firefox_profile_dir()
        except Exception:
            profile_dir = click.prompt(
                "Firefox profile dir",
                type=click.Path(exists=True, file_okay=False, path_type=Path),
            )
    auth_headers = load_costco_browser_headers(
        profile_dir,
        authorization=config["authorization"],
        client_id=config["client_id"],
        client_identifier=config["client_identifier"],
    )
    session = build_session(profile_dir, auth_headers)
    click.echo(
        "session bootstrap: "
        f"cookies={True} "
        f"authorization={bool(auth_headers.get('costco-x-authorization'))} "
        f"client_id={bool(auth_headers.get('costco-x-wcs-clientId'))} "
        f"client_identifier={bool(auth_headers.get('client-identifier'))}"
    )
    start_date, end_date = resolve_date_range(months_back)
    summary_payload, request_metadata = fetch_summary_windows(
        session,
        start_date,
        end_date,
        document_type,
        document_sub_type,
        window_days,
    )
    write_json(raw_dir / "summary.json", summary_payload)
    write_json(raw_dir / "summary_requests.json", request_metadata)
    receipts = summary_receipts(summary_payload)
    detail_payloads = []
    for receipt in receipts:
        barcode = receipt["transactionBarcode"]
        receipt_id = receipt_key(receipt) or barcode
        click.echo(f"fetching {barcode}")
        detail_payload = graphql_post(
            session,
            DETAIL_QUERY,
            {"barcode": barcode, "documentType": "warehouse"},
        )
        detail_payloads.append(detail_payload)
        write_json(raw_dir / f"{safe_filename(receipt_id)}.json", detail_payload)
    orders, items = flatten_costco_data(summary_payload, detail_payloads, raw_dir)
    write_csv(outdir / orders_filename, orders, ORDER_FIELDS)
    write_csv(outdir / items_filename, items, ITEM_FIELDS)
    click.echo(f"wrote {len(orders)} orders and {len(items)} item rows to {outdir}")
 if __name__ == "__main__":
    main()
--- a/scrape_giant.py
+++ b/scrape_giant.py
@@ -1,367 +0,0 @@
 import csv
 import json
 import os
 import time
 from pathlib import Path
 import click
 from dotenv import load_dotenv
 from curl_cffi import requests
 from browser_session import find_firefox_profile_dir, load_firefox_cookies
 BASE = "https://giantfood.com"
 ACCOUNT_PAGE = f"{BASE}/account/history/invoice/in-store"
 RETAILER = "giant"
 ORDER_FIELDS = [
    "retailer",
    "order_id",
    "order_date",
    "delivery_date",
    "service_type",
    "order_total",
    "payment_method",
    "total_item_count",
    "total_savings",
    "your_savings_total",
    "coupons_discounts_total",
    "store_name",
    "store_number",
    "store_address1",
    "store_city",
    "store_state",
    "store_zipcode",
    "refund_order",
    "ebt_order",
    "raw_history_path",
    "raw_order_path",
 ]
 ITEM_FIELDS = [
    "retailer",
    "order_id",
    "order_date",
    "line_no",
    "retailer_item_id",
    "pod_id",
    "item_name",
    "upc",
    "category_id",
    "category",
    "qty",
    "unit",
    "unit_price",
    "line_total",
    "picked_weight",
    "mvp_savings",
    "reward_savings",
    "coupon_savings",
    "coupon_price",
    "image_url",
    "raw_order_path",
    "is_discount_line",
    "is_coupon_line",
 ]
 def load_config():
    if load_dotenv is not None:
        load_dotenv()
    return {
        "user_id": os.getenv("GIANT_USER_ID", "").strip(),
        "loyalty": os.getenv("GIANT_LOYALTY_NUMBER", "").strip(),
    }
 def build_session():
    profile_dir = find_firefox_profile_dir()
    session = requests.Session()
    session.cookies.update(load_firefox_cookies("giantfood.com", profile_dir))
    session.headers.update(
        {
            "user-agent": (
                "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:148.0) "
                "Gecko/20100101 Firefox/148.0"
            ),
            "accept": "application/json, text/plain, */*",
            "accept-language": "en-US,en;q=0.9",
            "referer": ACCOUNT_PAGE,
        }
    )
    return session
 def safe_get(session, url, **kwargs):
    last_response = None
    for attempt in range(3):
        try:
            response = session.get(
                url,
                impersonate="firefox",
                timeout=30,
                **kwargs,
            )
            last_response = response
            if response.status_code == 200:
                return response
            click.echo(f"retry {attempt + 1}/3 status={response.status_code}")
        except Exception as exc:  # pragma: no cover - network error path
            click.echo(f"retry {attempt + 1}/3 error={exc}")
        time.sleep(3)
    if last_response is not None:
        last_response.raise_for_status()
    raise RuntimeError(f"failed to fetch {url}")
 def get_history(session, user_id, loyalty):
    response = safe_get(
        session,
        f"{BASE}/api/v6.0/user/{user_id}/order/history",
        params={"filter": "instore", "loyaltyNumber": loyalty},
    )
    return response.json()
 def get_order_detail(session, user_id, order_id):
    response = safe_get(
        session,
        f"{BASE}/api/v6.0/user/{user_id}/order/history/detail/{order_id}",
        params={"isInStore": "true"},
    )
    return response.json()
 def flatten_orders(history, details, history_path=None, raw_dir=None):
    orders = []
    items = []
    history_lookup = {record["orderId"]: record for record in history.get("records", [])}
    history_path_value = history_path.as_posix() if history_path else ""
    for detail in details:
        order_id = str(detail["orderId"])
        history_row = history_lookup.get(detail["orderId"], {})
        pickup = detail.get("pup", {})
        raw_order_path = (raw_dir / f"{order_id}.json").as_posix() if raw_dir else ""
        orders.append(
            {
                "retailer": RETAILER,
                "order_id": order_id,
                "order_date": detail.get("orderDate"),
                "delivery_date": detail.get("deliveryDate"),
                "service_type": history_row.get("serviceType"),
                "order_total": detail.get("orderTotal"),
                "payment_method": detail.get("paymentMethod"),
                "total_item_count": detail.get("totalItemCount"),
                "total_savings": detail.get("totalSavings"),
                "your_savings_total": detail.get("yourSavingsTotal"),
                "coupons_discounts_total": detail.get("couponsDiscountsTotal"),
                "store_name": pickup.get("storeName"),
                "store_number": pickup.get("aholdStoreNumber"),
                "store_address1": pickup.get("storeAddress1"),
                "store_city": pickup.get("storeCity"),
                "store_state": pickup.get("storeState"),
                "store_zipcode": pickup.get("storeZipcode"),
                "refund_order": detail.get("refundOrder"),
                "ebt_order": detail.get("ebtOrder"),
                "raw_history_path": history_path_value,
                "raw_order_path": raw_order_path,
            }
        )
        for line_no, item in enumerate(detail.get("items", []), start=1):
            items.append(
                {
                    "retailer": RETAILER,
                    "order_id": order_id,
                    "order_date": detail.get("orderDate"),
                    "line_no": str(line_no),
                    "retailer_item_id": "",
                    "pod_id": item.get("podId"),
                    "item_name": item.get("itemName"),
                    "upc": item.get("primUpcCd"),
                    "category_id": item.get("categoryId"),
                    "category": item.get("categoryDesc"),
                    "qty": item.get("shipQy"),
                    "unit": item.get("lbEachCd"),
                    "unit_price": item.get("unitPrice"),
                    "line_total": item.get("groceryAmount"),
                    "picked_weight": item.get("totalPickedWeight"),
                    "mvp_savings": item.get("mvpSavings"),
                    "reward_savings": item.get("rewardSavings"),
                    "coupon_savings": item.get("couponSavings"),
                    "coupon_price": item.get("couponPrice"),
                    "image_url": "",
                    "raw_order_path": raw_order_path,
                    "is_discount_line": "false",
                    "is_coupon_line": "false",
                }
            )
    return orders, items
 def normalize_row(row, fieldnames):
    return {field: stringify(row.get(field)) for field in fieldnames}
 def stringify(value):
    if value is None:
        return ""
    return str(value)
 def read_csv_rows(path):
    if not path.exists():
        return [], []
    with path.open(newline="", encoding="utf-8") as handle:
        reader = csv.DictReader(handle)
        fieldnames = reader.fieldnames or []
        return fieldnames, list(reader)
 def read_existing_order_ids(path):
    _, rows = read_csv_rows(path)
    return {row["order_id"] for row in rows if row.get("order_id")}
 def merge_rows(existing_rows, new_rows, subset):
    merged = []
    row_index = {}
    for row in existing_rows + new_rows:
        key = tuple(stringify(row.get(field)) for field in subset)
        normalized = dict(row)
        if key in row_index:
            merged[row_index[key]] = normalized
        else:
            row_index[key] = len(merged)
            merged.append(normalized)
    return merged
 def append_dedup(path, new_rows, subset, fieldnames):
    existing_fieldnames, existing_rows = read_csv_rows(path)
    all_fieldnames = list(dict.fromkeys(existing_fieldnames + fieldnames))
    merged = merge_rows(
        [normalize_row(row, all_fieldnames) for row in existing_rows],
        [normalize_row(row, all_fieldnames) for row in new_rows],
        subset=subset,
    )
    with path.open("w", newline="", encoding="utf-8") as handle:
        writer = csv.DictWriter(handle, fieldnames=all_fieldnames)
        writer.writeheader()
        writer.writerows(merged)
    return merged
 def write_json(path, payload):
    path.write_text(json.dumps(payload, indent=2), encoding="utf-8")
@click.command()
@click.option("--user-id", default=None, help="Giant user id.")
@click.option("--loyalty", default=None, help="Giant loyalty number.")
@click.option(
    "--outdir",
    default="giant_output",
    show_default=True,
    help="Directory for raw json and csv outputs.",
 )
@click.option(
    "--sleep-seconds",
    default=1.5,
    show_default=True,
    type=float,
    help="Delay between order detail requests.",
 )
 def main(user_id, loyalty, outdir, sleep_seconds):
    click.echo("legacy entrypoint: prefer collect_giant_web.py for data-model outputs")
    run_collection(user_id, loyalty, outdir, sleep_seconds)
 def run_collection(
    user_id,
    loyalty,
    outdir,
    sleep_seconds,
    orders_filename="orders.csv",
    items_filename="items.csv",
 ):
    config = load_config()
    user_id = user_id or config["user_id"] or click.prompt("Giant user id", type=str)
    loyalty = loyalty or config["loyalty"] or click.prompt(
        "Giant loyalty number", type=str
    )
    outdir = Path(outdir)
    rawdir = outdir / "raw"
    rawdir.mkdir(parents=True, exist_ok=True)
    orders_csv = outdir / orders_filename
    items_csv = outdir / items_filename
    existing_order_ids = read_existing_order_ids(orders_csv)
    session = build_session()
    history = get_history(session, user_id, loyalty)
    history_path = rawdir / "history.json"
    write_json(history_path, history)
    records = history.get("records", [])
    click.echo(f"history returned {len(records)} visits; Giant exposes only the most recent 50")
    unseen_records = [
        record
        for record in records
        if stringify(record.get("orderId")) not in existing_order_ids
    ]
    click.echo(
        f"found {len(unseen_records)} unseen visits "
        f"({len(existing_order_ids)} already stored)"
    )
    details = []
    for index, record in enumerate(unseen_records, start=1):
        order_id = stringify(record.get("orderId"))
        click.echo(f"[{index}/{len(unseen_records)}] fetching {order_id}")
        detail = get_order_detail(session, user_id, order_id)
        write_json(rawdir / f"{order_id}.json", detail)
        details.append(detail)
        if index < len(unseen_records):
            time.sleep(sleep_seconds)
    orders, items = flatten_orders(history, details, history_path=history_path, raw_dir=rawdir)
    merged_orders = append_dedup(
        orders_csv,
        orders,
        subset=["order_id"],
        fieldnames=ORDER_FIELDS,
    )
    merged_items = append_dedup(
        items_csv,
        items,
        subset=["order_id", "line_no"],
        fieldnames=ITEM_FIELDS,
    )
    click.echo(
        f"wrote {len(orders)} new orders / {len(items)} new items "
        f"({len(merged_orders)} total orders, {len(merged_items)} total items)"
    )
 if __name__ == "__main__":
    main()
--- a/scraper.py
+++ b/scraper.py
@@ -1,5 +1,181 @@
-from scrape_giant import *  # noqa: F401,F403
+import json
 import time
 from pathlib import Path
 import browser_cookie3
 import pandas as pd
 from curl_cffi import requests
 BASE = "https://giantfood.com"
 ACCOUNT_PAGE = f"{BASE}/account/history/invoice/in-store"
 USER_ID = "369513017"
 LOYALTY = "440155630880"
 def build_session():
    s = requests.Session()
    s.cookies.update(browser_cookie3.firefox(domain_name="giantfood.com"))
    s.headers.update({
        "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:148.0) Gecko/20100101 Firefox/148.0",
        "accept": "application/json, text/plain, */*",
        "accept-language": "en-US,en;q=0.9",
        "referer": ACCOUNT_PAGE,
    })
    return s
 def safe_get(session, url, **kwargs):
    last_response = None
    for attempt in range(3):
        try:
            r = session.get(
                url,
                impersonate="firefox",
                timeout=30,
                **kwargs,
            )
            last_response = r
            if r.status_code == 200:
                return r
            print(f"retry {attempt + 1}/3 status={r.status_code}")
        except Exception as e:
            print(f"retry {attempt + 1}/3 error={e}")
        time.sleep(3)
    if last_response is not None:
        last_response.raise_for_status()
    raise RuntimeError(f"failed to fetch {url}")
 def get_history(session):
    url = f"{BASE}/api/v6.0/user/{USER_ID}/order/history"
    r = safe_get(
        session,
        url,
        params={
            "filter": "instore",
            "loyaltyNumber": LOYALTY,
        },
    )
    return r.json()
 def get_order_detail(session, order_id):
    url = f"{BASE}/api/v6.0/user/{USER_ID}/order/history/detail/{order_id}"
    r = safe_get(
        session,
        url,
        params={"isInStore": "true"},
    )
    return r.json()
 def flatten_orders(history, details):
    orders = []
    items = []
    history_lookup = {
        r["orderId"]: r
        for r in history.get("records", [])
    }
    for d in details:
        hist = history_lookup.get(d["orderId"], {})
        pup = d.get("pup", {})
        orders.append({
            "order_id": d["orderId"],
            "order_date": d.get("orderDate"),
            "delivery_date": d.get("deliveryDate"),
            "service_type": hist.get("serviceType"),
            "order_total": d.get("orderTotal"),
            "payment_method": d.get("paymentMethod"),
            "total_item_count": d.get("totalItemCount"),
            "total_savings": d.get("totalSavings"),
            "your_savings_total": d.get("yourSavingsTotal"),
            "coupons_discounts_total": d.get("couponsDiscountsTotal"),
            "store_name": pup.get("storeName"),
            "store_number": pup.get("aholdStoreNumber"),
            "store_address1": pup.get("storeAddress1"),
            "store_city": pup.get("storeCity"),
            "store_state": pup.get("storeState"),
            "store_zipcode": pup.get("storeZipcode"),
            "refund_order": d.get("refundOrder"),
            "ebt_order": d.get("ebtOrder"),
        })
        for i, item in enumerate(d.get("items", []), start=1):
            items.append({
                "order_id": d["orderId"],
                "order_date": d.get("orderDate"),
                "line_no": i,
                "pod_id": item.get("podId"),
                "item_name": item.get("itemName"),
                "upc": item.get("primUpcCd"),
                "category_id": item.get("categoryId"),
                "category": item.get("categoryDesc"),
                "qty": item.get("shipQy"),
                "unit": item.get("lbEachCd"),
                "unit_price": item.get("unitPrice"),
                "line_total": item.get("groceryAmount"),
                "picked_weight": item.get("totalPickedWeight"),
                "mvp_savings": item.get("mvpSavings"),
                "reward_savings": item.get("rewardSavings"),
                "coupon_savings": item.get("couponSavings"),
                "coupon_price": item.get("couponPrice"),
            })
    return pd.DataFrame(orders), pd.DataFrame(items)
 def main():
    outdir = Path("giant_output")
    rawdir = outdir / "raw"
    rawdir.mkdir(parents=True, exist_ok=True)
    session = build_session()
    print("fetching order history...")
    history = get_history(session)
    (rawdir / "history.json").write_text(
        json.dumps(history, indent=2),
        encoding="utf-8",
    )
    order_ids = [r["orderId"] for r in history.get("records", [])]
    print(f"{len(order_ids)} orders found")
    details = []
    for order_id in order_ids:
        print(f"fetching {order_id}")
        d = get_order_detail(session, order_id)
        details.append(d)
        (rawdir / f"{order_id}.json").write_text(
            json.dumps(d, indent=2),
            encoding="utf-8",
        )
        time.sleep(1.5)
    print("flattening data...")
    orders_df, items_df = flatten_orders(history, details)
    orders_df.to_csv(outdir / "orders.csv", index=False)
    items_df.to_csv(outdir / "items.csv", index=False)
    print("done")
    print(f"{len(orders_df)} orders written to {outdir / 'orders.csv'}")
    print(f"{len(items_df)} items written to {outdir / 'items.csv'}")
 if __name__ == "__main__":
-    main()
+    main()
--- a/tests/test_bc.py
+++ b/tests/test_bc.py
@@ -1,17 +1,28 @@
-import unittest
+import requests
 import browser_cookie3
 BASE = "https://giantfood.com"
 ACCOUNT_PAGE = f"{BASE}/account/history/invoice/in-store"
-try:
+USER_ID = "369513017"
-    import browser_cookie3  # noqa: F401
+LOYALTY = "440155630880"
    import requests  # noqa: F401
 except ImportError as exc:  # pragma: no cover - dependency-gated smoke test
    browser_cookie3 = None
    _IMPORT_ERROR = exc
 else:
    _IMPORT_ERROR = None
 cj = browser_cookie3.firefox(domain_name="giantfood.com")
-@unittest.skipIf(browser_cookie3 is None, f"optional smoke test dependency missing: {_IMPORT_ERROR}")
+s = requests.Session()
-class BrowserCookieSmokeTest(unittest.TestCase):
+s.cookies.update(cj)
-    def test_dependencies_available(self):
+s.headers.update({
-        self.assertIsNotNone(browser_cookie3)
+    "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:148.0) Gecko/20100101 Firefox/148.0",
    "accept": "application/json, text/plain, */*",
    "accept-language": "en-US,en;q=0.9",
    "referer": ACCOUNT_PAGE,
 })
 r = s.get(
    f"{BASE}/api/v6.0/user/{USER_ID}/order/history",
    params={"filter": "instore", "loyaltyNumber": LOYALTY},
    timeout=30,
 )
 print(r.status_code)
 print(r.text[:500])
--- a/tests/test_bc_cffi.py
+++ b/tests/test_bc_cffi.py
@@ -1,17 +1,27 @@
-import unittest
+import browser_cookie3
 from curl_cffi import requests
 BASE = "https://giantfood.com"
 ACCOUNT_PAGE = f"{BASE}/account/history/invoice/in-store"
-try:
+USER_ID = "369513017"
-    import browser_cookie3  # noqa: F401
+LOYALTY = "440155630880"
    from curl_cffi import requests  # noqa: F401
 except ImportError as exc:  # pragma: no cover - dependency-gated smoke test
    browser_cookie3 = None
    _IMPORT_ERROR = exc
 else:
    _IMPORT_ERROR = None
 s = requests.Session()
 s.cookies.update(browser_cookie3.firefox(domain_name="giantfood.com"))
 s.headers.update({
    "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:148.0) Gecko/20100101 Firefox/148.0",
    "accept": "application/json, text/plain, */*",
    "accept-language": "en-US,en;q=0.9",
    "referer": ACCOUNT_PAGE,
 })
-@unittest.skipIf(browser_cookie3 is None, f"optional smoke test dependency missing: {_IMPORT_ERROR}")
+r = s.get(
-class CurlCffiSmokeTest(unittest.TestCase):
+    f"{BASE}/api/v6.0/user/{USER_ID}/order/history",
-    def test_dependencies_available(self):
+    params={"filter": "instore", "loyaltyNumber": LOYALTY},
-        self.assertIsNotNone(browser_cookie3)
+    impersonate="firefox",
    timeout=30,
 )
 print(r.status_code)
 print(r.text[:500])
--- a/tests/test_browser_session.py
+++ b/tests/test_browser_session.py
@@ -1,155 +0,0 @@
 import sqlite3
 import tempfile
 import unittest
 from pathlib import Path
 from unittest import mock
 import browser_session
 import scrape_costco
 class BrowserSessionTests(unittest.TestCase):
    def test_read_firefox_local_storage_reads_copied_sqlite(self):
        with tempfile.TemporaryDirectory() as tmpdir:
            profile_dir = Path(tmpdir) / "abcd.default-release"
            ls_dir = profile_dir / "storage" / "default" / "https+++www.costco.com" / "ls"
            ls_dir.mkdir(parents=True)
            db_path = ls_dir / "data.sqlite"
            with sqlite3.connect(db_path) as connection:
                connection.execute("CREATE TABLE data (key TEXT, value TEXT)")
                connection.execute(
                    "INSERT INTO data (key, value) VALUES (?, ?)",
                    ("costco-x-wcs-clientId", "4900eb1f-0c10-4bd9-99c3-c59e6c1ecebf"),
                )
            values = browser_session.read_firefox_local_storage(
                profile_dir,
                origin_filter="costco.com",
            )
            self.assertEqual(
                "4900eb1f-0c10-4bd9-99c3-c59e6c1ecebf",
                values["costco-x-wcs-clientId"],
            )
    def test_load_costco_browser_headers_reads_id_token_and_client_id(self):
        with tempfile.TemporaryDirectory() as tmpdir:
            profile_dir = Path(tmpdir)
            storage_dir = profile_dir / "storage" / "default" / "https+++www.costco.com" / "ls"
            storage_dir.mkdir(parents=True)
            db_path = storage_dir / "data.sqlite"
            with sqlite3.connect(db_path) as connection:
                connection.execute("CREATE TABLE data (key TEXT, value TEXT)")
                connection.execute(
                    "INSERT INTO data (key, value) VALUES (?, ?)",
                    ("idToken", "header.payload.signature"),
                )
                connection.execute(
                    "INSERT INTO data (key, value) VALUES (?, ?)",
                    ("clientID", "4900eb1f-0c10-4bd9-99c3-c59e6c1ecebf"),
                )
            headers = scrape_costco.load_costco_browser_headers(
                profile_dir,
                authorization="",
                client_id="",
                client_identifier="481b1aec-aa3b-454b-b81b-48187e28f205",
            )
        self.assertEqual("Bearer header.payload.signature", headers["costco-x-authorization"])
        self.assertEqual(
            "4900eb1f-0c10-4bd9-99c3-c59e6c1ecebf",
            headers["costco-x-wcs-clientId"],
        )
        self.assertEqual(
            "481b1aec-aa3b-454b-b81b-48187e28f205",
            headers["client-identifier"],
        )
    def test_load_costco_browser_headers_prefers_env_values(self):
        with tempfile.TemporaryDirectory() as tmpdir:
            profile_dir = Path(tmpdir)
            storage_dir = profile_dir / "storage" / "default" / "https+++www.costco.com" / "ls"
            storage_dir.mkdir(parents=True)
            db_path = storage_dir / "data.sqlite"
            with sqlite3.connect(db_path) as connection:
                connection.execute("CREATE TABLE data (key TEXT, value TEXT)")
                connection.execute(
                    "INSERT INTO data (key, value) VALUES (?, ?)",
                    ("idToken", "storage.payload.signature"),
                )
                connection.execute(
                    "INSERT INTO data (key, value) VALUES (?, ?)",
                    ("clientID", "4900eb1f-0c10-4bd9-99c3-c59e6c1ecebf"),
                )
            headers = scrape_costco.load_costco_browser_headers(
                profile_dir,
                authorization="Bearer env.payload.signature",
                client_id="env-client-id",
                client_identifier="481b1aec-aa3b-454b-b81b-48187e28f205",
            )
        self.assertEqual("Bearer env.payload.signature", headers["costco-x-authorization"])
        self.assertEqual("env-client-id", headers["costco-x-wcs-clientId"])
    def test_scrape_costco_prompts_for_profile_dir_when_autodiscovery_fails(self):
        with mock.patch.object(
            scrape_costco,
            "find_firefox_profile_dir",
            side_effect=FileNotFoundError("no default profile"),
        ), mock.patch.object(
            scrape_costco.click,
            "prompt",
            return_value=Path("/tmp/profile"),
        ) as mocked_prompt, mock.patch.object(
            scrape_costco,
            "load_config",
            return_value={
                "authorization": "",
                "client_id": "4900eb1f-0c10-4bd9-99c3-c59e6c1ecebf",
                "client_identifier": "481b1aec-aa3b-454b-b81b-48187e28f205",
            },
        ), mock.patch.object(
            scrape_costco,
            "load_costco_browser_headers",
            return_value={
                "costco-x-authorization": "Bearer header.payload.signature",
                "costco-x-wcs-clientId": "4900eb1f-0c10-4bd9-99c3-c59e6c1ecebf",
                "client-identifier": "481b1aec-aa3b-454b-b81b-48187e28f205",
            },
        ), mock.patch.object(
            scrape_costco,
            "build_session",
            return_value=object(),
        ), mock.patch.object(
            scrape_costco,
            "fetch_summary_windows",
            return_value=(
                {"data": {"receiptsWithCounts": {"receipts": []}}},
                [],
            ),
        ), mock.patch.object(
            scrape_costco,
            "write_json",
        ), mock.patch.object(
            scrape_costco,
            "write_csv",
        ):
            scrape_costco.main.callback(
                outdir="/tmp/costco_output",
                document_type="all",
                document_sub_type="all",
                window_days=92,
                months_back=3,
                firefox_profile_dir=None,
            )
        mocked_prompt.assert_called_once()
 if __name__ == "__main__":
    unittest.main()
--- a/tests/test_canonical_layer.py
+++ b/tests/test_canonical_layer.py
@@ -1,119 +0,0 @@
 import unittest
 import build_canonical_layer
 class CanonicalLayerTests(unittest.TestCase):
    def test_build_canonical_layer_auto_links_exact_upc_and_name_size_only(self):
        observed_rows = [
            {
                "observed_product_id": "gobs_1",
                "representative_upc": "111",
                "representative_retailer_item_id": "11",
                "representative_name_norm": "GALA APPLE",
                "representative_brand": "SB",
                "representative_variant": "",
                "representative_size_value": "5",
                "representative_size_unit": "lb",
                "representative_pack_qty": "",
                "representative_measure_type": "weight",
                "is_fee": "false",
                "is_discount_line": "false",
                "is_coupon_line": "false",
            },
            {
                "observed_product_id": "gobs_2",
                "representative_upc": "111",
                "representative_retailer_item_id": "12",
                "representative_name_norm": "LARGE WHITE EGGS",
                "representative_brand": "SB",
                "representative_variant": "",
                "representative_size_value": "",
                "representative_size_unit": "",
                "representative_pack_qty": "18",
                "representative_measure_type": "count",
                "is_fee": "false",
                "is_discount_line": "false",
                "is_coupon_line": "false",
            },
            {
                "observed_product_id": "gobs_3",
                "representative_upc": "",
                "representative_retailer_item_id": "21",
                "representative_name_norm": "ROTINI",
                "representative_brand": "",
                "representative_variant": "",
                "representative_size_value": "16",
                "representative_size_unit": "oz",
                "representative_pack_qty": "",
                "representative_measure_type": "weight",
                "is_fee": "false",
                "is_discount_line": "false",
                "is_coupon_line": "false",
            },
            {
                "observed_product_id": "gobs_4",
                "representative_upc": "",
                "representative_retailer_item_id": "22",
                "representative_name_norm": "ROTINI",
                "representative_brand": "SB",
                "representative_variant": "",
                "representative_size_value": "16",
                "representative_size_unit": "oz",
                "representative_pack_qty": "",
                "representative_measure_type": "weight",
                "is_fee": "false",
                "is_discount_line": "false",
                "is_coupon_line": "false",
            },
            {
                "observed_product_id": "gobs_5",
                "representative_upc": "",
                "representative_retailer_item_id": "99",
                "representative_name_norm": "GL BAG CHARGE",
                "representative_brand": "",
                "representative_variant": "",
                "representative_size_value": "",
                "representative_size_unit": "",
                "representative_pack_qty": "",
                "representative_measure_type": "each",
                "is_fee": "true",
                "is_discount_line": "false",
                "is_coupon_line": "false",
            },
            {
                "observed_product_id": "gobs_6",
                "representative_upc": "",
                "representative_retailer_item_id": "",
                "representative_name_norm": "LIME",
                "representative_brand": "",
                "representative_variant": "",
                "representative_size_value": "",
                "representative_size_unit": "",
                "representative_pack_qty": "",
                "representative_measure_type": "each",
                "is_fee": "false",
                "is_discount_line": "false",
                "is_coupon_line": "false",
            },
        ]
        canonicals, links = build_canonical_layer.build_canonical_layer(observed_rows)
        self.assertEqual(2, len(canonicals))
        self.assertEqual(4, len(links))
        methods = {row["observed_product_id"]: row["link_method"] for row in links}
        self.assertEqual("exact_upc", methods["gobs_1"])
        self.assertEqual("exact_upc", methods["gobs_2"])
        self.assertEqual("exact_name_size", methods["gobs_3"])
        self.assertEqual("exact_name_size", methods["gobs_4"])
        self.assertNotIn("gobs_5", methods)
        self.assertNotIn("gobs_6", methods)
    def test_clean_canonical_name_removes_packaging_noise(self):
        self.assertEqual("LIME", build_canonical_layer.clean_canonical_name("LIME  . / ."))
        self.assertEqual("EGG", build_canonical_layer.clean_canonical_name("5DZ EGG / /"))
 if __name__ == "__main__":
    unittest.main()
--- a/tests/test_costco_pipeline.py
+++ b/tests/test_costco_pipeline.py
@@ -1,511 +0,0 @@
 import csv
 import json
 import tempfile
 import unittest
 from pathlib import Path
 from unittest import mock
 import enrich_costco
 import scrape_costco
 import validate_cross_retailer_flow
 class CostcoPipelineTests(unittest.TestCase):
    def test_resolve_date_range_uses_months_back(self):
        start_date, end_date = scrape_costco.resolve_date_range(
            3, today=scrape_costco.parse_cli_date("3/16/2026")
        )
        self.assertEqual("12/16/2025", start_date)
        self.assertEqual("3/16/2026", end_date)
    def test_build_date_windows_splits_long_ranges(self):
        windows = scrape_costco.build_date_windows("1/01/2026", "6/30/2026", 92)
        self.assertEqual(
            [
                {"startDate": "1/01/2026", "endDate": "4/02/2026"},
                {"startDate": "4/03/2026", "endDate": "6/30/2026"},
            ],
            windows,
        )
    def test_fetch_summary_windows_records_metadata_and_warns_on_mismatch(self):
        payloads = [
            {
                "data": {
                    "receiptsWithCounts": {
                        "inWarehouse": 2,
                        "gasStation": 0,
                        "carWash": 0,
                        "gasAndCarWash": 0,
                        "receipts": [
                            {
                                "transactionBarcode": "abc",
                                "receiptType": "In-Warehouse",
                            }
                        ],
                    }
                }
            },
            {
                "data": {
                    "receiptsWithCounts": {
                        "inWarehouse": 1,
                        "gasStation": 0,
                        "carWash": 0,
                        "gasAndCarWash": 0,
                        "receipts": [
                            {
                                "transactionBarcode": "def",
                                "receiptType": "In-Warehouse",
                            }
                        ],
                    }
                }
            },
        ]
        with mock.patch.object(
            scrape_costco, "graphql_post", side_effect=payloads
        ) as mocked_post, mock.patch.object(scrape_costco.click, "echo") as mocked_echo:
            summary_payload, metadata = scrape_costco.fetch_summary_windows(
                session=object(),
                start_date="1/01/2026",
                end_date="6/30/2026",
                document_type="all",
                document_sub_type="all",
                window_days=92,
            )
        self.assertEqual(2, mocked_post.call_count)
        self.assertEqual(2, len(metadata))
        self.assertTrue(metadata[0]["countMismatch"])
        self.assertFalse(metadata[1]["countMismatch"])
        self.assertEqual("1/01/2026", metadata[0]["startDate"])
        self.assertEqual("4/03/2026", metadata[1]["startDate"])
        self.assertEqual(
            ["abc", "def"],
            [
                row["transactionBarcode"]
                for row in scrape_costco.summary_receipts(summary_payload)
            ],
        )
        mocked_echo.assert_called_once()
        warning_text = mocked_echo.call_args.args[0]
        self.assertIn("warning: summary count mismatch", warning_text)
    def test_flatten_costco_data_preserves_discount_rows(self):
        summary_payload = {
            "data": {
                "receiptsWithCounts": {
                    "receipts": [
                        {
                            "transactionBarcode": "abc",
                            "tenderArray": [{"tenderDescription": "VISA"}],
                            "couponArray": [{"upcnumberCoupon": "2100003746641"}],
                        }
                    ]
                }
            }
        }
        detail_payloads = [
            {
                "data": {
                    "receiptsWithCounts": {
                        "receipts": [
                            {
                                "transactionBarcode": "abc",
                                "transactionDate": "2026-03-12",
                                "receiptType": "In-Warehouse",
                                "total": 10.0,
                                "totalItemCount": 2,
                                "instantSavings": 5.0,
                                "warehouseName": "MT VERNON",
                                "warehouseNumber": 1115,
                                "warehouseAddress1": "7940 RICHMOND HWY",
                                "warehouseCity": "ALEXANDRIA",
                                "warehouseState": "VA",
                                "warehousePostalCode": "22306",
                                "itemArray": [
                                    {
                                        "itemNumber": "4873222",
                                        "itemDescription01": "ALL F&C",
                                        "itemDescription02": "200OZ 160LOADS P104",
                                        "itemDepartmentNumber": 14,
                                        "transDepartmentNumber": 14,
                                        "unit": 1,
                                        "itemIdentifier": "E",
                                        "amount": 19.99,
                                        "itemUnitPriceAmount": 19.99,
                                    },
                                    {
                                        "itemNumber": "374664",
                                        "itemDescription01": "/ 4873222",
                                        "itemDescription02": None,
                                        "itemDepartmentNumber": 14,
                                        "transDepartmentNumber": 14,
                                        "unit": -1,
                                        "itemIdentifier": None,
                                        "amount": -5,
                                        "itemUnitPriceAmount": 0,
                                    },
                                ],
                            }
                        ]
                    }
                }
            }
        ]
        orders, items = scrape_costco.flatten_costco_data(
            summary_payload, detail_payloads, Path("costco_output/raw")
        )
        self.assertEqual(1, len(orders))
        self.assertEqual(2, len(items))
        self.assertEqual("false", items[0]["is_discount_line"])
        self.assertEqual("true", items[1]["is_discount_line"])
        self.assertEqual("true", items[1]["is_coupon_line"])
    def test_flatten_costco_data_uses_composite_summary_lookup_key(self):
        summary_payload = {
            "data": {
                "receiptsWithCounts": {
                    "receipts": [
                        {
                            "transactionBarcode": "dup",
                            "transactionDateTime": "2026-03-12T16:16:00",
                            "tenderArray": [{"tenderDescription": "VISA"}],
                            "couponArray": [{"upcnumberCoupon": "111"}],
                        },
                        {
                            "transactionBarcode": "dup",
                            "transactionDateTime": "2026-02-14T16:25:00",
                            "tenderArray": [{"tenderDescription": "MASTERCARD"}],
                            "couponArray": [],
                        },
                    ]
                }
            }
        }
        detail_payloads = [
            {
                "data": {
                    "receiptsWithCounts": {
                        "receipts": [
                            {
                                "transactionBarcode": "dup",
                                "transactionDateTime": "2026-03-12T16:16:00",
                                "transactionDate": "2026-03-12",
                                "receiptType": "In-Warehouse",
                                "total": 10.0,
                                "totalItemCount": 1,
                                "instantSavings": 5.0,
                                "warehouseName": "MT VERNON",
                                "warehouseNumber": 1115,
                                "warehouseAddress1": "7940 RICHMOND HWY",
                                "warehouseCity": "ALEXANDRIA",
                                "warehouseState": "VA",
                                "warehousePostalCode": "22306",
                                "itemArray": [
                                    {
                                        "itemNumber": "111",
                                        "itemDescription01": "/ 111",
                                        "itemDescription02": None,
                                        "itemDepartmentNumber": 14,
                                        "transDepartmentNumber": 14,
                                        "unit": -1,
                                        "itemIdentifier": None,
                                        "amount": -5,
                                        "itemUnitPriceAmount": 0,
                                    }
                                ],
                            }
                        ]
                    }
                }
            }
        ]
        orders, items = scrape_costco.flatten_costco_data(
            summary_payload, detail_payloads, Path("costco_output/raw")
        )
        self.assertEqual("VISA", orders[0]["payment_method"])
        self.assertEqual("true", items[0]["is_coupon_line"])
        self.assertIn("dup-2026-03-12T16-16-00.json", items[0]["raw_order_path"])
    def test_costco_enricher_parses_size_pack_and_discount(self):
        row = enrich_costco.parse_costco_item(
            order_id="abc",
            order_date="2026-03-12",
            raw_path=Path("costco_output/raw/abc.json"),
            line_no=1,
            item={
                "itemNumber": "60357",
                "itemDescription01": "MIXED PEPPER",
                "itemDescription02": "6-PACK",
                "itemDepartmentNumber": 65,
                "transDepartmentNumber": 65,
                "unit": 1,
                "itemIdentifier": "E",
                "amount": 7.49,
                "itemUnitPriceAmount": 7.49,
            },
        )
        self.assertEqual("60357", row["retailer_item_id"])
        self.assertEqual("MIXED PEPPER", row["item_name_norm"])
        self.assertEqual("6", row["pack_qty"])
        self.assertEqual("count", row["measure_type"])
        discount = enrich_costco.parse_costco_item(
            order_id="abc",
            order_date="2026-03-12",
            raw_path=Path("costco_output/raw/abc.json"),
            line_no=2,
            item={
                "itemNumber": "374664",
                "itemDescription01": "/ 4873222",
                "itemDescription02": None,
                "itemDepartmentNumber": 14,
                "transDepartmentNumber": 14,
                "unit": -1,
                "itemIdentifier": None,
                "amount": -5,
                "itemUnitPriceAmount": 0,
            },
        )
        self.assertEqual("true", discount["is_discount_line"])
        self.assertEqual("true", discount["is_coupon_line"])
    def test_build_items_enriched_matches_discount_to_item(self):
        with tempfile.TemporaryDirectory() as tmpdir:
            raw_dir = Path(tmpdir) / "raw"
            raw_dir.mkdir()
            payload = {
                "data": {
                    "receiptsWithCounts": {
                        "receipts": [
                            {
                                "transactionBarcode": "abc",
                                "transactionDate": "2026-03-12",
                                "itemArray": [
                                    {
                                        "itemNumber": "4873222",
                                        "itemDescription01": "ALL F&C",
                                        "itemDescription02": "200OZ 160LOADS P104",
                                        "itemDepartmentNumber": 14,
                                        "transDepartmentNumber": 14,
                                        "unit": 1,
                                        "itemIdentifier": "E",
                                        "amount": 19.99,
                                        "itemUnitPriceAmount": 19.99,
                                    },
                                    {
                                        "itemNumber": "374664",
                                        "itemDescription01": "/ 4873222",
                                        "itemDescription02": None,
                                        "itemDepartmentNumber": 14,
                                        "transDepartmentNumber": 14,
                                        "unit": -1,
                                        "itemIdentifier": None,
                                        "amount": -5,
                                        "itemUnitPriceAmount": 0,
                                    },
                                ],
                            }
                        ]
                    }
                }
            }
            (raw_dir / "abc.json").write_text(json.dumps(payload), encoding="utf-8")
            rows = enrich_costco.build_items_enriched(raw_dir)
            purchase_row = next(row for row in rows if row["is_discount_line"] == "false")
            discount_row = next(row for row in rows if row["is_discount_line"] == "true")
            self.assertEqual("-5", purchase_row["matched_discount_amount"])
            self.assertEqual("14.99", purchase_row["net_line_total"])
            self.assertIn("matched_discount=4873222", purchase_row["parse_notes"])
            self.assertIn("matched_to_item=4873222", discount_row["parse_notes"])
    def test_cross_retailer_validation_writes_proof_example(self):
        with tempfile.TemporaryDirectory() as tmpdir:
            giant_csv = Path(tmpdir) / "giant_items_enriched.csv"
            costco_csv = Path(tmpdir) / "costco_items_enriched.csv"
            outdir = Path(tmpdir) / "combined"
            fieldnames = enrich_costco.OUTPUT_FIELDS
            giant_row = {field: "" for field in fieldnames}
            giant_row.update(
                {
                    "retailer": "giant",
                    "order_id": "g1",
                    "line_no": "1",
                    "order_date": "2026-03-01",
                    "retailer_item_id": "100",
                    "item_name": "FRESH BANANA",
                    "item_name_norm": "BANANA",
                    "upc": "4011",
                    "measure_type": "weight",
                    "is_store_brand": "false",
                    "is_fee": "false",
                    "is_discount_line": "false",
                    "is_coupon_line": "false",
                    "line_total": "1.29",
                }
            )
            costco_row = {field: "" for field in fieldnames}
            costco_row.update(
                {
                    "retailer": "costco",
                    "order_id": "c1",
                    "line_no": "1",
                    "order_date": "2026-03-12",
                    "retailer_item_id": "30669",
                    "item_name": "BANANAS 3 LB / 1.36 KG",
                    "item_name_norm": "BANANA",
                    "upc": "",
                    "size_value": "3",
                    "size_unit": "lb",
                    "measure_type": "weight",
                    "is_store_brand": "false",
                    "is_fee": "false",
                    "is_discount_line": "false",
                    "is_coupon_line": "false",
                    "line_total": "2.98",
                }
            )
            with giant_csv.open("w", newline="", encoding="utf-8") as handle:
                writer = csv.DictWriter(handle, fieldnames=fieldnames)
                writer.writeheader()
                writer.writerow(giant_row)
            with costco_csv.open("w", newline="", encoding="utf-8") as handle:
                writer = csv.DictWriter(handle, fieldnames=fieldnames)
                writer.writeheader()
                writer.writerow(costco_row)
            validate_cross_retailer_flow.main.callback(
                giant_items_enriched_csv=str(giant_csv),
                costco_items_enriched_csv=str(costco_csv),
                outdir=str(outdir),
            )
            proof_path = outdir / "proof_examples.csv"
            self.assertTrue(proof_path.exists())
            with proof_path.open(newline="", encoding="utf-8") as handle:
                rows = list(csv.DictReader(handle))
            self.assertEqual(1, len(rows))
            self.assertEqual("banana", rows[0]["proof_name"])
    def test_main_writes_summary_request_metadata(self):
        with tempfile.TemporaryDirectory() as tmpdir:
            outdir = Path(tmpdir) / "costco_output"
            summary_payload = {
                "data": {
                    "receiptsWithCounts": {
                        "inWarehouse": 1,
                        "gasStation": 0,
                        "carWash": 0,
                        "gasAndCarWash": 0,
                        "receipts": [
                            {
                                "transactionBarcode": "abc",
                                "receiptType": "In-Warehouse",
                                "tenderArray": [],
                                "couponArray": [],
                            }
                        ],
                    }
                }
            }
            detail_payload = {
                "data": {
                    "receiptsWithCounts": {
                        "receipts": [
                            {
                                "transactionBarcode": "abc",
                                "transactionDate": "2026-03-12",
                                "receiptType": "In-Warehouse",
                                "total": 10.0,
                                "totalItemCount": 1,
                                "instantSavings": 0,
                                "warehouseName": "MT VERNON",
                                "warehouseNumber": 1115,
                                "warehouseAddress1": "7940 RICHMOND HWY",
                                "warehouseCity": "ALEXANDRIA",
                                "warehouseState": "VA",
                                "warehousePostalCode": "22306",
                                "itemArray": [],
                            }
                        ]
                    }
                }
            }
            metadata = [
                {
                    "startDate": "1/01/2026",
                    "endDate": "3/31/2026",
                    "text": "custom",
                    "documentType": "all",
                    "documentSubType": "all",
                    "returnedReceipts": 1,
                    "returnedInWarehouseReceipts": 1,
                    "inWarehouse": 1,
                    "gasStation": 0,
                    "carWash": 0,
                    "gasAndCarWash": 0,
                    "countMismatch": False,
                }
            ]
            with mock.patch.object(
                scrape_costco,
                "load_config",
                return_value={
                    "authorization": "",
                    "client_id": "4900eb1f-0c10-4bd9-99c3-c59e6c1ecebf",
                    "client_identifier": "481b1aec-aa3b-454b-b81b-48187e28f205",
                },
            ), mock.patch.object(
                scrape_costco,
                "find_firefox_profile_dir",
                return_value=Path("/tmp/profile"),
            ), mock.patch.object(
                scrape_costco,
                "load_costco_browser_headers",
                return_value={
                    "costco-x-authorization": "Bearer header.payload.signature",
                    "costco-x-wcs-clientId": "4900eb1f-0c10-4bd9-99c3-c59e6c1ecebf",
                    "client-identifier": "481b1aec-aa3b-454b-b81b-48187e28f205",
                },
            ), mock.patch.object(
                scrape_costco, "build_session", return_value=object()
            ), mock.patch.object(
                scrape_costco,
                "fetch_summary_windows",
                return_value=(summary_payload, metadata),
            ), mock.patch.object(
                scrape_costco,
                "graphql_post",
                return_value=detail_payload,
            ):
                scrape_costco.main.callback(
                    outdir=str(outdir),
                    document_type="all",
                    document_sub_type="all",
                    window_days=92,
                    months_back=3,
                    firefox_profile_dir=None,
                )
            metadata_path = outdir / "raw" / "summary_requests.json"
            self.assertTrue(metadata_path.exists())
            saved_metadata = json.loads(metadata_path.read_text(encoding="utf-8"))
            self.assertEqual(metadata, saved_metadata)
 if __name__ == "__main__":
    unittest.main()
--- a/tests/test_enrich_giant.py
+++ b/tests/test_enrich_giant.py
@@ -1,191 +0,0 @@
 import csv
 import json
 import tempfile
 import unittest
 from pathlib import Path
 import enrich_giant
 class EnrichGiantTests(unittest.TestCase):
    def test_parse_size_and_pack_handles_pack_and_weight_tokens(self):
        size_value, size_unit, pack_qty = enrich_giant.parse_size_and_pack(
            "COKE CHERRY 6PK 7.5Z"
        )
        self.assertEqual("7.5", size_value)
        self.assertEqual("oz", size_unit)
        self.assertEqual("6", pack_qty)
    def test_parse_item_marks_store_brand_fee_and_weight_prices(self):
        row = enrich_giant.parse_item(
            order_id="abc123",
            order_date="2026-03-01",
            raw_path=Path("raw/abc123.json"),
            line_no=1,
            item={
                "podId": 1,
                "shipQy": 1,
                "totalPickedWeight": 2,
                "unitPrice": 3.98,
                "itemName": "+SB GALA APPLE 5 LB",
                "lbEachCd": "LB",
                "groceryAmount": 3.98,
                "primUpcCd": "111",
                "mvpSavings": 0,
                "rewardSavings": 0,
                "couponSavings": 0,
                "couponPrice": 0,
                "categoryId": "1",
                "categoryDesc": "Grocery",
                "image": {"large": "https://example.test/apple.jpg"},
            },
        )
        self.assertEqual("SB", row["brand_guess"])
        self.assertEqual("GALA APPLE", row["item_name_norm"])
        self.assertEqual("5", row["size_value"])
        self.assertEqual("lb", row["size_unit"])
        self.assertEqual("weight", row["measure_type"])
        self.assertEqual("true", row["is_store_brand"])
        self.assertEqual("1.99", row["price_per_lb"])
        self.assertEqual("0.1244", row["price_per_oz"])
        self.assertEqual("https://example.test/apple.jpg", row["image_url"])
        fee_row = enrich_giant.parse_item(
            order_id="abc123",
            order_date="2026-03-01",
            raw_path=Path("raw/abc123.json"),
            line_no=2,
            item={
                "podId": 2,
                "shipQy": 1,
                "totalPickedWeight": 0,
                "unitPrice": 0.05,
                "itemName": "GL BAG CHARGE",
                "lbEachCd": "EA",
                "groceryAmount": 0.05,
                "primUpcCd": "",
                "mvpSavings": 0,
                "rewardSavings": 0,
                "couponSavings": 0,
                "couponPrice": 0,
                "categoryId": "1",
                "categoryDesc": "Grocery",
            },
        )
        self.assertEqual("true", fee_row["is_fee"])
        self.assertEqual("GL BAG CHARGE", fee_row["item_name_norm"])
    def test_parse_item_derives_packaged_weight_prices_from_size_tokens(self):
        row = enrich_giant.parse_item(
            order_id="abc123",
            order_date="2026-03-01",
            raw_path=Path("raw/abc123.json"),
            line_no=1,
            item={
                "podId": 1,
                "shipQy": 2,
                "totalPickedWeight": 0,
                "unitPrice": 3.0,
                "itemName": "PEPSI 6PK 7.5Z",
                "lbEachCd": "EA",
                "groceryAmount": 6.0,
                "primUpcCd": "111",
                "mvpSavings": 0,
                "rewardSavings": 0,
                "couponSavings": 0,
                "couponPrice": 0,
                "categoryId": "1",
                "categoryDesc": "Grocery",
            },
        )
        self.assertEqual("weight", row["measure_type"])
        self.assertEqual("6", row["pack_qty"])
        self.assertEqual("7.5", row["size_value"])
        self.assertEqual("0.0667", row["price_per_oz"])
        self.assertEqual("1.0667", row["price_per_lb"])
    def test_build_items_enriched_reads_raw_order_files_and_writes_csv(self):
        with tempfile.TemporaryDirectory() as tmpdir:
            raw_dir = Path(tmpdir) / "raw"
            raw_dir.mkdir()
            (raw_dir / "history.json").write_text("{}", encoding="utf-8")
            (raw_dir / "order-2.json").write_text(
                json.dumps(
                    {
                        "orderId": "order-2",
                        "orderDate": "2026-03-02",
                        "items": [
                            {
                                "podId": 20,
                                "shipQy": 1,
                                "totalPickedWeight": 0,
                                "unitPrice": 2.99,
                                "itemName": "SB ROTINI 16Z",
                                "lbEachCd": "EA",
                                "groceryAmount": 2.99,
                                "primUpcCd": "222",
                                "mvpSavings": 0,
                                "rewardSavings": 0,
                                "couponSavings": 0,
                                "couponPrice": 0,
                                "categoryId": "1",
                                "categoryDesc": "Grocery",
                                "image": {"small": "https://example.test/rotini.jpg"},
                            }
                        ],
                    }
                ),
                encoding="utf-8",
            )
            (raw_dir / "order-1.json").write_text(
                json.dumps(
                    {
                        "orderId": "order-1",
                        "orderDate": "2026-03-01",
                        "items": [
                            {
                                "podId": 10,
                                "shipQy": 2,
                                "totalPickedWeight": 0,
                                "unitPrice": 1.5,
                                "itemName": "PEPSI 6PK 7.5Z",
                                "lbEachCd": "EA",
                                "groceryAmount": 3.0,
                                "primUpcCd": "111",
                                "mvpSavings": 0,
                                "rewardSavings": 0,
                                "couponSavings": 0,
                                "couponPrice": 0,
                                "categoryId": "1",
                                "categoryDesc": "Grocery",
                            }
                        ],
                    }
                ),
                encoding="utf-8",
            )
            rows = enrich_giant.build_items_enriched(raw_dir)
            output_csv = Path(tmpdir) / "items_enriched.csv"
            enrich_giant.write_csv(output_csv, rows)
            self.assertEqual(["order-1", "order-2"], [row["order_id"] for row in rows])
            self.assertEqual("PEPSI", rows[0]["item_name_norm"])
            self.assertEqual("6", rows[0]["pack_qty"])
            self.assertEqual("7.5", rows[0]["size_value"])
            self.assertEqual("10", rows[0]["retailer_item_id"])
            self.assertEqual("true", rows[1]["is_store_brand"])
            with output_csv.open(newline="", encoding="utf-8") as handle:
                written_rows = list(csv.DictReader(handle))
            self.assertEqual(2, len(written_rows))
            self.assertEqual(enrich_giant.OUTPUT_FIELDS, list(written_rows[0].keys()))
 if __name__ == "__main__":
    unittest.main()
--- a/tests/test_giant_login.py
+++ b/tests/test_giant_login.py
@@ -1,17 +1,66 @@
-import unittest
+import requests
 from playwright.sync_api import sync_playwright
 BASE = "https://giantfood.com"
 ACCOUNT_PAGE = f"{BASE}/account/history/invoice/in-store"
 USER_ID = "369513017"
 LOYALTY = "440155630880"
-try:
+def get_session():
-    from playwright.sync_api import sync_playwright  # noqa: F401
+    with sync_playwright() as p:
-    import requests  # noqa: F401
+        browser = p.firefox.launch(headless=False)
-except ImportError as exc:  # pragma: no cover - dependency-gated smoke test
+        page = browser.new_page()
-    sync_playwright = None
+
-    _IMPORT_ERROR = exc
+        page.goto(ACCOUNT_PAGE)
-else:
+
-    _IMPORT_ERROR = None
+        print("log in manually in the browser, then press ENTER here")
        input()
        cookies = page.context.cookies()
        ua = page.evaluate("() => navigator.userAgent")
        browser.close()
    s = requests.Session()
    s.headers.update({
        "user-agent": ua,
        "accept": "application/json, text/plain, */*",
        "referer": ACCOUNT_PAGE,
    })
    for c in cookies:
        domain = c.get("domain", "").lstrip(".") or "giantfood.com"
        s.cookies.set(c["name"], c["value"], domain=domain)
    return s
-@unittest.skipIf(sync_playwright is None, f"optional smoke test dependency missing: {_IMPORT_ERROR}")
+def test_history(session):
-class GiantLoginSmokeTest(unittest.TestCase):
+    url = f"{BASE}/api/v6.0/user/{USER_ID}/order/history"
-    def test_dependencies_available(self):
+
-        self.assertIsNotNone(sync_playwright)
+    r = session.get(
        url,
        params={
            "filter": "instore",
            "loyaltyNumber": LOYALTY,
        },
    )
    print("status:", r.status_code)
    print()
    data = r.json()
    print("orders found:", len(data.get("records", [])))
    print()
    for rec in data.get("records", [])[:5]:
        print(rec["orderId"], rec["orderDate"], rec["orderTotal"])
 if __name__ == "__main__":
    session = get_session()
    test_history(session)
--- a/tests/test_observed_products.py
+++ b/tests/test_observed_products.py
@@ -1,67 +0,0 @@
 import unittest
 import build_observed_products
 class ObservedProductTests(unittest.TestCase):
    def test_build_observed_products_aggregates_rows_with_same_key(self):
        rows = [
            {
                "retailer": "giant",
                "order_id": "1",
                "line_no": "1",
                "order_date": "2026-01-01",
                "item_name": "SB GALA APPLE 5LB",
                "item_name_norm": "GALA APPLE",
                "retailer_item_id": "11",
                "upc": "111",
                "brand_guess": "SB",
                "variant": "",
                "size_value": "5",
                "size_unit": "lb",
                "pack_qty": "",
                "measure_type": "weight",
                "image_url": "https://example.test/a.jpg",
                "is_store_brand": "true",
                "is_fee": "false",
                "is_discount_line": "false",
                "is_coupon_line": "false",
                "line_total": "7.99",
            },
            {
                "retailer": "giant",
                "order_id": "2",
                "line_no": "1",
                "order_date": "2026-01-10",
                "item_name": "SB GALA APPLE 5 LB",
                "item_name_norm": "GALA APPLE",
                "retailer_item_id": "11",
                "upc": "111",
                "brand_guess": "SB",
                "variant": "",
                "size_value": "5",
                "size_unit": "lb",
                "pack_qty": "",
                "measure_type": "weight",
                "image_url": "",
                "is_store_brand": "true",
                "is_fee": "false",
                "is_discount_line": "false",
                "is_coupon_line": "false",
                "line_total": "8.49",
            },
        ]
        observed = build_observed_products.build_observed_products(rows)
        self.assertEqual(1, len(observed))
        self.assertEqual("2", observed[0]["times_seen"])
        self.assertEqual("2026-01-01", observed[0]["first_seen_date"])
        self.assertEqual("2026-01-10", observed[0]["last_seen_date"])
        self.assertEqual("11", observed[0]["representative_retailer_item_id"])
        self.assertEqual("111", observed[0]["representative_upc"])
        self.assertIn("SB GALA APPLE 5LB", observed[0]["raw_name_examples"])
 if __name__ == "__main__":
    unittest.main()
--- a/tests/test_pipeline_status.py
+++ b/tests/test_pipeline_status.py
@@ -1,80 +0,0 @@
 import unittest
 import report_pipeline_status
 class PipelineStatusTests(unittest.TestCase):
    def test_build_status_summary_reports_unresolved_and_reviewed_counts(self):
        summary = report_pipeline_status.build_status_summary(
            giant_orders=[{"order_id": "g1"}],
            giant_items=[{"order_id": "g1", "line_no": "1"}],
            giant_enriched=[
                {
                    "retailer": "giant",
                    "order_id": "g1",
                    "line_no": "1",
                    "item_name_norm": "BANANA",
                    "item_name": "FRESH BANANA",
                    "retailer_item_id": "1",
                    "upc": "4011",
                    "brand_guess": "",
                    "variant": "",
                    "size_value": "",
                    "size_unit": "",
                    "pack_qty": "",
                    "measure_type": "weight",
                    "image_url": "",
                    "is_store_brand": "false",
                    "is_fee": "false",
                    "is_discount_line": "false",
                    "is_coupon_line": "false",
                    "order_date": "2026-03-01",
                    "line_total": "1.29",
                }
            ],
            costco_orders=[],
            costco_items=[],
            costco_enriched=[],
            purchases=[
                {
                    "observed_product_id": "gobs_banana",
                    "canonical_product_id": "gcan_banana",
                    "resolution_action": "",
                    "is_fee": "false",
                    "is_discount_line": "false",
                    "is_coupon_line": "false",
                    "retailer": "giant",
                    "raw_item_name": "FRESH BANANA",
                    "normalized_item_name": "BANANA",
                    "upc": "4011",
                    "line_total": "1.29",
                },
                {
                    "observed_product_id": "gobs_lime",
                    "canonical_product_id": "",
                    "resolution_action": "",
                    "is_fee": "false",
                    "is_discount_line": "false",
                    "is_coupon_line": "false",
                    "retailer": "costco",
                    "raw_item_name": "LIME 5LB",
                    "normalized_item_name": "LIME",
                    "upc": "",
                    "line_total": "4.99",
                },
            ],
            resolutions=[],
        )
        counts = {row["stage"]: row["count"] for row in summary}
        self.assertEqual(1, counts["raw_orders"])
        self.assertEqual(1, counts["raw_items"])
        self.assertEqual(1, counts["enriched_items"])
        self.assertEqual(1, counts["canonical_linked_purchase_rows"])
        self.assertEqual(1, counts["unresolved_purchase_rows"])
        self.assertEqual(1, counts["review_queue_observed_products"])
        self.assertEqual(0, counts["unresolved_not_in_review_rows"])
 if __name__ == "__main__":
    unittest.main()
--- a/tests/test_purchases.py
+++ b/tests/test_purchases.py
@@ -1,301 +0,0 @@
 import csv
 import tempfile
 import unittest
 from pathlib import Path
 import build_purchases
 import enrich_costco
 class PurchaseLogTests(unittest.TestCase):
    def test_derive_metrics_prefers_picked_weight_and_pack_count(self):
        metrics = build_purchases.derive_metrics(
            {
                "line_total": "4.00",
                "qty": "1",
                "pack_qty": "4",
                "size_value": "",
                "size_unit": "",
                "picked_weight": "2",
                "price_per_each": "",
                "price_per_lb": "",
                "price_per_oz": "",
            }
        )
        self.assertEqual("4", metrics["price_per_each"])
        self.assertEqual("1", metrics["price_per_count"])
        self.assertEqual("2", metrics["price_per_lb"])
        self.assertEqual("0.125", metrics["price_per_oz"])
        self.assertEqual("picked_weight_lb", metrics["price_per_lb_basis"])
    def test_build_purchase_rows_maps_canonical_ids(self):
        fieldnames = enrich_costco.OUTPUT_FIELDS
        giant_row = {field: "" for field in fieldnames}
        giant_row.update(
            {
                "retailer": "giant",
                "order_id": "g1",
                "line_no": "1",
                "observed_item_key": "giant:g1:1",
                "order_date": "2026-03-01",
                "item_name": "FRESH BANANA",
                "item_name_norm": "BANANA",
                "image_url": "https://example.test/banana.jpg",
                "retailer_item_id": "100",
                "upc": "4011",
                "qty": "1",
                "unit": "LB",
                "line_total": "1.29",
                "unit_price": "1.29",
                "measure_type": "weight",
                "price_per_lb": "1.29",
                "raw_order_path": "giant_output/raw/g1.json",
                "is_discount_line": "false",
                "is_coupon_line": "false",
                "is_fee": "false",
            }
        )
        costco_row = {field: "" for field in fieldnames}
        costco_row.update(
            {
                "retailer": "costco",
                "order_id": "c1",
                "line_no": "1",
                "observed_item_key": "costco:c1:1",
                "order_date": "2026-03-12",
                "item_name": "BANANAS 3 LB / 1.36 KG",
                "item_name_norm": "BANANA",
                "retailer_item_id": "30669",
                "qty": "1",
                "unit": "E",
                "line_total": "2.98",
                "unit_price": "2.98",
                "size_value": "3",
                "size_unit": "lb",
                "measure_type": "weight",
                "price_per_lb": "0.9933",
                "raw_order_path": "costco_output/raw/c1.json",
                "is_discount_line": "false",
                "is_coupon_line": "false",
                "is_fee": "false",
            }
        )
        giant_orders = [
            {
                "order_id": "g1",
                "store_name": "Giant",
                "store_number": "42",
                "store_city": "Springfield",
                "store_state": "VA",
            }
        ]
        costco_orders = [
            {
                "order_id": "c1",
                "store_name": "MT VERNON",
                "store_number": "1115",
                "store_city": "ALEXANDRIA",
                "store_state": "VA",
            }
        ]
        rows, _observed, _canon, _links = build_purchases.build_purchase_rows(
            [giant_row],
            [costco_row],
            giant_orders,
            costco_orders,
            [],
        )
        self.assertEqual(2, len(rows))
        self.assertTrue(all(row["canonical_product_id"] for row in rows))
        self.assertEqual({"giant", "costco"}, {row["retailer"] for row in rows})
        self.assertEqual("https://example.test/banana.jpg", rows[0]["image_url"])
    def test_main_writes_purchase_and_example_csvs(self):
        with tempfile.TemporaryDirectory() as tmpdir:
            giant_items = Path(tmpdir) / "giant_items.csv"
            costco_items = Path(tmpdir) / "costco_items.csv"
            giant_orders = Path(tmpdir) / "giant_orders.csv"
            costco_orders = Path(tmpdir) / "costco_orders.csv"
            resolutions_csv = Path(tmpdir) / "review_resolutions.csv"
            catalog_csv = Path(tmpdir) / "canonical_catalog.csv"
            links_csv = Path(tmpdir) / "product_links.csv"
            purchases_csv = Path(tmpdir) / "combined" / "purchases.csv"
            examples_csv = Path(tmpdir) / "combined" / "comparison_examples.csv"
            fieldnames = enrich_costco.OUTPUT_FIELDS
            giant_row = {field: "" for field in fieldnames}
            giant_row.update(
                {
                    "retailer": "giant",
                    "order_id": "g1",
                    "line_no": "1",
                    "observed_item_key": "giant:g1:1",
                    "order_date": "2026-03-01",
                    "item_name": "FRESH BANANA",
                    "item_name_norm": "BANANA",
                    "retailer_item_id": "100",
                    "upc": "4011",
                    "qty": "1",
                    "unit": "LB",
                    "line_total": "1.29",
                    "unit_price": "1.29",
                    "measure_type": "weight",
                    "price_per_lb": "1.29",
                    "raw_order_path": "giant_output/raw/g1.json",
                    "is_discount_line": "false",
                    "is_coupon_line": "false",
                    "is_fee": "false",
                }
            )
            costco_row = {field: "" for field in fieldnames}
            costco_row.update(
                {
                    "retailer": "costco",
                    "order_id": "c1",
                    "line_no": "1",
                    "observed_item_key": "costco:c1:1",
                    "order_date": "2026-03-12",
                    "item_name": "BANANAS 3 LB / 1.36 KG",
                    "item_name_norm": "BANANA",
                    "retailer_item_id": "30669",
                    "qty": "1",
                    "unit": "E",
                    "line_total": "2.98",
                    "unit_price": "2.98",
                    "size_value": "3",
                    "size_unit": "lb",
                    "measure_type": "weight",
                    "price_per_lb": "0.9933",
                    "raw_order_path": "costco_output/raw/c1.json",
                    "is_discount_line": "false",
                    "is_coupon_line": "false",
                    "is_fee": "false",
                }
            )
            for path, source_rows in [
                (giant_items, [giant_row]),
                (costco_items, [costco_row]),
            ]:
                with path.open("w", newline="", encoding="utf-8") as handle:
                    writer = csv.DictWriter(handle, fieldnames=fieldnames)
                    writer.writeheader()
                    writer.writerows(source_rows)
            order_fields = ["order_id", "store_name", "store_number", "store_city", "store_state"]
            for path, source_rows in [
                (
                    giant_orders,
                    [
                        {
                            "order_id": "g1",
                            "store_name": "Giant",
                            "store_number": "42",
                            "store_city": "Springfield",
                            "store_state": "VA",
                        }
                    ],
                ),
                (
                    costco_orders,
                    [
                        {
                            "order_id": "c1",
                            "store_name": "MT VERNON",
                            "store_number": "1115",
                            "store_city": "ALEXANDRIA",
                            "store_state": "VA",
                        }
                    ],
                ),
            ]:
                with path.open("w", newline="", encoding="utf-8") as handle:
                    writer = csv.DictWriter(handle, fieldnames=order_fields)
                    writer.writeheader()
                    writer.writerows(source_rows)
            build_purchases.main.callback(
                giant_items_enriched_csv=str(giant_items),
                costco_items_enriched_csv=str(costco_items),
                giant_orders_csv=str(giant_orders),
                costco_orders_csv=str(costco_orders),
                resolutions_csv=str(resolutions_csv),
                catalog_csv=str(catalog_csv),
                links_csv=str(links_csv),
                output_csv=str(purchases_csv),
                examples_csv=str(examples_csv),
            )
            self.assertTrue(purchases_csv.exists())
            self.assertTrue(examples_csv.exists())
            with purchases_csv.open(newline="", encoding="utf-8") as handle:
                purchase_rows = list(csv.DictReader(handle))
            with examples_csv.open(newline="", encoding="utf-8") as handle:
                example_rows = list(csv.DictReader(handle))
            self.assertEqual(2, len(purchase_rows))
            self.assertEqual(1, len(example_rows))
    def test_build_purchase_rows_applies_manual_resolution(self):
        fieldnames = enrich_costco.OUTPUT_FIELDS
        giant_row = {field: "" for field in fieldnames}
        giant_row.update(
            {
                "retailer": "giant",
                "order_id": "g1",
                "line_no": "1",
                "observed_item_key": "giant:g1:1",
                "order_date": "2026-03-01",
                "item_name": "SB BAGGED ICE 20LB",
                "item_name_norm": "BAGGED ICE",
                "retailer_item_id": "100",
                "upc": "",
                "qty": "1",
                "unit": "EA",
                "line_total": "3.50",
                "unit_price": "3.50",
                "measure_type": "each",
                "raw_order_path": "giant_output/raw/g1.json",
                "is_discount_line": "false",
                "is_coupon_line": "false",
                "is_fee": "false",
            }
        )
        observed_rows, _canonical_rows, _link_rows, _observed_id_by_key, _canonical_by_observed = (
            build_purchases.build_link_state([giant_row])
        )
        observed_product_id = observed_rows[0]["observed_product_id"]
        rows, _observed, _canon, _links = build_purchases.build_purchase_rows(
            [giant_row],
            [],
            [
                {
                    "order_id": "g1",
                    "store_name": "Giant",
                    "store_number": "42",
                    "store_city": "Springfield",
                    "store_state": "VA",
                }
            ],
            [],
            [
                {
                    "observed_product_id": observed_product_id,
                    "canonical_product_id": "gcan_manual_ice",
                    "resolution_action": "create",
                    "status": "approved",
                    "resolution_notes": "manual ice merge",
                    "reviewed_at": "2026-03-16",
                }
            ],
        )
        self.assertEqual("gcan_manual_ice", rows[0]["canonical_product_id"])
        self.assertEqual("approved", rows[0]["review_status"])
        self.assertEqual("create", rows[0]["resolution_action"])
 if __name__ == "__main__":
    unittest.main()
--- a/tests/test_review_queue.py
+++ b/tests/test_review_queue.py
@@ -1,133 +0,0 @@
 import tempfile
 import unittest
 from pathlib import Path
 import build_observed_products
 import build_review_queue
 from layer_helpers import write_csv_rows
 class ReviewQueueTests(unittest.TestCase):
    def test_build_review_queue_preserves_existing_status(self):
        observed_rows = [
            {
                "observed_product_id": "gobs_1",
                "retailer": "giant",
                "representative_upc": "111",
                "representative_image_url": "",
                "representative_name_norm": "GALA APPLE",
                "times_seen": "2",
                "distinct_item_names_count": "2",
                "distinct_upcs_count": "1",
                "is_fee": "false",
                "is_discount_line": "false",
                "is_coupon_line": "false",
            }
        ]
        item_rows = [
            {
                "observed_product_id": "gobs_1",
                "item_name": "SB GALA APPLE 5LB",
                "item_name_norm": "GALA APPLE",
                "line_total": "7.99",
            },
            {
                "observed_product_id": "gobs_1",
                "item_name": "SB GALA APPLE 5 LB",
                "item_name_norm": "GALA APPLE",
                "line_total": "8.49",
            },
        ]
        existing = {
            build_review_queue.stable_id("rvw", "gobs_1|missing_image"): {
                "status": "approved",
                "resolution_notes": "looked fine",
                "created_at": "2026-03-15",
            }
        }
        queue = build_review_queue.build_review_queue(
            observed_rows, item_rows, existing, "2026-03-16"
        )
        self.assertEqual(2, len(queue))
        missing_image = [row for row in queue if row["reason_code"] == "missing_image"][0]
        self.assertEqual("approved", missing_image["status"])
        self.assertEqual("looked fine", missing_image["resolution_notes"])
    def test_review_queue_main_writes_output(self):
        with tempfile.TemporaryDirectory() as tmpdir:
            observed_path = Path(tmpdir) / "products_observed.csv"
            items_path = Path(tmpdir) / "items_enriched.csv"
            output_path = Path(tmpdir) / "review_queue.csv"
            observed_rows = [
                {
                    "observed_product_id": "gobs_1",
                    "retailer": "giant",
                    "observed_key": "giant|upc=111|name=GALA APPLE",
                    "representative_retailer_item_id": "11",
                    "representative_upc": "111",
                    "representative_item_name": "SB GALA APPLE 5LB",
                    "representative_name_norm": "GALA APPLE",
                    "representative_brand": "SB",
                    "representative_variant": "",
                    "representative_size_value": "5",
                    "representative_size_unit": "lb",
                    "representative_pack_qty": "",
                    "representative_measure_type": "weight",
                    "representative_image_url": "",
                    "is_store_brand": "true",
                    "is_fee": "false",
                    "is_discount_line": "false",
                    "is_coupon_line": "false",
                    "first_seen_date": "2026-01-01",
                    "last_seen_date": "2026-01-10",
                    "times_seen": "2",
                    "example_order_id": "1",
                    "example_item_name": "SB GALA APPLE 5LB",
                    "raw_name_examples": "SB GALA APPLE 5LB | SB GALA APPLE 5 LB",
                    "normalized_name_examples": "GALA APPLE",
                    "example_prices": "7.99 | 8.49",
                    "distinct_item_names_count": "2",
                    "distinct_retailer_item_ids_count": "1",
                    "distinct_upcs_count": "1",
                }
            ]
            item_rows = [
                {
                    "retailer": "giant",
                    "order_id": "1",
                    "line_no": "1",
                    "item_name": "SB GALA APPLE 5LB",
                    "item_name_norm": "GALA APPLE",
                    "retailer_item_id": "11",
                    "upc": "111",
                    "size_value": "5",
                    "size_unit": "lb",
                    "pack_qty": "",
                    "measure_type": "weight",
                    "is_store_brand": "true",
                    "is_fee": "false",
                    "is_discount_line": "false",
                    "is_coupon_line": "false",
                    "line_total": "7.99",
                }
            ]
            write_csv_rows(
                observed_path, observed_rows, build_observed_products.OUTPUT_FIELDS
            )
            write_csv_rows(items_path, item_rows, list(item_rows[0].keys()))
            build_review_queue.main.callback(
                observed_csv=str(observed_path),
                items_enriched_csv=str(items_path),
                output_csv=str(output_path),
            )
            self.assertTrue(output_path.exists())
 if __name__ == "__main__":
    unittest.main()
--- a/tests/test_review_workflow.py
+++ b/tests/test_review_workflow.py
@@ -1,409 +0,0 @@
 import csv
 import tempfile
 import unittest
 from pathlib import Path
 from unittest import mock
 from click.testing import CliRunner
 import review_products
 class ReviewWorkflowTests(unittest.TestCase):
    def test_build_review_queue_groups_unresolved_purchases(self):
        queue_rows = review_products.build_review_queue(
            [
                {
                    "observed_product_id": "gobs_1",
                    "canonical_product_id": "",
                    "retailer": "giant",
                    "raw_item_name": "SB BAGGED ICE 20LB",
                    "normalized_item_name": "BAGGED ICE",
                    "upc": "",
                    "line_total": "3.50",
                },
                {
                    "observed_product_id": "gobs_1",
                    "canonical_product_id": "",
                    "retailer": "giant",
                    "raw_item_name": "SB BAG ICE CUBED 10LB",
                    "normalized_item_name": "BAG ICE",
                    "upc": "",
                    "line_total": "2.50",
                },
            ],
            [],
        )
        self.assertEqual(1, len(queue_rows))
        self.assertEqual("gobs_1", queue_rows[0]["observed_product_id"])
        self.assertIn("SB BAGGED ICE 20LB", queue_rows[0]["raw_item_names"])
    def test_build_canonical_suggestions_prefers_upc_then_name(self):
        suggestions = review_products.build_canonical_suggestions(
            [
                {
                    "normalized_item_name": "MIXED PEPPER",
                    "upc": "12345",
                }
            ],
            [
                {
                    "canonical_product_id": "gcan_1",
                    "canonical_name": "MIXED PEPPER",
                    "upc": "",
                },
                {
                    "canonical_product_id": "gcan_2",
                    "canonical_name": "MIXED PEPPER 6 PACK",
                    "upc": "12345",
                },
            ],
        )
        self.assertEqual("gcan_2", suggestions[0]["canonical_product_id"])
        self.assertEqual("exact upc", suggestions[0]["reason"])
        self.assertEqual("gcan_1", suggestions[1]["canonical_product_id"])
    def test_review_products_displays_position_items_and_suggestions(self):
        with tempfile.TemporaryDirectory() as tmpdir:
            purchases_csv = Path(tmpdir) / "purchases.csv"
            queue_csv = Path(tmpdir) / "review_queue.csv"
            resolutions_csv = Path(tmpdir) / "review_resolutions.csv"
            catalog_csv = Path(tmpdir) / "canonical_catalog.csv"
            purchase_fields = [
                "purchase_date",
                "retailer",
                "order_id",
                "line_no",
                "observed_product_id",
                "canonical_product_id",
                "raw_item_name",
                "normalized_item_name",
                "image_url",
                "upc",
                "line_total",
            ]
            with purchases_csv.open("w", newline="", encoding="utf-8") as handle:
                writer = csv.DictWriter(handle, fieldnames=purchase_fields)
                writer.writeheader()
                writer.writerows(
                    [
                        {
                            "purchase_date": "2026-03-14",
                            "retailer": "costco",
                            "order_id": "c2",
                            "line_no": "2",
                            "observed_product_id": "gobs_mix",
                            "canonical_product_id": "",
                            "raw_item_name": "MIXED PEPPER 6-PACK",
                            "normalized_item_name": "MIXED PEPPER",
                            "image_url": "",
                            "upc": "",
                            "line_total": "7.49",
                        },
                        {
                            "purchase_date": "2026-03-12",
                            "retailer": "costco",
                            "order_id": "c1",
                            "line_no": "1",
                            "observed_product_id": "gobs_mix",
                            "canonical_product_id": "",
                            "raw_item_name": "MIXED PEPPER 6-PACK",
                            "normalized_item_name": "MIXED PEPPER",
                            "image_url": "https://example.test/mixed-pepper.jpg",
                            "upc": "",
                            "line_total": "6.99",
                        },
                    ]
                )
            with catalog_csv.open("w", newline="", encoding="utf-8") as handle:
                writer = csv.DictWriter(handle, fieldnames=review_products.build_purchases.CATALOG_FIELDS)
                writer.writeheader()
                writer.writerow(
                    {
                        "canonical_product_id": "gcan_mix",
                        "canonical_name": "MIXED PEPPER",
                        "category": "produce",
                        "product_type": "pepper",
                        "brand": "",
                        "variant": "",
                        "size_value": "",
                        "size_unit": "",
                        "pack_qty": "",
                        "measure_type": "",
                        "notes": "",
                        "created_at": "",
                        "updated_at": "",
                    }
                )
            runner = CliRunner()
            result = runner.invoke(
                review_products.main,
                [
                    "--purchases-csv",
                    str(purchases_csv),
                    "--queue-csv",
                    str(queue_csv),
                    "--resolutions-csv",
                    str(resolutions_csv),
                    "--catalog-csv",
                    str(catalog_csv),
                ],
                input="q\n",
                color=True,
            )
            self.assertEqual(0, result.exit_code)
            self.assertIn("Review 1/1: Resolve observed_product MIXED PEPPER to canonical_name [__]?", result.output)
            self.assertIn("2 matched items:", result.output)
            self.assertIn("[l]ink existing  [n]ew canonical  e[x]clude  [s]kip  [q]uit:", result.output)
            first_item = result.output.index("[1] 2026-03-14 | 7.49")
            second_item = result.output.index("[2] 2026-03-12 | 6.99")
            self.assertLess(first_item, second_item)
            self.assertIn("https://example.test/mixed-pepper.jpg", result.output)
            self.assertIn("1 canonical suggestions found:", result.output)
            self.assertIn("[1] MIXED PEPPER", result.output)
            self.assertIn("\x1b[", result.output)
    def test_review_products_no_suggestions_is_informational(self):
        with tempfile.TemporaryDirectory() as tmpdir:
            purchases_csv = Path(tmpdir) / "purchases.csv"
            queue_csv = Path(tmpdir) / "review_queue.csv"
            resolutions_csv = Path(tmpdir) / "review_resolutions.csv"
            catalog_csv = Path(tmpdir) / "canonical_catalog.csv"
            with purchases_csv.open("w", newline="", encoding="utf-8") as handle:
                writer = csv.DictWriter(
                    handle,
                    fieldnames=[
                        "purchase_date",
                        "retailer",
                        "order_id",
                        "line_no",
                        "observed_product_id",
                        "canonical_product_id",
                        "raw_item_name",
                        "normalized_item_name",
                        "image_url",
                        "upc",
                        "line_total",
                    ],
                )
                writer.writeheader()
                writer.writerow(
                    {
                        "purchase_date": "2026-03-14",
                        "retailer": "giant",
                        "order_id": "g1",
                        "line_no": "1",
                        "observed_product_id": "gobs_ice",
                        "canonical_product_id": "",
                        "raw_item_name": "SB BAGGED ICE 20LB",
                        "normalized_item_name": "BAGGED ICE",
                        "image_url": "",
                        "upc": "",
                        "line_total": "3.50",
                    }
                )
            with catalog_csv.open("w", newline="", encoding="utf-8") as handle:
                writer = csv.DictWriter(handle, fieldnames=review_products.build_purchases.CATALOG_FIELDS)
                writer.writeheader()
            result = CliRunner().invoke(
                review_products.main,
                [
                    "--purchases-csv",
                    str(purchases_csv),
                    "--queue-csv",
                    str(queue_csv),
                    "--resolutions-csv",
                    str(resolutions_csv),
                    "--catalog-csv",
                    str(catalog_csv),
                ],
                input="q\n",
                color=True,
            )
            self.assertEqual(0, result.exit_code)
            self.assertIn("no canonical_name suggestions found", result.output)
    def test_link_existing_uses_numbered_selection_and_confirmation(self):
        with tempfile.TemporaryDirectory() as tmpdir:
            purchases_csv = Path(tmpdir) / "purchases.csv"
            queue_csv = Path(tmpdir) / "review_queue.csv"
            resolutions_csv = Path(tmpdir) / "review_resolutions.csv"
            catalog_csv = Path(tmpdir) / "canonical_catalog.csv"
            with purchases_csv.open("w", newline="", encoding="utf-8") as handle:
                writer = csv.DictWriter(
                    handle,
                    fieldnames=[
                        "purchase_date",
                        "retailer",
                        "order_id",
                        "line_no",
                        "observed_product_id",
                        "canonical_product_id",
                        "raw_item_name",
                        "normalized_item_name",
                        "image_url",
                        "upc",
                        "line_total",
                    ],
                )
                writer.writeheader()
                writer.writerows(
                    [
                        {
                            "purchase_date": "2026-03-14",
                            "retailer": "costco",
                            "order_id": "c2",
                            "line_no": "2",
                            "observed_product_id": "gobs_mix",
                            "canonical_product_id": "",
                            "raw_item_name": "MIXED PEPPER 6-PACK",
                            "normalized_item_name": "MIXED PEPPER",
                            "image_url": "",
                            "upc": "",
                            "line_total": "7.49",
                        },
                        {
                            "purchase_date": "2026-03-12",
                            "retailer": "costco",
                            "order_id": "c1",
                            "line_no": "1",
                            "observed_product_id": "gobs_mix",
                            "canonical_product_id": "",
                            "raw_item_name": "MIXED PEPPER 6-PACK",
                            "normalized_item_name": "MIXED PEPPER",
                            "image_url": "",
                            "upc": "",
                            "line_total": "6.99",
                        },
                    ]
                )
            with catalog_csv.open("w", newline="", encoding="utf-8") as handle:
                writer = csv.DictWriter(handle, fieldnames=review_products.build_purchases.CATALOG_FIELDS)
                writer.writeheader()
                writer.writerow(
                    {
                        "canonical_product_id": "gcan_mix",
                        "canonical_name": "MIXED PEPPER",
                        "category": "",
                        "product_type": "",
                        "brand": "",
                        "variant": "",
                        "size_value": "",
                        "size_unit": "",
                        "pack_qty": "",
                        "measure_type": "",
                        "notes": "",
                        "created_at": "",
                        "updated_at": "",
                    }
                )
            result = CliRunner().invoke(
                review_products.main,
                [
                    "--purchases-csv",
                    str(purchases_csv),
                    "--queue-csv",
                    str(queue_csv),
                    "--resolutions-csv",
                    str(resolutions_csv),
                    "--catalog-csv",
                    str(catalog_csv),
                    "--limit",
                    "1",
                ],
                input="l\n1\ny\nlinked by test\n",
                color=True,
            )
            self.assertEqual(0, result.exit_code)
            self.assertIn("Select the canonical_name to associate 2 items with:", result.output)
            self.assertIn('[1] MIXED PEPPER | gcan_mix', result.output)
            self.assertIn('2 "MIXED PEPPER" items and future matches will be associated with "MIXED PEPPER".', result.output)
            self.assertIn("actions: [y]es  [n]o  [b]ack  [s]kip  [q]uit", result.output)
            with resolutions_csv.open(newline="", encoding="utf-8") as handle:
                rows = list(csv.DictReader(handle))
            self.assertEqual("gcan_mix", rows[0]["canonical_product_id"])
            self.assertEqual("link", rows[0]["resolution_action"])
    def test_review_products_creates_canonical_and_resolution(self):
        with tempfile.TemporaryDirectory() as tmpdir:
            purchases_csv = Path(tmpdir) / "purchases.csv"
            queue_csv = Path(tmpdir) / "review_queue.csv"
            resolutions_csv = Path(tmpdir) / "review_resolutions.csv"
            catalog_csv = Path(tmpdir) / "canonical_catalog.csv"
            with purchases_csv.open("w", newline="", encoding="utf-8") as handle:
                writer = csv.DictWriter(
                    handle,
                    fieldnames=[
                        "purchase_date",
                        "observed_product_id",
                        "canonical_product_id",
                        "retailer",
                        "raw_item_name",
                        "normalized_item_name",
                        "image_url",
                        "upc",
                        "line_total",
                        "order_id",
                        "line_no",
                    ],
                )
                writer.writeheader()
                writer.writerow(
                    {
                        "purchase_date": "2026-03-15",
                        "observed_product_id": "gobs_ice",
                        "canonical_product_id": "",
                        "retailer": "giant",
                        "raw_item_name": "SB BAGGED ICE 20LB",
                        "normalized_item_name": "BAGGED ICE",
                        "image_url": "",
                        "upc": "",
                        "line_total": "3.50",
                        "order_id": "g1",
                        "line_no": "1",
                    }
                )
            with mock.patch.object(
                review_products.click,
                "prompt",
                side_effect=["n", "ICE", "frozen", "ice", "manual merge", "q"],
            ):
                review_products.main.callback(
                    purchases_csv=str(purchases_csv),
                    queue_csv=str(queue_csv),
                    resolutions_csv=str(resolutions_csv),
                    catalog_csv=str(catalog_csv),
                    limit=1,
                    refresh_only=False,
                )
            self.assertTrue(queue_csv.exists())
            self.assertTrue(resolutions_csv.exists())
            self.assertTrue(catalog_csv.exists())
            with resolutions_csv.open(newline="", encoding="utf-8") as handle:
                resolution_rows = list(csv.DictReader(handle))
            with catalog_csv.open(newline="", encoding="utf-8") as handle:
                catalog_rows = list(csv.DictReader(handle))
            self.assertEqual("create", resolution_rows[0]["resolution_action"])
            self.assertEqual("approved", resolution_rows[0]["status"])
            self.assertEqual("ICE", catalog_rows[0]["canonical_name"])
 if __name__ == "__main__":
    unittest.main()
--- a/tests/test_scraper.py
+++ b/tests/test_scraper.py
@@ -1,128 +0,0 @@
 import csv
 import tempfile
 import unittest
 from pathlib import Path
 import scraper
 class ScraperTests(unittest.TestCase):
    def test_flatten_orders_extracts_order_and_item_rows(self):
        history = {
            "records": [
                {
                    "orderId": "abc123",
                    "serviceType": "PICKUP",
                }
            ]
        }
        details = [
            {
                "orderId": "abc123",
                "orderDate": "2026-03-01",
                "deliveryDate": "2026-03-02",
                "orderTotal": "12.34",
                "paymentMethod": "VISA",
                "totalItemCount": 1,
                "totalSavings": "1.00",
                "yourSavingsTotal": "1.00",
                "couponsDiscountsTotal": "0.50",
                "refundOrder": False,
                "ebtOrder": False,
                "pup": {
                    "storeName": "Giant",
                    "aholdStoreNumber": "42",
                    "storeAddress1": "123 Main",
                    "storeCity": "Springfield",
                    "storeState": "VA",
                    "storeZipcode": "22150",
                },
                "items": [
                    {
                        "podId": "pod-1",
                        "itemName": "Bananas",
                        "primUpcCd": "111",
                        "categoryId": "produce",
                        "categoryDesc": "Produce",
                        "shipQy": "2",
                        "lbEachCd": "EA",
                        "unitPrice": "0.59",
                        "groceryAmount": "1.18",
                        "totalPickedWeight": "",
                        "mvpSavings": "0.10",
                        "rewardSavings": "0.00",
                        "couponSavings": "0.00",
                        "couponPrice": "",
                    }
                ],
            }
        ]
        orders, items = scraper.flatten_orders(
            history,
            details,
            history_path=Path("data/giant-web/raw/history.json"),
            raw_dir=Path("data/giant-web/raw"),
        )
        self.assertEqual(1, len(orders))
        self.assertEqual("abc123", orders[0]["order_id"])
        self.assertEqual("giant", orders[0]["retailer"])
        self.assertEqual("PICKUP", orders[0]["service_type"])
        self.assertEqual("data/giant-web/raw/history.json", orders[0]["raw_history_path"])
        self.assertEqual("data/giant-web/raw/abc123.json", orders[0]["raw_order_path"])
        self.assertEqual(1, len(items))
        self.assertEqual("1", items[0]["line_no"])
        self.assertEqual("Bananas", items[0]["item_name"])
        self.assertEqual("giant", items[0]["retailer"])
        self.assertEqual("data/giant-web/raw/abc123.json", items[0]["raw_order_path"])
        self.assertEqual("false", items[0]["is_discount_line"])
    def test_append_dedup_replaces_duplicate_rows_and_preserves_new_values(self):
        with tempfile.TemporaryDirectory() as tmpdir:
            path = Path(tmpdir) / "orders.csv"
            scraper.append_dedup(
                path,
                [
                    {"order_id": "1", "order_total": "10.00"},
                    {"order_id": "2", "order_total": "20.00"},
                ],
                subset=["order_id"],
                fieldnames=["order_id", "order_total"],
            )
            merged = scraper.append_dedup(
                path,
                [
                    {"order_id": "2", "order_total": "21.50"},
                    {"order_id": "3", "order_total": "30.00"},
                ],
                subset=["order_id"],
                fieldnames=["order_id", "order_total"],
            )
            self.assertEqual(
                [
                    {"order_id": "1", "order_total": "10.00"},
                    {"order_id": "2", "order_total": "21.50"},
                    {"order_id": "3", "order_total": "30.00"},
                ],
                merged,
            )
            with path.open(newline="", encoding="utf-8") as handle:
                rows = list(csv.DictReader(handle))
            self.assertEqual(merged, rows)
    def test_read_existing_order_ids_returns_known_ids(self):
        with tempfile.TemporaryDirectory() as tmpdir:
            path = Path(tmpdir) / "orders.csv"
            path.write_text("order_id,order_total\n1,10.00\n2,20.00\n", encoding="utf-8")
            self.assertEqual({"1", "2"}, scraper.read_existing_order_ids(path))
 if __name__ == "__main__":
    unittest.main()
--- a/validate_cross_retailer_flow.py
+++ b/validate_cross_retailer_flow.py
@@ -1,154 +0,0 @@
 import json
 from pathlib import Path
 import click
 import build_canonical_layer
 import build_observed_products
 from layer_helpers import stable_id, write_csv_rows
 PROOF_FIELDS = [
    "proof_name",
    "canonical_product_id",
    "giant_observed_product_id",
    "costco_observed_product_id",
    "giant_example_item",
    "costco_example_item",
    "notes",
 ]
 def read_rows(path):
    import csv
    with Path(path).open(newline="", encoding="utf-8") as handle:
        return list(csv.DictReader(handle))
 def find_proof_pair(observed_rows):
    giant = None
    costco = None
    for row in observed_rows:
        if row["retailer"] == "giant" and row["representative_name_norm"] == "BANANA":
            giant = row
        if row["retailer"] == "costco" and row["representative_name_norm"] == "BANANA":
            costco = row
    return giant, costco
 def merge_proof_pair(canonical_rows, link_rows, giant_row, costco_row):
    if not giant_row or not costco_row:
        return canonical_rows, link_rows, []
    proof_canonical_id = stable_id("gcan", "proof|banana")
    link_rows = [
        row
        for row in link_rows
        if row["observed_product_id"]
        not in {giant_row["observed_product_id"], costco_row["observed_product_id"]}
    ]
    canonical_rows = [
        row
        for row in canonical_rows
        if row["canonical_product_id"] != proof_canonical_id
    ]
    canonical_rows.append(
        {
            "canonical_product_id": proof_canonical_id,
            "canonical_name": "BANANA",
            "product_type": "banana",
            "brand": "",
            "variant": "",
            "size_value": "",
            "size_unit": "",
            "pack_qty": "",
            "measure_type": "weight",
            "normalized_quantity": "",
            "normalized_quantity_unit": "",
            "notes": "manual proof merge for cross-retailer validation",
            "created_at": "",
            "updated_at": "",
        }
    )
    for observed_row in [giant_row, costco_row]:
        link_rows.append(
            {
                "observed_product_id": observed_row["observed_product_id"],
                "canonical_product_id": proof_canonical_id,
                "link_method": "manual_proof_merge",
                "link_confidence": "medium",
                "review_status": "",
                "reviewed_by": "",
                "reviewed_at": "",
                "link_notes": "cross-retailer validation proof",
            }
        )
    proof_rows = [
        {
            "proof_name": "banana",
            "canonical_product_id": proof_canonical_id,
            "giant_observed_product_id": giant_row["observed_product_id"],
            "costco_observed_product_id": costco_row["observed_product_id"],
            "giant_example_item": giant_row["example_item_name"],
            "costco_example_item": costco_row["example_item_name"],
            "notes": "BANANA proof pair built from Giant and Costco enriched rows",
        }
    ]
    return canonical_rows, link_rows, proof_rows
@click.command()
@click.option(
    "--giant-items-enriched-csv",
    default="giant_output/items_enriched.csv",
    show_default=True,
 )
@click.option(
    "--costco-items-enriched-csv",
    default="costco_output/items_enriched.csv",
    show_default=True,
 )
@click.option(
    "--outdir",
    default="combined_output",
    show_default=True,
 )
 def main(giant_items_enriched_csv, costco_items_enriched_csv, outdir):
    outdir = Path(outdir)
    rows = read_rows(giant_items_enriched_csv) + read_rows(costco_items_enriched_csv)
    observed_rows = build_observed_products.build_observed_products(rows)
    canonical_rows, link_rows = build_canonical_layer.build_canonical_layer(observed_rows)
    giant_row, costco_row = find_proof_pair(observed_rows)
    if not giant_row or not costco_row:
        raise click.ClickException(
            "could not find BANANA proof pair across Giant and Costco observed products"
        )
    canonical_rows, link_rows, proof_rows = merge_proof_pair(
        canonical_rows, link_rows, giant_row, costco_row
    )
    write_csv_rows(
        outdir / "products_observed.csv",
        observed_rows,
        build_observed_products.OUTPUT_FIELDS,
    )
    write_csv_rows(
        outdir / "products_canonical.csv",
        canonical_rows,
        build_canonical_layer.CANONICAL_FIELDS,
    )
    write_csv_rows(
        outdir / "product_links.csv",
        link_rows,
        build_canonical_layer.LINK_FIELDS,
    )
    write_csv_rows(outdir / "proof_examples.csv", proof_rows, PROOF_FIELDS)
    click.echo(
        f"wrote combined outputs to {outdir} using {len(observed_rows)} observed rows"
    )
 if __name__ == "__main__":
    main()