Compare commits

27 Commits

Author SHA1 Message Date
ben
afadd0c0d0 Restore skip and move search to find 2026-03-20 13:35:07 -04:00
ben
2847d2d59f Record t1.16.1 task evidence 2026-03-20 13:32:27 -04:00
ben
f93b9aa464 Add catalog search to review flow 2026-03-20 13:32:20 -04:00
ben
17158fb9e9 Record t1.16 task evidence 2026-03-20 12:45:57 -04:00
ben
975d44bebb Tighten review prompt flow 2026-03-20 12:45:38 -04:00
ben
f478795b5d added t1.16 to cleanup review process 2026-03-20 12:42:23 -04:00
ben
59fb881c0a Record t1.15 task evidence 2026-03-20 11:27:56 -04:00
ben
9104781b93 Refactor review pipeline around normalized items 2026-03-20 11:27:46 -04:00
ben
607c51038a Record t1.14.3 task evidence 2026-03-20 11:09:50 -04:00
ben
bcec6b37d3 Clean Costco normalization artifacts 2026-03-20 11:09:44 -04:00
ben
848d229f2d Record t1.14.2 task evidence 2026-03-20 10:05:08 -04:00
ben
d2e6f2afd3 Align refactor paths with data layout 2026-03-20 10:04:58 -04:00
424a777dd0 added git note 2026-03-20 09:58:25 -04:00
2e5d69c75e added 14.2 and 14.3 for refactor prep 2026-03-20 09:55:46 -04:00
ben
3c2462845b added task-sample 2026-03-18 15:47:12 -04:00
ben
c0023e8f3a Record t1.14.1 task evidence 2026-03-18 15:46:31 -04:00
ben
9064de5f67 Refactor retailer normalization outputs 2026-03-18 15:46:20 -04:00
ben
ec1f36a140 Record t1.14 task evidence 2026-03-18 15:18:54 -04:00
ben
48c6eaf753 Refactor retailer collection entrypoints 2026-03-18 15:18:47 -04:00
ben
e74253f6fb data-model prep for refactor, removing observed layer 2026-03-18 15:15:29 -04:00
ben
c13d144418 cleanup 2026-03-18 14:02:36 -04:00
ben
10aad05808 data-model refactor and prep scope 2026-03-18 13:08:28 -04:00
ben
9122821db1 Fix t1.13 evidence hashes 2026-03-17 15:08:09 -04:00
ben
7743421918 Record t1.13 task evidence 2026-03-17 15:07:51 -04:00
ben
08e2a86cbd Make canonical auto-linking more conservative 2026-03-17 15:07:48 -04:00
ben
56a03bcb1d Attach Costco discounts to purchase rows 2026-03-17 15:07:45 -04:00
ben
967e19e561 Add pipeline status accounting 2026-03-17 15:07:42 -04:00
24 changed files with 2731 additions and 729 deletions

View File

@@ -12,6 +12,13 @@ Run each script step-by-step from the terminal.
4. `enrich_costco.py`: normalize Costco line items 4. `enrich_costco.py`: normalize Costco line items
5. `build_purchases.py`: combine retailer outputs into one purchase table 5. `build_purchases.py`: combine retailer outputs into one purchase table
6. `review_products.py`: review unresolved product matches in the terminal 6. `review_products.py`: review unresolved product matches in the terminal
7. `report_pipeline_status.py`: show how many rows survive each stage
Active refactor entrypoints:
- `collect_giant_web.py`
- `collect_costco_web.py`
- `normalize_giant_web.py`
- `normalize_costco_web.py`
## Requirements ## Requirements
@@ -29,8 +36,9 @@ pip install -r requirements.txt
## Optional `.env` ## Optional `.env`
Current version works best with `.env` in the project root. The scraper will prompt for these values if they are not found in the current browser session. Current version works best with `.env` in the project root. The scraper will prompt for these values if they are not found in the current browser session.
- `scrape_giant` prompts if `GIANT_USER_ID` or `GIANT_LOYALTY_NUMBER` is missing. - `collect_giant_web.py` prompts if `GIANT_USER_ID` or `GIANT_LOYALTY_NUMBER` is missing.
- `scrape_costco` tries `.env` first, then Firefox local storage for session-backed values; `COSTCO_CLIENT_IDENTIFIER` should still be set explicitly. - `collect_costco_web.py` tries `.env` first, then Firefox local storage for session-backed values; `COSTCO_CLIENT_IDENTIFIER` should still be set explicitly.
- Costco discount matching happens later in `enrich_costco.py`; you do not need to pre-clean discount lines by hand.
```env ```env
GIANT_USER_ID=... GIANT_USER_ID=...
@@ -41,18 +49,44 @@ COSTCO_X_WCS_CLIENTID=...
COSTCO_CLIENT_IDENTIFIER=... COSTCO_CLIENT_IDENTIFIER=...
``` ```
Current active path layout:
```text
data/
giant-web/
raw/
collected_orders.csv
collected_items.csv
normalized_items.csv
costco-web/
raw/
collected_orders.csv
collected_items.csv
normalized_items.csv
review/
review_queue.csv
review_resolutions.csv
product_links.csv
purchases.csv
pipeline_status.csv
pipeline_status.json
catalog.csv
```
## Run Order ## Run Order
Run the pipeline in this order: Run the pipeline in this order:
```bash ```bash
python scrape_giant.py python collect_giant_web.py
python enrich_giant.py python normalize_giant_web.py
python scrape_costco.py python collect_costco_web.py
python enrich_costco.py python normalize_costco_web.py
python build_purchases.py python build_purchases.py
python review_products.py python review_products.py
python build_purchases.py python build_purchases.py
python review_products.py --refresh-only
python report_pipeline_status.py
``` ```
Why run `build_purchases.py` twice: Why run `build_purchases.py` twice:
@@ -66,25 +100,34 @@ If you only want to refresh the queue without reviewing interactively:
python review_products.py --refresh-only python review_products.py --refresh-only
``` ```
If you want a quick stage-by-stage accountability check:
```bash
python report_pipeline_status.py
```
## Key Outputs ## Key Outputs
Giant: Giant:
- `giant_output/orders.csv` - `data/giant-web/collected_orders.csv`
- `giant_output/items.csv` - `data/giant-web/collected_items.csv`
- `giant_output/items_enriched.csv` - `data/giant-web/normalized_items.csv`
Costco: Costco:
- `costco_output/orders.csv` - `data/costco-web/collected_orders.csv`
- `costco_output/items.csv` - `data/costco-web/collected_items.csv`
- `costco_output/items_enriched.csv` - `data/costco-web/normalized_items.csv`
- `data/costco-web/normalized_items.csv` preserves raw totals and matched net discount fields
Combined: Combined:
- `combined_output/purchases.csv` - `data/review/purchases.csv`
- `combined_output/review_queue.csv` - `data/review/review_queue.csv`
- `combined_output/review_resolutions.csv` - `data/review/review_resolutions.csv`
- `combined_output/canonical_catalog.csv` - `data/review/product_links.csv`
- `combined_output/product_links.csv` - `data/review/comparison_examples.csv`
- `combined_output/comparison_examples.csv` - `data/review/pipeline_status.csv`
- `data/review/pipeline_status.json`
- `data/catalog.csv`
## Review Workflow ## Review Workflow
@@ -95,9 +138,14 @@ Run `review_products.py` to cleanup unresolved or weakly unified items:
- skip it for later - skip it for later
Decisions are saved and reused on later runs. Decisions are saved and reused on later runs.
The review step is intentionally conservative:
- weak exact-name matches stay in the queue instead of auto-creating canonical products
- canonical names should describe stable product identity, not retailer packaging text
## Notes ## Notes
- This project is designed around fragile retailer scraping flows, so the code favors explicit retailer-specific steps over heavy abstraction. - This project is designed around fragile retailer scraping flows, so the code favors explicit retailer-specific steps over heavy abstraction.
- `scrape_giant.py` and `scrape_costco.py` are meant to work as standalone acquisition scripts. - `scrape_giant.py`, `scrape_costco.py`, `enrich_giant.py`, and `enrich_costco.py` are now legacy-compatible entrypoints; prefer the `collect_*` and `normalize_*` scripts for active work.
- Costco discount rows are preserved for auditability and also matched back to purchased items during enrichment.
- `validate_cross_retailer_flow.py` is a proof/check script, not a required production step. - `validate_cross_retailer_flow.py` is a proof/check script, not a required production step.
## Test ## Test

View File

@@ -1,4 +1,5 @@
import click import click
import re
from layer_helpers import read_csv_rows, representative_value, stable_id, write_csv_rows from layer_helpers import read_csv_rows, representative_value, stable_id, write_csv_rows
@@ -20,6 +21,8 @@ CANONICAL_FIELDS = [
"updated_at", "updated_at",
] ]
CANONICAL_DROP_TOKENS = {"CT", "COUNT", "COUNTS", "DOZ", "DOZEN", "DOZ.", "PACK"}
LINK_FIELDS = [ LINK_FIELDS = [
"observed_product_id", "observed_product_id",
"canonical_product_id", "canonical_product_id",
@@ -91,26 +94,24 @@ def auto_link_rule(observed_row):
"high", "high",
) )
if (
observed_row.get("representative_name_norm")
and not observed_row.get("representative_size_value")
and not observed_row.get("representative_size_unit")
and not observed_row.get("representative_pack_qty")
):
return (
"exact_name",
"|".join(
[
f"name={observed_row['representative_name_norm']}",
f"measure={observed_row['representative_measure_type']}",
]
),
"medium",
)
return "", "", "" return "", "", ""
def clean_canonical_name(name):
tokens = []
for token in re.sub(r"[^A-Z0-9\s]", " ", (name or "").upper()).split():
if token.isdigit():
continue
if token in CANONICAL_DROP_TOKENS:
continue
if re.fullmatch(r"\d+(?:PK|PACK)", token):
continue
if re.fullmatch(r"\d+DZ", token):
continue
tokens.append(token)
return " ".join(tokens).strip()
def canonical_row_for_group(canonical_product_id, group_rows, link_method): def canonical_row_for_group(canonical_product_id, group_rows, link_method):
quantity_value, quantity_unit = normalized_quantity( quantity_value, quantity_unit = normalized_quantity(
{ {
@@ -130,7 +131,10 @@ def canonical_row_for_group(canonical_product_id, group_rows, link_method):
) )
return { return {
"canonical_product_id": canonical_product_id, "canonical_product_id": canonical_product_id,
"canonical_name": representative_value(group_rows, "representative_name_norm"), "canonical_name": clean_canonical_name(
representative_value(group_rows, "representative_name_norm")
)
or representative_value(group_rows, "representative_name_norm"),
"product_type": "", "product_type": "",
"brand": representative_value(group_rows, "representative_brand"), "brand": representative_value(group_rows, "representative_brand"),
"variant": representative_value(group_rows, "representative_variant"), "variant": representative_value(group_rows, "representative_variant"),

View File

@@ -3,11 +3,8 @@ from pathlib import Path
import click import click
import build_canonical_layer
import build_observed_products
import validate_cross_retailer_flow
from enrich_giant import format_decimal, to_decimal from enrich_giant import format_decimal, to_decimal
from layer_helpers import read_csv_rows, stable_id, write_csv_rows from layer_helpers import read_csv_rows, write_csv_rows
PURCHASE_FIELDS = [ PURCHASE_FIELDS = [
@@ -15,13 +12,18 @@ PURCHASE_FIELDS = [
"retailer", "retailer",
"order_id", "order_id",
"line_no", "line_no",
"observed_item_key", "normalized_row_id",
"observed_product_id", "normalized_item_id",
"canonical_product_id", "catalog_id",
"review_status", "review_status",
"resolution_action", "resolution_action",
"raw_item_name", "raw_item_name",
"normalized_item_name", "normalized_item_name",
"catalog_name",
"category",
"product_type",
"brand",
"variant",
"image_url", "image_url",
"retailer_item_id", "retailer_item_id",
"upc", "upc",
@@ -33,6 +35,8 @@ PURCHASE_FIELDS = [
"measure_type", "measure_type",
"line_total", "line_total",
"unit_price", "unit_price",
"matched_discount_amount",
"net_line_total",
"store_name", "store_name",
"store_number", "store_number",
"store_city", "store_city",
@@ -53,7 +57,7 @@ PURCHASE_FIELDS = [
EXAMPLE_FIELDS = [ EXAMPLE_FIELDS = [
"example_name", "example_name",
"canonical_product_id", "catalog_id",
"giant_purchase_date", "giant_purchase_date",
"giant_raw_item_name", "giant_raw_item_name",
"giant_price_per_lb", "giant_price_per_lb",
@@ -64,8 +68,8 @@ EXAMPLE_FIELDS = [
] ]
CATALOG_FIELDS = [ CATALOG_FIELDS = [
"canonical_product_id", "catalog_id",
"canonical_name", "catalog_name",
"category", "category",
"product_type", "product_type",
"brand", "brand",
@@ -79,9 +83,20 @@ CATALOG_FIELDS = [
"updated_at", "updated_at",
] ]
PRODUCT_LINK_FIELDS = [
"normalized_item_id",
"catalog_id",
"link_method",
"link_confidence",
"review_status",
"reviewed_by",
"reviewed_at",
"link_notes",
]
RESOLUTION_FIELDS = [ RESOLUTION_FIELDS = [
"observed_product_id", "normalized_item_id",
"canonical_product_id", "catalog_id",
"resolution_action", "resolution_action",
"status", "status",
"resolution_notes", "resolution_notes",
@@ -89,12 +104,8 @@ RESOLUTION_FIELDS = [
] ]
def decimal_or_zero(value):
return to_decimal(value) or Decimal("0")
def derive_metrics(row): def derive_metrics(row):
line_total = to_decimal(row.get("line_total")) line_total = to_decimal(row.get("net_line_total") or row.get("line_total"))
qty = to_decimal(row.get("qty")) qty = to_decimal(row.get("qty"))
pack_qty = to_decimal(row.get("pack_qty")) pack_qty = to_decimal(row.get("pack_qty"))
size_value = to_decimal(row.get("size_value")) size_value = to_decimal(row.get("size_value"))
@@ -160,10 +171,7 @@ def derive_metrics(row):
def order_lookup(rows, retailer): def order_lookup(rows, retailer):
return { return {(retailer, row["order_id"]): row for row in rows}
(retailer, row["order_id"]): row
for row in rows
}
def read_optional_csv_rows(path): def read_optional_csv_rows(path):
@@ -173,28 +181,10 @@ def read_optional_csv_rows(path):
return read_csv_rows(path) return read_csv_rows(path)
def load_resolution_lookup(resolution_rows): def normalize_catalog_row(row):
lookup = {}
for row in resolution_rows:
if not row.get("observed_product_id"):
continue
lookup[row["observed_product_id"]] = row
return lookup
def merge_catalog_rows(existing_rows, auto_rows):
merged = {}
for row in auto_rows + existing_rows:
canonical_product_id = row.get("canonical_product_id", "")
if canonical_product_id:
merged[canonical_product_id] = row
return sorted(merged.values(), key=lambda row: row["canonical_product_id"])
def catalog_row_from_canonical(row):
return { return {
"canonical_product_id": row.get("canonical_product_id", ""), "catalog_id": row.get("catalog_id") or row.get("canonical_product_id", ""),
"canonical_name": row.get("canonical_name", ""), "catalog_name": row.get("catalog_name") or row.get("canonical_name", ""),
"category": row.get("category", ""), "category": row.get("category", ""),
"product_type": row.get("product_type", ""), "product_type": row.get("product_type", ""),
"brand": row.get("brand", ""), "brand": row.get("brand", ""),
@@ -209,24 +199,67 @@ def catalog_row_from_canonical(row):
} }
def build_link_state(enriched_rows): def is_review_first_catalog_row(row):
observed_rows = build_observed_products.build_observed_products(enriched_rows) notes = row.get("notes", "").strip().lower()
canonical_rows, link_rows = build_canonical_layer.build_canonical_layer(observed_rows) if notes.startswith("auto-linked via"):
giant_row, costco_row = validate_cross_retailer_flow.find_proof_pair(observed_rows) return False
canonical_rows, link_rows, _proof_rows = validate_cross_retailer_flow.merge_proof_pair( return True
canonical_rows,
link_rows,
giant_row,
costco_row,
)
observed_id_by_key = {
row["observed_key"]: row["observed_product_id"] for row in observed_rows def normalize_link_row(row):
return {
"normalized_item_id": row.get("normalized_item_id", ""),
"catalog_id": row.get("catalog_id") or row.get("canonical_product_id", ""),
"link_method": row.get("link_method", ""),
"link_confidence": row.get("link_confidence", ""),
"review_status": row.get("review_status", ""),
"reviewed_by": row.get("reviewed_by", ""),
"reviewed_at": row.get("reviewed_at", ""),
"link_notes": row.get("link_notes", ""),
} }
canonical_id_by_observed = {
row["observed_product_id"]: row["canonical_product_id"] for row in link_rows
def normalize_resolution_row(row):
return {
"normalized_item_id": row.get("normalized_item_id", ""),
"catalog_id": row.get("catalog_id") or row.get("canonical_product_id", ""),
"resolution_action": row.get("resolution_action", ""),
"status": row.get("status", ""),
"resolution_notes": row.get("resolution_notes", ""),
"reviewed_at": row.get("reviewed_at", ""),
} }
return observed_rows, canonical_rows, link_rows, observed_id_by_key, canonical_id_by_observed
def load_resolution_lookup(resolution_rows):
lookup = {}
for row in resolution_rows:
normalized_row = normalize_resolution_row(row)
normalized_item_id = normalized_row.get("normalized_item_id", "")
if not normalized_item_id:
continue
lookup[normalized_item_id] = normalized_row
return lookup
def merge_catalog_rows(existing_rows, new_rows):
merged = {}
for row in existing_rows + new_rows:
normalized_row = normalize_catalog_row(row)
catalog_id = normalized_row.get("catalog_id", "")
if catalog_id:
merged[catalog_id] = normalized_row
return sorted(merged.values(), key=lambda row: row["catalog_id"])
def load_link_lookup(link_rows):
lookup = {}
for row in link_rows:
normalized_row = normalize_link_row(row)
normalized_item_id = normalized_row.get("normalized_item_id", "")
if not normalized_item_id:
continue
lookup[normalized_item_id] = normalized_row
return lookup
def build_purchase_rows( def build_purchase_rows(
@@ -235,25 +268,37 @@ def build_purchase_rows(
giant_orders, giant_orders,
costco_orders, costco_orders,
resolution_rows, resolution_rows,
link_rows=None,
catalog_rows=None,
): ):
all_enriched_rows = giant_enriched_rows + costco_enriched_rows all_enriched_rows = giant_enriched_rows + costco_enriched_rows
(
observed_rows,
canonical_rows,
link_rows,
observed_id_by_key,
canonical_id_by_observed,
) = build_link_state(all_enriched_rows)
resolution_lookup = load_resolution_lookup(resolution_rows) resolution_lookup = load_resolution_lookup(resolution_rows)
for observed_product_id, resolution in resolution_lookup.items(): link_lookup = load_link_lookup(link_rows or [])
catalog_lookup = {
row["catalog_id"]: normalize_catalog_row(row)
for row in (catalog_rows or [])
if normalize_catalog_row(row).get("catalog_id")
}
for normalized_item_id, resolution in resolution_lookup.items():
action = resolution.get("resolution_action", "") action = resolution.get("resolution_action", "")
status = resolution.get("status", "") status = resolution.get("status", "")
if status != "approved": if status != "approved":
continue continue
if action in {"link", "create"} and resolution.get("canonical_product_id"): if action in {"link", "create"} and resolution.get("catalog_id"):
canonical_id_by_observed[observed_product_id] = resolution["canonical_product_id"] link_lookup[normalized_item_id] = {
"normalized_item_id": normalized_item_id,
"catalog_id": resolution["catalog_id"],
"link_method": f"manual_{action}",
"link_confidence": "high",
"review_status": status,
"reviewed_by": "",
"reviewed_at": resolution.get("reviewed_at", ""),
"link_notes": resolution.get("resolution_notes", ""),
}
elif action == "exclude": elif action == "exclude":
canonical_id_by_observed[observed_product_id] = "" link_lookup.pop(normalized_item_id, None)
orders_by_id = {} orders_by_id = {}
orders_by_id.update(order_lookup(giant_orders, "giant")) orders_by_id.update(order_lookup(giant_orders, "giant"))
orders_by_id.update(order_lookup(costco_orders, "costco")) orders_by_id.update(order_lookup(costco_orders, "costco"))
@@ -263,24 +308,30 @@ def build_purchase_rows(
all_enriched_rows, all_enriched_rows,
key=lambda item: (item["order_date"], item["retailer"], item["order_id"], int(item["line_no"])), key=lambda item: (item["order_date"], item["retailer"], item["order_id"], int(item["line_no"])),
): ):
observed_key = build_observed_products.build_observed_key(row) normalized_item_id = row.get("normalized_item_id", "")
observed_product_id = observed_id_by_key.get(observed_key, "") resolution = resolution_lookup.get(normalized_item_id, {})
link_row = link_lookup.get(normalized_item_id, {})
catalog_row = catalog_lookup.get(link_row.get("catalog_id", ""), {})
order_row = orders_by_id.get((row["retailer"], row["order_id"]), {}) order_row = orders_by_id.get((row["retailer"], row["order_id"]), {})
metrics = derive_metrics(row) metrics = derive_metrics(row)
resolution = resolution_lookup.get(observed_product_id, {})
purchase_rows.append( purchase_rows.append(
{ {
"purchase_date": row["order_date"], "purchase_date": row["order_date"],
"retailer": row["retailer"], "retailer": row["retailer"],
"order_id": row["order_id"], "order_id": row["order_id"],
"line_no": row["line_no"], "line_no": row["line_no"],
"observed_item_key": row["observed_item_key"], "normalized_row_id": row.get("normalized_row_id", ""),
"observed_product_id": observed_product_id, "normalized_item_id": normalized_item_id,
"canonical_product_id": canonical_id_by_observed.get(observed_product_id, ""), "catalog_id": link_row.get("catalog_id", ""),
"review_status": resolution.get("status", ""), "review_status": resolution.get("status", ""),
"resolution_action": resolution.get("resolution_action", ""), "resolution_action": resolution.get("resolution_action", ""),
"raw_item_name": row["item_name"], "raw_item_name": row["item_name"],
"normalized_item_name": row["item_name_norm"], "normalized_item_name": row["item_name_norm"],
"catalog_name": catalog_row.get("catalog_name", ""),
"category": catalog_row.get("category", ""),
"product_type": catalog_row.get("product_type", ""),
"brand": catalog_row.get("brand", ""),
"variant": catalog_row.get("variant", ""),
"image_url": row.get("image_url", ""), "image_url": row.get("image_url", ""),
"retailer_item_id": row["retailer_item_id"], "retailer_item_id": row["retailer_item_id"],
"upc": row["upc"], "upc": row["upc"],
@@ -292,6 +343,8 @@ def build_purchase_rows(
"measure_type": row["measure_type"], "measure_type": row["measure_type"],
"line_total": row["line_total"], "line_total": row["line_total"],
"unit_price": row["unit_price"], "unit_price": row["unit_price"],
"matched_discount_amount": row.get("matched_discount_amount", ""),
"net_line_total": row.get("net_line_total", ""),
"store_name": order_row.get("store_name", ""), "store_name": order_row.get("store_name", ""),
"store_number": order_row.get("store_number", ""), "store_number": order_row.get("store_number", ""),
"store_city": order_row.get("store_city", ""), "store_city": order_row.get("store_city", ""),
@@ -303,33 +356,7 @@ def build_purchase_rows(
**metrics, **metrics,
} }
) )
return purchase_rows, observed_rows, canonical_rows, link_rows return purchase_rows, sorted(link_lookup.values(), key=lambda row: row["normalized_item_id"])
def apply_manual_resolutions_to_links(link_rows, resolution_rows):
link_by_observed = {row["observed_product_id"]: dict(row) for row in link_rows}
for resolution in resolution_rows:
if resolution.get("status") != "approved":
continue
observed_product_id = resolution.get("observed_product_id", "")
action = resolution.get("resolution_action", "")
if not observed_product_id:
continue
if action == "exclude":
link_by_observed.pop(observed_product_id, None)
continue
if action in {"link", "create"} and resolution.get("canonical_product_id"):
link_by_observed[observed_product_id] = {
"observed_product_id": observed_product_id,
"canonical_product_id": resolution["canonical_product_id"],
"link_method": f"manual_{action}",
"link_confidence": "high",
"review_status": resolution.get("status", ""),
"reviewed_by": "",
"reviewed_at": resolution.get("reviewed_at", ""),
"link_notes": resolution.get("resolution_notes", ""),
}
return sorted(link_by_observed.values(), key=lambda row: row["observed_product_id"])
def build_comparison_examples(purchase_rows): def build_comparison_examples(purchase_rows):
@@ -338,7 +365,7 @@ def build_comparison_examples(purchase_rows):
for row in purchase_rows: for row in purchase_rows:
if row.get("normalized_item_name") != "BANANA": if row.get("normalized_item_name") != "BANANA":
continue continue
if not row.get("canonical_product_id"): if not row.get("catalog_id"):
continue continue
if row["retailer"] == "giant" and row.get("price_per_lb"): if row["retailer"] == "giant" and row.get("price_per_lb"):
giant_banana = row giant_banana = row
@@ -351,7 +378,7 @@ def build_comparison_examples(purchase_rows):
return [ return [
{ {
"example_name": "banana_price_per_lb", "example_name": "banana_price_per_lb",
"canonical_product_id": giant_banana["canonical_product_id"], "catalog_id": giant_banana["catalog_id"],
"giant_purchase_date": giant_banana["purchase_date"], "giant_purchase_date": giant_banana["purchase_date"],
"giant_raw_item_name": giant_banana["raw_item_name"], "giant_raw_item_name": giant_banana["raw_item_name"],
"giant_price_per_lb": giant_banana["price_per_lb"], "giant_price_per_lb": giant_banana["price_per_lb"],
@@ -364,15 +391,15 @@ def build_comparison_examples(purchase_rows):
@click.command() @click.command()
@click.option("--giant-items-enriched-csv", default="giant_output/items_enriched.csv", show_default=True) @click.option("--giant-items-enriched-csv", default="data/giant-web/normalized_items.csv", show_default=True)
@click.option("--costco-items-enriched-csv", default="costco_output/items_enriched.csv", show_default=True) @click.option("--costco-items-enriched-csv", default="data/costco-web/normalized_items.csv", show_default=True)
@click.option("--giant-orders-csv", default="giant_output/orders.csv", show_default=True) @click.option("--giant-orders-csv", default="data/giant-web/collected_orders.csv", show_default=True)
@click.option("--costco-orders-csv", default="costco_output/orders.csv", show_default=True) @click.option("--costco-orders-csv", default="data/costco-web/collected_orders.csv", show_default=True)
@click.option("--resolutions-csv", default="combined_output/review_resolutions.csv", show_default=True) @click.option("--resolutions-csv", default="data/review/review_resolutions.csv", show_default=True)
@click.option("--catalog-csv", default="combined_output/canonical_catalog.csv", show_default=True) @click.option("--catalog-csv", default="data/catalog.csv", show_default=True)
@click.option("--links-csv", default="combined_output/product_links.csv", show_default=True) @click.option("--links-csv", default="data/review/product_links.csv", show_default=True)
@click.option("--output-csv", default="combined_output/purchases.csv", show_default=True) @click.option("--output-csv", default="data/review/purchases.csv", show_default=True)
@click.option("--examples-csv", default="combined_output/comparison_examples.csv", show_default=True) @click.option("--examples-csv", default="data/review/comparison_examples.csv", show_default=True)
def main( def main(
giant_items_enriched_csv, giant_items_enriched_csv,
costco_items_enriched_csv, costco_items_enriched_csv,
@@ -385,27 +412,29 @@ def main(
examples_csv, examples_csv,
): ):
resolution_rows = read_optional_csv_rows(resolutions_csv) resolution_rows = read_optional_csv_rows(resolutions_csv)
purchase_rows, _observed_rows, canonical_rows, link_rows = build_purchase_rows( catalog_rows = merge_catalog_rows(
[row for row in read_optional_csv_rows(catalog_csv) if is_review_first_catalog_row(row)],
[],
)
existing_links = [normalize_link_row(row) for row in read_optional_csv_rows(links_csv)]
purchase_rows, link_rows = build_purchase_rows(
read_csv_rows(giant_items_enriched_csv), read_csv_rows(giant_items_enriched_csv),
read_csv_rows(costco_items_enriched_csv), read_csv_rows(costco_items_enriched_csv),
read_csv_rows(giant_orders_csv), read_csv_rows(giant_orders_csv),
read_csv_rows(costco_orders_csv), read_csv_rows(costco_orders_csv),
resolution_rows, resolution_rows,
existing_links,
catalog_rows,
) )
existing_catalog_rows = read_optional_csv_rows(catalog_csv)
merged_catalog_rows = merge_catalog_rows(
existing_catalog_rows,
[catalog_row_from_canonical(row) for row in canonical_rows],
)
link_rows = apply_manual_resolutions_to_links(link_rows, resolution_rows)
example_rows = build_comparison_examples(purchase_rows) example_rows = build_comparison_examples(purchase_rows)
write_csv_rows(catalog_csv, merged_catalog_rows, CATALOG_FIELDS) write_csv_rows(catalog_csv, catalog_rows, CATALOG_FIELDS)
write_csv_rows(links_csv, link_rows, build_canonical_layer.LINK_FIELDS) write_csv_rows(links_csv, link_rows, PRODUCT_LINK_FIELDS)
write_csv_rows(output_csv, purchase_rows, PURCHASE_FIELDS) write_csv_rows(output_csv, purchase_rows, PURCHASE_FIELDS)
write_csv_rows(examples_csv, example_rows, EXAMPLE_FIELDS) write_csv_rows(examples_csv, example_rows, EXAMPLE_FIELDS)
click.echo( click.echo(
f"wrote {len(purchase_rows)} purchase rows to {output_csv}, " f"wrote {len(purchase_rows)} purchase rows to {output_csv}, "
f"{len(merged_catalog_rows)} catalog rows to {catalog_csv}, " f"{len(catalog_rows)} catalog rows to {catalog_csv}, "
f"{len(link_rows)} product links to {links_csv}, "
f"and {len(example_rows)} comparison examples to {examples_csv}" f"and {len(example_rows)} comparison examples to {examples_csv}"
) )

65
collect_costco_web.py Normal file
View File

@@ -0,0 +1,65 @@
import click
import scrape_costco
@click.command()
@click.option(
"--outdir",
default="data/costco-web",
show_default=True,
help="Directory for Costco raw and collected outputs.",
)
@click.option(
"--document-type",
default="all",
show_default=True,
help="Summary document type.",
)
@click.option(
"--document-sub-type",
default="all",
show_default=True,
help="Summary document sub type.",
)
@click.option(
"--window-days",
default=92,
show_default=True,
type=int,
help="Maximum number of days to request per summary window.",
)
@click.option(
"--months-back",
default=36,
show_default=True,
type=int,
help="How many months of receipts to enumerate back from today.",
)
@click.option(
"--firefox-profile-dir",
default=None,
help="Firefox profile directory to use for cookies and session storage.",
)
def main(
outdir,
document_type,
document_sub_type,
window_days,
months_back,
firefox_profile_dir,
):
scrape_costco.run_collection(
outdir=outdir,
document_type=document_type,
document_sub_type=document_sub_type,
window_days=window_days,
months_back=months_back,
firefox_profile_dir=firefox_profile_dir,
orders_filename="collected_orders.csv",
items_filename="collected_items.csv",
)
if __name__ == "__main__":
main()

34
collect_giant_web.py Normal file
View File

@@ -0,0 +1,34 @@
import click
import scrape_giant
@click.command()
@click.option("--user-id", default=None, help="Giant user id.")
@click.option("--loyalty", default=None, help="Giant loyalty number.")
@click.option(
"--outdir",
default="data/giant-web",
show_default=True,
help="Directory for raw json and collected csv outputs.",
)
@click.option(
"--sleep-seconds",
default=1.5,
show_default=True,
type=float,
help="Delay between order detail requests.",
)
def main(user_id, loyalty, outdir, sleep_seconds):
scrape_giant.run_collection(
user_id,
loyalty,
outdir,
sleep_seconds,
orders_filename="collected_orders.csv",
items_filename="collected_items.csv",
)
if __name__ == "__main__":
main()

View File

@@ -1,13 +1,17 @@
import csv import csv
import json import json
import re import re
from collections import defaultdict
from pathlib import Path from pathlib import Path
import click import click
from enrich_giant import ( from enrich_giant import (
OUTPUT_FIELDS, OUTPUT_FIELDS,
derive_normalized_quantity,
derive_price_fields,
format_decimal, format_decimal,
normalization_identity,
normalize_number, normalize_number,
normalize_unit, normalize_unit,
normalize_whitespace, normalize_whitespace,
@@ -26,9 +30,15 @@ CODE_TOKEN_RE = re.compile(
) )
PACK_FRACTION_RE = re.compile(r"(?<![A-Z0-9])(\d+)\s*/\s*(\d+(?:\.\d+)?)\s*(OZ|LB|LBS|CT)\b") PACK_FRACTION_RE = re.compile(r"(?<![A-Z0-9])(\d+)\s*/\s*(\d+(?:\.\d+)?)\s*(OZ|LB|LBS|CT)\b")
HASH_SIZE_RE = re.compile(r"(?<![A-Z0-9])(\d+(?:\.\d+)?)#\b") HASH_SIZE_RE = re.compile(r"(?<![A-Z0-9])(\d+(?:\.\d+)?)#\b")
ITEM_CODE_RE = re.compile(r"#\w+\b")
DUAL_WEIGHT_RE = re.compile(
r"\b\d+(?:\.\d+)?\s*(?:KG|G|LB|LBS|OZ)\s*/\s*\d+(?:\.\d+)?\s*(?:KG|G|LB|LBS|OZ)\b"
)
LOGISTICS_SLASH_RE = re.compile(r"\b(?:T\d+/H\d+(?:/P\d+)?/?|H\d+/P\d+/?|T\d+/H\d+/?)\b")
PACK_DASH_RE = re.compile(r"(?<![A-Z0-9])(\d+)\s*-\s*PACK\b") PACK_DASH_RE = re.compile(r"(?<![A-Z0-9])(\d+)\s*-\s*PACK\b")
PACK_WORD_RE = re.compile(r"(?<![A-Z0-9])(\d+)\s*PACK\b") PACK_WORD_RE = re.compile(r"(?<![A-Z0-9])(\d+)\s*PACK\b")
SIZE_RE = re.compile(r"(?<![A-Z0-9])(\d+(?:\.\d+)?)\s*(OZ|LB|LBS|CT|KG|G)\b") SIZE_RE = re.compile(r"(?<![A-Z0-9])(\d+(?:\.\d+)?)\s*(OZ|LB|LBS|CT|KG|G)\b")
DISCOUNT_TARGET_RE = re.compile(r"^/\s*(\d+)\b")
def clean_costco_name(name): def clean_costco_name(name):
@@ -93,12 +103,17 @@ def normalize_costco_name(cleaned_name):
base = PACK_FRACTION_RE.sub(" ", base) base = PACK_FRACTION_RE.sub(" ", base)
else: else:
base = SIZE_RE.sub(" ", base) base = SIZE_RE.sub(" ", base)
base = DUAL_WEIGHT_RE.sub(" ", base)
base = HASH_SIZE_RE.sub(" ", base) base = HASH_SIZE_RE.sub(" ", base)
base = ITEM_CODE_RE.sub(" ", base)
base = LOGISTICS_SLASH_RE.sub(" ", base)
base = PACK_DASH_RE.sub(" ", base) base = PACK_DASH_RE.sub(" ", base)
base = PACK_WORD_RE.sub(" ", base) base = PACK_WORD_RE.sub(" ", base)
base = normalize_whitespace(base) base = normalize_whitespace(base)
tokens = [] tokens = []
for token in base.split(): for token in base.split():
if token in {"/", "-"}:
continue
if token in {"ORG"}: if token in {"ORG"}:
continue continue
if token in {"PEANUT", "BUTTER"} and "JIF" in base: if token in {"PEANUT", "BUTTER"} and "JIF" in base:
@@ -156,6 +171,13 @@ def is_discount_item(item):
return amount < 0 or unit < 0 or description.startswith("/") return amount < 0 or unit < 0 or description.startswith("/")
def discount_target_id(raw_name):
match = DISCOUNT_TARGET_RE.match(normalize_whitespace(raw_name))
if not match:
return ""
return match.group(1)
def parse_costco_item(order_id, order_date, raw_path, line_no, item): def parse_costco_item(order_id, order_date, raw_path, line_no, item):
raw_name = combine_description(item) raw_name = combine_description(item)
cleaned_name = clean_costco_name(raw_name) cleaned_name = clean_costco_name(raw_name)
@@ -168,12 +190,42 @@ def parse_costco_item(order_id, order_date, raw_path, line_no, item):
price_per_each, price_per_lb, price_per_oz = derive_costco_prices( price_per_each, price_per_lb, price_per_oz = derive_costco_prices(
item, measure_type, size_value, size_unit, pack_qty item, measure_type, size_value, size_unit, pack_qty
) )
normalized_row_id = f"{RETAILER}:{order_id}:{line_no}"
normalized_quantity, normalized_quantity_unit = derive_normalized_quantity(
size_value,
size_unit,
pack_qty,
measure_type,
)
identity_key, normalization_basis = normalization_identity(
{
"retailer": RETAILER,
"normalized_row_id": normalized_row_id,
"upc": "",
"retailer_item_id": str(item.get("itemNumber", "")),
"item_name_norm": item_name_norm,
"size_value": size_value,
"size_unit": size_unit,
"pack_qty": pack_qty,
}
)
price_fields = derive_price_fields(
price_per_each,
price_per_lb,
price_per_oz,
str(item.get("amount", "")),
str(item.get("unit", "")),
pack_qty,
)
return { return {
"retailer": RETAILER, "retailer": RETAILER,
"order_id": str(order_id), "order_id": str(order_id),
"line_no": str(line_no), "line_no": str(line_no),
"observed_item_key": f"{RETAILER}:{order_id}:{line_no}", "normalized_row_id": normalized_row_id,
"normalized_item_id": f"cnorm:{identity_key}",
"normalization_basis": normalization_basis,
"observed_item_key": normalized_row_id,
"order_date": normalize_whitespace(order_date), "order_date": normalize_whitespace(order_date),
"retailer_item_id": str(item.get("itemNumber", "")), "retailer_item_id": str(item.get("itemNumber", "")),
"pod_id": "", "pod_id": "",
@@ -190,6 +242,8 @@ def parse_costco_item(order_id, order_date, raw_path, line_no, item):
"reward_savings": "", "reward_savings": "",
"coupon_savings": str(item.get("amount", "")) if is_discount_line else "", "coupon_savings": str(item.get("amount", "")) if is_discount_line else "",
"coupon_price": "", "coupon_price": "",
"matched_discount_amount": "",
"net_line_total": str(item.get("amount", "")) if not is_discount_line else "",
"image_url": "", "image_url": "",
"raw_order_path": raw_path.as_posix(), "raw_order_path": raw_path.as_posix(),
"item_name_norm": item_name_norm, "item_name_norm": item_name_norm,
@@ -199,18 +253,64 @@ def parse_costco_item(order_id, order_date, raw_path, line_no, item):
"size_unit": size_unit, "size_unit": size_unit,
"pack_qty": pack_qty, "pack_qty": pack_qty,
"measure_type": measure_type, "measure_type": measure_type,
"normalized_quantity": normalized_quantity,
"normalized_quantity_unit": normalized_quantity_unit,
"is_store_brand": "true" if brand_guess else "false", "is_store_brand": "true" if brand_guess else "false",
"is_item": "false" if is_discount_line else "true",
"is_fee": "false", "is_fee": "false",
"is_discount_line": "true" if is_discount_line else "false", "is_discount_line": "true" if is_discount_line else "false",
"is_coupon_line": is_coupon_line, "is_coupon_line": is_coupon_line,
"price_per_each": price_per_each, **price_fields,
"price_per_lb": price_per_lb,
"price_per_oz": price_per_oz,
"parse_version": PARSER_VERSION, "parse_version": PARSER_VERSION,
"parse_notes": "", "parse_notes": "",
} }
def match_costco_discounts(rows):
rows_by_order = defaultdict(list)
for row in rows:
rows_by_order[row["order_id"]].append(row)
for order_rows in rows_by_order.values():
purchase_rows_by_item_id = defaultdict(list)
for row in order_rows:
if row.get("is_discount_line") == "true":
continue
retailer_item_id = row.get("retailer_item_id", "")
if retailer_item_id:
purchase_rows_by_item_id[retailer_item_id].append(row)
for row in order_rows:
if row.get("is_discount_line") != "true":
continue
target_id = discount_target_id(row.get("item_name", ""))
if not target_id:
continue
matches = purchase_rows_by_item_id.get(target_id, [])
if len(matches) != 1:
row["parse_notes"] = normalize_whitespace(
f"{row.get('parse_notes', '')};discount_target_unmatched={target_id}"
).strip(";")
continue
purchase_row = matches[0]
matched_discount = to_decimal(row.get("line_total"))
gross_total = to_decimal(purchase_row.get("line_total"))
existing_discount = to_decimal(purchase_row.get("matched_discount_amount")) or 0
if matched_discount is None or gross_total is None:
continue
total_discount = existing_discount + matched_discount
purchase_row["matched_discount_amount"] = format_decimal(total_discount)
purchase_row["net_line_total"] = format_decimal(gross_total + total_discount)
purchase_row["parse_notes"] = normalize_whitespace(
f"{purchase_row.get('parse_notes', '')};matched_discount={target_id}"
).strip(";")
row["parse_notes"] = normalize_whitespace(
f"{row.get('parse_notes', '')};matched_to_item={target_id}"
).strip(";")
def iter_costco_rows(raw_dir): def iter_costco_rows(raw_dir):
for path in discover_json_files(raw_dir): for path in discover_json_files(raw_dir):
if path.name in {"summary.json", "summary_requests.json"}: if path.name in {"summary.json", "summary_requests.json"}:
@@ -238,6 +338,7 @@ def discover_json_files(raw_dir):
def build_items_enriched(raw_dir): def build_items_enriched(raw_dir):
rows = list(iter_costco_rows(raw_dir)) rows = list(iter_costco_rows(raw_dir))
match_costco_discounts(rows)
rows.sort(key=lambda row: (row["order_date"], row["order_id"], int(row["line_no"]))) rows.sort(key=lambda row: (row["order_date"], row["order_id"], int(row["line_no"])))
return rows return rows
@@ -264,6 +365,7 @@ def write_csv(path, rows):
help="CSV path for enriched Costco item rows.", help="CSV path for enriched Costco item rows.",
) )
def main(input_dir, output_csv): def main(input_dir, output_csv):
click.echo("legacy entrypoint: prefer normalize_costco_web.py for data-model outputs")
rows = build_items_enriched(Path(input_dir)) rows = build_items_enriched(Path(input_dir))
write_csv(Path(output_csv), rows) write_csv(Path(output_csv), rows)
click.echo(f"wrote {len(rows)} rows to {output_csv}") click.echo(f"wrote {len(rows)} rows to {output_csv}")

View File

@@ -16,6 +16,9 @@ OUTPUT_FIELDS = [
"retailer", "retailer",
"order_id", "order_id",
"line_no", "line_no",
"normalized_row_id",
"normalized_item_id",
"normalization_basis",
"observed_item_key", "observed_item_key",
"order_date", "order_date",
"retailer_item_id", "retailer_item_id",
@@ -33,6 +36,8 @@ OUTPUT_FIELDS = [
"reward_savings", "reward_savings",
"coupon_savings", "coupon_savings",
"coupon_price", "coupon_price",
"matched_discount_amount",
"net_line_total",
"image_url", "image_url",
"raw_order_path", "raw_order_path",
"item_name_norm", "item_name_norm",
@@ -42,13 +47,21 @@ OUTPUT_FIELDS = [
"size_unit", "size_unit",
"pack_qty", "pack_qty",
"measure_type", "measure_type",
"normalized_quantity",
"normalized_quantity_unit",
"is_store_brand", "is_store_brand",
"is_item",
"is_fee", "is_fee",
"is_discount_line", "is_discount_line",
"is_coupon_line", "is_coupon_line",
"price_per_each", "price_per_each",
"price_per_each_basis",
"price_per_count",
"price_per_count_basis",
"price_per_lb", "price_per_lb",
"price_per_lb_basis",
"price_per_oz", "price_per_oz",
"price_per_oz_basis",
"parse_version", "parse_version",
"parse_notes", "parse_notes",
] ]
@@ -327,6 +340,65 @@ def derive_prices(item, measure_type, size_value="", size_unit="", pack_qty=""):
return price_per_each, price_per_lb, price_per_oz return price_per_each, price_per_lb, price_per_oz
def derive_normalized_quantity(size_value, size_unit, pack_qty, measure_type):
parsed_size = to_decimal(size_value)
parsed_pack = to_decimal(pack_qty) or Decimal("1")
if parsed_size not in (None, Decimal("0")) and size_unit:
return format_decimal(parsed_size * parsed_pack), size_unit
if parsed_pack not in (None, Decimal("0")) and measure_type == "count":
return format_decimal(parsed_pack), "count"
if measure_type == "each":
return "1", "each"
return "", ""
def derive_price_fields(price_per_each, price_per_lb, price_per_oz, line_total, qty, pack_qty):
line_total_decimal = to_decimal(line_total)
qty_decimal = to_decimal(qty)
pack_decimal = to_decimal(pack_qty)
price_per_count = ""
price_per_count_basis = ""
if line_total_decimal is not None and qty_decimal not in (None, Decimal("0")) and pack_decimal not in (
None,
Decimal("0"),
):
price_per_count = format_decimal(line_total_decimal / (qty_decimal * pack_decimal))
price_per_count_basis = "line_total_over_pack_qty"
return {
"price_per_each": price_per_each,
"price_per_each_basis": "line_total_over_qty" if price_per_each else "",
"price_per_count": price_per_count,
"price_per_count_basis": price_per_count_basis,
"price_per_lb": price_per_lb,
"price_per_lb_basis": "parsed_or_picked_weight" if price_per_lb else "",
"price_per_oz": price_per_oz,
"price_per_oz_basis": "parsed_or_picked_weight" if price_per_oz else "",
}
def normalization_identity(row):
if row.get("upc"):
return f"{row['retailer']}|upc={row['upc']}", "exact_upc"
if row.get("retailer_item_id"):
return f"{row['retailer']}|retailer_item_id={row['retailer_item_id']}", "exact_retailer_item_id"
if row.get("item_name_norm"):
return (
"|".join(
[
row["retailer"],
f"name={row['item_name_norm']}",
f"size={row.get('size_value', '')}",
f"unit={row.get('size_unit', '')}",
f"pack={row.get('pack_qty', '')}",
]
),
"exact_name_size_pack",
)
return row["normalized_row_id"], "row_identity"
def parse_item(order_id, order_date, raw_path, line_no, item): def parse_item(order_id, order_date, raw_path, line_no, item):
cleaned_name = clean_item_name(item.get("itemName", "")) cleaned_name = clean_item_name(item.get("itemName", ""))
size_value, size_unit, pack_qty = parse_size_and_pack(cleaned_name) size_value, size_unit, pack_qty = parse_size_and_pack(cleaned_name)
@@ -350,11 +422,42 @@ def parse_item(order_id, order_date, raw_path, line_no, item):
if size_value and not size_unit: if size_value and not size_unit:
parse_notes.append("size_without_unit") parse_notes.append("size_without_unit")
normalized_row_id = f"{RETAILER}:{order_id}:{line_no}"
normalized_quantity, normalized_quantity_unit = derive_normalized_quantity(
size_value,
size_unit,
pack_qty,
measure_type,
)
identity_key, normalization_basis = normalization_identity(
{
"retailer": RETAILER,
"normalized_row_id": normalized_row_id,
"upc": stringify(item.get("primUpcCd")),
"retailer_item_id": stringify(item.get("podId")),
"item_name_norm": normalized_name,
"size_value": size_value,
"size_unit": size_unit,
"pack_qty": pack_qty,
}
)
price_fields = derive_price_fields(
price_per_each,
price_per_lb,
price_per_oz,
stringify(item.get("groceryAmount")),
stringify(item.get("shipQy")),
pack_qty,
)
return { return {
"retailer": RETAILER, "retailer": RETAILER,
"order_id": str(order_id), "order_id": str(order_id),
"line_no": str(line_no), "line_no": str(line_no),
"observed_item_key": f"{RETAILER}:{order_id}:{line_no}", "normalized_row_id": normalized_row_id,
"normalized_item_id": f"gnorm:{identity_key}",
"normalization_basis": normalization_basis,
"observed_item_key": normalized_row_id,
"order_date": normalize_whitespace(order_date), "order_date": normalize_whitespace(order_date),
"retailer_item_id": stringify(item.get("podId")), "retailer_item_id": stringify(item.get("podId")),
"pod_id": stringify(item.get("podId")), "pod_id": stringify(item.get("podId")),
@@ -371,6 +474,8 @@ def parse_item(order_id, order_date, raw_path, line_no, item):
"reward_savings": stringify(item.get("rewardSavings")), "reward_savings": stringify(item.get("rewardSavings")),
"coupon_savings": stringify(item.get("couponSavings")), "coupon_savings": stringify(item.get("couponSavings")),
"coupon_price": stringify(item.get("couponPrice")), "coupon_price": stringify(item.get("couponPrice")),
"matched_discount_amount": "",
"net_line_total": stringify(item.get("totalPrice")),
"image_url": extract_image_url(item), "image_url": extract_image_url(item),
"raw_order_path": raw_path.as_posix(), "raw_order_path": raw_path.as_posix(),
"item_name_norm": normalized_name, "item_name_norm": normalized_name,
@@ -380,13 +485,14 @@ def parse_item(order_id, order_date, raw_path, line_no, item):
"size_unit": size_unit, "size_unit": size_unit,
"pack_qty": pack_qty, "pack_qty": pack_qty,
"measure_type": measure_type, "measure_type": measure_type,
"normalized_quantity": normalized_quantity,
"normalized_quantity_unit": normalized_quantity_unit,
"is_store_brand": "true" if bool(prefix) else "false", "is_store_brand": "true" if bool(prefix) else "false",
"is_item": "false" if is_fee else "true",
"is_fee": "true" if is_fee else "false", "is_fee": "true" if is_fee else "false",
"is_discount_line": "false", "is_discount_line": "false",
"is_coupon_line": "false", "is_coupon_line": "false",
"price_per_each": price_per_each, **price_fields,
"price_per_lb": price_per_lb,
"price_per_oz": price_per_oz,
"parse_version": PARSER_VERSION, "parse_version": PARSER_VERSION,
"parse_notes": ";".join(parse_notes), "parse_notes": ";".join(parse_notes),
} }
@@ -439,6 +545,7 @@ def write_csv(path, rows):
help="CSV path for enriched Giant item rows.", help="CSV path for enriched Giant item rows.",
) )
def main(input_dir, output_csv): def main(input_dir, output_csv):
click.echo("legacy entrypoint: prefer normalize_giant_web.py for data-model outputs")
raw_dir = Path(input_dir) raw_dir = Path(input_dir)
output_path = Path(output_csv) output_path = Path(output_csv)

28
normalize_costco_web.py Normal file
View File

@@ -0,0 +1,28 @@
from pathlib import Path
import click
import enrich_costco
@click.command()
@click.option(
"--input-dir",
default="data/costco-web/raw",
show_default=True,
help="Directory containing Costco raw order json files.",
)
@click.option(
"--output-csv",
default="data/costco-web/normalized_items.csv",
show_default=True,
help="CSV path for normalized Costco item rows.",
)
def main(input_dir, output_csv):
rows = enrich_costco.build_items_enriched(Path(input_dir))
enrich_costco.write_csv(Path(output_csv), rows)
click.echo(f"wrote {len(rows)} rows to {output_csv}")
if __name__ == "__main__":
main()

28
normalize_giant_web.py Normal file
View File

@@ -0,0 +1,28 @@
from pathlib import Path
import click
import enrich_giant
@click.command()
@click.option(
"--input-dir",
default="data/giant-web/raw",
show_default=True,
help="Directory containing Giant raw order json files.",
)
@click.option(
"--output-csv",
default="data/giant-web/normalized_items.csv",
show_default=True,
help="CSV path for normalized Giant item rows.",
)
def main(input_dir, output_csv):
rows = enrich_giant.build_items_enriched(Path(input_dir))
enrich_giant.write_csv(Path(output_csv), rows)
click.echo(f"wrote {len(rows)} rows to {output_csv}")
if __name__ == "__main__":
main()

View File

@@ -1,309 +1,346 @@
* grocery data model and file layout * Grocery data model and file layout
This document defines the shared file layout and stable CSV schemas for the This document defines the shared file layout and stable CSV schemas for the
grocery pipeline. The goal is to keep retailer-specific ingest separate from grocery pipeline.
cross-retailer product modeling so Giant-specific quirks do not become the Goals:
system of record. - Ensure data gathering is separate from analysis
- Enable multiple data gathering methods
** design rules - One layer for review and analysis
** Design Rules
- Raw retailer exports remain the source of truth. - Raw retailer exports remain the source of truth.
- Retailer parsing is isolated to retailer-specific files and ids. - Retailer parsing is isolated to retailer-specific files and ids.
- Cross-retailer product layers begin only after retailer-specific enrichment. - Cross-retailer product layers begin only after retailer-specific normalization.
- CSV schemas are stable and additive: new columns may be appended, but - CSV schemas are stable and additive: new columns may be appended, but
existing columns should not be repurposed. existing columns should not be repurposed.
- Unknown values should be left blank rather than guessed. - Unknown values should be left blank rather than guessed.
** directory layout *** Retailer-specific data:
Use one top-level data root:
#+begin_example
data/
giant/
raw/
history.json
orders/
<order_id>.json
orders.csv
items_raw.csv
items_enriched.csv
products_observed.csv
costco/
raw/
...
orders.csv
items_raw.csv
items_enriched.csv
products_observed.csv
shared/
products_canonical.csv
product_links.csv
review_queue.csv
#+end_example
** layer responsibilities
- `data/<retailer>/raw/`
Stores unmodified retailer payloads exactly as fetched.
- `data/<retailer>/orders.csv`
One row per retailer order or visit, flattened from raw order data.
- `data/<retailer>/items_raw.csv`
One row per retailer line item, preserving retailer-native values needed for
reruns and debugging.
- `data/<retailer>/items_enriched.csv`
Parsed retailer line items with normalized fields and derived guesses, still
retailer-specific.
- `data/<retailer>/products_observed.csv`
Distinct retailer-facing observed products aggregated from enriched items.
- `data/shared/products_canonical.csv`
Cross-retailer canonical product entities used for comparison.
- `data/shared/product_links.csv`
Links from retailer observed products to canonical products.
- `data/shared/review_queue.csv`
Human review queue for unresolved or low-confidence matching/parsing cases.
** retailer-specific versus shared
Retailer-specific:
- raw json payloads - raw json payloads
- retailer order ids - retailer order ids
- retailer line numbers - retailer line numbers
- retailer category ids and names - retailer category ids and names
- retailer item names - retailer item names
- retailer image urls - retailer image urls
- parsed guesses derived from one retailer feed
- observed products scoped to one retailer
Shared:
- canonical products
- observed-to-canonical links
- human review state for unresolved cases
- comparison-ready normalized quantity basis fields - comparison-ready normalized quantity basis fields
Observed products are the boundary between retailer-specific parsing and *** Review/Combined data:
cross-retailer canonicalization. Nothing upstream of `products_observed.csv` - catalog of reviewed products
should require knowledge of another retailer. - links from normalized retailer items to catalog
- human review state for unresolved cases
** schema: `data/<retailer>/orders.csv`
One row per order or visit. * Pipeline
Each step can be run alone if its dependents exist.
Each retail provider script must produce deterministic line-item outputs, and
normalization may assign within-retailer product identity only when the
retailer itself provides strong evidence.
| column | meaning | Key:
|- - (1) input
| `retailer` | retailer slug such as `giant` | - [1] output
| `order_id` | retailer order or visit id |
| `order_date` | order date in `YYYY-MM-DD` when available |
| `delivery_date` | fulfillment date in `YYYY-MM-DD` when available |
| `service_type` | retailer service type such as `INSTORE` |
| `order_total` | order total as provided by retailer |
| `payment_method` | retailer payment label |
| `total_item_count` | total line count or item count from retailer |
| `total_savings` | total savings as provided by retailer |
| `your_savings_total` | savings field from retailer when present |
| `coupons_discounts_total` | coupon/discount total from retailer |
| `store_name` | retailer store name |
| `store_number` | retailer store number |
| `store_address1` | street address |
| `store_city` | city |
| `store_state` | state or province |
| `store_zipcode` | postal code |
| `refund_order` | retailer refund flag |
| `ebt_order` | retailer EBT flag |
| `raw_history_path` | relative path to source history payload |
| `raw_order_path` | relative path to source order payload |
Primary key: ** 1. Collect
Get raw receipt/visit and item data from a retailer.
Scraping is unique to a Retailer and method (e.g., Giant-Web and Giant-Scan).
Preserve complete raw data and preserve fidelity.
Avoid interpretation beyond basic data flattening.
- (1) Source access (Varies, eg header data, auth for API access)
- [1] collected visits from each retailer
- [2] collected items from each retailer
- [3] any other raw data that supports [1] and [2]; explicit source (eventual receipt scan?)
- (`retailer`, `order_id`) ** 2. Normalize
Parse and extract structured facts from retailer-specific raw data
to create a standardized item format for that retailer.
Strictly dependent on Collect method and output.
- Extract quantity, size, pack, pricing, variant
- Add discount line items to product line items using upc/retail_item_id and concurrence
- Cleanup naming to facilitate later matching
- Assign retailer-level `normalized_item_id` only when evidence is deterministic
- Never use fuzzy or semantic matching here
- (1) collected items from each retailer
- (2) collected visits from each retailer
- [1] normalized items from each retailer
** schema: `data/<retailer>/items_raw.csv` ** 3. Review/Combine (Canonicalization)
Decide whether two normalized retailer items are "the same product";
match items across retailers using algo/logic and human review.
Create catalog linked to normalized retailer items.
- Review operates on distinct `normalized_item_id` values, not individual purchase rows
- Cross-retailer identity decisions happen only here
- Asking human to create a canonical/catalog item with:
- friendly/catalog_name: "bell pepper"; "milk"
- category: "produce"; "dairy"
- product_type: "pepper"; "milk"
- ? variant? "whole, "skim", "2pct"
- Then link the group of items to that catalog item.
- (1) normalized items from each retailer
- [1] review queue of items to be reviewed
- [2] catalog (lookup table) of confirmed normalized retailer items and catalog_id
- [3] purchase list of normalized items , pivot-ready
** Unresolved Issues
1. need central script to orchestrate; metadata belongs there and nowhere else
2. `LIME` and `LIME . / .` appearing in the catalog: names must come from review-approved names, not raw strings
* Directory Layout
Use one top-level data root:
#+begin_example
main.py
collect_<retailer>_<method>.py
normalize_<retailer>_<method>.py
review.py
data/
<retailer-method>/
raw/ # unmodified retailer payloads exactly as fetched
<order_id.json>
collected_items.csv # one row per retailer line item w/ retailer-native values
collected_orders.csv # one row per receipt/visit, flattened from raw order data
normalized_items.csv # parsed retailer-specific line items with normalized fields
costco-web/ # sample
raw/
orders/
history.json
<order_id>.json
collected_items.csv
collected_orders.csv
normalized_items.csv
review/
review_queue.csv # Human review queue for unresolved matching/parsing cases.
product_links.csv # Links from normalized retailer items to catalog items.
catalog.csv # Cross-retailer product catalog entities used for comparison.
purchases.csv
#+end_example
Notes:
- The current repo still uses transitional root-level scripts and output folders.
- This layout is the target structure for the refactor, not a claim that migration is already complete.
* Schemas
** `data/<retailer-method>/collected_items.csv`
One row per retailer line item. One row per retailer line item.
| key | definition |
|--------------------+--------------------------------------------|
| `retailer` PK | retailer slug |
| `order_id` PK | retailer order id |
| `line_no` PK | stable line number within order export |
| `order_date` | copied from order when available |
| `retailer_item_id` | retailer-native item id when available |
| `pod_id` | retailer pod/item id |
| `item_name` | raw retailer item name |
| `upc` | retailer UPC or PLU value |
| `category_id` | retailer category id |
| `category` | retailer category description |
| `qty` | retailer quantity field |
| `unit` | retailer unit code such as `EA` or `LB` |
| `unit_price` | retailer unit price field |
| `line_total` | retailer extended price field |
| `picked_weight` | retailer picked weight field |
| `mvp_savings` | retailer savings field |
| `reward_savings` | retailer rewards savings field |
| `coupon_savings` | retailer coupon savings field |
| `coupon_price` | retailer coupon price field |
| `image_url` | raw retailer image url when present |
| `raw_order_path` | relative path to source order payload |
| `is_discount_line` | retailer adjustment or discount-line flag |
| `is_coupon_line` | coupon-like line flag when distinguishable |
| column | meaning | ** `data/<retailer-method>/collected_orders.csv`
|------------------+-----------------------------------------| One row per order/visit/receipt.
| `retailer` | retailer slug | | key | definition |
| `order_id` | retailer order id | |---------------------------+-------------------------------------------------|
| `line_no` | stable line number within order export | | `retailer` PK | retailer slug such as `giant` |
| `order_date` | copied from order when available | | `order_id` PK | retailer order or visit id |
| `retailer_item_id` | retailer-native item id when available | | `order_date` | order date in `YYYY-MM-DD` when available |
| `pod_id` | retailer pod/item id | | `delivery_date` | fulfillment date in `YYYY-MM-DD` when available |
| `item_name` | raw retailer item name | | `service_type` | retailer service type such as `INSTORE` |
| `upc` | retailer UPC or PLU value | | `order_total` | order total as provided by retailer |
| `category_id` | retailer category id | | `payment_method` | retailer payment label |
| `category` | retailer category description | | `total_item_count` | total line count or item count from retailer |
| `qty` | retailer quantity field | | `total_savings` | total savings as provided by retailer |
| `unit` | retailer unit code such as `EA` or `LB` | | `your_savings_total` | savings field from retailer when present |
| `unit_price` | retailer unit price field | | `coupons_discounts_total` | coupon/discount total from retailer |
| `line_total` | retailer extended price field | | `store_name` | retailer store name |
| `picked_weight` | retailer picked weight field | | `store_number` | retailer store number |
| `mvp_savings` | retailer savings field | | `store_address1` | street address |
| `reward_savings` | retailer rewards savings field | | `store_city` | city |
| `coupon_savings` | retailer coupon savings field | | `store_state` | state or province |
| `coupon_price` | retailer coupon price field | | `store_zipcode` | postal code |
| `image_url` | raw retailer image url when present | | `refund_order` | retailer refund flag |
| `raw_order_path` | relative path to source order payload | | `ebt_order` | retailer EBT flag |
| `is_discount_line` | retailer adjustment or discount-line flag | | `raw_history_path` | relative path to source history payload |
| `is_coupon_line` | coupon-like line flag when distinguishable | | `raw_order_path` | relative path to source order payload |
Primary key: ** `data/<retailer-method>/normalized_items.csv`
One row per retailer line item after deterministic parsing. Preserve raw
fields from `collected_items.csv` and add parsed fields that make later review
and grouping easier. Normalization may assign retailer-level identity when the
evidence is deterministic and retailer-scoped.
- (`retailer`, `order_id`, `line_no`) | key | definition |
|----------------------------+------------------------------------------------------------------|
| `retailer` PK | retailer slug |
| `order_id` PK | retailer order id |
| `line_no` PK | line number within order |
| `normalized_row_id` | stable row key, typically `<retailer>:<order_id>:<line_no>` |
| `normalized_item_id` | stable retailer-level item identity when deterministic grouping is supported |
| `normalization_basis` | basis used to assign `normalized_item_id` |
| `retailer_item_id` | retailer-native item id |
| `item_name` | raw retailer item name |
| `item_name_norm` | normalized retailer item name |
| `brand_guess` | parsed brand guess |
| `variant` | parsed variant text |
| `size_value` | parsed numeric size value |
| `size_unit` | parsed size unit such as `oz`, `lb`, `fl_oz` |
| `pack_qty` | parsed pack or count guess |
| `measure_type` | `each`, `weight`, `volume`, `count`, or blank |
| `normalized_quantity` | numeric comparison basis derived during normalization |
| `normalized_quantity_unit` | basis unit such as `oz`, `lb`, `count`, or blank |
| `is_item` | item flag |
| `is_store_brand` | store-brand guess |
| `is_fee` | fee or non-product flag |
| `is_discount_line` | discount or adjustment-line flag |
| `is_coupon_line` | coupon-like line flag |
| `matched_discount_amount` | matched discount value carried onto purchased row when supported |
| `net_line_total` | line total after matched discount when supported |
| `price_per_each` | derived per-each price when supported |
| `price_per_each_basis` | source basis for `price_per_each` |
| `price_per_count` | derived per-count price when supported |
| `price_per_count_basis` | source basis for `price_per_count` |
| `price_per_lb` | derived per-pound price when supported |
| `price_per_lb_basis` | source basis for `price_per_lb` |
| `price_per_oz` | derived per-ounce price when supported |
| `price_per_oz_basis` | source basis for `price_per_oz` |
| `image_url` | best available retailer image url |
| `raw_order_path` | relative path to source order payload |
| `parse_version` | parser version string for reruns |
| `parse_notes` | optional non-fatal parser notes |
** schema: `data/<retailer>/items_enriched.csv` Notes:
- `normalized_row_id` identifies the purchase row; `normalized_item_id` identifies a repeated retailer item when strong retailer evidence supports grouping.
- Valid `normalization_basis` values should be explicit, e.g. `exact_upc`, `exact_retailer_item_id`, `exact_name_size_pack`, or `approved_retailer_alias`.
- Do not use fuzzy or semantic matching to assign `normalized_item_id`.
- Discount/coupon rows may remain as standalone normalized rows for auditability even when their amounts are attached to a purchased row via `matched_discount_amount`.
- Cross-retailer identity is handled later in review/combine via `catalog.csv` and `product_links.csv`.
One row per retailer line item after deterministic parsing. Preserve the raw ** `data/review/product_links.csv`
fields from `items_raw.csv` and add parsed fields. One row per review-approved link from a normalized retailer item to a catalog item.
Many normalized retailer items may link to the same catalog item.
| column | meaning | | key | definition |
|---------------------+-------------------------------------------------------------| |-------------------------+---------------------------------------------|
| `retailer` | retailer slug | | `normalized_item_id` PK | normalized retailer item id |
| `order_id` | retailer order id | | `catalog_id` PK | linked catalog product id |
| `line_no` | line number within order | | `link_method` | `manual`, `exact_upc`, `exact_name_size`, etc. |
| `observed_item_key` | stable row key, typically `<retailer>:<order_id>:<line_no>` | | `link_confidence` | optional confidence label |
| `retailer_item_id` | retailer-native item id | | `review_status` | `pending`, `approved`, `rejected`, or blank |
| `item_name` | raw retailer item name | | `reviewed_by` | reviewer id or initials |
| `item_name_norm` | normalized item name | | `reviewed_at` | review timestamp or date |
| `brand_guess` | parsed brand guess | | `link_notes` | optional notes |
| `variant` | parsed variant text |
| `size_value` | parsed numeric size value |
| `size_unit` | parsed size unit such as `oz`, `lb`, `fl_oz` |
| `pack_qty` | parsed pack or count guess |
| `measure_type` | `each`, `weight`, `volume`, `count`, or blank |
| `is_store_brand` | store-brand guess |
| `is_fee` | fee or non-product flag |
| `is_discount_line` | discount or adjustment-line flag |
| `is_coupon_line` | coupon-like line flag |
| `price_per_each` | derived per-each price when supported |
| `price_per_lb` | derived per-pound price when supported |
| `price_per_oz` | derived per-ounce price when supported |
| `image_url` | best available retailer image url |
| `parse_version` | parser version string for reruns |
| `parse_notes` | optional non-fatal parser notes |
Primary key:
- (`retailer`, `order_id`, `line_no`)
** schema: `data/<retailer>/products_observed.csv`
One row per distinct retailer-facing observed product.
| column | meaning |
|-------------------------------+----------------------------------------------------------------|
| `observed_product_id` | stable observed product id |
| `retailer` | retailer slug |
| `observed_key` | deterministic grouping key used to create the observed product |
| `representative_retailer_item_id` | best representative retailer-native item id |
| `representative_upc` | best representative UPC/PLU |
| `representative_item_name` | representative raw retailer name |
| `representative_name_norm` | representative normalized name |
| `representative_brand` | representative brand guess |
| `representative_variant` | representative variant |
| `representative_size_value` | representative size value |
| `representative_size_unit` | representative size unit |
| `representative_pack_qty` | representative pack/count |
| `representative_measure_type` | representative measure type |
| `representative_image_url` | representative image url |
| `is_store_brand` | representative store-brand flag |
| `is_fee` | representative fee flag |
| `is_discount_line` | representative discount-line flag |
| `is_coupon_line` | representative coupon-line flag |
| `first_seen_date` | first order date seen |
| `last_seen_date` | last order date seen |
| `times_seen` | number of enriched item rows grouped here |
| `example_order_id` | one example retailer order id |
| `example_item_name` | one example raw item name |
| `distinct_retailer_item_ids_count` | count of distinct retailer-native item ids |
Primary key:
- (`observed_product_id`)
** schema: `data/shared/products_canonical.csv`
One row per cross-retailer canonical product.
| column | meaning |
|----------------------------+--------------------------------------------------|
| `canonical_product_id` | stable canonical product id |
| `canonical_name` | canonical human-readable name |
| `product_type` | broad class such as `apple`, `milk`, `trash_bag` |
| `brand` | canonical brand when applicable |
| `variant` | canonical variant |
| `size_value` | normalized size value |
| `size_unit` | normalized size unit |
| `pack_qty` | normalized pack/count |
| `measure_type` | normalized measure type |
| `normalized_quantity` | numeric comparison basis value |
| `normalized_quantity_unit` | basis unit such as `oz`, `lb`, `count` |
| `notes` | optional human notes |
| `created_at` | creation timestamp or date |
| `updated_at` | last update timestamp or date |
Primary key:
- (`canonical_product_id`)
** schema: `data/shared/product_links.csv`
One row per observed-to-canonical relationship.
| column | meaning |
|-
| `observed_product_id` | retailer observed product id |
| `canonical_product_id` | linked canonical product id |
| `link_method` | `manual`, `exact_upc`, `exact_name`, etc. |
| `link_confidence` | optional confidence label |
| `review_status` | `pending`, `approved`, `rejected`, or blank |
| `reviewed_by` | reviewer id or initials |
| `reviewed_at` | review timestamp or date |
| `link_notes` | optional notes |
Primary key:
- (`observed_product_id`, `canonical_product_id`)
** schema: `data/shared/review_queue.csv`
** `data/review/review_queue.csv`
One row per issue needing human review. One row per issue needing human review.
| column | meaning | | key | definition |
|- |----------------------+-----------------------------------------------------|
| `review_id` | stable review row id | | `review_id` PK | stable review row id |
| `queue_type` | `observed_product`, `link_candidate`, `parse_issue` | | `queue_type` | `link_candidate`, `parse_issue`, `catalog_cleanup` |
| `retailer` | retailer slug when applicable | | `retailer` | retailer slug when applicable |
| `observed_product_id` | observed product id when applicable | | `normalized_item_id` | normalized retailer item id when review is item-level |
| `canonical_product_id` | candidate canonical id when applicable | | `normalized_row_id` | normalized row id when review is row-specific |
| `reason_code` | machine-readable review reason | | `catalog_id` | candidate canonical id |
| `priority` | optional priority label | | `reason_code` | machine-readable review reason |
| `raw_item_names` | compact list of example raw names | | `priority` | optional priority label |
| `normalized_names` | compact list of example normalized names | | `raw_item_names` | compact list of example raw names |
| `upc` | example UPC/PLU | | `normalized_names` | compact list of example normalized names |
| `image_url` | example image url | | `upc` | example UPC/PLU |
| `example_prices` | compact list of example prices | | `image_url` | example image url |
| `seen_count` | count of related rows | | `example_prices` | compact list of example prices |
| `status` | `pending`, `approved`, `rejected`, `deferred` | | `seen_count` | count of related rows |
| `resolution_notes` | reviewer notes | | `status` | `pending`, `approved`, `rejected`, `deferred` |
| `created_at` | creation timestamp or date | | `resolution_notes` | reviewer notes |
| `updated_at` | last update timestamp or date | | `created_at` | creation timestamp or date |
| `updated_at` | last update timestamp or date |
** `data/catalog.csv`
One row per cross-retailer catalog product.
| key | definition |
|----------------------------+----------------------------------------|
| `catalog_id` PK | stable catalog product id |
| `catalog_name` | human-reviewed product name |
| `product_type` | generic product eg `apple`, `milk` |
| `category` | broad section eg `produce`, `dairy` |
| `brand` | canonical brand when applicable |
| `variant` | canonical variant |
| `size_value` | normalized size value |
| `size_unit` | normalized size unit |
| `pack_qty` | normalized pack/count |
| `measure_type` | normalized measure type |
| `normalized_quantity` | numeric comparison basis value |
| `normalized_quantity_unit` | basis unit such as `oz`, `lb`, `count` |
| `notes` | optional human notes |
| `created_at` | creation timestamp or date |
| `updated_at` | last update timestamp or date |
Primary key: Notes:
- Do not auto-create new catalog rows from weak normalized names alone.
- Do not encode packaging/count into `catalog_name` unless it is essential to product identity.
- `catalog_name` should come from review-approved naming, not raw retailer strings.
- (`review_id`) ** `data/purchases.csv`
One row per purchased item (i.e., `is_item`==true from normalized layer), with
catalog attributes denormalized in and discounts already applied.
** current giant mapping | key | definition |
|----------------------------+----------------------------------------------------------------|
| `purchase_date` | date of purchase (from order) |
| `retailer` | retailer slug |
| `order_id` | retailer order id |
| `line_no` | line number within order |
| `normalized_row_id` | `<retailer>:<order_id>:<line_no>` |
| `normalized_item_id` | retailer-level normalized item identity |
| `catalog_id` | linked catalog product id |
| `catalog_name` | catalog product name for analysis |
| `catalog_product_type` | broader product family (e.g., `egg`, `milk`) |
| `catalog_category` | category such as `produce`, `dairy` |
| `catalog_brand` | canonical brand when applicable |
| `catalog_variant` | canonical variant when applicable |
| `raw_item_name` | original retailer item name |
| `normalized_item_name` | cleaned/normalized retailer item name |
| `retailer_item_id` | retailer-native item id |
| `upc` | UPC/PLU when available |
| `qty` | retailer quantity field |
| `unit` | retailer unit (e.g., `EA`, `LB`) |
| `pack_qty` | parsed pack/count |
| `size_value` | parsed size value |
| `size_unit` | parsed size unit |
| `measure_type` | `each`, `weight`, `volume`, `count` |
| `normalized_quantity` | normalized comparison quantity |
| `normalized_quantity_unit` | unit for normalized quantity |
| `unit_price` | retailer unit price |
| `line_total` | original retailer extended price (pre-discount) |
| `matched_discount_amount` | discount amount matched from discount lines |
| `net_line_total` | effective price after discount (`line_total` + discounts) |
| `store_name` | retailer store name |
| `store_city` | store city |
| `store_state` | store state |
| `price_per_each` | derived per-each price |
| `price_per_each_basis` | source basis for per-each calc |
| `price_per_count` | derived per-count price |
| `price_per_count_basis` | source basis for per-count calc |
| `price_per_lb` | derived per-pound price |
| `price_per_lb_basis` | source basis for per-pound calc |
| `price_per_oz` | derived per-ounce price |
| `price_per_oz_basis` | source basis for per-ounce calc |
| `is_fee` | true if row represents non-product fee |
| `raw_order_path` | relative path to original order payload |
Current scraper outputs map to the new layout as follows: Notes:
- Only rows that represent purchased items should appear here.
- `line_total` preserves retailer truth; `net_line_total` is what you actually paid.
- catalog fields are denormalized in to make pivoting trivial.
- no discount/coupon rows exist here; their effects are carried via `matched_discount_amount`.
- review/link decisions should apply at the `normalized_item_id` level, then fan out to all purchase rows sharing that id.
- `giant_output/raw/history.json` -> `data/giant/raw/history.json` * /
- `giant_output/raw/<order_id>.json` -> `data/giant/raw/orders/<order_id>.json`
- `giant_output/orders.csv` -> `data/giant/orders.csv`
- `giant_output/items.csv` -> `data/giant/items_raw.csv`
Current Giant raw order payloads already expose fields needed for future
enrichment, including `image`, `itemName`, `primUpcCd`, `lbEachCd`,
`unitPrice`, `groceryAmount`, and `totalPickedWeight`.

View File

@@ -27,6 +27,7 @@ carry forward image url
3. build observed-product atble from enriched items 3. build observed-product atble from enriched items
* git issues * git issues
- dont try to git push from win emacs viewing wsl, it will be screwy (windows identity vs wsl)
** ssh / access to gitea ** ssh / access to gitea
ssh://git@192.168.1.207:2020/ben/scrape-giant.git ssh://git@192.168.1.207:2020/ben/scrape-giant.git
@@ -71,6 +72,12 @@ l l : open local reflog
put point on the commit; highlighted remote gitea/cx put point on the commit; highlighted remote gitea/cx
X : reset branch; prompts you, selected cx X : reset branch; prompts you, selected cx
** merge branch
b b : switch to branch to be merged into (cx)
m m : pick branch to merge into current branch
* giant requests * giant requests
** item: ** item:
get: get:
@@ -250,18 +257,247 @@ python build_observed_products.py
python build_review_queue.py python build_review_queue.py
python build_canonical_layer.py python build_canonical_layer.py
python validate_cross_retailer_flow.py python validate_cross_retailer_flow.py
* t1.11 tasks [2026-03-17 Tue 13:49] * t1.13 tasks [2026-03-17 Tue 13:49]
ok i ran a few. time to run some cleanups here - i'm wondering if we shouldn't be less aggressive with canonical names and encourage a better manual process to start. ok i ran a few. time to run some cleanups here - i'm wondering if we shouldn't be less aggressive with canonical names and encourage a better manual process to start.
1. auto-created canonical_names lack category, product_type - ok with filling these in manually in the catalog once the queue is empty ** TODO fill in auto-created canonical category, product-type
2. canonical_names feel too specific, e.g., "5DZ egg" auto-created canonical_names lack category, product_type - ok with filling these in manually in the catalog once the queue is empty
3. some canonical_names need consolidation, eg "LIME" and "LIME . / ." ; poss cleanup issue. there are 5 entries for ergg but but they are all regular large grade A white eggs, just different amounts in dozens.
** TODO consolidation cleanup
1. canonical_names feel too specific, e.g., "5DZ egg" - probably a problem with the enrich_* steps not adding appropraite normalizing data /and/ removing from observed product title?
2. some canonical_names need consolidation, eg "LIME" and "LIME . / ." ; poss cleanup issue. there are 5 entries for ergg but but they are all regular large grade A white eggs, just different amounts in dozens.
Eggs are actually a great candidate for the kind of analysis we want to do - the pipeline should have caught and properly sorted these into size/qty: Eggs are actually a great candidate for the kind of analysis we want to do - the pipeline should have caught and properly sorted these into size/qty:
#+begin_example
```canonical_product_id canonical_name category product_type brand variant size_value size_unit pack_qty measure_type notes created_at updated_at ```canonical_product_id canonical_name category product_type brand variant size_value size_unit pack_qty measure_type notes created_at updated_at
gcan_0e350505fd22 5DZ EGG / / KS each auto-linked via exact_name gcan_0e350505fd22 5DZ EGG / / KS each auto-linked via exact_name
gcan_47279a80f5f3 EGG 5 DOZ. BBS each auto-linked via exact_name gcan_47279a80f5f3 EGG 5 DOZ. BBS each auto-linked via exact_name
gcan_7d099130c1bf LRG WHITE EGG SB 30 count auto-linked via exact_upc gcan_7d099130c1bf LRG WHITE EGG SB 30 count auto-linked via exact_upc
gcan_849c2817e667 GDA LRG WHITE EGG SB 18 count auto-linked via exact_upc gcan_849c2817e667 GDA LRG WHITE EGG SB 18 count auto-linked via exact_upc
gcan_cb0c6c8cf480 LG EGG CONVENTIONAL 18 count count auto-linked via exact_name_size ``` gcan_cb0c6c8cf480 LG EGG CONVENTIONAL 18 count count auto-linked via exact_name_size ```
4. Build costco mechanism for matching discount to line item. #+end_example
** TODO costco discount matching
Build costco mechanism for matching discount to line item.
1. Discounts appear as their own line items with a number like /123456, this matches the UPC of the discounted item 1. Discounts appear as their own line items with a number like /123456, this matches the UPC of the discounted item
2. must be date-matched to the UPC 2. must be date-matched to the UPC
Data model might be missing shape:
1. match discount rows like `item_name:/2303476` to `retailer_item_id:2303476`
2. display this value on the item somehow? maybe update line_total? otherwise we lose fidelity. should be stored in items_enriched somehow
#+begin_example
```retailer order_id line_no observed_item_key order_date retailer_item_id pod_id item_name upc category_id category qty unit unit_price line_total picked_weight mvp_savings reward_savings coupon_savings coupon_price image_url raw_order_path item_name_norm brand_guess variant size_value size_unit pack_qty measure_type is_store_brand is_fee is_discount_line is_coupon_line price_per_each price_per_lb price_per_oz parse_version parse_notes
costco 2.11115E+22 3 costco:21111520101942404241753:3 4/24/2024 2303476 KA 6QT MIXER P16 KSM60SECXER/CU FY23 33 33 1 None 399.99 399.99 costco_output/raw/21111520101942404241753-2024-04-24T17-53-00.json KA 6QT MIXER KSM60SECXER/CU each FALSE FALSE FALSE FALSE 399.99 costco-enrich-v1
costco 2.11115E+22 4 costco:21111520101942404241753:4 4/24/2024 325173 /2303476 33 33 -1 None 0 -100 -100 costco_output/raw/21111520101942404241753-2024-04-24T17-53-00.json /2303476 each FALSE FALSE TRUE TRUE 100 costco-enrich-v1 ```
#+end_example
** TODO giant discount matching
* prompt
do not add new abstractions unless they remove real duplication. prefer explicit retailer-specific logic over generic heuristics. do not auto-create new canonical products from weak normalized names.
and propose the smallest set of edits needed.
* 1.13 fixes
** 15x Costco discounts not caught
- 15x, some with slash-space: `/ 1768123`and some without: `/2303476`
** canonical names suck - tempted to force manual config from scratch?
- maybe first-pass should be naming groups, starting with largest groups and going on down.
- unfortunately not seeing many cross-retailer items? looks like costco-only; just taking Giant as gospel
- could be as simple as changing canonical name in canonical_catalog.csv
- tough to figure out where the data is, leading to below:
** need to refactor whole flow and where data is stored
group by browser or by site, or both? currently mixed.
1. Scrape
- Script:
- Output: /output/raw/orderN.json, history.json, orders.csv, history.csv
2. Enrich
- Scripts:
- Output: /output/enrich/items.json
3. Combined - /output/?
- Review step?
** propsed fixes
* 1.14 prep - OBE
** [ ] t1.14.1 define and document the filesystem/data-layer layout (2-3 commits)
make stage ownership and retailer ownership explicit so every artifact has one obvious home
** AC
1. define and document the canonical directory layout for the pipeline, separating retailer-specific artifacts from shared combined artifacts
2. adopt an explicit layout of the form:
- `data/<retailer>/raw/`
- `data/<retailer>/orders.csv`
- `data/<retailer>/items.csv`
- `data/<retailer>/items_enriched.csv`
- `data/combined/products_observed.csv`
- `data/combined/review_queue.csv`
- `data/combined/item_aliases.csv`
- `data/combined/canonical_catalog.csv`
- `data/combined/product_links.csv`
- `data/combined/purchases.csv`
- `data/combined/pipeline_status.csv`
- `data/combined/pipeline_status.json`
3. update docs/readme and pipeline docs so each scripts inputs and outputs point to the new layout
4. remove or deprecate ambiguous stage outputs living under a retailer-specific output directory when they are actually shared artifacts
- pm note: goal is “where does this file live?” should have one answer, not three
** evidence
- commit:
- tests:
- date:
** notes
** [ ] t1.14.2 define the row-level data model for raw, enriched, observed, canonical, and purchases layers (2-4 commits)
lock the item model before further refactors so each stage has a clear grain and purpose
** AC
1. document the row grain for each layer:
- raw item row = one receipt line from one retailer order
- enriched item row = one retailer line with retailer-specific parsed fields
- observed product row = one grouped retailer-facing product concept
- canonical catalog row = one review-controlled product identity
- purchase row = one final pivot-ready purchased item line
2. define the required fields for each layer, including stable ids and provenance fields
3. explicitly document which fields are allowed to be blank at each layer (e.g. `upc`, `canonical_item_id`, category)
4. document the relationship between:
- `raw_item_name`
- `normalized_item_name`
- `observed_product_id`
- `canonical_item_id`
5. document how retailer-native ids (e.g. Costco `retailer_item_id`) fit into the shared model without being forced into `upc`
- pm note: this is the schema contract task; code should follow it, not invent it ad hoc
** evidence
- commit:
- tests:
- date:
** notes
** [ ] t1.14.3 refactor pipeline outputs to the new layout without changing semantics (2-4 commits)
move files and script defaults to the new structure while preserving current behavior
** AC
1. update scraper and enrich scripts to write retailer-specific outputs under `data/<retailer>/...`
2. update combined/shared scripts to read from retailer-specific enriched outputs and write to `data/combined/...`
3. preserve current content/meaning of outputs during the move; this is a location/structure refactor, not a behavior rewrite
4. update tests, docs, and script defaults to use the new paths
- pm note: do not mix data-layout cleanup with canonical/review logic changes in this task
** evidence
- commit:
- tests:
- date:
** notes
** [ ] t1.14.4 make the review and catalog layer explicit and authoritative (2-4 commits)
treat review and canonical resolution as first-class data, not incidental byproducts
** AC
1. define `review_queue.csv`, `item_aliases.csv`, and `canonical_catalog.csv` as the authoritative review/catalog files in `data/combined/`
2. document the intended purpose of each:
- `review_queue.csv` = unresolved observed items needing action
- `item_aliases.csv` = approved mapping from observed/normalized names to canonical ids
- `canonical_catalog.csv` = review-controlled canonical product definitions and display names
3. ensure final purchase generation reads from these files as the source of truth for resolution
4. stop relying on weak implicit canonical creation as a substitute for the explicit review/catalog layer
- pm note: this is the control-plane task; observed products may be automatic, canonical products are review-controlled
** evidence
- commit:
- tests:
- date:
** notes
** [ ] t1.14.5 define and document the final pivot-ready purchases output (2-3 commits)
make the final analysis artifact explicit so excel/pivot/chart use is a first-class target
** AC
1. define `data/combined/purchases.csv` as the final normalized purchase log
2. ensure each purchase row retains:
- purchase date
- retailer
- order id
- raw item name
- normalized item name
- canonical item id when resolved
- quantity and unit
- original line total
- discount-adjusted fields when applicable
- store/location fields where available
3. document that `purchases.csv` is the primary excel/pivot input and that earlier files are staging layers
4. document expected pivot uses such as purchase frequency and cost over time by canonical item
- pm note: this task is about making the final artifact explicit and stable, not about adding new metrics
** evidence
- commit:
- tests:
- date:
** notes
* pipeline prep [2026-03-17 Tue]
data saved to /data
1. "scrape_<retailer>" gathers data from a retailer and outputs:
1. raw list of items per visit ./<retailer>/scraped/raw/order-<uid>.json
2. raw list of visits ./<retailer>/scraped_visits.csv
3. raw list of items from all visits ./<retailer>/scraped_items.csv
2. "enrich <retailer>" takes /scraped/ data and outputs:
1. normalized list of items ./<retailer>/enriched_items.csv
3. "combine" takes retailer
input:
1. all enriched items ./<retailer>/enriched_items.csv
2. all retailer visits ./<retailer>/scraped_visits.csv
outputs:
1. observed product groups ./combined/observed/products_observed.csv
2. unresolved products for review ./combined/review/review_queue.csv
3. pipeline accounting/status ./combined/status/pipeline_status.csv
4. pipeline accounting/status ./combined/status/pipeline_status.json
4. review resolves unknown or weakly identified products and maintains:
1. canonical product catalog ./combined/review/canonical_catalog.csv
2. approved alias mappings ./combined/review/item_aliases.csv
3. optional observed→canonical links ./combined/review/product_links.csv
5. build purchases takes combined observed data plus review/catalog data and outputs:
[1]. final normalized purchase log ./combined/purchases/purchases.csv
lets get this pipeline right before more refactoring.
* Pipeline - moved to data-model.org [2026-03-18 Wed]
Key:
- (1) input
- [2] output
Each step can be run alone if its dependents exist.
** 1. Collect
Get raw receipt/visit and item data from a retailer. Scraping is unique to a Retailer and method (e.g., Giant-Web and Giant-Scan). Preserve complete raw data and preserve fidelity. Avoid interpretation beyond basic data flattening.
- (1) Source access (Varies, eg header data, auth for API access)
- [1] collected visits from each retailer
- [2] collected items from each retailer
- [3] any other raw data that supports [1] and [2]; explicit source (eventual receipt scan?)
** 2. Normalize
Parse and extract structured facts from retailer-specific raw data to create a standardized item format. Strictly dependent on Collect method and output.
- Extract quantity, size, pack, pricing, variant
- Consolidate discount with item using upc/retail_item_id and concurrence
- Cleanup naming to facilitate later matching
- (1) collected items from each retailer
- (2) collected visits from each retailer
- [1] normalized items from each retailer
** 3. Review/Combine (Canonicalization)
Decide whether two normalized retailer items are "the same product"; match items across retailers using algo/logic and human review. Create catalog linked to normalized items.
- Grouping the same item from retailer
- Asking human to create a canonical/catalog item with:
- friendly/canonical_name: "bell pepper"; "milk"
- category: "produce"; "dairy"
- product_type: "pepper"; "milk"
- ? variant? "whole, "skim", "2pct"
- (1) normalized items from each retailer
- [1] review queue of items to be reviewed
- [2] catalog (lookup table) of confirmed retailer_item and canonical_name
- [3] canonical purchase list, pivot-ready
** Unresolved Issues
2. Create tags: canonical_name (need better label), category, product_type is missing data like Variant, shouldn't this be part of the normalization step?
3. need central script to orchestrate; metadata belongs here and nowhere else
** Symptoms
- `LIME` and `LIME . / .` appearing in canonical_catalog:
- names must come from review-approved names, not raw strings
*

22
pm/task-sample.org Normal file
View File

@@ -0,0 +1,22 @@
#+title: Task Log
#+updated: [2026-03-18 Wed 14:19]
Use the template below, which should be a top-level org-mode header.
* [ ] M.m.m: Task Title (estimate # commits)
replace the old observed/canonical workflow with a review-first pipeline that groups normalized rows only during review/combine and links them to catalog items
** Acceptance Criteria
1. Criterion
- expanded data
2. Criterion
- pm note: amplifying information
** evidence
- commit: abc123, bcd234
- tests:
- datetime: [2026-03-18 Wed 14:15]
** notes
- explanation of work done, decisions made, reasoning

View File

@@ -1,3 +1,5 @@
#+title: Scrape-Giant Task Log
#+STARTUP: overview
* [X] t1.1: harden giant receipt fetch cli (2-4 commits) * [X] t1.1: harden giant receipt fetch cli (2-4 commits)
** acceptance criteria ** acceptance criteria
- giant scraper runs from cli with prompts or env-backed defaults for `user_id` and `loyalty` - giant scraper runs from cli with prompts or env-backed defaults for `user_id` and `loyalty`
@@ -416,10 +418,356 @@ Clearly show current state separate from proposed future state.
- Numbered canonical selection plus confirmation worked better than free-text id entry and should reduce accidental links. - Numbered canonical selection plus confirmation worked better than free-text id entry and should reduce accidental links.
- Deterministic suggestions remain intentionally conservative; they speed up common cases, but unresolved items still depend on human review by design. - Deterministic suggestions remain intentionally conservative; they speed up common cases, but unresolved items still depend on human review by design.
* [ ] t1.10: add optional llm-assisted suggestion workflow for unresolved products (2-4 commits) * [X] t1.13.1 pipeline accountability and stage visibility (1-2 commits)
add simple accounting so we can see what survives or drops at each pipeline stage
** AC
1. emit counts for raw, enriched, combined/observed, review-queued, canonical-linked, and final purchase-log rows
2. report unresolved and dropped item counts explicitly
3. make it easy to verify that missing items were intentionally left in review rather than silently lost
- pm note: simple text/json/csv summary is sufficient; trust and visibility matter more than presentation
** evidence
- commit: `967e19e`
- tests: `./venv/bin/python -m unittest discover -s tests`; `./venv/bin/python report_pipeline_status.py --help`; `./venv/bin/python report_pipeline_status.py`; verified `combined_output/pipeline_status.csv` and `combined_output/pipeline_status.json`
- date: 2026-03-17
** notes
- Added a single explicit status script instead of threading counters through every pipeline step; this keeps the pipeline simple while still making row survival visible.
- The most useful check here is `unresolved_not_in_review_rows`; when it is non-zero, we know we have a real accounting bug rather than normal unresolved work.
* [X] t1.13.2 costco discount matching and net pricing in enrich_costco (2-3 commits)
refactor costco enrichment so discount lines are matched to purchased items and net pricing is preserved
** AC
1. detect costco discount/coupon rows like `/<retailer_item_id>` and match them to purchased items within the same order
2. preserve raw discount rows for auditability while also carrying matched discount values onto the purchased item row
3. add explicit fields for discount-adjusted pricing, e.g. `matched_discount_amount` and `net_line_total` (or equivalent)
4. preserve original raw receipt amounts (`line_total`) without overwriting them
- pm note: keep this retailer-specific and explicit; do not introduce generic discount heuristics
** evidence
- commit: `56a03bc`
- tests: `./venv/bin/python -m unittest discover -s tests`; `./venv/bin/python enrich_costco.py`; verified matched Costco discount rows now populate `matched_discount_amount` and `net_line_total` while preserving raw `line_total`
- date: 2026-03-17
** notes
- Kept this retailer-specific and literal: only discount rows with `/<retailer_item_id>` are matched, and only within the same order.
- Raw discount rows are still preserved for auditability; the purchased row now carries the matched adjustment separately rather than overwriting the original amount.
* [X] t1.13.3 canonical cleanup and review-first product identity (3-4 commits)
refactor canonical generation so product identity is cleaner, duplicate canonicals are reduced, and unresolved items stay in review instead of spawning junk canonicals
** AC
1. stop auto-creating new canonical products from weak normalized names alone; unresolved items remain in `review_queue.csv`
2. canonical names are based on stable product identity rather than noisy observed titles
3. packaging/count/size tokens are removed from canonical names when they belong in structured fields (`pack_qty`, `size_value`, `size_unit`)
4. consolidate obvious duplicate canonicals (e.g. egg/lime cases) and ensure final outputs retain raw item name, normalized item name, and canonical item id
- pm note: prefer conservative canonical creation and a better manual review loop over aggressive auto-unification
** evidence
- commit: `08e2a86`
- tests: `./venv/bin/python -m unittest discover -s tests`; `./venv/bin/python build_purchases.py`; `./venv/bin/python review_products.py --refresh-only`; verified weaker exact-name cases now remain unresolved in `combined_output/review_queue.csv` and canonical names are cleaned before auto-catalog creation
- date: 2026-03-17
** notes
- Removed weak exact-name auto-canonical creation so ambiguous products stay in review instead of generating junk canonicals.
- Canonical display names are now cleaned of obvious punctuation and packaging noise, but I kept the cleanup conservative rather than adding a broad fuzzy merge layer.
* [X] t1.14: refactor retailer collection into the new data model (2-4 commits)
move Giant and Costco collection into the new collect structure and make both retailers emit the same collected schemas
** Acceptance Criteria
1. create retailer-specific collect scripts in the target naming pattern, e.g.:
- collect_giant_web.py
- collect_costco_web.py
2. collected outputs conform to pm/data-model.org:
- data/<retailer-method>/raw/...
- data/<retailer-method>/collected_orders.csv
- data/<retailer-method>/collected_items.csv
3. current Giant and Costco raw acquisition behavior is preserved during the move
4. collected schemas preserve retailer truth and provenance:
- no interpretation beyond basic flattening
- raw_order_path/raw_history_path remain usable
- unknown values remain blank rather than guessed
5. old paths should be removed or deprecated
6. collect_* scripts do not depend on any normalize/review files or scripts
- pm note: this is a path/schema refactor, not a parsing rewrite
** evidence
- commit: `48c6eaf`
- tests: `./venv/bin/python -m unittest tests.test_scraper tests.test_costco_pipeline tests.test_browser_session`; `./venv/bin/python collect_giant_web.py --help`; `./venv/bin/python collect_costco_web.py --help`; `./venv/bin/python scrape_giant.py --help`; `./venv/bin/python scrape_costco.py --help`
- datetime: 2026-03-18
** notes
- Kept this as a path/schema move, not a parsing rewrite: the existing Giant and Costco collection behavior remains in place behind new `collect_*` entry points.
- Added lightweight deprecation nudges on the legacy `scrape_*` commands rather than removing them immediately, so the move is inspectable and low-risk.
- The main schema fix was on Giant collection, which was missing retailer/provenance/audit fields that Costco collection already carried.
* [X] t1.14.1: refactor retailer normalization into the new normalized_items schema (3-5 commits)
make Giant and Costco emit the shared normalized line-item schema without introducing cross-retailer identity logic
** Acceptance Criteria
1. create retailer-specific normalize scripts in the target naming pattern, e.g.:
- normalize_giant_web.py
- normalize_costco_web.py
2. normalized outputs conform to pm/data-model.org:
- data/<retailer-method>/normalized_items.csv
- one row per collected line item
- normalized_row_id is stable and present
- normalized_item_id is stable, present, and represents retailer-level identity reused across repeated purchase rows when deterministic retailer evidence is sufficient
- normalized_quantity and normalized_quantity_unit
- repeated rows for the same retailer product resolve to the same normalized_item_id only when supported by deterministic retailer evidence, e.g. exact upc, exact retailer_item_id, exact cleaned name + same size/pack
- normalization_basis is explicit
3. Giant normalization preserves current useful parsing:
- normalized item name
- size/unit/pack parsing
- fee/store-brand flags
- derived price fields
4. Costco normalization preserves current useful parsing:
- normalized item name
- size/unit/pack parsing
- explicit discount matching using retailer-specific logic
- matched_discount_amount and net_line_total
5. both normalizers preserve raw retailer truth:
- line_total is never overwritten
- unknown values remain blank rather than guessed
6. no cross-retailer identity assignment occurs in normalization
7. normalize never uses fuzzy or semantic matching to assign normalized_item_id
- pm note: prefer explicit retailer-specific code paths over generic normalization helpers unless the duplication is truly mechanical
- pm note: normalization may resolve retailer-level identity, but not catalog identity
- pm note: normalized_item_id is the only retailer-level grouping identity; do not introduce observed_products or a second grouping artifact
** evidence
- commit: `9064de5`
- tests: `./venv/bin/python -m unittest discover -s tests`; `./venv/bin/python -m unittest tests.test_enrich_giant tests.test_costco_pipeline tests.test_purchases`; `./venv/bin/python normalize_giant_web.py --help`; `./venv/bin/python normalize_costco_web.py --help`; `./venv/bin/python enrich_giant.py --help`; `./venv/bin/python enrich_costco.py --help`
- datetime: 2026-03-18
** notes
- Kept the existing Giant and Costco parsing logic intact and added the new normalized schema fields in place, rather than rewriting the enrichers from scratch.
- `normalized_item_id` is always present, but it only collapses repeated rows when the evidence is strong; otherwise it falls back to row-level identity via `normalized_row_id`.
- Added `normalize_*` entry points for the new data-model layout while leaving the legacy `enrich_*` commands available during the transition.
* [X] t1.14.2: finalize filesystem and schema alignment for the refactor (2-4 commits)
bring on-disk outputs fully into the target `data/` structure without changing retailer behavior
** Acceptance Criteria
1. retailer data directories conform to pm/data-model.org:
- `data/giant-web/raw/...`
- `data/giant-web/collected_orders.csv`
- `data/giant-web/collected_items.csv`
- `data/giant-web/normalized_items.csv`
- `data/costco-web/raw/...`
- `data/costco-web/collected_orders.csv`
- `data/costco-web/collected_items.csv`
- `data/costco-web/normalized_items.csv`
2. review/combine outputs are moved or rewritten into the target review paths:
- `data/review/review_queue.csv`
- `data/review/product_links.csv`
- `data/review/review_resolutions.csv`
- `data/review/purchases.csv`
- `data/review/pipeline_status.csv`
- `data/review/pipeline_status.json`
3. old transitional output paths are either:
- removed from active script defaults, or
- left as explicit compatibility shims with clear deprecation notes
4. no recollection is required if existing raw files and collected csvs can be moved/copied losslessly into the new structure
5. no schema information is lost during the move:
- raw paths still resolve
- collected/normalized csvs still open with the expected headers
6. README and task/docs reflect the final active paths
- pm note: prefer moving/adapting existing files over recollecting from retailers unless a real data loss or schema mismatch forces recollection
- pm note: this is a structure-alignment task, not a retailer parsing task
** evidence
- commit: `d2e6f2a`
- tests: `./venv/bin/python -m unittest discover -s tests`; `./venv/bin/python build_purchases.py`; `./venv/bin/python review_products.py --refresh-only`; `./venv/bin/python report_pipeline_status.py`; `./venv/bin/python build_purchases.py --help`; `./venv/bin/python review_products.py --help`; `./venv/bin/python report_pipeline_status.py --help`; verified `data/giant-web/collected_orders.csv`, `data/giant-web/collected_items.csv`, `data/costco-web/collected_orders.csv`, `data/costco-web/collected_items.csv`, `data/catalog.csv`, and archived transitional review outputs under `data/review/archive/`
- datetime: [2026-03-20 10:04:15 EDT]
** notes
- No recollection was needed; existing raw and collected exports were adapted in place and moved into the target names.
- Updated the active script defaults to point at `data/...` so the code and on-disk layout now agree.
- Kept obviously obsolete review artifacts, but moved them under `data/review/archive/` instead of deleting them outright.
* [X] t1.14.3: retailer-specific Costco normalization cleanup (2-4 commits)
tighten Costco-specific normalization so normalized item names are cleaner and deterministic retailer grouping is less noisy
** Acceptance Criteria
1. improve Costco item-name cleanup for obvious non-identity noise, such as:
- trailing slash fragments
- code tokens and receipt-format artifacts
- duplicated measurement fragments already captured in structured fields
2. preserve deterministic normalization rules only:
- exact retailer_item_id
- exact cleaned name + same size/pack when needed
- approved retailer alias
- no fuzzy or semantic matching
3. normalized Costco names improve on known bad examples, e.g.:
- `MANDARIN /` -> cleaner normalized item name
- `LIFE 6'TABLE ... /` -> cleaner normalized item name
4. cleanup does not overwrite retailer truth:
- raw `item_name` is unchanged
- parsed `size_value`, `size_unit`, `pack_qty`, and pricing fields remain intact
5. discount-row behavior remains correct:
- matched discount rows still populate `matched_discount_amount`
- `net_line_total` remains correct
- discount rows remain auditable
6. add regression tests for the cleaned Costco examples and any new parsing rules
- pm note: keep this explicitly Costco-specific; do not introduce a generic cleanup framework
- pm note: prefer a short allowlist/blocklist of known receipt artifacts over broad heuristics
** evidence
- commit: `bcec6b3`
- tests: `./venv/bin/python -m unittest discover -s tests`; `./venv/bin/python -m unittest tests.test_costco_pipeline`; `./venv/bin/python normalize_costco_web.py`; verified live cleaned examples in `data/costco-web/normalized_items.csv`, including `MANDARINS 2.27 KG / 5 LBS -> MANDARIN` and `LIFE 6'TABLE MDL #80873U - T12/H3/P36 -> LIFE 6'TABLE MDL`
- datetime: 2026-03-20 11:09:32 EDT
** notes
- Kept this explicitly Costco-specific and narrow: the cleanup removes known logistics/code artifacts and orphan slash tokens without introducing fuzzy naming logic.
- The structured parsing still owns size/pack extraction, so name cleanup can safely strip dual-unit and logistics fragments after those fields are parsed.
- Discount-line behavior remains unchanged; this task only cleaned normalized names and preserved the existing audit trail.
* [X] t1.15: refactor review/combine pipeline around normalized_item_id and catalog links (4-8 commits)
replace the old observed/canonical workflow with a review-first pipeline that uses normalized_item_id as the retailer-level review unit and links it to catalog items
** Acceptance Criteria
1. refactor review outputs to conform to pm/data-model.org:
- data/review/review_queue.csv
- data/review/product_links.csv
- data/catalog.csv
- data/purchases.csv
2. review logic uses normalized_item_id as the upstream retailer-level review identity:
- no dependency on observed_product_id
- no dependency on products_observed.csv
- one review/link decision applies to all purchase rows sharing the same normalized_item_id
3. product_links.csv stores review-approved links from normalized_item_id to catalog_id
- one row per approved retailer-level identity to catalog mapping
4. catalog.csv entries are review-first and conservative:
- no auto-creation from weak normalized names alone
- names come from reviewed catalog naming, not raw retailer strings
- packaging/count is not embedded in catalog_name unless essential to identity
- catalog_name/product_type/category/brand/variant may be blank until reviewed; blank is preferred to guessed
5. purchases.csv remains pivot-ready and retains:
- raw item name
- normalized item name
- normalized_row_id (not for review)
- normalized_item_id
- catalog_id
- catalog fields
- raw line_total
- matched_discount_amount and net_line_total when present
- derived price fields and their bases
6. terminal review flow remains simple and usable:
- reviewer sees one grouped retailer item identity (normalized_item_id) with count and list of matches, not one prompt per purchase row; use existing pattern as a template
- link to existing catalog item
- create new catalog item
- exclude
- skip
7. pipeline accounting remains valid after the refactor:
- unresolved items are visible
- missing items are not silently dropped
8. pm note: prefer a better manual review loop over aggressive automatic grouping. initial manual data entry is expected, and should resolve over time
9. pm note: keep review/combine auditable; each catalog link should be explainable from normalized rows and review state
** evidence
- commit: `9104781`
- tests: `./venv/bin/python -m unittest discover -s tests`; `./venv/bin/python build_purchases.py`; `./venv/bin/python review_products.py --refresh-only`; `./venv/bin/python report_pipeline_status.py`; `./venv/bin/python build_purchases.py --help`; `./venv/bin/python review_products.py --help`; `./venv/bin/python report_pipeline_status.py --help`
- datetime: 2026-03-20 11:27:12 EDT
** notes
- The old observed/canonical auto-layer is no longer in the active review/combine path. `build_purchases.py`, `review_products.py`, and `report_pipeline_status.py` now operate on `normalized_item_id`, `catalog_id`, and `catalog_name`.
- I kept the review CLI shape intentionally close to the pre-refactor flow so the project only changed its identity model, not the operator workflow.
- Existing auto-generated catalog rows are no longer carried forward by default; only deliberate catalog entries survive. That keeps the new `catalog.csv` conservative, but it also means prior observed-based auto-links do not migrate into the new model.
- Live rerun after the refactor produced `627` purchase rows, `387` review-queue rows, `407` distinct normalized items, `0` linked normalized items, and `0` unresolved rows missing from the review queue.
* [X] t1.16: cleanup review process and format
** acceptance criteria ** acceptance criteria
- llm suggestions are generated only for unresolved observed products 1. Add intro text explaining:
1. catalog name: unique product including variant but not packaging, eg "whole milk", "sharp cheddar cheese"
2. product type: general product you would like to compare to, eg "milk", "cheese"
3. category: eg "dairy"
2. Reformat input per item
1. Change matched item field display order
2. Add count of distinct normalized_item_ids and total purchase rows already linked to the catalog item
3. Add option to select catalog suggestion directly
#+begin_comment
Review 7/22: MIXED PEPPER 6-PK
2 matched items:
- MIXED PEPPER 6-PK | costco | 2026-03-12 | 7.49 | [img_url]
- [raw_name] | [retailer] | [YYYY-mm-dd] | [price] | [img_url]
2 catalog suggestions found:
[1] bell pepper, pepper, produce (42 items)
[2] ground pepper, spice, baking (1 item)
[#] link to suggestion [n]ew [s]kip e[x]clude [q]uit >
#+end_comment
3. When creating new, ask for input in catalog_name, product_type, category order
1. enter to accept blank value
4. Each reviewed item is saved after user input, not at the end of the script.
1. on new creation, create entry in catalog.csv and create entry in product_links.csv
2. on link existing, create entry in product_links.csv
3. update review_queue.csv status for item immediately after action
5. linking operates at normalized_item_id level, not per normalized_row_id
6. ensure catalog.csv and product_links.csv are human-editable and consistent so manual correction is possible without tooling
** evidence
- commit: `975d44b`
- tests: `./venv/bin/python -m unittest discover -s tests`; `./venv/bin/python review_products.py --refresh-only`; `./venv/bin/python review_products.py --help`
- datetime: 2026-03-20 12:45:25 EDT
** notes
- The main flow change is operational rather than architectural: each review decision now persists immediately to `review_resolutions.csv`, `catalog.csv`, `product_links.csv`, and the on-disk `review_queue.csv`.
- Direct numeric selection works well for suggestion-heavy review, while `[l]ink existing` remains available as a fallback when the suggestion list is empty or incomplete.
- I kept the review data model unchanged from `t1.15`; this task only tightened the prompt format, field order, and save behavior.
* [X] t1.16.1: add catalog search flow to review ui (2-3 commits)
enable fast lookup of catalog items during review via tokenized search and replace manual list scanning
** acceptance criteria
1. replace `[l]ink existing` with `[f]ind` in review prompt:
- `[#] link to suggestion [f]ind [n]ew [s]kip [x]exclude [q]uit >`
2. implement search flow:
- on `s`, prompt: `search: `
- tokenize input using same normalization rules as suggestion matching
- return ranked list of catalog items where tokens overlap with:
- catalog_name
- product_type
- variant
- display results in same numbered format as suggestions:
[1] flour, flour, baking (12 items, 48 rows)
3. allow direct selection from search results:
- when user inputs number, immediately creates approved resolution and product_links rows
- returns to next review item
4. reuse match logic used for suggestion matching; no new matching system introduced
- future improvements to matching logic will therefore apply in both places
5. search results exclude already-linked current normalized_item_id target
6. fallback behavior:
- if no results, print `no matches found`
- allow retry or return to main prompt
7. keep interaction tight:
- no full catalog dump
- max ~10 results returned
- sorted by simple score (token overlap count)
8. persistence:
- selected link writes immediately to `product_links.csv`
- no buffering until script end
- pm note: optimize for speed over correctness; this is a manual assist tool, not a ranking system
- pm note: improve manual lookup flow only, don't retool or create a second algorithm
** evidence
- commit: `f93b9aa`
- tests: `./venv/bin/python -m unittest discover -s tests`; `./venv/bin/python review_products.py --help`; `./venv/bin/python review_products.py --refresh-only`
- datetime: 2026-03-20 13:34:57 EDT
** notes
- The search path reuses the same lightweight token matching rules as suggestion ranking, so there is still only one matching system to maintain.
- Direct numeric suggestion-pick remains the fastest happy path; search is the fallback when suggestions are sparse or missing.
- Search intentionally optimizes for manual speed rather than smart ranking: simple token overlap, max 10 rows, and immediate persistence on selection.
- Follow-up fix: search moved to `[f]ind` so `[s]kip` remains available at the main prompt.
* [ ] 1t.10: add optional llm-assisted suggestion workflow for unresolved normalized retailer items (2-4 commits)
** acceptance criteria
- llm suggestions are generated only for unresolved normalized retailer items
- llm outputs are stored as suggestions, not auto-applied truth - llm outputs are stored as suggestions, not auto-applied truth
- reviewer can approve/edit/reject suggestions - reviewer can approve/edit/reject suggestions
- approved decisions are persisted into canonical/link files - approved decisions are persisted into canonical/link files

120
report_pipeline_status.py Normal file
View File

@@ -0,0 +1,120 @@
import json
from pathlib import Path
import click
import build_purchases
import review_products
from layer_helpers import read_csv_rows, write_csv_rows
SUMMARY_FIELDS = ["stage", "count"]
def read_rows_if_exists(path):
path = Path(path)
if not path.exists():
return []
return read_csv_rows(path)
def build_status_summary(
giant_orders,
giant_items,
giant_enriched,
costco_orders,
costco_items,
costco_enriched,
purchases,
resolutions,
):
normalized_rows = giant_enriched + costco_enriched
queue_rows = review_products.build_review_queue(purchases, resolutions)
queue_ids = {row["normalized_item_id"] for row in queue_rows}
unresolved_purchase_rows = [
row
for row in purchases
if row.get("normalized_item_id")
and not row.get("catalog_id")
and row.get("is_fee") != "true"
and row.get("is_discount_line") != "true"
and row.get("is_coupon_line") != "true"
]
excluded_rows = [row for row in purchases if row.get("resolution_action") == "exclude"]
linked_purchase_rows = [row for row in purchases if row.get("catalog_id")]
distinct_normalized_items = {
row["normalized_item_id"] for row in normalized_rows if row.get("normalized_item_id")
}
linked_normalized_items = {
row["normalized_item_id"] for row in purchases if row.get("normalized_item_id") and row.get("catalog_id")
}
summary = [
{"stage": "raw_orders", "count": len(giant_orders) + len(costco_orders)},
{"stage": "raw_items", "count": len(giant_items) + len(costco_items)},
{"stage": "normalized_items", "count": len(normalized_rows)},
{"stage": "distinct_normalized_items", "count": len(distinct_normalized_items)},
{"stage": "review_queue_normalized_items", "count": len(queue_rows)},
{"stage": "linked_normalized_items", "count": len(linked_normalized_items)},
{"stage": "linked_purchase_rows", "count": len(linked_purchase_rows)},
{"stage": "final_purchase_rows", "count": len(purchases)},
{"stage": "unresolved_purchase_rows", "count": len(unresolved_purchase_rows)},
{"stage": "excluded_purchase_rows", "count": len(excluded_rows)},
{
"stage": "unresolved_not_in_review_rows",
"count": len(
[
row
for row in unresolved_purchase_rows
if row.get("normalized_item_id") not in queue_ids
]
),
},
]
return summary
@click.command()
@click.option("--giant-orders-csv", default="data/giant-web/collected_orders.csv", show_default=True)
@click.option("--giant-items-csv", default="data/giant-web/collected_items.csv", show_default=True)
@click.option("--giant-enriched-csv", default="data/giant-web/normalized_items.csv", show_default=True)
@click.option("--costco-orders-csv", default="data/costco-web/collected_orders.csv", show_default=True)
@click.option("--costco-items-csv", default="data/costco-web/collected_items.csv", show_default=True)
@click.option("--costco-enriched-csv", default="data/costco-web/normalized_items.csv", show_default=True)
@click.option("--purchases-csv", default="data/review/purchases.csv", show_default=True)
@click.option("--resolutions-csv", default="data/review/review_resolutions.csv", show_default=True)
@click.option("--summary-csv", default="data/review/pipeline_status.csv", show_default=True)
@click.option("--summary-json", default="data/review/pipeline_status.json", show_default=True)
def main(
giant_orders_csv,
giant_items_csv,
giant_enriched_csv,
costco_orders_csv,
costco_items_csv,
costco_enriched_csv,
purchases_csv,
resolutions_csv,
summary_csv,
summary_json,
):
summary_rows = build_status_summary(
read_rows_if_exists(giant_orders_csv),
read_rows_if_exists(giant_items_csv),
read_rows_if_exists(giant_enriched_csv),
read_rows_if_exists(costco_orders_csv),
read_rows_if_exists(costco_items_csv),
read_rows_if_exists(costco_enriched_csv),
read_rows_if_exists(purchases_csv),
[build_purchases.normalize_resolution_row(row) for row in read_rows_if_exists(resolutions_csv)],
)
write_csv_rows(summary_csv, summary_rows, SUMMARY_FIELDS)
summary_json_path = Path(summary_json)
summary_json_path.parent.mkdir(parents=True, exist_ok=True)
summary_json_path.write_text(json.dumps(summary_rows, indent=2), encoding="utf-8")
for row in summary_rows:
click.echo(f"{row['stage']}: {row['count']}")
if __name__ == "__main__":
main()

View File

@@ -1,5 +1,6 @@
from collections import defaultdict from collections import defaultdict
from datetime import date from datetime import date
import re
import click import click
@@ -10,8 +11,8 @@ from layer_helpers import compact_join, stable_id, write_csv_rows
QUEUE_FIELDS = [ QUEUE_FIELDS = [
"review_id", "review_id",
"retailer", "retailer",
"observed_product_id", "normalized_item_id",
"canonical_product_id", "catalog_id",
"reason_code", "reason_code",
"priority", "priority",
"raw_item_names", "raw_item_names",
@@ -26,36 +27,57 @@ QUEUE_FIELDS = [
"updated_at", "updated_at",
] ]
INFO_COLOR = "cyan"
PROMPT_COLOR = "bright_yellow"
WARNING_COLOR = "magenta"
TOKEN_RE = re.compile(r"[A-Z0-9]+")
def print_intro_text():
click.secho("Review guide:", fg=INFO_COLOR)
click.echo(" catalog name: unique product identity including variant, but not packaging")
click.echo(" product type: general product you want to compare across purchases")
click.echo(" category: broad analysis bucket such as dairy, produce, or frozen")
def build_review_queue(purchase_rows, resolution_rows): def build_review_queue(purchase_rows, resolution_rows):
by_observed = defaultdict(list) by_normalized = defaultdict(list)
resolution_lookup = build_purchases.load_resolution_lookup(resolution_rows) resolution_lookup = build_purchases.load_resolution_lookup(resolution_rows)
for row in purchase_rows: for row in purchase_rows:
observed_product_id = row.get("observed_product_id", "") normalized_item_id = row.get("normalized_item_id", "")
if not observed_product_id: if not normalized_item_id:
continue continue
by_observed[observed_product_id].append(row) by_normalized[normalized_item_id].append(row)
today_text = str(date.today()) today_text = str(date.today())
queue_rows = [] queue_rows = []
for observed_product_id, rows in sorted(by_observed.items()): for normalized_item_id, rows in sorted(by_normalized.items()):
current_resolution = resolution_lookup.get(observed_product_id, {}) current_resolution = resolution_lookup.get(normalized_item_id, {})
if current_resolution.get("status") == "approved": if current_resolution.get("status") == "approved":
continue continue
unresolved_rows = [row for row in rows if not row.get("canonical_product_id")]
unresolved_rows = [
row
for row in rows
if not row.get("catalog_id")
and row.get("is_item", "true") != "false"
and row.get("is_fee") != "true"
and row.get("is_discount_line") != "true"
and row.get("is_coupon_line") != "true"
]
if not unresolved_rows: if not unresolved_rows:
continue continue
retailers = sorted({row["retailer"] for row in rows}) retailers = sorted({row["retailer"] for row in rows})
review_id = stable_id("rvw", observed_product_id) review_id = stable_id("rvw", normalized_item_id)
queue_rows.append( queue_rows.append(
{ {
"review_id": review_id, "review_id": review_id,
"retailer": " | ".join(retailers), "retailer": " | ".join(retailers),
"observed_product_id": observed_product_id, "normalized_item_id": normalized_item_id,
"canonical_product_id": current_resolution.get("canonical_product_id", ""), "catalog_id": current_resolution.get("catalog_id", ""),
"reason_code": "missing_canonical_link", "reason_code": "missing_catalog_link",
"priority": "high", "priority": "high",
"raw_item_names": compact_join( "raw_item_names": compact_join(
sorted({row["raw_item_name"] for row in rows if row["raw_item_name"]}), sorted({row["raw_item_name"] for row in rows if row["raw_item_name"]}),
@@ -98,9 +120,8 @@ def save_catalog_rows(path, rows):
write_csv_rows(path, rows, build_purchases.CATALOG_FIELDS) write_csv_rows(path, rows, build_purchases.CATALOG_FIELDS)
INFO_COLOR = "cyan" def save_link_rows(path, rows):
PROMPT_COLOR = "bright_yellow" write_csv_rows(path, rows, build_purchases.PRODUCT_LINK_FIELDS)
WARNING_COLOR = "magenta"
def sort_related_items(rows): def sort_related_items(rows):
@@ -115,7 +136,14 @@ def sort_related_items(rows):
) )
def build_canonical_suggestions(related_rows, catalog_rows, limit=3): def tokenize_match_text(*values):
tokens = set()
for value in values:
tokens.update(TOKEN_RE.findall((value or "").upper()))
return tokens
def build_catalog_suggestions(related_rows, purchase_rows, catalog_rows, limit=3):
normalized_names = { normalized_names = {
row.get("normalized_item_name", "").strip().upper() row.get("normalized_item_name", "").strip().upper()
for row in related_rows for row in related_rows
@@ -126,112 +154,203 @@ def build_canonical_suggestions(related_rows, catalog_rows, limit=3):
for row in related_rows for row in related_rows
if row.get("upc", "").strip() if row.get("upc", "").strip()
} }
catalog_by_id = {
row.get("catalog_id", ""): row for row in catalog_rows if row.get("catalog_id", "")
}
suggestions = [] suggestions = []
seen_ids = set() seen_ids = set()
def add_matches(rows, reason): def add_catalog_id(catalog_id, reason):
for row in rows: if not catalog_id or catalog_id in seen_ids or catalog_id not in catalog_by_id:
canonical_product_id = row.get("canonical_product_id", "") return False
if not canonical_product_id or canonical_product_id in seen_ids: seen_ids.add(catalog_id)
continue catalog_row = catalog_by_id[catalog_id]
seen_ids.add(canonical_product_id) suggestions.append(
suggestions.append( {
{ "catalog_id": catalog_id,
"canonical_product_id": canonical_product_id, "catalog_name": catalog_row.get("catalog_name", ""),
"canonical_name": row.get("canonical_name", ""), "reason": reason,
"reason": reason, }
} )
) return len(suggestions) >= limit
if len(suggestions) >= limit:
return True
return False
exact_upc_rows = [ reviewed_purchase_rows = [
row row for row in purchase_rows if row.get("catalog_id") and row.get("normalized_item_id")
for row in catalog_rows
if row.get("upc", "").strip() and row.get("upc", "").strip() in upcs
] ]
if add_matches(exact_upc_rows, "exact upc"): for row in reviewed_purchase_rows:
return suggestions if row.get("upc", "").strip() and row.get("upc", "").strip() in upcs:
if add_catalog_id(row.get("catalog_id", ""), "exact upc"):
return suggestions
exact_name_rows = [ for row in reviewed_purchase_rows:
row if row.get("normalized_item_name", "").strip().upper() in normalized_names:
for row in catalog_rows if add_catalog_id(row.get("catalog_id", ""), "exact normalized name"):
if row.get("canonical_name", "").strip().upper() in normalized_names return suggestions
]
if add_matches(exact_name_rows, "exact normalized name"):
return suggestions
contains_rows = [] for catalog_row in catalog_rows:
for row in catalog_rows: catalog_name = catalog_row.get("catalog_name", "").strip().upper()
canonical_name = row.get("canonical_name", "").strip().upper() if not catalog_name:
if not canonical_name:
continue continue
for normalized_name in normalized_names: for normalized_name in normalized_names:
if normalized_name in canonical_name or canonical_name in normalized_name: if normalized_name in catalog_name or catalog_name in normalized_name:
contains_rows.append(row) if add_catalog_id(catalog_row.get("catalog_id", ""), "catalog name contains match"):
return suggestions
break break
add_matches(contains_rows, "canonical name contains match")
return suggestions return suggestions
def build_display_lines(queue_row, related_rows): def search_catalog_rows(query, catalog_rows, purchase_rows, current_normalized_item_id, limit=10):
query_tokens = tokenize_match_text(query)
if not query_tokens:
return []
linked_purchase_counts = defaultdict(int)
linked_normalized_ids = defaultdict(set)
current_catalog_id = ""
for row in purchase_rows:
catalog_id = row.get("catalog_id", "")
normalized_item_id = row.get("normalized_item_id", "")
if catalog_id and normalized_item_id:
linked_purchase_counts[catalog_id] += 1
linked_normalized_ids[catalog_id].add(normalized_item_id)
if normalized_item_id == current_normalized_item_id and catalog_id:
current_catalog_id = catalog_id
ranked_rows = []
for row in catalog_rows:
catalog_id = row.get("catalog_id", "")
if not catalog_id or catalog_id == current_catalog_id:
continue
catalog_tokens = tokenize_match_text(
row.get("catalog_name", ""),
row.get("product_type", ""),
row.get("variant", ""),
)
overlap = query_tokens & catalog_tokens
if not overlap:
continue
ranked_rows.append(
{
"catalog_id": catalog_id,
"catalog_name": row.get("catalog_name", ""),
"product_type": row.get("product_type", ""),
"category": row.get("category", ""),
"variant": row.get("variant", ""),
"linked_normalized_items": len(linked_normalized_ids.get(catalog_id, set())),
"linked_purchase_rows": linked_purchase_counts.get(catalog_id, 0),
"score": len(overlap),
}
)
ranked_rows.sort(
key=lambda row: (-row["score"], row["catalog_name"], row["catalog_id"])
)
return ranked_rows[:limit]
def suggestion_display_rows(suggestions, purchase_rows, catalog_rows):
linked_purchase_counts = defaultdict(int)
linked_normalized_ids = defaultdict(set)
for row in purchase_rows:
catalog_id = row.get("catalog_id", "")
normalized_item_id = row.get("normalized_item_id", "")
if not catalog_id or not normalized_item_id:
continue
linked_purchase_counts[catalog_id] += 1
linked_normalized_ids[catalog_id].add(normalized_item_id)
display_rows = []
catalog_details = {
row["catalog_id"]: {
"product_type": row.get("product_type", ""),
"category": row.get("category", ""),
}
for row in catalog_rows
if row.get("catalog_id")
}
for row in purchase_rows:
if row.get("catalog_id"):
catalog_details.setdefault(
row["catalog_id"],
{
"product_type": row.get("product_type", ""),
"category": row.get("category", ""),
},
)
for row in suggestions:
catalog_id = row["catalog_id"]
details = catalog_details.get(catalog_id, {})
display_rows.append(
{
**row,
"product_type": details.get("product_type", ""),
"category": details.get("category", ""),
"linked_purchase_rows": linked_purchase_counts.get(catalog_id, 0),
"linked_normalized_items": len(linked_normalized_ids.get(catalog_id, set())),
}
)
return display_rows
def print_catalog_rows(rows):
for index, row in enumerate(rows, start=1):
click.echo(
f" [{index}] {row['catalog_name']}, {row.get('product_type', '')}, "
f"{row.get('category', '')} ({row['linked_normalized_items']} items, "
f"{row['linked_purchase_rows']} rows)"
)
def build_display_lines(related_rows):
lines = [] lines = []
for index, row in enumerate(sort_related_items(related_rows), start=1): for index, row in enumerate(sort_related_items(related_rows), start=1):
lines.append( lines.append(
" [{index}] {purchase_date} | {line_total} | {raw_item_name} | {normalized_item_name} | " " [{index}] {raw_item_name} | {retailer} | {purchase_date} | {line_total} | {image_url}".format(
"{upc} | {retailer}".format(
index=index, index=index,
raw_item_name=row.get("raw_item_name", ""),
retailer=row.get("retailer", ""),
purchase_date=row.get("purchase_date", ""), purchase_date=row.get("purchase_date", ""),
line_total=row.get("line_total", ""), line_total=row.get("line_total", ""),
raw_item_name=row.get("raw_item_name", ""), image_url=row.get("image_url", ""),
normalized_item_name=row.get("normalized_item_name", ""),
upc=row.get("upc", ""),
retailer=row.get("retailer", ""),
) )
) )
if row.get("image_url"):
lines.append(f" {row['image_url']}")
if not lines: if not lines:
lines.append(" [1] no matched item rows found") lines.append(" [1] no matched item rows found")
return lines return lines
def observed_name(queue_row, related_rows): def normalized_label(queue_row, related_rows):
if queue_row.get("normalized_names"): if queue_row.get("normalized_names"):
return queue_row["normalized_names"].split(" | ")[0] return queue_row["normalized_names"].split(" | ")[0]
for row in related_rows: for row in related_rows:
if row.get("normalized_item_name"): if row.get("normalized_item_name"):
return row["normalized_item_name"] return row["normalized_item_name"]
return queue_row.get("observed_product_id", "") return queue_row.get("normalized_item_id", "")
def choose_existing_canonical(display_rows, observed_label, matched_count): def choose_existing_catalog(display_rows, normalized_name, matched_count):
click.secho( click.secho(
f"Select the canonical_name to associate {matched_count} items with:", f"Select the catalog_name to associate {matched_count} items with:",
fg=INFO_COLOR, fg=INFO_COLOR,
) )
for index, row in enumerate(display_rows, start=1): print_catalog_rows(display_rows)
click.echo(f" [{index}] {row['canonical_name']} | {row['canonical_product_id']}")
choice = click.prompt( choice = click.prompt(
click.style("selection", fg=PROMPT_COLOR), click.style("selection", fg=PROMPT_COLOR),
type=click.IntRange(1, len(display_rows)), type=click.IntRange(1, len(display_rows)),
) )
chosen_row = display_rows[choice - 1] chosen_row = display_rows[choice - 1]
click.echo( click.echo(
f'{matched_count} "{observed_label}" items and future matches will be associated ' f'{matched_count} "{normalized_name}" items and future matches will be associated '
f'with "{chosen_row["canonical_name"]}".' f'with "{chosen_row["catalog_name"]}".'
)
click.secho(
"actions: [y]es [n]o [b]ack [s]kip [q]uit",
fg=PROMPT_COLOR,
) )
click.secho("actions: [y]es [n]o [b]ack [s]kip [q]uit", fg=PROMPT_COLOR)
confirm = click.prompt( confirm = click.prompt(
click.style("confirm", fg=PROMPT_COLOR), click.style("confirm", fg=PROMPT_COLOR),
type=click.Choice(["y", "n", "b", "s", "q"]), type=click.Choice(["y", "n", "b", "s", "q"]),
) )
if confirm == "y": if confirm == "y":
return chosen_row["canonical_product_id"], "" return chosen_row["catalog_id"], ""
if confirm == "s": if confirm == "s":
return "", "skip" return "", "skip"
if confirm == "q": if confirm == "q":
@@ -239,118 +358,118 @@ def choose_existing_canonical(display_rows, observed_label, matched_count):
return "", "back" return "", "back"
def prompt_resolution(queue_row, related_rows, catalog_rows, queue_index, queue_total): def prompt_resolution(queue_row, related_rows, purchase_rows, catalog_rows, queue_index, queue_total):
suggestions = build_canonical_suggestions(related_rows, catalog_rows) suggestions = suggestion_display_rows(
observed_label = observed_name(queue_row, related_rows) build_catalog_suggestions(related_rows, purchase_rows, catalog_rows),
purchase_rows,
catalog_rows,
)
normalized_name = normalized_label(queue_row, related_rows)
matched_count = len(related_rows) matched_count = len(related_rows)
click.echo("") click.echo("")
click.secho( click.secho(
f"Review {queue_index}/{queue_total}: Resolve observed_product {observed_label} " f"Review {queue_index}/{queue_total}: {normalized_name}",
"to canonical_name [__]?",
fg=INFO_COLOR, fg=INFO_COLOR,
) )
click.echo(f"{matched_count} matched items:") click.echo(f"{matched_count} matched items:")
for line in build_display_lines(queue_row, related_rows): for line in build_display_lines(related_rows):
click.echo(line) click.echo(line)
if suggestions: if suggestions:
click.echo(f"{len(suggestions)} canonical suggestions found:") click.echo(f"{len(suggestions)} catalog_name suggestions found:")
for index, suggestion in enumerate(suggestions, start=1): print_catalog_rows(suggestions)
click.echo(f" [{index}] {suggestion['canonical_name']}")
else: else:
click.echo("no canonical_name suggestions found") click.echo("no catalog_name suggestions found")
click.secho( prompt_bits = []
"[l]ink existing [n]ew canonical e[x]clude [s]kip [q]uit:", if suggestions:
fg=PROMPT_COLOR, prompt_bits.append("[#] link to suggestion")
) prompt_bits.extend(["[f]ind", "[n]ew", "[s]kip", "e[x]clude", "[q]uit"])
action = click.prompt( click.secho(" ".join(prompt_bits) + " >", fg=PROMPT_COLOR)
"", action = click.prompt("", type=str, prompt_suffix=" ").strip().lower()
type=click.Choice(["l", "n", "x", "s", "q"]), if action.isdigit() and suggestions:
prompt_suffix=" ", choice = int(action)
) if 1 <= choice <= len(suggestions):
chosen_row = suggestions[choice - 1]
notes = click.prompt(click.style("link notes", fg=PROMPT_COLOR), default="", show_default=False)
return {
"normalized_item_id": queue_row["normalized_item_id"],
"catalog_id": chosen_row["catalog_id"],
"resolution_action": "link",
"status": "approved",
"resolution_notes": notes,
"reviewed_at": str(date.today()),
}, None
click.secho("invalid suggestion number", fg=WARNING_COLOR)
return prompt_resolution(queue_row, related_rows, purchase_rows, catalog_rows, queue_index, queue_total)
if action == "q": if action == "q":
return None, None return None, None
if action == "s": if action == "s":
return { return {
"observed_product_id": queue_row["observed_product_id"], "normalized_item_id": queue_row["normalized_item_id"],
"canonical_product_id": "", "catalog_id": "",
"resolution_action": "skip", "resolution_action": "skip",
"status": "pending", "status": "pending",
"resolution_notes": queue_row.get("resolution_notes", ""), "resolution_notes": queue_row.get("resolution_notes", ""),
"reviewed_at": str(date.today()), "reviewed_at": str(date.today()),
}, None }, None
if action == "f":
while True:
query = click.prompt(click.style("search", fg=PROMPT_COLOR), default="", show_default=False).strip()
if not query:
return prompt_resolution(queue_row, related_rows, purchase_rows, catalog_rows, queue_index, queue_total)
search_rows = search_catalog_rows(
query,
catalog_rows,
purchase_rows,
queue_row["normalized_item_id"],
)
if not search_rows:
click.echo("no matches found")
retry = click.prompt(
click.style("search again? [enter=yes, q=no]", fg=PROMPT_COLOR),
default="",
show_default=False,
).strip().lower()
if retry == "q":
return prompt_resolution(queue_row, related_rows, purchase_rows, catalog_rows, queue_index, queue_total)
continue
click.echo(f"{len(search_rows)} search results found:")
print_catalog_rows(search_rows)
choice = click.prompt(
click.style("selection", fg=PROMPT_COLOR),
type=click.IntRange(1, len(search_rows)),
)
chosen_row = search_rows[choice - 1]
notes = click.prompt(click.style("link notes", fg=PROMPT_COLOR), default="", show_default=False)
return {
"normalized_item_id": queue_row["normalized_item_id"],
"catalog_id": chosen_row["catalog_id"],
"resolution_action": "link",
"status": "approved",
"resolution_notes": notes,
"reviewed_at": str(date.today()),
}, None
if action == "x": if action == "x":
notes = click.prompt( notes = click.prompt(click.style("exclude notes", fg=PROMPT_COLOR), default="", show_default=False)
click.style("exclude notes", fg=PROMPT_COLOR),
default="",
show_default=False,
)
return { return {
"observed_product_id": queue_row["observed_product_id"], "normalized_item_id": queue_row["normalized_item_id"],
"canonical_product_id": "", "catalog_id": "",
"resolution_action": "exclude", "resolution_action": "exclude",
"status": "approved", "status": "approved",
"resolution_notes": notes, "resolution_notes": notes,
"reviewed_at": str(date.today()), "reviewed_at": str(date.today()),
}, None }, None
if action == "l": if action != "n":
display_rows = suggestions or [ click.secho("invalid action", fg=WARNING_COLOR)
{ return prompt_resolution(queue_row, related_rows, purchase_rows, catalog_rows, queue_index, queue_total)
"canonical_product_id": row["canonical_product_id"],
"canonical_name": row["canonical_name"],
"reason": "catalog sample",
}
for row in catalog_rows[:10]
]
while True:
canonical_product_id, outcome = choose_existing_canonical(
display_rows,
observed_label,
matched_count,
)
if outcome == "skip":
return {
"observed_product_id": queue_row["observed_product_id"],
"canonical_product_id": "",
"resolution_action": "skip",
"status": "pending",
"resolution_notes": queue_row.get("resolution_notes", ""),
"reviewed_at": str(date.today()),
}, None
if outcome == "quit":
return None, None
if outcome == "back":
continue
break
notes = click.prompt(click.style("link notes", fg=PROMPT_COLOR), default="", show_default=False)
return {
"observed_product_id": queue_row["observed_product_id"],
"canonical_product_id": canonical_product_id,
"resolution_action": "link",
"status": "approved",
"resolution_notes": notes,
"reviewed_at": str(date.today()),
}, None
canonical_name = click.prompt(click.style("canonical name", fg=PROMPT_COLOR), type=str) catalog_name = click.prompt(click.style("catalog name", fg=PROMPT_COLOR), type=str)
category = click.prompt( product_type = click.prompt(click.style("product type", fg=PROMPT_COLOR), default="", show_default=False)
click.style("category", fg=PROMPT_COLOR), category = click.prompt(click.style("category", fg=PROMPT_COLOR), default="", show_default=False)
default="", notes = click.prompt(click.style("notes", fg=PROMPT_COLOR), default="", show_default=False)
show_default=False, catalog_id = stable_id("cat", f"manual|{catalog_name}|{category}|{product_type}")
) catalog_row = {
product_type = click.prompt( "catalog_id": catalog_id,
click.style("product type", fg=PROMPT_COLOR), "catalog_name": catalog_name,
default="",
show_default=False,
)
notes = click.prompt(
click.style("notes", fg=PROMPT_COLOR),
default="",
show_default=False,
)
canonical_product_id = stable_id("gcan", f"manual|{canonical_name}|{category}|{product_type}")
canonical_row = {
"canonical_product_id": canonical_product_id,
"canonical_name": canonical_name,
"category": category, "category": category,
"product_type": product_type, "product_type": product_type,
"brand": "", "brand": "",
@@ -364,27 +483,51 @@ def prompt_resolution(queue_row, related_rows, catalog_rows, queue_index, queue_
"updated_at": str(date.today()), "updated_at": str(date.today()),
} }
resolution_row = { resolution_row = {
"observed_product_id": queue_row["observed_product_id"], "normalized_item_id": queue_row["normalized_item_id"],
"canonical_product_id": canonical_product_id, "catalog_id": catalog_id,
"resolution_action": "create", "resolution_action": "create",
"status": "approved", "status": "approved",
"resolution_notes": notes, "resolution_notes": notes,
"reviewed_at": str(date.today()), "reviewed_at": str(date.today()),
} }
return resolution_row, canonical_row return resolution_row, catalog_row
def apply_resolution_to_queue(queue_rows, resolution_lookup):
today_text = str(date.today())
updated_rows = []
for row in queue_rows:
resolution = resolution_lookup.get(row["normalized_item_id"], {})
row_copy = dict(row)
if resolution:
row_copy["catalog_id"] = resolution.get("catalog_id", "")
row_copy["status"] = resolution.get("status", row_copy.get("status", "pending"))
row_copy["resolution_action"] = resolution.get("resolution_action", "")
row_copy["resolution_notes"] = resolution.get("resolution_notes", "")
row_copy["updated_at"] = resolution.get("reviewed_at", today_text)
if resolution.get("status") == "approved":
row_copy["created_at"] = row_copy.get("created_at") or resolution.get("reviewed_at", today_text)
updated_rows.append(row_copy)
return updated_rows
def link_rows_from_state(link_lookup):
return sorted(link_lookup.values(), key=lambda row: row["normalized_item_id"])
@click.command() @click.command()
@click.option("--purchases-csv", default="combined_output/purchases.csv", show_default=True) @click.option("--purchases-csv", default="data/review/purchases.csv", show_default=True)
@click.option("--queue-csv", default="combined_output/review_queue.csv", show_default=True) @click.option("--queue-csv", default="data/review/review_queue.csv", show_default=True)
@click.option("--resolutions-csv", default="combined_output/review_resolutions.csv", show_default=True) @click.option("--resolutions-csv", default="data/review/review_resolutions.csv", show_default=True)
@click.option("--catalog-csv", default="combined_output/canonical_catalog.csv", show_default=True) @click.option("--catalog-csv", default="data/catalog.csv", show_default=True)
@click.option("--links-csv", default="data/review/product_links.csv", show_default=True)
@click.option("--limit", default=0, show_default=True, type=int) @click.option("--limit", default=0, show_default=True, type=int)
@click.option("--refresh-only", is_flag=True, help="Only rebuild review_queue.csv without prompting.") @click.option("--refresh-only", is_flag=True, help="Only rebuild review_queue.csv without prompting.")
def main(purchases_csv, queue_csv, resolutions_csv, catalog_csv, limit, refresh_only): def main(purchases_csv, queue_csv, resolutions_csv, catalog_csv, links_csv, limit, refresh_only):
purchase_rows = build_purchases.read_optional_csv_rows(purchases_csv) purchase_rows = build_purchases.read_optional_csv_rows(purchases_csv)
resolution_rows = build_purchases.read_optional_csv_rows(resolutions_csv) resolution_rows = build_purchases.read_optional_csv_rows(resolutions_csv)
catalog_rows = build_purchases.read_optional_csv_rows(catalog_csv) catalog_rows = build_purchases.merge_catalog_rows(build_purchases.read_optional_csv_rows(catalog_csv), [])
link_lookup = build_purchases.load_link_lookup(build_purchases.read_optional_csv_rows(links_csv))
queue_rows = build_review_queue(purchase_rows, resolution_rows) queue_rows = build_review_queue(purchase_rows, resolution_rows)
write_csv_rows(queue_csv, queue_rows, QUEUE_FIELDS) write_csv_rows(queue_csv, queue_rows, QUEUE_FIELDS)
click.echo(f"wrote {len(queue_rows)} rows to {queue_csv}") click.echo(f"wrote {len(queue_rows)} rows to {queue_csv}")
@@ -392,33 +535,60 @@ def main(purchases_csv, queue_csv, resolutions_csv, catalog_csv, limit, refresh_
if refresh_only: if refresh_only:
return return
print_intro_text()
resolution_lookup = build_purchases.load_resolution_lookup(resolution_rows) resolution_lookup = build_purchases.load_resolution_lookup(resolution_rows)
catalog_by_id = {row["canonical_product_id"]: row for row in catalog_rows if row.get("canonical_product_id")} catalog_by_id = {row["catalog_id"]: row for row in catalog_rows if row.get("catalog_id")}
rows_by_observed = defaultdict(list) rows_by_normalized = defaultdict(list)
for row in purchase_rows: for row in purchase_rows:
observed_product_id = row.get("observed_product_id", "") normalized_item_id = row.get("normalized_item_id", "")
if observed_product_id: if normalized_item_id:
rows_by_observed[observed_product_id].append(row) rows_by_normalized[normalized_item_id].append(row)
reviewed = 0 reviewed = 0
for index, queue_row in enumerate(queue_rows, start=1): for index, queue_row in enumerate(queue_rows, start=1):
if limit and reviewed >= limit: if limit and reviewed >= limit:
break break
related_rows = rows_by_observed.get(queue_row["observed_product_id"], []) related_rows = rows_by_normalized.get(queue_row["normalized_item_id"], [])
result = prompt_resolution(queue_row, related_rows, catalog_rows, index, len(queue_rows)) result = prompt_resolution(queue_row, related_rows, purchase_rows, catalog_rows, index, len(queue_rows))
if result == (None, None): if result == (None, None):
break break
resolution_row, canonical_row = result resolution_row, catalog_row = result
resolution_lookup[resolution_row["observed_product_id"]] = resolution_row resolution_lookup[resolution_row["normalized_item_id"]] = resolution_row
if canonical_row and canonical_row["canonical_product_id"] not in catalog_by_id: if catalog_row and catalog_row["catalog_id"] not in catalog_by_id:
catalog_by_id[canonical_row["canonical_product_id"]] = canonical_row catalog_by_id[catalog_row["catalog_id"]] = catalog_row
catalog_rows.append(canonical_row) catalog_rows.append(catalog_row)
normalized_item_id = resolution_row["normalized_item_id"]
if resolution_row["status"] == "approved":
if resolution_row["resolution_action"] in {"link", "create"} and resolution_row.get("catalog_id"):
link_lookup[normalized_item_id] = {
"normalized_item_id": normalized_item_id,
"catalog_id": resolution_row["catalog_id"],
"link_method": f"manual_{resolution_row['resolution_action']}",
"link_confidence": "high",
"review_status": "approved",
"reviewed_by": "",
"reviewed_at": resolution_row.get("reviewed_at", ""),
"link_notes": resolution_row.get("resolution_notes", ""),
}
elif resolution_row["resolution_action"] == "exclude":
link_lookup.pop(normalized_item_id, None)
queue_rows = apply_resolution_to_queue(queue_rows, resolution_lookup)
write_csv_rows(queue_csv, queue_rows, QUEUE_FIELDS)
save_resolution_rows(
resolutions_csv,
sorted(resolution_lookup.values(), key=lambda row: row["normalized_item_id"]),
)
save_catalog_rows(catalog_csv, sorted(catalog_by_id.values(), key=lambda row: row["catalog_id"]))
save_link_rows(links_csv, link_rows_from_state(link_lookup))
reviewed += 1 reviewed += 1
save_resolution_rows(resolutions_csv, sorted(resolution_lookup.values(), key=lambda row: row["observed_product_id"])) save_resolution_rows(resolutions_csv, sorted(resolution_lookup.values(), key=lambda row: row["normalized_item_id"]))
save_catalog_rows(catalog_csv, sorted(catalog_by_id.values(), key=lambda row: row["canonical_product_id"])) save_catalog_rows(catalog_csv, sorted(catalog_by_id.values(), key=lambda row: row["catalog_id"]))
save_link_rows(links_csv, link_rows_from_state(link_lookup))
click.echo( click.echo(
f"saved {len(resolution_lookup)} resolution rows to {resolutions_csv} " f"saved {len(resolution_lookup)} resolution rows to {resolutions_csv}, "
f"and {len(catalog_by_id)} catalog rows to {catalog_csv}" f"{len(catalog_by_id)} catalog rows to {catalog_csv}, "
f"and {len(link_lookup)} product links to {links_csv}"
) )

View File

@@ -648,6 +648,27 @@ def main(
window_days, window_days,
months_back, months_back,
firefox_profile_dir, firefox_profile_dir,
):
click.echo("legacy entrypoint: prefer collect_costco_web.py for data-model outputs")
run_collection(
outdir=outdir,
document_type=document_type,
document_sub_type=document_sub_type,
window_days=window_days,
months_back=months_back,
firefox_profile_dir=firefox_profile_dir,
)
def run_collection(
outdir,
document_type,
document_sub_type,
window_days,
months_back,
firefox_profile_dir,
orders_filename="orders.csv",
items_filename="items.csv",
): ):
outdir = Path(outdir) outdir = Path(outdir)
raw_dir = outdir / "raw" raw_dir = outdir / "raw"
@@ -706,8 +727,8 @@ def main(
write_json(raw_dir / f"{safe_filename(receipt_id)}.json", detail_payload) write_json(raw_dir / f"{safe_filename(receipt_id)}.json", detail_payload)
orders, items = flatten_costco_data(summary_payload, detail_payloads, raw_dir) orders, items = flatten_costco_data(summary_payload, detail_payloads, raw_dir)
write_csv(outdir / "orders.csv", orders, ORDER_FIELDS) write_csv(outdir / orders_filename, orders, ORDER_FIELDS)
write_csv(outdir / "items.csv", items, ITEM_FIELDS) write_csv(outdir / items_filename, items, ITEM_FIELDS)
click.echo(f"wrote {len(orders)} orders and {len(items)} item rows to {outdir}") click.echo(f"wrote {len(orders)} orders and {len(items)} item rows to {outdir}")

View File

@@ -13,8 +13,10 @@ from browser_session import find_firefox_profile_dir, load_firefox_cookies
BASE = "https://giantfood.com" BASE = "https://giantfood.com"
ACCOUNT_PAGE = f"{BASE}/account/history/invoice/in-store" ACCOUNT_PAGE = f"{BASE}/account/history/invoice/in-store"
RETAILER = "giant"
ORDER_FIELDS = [ ORDER_FIELDS = [
"retailer",
"order_id", "order_id",
"order_date", "order_date",
"delivery_date", "delivery_date",
@@ -33,12 +35,16 @@ ORDER_FIELDS = [
"store_zipcode", "store_zipcode",
"refund_order", "refund_order",
"ebt_order", "ebt_order",
"raw_history_path",
"raw_order_path",
] ]
ITEM_FIELDS = [ ITEM_FIELDS = [
"retailer",
"order_id", "order_id",
"order_date", "order_date",
"line_no", "line_no",
"retailer_item_id",
"pod_id", "pod_id",
"item_name", "item_name",
"upc", "upc",
@@ -53,6 +59,10 @@ ITEM_FIELDS = [
"reward_savings", "reward_savings",
"coupon_savings", "coupon_savings",
"coupon_price", "coupon_price",
"image_url",
"raw_order_path",
"is_discount_line",
"is_coupon_line",
] ]
@@ -130,18 +140,21 @@ def get_order_detail(session, user_id, order_id):
return response.json() return response.json()
def flatten_orders(history, details): def flatten_orders(history, details, history_path=None, raw_dir=None):
orders = [] orders = []
items = [] items = []
history_lookup = {record["orderId"]: record for record in history.get("records", [])} history_lookup = {record["orderId"]: record for record in history.get("records", [])}
history_path_value = history_path.as_posix() if history_path else ""
for detail in details: for detail in details:
order_id = str(detail["orderId"]) order_id = str(detail["orderId"])
history_row = history_lookup.get(detail["orderId"], {}) history_row = history_lookup.get(detail["orderId"], {})
pickup = detail.get("pup", {}) pickup = detail.get("pup", {})
raw_order_path = (raw_dir / f"{order_id}.json").as_posix() if raw_dir else ""
orders.append( orders.append(
{ {
"retailer": RETAILER,
"order_id": order_id, "order_id": order_id,
"order_date": detail.get("orderDate"), "order_date": detail.get("orderDate"),
"delivery_date": detail.get("deliveryDate"), "delivery_date": detail.get("deliveryDate"),
@@ -160,15 +173,19 @@ def flatten_orders(history, details):
"store_zipcode": pickup.get("storeZipcode"), "store_zipcode": pickup.get("storeZipcode"),
"refund_order": detail.get("refundOrder"), "refund_order": detail.get("refundOrder"),
"ebt_order": detail.get("ebtOrder"), "ebt_order": detail.get("ebtOrder"),
"raw_history_path": history_path_value,
"raw_order_path": raw_order_path,
} }
) )
for line_no, item in enumerate(detail.get("items", []), start=1): for line_no, item in enumerate(detail.get("items", []), start=1):
items.append( items.append(
{ {
"retailer": RETAILER,
"order_id": order_id, "order_id": order_id,
"order_date": detail.get("orderDate"), "order_date": detail.get("orderDate"),
"line_no": str(line_no), "line_no": str(line_no),
"retailer_item_id": "",
"pod_id": item.get("podId"), "pod_id": item.get("podId"),
"item_name": item.get("itemName"), "item_name": item.get("itemName"),
"upc": item.get("primUpcCd"), "upc": item.get("primUpcCd"),
@@ -183,6 +200,10 @@ def flatten_orders(history, details):
"reward_savings": item.get("rewardSavings"), "reward_savings": item.get("rewardSavings"),
"coupon_savings": item.get("couponSavings"), "coupon_savings": item.get("couponSavings"),
"coupon_price": item.get("couponPrice"), "coupon_price": item.get("couponPrice"),
"image_url": "",
"raw_order_path": raw_order_path,
"is_discount_line": "false",
"is_coupon_line": "false",
} }
) )
@@ -269,6 +290,18 @@ def write_json(path, payload):
help="Delay between order detail requests.", help="Delay between order detail requests.",
) )
def main(user_id, loyalty, outdir, sleep_seconds): def main(user_id, loyalty, outdir, sleep_seconds):
click.echo("legacy entrypoint: prefer collect_giant_web.py for data-model outputs")
run_collection(user_id, loyalty, outdir, sleep_seconds)
def run_collection(
user_id,
loyalty,
outdir,
sleep_seconds,
orders_filename="orders.csv",
items_filename="items.csv",
):
config = load_config() config = load_config()
user_id = user_id or config["user_id"] or click.prompt("Giant user id", type=str) user_id = user_id or config["user_id"] or click.prompt("Giant user id", type=str)
loyalty = loyalty or config["loyalty"] or click.prompt( loyalty = loyalty or config["loyalty"] or click.prompt(
@@ -279,13 +312,14 @@ def main(user_id, loyalty, outdir, sleep_seconds):
rawdir = outdir / "raw" rawdir = outdir / "raw"
rawdir.mkdir(parents=True, exist_ok=True) rawdir.mkdir(parents=True, exist_ok=True)
orders_csv = outdir / "orders.csv" orders_csv = outdir / orders_filename
items_csv = outdir / "items.csv" items_csv = outdir / items_filename
existing_order_ids = read_existing_order_ids(orders_csv) existing_order_ids = read_existing_order_ids(orders_csv)
session = build_session() session = build_session()
history = get_history(session, user_id, loyalty) history = get_history(session, user_id, loyalty)
write_json(rawdir / "history.json", history) history_path = rawdir / "history.json"
write_json(history_path, history)
records = history.get("records", []) records = history.get("records", [])
click.echo(f"history returned {len(records)} visits; Giant exposes only the most recent 50") click.echo(f"history returned {len(records)} visits; Giant exposes only the most recent 50")
@@ -310,7 +344,7 @@ def main(user_id, loyalty, outdir, sleep_seconds):
if index < len(unseen_records): if index < len(unseen_records):
time.sleep(sleep_seconds) time.sleep(sleep_seconds)
orders, items = flatten_orders(history, details) orders, items = flatten_orders(history, details, history_path=history_path, raw_dir=rawdir)
merged_orders = append_dedup( merged_orders = append_dedup(
orders_csv, orders_csv,
orders, orders,

View File

@@ -4,7 +4,7 @@ import build_canonical_layer
class CanonicalLayerTests(unittest.TestCase): class CanonicalLayerTests(unittest.TestCase):
def test_build_canonical_layer_auto_links_exact_upc_and_name_size(self): def test_build_canonical_layer_auto_links_exact_upc_and_name_size_only(self):
observed_rows = [ observed_rows = [
{ {
"observed_product_id": "gobs_1", "observed_product_id": "gobs_1",
@@ -81,6 +81,21 @@ class CanonicalLayerTests(unittest.TestCase):
"is_discount_line": "false", "is_discount_line": "false",
"is_coupon_line": "false", "is_coupon_line": "false",
}, },
{
"observed_product_id": "gobs_6",
"representative_upc": "",
"representative_retailer_item_id": "",
"representative_name_norm": "LIME",
"representative_brand": "",
"representative_variant": "",
"representative_size_value": "",
"representative_size_unit": "",
"representative_pack_qty": "",
"representative_measure_type": "each",
"is_fee": "false",
"is_discount_line": "false",
"is_coupon_line": "false",
},
] ]
canonicals, links = build_canonical_layer.build_canonical_layer(observed_rows) canonicals, links = build_canonical_layer.build_canonical_layer(observed_rows)
@@ -93,6 +108,11 @@ class CanonicalLayerTests(unittest.TestCase):
self.assertEqual("exact_name_size", methods["gobs_3"]) self.assertEqual("exact_name_size", methods["gobs_3"])
self.assertEqual("exact_name_size", methods["gobs_4"]) self.assertEqual("exact_name_size", methods["gobs_4"])
self.assertNotIn("gobs_5", methods) self.assertNotIn("gobs_5", methods)
self.assertNotIn("gobs_6", methods)
def test_clean_canonical_name_removes_packaging_noise(self):
self.assertEqual("LIME", build_canonical_layer.clean_canonical_name("LIME . / ."))
self.assertEqual("EGG", build_canonical_layer.clean_canonical_name("5DZ EGG / /"))
if __name__ == "__main__": if __name__ == "__main__":

View File

@@ -258,6 +258,11 @@ class CostcoPipelineTests(unittest.TestCase):
self.assertEqual("MIXED PEPPER", row["item_name_norm"]) self.assertEqual("MIXED PEPPER", row["item_name_norm"])
self.assertEqual("6", row["pack_qty"]) self.assertEqual("6", row["pack_qty"])
self.assertEqual("count", row["measure_type"]) self.assertEqual("count", row["measure_type"])
self.assertEqual("costco:abc:1", row["normalized_row_id"])
self.assertEqual("exact_retailer_item_id", row["normalization_basis"])
self.assertTrue(row["normalized_item_id"])
self.assertEqual("6", row["normalized_quantity"])
self.assertEqual("count", row["normalized_quantity_unit"])
discount = enrich_costco.parse_costco_item( discount = enrich_costco.parse_costco_item(
order_id="abc", order_id="abc",
@@ -278,6 +283,99 @@ class CostcoPipelineTests(unittest.TestCase):
) )
self.assertEqual("true", discount["is_discount_line"]) self.assertEqual("true", discount["is_discount_line"])
self.assertEqual("true", discount["is_coupon_line"]) self.assertEqual("true", discount["is_coupon_line"])
self.assertEqual("false", discount["is_item"])
def test_costco_name_cleanup_removes_dual_weight_and_logistics_artifacts(self):
mixed_units = enrich_costco.parse_costco_item(
order_id="abc",
order_date="2026-03-12",
raw_path=Path("costco_output/raw/abc.json"),
line_no=1,
item={
"itemNumber": "18600",
"itemDescription01": "MANDARINS 2.27 KG / 5 LBS",
"itemDescription02": None,
"itemDepartmentNumber": 65,
"transDepartmentNumber": 65,
"unit": 1,
"itemIdentifier": "E",
"amount": 7.49,
"itemUnitPriceAmount": 7.49,
},
)
self.assertEqual("MANDARIN", mixed_units["item_name_norm"])
self.assertEqual("5", mixed_units["size_value"])
self.assertEqual("lb", mixed_units["size_unit"])
logistics = enrich_costco.parse_costco_item(
order_id="abc",
order_date="2026-03-12",
raw_path=Path("costco_output/raw/abc.json"),
line_no=2,
item={
"itemNumber": "1375005",
"itemDescription01": "LIFE 6'TABLE MDL #80873U - T12/H3/P36",
"itemDescription02": None,
"itemDepartmentNumber": 18,
"transDepartmentNumber": 18,
"unit": 1,
"itemIdentifier": "E",
"amount": 119.98,
"itemUnitPriceAmount": 119.98,
},
)
self.assertEqual("LIFE 6'TABLE MDL", logistics["item_name_norm"])
def test_build_items_enriched_matches_discount_to_item(self):
with tempfile.TemporaryDirectory() as tmpdir:
raw_dir = Path(tmpdir) / "raw"
raw_dir.mkdir()
payload = {
"data": {
"receiptsWithCounts": {
"receipts": [
{
"transactionBarcode": "abc",
"transactionDate": "2026-03-12",
"itemArray": [
{
"itemNumber": "4873222",
"itemDescription01": "ALL F&C",
"itemDescription02": "200OZ 160LOADS P104",
"itemDepartmentNumber": 14,
"transDepartmentNumber": 14,
"unit": 1,
"itemIdentifier": "E",
"amount": 19.99,
"itemUnitPriceAmount": 19.99,
},
{
"itemNumber": "374664",
"itemDescription01": "/ 4873222",
"itemDescription02": None,
"itemDepartmentNumber": 14,
"transDepartmentNumber": 14,
"unit": -1,
"itemIdentifier": None,
"amount": -5,
"itemUnitPriceAmount": 0,
},
],
}
]
}
}
}
(raw_dir / "abc.json").write_text(json.dumps(payload), encoding="utf-8")
rows = enrich_costco.build_items_enriched(raw_dir)
purchase_row = next(row for row in rows if row["is_discount_line"] == "false")
discount_row = next(row for row in rows if row["is_discount_line"] == "true")
self.assertEqual("-5", purchase_row["matched_discount_amount"])
self.assertEqual("14.99", purchase_row["net_line_total"])
self.assertIn("matched_discount=4873222", purchase_row["parse_notes"])
self.assertIn("matched_to_item=4873222", discount_row["parse_notes"])
def test_cross_retailer_validation_writes_proof_example(self): def test_cross_retailer_validation_writes_proof_example(self):
with tempfile.TemporaryDirectory() as tmpdir: with tempfile.TemporaryDirectory() as tmpdir:

View File

@@ -51,6 +51,11 @@ class EnrichGiantTests(unittest.TestCase):
self.assertEqual("1.99", row["price_per_lb"]) self.assertEqual("1.99", row["price_per_lb"])
self.assertEqual("0.1244", row["price_per_oz"]) self.assertEqual("0.1244", row["price_per_oz"])
self.assertEqual("https://example.test/apple.jpg", row["image_url"]) self.assertEqual("https://example.test/apple.jpg", row["image_url"])
self.assertEqual("giant:abc123:1", row["normalized_row_id"])
self.assertEqual("exact_upc", row["normalization_basis"])
self.assertEqual("5", row["normalized_quantity"])
self.assertEqual("lb", row["normalized_quantity_unit"])
self.assertEqual("true", row["is_item"])
fee_row = enrich_giant.parse_item( fee_row = enrich_giant.parse_item(
order_id="abc123", order_id="abc123",
@@ -77,6 +82,7 @@ class EnrichGiantTests(unittest.TestCase):
self.assertEqual("true", fee_row["is_fee"]) self.assertEqual("true", fee_row["is_fee"])
self.assertEqual("GL BAG CHARGE", fee_row["item_name_norm"]) self.assertEqual("GL BAG CHARGE", fee_row["item_name_norm"])
self.assertEqual("false", fee_row["is_item"])
def test_parse_item_derives_packaged_weight_prices_from_size_tokens(self): def test_parse_item_derives_packaged_weight_prices_from_size_tokens(self):
row = enrich_giant.parse_item( row = enrich_giant.parse_item(
@@ -179,6 +185,8 @@ class EnrichGiantTests(unittest.TestCase):
self.assertEqual("7.5", rows[0]["size_value"]) self.assertEqual("7.5", rows[0]["size_value"])
self.assertEqual("10", rows[0]["retailer_item_id"]) self.assertEqual("10", rows[0]["retailer_item_id"])
self.assertEqual("true", rows[1]["is_store_brand"]) self.assertEqual("true", rows[1]["is_store_brand"])
self.assertTrue(rows[0]["normalized_item_id"])
self.assertEqual("exact_upc", rows[0]["normalization_basis"])
with output_csv.open(newline="", encoding="utf-8") as handle: with output_csv.open(newline="", encoding="utf-8") as handle:
written_rows = list(csv.DictReader(handle)) written_rows = list(csv.DictReader(handle))

View File

@@ -0,0 +1,81 @@
import unittest
import report_pipeline_status
class PipelineStatusTests(unittest.TestCase):
def test_build_status_summary_reports_unresolved_and_reviewed_counts(self):
summary = report_pipeline_status.build_status_summary(
giant_orders=[{"order_id": "g1"}],
giant_items=[{"order_id": "g1", "line_no": "1"}],
giant_enriched=[
{
"retailer": "giant",
"order_id": "g1",
"line_no": "1",
"normalized_item_id": "gnorm_banana",
"item_name_norm": "BANANA",
"item_name": "FRESH BANANA",
"retailer_item_id": "1",
"upc": "4011",
"brand_guess": "",
"variant": "",
"size_value": "",
"size_unit": "",
"pack_qty": "",
"measure_type": "weight",
"image_url": "",
"is_store_brand": "false",
"is_fee": "false",
"is_discount_line": "false",
"is_coupon_line": "false",
"order_date": "2026-03-01",
"line_total": "1.29",
}
],
costco_orders=[],
costco_items=[],
costco_enriched=[],
purchases=[
{
"normalized_item_id": "gnorm_banana",
"catalog_id": "cat_banana",
"resolution_action": "",
"is_fee": "false",
"is_discount_line": "false",
"is_coupon_line": "false",
"retailer": "giant",
"raw_item_name": "FRESH BANANA",
"normalized_item_name": "BANANA",
"upc": "4011",
"line_total": "1.29",
},
{
"normalized_item_id": "cnorm_lime",
"catalog_id": "",
"resolution_action": "",
"is_fee": "false",
"is_discount_line": "false",
"is_coupon_line": "false",
"retailer": "costco",
"raw_item_name": "LIME 5LB",
"normalized_item_name": "LIME",
"upc": "",
"line_total": "4.99",
},
],
resolutions=[],
)
counts = {row["stage"]: row["count"] for row in summary}
self.assertEqual(1, counts["raw_orders"])
self.assertEqual(1, counts["raw_items"])
self.assertEqual(1, counts["normalized_items"])
self.assertEqual(1, counts["linked_purchase_rows"])
self.assertEqual(1, counts["unresolved_purchase_rows"])
self.assertEqual(1, counts["review_queue_normalized_items"])
self.assertEqual(0, counts["unresolved_not_in_review_rows"])
if __name__ == "__main__":
unittest.main()

View File

@@ -29,7 +29,7 @@ class PurchaseLogTests(unittest.TestCase):
self.assertEqual("0.125", metrics["price_per_oz"]) self.assertEqual("0.125", metrics["price_per_oz"])
self.assertEqual("picked_weight_lb", metrics["price_per_lb_basis"]) self.assertEqual("picked_weight_lb", metrics["price_per_lb_basis"])
def test_build_purchase_rows_maps_canonical_ids(self): def test_build_purchase_rows_maps_catalog_ids(self):
fieldnames = enrich_costco.OUTPUT_FIELDS fieldnames = enrich_costco.OUTPUT_FIELDS
giant_row = {field: "" for field in fieldnames} giant_row = {field: "" for field in fieldnames}
giant_row.update( giant_row.update(
@@ -37,7 +37,8 @@ class PurchaseLogTests(unittest.TestCase):
"retailer": "giant", "retailer": "giant",
"order_id": "g1", "order_id": "g1",
"line_no": "1", "line_no": "1",
"observed_item_key": "giant:g1:1", "normalized_row_id": "giant:g1:1",
"normalized_item_id": "gnorm:banana",
"order_date": "2026-03-01", "order_date": "2026-03-01",
"item_name": "FRESH BANANA", "item_name": "FRESH BANANA",
"item_name_norm": "BANANA", "item_name_norm": "BANANA",
@@ -50,7 +51,7 @@ class PurchaseLogTests(unittest.TestCase):
"unit_price": "1.29", "unit_price": "1.29",
"measure_type": "weight", "measure_type": "weight",
"price_per_lb": "1.29", "price_per_lb": "1.29",
"raw_order_path": "giant_output/raw/g1.json", "raw_order_path": "data/giant-web/raw/g1.json",
"is_discount_line": "false", "is_discount_line": "false",
"is_coupon_line": "false", "is_coupon_line": "false",
"is_fee": "false", "is_fee": "false",
@@ -62,7 +63,8 @@ class PurchaseLogTests(unittest.TestCase):
"retailer": "costco", "retailer": "costco",
"order_id": "c1", "order_id": "c1",
"line_no": "1", "line_no": "1",
"observed_item_key": "costco:c1:1", "normalized_row_id": "costco:c1:1",
"normalized_item_id": "cnorm:banana",
"order_date": "2026-03-12", "order_date": "2026-03-12",
"item_name": "BANANAS 3 LB / 1.36 KG", "item_name": "BANANAS 3 LB / 1.36 KG",
"item_name_norm": "BANANA", "item_name_norm": "BANANA",
@@ -75,7 +77,7 @@ class PurchaseLogTests(unittest.TestCase):
"size_unit": "lb", "size_unit": "lb",
"measure_type": "weight", "measure_type": "weight",
"price_per_lb": "0.9933", "price_per_lb": "0.9933",
"raw_order_path": "costco_output/raw/c1.json", "raw_order_path": "data/costco-web/raw/c1.json",
"is_discount_line": "false", "is_discount_line": "false",
"is_coupon_line": "false", "is_coupon_line": "false",
"is_fee": "false", "is_fee": "false",
@@ -99,17 +101,58 @@ class PurchaseLogTests(unittest.TestCase):
"store_state": "VA", "store_state": "VA",
} }
] ]
catalog_rows = [
{
"catalog_id": "cat_banana",
"catalog_name": "BANANA",
"category": "produce",
"product_type": "banana",
"brand": "",
"variant": "",
"size_value": "",
"size_unit": "",
"pack_qty": "",
"measure_type": "",
"notes": "",
"created_at": "",
"updated_at": "",
}
]
link_rows = [
{
"normalized_item_id": "gnorm:banana",
"catalog_id": "cat_banana",
"link_method": "manual_link",
"link_confidence": "high",
"review_status": "approved",
"reviewed_by": "",
"reviewed_at": "",
"link_notes": "",
},
{
"normalized_item_id": "cnorm:banana",
"catalog_id": "cat_banana",
"link_method": "manual_link",
"link_confidence": "high",
"review_status": "approved",
"reviewed_by": "",
"reviewed_at": "",
"link_notes": "",
},
]
rows, _observed, _canon, _links = build_purchases.build_purchase_rows( rows, _links = build_purchases.build_purchase_rows(
[giant_row], [giant_row],
[costco_row], [costco_row],
giant_orders, giant_orders,
costco_orders, costco_orders,
[], [],
link_rows,
catalog_rows,
) )
self.assertEqual(2, len(rows)) self.assertEqual(2, len(rows))
self.assertTrue(all(row["canonical_product_id"] for row in rows)) self.assertTrue(all(row["catalog_id"] == "cat_banana" for row in rows))
self.assertEqual({"giant", "costco"}, {row["retailer"] for row in rows}) self.assertEqual({"giant", "costco"}, {row["retailer"] for row in rows})
self.assertEqual("https://example.test/banana.jpg", rows[0]["image_url"]) self.assertEqual("https://example.test/banana.jpg", rows[0]["image_url"])
@@ -120,10 +163,10 @@ class PurchaseLogTests(unittest.TestCase):
giant_orders = Path(tmpdir) / "giant_orders.csv" giant_orders = Path(tmpdir) / "giant_orders.csv"
costco_orders = Path(tmpdir) / "costco_orders.csv" costco_orders = Path(tmpdir) / "costco_orders.csv"
resolutions_csv = Path(tmpdir) / "review_resolutions.csv" resolutions_csv = Path(tmpdir) / "review_resolutions.csv"
catalog_csv = Path(tmpdir) / "canonical_catalog.csv" catalog_csv = Path(tmpdir) / "catalog.csv"
links_csv = Path(tmpdir) / "product_links.csv" links_csv = Path(tmpdir) / "product_links.csv"
purchases_csv = Path(tmpdir) / "combined" / "purchases.csv" purchases_csv = Path(tmpdir) / "review" / "purchases.csv"
examples_csv = Path(tmpdir) / "combined" / "comparison_examples.csv" examples_csv = Path(tmpdir) / "review" / "comparison_examples.csv"
fieldnames = enrich_costco.OUTPUT_FIELDS fieldnames = enrich_costco.OUTPUT_FIELDS
giant_row = {field: "" for field in fieldnames} giant_row = {field: "" for field in fieldnames}
@@ -132,7 +175,8 @@ class PurchaseLogTests(unittest.TestCase):
"retailer": "giant", "retailer": "giant",
"order_id": "g1", "order_id": "g1",
"line_no": "1", "line_no": "1",
"observed_item_key": "giant:g1:1", "normalized_row_id": "giant:g1:1",
"normalized_item_id": "gnorm:banana",
"order_date": "2026-03-01", "order_date": "2026-03-01",
"item_name": "FRESH BANANA", "item_name": "FRESH BANANA",
"item_name_norm": "BANANA", "item_name_norm": "BANANA",
@@ -144,7 +188,7 @@ class PurchaseLogTests(unittest.TestCase):
"unit_price": "1.29", "unit_price": "1.29",
"measure_type": "weight", "measure_type": "weight",
"price_per_lb": "1.29", "price_per_lb": "1.29",
"raw_order_path": "giant_output/raw/g1.json", "raw_order_path": "data/giant-web/raw/g1.json",
"is_discount_line": "false", "is_discount_line": "false",
"is_coupon_line": "false", "is_coupon_line": "false",
"is_fee": "false", "is_fee": "false",
@@ -156,7 +200,8 @@ class PurchaseLogTests(unittest.TestCase):
"retailer": "costco", "retailer": "costco",
"order_id": "c1", "order_id": "c1",
"line_no": "1", "line_no": "1",
"observed_item_key": "costco:c1:1", "normalized_row_id": "costco:c1:1",
"normalized_item_id": "cnorm:banana",
"order_date": "2026-03-12", "order_date": "2026-03-12",
"item_name": "BANANAS 3 LB / 1.36 KG", "item_name": "BANANAS 3 LB / 1.36 KG",
"item_name_norm": "BANANA", "item_name_norm": "BANANA",
@@ -169,17 +214,14 @@ class PurchaseLogTests(unittest.TestCase):
"size_unit": "lb", "size_unit": "lb",
"measure_type": "weight", "measure_type": "weight",
"price_per_lb": "0.9933", "price_per_lb": "0.9933",
"raw_order_path": "costco_output/raw/c1.json", "raw_order_path": "data/costco-web/raw/c1.json",
"is_discount_line": "false", "is_discount_line": "false",
"is_coupon_line": "false", "is_coupon_line": "false",
"is_fee": "false", "is_fee": "false",
} }
) )
for path, source_rows in [ for path, source_rows in [(giant_items, [giant_row]), (costco_items, [costco_row])]:
(giant_items, [giant_row]),
(costco_items, [costco_row]),
]:
with path.open("w", newline="", encoding="utf-8") as handle: with path.open("w", newline="", encoding="utf-8") as handle:
writer = csv.DictWriter(handle, fieldnames=fieldnames) writer = csv.DictWriter(handle, fieldnames=fieldnames)
writer.writeheader() writer.writeheader()
@@ -217,6 +259,55 @@ class PurchaseLogTests(unittest.TestCase):
writer.writeheader() writer.writeheader()
writer.writerows(source_rows) writer.writerows(source_rows)
with catalog_csv.open("w", newline="", encoding="utf-8") as handle:
writer = csv.DictWriter(handle, fieldnames=build_purchases.CATALOG_FIELDS)
writer.writeheader()
writer.writerow(
{
"catalog_id": "cat_banana",
"catalog_name": "BANANA",
"category": "produce",
"product_type": "banana",
"brand": "",
"variant": "",
"size_value": "",
"size_unit": "",
"pack_qty": "",
"measure_type": "",
"notes": "",
"created_at": "",
"updated_at": "",
}
)
with links_csv.open("w", newline="", encoding="utf-8") as handle:
writer = csv.DictWriter(handle, fieldnames=build_purchases.PRODUCT_LINK_FIELDS)
writer.writeheader()
writer.writerows(
[
{
"normalized_item_id": "gnorm:banana",
"catalog_id": "cat_banana",
"link_method": "manual_link",
"link_confidence": "high",
"review_status": "approved",
"reviewed_by": "",
"reviewed_at": "",
"link_notes": "",
},
{
"normalized_item_id": "cnorm:banana",
"catalog_id": "cat_banana",
"link_method": "manual_link",
"link_confidence": "high",
"review_status": "approved",
"reviewed_by": "",
"reviewed_at": "",
"link_notes": "",
},
]
)
build_purchases.main.callback( build_purchases.main.callback(
giant_items_enriched_csv=str(giant_items), giant_items_enriched_csv=str(giant_items),
costco_items_enriched_csv=str(costco_items), costco_items_enriched_csv=str(costco_items),
@@ -246,7 +337,8 @@ class PurchaseLogTests(unittest.TestCase):
"retailer": "giant", "retailer": "giant",
"order_id": "g1", "order_id": "g1",
"line_no": "1", "line_no": "1",
"observed_item_key": "giant:g1:1", "normalized_row_id": "giant:g1:1",
"normalized_item_id": "gnorm:ice",
"order_date": "2026-03-01", "order_date": "2026-03-01",
"item_name": "SB BAGGED ICE 20LB", "item_name": "SB BAGGED ICE 20LB",
"item_name_norm": "BAGGED ICE", "item_name_norm": "BAGGED ICE",
@@ -257,17 +349,14 @@ class PurchaseLogTests(unittest.TestCase):
"line_total": "3.50", "line_total": "3.50",
"unit_price": "3.50", "unit_price": "3.50",
"measure_type": "each", "measure_type": "each",
"raw_order_path": "giant_output/raw/g1.json", "raw_order_path": "data/giant-web/raw/g1.json",
"is_discount_line": "false", "is_discount_line": "false",
"is_coupon_line": "false", "is_coupon_line": "false",
"is_fee": "false", "is_fee": "false",
} }
) )
observed_rows, _canonical_rows, _link_rows, _observed_id_by_key, _canonical_by_observed = (
build_purchases.build_link_state([giant_row]) rows, links = build_purchases.build_purchase_rows(
)
observed_product_id = observed_rows[0]["observed_product_id"]
rows, _observed, _canon, _links = build_purchases.build_purchase_rows(
[giant_row], [giant_row],
[], [],
[ [
@@ -282,19 +371,38 @@ class PurchaseLogTests(unittest.TestCase):
[], [],
[ [
{ {
"observed_product_id": observed_product_id, "normalized_item_id": "gnorm:ice",
"canonical_product_id": "gcan_manual_ice", "catalog_id": "cat_ice",
"resolution_action": "create", "resolution_action": "create",
"status": "approved", "status": "approved",
"resolution_notes": "manual ice merge", "resolution_notes": "manual ice merge",
"reviewed_at": "2026-03-16", "reviewed_at": "2026-03-16",
} }
], ],
[],
[
{
"catalog_id": "cat_ice",
"catalog_name": "ICE",
"category": "frozen",
"product_type": "ice",
"brand": "",
"variant": "",
"size_value": "",
"size_unit": "",
"pack_qty": "",
"measure_type": "",
"notes": "",
"created_at": "",
"updated_at": "",
}
],
) )
self.assertEqual("gcan_manual_ice", rows[0]["canonical_product_id"]) self.assertEqual("cat_ice", rows[0]["catalog_id"])
self.assertEqual("approved", rows[0]["review_status"]) self.assertEqual("approved", rows[0]["review_status"])
self.assertEqual("create", rows[0]["resolution_action"]) self.assertEqual("create", rows[0]["resolution_action"])
self.assertEqual("cat_ice", links[0]["catalog_id"])
if __name__ == "__main__": if __name__ == "__main__":

View File

@@ -14,33 +14,39 @@ class ReviewWorkflowTests(unittest.TestCase):
queue_rows = review_products.build_review_queue( queue_rows = review_products.build_review_queue(
[ [
{ {
"observed_product_id": "gobs_1", "normalized_item_id": "gnorm_1",
"canonical_product_id": "", "catalog_id": "",
"retailer": "giant", "retailer": "giant",
"raw_item_name": "SB BAGGED ICE 20LB", "raw_item_name": "SB BAGGED ICE 20LB",
"normalized_item_name": "BAGGED ICE", "normalized_item_name": "BAGGED ICE",
"upc": "", "upc": "",
"line_total": "3.50", "line_total": "3.50",
"is_fee": "false",
"is_discount_line": "false",
"is_coupon_line": "false",
}, },
{ {
"observed_product_id": "gobs_1", "normalized_item_id": "gnorm_1",
"canonical_product_id": "", "catalog_id": "",
"retailer": "giant", "retailer": "giant",
"raw_item_name": "SB BAG ICE CUBED 10LB", "raw_item_name": "SB BAG ICE CUBED 10LB",
"normalized_item_name": "BAG ICE", "normalized_item_name": "BAG ICE",
"upc": "", "upc": "",
"line_total": "2.50", "line_total": "2.50",
"is_fee": "false",
"is_discount_line": "false",
"is_coupon_line": "false",
}, },
], ],
[], [],
) )
self.assertEqual(1, len(queue_rows)) self.assertEqual(1, len(queue_rows))
self.assertEqual("gobs_1", queue_rows[0]["observed_product_id"]) self.assertEqual("gnorm_1", queue_rows[0]["normalized_item_id"])
self.assertIn("SB BAGGED ICE 20LB", queue_rows[0]["raw_item_names"]) self.assertIn("SB BAGGED ICE 20LB", queue_rows[0]["raw_item_names"])
def test_build_canonical_suggestions_prefers_upc_then_name(self): def test_build_catalog_suggestions_prefers_upc_then_name(self):
suggestions = review_products.build_canonical_suggestions( suggestions = review_products.build_catalog_suggestions(
[ [
{ {
"normalized_item_name": "MIXED PEPPER", "normalized_item_name": "MIXED PEPPER",
@@ -49,36 +55,73 @@ class ReviewWorkflowTests(unittest.TestCase):
], ],
[ [
{ {
"canonical_product_id": "gcan_1", "normalized_item_id": "prior_1",
"canonical_name": "MIXED PEPPER", "normalized_item_name": "MIXED PEPPER 6 PACK",
"upc": "", "upc": "12345",
"catalog_id": "cat_2",
}
],
[
{
"catalog_id": "cat_1",
"catalog_name": "MIXED PEPPER",
}, },
{ {
"canonical_product_id": "gcan_2", "catalog_id": "cat_2",
"canonical_name": "MIXED PEPPER 6 PACK", "catalog_name": "MIXED PEPPER 6 PACK",
"upc": "12345",
}, },
], ],
) )
self.assertEqual("gcan_2", suggestions[0]["canonical_product_id"]) self.assertEqual("cat_2", suggestions[0]["catalog_id"])
self.assertEqual("exact upc", suggestions[0]["reason"]) self.assertEqual("exact upc", suggestions[0]["reason"])
self.assertEqual("gcan_1", suggestions[1]["canonical_product_id"])
def test_search_catalog_rows_ranks_token_overlap(self):
results = review_products.search_catalog_rows(
"mixed pepper",
[
{
"catalog_id": "cat_1",
"catalog_name": "MIXED PEPPER",
"product_type": "pepper",
"category": "produce",
"variant": "",
},
{
"catalog_id": "cat_2",
"catalog_name": "GROUND PEPPER",
"product_type": "spice",
"category": "baking",
"variant": "",
},
],
[
{
"normalized_item_id": "gnorm_mix",
"catalog_id": "cat_1",
}
],
"cnorm_mix",
)
self.assertEqual("cat_1", results[0]["catalog_id"])
self.assertGreater(results[0]["score"], results[1]["score"])
def test_review_products_displays_position_items_and_suggestions(self): def test_review_products_displays_position_items_and_suggestions(self):
with tempfile.TemporaryDirectory() as tmpdir: with tempfile.TemporaryDirectory() as tmpdir:
purchases_csv = Path(tmpdir) / "purchases.csv" purchases_csv = Path(tmpdir) / "purchases.csv"
queue_csv = Path(tmpdir) / "review_queue.csv" queue_csv = Path(tmpdir) / "review_queue.csv"
resolutions_csv = Path(tmpdir) / "review_resolutions.csv" resolutions_csv = Path(tmpdir) / "review_resolutions.csv"
catalog_csv = Path(tmpdir) / "canonical_catalog.csv" catalog_csv = Path(tmpdir) / "catalog.csv"
links_csv = Path(tmpdir) / "product_links.csv"
purchase_fields = [ purchase_fields = [
"purchase_date", "purchase_date",
"retailer", "retailer",
"order_id", "order_id",
"line_no", "line_no",
"observed_product_id", "normalized_item_id",
"canonical_product_id", "catalog_id",
"raw_item_name", "raw_item_name",
"normalized_item_name", "normalized_item_name",
"image_url", "image_url",
@@ -95,8 +138,8 @@ class ReviewWorkflowTests(unittest.TestCase):
"retailer": "costco", "retailer": "costco",
"order_id": "c2", "order_id": "c2",
"line_no": "2", "line_no": "2",
"observed_product_id": "gobs_mix", "normalized_item_id": "cnorm_mix",
"canonical_product_id": "", "catalog_id": "",
"raw_item_name": "MIXED PEPPER 6-PACK", "raw_item_name": "MIXED PEPPER 6-PACK",
"normalized_item_name": "MIXED PEPPER", "normalized_item_name": "MIXED PEPPER",
"image_url": "", "image_url": "",
@@ -108,14 +151,27 @@ class ReviewWorkflowTests(unittest.TestCase):
"retailer": "costco", "retailer": "costco",
"order_id": "c1", "order_id": "c1",
"line_no": "1", "line_no": "1",
"observed_product_id": "gobs_mix", "normalized_item_id": "cnorm_mix",
"canonical_product_id": "", "catalog_id": "",
"raw_item_name": "MIXED PEPPER 6-PACK", "raw_item_name": "MIXED PEPPER 6-PACK",
"normalized_item_name": "MIXED PEPPER", "normalized_item_name": "MIXED PEPPER",
"image_url": "https://example.test/mixed-pepper.jpg", "image_url": "https://example.test/mixed-pepper.jpg",
"upc": "", "upc": "",
"line_total": "6.99", "line_total": "6.99",
}, },
{
"purchase_date": "2026-03-10",
"retailer": "giant",
"order_id": "g1",
"line_no": "1",
"normalized_item_id": "gnorm_mix",
"catalog_id": "cat_mix",
"raw_item_name": "MIXED PEPPER",
"normalized_item_name": "MIXED PEPPER",
"image_url": "",
"upc": "",
"line_total": "5.99",
},
] ]
) )
@@ -124,8 +180,8 @@ class ReviewWorkflowTests(unittest.TestCase):
writer.writeheader() writer.writeheader()
writer.writerow( writer.writerow(
{ {
"canonical_product_id": "gcan_mix", "catalog_id": "cat_mix",
"canonical_name": "MIXED PEPPER", "catalog_name": "MIXED PEPPER",
"category": "produce", "category": "produce",
"product_type": "pepper", "product_type": "pepper",
"brand": "", "brand": "",
@@ -152,21 +208,23 @@ class ReviewWorkflowTests(unittest.TestCase):
str(resolutions_csv), str(resolutions_csv),
"--catalog-csv", "--catalog-csv",
str(catalog_csv), str(catalog_csv),
"--links-csv",
str(links_csv),
], ],
input="q\n", input="q\n",
color=True, color=True,
) )
self.assertEqual(0, result.exit_code) self.assertEqual(0, result.exit_code)
self.assertIn("Review 1/1: Resolve observed_product MIXED PEPPER to canonical_name [__]?", result.output) self.assertIn("Review guide:", result.output)
self.assertIn("Review 1/1: MIXED PEPPER", result.output)
self.assertIn("2 matched items:", result.output) self.assertIn("2 matched items:", result.output)
self.assertIn("[l]ink existing [n]ew canonical e[x]clude [s]kip [q]uit:", result.output) self.assertIn("[#] link to suggestion [f]ind [n]ew [s]kip e[x]clude [q]uit >", result.output)
first_item = result.output.index("[1] 2026-03-14 | 7.49") first_item = result.output.index("[1] MIXED PEPPER 6-PACK | costco | 2026-03-14 | 7.49 | ")
second_item = result.output.index("[2] 2026-03-12 | 6.99") second_item = result.output.index("[2] MIXED PEPPER 6-PACK | costco | 2026-03-12 | 6.99 | https://example.test/mixed-pepper.jpg")
self.assertLess(first_item, second_item) self.assertLess(first_item, second_item)
self.assertIn("https://example.test/mixed-pepper.jpg", result.output) self.assertIn("1 catalog_name suggestions found:", result.output)
self.assertIn("1 canonical suggestions found:", result.output) self.assertIn("[1] MIXED PEPPER, pepper, produce (1 items, 1 rows)", result.output)
self.assertIn("[1] MIXED PEPPER", result.output)
self.assertIn("\x1b[", result.output) self.assertIn("\x1b[", result.output)
def test_review_products_no_suggestions_is_informational(self): def test_review_products_no_suggestions_is_informational(self):
@@ -174,7 +232,8 @@ class ReviewWorkflowTests(unittest.TestCase):
purchases_csv = Path(tmpdir) / "purchases.csv" purchases_csv = Path(tmpdir) / "purchases.csv"
queue_csv = Path(tmpdir) / "review_queue.csv" queue_csv = Path(tmpdir) / "review_queue.csv"
resolutions_csv = Path(tmpdir) / "review_resolutions.csv" resolutions_csv = Path(tmpdir) / "review_resolutions.csv"
catalog_csv = Path(tmpdir) / "canonical_catalog.csv" catalog_csv = Path(tmpdir) / "catalog.csv"
links_csv = Path(tmpdir) / "product_links.csv"
with purchases_csv.open("w", newline="", encoding="utf-8") as handle: with purchases_csv.open("w", newline="", encoding="utf-8") as handle:
writer = csv.DictWriter( writer = csv.DictWriter(
@@ -184,8 +243,8 @@ class ReviewWorkflowTests(unittest.TestCase):
"retailer", "retailer",
"order_id", "order_id",
"line_no", "line_no",
"observed_product_id", "normalized_item_id",
"canonical_product_id", "catalog_id",
"raw_item_name", "raw_item_name",
"normalized_item_name", "normalized_item_name",
"image_url", "image_url",
@@ -200,8 +259,8 @@ class ReviewWorkflowTests(unittest.TestCase):
"retailer": "giant", "retailer": "giant",
"order_id": "g1", "order_id": "g1",
"line_no": "1", "line_no": "1",
"observed_product_id": "gobs_ice", "normalized_item_id": "gnorm_ice",
"canonical_product_id": "", "catalog_id": "",
"raw_item_name": "SB BAGGED ICE 20LB", "raw_item_name": "SB BAGGED ICE 20LB",
"normalized_item_name": "BAGGED ICE", "normalized_item_name": "BAGGED ICE",
"image_url": "", "image_url": "",
@@ -225,20 +284,23 @@ class ReviewWorkflowTests(unittest.TestCase):
str(resolutions_csv), str(resolutions_csv),
"--catalog-csv", "--catalog-csv",
str(catalog_csv), str(catalog_csv),
"--links-csv",
str(links_csv),
], ],
input="q\n", input="q\n",
color=True, color=True,
) )
self.assertEqual(0, result.exit_code) self.assertEqual(0, result.exit_code)
self.assertIn("no canonical_name suggestions found", result.output) self.assertIn("no catalog_name suggestions found", result.output)
def test_link_existing_uses_numbered_selection_and_confirmation(self): def test_search_links_catalog_and_writes_link_row(self):
with tempfile.TemporaryDirectory() as tmpdir: with tempfile.TemporaryDirectory() as tmpdir:
purchases_csv = Path(tmpdir) / "purchases.csv" purchases_csv = Path(tmpdir) / "purchases.csv"
queue_csv = Path(tmpdir) / "review_queue.csv" queue_csv = Path(tmpdir) / "review_queue.csv"
resolutions_csv = Path(tmpdir) / "review_resolutions.csv" resolutions_csv = Path(tmpdir) / "review_resolutions.csv"
catalog_csv = Path(tmpdir) / "canonical_catalog.csv" catalog_csv = Path(tmpdir) / "catalog.csv"
links_csv = Path(tmpdir) / "product_links.csv"
with purchases_csv.open("w", newline="", encoding="utf-8") as handle: with purchases_csv.open("w", newline="", encoding="utf-8") as handle:
writer = csv.DictWriter( writer = csv.DictWriter(
@@ -248,8 +310,8 @@ class ReviewWorkflowTests(unittest.TestCase):
"retailer", "retailer",
"order_id", "order_id",
"line_no", "line_no",
"observed_product_id", "normalized_item_id",
"canonical_product_id", "catalog_id",
"raw_item_name", "raw_item_name",
"normalized_item_name", "normalized_item_name",
"image_url", "image_url",
@@ -265,8 +327,8 @@ class ReviewWorkflowTests(unittest.TestCase):
"retailer": "costco", "retailer": "costco",
"order_id": "c2", "order_id": "c2",
"line_no": "2", "line_no": "2",
"observed_product_id": "gobs_mix", "normalized_item_id": "cnorm_mix",
"canonical_product_id": "", "catalog_id": "",
"raw_item_name": "MIXED PEPPER 6-PACK", "raw_item_name": "MIXED PEPPER 6-PACK",
"normalized_item_name": "MIXED PEPPER", "normalized_item_name": "MIXED PEPPER",
"image_url": "", "image_url": "",
@@ -278,14 +340,27 @@ class ReviewWorkflowTests(unittest.TestCase):
"retailer": "costco", "retailer": "costco",
"order_id": "c1", "order_id": "c1",
"line_no": "1", "line_no": "1",
"observed_product_id": "gobs_mix", "normalized_item_id": "cnorm_mix",
"canonical_product_id": "", "catalog_id": "",
"raw_item_name": "MIXED PEPPER 6-PACK", "raw_item_name": "MIXED PEPPER 6-PACK",
"normalized_item_name": "MIXED PEPPER", "normalized_item_name": "MIXED PEPPER",
"image_url": "", "image_url": "",
"upc": "", "upc": "",
"line_total": "6.99", "line_total": "6.99",
}, },
{
"purchase_date": "2026-03-10",
"retailer": "giant",
"order_id": "g1",
"line_no": "1",
"normalized_item_id": "gnorm_mix",
"catalog_id": "cat_mix",
"raw_item_name": "MIXED PEPPER",
"normalized_item_name": "MIXED PEPPER",
"image_url": "",
"upc": "",
"line_total": "5.99",
},
] ]
) )
@@ -294,8 +369,8 @@ class ReviewWorkflowTests(unittest.TestCase):
writer.writeheader() writer.writeheader()
writer.writerow( writer.writerow(
{ {
"canonical_product_id": "gcan_mix", "catalog_id": "cat_mix",
"canonical_name": "MIXED PEPPER", "catalog_name": "MIXED PEPPER",
"category": "", "category": "",
"product_type": "", "product_type": "",
"brand": "", "brand": "",
@@ -321,37 +396,196 @@ class ReviewWorkflowTests(unittest.TestCase):
str(resolutions_csv), str(resolutions_csv),
"--catalog-csv", "--catalog-csv",
str(catalog_csv), str(catalog_csv),
"--links-csv",
str(links_csv),
"--limit", "--limit",
"1", "1",
], ],
input="l\n1\ny\nlinked by test\n", input="f\nmixed pepper\n1\nlinked by test\n",
color=True, color=True,
) )
self.assertEqual(0, result.exit_code) self.assertEqual(0, result.exit_code)
self.assertIn("Select the canonical_name to associate 2 items with:", result.output) self.assertIn("1 search results found:", result.output)
self.assertIn('[1] MIXED PEPPER | gcan_mix', result.output)
self.assertIn('2 "MIXED PEPPER" items and future matches will be associated with "MIXED PEPPER".', result.output)
self.assertIn("actions: [y]es [n]o [b]ack [s]kip [q]uit", result.output)
with resolutions_csv.open(newline="", encoding="utf-8") as handle: with resolutions_csv.open(newline="", encoding="utf-8") as handle:
rows = list(csv.DictReader(handle)) rows = list(csv.DictReader(handle))
self.assertEqual("gcan_mix", rows[0]["canonical_product_id"]) with links_csv.open(newline="", encoding="utf-8") as handle:
link_rows = list(csv.DictReader(handle))
self.assertEqual("cat_mix", rows[0]["catalog_id"])
self.assertEqual("link", rows[0]["resolution_action"]) self.assertEqual("link", rows[0]["resolution_action"])
self.assertEqual("cat_mix", link_rows[0]["catalog_id"])
def test_review_products_creates_canonical_and_resolution(self): def test_search_no_matches_allows_retry_or_return(self):
with tempfile.TemporaryDirectory() as tmpdir: with tempfile.TemporaryDirectory() as tmpdir:
purchases_csv = Path(tmpdir) / "purchases.csv" purchases_csv = Path(tmpdir) / "purchases.csv"
queue_csv = Path(tmpdir) / "review_queue.csv" queue_csv = Path(tmpdir) / "review_queue.csv"
resolutions_csv = Path(tmpdir) / "review_resolutions.csv" resolutions_csv = Path(tmpdir) / "review_resolutions.csv"
catalog_csv = Path(tmpdir) / "canonical_catalog.csv" catalog_csv = Path(tmpdir) / "catalog.csv"
links_csv = Path(tmpdir) / "product_links.csv"
with purchases_csv.open("w", newline="", encoding="utf-8") as handle: with purchases_csv.open("w", newline="", encoding="utf-8") as handle:
writer = csv.DictWriter( writer = csv.DictWriter(
handle, handle,
fieldnames=[ fieldnames=[
"purchase_date", "purchase_date",
"observed_product_id", "retailer",
"canonical_product_id", "order_id",
"line_no",
"normalized_item_id",
"catalog_id",
"raw_item_name",
"normalized_item_name",
"image_url",
"upc",
"line_total",
],
)
writer.writeheader()
writer.writerow(
{
"purchase_date": "2026-03-14",
"retailer": "giant",
"order_id": "g1",
"line_no": "1",
"normalized_item_id": "gnorm_ice",
"catalog_id": "",
"raw_item_name": "SB BAGGED ICE 20LB",
"normalized_item_name": "BAGGED ICE",
"image_url": "",
"upc": "",
"line_total": "3.50",
}
)
with catalog_csv.open("w", newline="", encoding="utf-8") as handle:
writer = csv.DictWriter(handle, fieldnames=review_products.build_purchases.CATALOG_FIELDS)
writer.writeheader()
writer.writerow(
{
"catalog_id": "cat_ice",
"catalog_name": "ICE",
"category": "frozen",
"product_type": "ice",
"brand": "",
"variant": "",
"size_value": "",
"size_unit": "",
"pack_qty": "",
"measure_type": "",
"notes": "",
"created_at": "",
"updated_at": "",
}
)
result = CliRunner().invoke(
review_products.main,
[
"--purchases-csv",
str(purchases_csv),
"--queue-csv",
str(queue_csv),
"--resolutions-csv",
str(resolutions_csv),
"--catalog-csv",
str(catalog_csv),
"--links-csv",
str(links_csv),
],
input="f\nzzz\nq\nq\n",
color=True,
)
self.assertEqual(0, result.exit_code)
self.assertIn("no matches found", result.output)
def test_skip_remains_available_from_main_prompt(self):
with tempfile.TemporaryDirectory() as tmpdir:
purchases_csv = Path(tmpdir) / "purchases.csv"
queue_csv = Path(tmpdir) / "review_queue.csv"
resolutions_csv = Path(tmpdir) / "review_resolutions.csv"
catalog_csv = Path(tmpdir) / "catalog.csv"
links_csv = Path(tmpdir) / "product_links.csv"
with purchases_csv.open("w", newline="", encoding="utf-8") as handle:
writer = csv.DictWriter(
handle,
fieldnames=[
"purchase_date",
"retailer",
"order_id",
"line_no",
"normalized_item_id",
"catalog_id",
"raw_item_name",
"normalized_item_name",
"image_url",
"upc",
"line_total",
],
)
writer.writeheader()
writer.writerow(
{
"purchase_date": "2026-03-14",
"retailer": "giant",
"order_id": "g1",
"line_no": "1",
"normalized_item_id": "gnorm_skip",
"catalog_id": "",
"raw_item_name": "TEST ITEM",
"normalized_item_name": "TEST ITEM",
"image_url": "",
"upc": "",
"line_total": "1.00",
}
)
with catalog_csv.open("w", newline="", encoding="utf-8") as handle:
writer = csv.DictWriter(handle, fieldnames=review_products.build_purchases.CATALOG_FIELDS)
writer.writeheader()
result = CliRunner().invoke(
review_products.main,
[
"--purchases-csv",
str(purchases_csv),
"--queue-csv",
str(queue_csv),
"--resolutions-csv",
str(resolutions_csv),
"--catalog-csv",
str(catalog_csv),
"--links-csv",
str(links_csv),
"--limit",
"1",
],
input="s\n",
color=True,
)
self.assertEqual(0, result.exit_code)
with resolutions_csv.open(newline="", encoding="utf-8") as handle:
rows = list(csv.DictReader(handle))
self.assertEqual("skip", rows[0]["resolution_action"])
self.assertEqual("pending", rows[0]["status"])
def test_review_products_creates_catalog_and_resolution(self):
with tempfile.TemporaryDirectory() as tmpdir:
purchases_csv = Path(tmpdir) / "purchases.csv"
queue_csv = Path(tmpdir) / "review_queue.csv"
resolutions_csv = Path(tmpdir) / "review_resolutions.csv"
catalog_csv = Path(tmpdir) / "catalog.csv"
links_csv = Path(tmpdir) / "product_links.csv"
with purchases_csv.open("w", newline="", encoding="utf-8") as handle:
writer = csv.DictWriter(
handle,
fieldnames=[
"purchase_date",
"normalized_item_id",
"catalog_id",
"retailer", "retailer",
"raw_item_name", "raw_item_name",
"normalized_item_name", "normalized_item_name",
@@ -366,8 +600,8 @@ class ReviewWorkflowTests(unittest.TestCase):
writer.writerow( writer.writerow(
{ {
"purchase_date": "2026-03-15", "purchase_date": "2026-03-15",
"observed_product_id": "gobs_ice", "normalized_item_id": "gnorm_ice",
"canonical_product_id": "", "catalog_id": "",
"retailer": "giant", "retailer": "giant",
"raw_item_name": "SB BAGGED ICE 20LB", "raw_item_name": "SB BAGGED ICE 20LB",
"normalized_item_name": "BAGGED ICE", "normalized_item_name": "BAGGED ICE",
@@ -389,6 +623,7 @@ class ReviewWorkflowTests(unittest.TestCase):
queue_csv=str(queue_csv), queue_csv=str(queue_csv),
resolutions_csv=str(resolutions_csv), resolutions_csv=str(resolutions_csv),
catalog_csv=str(catalog_csv), catalog_csv=str(catalog_csv),
links_csv=str(links_csv),
limit=1, limit=1,
refresh_only=False, refresh_only=False,
) )
@@ -396,13 +631,21 @@ class ReviewWorkflowTests(unittest.TestCase):
self.assertTrue(queue_csv.exists()) self.assertTrue(queue_csv.exists())
self.assertTrue(resolutions_csv.exists()) self.assertTrue(resolutions_csv.exists())
self.assertTrue(catalog_csv.exists()) self.assertTrue(catalog_csv.exists())
self.assertTrue(links_csv.exists())
with queue_csv.open(newline="", encoding="utf-8") as handle:
queue_rows = list(csv.DictReader(handle))
with resolutions_csv.open(newline="", encoding="utf-8") as handle: with resolutions_csv.open(newline="", encoding="utf-8") as handle:
resolution_rows = list(csv.DictReader(handle)) resolution_rows = list(csv.DictReader(handle))
with catalog_csv.open(newline="", encoding="utf-8") as handle: with catalog_csv.open(newline="", encoding="utf-8") as handle:
catalog_rows = list(csv.DictReader(handle)) catalog_rows = list(csv.DictReader(handle))
with links_csv.open(newline="", encoding="utf-8") as handle:
link_rows = list(csv.DictReader(handle))
self.assertEqual("approved", queue_rows[0]["status"])
self.assertEqual("create", queue_rows[0]["resolution_action"])
self.assertEqual("create", resolution_rows[0]["resolution_action"]) self.assertEqual("create", resolution_rows[0]["resolution_action"])
self.assertEqual("approved", resolution_rows[0]["status"]) self.assertEqual("approved", resolution_rows[0]["status"])
self.assertEqual("ICE", catalog_rows[0]["canonical_name"]) self.assertEqual("ICE", catalog_rows[0]["catalog_name"])
self.assertEqual(catalog_rows[0]["catalog_id"], link_rows[0]["catalog_id"])
if __name__ == "__main__": if __name__ == "__main__":

View File

@@ -58,14 +58,25 @@ class ScraperTests(unittest.TestCase):
} }
] ]
orders, items = scraper.flatten_orders(history, details) orders, items = scraper.flatten_orders(
history,
details,
history_path=Path("data/giant-web/raw/history.json"),
raw_dir=Path("data/giant-web/raw"),
)
self.assertEqual(1, len(orders)) self.assertEqual(1, len(orders))
self.assertEqual("abc123", orders[0]["order_id"]) self.assertEqual("abc123", orders[0]["order_id"])
self.assertEqual("giant", orders[0]["retailer"])
self.assertEqual("PICKUP", orders[0]["service_type"]) self.assertEqual("PICKUP", orders[0]["service_type"])
self.assertEqual("data/giant-web/raw/history.json", orders[0]["raw_history_path"])
self.assertEqual("data/giant-web/raw/abc123.json", orders[0]["raw_order_path"])
self.assertEqual(1, len(items)) self.assertEqual(1, len(items))
self.assertEqual("1", items[0]["line_no"]) self.assertEqual("1", items[0]["line_no"])
self.assertEqual("Bananas", items[0]["item_name"]) self.assertEqual("Bananas", items[0]["item_name"])
self.assertEqual("giant", items[0]["retailer"])
self.assertEqual("data/giant-web/raw/abc123.json", items[0]["raw_order_path"])
self.assertEqual("false", items[0]["is_discount_line"])
def test_append_dedup_replaces_duplicate_rows_and_preserves_new_values(self): def test_append_dedup_replaces_duplicate_rows_and_preserves_new_values(self):
with tempfile.TemporaryDirectory() as tmpdir: with tempfile.TemporaryDirectory() as tmpdir: