Compare commits

63 Commits

Author SHA1 Message Date
2e5d69c75e added 14.2 and 14.3 for refactor prep 2026-03-20 09:55:46 -04:00
ben
3c2462845b added task-sample 2026-03-18 15:47:12 -04:00
ben
c0023e8f3a Record t1.14.1 task evidence 2026-03-18 15:46:31 -04:00
ben
9064de5f67 Refactor retailer normalization outputs 2026-03-18 15:46:20 -04:00
ben
ec1f36a140 Record t1.14 task evidence 2026-03-18 15:18:54 -04:00
ben
48c6eaf753 Refactor retailer collection entrypoints 2026-03-18 15:18:47 -04:00
ben
e74253f6fb data-model prep for refactor, removing observed layer 2026-03-18 15:15:29 -04:00
ben
c13d144418 cleanup 2026-03-18 14:02:36 -04:00
ben
10aad05808 data-model refactor and prep scope 2026-03-18 13:08:28 -04:00
ben
9122821db1 Fix t1.13 evidence hashes 2026-03-17 15:08:09 -04:00
ben
7743421918 Record t1.13 task evidence 2026-03-17 15:07:51 -04:00
ben
08e2a86cbd Make canonical auto-linking more conservative 2026-03-17 15:07:48 -04:00
ben
56a03bcb1d Attach Costco discounts to purchase rows 2026-03-17 15:07:45 -04:00
ben
967e19e561 Add pipeline status accounting 2026-03-17 15:07:42 -04:00
ben
eddef7de2b updated readme and prep for next phase 2026-03-17 13:59:57 -04:00
ben
83bc6c4a7c Update t1.12 task evidence 2026-03-17 13:25:21 -04:00
ben
d39497c298 Refine product review prompt flow 2026-03-17 13:25:12 -04:00
ben
7b8141cd42 Improve product review display workflow 2026-03-17 12:25:47 -04:00
ben
e494386e64 build_purchases rev1 2026-03-17 12:21:44 -04:00
ben
7527fe37eb added git notes 2026-03-17 12:21:24 -04:00
ben
a1fafa3885 added t1.12 scope to simplify review process 2026-03-17 12:20:48 -04:00
ben
37b2196023 added git notes 2026-03-17 09:23:00 -04:00
ben
7f8c3ed8eb updated readme with Review steps 2026-03-17 09:14:14 -04:00
ben
91bfd3597e Record t1.11 task evidence 2026-03-16 20:45:57 -04:00
ben
c7dad5489e Add terminal review resolution workflow 2026-03-16 20:45:37 -04:00
ben
34eedff9c5 Record t1.8.7 and t1.9 task evidence 2026-03-16 18:01:16 -04:00
ben
be1bf6328e Build pivot-ready purchase log 2026-03-16 18:01:09 -04:00
ben
6806c0e7ff updated readme 2026-03-16 17:40:23 -04:00
ben
861955557a added instructions 2026-03-16 17:34:22 -04:00
ben
6e1cde2c83 fix json data pull from /raw 2026-03-16 17:34:01 -04:00
ben
23d0c7e5cd fix bug w session.headers.update missing auth_headers 2026-03-16 17:19:07 -04:00
ben
9a985bf98d updated to use .env, then pull idToken and clientID 2026-03-16 17:17:20 -04:00
ben
b0d4044dac updated task 1.8.7 2026-03-16 17:09:13 -04:00
ben
d7a0329332 Simplify browser session bootstrap 2026-03-16 17:08:44 -04:00
e48dd6c4c2 troubleshooting costco header extraction 2026-03-16 16:59:31 -04:00
ben
1b4c7dde25 Simplify Costco browser header extraction 2026-03-16 16:23:38 -04:00
5a331c9af4 fixed sqlite copy permission error 2026-03-16 16:18:50 -04:00
ben
4fd309251d Record t1.8.6 task evidence 2026-03-16 13:54:11 -04:00
ben
7789c2e6ae Add shared browser session bootstrap 2026-03-16 13:54:00 -04:00
0f797d0a96 added scope for browser session pull task and cleanup 2026-03-16 13:46:52 -04:00
a48a3c8396 added token and dotenv so costco scrapes successfully 36 mo 2026-03-16 13:46:22 -04:00
de0c276a24 Merge remote-tracking branch 'gitea/cx' into cx 2026-03-16 12:40:44 -04:00
d080a35697 added git issues notes 2026-03-16 12:33:50 -04:00
ben
2e5109bd11 Record t1.8.5 task evidence 2026-03-16 12:28:27 -04:00
ben
c0054dc51e Align Costco scraper with browser session flow 2026-03-16 12:28:19 -04:00
ben
58d6efb7bb assume local venv available 2026-03-16 11:44:10 -04:00
ben
031955ba54 Record t1.8.4 task evidence 2026-03-16 11:39:51 -04:00
ben
ac82fa64fb Fix Costco receipt enumeration windows 2026-03-16 11:39:45 -04:00
ben
0d1591a602 Record Costco task evidence 2026-03-16 09:18:05 -04:00
ben
da00288f10 Add Costco acquisition and enrich flow 2026-03-16 09:17:46 -04:00
ben
9497565978 Extend shared schema for retailer-native ids 2026-03-16 09:17:36 -04:00
ben
d20a131e04 updated scope to prep for costco scraper 2026-03-16 09:04:52 -04:00
ben
4216daa37c Record t1.4 through t1.7 task evidence 2026-03-16 00:45:04 -04:00
ben
385a31c07f Auto-link canonical products conservatively 2026-03-16 00:44:45 -04:00
ben
347cd44d09 Create canonical product layer scaffold 2026-03-16 00:43:21 -04:00
ben
9b13ec3b31 Build observed product review queue 2026-03-16 00:43:17 -04:00
ben
dc392149b5 Generate Giant observed products 2026-03-16 00:43:11 -04:00
ben
8cdc4a1ad3 Record t1.3 task evidence 2026-03-16 00:28:37 -04:00
ben
14f2cc2bac Build Giant item enricher 2026-03-16 00:28:28 -04:00
ben
42dbae1d2e added data-model 2026-03-16 00:22:24 -04:00
927643955e mandate dotenv 2026-03-15 15:44:11 -04:00
ben
5e88615a69 added dotenv and completed t1.1 2026-03-14 18:45:55 -04:00
ben
d57b9cf52f Harden giant receipt fetch CLI 2026-03-14 18:32:32 -04:00
41 changed files with 7838 additions and 700 deletions

131
README.md Normal file
View File

@@ -0,0 +1,131 @@
# scrape-giant
CLI to pull purchase history from Giant and Costco websites and refine into a single product catalog for external analysis.
Run each script step-by-step from the terminal.
## What It Does
1. `scrape_giant.py`: download Giant orders and items
2. `enrich_giant.py`: normalize Giant line items
3. `scrape_costco.py`: download Costco orders and items
4. `enrich_costco.py`: normalize Costco line items
5. `build_purchases.py`: combine retailer outputs into one purchase table
6. `review_products.py`: review unresolved product matches in the terminal
7. `report_pipeline_status.py`: show how many rows survive each stage
## Requirements
- Python 3.10+
- Firefox installed with active Giant and Costco sessions
## Install
```bash
python -m venv venv
./venv/scripts/activate
pip install -r requirements.txt
```
## Optional `.env`
Current version works best with `.env` in the project root. The scraper will prompt for these values if they are not found in the current browser session.
- `scrape_giant` prompts if `GIANT_USER_ID` or `GIANT_LOYALTY_NUMBER` is missing.
- `scrape_costco` tries `.env` first, then Firefox local storage for session-backed values; `COSTCO_CLIENT_IDENTIFIER` should still be set explicitly.
- Costco discount matching happens later in `enrich_costco.py`; you do not need to pre-clean discount lines by hand.
```env
GIANT_USER_ID=...
GIANT_LOYALTY_NUMBER=...
COSTCO_X_AUTHORIZATION=...
COSTCO_X_WCS_CLIENTID=...
COSTCO_CLIENT_IDENTIFIER=...
```
## Run Order
Run the pipeline in this order:
```bash
python scrape_giant.py
python enrich_giant.py
python scrape_costco.py
python enrich_costco.py
python build_purchases.py
python review_products.py
python build_purchases.py
python review_products.py --refresh-only
python report_pipeline_status.py
```
Why run `build_purchases.py` twice:
- first pass builds the current combined dataset and review queue inputs
- `review_products.py` writes durable review decisions
- second pass reapplies those decisions into the purchase output
If you only want to refresh the queue without reviewing interactively:
```bash
python review_products.py --refresh-only
```
If you want a quick stage-by-stage accountability check:
```bash
python report_pipeline_status.py
```
## Key Outputs
Giant:
- `giant_output/orders.csv`
- `giant_output/items.csv`
- `giant_output/items_enriched.csv`
Costco:
- `costco_output/orders.csv`
- `costco_output/items.csv`
- `costco_output/items_enriched.csv`
- `costco_output/items_enriched.csv` now preserves raw totals and matched net discount fields
Combined:
- `combined_output/purchases.csv`
- `combined_output/review_queue.csv`
- `combined_output/review_resolutions.csv`
- `combined_output/canonical_catalog.csv`
- `combined_output/product_links.csv`
- `combined_output/comparison_examples.csv`
- `combined_output/pipeline_status.csv`
- `combined_output/pipeline_status.json`
## Review Workflow
Run `review_products.py` to cleanup unresolved or weakly unified items:
- link an item to an existing canonical product
- create a new canonical product
- exclude an item
- skip it for later
Decisions are saved and reused on later runs.
The review step is intentionally conservative:
- weak exact-name matches stay in the queue instead of auto-creating canonical products
- canonical names should describe stable product identity, not retailer packaging text
## Notes
- This project is designed around fragile retailer scraping flows, so the code favors explicit retailer-specific steps over heavy abstraction.
- `scrape_giant.py` and `scrape_costco.py` are meant to work as standalone acquisition scripts.
- Costco discount rows are preserved for auditability and also matched back to purchased items during enrichment.
- `validate_cross_retailer_flow.py` is a proof/check script, not a required production step.
## Test
```bash
./venv/bin/python -m unittest discover -s tests
```
## Project Docs
- `pm/tasks.org`: task tracking
- `pm/data-model.org`: current data model notes
- `pm/review-workflow.org`: review and resolution workflow

24
agents.md Normal file
View File

@@ -0,0 +1,24 @@
# agent rules
## priorities
- optimize for simplicity, boringness, and long-term maintainability
- prefer minimal diffs; avoid refactors unless required for the active task
## tech stack
- python; pandas or polars
- file storage: json and csv, no sqlite or databases
- assume local virtual env is available and accessible
- do not add new dependencies unless explicitly approved; if unavoidable, document justification in the active task notes
## workflow
- prefer direct argv commands (no bash -lc / compound shell chains) unless necessary
- work on ONE task at a time unless explicitly instructed otherwise
- at the start of work, state the task id you are executing
- do not start work unless a task id is specified; if missing, choose the earliest unchecked task and say so
- propose incremental steps
- always include basic tests for core logic
- when you complete a task:
- mark it [x] in pm/tasks.md
- fill in evidence with commit hash + commands run
- never mark complete unless acceptance criteria are met
- include date and time (HH:MM)

129
browser_session.py Normal file
View File

@@ -0,0 +1,129 @@
import configparser
import os
import shutil
import sqlite3
import tempfile
from pathlib import Path
import browser_cookie3
def find_firefox_profile_dir():
profiles_ini = firefox_profiles_root() / "profiles.ini"
parser = configparser.RawConfigParser()
if not profiles_ini.exists():
raise FileNotFoundError(f"Firefox profiles.ini not found at {profiles_ini}")
parser.read(profiles_ini, encoding="utf-8")
profiles = []
for section in parser.sections():
if not section.startswith("Profile"):
continue
path_value = parser.get(section, "Path", fallback="")
if not path_value:
continue
is_relative = parser.getboolean(section, "IsRelative", fallback=True)
profile_path = (
profiles_ini.parent / path_value if is_relative else Path(path_value)
)
profiles.append(
(
parser.getboolean(section, "Default", fallback=False),
profile_path,
)
)
if not profiles:
raise FileNotFoundError("No Firefox profiles found in profiles.ini")
profiles.sort(key=lambda item: (not item[0], str(item[1])))
return profiles[0][1]
def firefox_profiles_root():
if os.name == "nt":
appdata = os.getenv("APPDATA", "").strip()
if not appdata:
raise FileNotFoundError("APPDATA is not set")
return Path(appdata) / "Mozilla" / "Firefox"
return Path.home() / ".mozilla" / "firefox"
def load_firefox_cookies(domain_name, profile_dir):
cookie_file = Path(profile_dir) / "cookies.sqlite"
return browser_cookie3.firefox(cookie_file=str(cookie_file), domain_name=domain_name)
def read_firefox_local_storage(profile_dir, origin_filter):
storage_root = profile_dir / "storage" / "default"
if not storage_root.exists():
return {}
for ls_path in storage_root.glob("*/ls/data.sqlite"):
origin = decode_firefox_origin(ls_path.parents[1].name)
if origin_filter.lower() not in origin.lower():
continue
return {
stringify_sql_value(row[0]): stringify_sql_value(row[1])
for row in query_sqlite(ls_path, "SELECT key, value FROM data")
}
return {}
def read_firefox_webapps_store(profile_dir, origin_filter):
webapps_path = profile_dir / "webappsstore.sqlite"
if not webapps_path.exists():
return {}
values = {}
for row in query_sqlite(
webapps_path,
"SELECT originKey, key, value FROM webappsstore2",
):
origin = stringify_sql_value(row[0])
if origin_filter.lower() not in origin.lower():
continue
values[stringify_sql_value(row[1])] = stringify_sql_value(row[2])
return values
def query_sqlite(path, query):
copied_path = copy_sqlite_to_temp(path)
connection = None
cursor = None
try:
connection = sqlite3.connect(copied_path)
cursor = connection.cursor()
cursor.execute(query)
rows = cursor.fetchall()
return rows
except sqlite3.OperationalError:
return []
finally:
if cursor is not None:
cursor.close()
if connection is not None:
connection.close()
copied_path.unlink(missing_ok=True)
def copy_sqlite_to_temp(path):
fd, tmp = tempfile.mkstemp(suffix=".sqlite")
os.close(fd)
shutil.copyfile(path, tmp)
return Path(tmp)
def decode_firefox_origin(raw_origin):
origin = raw_origin.split("^", 1)[0]
return origin.replace("+++", "://")
def stringify_sql_value(value):
if value is None:
return ""
if isinstance(value, bytes):
for encoding in ("utf-8", "utf-16-le", "utf-16"):
try:
return value.decode(encoding)
except UnicodeDecodeError:
continue
return value.decode("utf-8", errors="ignore")
return str(value)

220
build_canonical_layer.py Normal file
View File

@@ -0,0 +1,220 @@
import click
import re
from layer_helpers import read_csv_rows, representative_value, stable_id, write_csv_rows
CANONICAL_FIELDS = [
"canonical_product_id",
"canonical_name",
"product_type",
"brand",
"variant",
"size_value",
"size_unit",
"pack_qty",
"measure_type",
"normalized_quantity",
"normalized_quantity_unit",
"notes",
"created_at",
"updated_at",
]
CANONICAL_DROP_TOKENS = {"CT", "COUNT", "COUNTS", "DOZ", "DOZEN", "DOZ.", "PACK"}
LINK_FIELDS = [
"observed_product_id",
"canonical_product_id",
"link_method",
"link_confidence",
"review_status",
"reviewed_by",
"reviewed_at",
"link_notes",
]
def to_float(value):
try:
return float(value)
except (TypeError, ValueError):
return None
def normalized_quantity(row):
size_value = to_float(row.get("representative_size_value"))
pack_qty = to_float(row.get("representative_pack_qty")) or 1.0
size_unit = row.get("representative_size_unit", "")
measure_type = row.get("representative_measure_type", "")
if size_value is not None and size_unit:
return format(size_value * pack_qty, "g"), size_unit
if row.get("representative_pack_qty") and measure_type == "count":
return row["representative_pack_qty"], "count"
if measure_type == "each":
return "1", "each"
return "", ""
def auto_link_rule(observed_row):
if (
observed_row.get("is_fee") == "true"
or observed_row.get("is_discount_line") == "true"
or observed_row.get("is_coupon_line") == "true"
):
return "", "", ""
if observed_row.get("representative_upc"):
return (
"exact_upc",
f"upc={observed_row['representative_upc']}",
"high",
)
if (
observed_row.get("representative_name_norm")
and observed_row.get("representative_size_value")
and observed_row.get("representative_size_unit")
):
return (
"exact_name_size",
"|".join(
[
f"name={observed_row['representative_name_norm']}",
f"size={observed_row['representative_size_value']}",
f"unit={observed_row['representative_size_unit']}",
f"pack={observed_row['representative_pack_qty']}",
f"measure={observed_row['representative_measure_type']}",
]
),
"high",
)
return "", "", ""
def clean_canonical_name(name):
tokens = []
for token in re.sub(r"[^A-Z0-9\s]", " ", (name or "").upper()).split():
if token.isdigit():
continue
if token in CANONICAL_DROP_TOKENS:
continue
if re.fullmatch(r"\d+(?:PK|PACK)", token):
continue
if re.fullmatch(r"\d+DZ", token):
continue
tokens.append(token)
return " ".join(tokens).strip()
def canonical_row_for_group(canonical_product_id, group_rows, link_method):
quantity_value, quantity_unit = normalized_quantity(
{
"representative_size_value": representative_value(
group_rows, "representative_size_value"
),
"representative_size_unit": representative_value(
group_rows, "representative_size_unit"
),
"representative_pack_qty": representative_value(
group_rows, "representative_pack_qty"
),
"representative_measure_type": representative_value(
group_rows, "representative_measure_type"
),
}
)
return {
"canonical_product_id": canonical_product_id,
"canonical_name": clean_canonical_name(
representative_value(group_rows, "representative_name_norm")
)
or representative_value(group_rows, "representative_name_norm"),
"product_type": "",
"brand": representative_value(group_rows, "representative_brand"),
"variant": representative_value(group_rows, "representative_variant"),
"size_value": representative_value(group_rows, "representative_size_value"),
"size_unit": representative_value(group_rows, "representative_size_unit"),
"pack_qty": representative_value(group_rows, "representative_pack_qty"),
"measure_type": representative_value(group_rows, "representative_measure_type"),
"normalized_quantity": quantity_value,
"normalized_quantity_unit": quantity_unit,
"notes": f"auto-linked via {link_method}",
"created_at": "",
"updated_at": "",
}
def build_canonical_layer(observed_rows):
canonical_rows = []
link_rows = []
groups = {}
for observed_row in sorted(observed_rows, key=lambda row: row["observed_product_id"]):
link_method, group_key, confidence = auto_link_rule(observed_row)
if not group_key:
continue
canonical_product_id = stable_id("gcan", f"{link_method}|{group_key}")
groups.setdefault(canonical_product_id, {"method": link_method, "rows": []})
groups[canonical_product_id]["rows"].append(observed_row)
link_rows.append(
{
"observed_product_id": observed_row["observed_product_id"],
"canonical_product_id": canonical_product_id,
"link_method": link_method,
"link_confidence": confidence,
"review_status": "",
"reviewed_by": "",
"reviewed_at": "",
"link_notes": "",
}
)
for canonical_product_id, group in sorted(groups.items()):
canonical_rows.append(
canonical_row_for_group(
canonical_product_id, group["rows"], group["method"]
)
)
return canonical_rows, link_rows
@click.command()
@click.option(
"--observed-csv",
default="giant_output/products_observed.csv",
show_default=True,
help="Path to observed product rows.",
)
@click.option(
"--canonical-csv",
default="giant_output/products_canonical.csv",
show_default=True,
help="Path to canonical product output.",
)
@click.option(
"--links-csv",
default="giant_output/product_links.csv",
show_default=True,
help="Path to observed-to-canonical link output.",
)
def main(observed_csv, canonical_csv, links_csv):
observed_rows = read_csv_rows(observed_csv)
canonical_rows, link_rows = build_canonical_layer(observed_rows)
write_csv_rows(canonical_csv, canonical_rows, CANONICAL_FIELDS)
write_csv_rows(links_csv, link_rows, LINK_FIELDS)
click.echo(
f"wrote {len(canonical_rows)} canonical rows to {canonical_csv} and "
f"{len(link_rows)} links to {links_csv}"
)
if __name__ == "__main__":
main()

172
build_observed_products.py Normal file
View File

@@ -0,0 +1,172 @@
from collections import defaultdict
import click
from layer_helpers import (
compact_join,
distinct_values,
first_nonblank,
read_csv_rows,
representative_value,
stable_id,
write_csv_rows,
)
OUTPUT_FIELDS = [
"observed_product_id",
"retailer",
"observed_key",
"representative_retailer_item_id",
"representative_upc",
"representative_item_name",
"representative_name_norm",
"representative_brand",
"representative_variant",
"representative_size_value",
"representative_size_unit",
"representative_pack_qty",
"representative_measure_type",
"representative_image_url",
"is_store_brand",
"is_fee",
"is_discount_line",
"is_coupon_line",
"first_seen_date",
"last_seen_date",
"times_seen",
"example_order_id",
"example_item_name",
"raw_name_examples",
"normalized_name_examples",
"example_prices",
"distinct_item_names_count",
"distinct_retailer_item_ids_count",
"distinct_upcs_count",
]
def build_observed_key(row):
if row.get("upc"):
return "|".join(
[
row["retailer"],
f"upc={row['upc']}",
f"name={row['item_name_norm']}",
]
)
if row.get("retailer_item_id"):
return "|".join(
[
row["retailer"],
f"retailer_item_id={row['retailer_item_id']}",
f"name={row['item_name_norm']}",
f"discount={row.get('is_discount_line', 'false')}",
f"coupon={row.get('is_coupon_line', 'false')}",
]
)
return "|".join(
[
row["retailer"],
f"name={row['item_name_norm']}",
f"size={row['size_value']}",
f"unit={row['size_unit']}",
f"pack={row['pack_qty']}",
f"measure={row['measure_type']}",
f"store_brand={row['is_store_brand']}",
f"fee={row['is_fee']}",
]
)
def build_observed_products(rows):
grouped = defaultdict(list)
for row in rows:
grouped[build_observed_key(row)].append(row)
observed_rows = []
for observed_key, group_rows in sorted(grouped.items()):
ordered = sorted(
group_rows,
key=lambda row: (row["order_date"], row["order_id"], int(row["line_no"])),
)
observed_rows.append(
{
"observed_product_id": stable_id("gobs", observed_key),
"retailer": ordered[0]["retailer"],
"observed_key": observed_key,
"representative_retailer_item_id": representative_value(
ordered, "retailer_item_id"
),
"representative_upc": representative_value(ordered, "upc"),
"representative_item_name": representative_value(ordered, "item_name"),
"representative_name_norm": representative_value(
ordered, "item_name_norm"
),
"representative_brand": representative_value(ordered, "brand_guess"),
"representative_variant": representative_value(ordered, "variant"),
"representative_size_value": representative_value(ordered, "size_value"),
"representative_size_unit": representative_value(ordered, "size_unit"),
"representative_pack_qty": representative_value(ordered, "pack_qty"),
"representative_measure_type": representative_value(
ordered, "measure_type"
),
"representative_image_url": first_nonblank(ordered, "image_url"),
"is_store_brand": representative_value(ordered, "is_store_brand"),
"is_fee": representative_value(ordered, "is_fee"),
"is_discount_line": representative_value(
ordered, "is_discount_line"
),
"is_coupon_line": representative_value(ordered, "is_coupon_line"),
"first_seen_date": ordered[0]["order_date"],
"last_seen_date": ordered[-1]["order_date"],
"times_seen": str(len(ordered)),
"example_order_id": ordered[0]["order_id"],
"example_item_name": ordered[0]["item_name"],
"raw_name_examples": compact_join(
distinct_values(ordered, "item_name"), limit=4
),
"normalized_name_examples": compact_join(
distinct_values(ordered, "item_name_norm"), limit=4
),
"example_prices": compact_join(
distinct_values(ordered, "line_total"), limit=4
),
"distinct_item_names_count": str(
len(distinct_values(ordered, "item_name"))
),
"distinct_retailer_item_ids_count": str(
len(distinct_values(ordered, "retailer_item_id"))
),
"distinct_upcs_count": str(len(distinct_values(ordered, "upc"))),
}
)
observed_rows.sort(key=lambda row: row["observed_product_id"])
return observed_rows
@click.command()
@click.option(
"--items-enriched-csv",
default="giant_output/items_enriched.csv",
show_default=True,
help="Path to enriched Giant item rows.",
)
@click.option(
"--output-csv",
default="giant_output/products_observed.csv",
show_default=True,
help="Path to observed product output.",
)
def main(items_enriched_csv, output_csv):
rows = read_csv_rows(items_enriched_csv)
observed_rows = build_observed_products(rows)
write_csv_rows(output_csv, observed_rows, OUTPUT_FIELDS)
click.echo(f"wrote {len(observed_rows)} rows to {output_csv}")
if __name__ == "__main__":
main()

418
build_purchases.py Normal file
View File

@@ -0,0 +1,418 @@
from decimal import Decimal
from pathlib import Path
import click
import build_canonical_layer
import build_observed_products
import validate_cross_retailer_flow
from enrich_giant import format_decimal, to_decimal
from layer_helpers import read_csv_rows, stable_id, write_csv_rows
PURCHASE_FIELDS = [
"purchase_date",
"retailer",
"order_id",
"line_no",
"observed_item_key",
"observed_product_id",
"canonical_product_id",
"review_status",
"resolution_action",
"raw_item_name",
"normalized_item_name",
"image_url",
"retailer_item_id",
"upc",
"qty",
"unit",
"pack_qty",
"size_value",
"size_unit",
"measure_type",
"line_total",
"unit_price",
"matched_discount_amount",
"net_line_total",
"store_name",
"store_number",
"store_city",
"store_state",
"price_per_each",
"price_per_each_basis",
"price_per_count",
"price_per_count_basis",
"price_per_lb",
"price_per_lb_basis",
"price_per_oz",
"price_per_oz_basis",
"is_discount_line",
"is_coupon_line",
"is_fee",
"raw_order_path",
]
EXAMPLE_FIELDS = [
"example_name",
"canonical_product_id",
"giant_purchase_date",
"giant_raw_item_name",
"giant_price_per_lb",
"costco_purchase_date",
"costco_raw_item_name",
"costco_price_per_lb",
"notes",
]
CATALOG_FIELDS = [
"canonical_product_id",
"canonical_name",
"category",
"product_type",
"brand",
"variant",
"size_value",
"size_unit",
"pack_qty",
"measure_type",
"notes",
"created_at",
"updated_at",
]
RESOLUTION_FIELDS = [
"observed_product_id",
"canonical_product_id",
"resolution_action",
"status",
"resolution_notes",
"reviewed_at",
]
def decimal_or_zero(value):
return to_decimal(value) or Decimal("0")
def derive_metrics(row):
line_total = to_decimal(row.get("net_line_total") or row.get("line_total"))
qty = to_decimal(row.get("qty"))
pack_qty = to_decimal(row.get("pack_qty"))
size_value = to_decimal(row.get("size_value"))
picked_weight = to_decimal(row.get("picked_weight"))
size_unit = row.get("size_unit", "")
price_per_each = row.get("price_per_each", "")
price_per_lb = row.get("price_per_lb", "")
price_per_oz = row.get("price_per_oz", "")
price_per_count = ""
basis_each = ""
basis_count = ""
basis_lb = ""
basis_oz = ""
if price_per_each:
basis_each = "line_total_over_qty"
elif line_total is not None and qty not in (None, 0):
price_per_each = format_decimal(line_total / qty)
basis_each = "line_total_over_qty"
if line_total is not None and pack_qty not in (None, 0):
total_count = pack_qty * (qty or Decimal("1"))
if total_count not in (None, 0):
price_per_count = format_decimal(line_total / total_count)
basis_count = "line_total_over_pack_qty"
if picked_weight not in (None, 0):
price_per_lb = format_decimal(line_total / picked_weight) if line_total is not None else ""
price_per_oz = (
format_decimal((line_total / picked_weight) / Decimal("16"))
if line_total is not None
else ""
)
basis_lb = "picked_weight_lb"
basis_oz = "picked_weight_lb_to_oz"
elif line_total is not None and size_value not in (None, 0):
total_units = size_value * (pack_qty or Decimal("1")) * (qty or Decimal("1"))
if size_unit == "lb" and total_units not in (None, 0):
per_lb = line_total / total_units
price_per_lb = format_decimal(per_lb)
price_per_oz = format_decimal(per_lb / Decimal("16"))
basis_lb = "parsed_size_lb"
basis_oz = "parsed_size_lb_to_oz"
elif size_unit == "oz" and total_units not in (None, 0):
per_oz = line_total / total_units
price_per_oz = format_decimal(per_oz)
price_per_lb = format_decimal(per_oz * Decimal("16"))
basis_lb = "parsed_size_oz_to_lb"
basis_oz = "parsed_size_oz"
return {
"price_per_each": price_per_each,
"price_per_each_basis": basis_each,
"price_per_count": price_per_count,
"price_per_count_basis": basis_count,
"price_per_lb": price_per_lb,
"price_per_lb_basis": basis_lb,
"price_per_oz": price_per_oz,
"price_per_oz_basis": basis_oz,
}
def order_lookup(rows, retailer):
return {
(retailer, row["order_id"]): row
for row in rows
}
def read_optional_csv_rows(path):
path = Path(path)
if not path.exists():
return []
return read_csv_rows(path)
def load_resolution_lookup(resolution_rows):
lookup = {}
for row in resolution_rows:
if not row.get("observed_product_id"):
continue
lookup[row["observed_product_id"]] = row
return lookup
def merge_catalog_rows(existing_rows, auto_rows):
merged = {}
for row in auto_rows + existing_rows:
canonical_product_id = row.get("canonical_product_id", "")
if canonical_product_id:
merged[canonical_product_id] = row
return sorted(merged.values(), key=lambda row: row["canonical_product_id"])
def catalog_row_from_canonical(row):
return {
"canonical_product_id": row.get("canonical_product_id", ""),
"canonical_name": row.get("canonical_name", ""),
"category": row.get("category", ""),
"product_type": row.get("product_type", ""),
"brand": row.get("brand", ""),
"variant": row.get("variant", ""),
"size_value": row.get("size_value", ""),
"size_unit": row.get("size_unit", ""),
"pack_qty": row.get("pack_qty", ""),
"measure_type": row.get("measure_type", ""),
"notes": row.get("notes", ""),
"created_at": row.get("created_at", ""),
"updated_at": row.get("updated_at", ""),
}
def build_link_state(enriched_rows):
observed_rows = build_observed_products.build_observed_products(enriched_rows)
canonical_rows, link_rows = build_canonical_layer.build_canonical_layer(observed_rows)
giant_row, costco_row = validate_cross_retailer_flow.find_proof_pair(observed_rows)
canonical_rows, link_rows, _proof_rows = validate_cross_retailer_flow.merge_proof_pair(
canonical_rows,
link_rows,
giant_row,
costco_row,
)
observed_id_by_key = {
row["observed_key"]: row["observed_product_id"] for row in observed_rows
}
canonical_id_by_observed = {
row["observed_product_id"]: row["canonical_product_id"] for row in link_rows
}
return observed_rows, canonical_rows, link_rows, observed_id_by_key, canonical_id_by_observed
def build_purchase_rows(
giant_enriched_rows,
costco_enriched_rows,
giant_orders,
costco_orders,
resolution_rows,
):
all_enriched_rows = giant_enriched_rows + costco_enriched_rows
(
observed_rows,
canonical_rows,
link_rows,
observed_id_by_key,
canonical_id_by_observed,
) = build_link_state(all_enriched_rows)
resolution_lookup = load_resolution_lookup(resolution_rows)
for observed_product_id, resolution in resolution_lookup.items():
action = resolution.get("resolution_action", "")
status = resolution.get("status", "")
if status != "approved":
continue
if action in {"link", "create"} and resolution.get("canonical_product_id"):
canonical_id_by_observed[observed_product_id] = resolution["canonical_product_id"]
elif action == "exclude":
canonical_id_by_observed[observed_product_id] = ""
orders_by_id = {}
orders_by_id.update(order_lookup(giant_orders, "giant"))
orders_by_id.update(order_lookup(costco_orders, "costco"))
purchase_rows = []
for row in sorted(
all_enriched_rows,
key=lambda item: (item["order_date"], item["retailer"], item["order_id"], int(item["line_no"])),
):
observed_key = build_observed_products.build_observed_key(row)
observed_product_id = observed_id_by_key.get(observed_key, "")
order_row = orders_by_id.get((row["retailer"], row["order_id"]), {})
metrics = derive_metrics(row)
resolution = resolution_lookup.get(observed_product_id, {})
purchase_rows.append(
{
"purchase_date": row["order_date"],
"retailer": row["retailer"],
"order_id": row["order_id"],
"line_no": row["line_no"],
"observed_item_key": row["observed_item_key"],
"observed_product_id": observed_product_id,
"canonical_product_id": canonical_id_by_observed.get(observed_product_id, ""),
"review_status": resolution.get("status", ""),
"resolution_action": resolution.get("resolution_action", ""),
"raw_item_name": row["item_name"],
"normalized_item_name": row["item_name_norm"],
"image_url": row.get("image_url", ""),
"retailer_item_id": row["retailer_item_id"],
"upc": row["upc"],
"qty": row["qty"],
"unit": row["unit"],
"pack_qty": row["pack_qty"],
"size_value": row["size_value"],
"size_unit": row["size_unit"],
"measure_type": row["measure_type"],
"line_total": row["line_total"],
"unit_price": row["unit_price"],
"matched_discount_amount": row.get("matched_discount_amount", ""),
"net_line_total": row.get("net_line_total", ""),
"store_name": order_row.get("store_name", ""),
"store_number": order_row.get("store_number", ""),
"store_city": order_row.get("store_city", ""),
"store_state": order_row.get("store_state", ""),
"is_discount_line": row["is_discount_line"],
"is_coupon_line": row["is_coupon_line"],
"is_fee": row["is_fee"],
"raw_order_path": row["raw_order_path"],
**metrics,
}
)
return purchase_rows, observed_rows, canonical_rows, link_rows
def apply_manual_resolutions_to_links(link_rows, resolution_rows):
link_by_observed = {row["observed_product_id"]: dict(row) for row in link_rows}
for resolution in resolution_rows:
if resolution.get("status") != "approved":
continue
observed_product_id = resolution.get("observed_product_id", "")
action = resolution.get("resolution_action", "")
if not observed_product_id:
continue
if action == "exclude":
link_by_observed.pop(observed_product_id, None)
continue
if action in {"link", "create"} and resolution.get("canonical_product_id"):
link_by_observed[observed_product_id] = {
"observed_product_id": observed_product_id,
"canonical_product_id": resolution["canonical_product_id"],
"link_method": f"manual_{action}",
"link_confidence": "high",
"review_status": resolution.get("status", ""),
"reviewed_by": "",
"reviewed_at": resolution.get("reviewed_at", ""),
"link_notes": resolution.get("resolution_notes", ""),
}
return sorted(link_by_observed.values(), key=lambda row: row["observed_product_id"])
def build_comparison_examples(purchase_rows):
giant_banana = None
costco_banana = None
for row in purchase_rows:
if row.get("normalized_item_name") != "BANANA":
continue
if not row.get("canonical_product_id"):
continue
if row["retailer"] == "giant" and row.get("price_per_lb"):
giant_banana = row
if row["retailer"] == "costco" and row.get("price_per_lb"):
costco_banana = row
if not giant_banana or not costco_banana:
return []
return [
{
"example_name": "banana_price_per_lb",
"canonical_product_id": giant_banana["canonical_product_id"],
"giant_purchase_date": giant_banana["purchase_date"],
"giant_raw_item_name": giant_banana["raw_item_name"],
"giant_price_per_lb": giant_banana["price_per_lb"],
"costco_purchase_date": costco_banana["purchase_date"],
"costco_raw_item_name": costco_banana["raw_item_name"],
"costco_price_per_lb": costco_banana["price_per_lb"],
"notes": "Example comparison using normalized price_per_lb across Giant and Costco",
}
]
@click.command()
@click.option("--giant-items-enriched-csv", default="giant_output/items_enriched.csv", show_default=True)
@click.option("--costco-items-enriched-csv", default="costco_output/items_enriched.csv", show_default=True)
@click.option("--giant-orders-csv", default="giant_output/orders.csv", show_default=True)
@click.option("--costco-orders-csv", default="costco_output/orders.csv", show_default=True)
@click.option("--resolutions-csv", default="combined_output/review_resolutions.csv", show_default=True)
@click.option("--catalog-csv", default="combined_output/canonical_catalog.csv", show_default=True)
@click.option("--links-csv", default="combined_output/product_links.csv", show_default=True)
@click.option("--output-csv", default="combined_output/purchases.csv", show_default=True)
@click.option("--examples-csv", default="combined_output/comparison_examples.csv", show_default=True)
def main(
giant_items_enriched_csv,
costco_items_enriched_csv,
giant_orders_csv,
costco_orders_csv,
resolutions_csv,
catalog_csv,
links_csv,
output_csv,
examples_csv,
):
resolution_rows = read_optional_csv_rows(resolutions_csv)
purchase_rows, _observed_rows, canonical_rows, link_rows = build_purchase_rows(
read_csv_rows(giant_items_enriched_csv),
read_csv_rows(costco_items_enriched_csv),
read_csv_rows(giant_orders_csv),
read_csv_rows(costco_orders_csv),
resolution_rows,
)
existing_catalog_rows = read_optional_csv_rows(catalog_csv)
merged_catalog_rows = merge_catalog_rows(
existing_catalog_rows,
[catalog_row_from_canonical(row) for row in canonical_rows],
)
link_rows = apply_manual_resolutions_to_links(link_rows, resolution_rows)
example_rows = build_comparison_examples(purchase_rows)
write_csv_rows(catalog_csv, merged_catalog_rows, CATALOG_FIELDS)
write_csv_rows(links_csv, link_rows, build_canonical_layer.LINK_FIELDS)
write_csv_rows(output_csv, purchase_rows, PURCHASE_FIELDS)
write_csv_rows(examples_csv, example_rows, EXAMPLE_FIELDS)
click.echo(
f"wrote {len(purchase_rows)} purchase rows to {output_csv}, "
f"{len(merged_catalog_rows)} catalog rows to {catalog_csv}, "
f"and {len(example_rows)} comparison examples to {examples_csv}"
)
if __name__ == "__main__":
main()

175
build_review_queue.py Normal file
View File

@@ -0,0 +1,175 @@
from collections import defaultdict
from datetime import date
import click
from layer_helpers import compact_join, distinct_values, read_csv_rows, stable_id, write_csv_rows
OUTPUT_FIELDS = [
"review_id",
"queue_type",
"retailer",
"observed_product_id",
"canonical_product_id",
"reason_code",
"priority",
"raw_item_names",
"normalized_names",
"upc",
"image_url",
"example_prices",
"seen_count",
"status",
"resolution_notes",
"created_at",
"updated_at",
]
def existing_review_state(path):
try:
rows = read_csv_rows(path)
except FileNotFoundError:
return {}
return {row["review_id"]: row for row in rows}
def review_reasons(observed_row):
reasons = []
if (
observed_row["is_fee"] == "true"
or observed_row.get("is_discount_line") == "true"
or observed_row.get("is_coupon_line") == "true"
):
return reasons
if observed_row["distinct_upcs_count"] not in {"", "0", "1"}:
reasons.append(("multiple_upcs", "high"))
if observed_row["distinct_item_names_count"] not in {"", "0", "1"}:
reasons.append(("multiple_raw_names", "medium"))
if not observed_row["representative_image_url"]:
reasons.append(("missing_image", "medium"))
if not observed_row["representative_upc"]:
reasons.append(("missing_upc", "high"))
if not observed_row["representative_name_norm"]:
reasons.append(("missing_normalized_name", "high"))
return reasons
def build_review_queue(observed_rows, item_rows, existing_rows, today_text):
by_observed = defaultdict(list)
for row in item_rows:
observed_id = row.get("observed_product_id", "")
if observed_id:
by_observed[observed_id].append(row)
queue_rows = []
for observed_row in observed_rows:
reasons = review_reasons(observed_row)
if not reasons:
continue
related_items = by_observed.get(observed_row["observed_product_id"], [])
raw_names = compact_join(distinct_values(related_items, "item_name"), limit=5)
norm_names = compact_join(
distinct_values(related_items, "item_name_norm"), limit=5
)
example_prices = compact_join(
distinct_values(related_items, "line_total"), limit=5
)
for reason_code, priority in reasons:
review_id = stable_id(
"rvw",
f"{observed_row['observed_product_id']}|{reason_code}",
)
prior = existing_rows.get(review_id, {})
queue_rows.append(
{
"review_id": review_id,
"queue_type": "observed_product",
"retailer": observed_row["retailer"],
"observed_product_id": observed_row["observed_product_id"],
"canonical_product_id": prior.get("canonical_product_id", ""),
"reason_code": reason_code,
"priority": priority,
"raw_item_names": raw_names,
"normalized_names": norm_names,
"upc": observed_row["representative_upc"],
"image_url": observed_row["representative_image_url"],
"example_prices": example_prices,
"seen_count": observed_row["times_seen"],
"status": prior.get("status", "pending"),
"resolution_notes": prior.get("resolution_notes", ""),
"created_at": prior.get("created_at", today_text),
"updated_at": today_text,
}
)
queue_rows.sort(key=lambda row: (row["priority"], row["reason_code"], row["review_id"]))
return queue_rows
def attach_observed_ids(item_rows, observed_rows):
observed_by_key = {row["observed_key"]: row["observed_product_id"] for row in observed_rows}
attached = []
for row in item_rows:
observed_key = "|".join(
[
row["retailer"],
f"upc={row['upc']}",
f"name={row['item_name_norm']}",
]
) if row.get("upc") else "|".join(
[
row["retailer"],
f"retailer_item_id={row.get('retailer_item_id', '')}",
f"name={row['item_name_norm']}",
f"size={row['size_value']}",
f"unit={row['size_unit']}",
f"pack={row['pack_qty']}",
f"measure={row['measure_type']}",
f"store_brand={row['is_store_brand']}",
f"fee={row['is_fee']}",
f"discount={row.get('is_discount_line', 'false')}",
f"coupon={row.get('is_coupon_line', 'false')}",
]
)
enriched = dict(row)
enriched["observed_product_id"] = observed_by_key.get(observed_key, "")
attached.append(enriched)
return attached
@click.command()
@click.option(
"--observed-csv",
default="giant_output/products_observed.csv",
show_default=True,
help="Path to observed product rows.",
)
@click.option(
"--items-enriched-csv",
default="giant_output/items_enriched.csv",
show_default=True,
help="Path to enriched Giant item rows.",
)
@click.option(
"--output-csv",
default="giant_output/review_queue.csv",
show_default=True,
help="Path to review queue output.",
)
def main(observed_csv, items_enriched_csv, output_csv):
observed_rows = read_csv_rows(observed_csv)
item_rows = read_csv_rows(items_enriched_csv)
item_rows = attach_observed_ids(item_rows, observed_rows)
existing_rows = existing_review_state(output_csv)
today_text = str(date.today())
queue_rows = build_review_queue(observed_rows, item_rows, existing_rows, today_text)
write_csv_rows(output_csv, queue_rows, OUTPUT_FIELDS)
click.echo(f"wrote {len(queue_rows)} rows to {output_csv}")
if __name__ == "__main__":
main()

65
collect_costco_web.py Normal file
View File

@@ -0,0 +1,65 @@
import click
import scrape_costco
@click.command()
@click.option(
"--outdir",
default="data/costco-web",
show_default=True,
help="Directory for Costco raw and collected outputs.",
)
@click.option(
"--document-type",
default="all",
show_default=True,
help="Summary document type.",
)
@click.option(
"--document-sub-type",
default="all",
show_default=True,
help="Summary document sub type.",
)
@click.option(
"--window-days",
default=92,
show_default=True,
type=int,
help="Maximum number of days to request per summary window.",
)
@click.option(
"--months-back",
default=36,
show_default=True,
type=int,
help="How many months of receipts to enumerate back from today.",
)
@click.option(
"--firefox-profile-dir",
default=None,
help="Firefox profile directory to use for cookies and session storage.",
)
def main(
outdir,
document_type,
document_sub_type,
window_days,
months_back,
firefox_profile_dir,
):
scrape_costco.run_collection(
outdir=outdir,
document_type=document_type,
document_sub_type=document_sub_type,
window_days=window_days,
months_back=months_back,
firefox_profile_dir=firefox_profile_dir,
orders_filename="collected_orders.csv",
items_filename="collected_items.csv",
)
if __name__ == "__main__":
main()

34
collect_giant_web.py Normal file
View File

@@ -0,0 +1,34 @@
import click
import scrape_giant
@click.command()
@click.option("--user-id", default=None, help="Giant user id.")
@click.option("--loyalty", default=None, help="Giant loyalty number.")
@click.option(
"--outdir",
default="data/giant-web",
show_default=True,
help="Directory for raw json and collected csv outputs.",
)
@click.option(
"--sleep-seconds",
default=1.5,
show_default=True,
type=float,
help="Delay between order detail requests.",
)
def main(user_id, loyalty, outdir, sleep_seconds):
scrape_giant.run_collection(
user_id,
loyalty,
outdir,
sleep_seconds,
orders_filename="collected_orders.csv",
items_filename="collected_items.csv",
)
if __name__ == "__main__":
main()

365
enrich_costco.py Normal file
View File

@@ -0,0 +1,365 @@
import csv
import json
import re
from collections import defaultdict
from pathlib import Path
import click
from enrich_giant import (
OUTPUT_FIELDS,
derive_normalized_quantity,
derive_price_fields,
format_decimal,
normalization_identity,
normalize_number,
normalize_unit,
normalize_whitespace,
singularize_tokens,
to_decimal,
)
PARSER_VERSION = "costco-enrich-v1"
RETAILER = "costco"
DEFAULT_INPUT_DIR = Path("costco_output/raw")
DEFAULT_OUTPUT_CSV = Path("costco_output/items_enriched.csv")
CODE_TOKEN_RE = re.compile(
r"\b(?:SL\d+|T\d+H\d+|P\d+(?:/\d+)?|W\d+T\d+H\d+|FY\d+|CSPC#|C\d+T\d+H\d+|EC\d+T\d+H\d+|\d+X\d+)\b"
)
PACK_FRACTION_RE = re.compile(r"(?<![A-Z0-9])(\d+)\s*/\s*(\d+(?:\.\d+)?)\s*(OZ|LB|LBS|CT)\b")
HASH_SIZE_RE = re.compile(r"(?<![A-Z0-9])(\d+(?:\.\d+)?)#\b")
PACK_DASH_RE = re.compile(r"(?<![A-Z0-9])(\d+)\s*-\s*PACK\b")
PACK_WORD_RE = re.compile(r"(?<![A-Z0-9])(\d+)\s*PACK\b")
SIZE_RE = re.compile(r"(?<![A-Z0-9])(\d+(?:\.\d+)?)\s*(OZ|LB|LBS|CT|KG|G)\b")
DISCOUNT_TARGET_RE = re.compile(r"^/\s*(\d+)\b")
def clean_costco_name(name):
cleaned = normalize_whitespace(name).upper().replace('"', "")
cleaned = CODE_TOKEN_RE.sub(" ", cleaned)
cleaned = re.sub(r"\s*/\s*\d+(?:\.\d+)?\s*(KG|G)\b", " ", cleaned)
cleaned = normalize_whitespace(cleaned)
return cleaned
def combine_description(item):
return normalize_whitespace(
" ".join(
str(part).strip()
for part in [item.get("itemDescription01"), item.get("itemDescription02")]
if part
)
)
def parse_costco_size_and_pack(cleaned_name):
pack_qty = ""
size_value = ""
size_unit = ""
match = PACK_FRACTION_RE.search(cleaned_name)
if match:
pack_qty = normalize_number(match.group(1))
size_value = normalize_number(match.group(2))
size_unit = normalize_unit(match.group(3))
return size_value, size_unit, pack_qty
match = HASH_SIZE_RE.search(cleaned_name)
if match:
size_value = normalize_number(match.group(1))
size_unit = "lb"
match = PACK_DASH_RE.search(cleaned_name) or PACK_WORD_RE.search(cleaned_name)
if match:
pack_qty = normalize_number(match.group(1))
matches = list(SIZE_RE.finditer(cleaned_name))
if matches:
last = matches[-1]
unit = last.group(2)
size_value = normalize_number(last.group(1))
size_unit = "count" if unit == "CT" else normalize_unit(unit)
return size_value, size_unit, pack_qty
def normalize_costco_name(cleaned_name):
brand = ""
base = cleaned_name
if base.startswith("KS "):
brand = "KS"
base = normalize_whitespace(base[3:])
size_value, size_unit, pack_qty = parse_costco_size_and_pack(base)
if size_value and size_unit:
if pack_qty:
base = PACK_FRACTION_RE.sub(" ", base)
else:
base = SIZE_RE.sub(" ", base)
base = HASH_SIZE_RE.sub(" ", base)
base = PACK_DASH_RE.sub(" ", base)
base = PACK_WORD_RE.sub(" ", base)
base = normalize_whitespace(base)
tokens = []
for token in base.split():
if token in {"ORG"}:
continue
if token in {"PEANUT", "BUTTER"} and "JIF" in base:
continue
tokens.append(token)
base = singularize_tokens(" ".join(tokens))
return normalize_whitespace(base), brand, size_value, size_unit, pack_qty
def guess_measure_type(size_unit, pack_qty, is_discount_line):
if is_discount_line:
return "each"
if size_unit in {"lb", "oz", "g", "kg"}:
return "weight"
if size_unit in {"ml", "l", "qt", "pt", "gal", "fl_oz"}:
return "volume"
if size_unit == "count" or pack_qty:
return "count"
return "each"
def derive_costco_prices(item, measure_type, size_value, size_unit, pack_qty):
line_total = to_decimal(item.get("amount"))
qty = to_decimal(item.get("unit"))
parsed_size = to_decimal(size_value)
parsed_pack = to_decimal(pack_qty) or 1
price_per_each = ""
price_per_lb = ""
price_per_oz = ""
if line_total is None:
return price_per_each, price_per_lb, price_per_oz
if measure_type in {"each", "count"} and qty not in (None, 0):
price_per_each = format_decimal(line_total / qty)
if parsed_size not in (None, 0):
total_units = parsed_size * parsed_pack * (qty or 1)
if size_unit == "lb":
per_lb = line_total / total_units
price_per_lb = format_decimal(per_lb)
price_per_oz = format_decimal(per_lb / 16)
elif size_unit == "oz":
per_oz = line_total / total_units
price_per_oz = format_decimal(per_oz)
price_per_lb = format_decimal(per_oz * 16)
return price_per_each, price_per_lb, price_per_oz
def is_discount_item(item):
amount = to_decimal(item.get("amount")) or 0
unit = to_decimal(item.get("unit")) or 0
description = combine_description(item)
return amount < 0 or unit < 0 or description.startswith("/")
def discount_target_id(raw_name):
match = DISCOUNT_TARGET_RE.match(normalize_whitespace(raw_name))
if not match:
return ""
return match.group(1)
def parse_costco_item(order_id, order_date, raw_path, line_no, item):
raw_name = combine_description(item)
cleaned_name = clean_costco_name(raw_name)
item_name_norm, brand_guess, size_value, size_unit, pack_qty = normalize_costco_name(
cleaned_name
)
is_discount_line = is_discount_item(item)
is_coupon_line = "true" if raw_name.startswith("/") else "false"
measure_type = guess_measure_type(size_unit, pack_qty, is_discount_line)
price_per_each, price_per_lb, price_per_oz = derive_costco_prices(
item, measure_type, size_value, size_unit, pack_qty
)
normalized_row_id = f"{RETAILER}:{order_id}:{line_no}"
normalized_quantity, normalized_quantity_unit = derive_normalized_quantity(
size_value,
size_unit,
pack_qty,
measure_type,
)
identity_key, normalization_basis = normalization_identity(
{
"retailer": RETAILER,
"normalized_row_id": normalized_row_id,
"upc": "",
"retailer_item_id": str(item.get("itemNumber", "")),
"item_name_norm": item_name_norm,
"size_value": size_value,
"size_unit": size_unit,
"pack_qty": pack_qty,
}
)
price_fields = derive_price_fields(
price_per_each,
price_per_lb,
price_per_oz,
str(item.get("amount", "")),
str(item.get("unit", "")),
pack_qty,
)
return {
"retailer": RETAILER,
"order_id": str(order_id),
"line_no": str(line_no),
"normalized_row_id": normalized_row_id,
"normalized_item_id": f"cnorm:{identity_key}",
"normalization_basis": normalization_basis,
"observed_item_key": normalized_row_id,
"order_date": normalize_whitespace(order_date),
"retailer_item_id": str(item.get("itemNumber", "")),
"pod_id": "",
"item_name": raw_name,
"upc": "",
"category_id": str(item.get("itemDepartmentNumber", "")),
"category": str(item.get("transDepartmentNumber", "")),
"qty": str(item.get("unit", "")),
"unit": str(item.get("itemIdentifier", "")),
"unit_price": str(item.get("itemUnitPriceAmount", "")),
"line_total": str(item.get("amount", "")),
"picked_weight": "",
"mvp_savings": "",
"reward_savings": "",
"coupon_savings": str(item.get("amount", "")) if is_discount_line else "",
"coupon_price": "",
"matched_discount_amount": "",
"net_line_total": str(item.get("amount", "")) if not is_discount_line else "",
"image_url": "",
"raw_order_path": raw_path.as_posix(),
"item_name_norm": item_name_norm,
"brand_guess": brand_guess,
"variant": "",
"size_value": size_value,
"size_unit": size_unit,
"pack_qty": pack_qty,
"measure_type": measure_type,
"normalized_quantity": normalized_quantity,
"normalized_quantity_unit": normalized_quantity_unit,
"is_store_brand": "true" if brand_guess else "false",
"is_item": "false" if is_discount_line else "true",
"is_fee": "false",
"is_discount_line": "true" if is_discount_line else "false",
"is_coupon_line": is_coupon_line,
**price_fields,
"parse_version": PARSER_VERSION,
"parse_notes": "",
}
def match_costco_discounts(rows):
rows_by_order = defaultdict(list)
for row in rows:
rows_by_order[row["order_id"]].append(row)
for order_rows in rows_by_order.values():
purchase_rows_by_item_id = defaultdict(list)
for row in order_rows:
if row.get("is_discount_line") == "true":
continue
retailer_item_id = row.get("retailer_item_id", "")
if retailer_item_id:
purchase_rows_by_item_id[retailer_item_id].append(row)
for row in order_rows:
if row.get("is_discount_line") != "true":
continue
target_id = discount_target_id(row.get("item_name", ""))
if not target_id:
continue
matches = purchase_rows_by_item_id.get(target_id, [])
if len(matches) != 1:
row["parse_notes"] = normalize_whitespace(
f"{row.get('parse_notes', '')};discount_target_unmatched={target_id}"
).strip(";")
continue
purchase_row = matches[0]
matched_discount = to_decimal(row.get("line_total"))
gross_total = to_decimal(purchase_row.get("line_total"))
existing_discount = to_decimal(purchase_row.get("matched_discount_amount")) or 0
if matched_discount is None or gross_total is None:
continue
total_discount = existing_discount + matched_discount
purchase_row["matched_discount_amount"] = format_decimal(total_discount)
purchase_row["net_line_total"] = format_decimal(gross_total + total_discount)
purchase_row["parse_notes"] = normalize_whitespace(
f"{purchase_row.get('parse_notes', '')};matched_discount={target_id}"
).strip(";")
row["parse_notes"] = normalize_whitespace(
f"{row.get('parse_notes', '')};matched_to_item={target_id}"
).strip(";")
def iter_costco_rows(raw_dir):
for path in discover_json_files(raw_dir):
if path.name in {"summary.json", "summary_requests.json"}:
continue
payload = json.loads(path.read_text(encoding="utf-8"))
if not isinstance(payload, dict):
continue
receipts = payload.get("data", {}).get("receiptsWithCounts", {}).get("receipts", [])
for receipt in receipts:
order_id = receipt["transactionBarcode"]
order_date = receipt.get("transactionDate", "")
for line_no, item in enumerate(receipt.get("itemArray", []), start=1):
yield parse_costco_item(order_id, order_date, path, line_no, item)
def discover_json_files(raw_dir):
raw_dir = Path(raw_dir)
candidates = sorted(raw_dir.glob("*.json"))
if candidates:
return candidates
if raw_dir.name == "raw" and raw_dir.parent.exists():
return sorted(raw_dir.parent.glob("*.json"))
return []
def build_items_enriched(raw_dir):
rows = list(iter_costco_rows(raw_dir))
match_costco_discounts(rows)
rows.sort(key=lambda row: (row["order_date"], row["order_id"], int(row["line_no"])))
return rows
def write_csv(path, rows):
path.parent.mkdir(parents=True, exist_ok=True)
with path.open("w", newline="", encoding="utf-8") as handle:
writer = csv.DictWriter(handle, fieldnames=OUTPUT_FIELDS)
writer.writeheader()
writer.writerows(rows)
@click.command()
@click.option(
"--input-dir",
default=str(DEFAULT_INPUT_DIR),
show_default=True,
help="Directory containing Costco raw order json files.",
)
@click.option(
"--output-csv",
default=str(DEFAULT_OUTPUT_CSV),
show_default=True,
help="CSV path for enriched Costco item rows.",
)
def main(input_dir, output_csv):
click.echo("legacy entrypoint: prefer normalize_costco_web.py for data-model outputs")
rows = build_items_enriched(Path(input_dir))
write_csv(Path(output_csv), rows)
click.echo(f"wrote {len(rows)} rows to {output_csv}")
if __name__ == "__main__":
main()

562
enrich_giant.py Normal file
View File

@@ -0,0 +1,562 @@
import csv
import json
import re
from decimal import Decimal, InvalidOperation, ROUND_HALF_UP
from pathlib import Path
import click
PARSER_VERSION = "giant-enrich-v1"
RETAILER = "giant"
DEFAULT_INPUT_DIR = Path("giant_output/raw")
DEFAULT_OUTPUT_CSV = Path("giant_output/items_enriched.csv")
OUTPUT_FIELDS = [
"retailer",
"order_id",
"line_no",
"normalized_row_id",
"normalized_item_id",
"normalization_basis",
"observed_item_key",
"order_date",
"retailer_item_id",
"pod_id",
"item_name",
"upc",
"category_id",
"category",
"qty",
"unit",
"unit_price",
"line_total",
"picked_weight",
"mvp_savings",
"reward_savings",
"coupon_savings",
"coupon_price",
"matched_discount_amount",
"net_line_total",
"image_url",
"raw_order_path",
"item_name_norm",
"brand_guess",
"variant",
"size_value",
"size_unit",
"pack_qty",
"measure_type",
"normalized_quantity",
"normalized_quantity_unit",
"is_store_brand",
"is_item",
"is_fee",
"is_discount_line",
"is_coupon_line",
"price_per_each",
"price_per_each_basis",
"price_per_count",
"price_per_count_basis",
"price_per_lb",
"price_per_lb_basis",
"price_per_oz",
"price_per_oz_basis",
"parse_version",
"parse_notes",
]
STORE_BRAND_PREFIXES = {
"SB": "SB",
"NP": "NP",
}
DROP_TOKENS = {"FRESH"}
ABBREVIATIONS = {
"APPLE": "APPLE",
"APPLES": "APPLES",
"APLE": "APPLE",
"BASIL": "BASIL",
"BLK": "BLACK",
"BNLS": "BONELESS",
"BRWN": "BROWN",
"CARROTS": "CARROTS",
"CHDR": "CHEDDAR",
"CHICKEN": "CHICKEN",
"CHOC": "CHOCOLATE",
"CHS": "CHEESE",
"CHSE": "CHEESE",
"CHZ": "CHEESE",
"CILANTRO": "CILANTRO",
"CKI": "COOKIE",
"CRSHD": "CRUSHED",
"FLR": "FLOUR",
"FRSH": "FRESH",
"GALA": "GALA",
"GRAHM": "GRAHAM",
"HOT": "HOT",
"HRSRDSH": "HORSERADISH",
"IMP": "IMPORTED",
"IQF": "IQF",
"LENTILS": "LENTILS",
"LG": "LARGE",
"MLK": "MILK",
"MSTRD": "MUSTARD",
"ONION": "ONION",
"ORG": "ORGANIC",
"PEPPER": "PEPPER",
"PEPPERS": "PEPPERS",
"POT": "POTATO",
"POTATO": "POTATO",
"PPR": "PEPPER",
"RICOTTA": "RICOTTA",
"ROASTER": "ROASTER",
"ROTINI": "ROTINI",
"SCE": "SAUCE",
"SLC": "SLICED",
"SPINCH": "SPINACH",
"SPNC": "SPINACH",
"SPINACH": "SPINACH",
"SQZ": "SQUEEZE",
"SWT": "SWEET",
"THYME": "THYME",
"TOM": "TOMATO",
"TOMS": "TOMATOES",
"TRTL": "TORTILLA",
"VEG": "VEGETABLE",
"VINEGAR": "VINEGAR",
"WHT": "WHITE",
"WHOLE": "WHOLE",
"YLW": "YELLOW",
"YLWGLD": "YELLOW_GOLD",
}
FEE_PATTERNS = [
re.compile(r"\bBAG CHARGE\b"),
re.compile(r"\bDISC AT TOTAL\b"),
]
SIZE_RE = re.compile(r"(?<![A-Z0-9])(\d+(?:\.\d+)?)(?:\s*)(OZ|Z|LB|LBS|ML|L|FZ|FL OZ|QT|PT|GAL|GA)\b")
PACK_RE = re.compile(r"(?<![A-Z0-9])(\d+(?:\.\d+)?)(?:\s*)(CT|PK|PKG|PACK)\b")
def to_decimal(value):
if value in ("", None):
return None
try:
return Decimal(str(value))
except (InvalidOperation, ValueError):
return None
def format_decimal(value, places=4):
if value is None:
return ""
quant = Decimal("1").scaleb(-places)
normalized = value.quantize(quant, rounding=ROUND_HALF_UP).normalize()
return format(normalized, "f")
def normalize_whitespace(value):
return " ".join(str(value or "").strip().split())
def clean_item_name(name):
cleaned = normalize_whitespace(name).upper()
cleaned = re.sub(r"^\+", "", cleaned)
cleaned = re.sub(r"^PLU#\d+\s*", "", cleaned)
cleaned = cleaned.replace("#", " ")
return normalize_whitespace(cleaned)
def extract_store_brand_prefix(cleaned_name):
for prefix, brand in STORE_BRAND_PREFIXES.items():
if cleaned_name == prefix or cleaned_name.startswith(f"{prefix} "):
return prefix, brand
return "", ""
def extract_image_url(item):
image = item.get("image")
if isinstance(image, dict):
for key in ["xlarge", "large", "medium", "small"]:
value = image.get(key)
if value:
return value
if isinstance(image, str):
return image
return ""
def parse_size_and_pack(cleaned_name):
size_value = ""
size_unit = ""
pack_qty = ""
size_matches = list(SIZE_RE.finditer(cleaned_name))
if size_matches:
match = size_matches[-1]
size_value = normalize_number(match.group(1))
size_unit = normalize_unit(match.group(2))
pack_matches = list(PACK_RE.finditer(cleaned_name))
if pack_matches:
match = pack_matches[-1]
pack_qty = normalize_number(match.group(1))
return size_value, size_unit, pack_qty
def normalize_number(value):
decimal = to_decimal(value)
if decimal is None:
return ""
return format(decimal.normalize(), "f")
def normalize_unit(unit):
collapsed = normalize_whitespace(unit).upper()
return {
"Z": "oz",
"OZ": "oz",
"FZ": "fl_oz",
"FL OZ": "fl_oz",
"LB": "lb",
"LBS": "lb",
"ML": "ml",
"L": "l",
"QT": "qt",
"PT": "pt",
"GAL": "gal",
"GA": "gal",
}.get(collapsed, collapsed.lower())
def strip_measure_tokens(cleaned_name):
without_sizes = SIZE_RE.sub(" ", cleaned_name)
without_measures = PACK_RE.sub(" ", without_sizes)
return normalize_whitespace(without_measures)
def expand_token(token):
return ABBREVIATIONS.get(token, token)
def normalize_item_name(cleaned_name):
prefix, _brand = extract_store_brand_prefix(cleaned_name)
base = cleaned_name
if prefix:
base = normalize_whitespace(base[len(prefix):])
base = strip_measure_tokens(base)
expanded_tokens = []
for token in base.split():
expanded = expand_token(token)
if expanded in DROP_TOKENS:
continue
expanded_tokens.append(expanded)
expanded = " ".join(token for token in expanded_tokens if token)
return singularize_tokens(normalize_whitespace(expanded))
def singularize_tokens(text):
singular_map = {
"APPLES": "APPLE",
"BANANAS": "BANANA",
"BERRIES": "BERRY",
"EGGS": "EGG",
"LEMONS": "LEMON",
"LIMES": "LIME",
"MANDARINS": "MANDARIN",
"PEPPERS": "PEPPER",
"STRAWBERRIES": "STRAWBERRY",
}
tokens = [singular_map.get(token, token) for token in text.split()]
return normalize_whitespace(" ".join(tokens))
def guess_measure_type(item, size_unit, pack_qty):
unit = normalize_whitespace(item.get("lbEachCd")).upper()
picked_weight = to_decimal(item.get("totalPickedWeight"))
qty = to_decimal(item.get("shipQy"))
if unit == "LB" or (picked_weight is not None and picked_weight > 0 and unit != "EA"):
return "weight"
if size_unit in {"lb", "oz"}:
return "weight"
if size_unit in {"ml", "l", "qt", "pt", "gal", "fl_oz"}:
return "volume"
if pack_qty:
return "count"
if unit == "EA" or (qty is not None and qty > 0):
return "each"
return ""
def is_fee_item(cleaned_name):
return any(pattern.search(cleaned_name) for pattern in FEE_PATTERNS)
def derive_prices(item, measure_type, size_value="", size_unit="", pack_qty=""):
qty = to_decimal(item.get("shipQy"))
line_total = to_decimal(item.get("groceryAmount"))
picked_weight = to_decimal(item.get("totalPickedWeight"))
parsed_size = to_decimal(size_value)
parsed_pack = to_decimal(pack_qty) or Decimal("1")
price_per_each = ""
price_per_lb = ""
price_per_oz = ""
if line_total is None:
return price_per_each, price_per_lb, price_per_oz
if measure_type == "each" and qty not in (None, Decimal("0")):
price_per_each = format_decimal(line_total / qty)
if measure_type == "count" and qty not in (None, Decimal("0")):
price_per_each = format_decimal(line_total / qty)
if measure_type == "weight" and picked_weight not in (None, Decimal("0")):
per_lb = line_total / picked_weight
price_per_lb = format_decimal(per_lb)
price_per_oz = format_decimal(per_lb / Decimal("16"))
return price_per_each, price_per_lb, price_per_oz
if measure_type == "weight" and parsed_size not in (None, Decimal("0")) and qty not in (None, Decimal("0")):
total_units = qty * parsed_pack * parsed_size
if size_unit == "lb":
per_lb = line_total / total_units
price_per_lb = format_decimal(per_lb)
price_per_oz = format_decimal(per_lb / Decimal("16"))
elif size_unit == "oz":
per_oz = line_total / total_units
price_per_oz = format_decimal(per_oz)
price_per_lb = format_decimal(per_oz * Decimal("16"))
return price_per_each, price_per_lb, price_per_oz
def derive_normalized_quantity(size_value, size_unit, pack_qty, measure_type):
parsed_size = to_decimal(size_value)
parsed_pack = to_decimal(pack_qty) or Decimal("1")
if parsed_size not in (None, Decimal("0")) and size_unit:
return format_decimal(parsed_size * parsed_pack), size_unit
if parsed_pack not in (None, Decimal("0")) and measure_type == "count":
return format_decimal(parsed_pack), "count"
if measure_type == "each":
return "1", "each"
return "", ""
def derive_price_fields(price_per_each, price_per_lb, price_per_oz, line_total, qty, pack_qty):
line_total_decimal = to_decimal(line_total)
qty_decimal = to_decimal(qty)
pack_decimal = to_decimal(pack_qty)
price_per_count = ""
price_per_count_basis = ""
if line_total_decimal is not None and qty_decimal not in (None, Decimal("0")) and pack_decimal not in (
None,
Decimal("0"),
):
price_per_count = format_decimal(line_total_decimal / (qty_decimal * pack_decimal))
price_per_count_basis = "line_total_over_pack_qty"
return {
"price_per_each": price_per_each,
"price_per_each_basis": "line_total_over_qty" if price_per_each else "",
"price_per_count": price_per_count,
"price_per_count_basis": price_per_count_basis,
"price_per_lb": price_per_lb,
"price_per_lb_basis": "parsed_or_picked_weight" if price_per_lb else "",
"price_per_oz": price_per_oz,
"price_per_oz_basis": "parsed_or_picked_weight" if price_per_oz else "",
}
def normalization_identity(row):
if row.get("upc"):
return f"{row['retailer']}|upc={row['upc']}", "exact_upc"
if row.get("retailer_item_id"):
return f"{row['retailer']}|retailer_item_id={row['retailer_item_id']}", "exact_retailer_item_id"
if row.get("item_name_norm"):
return (
"|".join(
[
row["retailer"],
f"name={row['item_name_norm']}",
f"size={row.get('size_value', '')}",
f"unit={row.get('size_unit', '')}",
f"pack={row.get('pack_qty', '')}",
]
),
"exact_name_size_pack",
)
return row["normalized_row_id"], "row_identity"
def parse_item(order_id, order_date, raw_path, line_no, item):
cleaned_name = clean_item_name(item.get("itemName", ""))
size_value, size_unit, pack_qty = parse_size_and_pack(cleaned_name)
prefix, brand_guess = extract_store_brand_prefix(cleaned_name)
normalized_name = normalize_item_name(cleaned_name)
measure_type = guess_measure_type(item, size_unit, pack_qty)
price_per_each, price_per_lb, price_per_oz = derive_prices(
item,
measure_type,
size_value=size_value,
size_unit=size_unit,
pack_qty=pack_qty,
)
is_fee = is_fee_item(cleaned_name)
parse_notes = []
if prefix:
parse_notes.append(f"store_brand_prefix={prefix}")
if is_fee:
parse_notes.append("fee_item")
if size_value and not size_unit:
parse_notes.append("size_without_unit")
normalized_row_id = f"{RETAILER}:{order_id}:{line_no}"
normalized_quantity, normalized_quantity_unit = derive_normalized_quantity(
size_value,
size_unit,
pack_qty,
measure_type,
)
identity_key, normalization_basis = normalization_identity(
{
"retailer": RETAILER,
"normalized_row_id": normalized_row_id,
"upc": stringify(item.get("primUpcCd")),
"retailer_item_id": stringify(item.get("podId")),
"item_name_norm": normalized_name,
"size_value": size_value,
"size_unit": size_unit,
"pack_qty": pack_qty,
}
)
price_fields = derive_price_fields(
price_per_each,
price_per_lb,
price_per_oz,
stringify(item.get("groceryAmount")),
stringify(item.get("shipQy")),
pack_qty,
)
return {
"retailer": RETAILER,
"order_id": str(order_id),
"line_no": str(line_no),
"normalized_row_id": normalized_row_id,
"normalized_item_id": f"gnorm:{identity_key}",
"normalization_basis": normalization_basis,
"observed_item_key": normalized_row_id,
"order_date": normalize_whitespace(order_date),
"retailer_item_id": stringify(item.get("podId")),
"pod_id": stringify(item.get("podId")),
"item_name": stringify(item.get("itemName")),
"upc": stringify(item.get("primUpcCd")),
"category_id": stringify(item.get("categoryId")),
"category": stringify(item.get("categoryDesc")),
"qty": stringify(item.get("shipQy")),
"unit": stringify(item.get("lbEachCd")),
"unit_price": stringify(item.get("unitPrice")),
"line_total": stringify(item.get("groceryAmount")),
"picked_weight": stringify(item.get("totalPickedWeight")),
"mvp_savings": stringify(item.get("mvpSavings")),
"reward_savings": stringify(item.get("rewardSavings")),
"coupon_savings": stringify(item.get("couponSavings")),
"coupon_price": stringify(item.get("couponPrice")),
"matched_discount_amount": "",
"net_line_total": stringify(item.get("totalPrice")),
"image_url": extract_image_url(item),
"raw_order_path": raw_path.as_posix(),
"item_name_norm": normalized_name,
"brand_guess": brand_guess,
"variant": "",
"size_value": size_value,
"size_unit": size_unit,
"pack_qty": pack_qty,
"measure_type": measure_type,
"normalized_quantity": normalized_quantity,
"normalized_quantity_unit": normalized_quantity_unit,
"is_store_brand": "true" if bool(prefix) else "false",
"is_item": "false" if is_fee else "true",
"is_fee": "true" if is_fee else "false",
"is_discount_line": "false",
"is_coupon_line": "false",
**price_fields,
"parse_version": PARSER_VERSION,
"parse_notes": ";".join(parse_notes),
}
def stringify(value):
if value is None:
return ""
return str(value)
def iter_order_rows(raw_dir):
for path in sorted(raw_dir.glob("*.json")):
if path.name == "history.json":
continue
payload = json.loads(path.read_text(encoding="utf-8"))
order_id = payload.get("orderId", path.stem)
order_date = payload.get("orderDate", "")
for line_no, item in enumerate(payload.get("items", []), start=1):
yield parse_item(order_id, order_date, path, line_no, item)
def build_items_enriched(raw_dir):
rows = list(iter_order_rows(raw_dir))
rows.sort(key=lambda row: (row["order_date"], row["order_id"], int(row["line_no"])))
return rows
def write_csv(path, rows):
path.parent.mkdir(parents=True, exist_ok=True)
with path.open("w", newline="", encoding="utf-8") as handle:
writer = csv.DictWriter(handle, fieldnames=OUTPUT_FIELDS)
writer.writeheader()
writer.writerows(rows)
@click.command()
@click.option(
"--input-dir",
default=str(DEFAULT_INPUT_DIR),
show_default=True,
help="Directory containing Giant raw order json files.",
)
@click.option(
"--output-csv",
default=str(DEFAULT_OUTPUT_CSV),
show_default=True,
help="CSV path for enriched Giant item rows.",
)
def main(input_dir, output_csv):
click.echo("legacy entrypoint: prefer normalize_giant_web.py for data-model outputs")
raw_dir = Path(input_dir)
output_path = Path(output_csv)
if not raw_dir.exists():
raise click.ClickException(f"input dir does not exist: {raw_dir}")
rows = build_items_enriched(raw_dir)
write_csv(output_path, rows)
click.echo(f"wrote {len(rows)} rows to {output_path}")
if __name__ == "__main__":
main()

54
layer_helpers.py Normal file
View File

@@ -0,0 +1,54 @@
import csv
import hashlib
from collections import Counter
from pathlib import Path
def read_csv_rows(path):
path = Path(path)
with path.open(newline="", encoding="utf-8") as handle:
return list(csv.DictReader(handle))
def write_csv_rows(path, rows, fieldnames):
path = Path(path)
path.parent.mkdir(parents=True, exist_ok=True)
with path.open("w", newline="", encoding="utf-8") as handle:
writer = csv.DictWriter(handle, fieldnames=fieldnames)
writer.writeheader()
writer.writerows(rows)
def stable_id(prefix, raw_key):
digest = hashlib.sha1(str(raw_key).encode("utf-8")).hexdigest()[:12]
return f"{prefix}_{digest}"
def first_nonblank(rows, field):
for row in rows:
value = row.get(field, "")
if value:
return value
return ""
def representative_value(rows, field):
values = [row.get(field, "") for row in rows if row.get(field, "")]
if not values:
return ""
counts = Counter(values)
return sorted(counts.items(), key=lambda item: (-item[1], item[0]))[0][0]
def distinct_values(rows, field):
return sorted({row.get(field, "") for row in rows if row.get(field, "")})
def compact_join(values, limit=3):
unique = []
seen = set()
for value in values:
if value and value not in seen:
seen.add(value)
unique.append(value)
return " | ".join(unique[:limit])

28
normalize_costco_web.py Normal file
View File

@@ -0,0 +1,28 @@
from pathlib import Path
import click
import enrich_costco
@click.command()
@click.option(
"--input-dir",
default="data/costco-web/raw",
show_default=True,
help="Directory containing Costco raw order json files.",
)
@click.option(
"--output-csv",
default="data/costco-web/normalized_items.csv",
show_default=True,
help="CSV path for normalized Costco item rows.",
)
def main(input_dir, output_csv):
rows = enrich_costco.build_items_enriched(Path(input_dir))
enrich_costco.write_csv(Path(output_csv), rows)
click.echo(f"wrote {len(rows)} rows to {output_csv}")
if __name__ == "__main__":
main()

28
normalize_giant_web.py Normal file
View File

@@ -0,0 +1,28 @@
from pathlib import Path
import click
import enrich_giant
@click.command()
@click.option(
"--input-dir",
default="data/giant-web/raw",
show_default=True,
help="Directory containing Giant raw order json files.",
)
@click.option(
"--output-csv",
default="data/giant-web/normalized_items.csv",
show_default=True,
help="CSV path for normalized Giant item rows.",
)
def main(input_dir, output_csv):
rows = enrich_giant.build_items_enriched(Path(input_dir))
enrich_giant.write_csv(Path(output_csv), rows)
click.echo(f"wrote {len(rows)} rows to {output_csv}")
if __name__ == "__main__":
main()

346
pm/data-model.org Normal file
View File

@@ -0,0 +1,346 @@
* Grocery data model and file layout
This document defines the shared file layout and stable CSV schemas for the
grocery pipeline.
Goals:
- Ensure data gathering is separate from analysis
- Enable multiple data gathering methods
- One layer for review and analysis
** Design Rules
- Raw retailer exports remain the source of truth.
- Retailer parsing is isolated to retailer-specific files and ids.
- Cross-retailer product layers begin only after retailer-specific normalization.
- CSV schemas are stable and additive: new columns may be appended, but
existing columns should not be repurposed.
- Unknown values should be left blank rather than guessed.
*** Retailer-specific data:
- raw json payloads
- retailer order ids
- retailer line numbers
- retailer category ids and names
- retailer item names
- retailer image urls
- comparison-ready normalized quantity basis fields
*** Review/Combined data:
- catalog of reviewed products
- links from normalized retailer items to catalog
- human review state for unresolved cases
* Pipeline
Each step can be run alone if its dependents exist.
Each retail provider script must produce deterministic line-item outputs, and
normalization may assign within-retailer product identity only when the
retailer itself provides strong evidence.
Key:
- (1) input
- [1] output
** 1. Collect
Get raw receipt/visit and item data from a retailer.
Scraping is unique to a Retailer and method (e.g., Giant-Web and Giant-Scan).
Preserve complete raw data and preserve fidelity.
Avoid interpretation beyond basic data flattening.
- (1) Source access (Varies, eg header data, auth for API access)
- [1] collected visits from each retailer
- [2] collected items from each retailer
- [3] any other raw data that supports [1] and [2]; explicit source (eventual receipt scan?)
** 2. Normalize
Parse and extract structured facts from retailer-specific raw data
to create a standardized item format for that retailer.
Strictly dependent on Collect method and output.
- Extract quantity, size, pack, pricing, variant
- Add discount line items to product line items using upc/retail_item_id and concurrence
- Cleanup naming to facilitate later matching
- Assign retailer-level `normalized_item_id` only when evidence is deterministic
- Never use fuzzy or semantic matching here
- (1) collected items from each retailer
- (2) collected visits from each retailer
- [1] normalized items from each retailer
** 3. Review/Combine (Canonicalization)
Decide whether two normalized retailer items are "the same product";
match items across retailers using algo/logic and human review.
Create catalog linked to normalized retailer items.
- Review operates on distinct `normalized_item_id` values, not individual purchase rows
- Cross-retailer identity decisions happen only here
- Asking human to create a canonical/catalog item with:
- friendly/catalog_name: "bell pepper"; "milk"
- category: "produce"; "dairy"
- product_type: "pepper"; "milk"
- ? variant? "whole, "skim", "2pct"
- Then link the group of items to that catalog item.
- (1) normalized items from each retailer
- [1] review queue of items to be reviewed
- [2] catalog (lookup table) of confirmed normalized retailer items and catalog_id
- [3] purchase list of normalized items , pivot-ready
** Unresolved Issues
1. need central script to orchestrate; metadata belongs there and nowhere else
2. `LIME` and `LIME . / .` appearing in the catalog: names must come from review-approved names, not raw strings
* Directory Layout
Use one top-level data root:
#+begin_example
main.py
collect_<retailer>_<method>.py
normalize_<retailer>_<method>.py
review.py
data/
<retailer-method>/
raw/ # unmodified retailer payloads exactly as fetched
<order_id.json>
collected_items.csv # one row per retailer line item w/ retailer-native values
collected_orders.csv # one row per receipt/visit, flattened from raw order data
normalized_items.csv # parsed retailer-specific line items with normalized fields
costco-web/ # sample
raw/
orders/
history.json
<order_id>.json
collected_items.csv
collected_orders.csv
normalized_items.csv
review/
review_queue.csv # Human review queue for unresolved matching/parsing cases.
product_links.csv # Links from normalized retailer items to catalog items.
catalog.csv # Cross-retailer product catalog entities used for comparison.
purchases.csv
#+end_example
Notes:
- The current repo still uses transitional root-level scripts and output folders.
- This layout is the target structure for the refactor, not a claim that migration is already complete.
* Schemas
** `data/<retailer-method>/collected_items.csv`
One row per retailer line item.
| key | definition |
|--------------------+--------------------------------------------|
| `retailer` PK | retailer slug |
| `order_id` PK | retailer order id |
| `line_no` PK | stable line number within order export |
| `order_date` | copied from order when available |
| `retailer_item_id` | retailer-native item id when available |
| `pod_id` | retailer pod/item id |
| `item_name` | raw retailer item name |
| `upc` | retailer UPC or PLU value |
| `category_id` | retailer category id |
| `category` | retailer category description |
| `qty` | retailer quantity field |
| `unit` | retailer unit code such as `EA` or `LB` |
| `unit_price` | retailer unit price field |
| `line_total` | retailer extended price field |
| `picked_weight` | retailer picked weight field |
| `mvp_savings` | retailer savings field |
| `reward_savings` | retailer rewards savings field |
| `coupon_savings` | retailer coupon savings field |
| `coupon_price` | retailer coupon price field |
| `image_url` | raw retailer image url when present |
| `raw_order_path` | relative path to source order payload |
| `is_discount_line` | retailer adjustment or discount-line flag |
| `is_coupon_line` | coupon-like line flag when distinguishable |
** `data/<retailer-method>/collected_orders.csv`
One row per order/visit/receipt.
| key | definition |
|---------------------------+-------------------------------------------------|
| `retailer` PK | retailer slug such as `giant` |
| `order_id` PK | retailer order or visit id |
| `order_date` | order date in `YYYY-MM-DD` when available |
| `delivery_date` | fulfillment date in `YYYY-MM-DD` when available |
| `service_type` | retailer service type such as `INSTORE` |
| `order_total` | order total as provided by retailer |
| `payment_method` | retailer payment label |
| `total_item_count` | total line count or item count from retailer |
| `total_savings` | total savings as provided by retailer |
| `your_savings_total` | savings field from retailer when present |
| `coupons_discounts_total` | coupon/discount total from retailer |
| `store_name` | retailer store name |
| `store_number` | retailer store number |
| `store_address1` | street address |
| `store_city` | city |
| `store_state` | state or province |
| `store_zipcode` | postal code |
| `refund_order` | retailer refund flag |
| `ebt_order` | retailer EBT flag |
| `raw_history_path` | relative path to source history payload |
| `raw_order_path` | relative path to source order payload |
** `data/<retailer-method>/normalized_items.csv`
One row per retailer line item after deterministic parsing. Preserve raw
fields from `collected_items.csv` and add parsed fields that make later review
and grouping easier. Normalization may assign retailer-level identity when the
evidence is deterministic and retailer-scoped.
| key | definition |
|----------------------------+------------------------------------------------------------------|
| `retailer` PK | retailer slug |
| `order_id` PK | retailer order id |
| `line_no` PK | line number within order |
| `normalized_row_id` | stable row key, typically `<retailer>:<order_id>:<line_no>` |
| `normalized_item_id` | stable retailer-level item identity when deterministic grouping is supported |
| `normalization_basis` | basis used to assign `normalized_item_id` |
| `retailer_item_id` | retailer-native item id |
| `item_name` | raw retailer item name |
| `item_name_norm` | normalized retailer item name |
| `brand_guess` | parsed brand guess |
| `variant` | parsed variant text |
| `size_value` | parsed numeric size value |
| `size_unit` | parsed size unit such as `oz`, `lb`, `fl_oz` |
| `pack_qty` | parsed pack or count guess |
| `measure_type` | `each`, `weight`, `volume`, `count`, or blank |
| `normalized_quantity` | numeric comparison basis derived during normalization |
| `normalized_quantity_unit` | basis unit such as `oz`, `lb`, `count`, or blank |
| `is_item` | item flag |
| `is_store_brand` | store-brand guess |
| `is_fee` | fee or non-product flag |
| `is_discount_line` | discount or adjustment-line flag |
| `is_coupon_line` | coupon-like line flag |
| `matched_discount_amount` | matched discount value carried onto purchased row when supported |
| `net_line_total` | line total after matched discount when supported |
| `price_per_each` | derived per-each price when supported |
| `price_per_each_basis` | source basis for `price_per_each` |
| `price_per_count` | derived per-count price when supported |
| `price_per_count_basis` | source basis for `price_per_count` |
| `price_per_lb` | derived per-pound price when supported |
| `price_per_lb_basis` | source basis for `price_per_lb` |
| `price_per_oz` | derived per-ounce price when supported |
| `price_per_oz_basis` | source basis for `price_per_oz` |
| `image_url` | best available retailer image url |
| `raw_order_path` | relative path to source order payload |
| `parse_version` | parser version string for reruns |
| `parse_notes` | optional non-fatal parser notes |
Notes:
- `normalized_row_id` identifies the purchase row; `normalized_item_id` identifies a repeated retailer item when strong retailer evidence supports grouping.
- Valid `normalization_basis` values should be explicit, e.g. `exact_upc`, `exact_retailer_item_id`, `exact_name_size_pack`, or `approved_retailer_alias`.
- Do not use fuzzy or semantic matching to assign `normalized_item_id`.
- Discount/coupon rows may remain as standalone normalized rows for auditability even when their amounts are attached to a purchased row via `matched_discount_amount`.
- Cross-retailer identity is handled later in review/combine via `catalog.csv` and `product_links.csv`.
** `data/review/product_links.csv`
One row per review-approved link from a normalized retailer item to a catalog item.
Many normalized retailer items may link to the same catalog item.
| key | definition |
|-------------------------+---------------------------------------------|
| `normalized_item_id` PK | normalized retailer item id |
| `catalog_id` PK | linked catalog product id |
| `link_method` | `manual`, `exact_upc`, `exact_name_size`, etc. |
| `link_confidence` | optional confidence label |
| `review_status` | `pending`, `approved`, `rejected`, or blank |
| `reviewed_by` | reviewer id or initials |
| `reviewed_at` | review timestamp or date |
| `link_notes` | optional notes |
** `data/review/review_queue.csv`
One row per issue needing human review.
| key | definition |
|----------------------+-----------------------------------------------------|
| `review_id` PK | stable review row id |
| `queue_type` | `link_candidate`, `parse_issue`, `catalog_cleanup` |
| `retailer` | retailer slug when applicable |
| `normalized_item_id` | normalized retailer item id when review is item-level |
| `normalized_row_id` | normalized row id when review is row-specific |
| `catalog_id` | candidate canonical id |
| `reason_code` | machine-readable review reason |
| `priority` | optional priority label |
| `raw_item_names` | compact list of example raw names |
| `normalized_names` | compact list of example normalized names |
| `upc` | example UPC/PLU |
| `image_url` | example image url |
| `example_prices` | compact list of example prices |
| `seen_count` | count of related rows |
| `status` | `pending`, `approved`, `rejected`, `deferred` |
| `resolution_notes` | reviewer notes |
| `created_at` | creation timestamp or date |
| `updated_at` | last update timestamp or date |
** `data/catalog.csv`
One row per cross-retailer catalog product.
| key | definition |
|----------------------------+----------------------------------------|
| `catalog_id` PK | stable catalog product id |
| `catalog_name` | human-reviewed product name |
| `product_type` | generic product eg `apple`, `milk` |
| `category` | broad section eg `produce`, `dairy` |
| `brand` | canonical brand when applicable |
| `variant` | canonical variant |
| `size_value` | normalized size value |
| `size_unit` | normalized size unit |
| `pack_qty` | normalized pack/count |
| `measure_type` | normalized measure type |
| `normalized_quantity` | numeric comparison basis value |
| `normalized_quantity_unit` | basis unit such as `oz`, `lb`, `count` |
| `notes` | optional human notes |
| `created_at` | creation timestamp or date |
| `updated_at` | last update timestamp or date |
Notes:
- Do not auto-create new catalog rows from weak normalized names alone.
- Do not encode packaging/count into `catalog_name` unless it is essential to product identity.
- `catalog_name` should come from review-approved naming, not raw retailer strings.
** `data/purchases.csv`
One row per purchased item (i.e., `is_item`==true from normalized layer), with
catalog attributes denormalized in and discounts already applied.
| key | definition |
|----------------------------+----------------------------------------------------------------|
| `purchase_date` | date of purchase (from order) |
| `retailer` | retailer slug |
| `order_id` | retailer order id |
| `line_no` | line number within order |
| `normalized_row_id` | `<retailer>:<order_id>:<line_no>` |
| `normalized_item_id` | retailer-level normalized item identity |
| `catalog_id` | linked catalog product id |
| `catalog_name` | catalog product name for analysis |
| `catalog_product_type` | broader product family (e.g., `egg`, `milk`) |
| `catalog_category` | category such as `produce`, `dairy` |
| `catalog_brand` | canonical brand when applicable |
| `catalog_variant` | canonical variant when applicable |
| `raw_item_name` | original retailer item name |
| `normalized_item_name` | cleaned/normalized retailer item name |
| `retailer_item_id` | retailer-native item id |
| `upc` | UPC/PLU when available |
| `qty` | retailer quantity field |
| `unit` | retailer unit (e.g., `EA`, `LB`) |
| `pack_qty` | parsed pack/count |
| `size_value` | parsed size value |
| `size_unit` | parsed size unit |
| `measure_type` | `each`, `weight`, `volume`, `count` |
| `normalized_quantity` | normalized comparison quantity |
| `normalized_quantity_unit` | unit for normalized quantity |
| `unit_price` | retailer unit price |
| `line_total` | original retailer extended price (pre-discount) |
| `matched_discount_amount` | discount amount matched from discount lines |
| `net_line_total` | effective price after discount (`line_total` + discounts) |
| `store_name` | retailer store name |
| `store_city` | store city |
| `store_state` | store state |
| `price_per_each` | derived per-each price |
| `price_per_each_basis` | source basis for per-each calc |
| `price_per_count` | derived per-count price |
| `price_per_count_basis` | source basis for per-count calc |
| `price_per_lb` | derived per-pound price |
| `price_per_lb_basis` | source basis for per-pound calc |
| `price_per_oz` | derived per-ounce price |
| `price_per_oz_basis` | source basis for per-ounce calc |
| `is_fee` | true if row represents non-product fee |
| `raw_order_path` | relative path to original order payload |
Notes:
- Only rows that represent purchased items should appear here.
- `line_total` preserves retailer truth; `net_line_total` is what you actually paid.
- catalog fields are denormalized in to make pivoting trivial.
- no discount/coupon rows exist here; their effects are carried via `matched_discount_amount`.
- review/link decisions should apply at the `normalized_item_id` level, then fan out to all purchase rows sharing that id.
* /

502
pm/notes.org Normal file

File diff suppressed because one or more lines are too long

73
pm/review-workflow.org Normal file
View File

@@ -0,0 +1,73 @@
* review and item-resolution workflow
This document defines the durable review workflow for unresolved observed
products.
** persistent files
- `combined_output/purchases.csv`
Flat normalized purchase log. This is the review input because it retains:
- raw item name
- normalized item name
- observed product id
- canonical product id when resolved
- retailer/order/date/price context
- `combined_output/review_queue.csv`
Current unresolved observed products grouped for review.
- `combined_output/review_resolutions.csv`
Durable mapping decisions from observed products to canonical products.
- `combined_output/canonical_catalog.csv`
Durable canonical item catalog used by manual review and later purchase-log
rebuilds.
There is no separate alias file in v1. `review_resolutions.csv` is the mapping
layer from observed products to canonical product ids.
** workflow
1. Run `build_purchases.py`
This refreshes the purchase log and seeds/updates the canonical catalog from
current auto-linked canonical rows.
2. Run `review_products.py`
This rebuilds `review_queue.csv` from unresolved purchase rows and prompts in
the terminal for one observed product at a time.
3. Choose one of:
- link to existing canonical
- create new canonical
- exclude
- skip
4. `review_products.py` writes decisions immediately to:
- `review_resolutions.csv`
- `canonical_catalog.csv` when a new canonical item is created
5. Rerun `build_purchases.py`
This reapplies approved resolutions so the final normalized purchase log now
carries the reviewed `canonical_product_id`.
** what the human edits
The primary interface is terminal prompts in `review_products.py`.
The human provides:
- existing canonical id when linking
- canonical name/category/product type when creating a new canonical item
- optional resolution notes
The generated CSVs remain editable by hand if needed, but the intended workflow
is terminal-first.
** durability
- Resolutions are keyed by `observed_product_id`, not by one-off text
substitution.
- Canonical products are keyed by stable `canonical_product_id`.
- Future runs reuse approved mappings through `review_resolutions.csv`.
** retention of audit fields
The final `purchases.csv` retains:
- `raw_item_name`
- `normalized_item_name`
- `canonical_product_id`
This preserves the raw receipt description, the deterministic parser output, and
the human-approved canonical identity in one flat purchase log.

View File

@@ -1,107 +0,0 @@
* python setup
venv install playwright, pandas
playwright install
1. scrape - raw giant json
2. enrich -
cols:
item_name_norm
brand_guess
size_value
size_unit
pack_qty
variant
is_store_brand
is_fee
measure_type
price_per_lb
price_per_oz
price_per_each
image_url
normalize abbreviationsta
extract size like 12z, 10ct, 5lb
detect fees like bag charges
infer whether something is sold by each vs weight
carry forward image url
3. build observed-product atble from enriched items
* item:
get:
/api/v6.0/user/369513017/order/history/detail/69a2e44a16be1142e74ad3cc
headers:
request:
GET /api/v6.0/user/369513017/order/history/detail/69a2e44a16be1142e74ad3cc?isInStore=true HTTP/2
Host: giantfood.com
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:148.0) Gecko/20100101 Firefox/148.0
Accept: application/json, text/plain, */*
Accept-Language: en-US,en;q=0.9
Accept-Encoding: gzip, deflate, br, zstd
DNT: 1
Sec-GPC: 1
Connection: keep-alive
Referer: https://giantfood.com/account/history/invoice/in-store
Cookie: datadome=rDtvd3J2hO5AeghJMSFRRxGc6ifKCQYgMLcqPNr9rWiz2rdcXb032AY6GIZn8tUmYB96BKKbzh3_jSjEzYWLj8hDjl3oGYYAiu4jwdaxpf3vh2v4f7KH7kbqgsMWpkjt; cf_clearance=WEPyQokx9f0qoyS4Svsw4EkZ1TYOxjOwcUHspT3.rXw-1773348940-1.2.1.1-fPvERGxBlFUaBW83sUppbUWpwvFG7mZivag5vBvZb3kxUQv2WSVIV1tON0HV2n8bkVY0U8_BBl62a00Np.oJylYQcGME540gZlYEoL.gMs4WynLqApFe5BOXAEwOm01_6h6b62H90bl4ypRehVb_TXEi4qHaPLVSZhjZK_h.fv6RBqjgYch2j_8XnHe5HXvLziVjl1k2aJskozqy04KOyeHyc3OyIPTZd5On_KAzFIM; dvrctk=MnjKJVShVraEtbrBkkxWxLaZrXnIGNQlwB7QtZVPFeA=; __cflb=0H28vXMLFyydRmDMNgcPHijM6auXkCspCkuh58tVuJ3; __cf_bm=C6QbqiEvbbwdrYBpoJOkcWcedf60vcOfPfTPPbZzKbM-1773348202-1.0.1.1-cSHoYwi8ZjIHTdBItXQP_iXJdRJS6FYjFsGdl1eGHvS5pgfbcT4Lg19P6UStX.bZz1u0OXiS5ykdipPBtwP6OvZr68k4XSmjYpir05jNLhw; _dd_s=rum=0&expire=1773349846445; ppdtk=Uog72CR22mD85C7U4iZHlgOQeRmvHEYp0OdQc+0lEes1c5/LeqGT+ZUlXpSC6FpW; cartId=3820547
Sec-Fetch-Dest: empty
Sec-Fetch-Mode: cors
Sec-Fetch-Site: same-origin
Priority: u=0
TE: trailers
response:
HTTP/2 200
date: Thu, 12 Mar 2026 20:55:47 GMT
content-type: application/json
server: cloudflare
cf-ray: 9db5b3a5d84aff28-IAD
cf-cache-status: DYNAMIC
content-encoding: gzip
set-cookie: datadome=MXMri0hss6PlQ0_oS7gG2iMdOKnNkbDmGvOxelgN~nCcupgkJQOqjcjcgdprIaI7hSlt_w8E9Ri_RAzPFrGqtUfqAJ_szB_aNZ2FdC26qmI3870Nn4~T0vtx8Gj3dEZR; Max-Age=31536000; Domain=.giantfood.com; Path=/; Secure; SameSite=Lax
strict-transport-security: max-age=31536000; includeSubDomains
vary: Origin, Access-Control-Request-Method, Access-Control-Request-Headers, accept-encoding
accept-ch: Sec-CH-UA,Sec-CH-UA-Mobile,Sec-CH-UA-Platform,Sec-CH-UA-Arch,Sec-CH-UA-Full-Version-List,Sec-CH-UA-Model,Sec-CH-Device-Memory
x-datadome: protected
request-context: appId=cid-v1:75750625-0c81-4f08-9f5d-ce4f73198e54
X-Firefox-Spdy: h2
* history:
GET
https://giantfood.com/api/v6.0/user/369513017/order/history?filter=instore&loyaltyNumber=440155630880
headers:
request:
GET /api/v6.0/user/369513017/order/history?filter=instore&loyaltyNumber=440155630880 HTTP/2
Host: giantfood.com
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:148.0) Gecko/20100101 Firefox/148.0
Accept: application/json, text/plain, */*
Accept-Language: en-US,en;q=0.9
Accept-Encoding: gzip, deflate, br, zstd
DNT: 1
Sec-GPC: 1
Connection: keep-alive
Referer: https://giantfood.com/account/history/invoice/in-store
Cookie: datadome=OH2XjtCoI6XjE3Qsz_b0F1YULKLatAC0Ea~VMeDGBP0N9Z~CeI3RqEbvkGmNW_VCOU~vRb6p0kqibvF2tLbWnzyAGIdO7jsC41KiYbp7USpJDnefZhIg0e1ypAugvDSw; cf_clearance=WEPyQokx9f0qoyS4Svsw4EkZ1TYOxjOwcUHspT3.rXw-1773348940-1.2.1.1-fPvERGxBlFUaBW83sUppbUWpwvFG7mZivag5vBvZb3kxUQv2WSVIV1tON0HV2n8bkVY0U8_BBl62a00Np.oJylYQcGME540gZlYEoL.gMs4WynLqApFe5BOXAEwOm01_6h6b62H90bl4ypRehVb_TXEi4qHaPLVSZhjZK_h.fv6RBqjgYch2j_8XnHe5HXvLziVjl1k2aJskozqy04KOyeHyc3OyIPTZd5On_KAzFIM; dvrctk=MnjKJVShVraEtbrBkkxWxLaZrXnIGNQlwB7QtZVPFeA=; __cflb=0H28vXMLFyydRmDMNgcPHijM6auXkCspCkuh58tVuJ3; __cf_bm=C6QbqiEvbbwdrYBpoJOkcWcedf60vcOfPfTPPbZzKbM-1773348202-1.0.1.1-cSHoYwi8ZjIHTdBItXQP_iXJdRJS6FYjFsGdl1eGHvS5pgfbcT4Lg19P6UStX.bZz1u0OXiS5ykdipPBtwP6OvZr68k4XSmjYpir05jNLhw; _dd_s=rum=0&expire=1773349842848; ppdtk=Uog72CR22mD85C7U4iZHlgOQeRmvHEYp0OdQc+0lEes1c5/LeqGT+ZUlXpSC6FpW; cartId=3820547
Sec-Fetch-Dest: empty
Sec-Fetch-Mode: cors
Sec-Fetch-Site: same-origin
Priority: u=0
TE: trailers
response:
HTTP/2 200
date: Thu, 12 Mar 2026 20:55:43 GMT
content-type: application/json
server: cloudflare
cf-ray: 9db5b38f7eebff28-IAD
cf-cache-status: DYNAMIC
content-encoding: gzip
set-cookie: datadome=rDtvd3J2hO5AeghJMSFRRxGc6ifKCQYgMLcqPNr9rWiz2rdcXb032AY6GIZn8tUmYB96BKKbzh3_jSjEzYWLj8hDjl3oGYYAiu4jwdaxpf3vh2v4f7KH7kbqgsMWpkjt; Max-Age=31536000; Domain=.giantfood.com; Path=/; Secure; SameSite=Lax
strict-transport-security: max-age=31536000; includeSubDomains
vary: Origin, Access-Control-Request-Method, Access-Control-Request-Headers, accept-encoding
accept-ch: Sec-CH-UA,Sec-CH-UA-Mobile,Sec-CH-UA-Platform,Sec-CH-UA-Arch,Sec-CH-UA-Full-Version-List,Sec-CH-UA-Model,Sec-CH-Device-Memory
x-datadome: protected
request-context: appId=cid-v1:75750625-0c81-4f08-9f5d-ce4f73198e54
X-Firefox-Spdy: h2

22
pm/task-sample.org Normal file
View File

@@ -0,0 +1,22 @@
#+title: Task Log
#+updated: [2026-03-18 Wed 14:19]
Use the template below, which should be a top-level org-mode header.
* [ ] M.m.m: Task Title (estimate # commits)
replace the old observed/canonical workflow with a review-first pipeline that groups normalized rows only during review/combine and links them to catalog items
** Acceptance Criteria
1. Criterion
- expanded data
2. Criterion
- pm note: amplifying information
** evidence
- commit: abc123, bcd234
- tests:
- datetime: [2026-03-18 Wed 14:15]
** notes
- explanation of work done, decisions made, reasoning

View File

@@ -1,4 +1,6 @@
* [ ] t1.1: harden giant receipt fetch cli (2-4 commits) #+title: Scrape-Giant Task Log
* [X] t1.1: harden giant receipt fetch cli (2-4 commits)
** acceptance criteria ** acceptance criteria
- giant scraper runs from cli with prompts or env-backed defaults for `user_id` and `loyalty` - giant scraper runs from cli with prompts or env-backed defaults for `user_id` and `loyalty`
- script reuses current browser session via firefox cookies + `curl_cffi` - script reuses current browser session via firefox cookies + `curl_cffi`
@@ -12,11 +14,11 @@
- raw json archive remains source of truth - raw json archive remains source of truth
** evidence ** evidence
- commit: - commit: `d57b9cf` on branch `cx`
- tests: - tests: `./venv/bin/python -m unittest discover -s tests`; `./venv/bin/python scraper.py --help`; verified `.env` loading via `scraper.load_config()`
- date: - date: 2026-03-14
* [ ] t1.2: define grocery data model and file layout (1-2 commits) * [X] t1.2: define grocery data model and file layout (1-2 commits)
** acceptance criteria ** acceptance criteria
- decide and document the files/directories for: - decide and document the files/directories for:
- retailer raw exports - retailer raw exports
@@ -28,15 +30,15 @@
- explicitly separate retailer-specific parsing from cross-retailer canonicalization - explicitly separate retailer-specific parsing from cross-retailer canonicalization
** notes ** notes
- this is the guardrail task so we dont make giant-specific hacks the system of record - this is the guardrail task so we don't make giant-specific hacks the system of record
- keep schema minimal but extensible - keep schema minimal but extensible
** evidence ** evidence
- commit: - commit: `42dbae1` on branch `cx`
- tests: - tests: reviewed `giant_output/raw/history.json`, one sample raw order json, `giant_output/orders.csv`, `giant_output/items.csv`; documented schemas in `pm/data-model.org`
- date: - date: 2026-03-15
* [ ] t1.3: build giant parser/enricher from raw json (2-4 commits) * [X] t1.3: build giant parser/enricher from raw json (2-4 commits)
** acceptance criteria ** acceptance criteria
- parser reads giant raw order json files - parser reads giant raw order json files
- outputs `items_enriched.csv` - outputs `items_enriched.csv`
@@ -54,11 +56,11 @@
- parser should preserve ambiguity rather than hallucinating precision - parser should preserve ambiguity rather than hallucinating precision
** evidence ** evidence
- commit: - commit: `14f2cc2` on branch `cx`
- tests: - tests: `./venv/bin/python -m unittest discover -s tests`; `./venv/bin/python enrich_giant.py`; verified `giant_output/items_enriched.csv` on real raw data
- date: - date: 2026-03-16
* [ ] t1.4: generate observed-product layer from enriched items (2-3 commits) * [X] t1.4: generate observed-product layer from enriched items (2-3 commits)
** acceptance criteria ** acceptance criteria
- distinct observed products are generated from enriched giant items - distinct observed products are generated from enriched giant items
@@ -76,11 +78,11 @@
- likely key is some combo of retailer + upc + normalized name - likely key is some combo of retailer + upc + normalized name
** evidence ** evidence
- commit: - commit: `dc39214` on branch `cx`
- tests: - tests: `./venv/bin/python -m unittest discover -s tests`; `./venv/bin/python build_observed_products.py`; verified `giant_output/products_observed.csv`
- date: - date: 2026-03-16
* [ ] t1.5: build review queue for unresolved or low-confidence products (1-3 commits) * [X] t1.5: build review queue for unresolved or low-confidence products (1-3 commits)
** acceptance criteria ** acceptance criteria
- produce a review file containing observed products needing manual review - produce a review file containing observed products needing manual review
@@ -98,11 +100,11 @@
- optimize for “approve once, remember forever” - optimize for “approve once, remember forever”
** evidence ** evidence
- commit: - commit: `9b13ec3` on branch `cx`
- tests: - tests: `./venv/bin/python -m unittest discover -s tests`; `./venv/bin/python build_review_queue.py`; verified `giant_output/review_queue.csv`
- date: - date: 2026-03-16
* [ ] t1.6: create canonical product layer and observed→canonical links (2-4 commits) * [X] t1.6: create canonical product layer and observed→canonical links (2-4 commits)
** acceptance criteria ** acceptance criteria
- define and create `products_canonical.csv` - define and create `products_canonical.csv`
@@ -120,11 +122,11 @@
- do not require llm assistance for v1 - do not require llm assistance for v1
** evidence ** evidence
- commit: - commit: `347cd44` on branch `cx`
- tests: - tests: `./venv/bin/python -m unittest discover -s tests`; `./venv/bin/python build_canonical_layer.py`; verified seeded `giant_output/products_canonical.csv` and `giant_output/product_links.csv`
- date: - date: 2026-03-16
* [ ] t1.7: implement auto-link rules for easy matches (2-3 commits) * [X] t1.7: implement auto-link rules for easy matches (2-3 commits)
** acceptance criteria ** acceptance criteria
- auto-link can match observed products to canonical products using deterministic rules - auto-link can match observed products to canonical products using deterministic rules
@@ -139,53 +141,536 @@
- false positives are worse than unresolved items - false positives are worse than unresolved items
** evidence ** evidence
- commit: - commit: `385a31c` on branch `cx`
- tests: - tests: `./venv/bin/python -m unittest discover -s tests`; `./venv/bin/python build_canonical_layer.py`; verified auto-linked `giant_output/products_canonical.csv` and `giant_output/product_links.csv`
- date: - date: 2026-03-16
* [ ] t1.8: support costco raw ingest path (2-5 commits) * [X] t1.8: support costco raw ingest path (2-5 commits)
** acceptance criteria ** acceptance criteria
- add a costco-specific raw ingest/export path - add a costco-specific raw ingest/export path
- output costco line items into the same shared raw/enriched schema family - fetch costco receipt summary and receipt detail payloads from graphql endpoint
- persist raw json under `costco_output/raw/orders.csv` and `./items.csv`, same format as giant
- costco-native identifiers such as `transactionBarcode` as order id and `itemNumber` as retailer item id
- preserve discount/coupon rows rather than dropping
** notes
- focus on raw costco acquisistion and flattening
- do not force costco identifiers into `upc`
- bearer/auth values should come from local env, not source
** evidence
- commit: `da00288` on branch `cx`
- tests: `./venv/bin/python -m unittest discover -s tests`; `./venv/bin/python scrape_costco.py --help`; verified `costco_output/raw/*.json`, `costco_output/orders.csv`, and `costco_output/items.csv` from the local sample payload
- date: 2026-03-16
* [X] t1.8.1: support costco parser/enricher path (2-4 commits)
** acceptance criteria
- add a costco-specific enrich step producing `costco_output/items_enriched.csv`
- output rows into the same shared enriched schema family as Giant
- support costco-specific parsing for:
- `itemDescription01` + `itemDescription02`
- `itemNumber` as `retailer_item_id`
- discount lines / negative rows
- common size patterns such as `25#`, `48 OZ`, `2/24 OZ`, `6-PACK`
- preserve obvious unknowns as blank rather than guessed values
** notes
- this is the real schema compatibility proof, not raw ingest alone
- expect weaker identifiers than Giant
** evidence
- commit: `da00288` on branch `cx`
- tests: `./venv/bin/python -m unittest discover -s tests`; `./venv/bin/python enrich_costco.py`; verified `costco_output/items_enriched.csv`
- date: 2026-03-16
* [X] t1.8.2: validate cross-retailer observed/canonical flow (1-3 commits)
** acceptance criteria
- feed Giant and Costco enriched rows through the same observed/canonical pipeline
- confirm at least one product class can exist as: - confirm at least one product class can exist as:
- giant observed product - Giant observed product
- costco observed product - Costco observed product
- one shared canonical product - one shared canonical product
- document the exact example used for proof
** notes ** notes
- this is the proof that the architecture generalizes - keep this to one or two well-behaved product classes first
- dont chase perfection before the second retailer lands - apples, eggs, bananas, or flour are better than weird prepared foods
** evidence
- commit: `da00288` on branch `cx`
- tests: `./venv/bin/python -m unittest discover -s tests`; `./venv/bin/python validate_cross_retailer_flow.py`; proof example: Giant `FRESH BANANA` and Costco `BANANAS 3 LB / 1.36 KG` share one canonical in `combined_output/proof_examples.csv`
- date: 2026-03-16
* [X] t1.8.3: extend shared schema for retailer-native ids and adjustment lines (1-2 commits)
** acceptance criteria
- add shared fields needed for non-upc retailers, including:
- `retailer_item_id`
- `is_discount_line`
- `is_coupon_line` or equivalent if needed
- keep `upc` nullable across the pipeline
- update downstream builders/tests to accept retailers with blank `upc`
** notes
- this prevents costco from becoming a schema hack
- do this once instead of sprinkling exceptions everywhere
** evidence
- commit: `9497565` on branch `cx`
- tests: `./venv/bin/python -m unittest discover -s tests`; verified shared enriched fields in `giant_output/items_enriched.csv` and `costco_output/items_enriched.csv`
- date: 2026-03-16
* [X] t1.8.4: verify and correct costco receipt enumeration (12 commits)
** acceptance criteria
- confirm graphql summary query returns all expected receipts
- compare `inWarehouse` count vs number of `receipts` returned
- widen or parameterize date window if necessary; website shows receipts in 3-month windows
- persist request metadata (`startDate`, `endDate`, `documentType`, `documentSubType`)
- emit warning when receipt counts mismatch
** notes
- goal is to confirm we are enumerating all receipts before parsing
- do not expand schema or parser logic in this task
- keep changes limited to summary query handling and diagnostics
** evidence
- commit: `ac82fa6` on branch `cx`
- tests: `./venv/bin/python -m unittest discover -s tests`; `./venv/bin/python scrape_costco.py --help`; reviewed the sample Costco summary request in `pm/scrape-giant.org` against `costco_output/raw/summary.json` and added 3-month window chunking plus mismatch diagnostics
- date: 2026-03-16
* [X] t1.8.5: refactor costco scraper auth and UX with giant scraper
** acceptance criteria
- remove manual auth env vars
- load costco cookies from firefox session
- require only logged-in browser
- replace start/end date flags with --months-back
- maintain same raw output structure
- ensure summary_lookup keys are collision-safe by using a composite key (transactionBarcode + transactionDateTime) instead of transactionBarcode alone
** notes
- align Costco acquisition ergonomics with the Giant scraper
- keep downstream Costco parsing and shared schemas unchanged
** evidence
- commit: `c0054dc` on branch `cx`
- tests: `./venv/bin/python -m unittest discover -s tests`; `./venv/bin/python scrape_costco.py --help`; verified Costco summary/detail flattening now uses composite receipt keys in unit tests
- date: 2026-03-16
* [X] t1.8.6: add browser session helper (2-4 commits)
** acceptance criteria
- create a separate Python module/script that extracts firefox browser session data needed for giant and costco scrapers.
- support Firefox and Costco first, including:
- loading cookies via existing browser-cookie approach
- reading browser storage needed for dynamic auth headers (e.g. Costco bearer token)
- copying locked browser sqlite/db files to a temp location before reading when necessary
- expose a small interface usable by scrapers, e.g. cookie jar + storage/header values
- keep retailer-specific parsing of extracted session data outside the low-level browser access layer
- structure the helper so Chromium-family browser support can be added later without changing scraper call sites
** notes
- goal is to replace manual `.env` copying of volatile browser-derived auth data
- session bootstrap only, not full browser automation
- prefer one shared helper over retailer-specific ad hoc storage reads
- Firefox only; Chromium support later
** evidence
- commit: `7789c2e` on branch `cx`
- tests: `./venv/bin/python -m unittest discover -s tests`; `./venv/bin/python scrape_giant.py --help`; `./venv/bin/python scrape_costco.py --help`; verified Firefox storage token extraction and locked-db copy behavior in unit tests
- date: 2026-03-16
* [X] t1.8.7: simplify costco session bootstrap and remove over-abstraction (2-4 commits)
** acceptance criteria
- make `scrape_costco.py` readable end-to-end without tracing through multiple partial bootstrap layers
- keep `browser_session.py` limited to low-level browser data access only:
- firefox profile discovery
- cookie loading
- storage reads
- sqlite copy/read helpers
- remove or sharply reduce `retailer_sessions.py` so retailer-specific header extraction lives with the retailer scraper or in a very small retailer-specific helper
- make session bootstrap flow explicit and linear:
- load browser context
- extract costco auth values
- build request headers
- build requests session
- eliminate inconsistent/obsolete function signatures and dead call paths (e.g. mixed `build_session(...)` calling conventions, stale fallback branches, mismatched `build_headers(...)` args)
- add one focused bootstrap debug print showing whether cookies, authorization, client id, and client identifier were found
- preserve current working behavior where available; this is a refactor/clarification task, not a feature expansion task
** notes
- goal is to restore concern separation and debuggability
- prefer obvious retailer-specific code over “generic” helpers that guess and obscure control flow
- browser access can stay shared; retailer auth mapping should be explicit
- no new heuristics in this task
** evidence
- commit: `d7a0329` on branch `cx`
- tests: `./venv/bin/python -m unittest discover -s tests`; `./venv/bin/python scrape_costco.py --help`; verified explicit Costco session bootstrap flow in `scrape_costco.py` and low-level-only browser access in `browser_session.py`
- date: 2026-03-16
* [X] t1.9: build pivot-ready normalized purchase log and comparison metrics (2-4 commits)
** acceptance criteria
- produce a flat `purchases.csv` suitable for excel pivot tables and pivot charts
- each purchase row preserves:
- purchase date
- retailer
- order id
- raw item name
- normalized item name
- canonical item id when resolved
- quantity / unit
- line total
- store/location info where available
- derive normalized comparison fields where possible on enriched or observed product rows:
- `price_per_lb`
- `price_per_oz`
- `price_per_each`
- `price_per_count`
- preserve the source basis used to derive each metric, e.g.:
- parsed size/unit
- receipt weight
- explicit count/pack
- emit nulls when basis is unknown, conflicting, or ambiguous
- support pivot-friendly analysis of purchase frequency and item cost over time
- document at least one Giant vs Costco comparison example using the normalized metrics
** notes
- compute metrics as close to the raw observation as possible
- canonical layer can aggregate later, but should not invent missing unit economics
- unit discipline matters more than coverage
- raw item name must be retained for audit/debugging
** evidence
- commit: `be1bf63` on branch `cx`
- tests: `./venv/bin/python -m unittest discover -s tests`; `./venv/bin/python build_purchases.py`; verified `combined_output/purchases.csv` and `combined_output/comparison_examples.csv` on the current Giant + Costco dataset
- date: 2026-03-16
* [X] t1.11: define review and item-resolution workflow for unresolved products (2-3 commits)
** acceptance criteria
- define the persistent files used to resolve unknown items, including:
- review queue
- canonical item catalog
- alias / mapping layer if separate
- specify how unresolved items move from `review_queue.csv` into the final normalized purchase log
- define the manual resolution workflow, including:
- what the human edits
- what script is rerun afterward
- how resolved mappings are persisted for future runs
- ensure resolved items are positively identified into stable canonical item ids rather than one-off text substitutions
- document how raw item name, normalized item name, and canonical item id are all retained
** notes
- goal is “approve once, reuse forever”
- keep the workflow simple and auditable
- manual review is fine; the important part is making it durable and rerunnable
** evidence
- commit: `c7dad54` on branch `cx`
- tests: `./venv/bin/python -m unittest discover -s tests`; `./venv/bin/python build_purchases.py`; `./venv/bin/python review_products.py --refresh-only`; verified `combined_output/review_queue.csv`, `combined_output/review_resolutions.csv` workflow, and `combined_output/canonical_catalog.csv`
- date: 2026-03-16
* [X] t1.12: simplify review process display
Clearly show current state separate from proposed future state.
** acceptance criteria
1. Display position in review queue, e.g., (1/22)
2. Display compact header with observed_product under review, queue position, and canonical decision, e.g.: "Resolve [n] observed product group [name] and associated items to canonical_name [name]? (\n [n] matched items)"
3. color-code outputs based on info, input/prompt, warning/error
1. color action menu/requests for input differently from display text; do not color individual options separately
2. "no canonical_name suggestions found" is informational, not a warning/error.
4. update action menu `[x]exclude` to `e[x]clude`
5. on each review item, display a list of all matched items to be linked, sorted by descending date:
1. YYYY-mm-dd, price, raw item name, normalized item name, upc, retailer
2. image URL, if exists
3. Sample:
6. on each review item, suggest (but do not auto-apply) up to 3 likely existing canonicals using determinstic rules, e.g:
1. exact normalized name match
2. prefix/contains match on canonical name
3. exact UPC
7. Sample Entry:
#+begin_comment
Review 7/22: Resolve observed_product MIXED PEPPER to canonical_name [__]?
2 matched items:
[1] 2026-03-12 | 7.49 | MIXED PEPPER 6-PACK | MIXED PEPPER | [upc] | costco | [img_url]
[2] [YYYY-mm-dd] | [price] | [raw_name] | [observed_name] | [upc] | [retailer] | [img_url]
2 canonical suggestions found:
[1] BELL PEPPERS, PRODUCE
[2] PEPPER, SPICES
#+end_comment
8. When link is selected, users should be able to select the number of the item in the list, e.g.:
#+begin_comment
Select the canonical_name to associate [n] items with:
[1] GRB GRADU PCH PUF1. | gcan_01b0d623aa02
[2] BTB CHICKEN | gcan_0201f0feb749
[3] LIME | gcan_02074d9e7359
#+end_comment
9. Add confirmation to link selection with instructions, "[n] [observed_name] and future observed_name matches will be associated with [canonical_name], is this ok?
actions: [Y]es [n]o [b]ack [s]kip [q]uit
- reinforce project terminology such as raw_name, observed_name, canonical_name
** evidence
- commit: `7b8141c`, `d39497c`
- tests: `./venv/bin/python -m unittest discover -s tests`; `./venv/bin/python -m unittest tests.test_review_workflow tests.test_purchases`; `./venv/bin/python review_products.py --help`; verified compact review header, numbered matched-item display, informational no-suggestion state, numbered canonical selection, and confirmation flow
- date: 2026-03-17
** notes
- The key improvement was shifting the prompt from system metadata to reviewer intent: one observed_product, its matched retailer rows, and one canonical_name decision.
- Numbered canonical selection plus confirmation worked better than free-text id entry and should reduce accidental links.
- Deterministic suggestions remain intentionally conservative; they speed up common cases, but unresolved items still depend on human review by design.
* [X] t1.13.1 pipeline accountability and stage visibility (1-2 commits)
add simple accounting so we can see what survives or drops at each pipeline stage
** AC
1. emit counts for raw, enriched, combined/observed, review-queued, canonical-linked, and final purchase-log rows
2. report unresolved and dropped item counts explicitly
3. make it easy to verify that missing items were intentionally left in review rather than silently lost
- pm note: simple text/json/csv summary is sufficient; trust and visibility matter more than presentation
** evidence
- commit: `967e19e`
- tests: `./venv/bin/python -m unittest discover -s tests`; `./venv/bin/python report_pipeline_status.py --help`; `./venv/bin/python report_pipeline_status.py`; verified `combined_output/pipeline_status.csv` and `combined_output/pipeline_status.json`
- date: 2026-03-17
** notes
- Added a single explicit status script instead of threading counters through every pipeline step; this keeps the pipeline simple while still making row survival visible.
- The most useful check here is `unresolved_not_in_review_rows`; when it is non-zero, we know we have a real accounting bug rather than normal unresolved work.
* [X] t1.13.2 costco discount matching and net pricing in enrich_costco (2-3 commits)
refactor costco enrichment so discount lines are matched to purchased items and net pricing is preserved
** AC
1. detect costco discount/coupon rows like `/<retailer_item_id>` and match them to purchased items within the same order
2. preserve raw discount rows for auditability while also carrying matched discount values onto the purchased item row
3. add explicit fields for discount-adjusted pricing, e.g. `matched_discount_amount` and `net_line_total` (or equivalent)
4. preserve original raw receipt amounts (`line_total`) without overwriting them
- pm note: keep this retailer-specific and explicit; do not introduce generic discount heuristics
** evidence
- commit: `56a03bc`
- tests: `./venv/bin/python -m unittest discover -s tests`; `./venv/bin/python enrich_costco.py`; verified matched Costco discount rows now populate `matched_discount_amount` and `net_line_total` while preserving raw `line_total`
- date: 2026-03-17
** notes
- Kept this retailer-specific and literal: only discount rows with `/<retailer_item_id>` are matched, and only within the same order.
- Raw discount rows are still preserved for auditability; the purchased row now carries the matched adjustment separately rather than overwriting the original amount.
* [X] t1.13.3 canonical cleanup and review-first product identity (3-4 commits)
refactor canonical generation so product identity is cleaner, duplicate canonicals are reduced, and unresolved items stay in review instead of spawning junk canonicals
** AC
1. stop auto-creating new canonical products from weak normalized names alone; unresolved items remain in `review_queue.csv`
2. canonical names are based on stable product identity rather than noisy observed titles
3. packaging/count/size tokens are removed from canonical names when they belong in structured fields (`pack_qty`, `size_value`, `size_unit`)
4. consolidate obvious duplicate canonicals (e.g. egg/lime cases) and ensure final outputs retain raw item name, normalized item name, and canonical item id
- pm note: prefer conservative canonical creation and a better manual review loop over aggressive auto-unification
** evidence
- commit: `08e2a86`
- tests: `./venv/bin/python -m unittest discover -s tests`; `./venv/bin/python build_purchases.py`; `./venv/bin/python review_products.py --refresh-only`; verified weaker exact-name cases now remain unresolved in `combined_output/review_queue.csv` and canonical names are cleaned before auto-catalog creation
- date: 2026-03-17
** notes
- Removed weak exact-name auto-canonical creation so ambiguous products stay in review instead of generating junk canonicals.
- Canonical display names are now cleaned of obvious punctuation and packaging noise, but I kept the cleanup conservative rather than adding a broad fuzzy merge layer.
* [X] t1.14: refactor retailer collection into the new data model (2-4 commits)
move Giant and Costco collection into the new collect structure and make both retailers emit the same collected schemas
** Acceptance Criteria
1. create retailer-specific collect scripts in the target naming pattern, e.g.:
- collect_giant_web.py
- collect_costco_web.py
2. collected outputs conform to pm/data-model.org:
- data/<retailer-method>/raw/...
- data/<retailer-method>/collected_orders.csv
- data/<retailer-method>/collected_items.csv
3. current Giant and Costco raw acquisition behavior is preserved during the move
4. collected schemas preserve retailer truth and provenance:
- no interpretation beyond basic flattening
- raw_order_path/raw_history_path remain usable
- unknown values remain blank rather than guessed
5. old paths should be removed or deprecated
6. collect_* scripts do not depend on any normalize/review files or scripts
- pm note: this is a path/schema refactor, not a parsing rewrite
** evidence
- commit: `48c6eaf`
- tests: `./venv/bin/python -m unittest tests.test_scraper tests.test_costco_pipeline tests.test_browser_session`; `./venv/bin/python collect_giant_web.py --help`; `./venv/bin/python collect_costco_web.py --help`; `./venv/bin/python scrape_giant.py --help`; `./venv/bin/python scrape_costco.py --help`
- datetime: 2026-03-18
** notes
- Kept this as a path/schema move, not a parsing rewrite: the existing Giant and Costco collection behavior remains in place behind new `collect_*` entry points.
- Added lightweight deprecation nudges on the legacy `scrape_*` commands rather than removing them immediately, so the move is inspectable and low-risk.
- The main schema fix was on Giant collection, which was missing retailer/provenance/audit fields that Costco collection already carried.
* [X] t1.14.1: refactor retailer normalization into the new normalized_items schema (3-5 commits)
make Giant and Costco emit the shared normalized line-item schema without introducing cross-retailer identity logic
** Acceptance Criteria
1. create retailer-specific normalize scripts in the target naming pattern, e.g.:
- normalize_giant_web.py
- normalize_costco_web.py
2. normalized outputs conform to pm/data-model.org:
- data/<retailer-method>/normalized_items.csv
- one row per collected line item
- normalized_row_id is stable and present
- normalized_item_id is stable, present, and represents retailer-level identity reused across repeated purchase rows when deterministic retailer evidence is sufficient
- normalized_quantity and normalized_quantity_unit
- repeated rows for the same retailer product resolve to the same normalized_item_id only when supported by deterministic retailer evidence, e.g. exact upc, exact retailer_item_id, exact cleaned name + same size/pack
- normalization_basis is explicit
3. Giant normalization preserves current useful parsing:
- normalized item name
- size/unit/pack parsing
- fee/store-brand flags
- derived price fields
4. Costco normalization preserves current useful parsing:
- normalized item name
- size/unit/pack parsing
- explicit discount matching using retailer-specific logic
- matched_discount_amount and net_line_total
5. both normalizers preserve raw retailer truth:
- line_total is never overwritten
- unknown values remain blank rather than guessed
6. no cross-retailer identity assignment occurs in normalization
7. normalize never uses fuzzy or semantic matching to assign normalized_item_id
- pm note: prefer explicit retailer-specific code paths over generic normalization helpers unless the duplication is truly mechanical
- pm note: normalization may resolve retailer-level identity, but not catalog identity
- pm note: normalized_item_id is the only retailer-level grouping identity; do not introduce observed_products or a second grouping artifact
** evidence
- commit: `9064de5`
- tests: `./venv/bin/python -m unittest discover -s tests`; `./venv/bin/python -m unittest tests.test_enrich_giant tests.test_costco_pipeline tests.test_purchases`; `./venv/bin/python normalize_giant_web.py --help`; `./venv/bin/python normalize_costco_web.py --help`; `./venv/bin/python enrich_giant.py --help`; `./venv/bin/python enrich_costco.py --help`
- datetime: 2026-03-18
** notes
- Kept the existing Giant and Costco parsing logic intact and added the new normalized schema fields in place, rather than rewriting the enrichers from scratch.
- `normalized_item_id` is always present, but it only collapses repeated rows when the evidence is strong; otherwise it falls back to row-level identity via `normalized_row_id`.
- Added `normalize_*` entry points for the new data-model layout while leaving the legacy `enrich_*` commands available during the transition.
* [ ] t1.14.2: finalize filesystem and schema alignment for the refactor (2-4 commits)
bring on-disk outputs fully into the target `data/` structure without changing retailer behavior
** Acceptance Criteria
1. retailer data directories conform to pm/data-model.org:
- `data/giant-web/raw/...`
- `data/giant-web/collected_orders.csv`
- `data/giant-web/collected_items.csv`
- `data/giant-web/normalized_items.csv`
- `data/costco-web/raw/...`
- `data/costco-web/collected_orders.csv`
- `data/costco-web/collected_items.csv`
- `data/costco-web/normalized_items.csv`
2. review/combine outputs are moved or rewritten into the target review paths:
- `data/review/review_queue.csv`
- `data/review/product_links.csv`
- `data/review/review_resolutions.csv`
- `data/review/purchases.csv`
- `data/review/pipeline_status.csv`
- `data/review/pipeline_status.json`
3. old transitional output paths are either:
- removed from active script defaults, or
- left as explicit compatibility shims with clear deprecation notes
4. no recollection is required if existing raw files and collected csvs can be moved/copied losslessly into the new structure
5. no schema information is lost during the move:
- raw paths still resolve
- collected/normalized csvs still open with the expected headers
6. README and task/docs reflect the final active paths
- pm note: prefer moving/adapting existing files over recollecting from retailers unless a real data loss or schema mismatch forces recollection
- pm note: this is a structure-alignment task, not a retailer parsing task
** evidence ** evidence
- commit: - commit:
- tests: - tests:
- date: - datetime:
* [ ] t1.9: compute normalized comparison metrics (2-3 commits)
** acceptance criteria
- derive normalized comparison fields where possible:
- price per lb
- price per oz
- price per each
- price per count
- metrics are attached at canonical or linked-observed level as appropriate
- emit obvious nulls when basis is unknown rather than inventing values
** notes ** notes
- this is where “gala apples 5 lb bag vs other gala apples” becomes possible
- units discipline matters a lot here * [ ] t1.14.3: retailer-specific Costco normalization cleanup (2-4 commits)
tighten Costco-specific normalization so normalized item names are cleaner and deterministic retailer grouping is less noisy
** Acceptance Criteria
1. improve Costco item-name cleanup for obvious non-identity noise, such as:
- trailing slash fragments
- code tokens and receipt-format artifacts
- duplicated measurement fragments already captured in structured fields
2. preserve deterministic normalization rules only:
- exact retailer_item_id
- exact cleaned name + same size/pack when needed
- approved retailer alias
- no fuzzy or semantic matching
3. normalized Costco names improve on known bad examples, e.g.:
- `MANDARIN /` -> cleaner normalized item name
- `LIFE 6'TABLE ... /` -> cleaner normalized item name
4. cleanup does not overwrite retailer truth:
- raw `item_name` is unchanged
- parsed `size_value`, `size_unit`, `pack_qty`, and pricing fields remain intact
5. discount-row behavior remains correct:
- matched discount rows still populate `matched_discount_amount`
- `net_line_total` remains correct
- discount rows remain auditable
6. add regression tests for the cleaned Costco examples and any new parsing rules
- pm note: keep this explicitly Costco-specific; do not introduce a generic cleanup framework
- pm note: prefer a short allowlist/blocklist of known receipt artifacts over broad heuristics
** evidence ** evidence
- commit: - commit:
- tests: - tests:
- date: - datetime:
* [ ] t1.10: add optional llm-assisted suggestion workflow for unresolved products (2-4 commits) ** notes
* [ ] t1.15: refactor review/combine pipeline around normalized_item_id and catalog links (4-8 commits)
replace the old observed/canonical workflow with a review-first pipeline that uses normalized_item_id as the retailer-level review unit and links it to catalog items
** Acceptance Criteria
1. refactor review outputs to conform to pm/data-model.org:
- data/review/review_queue.csv
- data/review/product_links.csv
- data/catalog.csv
- data/purchases.csv
2. review logic uses normalized_item_id as the upstream retailer-level review identity:
- no dependency on observed_product_id
- no dependency on products_observed.csv
- one review/link decision applies to all purchase rows sharing the same normalized_item_id
3. product_links.csv stores review-approved links from normalized_item_id to catalog_id
- one row per approved retailer-level identity to catalog mapping
4. catalog.csv entries are review-first and conservative:
- no auto-creation from weak normalized names alone
- names come from reviewed catalog naming, not raw retailer strings
- packaging/count is not embedded in catalog_name unless essential to identity
- catalog_name/product_type/category/brand/variant may be blank until reviewed; blank is preferred to guessed
5. purchases.csv remains pivot-ready and retains:
- raw item name
- normalized item name
- normalized_row_id (not for review)
- normalized_item_id
- catalog_id
- catalog fields
- raw line_total
- matched_discount_amount and net_line_total when present
- derived price fields and their bases
6. terminal review flow remains simple and usable:
- reviewer sees one grouped retailer item identity (normalized_item_id) with count and list of matches, not one prompt per purchase row; use existing pattern as a template
- link to existing catalog item
- create new catalog item
- exclude
- skip
7. pipeline accounting remains valid after the refactor:
- unresolved items are visible
- missing items are not silently dropped
8. pm note: prefer a better manual review loop over aggressive automatic grouping. initial manual data entry is expected, and should resolve over time
9. pm note: keep review/combine auditable; each catalog link should be explainable from normalized rows and review state
** evidence
- commit:
- tests:
- datetime:
** notes
* [ ] 1t.10: add optional llm-assisted suggestion workflow for unresolved normalized retailer items (2-4 commits)
** acceptance criteria ** acceptance criteria
- llm suggestions are generated only for unresolved observed products - llm suggestions are generated only for unresolved normalized retailer items
- llm outputs are stored as suggestions, not auto-applied truth - llm outputs are stored as suggestions, not auto-applied truth
- reviewer can approve/edit/reject suggestions - reviewer can approve/edit/reject suggestions
- approved decisions are persisted into canonical/link files - approved decisions are persisted into canonical/link files

119
report_pipeline_status.py Normal file
View File

@@ -0,0 +1,119 @@
import json
from pathlib import Path
import click
import build_observed_products
import build_purchases
import review_products
from layer_helpers import read_csv_rows, write_csv_rows
SUMMARY_FIELDS = ["stage", "count"]
def read_rows_if_exists(path):
path = Path(path)
if not path.exists():
return []
return read_csv_rows(path)
def build_status_summary(
giant_orders,
giant_items,
giant_enriched,
costco_orders,
costco_items,
costco_enriched,
purchases,
resolutions,
):
enriched_rows = giant_enriched + costco_enriched
observed_rows = build_observed_products.build_observed_products(enriched_rows)
queue_rows = review_products.build_review_queue(purchases, resolutions)
unresolved_purchase_rows = [
row
for row in purchases
if row.get("observed_product_id")
and not row.get("canonical_product_id")
and row.get("is_fee") != "true"
and row.get("is_discount_line") != "true"
and row.get("is_coupon_line") != "true"
]
excluded_rows = [
row
for row in purchases
if row.get("resolution_action") == "exclude"
]
linked_purchase_rows = [row for row in purchases if row.get("canonical_product_id")]
summary = [
{"stage": "raw_orders", "count": len(giant_orders) + len(costco_orders)},
{"stage": "raw_items", "count": len(giant_items) + len(costco_items)},
{"stage": "enriched_items", "count": len(enriched_rows)},
{"stage": "observed_products", "count": len(observed_rows)},
{"stage": "review_queue_observed_products", "count": len(queue_rows)},
{"stage": "canonical_linked_purchase_rows", "count": len(linked_purchase_rows)},
{"stage": "final_purchase_rows", "count": len(purchases)},
{"stage": "unresolved_purchase_rows", "count": len(unresolved_purchase_rows)},
{"stage": "excluded_purchase_rows", "count": len(excluded_rows)},
{
"stage": "unresolved_not_in_review_rows",
"count": len(
[
row
for row in unresolved_purchase_rows
if row.get("observed_product_id")
not in {queue_row["observed_product_id"] for queue_row in queue_rows}
]
),
},
]
return summary
@click.command()
@click.option("--giant-orders-csv", default="giant_output/orders.csv", show_default=True)
@click.option("--giant-items-csv", default="giant_output/items.csv", show_default=True)
@click.option("--giant-enriched-csv", default="giant_output/items_enriched.csv", show_default=True)
@click.option("--costco-orders-csv", default="costco_output/orders.csv", show_default=True)
@click.option("--costco-items-csv", default="costco_output/items.csv", show_default=True)
@click.option("--costco-enriched-csv", default="costco_output/items_enriched.csv", show_default=True)
@click.option("--purchases-csv", default="combined_output/purchases.csv", show_default=True)
@click.option("--resolutions-csv", default="combined_output/review_resolutions.csv", show_default=True)
@click.option("--summary-csv", default="combined_output/pipeline_status.csv", show_default=True)
@click.option("--summary-json", default="combined_output/pipeline_status.json", show_default=True)
def main(
giant_orders_csv,
giant_items_csv,
giant_enriched_csv,
costco_orders_csv,
costco_items_csv,
costco_enriched_csv,
purchases_csv,
resolutions_csv,
summary_csv,
summary_json,
):
summary_rows = build_status_summary(
read_rows_if_exists(giant_orders_csv),
read_rows_if_exists(giant_items_csv),
read_rows_if_exists(giant_enriched_csv),
read_rows_if_exists(costco_orders_csv),
read_rows_if_exists(costco_items_csv),
read_rows_if_exists(costco_enriched_csv),
read_rows_if_exists(purchases_csv),
read_rows_if_exists(resolutions_csv),
)
write_csv_rows(summary_csv, summary_rows, SUMMARY_FIELDS)
summary_json_path = Path(summary_json)
summary_json_path.parent.mkdir(parents=True, exist_ok=True)
summary_json_path.write_text(json.dumps(summary_rows, indent=2), encoding="utf-8")
for row in summary_rows:
click.echo(f"{row['stage']}: {row['count']}")
if __name__ == "__main__":
main()

Binary file not shown.

426
review_products.py Normal file
View File

@@ -0,0 +1,426 @@
from collections import defaultdict
from datetime import date
import click
import build_purchases
from layer_helpers import compact_join, stable_id, write_csv_rows
QUEUE_FIELDS = [
"review_id",
"retailer",
"observed_product_id",
"canonical_product_id",
"reason_code",
"priority",
"raw_item_names",
"normalized_names",
"upc_values",
"example_prices",
"seen_count",
"status",
"resolution_action",
"resolution_notes",
"created_at",
"updated_at",
]
def build_review_queue(purchase_rows, resolution_rows):
by_observed = defaultdict(list)
resolution_lookup = build_purchases.load_resolution_lookup(resolution_rows)
for row in purchase_rows:
observed_product_id = row.get("observed_product_id", "")
if not observed_product_id:
continue
by_observed[observed_product_id].append(row)
today_text = str(date.today())
queue_rows = []
for observed_product_id, rows in sorted(by_observed.items()):
current_resolution = resolution_lookup.get(observed_product_id, {})
if current_resolution.get("status") == "approved":
continue
unresolved_rows = [row for row in rows if not row.get("canonical_product_id")]
if not unresolved_rows:
continue
retailers = sorted({row["retailer"] for row in rows})
review_id = stable_id("rvw", observed_product_id)
queue_rows.append(
{
"review_id": review_id,
"retailer": " | ".join(retailers),
"observed_product_id": observed_product_id,
"canonical_product_id": current_resolution.get("canonical_product_id", ""),
"reason_code": "missing_canonical_link",
"priority": "high",
"raw_item_names": compact_join(
sorted({row["raw_item_name"] for row in rows if row["raw_item_name"]}),
limit=8,
),
"normalized_names": compact_join(
sorted(
{
row["normalized_item_name"]
for row in rows
if row["normalized_item_name"]
}
),
limit=8,
),
"upc_values": compact_join(
sorted({row["upc"] for row in rows if row["upc"]}),
limit=8,
),
"example_prices": compact_join(
sorted({row["line_total"] for row in rows if row["line_total"]}),
limit=8,
),
"seen_count": str(len(rows)),
"status": current_resolution.get("status", "pending"),
"resolution_action": current_resolution.get("resolution_action", ""),
"resolution_notes": current_resolution.get("resolution_notes", ""),
"created_at": current_resolution.get("reviewed_at", today_text),
"updated_at": today_text,
}
)
return queue_rows
def save_resolution_rows(path, rows):
write_csv_rows(path, rows, build_purchases.RESOLUTION_FIELDS)
def save_catalog_rows(path, rows):
write_csv_rows(path, rows, build_purchases.CATALOG_FIELDS)
INFO_COLOR = "cyan"
PROMPT_COLOR = "bright_yellow"
WARNING_COLOR = "magenta"
def sort_related_items(rows):
return sorted(
rows,
key=lambda row: (
row.get("purchase_date", ""),
row.get("order_id", ""),
int(row.get("line_no", "0") or "0"),
),
reverse=True,
)
def build_canonical_suggestions(related_rows, catalog_rows, limit=3):
normalized_names = {
row.get("normalized_item_name", "").strip().upper()
for row in related_rows
if row.get("normalized_item_name", "").strip()
}
upcs = {
row.get("upc", "").strip()
for row in related_rows
if row.get("upc", "").strip()
}
suggestions = []
seen_ids = set()
def add_matches(rows, reason):
for row in rows:
canonical_product_id = row.get("canonical_product_id", "")
if not canonical_product_id or canonical_product_id in seen_ids:
continue
seen_ids.add(canonical_product_id)
suggestions.append(
{
"canonical_product_id": canonical_product_id,
"canonical_name": row.get("canonical_name", ""),
"reason": reason,
}
)
if len(suggestions) >= limit:
return True
return False
exact_upc_rows = [
row
for row in catalog_rows
if row.get("upc", "").strip() and row.get("upc", "").strip() in upcs
]
if add_matches(exact_upc_rows, "exact upc"):
return suggestions
exact_name_rows = [
row
for row in catalog_rows
if row.get("canonical_name", "").strip().upper() in normalized_names
]
if add_matches(exact_name_rows, "exact normalized name"):
return suggestions
contains_rows = []
for row in catalog_rows:
canonical_name = row.get("canonical_name", "").strip().upper()
if not canonical_name:
continue
for normalized_name in normalized_names:
if normalized_name in canonical_name or canonical_name in normalized_name:
contains_rows.append(row)
break
add_matches(contains_rows, "canonical name contains match")
return suggestions
def build_display_lines(queue_row, related_rows):
lines = []
for index, row in enumerate(sort_related_items(related_rows), start=1):
lines.append(
" [{index}] {purchase_date} | {line_total} | {raw_item_name} | {normalized_item_name} | "
"{upc} | {retailer}".format(
index=index,
purchase_date=row.get("purchase_date", ""),
line_total=row.get("line_total", ""),
raw_item_name=row.get("raw_item_name", ""),
normalized_item_name=row.get("normalized_item_name", ""),
upc=row.get("upc", ""),
retailer=row.get("retailer", ""),
)
)
if row.get("image_url"):
lines.append(f" {row['image_url']}")
if not lines:
lines.append(" [1] no matched item rows found")
return lines
def observed_name(queue_row, related_rows):
if queue_row.get("normalized_names"):
return queue_row["normalized_names"].split(" | ")[0]
for row in related_rows:
if row.get("normalized_item_name"):
return row["normalized_item_name"]
return queue_row.get("observed_product_id", "")
def choose_existing_canonical(display_rows, observed_label, matched_count):
click.secho(
f"Select the canonical_name to associate {matched_count} items with:",
fg=INFO_COLOR,
)
for index, row in enumerate(display_rows, start=1):
click.echo(f" [{index}] {row['canonical_name']} | {row['canonical_product_id']}")
choice = click.prompt(
click.style("selection", fg=PROMPT_COLOR),
type=click.IntRange(1, len(display_rows)),
)
chosen_row = display_rows[choice - 1]
click.echo(
f'{matched_count} "{observed_label}" items and future matches will be associated '
f'with "{chosen_row["canonical_name"]}".'
)
click.secho(
"actions: [y]es [n]o [b]ack [s]kip [q]uit",
fg=PROMPT_COLOR,
)
confirm = click.prompt(
click.style("confirm", fg=PROMPT_COLOR),
type=click.Choice(["y", "n", "b", "s", "q"]),
)
if confirm == "y":
return chosen_row["canonical_product_id"], ""
if confirm == "s":
return "", "skip"
if confirm == "q":
return "", "quit"
return "", "back"
def prompt_resolution(queue_row, related_rows, catalog_rows, queue_index, queue_total):
suggestions = build_canonical_suggestions(related_rows, catalog_rows)
observed_label = observed_name(queue_row, related_rows)
matched_count = len(related_rows)
click.echo("")
click.secho(
f"Review {queue_index}/{queue_total}: Resolve observed_product {observed_label} "
"to canonical_name [__]?",
fg=INFO_COLOR,
)
click.echo(f"{matched_count} matched items:")
for line in build_display_lines(queue_row, related_rows):
click.echo(line)
if suggestions:
click.echo(f"{len(suggestions)} canonical suggestions found:")
for index, suggestion in enumerate(suggestions, start=1):
click.echo(f" [{index}] {suggestion['canonical_name']}")
else:
click.echo("no canonical_name suggestions found")
click.secho(
"[l]ink existing [n]ew canonical e[x]clude [s]kip [q]uit:",
fg=PROMPT_COLOR,
)
action = click.prompt(
"",
type=click.Choice(["l", "n", "x", "s", "q"]),
prompt_suffix=" ",
)
if action == "q":
return None, None
if action == "s":
return {
"observed_product_id": queue_row["observed_product_id"],
"canonical_product_id": "",
"resolution_action": "skip",
"status": "pending",
"resolution_notes": queue_row.get("resolution_notes", ""),
"reviewed_at": str(date.today()),
}, None
if action == "x":
notes = click.prompt(
click.style("exclude notes", fg=PROMPT_COLOR),
default="",
show_default=False,
)
return {
"observed_product_id": queue_row["observed_product_id"],
"canonical_product_id": "",
"resolution_action": "exclude",
"status": "approved",
"resolution_notes": notes,
"reviewed_at": str(date.today()),
}, None
if action == "l":
display_rows = suggestions or [
{
"canonical_product_id": row["canonical_product_id"],
"canonical_name": row["canonical_name"],
"reason": "catalog sample",
}
for row in catalog_rows[:10]
]
while True:
canonical_product_id, outcome = choose_existing_canonical(
display_rows,
observed_label,
matched_count,
)
if outcome == "skip":
return {
"observed_product_id": queue_row["observed_product_id"],
"canonical_product_id": "",
"resolution_action": "skip",
"status": "pending",
"resolution_notes": queue_row.get("resolution_notes", ""),
"reviewed_at": str(date.today()),
}, None
if outcome == "quit":
return None, None
if outcome == "back":
continue
break
notes = click.prompt(click.style("link notes", fg=PROMPT_COLOR), default="", show_default=False)
return {
"observed_product_id": queue_row["observed_product_id"],
"canonical_product_id": canonical_product_id,
"resolution_action": "link",
"status": "approved",
"resolution_notes": notes,
"reviewed_at": str(date.today()),
}, None
canonical_name = click.prompt(click.style("canonical name", fg=PROMPT_COLOR), type=str)
category = click.prompt(
click.style("category", fg=PROMPT_COLOR),
default="",
show_default=False,
)
product_type = click.prompt(
click.style("product type", fg=PROMPT_COLOR),
default="",
show_default=False,
)
notes = click.prompt(
click.style("notes", fg=PROMPT_COLOR),
default="",
show_default=False,
)
canonical_product_id = stable_id("gcan", f"manual|{canonical_name}|{category}|{product_type}")
canonical_row = {
"canonical_product_id": canonical_product_id,
"canonical_name": canonical_name,
"category": category,
"product_type": product_type,
"brand": "",
"variant": "",
"size_value": "",
"size_unit": "",
"pack_qty": "",
"measure_type": "",
"notes": notes,
"created_at": str(date.today()),
"updated_at": str(date.today()),
}
resolution_row = {
"observed_product_id": queue_row["observed_product_id"],
"canonical_product_id": canonical_product_id,
"resolution_action": "create",
"status": "approved",
"resolution_notes": notes,
"reviewed_at": str(date.today()),
}
return resolution_row, canonical_row
@click.command()
@click.option("--purchases-csv", default="combined_output/purchases.csv", show_default=True)
@click.option("--queue-csv", default="combined_output/review_queue.csv", show_default=True)
@click.option("--resolutions-csv", default="combined_output/review_resolutions.csv", show_default=True)
@click.option("--catalog-csv", default="combined_output/canonical_catalog.csv", show_default=True)
@click.option("--limit", default=0, show_default=True, type=int)
@click.option("--refresh-only", is_flag=True, help="Only rebuild review_queue.csv without prompting.")
def main(purchases_csv, queue_csv, resolutions_csv, catalog_csv, limit, refresh_only):
purchase_rows = build_purchases.read_optional_csv_rows(purchases_csv)
resolution_rows = build_purchases.read_optional_csv_rows(resolutions_csv)
catalog_rows = build_purchases.read_optional_csv_rows(catalog_csv)
queue_rows = build_review_queue(purchase_rows, resolution_rows)
write_csv_rows(queue_csv, queue_rows, QUEUE_FIELDS)
click.echo(f"wrote {len(queue_rows)} rows to {queue_csv}")
if refresh_only:
return
resolution_lookup = build_purchases.load_resolution_lookup(resolution_rows)
catalog_by_id = {row["canonical_product_id"]: row for row in catalog_rows if row.get("canonical_product_id")}
rows_by_observed = defaultdict(list)
for row in purchase_rows:
observed_product_id = row.get("observed_product_id", "")
if observed_product_id:
rows_by_observed[observed_product_id].append(row)
reviewed = 0
for index, queue_row in enumerate(queue_rows, start=1):
if limit and reviewed >= limit:
break
related_rows = rows_by_observed.get(queue_row["observed_product_id"], [])
result = prompt_resolution(queue_row, related_rows, catalog_rows, index, len(queue_rows))
if result == (None, None):
break
resolution_row, canonical_row = result
resolution_lookup[resolution_row["observed_product_id"]] = resolution_row
if canonical_row and canonical_row["canonical_product_id"] not in catalog_by_id:
catalog_by_id[canonical_row["canonical_product_id"]] = canonical_row
catalog_rows.append(canonical_row)
reviewed += 1
save_resolution_rows(resolutions_csv, sorted(resolution_lookup.values(), key=lambda row: row["observed_product_id"]))
save_catalog_rows(catalog_csv, sorted(catalog_by_id.values(), key=lambda row: row["canonical_product_id"]))
click.echo(
f"saved {len(resolution_lookup)} resolution rows to {resolutions_csv} "
f"and {len(catalog_by_id)} catalog rows to {catalog_csv}"
)
if __name__ == "__main__":
main()

View File

@@ -1,254 +0,0 @@
import json
import time
from pathlib import Path
import browser_cookie3
import click
import pandas as pd
from curl_cffi import requests
from dotenv import load_dotenv
import os
BASE = "https://giantfood.com"
ACCOUNT_PAGE = f"{BASE}/account/history/invoice/in-store"
def load_config():
load_dotenv()
return {
"user_id": os.getenv("GIANT_USER_ID", "").strip(),
"loyalty": os.getenv("GIANT_LOYALTY_NUMBER", "").strip(),
}
def build_session():
s = requests.Session()
s.cookies.update(browser_cookie3.firefox(domain_name="giantfood.com"))
s.headers.update({
"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:148.0) Gecko/20100101 Firefox/148.0",
"accept": "application/json, text/plain, */*",
"accept-language": "en-US,en;q=0.9",
"referer": ACCOUNT_PAGE,
})
return s
def safe_get(session, url, **kwargs):
last_response = None
for attempt in range(3):
try:
r = session.get(
url,
impersonate="firefox",
timeout=30,
**kwargs,
)
last_response = r
if r.status_code == 200:
return r
click.echo(f"retry {attempt + 1}/3 status={r.status_code}")
except Exception as e:
click.echo(f"retry {attempt + 1}/3 error={e}")
time.sleep(3)
if last_response is not None:
last_response.raise_for_status()
raise RuntimeError(f"failed to fetch {url}")
def get_history(session, user_id, loyalty):
url = f"{BASE}/api/v6.0/user/{user_id}/order/history"
r = safe_get(
session,
url,
params={
"filter": "instore",
"loyaltyNumber": loyalty,
},
)
return r.json()
def get_order_detail(session, user_id, order_id):
url = f"{BASE}/api/v6.0/user/{user_id}/order/history/detail/{order_id}"
r = safe_get(
session,
url,
params={"isInStore": "true"},
)
return r.json()
def flatten_orders(history, details):
orders = []
items = []
history_lookup = {
r["orderId"]: r
for r in history.get("records", [])
}
for d in details:
hist = history_lookup.get(d["orderId"], {})
pup = d.get("pup", {})
orders.append({
"order_id": d["orderId"],
"order_date": d.get("orderDate"),
"delivery_date": d.get("deliveryDate"),
"service_type": hist.get("serviceType"),
"order_total": d.get("orderTotal"),
"payment_method": d.get("paymentMethod"),
"total_item_count": d.get("totalItemCount"),
"total_savings": d.get("totalSavings"),
"your_savings_total": d.get("yourSavingsTotal"),
"coupons_discounts_total": d.get("couponsDiscountsTotal"),
"store_name": pup.get("storeName"),
"store_number": pup.get("aholdStoreNumber"),
"store_address1": pup.get("storeAddress1"),
"store_city": pup.get("storeCity"),
"store_state": pup.get("storeState"),
"store_zipcode": pup.get("storeZipcode"),
"refund_order": d.get("refundOrder"),
"ebt_order": d.get("ebtOrder"),
})
for i, item in enumerate(d.get("items", []), start=1):
items.append({
"order_id": d["orderId"],
"order_date": d.get("orderDate"),
"line_no": i,
"pod_id": item.get("podId"),
"item_name": item.get("itemName"),
"upc": item.get("primUpcCd"),
"category_id": item.get("categoryId"),
"category": item.get("categoryDesc"),
"qty": item.get("shipQy"),
"unit": item.get("lbEachCd"),
"unit_price": item.get("unitPrice"),
"line_total": item.get("groceryAmount"),
"picked_weight": item.get("totalPickedWeight"),
"mvp_savings": item.get("mvpSavings"),
"reward_savings": item.get("rewardSavings"),
"coupon_savings": item.get("couponSavings"),
"coupon_price": item.get("couponPrice"),
})
return pd.DataFrame(orders), pd.DataFrame(items)
def read_existing_order_ids(orders_csv: Path) -> set[str]:
if not orders_csv.exists():
return set()
try:
df = pd.read_csv(orders_csv, dtype={"order_id": str})
if "order_id" not in df.columns:
return set()
return set(df["order_id"].dropna().astype(str))
except Exception:
return set()
def append_dedup(existing_path: Path, new_df: pd.DataFrame, subset: list[str]) -> pd.DataFrame:
if existing_path.exists():
old_df = pd.read_csv(existing_path, dtype=str)
combined = pd.concat([old_df, new_df.astype(str)], ignore_index=True)
else:
combined = new_df.astype(str).copy()
combined = combined.drop_duplicates(subset=subset, keep="last")
combined.to_csv(existing_path, index=False)
return combined
@click.command()
@click.option("--user-id", default=None, help="giant user id")
@click.option("--loyalty", default=None, help="giant loyalty number")
@click.option("--outdir", default="giant_output", show_default=True, help="output directory")
@click.option("--sleep-seconds", default=1.5, show_default=True, type=float, help="delay between detail requests")
def main(user_id, loyalty, outdir, sleep_seconds):
cfg = load_config()
user_id = user_id or cfg["user_id"] or click.prompt("giant user id", type=str)
loyalty = loyalty or cfg["loyalty"] or click.prompt("giant loyalty number", type=str)
outdir = Path(outdir)
rawdir = outdir / "raw"
rawdir.mkdir(parents=True, exist_ok=True)
orders_csv = outdir / "orders.csv"
items_csv = outdir / "items.csv"
click.echo("using cookies from your current firefox profile.")
click.echo(f"open giant here, make sure you're logged in, then return: {ACCOUNT_PAGE}")
click.pause(info="press any key once giant is open and logged in")
session = build_session()
click.echo("fetching order history...")
history = get_history(session, user_id, loyalty)
(rawdir / "history.json").write_text(
json.dumps(history, indent=2),
encoding="utf-8",
)
records = history.get("records", [])
click.echo(f"history returned {len(records)} visits")
click.echo("tip: giant appears to expose only the most recent 50 visits, so run this periodically if you want full continuity.")
history_order_ids = [str(r["orderId"]) for r in records]
existing_order_ids = read_existing_order_ids(orders_csv)
new_order_ids = [oid for oid in history_order_ids if oid not in existing_order_ids]
click.echo(f"existing orders in csv: {len(existing_order_ids)}")
click.echo(f"new orders to fetch: {len(new_order_ids)}")
if not new_order_ids:
click.echo("no new orders found. done.")
return
details = []
for order_id in new_order_ids:
click.echo(f"fetching {order_id}")
d = get_order_detail(session, user_id, order_id)
details.append(d)
(rawdir / f"{order_id}.json").write_text(
json.dumps(d, indent=2),
encoding="utf-8",
)
time.sleep(sleep_seconds)
click.echo("flattening new data...")
orders_df, items_df = flatten_orders(history, details)
orders_all = append_dedup(
orders_csv,
orders_df,
subset=["order_id"],
)
items_all = append_dedup(
items_csv,
items_df,
subset=["order_id", "line_no", "item_name", "upc", "line_total"],
)
click.echo("done")
click.echo(f"orders csv: {orders_csv}")
click.echo(f"items csv: {items_csv}")
click.echo(f"total orders stored: {len(orders_all)}")
click.echo(f"total item rows stored: {len(items_all)}")
if __name__ == "__main__":
main()

738
scrape_costco.py Normal file
View File

@@ -0,0 +1,738 @@
import os
import csv
import json
import time
import re
from pathlib import Path
from calendar import monthrange
from datetime import datetime, timedelta
from dotenv import load_dotenv
import click
from curl_cffi import requests
from browser_session import (
find_firefox_profile_dir,
load_firefox_cookies,
read_firefox_local_storage,
read_firefox_webapps_store,
)
BASE_URL = "https://ecom-api.costco.com/ebusiness/order/v1/orders/graphql"
RETAILER = "costco"
SUMMARY_QUERY = """
query receiptsWithCounts($startDate: String!, $endDate: String!, $documentType: String!, $documentSubType: String!) {
receiptsWithCounts(startDate: $startDate, endDate: $endDate, documentType: $documentType, documentSubType: $documentSubType) {
inWarehouse
gasStation
carWash
gasAndCarWash
receipts {
warehouseName
receiptType
documentType
transactionDateTime
transactionBarcode
warehouseName
transactionType
total
totalItemCount
itemArray {
itemNumber
}
tenderArray {
tenderTypeCode
tenderDescription
amountTender
}
couponArray {
upcnumberCoupon
}
}
}
}
""".strip()
DETAIL_QUERY = """
query receiptsWithCounts($barcode: String!, $documentType: String!) {
receiptsWithCounts(barcode: $barcode, documentType: $documentType) {
receipts {
warehouseName
receiptType
documentType
transactionDateTime
transactionDate
companyNumber
warehouseNumber
operatorNumber
warehouseShortName
registerNumber
transactionNumber
transactionType
transactionBarcode
total
warehouseAddress1
warehouseAddress2
warehouseCity
warehouseState
warehouseCountry
warehousePostalCode
totalItemCount
subTotal
taxes
total
invoiceNumber
sequenceNumber
itemArray {
itemNumber
itemDescription01
frenchItemDescription1
itemDescription02
frenchItemDescription2
itemIdentifier
itemDepartmentNumber
unit
amount
taxFlag
merchantID
entryMethod
transDepartmentNumber
fuelUnitQuantity
fuelGradeCode
itemUnitPriceAmount
fuelUomCode
fuelUomDescription
fuelUomDescriptionFr
fuelGradeDescription
fuelGradeDescriptionFr
}
tenderArray {
tenderTypeCode
tenderSubTypeCode
tenderDescription
amountTender
displayAccountNumber
sequenceNumber
approvalNumber
responseCode
tenderTypeName
transactionID
merchantID
entryMethod
tenderAcctTxnNumber
tenderAuthorizationCode
tenderTypeNameFr
tenderEntryMethodDescription
walletType
walletId
storedValueBucket
}
subTaxes {
tax1
tax2
tax3
tax4
aTaxPercent
aTaxLegend
aTaxAmount
aTaxPrintCode
aTaxPrintCodeFR
aTaxIdentifierCode
bTaxPercent
bTaxLegend
bTaxAmount
bTaxPrintCode
bTaxPrintCodeFR
bTaxIdentifierCode
cTaxPercent
cTaxLegend
cTaxAmount
cTaxIdentifierCode
dTaxPercent
dTaxLegend
dTaxAmount
dTaxPrintCode
dTaxPrintCodeFR
dTaxIdentifierCode
uTaxLegend
uTaxAmount
uTaxableAmount
}
instantSavings
membershipNumber
}
}
}
""".strip()
ORDER_FIELDS = [
"retailer",
"order_id",
"order_date",
"delivery_date",
"service_type",
"order_total",
"payment_method",
"total_item_count",
"total_savings",
"your_savings_total",
"coupons_discounts_total",
"store_name",
"store_number",
"store_address1",
"store_city",
"store_state",
"store_zipcode",
"refund_order",
"ebt_order",
"raw_history_path",
"raw_order_path",
]
ITEM_FIELDS = [
"retailer",
"order_id",
"line_no",
"order_date",
"retailer_item_id",
"pod_id",
"item_name",
"upc",
"category_id",
"category",
"qty",
"unit",
"unit_price",
"line_total",
"picked_weight",
"mvp_savings",
"reward_savings",
"coupon_savings",
"coupon_price",
"image_url",
"raw_order_path",
"is_discount_line",
"is_coupon_line",
]
COSTCO_STORAGE_ORIGIN = "costco.com"
COSTCO_ID_TOKEN_STORAGE_KEY = "idToken"
COSTCO_CLIENT_ID_STORAGE_KEY = "clientID"
def load_config():
load_dotenv()
return {
"authorization": os.getenv("COSTCO_X_AUTHORIZATION", "").strip(),
"client_id": os.getenv("COSTCO_X_WCS_CLIENTID", "").strip(),
"client_identifier": os.getenv("COSTCO_CLIENT_IDENTIFIER", "").strip(),
}
def build_headers(auth_headers):
headers = {
"accept": "*/*",
"content-type": "application/json-patch+json",
"costco.service": "restOrders",
"costco.env": "ecom",
"origin": "https://www.costco.com",
"referer": "https://www.costco.com/",
"user-agent": (
"Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:148.0) "
"Gecko/20100101 Firefox/148.0"
),
}
headers.update(auth_headers)
return headers
def load_costco_browser_headers(profile_dir, authorization, client_id, client_identifier):
local_storage = read_firefox_local_storage(profile_dir, COSTCO_STORAGE_ORIGIN)
webapps_store = read_firefox_webapps_store(profile_dir, COSTCO_STORAGE_ORIGIN)
auth_header = authorization.strip() if authorization else ""
if client_id:
client_id = client_id.strip()
if client_identifier:
client_identifier = client_identifier.strip()
if not auth_header:
id_token = (
local_storage.get(COSTCO_ID_TOKEN_STORAGE_KEY, "").strip()
or webapps_store.get(COSTCO_ID_TOKEN_STORAGE_KEY, "").strip()
)
if id_token:
auth_header = f"Bearer {id_token}"
client_id = client_id or (
local_storage.get(COSTCO_CLIENT_ID_STORAGE_KEY, "").strip()
or webapps_store.get(COSTCO_CLIENT_ID_STORAGE_KEY, "").strip()
)
if not auth_header:
raise click.ClickException(
"could not find Costco auth token; set COSTCO_X_AUTHORIZATION or load Firefox idToken"
)
if not client_id or not client_identifier:
raise click.ClickException(
"missing Costco client ids; set COSTCO_X_WCS_CLIENTID and COSTCO_CLIENT_IDENTIFIER"
)
return {
"costco-x-authorization": auth_header,
"costco-x-wcs-clientId": client_id,
"client-identifier": client_identifier,
}
def build_session(profile_dir, auth_headers):
session = requests.Session()
session.cookies.update(load_firefox_cookies(".costco.com", profile_dir))
session.headers.update(build_headers(auth_headers))
session.headers.update(auth_headers)
return session
def graphql_post(session, query, variables):
last_response = None
for attempt in range(3):
try:
response = session.post(
BASE_URL,
json={"query": query, "variables": variables},
impersonate="firefox",
timeout=30,
)
last_response = response
if response.status_code == 200:
return response.json()
click.echo(f"retry {attempt + 1}/3 status={response.status_code} body={response.text[:500]}")
except Exception as exc: # pragma: no cover - network error path
click.echo(f"retry {attempt + 1}/3 error={exc}")
time.sleep(3)
if last_response is not None:
last_response.raise_for_status()
raise RuntimeError("failed to fetch Costco GraphQL payload")
def safe_filename(value):
return re.sub(r'[<>:"/\\|?*]+', "-", str(value))
def summary_receipts(payload):
return payload.get("data", {}).get("receiptsWithCounts", {}).get("receipts", [])
def detail_receipts(payload):
return payload.get("data", {}).get("receiptsWithCounts", {}).get("receipts", [])
def summary_counts(payload):
counts = payload.get("data", {}).get("receiptsWithCounts", {})
return {
"inWarehouse": counts.get("inWarehouse", 0) or 0,
"gasStation": counts.get("gasStation", 0) or 0,
"carWash": counts.get("carWash", 0) or 0,
"gasAndCarWash": counts.get("gasAndCarWash", 0) or 0,
}
def parse_cli_date(value):
return datetime.strptime(value, "%m/%d/%Y").date()
def format_cli_date(value):
return f"{value.month}/{value.day:02d}/{value.year}"
def subtract_months(value, months):
year = value.year
month = value.month - months
while month <= 0:
month += 12
year -= 1
day = min(value.day, monthrange(year, month)[1])
return value.replace(year=year, month=month, day=day)
def resolve_date_range(months_back, today=None):
if months_back < 1:
raise click.ClickException("months-back must be at least 1")
end = today or datetime.now().date()
start = subtract_months(end, months_back)
return format_cli_date(start), format_cli_date(end)
def build_date_windows(start_date, end_date, window_days):
start = parse_cli_date(start_date)
end = parse_cli_date(end_date)
if end < start:
raise click.ClickException("end-date must be on or after start-date")
if window_days < 1:
raise click.ClickException("window-days must be at least 1")
windows = []
current = start
while current <= end:
window_end = min(current + timedelta(days=window_days - 1), end)
windows.append(
{
"startDate": format_cli_date(current),
"endDate": format_cli_date(window_end),
}
)
current = window_end + timedelta(days=1)
return windows
def unique_receipts(receipts):
by_barcode = {}
for receipt in receipts:
key = receipt_key(receipt)
if key:
by_barcode[key] = receipt
return list(by_barcode.values())
def receipt_key(receipt):
barcode = receipt.get("transactionBarcode", "")
transaction_date_time = receipt.get("transactionDateTime", "")
if not barcode:
return ""
return f"{barcode}::{transaction_date_time}"
def fetch_summary_windows(
session,
start_date,
end_date,
document_type,
document_sub_type,
window_days,
):
requests_metadata = []
combined_receipts = []
for window in build_date_windows(start_date, end_date, window_days):
variables = {
"startDate": window["startDate"],
"endDate": window["endDate"],
"text": "custom",
"documentType": document_type,
"documentSubType": document_sub_type,
}
payload = graphql_post(session, SUMMARY_QUERY, variables)
receipts = summary_receipts(payload)
counts = summary_counts(payload)
warehouse_count = sum(
1 for receipt in receipts if receipt.get("receiptType") == "In-Warehouse"
)
mismatch = counts["inWarehouse"] != warehouse_count
requests_metadata.append(
{
**variables,
"returnedReceipts": len(receipts),
"returnedInWarehouseReceipts": warehouse_count,
"inWarehouse": counts["inWarehouse"],
"gasStation": counts["gasStation"],
"carWash": counts["carWash"],
"gasAndCarWash": counts["gasAndCarWash"],
"countMismatch": mismatch,
}
)
if mismatch:
click.echo(
(
"warning: summary count mismatch for "
f"{window['startDate']} to {window['endDate']}: "
f"inWarehouse={counts['inWarehouse']} "
f"returnedInWarehouseReceipts={warehouse_count}"
),
err=True,
)
combined_receipts.extend(receipts)
unique = unique_receipts(combined_receipts)
aggregate_payload = {
"data": {
"receiptsWithCounts": {
"inWarehouse": sum(row["inWarehouse"] for row in requests_metadata),
"gasStation": sum(row["gasStation"] for row in requests_metadata),
"carWash": sum(row["carWash"] for row in requests_metadata),
"gasAndCarWash": sum(row["gasAndCarWash"] for row in requests_metadata),
"receipts": unique,
}
}
}
return aggregate_payload, requests_metadata
def flatten_costco_data(summary_payload, detail_payloads, raw_dir):
summary_lookup = {
receipt_key(receipt): receipt
for receipt in summary_receipts(summary_payload)
if receipt_key(receipt)
}
orders = []
items = []
for detail_payload in detail_payloads:
for receipt in detail_receipts(detail_payload):
order_id = receipt["transactionBarcode"]
receipt_id = receipt_key(receipt)
summary_row = summary_lookup.get(receipt_id, {})
coupon_numbers = {
row.get("upcnumberCoupon", "")
for row in summary_row.get("couponArray", []) or []
if row.get("upcnumberCoupon")
}
raw_order_path = raw_dir / f"{safe_filename(receipt_id or order_id)}.json"
orders.append(
{
"retailer": RETAILER,
"order_id": order_id,
"order_date": receipt.get("transactionDate", ""),
"delivery_date": receipt.get("transactionDate", ""),
"service_type": receipt.get("receiptType", ""),
"order_total": stringify(receipt.get("total")),
"payment_method": compact_join(
summary_row.get("tenderArray", []) or [], "tenderDescription"
),
"total_item_count": stringify(receipt.get("totalItemCount")),
"total_savings": stringify(receipt.get("instantSavings")),
"your_savings_total": stringify(receipt.get("instantSavings")),
"coupons_discounts_total": stringify(receipt.get("instantSavings")),
"store_name": receipt.get("warehouseName", ""),
"store_number": stringify(receipt.get("warehouseNumber")),
"store_address1": receipt.get("warehouseAddress1", ""),
"store_city": receipt.get("warehouseCity", ""),
"store_state": receipt.get("warehouseState", ""),
"store_zipcode": receipt.get("warehousePostalCode", ""),
"refund_order": "false",
"ebt_order": "false",
"raw_history_path": (raw_dir / "summary.json").as_posix(),
"raw_order_path": raw_order_path.as_posix(),
}
)
for line_no, item in enumerate(receipt.get("itemArray", []), start=1):
item_number = stringify(item.get("itemNumber"))
description = join_descriptions(
item.get("itemDescription01"), item.get("itemDescription02")
)
is_discount = is_discount_line(item)
is_coupon = is_discount and (
item_number in coupon_numbers
or description.startswith("/")
)
items.append(
{
"retailer": RETAILER,
"order_id": order_id,
"line_no": str(line_no),
"order_date": receipt.get("transactionDate", ""),
"retailer_item_id": item_number,
"pod_id": "",
"item_name": description,
"upc": "",
"category_id": stringify(item.get("itemDepartmentNumber")),
"category": stringify(item.get("transDepartmentNumber")),
"qty": stringify(item.get("unit")),
"unit": stringify(item.get("itemIdentifier")),
"unit_price": stringify(item.get("itemUnitPriceAmount")),
"line_total": stringify(item.get("amount")),
"picked_weight": "",
"mvp_savings": "",
"reward_savings": "",
"coupon_savings": stringify(item.get("amount") if is_coupon else ""),
"coupon_price": "",
"image_url": "",
"raw_order_path": raw_order_path.as_posix(),
"is_discount_line": "true" if is_discount else "false",
"is_coupon_line": "true" if is_coupon else "false",
}
)
return orders, items
def join_descriptions(*parts):
return " ".join(str(part).strip() for part in parts if part).strip()
def compact_join(rows, field):
values = [str(row.get(field, "")).strip() for row in rows if row.get(field)]
return " | ".join(values)
def is_discount_line(item):
amount = item.get("amount")
unit = item.get("unit")
description = join_descriptions(
item.get("itemDescription01"), item.get("itemDescription02")
)
try:
amount_val = float(amount)
except (TypeError, ValueError):
amount_val = 0.0
try:
unit_val = float(unit)
except (TypeError, ValueError):
unit_val = 0.0
return amount_val < 0 or unit_val < 0 or description.startswith("/")
def stringify(value):
if value is None:
return ""
return str(value)
def write_json(path, payload):
path.parent.mkdir(parents=True, exist_ok=True)
path.write_text(json.dumps(payload, indent=2), encoding="utf-8")
def write_csv(path, rows, fieldnames):
path.parent.mkdir(parents=True, exist_ok=True)
with path.open("w", newline="", encoding="utf-8") as handle:
writer = csv.DictWriter(handle, fieldnames=fieldnames)
writer.writeheader()
writer.writerows(rows)
@click.command()
@click.option(
"--outdir",
default="costco_output",
show_default=True,
help="Output directory for Costco raw and flattened files.",
)
@click.option(
"--document-type",
default="all",
show_default=True,
help="Summary document type.",
)
@click.option(
"--document-sub-type",
default="all",
show_default=True,
help="Summary document sub type.",
)
@click.option(
"--window-days",
default=92,
show_default=True,
type=int,
help="Maximum number of days to request per summary window.",
)
@click.option(
"--months-back",
default=36,
show_default=True,
type=int,
help="How many months of receipts to enumerate back from today.",
)
@click.option(
"--firefox-profile-dir",
default=None,
help="Firefox profile directory to use for cookies and session storage.",
)
def main(
outdir,
document_type,
document_sub_type,
window_days,
months_back,
firefox_profile_dir,
):
click.echo("legacy entrypoint: prefer collect_costco_web.py for data-model outputs")
run_collection(
outdir=outdir,
document_type=document_type,
document_sub_type=document_sub_type,
window_days=window_days,
months_back=months_back,
firefox_profile_dir=firefox_profile_dir,
)
def run_collection(
outdir,
document_type,
document_sub_type,
window_days,
months_back,
firefox_profile_dir,
orders_filename="orders.csv",
items_filename="items.csv",
):
outdir = Path(outdir)
raw_dir = outdir / "raw"
config = load_config()
profile_dir = Path(firefox_profile_dir) if firefox_profile_dir else None
if profile_dir is None:
try:
profile_dir = find_firefox_profile_dir()
except Exception:
profile_dir = click.prompt(
"Firefox profile dir",
type=click.Path(exists=True, file_okay=False, path_type=Path),
)
auth_headers = load_costco_browser_headers(
profile_dir,
authorization=config["authorization"],
client_id=config["client_id"],
client_identifier=config["client_identifier"],
)
session = build_session(profile_dir, auth_headers)
click.echo(
"session bootstrap: "
f"cookies={True} "
f"authorization={bool(auth_headers.get('costco-x-authorization'))} "
f"client_id={bool(auth_headers.get('costco-x-wcs-clientId'))} "
f"client_identifier={bool(auth_headers.get('client-identifier'))}"
)
start_date, end_date = resolve_date_range(months_back)
summary_payload, request_metadata = fetch_summary_windows(
session,
start_date,
end_date,
document_type,
document_sub_type,
window_days,
)
write_json(raw_dir / "summary.json", summary_payload)
write_json(raw_dir / "summary_requests.json", request_metadata)
receipts = summary_receipts(summary_payload)
detail_payloads = []
for receipt in receipts:
barcode = receipt["transactionBarcode"]
receipt_id = receipt_key(receipt) or barcode
click.echo(f"fetching {barcode}")
detail_payload = graphql_post(
session,
DETAIL_QUERY,
{"barcode": barcode, "documentType": "warehouse"},
)
detail_payloads.append(detail_payload)
write_json(raw_dir / f"{safe_filename(receipt_id)}.json", detail_payload)
orders, items = flatten_costco_data(summary_payload, detail_payloads, raw_dir)
write_csv(outdir / orders_filename, orders, ORDER_FIELDS)
write_csv(outdir / items_filename, items, ITEM_FIELDS)
click.echo(f"wrote {len(orders)} orders and {len(items)} item rows to {outdir}")
if __name__ == "__main__":
main()

367
scrape_giant.py Normal file
View File

@@ -0,0 +1,367 @@
import csv
import json
import os
import time
from pathlib import Path
import click
from dotenv import load_dotenv
from curl_cffi import requests
from browser_session import find_firefox_profile_dir, load_firefox_cookies
BASE = "https://giantfood.com"
ACCOUNT_PAGE = f"{BASE}/account/history/invoice/in-store"
RETAILER = "giant"
ORDER_FIELDS = [
"retailer",
"order_id",
"order_date",
"delivery_date",
"service_type",
"order_total",
"payment_method",
"total_item_count",
"total_savings",
"your_savings_total",
"coupons_discounts_total",
"store_name",
"store_number",
"store_address1",
"store_city",
"store_state",
"store_zipcode",
"refund_order",
"ebt_order",
"raw_history_path",
"raw_order_path",
]
ITEM_FIELDS = [
"retailer",
"order_id",
"order_date",
"line_no",
"retailer_item_id",
"pod_id",
"item_name",
"upc",
"category_id",
"category",
"qty",
"unit",
"unit_price",
"line_total",
"picked_weight",
"mvp_savings",
"reward_savings",
"coupon_savings",
"coupon_price",
"image_url",
"raw_order_path",
"is_discount_line",
"is_coupon_line",
]
def load_config():
if load_dotenv is not None:
load_dotenv()
return {
"user_id": os.getenv("GIANT_USER_ID", "").strip(),
"loyalty": os.getenv("GIANT_LOYALTY_NUMBER", "").strip(),
}
def build_session():
profile_dir = find_firefox_profile_dir()
session = requests.Session()
session.cookies.update(load_firefox_cookies("giantfood.com", profile_dir))
session.headers.update(
{
"user-agent": (
"Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:148.0) "
"Gecko/20100101 Firefox/148.0"
),
"accept": "application/json, text/plain, */*",
"accept-language": "en-US,en;q=0.9",
"referer": ACCOUNT_PAGE,
}
)
return session
def safe_get(session, url, **kwargs):
last_response = None
for attempt in range(3):
try:
response = session.get(
url,
impersonate="firefox",
timeout=30,
**kwargs,
)
last_response = response
if response.status_code == 200:
return response
click.echo(f"retry {attempt + 1}/3 status={response.status_code}")
except Exception as exc: # pragma: no cover - network error path
click.echo(f"retry {attempt + 1}/3 error={exc}")
time.sleep(3)
if last_response is not None:
last_response.raise_for_status()
raise RuntimeError(f"failed to fetch {url}")
def get_history(session, user_id, loyalty):
response = safe_get(
session,
f"{BASE}/api/v6.0/user/{user_id}/order/history",
params={"filter": "instore", "loyaltyNumber": loyalty},
)
return response.json()
def get_order_detail(session, user_id, order_id):
response = safe_get(
session,
f"{BASE}/api/v6.0/user/{user_id}/order/history/detail/{order_id}",
params={"isInStore": "true"},
)
return response.json()
def flatten_orders(history, details, history_path=None, raw_dir=None):
orders = []
items = []
history_lookup = {record["orderId"]: record for record in history.get("records", [])}
history_path_value = history_path.as_posix() if history_path else ""
for detail in details:
order_id = str(detail["orderId"])
history_row = history_lookup.get(detail["orderId"], {})
pickup = detail.get("pup", {})
raw_order_path = (raw_dir / f"{order_id}.json").as_posix() if raw_dir else ""
orders.append(
{
"retailer": RETAILER,
"order_id": order_id,
"order_date": detail.get("orderDate"),
"delivery_date": detail.get("deliveryDate"),
"service_type": history_row.get("serviceType"),
"order_total": detail.get("orderTotal"),
"payment_method": detail.get("paymentMethod"),
"total_item_count": detail.get("totalItemCount"),
"total_savings": detail.get("totalSavings"),
"your_savings_total": detail.get("yourSavingsTotal"),
"coupons_discounts_total": detail.get("couponsDiscountsTotal"),
"store_name": pickup.get("storeName"),
"store_number": pickup.get("aholdStoreNumber"),
"store_address1": pickup.get("storeAddress1"),
"store_city": pickup.get("storeCity"),
"store_state": pickup.get("storeState"),
"store_zipcode": pickup.get("storeZipcode"),
"refund_order": detail.get("refundOrder"),
"ebt_order": detail.get("ebtOrder"),
"raw_history_path": history_path_value,
"raw_order_path": raw_order_path,
}
)
for line_no, item in enumerate(detail.get("items", []), start=1):
items.append(
{
"retailer": RETAILER,
"order_id": order_id,
"order_date": detail.get("orderDate"),
"line_no": str(line_no),
"retailer_item_id": "",
"pod_id": item.get("podId"),
"item_name": item.get("itemName"),
"upc": item.get("primUpcCd"),
"category_id": item.get("categoryId"),
"category": item.get("categoryDesc"),
"qty": item.get("shipQy"),
"unit": item.get("lbEachCd"),
"unit_price": item.get("unitPrice"),
"line_total": item.get("groceryAmount"),
"picked_weight": item.get("totalPickedWeight"),
"mvp_savings": item.get("mvpSavings"),
"reward_savings": item.get("rewardSavings"),
"coupon_savings": item.get("couponSavings"),
"coupon_price": item.get("couponPrice"),
"image_url": "",
"raw_order_path": raw_order_path,
"is_discount_line": "false",
"is_coupon_line": "false",
}
)
return orders, items
def normalize_row(row, fieldnames):
return {field: stringify(row.get(field)) for field in fieldnames}
def stringify(value):
if value is None:
return ""
return str(value)
def read_csv_rows(path):
if not path.exists():
return [], []
with path.open(newline="", encoding="utf-8") as handle:
reader = csv.DictReader(handle)
fieldnames = reader.fieldnames or []
return fieldnames, list(reader)
def read_existing_order_ids(path):
_, rows = read_csv_rows(path)
return {row["order_id"] for row in rows if row.get("order_id")}
def merge_rows(existing_rows, new_rows, subset):
merged = []
row_index = {}
for row in existing_rows + new_rows:
key = tuple(stringify(row.get(field)) for field in subset)
normalized = dict(row)
if key in row_index:
merged[row_index[key]] = normalized
else:
row_index[key] = len(merged)
merged.append(normalized)
return merged
def append_dedup(path, new_rows, subset, fieldnames):
existing_fieldnames, existing_rows = read_csv_rows(path)
all_fieldnames = list(dict.fromkeys(existing_fieldnames + fieldnames))
merged = merge_rows(
[normalize_row(row, all_fieldnames) for row in existing_rows],
[normalize_row(row, all_fieldnames) for row in new_rows],
subset=subset,
)
with path.open("w", newline="", encoding="utf-8") as handle:
writer = csv.DictWriter(handle, fieldnames=all_fieldnames)
writer.writeheader()
writer.writerows(merged)
return merged
def write_json(path, payload):
path.write_text(json.dumps(payload, indent=2), encoding="utf-8")
@click.command()
@click.option("--user-id", default=None, help="Giant user id.")
@click.option("--loyalty", default=None, help="Giant loyalty number.")
@click.option(
"--outdir",
default="giant_output",
show_default=True,
help="Directory for raw json and csv outputs.",
)
@click.option(
"--sleep-seconds",
default=1.5,
show_default=True,
type=float,
help="Delay between order detail requests.",
)
def main(user_id, loyalty, outdir, sleep_seconds):
click.echo("legacy entrypoint: prefer collect_giant_web.py for data-model outputs")
run_collection(user_id, loyalty, outdir, sleep_seconds)
def run_collection(
user_id,
loyalty,
outdir,
sleep_seconds,
orders_filename="orders.csv",
items_filename="items.csv",
):
config = load_config()
user_id = user_id or config["user_id"] or click.prompt("Giant user id", type=str)
loyalty = loyalty or config["loyalty"] or click.prompt(
"Giant loyalty number", type=str
)
outdir = Path(outdir)
rawdir = outdir / "raw"
rawdir.mkdir(parents=True, exist_ok=True)
orders_csv = outdir / orders_filename
items_csv = outdir / items_filename
existing_order_ids = read_existing_order_ids(orders_csv)
session = build_session()
history = get_history(session, user_id, loyalty)
history_path = rawdir / "history.json"
write_json(history_path, history)
records = history.get("records", [])
click.echo(f"history returned {len(records)} visits; Giant exposes only the most recent 50")
unseen_records = [
record
for record in records
if stringify(record.get("orderId")) not in existing_order_ids
]
click.echo(
f"found {len(unseen_records)} unseen visits "
f"({len(existing_order_ids)} already stored)"
)
details = []
for index, record in enumerate(unseen_records, start=1):
order_id = stringify(record.get("orderId"))
click.echo(f"[{index}/{len(unseen_records)}] fetching {order_id}")
detail = get_order_detail(session, user_id, order_id)
write_json(rawdir / f"{order_id}.json", detail)
details.append(detail)
if index < len(unseen_records):
time.sleep(sleep_seconds)
orders, items = flatten_orders(history, details, history_path=history_path, raw_dir=rawdir)
merged_orders = append_dedup(
orders_csv,
orders,
subset=["order_id"],
fieldnames=ORDER_FIELDS,
)
merged_items = append_dedup(
items_csv,
items,
subset=["order_id", "line_no"],
fieldnames=ITEM_FIELDS,
)
click.echo(
f"wrote {len(orders)} new orders / {len(items)} new items "
f"({len(merged_orders)} total orders, {len(merged_items)} total items)"
)
if __name__ == "__main__":
main()

View File

@@ -1,181 +1,5 @@
import json from scrape_giant import * # noqa: F401,F403
import time
from pathlib import Path
import browser_cookie3
import pandas as pd
from curl_cffi import requests
BASE = "https://giantfood.com"
ACCOUNT_PAGE = f"{BASE}/account/history/invoice/in-store"
USER_ID = "369513017"
LOYALTY = "440155630880"
def build_session():
s = requests.Session()
s.cookies.update(browser_cookie3.firefox(domain_name="giantfood.com"))
s.headers.update({
"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:148.0) Gecko/20100101 Firefox/148.0",
"accept": "application/json, text/plain, */*",
"accept-language": "en-US,en;q=0.9",
"referer": ACCOUNT_PAGE,
})
return s
def safe_get(session, url, **kwargs):
last_response = None
for attempt in range(3):
try:
r = session.get(
url,
impersonate="firefox",
timeout=30,
**kwargs,
)
last_response = r
if r.status_code == 200:
return r
print(f"retry {attempt + 1}/3 status={r.status_code}")
except Exception as e:
print(f"retry {attempt + 1}/3 error={e}")
time.sleep(3)
if last_response is not None:
last_response.raise_for_status()
raise RuntimeError(f"failed to fetch {url}")
def get_history(session):
url = f"{BASE}/api/v6.0/user/{USER_ID}/order/history"
r = safe_get(
session,
url,
params={
"filter": "instore",
"loyaltyNumber": LOYALTY,
},
)
return r.json()
def get_order_detail(session, order_id):
url = f"{BASE}/api/v6.0/user/{USER_ID}/order/history/detail/{order_id}"
r = safe_get(
session,
url,
params={"isInStore": "true"},
)
return r.json()
def flatten_orders(history, details):
orders = []
items = []
history_lookup = {
r["orderId"]: r
for r in history.get("records", [])
}
for d in details:
hist = history_lookup.get(d["orderId"], {})
pup = d.get("pup", {})
orders.append({
"order_id": d["orderId"],
"order_date": d.get("orderDate"),
"delivery_date": d.get("deliveryDate"),
"service_type": hist.get("serviceType"),
"order_total": d.get("orderTotal"),
"payment_method": d.get("paymentMethod"),
"total_item_count": d.get("totalItemCount"),
"total_savings": d.get("totalSavings"),
"your_savings_total": d.get("yourSavingsTotal"),
"coupons_discounts_total": d.get("couponsDiscountsTotal"),
"store_name": pup.get("storeName"),
"store_number": pup.get("aholdStoreNumber"),
"store_address1": pup.get("storeAddress1"),
"store_city": pup.get("storeCity"),
"store_state": pup.get("storeState"),
"store_zipcode": pup.get("storeZipcode"),
"refund_order": d.get("refundOrder"),
"ebt_order": d.get("ebtOrder"),
})
for i, item in enumerate(d.get("items", []), start=1):
items.append({
"order_id": d["orderId"],
"order_date": d.get("orderDate"),
"line_no": i,
"pod_id": item.get("podId"),
"item_name": item.get("itemName"),
"upc": item.get("primUpcCd"),
"category_id": item.get("categoryId"),
"category": item.get("categoryDesc"),
"qty": item.get("shipQy"),
"unit": item.get("lbEachCd"),
"unit_price": item.get("unitPrice"),
"line_total": item.get("groceryAmount"),
"picked_weight": item.get("totalPickedWeight"),
"mvp_savings": item.get("mvpSavings"),
"reward_savings": item.get("rewardSavings"),
"coupon_savings": item.get("couponSavings"),
"coupon_price": item.get("couponPrice"),
})
return pd.DataFrame(orders), pd.DataFrame(items)
def main():
outdir = Path("giant_output")
rawdir = outdir / "raw"
rawdir.mkdir(parents=True, exist_ok=True)
session = build_session()
print("fetching order history...")
history = get_history(session)
(rawdir / "history.json").write_text(
json.dumps(history, indent=2),
encoding="utf-8",
)
order_ids = [r["orderId"] for r in history.get("records", [])]
print(f"{len(order_ids)} orders found")
details = []
for order_id in order_ids:
print(f"fetching {order_id}")
d = get_order_detail(session, order_id)
details.append(d)
(rawdir / f"{order_id}.json").write_text(
json.dumps(d, indent=2),
encoding="utf-8",
)
time.sleep(1.5)
print("flattening data...")
orders_df, items_df = flatten_orders(history, details)
orders_df.to_csv(outdir / "orders.csv", index=False)
items_df.to_csv(outdir / "items.csv", index=False)
print("done")
print(f"{len(orders_df)} orders written to {outdir / 'orders.csv'}")
print(f"{len(items_df)} items written to {outdir / 'items.csv'}")
if __name__ == "__main__": if __name__ == "__main__":
main() main()

View File

@@ -1,28 +1,17 @@
import requests import unittest
import browser_cookie3
BASE = "https://giantfood.com"
ACCOUNT_PAGE = f"{BASE}/account/history/invoice/in-store"
USER_ID = "369513017" try:
LOYALTY = "440155630880" import browser_cookie3 # noqa: F401
import requests # noqa: F401
except ImportError as exc: # pragma: no cover - dependency-gated smoke test
browser_cookie3 = None
_IMPORT_ERROR = exc
else:
_IMPORT_ERROR = None
cj = browser_cookie3.firefox(domain_name="giantfood.com")
s = requests.Session() @unittest.skipIf(browser_cookie3 is None, f"optional smoke test dependency missing: {_IMPORT_ERROR}")
s.cookies.update(cj) class BrowserCookieSmokeTest(unittest.TestCase):
s.headers.update({ def test_dependencies_available(self):
"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:148.0) Gecko/20100101 Firefox/148.0", self.assertIsNotNone(browser_cookie3)
"accept": "application/json, text/plain, */*",
"accept-language": "en-US,en;q=0.9",
"referer": ACCOUNT_PAGE,
})
r = s.get(
f"{BASE}/api/v6.0/user/{USER_ID}/order/history",
params={"filter": "instore", "loyaltyNumber": LOYALTY},
timeout=30,
)
print(r.status_code)
print(r.text[:500])

View File

@@ -1,27 +1,17 @@
import browser_cookie3 import unittest
from curl_cffi import requests
BASE = "https://giantfood.com"
ACCOUNT_PAGE = f"{BASE}/account/history/invoice/in-store"
USER_ID = "369513017" try:
LOYALTY = "440155630880" import browser_cookie3 # noqa: F401
from curl_cffi import requests # noqa: F401
except ImportError as exc: # pragma: no cover - dependency-gated smoke test
browser_cookie3 = None
_IMPORT_ERROR = exc
else:
_IMPORT_ERROR = None
s = requests.Session()
s.cookies.update(browser_cookie3.firefox(domain_name="giantfood.com"))
s.headers.update({
"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:148.0) Gecko/20100101 Firefox/148.0",
"accept": "application/json, text/plain, */*",
"accept-language": "en-US,en;q=0.9",
"referer": ACCOUNT_PAGE,
})
r = s.get( @unittest.skipIf(browser_cookie3 is None, f"optional smoke test dependency missing: {_IMPORT_ERROR}")
f"{BASE}/api/v6.0/user/{USER_ID}/order/history", class CurlCffiSmokeTest(unittest.TestCase):
params={"filter": "instore", "loyaltyNumber": LOYALTY}, def test_dependencies_available(self):
impersonate="firefox", self.assertIsNotNone(browser_cookie3)
timeout=30,
)
print(r.status_code)
print(r.text[:500])

View File

@@ -0,0 +1,155 @@
import sqlite3
import tempfile
import unittest
from pathlib import Path
from unittest import mock
import browser_session
import scrape_costco
class BrowserSessionTests(unittest.TestCase):
def test_read_firefox_local_storage_reads_copied_sqlite(self):
with tempfile.TemporaryDirectory() as tmpdir:
profile_dir = Path(tmpdir) / "abcd.default-release"
ls_dir = profile_dir / "storage" / "default" / "https+++www.costco.com" / "ls"
ls_dir.mkdir(parents=True)
db_path = ls_dir / "data.sqlite"
with sqlite3.connect(db_path) as connection:
connection.execute("CREATE TABLE data (key TEXT, value TEXT)")
connection.execute(
"INSERT INTO data (key, value) VALUES (?, ?)",
("costco-x-wcs-clientId", "4900eb1f-0c10-4bd9-99c3-c59e6c1ecebf"),
)
values = browser_session.read_firefox_local_storage(
profile_dir,
origin_filter="costco.com",
)
self.assertEqual(
"4900eb1f-0c10-4bd9-99c3-c59e6c1ecebf",
values["costco-x-wcs-clientId"],
)
def test_load_costco_browser_headers_reads_id_token_and_client_id(self):
with tempfile.TemporaryDirectory() as tmpdir:
profile_dir = Path(tmpdir)
storage_dir = profile_dir / "storage" / "default" / "https+++www.costco.com" / "ls"
storage_dir.mkdir(parents=True)
db_path = storage_dir / "data.sqlite"
with sqlite3.connect(db_path) as connection:
connection.execute("CREATE TABLE data (key TEXT, value TEXT)")
connection.execute(
"INSERT INTO data (key, value) VALUES (?, ?)",
("idToken", "header.payload.signature"),
)
connection.execute(
"INSERT INTO data (key, value) VALUES (?, ?)",
("clientID", "4900eb1f-0c10-4bd9-99c3-c59e6c1ecebf"),
)
headers = scrape_costco.load_costco_browser_headers(
profile_dir,
authorization="",
client_id="",
client_identifier="481b1aec-aa3b-454b-b81b-48187e28f205",
)
self.assertEqual("Bearer header.payload.signature", headers["costco-x-authorization"])
self.assertEqual(
"4900eb1f-0c10-4bd9-99c3-c59e6c1ecebf",
headers["costco-x-wcs-clientId"],
)
self.assertEqual(
"481b1aec-aa3b-454b-b81b-48187e28f205",
headers["client-identifier"],
)
def test_load_costco_browser_headers_prefers_env_values(self):
with tempfile.TemporaryDirectory() as tmpdir:
profile_dir = Path(tmpdir)
storage_dir = profile_dir / "storage" / "default" / "https+++www.costco.com" / "ls"
storage_dir.mkdir(parents=True)
db_path = storage_dir / "data.sqlite"
with sqlite3.connect(db_path) as connection:
connection.execute("CREATE TABLE data (key TEXT, value TEXT)")
connection.execute(
"INSERT INTO data (key, value) VALUES (?, ?)",
("idToken", "storage.payload.signature"),
)
connection.execute(
"INSERT INTO data (key, value) VALUES (?, ?)",
("clientID", "4900eb1f-0c10-4bd9-99c3-c59e6c1ecebf"),
)
headers = scrape_costco.load_costco_browser_headers(
profile_dir,
authorization="Bearer env.payload.signature",
client_id="env-client-id",
client_identifier="481b1aec-aa3b-454b-b81b-48187e28f205",
)
self.assertEqual("Bearer env.payload.signature", headers["costco-x-authorization"])
self.assertEqual("env-client-id", headers["costco-x-wcs-clientId"])
def test_scrape_costco_prompts_for_profile_dir_when_autodiscovery_fails(self):
with mock.patch.object(
scrape_costco,
"find_firefox_profile_dir",
side_effect=FileNotFoundError("no default profile"),
), mock.patch.object(
scrape_costco.click,
"prompt",
return_value=Path("/tmp/profile"),
) as mocked_prompt, mock.patch.object(
scrape_costco,
"load_config",
return_value={
"authorization": "",
"client_id": "4900eb1f-0c10-4bd9-99c3-c59e6c1ecebf",
"client_identifier": "481b1aec-aa3b-454b-b81b-48187e28f205",
},
), mock.patch.object(
scrape_costco,
"load_costco_browser_headers",
return_value={
"costco-x-authorization": "Bearer header.payload.signature",
"costco-x-wcs-clientId": "4900eb1f-0c10-4bd9-99c3-c59e6c1ecebf",
"client-identifier": "481b1aec-aa3b-454b-b81b-48187e28f205",
},
), mock.patch.object(
scrape_costco,
"build_session",
return_value=object(),
), mock.patch.object(
scrape_costco,
"fetch_summary_windows",
return_value=(
{"data": {"receiptsWithCounts": {"receipts": []}}},
[],
),
), mock.patch.object(
scrape_costco,
"write_json",
), mock.patch.object(
scrape_costco,
"write_csv",
):
scrape_costco.main.callback(
outdir="/tmp/costco_output",
document_type="all",
document_sub_type="all",
window_days=92,
months_back=3,
firefox_profile_dir=None,
)
mocked_prompt.assert_called_once()
if __name__ == "__main__":
unittest.main()

View File

@@ -0,0 +1,119 @@
import unittest
import build_canonical_layer
class CanonicalLayerTests(unittest.TestCase):
def test_build_canonical_layer_auto_links_exact_upc_and_name_size_only(self):
observed_rows = [
{
"observed_product_id": "gobs_1",
"representative_upc": "111",
"representative_retailer_item_id": "11",
"representative_name_norm": "GALA APPLE",
"representative_brand": "SB",
"representative_variant": "",
"representative_size_value": "5",
"representative_size_unit": "lb",
"representative_pack_qty": "",
"representative_measure_type": "weight",
"is_fee": "false",
"is_discount_line": "false",
"is_coupon_line": "false",
},
{
"observed_product_id": "gobs_2",
"representative_upc": "111",
"representative_retailer_item_id": "12",
"representative_name_norm": "LARGE WHITE EGGS",
"representative_brand": "SB",
"representative_variant": "",
"representative_size_value": "",
"representative_size_unit": "",
"representative_pack_qty": "18",
"representative_measure_type": "count",
"is_fee": "false",
"is_discount_line": "false",
"is_coupon_line": "false",
},
{
"observed_product_id": "gobs_3",
"representative_upc": "",
"representative_retailer_item_id": "21",
"representative_name_norm": "ROTINI",
"representative_brand": "",
"representative_variant": "",
"representative_size_value": "16",
"representative_size_unit": "oz",
"representative_pack_qty": "",
"representative_measure_type": "weight",
"is_fee": "false",
"is_discount_line": "false",
"is_coupon_line": "false",
},
{
"observed_product_id": "gobs_4",
"representative_upc": "",
"representative_retailer_item_id": "22",
"representative_name_norm": "ROTINI",
"representative_brand": "SB",
"representative_variant": "",
"representative_size_value": "16",
"representative_size_unit": "oz",
"representative_pack_qty": "",
"representative_measure_type": "weight",
"is_fee": "false",
"is_discount_line": "false",
"is_coupon_line": "false",
},
{
"observed_product_id": "gobs_5",
"representative_upc": "",
"representative_retailer_item_id": "99",
"representative_name_norm": "GL BAG CHARGE",
"representative_brand": "",
"representative_variant": "",
"representative_size_value": "",
"representative_size_unit": "",
"representative_pack_qty": "",
"representative_measure_type": "each",
"is_fee": "true",
"is_discount_line": "false",
"is_coupon_line": "false",
},
{
"observed_product_id": "gobs_6",
"representative_upc": "",
"representative_retailer_item_id": "",
"representative_name_norm": "LIME",
"representative_brand": "",
"representative_variant": "",
"representative_size_value": "",
"representative_size_unit": "",
"representative_pack_qty": "",
"representative_measure_type": "each",
"is_fee": "false",
"is_discount_line": "false",
"is_coupon_line": "false",
},
]
canonicals, links = build_canonical_layer.build_canonical_layer(observed_rows)
self.assertEqual(2, len(canonicals))
self.assertEqual(4, len(links))
methods = {row["observed_product_id"]: row["link_method"] for row in links}
self.assertEqual("exact_upc", methods["gobs_1"])
self.assertEqual("exact_upc", methods["gobs_2"])
self.assertEqual("exact_name_size", methods["gobs_3"])
self.assertEqual("exact_name_size", methods["gobs_4"])
self.assertNotIn("gobs_5", methods)
self.assertNotIn("gobs_6", methods)
def test_clean_canonical_name_removes_packaging_noise(self):
self.assertEqual("LIME", build_canonical_layer.clean_canonical_name("LIME . / ."))
self.assertEqual("EGG", build_canonical_layer.clean_canonical_name("5DZ EGG / /"))
if __name__ == "__main__":
unittest.main()

View File

@@ -0,0 +1,517 @@
import csv
import json
import tempfile
import unittest
from pathlib import Path
from unittest import mock
import enrich_costco
import scrape_costco
import validate_cross_retailer_flow
class CostcoPipelineTests(unittest.TestCase):
def test_resolve_date_range_uses_months_back(self):
start_date, end_date = scrape_costco.resolve_date_range(
3, today=scrape_costco.parse_cli_date("3/16/2026")
)
self.assertEqual("12/16/2025", start_date)
self.assertEqual("3/16/2026", end_date)
def test_build_date_windows_splits_long_ranges(self):
windows = scrape_costco.build_date_windows("1/01/2026", "6/30/2026", 92)
self.assertEqual(
[
{"startDate": "1/01/2026", "endDate": "4/02/2026"},
{"startDate": "4/03/2026", "endDate": "6/30/2026"},
],
windows,
)
def test_fetch_summary_windows_records_metadata_and_warns_on_mismatch(self):
payloads = [
{
"data": {
"receiptsWithCounts": {
"inWarehouse": 2,
"gasStation": 0,
"carWash": 0,
"gasAndCarWash": 0,
"receipts": [
{
"transactionBarcode": "abc",
"receiptType": "In-Warehouse",
}
],
}
}
},
{
"data": {
"receiptsWithCounts": {
"inWarehouse": 1,
"gasStation": 0,
"carWash": 0,
"gasAndCarWash": 0,
"receipts": [
{
"transactionBarcode": "def",
"receiptType": "In-Warehouse",
}
],
}
}
},
]
with mock.patch.object(
scrape_costco, "graphql_post", side_effect=payloads
) as mocked_post, mock.patch.object(scrape_costco.click, "echo") as mocked_echo:
summary_payload, metadata = scrape_costco.fetch_summary_windows(
session=object(),
start_date="1/01/2026",
end_date="6/30/2026",
document_type="all",
document_sub_type="all",
window_days=92,
)
self.assertEqual(2, mocked_post.call_count)
self.assertEqual(2, len(metadata))
self.assertTrue(metadata[0]["countMismatch"])
self.assertFalse(metadata[1]["countMismatch"])
self.assertEqual("1/01/2026", metadata[0]["startDate"])
self.assertEqual("4/03/2026", metadata[1]["startDate"])
self.assertEqual(
["abc", "def"],
[
row["transactionBarcode"]
for row in scrape_costco.summary_receipts(summary_payload)
],
)
mocked_echo.assert_called_once()
warning_text = mocked_echo.call_args.args[0]
self.assertIn("warning: summary count mismatch", warning_text)
def test_flatten_costco_data_preserves_discount_rows(self):
summary_payload = {
"data": {
"receiptsWithCounts": {
"receipts": [
{
"transactionBarcode": "abc",
"tenderArray": [{"tenderDescription": "VISA"}],
"couponArray": [{"upcnumberCoupon": "2100003746641"}],
}
]
}
}
}
detail_payloads = [
{
"data": {
"receiptsWithCounts": {
"receipts": [
{
"transactionBarcode": "abc",
"transactionDate": "2026-03-12",
"receiptType": "In-Warehouse",
"total": 10.0,
"totalItemCount": 2,
"instantSavings": 5.0,
"warehouseName": "MT VERNON",
"warehouseNumber": 1115,
"warehouseAddress1": "7940 RICHMOND HWY",
"warehouseCity": "ALEXANDRIA",
"warehouseState": "VA",
"warehousePostalCode": "22306",
"itemArray": [
{
"itemNumber": "4873222",
"itemDescription01": "ALL F&C",
"itemDescription02": "200OZ 160LOADS P104",
"itemDepartmentNumber": 14,
"transDepartmentNumber": 14,
"unit": 1,
"itemIdentifier": "E",
"amount": 19.99,
"itemUnitPriceAmount": 19.99,
},
{
"itemNumber": "374664",
"itemDescription01": "/ 4873222",
"itemDescription02": None,
"itemDepartmentNumber": 14,
"transDepartmentNumber": 14,
"unit": -1,
"itemIdentifier": None,
"amount": -5,
"itemUnitPriceAmount": 0,
},
],
}
]
}
}
}
]
orders, items = scrape_costco.flatten_costco_data(
summary_payload, detail_payloads, Path("costco_output/raw")
)
self.assertEqual(1, len(orders))
self.assertEqual(2, len(items))
self.assertEqual("false", items[0]["is_discount_line"])
self.assertEqual("true", items[1]["is_discount_line"])
self.assertEqual("true", items[1]["is_coupon_line"])
def test_flatten_costco_data_uses_composite_summary_lookup_key(self):
summary_payload = {
"data": {
"receiptsWithCounts": {
"receipts": [
{
"transactionBarcode": "dup",
"transactionDateTime": "2026-03-12T16:16:00",
"tenderArray": [{"tenderDescription": "VISA"}],
"couponArray": [{"upcnumberCoupon": "111"}],
},
{
"transactionBarcode": "dup",
"transactionDateTime": "2026-02-14T16:25:00",
"tenderArray": [{"tenderDescription": "MASTERCARD"}],
"couponArray": [],
},
]
}
}
}
detail_payloads = [
{
"data": {
"receiptsWithCounts": {
"receipts": [
{
"transactionBarcode": "dup",
"transactionDateTime": "2026-03-12T16:16:00",
"transactionDate": "2026-03-12",
"receiptType": "In-Warehouse",
"total": 10.0,
"totalItemCount": 1,
"instantSavings": 5.0,
"warehouseName": "MT VERNON",
"warehouseNumber": 1115,
"warehouseAddress1": "7940 RICHMOND HWY",
"warehouseCity": "ALEXANDRIA",
"warehouseState": "VA",
"warehousePostalCode": "22306",
"itemArray": [
{
"itemNumber": "111",
"itemDescription01": "/ 111",
"itemDescription02": None,
"itemDepartmentNumber": 14,
"transDepartmentNumber": 14,
"unit": -1,
"itemIdentifier": None,
"amount": -5,
"itemUnitPriceAmount": 0,
}
],
}
]
}
}
}
]
orders, items = scrape_costco.flatten_costco_data(
summary_payload, detail_payloads, Path("costco_output/raw")
)
self.assertEqual("VISA", orders[0]["payment_method"])
self.assertEqual("true", items[0]["is_coupon_line"])
self.assertIn("dup-2026-03-12T16-16-00.json", items[0]["raw_order_path"])
def test_costco_enricher_parses_size_pack_and_discount(self):
row = enrich_costco.parse_costco_item(
order_id="abc",
order_date="2026-03-12",
raw_path=Path("costco_output/raw/abc.json"),
line_no=1,
item={
"itemNumber": "60357",
"itemDescription01": "MIXED PEPPER",
"itemDescription02": "6-PACK",
"itemDepartmentNumber": 65,
"transDepartmentNumber": 65,
"unit": 1,
"itemIdentifier": "E",
"amount": 7.49,
"itemUnitPriceAmount": 7.49,
},
)
self.assertEqual("60357", row["retailer_item_id"])
self.assertEqual("MIXED PEPPER", row["item_name_norm"])
self.assertEqual("6", row["pack_qty"])
self.assertEqual("count", row["measure_type"])
self.assertEqual("costco:abc:1", row["normalized_row_id"])
self.assertEqual("exact_retailer_item_id", row["normalization_basis"])
self.assertTrue(row["normalized_item_id"])
self.assertEqual("6", row["normalized_quantity"])
self.assertEqual("count", row["normalized_quantity_unit"])
discount = enrich_costco.parse_costco_item(
order_id="abc",
order_date="2026-03-12",
raw_path=Path("costco_output/raw/abc.json"),
line_no=2,
item={
"itemNumber": "374664",
"itemDescription01": "/ 4873222",
"itemDescription02": None,
"itemDepartmentNumber": 14,
"transDepartmentNumber": 14,
"unit": -1,
"itemIdentifier": None,
"amount": -5,
"itemUnitPriceAmount": 0,
},
)
self.assertEqual("true", discount["is_discount_line"])
self.assertEqual("true", discount["is_coupon_line"])
self.assertEqual("false", discount["is_item"])
def test_build_items_enriched_matches_discount_to_item(self):
with tempfile.TemporaryDirectory() as tmpdir:
raw_dir = Path(tmpdir) / "raw"
raw_dir.mkdir()
payload = {
"data": {
"receiptsWithCounts": {
"receipts": [
{
"transactionBarcode": "abc",
"transactionDate": "2026-03-12",
"itemArray": [
{
"itemNumber": "4873222",
"itemDescription01": "ALL F&C",
"itemDescription02": "200OZ 160LOADS P104",
"itemDepartmentNumber": 14,
"transDepartmentNumber": 14,
"unit": 1,
"itemIdentifier": "E",
"amount": 19.99,
"itemUnitPriceAmount": 19.99,
},
{
"itemNumber": "374664",
"itemDescription01": "/ 4873222",
"itemDescription02": None,
"itemDepartmentNumber": 14,
"transDepartmentNumber": 14,
"unit": -1,
"itemIdentifier": None,
"amount": -5,
"itemUnitPriceAmount": 0,
},
],
}
]
}
}
}
(raw_dir / "abc.json").write_text(json.dumps(payload), encoding="utf-8")
rows = enrich_costco.build_items_enriched(raw_dir)
purchase_row = next(row for row in rows if row["is_discount_line"] == "false")
discount_row = next(row for row in rows if row["is_discount_line"] == "true")
self.assertEqual("-5", purchase_row["matched_discount_amount"])
self.assertEqual("14.99", purchase_row["net_line_total"])
self.assertIn("matched_discount=4873222", purchase_row["parse_notes"])
self.assertIn("matched_to_item=4873222", discount_row["parse_notes"])
def test_cross_retailer_validation_writes_proof_example(self):
with tempfile.TemporaryDirectory() as tmpdir:
giant_csv = Path(tmpdir) / "giant_items_enriched.csv"
costco_csv = Path(tmpdir) / "costco_items_enriched.csv"
outdir = Path(tmpdir) / "combined"
fieldnames = enrich_costco.OUTPUT_FIELDS
giant_row = {field: "" for field in fieldnames}
giant_row.update(
{
"retailer": "giant",
"order_id": "g1",
"line_no": "1",
"order_date": "2026-03-01",
"retailer_item_id": "100",
"item_name": "FRESH BANANA",
"item_name_norm": "BANANA",
"upc": "4011",
"measure_type": "weight",
"is_store_brand": "false",
"is_fee": "false",
"is_discount_line": "false",
"is_coupon_line": "false",
"line_total": "1.29",
}
)
costco_row = {field: "" for field in fieldnames}
costco_row.update(
{
"retailer": "costco",
"order_id": "c1",
"line_no": "1",
"order_date": "2026-03-12",
"retailer_item_id": "30669",
"item_name": "BANANAS 3 LB / 1.36 KG",
"item_name_norm": "BANANA",
"upc": "",
"size_value": "3",
"size_unit": "lb",
"measure_type": "weight",
"is_store_brand": "false",
"is_fee": "false",
"is_discount_line": "false",
"is_coupon_line": "false",
"line_total": "2.98",
}
)
with giant_csv.open("w", newline="", encoding="utf-8") as handle:
writer = csv.DictWriter(handle, fieldnames=fieldnames)
writer.writeheader()
writer.writerow(giant_row)
with costco_csv.open("w", newline="", encoding="utf-8") as handle:
writer = csv.DictWriter(handle, fieldnames=fieldnames)
writer.writeheader()
writer.writerow(costco_row)
validate_cross_retailer_flow.main.callback(
giant_items_enriched_csv=str(giant_csv),
costco_items_enriched_csv=str(costco_csv),
outdir=str(outdir),
)
proof_path = outdir / "proof_examples.csv"
self.assertTrue(proof_path.exists())
with proof_path.open(newline="", encoding="utf-8") as handle:
rows = list(csv.DictReader(handle))
self.assertEqual(1, len(rows))
self.assertEqual("banana", rows[0]["proof_name"])
def test_main_writes_summary_request_metadata(self):
with tempfile.TemporaryDirectory() as tmpdir:
outdir = Path(tmpdir) / "costco_output"
summary_payload = {
"data": {
"receiptsWithCounts": {
"inWarehouse": 1,
"gasStation": 0,
"carWash": 0,
"gasAndCarWash": 0,
"receipts": [
{
"transactionBarcode": "abc",
"receiptType": "In-Warehouse",
"tenderArray": [],
"couponArray": [],
}
],
}
}
}
detail_payload = {
"data": {
"receiptsWithCounts": {
"receipts": [
{
"transactionBarcode": "abc",
"transactionDate": "2026-03-12",
"receiptType": "In-Warehouse",
"total": 10.0,
"totalItemCount": 1,
"instantSavings": 0,
"warehouseName": "MT VERNON",
"warehouseNumber": 1115,
"warehouseAddress1": "7940 RICHMOND HWY",
"warehouseCity": "ALEXANDRIA",
"warehouseState": "VA",
"warehousePostalCode": "22306",
"itemArray": [],
}
]
}
}
}
metadata = [
{
"startDate": "1/01/2026",
"endDate": "3/31/2026",
"text": "custom",
"documentType": "all",
"documentSubType": "all",
"returnedReceipts": 1,
"returnedInWarehouseReceipts": 1,
"inWarehouse": 1,
"gasStation": 0,
"carWash": 0,
"gasAndCarWash": 0,
"countMismatch": False,
}
]
with mock.patch.object(
scrape_costco,
"load_config",
return_value={
"authorization": "",
"client_id": "4900eb1f-0c10-4bd9-99c3-c59e6c1ecebf",
"client_identifier": "481b1aec-aa3b-454b-b81b-48187e28f205",
},
), mock.patch.object(
scrape_costco,
"find_firefox_profile_dir",
return_value=Path("/tmp/profile"),
), mock.patch.object(
scrape_costco,
"load_costco_browser_headers",
return_value={
"costco-x-authorization": "Bearer header.payload.signature",
"costco-x-wcs-clientId": "4900eb1f-0c10-4bd9-99c3-c59e6c1ecebf",
"client-identifier": "481b1aec-aa3b-454b-b81b-48187e28f205",
},
), mock.patch.object(
scrape_costco, "build_session", return_value=object()
), mock.patch.object(
scrape_costco,
"fetch_summary_windows",
return_value=(summary_payload, metadata),
), mock.patch.object(
scrape_costco,
"graphql_post",
return_value=detail_payload,
):
scrape_costco.main.callback(
outdir=str(outdir),
document_type="all",
document_sub_type="all",
window_days=92,
months_back=3,
firefox_profile_dir=None,
)
metadata_path = outdir / "raw" / "summary_requests.json"
self.assertTrue(metadata_path.exists())
saved_metadata = json.loads(metadata_path.read_text(encoding="utf-8"))
self.assertEqual(metadata, saved_metadata)
if __name__ == "__main__":
unittest.main()

199
tests/test_enrich_giant.py Normal file
View File

@@ -0,0 +1,199 @@
import csv
import json
import tempfile
import unittest
from pathlib import Path
import enrich_giant
class EnrichGiantTests(unittest.TestCase):
def test_parse_size_and_pack_handles_pack_and_weight_tokens(self):
size_value, size_unit, pack_qty = enrich_giant.parse_size_and_pack(
"COKE CHERRY 6PK 7.5Z"
)
self.assertEqual("7.5", size_value)
self.assertEqual("oz", size_unit)
self.assertEqual("6", pack_qty)
def test_parse_item_marks_store_brand_fee_and_weight_prices(self):
row = enrich_giant.parse_item(
order_id="abc123",
order_date="2026-03-01",
raw_path=Path("raw/abc123.json"),
line_no=1,
item={
"podId": 1,
"shipQy": 1,
"totalPickedWeight": 2,
"unitPrice": 3.98,
"itemName": "+SB GALA APPLE 5 LB",
"lbEachCd": "LB",
"groceryAmount": 3.98,
"primUpcCd": "111",
"mvpSavings": 0,
"rewardSavings": 0,
"couponSavings": 0,
"couponPrice": 0,
"categoryId": "1",
"categoryDesc": "Grocery",
"image": {"large": "https://example.test/apple.jpg"},
},
)
self.assertEqual("SB", row["brand_guess"])
self.assertEqual("GALA APPLE", row["item_name_norm"])
self.assertEqual("5", row["size_value"])
self.assertEqual("lb", row["size_unit"])
self.assertEqual("weight", row["measure_type"])
self.assertEqual("true", row["is_store_brand"])
self.assertEqual("1.99", row["price_per_lb"])
self.assertEqual("0.1244", row["price_per_oz"])
self.assertEqual("https://example.test/apple.jpg", row["image_url"])
self.assertEqual("giant:abc123:1", row["normalized_row_id"])
self.assertEqual("exact_upc", row["normalization_basis"])
self.assertEqual("5", row["normalized_quantity"])
self.assertEqual("lb", row["normalized_quantity_unit"])
self.assertEqual("true", row["is_item"])
fee_row = enrich_giant.parse_item(
order_id="abc123",
order_date="2026-03-01",
raw_path=Path("raw/abc123.json"),
line_no=2,
item={
"podId": 2,
"shipQy": 1,
"totalPickedWeight": 0,
"unitPrice": 0.05,
"itemName": "GL BAG CHARGE",
"lbEachCd": "EA",
"groceryAmount": 0.05,
"primUpcCd": "",
"mvpSavings": 0,
"rewardSavings": 0,
"couponSavings": 0,
"couponPrice": 0,
"categoryId": "1",
"categoryDesc": "Grocery",
},
)
self.assertEqual("true", fee_row["is_fee"])
self.assertEqual("GL BAG CHARGE", fee_row["item_name_norm"])
self.assertEqual("false", fee_row["is_item"])
def test_parse_item_derives_packaged_weight_prices_from_size_tokens(self):
row = enrich_giant.parse_item(
order_id="abc123",
order_date="2026-03-01",
raw_path=Path("raw/abc123.json"),
line_no=1,
item={
"podId": 1,
"shipQy": 2,
"totalPickedWeight": 0,
"unitPrice": 3.0,
"itemName": "PEPSI 6PK 7.5Z",
"lbEachCd": "EA",
"groceryAmount": 6.0,
"primUpcCd": "111",
"mvpSavings": 0,
"rewardSavings": 0,
"couponSavings": 0,
"couponPrice": 0,
"categoryId": "1",
"categoryDesc": "Grocery",
},
)
self.assertEqual("weight", row["measure_type"])
self.assertEqual("6", row["pack_qty"])
self.assertEqual("7.5", row["size_value"])
self.assertEqual("0.0667", row["price_per_oz"])
self.assertEqual("1.0667", row["price_per_lb"])
def test_build_items_enriched_reads_raw_order_files_and_writes_csv(self):
with tempfile.TemporaryDirectory() as tmpdir:
raw_dir = Path(tmpdir) / "raw"
raw_dir.mkdir()
(raw_dir / "history.json").write_text("{}", encoding="utf-8")
(raw_dir / "order-2.json").write_text(
json.dumps(
{
"orderId": "order-2",
"orderDate": "2026-03-02",
"items": [
{
"podId": 20,
"shipQy": 1,
"totalPickedWeight": 0,
"unitPrice": 2.99,
"itemName": "SB ROTINI 16Z",
"lbEachCd": "EA",
"groceryAmount": 2.99,
"primUpcCd": "222",
"mvpSavings": 0,
"rewardSavings": 0,
"couponSavings": 0,
"couponPrice": 0,
"categoryId": "1",
"categoryDesc": "Grocery",
"image": {"small": "https://example.test/rotini.jpg"},
}
],
}
),
encoding="utf-8",
)
(raw_dir / "order-1.json").write_text(
json.dumps(
{
"orderId": "order-1",
"orderDate": "2026-03-01",
"items": [
{
"podId": 10,
"shipQy": 2,
"totalPickedWeight": 0,
"unitPrice": 1.5,
"itemName": "PEPSI 6PK 7.5Z",
"lbEachCd": "EA",
"groceryAmount": 3.0,
"primUpcCd": "111",
"mvpSavings": 0,
"rewardSavings": 0,
"couponSavings": 0,
"couponPrice": 0,
"categoryId": "1",
"categoryDesc": "Grocery",
}
],
}
),
encoding="utf-8",
)
rows = enrich_giant.build_items_enriched(raw_dir)
output_csv = Path(tmpdir) / "items_enriched.csv"
enrich_giant.write_csv(output_csv, rows)
self.assertEqual(["order-1", "order-2"], [row["order_id"] for row in rows])
self.assertEqual("PEPSI", rows[0]["item_name_norm"])
self.assertEqual("6", rows[0]["pack_qty"])
self.assertEqual("7.5", rows[0]["size_value"])
self.assertEqual("10", rows[0]["retailer_item_id"])
self.assertEqual("true", rows[1]["is_store_brand"])
self.assertTrue(rows[0]["normalized_item_id"])
self.assertEqual("exact_upc", rows[0]["normalization_basis"])
with output_csv.open(newline="", encoding="utf-8") as handle:
written_rows = list(csv.DictReader(handle))
self.assertEqual(2, len(written_rows))
self.assertEqual(enrich_giant.OUTPUT_FIELDS, list(written_rows[0].keys()))
if __name__ == "__main__":
unittest.main()

View File

@@ -1,66 +1,17 @@
import requests import unittest
from playwright.sync_api import sync_playwright
BASE = "https://giantfood.com"
ACCOUNT_PAGE = f"{BASE}/account/history/invoice/in-store"
USER_ID = "369513017"
LOYALTY = "440155630880"
def get_session(): try:
with sync_playwright() as p: from playwright.sync_api import sync_playwright # noqa: F401
browser = p.firefox.launch(headless=False) import requests # noqa: F401
page = browser.new_page() except ImportError as exc: # pragma: no cover - dependency-gated smoke test
sync_playwright = None
page.goto(ACCOUNT_PAGE) _IMPORT_ERROR = exc
else:
print("log in manually in the browser, then press ENTER here") _IMPORT_ERROR = None
input()
cookies = page.context.cookies()
ua = page.evaluate("() => navigator.userAgent")
browser.close()
s = requests.Session()
s.headers.update({
"user-agent": ua,
"accept": "application/json, text/plain, */*",
"referer": ACCOUNT_PAGE,
})
for c in cookies:
domain = c.get("domain", "").lstrip(".") or "giantfood.com"
s.cookies.set(c["name"], c["value"], domain=domain)
return s
def test_history(session): @unittest.skipIf(sync_playwright is None, f"optional smoke test dependency missing: {_IMPORT_ERROR}")
url = f"{BASE}/api/v6.0/user/{USER_ID}/order/history" class GiantLoginSmokeTest(unittest.TestCase):
def test_dependencies_available(self):
r = session.get( self.assertIsNotNone(sync_playwright)
url,
params={
"filter": "instore",
"loyaltyNumber": LOYALTY,
},
)
print("status:", r.status_code)
print()
data = r.json()
print("orders found:", len(data.get("records", [])))
print()
for rec in data.get("records", [])[:5]:
print(rec["orderId"], rec["orderDate"], rec["orderTotal"])
if __name__ == "__main__":
session = get_session()
test_history(session)

View File

@@ -0,0 +1,67 @@
import unittest
import build_observed_products
class ObservedProductTests(unittest.TestCase):
def test_build_observed_products_aggregates_rows_with_same_key(self):
rows = [
{
"retailer": "giant",
"order_id": "1",
"line_no": "1",
"order_date": "2026-01-01",
"item_name": "SB GALA APPLE 5LB",
"item_name_norm": "GALA APPLE",
"retailer_item_id": "11",
"upc": "111",
"brand_guess": "SB",
"variant": "",
"size_value": "5",
"size_unit": "lb",
"pack_qty": "",
"measure_type": "weight",
"image_url": "https://example.test/a.jpg",
"is_store_brand": "true",
"is_fee": "false",
"is_discount_line": "false",
"is_coupon_line": "false",
"line_total": "7.99",
},
{
"retailer": "giant",
"order_id": "2",
"line_no": "1",
"order_date": "2026-01-10",
"item_name": "SB GALA APPLE 5 LB",
"item_name_norm": "GALA APPLE",
"retailer_item_id": "11",
"upc": "111",
"brand_guess": "SB",
"variant": "",
"size_value": "5",
"size_unit": "lb",
"pack_qty": "",
"measure_type": "weight",
"image_url": "",
"is_store_brand": "true",
"is_fee": "false",
"is_discount_line": "false",
"is_coupon_line": "false",
"line_total": "8.49",
},
]
observed = build_observed_products.build_observed_products(rows)
self.assertEqual(1, len(observed))
self.assertEqual("2", observed[0]["times_seen"])
self.assertEqual("2026-01-01", observed[0]["first_seen_date"])
self.assertEqual("2026-01-10", observed[0]["last_seen_date"])
self.assertEqual("11", observed[0]["representative_retailer_item_id"])
self.assertEqual("111", observed[0]["representative_upc"])
self.assertIn("SB GALA APPLE 5LB", observed[0]["raw_name_examples"])
if __name__ == "__main__":
unittest.main()

View File

@@ -0,0 +1,80 @@
import unittest
import report_pipeline_status
class PipelineStatusTests(unittest.TestCase):
def test_build_status_summary_reports_unresolved_and_reviewed_counts(self):
summary = report_pipeline_status.build_status_summary(
giant_orders=[{"order_id": "g1"}],
giant_items=[{"order_id": "g1", "line_no": "1"}],
giant_enriched=[
{
"retailer": "giant",
"order_id": "g1",
"line_no": "1",
"item_name_norm": "BANANA",
"item_name": "FRESH BANANA",
"retailer_item_id": "1",
"upc": "4011",
"brand_guess": "",
"variant": "",
"size_value": "",
"size_unit": "",
"pack_qty": "",
"measure_type": "weight",
"image_url": "",
"is_store_brand": "false",
"is_fee": "false",
"is_discount_line": "false",
"is_coupon_line": "false",
"order_date": "2026-03-01",
"line_total": "1.29",
}
],
costco_orders=[],
costco_items=[],
costco_enriched=[],
purchases=[
{
"observed_product_id": "gobs_banana",
"canonical_product_id": "gcan_banana",
"resolution_action": "",
"is_fee": "false",
"is_discount_line": "false",
"is_coupon_line": "false",
"retailer": "giant",
"raw_item_name": "FRESH BANANA",
"normalized_item_name": "BANANA",
"upc": "4011",
"line_total": "1.29",
},
{
"observed_product_id": "gobs_lime",
"canonical_product_id": "",
"resolution_action": "",
"is_fee": "false",
"is_discount_line": "false",
"is_coupon_line": "false",
"retailer": "costco",
"raw_item_name": "LIME 5LB",
"normalized_item_name": "LIME",
"upc": "",
"line_total": "4.99",
},
],
resolutions=[],
)
counts = {row["stage"]: row["count"] for row in summary}
self.assertEqual(1, counts["raw_orders"])
self.assertEqual(1, counts["raw_items"])
self.assertEqual(1, counts["enriched_items"])
self.assertEqual(1, counts["canonical_linked_purchase_rows"])
self.assertEqual(1, counts["unresolved_purchase_rows"])
self.assertEqual(1, counts["review_queue_observed_products"])
self.assertEqual(0, counts["unresolved_not_in_review_rows"])
if __name__ == "__main__":
unittest.main()

301
tests/test_purchases.py Normal file
View File

@@ -0,0 +1,301 @@
import csv
import tempfile
import unittest
from pathlib import Path
import build_purchases
import enrich_costco
class PurchaseLogTests(unittest.TestCase):
def test_derive_metrics_prefers_picked_weight_and_pack_count(self):
metrics = build_purchases.derive_metrics(
{
"line_total": "4.00",
"qty": "1",
"pack_qty": "4",
"size_value": "",
"size_unit": "",
"picked_weight": "2",
"price_per_each": "",
"price_per_lb": "",
"price_per_oz": "",
}
)
self.assertEqual("4", metrics["price_per_each"])
self.assertEqual("1", metrics["price_per_count"])
self.assertEqual("2", metrics["price_per_lb"])
self.assertEqual("0.125", metrics["price_per_oz"])
self.assertEqual("picked_weight_lb", metrics["price_per_lb_basis"])
def test_build_purchase_rows_maps_canonical_ids(self):
fieldnames = enrich_costco.OUTPUT_FIELDS
giant_row = {field: "" for field in fieldnames}
giant_row.update(
{
"retailer": "giant",
"order_id": "g1",
"line_no": "1",
"observed_item_key": "giant:g1:1",
"order_date": "2026-03-01",
"item_name": "FRESH BANANA",
"item_name_norm": "BANANA",
"image_url": "https://example.test/banana.jpg",
"retailer_item_id": "100",
"upc": "4011",
"qty": "1",
"unit": "LB",
"line_total": "1.29",
"unit_price": "1.29",
"measure_type": "weight",
"price_per_lb": "1.29",
"raw_order_path": "giant_output/raw/g1.json",
"is_discount_line": "false",
"is_coupon_line": "false",
"is_fee": "false",
}
)
costco_row = {field: "" for field in fieldnames}
costco_row.update(
{
"retailer": "costco",
"order_id": "c1",
"line_no": "1",
"observed_item_key": "costco:c1:1",
"order_date": "2026-03-12",
"item_name": "BANANAS 3 LB / 1.36 KG",
"item_name_norm": "BANANA",
"retailer_item_id": "30669",
"qty": "1",
"unit": "E",
"line_total": "2.98",
"unit_price": "2.98",
"size_value": "3",
"size_unit": "lb",
"measure_type": "weight",
"price_per_lb": "0.9933",
"raw_order_path": "costco_output/raw/c1.json",
"is_discount_line": "false",
"is_coupon_line": "false",
"is_fee": "false",
}
)
giant_orders = [
{
"order_id": "g1",
"store_name": "Giant",
"store_number": "42",
"store_city": "Springfield",
"store_state": "VA",
}
]
costco_orders = [
{
"order_id": "c1",
"store_name": "MT VERNON",
"store_number": "1115",
"store_city": "ALEXANDRIA",
"store_state": "VA",
}
]
rows, _observed, _canon, _links = build_purchases.build_purchase_rows(
[giant_row],
[costco_row],
giant_orders,
costco_orders,
[],
)
self.assertEqual(2, len(rows))
self.assertTrue(all(row["canonical_product_id"] for row in rows))
self.assertEqual({"giant", "costco"}, {row["retailer"] for row in rows})
self.assertEqual("https://example.test/banana.jpg", rows[0]["image_url"])
def test_main_writes_purchase_and_example_csvs(self):
with tempfile.TemporaryDirectory() as tmpdir:
giant_items = Path(tmpdir) / "giant_items.csv"
costco_items = Path(tmpdir) / "costco_items.csv"
giant_orders = Path(tmpdir) / "giant_orders.csv"
costco_orders = Path(tmpdir) / "costco_orders.csv"
resolutions_csv = Path(tmpdir) / "review_resolutions.csv"
catalog_csv = Path(tmpdir) / "canonical_catalog.csv"
links_csv = Path(tmpdir) / "product_links.csv"
purchases_csv = Path(tmpdir) / "combined" / "purchases.csv"
examples_csv = Path(tmpdir) / "combined" / "comparison_examples.csv"
fieldnames = enrich_costco.OUTPUT_FIELDS
giant_row = {field: "" for field in fieldnames}
giant_row.update(
{
"retailer": "giant",
"order_id": "g1",
"line_no": "1",
"observed_item_key": "giant:g1:1",
"order_date": "2026-03-01",
"item_name": "FRESH BANANA",
"item_name_norm": "BANANA",
"retailer_item_id": "100",
"upc": "4011",
"qty": "1",
"unit": "LB",
"line_total": "1.29",
"unit_price": "1.29",
"measure_type": "weight",
"price_per_lb": "1.29",
"raw_order_path": "giant_output/raw/g1.json",
"is_discount_line": "false",
"is_coupon_line": "false",
"is_fee": "false",
}
)
costco_row = {field: "" for field in fieldnames}
costco_row.update(
{
"retailer": "costco",
"order_id": "c1",
"line_no": "1",
"observed_item_key": "costco:c1:1",
"order_date": "2026-03-12",
"item_name": "BANANAS 3 LB / 1.36 KG",
"item_name_norm": "BANANA",
"retailer_item_id": "30669",
"qty": "1",
"unit": "E",
"line_total": "2.98",
"unit_price": "2.98",
"size_value": "3",
"size_unit": "lb",
"measure_type": "weight",
"price_per_lb": "0.9933",
"raw_order_path": "costco_output/raw/c1.json",
"is_discount_line": "false",
"is_coupon_line": "false",
"is_fee": "false",
}
)
for path, source_rows in [
(giant_items, [giant_row]),
(costco_items, [costco_row]),
]:
with path.open("w", newline="", encoding="utf-8") as handle:
writer = csv.DictWriter(handle, fieldnames=fieldnames)
writer.writeheader()
writer.writerows(source_rows)
order_fields = ["order_id", "store_name", "store_number", "store_city", "store_state"]
for path, source_rows in [
(
giant_orders,
[
{
"order_id": "g1",
"store_name": "Giant",
"store_number": "42",
"store_city": "Springfield",
"store_state": "VA",
}
],
),
(
costco_orders,
[
{
"order_id": "c1",
"store_name": "MT VERNON",
"store_number": "1115",
"store_city": "ALEXANDRIA",
"store_state": "VA",
}
],
),
]:
with path.open("w", newline="", encoding="utf-8") as handle:
writer = csv.DictWriter(handle, fieldnames=order_fields)
writer.writeheader()
writer.writerows(source_rows)
build_purchases.main.callback(
giant_items_enriched_csv=str(giant_items),
costco_items_enriched_csv=str(costco_items),
giant_orders_csv=str(giant_orders),
costco_orders_csv=str(costco_orders),
resolutions_csv=str(resolutions_csv),
catalog_csv=str(catalog_csv),
links_csv=str(links_csv),
output_csv=str(purchases_csv),
examples_csv=str(examples_csv),
)
self.assertTrue(purchases_csv.exists())
self.assertTrue(examples_csv.exists())
with purchases_csv.open(newline="", encoding="utf-8") as handle:
purchase_rows = list(csv.DictReader(handle))
with examples_csv.open(newline="", encoding="utf-8") as handle:
example_rows = list(csv.DictReader(handle))
self.assertEqual(2, len(purchase_rows))
self.assertEqual(1, len(example_rows))
def test_build_purchase_rows_applies_manual_resolution(self):
fieldnames = enrich_costco.OUTPUT_FIELDS
giant_row = {field: "" for field in fieldnames}
giant_row.update(
{
"retailer": "giant",
"order_id": "g1",
"line_no": "1",
"observed_item_key": "giant:g1:1",
"order_date": "2026-03-01",
"item_name": "SB BAGGED ICE 20LB",
"item_name_norm": "BAGGED ICE",
"retailer_item_id": "100",
"upc": "",
"qty": "1",
"unit": "EA",
"line_total": "3.50",
"unit_price": "3.50",
"measure_type": "each",
"raw_order_path": "giant_output/raw/g1.json",
"is_discount_line": "false",
"is_coupon_line": "false",
"is_fee": "false",
}
)
observed_rows, _canonical_rows, _link_rows, _observed_id_by_key, _canonical_by_observed = (
build_purchases.build_link_state([giant_row])
)
observed_product_id = observed_rows[0]["observed_product_id"]
rows, _observed, _canon, _links = build_purchases.build_purchase_rows(
[giant_row],
[],
[
{
"order_id": "g1",
"store_name": "Giant",
"store_number": "42",
"store_city": "Springfield",
"store_state": "VA",
}
],
[],
[
{
"observed_product_id": observed_product_id,
"canonical_product_id": "gcan_manual_ice",
"resolution_action": "create",
"status": "approved",
"resolution_notes": "manual ice merge",
"reviewed_at": "2026-03-16",
}
],
)
self.assertEqual("gcan_manual_ice", rows[0]["canonical_product_id"])
self.assertEqual("approved", rows[0]["review_status"])
self.assertEqual("create", rows[0]["resolution_action"])
if __name__ == "__main__":
unittest.main()

133
tests/test_review_queue.py Normal file
View File

@@ -0,0 +1,133 @@
import tempfile
import unittest
from pathlib import Path
import build_observed_products
import build_review_queue
from layer_helpers import write_csv_rows
class ReviewQueueTests(unittest.TestCase):
def test_build_review_queue_preserves_existing_status(self):
observed_rows = [
{
"observed_product_id": "gobs_1",
"retailer": "giant",
"representative_upc": "111",
"representative_image_url": "",
"representative_name_norm": "GALA APPLE",
"times_seen": "2",
"distinct_item_names_count": "2",
"distinct_upcs_count": "1",
"is_fee": "false",
"is_discount_line": "false",
"is_coupon_line": "false",
}
]
item_rows = [
{
"observed_product_id": "gobs_1",
"item_name": "SB GALA APPLE 5LB",
"item_name_norm": "GALA APPLE",
"line_total": "7.99",
},
{
"observed_product_id": "gobs_1",
"item_name": "SB GALA APPLE 5 LB",
"item_name_norm": "GALA APPLE",
"line_total": "8.49",
},
]
existing = {
build_review_queue.stable_id("rvw", "gobs_1|missing_image"): {
"status": "approved",
"resolution_notes": "looked fine",
"created_at": "2026-03-15",
}
}
queue = build_review_queue.build_review_queue(
observed_rows, item_rows, existing, "2026-03-16"
)
self.assertEqual(2, len(queue))
missing_image = [row for row in queue if row["reason_code"] == "missing_image"][0]
self.assertEqual("approved", missing_image["status"])
self.assertEqual("looked fine", missing_image["resolution_notes"])
def test_review_queue_main_writes_output(self):
with tempfile.TemporaryDirectory() as tmpdir:
observed_path = Path(tmpdir) / "products_observed.csv"
items_path = Path(tmpdir) / "items_enriched.csv"
output_path = Path(tmpdir) / "review_queue.csv"
observed_rows = [
{
"observed_product_id": "gobs_1",
"retailer": "giant",
"observed_key": "giant|upc=111|name=GALA APPLE",
"representative_retailer_item_id": "11",
"representative_upc": "111",
"representative_item_name": "SB GALA APPLE 5LB",
"representative_name_norm": "GALA APPLE",
"representative_brand": "SB",
"representative_variant": "",
"representative_size_value": "5",
"representative_size_unit": "lb",
"representative_pack_qty": "",
"representative_measure_type": "weight",
"representative_image_url": "",
"is_store_brand": "true",
"is_fee": "false",
"is_discount_line": "false",
"is_coupon_line": "false",
"first_seen_date": "2026-01-01",
"last_seen_date": "2026-01-10",
"times_seen": "2",
"example_order_id": "1",
"example_item_name": "SB GALA APPLE 5LB",
"raw_name_examples": "SB GALA APPLE 5LB | SB GALA APPLE 5 LB",
"normalized_name_examples": "GALA APPLE",
"example_prices": "7.99 | 8.49",
"distinct_item_names_count": "2",
"distinct_retailer_item_ids_count": "1",
"distinct_upcs_count": "1",
}
]
item_rows = [
{
"retailer": "giant",
"order_id": "1",
"line_no": "1",
"item_name": "SB GALA APPLE 5LB",
"item_name_norm": "GALA APPLE",
"retailer_item_id": "11",
"upc": "111",
"size_value": "5",
"size_unit": "lb",
"pack_qty": "",
"measure_type": "weight",
"is_store_brand": "true",
"is_fee": "false",
"is_discount_line": "false",
"is_coupon_line": "false",
"line_total": "7.99",
}
]
write_csv_rows(
observed_path, observed_rows, build_observed_products.OUTPUT_FIELDS
)
write_csv_rows(items_path, item_rows, list(item_rows[0].keys()))
build_review_queue.main.callback(
observed_csv=str(observed_path),
items_enriched_csv=str(items_path),
output_csv=str(output_path),
)
self.assertTrue(output_path.exists())
if __name__ == "__main__":
unittest.main()

View File

@@ -0,0 +1,409 @@
import csv
import tempfile
import unittest
from pathlib import Path
from unittest import mock
from click.testing import CliRunner
import review_products
class ReviewWorkflowTests(unittest.TestCase):
def test_build_review_queue_groups_unresolved_purchases(self):
queue_rows = review_products.build_review_queue(
[
{
"observed_product_id": "gobs_1",
"canonical_product_id": "",
"retailer": "giant",
"raw_item_name": "SB BAGGED ICE 20LB",
"normalized_item_name": "BAGGED ICE",
"upc": "",
"line_total": "3.50",
},
{
"observed_product_id": "gobs_1",
"canonical_product_id": "",
"retailer": "giant",
"raw_item_name": "SB BAG ICE CUBED 10LB",
"normalized_item_name": "BAG ICE",
"upc": "",
"line_total": "2.50",
},
],
[],
)
self.assertEqual(1, len(queue_rows))
self.assertEqual("gobs_1", queue_rows[0]["observed_product_id"])
self.assertIn("SB BAGGED ICE 20LB", queue_rows[0]["raw_item_names"])
def test_build_canonical_suggestions_prefers_upc_then_name(self):
suggestions = review_products.build_canonical_suggestions(
[
{
"normalized_item_name": "MIXED PEPPER",
"upc": "12345",
}
],
[
{
"canonical_product_id": "gcan_1",
"canonical_name": "MIXED PEPPER",
"upc": "",
},
{
"canonical_product_id": "gcan_2",
"canonical_name": "MIXED PEPPER 6 PACK",
"upc": "12345",
},
],
)
self.assertEqual("gcan_2", suggestions[0]["canonical_product_id"])
self.assertEqual("exact upc", suggestions[0]["reason"])
self.assertEqual("gcan_1", suggestions[1]["canonical_product_id"])
def test_review_products_displays_position_items_and_suggestions(self):
with tempfile.TemporaryDirectory() as tmpdir:
purchases_csv = Path(tmpdir) / "purchases.csv"
queue_csv = Path(tmpdir) / "review_queue.csv"
resolutions_csv = Path(tmpdir) / "review_resolutions.csv"
catalog_csv = Path(tmpdir) / "canonical_catalog.csv"
purchase_fields = [
"purchase_date",
"retailer",
"order_id",
"line_no",
"observed_product_id",
"canonical_product_id",
"raw_item_name",
"normalized_item_name",
"image_url",
"upc",
"line_total",
]
with purchases_csv.open("w", newline="", encoding="utf-8") as handle:
writer = csv.DictWriter(handle, fieldnames=purchase_fields)
writer.writeheader()
writer.writerows(
[
{
"purchase_date": "2026-03-14",
"retailer": "costco",
"order_id": "c2",
"line_no": "2",
"observed_product_id": "gobs_mix",
"canonical_product_id": "",
"raw_item_name": "MIXED PEPPER 6-PACK",
"normalized_item_name": "MIXED PEPPER",
"image_url": "",
"upc": "",
"line_total": "7.49",
},
{
"purchase_date": "2026-03-12",
"retailer": "costco",
"order_id": "c1",
"line_no": "1",
"observed_product_id": "gobs_mix",
"canonical_product_id": "",
"raw_item_name": "MIXED PEPPER 6-PACK",
"normalized_item_name": "MIXED PEPPER",
"image_url": "https://example.test/mixed-pepper.jpg",
"upc": "",
"line_total": "6.99",
},
]
)
with catalog_csv.open("w", newline="", encoding="utf-8") as handle:
writer = csv.DictWriter(handle, fieldnames=review_products.build_purchases.CATALOG_FIELDS)
writer.writeheader()
writer.writerow(
{
"canonical_product_id": "gcan_mix",
"canonical_name": "MIXED PEPPER",
"category": "produce",
"product_type": "pepper",
"brand": "",
"variant": "",
"size_value": "",
"size_unit": "",
"pack_qty": "",
"measure_type": "",
"notes": "",
"created_at": "",
"updated_at": "",
}
)
runner = CliRunner()
result = runner.invoke(
review_products.main,
[
"--purchases-csv",
str(purchases_csv),
"--queue-csv",
str(queue_csv),
"--resolutions-csv",
str(resolutions_csv),
"--catalog-csv",
str(catalog_csv),
],
input="q\n",
color=True,
)
self.assertEqual(0, result.exit_code)
self.assertIn("Review 1/1: Resolve observed_product MIXED PEPPER to canonical_name [__]?", result.output)
self.assertIn("2 matched items:", result.output)
self.assertIn("[l]ink existing [n]ew canonical e[x]clude [s]kip [q]uit:", result.output)
first_item = result.output.index("[1] 2026-03-14 | 7.49")
second_item = result.output.index("[2] 2026-03-12 | 6.99")
self.assertLess(first_item, second_item)
self.assertIn("https://example.test/mixed-pepper.jpg", result.output)
self.assertIn("1 canonical suggestions found:", result.output)
self.assertIn("[1] MIXED PEPPER", result.output)
self.assertIn("\x1b[", result.output)
def test_review_products_no_suggestions_is_informational(self):
with tempfile.TemporaryDirectory() as tmpdir:
purchases_csv = Path(tmpdir) / "purchases.csv"
queue_csv = Path(tmpdir) / "review_queue.csv"
resolutions_csv = Path(tmpdir) / "review_resolutions.csv"
catalog_csv = Path(tmpdir) / "canonical_catalog.csv"
with purchases_csv.open("w", newline="", encoding="utf-8") as handle:
writer = csv.DictWriter(
handle,
fieldnames=[
"purchase_date",
"retailer",
"order_id",
"line_no",
"observed_product_id",
"canonical_product_id",
"raw_item_name",
"normalized_item_name",
"image_url",
"upc",
"line_total",
],
)
writer.writeheader()
writer.writerow(
{
"purchase_date": "2026-03-14",
"retailer": "giant",
"order_id": "g1",
"line_no": "1",
"observed_product_id": "gobs_ice",
"canonical_product_id": "",
"raw_item_name": "SB BAGGED ICE 20LB",
"normalized_item_name": "BAGGED ICE",
"image_url": "",
"upc": "",
"line_total": "3.50",
}
)
with catalog_csv.open("w", newline="", encoding="utf-8") as handle:
writer = csv.DictWriter(handle, fieldnames=review_products.build_purchases.CATALOG_FIELDS)
writer.writeheader()
result = CliRunner().invoke(
review_products.main,
[
"--purchases-csv",
str(purchases_csv),
"--queue-csv",
str(queue_csv),
"--resolutions-csv",
str(resolutions_csv),
"--catalog-csv",
str(catalog_csv),
],
input="q\n",
color=True,
)
self.assertEqual(0, result.exit_code)
self.assertIn("no canonical_name suggestions found", result.output)
def test_link_existing_uses_numbered_selection_and_confirmation(self):
with tempfile.TemporaryDirectory() as tmpdir:
purchases_csv = Path(tmpdir) / "purchases.csv"
queue_csv = Path(tmpdir) / "review_queue.csv"
resolutions_csv = Path(tmpdir) / "review_resolutions.csv"
catalog_csv = Path(tmpdir) / "canonical_catalog.csv"
with purchases_csv.open("w", newline="", encoding="utf-8") as handle:
writer = csv.DictWriter(
handle,
fieldnames=[
"purchase_date",
"retailer",
"order_id",
"line_no",
"observed_product_id",
"canonical_product_id",
"raw_item_name",
"normalized_item_name",
"image_url",
"upc",
"line_total",
],
)
writer.writeheader()
writer.writerows(
[
{
"purchase_date": "2026-03-14",
"retailer": "costco",
"order_id": "c2",
"line_no": "2",
"observed_product_id": "gobs_mix",
"canonical_product_id": "",
"raw_item_name": "MIXED PEPPER 6-PACK",
"normalized_item_name": "MIXED PEPPER",
"image_url": "",
"upc": "",
"line_total": "7.49",
},
{
"purchase_date": "2026-03-12",
"retailer": "costco",
"order_id": "c1",
"line_no": "1",
"observed_product_id": "gobs_mix",
"canonical_product_id": "",
"raw_item_name": "MIXED PEPPER 6-PACK",
"normalized_item_name": "MIXED PEPPER",
"image_url": "",
"upc": "",
"line_total": "6.99",
},
]
)
with catalog_csv.open("w", newline="", encoding="utf-8") as handle:
writer = csv.DictWriter(handle, fieldnames=review_products.build_purchases.CATALOG_FIELDS)
writer.writeheader()
writer.writerow(
{
"canonical_product_id": "gcan_mix",
"canonical_name": "MIXED PEPPER",
"category": "",
"product_type": "",
"brand": "",
"variant": "",
"size_value": "",
"size_unit": "",
"pack_qty": "",
"measure_type": "",
"notes": "",
"created_at": "",
"updated_at": "",
}
)
result = CliRunner().invoke(
review_products.main,
[
"--purchases-csv",
str(purchases_csv),
"--queue-csv",
str(queue_csv),
"--resolutions-csv",
str(resolutions_csv),
"--catalog-csv",
str(catalog_csv),
"--limit",
"1",
],
input="l\n1\ny\nlinked by test\n",
color=True,
)
self.assertEqual(0, result.exit_code)
self.assertIn("Select the canonical_name to associate 2 items with:", result.output)
self.assertIn('[1] MIXED PEPPER | gcan_mix', result.output)
self.assertIn('2 "MIXED PEPPER" items and future matches will be associated with "MIXED PEPPER".', result.output)
self.assertIn("actions: [y]es [n]o [b]ack [s]kip [q]uit", result.output)
with resolutions_csv.open(newline="", encoding="utf-8") as handle:
rows = list(csv.DictReader(handle))
self.assertEqual("gcan_mix", rows[0]["canonical_product_id"])
self.assertEqual("link", rows[0]["resolution_action"])
def test_review_products_creates_canonical_and_resolution(self):
with tempfile.TemporaryDirectory() as tmpdir:
purchases_csv = Path(tmpdir) / "purchases.csv"
queue_csv = Path(tmpdir) / "review_queue.csv"
resolutions_csv = Path(tmpdir) / "review_resolutions.csv"
catalog_csv = Path(tmpdir) / "canonical_catalog.csv"
with purchases_csv.open("w", newline="", encoding="utf-8") as handle:
writer = csv.DictWriter(
handle,
fieldnames=[
"purchase_date",
"observed_product_id",
"canonical_product_id",
"retailer",
"raw_item_name",
"normalized_item_name",
"image_url",
"upc",
"line_total",
"order_id",
"line_no",
],
)
writer.writeheader()
writer.writerow(
{
"purchase_date": "2026-03-15",
"observed_product_id": "gobs_ice",
"canonical_product_id": "",
"retailer": "giant",
"raw_item_name": "SB BAGGED ICE 20LB",
"normalized_item_name": "BAGGED ICE",
"image_url": "",
"upc": "",
"line_total": "3.50",
"order_id": "g1",
"line_no": "1",
}
)
with mock.patch.object(
review_products.click,
"prompt",
side_effect=["n", "ICE", "frozen", "ice", "manual merge", "q"],
):
review_products.main.callback(
purchases_csv=str(purchases_csv),
queue_csv=str(queue_csv),
resolutions_csv=str(resolutions_csv),
catalog_csv=str(catalog_csv),
limit=1,
refresh_only=False,
)
self.assertTrue(queue_csv.exists())
self.assertTrue(resolutions_csv.exists())
self.assertTrue(catalog_csv.exists())
with resolutions_csv.open(newline="", encoding="utf-8") as handle:
resolution_rows = list(csv.DictReader(handle))
with catalog_csv.open(newline="", encoding="utf-8") as handle:
catalog_rows = list(csv.DictReader(handle))
self.assertEqual("create", resolution_rows[0]["resolution_action"])
self.assertEqual("approved", resolution_rows[0]["status"])
self.assertEqual("ICE", catalog_rows[0]["canonical_name"])
if __name__ == "__main__":
unittest.main()

128
tests/test_scraper.py Normal file
View File

@@ -0,0 +1,128 @@
import csv
import tempfile
import unittest
from pathlib import Path
import scraper
class ScraperTests(unittest.TestCase):
def test_flatten_orders_extracts_order_and_item_rows(self):
history = {
"records": [
{
"orderId": "abc123",
"serviceType": "PICKUP",
}
]
}
details = [
{
"orderId": "abc123",
"orderDate": "2026-03-01",
"deliveryDate": "2026-03-02",
"orderTotal": "12.34",
"paymentMethod": "VISA",
"totalItemCount": 1,
"totalSavings": "1.00",
"yourSavingsTotal": "1.00",
"couponsDiscountsTotal": "0.50",
"refundOrder": False,
"ebtOrder": False,
"pup": {
"storeName": "Giant",
"aholdStoreNumber": "42",
"storeAddress1": "123 Main",
"storeCity": "Springfield",
"storeState": "VA",
"storeZipcode": "22150",
},
"items": [
{
"podId": "pod-1",
"itemName": "Bananas",
"primUpcCd": "111",
"categoryId": "produce",
"categoryDesc": "Produce",
"shipQy": "2",
"lbEachCd": "EA",
"unitPrice": "0.59",
"groceryAmount": "1.18",
"totalPickedWeight": "",
"mvpSavings": "0.10",
"rewardSavings": "0.00",
"couponSavings": "0.00",
"couponPrice": "",
}
],
}
]
orders, items = scraper.flatten_orders(
history,
details,
history_path=Path("data/giant-web/raw/history.json"),
raw_dir=Path("data/giant-web/raw"),
)
self.assertEqual(1, len(orders))
self.assertEqual("abc123", orders[0]["order_id"])
self.assertEqual("giant", orders[0]["retailer"])
self.assertEqual("PICKUP", orders[0]["service_type"])
self.assertEqual("data/giant-web/raw/history.json", orders[0]["raw_history_path"])
self.assertEqual("data/giant-web/raw/abc123.json", orders[0]["raw_order_path"])
self.assertEqual(1, len(items))
self.assertEqual("1", items[0]["line_no"])
self.assertEqual("Bananas", items[0]["item_name"])
self.assertEqual("giant", items[0]["retailer"])
self.assertEqual("data/giant-web/raw/abc123.json", items[0]["raw_order_path"])
self.assertEqual("false", items[0]["is_discount_line"])
def test_append_dedup_replaces_duplicate_rows_and_preserves_new_values(self):
with tempfile.TemporaryDirectory() as tmpdir:
path = Path(tmpdir) / "orders.csv"
scraper.append_dedup(
path,
[
{"order_id": "1", "order_total": "10.00"},
{"order_id": "2", "order_total": "20.00"},
],
subset=["order_id"],
fieldnames=["order_id", "order_total"],
)
merged = scraper.append_dedup(
path,
[
{"order_id": "2", "order_total": "21.50"},
{"order_id": "3", "order_total": "30.00"},
],
subset=["order_id"],
fieldnames=["order_id", "order_total"],
)
self.assertEqual(
[
{"order_id": "1", "order_total": "10.00"},
{"order_id": "2", "order_total": "21.50"},
{"order_id": "3", "order_total": "30.00"},
],
merged,
)
with path.open(newline="", encoding="utf-8") as handle:
rows = list(csv.DictReader(handle))
self.assertEqual(merged, rows)
def test_read_existing_order_ids_returns_known_ids(self):
with tempfile.TemporaryDirectory() as tmpdir:
path = Path(tmpdir) / "orders.csv"
path.write_text("order_id,order_total\n1,10.00\n2,20.00\n", encoding="utf-8")
self.assertEqual({"1", "2"}, scraper.read_existing_order_ids(path))
if __name__ == "__main__":
unittest.main()

View File

@@ -0,0 +1,154 @@
import json
from pathlib import Path
import click
import build_canonical_layer
import build_observed_products
from layer_helpers import stable_id, write_csv_rows
PROOF_FIELDS = [
"proof_name",
"canonical_product_id",
"giant_observed_product_id",
"costco_observed_product_id",
"giant_example_item",
"costco_example_item",
"notes",
]
def read_rows(path):
import csv
with Path(path).open(newline="", encoding="utf-8") as handle:
return list(csv.DictReader(handle))
def find_proof_pair(observed_rows):
giant = None
costco = None
for row in observed_rows:
if row["retailer"] == "giant" and row["representative_name_norm"] == "BANANA":
giant = row
if row["retailer"] == "costco" and row["representative_name_norm"] == "BANANA":
costco = row
return giant, costco
def merge_proof_pair(canonical_rows, link_rows, giant_row, costco_row):
if not giant_row or not costco_row:
return canonical_rows, link_rows, []
proof_canonical_id = stable_id("gcan", "proof|banana")
link_rows = [
row
for row in link_rows
if row["observed_product_id"]
not in {giant_row["observed_product_id"], costco_row["observed_product_id"]}
]
canonical_rows = [
row
for row in canonical_rows
if row["canonical_product_id"] != proof_canonical_id
]
canonical_rows.append(
{
"canonical_product_id": proof_canonical_id,
"canonical_name": "BANANA",
"product_type": "banana",
"brand": "",
"variant": "",
"size_value": "",
"size_unit": "",
"pack_qty": "",
"measure_type": "weight",
"normalized_quantity": "",
"normalized_quantity_unit": "",
"notes": "manual proof merge for cross-retailer validation",
"created_at": "",
"updated_at": "",
}
)
for observed_row in [giant_row, costco_row]:
link_rows.append(
{
"observed_product_id": observed_row["observed_product_id"],
"canonical_product_id": proof_canonical_id,
"link_method": "manual_proof_merge",
"link_confidence": "medium",
"review_status": "",
"reviewed_by": "",
"reviewed_at": "",
"link_notes": "cross-retailer validation proof",
}
)
proof_rows = [
{
"proof_name": "banana",
"canonical_product_id": proof_canonical_id,
"giant_observed_product_id": giant_row["observed_product_id"],
"costco_observed_product_id": costco_row["observed_product_id"],
"giant_example_item": giant_row["example_item_name"],
"costco_example_item": costco_row["example_item_name"],
"notes": "BANANA proof pair built from Giant and Costco enriched rows",
}
]
return canonical_rows, link_rows, proof_rows
@click.command()
@click.option(
"--giant-items-enriched-csv",
default="giant_output/items_enriched.csv",
show_default=True,
)
@click.option(
"--costco-items-enriched-csv",
default="costco_output/items_enriched.csv",
show_default=True,
)
@click.option(
"--outdir",
default="combined_output",
show_default=True,
)
def main(giant_items_enriched_csv, costco_items_enriched_csv, outdir):
outdir = Path(outdir)
rows = read_rows(giant_items_enriched_csv) + read_rows(costco_items_enriched_csv)
observed_rows = build_observed_products.build_observed_products(rows)
canonical_rows, link_rows = build_canonical_layer.build_canonical_layer(observed_rows)
giant_row, costco_row = find_proof_pair(observed_rows)
if not giant_row or not costco_row:
raise click.ClickException(
"could not find BANANA proof pair across Giant and Costco observed products"
)
canonical_rows, link_rows, proof_rows = merge_proof_pair(
canonical_rows, link_rows, giant_row, costco_row
)
write_csv_rows(
outdir / "products_observed.csv",
observed_rows,
build_observed_products.OUTPUT_FIELDS,
)
write_csv_rows(
outdir / "products_canonical.csv",
canonical_rows,
build_canonical_layer.CANONICAL_FIELDS,
)
write_csv_rows(
outdir / "product_links.csv",
link_rows,
build_canonical_layer.LINK_FIELDS,
)
write_csv_rows(outdir / "proof_examples.csv", proof_rows, PROOF_FIELDS)
click.echo(
f"wrote combined outputs to {outdir} using {len(observed_rows)} observed rows"
)
if __name__ == "__main__":
main()