updated readme with Review steps

This commit is contained in:
ben
2026-03-17 09:14:14 -04:00
parent 91bfd3597e
commit 7f8c3ed8eb
3 changed files with 147 additions and 191 deletions

271
README.md
View File

@@ -1,227 +1,118 @@
# scrape-giant # scrape-giant
Small grocery-history pipeline for Giant and Costco receipt data. Small CLI pipeline for pulling purchase history from Giant and Costco, enriching line items, and building a reviewable cross-retailer purchase dataset.
This repo is still a manual, stepwise pipeline. There is no single orchestrator There is no one-shot runner yet. Today, you run the scripts step by step from the terminal.
script yet. Each stage is run directly, and later stages depend on files
produced by earlier stages.
## What The Project Does ## What It Does
The current flow is: - `scrape_giant.py`: download Giant orders and items
- `enrich_giant.py`: normalize Giant line items
- `scrape_costco.py`: download Costco orders and items
- `enrich_costco.py`: normalize Costco line items
- `build_purchases.py`: combine retailer outputs into one purchase table
- `review_products.py`: review unresolved product matches in the terminal
1. acquire raw Giant receipt/history data ## Requirements
2. enrich Giant line items into a shared enriched-item schema
3. acquire raw Costco receipt data
4. enrich Costco line items into the same shared enriched-item schema
5. build observed-product, review, and canonical-product layers
6. validate that Giant and Costco can flow through the same downstream model
Raw retailer JSON remains the source of truth. - Python 3.10+
- Firefox installed with active Giant and Costco sessions
## Current Scripts ## Install
- `scrape_giant.py`
Fetch Giant in-store history and order detail payloads from an active Firefox
session.
- `scrape_costco.py`
Fetch Costco receipt summary/detail payloads from an active Firefox session.
Costco currently prefers `.env` header values first, then falls back to exact
Firefox local-storage values for session auth.
- `enrich_giant.py`
Parse Giant raw order JSON into `giant_output/items_enriched.csv`.
- `enrich_costco.py`
Parse Costco raw receipt JSON into `costco_output/items_enriched.csv`.
- `build_observed_products.py`
Build retailer-facing observed products from enriched rows.
- `build_review_queue.py`
Build a manual review queue for low-confidence or unresolved observed
products.
- `build_canonical_layer.py`
Build shared canonical products and observed-to-canonical links.
- `validate_cross_retailer_flow.py`
Write a proof/check output showing that Giant and Costco can meet in the same
downstream model.
## Manual Pipeline
Run these from the repo root with the venv active, or call them through
`./venv/bin/python`.
### 1. Acquire Giant raw data
```bash ```bash
./venv/bin/python scrape_giant.py python -m venv venv
./venv/scripts/activate
pip install -r requirements.txt
``` ```
Inputs: ## Optional `.env`
- active Firefox session for `giantfood.com`
- `GIANT_USER_ID` and `GIANT_LOYALTY_NUMBER` from `.env`, shell env, or prompt
Outputs: Current version works best with `.env` in the project root. The scraper will prompt for these values if they are not found in the current browser session.
- `giant_output/raw/history.json` - `scrape_giant` prompts if `GIANT_USER_ID` or `GIANT_LOYALTY_NUMBER` is missing.
- `giant_output/raw/<order_id>.json` - `scrape_costco` tries `.env` first, then Firefox local storage for session-backed values; `COSTCO_CLIENT_IDENTIFIER` should still be set explicitly.
```env
GIANT_USER_ID=...
GIANT_LOYALTY_NUMBER=...
# Costco can use these if present, but it can also pull session values from Firefox.
COSTCO_X_AUTHORIZATION=...
COSTCO_X_WCS_CLIENTID=...
COSTCO_CLIENT_IDENTIFIER=...
```
## Run Order
Run the pipeline in this order:
```bash
python scrape_giant.py
python enrich_giant.py
python scrape_costco.py
python enrich_costco.py
python build_purchases.py
python review_products.py
python build_purchases.py
```
Why run `build_purchases.py` twice:
- first pass builds the current combined dataset and review queue inputs
- `review_products.py` writes durable review decisions
- second pass reapplies those decisions into the purchase output
If you only want to refresh the queue without reviewing interactively:
```bash
python review_products.py --refresh-only
```
## Key Outputs
Giant:
- `giant_output/orders.csv` - `giant_output/orders.csv`
- `giant_output/items.csv` - `giant_output/items.csv`
### 2. Enrich Giant data
```bash
./venv/bin/python enrich_giant.py
```
Input:
- `giant_output/raw/*.json`
Output:
- `giant_output/items_enriched.csv` - `giant_output/items_enriched.csv`
### 3. Acquire Costco raw data Costco:
```bash
./venv/bin/python scrape_costco.py
```
Optional useful flags:
```bash
./venv/bin/python scrape_costco.py --months-back 36
./venv/bin/python scrape_costco.py --firefox-profile-dir "C:\\Users\\you\\AppData\\Roaming\\Mozilla\\Firefox\\Profiles\\xxxx.default-release"
```
Inputs:
- active Firefox session for `costco.com`
- optional `.env` values:
- `COSTCO_X_AUTHORIZATION`
- `COSTCO_X_WCS_CLIENTID`
- `COSTCO_CLIENT_IDENTIFIER`
- if `COSTCO_X_AUTHORIZATION` is absent, the script falls back to exact Firefox
local-storage values:
- `idToken` -> sent as `Bearer <idToken>`
- `clientID` -> used as `costco-x-wcs-clientId` when env is blank
Outputs:
- `costco_output/raw/summary.json`
- `costco_output/raw/summary_requests.json`
- `costco_output/raw/<receipt_id>-<timestamp>.json`
- `costco_output/orders.csv` - `costco_output/orders.csv`
- `costco_output/items.csv` - `costco_output/items.csv`
### 4. Enrich Costco data
```bash
./venv/bin/python enrich_costco.py
```
Input:
- `costco_output/raw/*.json`
Output:
- `costco_output/items_enriched.csv` - `costco_output/items_enriched.csv`
### 5. Build shared downstream layers Combined:
- `combined_output/purchases.csv`
```bash - `combined_output/review_queue.csv`
./venv/bin/python build_observed_products.py - `combined_output/review_resolutions.csv`
./venv/bin/python build_review_queue.py - `combined_output/canonical_catalog.csv`
./venv/bin/python build_canonical_layer.py
```
These scripts consume the enriched item files and generate the downstream
product-model outputs.
Current outputs on disk:
- retailer-facing:
- `giant_output/products_observed.csv`
- `giant_output/review_queue.csv`
- `giant_output/products_canonical.csv`
- `giant_output/product_links.csv`
- cross-retailer proof/check output:
- `combined_output/products_observed.csv`
- `combined_output/products_canonical.csv`
- `combined_output/product_links.csv` - `combined_output/product_links.csv`
- `combined_output/proof_examples.csv` - `combined_output/comparison_examples.csv`
### 6. Validate cross-retailer flow ## Review Workflow
```bash `review_products.py` is the manual cleanup step for unresolved or weakly unified items.
./venv/bin/python validate_cross_retailer_flow.py
```
This is a proof/check step, not the main acquisition path. In the terminal, you can:
- link an item to an existing canonical product
- create a new canonical product
- exclude an item
- skip it for later
## Inputs And Outputs By Directory Those decisions are saved and reused on later runs.
### `giant_output/`
Inputs to this layer:
- Firefox session data for Giant
- Giant raw JSON payloads
Generated files:
- `raw/history.json`
- `raw/<order_id>.json`
- `orders.csv`
- `items.csv`
- `items_enriched.csv`
- `products_observed.csv`
- `review_queue.csv`
- `products_canonical.csv`
- `product_links.csv`
### `costco_output/`
Inputs to this layer:
- Firefox session data for Costco
- Costco raw GraphQL receipt payloads
Generated files:
- `raw/summary.json`
- `raw/summary_requests.json`
- `raw/<receipt_id>-<timestamp>.json`
- `orders.csv`
- `items.csv`
- `items_enriched.csv`
### `combined_output/`
Generated by cross-retailer proof/build scripts:
- `products_observed.csv`
- `products_canonical.csv`
- `product_links.csv`
- `proof_examples.csv`
## Notes ## Notes
- The pipeline is intentionally simple and currently manual. - This project is designed around fragile retailer scraping flows, so the code favors explicit retailer-specific steps over heavy abstraction.
- Scraping is retailer-specific and fragile; downstream modeling is shared only - `scrape_giant.py` and `scrape_costco.py` are meant to work as standalone acquisition scripts.
after enrichment. - `validate_cross_retailer_flow.py` is a proof/check script, not a required production step.
- `summary_requests.json` is diagnostic metadata from Costco summary enumeration
and is not a receipt payload.
- `enrich_costco.py` skips that file and only parses receipt payloads.
- The repo may contain archived or sample output files under `archive/`; they
are not part of the active scrape path.
## Verification ## Test
Run the full test suite with:
```bash ```bash
./venv/bin/python -m unittest discover -s tests ./venv/bin/python -m unittest discover -s tests
``` ```
Useful one-off checks:
```bash
./venv/bin/python scrape_giant.py --help
./venv/bin/python scrape_costco.py --help
./venv/bin/python enrich_giant.py
./venv/bin/python enrich_costco.py
```
## Project Docs ## Project Docs
- `pm/tasks.org` - `pm/tasks.org`: task tracking
- `pm/data-model.org` - `pm/data-model.org`: current data model notes
- `pm/scrape-giant.org` - `pm/review-workflow.org`: review and resolution workflow

View File

@@ -7,7 +7,11 @@ import build_canonical_layer
import build_observed_products import build_observed_products
import validate_cross_retailer_flow import validate_cross_retailer_flow
from enrich_giant import format_decimal, to_decimal from enrich_giant import format_decimal, to_decimal
<<<<<<< HEAD
from layer_helpers import read_csv_rows, stable_id, write_csv_rows from layer_helpers import read_csv_rows, stable_id, write_csv_rows
=======
from layer_helpers import read_csv_rows, write_csv_rows
>>>>>>> be1bf63 (Build pivot-ready purchase log)
PURCHASE_FIELDS = [ PURCHASE_FIELDS = [
@@ -18,8 +22,11 @@ PURCHASE_FIELDS = [
"observed_item_key", "observed_item_key",
"observed_product_id", "observed_product_id",
"canonical_product_id", "canonical_product_id",
<<<<<<< HEAD
"review_status", "review_status",
"resolution_action", "resolution_action",
=======
>>>>>>> be1bf63 (Build pivot-ready purchase log)
"raw_item_name", "raw_item_name",
"normalized_item_name", "normalized_item_name",
"retailer_item_id", "retailer_item_id",
@@ -62,6 +69,7 @@ EXAMPLE_FIELDS = [
"notes", "notes",
] ]
<<<<<<< HEAD
CATALOG_FIELDS = [ CATALOG_FIELDS = [
"canonical_product_id", "canonical_product_id",
"canonical_name", "canonical_name",
@@ -87,6 +95,8 @@ RESOLUTION_FIELDS = [
"reviewed_at", "reviewed_at",
] ]
=======
>>>>>>> be1bf63 (Build pivot-ready purchase log)
def decimal_or_zero(value): def decimal_or_zero(value):
return to_decimal(value) or Decimal("0") return to_decimal(value) or Decimal("0")
@@ -165,6 +175,7 @@ def order_lookup(rows, retailer):
} }
<<<<<<< HEAD
def read_optional_csv_rows(path): def read_optional_csv_rows(path):
path = Path(path) path = Path(path)
if not path.exists(): if not path.exists():
@@ -209,6 +220,9 @@ def catalog_row_from_canonical(row):
def build_link_state(enriched_rows): def build_link_state(enriched_rows):
=======
def build_link_lookup(enriched_rows):
>>>>>>> be1bf63 (Build pivot-ready purchase log)
observed_rows = build_observed_products.build_observed_products(enriched_rows) observed_rows = build_observed_products.build_observed_products(enriched_rows)
canonical_rows, link_rows = build_canonical_layer.build_canonical_layer(observed_rows) canonical_rows, link_rows = build_canonical_layer.build_canonical_layer(observed_rows)
giant_row, costco_row = validate_cross_retailer_flow.find_proof_pair(observed_rows) giant_row, costco_row = validate_cross_retailer_flow.find_proof_pair(observed_rows)
@@ -225,6 +239,7 @@ def build_link_state(enriched_rows):
canonical_id_by_observed = { canonical_id_by_observed = {
row["observed_product_id"]: row["canonical_product_id"] for row in link_rows row["observed_product_id"]: row["canonical_product_id"] for row in link_rows
} }
<<<<<<< HEAD
return observed_rows, canonical_rows, link_rows, observed_id_by_key, canonical_id_by_observed return observed_rows, canonical_rows, link_rows, observed_id_by_key, canonical_id_by_observed
@@ -253,6 +268,14 @@ def build_purchase_rows(
canonical_id_by_observed[observed_product_id] = resolution["canonical_product_id"] canonical_id_by_observed[observed_product_id] = resolution["canonical_product_id"]
elif action == "exclude": elif action == "exclude":
canonical_id_by_observed[observed_product_id] = "" canonical_id_by_observed[observed_product_id] = ""
=======
return observed_id_by_key, canonical_id_by_observed
def build_purchase_rows(giant_enriched_rows, costco_enriched_rows, giant_orders, costco_orders):
all_enriched_rows = giant_enriched_rows + costco_enriched_rows
observed_id_by_key, canonical_id_by_observed = build_link_lookup(all_enriched_rows)
>>>>>>> be1bf63 (Build pivot-ready purchase log)
orders_by_id = {} orders_by_id = {}
orders_by_id.update(order_lookup(giant_orders, "giant")) orders_by_id.update(order_lookup(giant_orders, "giant"))
orders_by_id.update(order_lookup(costco_orders, "costco")) orders_by_id.update(order_lookup(costco_orders, "costco"))
@@ -266,7 +289,10 @@ def build_purchase_rows(
observed_product_id = observed_id_by_key.get(observed_key, "") observed_product_id = observed_id_by_key.get(observed_key, "")
order_row = orders_by_id.get((row["retailer"], row["order_id"]), {}) order_row = orders_by_id.get((row["retailer"], row["order_id"]), {})
metrics = derive_metrics(row) metrics = derive_metrics(row)
<<<<<<< HEAD
resolution = resolution_lookup.get(observed_product_id, {}) resolution = resolution_lookup.get(observed_product_id, {})
=======
>>>>>>> be1bf63 (Build pivot-ready purchase log)
purchase_rows.append( purchase_rows.append(
{ {
"purchase_date": row["order_date"], "purchase_date": row["order_date"],
@@ -276,8 +302,11 @@ def build_purchase_rows(
"observed_item_key": row["observed_item_key"], "observed_item_key": row["observed_item_key"],
"observed_product_id": observed_product_id, "observed_product_id": observed_product_id,
"canonical_product_id": canonical_id_by_observed.get(observed_product_id, ""), "canonical_product_id": canonical_id_by_observed.get(observed_product_id, ""),
<<<<<<< HEAD
"review_status": resolution.get("status", ""), "review_status": resolution.get("status", ""),
"resolution_action": resolution.get("resolution_action", ""), "resolution_action": resolution.get("resolution_action", ""),
=======
>>>>>>> be1bf63 (Build pivot-ready purchase log)
"raw_item_name": row["item_name"], "raw_item_name": row["item_name"],
"normalized_item_name": row["item_name_norm"], "normalized_item_name": row["item_name_norm"],
"retailer_item_id": row["retailer_item_id"], "retailer_item_id": row["retailer_item_id"],
@@ -301,6 +330,7 @@ def build_purchase_rows(
**metrics, **metrics,
} }
) )
<<<<<<< HEAD
return purchase_rows, observed_rows, canonical_rows, link_rows return purchase_rows, observed_rows, canonical_rows, link_rows
@@ -328,6 +358,9 @@ def apply_manual_resolutions_to_links(link_rows, resolution_rows):
"link_notes": resolution.get("resolution_notes", ""), "link_notes": resolution.get("resolution_notes", ""),
} }
return sorted(link_by_observed.values(), key=lambda row: row["observed_product_id"]) return sorted(link_by_observed.values(), key=lambda row: row["observed_product_id"])
=======
return purchase_rows
>>>>>>> be1bf63 (Build pivot-ready purchase log)
def build_comparison_examples(purchase_rows): def build_comparison_examples(purchase_rows):
@@ -366,9 +399,12 @@ def build_comparison_examples(purchase_rows):
@click.option("--costco-items-enriched-csv", default="costco_output/items_enriched.csv", show_default=True) @click.option("--costco-items-enriched-csv", default="costco_output/items_enriched.csv", show_default=True)
@click.option("--giant-orders-csv", default="giant_output/orders.csv", show_default=True) @click.option("--giant-orders-csv", default="giant_output/orders.csv", show_default=True)
@click.option("--costco-orders-csv", default="costco_output/orders.csv", show_default=True) @click.option("--costco-orders-csv", default="costco_output/orders.csv", show_default=True)
<<<<<<< HEAD
@click.option("--resolutions-csv", default="combined_output/review_resolutions.csv", show_default=True) @click.option("--resolutions-csv", default="combined_output/review_resolutions.csv", show_default=True)
@click.option("--catalog-csv", default="combined_output/canonical_catalog.csv", show_default=True) @click.option("--catalog-csv", default="combined_output/canonical_catalog.csv", show_default=True)
@click.option("--links-csv", default="combined_output/product_links.csv", show_default=True) @click.option("--links-csv", default="combined_output/product_links.csv", show_default=True)
=======
>>>>>>> be1bf63 (Build pivot-ready purchase log)
@click.option("--output-csv", default="combined_output/purchases.csv", show_default=True) @click.option("--output-csv", default="combined_output/purchases.csv", show_default=True)
@click.option("--examples-csv", default="combined_output/comparison_examples.csv", show_default=True) @click.option("--examples-csv", default="combined_output/comparison_examples.csv", show_default=True)
def main( def main(
@@ -376,6 +412,7 @@ def main(
costco_items_enriched_csv, costco_items_enriched_csv,
giant_orders_csv, giant_orders_csv,
costco_orders_csv, costco_orders_csv,
<<<<<<< HEAD
resolutions_csv, resolutions_csv,
catalog_csv, catalog_csv,
links_csv, links_csv,
@@ -384,10 +421,17 @@ def main(
): ):
resolution_rows = read_optional_csv_rows(resolutions_csv) resolution_rows = read_optional_csv_rows(resolutions_csv)
purchase_rows, _observed_rows, canonical_rows, link_rows = build_purchase_rows( purchase_rows, _observed_rows, canonical_rows, link_rows = build_purchase_rows(
=======
output_csv,
examples_csv,
):
purchase_rows = build_purchase_rows(
>>>>>>> be1bf63 (Build pivot-ready purchase log)
read_csv_rows(giant_items_enriched_csv), read_csv_rows(giant_items_enriched_csv),
read_csv_rows(costco_items_enriched_csv), read_csv_rows(costco_items_enriched_csv),
read_csv_rows(giant_orders_csv), read_csv_rows(giant_orders_csv),
read_csv_rows(costco_orders_csv), read_csv_rows(costco_orders_csv),
<<<<<<< HEAD
resolution_rows, resolution_rows,
) )
existing_catalog_rows = read_optional_csv_rows(catalog_csv) existing_catalog_rows = read_optional_csv_rows(catalog_csv)
@@ -404,6 +448,14 @@ def main(
click.echo( click.echo(
f"wrote {len(purchase_rows)} purchase rows to {output_csv}, " f"wrote {len(purchase_rows)} purchase rows to {output_csv}, "
f"{len(merged_catalog_rows)} catalog rows to {catalog_csv}, " f"{len(merged_catalog_rows)} catalog rows to {catalog_csv}, "
=======
)
example_rows = build_comparison_examples(purchase_rows)
write_csv_rows(output_csv, purchase_rows, PURCHASE_FIELDS)
write_csv_rows(examples_csv, example_rows, EXAMPLE_FIELDS)
click.echo(
f"wrote {len(purchase_rows)} purchase rows to {output_csv} "
>>>>>>> be1bf63 (Build pivot-ready purchase log)
f"and {len(example_rows)} comparison examples to {examples_csv}" f"and {len(example_rows)} comparison examples to {examples_csv}"
) )

View File

@@ -99,12 +99,19 @@ class PurchaseLogTests(unittest.TestCase):
} }
] ]
<<<<<<< HEAD
rows, _observed, _canon, _links = build_purchases.build_purchase_rows( rows, _observed, _canon, _links = build_purchases.build_purchase_rows(
=======
rows = build_purchases.build_purchase_rows(
>>>>>>> be1bf63 (Build pivot-ready purchase log)
[giant_row], [giant_row],
[costco_row], [costco_row],
giant_orders, giant_orders,
costco_orders, costco_orders,
<<<<<<< HEAD
[], [],
=======
>>>>>>> be1bf63 (Build pivot-ready purchase log)
) )
self.assertEqual(2, len(rows)) self.assertEqual(2, len(rows))
@@ -196,9 +203,12 @@ class PurchaseLogTests(unittest.TestCase):
costco_items_enriched_csv=str(costco_items), costco_items_enriched_csv=str(costco_items),
giant_orders_csv=str(giant_orders), giant_orders_csv=str(giant_orders),
costco_orders_csv=str(costco_orders), costco_orders_csv=str(costco_orders),
<<<<<<< HEAD
resolutions_csv=str(Path(tmpdir) / "review_resolutions.csv"), resolutions_csv=str(Path(tmpdir) / "review_resolutions.csv"),
catalog_csv=str(Path(tmpdir) / "canonical_catalog.csv"), catalog_csv=str(Path(tmpdir) / "canonical_catalog.csv"),
links_csv=str(Path(tmpdir) / "product_links.csv"), links_csv=str(Path(tmpdir) / "product_links.csv"),
=======
>>>>>>> be1bf63 (Build pivot-ready purchase log)
output_csv=str(purchases_csv), output_csv=str(purchases_csv),
examples_csv=str(examples_csv), examples_csv=str(examples_csv),
) )
@@ -212,6 +222,7 @@ class PurchaseLogTests(unittest.TestCase):
self.assertEqual(2, len(purchase_rows)) self.assertEqual(2, len(purchase_rows))
self.assertEqual(1, len(example_rows)) self.assertEqual(1, len(example_rows))
<<<<<<< HEAD
def test_build_purchase_rows_applies_manual_resolution(self): def test_build_purchase_rows_applies_manual_resolution(self):
fieldnames = enrich_costco.OUTPUT_FIELDS fieldnames = enrich_costco.OUTPUT_FIELDS
giant_row = {field: "" for field in fieldnames} giant_row = {field: "" for field in fieldnames}
@@ -262,6 +273,8 @@ class PurchaseLogTests(unittest.TestCase):
self.assertEqual("approved", rows[0]["review_status"]) self.assertEqual("approved", rows[0]["review_status"])
self.assertEqual("create", rows[0]["resolution_action"]) self.assertEqual("create", rows[0]["resolution_action"])
=======
>>>>>>> be1bf63 (Build pivot-ready purchase log)
if __name__ == "__main__": if __name__ == "__main__":
unittest.main() unittest.main()