Files
scrape-giant/pm/review-workflow.org

2.4 KiB

review and item-resolution workflow

This document defines the durable review workflow for unresolved observed products.

persistent files

  • `combined_output/purchases.csv` Flat normalized purchase log. This is the review input because it retains:

    • raw item name
    • normalized item name
    • observed product id
    • canonical product id when resolved
    • retailer/order/date/price context
  • `combined_output/review_queue.csv` Current unresolved observed products grouped for review.
  • `combined_output/review_resolutions.csv` Durable mapping decisions from observed products to canonical products.
  • `combined_output/canonical_catalog.csv` Durable canonical item catalog used by manual review and later purchase-log rebuilds.

There is no separate alias file in v1. `review_resolutions.csv` is the mapping layer from observed products to canonical product ids.

workflow

  1. Run `build_purchases.py` This refreshes the purchase log and seeds/updates the canonical catalog from current auto-linked canonical rows.
  2. Run `review_products.py` This rebuilds `review_queue.csv` from unresolved purchase rows and prompts in the terminal for one observed product at a time.
  3. Choose one of:

    • link to existing canonical
    • create new canonical
    • exclude
    • skip
  4. `review_products.py` writes decisions immediately to:

    • `review_resolutions.csv`
    • `canonical_catalog.csv` when a new canonical item is created
  5. Rerun `build_purchases.py` This reapplies approved resolutions so the final normalized purchase log now carries the reviewed `canonical_product_id`.

what the human edits

The primary interface is terminal prompts in `review_products.py`.

The human provides:

  • existing canonical id when linking
  • canonical name/category/product type when creating a new canonical item
  • optional resolution notes

The generated CSVs remain editable by hand if needed, but the intended workflow is terminal-first.

durability

  • Resolutions are keyed by `observed_product_id`, not by one-off text substitution.
  • Canonical products are keyed by stable `canonical_product_id`.
  • Future runs reuse approved mappings through `review_resolutions.csv`.

retention of audit fields

The final `purchases.csv` retains:

  • `raw_item_name`
  • `normalized_item_name`
  • `canonical_product_id`

This preserves the raw receipt description, the deterministic parser output, and the human-approved canonical identity in one flat purchase log.