# scrape-giant CLI to pull purchase history from Giant and Costco websites and refine into a single product catalog for external analysis. Run each script step-by-step from the terminal. ## What It Does 1. `scrape_giant.py`: download Giant orders and items 2. `enrich_giant.py`: normalize Giant line items 3. `scrape_costco.py`: download Costco orders and items 4. `enrich_costco.py`: normalize Costco line items 5. `build_purchases.py`: combine retailer outputs into one purchase table 6. `review_products.py`: review unresolved product matches in the terminal 7. `report_pipeline_status.py`: show how many rows survive each stage Active refactor entrypoints: - `collect_giant_web.py` - `collect_costco_web.py` - `normalize_giant_web.py` - `normalize_costco_web.py` ## Requirements - Python 3.10+ - Firefox installed with active Giant and Costco sessions ## Install ```bash python -m venv venv ./venv/scripts/activate pip install -r requirements.txt ``` ## Optional `.env` Current version works best with `.env` in the project root. The scraper will prompt for these values if they are not found in the current browser session. - `collect_giant_web.py` prompts if `GIANT_USER_ID` or `GIANT_LOYALTY_NUMBER` is missing. - `collect_costco_web.py` tries `.env` first, then Firefox local storage for session-backed values; `COSTCO_CLIENT_IDENTIFIER` should still be set explicitly. - Costco discount matching happens later in `enrich_costco.py`; you do not need to pre-clean discount lines by hand. ```env GIANT_USER_ID=... GIANT_LOYALTY_NUMBER=... COSTCO_X_AUTHORIZATION=... COSTCO_X_WCS_CLIENTID=... COSTCO_CLIENT_IDENTIFIER=... ``` Current active path layout: ```text data/ giant-web/ raw/ collected_orders.csv collected_items.csv normalized_items.csv costco-web/ raw/ collected_orders.csv collected_items.csv normalized_items.csv review/ review_queue.csv review_resolutions.csv product_links.csv purchases.csv pipeline_status.csv pipeline_status.json catalog.csv ``` ## Run Order Run the pipeline in this order: ```bash python collect_giant_web.py python normalize_giant_web.py python collect_costco_web.py python normalize_costco_web.py python build_purchases.py python review_products.py python build_purchases.py python review_products.py --refresh-only python report_pipeline_status.py ``` Why run `build_purchases.py` twice: - first pass builds the current combined dataset and review queue inputs - `review_products.py` writes durable review decisions - second pass reapplies those decisions into the purchase output If you only want to refresh the queue without reviewing interactively: ```bash python review_products.py --refresh-only ``` If you want a quick stage-by-stage accountability check: ```bash python report_pipeline_status.py ``` ## Key Outputs Giant: - `data/giant-web/collected_orders.csv` - `data/giant-web/collected_items.csv` - `data/giant-web/normalized_items.csv` Costco: - `data/costco-web/collected_orders.csv` - `data/costco-web/collected_items.csv` - `data/costco-web/normalized_items.csv` - `data/costco-web/normalized_items.csv` preserves raw totals and matched net discount fields Combined: - `data/review/purchases.csv` - `data/review/review_queue.csv` - `data/review/review_resolutions.csv` - `data/review/product_links.csv` - `data/review/comparison_examples.csv` - `data/review/pipeline_status.csv` - `data/review/pipeline_status.json` - `data/catalog.csv` `data/review/purchases.csv` is the main analysis artifact. It is designed to support both: - item-level price analysis - visit-level analysis such as spend by visit, items per visit, category spend by visit, and retailer/store breakdown The visit fields are carried directly in `purchases.csv`, so you can pivot on them without extra joins: - `order_id` - `purchase_date` - `retailer` - `store_name` - `store_number` - `store_city` - `store_state` ## Review Workflow Run `review_products.py` to cleanup unresolved or weakly unified items: - link an item to an existing canonical product - create a new canonical product - exclude an item - skip it for later Decisions are saved and reused on later runs. The review step is intentionally conservative: - weak exact-name matches stay in the queue instead of auto-creating canonical products - canonical names should describe stable product identity, not retailer packaging text ## Notes - This project is designed around fragile retailer scraping flows, so the code favors explicit retailer-specific steps over heavy abstraction. - `scrape_giant.py`, `scrape_costco.py`, `enrich_giant.py`, and `enrich_costco.py` are now legacy-compatible entrypoints; prefer the `collect_*` and `normalize_*` scripts for active work. - Costco discount rows are preserved for auditability and also matched back to purchased items during enrichment. - `validate_cross_retailer_flow.py` is a proof/check script, not a required production step. ## Test ```bash ./venv/bin/python -m unittest discover -s tests ``` ## Project Docs - `pm/tasks.org`: task tracking - `pm/data-model.org`: current data model notes - `pm/review-workflow.org`: review and resolution workflow