# scrape-giant CLI to pull purchase history from Giant and Costco websites and refine into a single product catalog for external analysis. Run each script step-by-step from the terminal. ## What It Does 1. `collect_giant_web.py`: download Giant orders and items 2. `normalize_giant_web.py`: normalize Giant line items 3. `collect_costco_web.py`: download Costco orders and items 4. `normalize_costco_web.py`: normalize Costco line items 5. `build_purchases.py`: combine retailer outputs into one purchase table 6. `review_products.py`: review unresolved product matches in the terminal 7. `report_pipeline_status.py`: show how many rows survive each stage 8. `analyze_purchases.py`: write chart-ready analysis CSVs from the purchase table ## Requirements - Python 3.10+ - Firefox installed with active Giant and Costco sessions ## Install ```bash python -m venv venv ./venv/scripts/activate pip install -r requirements.txt ``` ## Optional `.env` Current version works best with `.env` in the project root. The scraper will prompt for these values if they are not found in the current browser session. - `collect_giant_web.py` prompts if `GIANT_USER_ID` or `GIANT_LOYALTY_NUMBER` is missing. - `collect_costco_web.py` tries `.env` first, then Firefox local storage for session-backed values; `COSTCO_CLIENT_IDENTIFIER` should still be set explicitly. - Costco discount matching happens later in `enrich_costco.py`; you do not need to pre-clean discount lines by hand. ```env GIANT_USER_ID=... GIANT_LOYALTY_NUMBER=... COSTCO_X_AUTHORIZATION=... COSTCO_X_WCS_CLIENTID=... COSTCO_CLIENT_IDENTIFIER=... ``` Current active path layout: ```text data/ giant-web/ raw/ collected_orders.csv collected_items.csv normalized_items.csv costco-web/ raw/ collected_orders.csv collected_items.csv normalized_items.csv review/ catalog.csv review_queue.csv review_resolutions.csv product_links.csv pipeline_status.csv pipeline_status.json analysis/ purchases.csv comparison_examples.csv item_price_over_time.csv spend_by_visit.csv items_per_visit.csv category_spend_over_time.csv retailer_store_breakdown.csv ``` ## Run Order Run the pipeline in this order: ```bash python collect_giant_web.py python normalize_giant_web.py python collect_costco_web.py python normalize_costco_web.py python build_purchases.py python review_products.py python build_purchases.py python review_products.py --refresh-only python report_pipeline_status.py python analyze_purchases.py ``` Why run `build_purchases.py` twice: - first pass builds the current combined dataset and review queue inputs - `review_products.py` writes durable review decisions - second pass reapplies those decisions into the purchase output If you only want to refresh the queue without reviewing interactively: ```bash python review_products.py --refresh-only ``` If you want a quick stage-by-stage accountability check: ```bash python report_pipeline_status.py ``` ## Key Outputs Giant: - `data/giant-web/collected_orders.csv` - `data/giant-web/collected_items.csv` - `data/giant-web/normalized_items.csv` Costco: - `data/costco-web/collected_orders.csv` - `data/costco-web/collected_items.csv` - `data/costco-web/normalized_items.csv` - `data/costco-web/normalized_items.csv` preserves raw totals and matched net discount fields Combined: - `data/analysis/purchases.csv` - `data/analysis/comparison_examples.csv` - `data/analysis/item_price_over_time.csv` - `data/analysis/spend_by_visit.csv` - `data/analysis/items_per_visit.csv` - `data/analysis/category_spend_over_time.csv` - `data/analysis/retailer_store_breakdown.csv` - `data/review/review_queue.csv` - `data/review/review_resolutions.csv` - `data/review/product_links.csv` - `data/review/pipeline_status.csv` - `data/review/pipeline_status.json` - `data/review/catalog.csv` `data/analysis/purchases.csv` is the main analysis artifact. It is designed to support both: - item-level price analysis - visit-level analysis such as spend by visit, items per visit, category spend by visit, and retailer/store breakdown The visit fields are carried directly in `purchases.csv`, so you can pivot on them without extra joins: - `order_id` - `purchase_date` - `retailer` - `store_name` - `store_number` - `store_city` - `store_state` ## Review Workflow Run `review_products.py` to cleanup unresolved or weakly unified items: - link an item to an existing canonical product - create a new canonical product - exclude an item - skip it for later Decisions are saved and reused on later runs. The review step is intentionally conservative: - weak exact-name matches stay in the queue instead of auto-creating canonical products - canonical names should describe stable product identity, not retailer packaging text ## Notes - This project is designed around fragile retailer scraping flows, so the code favors explicit retailer-specific steps over heavy abstraction. - Costco discount rows are preserved for auditability and also matched back to purchased items during enrichment. ## Test ```bash ./venv/bin/python -m unittest discover -s tests ``` ## Project Docs - `pm/tasks.org`: task tracking - `pm/data-model.org`: current data model notes - `pm/review-workflow.org`: review and resolution workflow