# scrape-giant Small CLI pipeline for pulling purchase history from Giant and Costco, enriching line items, and building a reviewable cross-retailer purchase dataset. There is no one-shot runner yet. Today, you run the scripts step by step from the terminal. ## What It Does - `scrape_giant.py`: download Giant orders and items - `enrich_giant.py`: normalize Giant line items - `scrape_costco.py`: download Costco orders and items - `enrich_costco.py`: normalize Costco line items - `build_purchases.py`: combine retailer outputs into one purchase table - `review_products.py`: review unresolved product matches in the terminal ## Requirements - Python 3.10+ - Firefox installed with active Giant and Costco sessions ## Install ```bash python -m venv venv ./venv/scripts/activate pip install -r requirements.txt ``` ## Optional `.env` Current version works best with `.env` in the project root. The scraper will prompt for these values if they are not found in the current browser session. - `scrape_giant` prompts if `GIANT_USER_ID` or `GIANT_LOYALTY_NUMBER` is missing. - `scrape_costco` tries `.env` first, then Firefox local storage for session-backed values; `COSTCO_CLIENT_IDENTIFIER` should still be set explicitly. ```env GIANT_USER_ID=... GIANT_LOYALTY_NUMBER=... # Costco can use these if present, but it can also pull session values from Firefox. COSTCO_X_AUTHORIZATION=... COSTCO_X_WCS_CLIENTID=... COSTCO_CLIENT_IDENTIFIER=... ``` ## Run Order Run the pipeline in this order: ```bash python scrape_giant.py python enrich_giant.py python scrape_costco.py python enrich_costco.py python build_purchases.py python review_products.py python build_purchases.py ``` Why run `build_purchases.py` twice: - first pass builds the current combined dataset and review queue inputs - `review_products.py` writes durable review decisions - second pass reapplies those decisions into the purchase output If you only want to refresh the queue without reviewing interactively: ```bash python review_products.py --refresh-only ``` ## Key Outputs Giant: - `giant_output/orders.csv` - `giant_output/items.csv` - `giant_output/items_enriched.csv` Costco: - `costco_output/orders.csv` - `costco_output/items.csv` - `costco_output/items_enriched.csv` Combined: - `combined_output/purchases.csv` - `combined_output/review_queue.csv` - `combined_output/review_resolutions.csv` - `combined_output/canonical_catalog.csv` - `combined_output/product_links.csv` - `combined_output/comparison_examples.csv` ## Review Workflow `review_products.py` is the manual cleanup step for unresolved or weakly unified items. In the terminal, you can: - link an item to an existing canonical product - create a new canonical product - exclude an item - skip it for later Those decisions are saved and reused on later runs. ## Notes - This project is designed around fragile retailer scraping flows, so the code favors explicit retailer-specific steps over heavy abstraction. - `scrape_giant.py` and `scrape_costco.py` are meant to work as standalone acquisition scripts. - `validate_cross_retailer_flow.py` is a proof/check script, not a required production step. ## Test ```bash ./venv/bin/python -m unittest discover -s tests ``` ## Project Docs - `pm/tasks.org`: task tracking - `pm/data-model.org`: current data model notes - `pm/review-workflow.org`: review and resolution workflow