ben/scrape-giant

Fork 0

Go to file

ben 74d17b0b0c minor edit

2026-03-24 17:28:16 -04:00

minor edit

2026-03-24 17:28:16 -04:00

tests

Finalize post-refactor layout and remove old pipeline files

2026-03-24 17:09:57 -04:00

.gitignore

added pm folder and tasks

2026-03-14 17:59:40 -04:00

agents.md

assume local venv available

2026-03-16 11:44:10 -04:00

analyze_purchases.py

Finalize post-refactor layout and remove old pipeline files

2026-03-24 17:09:57 -04:00

browser_session.py

Simplify browser session bootstrap

2026-03-16 17:08:44 -04:00

build_purchases.py

Finalize post-refactor layout and remove old pipeline files

2026-03-24 17:09:57 -04:00

collect_costco_web.py

Refactor retailer collection entrypoints

2026-03-18 15:18:47 -04:00

collect_giant_web.py

Refactor retailer collection entrypoints

2026-03-18 15:18:47 -04:00

enrich_costco.py

Fix Costco hash-size weight parsing

2026-03-23 13:56:47 -04:00

enrich_giant.py

Use picked weight for Giant quantity basis

2026-03-23 13:22:56 -04:00

layer_helpers.py

Generate Giant observed products

2026-03-16 00:43:11 -04:00

normalize_costco_web.py

Refactor retailer normalization outputs

2026-03-18 15:46:20 -04:00

normalize_giant_web.py

Refactor retailer normalization outputs

2026-03-18 15:46:20 -04:00

README.md

Finalize post-refactor layout and remove old pipeline files

2026-03-24 17:09:57 -04:00

report_pipeline_status.py

Finalize post-refactor layout and remove old pipeline files

2026-03-24 17:09:57 -04:00

requirements.txt

Trim requirements to direct runtime deps

2026-03-24 17:25:52 -04:00

review_products.py

Finalize post-refactor layout and remove old pipeline files

2026-03-24 17:09:57 -04:00

scrape_costco.py

Refactor retailer collection entrypoints

2026-03-18 15:18:47 -04:00

scrape_giant.py

Refactor retailer collection entrypoints

2026-03-18 15:18:47 -04:00

README.md

scrape-giant

CLI to pull purchase history from Giant and Costco websites and refine into a single product catalog for external analysis.

Run each script step-by-step from the terminal.

What It Does

collect_giant_web.py: download Giant orders and items
normalize_giant_web.py: normalize Giant line items
collect_costco_web.py: download Costco orders and items
normalize_costco_web.py: normalize Costco line items
build_purchases.py: combine retailer outputs into one purchase table
review_products.py: review unresolved product matches in the terminal
report_pipeline_status.py: show how many rows survive each stage
analyze_purchases.py: write chart-ready analysis CSVs from the purchase table

Requirements

Python 3.10+
Firefox installed with active Giant and Costco sessions

Install

python -m venv venv
./venv/scripts/activate
pip install -r requirements.txt

Optional `.env`

Current version works best with .env in the project root. The scraper will prompt for these values if they are not found in the current browser session.

collect_giant_web.py prompts if GIANT_USER_ID or GIANT_LOYALTY_NUMBER is missing.
collect_costco_web.py tries .env first, then Firefox local storage for session-backed values; COSTCO_CLIENT_IDENTIFIER should still be set explicitly.
Costco discount matching happens later in enrich_costco.py; you do not need to pre-clean discount lines by hand.

GIANT_USER_ID=...
GIANT_LOYALTY_NUMBER=...

COSTCO_X_AUTHORIZATION=...
COSTCO_X_WCS_CLIENTID=...
COSTCO_CLIENT_IDENTIFIER=...

Current active path layout:

data/
  giant-web/
    raw/
    collected_orders.csv
    collected_items.csv
    normalized_items.csv
  costco-web/
    raw/
    collected_orders.csv
    collected_items.csv
    normalized_items.csv
  review/
    catalog.csv
    review_queue.csv
    review_resolutions.csv
    product_links.csv
    pipeline_status.csv
    pipeline_status.json
  analysis/
    purchases.csv
    comparison_examples.csv
    item_price_over_time.csv
    spend_by_visit.csv
    items_per_visit.csv
    category_spend_over_time.csv
    retailer_store_breakdown.csv

Run Order

Run the pipeline in this order:

python collect_giant_web.py
python normalize_giant_web.py
python collect_costco_web.py
python normalize_costco_web.py
python build_purchases.py
python review_products.py
python build_purchases.py
python review_products.py --refresh-only
python report_pipeline_status.py
python analyze_purchases.py

Why run build_purchases.py twice:

first pass builds the current combined dataset and review queue inputs
review_products.py writes durable review decisions
second pass reapplies those decisions into the purchase output

If you only want to refresh the queue without reviewing interactively:

python review_products.py --refresh-only

If you want a quick stage-by-stage accountability check:

python report_pipeline_status.py

Key Outputs

Giant:

data/giant-web/collected_orders.csv
data/giant-web/collected_items.csv
data/giant-web/normalized_items.csv

Costco:

data/costco-web/collected_orders.csv
data/costco-web/collected_items.csv
data/costco-web/normalized_items.csv
data/costco-web/normalized_items.csv preserves raw totals and matched net discount fields

Combined:

data/analysis/purchases.csv
data/analysis/comparison_examples.csv
data/analysis/item_price_over_time.csv
data/analysis/spend_by_visit.csv
data/analysis/items_per_visit.csv
data/analysis/category_spend_over_time.csv
data/analysis/retailer_store_breakdown.csv
data/review/review_queue.csv
data/review/review_resolutions.csv
data/review/product_links.csv
data/review/pipeline_status.csv
data/review/pipeline_status.json
data/review/catalog.csv

data/analysis/purchases.csv is the main analysis artifact. It is designed to support both:

item-level price analysis
visit-level analysis such as spend by visit, items per visit, category spend by visit, and retailer/store breakdown

The visit fields are carried directly in purchases.csv, so you can pivot on them without extra joins:

order_id
purchase_date
retailer
store_name
store_number
store_city
store_state

Review Workflow

Run review_products.py to cleanup unresolved or weakly unified items:

link an item to an existing canonical product
create a new canonical product
exclude an item
skip it for later Decisions are saved and reused on later runs.

The review step is intentionally conservative:

weak exact-name matches stay in the queue instead of auto-creating canonical products
canonical names should describe stable product identity, not retailer packaging text

Notes

This project is designed around fragile retailer scraping flows, so the code favors explicit retailer-specific steps over heavy abstraction.
Costco discount rows are preserved for auditability and also matched back to purchased items during enrichment.

Test

./venv/bin/python -m unittest discover -s tests

Project Docs

pm/tasks.org: task tracking
pm/data-model.org: current data model notes
pm/review-workflow.org: review and resolution workflow

README.md

scrape-giant

What It Does

Requirements

Install

Optional .env

Run Order

Key Outputs

Review Workflow

Notes

Test

Project Docs

Optional `.env`