ben/vath

Fork 0

Files

eulaly 28d6d222bd added create_csv.py

2026-05-07 17:22:00 -04:00

20 KiB

Raw Blame History

VATH Task Log

[X] t1.1: scrape one forum (1)
[X] t1.2: initial 4o sentiment
[X] t1.2.1: batch processing
[X] t1.2.2: Tokenizer / Batch mgmt
[X] t1.2.3: batch job refactor
[X] t1.3: cleanup model output and rejoin
[X] t1.1.1: text encoding cleanup
= Backlog =
[ ] X: first dash explorer
- acceptance criteria
[ ] X: complete proposal information
- acceptance criteria

[X] t1.1: scrape one forum (1)

Use https://www.townhall.virginia.gov/L/comments.cfm?GDocForumID=452 as the first forum. Scraper should be run manually at this step. ViewComments (townhall.virginia.gov/L/ViewComments.cfm?CommentID=#) appears to be raw list of all comments on forum - could be useful later for whole-scrape Append forum id to viewall per forum (townhall.virginia.gov/L/ViewComments.cfm?GdocForumID=452) Comments are hydrated in backend via js-cued button (AJAX?).

acceptance criteria

run manual scraper
1. store proposal title and description
2. store comment title, commenter, date
3. store relevant metadata
friendly/polite scraping
store forum as distinct item with title, desc
add forum ID in comment filename, eg forum452_comments_<datetime>.jsonl
remove reg_title and reg_desc from each comment; these belong in forum item
parse datetimes into object for later use (plotting)

notes

scraper/spiders/forum.py — ForumSpider using ViewComments.cfm?GdocForumID=N with POST pagination. First request fetches page 1 (vPerPage=500), discovers the last page number from the form's link, generates all remaining page requests upfront. Parses each div.Cbox for all required fields.
scraper/items.py — CommentItem with forum_id, reg_title, reg_desc, comment_id, author, date, title, text
tests/test_forum_spider.py — 7 tests, all passing
Settings: DEFAULT_RESPONSE_ENCODING=utf-8 (fixes Windows-1251 meta-tag mismatch), HTTPCACHE_ENABLED=True, feed output to output/
ViewComments.cfm instead of comments.cfm: POST to Comments.cfm returned a 500 error (wrong endpoint). ViewComments.cfm?GdocForumID=N is the correct listing URL, returns full comment text on the page itself — no per-comment follow requests needed.
Span-wrapped text: .divComment p::text missed 3.6% of comments where text is in <p><span>text</span></p>. Fixed to .divComment *::text, .divComment::text. Worth knowing for when the spider is extended to other forums.
start() vs start_requests(): Scrapy 2.13+ deprecates start_requests() in favor of async def start()
ForumItem vs CommentItem: ForumItem (forum_id, reg_title, reg_desc) yielded once on first page; CommentItem no longer carries reg_title/reg_desc. Both land in the same JSONL feed.
Dynamic output filename: set via from_crawler() overriding FEEDS at 'spider' priority — format is output/forum{id}_comments_%(time)s.jsonl. FEEDS removed from settings.py; spider owns it.
Date parsing: _parse_date() normalizes whitespace, upper-cases, parses "%m/%d/%y %I:%M %p" → ISO 8601; falls back to raw string on failure.

evidence

commit: beb5cf4 (AC1-2), e7df0b2 (AC3-6)
tests: 8 passing (`python -m pytest tests -q`) or (`python -m pytest tests/`)
- `scrapy crawl forum -a forum_id=452 -s LOG_LEVEL=WARNING 2>&1`
- retrieved 9083 comments
datetime: [2026-05-05 Tue 14:00]

[X] t1.2: initial 4o sentiment

Write a simple manual pipeline for gpt-4o that reads one scraped forum jsonl file and roduces a separate analyzed jsonl file. this step must not mutate scraper output. analysis should classify each comment for regulatory stance, generic tone/sentiment, confidence, and enough rationale/evidence to support later dashboard drilldown. Should be run manually, separate from scraper. You may use scrapy, but are not required to.

Sentiment is derived, not scraped - keep separate from raw comments.
keep jsonl as interchange/audit format

acceptance criteria

input scraped jsonl doc by filename/path, e.g. "./output/forum452_comments_<datetime>.jsonl"
- handle mixed itemtypes, e.g., forum + comment items
output new analysis file, e.g., "analysis/forum452_<datetime>_<model>_<datetime>.jsonl"
- one analysis record per comment
- include run_id, forum_id, comment_id, analyzed_at, model, prompt_version
capture stance toward proposed reg/guidance:
- `stance`: support, oppose, neutral, unknown
- `confidence`: 0-1
- short rationale, if provided by model
capture generic sentiment/tone separately from stance: `tone`=positive, negative, neutral, mixed, unclear
capture issue/topic tags for later grouping, may be empty
use .env for api key management
document the exact prompt version used; prompt text may live in code or docs, but must have a version string/hash in output records
for this run, an option to run the first N comments (5, 10, 20, 50) - will add batch processing later

notes

analysis/gpt4o/analysis.py: standalone script; core functions importable for tests.
Prompt version = SHA-256[:7] of SYSTEM_PROMPT+USER_TEMPLATE; auto-updates on prompt change.
Output: analysis/gpt4o/forum{id}_{scrape_ts}_model_{run_ts}.jsonl, one record per comment.
–limit {5,10,20,50} for test runs; omit for full corpus. Batch processing planned for later.
Incremental flush after each record: safe to interrupt and inspect partial output.
temperature=0.0 for deterministic, reproducible classifications across runs.
Retry: 3 attempts (delays 1s, 2s) on RateLimitError; all other exceptions → error record + continue.
openai==2.34.0 installed; python-dotenv already present; key loaded from .env via OPENAI_API_KEY.
MAX_COMMENT_CHARS=6000: covers >99% without truncation; outliers (e.g. 18k-char law firm brief) flagged with truncated=True.

evidence

commit: d834d18
tests: 20 passing (pytest tests/analysis_gpt4o_realtime.py), 28 total across suite
- `python ./analysis/gpt4o/analysis_realtime.py –limit 5 ./output/f452.jsonl`
- see: ./analysis/gpt4o/forum452_unknown_gpt-4o_2026-05-05T18-48-32+00-00.jsonl
date: [2026-05-05 Tue 15:00]

[X] t1.2.1: batch processing

Create analysis-batch.py to capture same elements as t1.2 above. May need to add multiple commands to upload, check batch status, download, etc. Commands should all be run manually. Reference: ./docs/openai-batch.md. openai batch output order is not guaranteed, so custom_id is mandatory for reconciliation

acceptance criteria

input scraped jsonl doc by filename/path, and process the whole thing via batch processing
- ignore non-comment items in jsonl
- do not modify raw scraper output
- specify model and prompt
output a run manifest in ./analysis/<model>/runs/<run_id>.json
- include: include run_id, input_filename, input_sha256, prompt_hash, model, batch_id, records_submitted, records_completed, records_failed, request_filename, raw_output_filename, normalized_output_filename, created_at, completed_at
add tests without live api calls

notes

analysis/gpt4o/analysis-batch.py with three subcommands:
- `submit`: reads scraped JSONL, builds batch request file (requests/<run_id>.jsonl), uploads to Files API, creates batch, saves manifest to runs/<run_id>.json. Prints run_id to stdout for scripting.
- `status`: retrieves batch from OpenAI, prints status + counts, updates manifest.
- `download`: downloads raw output to raw/<run_id>.jsonl, normalizes to <run_id>_<model>.jsonl using comment_lookup keyed by comment_id for reconciliation (batch output order not guaranteed). Updates manifest with filenames, counts, completed_at.
custom_id format: comment_{comment_id} — unique within a forum, stable across runs.
PROMPT_VERSION derived from analysis/prompt-1.txt (same file as realtime); both scripts produce matching prompt_hash in all records.
analysis/prompt-1.txt: system prompt as plaintext, read at import time by both scripts. Edit here to change prompt for both pipelines.

evidence

commit: 683bfb3 (remove hyphen), f3abbef
tests: 18 passing (pytest tests/analysis_gpt4o_batch.py), 46 total across suite
datetime: [2026-05-05 Tue 17:00]

[X] t1.2.2: Tokenizer / Batch mgmt

openai batch analysis requires coordination - more like a job queue. batch script should setup queue for user to setup manually; openai api will reject subsequent batches when the total daily token limit is maxed.

Acceptance Criteria

add token estimator utility script, probably to /analysis
add MODEL_LIMITS dict to analysis_batch.py. if there are more than (n)
- gpt-4o (30k tpm/90k tpd batch)
- gpt-4o-mini (200k tpm/2M tpd batch)
- add models listed in docs/openai.md
Auto-chunk submit: before writing the request file, walk comments, accumulate estimated tokens, and split into chunks that fit under the model's limit.
- Each chunk becomes its own batch submission with its own run_id.
- Drop –limit (or keep as hard cap override).
- Print all run_ids
- Submit the first batch only (failed)
Update test script to show tokenizer output

notes

MODEL_LIMITS and _MODEL_ENCODING dicts in analysis/gpt4o/analysis_batch.py; keyed by model name, sourced from docs/openai.md. Unknown models fall back to o200k_base encoding and 900k token limit.
estimate_tokens(messages, model): uses tiktoken (o200k_base) when available; falls back to chars/3 + 4 overhead per message.
chunk_comments_by_tokens(comments, forum, model): greedy bin-pack; respects 10% headroom (_LIMIT_BUFFER=0.90). Returns list of comment lists.
submit sends only chunks[0] — enqueued token limit is a TOTAL across all concurrent batches; stacking would exceed quota. Remaining chunk ranges are printed as manual instructions.
–limit N still available as a hard cap on total comments before chunking (useful when org-tier limit is below the published model limit).
pip install tiktoken required for exact token counting; chars/3 fallback activates automatically if not installed.

usage

`pip install tiktoken`
submit first chunk (auto-sized to model token limit, uses most recent output file) `python analysis/gpt4o/analysis_batch.py submit output/f452.jsonl –model gpt-4o-mini`
check status (defaults to most recent run) `python analysis/gpt4o/analysis_batch.py status`
download + normalize when complete `python analysis/gpt4o/analysis_batch.py download`
submit next chunk: rerun with `–limit` to cover the next N comments (track which comment_ids have already been analyzed to avoid duplicates)

validation

import pandas as pd
df_input = pd.read_json('C:/Users/moses/projects/vath/analysis/gpt4o/runs/75ee9a/f452.jsonl', lines=True)
# drop forum item
df_input_comments = df_input[df_input["comment_id"].notna()].copy()
df_output = pd.read_json('C:/Users/moses/projects/vath/analysis/gpt4o/runs/75ee9a/75ee9a6c-8fc2-4924-8d96-b55bb4d5e832_gpt-4o.jsonl', lines=True)
dfm = df_output.merge(df_input_comments,on="comment_id",how="left",suffixes=("","_input"),)
dfm.to_csv('C:/Users/moses/projects/vath/analysis/gpt4o/1.csv')

order columns: forum_id_input,comment_id,title,text,date,author,stance,stance_confidence,stance_rationale,tone,tags,error,truncated,analyzed_at,prompt_version,model

evidence

commit:
tests: 23 passing (pytest tests/analysis_gpt4o_batch.py), 51 total across suite
datetime: [2026-05-06 Wed 08:55]

[X] t1.2.3: batch job refactor

This task encompasses intent and fixes for 1.2.1 and 1.2.2. batch processing should be a resumable job queue, not a one-shot script. the user should not need to remember offsets, completed chunks, failed batches, or which comments remain.

Acceptance Criteria

create tokenizer to prepare the batch job

input: prompt.txt, forum.jsonl

output: report.json with each model's batch structure, cost, and time (considering tpd constraints)

analysis_batch should be able to take this report to run the job. good place to copy the raw scraper jsonl

  {'prompt': 'prompt1.txt',
   'input_file': 'f451.jsonl',
   'input_tokens': 123456789,
   'gpt-4o': {'jobs':71,'cost_$':4,'est_queue_days':3} # divide tokens by model TPD to get time_days
   'gpt-4o-mini': {'jobs':71,'cost_$':4,'est_queue_days':3} # divide tokens by model TPD to get time_days

batch py should contain commands to create, check, run, and complete jobs.
- inputs: report.json, –model, optional –job N, read api key from .env
- outputs:
  - status.json: job structure, status, metadata; updated when jobs are finished. includes all report.json info
  - for each job: jobN-input.jsonl (what is sent to openai); jobN-output-raw.jsonl, jobN-output.jsonl, and jobN-errors.jsonl (when downloaded)
  - jobN-output.jsonl contains:
    - one analysis record per comment
    - `run_id`, `forum_id`, `comment_id`, `analyzed_at`, `model`, `prompt_version`
    - `stance` toward proposed reg/guidance: support|oppose|neutral|unclear
    - `stance_confidence`: 0-1
    - short rationale, if provided by model
    - generic sentiment `tone` (separate from stance): positive|negative|neutral|mixed|unclear
    - `tags` for later grouping, may be empty
- commands: `create`, `submit`, `status`, `download`
  - `create` run directory, copy input/prompt/report, generate status.json, job request files
  - `submit` if eligible, submit next or specified job; does not blindly stack jobs, warns if prev jobs in progress, print next action
  - `status` check status of one or all submitted jobs, update status.json
  - `download` raw output (jobN-output-raw.jsonl) and error files for completed jobs, and normalize raw output (jobN-output.jsonl) auto run status.
tests without live api calls
- partial completed run
- failed batch records
- out-of-order output
- duplicate custom_id
- missing output file
- resume from status.json
- remaining-comment detection

notes

analysis/tokenizer.py: new standalone script; imports openai_batch for MODEL_LIMITS, estimate_tokens, build_messages. Reads input JSONL + prompt, computes per-model jobs/cost/time table, writes reports/<stem>-report.json. MODEL_PRICING dict lives here (not in openai_batch). Pass a jobN-input.jsonl to count actual tokens instead.
analysis/openai_batch.py: fully rewritten with four subcommands: create, submit, status, download. Job dirs at analysis/jobs/<stem[:8]>-N/.
Job directories: analysis/jobs/<stem[:8]>-N/ (e.g. f452-1). Each run is self-contained: forum.jsonl, prompt.txt, report.json, jobN-input.jsonl, jobN-output-raw.jsonl, jobN-output.jsonl, jobN-errors.jsonl.
status.json: tracks all jobs with pending/submitted/in_progress/completed/failed states. Updated by submit, status, download.
_find_next_eligible_job: pure function for testability. Returns (next_pending_job, None) or (None, warning). Blocks submission if previous job is in_progress/submitted.
create: no API key required. Reads report.json, re-chunks comments, writes all jobN-input.jsonl files, writes status.json.
submit: uploads jobN-input.jsonl to Files API, creates batch, updates status.json to 'submitted'. Will not stack batches.
status: retrieves batch from OpenAI, updates status.json counts and status.
download: auto-runs status first, downloads output_file_id → jobN-output-raw.jsonl, error_file_id → jobN-errors.jsonl, normalizes → jobN-output.jsonl. Updates status.json.
tests/tokenizer.py: 19 tests for compute_report schema, cost/time calculation, MODEL_PRICING coverage, print_table output, count_input_tokens, report.json round-trip.
Token limit buffer: _LIMIT_BUFFER=0.80 (20% headroom). Estimate uses OpenAI cookbook chat formula (role tokens + 3-token reply primer). Verify a job file with: python analysis/tokenizer.py analysis/jobs/<dir>/jobN-input.jsonl

usage

# 1. estimate tokens and cost
python analysis/tokenizer.py output/f452.jsonl --prompt analysis/prompt-1.txt
# writes reports/f452-report.json

# 2. verify actual tokens in a job file (optional sanity check)
python analysis/tokenizer.py analysis/jobs/f452-1/job1-input.jsonl

# 3. create job directory (no api key needed)
python analysis/openai_batch.py create reports/f452-report.json --model gpt-5.4-mini
# creates analysis/jobs/f452-1/

# 4. submit first job
python analysis/openai_batch.py submit

# 5. check status (repeat until completed)
python analysis/openai_batch.py status

# 6. download and normalize
python analysis/openai_batch.py download

# 7. submit next job (if multi-job run), then repeat 5-6
python analysis/openai_batch.py submit

evidence

commit:
tests: passing (pytest tests/openai_batch.py tests/openai_realtime.py tests/tokenizer.py)
datetime: [2026-05-06 Wed]

[X] t1.3: cleanup model output and rejoin

create a lightweight validation script that joins raw comments to normalized analysis output and writes a human-reviewable csv. review create_csv for the simple approach - keep this regardless

acceptance criteria

input raw scrape jsonl and all *-output.jsonl files in a dir
join by comment_id, not dataframe index
output csv columns in review order:
- forum_id, comment_id, title, text, date, author
- stance, stance_confidence, stance_rationale, tone, tags
- error, truncated, analyzed_at, prompt_version, model
output parquet?
print validation counts
- raw comments
- analyzed records
- joined records
- missing comment text
- duplicate comment_ids
- error records
- stance counts
- tone counts
tests cover join behavior and missing/duplicate ids

notes

analysis/create_csv.py: reads raw scrape JSONL + all job*-output.jsonl in a job dir (skips *-output-raw.jsonl); left-joins on comment_id; writes review.csv (UTF-8 BOM for Excel); optional –parquet.
Uses pd.read_json(path, lines=True) — no manual JSON parsing.
Prints summary counts: raw/analyzed/joined/unanalyzed/errors/duplicate IDs, stance distribution, tone distribution.

usage

python analysis/create_csv.py output/f452.jsonl analysis/jobs/f452-1/
python analysis/create_csv.py output/f452.jsonl analysis/jobs/f452-1/ --parquet
# output: analysis/jobs/f452-1/review.csv (and optionally review.parquet)

evidence

commit:
tests: passing (pytest tests/create_csv.py tests/encoding.py)
csv: analysis/jobs/f452-1/review.csv
datetime: [2026-05-07 Thu]

[X] t1.1.1: text encoding cleanup

fix mojibake in scraped text before analysis/reporting, especially curly quotes showing as â€™.

acceptance criteria

identify whether mojibake exists in raw scrape, analysis output, or csv export only
add repair step at the earliest correct layer
preserve original raw scrape if repair changes source text
add test cases for common bad sequences:
- â€™
- â€œ
- â€
- â€“
- â€”
document whether repaired text is used for model input

notes

Diagnosis: f452.jsonl raw data is CLEAN — proper Unicode throughout (U+2019, U+201C, etc.). The DEFAULT_RESPONSE_ENCODING=utf-8 spider setting is working for this site. No mojibake or FFFD chars found.
The encoding issue would surface for forums whose server sends cp1252 bytes (0x91-0x97 range) embedded in otherwise UTF-8 content. FFFD replacement chars appear when the UTF-8 decoder hits those bytes. Once the byte is replaced by FFFD, the original character cannot be recovered.
Repair layer: analysis/encoding.py applied in analysis/validate.py at reporting time. Raw scrape JSONL is never modified (AC3).
Model input: repair_text() is NOT applied in build_messages() for this dataset since raw data is clean. Can be added if a future forum produces dirty text.
Spider: DEFAULT_RESPONSE_ENCODING=utf-8 remains. If a future forum genuinely sends cp1252, change to 'cp1252' and apply ftfy post-decode in the item pipeline.

evidence

commit:
tests: passing (pytest tests/encoding.py)
before/after sample: N/A — f452.jsonl is clean; tests cover synthetic mojibake patterns
datetime: [2026-05-07 Thu]

`=` Backlog `=`

[ ] X: first dash explorer

create a local dash app for exploring one forum analysis dataset.

acceptance criteria

load parquet/csv review dataset
show stance counts, tone counts, tag counts, and confidence histogram
provide filters for stance, tone, confidence, tag, and text search
show filtered comment table
clicking/selecting a comment shows full text and model rationale
app runs locally with one command

[ ] X: complete proposal information

Ensure we capture as much useful information as possible about the actual proposal - contact information, etc. what the state actually says about what was posted.

acceptance criteria

Item: `Forum` stores id, url, proposal title, description, open/close date, number of comments, agency, board, guidance document id
- add details for guidanceDoc, publication date, comments, guidance docs - eg: https://www.townhall.virginia.gov/L/GDocForum.cfm?GDocForumID=452
Item: `Comment` stores forum_id, comment_id, author, title, text, date, url

20 KiB Raw Blame History Unescape Escape

VATH Task Log

[X] t1.1: scrape one forum (1)

acceptance criteria

notes

evidence

[X] t1.2: initial 4o sentiment

acceptance criteria

notes

evidence

[X] t1.2.1: batch processing

acceptance criteria

notes

evidence

[X] t1.2.2: Tokenizer / Batch mgmt

Acceptance Criteria

notes

usage

validation

evidence

[X] t1.2.3: batch job refactor

Acceptance Criteria

notes

usage

evidence

[X] t1.3: cleanup model output and rejoin

acceptance criteria

notes

usage

evidence

[X] t1.1.1: text encoding cleanup

acceptance criteria

notes

evidence

= Backlog =

[ ] X: first dash explorer

acceptance criteria

[ ] X: complete proposal information

acceptance criteria

20 KiB

Raw Blame History

`=` Backlog `=`