Files
vath/docs/tasks.org
2026-05-07 17:22:00 -04:00

20 KiB
Raw Blame History

VATH Task Log

[X] t1.1: scrape one forum (1)

Use https://www.townhall.virginia.gov/L/comments.cfm?GDocForumID=452 as the first forum. Scraper should be run manually at this step. ViewComments (townhall.virginia.gov/L/ViewComments.cfm?CommentID=#) appears to be raw list of all comments on forum - could be useful later for whole-scrape Append forum id to viewall per forum (townhall.virginia.gov/L/ViewComments.cfm?GdocForumID=452) Comments are hydrated in backend via js-cued button (AJAX?).

acceptance criteria

  1. run manual scraper

    1. store proposal title and description
    2. store comment title, commenter, date
    3. store relevant metadata
  2. friendly/polite scraping
  3. store forum as distinct item with title, desc
  4. add forum ID in comment filename, eg forum452_comments_<datetime>.jsonl
  5. remove reg_title and reg_desc from each comment; these belong in forum item
  6. parse datetimes into object for later use (plotting)

notes

  • scraper/spiders/forum.py — ForumSpider using ViewComments.cfm?GdocForumID=N with POST pagination. First request fetches page 1 (vPerPage=500), discovers the last page number from the form's link, generates all remaining page requests upfront. Parses each div.Cbox for all required fields.
  • scraper/items.py — CommentItem with forum_id, reg_title, reg_desc, comment_id, author, date, title, text
  • tests/test_forum_spider.py — 7 tests, all passing
  • Settings: DEFAULT_RESPONSE_ENCODING=utf-8 (fixes Windows-1251 meta-tag mismatch), HTTPCACHE_ENABLED=True, feed output to output/
  • ViewComments.cfm instead of comments.cfm: POST to Comments.cfm returned a 500 error (wrong endpoint). ViewComments.cfm?GdocForumID=N is the correct listing URL, returns full comment text on the page itself — no per-comment follow requests needed.
  • Span-wrapped text: .divComment p::text missed 3.6% of comments where text is in <p><span>text</span></p>. Fixed to .divComment *::text, .divComment::text. Worth knowing for when the spider is extended to other forums.
  • start() vs start_requests(): Scrapy 2.13+ deprecates start_requests() in favor of async def start()
  • ForumItem vs CommentItem: ForumItem (forum_id, reg_title, reg_desc) yielded once on first page; CommentItem no longer carries reg_title/reg_desc. Both land in the same JSONL feed.
  • Dynamic output filename: set via from_crawler() overriding FEEDS at 'spider' priority — format is output/forum{id}_comments_%(time)s.jsonl. FEEDS removed from settings.py; spider owns it.
  • Date parsing: _parse_date() normalizes whitespace, upper-cases, parses "%m/%d/%y %I:%M %p" → ISO 8601; falls back to raw string on failure.

evidence

  • commit: beb5cf4 (AC1-2), e7df0b2 (AC3-6)
  • tests: 8 passing (`python -m pytest tests -q`) or (`python -m pytest tests/`)

    • `scrapy crawl forum -a forum_id=452 -s LOG_LEVEL=WARNING 2>&1`
    • retrieved 9083 comments
  • datetime: [2026-05-05 Tue 14:00]

[X] t1.2: initial 4o sentiment

Write a simple manual pipeline for gpt-4o that reads one scraped forum jsonl file and roduces a separate analyzed jsonl file. this step must not mutate scraper output. analysis should classify each comment for regulatory stance, generic tone/sentiment, confidence, and enough rationale/evidence to support later dashboard drilldown. Should be run manually, separate from scraper. You may use scrapy, but are not required to.

  • Sentiment is derived, not scraped - keep separate from raw comments.
  • keep jsonl as interchange/audit format

acceptance criteria

  1. input scraped jsonl doc by filename/path, e.g. "./output/forum452_comments_<datetime>.jsonl"

    • handle mixed itemtypes, e.g., forum + comment items
  2. output new analysis file, e.g., "analysis/forum452_<datetime>_<model>_<datetime>.jsonl"

    • one analysis record per comment
    • include run_id, forum_id, comment_id, analyzed_at, model, prompt_version
  3. capture stance toward proposed reg/guidance:

    • `stance`: support, oppose, neutral, unknown
    • `confidence`: 0-1
    • short rationale, if provided by model
  4. capture generic sentiment/tone separately from stance: `tone`=positive, negative, neutral, mixed, unclear
  5. capture issue/topic tags for later grouping, may be empty
  6. use .env for api key management
  7. document the exact prompt version used; prompt text may live in code or docs, but must have a version string/hash in output records
  8. for this run, an option to run the first N comments (5, 10, 20, 50) - will add batch processing later

notes

  • analysis/gpt4o/analysis.py: standalone script; core functions importable for tests.
  • Prompt version = SHA-256[:7] of SYSTEM_PROMPT+USER_TEMPLATE; auto-updates on prompt change.
  • Output: analysis/gpt4o/forum{id}scrape_tsmodelrun_ts.jsonl, one record per comment.
  • limit {5,10,20,50} for test runs; omit for full corpus. Batch processing planned for later.
  • Incremental flush after each record: safe to interrupt and inspect partial output.
  • temperature=0.0 for deterministic, reproducible classifications across runs.
  • Retry: 3 attempts (delays 1s, 2s) on RateLimitError; all other exceptions → error record + continue.
  • openai==2.34.0 installed; python-dotenv already present; key loaded from .env via OPENAI_API_KEY.
  • MAX_COMMENT_CHARS=6000: covers >99% without truncation; outliers (e.g. 18k-char law firm brief) flagged with truncated=True.

evidence

  • commit: d834d18
  • tests: 20 passing (pytest tests/analysis_gpt4o_realtime.py), 28 total across suite

    • `python ./analysis/gpt4o/analysis_realtime.py limit 5 ./output/f452.jsonl`
    • see: ./analysis/gpt4o/forum452_unknown_gpt-4o_2026-05-05T18-48-32+00-00.jsonl
  • date: [2026-05-05 Tue 15:00]

[X] t1.2.1: batch processing

Create analysis-batch.py to capture same elements as t1.2 above. May need to add multiple commands to upload, check batch status, download, etc. Commands should all be run manually. Reference: ./docs/openai-batch.md. openai batch output order is not guaranteed, so custom_id is mandatory for reconciliation

acceptance criteria

  1. input scraped jsonl doc by filename/path, and process the whole thing via batch processing

    • ignore non-comment items in jsonl
    • do not modify raw scraper output
    • specify model and prompt
  2. output a run manifest in ./analysis/<model>/runs/<run_id>.json

    • include: include run_id, input_filename, input_sha256, prompt_hash, model, batch_id, records_submitted, records_completed, records_failed, request_filename, raw_output_filename, normalized_output_filename, created_at, completed_at
  3. add tests without live api calls

notes

  • analysis/gpt4o/analysis-batch.py with three subcommands:

    • `submit`: reads scraped JSONL, builds batch request file (requests/<run_id>.jsonl), uploads to Files API, creates batch, saves manifest to runs/<run_id>.json. Prints run_id to stdout for scripting.
    • `status`: retrieves batch from OpenAI, prints status + counts, updates manifest.
    • `download`: downloads raw output to raw/<run_id>.jsonl, normalizes to <run_id>_<model>.jsonl using comment_lookup keyed by comment_id for reconciliation (batch output order not guaranteed). Updates manifest with filenames, counts, completed_at.
  • custom_id format: commentcomment_id — unique within a forum, stable across runs.
  • PROMPT_VERSION derived from analysis/prompt-1.txt (same file as realtime); both scripts produce matching prompt_hash in all records.
  • analysis/prompt-1.txt: system prompt as plaintext, read at import time by both scripts. Edit here to change prompt for both pipelines.

evidence

  • commit: 683bfb3 (remove hyphen), f3abbef
  • tests: 18 passing (pytest tests/analysis_gpt4o_batch.py), 46 total across suite
  • datetime: [2026-05-05 Tue 17:00]

[X] t1.2.2: Tokenizer / Batch mgmt

openai batch analysis requires coordination - more like a job queue. batch script should setup queue for user to setup manually; openai api will reject subsequent batches when the total daily token limit is maxed.

Acceptance Criteria

  1. add token estimator utility script, probably to /analysis
  2. add MODEL_LIMITS dict to analysis_batch.py. if there are more than (n)

    • gpt-4o (30k tpm/90k tpd batch)
    • gpt-4o-mini (200k tpm/2M tpd batch)
    • add models listed in docs/openai.md
  3. Auto-chunk submit: before writing the request file, walk comments, accumulate estimated tokens, and split into chunks that fit under the model's limit.

    • Each chunk becomes its own batch submission with its own run_id.
    • Drop limit (or keep as hard cap override).
    • Print all run_ids
    • Submit the first batch only (failed)
  4. Update test script to show tokenizer output

notes

  • MODEL_LIMITS and _MODEL_ENCODING dicts in analysis/gpt4o/analysis_batch.py; keyed by model name, sourced from docs/openai.md. Unknown models fall back to o200k_base encoding and 900k token limit.
  • estimate_tokens(messages, model): uses tiktoken (o200k_base) when available; falls back to chars/3 + 4 overhead per message.
  • chunk_comments_by_tokens(comments, forum, model): greedy bin-pack; respects 10% headroom (_LIMIT_BUFFER=0.90). Returns list of comment lists.
  • submit sends only chunks[0] — enqueued token limit is a TOTAL across all concurrent batches; stacking would exceed quota. Remaining chunk ranges are printed as manual instructions.
  • limit N still available as a hard cap on total comments before chunking (useful when org-tier limit is below the published model limit).
  • pip install tiktoken required for exact token counting; chars/3 fallback activates automatically if not installed.

usage

  • `pip install tiktoken`
  • submit first chunk (auto-sized to model token limit, uses most recent output file) `python analysis/gpt4o/analysis_batch.py submit output/f452.jsonl model gpt-4o-mini`
  • check status (defaults to most recent run) `python analysis/gpt4o/analysis_batch.py status`
  • download + normalize when complete `python analysis/gpt4o/analysis_batch.py download`
  • submit next chunk: rerun with `limit` to cover the next N comments (track which comment_ids have already been analyzed to avoid duplicates)

validation

import pandas as pd
df_input = pd.read_json('C:/Users/moses/projects/vath/analysis/gpt4o/runs/75ee9a/f452.jsonl', lines=True)
# drop forum item
df_input_comments = df_input[df_input["comment_id"].notna()].copy()
df_output = pd.read_json('C:/Users/moses/projects/vath/analysis/gpt4o/runs/75ee9a/75ee9a6c-8fc2-4924-8d96-b55bb4d5e832_gpt-4o.jsonl', lines=True)
dfm = df_output.merge(df_input_comments,on="comment_id",how="left",suffixes=("","_input"),)
dfm.to_csv('C:/Users/moses/projects/vath/analysis/gpt4o/1.csv')

order columns: forum_id_input,comment_id,title,text,date,author,stance,stance_confidence,stance_rationale,tone,tags,error,truncated,analyzed_at,prompt_version,model

evidence

  • commit:
  • tests: 23 passing (pytest tests/analysis_gpt4o_batch.py), 51 total across suite
  • datetime: [2026-05-06 Wed 08:55]

[X] t1.2.3: batch job refactor

This task encompasses intent and fixes for 1.2.1 and 1.2.2. batch processing should be a resumable job queue, not a one-shot script. the user should not need to remember offsets, completed chunks, failed batches, or which comments remain.

Acceptance Criteria

  1. create tokenizer to prepare the batch job

    • input: prompt.txt, forum.jsonl
    • output: report.json with each model's batch structure, cost, and time (considering tpd constraints)

      • analysis_batch should be able to take this report to run the job. good place to copy the raw scraper jsonl
        {'prompt': 'prompt1.txt',
         'input_file': 'f451.jsonl',
         'input_tokens': 123456789,
         'gpt-4o': {'jobs':71,'cost_$':4,'est_queue_days':3} # divide tokens by model TPD to get time_days
         'gpt-4o-mini': {'jobs':71,'cost_$':4,'est_queue_days':3} # divide tokens by model TPD to get time_days
  2. batch py should contain commands to create, check, run, and complete jobs.

    • inputs: report.json, model, optional job N, read api key from .env
    • outputs:

      • status.json: job structure, status, metadata; updated when jobs are finished. includes all report.json info
      • for each job: jobN-input.jsonl (what is sent to openai); jobN-output-raw.jsonl, jobN-output.jsonl, and jobN-errors.jsonl (when downloaded)
      • jobN-output.jsonl contains:

        • one analysis record per comment
        • `run_id`, `forum_id`, `comment_id`, `analyzed_at`, `model`, `prompt_version`
        • `stance` toward proposed reg/guidance: support|oppose|neutral|unclear
        • `stance_confidence`: 0-1
        • short rationale, if provided by model
        • generic sentiment `tone` (separate from stance): positive|negative|neutral|mixed|unclear
        • `tags` for later grouping, may be empty
    • commands: `create`, `submit`, `status`, `download`

      • `create` run directory, copy input/prompt/report, generate status.json, job request files
      • `submit` if eligible, submit next or specified job; does not blindly stack jobs, warns if prev jobs in progress, print next action
      • `status` check status of one or all submitted jobs, update status.json
      • `download` raw output (jobN-output-raw.jsonl) and error files for completed jobs, and normalize raw output (jobN-output.jsonl) auto run status.
  3. tests without live api calls

    • partial completed run
    • failed batch records
    • out-of-order output
    • duplicate custom_id
    • missing output file
    • resume from status.json
    • remaining-comment detection

notes

  • analysis/tokenizer.py: new standalone script; imports openai_batch for MODEL_LIMITS, estimate_tokens, build_messages. Reads input JSONL + prompt, computes per-model jobs/cost/time table, writes reports/<stem>-report.json. MODEL_PRICING dict lives here (not in openai_batch). Pass a jobN-input.jsonl to count actual tokens instead.
  • analysis/openai_batch.py: fully rewritten with four subcommands: create, submit, status, download. Job dirs at analysis/jobs/<stem[:8]>-N/.
  • Job directories: analysis/jobs/<stem[:8]>-N/ (e.g. f452-1). Each run is self-contained: forum.jsonl, prompt.txt, report.json, jobN-input.jsonl, jobN-output-raw.jsonl, jobN-output.jsonl, jobN-errors.jsonl.
  • status.json: tracks all jobs with pending/submitted/in_progress/completed/failed states. Updated by submit, status, download.
  • _find_next_eligible_job: pure function for testability. Returns (next_pending_job, None) or (None, warning). Blocks submission if previous job is in_progress/submitted.
  • create: no API key required. Reads report.json, re-chunks comments, writes all jobN-input.jsonl files, writes status.json.
  • submit: uploads jobN-input.jsonl to Files API, creates batch, updates status.json to 'submitted'. Will not stack batches.
  • status: retrieves batch from OpenAI, updates status.json counts and status.
  • download: auto-runs status first, downloads output_file_id → jobN-output-raw.jsonl, error_file_id → jobN-errors.jsonl, normalizes → jobN-output.jsonl. Updates status.json.
  • tests/tokenizer.py: 19 tests for compute_report schema, cost/time calculation, MODEL_PRICING coverage, print_table output, count_input_tokens, report.json round-trip.
  • Token limit buffer: _LIMIT_BUFFER=0.80 (20% headroom). Estimate uses OpenAI cookbook chat formula (role tokens + 3-token reply primer). Verify a job file with: python analysis/tokenizer.py analysis/jobs/<dir>/jobN-input.jsonl

usage

# 1. estimate tokens and cost
python analysis/tokenizer.py output/f452.jsonl --prompt analysis/prompt-1.txt
# writes reports/f452-report.json

# 2. verify actual tokens in a job file (optional sanity check)
python analysis/tokenizer.py analysis/jobs/f452-1/job1-input.jsonl

# 3. create job directory (no api key needed)
python analysis/openai_batch.py create reports/f452-report.json --model gpt-5.4-mini
# creates analysis/jobs/f452-1/

# 4. submit first job
python analysis/openai_batch.py submit

# 5. check status (repeat until completed)
python analysis/openai_batch.py status

# 6. download and normalize
python analysis/openai_batch.py download

# 7. submit next job (if multi-job run), then repeat 5-6
python analysis/openai_batch.py submit

evidence

  • commit:
  • tests: passing (pytest tests/openai_batch.py tests/openai_realtime.py tests/tokenizer.py)
  • datetime: [2026-05-06 Wed]

[X] t1.3: cleanup model output and rejoin

create a lightweight validation script that joins raw comments to normalized analysis output and writes a human-reviewable csv. review create_csv for the simple approach - keep this regardless

acceptance criteria

  1. input raw scrape jsonl and all *-output.jsonl files in a dir
  2. join by comment_id, not dataframe index
  3. output csv columns in review order:

    • forum_id, comment_id, title, text, date, author
    • stance, stance_confidence, stance_rationale, tone, tags
    • error, truncated, analyzed_at, prompt_version, model
  4. output parquet?
  5. print validation counts

    • raw comments
    • analyzed records
    • joined records
    • missing comment text
    • duplicate comment_ids
    • error records
    • stance counts
    • tone counts
  6. tests cover join behavior and missing/duplicate ids

notes

  • analysis/create_csv.py: reads raw scrape JSONL + all job*-output.jsonl in a job dir (skips *-output-raw.jsonl); left-joins on comment_id; writes review.csv (UTF-8 BOM for Excel); optional parquet.
  • Uses pd.read_json(path, lines=True) — no manual JSON parsing.
  • Prints summary counts: raw/analyzed/joined/unanalyzed/errors/duplicate IDs, stance distribution, tone distribution.

usage

python analysis/create_csv.py output/f452.jsonl analysis/jobs/f452-1/
python analysis/create_csv.py output/f452.jsonl analysis/jobs/f452-1/ --parquet
# output: analysis/jobs/f452-1/review.csv (and optionally review.parquet)

evidence

  • commit:
  • tests: passing (pytest tests/create_csv.py tests/encoding.py)
  • csv: analysis/jobs/f452-1/review.csv
  • datetime: [2026-05-07 Thu]

[X] t1.1.1: text encoding cleanup

fix mojibake in scraped text before analysis/reporting, especially curly quotes showing as ’.

acceptance criteria

  1. identify whether mojibake exists in raw scrape, analysis output, or csv export only
  2. add repair step at the earliest correct layer
  3. preserve original raw scrape if repair changes source text
  4. add test cases for common bad sequences:

    • ’
    • “
    • ”
    • –
    • —
  5. document whether repaired text is used for model input

notes

  • Diagnosis: f452.jsonl raw data is CLEAN — proper Unicode throughout (U+2019, U+201C, etc.). The DEFAULT_RESPONSE_ENCODING=utf-8 spider setting is working for this site. No mojibake or FFFD chars found.
  • The encoding issue would surface for forums whose server sends cp1252 bytes (0x91-0x97 range) embedded in otherwise UTF-8 content. FFFD replacement chars appear when the UTF-8 decoder hits those bytes. Once the byte is replaced by FFFD, the original character cannot be recovered.
  • Repair layer: analysis/encoding.py applied in analysis/validate.py at reporting time. Raw scrape JSONL is never modified (AC3).
  • Model input: repair_text() is NOT applied in build_messages() for this dataset since raw data is clean. Can be added if a future forum produces dirty text.
  • Spider: DEFAULT_RESPONSE_ENCODING=utf-8 remains. If a future forum genuinely sends cp1252, change to 'cp1252' and apply ftfy post-decode in the item pipeline.

evidence

  • commit:
  • tests: passing (pytest tests/encoding.py)
  • before/after sample: N/A — f452.jsonl is clean; tests cover synthetic mojibake patterns
  • datetime: [2026-05-07 Thu]

= Backlog =

[ ] X: first dash explorer

create a local dash app for exploring one forum analysis dataset.

acceptance criteria

  1. load parquet/csv review dataset
  2. show stance counts, tone counts, tag counts, and confidence histogram
  3. provide filters for stance, tone, confidence, tag, and text search
  4. show filtered comment table
  5. clicking/selecting a comment shows full text and model rationale
  6. app runs locally with one command

[ ] X: complete proposal information

Ensure we capture as much useful information as possible about the actual proposal - contact information, etc. what the state actually says about what was posted.

acceptance criteria

  1. Item: `Forum` stores id, url, proposal title, description, open/close date, number of comments, agency, board, guidance document id

  2. Item: `Comment` stores forum_id, comment_id, author, title, text, date, url