From 72c2ae0ca0c6db5f4f5b82099c0f08e607547d7f Mon Sep 17 00:00:00 2001 From: eulaly Date: Thu, 7 May 2026 17:01:08 -0400 Subject: [PATCH] updated readme --- README.md | 211 ++++++++++++++++++++++---------------------- docs/vatownhall.org | 117 ++++++++++++++++++------ 2 files changed, 192 insertions(+), 136 deletions(-) diff --git a/README.md b/README.md index 862871a..7767f69 100644 --- a/README.md +++ b/README.md @@ -1,21 +1,20 @@ # Table of Contents -1. [Project Goals](#org5acb669) - 1. [Document and analyze sentiment](#org9291576) - 2. [Make data available](#org8054421) - 3. [Generalize](#orgdda4b6f) -2. [Architecture](#org1d6bc40) - 1. [Scraper](#org4298028) - 2. [Storage](#org1cd413c) - 3. [Analysis](#orgaea450e) -3. [Roadmap](#org6b7660d) + 1. [Project Goals](#orgf37a106) + 1. [Research questions](#orgec50d46) + 2. [Architecture](#org7a5389e) + 1. [Scraper](#org7771df2) + 2. [Analysis](#org16a9e36) + 3. [Storage](#org7341391) + 3. [Instructions](#org692b2f6) +1. [Roadmap](#org9f21934) - + -# Project Goals +## Project Goals 1. Document and analyze sentiment of public comments on Virginia law, to determine: 1. the utility of this forum as a mechanism for public comment, and @@ -24,130 +23,128 @@ 3. Generalize to other public comment tools. - + -## Document and analyze sentiment +### Research questions -- Scrape the data, parse, clean, and store. Clearly separate scraper from sentiment analyzer for maximum auditability. -- Build tests for identifying abuse, such as spam and account fraud -- Identify any patterns connecting measured sentiment against VA decisions +1. What is the quality of the comments on the forum? + 1. Are there duplicate entries? + 2. Are there non-human-generated entries? + 3. Are there entries intended to abuse the forum or drown out comment? +2. How do commenters feel about the proposed change? + 1. What is the total number and percent supporting vs opposing, and how does this change over time? + 2. What is the type of support, such as strong/weak, positive/negative? +3. What impact do the comments have on the proposed change? + (I anticipate this will not be measurable from currently available data) - + -## Make data available +## Architecture -- Pick a good visualization tool +1. Scrape/Parse: Scrapy +2. Sentiment analysis: gpt-5.4-mini +3. Display: streamlit +4. Storage: jsonl, csv, parquet - + -## Generalize +### Scraper -- Identify scalable ways to apply this toolset to similar problems - - - - -# Architecture - -1. Scrape/Parse: ****Scrapy**** for downloading comments -2. Storage: json -3. Sentiment analysis: Claude haiku -4. Display: TBD - - - - -## Scraper - -Scrapy provides a simple mechanism for browsing and +Scrapy provides a simple mechanism for retrieving, parsing, and saving content form the forums. 1. Forums listing page: \`Forums.cfm\` - lists all open forums with agency, reg title, action type, brief description, closing date, comment count 2. Comment listing page: \`comments.cfm?GDocForumID=X\` or \`comments.cfm?stageid=X\` or \`comments.cfm?petitionid=X\` - lists comments with title, author, date 3. Individual comment page: \`viewcomments.cfm?commentid=X\` - shows regulation title + brief description at the top, plus the comment - + -## Storage +### Analysis -One JSONL file per forum/bill. +Google and Amazon both return generic sentiment (tone of writing: positive/negative), not stance (for/against the regulation): "I strongly believe the government should NOT interfere" is negative tone but "against" the regulation. We add the proposed change as context to the model. + +Before sending the comments for sentiment analysis, \`tokenizer.py\` receives the forum to be processed and prompt as inputs, then generates a \`report.json\` estimating tokens (tiktoken), cost, and time to run for multiple models. + +Then, the batch processing scripts uses the \`report.json\` to create multiple jobs, with subcommands to download and check their status. + +We selected gpt-5.4-mini for a good balance of quality, cost, and time. + +1. Prompt + + \`\`\` + You are an expert policy analyst classifying public comments submitted to the Virginia Town Hall + regulatory comment system. You will be given the text of a proposed regulation and a single + public comment. Return ONLY a JSON object — no other text. + + Definitions: + + - stance: the commenter's position on whether the regulation should be adopted. + "support" = wants it approved (as-is or with changes); + "oppose" = wants it rejected or substantially weakened; + "neutral" = takes no position, asks a question, or provides factual input only; + "unknown" = too vague, off-topic, or uninterpretable to classify. + - tone: the emotional register of the writing, independent of stance. + "positive" = affirming, hopeful, appreciative; + "negative" = angry, fearful, alarmed, or contemptuous; + "neutral" = matter-of-fact, procedural, or informational; + "mixed" = contains both positive and negative emotional content; + "unclear" = tone cannot be determined (e.g., a one-word comment). + - stanceconfidence: float 0.0-1.0, your confidence in the stance label. + - stancerationale: 1-3 sentences explaining the key evidence; quote specific phrases where possible. + - tags: up to 5 short topic labels relevant to the comment's specific concerns (e.g. + "parental rights", "student safety", "privacy", "religious freedom", "LGBTQ+ inclusion", + "bullying prevention", "school sports", "bathroom access"). Empty array if none apply. + + Return exactly these keys: stance, stanceconfidence, stancerationale, tone, tags. + \`\`\` - + -## Analysis +### Storage -Google and Amazon both return generic sentiment (tone of writing: positive/negative), not stance (for/against the regulation): "I strongly believe the government should NOT interfere" is negative tone but "against" the regulation. We will run the forum/bill title and cache the entirety of the proposed change, perhaps as a fallback. - - +- Each scraped forum is saved to \`output/.jsonl\` +- Each report (forum + prompt) is saves to \`reports/.json\` +- Each job is saved to \`analysis/jobs//: + └─\`forum.jsonl\` is a copy of the scraped forum for convenience + └─\`prompt.txt\` is a copy of the prompt used + └─\`report.json\` is a copy of the report used + └─\`status.json\` contains metadata about the job + For each batch in the job, four files are created: + └─\`jobN-input.jsonl\` contains the exact queries sent to the API, for troubleshooting + └─\`jobN-output-raw.jsonl\` contains the exact response from the API + └─\`jobN-output.jsonl\` contains the exact response from the API + └─\`jobN-output-errors.jsonl\` when errors are returned (this file may not exist) +- Once complete, the cleanup script saves \`review.csv\`, \`review.pqt\`, and \`review.sqlite\` in this folder. --+ -+## Instructions -- -- -- -- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
ToolOutputContextSarcasmContext windowCost/1k comments
Google NL API-1→+1, magnitudeNo/genericPoorlyNo~$1–2
Amazon ComprehendPos/Neg/Neutral/MixedNo/genericPoorlyNo~$0.10
Claude HaikuPrompted → for/against/neutralYesYes, with promptYes~$0.10–0.30
GPT-4o-miniPrompted → sameYesYesYes~$0.05–0.15
+1. Scrape the forum. + \`python +2. Run model report. + \`python analysis/tokenizer.py –prompt \` +3. To run a realtime subset: + \`python analysis/openairealtime.py –prompt –model –limit \` + \`python analysis/openairealtime.py output/f452.jsonl –prompt prompt-1.txt –model gpt-4o-mini –limit 10\` +4. To create and run the whole thing in batches, first create the batch jobs from the report: + \`python analysis/openaibatch.py create –model \` + \`python analysis/openaibatch.py create ./reports/f452-1.json –model gpt-5.4-mini\` +5. Then, run the jobs sequentially. Don't submit more than one at a time, if the model fills up the batch will fail and resubmission is not implemented. + \`python analysis/openaibatch.py submit\` + + \`python analysis/openaibatch.py status\` + + \`python analysis/openaibatch.py download\` + + \`python analysis/openaibatch.py submit\` - + # Roadmap diff --git a/docs/vatownhall.org b/docs/vatownhall.org index 128b222..0c12b41 100644 --- a/docs/vatownhall.org +++ b/docs/vatownhall.org @@ -1,50 +1,109 @@ #+title: VA Townhall #+date: [2026-05-05 Tue] -#+version: 1 +#+version: 1.1 -* Project Goals +** Project Goals 1. Document and analyze sentiment of public comments on Virginia law, to determine: 1. the utility of this forum as a mechanism for public comment, and 2. the impact of this forum on Virginia regulation. 2. Make data and insights broadly available. 3. Generalize to other public comment tools. -** Document and analyze sentiment -- Scrape the data, parse, clean, and store. Clearly separate scraper from sentiment analyzer for maximum auditability. -- Build tests for identifying abuse, such as spam and account fraud -- Identify any patterns connecting measured sentiment against VA decisions - -** Make data available -- Pick a good visualization tool +*** Research questions +1. What is the quality of the comments on the forum? + 1. Are there duplicate entries? + 2. Are there non-human-generated entries? + 3. Are there entries intended to abuse the forum or drown out comment? +2. How do commenters feel about the proposed change? + 1. What is the total number and percent supporting vs opposing, and how does this change over time? + 2. What is the type of support, such as strong/weak, positive/negative? +3. What impact do the comments have on the proposed change? + (I anticipate this will not be measurable from currently available data) -** Generalize -- Identify scalable ways to apply this toolset to similar problems +** Architecture +1. Scrape/Parse: Scrapy +2. Sentiment analysis: gpt-5.4-mini +3. Display: streamlit +4. Storage: jsonl, csv, parquet -* Architecture -1. Scrape/Parse: **Scrapy** for downloading comments -2. Storage: json -3. Sentiment analysis: Claude haiku -4. Display: TBD - -** Scraper -Scrapy provides a simple mechanism for browsing and +*** Scraper +Scrapy provides a simple mechanism for retrieving, parsing, and saving content form the forums. 1. Forums listing page: `Forums.cfm` - lists all open forums with agency, reg title, action type, brief description, closing date, comment count 2. Comment listing page: `comments.cfm?GDocForumID=X` or `comments.cfm?stageid=X` or `comments.cfm?petitionid=X` - lists comments with title, author, date 3. Individual comment page: `viewcomments.cfm?commentid=X` - shows regulation title + brief description at the top, plus the comment -** Storage -One JSONL file per forum/bill. +*** Analysis +Google and Amazon both return generic sentiment (tone of writing: positive/negative), not stance (for/against the regulation): "I strongly believe the government should NOT interfere" is negative tone but "against" the regulation. We add the proposed change as context to the model. -** Analysis -Google and Amazon both return generic sentiment (tone of writing: positive/negative), not stance (for/against the regulation): "I strongly believe the government should NOT interfere" is negative tone but "against" the regulation. We will run the forum/bill title and cache the entirety of the proposed change, perhaps as a fallback. +Before sending the comments for sentiment analysis, `tokenizer.py` receives the forum to be processed and prompt as inputs, then generates a `report.json` estimating tokens (tiktoken), cost, and time to run for multiple models. -| Tool | Output | Context | Sarcasm | Context window | Cost/1k comments | -|-------------------+--------------------------------+------------+------------------+----------------+------------------| -| Google NL API | -1→+1, magnitude | No/generic | Poorly | No | ~$1–2 | -| Amazon Comprehend | Pos/Neg/Neutral/Mixed | No/generic | Poorly | No | ~$0.10 | -| Claude Haiku | Prompted → for/against/neutral | Yes | Yes, with prompt | Yes | ~$0.10–0.30 | -| GPT-4o-mini | Prompted → same | Yes | Yes | Yes | ~$0.05–0.15 | +Then, the batch processing scripts uses the `report.json` to create multiple jobs, with subcommands to download and check their status. +We selected gpt-5.4-mini for a good balance of quality, cost, and time. + +**** Prompt +``` +You are an expert policy analyst classifying public comments submitted to the Virginia Town Hall +regulatory comment system. You will be given the text of a proposed regulation and a single +public comment. Return ONLY a JSON object — no other text. + +Definitions: +- stance: the commenter's position on whether the regulation should be adopted. + "support" = wants it approved (as-is or with changes); + "oppose" = wants it rejected or substantially weakened; + "neutral" = takes no position, asks a question, or provides factual input only; + "unknown" = too vague, off-topic, or uninterpretable to classify. +- tone: the emotional register of the writing, independent of stance. + "positive" = affirming, hopeful, appreciative; + "negative" = angry, fearful, alarmed, or contemptuous; + "neutral" = matter-of-fact, procedural, or informational; + "mixed" = contains both positive and negative emotional content; + "unclear" = tone cannot be determined (e.g., a one-word comment). +- stance_confidence: float 0.0-1.0, your confidence in the stance label. +- stance_rationale: 1-3 sentences explaining the key evidence; quote specific phrases where possible. +- tags: up to 5 short topic labels relevant to the comment's specific concerns (e.g. + "parental rights", "student safety", "privacy", "religious freedom", "LGBTQ+ inclusion", + "bullying prevention", "school sports", "bathroom access"). Empty array if none apply. + +Return exactly these keys: stance, stance_confidence, stance_rationale, tone, tags. +``` + + +*** Storage +- Each scraped forum is saved to `output/.jsonl` +- Each report (forum + prompt) is saves to `reports/.json` +- Each job is saved to `analysis/jobs//: + └─`forum.jsonl` is a copy of the scraped forum for convenience + └─`prompt.txt` is a copy of the prompt used + └─`report.json` is a copy of the report used + └─`status.json` contains metadata about the job + For each batch in the job, four files are created: + └─`jobN-input.jsonl` contains the exact queries sent to the API, for troubleshooting + └─`jobN-output-raw.jsonl` contains the exact response from the API + └─`jobN-output.jsonl` contains the exact response from the API + └─`jobN-output-errors.jsonl` when errors are returned (this file may not exist) +- Once complete, the cleanup script saves `review.csv`, `review.pqt`, and `review.sqlite` in this folder. + +** Instructions +1. Scrape the forum. + `python +2. Run model report. + `python analysis/tokenizer.py --prompt ` +3. To run a realtime subset: + `python analysis/openai_realtime.py --prompt --model --limit ` + `python analysis/openai_realtime.py output/f452.jsonl --prompt prompt-1.txt --model gpt-4o-mini --limit 10` +4. To create and run the whole thing in batches, first create the batch jobs from the report: + `python analysis/openai_batch.py create --model ` + `python analysis/openai_batch.py create ./reports/f452-1.json --model gpt-5.4-mini` +5. Then, run the jobs sequentially. Don't submit more than one at a time, if the model fills up the batch will fail and resubmission is not implemented. + `python analysis/openai_batch.py submit` + # Check status + `python analysis/openai_batch.py status` + # When complete, download: + `python analysis/openai_batch.py download` + # Submit the next batch after the previous is complete: + `python analysis/openai_batch.py submit` + * Roadmap 1. Scrape one forum 2. Compare sentiment models