updated readme
This commit is contained in:
211
README.md
211
README.md
@@ -1,21 +1,20 @@
|
|||||||
|
|
||||||
# Table of Contents
|
# Table of Contents
|
||||||
|
|
||||||
1. [Project Goals](#org5acb669)
|
1. [Project Goals](#orgf37a106)
|
||||||
1. [Document and analyze sentiment](#org9291576)
|
1. [Research questions](#orgec50d46)
|
||||||
2. [Make data available](#org8054421)
|
2. [Architecture](#org7a5389e)
|
||||||
3. [Generalize](#orgdda4b6f)
|
1. [Scraper](#org7771df2)
|
||||||
2. [Architecture](#org1d6bc40)
|
2. [Analysis](#org16a9e36)
|
||||||
1. [Scraper](#org4298028)
|
3. [Storage](#org7341391)
|
||||||
2. [Storage](#org1cd413c)
|
3. [Instructions](#org692b2f6)
|
||||||
3. [Analysis](#orgaea450e)
|
1. [Roadmap](#org9f21934)
|
||||||
3. [Roadmap](#org6b7660d)
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
<a id="org5acb669"></a>
|
<a id="orgf37a106"></a>
|
||||||
|
|
||||||
# Project Goals
|
## Project Goals
|
||||||
|
|
||||||
1. Document and analyze sentiment of public comments on Virginia law, to determine:
|
1. Document and analyze sentiment of public comments on Virginia law, to determine:
|
||||||
1. the utility of this forum as a mechanism for public comment, and
|
1. the utility of this forum as a mechanism for public comment, and
|
||||||
@@ -24,130 +23,128 @@
|
|||||||
3. Generalize to other public comment tools.
|
3. Generalize to other public comment tools.
|
||||||
|
|
||||||
|
|
||||||
<a id="org9291576"></a>
|
<a id="orgec50d46"></a>
|
||||||
|
|
||||||
## Document and analyze sentiment
|
### Research questions
|
||||||
|
|
||||||
- Scrape the data, parse, clean, and store. Clearly separate scraper from sentiment analyzer for maximum auditability.
|
1. What is the quality of the comments on the forum?
|
||||||
- Build tests for identifying abuse, such as spam and account fraud
|
1. Are there duplicate entries?
|
||||||
- Identify any patterns connecting measured sentiment against VA decisions
|
2. Are there non-human-generated entries?
|
||||||
|
3. Are there entries intended to abuse the forum or drown out comment?
|
||||||
|
2. How do commenters feel about the proposed change?
|
||||||
|
1. What is the total number and percent supporting vs opposing, and how does this change over time?
|
||||||
|
2. What is the type of support, such as strong/weak, positive/negative?
|
||||||
|
3. What impact do the comments have on the proposed change?
|
||||||
|
(I anticipate this will not be measurable from currently available data)
|
||||||
|
|
||||||
|
|
||||||
<a id="org8054421"></a>
|
<a id="org7a5389e"></a>
|
||||||
|
|
||||||
## Make data available
|
## Architecture
|
||||||
|
|
||||||
- Pick a good visualization tool
|
1. Scrape/Parse: Scrapy
|
||||||
|
2. Sentiment analysis: gpt-5.4-mini
|
||||||
|
3. Display: streamlit
|
||||||
|
4. Storage: jsonl, csv, parquet
|
||||||
|
|
||||||
|
|
||||||
<a id="orgdda4b6f"></a>
|
<a id="org7771df2"></a>
|
||||||
|
|
||||||
## Generalize
|
### Scraper
|
||||||
|
|
||||||
- Identify scalable ways to apply this toolset to similar problems
|
Scrapy provides a simple mechanism for retrieving, parsing, and saving content form the forums.
|
||||||
|
|
||||||
|
|
||||||
<a id="org1d6bc40"></a>
|
|
||||||
|
|
||||||
# Architecture
|
|
||||||
|
|
||||||
1. Scrape/Parse: ****Scrapy**** for downloading comments
|
|
||||||
2. Storage: json
|
|
||||||
3. Sentiment analysis: Claude haiku
|
|
||||||
4. Display: TBD
|
|
||||||
|
|
||||||
|
|
||||||
<a id="org4298028"></a>
|
|
||||||
|
|
||||||
## Scraper
|
|
||||||
|
|
||||||
Scrapy provides a simple mechanism for browsing and
|
|
||||||
|
|
||||||
1. Forums listing page: \`Forums.cfm\` - lists all open forums with agency, reg title, action type, brief description, closing date, comment count
|
1. Forums listing page: \`Forums.cfm\` - lists all open forums with agency, reg title, action type, brief description, closing date, comment count
|
||||||
2. Comment listing page: \`comments.cfm?GDocForumID=X\` or \`comments.cfm?stageid=X\` or \`comments.cfm?petitionid=X\` - lists comments with title, author, date
|
2. Comment listing page: \`comments.cfm?GDocForumID=X\` or \`comments.cfm?stageid=X\` or \`comments.cfm?petitionid=X\` - lists comments with title, author, date
|
||||||
3. Individual comment page: \`viewcomments.cfm?commentid=X\` - shows regulation title + brief description at the top, plus the comment
|
3. Individual comment page: \`viewcomments.cfm?commentid=X\` - shows regulation title + brief description at the top, plus the comment
|
||||||
|
|
||||||
|
|
||||||
<a id="org1cd413c"></a>
|
<a id="org16a9e36"></a>
|
||||||
|
|
||||||
## Storage
|
### Analysis
|
||||||
|
|
||||||
One JSONL file per forum/bill.
|
Google and Amazon both return generic sentiment (tone of writing: positive/negative), not stance (for/against the regulation): "I strongly believe the government should NOT interfere" is negative tone but "against" the regulation. We add the proposed change as context to the model.
|
||||||
|
|
||||||
|
Before sending the comments for sentiment analysis, \`tokenizer.py\` receives the forum to be processed and prompt as inputs, then generates a \`report.json\` estimating tokens (tiktoken), cost, and time to run for multiple models.
|
||||||
|
|
||||||
|
Then, the batch processing scripts uses the \`report.json\` to create multiple jobs, with subcommands to download and check their status.
|
||||||
|
|
||||||
|
We selected gpt-5.4-mini for a good balance of quality, cost, and time.
|
||||||
|
|
||||||
|
1. Prompt
|
||||||
|
|
||||||
|
\`\`\`
|
||||||
|
You are an expert policy analyst classifying public comments submitted to the Virginia Town Hall
|
||||||
|
regulatory comment system. You will be given the text of a proposed regulation and a single
|
||||||
|
public comment. Return ONLY a JSON object — no other text.
|
||||||
|
|
||||||
|
Definitions:
|
||||||
|
|
||||||
|
- stance: the commenter's position on whether the regulation should be adopted.
|
||||||
|
"support" = wants it approved (as-is or with changes);
|
||||||
|
"oppose" = wants it rejected or substantially weakened;
|
||||||
|
"neutral" = takes no position, asks a question, or provides factual input only;
|
||||||
|
"unknown" = too vague, off-topic, or uninterpretable to classify.
|
||||||
|
- tone: the emotional register of the writing, independent of stance.
|
||||||
|
"positive" = affirming, hopeful, appreciative;
|
||||||
|
"negative" = angry, fearful, alarmed, or contemptuous;
|
||||||
|
"neutral" = matter-of-fact, procedural, or informational;
|
||||||
|
"mixed" = contains both positive and negative emotional content;
|
||||||
|
"unclear" = tone cannot be determined (e.g., a one-word comment).
|
||||||
|
- stance<sub>confidence</sub>: float 0.0-1.0, your confidence in the stance label.
|
||||||
|
- stance<sub>rationale</sub>: 1-3 sentences explaining the key evidence; quote specific phrases where possible.
|
||||||
|
- tags: up to 5 short topic labels relevant to the comment's specific concerns (e.g.
|
||||||
|
"parental rights", "student safety", "privacy", "religious freedom", "LGBTQ+ inclusion",
|
||||||
|
"bullying prevention", "school sports", "bathroom access"). Empty array if none apply.
|
||||||
|
|
||||||
|
Return exactly these keys: stance, stance<sub>confidence</sub>, stance<sub>rationale</sub>, tone, tags.
|
||||||
|
\`\`\`
|
||||||
|
|
||||||
|
|
||||||
<a id="orgaea450e"></a>
|
<a id="org7341391"></a>
|
||||||
|
|
||||||
## Analysis
|
### Storage
|
||||||
|
|
||||||
Google and Amazon both return generic sentiment (tone of writing: positive/negative), not stance (for/against the regulation): "I strongly believe the government should NOT interfere" is negative tone but "against" the regulation. We will run the forum/bill title and cache the entirety of the proposed change, perhaps as a fallback.
|
- Each scraped forum is saved to \`output/<forum-id>.jsonl\`
|
||||||
|
- Each report (forum + prompt) is saves to \`reports/<forum-id-N>.json\`
|
||||||
<table border="2" cellspacing="0" cellpadding="6" rules="groups" frame="hsides">
|
- Each job is saved to \`analysis/jobs/<report-id>/:
|
||||||
|
└─\`forum.jsonl\` is a copy of the scraped forum for convenience
|
||||||
|
└─\`prompt.txt\` is a copy of the prompt used
|
||||||
|
└─\`report.json\` is a copy of the report used
|
||||||
|
└─\`status.json\` contains metadata about the job
|
||||||
|
For each batch in the job, four files are created:
|
||||||
|
└─\`jobN-input.jsonl\` contains the exact queries sent to the API, for troubleshooting
|
||||||
|
└─\`jobN-output-raw.jsonl\` contains the exact response from the API
|
||||||
|
└─\`jobN-output.jsonl\` contains the exact response from the API
|
||||||
|
└─\`jobN-output-errors.jsonl\` when errors are returned (this file may not exist)
|
||||||
|
- Once complete, the cleanup script saves \`review.csv\`, \`review.pqt\`, and \`review.sqlite\` in this folder.
|
||||||
|
|
||||||
|
|
||||||
<colgroup>
|
<a id="org692b2f6"></a>
|
||||||
<col class="org-left" />
|
|
||||||
|
|
||||||
<col class="org-left" />
|
## Instructions
|
||||||
|
|
||||||
<col class="org-left" />
|
1. Scrape the forum.
|
||||||
|
\`python
|
||||||
<col class="org-left" />
|
2. Run model report.
|
||||||
|
\`python analysis/tokenizer.py <input> –prompt <prompt>\`
|
||||||
<col class="org-left" />
|
3. To run a realtime subset:
|
||||||
|
\`python analysis/openai<sub>realtime.py</sub> <input> –prompt <prompt> –model <model> –limit <N comments>\`
|
||||||
<col class="org-left" />
|
\`python analysis/openai<sub>realtime.py</sub> output/f452.jsonl –prompt prompt-1.txt –model gpt-4o-mini –limit 10\`
|
||||||
</colgroup>
|
4. To create and run the whole thing in batches, first create the batch jobs from the report:
|
||||||
<thead>
|
\`python analysis/openai<sub>batch.py</sub> create <report> –model <model>\`
|
||||||
<tr>
|
\`python analysis/openai<sub>batch.py</sub> create ./reports/f452-1.json –model gpt-5.4-mini\`
|
||||||
<th scope="col" class="org-left">Tool</th>
|
5. Then, run the jobs sequentially. Don't submit more than one at a time, if the model fills up the batch will fail and resubmission is not implemented.
|
||||||
<th scope="col" class="org-left">Output</th>
|
\`python analysis/openai<sub>batch.py</sub> submit\`
|
||||||
<th scope="col" class="org-left">Context</th>
|
|
||||||
<th scope="col" class="org-left">Sarcasm</th>
|
\`python analysis/openai<sub>batch.py</sub> status\`
|
||||||
<th scope="col" class="org-left">Context window</th>
|
|
||||||
<th scope="col" class="org-left">Cost/1k comments</th>
|
\`python analysis/openai<sub>batch.py</sub> download\`
|
||||||
</tr>
|
|
||||||
</thead>
|
\`python analysis/openai<sub>batch.py</sub> submit\`
|
||||||
<tbody>
|
|
||||||
<tr>
|
|
||||||
<td class="org-left">Google NL API</td>
|
|
||||||
<td class="org-left">-1→+1, magnitude</td>
|
|
||||||
<td class="org-left">No/generic</td>
|
|
||||||
<td class="org-left">Poorly</td>
|
|
||||||
<td class="org-left">No</td>
|
|
||||||
<td class="org-left">~$1–2</td>
|
|
||||||
</tr>
|
|
||||||
|
|
||||||
<tr>
|
|
||||||
<td class="org-left">Amazon Comprehend</td>
|
|
||||||
<td class="org-left">Pos/Neg/Neutral/Mixed</td>
|
|
||||||
<td class="org-left">No/generic</td>
|
|
||||||
<td class="org-left">Poorly</td>
|
|
||||||
<td class="org-left">No</td>
|
|
||||||
<td class="org-left">~$0.10</td>
|
|
||||||
</tr>
|
|
||||||
|
|
||||||
<tr>
|
|
||||||
<td class="org-left">Claude Haiku</td>
|
|
||||||
<td class="org-left">Prompted → for/against/neutral</td>
|
|
||||||
<td class="org-left">Yes</td>
|
|
||||||
<td class="org-left">Yes, with prompt</td>
|
|
||||||
<td class="org-left">Yes</td>
|
|
||||||
<td class="org-left">~$0.10–0.30</td>
|
|
||||||
</tr>
|
|
||||||
|
|
||||||
<tr>
|
|
||||||
<td class="org-left">GPT-4o-mini</td>
|
|
||||||
<td class="org-left">Prompted → same</td>
|
|
||||||
<td class="org-left">Yes</td>
|
|
||||||
<td class="org-left">Yes</td>
|
|
||||||
<td class="org-left">Yes</td>
|
|
||||||
<td class="org-left">~$0.05–0.15</td>
|
|
||||||
</tr>
|
|
||||||
</tbody>
|
|
||||||
</table>
|
|
||||||
|
|
||||||
|
|
||||||
<a id="org6b7660d"></a>
|
<a id="org9f21934"></a>
|
||||||
|
|
||||||
# Roadmap
|
# Roadmap
|
||||||
|
|
||||||
|
|||||||
@@ -1,50 +1,109 @@
|
|||||||
#+title: VA Townhall
|
#+title: VA Townhall
|
||||||
#+date: [2026-05-05 Tue]
|
#+date: [2026-05-05 Tue]
|
||||||
#+version: 1
|
#+version: 1.1
|
||||||
|
|
||||||
* Project Goals
|
** Project Goals
|
||||||
1. Document and analyze sentiment of public comments on Virginia law, to determine:
|
1. Document and analyze sentiment of public comments on Virginia law, to determine:
|
||||||
1. the utility of this forum as a mechanism for public comment, and
|
1. the utility of this forum as a mechanism for public comment, and
|
||||||
2. the impact of this forum on Virginia regulation.
|
2. the impact of this forum on Virginia regulation.
|
||||||
2. Make data and insights broadly available.
|
2. Make data and insights broadly available.
|
||||||
3. Generalize to other public comment tools.
|
3. Generalize to other public comment tools.
|
||||||
|
|
||||||
** Document and analyze sentiment
|
*** Research questions
|
||||||
- Scrape the data, parse, clean, and store. Clearly separate scraper from sentiment analyzer for maximum auditability.
|
1. What is the quality of the comments on the forum?
|
||||||
- Build tests for identifying abuse, such as spam and account fraud
|
1. Are there duplicate entries?
|
||||||
- Identify any patterns connecting measured sentiment against VA decisions
|
2. Are there non-human-generated entries?
|
||||||
|
3. Are there entries intended to abuse the forum or drown out comment?
|
||||||
** Make data available
|
2. How do commenters feel about the proposed change?
|
||||||
- Pick a good visualization tool
|
1. What is the total number and percent supporting vs opposing, and how does this change over time?
|
||||||
|
2. What is the type of support, such as strong/weak, positive/negative?
|
||||||
|
3. What impact do the comments have on the proposed change?
|
||||||
|
(I anticipate this will not be measurable from currently available data)
|
||||||
|
|
||||||
** Generalize
|
** Architecture
|
||||||
- Identify scalable ways to apply this toolset to similar problems
|
1. Scrape/Parse: Scrapy
|
||||||
|
2. Sentiment analysis: gpt-5.4-mini
|
||||||
|
3. Display: streamlit
|
||||||
|
4. Storage: jsonl, csv, parquet
|
||||||
|
|
||||||
* Architecture
|
*** Scraper
|
||||||
1. Scrape/Parse: **Scrapy** for downloading comments
|
Scrapy provides a simple mechanism for retrieving, parsing, and saving content form the forums.
|
||||||
2. Storage: json
|
|
||||||
3. Sentiment analysis: Claude haiku
|
|
||||||
4. Display: TBD
|
|
||||||
|
|
||||||
** Scraper
|
|
||||||
Scrapy provides a simple mechanism for browsing and
|
|
||||||
1. Forums listing page: `Forums.cfm` - lists all open forums with agency, reg title, action type, brief description, closing date, comment count
|
1. Forums listing page: `Forums.cfm` - lists all open forums with agency, reg title, action type, brief description, closing date, comment count
|
||||||
2. Comment listing page: `comments.cfm?GDocForumID=X` or `comments.cfm?stageid=X` or `comments.cfm?petitionid=X` - lists comments with title, author, date
|
2. Comment listing page: `comments.cfm?GDocForumID=X` or `comments.cfm?stageid=X` or `comments.cfm?petitionid=X` - lists comments with title, author, date
|
||||||
3. Individual comment page: `viewcomments.cfm?commentid=X` - shows regulation title + brief description at the top, plus the comment
|
3. Individual comment page: `viewcomments.cfm?commentid=X` - shows regulation title + brief description at the top, plus the comment
|
||||||
|
|
||||||
** Storage
|
*** Analysis
|
||||||
One JSONL file per forum/bill.
|
Google and Amazon both return generic sentiment (tone of writing: positive/negative), not stance (for/against the regulation): "I strongly believe the government should NOT interfere" is negative tone but "against" the regulation. We add the proposed change as context to the model.
|
||||||
|
|
||||||
** Analysis
|
Before sending the comments for sentiment analysis, `tokenizer.py` receives the forum to be processed and prompt as inputs, then generates a `report.json` estimating tokens (tiktoken), cost, and time to run for multiple models.
|
||||||
Google and Amazon both return generic sentiment (tone of writing: positive/negative), not stance (for/against the regulation): "I strongly believe the government should NOT interfere" is negative tone but "against" the regulation. We will run the forum/bill title and cache the entirety of the proposed change, perhaps as a fallback.
|
|
||||||
|
|
||||||
| Tool | Output | Context | Sarcasm | Context window | Cost/1k comments |
|
Then, the batch processing scripts uses the `report.json` to create multiple jobs, with subcommands to download and check their status.
|
||||||
|-------------------+--------------------------------+------------+------------------+----------------+------------------|
|
|
||||||
| Google NL API | -1→+1, magnitude | No/generic | Poorly | No | ~$1–2 |
|
|
||||||
| Amazon Comprehend | Pos/Neg/Neutral/Mixed | No/generic | Poorly | No | ~$0.10 |
|
|
||||||
| Claude Haiku | Prompted → for/against/neutral | Yes | Yes, with prompt | Yes | ~$0.10–0.30 |
|
|
||||||
| GPT-4o-mini | Prompted → same | Yes | Yes | Yes | ~$0.05–0.15 |
|
|
||||||
|
|
||||||
|
We selected gpt-5.4-mini for a good balance of quality, cost, and time.
|
||||||
|
|
||||||
|
**** Prompt
|
||||||
|
```
|
||||||
|
You are an expert policy analyst classifying public comments submitted to the Virginia Town Hall
|
||||||
|
regulatory comment system. You will be given the text of a proposed regulation and a single
|
||||||
|
public comment. Return ONLY a JSON object — no other text.
|
||||||
|
|
||||||
|
Definitions:
|
||||||
|
- stance: the commenter's position on whether the regulation should be adopted.
|
||||||
|
"support" = wants it approved (as-is or with changes);
|
||||||
|
"oppose" = wants it rejected or substantially weakened;
|
||||||
|
"neutral" = takes no position, asks a question, or provides factual input only;
|
||||||
|
"unknown" = too vague, off-topic, or uninterpretable to classify.
|
||||||
|
- tone: the emotional register of the writing, independent of stance.
|
||||||
|
"positive" = affirming, hopeful, appreciative;
|
||||||
|
"negative" = angry, fearful, alarmed, or contemptuous;
|
||||||
|
"neutral" = matter-of-fact, procedural, or informational;
|
||||||
|
"mixed" = contains both positive and negative emotional content;
|
||||||
|
"unclear" = tone cannot be determined (e.g., a one-word comment).
|
||||||
|
- stance_confidence: float 0.0-1.0, your confidence in the stance label.
|
||||||
|
- stance_rationale: 1-3 sentences explaining the key evidence; quote specific phrases where possible.
|
||||||
|
- tags: up to 5 short topic labels relevant to the comment's specific concerns (e.g.
|
||||||
|
"parental rights", "student safety", "privacy", "religious freedom", "LGBTQ+ inclusion",
|
||||||
|
"bullying prevention", "school sports", "bathroom access"). Empty array if none apply.
|
||||||
|
|
||||||
|
Return exactly these keys: stance, stance_confidence, stance_rationale, tone, tags.
|
||||||
|
```
|
||||||
|
|
||||||
|
|
||||||
|
*** Storage
|
||||||
|
- Each scraped forum is saved to `output/<forum-id>.jsonl`
|
||||||
|
- Each report (forum + prompt) is saves to `reports/<forum-id-N>.json`
|
||||||
|
- Each job is saved to `analysis/jobs/<report-id>/:
|
||||||
|
└─`forum.jsonl` is a copy of the scraped forum for convenience
|
||||||
|
└─`prompt.txt` is a copy of the prompt used
|
||||||
|
└─`report.json` is a copy of the report used
|
||||||
|
└─`status.json` contains metadata about the job
|
||||||
|
For each batch in the job, four files are created:
|
||||||
|
└─`jobN-input.jsonl` contains the exact queries sent to the API, for troubleshooting
|
||||||
|
└─`jobN-output-raw.jsonl` contains the exact response from the API
|
||||||
|
└─`jobN-output.jsonl` contains the exact response from the API
|
||||||
|
└─`jobN-output-errors.jsonl` when errors are returned (this file may not exist)
|
||||||
|
- Once complete, the cleanup script saves `review.csv`, `review.pqt`, and `review.sqlite` in this folder.
|
||||||
|
|
||||||
|
** Instructions
|
||||||
|
1. Scrape the forum.
|
||||||
|
`python
|
||||||
|
2. Run model report.
|
||||||
|
`python analysis/tokenizer.py <input> --prompt <prompt>`
|
||||||
|
3. To run a realtime subset:
|
||||||
|
`python analysis/openai_realtime.py <input> --prompt <prompt> --model <model> --limit <N comments>`
|
||||||
|
`python analysis/openai_realtime.py output/f452.jsonl --prompt prompt-1.txt --model gpt-4o-mini --limit 10`
|
||||||
|
4. To create and run the whole thing in batches, first create the batch jobs from the report:
|
||||||
|
`python analysis/openai_batch.py create <report> --model <model>`
|
||||||
|
`python analysis/openai_batch.py create ./reports/f452-1.json --model gpt-5.4-mini`
|
||||||
|
5. Then, run the jobs sequentially. Don't submit more than one at a time, if the model fills up the batch will fail and resubmission is not implemented.
|
||||||
|
`python analysis/openai_batch.py submit`
|
||||||
|
# Check status
|
||||||
|
`python analysis/openai_batch.py status`
|
||||||
|
# When complete, download:
|
||||||
|
`python analysis/openai_batch.py download`
|
||||||
|
# Submit the next batch after the previous is complete:
|
||||||
|
`python analysis/openai_batch.py submit`
|
||||||
|
|
||||||
* Roadmap
|
* Roadmap
|
||||||
1. Scrape one forum
|
1. Scrape one forum
|
||||||
2. Compare sentiment models
|
2. Compare sentiment models
|
||||||
|
|||||||
Reference in New Issue
Block a user