From 72c2ae0ca0c6db5f4f5b82099c0f08e607547d7f Mon Sep 17 00:00:00 2001
From: eulaly <johnmosescarter@gmail.com>
Date: Thu, 7 May 2026 17:01:08 -0400
Subject: [PATCH] updated readme

---
 README.md           | 211 ++++++++++++++++++++++----------------------
 docs/vatownhall.org | 117 ++++++++++++++++++------
 2 files changed, 192 insertions(+), 136 deletions(-)
diff --git a/README.md b/README.md
index 862871a..7767f69 100644
--- a/README.md
+++ b/README.md
@@ -1,21 +1,20 @@
 
 # Table of Contents
 
-1.  [Project Goals](#org5acb669)
-    1.  [Document and analyze sentiment](#org9291576)
-    2.  [Make data available](#org8054421)
-    3.  [Generalize](#orgdda4b6f)
-2.  [Architecture](#org1d6bc40)
-    1.  [Scraper](#org4298028)
-    2.  [Storage](#org1cd413c)
-    3.  [Analysis](#orgaea450e)
-3.  [Roadmap](#org6b7660d)
+    1.  [Project Goals](#orgf37a106)
+        1.  [Research questions](#orgec50d46)
+    2.  [Architecture](#org7a5389e)
+        1.  [Scraper](#org7771df2)
+        2.  [Analysis](#org16a9e36)
+        3.  [Storage](#org7341391)
+    3.  [Instructions](#org692b2f6)
+1.  [Roadmap](#org9f21934)
 
 
 
-<a id="org5acb669"></a>
+<a id="orgf37a106"></a>
 
-# Project Goals
+## Project Goals
 
 1.  Document and analyze sentiment of public comments on Virginia law, to determine:
     1.  the utility of this forum as a mechanism for public comment, and
@@ -24,130 +23,128 @@
 3.  Generalize to other public comment tools.
 
 
-<a id="org9291576"></a>
+<a id="orgec50d46"></a>
 
-## Document and analyze sentiment
+### Research questions
 
--   Scrape the data, parse, clean, and store. Clearly separate scraper from sentiment analyzer for maximum auditability.
--   Build tests for identifying abuse, such as spam and account fraud
--   Identify any patterns connecting measured sentiment against VA decisions
+1.  What is the quality of the comments on the forum?
+    1.  Are there duplicate entries?
+    2.  Are there non-human-generated entries?
+    3.  Are there entries intended to abuse the forum or drown out comment?
+2.  How do commenters feel about the proposed change?
+    1.  What is the total number and percent supporting vs opposing, and how does this change over time?
+    2.  What is the type of support, such as strong/weak, positive/negative?
+3.  What impact do the comments have on the proposed change?
+    (I anticipate this will not be measurable from currently available data)
 
 
-<a id="org8054421"></a>
+<a id="org7a5389e"></a>
 
-## Make data available
+## Architecture
 
--   Pick a good visualization tool
+1.  Scrape/Parse: Scrapy
+2.  Sentiment analysis: gpt-5.4-mini
+3.  Display: streamlit
+4.  Storage: jsonl, csv, parquet
 
 
-<a id="orgdda4b6f"></a>
+<a id="org7771df2"></a>
 
-## Generalize
+### Scraper
 
--   Identify scalable ways to apply this toolset to similar problems
-
-
-<a id="org1d6bc40"></a>
-
-# Architecture
-
-1.  Scrape/Parse: ****Scrapy**** for downloading comments
-2.  Storage: json
-3.  Sentiment analysis: Claude haiku
-4.  Display: TBD
-
-
-<a id="org4298028"></a>
-
-## Scraper
-
-Scrapy provides a simple mechanism for browsing and 
+Scrapy provides a simple mechanism for retrieving, parsing, and saving content form the forums.
 
 1.  Forums listing page: \`Forums.cfm\` - lists all open forums with agency, reg title, action type, brief description, closing date, comment count
 2.  Comment listing page: \`comments.cfm?GDocForumID=X\` or \`comments.cfm?stageid=X\` or \`comments.cfm?petitionid=X\` - lists comments with title, author, date
 3.  Individual comment page: \`viewcomments.cfm?commentid=X\` - shows regulation title + brief description at the top, plus the comment
 
 
-<a id="org1cd413c"></a>
+<a id="org16a9e36"></a>
 
-## Storage
+### Analysis
 
-One JSONL file per forum/bill.
+Google and Amazon both return generic sentiment (tone of writing: positive/negative), not stance (for/against the regulation): "I strongly believe the government should NOT interfere" is negative tone but "against" the regulation.  We add the proposed change as context to the model.
+
+Before sending the comments for sentiment analysis, \`tokenizer.py\` receives the forum to be processed and prompt as inputs, then generates a \`report.json\` estimating tokens (tiktoken), cost, and time to run for multiple models.
+
+Then, the batch processing scripts uses the \`report.json\` to create multiple jobs, with subcommands to download and check their status. 
+
+We selected gpt-5.4-mini for a good balance of quality, cost, and time.
+
+1.  Prompt
+
+    \`\`\`
+    You are an expert policy analyst classifying public comments submitted to the Virginia Town Hall
+    regulatory comment system. You will be given the text of a proposed regulation and a single
+    public comment. Return ONLY a JSON object — no other text.
+    
+    Definitions:
+    
+    -   stance: the commenter's position on whether the regulation should be adopted.
+        "support" = wants it approved (as-is or with changes);
+        "oppose"  = wants it rejected or substantially weakened;
+        "neutral" = takes no position, asks a question, or provides factual input only;
+        "unknown" = too vague, off-topic, or uninterpretable to classify.
+    -   tone: the emotional register of the writing, independent of stance.
+        "positive" = affirming, hopeful, appreciative;
+        "negative" = angry, fearful, alarmed, or contemptuous;
+        "neutral"  = matter-of-fact, procedural, or informational;
+        "mixed"    = contains both positive and negative emotional content;
+        "unclear"  = tone cannot be determined (e.g., a one-word comment).
+    -   stance<sub>confidence</sub>: float 0.0-1.0, your confidence in the stance label.
+    -   stance<sub>rationale</sub>: 1-3 sentences explaining the key evidence; quote specific phrases where possible.
+    -   tags: up to 5 short topic labels relevant to the comment's specific concerns (e.g.
+        "parental rights", "student safety", "privacy", "religious freedom", "LGBTQ+ inclusion",
+        "bullying prevention", "school sports", "bathroom access"). Empty array if none apply.
+    
+    Return exactly these keys: stance, stance<sub>confidence</sub>, stance<sub>rationale</sub>, tone, tags.
+    \`\`\`
 
 
-<a id="orgaea450e"></a>
+<a id="org7341391"></a>
 
-## Analysis
+### Storage
 
-Google and Amazon both return generic sentiment (tone of writing: positive/negative), not stance (for/against the regulation): "I strongly believe the government should NOT interfere" is negative tone but "against" the regulation.  We will run the forum/bill title and cache the entirety of the proposed change, perhaps as a fallback.
-
-<table border="2" cellspacing="0" cellpadding="6" rules="groups" frame="hsides">
+-   Each scraped forum is saved to \`output/<forum-id>.jsonl\`
+-   Each report (forum + prompt) is saves to \`reports/<forum-id-N>.json\`
+-   Each job is saved to \`analysis/jobs/<report-id>/:
+     └─\`forum.jsonl\` is a copy of the scraped forum for convenience
+     └─\`prompt.txt\` is a copy of the prompt used
+     └─\`report.json\` is a copy of the report used
+     └─\`status.json\` contains metadata about the job
+    For each batch in the job, four files are created:
+     └─\`jobN-input.jsonl\` contains the exact queries sent to the API, for troubleshooting
+     └─\`jobN-output-raw.jsonl\` contains the exact response from the API
+     └─\`jobN-output.jsonl\` contains the exact response from the API
+     └─\`jobN-output-errors.jsonl\` when errors are returned (this file may not exist)
+-   Once complete, the cleanup script saves \`review.csv\`, \`review.pqt\`, and \`review.sqlite\` in this folder.
 
 
-<colgroup>
-<col  class="org-left" />
+<a id="org692b2f6"></a>
 
-<col  class="org-left" />
+## Instructions
 
-<col  class="org-left" />
-
-<col  class="org-left" />
-
-<col  class="org-left" />
-
-<col  class="org-left" />
-</colgroup>
-<thead>
-<tr>
-<th scope="col" class="org-left">Tool</th>
-<th scope="col" class="org-left">Output</th>
-<th scope="col" class="org-left">Context</th>
-<th scope="col" class="org-left">Sarcasm</th>
-<th scope="col" class="org-left">Context window</th>
-<th scope="col" class="org-left">Cost/1k comments</th>
-</tr>
-</thead>
-<tbody>
-<tr>
-<td class="org-left">Google NL API</td>
-<td class="org-left">-1→+1, magnitude</td>
-<td class="org-left">No/generic</td>
-<td class="org-left">Poorly</td>
-<td class="org-left">No</td>
-<td class="org-left">~$1–2</td>
-</tr>
-
-<tr>
-<td class="org-left">Amazon Comprehend</td>
-<td class="org-left">Pos/Neg/Neutral/Mixed</td>
-<td class="org-left">No/generic</td>
-<td class="org-left">Poorly</td>
-<td class="org-left">No</td>
-<td class="org-left">~$0.10</td>
-</tr>
-
-<tr>
-<td class="org-left">Claude Haiku</td>
-<td class="org-left">Prompted → for/against/neutral</td>
-<td class="org-left">Yes</td>
-<td class="org-left">Yes, with prompt</td>
-<td class="org-left">Yes</td>
-<td class="org-left">~$0.10–0.30</td>
-</tr>
-
-<tr>
-<td class="org-left">GPT-4o-mini</td>
-<td class="org-left">Prompted → same</td>
-<td class="org-left">Yes</td>
-<td class="org-left">Yes</td>
-<td class="org-left">Yes</td>
-<td class="org-left">~$0.05–0.15</td>
-</tr>
-</tbody>
-</table>
+1.  Scrape the forum.
+    \`python
+2.  Run model report.
+    \`python analysis/tokenizer.py <input> &ndash;prompt <prompt>\`
+3.  To run a realtime subset:
+    \`python analysis/openai<sub>realtime.py</sub> <input> &ndash;prompt <prompt> &ndash;model <model> &ndash;limit <N comments>\`
+    \`python analysis/openai<sub>realtime.py</sub> output/f452.jsonl &ndash;prompt prompt-1.txt &ndash;model gpt-4o-mini &ndash;limit 10\`
+4.  To create and run the whole thing in batches, first create the batch jobs from the report:
+    \`python analysis/openai<sub>batch.py</sub> create <report> &ndash;model <model>\`
+    \`python analysis/openai<sub>batch.py</sub> create ./reports/f452-1.json &ndash;model gpt-5.4-mini\`
+5.  Then, run the jobs sequentially. Don't submit more than one at a time, if the model fills up the batch will fail and resubmission is not implemented.
+    \`python analysis/openai<sub>batch.py</sub> submit\`
+    
+    \`python analysis/openai<sub>batch.py</sub> status\`
+    
+    \`python analysis/openai<sub>batch.py</sub> download\`
+    
+    \`python analysis/openai<sub>batch.py</sub> submit\`
 
 
-<a id="org6b7660d"></a>
+<a id="org9f21934"></a>
 
 # Roadmap
 
diff --git a/docs/vatownhall.org b/docs/vatownhall.org
index 128b222..0c12b41 100644
--- a/docs/vatownhall.org
+++ b/docs/vatownhall.org
@@ -1,50 +1,109 @@
 #+title: VA Townhall
 #+date: [2026-05-05 Tue]
-#+version: 1
+#+version: 1.1
 
-* Project Goals
+** Project Goals
 1. Document and analyze sentiment of public comments on Virginia law, to determine:
    1. the utility of this forum as a mechanism for public comment, and
    2. the impact of this forum on Virginia regulation.
 2. Make data and insights broadly available.
 3. Generalize to other public comment tools.
 
-** Document and analyze sentiment
-- Scrape the data, parse, clean, and store. Clearly separate scraper from sentiment analyzer for maximum auditability.
-- Build tests for identifying abuse, such as spam and account fraud
-- Identify any patterns connecting measured sentiment against VA decisions
-  
-** Make data available
-- Pick a good visualization tool
+*** Research questions   
+1. What is the quality of the comments on the forum?
+   1. Are there duplicate entries?
+   2. Are there non-human-generated entries?
+   3. Are there entries intended to abuse the forum or drown out comment?
+2. How do commenters feel about the proposed change?
+   1. What is the total number and percent supporting vs opposing, and how does this change over time?
+   2. What is the type of support, such as strong/weak, positive/negative?
+3. What impact do the comments have on the proposed change?
+   (I anticipate this will not be measurable from currently available data)
 
-** Generalize
-- Identify scalable ways to apply this toolset to similar problems
+** Architecture
+1. Scrape/Parse: Scrapy
+2. Sentiment analysis: gpt-5.4-mini
+3. Display: streamlit
+4. Storage: jsonl, csv, parquet
 
-* Architecture
-1. Scrape/Parse: **Scrapy** for downloading comments
-2. Storage: json
-3. Sentiment analysis: Claude haiku
-4. Display: TBD   
-
-** Scraper
-Scrapy provides a simple mechanism for browsing and 
+*** Scraper
+Scrapy provides a simple mechanism for retrieving, parsing, and saving content form the forums.
 1. Forums listing page: `Forums.cfm` - lists all open forums with agency, reg title, action type, brief description, closing date, comment count
 2. Comment listing page: `comments.cfm?GDocForumID=X` or `comments.cfm?stageid=X` or `comments.cfm?petitionid=X` - lists comments with title, author, date
 3. Individual comment page: `viewcomments.cfm?commentid=X` - shows regulation title + brief description at the top, plus the comment
 
-** Storage
-One JSONL file per forum/bill.
+*** Analysis
+Google and Amazon both return generic sentiment (tone of writing: positive/negative), not stance (for/against the regulation): "I strongly believe the government should NOT interfere" is negative tone but "against" the regulation.  We add the proposed change as context to the model.
 
-** Analysis
-Google and Amazon both return generic sentiment (tone of writing: positive/negative), not stance (for/against the regulation): "I strongly believe the government should NOT interfere" is negative tone but "against" the regulation.  We will run the forum/bill title and cache the entirety of the proposed change, perhaps as a fallback.
+Before sending the comments for sentiment analysis, `tokenizer.py` receives the forum to be processed and prompt as inputs, then generates a `report.json` estimating tokens (tiktoken), cost, and time to run for multiple models.
 
-| Tool              | Output                         | Context    | Sarcasm          | Context window | Cost/1k comments |
-|-------------------+--------------------------------+------------+------------------+----------------+------------------|
-| Google NL API     | -1→+1, magnitude               | No/generic | Poorly           | No             | ~$1–2            |
-| Amazon Comprehend | Pos/Neg/Neutral/Mixed          | No/generic | Poorly           | No             | ~$0.10           |
-| Claude Haiku      | Prompted → for/against/neutral | Yes        | Yes, with prompt | Yes            | ~$0.10–0.30      |
-| GPT-4o-mini       | Prompted → same                | Yes        | Yes              | Yes            | ~$0.05–0.15      |
+Then, the batch processing scripts uses the `report.json` to create multiple jobs, with subcommands to download and check their status. 
 
+We selected gpt-5.4-mini for a good balance of quality, cost, and time.
+
+**** Prompt
+```
+You are an expert policy analyst classifying public comments submitted to the Virginia Town Hall
+regulatory comment system. You will be given the text of a proposed regulation and a single
+public comment. Return ONLY a JSON object — no other text.
+
+Definitions:
+- stance: the commenter's position on whether the regulation should be adopted.
+  "support" = wants it approved (as-is or with changes);
+  "oppose"  = wants it rejected or substantially weakened;
+  "neutral" = takes no position, asks a question, or provides factual input only;
+  "unknown" = too vague, off-topic, or uninterpretable to classify.
+- tone: the emotional register of the writing, independent of stance.
+  "positive" = affirming, hopeful, appreciative;
+  "negative" = angry, fearful, alarmed, or contemptuous;
+  "neutral"  = matter-of-fact, procedural, or informational;
+  "mixed"    = contains both positive and negative emotional content;
+  "unclear"  = tone cannot be determined (e.g., a one-word comment).
+- stance_confidence: float 0.0-1.0, your confidence in the stance label.
+- stance_rationale: 1-3 sentences explaining the key evidence; quote specific phrases where possible.
+- tags: up to 5 short topic labels relevant to the comment's specific concerns (e.g.
+  "parental rights", "student safety", "privacy", "religious freedom", "LGBTQ+ inclusion",
+  "bullying prevention", "school sports", "bathroom access"). Empty array if none apply.
+
+Return exactly these keys: stance, stance_confidence, stance_rationale, tone, tags.
+```
+
+
+*** Storage
+- Each scraped forum is saved to `output/<forum-id>.jsonl`
+- Each report (forum + prompt) is saves to `reports/<forum-id-N>.json`
+- Each job is saved to `analysis/jobs/<report-id>/:
+   └─`forum.jsonl` is a copy of the scraped forum for convenience
+   └─`prompt.txt` is a copy of the prompt used
+   └─`report.json` is a copy of the report used
+   └─`status.json` contains metadata about the job
+  For each batch in the job, four files are created:
+   └─`jobN-input.jsonl` contains the exact queries sent to the API, for troubleshooting
+   └─`jobN-output-raw.jsonl` contains the exact response from the API
+   └─`jobN-output.jsonl` contains the exact response from the API
+   └─`jobN-output-errors.jsonl` when errors are returned (this file may not exist)
+- Once complete, the cleanup script saves `review.csv`, `review.pqt`, and `review.sqlite` in this folder.
+   
+** Instructions
+1. Scrape the forum.
+   `python 
+2. Run model report.
+   `python analysis/tokenizer.py <input> --prompt <prompt>`
+3. To run a realtime subset:
+   `python analysis/openai_realtime.py <input> --prompt <prompt> --model <model> --limit <N comments>`
+   `python analysis/openai_realtime.py output/f452.jsonl --prompt prompt-1.txt --model gpt-4o-mini --limit 10`
+4. To create and run the whole thing in batches, first create the batch jobs from the report:
+   `python analysis/openai_batch.py create <report> --model <model>`
+   `python analysis/openai_batch.py create ./reports/f452-1.json --model gpt-5.4-mini`
+5. Then, run the jobs sequentially. Don't submit more than one at a time, if the model fills up the batch will fail and resubmission is not implemented.
+   `python analysis/openai_batch.py submit`
+  # Check status
+   `python analysis/openai_batch.py status`
+  # When complete, download:
+   `python analysis/openai_batch.py download`
+  # Submit the next batch after the previous is complete:
+   `python analysis/openai_batch.py submit`
+   
 * Roadmap
 1. Scrape one forum
 2. Compare sentiment models

Tool	Output	Context	Sarcasm	Context window	Cost/1k comments
Google NL API	-1→+1, magnitude	No/generic	Poorly	No	~$1–2
Amazon Comprehend	Pos/Neg/Neutral/Mixed	No/generic	Poorly	No	~$0.10
Claude Haiku	Prompted → for/against/neutral	Yes	Yes, with prompt	Yes	~$0.10–0.30
GPT-4o-mini	Prompted → same	Yes	Yes	Yes	~$0.05–0.15