updated readme

2026-05-07 17:01:08 -04:00
parent f5d679808e
commit 72c2ae0ca0
2 changed files with 192 additions and 136 deletions
--- a/README.md
+++ b/README.md
@@ -1,21 +1,20 @@
 # Table of Contents
-1.  [Project Goals](#org5acb669)
+    1.  [Project Goals](#orgf37a106)
-    1.  [Document and analyze sentiment](#org9291576)
+        1.  [Research questions](#orgec50d46)
-    2.  [Make data available](#org8054421)
+    2.  [Architecture](#org7a5389e)
-    3.  [Generalize](#orgdda4b6f)
+        1.  [Scraper](#org7771df2)
-2.  [Architecture](#org1d6bc40)
+        2.  [Analysis](#org16a9e36)
-    1.  [Scraper](#org4298028)
+        3.  [Storage](#org7341391)
-    2.  [Storage](#org1cd413c)
+    3.  [Instructions](#org692b2f6)
-    3.  [Analysis](#orgaea450e)
+1.  [Roadmap](#org9f21934)
 3.  [Roadmap](#org6b7660d)
-<a id="org5acb669"></a>
+<a id="orgf37a106"></a>
-# Project Goals
+## Project Goals
 1.  Document and analyze sentiment of public comments on Virginia law, to determine:
    1.  the utility of this forum as a mechanism for public comment, and
@@ -24,130 +23,128 @@
 3.  Generalize to other public comment tools.
-<a id="org9291576"></a>
+<a id="orgec50d46"></a>
-## Document and analyze sentiment
+### Research questions
-   Scrape the data, parse, clean, and store. Clearly separate scraper from sentiment analyzer for maximum auditability.
+1.  What is the quality of the comments on the forum?
-   Build tests for identifying abuse, such as spam and account fraud
+    1.  Are there duplicate entries?
-   Identify any patterns connecting measured sentiment against VA decisions
+    2.  Are there non-human-generated entries?
    3.  Are there entries intended to abuse the forum or drown out comment?
 2.  How do commenters feel about the proposed change?
    1.  What is the total number and percent supporting vs opposing, and how does this change over time?
    2.  What is the type of support, such as strong/weak, positive/negative?
 3.  What impact do the comments have on the proposed change?
    (I anticipate this will not be measurable from currently available data)
-<a id="org8054421"></a>
+<a id="org7a5389e"></a>
-## Make data available
+## Architecture
-   Pick a good visualization tool
+1.  Scrape/Parse: Scrapy
 2.  Sentiment analysis: gpt-5.4-mini
 3.  Display: streamlit
 4.  Storage: jsonl, csv, parquet
-<a id="orgdda4b6f"></a>
+<a id="org7771df2"></a>
-## Generalize
+### Scraper
-   Identify scalable ways to apply this toolset to similar problems
+Scrapy provides a simple mechanism for retrieving, parsing, and saving content form the forums.
 <a id="org1d6bc40"></a>
 # Architecture
 1.  Scrape/Parse: ****Scrapy**** for downloading comments
 2.  Storage: json
 3.  Sentiment analysis: Claude haiku
 4.  Display: TBD
 <a id="org4298028"></a>
 ## Scraper
 Scrapy provides a simple mechanism for browsing and 
 1.  Forums listing page: \`Forums.cfm\` - lists all open forums with agency, reg title, action type, brief description, closing date, comment count
 2.  Comment listing page: \`comments.cfm?GDocForumID=X\` or \`comments.cfm?stageid=X\` or \`comments.cfm?petitionid=X\` - lists comments with title, author, date
 3.  Individual comment page: \`viewcomments.cfm?commentid=X\` - shows regulation title + brief description at the top, plus the comment
-<a id="org1cd413c"></a>
+<a id="org16a9e36"></a>
-## Storage
+### Analysis
-One JSONL file per forum/bill.
+Google and Amazon both return generic sentiment (tone of writing: positive/negative), not stance (for/against the regulation): "I strongly believe the government should NOT interfere" is negative tone but "against" the regulation.  We add the proposed change as context to the model.
 Before sending the comments for sentiment analysis, \`tokenizer.py\` receives the forum to be processed and prompt as inputs, then generates a \`report.json\` estimating tokens (tiktoken), cost, and time to run for multiple models.
 Then, the batch processing scripts uses the \`report.json\` to create multiple jobs, with subcommands to download and check their status. 
 We selected gpt-5.4-mini for a good balance of quality, cost, and time.
 1.  Prompt
    \`\`\`
    You are an expert policy analyst classifying public comments submitted to the Virginia Town Hall
    regulatory comment system. You will be given the text of a proposed regulation and a single
    public comment. Return ONLY a JSON object — no other text.
    Definitions:
    -   stance: the commenter's position on whether the regulation should be adopted.
        "support" = wants it approved (as-is or with changes);
        "oppose"  = wants it rejected or substantially weakened;
        "neutral" = takes no position, asks a question, or provides factual input only;
        "unknown" = too vague, off-topic, or uninterpretable to classify.
    -   tone: the emotional register of the writing, independent of stance.
        "positive" = affirming, hopeful, appreciative;
        "negative" = angry, fearful, alarmed, or contemptuous;
        "neutral"  = matter-of-fact, procedural, or informational;
        "mixed"    = contains both positive and negative emotional content;
        "unclear"  = tone cannot be determined (e.g., a one-word comment).
    -   stance<sub>confidence</sub>: float 0.0-1.0, your confidence in the stance label.
    -   stance<sub>rationale</sub>: 1-3 sentences explaining the key evidence; quote specific phrases where possible.
    -   tags: up to 5 short topic labels relevant to the comment's specific concerns (e.g.
        "parental rights", "student safety", "privacy", "religious freedom", "LGBTQ+ inclusion",
        "bullying prevention", "school sports", "bathroom access"). Empty array if none apply.
    Return exactly these keys: stance, stance<sub>confidence</sub>, stance<sub>rationale</sub>, tone, tags.
    \`\`\`
-<a id="orgaea450e"></a>
+<a id="org7341391"></a>
-## Analysis
+### Storage
-Google and Amazon both return generic sentiment (tone of writing: positive/negative), not stance (for/against the regulation): "I strongly believe the government should NOT interfere" is negative tone but "against" the regulation.  We will run the forum/bill title and cache the entirety of the proposed change, perhaps as a fallback.
+-   Each scraped forum is saved to \`output/<forum-id>.jsonl\`
-
+-   Each report (forum + prompt) is saves to \`reports/<forum-id-N>.json\`
-<table border="2" cellspacing="0" cellpadding="6" rules="groups" frame="hsides">
+-   Each job is saved to \`analysis/jobs/<report-id>/:
     └─\`forum.jsonl\` is a copy of the scraped forum for convenience
     └─\`prompt.txt\` is a copy of the prompt used
     └─\`report.json\` is a copy of the report used
     └─\`status.json\` contains metadata about the job
    For each batch in the job, four files are created:
     └─\`jobN-input.jsonl\` contains the exact queries sent to the API, for troubleshooting
     └─\`jobN-output-raw.jsonl\` contains the exact response from the API
     └─\`jobN-output.jsonl\` contains the exact response from the API
     └─\`jobN-output-errors.jsonl\` when errors are returned (this file may not exist)
 -   Once complete, the cleanup script saves \`review.csv\`, \`review.pqt\`, and \`review.sqlite\` in this folder.
-<colgroup>
+<a id="org692b2f6"></a>
 <col  class="org-left" />
-<col  class="org-left" />
+## Instructions
-<col  class="org-left" />
+1.  Scrape the forum.
-
+    \`python
-<col  class="org-left" />
+2.  Run model report.
-
+    \`python analysis/tokenizer.py <input> &ndash;prompt <prompt>\`
-<col  class="org-left" />
+3.  To run a realtime subset:
-
+    \`python analysis/openai<sub>realtime.py</sub> <input> &ndash;prompt <prompt> &ndash;model <model> &ndash;limit <N comments>\`
-<col  class="org-left" />
+    \`python analysis/openai<sub>realtime.py</sub> output/f452.jsonl &ndash;prompt prompt-1.txt &ndash;model gpt-4o-mini &ndash;limit 10\`
-</colgroup>
+4.  To create and run the whole thing in batches, first create the batch jobs from the report:
-<thead>
+    \`python analysis/openai<sub>batch.py</sub> create <report> &ndash;model <model>\`
-<tr>
+    \`python analysis/openai<sub>batch.py</sub> create ./reports/f452-1.json &ndash;model gpt-5.4-mini\`
-<th scope="col" class="org-left">Tool</th>
+5.  Then, run the jobs sequentially. Don't submit more than one at a time, if the model fills up the batch will fail and resubmission is not implemented.
-<th scope="col" class="org-left">Output</th>
+    \`python analysis/openai<sub>batch.py</sub> submit\`
-<th scope="col" class="org-left">Context</th>
+    
-<th scope="col" class="org-left">Sarcasm</th>
+    \`python analysis/openai<sub>batch.py</sub> status\`
-<th scope="col" class="org-left">Context window</th>
+    
-<th scope="col" class="org-left">Cost/1k comments</th>
+    \`python analysis/openai<sub>batch.py</sub> download\`
-</tr>
+    
-</thead>
+    \`python analysis/openai<sub>batch.py</sub> submit\`
 <tbody>
 <tr>
 <td class="org-left">Google NL API</td>
 <td class="org-left">-1→+1, magnitude</td>
 <td class="org-left">No/generic</td>
 <td class="org-left">Poorly</td>
 <td class="org-left">No</td>
 <td class="org-left">~$1–2</td>
 </tr>
 <tr>
 <td class="org-left">Amazon Comprehend</td>
 <td class="org-left">Pos/Neg/Neutral/Mixed</td>
 <td class="org-left">No/generic</td>
 <td class="org-left">Poorly</td>
 <td class="org-left">No</td>
 <td class="org-left">~$0.10</td>
 </tr>
 <tr>
 <td class="org-left">Claude Haiku</td>
 <td class="org-left">Prompted → for/against/neutral</td>
 <td class="org-left">Yes</td>
 <td class="org-left">Yes, with prompt</td>
 <td class="org-left">Yes</td>
 <td class="org-left">~$0.10–0.30</td>
 </tr>
 <tr>
 <td class="org-left">GPT-4o-mini</td>
 <td class="org-left">Prompted → same</td>
 <td class="org-left">Yes</td>
 <td class="org-left">Yes</td>
 <td class="org-left">Yes</td>
 <td class="org-left">~$0.05–0.15</td>
 </tr>
 </tbody>
 </table>
-<a id="org6b7660d"></a>
+<a id="org9f21934"></a>
 # Roadmap
--- a/docs/vatownhall.org
+++ b/docs/vatownhall.org
@@ -1,50 +1,109 @@
 #+title: VA Townhall
 #+date: [2026-05-05 Tue]
-#+version: 1
+#+version: 1.1
-* Project Goals
+** Project Goals
 1. Document and analyze sentiment of public comments on Virginia law, to determine:
   1. the utility of this forum as a mechanism for public comment, and
   2. the impact of this forum on Virginia regulation.
 2. Make data and insights broadly available.
 3. Generalize to other public comment tools.
-** Document and analyze sentiment
+*** Research questions   
- Scrape the data, parse, clean, and store. Clearly separate scraper from sentiment analyzer for maximum auditability.
+1. What is the quality of the comments on the forum?
- Build tests for identifying abuse, such as spam and account fraud
+   1. Are there duplicate entries?
- Identify any patterns connecting measured sentiment against VA decisions
+   2. Are there non-human-generated entries?
-  
+   3. Are there entries intended to abuse the forum or drown out comment?
-** Make data available
+2. How do commenters feel about the proposed change?
- Pick a good visualization tool
+   1. What is the total number and percent supporting vs opposing, and how does this change over time?
   2. What is the type of support, such as strong/weak, positive/negative?
 3. What impact do the comments have on the proposed change?
   (I anticipate this will not be measurable from currently available data)
-** Generalize
+** Architecture
- Identify scalable ways to apply this toolset to similar problems
+1. Scrape/Parse: Scrapy
 2. Sentiment analysis: gpt-5.4-mini
 3. Display: streamlit
 4. Storage: jsonl, csv, parquet
-* Architecture
+*** Scraper
-1. Scrape/Parse: **Scrapy** for downloading comments
+Scrapy provides a simple mechanism for retrieving, parsing, and saving content form the forums.
 2. Storage: json
 3. Sentiment analysis: Claude haiku
 4. Display: TBD   
 ** Scraper
 Scrapy provides a simple mechanism for browsing and 
 1. Forums listing page: `Forums.cfm` - lists all open forums with agency, reg title, action type, brief description, closing date, comment count
 2. Comment listing page: `comments.cfm?GDocForumID=X` or `comments.cfm?stageid=X` or `comments.cfm?petitionid=X` - lists comments with title, author, date
 3. Individual comment page: `viewcomments.cfm?commentid=X` - shows regulation title + brief description at the top, plus the comment
-** Storage
+*** Analysis
-One JSONL file per forum/bill.
+Google and Amazon both return generic sentiment (tone of writing: positive/negative), not stance (for/against the regulation): "I strongly believe the government should NOT interfere" is negative tone but "against" the regulation.  We add the proposed change as context to the model.
-** Analysis
+Before sending the comments for sentiment analysis, `tokenizer.py` receives the forum to be processed and prompt as inputs, then generates a `report.json` estimating tokens (tiktoken), cost, and time to run for multiple models.
 Google and Amazon both return generic sentiment (tone of writing: positive/negative), not stance (for/against the regulation): "I strongly believe the government should NOT interfere" is negative tone but "against" the regulation.  We will run the forum/bill title and cache the entirety of the proposed change, perhaps as a fallback.
-| Tool              | Output                         | Context    | Sarcasm          | Context window | Cost/1k comments |
+Then, the batch processing scripts uses the `report.json` to create multiple jobs, with subcommands to download and check their status. 
 |-------------------+--------------------------------+------------+------------------+----------------+------------------|
 | Google NL API     | -1→+1, magnitude               | No/generic | Poorly           | No             | ~$1–2            |
 | Amazon Comprehend | Pos/Neg/Neutral/Mixed          | No/generic | Poorly           | No             | ~$0.10           |
 | Claude Haiku      | Prompted → for/against/neutral | Yes        | Yes, with prompt | Yes            | ~$0.10–0.30      |
 | GPT-4o-mini       | Prompted → same                | Yes        | Yes              | Yes            | ~$0.05–0.15      |
 We selected gpt-5.4-mini for a good balance of quality, cost, and time.
 **** Prompt
 ```
 You are an expert policy analyst classifying public comments submitted to the Virginia Town Hall
 regulatory comment system. You will be given the text of a proposed regulation and a single
 public comment. Return ONLY a JSON object — no other text.
 Definitions:
 - stance: the commenter's position on whether the regulation should be adopted.
  "support" = wants it approved (as-is or with changes);
  "oppose"  = wants it rejected or substantially weakened;
  "neutral" = takes no position, asks a question, or provides factual input only;
  "unknown" = too vague, off-topic, or uninterpretable to classify.
 - tone: the emotional register of the writing, independent of stance.
  "positive" = affirming, hopeful, appreciative;
  "negative" = angry, fearful, alarmed, or contemptuous;
  "neutral"  = matter-of-fact, procedural, or informational;
  "mixed"    = contains both positive and negative emotional content;
  "unclear"  = tone cannot be determined (e.g., a one-word comment).
 - stance_confidence: float 0.0-1.0, your confidence in the stance label.
 - stance_rationale: 1-3 sentences explaining the key evidence; quote specific phrases where possible.
 - tags: up to 5 short topic labels relevant to the comment's specific concerns (e.g.
  "parental rights", "student safety", "privacy", "religious freedom", "LGBTQ+ inclusion",
  "bullying prevention", "school sports", "bathroom access"). Empty array if none apply.
 Return exactly these keys: stance, stance_confidence, stance_rationale, tone, tags.
 ```
 *** Storage
 - Each scraped forum is saved to `output/<forum-id>.jsonl`
 - Each report (forum + prompt) is saves to `reports/<forum-id-N>.json`
 - Each job is saved to `analysis/jobs/<report-id>/:
   └─`forum.jsonl` is a copy of the scraped forum for convenience
   └─`prompt.txt` is a copy of the prompt used
   └─`report.json` is a copy of the report used
   └─`status.json` contains metadata about the job
  For each batch in the job, four files are created:
   └─`jobN-input.jsonl` contains the exact queries sent to the API, for troubleshooting
   └─`jobN-output-raw.jsonl` contains the exact response from the API
   └─`jobN-output.jsonl` contains the exact response from the API
   └─`jobN-output-errors.jsonl` when errors are returned (this file may not exist)
 - Once complete, the cleanup script saves `review.csv`, `review.pqt`, and `review.sqlite` in this folder.
 ** Instructions
 1. Scrape the forum.
   `python 
 2. Run model report.
   `python analysis/tokenizer.py <input> --prompt <prompt>`
 3. To run a realtime subset:
   `python analysis/openai_realtime.py <input> --prompt <prompt> --model <model> --limit <N comments>`
   `python analysis/openai_realtime.py output/f452.jsonl --prompt prompt-1.txt --model gpt-4o-mini --limit 10`
 4. To create and run the whole thing in batches, first create the batch jobs from the report:
   `python analysis/openai_batch.py create <report> --model <model>`
   `python analysis/openai_batch.py create ./reports/f452-1.json --model gpt-5.4-mini`
 5. Then, run the jobs sequentially. Don't submit more than one at a time, if the model fills up the batch will fail and resubmission is not implemented.
   `python analysis/openai_batch.py submit`
  # Check status
   `python analysis/openai_batch.py status`
  # When complete, download:
   `python analysis/openai_batch.py download`
  # Submit the next batch after the previous is complete:
   `python analysis/openai_batch.py submit`
 * Roadmap
 1. Scrape one forum
 2. Compare sentiment models