tesging images

testing images
adding image
2026-05-07 18:07:45 -04:00 · 2026-05-07 18:06:02 -04:00 · 2026-05-07 18:00:51 -04:00 · 2026-05-07 17:56:05 -04:00
4 changed files with 179 additions and 17 deletions
--- a/.gitignore
+++ b/.gitignore
@@ -29,3 +29,4 @@ output/

 # --- misc ---
 .DS_Store
+*~$*
--- a/README.md
+++ b/README.md
@@ -1,18 +1,18 @@

 # Table of Contents

-    1.  [Project Goals](#orgf37a106)
-        1.  [Research questions](#orgec50d46)
-    2.  [Architecture](#org7a5389e)
-        1.  [Scraper](#org7771df2)
-        2.  [Analysis](#org16a9e36)
-        3.  [Storage](#org7341391)
-    3.  [Instructions](#org692b2f6)
-1.  [Roadmap](#org9f21934)
+1.  [Project Goals](#org2da6874)
+    1.  [Research questions](#org1a2b8b3)
+    2.  [Architecture](#orgfabfcd9)
+        1.  [Scraper](#org2c5c7a2)
+        2.  [Analysis](#org72990f4)
+        3.  [Storage](#org58a5b72)
+    3.  [Instructions](#org24fe465)
+1.  [Roadmap](#org5739d49)



-<a id="orgf37a106"></a>
+<a id="org2da6874"></a>

 ## Project Goals

@@ -23,7 +23,7 @@
 3.  Generalize to other public comment tools.


-<a id="orgec50d46"></a>
+<a id="org1a2b8b3"></a>

 ### Research questions

@@ -38,7 +38,7 @@
    (I anticipate this will not be measurable from currently available data)


-<a id="org7a5389e"></a>
+<a id="orgfabfcd9"></a>

 ## Architecture

@@ -47,8 +47,10 @@
 3.  Display: streamlit
 4.  Storage: jsonl, csv, parquet

+![img](./docs/pipeline-v1.2.3.svg)

-<a id="org7771df2"></a>
+
+<a id="org2c5c7a2"></a>

 ### Scraper

@@ -59,7 +61,7 @@ Scrapy provides a simple mechanism for retrieving, parsing, and saving content f
 3.  Individual comment page: \`viewcomments.cfm?commentid=X\` - shows regulation title + brief description at the top, plus the comment


-<a id="org16a9e36"></a>
+<a id="org72990f4"></a>

 ### Analysis

@@ -101,7 +103,7 @@ We selected gpt-5.4-mini for a good balance of quality, cost, and time.
    \`\`\`


-<a id="org7341391"></a>
+<a id="org58a5b72"></a>

 ### Storage

@@ -120,7 +122,7 @@ We selected gpt-5.4-mini for a good balance of quality, cost, and time.
 -   Once complete, the cleanup script saves \`review.csv\`, \`review.pqt\`, and \`review.sqlite\` in this folder.


-<a id="org692b2f6"></a>
+<a id="org24fe465"></a>

 ## Instructions

@@ -144,7 +146,7 @@ We selected gpt-5.4-mini for a good balance of quality, cost, and time.
    \`python analysis/openai<sub>batch.py</sub> submit\`


-<a id="org9f21934"></a>
+<a id="org5739d49"></a>

 # Roadmap

--- a/docs/vatownhall.md
+++ b/docs/vatownhall.md
@@ -0,0 +1,157 @@
+
+# Table of Contents
+
+1.  [Project Goals](#org214014d)
+    1.  [Research questions](#org54bfaa9)
+    2.  [Architecture](#orgf2c1000)
+        1.  [Scraper](#org88a423d)
+        2.  [Analysis](#orga217037)
+        3.  [Storage](#org73d6f34)
+    3.  [Instructions](#org672fefe)
+2.  [Roadmap](#org084df10)
+
+
+<a id="org214014d"></a>
+
+## Project Goals
+
+1.  Document and analyze sentiment of public comments on Virginia law, to determine:
+    1.  the utility of this forum as a mechanism for public comment, and
+    2.  the impact of this forum on Virginia regulation.
+2.  Make data and insights broadly available.
+3.  Generalize to other public comment tools.
+
+
+<a id="org54bfaa9"></a>
+
+### Research questions
+
+1.  What is the quality of the comments on the forum?
+    1.  Are there duplicate entries?
+    2.  Are there non-human-generated entries?
+    3.  Are there entries intended to abuse the forum or drown out comment?
+2.  How do commenters feel about the proposed change?
+    1.  What is the total number and percent supporting vs opposing, and how does this change over time?
+    2.  What is the type of support, such as strong/weak, positive/negative?
+3.  What impact do the comments have on the proposed change?
+    (I anticipate this will not be measurable from currently available data)
+
+
+<a id="orgf2c1000"></a>
+
+## Architecture
+
+1.  Scrape/Parse: Scrapy
+2.  Sentiment analysis: gpt-5.4-mini
+3.  Display: streamlit
+4.  Storage: jsonl, csv, parquet
+
+![](./docs/pipeline-v1.2.3.svg)
+
+
+<a id="org88a423d"></a>
+
+### Scraper
+
+Scrapy provides a simple mechanism for retrieving, parsing, and saving content form the forums.
+
+1.  Forums listing page: \`Forums.cfm\` - lists all open forums with agency, reg title, action type, brief description, closing date, comment count
+2.  Comment listing page: \`comments.cfm?GDocForumID=X\` or \`comments.cfm?stageid=X\` or \`comments.cfm?petitionid=X\` - lists comments with title, author, date
+3.  Individual comment page: \`viewcomments.cfm?commentid=X\` - shows regulation title + brief description at the top, plus the comment
+
+
+<a id="orga217037"></a>
+
+### Analysis
+
+Google and Amazon both return generic sentiment (tone of writing: positive/negative), not stance (for/against the regulation): "I strongly believe the government should NOT interfere" is negative tone but "against" the regulation.  We add the proposed change as context to the model.
+
+Before sending the comments for sentiment analysis, \`tokenizer.py\` receives the forum to be processed and prompt as inputs, then generates a \`report.json\` estimating tokens (tiktoken), cost, and time to run for multiple models.
+
+Then, the batch processing scripts uses the \`report.json\` to create multiple jobs, with subcommands to download and check their status. 
+
+We selected gpt-5.4-mini for a good balance of quality, cost, and time.
+
+1.  Prompt
+
+    \`\`\`
+    You are an expert policy analyst classifying public comments submitted to the Virginia Town Hall
+    regulatory comment system. You will be given the text of a proposed regulation and a single
+    public comment. Return ONLY a JSON object — no other text.
+    
+    Definitions:
+    
+    -   stance: the commenter's position on whether the regulation should be adopted.
+        "support" = wants it approved (as-is or with changes);
+        "oppose"  = wants it rejected or substantially weakened;
+        "neutral" = takes no position, asks a question, or provides factual input only;
+        "unknown" = too vague, off-topic, or uninterpretable to classify.
+    -   tone: the emotional register of the writing, independent of stance.
+        "positive" = affirming, hopeful, appreciative;
+        "negative" = angry, fearful, alarmed, or contemptuous;
+        "neutral"  = matter-of-fact, procedural, or informational;
+        "mixed"    = contains both positive and negative emotional content;
+        "unclear"  = tone cannot be determined (e.g., a one-word comment).
+    -   stance<sub>confidence</sub>: float 0.0-1.0, your confidence in the stance label.
+    -   stance<sub>rationale</sub>: 1-3 sentences explaining the key evidence; quote specific phrases where possible.
+    -   tags: up to 5 short topic labels relevant to the comment's specific concerns (e.g.
+        "parental rights", "student safety", "privacy", "religious freedom", "LGBTQ+ inclusion",
+        "bullying prevention", "school sports", "bathroom access"). Empty array if none apply.
+    
+    Return exactly these keys: stance, stance<sub>confidence</sub>, stance<sub>rationale</sub>, tone, tags.
+    \`\`\`
+
+
+<a id="org73d6f34"></a>
+
+### Storage
+
+-   Each scraped forum is saved to \`output/<forum-id>.jsonl\`
+-   Each report (forum + prompt) is saves to \`reports/<forum-id-N>.json\`
+-   Each job is saved to \`analysis/jobs/<report-id>/:
+     └─\`forum.jsonl\` is a copy of the scraped forum for convenience
+     └─\`prompt.txt\` is a copy of the prompt used
+     └─\`report.json\` is a copy of the report used
+     └─\`status.json\` contains metadata about the job
+    For each batch in the job, four files are created:
+     └─\`jobN-input.jsonl\` contains the exact queries sent to the API, for troubleshooting
+     └─\`jobN-output-raw.jsonl\` contains the exact response from the API
+     └─\`jobN-output.jsonl\` contains the exact response from the API
+     └─\`jobN-output-errors.jsonl\` when errors are returned (this file may not exist)
+-   Once complete, the cleanup script saves \`review.csv\`, \`review.pqt\`, and \`review.sqlite\` in this folder.
+
+
+<a id="org672fefe"></a>
+
+## Instructions
+
+1.  Scrape the forum.
+    \`python
+2.  Run model report.
+    \`python analysis/tokenizer.py <input> &ndash;prompt <prompt>\`
+3.  To run a realtime subset:
+    \`python analysis/openai<sub>realtime.py</sub> <input> &ndash;prompt <prompt> &ndash;model <model> &ndash;limit <N comments>\`
+    \`python analysis/openai<sub>realtime.py</sub> output/f452.jsonl &ndash;prompt prompt-1.txt &ndash;model gpt-4o-mini &ndash;limit 10\`
+4.  To create and run the whole thing in batches, first create the batch jobs from the report:
+    \`python analysis/openai<sub>batch.py</sub> create <report> &ndash;model <model>\`
+    \`python analysis/openai<sub>batch.py</sub> create ./reports/f452-1.json &ndash;model gpt-5.4-mini\`
+5.  Then, run the jobs sequentially. Don't submit more than one at a time, if the model fills up the batch will fail and resubmission is not implemented.
+    \`python analysis/openai<sub>batch.py</sub> submit\`
+    
+    \`python analysis/openai<sub>batch.py</sub> status\`
+    
+    \`python analysis/openai<sub>batch.py</sub> download\`
+    
+    \`python analysis/openai<sub>batch.py</sub> submit\`
+
+
+<a id="org084df10"></a>
+
+# Roadmap
+
+1.  Scrape one forum
+2.  Compare sentiment models
+3.  Display
+4.  Scrape all data
+5.  Scale?
+
--- a/docs/vatownhall.org
+++ b/docs/vatownhall.org
@@ -26,6 +26,8 @@
 3. Display: streamlit
 4. Storage: jsonl, csv, parquet

+![](./docs/pipeline-v1.2.3.svg)
+   
 *** Scraper
 Scrapy provides a simple mechanism for retrieving, parsing, and saving content form the forums.
 1. Forums listing page: `Forums.cfm` - lists all open forums with agency, reg title, action type, brief description, closing date, comment count
Author	SHA1	Message	Date
eulaly	985760be7c	tesging images	2026-05-07 18:07:45 -04:00
eulaly	983650a64f	testing images	2026-05-07 18:06:02 -04:00
eulaly	eaaefb66f2	adding image	2026-05-07 18:00:51 -04:00
eulaly	bdab3c5e21	added excel detritus	2026-05-07 17:56:05 -04:00