initial commit

This commit is contained in:
2026-05-05 11:35:19 -04:00
commit cd3543bd0f
12 changed files with 507 additions and 0 deletions

133
README.md Normal file
View File

@@ -0,0 +1,133 @@
# Table of Contents
1. [Project Goals](#org863a759)
2. [Architecture](#orgcd91fd0)
1. [Scraper](#org3256ad3)
2. [Storage](#org7a9a92c)
3. [Analysis](#org6ed72dc)
3. [Roadmap](#org416f14d)
<a id="org863a759"></a>
# Project Goals
1. Document and analyze sentiment of public comments on Virginia law, to determine:
1. the utility of this forum as a mechanism for public comment, and
2. the impact of this forum on Virginia regulation.
2. Make data and insights broadly available.
3. Generalize to other public comment tools.
<a id="orgcd91fd0"></a>
# Architecture
1. Scrape/Parse: ****Scrapy**** for downloading comments
2. Storage: json
3. Sentiment analysis: Claude haiku
4. Display: TBD
<a id="org3256ad3"></a>
## Scraper
Scrapy provides a simple mechanism for browsing and
1. Forums listing page: \`Forums.cfm\` - lists all open forums with agency, reg title, action type, brief description, closing date, comment count
2. Comment listing page: \`comments.cfm?GDocForumID=X\` or \`comments.cfm?stageid=X\` or \`comments.cfm?petitionid=X\` - lists comments with title, author, date
3. Individual comment page: \`viewcomments.cfm?commentid=X\` - shows regulation title + brief description at the top, plus the comment
<a id="org7a9a92c"></a>
## Storage
One JSONL file per forum/bill.
<a id="org6ed72dc"></a>
## Analysis
Google and Amazon both return generic sentiment (tone of writing: positive/negative), not stance (for/against the regulation): "I strongly believe the government should NOT interfere" is negative tone but "against" the regulation. We will run the forum/bill title and cache the entirety of the proposed change, perhaps as a fallback.
<table border="2" cellspacing="0" cellpadding="6" rules="groups" frame="hsides">
<colgroup>
<col class="org-left" />
<col class="org-left" />
<col class="org-left" />
<col class="org-left" />
<col class="org-left" />
<col class="org-left" />
</colgroup>
<thead>
<tr>
<th scope="col" class="org-left">Tool</th>
<th scope="col" class="org-left">Output</th>
<th scope="col" class="org-left">Context</th>
<th scope="col" class="org-left">Sarcasm</th>
<th scope="col" class="org-left">Context window</th>
<th scope="col" class="org-left">Cost/1k comments</th>
</tr>
</thead>
<tbody>
<tr>
<td class="org-left">Google NL API</td>
<td class="org-left">-1→+1, magnitude</td>
<td class="org-left">No/generic</td>
<td class="org-left">Poorly</td>
<td class="org-left">No</td>
<td class="org-left">~$12</td>
</tr>
<tr>
<td class="org-left">Amazon Comprehend</td>
<td class="org-left">Pos/Neg/Neutral/Mixed</td>
<td class="org-left">No/generic</td>
<td class="org-left">Poorly</td>
<td class="org-left">No</td>
<td class="org-left">~$0.10</td>
</tr>
<tr>
<td class="org-left">Claude Haiku</td>
<td class="org-left">Prompted → for/against/neutral</td>
<td class="org-left">Yes</td>
<td class="org-left">Yes, with prompt</td>
<td class="org-left">Yes</td>
<td class="org-left">~$0.100.30</td>
</tr>
<tr>
<td class="org-left">GPT-4o-mini</td>
<td class="org-left">Prompted → same</td>
<td class="org-left">Yes</td>
<td class="org-left">Yes</td>
<td class="org-left">Yes</td>
<td class="org-left">~$0.050.15</td>
</tr>
</tbody>
</table>
<a id="org416f14d"></a>
# Roadmap
1. Scrape one forum
2. Compare sentiment models
3. Display
4. Scrape all data
5. Scale?