initial commit

2026-05-05 11:35:19 -04:00
commit cd3543bd0f
12 changed files with 507 additions and 0 deletions
--- a/.gitignore
+++ b/.gitignore
@@ -0,0 +1,26 @@
+# --- python bytecode ---
+__pycache__/
+*.py[cod]
+*$py.class
+
+# --- environment files ---
+.env
+.env.*
+*.local
+.venv/
+venv/
+env/
+
+# --- emacs ---
+*~
+\#*\#
+.\#*
+*.elc
+
+# --- project private data ---
+/private/
+archive/
+
+
+# --- misc ---
+.DS_Store
--- a/README.md
+++ b/README.md
@@ -0,0 +1,133 @@
+
+# Table of Contents
+
+1.  [Project Goals](#org863a759)
+2.  [Architecture](#orgcd91fd0)
+    1.  [Scraper](#org3256ad3)
+    2.  [Storage](#org7a9a92c)
+    3.  [Analysis](#org6ed72dc)
+3.  [Roadmap](#org416f14d)
+
+
+
+<a id="org863a759"></a>
+
+# Project Goals
+
+1.  Document and analyze sentiment of public comments on Virginia law, to determine:
+    1.  the utility of this forum as a mechanism for public comment, and
+    2.  the impact of this forum on Virginia regulation.
+2.  Make data and insights broadly available.
+3.  Generalize to other public comment tools.
+
+
+<a id="orgcd91fd0"></a>
+
+# Architecture
+
+1.  Scrape/Parse: ****Scrapy**** for downloading comments
+2.  Storage: json
+3.  Sentiment analysis: Claude haiku
+4.  Display: TBD
+
+
+<a id="org3256ad3"></a>
+
+## Scraper
+
+Scrapy provides a simple mechanism for browsing and 
+
+1.  Forums listing page: \`Forums.cfm\` - lists all open forums with agency, reg title, action type, brief description, closing date, comment count
+2.  Comment listing page: \`comments.cfm?GDocForumID=X\` or \`comments.cfm?stageid=X\` or \`comments.cfm?petitionid=X\` - lists comments with title, author, date
+3.  Individual comment page: \`viewcomments.cfm?commentid=X\` - shows regulation title + brief description at the top, plus the comment
+
+
+<a id="org7a9a92c"></a>
+
+## Storage
+
+One JSONL file per forum/bill.
+
+
+<a id="org6ed72dc"></a>
+
+## Analysis
+
+Google and Amazon both return generic sentiment (tone of writing: positive/negative), not stance (for/against the regulation): "I strongly believe the government should NOT interfere" is negative tone but "against" the regulation.  We will run the forum/bill title and cache the entirety of the proposed change, perhaps as a fallback.
+
+<table border="2" cellspacing="0" cellpadding="6" rules="groups" frame="hsides">
+
+
+<colgroup>
+<col  class="org-left" />
+
+<col  class="org-left" />
+
+<col  class="org-left" />
+
+<col  class="org-left" />
+
+<col  class="org-left" />
+
+<col  class="org-left" />
+</colgroup>
+<thead>
+<tr>
+<th scope="col" class="org-left">Tool</th>
+<th scope="col" class="org-left">Output</th>
+<th scope="col" class="org-left">Context</th>
+<th scope="col" class="org-left">Sarcasm</th>
+<th scope="col" class="org-left">Context window</th>
+<th scope="col" class="org-left">Cost/1k comments</th>
+</tr>
+</thead>
+<tbody>
+<tr>
+<td class="org-left">Google NL API</td>
+<td class="org-left">-1→+1, magnitude</td>
+<td class="org-left">No/generic</td>
+<td class="org-left">Poorly</td>
+<td class="org-left">No</td>
+<td class="org-left">~$1–2</td>
+</tr>
+
+<tr>
+<td class="org-left">Amazon Comprehend</td>
+<td class="org-left">Pos/Neg/Neutral/Mixed</td>
+<td class="org-left">No/generic</td>
+<td class="org-left">Poorly</td>
+<td class="org-left">No</td>
+<td class="org-left">~$0.10</td>
+</tr>
+
+<tr>
+<td class="org-left">Claude Haiku</td>
+<td class="org-left">Prompted → for/against/neutral</td>
+<td class="org-left">Yes</td>
+<td class="org-left">Yes, with prompt</td>
+<td class="org-left">Yes</td>
+<td class="org-left">~$0.10–0.30</td>
+</tr>
+
+<tr>
+<td class="org-left">GPT-4o-mini</td>
+<td class="org-left">Prompted → same</td>
+<td class="org-left">Yes</td>
+<td class="org-left">Yes</td>
+<td class="org-left">Yes</td>
+<td class="org-left">~$0.05–0.15</td>
+</tr>
+</tbody>
+</table>
+
+
+<a id="org416f14d"></a>
+
+# Roadmap
+
+1.  Scrape one forum
+2.  Compare sentiment models
+3.  Display
+4.  Scrape all data
+5.  Scale?
+
--- a/agents.md
+++ b/agents.md
@@ -0,0 +1,40 @@
+# agent rules
+
+## priorities
+- optimize for simplicity, boringness, and long-term maintainability
+- prefer minimal diffs; avoid refactors unless required for the active task
+
+## tech stack
+- python; scrapy
+- file storage: json or csv
+- assume local virtual env is available and accessible
+- do not add new dependencies unless explicitly approved; if unavoidable, document justification in the active task notes
+
+## workflow
+- prefer direct argv commands (no bash -lc / compound shell chains) unless necessary
+- work on ONE task at a time unless explicitly instructed otherwise
+- at the start of work, state the task id you are executing
+- do not start work unless a task id is specified; if missing, choose the earliest unchecked task and say so
+- propose incremental steps
+- always include basic tests for core logic
+- when you complete a task:
+  - mark it [X] in docs/tasks.md
+  - fill in evidence with commit hash + commands run
+  - never mark complete unless acceptance criteria are met
+  - include date and time (HH:MM)
+
+```
+* [ ] t1.1 Task Title (1)
+Description and PM notes
+** acceptance criteria
+1. AC 1
+2. AC 2
+
+** notes
+- document thoughts, decisions, reasoning
+
+** evidence
+- commit: 
+- tests: 
+- datetime: 
+```
--- a/docs/tasks.org
+++ b/docs/tasks.org
@@ -0,0 +1,28 @@
+* [ ] t1.1: scrape one forum (1)
+Use https://www.townhall.virginia.gov/L/comments.cfm?GDocForumID=452 as the first forum. Scraper should be run manually at this step.
+** acceptance criteria
+1. run manual scraper
+   1. store proposal title and description
+   2. store comment title, commenter, date
+   3. store relevant metadata
+2. friendly/polite scraping
+  
+** notes
+
+** evidence
+- commit: 
+- tests: 
+- datetime: 
+
+* [ ] t1.2: initial analysis pipeline
+Write a simple pipeline for both - prefer non-concurrent/async from scraping run. Should be run manually, separate from scraper. You may use scrapy, but are not required to.
+** acceptance criteria
+1. run manual sentiment analysis of selected file against haiku
+2. run manual sentiment analysis of selected file against gpt-4o
+
+** notes
+
+** evidence
+- commit: 
+- tests: 
+- date: 
--- a/docs/vatownhall.org
+++ b/docs/vatownhall.org
@@ -0,0 +1,53 @@
+#+title: VA Townhall
+#+date: [2026-05-05 Tue]
+#+version: 1
+
+* Project Goals
+1. Document and analyze sentiment of public comments on Virginia law, to determine:
+   1. the utility of this forum as a mechanism for public comment, and
+   2. the impact of this forum on Virginia regulation.
+2. Make data and insights broadly available.
+3. Generalize to other public comment tools.
+
+** Document and analyze sentiment
+- Scrape the data, parse, clean, and store. Clearly separate scraper from sentiment analyzer for maximum auditability.
+- Build tests for identifying abuse, such as spam and account fraud
+- Identify any patterns connecting measured sentiment against VA decisions
+  
+** Make data available
+- Pick a good visualization tool
+
+** Generalize
+- Identify scalable ways to apply this toolset to similar problems
+
+* Architecture
+1. Scrape/Parse: **Scrapy** for downloading comments
+2. Storage: json
+3. Sentiment analysis: Claude haiku
+4. Display: TBD   
+
+** Scraper
+Scrapy provides a simple mechanism for browsing and 
+1. Forums listing page: `Forums.cfm` - lists all open forums with agency, reg title, action type, brief description, closing date, comment count
+2. Comment listing page: `comments.cfm?GDocForumID=X` or `comments.cfm?stageid=X` or `comments.cfm?petitionid=X` - lists comments with title, author, date
+3. Individual comment page: `viewcomments.cfm?commentid=X` - shows regulation title + brief description at the top, plus the comment
+
+** Storage
+One JSONL file per forum/bill.
+
+** Analysis
+Google and Amazon both return generic sentiment (tone of writing: positive/negative), not stance (for/against the regulation): "I strongly believe the government should NOT interfere" is negative tone but "against" the regulation.  We will run the forum/bill title and cache the entirety of the proposed change, perhaps as a fallback.
+
+| Tool              | Output                         | Context    | Sarcasm          | Context window | Cost/1k comments |
+|-------------------+--------------------------------+------------+------------------+----------------+------------------|
+| Google NL API     | -1→+1, magnitude               | No/generic | Poorly           | No             | ~$1–2            |
+| Amazon Comprehend | Pos/Neg/Neutral/Mixed          | No/generic | Poorly           | No             | ~$0.10           |
+| Claude Haiku      | Prompted → for/against/neutral | Yes        | Yes, with prompt | Yes            | ~$0.10–0.30      |
+| GPT-4o-mini       | Prompted → same                | Yes        | Yes              | Yes            | ~$0.05–0.15      |
+
+* Roadmap
+1. Scrape one forum
+2. Compare sentiment models
+3. Display   
+4. Scrape all data
+5. Scale?
--- a/scraper/init.py
+++ b/scraper/init.py
--- a/scraper/items.py
+++ b/scraper/items.py
@@ -0,0 +1,12 @@
+# Define here the models for your scraped items
+#
+# See documentation in:
+# https://docs.scrapy.org/en/latest/topics/items.html
+
+import scrapy
+
+
+class ScraperItem(scrapy.Item):
+    # define the fields for your item here like:
+    # name = scrapy.Field()
+    pass
--- a/scraper/middlewares.py
+++ b/scraper/middlewares.py
@@ -0,0 +1,100 @@
+# Define here the models for your spider middleware
+#
+# See documentation in:
+# https://docs.scrapy.org/en/latest/topics/spider-middleware.html
+
+from scrapy import signals
+
+# useful for handling different item types with a single interface
+from itemadapter import ItemAdapter
+
+
+class ScraperSpiderMiddleware:
+    # Not all methods need to be defined. If a method is not defined,
+    # scrapy acts as if the spider middleware does not modify the
+    # passed objects.
+
+    @classmethod
+    def from_crawler(cls, crawler):
+        # This method is used by Scrapy to create your spiders.
+        s = cls()
+        crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
+        return s
+
+    def process_spider_input(self, response, spider):
+        # Called for each response that goes through the spider
+        # middleware and into the spider.
+
+        # Should return None or raise an exception.
+        return None
+
+    def process_spider_output(self, response, result, spider):
+        # Called with the results returned from the Spider, after
+        # it has processed the response.
+
+        # Must return an iterable of Request, or item objects.
+        for i in result:
+            yield i
+
+    def process_spider_exception(self, response, exception, spider):
+        # Called when a spider or process_spider_input() method
+        # (from other spider middleware) raises an exception.
+
+        # Should return either None or an iterable of Request or item objects.
+        pass
+
+    async def process_start(self, start):
+        # Called with an async iterator over the spider start() method or the
+        # matching method of an earlier spider middleware.
+        async for item_or_request in start:
+            yield item_or_request
+
+    def spider_opened(self, spider):
+        spider.logger.info("Spider opened: %s" % spider.name)
+
+
+class ScraperDownloaderMiddleware:
+    # Not all methods need to be defined. If a method is not defined,
+    # scrapy acts as if the downloader middleware does not modify the
+    # passed objects.
+
+    @classmethod
+    def from_crawler(cls, crawler):
+        # This method is used by Scrapy to create your spiders.
+        s = cls()
+        crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
+        return s
+
+    def process_request(self, request, spider):
+        # Called for each request that goes through the downloader
+        # middleware.
+
+        # Must either:
+        # - return None: continue processing this request
+        # - or return a Response object
+        # - or return a Request object
+        # - or raise IgnoreRequest: process_exception() methods of
+        #   installed downloader middleware will be called
+        return None
+
+    def process_response(self, request, response, spider):
+        # Called with the response returned from the downloader.
+
+        # Must either;
+        # - return a Response object
+        # - return a Request object
+        # - or raise IgnoreRequest
+        return response
+
+    def process_exception(self, request, exception, spider):
+        # Called when a download handler or a process_request()
+        # (from other downloader middleware) raises an exception.
+
+        # Must either:
+        # - return None: continue processing this exception
+        # - return a Response object: stops process_exception() chain
+        # - return a Request object: stops process_exception() chain
+        pass
+
+    def spider_opened(self, spider):
+        spider.logger.info("Spider opened: %s" % spider.name)
--- a/scraper/pipelines.py
+++ b/scraper/pipelines.py
@@ -0,0 +1,13 @@
+# Define your item pipelines here
+#
+# Don't forget to add your pipeline to the ITEM_PIPELINES setting
+# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html
+
+
+# useful for handling different item types with a single interface
+from itemadapter import ItemAdapter
+
+
+class ScraperPipeline:
+    def process_item(self, item, spider):
+        return item
--- a/scraper/settings.py
+++ b/scraper/settings.py
@@ -0,0 +1,87 @@
+# Scrapy settings for scraper project
+#
+# For simplicity, this file contains only settings considered important or
+# commonly used. You can find more settings consulting the documentation:
+#
+#     https://docs.scrapy.org/en/latest/topics/settings.html
+#     https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
+#     https://docs.scrapy.org/en/latest/topics/spider-middleware.html
+
+BOT_NAME = "scraper"
+
+SPIDER_MODULES = ["scraper.spiders"]
+NEWSPIDER_MODULE = "scraper.spiders"
+
+ADDONS = {}
+
+
+# Crawl responsibly by identifying yourself (and your website) on the user-agent
+#USER_AGENT = "scraper (+http://www.yourdomain.com)"
+
+# Obey robots.txt rules
+ROBOTSTXT_OBEY = True
+
+# Concurrency and throttling settings
+#CONCURRENT_REQUESTS = 16
+CONCURRENT_REQUESTS_PER_DOMAIN = 1
+DOWNLOAD_DELAY = 1
+
+# Disable cookies (enabled by default)
+#COOKIES_ENABLED = False
+
+# Disable Telnet Console (enabled by default)
+#TELNETCONSOLE_ENABLED = False
+
+# Override the default request headers:
+#DEFAULT_REQUEST_HEADERS = {
+#    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
+#    "Accept-Language": "en",
+#}
+
+# Enable or disable spider middlewares
+# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
+#SPIDER_MIDDLEWARES = {
+#    "scraper.middlewares.ScraperSpiderMiddleware": 543,
+#}
+
+# Enable or disable downloader middlewares
+# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
+#DOWNLOADER_MIDDLEWARES = {
+#    "scraper.middlewares.ScraperDownloaderMiddleware": 543,
+#}
+
+# Enable or disable extensions
+# See https://docs.scrapy.org/en/latest/topics/extensions.html
+#EXTENSIONS = {
+#    "scrapy.extensions.telnet.TelnetConsole": None,
+#}
+
+# Configure item pipelines
+# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
+#ITEM_PIPELINES = {
+#    "scraper.pipelines.ScraperPipeline": 300,
+#}
+
+# Enable and configure the AutoThrottle extension (disabled by default)
+# See https://docs.scrapy.org/en/latest/topics/autothrottle.html
+#AUTOTHROTTLE_ENABLED = True
+# The initial download delay
+#AUTOTHROTTLE_START_DELAY = 5
+# The maximum download delay to be set in case of high latencies
+#AUTOTHROTTLE_MAX_DELAY = 60
+# The average number of requests Scrapy should be sending in parallel to
+# each remote server
+#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
+# Enable showing throttling stats for every response received:
+#AUTOTHROTTLE_DEBUG = False
+
+# Enable and configure HTTP caching (disabled by default)
+# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
+#HTTPCACHE_ENABLED = True
+#HTTPCACHE_EXPIRATION_SECS = 0
+#HTTPCACHE_DIR = "httpcache"
+#HTTPCACHE_IGNORE_HTTP_CODES = []
+#HTTPCACHE_STORAGE = "scrapy.extensions.httpcache.FilesystemCacheStorage"
+
+# Set settings whose default value is deprecated to a future-proof value
+FEED_EXPORT_ENCODING = "utf-8"
--- a/scraper/spiders/init.py
+++ b/scraper/spiders/init.py
@@ -0,0 +1,4 @@
+# This package will contain the spiders of your Scrapy project
+#
+# Please refer to the documentation for information on how to create and manage
+# your spiders.
--- a/scrapy.cfg
+++ b/scrapy.cfg
@@ -0,0 +1,11 @@
+# Automatically created by: scrapy startproject
+#
+# For more information about the [deploy] section see:
+# https://scrapyd.readthedocs.io/en/latest/deploy.html
+
+[settings]
+default = scraper.settings
+
+[deploy]
+#url = http://localhost:6800/
+project = scraper