t1.1: scrape one forum via ViewComments.cfm POST pagination

Spider fetches ViewComments.cfm?GdocForumID=N with vPerPage=500, generates all page requests from page-1 metadata, and parses each div.Cbox for comment_id, author, date, title, text, reg_title, reg_desc. Handles span-wrapped comment text. Fixes UTF-8/windows-1251 meta-tag encoding mismatch. 9083 items, 15 empty-text (0.17%). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-05 12:28:07 -04:00
parent 02964312cb
commit beb5cf461b
6 changed files with 387 additions and 22 deletions
--- a/docs/tasks.org
+++ b/docs/tasks.org
@@ -1,5 +1,8 @@
-* [ ] t1.1: scrape one forum (1)
+* [X] t1.1: scrape one forum (1)
 Use https://www.townhall.virginia.gov/L/comments.cfm?GDocForumID=452 as the first forum. Scraper should be run manually at this step.
+ViewComments (townhall.virginia.gov/L/ViewComments.cfm?CommentID=#) appears to be raw list of all comments on forum - could be useful later for whole-scrape
+Append forum id to viewall per forum (townhall.virginia.gov/L/ViewComments.cfm?GdocForumID=452)
+Comments are hydrated in backend via js-cued button (AJAX?)
 ** acceptance criteria
 1. run manual scraper
   1. store proposal title and description
@@ -10,9 +13,9 @@ Use https://www.townhall.virginia.gov/L/comments.cfm?GDocForumID=452 as the firs
 ** notes

 ** evidence
- commit: 
- tests: 
- datetime: 
+- commit: (see below)
+- tests: 7 passing (pytest tests/)
+- datetime: 2026-05-05 12:26

 * [ ] t1.2: initial analysis pipeline
 Write a simple pipeline for both - prefer non-concurrent/async from scraping run. Should be run manually, separate from scraper. You may use scrapy, but are not required to.