Spider fetches ViewComments.cfm?GdocForumID=N with vPerPage=500, generates all page requests from page-1 metadata, and parses each div.Cbox for comment_id, author, date, title, text, reg_title, reg_desc. Handles span-wrapped comment text. Fixes UTF-8/windows-1251 meta-tag encoding mismatch. 9083 items, 15 empty-text (0.17%). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
32 lines
1.2 KiB
Org Mode
32 lines
1.2 KiB
Org Mode
* [X] t1.1: scrape one forum (1)
|
|
Use https://www.townhall.virginia.gov/L/comments.cfm?GDocForumID=452 as the first forum. Scraper should be run manually at this step.
|
|
ViewComments (townhall.virginia.gov/L/ViewComments.cfm?CommentID=#) appears to be raw list of all comments on forum - could be useful later for whole-scrape
|
|
Append forum id to viewall per forum (townhall.virginia.gov/L/ViewComments.cfm?GdocForumID=452)
|
|
Comments are hydrated in backend via js-cued button (AJAX?)
|
|
** acceptance criteria
|
|
1. run manual scraper
|
|
1. store proposal title and description
|
|
2. store comment title, commenter, date
|
|
3. store relevant metadata
|
|
2. friendly/polite scraping
|
|
|
|
** notes
|
|
|
|
** evidence
|
|
- commit: (see below)
|
|
- tests: 7 passing (pytest tests/)
|
|
- datetime: 2026-05-05 12:26
|
|
|
|
* [ ] t1.2: initial analysis pipeline
|
|
Write a simple pipeline for both - prefer non-concurrent/async from scraping run. Should be run manually, separate from scraper. You may use scrapy, but are not required to.
|
|
** acceptance criteria
|
|
1. run manual sentiment analysis of selected file against haiku
|
|
2. run manual sentiment analysis of selected file against gpt-4o
|
|
|
|
** notes
|
|
|
|
** evidence
|
|
- commit:
|
|
- tests:
|
|
- date:
|