Files
vath/docs/tasks.org
eulaly beb5cf461b t1.1: scrape one forum via ViewComments.cfm POST pagination
Spider fetches ViewComments.cfm?GdocForumID=N with vPerPage=500,
generates all page requests from page-1 metadata, and parses
each div.Cbox for comment_id, author, date, title, text, reg_title,
reg_desc. Handles span-wrapped comment text. Fixes UTF-8/windows-1251
meta-tag encoding mismatch. 9083 items, 15 empty-text (0.17%).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-05 12:28:07 -04:00

1.2 KiB

[X] t1.1: scrape one forum (1)

Use https://www.townhall.virginia.gov/L/comments.cfm?GDocForumID=452 as the first forum. Scraper should be run manually at this step. ViewComments (townhall.virginia.gov/L/ViewComments.cfm?CommentID=#) appears to be raw list of all comments on forum - could be useful later for whole-scrape Append forum id to viewall per forum (townhall.virginia.gov/L/ViewComments.cfm?GdocForumID=452) Comments are hydrated in backend via js-cued button (AJAX?)

acceptance criteria

  1. run manual scraper

    1. store proposal title and description
    2. store comment title, commenter, date
    3. store relevant metadata
  2. friendly/polite scraping

notes

evidence

  • commit: (see below)
  • tests: 7 passing (pytest tests/)
  • datetime: 2026-05-05 12:26

[ ] t1.2: initial analysis pipeline

Write a simple pipeline for both - prefer non-concurrent/async from scraping run. Should be run manually, separate from scraper. You may use scrapy, but are not required to.

acceptance criteria

  1. run manual sentiment analysis of selected file against haiku
  2. run manual sentiment analysis of selected file against gpt-4o

notes

evidence

  • commit:
  • tests:
  • date: