vath/docs/tasks.org at beb5cf461b16e451f1bdca97cfa68fb126dc13f2

ben/vath

Files

eulaly beb5cf461b t1.1: scrape one forum via ViewComments.cfm POST pagination

Spider fetches ViewComments.cfm?GdocForumID=N with vPerPage=500,
generates all page requests from page-1 metadata, and parses
each div.Cbox for comment_id, author, date, title, text, reg_title,
reg_desc. Handles span-wrapped comment text. Fixes UTF-8/windows-1251
meta-tag encoding mismatch. 9083 items, 15 empty-text (0.17%).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

2026-05-05 12:28:07 -04:00

1.2 KiB

Raw Blame History

[X] t1.1: scrape one forum (1)
[ ] t1.2: initial analysis pipeline

[X] t1.1: scrape one forum (1)

Use https://www.townhall.virginia.gov/L/comments.cfm?GDocForumID=452 as the first forum. Scraper should be run manually at this step. ViewComments (townhall.virginia.gov/L/ViewComments.cfm?CommentID=#) appears to be raw list of all comments on forum - could be useful later for whole-scrape Append forum id to viewall per forum (townhall.virginia.gov/L/ViewComments.cfm?GdocForumID=452) Comments are hydrated in backend via js-cued button (AJAX?)

acceptance criteria

run manual scraper
1. store proposal title and description
2. store comment title, commenter, date
3. store relevant metadata
friendly/polite scraping

notes

evidence

commit: (see below)
tests: 7 passing (pytest tests/)
datetime: 2026-05-05 12:26

[ ] t1.2: initial analysis pipeline

Write a simple pipeline for both - prefer non-concurrent/async from scraping run. Should be run manually, separate from scraper. You may use scrapy, but are not required to.

1.2 KiB Raw Blame History

[X] t1.1: scrape one forum (1)

acceptance criteria

notes

evidence

[ ] t1.2: initial analysis pipeline

acceptance criteria

notes

evidence

1.2 KiB

Raw Blame History