Compare commits

..

4 Commits

Author SHA1 Message Date
314f8d2621 updated instructions and added artifacts 2026-05-05 13:50:27 -04:00
e7df0b24a1 1.1 cleanup 2026-05-05 13:50:04 -04:00
951cc11a14 t1.1: record commit hash in evidence 2026-05-05 12:28:15 -04:00
beb5cf461b t1.1: scrape one forum via ViewComments.cfm POST pagination
Spider fetches ViewComments.cfm?GdocForumID=N with vPerPage=500,
generates all page requests from page-1 metadata, and parses
each div.Cbox for comment_id, author, date, title, text, reg_title,
reg_desc. Handles span-wrapped comment text. Fixes UTF-8/windows-1251
meta-tag encoding mismatch. 9083 items, 15 empty-text (0.17%).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-05 12:28:07 -04:00
10 changed files with 651 additions and 36 deletions

4
.gitignore vendored
View File

@@ -22,5 +22,9 @@ env/
archive/ archive/
# --- scrapy ---
.scrapy/
output/
# --- misc --- # --- misc ---
.DS_Store .DS_Store

View File

@@ -5,14 +5,14 @@
- prefer minimal diffs; avoid refactors unless required for the active task - prefer minimal diffs; avoid refactors unless required for the active task
## tech stack ## tech stack
- python; scrapy - python; scrapy, pytest
- file storage: json or csv - file storage: json or csv
- assume local virtual env is available and accessible - assume local virtual env is available and accessible
- do not add new dependencies unless explicitly approved; if unavoidable, document justification in the active task notes - do not add new dependencies unless explicitly approved; if unavoidable, document justification in the active task notes
## workflow ## workflow
- prefer direct argv commands (no bash -lc / compound shell chains) unless necessary - prefer direct commands
- work on ONE task at a time unless explicitly instructed otherwise - work on ONE task at a time unless explicitly instructed otherwise:
- at the start of work, state the task id you are executing - at the start of work, state the task id you are executing
- do not start work unless a task id is specified; if missing, choose the earliest unchecked task and say so - do not start work unless a task id is specified; if missing, choose the earliest unchecked task and say so
- propose incremental steps - propose incremental steps
@@ -22,7 +22,7 @@
- fill in evidence with commit hash + commands run - fill in evidence with commit hash + commands run
- never mark complete unless acceptance criteria are met - never mark complete unless acceptance criteria are met
- include date and time (HH:MM) - include date and time (HH:MM)
- follow this format:
``` ```
* [ ] t1.1 Task Title (1) * [ ] t1.1 Task Title (1)
Description and PM notes Description and PM notes

View File

@@ -1,18 +1,37 @@
* [ ] t1.1: scrape one forum (1) * [X] t1.1: scrape one forum (1)
Use https://www.townhall.virginia.gov/L/comments.cfm?GDocForumID=452 as the first forum. Scraper should be run manually at this step. Use https://www.townhall.virginia.gov/L/comments.cfm?GDocForumID=452 as the first forum. Scraper should be run manually at this step.
ViewComments (townhall.virginia.gov/L/ViewComments.cfm?CommentID=#) appears to be raw list of all comments on forum - could be useful later for whole-scrape
Append forum id to viewall per forum (townhall.virginia.gov/L/ViewComments.cfm?GdocForumID=452)
Comments are hydrated in backend via js-cued button (AJAX?).
** acceptance criteria ** acceptance criteria
1. run manual scraper 1. run manual scraper
1. store proposal title and description 1. store proposal title and description
2. store comment title, commenter, date 2. store comment title, commenter, date
3. store relevant metadata 3. store relevant metadata
2. friendly/polite scraping 2. friendly/polite scraping
3. store forum as distinct item with title, desc
4. add forum ID in comment filename, eg forum452_comments_<datetime>.jsonl
5. remove reg_title and reg_desc from each comment; these belong in forum item
6. parse datetimes into object for later use (plotting)
** notes ** notes
- scraper/spiders/forum.py — ForumSpider using ViewComments.cfm?GdocForumID=N with POST pagination. First request fetches page 1 (vPerPage=500), discovers the last page number from the form's link, generates all remaining page requests upfront. Parses each div.Cbox for all required fields.
- scraper/items.py — CommentItem with forum_id, reg_title, reg_desc, comment_id, author, date, title, text
- tests/test_forum_spider.py — 7 tests, all passing
- Settings: DEFAULT_RESPONSE_ENCODING=utf-8 (fixes Windows-1251 meta-tag mismatch), HTTPCACHE_ENABLED=True, feed output to output/
- ViewComments.cfm instead of comments.cfm: POST to Comments.cfm returned a 500 error (wrong endpoint). ViewComments.cfm?GdocForumID=N is the correct listing URL, returns full comment text on the page itself — no per-comment follow requests needed.
- Span-wrapped text: .divComment p::text missed 3.6% of comments where text is in <p><span>text</span></p>. Fixed to .divComment *::text, .divComment::text. Worth knowing for when the spider is extended to other forums.
- start() vs start_requests(): Scrapy 2.13+ deprecates start_requests() in favor of async def start()
- ForumItem vs CommentItem: ForumItem (forum_id, reg_title, reg_desc) yielded once on first page; CommentItem no longer carries reg_title/reg_desc. Both land in the same JSONL feed.
- Dynamic output filename: set via from_crawler() overriding FEEDS at 'spider' priority — format is output/forum{id}_comments_%(time)s.jsonl. FEEDS removed from settings.py; spider owns it.
- Date parsing: _parse_date() normalizes whitespace, upper-cases, parses "%m/%d/%y %I:%M %p" → ISO 8601; falls back to raw string on failure.
** evidence ** evidence
- commit: - commit: beb5cf4 (AC1-2), <commit> (AC3-6)
- tests: - tests: 8 passing (`python -m pytest tests -q`) or (`python -m pytest tests/`)
- datetime: - `scrapy crawl forum -a forum_id=452 -s LOG_LEVEL=WARNING 2>&1`
- retrieved 9083 comments
- datetime: 2026-05-05
* [ ] t1.2: initial analysis pipeline * [ ] t1.2: initial analysis pipeline
Write a simple pipeline for both - prefer non-concurrent/async from scraping run. Should be run manually, separate from scraper. You may use scrapy, but are not required to. Write a simple pipeline for both - prefer non-concurrent/async from scraping run. Should be run manually, separate from scraper. You may use scrapy, but are not required to.
@@ -26,3 +45,10 @@ Write a simple pipeline for both - prefer non-concurrent/async from scraping run
- commit: - commit:
- tests: - tests:
- date: - date:
* [ ] X: complete proposal information
Ensure we capture as much useful information as possible about the actual proposal - contact information, etc. what the state actually says about what was posted.
** acceptance criteria
1. Item: `Forum` stores id, url, proposal title, description, open/close date, number of comments, agency, board, guidance document id
- add details for guidanceDoc, publication date, comments, guidance docs - eg: https://www.townhall.virginia.gov/L/GDocForum.cfm?GDocForumID=452
2. Item: `Comment` stores forum_id, comment_id, author, title, text, date, url

105
docs/tb.py Normal file
View File

@@ -0,0 +1,105 @@
import jsonlines
import re
from textblob import TextBlob
from collections import Counter
def tprint(obj):
print(f"{type(obj)} : {obj}")
def sort_file(file):
'''return number of positive and negative comments based on TextBlob sentiment analysis'''
# with jsonlines.open("/vadoe/vadoe/vadoe/townhall_2021-01-14T02-05-51.json") as reader:
with jsonlines.open(file, mode='r') as reader:
# Confirm type
tprint(reader)
# Build iterator
_doc = iter(reader)
i = 0
pos = 0
neg = 0
posl = []
negl = []
while i<25:
_line = next(_doc)
tprint(_line)
if _line['sentiment'] == 'pos':
pos = pos + 1
posl.append(_line['comment'])
elif _line['sentiment'] == 'neg':
neg = neg + 1
negl.append(_line['comment'])
i=i+1
print(f'{pos} positive and {neg} negative comments')
# tst = TextBlob(obj['comment'])
# tst.sentiment
def process_file(file):
'''Find Smythers posts'''
with jsonlines.open(file, mode='r') as reader:
_doc = iter(reader)
_list = []
for item in _doc:
try:
if item['author'][0] == 'Smythers':
_list.append(item['content'][0])
except KeyError:
continue
return(_list)
def write_file(file, data:object):
'''Write data to file'''
with jsonlines.open(file, mode='w') as writer:
for each in data:
writer.write(each)
print('write successful')
def clean_text(text:str):
s1 = remove_html(text)
s2 = remove_http(s1)
return s2
def remove_html(text:str):
'''Remove html tags from string'''
clean = re.compile('<.*?>')
return re.sub(clean, '', text)
def remove_http(text:str):
'''Remove URLs from string'''
return re.sub(r'http\S+','', text)
def get_nouns(text:str):
blob = TextBlob(text)
# check nouns? or no
return blob.tags
vadoe = '/vadoe/vadoe/vadoe/townhall_2021-01-14T02-05-51.json'
vadoe_p = '/vadoe/vadoe/vadoe/townhall_2021-01-14T05-11-55.json'
dlr = '/vadoe/vadoe/vadoe/dlr.json'
smythers_pc = '/vadoe/vadoe/vadoe/smythers.json'
write_to = '/vadoe/vadoe/vadoe/nouns.json'
# processed_file(file)
smythers_posts = process_file(dlr)
# cleaned = []
# for each in smythers:
# cleaned.append(clean_text(each))
cleaned = [clean_text(each) for each in smythers_posts]
nouns = []
for x in cleaned:
_list = get_nouns(x)
for y in _list:
nouns.append(y)
# nouns.append(x for x in [get_nouns())
sortedNouns = Counter(nouns)
nouns = []
for k, v in sortedNouns.items():
if v > 2:
_d = (k, v)
nouns.append(_d)
print(nouns)
write_file(write_to, nouns)

45
docs/townhall.py Normal file
View File

@@ -0,0 +1,45 @@
# -*- coding: utf-8 -*-
import scrapy
from items import CommentItem
import textblob
from textblob import TextBlob
from textblob.sentiments import NaiveBayesAnalyzer
class TownhallSpider(scrapy.Spider):
name = 'townhall'
allowed_domains = ['townhall.virginia.gov']
start_urls = ['https://www.townhall.virginia.gov/L/comments.cfm?GDocForumID=452']
custom_settings = {
'FEED_EXPORTERS' : {
"jsonlines": "scrapy.exporters.JsonLinesItemExporter",
},
'FEED_URI' : '%(name)s_%(time)s.json',
'FEED_FORMAT': 'jsonlines'
}
def parse(self, response):
rows = response.css('#contentwide>table>tr')
# cut out the header row
for each in rows[1:]:
# for each in rows[1:6]:
cols = each.xpath('.//td')
linkfollow = cols[0].css('a::attr(href)').get()
comment_title = cols[0].xpath('a/text()').get()
# clean up
commenter = cols[1].xpath('text()').get()
# clean up
date = cols[2].xpath('a/text()').get()
print(f'{comment_title} | {commenter}')
yield response.follow(linkfollow, callback = self.parse_comment)
def parse_comment(self, response):
entry = CommentItem()
text = response.css('.divComment>p::text').get()
text = text.replace(u'\u00a0',' ')
entry['comment'] = text
blob = TextBlob(text, analyzer=NaiveBayesAnalyzer())
entry['sentiment'] = blob.sentiment.classification
entry['sentiment_pos'] = blob.sentiment.p_pos
entry['sentiment_neg'] = blob.sentiment.p_neg
# yield CommentItem(comment = response.css('.divComment>p::text').get())
yield entry

62
docs/townhall2.py Normal file
View File

@@ -0,0 +1,62 @@
# -*- coding: utf-8 -*-
import scrapy
from items import CommentItem
import textblob
from textblob import TextBlob
from textblob.sentiments import NaiveBayesAnalyzer
class TownhallSpider(scrapy.Spider):
name = 'townhall'
allowed_domains = ['townhall.virginia.gov']
start_urls = ['https://www.townhall.virginia.gov/L/Forums.cfm']
custom_settings = {
'FEED_EXPORTERS' : {
"jsonlines": "scrapy.exporters.JsonLinesItemExporter",
},
'FEED_URI' : '%(name)s_%(time)s.json',
'FEED_FORMAT': 'jsonlines'
}
def parse(self, response):
rows = response.css('table>tr>td')
for each in rows:
linkfollow = each.css('a').attrib['href']
if 'comments' in linkfollow:
yield response.follow(linkfollow, callback = self.parse_forum)
cols = each.xpath('.//td')
linkfollow = cols[0].css('a::attr(href)').get()
comment_title = cols[0].xpath('a/text()').get()
# clean up
commenter = cols[1].xpath('text()').get()
# clean up
date = cols[2].xpath('a/text()').get()
print(f'{comment_title} | {commenter}')
yield response.follow(linkfollow, callback = self.parse_comment)
def parse_forum(self, response):
rows = response.css('#contentwide>table>tr')
# cut out the header row
for each in rows[1:]:
# for each in rows[1:6]:
cols = each.xpath('.//td')
linkfollow = cols[0].css('a::attr(href)').get()
comment_title = cols[0].xpath('a/text()').get()
# clean up
commenter = cols[1].xpath('text()').get()
# clean up
date = cols[2].xpath('a/text()').get()
print(f'{comment_title} | {commenter}')
yield response.follow(linkfollow, callback = self.parse_comment)
def parse_comment(self, response):
entry = CommentItem()
text = response.css('.divComment>p::text').get()
text = text.replace(u'\u00a0',' ')
entry['comment'] = text
blob = TextBlob(text, analyzer=NaiveBayesAnalyzer())
entry['sentiment'] = blob.sentiment.classification
entry['sentiment_pos'] = blob.sentiment.p_pos
entry['sentiment_neg'] = blob.sentiment.p_neg
# yield CommentItem(comment = response.css('.divComment>p::text').get())
yield entry

View File

@@ -1,12 +1,16 @@
# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html
import scrapy import scrapy
class ScraperItem(scrapy.Item): class ForumItem(scrapy.Item):
# define the fields for your item here like: forum_id = scrapy.Field()
# name = scrapy.Field() reg_title = scrapy.Field()
pass reg_desc = scrapy.Field()
class CommentItem(scrapy.Item):
forum_id = scrapy.Field()
comment_id = scrapy.Field()
author = scrapy.Field()
date = scrapy.Field()
title = scrapy.Field()
text = scrapy.Field()

View File

@@ -15,8 +15,7 @@ NEWSPIDER_MODULE = "scraper.spiders"
ADDONS = {} ADDONS = {}
# Crawl responsibly by identifying yourself (and your website) on the user-agent USER_AGENT = "vath-research-scraper/1.0 (public comment analysis; contact: research)"
#USER_AGENT = "scraper (+http://www.yourdomain.com)"
# Obey robots.txt rules # Obey robots.txt rules
ROBOTSTXT_OBEY = True ROBOTSTXT_OBEY = True
@@ -75,13 +74,17 @@ DOWNLOAD_DELAY = 1
# Enable showing throttling stats for every response received: # Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False #AUTOTHROTTLE_DEBUG = False
# Enable and configure HTTP caching (disabled by default) # HTTP cache — enabled during development to avoid re-hitting the server on test runs.
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings # Disable (or delete httpcache/) before a production run.
#HTTPCACHE_ENABLED = True HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0 HTTPCACHE_EXPIRATION_SECS = 86400 # 24 h
#HTTPCACHE_DIR = "httpcache" HTTPCACHE_DIR = "httpcache"
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = "scrapy.extensions.httpcache.FilesystemCacheStorage" # Output filename is set dynamically by each spider via from_crawler (includes forum_id).
# The site declares windows-1251 in a meta tag but sends valid UTF-8 bytes.
# Force UTF-8 to prevent lxml from re-decoding via the meta charset.
DEFAULT_RESPONSE_ENCODING = "utf-8"
# Set settings whose default value is deprecated to a future-proof value # Set settings whose default value is deprecated to a future-proof value
FEED_EXPORT_ENCODING = "utf-8" FEED_EXPORT_ENCODING = "utf-8"

136
scraper/spiders/forum.py Normal file
View File

@@ -0,0 +1,136 @@
import re
from datetime import datetime
import scrapy
from scraper.items import CommentItem, ForumItem
_BASE = "https://www.townhall.virginia.gov/L/ViewComments.cfm"
_NBSP = "\xa0"
_REPLACEMENT_CHAR = "<EFBFBD>"
def _view_url(forum_id):
return f"{_BASE}?GdocForumID={forum_id}"
def _parse_date(raw):
normalized = " ".join(raw.split()).upper()
try:
return datetime.strptime(normalized, "%m/%d/%y %I:%M %p").isoformat()
except ValueError:
return raw
class ForumSpider(scrapy.Spider):
name = "forum"
allowed_domains = ["townhall.virginia.gov"]
# Override at runtime: scrapy crawl forum -a forum_id=452
forum_id = "452"
per_page = 500
@classmethod
def from_crawler(cls, crawler, *args, **kwargs):
spider = super().from_crawler(crawler, *args, **kwargs)
crawler.settings.set(
"FEEDS",
{
f"output/forum{spider.forum_id}_comments_%(time)s.jsonl": {
"format": "jsonlines",
"encoding": "utf-8",
"overwrite": False,
}
},
priority="spider",
)
return spider
async def start(self):
yield scrapy.FormRequest(
_view_url(self.forum_id),
formdata={"vPage": "1", "vPerPage": str(self.per_page), "sub1": "go"},
callback=self.parse_comments,
meta={"is_first": True},
)
# ------------------------------------------------------------------
def parse_comments(self, response):
if response.meta.get("is_first"):
reg_title, reg_desc = self._reg_context(response)
last_page = self._last_page(response)
yield ForumItem(
forum_id=self.forum_id,
reg_title=reg_title,
reg_desc=reg_desc,
)
for page in range(2, last_page + 1):
yield scrapy.FormRequest(
_view_url(self.forum_id),
formdata={"vPage": str(page), "vPerPage": str(self.per_page), "sub1": "go"},
callback=self.parse_comments,
)
for box in response.css("div.Cbox"):
yield self._parse_box(box)
# ------------------------------------------------------------------
def _parse_box(self, box):
cbox_id = box.attrib.get("id", "")
comment_id = cbox_id[len("cbox"):] if cbox_id.startswith("cbox") else ""
date_raw = (
box.css("div[style*='float: right'] div::text").get("")
.replace(_NBSP, " ").strip()
)
author = (
box.xpath('.//strong[contains(text(),"Commenter:")]/following-sibling::text()[1]')
.get("").strip()
)
# Second <strong> in the commenter block is the comment title
strongs = box.css("div > strong::text").getall()
title = strongs[-1].strip() if len(strongs) > 1 else ""
paragraphs = box.css(".divComment *::text, .divComment::text").getall()
text = " ".join(p.strip() for p in paragraphs if p.strip())
text = text.replace(_NBSP, " ").replace(_REPLACEMENT_CHAR, "'").strip()
return CommentItem(
forum_id=self.forum_id,
comment_id=comment_id,
author=author,
date=_parse_date(date_raw),
title=title,
text=text,
)
# ------------------------------------------------------------------
def _reg_context(self, response):
# Page shows: <strong>Guidance Document Change:</strong> description text...
label_node = response.xpath('//strong[contains(text(),"Change:")]')
# Collect all sibling text nodes following the label
siblings = label_node.xpath("following-sibling::text()").getall()
raw = " ".join(t.strip() for t in siblings if t.strip())
raw = raw.replace(_NBSP, " ").replace(_REPLACEMENT_CHAR, "'").strip()
reg_desc = raw
# reg_title: text up to the first "was " clause or first 200 chars
m = re.match(r"^(.+?)\s+(?:was |has |guidance document)", raw, re.IGNORECASE)
reg_title = m.group(1).strip() if m else raw[:200]
return reg_title, reg_desc
def _last_page(self, response):
hrefs = response.xpath(
'//form[@name="page"]//a[contains(@href,"vpage.value=")]/@href'
).getall()
pages = [
int(m.group(1))
for h in hrefs
if (m := re.search(r"vpage\.value=(\d+)", h))
]
return max(pages) if pages else 1

230
tests/test_forum_spider.py Normal file
View File

@@ -0,0 +1,230 @@
"""Tests for ForumSpider parsing logic using fake HTML responses."""
import scrapy
from scrapy.http import HtmlResponse, Request
from scraper.items import CommentItem, ForumItem
from scraper.spiders.forum import ForumSpider, _parse_date
def fake_response(url, body, meta=None):
req = Request(url=url, meta=meta or {})
return HtmlResponse(url=url, body=body.encode("utf-8"), request=req)
# ---------------------------------------------------------------------------
# Minimal page HTML fragments
PAGE1_HTML = """
<html><body>
<strong>Guidance Document Change:</strong> The Model Policies for the Treatment of Transgender Students
was developed in response to House Bill 145 and Senate Bill 161.
<div style="font-family: verdana;">
<form name="page" id="page" action="ViewComments.cfm?GdocForumID=452" method="post">
<input name="vPage" id="vpage" type="input" value="1">
<input name="vPerPage" id="vPerPage" type="input" value="500">
<a href="javascript:document.page.vpage.value=3;document.page.submit();">3</a>
<a href="javascript:document.page.vpage.value=2;document.page.submit();">Next</a>
<input type="submit" name="sub1" value="go">
</form>
</div>
<div id="cbox101" class="Cbox">
<div style="float: right; text-align: right;">
<div style="background-color: white; border: 1px solid #cccccc; padding: 4px">1/4/21&nbsp;&nbsp;9:15 am</div>
</div>
<div>
<strong>Commenter:</strong>
Alice Example
<br><br>
<strong>I strongly support this</strong>
</div>
<div style="clear: right">&nbsp;</div>
<div class="divComment">
<p>This is a great policy for students.</p>
<p>All schools should follow it.</p>
</div>
<div style="float: left; font-size: 90%;">
CommentID: <a href="ViewComments.cfm?commentid=101">101</a>
</div>
</div>
<div id="cbox102" class="Cbox">
<div style="float: right; text-align: right;">
<div style="background-color: white; border: 1px solid #cccccc; padding: 4px">1/5/21&nbsp;&nbsp;10:00 am</div>
</div>
<div>
<strong>Commenter:</strong>
Bob Sample
<br><br>
<strong>Opposed</strong>
</div>
<div style="clear: right">&nbsp;</div>
<div class="divComment">
<p>I do not support this guidance.</p>
</div>
<div style="float: left; font-size: 90%;">
CommentID: <a href="ViewComments.cfm?commentid=102">102</a>
</div>
</div>
</body></html>
"""
PAGE2_HTML = """
<html><body>
<div id="cbox201" class="Cbox">
<div style="float: right; text-align: right;">
<div style="background-color: white; border: 1px solid #cccccc; padding: 4px">1/6/21&nbsp;&nbsp;11:00 am</div>
</div>
<div>
<strong>Commenter:</strong>
Carol T
<br><br>
<strong>Support</strong>
</div>
<div style="clear: right">&nbsp;</div>
<div class="divComment">
<p>This policy is long overdue.</p>
</div>
</div>
</body></html>
"""
def make_spider():
return ForumSpider()
# ---------------------------------------------------------------------------
def test_page1_generates_remaining_page_requests():
spider = make_spider()
response = fake_response(
"https://www.townhall.virginia.gov/L/ViewComments.cfm?GdocForumID=452",
PAGE1_HTML,
meta={"is_first": True},
)
results = list(spider.parse_comments(response))
form_reqs = [r for r in results if isinstance(r, scrapy.FormRequest)]
# Pages 2 and 3 should be requested (last page link = 3)
assert len(form_reqs) == 2
pages = sorted(r.body.decode() for r in form_reqs)
assert "vPage=2" in pages[0]
assert "vPage=3" in pages[1]
def test_page1_yields_items():
spider = make_spider()
response = fake_response(
"https://www.townhall.virginia.gov/L/ViewComments.cfm?GdocForumID=452",
PAGE1_HTML,
meta={"is_first": True},
)
results = list(spider.parse_comments(response))
items = [r for r in results if isinstance(r, CommentItem)]
assert len(items) == 2
def test_page1_yields_forum_item():
spider = make_spider()
response = fake_response(
"https://www.townhall.virginia.gov/L/ViewComments.cfm?GdocForumID=452",
PAGE1_HTML,
meta={"is_first": True},
)
results = list(spider.parse_comments(response))
forum_items = [r for r in results if isinstance(r, ForumItem)]
assert len(forum_items) == 1
fi = forum_items[0]
assert "Transgender Students" in fi["reg_title"]
assert "House Bill 145" in fi["reg_desc"]
assert fi["forum_id"] == "452"
def test_comment_fields_parsed_correctly():
spider = make_spider()
response = fake_response(
"https://www.townhall.virginia.gov/L/ViewComments.cfm?GdocForumID=452",
PAGE1_HTML,
meta={"is_first": True},
)
items = [r for r in spider.parse_comments(response) if isinstance(r, CommentItem)]
item = items[0]
assert item["comment_id"] == "101"
assert item["author"] == "Alice Example"
assert item["title"] == "I strongly support this"
assert "great policy" in item["text"]
assert "All schools" in item["text"] # multi-paragraph joined
assert "reg_title" not in item
assert "reg_desc" not in item
def test_subsequent_page_yields_comments():
spider = make_spider()
response = fake_response(
"https://www.townhall.virginia.gov/L/ViewComments.cfm?GdocForumID=452",
PAGE2_HTML,
)
items = [r for r in spider.parse_comments(response) if isinstance(r, CommentItem)]
assert len(items) == 1
assert items[0]["author"] == "Carol T"
def test_last_page_detection():
spider = make_spider()
response = fake_response(
"https://www.townhall.virginia.gov/L/ViewComments.cfm?GdocForumID=452",
PAGE1_HTML,
meta={"is_first": True},
)
assert spider._last_page(response) == 3
def test_date_parsed_to_iso():
assert _parse_date("1/4/21 9:15 am") == "2021-01-04T09:15:00"
assert _parse_date("1/5/21 10:00 am") == "2021-01-05T10:00:00"
assert _parse_date("unparseable") == "unparseable"
SPAN_WRAPPED_HTML = """
<html><body>
<strong>Guidance Document Change:</strong> Some regulation was developed.
<form name="page" id="page" action="ViewComments.cfm?GdocForumID=452" method="post">
<input name="vPage" value="1"><input name="vPerPage" value="500">
<a href="javascript:document.page.vpage.value=1;document.page.submit();">1</a>
<input type="submit" name="sub1" value="go">
</form>
<div id="cbox301" class="Cbox">
<div style="float: right; text-align: right;">
<div style="background-color: white; border: 1px solid #cccccc; padding: 4px">2/1/21&nbsp;&nbsp;8:00 am</div>
</div>
<div>
<strong>Commenter:</strong>
Dan Span
<br><br>
<strong>Opposed</strong>
</div>
<div style="clear: right">&nbsp;</div>
<div class="divComment">
<!DOCTYPE html><html><head></head><body>
<p style="margin: 0in;"><span style="font-size: 10.5pt;">Text inside a span element.</span></p>
</body></html>
</div>
</div>
</body></html>
"""
def test_span_wrapped_text_is_extracted():
spider = make_spider()
response = fake_response(
"https://www.townhall.virginia.gov/L/ViewComments.cfm?GdocForumID=452",
SPAN_WRAPPED_HTML,
meta={"is_first": True},
)
items = [r for r in spider.parse_comments(response) if isinstance(r, CommentItem)]
assert len(items) == 1
assert "Text inside a span element" in items[0]["text"]