Compare commits
4 Commits
02964312cb
...
314f8d2621
| Author | SHA1 | Date | |
|---|---|---|---|
| 314f8d2621 | |||
| e7df0b24a1 | |||
| 951cc11a14 | |||
| beb5cf461b |
4
.gitignore
vendored
4
.gitignore
vendored
@@ -22,5 +22,9 @@ env/
|
|||||||
archive/
|
archive/
|
||||||
|
|
||||||
|
|
||||||
|
# --- scrapy ---
|
||||||
|
.scrapy/
|
||||||
|
output/
|
||||||
|
|
||||||
# --- misc ---
|
# --- misc ---
|
||||||
.DS_Store
|
.DS_Store
|
||||||
@@ -5,14 +5,14 @@
|
|||||||
- prefer minimal diffs; avoid refactors unless required for the active task
|
- prefer minimal diffs; avoid refactors unless required for the active task
|
||||||
|
|
||||||
## tech stack
|
## tech stack
|
||||||
- python; scrapy
|
- python; scrapy, pytest
|
||||||
- file storage: json or csv
|
- file storage: json or csv
|
||||||
- assume local virtual env is available and accessible
|
- assume local virtual env is available and accessible
|
||||||
- do not add new dependencies unless explicitly approved; if unavoidable, document justification in the active task notes
|
- do not add new dependencies unless explicitly approved; if unavoidable, document justification in the active task notes
|
||||||
|
|
||||||
## workflow
|
## workflow
|
||||||
- prefer direct argv commands (no bash -lc / compound shell chains) unless necessary
|
- prefer direct commands
|
||||||
- work on ONE task at a time unless explicitly instructed otherwise
|
- work on ONE task at a time unless explicitly instructed otherwise:
|
||||||
- at the start of work, state the task id you are executing
|
- at the start of work, state the task id you are executing
|
||||||
- do not start work unless a task id is specified; if missing, choose the earliest unchecked task and say so
|
- do not start work unless a task id is specified; if missing, choose the earliest unchecked task and say so
|
||||||
- propose incremental steps
|
- propose incremental steps
|
||||||
@@ -22,7 +22,7 @@
|
|||||||
- fill in evidence with commit hash + commands run
|
- fill in evidence with commit hash + commands run
|
||||||
- never mark complete unless acceptance criteria are met
|
- never mark complete unless acceptance criteria are met
|
||||||
- include date and time (HH:MM)
|
- include date and time (HH:MM)
|
||||||
|
- follow this format:
|
||||||
```
|
```
|
||||||
* [ ] t1.1 Task Title (1)
|
* [ ] t1.1 Task Title (1)
|
||||||
Description and PM notes
|
Description and PM notes
|
||||||
|
|||||||
@@ -1,18 +1,37 @@
|
|||||||
* [ ] t1.1: scrape one forum (1)
|
* [X] t1.1: scrape one forum (1)
|
||||||
Use https://www.townhall.virginia.gov/L/comments.cfm?GDocForumID=452 as the first forum. Scraper should be run manually at this step.
|
Use https://www.townhall.virginia.gov/L/comments.cfm?GDocForumID=452 as the first forum. Scraper should be run manually at this step.
|
||||||
|
ViewComments (townhall.virginia.gov/L/ViewComments.cfm?CommentID=#) appears to be raw list of all comments on forum - could be useful later for whole-scrape
|
||||||
|
Append forum id to viewall per forum (townhall.virginia.gov/L/ViewComments.cfm?GdocForumID=452)
|
||||||
|
Comments are hydrated in backend via js-cued button (AJAX?).
|
||||||
** acceptance criteria
|
** acceptance criteria
|
||||||
1. run manual scraper
|
1. run manual scraper
|
||||||
1. store proposal title and description
|
1. store proposal title and description
|
||||||
2. store comment title, commenter, date
|
2. store comment title, commenter, date
|
||||||
3. store relevant metadata
|
3. store relevant metadata
|
||||||
2. friendly/polite scraping
|
2. friendly/polite scraping
|
||||||
|
3. store forum as distinct item with title, desc
|
||||||
|
4. add forum ID in comment filename, eg forum452_comments_<datetime>.jsonl
|
||||||
|
5. remove reg_title and reg_desc from each comment; these belong in forum item
|
||||||
|
6. parse datetimes into object for later use (plotting)
|
||||||
|
|
||||||
** notes
|
** notes
|
||||||
|
- scraper/spiders/forum.py — ForumSpider using ViewComments.cfm?GdocForumID=N with POST pagination. First request fetches page 1 (vPerPage=500), discovers the last page number from the form's link, generates all remaining page requests upfront. Parses each div.Cbox for all required fields.
|
||||||
|
- scraper/items.py — CommentItem with forum_id, reg_title, reg_desc, comment_id, author, date, title, text
|
||||||
|
- tests/test_forum_spider.py — 7 tests, all passing
|
||||||
|
- Settings: DEFAULT_RESPONSE_ENCODING=utf-8 (fixes Windows-1251 meta-tag mismatch), HTTPCACHE_ENABLED=True, feed output to output/
|
||||||
|
- ViewComments.cfm instead of comments.cfm: POST to Comments.cfm returned a 500 error (wrong endpoint). ViewComments.cfm?GdocForumID=N is the correct listing URL, returns full comment text on the page itself — no per-comment follow requests needed.
|
||||||
|
- Span-wrapped text: .divComment p::text missed 3.6% of comments where text is in <p><span>text</span></p>. Fixed to .divComment *::text, .divComment::text. Worth knowing for when the spider is extended to other forums.
|
||||||
|
- start() vs start_requests(): Scrapy 2.13+ deprecates start_requests() in favor of async def start()
|
||||||
|
- ForumItem vs CommentItem: ForumItem (forum_id, reg_title, reg_desc) yielded once on first page; CommentItem no longer carries reg_title/reg_desc. Both land in the same JSONL feed.
|
||||||
|
- Dynamic output filename: set via from_crawler() overriding FEEDS at 'spider' priority — format is output/forum{id}_comments_%(time)s.jsonl. FEEDS removed from settings.py; spider owns it.
|
||||||
|
- Date parsing: _parse_date() normalizes whitespace, upper-cases, parses "%m/%d/%y %I:%M %p" → ISO 8601; falls back to raw string on failure.
|
||||||
|
|
||||||
** evidence
|
** evidence
|
||||||
- commit:
|
- commit: beb5cf4 (AC1-2), <commit> (AC3-6)
|
||||||
- tests:
|
- tests: 8 passing (`python -m pytest tests -q`) or (`python -m pytest tests/`)
|
||||||
- datetime:
|
- `scrapy crawl forum -a forum_id=452 -s LOG_LEVEL=WARNING 2>&1`
|
||||||
|
- retrieved 9083 comments
|
||||||
|
- datetime: 2026-05-05
|
||||||
|
|
||||||
* [ ] t1.2: initial analysis pipeline
|
* [ ] t1.2: initial analysis pipeline
|
||||||
Write a simple pipeline for both - prefer non-concurrent/async from scraping run. Should be run manually, separate from scraper. You may use scrapy, but are not required to.
|
Write a simple pipeline for both - prefer non-concurrent/async from scraping run. Should be run manually, separate from scraper. You may use scrapy, but are not required to.
|
||||||
@@ -26,3 +45,10 @@ Write a simple pipeline for both - prefer non-concurrent/async from scraping run
|
|||||||
- commit:
|
- commit:
|
||||||
- tests:
|
- tests:
|
||||||
- date:
|
- date:
|
||||||
|
|
||||||
|
* [ ] X: complete proposal information
|
||||||
|
Ensure we capture as much useful information as possible about the actual proposal - contact information, etc. what the state actually says about what was posted.
|
||||||
|
** acceptance criteria
|
||||||
|
1. Item: `Forum` stores id, url, proposal title, description, open/close date, number of comments, agency, board, guidance document id
|
||||||
|
- add details for guidanceDoc, publication date, comments, guidance docs - eg: https://www.townhall.virginia.gov/L/GDocForum.cfm?GDocForumID=452
|
||||||
|
2. Item: `Comment` stores forum_id, comment_id, author, title, text, date, url
|
||||||
|
|||||||
105
docs/tb.py
Normal file
105
docs/tb.py
Normal file
@@ -0,0 +1,105 @@
|
|||||||
|
import jsonlines
|
||||||
|
import re
|
||||||
|
from textblob import TextBlob
|
||||||
|
from collections import Counter
|
||||||
|
|
||||||
|
def tprint(obj):
|
||||||
|
print(f"{type(obj)} : {obj}")
|
||||||
|
|
||||||
|
|
||||||
|
def sort_file(file):
|
||||||
|
'''return number of positive and negative comments based on TextBlob sentiment analysis'''
|
||||||
|
# with jsonlines.open("/vadoe/vadoe/vadoe/townhall_2021-01-14T02-05-51.json") as reader:
|
||||||
|
with jsonlines.open(file, mode='r') as reader:
|
||||||
|
# Confirm type
|
||||||
|
tprint(reader)
|
||||||
|
|
||||||
|
# Build iterator
|
||||||
|
_doc = iter(reader)
|
||||||
|
i = 0
|
||||||
|
pos = 0
|
||||||
|
neg = 0
|
||||||
|
posl = []
|
||||||
|
negl = []
|
||||||
|
|
||||||
|
while i<25:
|
||||||
|
_line = next(_doc)
|
||||||
|
tprint(_line)
|
||||||
|
if _line['sentiment'] == 'pos':
|
||||||
|
pos = pos + 1
|
||||||
|
posl.append(_line['comment'])
|
||||||
|
elif _line['sentiment'] == 'neg':
|
||||||
|
neg = neg + 1
|
||||||
|
negl.append(_line['comment'])
|
||||||
|
i=i+1
|
||||||
|
|
||||||
|
print(f'{pos} positive and {neg} negative comments')
|
||||||
|
# tst = TextBlob(obj['comment'])
|
||||||
|
# tst.sentiment
|
||||||
|
|
||||||
|
def process_file(file):
|
||||||
|
'''Find Smythers posts'''
|
||||||
|
with jsonlines.open(file, mode='r') as reader:
|
||||||
|
_doc = iter(reader)
|
||||||
|
_list = []
|
||||||
|
for item in _doc:
|
||||||
|
try:
|
||||||
|
if item['author'][0] == 'Smythers':
|
||||||
|
_list.append(item['content'][0])
|
||||||
|
except KeyError:
|
||||||
|
continue
|
||||||
|
return(_list)
|
||||||
|
|
||||||
|
def write_file(file, data:object):
|
||||||
|
'''Write data to file'''
|
||||||
|
with jsonlines.open(file, mode='w') as writer:
|
||||||
|
for each in data:
|
||||||
|
writer.write(each)
|
||||||
|
print('write successful')
|
||||||
|
|
||||||
|
def clean_text(text:str):
|
||||||
|
s1 = remove_html(text)
|
||||||
|
s2 = remove_http(s1)
|
||||||
|
return s2
|
||||||
|
|
||||||
|
def remove_html(text:str):
|
||||||
|
'''Remove html tags from string'''
|
||||||
|
clean = re.compile('<.*?>')
|
||||||
|
return re.sub(clean, '', text)
|
||||||
|
|
||||||
|
def remove_http(text:str):
|
||||||
|
'''Remove URLs from string'''
|
||||||
|
return re.sub(r'http\S+','', text)
|
||||||
|
|
||||||
|
def get_nouns(text:str):
|
||||||
|
blob = TextBlob(text)
|
||||||
|
# check nouns? or no
|
||||||
|
return blob.tags
|
||||||
|
|
||||||
|
vadoe = '/vadoe/vadoe/vadoe/townhall_2021-01-14T02-05-51.json'
|
||||||
|
vadoe_p = '/vadoe/vadoe/vadoe/townhall_2021-01-14T05-11-55.json'
|
||||||
|
dlr = '/vadoe/vadoe/vadoe/dlr.json'
|
||||||
|
|
||||||
|
smythers_pc = '/vadoe/vadoe/vadoe/smythers.json'
|
||||||
|
write_to = '/vadoe/vadoe/vadoe/nouns.json'
|
||||||
|
|
||||||
|
# processed_file(file)
|
||||||
|
smythers_posts = process_file(dlr)
|
||||||
|
# cleaned = []
|
||||||
|
# for each in smythers:
|
||||||
|
# cleaned.append(clean_text(each))
|
||||||
|
cleaned = [clean_text(each) for each in smythers_posts]
|
||||||
|
nouns = []
|
||||||
|
for x in cleaned:
|
||||||
|
_list = get_nouns(x)
|
||||||
|
for y in _list:
|
||||||
|
nouns.append(y)
|
||||||
|
# nouns.append(x for x in [get_nouns())
|
||||||
|
sortedNouns = Counter(nouns)
|
||||||
|
nouns = []
|
||||||
|
for k, v in sortedNouns.items():
|
||||||
|
if v > 2:
|
||||||
|
_d = (k, v)
|
||||||
|
nouns.append(_d)
|
||||||
|
print(nouns)
|
||||||
|
write_file(write_to, nouns)
|
||||||
45
docs/townhall.py
Normal file
45
docs/townhall.py
Normal file
@@ -0,0 +1,45 @@
|
|||||||
|
# -*- coding: utf-8 -*-
|
||||||
|
import scrapy
|
||||||
|
from items import CommentItem
|
||||||
|
import textblob
|
||||||
|
from textblob import TextBlob
|
||||||
|
from textblob.sentiments import NaiveBayesAnalyzer
|
||||||
|
|
||||||
|
class TownhallSpider(scrapy.Spider):
|
||||||
|
name = 'townhall'
|
||||||
|
allowed_domains = ['townhall.virginia.gov']
|
||||||
|
start_urls = ['https://www.townhall.virginia.gov/L/comments.cfm?GDocForumID=452']
|
||||||
|
custom_settings = {
|
||||||
|
'FEED_EXPORTERS' : {
|
||||||
|
"jsonlines": "scrapy.exporters.JsonLinesItemExporter",
|
||||||
|
},
|
||||||
|
'FEED_URI' : '%(name)s_%(time)s.json',
|
||||||
|
'FEED_FORMAT': 'jsonlines'
|
||||||
|
}
|
||||||
|
|
||||||
|
def parse(self, response):
|
||||||
|
rows = response.css('#contentwide>table>tr')
|
||||||
|
# cut out the header row
|
||||||
|
for each in rows[1:]:
|
||||||
|
# for each in rows[1:6]:
|
||||||
|
cols = each.xpath('.//td')
|
||||||
|
linkfollow = cols[0].css('a::attr(href)').get()
|
||||||
|
comment_title = cols[0].xpath('a/text()').get()
|
||||||
|
# clean up
|
||||||
|
commenter = cols[1].xpath('text()').get()
|
||||||
|
# clean up
|
||||||
|
date = cols[2].xpath('a/text()').get()
|
||||||
|
print(f'{comment_title} | {commenter}')
|
||||||
|
yield response.follow(linkfollow, callback = self.parse_comment)
|
||||||
|
|
||||||
|
def parse_comment(self, response):
|
||||||
|
entry = CommentItem()
|
||||||
|
text = response.css('.divComment>p::text').get()
|
||||||
|
text = text.replace(u'\u00a0',' ')
|
||||||
|
entry['comment'] = text
|
||||||
|
blob = TextBlob(text, analyzer=NaiveBayesAnalyzer())
|
||||||
|
entry['sentiment'] = blob.sentiment.classification
|
||||||
|
entry['sentiment_pos'] = blob.sentiment.p_pos
|
||||||
|
entry['sentiment_neg'] = blob.sentiment.p_neg
|
||||||
|
# yield CommentItem(comment = response.css('.divComment>p::text').get())
|
||||||
|
yield entry
|
||||||
62
docs/townhall2.py
Normal file
62
docs/townhall2.py
Normal file
@@ -0,0 +1,62 @@
|
|||||||
|
# -*- coding: utf-8 -*-
|
||||||
|
import scrapy
|
||||||
|
from items import CommentItem
|
||||||
|
import textblob
|
||||||
|
from textblob import TextBlob
|
||||||
|
from textblob.sentiments import NaiveBayesAnalyzer
|
||||||
|
|
||||||
|
class TownhallSpider(scrapy.Spider):
|
||||||
|
name = 'townhall'
|
||||||
|
allowed_domains = ['townhall.virginia.gov']
|
||||||
|
start_urls = ['https://www.townhall.virginia.gov/L/Forums.cfm']
|
||||||
|
custom_settings = {
|
||||||
|
'FEED_EXPORTERS' : {
|
||||||
|
"jsonlines": "scrapy.exporters.JsonLinesItemExporter",
|
||||||
|
},
|
||||||
|
'FEED_URI' : '%(name)s_%(time)s.json',
|
||||||
|
'FEED_FORMAT': 'jsonlines'
|
||||||
|
}
|
||||||
|
|
||||||
|
def parse(self, response):
|
||||||
|
rows = response.css('table>tr>td')
|
||||||
|
for each in rows:
|
||||||
|
linkfollow = each.css('a').attrib['href']
|
||||||
|
if 'comments' in linkfollow:
|
||||||
|
yield response.follow(linkfollow, callback = self.parse_forum)
|
||||||
|
|
||||||
|
cols = each.xpath('.//td')
|
||||||
|
linkfollow = cols[0].css('a::attr(href)').get()
|
||||||
|
comment_title = cols[0].xpath('a/text()').get()
|
||||||
|
# clean up
|
||||||
|
commenter = cols[1].xpath('text()').get()
|
||||||
|
# clean up
|
||||||
|
date = cols[2].xpath('a/text()').get()
|
||||||
|
print(f'{comment_title} | {commenter}')
|
||||||
|
yield response.follow(linkfollow, callback = self.parse_comment)
|
||||||
|
|
||||||
|
def parse_forum(self, response):
|
||||||
|
rows = response.css('#contentwide>table>tr')
|
||||||
|
# cut out the header row
|
||||||
|
for each in rows[1:]:
|
||||||
|
# for each in rows[1:6]:
|
||||||
|
cols = each.xpath('.//td')
|
||||||
|
linkfollow = cols[0].css('a::attr(href)').get()
|
||||||
|
comment_title = cols[0].xpath('a/text()').get()
|
||||||
|
# clean up
|
||||||
|
commenter = cols[1].xpath('text()').get()
|
||||||
|
# clean up
|
||||||
|
date = cols[2].xpath('a/text()').get()
|
||||||
|
print(f'{comment_title} | {commenter}')
|
||||||
|
yield response.follow(linkfollow, callback = self.parse_comment)
|
||||||
|
|
||||||
|
def parse_comment(self, response):
|
||||||
|
entry = CommentItem()
|
||||||
|
text = response.css('.divComment>p::text').get()
|
||||||
|
text = text.replace(u'\u00a0',' ')
|
||||||
|
entry['comment'] = text
|
||||||
|
blob = TextBlob(text, analyzer=NaiveBayesAnalyzer())
|
||||||
|
entry['sentiment'] = blob.sentiment.classification
|
||||||
|
entry['sentiment_pos'] = blob.sentiment.p_pos
|
||||||
|
entry['sentiment_neg'] = blob.sentiment.p_neg
|
||||||
|
# yield CommentItem(comment = response.css('.divComment>p::text').get())
|
||||||
|
yield entry
|
||||||
@@ -1,12 +1,16 @@
|
|||||||
# Define here the models for your scraped items
|
|
||||||
#
|
|
||||||
# See documentation in:
|
|
||||||
# https://docs.scrapy.org/en/latest/topics/items.html
|
|
||||||
|
|
||||||
import scrapy
|
import scrapy
|
||||||
|
|
||||||
|
|
||||||
class ScraperItem(scrapy.Item):
|
class ForumItem(scrapy.Item):
|
||||||
# define the fields for your item here like:
|
forum_id = scrapy.Field()
|
||||||
# name = scrapy.Field()
|
reg_title = scrapy.Field()
|
||||||
pass
|
reg_desc = scrapy.Field()
|
||||||
|
|
||||||
|
|
||||||
|
class CommentItem(scrapy.Item):
|
||||||
|
forum_id = scrapy.Field()
|
||||||
|
comment_id = scrapy.Field()
|
||||||
|
author = scrapy.Field()
|
||||||
|
date = scrapy.Field()
|
||||||
|
title = scrapy.Field()
|
||||||
|
text = scrapy.Field()
|
||||||
|
|||||||
@@ -15,8 +15,7 @@ NEWSPIDER_MODULE = "scraper.spiders"
|
|||||||
ADDONS = {}
|
ADDONS = {}
|
||||||
|
|
||||||
|
|
||||||
# Crawl responsibly by identifying yourself (and your website) on the user-agent
|
USER_AGENT = "vath-research-scraper/1.0 (public comment analysis; contact: research)"
|
||||||
#USER_AGENT = "scraper (+http://www.yourdomain.com)"
|
|
||||||
|
|
||||||
# Obey robots.txt rules
|
# Obey robots.txt rules
|
||||||
ROBOTSTXT_OBEY = True
|
ROBOTSTXT_OBEY = True
|
||||||
@@ -75,13 +74,17 @@ DOWNLOAD_DELAY = 1
|
|||||||
# Enable showing throttling stats for every response received:
|
# Enable showing throttling stats for every response received:
|
||||||
#AUTOTHROTTLE_DEBUG = False
|
#AUTOTHROTTLE_DEBUG = False
|
||||||
|
|
||||||
# Enable and configure HTTP caching (disabled by default)
|
# HTTP cache — enabled during development to avoid re-hitting the server on test runs.
|
||||||
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
|
# Disable (or delete httpcache/) before a production run.
|
||||||
#HTTPCACHE_ENABLED = True
|
HTTPCACHE_ENABLED = True
|
||||||
#HTTPCACHE_EXPIRATION_SECS = 0
|
HTTPCACHE_EXPIRATION_SECS = 86400 # 24 h
|
||||||
#HTTPCACHE_DIR = "httpcache"
|
HTTPCACHE_DIR = "httpcache"
|
||||||
#HTTPCACHE_IGNORE_HTTP_CODES = []
|
|
||||||
#HTTPCACHE_STORAGE = "scrapy.extensions.httpcache.FilesystemCacheStorage"
|
# Output filename is set dynamically by each spider via from_crawler (includes forum_id).
|
||||||
|
|
||||||
|
# The site declares windows-1251 in a meta tag but sends valid UTF-8 bytes.
|
||||||
|
# Force UTF-8 to prevent lxml from re-decoding via the meta charset.
|
||||||
|
DEFAULT_RESPONSE_ENCODING = "utf-8"
|
||||||
|
|
||||||
# Set settings whose default value is deprecated to a future-proof value
|
# Set settings whose default value is deprecated to a future-proof value
|
||||||
FEED_EXPORT_ENCODING = "utf-8"
|
FEED_EXPORT_ENCODING = "utf-8"
|
||||||
|
|||||||
136
scraper/spiders/forum.py
Normal file
136
scraper/spiders/forum.py
Normal file
@@ -0,0 +1,136 @@
|
|||||||
|
import re
|
||||||
|
from datetime import datetime
|
||||||
|
|
||||||
|
import scrapy
|
||||||
|
|
||||||
|
from scraper.items import CommentItem, ForumItem
|
||||||
|
|
||||||
|
_BASE = "https://www.townhall.virginia.gov/L/ViewComments.cfm"
|
||||||
|
_NBSP = "\xa0"
|
||||||
|
_REPLACEMENT_CHAR = "<EFBFBD>"
|
||||||
|
|
||||||
|
|
||||||
|
def _view_url(forum_id):
|
||||||
|
return f"{_BASE}?GdocForumID={forum_id}"
|
||||||
|
|
||||||
|
|
||||||
|
def _parse_date(raw):
|
||||||
|
normalized = " ".join(raw.split()).upper()
|
||||||
|
try:
|
||||||
|
return datetime.strptime(normalized, "%m/%d/%y %I:%M %p").isoformat()
|
||||||
|
except ValueError:
|
||||||
|
return raw
|
||||||
|
|
||||||
|
|
||||||
|
class ForumSpider(scrapy.Spider):
|
||||||
|
name = "forum"
|
||||||
|
allowed_domains = ["townhall.virginia.gov"]
|
||||||
|
|
||||||
|
# Override at runtime: scrapy crawl forum -a forum_id=452
|
||||||
|
forum_id = "452"
|
||||||
|
per_page = 500
|
||||||
|
|
||||||
|
@classmethod
|
||||||
|
def from_crawler(cls, crawler, *args, **kwargs):
|
||||||
|
spider = super().from_crawler(crawler, *args, **kwargs)
|
||||||
|
crawler.settings.set(
|
||||||
|
"FEEDS",
|
||||||
|
{
|
||||||
|
f"output/forum{spider.forum_id}_comments_%(time)s.jsonl": {
|
||||||
|
"format": "jsonlines",
|
||||||
|
"encoding": "utf-8",
|
||||||
|
"overwrite": False,
|
||||||
|
}
|
||||||
|
},
|
||||||
|
priority="spider",
|
||||||
|
)
|
||||||
|
return spider
|
||||||
|
|
||||||
|
async def start(self):
|
||||||
|
yield scrapy.FormRequest(
|
||||||
|
_view_url(self.forum_id),
|
||||||
|
formdata={"vPage": "1", "vPerPage": str(self.per_page), "sub1": "go"},
|
||||||
|
callback=self.parse_comments,
|
||||||
|
meta={"is_first": True},
|
||||||
|
)
|
||||||
|
|
||||||
|
# ------------------------------------------------------------------
|
||||||
|
def parse_comments(self, response):
|
||||||
|
if response.meta.get("is_first"):
|
||||||
|
reg_title, reg_desc = self._reg_context(response)
|
||||||
|
last_page = self._last_page(response)
|
||||||
|
yield ForumItem(
|
||||||
|
forum_id=self.forum_id,
|
||||||
|
reg_title=reg_title,
|
||||||
|
reg_desc=reg_desc,
|
||||||
|
)
|
||||||
|
for page in range(2, last_page + 1):
|
||||||
|
yield scrapy.FormRequest(
|
||||||
|
_view_url(self.forum_id),
|
||||||
|
formdata={"vPage": str(page), "vPerPage": str(self.per_page), "sub1": "go"},
|
||||||
|
callback=self.parse_comments,
|
||||||
|
)
|
||||||
|
|
||||||
|
for box in response.css("div.Cbox"):
|
||||||
|
yield self._parse_box(box)
|
||||||
|
|
||||||
|
# ------------------------------------------------------------------
|
||||||
|
def _parse_box(self, box):
|
||||||
|
cbox_id = box.attrib.get("id", "")
|
||||||
|
comment_id = cbox_id[len("cbox"):] if cbox_id.startswith("cbox") else ""
|
||||||
|
|
||||||
|
date_raw = (
|
||||||
|
box.css("div[style*='float: right'] div::text").get("")
|
||||||
|
.replace(_NBSP, " ").strip()
|
||||||
|
)
|
||||||
|
|
||||||
|
author = (
|
||||||
|
box.xpath('.//strong[contains(text(),"Commenter:")]/following-sibling::text()[1]')
|
||||||
|
.get("").strip()
|
||||||
|
)
|
||||||
|
|
||||||
|
# Second <strong> in the commenter block is the comment title
|
||||||
|
strongs = box.css("div > strong::text").getall()
|
||||||
|
title = strongs[-1].strip() if len(strongs) > 1 else ""
|
||||||
|
|
||||||
|
paragraphs = box.css(".divComment *::text, .divComment::text").getall()
|
||||||
|
text = " ".join(p.strip() for p in paragraphs if p.strip())
|
||||||
|
text = text.replace(_NBSP, " ").replace(_REPLACEMENT_CHAR, "'").strip()
|
||||||
|
|
||||||
|
return CommentItem(
|
||||||
|
forum_id=self.forum_id,
|
||||||
|
comment_id=comment_id,
|
||||||
|
author=author,
|
||||||
|
date=_parse_date(date_raw),
|
||||||
|
title=title,
|
||||||
|
text=text,
|
||||||
|
)
|
||||||
|
|
||||||
|
# ------------------------------------------------------------------
|
||||||
|
def _reg_context(self, response):
|
||||||
|
# Page shows: <strong>Guidance Document Change:</strong> description text...
|
||||||
|
label_node = response.xpath('//strong[contains(text(),"Change:")]')
|
||||||
|
|
||||||
|
# Collect all sibling text nodes following the label
|
||||||
|
siblings = label_node.xpath("following-sibling::text()").getall()
|
||||||
|
raw = " ".join(t.strip() for t in siblings if t.strip())
|
||||||
|
raw = raw.replace(_NBSP, " ").replace(_REPLACEMENT_CHAR, "'").strip()
|
||||||
|
|
||||||
|
reg_desc = raw
|
||||||
|
|
||||||
|
# reg_title: text up to the first "was " clause or first 200 chars
|
||||||
|
m = re.match(r"^(.+?)\s+(?:was |has |guidance document)", raw, re.IGNORECASE)
|
||||||
|
reg_title = m.group(1).strip() if m else raw[:200]
|
||||||
|
|
||||||
|
return reg_title, reg_desc
|
||||||
|
|
||||||
|
def _last_page(self, response):
|
||||||
|
hrefs = response.xpath(
|
||||||
|
'//form[@name="page"]//a[contains(@href,"vpage.value=")]/@href'
|
||||||
|
).getall()
|
||||||
|
pages = [
|
||||||
|
int(m.group(1))
|
||||||
|
for h in hrefs
|
||||||
|
if (m := re.search(r"vpage\.value=(\d+)", h))
|
||||||
|
]
|
||||||
|
return max(pages) if pages else 1
|
||||||
230
tests/test_forum_spider.py
Normal file
230
tests/test_forum_spider.py
Normal file
@@ -0,0 +1,230 @@
|
|||||||
|
"""Tests for ForumSpider parsing logic using fake HTML responses."""
|
||||||
|
|
||||||
|
import scrapy
|
||||||
|
from scrapy.http import HtmlResponse, Request
|
||||||
|
|
||||||
|
from scraper.items import CommentItem, ForumItem
|
||||||
|
from scraper.spiders.forum import ForumSpider, _parse_date
|
||||||
|
|
||||||
|
|
||||||
|
def fake_response(url, body, meta=None):
|
||||||
|
req = Request(url=url, meta=meta or {})
|
||||||
|
return HtmlResponse(url=url, body=body.encode("utf-8"), request=req)
|
||||||
|
|
||||||
|
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
# Minimal page HTML fragments
|
||||||
|
|
||||||
|
PAGE1_HTML = """
|
||||||
|
<html><body>
|
||||||
|
<strong>Guidance Document Change:</strong> The Model Policies for the Treatment of Transgender Students
|
||||||
|
was developed in response to House Bill 145 and Senate Bill 161.
|
||||||
|
|
||||||
|
<div style="font-family: verdana;">
|
||||||
|
<form name="page" id="page" action="ViewComments.cfm?GdocForumID=452" method="post">
|
||||||
|
<input name="vPage" id="vpage" type="input" value="1">
|
||||||
|
<input name="vPerPage" id="vPerPage" type="input" value="500">
|
||||||
|
<a href="javascript:document.page.vpage.value=3;document.page.submit();">3</a>
|
||||||
|
<a href="javascript:document.page.vpage.value=2;document.page.submit();">Next</a>
|
||||||
|
<input type="submit" name="sub1" value="go">
|
||||||
|
</form>
|
||||||
|
</div>
|
||||||
|
|
||||||
|
<div id="cbox101" class="Cbox">
|
||||||
|
<div style="float: right; text-align: right;">
|
||||||
|
<div style="background-color: white; border: 1px solid #cccccc; padding: 4px">1/4/21 9:15 am</div>
|
||||||
|
</div>
|
||||||
|
<div>
|
||||||
|
<strong>Commenter:</strong>
|
||||||
|
Alice Example
|
||||||
|
<br><br>
|
||||||
|
<strong>I strongly support this</strong>
|
||||||
|
</div>
|
||||||
|
<div style="clear: right"> </div>
|
||||||
|
<div class="divComment">
|
||||||
|
<p>This is a great policy for students.</p>
|
||||||
|
<p>All schools should follow it.</p>
|
||||||
|
</div>
|
||||||
|
<div style="float: left; font-size: 90%;">
|
||||||
|
CommentID: <a href="ViewComments.cfm?commentid=101">101</a>
|
||||||
|
</div>
|
||||||
|
</div>
|
||||||
|
|
||||||
|
<div id="cbox102" class="Cbox">
|
||||||
|
<div style="float: right; text-align: right;">
|
||||||
|
<div style="background-color: white; border: 1px solid #cccccc; padding: 4px">1/5/21 10:00 am</div>
|
||||||
|
</div>
|
||||||
|
<div>
|
||||||
|
<strong>Commenter:</strong>
|
||||||
|
Bob Sample
|
||||||
|
<br><br>
|
||||||
|
<strong>Opposed</strong>
|
||||||
|
</div>
|
||||||
|
<div style="clear: right"> </div>
|
||||||
|
<div class="divComment">
|
||||||
|
<p>I do not support this guidance.</p>
|
||||||
|
</div>
|
||||||
|
<div style="float: left; font-size: 90%;">
|
||||||
|
CommentID: <a href="ViewComments.cfm?commentid=102">102</a>
|
||||||
|
</div>
|
||||||
|
</div>
|
||||||
|
</body></html>
|
||||||
|
"""
|
||||||
|
|
||||||
|
PAGE2_HTML = """
|
||||||
|
<html><body>
|
||||||
|
<div id="cbox201" class="Cbox">
|
||||||
|
<div style="float: right; text-align: right;">
|
||||||
|
<div style="background-color: white; border: 1px solid #cccccc; padding: 4px">1/6/21 11:00 am</div>
|
||||||
|
</div>
|
||||||
|
<div>
|
||||||
|
<strong>Commenter:</strong>
|
||||||
|
Carol T
|
||||||
|
<br><br>
|
||||||
|
<strong>Support</strong>
|
||||||
|
</div>
|
||||||
|
<div style="clear: right"> </div>
|
||||||
|
<div class="divComment">
|
||||||
|
<p>This policy is long overdue.</p>
|
||||||
|
</div>
|
||||||
|
</div>
|
||||||
|
</body></html>
|
||||||
|
"""
|
||||||
|
|
||||||
|
|
||||||
|
def make_spider():
|
||||||
|
return ForumSpider()
|
||||||
|
|
||||||
|
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
|
||||||
|
def test_page1_generates_remaining_page_requests():
|
||||||
|
spider = make_spider()
|
||||||
|
response = fake_response(
|
||||||
|
"https://www.townhall.virginia.gov/L/ViewComments.cfm?GdocForumID=452",
|
||||||
|
PAGE1_HTML,
|
||||||
|
meta={"is_first": True},
|
||||||
|
)
|
||||||
|
results = list(spider.parse_comments(response))
|
||||||
|
form_reqs = [r for r in results if isinstance(r, scrapy.FormRequest)]
|
||||||
|
# Pages 2 and 3 should be requested (last page link = 3)
|
||||||
|
assert len(form_reqs) == 2
|
||||||
|
pages = sorted(r.body.decode() for r in form_reqs)
|
||||||
|
assert "vPage=2" in pages[0]
|
||||||
|
assert "vPage=3" in pages[1]
|
||||||
|
|
||||||
|
|
||||||
|
def test_page1_yields_items():
|
||||||
|
spider = make_spider()
|
||||||
|
response = fake_response(
|
||||||
|
"https://www.townhall.virginia.gov/L/ViewComments.cfm?GdocForumID=452",
|
||||||
|
PAGE1_HTML,
|
||||||
|
meta={"is_first": True},
|
||||||
|
)
|
||||||
|
results = list(spider.parse_comments(response))
|
||||||
|
items = [r for r in results if isinstance(r, CommentItem)]
|
||||||
|
assert len(items) == 2
|
||||||
|
|
||||||
|
|
||||||
|
def test_page1_yields_forum_item():
|
||||||
|
spider = make_spider()
|
||||||
|
response = fake_response(
|
||||||
|
"https://www.townhall.virginia.gov/L/ViewComments.cfm?GdocForumID=452",
|
||||||
|
PAGE1_HTML,
|
||||||
|
meta={"is_first": True},
|
||||||
|
)
|
||||||
|
results = list(spider.parse_comments(response))
|
||||||
|
forum_items = [r for r in results if isinstance(r, ForumItem)]
|
||||||
|
assert len(forum_items) == 1
|
||||||
|
fi = forum_items[0]
|
||||||
|
assert "Transgender Students" in fi["reg_title"]
|
||||||
|
assert "House Bill 145" in fi["reg_desc"]
|
||||||
|
assert fi["forum_id"] == "452"
|
||||||
|
|
||||||
|
|
||||||
|
def test_comment_fields_parsed_correctly():
|
||||||
|
spider = make_spider()
|
||||||
|
response = fake_response(
|
||||||
|
"https://www.townhall.virginia.gov/L/ViewComments.cfm?GdocForumID=452",
|
||||||
|
PAGE1_HTML,
|
||||||
|
meta={"is_first": True},
|
||||||
|
)
|
||||||
|
items = [r for r in spider.parse_comments(response) if isinstance(r, CommentItem)]
|
||||||
|
item = items[0]
|
||||||
|
assert item["comment_id"] == "101"
|
||||||
|
assert item["author"] == "Alice Example"
|
||||||
|
assert item["title"] == "I strongly support this"
|
||||||
|
assert "great policy" in item["text"]
|
||||||
|
assert "All schools" in item["text"] # multi-paragraph joined
|
||||||
|
assert "reg_title" not in item
|
||||||
|
assert "reg_desc" not in item
|
||||||
|
|
||||||
|
|
||||||
|
def test_subsequent_page_yields_comments():
|
||||||
|
spider = make_spider()
|
||||||
|
response = fake_response(
|
||||||
|
"https://www.townhall.virginia.gov/L/ViewComments.cfm?GdocForumID=452",
|
||||||
|
PAGE2_HTML,
|
||||||
|
)
|
||||||
|
items = [r for r in spider.parse_comments(response) if isinstance(r, CommentItem)]
|
||||||
|
assert len(items) == 1
|
||||||
|
assert items[0]["author"] == "Carol T"
|
||||||
|
|
||||||
|
|
||||||
|
def test_last_page_detection():
|
||||||
|
spider = make_spider()
|
||||||
|
response = fake_response(
|
||||||
|
"https://www.townhall.virginia.gov/L/ViewComments.cfm?GdocForumID=452",
|
||||||
|
PAGE1_HTML,
|
||||||
|
meta={"is_first": True},
|
||||||
|
)
|
||||||
|
assert spider._last_page(response) == 3
|
||||||
|
|
||||||
|
|
||||||
|
def test_date_parsed_to_iso():
|
||||||
|
assert _parse_date("1/4/21 9:15 am") == "2021-01-04T09:15:00"
|
||||||
|
assert _parse_date("1/5/21 10:00 am") == "2021-01-05T10:00:00"
|
||||||
|
assert _parse_date("unparseable") == "unparseable"
|
||||||
|
|
||||||
|
|
||||||
|
SPAN_WRAPPED_HTML = """
|
||||||
|
<html><body>
|
||||||
|
<strong>Guidance Document Change:</strong> Some regulation was developed.
|
||||||
|
|
||||||
|
<form name="page" id="page" action="ViewComments.cfm?GdocForumID=452" method="post">
|
||||||
|
<input name="vPage" value="1"><input name="vPerPage" value="500">
|
||||||
|
<a href="javascript:document.page.vpage.value=1;document.page.submit();">1</a>
|
||||||
|
<input type="submit" name="sub1" value="go">
|
||||||
|
</form>
|
||||||
|
|
||||||
|
<div id="cbox301" class="Cbox">
|
||||||
|
<div style="float: right; text-align: right;">
|
||||||
|
<div style="background-color: white; border: 1px solid #cccccc; padding: 4px">2/1/21 8:00 am</div>
|
||||||
|
</div>
|
||||||
|
<div>
|
||||||
|
<strong>Commenter:</strong>
|
||||||
|
Dan Span
|
||||||
|
<br><br>
|
||||||
|
<strong>Opposed</strong>
|
||||||
|
</div>
|
||||||
|
<div style="clear: right"> </div>
|
||||||
|
<div class="divComment">
|
||||||
|
<!DOCTYPE html><html><head></head><body>
|
||||||
|
<p style="margin: 0in;"><span style="font-size: 10.5pt;">Text inside a span element.</span></p>
|
||||||
|
</body></html>
|
||||||
|
</div>
|
||||||
|
</div>
|
||||||
|
</body></html>
|
||||||
|
"""
|
||||||
|
|
||||||
|
|
||||||
|
def test_span_wrapped_text_is_extracted():
|
||||||
|
spider = make_spider()
|
||||||
|
response = fake_response(
|
||||||
|
"https://www.townhall.virginia.gov/L/ViewComments.cfm?GdocForumID=452",
|
||||||
|
SPAN_WRAPPED_HTML,
|
||||||
|
meta={"is_first": True},
|
||||||
|
)
|
||||||
|
items = [r for r in spider.parse_comments(response) if isinstance(r, CommentItem)]
|
||||||
|
assert len(items) == 1
|
||||||
|
assert "Text inside a span element" in items[0]["text"]
|
||||||
Reference in New Issue
Block a user