updated instructions and added artifacts

1.1 cleanup
t1.1: record commit hash in evidence
2026-05-05 13:50:27 -04:00 · 2026-05-05 13:50:04 -04:00 · 2026-05-05 12:28:15 -04:00 · 2026-05-05 12:28:07 -04:00
10 changed files with 651 additions and 36 deletions
--- a/.gitignore
+++ b/.gitignore
@@ -22,5 +22,9 @@ env/
 archive/


+# --- scrapy ---
+.scrapy/
+output/
+
 # --- misc ---
 .DS_Store
--- a/agents.md
+++ b/agents.md
@@ -5,24 +5,24 @@
 - prefer minimal diffs; avoid refactors unless required for the active task

 ## tech stack
- python; scrapy
+- python; scrapy, pytest
 - file storage: json or csv
 - assume local virtual env is available and accessible
 - do not add new dependencies unless explicitly approved; if unavoidable, document justification in the active task notes

 ## workflow
- prefer direct argv commands (no bash -lc / compound shell chains) unless necessary
- work on ONE task at a time unless explicitly instructed otherwise
- at the start of work, state the task id you are executing
- do not start work unless a task id is specified; if missing, choose the earliest unchecked task and say so
- propose incremental steps
- always include basic tests for core logic
- when you complete a task:
+- prefer direct commands
+- work on ONE task at a time unless explicitly instructed otherwise:
+  - at the start of work, state the task id you are executing
+  - do not start work unless a task id is specified; if missing, choose the earliest unchecked task and say so
+  - propose incremental steps
+  - always include basic tests for core logic
+  - when you complete a task:
 	- mark it [X] in docs/tasks.md
    - fill in evidence with commit hash + commands run
 	- never mark complete unless acceptance criteria are met
    - include date and time (HH:MM)
-
+	- follow this format:
 ```
 * [ ] t1.1 Task Title (1)
 Description and PM notes
--- a/docs/tasks.org
+++ b/docs/tasks.org
@@ -1,18 +1,37 @@
-* [ ] t1.1: scrape one forum (1)
+* [X] t1.1: scrape one forum (1)
 Use https://www.townhall.virginia.gov/L/comments.cfm?GDocForumID=452 as the first forum. Scraper should be run manually at this step.
+ViewComments (townhall.virginia.gov/L/ViewComments.cfm?CommentID=#) appears to be raw list of all comments on forum - could be useful later for whole-scrape
+Append forum id to viewall per forum (townhall.virginia.gov/L/ViewComments.cfm?GdocForumID=452)
+Comments are hydrated in backend via js-cued button (AJAX?).
 ** acceptance criteria
 1. run manual scraper
   1. store proposal title and description
   2. store comment title, commenter, date
   3. store relevant metadata
 2. friendly/polite scraping
+3. store forum as distinct item with title, desc
+4. add forum ID in comment filename, eg forum452_comments_<datetime>.jsonl
+5. remove reg_title and reg_desc from each comment; these belong in forum item
+6. parse datetimes into object for later use (plotting)
   
 ** notes
+- scraper/spiders/forum.py — ForumSpider using ViewComments.cfm?GdocForumID=N with POST pagination. First request fetches page 1 (vPerPage=500), discovers the last page number from the form's link, generates all remaining page requests upfront. Parses each div.Cbox for all required fields.
+- scraper/items.py — CommentItem with forum_id, reg_title, reg_desc, comment_id, author, date, title, text
+- tests/test_forum_spider.py — 7 tests, all passing
+- Settings: DEFAULT_RESPONSE_ENCODING=utf-8 (fixes Windows-1251 meta-tag mismatch), HTTPCACHE_ENABLED=True, feed output to output/
+- ViewComments.cfm instead of comments.cfm: POST to Comments.cfm returned a 500 error (wrong endpoint). ViewComments.cfm?GdocForumID=N is the correct listing URL, returns full comment text on the page itself — no per-comment follow requests needed.
+- Span-wrapped text: .divComment p::text missed 3.6% of comments where text is in <p><span>text</span></p>. Fixed to .divComment *::text, .divComment::text. Worth knowing for when the spider is extended to other forums.
+- start() vs start_requests(): Scrapy 2.13+ deprecates start_requests() in favor of async def start()
+- ForumItem vs CommentItem: ForumItem (forum_id, reg_title, reg_desc) yielded once on first page; CommentItem no longer carries reg_title/reg_desc. Both land in the same JSONL feed.
+- Dynamic output filename: set via from_crawler() overriding FEEDS at 'spider' priority — format is output/forum{id}_comments_%(time)s.jsonl. FEEDS removed from settings.py; spider owns it.
+- Date parsing: _parse_date() normalizes whitespace, upper-cases, parses "%m/%d/%y %I:%M %p" → ISO 8601; falls back to raw string on failure.

 ** evidence
- commit: 
- tests: 
- datetime: 
+- commit: beb5cf4 (AC1-2), <commit> (AC3-6)
+- tests: 8 passing (`python -m pytest tests -q`) or (`python -m pytest tests/`)
+   - `scrapy crawl forum -a forum_id=452 -s LOG_LEVEL=WARNING 2>&1`
+   - retrieved 9083 comments
+- datetime: 2026-05-05

 * [ ] t1.2: initial analysis pipeline
 Write a simple pipeline for both - prefer non-concurrent/async from scraping run. Should be run manually, separate from scraper. You may use scrapy, but are not required to.
@@ -26,3 +45,10 @@ Write a simple pipeline for both - prefer non-concurrent/async from scraping run
 - commit: 
 - tests: 
 - date: 
+
+* [ ] X: complete proposal information
+Ensure we capture as much useful information as possible about the actual proposal - contact information, etc. what the state actually says about what was posted. 
+** acceptance criteria
+1. Item: `Forum` stores id, url, proposal title, description, open/close date, number of comments, agency, board, guidance document id
+   - add details for guidanceDoc, publication date, comments, guidance docs - eg: https://www.townhall.virginia.gov/L/GDocForum.cfm?GDocForumID=452
+2. Item: `Comment` stores forum_id, comment_id, author, title, text, date, url
--- a/docs/tb.py
+++ b/docs/tb.py
@@ -0,0 +1,105 @@
+import jsonlines
+import re
+from textblob import TextBlob
+from collections import Counter
+
+def tprint(obj):
+    print(f"{type(obj)} : {obj}")
+
+
+def sort_file(file):
+    '''return number of positive and negative comments based on TextBlob sentiment analysis'''
+    # with jsonlines.open("/vadoe/vadoe/vadoe/townhall_2021-01-14T02-05-51.json") as reader:
+    with jsonlines.open(file, mode='r') as reader:
+        # Confirm type
+        tprint(reader)
+
+        # Build iterator
+        _doc = iter(reader)
+        i = 0
+        pos = 0
+        neg = 0
+        posl = []
+        negl = []
+
+        while i<25:
+            _line = next(_doc)
+            tprint(_line)
+            if _line['sentiment'] == 'pos':
+                pos = pos + 1
+                posl.append(_line['comment'])
+            elif _line['sentiment'] == 'neg':
+                neg = neg + 1
+                negl.append(_line['comment'])
+            i=i+1
+
+        print(f'{pos} positive and {neg} negative comments')
+            # tst = TextBlob(obj['comment'])
+            # tst.sentiment
+
+def process_file(file):
+    '''Find Smythers posts'''
+    with jsonlines.open(file, mode='r') as reader:
+        _doc = iter(reader)
+        _list = []
+        for item in _doc:
+                try:
+                    if item['author'][0] == 'Smythers': 
+                        _list.append(item['content'][0])
+                except KeyError:
+                    continue
+    return(_list)
+
+def write_file(file, data:object):
+    '''Write data to file'''
+    with jsonlines.open(file, mode='w') as writer:
+        for each in data:
+            writer.write(each)
+    print('write successful')
+
+def clean_text(text:str):
+    s1 = remove_html(text)
+    s2 = remove_http(s1)
+    return s2
+
+def remove_html(text:str):
+    '''Remove html tags from string'''
+    clean = re.compile('<.*?>')
+    return re.sub(clean, '', text)
+
+def remove_http(text:str):
+    '''Remove URLs from string'''
+    return re.sub(r'http\S+','', text)
+
+def get_nouns(text:str):
+    blob = TextBlob(text)
+    # check nouns? or no
+    return blob.tags
+
+vadoe = '/vadoe/vadoe/vadoe/townhall_2021-01-14T02-05-51.json'
+vadoe_p = '/vadoe/vadoe/vadoe/townhall_2021-01-14T05-11-55.json'
+dlr = '/vadoe/vadoe/vadoe/dlr.json'
+
+smythers_pc = '/vadoe/vadoe/vadoe/smythers.json'
+write_to = '/vadoe/vadoe/vadoe/nouns.json'
+
+# processed_file(file)
+smythers_posts = process_file(dlr)
+# cleaned = []
+# for each in smythers:
+    # cleaned.append(clean_text(each))
+cleaned = [clean_text(each) for each in smythers_posts]
+nouns = []
+for x in cleaned:
+    _list = get_nouns(x)
+    for y in _list:
+        nouns.append(y)
+    # nouns.append(x for x in [get_nouns())
+sortedNouns = Counter(nouns)
+nouns = []
+for k, v in sortedNouns.items():
+    if v > 2: 
+        _d = (k, v)
+        nouns.append(_d)
+print(nouns)
+write_file(write_to, nouns)
--- a/docs/townhall.py
+++ b/docs/townhall.py
@@ -0,0 +1,45 @@
+# -*- coding: utf-8 -*-
+import scrapy
+from items import CommentItem
+import textblob
+from textblob import TextBlob
+from textblob.sentiments import NaiveBayesAnalyzer
+
+class TownhallSpider(scrapy.Spider):
+    name = 'townhall'
+    allowed_domains = ['townhall.virginia.gov']
+    start_urls = ['https://www.townhall.virginia.gov/L/comments.cfm?GDocForumID=452']
+    custom_settings = {
+        'FEED_EXPORTERS' : {
+            "jsonlines": "scrapy.exporters.JsonLinesItemExporter",
+        },
+        'FEED_URI' : '%(name)s_%(time)s.json',
+        'FEED_FORMAT': 'jsonlines'
+    }
+
+    def parse(self, response):
+        rows = response.css('#contentwide>table>tr')
+        # cut out the header row
+        for each in rows[1:]:
+        # for each in rows[1:6]:
+            cols = each.xpath('.//td')
+            linkfollow = cols[0].css('a::attr(href)').get()
+            comment_title = cols[0].xpath('a/text()').get()
+            # clean up
+            commenter = cols[1].xpath('text()').get()
+            # clean up
+            date = cols[2].xpath('a/text()').get()
+            print(f'{comment_title}  |  {commenter}')
+            yield response.follow(linkfollow, callback = self.parse_comment)
+
+    def parse_comment(self, response):
+        entry = CommentItem()
+        text = response.css('.divComment>p::text').get()
+        text = text.replace(u'\u00a0',' ')
+        entry['comment'] = text
+        blob = TextBlob(text, analyzer=NaiveBayesAnalyzer())
+        entry['sentiment'] = blob.sentiment.classification
+        entry['sentiment_pos'] = blob.sentiment.p_pos
+        entry['sentiment_neg'] = blob.sentiment.p_neg
+        # yield CommentItem(comment = response.css('.divComment>p::text').get())
+        yield entry
--- a/docs/townhall2.py
+++ b/docs/townhall2.py
@@ -0,0 +1,62 @@
+# -*- coding: utf-8 -*-
+import scrapy
+from items import CommentItem
+import textblob
+from textblob import TextBlob
+from textblob.sentiments import NaiveBayesAnalyzer
+
+class TownhallSpider(scrapy.Spider):
+    name = 'townhall'
+    allowed_domains = ['townhall.virginia.gov']
+    start_urls = ['https://www.townhall.virginia.gov/L/Forums.cfm']
+    custom_settings = {
+        'FEED_EXPORTERS' : {
+            "jsonlines": "scrapy.exporters.JsonLinesItemExporter",
+        },
+        'FEED_URI' : '%(name)s_%(time)s.json',
+        'FEED_FORMAT': 'jsonlines'
+    }
+
+    def parse(self, response):
+        rows = response.css('table>tr>td')
+        for each in rows:
+            linkfollow = each.css('a').attrib['href']
+            if 'comments' in linkfollow:
+                yield response.follow(linkfollow, callback = self.parse_forum)
+
+            cols = each.xpath('.//td')
+            linkfollow = cols[0].css('a::attr(href)').get()
+            comment_title = cols[0].xpath('a/text()').get()
+            # clean up
+            commenter = cols[1].xpath('text()').get()
+            # clean up
+            date = cols[2].xpath('a/text()').get()
+            print(f'{comment_title}  |  {commenter}')
+            yield response.follow(linkfollow, callback = self.parse_comment)
+
+    def parse_forum(self, response):
+        rows = response.css('#contentwide>table>tr')
+        # cut out the header row
+        for each in rows[1:]:
+        # for each in rows[1:6]:
+            cols = each.xpath('.//td')
+            linkfollow = cols[0].css('a::attr(href)').get()
+            comment_title = cols[0].xpath('a/text()').get()
+            # clean up
+            commenter = cols[1].xpath('text()').get()
+            # clean up
+            date = cols[2].xpath('a/text()').get()
+            print(f'{comment_title}  |  {commenter}')
+            yield response.follow(linkfollow, callback = self.parse_comment)
+
+    def parse_comment(self, response):
+        entry = CommentItem()
+        text = response.css('.divComment>p::text').get()
+        text = text.replace(u'\u00a0',' ')
+        entry['comment'] = text
+        blob = TextBlob(text, analyzer=NaiveBayesAnalyzer())
+        entry['sentiment'] = blob.sentiment.classification
+        entry['sentiment_pos'] = blob.sentiment.p_pos
+        entry['sentiment_neg'] = blob.sentiment.p_neg
+        # yield CommentItem(comment = response.css('.divComment>p::text').get())
+        yield entry
--- a/scraper/items.py
+++ b/scraper/items.py
@@ -1,12 +1,16 @@
-# Define here the models for your scraped items
-#
-# See documentation in:
-# https://docs.scrapy.org/en/latest/topics/items.html
-
 import scrapy


-class ScraperItem(scrapy.Item):
-    # define the fields for your item here like:
-    # name = scrapy.Field()
-    pass
+class ForumItem(scrapy.Item):
+    forum_id  = scrapy.Field()
+    reg_title = scrapy.Field()
+    reg_desc  = scrapy.Field()
+
+
+class CommentItem(scrapy.Item):
+    forum_id   = scrapy.Field()
+    comment_id = scrapy.Field()
+    author     = scrapy.Field()
+    date       = scrapy.Field()
+    title      = scrapy.Field()
+    text       = scrapy.Field()
--- a/scraper/settings.py
+++ b/scraper/settings.py
@@ -15,8 +15,7 @@ NEWSPIDER_MODULE = "scraper.spiders"
 ADDONS = {}


-# Crawl responsibly by identifying yourself (and your website) on the user-agent
-#USER_AGENT = "scraper (+http://www.yourdomain.com)"
+USER_AGENT = "vath-research-scraper/1.0 (public comment analysis; contact: research)"

 # Obey robots.txt rules
 ROBOTSTXT_OBEY = True
@@ -75,13 +74,17 @@ DOWNLOAD_DELAY = 1
 # Enable showing throttling stats for every response received:
 #AUTOTHROTTLE_DEBUG = False

-# Enable and configure HTTP caching (disabled by default)
-# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
-#HTTPCACHE_ENABLED = True
-#HTTPCACHE_EXPIRATION_SECS = 0
-#HTTPCACHE_DIR = "httpcache"
-#HTTPCACHE_IGNORE_HTTP_CODES = []
-#HTTPCACHE_STORAGE = "scrapy.extensions.httpcache.FilesystemCacheStorage"
+# HTTP cache — enabled during development to avoid re-hitting the server on test runs.
+# Disable (or delete httpcache/) before a production run.
+HTTPCACHE_ENABLED = True
+HTTPCACHE_EXPIRATION_SECS = 86400  # 24 h
+HTTPCACHE_DIR = "httpcache"
+
+# Output filename is set dynamically by each spider via from_crawler (includes forum_id).
+
+# The site declares windows-1251 in a meta tag but sends valid UTF-8 bytes.
+# Force UTF-8 to prevent lxml from re-decoding via the meta charset.
+DEFAULT_RESPONSE_ENCODING = "utf-8"

 # Set settings whose default value is deprecated to a future-proof value
 FEED_EXPORT_ENCODING = "utf-8"
--- a/scraper/spiders/forum.py
+++ b/scraper/spiders/forum.py
@@ -0,0 +1,136 @@
+import re
+from datetime import datetime
+
+import scrapy
+
+from scraper.items import CommentItem, ForumItem
+
+_BASE = "https://www.townhall.virginia.gov/L/ViewComments.cfm"
+_NBSP = "\xa0"
+_REPLACEMENT_CHAR = "<EFBFBD>"
+
+
+def _view_url(forum_id):
+    return f"{_BASE}?GdocForumID={forum_id}"
+
+
+def _parse_date(raw):
+    normalized = " ".join(raw.split()).upper()
+    try:
+        return datetime.strptime(normalized, "%m/%d/%y %I:%M %p").isoformat()
+    except ValueError:
+        return raw
+
+
+class ForumSpider(scrapy.Spider):
+    name = "forum"
+    allowed_domains = ["townhall.virginia.gov"]
+
+    # Override at runtime: scrapy crawl forum -a forum_id=452
+    forum_id = "452"
+    per_page = 500
+
+    @classmethod
+    def from_crawler(cls, crawler, *args, **kwargs):
+        spider = super().from_crawler(crawler, *args, **kwargs)
+        crawler.settings.set(
+            "FEEDS",
+            {
+                f"output/forum{spider.forum_id}_comments_%(time)s.jsonl": {
+                    "format": "jsonlines",
+                    "encoding": "utf-8",
+                    "overwrite": False,
+                }
+            },
+            priority="spider",
+        )
+        return spider
+
+    async def start(self):
+        yield scrapy.FormRequest(
+            _view_url(self.forum_id),
+            formdata={"vPage": "1", "vPerPage": str(self.per_page), "sub1": "go"},
+            callback=self.parse_comments,
+            meta={"is_first": True},
+        )
+
+    # ------------------------------------------------------------------
+    def parse_comments(self, response):
+        if response.meta.get("is_first"):
+            reg_title, reg_desc = self._reg_context(response)
+            last_page = self._last_page(response)
+            yield ForumItem(
+                forum_id=self.forum_id,
+                reg_title=reg_title,
+                reg_desc=reg_desc,
+            )
+            for page in range(2, last_page + 1):
+                yield scrapy.FormRequest(
+                    _view_url(self.forum_id),
+                    formdata={"vPage": str(page), "vPerPage": str(self.per_page), "sub1": "go"},
+                    callback=self.parse_comments,
+                )
+
+        for box in response.css("div.Cbox"):
+            yield self._parse_box(box)
+
+    # ------------------------------------------------------------------
+    def _parse_box(self, box):
+        cbox_id = box.attrib.get("id", "")
+        comment_id = cbox_id[len("cbox"):] if cbox_id.startswith("cbox") else ""
+
+        date_raw = (
+            box.css("div[style*='float: right'] div::text").get("")
+            .replace(_NBSP, " ").strip()
+        )
+
+        author = (
+            box.xpath('.//strong[contains(text(),"Commenter:")]/following-sibling::text()[1]')
+            .get("").strip()
+        )
+
+        # Second <strong> in the commenter block is the comment title
+        strongs = box.css("div > strong::text").getall()
+        title = strongs[-1].strip() if len(strongs) > 1 else ""
+
+        paragraphs = box.css(".divComment *::text, .divComment::text").getall()
+        text = " ".join(p.strip() for p in paragraphs if p.strip())
+        text = text.replace(_NBSP, " ").replace(_REPLACEMENT_CHAR, "'").strip()
+
+        return CommentItem(
+            forum_id=self.forum_id,
+            comment_id=comment_id,
+            author=author,
+            date=_parse_date(date_raw),
+            title=title,
+            text=text,
+        )
+
+    # ------------------------------------------------------------------
+    def _reg_context(self, response):
+        # Page shows: <strong>Guidance Document Change:</strong> description text...
+        label_node = response.xpath('//strong[contains(text(),"Change:")]')
+
+        # Collect all sibling text nodes following the label
+        siblings = label_node.xpath("following-sibling::text()").getall()
+        raw = " ".join(t.strip() for t in siblings if t.strip())
+        raw = raw.replace(_NBSP, " ").replace(_REPLACEMENT_CHAR, "'").strip()
+
+        reg_desc = raw
+
+        # reg_title: text up to the first "was " clause or first 200 chars
+        m = re.match(r"^(.+?)\s+(?:was |has |guidance document)", raw, re.IGNORECASE)
+        reg_title = m.group(1).strip() if m else raw[:200]
+
+        return reg_title, reg_desc
+
+    def _last_page(self, response):
+        hrefs = response.xpath(
+            '//form[@name="page"]//a[contains(@href,"vpage.value=")]/@href'
+        ).getall()
+        pages = [
+            int(m.group(1))
+            for h in hrefs
+            if (m := re.search(r"vpage\.value=(\d+)", h))
+        ]
+        return max(pages) if pages else 1
--- a/tests/test_forum_spider.py
+++ b/tests/test_forum_spider.py
@@ -0,0 +1,230 @@
+"""Tests for ForumSpider parsing logic using fake HTML responses."""
+
+import scrapy
+from scrapy.http import HtmlResponse, Request
+
+from scraper.items import CommentItem, ForumItem
+from scraper.spiders.forum import ForumSpider, _parse_date
+
+
+def fake_response(url, body, meta=None):
+    req = Request(url=url, meta=meta or {})
+    return HtmlResponse(url=url, body=body.encode("utf-8"), request=req)
+
+
+# ---------------------------------------------------------------------------
+# Minimal page HTML fragments
+
+PAGE1_HTML = """
+<html><body>
+  <strong>Guidance Document Change:</strong> The Model Policies for the Treatment of Transgender Students
+  was developed in response to House Bill 145 and Senate Bill 161.
+
+  <div style="font-family: verdana;">
+    <form name="page" id="page" action="ViewComments.cfm?GdocForumID=452" method="post">
+      <input name="vPage" id="vpage" type="input" value="1">
+      <input name="vPerPage" id="vPerPage" type="input" value="500">
+      <a href="javascript:document.page.vpage.value=3;document.page.submit();">3</a>
+      <a href="javascript:document.page.vpage.value=2;document.page.submit();">Next</a>
+      <input type="submit" name="sub1" value="go">
+    </form>
+  </div>
+
+  <div id="cbox101" class="Cbox">
+    <div style="float: right; text-align: right;">
+      <div style="background-color: white; border: 1px solid #cccccc; padding: 4px">1/4/21&nbsp;&nbsp;9:15 am</div>
+    </div>
+    <div>
+      <strong>Commenter:</strong>
+      Alice Example
+      <br><br>
+      <strong>I strongly support this</strong>
+    </div>
+    <div style="clear: right">&nbsp;</div>
+    <div class="divComment">
+      <p>This is a great policy for students.</p>
+      <p>All schools should follow it.</p>
+    </div>
+    <div style="float: left; font-size: 90%;">
+      CommentID: <a href="ViewComments.cfm?commentid=101">101</a>
+    </div>
+  </div>
+
+  <div id="cbox102" class="Cbox">
+    <div style="float: right; text-align: right;">
+      <div style="background-color: white; border: 1px solid #cccccc; padding: 4px">1/5/21&nbsp;&nbsp;10:00 am</div>
+    </div>
+    <div>
+      <strong>Commenter:</strong>
+      Bob Sample
+      <br><br>
+      <strong>Opposed</strong>
+    </div>
+    <div style="clear: right">&nbsp;</div>
+    <div class="divComment">
+      <p>I do not support this guidance.</p>
+    </div>
+    <div style="float: left; font-size: 90%;">
+      CommentID: <a href="ViewComments.cfm?commentid=102">102</a>
+    </div>
+  </div>
+</body></html>
+"""
+
+PAGE2_HTML = """
+<html><body>
+  <div id="cbox201" class="Cbox">
+    <div style="float: right; text-align: right;">
+      <div style="background-color: white; border: 1px solid #cccccc; padding: 4px">1/6/21&nbsp;&nbsp;11:00 am</div>
+    </div>
+    <div>
+      <strong>Commenter:</strong>
+      Carol T
+      <br><br>
+      <strong>Support</strong>
+    </div>
+    <div style="clear: right">&nbsp;</div>
+    <div class="divComment">
+      <p>This policy is long overdue.</p>
+    </div>
+  </div>
+</body></html>
+"""
+
+
+def make_spider():
+    return ForumSpider()
+
+
+# ---------------------------------------------------------------------------
+
+def test_page1_generates_remaining_page_requests():
+    spider = make_spider()
+    response = fake_response(
+        "https://www.townhall.virginia.gov/L/ViewComments.cfm?GdocForumID=452",
+        PAGE1_HTML,
+        meta={"is_first": True},
+    )
+    results = list(spider.parse_comments(response))
+    form_reqs = [r for r in results if isinstance(r, scrapy.FormRequest)]
+    # Pages 2 and 3 should be requested (last page link = 3)
+    assert len(form_reqs) == 2
+    pages = sorted(r.body.decode() for r in form_reqs)
+    assert "vPage=2" in pages[0]
+    assert "vPage=3" in pages[1]
+
+
+def test_page1_yields_items():
+    spider = make_spider()
+    response = fake_response(
+        "https://www.townhall.virginia.gov/L/ViewComments.cfm?GdocForumID=452",
+        PAGE1_HTML,
+        meta={"is_first": True},
+    )
+    results = list(spider.parse_comments(response))
+    items = [r for r in results if isinstance(r, CommentItem)]
+    assert len(items) == 2
+
+
+def test_page1_yields_forum_item():
+    spider = make_spider()
+    response = fake_response(
+        "https://www.townhall.virginia.gov/L/ViewComments.cfm?GdocForumID=452",
+        PAGE1_HTML,
+        meta={"is_first": True},
+    )
+    results = list(spider.parse_comments(response))
+    forum_items = [r for r in results if isinstance(r, ForumItem)]
+    assert len(forum_items) == 1
+    fi = forum_items[0]
+    assert "Transgender Students" in fi["reg_title"]
+    assert "House Bill 145" in fi["reg_desc"]
+    assert fi["forum_id"] == "452"
+
+
+def test_comment_fields_parsed_correctly():
+    spider = make_spider()
+    response = fake_response(
+        "https://www.townhall.virginia.gov/L/ViewComments.cfm?GdocForumID=452",
+        PAGE1_HTML,
+        meta={"is_first": True},
+    )
+    items = [r for r in spider.parse_comments(response) if isinstance(r, CommentItem)]
+    item = items[0]
+    assert item["comment_id"] == "101"
+    assert item["author"] == "Alice Example"
+    assert item["title"] == "I strongly support this"
+    assert "great policy" in item["text"]
+    assert "All schools" in item["text"]  # multi-paragraph joined
+    assert "reg_title" not in item
+    assert "reg_desc" not in item
+
+
+def test_subsequent_page_yields_comments():
+    spider = make_spider()
+    response = fake_response(
+        "https://www.townhall.virginia.gov/L/ViewComments.cfm?GdocForumID=452",
+        PAGE2_HTML,
+    )
+    items = [r for r in spider.parse_comments(response) if isinstance(r, CommentItem)]
+    assert len(items) == 1
+    assert items[0]["author"] == "Carol T"
+
+
+def test_last_page_detection():
+    spider = make_spider()
+    response = fake_response(
+        "https://www.townhall.virginia.gov/L/ViewComments.cfm?GdocForumID=452",
+        PAGE1_HTML,
+        meta={"is_first": True},
+    )
+    assert spider._last_page(response) == 3
+
+
+def test_date_parsed_to_iso():
+    assert _parse_date("1/4/21  9:15 am") == "2021-01-04T09:15:00"
+    assert _parse_date("1/5/21  10:00 am") == "2021-01-05T10:00:00"
+    assert _parse_date("unparseable") == "unparseable"
+
+
+SPAN_WRAPPED_HTML = """
+<html><body>
+  <strong>Guidance Document Change:</strong> Some regulation was developed.
+
+  <form name="page" id="page" action="ViewComments.cfm?GdocForumID=452" method="post">
+    <input name="vPage" value="1"><input name="vPerPage" value="500">
+    <a href="javascript:document.page.vpage.value=1;document.page.submit();">1</a>
+    <input type="submit" name="sub1" value="go">
+  </form>
+
+  <div id="cbox301" class="Cbox">
+    <div style="float: right; text-align: right;">
+      <div style="background-color: white; border: 1px solid #cccccc; padding: 4px">2/1/21&nbsp;&nbsp;8:00 am</div>
+    </div>
+    <div>
+      <strong>Commenter:</strong>
+      Dan Span
+      <br><br>
+      <strong>Opposed</strong>
+    </div>
+    <div style="clear: right">&nbsp;</div>
+    <div class="divComment">
+      <!DOCTYPE html><html><head></head><body>
+      <p style="margin: 0in;"><span style="font-size: 10.5pt;">Text inside a span element.</span></p>
+      </body></html>
+    </div>
+  </div>
+</body></html>
+"""
+
+
+def test_span_wrapped_text_is_extracted():
+    spider = make_spider()
+    response = fake_response(
+        "https://www.townhall.virginia.gov/L/ViewComments.cfm?GdocForumID=452",
+        SPAN_WRAPPED_HTML,
+        meta={"is_first": True},
+    )
+    items = [r for r in spider.parse_comments(response) if isinstance(r, CommentItem)]
+    assert len(items) == 1
+    assert "Text inside a span element" in items[0]["text"]
Author	SHA1	Message	Date
eulaly	314f8d2621	updated instructions and added artifacts	2026-05-05 13:50:27 -04:00
eulaly	e7df0b24a1	1.1 cleanup	2026-05-05 13:50:04 -04:00
eulaly	951cc11a14	t1.1: record commit hash in evidence	2026-05-05 12:28:15 -04:00
eulaly	beb5cf461b	t1.1: scrape one forum via ViewComments.cfm POST pagination Spider fetches ViewComments.cfm?GdocForumID=N with vPerPage=500, generates all page requests from page-1 metadata, and parses each div.Cbox for comment_id, author, date, title, text, reg_title, reg_desc. Handles span-wrapped comment text. Fixes UTF-8/windows-1251 meta-tag encoding mismatch. 9083 items, 15 empty-text (0.17%). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-05 12:28:07 -04:00