t1.1: scrape one forum via ViewComments.cfm POST pagination

Spider fetches ViewComments.cfm?GdocForumID=N with vPerPage=500, generates all page requests from page-1 metadata, and parses each div.Cbox for comment_id, author, date, title, text, reg_title, reg_desc. Handles span-wrapped comment text. Fixes UTF-8/windows-1251 meta-tag encoding mismatch. 9083 items, 15 empty-text (0.17%). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-05 12:28:07 -04:00
parent 02964312cb
commit beb5cf461b
6 changed files with 387 additions and 22 deletions
--- a/scraper/items.py
+++ b/scraper/items.py
@@ -1,12 +1,17 @@
-# Define here the models for your scraped items
-#
-# See documentation in:
-# https://docs.scrapy.org/en/latest/topics/items.html
-
 import scrapy


-class ScraperItem(scrapy.Item):
-    # define the fields for your item here like:
-    # name = scrapy.Field()
-    pass
+class CommentItem(scrapy.Item):
+    # Forum / regulation context
+    forum_id   = scrapy.Field()
+    reg_title  = scrapy.Field()
+    reg_desc   = scrapy.Field()
+
+    # Comment metadata
+    comment_id = scrapy.Field()
+    author     = scrapy.Field()
+    date       = scrapy.Field()
+    title      = scrapy.Field()
+
+    # Comment content
+    text       = scrapy.Field()