t1.1: scrape one forum via ViewComments.cfm POST pagination

Spider fetches ViewComments.cfm?GdocForumID=N with vPerPage=500,
generates all page requests from page-1 metadata, and parses
each div.Cbox for comment_id, author, date, title, text, reg_title,
reg_desc. Handles span-wrapped comment text. Fixes UTF-8/windows-1251
meta-tag encoding mismatch. 9083 items, 15 empty-text (0.17%).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
2026-05-05 12:28:07 -04:00
parent 02964312cb
commit beb5cf461b
6 changed files with 387 additions and 22 deletions

View File

@@ -1,12 +1,17 @@
# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html
import scrapy
class ScraperItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
pass
class CommentItem(scrapy.Item):
# Forum / regulation context
forum_id = scrapy.Field()
reg_title = scrapy.Field()
reg_desc = scrapy.Field()
# Comment metadata
comment_id = scrapy.Field()
author = scrapy.Field()
date = scrapy.Field()
title = scrapy.Field()
# Comment content
text = scrapy.Field()