t1.1: scrape one forum via ViewComments.cfm POST pagination
Spider fetches ViewComments.cfm?GdocForumID=N with vPerPage=500, generates all page requests from page-1 metadata, and parses each div.Cbox for comment_id, author, date, title, text, reg_title, reg_desc. Handles span-wrapped comment text. Fixes UTF-8/windows-1251 meta-tag encoding mismatch. 9083 items, 15 empty-text (0.17%). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
@@ -1,12 +1,17 @@
|
||||
# Define here the models for your scraped items
|
||||
#
|
||||
# See documentation in:
|
||||
# https://docs.scrapy.org/en/latest/topics/items.html
|
||||
|
||||
import scrapy
|
||||
|
||||
|
||||
class ScraperItem(scrapy.Item):
|
||||
# define the fields for your item here like:
|
||||
# name = scrapy.Field()
|
||||
pass
|
||||
class CommentItem(scrapy.Item):
|
||||
# Forum / regulation context
|
||||
forum_id = scrapy.Field()
|
||||
reg_title = scrapy.Field()
|
||||
reg_desc = scrapy.Field()
|
||||
|
||||
# Comment metadata
|
||||
comment_id = scrapy.Field()
|
||||
author = scrapy.Field()
|
||||
date = scrapy.Field()
|
||||
title = scrapy.Field()
|
||||
|
||||
# Comment content
|
||||
text = scrapy.Field()
|
||||
|
||||
Reference in New Issue
Block a user