114 lines
8.2 KiB
Org Mode
114 lines
8.2 KiB
Org Mode
#+title: VATH Task Log
|
|
#+date: [2026-05-05 Tue]
|
|
#+startup: Overview
|
|
|
|
* [X] t1.1: scrape one forum (1)
|
|
Use https://www.townhall.virginia.gov/L/comments.cfm?GDocForumID=452 as the first forum. Scraper should be run manually at this step.
|
|
ViewComments (townhall.virginia.gov/L/ViewComments.cfm?CommentID=#) appears to be raw list of all comments on forum - could be useful later for whole-scrape
|
|
Append forum id to viewall per forum (townhall.virginia.gov/L/ViewComments.cfm?GdocForumID=452)
|
|
Comments are hydrated in backend via js-cued button (AJAX?).
|
|
** acceptance criteria
|
|
1. run manual scraper
|
|
1. store proposal title and description
|
|
2. store comment title, commenter, date
|
|
3. store relevant metadata
|
|
2. friendly/polite scraping
|
|
3. store forum as distinct item with title, desc
|
|
4. add forum ID in comment filename, eg forum452_comments_<datetime>.jsonl
|
|
5. remove reg_title and reg_desc from each comment; these belong in forum item
|
|
6. parse datetimes into object for later use (plotting)
|
|
|
|
** notes
|
|
- scraper/spiders/forum.py — ForumSpider using ViewComments.cfm?GdocForumID=N with POST pagination. First request fetches page 1 (vPerPage=500), discovers the last page number from the form's link, generates all remaining page requests upfront. Parses each div.Cbox for all required fields.
|
|
- scraper/items.py — CommentItem with forum_id, reg_title, reg_desc, comment_id, author, date, title, text
|
|
- tests/test_forum_spider.py — 7 tests, all passing
|
|
- Settings: DEFAULT_RESPONSE_ENCODING=utf-8 (fixes Windows-1251 meta-tag mismatch), HTTPCACHE_ENABLED=True, feed output to output/
|
|
- ViewComments.cfm instead of comments.cfm: POST to Comments.cfm returned a 500 error (wrong endpoint). ViewComments.cfm?GdocForumID=N is the correct listing URL, returns full comment text on the page itself — no per-comment follow requests needed.
|
|
- Span-wrapped text: .divComment p::text missed 3.6% of comments where text is in <p><span>text</span></p>. Fixed to .divComment *::text, .divComment::text. Worth knowing for when the spider is extended to other forums.
|
|
- start() vs start_requests(): Scrapy 2.13+ deprecates start_requests() in favor of async def start()
|
|
- ForumItem vs CommentItem: ForumItem (forum_id, reg_title, reg_desc) yielded once on first page; CommentItem no longer carries reg_title/reg_desc. Both land in the same JSONL feed.
|
|
- Dynamic output filename: set via from_crawler() overriding FEEDS at 'spider' priority — format is output/forum{id}_comments_%(time)s.jsonl. FEEDS removed from settings.py; spider owns it.
|
|
- Date parsing: _parse_date() normalizes whitespace, upper-cases, parses "%m/%d/%y %I:%M %p" → ISO 8601; falls back to raw string on failure.
|
|
|
|
** evidence
|
|
- commit: beb5cf4 (AC1-2), e7df0b2 (AC3-6)
|
|
- tests: 8 passing (`python -m pytest tests -q`) or (`python -m pytest tests/`)
|
|
- `scrapy crawl forum -a forum_id=452 -s LOG_LEVEL=WARNING 2>&1`
|
|
- retrieved 9083 comments
|
|
- datetime: [2026-05-05 Tue 14:00]
|
|
|
|
* [X] t1.2: initial 4o sentiment
|
|
Write a simple manual pipeline for gpt-4o that reads one scraped forum jsonl file and roduces a separate analyzed jsonl file. this step must not mutate scraper output. analysis should classify each comment for regulatory stance, generic tone/sentiment, confidence, and enough rationale/evidence to support later dashboard drilldown.
|
|
Should be run manually, separate from scraper. You may use scrapy, but are not required to.
|
|
- Sentiment is derived, not scraped - keep separate from raw comments.
|
|
- keep jsonl as interchange/audit format
|
|
|
|
** acceptance criteria
|
|
1. input scraped jsonl doc by filename/path, e.g. "./output/forum452_comments_<datetime>.jsonl"
|
|
- handle mixed itemtypes, e.g., forum + comment items
|
|
2. output new analysis file, e.g., "analysis/forum452_<datetime>_<model>_<datetime>.jsonl"
|
|
- one analysis record per comment
|
|
- include run_id, forum_id, comment_id, analyzed_at, model, prompt_version
|
|
3. capture stance toward proposed reg/guidance:
|
|
- `stance`: support, oppose, neutral, unknown
|
|
- `confidence`: 0-1
|
|
- short rationale, if provided by model
|
|
4. capture generic sentiment/tone separately from stance: `tone`=positive, negative, neutral, mixed, unclear
|
|
5. capture issue/topic tags for later grouping, may be empty
|
|
6. use .env for api key management
|
|
7. document the exact prompt version used; prompt text may live in code or docs, but must have a version string/hash in output records
|
|
8. for this run, an option to run the first N comments (5, 10, 20, 50) - will add batch processing later
|
|
|
|
** notes
|
|
- analysis/gpt4o/analysis.py: standalone script; core functions importable for tests.
|
|
- Prompt version = SHA-256[:7] of SYSTEM_PROMPT+USER_TEMPLATE; auto-updates on prompt change.
|
|
- Output: analysis/gpt4o/forum{id}_{scrape_ts}_{model}_{run_ts}.jsonl, one record per comment.
|
|
- --limit {5,10,20,50} for test runs; omit for full corpus. Batch processing planned for later.
|
|
- Incremental flush after each record: safe to interrupt and inspect partial output.
|
|
- temperature=0.0 for deterministic, reproducible classifications across runs.
|
|
- Retry: 3 attempts (delays 1s, 2s) on RateLimitError; all other exceptions → error record + continue.
|
|
- openai==2.34.0 installed; python-dotenv already present; key loaded from .env via OPENAI_API_KEY.
|
|
- MAX_COMMENT_CHARS=6000: covers >99% without truncation; outliers (e.g. 18k-char law firm brief) flagged with truncated=True.
|
|
|
|
** evidence
|
|
- commit: d834d18
|
|
- tests: 20 passing (pytest tests/analysis_gpt4o_realtime.py), 28 total across suite
|
|
- `python ./analysis/gpt4o/analysis_realtime.py --limit 5 ./output/f452.jsonl`
|
|
- see: ./analysis/gpt4o/forum452_unknown_gpt-4o_2026-05-05T18-48-32+00-00.jsonl
|
|
- date: [2026-05-05 Tue 15:00]
|
|
|
|
* [ ] t1.2.1: batch processing
|
|
Create analysis-batch.py to capture same elements as t1.2 above.
|
|
May need to add multiple commands to upload, check batch status, download, etc.
|
|
Commands should all be run manually.
|
|
Reference: ./docs/openai-batch.md. openai batch output order is not guaranteed, so custom_id is mandatory for reconciliation
|
|
** acceptance criteria
|
|
1. input scraped jsonl doc by filename/path, and process the whole thing via batch processing
|
|
- ignore non-comment items in jsonl
|
|
- do not modify raw scraper output
|
|
- specify model and prompt
|
|
2. output a run manifest in ./analysis/<model>/runs/<run_id>.json
|
|
- include: include run_id, input_filename, input_sha256, prompt_hash, model, batch_id, records_submitted, records_completed, records_failed, request_filename, raw_output_filename, normalized_output_filename, created_at, completed_at
|
|
3. add tests without live api calls
|
|
** notes
|
|
- analysis/gpt4o/analysis-batch.py with three subcommands:
|
|
- `submit`: reads scraped JSONL, builds batch request file (requests/<run_id>.jsonl), uploads to Files API, creates batch, saves manifest to runs/<run_id>.json. Prints run_id to stdout for scripting.
|
|
- `status`: retrieves batch from OpenAI, prints status + counts, updates manifest.
|
|
- `download`: downloads raw output to raw/<run_id>.jsonl, normalizes to <run_id>_<model>.jsonl using comment_lookup keyed by comment_id for reconciliation (batch output order not guaranteed). Updates manifest with filenames, counts, completed_at.
|
|
- custom_id format: comment_{comment_id} — unique within a forum, stable across runs.
|
|
- PROMPT_VERSION derived from analysis/prompt-1.txt (same file as realtime); both scripts produce matching prompt_hash in all records.
|
|
- analysis/prompt-1.txt: system prompt as plaintext, read at import time by both scripts. Edit here to change prompt for both pipelines.
|
|
- Tests use importlib.util to load hyphenated filenames; monkeypatch for RUNS_DIR in save/load test.
|
|
|
|
** evidence
|
|
- commit:
|
|
- tests: 18 passing (pytest tests/analysis_gpt4o_batch.py), 46 total across suite
|
|
- datetime: [2026-05-05 Tue 17:00]
|
|
|
|
* [ ] X: complete proposal information
|
|
Ensure we capture as much useful information as possible about the actual proposal - contact information, etc. what the state actually says about what was posted.
|
|
** acceptance criteria
|
|
1. Item: `Forum` stores id, url, proposal title, description, open/close date, number of comments, agency, board, guidance document id
|
|
- add details for guidanceDoc, publication date, comments, guidance docs - eg: https://www.townhall.virginia.gov/L/GDocForum.cfm?GDocForumID=452
|
|
2. Item: `Comment` stores forum_id, comment_id, author, title, text, date, url
|