Compare commits
44 Commits
02964312cb
...
master
| Author | SHA1 | Date | |
|---|---|---|---|
| 8f1d9e7723 | |||
| 181477bce7 | |||
| 771f11fd3c | |||
| f42183eeda | |||
| 92706bafb5 | |||
| 723b353db8 | |||
| 67cd96a523 | |||
| cc16acbb12 | |||
| afd5b8c60e | |||
| 3fb424da3c | |||
| c3f2911563 | |||
| 05515745fd | |||
| 3d3372bbb3 | |||
| 3a139da440 | |||
| 976db1b0fe | |||
| 7593754866 | |||
| 016882d527 | |||
| 58feb9820d | |||
| 35f30e9514 | |||
| 985760be7c | |||
| 983650a64f | |||
| eaaefb66f2 | |||
| bdab3c5e21 | |||
| b4a9651e11 | |||
| 1ea696d818 | |||
| 28d6d222bd | |||
| 72c2ae0ca0 | |||
| f5d679808e | |||
| 64a7a18721 | |||
| 946aeac7c8 | |||
| e1ad4432a7 | |||
| 6eecc186f6 | |||
| f3abbefac7 | |||
| 683bfb324f | |||
| fd9d656e13 | |||
| 122c1ce939 | |||
| 490c642bd9 | |||
| d834d18c81 | |||
| c8017c908d | |||
| dfc3faffc3 | |||
| 314f8d2621 | |||
| e7df0b24a1 | |||
| 951cc11a14 | |||
| beb5cf461b |
6
.gitignore
vendored
6
.gitignore
vendored
@@ -10,6 +10,7 @@ __pycache__/
|
||||
.venv/
|
||||
venv/
|
||||
env/
|
||||
.claude/
|
||||
|
||||
# --- emacs ---
|
||||
*~
|
||||
@@ -22,5 +23,10 @@ env/
|
||||
archive/
|
||||
|
||||
|
||||
# --- scrapy ---
|
||||
.scrapy/
|
||||
output/
|
||||
|
||||
# --- misc ---
|
||||
.DS_Store
|
||||
*~$*
|
||||
212
README.md
212
README.md
@@ -1,21 +1,5 @@
|
||||
|
||||
# Table of Contents
|
||||
|
||||
1. [Project Goals](#org5acb669)
|
||||
1. [Document and analyze sentiment](#org9291576)
|
||||
2. [Make data available](#org8054421)
|
||||
3. [Generalize](#orgdda4b6f)
|
||||
2. [Architecture](#org1d6bc40)
|
||||
1. [Scraper](#org4298028)
|
||||
2. [Storage](#org1cd413c)
|
||||
3. [Analysis](#orgaea450e)
|
||||
3. [Roadmap](#org6b7660d)
|
||||
|
||||
|
||||
|
||||
<a id="org5acb669"></a>
|
||||
|
||||
# Project Goals
|
||||
## Project Goals
|
||||
|
||||
1. Document and analyze sentiment of public comments on Virginia law, to determine:
|
||||
1. the utility of this forum as a mechanism for public comment, and
|
||||
@@ -23,131 +7,127 @@
|
||||
2. Make data and insights broadly available.
|
||||
3. Generalize to other public comment tools.
|
||||
|
||||
|
||||
<a id="org9291576"></a>
|
||||
|
||||
## Document and analyze sentiment
|
||||
|
||||
- Scrape the data, parse, clean, and store. Clearly separate scraper from sentiment analyzer for maximum auditability.
|
||||
- Build tests for identifying abuse, such as spam and account fraud
|
||||
- Identify any patterns connecting measured sentiment against VA decisions
|
||||
Take a look at https://vatownhall.streamlit.app
|
||||

|
||||
|
||||
|
||||
<a id="org8054421"></a>
|
||||
### Research questions
|
||||
|
||||
## Make data available
|
||||
|
||||
- Pick a good visualization tool
|
||||
1. What is the quality of the comments on the forum?
|
||||
1. Are there duplicate entries?
|
||||
2. Are there non-human-generated entries?
|
||||
3. Are there entries intended to abuse the forum or drown out comment?
|
||||
2. How do commenters feel about the proposed change?
|
||||
1. What is the total number and percent supporting vs opposing, and how does this change over time?
|
||||
2. What is the type of support, such as strong/weak, positive/negative?
|
||||
3. What impact do the comments have on the proposed change?
|
||||
(I anticipate this will not be measurable from currently available data)
|
||||
|
||||
|
||||
<a id="orgdda4b6f"></a>
|
||||
<a id="orgfabfcd9"></a>
|
||||
|
||||
## Generalize
|
||||
## Architecture
|
||||
|
||||
- Identify scalable ways to apply this toolset to similar problems
|
||||
1. Scrape/Parse: Scrapy
|
||||
2. Sentiment analysis: gpt-5.4-mini
|
||||
3. Display: streamlit
|
||||
4. Storage: jsonl, csv, parquet
|
||||
|
||||

|
||||
|
||||
|
||||
<a id="org1d6bc40"></a>
|
||||
<a id="org2c5c7a2"></a>
|
||||
|
||||
# Architecture
|
||||
### Scraper
|
||||
|
||||
1. Scrape/Parse: ****Scrapy**** for downloading comments
|
||||
2. Storage: json
|
||||
3. Sentiment analysis: Claude haiku
|
||||
4. Display: TBD
|
||||
Scrapy provides a simple mechanism for retrieving, parsing, and saving content form the forums.
|
||||
|
||||
1. Forums listing page: `Forums.cfm` lists all open forums with agency, reg title, action type, brief description, closing date, comment count
|
||||
2. Comment listing page: `comments.cfm?GDocForumID=X` or `comments.cfm?stageid=X` or `comments.cfm?petitionid=X` lists comments with title, author, date
|
||||
3. Individual comment page: `viewcomments.cfm?commentid=X` shows regulation title + brief description at the top, plus the comment
|
||||
|
||||
|
||||
<a id="org4298028"></a>
|
||||
<a id="org72990f4"></a>
|
||||
|
||||
## Scraper
|
||||
### Analysis
|
||||
|
||||
Scrapy provides a simple mechanism for browsing and
|
||||
Google and Amazon both return generic sentiment (tone of writing: positive/negative), not stance (for/against the regulation): "I strongly believe the government should NOT interfere" is negative tone but "against" the regulation. We add the proposed change as context to the model.
|
||||
|
||||
1. Forums listing page: \`Forums.cfm\` - lists all open forums with agency, reg title, action type, brief description, closing date, comment count
|
||||
2. Comment listing page: \`comments.cfm?GDocForumID=X\` or \`comments.cfm?stageid=X\` or \`comments.cfm?petitionid=X\` - lists comments with title, author, date
|
||||
3. Individual comment page: \`viewcomments.cfm?commentid=X\` - shows regulation title + brief description at the top, plus the comment
|
||||
Before sending the comments for sentiment analysis, `tokenizer.py` receives the forum to be processed and prompt as inputs, then generates a `report.json` estimating tokens (tiktoken), cost, and time to run for multiple models.
|
||||
|
||||
Then, the batch processing scripts uses the `report.json` to create multiple jobs, with subcommands to download and check their status.
|
||||
|
||||
We selected gpt-5.4-mini for a good balance of quality, cost, and time.
|
||||
|
||||
1. Prompt
|
||||
```
|
||||
You are an expert policy analyst classifying public comments submitted to the Virginia Town Hall
|
||||
regulatory comment system. You will be given the text of a proposed regulation and a single
|
||||
public comment. Return ONLY a JSON object — no other text.
|
||||
|
||||
Definitions:
|
||||
- stance: the commenter's position on whether the regulation should be adopted.
|
||||
"support" = wants it approved (as-is or with changes);
|
||||
"oppose" = wants it rejected or substantially weakened;
|
||||
"neutral" = takes no position, asks a question, or provides factual input only;
|
||||
"unknown" = too vague, off-topic, or uninterpretable to classify.
|
||||
- tone: the emotional register of the writing, independent of stance.
|
||||
"positive" = affirming, hopeful, appreciative;
|
||||
"negative" = angry, fearful, alarmed, or contemptuous;
|
||||
"neutral" = matter-of-fact, procedural, or informational;
|
||||
"mixed" = contains both positive and negative emotional content;
|
||||
"unclear" = tone cannot be determined (e.g., a one-word comment).
|
||||
- stance_confidence: float 0.0-1.0, your confidence in the stance label.
|
||||
- stance_rationale: 1-3 sentences explaining the key evidence; quote specific phrases where possible.
|
||||
- tags: up to 5 short topic labels relevant to the comment's specific concerns (e.g.
|
||||
"parental rights", "student safety", "privacy", "religious freedom", "LGBTQ+ inclusion",
|
||||
"bullying prevention", "school sports", "bathroom access"). Empty array if none apply.
|
||||
|
||||
Return exactly these keys: stance, stance_confidence, stance_rationale, tone, tags.
|
||||
```
|
||||
|
||||
|
||||
<a id="org1cd413c"></a>
|
||||
<a id="org58a5b72"></a>
|
||||
|
||||
## Storage
|
||||
### Storage
|
||||
|
||||
One JSONL file per forum/bill.
|
||||
- Each scraped forum is saved to `output/<forum-id>.jsonl`
|
||||
- Each report (forum + prompt) is saves to `reports/<forum-id-N>.json`
|
||||
- Each job is saved to `analysis/jobs/<report-id>`:
|
||||
└─`forum.jsonl` is a copy of the scraped forum for convenience
|
||||
└─`prompt.txt` is a copy of the prompt used
|
||||
└─`report.json` is a copy of the report used
|
||||
└─`status.json` contains metadata about the job
|
||||
For each batch in the job, four files are created:
|
||||
└─`jobN-input.jsonl` contains the exact queries sent to the API, for troubleshooting
|
||||
└─`jobN-output-raw.jsonl` contains the exact response from the API
|
||||
└─`jobN-output.jsonl` contains the exact response from the API
|
||||
└─`jobN-output-errors.jsonl` when errors are returned (this file may not exist)
|
||||
- Once complete, the cleanup script saves `review.csv`, `review.pqt`, and `review.sqlite` in this folder.
|
||||
|
||||
|
||||
<a id="orgaea450e"></a>
|
||||
<a id="org24fe465"></a>
|
||||
|
||||
## Analysis
|
||||
## Instructions
|
||||
|
||||
Google and Amazon both return generic sentiment (tone of writing: positive/negative), not stance (for/against the regulation): "I strongly believe the government should NOT interfere" is negative tone but "against" the regulation. We will run the forum/bill title and cache the entirety of the proposed change, perhaps as a fallback.
|
||||
|
||||
<table border="2" cellspacing="0" cellpadding="6" rules="groups" frame="hsides">
|
||||
1. Scrape the forum.
|
||||
`python`
|
||||
2. Run model report.
|
||||
`python analysis/tokenizer.py <input> --prompt <prompt>`
|
||||
3. To run a realtime subset:
|
||||
`python analysis/openai_realtime.py <input> --prompt <prompt> --model <model> --limit <N comments>`
|
||||
`python analysis/openai_realtime.py output/f452.jsonl --prompt prompt-1.txt --model gpt-4o-mini --limit 10`
|
||||
4. To create and run the whole thing in batches, first create the batch jobs from the report:
|
||||
`python analysis/openai_batch.py create <report> --model <model>`
|
||||
`python analysis/openai_batch.py create ./reports/f452-1.json --model gpt-5.4-mini`
|
||||
5. Then, run the jobs sequentially. Don't submit more than one at a time, if the model fills up the batch will fail and resubmission is not implemented.
|
||||
`python analysis/openai<sub>batch.py</sub> submit`
|
||||
`python analysis/openai<sub>batch.py</sub> status`
|
||||
`python analysis/openai<sub>batch.py</sub> download`
|
||||
`python analysis/openai<sub>batch.py</sub> submit`
|
||||
|
||||
|
||||
<colgroup>
|
||||
<col class="org-left" />
|
||||
|
||||
<col class="org-left" />
|
||||
|
||||
<col class="org-left" />
|
||||
|
||||
<col class="org-left" />
|
||||
|
||||
<col class="org-left" />
|
||||
|
||||
<col class="org-left" />
|
||||
</colgroup>
|
||||
<thead>
|
||||
<tr>
|
||||
<th scope="col" class="org-left">Tool</th>
|
||||
<th scope="col" class="org-left">Output</th>
|
||||
<th scope="col" class="org-left">Context</th>
|
||||
<th scope="col" class="org-left">Sarcasm</th>
|
||||
<th scope="col" class="org-left">Context window</th>
|
||||
<th scope="col" class="org-left">Cost/1k comments</th>
|
||||
</tr>
|
||||
</thead>
|
||||
<tbody>
|
||||
<tr>
|
||||
<td class="org-left">Google NL API</td>
|
||||
<td class="org-left">-1→+1, magnitude</td>
|
||||
<td class="org-left">No/generic</td>
|
||||
<td class="org-left">Poorly</td>
|
||||
<td class="org-left">No</td>
|
||||
<td class="org-left">~$1–2</td>
|
||||
</tr>
|
||||
|
||||
<tr>
|
||||
<td class="org-left">Amazon Comprehend</td>
|
||||
<td class="org-left">Pos/Neg/Neutral/Mixed</td>
|
||||
<td class="org-left">No/generic</td>
|
||||
<td class="org-left">Poorly</td>
|
||||
<td class="org-left">No</td>
|
||||
<td class="org-left">~$0.10</td>
|
||||
</tr>
|
||||
|
||||
<tr>
|
||||
<td class="org-left">Claude Haiku</td>
|
||||
<td class="org-left">Prompted → for/against/neutral</td>
|
||||
<td class="org-left">Yes</td>
|
||||
<td class="org-left">Yes, with prompt</td>
|
||||
<td class="org-left">Yes</td>
|
||||
<td class="org-left">~$0.10–0.30</td>
|
||||
</tr>
|
||||
|
||||
<tr>
|
||||
<td class="org-left">GPT-4o-mini</td>
|
||||
<td class="org-left">Prompted → same</td>
|
||||
<td class="org-left">Yes</td>
|
||||
<td class="org-left">Yes</td>
|
||||
<td class="org-left">Yes</td>
|
||||
<td class="org-left">~$0.05–0.15</td>
|
||||
</tr>
|
||||
</tbody>
|
||||
</table>
|
||||
|
||||
|
||||
<a id="org6b7660d"></a>
|
||||
<a id="org5739d49"></a>
|
||||
|
||||
# Roadmap
|
||||
|
||||
|
||||
26
agents.md
26
agents.md
@@ -5,24 +5,24 @@
|
||||
- prefer minimal diffs; avoid refactors unless required for the active task
|
||||
|
||||
## tech stack
|
||||
- python; scrapy
|
||||
- python; scrapy, pytest
|
||||
- file storage: json or csv
|
||||
- assume local virtual env is available and accessible
|
||||
- do not add new dependencies unless explicitly approved; if unavoidable, document justification in the active task notes
|
||||
|
||||
## workflow
|
||||
- prefer direct argv commands (no bash -lc / compound shell chains) unless necessary
|
||||
- work on ONE task at a time unless explicitly instructed otherwise
|
||||
- at the start of work, state the task id you are executing
|
||||
- do not start work unless a task id is specified; if missing, choose the earliest unchecked task and say so
|
||||
- propose incremental steps
|
||||
- always include basic tests for core logic
|
||||
- when you complete a task:
|
||||
- prefer direct commands
|
||||
- work on ONE task at a time unless explicitly instructed otherwise:
|
||||
- at the start of work, state the task id you are executing
|
||||
- do not start work unless a task id is specified; if missing, choose the earliest unchecked task and say so
|
||||
- propose incremental steps
|
||||
- always include basic tests for core logic
|
||||
- when you complete a task:
|
||||
- mark it [X] in docs/tasks.md
|
||||
- fill in evidence with commit hash + commands run
|
||||
- never mark complete unless acceptance criteria are met
|
||||
- include date and time (HH:MM)
|
||||
|
||||
- follow this format:
|
||||
```
|
||||
* [ ] t1.1 Task Title (1)
|
||||
Description and PM notes
|
||||
@@ -36,5 +36,11 @@ Description and PM notes
|
||||
** evidence
|
||||
- commit:
|
||||
- tests:
|
||||
- datetime:
|
||||
- date: [2026-05-05 Tue 15:00]
|
||||
```
|
||||
|
||||
## tests and commands
|
||||
- project dir: `%userprofile%\projects\vath\`
|
||||
- python venv: `%userprofile%\projects\vath\venv\scripts\activate`
|
||||
- pytest (inside venv): `python -m pytest tests/`
|
||||
- create tests without `test_` prefix, ie: `tests/tokenizer.py` not `tests/test_tokenizer.py`
|
||||
|
||||
76
analysis/create_csv.py
Normal file
76
analysis/create_csv.py
Normal file
@@ -0,0 +1,76 @@
|
||||
#!/usr/bin/env python3
|
||||
"""analysis/create_csv.py — join raw scrape with analysis output for review."""
|
||||
|
||||
import argparse
|
||||
from pathlib import Path
|
||||
|
||||
import pandas as pd
|
||||
|
||||
RAW_COLS = ["forum_id", "comment_id", "title", "text", "date", "author"]
|
||||
ANALYSIS_COLS = [
|
||||
"stance", "stance_confidence", "stance_rationale", "tone", "tags",
|
||||
"error", "truncated", "analyzed_at", "prompt_version", "model",
|
||||
]
|
||||
OUTPUT_COLS = RAW_COLS + ANALYSIS_COLS
|
||||
|
||||
|
||||
def load_raw(path: Path) -> pd.DataFrame:
|
||||
df = pd.read_json(path, lines=True)
|
||||
df = df[df["comment_id"].notna()] # rm first item (forum, not comment)
|
||||
for col in RAW_COLS:
|
||||
if col not in df.columns:
|
||||
df[col] = None
|
||||
return df[RAW_COLS].copy()
|
||||
|
||||
|
||||
def load_analysis(jobs_dir: Path) -> pd.DataFrame:
|
||||
files = sorted(p for p in jobs_dir.glob("job*-output.jsonl") if "-raw" not in p.name)
|
||||
df = pd.concat([pd.read_json(p, lines=True) for p in files], ignore_index=True)
|
||||
for col in ANALYSIS_COLS:
|
||||
if col not in df.columns:
|
||||
df[col] = None
|
||||
return df[["comment_id"] + ANALYSIS_COLS].copy()
|
||||
|
||||
|
||||
def join(raw: pd.DataFrame, analysis: pd.DataFrame) -> pd.DataFrame:
|
||||
return raw.merge(analysis, on="comment_id", how="left")[OUTPUT_COLS]
|
||||
|
||||
|
||||
def print_counts(raw: pd.DataFrame, analysis: pd.DataFrame, merged: pd.DataFrame) -> None:
|
||||
print(f"\nRaw comments : {len(raw):,}")
|
||||
print(f"Analyzed : {len(analysis):,}")
|
||||
print(f"Joined : {merged['stance'].notna().sum():,}")
|
||||
print(f"Unanalyzed : {merged['stance'].isna().sum():,}")
|
||||
print(f"Errors : {analysis['error'].notna().sum():,}")
|
||||
print(f"Dup IDs (raw) : {raw['comment_id'].duplicated().sum():,}")
|
||||
print(f"\nStance:\n{analysis['stance'].value_counts(dropna=False).to_string()}")
|
||||
print(f"\nTone:\n{analysis['tone'].value_counts(dropna=False).to_string()}\n")
|
||||
|
||||
|
||||
def main() -> None:
|
||||
p = argparse.ArgumentParser(
|
||||
description="Join raw scrape JSONL with analysis output; write review CSV."
|
||||
)
|
||||
p.add_argument("input", help="Raw scrape JSONL (e.g. output/f452.jsonl)")
|
||||
p.add_argument("jobs_dir", help="Job directory containing job*-output.jsonl files")
|
||||
p.add_argument("--parquet", action="store_true", help="Also write review.parquet")
|
||||
p.add_argument("--out", default=None, help="Output CSV path (default: <jobs_dir>/review.csv)")
|
||||
args = p.parse_args()
|
||||
|
||||
raw = load_raw(Path(args.input))
|
||||
analysis = load_analysis(Path(args.jobs_dir))
|
||||
merged = join(raw, analysis)
|
||||
print_counts(raw, analysis, merged)
|
||||
|
||||
out = Path(args.out) if args.out else Path(args.jobs_dir) / "review.csv"
|
||||
merged.to_csv(out, index=False, encoding="utf-8-sig")
|
||||
print(f"CSV → {out}")
|
||||
|
||||
if args.parquet:
|
||||
pq = out.with_suffix(".parquet")
|
||||
merged.to_parquet(pq, index=False)
|
||||
print(f"Parquet → {pq}")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
74
analysis/encoding.py
Normal file
74
analysis/encoding.py
Normal file
@@ -0,0 +1,74 @@
|
||||
"""
|
||||
analysis/encoding.py — text encoding repair for scraped content.
|
||||
|
||||
The townhall.virginia.gov scraper forces UTF-8 decoding, which is correct for the
|
||||
site's current content. This module provides a defensive repair function for cases
|
||||
where a response arrives with Windows-1252/cp1252 bytes embedded in otherwise UTF-8
|
||||
content (common in older CMSes). The raw scrape files are never modified; repair is
|
||||
applied at the analysis and reporting layers only.
|
||||
|
||||
Primary: uses `ftfy` when installed (pip install ftfy).
|
||||
Fallback: re-encodes as cp1252, decodes as UTF-8 (pure mojibake strings only),
|
||||
then applies a table of known-bad patterns for mixed-encoding strings.
|
||||
"""
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Known patterns: UTF-8 bytes decoded as cp1252, i.e. the 3-char sequences you
|
||||
# see when a server sends e.g. E2 80 99 and it gets decoded as cp1252 chars.
|
||||
#
|
||||
# Byte → cp1252 char mappings for the 0x80–0x9F range:
|
||||
# E2 → â (U+00E2, always)
|
||||
# 80 → € (U+20AC, cp1252 0x80)
|
||||
# 99 → ™ (U+2122, cp1252 0x99) ← E2 80 99 = U+2019 ' right single quote
|
||||
# 98 → ˜ (U+02DC, cp1252 0x98) ← E2 80 98 = U+2018 ' left single quote
|
||||
# 9C → œ (U+0153, cp1252 0x9C) ← E2 80 9C = U+201C " left double quote
|
||||
# 9D → \x9d (undefined → U+009D) ← E2 80 9D = U+201D " right double quote
|
||||
# 93 → " (U+201C, cp1252 0x93) ← E2 80 93 = U+2013 – en dash
|
||||
# 94 → " (U+201D, cp1252 0x94) ← E2 80 94 = U+2014 — em dash
|
||||
# A6 → ¦ (U+00A6, cp1252 0xA6) ← E2 80 A6 = U+2026 … ellipsis
|
||||
|
||||
_KNOWN_REPAIRS: list[tuple[str, str]] = [
|
||||
# Longer / more specific patterns first to avoid partial matches
|
||||
("’", "’"), # ’ → ' right single quote
|
||||
("‘", "‘"), # ‘ → ' left single quote
|
||||
("“", "“"), # “ → " left double quote
|
||||
("â€", "”"), # â€\x9d → " right double quote
|
||||
("–", "–"), # â€" (with left DQ) → – en dash
|
||||
("—", "—"), # â€" (with right DQ) → — em dash
|
||||
("…", "…"), # … → … ellipsis
|
||||
# Generic fallback: bare †prefix not caught above → remove artifact
|
||||
("â€", ""),
|
||||
]
|
||||
|
||||
|
||||
def repair_text(text: str) -> str:
|
||||
"""Repair common encoding artifacts in scraped text.
|
||||
|
||||
Handles:
|
||||
- UTF-8 bytes decoded as cp1252/Latin-1 (’ → ')
|
||||
- Attempts best-effort cleanup for mixed-encoding strings
|
||||
|
||||
U+FFFD replacement characters (from strict UTF-8 decoding of cp1252 bytes)
|
||||
cannot be recovered since the original byte is lost; they are left as-is.
|
||||
"""
|
||||
if not text:
|
||||
return text
|
||||
|
||||
try:
|
||||
import ftfy
|
||||
return ftfy.fix_text(text)
|
||||
except ImportError:
|
||||
pass
|
||||
|
||||
# Fallback 1: pure mojibake — entire string is UTF-8 bytes read as cp1252.
|
||||
# Re-encode as cp1252 and decode as UTF-8.
|
||||
try:
|
||||
return text.encode("cp1252").decode("utf-8")
|
||||
except (UnicodeEncodeError, UnicodeDecodeError):
|
||||
pass
|
||||
|
||||
# Fallback 2: mixed strings — substitute known-bad patterns.
|
||||
for bad, good in _KNOWN_REPAIRS:
|
||||
if bad in text:
|
||||
text = text.replace(bad, good)
|
||||
return text
|
||||
9084
analysis/jobs/f452-1/forum.jsonl
Normal file
9084
analysis/jobs/f452-1/forum.jsonl
Normal file
File diff suppressed because one or more lines are too long
2270
analysis/jobs/f452-1/job1-input.jsonl
Normal file
2270
analysis/jobs/f452-1/job1-input.jsonl
Normal file
File diff suppressed because one or more lines are too long
2270
analysis/jobs/f452-1/job1-output-raw.jsonl
Normal file
2270
analysis/jobs/f452-1/job1-output-raw.jsonl
Normal file
File diff suppressed because it is too large
Load Diff
2270
analysis/jobs/f452-1/job1-output.jsonl
Normal file
2270
analysis/jobs/f452-1/job1-output.jsonl
Normal file
File diff suppressed because it is too large
Load Diff
2274
analysis/jobs/f452-1/job2-input.jsonl
Normal file
2274
analysis/jobs/f452-1/job2-input.jsonl
Normal file
File diff suppressed because one or more lines are too long
2274
analysis/jobs/f452-1/job2-output-raw.jsonl
Normal file
2274
analysis/jobs/f452-1/job2-output-raw.jsonl
Normal file
File diff suppressed because it is too large
Load Diff
2274
analysis/jobs/f452-1/job2-output.jsonl
Normal file
2274
analysis/jobs/f452-1/job2-output.jsonl
Normal file
File diff suppressed because it is too large
Load Diff
2282
analysis/jobs/f452-1/job3-input.jsonl
Normal file
2282
analysis/jobs/f452-1/job3-input.jsonl
Normal file
File diff suppressed because one or more lines are too long
2282
analysis/jobs/f452-1/job3-output-raw.jsonl
Normal file
2282
analysis/jobs/f452-1/job3-output-raw.jsonl
Normal file
File diff suppressed because it is too large
Load Diff
2282
analysis/jobs/f452-1/job3-output.jsonl
Normal file
2282
analysis/jobs/f452-1/job3-output.jsonl
Normal file
File diff suppressed because it is too large
Load Diff
2257
analysis/jobs/f452-1/job4-input.jsonl
Normal file
2257
analysis/jobs/f452-1/job4-input.jsonl
Normal file
File diff suppressed because one or more lines are too long
2257
analysis/jobs/f452-1/job4-output-raw.jsonl
Normal file
2257
analysis/jobs/f452-1/job4-output-raw.jsonl
Normal file
File diff suppressed because it is too large
Load Diff
2257
analysis/jobs/f452-1/job4-output.jsonl
Normal file
2257
analysis/jobs/f452-1/job4-output.jsonl
Normal file
File diff suppressed because it is too large
Load Diff
23
analysis/jobs/f452-1/prompt.txt
Normal file
23
analysis/jobs/f452-1/prompt.txt
Normal file
@@ -0,0 +1,23 @@
|
||||
You are an expert policy analyst classifying public comments submitted to the Virginia Town Hall
|
||||
regulatory comment system. You will be given the text of a proposed regulation and a single
|
||||
public comment. Return ONLY a JSON object — no other text.
|
||||
|
||||
Definitions:
|
||||
- stance: the commenter's position on whether the regulation should be adopted.
|
||||
"support" = wants it approved (as-is or with changes);
|
||||
"oppose" = wants it rejected or substantially weakened;
|
||||
"neutral" = takes no position, asks a question, or provides factual input only;
|
||||
"unknown" = too vague, off-topic, or uninterpretable to classify.
|
||||
- tone: the emotional register of the writing, independent of stance.
|
||||
"positive" = affirming, hopeful, appreciative;
|
||||
"negative" = angry, fearful, alarmed, or contemptuous;
|
||||
"neutral" = matter-of-fact, procedural, or informational;
|
||||
"mixed" = contains both positive and negative emotional content;
|
||||
"unclear" = tone cannot be determined (e.g., a one-word comment).
|
||||
- stance_confidence: float 0.0-1.0, your confidence in the stance label.
|
||||
- stance_rationale: 1-3 sentences explaining the key evidence; quote specific phrases where possible.
|
||||
- tags: up to 5 short topic labels relevant to the comment's specific concerns (e.g.
|
||||
"parental rights", "student safety", "privacy", "religious freedom", "LGBTQ+ inclusion",
|
||||
"bullying prevention", "school sports", "bathroom access"). Empty array if none apply.
|
||||
|
||||
Return exactly these keys: stance, stance_confidence, stance_rationale, tone, tags.
|
||||
43
analysis/jobs/f452-1/report.json
Normal file
43
analysis/jobs/f452-1/report.json
Normal file
@@ -0,0 +1,43 @@
|
||||
{
|
||||
"prompt": "analysis\\prompt-1.txt",
|
||||
"prompt_hash": "cb41250",
|
||||
"input_file": "output\\f452.jsonl",
|
||||
"input_sha256": "59dcc8b13cc2a386977a8b934c498c7e639b7e684a94ca1bfd10a14878670018",
|
||||
"total_comments": 9083,
|
||||
"input_tokens": 6397254,
|
||||
"gpt-5.5": {
|
||||
"jobs": 9,
|
||||
"cost_$": 15.9931,
|
||||
"est_queue_days": 7.11
|
||||
},
|
||||
"gpt-5.4": {
|
||||
"jobs": 9,
|
||||
"cost_$": 7.9966,
|
||||
"est_queue_days": 7.11
|
||||
},
|
||||
"gpt-5.4-mini": {
|
||||
"jobs": 4,
|
||||
"cost_$": 2.399,
|
||||
"est_queue_days": 3.2
|
||||
},
|
||||
"gpt-5.4-nano": {
|
||||
"jobs": 40,
|
||||
"cost_$": 0.6397,
|
||||
"est_queue_days": 31.99
|
||||
},
|
||||
"gpt-4o": {
|
||||
"jobs": 9,
|
||||
"cost_$": 7.9966,
|
||||
"est_queue_days": 7.11
|
||||
},
|
||||
"gpt-4o-mini": {
|
||||
"jobs": 4,
|
||||
"cost_$": 0.4798,
|
||||
"est_queue_days": 3.2
|
||||
},
|
||||
"gpt-o4-mini": {
|
||||
"jobs": 4,
|
||||
"cost_$": 3.5185,
|
||||
"est_queue_days": 3.2
|
||||
}
|
||||
}
|
||||
9091
analysis/jobs/f452-1/review.csv
Normal file
9091
analysis/jobs/f452-1/review.csv
Normal file
File diff suppressed because one or more lines are too long
BIN
analysis/jobs/f452-1/review.xlsx
Normal file
BIN
analysis/jobs/f452-1/review.xlsx
Normal file
Binary file not shown.
57
analysis/jobs/f452-1/status.json
Normal file
57
analysis/jobs/f452-1/status.json
Normal file
@@ -0,0 +1,57 @@
|
||||
{
|
||||
"model": "gpt-5.4-mini",
|
||||
"prompt_hash": "cb41250",
|
||||
"input_file": "output\\f452.jsonl",
|
||||
"input_sha256": "59dcc8b13cc2a386977a8b934c498c7e639b7e684a94ca1bfd10a14878670018",
|
||||
"total_comments": 9083,
|
||||
"input_tokens": 6397254,
|
||||
"est_queue_days": 3.2,
|
||||
"cost_$": 2.399,
|
||||
"total_jobs": 4,
|
||||
"jobs": [
|
||||
{
|
||||
"job_num": 1,
|
||||
"run_id": "76c97113-63aa-43db-8f84-9c60ebcbb105",
|
||||
"status": "completed",
|
||||
"batch_id": "batch_69fb9081639881909be0c40d86edd747",
|
||||
"records_submitted": 2270,
|
||||
"records_completed": 2270,
|
||||
"records_failed": 0,
|
||||
"submitted_at": "2026-05-06T19:03:28.949240+00:00",
|
||||
"completed_at": "2026-05-06T20:09:14+00:00"
|
||||
},
|
||||
{
|
||||
"job_num": 2,
|
||||
"run_id": "b8f3b0bb-f155-4a5c-acce-f3504c0e09aa",
|
||||
"status": "completed",
|
||||
"batch_id": "batch_69fba02df7b481909e96afa1ee8879f5",
|
||||
"records_submitted": 2274,
|
||||
"records_completed": 2274,
|
||||
"records_failed": 0,
|
||||
"submitted_at": "2026-05-06T20:10:21.424330+00:00",
|
||||
"completed_at": "2026-05-06T20:37:11+00:00"
|
||||
},
|
||||
{
|
||||
"job_num": 3,
|
||||
"run_id": "8d769f37-6beb-4a1b-87ee-3f66cdc6adc8",
|
||||
"status": "completed",
|
||||
"batch_id": "batch_69fba69a85488190977792b6f95b614b",
|
||||
"records_submitted": 2282,
|
||||
"records_completed": 2282,
|
||||
"records_failed": 0,
|
||||
"submitted_at": "2026-05-06T20:37:45.586815+00:00",
|
||||
"completed_at": "2026-05-06T21:09:24+00:00"
|
||||
},
|
||||
{
|
||||
"job_num": 4,
|
||||
"run_id": "e6affbc2-ddc9-43a6-b8e9-d1f47e736283",
|
||||
"status": "completed",
|
||||
"batch_id": "batch_69fbe44565748190ad19f17ee3143f8d",
|
||||
"records_submitted": 2257,
|
||||
"records_completed": 2257,
|
||||
"records_failed": 0,
|
||||
"submitted_at": "2026-05-07T01:00:52.886953+00:00",
|
||||
"completed_at": "2026-05-07T09:20:01+00:00"
|
||||
}
|
||||
]
|
||||
}
|
||||
624
analysis/openai_batch.py
Normal file
624
analysis/openai_batch.py
Normal file
@@ -0,0 +1,624 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
openai_batch.py — OpenAI Batch API job runner
|
||||
|
||||
Run tokenizer.py first to generate report.json, then:
|
||||
create <report.json> --model <model> — build job directory
|
||||
submit [--job N] [--dir DIR] — submit next eligible job
|
||||
status [--job N] [--dir DIR] — check job status
|
||||
download [--job N] [--dir DIR] — download + normalize completed jobs
|
||||
|
||||
DIR is a name under analysis/jobs/ (default: most recently created).
|
||||
"""
|
||||
|
||||
import argparse
|
||||
import hashlib
|
||||
import json
|
||||
import os
|
||||
import shutil
|
||||
import sys
|
||||
import uuid
|
||||
from datetime import datetime, timezone
|
||||
from pathlib import Path
|
||||
|
||||
from dotenv import load_dotenv
|
||||
|
||||
try:
|
||||
import openai
|
||||
except ImportError:
|
||||
sys.exit("openai package not installed. Run: pip install openai")
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Model limits and token estimation
|
||||
|
||||
# Max enqueued tokens across ALL concurrent batches (docs/openai.md, 2026-05-05).
|
||||
# Org-tier limits may be lower; use --job to limit submission size if needed.
|
||||
MODEL_LIMITS: dict[str, int] = {
|
||||
"gpt-5.5": 900_000,
|
||||
"gpt-5.4": 900_000,
|
||||
"gpt-5.4-mini": 2_000_000,
|
||||
"gpt-5.4-nano": 200_000,
|
||||
"gpt-4o": 900_000,
|
||||
"gpt-4o-mini": 2_000_000,
|
||||
"gpt-o4-mini": 2_000_000,
|
||||
}
|
||||
_DEFAULT_TOKEN_LIMIT = 900_000
|
||||
_MODEL_ENCODING: dict[str, str] = {
|
||||
"gpt-5.5": "o200k_base",
|
||||
"gpt-5.4": "o200k_base",
|
||||
"gpt-5.4-mini": "o200k_base",
|
||||
"gpt-5.4-nano": "o200k_base",
|
||||
"gpt-4o": "o200k_base",
|
||||
"gpt-4o-mini": "o200k_base",
|
||||
"gpt-o4-mini": "o200k_base",
|
||||
}
|
||||
_LIMIT_BUFFER = 0.80
|
||||
|
||||
|
||||
def estimate_tokens(messages: list[dict], model: str) -> int:
|
||||
"""Token count per OpenAI cookbook chat formula; falls back to chars/3."""
|
||||
try:
|
||||
import tiktoken
|
||||
enc = tiktoken.get_encoding(_MODEL_ENCODING.get(model, "o200k_base"))
|
||||
# Per OpenAI cookbook for gpt-4o: 3 overhead per message + role + content;
|
||||
# plus 3 tokens for the reply primer (<|start|>assistant<|message|>).
|
||||
total = 3 # reply primer
|
||||
for m in messages:
|
||||
total += 3
|
||||
total += len(enc.encode(m.get("role", "")))
|
||||
total += len(enc.encode(m["content"]))
|
||||
return total
|
||||
except ImportError:
|
||||
return 3 + sum(3 + len(m["content"]) // 3 for m in messages)
|
||||
|
||||
|
||||
def chunk_comments_by_tokens(
|
||||
comments: list[dict], forum: dict | None, model: str
|
||||
) -> list[list[dict]]:
|
||||
"""Greedy bin-pack comments into chunks that fit under the model TPD limit."""
|
||||
token_limit = int(MODEL_LIMITS.get(model, _DEFAULT_TOKEN_LIMIT) * _LIMIT_BUFFER)
|
||||
chunks: list[list[dict]] = []
|
||||
current: list[dict] = []
|
||||
current_tokens = 0
|
||||
for comment in comments:
|
||||
messages, _ = build_messages(comment, forum)
|
||||
tokens = estimate_tokens(messages, model)
|
||||
if current and current_tokens + tokens > token_limit:
|
||||
chunks.append(current)
|
||||
current = [comment]
|
||||
current_tokens = tokens
|
||||
else:
|
||||
current.append(comment)
|
||||
current_tokens += tokens
|
||||
if current:
|
||||
chunks.append(current)
|
||||
return chunks
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Prompt
|
||||
|
||||
_DEFAULT_PROMPT_FILE = Path(__file__).parent / "prompt-1.txt"
|
||||
SYSTEM_PROMPT = _DEFAULT_PROMPT_FILE.read_text(encoding="utf-8").strip()
|
||||
PROMPT_VERSION = hashlib.sha256(SYSTEM_PROMPT.encode("utf-8")).hexdigest()[:7]
|
||||
|
||||
|
||||
def _load_prompt(path: Path) -> None:
|
||||
global SYSTEM_PROMPT, PROMPT_VERSION
|
||||
SYSTEM_PROMPT = path.read_text(encoding="utf-8").strip()
|
||||
PROMPT_VERSION = hashlib.sha256(SYSTEM_PROMPT.encode("utf-8")).hexdigest()[:7]
|
||||
|
||||
|
||||
USER_TEMPLATE = """\
|
||||
## Proposed Regulation
|
||||
Title: {reg_title}
|
||||
Description: {reg_desc}
|
||||
|
||||
---
|
||||
|
||||
## Public Comment
|
||||
Comment ID: {comment_id}
|
||||
Title: {comment_title}
|
||||
Body:
|
||||
{comment_text}
|
||||
|
||||
---
|
||||
Classify this comment per the instructions. Return only JSON.\
|
||||
"""
|
||||
|
||||
MAX_COMMENT_CHARS = 6000
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Directories
|
||||
|
||||
_SCRIPT_DIR = Path(__file__).parent
|
||||
JOBS_DIR = _SCRIPT_DIR / "jobs"
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Core functions (importable for tests)
|
||||
|
||||
|
||||
def load_items(path: Path) -> tuple[dict | None, list[dict]]:
|
||||
"""Read a scraped JSONL. Returns (forum_item_or_None, [comment_items])."""
|
||||
forum = None
|
||||
comments = []
|
||||
with open(path, encoding="utf-8") as f:
|
||||
for line in f:
|
||||
line = line.strip()
|
||||
if not line:
|
||||
continue
|
||||
item = json.loads(line)
|
||||
if "comment_id" in item:
|
||||
comments.append(item)
|
||||
elif "reg_title" in item:
|
||||
forum = item
|
||||
return forum, comments
|
||||
|
||||
|
||||
def custom_id_from(comment_id: str) -> str:
|
||||
return f"comment_{comment_id}"
|
||||
|
||||
|
||||
def parse_custom_id(custom_id: str) -> str:
|
||||
return custom_id.removeprefix("comment_")
|
||||
|
||||
|
||||
def build_messages(comment: dict, forum: dict | None) -> tuple[list, bool]:
|
||||
"""Build OpenAI messages for one comment. Returns (messages, truncated)."""
|
||||
reg_title = (forum or {}).get("reg_title", "[unknown]")
|
||||
reg_desc = (forum or {}).get("reg_desc", "[unknown]")
|
||||
body = (comment.get("text") or "").strip()
|
||||
truncated = False
|
||||
if not body:
|
||||
body = "[No body text provided]"
|
||||
elif len(body) > MAX_COMMENT_CHARS:
|
||||
body = body[:MAX_COMMENT_CHARS] + "... [truncated]"
|
||||
truncated = True
|
||||
user_text = USER_TEMPLATE.format(
|
||||
reg_title=reg_title,
|
||||
reg_desc=reg_desc,
|
||||
comment_id=comment.get("comment_id", ""),
|
||||
comment_title=comment.get("title", ""),
|
||||
comment_text=body,
|
||||
)
|
||||
return [
|
||||
{"role": "system", "content": SYSTEM_PROMPT},
|
||||
{"role": "user", "content": user_text},
|
||||
], truncated
|
||||
|
||||
|
||||
def build_batch_request_line(comment: dict, forum: dict | None, model: str) -> dict:
|
||||
messages, _ = build_messages(comment, forum)
|
||||
return {
|
||||
"custom_id": custom_id_from(comment["comment_id"]),
|
||||
"method": "POST",
|
||||
"url": "/v1/chat/completions",
|
||||
"body": {
|
||||
"model": model,
|
||||
"messages": messages,
|
||||
"response_format": {"type": "json_object"},
|
||||
"temperature": 0.0,
|
||||
},
|
||||
}
|
||||
|
||||
|
||||
def normalize_output_line(
|
||||
raw_line: dict,
|
||||
comment_lookup: dict,
|
||||
run_id: str,
|
||||
analyzed_at: str,
|
||||
model: str,
|
||||
prompt_version: str,
|
||||
) -> dict:
|
||||
"""Convert one raw batch output line into a normalized analysis record."""
|
||||
comment_id = parse_custom_id(raw_line.get("custom_id", ""))
|
||||
comment = comment_lookup.get(comment_id, {})
|
||||
base = {
|
||||
"run_id": run_id,
|
||||
"forum_id": comment.get("forum_id", ""),
|
||||
"comment_id": comment_id,
|
||||
"analyzed_at": analyzed_at,
|
||||
"model": model,
|
||||
"prompt_version": prompt_version,
|
||||
"input_title": comment.get("title", ""),
|
||||
"truncated": len(comment.get("text") or "") > MAX_COMMENT_CHARS,
|
||||
}
|
||||
if raw_line.get("error"):
|
||||
err = raw_line["error"]
|
||||
err_msg = err.get("message", str(err)) if isinstance(err, dict) else str(err)
|
||||
return {**base, "stance": None, "stance_confidence": None,
|
||||
"stance_rationale": None, "tone": None, "tags": None, "error": err_msg}
|
||||
response = raw_line.get("response") or {}
|
||||
if response.get("status_code") != 200:
|
||||
return {**base, "stance": None, "stance_confidence": None,
|
||||
"stance_rationale": None, "tone": None, "tags": None,
|
||||
"error": f"status {response.get('status_code')}"}
|
||||
try:
|
||||
content = response["body"]["choices"][0]["message"]["content"]
|
||||
data = json.loads(content)
|
||||
keys = ("stance", "stance_confidence", "stance_rationale", "tone", "tags")
|
||||
parsed = {k: data.get(k) for k in keys}
|
||||
return {**base, **parsed, "error": None}
|
||||
except Exception as exc:
|
||||
return {**base, "stance": None, "stance_confidence": None,
|
||||
"stance_rationale": None, "tone": None, "tags": None, "error": str(exc)}
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Job directory management
|
||||
|
||||
|
||||
def _next_job_dir(stem: str) -> Path:
|
||||
base = stem[:8]
|
||||
i = 1
|
||||
while (JOBS_DIR / f"{base}-{i}").exists():
|
||||
i += 1
|
||||
return JOBS_DIR / f"{base}-{i}"
|
||||
|
||||
|
||||
def _latest_job_dir() -> Path:
|
||||
if not JOBS_DIR.exists():
|
||||
sys.exit(f"No jobs directory found. Run 'create' first.")
|
||||
status_files = list(JOBS_DIR.glob("*/status.json"))
|
||||
if not status_files:
|
||||
sys.exit(f"No jobs found in {JOBS_DIR}. Run 'create' first.")
|
||||
return max(status_files, key=lambda p: p.stat().st_mtime).parent
|
||||
|
||||
|
||||
def _resolve_job_dir(args) -> Path:
|
||||
if getattr(args, "dir", None):
|
||||
d = Path(args.dir)
|
||||
if not d.is_absolute():
|
||||
d = JOBS_DIR / d
|
||||
if not d.exists():
|
||||
sys.exit(f"Job directory not found: {d}")
|
||||
return d
|
||||
return _latest_job_dir()
|
||||
|
||||
|
||||
def load_status(job_dir: Path) -> dict:
|
||||
return json.loads((job_dir / "status.json").read_text(encoding="utf-8"))
|
||||
|
||||
|
||||
def save_status(status: dict, job_dir: Path) -> None:
|
||||
(job_dir / "status.json").write_text(
|
||||
json.dumps(status, indent=2, ensure_ascii=False), encoding="utf-8"
|
||||
)
|
||||
|
||||
|
||||
def _find_next_eligible_job(jobs: list[dict]) -> tuple[dict | None, str | None]:
|
||||
"""Return (next_pending_job, None) or (None, warning_message).
|
||||
|
||||
A job is eligible when it is 'pending' and either it is the first job
|
||||
or its predecessor is 'completed'.
|
||||
"""
|
||||
for j in jobs:
|
||||
if j["status"] != "pending":
|
||||
continue
|
||||
if j["job_num"] == 1:
|
||||
return j, None
|
||||
prev = next(p for p in jobs if p["job_num"] == j["job_num"] - 1)
|
||||
if prev["status"] == "completed":
|
||||
return j, None
|
||||
if prev["status"] in ("submitted", "in_progress", "validating", "finalizing"):
|
||||
return None, (
|
||||
f"Job {prev['job_num']} is '{prev['status']}'. "
|
||||
f"Wait for it to complete before submitting job {j['job_num']}."
|
||||
)
|
||||
return None, None
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Subcommand: create
|
||||
|
||||
|
||||
def cmd_create(args) -> None:
|
||||
report_path = Path(args.report)
|
||||
if not report_path.exists():
|
||||
sys.exit(f"Report not found: {report_path}")
|
||||
|
||||
report = json.loads(report_path.read_text(encoding="utf-8"))
|
||||
|
||||
if args.model not in report or not isinstance(report[args.model], dict):
|
||||
available = [k for k in report if isinstance(report.get(k), dict)]
|
||||
sys.exit(f"Model '{args.model}' not in report. Available: {', '.join(available)}")
|
||||
|
||||
prompt_path = Path(report["prompt"])
|
||||
if not prompt_path.exists():
|
||||
sys.exit(f"Prompt file not found: {prompt_path}")
|
||||
_load_prompt(prompt_path)
|
||||
|
||||
input_path = Path(report["input_file"])
|
||||
if not input_path.exists():
|
||||
sys.exit(f"Input file not found: {input_path}")
|
||||
forum, comments = load_items(input_path)
|
||||
if not comments:
|
||||
sys.exit("No comment items found in input file.")
|
||||
|
||||
chunks = chunk_comments_by_tokens(comments, forum, args.model)
|
||||
|
||||
stem = input_path.stem[:8]
|
||||
job_dir = _next_job_dir(stem)
|
||||
JOBS_DIR.mkdir(parents=True, exist_ok=True)
|
||||
job_dir.mkdir()
|
||||
|
||||
shutil.copy2(input_path, job_dir / "forum.jsonl")
|
||||
shutil.copy2(prompt_path, job_dir / "prompt.txt")
|
||||
shutil.copy2(report_path, job_dir / "report.json")
|
||||
|
||||
jobs_meta = []
|
||||
for i, chunk in enumerate(chunks, start=1):
|
||||
req_path = job_dir / f"job{i}-input.jsonl"
|
||||
with open(req_path, "w", encoding="utf-8") as f:
|
||||
for comment in chunk:
|
||||
f.write(json.dumps(build_batch_request_line(comment, forum, args.model),
|
||||
ensure_ascii=False) + "\n")
|
||||
jobs_meta.append({
|
||||
"job_num": i,
|
||||
"run_id": str(uuid.uuid4()),
|
||||
"status": "pending",
|
||||
"batch_id": None,
|
||||
"records_submitted": len(chunk),
|
||||
"records_completed": None,
|
||||
"records_failed": None,
|
||||
"submitted_at": None,
|
||||
"completed_at": None,
|
||||
})
|
||||
|
||||
model_info = report[args.model]
|
||||
status = {
|
||||
"model": args.model,
|
||||
"prompt_hash": report["prompt_hash"],
|
||||
"input_file": str(input_path),
|
||||
"input_sha256": report["input_sha256"],
|
||||
"total_comments": report["total_comments"],
|
||||
"input_tokens": report["input_tokens"],
|
||||
"est_queue_days": model_info["est_queue_days"],
|
||||
"cost_$": model_info["cost_$"],
|
||||
"total_jobs": len(chunks),
|
||||
"jobs": jobs_meta,
|
||||
}
|
||||
save_status(status, job_dir)
|
||||
|
||||
print(f"Created: {job_dir.name}")
|
||||
print(f" {len(chunks)} job(s) | {len(comments)} comments | model: {args.model}")
|
||||
print(f"\nNext: python analysis/openai_batch.py submit")
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Subcommand: submit
|
||||
|
||||
|
||||
def cmd_submit(args, client) -> None:
|
||||
job_dir = _resolve_job_dir(args)
|
||||
status = load_status(job_dir)
|
||||
jobs = status["jobs"]
|
||||
|
||||
if args.job:
|
||||
target = next((j for j in jobs if j["job_num"] == args.job), None)
|
||||
if target is None:
|
||||
sys.exit(f"Job {args.job} not found in {job_dir.name}.")
|
||||
if target["status"] != "pending":
|
||||
sys.exit(f"Job {args.job} is already '{target['status']}' — cannot resubmit.")
|
||||
if target["job_num"] > 1:
|
||||
prev = next(p for p in jobs if p["job_num"] == target["job_num"] - 1)
|
||||
if prev["status"] != "completed":
|
||||
sys.exit(
|
||||
f"Cannot submit job {target['job_num']}: "
|
||||
f"job {prev['job_num']} is '{prev['status']}' (must be 'completed')."
|
||||
)
|
||||
else:
|
||||
target, warning = _find_next_eligible_job(jobs)
|
||||
if warning:
|
||||
print(warning, file=sys.stderr)
|
||||
sys.exit(1)
|
||||
if target is None:
|
||||
all_done = all(j["status"] == "completed" for j in jobs)
|
||||
print("All jobs completed." if all_done else "No pending jobs eligible for submission.")
|
||||
return
|
||||
|
||||
n = target["job_num"]
|
||||
req_path = job_dir / f"job{n}-input.jsonl"
|
||||
print(f"Submitting job {n}/{status['total_jobs']} ({target['records_submitted']} comments) ...",
|
||||
file=sys.stderr)
|
||||
|
||||
with open(req_path, "rb") as f:
|
||||
uploaded = client.files.create(file=f, purpose="batch")
|
||||
|
||||
batch = client.batches.create(
|
||||
input_file_id=uploaded.id,
|
||||
endpoint="/v1/chat/completions",
|
||||
completion_window="24h",
|
||||
metadata={"run_id": target["run_id"], "job_dir": job_dir.name},
|
||||
)
|
||||
|
||||
target["status"] = "submitted"
|
||||
target["batch_id"] = batch.id
|
||||
target["submitted_at"] = datetime.now(timezone.utc).isoformat()
|
||||
save_status(status, job_dir)
|
||||
|
||||
print(f"Job {n} submitted: {batch.id} ({batch.status})")
|
||||
print(f" python analysis/openai_batch.py status")
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Subcommand: status
|
||||
|
||||
|
||||
def cmd_status(args, client) -> None:
|
||||
job_dir = _resolve_job_dir(args)
|
||||
status = load_status(job_dir)
|
||||
jobs = status["jobs"]
|
||||
|
||||
job_filter = getattr(args, "job", None)
|
||||
|
||||
for job in jobs:
|
||||
if job_filter is not None and job["job_num"] != job_filter:
|
||||
continue
|
||||
if not job["batch_id"]:
|
||||
continue
|
||||
if job["status"] in ("completed", "failed", "expired", "cancelled", "pending"):
|
||||
continue
|
||||
batch = client.batches.retrieve(job["batch_id"])
|
||||
counts = batch.request_counts
|
||||
if batch.status == "completed":
|
||||
job["status"] = "completed"
|
||||
if batch.completed_at:
|
||||
job["completed_at"] = datetime.fromtimestamp(
|
||||
batch.completed_at, tz=timezone.utc
|
||||
).isoformat()
|
||||
elif batch.status in ("failed", "expired", "cancelled"):
|
||||
job["status"] = batch.status
|
||||
else:
|
||||
job["status"] = batch.status
|
||||
job["records_completed"] = counts.completed
|
||||
job["records_failed"] = counts.failed
|
||||
|
||||
save_status(status, job_dir)
|
||||
|
||||
target_jobs = jobs if not job_filter else [j for j in jobs if j["job_num"] == job_filter]
|
||||
print(f"Dir: {job_dir.name} | Model: {status['model']} | {status['total_jobs']} job(s)")
|
||||
print(f"{'Job':<5} {'Status':<14} {'Records':>12} {'Submitted':<20} {'Completed':<20}")
|
||||
print("-" * 76)
|
||||
for j in target_jobs:
|
||||
rec = (f"{j['records_completed']}/{j['records_submitted']}"
|
||||
if j["records_completed"] is not None else f"-/{j['records_submitted']}")
|
||||
sub = (j["submitted_at"] or "-")[:19]
|
||||
done = (j["completed_at"] or "-")[:19]
|
||||
print(f"{j['job_num']:<5} {j['status']:<14} {rec:>12} {sub:<20} {done:<20}")
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Subcommand: download
|
||||
|
||||
|
||||
def cmd_download(args, client) -> None:
|
||||
job_dir = _resolve_job_dir(args)
|
||||
|
||||
# Refresh status before deciding what to download
|
||||
cmd_status(args, client)
|
||||
status = load_status(job_dir)
|
||||
jobs = status["jobs"]
|
||||
|
||||
job_filter = getattr(args, "job", None)
|
||||
if job_filter:
|
||||
candidates = [j for j in jobs if j["job_num"] == job_filter]
|
||||
else:
|
||||
candidates = [
|
||||
j for j in jobs
|
||||
if j["status"] == "completed"
|
||||
and not (job_dir / f"job{j['job_num']}-output.jsonl").exists()
|
||||
]
|
||||
|
||||
if not candidates:
|
||||
print("No completed jobs pending download.", file=sys.stderr)
|
||||
return
|
||||
|
||||
_, all_comments = load_items(job_dir / "forum.jsonl")
|
||||
comment_lookup = {c["comment_id"]: c for c in all_comments}
|
||||
|
||||
for job in candidates:
|
||||
n = job["job_num"]
|
||||
|
||||
if job["status"] != "completed":
|
||||
print(f"Job {n} not yet completed ('{job['status']}'), skipping.", file=sys.stderr)
|
||||
continue
|
||||
|
||||
batch = client.batches.retrieve(job["batch_id"])
|
||||
|
||||
if not batch.output_file_id:
|
||||
print(f"Job {n}: no output file available from OpenAI.", file=sys.stderr)
|
||||
continue
|
||||
|
||||
raw_text = client.files.content(batch.output_file_id).text
|
||||
raw_path = job_dir / f"job{n}-output-raw.jsonl"
|
||||
raw_path.write_text(raw_text, encoding="utf-8")
|
||||
print(f"Job {n} raw → {raw_path.name}", file=sys.stderr)
|
||||
|
||||
if batch.error_file_id:
|
||||
err_text = client.files.content(batch.error_file_id).text
|
||||
err_path = job_dir / f"job{n}-errors.jsonl"
|
||||
err_path.write_text(err_text, encoding="utf-8")
|
||||
n_err_lines = sum(1 for line in err_text.splitlines() if line.strip())
|
||||
print(f"Job {n} errors → {err_path.name} ({n_err_lines} lines)", file=sys.stderr)
|
||||
|
||||
completed_at = job.get("completed_at") or datetime.now(timezone.utc).isoformat()
|
||||
norm_path = job_dir / f"job{n}-output.jsonl"
|
||||
n_ok = n_err = 0
|
||||
with open(norm_path, "w", encoding="utf-8") as out:
|
||||
for line in raw_text.splitlines():
|
||||
if not line.strip():
|
||||
continue
|
||||
record = normalize_output_line(
|
||||
json.loads(line), comment_lookup,
|
||||
job["run_id"], completed_at,
|
||||
status["model"], status["prompt_hash"],
|
||||
)
|
||||
out.write(json.dumps(record, ensure_ascii=False) + "\n")
|
||||
if record["error"]:
|
||||
n_err += 1
|
||||
else:
|
||||
n_ok += 1
|
||||
|
||||
print(f"Job {n} normalized → {norm_path.name} ({n_ok} ok, {n_err} errors)", file=sys.stderr)
|
||||
job["records_completed"] = n_ok
|
||||
job["records_failed"] = n_err
|
||||
|
||||
save_status(status, job_dir)
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# CLI
|
||||
|
||||
|
||||
def _add_common_args(p: argparse.ArgumentParser) -> None:
|
||||
p.add_argument("--job", type=int, default=None, metavar="N",
|
||||
help="Job number within the run (default: auto)")
|
||||
p.add_argument("--dir", default=None, metavar="DIR",
|
||||
help="Job directory name or path (default: most recent)")
|
||||
|
||||
|
||||
def main() -> None:
|
||||
load_dotenv()
|
||||
|
||||
parser = argparse.ArgumentParser(
|
||||
description="Batch analysis job runner.",
|
||||
formatter_class=argparse.RawDescriptionHelpFormatter,
|
||||
epilog=__doc__,
|
||||
)
|
||||
sub = parser.add_subparsers(dest="command", required=True)
|
||||
|
||||
p_create = sub.add_parser("create", help="Create job directory from tokenizer report")
|
||||
p_create.add_argument("report", help="Path to report.json from tokenizer.py")
|
||||
p_create.add_argument("--model", required=True, help="Model (e.g. gpt-4o-mini)")
|
||||
|
||||
p_submit = sub.add_parser("submit", help="Submit next eligible job")
|
||||
_add_common_args(p_submit)
|
||||
|
||||
p_status = sub.add_parser("status", help="Check job status")
|
||||
_add_common_args(p_status)
|
||||
|
||||
p_download = sub.add_parser("download", help="Download and normalize completed jobs")
|
||||
_add_common_args(p_download)
|
||||
|
||||
args = parser.parse_args()
|
||||
|
||||
if args.command == "create":
|
||||
cmd_create(args)
|
||||
return
|
||||
|
||||
api_key = os.environ.get("OPENAI_API_KEY")
|
||||
if not api_key:
|
||||
sys.exit("OPENAI_API_KEY not set. Create a .env file or export the variable.")
|
||||
client = openai.OpenAI(api_key=api_key)
|
||||
|
||||
if args.command == "submit":
|
||||
cmd_submit(args, client)
|
||||
elif args.command == "status":
|
||||
cmd_status(args, client)
|
||||
elif args.command == "download":
|
||||
cmd_download(args, client)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
228
analysis/openai_realtime.py
Normal file
228
analysis/openai_realtime.py
Normal file
@@ -0,0 +1,228 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
analysis/openai_realtime.py — Synchronous GPT-4o pipeline for VA Townhall comments.
|
||||
|
||||
Usage:
|
||||
python analysis/openai_realtime.py <input_jsonl> [--limit {5,10,20,50}] [--model MODEL]
|
||||
|
||||
Output:
|
||||
analysis/forum{id}_{scrape_ts}_{model}_{run_ts}.jsonl
|
||||
"""
|
||||
|
||||
import argparse
|
||||
import hashlib
|
||||
import json
|
||||
import os
|
||||
import re
|
||||
import sys
|
||||
import time
|
||||
import uuid
|
||||
from datetime import datetime, timezone
|
||||
from pathlib import Path
|
||||
|
||||
from dotenv import load_dotenv
|
||||
|
||||
try:
|
||||
import openai
|
||||
except ImportError:
|
||||
sys.exit("openai package not installed. Run: pip install openai")
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Prompt — loaded from analysis/prompt-1.txt at import time
|
||||
|
||||
_PROMPT_FILE = Path(__file__).parent / "prompt-1.txt"
|
||||
SYSTEM_PROMPT = _PROMPT_FILE.read_text(encoding="utf-8").strip()
|
||||
PROMPT_VERSION = hashlib.sha256(SYSTEM_PROMPT.encode("utf-8")).hexdigest()[:7]
|
||||
|
||||
USER_TEMPLATE = """\
|
||||
## Proposed Regulation
|
||||
Title: {reg_title}
|
||||
Description: {reg_desc}
|
||||
|
||||
---
|
||||
|
||||
## Public Comment
|
||||
Comment ID: {comment_id}
|
||||
Title: {comment_title}
|
||||
Body:
|
||||
{comment_text}
|
||||
|
||||
---
|
||||
Classify this comment per the instructions. Return only JSON.\
|
||||
"""
|
||||
|
||||
MAX_COMMENT_CHARS = 6000
|
||||
_RETRY_DELAYS = [1.0, 2.0]
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Core functions
|
||||
|
||||
|
||||
def load_items(path: Path) -> tuple[dict | None, list[dict]]:
|
||||
"""Read a scraped JSONL file. Returns (forum_item_or_None, [comment_items])."""
|
||||
forum = None
|
||||
comments = []
|
||||
with open(path, encoding="utf-8") as f:
|
||||
for line in f:
|
||||
line = line.strip()
|
||||
if not line:
|
||||
continue
|
||||
item = json.loads(line)
|
||||
if "comment_id" in item:
|
||||
comments.append(item)
|
||||
elif "reg_title" in item:
|
||||
forum = item
|
||||
return forum, comments
|
||||
|
||||
|
||||
def build_messages(comment: dict, forum: dict | None) -> tuple[list, bool]:
|
||||
"""Build OpenAI messages for one comment. Returns (messages, truncated)."""
|
||||
reg_title = (forum or {}).get("reg_title", "[unknown]")
|
||||
reg_desc = (forum or {}).get("reg_desc", "[unknown]")
|
||||
|
||||
body = (comment.get("text") or "").strip()
|
||||
truncated = False
|
||||
if not body:
|
||||
body = "[No body text provided]"
|
||||
elif len(body) > MAX_COMMENT_CHARS:
|
||||
body = body[:MAX_COMMENT_CHARS] + "... [truncated]"
|
||||
truncated = True
|
||||
|
||||
user_text = USER_TEMPLATE.format(
|
||||
reg_title=reg_title,
|
||||
reg_desc=reg_desc,
|
||||
comment_id=comment.get("comment_id", ""),
|
||||
comment_title=comment.get("title", ""),
|
||||
comment_text=body,
|
||||
)
|
||||
|
||||
return [
|
||||
{"role": "system", "content": SYSTEM_PROMPT},
|
||||
{"role": "user", "content": user_text},
|
||||
], truncated
|
||||
|
||||
|
||||
def parse_api_response(content: str) -> dict:
|
||||
data = json.loads(content)
|
||||
keys = ("stance", "stance_confidence", "stance_rationale", "tone", "tags")
|
||||
return {k: data.get(k) for k in keys}
|
||||
|
||||
|
||||
def _call_api(client, messages: list, model: str) -> str:
|
||||
last_exc = None
|
||||
for delay in [0.0] + _RETRY_DELAYS:
|
||||
if delay:
|
||||
time.sleep(delay)
|
||||
try:
|
||||
resp = client.chat.completions.create(
|
||||
model=model,
|
||||
messages=messages,
|
||||
response_format={"type": "json_object"},
|
||||
temperature=0.0,
|
||||
)
|
||||
return resp.choices[0].message.content
|
||||
except openai.RateLimitError as exc:
|
||||
last_exc = exc
|
||||
raise last_exc # type: ignore[misc]
|
||||
|
||||
|
||||
def analyze_comment(client, comment: dict, forum: dict | None, run_id: str, model: str) -> dict:
|
||||
base = {
|
||||
"run_id": run_id,
|
||||
"forum_id": comment.get("forum_id", ""),
|
||||
"comment_id": comment.get("comment_id", ""),
|
||||
"analyzed_at": datetime.now(timezone.utc).isoformat(),
|
||||
"model": model,
|
||||
"prompt_version": PROMPT_VERSION,
|
||||
"input_title": comment.get("title", ""),
|
||||
}
|
||||
try:
|
||||
messages, truncated = build_messages(comment, forum)
|
||||
content = _call_api(client, messages, model)
|
||||
parsed = parse_api_response(content)
|
||||
return {**base, **parsed, "truncated": truncated, "error": None}
|
||||
except Exception as exc:
|
||||
return {
|
||||
**base,
|
||||
"stance": None, "stance_confidence": None,
|
||||
"stance_rationale": None, "tone": None, "tags": None,
|
||||
"truncated": False,
|
||||
"error": str(exc),
|
||||
}
|
||||
|
||||
|
||||
def _scrape_ts_from_filename(path: Path) -> str:
|
||||
m = re.search(r"(\d{4}-\d{2}-\d{2}T[\d\-+:]+)", path.stem)
|
||||
return m.group(1).replace(":", "-") if m else "unknown"
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# CLI
|
||||
|
||||
def main() -> None:
|
||||
load_dotenv()
|
||||
|
||||
parser = argparse.ArgumentParser(
|
||||
description="Analyze VA Townhall public comments with GPT-4o (synchronous).",
|
||||
)
|
||||
parser.add_argument("input", help="Path to scraped JSONL file")
|
||||
parser.add_argument(
|
||||
"--limit",
|
||||
type=int,
|
||||
choices=[5, 10, 20, 50],
|
||||
metavar="{5,10,20,50}",
|
||||
help="Process only the first N comments (for testing). Omit to process all.",
|
||||
)
|
||||
parser.add_argument("--model", default="gpt-4o", help="OpenAI model (default: gpt-4o)")
|
||||
args = parser.parse_args()
|
||||
|
||||
api_key = os.environ.get("OPENAI_API_KEY")
|
||||
if not api_key:
|
||||
sys.exit("OPENAI_API_KEY not set. Create a .env file or export the variable.")
|
||||
|
||||
input_path = Path(args.input)
|
||||
if not input_path.exists():
|
||||
sys.exit(f"File not found: {input_path}")
|
||||
|
||||
print(f"Reading {input_path} ...", file=sys.stderr)
|
||||
forum, comments = load_items(input_path)
|
||||
|
||||
if forum is None:
|
||||
print("Warning: no ForumItem found — regulation context will be [unknown].", file=sys.stderr)
|
||||
|
||||
if args.limit:
|
||||
comments = comments[: args.limit]
|
||||
|
||||
forum_id = (forum or {}).get("forum_id", "unknown")
|
||||
scrape_ts = _scrape_ts_from_filename(input_path)
|
||||
run_ts = datetime.now(timezone.utc).strftime("%Y-%m-%dT%H-%M-%S+00-00")
|
||||
model_slug = args.model.replace("/", "-")
|
||||
|
||||
out_dir = Path(__file__).parent
|
||||
out_path = out_dir / f"forum{forum_id}_{scrape_ts}_{model_slug}_{run_ts}.jsonl"
|
||||
|
||||
run_id = str(uuid.uuid4())
|
||||
client = openai.OpenAI(api_key=api_key)
|
||||
|
||||
n_ok = n_err = 0
|
||||
total = len(comments)
|
||||
print(f"Analyzing {total} comments → {out_path}", file=sys.stderr)
|
||||
|
||||
with open(out_path, "w", encoding="utf-8") as out:
|
||||
for i, comment in enumerate(comments, 1):
|
||||
record = analyze_comment(client, comment, forum, run_id, args.model)
|
||||
out.write(json.dumps(record, ensure_ascii=False) + "\n")
|
||||
out.flush()
|
||||
if record["error"]:
|
||||
n_err += 1
|
||||
print(f" [{i}/{total}] ERROR {comment.get('comment_id')}: {record['error']}", file=sys.stderr)
|
||||
else:
|
||||
n_ok += 1
|
||||
print(f" [{i}/{total}] OK {comment.get('comment_id')} → {record['stance']}", file=sys.stderr)
|
||||
time.sleep(0.1)
|
||||
|
||||
print(f"\nDone. {n_ok} ok, {n_err} errors → {out_path}", file=sys.stderr)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
19
analysis/prompt-1.txt
Normal file
19
analysis/prompt-1.txt
Normal file
@@ -0,0 +1,19 @@
|
||||
You are an expert policy analyst classifying public comments submitted to the Virginia Town Hall regulatory comment system. You will be given the text of a proposed regulation and a single public comment. Return ONLY a JSON object — no other text.
|
||||
|
||||
Definitions:
|
||||
- stance: the commenter's position on whether the regulation should be adopted.
|
||||
"support" = wants it approved (as-is or with changes);
|
||||
"oppose" = wants it rejected or substantially weakened;
|
||||
"neutral" = takes no position, asks a question, or provides factual input only;
|
||||
"unknown" = too vague, off-topic, or uninterpretable to classify.
|
||||
- tone: the emotional register of the writing, independent of stance.
|
||||
"positive" = affirming, hopeful, appreciative;
|
||||
"negative" = angry, fearful, alarmed, or contemptuous;
|
||||
"neutral" = matter-of-fact, procedural, or informational;
|
||||
"mixed" = contains both positive and negative emotional content;
|
||||
"unclear" = tone cannot be determined (e.g., a one-word comment).
|
||||
- stance_confidence: float 0.0-1.0, your confidence in the stance label.
|
||||
- stance_rationale: 1-3 sentences explaining the key evidence; quote specific phrases where possible.
|
||||
- tags: up to 5 short topic labels relevant to the comment's specific concerns (e.g. "parental rights", "student safety", "privacy", "religious freedom", "LGBTQ inclusion", "bullying prevention", "school sports", "bathroom access"). Empty array if none apply.
|
||||
|
||||
Return exactly these keys: stance, stance_confidence, stance_rationale, tone, tags.
|
||||
190
analysis/tokenizer.py
Normal file
190
analysis/tokenizer.py
Normal file
@@ -0,0 +1,190 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
tokenizer.py — estimate token usage and cost for a batch analysis run.
|
||||
|
||||
Usage:
|
||||
python analysis/tokenizer.py output/f452.jsonl [--prompt analysis/prompt-1.txt]
|
||||
python analysis/tokenizer.py analysis/jobs/f452-1/job1-input.jsonl # count actual tokens in a job
|
||||
|
||||
Prints a per-model comparison table and writes reports/<stem>-report.json.
|
||||
Run this before openai_batch.py create.
|
||||
"""
|
||||
|
||||
import argparse
|
||||
import hashlib
|
||||
import json
|
||||
import math
|
||||
import sys
|
||||
from pathlib import Path
|
||||
|
||||
sys.path.insert(0, str(Path(__file__).parent))
|
||||
import openai_batch as _ab
|
||||
|
||||
# Input pricing ($/1M tokens, batch API) — from docs/openai.md, updated 2026-05-05.
|
||||
# Add Anthropic/other models here when needed; only models with a LIMITS entry are reported.
|
||||
MODEL_PRICING: dict[str, float] = {
|
||||
"gpt-5.5": 2.50,
|
||||
"gpt-5.4": 1.25,
|
||||
"gpt-5.4-mini": 0.375,
|
||||
"gpt-5.4-nano": 0.10,
|
||||
"gpt-4o": 1.25,
|
||||
"gpt-4o-mini": 0.075,
|
||||
"gpt-o4-mini": 0.55,
|
||||
}
|
||||
|
||||
|
||||
def compute_report(
|
||||
comments: list[dict],
|
||||
forum: dict | None,
|
||||
prompt_hash: str,
|
||||
input_file: str,
|
||||
input_sha256: str,
|
||||
prompt_file: str,
|
||||
) -> dict:
|
||||
"""Compute token estimate and per-model job/cost/time breakdown."""
|
||||
# Use gpt-4o encoding as the canonical estimator (same for all current models)
|
||||
total_tokens = sum(
|
||||
_ab.estimate_tokens(_ab.build_messages(c, forum)[0], "gpt-4o")
|
||||
for c in comments
|
||||
)
|
||||
|
||||
report: dict = {
|
||||
"prompt": prompt_file,
|
||||
"prompt_hash": prompt_hash,
|
||||
"input_file": input_file,
|
||||
"input_sha256": input_sha256,
|
||||
"total_comments": len(comments),
|
||||
"input_tokens": total_tokens,
|
||||
}
|
||||
|
||||
for model, tpd in _ab.MODEL_LIMITS.items():
|
||||
effective_tpd = int(tpd * _ab._LIMIT_BUFFER)
|
||||
jobs = math.ceil(total_tokens / effective_tpd)
|
||||
cost = round(total_tokens / 1_000_000 * MODEL_PRICING.get(model, 0.0), 4)
|
||||
est_days = round(total_tokens / tpd, 2)
|
||||
report[model] = {"jobs": jobs, "cost_$": cost, "est_queue_days": est_days}
|
||||
|
||||
return report
|
||||
|
||||
|
||||
def count_input_tokens(path: Path, model: str = "gpt-4o") -> dict:
|
||||
"""Count tokens in an existing job input JSONL (batch request format).
|
||||
|
||||
Each line must have body.messages (as written by build_batch_request_line).
|
||||
Returns {"total_tokens": int, "total_requests": int, "min": int, "max": int, "mean": float}.
|
||||
"""
|
||||
counts = []
|
||||
with open(path, encoding="utf-8") as f:
|
||||
for line in f:
|
||||
line = line.strip()
|
||||
if not line:
|
||||
continue
|
||||
req = json.loads(line)
|
||||
messages = req["body"]["messages"]
|
||||
counts.append(_ab.estimate_tokens(messages, model))
|
||||
if not counts:
|
||||
return {"total_tokens": 0, "total_requests": 0, "min": 0, "max": 0, "mean": 0.0}
|
||||
return {
|
||||
"total_tokens": sum(counts),
|
||||
"total_requests": len(counts),
|
||||
"min": min(counts),
|
||||
"max": max(counts),
|
||||
"mean": round(sum(counts) / len(counts), 1),
|
||||
}
|
||||
|
||||
|
||||
def print_table(report: dict) -> None:
|
||||
"""Print a human-readable model comparison table to stdout."""
|
||||
print(f"\nInput: {report['input_file']}")
|
||||
print(f"Comments: {report['total_comments']:,}")
|
||||
print(f"Tokens: {report['input_tokens']:,}")
|
||||
print(f"Prompt: {report['prompt']} (hash: {report['prompt_hash']})")
|
||||
print()
|
||||
|
||||
# Cheapest model that fits in one job
|
||||
single_job_models = [m for m in _ab.MODEL_LIMITS if report.get(m, {}).get("jobs") == 1]
|
||||
best = (min(single_job_models, key=lambda m: report[m]["cost_$"])
|
||||
if single_job_models else None)
|
||||
|
||||
print(f"{'Model':<15} {'Jobs':>5} {'Cost ($)':>9} {'Est days':>9} {'Note'}")
|
||||
print("-" * 62)
|
||||
for model in _ab.MODEL_LIMITS:
|
||||
if model not in report or not isinstance(report[model], dict):
|
||||
continue
|
||||
m = report[model]
|
||||
note = "<-- recommended" if model == best else ""
|
||||
print(f"{model:<15} {m['jobs']:>5} {m['cost_$']:>9.4f} {m['est_queue_days']:>9.2f} {note}")
|
||||
print()
|
||||
|
||||
|
||||
def _is_job_input(path: Path) -> bool:
|
||||
"""Return True if this JSONL looks like a batch request file (has custom_id)."""
|
||||
with open(path, encoding="utf-8") as f:
|
||||
for line in f:
|
||||
line = line.strip()
|
||||
if line:
|
||||
return "custom_id" in json.loads(line)
|
||||
return False
|
||||
|
||||
|
||||
def main() -> None:
|
||||
_default_prompt = Path(__file__).parent / "prompt-1.txt"
|
||||
|
||||
parser = argparse.ArgumentParser(description="Estimate batch token usage and cost.")
|
||||
parser.add_argument("input", help="Scraped JSONL or job input JSONL (jobN-input.jsonl)")
|
||||
parser.add_argument(
|
||||
"--prompt",
|
||||
default=str(_default_prompt),
|
||||
help=f"System prompt file (default: {_default_prompt.name})",
|
||||
)
|
||||
args = parser.parse_args()
|
||||
|
||||
input_path = Path(args.input)
|
||||
if not input_path.exists():
|
||||
sys.exit(f"File not found: {input_path}")
|
||||
|
||||
# --- Mode: count tokens in an existing job input file ---
|
||||
if _is_job_input(input_path):
|
||||
result = count_input_tokens(input_path)
|
||||
print(f"\nJob input: {input_path.name}")
|
||||
print(f" Requests : {result['total_requests']:,}")
|
||||
print(f" Tokens : {result['total_tokens']:,}")
|
||||
print(f" Per-req : min={result['min']} max={result['max']} mean={result['mean']}")
|
||||
return
|
||||
|
||||
# --- Mode: estimate from raw scrape file and write report.json ---
|
||||
prompt_path = Path(args.prompt)
|
||||
if not prompt_path.exists():
|
||||
sys.exit(f"Prompt file not found: {prompt_path}")
|
||||
|
||||
prompt_text = prompt_path.read_text(encoding="utf-8").strip()
|
||||
prompt_hash = hashlib.sha256(prompt_text.encode("utf-8")).hexdigest()[:7]
|
||||
|
||||
# Ensure build_messages uses the specified prompt
|
||||
_ab._load_prompt(prompt_path)
|
||||
|
||||
forum, comments = _ab.load_items(input_path)
|
||||
if not comments:
|
||||
sys.exit("No comment items found.")
|
||||
if forum is None:
|
||||
print("Warning: no ForumItem — token estimates may be slightly low.", file=sys.stderr)
|
||||
|
||||
input_sha256 = hashlib.sha256(input_path.read_bytes()).hexdigest()
|
||||
|
||||
report = compute_report(
|
||||
comments, forum, prompt_hash,
|
||||
str(input_path), input_sha256, str(prompt_path),
|
||||
)
|
||||
|
||||
print_table(report)
|
||||
|
||||
reports_dir = Path(__file__).parent.parent / "reports"
|
||||
reports_dir.mkdir(exist_ok=True)
|
||||
out_path = reports_dir / f"{input_path.stem}-report.json"
|
||||
out_path.write_text(json.dumps(report, indent=2, ensure_ascii=False), encoding="utf-8")
|
||||
print(f"Report written to: {out_path}")
|
||||
print(f"\nNext: python analysis/openai_batch.py create {out_path} --model <model>")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
BIN
docs/excel-snapshot.png
Normal file
BIN
docs/excel-snapshot.png
Normal file
Binary file not shown.
|
After Width: | Height: | Size: 32 KiB |
404
docs/openai.md
Normal file
404
docs/openai.md
Normal file
@@ -0,0 +1,404 @@
|
||||
# Batch API
|
||||
|
||||
Learn how to use OpenAI's Batch API to send asynchronous groups of requests with 50% lower costs, a separate pool of significantly higher rate limits, and a clear 24-hour turnaround time. The service is ideal for processing jobs that don't require immediate responses. You can also [explore the API reference directly here](https://developers.openai.com/api/docs/api-reference/batch).
|
||||
|
||||
## Overview
|
||||
|
||||
While some uses of the OpenAI Platform require you to send synchronous requests, there are many cases where requests do not need an immediate response or [rate limits](https://developers.openai.com/api/docs/guides/rate-limits) prevent you from executing a large number of queries quickly. Batch processing jobs are often helpful in use cases like:
|
||||
|
||||
1. Running evaluations
|
||||
2. Classifying large datasets
|
||||
3. Embedding content repositories
|
||||
4. Queuing large offline video-render jobs
|
||||
|
||||
The Batch API offers a straightforward set of endpoints that allow you to collect a set of requests into a single file, kick off a batch processing job to execute these requests, query for the status of that batch while the underlying requests execute, and eventually retrieve the collected results when the batch is complete.
|
||||
|
||||
Compared to using standard endpoints directly, Batch API has:
|
||||
|
||||
1. **Better cost efficiency:** 50% cost discount compared to synchronous APIs
|
||||
2. **Higher rate limits:** [Substantially more headroom](https://platform.openai.com/settings/organization/limits) compared to the synchronous APIs
|
||||
3. **Fast completion times:** Each batch completes within 24 hours (and often more quickly)
|
||||
|
||||
## Getting started
|
||||
|
||||
### 1. Prepare your batch file
|
||||
|
||||
Batches start with a `.jsonl` file where each line contains the details of an individual request to the API. For now, the available endpoints are:
|
||||
|
||||
- `/v1/responses` ([Responses API](https://developers.openai.com/api/docs/api-reference/responses))
|
||||
- `/v1/chat/completions` ([Chat Completions API](https://developers.openai.com/api/docs/api-reference/chat))
|
||||
- `/v1/embeddings` ([Embeddings API](https://developers.openai.com/api/docs/api-reference/embeddings))
|
||||
- `/v1/completions` ([Completions API](https://developers.openai.com/api/docs/api-reference/completions))
|
||||
- `/v1/moderations` ([Moderations guide](https://developers.openai.com/api/docs/guides/moderation))
|
||||
- `/v1/images/generations` ([Images API](https://developers.openai.com/api/docs/api-reference/images))
|
||||
- `/v1/images/edits` ([Images API](https://developers.openai.com/api/docs/api-reference/images))
|
||||
- `/v1/videos` ([Video generation guide](https://developers.openai.com/api/docs/guides/video-generation))
|
||||
|
||||
For a given input file, the parameters in each line's `body` field are the same as the parameters for the underlying endpoint. Each request must include a unique `custom_id` value, which you can use to reference results after completion. Here's an example of an input file with 2 requests. Note that each input file can only include requests to a single model.
|
||||
|
||||
For video generation in Batch:
|
||||
|
||||
- Batch currently supports `POST /v1/videos` only.
|
||||
- Batch requests for videos must use JSON, not multipart.
|
||||
- Upload assets ahead of time and pass supported asset references in the request body rather than using multipart uploads.
|
||||
- Use `input_reference` for image-guided generations in Batch. In JSON requests, pass `input_reference` as an object with either `file_id` or `image_url`.
|
||||
- Multipart `input_reference` uploads, including video reference inputs, aren't supported in Batch.
|
||||
- Batch-generated videos are available for download for up to `24` hours after the batch completes.
|
||||
|
||||
When targeting `/v1/moderations`, include an `input` field in every request body. Batch accepts both plain-text inputs (for `omni-moderation-latest` and `text-moderation-latest`) and multimodal content arrays (for `omni-moderation-latest`). The Batch worker enforces the same non-streaming requirement as the synchronous Moderations API and rejects requests that set `stream=true`.
|
||||
|
||||
```jsonl
|
||||
{"custom_id": "request-1", "method": "POST", "url": "/v1/chat/completions", "body": {"model": "gpt-3.5-turbo-0125", "messages": [{"role": "system", "content": "You are a helpful assistant."},{"role": "user", "content": "Hello world!"}],"max_tokens": 1000}}
|
||||
{"custom_id": "request-2", "method": "POST", "url": "/v1/chat/completions", "body": {"model": "gpt-3.5-turbo-0125", "messages": [{"role": "system", "content": "You are an unhelpful assistant."},{"role": "user", "content": "Hello world!"}],"max_tokens": 1000}}
|
||||
```
|
||||
|
||||
#### Moderations input examples
|
||||
|
||||
Text-only request:
|
||||
|
||||
```jsonl
|
||||
{
|
||||
"custom_id": "moderation-text-1",
|
||||
"method": "POST",
|
||||
"url": "/v1/moderations",
|
||||
"body": {
|
||||
"model": "omni-moderation-latest",
|
||||
"input": "This is a harmless test sentence."
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
Multimodal request:
|
||||
|
||||
```jsonl
|
||||
{
|
||||
"custom_id": "moderation-mm-1",
|
||||
"method": "POST",
|
||||
"url": "/v1/moderations",
|
||||
"body": {
|
||||
"model": "omni-moderation-latest",
|
||||
"input": [
|
||||
{
|
||||
"type": "text",
|
||||
"text": "Describe this image"
|
||||
},
|
||||
{
|
||||
"type": "image_url",
|
||||
"image_url": {
|
||||
"url": "https://api.nga.gov/iiif/a2e6da57-3cd1-4235-b20e-95dcaefed6c8/full/!800,800/0/default.jpg"
|
||||
}
|
||||
}
|
||||
]
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
Prefer referencing remote assets with `image_url` (instead of base64 blobs) to
|
||||
keep your `.jsonl` files well below the 200 MB Batch upload limit,
|
||||
especially for multimodal Moderations requests.
|
||||
|
||||
### 2. Upload your batch input file
|
||||
|
||||
Similar to our [Fine-tuning API](https://developers.openai.com/api/docs/guides/model-optimization), you must first upload your input file so that you can reference it correctly when kicking off batches. Upload your `.jsonl` file using the [Files API](https://developers.openai.com/api/docs/api-reference/files).
|
||||
|
||||
Upload files for Batch API
|
||||
|
||||
```javascript
|
||||
import fs from "fs";
|
||||
import OpenAI from "openai";
|
||||
const openai = new OpenAI();
|
||||
|
||||
const file = await openai.files.create({
|
||||
file: fs.createReadStream("batchinput.jsonl"),
|
||||
purpose: "batch",
|
||||
});
|
||||
|
||||
console.log(file);
|
||||
```
|
||||
|
||||
```python
|
||||
from openai import OpenAI
|
||||
client = OpenAI()
|
||||
|
||||
batch_input_file = client.files.create(
|
||||
file=open("batchinput.jsonl", "rb"),
|
||||
purpose="batch"
|
||||
)
|
||||
|
||||
print(batch_input_file)
|
||||
```
|
||||
|
||||
```bash
|
||||
curl https://api.openai.com/v1/files \\
|
||||
-H "Authorization: Bearer $OPENAI_API_KEY" \\
|
||||
-F purpose="batch" \\
|
||||
-F file="@batchinput.jsonl"
|
||||
```
|
||||
|
||||
|
||||
### 3. Create the batch
|
||||
|
||||
Once you've successfully uploaded your input file, you can use the input File object's ID to create a batch. In this case, let's assume the file ID is `file-abc123`. For now, the completion window can only be set to `24h`. You can also provide custom metadata via an optional `metadata` parameter.
|
||||
|
||||
Create the Batch
|
||||
|
||||
```javascript
|
||||
import OpenAI from "openai";
|
||||
const openai = new OpenAI();
|
||||
|
||||
const batch = await openai.batches.create({
|
||||
input_file_id: "file-abc123",
|
||||
endpoint: "/v1/chat/completions",
|
||||
completion_window: "24h"
|
||||
});
|
||||
|
||||
console.log(batch);
|
||||
```
|
||||
|
||||
```python
|
||||
from openai import OpenAI
|
||||
client = OpenAI()
|
||||
|
||||
batch_input_file_id = batch_input_file.id
|
||||
client.batches.create(
|
||||
input_file_id=batch_input_file_id,
|
||||
endpoint="/v1/chat/completions",
|
||||
completion_window="24h",
|
||||
metadata={
|
||||
"description": "nightly eval job"
|
||||
}
|
||||
)
|
||||
```
|
||||
|
||||
```bash
|
||||
curl https://api.openai.com/v1/batches \\
|
||||
-H "Authorization: Bearer $OPENAI_API_KEY" \\
|
||||
-H "Content-Type: application/json" \\
|
||||
-d '{
|
||||
"input_file_id": "file-abc123",
|
||||
"endpoint": "/v1/chat/completions",
|
||||
"completion_window": "24h"
|
||||
}'
|
||||
```
|
||||
|
||||
|
||||
This request will return a [Batch object](https://developers.openai.com/api/docs/api-reference/batch/object) with metadata about your batch:
|
||||
|
||||
```python
|
||||
{
|
||||
"id": "batch_abc123",
|
||||
"object": "batch",
|
||||
"endpoint": "/v1/chat/completions",
|
||||
"errors": null,
|
||||
"input_file_id": "file-abc123",
|
||||
"completion_window": "24h",
|
||||
"status": "validating",
|
||||
"output_file_id": null,
|
||||
"error_file_id": null,
|
||||
"created_at": 1714508499,
|
||||
"in_progress_at": null,
|
||||
"expires_at": 1714536634,
|
||||
"completed_at": null,
|
||||
"failed_at": null,
|
||||
"expired_at": null,
|
||||
"request_counts": {
|
||||
"total": 0,
|
||||
"completed": 0,
|
||||
"failed": 0
|
||||
},
|
||||
"metadata": null
|
||||
}
|
||||
```
|
||||
|
||||
### 4. Check the status of a batch
|
||||
|
||||
You can check the status of a batch at any time, which will also return a Batch object.
|
||||
|
||||
Check the status of a batch
|
||||
|
||||
```javascript
|
||||
import OpenAI from "openai";
|
||||
const openai = new OpenAI();
|
||||
|
||||
const batch = await openai.batches.retrieve("batch_abc123");
|
||||
console.log(batch);
|
||||
```
|
||||
|
||||
```python
|
||||
from openai import OpenAI
|
||||
client = OpenAI()
|
||||
|
||||
batch = client.batches.retrieve("batch_abc123")
|
||||
print(batch)
|
||||
```
|
||||
|
||||
```bash
|
||||
curl https://api.openai.com/v1/batches/batch_abc123 \\
|
||||
-H "Authorization: Bearer $OPENAI_API_KEY" \\
|
||||
-H "Content-Type: application/json"
|
||||
```
|
||||
|
||||
|
||||
The status of a given Batch object can be any of the following:
|
||||
|
||||
| Status | Description |
|
||||
| ------------- | ------------------------------------------------------------------------------ |
|
||||
| `validating` | the input file is being validated before the batch can begin |
|
||||
| `failed` | the input file has failed the validation process |
|
||||
| `in_progress` | the input file was successfully validated and the batch is currently being run |
|
||||
| `finalizing` | the batch has completed and the results are being prepared |
|
||||
| `completed` | the batch has been completed and the results are ready |
|
||||
| `expired` | the batch was not able to be completed within the 24-hour time window |
|
||||
| `cancelling` | the batch is being cancelled (may take up to 10 minutes) |
|
||||
| `cancelled` | the batch was cancelled |
|
||||
|
||||
### 5. Retrieve the results
|
||||
|
||||
Once the batch is complete, you can download the output by making a request against the [Files API](https://developers.openai.com/api/docs/api-reference/files) via the `output_file_id` field from the Batch object and writing it to a file on your machine, in this case `batch_output.jsonl`
|
||||
|
||||
Retrieving the batch results
|
||||
|
||||
```javascript
|
||||
import OpenAI from "openai";
|
||||
const openai = new OpenAI();
|
||||
|
||||
const fileResponse = await openai.files.content("file-xyz123");
|
||||
const fileContents = await fileResponse.text();
|
||||
|
||||
console.log(fileContents);
|
||||
```
|
||||
|
||||
```python
|
||||
from openai import OpenAI
|
||||
client = OpenAI()
|
||||
|
||||
file_response = client.files.content("file-xyz123")
|
||||
print(file_response.text)
|
||||
```
|
||||
|
||||
```bash
|
||||
curl https://api.openai.com/v1/files/file-xyz123/content \\
|
||||
-H "Authorization: Bearer $OPENAI_API_KEY" > batch_output.jsonl
|
||||
```
|
||||
|
||||
|
||||
The output `.jsonl` file will have one response line for every successful request line in the input file. Any failed requests in the batch will have their error information written to an error file that can be found via the batch's `error_file_id`.
|
||||
|
||||
For `/v1/videos`, a completed batch result contains video objects that have already reached a terminal state such as `completed`, `failed`, or `expired`. You can use the returned video IDs to download final assets immediately after the batch finishes.
|
||||
|
||||
Note that the output line order **may not match** the input line order.
|
||||
Instead of relying on order to process your results, use the custom_id field
|
||||
which will be present in each line of your output file and allow you to map
|
||||
requests in your input to results in your output.
|
||||
|
||||
```jsonl
|
||||
{"id": "batch_req_123", "custom_id": "request-2", "response": {"status_code": 200, "request_id": "req_123", "body": {"id": "chatcmpl-123", "object": "chat.completion", "created": 1711652795, "model": "gpt-3.5-turbo-0125", "choices": [{"index": 0, "message": {"role": "assistant", "content": "Hello."}, "logprobs": null, "finish_reason": "stop"}], "usage": {"prompt_tokens": 22, "completion_tokens": 2, "total_tokens": 24}, "system_fingerprint": "fp_123"}}, "error": null}
|
||||
{"id": "batch_req_456", "custom_id": "request-1", "response": {"status_code": 200, "request_id": "req_789", "body": {"id": "chatcmpl-abc", "object": "chat.completion", "created": 1711652789, "model": "gpt-3.5-turbo-0125", "choices": [{"index": 0, "message": {"role": "assistant", "content": "Hello! How can I assist you today?"}, "logprobs": null, "finish_reason": "stop"}], "usage": {"prompt_tokens": 20, "completion_tokens": 9, "total_tokens": 29}, "system_fingerprint": "fp_3ba"}}, "error": null}
|
||||
```
|
||||
|
||||
The output file will automatically be deleted 30 days after the batch is complete.
|
||||
|
||||
### 6. Cancel a batch
|
||||
|
||||
If necessary, you can cancel an ongoing batch. The batch's status will change to `cancelling` until in-flight requests are complete (up to 10 minutes), after which the status will change to `cancelled`.
|
||||
|
||||
Cancelling a batch
|
||||
|
||||
```javascript
|
||||
import OpenAI from "openai";
|
||||
const openai = new OpenAI();
|
||||
|
||||
const batch = await openai.batches.cancel("batch_abc123");
|
||||
console.log(batch);
|
||||
```
|
||||
|
||||
```python
|
||||
from openai import OpenAI
|
||||
client = OpenAI()
|
||||
|
||||
client.batches.cancel("batch_abc123")
|
||||
```
|
||||
|
||||
```bash
|
||||
curl https://api.openai.com/v1/batches/batch_abc123/cancel \\
|
||||
-H "Authorization: Bearer $OPENAI_API_KEY" \\
|
||||
-H "Content-Type: application/json" \\
|
||||
-X POST
|
||||
```
|
||||
|
||||
|
||||
### 7. Get a list of all batches
|
||||
|
||||
At any time, you can see all your batches. For users with many batches, you can use the `limit` and `after` parameters to paginate your results.
|
||||
|
||||
Getting a list of all batches
|
||||
|
||||
```javascript
|
||||
import OpenAI from "openai";
|
||||
const openai = new OpenAI();
|
||||
|
||||
const list = await openai.batches.list();
|
||||
|
||||
for await (const batch of list) {
|
||||
console.log(batch);
|
||||
}
|
||||
```
|
||||
|
||||
```python
|
||||
from openai import OpenAI
|
||||
client = OpenAI()
|
||||
|
||||
client.batches.list(limit=10)
|
||||
```
|
||||
|
||||
```bash
|
||||
curl https://api.openai.com/v1/batches?limit=10 \\
|
||||
-H "Authorization: Bearer $OPENAI_API_KEY" \\
|
||||
-H "Content-Type: application/json"
|
||||
```
|
||||
|
||||
|
||||
## Model availability
|
||||
|
||||
The Batch API is widely available across most of our models, but not all. Please refer to the [model reference docs](https://developers.openai.com/api/docs/models) to ensure the model you're using supports the Batch API.
|
||||
|
||||
## Rate limits
|
||||
|
||||
Batch API rate limits are separate from existing per-model rate limits. The Batch API has three types of rate limits:
|
||||
|
||||
1. **Per-batch limits:** A single batch may include up to 50,000 requests, and a batch input file can be up to 200 MB in size. Note that `/v1/embeddings` batches are also restricted to a maximum of 50,000 embedding inputs across all requests in the batch.
|
||||
2. **Enqueued prompt tokens per model:** Each model has a maximum number of enqueued prompt tokens allowed for batch processing. You can find these limits on the [Platform Settings page](https://platform.openai.com/settings/organization/limits).
|
||||
3. **Batch creation rate limit:** You can create up to 2,000 batches per hour. If you need to submit more requests, increase the number of requests per batch.
|
||||
|
||||
There are no limits for output tokens for the Batch API today. Because Batch API rate limits are a new, separate pool, **using the Batch API will not consume tokens from your standard per-model rate limits**, thereby offering you a convenient way to increase the number of requests and processed tokens you can use when querying our API.
|
||||
|
||||
## Batch expiration
|
||||
|
||||
Batches that do not complete in time eventually move to an `expired` state; unfinished requests within that batch are cancelled, and any responses to completed requests are made available via the batch's output file. You will be charged for tokens consumed from any completed requests.
|
||||
|
||||
Expired requests will be written to your error file with the message as shown below. You can use the `custom_id` to retrieve the request data for expired requests.
|
||||
|
||||
```jsonl
|
||||
{"id": "batch_req_123", "custom_id": "request-3", "response": null, "error": {"code": "batch_expired", "message": "This request could not be executed before the completion window expired."}}
|
||||
{"id": "batch_req_123", "custom_id": "request-7", "response": null, "error": {"code": "batch_expired", "message": "This request could not be executed before the completion window expired."}}
|
||||
```
|
||||
|
||||
|
||||
# Pricing and Limits - Batch
|
||||
Updated 2026-05-05
|
||||
|
||||
Price per 1M Tokens, Short Context / Limits
|
||||
TPM = Tokens per minute
|
||||
Req/rpm = Requests per minute
|
||||
TPD = Tokens per /day/, you cannot queue more than this TOTAL across all concurrent batches
|
||||
| Model | Input ($) | Cached Input ($) | Output ($) | Token (tpm) | Req (rpm) | Batch (tpd) |
|
||||
|--------------|-----------|------------------|------------|--------------|-----------|-------------|
|
||||
| gpt-5.5 | 2.5 | 0.25 | 15.00 | 500000 | 500 | 900000 |
|
||||
| gpt-5.4 | 1.25 | 0.13 | 7.50 | 500000 | 500 | 900000 |
|
||||
| gpt-5.4-mini | 0.375 | 0.0375 | 2.25 | 200000 | 500 | 2000000 |
|
||||
| gpt-5.4-nano | 0.10 | 0.01 | 0.625 | 200000 | 500 | 200000 |
|
||||
| gpt-4o | 1.25 | - | 5.00 | 500000 | 500 | 900000 |
|
||||
| gpt-4o-mini | 0.075 | - | 0.30 | 200000 | 500 | 2000000 |
|
||||
| gpt-o4-mini | 0.55 | - | 0.30 | 200000 | 500 | 2000000 |
|
||||
https://developers.openai.com/api/docs/pricing?latest-pricing=batch
|
||||
https://platform.openai.com/settings/organization/limits
|
||||
117
docs/pipeline-v1.2.3.drawio
Normal file
117
docs/pipeline-v1.2.3.drawio
Normal file
@@ -0,0 +1,117 @@
|
||||
<mxfile host="app.diagrams.net">
|
||||
<diagram name="Page-1" id="0sW-Vs8X5usvYmJikUIv">
|
||||
<mxGraphModel dx="1315" dy="798" grid="1" gridSize="10" guides="1" tooltips="1" connect="1" arrows="1" fold="1" page="0" pageScale="1" pageWidth="850" pageHeight="1100" math="0" shadow="0">
|
||||
<root>
|
||||
<mxCell id="0" />
|
||||
<mxCell id="1" parent="0" />
|
||||
<mxCell id="mENAtx_syaeSO5uR6kG6-61" parent="1" style="rounded=0;whiteSpace=wrap;html=1;" value="" vertex="1">
|
||||
<mxGeometry height="90" width="190" x="1000" y="330" as="geometry" />
|
||||
</mxCell>
|
||||
<mxCell id="mENAtx_syaeSO5uR6kG6-60" parent="1" style="rounded=0;whiteSpace=wrap;html=1;" value="" vertex="1">
|
||||
<mxGeometry height="90" width="190" x="1010" y="340" as="geometry" />
|
||||
</mxCell>
|
||||
<mxCell id="mENAtx_syaeSO5uR6kG6-59" parent="1" style="rounded=0;whiteSpace=wrap;html=1;" value="" vertex="1">
|
||||
<mxGeometry height="90" width="190" x="1020" y="350" as="geometry" />
|
||||
</mxCell>
|
||||
<mxCell id="mENAtx_syaeSO5uR6kG6-3" edge="1" parent="1" source="mENAtx_syaeSO5uR6kG6-1" style="edgeStyle=orthogonalEdgeStyle;rounded=0;orthogonalLoop=1;jettySize=auto;html=1;exitX=1;exitY=0.5;exitDx=0;exitDy=0;entryX=0;entryY=0;entryDx=16.5;entryDy=0;entryPerimeter=0;" target="mENAtx_syaeSO5uR6kG6-29">
|
||||
<mxGeometry relative="1" as="geometry">
|
||||
<mxPoint x="200" y="290" as="targetPoint" />
|
||||
</mxGeometry>
|
||||
</mxCell>
|
||||
<mxCell id="mENAtx_syaeSO5uR6kG6-1" parent="1" style="shape=process;whiteSpace=wrap;html=1;backgroundOutline=1;" value="scraper" vertex="1">
|
||||
<mxGeometry height="60" width="120" x="40" y="170" as="geometry" />
|
||||
</mxCell>
|
||||
<mxCell id="mENAtx_syaeSO5uR6kG6-46" edge="1" parent="1" source="mENAtx_syaeSO5uR6kG6-5" style="edgeStyle=orthogonalEdgeStyle;rounded=0;orthogonalLoop=1;jettySize=auto;html=1;exitX=1;exitY=0.5;exitDx=0;exitDy=0;" target="mENAtx_syaeSO5uR6kG6-34">
|
||||
<mxGeometry relative="1" as="geometry" />
|
||||
</mxCell>
|
||||
<mxCell id="mENAtx_syaeSO5uR6kG6-5" parent="1" style="shape=process;whiteSpace=wrap;html=1;backgroundOutline=1;" value="tokenizer" vertex="1">
|
||||
<mxGeometry height="60" width="120" x="400" y="170" as="geometry" />
|
||||
</mxCell>
|
||||
<mxCell id="mENAtx_syaeSO5uR6kG6-6" parent="1" style="text;html=1;whiteSpace=wrap;strokeColor=none;fillColor=none;align=left;verticalAlign=top;rounded=0;" value="<div align="left">- collect forum data</div>" vertex="1">
|
||||
<mxGeometry height="60" width="120" x="40" y="240" as="geometry" />
|
||||
</mxCell>
|
||||
<mxCell id="mENAtx_syaeSO5uR6kG6-7" parent="1" style="text;html=1;whiteSpace=wrap;strokeColor=none;fillColor=none;align=left;verticalAlign=top;rounded=0;" value="<div>- tokenize forum</div><div>- generate report w/</div><div>recommendations</div>" vertex="1">
|
||||
<mxGeometry height="60" width="120" x="400" y="240" as="geometry" />
|
||||
</mxCell>
|
||||
<mxCell id="mENAtx_syaeSO5uR6kG6-28" edge="1" parent="1" source="mENAtx_syaeSO5uR6kG6-19" style="edgeStyle=orthogonalEdgeStyle;rounded=0;orthogonalLoop=1;jettySize=auto;html=1;exitX=1;exitY=0.5;exitDx=0;exitDy=0;entryX=0.5;entryY=0;entryDx=0;entryDy=0;" target="mENAtx_syaeSO5uR6kG6-73">
|
||||
<mxGeometry relative="1" as="geometry">
|
||||
<mxPoint x="953" y="240" as="targetPoint" />
|
||||
</mxGeometry>
|
||||
</mxCell>
|
||||
<mxCell id="mENAtx_syaeSO5uR6kG6-19" parent="1" style="shape=process;whiteSpace=wrap;html=1;backgroundOutline=1;" value="openai_batch" vertex="1">
|
||||
<mxGeometry height="60" width="120" x="720" y="170" as="geometry" />
|
||||
</mxCell>
|
||||
<mxCell id="mENAtx_syaeSO5uR6kG6-21" parent="1" style="text;html=1;whiteSpace=wrap;strokeColor=none;fillColor=none;align=right;verticalAlign=top;rounded=0;fontFamily=Courier New;" value="<div>--model</div><div>--limit</div>" vertex="1">
|
||||
<mxGeometry height="60" width="120" x="590" y="210" as="geometry" />
|
||||
</mxCell>
|
||||
<mxCell id="mENAtx_syaeSO5uR6kG6-23" parent="1" style="text;html=1;whiteSpace=wrap;strokeColor=none;fillColor=none;align=right;verticalAlign=top;rounded=0;fontFamily=Courier New;" value="--forum" vertex="1">
|
||||
<mxGeometry height="60" width="120" x="-90" y="170" as="geometry" />
|
||||
</mxCell>
|
||||
<mxCell id="mENAtx_syaeSO5uR6kG6-26" parent="1" style="text;html=1;whiteSpace=wrap;strokeColor=none;fillColor=none;align=left;verticalAlign=top;rounded=0;" value="<div>- split job into batches</div><div>- submit first batch</div><div>- status of current batch</div><div>- download batch artifacts</div>" vertex="1">
|
||||
<mxGeometry height="70" width="140" x="720" y="240" as="geometry" />
|
||||
</mxCell>
|
||||
<mxCell id="mENAtx_syaeSO5uR6kG6-29" parent="1" style="shape=note;whiteSpace=wrap;html=1;backgroundOutline=1;darkOpacity=0.05;size=17;" value="" vertex="1">
|
||||
<mxGeometry height="70" width="50" x="210" y="240" as="geometry" />
|
||||
</mxCell>
|
||||
<mxCell id="mENAtx_syaeSO5uR6kG6-30" parent="1" style="shape=note;whiteSpace=wrap;html=1;backgroundOutline=1;darkOpacity=0.05;size=17;" value="" vertex="1">
|
||||
<mxGeometry height="70" width="50" x="220" y="250" as="geometry" />
|
||||
</mxCell>
|
||||
<mxCell id="mENAtx_syaeSO5uR6kG6-45" edge="1" parent="1" source="mENAtx_syaeSO5uR6kG6-31" style="edgeStyle=orthogonalEdgeStyle;rounded=0;orthogonalLoop=1;jettySize=auto;html=1;exitX=0;exitY=0;exitDx=50;exitDy=43.5;exitPerimeter=0;entryX=0;entryY=0.5;entryDx=0;entryDy=0;" target="mENAtx_syaeSO5uR6kG6-5">
|
||||
<mxGeometry relative="1" as="geometry">
|
||||
<Array as="points">
|
||||
<mxPoint x="320" y="304" />
|
||||
<mxPoint x="320" y="200" />
|
||||
</Array>
|
||||
</mxGeometry>
|
||||
</mxCell>
|
||||
<mxCell id="mENAtx_syaeSO5uR6kG6-31" parent="1" style="shape=note;whiteSpace=wrap;html=1;backgroundOutline=1;darkOpacity=0.05;size=17;" value="<div>&lt;forumid&gt;</div><div>.jsonl</div>" vertex="1">
|
||||
<mxGeometry height="70" width="50" x="230" y="260" as="geometry" />
|
||||
</mxCell>
|
||||
<mxCell id="mENAtx_syaeSO5uR6kG6-47" edge="1" parent="1" source="mENAtx_syaeSO5uR6kG6-34" style="edgeStyle=orthogonalEdgeStyle;rounded=0;orthogonalLoop=1;jettySize=auto;html=1;exitX=0;exitY=0;exitDx=50;exitDy=43.5;exitPerimeter=0;entryX=0;entryY=0.5;entryDx=0;entryDy=0;" target="mENAtx_syaeSO5uR6kG6-19">
|
||||
<mxGeometry relative="1" as="geometry">
|
||||
<Array as="points">
|
||||
<mxPoint x="640" y="284" />
|
||||
<mxPoint x="640" y="200" />
|
||||
</Array>
|
||||
</mxGeometry>
|
||||
</mxCell>
|
||||
<mxCell id="mENAtx_syaeSO5uR6kG6-34" parent="1" style="shape=note;whiteSpace=wrap;html=1;backgroundOutline=1;darkOpacity=0.05;size=17;" value="<div><br></div><div>&lt;forumid&gt;<br>-report</div><div>.json</div>" vertex="1">
|
||||
<mxGeometry height="70" width="50" x="560" y="240" as="geometry" />
|
||||
</mxCell>
|
||||
<mxCell id="mENAtx_syaeSO5uR6kG6-35" parent="1" style="shape=note;whiteSpace=wrap;html=1;backgroundOutline=1;darkOpacity=0.05;size=17;" value="<div>status</div><div>.json</div>" vertex="1">
|
||||
<mxGeometry height="70" width="50" x="913.25" y="360" as="geometry" />
|
||||
</mxCell>
|
||||
<mxCell id="mENAtx_syaeSO5uR6kG6-43" parent="1" style="shape=note;whiteSpace=wrap;html=1;backgroundOutline=1;darkOpacity=0.05;size=17;" value="<div>jobN-</div><div>output</div><div>.jsonl</div>" vertex="1">
|
||||
<mxGeometry height="70" width="50" x="1090" y="360" as="geometry" />
|
||||
</mxCell>
|
||||
<mxCell id="mENAtx_syaeSO5uR6kG6-48" parent="1" style="shape=note;whiteSpace=wrap;html=1;backgroundOutline=1;darkOpacity=0.05;size=17;" value="<div>jobN-errors</div><div>.jsonl</div>" vertex="1">
|
||||
<mxGeometry height="70" width="50" x="1150" y="360" as="geometry" />
|
||||
</mxCell>
|
||||
<mxCell id="mENAtx_syaeSO5uR6kG6-54" parent="1" style="shape=note;whiteSpace=wrap;html=1;backgroundOutline=1;darkOpacity=0.05;size=17;" value="<div>jobN-</div><div>input</div><div>.jsonl</div>" vertex="1">
|
||||
<mxGeometry height="70" width="50" x="1030" y="360" as="geometry" />
|
||||
</mxCell>
|
||||
<mxCell id="mENAtx_syaeSO5uR6kG6-64" edge="1" parent="1" source="mENAtx_syaeSO5uR6kG6-63" style="edgeStyle=orthogonalEdgeStyle;rounded=0;orthogonalLoop=1;jettySize=auto;html=1;exitX=0;exitY=0;exitDx=50;exitDy=43.5;exitPerimeter=0;entryX=0;entryY=0.5;entryDx=0;entryDy=0;" target="mENAtx_syaeSO5uR6kG6-5">
|
||||
<mxGeometry relative="1" as="geometry" />
|
||||
</mxCell>
|
||||
<mxCell id="mENAtx_syaeSO5uR6kG6-63" parent="1" style="shape=note;whiteSpace=wrap;html=1;backgroundOutline=1;darkOpacity=0.05;size=17;" value="<div>prompt</div><div>.txt</div>" vertex="1">
|
||||
<mxGeometry height="70" width="50" x="270" y="90" as="geometry" />
|
||||
</mxCell>
|
||||
<mxCell id="mENAtx_syaeSO5uR6kG6-67" parent="1" style="text;html=1;whiteSpace=wrap;strokeColor=none;fillColor=none;align=left;verticalAlign=top;rounded=0;fontFamily=Courier New;" value="create" vertex="1">
|
||||
<mxGeometry height="20" width="120" x="850" y="170" as="geometry" />
|
||||
</mxCell>
|
||||
<mxCell id="mENAtx_syaeSO5uR6kG6-71" parent="1" style="text;html=1;whiteSpace=wrap;strokeColor=none;fillColor=none;align=left;verticalAlign=top;rounded=0;fontFamily=Courier New;" value="<div>submit</div><div><br></div><div>status</div><div>download</div>" vertex="1">
|
||||
<mxGeometry height="60" width="120" x="1020" y="240" as="geometry" />
|
||||
</mxCell>
|
||||
<mxCell id="mENAtx_syaeSO5uR6kG6-75" edge="1" parent="1" source="mENAtx_syaeSO5uR6kG6-73" style="edgeStyle=orthogonalEdgeStyle;rounded=0;orthogonalLoop=1;jettySize=auto;html=1;" target="mENAtx_syaeSO5uR6kG6-35">
|
||||
<mxGeometry relative="1" as="geometry" />
|
||||
</mxCell>
|
||||
<mxCell id="mENAtx_syaeSO5uR6kG6-76" edge="1" parent="1" source="mENAtx_syaeSO5uR6kG6-73" style="edgeStyle=orthogonalEdgeStyle;rounded=0;orthogonalLoop=1;jettySize=auto;html=1;entryX=0.5;entryY=0;entryDx=0;entryDy=0;" target="mENAtx_syaeSO5uR6kG6-61">
|
||||
<mxGeometry relative="1" as="geometry" />
|
||||
</mxCell>
|
||||
<mxCell id="mENAtx_syaeSO5uR6kG6-73" parent="1" style="image;aspect=fixed;perimeter=ellipsePerimeter;html=1;align=center;shadow=0;dashed=0;spacingTop=3;image=img/lib/active_directory/folder.svg;" value="&lt;forumid&gt;-N" vertex="1">
|
||||
<mxGeometry height="50" width="36.5" x="920" y="240" as="geometry" />
|
||||
</mxCell>
|
||||
</root>
|
||||
</mxGraphModel>
|
||||
</diagram>
|
||||
</mxfile>
|
||||
4
docs/pipeline-v1.2.3.svg
Normal file
4
docs/pipeline-v1.2.3.svg
Normal file
File diff suppressed because one or more lines are too long
|
After Width: | Height: | Size: 170 KiB |
BIN
docs/streamlit-snapshot.png
Normal file
BIN
docs/streamlit-snapshot.png
Normal file
Binary file not shown.
|
After Width: | Height: | Size: 30 KiB |
374
docs/tasks.org
374
docs/tasks.org
@@ -1,28 +1,382 @@
|
||||
* [ ] t1.1: scrape one forum (1)
|
||||
#+title: VATH Task Log
|
||||
#+date: [2026-05-05 Tue]
|
||||
#+startup: Overview
|
||||
|
||||
* [X] t1.1: scrape one forum (1)
|
||||
Use https://www.townhall.virginia.gov/L/comments.cfm?GDocForumID=452 as the first forum. Scraper should be run manually at this step.
|
||||
ViewComments (townhall.virginia.gov/L/ViewComments.cfm?CommentID=#) appears to be raw list of all comments on forum - could be useful later for whole-scrape
|
||||
Append forum id to viewall per forum (townhall.virginia.gov/L/ViewComments.cfm?GdocForumID=452)
|
||||
Comments are hydrated in backend via js-cued button (AJAX?).
|
||||
** acceptance criteria
|
||||
1. run manual scraper
|
||||
1. store proposal title and description
|
||||
2. store comment title, commenter, date
|
||||
3. store relevant metadata
|
||||
2. friendly/polite scraping
|
||||
3. store forum as distinct item with title, desc
|
||||
4. add forum ID in comment filename, eg forum452_comments_<datetime>.jsonl
|
||||
5. remove reg_title and reg_desc from each comment; these belong in forum item
|
||||
6. parse datetimes into object for later use (plotting)
|
||||
|
||||
** notes
|
||||
- scraper/spiders/forum.py — ForumSpider using ViewComments.cfm?GdocForumID=N with POST pagination. First request fetches page 1 (vPerPage=500), discovers the last page number from the form's link, generates all remaining page requests upfront. Parses each div.Cbox for all required fields.
|
||||
- scraper/items.py — CommentItem with forum_id, reg_title, reg_desc, comment_id, author, date, title, text
|
||||
- tests/test_forum_spider.py — 7 tests, all passing
|
||||
- Settings: DEFAULT_RESPONSE_ENCODING=utf-8 (fixes Windows-1251 meta-tag mismatch), HTTPCACHE_ENABLED=True, feed output to output/
|
||||
- ViewComments.cfm instead of comments.cfm: POST to Comments.cfm returned a 500 error (wrong endpoint). ViewComments.cfm?GdocForumID=N is the correct listing URL, returns full comment text on the page itself — no per-comment follow requests needed.
|
||||
- Span-wrapped text: .divComment p::text missed 3.6% of comments where text is in <p><span>text</span></p>. Fixed to .divComment *::text, .divComment::text. Worth knowing for when the spider is extended to other forums.
|
||||
- start() vs start_requests(): Scrapy 2.13+ deprecates start_requests() in favor of async def start()
|
||||
- ForumItem vs CommentItem: ForumItem (forum_id, reg_title, reg_desc) yielded once on first page; CommentItem no longer carries reg_title/reg_desc. Both land in the same JSONL feed.
|
||||
- Dynamic output filename: set via from_crawler() overriding FEEDS at 'spider' priority — format is output/forum{id}_comments_%(time)s.jsonl. FEEDS removed from settings.py; spider owns it.
|
||||
- Date parsing: _parse_date() normalizes whitespace, upper-cases, parses "%m/%d/%y %I:%M %p" → ISO 8601; falls back to raw string on failure.
|
||||
|
||||
** evidence
|
||||
- commit:
|
||||
- tests:
|
||||
- datetime:
|
||||
- commit: beb5cf4 (AC1-2), e7df0b2 (AC3-6)
|
||||
- tests: 8 passing (`python -m pytest tests -q`) or (`python -m pytest tests/`)
|
||||
- `scrapy crawl forum -a forum_id=452 -s LOG_LEVEL=WARNING 2>&1`
|
||||
- retrieved 9083 comments
|
||||
- datetime: [2026-05-05 Tue 14:00]
|
||||
|
||||
* [X] t1.2: initial 4o sentiment
|
||||
Write a simple manual pipeline for gpt-4o that reads one scraped forum jsonl file and roduces a separate analyzed jsonl file. this step must not mutate scraper output. analysis should classify each comment for regulatory stance, generic tone/sentiment, confidence, and enough rationale/evidence to support later dashboard drilldown.
|
||||
Should be run manually, separate from scraper. You may use scrapy, but are not required to.
|
||||
- Sentiment is derived, not scraped - keep separate from raw comments.
|
||||
- keep jsonl as interchange/audit format
|
||||
|
||||
* [ ] t1.2: initial analysis pipeline
|
||||
Write a simple pipeline for both - prefer non-concurrent/async from scraping run. Should be run manually, separate from scraper. You may use scrapy, but are not required to.
|
||||
** acceptance criteria
|
||||
1. run manual sentiment analysis of selected file against haiku
|
||||
2. run manual sentiment analysis of selected file against gpt-4o
|
||||
1. input scraped jsonl doc by filename/path, e.g. "./output/forum452_comments_<datetime>.jsonl"
|
||||
- handle mixed itemtypes, e.g., forum + comment items
|
||||
2. output new analysis file, e.g., "analysis/forum452_<datetime>_<model>_<datetime>.jsonl"
|
||||
- one analysis record per comment
|
||||
- include run_id, forum_id, comment_id, analyzed_at, model, prompt_version
|
||||
3. capture stance toward proposed reg/guidance:
|
||||
- `stance`: support, oppose, neutral, unknown
|
||||
- `confidence`: 0-1
|
||||
- short rationale, if provided by model
|
||||
4. capture generic sentiment/tone separately from stance: `tone`=positive, negative, neutral, mixed, unclear
|
||||
5. capture issue/topic tags for later grouping, may be empty
|
||||
6. use .env for api key management
|
||||
7. document the exact prompt version used; prompt text may live in code or docs, but must have a version string/hash in output records
|
||||
8. for this run, an option to run the first N comments (5, 10, 20, 50) - will add batch processing later
|
||||
|
||||
** notes
|
||||
- analysis/gpt4o/analysis.py: standalone script; core functions importable for tests.
|
||||
- Prompt version = SHA-256[:7] of SYSTEM_PROMPT+USER_TEMPLATE; auto-updates on prompt change.
|
||||
- Output: analysis/gpt4o/forum{id}_{scrape_ts}_{model}_{run_ts}.jsonl, one record per comment.
|
||||
- --limit {5,10,20,50} for test runs; omit for full corpus. Batch processing planned for later.
|
||||
- Incremental flush after each record: safe to interrupt and inspect partial output.
|
||||
- temperature=0.0 for deterministic, reproducible classifications across runs.
|
||||
- Retry: 3 attempts (delays 1s, 2s) on RateLimitError; all other exceptions → error record + continue.
|
||||
- openai==2.34.0 installed; python-dotenv already present; key loaded from .env via OPENAI_API_KEY.
|
||||
- MAX_COMMENT_CHARS=6000: covers >99% without truncation; outliers (e.g. 18k-char law firm brief) flagged with truncated=True.
|
||||
|
||||
** evidence
|
||||
- commit: d834d18
|
||||
- tests: 20 passing (pytest tests/analysis_gpt4o_realtime.py), 28 total across suite
|
||||
- `python ./analysis/gpt4o/analysis_realtime.py --limit 5 ./output/f452.jsonl`
|
||||
- see: ./analysis/gpt4o/forum452_unknown_gpt-4o_2026-05-05T18-48-32+00-00.jsonl
|
||||
- date: [2026-05-05 Tue 15:00]
|
||||
|
||||
* [X] t1.2.1: batch processing
|
||||
Create analysis-batch.py to capture same elements as t1.2 above.
|
||||
May need to add multiple commands to upload, check batch status, download, etc.
|
||||
Commands should all be run manually.
|
||||
Reference: ./docs/openai-batch.md. openai batch output order is not guaranteed, so custom_id is mandatory for reconciliation
|
||||
** acceptance criteria
|
||||
1. input scraped jsonl doc by filename/path, and process the whole thing via batch processing
|
||||
- ignore non-comment items in jsonl
|
||||
- do not modify raw scraper output
|
||||
- specify model and prompt
|
||||
2. output a run manifest in ./analysis/<model>/runs/<run_id>.json
|
||||
- include: include run_id, input_filename, input_sha256, prompt_hash, model, batch_id, records_submitted, records_completed, records_failed, request_filename, raw_output_filename, normalized_output_filename, created_at, completed_at
|
||||
3. add tests without live api calls
|
||||
** notes
|
||||
- analysis/gpt4o/analysis-batch.py with three subcommands:
|
||||
- `submit`: reads scraped JSONL, builds batch request file (requests/<run_id>.jsonl), uploads to Files API, creates batch, saves manifest to runs/<run_id>.json. Prints run_id to stdout for scripting.
|
||||
- `status`: retrieves batch from OpenAI, prints status + counts, updates manifest.
|
||||
- `download`: downloads raw output to raw/<run_id>.jsonl, normalizes to <run_id>_<model>.jsonl using comment_lookup keyed by comment_id for reconciliation (batch output order not guaranteed). Updates manifest with filenames, counts, completed_at.
|
||||
- custom_id format: comment_{comment_id} — unique within a forum, stable across runs.
|
||||
- PROMPT_VERSION derived from analysis/prompt-1.txt (same file as realtime); both scripts produce matching prompt_hash in all records.
|
||||
- analysis/prompt-1.txt: system prompt as plaintext, read at import time by both scripts. Edit here to change prompt for both pipelines.
|
||||
|
||||
** evidence
|
||||
- commit: 683bfb3 (remove hyphen), f3abbef
|
||||
- tests: 18 passing (pytest tests/analysis_gpt4o_batch.py), 46 total across suite
|
||||
- datetime: [2026-05-05 Tue 17:00]
|
||||
|
||||
* [X] t1.2.2: Tokenizer / Batch mgmt
|
||||
openai batch analysis requires coordination - more like a job queue.
|
||||
batch script should setup queue for user to setup manually; openai api will reject subsequent batches when the total daily token limit is maxed.
|
||||
** Acceptance Criteria
|
||||
1. add token estimator utility script, probably to /analysis
|
||||
2. add MODEL_LIMITS dict to analysis_batch.py. if there are more than (n)
|
||||
- gpt-4o (30k tpm/90k tpd batch)
|
||||
- gpt-4o-mini (200k tpm/2M tpd batch)
|
||||
- add models listed in docs/openai.md
|
||||
3. Auto-chunk submit: before writing the request file, walk comments, accumulate estimated tokens, and split into chunks that fit under the model's limit.
|
||||
- Each chunk becomes its own batch submission with its own run_id.
|
||||
- Drop --limit (or keep as hard cap override).
|
||||
- Print all run_ids
|
||||
- Submit the first batch only (failed)
|
||||
4. Update test script to show tokenizer output
|
||||
|
||||
** notes
|
||||
- MODEL_LIMITS and _MODEL_ENCODING dicts in analysis/gpt4o/analysis_batch.py; keyed by model name, sourced from docs/openai.md. Unknown models fall back to o200k_base encoding and 900k token limit.
|
||||
- estimate_tokens(messages, model): uses tiktoken (o200k_base) when available; falls back to chars/3 + 4 overhead per message.
|
||||
- chunk_comments_by_tokens(comments, forum, model): greedy bin-pack; respects 10% headroom (_LIMIT_BUFFER=0.90). Returns list of comment lists.
|
||||
- submit sends only chunks[0] — enqueued token limit is a TOTAL across all concurrent batches; stacking would exceed quota. Remaining chunk ranges are printed as manual instructions.
|
||||
- --limit N still available as a hard cap on total comments before chunking (useful when org-tier limit is below the published model limit).
|
||||
- pip install tiktoken required for exact token counting; chars/3 fallback activates automatically if not installed.
|
||||
|
||||
|
||||
*** usage
|
||||
- `pip install tiktoken`
|
||||
- submit first chunk (auto-sized to model token limit, uses most recent output file)
|
||||
`python analysis/gpt4o/analysis_batch.py submit output/f452.jsonl --model gpt-4o-mini`
|
||||
- check status (defaults to most recent run)
|
||||
`python analysis/gpt4o/analysis_batch.py status`
|
||||
- download + normalize when complete
|
||||
`python analysis/gpt4o/analysis_batch.py download`
|
||||
- submit next chunk: rerun with `--limit` to cover the next N comments
|
||||
(track which comment_ids have already been analyzed to avoid duplicates)
|
||||
|
||||
*** validation
|
||||
#+begin_src python
|
||||
import pandas as pd
|
||||
df_input = pd.read_json('C:/Users/moses/projects/vath/analysis/gpt4o/runs/75ee9a/f452.jsonl', lines=True)
|
||||
# drop forum item
|
||||
df_input_comments = df_input[df_input["comment_id"].notna()].copy()
|
||||
df_output = pd.read_json('C:/Users/moses/projects/vath/analysis/gpt4o/runs/75ee9a/75ee9a6c-8fc2-4924-8d96-b55bb4d5e832_gpt-4o.jsonl', lines=True)
|
||||
dfm = df_output.merge(df_input_comments,on="comment_id",how="left",suffixes=("","_input"),)
|
||||
dfm.to_csv('C:/Users/moses/projects/vath/analysis/gpt4o/1.csv')
|
||||
#+end_src
|
||||
order columns:
|
||||
forum_id_input,comment_id,title,text,date,author,stance,stance_confidence,stance_rationale,tone,tags,error,truncated,analyzed_at,prompt_version,model
|
||||
|
||||
** evidence
|
||||
- commit:
|
||||
- tests:
|
||||
- date:
|
||||
- tests: 23 passing (pytest tests/analysis_gpt4o_batch.py), 51 total across suite
|
||||
- datetime: [2026-05-06 Wed 08:55]
|
||||
|
||||
* [X] t1.2.3: batch job refactor
|
||||
This task encompasses intent and fixes for 1.2.1 and 1.2.2.
|
||||
batch processing should be a resumable job queue, not a one-shot script. the user should not need to remember offsets, completed chunks, failed batches, or which comments remain.
|
||||
** Acceptance Criteria
|
||||
1. create tokenizer to prepare the batch job
|
||||
- input: prompt.txt, forum.jsonl
|
||||
- output: report.json with each model's batch structure, cost, and time (considering tpd constraints)
|
||||
- analysis_batch should be able to take this report to run the job. good place to copy the raw scraper jsonl
|
||||
#+begin_src python
|
||||
{'prompt': 'prompt1.txt',
|
||||
'input_file': 'f451.jsonl',
|
||||
'input_tokens': 123456789,
|
||||
'gpt-4o': {'jobs':71,'cost_$':4,'est_queue_days':3} # divide tokens by model TPD to get time_days
|
||||
'gpt-4o-mini': {'jobs':71,'cost_$':4,'est_queue_days':3} # divide tokens by model TPD to get time_days
|
||||
#+end_src
|
||||
2. batch py should contain commands to create, check, run, and complete jobs.
|
||||
- inputs: report.json, --model, optional --job N, read api key from .env
|
||||
- outputs:
|
||||
- status.json: job structure, status, metadata; updated when jobs are finished. includes all report.json info
|
||||
- for each job: jobN-input.jsonl (what is sent to openai); jobN-output-raw.jsonl, jobN-output.jsonl, and jobN-errors.jsonl (when downloaded)
|
||||
- jobN-output.jsonl contains:
|
||||
- one analysis record per comment
|
||||
- `run_id`, `forum_id`, `comment_id`, `analyzed_at`, `model`, `prompt_version`
|
||||
- `stance` toward proposed reg/guidance: support|oppose|neutral|unclear
|
||||
- `stance_confidence`: 0-1
|
||||
- short rationale, if provided by model
|
||||
- generic sentiment `tone` (separate from stance): positive|negative|neutral|mixed|unclear
|
||||
- `tags` for later grouping, may be empty
|
||||
- commands: `create`, `submit`, `status`, `download`
|
||||
- `create` run directory, copy input/prompt/report, generate status.json, job request files
|
||||
- `submit` if eligible, submit next or specified job; does not blindly stack jobs, warns if prev jobs in progress, print next action
|
||||
- `status` check status of one or all submitted jobs, update status.json
|
||||
- `download` raw output (jobN-output-raw.jsonl) and error files for completed jobs, and normalize raw output (jobN-output.jsonl) auto run status.
|
||||
3. tests without live api calls
|
||||
- partial completed run
|
||||
- failed batch records
|
||||
- out-of-order output
|
||||
- duplicate custom_id
|
||||
- missing output file
|
||||
- resume from status.json
|
||||
- remaining-comment detection
|
||||
|
||||
** notes
|
||||
- analysis/tokenizer.py: new standalone script; imports openai_batch for MODEL_LIMITS, estimate_tokens, build_messages. Reads input JSONL + prompt, computes per-model jobs/cost/time table, writes reports/<stem>-report.json. MODEL_PRICING dict lives here (not in openai_batch). Pass a jobN-input.jsonl to count actual tokens instead.
|
||||
- analysis/openai_batch.py: fully rewritten with four subcommands: create, submit, status, download. Job dirs at analysis/jobs/<stem[:8]>-N/.
|
||||
- Job directories: analysis/jobs/<stem[:8]>-N/ (e.g. f452-1). Each run is self-contained: forum.jsonl, prompt.txt, report.json, jobN-input.jsonl, jobN-output-raw.jsonl, jobN-output.jsonl, jobN-errors.jsonl.
|
||||
- status.json: tracks all jobs with pending/submitted/in_progress/completed/failed states. Updated by submit, status, download.
|
||||
- _find_next_eligible_job: pure function for testability. Returns (next_pending_job, None) or (None, warning). Blocks submission if previous job is in_progress/submitted.
|
||||
- create: no API key required. Reads report.json, re-chunks comments, writes all jobN-input.jsonl files, writes status.json.
|
||||
- submit: uploads jobN-input.jsonl to Files API, creates batch, updates status.json to 'submitted'. Will not stack batches.
|
||||
- status: retrieves batch from OpenAI, updates status.json counts and status.
|
||||
- download: auto-runs status first, downloads output_file_id → jobN-output-raw.jsonl, error_file_id → jobN-errors.jsonl, normalizes → jobN-output.jsonl. Updates status.json.
|
||||
- tests/tokenizer.py: 19 tests for compute_report schema, cost/time calculation, MODEL_PRICING coverage, print_table output, count_input_tokens, report.json round-trip.
|
||||
- Token limit buffer: _LIMIT_BUFFER=0.80 (20% headroom). Estimate uses OpenAI cookbook chat formula (role tokens + 3-token reply primer). Verify a job file with: python analysis/tokenizer.py analysis/jobs/<dir>/jobN-input.jsonl
|
||||
|
||||
*** usage
|
||||
#+begin_src powershell
|
||||
# 1. estimate tokens and cost
|
||||
python analysis/tokenizer.py output/f452.jsonl --prompt analysis/prompt-1.txt
|
||||
# writes reports/f452-report.json
|
||||
|
||||
# 2. verify actual tokens in a job file (optional sanity check)
|
||||
python analysis/tokenizer.py analysis/jobs/f452-1/job1-input.jsonl
|
||||
|
||||
# 3. create job directory (no api key needed)
|
||||
python analysis/openai_batch.py create reports/f452-report.json --model gpt-5.4-mini
|
||||
# creates analysis/jobs/f452-1/
|
||||
|
||||
# 4. submit first job
|
||||
python analysis/openai_batch.py submit
|
||||
|
||||
# 5. check status (repeat until completed)
|
||||
python analysis/openai_batch.py status
|
||||
|
||||
# 6. download and normalize
|
||||
python analysis/openai_batch.py download
|
||||
|
||||
# 7. submit next job (if multi-job run), then repeat 5-6
|
||||
python analysis/openai_batch.py submit
|
||||
#+end_src
|
||||
|
||||
** evidence
|
||||
- commit:
|
||||
- tests: passing (pytest tests/openai_batch.py tests/openai_realtime.py tests/tokenizer.py)
|
||||
- datetime: [2026-05-06 Wed]
|
||||
|
||||
* [X] t1.3: cleanup model output and rejoin
|
||||
create a lightweight validation script that joins raw comments to normalized analysis output and writes a human-reviewable csv.
|
||||
review create_csv for the simple approach - keep this regardless
|
||||
|
||||
** acceptance criteria
|
||||
1. input raw scrape jsonl and all *-output.jsonl files in a dir
|
||||
2. join by comment_id, not dataframe index
|
||||
3. output csv columns in review order:
|
||||
- forum_id, comment_id, title, text, date, author
|
||||
- stance, stance_confidence, stance_rationale, tone, tags
|
||||
- error, truncated, analyzed_at, prompt_version, model
|
||||
4. output parquet?
|
||||
5. print validation counts
|
||||
- raw comments
|
||||
- analyzed records
|
||||
- joined records
|
||||
- missing comment text
|
||||
- duplicate comment_ids
|
||||
- error records
|
||||
- stance counts
|
||||
- tone counts
|
||||
6. tests cover join behavior and missing/duplicate ids
|
||||
|
||||
** notes
|
||||
- analysis/create_csv.py: reads raw scrape JSONL + all job*-output.jsonl in a job dir (skips *-output-raw.jsonl); left-joins on comment_id; writes review.csv (UTF-8 BOM for Excel); optional --parquet.
|
||||
- Uses pd.read_json(path, lines=True) — no manual JSON parsing.
|
||||
- Prints summary counts: raw/analyzed/joined/unanalyzed/errors/duplicate IDs, stance distribution, tone distribution.
|
||||
|
||||
*** usage
|
||||
#+begin_src sh
|
||||
python analysis/create_csv.py output/f452.jsonl analysis/jobs/f452-1/
|
||||
python analysis/create_csv.py output/f452.jsonl analysis/jobs/f452-1/ --parquet
|
||||
# output: analysis/jobs/f452-1/review.csv (and optionally review.parquet)
|
||||
#+end_src
|
||||
|
||||
** evidence
|
||||
- commit: 28d6d22
|
||||
- tests: passing (pytest tests/create_csv.py tests/encoding.py)
|
||||
- csv: analysis/jobs/f452-1/review.csv
|
||||
- datetime: [2026-05-07 Thu 17:23]
|
||||
|
||||
* [X] t1.1.1: text encoding cleanup
|
||||
fix mojibake in scraped text before analysis/reporting, especially curly quotes showing as ’.
|
||||
|
||||
|
||||
** acceptance criteria
|
||||
1. identify whether mojibake exists in raw scrape, analysis output, or csv export only
|
||||
2. add repair step at the earliest correct layer
|
||||
3. preserve original raw scrape if repair changes source text
|
||||
4. add test cases for common bad sequences:
|
||||
- ’
|
||||
- “
|
||||
- â€
|
||||
- –
|
||||
- —
|
||||
5. document whether repaired text is used for model input
|
||||
|
||||
** notes
|
||||
- Diagnosis: f452.jsonl raw data is CLEAN — proper Unicode throughout (U+2019, U+201C, etc.). The DEFAULT_RESPONSE_ENCODING=utf-8 spider setting is working for this site. No mojibake or FFFD chars found.
|
||||
- The encoding issue would surface for forums whose server sends cp1252 bytes (0x91-0x97 range) embedded in otherwise UTF-8 content. FFFD replacement chars appear when the UTF-8 decoder hits those bytes. Once the byte is replaced by FFFD, the original character cannot be recovered.
|
||||
- Repair layer: analysis/encoding.py applied in analysis/validate.py at reporting time. Raw scrape JSONL is never modified (AC3).
|
||||
- Model input: repair_text() is NOT applied in build_messages() for this dataset since raw data is clean. Can be added if a future forum produces dirty text.
|
||||
- Spider: DEFAULT_RESPONSE_ENCODING=utf-8 remains. If a future forum genuinely sends cp1252, change to 'cp1252' and apply ftfy post-decode in the item pipeline.
|
||||
|
||||
** evidence
|
||||
- commit: 1ea696d
|
||||
- tests: passing (pytest tests/encoding.py)
|
||||
- before/after sample: N/A — f452.jsonl is clean; tests cover synthetic mojibake patterns
|
||||
- datetime: [2026-05-07 Thu 17:00]
|
||||
|
||||
* [X] t1.4: graph data prototype
|
||||
create ./viz/prototype_charts.py generating individual plotly charts for exploring graphs to embed into streamlit or dash later
|
||||
|
||||
** acceptance criteria
|
||||
2. create graph for Stance/Share
|
||||
- stacked h-bar with % support/oppose/neutral/unknown + raw totals, eg 63% (5720) / 37% (3320) / 0.09% (8) / 0.37% (34)
|
||||
- later, consider centered diverging h-bar: oppose ← | neutral/unknown | → support
|
||||
3. create graph for Stance/Time:
|
||||
- cumulative support/oppose % over time
|
||||
4. create graph for Stance/Tone (heatmap count)
|
||||
5. create graph for Confidence/Stance (boxplot or histogram)
|
||||
|
||||
** notes
|
||||
- prototyped in plotly
|
||||
- initial streamlit
|
||||
|
||||
** evidence
|
||||
- commit: 3fb424d
|
||||
- tests: see viz/proto and viz/chart_tests
|
||||
- datetime: [2026-05-08 Fri 08:38]
|
||||
|
||||
* [X] t1.5: streamlit
|
||||
create organized webpage displaying useful information from completed job and analysis
|
||||
|
||||
** acceptance criteria
|
||||
1. display total stance breakdown
|
||||
2. display centered horiz-bar with absolute stances
|
||||
3. show daily comment stances and cumulative
|
||||
4. show comment table with filters for stance (filter tone?)
|
||||
5. clicking/selecting a comment shows full text and model rationale
|
||||
6. app runs locally with one command
|
||||
|
||||
** notes
|
||||
data pulls entirely from the job; goal is to point viz/streamlit.py at any job/ folder and have everything it needs
|
||||
|
||||
** evidence
|
||||
- commit: cc16acb
|
||||
- tests: from root dir, `streamlit run viz/streamlit.py <job-dir>`
|
||||
- datetime: [2026-05-08 Fri 23:44]
|
||||
|
||||
* +[ ] t1.6 host streamlit via dockerfile+
|
||||
planning to deploy manually, get cert, etc etc. probably dont care about https?
|
||||
+using streamlit.app instead+
|
||||
** acceptance criteria
|
||||
1. write dockerfile with slim image
|
||||
|
||||
** notes
|
||||
|
||||
* === Backlog ===
|
||||
- add forum_url, forum_collected_date to scraper (to add to viz)
|
||||
* [ ] X: complete proposal information
|
||||
Ensure we capture as much useful information as possible about the actual proposal - contact information, etc. what the state actually says about what was posted.
|
||||
** acceptance criteria
|
||||
1. Item: `Forum` stores id, url, proposal title, description, open/close date, number of comments, agency, board, guidance document id
|
||||
- add details for guidanceDoc, publication date, comments, guidance docs - eg: https://www.townhall.virginia.gov/L/GDocForum.cfm?GDocForumID=452
|
||||
2. Item: `Comment` stores forum_id, comment_id, author, title, text, date, url
|
||||
* [ ] X: add helper data to create_csv
|
||||
1. in create_csv.py, create helper columns:
|
||||
- stance_signed = {"support":1, "oppose":-1, "neutral":0, "unknown":0}
|
||||
- stance_weighted = stance_signed * stance_confidence
|
||||
- is_support_oppose = stance in ["support", "oppose"]
|
||||
- date_day
|
||||
- date_hour
|
||||
- text_norm
|
||||
- text_hash
|
||||
- confidence_bucket = 'low' <.7 | 'med' .7-.89 | 'high' >=.9
|
||||
|
||||
105
docs/tb.py
Normal file
105
docs/tb.py
Normal file
@@ -0,0 +1,105 @@
|
||||
import jsonlines
|
||||
import re
|
||||
from textblob import TextBlob
|
||||
from collections import Counter
|
||||
|
||||
def tprint(obj):
|
||||
print(f"{type(obj)} : {obj}")
|
||||
|
||||
|
||||
def sort_file(file):
|
||||
'''return number of positive and negative comments based on TextBlob sentiment analysis'''
|
||||
# with jsonlines.open("/vadoe/vadoe/vadoe/townhall_2021-01-14T02-05-51.json") as reader:
|
||||
with jsonlines.open(file, mode='r') as reader:
|
||||
# Confirm type
|
||||
tprint(reader)
|
||||
|
||||
# Build iterator
|
||||
_doc = iter(reader)
|
||||
i = 0
|
||||
pos = 0
|
||||
neg = 0
|
||||
posl = []
|
||||
negl = []
|
||||
|
||||
while i<25:
|
||||
_line = next(_doc)
|
||||
tprint(_line)
|
||||
if _line['sentiment'] == 'pos':
|
||||
pos = pos + 1
|
||||
posl.append(_line['comment'])
|
||||
elif _line['sentiment'] == 'neg':
|
||||
neg = neg + 1
|
||||
negl.append(_line['comment'])
|
||||
i=i+1
|
||||
|
||||
print(f'{pos} positive and {neg} negative comments')
|
||||
# tst = TextBlob(obj['comment'])
|
||||
# tst.sentiment
|
||||
|
||||
def process_file(file):
|
||||
'''Find Smythers posts'''
|
||||
with jsonlines.open(file, mode='r') as reader:
|
||||
_doc = iter(reader)
|
||||
_list = []
|
||||
for item in _doc:
|
||||
try:
|
||||
if item['author'][0] == 'Smythers':
|
||||
_list.append(item['content'][0])
|
||||
except KeyError:
|
||||
continue
|
||||
return(_list)
|
||||
|
||||
def write_file(file, data:object):
|
||||
'''Write data to file'''
|
||||
with jsonlines.open(file, mode='w') as writer:
|
||||
for each in data:
|
||||
writer.write(each)
|
||||
print('write successful')
|
||||
|
||||
def clean_text(text:str):
|
||||
s1 = remove_html(text)
|
||||
s2 = remove_http(s1)
|
||||
return s2
|
||||
|
||||
def remove_html(text:str):
|
||||
'''Remove html tags from string'''
|
||||
clean = re.compile('<.*?>')
|
||||
return re.sub(clean, '', text)
|
||||
|
||||
def remove_http(text:str):
|
||||
'''Remove URLs from string'''
|
||||
return re.sub(r'http\S+','', text)
|
||||
|
||||
def get_nouns(text:str):
|
||||
blob = TextBlob(text)
|
||||
# check nouns? or no
|
||||
return blob.tags
|
||||
|
||||
vadoe = '/vadoe/vadoe/vadoe/townhall_2021-01-14T02-05-51.json'
|
||||
vadoe_p = '/vadoe/vadoe/vadoe/townhall_2021-01-14T05-11-55.json'
|
||||
dlr = '/vadoe/vadoe/vadoe/dlr.json'
|
||||
|
||||
smythers_pc = '/vadoe/vadoe/vadoe/smythers.json'
|
||||
write_to = '/vadoe/vadoe/vadoe/nouns.json'
|
||||
|
||||
# processed_file(file)
|
||||
smythers_posts = process_file(dlr)
|
||||
# cleaned = []
|
||||
# for each in smythers:
|
||||
# cleaned.append(clean_text(each))
|
||||
cleaned = [clean_text(each) for each in smythers_posts]
|
||||
nouns = []
|
||||
for x in cleaned:
|
||||
_list = get_nouns(x)
|
||||
for y in _list:
|
||||
nouns.append(y)
|
||||
# nouns.append(x for x in [get_nouns())
|
||||
sortedNouns = Counter(nouns)
|
||||
nouns = []
|
||||
for k, v in sortedNouns.items():
|
||||
if v > 2:
|
||||
_d = (k, v)
|
||||
nouns.append(_d)
|
||||
print(nouns)
|
||||
write_file(write_to, nouns)
|
||||
45
docs/townhall.py
Normal file
45
docs/townhall.py
Normal file
@@ -0,0 +1,45 @@
|
||||
# -*- coding: utf-8 -*-
|
||||
import scrapy
|
||||
from items import CommentItem
|
||||
import textblob
|
||||
from textblob import TextBlob
|
||||
from textblob.sentiments import NaiveBayesAnalyzer
|
||||
|
||||
class TownhallSpider(scrapy.Spider):
|
||||
name = 'townhall'
|
||||
allowed_domains = ['townhall.virginia.gov']
|
||||
start_urls = ['https://www.townhall.virginia.gov/L/comments.cfm?GDocForumID=452']
|
||||
custom_settings = {
|
||||
'FEED_EXPORTERS' : {
|
||||
"jsonlines": "scrapy.exporters.JsonLinesItemExporter",
|
||||
},
|
||||
'FEED_URI' : '%(name)s_%(time)s.json',
|
||||
'FEED_FORMAT': 'jsonlines'
|
||||
}
|
||||
|
||||
def parse(self, response):
|
||||
rows = response.css('#contentwide>table>tr')
|
||||
# cut out the header row
|
||||
for each in rows[1:]:
|
||||
# for each in rows[1:6]:
|
||||
cols = each.xpath('.//td')
|
||||
linkfollow = cols[0].css('a::attr(href)').get()
|
||||
comment_title = cols[0].xpath('a/text()').get()
|
||||
# clean up
|
||||
commenter = cols[1].xpath('text()').get()
|
||||
# clean up
|
||||
date = cols[2].xpath('a/text()').get()
|
||||
print(f'{comment_title} | {commenter}')
|
||||
yield response.follow(linkfollow, callback = self.parse_comment)
|
||||
|
||||
def parse_comment(self, response):
|
||||
entry = CommentItem()
|
||||
text = response.css('.divComment>p::text').get()
|
||||
text = text.replace(u'\u00a0',' ')
|
||||
entry['comment'] = text
|
||||
blob = TextBlob(text, analyzer=NaiveBayesAnalyzer())
|
||||
entry['sentiment'] = blob.sentiment.classification
|
||||
entry['sentiment_pos'] = blob.sentiment.p_pos
|
||||
entry['sentiment_neg'] = blob.sentiment.p_neg
|
||||
# yield CommentItem(comment = response.css('.divComment>p::text').get())
|
||||
yield entry
|
||||
62
docs/townhall2.py
Normal file
62
docs/townhall2.py
Normal file
@@ -0,0 +1,62 @@
|
||||
# -*- coding: utf-8 -*-
|
||||
import scrapy
|
||||
from items import CommentItem
|
||||
import textblob
|
||||
from textblob import TextBlob
|
||||
from textblob.sentiments import NaiveBayesAnalyzer
|
||||
|
||||
class TownhallSpider(scrapy.Spider):
|
||||
name = 'townhall'
|
||||
allowed_domains = ['townhall.virginia.gov']
|
||||
start_urls = ['https://www.townhall.virginia.gov/L/Forums.cfm']
|
||||
custom_settings = {
|
||||
'FEED_EXPORTERS' : {
|
||||
"jsonlines": "scrapy.exporters.JsonLinesItemExporter",
|
||||
},
|
||||
'FEED_URI' : '%(name)s_%(time)s.json',
|
||||
'FEED_FORMAT': 'jsonlines'
|
||||
}
|
||||
|
||||
def parse(self, response):
|
||||
rows = response.css('table>tr>td')
|
||||
for each in rows:
|
||||
linkfollow = each.css('a').attrib['href']
|
||||
if 'comments' in linkfollow:
|
||||
yield response.follow(linkfollow, callback = self.parse_forum)
|
||||
|
||||
cols = each.xpath('.//td')
|
||||
linkfollow = cols[0].css('a::attr(href)').get()
|
||||
comment_title = cols[0].xpath('a/text()').get()
|
||||
# clean up
|
||||
commenter = cols[1].xpath('text()').get()
|
||||
# clean up
|
||||
date = cols[2].xpath('a/text()').get()
|
||||
print(f'{comment_title} | {commenter}')
|
||||
yield response.follow(linkfollow, callback = self.parse_comment)
|
||||
|
||||
def parse_forum(self, response):
|
||||
rows = response.css('#contentwide>table>tr')
|
||||
# cut out the header row
|
||||
for each in rows[1:]:
|
||||
# for each in rows[1:6]:
|
||||
cols = each.xpath('.//td')
|
||||
linkfollow = cols[0].css('a::attr(href)').get()
|
||||
comment_title = cols[0].xpath('a/text()').get()
|
||||
# clean up
|
||||
commenter = cols[1].xpath('text()').get()
|
||||
# clean up
|
||||
date = cols[2].xpath('a/text()').get()
|
||||
print(f'{comment_title} | {commenter}')
|
||||
yield response.follow(linkfollow, callback = self.parse_comment)
|
||||
|
||||
def parse_comment(self, response):
|
||||
entry = CommentItem()
|
||||
text = response.css('.divComment>p::text').get()
|
||||
text = text.replace(u'\u00a0',' ')
|
||||
entry['comment'] = text
|
||||
blob = TextBlob(text, analyzer=NaiveBayesAnalyzer())
|
||||
entry['sentiment'] = blob.sentiment.classification
|
||||
entry['sentiment_pos'] = blob.sentiment.p_pos
|
||||
entry['sentiment_neg'] = blob.sentiment.p_neg
|
||||
# yield CommentItem(comment = response.css('.divComment>p::text').get())
|
||||
yield entry
|
||||
@@ -1,49 +1,110 @@
|
||||
#+title: VA Townhall
|
||||
#+date: [2026-05-05 Tue]
|
||||
#+version: 1
|
||||
#+version: 1.1
|
||||
|
||||
* Project Goals
|
||||
** Project Goals
|
||||
1. Document and analyze sentiment of public comments on Virginia law, to determine:
|
||||
1. the utility of this forum as a mechanism for public comment, and
|
||||
2. the impact of this forum on Virginia regulation.
|
||||
2. Make data and insights broadly available.
|
||||
3. Generalize to other public comment tools.
|
||||
|
||||
** Document and analyze sentiment
|
||||
- Scrape the data, parse, clean, and store. Clearly separate scraper from sentiment analyzer for maximum auditability.
|
||||
- Build tests for identifying abuse, such as spam and account fraud
|
||||
- Identify any patterns connecting measured sentiment against VA decisions
|
||||
*** Research questions
|
||||
1. What is the quality of the comments on the forum?
|
||||
1. Are there duplicate entries?
|
||||
2. Are there non-human-generated entries?
|
||||
3. Are there entries intended to abuse the forum or drown out comment?
|
||||
2. How do commenters feel about the proposed change?
|
||||
1. What is the total number and percent supporting vs opposing, and how does this change over time?
|
||||
2. What is the type of support, such as strong/weak, positive/negative?
|
||||
3. What impact do the comments have on the proposed change?
|
||||
(I anticipate this will not be measurable from currently available data)
|
||||
|
||||
** Make data available
|
||||
- Pick a good visualization tool
|
||||
** Architecture
|
||||
1. Scrape/Parse: Scrapy
|
||||
2. Sentiment analysis: gpt-5.4-mini
|
||||
3. Display: streamlit
|
||||
4. Storage: jsonl, csv, parquet
|
||||
|
||||
** Generalize
|
||||
- Identify scalable ways to apply this toolset to similar problems
|
||||
[[file:pipeline-v1.2.3.svg]]
|
||||
|
||||
* Architecture
|
||||
1. Scrape/Parse: **Scrapy** for downloading comments
|
||||
2. Storage: json
|
||||
3. Sentiment analysis: Claude haiku
|
||||
4. Display: TBD
|
||||
|
||||
** Scraper
|
||||
Scrapy provides a simple mechanism for browsing and
|
||||
*** Scraper
|
||||
Scrapy provides a simple mechanism for retrieving, parsing, and saving content form the forums.
|
||||
1. Forums listing page: `Forums.cfm` - lists all open forums with agency, reg title, action type, brief description, closing date, comment count
|
||||
2. Comment listing page: `comments.cfm?GDocForumID=X` or `comments.cfm?stageid=X` or `comments.cfm?petitionid=X` - lists comments with title, author, date
|
||||
3. Individual comment page: `viewcomments.cfm?commentid=X` - shows regulation title + brief description at the top, plus the comment
|
||||
|
||||
** Storage
|
||||
One JSONL file per forum/bill.
|
||||
*** Analysis
|
||||
Google and Amazon both return generic sentiment (tone of writing: positive/negative), not stance (for/against the regulation): "I strongly believe the government should NOT interfere" is negative tone but "against" the regulation. We add the proposed change as context to the model.
|
||||
|
||||
** Analysis
|
||||
Google and Amazon both return generic sentiment (tone of writing: positive/negative), not stance (for/against the regulation): "I strongly believe the government should NOT interfere" is negative tone but "against" the regulation. We will run the forum/bill title and cache the entirety of the proposed change, perhaps as a fallback.
|
||||
Before sending the comments for sentiment analysis, `tokenizer.py` receives the forum to be processed and prompt as inputs, then generates a `report.json` estimating tokens (tiktoken), cost, and time to run for multiple models.
|
||||
|
||||
| Tool | Output | Context | Sarcasm | Context window | Cost/1k comments |
|
||||
|-------------------+--------------------------------+------------+------------------+----------------+------------------|
|
||||
| Google NL API | -1→+1, magnitude | No/generic | Poorly | No | ~$1–2 |
|
||||
| Amazon Comprehend | Pos/Neg/Neutral/Mixed | No/generic | Poorly | No | ~$0.10 |
|
||||
| Claude Haiku | Prompted → for/against/neutral | Yes | Yes, with prompt | Yes | ~$0.10–0.30 |
|
||||
| GPT-4o-mini | Prompted → same | Yes | Yes | Yes | ~$0.05–0.15 |
|
||||
Then, the batch processing scripts uses the `report.json` to create multiple jobs, with subcommands to download and check their status.
|
||||
|
||||
We selected gpt-5.4-mini for a good balance of quality, cost, and time.
|
||||
|
||||
**** Prompt
|
||||
```
|
||||
You are an expert policy analyst classifying public comments submitted to the Virginia Town Hall
|
||||
regulatory comment system. You will be given the text of a proposed regulation and a single
|
||||
public comment. Return ONLY a JSON object — no other text.
|
||||
|
||||
Definitions:
|
||||
- stance: the commenter's position on whether the regulation should be adopted.
|
||||
"support" = wants it approved (as-is or with changes);
|
||||
"oppose" = wants it rejected or substantially weakened;
|
||||
"neutral" = takes no position, asks a question, or provides factual input only;
|
||||
"unknown" = too vague, off-topic, or uninterpretable to classify.
|
||||
- tone: the emotional register of the writing, independent of stance.
|
||||
"positive" = affirming, hopeful, appreciative;
|
||||
"negative" = angry, fearful, alarmed, or contemptuous;
|
||||
"neutral" = matter-of-fact, procedural, or informational;
|
||||
"mixed" = contains both positive and negative emotional content;
|
||||
"unclear" = tone cannot be determined (e.g., a one-word comment).
|
||||
- stance_confidence: float 0.0-1.0, your confidence in the stance label.
|
||||
- stance_rationale: 1-3 sentences explaining the key evidence; quote specific phrases where possible.
|
||||
- tags: up to 5 short topic labels relevant to the comment's specific concerns (e.g.
|
||||
"parental rights", "student safety", "privacy", "religious freedom", "LGBTQ+ inclusion",
|
||||
"bullying prevention", "school sports", "bathroom access"). Empty array if none apply.
|
||||
|
||||
Return exactly these keys: stance, stance_confidence, stance_rationale, tone, tags.
|
||||
```
|
||||
|
||||
|
||||
*** Storage
|
||||
- Each scraped forum is saved to `output/<forum-id>.jsonl`
|
||||
- Each report (forum + prompt) is saves to `reports/<forum-id-N>.json`
|
||||
- Each job is saved to `analysis/jobs/<report-id>/:
|
||||
└─`forum.jsonl` is a copy of the scraped forum for convenience
|
||||
└─`prompt.txt` is a copy of the prompt used
|
||||
└─`report.json` is a copy of the report used
|
||||
└─`status.json` contains metadata about the job
|
||||
For each batch in the job, four files are created:
|
||||
└─`jobN-input.jsonl` contains the exact queries sent to the API, for troubleshooting
|
||||
└─`jobN-output-raw.jsonl` contains the exact response from the API
|
||||
└─`jobN-output.jsonl` contains the exact response from the API
|
||||
└─`jobN-output-errors.jsonl` when errors are returned (this file may not exist)
|
||||
- Once complete, the cleanup script saves `review.csv`, `review.pqt`, and `review.sqlite` in this folder.
|
||||
|
||||
** Instructions
|
||||
1. Scrape the forum.
|
||||
`python
|
||||
2. Run model report.
|
||||
`python analysis/tokenizer.py <input> --prompt <prompt>`
|
||||
3. To run a realtime subset:
|
||||
`python analysis/openai_realtime.py <input> --prompt <prompt> --model <model> --limit <N comments>`
|
||||
`python analysis/openai_realtime.py output/f452.jsonl --prompt prompt-1.txt --model gpt-4o-mini --limit 10`
|
||||
4. To create and run the whole thing in batches, first create the batch jobs from the report:
|
||||
`python analysis/openai_batch.py create <report> --model <model>`
|
||||
`python analysis/openai_batch.py create ./reports/f452-1.json --model gpt-5.4-mini`
|
||||
5. Then, run the jobs sequentially. Don't submit more than one at a time, if the model fills up the batch will fail and resubmission is not implemented.
|
||||
`python analysis/openai_batch.py submit`
|
||||
# Check status
|
||||
`python analysis/openai_batch.py status`
|
||||
# When complete, download:
|
||||
`python analysis/openai_batch.py download`
|
||||
# Submit the next batch after the previous is complete:
|
||||
`python analysis/openai_batch.py submit`
|
||||
|
||||
* Roadmap
|
||||
1. Scrape one forum
|
||||
|
||||
5
pytest.ini
Normal file
5
pytest.ini
Normal file
@@ -0,0 +1,5 @@
|
||||
[pytest]
|
||||
testpaths = tests
|
||||
python_files = *.py
|
||||
python_classes = Test*
|
||||
python_functions = test_*
|
||||
43
reports/f452-1.json
Normal file
43
reports/f452-1.json
Normal file
@@ -0,0 +1,43 @@
|
||||
{
|
||||
"prompt": "analysis\\prompt-1.txt",
|
||||
"prompt_hash": "cb41250",
|
||||
"input_file": "output\\f452.jsonl",
|
||||
"input_sha256": "59dcc8b13cc2a386977a8b934c498c7e639b7e684a94ca1bfd10a14878670018",
|
||||
"total_comments": 9083,
|
||||
"input_tokens": 6397254,
|
||||
"gpt-5.5": {
|
||||
"jobs": 9,
|
||||
"cost_$": 15.9931,
|
||||
"est_queue_days": 7.11
|
||||
},
|
||||
"gpt-5.4": {
|
||||
"jobs": 9,
|
||||
"cost_$": 7.9966,
|
||||
"est_queue_days": 7.11
|
||||
},
|
||||
"gpt-5.4-mini": {
|
||||
"jobs": 4,
|
||||
"cost_$": 2.399,
|
||||
"est_queue_days": 3.2
|
||||
},
|
||||
"gpt-5.4-nano": {
|
||||
"jobs": 40,
|
||||
"cost_$": 0.6397,
|
||||
"est_queue_days": 31.99
|
||||
},
|
||||
"gpt-4o": {
|
||||
"jobs": 9,
|
||||
"cost_$": 7.9966,
|
||||
"est_queue_days": 7.11
|
||||
},
|
||||
"gpt-4o-mini": {
|
||||
"jobs": 4,
|
||||
"cost_$": 0.4798,
|
||||
"est_queue_days": 3.2
|
||||
},
|
||||
"gpt-o4-mini": {
|
||||
"jobs": 4,
|
||||
"cost_$": 3.5185,
|
||||
"est_queue_days": 3.2
|
||||
}
|
||||
}
|
||||
BIN
requirements.txt
Normal file
BIN
requirements.txt
Normal file
Binary file not shown.
@@ -1,12 +1,18 @@
|
||||
# Define here the models for your scraped items
|
||||
#
|
||||
# See documentation in:
|
||||
# https://docs.scrapy.org/en/latest/topics/items.html
|
||||
|
||||
import scrapy
|
||||
|
||||
|
||||
class ScraperItem(scrapy.Item):
|
||||
# define the fields for your item here like:
|
||||
# name = scrapy.Field()
|
||||
pass
|
||||
class ForumItem(scrapy.Item):
|
||||
forum_id = scrapy.Field()
|
||||
reg_title = scrapy.Field()
|
||||
reg_desc = scrapy.Field()
|
||||
scraped_at = scrapy.Field()
|
||||
forum_url = scrapy.Field()
|
||||
|
||||
|
||||
class CommentItem(scrapy.Item):
|
||||
forum_id = scrapy.Field()
|
||||
comment_id = scrapy.Field()
|
||||
author = scrapy.Field()
|
||||
date = scrapy.Field()
|
||||
title = scrapy.Field()
|
||||
text = scrapy.Field()
|
||||
|
||||
@@ -15,8 +15,7 @@ NEWSPIDER_MODULE = "scraper.spiders"
|
||||
ADDONS = {}
|
||||
|
||||
|
||||
# Crawl responsibly by identifying yourself (and your website) on the user-agent
|
||||
#USER_AGENT = "scraper (+http://www.yourdomain.com)"
|
||||
USER_AGENT = "vath-research-scraper/1.0 (public comment analysis; contact: research)"
|
||||
|
||||
# Obey robots.txt rules
|
||||
ROBOTSTXT_OBEY = True
|
||||
@@ -75,13 +74,17 @@ DOWNLOAD_DELAY = 1
|
||||
# Enable showing throttling stats for every response received:
|
||||
#AUTOTHROTTLE_DEBUG = False
|
||||
|
||||
# Enable and configure HTTP caching (disabled by default)
|
||||
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
|
||||
#HTTPCACHE_ENABLED = True
|
||||
#HTTPCACHE_EXPIRATION_SECS = 0
|
||||
#HTTPCACHE_DIR = "httpcache"
|
||||
#HTTPCACHE_IGNORE_HTTP_CODES = []
|
||||
#HTTPCACHE_STORAGE = "scrapy.extensions.httpcache.FilesystemCacheStorage"
|
||||
# HTTP cache — enabled during development to avoid re-hitting the server on test runs.
|
||||
# Disable (or delete httpcache/) before a production run.
|
||||
HTTPCACHE_ENABLED = True
|
||||
HTTPCACHE_EXPIRATION_SECS = 86400 # 24 h
|
||||
HTTPCACHE_DIR = "httpcache"
|
||||
|
||||
# Output filename is set dynamically by each spider via from_crawler (includes forum_id).
|
||||
|
||||
# The site declares windows-1251 in a meta tag but sends valid UTF-8 bytes.
|
||||
# Force UTF-8 to prevent lxml from re-decoding via the meta charset.
|
||||
DEFAULT_RESPONSE_ENCODING = "utf-8"
|
||||
|
||||
# Set settings whose default value is deprecated to a future-proof value
|
||||
FEED_EXPORT_ENCODING = "utf-8"
|
||||
|
||||
138
scraper/spiders/forum.py
Normal file
138
scraper/spiders/forum.py
Normal file
@@ -0,0 +1,138 @@
|
||||
import re
|
||||
from datetime import datetime
|
||||
|
||||
import scrapy
|
||||
|
||||
from scraper.items import CommentItem, ForumItem
|
||||
|
||||
_BASE = "https://www.townhall.virginia.gov/L/ViewComments.cfm"
|
||||
_NBSP = "\xa0"
|
||||
_REPLACEMENT_CHAR = "<EFBFBD>"
|
||||
|
||||
|
||||
def _view_url(forum_id):
|
||||
return f"{_BASE}?GdocForumID={forum_id}"
|
||||
|
||||
|
||||
def _parse_date(raw):
|
||||
normalized = " ".join(raw.split()).upper()
|
||||
try:
|
||||
return datetime.strptime(normalized, "%m/%d/%y %I:%M %p").isoformat()
|
||||
except ValueError:
|
||||
return raw
|
||||
|
||||
|
||||
class ForumSpider(scrapy.Spider):
|
||||
name = "forum"
|
||||
allowed_domains = ["townhall.virginia.gov"]
|
||||
|
||||
# Override at runtime: scrapy crawl forum -a forum_id=452
|
||||
forum_id = "452"
|
||||
per_page = 500
|
||||
|
||||
@classmethod
|
||||
def from_crawler(cls, crawler, *args, **kwargs):
|
||||
spider = super().from_crawler(crawler, *args, **kwargs)
|
||||
crawler.settings.set(
|
||||
"FEEDS",
|
||||
{
|
||||
f"output/forum{spider.forum_id}_comments_%(time)s.jsonl": {
|
||||
"format": "jsonlines",
|
||||
"encoding": "utf-8",
|
||||
"overwrite": False,
|
||||
}
|
||||
},
|
||||
priority="spider",
|
||||
)
|
||||
return spider
|
||||
|
||||
async def start(self):
|
||||
yield scrapy.FormRequest(
|
||||
_view_url(self.forum_id),
|
||||
formdata={"vPage": "1", "vPerPage": str(self.per_page), "sub1": "go"},
|
||||
callback=self.parse_comments,
|
||||
meta={"is_first": True},
|
||||
)
|
||||
|
||||
# ------------------------------------------------------------------
|
||||
def parse_comments(self, response):
|
||||
if response.meta.get("is_first"):
|
||||
reg_title, reg_desc = self._reg_context(response)
|
||||
last_page = self._last_page(response)
|
||||
yield ForumItem(
|
||||
forum_id=self.forum_id,
|
||||
reg_title=reg_title,
|
||||
reg_desc=reg_desc,
|
||||
scraped_at=datetime.utcnow().isoformat(),
|
||||
forum_url=_view_url(self.forum_id),
|
||||
)
|
||||
for page in range(2, last_page + 1):
|
||||
yield scrapy.FormRequest(
|
||||
_view_url(self.forum_id),
|
||||
formdata={"vPage": str(page), "vPerPage": str(self.per_page), "sub1": "go"},
|
||||
callback=self.parse_comments,
|
||||
)
|
||||
|
||||
for box in response.css("div.Cbox"):
|
||||
yield self._parse_box(box)
|
||||
|
||||
# ------------------------------------------------------------------
|
||||
def _parse_box(self, box):
|
||||
cbox_id = box.attrib.get("id", "")
|
||||
comment_id = cbox_id[len("cbox"):] if cbox_id.startswith("cbox") else ""
|
||||
|
||||
date_raw = (
|
||||
box.css("div[style*='float: right'] div::text").get("")
|
||||
.replace(_NBSP, " ").strip()
|
||||
)
|
||||
|
||||
author = (
|
||||
box.xpath('.//strong[contains(text(),"Commenter:")]/following-sibling::text()[1]')
|
||||
.get("").strip()
|
||||
)
|
||||
|
||||
# Second <strong> in the commenter block is the comment title
|
||||
strongs = box.css("div > strong::text").getall()
|
||||
title = strongs[-1].strip() if len(strongs) > 1 else ""
|
||||
|
||||
paragraphs = box.css(".divComment *::text, .divComment::text").getall()
|
||||
text = " ".join(p.strip() for p in paragraphs if p.strip())
|
||||
text = text.replace(_NBSP, " ").replace(_REPLACEMENT_CHAR, "'").strip()
|
||||
|
||||
return CommentItem(
|
||||
forum_id=self.forum_id,
|
||||
comment_id=comment_id,
|
||||
author=author,
|
||||
date=_parse_date(date_raw),
|
||||
title=title,
|
||||
text=text,
|
||||
)
|
||||
|
||||
# ------------------------------------------------------------------
|
||||
def _reg_context(self, response):
|
||||
# Page shows: <strong>Guidance Document Change:</strong> description text...
|
||||
label_node = response.xpath('//strong[contains(text(),"Change:")]')
|
||||
|
||||
# Collect all sibling text nodes following the label
|
||||
siblings = label_node.xpath("following-sibling::text()").getall()
|
||||
raw = " ".join(t.strip() for t in siblings if t.strip())
|
||||
raw = raw.replace(_NBSP, " ").replace(_REPLACEMENT_CHAR, "'").strip()
|
||||
|
||||
reg_desc = raw
|
||||
|
||||
# reg_title: text up to the first "was " clause or first 200 chars
|
||||
m = re.match(r"^(.+?)\s+(?:was |has |guidance document)", raw, re.IGNORECASE)
|
||||
reg_title = m.group(1).strip() if m else raw[:200]
|
||||
|
||||
return reg_title, reg_desc
|
||||
|
||||
def _last_page(self, response):
|
||||
hrefs = response.xpath(
|
||||
'//form[@name="page"]//a[contains(@href,"vpage.value=")]/@href'
|
||||
).getall()
|
||||
pages = [
|
||||
int(m.group(1))
|
||||
for h in hrefs
|
||||
if (m := re.search(r"vpage\.value=(\d+)", h))
|
||||
]
|
||||
return max(pages) if pages else 1
|
||||
155
tests/create_csv.py
Normal file
155
tests/create_csv.py
Normal file
@@ -0,0 +1,155 @@
|
||||
"""Unit tests for analysis/create_csv.py — no external API calls."""
|
||||
|
||||
import json
|
||||
import sys
|
||||
from pathlib import Path
|
||||
|
||||
import pandas as pd
|
||||
import pytest
|
||||
|
||||
sys.path.insert(0, str(Path(__file__).parent.parent / "analysis"))
|
||||
import create_csv as cc
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Helpers
|
||||
|
||||
def _write_jsonl(path: Path, rows: list[dict]) -> None:
|
||||
with open(path, "w", encoding="utf-8") as f:
|
||||
for row in rows:
|
||||
f.write(json.dumps(row) + "\n")
|
||||
|
||||
|
||||
RAW_ROWS = [
|
||||
{"forum_id": "452", "comment_id": "1", "title": "Support", "text": "I support.", "date": "2021-01-01", "author": "Alice"},
|
||||
{"forum_id": "452", "comment_id": "2", "title": "Oppose", "text": "I oppose.", "date": "2021-01-02", "author": "Bob"},
|
||||
{"forum_id": "452", "comment_id": "3", "title": "Neutral", "text": "No opinion.","date": "2021-01-03", "author": "Carol"},
|
||||
]
|
||||
|
||||
ANALYSIS_ROWS = [
|
||||
{"comment_id": "1", "stance": "support", "stance_confidence": 0.9, "stance_rationale": "clear support",
|
||||
"tone": "neutral", "tags": '["policy"]', "error": None, "truncated": False,
|
||||
"analyzed_at": "2021-01-10", "prompt_version": "1", "model": "gpt-4o-mini"},
|
||||
{"comment_id": "2", "stance": "oppose", "stance_confidence": 0.8, "stance_rationale": "clear oppose",
|
||||
"tone": "negative", "tags": '[]', "error": None, "truncated": False,
|
||||
"analyzed_at": "2021-01-10", "prompt_version": "1", "model": "gpt-4o-mini"},
|
||||
]
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# load_raw
|
||||
|
||||
def test_load_raw_returns_raw_cols(tmp_path):
|
||||
p = tmp_path / "forum.jsonl"
|
||||
_write_jsonl(p, RAW_ROWS)
|
||||
df = cc.load_raw(p)
|
||||
assert list(df.columns) == cc.RAW_COLS
|
||||
|
||||
|
||||
def test_load_raw_row_count(tmp_path):
|
||||
p = tmp_path / "forum.jsonl"
|
||||
_write_jsonl(p, RAW_ROWS)
|
||||
df = cc.load_raw(p)
|
||||
assert len(df) == 3
|
||||
|
||||
|
||||
def test_load_raw_skips_non_comment_rows(tmp_path):
|
||||
"""Rows without comment_id (e.g. forum metadata) are dropped."""
|
||||
rows = RAW_ROWS + [{"forum_id": "452", "reg_title": "Metadata row"}]
|
||||
p = tmp_path / "forum.jsonl"
|
||||
_write_jsonl(p, rows)
|
||||
df = cc.load_raw(p)
|
||||
assert len(df) == 3
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# load_analysis
|
||||
|
||||
def test_load_analysis_returns_analysis_cols(tmp_path):
|
||||
jobs = tmp_path / "jobs"
|
||||
jobs.mkdir()
|
||||
_write_jsonl(jobs / "job1-output.jsonl", ANALYSIS_ROWS)
|
||||
df = cc.load_analysis(jobs)
|
||||
expected = ["comment_id"] + cc.ANALYSIS_COLS
|
||||
assert list(df.columns) == expected
|
||||
|
||||
|
||||
def test_load_analysis_skips_raw_files(tmp_path):
|
||||
jobs = tmp_path / "jobs"
|
||||
jobs.mkdir()
|
||||
_write_jsonl(jobs / "job1-output.jsonl", ANALYSIS_ROWS)
|
||||
_write_jsonl(jobs / "job1-output-raw.jsonl", ANALYSIS_ROWS) # should be ignored
|
||||
df = cc.load_analysis(jobs)
|
||||
assert len(df) == len(ANALYSIS_ROWS)
|
||||
|
||||
|
||||
def test_load_analysis_concatenates_multiple_files(tmp_path):
|
||||
jobs = tmp_path / "jobs"
|
||||
jobs.mkdir()
|
||||
_write_jsonl(jobs / "job1-output.jsonl", [ANALYSIS_ROWS[0]])
|
||||
_write_jsonl(jobs / "job2-output.jsonl", [ANALYSIS_ROWS[1]])
|
||||
df = cc.load_analysis(jobs)
|
||||
assert len(df) == 2
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# join
|
||||
|
||||
def test_join_all_raw_preserved(tmp_path):
|
||||
"""Left join: all raw comments appear in output, even without analysis."""
|
||||
raw = pd.DataFrame(RAW_ROWS)[cc.RAW_COLS]
|
||||
analysis = pd.DataFrame(ANALYSIS_ROWS)
|
||||
for col in cc.ANALYSIS_COLS:
|
||||
if col not in analysis.columns:
|
||||
analysis[col] = None
|
||||
analysis = analysis[["comment_id"] + cc.ANALYSIS_COLS]
|
||||
|
||||
merged = cc.join(raw, analysis)
|
||||
assert len(merged) == 3 # all 3 raw rows, even comment_id=3 with no analysis
|
||||
|
||||
|
||||
def test_join_unanalyzed_row_has_null_stance(tmp_path):
|
||||
raw = pd.DataFrame(RAW_ROWS)[cc.RAW_COLS]
|
||||
analysis = pd.DataFrame(ANALYSIS_ROWS)
|
||||
for col in cc.ANALYSIS_COLS:
|
||||
if col not in analysis.columns:
|
||||
analysis[col] = None
|
||||
analysis = analysis[["comment_id"] + cc.ANALYSIS_COLS]
|
||||
|
||||
merged = cc.join(raw, analysis)
|
||||
unanalyzed = merged[merged["comment_id"] == "3"]
|
||||
assert pd.isna(unanalyzed.iloc[0]["stance"])
|
||||
|
||||
|
||||
def test_join_column_order(tmp_path):
|
||||
raw = pd.DataFrame(RAW_ROWS)[cc.RAW_COLS]
|
||||
analysis = pd.DataFrame(ANALYSIS_ROWS)
|
||||
for col in cc.ANALYSIS_COLS:
|
||||
if col not in analysis.columns:
|
||||
analysis[col] = None
|
||||
analysis = analysis[["comment_id"] + cc.ANALYSIS_COLS]
|
||||
|
||||
merged = cc.join(raw, analysis)
|
||||
assert list(merged.columns) == cc.OUTPUT_COLS
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# End-to-end: write + read CSV
|
||||
|
||||
def test_csv_written_correctly(tmp_path):
|
||||
raw_path = tmp_path / "forum.jsonl"
|
||||
_write_jsonl(raw_path, RAW_ROWS)
|
||||
|
||||
jobs = tmp_path / "jobs"
|
||||
jobs.mkdir()
|
||||
_write_jsonl(jobs / "job1-output.jsonl", ANALYSIS_ROWS)
|
||||
|
||||
out = tmp_path / "review.csv"
|
||||
raw = cc.load_raw(raw_path)
|
||||
analysis = cc.load_analysis(jobs)
|
||||
merged = cc.join(raw, analysis)
|
||||
merged.to_csv(out, index=False, encoding="utf-8-sig")
|
||||
|
||||
loaded = pd.read_csv(out)
|
||||
assert len(loaded) == 3
|
||||
assert list(loaded.columns) == cc.OUTPUT_COLS
|
||||
119
tests/encoding.py
Normal file
119
tests/encoding.py
Normal file
@@ -0,0 +1,119 @@
|
||||
"""Unit tests for analysis/encoding.py — no external dependencies required."""
|
||||
|
||||
import sys
|
||||
from pathlib import Path
|
||||
|
||||
import pytest
|
||||
|
||||
sys.path.insert(0, str(Path(__file__).parent.parent / "analysis"))
|
||||
from encoding import repair_text, _KNOWN_REPAIRS
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Core contract
|
||||
|
||||
|
||||
def test_empty_string_unchanged():
|
||||
assert repair_text("") == ""
|
||||
|
||||
|
||||
def test_none_like_empty_unchanged():
|
||||
assert repair_text("") == ""
|
||||
|
||||
|
||||
def test_clean_ascii_unchanged():
|
||||
text = "This is a normal sentence with no encoding issues."
|
||||
assert repair_text(text) == text
|
||||
|
||||
|
||||
def test_clean_unicode_unchanged():
|
||||
text = "Café, naïve, résumé — proper Unicode already."
|
||||
result = repair_text(text)
|
||||
# Should either be unchanged or equivalently correct
|
||||
assert "Caf" in result and "na" in result
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Known mojibake sequences (tasks.org AC4)
|
||||
# These are the 5 patterns explicitly listed in the acceptance criteria.
|
||||
|
||||
|
||||
def test_right_single_quote():
|
||||
"""’ → ' (U+2019 right single quotation mark)"""
|
||||
assert repair_text("Virginia’s") == "Virginia’s"
|
||||
|
||||
|
||||
def test_left_double_quote():
|
||||
"""“ → " (U+201C left double quotation mark)"""
|
||||
assert repair_text("“Hello") == "“Hello"
|
||||
|
||||
|
||||
def test_en_dash():
|
||||
"""â€" (where last char is U+201C) → – (U+2013 en dash)"""
|
||||
result = repair_text("pages 1–5")
|
||||
assert "–" in result or "—" in result or "-" in result
|
||||
|
||||
|
||||
def test_em_dash():
|
||||
"""â€" (where last char is U+201D) → — (U+2014 em dash)"""
|
||||
result = repair_text("word—word")
|
||||
assert "—" in result or "–" in result or "-" in result
|
||||
|
||||
|
||||
def test_right_double_quote():
|
||||
"""â€\x9d → " (U+201D right double quotation mark)"""
|
||||
result = repair_text("said†he")
|
||||
# Should not contain the raw artifact
|
||||
assert "â€" not in result
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Round-trip: garbled text produces sensible output
|
||||
|
||||
|
||||
def test_garbled_sentence_repaired():
|
||||
"""A sentence with multiple mojibake chars is repaired to readable text."""
|
||||
# "Don't" with right single quote encoded as UTF-8, then decoded as cp1252
|
||||
# D o n ' t → D o n ’ t
|
||||
garbled = "Don’t worry"
|
||||
result = repair_text(garbled)
|
||||
assert "Don" in result and "t worry" in result
|
||||
assert "â€" not in result # artifact gone
|
||||
|
||||
|
||||
def test_clean_string_after_repair_has_no_artifacts():
|
||||
garbled = "She said “Hello†and left."
|
||||
result = repair_text(garbled)
|
||||
assert "â€" not in result
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# FFFD replacement characters (from strict UTF-8 decode of cp1252 bytes)
|
||||
|
||||
|
||||
def test_fffd_preserved_not_crashed():
|
||||
"""repair_text must not raise on U+FFFD; it may or may not repair it."""
|
||||
text = "Virginia<EFBFBD>s Public Schools"
|
||||
result = repair_text(text)
|
||||
assert isinstance(result, str)
|
||||
assert "Virginia" in result
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# _KNOWN_REPAIRS table structure
|
||||
|
||||
|
||||
def test_known_repairs_non_empty():
|
||||
assert len(_KNOWN_REPAIRS) > 0
|
||||
|
||||
|
||||
def test_known_repairs_are_pairs():
|
||||
for item in _KNOWN_REPAIRS:
|
||||
assert len(item) == 2
|
||||
bad, good = item
|
||||
assert isinstance(bad, str) and isinstance(good, str)
|
||||
|
||||
|
||||
def test_known_repairs_bad_not_equal_good():
|
||||
for bad, good in _KNOWN_REPAIRS:
|
||||
assert bad != good
|
||||
390
tests/openai_batch.py
Normal file
390
tests/openai_batch.py
Normal file
@@ -0,0 +1,390 @@
|
||||
"""Unit tests for analysis/openai_batch.py — no real API calls."""
|
||||
|
||||
import json
|
||||
import sys
|
||||
from pathlib import Path
|
||||
from unittest.mock import MagicMock
|
||||
|
||||
import pytest
|
||||
|
||||
sys.path.insert(0, str(Path(__file__).parent.parent / "analysis"))
|
||||
import openai_batch as bt
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Fixtures
|
||||
|
||||
FORUM_ITEM = {
|
||||
"forum_id": "452",
|
||||
"reg_title": "Model Policies for Transgender Students",
|
||||
"reg_desc": "Guidance developed in response to HB 145.",
|
||||
}
|
||||
|
||||
COMMENT_ITEM = {
|
||||
"forum_id": "452",
|
||||
"comment_id": "87914",
|
||||
"author": "Alice Example",
|
||||
"date": "2021-01-04T09:15:00",
|
||||
"title": "I support this policy",
|
||||
"text": "This is a great policy that protects students.",
|
||||
}
|
||||
|
||||
RAW_SUCCESS_LINE = {
|
||||
"id": "batch_req_001",
|
||||
"custom_id": "comment_87914",
|
||||
"response": {
|
||||
"status_code": 200,
|
||||
"request_id": "req_abc",
|
||||
"body": {
|
||||
"id": "chatcmpl-xyz",
|
||||
"choices": [{
|
||||
"index": 0,
|
||||
"message": {
|
||||
"role": "assistant",
|
||||
"content": json.dumps({
|
||||
"stance": "support",
|
||||
"stance_confidence": 0.95,
|
||||
"stance_rationale": "Commenter explicitly endorses the policy.",
|
||||
"tone": "positive",
|
||||
"tags": ["student safety"],
|
||||
}),
|
||||
},
|
||||
"finish_reason": "stop",
|
||||
}],
|
||||
},
|
||||
},
|
||||
"error": None,
|
||||
}
|
||||
|
||||
RAW_ERROR_LINE = {
|
||||
"id": "batch_req_002",
|
||||
"custom_id": "comment_87914",
|
||||
"response": None,
|
||||
"error": {"code": "batch_expired", "message": "This request could not be executed."},
|
||||
}
|
||||
|
||||
RAW_HTTP_ERROR_LINE = {
|
||||
"id": "batch_req_003",
|
||||
"custom_id": "comment_87914",
|
||||
"response": {"status_code": 400, "body": {}},
|
||||
"error": None,
|
||||
}
|
||||
|
||||
COMMENT_LOOKUP = {"87914": COMMENT_ITEM}
|
||||
ANALYZED_AT = "2026-05-05T18:00:00+00:00"
|
||||
RUN_ID = "test-run-id-123"
|
||||
MODEL = "gpt-4o"
|
||||
|
||||
# Minimal status.json for testing job logic
|
||||
def _make_status(jobs_override=None):
|
||||
jobs = jobs_override or [
|
||||
{"job_num": 1, "run_id": "r1", "status": "pending", "batch_id": None,
|
||||
"records_submitted": 60, "records_completed": None, "records_failed": None,
|
||||
"submitted_at": None, "completed_at": None},
|
||||
]
|
||||
return {
|
||||
"model": "gpt-4o-mini", "prompt_hash": "abc1234",
|
||||
"input_file": "output/f452.jsonl", "input_sha256": "sha",
|
||||
"total_comments": 100, "input_tokens": 50_000,
|
||||
"est_queue_days": 0.025, "cost_$": 0.01,
|
||||
"total_jobs": len(jobs), "jobs": jobs,
|
||||
}
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Prompt versioning
|
||||
|
||||
def test_prompt_version_is_7_hex_chars():
|
||||
assert len(bt.PROMPT_VERSION) == 7
|
||||
assert all(c in "0123456789abcdef" for c in bt.PROMPT_VERSION)
|
||||
|
||||
|
||||
def test_prompt_version_matches_realtime():
|
||||
"""Both scripts must derive the same PROMPT_VERSION from the same file."""
|
||||
import openai_realtime as rt
|
||||
assert bt.PROMPT_VERSION == rt.PROMPT_VERSION
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# custom_id helpers
|
||||
|
||||
def test_custom_id_from():
|
||||
assert bt.custom_id_from("87914") == "comment_87914"
|
||||
|
||||
|
||||
def test_parse_custom_id():
|
||||
assert bt.parse_custom_id("comment_87914") == "87914"
|
||||
|
||||
|
||||
def test_custom_id_round_trip():
|
||||
cid = "12345"
|
||||
assert bt.parse_custom_id(bt.custom_id_from(cid)) == cid
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# build_batch_request_line
|
||||
|
||||
def test_batch_request_line_structure():
|
||||
line = bt.build_batch_request_line(COMMENT_ITEM, FORUM_ITEM, "gpt-4o")
|
||||
assert line["custom_id"] == "comment_87914"
|
||||
assert line["method"] == "POST"
|
||||
assert line["url"] == "/v1/chat/completions"
|
||||
assert line["body"]["model"] == "gpt-4o"
|
||||
assert line["body"]["temperature"] == 0.0
|
||||
assert line["body"]["response_format"] == {"type": "json_object"}
|
||||
messages = line["body"]["messages"]
|
||||
assert messages[0]["role"] == "system"
|
||||
assert messages[1]["role"] == "user"
|
||||
|
||||
|
||||
def test_batch_request_line_includes_reg_context():
|
||||
line = bt.build_batch_request_line(COMMENT_ITEM, FORUM_ITEM, "gpt-4o")
|
||||
user_content = line["body"]["messages"][1]["content"]
|
||||
assert "Model Policies for Transgender Students" in user_content
|
||||
assert "HB 145" in user_content
|
||||
|
||||
|
||||
def test_batch_request_line_truncation():
|
||||
long_comment = {**COMMENT_ITEM, "text": "x" * 7000}
|
||||
line = bt.build_batch_request_line(long_comment, FORUM_ITEM, "gpt-4o")
|
||||
user_content = line["body"]["messages"][1]["content"]
|
||||
assert "... [truncated]" in user_content
|
||||
assert user_content.count("x") == bt.MAX_COMMENT_CHARS
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# normalize_output_line — success
|
||||
|
||||
def test_normalize_success_all_keys():
|
||||
record = bt.normalize_output_line(RAW_SUCCESS_LINE, COMMENT_LOOKUP, RUN_ID, ANALYZED_AT, MODEL, bt.PROMPT_VERSION)
|
||||
required = {
|
||||
"run_id", "forum_id", "comment_id", "analyzed_at", "model", "prompt_version",
|
||||
"stance", "stance_confidence", "stance_rationale", "tone", "tags",
|
||||
"input_title", "truncated", "error",
|
||||
}
|
||||
assert required == set(record.keys())
|
||||
|
||||
|
||||
def test_normalize_success_values():
|
||||
record = bt.normalize_output_line(RAW_SUCCESS_LINE, COMMENT_LOOKUP, RUN_ID, ANALYZED_AT, MODEL, bt.PROMPT_VERSION)
|
||||
assert record["stance"] == "support"
|
||||
assert record["tone"] == "positive"
|
||||
assert record["comment_id"] == "87914"
|
||||
assert record["run_id"] == RUN_ID
|
||||
assert record["analyzed_at"] == ANALYZED_AT
|
||||
assert record["error"] is None
|
||||
assert record["truncated"] is False
|
||||
|
||||
|
||||
def test_normalize_success_input_title():
|
||||
record = bt.normalize_output_line(RAW_SUCCESS_LINE, COMMENT_LOOKUP, RUN_ID, ANALYZED_AT, MODEL, bt.PROMPT_VERSION)
|
||||
assert record["input_title"] == COMMENT_ITEM["title"]
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# normalize_output_line — errors
|
||||
|
||||
def test_normalize_batch_expired_error():
|
||||
record = bt.normalize_output_line(RAW_ERROR_LINE, COMMENT_LOOKUP, RUN_ID, ANALYZED_AT, MODEL, bt.PROMPT_VERSION)
|
||||
assert record["error"] is not None
|
||||
assert "could not be executed" in record["error"]
|
||||
assert record["stance"] is None
|
||||
assert record["tone"] is None
|
||||
|
||||
|
||||
def test_normalize_http_error():
|
||||
record = bt.normalize_output_line(RAW_HTTP_ERROR_LINE, COMMENT_LOOKUP, RUN_ID, ANALYZED_AT, MODEL, bt.PROMPT_VERSION)
|
||||
assert record["error"] is not None
|
||||
assert record["stance"] is None
|
||||
|
||||
|
||||
def test_normalize_malformed_json_in_response():
|
||||
bad_line = {
|
||||
"id": "batch_req_004",
|
||||
"custom_id": "comment_87914",
|
||||
"response": {
|
||||
"status_code": 200,
|
||||
"body": {"choices": [{"message": {"content": "not valid json{{{"}}]},
|
||||
},
|
||||
"error": None,
|
||||
}
|
||||
record = bt.normalize_output_line(bad_line, COMMENT_LOOKUP, RUN_ID, ANALYZED_AT, MODEL, bt.PROMPT_VERSION)
|
||||
assert record["error"] is not None
|
||||
assert record["stance"] is None
|
||||
|
||||
|
||||
def test_normalize_unknown_comment_id():
|
||||
"""A custom_id not in lookup yields empty forum_id and title but doesn't crash."""
|
||||
record = bt.normalize_output_line(RAW_SUCCESS_LINE, {}, RUN_ID, ANALYZED_AT, MODEL, bt.PROMPT_VERSION)
|
||||
assert record["comment_id"] == "87914"
|
||||
assert record["forum_id"] == ""
|
||||
assert record["input_title"] == ""
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# estimate_tokens
|
||||
|
||||
def test_estimate_tokens_returns_positive_int():
|
||||
messages = [{"role": "system", "content": "hello"}, {"role": "user", "content": "world"}]
|
||||
result = bt.estimate_tokens(messages, "gpt-4o-mini")
|
||||
assert isinstance(result, int)
|
||||
assert result > 0
|
||||
|
||||
|
||||
def test_estimate_tokens_longer_content_is_larger():
|
||||
short_msg = [{"role": "user", "content": "hi"}]
|
||||
long_msg = [{"role": "user", "content": "hi " * 500}]
|
||||
assert bt.estimate_tokens(long_msg, "gpt-4o-mini") > bt.estimate_tokens(short_msg, "gpt-4o-mini")
|
||||
|
||||
|
||||
def test_estimate_tokens_fallback_without_tiktoken(monkeypatch):
|
||||
import sys as _sys
|
||||
monkeypatch.setitem(_sys.modules, "tiktoken", None)
|
||||
messages = [{"role": "user", "content": "x" * 300}]
|
||||
result = bt.estimate_tokens(messages, "gpt-4o")
|
||||
# fallback: 3 primer + (3 + 300//3) per message
|
||||
assert result == 3 + (3 + 300 // 3)
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# chunk_comments_by_tokens
|
||||
|
||||
def test_chunk_single_chunk_for_small_input(monkeypatch):
|
||||
monkeypatch.setattr(bt, "MODEL_LIMITS", {"gpt-4o-mini": 10_000_000})
|
||||
comments = [COMMENT_ITEM, {**COMMENT_ITEM, "comment_id": "99999"}]
|
||||
chunks = bt.chunk_comments_by_tokens(comments, FORUM_ITEM, "gpt-4o-mini")
|
||||
assert len(chunks) == 1
|
||||
assert len(chunks[0]) == 2
|
||||
|
||||
|
||||
def test_chunk_splits_when_over_limit(monkeypatch):
|
||||
monkeypatch.setattr(bt, "MODEL_LIMITS", {"gpt-4o-mini": 1})
|
||||
comments = [
|
||||
COMMENT_ITEM,
|
||||
{**COMMENT_ITEM, "comment_id": "99999"},
|
||||
{**COMMENT_ITEM, "comment_id": "88888"},
|
||||
]
|
||||
chunks = bt.chunk_comments_by_tokens(comments, FORUM_ITEM, "gpt-4o-mini")
|
||||
assert len(chunks) == len(comments)
|
||||
|
||||
|
||||
def test_chunk_preserves_all_comments(monkeypatch):
|
||||
monkeypatch.setattr(bt, "MODEL_LIMITS", {"gpt-4o-mini": 200})
|
||||
comments = [{**COMMENT_ITEM, "comment_id": str(i)} for i in range(10)]
|
||||
chunks = bt.chunk_comments_by_tokens(comments, FORUM_ITEM, "gpt-4o-mini")
|
||||
flat = [c for chunk in chunks for c in chunk]
|
||||
assert len(flat) == 10
|
||||
|
||||
|
||||
def test_model_limits_has_required_models():
|
||||
for model in ("gpt-4o", "gpt-4o-mini", "gpt-5.4", "gpt-5.4-mini", "gpt-o4-mini"):
|
||||
assert model in bt.MODEL_LIMITS, f"{model} missing from MODEL_LIMITS"
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# status.json helpers
|
||||
|
||||
def test_status_save_load_roundtrip(tmp_path):
|
||||
status = _make_status()
|
||||
bt.save_status(status, tmp_path)
|
||||
loaded = bt.load_status(tmp_path)
|
||||
assert loaded == status
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# _find_next_eligible_job
|
||||
|
||||
def test_find_next_eligible_job_first_job_pending():
|
||||
jobs = _make_status()["jobs"]
|
||||
target, warning = bt._find_next_eligible_job(jobs)
|
||||
assert target["job_num"] == 1
|
||||
assert warning is None
|
||||
|
||||
|
||||
def test_find_next_eligible_job_after_completed():
|
||||
jobs = [
|
||||
{"job_num": 1, "status": "completed", "batch_id": "b1",
|
||||
"records_submitted": 60, "records_completed": 60, "records_failed": 0,
|
||||
"submitted_at": "t", "completed_at": "t", "run_id": "r1"},
|
||||
{"job_num": 2, "status": "pending", "batch_id": None,
|
||||
"records_submitted": 40, "records_completed": None, "records_failed": None,
|
||||
"submitted_at": None, "completed_at": None, "run_id": "r2"},
|
||||
]
|
||||
target, warning = bt._find_next_eligible_job(jobs)
|
||||
assert target["job_num"] == 2
|
||||
assert warning is None
|
||||
|
||||
|
||||
def test_find_next_eligible_job_blocked_by_in_progress():
|
||||
jobs = [
|
||||
{"job_num": 1, "status": "in_progress", "batch_id": "b1",
|
||||
"records_submitted": 60, "records_completed": None, "records_failed": None,
|
||||
"submitted_at": "t", "completed_at": None, "run_id": "r1"},
|
||||
{"job_num": 2, "status": "pending", "batch_id": None,
|
||||
"records_submitted": 40, "records_completed": None, "records_failed": None,
|
||||
"submitted_at": None, "completed_at": None, "run_id": "r2"},
|
||||
]
|
||||
target, warning = bt._find_next_eligible_job(jobs)
|
||||
assert target is None
|
||||
assert warning is not None
|
||||
assert "in_progress" in warning
|
||||
|
||||
|
||||
def test_find_next_eligible_job_all_completed():
|
||||
jobs = [
|
||||
{"job_num": 1, "status": "completed", "batch_id": "b1",
|
||||
"records_submitted": 60, "records_completed": 60, "records_failed": 0,
|
||||
"submitted_at": "t", "completed_at": "t", "run_id": "r1"},
|
||||
]
|
||||
target, warning = bt._find_next_eligible_job(jobs)
|
||||
assert target is None
|
||||
assert warning is None
|
||||
|
||||
|
||||
def test_resume_from_status_json(tmp_path):
|
||||
"""Reload a status.json with one completed job and find the next pending job."""
|
||||
jobs = [
|
||||
{"job_num": 1, "run_id": "r1", "status": "completed", "batch_id": "b1",
|
||||
"records_submitted": 60, "records_completed": 58, "records_failed": 2,
|
||||
"submitted_at": "2026-05-06T10:00:00+00:00", "completed_at": "2026-05-06T11:00:00+00:00"},
|
||||
{"job_num": 2, "run_id": "r2", "status": "pending", "batch_id": None,
|
||||
"records_submitted": 40, "records_completed": None, "records_failed": None,
|
||||
"submitted_at": None, "completed_at": None},
|
||||
]
|
||||
bt.save_status(_make_status(jobs), tmp_path)
|
||||
loaded = bt.load_status(tmp_path)
|
||||
target, warning = bt._find_next_eligible_job(loaded["jobs"])
|
||||
assert target["job_num"] == 2
|
||||
assert warning is None
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# normalize: out-of-order and duplicate custom_id
|
||||
|
||||
def test_out_of_order_output_reconciled_by_custom_id():
|
||||
"""Raw lines processed in any order are mapped to the correct comment."""
|
||||
c2 = {**COMMENT_ITEM, "comment_id": "99999", "title": "Second comment"}
|
||||
lookup = {COMMENT_ITEM["comment_id"]: COMMENT_ITEM, "99999": c2}
|
||||
|
||||
line_for_99999 = {
|
||||
**RAW_SUCCESS_LINE,
|
||||
"custom_id": "comment_99999",
|
||||
}
|
||||
line_for_87914 = RAW_SUCCESS_LINE
|
||||
|
||||
r1 = bt.normalize_output_line(line_for_99999, lookup, RUN_ID, ANALYZED_AT, MODEL, bt.PROMPT_VERSION)
|
||||
r2 = bt.normalize_output_line(line_for_87914, lookup, RUN_ID, ANALYZED_AT, MODEL, bt.PROMPT_VERSION)
|
||||
|
||||
assert r1["comment_id"] == "99999"
|
||||
assert r1["input_title"] == "Second comment"
|
||||
assert r2["comment_id"] == "87914"
|
||||
assert r2["input_title"] == COMMENT_ITEM["title"]
|
||||
|
||||
|
||||
def test_duplicate_custom_id_both_produce_valid_records():
|
||||
"""Two raw lines with the same custom_id each produce a valid record."""
|
||||
r1 = bt.normalize_output_line(RAW_SUCCESS_LINE, COMMENT_LOOKUP, RUN_ID, ANALYZED_AT, MODEL, bt.PROMPT_VERSION)
|
||||
r2 = bt.normalize_output_line(RAW_SUCCESS_LINE, COMMENT_LOOKUP, RUN_ID, ANALYZED_AT, MODEL, bt.PROMPT_VERSION)
|
||||
assert r1["comment_id"] == r2["comment_id"] == "87914"
|
||||
assert r1["error"] is None
|
||||
assert r2["error"] is None
|
||||
215
tests/openai_realtime.py
Normal file
215
tests/openai_realtime.py
Normal file
@@ -0,0 +1,215 @@
|
||||
"""Unit tests for analysis/openai_realtime.py — no real API calls."""
|
||||
|
||||
import json
|
||||
import sys
|
||||
from pathlib import Path
|
||||
from unittest.mock import MagicMock
|
||||
|
||||
import pytest
|
||||
|
||||
sys.path.insert(0, str(Path(__file__).parent.parent / "analysis"))
|
||||
import openai_realtime as rt
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Fixtures
|
||||
|
||||
FORUM_ITEM = {
|
||||
"forum_id": "452",
|
||||
"reg_title": "Model Policies for Transgender Students",
|
||||
"reg_desc": "Guidance developed in response to HB 145.",
|
||||
}
|
||||
|
||||
COMMENT_ITEM = {
|
||||
"forum_id": "452",
|
||||
"comment_id": "87914",
|
||||
"author": "Alice Example",
|
||||
"date": "2021-01-04T09:15:00",
|
||||
"title": "I support this policy",
|
||||
"text": "This is a great policy that protects students.",
|
||||
}
|
||||
|
||||
MOCK_RESPONSE_CONTENT = json.dumps({
|
||||
"stance": "support",
|
||||
"stance_confidence": 0.95,
|
||||
"stance_rationale": "Commenter explicitly endorses the policy.",
|
||||
"tone": "positive",
|
||||
"tags": ["student safety", "LGBTQ+ inclusion"],
|
||||
})
|
||||
|
||||
|
||||
def _mock_client(response_content: str = MOCK_RESPONSE_CONTENT):
|
||||
client = MagicMock()
|
||||
choice = MagicMock()
|
||||
choice.message.content = response_content
|
||||
client.chat.completions.create.return_value = MagicMock(choices=[choice])
|
||||
return client
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Prompt versioning
|
||||
|
||||
def test_prompt_version_is_7_hex_chars():
|
||||
assert len(rt.PROMPT_VERSION) == 7
|
||||
assert all(c in "0123456789abcdef" for c in rt.PROMPT_VERSION)
|
||||
|
||||
|
||||
def test_prompt_version_matches_prompt_file():
|
||||
import hashlib
|
||||
prompt_file = Path(__file__).parent.parent / "analysis" / "prompt-1.txt"
|
||||
expected = hashlib.sha256(prompt_file.read_text(encoding="utf-8").strip().encode()).hexdigest()[:7]
|
||||
assert rt.PROMPT_VERSION == expected
|
||||
|
||||
|
||||
def test_prompt_version_is_stable():
|
||||
import hashlib
|
||||
v2 = hashlib.sha256(rt.SYSTEM_PROMPT.encode("utf-8")).hexdigest()[:7]
|
||||
assert v2 == rt.PROMPT_VERSION
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# load_items
|
||||
|
||||
def test_load_items_separates_forum_and_comments(tmp_path):
|
||||
jsonl = tmp_path / "test.jsonl"
|
||||
jsonl.write_text(
|
||||
json.dumps(FORUM_ITEM) + "\n" + json.dumps(COMMENT_ITEM) + "\n",
|
||||
encoding="utf-8",
|
||||
)
|
||||
forum, comments = rt.load_items(jsonl)
|
||||
assert forum is not None
|
||||
assert forum["reg_title"] == FORUM_ITEM["reg_title"]
|
||||
assert len(comments) == 1
|
||||
assert comments[0]["comment_id"] == "87914"
|
||||
|
||||
|
||||
def test_load_items_no_forum(tmp_path):
|
||||
jsonl = tmp_path / "test.jsonl"
|
||||
jsonl.write_text(json.dumps(COMMENT_ITEM) + "\n", encoding="utf-8")
|
||||
forum, comments = rt.load_items(jsonl)
|
||||
assert forum is None
|
||||
assert len(comments) == 1
|
||||
|
||||
|
||||
def test_load_items_skips_blank_lines(tmp_path):
|
||||
jsonl = tmp_path / "test.jsonl"
|
||||
jsonl.write_text("\n" + json.dumps(COMMENT_ITEM) + "\n\n", encoding="utf-8")
|
||||
_, comments = rt.load_items(jsonl)
|
||||
assert len(comments) == 1
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# build_messages
|
||||
|
||||
def test_truncation_applied():
|
||||
long_comment = {**COMMENT_ITEM, "text": "x" * 7000}
|
||||
messages, truncated = rt.build_messages(long_comment, FORUM_ITEM)
|
||||
assert truncated is True
|
||||
assert "... [truncated]" in messages[1]["content"]
|
||||
assert messages[1]["content"].count("x") == rt.MAX_COMMENT_CHARS
|
||||
|
||||
|
||||
def test_no_truncation_for_short_comment():
|
||||
_, truncated = rt.build_messages(COMMENT_ITEM, FORUM_ITEM)
|
||||
assert truncated is False
|
||||
|
||||
|
||||
def test_empty_text_fallback():
|
||||
empty = {**COMMENT_ITEM, "text": ""}
|
||||
messages, truncated = rt.build_messages(empty, FORUM_ITEM)
|
||||
assert "[No body text provided]" in messages[1]["content"]
|
||||
assert truncated is False
|
||||
|
||||
|
||||
def test_none_text_fallback():
|
||||
none_text = {**COMMENT_ITEM, "text": None}
|
||||
messages, _ = rt.build_messages(none_text, FORUM_ITEM)
|
||||
assert "[No body text provided]" in messages[1]["content"]
|
||||
|
||||
|
||||
def test_missing_forum_uses_unknown_context():
|
||||
messages, _ = rt.build_messages(COMMENT_ITEM, None)
|
||||
assert "[unknown]" in messages[1]["content"]
|
||||
|
||||
|
||||
def test_reg_context_included_in_prompt():
|
||||
messages, _ = rt.build_messages(COMMENT_ITEM, FORUM_ITEM)
|
||||
assert FORUM_ITEM["reg_title"] in messages[1]["content"]
|
||||
assert "HB 145" in messages[1]["content"]
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Output record schema
|
||||
|
||||
def test_output_record_all_keys_present():
|
||||
record = rt.analyze_comment(_mock_client(), COMMENT_ITEM, FORUM_ITEM, "run-123", "gpt-4o")
|
||||
required = {
|
||||
"run_id", "forum_id", "comment_id", "analyzed_at", "model", "prompt_version",
|
||||
"stance", "stance_confidence", "stance_rationale", "tone", "tags",
|
||||
"input_title", "truncated", "error",
|
||||
}
|
||||
assert required == set(record.keys())
|
||||
|
||||
|
||||
def test_output_record_correct_types():
|
||||
record = rt.analyze_comment(_mock_client(), COMMENT_ITEM, FORUM_ITEM, "run-123", "gpt-4o")
|
||||
assert record["stance"] == "support"
|
||||
assert isinstance(record["stance_confidence"], float)
|
||||
assert isinstance(record["tags"], list)
|
||||
assert record["truncated"] is False
|
||||
assert record["error"] is None
|
||||
|
||||
|
||||
def test_output_record_metadata():
|
||||
record = rt.analyze_comment(_mock_client(), COMMENT_ITEM, FORUM_ITEM, "run-123", "gpt-4o")
|
||||
assert record["run_id"] == "run-123"
|
||||
assert record["forum_id"] == "452"
|
||||
assert record["comment_id"] == "87914"
|
||||
assert record["model"] == "gpt-4o"
|
||||
assert record["prompt_version"] == rt.PROMPT_VERSION
|
||||
assert record["input_title"] == COMMENT_ITEM["title"]
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Error handling
|
||||
|
||||
def test_error_record_on_api_failure():
|
||||
import openai as _openai
|
||||
client = MagicMock()
|
||||
client.chat.completions.create.side_effect = _openai.RateLimitError(
|
||||
"rate limit", response=MagicMock(status_code=429), body={}
|
||||
)
|
||||
record = rt.analyze_comment(client, COMMENT_ITEM, FORUM_ITEM, "run-123", "gpt-4o")
|
||||
assert record["error"] is not None
|
||||
assert record["stance"] is None
|
||||
assert record["tone"] is None
|
||||
assert record["tags"] is None
|
||||
|
||||
|
||||
def test_error_record_on_bad_json():
|
||||
record = rt.analyze_comment(_mock_client("not valid json{{{"), COMMENT_ITEM, FORUM_ITEM, "run-123", "gpt-4o")
|
||||
assert record["error"] is not None
|
||||
assert record["stance"] is None
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# run_id consistency
|
||||
|
||||
def test_run_id_is_shared_across_records():
|
||||
client = _mock_client()
|
||||
run_id = "fixed-run-id"
|
||||
r1 = rt.analyze_comment(client, COMMENT_ITEM, FORUM_ITEM, run_id, "gpt-4o")
|
||||
r2 = rt.analyze_comment(client, {**COMMENT_ITEM, "comment_id": "99999"}, FORUM_ITEM, run_id, "gpt-4o")
|
||||
assert r1["run_id"] == r2["run_id"] == run_id
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Filename helpers
|
||||
|
||||
def test_scrape_ts_extracted_from_filename():
|
||||
p = Path("output/forum452_comments_2026-05-05T17-33-54+00-00.jsonl")
|
||||
assert rt._scrape_ts_from_filename(p) == "2026-05-05T17-33-54+00-00"
|
||||
|
||||
|
||||
def test_scrape_ts_fallback_for_unknown_filename():
|
||||
assert rt._scrape_ts_from_filename(Path("output/somefile.jsonl")) == "unknown"
|
||||
230
tests/scrape_forum_spider.py
Normal file
230
tests/scrape_forum_spider.py
Normal file
@@ -0,0 +1,230 @@
|
||||
"""Tests for ForumSpider parsing logic using fake HTML responses."""
|
||||
|
||||
import scrapy
|
||||
from scrapy.http import HtmlResponse, Request
|
||||
|
||||
from scraper.items import CommentItem, ForumItem
|
||||
from scraper.spiders.forum import ForumSpider, _parse_date
|
||||
|
||||
|
||||
def fake_response(url, body, meta=None):
|
||||
req = Request(url=url, meta=meta or {})
|
||||
return HtmlResponse(url=url, body=body.encode("utf-8"), request=req)
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Minimal page HTML fragments
|
||||
|
||||
PAGE1_HTML = """
|
||||
<html><body>
|
||||
<strong>Guidance Document Change:</strong> The Model Policies for the Treatment of Transgender Students
|
||||
was developed in response to House Bill 145 and Senate Bill 161.
|
||||
|
||||
<div style="font-family: verdana;">
|
||||
<form name="page" id="page" action="ViewComments.cfm?GdocForumID=452" method="post">
|
||||
<input name="vPage" id="vpage" type="input" value="1">
|
||||
<input name="vPerPage" id="vPerPage" type="input" value="500">
|
||||
<a href="javascript:document.page.vpage.value=3;document.page.submit();">3</a>
|
||||
<a href="javascript:document.page.vpage.value=2;document.page.submit();">Next</a>
|
||||
<input type="submit" name="sub1" value="go">
|
||||
</form>
|
||||
</div>
|
||||
|
||||
<div id="cbox101" class="Cbox">
|
||||
<div style="float: right; text-align: right;">
|
||||
<div style="background-color: white; border: 1px solid #cccccc; padding: 4px">1/4/21 9:15 am</div>
|
||||
</div>
|
||||
<div>
|
||||
<strong>Commenter:</strong>
|
||||
Alice Example
|
||||
<br><br>
|
||||
<strong>I strongly support this</strong>
|
||||
</div>
|
||||
<div style="clear: right"> </div>
|
||||
<div class="divComment">
|
||||
<p>This is a great policy for students.</p>
|
||||
<p>All schools should follow it.</p>
|
||||
</div>
|
||||
<div style="float: left; font-size: 90%;">
|
||||
CommentID: <a href="ViewComments.cfm?commentid=101">101</a>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
<div id="cbox102" class="Cbox">
|
||||
<div style="float: right; text-align: right;">
|
||||
<div style="background-color: white; border: 1px solid #cccccc; padding: 4px">1/5/21 10:00 am</div>
|
||||
</div>
|
||||
<div>
|
||||
<strong>Commenter:</strong>
|
||||
Bob Sample
|
||||
<br><br>
|
||||
<strong>Opposed</strong>
|
||||
</div>
|
||||
<div style="clear: right"> </div>
|
||||
<div class="divComment">
|
||||
<p>I do not support this guidance.</p>
|
||||
</div>
|
||||
<div style="float: left; font-size: 90%;">
|
||||
CommentID: <a href="ViewComments.cfm?commentid=102">102</a>
|
||||
</div>
|
||||
</div>
|
||||
</body></html>
|
||||
"""
|
||||
|
||||
PAGE2_HTML = """
|
||||
<html><body>
|
||||
<div id="cbox201" class="Cbox">
|
||||
<div style="float: right; text-align: right;">
|
||||
<div style="background-color: white; border: 1px solid #cccccc; padding: 4px">1/6/21 11:00 am</div>
|
||||
</div>
|
||||
<div>
|
||||
<strong>Commenter:</strong>
|
||||
Carol T
|
||||
<br><br>
|
||||
<strong>Support</strong>
|
||||
</div>
|
||||
<div style="clear: right"> </div>
|
||||
<div class="divComment">
|
||||
<p>This policy is long overdue.</p>
|
||||
</div>
|
||||
</div>
|
||||
</body></html>
|
||||
"""
|
||||
|
||||
|
||||
def make_spider():
|
||||
return ForumSpider()
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
def test_page1_generates_remaining_page_requests():
|
||||
spider = make_spider()
|
||||
response = fake_response(
|
||||
"https://www.townhall.virginia.gov/L/ViewComments.cfm?GdocForumID=452",
|
||||
PAGE1_HTML,
|
||||
meta={"is_first": True},
|
||||
)
|
||||
results = list(spider.parse_comments(response))
|
||||
form_reqs = [r for r in results if isinstance(r, scrapy.FormRequest)]
|
||||
# Pages 2 and 3 should be requested (last page link = 3)
|
||||
assert len(form_reqs) == 2
|
||||
pages = sorted(r.body.decode() for r in form_reqs)
|
||||
assert "vPage=2" in pages[0]
|
||||
assert "vPage=3" in pages[1]
|
||||
|
||||
|
||||
def test_page1_yields_items():
|
||||
spider = make_spider()
|
||||
response = fake_response(
|
||||
"https://www.townhall.virginia.gov/L/ViewComments.cfm?GdocForumID=452",
|
||||
PAGE1_HTML,
|
||||
meta={"is_first": True},
|
||||
)
|
||||
results = list(spider.parse_comments(response))
|
||||
items = [r for r in results if isinstance(r, CommentItem)]
|
||||
assert len(items) == 2
|
||||
|
||||
|
||||
def test_page1_yields_forum_item():
|
||||
spider = make_spider()
|
||||
response = fake_response(
|
||||
"https://www.townhall.virginia.gov/L/ViewComments.cfm?GdocForumID=452",
|
||||
PAGE1_HTML,
|
||||
meta={"is_first": True},
|
||||
)
|
||||
results = list(spider.parse_comments(response))
|
||||
forum_items = [r for r in results if isinstance(r, ForumItem)]
|
||||
assert len(forum_items) == 1
|
||||
fi = forum_items[0]
|
||||
assert "Transgender Students" in fi["reg_title"]
|
||||
assert "House Bill 145" in fi["reg_desc"]
|
||||
assert fi["forum_id"] == "452"
|
||||
|
||||
|
||||
def test_comment_fields_parsed_correctly():
|
||||
spider = make_spider()
|
||||
response = fake_response(
|
||||
"https://www.townhall.virginia.gov/L/ViewComments.cfm?GdocForumID=452",
|
||||
PAGE1_HTML,
|
||||
meta={"is_first": True},
|
||||
)
|
||||
items = [r for r in spider.parse_comments(response) if isinstance(r, CommentItem)]
|
||||
item = items[0]
|
||||
assert item["comment_id"] == "101"
|
||||
assert item["author"] == "Alice Example"
|
||||
assert item["title"] == "I strongly support this"
|
||||
assert "great policy" in item["text"]
|
||||
assert "All schools" in item["text"] # multi-paragraph joined
|
||||
assert "reg_title" not in item
|
||||
assert "reg_desc" not in item
|
||||
|
||||
|
||||
def test_subsequent_page_yields_comments():
|
||||
spider = make_spider()
|
||||
response = fake_response(
|
||||
"https://www.townhall.virginia.gov/L/ViewComments.cfm?GdocForumID=452",
|
||||
PAGE2_HTML,
|
||||
)
|
||||
items = [r for r in spider.parse_comments(response) if isinstance(r, CommentItem)]
|
||||
assert len(items) == 1
|
||||
assert items[0]["author"] == "Carol T"
|
||||
|
||||
|
||||
def test_last_page_detection():
|
||||
spider = make_spider()
|
||||
response = fake_response(
|
||||
"https://www.townhall.virginia.gov/L/ViewComments.cfm?GdocForumID=452",
|
||||
PAGE1_HTML,
|
||||
meta={"is_first": True},
|
||||
)
|
||||
assert spider._last_page(response) == 3
|
||||
|
||||
|
||||
def test_date_parsed_to_iso():
|
||||
assert _parse_date("1/4/21 9:15 am") == "2021-01-04T09:15:00"
|
||||
assert _parse_date("1/5/21 10:00 am") == "2021-01-05T10:00:00"
|
||||
assert _parse_date("unparseable") == "unparseable"
|
||||
|
||||
|
||||
SPAN_WRAPPED_HTML = """
|
||||
<html><body>
|
||||
<strong>Guidance Document Change:</strong> Some regulation was developed.
|
||||
|
||||
<form name="page" id="page" action="ViewComments.cfm?GdocForumID=452" method="post">
|
||||
<input name="vPage" value="1"><input name="vPerPage" value="500">
|
||||
<a href="javascript:document.page.vpage.value=1;document.page.submit();">1</a>
|
||||
<input type="submit" name="sub1" value="go">
|
||||
</form>
|
||||
|
||||
<div id="cbox301" class="Cbox">
|
||||
<div style="float: right; text-align: right;">
|
||||
<div style="background-color: white; border: 1px solid #cccccc; padding: 4px">2/1/21 8:00 am</div>
|
||||
</div>
|
||||
<div>
|
||||
<strong>Commenter:</strong>
|
||||
Dan Span
|
||||
<br><br>
|
||||
<strong>Opposed</strong>
|
||||
</div>
|
||||
<div style="clear: right"> </div>
|
||||
<div class="divComment">
|
||||
<!DOCTYPE html><html><head></head><body>
|
||||
<p style="margin: 0in;"><span style="font-size: 10.5pt;">Text inside a span element.</span></p>
|
||||
</body></html>
|
||||
</div>
|
||||
</div>
|
||||
</body></html>
|
||||
"""
|
||||
|
||||
|
||||
def test_span_wrapped_text_is_extracted():
|
||||
spider = make_spider()
|
||||
response = fake_response(
|
||||
"https://www.townhall.virginia.gov/L/ViewComments.cfm?GdocForumID=452",
|
||||
SPAN_WRAPPED_HTML,
|
||||
meta={"is_first": True},
|
||||
)
|
||||
items = [r for r in spider.parse_comments(response) if isinstance(r, CommentItem)]
|
||||
assert len(items) == 1
|
||||
assert "Text inside a span element" in items[0]["text"]
|
||||
250
tests/tokenizer.py
Normal file
250
tests/tokenizer.py
Normal file
@@ -0,0 +1,250 @@
|
||||
"""Unit tests for analysis/tokenizer.py — no real API calls."""
|
||||
|
||||
import io
|
||||
import json
|
||||
import math
|
||||
import sys
|
||||
from pathlib import Path
|
||||
from unittest.mock import patch
|
||||
|
||||
import pytest
|
||||
|
||||
sys.path.insert(0, str(Path(__file__).parent.parent / "analysis"))
|
||||
import tokenizer as tk
|
||||
import openai_batch as ab
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Fixtures
|
||||
|
||||
FORUM_ITEM = {
|
||||
"forum_id": "452",
|
||||
"reg_title": "Model Policies for Transgender Students",
|
||||
"reg_desc": "Guidance developed in response to HB 145.",
|
||||
}
|
||||
|
||||
COMMENT_A = {
|
||||
"forum_id": "452",
|
||||
"comment_id": "100",
|
||||
"author": "Alice",
|
||||
"date": "2021-01-04T09:15:00",
|
||||
"title": "Support",
|
||||
"text": "I support this policy.",
|
||||
}
|
||||
|
||||
COMMENT_B = {
|
||||
"forum_id": "452",
|
||||
"comment_id": "101",
|
||||
"author": "Bob",
|
||||
"date": "2021-01-05T10:00:00",
|
||||
"title": "Oppose",
|
||||
"text": "I oppose this policy.",
|
||||
}
|
||||
|
||||
COMMENTS = [COMMENT_A, COMMENT_B]
|
||||
PROMPT_HASH = "abc1234"
|
||||
INPUT_FILE = "output/f452.jsonl"
|
||||
INPUT_SHA256 = "deadbeef" * 8
|
||||
PROMPT_FILE = "analysis/prompt-1.txt"
|
||||
|
||||
|
||||
def _make_report(total_tokens=10_000):
|
||||
return tk.compute_report(
|
||||
COMMENTS, FORUM_ITEM, PROMPT_HASH, INPUT_FILE, INPUT_SHA256, PROMPT_FILE
|
||||
)
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# compute_report: required top-level keys
|
||||
|
||||
def test_report_has_top_level_keys():
|
||||
report = _make_report()
|
||||
required = {"prompt", "prompt_hash", "input_file", "input_sha256",
|
||||
"total_comments", "input_tokens"}
|
||||
assert required.issubset(set(report.keys()))
|
||||
|
||||
|
||||
def test_report_metadata_values():
|
||||
report = _make_report()
|
||||
assert report["prompt"] == PROMPT_FILE
|
||||
assert report["prompt_hash"] == PROMPT_HASH
|
||||
assert report["input_file"] == INPUT_FILE
|
||||
assert report["input_sha256"] == INPUT_SHA256
|
||||
assert report["total_comments"] == 2
|
||||
|
||||
|
||||
def test_report_input_tokens_positive():
|
||||
report = _make_report()
|
||||
assert isinstance(report["input_tokens"], int)
|
||||
assert report["input_tokens"] > 0
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# compute_report: per-model entries
|
||||
|
||||
def test_report_has_per_model_keys():
|
||||
report = _make_report()
|
||||
for model in ab.MODEL_LIMITS:
|
||||
assert model in report, f"Model {model} missing from report"
|
||||
assert isinstance(report[model], dict)
|
||||
|
||||
|
||||
def test_report_per_model_has_required_fields():
|
||||
report = _make_report()
|
||||
for model in ab.MODEL_LIMITS:
|
||||
m = report[model]
|
||||
assert "jobs" in m
|
||||
assert "cost_$" in m
|
||||
assert "est_queue_days" in m
|
||||
|
||||
|
||||
def test_report_jobs_at_least_one():
|
||||
report = _make_report()
|
||||
for model in ab.MODEL_LIMITS:
|
||||
assert report[model]["jobs"] >= 1
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# compute_report: calculation accuracy
|
||||
|
||||
def test_cost_calculation():
|
||||
"""cost_$ = total_tokens / 1M * pricing_rate"""
|
||||
report = _make_report()
|
||||
total = report["input_tokens"]
|
||||
for model in ab.MODEL_LIMITS:
|
||||
expected_cost = round(total / 1_000_000 * tk.MODEL_PRICING.get(model, 0.0), 4)
|
||||
assert report[model]["cost_$"] == pytest.approx(expected_cost, abs=1e-6)
|
||||
|
||||
|
||||
def test_est_queue_days_calculation():
|
||||
"""est_queue_days = total_tokens / tpd (rounded to 2 decimal places)"""
|
||||
report = _make_report()
|
||||
total = report["input_tokens"]
|
||||
for model, tpd in ab.MODEL_LIMITS.items():
|
||||
expected = round(total / tpd, 2)
|
||||
assert report[model]["est_queue_days"] == pytest.approx(expected, abs=1e-4)
|
||||
|
||||
|
||||
def test_jobs_ceiling_division():
|
||||
"""jobs = ceil(total_tokens / (tpd * _LIMIT_BUFFER))"""
|
||||
report = _make_report()
|
||||
total = report["input_tokens"]
|
||||
for model, tpd in ab.MODEL_LIMITS.items():
|
||||
effective = int(tpd * ab._LIMIT_BUFFER)
|
||||
expected = math.ceil(total / effective)
|
||||
assert report[model]["jobs"] == expected
|
||||
|
||||
|
||||
def test_more_comments_increases_tokens():
|
||||
"""More comments → more input_tokens."""
|
||||
few = tk.compute_report([COMMENT_A], FORUM_ITEM, PROMPT_HASH, INPUT_FILE, INPUT_SHA256, PROMPT_FILE)
|
||||
many = tk.compute_report(COMMENTS, FORUM_ITEM, PROMPT_HASH, INPUT_FILE, INPUT_SHA256, PROMPT_FILE)
|
||||
assert many["input_tokens"] > few["input_tokens"]
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# MODEL_PRICING coverage
|
||||
|
||||
def test_model_pricing_has_required_models():
|
||||
for model in ("gpt-4o", "gpt-4o-mini", "gpt-5.4", "gpt-5.4-mini", "gpt-o4-mini"):
|
||||
assert model in tk.MODEL_PRICING, f"{model} missing from MODEL_PRICING"
|
||||
|
||||
|
||||
def test_model_pricing_values_positive():
|
||||
for model, price in tk.MODEL_PRICING.items():
|
||||
assert price > 0, f"{model} has non-positive price"
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# print_table: runs without error, produces output
|
||||
|
||||
def test_print_table_runs():
|
||||
report = _make_report()
|
||||
buf = io.StringIO()
|
||||
with patch("sys.stdout", buf):
|
||||
tk.print_table(report)
|
||||
output = buf.getvalue()
|
||||
assert "gpt-4o" in output
|
||||
assert "gpt-4o-mini" in output
|
||||
|
||||
|
||||
def test_print_table_shows_all_models():
|
||||
report = _make_report()
|
||||
buf = io.StringIO()
|
||||
with patch("sys.stdout", buf):
|
||||
tk.print_table(report)
|
||||
output = buf.getvalue()
|
||||
for model in ab.MODEL_LIMITS:
|
||||
assert model in output, f"{model} not shown in print_table output"
|
||||
|
||||
|
||||
def test_print_table_highlights_recommended():
|
||||
"""When a single-job cheapest model exists, table marks it as recommended."""
|
||||
report = _make_report()
|
||||
buf = io.StringIO()
|
||||
with patch("sys.stdout", buf):
|
||||
tk.print_table(report)
|
||||
output = buf.getvalue()
|
||||
assert "recommended" in output
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# report.json round-trip (write → read)
|
||||
|
||||
def test_report_json_roundtrip(tmp_path):
|
||||
report = _make_report()
|
||||
out = tmp_path / "report.json"
|
||||
out.write_text(json.dumps(report, indent=2, ensure_ascii=False), encoding="utf-8")
|
||||
loaded = json.loads(out.read_text(encoding="utf-8"))
|
||||
assert loaded["total_comments"] == report["total_comments"]
|
||||
assert loaded["input_tokens"] == report["input_tokens"]
|
||||
assert loaded["gpt-4o-mini"]["jobs"] == report["gpt-4o-mini"]["jobs"]
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# count_input_tokens
|
||||
|
||||
def _make_job_input(tmp_path, comments, forum=None) -> Path:
|
||||
"""Write a batch request JSONL in the same format as job1-input.jsonl."""
|
||||
p = tmp_path / "job1-input.jsonl"
|
||||
with open(p, "w", encoding="utf-8") as f:
|
||||
for c in comments:
|
||||
f.write(json.dumps(ab.build_batch_request_line(c, forum, "gpt-4o-mini")) + "\n")
|
||||
return p
|
||||
|
||||
|
||||
def test_count_input_tokens_matches_estimate(tmp_path):
|
||||
"""count_input_tokens on a freshly written job file equals the sum estimate_tokens produces."""
|
||||
p = _make_job_input(tmp_path, COMMENTS, FORUM_ITEM)
|
||||
result = tk.count_input_tokens(p, "gpt-4o-mini")
|
||||
expected = sum(
|
||||
ab.estimate_tokens(ab.build_messages(c, FORUM_ITEM)[0], "gpt-4o-mini")
|
||||
for c in COMMENTS
|
||||
)
|
||||
assert result["total_tokens"] == expected
|
||||
assert result["total_requests"] == len(COMMENTS)
|
||||
|
||||
|
||||
def test_count_input_tokens_fields(tmp_path):
|
||||
p = _make_job_input(tmp_path, COMMENTS, FORUM_ITEM)
|
||||
result = tk.count_input_tokens(p)
|
||||
assert set(result.keys()) == {"total_tokens", "total_requests", "min", "max", "mean"}
|
||||
assert result["min"] <= result["mean"] <= result["max"]
|
||||
assert result["min"] > 0
|
||||
|
||||
|
||||
def test_count_input_tokens_empty_file(tmp_path):
|
||||
p = tmp_path / "empty.jsonl"
|
||||
p.write_text("", encoding="utf-8")
|
||||
result = tk.count_input_tokens(p)
|
||||
assert result["total_tokens"] == 0
|
||||
assert result["total_requests"] == 0
|
||||
|
||||
|
||||
def test_count_input_tokens_includes_system_prompt(tmp_path):
|
||||
"""Token count must be higher than user-message-only text length / 3 (prompt adds tokens)."""
|
||||
p = _make_job_input(tmp_path, [COMMENT_A], FORUM_ITEM)
|
||||
result = tk.count_input_tokens(p)
|
||||
user_chars = len(COMMENT_A.get("text", ""))
|
||||
# system prompt alone is hundreds of tokens; total must exceed naive user-text estimate
|
||||
assert result["total_tokens"] > user_chars // 3
|
||||
217
tests/validate-sentiment.py
Normal file
217
tests/validate-sentiment.py
Normal file
@@ -0,0 +1,217 @@
|
||||
"""Unit tests for analysis/validate.py — no file I/O beyond tmp_path."""
|
||||
|
||||
import json
|
||||
import sys
|
||||
from pathlib import Path
|
||||
|
||||
import pytest
|
||||
|
||||
sys.path.insert(0, str(Path(__file__).parent.parent / "analysis"))
|
||||
|
||||
try:
|
||||
import pandas as pd
|
||||
except ImportError:
|
||||
pytest.skip("pandas not installed", allow_module_level=True)
|
||||
|
||||
import validate as vl
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Fixtures
|
||||
|
||||
|
||||
def _write_jsonl(path: Path, rows: list[dict]) -> None:
|
||||
with open(path, "w", encoding="utf-8") as f:
|
||||
for row in rows:
|
||||
f.write(json.dumps(row, ensure_ascii=False) + "\n")
|
||||
|
||||
|
||||
RAW_ROWS = [
|
||||
{"forum_id": "452", "comment_id": "1", "title": "Support it",
|
||||
"text": "I support this.", "date": "2021-01-04T09:00:00", "author": "Alice"},
|
||||
{"forum_id": "452", "comment_id": "2", "title": "Oppose it",
|
||||
"text": "I oppose this.", "date": "2021-01-05T10:00:00", "author": "Bob"},
|
||||
{"forum_id": "452", "comment_id": "3", "title": "Neutral",
|
||||
"text": "No opinion.", "date": "2021-01-06T11:00:00", "author": "Carol"},
|
||||
]
|
||||
|
||||
ANALYSIS_ROWS = [
|
||||
{"run_id": "r1", "forum_id": "452", "comment_id": "1", "input_title": "Support it",
|
||||
"analyzed_at": "2026-05-06T12:00:00+00:00", "model": "gpt-5.4-mini",
|
||||
"prompt_version": "abc1234", "stance": "support", "stance_confidence": 0.95,
|
||||
"stance_rationale": "Commenter says 'I support'.", "tone": "positive",
|
||||
"tags": ["student safety"], "truncated": False, "error": None},
|
||||
{"run_id": "r1", "forum_id": "452", "comment_id": "2", "input_title": "Oppose it",
|
||||
"analyzed_at": "2026-05-06T12:00:00+00:00", "model": "gpt-5.4-mini",
|
||||
"prompt_version": "abc1234", "stance": "oppose", "stance_confidence": 0.90,
|
||||
"stance_rationale": "Commenter says 'I oppose'.", "tone": "negative",
|
||||
"tags": [], "truncated": False, "error": None},
|
||||
]
|
||||
|
||||
FORUM_ROW = {"forum_id": "452", "reg_title": "Policy X", "reg_desc": "Guidance on Y."}
|
||||
|
||||
|
||||
@pytest.fixture()
|
||||
def raw_jsonl(tmp_path) -> Path:
|
||||
p = tmp_path / "f452.jsonl"
|
||||
_write_jsonl(p, [FORUM_ROW] + RAW_ROWS)
|
||||
return p
|
||||
|
||||
|
||||
@pytest.fixture()
|
||||
def jobs_dir(tmp_path) -> Path:
|
||||
d = tmp_path / "jobs" / "f452-1"
|
||||
d.mkdir(parents=True)
|
||||
_write_jsonl(d / "job1-output.jsonl", ANALYSIS_ROWS)
|
||||
return d
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# load_raw
|
||||
|
||||
|
||||
def test_load_raw_returns_only_comments(raw_jsonl):
|
||||
df = vl.load_raw(raw_jsonl)
|
||||
assert len(df) == 3
|
||||
assert set(df.columns) == set(vl.RAW_COLS)
|
||||
|
||||
|
||||
def test_load_raw_correct_columns(raw_jsonl):
|
||||
df = vl.load_raw(raw_jsonl)
|
||||
for col in vl.RAW_COLS:
|
||||
assert col in df.columns
|
||||
|
||||
|
||||
def test_load_raw_skips_forum_item(raw_jsonl):
|
||||
df = vl.load_raw(raw_jsonl)
|
||||
assert "reg_title" not in df.columns
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# load_analysis
|
||||
|
||||
|
||||
def test_load_analysis_skips_raw_files(tmp_path):
|
||||
d = tmp_path / "jobs" / "f452-1"
|
||||
d.mkdir(parents=True)
|
||||
_write_jsonl(d / "job1-output-raw.jsonl", ANALYSIS_ROWS) # should be ignored
|
||||
_write_jsonl(d / "job1-output.jsonl", ANALYSIS_ROWS)
|
||||
df = vl.load_analysis(d)
|
||||
assert len(df) == len(ANALYSIS_ROWS)
|
||||
|
||||
|
||||
def test_load_analysis_concatenates_multiple_files(tmp_path):
|
||||
d = tmp_path / "jobs" / "f452-1"
|
||||
d.mkdir(parents=True)
|
||||
_write_jsonl(d / "job1-output.jsonl", [ANALYSIS_ROWS[0]])
|
||||
_write_jsonl(d / "job2-output.jsonl", [ANALYSIS_ROWS[1]])
|
||||
df = vl.load_analysis(d)
|
||||
assert len(df) == 2
|
||||
|
||||
|
||||
def test_load_analysis_tags_serialized_as_json(jobs_dir):
|
||||
df = vl.load_analysis(jobs_dir)
|
||||
tags_val = df.loc[df["comment_id"] == "1", "tags"].iloc[0]
|
||||
assert isinstance(tags_val, str)
|
||||
assert json.loads(tags_val) == ["student safety"]
|
||||
|
||||
|
||||
def test_load_analysis_empty_tags_serialized(jobs_dir):
|
||||
df = vl.load_analysis(jobs_dir)
|
||||
tags_val = df.loc[df["comment_id"] == "2", "tags"].iloc[0]
|
||||
assert json.loads(tags_val) == []
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# join — by comment_id, not index
|
||||
|
||||
|
||||
def test_join_by_comment_id_not_index(raw_jsonl, jobs_dir):
|
||||
raw = vl.load_raw(raw_jsonl)
|
||||
analysis = vl.load_analysis(jobs_dir)
|
||||
# Shuffle raw order so comment_id ordering differs from index
|
||||
raw = raw.sample(frac=1, random_state=42).reset_index(drop=True)
|
||||
merged = vl.join(raw, analysis)
|
||||
row_1 = merged[merged["comment_id"] == "1"].iloc[0]
|
||||
assert row_1["stance"] == "support"
|
||||
assert row_1["author"] == "Alice"
|
||||
|
||||
|
||||
def test_join_unanalyzed_comment_has_null_stance(raw_jsonl, jobs_dir):
|
||||
"""Comment 3 is in raw but not in analysis — stance should be NaN."""
|
||||
raw = vl.load_raw(raw_jsonl)
|
||||
analysis = vl.load_analysis(jobs_dir)
|
||||
merged = vl.join(raw, analysis)
|
||||
row_3 = merged[merged["comment_id"] == "3"].iloc[0]
|
||||
assert pd.isna(row_3["stance"])
|
||||
|
||||
|
||||
def test_join_preserves_all_raw_comments(raw_jsonl, jobs_dir):
|
||||
raw = vl.load_raw(raw_jsonl)
|
||||
analysis = vl.load_analysis(jobs_dir)
|
||||
merged = vl.join(raw, analysis)
|
||||
assert len(merged) == len(raw)
|
||||
|
||||
|
||||
def test_join_output_columns_in_order(raw_jsonl, jobs_dir):
|
||||
raw = vl.load_raw(raw_jsonl)
|
||||
analysis = vl.load_analysis(jobs_dir)
|
||||
merged = vl.join(raw, analysis)
|
||||
assert list(merged.columns) == vl.OUTPUT_COLS
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Duplicate comment_id handling
|
||||
|
||||
|
||||
def test_duplicate_raw_id_flagged(raw_jsonl, jobs_dir):
|
||||
raw = vl.load_raw(raw_jsonl)
|
||||
# Manually duplicate a row
|
||||
raw = pd.concat([raw, raw.iloc[[0]]], ignore_index=True)
|
||||
analysis = vl.load_analysis(jobs_dir)
|
||||
merged = vl.join(raw, analysis)
|
||||
# join still produces a row for each raw row (left join)
|
||||
assert len(merged) == len(raw)
|
||||
assert raw["comment_id"].duplicated().sum() == 1
|
||||
|
||||
|
||||
def test_duplicate_analysis_id_produces_extra_rows(raw_jsonl, tmp_path):
|
||||
"""Two analysis records for the same comment_id create two joined rows."""
|
||||
d = tmp_path / "jobs" / "f452-dup"
|
||||
d.mkdir(parents=True)
|
||||
dup_rows = [ANALYSIS_ROWS[0], {**ANALYSIS_ROWS[0], "stance": "oppose"}]
|
||||
_write_jsonl(d / "job1-output.jsonl", dup_rows)
|
||||
raw = vl.load_raw(raw_jsonl)
|
||||
analysis = vl.load_analysis(d)
|
||||
merged = vl.join(raw, analysis)
|
||||
assert len(merged[merged["comment_id"] == "1"]) == 2
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Validation counts (smoke test — just confirm it runs without error)
|
||||
|
||||
|
||||
def test_print_validation_runs(raw_jsonl, jobs_dir, capsys):
|
||||
raw = vl.load_raw(raw_jsonl)
|
||||
analysis = vl.load_analysis(jobs_dir)
|
||||
merged = vl.join(raw, analysis)
|
||||
vl.print_validation(raw, analysis, merged)
|
||||
out = capsys.readouterr().out
|
||||
assert "Raw comments" in out
|
||||
assert "Stance counts" in out
|
||||
assert "Tone counts" in out
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# CSV output
|
||||
|
||||
|
||||
def test_csv_written_to_jobs_dir(raw_jsonl, jobs_dir, tmp_path):
|
||||
raw = vl.load_raw(raw_jsonl)
|
||||
analysis = vl.load_analysis(jobs_dir)
|
||||
merged = vl.join(raw, analysis)
|
||||
out_path = jobs_dir / "review.csv"
|
||||
merged.to_csv(out_path, index=False, encoding="utf-8-sig")
|
||||
assert out_path.exists()
|
||||
loaded = pd.read_csv(out_path, encoding="utf-8-sig")
|
||||
assert list(loaded.columns) == vl.OUTPUT_COLS
|
||||
assert len(loaded) == len(raw)
|
||||
3888
viz/chart_tests/confidence_by_stance.html
Normal file
3888
viz/chart_tests/confidence_by_stance.html
Normal file
File diff suppressed because one or more lines are too long
3888
viz/chart_tests/cumulative_stance_area.html
Normal file
3888
viz/chart_tests/cumulative_stance_area.html
Normal file
File diff suppressed because one or more lines are too long
3888
viz/chart_tests/cumulative_stance_share.html
Normal file
3888
viz/chart_tests/cumulative_stance_share.html
Normal file
File diff suppressed because one or more lines are too long
3888
viz/chart_tests/stance_diverging_bar.html
Normal file
3888
viz/chart_tests/stance_diverging_bar.html
Normal file
File diff suppressed because one or more lines are too long
3888
viz/chart_tests/stance_over_time.html
Normal file
3888
viz/chart_tests/stance_over_time.html
Normal file
File diff suppressed because one or more lines are too long
3888
viz/chart_tests/stance_share.html
Normal file
3888
viz/chart_tests/stance_share.html
Normal file
File diff suppressed because one or more lines are too long
3888
viz/chart_tests/stance_tone_counts.html
Normal file
3888
viz/chart_tests/stance_tone_counts.html
Normal file
File diff suppressed because one or more lines are too long
3888
viz/chart_tests/stance_tone_heatmap.html
Normal file
3888
viz/chart_tests/stance_tone_heatmap.html
Normal file
File diff suppressed because one or more lines are too long
3888
viz/chart_tests/stance_tone_rowpct.html
Normal file
3888
viz/chart_tests/stance_tone_rowpct.html
Normal file
File diff suppressed because one or more lines are too long
3888
viz/proto/confidence_by_stance.html
Normal file
3888
viz/proto/confidence_by_stance.html
Normal file
File diff suppressed because one or more lines are too long
3888
viz/proto/stance_over_time.html
Normal file
3888
viz/proto/stance_over_time.html
Normal file
File diff suppressed because one or more lines are too long
3888
viz/proto/stance_share.html
Normal file
3888
viz/proto/stance_share.html
Normal file
File diff suppressed because one or more lines are too long
3888
viz/proto/stance_tone_heatmap.html
Normal file
3888
viz/proto/stance_tone_heatmap.html
Normal file
File diff suppressed because one or more lines are too long
134
viz/prototype_charts.py
Normal file
134
viz/prototype_charts.py
Normal file
@@ -0,0 +1,134 @@
|
||||
'''
|
||||
prototype_charts.py
|
||||
generate test charts for later addition to streamlit
|
||||
'''
|
||||
|
||||
|
||||
from pathlib import Path
|
||||
import pandas as pd
|
||||
import plotly.express as px
|
||||
import numpy as np
|
||||
|
||||
inp = Path(r"c:/users/moses/projects/vath/analysis/jobs/f452-1/review.csv")
|
||||
out = Path("viz/")
|
||||
out.mkdir(parents=True, exist_ok=True)
|
||||
|
||||
stance_order = ["support", "oppose", "neutral", "unknown"]
|
||||
|
||||
# tone_order = ["positive", "negative", "neutral", "mixed", "unknown", "unclear"]
|
||||
# default order was actually better - unclear/negative/neutral/mixed/positive vs unknown/oppose/neutral/support
|
||||
# same for pct w/in stance
|
||||
df = pd.read_csv(inp)
|
||||
df["date"] = pd.to_datetime(df["date"], errors="coerce")
|
||||
df["date_day"] = df["date"].dt.date
|
||||
df["stance"] = df["stance"].fillna("unknown")
|
||||
df["tone"] = df["tone"].fillna("unknown")
|
||||
|
||||
# 1. stance share
|
||||
counts = df["stance"].value_counts().reindex(stance_order, fill_value=0).reset_index()
|
||||
counts.columns = ["stance", "count"]
|
||||
fig = px.bar(counts, x="count", y="stance", orientation="h", text="count")
|
||||
fig.write_html(out / "stance_share.html")
|
||||
|
||||
# 2. stance over time
|
||||
daily = df.groupby(["date_day", "stance"]).size().reset_index(name="count")
|
||||
fig = px.bar(daily, x="date_day", y="count", color="stance", category_orders={"stance": stance_order})
|
||||
fig.write_html(out / "stance_over_time.html")
|
||||
|
||||
# 3. stance x tone
|
||||
heat = df.groupby(["stance", "tone"]).size().reset_index(name="count")
|
||||
fig = px.density_heatmap(heat, x="tone", y="stance", z="count", category_orders={"stance": stance_order})
|
||||
fig.write_html(out / "stance_tone_heatmap.html")
|
||||
|
||||
# 4. confidence by stance
|
||||
fig = px.box(df, x="stance", y="stance_confidence", category_orders={"stance": stance_order}, points="outliers")
|
||||
fig.write_html(out / "confidence_by_stance.html")
|
||||
|
||||
# 5. cumulative stance and share over time
|
||||
daily = (
|
||||
df.groupby(["date_day", "stance"])
|
||||
.size()
|
||||
.unstack(fill_value=0)
|
||||
.reindex(columns=stance_order, fill_value=0)
|
||||
.sort_index()
|
||||
)
|
||||
|
||||
cum = daily.cumsum()
|
||||
cum_long = cum.reset_index().melt(id_vars="date_day", var_name="stance", value_name="cumulative_count")
|
||||
|
||||
fig = px.area(
|
||||
cum_long,
|
||||
x="date_day",
|
||||
y="cumulative_count",
|
||||
color="stance",
|
||||
category_orders={"stance": stance_order},
|
||||
title="cumulative comments by stance over time",
|
||||
)
|
||||
fig.write_html(out / "cumulative_stance_area.html")
|
||||
|
||||
cum_pct = cum.div(cum.sum(axis=1), axis=0).reset_index().melt(
|
||||
id_vars="date_day", var_name="stance", value_name="cumulative_share"
|
||||
)
|
||||
|
||||
fig = px.line(
|
||||
cum_pct,
|
||||
x="date_day",
|
||||
y="cumulative_share",
|
||||
color="stance",
|
||||
category_orders={"stance": stance_order},
|
||||
title="cumulative stance share over time",
|
||||
)
|
||||
fig.update_yaxes(tickformat=".0%")
|
||||
fig.write_html(out / "cumulative_stance_share.html")
|
||||
|
||||
# 7. diverging h-bar
|
||||
stance_counts = df["stance"].value_counts().reindex(stance_order, fill_value=0)
|
||||
|
||||
div = pd.DataFrame({
|
||||
"stance": ["oppose", "support", "neutral", "unknown"],
|
||||
"count": [
|
||||
-stance_counts.get("oppose", 0),
|
||||
stance_counts.get("support", 0),
|
||||
stance_counts.get("neutral", 0),
|
||||
stance_counts.get("unknown", 0),
|
||||
],
|
||||
})
|
||||
|
||||
fig = px.bar(
|
||||
div,
|
||||
x="count",
|
||||
y="stance",
|
||||
orientation="h",
|
||||
text=div["count"].abs(),
|
||||
title="support vs oppose",
|
||||
)
|
||||
fig.update_xaxes(title="comments", zeroline=True)
|
||||
fig.update_traces(textposition="outside")
|
||||
fig.write_html(out / "stance_diverging_bar.html")
|
||||
|
||||
# 8. Stance x Tone labels
|
||||
heat = pd.crosstab(df["stance"], df["tone"]).reindex(
|
||||
index=stance_order,
|
||||
columns=[c for c in tone_order if c in df["tone"].unique()],
|
||||
fill_value=0,
|
||||
)
|
||||
|
||||
fig = px.imshow(
|
||||
heat,
|
||||
text_auto=True,
|
||||
aspect="auto",
|
||||
title="stance x tone, count",
|
||||
)
|
||||
fig.write_html(out / "stance_tone_counts.html")
|
||||
|
||||
rowpct = heat.div(heat.sum(axis=1).replace(0, np.nan), axis=0)
|
||||
|
||||
fig = px.imshow(
|
||||
rowpct,
|
||||
text_auto=".0%",
|
||||
aspect="auto",
|
||||
title="stance x tone, percent within stance",
|
||||
)
|
||||
fig.write_html(out / "stance_tone_rowpct.html")
|
||||
|
||||
|
||||
28
viz/prototype_streamlit.py
Normal file
28
viz/prototype_streamlit.py
Normal file
@@ -0,0 +1,28 @@
|
||||
# streamlit run analysis/viz/prototype_streamlit.py
|
||||
from datetime import datetime
|
||||
import pandas as pd
|
||||
import plotly.graph_objects as go
|
||||
import plotly.express as px
|
||||
import streamlit as st
|
||||
|
||||
df = pd.read_csv(r"analysis/jobs/f452-1/review.csv")
|
||||
st.set_page_config(layout="wide")
|
||||
|
||||
stance = st.multiselect("Filter stance", sorted(df["stance"].dropna().unique()), default=sorted(df["stance"].dropna().unique()))
|
||||
q = st.text_input("Search comment text")
|
||||
dff = df[df["stance"].isin(stance)]
|
||||
if q:
|
||||
dff = dff[dff["text"].fillna("").str.contains(q, case=False, regex=False)]
|
||||
|
||||
st.dataframe(dff[["comment_id", "title", "stance", "stance_confidence", "tone"]], width="stretch")
|
||||
st.write("Showing " + str(len(dff))+ " comments")
|
||||
|
||||
cid = st.selectbox("comment", dff["comment_id"].astype(str))
|
||||
row = dff[dff["comment_id"].astype(str) == cid].iloc[0]
|
||||
|
||||
st.subheader(row["title"])
|
||||
st.write(row["text"])
|
||||
st.write(row["author"] + ", " + row["date"][:10])
|
||||
st.write("**model:** " + str(row["model"]))
|
||||
st.markdown("**stance:** " + str(row["stance"]) + " \n**confidence:** " + str(row["stance_confidence"]) + " \n**tone:** " + str(row["tone"]))
|
||||
st.write("**analysis:** "+ row["stance_rationale"])
|
||||
189
viz/streamlit.py
Normal file
189
viz/streamlit.py
Normal file
@@ -0,0 +1,189 @@
|
||||
# streamlit run viz/streamlit.py -- --jobs-dir analysis/jobs/f452-1
|
||||
import argparse
|
||||
from pathlib import Path
|
||||
from datetime import datetime as dt
|
||||
import pandas as pd
|
||||
import plotly.graph_objects as go
|
||||
import plotly.express as px
|
||||
import streamlit as st
|
||||
|
||||
parser = argparse.ArgumentParser()
|
||||
parser.add_argument("--jobs-dir", default="analysis/jobs/f452-1", type=Path,
|
||||
help="Job directory containing review.csv, forum.jsonl, and prompt.txt")
|
||||
args, _ = parser.parse_known_args() # parse_known_args: ignore Streamlit's own argv entries
|
||||
workdir = args.jobs_dir
|
||||
df = pd.read_csv(workdir/"review.csv")
|
||||
df['date_dt'] = pd.to_datetime(df.date)
|
||||
df["date_day"] = df["date_dt"].dt.date
|
||||
forum = pd.read_json(workdir/"forum.jsonl", lines=True).iloc[0].to_dict()
|
||||
prompt = (workdir/"prompt.txt").read_text(encoding="utf-8")
|
||||
|
||||
stance_colors = {'oppose':'#ffa15a', 'neutral':'#e377c2','support':'#19d3f3','unknown':'#000000'}
|
||||
stance_order = ["oppose", "mixed", "unknown", "neutral", "support"]
|
||||
|
||||
st.set_page_config(layout="wide")
|
||||
st.title("Virginia Townhall Explorer",anchor=None)
|
||||
st.caption("Explore data collected from Virginia's public comment system. Source code at https://github.com/eulaly/vath")
|
||||
|
||||
st.subheader("Proposal",anchor=None,divider="gray")
|
||||
st.markdown(f"**{forum.get('reg_title')}**")
|
||||
st.text(forum.get('reg_desc'))
|
||||
st.caption(f'Comments posted from {dt.strftime(min(df.date_dt),"%D")}—{dt.strftime(max(df.date_dt),"%D")} at https://www.townhall.virginia.gov/L/Comments.cfm?GDocForumID={forum.get("forum_id")}')
|
||||
|
||||
st.subheader("Comment Summary",anchor=False,divider="gray")
|
||||
summary_left, summary_right = st.columns([1,2])
|
||||
with summary_left:
|
||||
# Summary Table
|
||||
summary_stats = (
|
||||
df.groupby("stance").size()
|
||||
.reindex(stance_order, fill_value=0)
|
||||
.reset_index(name="count")
|
||||
.assign(percent=lambda d: (d["count"] / d["count"].sum()).map("{:.1%}".format))
|
||||
)
|
||||
|
||||
st.dataframe(summary_stats, hide_index=True, width="stretch")
|
||||
with summary_right:
|
||||
# Stance div-h
|
||||
counts = df["stance"].value_counts()
|
||||
stance_divh = go.Figure()
|
||||
stance_divh.add_bar(y=["stance"], x=[-counts.get("oppose",0)], name="oppose", orientation="h", marker_color=stance_colors.get('oppose'), text=[counts.get("oppose",0)], textposition="inside")
|
||||
stance_divh.add_bar(y=["stance"], x=[counts.get("neutral",0)], name="neutral", orientation="h", marker_color=stance_colors.get('neutral'), text=[counts.get("neutral",0)], textposition="inside")
|
||||
stance_divh.add_bar(y=["stance"], x=[counts.get("unknown",0)], name="unknown", orientation="h", marker_color=stance_colors.get('unknown'), text=[counts.get("unknown",0)], textposition="inside")
|
||||
stance_divh.add_bar(y=["stance"], x=[counts.get("support",0)], name="support", orientation="h", marker_color=stance_colors.get('support'), text=[counts.get("support",0)], textposition="inside")
|
||||
stance_divh.update_yaxes(title_text="",showticklabels=False)
|
||||
stance_divh.update_layout(barmode="relative", title="", height=180, margin=dict(l=0,r=0,t=0,b=0),xaxis_title="", yaxis_title="",legend=dict(orientation="v",y=0.12))
|
||||
st.plotly_chart(stance_divh,width='stretch')
|
||||
|
||||
# Daily Comments Breakdown, 3 Tabs
|
||||
daily_wide = (
|
||||
df.groupby(["date_day", "stance"])
|
||||
.size()
|
||||
.unstack(fill_value=0)
|
||||
.reindex(columns=stance_order, fill_value=0)
|
||||
.sort_index()
|
||||
)
|
||||
|
||||
daily_long = (
|
||||
daily_wide.reset_index()
|
||||
.melt(id_vars="date_day", var_name="stance", value_name="count")
|
||||
)
|
||||
|
||||
cum_wide = daily_wide.cumsum()
|
||||
|
||||
cum_long = (
|
||||
cum_wide.reset_index()
|
||||
.melt(id_vars="date_day", var_name="stance", value_name="cumulative_count")
|
||||
)
|
||||
|
||||
cum_total = cum_wide.sum(axis=1)
|
||||
cum_share = cum_wide.div(cum_total.where(cum_total > 0), axis=0)
|
||||
|
||||
cum_share_long = (
|
||||
cum_share.reset_index()
|
||||
.melt(id_vars="date_day", var_name="stance", value_name="cumulative_share")
|
||||
)
|
||||
|
||||
|
||||
tab_daily, tab_area, tab_share = st.tabs([
|
||||
"Daily",
|
||||
"Cumulative",
|
||||
"Cumulative Share",
|
||||
])
|
||||
|
||||
with tab_daily:
|
||||
fig = px.bar(
|
||||
daily_long,
|
||||
x="date_day",
|
||||
y="count",
|
||||
color="stance",
|
||||
category_orders={"stance": stance_order},
|
||||
color_discrete_map=stance_colors,
|
||||
)
|
||||
fig.update_layout(barmode="stack", height=420, legend_orientation="v")
|
||||
st.plotly_chart(fig, width="stretch")
|
||||
|
||||
with tab_area:
|
||||
fig = px.area(
|
||||
cum_long,
|
||||
x="date_day",
|
||||
y="cumulative_count",
|
||||
color="stance",
|
||||
category_orders={"stance": stance_order},
|
||||
color_discrete_map=stance_colors,
|
||||
)
|
||||
fig.update_layout(height=420, legend_orientation="v")
|
||||
st.plotly_chart(fig, width="stretch")
|
||||
|
||||
with tab_share:
|
||||
fig = px.line(
|
||||
cum_share_long,
|
||||
x="date_day",
|
||||
y="cumulative_share",
|
||||
color="stance",
|
||||
category_orders={"stance": stance_order},
|
||||
color_discrete_map=stance_colors,
|
||||
)
|
||||
fig.update_yaxes(tickformat=".0%", range=[0, 1])
|
||||
fig.update_layout(height=420, legend_orientation="v")
|
||||
st.plotly_chart(fig, width="stretch")
|
||||
|
||||
st.subheader("Comment Explorer",anchor=False,divider="gray")
|
||||
# comment explorer
|
||||
cex_left, cex_right = st.columns([1,1])
|
||||
with cex_left:
|
||||
filter_stance = st.multiselect("Filter stance", sorted(df["stance"].dropna().unique()), default=sorted(df["stance"].dropna().unique()))
|
||||
filter_tone = st.multiselect("Filter tone", sorted(df["tone"].dropna().unique()), default=sorted(df["tone"].dropna().unique()))
|
||||
dff = df[df["stance"].isin(filter_stance) & df["tone"].isin(filter_tone)]
|
||||
|
||||
with cex_right:
|
||||
q = st.text_input("Search comment title and text")
|
||||
if q:
|
||||
dff = dff[dff["text"].fillna("").str.contains(q, case=False, regex=False)]
|
||||
st.text(""); st.text("")
|
||||
st.text("Showing " + str(len(dff))+ " comments",text_alignment="right", width="stretch")
|
||||
|
||||
st.dataframe(dff[["comment_id", "title", "text", "stance", "stance_confidence", "tone"]], width="stretch")
|
||||
|
||||
cid = st.selectbox("Select comment to view:", dff["comment_id"].astype(str))
|
||||
row = dff[dff["comment_id"].astype(str) == cid].iloc[0]
|
||||
|
||||
st.markdown(f'**{row["title"]}**')
|
||||
st.text(row["text"])
|
||||
st.write(row["author"] + ", " + row["date_dt"].strftime("%D"))
|
||||
|
||||
st.divider()
|
||||
|
||||
st.subheader('Analysis')
|
||||
cexs_left, cexs_right = st.columns([1,1])
|
||||
with cexs_left:
|
||||
st.write(f"**stance:** {row['stance']}")
|
||||
st.write(f"**stance_confidence:** {row['stance_confidence']:.2f}")
|
||||
st.write(f"**tone:** {row['tone']}")
|
||||
st.write("**analysis:** "+ row["stance_rationale"])
|
||||
with cexs_right:
|
||||
x_order = ["unknown","oppose","mixed","neutral","support"] # includes mixed even if absent; harmless zero column
|
||||
y_order = ["positive","neutral","mixed","negative","unclear"]
|
||||
tab = pd.crosstab(df["tone"], df["stance"]).reindex(index=y_order, columns=x_order, fill_value=0)
|
||||
pct = tab.div(tab.sum(axis=1).replace(0, pd.NA), axis=0).fillna(0)
|
||||
tone_stance = px.imshow(
|
||||
pct,
|
||||
x=x_order, y=y_order,
|
||||
text_auto=".0%",
|
||||
aspect="auto",
|
||||
color_continuous_scale="Greens",
|
||||
)
|
||||
tone_stance.update_traces(text=tab.astype(str) + " / " + (pct*100).round(0).astype(int).astype(str) + "%")
|
||||
tone_stance.add_scatter(x=[row["stance"]],y=[row["tone"]],mode="markers",marker=dict(size=15,color="yellow",symbol="cross",line=dict(width=1, color="red")),showlegend=False)
|
||||
tone_stance.update_layout(height=420, xaxis_title="stance", yaxis_title="tone")
|
||||
st.plotly_chart(tone_stance, width='stretch')
|
||||
st.caption("Tone by stance, % within tone", text_alignment="right",width="stretch")
|
||||
|
||||
st.divider()
|
||||
st.write("**model:** " + str(row["model"]))
|
||||
with st.expander("Prompt", expanded=False):
|
||||
st.code(prompt, language="text")
|
||||
|
||||
tone_conf = px.box(df,x="stance",y="stance_confidence",color="stance",category_orders={"stance":stance_order},color_discrete_map=stance_colors,points="outliers",title="Comment Stance Classification Confidence")
|
||||
tone_conf.update_yaxes(range=[0,1.02])
|
||||
tone_conf.update_layout(height=430, legend_orientation="v")
|
||||
st.plotly_chart(tone_conf,width="stretch")
|
||||
Reference in New Issue
Block a user