Compare commits
29 Commits
946aeac7c8
...
master
| Author | SHA1 | Date | |
|---|---|---|---|
| 8f1d9e7723 | |||
| 181477bce7 | |||
| 771f11fd3c | |||
| f42183eeda | |||
| 92706bafb5 | |||
| 723b353db8 | |||
| 67cd96a523 | |||
| cc16acbb12 | |||
| afd5b8c60e | |||
| 3fb424da3c | |||
| c3f2911563 | |||
| 05515745fd | |||
| 3d3372bbb3 | |||
| 3a139da440 | |||
| 976db1b0fe | |||
| 7593754866 | |||
| 016882d527 | |||
| 58feb9820d | |||
| 35f30e9514 | |||
| 985760be7c | |||
| 983650a64f | |||
| eaaefb66f2 | |||
| bdab3c5e21 | |||
| b4a9651e11 | |||
| 1ea696d818 | |||
| 28d6d222bd | |||
| 72c2ae0ca0 | |||
| f5d679808e | |||
| 64a7a18721 |
3
.gitignore
vendored
3
.gitignore
vendored
@@ -28,4 +28,5 @@ archive/
|
|||||||
output/
|
output/
|
||||||
|
|
||||||
# --- misc ---
|
# --- misc ---
|
||||||
.DS_Store
|
.DS_Store
|
||||||
|
*~$*
|
||||||
212
README.md
212
README.md
@@ -1,21 +1,5 @@
|
|||||||
|
|
||||||
# Table of Contents
|
## Project Goals
|
||||||
|
|
||||||
1. [Project Goals](#org5acb669)
|
|
||||||
1. [Document and analyze sentiment](#org9291576)
|
|
||||||
2. [Make data available](#org8054421)
|
|
||||||
3. [Generalize](#orgdda4b6f)
|
|
||||||
2. [Architecture](#org1d6bc40)
|
|
||||||
1. [Scraper](#org4298028)
|
|
||||||
2. [Storage](#org1cd413c)
|
|
||||||
3. [Analysis](#orgaea450e)
|
|
||||||
3. [Roadmap](#org6b7660d)
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
<a id="org5acb669"></a>
|
|
||||||
|
|
||||||
# Project Goals
|
|
||||||
|
|
||||||
1. Document and analyze sentiment of public comments on Virginia law, to determine:
|
1. Document and analyze sentiment of public comments on Virginia law, to determine:
|
||||||
1. the utility of this forum as a mechanism for public comment, and
|
1. the utility of this forum as a mechanism for public comment, and
|
||||||
@@ -23,131 +7,127 @@
|
|||||||
2. Make data and insights broadly available.
|
2. Make data and insights broadly available.
|
||||||
3. Generalize to other public comment tools.
|
3. Generalize to other public comment tools.
|
||||||
|
|
||||||
|
Take a look at https://vatownhall.streamlit.app
|
||||||
<a id="org9291576"></a>
|

|
||||||
|
|
||||||
## Document and analyze sentiment
|
|
||||||
|
|
||||||
- Scrape the data, parse, clean, and store. Clearly separate scraper from sentiment analyzer for maximum auditability.
|
|
||||||
- Build tests for identifying abuse, such as spam and account fraud
|
|
||||||
- Identify any patterns connecting measured sentiment against VA decisions
|
|
||||||
|
|
||||||
|
|
||||||
<a id="org8054421"></a>
|
### Research questions
|
||||||
|
|
||||||
## Make data available
|
1. What is the quality of the comments on the forum?
|
||||||
|
1. Are there duplicate entries?
|
||||||
- Pick a good visualization tool
|
2. Are there non-human-generated entries?
|
||||||
|
3. Are there entries intended to abuse the forum or drown out comment?
|
||||||
|
2. How do commenters feel about the proposed change?
|
||||||
|
1. What is the total number and percent supporting vs opposing, and how does this change over time?
|
||||||
|
2. What is the type of support, such as strong/weak, positive/negative?
|
||||||
|
3. What impact do the comments have on the proposed change?
|
||||||
|
(I anticipate this will not be measurable from currently available data)
|
||||||
|
|
||||||
|
|
||||||
<a id="orgdda4b6f"></a>
|
<a id="orgfabfcd9"></a>
|
||||||
|
|
||||||
## Generalize
|
## Architecture
|
||||||
|
|
||||||
- Identify scalable ways to apply this toolset to similar problems
|
1. Scrape/Parse: Scrapy
|
||||||
|
2. Sentiment analysis: gpt-5.4-mini
|
||||||
|
3. Display: streamlit
|
||||||
|
4. Storage: jsonl, csv, parquet
|
||||||
|
|
||||||
|

|
||||||
|
|
||||||
|
|
||||||
<a id="org1d6bc40"></a>
|
<a id="org2c5c7a2"></a>
|
||||||
|
|
||||||
# Architecture
|
### Scraper
|
||||||
|
|
||||||
1. Scrape/Parse: ****Scrapy**** for downloading comments
|
Scrapy provides a simple mechanism for retrieving, parsing, and saving content form the forums.
|
||||||
2. Storage: json
|
|
||||||
3. Sentiment analysis: Claude haiku
|
1. Forums listing page: `Forums.cfm` lists all open forums with agency, reg title, action type, brief description, closing date, comment count
|
||||||
4. Display: TBD
|
2. Comment listing page: `comments.cfm?GDocForumID=X` or `comments.cfm?stageid=X` or `comments.cfm?petitionid=X` lists comments with title, author, date
|
||||||
|
3. Individual comment page: `viewcomments.cfm?commentid=X` shows regulation title + brief description at the top, plus the comment
|
||||||
|
|
||||||
|
|
||||||
<a id="org4298028"></a>
|
<a id="org72990f4"></a>
|
||||||
|
|
||||||
## Scraper
|
### Analysis
|
||||||
|
|
||||||
Scrapy provides a simple mechanism for browsing and
|
Google and Amazon both return generic sentiment (tone of writing: positive/negative), not stance (for/against the regulation): "I strongly believe the government should NOT interfere" is negative tone but "against" the regulation. We add the proposed change as context to the model.
|
||||||
|
|
||||||
1. Forums listing page: \`Forums.cfm\` - lists all open forums with agency, reg title, action type, brief description, closing date, comment count
|
Before sending the comments for sentiment analysis, `tokenizer.py` receives the forum to be processed and prompt as inputs, then generates a `report.json` estimating tokens (tiktoken), cost, and time to run for multiple models.
|
||||||
2. Comment listing page: \`comments.cfm?GDocForumID=X\` or \`comments.cfm?stageid=X\` or \`comments.cfm?petitionid=X\` - lists comments with title, author, date
|
|
||||||
3. Individual comment page: \`viewcomments.cfm?commentid=X\` - shows regulation title + brief description at the top, plus the comment
|
Then, the batch processing scripts uses the `report.json` to create multiple jobs, with subcommands to download and check their status.
|
||||||
|
|
||||||
|
We selected gpt-5.4-mini for a good balance of quality, cost, and time.
|
||||||
|
|
||||||
|
1. Prompt
|
||||||
|
```
|
||||||
|
You are an expert policy analyst classifying public comments submitted to the Virginia Town Hall
|
||||||
|
regulatory comment system. You will be given the text of a proposed regulation and a single
|
||||||
|
public comment. Return ONLY a JSON object — no other text.
|
||||||
|
|
||||||
|
Definitions:
|
||||||
|
- stance: the commenter's position on whether the regulation should be adopted.
|
||||||
|
"support" = wants it approved (as-is or with changes);
|
||||||
|
"oppose" = wants it rejected or substantially weakened;
|
||||||
|
"neutral" = takes no position, asks a question, or provides factual input only;
|
||||||
|
"unknown" = too vague, off-topic, or uninterpretable to classify.
|
||||||
|
- tone: the emotional register of the writing, independent of stance.
|
||||||
|
"positive" = affirming, hopeful, appreciative;
|
||||||
|
"negative" = angry, fearful, alarmed, or contemptuous;
|
||||||
|
"neutral" = matter-of-fact, procedural, or informational;
|
||||||
|
"mixed" = contains both positive and negative emotional content;
|
||||||
|
"unclear" = tone cannot be determined (e.g., a one-word comment).
|
||||||
|
- stance_confidence: float 0.0-1.0, your confidence in the stance label.
|
||||||
|
- stance_rationale: 1-3 sentences explaining the key evidence; quote specific phrases where possible.
|
||||||
|
- tags: up to 5 short topic labels relevant to the comment's specific concerns (e.g.
|
||||||
|
"parental rights", "student safety", "privacy", "religious freedom", "LGBTQ+ inclusion",
|
||||||
|
"bullying prevention", "school sports", "bathroom access"). Empty array if none apply.
|
||||||
|
|
||||||
|
Return exactly these keys: stance, stance_confidence, stance_rationale, tone, tags.
|
||||||
|
```
|
||||||
|
|
||||||
|
|
||||||
<a id="org1cd413c"></a>
|
<a id="org58a5b72"></a>
|
||||||
|
|
||||||
## Storage
|
### Storage
|
||||||
|
|
||||||
One JSONL file per forum/bill.
|
- Each scraped forum is saved to `output/<forum-id>.jsonl`
|
||||||
|
- Each report (forum + prompt) is saves to `reports/<forum-id-N>.json`
|
||||||
|
- Each job is saved to `analysis/jobs/<report-id>`:
|
||||||
|
└─`forum.jsonl` is a copy of the scraped forum for convenience
|
||||||
|
└─`prompt.txt` is a copy of the prompt used
|
||||||
|
└─`report.json` is a copy of the report used
|
||||||
|
└─`status.json` contains metadata about the job
|
||||||
|
For each batch in the job, four files are created:
|
||||||
|
└─`jobN-input.jsonl` contains the exact queries sent to the API, for troubleshooting
|
||||||
|
└─`jobN-output-raw.jsonl` contains the exact response from the API
|
||||||
|
└─`jobN-output.jsonl` contains the exact response from the API
|
||||||
|
└─`jobN-output-errors.jsonl` when errors are returned (this file may not exist)
|
||||||
|
- Once complete, the cleanup script saves `review.csv`, `review.pqt`, and `review.sqlite` in this folder.
|
||||||
|
|
||||||
|
|
||||||
<a id="orgaea450e"></a>
|
<a id="org24fe465"></a>
|
||||||
|
|
||||||
## Analysis
|
## Instructions
|
||||||
|
|
||||||
Google and Amazon both return generic sentiment (tone of writing: positive/negative), not stance (for/against the regulation): "I strongly believe the government should NOT interfere" is negative tone but "against" the regulation. We will run the forum/bill title and cache the entirety of the proposed change, perhaps as a fallback.
|
1. Scrape the forum.
|
||||||
|
`python`
|
||||||
<table border="2" cellspacing="0" cellpadding="6" rules="groups" frame="hsides">
|
2. Run model report.
|
||||||
|
`python analysis/tokenizer.py <input> --prompt <prompt>`
|
||||||
|
3. To run a realtime subset:
|
||||||
|
`python analysis/openai_realtime.py <input> --prompt <prompt> --model <model> --limit <N comments>`
|
||||||
|
`python analysis/openai_realtime.py output/f452.jsonl --prompt prompt-1.txt --model gpt-4o-mini --limit 10`
|
||||||
|
4. To create and run the whole thing in batches, first create the batch jobs from the report:
|
||||||
|
`python analysis/openai_batch.py create <report> --model <model>`
|
||||||
|
`python analysis/openai_batch.py create ./reports/f452-1.json --model gpt-5.4-mini`
|
||||||
|
5. Then, run the jobs sequentially. Don't submit more than one at a time, if the model fills up the batch will fail and resubmission is not implemented.
|
||||||
|
`python analysis/openai<sub>batch.py</sub> submit`
|
||||||
|
`python analysis/openai<sub>batch.py</sub> status`
|
||||||
|
`python analysis/openai<sub>batch.py</sub> download`
|
||||||
|
`python analysis/openai<sub>batch.py</sub> submit`
|
||||||
|
|
||||||
|
|
||||||
<colgroup>
|
<a id="org5739d49"></a>
|
||||||
<col class="org-left" />
|
|
||||||
|
|
||||||
<col class="org-left" />
|
|
||||||
|
|
||||||
<col class="org-left" />
|
|
||||||
|
|
||||||
<col class="org-left" />
|
|
||||||
|
|
||||||
<col class="org-left" />
|
|
||||||
|
|
||||||
<col class="org-left" />
|
|
||||||
</colgroup>
|
|
||||||
<thead>
|
|
||||||
<tr>
|
|
||||||
<th scope="col" class="org-left">Tool</th>
|
|
||||||
<th scope="col" class="org-left">Output</th>
|
|
||||||
<th scope="col" class="org-left">Context</th>
|
|
||||||
<th scope="col" class="org-left">Sarcasm</th>
|
|
||||||
<th scope="col" class="org-left">Context window</th>
|
|
||||||
<th scope="col" class="org-left">Cost/1k comments</th>
|
|
||||||
</tr>
|
|
||||||
</thead>
|
|
||||||
<tbody>
|
|
||||||
<tr>
|
|
||||||
<td class="org-left">Google NL API</td>
|
|
||||||
<td class="org-left">-1→+1, magnitude</td>
|
|
||||||
<td class="org-left">No/generic</td>
|
|
||||||
<td class="org-left">Poorly</td>
|
|
||||||
<td class="org-left">No</td>
|
|
||||||
<td class="org-left">~$1–2</td>
|
|
||||||
</tr>
|
|
||||||
|
|
||||||
<tr>
|
|
||||||
<td class="org-left">Amazon Comprehend</td>
|
|
||||||
<td class="org-left">Pos/Neg/Neutral/Mixed</td>
|
|
||||||
<td class="org-left">No/generic</td>
|
|
||||||
<td class="org-left">Poorly</td>
|
|
||||||
<td class="org-left">No</td>
|
|
||||||
<td class="org-left">~$0.10</td>
|
|
||||||
</tr>
|
|
||||||
|
|
||||||
<tr>
|
|
||||||
<td class="org-left">Claude Haiku</td>
|
|
||||||
<td class="org-left">Prompted → for/against/neutral</td>
|
|
||||||
<td class="org-left">Yes</td>
|
|
||||||
<td class="org-left">Yes, with prompt</td>
|
|
||||||
<td class="org-left">Yes</td>
|
|
||||||
<td class="org-left">~$0.10–0.30</td>
|
|
||||||
</tr>
|
|
||||||
|
|
||||||
<tr>
|
|
||||||
<td class="org-left">GPT-4o-mini</td>
|
|
||||||
<td class="org-left">Prompted → same</td>
|
|
||||||
<td class="org-left">Yes</td>
|
|
||||||
<td class="org-left">Yes</td>
|
|
||||||
<td class="org-left">Yes</td>
|
|
||||||
<td class="org-left">~$0.05–0.15</td>
|
|
||||||
</tr>
|
|
||||||
</tbody>
|
|
||||||
</table>
|
|
||||||
|
|
||||||
|
|
||||||
<a id="org6b7660d"></a>
|
|
||||||
|
|
||||||
# Roadmap
|
# Roadmap
|
||||||
|
|
||||||
|
|||||||
@@ -43,3 +43,4 @@ Description and PM notes
|
|||||||
- project dir: `%userprofile%\projects\vath\`
|
- project dir: `%userprofile%\projects\vath\`
|
||||||
- python venv: `%userprofile%\projects\vath\venv\scripts\activate`
|
- python venv: `%userprofile%\projects\vath\venv\scripts\activate`
|
||||||
- pytest (inside venv): `python -m pytest tests/`
|
- pytest (inside venv): `python -m pytest tests/`
|
||||||
|
- create tests without `test_` prefix, ie: `tests/tokenizer.py` not `tests/test_tokenizer.py`
|
||||||
|
|||||||
76
analysis/create_csv.py
Normal file
76
analysis/create_csv.py
Normal file
@@ -0,0 +1,76 @@
|
|||||||
|
#!/usr/bin/env python3
|
||||||
|
"""analysis/create_csv.py — join raw scrape with analysis output for review."""
|
||||||
|
|
||||||
|
import argparse
|
||||||
|
from pathlib import Path
|
||||||
|
|
||||||
|
import pandas as pd
|
||||||
|
|
||||||
|
RAW_COLS = ["forum_id", "comment_id", "title", "text", "date", "author"]
|
||||||
|
ANALYSIS_COLS = [
|
||||||
|
"stance", "stance_confidence", "stance_rationale", "tone", "tags",
|
||||||
|
"error", "truncated", "analyzed_at", "prompt_version", "model",
|
||||||
|
]
|
||||||
|
OUTPUT_COLS = RAW_COLS + ANALYSIS_COLS
|
||||||
|
|
||||||
|
|
||||||
|
def load_raw(path: Path) -> pd.DataFrame:
|
||||||
|
df = pd.read_json(path, lines=True)
|
||||||
|
df = df[df["comment_id"].notna()] # rm first item (forum, not comment)
|
||||||
|
for col in RAW_COLS:
|
||||||
|
if col not in df.columns:
|
||||||
|
df[col] = None
|
||||||
|
return df[RAW_COLS].copy()
|
||||||
|
|
||||||
|
|
||||||
|
def load_analysis(jobs_dir: Path) -> pd.DataFrame:
|
||||||
|
files = sorted(p for p in jobs_dir.glob("job*-output.jsonl") if "-raw" not in p.name)
|
||||||
|
df = pd.concat([pd.read_json(p, lines=True) for p in files], ignore_index=True)
|
||||||
|
for col in ANALYSIS_COLS:
|
||||||
|
if col not in df.columns:
|
||||||
|
df[col] = None
|
||||||
|
return df[["comment_id"] + ANALYSIS_COLS].copy()
|
||||||
|
|
||||||
|
|
||||||
|
def join(raw: pd.DataFrame, analysis: pd.DataFrame) -> pd.DataFrame:
|
||||||
|
return raw.merge(analysis, on="comment_id", how="left")[OUTPUT_COLS]
|
||||||
|
|
||||||
|
|
||||||
|
def print_counts(raw: pd.DataFrame, analysis: pd.DataFrame, merged: pd.DataFrame) -> None:
|
||||||
|
print(f"\nRaw comments : {len(raw):,}")
|
||||||
|
print(f"Analyzed : {len(analysis):,}")
|
||||||
|
print(f"Joined : {merged['stance'].notna().sum():,}")
|
||||||
|
print(f"Unanalyzed : {merged['stance'].isna().sum():,}")
|
||||||
|
print(f"Errors : {analysis['error'].notna().sum():,}")
|
||||||
|
print(f"Dup IDs (raw) : {raw['comment_id'].duplicated().sum():,}")
|
||||||
|
print(f"\nStance:\n{analysis['stance'].value_counts(dropna=False).to_string()}")
|
||||||
|
print(f"\nTone:\n{analysis['tone'].value_counts(dropna=False).to_string()}\n")
|
||||||
|
|
||||||
|
|
||||||
|
def main() -> None:
|
||||||
|
p = argparse.ArgumentParser(
|
||||||
|
description="Join raw scrape JSONL with analysis output; write review CSV."
|
||||||
|
)
|
||||||
|
p.add_argument("input", help="Raw scrape JSONL (e.g. output/f452.jsonl)")
|
||||||
|
p.add_argument("jobs_dir", help="Job directory containing job*-output.jsonl files")
|
||||||
|
p.add_argument("--parquet", action="store_true", help="Also write review.parquet")
|
||||||
|
p.add_argument("--out", default=None, help="Output CSV path (default: <jobs_dir>/review.csv)")
|
||||||
|
args = p.parse_args()
|
||||||
|
|
||||||
|
raw = load_raw(Path(args.input))
|
||||||
|
analysis = load_analysis(Path(args.jobs_dir))
|
||||||
|
merged = join(raw, analysis)
|
||||||
|
print_counts(raw, analysis, merged)
|
||||||
|
|
||||||
|
out = Path(args.out) if args.out else Path(args.jobs_dir) / "review.csv"
|
||||||
|
merged.to_csv(out, index=False, encoding="utf-8-sig")
|
||||||
|
print(f"CSV → {out}")
|
||||||
|
|
||||||
|
if args.parquet:
|
||||||
|
pq = out.with_suffix(".parquet")
|
||||||
|
merged.to_parquet(pq, index=False)
|
||||||
|
print(f"Parquet → {pq}")
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
main()
|
||||||
74
analysis/encoding.py
Normal file
74
analysis/encoding.py
Normal file
@@ -0,0 +1,74 @@
|
|||||||
|
"""
|
||||||
|
analysis/encoding.py — text encoding repair for scraped content.
|
||||||
|
|
||||||
|
The townhall.virginia.gov scraper forces UTF-8 decoding, which is correct for the
|
||||||
|
site's current content. This module provides a defensive repair function for cases
|
||||||
|
where a response arrives with Windows-1252/cp1252 bytes embedded in otherwise UTF-8
|
||||||
|
content (common in older CMSes). The raw scrape files are never modified; repair is
|
||||||
|
applied at the analysis and reporting layers only.
|
||||||
|
|
||||||
|
Primary: uses `ftfy` when installed (pip install ftfy).
|
||||||
|
Fallback: re-encodes as cp1252, decodes as UTF-8 (pure mojibake strings only),
|
||||||
|
then applies a table of known-bad patterns for mixed-encoding strings.
|
||||||
|
"""
|
||||||
|
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
# Known patterns: UTF-8 bytes decoded as cp1252, i.e. the 3-char sequences you
|
||||||
|
# see when a server sends e.g. E2 80 99 and it gets decoded as cp1252 chars.
|
||||||
|
#
|
||||||
|
# Byte → cp1252 char mappings for the 0x80–0x9F range:
|
||||||
|
# E2 → â (U+00E2, always)
|
||||||
|
# 80 → € (U+20AC, cp1252 0x80)
|
||||||
|
# 99 → ™ (U+2122, cp1252 0x99) ← E2 80 99 = U+2019 ' right single quote
|
||||||
|
# 98 → ˜ (U+02DC, cp1252 0x98) ← E2 80 98 = U+2018 ' left single quote
|
||||||
|
# 9C → œ (U+0153, cp1252 0x9C) ← E2 80 9C = U+201C " left double quote
|
||||||
|
# 9D → \x9d (undefined → U+009D) ← E2 80 9D = U+201D " right double quote
|
||||||
|
# 93 → " (U+201C, cp1252 0x93) ← E2 80 93 = U+2013 – en dash
|
||||||
|
# 94 → " (U+201D, cp1252 0x94) ← E2 80 94 = U+2014 — em dash
|
||||||
|
# A6 → ¦ (U+00A6, cp1252 0xA6) ← E2 80 A6 = U+2026 … ellipsis
|
||||||
|
|
||||||
|
_KNOWN_REPAIRS: list[tuple[str, str]] = [
|
||||||
|
# Longer / more specific patterns first to avoid partial matches
|
||||||
|
("’", "’"), # ’ → ' right single quote
|
||||||
|
("‘", "‘"), # ‘ → ' left single quote
|
||||||
|
("“", "“"), # “ → " left double quote
|
||||||
|
("â€", "”"), # â€\x9d → " right double quote
|
||||||
|
("–", "–"), # â€" (with left DQ) → – en dash
|
||||||
|
("—", "—"), # â€" (with right DQ) → — em dash
|
||||||
|
("…", "…"), # … → … ellipsis
|
||||||
|
# Generic fallback: bare †prefix not caught above → remove artifact
|
||||||
|
("â€", ""),
|
||||||
|
]
|
||||||
|
|
||||||
|
|
||||||
|
def repair_text(text: str) -> str:
|
||||||
|
"""Repair common encoding artifacts in scraped text.
|
||||||
|
|
||||||
|
Handles:
|
||||||
|
- UTF-8 bytes decoded as cp1252/Latin-1 (’ → ')
|
||||||
|
- Attempts best-effort cleanup for mixed-encoding strings
|
||||||
|
|
||||||
|
U+FFFD replacement characters (from strict UTF-8 decoding of cp1252 bytes)
|
||||||
|
cannot be recovered since the original byte is lost; they are left as-is.
|
||||||
|
"""
|
||||||
|
if not text:
|
||||||
|
return text
|
||||||
|
|
||||||
|
try:
|
||||||
|
import ftfy
|
||||||
|
return ftfy.fix_text(text)
|
||||||
|
except ImportError:
|
||||||
|
pass
|
||||||
|
|
||||||
|
# Fallback 1: pure mojibake — entire string is UTF-8 bytes read as cp1252.
|
||||||
|
# Re-encode as cp1252 and decode as UTF-8.
|
||||||
|
try:
|
||||||
|
return text.encode("cp1252").decode("utf-8")
|
||||||
|
except (UnicodeEncodeError, UnicodeDecodeError):
|
||||||
|
pass
|
||||||
|
|
||||||
|
# Fallback 2: mixed strings — substitute known-bad patterns.
|
||||||
|
for bad, good in _KNOWN_REPAIRS:
|
||||||
|
if bad in text:
|
||||||
|
text = text.replace(bad, good)
|
||||||
|
return text
|
||||||
@@ -1,556 +0,0 @@
|
|||||||
#!/usr/bin/env python3
|
|
||||||
"""
|
|
||||||
analysis_batch.py — OpenAI Batch API pipeline
|
|
||||||
|
|
||||||
Commands (run manually in order):
|
|
||||||
submit <input_jsonl> [--model gpt-4o] [--limit N]
|
|
||||||
— build request file, upload, create batch
|
|
||||||
status [run_id] — check batch status, update manifest
|
|
||||||
download [run_id] — download + normalize output, update manifest
|
|
||||||
|
|
||||||
run_id defaults to the most recent run in runs/ when omitted.
|
|
||||||
|
|
||||||
File layout (all under analysis/gpt4o/):
|
|
||||||
requests/<run_id>.jsonl — batch input sent to OpenAI
|
|
||||||
raw/<run_id>.jsonl — raw batch output from OpenAI
|
|
||||||
runs/<run_id>.json — run manifest
|
|
||||||
<run_id>_<model>.jsonl — normalized output (same schema as realtime)
|
|
||||||
"""
|
|
||||||
|
|
||||||
import argparse
|
|
||||||
import hashlib
|
|
||||||
import json
|
|
||||||
import os
|
|
||||||
import sys
|
|
||||||
from datetime import datetime, timezone
|
|
||||||
from pathlib import Path
|
|
||||||
|
|
||||||
from dotenv import load_dotenv
|
|
||||||
|
|
||||||
try:
|
|
||||||
import openai
|
|
||||||
except ImportError:
|
|
||||||
sys.exit("openai package not installed. Run: pip install openai")
|
|
||||||
|
|
||||||
# ---------------------------------------------------------------------------
|
|
||||||
# Model limits and token estimation
|
|
||||||
|
|
||||||
# Max enqueued tokens across ALL concurrent batches for this model
|
|
||||||
# (docs/openai.md pricing table, updated 2026-05-05).
|
|
||||||
# NOTE: your org tier may be lower — if a submit fails, use --limit to reduce chunk size.
|
|
||||||
MODEL_LIMITS: dict[str, int] = {
|
|
||||||
"gpt-5.5": 900_000,
|
|
||||||
"gpt-5.4": 900_000,
|
|
||||||
"gpt-5.4-mini": 2_000_000,
|
|
||||||
"gpt-5.4-nano": 200_000,
|
|
||||||
"gpt-4o": 900_000,
|
|
||||||
"gpt-4o-mini": 2_000_000,
|
|
||||||
"gpt-o4-mini": 2_000_000,
|
|
||||||
}
|
|
||||||
_DEFAULT_TOKEN_LIMIT = 900_000
|
|
||||||
|
|
||||||
# tiktoken encoding per model family; unknown models fall back to o200k_base
|
|
||||||
_MODEL_ENCODING: dict[str, str] = {
|
|
||||||
"gpt-5.5": "o200k_base",
|
|
||||||
"gpt-5.4": "o200k_base",
|
|
||||||
"gpt-5.4-mini": "o200k_base",
|
|
||||||
"gpt-5.4-nano": "o200k_base",
|
|
||||||
"gpt-4o": "o200k_base",
|
|
||||||
"gpt-4o-mini": "o200k_base",
|
|
||||||
"gpt-o4-mini": "o200k_base",
|
|
||||||
}
|
|
||||||
# Leave 10% headroom below the published limit
|
|
||||||
_LIMIT_BUFFER = 0.90
|
|
||||||
|
|
||||||
|
|
||||||
def estimate_tokens(messages: list[dict], model: str) -> int:
|
|
||||||
"""Estimate token count for a messages list.
|
|
||||||
|
|
||||||
Uses tiktoken when available (exact for OpenAI models); falls back to
|
|
||||||
chars/3 + 4-token overhead per message for unknown/Anthropic models.
|
|
||||||
"""
|
|
||||||
try:
|
|
||||||
import tiktoken
|
|
||||||
enc = tiktoken.get_encoding(_MODEL_ENCODING.get(model, "o200k_base"))
|
|
||||||
return sum(4 + len(enc.encode(m["content"])) for m in messages)
|
|
||||||
except ImportError:
|
|
||||||
return sum(4 + len(m["content"]) // 3 for m in messages)
|
|
||||||
|
|
||||||
|
|
||||||
def chunk_comments_by_tokens(
|
|
||||||
comments: list[dict], forum: dict | None, model: str
|
|
||||||
) -> list[list[dict]]:
|
|
||||||
"""Split comments into chunks where each chunk fits under the model token limit."""
|
|
||||||
raw_limit = MODEL_LIMITS.get(model, _DEFAULT_TOKEN_LIMIT)
|
|
||||||
token_limit = int(raw_limit * _LIMIT_BUFFER)
|
|
||||||
|
|
||||||
chunks: list[list[dict]] = []
|
|
||||||
current: list[dict] = []
|
|
||||||
current_tokens = 0
|
|
||||||
|
|
||||||
for comment in comments:
|
|
||||||
messages, _ = build_messages(comment, forum)
|
|
||||||
tokens = estimate_tokens(messages, model)
|
|
||||||
if current and current_tokens + tokens > token_limit:
|
|
||||||
chunks.append(current)
|
|
||||||
current = [comment]
|
|
||||||
current_tokens = tokens
|
|
||||||
else:
|
|
||||||
current.append(comment)
|
|
||||||
current_tokens += tokens
|
|
||||||
|
|
||||||
if current:
|
|
||||||
chunks.append(current)
|
|
||||||
|
|
||||||
return chunks
|
|
||||||
|
|
||||||
|
|
||||||
# ---------------------------------------------------------------------------
|
|
||||||
# Prompt
|
|
||||||
|
|
||||||
_DEFAULT_PROMPT_FILE = Path(__file__).parent.parent / "prompt-1.txt"
|
|
||||||
SYSTEM_PROMPT = _DEFAULT_PROMPT_FILE.read_text(encoding="utf-8").strip()
|
|
||||||
PROMPT_VERSION = hashlib.sha256(SYSTEM_PROMPT.encode("utf-8")).hexdigest()[:7]
|
|
||||||
|
|
||||||
|
|
||||||
def _load_prompt(path: Path) -> None:
|
|
||||||
"""Re-read a prompt file, updating module-level SYSTEM_PROMPT and PROMPT_VERSION."""
|
|
||||||
global SYSTEM_PROMPT, PROMPT_VERSION
|
|
||||||
SYSTEM_PROMPT = path.read_text(encoding="utf-8").strip()
|
|
||||||
PROMPT_VERSION = hashlib.sha256(SYSTEM_PROMPT.encode("utf-8")).hexdigest()[:7]
|
|
||||||
|
|
||||||
USER_TEMPLATE = """\
|
|
||||||
## Proposed Regulation
|
|
||||||
Title: {reg_title}
|
|
||||||
Description: {reg_desc}
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Public Comment
|
|
||||||
Comment ID: {comment_id}
|
|
||||||
Title: {comment_title}
|
|
||||||
Body:
|
|
||||||
{comment_text}
|
|
||||||
|
|
||||||
---
|
|
||||||
Classify this comment per the instructions. Return only JSON.\
|
|
||||||
"""
|
|
||||||
|
|
||||||
MAX_COMMENT_CHARS = 6000
|
|
||||||
|
|
||||||
# ---------------------------------------------------------------------------
|
|
||||||
# Directories
|
|
||||||
|
|
||||||
_SCRIPT_DIR = Path(__file__).parent
|
|
||||||
REQUESTS_DIR = _SCRIPT_DIR / "requests"
|
|
||||||
RAW_DIR = _SCRIPT_DIR / "raw"
|
|
||||||
RUNS_DIR = _SCRIPT_DIR / "runs"
|
|
||||||
|
|
||||||
# ---------------------------------------------------------------------------
|
|
||||||
# Core functions (importable for tests)
|
|
||||||
|
|
||||||
|
|
||||||
def load_items(path: Path) -> tuple[dict | None, list[dict]]:
|
|
||||||
"""Read a scraped JSONL file. Returns (forum_item_or_None, [comment_items])."""
|
|
||||||
forum = None
|
|
||||||
comments = []
|
|
||||||
with open(path, encoding="utf-8") as f:
|
|
||||||
for line in f:
|
|
||||||
line = line.strip()
|
|
||||||
if not line:
|
|
||||||
continue
|
|
||||||
item = json.loads(line)
|
|
||||||
if "comment_id" in item:
|
|
||||||
comments.append(item)
|
|
||||||
elif "reg_title" in item:
|
|
||||||
forum = item
|
|
||||||
return forum, comments
|
|
||||||
|
|
||||||
|
|
||||||
def custom_id_from(comment_id: str) -> str:
|
|
||||||
return f"comment_{comment_id}"
|
|
||||||
|
|
||||||
|
|
||||||
def parse_custom_id(custom_id: str) -> str:
|
|
||||||
"""Return comment_id from a custom_id string."""
|
|
||||||
return custom_id.removeprefix("comment_")
|
|
||||||
|
|
||||||
|
|
||||||
def build_messages(comment: dict, forum: dict | None) -> tuple[list, bool]:
|
|
||||||
"""Build OpenAI messages for one comment. Returns (messages, truncated)."""
|
|
||||||
reg_title = (forum or {}).get("reg_title", "[unknown]")
|
|
||||||
reg_desc = (forum or {}).get("reg_desc", "[unknown]")
|
|
||||||
|
|
||||||
body = (comment.get("text") or "").strip()
|
|
||||||
truncated = False
|
|
||||||
if not body:
|
|
||||||
body = "[No body text provided]"
|
|
||||||
elif len(body) > MAX_COMMENT_CHARS:
|
|
||||||
body = body[:MAX_COMMENT_CHARS] + "... [truncated]"
|
|
||||||
truncated = True
|
|
||||||
|
|
||||||
user_text = USER_TEMPLATE.format(
|
|
||||||
reg_title=reg_title,
|
|
||||||
reg_desc=reg_desc,
|
|
||||||
comment_id=comment.get("comment_id", ""),
|
|
||||||
comment_title=comment.get("title", ""),
|
|
||||||
comment_text=body,
|
|
||||||
)
|
|
||||||
|
|
||||||
return [
|
|
||||||
{"role": "system", "content": SYSTEM_PROMPT},
|
|
||||||
{"role": "user", "content": user_text},
|
|
||||||
], truncated
|
|
||||||
|
|
||||||
|
|
||||||
def build_batch_request_line(comment: dict, forum: dict | None, model: str) -> dict:
|
|
||||||
"""Build one line of the batch input JSONL."""
|
|
||||||
messages, _ = build_messages(comment, forum)
|
|
||||||
return {
|
|
||||||
"custom_id": custom_id_from(comment["comment_id"]),
|
|
||||||
"method": "POST",
|
|
||||||
"url": "/v1/chat/completions",
|
|
||||||
"body": {
|
|
||||||
"model": model,
|
|
||||||
"messages": messages,
|
|
||||||
"response_format": {"type": "json_object"},
|
|
||||||
"temperature": 0.0,
|
|
||||||
},
|
|
||||||
}
|
|
||||||
|
|
||||||
|
|
||||||
def normalize_output_line(
|
|
||||||
raw_line: dict,
|
|
||||||
comment_lookup: dict,
|
|
||||||
run_id: str,
|
|
||||||
analyzed_at: str,
|
|
||||||
model: str,
|
|
||||||
prompt_version: str,
|
|
||||||
) -> dict:
|
|
||||||
"""Convert one raw batch output line into a normalized analysis record.
|
|
||||||
|
|
||||||
comment_lookup: {comment_id: CommentItem dict}
|
|
||||||
prompt_version: taken from the run manifest so it reflects what was submitted.
|
|
||||||
"""
|
|
||||||
comment_id = parse_custom_id(raw_line.get("custom_id", ""))
|
|
||||||
comment = comment_lookup.get(comment_id, {})
|
|
||||||
|
|
||||||
base = {
|
|
||||||
"run_id": run_id,
|
|
||||||
"forum_id": comment.get("forum_id", ""),
|
|
||||||
"comment_id": comment_id,
|
|
||||||
"analyzed_at": analyzed_at,
|
|
||||||
"model": model,
|
|
||||||
"prompt_version": prompt_version,
|
|
||||||
"input_title": comment.get("title", ""),
|
|
||||||
"truncated": len(comment.get("text") or "") > MAX_COMMENT_CHARS,
|
|
||||||
}
|
|
||||||
|
|
||||||
# Check for outer-level batch error (e.g. batch_expired)
|
|
||||||
if raw_line.get("error"):
|
|
||||||
err = raw_line["error"]
|
|
||||||
err_msg = err.get("message", str(err)) if isinstance(err, dict) else str(err)
|
|
||||||
return {**base, "stance": None, "stance_confidence": None,
|
|
||||||
"stance_rationale": None, "tone": None, "tags": None, "error": err_msg}
|
|
||||||
|
|
||||||
response = raw_line.get("response") or {}
|
|
||||||
if response.get("status_code") != 200:
|
|
||||||
return {**base, "stance": None, "stance_confidence": None,
|
|
||||||
"stance_rationale": None, "tone": None, "tags": None,
|
|
||||||
"error": f"status {response.get('status_code')}"}
|
|
||||||
|
|
||||||
try:
|
|
||||||
content = response["body"]["choices"][0]["message"]["content"]
|
|
||||||
data = json.loads(content)
|
|
||||||
keys = ("stance", "stance_confidence", "stance_rationale", "tone", "tags")
|
|
||||||
parsed = {k: data.get(k) for k in keys}
|
|
||||||
return {**base, **parsed, "error": None}
|
|
||||||
except Exception as exc:
|
|
||||||
return {**base, "stance": None, "stance_confidence": None,
|
|
||||||
"stance_rationale": None, "tone": None, "tags": None, "error": str(exc)}
|
|
||||||
|
|
||||||
|
|
||||||
def make_manifest(
|
|
||||||
run_id: str,
|
|
||||||
input_filename: str,
|
|
||||||
input_sha256: str,
|
|
||||||
model: str,
|
|
||||||
batch_id: str,
|
|
||||||
records_submitted: int,
|
|
||||||
request_filename: str,
|
|
||||||
) -> dict:
|
|
||||||
return {
|
|
||||||
"run_id": run_id,
|
|
||||||
"input_filename": input_filename,
|
|
||||||
"input_sha256": input_sha256,
|
|
||||||
"prompt_hash": PROMPT_VERSION,
|
|
||||||
"model": model,
|
|
||||||
"batch_id": batch_id,
|
|
||||||
"records_submitted": records_submitted,
|
|
||||||
"records_completed": None,
|
|
||||||
"records_failed": None,
|
|
||||||
"request_filename": request_filename,
|
|
||||||
"raw_output_filename": None,
|
|
||||||
"normalized_output_filename": None,
|
|
||||||
"created_at": datetime.now(timezone.utc).isoformat(),
|
|
||||||
"completed_at": None,
|
|
||||||
}
|
|
||||||
|
|
||||||
|
|
||||||
def _latest_run_id() -> str:
|
|
||||||
"""Return the run_id of the most recently saved manifest, or exit if none found."""
|
|
||||||
runs = list(RUNS_DIR.glob("*.json")) if RUNS_DIR.exists() else []
|
|
||||||
if not runs:
|
|
||||||
sys.exit(f"No runs found in {RUNS_DIR}. Submit a batch first.")
|
|
||||||
latest = max(runs, key=lambda p: p.stat().st_mtime)
|
|
||||||
return latest.stem
|
|
||||||
|
|
||||||
|
|
||||||
def load_manifest(run_id: str) -> dict:
|
|
||||||
path = RUNS_DIR / f"{run_id}.json"
|
|
||||||
return json.loads(path.read_text(encoding="utf-8"))
|
|
||||||
|
|
||||||
|
|
||||||
def save_manifest(manifest: dict) -> None:
|
|
||||||
RUNS_DIR.mkdir(parents=True, exist_ok=True)
|
|
||||||
path = RUNS_DIR / f"{manifest['run_id']}.json"
|
|
||||||
path.write_text(json.dumps(manifest, indent=2, ensure_ascii=False), encoding="utf-8")
|
|
||||||
|
|
||||||
|
|
||||||
# ---------------------------------------------------------------------------
|
|
||||||
# Subcommand: submit
|
|
||||||
|
|
||||||
def _submit_chunk(
|
|
||||||
chunk: list[dict],
|
|
||||||
forum: dict | None,
|
|
||||||
input_path: Path,
|
|
||||||
input_sha256: str,
|
|
||||||
model: str,
|
|
||||||
client,
|
|
||||||
chunk_index: int,
|
|
||||||
total_chunks: int,
|
|
||||||
) -> str:
|
|
||||||
"""Upload and submit one chunk of comments. Returns the run_id."""
|
|
||||||
import uuid
|
|
||||||
run_id = str(uuid.uuid4())
|
|
||||||
label = f"chunk {chunk_index + 1}/{total_chunks}" if total_chunks > 1 else "single batch"
|
|
||||||
|
|
||||||
REQUESTS_DIR.mkdir(parents=True, exist_ok=True)
|
|
||||||
request_path = REQUESTS_DIR / f"{run_id}.jsonl"
|
|
||||||
with open(request_path, "w", encoding="utf-8") as f:
|
|
||||||
for comment in chunk:
|
|
||||||
line = build_batch_request_line(comment, forum, model)
|
|
||||||
f.write(json.dumps(line, ensure_ascii=False) + "\n")
|
|
||||||
|
|
||||||
print(f"[{label}] Wrote {len(chunk)} requests → {request_path}", file=sys.stderr)
|
|
||||||
|
|
||||||
with open(request_path, "rb") as f:
|
|
||||||
uploaded = client.files.create(file=f, purpose="batch")
|
|
||||||
print(f"[{label}] Uploaded: {uploaded.id}", file=sys.stderr)
|
|
||||||
|
|
||||||
batch = client.batches.create(
|
|
||||||
input_file_id=uploaded.id,
|
|
||||||
endpoint="/v1/chat/completions",
|
|
||||||
completion_window="24h",
|
|
||||||
metadata={"run_id": run_id, "input_filename": str(input_path)},
|
|
||||||
)
|
|
||||||
print(f"[{label}] Batch created: {batch.id} status={batch.status}", file=sys.stderr)
|
|
||||||
|
|
||||||
manifest = make_manifest(
|
|
||||||
run_id=run_id,
|
|
||||||
input_filename=str(input_path),
|
|
||||||
input_sha256=input_sha256,
|
|
||||||
model=model,
|
|
||||||
batch_id=batch.id,
|
|
||||||
records_submitted=len(chunk),
|
|
||||||
request_filename=str(request_path),
|
|
||||||
)
|
|
||||||
save_manifest(manifest)
|
|
||||||
return run_id
|
|
||||||
|
|
||||||
|
|
||||||
def cmd_submit(args, client) -> None:
|
|
||||||
_load_prompt(Path(args.prompt))
|
|
||||||
print(f"Prompt: {args.prompt} (version {PROMPT_VERSION})", file=sys.stderr)
|
|
||||||
|
|
||||||
input_path = Path(args.input)
|
|
||||||
if not input_path.exists():
|
|
||||||
sys.exit(f"File not found: {input_path}")
|
|
||||||
|
|
||||||
print(f"Reading {input_path} ...", file=sys.stderr)
|
|
||||||
forum, comments = load_items(input_path)
|
|
||||||
if not comments:
|
|
||||||
sys.exit("No comment items found in input file.")
|
|
||||||
if forum is None:
|
|
||||||
print("Warning: no ForumItem found — regulation context will be [unknown].", file=sys.stderr)
|
|
||||||
|
|
||||||
if args.limit:
|
|
||||||
comments = comments[:args.limit]
|
|
||||||
print(f"Limiting to {len(comments)} comments (--limit {args.limit}).", file=sys.stderr)
|
|
||||||
|
|
||||||
token_limit = int(MODEL_LIMITS.get(args.model, _DEFAULT_TOKEN_LIMIT) * _LIMIT_BUFFER)
|
|
||||||
chunks = chunk_comments_by_tokens(comments, forum, args.model)
|
|
||||||
total = len(chunks)
|
|
||||||
print(
|
|
||||||
f"Model: {args.model} token limit: {token_limit:,} "
|
|
||||||
f"→ {len(comments)} comments split into {total} chunk(s).",
|
|
||||||
file=sys.stderr,
|
|
||||||
)
|
|
||||||
|
|
||||||
input_sha256 = hashlib.sha256(input_path.read_bytes()).hexdigest()
|
|
||||||
|
|
||||||
# Submit only the first chunk — the enqueued token limit is a TOTAL across all
|
|
||||||
# concurrent batches, so stacking multiple submissions will exceed the quota.
|
|
||||||
# Wait for each batch to complete before submitting the next.
|
|
||||||
run_id = _submit_chunk(chunks[0], forum, input_path, input_sha256, args.model, client, 0, total)
|
|
||||||
|
|
||||||
print(f"\nBatch 1/{total} submitted.", file=sys.stderr)
|
|
||||||
print(f" status: python analysis/gpt4o/analysis_batch.py status {run_id}", file=sys.stderr)
|
|
||||||
print(f" download: python analysis/gpt4o/analysis_batch.py download {run_id}", file=sys.stderr)
|
|
||||||
|
|
||||||
if total > 1:
|
|
||||||
remaining = sum(len(c) for c in chunks[1:])
|
|
||||||
print(f"\n{total - 1} more chunk(s) remaining ({remaining} comments).", file=sys.stderr)
|
|
||||||
print("After this batch completes and is downloaded, rerun submit with --limit to get the next chunk:", file=sys.stderr)
|
|
||||||
offset = len(chunks[0])
|
|
||||||
for idx, chunk in enumerate(chunks[1:], start=2):
|
|
||||||
print(f" chunk {idx}/{total}: comments {offset}–{offset + len(chunk) - 1}", file=sys.stderr)
|
|
||||||
offset += len(chunk)
|
|
||||||
|
|
||||||
print(run_id) # stdout for scripting
|
|
||||||
|
|
||||||
|
|
||||||
# ---------------------------------------------------------------------------
|
|
||||||
# Subcommand: status
|
|
||||||
|
|
||||||
def cmd_status(args, client) -> None:
|
|
||||||
run_id = args.run_id or _latest_run_id()
|
|
||||||
if not args.run_id:
|
|
||||||
print(f"(using latest run: {run_id})", file=sys.stderr)
|
|
||||||
manifest = load_manifest(run_id)
|
|
||||||
batch = client.batches.retrieve(manifest["batch_id"])
|
|
||||||
|
|
||||||
counts = batch.request_counts
|
|
||||||
print(f"status: {batch.status}")
|
|
||||||
print(f"completed: {counts.completed}/{counts.total}")
|
|
||||||
print(f"failed: {counts.failed}")
|
|
||||||
|
|
||||||
manifest["records_completed"] = counts.completed
|
|
||||||
manifest["records_failed"] = counts.failed
|
|
||||||
save_manifest(manifest)
|
|
||||||
|
|
||||||
if batch.status == "completed":
|
|
||||||
print(f"\nReady to download. Run:")
|
|
||||||
print(f" python analysis/gpt4o/analysis_batch.py download {run_id}")
|
|
||||||
|
|
||||||
|
|
||||||
# ---------------------------------------------------------------------------
|
|
||||||
# Subcommand: download
|
|
||||||
|
|
||||||
def cmd_download(args, client) -> None:
|
|
||||||
run_id = args.run_id or _latest_run_id()
|
|
||||||
if not args.run_id:
|
|
||||||
print(f"(using latest run: {run_id})", file=sys.stderr)
|
|
||||||
manifest = load_manifest(run_id)
|
|
||||||
batch = client.batches.retrieve(manifest["batch_id"])
|
|
||||||
|
|
||||||
if batch.status != "completed":
|
|
||||||
sys.exit(f"Batch not complete yet (status={batch.status}). Run 'status' to check.")
|
|
||||||
|
|
||||||
run_id = manifest["run_id"]
|
|
||||||
model = manifest["model"]
|
|
||||||
model_slug = model.replace("/", "-")
|
|
||||||
|
|
||||||
# Download raw output
|
|
||||||
RAW_DIR.mkdir(parents=True, exist_ok=True)
|
|
||||||
raw_path = RAW_DIR / f"{run_id}.jsonl"
|
|
||||||
raw_text = client.files.content(batch.output_file_id).text
|
|
||||||
raw_path.write_text(raw_text, encoding="utf-8")
|
|
||||||
print(f"Raw output → {raw_path}", file=sys.stderr)
|
|
||||||
|
|
||||||
# Build comment lookup from original input for reconciliation
|
|
||||||
input_path = Path(manifest["input_filename"])
|
|
||||||
_, comments = load_items(input_path)
|
|
||||||
comment_lookup = {c["comment_id"]: c for c in comments}
|
|
||||||
|
|
||||||
# Normalize
|
|
||||||
completed_at = datetime.now(timezone.utc).isoformat()
|
|
||||||
if batch.completed_at:
|
|
||||||
completed_at = datetime.fromtimestamp(batch.completed_at, tz=timezone.utc).isoformat()
|
|
||||||
|
|
||||||
normalized_path = _SCRIPT_DIR / f"{run_id}_{model_slug}.jsonl"
|
|
||||||
n_ok = n_err = 0
|
|
||||||
with open(normalized_path, "w", encoding="utf-8") as out:
|
|
||||||
for line in raw_text.splitlines():
|
|
||||||
if not line.strip():
|
|
||||||
continue
|
|
||||||
raw_line = json.loads(line)
|
|
||||||
record = normalize_output_line(raw_line, comment_lookup, run_id, completed_at, model, manifest["prompt_hash"])
|
|
||||||
out.write(json.dumps(record, ensure_ascii=False) + "\n")
|
|
||||||
if record["error"]:
|
|
||||||
n_err += 1
|
|
||||||
else:
|
|
||||||
n_ok += 1
|
|
||||||
|
|
||||||
print(f"Normalized → {normalized_path} ({n_ok} ok, {n_err} errors)", file=sys.stderr)
|
|
||||||
|
|
||||||
manifest["records_completed"] = n_ok
|
|
||||||
manifest["records_failed"] = n_err
|
|
||||||
manifest["raw_output_filename"] = str(raw_path)
|
|
||||||
manifest["normalized_output_filename"] = str(normalized_path)
|
|
||||||
manifest["completed_at"] = completed_at
|
|
||||||
save_manifest(manifest)
|
|
||||||
print(f"Manifest updated → {RUNS_DIR / run_id}.json", file=sys.stderr)
|
|
||||||
|
|
||||||
|
|
||||||
# ---------------------------------------------------------------------------
|
|
||||||
# CLI
|
|
||||||
|
|
||||||
def main() -> None:
|
|
||||||
load_dotenv()
|
|
||||||
|
|
||||||
api_key = os.environ.get("OPENAI_API_KEY")
|
|
||||||
if not api_key:
|
|
||||||
sys.exit("OPENAI_API_KEY not set. Create a .env file or export the variable.")
|
|
||||||
|
|
||||||
parser = argparse.ArgumentParser(
|
|
||||||
description="Public comment batch analysis pipeline.",
|
|
||||||
formatter_class=argparse.RawDescriptionHelpFormatter,
|
|
||||||
epilog=__doc__,
|
|
||||||
)
|
|
||||||
sub = parser.add_subparsers(dest="command", required=True)
|
|
||||||
|
|
||||||
p_submit = sub.add_parser("submit", help="Build and submit a batch job")
|
|
||||||
p_submit.add_argument("input", help="Path to scraped JSONL file")
|
|
||||||
p_submit.add_argument("--model", default="gpt-4o", help="OpenAI model (default: gpt-4o)")
|
|
||||||
p_submit.add_argument(
|
|
||||||
"--prompt",
|
|
||||||
default=str(_DEFAULT_PROMPT_FILE),
|
|
||||||
help="Path to system prompt file (default: analysis/prompt-1.txt)",
|
|
||||||
)
|
|
||||||
p_submit.add_argument(
|
|
||||||
"--limit", type=int, default=None, metavar="N",
|
|
||||||
help="Submit only the first N comments (useful for staying under token quota)",
|
|
||||||
)
|
|
||||||
|
|
||||||
p_status = sub.add_parser("status", help="Check batch status")
|
|
||||||
p_status.add_argument("run_id", nargs="?", default=None,
|
|
||||||
help="run_id from submit (default: most recent run)")
|
|
||||||
|
|
||||||
p_download = sub.add_parser("download", help="Download and normalize completed batch")
|
|
||||||
p_download.add_argument("run_id", nargs="?", default=None,
|
|
||||||
help="run_id from submit (default: most recent run)")
|
|
||||||
|
|
||||||
args = parser.parse_args()
|
|
||||||
client = openai.OpenAI(api_key=api_key)
|
|
||||||
|
|
||||||
if args.command == "submit":
|
|
||||||
cmd_submit(args, client)
|
|
||||||
elif args.command == "status":
|
|
||||||
cmd_status(args, client)
|
|
||||||
elif args.command == "download":
|
|
||||||
cmd_download(args, client)
|
|
||||||
|
|
||||||
|
|
||||||
if __name__ == "__main__":
|
|
||||||
main()
|
|
||||||
9084
analysis/jobs/f452-1/forum.jsonl
Normal file
9084
analysis/jobs/f452-1/forum.jsonl
Normal file
File diff suppressed because one or more lines are too long
2270
analysis/jobs/f452-1/job1-input.jsonl
Normal file
2270
analysis/jobs/f452-1/job1-input.jsonl
Normal file
File diff suppressed because one or more lines are too long
2270
analysis/jobs/f452-1/job1-output-raw.jsonl
Normal file
2270
analysis/jobs/f452-1/job1-output-raw.jsonl
Normal file
File diff suppressed because it is too large
Load Diff
2270
analysis/jobs/f452-1/job1-output.jsonl
Normal file
2270
analysis/jobs/f452-1/job1-output.jsonl
Normal file
File diff suppressed because it is too large
Load Diff
2274
analysis/jobs/f452-1/job2-input.jsonl
Normal file
2274
analysis/jobs/f452-1/job2-input.jsonl
Normal file
File diff suppressed because one or more lines are too long
2274
analysis/jobs/f452-1/job2-output-raw.jsonl
Normal file
2274
analysis/jobs/f452-1/job2-output-raw.jsonl
Normal file
File diff suppressed because it is too large
Load Diff
2274
analysis/jobs/f452-1/job2-output.jsonl
Normal file
2274
analysis/jobs/f452-1/job2-output.jsonl
Normal file
File diff suppressed because it is too large
Load Diff
2282
analysis/jobs/f452-1/job3-input.jsonl
Normal file
2282
analysis/jobs/f452-1/job3-input.jsonl
Normal file
File diff suppressed because one or more lines are too long
2282
analysis/jobs/f452-1/job3-output-raw.jsonl
Normal file
2282
analysis/jobs/f452-1/job3-output-raw.jsonl
Normal file
File diff suppressed because it is too large
Load Diff
2282
analysis/jobs/f452-1/job3-output.jsonl
Normal file
2282
analysis/jobs/f452-1/job3-output.jsonl
Normal file
File diff suppressed because it is too large
Load Diff
2257
analysis/jobs/f452-1/job4-input.jsonl
Normal file
2257
analysis/jobs/f452-1/job4-input.jsonl
Normal file
File diff suppressed because one or more lines are too long
2257
analysis/jobs/f452-1/job4-output-raw.jsonl
Normal file
2257
analysis/jobs/f452-1/job4-output-raw.jsonl
Normal file
File diff suppressed because it is too large
Load Diff
2257
analysis/jobs/f452-1/job4-output.jsonl
Normal file
2257
analysis/jobs/f452-1/job4-output.jsonl
Normal file
File diff suppressed because it is too large
Load Diff
23
analysis/jobs/f452-1/prompt.txt
Normal file
23
analysis/jobs/f452-1/prompt.txt
Normal file
@@ -0,0 +1,23 @@
|
|||||||
|
You are an expert policy analyst classifying public comments submitted to the Virginia Town Hall
|
||||||
|
regulatory comment system. You will be given the text of a proposed regulation and a single
|
||||||
|
public comment. Return ONLY a JSON object — no other text.
|
||||||
|
|
||||||
|
Definitions:
|
||||||
|
- stance: the commenter's position on whether the regulation should be adopted.
|
||||||
|
"support" = wants it approved (as-is or with changes);
|
||||||
|
"oppose" = wants it rejected or substantially weakened;
|
||||||
|
"neutral" = takes no position, asks a question, or provides factual input only;
|
||||||
|
"unknown" = too vague, off-topic, or uninterpretable to classify.
|
||||||
|
- tone: the emotional register of the writing, independent of stance.
|
||||||
|
"positive" = affirming, hopeful, appreciative;
|
||||||
|
"negative" = angry, fearful, alarmed, or contemptuous;
|
||||||
|
"neutral" = matter-of-fact, procedural, or informational;
|
||||||
|
"mixed" = contains both positive and negative emotional content;
|
||||||
|
"unclear" = tone cannot be determined (e.g., a one-word comment).
|
||||||
|
- stance_confidence: float 0.0-1.0, your confidence in the stance label.
|
||||||
|
- stance_rationale: 1-3 sentences explaining the key evidence; quote specific phrases where possible.
|
||||||
|
- tags: up to 5 short topic labels relevant to the comment's specific concerns (e.g.
|
||||||
|
"parental rights", "student safety", "privacy", "religious freedom", "LGBTQ+ inclusion",
|
||||||
|
"bullying prevention", "school sports", "bathroom access"). Empty array if none apply.
|
||||||
|
|
||||||
|
Return exactly these keys: stance, stance_confidence, stance_rationale, tone, tags.
|
||||||
43
analysis/jobs/f452-1/report.json
Normal file
43
analysis/jobs/f452-1/report.json
Normal file
@@ -0,0 +1,43 @@
|
|||||||
|
{
|
||||||
|
"prompt": "analysis\\prompt-1.txt",
|
||||||
|
"prompt_hash": "cb41250",
|
||||||
|
"input_file": "output\\f452.jsonl",
|
||||||
|
"input_sha256": "59dcc8b13cc2a386977a8b934c498c7e639b7e684a94ca1bfd10a14878670018",
|
||||||
|
"total_comments": 9083,
|
||||||
|
"input_tokens": 6397254,
|
||||||
|
"gpt-5.5": {
|
||||||
|
"jobs": 9,
|
||||||
|
"cost_$": 15.9931,
|
||||||
|
"est_queue_days": 7.11
|
||||||
|
},
|
||||||
|
"gpt-5.4": {
|
||||||
|
"jobs": 9,
|
||||||
|
"cost_$": 7.9966,
|
||||||
|
"est_queue_days": 7.11
|
||||||
|
},
|
||||||
|
"gpt-5.4-mini": {
|
||||||
|
"jobs": 4,
|
||||||
|
"cost_$": 2.399,
|
||||||
|
"est_queue_days": 3.2
|
||||||
|
},
|
||||||
|
"gpt-5.4-nano": {
|
||||||
|
"jobs": 40,
|
||||||
|
"cost_$": 0.6397,
|
||||||
|
"est_queue_days": 31.99
|
||||||
|
},
|
||||||
|
"gpt-4o": {
|
||||||
|
"jobs": 9,
|
||||||
|
"cost_$": 7.9966,
|
||||||
|
"est_queue_days": 7.11
|
||||||
|
},
|
||||||
|
"gpt-4o-mini": {
|
||||||
|
"jobs": 4,
|
||||||
|
"cost_$": 0.4798,
|
||||||
|
"est_queue_days": 3.2
|
||||||
|
},
|
||||||
|
"gpt-o4-mini": {
|
||||||
|
"jobs": 4,
|
||||||
|
"cost_$": 3.5185,
|
||||||
|
"est_queue_days": 3.2
|
||||||
|
}
|
||||||
|
}
|
||||||
9091
analysis/jobs/f452-1/review.csv
Normal file
9091
analysis/jobs/f452-1/review.csv
Normal file
File diff suppressed because one or more lines are too long
BIN
analysis/jobs/f452-1/review.xlsx
Normal file
BIN
analysis/jobs/f452-1/review.xlsx
Normal file
Binary file not shown.
57
analysis/jobs/f452-1/status.json
Normal file
57
analysis/jobs/f452-1/status.json
Normal file
@@ -0,0 +1,57 @@
|
|||||||
|
{
|
||||||
|
"model": "gpt-5.4-mini",
|
||||||
|
"prompt_hash": "cb41250",
|
||||||
|
"input_file": "output\\f452.jsonl",
|
||||||
|
"input_sha256": "59dcc8b13cc2a386977a8b934c498c7e639b7e684a94ca1bfd10a14878670018",
|
||||||
|
"total_comments": 9083,
|
||||||
|
"input_tokens": 6397254,
|
||||||
|
"est_queue_days": 3.2,
|
||||||
|
"cost_$": 2.399,
|
||||||
|
"total_jobs": 4,
|
||||||
|
"jobs": [
|
||||||
|
{
|
||||||
|
"job_num": 1,
|
||||||
|
"run_id": "76c97113-63aa-43db-8f84-9c60ebcbb105",
|
||||||
|
"status": "completed",
|
||||||
|
"batch_id": "batch_69fb9081639881909be0c40d86edd747",
|
||||||
|
"records_submitted": 2270,
|
||||||
|
"records_completed": 2270,
|
||||||
|
"records_failed": 0,
|
||||||
|
"submitted_at": "2026-05-06T19:03:28.949240+00:00",
|
||||||
|
"completed_at": "2026-05-06T20:09:14+00:00"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"job_num": 2,
|
||||||
|
"run_id": "b8f3b0bb-f155-4a5c-acce-f3504c0e09aa",
|
||||||
|
"status": "completed",
|
||||||
|
"batch_id": "batch_69fba02df7b481909e96afa1ee8879f5",
|
||||||
|
"records_submitted": 2274,
|
||||||
|
"records_completed": 2274,
|
||||||
|
"records_failed": 0,
|
||||||
|
"submitted_at": "2026-05-06T20:10:21.424330+00:00",
|
||||||
|
"completed_at": "2026-05-06T20:37:11+00:00"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"job_num": 3,
|
||||||
|
"run_id": "8d769f37-6beb-4a1b-87ee-3f66cdc6adc8",
|
||||||
|
"status": "completed",
|
||||||
|
"batch_id": "batch_69fba69a85488190977792b6f95b614b",
|
||||||
|
"records_submitted": 2282,
|
||||||
|
"records_completed": 2282,
|
||||||
|
"records_failed": 0,
|
||||||
|
"submitted_at": "2026-05-06T20:37:45.586815+00:00",
|
||||||
|
"completed_at": "2026-05-06T21:09:24+00:00"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"job_num": 4,
|
||||||
|
"run_id": "e6affbc2-ddc9-43a6-b8e9-d1f47e736283",
|
||||||
|
"status": "completed",
|
||||||
|
"batch_id": "batch_69fbe44565748190ad19f17ee3143f8d",
|
||||||
|
"records_submitted": 2257,
|
||||||
|
"records_completed": 2257,
|
||||||
|
"records_failed": 0,
|
||||||
|
"submitted_at": "2026-05-07T01:00:52.886953+00:00",
|
||||||
|
"completed_at": "2026-05-07T09:20:01+00:00"
|
||||||
|
}
|
||||||
|
]
|
||||||
|
}
|
||||||
624
analysis/openai_batch.py
Normal file
624
analysis/openai_batch.py
Normal file
@@ -0,0 +1,624 @@
|
|||||||
|
#!/usr/bin/env python3
|
||||||
|
"""
|
||||||
|
openai_batch.py — OpenAI Batch API job runner
|
||||||
|
|
||||||
|
Run tokenizer.py first to generate report.json, then:
|
||||||
|
create <report.json> --model <model> — build job directory
|
||||||
|
submit [--job N] [--dir DIR] — submit next eligible job
|
||||||
|
status [--job N] [--dir DIR] — check job status
|
||||||
|
download [--job N] [--dir DIR] — download + normalize completed jobs
|
||||||
|
|
||||||
|
DIR is a name under analysis/jobs/ (default: most recently created).
|
||||||
|
"""
|
||||||
|
|
||||||
|
import argparse
|
||||||
|
import hashlib
|
||||||
|
import json
|
||||||
|
import os
|
||||||
|
import shutil
|
||||||
|
import sys
|
||||||
|
import uuid
|
||||||
|
from datetime import datetime, timezone
|
||||||
|
from pathlib import Path
|
||||||
|
|
||||||
|
from dotenv import load_dotenv
|
||||||
|
|
||||||
|
try:
|
||||||
|
import openai
|
||||||
|
except ImportError:
|
||||||
|
sys.exit("openai package not installed. Run: pip install openai")
|
||||||
|
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
# Model limits and token estimation
|
||||||
|
|
||||||
|
# Max enqueued tokens across ALL concurrent batches (docs/openai.md, 2026-05-05).
|
||||||
|
# Org-tier limits may be lower; use --job to limit submission size if needed.
|
||||||
|
MODEL_LIMITS: dict[str, int] = {
|
||||||
|
"gpt-5.5": 900_000,
|
||||||
|
"gpt-5.4": 900_000,
|
||||||
|
"gpt-5.4-mini": 2_000_000,
|
||||||
|
"gpt-5.4-nano": 200_000,
|
||||||
|
"gpt-4o": 900_000,
|
||||||
|
"gpt-4o-mini": 2_000_000,
|
||||||
|
"gpt-o4-mini": 2_000_000,
|
||||||
|
}
|
||||||
|
_DEFAULT_TOKEN_LIMIT = 900_000
|
||||||
|
_MODEL_ENCODING: dict[str, str] = {
|
||||||
|
"gpt-5.5": "o200k_base",
|
||||||
|
"gpt-5.4": "o200k_base",
|
||||||
|
"gpt-5.4-mini": "o200k_base",
|
||||||
|
"gpt-5.4-nano": "o200k_base",
|
||||||
|
"gpt-4o": "o200k_base",
|
||||||
|
"gpt-4o-mini": "o200k_base",
|
||||||
|
"gpt-o4-mini": "o200k_base",
|
||||||
|
}
|
||||||
|
_LIMIT_BUFFER = 0.80
|
||||||
|
|
||||||
|
|
||||||
|
def estimate_tokens(messages: list[dict], model: str) -> int:
|
||||||
|
"""Token count per OpenAI cookbook chat formula; falls back to chars/3."""
|
||||||
|
try:
|
||||||
|
import tiktoken
|
||||||
|
enc = tiktoken.get_encoding(_MODEL_ENCODING.get(model, "o200k_base"))
|
||||||
|
# Per OpenAI cookbook for gpt-4o: 3 overhead per message + role + content;
|
||||||
|
# plus 3 tokens for the reply primer (<|start|>assistant<|message|>).
|
||||||
|
total = 3 # reply primer
|
||||||
|
for m in messages:
|
||||||
|
total += 3
|
||||||
|
total += len(enc.encode(m.get("role", "")))
|
||||||
|
total += len(enc.encode(m["content"]))
|
||||||
|
return total
|
||||||
|
except ImportError:
|
||||||
|
return 3 + sum(3 + len(m["content"]) // 3 for m in messages)
|
||||||
|
|
||||||
|
|
||||||
|
def chunk_comments_by_tokens(
|
||||||
|
comments: list[dict], forum: dict | None, model: str
|
||||||
|
) -> list[list[dict]]:
|
||||||
|
"""Greedy bin-pack comments into chunks that fit under the model TPD limit."""
|
||||||
|
token_limit = int(MODEL_LIMITS.get(model, _DEFAULT_TOKEN_LIMIT) * _LIMIT_BUFFER)
|
||||||
|
chunks: list[list[dict]] = []
|
||||||
|
current: list[dict] = []
|
||||||
|
current_tokens = 0
|
||||||
|
for comment in comments:
|
||||||
|
messages, _ = build_messages(comment, forum)
|
||||||
|
tokens = estimate_tokens(messages, model)
|
||||||
|
if current and current_tokens + tokens > token_limit:
|
||||||
|
chunks.append(current)
|
||||||
|
current = [comment]
|
||||||
|
current_tokens = tokens
|
||||||
|
else:
|
||||||
|
current.append(comment)
|
||||||
|
current_tokens += tokens
|
||||||
|
if current:
|
||||||
|
chunks.append(current)
|
||||||
|
return chunks
|
||||||
|
|
||||||
|
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
# Prompt
|
||||||
|
|
||||||
|
_DEFAULT_PROMPT_FILE = Path(__file__).parent / "prompt-1.txt"
|
||||||
|
SYSTEM_PROMPT = _DEFAULT_PROMPT_FILE.read_text(encoding="utf-8").strip()
|
||||||
|
PROMPT_VERSION = hashlib.sha256(SYSTEM_PROMPT.encode("utf-8")).hexdigest()[:7]
|
||||||
|
|
||||||
|
|
||||||
|
def _load_prompt(path: Path) -> None:
|
||||||
|
global SYSTEM_PROMPT, PROMPT_VERSION
|
||||||
|
SYSTEM_PROMPT = path.read_text(encoding="utf-8").strip()
|
||||||
|
PROMPT_VERSION = hashlib.sha256(SYSTEM_PROMPT.encode("utf-8")).hexdigest()[:7]
|
||||||
|
|
||||||
|
|
||||||
|
USER_TEMPLATE = """\
|
||||||
|
## Proposed Regulation
|
||||||
|
Title: {reg_title}
|
||||||
|
Description: {reg_desc}
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Public Comment
|
||||||
|
Comment ID: {comment_id}
|
||||||
|
Title: {comment_title}
|
||||||
|
Body:
|
||||||
|
{comment_text}
|
||||||
|
|
||||||
|
---
|
||||||
|
Classify this comment per the instructions. Return only JSON.\
|
||||||
|
"""
|
||||||
|
|
||||||
|
MAX_COMMENT_CHARS = 6000
|
||||||
|
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
# Directories
|
||||||
|
|
||||||
|
_SCRIPT_DIR = Path(__file__).parent
|
||||||
|
JOBS_DIR = _SCRIPT_DIR / "jobs"
|
||||||
|
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
# Core functions (importable for tests)
|
||||||
|
|
||||||
|
|
||||||
|
def load_items(path: Path) -> tuple[dict | None, list[dict]]:
|
||||||
|
"""Read a scraped JSONL. Returns (forum_item_or_None, [comment_items])."""
|
||||||
|
forum = None
|
||||||
|
comments = []
|
||||||
|
with open(path, encoding="utf-8") as f:
|
||||||
|
for line in f:
|
||||||
|
line = line.strip()
|
||||||
|
if not line:
|
||||||
|
continue
|
||||||
|
item = json.loads(line)
|
||||||
|
if "comment_id" in item:
|
||||||
|
comments.append(item)
|
||||||
|
elif "reg_title" in item:
|
||||||
|
forum = item
|
||||||
|
return forum, comments
|
||||||
|
|
||||||
|
|
||||||
|
def custom_id_from(comment_id: str) -> str:
|
||||||
|
return f"comment_{comment_id}"
|
||||||
|
|
||||||
|
|
||||||
|
def parse_custom_id(custom_id: str) -> str:
|
||||||
|
return custom_id.removeprefix("comment_")
|
||||||
|
|
||||||
|
|
||||||
|
def build_messages(comment: dict, forum: dict | None) -> tuple[list, bool]:
|
||||||
|
"""Build OpenAI messages for one comment. Returns (messages, truncated)."""
|
||||||
|
reg_title = (forum or {}).get("reg_title", "[unknown]")
|
||||||
|
reg_desc = (forum or {}).get("reg_desc", "[unknown]")
|
||||||
|
body = (comment.get("text") or "").strip()
|
||||||
|
truncated = False
|
||||||
|
if not body:
|
||||||
|
body = "[No body text provided]"
|
||||||
|
elif len(body) > MAX_COMMENT_CHARS:
|
||||||
|
body = body[:MAX_COMMENT_CHARS] + "... [truncated]"
|
||||||
|
truncated = True
|
||||||
|
user_text = USER_TEMPLATE.format(
|
||||||
|
reg_title=reg_title,
|
||||||
|
reg_desc=reg_desc,
|
||||||
|
comment_id=comment.get("comment_id", ""),
|
||||||
|
comment_title=comment.get("title", ""),
|
||||||
|
comment_text=body,
|
||||||
|
)
|
||||||
|
return [
|
||||||
|
{"role": "system", "content": SYSTEM_PROMPT},
|
||||||
|
{"role": "user", "content": user_text},
|
||||||
|
], truncated
|
||||||
|
|
||||||
|
|
||||||
|
def build_batch_request_line(comment: dict, forum: dict | None, model: str) -> dict:
|
||||||
|
messages, _ = build_messages(comment, forum)
|
||||||
|
return {
|
||||||
|
"custom_id": custom_id_from(comment["comment_id"]),
|
||||||
|
"method": "POST",
|
||||||
|
"url": "/v1/chat/completions",
|
||||||
|
"body": {
|
||||||
|
"model": model,
|
||||||
|
"messages": messages,
|
||||||
|
"response_format": {"type": "json_object"},
|
||||||
|
"temperature": 0.0,
|
||||||
|
},
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
def normalize_output_line(
|
||||||
|
raw_line: dict,
|
||||||
|
comment_lookup: dict,
|
||||||
|
run_id: str,
|
||||||
|
analyzed_at: str,
|
||||||
|
model: str,
|
||||||
|
prompt_version: str,
|
||||||
|
) -> dict:
|
||||||
|
"""Convert one raw batch output line into a normalized analysis record."""
|
||||||
|
comment_id = parse_custom_id(raw_line.get("custom_id", ""))
|
||||||
|
comment = comment_lookup.get(comment_id, {})
|
||||||
|
base = {
|
||||||
|
"run_id": run_id,
|
||||||
|
"forum_id": comment.get("forum_id", ""),
|
||||||
|
"comment_id": comment_id,
|
||||||
|
"analyzed_at": analyzed_at,
|
||||||
|
"model": model,
|
||||||
|
"prompt_version": prompt_version,
|
||||||
|
"input_title": comment.get("title", ""),
|
||||||
|
"truncated": len(comment.get("text") or "") > MAX_COMMENT_CHARS,
|
||||||
|
}
|
||||||
|
if raw_line.get("error"):
|
||||||
|
err = raw_line["error"]
|
||||||
|
err_msg = err.get("message", str(err)) if isinstance(err, dict) else str(err)
|
||||||
|
return {**base, "stance": None, "stance_confidence": None,
|
||||||
|
"stance_rationale": None, "tone": None, "tags": None, "error": err_msg}
|
||||||
|
response = raw_line.get("response") or {}
|
||||||
|
if response.get("status_code") != 200:
|
||||||
|
return {**base, "stance": None, "stance_confidence": None,
|
||||||
|
"stance_rationale": None, "tone": None, "tags": None,
|
||||||
|
"error": f"status {response.get('status_code')}"}
|
||||||
|
try:
|
||||||
|
content = response["body"]["choices"][0]["message"]["content"]
|
||||||
|
data = json.loads(content)
|
||||||
|
keys = ("stance", "stance_confidence", "stance_rationale", "tone", "tags")
|
||||||
|
parsed = {k: data.get(k) for k in keys}
|
||||||
|
return {**base, **parsed, "error": None}
|
||||||
|
except Exception as exc:
|
||||||
|
return {**base, "stance": None, "stance_confidence": None,
|
||||||
|
"stance_rationale": None, "tone": None, "tags": None, "error": str(exc)}
|
||||||
|
|
||||||
|
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
# Job directory management
|
||||||
|
|
||||||
|
|
||||||
|
def _next_job_dir(stem: str) -> Path:
|
||||||
|
base = stem[:8]
|
||||||
|
i = 1
|
||||||
|
while (JOBS_DIR / f"{base}-{i}").exists():
|
||||||
|
i += 1
|
||||||
|
return JOBS_DIR / f"{base}-{i}"
|
||||||
|
|
||||||
|
|
||||||
|
def _latest_job_dir() -> Path:
|
||||||
|
if not JOBS_DIR.exists():
|
||||||
|
sys.exit(f"No jobs directory found. Run 'create' first.")
|
||||||
|
status_files = list(JOBS_DIR.glob("*/status.json"))
|
||||||
|
if not status_files:
|
||||||
|
sys.exit(f"No jobs found in {JOBS_DIR}. Run 'create' first.")
|
||||||
|
return max(status_files, key=lambda p: p.stat().st_mtime).parent
|
||||||
|
|
||||||
|
|
||||||
|
def _resolve_job_dir(args) -> Path:
|
||||||
|
if getattr(args, "dir", None):
|
||||||
|
d = Path(args.dir)
|
||||||
|
if not d.is_absolute():
|
||||||
|
d = JOBS_DIR / d
|
||||||
|
if not d.exists():
|
||||||
|
sys.exit(f"Job directory not found: {d}")
|
||||||
|
return d
|
||||||
|
return _latest_job_dir()
|
||||||
|
|
||||||
|
|
||||||
|
def load_status(job_dir: Path) -> dict:
|
||||||
|
return json.loads((job_dir / "status.json").read_text(encoding="utf-8"))
|
||||||
|
|
||||||
|
|
||||||
|
def save_status(status: dict, job_dir: Path) -> None:
|
||||||
|
(job_dir / "status.json").write_text(
|
||||||
|
json.dumps(status, indent=2, ensure_ascii=False), encoding="utf-8"
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
def _find_next_eligible_job(jobs: list[dict]) -> tuple[dict | None, str | None]:
|
||||||
|
"""Return (next_pending_job, None) or (None, warning_message).
|
||||||
|
|
||||||
|
A job is eligible when it is 'pending' and either it is the first job
|
||||||
|
or its predecessor is 'completed'.
|
||||||
|
"""
|
||||||
|
for j in jobs:
|
||||||
|
if j["status"] != "pending":
|
||||||
|
continue
|
||||||
|
if j["job_num"] == 1:
|
||||||
|
return j, None
|
||||||
|
prev = next(p for p in jobs if p["job_num"] == j["job_num"] - 1)
|
||||||
|
if prev["status"] == "completed":
|
||||||
|
return j, None
|
||||||
|
if prev["status"] in ("submitted", "in_progress", "validating", "finalizing"):
|
||||||
|
return None, (
|
||||||
|
f"Job {prev['job_num']} is '{prev['status']}'. "
|
||||||
|
f"Wait for it to complete before submitting job {j['job_num']}."
|
||||||
|
)
|
||||||
|
return None, None
|
||||||
|
|
||||||
|
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
# Subcommand: create
|
||||||
|
|
||||||
|
|
||||||
|
def cmd_create(args) -> None:
|
||||||
|
report_path = Path(args.report)
|
||||||
|
if not report_path.exists():
|
||||||
|
sys.exit(f"Report not found: {report_path}")
|
||||||
|
|
||||||
|
report = json.loads(report_path.read_text(encoding="utf-8"))
|
||||||
|
|
||||||
|
if args.model not in report or not isinstance(report[args.model], dict):
|
||||||
|
available = [k for k in report if isinstance(report.get(k), dict)]
|
||||||
|
sys.exit(f"Model '{args.model}' not in report. Available: {', '.join(available)}")
|
||||||
|
|
||||||
|
prompt_path = Path(report["prompt"])
|
||||||
|
if not prompt_path.exists():
|
||||||
|
sys.exit(f"Prompt file not found: {prompt_path}")
|
||||||
|
_load_prompt(prompt_path)
|
||||||
|
|
||||||
|
input_path = Path(report["input_file"])
|
||||||
|
if not input_path.exists():
|
||||||
|
sys.exit(f"Input file not found: {input_path}")
|
||||||
|
forum, comments = load_items(input_path)
|
||||||
|
if not comments:
|
||||||
|
sys.exit("No comment items found in input file.")
|
||||||
|
|
||||||
|
chunks = chunk_comments_by_tokens(comments, forum, args.model)
|
||||||
|
|
||||||
|
stem = input_path.stem[:8]
|
||||||
|
job_dir = _next_job_dir(stem)
|
||||||
|
JOBS_DIR.mkdir(parents=True, exist_ok=True)
|
||||||
|
job_dir.mkdir()
|
||||||
|
|
||||||
|
shutil.copy2(input_path, job_dir / "forum.jsonl")
|
||||||
|
shutil.copy2(prompt_path, job_dir / "prompt.txt")
|
||||||
|
shutil.copy2(report_path, job_dir / "report.json")
|
||||||
|
|
||||||
|
jobs_meta = []
|
||||||
|
for i, chunk in enumerate(chunks, start=1):
|
||||||
|
req_path = job_dir / f"job{i}-input.jsonl"
|
||||||
|
with open(req_path, "w", encoding="utf-8") as f:
|
||||||
|
for comment in chunk:
|
||||||
|
f.write(json.dumps(build_batch_request_line(comment, forum, args.model),
|
||||||
|
ensure_ascii=False) + "\n")
|
||||||
|
jobs_meta.append({
|
||||||
|
"job_num": i,
|
||||||
|
"run_id": str(uuid.uuid4()),
|
||||||
|
"status": "pending",
|
||||||
|
"batch_id": None,
|
||||||
|
"records_submitted": len(chunk),
|
||||||
|
"records_completed": None,
|
||||||
|
"records_failed": None,
|
||||||
|
"submitted_at": None,
|
||||||
|
"completed_at": None,
|
||||||
|
})
|
||||||
|
|
||||||
|
model_info = report[args.model]
|
||||||
|
status = {
|
||||||
|
"model": args.model,
|
||||||
|
"prompt_hash": report["prompt_hash"],
|
||||||
|
"input_file": str(input_path),
|
||||||
|
"input_sha256": report["input_sha256"],
|
||||||
|
"total_comments": report["total_comments"],
|
||||||
|
"input_tokens": report["input_tokens"],
|
||||||
|
"est_queue_days": model_info["est_queue_days"],
|
||||||
|
"cost_$": model_info["cost_$"],
|
||||||
|
"total_jobs": len(chunks),
|
||||||
|
"jobs": jobs_meta,
|
||||||
|
}
|
||||||
|
save_status(status, job_dir)
|
||||||
|
|
||||||
|
print(f"Created: {job_dir.name}")
|
||||||
|
print(f" {len(chunks)} job(s) | {len(comments)} comments | model: {args.model}")
|
||||||
|
print(f"\nNext: python analysis/openai_batch.py submit")
|
||||||
|
|
||||||
|
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
# Subcommand: submit
|
||||||
|
|
||||||
|
|
||||||
|
def cmd_submit(args, client) -> None:
|
||||||
|
job_dir = _resolve_job_dir(args)
|
||||||
|
status = load_status(job_dir)
|
||||||
|
jobs = status["jobs"]
|
||||||
|
|
||||||
|
if args.job:
|
||||||
|
target = next((j for j in jobs if j["job_num"] == args.job), None)
|
||||||
|
if target is None:
|
||||||
|
sys.exit(f"Job {args.job} not found in {job_dir.name}.")
|
||||||
|
if target["status"] != "pending":
|
||||||
|
sys.exit(f"Job {args.job} is already '{target['status']}' — cannot resubmit.")
|
||||||
|
if target["job_num"] > 1:
|
||||||
|
prev = next(p for p in jobs if p["job_num"] == target["job_num"] - 1)
|
||||||
|
if prev["status"] != "completed":
|
||||||
|
sys.exit(
|
||||||
|
f"Cannot submit job {target['job_num']}: "
|
||||||
|
f"job {prev['job_num']} is '{prev['status']}' (must be 'completed')."
|
||||||
|
)
|
||||||
|
else:
|
||||||
|
target, warning = _find_next_eligible_job(jobs)
|
||||||
|
if warning:
|
||||||
|
print(warning, file=sys.stderr)
|
||||||
|
sys.exit(1)
|
||||||
|
if target is None:
|
||||||
|
all_done = all(j["status"] == "completed" for j in jobs)
|
||||||
|
print("All jobs completed." if all_done else "No pending jobs eligible for submission.")
|
||||||
|
return
|
||||||
|
|
||||||
|
n = target["job_num"]
|
||||||
|
req_path = job_dir / f"job{n}-input.jsonl"
|
||||||
|
print(f"Submitting job {n}/{status['total_jobs']} ({target['records_submitted']} comments) ...",
|
||||||
|
file=sys.stderr)
|
||||||
|
|
||||||
|
with open(req_path, "rb") as f:
|
||||||
|
uploaded = client.files.create(file=f, purpose="batch")
|
||||||
|
|
||||||
|
batch = client.batches.create(
|
||||||
|
input_file_id=uploaded.id,
|
||||||
|
endpoint="/v1/chat/completions",
|
||||||
|
completion_window="24h",
|
||||||
|
metadata={"run_id": target["run_id"], "job_dir": job_dir.name},
|
||||||
|
)
|
||||||
|
|
||||||
|
target["status"] = "submitted"
|
||||||
|
target["batch_id"] = batch.id
|
||||||
|
target["submitted_at"] = datetime.now(timezone.utc).isoformat()
|
||||||
|
save_status(status, job_dir)
|
||||||
|
|
||||||
|
print(f"Job {n} submitted: {batch.id} ({batch.status})")
|
||||||
|
print(f" python analysis/openai_batch.py status")
|
||||||
|
|
||||||
|
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
# Subcommand: status
|
||||||
|
|
||||||
|
|
||||||
|
def cmd_status(args, client) -> None:
|
||||||
|
job_dir = _resolve_job_dir(args)
|
||||||
|
status = load_status(job_dir)
|
||||||
|
jobs = status["jobs"]
|
||||||
|
|
||||||
|
job_filter = getattr(args, "job", None)
|
||||||
|
|
||||||
|
for job in jobs:
|
||||||
|
if job_filter is not None and job["job_num"] != job_filter:
|
||||||
|
continue
|
||||||
|
if not job["batch_id"]:
|
||||||
|
continue
|
||||||
|
if job["status"] in ("completed", "failed", "expired", "cancelled", "pending"):
|
||||||
|
continue
|
||||||
|
batch = client.batches.retrieve(job["batch_id"])
|
||||||
|
counts = batch.request_counts
|
||||||
|
if batch.status == "completed":
|
||||||
|
job["status"] = "completed"
|
||||||
|
if batch.completed_at:
|
||||||
|
job["completed_at"] = datetime.fromtimestamp(
|
||||||
|
batch.completed_at, tz=timezone.utc
|
||||||
|
).isoformat()
|
||||||
|
elif batch.status in ("failed", "expired", "cancelled"):
|
||||||
|
job["status"] = batch.status
|
||||||
|
else:
|
||||||
|
job["status"] = batch.status
|
||||||
|
job["records_completed"] = counts.completed
|
||||||
|
job["records_failed"] = counts.failed
|
||||||
|
|
||||||
|
save_status(status, job_dir)
|
||||||
|
|
||||||
|
target_jobs = jobs if not job_filter else [j for j in jobs if j["job_num"] == job_filter]
|
||||||
|
print(f"Dir: {job_dir.name} | Model: {status['model']} | {status['total_jobs']} job(s)")
|
||||||
|
print(f"{'Job':<5} {'Status':<14} {'Records':>12} {'Submitted':<20} {'Completed':<20}")
|
||||||
|
print("-" * 76)
|
||||||
|
for j in target_jobs:
|
||||||
|
rec = (f"{j['records_completed']}/{j['records_submitted']}"
|
||||||
|
if j["records_completed"] is not None else f"-/{j['records_submitted']}")
|
||||||
|
sub = (j["submitted_at"] or "-")[:19]
|
||||||
|
done = (j["completed_at"] or "-")[:19]
|
||||||
|
print(f"{j['job_num']:<5} {j['status']:<14} {rec:>12} {sub:<20} {done:<20}")
|
||||||
|
|
||||||
|
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
# Subcommand: download
|
||||||
|
|
||||||
|
|
||||||
|
def cmd_download(args, client) -> None:
|
||||||
|
job_dir = _resolve_job_dir(args)
|
||||||
|
|
||||||
|
# Refresh status before deciding what to download
|
||||||
|
cmd_status(args, client)
|
||||||
|
status = load_status(job_dir)
|
||||||
|
jobs = status["jobs"]
|
||||||
|
|
||||||
|
job_filter = getattr(args, "job", None)
|
||||||
|
if job_filter:
|
||||||
|
candidates = [j for j in jobs if j["job_num"] == job_filter]
|
||||||
|
else:
|
||||||
|
candidates = [
|
||||||
|
j for j in jobs
|
||||||
|
if j["status"] == "completed"
|
||||||
|
and not (job_dir / f"job{j['job_num']}-output.jsonl").exists()
|
||||||
|
]
|
||||||
|
|
||||||
|
if not candidates:
|
||||||
|
print("No completed jobs pending download.", file=sys.stderr)
|
||||||
|
return
|
||||||
|
|
||||||
|
_, all_comments = load_items(job_dir / "forum.jsonl")
|
||||||
|
comment_lookup = {c["comment_id"]: c for c in all_comments}
|
||||||
|
|
||||||
|
for job in candidates:
|
||||||
|
n = job["job_num"]
|
||||||
|
|
||||||
|
if job["status"] != "completed":
|
||||||
|
print(f"Job {n} not yet completed ('{job['status']}'), skipping.", file=sys.stderr)
|
||||||
|
continue
|
||||||
|
|
||||||
|
batch = client.batches.retrieve(job["batch_id"])
|
||||||
|
|
||||||
|
if not batch.output_file_id:
|
||||||
|
print(f"Job {n}: no output file available from OpenAI.", file=sys.stderr)
|
||||||
|
continue
|
||||||
|
|
||||||
|
raw_text = client.files.content(batch.output_file_id).text
|
||||||
|
raw_path = job_dir / f"job{n}-output-raw.jsonl"
|
||||||
|
raw_path.write_text(raw_text, encoding="utf-8")
|
||||||
|
print(f"Job {n} raw → {raw_path.name}", file=sys.stderr)
|
||||||
|
|
||||||
|
if batch.error_file_id:
|
||||||
|
err_text = client.files.content(batch.error_file_id).text
|
||||||
|
err_path = job_dir / f"job{n}-errors.jsonl"
|
||||||
|
err_path.write_text(err_text, encoding="utf-8")
|
||||||
|
n_err_lines = sum(1 for line in err_text.splitlines() if line.strip())
|
||||||
|
print(f"Job {n} errors → {err_path.name} ({n_err_lines} lines)", file=sys.stderr)
|
||||||
|
|
||||||
|
completed_at = job.get("completed_at") or datetime.now(timezone.utc).isoformat()
|
||||||
|
norm_path = job_dir / f"job{n}-output.jsonl"
|
||||||
|
n_ok = n_err = 0
|
||||||
|
with open(norm_path, "w", encoding="utf-8") as out:
|
||||||
|
for line in raw_text.splitlines():
|
||||||
|
if not line.strip():
|
||||||
|
continue
|
||||||
|
record = normalize_output_line(
|
||||||
|
json.loads(line), comment_lookup,
|
||||||
|
job["run_id"], completed_at,
|
||||||
|
status["model"], status["prompt_hash"],
|
||||||
|
)
|
||||||
|
out.write(json.dumps(record, ensure_ascii=False) + "\n")
|
||||||
|
if record["error"]:
|
||||||
|
n_err += 1
|
||||||
|
else:
|
||||||
|
n_ok += 1
|
||||||
|
|
||||||
|
print(f"Job {n} normalized → {norm_path.name} ({n_ok} ok, {n_err} errors)", file=sys.stderr)
|
||||||
|
job["records_completed"] = n_ok
|
||||||
|
job["records_failed"] = n_err
|
||||||
|
|
||||||
|
save_status(status, job_dir)
|
||||||
|
|
||||||
|
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
# CLI
|
||||||
|
|
||||||
|
|
||||||
|
def _add_common_args(p: argparse.ArgumentParser) -> None:
|
||||||
|
p.add_argument("--job", type=int, default=None, metavar="N",
|
||||||
|
help="Job number within the run (default: auto)")
|
||||||
|
p.add_argument("--dir", default=None, metavar="DIR",
|
||||||
|
help="Job directory name or path (default: most recent)")
|
||||||
|
|
||||||
|
|
||||||
|
def main() -> None:
|
||||||
|
load_dotenv()
|
||||||
|
|
||||||
|
parser = argparse.ArgumentParser(
|
||||||
|
description="Batch analysis job runner.",
|
||||||
|
formatter_class=argparse.RawDescriptionHelpFormatter,
|
||||||
|
epilog=__doc__,
|
||||||
|
)
|
||||||
|
sub = parser.add_subparsers(dest="command", required=True)
|
||||||
|
|
||||||
|
p_create = sub.add_parser("create", help="Create job directory from tokenizer report")
|
||||||
|
p_create.add_argument("report", help="Path to report.json from tokenizer.py")
|
||||||
|
p_create.add_argument("--model", required=True, help="Model (e.g. gpt-4o-mini)")
|
||||||
|
|
||||||
|
p_submit = sub.add_parser("submit", help="Submit next eligible job")
|
||||||
|
_add_common_args(p_submit)
|
||||||
|
|
||||||
|
p_status = sub.add_parser("status", help="Check job status")
|
||||||
|
_add_common_args(p_status)
|
||||||
|
|
||||||
|
p_download = sub.add_parser("download", help="Download and normalize completed jobs")
|
||||||
|
_add_common_args(p_download)
|
||||||
|
|
||||||
|
args = parser.parse_args()
|
||||||
|
|
||||||
|
if args.command == "create":
|
||||||
|
cmd_create(args)
|
||||||
|
return
|
||||||
|
|
||||||
|
api_key = os.environ.get("OPENAI_API_KEY")
|
||||||
|
if not api_key:
|
||||||
|
sys.exit("OPENAI_API_KEY not set. Create a .env file or export the variable.")
|
||||||
|
client = openai.OpenAI(api_key=api_key)
|
||||||
|
|
||||||
|
if args.command == "submit":
|
||||||
|
cmd_submit(args, client)
|
||||||
|
elif args.command == "status":
|
||||||
|
cmd_status(args, client)
|
||||||
|
elif args.command == "download":
|
||||||
|
cmd_download(args, client)
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
main()
|
||||||
@@ -1,12 +1,12 @@
|
|||||||
#!/usr/bin/env python3
|
#!/usr/bin/env python3
|
||||||
"""
|
"""
|
||||||
analysis/gpt4o/analysis-realtime.py — Synchronous GPT-4o pipeline for VA Townhall comments.
|
analysis/openai_realtime.py — Synchronous GPT-4o pipeline for VA Townhall comments.
|
||||||
|
|
||||||
Usage:
|
Usage:
|
||||||
python analysis/gpt4o/analysis-realtime.py <input_jsonl> [--limit {5,10,20,50}] [--model MODEL]
|
python analysis/openai_realtime.py <input_jsonl> [--limit {5,10,20,50}] [--model MODEL]
|
||||||
|
|
||||||
Output:
|
Output:
|
||||||
analysis/gpt4o/forum{id}_{scrape_ts}_{model}_{run_ts}.jsonl
|
analysis/forum{id}_{scrape_ts}_{model}_{run_ts}.jsonl
|
||||||
"""
|
"""
|
||||||
|
|
||||||
import argparse
|
import argparse
|
||||||
@@ -30,7 +30,7 @@ except ImportError:
|
|||||||
# ---------------------------------------------------------------------------
|
# ---------------------------------------------------------------------------
|
||||||
# Prompt — loaded from analysis/prompt-1.txt at import time
|
# Prompt — loaded from analysis/prompt-1.txt at import time
|
||||||
|
|
||||||
_PROMPT_FILE = Path(__file__).parent.parent / "prompt-1.txt"
|
_PROMPT_FILE = Path(__file__).parent / "prompt-1.txt"
|
||||||
SYSTEM_PROMPT = _PROMPT_FILE.read_text(encoding="utf-8").strip()
|
SYSTEM_PROMPT = _PROMPT_FILE.read_text(encoding="utf-8").strip()
|
||||||
PROMPT_VERSION = hashlib.sha256(SYSTEM_PROMPT.encode("utf-8")).hexdigest()[:7]
|
PROMPT_VERSION = hashlib.sha256(SYSTEM_PROMPT.encode("utf-8")).hexdigest()[:7]
|
||||||
|
|
||||||
@@ -1,6 +1,4 @@
|
|||||||
You are an expert policy analyst classifying public comments submitted to the Virginia Town Hall
|
You are an expert policy analyst classifying public comments submitted to the Virginia Town Hall regulatory comment system. You will be given the text of a proposed regulation and a single public comment. Return ONLY a JSON object — no other text.
|
||||||
regulatory comment system. You will be given the text of a proposed regulation and a single
|
|
||||||
public comment. Return ONLY a JSON object — no other text.
|
|
||||||
|
|
||||||
Definitions:
|
Definitions:
|
||||||
- stance: the commenter's position on whether the regulation should be adopted.
|
- stance: the commenter's position on whether the regulation should be adopted.
|
||||||
@@ -16,8 +14,6 @@ Definitions:
|
|||||||
"unclear" = tone cannot be determined (e.g., a one-word comment).
|
"unclear" = tone cannot be determined (e.g., a one-word comment).
|
||||||
- stance_confidence: float 0.0-1.0, your confidence in the stance label.
|
- stance_confidence: float 0.0-1.0, your confidence in the stance label.
|
||||||
- stance_rationale: 1-3 sentences explaining the key evidence; quote specific phrases where possible.
|
- stance_rationale: 1-3 sentences explaining the key evidence; quote specific phrases where possible.
|
||||||
- tags: up to 5 short topic labels relevant to the comment's specific concerns (e.g.
|
- tags: up to 5 short topic labels relevant to the comment's specific concerns (e.g. "parental rights", "student safety", "privacy", "religious freedom", "LGBTQ inclusion", "bullying prevention", "school sports", "bathroom access"). Empty array if none apply.
|
||||||
"parental rights", "student safety", "privacy", "religious freedom", "LGBTQ+ inclusion",
|
|
||||||
"bullying prevention", "school sports", "bathroom access"). Empty array if none apply.
|
|
||||||
|
|
||||||
Return exactly these keys: stance, stance_confidence, stance_rationale, tone, tags.
|
Return exactly these keys: stance, stance_confidence, stance_rationale, tone, tags.
|
||||||
|
|||||||
190
analysis/tokenizer.py
Normal file
190
analysis/tokenizer.py
Normal file
@@ -0,0 +1,190 @@
|
|||||||
|
#!/usr/bin/env python3
|
||||||
|
"""
|
||||||
|
tokenizer.py — estimate token usage and cost for a batch analysis run.
|
||||||
|
|
||||||
|
Usage:
|
||||||
|
python analysis/tokenizer.py output/f452.jsonl [--prompt analysis/prompt-1.txt]
|
||||||
|
python analysis/tokenizer.py analysis/jobs/f452-1/job1-input.jsonl # count actual tokens in a job
|
||||||
|
|
||||||
|
Prints a per-model comparison table and writes reports/<stem>-report.json.
|
||||||
|
Run this before openai_batch.py create.
|
||||||
|
"""
|
||||||
|
|
||||||
|
import argparse
|
||||||
|
import hashlib
|
||||||
|
import json
|
||||||
|
import math
|
||||||
|
import sys
|
||||||
|
from pathlib import Path
|
||||||
|
|
||||||
|
sys.path.insert(0, str(Path(__file__).parent))
|
||||||
|
import openai_batch as _ab
|
||||||
|
|
||||||
|
# Input pricing ($/1M tokens, batch API) — from docs/openai.md, updated 2026-05-05.
|
||||||
|
# Add Anthropic/other models here when needed; only models with a LIMITS entry are reported.
|
||||||
|
MODEL_PRICING: dict[str, float] = {
|
||||||
|
"gpt-5.5": 2.50,
|
||||||
|
"gpt-5.4": 1.25,
|
||||||
|
"gpt-5.4-mini": 0.375,
|
||||||
|
"gpt-5.4-nano": 0.10,
|
||||||
|
"gpt-4o": 1.25,
|
||||||
|
"gpt-4o-mini": 0.075,
|
||||||
|
"gpt-o4-mini": 0.55,
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
def compute_report(
|
||||||
|
comments: list[dict],
|
||||||
|
forum: dict | None,
|
||||||
|
prompt_hash: str,
|
||||||
|
input_file: str,
|
||||||
|
input_sha256: str,
|
||||||
|
prompt_file: str,
|
||||||
|
) -> dict:
|
||||||
|
"""Compute token estimate and per-model job/cost/time breakdown."""
|
||||||
|
# Use gpt-4o encoding as the canonical estimator (same for all current models)
|
||||||
|
total_tokens = sum(
|
||||||
|
_ab.estimate_tokens(_ab.build_messages(c, forum)[0], "gpt-4o")
|
||||||
|
for c in comments
|
||||||
|
)
|
||||||
|
|
||||||
|
report: dict = {
|
||||||
|
"prompt": prompt_file,
|
||||||
|
"prompt_hash": prompt_hash,
|
||||||
|
"input_file": input_file,
|
||||||
|
"input_sha256": input_sha256,
|
||||||
|
"total_comments": len(comments),
|
||||||
|
"input_tokens": total_tokens,
|
||||||
|
}
|
||||||
|
|
||||||
|
for model, tpd in _ab.MODEL_LIMITS.items():
|
||||||
|
effective_tpd = int(tpd * _ab._LIMIT_BUFFER)
|
||||||
|
jobs = math.ceil(total_tokens / effective_tpd)
|
||||||
|
cost = round(total_tokens / 1_000_000 * MODEL_PRICING.get(model, 0.0), 4)
|
||||||
|
est_days = round(total_tokens / tpd, 2)
|
||||||
|
report[model] = {"jobs": jobs, "cost_$": cost, "est_queue_days": est_days}
|
||||||
|
|
||||||
|
return report
|
||||||
|
|
||||||
|
|
||||||
|
def count_input_tokens(path: Path, model: str = "gpt-4o") -> dict:
|
||||||
|
"""Count tokens in an existing job input JSONL (batch request format).
|
||||||
|
|
||||||
|
Each line must have body.messages (as written by build_batch_request_line).
|
||||||
|
Returns {"total_tokens": int, "total_requests": int, "min": int, "max": int, "mean": float}.
|
||||||
|
"""
|
||||||
|
counts = []
|
||||||
|
with open(path, encoding="utf-8") as f:
|
||||||
|
for line in f:
|
||||||
|
line = line.strip()
|
||||||
|
if not line:
|
||||||
|
continue
|
||||||
|
req = json.loads(line)
|
||||||
|
messages = req["body"]["messages"]
|
||||||
|
counts.append(_ab.estimate_tokens(messages, model))
|
||||||
|
if not counts:
|
||||||
|
return {"total_tokens": 0, "total_requests": 0, "min": 0, "max": 0, "mean": 0.0}
|
||||||
|
return {
|
||||||
|
"total_tokens": sum(counts),
|
||||||
|
"total_requests": len(counts),
|
||||||
|
"min": min(counts),
|
||||||
|
"max": max(counts),
|
||||||
|
"mean": round(sum(counts) / len(counts), 1),
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
def print_table(report: dict) -> None:
|
||||||
|
"""Print a human-readable model comparison table to stdout."""
|
||||||
|
print(f"\nInput: {report['input_file']}")
|
||||||
|
print(f"Comments: {report['total_comments']:,}")
|
||||||
|
print(f"Tokens: {report['input_tokens']:,}")
|
||||||
|
print(f"Prompt: {report['prompt']} (hash: {report['prompt_hash']})")
|
||||||
|
print()
|
||||||
|
|
||||||
|
# Cheapest model that fits in one job
|
||||||
|
single_job_models = [m for m in _ab.MODEL_LIMITS if report.get(m, {}).get("jobs") == 1]
|
||||||
|
best = (min(single_job_models, key=lambda m: report[m]["cost_$"])
|
||||||
|
if single_job_models else None)
|
||||||
|
|
||||||
|
print(f"{'Model':<15} {'Jobs':>5} {'Cost ($)':>9} {'Est days':>9} {'Note'}")
|
||||||
|
print("-" * 62)
|
||||||
|
for model in _ab.MODEL_LIMITS:
|
||||||
|
if model not in report or not isinstance(report[model], dict):
|
||||||
|
continue
|
||||||
|
m = report[model]
|
||||||
|
note = "<-- recommended" if model == best else ""
|
||||||
|
print(f"{model:<15} {m['jobs']:>5} {m['cost_$']:>9.4f} {m['est_queue_days']:>9.2f} {note}")
|
||||||
|
print()
|
||||||
|
|
||||||
|
|
||||||
|
def _is_job_input(path: Path) -> bool:
|
||||||
|
"""Return True if this JSONL looks like a batch request file (has custom_id)."""
|
||||||
|
with open(path, encoding="utf-8") as f:
|
||||||
|
for line in f:
|
||||||
|
line = line.strip()
|
||||||
|
if line:
|
||||||
|
return "custom_id" in json.loads(line)
|
||||||
|
return False
|
||||||
|
|
||||||
|
|
||||||
|
def main() -> None:
|
||||||
|
_default_prompt = Path(__file__).parent / "prompt-1.txt"
|
||||||
|
|
||||||
|
parser = argparse.ArgumentParser(description="Estimate batch token usage and cost.")
|
||||||
|
parser.add_argument("input", help="Scraped JSONL or job input JSONL (jobN-input.jsonl)")
|
||||||
|
parser.add_argument(
|
||||||
|
"--prompt",
|
||||||
|
default=str(_default_prompt),
|
||||||
|
help=f"System prompt file (default: {_default_prompt.name})",
|
||||||
|
)
|
||||||
|
args = parser.parse_args()
|
||||||
|
|
||||||
|
input_path = Path(args.input)
|
||||||
|
if not input_path.exists():
|
||||||
|
sys.exit(f"File not found: {input_path}")
|
||||||
|
|
||||||
|
# --- Mode: count tokens in an existing job input file ---
|
||||||
|
if _is_job_input(input_path):
|
||||||
|
result = count_input_tokens(input_path)
|
||||||
|
print(f"\nJob input: {input_path.name}")
|
||||||
|
print(f" Requests : {result['total_requests']:,}")
|
||||||
|
print(f" Tokens : {result['total_tokens']:,}")
|
||||||
|
print(f" Per-req : min={result['min']} max={result['max']} mean={result['mean']}")
|
||||||
|
return
|
||||||
|
|
||||||
|
# --- Mode: estimate from raw scrape file and write report.json ---
|
||||||
|
prompt_path = Path(args.prompt)
|
||||||
|
if not prompt_path.exists():
|
||||||
|
sys.exit(f"Prompt file not found: {prompt_path}")
|
||||||
|
|
||||||
|
prompt_text = prompt_path.read_text(encoding="utf-8").strip()
|
||||||
|
prompt_hash = hashlib.sha256(prompt_text.encode("utf-8")).hexdigest()[:7]
|
||||||
|
|
||||||
|
# Ensure build_messages uses the specified prompt
|
||||||
|
_ab._load_prompt(prompt_path)
|
||||||
|
|
||||||
|
forum, comments = _ab.load_items(input_path)
|
||||||
|
if not comments:
|
||||||
|
sys.exit("No comment items found.")
|
||||||
|
if forum is None:
|
||||||
|
print("Warning: no ForumItem — token estimates may be slightly low.", file=sys.stderr)
|
||||||
|
|
||||||
|
input_sha256 = hashlib.sha256(input_path.read_bytes()).hexdigest()
|
||||||
|
|
||||||
|
report = compute_report(
|
||||||
|
comments, forum, prompt_hash,
|
||||||
|
str(input_path), input_sha256, str(prompt_path),
|
||||||
|
)
|
||||||
|
|
||||||
|
print_table(report)
|
||||||
|
|
||||||
|
reports_dir = Path(__file__).parent.parent / "reports"
|
||||||
|
reports_dir.mkdir(exist_ok=True)
|
||||||
|
out_path = reports_dir / f"{input_path.stem}-report.json"
|
||||||
|
out_path.write_text(json.dumps(report, indent=2, ensure_ascii=False), encoding="utf-8")
|
||||||
|
print(f"Report written to: {out_path}")
|
||||||
|
print(f"\nNext: python analysis/openai_batch.py create {out_path} --model <model>")
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
main()
|
||||||
BIN
docs/excel-snapshot.png
Normal file
BIN
docs/excel-snapshot.png
Normal file
Binary file not shown.
|
After Width: | Height: | Size: 32 KiB |
File diff suppressed because one or more lines are too long
|
Before Width: | Height: | Size: 130 KiB |
@@ -1,9 +1,18 @@
|
|||||||
<mxfile host="app.diagrams.net">
|
<mxfile host="app.diagrams.net">
|
||||||
<diagram name="Page-1" id="0sW-Vs8X5usvYmJikUIv">
|
<diagram name="Page-1" id="0sW-Vs8X5usvYmJikUIv">
|
||||||
<mxGraphModel dx="2179" dy="1118" grid="1" gridSize="10" guides="1" tooltips="1" connect="1" arrows="1" fold="1" page="0" pageScale="1" pageWidth="850" pageHeight="1100" math="0" shadow="0">
|
<mxGraphModel dx="1315" dy="798" grid="1" gridSize="10" guides="1" tooltips="1" connect="1" arrows="1" fold="1" page="0" pageScale="1" pageWidth="850" pageHeight="1100" math="0" shadow="0">
|
||||||
<root>
|
<root>
|
||||||
<mxCell id="0" />
|
<mxCell id="0" />
|
||||||
<mxCell id="1" parent="0" />
|
<mxCell id="1" parent="0" />
|
||||||
|
<mxCell id="mENAtx_syaeSO5uR6kG6-61" parent="1" style="rounded=0;whiteSpace=wrap;html=1;" value="" vertex="1">
|
||||||
|
<mxGeometry height="90" width="190" x="1000" y="330" as="geometry" />
|
||||||
|
</mxCell>
|
||||||
|
<mxCell id="mENAtx_syaeSO5uR6kG6-60" parent="1" style="rounded=0;whiteSpace=wrap;html=1;" value="" vertex="1">
|
||||||
|
<mxGeometry height="90" width="190" x="1010" y="340" as="geometry" />
|
||||||
|
</mxCell>
|
||||||
|
<mxCell id="mENAtx_syaeSO5uR6kG6-59" parent="1" style="rounded=0;whiteSpace=wrap;html=1;" value="" vertex="1">
|
||||||
|
<mxGeometry height="90" width="190" x="1020" y="350" as="geometry" />
|
||||||
|
</mxCell>
|
||||||
<mxCell id="mENAtx_syaeSO5uR6kG6-3" edge="1" parent="1" source="mENAtx_syaeSO5uR6kG6-1" style="edgeStyle=orthogonalEdgeStyle;rounded=0;orthogonalLoop=1;jettySize=auto;html=1;exitX=1;exitY=0.5;exitDx=0;exitDy=0;entryX=0;entryY=0;entryDx=16.5;entryDy=0;entryPerimeter=0;" target="mENAtx_syaeSO5uR6kG6-29">
|
<mxCell id="mENAtx_syaeSO5uR6kG6-3" edge="1" parent="1" source="mENAtx_syaeSO5uR6kG6-1" style="edgeStyle=orthogonalEdgeStyle;rounded=0;orthogonalLoop=1;jettySize=auto;html=1;exitX=1;exitY=0.5;exitDx=0;exitDy=0;entryX=0;entryY=0;entryDx=16.5;entryDy=0;entryPerimeter=0;" target="mENAtx_syaeSO5uR6kG6-29">
|
||||||
<mxGeometry relative="1" as="geometry">
|
<mxGeometry relative="1" as="geometry">
|
||||||
<mxPoint x="200" y="290" as="targetPoint" />
|
<mxPoint x="200" y="290" as="targetPoint" />
|
||||||
@@ -18,18 +27,18 @@
|
|||||||
<mxCell id="mENAtx_syaeSO5uR6kG6-5" parent="1" style="shape=process;whiteSpace=wrap;html=1;backgroundOutline=1;" value="tokenizer" vertex="1">
|
<mxCell id="mENAtx_syaeSO5uR6kG6-5" parent="1" style="shape=process;whiteSpace=wrap;html=1;backgroundOutline=1;" value="tokenizer" vertex="1">
|
||||||
<mxGeometry height="60" width="120" x="400" y="170" as="geometry" />
|
<mxGeometry height="60" width="120" x="400" y="170" as="geometry" />
|
||||||
</mxCell>
|
</mxCell>
|
||||||
<mxCell id="mENAtx_syaeSO5uR6kG6-6" parent="1" style="text;html=1;whiteSpace=wrap;strokeColor=none;fillColor=none;align=center;verticalAlign=top;rounded=0;" value="gather forum data" vertex="1">
|
<mxCell id="mENAtx_syaeSO5uR6kG6-6" parent="1" style="text;html=1;whiteSpace=wrap;strokeColor=none;fillColor=none;align=left;verticalAlign=top;rounded=0;" value="<div align="left">- collect forum data</div>" vertex="1">
|
||||||
<mxGeometry height="60" width="120" x="20" y="240" as="geometry" />
|
<mxGeometry height="60" width="120" x="40" y="240" as="geometry" />
|
||||||
</mxCell>
|
</mxCell>
|
||||||
<mxCell id="mENAtx_syaeSO5uR6kG6-7" parent="1" style="text;html=1;whiteSpace=wrap;strokeColor=none;fillColor=none;align=left;verticalAlign=top;rounded=0;" value="<div>tokenize forum,</div><div>generate report w/</div><div>recommendations</div>" vertex="1">
|
<mxCell id="mENAtx_syaeSO5uR6kG6-7" parent="1" style="text;html=1;whiteSpace=wrap;strokeColor=none;fillColor=none;align=left;verticalAlign=top;rounded=0;" value="<div>- tokenize forum</div><div>- generate report w/</div><div>recommendations</div>" vertex="1">
|
||||||
<mxGeometry height="60" width="120" x="400" y="240" as="geometry" />
|
<mxGeometry height="60" width="120" x="400" y="240" as="geometry" />
|
||||||
</mxCell>
|
</mxCell>
|
||||||
<mxCell id="mENAtx_syaeSO5uR6kG6-28" edge="1" parent="1" source="mENAtx_syaeSO5uR6kG6-19" style="edgeStyle=orthogonalEdgeStyle;rounded=0;orthogonalLoop=1;jettySize=auto;html=1;exitX=1;exitY=0.5;exitDx=0;exitDy=0;entryX=0.5;entryY=0;entryDx=0;entryDy=0;entryPerimeter=0;" target="mENAtx_syaeSO5uR6kG6-35">
|
<mxCell id="mENAtx_syaeSO5uR6kG6-28" edge="1" parent="1" source="mENAtx_syaeSO5uR6kG6-19" style="edgeStyle=orthogonalEdgeStyle;rounded=0;orthogonalLoop=1;jettySize=auto;html=1;exitX=1;exitY=0.5;exitDx=0;exitDy=0;entryX=0.5;entryY=0;entryDx=0;entryDy=0;" target="mENAtx_syaeSO5uR6kG6-73">
|
||||||
<mxGeometry relative="1" as="geometry">
|
<mxGeometry relative="1" as="geometry">
|
||||||
<mxPoint x="910" y="270" as="targetPoint" />
|
<mxPoint x="953" y="240" as="targetPoint" />
|
||||||
</mxGeometry>
|
</mxGeometry>
|
||||||
</mxCell>
|
</mxCell>
|
||||||
<mxCell id="mENAtx_syaeSO5uR6kG6-19" parent="1" style="shape=process;whiteSpace=wrap;html=1;backgroundOutline=1;" value="batch" vertex="1">
|
<mxCell id="mENAtx_syaeSO5uR6kG6-19" parent="1" style="shape=process;whiteSpace=wrap;html=1;backgroundOutline=1;" value="openai_batch" vertex="1">
|
||||||
<mxGeometry height="60" width="120" x="720" y="170" as="geometry" />
|
<mxGeometry height="60" width="120" x="720" y="170" as="geometry" />
|
||||||
</mxCell>
|
</mxCell>
|
||||||
<mxCell id="mENAtx_syaeSO5uR6kG6-21" parent="1" style="text;html=1;whiteSpace=wrap;strokeColor=none;fillColor=none;align=right;verticalAlign=top;rounded=0;fontFamily=Courier New;" value="<div>--model</div><div>--limit</div>" vertex="1">
|
<mxCell id="mENAtx_syaeSO5uR6kG6-21" parent="1" style="text;html=1;whiteSpace=wrap;strokeColor=none;fillColor=none;align=right;verticalAlign=top;rounded=0;fontFamily=Courier New;" value="<div>--model</div><div>--limit</div>" vertex="1">
|
||||||
@@ -38,11 +47,8 @@
|
|||||||
<mxCell id="mENAtx_syaeSO5uR6kG6-23" parent="1" style="text;html=1;whiteSpace=wrap;strokeColor=none;fillColor=none;align=right;verticalAlign=top;rounded=0;fontFamily=Courier New;" value="--forum" vertex="1">
|
<mxCell id="mENAtx_syaeSO5uR6kG6-23" parent="1" style="text;html=1;whiteSpace=wrap;strokeColor=none;fillColor=none;align=right;verticalAlign=top;rounded=0;fontFamily=Courier New;" value="--forum" vertex="1">
|
||||||
<mxGeometry height="60" width="120" x="-90" y="170" as="geometry" />
|
<mxGeometry height="60" width="120" x="-90" y="170" as="geometry" />
|
||||||
</mxCell>
|
</mxCell>
|
||||||
<mxCell id="mENAtx_syaeSO5uR6kG6-25" parent="1" style="text;html=1;whiteSpace=wrap;strokeColor=none;fillColor=none;align=right;verticalAlign=top;rounded=0;fontFamily=Courier New;" value="--prompt" vertex="1">
|
<mxCell id="mENAtx_syaeSO5uR6kG6-26" parent="1" style="text;html=1;whiteSpace=wrap;strokeColor=none;fillColor=none;align=left;verticalAlign=top;rounded=0;" value="<div>- split job into batches</div><div>- submit first batch</div><div>- status of current batch</div><div>- download batch artifacts</div>" vertex="1">
|
||||||
<mxGeometry height="60" width="120" x="270" y="210" as="geometry" />
|
<mxGeometry height="70" width="140" x="720" y="240" as="geometry" />
|
||||||
</mxCell>
|
|
||||||
<mxCell id="mENAtx_syaeSO5uR6kG6-26" parent="1" style="text;html=1;whiteSpace=wrap;strokeColor=none;fillColor=none;align=left;verticalAlign=top;rounded=0;" value="<div>split job into batches</div><div>submit first batch</div><div>status of current batch</div><div>download batch artifacts</div>" vertex="1">
|
|
||||||
<mxGeometry height="60" width="120" x="720" y="240" as="geometry" />
|
|
||||||
</mxCell>
|
</mxCell>
|
||||||
<mxCell id="mENAtx_syaeSO5uR6kG6-29" parent="1" style="shape=note;whiteSpace=wrap;html=1;backgroundOutline=1;darkOpacity=0.05;size=17;" value="" vertex="1">
|
<mxCell id="mENAtx_syaeSO5uR6kG6-29" parent="1" style="shape=note;whiteSpace=wrap;html=1;backgroundOutline=1;darkOpacity=0.05;size=17;" value="" vertex="1">
|
||||||
<mxGeometry height="70" width="50" x="210" y="240" as="geometry" />
|
<mxGeometry height="70" width="50" x="210" y="240" as="geometry" />
|
||||||
@@ -58,7 +64,7 @@
|
|||||||
</Array>
|
</Array>
|
||||||
</mxGeometry>
|
</mxGeometry>
|
||||||
</mxCell>
|
</mxCell>
|
||||||
<mxCell id="mENAtx_syaeSO5uR6kG6-31" parent="1" style="shape=note;whiteSpace=wrap;html=1;backgroundOutline=1;darkOpacity=0.05;size=17;" value="<div>forum</div><div>.jsonl</div>" vertex="1">
|
<mxCell id="mENAtx_syaeSO5uR6kG6-31" parent="1" style="shape=note;whiteSpace=wrap;html=1;backgroundOutline=1;darkOpacity=0.05;size=17;" value="<div>&lt;forumid&gt;</div><div>.jsonl</div>" vertex="1">
|
||||||
<mxGeometry height="70" width="50" x="230" y="260" as="geometry" />
|
<mxGeometry height="70" width="50" x="230" y="260" as="geometry" />
|
||||||
</mxCell>
|
</mxCell>
|
||||||
<mxCell id="mENAtx_syaeSO5uR6kG6-47" edge="1" parent="1" source="mENAtx_syaeSO5uR6kG6-34" style="edgeStyle=orthogonalEdgeStyle;rounded=0;orthogonalLoop=1;jettySize=auto;html=1;exitX=0;exitY=0;exitDx=50;exitDy=43.5;exitPerimeter=0;entryX=0;entryY=0.5;entryDx=0;entryDy=0;" target="mENAtx_syaeSO5uR6kG6-19">
|
<mxCell id="mENAtx_syaeSO5uR6kG6-47" edge="1" parent="1" source="mENAtx_syaeSO5uR6kG6-34" style="edgeStyle=orthogonalEdgeStyle;rounded=0;orthogonalLoop=1;jettySize=auto;html=1;exitX=0;exitY=0;exitDx=50;exitDy=43.5;exitPerimeter=0;entryX=0;entryY=0.5;entryDx=0;entryDy=0;" target="mENAtx_syaeSO5uR6kG6-19">
|
||||||
@@ -69,30 +75,42 @@
|
|||||||
</Array>
|
</Array>
|
||||||
</mxGeometry>
|
</mxGeometry>
|
||||||
</mxCell>
|
</mxCell>
|
||||||
<mxCell id="mENAtx_syaeSO5uR6kG6-34" parent="1" style="shape=note;whiteSpace=wrap;html=1;backgroundOutline=1;darkOpacity=0.05;size=17;" value="<div>report</div><div>.json</div>" vertex="1">
|
<mxCell id="mENAtx_syaeSO5uR6kG6-34" parent="1" style="shape=note;whiteSpace=wrap;html=1;backgroundOutline=1;darkOpacity=0.05;size=17;" value="<div><br></div><div>&lt;forumid&gt;<br>-report</div><div>.json</div>" vertex="1">
|
||||||
<mxGeometry height="70" width="50" x="560" y="240" as="geometry" />
|
<mxGeometry height="70" width="50" x="560" y="240" as="geometry" />
|
||||||
</mxCell>
|
</mxCell>
|
||||||
<mxCell id="mENAtx_syaeSO5uR6kG6-35" parent="1" style="shape=note;whiteSpace=wrap;html=1;backgroundOutline=1;darkOpacity=0.05;size=17;" value="job.json" vertex="1">
|
<mxCell id="mENAtx_syaeSO5uR6kG6-35" parent="1" style="shape=note;whiteSpace=wrap;html=1;backgroundOutline=1;darkOpacity=0.05;size=17;" value="<div>status</div><div>.json</div>" vertex="1">
|
||||||
<mxGeometry height="70" width="50" x="890" y="240" as="geometry" />
|
<mxGeometry height="70" width="50" x="913.25" y="360" as="geometry" />
|
||||||
</mxCell>
|
</mxCell>
|
||||||
<mxCell id="mENAtx_syaeSO5uR6kG6-41" parent="1" style="shape=note;whiteSpace=wrap;html=1;backgroundOutline=1;darkOpacity=0.05;size=17;" value="" vertex="1">
|
<mxCell id="mENAtx_syaeSO5uR6kG6-43" parent="1" style="shape=note;whiteSpace=wrap;html=1;backgroundOutline=1;darkOpacity=0.05;size=17;" value="<div>jobN-</div><div>output</div><div>.jsonl</div>" vertex="1">
|
||||||
<mxGeometry height="70" width="50" x="940" y="340" as="geometry" />
|
<mxGeometry height="70" width="50" x="1090" y="360" as="geometry" />
|
||||||
</mxCell>
|
</mxCell>
|
||||||
<mxCell id="mENAtx_syaeSO5uR6kG6-42" parent="1" style="shape=note;whiteSpace=wrap;html=1;backgroundOutline=1;darkOpacity=0.05;size=17;" value="" vertex="1">
|
<mxCell id="mENAtx_syaeSO5uR6kG6-48" parent="1" style="shape=note;whiteSpace=wrap;html=1;backgroundOutline=1;darkOpacity=0.05;size=17;" value="<div>jobN-errors</div><div>.jsonl</div>" vertex="1">
|
||||||
<mxGeometry height="70" width="50" x="950" y="350" as="geometry" />
|
<mxGeometry height="70" width="50" x="1150" y="360" as="geometry" />
|
||||||
</mxCell>
|
</mxCell>
|
||||||
<mxCell id="mENAtx_syaeSO5uR6kG6-43" parent="1" style="shape=note;whiteSpace=wrap;html=1;backgroundOutline=1;darkOpacity=0.05;size=17;" value="<div>batchN-</div><div>output-</div><div>.jsonl</div>" vertex="1">
|
<mxCell id="mENAtx_syaeSO5uR6kG6-54" parent="1" style="shape=note;whiteSpace=wrap;html=1;backgroundOutline=1;darkOpacity=0.05;size=17;" value="<div>jobN-</div><div>input</div><div>.jsonl</div>" vertex="1">
|
||||||
<mxGeometry height="70" width="50" x="960" y="360" as="geometry" />
|
<mxGeometry height="70" width="50" x="1030" y="360" as="geometry" />
|
||||||
</mxCell>
|
</mxCell>
|
||||||
<mxCell id="mENAtx_syaeSO5uR6kG6-48" parent="1" style="shape=note;whiteSpace=wrap;html=1;backgroundOutline=1;darkOpacity=0.05;size=17;" value="<div>errors</div><div>.jsonl</div>" vertex="1">
|
<mxCell id="mENAtx_syaeSO5uR6kG6-64" edge="1" parent="1" source="mENAtx_syaeSO5uR6kG6-63" style="edgeStyle=orthogonalEdgeStyle;rounded=0;orthogonalLoop=1;jettySize=auto;html=1;exitX=0;exitY=0;exitDx=50;exitDy=43.5;exitPerimeter=0;entryX=0;entryY=0.5;entryDx=0;entryDy=0;" target="mENAtx_syaeSO5uR6kG6-5">
|
||||||
<mxGeometry height="70" width="50" x="980" y="240" as="geometry" />
|
|
||||||
</mxCell>
|
|
||||||
<mxCell id="mENAtx_syaeSO5uR6kG6-51" edge="1" parent="1" source="mENAtx_syaeSO5uR6kG6-19" style="edgeStyle=orthogonalEdgeStyle;rounded=0;orthogonalLoop=1;jettySize=auto;html=1;exitX=1;exitY=0.5;exitDx=0;exitDy=0;entryX=0;entryY=0;entryDx=16.5;entryDy=0;entryPerimeter=0;" target="mENAtx_syaeSO5uR6kG6-41">
|
|
||||||
<mxGeometry relative="1" as="geometry" />
|
<mxGeometry relative="1" as="geometry" />
|
||||||
</mxCell>
|
</mxCell>
|
||||||
<mxCell id="mENAtx_syaeSO5uR6kG6-53" edge="1" parent="1" source="mENAtx_syaeSO5uR6kG6-19" style="edgeStyle=orthogonalEdgeStyle;rounded=0;orthogonalLoop=1;jettySize=auto;html=1;exitX=1;exitY=0.5;exitDx=0;exitDy=0;entryX=0;entryY=0;entryDx=16.5;entryDy=0;entryPerimeter=0;" target="mENAtx_syaeSO5uR6kG6-48">
|
<mxCell id="mENAtx_syaeSO5uR6kG6-63" parent="1" style="shape=note;whiteSpace=wrap;html=1;backgroundOutline=1;darkOpacity=0.05;size=17;" value="<div>prompt</div><div>.txt</div>" vertex="1">
|
||||||
|
<mxGeometry height="70" width="50" x="270" y="90" as="geometry" />
|
||||||
|
</mxCell>
|
||||||
|
<mxCell id="mENAtx_syaeSO5uR6kG6-67" parent="1" style="text;html=1;whiteSpace=wrap;strokeColor=none;fillColor=none;align=left;verticalAlign=top;rounded=0;fontFamily=Courier New;" value="create" vertex="1">
|
||||||
|
<mxGeometry height="20" width="120" x="850" y="170" as="geometry" />
|
||||||
|
</mxCell>
|
||||||
|
<mxCell id="mENAtx_syaeSO5uR6kG6-71" parent="1" style="text;html=1;whiteSpace=wrap;strokeColor=none;fillColor=none;align=left;verticalAlign=top;rounded=0;fontFamily=Courier New;" value="<div>submit</div><div><br></div><div>status</div><div>download</div>" vertex="1">
|
||||||
|
<mxGeometry height="60" width="120" x="1020" y="240" as="geometry" />
|
||||||
|
</mxCell>
|
||||||
|
<mxCell id="mENAtx_syaeSO5uR6kG6-75" edge="1" parent="1" source="mENAtx_syaeSO5uR6kG6-73" style="edgeStyle=orthogonalEdgeStyle;rounded=0;orthogonalLoop=1;jettySize=auto;html=1;" target="mENAtx_syaeSO5uR6kG6-35">
|
||||||
<mxGeometry relative="1" as="geometry" />
|
<mxGeometry relative="1" as="geometry" />
|
||||||
</mxCell>
|
</mxCell>
|
||||||
|
<mxCell id="mENAtx_syaeSO5uR6kG6-76" edge="1" parent="1" source="mENAtx_syaeSO5uR6kG6-73" style="edgeStyle=orthogonalEdgeStyle;rounded=0;orthogonalLoop=1;jettySize=auto;html=1;entryX=0.5;entryY=0;entryDx=0;entryDy=0;" target="mENAtx_syaeSO5uR6kG6-61">
|
||||||
|
<mxGeometry relative="1" as="geometry" />
|
||||||
|
</mxCell>
|
||||||
|
<mxCell id="mENAtx_syaeSO5uR6kG6-73" parent="1" style="image;aspect=fixed;perimeter=ellipsePerimeter;html=1;align=center;shadow=0;dashed=0;spacingTop=3;image=img/lib/active_directory/folder.svg;" value="&lt;forumid&gt;-N" vertex="1">
|
||||||
|
<mxGeometry height="50" width="36.5" x="920" y="240" as="geometry" />
|
||||||
|
</mxCell>
|
||||||
</root>
|
</root>
|
||||||
</mxGraphModel>
|
</mxGraphModel>
|
||||||
</diagram>
|
</diagram>
|
||||||
|
|||||||
4
docs/pipeline-v1.2.3.svg
Normal file
4
docs/pipeline-v1.2.3.svg
Normal file
File diff suppressed because one or more lines are too long
|
After Width: | Height: | Size: 170 KiB |
BIN
docs/streamlit-snapshot.png
Normal file
BIN
docs/streamlit-snapshot.png
Normal file
Binary file not shown.
|
After Width: | Height: | Size: 30 KiB |
155
docs/tasks.org
155
docs/tasks.org
@@ -158,7 +158,7 @@ forum_id_input,comment_id,title,text,date,author,stance,stance_confidence,stance
|
|||||||
- tests: 23 passing (pytest tests/analysis_gpt4o_batch.py), 51 total across suite
|
- tests: 23 passing (pytest tests/analysis_gpt4o_batch.py), 51 total across suite
|
||||||
- datetime: [2026-05-06 Wed 08:55]
|
- datetime: [2026-05-06 Wed 08:55]
|
||||||
|
|
||||||
* [ ] t1.2.3: batch job refactor
|
* [X] t1.2.3: batch job refactor
|
||||||
This task encompasses intent and fixes for 1.2.1 and 1.2.2.
|
This task encompasses intent and fixes for 1.2.1 and 1.2.2.
|
||||||
batch processing should be a resumable job queue, not a one-shot script. the user should not need to remember offsets, completed chunks, failed batches, or which comments remain.
|
batch processing should be a resumable job queue, not a one-shot script. the user should not need to remember offsets, completed chunks, failed batches, or which comments remain.
|
||||||
** Acceptance Criteria
|
** Acceptance Criteria
|
||||||
@@ -200,9 +200,53 @@ batch processing should be a resumable job queue, not a one-shot script. the us
|
|||||||
- resume from status.json
|
- resume from status.json
|
||||||
- remaining-comment detection
|
- remaining-comment detection
|
||||||
|
|
||||||
* === Backlog ===
|
** notes
|
||||||
* [ ] X: analysis validation view
|
- analysis/tokenizer.py: new standalone script; imports openai_batch for MODEL_LIMITS, estimate_tokens, build_messages. Reads input JSONL + prompt, computes per-model jobs/cost/time table, writes reports/<stem>-report.json. MODEL_PRICING dict lives here (not in openai_batch). Pass a jobN-input.jsonl to count actual tokens instead.
|
||||||
|
- analysis/openai_batch.py: fully rewritten with four subcommands: create, submit, status, download. Job dirs at analysis/jobs/<stem[:8]>-N/.
|
||||||
|
- Job directories: analysis/jobs/<stem[:8]>-N/ (e.g. f452-1). Each run is self-contained: forum.jsonl, prompt.txt, report.json, jobN-input.jsonl, jobN-output-raw.jsonl, jobN-output.jsonl, jobN-errors.jsonl.
|
||||||
|
- status.json: tracks all jobs with pending/submitted/in_progress/completed/failed states. Updated by submit, status, download.
|
||||||
|
- _find_next_eligible_job: pure function for testability. Returns (next_pending_job, None) or (None, warning). Blocks submission if previous job is in_progress/submitted.
|
||||||
|
- create: no API key required. Reads report.json, re-chunks comments, writes all jobN-input.jsonl files, writes status.json.
|
||||||
|
- submit: uploads jobN-input.jsonl to Files API, creates batch, updates status.json to 'submitted'. Will not stack batches.
|
||||||
|
- status: retrieves batch from OpenAI, updates status.json counts and status.
|
||||||
|
- download: auto-runs status first, downloads output_file_id → jobN-output-raw.jsonl, error_file_id → jobN-errors.jsonl, normalizes → jobN-output.jsonl. Updates status.json.
|
||||||
|
- tests/tokenizer.py: 19 tests for compute_report schema, cost/time calculation, MODEL_PRICING coverage, print_table output, count_input_tokens, report.json round-trip.
|
||||||
|
- Token limit buffer: _LIMIT_BUFFER=0.80 (20% headroom). Estimate uses OpenAI cookbook chat formula (role tokens + 3-token reply primer). Verify a job file with: python analysis/tokenizer.py analysis/jobs/<dir>/jobN-input.jsonl
|
||||||
|
|
||||||
|
*** usage
|
||||||
|
#+begin_src powershell
|
||||||
|
# 1. estimate tokens and cost
|
||||||
|
python analysis/tokenizer.py output/f452.jsonl --prompt analysis/prompt-1.txt
|
||||||
|
# writes reports/f452-report.json
|
||||||
|
|
||||||
|
# 2. verify actual tokens in a job file (optional sanity check)
|
||||||
|
python analysis/tokenizer.py analysis/jobs/f452-1/job1-input.jsonl
|
||||||
|
|
||||||
|
# 3. create job directory (no api key needed)
|
||||||
|
python analysis/openai_batch.py create reports/f452-report.json --model gpt-5.4-mini
|
||||||
|
# creates analysis/jobs/f452-1/
|
||||||
|
|
||||||
|
# 4. submit first job
|
||||||
|
python analysis/openai_batch.py submit
|
||||||
|
|
||||||
|
# 5. check status (repeat until completed)
|
||||||
|
python analysis/openai_batch.py status
|
||||||
|
|
||||||
|
# 6. download and normalize
|
||||||
|
python analysis/openai_batch.py download
|
||||||
|
|
||||||
|
# 7. submit next job (if multi-job run), then repeat 5-6
|
||||||
|
python analysis/openai_batch.py submit
|
||||||
|
#+end_src
|
||||||
|
|
||||||
|
** evidence
|
||||||
|
- commit:
|
||||||
|
- tests: passing (pytest tests/openai_batch.py tests/openai_realtime.py tests/tokenizer.py)
|
||||||
|
- datetime: [2026-05-06 Wed]
|
||||||
|
|
||||||
|
* [X] t1.3: cleanup model output and rejoin
|
||||||
create a lightweight validation script that joins raw comments to normalized analysis output and writes a human-reviewable csv.
|
create a lightweight validation script that joins raw comments to normalized analysis output and writes a human-reviewable csv.
|
||||||
|
review create_csv for the simple approach - keep this regardless
|
||||||
|
|
||||||
** acceptance criteria
|
** acceptance criteria
|
||||||
1. input raw scrape jsonl and all *-output.jsonl files in a dir
|
1. input raw scrape jsonl and all *-output.jsonl files in a dir
|
||||||
@@ -211,7 +255,8 @@ create a lightweight validation script that joins raw comments to normalized ana
|
|||||||
- forum_id, comment_id, title, text, date, author
|
- forum_id, comment_id, title, text, date, author
|
||||||
- stance, stance_confidence, stance_rationale, tone, tags
|
- stance, stance_confidence, stance_rationale, tone, tags
|
||||||
- error, truncated, analyzed_at, prompt_version, model
|
- error, truncated, analyzed_at, prompt_version, model
|
||||||
4. print validation counts
|
4. output parquet?
|
||||||
|
5. print validation counts
|
||||||
- raw comments
|
- raw comments
|
||||||
- analyzed records
|
- analyzed records
|
||||||
- joined records
|
- joined records
|
||||||
@@ -220,16 +265,30 @@ create a lightweight validation script that joins raw comments to normalized ana
|
|||||||
- error records
|
- error records
|
||||||
- stance counts
|
- stance counts
|
||||||
- tone counts
|
- tone counts
|
||||||
5. tests cover join behavior and missing/duplicate ids
|
6. tests cover join behavior and missing/duplicate ids
|
||||||
|
|
||||||
|
** notes
|
||||||
|
- analysis/create_csv.py: reads raw scrape JSONL + all job*-output.jsonl in a job dir (skips *-output-raw.jsonl); left-joins on comment_id; writes review.csv (UTF-8 BOM for Excel); optional --parquet.
|
||||||
|
- Uses pd.read_json(path, lines=True) — no manual JSON parsing.
|
||||||
|
- Prints summary counts: raw/analyzed/joined/unanalyzed/errors/duplicate IDs, stance distribution, tone distribution.
|
||||||
|
|
||||||
|
*** usage
|
||||||
|
#+begin_src sh
|
||||||
|
python analysis/create_csv.py output/f452.jsonl analysis/jobs/f452-1/
|
||||||
|
python analysis/create_csv.py output/f452.jsonl analysis/jobs/f452-1/ --parquet
|
||||||
|
# output: analysis/jobs/f452-1/review.csv (and optionally review.parquet)
|
||||||
|
#+end_src
|
||||||
|
|
||||||
** evidence
|
** evidence
|
||||||
- commit:
|
- commit: 28d6d22
|
||||||
- tests:
|
- tests: passing (pytest tests/create_csv.py tests/encoding.py)
|
||||||
- csv:
|
- csv: analysis/jobs/f452-1/review.csv
|
||||||
- datetime:
|
- datetime: [2026-05-07 Thu 17:23]
|
||||||
* [ ] X: text encoding cleanup
|
|
||||||
|
* [X] t1.1.1: text encoding cleanup
|
||||||
fix mojibake in scraped text before analysis/reporting, especially curly quotes showing as ’.
|
fix mojibake in scraped text before analysis/reporting, especially curly quotes showing as ’.
|
||||||
|
|
||||||
|
|
||||||
** acceptance criteria
|
** acceptance criteria
|
||||||
1. identify whether mojibake exists in raw scrape, analysis output, or csv export only
|
1. identify whether mojibake exists in raw scrape, analysis output, or csv export only
|
||||||
2. add repair step at the earliest correct layer
|
2. add repair step at the earliest correct layer
|
||||||
@@ -242,14 +301,82 @@ fix mojibake in scraped text before analysis/reporting, especially curly quotes
|
|||||||
- —
|
- —
|
||||||
5. document whether repaired text is used for model input
|
5. document whether repaired text is used for model input
|
||||||
|
|
||||||
|
** notes
|
||||||
|
- Diagnosis: f452.jsonl raw data is CLEAN — proper Unicode throughout (U+2019, U+201C, etc.). The DEFAULT_RESPONSE_ENCODING=utf-8 spider setting is working for this site. No mojibake or FFFD chars found.
|
||||||
|
- The encoding issue would surface for forums whose server sends cp1252 bytes (0x91-0x97 range) embedded in otherwise UTF-8 content. FFFD replacement chars appear when the UTF-8 decoder hits those bytes. Once the byte is replaced by FFFD, the original character cannot be recovered.
|
||||||
|
- Repair layer: analysis/encoding.py applied in analysis/validate.py at reporting time. Raw scrape JSONL is never modified (AC3).
|
||||||
|
- Model input: repair_text() is NOT applied in build_messages() for this dataset since raw data is clean. Can be added if a future forum produces dirty text.
|
||||||
|
- Spider: DEFAULT_RESPONSE_ENCODING=utf-8 remains. If a future forum genuinely sends cp1252, change to 'cp1252' and apply ftfy post-decode in the item pipeline.
|
||||||
|
|
||||||
** evidence
|
** evidence
|
||||||
- commit:
|
- commit: 1ea696d
|
||||||
- tests:
|
- tests: passing (pytest tests/encoding.py)
|
||||||
- before/after sample:
|
- before/after sample: N/A — f452.jsonl is clean; tests cover synthetic mojibake patterns
|
||||||
- datetime:
|
- datetime: [2026-05-07 Thu 17:00]
|
||||||
|
|
||||||
|
* [X] t1.4: graph data prototype
|
||||||
|
create ./viz/prototype_charts.py generating individual plotly charts for exploring graphs to embed into streamlit or dash later
|
||||||
|
|
||||||
|
** acceptance criteria
|
||||||
|
2. create graph for Stance/Share
|
||||||
|
- stacked h-bar with % support/oppose/neutral/unknown + raw totals, eg 63% (5720) / 37% (3320) / 0.09% (8) / 0.37% (34)
|
||||||
|
- later, consider centered diverging h-bar: oppose ← | neutral/unknown | → support
|
||||||
|
3. create graph for Stance/Time:
|
||||||
|
- cumulative support/oppose % over time
|
||||||
|
4. create graph for Stance/Tone (heatmap count)
|
||||||
|
5. create graph for Confidence/Stance (boxplot or histogram)
|
||||||
|
|
||||||
|
** notes
|
||||||
|
- prototyped in plotly
|
||||||
|
- initial streamlit
|
||||||
|
|
||||||
|
** evidence
|
||||||
|
- commit: 3fb424d
|
||||||
|
- tests: see viz/proto and viz/chart_tests
|
||||||
|
- datetime: [2026-05-08 Fri 08:38]
|
||||||
|
|
||||||
|
* [X] t1.5: streamlit
|
||||||
|
create organized webpage displaying useful information from completed job and analysis
|
||||||
|
|
||||||
|
** acceptance criteria
|
||||||
|
1. display total stance breakdown
|
||||||
|
2. display centered horiz-bar with absolute stances
|
||||||
|
3. show daily comment stances and cumulative
|
||||||
|
4. show comment table with filters for stance (filter tone?)
|
||||||
|
5. clicking/selecting a comment shows full text and model rationale
|
||||||
|
6. app runs locally with one command
|
||||||
|
|
||||||
|
** notes
|
||||||
|
data pulls entirely from the job; goal is to point viz/streamlit.py at any job/ folder and have everything it needs
|
||||||
|
|
||||||
|
** evidence
|
||||||
|
- commit: cc16acb
|
||||||
|
- tests: from root dir, `streamlit run viz/streamlit.py <job-dir>`
|
||||||
|
- datetime: [2026-05-08 Fri 23:44]
|
||||||
|
|
||||||
|
* +[ ] t1.6 host streamlit via dockerfile+
|
||||||
|
planning to deploy manually, get cert, etc etc. probably dont care about https?
|
||||||
|
+using streamlit.app instead+
|
||||||
|
** acceptance criteria
|
||||||
|
1. write dockerfile with slim image
|
||||||
|
|
||||||
|
** notes
|
||||||
|
|
||||||
|
* === Backlog ===
|
||||||
|
- add forum_url, forum_collected_date to scraper (to add to viz)
|
||||||
* [ ] X: complete proposal information
|
* [ ] X: complete proposal information
|
||||||
Ensure we capture as much useful information as possible about the actual proposal - contact information, etc. what the state actually says about what was posted.
|
Ensure we capture as much useful information as possible about the actual proposal - contact information, etc. what the state actually says about what was posted.
|
||||||
** acceptance criteria
|
** acceptance criteria
|
||||||
1. Item: `Forum` stores id, url, proposal title, description, open/close date, number of comments, agency, board, guidance document id
|
1. Item: `Forum` stores id, url, proposal title, description, open/close date, number of comments, agency, board, guidance document id
|
||||||
- add details for guidanceDoc, publication date, comments, guidance docs - eg: https://www.townhall.virginia.gov/L/GDocForum.cfm?GDocForumID=452
|
- add details for guidanceDoc, publication date, comments, guidance docs - eg: https://www.townhall.virginia.gov/L/GDocForum.cfm?GDocForumID=452
|
||||||
2. Item: `Comment` stores forum_id, comment_id, author, title, text, date, url
|
2. Item: `Comment` stores forum_id, comment_id, author, title, text, date, url
|
||||||
|
* [ ] X: add helper data to create_csv
|
||||||
|
1. in create_csv.py, create helper columns:
|
||||||
|
- stance_signed = {"support":1, "oppose":-1, "neutral":0, "unknown":0}
|
||||||
|
- stance_weighted = stance_signed * stance_confidence
|
||||||
|
- is_support_oppose = stance in ["support", "oppose"]
|
||||||
|
- date_day
|
||||||
|
- date_hour
|
||||||
|
- text_norm
|
||||||
|
- text_hash
|
||||||
|
- confidence_bucket = 'low' <.7 | 'med' .7-.89 | 'high' >=.9
|
||||||
|
|||||||
@@ -1,50 +1,111 @@
|
|||||||
#+title: VA Townhall
|
#+title: VA Townhall
|
||||||
#+date: [2026-05-05 Tue]
|
#+date: [2026-05-05 Tue]
|
||||||
#+version: 1
|
#+version: 1.1
|
||||||
|
|
||||||
* Project Goals
|
** Project Goals
|
||||||
1. Document and analyze sentiment of public comments on Virginia law, to determine:
|
1. Document and analyze sentiment of public comments on Virginia law, to determine:
|
||||||
1. the utility of this forum as a mechanism for public comment, and
|
1. the utility of this forum as a mechanism for public comment, and
|
||||||
2. the impact of this forum on Virginia regulation.
|
2. the impact of this forum on Virginia regulation.
|
||||||
2. Make data and insights broadly available.
|
2. Make data and insights broadly available.
|
||||||
3. Generalize to other public comment tools.
|
3. Generalize to other public comment tools.
|
||||||
|
|
||||||
** Document and analyze sentiment
|
*** Research questions
|
||||||
- Scrape the data, parse, clean, and store. Clearly separate scraper from sentiment analyzer for maximum auditability.
|
1. What is the quality of the comments on the forum?
|
||||||
- Build tests for identifying abuse, such as spam and account fraud
|
1. Are there duplicate entries?
|
||||||
- Identify any patterns connecting measured sentiment against VA decisions
|
2. Are there non-human-generated entries?
|
||||||
|
3. Are there entries intended to abuse the forum or drown out comment?
|
||||||
** Make data available
|
2. How do commenters feel about the proposed change?
|
||||||
- Pick a good visualization tool
|
1. What is the total number and percent supporting vs opposing, and how does this change over time?
|
||||||
|
2. What is the type of support, such as strong/weak, positive/negative?
|
||||||
|
3. What impact do the comments have on the proposed change?
|
||||||
|
(I anticipate this will not be measurable from currently available data)
|
||||||
|
|
||||||
** Generalize
|
** Architecture
|
||||||
- Identify scalable ways to apply this toolset to similar problems
|
1. Scrape/Parse: Scrapy
|
||||||
|
2. Sentiment analysis: gpt-5.4-mini
|
||||||
|
3. Display: streamlit
|
||||||
|
4. Storage: jsonl, csv, parquet
|
||||||
|
|
||||||
* Architecture
|
[[file:pipeline-v1.2.3.svg]]
|
||||||
1. Scrape/Parse: **Scrapy** for downloading comments
|
|
||||||
2. Storage: json
|
*** Scraper
|
||||||
3. Sentiment analysis: Claude haiku
|
Scrapy provides a simple mechanism for retrieving, parsing, and saving content form the forums.
|
||||||
4. Display: TBD
|
|
||||||
|
|
||||||
** Scraper
|
|
||||||
Scrapy provides a simple mechanism for browsing and
|
|
||||||
1. Forums listing page: `Forums.cfm` - lists all open forums with agency, reg title, action type, brief description, closing date, comment count
|
1. Forums listing page: `Forums.cfm` - lists all open forums with agency, reg title, action type, brief description, closing date, comment count
|
||||||
2. Comment listing page: `comments.cfm?GDocForumID=X` or `comments.cfm?stageid=X` or `comments.cfm?petitionid=X` - lists comments with title, author, date
|
2. Comment listing page: `comments.cfm?GDocForumID=X` or `comments.cfm?stageid=X` or `comments.cfm?petitionid=X` - lists comments with title, author, date
|
||||||
3. Individual comment page: `viewcomments.cfm?commentid=X` - shows regulation title + brief description at the top, plus the comment
|
3. Individual comment page: `viewcomments.cfm?commentid=X` - shows regulation title + brief description at the top, plus the comment
|
||||||
|
|
||||||
** Storage
|
*** Analysis
|
||||||
One JSONL file per forum/bill.
|
Google and Amazon both return generic sentiment (tone of writing: positive/negative), not stance (for/against the regulation): "I strongly believe the government should NOT interfere" is negative tone but "against" the regulation. We add the proposed change as context to the model.
|
||||||
|
|
||||||
** Analysis
|
Before sending the comments for sentiment analysis, `tokenizer.py` receives the forum to be processed and prompt as inputs, then generates a `report.json` estimating tokens (tiktoken), cost, and time to run for multiple models.
|
||||||
Google and Amazon both return generic sentiment (tone of writing: positive/negative), not stance (for/against the regulation): "I strongly believe the government should NOT interfere" is negative tone but "against" the regulation. We will run the forum/bill title and cache the entirety of the proposed change, perhaps as a fallback.
|
|
||||||
|
|
||||||
| Tool | Output | Context | Sarcasm | Context window | Cost/1k comments |
|
Then, the batch processing scripts uses the `report.json` to create multiple jobs, with subcommands to download and check their status.
|
||||||
|-------------------+--------------------------------+------------+------------------+----------------+------------------|
|
|
||||||
| Google NL API | -1→+1, magnitude | No/generic | Poorly | No | ~$1–2 |
|
|
||||||
| Amazon Comprehend | Pos/Neg/Neutral/Mixed | No/generic | Poorly | No | ~$0.10 |
|
|
||||||
| Claude Haiku | Prompted → for/against/neutral | Yes | Yes, with prompt | Yes | ~$0.10–0.30 |
|
|
||||||
| GPT-4o-mini | Prompted → same | Yes | Yes | Yes | ~$0.05–0.15 |
|
|
||||||
|
|
||||||
|
We selected gpt-5.4-mini for a good balance of quality, cost, and time.
|
||||||
|
|
||||||
|
**** Prompt
|
||||||
|
```
|
||||||
|
You are an expert policy analyst classifying public comments submitted to the Virginia Town Hall
|
||||||
|
regulatory comment system. You will be given the text of a proposed regulation and a single
|
||||||
|
public comment. Return ONLY a JSON object — no other text.
|
||||||
|
|
||||||
|
Definitions:
|
||||||
|
- stance: the commenter's position on whether the regulation should be adopted.
|
||||||
|
"support" = wants it approved (as-is or with changes);
|
||||||
|
"oppose" = wants it rejected or substantially weakened;
|
||||||
|
"neutral" = takes no position, asks a question, or provides factual input only;
|
||||||
|
"unknown" = too vague, off-topic, or uninterpretable to classify.
|
||||||
|
- tone: the emotional register of the writing, independent of stance.
|
||||||
|
"positive" = affirming, hopeful, appreciative;
|
||||||
|
"negative" = angry, fearful, alarmed, or contemptuous;
|
||||||
|
"neutral" = matter-of-fact, procedural, or informational;
|
||||||
|
"mixed" = contains both positive and negative emotional content;
|
||||||
|
"unclear" = tone cannot be determined (e.g., a one-word comment).
|
||||||
|
- stance_confidence: float 0.0-1.0, your confidence in the stance label.
|
||||||
|
- stance_rationale: 1-3 sentences explaining the key evidence; quote specific phrases where possible.
|
||||||
|
- tags: up to 5 short topic labels relevant to the comment's specific concerns (e.g.
|
||||||
|
"parental rights", "student safety", "privacy", "religious freedom", "LGBTQ+ inclusion",
|
||||||
|
"bullying prevention", "school sports", "bathroom access"). Empty array if none apply.
|
||||||
|
|
||||||
|
Return exactly these keys: stance, stance_confidence, stance_rationale, tone, tags.
|
||||||
|
```
|
||||||
|
|
||||||
|
|
||||||
|
*** Storage
|
||||||
|
- Each scraped forum is saved to `output/<forum-id>.jsonl`
|
||||||
|
- Each report (forum + prompt) is saves to `reports/<forum-id-N>.json`
|
||||||
|
- Each job is saved to `analysis/jobs/<report-id>/:
|
||||||
|
└─`forum.jsonl` is a copy of the scraped forum for convenience
|
||||||
|
└─`prompt.txt` is a copy of the prompt used
|
||||||
|
└─`report.json` is a copy of the report used
|
||||||
|
└─`status.json` contains metadata about the job
|
||||||
|
For each batch in the job, four files are created:
|
||||||
|
└─`jobN-input.jsonl` contains the exact queries sent to the API, for troubleshooting
|
||||||
|
└─`jobN-output-raw.jsonl` contains the exact response from the API
|
||||||
|
└─`jobN-output.jsonl` contains the exact response from the API
|
||||||
|
└─`jobN-output-errors.jsonl` when errors are returned (this file may not exist)
|
||||||
|
- Once complete, the cleanup script saves `review.csv`, `review.pqt`, and `review.sqlite` in this folder.
|
||||||
|
|
||||||
|
** Instructions
|
||||||
|
1. Scrape the forum.
|
||||||
|
`python
|
||||||
|
2. Run model report.
|
||||||
|
`python analysis/tokenizer.py <input> --prompt <prompt>`
|
||||||
|
3. To run a realtime subset:
|
||||||
|
`python analysis/openai_realtime.py <input> --prompt <prompt> --model <model> --limit <N comments>`
|
||||||
|
`python analysis/openai_realtime.py output/f452.jsonl --prompt prompt-1.txt --model gpt-4o-mini --limit 10`
|
||||||
|
4. To create and run the whole thing in batches, first create the batch jobs from the report:
|
||||||
|
`python analysis/openai_batch.py create <report> --model <model>`
|
||||||
|
`python analysis/openai_batch.py create ./reports/f452-1.json --model gpt-5.4-mini`
|
||||||
|
5. Then, run the jobs sequentially. Don't submit more than one at a time, if the model fills up the batch will fail and resubmission is not implemented.
|
||||||
|
`python analysis/openai_batch.py submit`
|
||||||
|
# Check status
|
||||||
|
`python analysis/openai_batch.py status`
|
||||||
|
# When complete, download:
|
||||||
|
`python analysis/openai_batch.py download`
|
||||||
|
# Submit the next batch after the previous is complete:
|
||||||
|
`python analysis/openai_batch.py submit`
|
||||||
|
|
||||||
* Roadmap
|
* Roadmap
|
||||||
1. Scrape one forum
|
1. Scrape one forum
|
||||||
2. Compare sentiment models
|
2. Compare sentiment models
|
||||||
|
|||||||
43
reports/f452-1.json
Normal file
43
reports/f452-1.json
Normal file
@@ -0,0 +1,43 @@
|
|||||||
|
{
|
||||||
|
"prompt": "analysis\\prompt-1.txt",
|
||||||
|
"prompt_hash": "cb41250",
|
||||||
|
"input_file": "output\\f452.jsonl",
|
||||||
|
"input_sha256": "59dcc8b13cc2a386977a8b934c498c7e639b7e684a94ca1bfd10a14878670018",
|
||||||
|
"total_comments": 9083,
|
||||||
|
"input_tokens": 6397254,
|
||||||
|
"gpt-5.5": {
|
||||||
|
"jobs": 9,
|
||||||
|
"cost_$": 15.9931,
|
||||||
|
"est_queue_days": 7.11
|
||||||
|
},
|
||||||
|
"gpt-5.4": {
|
||||||
|
"jobs": 9,
|
||||||
|
"cost_$": 7.9966,
|
||||||
|
"est_queue_days": 7.11
|
||||||
|
},
|
||||||
|
"gpt-5.4-mini": {
|
||||||
|
"jobs": 4,
|
||||||
|
"cost_$": 2.399,
|
||||||
|
"est_queue_days": 3.2
|
||||||
|
},
|
||||||
|
"gpt-5.4-nano": {
|
||||||
|
"jobs": 40,
|
||||||
|
"cost_$": 0.6397,
|
||||||
|
"est_queue_days": 31.99
|
||||||
|
},
|
||||||
|
"gpt-4o": {
|
||||||
|
"jobs": 9,
|
||||||
|
"cost_$": 7.9966,
|
||||||
|
"est_queue_days": 7.11
|
||||||
|
},
|
||||||
|
"gpt-4o-mini": {
|
||||||
|
"jobs": 4,
|
||||||
|
"cost_$": 0.4798,
|
||||||
|
"est_queue_days": 3.2
|
||||||
|
},
|
||||||
|
"gpt-o4-mini": {
|
||||||
|
"jobs": 4,
|
||||||
|
"cost_$": 3.5185,
|
||||||
|
"est_queue_days": 3.2
|
||||||
|
}
|
||||||
|
}
|
||||||
BIN
requirements.txt
BIN
requirements.txt
Binary file not shown.
@@ -5,6 +5,8 @@ class ForumItem(scrapy.Item):
|
|||||||
forum_id = scrapy.Field()
|
forum_id = scrapy.Field()
|
||||||
reg_title = scrapy.Field()
|
reg_title = scrapy.Field()
|
||||||
reg_desc = scrapy.Field()
|
reg_desc = scrapy.Field()
|
||||||
|
scraped_at = scrapy.Field()
|
||||||
|
forum_url = scrapy.Field()
|
||||||
|
|
||||||
|
|
||||||
class CommentItem(scrapy.Item):
|
class CommentItem(scrapy.Item):
|
||||||
|
|||||||
@@ -63,6 +63,8 @@ class ForumSpider(scrapy.Spider):
|
|||||||
forum_id=self.forum_id,
|
forum_id=self.forum_id,
|
||||||
reg_title=reg_title,
|
reg_title=reg_title,
|
||||||
reg_desc=reg_desc,
|
reg_desc=reg_desc,
|
||||||
|
scraped_at=datetime.utcnow().isoformat(),
|
||||||
|
forum_url=_view_url(self.forum_id),
|
||||||
)
|
)
|
||||||
for page in range(2, last_page + 1):
|
for page in range(2, last_page + 1):
|
||||||
yield scrapy.FormRequest(
|
yield scrapy.FormRequest(
|
||||||
|
|||||||
155
tests/create_csv.py
Normal file
155
tests/create_csv.py
Normal file
@@ -0,0 +1,155 @@
|
|||||||
|
"""Unit tests for analysis/create_csv.py — no external API calls."""
|
||||||
|
|
||||||
|
import json
|
||||||
|
import sys
|
||||||
|
from pathlib import Path
|
||||||
|
|
||||||
|
import pandas as pd
|
||||||
|
import pytest
|
||||||
|
|
||||||
|
sys.path.insert(0, str(Path(__file__).parent.parent / "analysis"))
|
||||||
|
import create_csv as cc
|
||||||
|
|
||||||
|
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
# Helpers
|
||||||
|
|
||||||
|
def _write_jsonl(path: Path, rows: list[dict]) -> None:
|
||||||
|
with open(path, "w", encoding="utf-8") as f:
|
||||||
|
for row in rows:
|
||||||
|
f.write(json.dumps(row) + "\n")
|
||||||
|
|
||||||
|
|
||||||
|
RAW_ROWS = [
|
||||||
|
{"forum_id": "452", "comment_id": "1", "title": "Support", "text": "I support.", "date": "2021-01-01", "author": "Alice"},
|
||||||
|
{"forum_id": "452", "comment_id": "2", "title": "Oppose", "text": "I oppose.", "date": "2021-01-02", "author": "Bob"},
|
||||||
|
{"forum_id": "452", "comment_id": "3", "title": "Neutral", "text": "No opinion.","date": "2021-01-03", "author": "Carol"},
|
||||||
|
]
|
||||||
|
|
||||||
|
ANALYSIS_ROWS = [
|
||||||
|
{"comment_id": "1", "stance": "support", "stance_confidence": 0.9, "stance_rationale": "clear support",
|
||||||
|
"tone": "neutral", "tags": '["policy"]', "error": None, "truncated": False,
|
||||||
|
"analyzed_at": "2021-01-10", "prompt_version": "1", "model": "gpt-4o-mini"},
|
||||||
|
{"comment_id": "2", "stance": "oppose", "stance_confidence": 0.8, "stance_rationale": "clear oppose",
|
||||||
|
"tone": "negative", "tags": '[]', "error": None, "truncated": False,
|
||||||
|
"analyzed_at": "2021-01-10", "prompt_version": "1", "model": "gpt-4o-mini"},
|
||||||
|
]
|
||||||
|
|
||||||
|
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
# load_raw
|
||||||
|
|
||||||
|
def test_load_raw_returns_raw_cols(tmp_path):
|
||||||
|
p = tmp_path / "forum.jsonl"
|
||||||
|
_write_jsonl(p, RAW_ROWS)
|
||||||
|
df = cc.load_raw(p)
|
||||||
|
assert list(df.columns) == cc.RAW_COLS
|
||||||
|
|
||||||
|
|
||||||
|
def test_load_raw_row_count(tmp_path):
|
||||||
|
p = tmp_path / "forum.jsonl"
|
||||||
|
_write_jsonl(p, RAW_ROWS)
|
||||||
|
df = cc.load_raw(p)
|
||||||
|
assert len(df) == 3
|
||||||
|
|
||||||
|
|
||||||
|
def test_load_raw_skips_non_comment_rows(tmp_path):
|
||||||
|
"""Rows without comment_id (e.g. forum metadata) are dropped."""
|
||||||
|
rows = RAW_ROWS + [{"forum_id": "452", "reg_title": "Metadata row"}]
|
||||||
|
p = tmp_path / "forum.jsonl"
|
||||||
|
_write_jsonl(p, rows)
|
||||||
|
df = cc.load_raw(p)
|
||||||
|
assert len(df) == 3
|
||||||
|
|
||||||
|
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
# load_analysis
|
||||||
|
|
||||||
|
def test_load_analysis_returns_analysis_cols(tmp_path):
|
||||||
|
jobs = tmp_path / "jobs"
|
||||||
|
jobs.mkdir()
|
||||||
|
_write_jsonl(jobs / "job1-output.jsonl", ANALYSIS_ROWS)
|
||||||
|
df = cc.load_analysis(jobs)
|
||||||
|
expected = ["comment_id"] + cc.ANALYSIS_COLS
|
||||||
|
assert list(df.columns) == expected
|
||||||
|
|
||||||
|
|
||||||
|
def test_load_analysis_skips_raw_files(tmp_path):
|
||||||
|
jobs = tmp_path / "jobs"
|
||||||
|
jobs.mkdir()
|
||||||
|
_write_jsonl(jobs / "job1-output.jsonl", ANALYSIS_ROWS)
|
||||||
|
_write_jsonl(jobs / "job1-output-raw.jsonl", ANALYSIS_ROWS) # should be ignored
|
||||||
|
df = cc.load_analysis(jobs)
|
||||||
|
assert len(df) == len(ANALYSIS_ROWS)
|
||||||
|
|
||||||
|
|
||||||
|
def test_load_analysis_concatenates_multiple_files(tmp_path):
|
||||||
|
jobs = tmp_path / "jobs"
|
||||||
|
jobs.mkdir()
|
||||||
|
_write_jsonl(jobs / "job1-output.jsonl", [ANALYSIS_ROWS[0]])
|
||||||
|
_write_jsonl(jobs / "job2-output.jsonl", [ANALYSIS_ROWS[1]])
|
||||||
|
df = cc.load_analysis(jobs)
|
||||||
|
assert len(df) == 2
|
||||||
|
|
||||||
|
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
# join
|
||||||
|
|
||||||
|
def test_join_all_raw_preserved(tmp_path):
|
||||||
|
"""Left join: all raw comments appear in output, even without analysis."""
|
||||||
|
raw = pd.DataFrame(RAW_ROWS)[cc.RAW_COLS]
|
||||||
|
analysis = pd.DataFrame(ANALYSIS_ROWS)
|
||||||
|
for col in cc.ANALYSIS_COLS:
|
||||||
|
if col not in analysis.columns:
|
||||||
|
analysis[col] = None
|
||||||
|
analysis = analysis[["comment_id"] + cc.ANALYSIS_COLS]
|
||||||
|
|
||||||
|
merged = cc.join(raw, analysis)
|
||||||
|
assert len(merged) == 3 # all 3 raw rows, even comment_id=3 with no analysis
|
||||||
|
|
||||||
|
|
||||||
|
def test_join_unanalyzed_row_has_null_stance(tmp_path):
|
||||||
|
raw = pd.DataFrame(RAW_ROWS)[cc.RAW_COLS]
|
||||||
|
analysis = pd.DataFrame(ANALYSIS_ROWS)
|
||||||
|
for col in cc.ANALYSIS_COLS:
|
||||||
|
if col not in analysis.columns:
|
||||||
|
analysis[col] = None
|
||||||
|
analysis = analysis[["comment_id"] + cc.ANALYSIS_COLS]
|
||||||
|
|
||||||
|
merged = cc.join(raw, analysis)
|
||||||
|
unanalyzed = merged[merged["comment_id"] == "3"]
|
||||||
|
assert pd.isna(unanalyzed.iloc[0]["stance"])
|
||||||
|
|
||||||
|
|
||||||
|
def test_join_column_order(tmp_path):
|
||||||
|
raw = pd.DataFrame(RAW_ROWS)[cc.RAW_COLS]
|
||||||
|
analysis = pd.DataFrame(ANALYSIS_ROWS)
|
||||||
|
for col in cc.ANALYSIS_COLS:
|
||||||
|
if col not in analysis.columns:
|
||||||
|
analysis[col] = None
|
||||||
|
analysis = analysis[["comment_id"] + cc.ANALYSIS_COLS]
|
||||||
|
|
||||||
|
merged = cc.join(raw, analysis)
|
||||||
|
assert list(merged.columns) == cc.OUTPUT_COLS
|
||||||
|
|
||||||
|
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
# End-to-end: write + read CSV
|
||||||
|
|
||||||
|
def test_csv_written_correctly(tmp_path):
|
||||||
|
raw_path = tmp_path / "forum.jsonl"
|
||||||
|
_write_jsonl(raw_path, RAW_ROWS)
|
||||||
|
|
||||||
|
jobs = tmp_path / "jobs"
|
||||||
|
jobs.mkdir()
|
||||||
|
_write_jsonl(jobs / "job1-output.jsonl", ANALYSIS_ROWS)
|
||||||
|
|
||||||
|
out = tmp_path / "review.csv"
|
||||||
|
raw = cc.load_raw(raw_path)
|
||||||
|
analysis = cc.load_analysis(jobs)
|
||||||
|
merged = cc.join(raw, analysis)
|
||||||
|
merged.to_csv(out, index=False, encoding="utf-8-sig")
|
||||||
|
|
||||||
|
loaded = pd.read_csv(out)
|
||||||
|
assert len(loaded) == 3
|
||||||
|
assert list(loaded.columns) == cc.OUTPUT_COLS
|
||||||
119
tests/encoding.py
Normal file
119
tests/encoding.py
Normal file
@@ -0,0 +1,119 @@
|
|||||||
|
"""Unit tests for analysis/encoding.py — no external dependencies required."""
|
||||||
|
|
||||||
|
import sys
|
||||||
|
from pathlib import Path
|
||||||
|
|
||||||
|
import pytest
|
||||||
|
|
||||||
|
sys.path.insert(0, str(Path(__file__).parent.parent / "analysis"))
|
||||||
|
from encoding import repair_text, _KNOWN_REPAIRS
|
||||||
|
|
||||||
|
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
# Core contract
|
||||||
|
|
||||||
|
|
||||||
|
def test_empty_string_unchanged():
|
||||||
|
assert repair_text("") == ""
|
||||||
|
|
||||||
|
|
||||||
|
def test_none_like_empty_unchanged():
|
||||||
|
assert repair_text("") == ""
|
||||||
|
|
||||||
|
|
||||||
|
def test_clean_ascii_unchanged():
|
||||||
|
text = "This is a normal sentence with no encoding issues."
|
||||||
|
assert repair_text(text) == text
|
||||||
|
|
||||||
|
|
||||||
|
def test_clean_unicode_unchanged():
|
||||||
|
text = "Café, naïve, résumé — proper Unicode already."
|
||||||
|
result = repair_text(text)
|
||||||
|
# Should either be unchanged or equivalently correct
|
||||||
|
assert "Caf" in result and "na" in result
|
||||||
|
|
||||||
|
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
# Known mojibake sequences (tasks.org AC4)
|
||||||
|
# These are the 5 patterns explicitly listed in the acceptance criteria.
|
||||||
|
|
||||||
|
|
||||||
|
def test_right_single_quote():
|
||||||
|
"""’ → ' (U+2019 right single quotation mark)"""
|
||||||
|
assert repair_text("Virginia’s") == "Virginia’s"
|
||||||
|
|
||||||
|
|
||||||
|
def test_left_double_quote():
|
||||||
|
"""“ → " (U+201C left double quotation mark)"""
|
||||||
|
assert repair_text("“Hello") == "“Hello"
|
||||||
|
|
||||||
|
|
||||||
|
def test_en_dash():
|
||||||
|
"""â€" (where last char is U+201C) → – (U+2013 en dash)"""
|
||||||
|
result = repair_text("pages 1–5")
|
||||||
|
assert "–" in result or "—" in result or "-" in result
|
||||||
|
|
||||||
|
|
||||||
|
def test_em_dash():
|
||||||
|
"""â€" (where last char is U+201D) → — (U+2014 em dash)"""
|
||||||
|
result = repair_text("word—word")
|
||||||
|
assert "—" in result or "–" in result or "-" in result
|
||||||
|
|
||||||
|
|
||||||
|
def test_right_double_quote():
|
||||||
|
"""â€\x9d → " (U+201D right double quotation mark)"""
|
||||||
|
result = repair_text("said†he")
|
||||||
|
# Should not contain the raw artifact
|
||||||
|
assert "â€" not in result
|
||||||
|
|
||||||
|
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
# Round-trip: garbled text produces sensible output
|
||||||
|
|
||||||
|
|
||||||
|
def test_garbled_sentence_repaired():
|
||||||
|
"""A sentence with multiple mojibake chars is repaired to readable text."""
|
||||||
|
# "Don't" with right single quote encoded as UTF-8, then decoded as cp1252
|
||||||
|
# D o n ' t → D o n ’ t
|
||||||
|
garbled = "Don’t worry"
|
||||||
|
result = repair_text(garbled)
|
||||||
|
assert "Don" in result and "t worry" in result
|
||||||
|
assert "â€" not in result # artifact gone
|
||||||
|
|
||||||
|
|
||||||
|
def test_clean_string_after_repair_has_no_artifacts():
|
||||||
|
garbled = "She said “Hello†and left."
|
||||||
|
result = repair_text(garbled)
|
||||||
|
assert "â€" not in result
|
||||||
|
|
||||||
|
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
# FFFD replacement characters (from strict UTF-8 decode of cp1252 bytes)
|
||||||
|
|
||||||
|
|
||||||
|
def test_fffd_preserved_not_crashed():
|
||||||
|
"""repair_text must not raise on U+FFFD; it may or may not repair it."""
|
||||||
|
text = "Virginia<EFBFBD>s Public Schools"
|
||||||
|
result = repair_text(text)
|
||||||
|
assert isinstance(result, str)
|
||||||
|
assert "Virginia" in result
|
||||||
|
|
||||||
|
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
# _KNOWN_REPAIRS table structure
|
||||||
|
|
||||||
|
|
||||||
|
def test_known_repairs_non_empty():
|
||||||
|
assert len(_KNOWN_REPAIRS) > 0
|
||||||
|
|
||||||
|
|
||||||
|
def test_known_repairs_are_pairs():
|
||||||
|
for item in _KNOWN_REPAIRS:
|
||||||
|
assert len(item) == 2
|
||||||
|
bad, good = item
|
||||||
|
assert isinstance(bad, str) and isinstance(good, str)
|
||||||
|
|
||||||
|
|
||||||
|
def test_known_repairs_bad_not_equal_good():
|
||||||
|
for bad, good in _KNOWN_REPAIRS:
|
||||||
|
assert bad != good
|
||||||
@@ -1,4 +1,4 @@
|
|||||||
"""Unit tests for analysis/gpt4o/analysis_batch.py — no real API calls."""
|
"""Unit tests for analysis/openai_batch.py — no real API calls."""
|
||||||
|
|
||||||
import json
|
import json
|
||||||
import sys
|
import sys
|
||||||
@@ -7,8 +7,8 @@ from unittest.mock import MagicMock
|
|||||||
|
|
||||||
import pytest
|
import pytest
|
||||||
|
|
||||||
sys.path.insert(0, str(Path(__file__).parent.parent / "analysis" / "gpt4o"))
|
sys.path.insert(0, str(Path(__file__).parent.parent / "analysis"))
|
||||||
import analysis_batch as bt
|
import openai_batch as bt
|
||||||
|
|
||||||
|
|
||||||
# ---------------------------------------------------------------------------
|
# ---------------------------------------------------------------------------
|
||||||
@@ -75,9 +75,24 @@ ANALYZED_AT = "2026-05-05T18:00:00+00:00"
|
|||||||
RUN_ID = "test-run-id-123"
|
RUN_ID = "test-run-id-123"
|
||||||
MODEL = "gpt-4o"
|
MODEL = "gpt-4o"
|
||||||
|
|
||||||
|
# Minimal status.json for testing job logic
|
||||||
|
def _make_status(jobs_override=None):
|
||||||
|
jobs = jobs_override or [
|
||||||
|
{"job_num": 1, "run_id": "r1", "status": "pending", "batch_id": None,
|
||||||
|
"records_submitted": 60, "records_completed": None, "records_failed": None,
|
||||||
|
"submitted_at": None, "completed_at": None},
|
||||||
|
]
|
||||||
|
return {
|
||||||
|
"model": "gpt-4o-mini", "prompt_hash": "abc1234",
|
||||||
|
"input_file": "output/f452.jsonl", "input_sha256": "sha",
|
||||||
|
"total_comments": 100, "input_tokens": 50_000,
|
||||||
|
"est_queue_days": 0.025, "cost_$": 0.01,
|
||||||
|
"total_jobs": len(jobs), "jobs": jobs,
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
# ---------------------------------------------------------------------------
|
# ---------------------------------------------------------------------------
|
||||||
# Prompt versioning (batch reads the same prompt file)
|
# Prompt versioning
|
||||||
|
|
||||||
def test_prompt_version_is_7_hex_chars():
|
def test_prompt_version_is_7_hex_chars():
|
||||||
assert len(bt.PROMPT_VERSION) == 7
|
assert len(bt.PROMPT_VERSION) == 7
|
||||||
@@ -86,7 +101,7 @@ def test_prompt_version_is_7_hex_chars():
|
|||||||
|
|
||||||
def test_prompt_version_matches_realtime():
|
def test_prompt_version_matches_realtime():
|
||||||
"""Both scripts must derive the same PROMPT_VERSION from the same file."""
|
"""Both scripts must derive the same PROMPT_VERSION from the same file."""
|
||||||
import analysis_realtime as rt
|
import openai_realtime as rt
|
||||||
assert bt.PROMPT_VERSION == rt.PROMPT_VERSION
|
assert bt.PROMPT_VERSION == rt.PROMPT_VERSION
|
||||||
|
|
||||||
|
|
||||||
@@ -206,52 +221,6 @@ def test_normalize_unknown_comment_id():
|
|||||||
assert record["input_title"] == ""
|
assert record["input_title"] == ""
|
||||||
|
|
||||||
|
|
||||||
# ---------------------------------------------------------------------------
|
|
||||||
# Manifest
|
|
||||||
|
|
||||||
def test_make_manifest_all_keys():
|
|
||||||
m = bt.make_manifest(
|
|
||||||
run_id=RUN_ID,
|
|
||||||
input_filename="output/forum452.jsonl",
|
|
||||||
input_sha256="abc123",
|
|
||||||
model="gpt-4o",
|
|
||||||
batch_id="batch_xyz",
|
|
||||||
records_submitted=100,
|
|
||||||
request_filename="analysis/gpt4o/requests/test-run-id-123.jsonl",
|
|
||||||
)
|
|
||||||
required = {
|
|
||||||
"run_id", "input_filename", "input_sha256", "prompt_hash", "model",
|
|
||||||
"batch_id", "records_submitted", "records_completed", "records_failed",
|
|
||||||
"request_filename", "raw_output_filename", "normalized_output_filename",
|
|
||||||
"created_at", "completed_at",
|
|
||||||
}
|
|
||||||
assert required == set(m.keys())
|
|
||||||
|
|
||||||
|
|
||||||
def test_make_manifest_initial_nulls():
|
|
||||||
m = bt.make_manifest(
|
|
||||||
run_id=RUN_ID, input_filename="f", input_sha256="s",
|
|
||||||
model="gpt-4o", batch_id="b", records_submitted=10, request_filename="r",
|
|
||||||
)
|
|
||||||
assert m["records_completed"] is None
|
|
||||||
assert m["records_failed"] is None
|
|
||||||
assert m["raw_output_filename"] is None
|
|
||||||
assert m["normalized_output_filename"] is None
|
|
||||||
assert m["completed_at"] is None
|
|
||||||
assert m["prompt_hash"] == bt.PROMPT_VERSION
|
|
||||||
|
|
||||||
|
|
||||||
def test_manifest_save_load_roundtrip(tmp_path, monkeypatch):
|
|
||||||
monkeypatch.setattr(bt, "RUNS_DIR", tmp_path)
|
|
||||||
m = bt.make_manifest(
|
|
||||||
run_id=RUN_ID, input_filename="f", input_sha256="s",
|
|
||||||
model="gpt-4o", batch_id="b", records_submitted=42, request_filename="r",
|
|
||||||
)
|
|
||||||
bt.save_manifest(m)
|
|
||||||
loaded = bt.load_manifest(RUN_ID)
|
|
||||||
assert loaded == m
|
|
||||||
|
|
||||||
|
|
||||||
# ---------------------------------------------------------------------------
|
# ---------------------------------------------------------------------------
|
||||||
# estimate_tokens
|
# estimate_tokens
|
||||||
|
|
||||||
@@ -273,7 +242,8 @@ def test_estimate_tokens_fallback_without_tiktoken(monkeypatch):
|
|||||||
monkeypatch.setitem(_sys.modules, "tiktoken", None)
|
monkeypatch.setitem(_sys.modules, "tiktoken", None)
|
||||||
messages = [{"role": "user", "content": "x" * 300}]
|
messages = [{"role": "user", "content": "x" * 300}]
|
||||||
result = bt.estimate_tokens(messages, "gpt-4o")
|
result = bt.estimate_tokens(messages, "gpt-4o")
|
||||||
assert result == 4 + 300 // 3
|
# fallback: 3 primer + (3 + 300//3) per message
|
||||||
|
assert result == 3 + (3 + 300 // 3)
|
||||||
|
|
||||||
|
|
||||||
# ---------------------------------------------------------------------------
|
# ---------------------------------------------------------------------------
|
||||||
@@ -309,3 +279,112 @@ def test_chunk_preserves_all_comments(monkeypatch):
|
|||||||
def test_model_limits_has_required_models():
|
def test_model_limits_has_required_models():
|
||||||
for model in ("gpt-4o", "gpt-4o-mini", "gpt-5.4", "gpt-5.4-mini", "gpt-o4-mini"):
|
for model in ("gpt-4o", "gpt-4o-mini", "gpt-5.4", "gpt-5.4-mini", "gpt-o4-mini"):
|
||||||
assert model in bt.MODEL_LIMITS, f"{model} missing from MODEL_LIMITS"
|
assert model in bt.MODEL_LIMITS, f"{model} missing from MODEL_LIMITS"
|
||||||
|
|
||||||
|
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
# status.json helpers
|
||||||
|
|
||||||
|
def test_status_save_load_roundtrip(tmp_path):
|
||||||
|
status = _make_status()
|
||||||
|
bt.save_status(status, tmp_path)
|
||||||
|
loaded = bt.load_status(tmp_path)
|
||||||
|
assert loaded == status
|
||||||
|
|
||||||
|
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
# _find_next_eligible_job
|
||||||
|
|
||||||
|
def test_find_next_eligible_job_first_job_pending():
|
||||||
|
jobs = _make_status()["jobs"]
|
||||||
|
target, warning = bt._find_next_eligible_job(jobs)
|
||||||
|
assert target["job_num"] == 1
|
||||||
|
assert warning is None
|
||||||
|
|
||||||
|
|
||||||
|
def test_find_next_eligible_job_after_completed():
|
||||||
|
jobs = [
|
||||||
|
{"job_num": 1, "status": "completed", "batch_id": "b1",
|
||||||
|
"records_submitted": 60, "records_completed": 60, "records_failed": 0,
|
||||||
|
"submitted_at": "t", "completed_at": "t", "run_id": "r1"},
|
||||||
|
{"job_num": 2, "status": "pending", "batch_id": None,
|
||||||
|
"records_submitted": 40, "records_completed": None, "records_failed": None,
|
||||||
|
"submitted_at": None, "completed_at": None, "run_id": "r2"},
|
||||||
|
]
|
||||||
|
target, warning = bt._find_next_eligible_job(jobs)
|
||||||
|
assert target["job_num"] == 2
|
||||||
|
assert warning is None
|
||||||
|
|
||||||
|
|
||||||
|
def test_find_next_eligible_job_blocked_by_in_progress():
|
||||||
|
jobs = [
|
||||||
|
{"job_num": 1, "status": "in_progress", "batch_id": "b1",
|
||||||
|
"records_submitted": 60, "records_completed": None, "records_failed": None,
|
||||||
|
"submitted_at": "t", "completed_at": None, "run_id": "r1"},
|
||||||
|
{"job_num": 2, "status": "pending", "batch_id": None,
|
||||||
|
"records_submitted": 40, "records_completed": None, "records_failed": None,
|
||||||
|
"submitted_at": None, "completed_at": None, "run_id": "r2"},
|
||||||
|
]
|
||||||
|
target, warning = bt._find_next_eligible_job(jobs)
|
||||||
|
assert target is None
|
||||||
|
assert warning is not None
|
||||||
|
assert "in_progress" in warning
|
||||||
|
|
||||||
|
|
||||||
|
def test_find_next_eligible_job_all_completed():
|
||||||
|
jobs = [
|
||||||
|
{"job_num": 1, "status": "completed", "batch_id": "b1",
|
||||||
|
"records_submitted": 60, "records_completed": 60, "records_failed": 0,
|
||||||
|
"submitted_at": "t", "completed_at": "t", "run_id": "r1"},
|
||||||
|
]
|
||||||
|
target, warning = bt._find_next_eligible_job(jobs)
|
||||||
|
assert target is None
|
||||||
|
assert warning is None
|
||||||
|
|
||||||
|
|
||||||
|
def test_resume_from_status_json(tmp_path):
|
||||||
|
"""Reload a status.json with one completed job and find the next pending job."""
|
||||||
|
jobs = [
|
||||||
|
{"job_num": 1, "run_id": "r1", "status": "completed", "batch_id": "b1",
|
||||||
|
"records_submitted": 60, "records_completed": 58, "records_failed": 2,
|
||||||
|
"submitted_at": "2026-05-06T10:00:00+00:00", "completed_at": "2026-05-06T11:00:00+00:00"},
|
||||||
|
{"job_num": 2, "run_id": "r2", "status": "pending", "batch_id": None,
|
||||||
|
"records_submitted": 40, "records_completed": None, "records_failed": None,
|
||||||
|
"submitted_at": None, "completed_at": None},
|
||||||
|
]
|
||||||
|
bt.save_status(_make_status(jobs), tmp_path)
|
||||||
|
loaded = bt.load_status(tmp_path)
|
||||||
|
target, warning = bt._find_next_eligible_job(loaded["jobs"])
|
||||||
|
assert target["job_num"] == 2
|
||||||
|
assert warning is None
|
||||||
|
|
||||||
|
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
# normalize: out-of-order and duplicate custom_id
|
||||||
|
|
||||||
|
def test_out_of_order_output_reconciled_by_custom_id():
|
||||||
|
"""Raw lines processed in any order are mapped to the correct comment."""
|
||||||
|
c2 = {**COMMENT_ITEM, "comment_id": "99999", "title": "Second comment"}
|
||||||
|
lookup = {COMMENT_ITEM["comment_id"]: COMMENT_ITEM, "99999": c2}
|
||||||
|
|
||||||
|
line_for_99999 = {
|
||||||
|
**RAW_SUCCESS_LINE,
|
||||||
|
"custom_id": "comment_99999",
|
||||||
|
}
|
||||||
|
line_for_87914 = RAW_SUCCESS_LINE
|
||||||
|
|
||||||
|
r1 = bt.normalize_output_line(line_for_99999, lookup, RUN_ID, ANALYZED_AT, MODEL, bt.PROMPT_VERSION)
|
||||||
|
r2 = bt.normalize_output_line(line_for_87914, lookup, RUN_ID, ANALYZED_AT, MODEL, bt.PROMPT_VERSION)
|
||||||
|
|
||||||
|
assert r1["comment_id"] == "99999"
|
||||||
|
assert r1["input_title"] == "Second comment"
|
||||||
|
assert r2["comment_id"] == "87914"
|
||||||
|
assert r2["input_title"] == COMMENT_ITEM["title"]
|
||||||
|
|
||||||
|
|
||||||
|
def test_duplicate_custom_id_both_produce_valid_records():
|
||||||
|
"""Two raw lines with the same custom_id each produce a valid record."""
|
||||||
|
r1 = bt.normalize_output_line(RAW_SUCCESS_LINE, COMMENT_LOOKUP, RUN_ID, ANALYZED_AT, MODEL, bt.PROMPT_VERSION)
|
||||||
|
r2 = bt.normalize_output_line(RAW_SUCCESS_LINE, COMMENT_LOOKUP, RUN_ID, ANALYZED_AT, MODEL, bt.PROMPT_VERSION)
|
||||||
|
assert r1["comment_id"] == r2["comment_id"] == "87914"
|
||||||
|
assert r1["error"] is None
|
||||||
|
assert r2["error"] is None
|
||||||
@@ -1,4 +1,4 @@
|
|||||||
"""Unit tests for analysis/gpt4o/analysis_realtime.py — no real API calls."""
|
"""Unit tests for analysis/openai_realtime.py — no real API calls."""
|
||||||
|
|
||||||
import json
|
import json
|
||||||
import sys
|
import sys
|
||||||
@@ -7,8 +7,8 @@ from unittest.mock import MagicMock
|
|||||||
|
|
||||||
import pytest
|
import pytest
|
||||||
|
|
||||||
sys.path.insert(0, str(Path(__file__).parent.parent / "analysis" / "gpt4o"))
|
sys.path.insert(0, str(Path(__file__).parent.parent / "analysis"))
|
||||||
import analysis_realtime as rt
|
import openai_realtime as rt
|
||||||
|
|
||||||
|
|
||||||
# ---------------------------------------------------------------------------
|
# ---------------------------------------------------------------------------
|
||||||
250
tests/tokenizer.py
Normal file
250
tests/tokenizer.py
Normal file
@@ -0,0 +1,250 @@
|
|||||||
|
"""Unit tests for analysis/tokenizer.py — no real API calls."""
|
||||||
|
|
||||||
|
import io
|
||||||
|
import json
|
||||||
|
import math
|
||||||
|
import sys
|
||||||
|
from pathlib import Path
|
||||||
|
from unittest.mock import patch
|
||||||
|
|
||||||
|
import pytest
|
||||||
|
|
||||||
|
sys.path.insert(0, str(Path(__file__).parent.parent / "analysis"))
|
||||||
|
import tokenizer as tk
|
||||||
|
import openai_batch as ab
|
||||||
|
|
||||||
|
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
# Fixtures
|
||||||
|
|
||||||
|
FORUM_ITEM = {
|
||||||
|
"forum_id": "452",
|
||||||
|
"reg_title": "Model Policies for Transgender Students",
|
||||||
|
"reg_desc": "Guidance developed in response to HB 145.",
|
||||||
|
}
|
||||||
|
|
||||||
|
COMMENT_A = {
|
||||||
|
"forum_id": "452",
|
||||||
|
"comment_id": "100",
|
||||||
|
"author": "Alice",
|
||||||
|
"date": "2021-01-04T09:15:00",
|
||||||
|
"title": "Support",
|
||||||
|
"text": "I support this policy.",
|
||||||
|
}
|
||||||
|
|
||||||
|
COMMENT_B = {
|
||||||
|
"forum_id": "452",
|
||||||
|
"comment_id": "101",
|
||||||
|
"author": "Bob",
|
||||||
|
"date": "2021-01-05T10:00:00",
|
||||||
|
"title": "Oppose",
|
||||||
|
"text": "I oppose this policy.",
|
||||||
|
}
|
||||||
|
|
||||||
|
COMMENTS = [COMMENT_A, COMMENT_B]
|
||||||
|
PROMPT_HASH = "abc1234"
|
||||||
|
INPUT_FILE = "output/f452.jsonl"
|
||||||
|
INPUT_SHA256 = "deadbeef" * 8
|
||||||
|
PROMPT_FILE = "analysis/prompt-1.txt"
|
||||||
|
|
||||||
|
|
||||||
|
def _make_report(total_tokens=10_000):
|
||||||
|
return tk.compute_report(
|
||||||
|
COMMENTS, FORUM_ITEM, PROMPT_HASH, INPUT_FILE, INPUT_SHA256, PROMPT_FILE
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
# compute_report: required top-level keys
|
||||||
|
|
||||||
|
def test_report_has_top_level_keys():
|
||||||
|
report = _make_report()
|
||||||
|
required = {"prompt", "prompt_hash", "input_file", "input_sha256",
|
||||||
|
"total_comments", "input_tokens"}
|
||||||
|
assert required.issubset(set(report.keys()))
|
||||||
|
|
||||||
|
|
||||||
|
def test_report_metadata_values():
|
||||||
|
report = _make_report()
|
||||||
|
assert report["prompt"] == PROMPT_FILE
|
||||||
|
assert report["prompt_hash"] == PROMPT_HASH
|
||||||
|
assert report["input_file"] == INPUT_FILE
|
||||||
|
assert report["input_sha256"] == INPUT_SHA256
|
||||||
|
assert report["total_comments"] == 2
|
||||||
|
|
||||||
|
|
||||||
|
def test_report_input_tokens_positive():
|
||||||
|
report = _make_report()
|
||||||
|
assert isinstance(report["input_tokens"], int)
|
||||||
|
assert report["input_tokens"] > 0
|
||||||
|
|
||||||
|
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
# compute_report: per-model entries
|
||||||
|
|
||||||
|
def test_report_has_per_model_keys():
|
||||||
|
report = _make_report()
|
||||||
|
for model in ab.MODEL_LIMITS:
|
||||||
|
assert model in report, f"Model {model} missing from report"
|
||||||
|
assert isinstance(report[model], dict)
|
||||||
|
|
||||||
|
|
||||||
|
def test_report_per_model_has_required_fields():
|
||||||
|
report = _make_report()
|
||||||
|
for model in ab.MODEL_LIMITS:
|
||||||
|
m = report[model]
|
||||||
|
assert "jobs" in m
|
||||||
|
assert "cost_$" in m
|
||||||
|
assert "est_queue_days" in m
|
||||||
|
|
||||||
|
|
||||||
|
def test_report_jobs_at_least_one():
|
||||||
|
report = _make_report()
|
||||||
|
for model in ab.MODEL_LIMITS:
|
||||||
|
assert report[model]["jobs"] >= 1
|
||||||
|
|
||||||
|
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
# compute_report: calculation accuracy
|
||||||
|
|
||||||
|
def test_cost_calculation():
|
||||||
|
"""cost_$ = total_tokens / 1M * pricing_rate"""
|
||||||
|
report = _make_report()
|
||||||
|
total = report["input_tokens"]
|
||||||
|
for model in ab.MODEL_LIMITS:
|
||||||
|
expected_cost = round(total / 1_000_000 * tk.MODEL_PRICING.get(model, 0.0), 4)
|
||||||
|
assert report[model]["cost_$"] == pytest.approx(expected_cost, abs=1e-6)
|
||||||
|
|
||||||
|
|
||||||
|
def test_est_queue_days_calculation():
|
||||||
|
"""est_queue_days = total_tokens / tpd (rounded to 2 decimal places)"""
|
||||||
|
report = _make_report()
|
||||||
|
total = report["input_tokens"]
|
||||||
|
for model, tpd in ab.MODEL_LIMITS.items():
|
||||||
|
expected = round(total / tpd, 2)
|
||||||
|
assert report[model]["est_queue_days"] == pytest.approx(expected, abs=1e-4)
|
||||||
|
|
||||||
|
|
||||||
|
def test_jobs_ceiling_division():
|
||||||
|
"""jobs = ceil(total_tokens / (tpd * _LIMIT_BUFFER))"""
|
||||||
|
report = _make_report()
|
||||||
|
total = report["input_tokens"]
|
||||||
|
for model, tpd in ab.MODEL_LIMITS.items():
|
||||||
|
effective = int(tpd * ab._LIMIT_BUFFER)
|
||||||
|
expected = math.ceil(total / effective)
|
||||||
|
assert report[model]["jobs"] == expected
|
||||||
|
|
||||||
|
|
||||||
|
def test_more_comments_increases_tokens():
|
||||||
|
"""More comments → more input_tokens."""
|
||||||
|
few = tk.compute_report([COMMENT_A], FORUM_ITEM, PROMPT_HASH, INPUT_FILE, INPUT_SHA256, PROMPT_FILE)
|
||||||
|
many = tk.compute_report(COMMENTS, FORUM_ITEM, PROMPT_HASH, INPUT_FILE, INPUT_SHA256, PROMPT_FILE)
|
||||||
|
assert many["input_tokens"] > few["input_tokens"]
|
||||||
|
|
||||||
|
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
# MODEL_PRICING coverage
|
||||||
|
|
||||||
|
def test_model_pricing_has_required_models():
|
||||||
|
for model in ("gpt-4o", "gpt-4o-mini", "gpt-5.4", "gpt-5.4-mini", "gpt-o4-mini"):
|
||||||
|
assert model in tk.MODEL_PRICING, f"{model} missing from MODEL_PRICING"
|
||||||
|
|
||||||
|
|
||||||
|
def test_model_pricing_values_positive():
|
||||||
|
for model, price in tk.MODEL_PRICING.items():
|
||||||
|
assert price > 0, f"{model} has non-positive price"
|
||||||
|
|
||||||
|
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
# print_table: runs without error, produces output
|
||||||
|
|
||||||
|
def test_print_table_runs():
|
||||||
|
report = _make_report()
|
||||||
|
buf = io.StringIO()
|
||||||
|
with patch("sys.stdout", buf):
|
||||||
|
tk.print_table(report)
|
||||||
|
output = buf.getvalue()
|
||||||
|
assert "gpt-4o" in output
|
||||||
|
assert "gpt-4o-mini" in output
|
||||||
|
|
||||||
|
|
||||||
|
def test_print_table_shows_all_models():
|
||||||
|
report = _make_report()
|
||||||
|
buf = io.StringIO()
|
||||||
|
with patch("sys.stdout", buf):
|
||||||
|
tk.print_table(report)
|
||||||
|
output = buf.getvalue()
|
||||||
|
for model in ab.MODEL_LIMITS:
|
||||||
|
assert model in output, f"{model} not shown in print_table output"
|
||||||
|
|
||||||
|
|
||||||
|
def test_print_table_highlights_recommended():
|
||||||
|
"""When a single-job cheapest model exists, table marks it as recommended."""
|
||||||
|
report = _make_report()
|
||||||
|
buf = io.StringIO()
|
||||||
|
with patch("sys.stdout", buf):
|
||||||
|
tk.print_table(report)
|
||||||
|
output = buf.getvalue()
|
||||||
|
assert "recommended" in output
|
||||||
|
|
||||||
|
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
# report.json round-trip (write → read)
|
||||||
|
|
||||||
|
def test_report_json_roundtrip(tmp_path):
|
||||||
|
report = _make_report()
|
||||||
|
out = tmp_path / "report.json"
|
||||||
|
out.write_text(json.dumps(report, indent=2, ensure_ascii=False), encoding="utf-8")
|
||||||
|
loaded = json.loads(out.read_text(encoding="utf-8"))
|
||||||
|
assert loaded["total_comments"] == report["total_comments"]
|
||||||
|
assert loaded["input_tokens"] == report["input_tokens"]
|
||||||
|
assert loaded["gpt-4o-mini"]["jobs"] == report["gpt-4o-mini"]["jobs"]
|
||||||
|
|
||||||
|
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
# count_input_tokens
|
||||||
|
|
||||||
|
def _make_job_input(tmp_path, comments, forum=None) -> Path:
|
||||||
|
"""Write a batch request JSONL in the same format as job1-input.jsonl."""
|
||||||
|
p = tmp_path / "job1-input.jsonl"
|
||||||
|
with open(p, "w", encoding="utf-8") as f:
|
||||||
|
for c in comments:
|
||||||
|
f.write(json.dumps(ab.build_batch_request_line(c, forum, "gpt-4o-mini")) + "\n")
|
||||||
|
return p
|
||||||
|
|
||||||
|
|
||||||
|
def test_count_input_tokens_matches_estimate(tmp_path):
|
||||||
|
"""count_input_tokens on a freshly written job file equals the sum estimate_tokens produces."""
|
||||||
|
p = _make_job_input(tmp_path, COMMENTS, FORUM_ITEM)
|
||||||
|
result = tk.count_input_tokens(p, "gpt-4o-mini")
|
||||||
|
expected = sum(
|
||||||
|
ab.estimate_tokens(ab.build_messages(c, FORUM_ITEM)[0], "gpt-4o-mini")
|
||||||
|
for c in COMMENTS
|
||||||
|
)
|
||||||
|
assert result["total_tokens"] == expected
|
||||||
|
assert result["total_requests"] == len(COMMENTS)
|
||||||
|
|
||||||
|
|
||||||
|
def test_count_input_tokens_fields(tmp_path):
|
||||||
|
p = _make_job_input(tmp_path, COMMENTS, FORUM_ITEM)
|
||||||
|
result = tk.count_input_tokens(p)
|
||||||
|
assert set(result.keys()) == {"total_tokens", "total_requests", "min", "max", "mean"}
|
||||||
|
assert result["min"] <= result["mean"] <= result["max"]
|
||||||
|
assert result["min"] > 0
|
||||||
|
|
||||||
|
|
||||||
|
def test_count_input_tokens_empty_file(tmp_path):
|
||||||
|
p = tmp_path / "empty.jsonl"
|
||||||
|
p.write_text("", encoding="utf-8")
|
||||||
|
result = tk.count_input_tokens(p)
|
||||||
|
assert result["total_tokens"] == 0
|
||||||
|
assert result["total_requests"] == 0
|
||||||
|
|
||||||
|
|
||||||
|
def test_count_input_tokens_includes_system_prompt(tmp_path):
|
||||||
|
"""Token count must be higher than user-message-only text length / 3 (prompt adds tokens)."""
|
||||||
|
p = _make_job_input(tmp_path, [COMMENT_A], FORUM_ITEM)
|
||||||
|
result = tk.count_input_tokens(p)
|
||||||
|
user_chars = len(COMMENT_A.get("text", ""))
|
||||||
|
# system prompt alone is hundreds of tokens; total must exceed naive user-text estimate
|
||||||
|
assert result["total_tokens"] > user_chars // 3
|
||||||
217
tests/validate-sentiment.py
Normal file
217
tests/validate-sentiment.py
Normal file
@@ -0,0 +1,217 @@
|
|||||||
|
"""Unit tests for analysis/validate.py — no file I/O beyond tmp_path."""
|
||||||
|
|
||||||
|
import json
|
||||||
|
import sys
|
||||||
|
from pathlib import Path
|
||||||
|
|
||||||
|
import pytest
|
||||||
|
|
||||||
|
sys.path.insert(0, str(Path(__file__).parent.parent / "analysis"))
|
||||||
|
|
||||||
|
try:
|
||||||
|
import pandas as pd
|
||||||
|
except ImportError:
|
||||||
|
pytest.skip("pandas not installed", allow_module_level=True)
|
||||||
|
|
||||||
|
import validate as vl
|
||||||
|
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
# Fixtures
|
||||||
|
|
||||||
|
|
||||||
|
def _write_jsonl(path: Path, rows: list[dict]) -> None:
|
||||||
|
with open(path, "w", encoding="utf-8") as f:
|
||||||
|
for row in rows:
|
||||||
|
f.write(json.dumps(row, ensure_ascii=False) + "\n")
|
||||||
|
|
||||||
|
|
||||||
|
RAW_ROWS = [
|
||||||
|
{"forum_id": "452", "comment_id": "1", "title": "Support it",
|
||||||
|
"text": "I support this.", "date": "2021-01-04T09:00:00", "author": "Alice"},
|
||||||
|
{"forum_id": "452", "comment_id": "2", "title": "Oppose it",
|
||||||
|
"text": "I oppose this.", "date": "2021-01-05T10:00:00", "author": "Bob"},
|
||||||
|
{"forum_id": "452", "comment_id": "3", "title": "Neutral",
|
||||||
|
"text": "No opinion.", "date": "2021-01-06T11:00:00", "author": "Carol"},
|
||||||
|
]
|
||||||
|
|
||||||
|
ANALYSIS_ROWS = [
|
||||||
|
{"run_id": "r1", "forum_id": "452", "comment_id": "1", "input_title": "Support it",
|
||||||
|
"analyzed_at": "2026-05-06T12:00:00+00:00", "model": "gpt-5.4-mini",
|
||||||
|
"prompt_version": "abc1234", "stance": "support", "stance_confidence": 0.95,
|
||||||
|
"stance_rationale": "Commenter says 'I support'.", "tone": "positive",
|
||||||
|
"tags": ["student safety"], "truncated": False, "error": None},
|
||||||
|
{"run_id": "r1", "forum_id": "452", "comment_id": "2", "input_title": "Oppose it",
|
||||||
|
"analyzed_at": "2026-05-06T12:00:00+00:00", "model": "gpt-5.4-mini",
|
||||||
|
"prompt_version": "abc1234", "stance": "oppose", "stance_confidence": 0.90,
|
||||||
|
"stance_rationale": "Commenter says 'I oppose'.", "tone": "negative",
|
||||||
|
"tags": [], "truncated": False, "error": None},
|
||||||
|
]
|
||||||
|
|
||||||
|
FORUM_ROW = {"forum_id": "452", "reg_title": "Policy X", "reg_desc": "Guidance on Y."}
|
||||||
|
|
||||||
|
|
||||||
|
@pytest.fixture()
|
||||||
|
def raw_jsonl(tmp_path) -> Path:
|
||||||
|
p = tmp_path / "f452.jsonl"
|
||||||
|
_write_jsonl(p, [FORUM_ROW] + RAW_ROWS)
|
||||||
|
return p
|
||||||
|
|
||||||
|
|
||||||
|
@pytest.fixture()
|
||||||
|
def jobs_dir(tmp_path) -> Path:
|
||||||
|
d = tmp_path / "jobs" / "f452-1"
|
||||||
|
d.mkdir(parents=True)
|
||||||
|
_write_jsonl(d / "job1-output.jsonl", ANALYSIS_ROWS)
|
||||||
|
return d
|
||||||
|
|
||||||
|
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
# load_raw
|
||||||
|
|
||||||
|
|
||||||
|
def test_load_raw_returns_only_comments(raw_jsonl):
|
||||||
|
df = vl.load_raw(raw_jsonl)
|
||||||
|
assert len(df) == 3
|
||||||
|
assert set(df.columns) == set(vl.RAW_COLS)
|
||||||
|
|
||||||
|
|
||||||
|
def test_load_raw_correct_columns(raw_jsonl):
|
||||||
|
df = vl.load_raw(raw_jsonl)
|
||||||
|
for col in vl.RAW_COLS:
|
||||||
|
assert col in df.columns
|
||||||
|
|
||||||
|
|
||||||
|
def test_load_raw_skips_forum_item(raw_jsonl):
|
||||||
|
df = vl.load_raw(raw_jsonl)
|
||||||
|
assert "reg_title" not in df.columns
|
||||||
|
|
||||||
|
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
# load_analysis
|
||||||
|
|
||||||
|
|
||||||
|
def test_load_analysis_skips_raw_files(tmp_path):
|
||||||
|
d = tmp_path / "jobs" / "f452-1"
|
||||||
|
d.mkdir(parents=True)
|
||||||
|
_write_jsonl(d / "job1-output-raw.jsonl", ANALYSIS_ROWS) # should be ignored
|
||||||
|
_write_jsonl(d / "job1-output.jsonl", ANALYSIS_ROWS)
|
||||||
|
df = vl.load_analysis(d)
|
||||||
|
assert len(df) == len(ANALYSIS_ROWS)
|
||||||
|
|
||||||
|
|
||||||
|
def test_load_analysis_concatenates_multiple_files(tmp_path):
|
||||||
|
d = tmp_path / "jobs" / "f452-1"
|
||||||
|
d.mkdir(parents=True)
|
||||||
|
_write_jsonl(d / "job1-output.jsonl", [ANALYSIS_ROWS[0]])
|
||||||
|
_write_jsonl(d / "job2-output.jsonl", [ANALYSIS_ROWS[1]])
|
||||||
|
df = vl.load_analysis(d)
|
||||||
|
assert len(df) == 2
|
||||||
|
|
||||||
|
|
||||||
|
def test_load_analysis_tags_serialized_as_json(jobs_dir):
|
||||||
|
df = vl.load_analysis(jobs_dir)
|
||||||
|
tags_val = df.loc[df["comment_id"] == "1", "tags"].iloc[0]
|
||||||
|
assert isinstance(tags_val, str)
|
||||||
|
assert json.loads(tags_val) == ["student safety"]
|
||||||
|
|
||||||
|
|
||||||
|
def test_load_analysis_empty_tags_serialized(jobs_dir):
|
||||||
|
df = vl.load_analysis(jobs_dir)
|
||||||
|
tags_val = df.loc[df["comment_id"] == "2", "tags"].iloc[0]
|
||||||
|
assert json.loads(tags_val) == []
|
||||||
|
|
||||||
|
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
# join — by comment_id, not index
|
||||||
|
|
||||||
|
|
||||||
|
def test_join_by_comment_id_not_index(raw_jsonl, jobs_dir):
|
||||||
|
raw = vl.load_raw(raw_jsonl)
|
||||||
|
analysis = vl.load_analysis(jobs_dir)
|
||||||
|
# Shuffle raw order so comment_id ordering differs from index
|
||||||
|
raw = raw.sample(frac=1, random_state=42).reset_index(drop=True)
|
||||||
|
merged = vl.join(raw, analysis)
|
||||||
|
row_1 = merged[merged["comment_id"] == "1"].iloc[0]
|
||||||
|
assert row_1["stance"] == "support"
|
||||||
|
assert row_1["author"] == "Alice"
|
||||||
|
|
||||||
|
|
||||||
|
def test_join_unanalyzed_comment_has_null_stance(raw_jsonl, jobs_dir):
|
||||||
|
"""Comment 3 is in raw but not in analysis — stance should be NaN."""
|
||||||
|
raw = vl.load_raw(raw_jsonl)
|
||||||
|
analysis = vl.load_analysis(jobs_dir)
|
||||||
|
merged = vl.join(raw, analysis)
|
||||||
|
row_3 = merged[merged["comment_id"] == "3"].iloc[0]
|
||||||
|
assert pd.isna(row_3["stance"])
|
||||||
|
|
||||||
|
|
||||||
|
def test_join_preserves_all_raw_comments(raw_jsonl, jobs_dir):
|
||||||
|
raw = vl.load_raw(raw_jsonl)
|
||||||
|
analysis = vl.load_analysis(jobs_dir)
|
||||||
|
merged = vl.join(raw, analysis)
|
||||||
|
assert len(merged) == len(raw)
|
||||||
|
|
||||||
|
|
||||||
|
def test_join_output_columns_in_order(raw_jsonl, jobs_dir):
|
||||||
|
raw = vl.load_raw(raw_jsonl)
|
||||||
|
analysis = vl.load_analysis(jobs_dir)
|
||||||
|
merged = vl.join(raw, analysis)
|
||||||
|
assert list(merged.columns) == vl.OUTPUT_COLS
|
||||||
|
|
||||||
|
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
# Duplicate comment_id handling
|
||||||
|
|
||||||
|
|
||||||
|
def test_duplicate_raw_id_flagged(raw_jsonl, jobs_dir):
|
||||||
|
raw = vl.load_raw(raw_jsonl)
|
||||||
|
# Manually duplicate a row
|
||||||
|
raw = pd.concat([raw, raw.iloc[[0]]], ignore_index=True)
|
||||||
|
analysis = vl.load_analysis(jobs_dir)
|
||||||
|
merged = vl.join(raw, analysis)
|
||||||
|
# join still produces a row for each raw row (left join)
|
||||||
|
assert len(merged) == len(raw)
|
||||||
|
assert raw["comment_id"].duplicated().sum() == 1
|
||||||
|
|
||||||
|
|
||||||
|
def test_duplicate_analysis_id_produces_extra_rows(raw_jsonl, tmp_path):
|
||||||
|
"""Two analysis records for the same comment_id create two joined rows."""
|
||||||
|
d = tmp_path / "jobs" / "f452-dup"
|
||||||
|
d.mkdir(parents=True)
|
||||||
|
dup_rows = [ANALYSIS_ROWS[0], {**ANALYSIS_ROWS[0], "stance": "oppose"}]
|
||||||
|
_write_jsonl(d / "job1-output.jsonl", dup_rows)
|
||||||
|
raw = vl.load_raw(raw_jsonl)
|
||||||
|
analysis = vl.load_analysis(d)
|
||||||
|
merged = vl.join(raw, analysis)
|
||||||
|
assert len(merged[merged["comment_id"] == "1"]) == 2
|
||||||
|
|
||||||
|
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
# Validation counts (smoke test — just confirm it runs without error)
|
||||||
|
|
||||||
|
|
||||||
|
def test_print_validation_runs(raw_jsonl, jobs_dir, capsys):
|
||||||
|
raw = vl.load_raw(raw_jsonl)
|
||||||
|
analysis = vl.load_analysis(jobs_dir)
|
||||||
|
merged = vl.join(raw, analysis)
|
||||||
|
vl.print_validation(raw, analysis, merged)
|
||||||
|
out = capsys.readouterr().out
|
||||||
|
assert "Raw comments" in out
|
||||||
|
assert "Stance counts" in out
|
||||||
|
assert "Tone counts" in out
|
||||||
|
|
||||||
|
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
# CSV output
|
||||||
|
|
||||||
|
|
||||||
|
def test_csv_written_to_jobs_dir(raw_jsonl, jobs_dir, tmp_path):
|
||||||
|
raw = vl.load_raw(raw_jsonl)
|
||||||
|
analysis = vl.load_analysis(jobs_dir)
|
||||||
|
merged = vl.join(raw, analysis)
|
||||||
|
out_path = jobs_dir / "review.csv"
|
||||||
|
merged.to_csv(out_path, index=False, encoding="utf-8-sig")
|
||||||
|
assert out_path.exists()
|
||||||
|
loaded = pd.read_csv(out_path, encoding="utf-8-sig")
|
||||||
|
assert list(loaded.columns) == vl.OUTPUT_COLS
|
||||||
|
assert len(loaded) == len(raw)
|
||||||
3888
viz/chart_tests/confidence_by_stance.html
Normal file
3888
viz/chart_tests/confidence_by_stance.html
Normal file
File diff suppressed because one or more lines are too long
3888
viz/chart_tests/cumulative_stance_area.html
Normal file
3888
viz/chart_tests/cumulative_stance_area.html
Normal file
File diff suppressed because one or more lines are too long
3888
viz/chart_tests/cumulative_stance_share.html
Normal file
3888
viz/chart_tests/cumulative_stance_share.html
Normal file
File diff suppressed because one or more lines are too long
3888
viz/chart_tests/stance_diverging_bar.html
Normal file
3888
viz/chart_tests/stance_diverging_bar.html
Normal file
File diff suppressed because one or more lines are too long
3888
viz/chart_tests/stance_over_time.html
Normal file
3888
viz/chart_tests/stance_over_time.html
Normal file
File diff suppressed because one or more lines are too long
3888
viz/chart_tests/stance_share.html
Normal file
3888
viz/chart_tests/stance_share.html
Normal file
File diff suppressed because one or more lines are too long
3888
viz/chart_tests/stance_tone_counts.html
Normal file
3888
viz/chart_tests/stance_tone_counts.html
Normal file
File diff suppressed because one or more lines are too long
3888
viz/chart_tests/stance_tone_heatmap.html
Normal file
3888
viz/chart_tests/stance_tone_heatmap.html
Normal file
File diff suppressed because one or more lines are too long
3888
viz/chart_tests/stance_tone_rowpct.html
Normal file
3888
viz/chart_tests/stance_tone_rowpct.html
Normal file
File diff suppressed because one or more lines are too long
3888
viz/proto/confidence_by_stance.html
Normal file
3888
viz/proto/confidence_by_stance.html
Normal file
File diff suppressed because one or more lines are too long
3888
viz/proto/stance_over_time.html
Normal file
3888
viz/proto/stance_over_time.html
Normal file
File diff suppressed because one or more lines are too long
3888
viz/proto/stance_share.html
Normal file
3888
viz/proto/stance_share.html
Normal file
File diff suppressed because one or more lines are too long
3888
viz/proto/stance_tone_heatmap.html
Normal file
3888
viz/proto/stance_tone_heatmap.html
Normal file
File diff suppressed because one or more lines are too long
134
viz/prototype_charts.py
Normal file
134
viz/prototype_charts.py
Normal file
@@ -0,0 +1,134 @@
|
|||||||
|
'''
|
||||||
|
prototype_charts.py
|
||||||
|
generate test charts for later addition to streamlit
|
||||||
|
'''
|
||||||
|
|
||||||
|
|
||||||
|
from pathlib import Path
|
||||||
|
import pandas as pd
|
||||||
|
import plotly.express as px
|
||||||
|
import numpy as np
|
||||||
|
|
||||||
|
inp = Path(r"c:/users/moses/projects/vath/analysis/jobs/f452-1/review.csv")
|
||||||
|
out = Path("viz/")
|
||||||
|
out.mkdir(parents=True, exist_ok=True)
|
||||||
|
|
||||||
|
stance_order = ["support", "oppose", "neutral", "unknown"]
|
||||||
|
|
||||||
|
# tone_order = ["positive", "negative", "neutral", "mixed", "unknown", "unclear"]
|
||||||
|
# default order was actually better - unclear/negative/neutral/mixed/positive vs unknown/oppose/neutral/support
|
||||||
|
# same for pct w/in stance
|
||||||
|
df = pd.read_csv(inp)
|
||||||
|
df["date"] = pd.to_datetime(df["date"], errors="coerce")
|
||||||
|
df["date_day"] = df["date"].dt.date
|
||||||
|
df["stance"] = df["stance"].fillna("unknown")
|
||||||
|
df["tone"] = df["tone"].fillna("unknown")
|
||||||
|
|
||||||
|
# 1. stance share
|
||||||
|
counts = df["stance"].value_counts().reindex(stance_order, fill_value=0).reset_index()
|
||||||
|
counts.columns = ["stance", "count"]
|
||||||
|
fig = px.bar(counts, x="count", y="stance", orientation="h", text="count")
|
||||||
|
fig.write_html(out / "stance_share.html")
|
||||||
|
|
||||||
|
# 2. stance over time
|
||||||
|
daily = df.groupby(["date_day", "stance"]).size().reset_index(name="count")
|
||||||
|
fig = px.bar(daily, x="date_day", y="count", color="stance", category_orders={"stance": stance_order})
|
||||||
|
fig.write_html(out / "stance_over_time.html")
|
||||||
|
|
||||||
|
# 3. stance x tone
|
||||||
|
heat = df.groupby(["stance", "tone"]).size().reset_index(name="count")
|
||||||
|
fig = px.density_heatmap(heat, x="tone", y="stance", z="count", category_orders={"stance": stance_order})
|
||||||
|
fig.write_html(out / "stance_tone_heatmap.html")
|
||||||
|
|
||||||
|
# 4. confidence by stance
|
||||||
|
fig = px.box(df, x="stance", y="stance_confidence", category_orders={"stance": stance_order}, points="outliers")
|
||||||
|
fig.write_html(out / "confidence_by_stance.html")
|
||||||
|
|
||||||
|
# 5. cumulative stance and share over time
|
||||||
|
daily = (
|
||||||
|
df.groupby(["date_day", "stance"])
|
||||||
|
.size()
|
||||||
|
.unstack(fill_value=0)
|
||||||
|
.reindex(columns=stance_order, fill_value=0)
|
||||||
|
.sort_index()
|
||||||
|
)
|
||||||
|
|
||||||
|
cum = daily.cumsum()
|
||||||
|
cum_long = cum.reset_index().melt(id_vars="date_day", var_name="stance", value_name="cumulative_count")
|
||||||
|
|
||||||
|
fig = px.area(
|
||||||
|
cum_long,
|
||||||
|
x="date_day",
|
||||||
|
y="cumulative_count",
|
||||||
|
color="stance",
|
||||||
|
category_orders={"stance": stance_order},
|
||||||
|
title="cumulative comments by stance over time",
|
||||||
|
)
|
||||||
|
fig.write_html(out / "cumulative_stance_area.html")
|
||||||
|
|
||||||
|
cum_pct = cum.div(cum.sum(axis=1), axis=0).reset_index().melt(
|
||||||
|
id_vars="date_day", var_name="stance", value_name="cumulative_share"
|
||||||
|
)
|
||||||
|
|
||||||
|
fig = px.line(
|
||||||
|
cum_pct,
|
||||||
|
x="date_day",
|
||||||
|
y="cumulative_share",
|
||||||
|
color="stance",
|
||||||
|
category_orders={"stance": stance_order},
|
||||||
|
title="cumulative stance share over time",
|
||||||
|
)
|
||||||
|
fig.update_yaxes(tickformat=".0%")
|
||||||
|
fig.write_html(out / "cumulative_stance_share.html")
|
||||||
|
|
||||||
|
# 7. diverging h-bar
|
||||||
|
stance_counts = df["stance"].value_counts().reindex(stance_order, fill_value=0)
|
||||||
|
|
||||||
|
div = pd.DataFrame({
|
||||||
|
"stance": ["oppose", "support", "neutral", "unknown"],
|
||||||
|
"count": [
|
||||||
|
-stance_counts.get("oppose", 0),
|
||||||
|
stance_counts.get("support", 0),
|
||||||
|
stance_counts.get("neutral", 0),
|
||||||
|
stance_counts.get("unknown", 0),
|
||||||
|
],
|
||||||
|
})
|
||||||
|
|
||||||
|
fig = px.bar(
|
||||||
|
div,
|
||||||
|
x="count",
|
||||||
|
y="stance",
|
||||||
|
orientation="h",
|
||||||
|
text=div["count"].abs(),
|
||||||
|
title="support vs oppose",
|
||||||
|
)
|
||||||
|
fig.update_xaxes(title="comments", zeroline=True)
|
||||||
|
fig.update_traces(textposition="outside")
|
||||||
|
fig.write_html(out / "stance_diverging_bar.html")
|
||||||
|
|
||||||
|
# 8. Stance x Tone labels
|
||||||
|
heat = pd.crosstab(df["stance"], df["tone"]).reindex(
|
||||||
|
index=stance_order,
|
||||||
|
columns=[c for c in tone_order if c in df["tone"].unique()],
|
||||||
|
fill_value=0,
|
||||||
|
)
|
||||||
|
|
||||||
|
fig = px.imshow(
|
||||||
|
heat,
|
||||||
|
text_auto=True,
|
||||||
|
aspect="auto",
|
||||||
|
title="stance x tone, count",
|
||||||
|
)
|
||||||
|
fig.write_html(out / "stance_tone_counts.html")
|
||||||
|
|
||||||
|
rowpct = heat.div(heat.sum(axis=1).replace(0, np.nan), axis=0)
|
||||||
|
|
||||||
|
fig = px.imshow(
|
||||||
|
rowpct,
|
||||||
|
text_auto=".0%",
|
||||||
|
aspect="auto",
|
||||||
|
title="stance x tone, percent within stance",
|
||||||
|
)
|
||||||
|
fig.write_html(out / "stance_tone_rowpct.html")
|
||||||
|
|
||||||
|
|
||||||
28
viz/prototype_streamlit.py
Normal file
28
viz/prototype_streamlit.py
Normal file
@@ -0,0 +1,28 @@
|
|||||||
|
# streamlit run analysis/viz/prototype_streamlit.py
|
||||||
|
from datetime import datetime
|
||||||
|
import pandas as pd
|
||||||
|
import plotly.graph_objects as go
|
||||||
|
import plotly.express as px
|
||||||
|
import streamlit as st
|
||||||
|
|
||||||
|
df = pd.read_csv(r"analysis/jobs/f452-1/review.csv")
|
||||||
|
st.set_page_config(layout="wide")
|
||||||
|
|
||||||
|
stance = st.multiselect("Filter stance", sorted(df["stance"].dropna().unique()), default=sorted(df["stance"].dropna().unique()))
|
||||||
|
q = st.text_input("Search comment text")
|
||||||
|
dff = df[df["stance"].isin(stance)]
|
||||||
|
if q:
|
||||||
|
dff = dff[dff["text"].fillna("").str.contains(q, case=False, regex=False)]
|
||||||
|
|
||||||
|
st.dataframe(dff[["comment_id", "title", "stance", "stance_confidence", "tone"]], width="stretch")
|
||||||
|
st.write("Showing " + str(len(dff))+ " comments")
|
||||||
|
|
||||||
|
cid = st.selectbox("comment", dff["comment_id"].astype(str))
|
||||||
|
row = dff[dff["comment_id"].astype(str) == cid].iloc[0]
|
||||||
|
|
||||||
|
st.subheader(row["title"])
|
||||||
|
st.write(row["text"])
|
||||||
|
st.write(row["author"] + ", " + row["date"][:10])
|
||||||
|
st.write("**model:** " + str(row["model"]))
|
||||||
|
st.markdown("**stance:** " + str(row["stance"]) + " \n**confidence:** " + str(row["stance_confidence"]) + " \n**tone:** " + str(row["tone"]))
|
||||||
|
st.write("**analysis:** "+ row["stance_rationale"])
|
||||||
189
viz/streamlit.py
Normal file
189
viz/streamlit.py
Normal file
@@ -0,0 +1,189 @@
|
|||||||
|
# streamlit run viz/streamlit.py -- --jobs-dir analysis/jobs/f452-1
|
||||||
|
import argparse
|
||||||
|
from pathlib import Path
|
||||||
|
from datetime import datetime as dt
|
||||||
|
import pandas as pd
|
||||||
|
import plotly.graph_objects as go
|
||||||
|
import plotly.express as px
|
||||||
|
import streamlit as st
|
||||||
|
|
||||||
|
parser = argparse.ArgumentParser()
|
||||||
|
parser.add_argument("--jobs-dir", default="analysis/jobs/f452-1", type=Path,
|
||||||
|
help="Job directory containing review.csv, forum.jsonl, and prompt.txt")
|
||||||
|
args, _ = parser.parse_known_args() # parse_known_args: ignore Streamlit's own argv entries
|
||||||
|
workdir = args.jobs_dir
|
||||||
|
df = pd.read_csv(workdir/"review.csv")
|
||||||
|
df['date_dt'] = pd.to_datetime(df.date)
|
||||||
|
df["date_day"] = df["date_dt"].dt.date
|
||||||
|
forum = pd.read_json(workdir/"forum.jsonl", lines=True).iloc[0].to_dict()
|
||||||
|
prompt = (workdir/"prompt.txt").read_text(encoding="utf-8")
|
||||||
|
|
||||||
|
stance_colors = {'oppose':'#ffa15a', 'neutral':'#e377c2','support':'#19d3f3','unknown':'#000000'}
|
||||||
|
stance_order = ["oppose", "mixed", "unknown", "neutral", "support"]
|
||||||
|
|
||||||
|
st.set_page_config(layout="wide")
|
||||||
|
st.title("Virginia Townhall Explorer",anchor=None)
|
||||||
|
st.caption("Explore data collected from Virginia's public comment system. Source code at https://github.com/eulaly/vath")
|
||||||
|
|
||||||
|
st.subheader("Proposal",anchor=None,divider="gray")
|
||||||
|
st.markdown(f"**{forum.get('reg_title')}**")
|
||||||
|
st.text(forum.get('reg_desc'))
|
||||||
|
st.caption(f'Comments posted from {dt.strftime(min(df.date_dt),"%D")}—{dt.strftime(max(df.date_dt),"%D")} at https://www.townhall.virginia.gov/L/Comments.cfm?GDocForumID={forum.get("forum_id")}')
|
||||||
|
|
||||||
|
st.subheader("Comment Summary",anchor=False,divider="gray")
|
||||||
|
summary_left, summary_right = st.columns([1,2])
|
||||||
|
with summary_left:
|
||||||
|
# Summary Table
|
||||||
|
summary_stats = (
|
||||||
|
df.groupby("stance").size()
|
||||||
|
.reindex(stance_order, fill_value=0)
|
||||||
|
.reset_index(name="count")
|
||||||
|
.assign(percent=lambda d: (d["count"] / d["count"].sum()).map("{:.1%}".format))
|
||||||
|
)
|
||||||
|
|
||||||
|
st.dataframe(summary_stats, hide_index=True, width="stretch")
|
||||||
|
with summary_right:
|
||||||
|
# Stance div-h
|
||||||
|
counts = df["stance"].value_counts()
|
||||||
|
stance_divh = go.Figure()
|
||||||
|
stance_divh.add_bar(y=["stance"], x=[-counts.get("oppose",0)], name="oppose", orientation="h", marker_color=stance_colors.get('oppose'), text=[counts.get("oppose",0)], textposition="inside")
|
||||||
|
stance_divh.add_bar(y=["stance"], x=[counts.get("neutral",0)], name="neutral", orientation="h", marker_color=stance_colors.get('neutral'), text=[counts.get("neutral",0)], textposition="inside")
|
||||||
|
stance_divh.add_bar(y=["stance"], x=[counts.get("unknown",0)], name="unknown", orientation="h", marker_color=stance_colors.get('unknown'), text=[counts.get("unknown",0)], textposition="inside")
|
||||||
|
stance_divh.add_bar(y=["stance"], x=[counts.get("support",0)], name="support", orientation="h", marker_color=stance_colors.get('support'), text=[counts.get("support",0)], textposition="inside")
|
||||||
|
stance_divh.update_yaxes(title_text="",showticklabels=False)
|
||||||
|
stance_divh.update_layout(barmode="relative", title="", height=180, margin=dict(l=0,r=0,t=0,b=0),xaxis_title="", yaxis_title="",legend=dict(orientation="v",y=0.12))
|
||||||
|
st.plotly_chart(stance_divh,width='stretch')
|
||||||
|
|
||||||
|
# Daily Comments Breakdown, 3 Tabs
|
||||||
|
daily_wide = (
|
||||||
|
df.groupby(["date_day", "stance"])
|
||||||
|
.size()
|
||||||
|
.unstack(fill_value=0)
|
||||||
|
.reindex(columns=stance_order, fill_value=0)
|
||||||
|
.sort_index()
|
||||||
|
)
|
||||||
|
|
||||||
|
daily_long = (
|
||||||
|
daily_wide.reset_index()
|
||||||
|
.melt(id_vars="date_day", var_name="stance", value_name="count")
|
||||||
|
)
|
||||||
|
|
||||||
|
cum_wide = daily_wide.cumsum()
|
||||||
|
|
||||||
|
cum_long = (
|
||||||
|
cum_wide.reset_index()
|
||||||
|
.melt(id_vars="date_day", var_name="stance", value_name="cumulative_count")
|
||||||
|
)
|
||||||
|
|
||||||
|
cum_total = cum_wide.sum(axis=1)
|
||||||
|
cum_share = cum_wide.div(cum_total.where(cum_total > 0), axis=0)
|
||||||
|
|
||||||
|
cum_share_long = (
|
||||||
|
cum_share.reset_index()
|
||||||
|
.melt(id_vars="date_day", var_name="stance", value_name="cumulative_share")
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
tab_daily, tab_area, tab_share = st.tabs([
|
||||||
|
"Daily",
|
||||||
|
"Cumulative",
|
||||||
|
"Cumulative Share",
|
||||||
|
])
|
||||||
|
|
||||||
|
with tab_daily:
|
||||||
|
fig = px.bar(
|
||||||
|
daily_long,
|
||||||
|
x="date_day",
|
||||||
|
y="count",
|
||||||
|
color="stance",
|
||||||
|
category_orders={"stance": stance_order},
|
||||||
|
color_discrete_map=stance_colors,
|
||||||
|
)
|
||||||
|
fig.update_layout(barmode="stack", height=420, legend_orientation="v")
|
||||||
|
st.plotly_chart(fig, width="stretch")
|
||||||
|
|
||||||
|
with tab_area:
|
||||||
|
fig = px.area(
|
||||||
|
cum_long,
|
||||||
|
x="date_day",
|
||||||
|
y="cumulative_count",
|
||||||
|
color="stance",
|
||||||
|
category_orders={"stance": stance_order},
|
||||||
|
color_discrete_map=stance_colors,
|
||||||
|
)
|
||||||
|
fig.update_layout(height=420, legend_orientation="v")
|
||||||
|
st.plotly_chart(fig, width="stretch")
|
||||||
|
|
||||||
|
with tab_share:
|
||||||
|
fig = px.line(
|
||||||
|
cum_share_long,
|
||||||
|
x="date_day",
|
||||||
|
y="cumulative_share",
|
||||||
|
color="stance",
|
||||||
|
category_orders={"stance": stance_order},
|
||||||
|
color_discrete_map=stance_colors,
|
||||||
|
)
|
||||||
|
fig.update_yaxes(tickformat=".0%", range=[0, 1])
|
||||||
|
fig.update_layout(height=420, legend_orientation="v")
|
||||||
|
st.plotly_chart(fig, width="stretch")
|
||||||
|
|
||||||
|
st.subheader("Comment Explorer",anchor=False,divider="gray")
|
||||||
|
# comment explorer
|
||||||
|
cex_left, cex_right = st.columns([1,1])
|
||||||
|
with cex_left:
|
||||||
|
filter_stance = st.multiselect("Filter stance", sorted(df["stance"].dropna().unique()), default=sorted(df["stance"].dropna().unique()))
|
||||||
|
filter_tone = st.multiselect("Filter tone", sorted(df["tone"].dropna().unique()), default=sorted(df["tone"].dropna().unique()))
|
||||||
|
dff = df[df["stance"].isin(filter_stance) & df["tone"].isin(filter_tone)]
|
||||||
|
|
||||||
|
with cex_right:
|
||||||
|
q = st.text_input("Search comment title and text")
|
||||||
|
if q:
|
||||||
|
dff = dff[dff["text"].fillna("").str.contains(q, case=False, regex=False)]
|
||||||
|
st.text(""); st.text("")
|
||||||
|
st.text("Showing " + str(len(dff))+ " comments",text_alignment="right", width="stretch")
|
||||||
|
|
||||||
|
st.dataframe(dff[["comment_id", "title", "text", "stance", "stance_confidence", "tone"]], width="stretch")
|
||||||
|
|
||||||
|
cid = st.selectbox("Select comment to view:", dff["comment_id"].astype(str))
|
||||||
|
row = dff[dff["comment_id"].astype(str) == cid].iloc[0]
|
||||||
|
|
||||||
|
st.markdown(f'**{row["title"]}**')
|
||||||
|
st.text(row["text"])
|
||||||
|
st.write(row["author"] + ", " + row["date_dt"].strftime("%D"))
|
||||||
|
|
||||||
|
st.divider()
|
||||||
|
|
||||||
|
st.subheader('Analysis')
|
||||||
|
cexs_left, cexs_right = st.columns([1,1])
|
||||||
|
with cexs_left:
|
||||||
|
st.write(f"**stance:** {row['stance']}")
|
||||||
|
st.write(f"**stance_confidence:** {row['stance_confidence']:.2f}")
|
||||||
|
st.write(f"**tone:** {row['tone']}")
|
||||||
|
st.write("**analysis:** "+ row["stance_rationale"])
|
||||||
|
with cexs_right:
|
||||||
|
x_order = ["unknown","oppose","mixed","neutral","support"] # includes mixed even if absent; harmless zero column
|
||||||
|
y_order = ["positive","neutral","mixed","negative","unclear"]
|
||||||
|
tab = pd.crosstab(df["tone"], df["stance"]).reindex(index=y_order, columns=x_order, fill_value=0)
|
||||||
|
pct = tab.div(tab.sum(axis=1).replace(0, pd.NA), axis=0).fillna(0)
|
||||||
|
tone_stance = px.imshow(
|
||||||
|
pct,
|
||||||
|
x=x_order, y=y_order,
|
||||||
|
text_auto=".0%",
|
||||||
|
aspect="auto",
|
||||||
|
color_continuous_scale="Greens",
|
||||||
|
)
|
||||||
|
tone_stance.update_traces(text=tab.astype(str) + " / " + (pct*100).round(0).astype(int).astype(str) + "%")
|
||||||
|
tone_stance.add_scatter(x=[row["stance"]],y=[row["tone"]],mode="markers",marker=dict(size=15,color="yellow",symbol="cross",line=dict(width=1, color="red")),showlegend=False)
|
||||||
|
tone_stance.update_layout(height=420, xaxis_title="stance", yaxis_title="tone")
|
||||||
|
st.plotly_chart(tone_stance, width='stretch')
|
||||||
|
st.caption("Tone by stance, % within tone", text_alignment="right",width="stretch")
|
||||||
|
|
||||||
|
st.divider()
|
||||||
|
st.write("**model:** " + str(row["model"]))
|
||||||
|
with st.expander("Prompt", expanded=False):
|
||||||
|
st.code(prompt, language="text")
|
||||||
|
|
||||||
|
tone_conf = px.box(df,x="stance",y="stance_confidence",color="stance",category_orders={"stance":stance_order},color_discrete_map=stance_colors,points="outliers",title="Comment Stance Classification Confidence")
|
||||||
|
tone_conf.update_yaxes(range=[0,1.02])
|
||||||
|
tone_conf.update_layout(height=430, legend_orientation="v")
|
||||||
|
st.plotly_chart(tone_conf,width="stretch")
|
||||||
Reference in New Issue
Block a user