added forum metadata for later use

streamlit > local docker
updated readme
2026-05-09 00:36:30 -04:00 · 2026-05-09 00:25:27 -04:00 · 2026-05-09 00:02:24 -04:00 · 2026-05-09 00:00:59 -04:00 · 2026-05-08 23:57:46 -04:00 · 2026-05-08 23:33:55 -04:00
62 changed files with 98713 additions and 812 deletions
--- a/.gitignore
+++ b/.gitignore
@@ -28,4 +28,5 @@ archive/
 output/
 # --- misc ---
-.DS_Store
+.DS_Store
 *~$*
--- a/README.md
+++ b/README.md
@@ -1,21 +1,5 @@
-# Table of Contents
+## Project Goals
 1.  [Project Goals](#org5acb669)
    1.  [Document and analyze sentiment](#org9291576)
    2.  [Make data available](#org8054421)
    3.  [Generalize](#orgdda4b6f)
 2.  [Architecture](#org1d6bc40)
    1.  [Scraper](#org4298028)
    2.  [Storage](#org1cd413c)
    3.  [Analysis](#orgaea450e)
 3.  [Roadmap](#org6b7660d)
 <a id="org5acb669"></a>
 # Project Goals
 1.  Document and analyze sentiment of public comments on Virginia law, to determine:
    1.  the utility of this forum as a mechanism for public comment, and
@@ -23,131 +7,127 @@
 2.  Make data and insights broadly available.
 3.  Generalize to other public comment tools.
-
+Take a look at https://vatownhall.streamlit.app
-<a id="org9291576"></a>
+![img](./docs/streamlit-snapshot.png)
 ## Document and analyze sentiment
 -   Scrape the data, parse, clean, and store. Clearly separate scraper from sentiment analyzer for maximum auditability.
 -   Build tests for identifying abuse, such as spam and account fraud
 -   Identify any patterns connecting measured sentiment against VA decisions
-<a id="org8054421"></a>
+### Research questions
-## Make data available
+1.  What is the quality of the comments on the forum?
-
+    1.  Are there duplicate entries?
-   Pick a good visualization tool
+    2.  Are there non-human-generated entries?
    3.  Are there entries intended to abuse the forum or drown out comment?
 2.  How do commenters feel about the proposed change?
    1.  What is the total number and percent supporting vs opposing, and how does this change over time?
    2.  What is the type of support, such as strong/weak, positive/negative?
 3.  What impact do the comments have on the proposed change?
    (I anticipate this will not be measurable from currently available data)
-<a id="orgdda4b6f"></a>
+<a id="orgfabfcd9"></a>
-## Generalize
+## Architecture
-   Identify scalable ways to apply this toolset to similar problems
+1.  Scrape/Parse: Scrapy
 2.  Sentiment analysis: gpt-5.4-mini
 3.  Display: streamlit
 4.  Storage: jsonl, csv, parquet
 ![img](./docs/pipeline-v1.2.3.svg)
-<a id="org1d6bc40"></a>
+<a id="org2c5c7a2"></a>
-# Architecture
+### Scraper
-1.  Scrape/Parse: ****Scrapy**** for downloading comments
+Scrapy provides a simple mechanism for retrieving, parsing, and saving content form the forums.
-2.  Storage: json
+
-3.  Sentiment analysis: Claude haiku
+1.  Forums listing page: `Forums.cfm` lists all open forums with agency, reg title, action type, brief description, closing date, comment count
-4.  Display: TBD
+2.  Comment listing page: `comments.cfm?GDocForumID=X` or `comments.cfm?stageid=X` or `comments.cfm?petitionid=X` lists comments with title, author, date
 3.  Individual comment page: `viewcomments.cfm?commentid=X` shows regulation title + brief description at the top, plus the comment
-<a id="org4298028"></a>
+<a id="org72990f4"></a>
-## Scraper
+### Analysis
-Scrapy provides a simple mechanism for browsing and 
+Google and Amazon both return generic sentiment (tone of writing: positive/negative), not stance (for/against the regulation): "I strongly believe the government should NOT interfere" is negative tone but "against" the regulation.  We add the proposed change as context to the model.
-1.  Forums listing page: \`Forums.cfm\` - lists all open forums with agency, reg title, action type, brief description, closing date, comment count
+Before sending the comments for sentiment analysis, `tokenizer.py` receives the forum to be processed and prompt as inputs, then generates a `report.json` estimating tokens (tiktoken), cost, and time to run for multiple models.
-2.  Comment listing page: \`comments.cfm?GDocForumID=X\` or \`comments.cfm?stageid=X\` or \`comments.cfm?petitionid=X\` - lists comments with title, author, date
+
-3.  Individual comment page: \`viewcomments.cfm?commentid=X\` - shows regulation title + brief description at the top, plus the comment
+Then, the batch processing scripts uses the `report.json` to create multiple jobs, with subcommands to download and check their status. 
 We selected gpt-5.4-mini for a good balance of quality, cost, and time.
 1.  Prompt
    ```
    You are an expert policy analyst classifying public comments submitted to the Virginia Town Hall
    regulatory comment system. You will be given the text of a proposed regulation and a single
    public comment. Return ONLY a JSON object — no other text.
    Definitions:
    -   stance: the commenter's position on whether the regulation should be adopted.
        "support" = wants it approved (as-is or with changes);
        "oppose"  = wants it rejected or substantially weakened;
        "neutral" = takes no position, asks a question, or provides factual input only;
        "unknown" = too vague, off-topic, or uninterpretable to classify.
    -   tone: the emotional register of the writing, independent of stance.
        "positive" = affirming, hopeful, appreciative;
        "negative" = angry, fearful, alarmed, or contemptuous;
        "neutral"  = matter-of-fact, procedural, or informational;
        "mixed"    = contains both positive and negative emotional content;
        "unclear"  = tone cannot be determined (e.g., a one-word comment).
    -   stance_confidence: float 0.0-1.0, your confidence in the stance label.
    -   stance_rationale: 1-3 sentences explaining the key evidence; quote specific phrases where possible.
    -   tags: up to 5 short topic labels relevant to the comment's specific concerns (e.g.
        "parental rights", "student safety", "privacy", "religious freedom", "LGBTQ+ inclusion",
        "bullying prevention", "school sports", "bathroom access"). Empty array if none apply.
    Return exactly these keys: stance, stance_confidence, stance_rationale, tone, tags.
    ```
-<a id="org1cd413c"></a>
+<a id="org58a5b72"></a>
-## Storage
+### Storage
-One JSONL file per forum/bill.
+-   Each scraped forum is saved to `output/<forum-id>.jsonl`
 -   Each report (forum + prompt) is saves to `reports/<forum-id-N>.json`
 -   Each job is saved to `analysis/jobs/<report-id>`:
     └─`forum.jsonl` is a copy of the scraped forum for convenience  
     └─`prompt.txt` is a copy of the prompt used  
     └─`report.json` is a copy of the report used  
     └─`status.json` contains metadata about the job  
    For each batch in the job, four files are created:  
     └─`jobN-input.jsonl` contains the exact queries sent to the API, for troubleshooting  
     └─`jobN-output-raw.jsonl` contains the exact response from the API  
     └─`jobN-output.jsonl` contains the exact response from the API  
     └─`jobN-output-errors.jsonl` when errors are returned (this file may not exist)  
 -   Once complete, the cleanup script saves `review.csv`, `review.pqt`, and `review.sqlite` in this folder.
-<a id="orgaea450e"></a>
+<a id="org24fe465"></a>
-## Analysis
+## Instructions
-Google and Amazon both return generic sentiment (tone of writing: positive/negative), not stance (for/against the regulation): "I strongly believe the government should NOT interfere" is negative tone but "against" the regulation.  We will run the forum/bill title and cache the entirety of the proposed change, perhaps as a fallback.
+1.  Scrape the forum.  
-
+    `python`  
-<table border="2" cellspacing="0" cellpadding="6" rules="groups" frame="hsides">
+2.  Run model report.  
    `python analysis/tokenizer.py <input> --prompt <prompt>`  
 3.  To run a realtime subset:  
    `python analysis/openai_realtime.py <input> --prompt <prompt> --model <model> --limit <N comments>`  
    `python analysis/openai_realtime.py output/f452.jsonl --prompt prompt-1.txt --model gpt-4o-mini --limit 10`  
 4.  To create and run the whole thing in batches, first create the batch jobs from the report:  
    `python analysis/openai_batch.py create <report> --model <model>`  
    `python analysis/openai_batch.py create ./reports/f452-1.json --model gpt-5.4-mini`  
 5.  Then, run the jobs sequentially. Don't submit more than one at a time, if the model fills up the batch will fail and resubmission is not implemented.  
    `python analysis/openai<sub>batch.py</sub> submit`  
    `python analysis/openai<sub>batch.py</sub> status`  
    `python analysis/openai<sub>batch.py</sub> download`  
    `python analysis/openai<sub>batch.py</sub> submit`  
-<colgroup>
+<a id="org5739d49"></a>
 <col  class="org-left" />
 <col  class="org-left" />
 <col  class="org-left" />
 <col  class="org-left" />
 <col  class="org-left" />
 <col  class="org-left" />
 </colgroup>
 <thead>
 <tr>
 <th scope="col" class="org-left">Tool</th>
 <th scope="col" class="org-left">Output</th>
 <th scope="col" class="org-left">Context</th>
 <th scope="col" class="org-left">Sarcasm</th>
 <th scope="col" class="org-left">Context window</th>
 <th scope="col" class="org-left">Cost/1k comments</th>
 </tr>
 </thead>
 <tbody>
 <tr>
 <td class="org-left">Google NL API</td>
 <td class="org-left">-1→+1, magnitude</td>
 <td class="org-left">No/generic</td>
 <td class="org-left">Poorly</td>
 <td class="org-left">No</td>
 <td class="org-left">~$1–2</td>
 </tr>
 <tr>
 <td class="org-left">Amazon Comprehend</td>
 <td class="org-left">Pos/Neg/Neutral/Mixed</td>
 <td class="org-left">No/generic</td>
 <td class="org-left">Poorly</td>
 <td class="org-left">No</td>
 <td class="org-left">~$0.10</td>
 </tr>
 <tr>
 <td class="org-left">Claude Haiku</td>
 <td class="org-left">Prompted → for/against/neutral</td>
 <td class="org-left">Yes</td>
 <td class="org-left">Yes, with prompt</td>
 <td class="org-left">Yes</td>
 <td class="org-left">~$0.10–0.30</td>
 </tr>
 <tr>
 <td class="org-left">GPT-4o-mini</td>
 <td class="org-left">Prompted → same</td>
 <td class="org-left">Yes</td>
 <td class="org-left">Yes</td>
 <td class="org-left">Yes</td>
 <td class="org-left">~$0.05–0.15</td>
 </tr>
 </tbody>
 </table>
 <a id="org6b7660d"></a>
 # Roadmap
--- a/agents.md
+++ b/agents.md
@@ -43,3 +43,4 @@ Description and PM notes
 - project dir: `%userprofile%\projects\vath\`
 - python venv: `%userprofile%\projects\vath\venv\scripts\activate`
 - pytest (inside venv): `python -m pytest tests/`
  - create tests without `test_` prefix, ie: `tests/tokenizer.py` not `tests/test_tokenizer.py`
--- a/analysis/create_csv.py
+++ b/analysis/create_csv.py
@@ -0,0 +1,76 @@
 #!/usr/bin/env python3
 """analysis/create_csv.py — join raw scrape with analysis output for review."""
 import argparse
 from pathlib import Path
 import pandas as pd
 RAW_COLS = ["forum_id", "comment_id", "title", "text", "date", "author"]
 ANALYSIS_COLS = [
    "stance", "stance_confidence", "stance_rationale", "tone", "tags",
    "error", "truncated", "analyzed_at", "prompt_version", "model",
 ]
 OUTPUT_COLS = RAW_COLS + ANALYSIS_COLS
 def load_raw(path: Path) -> pd.DataFrame:
    df = pd.read_json(path, lines=True)
    df = df[df["comment_id"].notna()] # rm first item (forum, not comment)
    for col in RAW_COLS:
        if col not in df.columns:
            df[col] = None
    return df[RAW_COLS].copy()
 def load_analysis(jobs_dir: Path) -> pd.DataFrame:
    files = sorted(p for p in jobs_dir.glob("job*-output.jsonl") if "-raw" not in p.name)
    df = pd.concat([pd.read_json(p, lines=True) for p in files], ignore_index=True)
    for col in ANALYSIS_COLS:
        if col not in df.columns:
            df[col] = None
    return df[["comment_id"] + ANALYSIS_COLS].copy()
 def join(raw: pd.DataFrame, analysis: pd.DataFrame) -> pd.DataFrame:
    return raw.merge(analysis, on="comment_id", how="left")[OUTPUT_COLS]
 def print_counts(raw: pd.DataFrame, analysis: pd.DataFrame, merged: pd.DataFrame) -> None:
    print(f"\nRaw comments  : {len(raw):,}")
    print(f"Analyzed      : {len(analysis):,}")
    print(f"Joined        : {merged['stance'].notna().sum():,}")
    print(f"Unanalyzed    : {merged['stance'].isna().sum():,}")
    print(f"Errors        : {analysis['error'].notna().sum():,}")
    print(f"Dup IDs (raw) : {raw['comment_id'].duplicated().sum():,}")
    print(f"\nStance:\n{analysis['stance'].value_counts(dropna=False).to_string()}")
    print(f"\nTone:\n{analysis['tone'].value_counts(dropna=False).to_string()}\n")
 def main() -> None:
    p = argparse.ArgumentParser(
        description="Join raw scrape JSONL with analysis output; write review CSV."
    )
    p.add_argument("input",    help="Raw scrape JSONL (e.g. output/f452.jsonl)")
    p.add_argument("jobs_dir", help="Job directory containing job*-output.jsonl files")
    p.add_argument("--parquet", action="store_true", help="Also write review.parquet")
    p.add_argument("--out", default=None, help="Output CSV path (default: <jobs_dir>/review.csv)")
    args = p.parse_args()
    raw      = load_raw(Path(args.input))
    analysis = load_analysis(Path(args.jobs_dir))
    merged   = join(raw, analysis)
    print_counts(raw, analysis, merged)
    out = Path(args.out) if args.out else Path(args.jobs_dir) / "review.csv"
    merged.to_csv(out, index=False, encoding="utf-8-sig")
    print(f"CSV     → {out}")
    if args.parquet:
        pq = out.with_suffix(".parquet")
        merged.to_parquet(pq, index=False)
        print(f"Parquet → {pq}")
 if __name__ == "__main__":
    main()
--- a/analysis/encoding.py
+++ b/analysis/encoding.py
@@ -0,0 +1,74 @@
 """
 analysis/encoding.py — text encoding repair for scraped content.
 The townhall.virginia.gov scraper forces UTF-8 decoding, which is correct for the
 site's current content. This module provides a defensive repair function for cases
 where a response arrives with Windows-1252/cp1252 bytes embedded in otherwise UTF-8
 content (common in older CMSes). The raw scrape files are never modified; repair is
 applied at the analysis and reporting layers only.
 Primary: uses `ftfy` when installed (pip install ftfy).
 Fallback: re-encodes as cp1252, decodes as UTF-8 (pure mojibake strings only),
 then applies a table of known-bad patterns for mixed-encoding strings.
 """
 # ---------------------------------------------------------------------------
 # Known patterns: UTF-8 bytes decoded as cp1252, i.e. the 3-char sequences you
 # see when a server sends e.g. E2 80 99 and it gets decoded as cp1252 chars.
 #
 # Byte → cp1252 char mappings for the 0x80–0x9F range:
 #   E2 → â  (U+00E2, always)
 #   80 → €  (U+20AC, cp1252 0x80)
 #   99 → ™  (U+2122, cp1252 0x99)  ← E2 80 99 = U+2019 ' right single quote
 #   98 → ˜  (U+02DC, cp1252 0x98)  ← E2 80 98 = U+2018 ' left single quote
 #   9C → œ  (U+0153, cp1252 0x9C)  ← E2 80 9C = U+201C " left double quote
 #   9D → \x9d (undefined → U+009D) ← E2 80 9D = U+201D " right double quote
 #   93 → "  (U+201C, cp1252 0x93)  ← E2 80 93 = U+2013 – en dash
 #   94 → "  (U+201D, cp1252 0x94)  ← E2 80 94 = U+2014 — em dash
 #   A6 → ¦  (U+00A6, cp1252 0xA6)  ← E2 80 A6 = U+2026 … ellipsis
 _KNOWN_REPAIRS: list[tuple[str, str]] = [
    # Longer / more specific patterns first to avoid partial matches
    ("â€™",  "’"),  # â€™ → ' right single quote
    ("â€˜",  "‘"),  # â€˜ → ' left single quote
    ("â€œ",  "“"),  # â€œ → " left double quote
    ("â€",  "”"),  # â€\x9d → " right double quote
    ("â€“",  "–"),  # â€" (with left DQ) → – en dash
    ("â€”",  "—"),  # â€" (with right DQ) → — em dash
    ("â€¦",  "…"),  # â€¦ → … ellipsis
    # Generic fallback: bare â€ prefix not caught above → remove artifact
    ("â€",        ""),
 ]
 def repair_text(text: str) -> str:
    """Repair common encoding artifacts in scraped text.
    Handles:
    - UTF-8 bytes decoded as cp1252/Latin-1 (â€™ → ')
    - Attempts best-effort cleanup for mixed-encoding strings
    U+FFFD replacement characters (from strict UTF-8 decoding of cp1252 bytes)
    cannot be recovered since the original byte is lost; they are left as-is.
    """
    if not text:
        return text
    try:
        import ftfy
        return ftfy.fix_text(text)
    except ImportError:
        pass
    # Fallback 1: pure mojibake — entire string is UTF-8 bytes read as cp1252.
    # Re-encode as cp1252 and decode as UTF-8.
    try:
        return text.encode("cp1252").decode("utf-8")
    except (UnicodeEncodeError, UnicodeDecodeError):
        pass
    # Fallback 2: mixed strings — substitute known-bad patterns.
    for bad, good in _KNOWN_REPAIRS:
        if bad in text:
            text = text.replace(bad, good)
    return text
--- a/analysis/gpt4o/init.py
+++ b/analysis/gpt4o/init.py
--- a/analysis/gpt4o/analysis_batch.py
+++ b/analysis/gpt4o/analysis_batch.py
@@ -1,556 +0,0 @@
 #!/usr/bin/env python3
 """
 analysis_batch.py — OpenAI Batch API pipeline
 Commands (run manually in order):
    submit   <input_jsonl> [--model gpt-4o] [--limit N]
                                           — build request file, upload, create batch
    status   [run_id]                      — check batch status, update manifest
    download [run_id]                      — download + normalize output, update manifest
 run_id defaults to the most recent run in runs/ when omitted.
 File layout (all under analysis/gpt4o/):
    requests/<run_id>.jsonl     — batch input sent to OpenAI
    raw/<run_id>.jsonl          — raw batch output from OpenAI
    runs/<run_id>.json          — run manifest
    <run_id>_<model>.jsonl      — normalized output (same schema as realtime)
 """
 import argparse
 import hashlib
 import json
 import os
 import sys
 from datetime import datetime, timezone
 from pathlib import Path
 from dotenv import load_dotenv
 try:
    import openai
 except ImportError:
    sys.exit("openai package not installed. Run: pip install openai")
 # ---------------------------------------------------------------------------
 # Model limits and token estimation
 # Max enqueued tokens across ALL concurrent batches for this model
 # (docs/openai.md pricing table, updated 2026-05-05).
 # NOTE: your org tier may be lower — if a submit fails, use --limit to reduce chunk size.
 MODEL_LIMITS: dict[str, int] = {
    "gpt-5.5":        900_000,
    "gpt-5.4":        900_000,
    "gpt-5.4-mini": 2_000_000,
    "gpt-5.4-nano":   200_000,
    "gpt-4o":         900_000,
    "gpt-4o-mini":  2_000_000,
    "gpt-o4-mini":  2_000_000,
 }
 _DEFAULT_TOKEN_LIMIT = 900_000
 # tiktoken encoding per model family; unknown models fall back to o200k_base
 _MODEL_ENCODING: dict[str, str] = {
    "gpt-5.5":       "o200k_base",
    "gpt-5.4":       "o200k_base",
    "gpt-5.4-mini":  "o200k_base",
    "gpt-5.4-nano":  "o200k_base",
    "gpt-4o":        "o200k_base",
    "gpt-4o-mini":   "o200k_base",
    "gpt-o4-mini":   "o200k_base",
 }
 # Leave 10% headroom below the published limit
 _LIMIT_BUFFER = 0.90
 def estimate_tokens(messages: list[dict], model: str) -> int:
    """Estimate token count for a messages list.
    Uses tiktoken when available (exact for OpenAI models); falls back to
    chars/3 + 4-token overhead per message for unknown/Anthropic models.
    """
    try:
        import tiktoken
        enc = tiktoken.get_encoding(_MODEL_ENCODING.get(model, "o200k_base"))
        return sum(4 + len(enc.encode(m["content"])) for m in messages)
    except ImportError:
        return sum(4 + len(m["content"]) // 3 for m in messages)
 def chunk_comments_by_tokens(
    comments: list[dict], forum: dict | None, model: str
 ) -> list[list[dict]]:
    """Split comments into chunks where each chunk fits under the model token limit."""
    raw_limit = MODEL_LIMITS.get(model, _DEFAULT_TOKEN_LIMIT)
    token_limit = int(raw_limit * _LIMIT_BUFFER)
    chunks: list[list[dict]] = []
    current: list[dict] = []
    current_tokens = 0
    for comment in comments:
        messages, _ = build_messages(comment, forum)
        tokens = estimate_tokens(messages, model)
        if current and current_tokens + tokens > token_limit:
            chunks.append(current)
            current = [comment]
            current_tokens = tokens
        else:
            current.append(comment)
            current_tokens += tokens
    if current:
        chunks.append(current)
    return chunks
 # ---------------------------------------------------------------------------
 # Prompt
 _DEFAULT_PROMPT_FILE = Path(__file__).parent.parent / "prompt-1.txt"
 SYSTEM_PROMPT = _DEFAULT_PROMPT_FILE.read_text(encoding="utf-8").strip()
 PROMPT_VERSION = hashlib.sha256(SYSTEM_PROMPT.encode("utf-8")).hexdigest()[:7]
 def _load_prompt(path: Path) -> None:
    """Re-read a prompt file, updating module-level SYSTEM_PROMPT and PROMPT_VERSION."""
    global SYSTEM_PROMPT, PROMPT_VERSION
    SYSTEM_PROMPT = path.read_text(encoding="utf-8").strip()
    PROMPT_VERSION = hashlib.sha256(SYSTEM_PROMPT.encode("utf-8")).hexdigest()[:7]
 USER_TEMPLATE = """\
 ## Proposed Regulation
 Title: {reg_title}
 Description: {reg_desc}
 ---
 ## Public Comment
 Comment ID: {comment_id}
 Title: {comment_title}
 Body:
 {comment_text}
 ---
 Classify this comment per the instructions. Return only JSON.\
 """
 MAX_COMMENT_CHARS = 6000
 # ---------------------------------------------------------------------------
 # Directories
 _SCRIPT_DIR  = Path(__file__).parent
 REQUESTS_DIR = _SCRIPT_DIR / "requests"
 RAW_DIR      = _SCRIPT_DIR / "raw"
 RUNS_DIR     = _SCRIPT_DIR / "runs"
 # ---------------------------------------------------------------------------
 # Core functions (importable for tests)
 def load_items(path: Path) -> tuple[dict | None, list[dict]]:
    """Read a scraped JSONL file. Returns (forum_item_or_None, [comment_items])."""
    forum = None
    comments = []
    with open(path, encoding="utf-8") as f:
        for line in f:
            line = line.strip()
            if not line:
                continue
            item = json.loads(line)
            if "comment_id" in item:
                comments.append(item)
            elif "reg_title" in item:
                forum = item
    return forum, comments
 def custom_id_from(comment_id: str) -> str:
    return f"comment_{comment_id}"
 def parse_custom_id(custom_id: str) -> str:
    """Return comment_id from a custom_id string."""
    return custom_id.removeprefix("comment_")
 def build_messages(comment: dict, forum: dict | None) -> tuple[list, bool]:
    """Build OpenAI messages for one comment. Returns (messages, truncated)."""
    reg_title = (forum or {}).get("reg_title", "[unknown]")
    reg_desc  = (forum or {}).get("reg_desc",  "[unknown]")
    body = (comment.get("text") or "").strip()
    truncated = False
    if not body:
        body = "[No body text provided]"
    elif len(body) > MAX_COMMENT_CHARS:
        body = body[:MAX_COMMENT_CHARS] + "... [truncated]"
        truncated = True
    user_text = USER_TEMPLATE.format(
        reg_title=reg_title,
        reg_desc=reg_desc,
        comment_id=comment.get("comment_id", ""),
        comment_title=comment.get("title", ""),
        comment_text=body,
    )
    return [
        {"role": "system", "content": SYSTEM_PROMPT},
        {"role": "user",   "content": user_text},
    ], truncated
 def build_batch_request_line(comment: dict, forum: dict | None, model: str) -> dict:
    """Build one line of the batch input JSONL."""
    messages, _ = build_messages(comment, forum)
    return {
        "custom_id": custom_id_from(comment["comment_id"]),
        "method": "POST",
        "url": "/v1/chat/completions",
        "body": {
            "model": model,
            "messages": messages,
            "response_format": {"type": "json_object"},
            "temperature": 0.0,
        },
    }
 def normalize_output_line(
    raw_line: dict,
    comment_lookup: dict,
    run_id: str,
    analyzed_at: str,
    model: str,
    prompt_version: str,
 ) -> dict:
    """Convert one raw batch output line into a normalized analysis record.
    comment_lookup: {comment_id: CommentItem dict}
    prompt_version: taken from the run manifest so it reflects what was submitted.
    """
    comment_id = parse_custom_id(raw_line.get("custom_id", ""))
    comment = comment_lookup.get(comment_id, {})
    base = {
        "run_id":         run_id,
        "forum_id":       comment.get("forum_id", ""),
        "comment_id":     comment_id,
        "analyzed_at":    analyzed_at,
        "model":          model,
        "prompt_version": prompt_version,
        "input_title":    comment.get("title", ""),
        "truncated":      len(comment.get("text") or "") > MAX_COMMENT_CHARS,
    }
    # Check for outer-level batch error (e.g. batch_expired)
    if raw_line.get("error"):
        err = raw_line["error"]
        err_msg = err.get("message", str(err)) if isinstance(err, dict) else str(err)
        return {**base, "stance": None, "stance_confidence": None,
                "stance_rationale": None, "tone": None, "tags": None, "error": err_msg}
    response = raw_line.get("response") or {}
    if response.get("status_code") != 200:
        return {**base, "stance": None, "stance_confidence": None,
                "stance_rationale": None, "tone": None, "tags": None,
                "error": f"status {response.get('status_code')}"}
    try:
        content = response["body"]["choices"][0]["message"]["content"]
        data = json.loads(content)
        keys = ("stance", "stance_confidence", "stance_rationale", "tone", "tags")
        parsed = {k: data.get(k) for k in keys}
        return {**base, **parsed, "error": None}
    except Exception as exc:
        return {**base, "stance": None, "stance_confidence": None,
                "stance_rationale": None, "tone": None, "tags": None, "error": str(exc)}
 def make_manifest(
    run_id: str,
    input_filename: str,
    input_sha256: str,
    model: str,
    batch_id: str,
    records_submitted: int,
    request_filename: str,
 ) -> dict:
    return {
        "run_id":                 run_id,
        "input_filename":         input_filename,
        "input_sha256":           input_sha256,
        "prompt_hash":            PROMPT_VERSION,
        "model":                  model,
        "batch_id":               batch_id,
        "records_submitted":      records_submitted,
        "records_completed":      None,
        "records_failed":         None,
        "request_filename":       request_filename,
        "raw_output_filename":    None,
        "normalized_output_filename": None,
        "created_at":             datetime.now(timezone.utc).isoformat(),
        "completed_at":           None,
    }
 def _latest_run_id() -> str:
    """Return the run_id of the most recently saved manifest, or exit if none found."""
    runs = list(RUNS_DIR.glob("*.json")) if RUNS_DIR.exists() else []
    if not runs:
        sys.exit(f"No runs found in {RUNS_DIR}. Submit a batch first.")
    latest = max(runs, key=lambda p: p.stat().st_mtime)
    return latest.stem
 def load_manifest(run_id: str) -> dict:
    path = RUNS_DIR / f"{run_id}.json"
    return json.loads(path.read_text(encoding="utf-8"))
 def save_manifest(manifest: dict) -> None:
    RUNS_DIR.mkdir(parents=True, exist_ok=True)
    path = RUNS_DIR / f"{manifest['run_id']}.json"
    path.write_text(json.dumps(manifest, indent=2, ensure_ascii=False), encoding="utf-8")
 # ---------------------------------------------------------------------------
 # Subcommand: submit
 def _submit_chunk(
    chunk: list[dict],
    forum: dict | None,
    input_path: Path,
    input_sha256: str,
    model: str,
    client,
    chunk_index: int,
    total_chunks: int,
 ) -> str:
    """Upload and submit one chunk of comments. Returns the run_id."""
    import uuid
    run_id = str(uuid.uuid4())
    label = f"chunk {chunk_index + 1}/{total_chunks}" if total_chunks > 1 else "single batch"
    REQUESTS_DIR.mkdir(parents=True, exist_ok=True)
    request_path = REQUESTS_DIR / f"{run_id}.jsonl"
    with open(request_path, "w", encoding="utf-8") as f:
        for comment in chunk:
            line = build_batch_request_line(comment, forum, model)
            f.write(json.dumps(line, ensure_ascii=False) + "\n")
    print(f"[{label}] Wrote {len(chunk)} requests → {request_path}", file=sys.stderr)
    with open(request_path, "rb") as f:
        uploaded = client.files.create(file=f, purpose="batch")
    print(f"[{label}] Uploaded: {uploaded.id}", file=sys.stderr)
    batch = client.batches.create(
        input_file_id=uploaded.id,
        endpoint="/v1/chat/completions",
        completion_window="24h",
        metadata={"run_id": run_id, "input_filename": str(input_path)},
    )
    print(f"[{label}] Batch created: {batch.id}  status={batch.status}", file=sys.stderr)
    manifest = make_manifest(
        run_id=run_id,
        input_filename=str(input_path),
        input_sha256=input_sha256,
        model=model,
        batch_id=batch.id,
        records_submitted=len(chunk),
        request_filename=str(request_path),
    )
    save_manifest(manifest)
    return run_id
 def cmd_submit(args, client) -> None:
    _load_prompt(Path(args.prompt))
    print(f"Prompt: {args.prompt}  (version {PROMPT_VERSION})", file=sys.stderr)
    input_path = Path(args.input)
    if not input_path.exists():
        sys.exit(f"File not found: {input_path}")
    print(f"Reading {input_path} ...", file=sys.stderr)
    forum, comments = load_items(input_path)
    if not comments:
        sys.exit("No comment items found in input file.")
    if forum is None:
        print("Warning: no ForumItem found — regulation context will be [unknown].", file=sys.stderr)
    if args.limit:
        comments = comments[:args.limit]
        print(f"Limiting to {len(comments)} comments (--limit {args.limit}).", file=sys.stderr)
    token_limit = int(MODEL_LIMITS.get(args.model, _DEFAULT_TOKEN_LIMIT) * _LIMIT_BUFFER)
    chunks = chunk_comments_by_tokens(comments, forum, args.model)
    total = len(chunks)
    print(
        f"Model: {args.model}  token limit: {token_limit:,}  "
        f"→ {len(comments)} comments split into {total} chunk(s).",
        file=sys.stderr,
    )
    input_sha256 = hashlib.sha256(input_path.read_bytes()).hexdigest()
    # Submit only the first chunk — the enqueued token limit is a TOTAL across all
    # concurrent batches, so stacking multiple submissions will exceed the quota.
    # Wait for each batch to complete before submitting the next.
    run_id = _submit_chunk(chunks[0], forum, input_path, input_sha256, args.model, client, 0, total)
    print(f"\nBatch 1/{total} submitted.", file=sys.stderr)
    print(f"  status:   python analysis/gpt4o/analysis_batch.py status {run_id}", file=sys.stderr)
    print(f"  download: python analysis/gpt4o/analysis_batch.py download {run_id}", file=sys.stderr)
    if total > 1:
        remaining = sum(len(c) for c in chunks[1:])
        print(f"\n{total - 1} more chunk(s) remaining ({remaining} comments).", file=sys.stderr)
        print("After this batch completes and is downloaded, rerun submit with --limit to get the next chunk:", file=sys.stderr)
        offset = len(chunks[0])
        for idx, chunk in enumerate(chunks[1:], start=2):
            print(f"  chunk {idx}/{total}: comments {offset}–{offset + len(chunk) - 1}", file=sys.stderr)
            offset += len(chunk)
    print(run_id)  # stdout for scripting
 # ---------------------------------------------------------------------------
 # Subcommand: status
 def cmd_status(args, client) -> None:
    run_id = args.run_id or _latest_run_id()
    if not args.run_id:
        print(f"(using latest run: {run_id})", file=sys.stderr)
    manifest = load_manifest(run_id)
    batch = client.batches.retrieve(manifest["batch_id"])
    counts = batch.request_counts
    print(f"status:     {batch.status}")
    print(f"completed:  {counts.completed}/{counts.total}")
    print(f"failed:     {counts.failed}")
    manifest["records_completed"] = counts.completed
    manifest["records_failed"]    = counts.failed
    save_manifest(manifest)
    if batch.status == "completed":
        print(f"\nReady to download. Run:")
        print(f"  python analysis/gpt4o/analysis_batch.py download {run_id}")
 # ---------------------------------------------------------------------------
 # Subcommand: download
 def cmd_download(args, client) -> None:
    run_id = args.run_id or _latest_run_id()
    if not args.run_id:
        print(f"(using latest run: {run_id})", file=sys.stderr)
    manifest = load_manifest(run_id)
    batch = client.batches.retrieve(manifest["batch_id"])
    if batch.status != "completed":
        sys.exit(f"Batch not complete yet (status={batch.status}). Run 'status' to check.")
    run_id    = manifest["run_id"]
    model     = manifest["model"]
    model_slug = model.replace("/", "-")
    # Download raw output
    RAW_DIR.mkdir(parents=True, exist_ok=True)
    raw_path = RAW_DIR / f"{run_id}.jsonl"
    raw_text = client.files.content(batch.output_file_id).text
    raw_path.write_text(raw_text, encoding="utf-8")
    print(f"Raw output → {raw_path}", file=sys.stderr)
    # Build comment lookup from original input for reconciliation
    input_path = Path(manifest["input_filename"])
    _, comments = load_items(input_path)
    comment_lookup = {c["comment_id"]: c for c in comments}
    # Normalize
    completed_at = datetime.now(timezone.utc).isoformat()
    if batch.completed_at:
        completed_at = datetime.fromtimestamp(batch.completed_at, tz=timezone.utc).isoformat()
    normalized_path = _SCRIPT_DIR / f"{run_id}_{model_slug}.jsonl"
    n_ok = n_err = 0
    with open(normalized_path, "w", encoding="utf-8") as out:
        for line in raw_text.splitlines():
            if not line.strip():
                continue
            raw_line = json.loads(line)
            record = normalize_output_line(raw_line, comment_lookup, run_id, completed_at, model, manifest["prompt_hash"])
            out.write(json.dumps(record, ensure_ascii=False) + "\n")
            if record["error"]:
                n_err += 1
            else:
                n_ok += 1
    print(f"Normalized → {normalized_path}  ({n_ok} ok, {n_err} errors)", file=sys.stderr)
    manifest["records_completed"]         = n_ok
    manifest["records_failed"]            = n_err
    manifest["raw_output_filename"]       = str(raw_path)
    manifest["normalized_output_filename"] = str(normalized_path)
    manifest["completed_at"]              = completed_at
    save_manifest(manifest)
    print(f"Manifest updated → {RUNS_DIR / run_id}.json", file=sys.stderr)
 # ---------------------------------------------------------------------------
 # CLI
 def main() -> None:
    load_dotenv()
    api_key = os.environ.get("OPENAI_API_KEY")
    if not api_key:
        sys.exit("OPENAI_API_KEY not set. Create a .env file or export the variable.")
    parser = argparse.ArgumentParser(
        description="Public comment batch analysis pipeline.",
        formatter_class=argparse.RawDescriptionHelpFormatter,
        epilog=__doc__,
    )
    sub = parser.add_subparsers(dest="command", required=True)
    p_submit = sub.add_parser("submit", help="Build and submit a batch job")
    p_submit.add_argument("input", help="Path to scraped JSONL file")
    p_submit.add_argument("--model", default="gpt-4o", help="OpenAI model (default: gpt-4o)")
    p_submit.add_argument(
        "--prompt",
        default=str(_DEFAULT_PROMPT_FILE),
        help="Path to system prompt file (default: analysis/prompt-1.txt)",
    )
    p_submit.add_argument(
        "--limit", type=int, default=None, metavar="N",
        help="Submit only the first N comments (useful for staying under token quota)",
    )
    p_status = sub.add_parser("status", help="Check batch status")
    p_status.add_argument("run_id", nargs="?", default=None,
                          help="run_id from submit (default: most recent run)")
    p_download = sub.add_parser("download", help="Download and normalize completed batch")
    p_download.add_argument("run_id", nargs="?", default=None,
                            help="run_id from submit (default: most recent run)")
    args = parser.parse_args()
    client = openai.OpenAI(api_key=api_key)
    if args.command == "submit":
        cmd_submit(args, client)
    elif args.command == "status":
        cmd_status(args, client)
    elif args.command == "download":
        cmd_download(args, client)
 if __name__ == "__main__":
    main()
--- a/analysis/jobs/f452-1/forum.jsonl
+++ b/analysis/jobs/f452-1/forum.jsonl
--- a/analysis/jobs/f452-1/job1-input.jsonl
+++ b/analysis/jobs/f452-1/job1-input.jsonl
--- a/analysis/jobs/f452-1/job1-output-raw.jsonl
+++ b/analysis/jobs/f452-1/job1-output-raw.jsonl
--- a/analysis/jobs/f452-1/job1-output.jsonl
+++ b/analysis/jobs/f452-1/job1-output.jsonl
--- a/analysis/jobs/f452-1/job2-input.jsonl
+++ b/analysis/jobs/f452-1/job2-input.jsonl
--- a/analysis/jobs/f452-1/job2-output-raw.jsonl
+++ b/analysis/jobs/f452-1/job2-output-raw.jsonl
--- a/analysis/jobs/f452-1/job2-output.jsonl
+++ b/analysis/jobs/f452-1/job2-output.jsonl
--- a/analysis/jobs/f452-1/job3-input.jsonl
+++ b/analysis/jobs/f452-1/job3-input.jsonl
--- a/analysis/jobs/f452-1/job3-output-raw.jsonl
+++ b/analysis/jobs/f452-1/job3-output-raw.jsonl
--- a/analysis/jobs/f452-1/job3-output.jsonl
+++ b/analysis/jobs/f452-1/job3-output.jsonl
--- a/analysis/jobs/f452-1/job4-input.jsonl
+++ b/analysis/jobs/f452-1/job4-input.jsonl
--- a/analysis/jobs/f452-1/job4-output-raw.jsonl
+++ b/analysis/jobs/f452-1/job4-output-raw.jsonl
--- a/analysis/jobs/f452-1/job4-output.jsonl
+++ b/analysis/jobs/f452-1/job4-output.jsonl
--- a/analysis/jobs/f452-1/prompt.txt
+++ b/analysis/jobs/f452-1/prompt.txt
@@ -0,0 +1,23 @@
 You are an expert policy analyst classifying public comments submitted to the Virginia Town Hall
 regulatory comment system. You will be given the text of a proposed regulation and a single
 public comment. Return ONLY a JSON object — no other text.
 Definitions:
 - stance: the commenter's position on whether the regulation should be adopted.
  "support" = wants it approved (as-is or with changes);
  "oppose"  = wants it rejected or substantially weakened;
  "neutral" = takes no position, asks a question, or provides factual input only;
  "unknown" = too vague, off-topic, or uninterpretable to classify.
 - tone: the emotional register of the writing, independent of stance.
  "positive" = affirming, hopeful, appreciative;
  "negative" = angry, fearful, alarmed, or contemptuous;
  "neutral"  = matter-of-fact, procedural, or informational;
  "mixed"    = contains both positive and negative emotional content;
  "unclear"  = tone cannot be determined (e.g., a one-word comment).
 - stance_confidence: float 0.0-1.0, your confidence in the stance label.
 - stance_rationale: 1-3 sentences explaining the key evidence; quote specific phrases where possible.
 - tags: up to 5 short topic labels relevant to the comment's specific concerns (e.g.
  "parental rights", "student safety", "privacy", "religious freedom", "LGBTQ+ inclusion",
  "bullying prevention", "school sports", "bathroom access"). Empty array if none apply.
 Return exactly these keys: stance, stance_confidence, stance_rationale, tone, tags.
--- a/analysis/jobs/f452-1/report.json
+++ b/analysis/jobs/f452-1/report.json
@@ -0,0 +1,43 @@
 {
  "prompt": "analysis\\prompt-1.txt",
  "prompt_hash": "cb41250",
  "input_file": "output\\f452.jsonl",
  "input_sha256": "59dcc8b13cc2a386977a8b934c498c7e639b7e684a94ca1bfd10a14878670018",
  "total_comments": 9083,
  "input_tokens": 6397254,
  "gpt-5.5": {
    "jobs": 9,
    "cost_$": 15.9931,
    "est_queue_days": 7.11
  },
  "gpt-5.4": {
    "jobs": 9,
    "cost_$": 7.9966,
    "est_queue_days": 7.11
  },
  "gpt-5.4-mini": {
    "jobs": 4,
    "cost_$": 2.399,
    "est_queue_days": 3.2
  },
  "gpt-5.4-nano": {
    "jobs": 40,
    "cost_$": 0.6397,
    "est_queue_days": 31.99
  },
  "gpt-4o": {
    "jobs": 9,
    "cost_$": 7.9966,
    "est_queue_days": 7.11
  },
  "gpt-4o-mini": {
    "jobs": 4,
    "cost_$": 0.4798,
    "est_queue_days": 3.2
  },
  "gpt-o4-mini": {
    "jobs": 4,
    "cost_$": 3.5185,
    "est_queue_days": 3.2
  }
 }
--- a/analysis/jobs/f452-1/review.csv
+++ b/analysis/jobs/f452-1/review.csv
--- a/analysis/jobs/f452-1/review.xlsx
+++ b/analysis/jobs/f452-1/review.xlsx
--- a/analysis/jobs/f452-1/status.json
+++ b/analysis/jobs/f452-1/status.json
@@ -0,0 +1,57 @@
 {
  "model": "gpt-5.4-mini",
  "prompt_hash": "cb41250",
  "input_file": "output\\f452.jsonl",
  "input_sha256": "59dcc8b13cc2a386977a8b934c498c7e639b7e684a94ca1bfd10a14878670018",
  "total_comments": 9083,
  "input_tokens": 6397254,
  "est_queue_days": 3.2,
  "cost_$": 2.399,
  "total_jobs": 4,
  "jobs": [
    {
      "job_num": 1,
      "run_id": "76c97113-63aa-43db-8f84-9c60ebcbb105",
      "status": "completed",
      "batch_id": "batch_69fb9081639881909be0c40d86edd747",
      "records_submitted": 2270,
      "records_completed": 2270,
      "records_failed": 0,
      "submitted_at": "2026-05-06T19:03:28.949240+00:00",
      "completed_at": "2026-05-06T20:09:14+00:00"
    },
    {
      "job_num": 2,
      "run_id": "b8f3b0bb-f155-4a5c-acce-f3504c0e09aa",
      "status": "completed",
      "batch_id": "batch_69fba02df7b481909e96afa1ee8879f5",
      "records_submitted": 2274,
      "records_completed": 2274,
      "records_failed": 0,
      "submitted_at": "2026-05-06T20:10:21.424330+00:00",
      "completed_at": "2026-05-06T20:37:11+00:00"
    },
    {
      "job_num": 3,
      "run_id": "8d769f37-6beb-4a1b-87ee-3f66cdc6adc8",
      "status": "completed",
      "batch_id": "batch_69fba69a85488190977792b6f95b614b",
      "records_submitted": 2282,
      "records_completed": 2282,
      "records_failed": 0,
      "submitted_at": "2026-05-06T20:37:45.586815+00:00",
      "completed_at": "2026-05-06T21:09:24+00:00"
    },
    {
      "job_num": 4,
      "run_id": "e6affbc2-ddc9-43a6-b8e9-d1f47e736283",
      "status": "completed",
      "batch_id": "batch_69fbe44565748190ad19f17ee3143f8d",
      "records_submitted": 2257,
      "records_completed": 2257,
      "records_failed": 0,
      "submitted_at": "2026-05-07T01:00:52.886953+00:00",
      "completed_at": "2026-05-07T09:20:01+00:00"
    }
  ]
 }
--- a/analysis/openai_batch.py
+++ b/analysis/openai_batch.py
@@ -0,0 +1,624 @@
 #!/usr/bin/env python3
 """
 openai_batch.py — OpenAI Batch API job runner
 Run tokenizer.py first to generate report.json, then:
    create   <report.json> --model <model>   — build job directory
    submit   [--job N] [--dir DIR]           — submit next eligible job
    status   [--job N] [--dir DIR]           — check job status
    download [--job N] [--dir DIR]           — download + normalize completed jobs
 DIR is a name under analysis/jobs/ (default: most recently created).
 """
 import argparse
 import hashlib
 import json
 import os
 import shutil
 import sys
 import uuid
 from datetime import datetime, timezone
 from pathlib import Path
 from dotenv import load_dotenv
 try:
    import openai
 except ImportError:
    sys.exit("openai package not installed. Run: pip install openai")
 # ---------------------------------------------------------------------------
 # Model limits and token estimation
 # Max enqueued tokens across ALL concurrent batches (docs/openai.md, 2026-05-05).
 # Org-tier limits may be lower; use --job to limit submission size if needed.
 MODEL_LIMITS: dict[str, int] = {
    "gpt-5.5":        900_000,
    "gpt-5.4":        900_000,
    "gpt-5.4-mini": 2_000_000,
    "gpt-5.4-nano":   200_000,
    "gpt-4o":         900_000,
    "gpt-4o-mini":  2_000_000,
    "gpt-o4-mini":  2_000_000,
 }
 _DEFAULT_TOKEN_LIMIT = 900_000
 _MODEL_ENCODING: dict[str, str] = {
    "gpt-5.5":       "o200k_base",
    "gpt-5.4":       "o200k_base",
    "gpt-5.4-mini":  "o200k_base",
    "gpt-5.4-nano":  "o200k_base",
    "gpt-4o":        "o200k_base",
    "gpt-4o-mini":   "o200k_base",
    "gpt-o4-mini":   "o200k_base",
 }
 _LIMIT_BUFFER = 0.80
 def estimate_tokens(messages: list[dict], model: str) -> int:
    """Token count per OpenAI cookbook chat formula; falls back to chars/3."""
    try:
        import tiktoken
        enc = tiktoken.get_encoding(_MODEL_ENCODING.get(model, "o200k_base"))
        # Per OpenAI cookbook for gpt-4o: 3 overhead per message + role + content;
        # plus 3 tokens for the reply primer (<|start|>assistant<|message|>).
        total = 3  # reply primer
        for m in messages:
            total += 3
            total += len(enc.encode(m.get("role", "")))
            total += len(enc.encode(m["content"]))
        return total
    except ImportError:
        return 3 + sum(3 + len(m["content"]) // 3 for m in messages)
 def chunk_comments_by_tokens(
    comments: list[dict], forum: dict | None, model: str
 ) -> list[list[dict]]:
    """Greedy bin-pack comments into chunks that fit under the model TPD limit."""
    token_limit = int(MODEL_LIMITS.get(model, _DEFAULT_TOKEN_LIMIT) * _LIMIT_BUFFER)
    chunks: list[list[dict]] = []
    current: list[dict] = []
    current_tokens = 0
    for comment in comments:
        messages, _ = build_messages(comment, forum)
        tokens = estimate_tokens(messages, model)
        if current and current_tokens + tokens > token_limit:
            chunks.append(current)
            current = [comment]
            current_tokens = tokens
        else:
            current.append(comment)
            current_tokens += tokens
    if current:
        chunks.append(current)
    return chunks
 # ---------------------------------------------------------------------------
 # Prompt
 _DEFAULT_PROMPT_FILE = Path(__file__).parent / "prompt-1.txt"
 SYSTEM_PROMPT = _DEFAULT_PROMPT_FILE.read_text(encoding="utf-8").strip()
 PROMPT_VERSION = hashlib.sha256(SYSTEM_PROMPT.encode("utf-8")).hexdigest()[:7]
 def _load_prompt(path: Path) -> None:
    global SYSTEM_PROMPT, PROMPT_VERSION
    SYSTEM_PROMPT = path.read_text(encoding="utf-8").strip()
    PROMPT_VERSION = hashlib.sha256(SYSTEM_PROMPT.encode("utf-8")).hexdigest()[:7]
 USER_TEMPLATE = """\
 ## Proposed Regulation
 Title: {reg_title}
 Description: {reg_desc}
 ---
 ## Public Comment
 Comment ID: {comment_id}
 Title: {comment_title}
 Body:
 {comment_text}
 ---
 Classify this comment per the instructions. Return only JSON.\
 """
 MAX_COMMENT_CHARS = 6000
 # ---------------------------------------------------------------------------
 # Directories
 _SCRIPT_DIR = Path(__file__).parent
 JOBS_DIR = _SCRIPT_DIR / "jobs"
 # ---------------------------------------------------------------------------
 # Core functions (importable for tests)
 def load_items(path: Path) -> tuple[dict | None, list[dict]]:
    """Read a scraped JSONL. Returns (forum_item_or_None, [comment_items])."""
    forum = None
    comments = []
    with open(path, encoding="utf-8") as f:
        for line in f:
            line = line.strip()
            if not line:
                continue
            item = json.loads(line)
            if "comment_id" in item:
                comments.append(item)
            elif "reg_title" in item:
                forum = item
    return forum, comments
 def custom_id_from(comment_id: str) -> str:
    return f"comment_{comment_id}"
 def parse_custom_id(custom_id: str) -> str:
    return custom_id.removeprefix("comment_")
 def build_messages(comment: dict, forum: dict | None) -> tuple[list, bool]:
    """Build OpenAI messages for one comment. Returns (messages, truncated)."""
    reg_title = (forum or {}).get("reg_title", "[unknown]")
    reg_desc  = (forum or {}).get("reg_desc",  "[unknown]")
    body = (comment.get("text") or "").strip()
    truncated = False
    if not body:
        body = "[No body text provided]"
    elif len(body) > MAX_COMMENT_CHARS:
        body = body[:MAX_COMMENT_CHARS] + "... [truncated]"
        truncated = True
    user_text = USER_TEMPLATE.format(
        reg_title=reg_title,
        reg_desc=reg_desc,
        comment_id=comment.get("comment_id", ""),
        comment_title=comment.get("title", ""),
        comment_text=body,
    )
    return [
        {"role": "system", "content": SYSTEM_PROMPT},
        {"role": "user",   "content": user_text},
    ], truncated
 def build_batch_request_line(comment: dict, forum: dict | None, model: str) -> dict:
    messages, _ = build_messages(comment, forum)
    return {
        "custom_id": custom_id_from(comment["comment_id"]),
        "method": "POST",
        "url": "/v1/chat/completions",
        "body": {
            "model": model,
            "messages": messages,
            "response_format": {"type": "json_object"},
            "temperature": 0.0,
        },
    }
 def normalize_output_line(
    raw_line: dict,
    comment_lookup: dict,
    run_id: str,
    analyzed_at: str,
    model: str,
    prompt_version: str,
 ) -> dict:
    """Convert one raw batch output line into a normalized analysis record."""
    comment_id = parse_custom_id(raw_line.get("custom_id", ""))
    comment = comment_lookup.get(comment_id, {})
    base = {
        "run_id":         run_id,
        "forum_id":       comment.get("forum_id", ""),
        "comment_id":     comment_id,
        "analyzed_at":    analyzed_at,
        "model":          model,
        "prompt_version": prompt_version,
        "input_title":    comment.get("title", ""),
        "truncated":      len(comment.get("text") or "") > MAX_COMMENT_CHARS,
    }
    if raw_line.get("error"):
        err = raw_line["error"]
        err_msg = err.get("message", str(err)) if isinstance(err, dict) else str(err)
        return {**base, "stance": None, "stance_confidence": None,
                "stance_rationale": None, "tone": None, "tags": None, "error": err_msg}
    response = raw_line.get("response") or {}
    if response.get("status_code") != 200:
        return {**base, "stance": None, "stance_confidence": None,
                "stance_rationale": None, "tone": None, "tags": None,
                "error": f"status {response.get('status_code')}"}
    try:
        content = response["body"]["choices"][0]["message"]["content"]
        data = json.loads(content)
        keys = ("stance", "stance_confidence", "stance_rationale", "tone", "tags")
        parsed = {k: data.get(k) for k in keys}
        return {**base, **parsed, "error": None}
    except Exception as exc:
        return {**base, "stance": None, "stance_confidence": None,
                "stance_rationale": None, "tone": None, "tags": None, "error": str(exc)}
 # ---------------------------------------------------------------------------
 # Job directory management
 def _next_job_dir(stem: str) -> Path:
    base = stem[:8]
    i = 1
    while (JOBS_DIR / f"{base}-{i}").exists():
        i += 1
    return JOBS_DIR / f"{base}-{i}"
 def _latest_job_dir() -> Path:
    if not JOBS_DIR.exists():
        sys.exit(f"No jobs directory found. Run 'create' first.")
    status_files = list(JOBS_DIR.glob("*/status.json"))
    if not status_files:
        sys.exit(f"No jobs found in {JOBS_DIR}. Run 'create' first.")
    return max(status_files, key=lambda p: p.stat().st_mtime).parent
 def _resolve_job_dir(args) -> Path:
    if getattr(args, "dir", None):
        d = Path(args.dir)
        if not d.is_absolute():
            d = JOBS_DIR / d
        if not d.exists():
            sys.exit(f"Job directory not found: {d}")
        return d
    return _latest_job_dir()
 def load_status(job_dir: Path) -> dict:
    return json.loads((job_dir / "status.json").read_text(encoding="utf-8"))
 def save_status(status: dict, job_dir: Path) -> None:
    (job_dir / "status.json").write_text(
        json.dumps(status, indent=2, ensure_ascii=False), encoding="utf-8"
    )
 def _find_next_eligible_job(jobs: list[dict]) -> tuple[dict | None, str | None]:
    """Return (next_pending_job, None) or (None, warning_message).
    A job is eligible when it is 'pending' and either it is the first job
    or its predecessor is 'completed'.
    """
    for j in jobs:
        if j["status"] != "pending":
            continue
        if j["job_num"] == 1:
            return j, None
        prev = next(p for p in jobs if p["job_num"] == j["job_num"] - 1)
        if prev["status"] == "completed":
            return j, None
        if prev["status"] in ("submitted", "in_progress", "validating", "finalizing"):
            return None, (
                f"Job {prev['job_num']} is '{prev['status']}'. "
                f"Wait for it to complete before submitting job {j['job_num']}."
            )
    return None, None
 # ---------------------------------------------------------------------------
 # Subcommand: create
 def cmd_create(args) -> None:
    report_path = Path(args.report)
    if not report_path.exists():
        sys.exit(f"Report not found: {report_path}")
    report = json.loads(report_path.read_text(encoding="utf-8"))
    if args.model not in report or not isinstance(report[args.model], dict):
        available = [k for k in report if isinstance(report.get(k), dict)]
        sys.exit(f"Model '{args.model}' not in report. Available: {', '.join(available)}")
    prompt_path = Path(report["prompt"])
    if not prompt_path.exists():
        sys.exit(f"Prompt file not found: {prompt_path}")
    _load_prompt(prompt_path)
    input_path = Path(report["input_file"])
    if not input_path.exists():
        sys.exit(f"Input file not found: {input_path}")
    forum, comments = load_items(input_path)
    if not comments:
        sys.exit("No comment items found in input file.")
    chunks = chunk_comments_by_tokens(comments, forum, args.model)
    stem = input_path.stem[:8]
    job_dir = _next_job_dir(stem)
    JOBS_DIR.mkdir(parents=True, exist_ok=True)
    job_dir.mkdir()
    shutil.copy2(input_path, job_dir / "forum.jsonl")
    shutil.copy2(prompt_path, job_dir / "prompt.txt")
    shutil.copy2(report_path, job_dir / "report.json")
    jobs_meta = []
    for i, chunk in enumerate(chunks, start=1):
        req_path = job_dir / f"job{i}-input.jsonl"
        with open(req_path, "w", encoding="utf-8") as f:
            for comment in chunk:
                f.write(json.dumps(build_batch_request_line(comment, forum, args.model),
                                   ensure_ascii=False) + "\n")
        jobs_meta.append({
            "job_num": i,
            "run_id": str(uuid.uuid4()),
            "status": "pending",
            "batch_id": None,
            "records_submitted": len(chunk),
            "records_completed": None,
            "records_failed": None,
            "submitted_at": None,
            "completed_at": None,
        })
    model_info = report[args.model]
    status = {
        "model": args.model,
        "prompt_hash": report["prompt_hash"],
        "input_file": str(input_path),
        "input_sha256": report["input_sha256"],
        "total_comments": report["total_comments"],
        "input_tokens": report["input_tokens"],
        "est_queue_days": model_info["est_queue_days"],
        "cost_$": model_info["cost_$"],
        "total_jobs": len(chunks),
        "jobs": jobs_meta,
    }
    save_status(status, job_dir)
    print(f"Created: {job_dir.name}")
    print(f"  {len(chunks)} job(s)  |  {len(comments)} comments  |  model: {args.model}")
    print(f"\nNext:  python analysis/openai_batch.py submit")
 # ---------------------------------------------------------------------------
 # Subcommand: submit
 def cmd_submit(args, client) -> None:
    job_dir = _resolve_job_dir(args)
    status = load_status(job_dir)
    jobs = status["jobs"]
    if args.job:
        target = next((j for j in jobs if j["job_num"] == args.job), None)
        if target is None:
            sys.exit(f"Job {args.job} not found in {job_dir.name}.")
        if target["status"] != "pending":
            sys.exit(f"Job {args.job} is already '{target['status']}' — cannot resubmit.")
        if target["job_num"] > 1:
            prev = next(p for p in jobs if p["job_num"] == target["job_num"] - 1)
            if prev["status"] != "completed":
                sys.exit(
                    f"Cannot submit job {target['job_num']}: "
                    f"job {prev['job_num']} is '{prev['status']}' (must be 'completed')."
                )
    else:
        target, warning = _find_next_eligible_job(jobs)
        if warning:
            print(warning, file=sys.stderr)
            sys.exit(1)
        if target is None:
            all_done = all(j["status"] == "completed" for j in jobs)
            print("All jobs completed." if all_done else "No pending jobs eligible for submission.")
            return
    n = target["job_num"]
    req_path = job_dir / f"job{n}-input.jsonl"
    print(f"Submitting job {n}/{status['total_jobs']} ({target['records_submitted']} comments) ...",
          file=sys.stderr)
    with open(req_path, "rb") as f:
        uploaded = client.files.create(file=f, purpose="batch")
    batch = client.batches.create(
        input_file_id=uploaded.id,
        endpoint="/v1/chat/completions",
        completion_window="24h",
        metadata={"run_id": target["run_id"], "job_dir": job_dir.name},
    )
    target["status"] = "submitted"
    target["batch_id"] = batch.id
    target["submitted_at"] = datetime.now(timezone.utc).isoformat()
    save_status(status, job_dir)
    print(f"Job {n} submitted: {batch.id}  ({batch.status})")
    print(f"  python analysis/openai_batch.py status")
 # ---------------------------------------------------------------------------
 # Subcommand: status
 def cmd_status(args, client) -> None:
    job_dir = _resolve_job_dir(args)
    status = load_status(job_dir)
    jobs = status["jobs"]
    job_filter = getattr(args, "job", None)
    for job in jobs:
        if job_filter is not None and job["job_num"] != job_filter:
            continue
        if not job["batch_id"]:
            continue
        if job["status"] in ("completed", "failed", "expired", "cancelled", "pending"):
            continue
        batch = client.batches.retrieve(job["batch_id"])
        counts = batch.request_counts
        if batch.status == "completed":
            job["status"] = "completed"
            if batch.completed_at:
                job["completed_at"] = datetime.fromtimestamp(
                    batch.completed_at, tz=timezone.utc
                ).isoformat()
        elif batch.status in ("failed", "expired", "cancelled"):
            job["status"] = batch.status
        else:
            job["status"] = batch.status
        job["records_completed"] = counts.completed
        job["records_failed"] = counts.failed
    save_status(status, job_dir)
    target_jobs = jobs if not job_filter else [j for j in jobs if j["job_num"] == job_filter]
    print(f"Dir: {job_dir.name}  |  Model: {status['model']}  |  {status['total_jobs']} job(s)")
    print(f"{'Job':<5} {'Status':<14} {'Records':>12}  {'Submitted':<20}  {'Completed':<20}")
    print("-" * 76)
    for j in target_jobs:
        rec = (f"{j['records_completed']}/{j['records_submitted']}"
               if j["records_completed"] is not None else f"-/{j['records_submitted']}")
        sub  = (j["submitted_at"]  or "-")[:19]
        done = (j["completed_at"] or "-")[:19]
        print(f"{j['job_num']:<5} {j['status']:<14} {rec:>12}  {sub:<20}  {done:<20}")
 # ---------------------------------------------------------------------------
 # Subcommand: download
 def cmd_download(args, client) -> None:
    job_dir = _resolve_job_dir(args)
    # Refresh status before deciding what to download
    cmd_status(args, client)
    status = load_status(job_dir)
    jobs = status["jobs"]
    job_filter = getattr(args, "job", None)
    if job_filter:
        candidates = [j for j in jobs if j["job_num"] == job_filter]
    else:
        candidates = [
            j for j in jobs
            if j["status"] == "completed"
            and not (job_dir / f"job{j['job_num']}-output.jsonl").exists()
        ]
    if not candidates:
        print("No completed jobs pending download.", file=sys.stderr)
        return
    _, all_comments = load_items(job_dir / "forum.jsonl")
    comment_lookup = {c["comment_id"]: c for c in all_comments}
    for job in candidates:
        n = job["job_num"]
        if job["status"] != "completed":
            print(f"Job {n} not yet completed ('{job['status']}'), skipping.", file=sys.stderr)
            continue
        batch = client.batches.retrieve(job["batch_id"])
        if not batch.output_file_id:
            print(f"Job {n}: no output file available from OpenAI.", file=sys.stderr)
            continue
        raw_text = client.files.content(batch.output_file_id).text
        raw_path = job_dir / f"job{n}-output-raw.jsonl"
        raw_path.write_text(raw_text, encoding="utf-8")
        print(f"Job {n} raw → {raw_path.name}", file=sys.stderr)
        if batch.error_file_id:
            err_text = client.files.content(batch.error_file_id).text
            err_path = job_dir / f"job{n}-errors.jsonl"
            err_path.write_text(err_text, encoding="utf-8")
            n_err_lines = sum(1 for line in err_text.splitlines() if line.strip())
            print(f"Job {n} errors → {err_path.name}  ({n_err_lines} lines)", file=sys.stderr)
        completed_at = job.get("completed_at") or datetime.now(timezone.utc).isoformat()
        norm_path = job_dir / f"job{n}-output.jsonl"
        n_ok = n_err = 0
        with open(norm_path, "w", encoding="utf-8") as out:
            for line in raw_text.splitlines():
                if not line.strip():
                    continue
                record = normalize_output_line(
                    json.loads(line), comment_lookup,
                    job["run_id"], completed_at,
                    status["model"], status["prompt_hash"],
                )
                out.write(json.dumps(record, ensure_ascii=False) + "\n")
                if record["error"]:
                    n_err += 1
                else:
                    n_ok += 1
        print(f"Job {n} normalized → {norm_path.name}  ({n_ok} ok, {n_err} errors)", file=sys.stderr)
        job["records_completed"] = n_ok
        job["records_failed"] = n_err
    save_status(status, job_dir)
 # ---------------------------------------------------------------------------
 # CLI
 def _add_common_args(p: argparse.ArgumentParser) -> None:
    p.add_argument("--job", type=int, default=None, metavar="N",
                   help="Job number within the run (default: auto)")
    p.add_argument("--dir", default=None, metavar="DIR",
                   help="Job directory name or path (default: most recent)")
 def main() -> None:
    load_dotenv()
    parser = argparse.ArgumentParser(
        description="Batch analysis job runner.",
        formatter_class=argparse.RawDescriptionHelpFormatter,
        epilog=__doc__,
    )
    sub = parser.add_subparsers(dest="command", required=True)
    p_create = sub.add_parser("create", help="Create job directory from tokenizer report")
    p_create.add_argument("report", help="Path to report.json from tokenizer.py")
    p_create.add_argument("--model", required=True, help="Model (e.g. gpt-4o-mini)")
    p_submit = sub.add_parser("submit", help="Submit next eligible job")
    _add_common_args(p_submit)
    p_status = sub.add_parser("status", help="Check job status")
    _add_common_args(p_status)
    p_download = sub.add_parser("download", help="Download and normalize completed jobs")
    _add_common_args(p_download)
    args = parser.parse_args()
    if args.command == "create":
        cmd_create(args)
        return
    api_key = os.environ.get("OPENAI_API_KEY")
    if not api_key:
        sys.exit("OPENAI_API_KEY not set. Create a .env file or export the variable.")
    client = openai.OpenAI(api_key=api_key)
    if args.command == "submit":
        cmd_submit(args, client)
    elif args.command == "status":
        cmd_status(args, client)
    elif args.command == "download":
        cmd_download(args, client)
 if __name__ == "__main__":
    main()
--- a/analysis/gpt4o/analysis_realtime.py
+++ b/analysis/gpt4o/analysis_realtime.py
@@ -1,12 +1,12 @@
 #!/usr/bin/env python3
 """
-analysis/gpt4o/analysis-realtime.py — Synchronous GPT-4o pipeline for VA Townhall comments.
+analysis/openai_realtime.py — Synchronous GPT-4o pipeline for VA Townhall comments.
 Usage:
-    python analysis/gpt4o/analysis-realtime.py <input_jsonl> [--limit {5,10,20,50}] [--model MODEL]
+    python analysis/openai_realtime.py <input_jsonl> [--limit {5,10,20,50}] [--model MODEL]
 Output:
-    analysis/gpt4o/forum{id}_{scrape_ts}_{model}_{run_ts}.jsonl
+    analysis/forum{id}_{scrape_ts}_{model}_{run_ts}.jsonl
 """
 import argparse
@@ -30,7 +30,7 @@ except ImportError:
 # ---------------------------------------------------------------------------
 # Prompt — loaded from analysis/prompt-1.txt at import time
-_PROMPT_FILE = Path(__file__).parent.parent / "prompt-1.txt"
+_PROMPT_FILE = Path(__file__).parent / "prompt-1.txt"
 SYSTEM_PROMPT = _PROMPT_FILE.read_text(encoding="utf-8").strip()
 PROMPT_VERSION = hashlib.sha256(SYSTEM_PROMPT.encode("utf-8")).hexdigest()[:7]
--- a/analysis/prompt-1.txt
+++ b/analysis/prompt-1.txt
@@ -1,6 +1,4 @@
-You are an expert policy analyst classifying public comments submitted to the Virginia Town Hall
+You are an expert policy analyst classifying public comments submitted to the Virginia Town Hall regulatory comment system. You will be given the text of a proposed regulation and a single public comment. Return ONLY a JSON object — no other text.
 regulatory comment system. You will be given the text of a proposed regulation and a single
 public comment. Return ONLY a JSON object — no other text.
 Definitions:
 - stance: the commenter's position on whether the regulation should be adopted.
@@ -16,8 +14,6 @@ Definitions:
  "unclear"  = tone cannot be determined (e.g., a one-word comment).
 - stance_confidence: float 0.0-1.0, your confidence in the stance label.
 - stance_rationale: 1-3 sentences explaining the key evidence; quote specific phrases where possible.
- tags: up to 5 short topic labels relevant to the comment's specific concerns (e.g.
+- tags: up to 5 short topic labels relevant to the comment's specific concerns (e.g. "parental rights", "student safety", "privacy", "religious freedom", "LGBTQ inclusion", "bullying prevention", "school sports", "bathroom access"). Empty array if none apply.
  "parental rights", "student safety", "privacy", "religious freedom", "LGBTQ+ inclusion",
  "bullying prevention", "school sports", "bathroom access"). Empty array if none apply.
 Return exactly these keys: stance, stance_confidence, stance_rationale, tone, tags.
--- a/analysis/tokenizer.py
+++ b/analysis/tokenizer.py
@@ -0,0 +1,190 @@
 #!/usr/bin/env python3
 """
 tokenizer.py — estimate token usage and cost for a batch analysis run.
 Usage:
    python analysis/tokenizer.py output/f452.jsonl [--prompt analysis/prompt-1.txt]
    python analysis/tokenizer.py analysis/jobs/f452-1/job1-input.jsonl  # count actual tokens in a job
 Prints a per-model comparison table and writes reports/<stem>-report.json.
 Run this before openai_batch.py create.
 """
 import argparse
 import hashlib
 import json
 import math
 import sys
 from pathlib import Path
 sys.path.insert(0, str(Path(__file__).parent))
 import openai_batch as _ab
 # Input pricing ($/1M tokens, batch API) — from docs/openai.md, updated 2026-05-05.
 # Add Anthropic/other models here when needed; only models with a LIMITS entry are reported.
 MODEL_PRICING: dict[str, float] = {
    "gpt-5.5":       2.50,
    "gpt-5.4":       1.25,
    "gpt-5.4-mini":  0.375,
    "gpt-5.4-nano":  0.10,
    "gpt-4o":        1.25,
    "gpt-4o-mini":   0.075,
    "gpt-o4-mini":   0.55,
 }
 def compute_report(
    comments: list[dict],
    forum: dict | None,
    prompt_hash: str,
    input_file: str,
    input_sha256: str,
    prompt_file: str,
 ) -> dict:
    """Compute token estimate and per-model job/cost/time breakdown."""
    # Use gpt-4o encoding as the canonical estimator (same for all current models)
    total_tokens = sum(
        _ab.estimate_tokens(_ab.build_messages(c, forum)[0], "gpt-4o")
        for c in comments
    )
    report: dict = {
        "prompt": prompt_file,
        "prompt_hash": prompt_hash,
        "input_file": input_file,
        "input_sha256": input_sha256,
        "total_comments": len(comments),
        "input_tokens": total_tokens,
    }
    for model, tpd in _ab.MODEL_LIMITS.items():
        effective_tpd = int(tpd * _ab._LIMIT_BUFFER)
        jobs = math.ceil(total_tokens / effective_tpd)
        cost = round(total_tokens / 1_000_000 * MODEL_PRICING.get(model, 0.0), 4)
        est_days = round(total_tokens / tpd, 2)
        report[model] = {"jobs": jobs, "cost_$": cost, "est_queue_days": est_days}
    return report
 def count_input_tokens(path: Path, model: str = "gpt-4o") -> dict:
    """Count tokens in an existing job input JSONL (batch request format).
    Each line must have body.messages (as written by build_batch_request_line).
    Returns {"total_tokens": int, "total_requests": int, "min": int, "max": int, "mean": float}.
    """
    counts = []
    with open(path, encoding="utf-8") as f:
        for line in f:
            line = line.strip()
            if not line:
                continue
            req = json.loads(line)
            messages = req["body"]["messages"]
            counts.append(_ab.estimate_tokens(messages, model))
    if not counts:
        return {"total_tokens": 0, "total_requests": 0, "min": 0, "max": 0, "mean": 0.0}
    return {
        "total_tokens": sum(counts),
        "total_requests": len(counts),
        "min": min(counts),
        "max": max(counts),
        "mean": round(sum(counts) / len(counts), 1),
    }
 def print_table(report: dict) -> None:
    """Print a human-readable model comparison table to stdout."""
    print(f"\nInput:    {report['input_file']}")
    print(f"Comments: {report['total_comments']:,}")
    print(f"Tokens:   {report['input_tokens']:,}")
    print(f"Prompt:   {report['prompt']}  (hash: {report['prompt_hash']})")
    print()
    # Cheapest model that fits in one job
    single_job_models = [m for m in _ab.MODEL_LIMITS if report.get(m, {}).get("jobs") == 1]
    best = (min(single_job_models, key=lambda m: report[m]["cost_$"])
            if single_job_models else None)
    print(f"{'Model':<15} {'Jobs':>5}  {'Cost ($)':>9}  {'Est days':>9}  {'Note'}")
    print("-" * 62)
    for model in _ab.MODEL_LIMITS:
        if model not in report or not isinstance(report[model], dict):
            continue
        m = report[model]
        note = "<-- recommended" if model == best else ""
        print(f"{model:<15} {m['jobs']:>5}  {m['cost_$']:>9.4f}  {m['est_queue_days']:>9.2f}  {note}")
    print()
 def _is_job_input(path: Path) -> bool:
    """Return True if this JSONL looks like a batch request file (has custom_id)."""
    with open(path, encoding="utf-8") as f:
        for line in f:
            line = line.strip()
            if line:
                return "custom_id" in json.loads(line)
    return False
 def main() -> None:
    _default_prompt = Path(__file__).parent / "prompt-1.txt"
    parser = argparse.ArgumentParser(description="Estimate batch token usage and cost.")
    parser.add_argument("input", help="Scraped JSONL or job input JSONL (jobN-input.jsonl)")
    parser.add_argument(
        "--prompt",
        default=str(_default_prompt),
        help=f"System prompt file (default: {_default_prompt.name})",
    )
    args = parser.parse_args()
    input_path = Path(args.input)
    if not input_path.exists():
        sys.exit(f"File not found: {input_path}")
    # --- Mode: count tokens in an existing job input file ---
    if _is_job_input(input_path):
        result = count_input_tokens(input_path)
        print(f"\nJob input: {input_path.name}")
        print(f"  Requests : {result['total_requests']:,}")
        print(f"  Tokens   : {result['total_tokens']:,}")
        print(f"  Per-req  : min={result['min']}  max={result['max']}  mean={result['mean']}")
        return
    # --- Mode: estimate from raw scrape file and write report.json ---
    prompt_path = Path(args.prompt)
    if not prompt_path.exists():
        sys.exit(f"Prompt file not found: {prompt_path}")
    prompt_text = prompt_path.read_text(encoding="utf-8").strip()
    prompt_hash = hashlib.sha256(prompt_text.encode("utf-8")).hexdigest()[:7]
    # Ensure build_messages uses the specified prompt
    _ab._load_prompt(prompt_path)
    forum, comments = _ab.load_items(input_path)
    if not comments:
        sys.exit("No comment items found.")
    if forum is None:
        print("Warning: no ForumItem — token estimates may be slightly low.", file=sys.stderr)
    input_sha256 = hashlib.sha256(input_path.read_bytes()).hexdigest()
    report = compute_report(
        comments, forum, prompt_hash,
        str(input_path), input_sha256, str(prompt_path),
    )
    print_table(report)
    reports_dir = Path(__file__).parent.parent / "reports"
    reports_dir.mkdir(exist_ok=True)
    out_path = reports_dir / f"{input_path.stem}-report.json"
    out_path.write_text(json.dumps(report, indent=2, ensure_ascii=False), encoding="utf-8")
    print(f"Report written to: {out_path}")
    print(f"\nNext:  python analysis/openai_batch.py create {out_path} --model <model>")
 if __name__ == "__main__":
    main()
--- a/docs/excel-snapshot.png
+++ b/docs/excel-snapshot.png
--- a/docs/pipeline-1.2.3.svg
+++ b/docs/pipeline-1.2.3.svg
--- a/docs/pipeline-v1.2.3.drawio
+++ b/docs/pipeline-v1.2.3.drawio
@@ -1,9 +1,18 @@
 <mxfile host="app.diagrams.net">
  <diagram name="Page-1" id="0sW-Vs8X5usvYmJikUIv">
-    <mxGraphModel dx="2179" dy="1118" grid="1" gridSize="10" guides="1" tooltips="1" connect="1" arrows="1" fold="1" page="0" pageScale="1" pageWidth="850" pageHeight="1100" math="0" shadow="0">
+    <mxGraphModel dx="1315" dy="798" grid="1" gridSize="10" guides="1" tooltips="1" connect="1" arrows="1" fold="1" page="0" pageScale="1" pageWidth="850" pageHeight="1100" math="0" shadow="0">
      <root>
        <mxCell id="0" />
        <mxCell id="1" parent="0" />
        <mxCell id="mENAtx_syaeSO5uR6kG6-61" parent="1" style="rounded=0;whiteSpace=wrap;html=1;" value="" vertex="1">
          <mxGeometry height="90" width="190" x="1000" y="330" as="geometry" />
        </mxCell>
        <mxCell id="mENAtx_syaeSO5uR6kG6-60" parent="1" style="rounded=0;whiteSpace=wrap;html=1;" value="" vertex="1">
          <mxGeometry height="90" width="190" x="1010" y="340" as="geometry" />
        </mxCell>
        <mxCell id="mENAtx_syaeSO5uR6kG6-59" parent="1" style="rounded=0;whiteSpace=wrap;html=1;" value="" vertex="1">
          <mxGeometry height="90" width="190" x="1020" y="350" as="geometry" />
        </mxCell>
        <mxCell id="mENAtx_syaeSO5uR6kG6-3" edge="1" parent="1" source="mENAtx_syaeSO5uR6kG6-1" style="edgeStyle=orthogonalEdgeStyle;rounded=0;orthogonalLoop=1;jettySize=auto;html=1;exitX=1;exitY=0.5;exitDx=0;exitDy=0;entryX=0;entryY=0;entryDx=16.5;entryDy=0;entryPerimeter=0;" target="mENAtx_syaeSO5uR6kG6-29">
          <mxGeometry relative="1" as="geometry">
            <mxPoint x="200" y="290" as="targetPoint" />
@@ -18,18 +27,18 @@
        <mxCell id="mENAtx_syaeSO5uR6kG6-5" parent="1" style="shape=process;whiteSpace=wrap;html=1;backgroundOutline=1;" value="tokenizer" vertex="1">
          <mxGeometry height="60" width="120" x="400" y="170" as="geometry" />
        </mxCell>
-        <mxCell id="mENAtx_syaeSO5uR6kG6-6" parent="1" style="text;html=1;whiteSpace=wrap;strokeColor=none;fillColor=none;align=center;verticalAlign=top;rounded=0;" value="gather forum data" vertex="1">
+        <mxCell id="mENAtx_syaeSO5uR6kG6-6" parent="1" style="text;html=1;whiteSpace=wrap;strokeColor=none;fillColor=none;align=left;verticalAlign=top;rounded=0;" value="&lt;div align=&quot;left&quot;&gt;- collect forum data&lt;/div&gt;" vertex="1">
-          <mxGeometry height="60" width="120" x="20" y="240" as="geometry" />
+          <mxGeometry height="60" width="120" x="40" y="240" as="geometry" />
        </mxCell>
-        <mxCell id="mENAtx_syaeSO5uR6kG6-7" parent="1" style="text;html=1;whiteSpace=wrap;strokeColor=none;fillColor=none;align=left;verticalAlign=top;rounded=0;" value="&lt;div&gt;tokenize forum,&lt;/div&gt;&lt;div&gt;generate report w/&lt;/div&gt;&lt;div&gt;recommendations&lt;/div&gt;" vertex="1">
+        <mxCell id="mENAtx_syaeSO5uR6kG6-7" parent="1" style="text;html=1;whiteSpace=wrap;strokeColor=none;fillColor=none;align=left;verticalAlign=top;rounded=0;" value="&lt;div&gt;- tokenize forum&lt;/div&gt;&lt;div&gt;- generate report w/&lt;/div&gt;&lt;div&gt;recommendations&lt;/div&gt;" vertex="1">
          <mxGeometry height="60" width="120" x="400" y="240" as="geometry" />
        </mxCell>
-        <mxCell id="mENAtx_syaeSO5uR6kG6-28" edge="1" parent="1" source="mENAtx_syaeSO5uR6kG6-19" style="edgeStyle=orthogonalEdgeStyle;rounded=0;orthogonalLoop=1;jettySize=auto;html=1;exitX=1;exitY=0.5;exitDx=0;exitDy=0;entryX=0.5;entryY=0;entryDx=0;entryDy=0;entryPerimeter=0;" target="mENAtx_syaeSO5uR6kG6-35">
+        <mxCell id="mENAtx_syaeSO5uR6kG6-28" edge="1" parent="1" source="mENAtx_syaeSO5uR6kG6-19" style="edgeStyle=orthogonalEdgeStyle;rounded=0;orthogonalLoop=1;jettySize=auto;html=1;exitX=1;exitY=0.5;exitDx=0;exitDy=0;entryX=0.5;entryY=0;entryDx=0;entryDy=0;" target="mENAtx_syaeSO5uR6kG6-73">
          <mxGeometry relative="1" as="geometry">
-            <mxPoint x="910" y="270" as="targetPoint" />
+            <mxPoint x="953" y="240" as="targetPoint" />
          </mxGeometry>
        </mxCell>
-        <mxCell id="mENAtx_syaeSO5uR6kG6-19" parent="1" style="shape=process;whiteSpace=wrap;html=1;backgroundOutline=1;" value="batch" vertex="1">
+        <mxCell id="mENAtx_syaeSO5uR6kG6-19" parent="1" style="shape=process;whiteSpace=wrap;html=1;backgroundOutline=1;" value="openai_batch" vertex="1">
          <mxGeometry height="60" width="120" x="720" y="170" as="geometry" />
        </mxCell>
        <mxCell id="mENAtx_syaeSO5uR6kG6-21" parent="1" style="text;html=1;whiteSpace=wrap;strokeColor=none;fillColor=none;align=right;verticalAlign=top;rounded=0;fontFamily=Courier New;" value="&lt;div&gt;--model&lt;/div&gt;&lt;div&gt;--limit&lt;/div&gt;" vertex="1">
@@ -38,11 +47,8 @@
        <mxCell id="mENAtx_syaeSO5uR6kG6-23" parent="1" style="text;html=1;whiteSpace=wrap;strokeColor=none;fillColor=none;align=right;verticalAlign=top;rounded=0;fontFamily=Courier New;" value="--forum" vertex="1">
          <mxGeometry height="60" width="120" x="-90" y="170" as="geometry" />
        </mxCell>
-        <mxCell id="mENAtx_syaeSO5uR6kG6-25" parent="1" style="text;html=1;whiteSpace=wrap;strokeColor=none;fillColor=none;align=right;verticalAlign=top;rounded=0;fontFamily=Courier New;" value="--prompt" vertex="1">
+        <mxCell id="mENAtx_syaeSO5uR6kG6-26" parent="1" style="text;html=1;whiteSpace=wrap;strokeColor=none;fillColor=none;align=left;verticalAlign=top;rounded=0;" value="&lt;div&gt;- split job into batches&lt;/div&gt;&lt;div&gt;- submit first batch&lt;/div&gt;&lt;div&gt;- status of current batch&lt;/div&gt;&lt;div&gt;- download batch artifacts&lt;/div&gt;" vertex="1">
-          <mxGeometry height="60" width="120" x="270" y="210" as="geometry" />
+          <mxGeometry height="70" width="140" x="720" y="240" as="geometry" />
        </mxCell>
        <mxCell id="mENAtx_syaeSO5uR6kG6-26" parent="1" style="text;html=1;whiteSpace=wrap;strokeColor=none;fillColor=none;align=left;verticalAlign=top;rounded=0;" value="&lt;div&gt;split job into batches&lt;/div&gt;&lt;div&gt;submit first batch&lt;/div&gt;&lt;div&gt;status of current batch&lt;/div&gt;&lt;div&gt;download batch artifacts&lt;/div&gt;" vertex="1">
          <mxGeometry height="60" width="120" x="720" y="240" as="geometry" />
        </mxCell>
        <mxCell id="mENAtx_syaeSO5uR6kG6-29" parent="1" style="shape=note;whiteSpace=wrap;html=1;backgroundOutline=1;darkOpacity=0.05;size=17;" value="" vertex="1">
          <mxGeometry height="70" width="50" x="210" y="240" as="geometry" />
@@ -58,7 +64,7 @@
            </Array>
          </mxGeometry>
        </mxCell>
-        <mxCell id="mENAtx_syaeSO5uR6kG6-31" parent="1" style="shape=note;whiteSpace=wrap;html=1;backgroundOutline=1;darkOpacity=0.05;size=17;" value="&lt;div&gt;forum&lt;/div&gt;&lt;div&gt;.jsonl&lt;/div&gt;" vertex="1">
+        <mxCell id="mENAtx_syaeSO5uR6kG6-31" parent="1" style="shape=note;whiteSpace=wrap;html=1;backgroundOutline=1;darkOpacity=0.05;size=17;" value="&lt;div&gt;&amp;lt;forumid&amp;gt;&lt;/div&gt;&lt;div&gt;.jsonl&lt;/div&gt;" vertex="1">
          <mxGeometry height="70" width="50" x="230" y="260" as="geometry" />
        </mxCell>
        <mxCell id="mENAtx_syaeSO5uR6kG6-47" edge="1" parent="1" source="mENAtx_syaeSO5uR6kG6-34" style="edgeStyle=orthogonalEdgeStyle;rounded=0;orthogonalLoop=1;jettySize=auto;html=1;exitX=0;exitY=0;exitDx=50;exitDy=43.5;exitPerimeter=0;entryX=0;entryY=0.5;entryDx=0;entryDy=0;" target="mENAtx_syaeSO5uR6kG6-19">
@@ -69,30 +75,42 @@
            </Array>
          </mxGeometry>
        </mxCell>
-        <mxCell id="mENAtx_syaeSO5uR6kG6-34" parent="1" style="shape=note;whiteSpace=wrap;html=1;backgroundOutline=1;darkOpacity=0.05;size=17;" value="&lt;div&gt;report&lt;/div&gt;&lt;div&gt;.json&lt;/div&gt;" vertex="1">
+        <mxCell id="mENAtx_syaeSO5uR6kG6-34" parent="1" style="shape=note;whiteSpace=wrap;html=1;backgroundOutline=1;darkOpacity=0.05;size=17;" value="&lt;div&gt;&lt;br&gt;&lt;/div&gt;&lt;div&gt;&amp;lt;forumid&amp;gt;&lt;br&gt;-report&lt;/div&gt;&lt;div&gt;.json&lt;/div&gt;" vertex="1">
          <mxGeometry height="70" width="50" x="560" y="240" as="geometry" />
        </mxCell>
-        <mxCell id="mENAtx_syaeSO5uR6kG6-35" parent="1" style="shape=note;whiteSpace=wrap;html=1;backgroundOutline=1;darkOpacity=0.05;size=17;" value="job.json" vertex="1">
+        <mxCell id="mENAtx_syaeSO5uR6kG6-35" parent="1" style="shape=note;whiteSpace=wrap;html=1;backgroundOutline=1;darkOpacity=0.05;size=17;" value="&lt;div&gt;status&lt;/div&gt;&lt;div&gt;.json&lt;/div&gt;" vertex="1">
-          <mxGeometry height="70" width="50" x="890" y="240" as="geometry" />
+          <mxGeometry height="70" width="50" x="913.25" y="360" as="geometry" />
        </mxCell>
-        <mxCell id="mENAtx_syaeSO5uR6kG6-41" parent="1" style="shape=note;whiteSpace=wrap;html=1;backgroundOutline=1;darkOpacity=0.05;size=17;" value="" vertex="1">
+        <mxCell id="mENAtx_syaeSO5uR6kG6-43" parent="1" style="shape=note;whiteSpace=wrap;html=1;backgroundOutline=1;darkOpacity=0.05;size=17;" value="&lt;div&gt;jobN-&lt;/div&gt;&lt;div&gt;output&lt;/div&gt;&lt;div&gt;.jsonl&lt;/div&gt;" vertex="1">
-          <mxGeometry height="70" width="50" x="940" y="340" as="geometry" />
+          <mxGeometry height="70" width="50" x="1090" y="360" as="geometry" />
        </mxCell>
-        <mxCell id="mENAtx_syaeSO5uR6kG6-42" parent="1" style="shape=note;whiteSpace=wrap;html=1;backgroundOutline=1;darkOpacity=0.05;size=17;" value="" vertex="1">
+        <mxCell id="mENAtx_syaeSO5uR6kG6-48" parent="1" style="shape=note;whiteSpace=wrap;html=1;backgroundOutline=1;darkOpacity=0.05;size=17;" value="&lt;div&gt;jobN-errors&lt;/div&gt;&lt;div&gt;.jsonl&lt;/div&gt;" vertex="1">
-          <mxGeometry height="70" width="50" x="950" y="350" as="geometry" />
+          <mxGeometry height="70" width="50" x="1150" y="360" as="geometry" />
        </mxCell>
-        <mxCell id="mENAtx_syaeSO5uR6kG6-43" parent="1" style="shape=note;whiteSpace=wrap;html=1;backgroundOutline=1;darkOpacity=0.05;size=17;" value="&lt;div&gt;batchN-&lt;/div&gt;&lt;div&gt;output-&lt;/div&gt;&lt;div&gt;.jsonl&lt;/div&gt;" vertex="1">
+        <mxCell id="mENAtx_syaeSO5uR6kG6-54" parent="1" style="shape=note;whiteSpace=wrap;html=1;backgroundOutline=1;darkOpacity=0.05;size=17;" value="&lt;div&gt;jobN-&lt;/div&gt;&lt;div&gt;input&lt;/div&gt;&lt;div&gt;.jsonl&lt;/div&gt;" vertex="1">
-          <mxGeometry height="70" width="50" x="960" y="360" as="geometry" />
+          <mxGeometry height="70" width="50" x="1030" y="360" as="geometry" />
        </mxCell>
-        <mxCell id="mENAtx_syaeSO5uR6kG6-48" parent="1" style="shape=note;whiteSpace=wrap;html=1;backgroundOutline=1;darkOpacity=0.05;size=17;" value="&lt;div&gt;errors&lt;/div&gt;&lt;div&gt;.jsonl&lt;/div&gt;" vertex="1">
+        <mxCell id="mENAtx_syaeSO5uR6kG6-64" edge="1" parent="1" source="mENAtx_syaeSO5uR6kG6-63" style="edgeStyle=orthogonalEdgeStyle;rounded=0;orthogonalLoop=1;jettySize=auto;html=1;exitX=0;exitY=0;exitDx=50;exitDy=43.5;exitPerimeter=0;entryX=0;entryY=0.5;entryDx=0;entryDy=0;" target="mENAtx_syaeSO5uR6kG6-5">
          <mxGeometry height="70" width="50" x="980" y="240" as="geometry" />
        </mxCell>
        <mxCell id="mENAtx_syaeSO5uR6kG6-51" edge="1" parent="1" source="mENAtx_syaeSO5uR6kG6-19" style="edgeStyle=orthogonalEdgeStyle;rounded=0;orthogonalLoop=1;jettySize=auto;html=1;exitX=1;exitY=0.5;exitDx=0;exitDy=0;entryX=0;entryY=0;entryDx=16.5;entryDy=0;entryPerimeter=0;" target="mENAtx_syaeSO5uR6kG6-41">
          <mxGeometry relative="1" as="geometry" />
        </mxCell>
-        <mxCell id="mENAtx_syaeSO5uR6kG6-53" edge="1" parent="1" source="mENAtx_syaeSO5uR6kG6-19" style="edgeStyle=orthogonalEdgeStyle;rounded=0;orthogonalLoop=1;jettySize=auto;html=1;exitX=1;exitY=0.5;exitDx=0;exitDy=0;entryX=0;entryY=0;entryDx=16.5;entryDy=0;entryPerimeter=0;" target="mENAtx_syaeSO5uR6kG6-48">
+        <mxCell id="mENAtx_syaeSO5uR6kG6-63" parent="1" style="shape=note;whiteSpace=wrap;html=1;backgroundOutline=1;darkOpacity=0.05;size=17;" value="&lt;div&gt;prompt&lt;/div&gt;&lt;div&gt;.txt&lt;/div&gt;" vertex="1">
          <mxGeometry height="70" width="50" x="270" y="90" as="geometry" />
        </mxCell>
        <mxCell id="mENAtx_syaeSO5uR6kG6-67" parent="1" style="text;html=1;whiteSpace=wrap;strokeColor=none;fillColor=none;align=left;verticalAlign=top;rounded=0;fontFamily=Courier New;" value="create" vertex="1">
          <mxGeometry height="20" width="120" x="850" y="170" as="geometry" />
        </mxCell>
        <mxCell id="mENAtx_syaeSO5uR6kG6-71" parent="1" style="text;html=1;whiteSpace=wrap;strokeColor=none;fillColor=none;align=left;verticalAlign=top;rounded=0;fontFamily=Courier New;" value="&lt;div&gt;submit&lt;/div&gt;&lt;div&gt;&lt;br&gt;&lt;/div&gt;&lt;div&gt;status&lt;/div&gt;&lt;div&gt;download&lt;/div&gt;" vertex="1">
          <mxGeometry height="60" width="120" x="1020" y="240" as="geometry" />
        </mxCell>
        <mxCell id="mENAtx_syaeSO5uR6kG6-75" edge="1" parent="1" source="mENAtx_syaeSO5uR6kG6-73" style="edgeStyle=orthogonalEdgeStyle;rounded=0;orthogonalLoop=1;jettySize=auto;html=1;" target="mENAtx_syaeSO5uR6kG6-35">
          <mxGeometry relative="1" as="geometry" />
        </mxCell>
        <mxCell id="mENAtx_syaeSO5uR6kG6-76" edge="1" parent="1" source="mENAtx_syaeSO5uR6kG6-73" style="edgeStyle=orthogonalEdgeStyle;rounded=0;orthogonalLoop=1;jettySize=auto;html=1;entryX=0.5;entryY=0;entryDx=0;entryDy=0;" target="mENAtx_syaeSO5uR6kG6-61">
          <mxGeometry relative="1" as="geometry" />
        </mxCell>
        <mxCell id="mENAtx_syaeSO5uR6kG6-73" parent="1" style="image;aspect=fixed;perimeter=ellipsePerimeter;html=1;align=center;shadow=0;dashed=0;spacingTop=3;image=img/lib/active_directory/folder.svg;" value="&amp;lt;forumid&amp;gt;-N" vertex="1">
          <mxGeometry height="50" width="36.5" x="920" y="240" as="geometry" />
        </mxCell>
      </root>
    </mxGraphModel>
  </diagram>
--- a/docs/pipeline-v1.2.3.svg
+++ b/docs/pipeline-v1.2.3.svg
--- a/docs/streamlit-snapshot.png
+++ b/docs/streamlit-snapshot.png
--- a/docs/tasks.org
+++ b/docs/tasks.org
@@ -158,7 +158,7 @@ forum_id_input,comment_id,title,text,date,author,stance,stance_confidence,stance
 - tests: 23 passing (pytest tests/analysis_gpt4o_batch.py), 51 total across suite
 - datetime: [2026-05-06 Wed 08:55]
-* [ ] t1.2.3: batch job refactor
+* [X] t1.2.3: batch job refactor
 This task encompasses intent and fixes for 1.2.1 and 1.2.2.
 batch processing should  be a resumable job queue, not a one-shot script. the user should not need to remember offsets, completed chunks, failed batches, or which comments remain.
 ** Acceptance Criteria
@@ -200,9 +200,53 @@ batch processing should  be a resumable job queue, not a one-shot script. the us
   - resume from status.json
   - remaining-comment detection
-* === Backlog ===
+** notes
-* [ ] X: analysis validation view
+- analysis/tokenizer.py: new standalone script; imports openai_batch for MODEL_LIMITS, estimate_tokens, build_messages. Reads input JSONL + prompt, computes per-model jobs/cost/time table, writes reports/<stem>-report.json. MODEL_PRICING dict lives here (not in openai_batch). Pass a jobN-input.jsonl to count actual tokens instead.
 - analysis/openai_batch.py: fully rewritten with four subcommands: create, submit, status, download. Job dirs at analysis/jobs/<stem[:8]>-N/.
 - Job directories: analysis/jobs/<stem[:8]>-N/ (e.g. f452-1). Each run is self-contained: forum.jsonl, prompt.txt, report.json, jobN-input.jsonl, jobN-output-raw.jsonl, jobN-output.jsonl, jobN-errors.jsonl.
 - status.json: tracks all jobs with pending/submitted/in_progress/completed/failed states. Updated by submit, status, download.
 - _find_next_eligible_job: pure function for testability. Returns (next_pending_job, None) or (None, warning). Blocks submission if previous job is in_progress/submitted.
 - create: no API key required. Reads report.json, re-chunks comments, writes all jobN-input.jsonl files, writes status.json.
 - submit: uploads jobN-input.jsonl to Files API, creates batch, updates status.json to 'submitted'. Will not stack batches.
 - status: retrieves batch from OpenAI, updates status.json counts and status.
 - download: auto-runs status first, downloads output_file_id → jobN-output-raw.jsonl, error_file_id → jobN-errors.jsonl, normalizes → jobN-output.jsonl. Updates status.json.
 - tests/tokenizer.py: 19 tests for compute_report schema, cost/time calculation, MODEL_PRICING coverage, print_table output, count_input_tokens, report.json round-trip.
 - Token limit buffer: _LIMIT_BUFFER=0.80 (20% headroom). Estimate uses OpenAI cookbook chat formula (role tokens + 3-token reply primer). Verify a job file with: python analysis/tokenizer.py analysis/jobs/<dir>/jobN-input.jsonl
 *** usage
 #+begin_src powershell
 # 1. estimate tokens and cost
 python analysis/tokenizer.py output/f452.jsonl --prompt analysis/prompt-1.txt
 # writes reports/f452-report.json
 # 2. verify actual tokens in a job file (optional sanity check)
 python analysis/tokenizer.py analysis/jobs/f452-1/job1-input.jsonl
 # 3. create job directory (no api key needed)
 python analysis/openai_batch.py create reports/f452-report.json --model gpt-5.4-mini
 # creates analysis/jobs/f452-1/
 # 4. submit first job
 python analysis/openai_batch.py submit
 # 5. check status (repeat until completed)
 python analysis/openai_batch.py status
 # 6. download and normalize
 python analysis/openai_batch.py download
 # 7. submit next job (if multi-job run), then repeat 5-6
 python analysis/openai_batch.py submit
 #+end_src
 ** evidence
 - commit:
 - tests: passing (pytest tests/openai_batch.py tests/openai_realtime.py tests/tokenizer.py)
 - datetime: [2026-05-06 Wed]
 * [X] t1.3: cleanup model output and rejoin
 create a lightweight validation script that joins raw comments to normalized analysis output and writes a human-reviewable csv.
 review create_csv for the simple approach - keep this regardless
 ** acceptance criteria
 1. input raw scrape jsonl and all *-output.jsonl files in a dir
@@ -211,7 +255,8 @@ create a lightweight validation script that joins raw comments to normalized ana
   - forum_id, comment_id, title, text, date, author
   - stance, stance_confidence, stance_rationale, tone, tags
   - error, truncated, analyzed_at, prompt_version, model
-4. print validation counts
+4. output parquet?
 5. print validation counts
   - raw comments
   - analyzed records
   - joined records
@@ -220,16 +265,30 @@ create a lightweight validation script that joins raw comments to normalized ana
   - error records
   - stance counts
   - tone counts
-5. tests cover join behavior and missing/duplicate ids
+6. tests cover join behavior and missing/duplicate ids
 ** notes
 - analysis/create_csv.py: reads raw scrape JSONL + all job*-output.jsonl in a job dir (skips *-output-raw.jsonl); left-joins on comment_id; writes review.csv (UTF-8 BOM for Excel); optional --parquet.
 - Uses pd.read_json(path, lines=True) — no manual JSON parsing.
 - Prints summary counts: raw/analyzed/joined/unanalyzed/errors/duplicate IDs, stance distribution, tone distribution.
 *** usage
 #+begin_src sh
 python analysis/create_csv.py output/f452.jsonl analysis/jobs/f452-1/
 python analysis/create_csv.py output/f452.jsonl analysis/jobs/f452-1/ --parquet
 # output: analysis/jobs/f452-1/review.csv (and optionally review.parquet)
 #+end_src
 ** evidence
- commit:
+- commit: 28d6d22
- tests:
+- tests: passing (pytest tests/create_csv.py tests/encoding.py)
- csv:
+- csv: analysis/jobs/f452-1/review.csv
- datetime:       
+- datetime: [2026-05-07 Thu 17:23]
-* [ ] X: text encoding cleanup
+
 * [X] t1.1.1: text encoding cleanup
 fix mojibake in scraped text before analysis/reporting, especially curly quotes showing as â€™.
 ** acceptance criteria
 1. identify whether mojibake exists in raw scrape, analysis output, or csv export only
 2. add repair step at the earliest correct layer
@@ -242,14 +301,82 @@ fix mojibake in scraped text before analysis/reporting, especially curly quotes
   - â€”
 5. document whether repaired text is used for model input
 ** notes
 - Diagnosis: f452.jsonl raw data is CLEAN — proper Unicode throughout (U+2019, U+201C, etc.). The DEFAULT_RESPONSE_ENCODING=utf-8 spider setting is working for this site. No mojibake or FFFD chars found.
 - The encoding issue would surface for forums whose server sends cp1252 bytes (0x91-0x97 range) embedded in otherwise UTF-8 content. FFFD replacement chars appear when the UTF-8 decoder hits those bytes. Once the byte is replaced by FFFD, the original character cannot be recovered.
 - Repair layer: analysis/encoding.py applied in analysis/validate.py at reporting time. Raw scrape JSONL is never modified (AC3).
 - Model input: repair_text() is NOT applied in build_messages() for this dataset since raw data is clean. Can be added if a future forum produces dirty text.
 - Spider: DEFAULT_RESPONSE_ENCODING=utf-8 remains. If a future forum genuinely sends cp1252, change to 'cp1252' and apply ftfy post-decode in the item pipeline.
 ** evidence
- commit:
+- commit: 1ea696d
- tests:
+- tests: passing (pytest tests/encoding.py)
- before/after sample:
+- before/after sample: N/A — f452.jsonl is clean; tests cover synthetic mojibake patterns
- datetime:
+- datetime: [2026-05-07 Thu 17:00]
 * [X] t1.4: graph data prototype
 create ./viz/prototype_charts.py generating individual plotly charts for exploring graphs to embed into streamlit or dash later
 ** acceptance criteria
 2. create graph for Stance/Share
   - stacked h-bar with % support/oppose/neutral/unknown + raw totals, eg  63% (5720) / 37% (3320) / 0.09% (8) / 0.37% (34)
   - later, consider centered diverging h-bar: oppose ← | neutral/unknown | → support
 3. create graph for Stance/Time: 
   - cumulative support/oppose % over time
 4. create graph for Stance/Tone (heatmap count)
 5. create graph for Confidence/Stance (boxplot or histogram)
 ** notes
 - prototyped in plotly
 - initial streamlit  
 ** evidence
 - commit: 3fb424d
 - tests: see viz/proto and viz/chart_tests
 - datetime: [2026-05-08 Fri 08:38]
 * [X] t1.5: streamlit
 create organized webpage displaying useful information from completed job and analysis
 ** acceptance criteria
 1. display total stance breakdown
 2. display centered horiz-bar with absolute stances
 3. show daily comment stances and cumulative
 4. show comment table with filters for stance (filter tone?)
 5. clicking/selecting a comment shows full text and model rationale
 6. app runs locally with one command
 ** notes
 data pulls entirely from the job; goal is to point viz/streamlit.py at any job/ folder and have everything it needs
 ** evidence
 - commit: cc16acb
 - tests: from root dir, `streamlit run viz/streamlit.py <job-dir>`
 - datetime: [2026-05-08 Fri 23:44]
 * +[ ] t1.6 host streamlit via dockerfile+
 planning to deploy manually, get cert, etc etc. probably dont care about https?
 +using streamlit.app instead+
 ** acceptance criteria
 1. write dockerfile with slim image
 ** notes
 * === Backlog ===
 - add forum_url, forum_collected_date to scraper (to add to viz)
 * [ ] X: complete proposal information
 Ensure we capture as much useful information as possible about the actual proposal - contact information, etc. what the state actually says about what was posted. 
 ** acceptance criteria
 1. Item: `Forum` stores id, url, proposal title, description, open/close date, number of comments, agency, board, guidance document id
   - add details for guidanceDoc, publication date, comments, guidance docs - eg: https://www.townhall.virginia.gov/L/GDocForum.cfm?GDocForumID=452
 2. Item: `Comment` stores forum_id, comment_id, author, title, text, date, url
 * [ ] X: add helper data to create_csv
 1. in create_csv.py, create helper columns:
   - stance_signed = {"support":1, "oppose":-1, "neutral":0, "unknown":0}
   - stance_weighted = stance_signed * stance_confidence
   - is_support_oppose = stance in ["support", "oppose"]
   - date_day
   - date_hour
   - text_norm
   - text_hash
   - confidence_bucket = 'low' <.7 | 'med' .7-.89 | 'high' >=.9
--- a/docs/vatownhall.org
+++ b/docs/vatownhall.org
@@ -1,50 +1,111 @@
 #+title: VA Townhall
 #+date: [2026-05-05 Tue]
-#+version: 1
+#+version: 1.1
-* Project Goals
+** Project Goals
 1. Document and analyze sentiment of public comments on Virginia law, to determine:
   1. the utility of this forum as a mechanism for public comment, and
   2. the impact of this forum on Virginia regulation.
 2. Make data and insights broadly available.
 3. Generalize to other public comment tools.
-** Document and analyze sentiment
+*** Research questions   
- Scrape the data, parse, clean, and store. Clearly separate scraper from sentiment analyzer for maximum auditability.
+1. What is the quality of the comments on the forum?
- Build tests for identifying abuse, such as spam and account fraud
+   1. Are there duplicate entries?
- Identify any patterns connecting measured sentiment against VA decisions
+   2. Are there non-human-generated entries?
-  
+   3. Are there entries intended to abuse the forum or drown out comment?
-** Make data available
+2. How do commenters feel about the proposed change?
- Pick a good visualization tool
+   1. What is the total number and percent supporting vs opposing, and how does this change over time?
   2. What is the type of support, such as strong/weak, positive/negative?
 3. What impact do the comments have on the proposed change?
   (I anticipate this will not be measurable from currently available data)
-** Generalize
+** Architecture
- Identify scalable ways to apply this toolset to similar problems
+1. Scrape/Parse: Scrapy
 2. Sentiment analysis: gpt-5.4-mini
 3. Display: streamlit
 4. Storage: jsonl, csv, parquet
-* Architecture
+[[file:pipeline-v1.2.3.svg]]
-1. Scrape/Parse: **Scrapy** for downloading comments
+   
-2. Storage: json
+*** Scraper
-3. Sentiment analysis: Claude haiku
+Scrapy provides a simple mechanism for retrieving, parsing, and saving content form the forums.
 4. Display: TBD   
 ** Scraper
 Scrapy provides a simple mechanism for browsing and 
 1. Forums listing page: `Forums.cfm` - lists all open forums with agency, reg title, action type, brief description, closing date, comment count
 2. Comment listing page: `comments.cfm?GDocForumID=X` or `comments.cfm?stageid=X` or `comments.cfm?petitionid=X` - lists comments with title, author, date
 3. Individual comment page: `viewcomments.cfm?commentid=X` - shows regulation title + brief description at the top, plus the comment
-** Storage
+*** Analysis
-One JSONL file per forum/bill.
+Google and Amazon both return generic sentiment (tone of writing: positive/negative), not stance (for/against the regulation): "I strongly believe the government should NOT interfere" is negative tone but "against" the regulation.  We add the proposed change as context to the model.
-** Analysis
+Before sending the comments for sentiment analysis, `tokenizer.py` receives the forum to be processed and prompt as inputs, then generates a `report.json` estimating tokens (tiktoken), cost, and time to run for multiple models.
 Google and Amazon both return generic sentiment (tone of writing: positive/negative), not stance (for/against the regulation): "I strongly believe the government should NOT interfere" is negative tone but "against" the regulation.  We will run the forum/bill title and cache the entirety of the proposed change, perhaps as a fallback.
-| Tool              | Output                         | Context    | Sarcasm          | Context window | Cost/1k comments |
+Then, the batch processing scripts uses the `report.json` to create multiple jobs, with subcommands to download and check their status. 
 |-------------------+--------------------------------+------------+------------------+----------------+------------------|
 | Google NL API     | -1→+1, magnitude               | No/generic | Poorly           | No             | ~$1–2            |
 | Amazon Comprehend | Pos/Neg/Neutral/Mixed          | No/generic | Poorly           | No             | ~$0.10           |
 | Claude Haiku      | Prompted → for/against/neutral | Yes        | Yes, with prompt | Yes            | ~$0.10–0.30      |
 | GPT-4o-mini       | Prompted → same                | Yes        | Yes              | Yes            | ~$0.05–0.15      |
 We selected gpt-5.4-mini for a good balance of quality, cost, and time.
 **** Prompt
 ```
 You are an expert policy analyst classifying public comments submitted to the Virginia Town Hall
 regulatory comment system. You will be given the text of a proposed regulation and a single
 public comment. Return ONLY a JSON object — no other text.
 Definitions:
 - stance: the commenter's position on whether the regulation should be adopted.
  "support" = wants it approved (as-is or with changes);
  "oppose"  = wants it rejected or substantially weakened;
  "neutral" = takes no position, asks a question, or provides factual input only;
  "unknown" = too vague, off-topic, or uninterpretable to classify.
 - tone: the emotional register of the writing, independent of stance.
  "positive" = affirming, hopeful, appreciative;
  "negative" = angry, fearful, alarmed, or contemptuous;
  "neutral"  = matter-of-fact, procedural, or informational;
  "mixed"    = contains both positive and negative emotional content;
  "unclear"  = tone cannot be determined (e.g., a one-word comment).
 - stance_confidence: float 0.0-1.0, your confidence in the stance label.
 - stance_rationale: 1-3 sentences explaining the key evidence; quote specific phrases where possible.
 - tags: up to 5 short topic labels relevant to the comment's specific concerns (e.g.
  "parental rights", "student safety", "privacy", "religious freedom", "LGBTQ+ inclusion",
  "bullying prevention", "school sports", "bathroom access"). Empty array if none apply.
 Return exactly these keys: stance, stance_confidence, stance_rationale, tone, tags.
 ```
 *** Storage
 - Each scraped forum is saved to `output/<forum-id>.jsonl`
 - Each report (forum + prompt) is saves to `reports/<forum-id-N>.json`
 - Each job is saved to `analysis/jobs/<report-id>/:
   └─`forum.jsonl` is a copy of the scraped forum for convenience
   └─`prompt.txt` is a copy of the prompt used
   └─`report.json` is a copy of the report used
   └─`status.json` contains metadata about the job
  For each batch in the job, four files are created:
   └─`jobN-input.jsonl` contains the exact queries sent to the API, for troubleshooting
   └─`jobN-output-raw.jsonl` contains the exact response from the API
   └─`jobN-output.jsonl` contains the exact response from the API
   └─`jobN-output-errors.jsonl` when errors are returned (this file may not exist)
 - Once complete, the cleanup script saves `review.csv`, `review.pqt`, and `review.sqlite` in this folder.
 ** Instructions
 1. Scrape the forum.
   `python 
 2. Run model report.
   `python analysis/tokenizer.py <input> --prompt <prompt>`
 3. To run a realtime subset:
   `python analysis/openai_realtime.py <input> --prompt <prompt> --model <model> --limit <N comments>`
   `python analysis/openai_realtime.py output/f452.jsonl --prompt prompt-1.txt --model gpt-4o-mini --limit 10`
 4. To create and run the whole thing in batches, first create the batch jobs from the report:
   `python analysis/openai_batch.py create <report> --model <model>`
   `python analysis/openai_batch.py create ./reports/f452-1.json --model gpt-5.4-mini`
 5. Then, run the jobs sequentially. Don't submit more than one at a time, if the model fills up the batch will fail and resubmission is not implemented.
   `python analysis/openai_batch.py submit`
  # Check status
   `python analysis/openai_batch.py status`
  # When complete, download:
   `python analysis/openai_batch.py download`
  # Submit the next batch after the previous is complete:
   `python analysis/openai_batch.py submit`
 * Roadmap
 1. Scrape one forum
 2. Compare sentiment models
--- a/reports/f452-1.json
+++ b/reports/f452-1.json
@@ -0,0 +1,43 @@
 {
  "prompt": "analysis\\prompt-1.txt",
  "prompt_hash": "cb41250",
  "input_file": "output\\f452.jsonl",
  "input_sha256": "59dcc8b13cc2a386977a8b934c498c7e639b7e684a94ca1bfd10a14878670018",
  "total_comments": 9083,
  "input_tokens": 6397254,
  "gpt-5.5": {
    "jobs": 9,
    "cost_$": 15.9931,
    "est_queue_days": 7.11
  },
  "gpt-5.4": {
    "jobs": 9,
    "cost_$": 7.9966,
    "est_queue_days": 7.11
  },
  "gpt-5.4-mini": {
    "jobs": 4,
    "cost_$": 2.399,
    "est_queue_days": 3.2
  },
  "gpt-5.4-nano": {
    "jobs": 40,
    "cost_$": 0.6397,
    "est_queue_days": 31.99
  },
  "gpt-4o": {
    "jobs": 9,
    "cost_$": 7.9966,
    "est_queue_days": 7.11
  },
  "gpt-4o-mini": {
    "jobs": 4,
    "cost_$": 0.4798,
    "est_queue_days": 3.2
  },
  "gpt-o4-mini": {
    "jobs": 4,
    "cost_$": 3.5185,
    "est_queue_days": 3.2
  }
 }
--- a/requirements.txt
+++ b/requirements.txt
--- a/scraper/items.py
+++ b/scraper/items.py
@@ -5,6 +5,8 @@ class ForumItem(scrapy.Item):
    forum_id  = scrapy.Field()
    reg_title = scrapy.Field()
    reg_desc  = scrapy.Field()
    scraped_at = scrapy.Field()
    forum_url = scrapy.Field()
 class CommentItem(scrapy.Item):
--- a/scraper/spiders/forum.py
+++ b/scraper/spiders/forum.py
@@ -63,6 +63,8 @@ class ForumSpider(scrapy.Spider):
                forum_id=self.forum_id,
                reg_title=reg_title,
                reg_desc=reg_desc,
                scraped_at=datetime.utcnow().isoformat(),
                forum_url=_view_url(self.forum_id),
            )
            for page in range(2, last_page + 1):
                yield scrapy.FormRequest(
--- a/tests/create_csv.py
+++ b/tests/create_csv.py
@@ -0,0 +1,155 @@
 """Unit tests for analysis/create_csv.py — no external API calls."""
 import json
 import sys
 from pathlib import Path
 import pandas as pd
 import pytest
 sys.path.insert(0, str(Path(__file__).parent.parent / "analysis"))
 import create_csv as cc
 # ---------------------------------------------------------------------------
 # Helpers
 def _write_jsonl(path: Path, rows: list[dict]) -> None:
    with open(path, "w", encoding="utf-8") as f:
        for row in rows:
            f.write(json.dumps(row) + "\n")
 RAW_ROWS = [
    {"forum_id": "452", "comment_id": "1", "title": "Support", "text": "I support.", "date": "2021-01-01", "author": "Alice"},
    {"forum_id": "452", "comment_id": "2", "title": "Oppose",  "text": "I oppose.",  "date": "2021-01-02", "author": "Bob"},
    {"forum_id": "452", "comment_id": "3", "title": "Neutral", "text": "No opinion.","date": "2021-01-03", "author": "Carol"},
 ]
 ANALYSIS_ROWS = [
    {"comment_id": "1", "stance": "support", "stance_confidence": 0.9, "stance_rationale": "clear support",
     "tone": "neutral", "tags": '["policy"]', "error": None, "truncated": False,
     "analyzed_at": "2021-01-10", "prompt_version": "1", "model": "gpt-4o-mini"},
    {"comment_id": "2", "stance": "oppose",  "stance_confidence": 0.8, "stance_rationale": "clear oppose",
     "tone": "negative", "tags": '[]', "error": None, "truncated": False,
     "analyzed_at": "2021-01-10", "prompt_version": "1", "model": "gpt-4o-mini"},
 ]
 # ---------------------------------------------------------------------------
 # load_raw
 def test_load_raw_returns_raw_cols(tmp_path):
    p = tmp_path / "forum.jsonl"
    _write_jsonl(p, RAW_ROWS)
    df = cc.load_raw(p)
    assert list(df.columns) == cc.RAW_COLS
 def test_load_raw_row_count(tmp_path):
    p = tmp_path / "forum.jsonl"
    _write_jsonl(p, RAW_ROWS)
    df = cc.load_raw(p)
    assert len(df) == 3
 def test_load_raw_skips_non_comment_rows(tmp_path):
    """Rows without comment_id (e.g. forum metadata) are dropped."""
    rows = RAW_ROWS + [{"forum_id": "452", "reg_title": "Metadata row"}]
    p = tmp_path / "forum.jsonl"
    _write_jsonl(p, rows)
    df = cc.load_raw(p)
    assert len(df) == 3
 # ---------------------------------------------------------------------------
 # load_analysis
 def test_load_analysis_returns_analysis_cols(tmp_path):
    jobs = tmp_path / "jobs"
    jobs.mkdir()
    _write_jsonl(jobs / "job1-output.jsonl", ANALYSIS_ROWS)
    df = cc.load_analysis(jobs)
    expected = ["comment_id"] + cc.ANALYSIS_COLS
    assert list(df.columns) == expected
 def test_load_analysis_skips_raw_files(tmp_path):
    jobs = tmp_path / "jobs"
    jobs.mkdir()
    _write_jsonl(jobs / "job1-output.jsonl", ANALYSIS_ROWS)
    _write_jsonl(jobs / "job1-output-raw.jsonl", ANALYSIS_ROWS)  # should be ignored
    df = cc.load_analysis(jobs)
    assert len(df) == len(ANALYSIS_ROWS)
 def test_load_analysis_concatenates_multiple_files(tmp_path):
    jobs = tmp_path / "jobs"
    jobs.mkdir()
    _write_jsonl(jobs / "job1-output.jsonl", [ANALYSIS_ROWS[0]])
    _write_jsonl(jobs / "job2-output.jsonl", [ANALYSIS_ROWS[1]])
    df = cc.load_analysis(jobs)
    assert len(df) == 2
 # ---------------------------------------------------------------------------
 # join
 def test_join_all_raw_preserved(tmp_path):
    """Left join: all raw comments appear in output, even without analysis."""
    raw = pd.DataFrame(RAW_ROWS)[cc.RAW_COLS]
    analysis = pd.DataFrame(ANALYSIS_ROWS)
    for col in cc.ANALYSIS_COLS:
        if col not in analysis.columns:
            analysis[col] = None
    analysis = analysis[["comment_id"] + cc.ANALYSIS_COLS]
    merged = cc.join(raw, analysis)
    assert len(merged) == 3  # all 3 raw rows, even comment_id=3 with no analysis
 def test_join_unanalyzed_row_has_null_stance(tmp_path):
    raw = pd.DataFrame(RAW_ROWS)[cc.RAW_COLS]
    analysis = pd.DataFrame(ANALYSIS_ROWS)
    for col in cc.ANALYSIS_COLS:
        if col not in analysis.columns:
            analysis[col] = None
    analysis = analysis[["comment_id"] + cc.ANALYSIS_COLS]
    merged = cc.join(raw, analysis)
    unanalyzed = merged[merged["comment_id"] == "3"]
    assert pd.isna(unanalyzed.iloc[0]["stance"])
 def test_join_column_order(tmp_path):
    raw = pd.DataFrame(RAW_ROWS)[cc.RAW_COLS]
    analysis = pd.DataFrame(ANALYSIS_ROWS)
    for col in cc.ANALYSIS_COLS:
        if col not in analysis.columns:
            analysis[col] = None
    analysis = analysis[["comment_id"] + cc.ANALYSIS_COLS]
    merged = cc.join(raw, analysis)
    assert list(merged.columns) == cc.OUTPUT_COLS
 # ---------------------------------------------------------------------------
 # End-to-end: write + read CSV
 def test_csv_written_correctly(tmp_path):
    raw_path = tmp_path / "forum.jsonl"
    _write_jsonl(raw_path, RAW_ROWS)
    jobs = tmp_path / "jobs"
    jobs.mkdir()
    _write_jsonl(jobs / "job1-output.jsonl", ANALYSIS_ROWS)
    out = tmp_path / "review.csv"
    raw      = cc.load_raw(raw_path)
    analysis = cc.load_analysis(jobs)
    merged   = cc.join(raw, analysis)
    merged.to_csv(out, index=False, encoding="utf-8-sig")
    loaded = pd.read_csv(out)
    assert len(loaded) == 3
    assert list(loaded.columns) == cc.OUTPUT_COLS
--- a/tests/encoding.py
+++ b/tests/encoding.py
@@ -0,0 +1,119 @@
 """Unit tests for analysis/encoding.py — no external dependencies required."""
 import sys
 from pathlib import Path
 import pytest
 sys.path.insert(0, str(Path(__file__).parent.parent / "analysis"))
 from encoding import repair_text, _KNOWN_REPAIRS
 # ---------------------------------------------------------------------------
 # Core contract
 def test_empty_string_unchanged():
    assert repair_text("") == ""
 def test_none_like_empty_unchanged():
    assert repair_text("") == ""
 def test_clean_ascii_unchanged():
    text = "This is a normal sentence with no encoding issues."
    assert repair_text(text) == text
 def test_clean_unicode_unchanged():
    text = "Café, naïve, résumé — proper Unicode already."
    result = repair_text(text)
    # Should either be unchanged or equivalently correct
    assert "Caf" in result and "na" in result
 # ---------------------------------------------------------------------------
 # Known mojibake sequences (tasks.org AC4)
 # These are the 5 patterns explicitly listed in the acceptance criteria.
 def test_right_single_quote():
    """â€™ → ' (U+2019 right single quotation mark)"""
    assert repair_text("Virginiaâ€™s") == "Virginia’s"
 def test_left_double_quote():
    """â€œ → " (U+201C left double quotation mark)"""
    assert repair_text("â€œHello") == "“Hello"
 def test_en_dash():
    """â€" (where last char is U+201C) → – (U+2013 en dash)"""
    result = repair_text("pages 1â€“5")
    assert "–" in result or "—" in result or "-" in result
 def test_em_dash():
    """â€" (where last char is U+201D) → — (U+2014 em dash)"""
    result = repair_text("wordâ€”word")
    assert "—" in result or "–" in result or "-" in result
 def test_right_double_quote():
    """â€\x9d → " (U+201D right double quotation mark)"""
    result = repair_text("saidâ€ he")
    # Should not contain the raw artifact
    assert "â€" not in result
 # ---------------------------------------------------------------------------
 # Round-trip: garbled text produces sensible output
 def test_garbled_sentence_repaired():
    """A sentence with multiple mojibake chars is repaired to readable text."""
    # "Don't" with right single quote encoded as UTF-8, then decoded as cp1252
    # D o n ' t  →  D o n â€™ t
    garbled = "Donâ€™t worry"
    result = repair_text(garbled)
    assert "Don" in result and "t worry" in result
    assert "â€" not in result  # artifact gone
 def test_clean_string_after_repair_has_no_artifacts():
    garbled = "She said â€œHelloâ€ and left."
    result = repair_text(garbled)
    assert "â€" not in result
 # ---------------------------------------------------------------------------
 # FFFD replacement characters (from strict UTF-8 decode of cp1252 bytes)
 def test_fffd_preserved_not_crashed():
    """repair_text must not raise on U+FFFD; it may or may not repair it."""
    text = "Virginia<EFBFBD>s Public Schools"
    result = repair_text(text)
    assert isinstance(result, str)
    assert "Virginia" in result
 # ---------------------------------------------------------------------------
 # _KNOWN_REPAIRS table structure
 def test_known_repairs_non_empty():
    assert len(_KNOWN_REPAIRS) > 0
 def test_known_repairs_are_pairs():
    for item in _KNOWN_REPAIRS:
        assert len(item) == 2
        bad, good = item
        assert isinstance(bad, str) and isinstance(good, str)
 def test_known_repairs_bad_not_equal_good():
    for bad, good in _KNOWN_REPAIRS:
        assert bad != good
--- a/tests/analysis_gpt4o_batch.py
+++ b/tests/analysis_gpt4o_batch.py
@@ -1,4 +1,4 @@
-"""Unit tests for analysis/gpt4o/analysis_batch.py — no real API calls."""
+"""Unit tests for analysis/openai_batch.py — no real API calls."""
 import json
 import sys
@@ -7,8 +7,8 @@ from unittest.mock import MagicMock
 import pytest
-sys.path.insert(0, str(Path(__file__).parent.parent / "analysis" / "gpt4o"))
+sys.path.insert(0, str(Path(__file__).parent.parent / "analysis"))
-import analysis_batch as bt
+import openai_batch as bt
 # ---------------------------------------------------------------------------
@@ -75,9 +75,24 @@ ANALYZED_AT = "2026-05-05T18:00:00+00:00"
 RUN_ID = "test-run-id-123"
 MODEL = "gpt-4o"
 # Minimal status.json for testing job logic
 def _make_status(jobs_override=None):
    jobs = jobs_override or [
        {"job_num": 1, "run_id": "r1", "status": "pending", "batch_id": None,
         "records_submitted": 60, "records_completed": None, "records_failed": None,
         "submitted_at": None, "completed_at": None},
    ]
    return {
        "model": "gpt-4o-mini", "prompt_hash": "abc1234",
        "input_file": "output/f452.jsonl", "input_sha256": "sha",
        "total_comments": 100, "input_tokens": 50_000,
        "est_queue_days": 0.025, "cost_$": 0.01,
        "total_jobs": len(jobs), "jobs": jobs,
    }
 # ---------------------------------------------------------------------------
-# Prompt versioning (batch reads the same prompt file)
+# Prompt versioning
 def test_prompt_version_is_7_hex_chars():
    assert len(bt.PROMPT_VERSION) == 7
@@ -86,7 +101,7 @@ def test_prompt_version_is_7_hex_chars():
 def test_prompt_version_matches_realtime():
    """Both scripts must derive the same PROMPT_VERSION from the same file."""
-    import analysis_realtime as rt
+    import openai_realtime as rt
    assert bt.PROMPT_VERSION == rt.PROMPT_VERSION
@@ -206,52 +221,6 @@ def test_normalize_unknown_comment_id():
    assert record["input_title"] == ""
 # ---------------------------------------------------------------------------
 # Manifest
 def test_make_manifest_all_keys():
    m = bt.make_manifest(
        run_id=RUN_ID,
        input_filename="output/forum452.jsonl",
        input_sha256="abc123",
        model="gpt-4o",
        batch_id="batch_xyz",
        records_submitted=100,
        request_filename="analysis/gpt4o/requests/test-run-id-123.jsonl",
    )
    required = {
        "run_id", "input_filename", "input_sha256", "prompt_hash", "model",
        "batch_id", "records_submitted", "records_completed", "records_failed",
        "request_filename", "raw_output_filename", "normalized_output_filename",
        "created_at", "completed_at",
    }
    assert required == set(m.keys())
 def test_make_manifest_initial_nulls():
    m = bt.make_manifest(
        run_id=RUN_ID, input_filename="f", input_sha256="s",
        model="gpt-4o", batch_id="b", records_submitted=10, request_filename="r",
    )
    assert m["records_completed"] is None
    assert m["records_failed"] is None
    assert m["raw_output_filename"] is None
    assert m["normalized_output_filename"] is None
    assert m["completed_at"] is None
    assert m["prompt_hash"] == bt.PROMPT_VERSION
 def test_manifest_save_load_roundtrip(tmp_path, monkeypatch):
    monkeypatch.setattr(bt, "RUNS_DIR", tmp_path)
    m = bt.make_manifest(
        run_id=RUN_ID, input_filename="f", input_sha256="s",
        model="gpt-4o", batch_id="b", records_submitted=42, request_filename="r",
    )
    bt.save_manifest(m)
    loaded = bt.load_manifest(RUN_ID)
    assert loaded == m
 # ---------------------------------------------------------------------------
 # estimate_tokens
@@ -273,7 +242,8 @@ def test_estimate_tokens_fallback_without_tiktoken(monkeypatch):
    monkeypatch.setitem(_sys.modules, "tiktoken", None)
    messages = [{"role": "user", "content": "x" * 300}]
    result = bt.estimate_tokens(messages, "gpt-4o")
-    assert result == 4 + 300 // 3
+    # fallback: 3 primer + (3 + 300//3) per message
    assert result == 3 + (3 + 300 // 3)
 # ---------------------------------------------------------------------------
@@ -309,3 +279,112 @@ def test_chunk_preserves_all_comments(monkeypatch):
 def test_model_limits_has_required_models():
    for model in ("gpt-4o", "gpt-4o-mini", "gpt-5.4", "gpt-5.4-mini", "gpt-o4-mini"):
        assert model in bt.MODEL_LIMITS, f"{model} missing from MODEL_LIMITS"
 # ---------------------------------------------------------------------------
 # status.json helpers
 def test_status_save_load_roundtrip(tmp_path):
    status = _make_status()
    bt.save_status(status, tmp_path)
    loaded = bt.load_status(tmp_path)
    assert loaded == status
 # ---------------------------------------------------------------------------
 # _find_next_eligible_job
 def test_find_next_eligible_job_first_job_pending():
    jobs = _make_status()["jobs"]
    target, warning = bt._find_next_eligible_job(jobs)
    assert target["job_num"] == 1
    assert warning is None
 def test_find_next_eligible_job_after_completed():
    jobs = [
        {"job_num": 1, "status": "completed", "batch_id": "b1",
         "records_submitted": 60, "records_completed": 60, "records_failed": 0,
         "submitted_at": "t", "completed_at": "t", "run_id": "r1"},
        {"job_num": 2, "status": "pending", "batch_id": None,
         "records_submitted": 40, "records_completed": None, "records_failed": None,
         "submitted_at": None, "completed_at": None, "run_id": "r2"},
    ]
    target, warning = bt._find_next_eligible_job(jobs)
    assert target["job_num"] == 2
    assert warning is None
 def test_find_next_eligible_job_blocked_by_in_progress():
    jobs = [
        {"job_num": 1, "status": "in_progress", "batch_id": "b1",
         "records_submitted": 60, "records_completed": None, "records_failed": None,
         "submitted_at": "t", "completed_at": None, "run_id": "r1"},
        {"job_num": 2, "status": "pending", "batch_id": None,
         "records_submitted": 40, "records_completed": None, "records_failed": None,
         "submitted_at": None, "completed_at": None, "run_id": "r2"},
    ]
    target, warning = bt._find_next_eligible_job(jobs)
    assert target is None
    assert warning is not None
    assert "in_progress" in warning
 def test_find_next_eligible_job_all_completed():
    jobs = [
        {"job_num": 1, "status": "completed", "batch_id": "b1",
         "records_submitted": 60, "records_completed": 60, "records_failed": 0,
         "submitted_at": "t", "completed_at": "t", "run_id": "r1"},
    ]
    target, warning = bt._find_next_eligible_job(jobs)
    assert target is None
    assert warning is None
 def test_resume_from_status_json(tmp_path):
    """Reload a status.json with one completed job and find the next pending job."""
    jobs = [
        {"job_num": 1, "run_id": "r1", "status": "completed", "batch_id": "b1",
         "records_submitted": 60, "records_completed": 58, "records_failed": 2,
         "submitted_at": "2026-05-06T10:00:00+00:00", "completed_at": "2026-05-06T11:00:00+00:00"},
        {"job_num": 2, "run_id": "r2", "status": "pending", "batch_id": None,
         "records_submitted": 40, "records_completed": None, "records_failed": None,
         "submitted_at": None, "completed_at": None},
    ]
    bt.save_status(_make_status(jobs), tmp_path)
    loaded = bt.load_status(tmp_path)
    target, warning = bt._find_next_eligible_job(loaded["jobs"])
    assert target["job_num"] == 2
    assert warning is None
 # ---------------------------------------------------------------------------
 # normalize: out-of-order and duplicate custom_id
 def test_out_of_order_output_reconciled_by_custom_id():
    """Raw lines processed in any order are mapped to the correct comment."""
    c2 = {**COMMENT_ITEM, "comment_id": "99999", "title": "Second comment"}
    lookup = {COMMENT_ITEM["comment_id"]: COMMENT_ITEM, "99999": c2}
    line_for_99999 = {
        **RAW_SUCCESS_LINE,
        "custom_id": "comment_99999",
    }
    line_for_87914 = RAW_SUCCESS_LINE
    r1 = bt.normalize_output_line(line_for_99999, lookup, RUN_ID, ANALYZED_AT, MODEL, bt.PROMPT_VERSION)
    r2 = bt.normalize_output_line(line_for_87914, lookup, RUN_ID, ANALYZED_AT, MODEL, bt.PROMPT_VERSION)
    assert r1["comment_id"] == "99999"
    assert r1["input_title"] == "Second comment"
    assert r2["comment_id"] == "87914"
    assert r2["input_title"] == COMMENT_ITEM["title"]
 def test_duplicate_custom_id_both_produce_valid_records():
    """Two raw lines with the same custom_id each produce a valid record."""
    r1 = bt.normalize_output_line(RAW_SUCCESS_LINE, COMMENT_LOOKUP, RUN_ID, ANALYZED_AT, MODEL, bt.PROMPT_VERSION)
    r2 = bt.normalize_output_line(RAW_SUCCESS_LINE, COMMENT_LOOKUP, RUN_ID, ANALYZED_AT, MODEL, bt.PROMPT_VERSION)
    assert r1["comment_id"] == r2["comment_id"] == "87914"
    assert r1["error"] is None
    assert r2["error"] is None
--- a/tests/analysis_gpt4o_realtime.py
+++ b/tests/analysis_gpt4o_realtime.py
@@ -1,4 +1,4 @@
-"""Unit tests for analysis/gpt4o/analysis_realtime.py — no real API calls."""
+"""Unit tests for analysis/openai_realtime.py — no real API calls."""
 import json
 import sys
@@ -7,8 +7,8 @@ from unittest.mock import MagicMock
 import pytest
-sys.path.insert(0, str(Path(__file__).parent.parent / "analysis" / "gpt4o"))
+sys.path.insert(0, str(Path(__file__).parent.parent / "analysis"))
-import analysis_realtime as rt
+import openai_realtime as rt
 # ---------------------------------------------------------------------------
--- a/tests/tokenizer.py
+++ b/tests/tokenizer.py
@@ -0,0 +1,250 @@
 """Unit tests for analysis/tokenizer.py — no real API calls."""
 import io
 import json
 import math
 import sys
 from pathlib import Path
 from unittest.mock import patch
 import pytest
 sys.path.insert(0, str(Path(__file__).parent.parent / "analysis"))
 import tokenizer as tk
 import openai_batch as ab
 # ---------------------------------------------------------------------------
 # Fixtures
 FORUM_ITEM = {
    "forum_id": "452",
    "reg_title": "Model Policies for Transgender Students",
    "reg_desc": "Guidance developed in response to HB 145.",
 }
 COMMENT_A = {
    "forum_id": "452",
    "comment_id": "100",
    "author": "Alice",
    "date": "2021-01-04T09:15:00",
    "title": "Support",
    "text": "I support this policy.",
 }
 COMMENT_B = {
    "forum_id": "452",
    "comment_id": "101",
    "author": "Bob",
    "date": "2021-01-05T10:00:00",
    "title": "Oppose",
    "text": "I oppose this policy.",
 }
 COMMENTS = [COMMENT_A, COMMENT_B]
 PROMPT_HASH = "abc1234"
 INPUT_FILE = "output/f452.jsonl"
 INPUT_SHA256 = "deadbeef" * 8
 PROMPT_FILE = "analysis/prompt-1.txt"
 def _make_report(total_tokens=10_000):
    return tk.compute_report(
        COMMENTS, FORUM_ITEM, PROMPT_HASH, INPUT_FILE, INPUT_SHA256, PROMPT_FILE
    )
 # ---------------------------------------------------------------------------
 # compute_report: required top-level keys
 def test_report_has_top_level_keys():
    report = _make_report()
    required = {"prompt", "prompt_hash", "input_file", "input_sha256",
                "total_comments", "input_tokens"}
    assert required.issubset(set(report.keys()))
 def test_report_metadata_values():
    report = _make_report()
    assert report["prompt"] == PROMPT_FILE
    assert report["prompt_hash"] == PROMPT_HASH
    assert report["input_file"] == INPUT_FILE
    assert report["input_sha256"] == INPUT_SHA256
    assert report["total_comments"] == 2
 def test_report_input_tokens_positive():
    report = _make_report()
    assert isinstance(report["input_tokens"], int)
    assert report["input_tokens"] > 0
 # ---------------------------------------------------------------------------
 # compute_report: per-model entries
 def test_report_has_per_model_keys():
    report = _make_report()
    for model in ab.MODEL_LIMITS:
        assert model in report, f"Model {model} missing from report"
        assert isinstance(report[model], dict)
 def test_report_per_model_has_required_fields():
    report = _make_report()
    for model in ab.MODEL_LIMITS:
        m = report[model]
        assert "jobs" in m
        assert "cost_$" in m
        assert "est_queue_days" in m
 def test_report_jobs_at_least_one():
    report = _make_report()
    for model in ab.MODEL_LIMITS:
        assert report[model]["jobs"] >= 1
 # ---------------------------------------------------------------------------
 # compute_report: calculation accuracy
 def test_cost_calculation():
    """cost_$ = total_tokens / 1M * pricing_rate"""
    report = _make_report()
    total = report["input_tokens"]
    for model in ab.MODEL_LIMITS:
        expected_cost = round(total / 1_000_000 * tk.MODEL_PRICING.get(model, 0.0), 4)
        assert report[model]["cost_$"] == pytest.approx(expected_cost, abs=1e-6)
 def test_est_queue_days_calculation():
    """est_queue_days = total_tokens / tpd (rounded to 2 decimal places)"""
    report = _make_report()
    total = report["input_tokens"]
    for model, tpd in ab.MODEL_LIMITS.items():
        expected = round(total / tpd, 2)
        assert report[model]["est_queue_days"] == pytest.approx(expected, abs=1e-4)
 def test_jobs_ceiling_division():
    """jobs = ceil(total_tokens / (tpd * _LIMIT_BUFFER))"""
    report = _make_report()
    total = report["input_tokens"]
    for model, tpd in ab.MODEL_LIMITS.items():
        effective = int(tpd * ab._LIMIT_BUFFER)
        expected = math.ceil(total / effective)
        assert report[model]["jobs"] == expected
 def test_more_comments_increases_tokens():
    """More comments → more input_tokens."""
    few = tk.compute_report([COMMENT_A], FORUM_ITEM, PROMPT_HASH, INPUT_FILE, INPUT_SHA256, PROMPT_FILE)
    many = tk.compute_report(COMMENTS, FORUM_ITEM, PROMPT_HASH, INPUT_FILE, INPUT_SHA256, PROMPT_FILE)
    assert many["input_tokens"] > few["input_tokens"]
 # ---------------------------------------------------------------------------
 # MODEL_PRICING coverage
 def test_model_pricing_has_required_models():
    for model in ("gpt-4o", "gpt-4o-mini", "gpt-5.4", "gpt-5.4-mini", "gpt-o4-mini"):
        assert model in tk.MODEL_PRICING, f"{model} missing from MODEL_PRICING"
 def test_model_pricing_values_positive():
    for model, price in tk.MODEL_PRICING.items():
        assert price > 0, f"{model} has non-positive price"
 # ---------------------------------------------------------------------------
 # print_table: runs without error, produces output
 def test_print_table_runs():
    report = _make_report()
    buf = io.StringIO()
    with patch("sys.stdout", buf):
        tk.print_table(report)
    output = buf.getvalue()
    assert "gpt-4o" in output
    assert "gpt-4o-mini" in output
 def test_print_table_shows_all_models():
    report = _make_report()
    buf = io.StringIO()
    with patch("sys.stdout", buf):
        tk.print_table(report)
    output = buf.getvalue()
    for model in ab.MODEL_LIMITS:
        assert model in output, f"{model} not shown in print_table output"
 def test_print_table_highlights_recommended():
    """When a single-job cheapest model exists, table marks it as recommended."""
    report = _make_report()
    buf = io.StringIO()
    with patch("sys.stdout", buf):
        tk.print_table(report)
    output = buf.getvalue()
    assert "recommended" in output
 # ---------------------------------------------------------------------------
 # report.json round-trip (write → read)
 def test_report_json_roundtrip(tmp_path):
    report = _make_report()
    out = tmp_path / "report.json"
    out.write_text(json.dumps(report, indent=2, ensure_ascii=False), encoding="utf-8")
    loaded = json.loads(out.read_text(encoding="utf-8"))
    assert loaded["total_comments"] == report["total_comments"]
    assert loaded["input_tokens"] == report["input_tokens"]
    assert loaded["gpt-4o-mini"]["jobs"] == report["gpt-4o-mini"]["jobs"]
 # ---------------------------------------------------------------------------
 # count_input_tokens
 def _make_job_input(tmp_path, comments, forum=None) -> Path:
    """Write a batch request JSONL in the same format as job1-input.jsonl."""
    p = tmp_path / "job1-input.jsonl"
    with open(p, "w", encoding="utf-8") as f:
        for c in comments:
            f.write(json.dumps(ab.build_batch_request_line(c, forum, "gpt-4o-mini")) + "\n")
    return p
 def test_count_input_tokens_matches_estimate(tmp_path):
    """count_input_tokens on a freshly written job file equals the sum estimate_tokens produces."""
    p = _make_job_input(tmp_path, COMMENTS, FORUM_ITEM)
    result = tk.count_input_tokens(p, "gpt-4o-mini")
    expected = sum(
        ab.estimate_tokens(ab.build_messages(c, FORUM_ITEM)[0], "gpt-4o-mini")
        for c in COMMENTS
    )
    assert result["total_tokens"] == expected
    assert result["total_requests"] == len(COMMENTS)
 def test_count_input_tokens_fields(tmp_path):
    p = _make_job_input(tmp_path, COMMENTS, FORUM_ITEM)
    result = tk.count_input_tokens(p)
    assert set(result.keys()) == {"total_tokens", "total_requests", "min", "max", "mean"}
    assert result["min"] <= result["mean"] <= result["max"]
    assert result["min"] > 0
 def test_count_input_tokens_empty_file(tmp_path):
    p = tmp_path / "empty.jsonl"
    p.write_text("", encoding="utf-8")
    result = tk.count_input_tokens(p)
    assert result["total_tokens"] == 0
    assert result["total_requests"] == 0
 def test_count_input_tokens_includes_system_prompt(tmp_path):
    """Token count must be higher than user-message-only text length / 3 (prompt adds tokens)."""
    p = _make_job_input(tmp_path, [COMMENT_A], FORUM_ITEM)
    result = tk.count_input_tokens(p)
    user_chars = len(COMMENT_A.get("text", ""))
    # system prompt alone is hundreds of tokens; total must exceed naive user-text estimate
    assert result["total_tokens"] > user_chars // 3
--- a/tests/validate-sentiment.py
+++ b/tests/validate-sentiment.py
@@ -0,0 +1,217 @@
 """Unit tests for analysis/validate.py — no file I/O beyond tmp_path."""
 import json
 import sys
 from pathlib import Path
 import pytest
 sys.path.insert(0, str(Path(__file__).parent.parent / "analysis"))
 try:
    import pandas as pd
 except ImportError:
    pytest.skip("pandas not installed", allow_module_level=True)
 import validate as vl
 # ---------------------------------------------------------------------------
 # Fixtures
 def _write_jsonl(path: Path, rows: list[dict]) -> None:
    with open(path, "w", encoding="utf-8") as f:
        for row in rows:
            f.write(json.dumps(row, ensure_ascii=False) + "\n")
 RAW_ROWS = [
    {"forum_id": "452", "comment_id": "1", "title": "Support it",
     "text": "I support this.", "date": "2021-01-04T09:00:00", "author": "Alice"},
    {"forum_id": "452", "comment_id": "2", "title": "Oppose it",
     "text": "I oppose this.", "date": "2021-01-05T10:00:00", "author": "Bob"},
    {"forum_id": "452", "comment_id": "3", "title": "Neutral",
     "text": "No opinion.", "date": "2021-01-06T11:00:00", "author": "Carol"},
 ]
 ANALYSIS_ROWS = [
    {"run_id": "r1", "forum_id": "452", "comment_id": "1", "input_title": "Support it",
     "analyzed_at": "2026-05-06T12:00:00+00:00", "model": "gpt-5.4-mini",
     "prompt_version": "abc1234", "stance": "support", "stance_confidence": 0.95,
     "stance_rationale": "Commenter says 'I support'.", "tone": "positive",
     "tags": ["student safety"], "truncated": False, "error": None},
    {"run_id": "r1", "forum_id": "452", "comment_id": "2", "input_title": "Oppose it",
     "analyzed_at": "2026-05-06T12:00:00+00:00", "model": "gpt-5.4-mini",
     "prompt_version": "abc1234", "stance": "oppose", "stance_confidence": 0.90,
     "stance_rationale": "Commenter says 'I oppose'.", "tone": "negative",
     "tags": [], "truncated": False, "error": None},
 ]
 FORUM_ROW = {"forum_id": "452", "reg_title": "Policy X", "reg_desc": "Guidance on Y."}
@pytest.fixture()
 def raw_jsonl(tmp_path) -> Path:
    p = tmp_path / "f452.jsonl"
    _write_jsonl(p, [FORUM_ROW] + RAW_ROWS)
    return p
@pytest.fixture()
 def jobs_dir(tmp_path) -> Path:
    d = tmp_path / "jobs" / "f452-1"
    d.mkdir(parents=True)
    _write_jsonl(d / "job1-output.jsonl", ANALYSIS_ROWS)
    return d
 # ---------------------------------------------------------------------------
 # load_raw
 def test_load_raw_returns_only_comments(raw_jsonl):
    df = vl.load_raw(raw_jsonl)
    assert len(df) == 3
    assert set(df.columns) == set(vl.RAW_COLS)
 def test_load_raw_correct_columns(raw_jsonl):
    df = vl.load_raw(raw_jsonl)
    for col in vl.RAW_COLS:
        assert col in df.columns
 def test_load_raw_skips_forum_item(raw_jsonl):
    df = vl.load_raw(raw_jsonl)
    assert "reg_title" not in df.columns
 # ---------------------------------------------------------------------------
 # load_analysis
 def test_load_analysis_skips_raw_files(tmp_path):
    d = tmp_path / "jobs" / "f452-1"
    d.mkdir(parents=True)
    _write_jsonl(d / "job1-output-raw.jsonl", ANALYSIS_ROWS)   # should be ignored
    _write_jsonl(d / "job1-output.jsonl", ANALYSIS_ROWS)
    df = vl.load_analysis(d)
    assert len(df) == len(ANALYSIS_ROWS)
 def test_load_analysis_concatenates_multiple_files(tmp_path):
    d = tmp_path / "jobs" / "f452-1"
    d.mkdir(parents=True)
    _write_jsonl(d / "job1-output.jsonl", [ANALYSIS_ROWS[0]])
    _write_jsonl(d / "job2-output.jsonl", [ANALYSIS_ROWS[1]])
    df = vl.load_analysis(d)
    assert len(df) == 2
 def test_load_analysis_tags_serialized_as_json(jobs_dir):
    df = vl.load_analysis(jobs_dir)
    tags_val = df.loc[df["comment_id"] == "1", "tags"].iloc[0]
    assert isinstance(tags_val, str)
    assert json.loads(tags_val) == ["student safety"]
 def test_load_analysis_empty_tags_serialized(jobs_dir):
    df = vl.load_analysis(jobs_dir)
    tags_val = df.loc[df["comment_id"] == "2", "tags"].iloc[0]
    assert json.loads(tags_val) == []
 # ---------------------------------------------------------------------------
 # join — by comment_id, not index
 def test_join_by_comment_id_not_index(raw_jsonl, jobs_dir):
    raw      = vl.load_raw(raw_jsonl)
    analysis = vl.load_analysis(jobs_dir)
    # Shuffle raw order so comment_id ordering differs from index
    raw = raw.sample(frac=1, random_state=42).reset_index(drop=True)
    merged = vl.join(raw, analysis)
    row_1 = merged[merged["comment_id"] == "1"].iloc[0]
    assert row_1["stance"] == "support"
    assert row_1["author"] == "Alice"
 def test_join_unanalyzed_comment_has_null_stance(raw_jsonl, jobs_dir):
    """Comment 3 is in raw but not in analysis — stance should be NaN."""
    raw      = vl.load_raw(raw_jsonl)
    analysis = vl.load_analysis(jobs_dir)
    merged   = vl.join(raw, analysis)
    row_3 = merged[merged["comment_id"] == "3"].iloc[0]
    assert pd.isna(row_3["stance"])
 def test_join_preserves_all_raw_comments(raw_jsonl, jobs_dir):
    raw      = vl.load_raw(raw_jsonl)
    analysis = vl.load_analysis(jobs_dir)
    merged   = vl.join(raw, analysis)
    assert len(merged) == len(raw)
 def test_join_output_columns_in_order(raw_jsonl, jobs_dir):
    raw      = vl.load_raw(raw_jsonl)
    analysis = vl.load_analysis(jobs_dir)
    merged   = vl.join(raw, analysis)
    assert list(merged.columns) == vl.OUTPUT_COLS
 # ---------------------------------------------------------------------------
 # Duplicate comment_id handling
 def test_duplicate_raw_id_flagged(raw_jsonl, jobs_dir):
    raw      = vl.load_raw(raw_jsonl)
    # Manually duplicate a row
    raw = pd.concat([raw, raw.iloc[[0]]], ignore_index=True)
    analysis = vl.load_analysis(jobs_dir)
    merged   = vl.join(raw, analysis)
    # join still produces a row for each raw row (left join)
    assert len(merged) == len(raw)
    assert raw["comment_id"].duplicated().sum() == 1
 def test_duplicate_analysis_id_produces_extra_rows(raw_jsonl, tmp_path):
    """Two analysis records for the same comment_id create two joined rows."""
    d = tmp_path / "jobs" / "f452-dup"
    d.mkdir(parents=True)
    dup_rows = [ANALYSIS_ROWS[0], {**ANALYSIS_ROWS[0], "stance": "oppose"}]
    _write_jsonl(d / "job1-output.jsonl", dup_rows)
    raw      = vl.load_raw(raw_jsonl)
    analysis = vl.load_analysis(d)
    merged   = vl.join(raw, analysis)
    assert len(merged[merged["comment_id"] == "1"]) == 2
 # ---------------------------------------------------------------------------
 # Validation counts (smoke test — just confirm it runs without error)
 def test_print_validation_runs(raw_jsonl, jobs_dir, capsys):
    raw      = vl.load_raw(raw_jsonl)
    analysis = vl.load_analysis(jobs_dir)
    merged   = vl.join(raw, analysis)
    vl.print_validation(raw, analysis, merged)
    out = capsys.readouterr().out
    assert "Raw comments" in out
    assert "Stance counts" in out
    assert "Tone counts" in out
 # ---------------------------------------------------------------------------
 # CSV output
 def test_csv_written_to_jobs_dir(raw_jsonl, jobs_dir, tmp_path):
    raw      = vl.load_raw(raw_jsonl)
    analysis = vl.load_analysis(jobs_dir)
    merged   = vl.join(raw, analysis)
    out_path = jobs_dir / "review.csv"
    merged.to_csv(out_path, index=False, encoding="utf-8-sig")
    assert out_path.exists()
    loaded = pd.read_csv(out_path, encoding="utf-8-sig")
    assert list(loaded.columns) == vl.OUTPUT_COLS
    assert len(loaded) == len(raw)
--- a/viz/chart_tests/confidence_by_stance.html
+++ b/viz/chart_tests/confidence_by_stance.html
--- a/viz/chart_tests/cumulative_stance_area.html
+++ b/viz/chart_tests/cumulative_stance_area.html
--- a/viz/chart_tests/cumulative_stance_share.html
+++ b/viz/chart_tests/cumulative_stance_share.html
--- a/viz/chart_tests/stance_diverging_bar.html
+++ b/viz/chart_tests/stance_diverging_bar.html
--- a/viz/chart_tests/stance_over_time.html
+++ b/viz/chart_tests/stance_over_time.html
--- a/viz/chart_tests/stance_share.html
+++ b/viz/chart_tests/stance_share.html
--- a/viz/chart_tests/stance_tone_counts.html
+++ b/viz/chart_tests/stance_tone_counts.html
--- a/viz/chart_tests/stance_tone_heatmap.html
+++ b/viz/chart_tests/stance_tone_heatmap.html
--- a/viz/chart_tests/stance_tone_rowpct.html
+++ b/viz/chart_tests/stance_tone_rowpct.html
--- a/viz/proto/confidence_by_stance.html
+++ b/viz/proto/confidence_by_stance.html
--- a/viz/proto/stance_over_time.html
+++ b/viz/proto/stance_over_time.html
--- a/viz/proto/stance_share.html
+++ b/viz/proto/stance_share.html
--- a/viz/proto/stance_tone_heatmap.html
+++ b/viz/proto/stance_tone_heatmap.html
--- a/viz/prototype_charts.py
+++ b/viz/prototype_charts.py
@@ -0,0 +1,134 @@
 '''
    prototype_charts.py
    generate test charts for later addition to streamlit
 '''
 from pathlib import Path
 import pandas as pd
 import plotly.express as px
 import numpy as np
 inp = Path(r"c:/users/moses/projects/vath/analysis/jobs/f452-1/review.csv")
 out = Path("viz/")
 out.mkdir(parents=True, exist_ok=True)
 stance_order = ["support", "oppose", "neutral", "unknown"]
 # tone_order = ["positive", "negative", "neutral", "mixed", "unknown", "unclear"]
 # default order was actually better - unclear/negative/neutral/mixed/positive vs unknown/oppose/neutral/support
 # same for pct w/in stance
 df = pd.read_csv(inp)
 df["date"] = pd.to_datetime(df["date"], errors="coerce")
 df["date_day"] = df["date"].dt.date
 df["stance"] = df["stance"].fillna("unknown")
 df["tone"] = df["tone"].fillna("unknown")
 # 1. stance share
 counts = df["stance"].value_counts().reindex(stance_order, fill_value=0).reset_index()
 counts.columns = ["stance", "count"]
 fig = px.bar(counts, x="count", y="stance", orientation="h", text="count")
 fig.write_html(out / "stance_share.html")
 # 2. stance over time
 daily = df.groupby(["date_day", "stance"]).size().reset_index(name="count")
 fig = px.bar(daily, x="date_day", y="count", color="stance", category_orders={"stance": stance_order})
 fig.write_html(out / "stance_over_time.html")
 # 3. stance x tone
 heat = df.groupby(["stance", "tone"]).size().reset_index(name="count")
 fig = px.density_heatmap(heat, x="tone", y="stance", z="count", category_orders={"stance": stance_order})
 fig.write_html(out / "stance_tone_heatmap.html")
 # 4. confidence by stance
 fig = px.box(df, x="stance", y="stance_confidence", category_orders={"stance": stance_order}, points="outliers")
 fig.write_html(out / "confidence_by_stance.html")
 # 5. cumulative stance and share over time
 daily = (
    df.groupby(["date_day", "stance"])
      .size()
      .unstack(fill_value=0)
      .reindex(columns=stance_order, fill_value=0)
      .sort_index()
 )
 cum = daily.cumsum()
 cum_long = cum.reset_index().melt(id_vars="date_day", var_name="stance", value_name="cumulative_count")
 fig = px.area(
    cum_long,
    x="date_day",
    y="cumulative_count",
    color="stance",
    category_orders={"stance": stance_order},
    title="cumulative comments by stance over time",
 )
 fig.write_html(out / "cumulative_stance_area.html")
 cum_pct = cum.div(cum.sum(axis=1), axis=0).reset_index().melt(
    id_vars="date_day", var_name="stance", value_name="cumulative_share"
 )
 fig = px.line(
    cum_pct,
    x="date_day",
    y="cumulative_share",
    color="stance",
    category_orders={"stance": stance_order},
    title="cumulative stance share over time",
 )
 fig.update_yaxes(tickformat=".0%")
 fig.write_html(out / "cumulative_stance_share.html")
 # 7. diverging h-bar
 stance_counts = df["stance"].value_counts().reindex(stance_order, fill_value=0)
 div = pd.DataFrame({
    "stance": ["oppose", "support", "neutral", "unknown"],
    "count": [
        -stance_counts.get("oppose", 0),
         stance_counts.get("support", 0),
         stance_counts.get("neutral", 0),
         stance_counts.get("unknown", 0),
    ],
 })
 fig = px.bar(
    div,
    x="count",
    y="stance",
    orientation="h",
    text=div["count"].abs(),
    title="support vs oppose",
 )
 fig.update_xaxes(title="comments", zeroline=True)
 fig.update_traces(textposition="outside")
 fig.write_html(out / "stance_diverging_bar.html")
 # 8. Stance x Tone labels
 heat = pd.crosstab(df["stance"], df["tone"]).reindex(
    index=stance_order,
    columns=[c for c in tone_order if c in df["tone"].unique()],
    fill_value=0,
 )
 fig = px.imshow(
    heat,
    text_auto=True,
    aspect="auto",
    title="stance x tone, count",
 )
 fig.write_html(out / "stance_tone_counts.html")
 rowpct = heat.div(heat.sum(axis=1).replace(0, np.nan), axis=0)
 fig = px.imshow(
    rowpct,
    text_auto=".0%",
    aspect="auto",
    title="stance x tone, percent within stance",
 )
 fig.write_html(out / "stance_tone_rowpct.html")
--- a/viz/prototype_streamlit.py
+++ b/viz/prototype_streamlit.py
@@ -0,0 +1,28 @@
 # streamlit run analysis/viz/prototype_streamlit.py
 from datetime import datetime
 import pandas as pd
 import plotly.graph_objects as go
 import plotly.express as px
 import streamlit as st
 df = pd.read_csv(r"analysis/jobs/f452-1/review.csv")
 st.set_page_config(layout="wide")
 stance = st.multiselect("Filter stance", sorted(df["stance"].dropna().unique()), default=sorted(df["stance"].dropna().unique()))
 q = st.text_input("Search comment text")
 dff = df[df["stance"].isin(stance)]
 if q:
    dff = dff[dff["text"].fillna("").str.contains(q, case=False, regex=False)]
 st.dataframe(dff[["comment_id", "title", "stance", "stance_confidence", "tone"]], width="stretch")
 st.write("Showing " + str(len(dff))+ " comments")
 cid = st.selectbox("comment", dff["comment_id"].astype(str))
 row = dff[dff["comment_id"].astype(str) == cid].iloc[0]
 st.subheader(row["title"])
 st.write(row["text"])
 st.write(row["author"] + ", " + row["date"][:10])
 st.write("**model:** " + str(row["model"]))
 st.markdown("**stance:** " + str(row["stance"]) + "  \n**confidence:** " + str(row["stance_confidence"]) + "  \n**tone:** " + str(row["tone"]))
 st.write("**analysis:** "+ row["stance_rationale"])
--- a/viz/streamlit.py
+++ b/viz/streamlit.py
@@ -0,0 +1,189 @@
 # streamlit run viz/streamlit.py -- --jobs-dir analysis/jobs/f452-1
 import argparse
 from pathlib import Path
 from datetime import datetime as dt
 import pandas as pd
 import plotly.graph_objects as go
 import plotly.express as px
 import streamlit as st
 parser = argparse.ArgumentParser()
 parser.add_argument("--jobs-dir", default="analysis/jobs/f452-1", type=Path,
                    help="Job directory containing review.csv, forum.jsonl, and prompt.txt")
 args, _ = parser.parse_known_args()  # parse_known_args: ignore Streamlit's own argv entries
 workdir = args.jobs_dir
 df = pd.read_csv(workdir/"review.csv")
 df['date_dt'] = pd.to_datetime(df.date)
 df["date_day"] = df["date_dt"].dt.date
 forum = pd.read_json(workdir/"forum.jsonl", lines=True).iloc[0].to_dict()
 prompt = (workdir/"prompt.txt").read_text(encoding="utf-8")
 stance_colors = {'oppose':'#ffa15a', 'neutral':'#e377c2','support':'#19d3f3','unknown':'#000000'}
 stance_order = ["oppose", "mixed", "unknown", "neutral", "support"]
 st.set_page_config(layout="wide")
 st.title("Virginia Townhall Explorer",anchor=None)
 st.caption("Explore data collected from Virginia's public comment system. Source code at https://github.com/eulaly/vath")
 st.subheader("Proposal",anchor=None,divider="gray")
 st.markdown(f"**{forum.get('reg_title')}**")
 st.text(forum.get('reg_desc'))
 st.caption(f'Comments posted from {dt.strftime(min(df.date_dt),"%D")}—{dt.strftime(max(df.date_dt),"%D")} at https://www.townhall.virginia.gov/L/Comments.cfm?GDocForumID={forum.get("forum_id")}')
 st.subheader("Comment Summary",anchor=False,divider="gray")
 summary_left, summary_right = st.columns([1,2])
 with summary_left:
 # Summary Table
    summary_stats = (
    df.groupby("stance").size()
      .reindex(stance_order, fill_value=0)
      .reset_index(name="count")
      .assign(percent=lambda d: (d["count"] / d["count"].sum()).map("{:.1%}".format))
 )
    st.dataframe(summary_stats, hide_index=True, width="stretch")
 with summary_right:
 # Stance div-h
    counts = df["stance"].value_counts()
    stance_divh = go.Figure()
    stance_divh.add_bar(y=["stance"], x=[-counts.get("oppose",0)], name="oppose", orientation="h", marker_color=stance_colors.get('oppose'), text=[counts.get("oppose",0)], textposition="inside")
    stance_divh.add_bar(y=["stance"], x=[counts.get("neutral",0)], name="neutral", orientation="h", marker_color=stance_colors.get('neutral'), text=[counts.get("neutral",0)], textposition="inside")
    stance_divh.add_bar(y=["stance"], x=[counts.get("unknown",0)], name="unknown", orientation="h", marker_color=stance_colors.get('unknown'), text=[counts.get("unknown",0)], textposition="inside")
    stance_divh.add_bar(y=["stance"], x=[counts.get("support",0)], name="support", orientation="h", marker_color=stance_colors.get('support'), text=[counts.get("support",0)], textposition="inside")
    stance_divh.update_yaxes(title_text="",showticklabels=False)
    stance_divh.update_layout(barmode="relative", title="", height=180, margin=dict(l=0,r=0,t=0,b=0),xaxis_title="", yaxis_title="",legend=dict(orientation="v",y=0.12))
    st.plotly_chart(stance_divh,width='stretch')
 # Daily Comments Breakdown, 3 Tabs
 daily_wide = (
    df.groupby(["date_day", "stance"])
      .size()
      .unstack(fill_value=0)
      .reindex(columns=stance_order, fill_value=0)
      .sort_index()
 )
 daily_long = (
    daily_wide.reset_index()
      .melt(id_vars="date_day", var_name="stance", value_name="count")
 )
 cum_wide = daily_wide.cumsum()
 cum_long = (
    cum_wide.reset_index()
      .melt(id_vars="date_day", var_name="stance", value_name="cumulative_count")
 )
 cum_total = cum_wide.sum(axis=1)
 cum_share = cum_wide.div(cum_total.where(cum_total > 0), axis=0)
 cum_share_long = (
    cum_share.reset_index()
      .melt(id_vars="date_day", var_name="stance", value_name="cumulative_share")
 )
 tab_daily, tab_area, tab_share = st.tabs([
    "Daily",
    "Cumulative",
    "Cumulative Share",
 ])
 with tab_daily:
    fig = px.bar(
        daily_long,
        x="date_day",
        y="count",
        color="stance",
        category_orders={"stance": stance_order},
        color_discrete_map=stance_colors,
    )
    fig.update_layout(barmode="stack", height=420, legend_orientation="v")
    st.plotly_chart(fig, width="stretch")
 with tab_area:
    fig = px.area(
        cum_long,
        x="date_day",
        y="cumulative_count",
        color="stance",
        category_orders={"stance": stance_order},
        color_discrete_map=stance_colors,
    )
    fig.update_layout(height=420, legend_orientation="v")
    st.plotly_chart(fig, width="stretch")
 with tab_share:
    fig = px.line(
        cum_share_long,
        x="date_day",
        y="cumulative_share",
        color="stance",
        category_orders={"stance": stance_order},
        color_discrete_map=stance_colors,
    )
    fig.update_yaxes(tickformat=".0%", range=[0, 1])
    fig.update_layout(height=420, legend_orientation="v")
    st.plotly_chart(fig, width="stretch")
 st.subheader("Comment Explorer",anchor=False,divider="gray") 
 # comment explorer
 cex_left, cex_right = st.columns([1,1])
 with cex_left:
    filter_stance = st.multiselect("Filter stance", sorted(df["stance"].dropna().unique()), default=sorted(df["stance"].dropna().unique()))
    filter_tone = st.multiselect("Filter tone", sorted(df["tone"].dropna().unique()), default=sorted(df["tone"].dropna().unique()))
    dff = df[df["stance"].isin(filter_stance) & df["tone"].isin(filter_tone)]
 with cex_right:
    q = st.text_input("Search comment title and text")
    if q:
        dff = dff[dff["text"].fillna("").str.contains(q, case=False, regex=False)]
    st.text(""); st.text("")
    st.text("Showing " + str(len(dff))+ " comments",text_alignment="right", width="stretch")
 st.dataframe(dff[["comment_id", "title", "text", "stance", "stance_confidence", "tone"]], width="stretch")
 cid = st.selectbox("Select comment to view:", dff["comment_id"].astype(str))
 row = dff[dff["comment_id"].astype(str) == cid].iloc[0]
 st.markdown(f'**{row["title"]}**')
 st.text(row["text"])
 st.write(row["author"] + ", " + row["date_dt"].strftime("%D"))
 st.divider()
 st.subheader('Analysis')
 cexs_left, cexs_right = st.columns([1,1])
 with cexs_left:
    st.write(f"**stance:** {row['stance']}")
    st.write(f"**stance_confidence:** {row['stance_confidence']:.2f}")
    st.write(f"**tone:** {row['tone']}")
    st.write("**analysis:** "+ row["stance_rationale"])
 with cexs_right:
    x_order = ["unknown","oppose","mixed","neutral","support"]  # includes mixed even if absent; harmless zero column
    y_order = ["positive","neutral","mixed","negative","unclear"]
    tab = pd.crosstab(df["tone"], df["stance"]).reindex(index=y_order, columns=x_order, fill_value=0)
    pct = tab.div(tab.sum(axis=1).replace(0, pd.NA), axis=0).fillna(0)
    tone_stance = px.imshow(
        pct,
        x=x_order, y=y_order,
        text_auto=".0%",
        aspect="auto",
        color_continuous_scale="Greens",
    )
    tone_stance.update_traces(text=tab.astype(str) + " / " + (pct*100).round(0).astype(int).astype(str) + "%")
    tone_stance.add_scatter(x=[row["stance"]],y=[row["tone"]],mode="markers",marker=dict(size=15,color="yellow",symbol="cross",line=dict(width=1, color="red")),showlegend=False)
    tone_stance.update_layout(height=420, xaxis_title="stance", yaxis_title="tone")
    st.plotly_chart(tone_stance, width='stretch')
    st.caption("Tone by stance, % within tone", text_alignment="right",width="stretch")
 st.divider()
 st.write("**model:** " + str(row["model"]))
 with st.expander("Prompt", expanded=False):
    st.code(prompt, language="text")
 tone_conf = px.box(df,x="stance",y="stance_confidence",color="stance",category_orders={"stance":stance_order},color_discrete_map=stance_colors,points="outliers",title="Comment Stance Classification Confidence")
 tone_conf.update_yaxes(range=[0,1.02])
 tone_conf.update_layout(height=430, legend_orientation="v")
 st.plotly_chart(tone_conf,width="stretch")
Author	SHA1	Message	Date
eulaly	8f1d9e7723	added forum metadata for later use	2026-05-09 00:36:30 -04:00
eulaly	181477bce7	streamlit > local docker	2026-05-09 00:25:27 -04:00
eulaly	771f11fd3c	updated readme	2026-05-09 00:02:24 -04:00
eulaly	f42183eeda	added streamlit link	2026-05-09 00:00:59 -04:00
eulaly	92706bafb5	updated tasks and deps	2026-05-08 23:57:46 -04:00
eulaly	723b353db8	lol	2026-05-08 23:33:55 -04:00
eulaly	67cd96a523	updated readme.md	2026-05-08 23:32:44 -04:00
eulaly	cc16acbb12	added argparse for job dir, added tone filter	2026-05-08 23:28:13 -04:00
eulaly	afd5b8c60e	full local streamlit support	2026-05-08 21:57:04 -04:00
eulaly	3fb424da3c	added streamlit v1	2026-05-08 17:22:33 -04:00
eulaly	c3f2911563	updated reqts	2026-05-07 21:55:00 -04:00
eulaly	05515745fd	Merge branch 'master' of https://git.hgsky.me/ben/vath	2026-05-07 21:54:27 -04:00
eulaly	3d3372bbb3	Merge branch 'master' of https://git.hgsky.me/ben/vath	2026-05-07 21:53:40 -04:00
ben	3a139da440	Delete docs/vatownhall.md ye	2026-05-07 21:48:08 -04:00
eulaly	976db1b0fe	finally got images working	2026-05-07 21:46:27 -04:00
ben	7593754866	Update README.md fixed display	2026-05-07 21:42:08 -04:00
ben	016882d527	Update docs/vatownhall.md	2026-05-07 21:35:49 -04:00
ben	58feb9820d	Update docs/vatownhall.md fixing inline img	2026-05-07 21:34:57 -04:00
ben	35f30e9514	Update docs/vatownhall.md fixing inline img	2026-05-07 21:34:33 -04:00
eulaly	985760be7c	tesging images	2026-05-07 18:07:45 -04:00
eulaly	983650a64f	testing images	2026-05-07 18:06:02 -04:00
eulaly	eaaefb66f2	adding image	2026-05-07 18:00:51 -04:00
eulaly	bdab3c5e21	added excel detritus	2026-05-07 17:56:05 -04:00
eulaly	b4a9651e11	added graph snapshot	2026-05-07 17:22:34 -04:00
eulaly	1ea696d818	added texts and fixes for mojibake	2026-05-07 17:22:16 -04:00
eulaly	28d6d222bd	added create_csv.py	2026-05-07 17:22:00 -04:00
eulaly	72c2ae0ca0	updated readme	2026-05-07 17:01:08 -04:00
eulaly	f5d679808e	completed openai batch work	2026-05-07 07:24:11 -04:00
eulaly	64a7a18721	openai batch refactor	2026-05-06 13:53:50 -04:00