Compare commits
27 Commits
f5d679808e
...
master
| Author | SHA1 | Date | |
|---|---|---|---|
| 8f1d9e7723 | |||
| 181477bce7 | |||
| 771f11fd3c | |||
| f42183eeda | |||
| 92706bafb5 | |||
| 723b353db8 | |||
| 67cd96a523 | |||
| cc16acbb12 | |||
| afd5b8c60e | |||
| 3fb424da3c | |||
| c3f2911563 | |||
| 05515745fd | |||
| 3d3372bbb3 | |||
| 3a139da440 | |||
| 976db1b0fe | |||
| 7593754866 | |||
| 016882d527 | |||
| 58feb9820d | |||
| 35f30e9514 | |||
| 985760be7c | |||
| 983650a64f | |||
| eaaefb66f2 | |||
| bdab3c5e21 | |||
| b4a9651e11 | |||
| 1ea696d818 | |||
| 28d6d222bd | |||
| 72c2ae0ca0 |
1
.gitignore
vendored
1
.gitignore
vendored
@@ -29,3 +29,4 @@ output/
|
||||
|
||||
# --- misc ---
|
||||
.DS_Store
|
||||
*~$*
|
||||
212
README.md
212
README.md
@@ -1,21 +1,5 @@
|
||||
|
||||
# Table of Contents
|
||||
|
||||
1. [Project Goals](#org5acb669)
|
||||
1. [Document and analyze sentiment](#org9291576)
|
||||
2. [Make data available](#org8054421)
|
||||
3. [Generalize](#orgdda4b6f)
|
||||
2. [Architecture](#org1d6bc40)
|
||||
1. [Scraper](#org4298028)
|
||||
2. [Storage](#org1cd413c)
|
||||
3. [Analysis](#orgaea450e)
|
||||
3. [Roadmap](#org6b7660d)
|
||||
|
||||
|
||||
|
||||
<a id="org5acb669"></a>
|
||||
|
||||
# Project Goals
|
||||
## Project Goals
|
||||
|
||||
1. Document and analyze sentiment of public comments on Virginia law, to determine:
|
||||
1. the utility of this forum as a mechanism for public comment, and
|
||||
@@ -23,131 +7,127 @@
|
||||
2. Make data and insights broadly available.
|
||||
3. Generalize to other public comment tools.
|
||||
|
||||
|
||||
<a id="org9291576"></a>
|
||||
|
||||
## Document and analyze sentiment
|
||||
|
||||
- Scrape the data, parse, clean, and store. Clearly separate scraper from sentiment analyzer for maximum auditability.
|
||||
- Build tests for identifying abuse, such as spam and account fraud
|
||||
- Identify any patterns connecting measured sentiment against VA decisions
|
||||
Take a look at https://vatownhall.streamlit.app
|
||||

|
||||
|
||||
|
||||
<a id="org8054421"></a>
|
||||
### Research questions
|
||||
|
||||
## Make data available
|
||||
|
||||
- Pick a good visualization tool
|
||||
1. What is the quality of the comments on the forum?
|
||||
1. Are there duplicate entries?
|
||||
2. Are there non-human-generated entries?
|
||||
3. Are there entries intended to abuse the forum or drown out comment?
|
||||
2. How do commenters feel about the proposed change?
|
||||
1. What is the total number and percent supporting vs opposing, and how does this change over time?
|
||||
2. What is the type of support, such as strong/weak, positive/negative?
|
||||
3. What impact do the comments have on the proposed change?
|
||||
(I anticipate this will not be measurable from currently available data)
|
||||
|
||||
|
||||
<a id="orgdda4b6f"></a>
|
||||
<a id="orgfabfcd9"></a>
|
||||
|
||||
## Generalize
|
||||
## Architecture
|
||||
|
||||
- Identify scalable ways to apply this toolset to similar problems
|
||||
1. Scrape/Parse: Scrapy
|
||||
2. Sentiment analysis: gpt-5.4-mini
|
||||
3. Display: streamlit
|
||||
4. Storage: jsonl, csv, parquet
|
||||
|
||||

|
||||
|
||||
|
||||
<a id="org1d6bc40"></a>
|
||||
<a id="org2c5c7a2"></a>
|
||||
|
||||
# Architecture
|
||||
### Scraper
|
||||
|
||||
1. Scrape/Parse: ****Scrapy**** for downloading comments
|
||||
2. Storage: json
|
||||
3. Sentiment analysis: Claude haiku
|
||||
4. Display: TBD
|
||||
Scrapy provides a simple mechanism for retrieving, parsing, and saving content form the forums.
|
||||
|
||||
1. Forums listing page: `Forums.cfm` lists all open forums with agency, reg title, action type, brief description, closing date, comment count
|
||||
2. Comment listing page: `comments.cfm?GDocForumID=X` or `comments.cfm?stageid=X` or `comments.cfm?petitionid=X` lists comments with title, author, date
|
||||
3. Individual comment page: `viewcomments.cfm?commentid=X` shows regulation title + brief description at the top, plus the comment
|
||||
|
||||
|
||||
<a id="org4298028"></a>
|
||||
<a id="org72990f4"></a>
|
||||
|
||||
## Scraper
|
||||
### Analysis
|
||||
|
||||
Scrapy provides a simple mechanism for browsing and
|
||||
Google and Amazon both return generic sentiment (tone of writing: positive/negative), not stance (for/against the regulation): "I strongly believe the government should NOT interfere" is negative tone but "against" the regulation. We add the proposed change as context to the model.
|
||||
|
||||
1. Forums listing page: \`Forums.cfm\` - lists all open forums with agency, reg title, action type, brief description, closing date, comment count
|
||||
2. Comment listing page: \`comments.cfm?GDocForumID=X\` or \`comments.cfm?stageid=X\` or \`comments.cfm?petitionid=X\` - lists comments with title, author, date
|
||||
3. Individual comment page: \`viewcomments.cfm?commentid=X\` - shows regulation title + brief description at the top, plus the comment
|
||||
Before sending the comments for sentiment analysis, `tokenizer.py` receives the forum to be processed and prompt as inputs, then generates a `report.json` estimating tokens (tiktoken), cost, and time to run for multiple models.
|
||||
|
||||
Then, the batch processing scripts uses the `report.json` to create multiple jobs, with subcommands to download and check their status.
|
||||
|
||||
We selected gpt-5.4-mini for a good balance of quality, cost, and time.
|
||||
|
||||
1. Prompt
|
||||
```
|
||||
You are an expert policy analyst classifying public comments submitted to the Virginia Town Hall
|
||||
regulatory comment system. You will be given the text of a proposed regulation and a single
|
||||
public comment. Return ONLY a JSON object — no other text.
|
||||
|
||||
Definitions:
|
||||
- stance: the commenter's position on whether the regulation should be adopted.
|
||||
"support" = wants it approved (as-is or with changes);
|
||||
"oppose" = wants it rejected or substantially weakened;
|
||||
"neutral" = takes no position, asks a question, or provides factual input only;
|
||||
"unknown" = too vague, off-topic, or uninterpretable to classify.
|
||||
- tone: the emotional register of the writing, independent of stance.
|
||||
"positive" = affirming, hopeful, appreciative;
|
||||
"negative" = angry, fearful, alarmed, or contemptuous;
|
||||
"neutral" = matter-of-fact, procedural, or informational;
|
||||
"mixed" = contains both positive and negative emotional content;
|
||||
"unclear" = tone cannot be determined (e.g., a one-word comment).
|
||||
- stance_confidence: float 0.0-1.0, your confidence in the stance label.
|
||||
- stance_rationale: 1-3 sentences explaining the key evidence; quote specific phrases where possible.
|
||||
- tags: up to 5 short topic labels relevant to the comment's specific concerns (e.g.
|
||||
"parental rights", "student safety", "privacy", "religious freedom", "LGBTQ+ inclusion",
|
||||
"bullying prevention", "school sports", "bathroom access"). Empty array if none apply.
|
||||
|
||||
Return exactly these keys: stance, stance_confidence, stance_rationale, tone, tags.
|
||||
```
|
||||
|
||||
|
||||
<a id="org1cd413c"></a>
|
||||
<a id="org58a5b72"></a>
|
||||
|
||||
## Storage
|
||||
### Storage
|
||||
|
||||
One JSONL file per forum/bill.
|
||||
- Each scraped forum is saved to `output/<forum-id>.jsonl`
|
||||
- Each report (forum + prompt) is saves to `reports/<forum-id-N>.json`
|
||||
- Each job is saved to `analysis/jobs/<report-id>`:
|
||||
└─`forum.jsonl` is a copy of the scraped forum for convenience
|
||||
└─`prompt.txt` is a copy of the prompt used
|
||||
└─`report.json` is a copy of the report used
|
||||
└─`status.json` contains metadata about the job
|
||||
For each batch in the job, four files are created:
|
||||
└─`jobN-input.jsonl` contains the exact queries sent to the API, for troubleshooting
|
||||
└─`jobN-output-raw.jsonl` contains the exact response from the API
|
||||
└─`jobN-output.jsonl` contains the exact response from the API
|
||||
└─`jobN-output-errors.jsonl` when errors are returned (this file may not exist)
|
||||
- Once complete, the cleanup script saves `review.csv`, `review.pqt`, and `review.sqlite` in this folder.
|
||||
|
||||
|
||||
<a id="orgaea450e"></a>
|
||||
<a id="org24fe465"></a>
|
||||
|
||||
## Analysis
|
||||
## Instructions
|
||||
|
||||
Google and Amazon both return generic sentiment (tone of writing: positive/negative), not stance (for/against the regulation): "I strongly believe the government should NOT interfere" is negative tone but "against" the regulation. We will run the forum/bill title and cache the entirety of the proposed change, perhaps as a fallback.
|
||||
|
||||
<table border="2" cellspacing="0" cellpadding="6" rules="groups" frame="hsides">
|
||||
1. Scrape the forum.
|
||||
`python`
|
||||
2. Run model report.
|
||||
`python analysis/tokenizer.py <input> --prompt <prompt>`
|
||||
3. To run a realtime subset:
|
||||
`python analysis/openai_realtime.py <input> --prompt <prompt> --model <model> --limit <N comments>`
|
||||
`python analysis/openai_realtime.py output/f452.jsonl --prompt prompt-1.txt --model gpt-4o-mini --limit 10`
|
||||
4. To create and run the whole thing in batches, first create the batch jobs from the report:
|
||||
`python analysis/openai_batch.py create <report> --model <model>`
|
||||
`python analysis/openai_batch.py create ./reports/f452-1.json --model gpt-5.4-mini`
|
||||
5. Then, run the jobs sequentially. Don't submit more than one at a time, if the model fills up the batch will fail and resubmission is not implemented.
|
||||
`python analysis/openai<sub>batch.py</sub> submit`
|
||||
`python analysis/openai<sub>batch.py</sub> status`
|
||||
`python analysis/openai<sub>batch.py</sub> download`
|
||||
`python analysis/openai<sub>batch.py</sub> submit`
|
||||
|
||||
|
||||
<colgroup>
|
||||
<col class="org-left" />
|
||||
|
||||
<col class="org-left" />
|
||||
|
||||
<col class="org-left" />
|
||||
|
||||
<col class="org-left" />
|
||||
|
||||
<col class="org-left" />
|
||||
|
||||
<col class="org-left" />
|
||||
</colgroup>
|
||||
<thead>
|
||||
<tr>
|
||||
<th scope="col" class="org-left">Tool</th>
|
||||
<th scope="col" class="org-left">Output</th>
|
||||
<th scope="col" class="org-left">Context</th>
|
||||
<th scope="col" class="org-left">Sarcasm</th>
|
||||
<th scope="col" class="org-left">Context window</th>
|
||||
<th scope="col" class="org-left">Cost/1k comments</th>
|
||||
</tr>
|
||||
</thead>
|
||||
<tbody>
|
||||
<tr>
|
||||
<td class="org-left">Google NL API</td>
|
||||
<td class="org-left">-1→+1, magnitude</td>
|
||||
<td class="org-left">No/generic</td>
|
||||
<td class="org-left">Poorly</td>
|
||||
<td class="org-left">No</td>
|
||||
<td class="org-left">~$1–2</td>
|
||||
</tr>
|
||||
|
||||
<tr>
|
||||
<td class="org-left">Amazon Comprehend</td>
|
||||
<td class="org-left">Pos/Neg/Neutral/Mixed</td>
|
||||
<td class="org-left">No/generic</td>
|
||||
<td class="org-left">Poorly</td>
|
||||
<td class="org-left">No</td>
|
||||
<td class="org-left">~$0.10</td>
|
||||
</tr>
|
||||
|
||||
<tr>
|
||||
<td class="org-left">Claude Haiku</td>
|
||||
<td class="org-left">Prompted → for/against/neutral</td>
|
||||
<td class="org-left">Yes</td>
|
||||
<td class="org-left">Yes, with prompt</td>
|
||||
<td class="org-left">Yes</td>
|
||||
<td class="org-left">~$0.10–0.30</td>
|
||||
</tr>
|
||||
|
||||
<tr>
|
||||
<td class="org-left">GPT-4o-mini</td>
|
||||
<td class="org-left">Prompted → same</td>
|
||||
<td class="org-left">Yes</td>
|
||||
<td class="org-left">Yes</td>
|
||||
<td class="org-left">Yes</td>
|
||||
<td class="org-left">~$0.05–0.15</td>
|
||||
</tr>
|
||||
</tbody>
|
||||
</table>
|
||||
|
||||
|
||||
<a id="org6b7660d"></a>
|
||||
<a id="org5739d49"></a>
|
||||
|
||||
# Roadmap
|
||||
|
||||
|
||||
76
analysis/create_csv.py
Normal file
76
analysis/create_csv.py
Normal file
@@ -0,0 +1,76 @@
|
||||
#!/usr/bin/env python3
|
||||
"""analysis/create_csv.py — join raw scrape with analysis output for review."""
|
||||
|
||||
import argparse
|
||||
from pathlib import Path
|
||||
|
||||
import pandas as pd
|
||||
|
||||
RAW_COLS = ["forum_id", "comment_id", "title", "text", "date", "author"]
|
||||
ANALYSIS_COLS = [
|
||||
"stance", "stance_confidence", "stance_rationale", "tone", "tags",
|
||||
"error", "truncated", "analyzed_at", "prompt_version", "model",
|
||||
]
|
||||
OUTPUT_COLS = RAW_COLS + ANALYSIS_COLS
|
||||
|
||||
|
||||
def load_raw(path: Path) -> pd.DataFrame:
|
||||
df = pd.read_json(path, lines=True)
|
||||
df = df[df["comment_id"].notna()] # rm first item (forum, not comment)
|
||||
for col in RAW_COLS:
|
||||
if col not in df.columns:
|
||||
df[col] = None
|
||||
return df[RAW_COLS].copy()
|
||||
|
||||
|
||||
def load_analysis(jobs_dir: Path) -> pd.DataFrame:
|
||||
files = sorted(p for p in jobs_dir.glob("job*-output.jsonl") if "-raw" not in p.name)
|
||||
df = pd.concat([pd.read_json(p, lines=True) for p in files], ignore_index=True)
|
||||
for col in ANALYSIS_COLS:
|
||||
if col not in df.columns:
|
||||
df[col] = None
|
||||
return df[["comment_id"] + ANALYSIS_COLS].copy()
|
||||
|
||||
|
||||
def join(raw: pd.DataFrame, analysis: pd.DataFrame) -> pd.DataFrame:
|
||||
return raw.merge(analysis, on="comment_id", how="left")[OUTPUT_COLS]
|
||||
|
||||
|
||||
def print_counts(raw: pd.DataFrame, analysis: pd.DataFrame, merged: pd.DataFrame) -> None:
|
||||
print(f"\nRaw comments : {len(raw):,}")
|
||||
print(f"Analyzed : {len(analysis):,}")
|
||||
print(f"Joined : {merged['stance'].notna().sum():,}")
|
||||
print(f"Unanalyzed : {merged['stance'].isna().sum():,}")
|
||||
print(f"Errors : {analysis['error'].notna().sum():,}")
|
||||
print(f"Dup IDs (raw) : {raw['comment_id'].duplicated().sum():,}")
|
||||
print(f"\nStance:\n{analysis['stance'].value_counts(dropna=False).to_string()}")
|
||||
print(f"\nTone:\n{analysis['tone'].value_counts(dropna=False).to_string()}\n")
|
||||
|
||||
|
||||
def main() -> None:
|
||||
p = argparse.ArgumentParser(
|
||||
description="Join raw scrape JSONL with analysis output; write review CSV."
|
||||
)
|
||||
p.add_argument("input", help="Raw scrape JSONL (e.g. output/f452.jsonl)")
|
||||
p.add_argument("jobs_dir", help="Job directory containing job*-output.jsonl files")
|
||||
p.add_argument("--parquet", action="store_true", help="Also write review.parquet")
|
||||
p.add_argument("--out", default=None, help="Output CSV path (default: <jobs_dir>/review.csv)")
|
||||
args = p.parse_args()
|
||||
|
||||
raw = load_raw(Path(args.input))
|
||||
analysis = load_analysis(Path(args.jobs_dir))
|
||||
merged = join(raw, analysis)
|
||||
print_counts(raw, analysis, merged)
|
||||
|
||||
out = Path(args.out) if args.out else Path(args.jobs_dir) / "review.csv"
|
||||
merged.to_csv(out, index=False, encoding="utf-8-sig")
|
||||
print(f"CSV → {out}")
|
||||
|
||||
if args.parquet:
|
||||
pq = out.with_suffix(".parquet")
|
||||
merged.to_parquet(pq, index=False)
|
||||
print(f"Parquet → {pq}")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
74
analysis/encoding.py
Normal file
74
analysis/encoding.py
Normal file
@@ -0,0 +1,74 @@
|
||||
"""
|
||||
analysis/encoding.py — text encoding repair for scraped content.
|
||||
|
||||
The townhall.virginia.gov scraper forces UTF-8 decoding, which is correct for the
|
||||
site's current content. This module provides a defensive repair function for cases
|
||||
where a response arrives with Windows-1252/cp1252 bytes embedded in otherwise UTF-8
|
||||
content (common in older CMSes). The raw scrape files are never modified; repair is
|
||||
applied at the analysis and reporting layers only.
|
||||
|
||||
Primary: uses `ftfy` when installed (pip install ftfy).
|
||||
Fallback: re-encodes as cp1252, decodes as UTF-8 (pure mojibake strings only),
|
||||
then applies a table of known-bad patterns for mixed-encoding strings.
|
||||
"""
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Known patterns: UTF-8 bytes decoded as cp1252, i.e. the 3-char sequences you
|
||||
# see when a server sends e.g. E2 80 99 and it gets decoded as cp1252 chars.
|
||||
#
|
||||
# Byte → cp1252 char mappings for the 0x80–0x9F range:
|
||||
# E2 → â (U+00E2, always)
|
||||
# 80 → € (U+20AC, cp1252 0x80)
|
||||
# 99 → ™ (U+2122, cp1252 0x99) ← E2 80 99 = U+2019 ' right single quote
|
||||
# 98 → ˜ (U+02DC, cp1252 0x98) ← E2 80 98 = U+2018 ' left single quote
|
||||
# 9C → œ (U+0153, cp1252 0x9C) ← E2 80 9C = U+201C " left double quote
|
||||
# 9D → \x9d (undefined → U+009D) ← E2 80 9D = U+201D " right double quote
|
||||
# 93 → " (U+201C, cp1252 0x93) ← E2 80 93 = U+2013 – en dash
|
||||
# 94 → " (U+201D, cp1252 0x94) ← E2 80 94 = U+2014 — em dash
|
||||
# A6 → ¦ (U+00A6, cp1252 0xA6) ← E2 80 A6 = U+2026 … ellipsis
|
||||
|
||||
_KNOWN_REPAIRS: list[tuple[str, str]] = [
|
||||
# Longer / more specific patterns first to avoid partial matches
|
||||
("’", "’"), # ’ → ' right single quote
|
||||
("‘", "‘"), # ‘ → ' left single quote
|
||||
("“", "“"), # “ → " left double quote
|
||||
("â€", "”"), # â€\x9d → " right double quote
|
||||
("–", "–"), # â€" (with left DQ) → – en dash
|
||||
("—", "—"), # â€" (with right DQ) → — em dash
|
||||
("…", "…"), # … → … ellipsis
|
||||
# Generic fallback: bare †prefix not caught above → remove artifact
|
||||
("â€", ""),
|
||||
]
|
||||
|
||||
|
||||
def repair_text(text: str) -> str:
|
||||
"""Repair common encoding artifacts in scraped text.
|
||||
|
||||
Handles:
|
||||
- UTF-8 bytes decoded as cp1252/Latin-1 (’ → ')
|
||||
- Attempts best-effort cleanup for mixed-encoding strings
|
||||
|
||||
U+FFFD replacement characters (from strict UTF-8 decoding of cp1252 bytes)
|
||||
cannot be recovered since the original byte is lost; they are left as-is.
|
||||
"""
|
||||
if not text:
|
||||
return text
|
||||
|
||||
try:
|
||||
import ftfy
|
||||
return ftfy.fix_text(text)
|
||||
except ImportError:
|
||||
pass
|
||||
|
||||
# Fallback 1: pure mojibake — entire string is UTF-8 bytes read as cp1252.
|
||||
# Re-encode as cp1252 and decode as UTF-8.
|
||||
try:
|
||||
return text.encode("cp1252").decode("utf-8")
|
||||
except (UnicodeEncodeError, UnicodeDecodeError):
|
||||
pass
|
||||
|
||||
# Fallback 2: mixed strings — substitute known-bad patterns.
|
||||
for bad, good in _KNOWN_REPAIRS:
|
||||
if bad in text:
|
||||
text = text.replace(bad, good)
|
||||
return text
|
||||
9091
analysis/jobs/f452-1/review.csv
Normal file
9091
analysis/jobs/f452-1/review.csv
Normal file
File diff suppressed because one or more lines are too long
BIN
analysis/jobs/f452-1/review.xlsx
Normal file
BIN
analysis/jobs/f452-1/review.xlsx
Normal file
Binary file not shown.
@@ -1,6 +1,4 @@
|
||||
You are an expert policy analyst classifying public comments submitted to the Virginia Town Hall
|
||||
regulatory comment system. You will be given the text of a proposed regulation and a single
|
||||
public comment. Return ONLY a JSON object — no other text.
|
||||
You are an expert policy analyst classifying public comments submitted to the Virginia Town Hall regulatory comment system. You will be given the text of a proposed regulation and a single public comment. Return ONLY a JSON object — no other text.
|
||||
|
||||
Definitions:
|
||||
- stance: the commenter's position on whether the regulation should be adopted.
|
||||
@@ -16,8 +14,6 @@ Definitions:
|
||||
"unclear" = tone cannot be determined (e.g., a one-word comment).
|
||||
- stance_confidence: float 0.0-1.0, your confidence in the stance label.
|
||||
- stance_rationale: 1-3 sentences explaining the key evidence; quote specific phrases where possible.
|
||||
- tags: up to 5 short topic labels relevant to the comment's specific concerns (e.g.
|
||||
"parental rights", "student safety", "privacy", "religious freedom", "LGBTQ+ inclusion",
|
||||
"bullying prevention", "school sports", "bathroom access"). Empty array if none apply.
|
||||
- tags: up to 5 short topic labels relevant to the comment's specific concerns (e.g. "parental rights", "student safety", "privacy", "religious freedom", "LGBTQ inclusion", "bullying prevention", "school sports", "bathroom access"). Empty array if none apply.
|
||||
|
||||
Return exactly these keys: stance, stance_confidence, stance_rationale, tone, tags.
|
||||
|
||||
BIN
docs/excel-snapshot.png
Normal file
BIN
docs/excel-snapshot.png
Normal file
Binary file not shown.
|
After Width: | Height: | Size: 32 KiB |
BIN
docs/streamlit-snapshot.png
Normal file
BIN
docs/streamlit-snapshot.png
Normal file
Binary file not shown.
|
After Width: | Height: | Size: 30 KiB |
109
docs/tasks.org
109
docs/tasks.org
@@ -244,9 +244,9 @@ python analysis/openai_batch.py submit
|
||||
- tests: passing (pytest tests/openai_batch.py tests/openai_realtime.py tests/tokenizer.py)
|
||||
- datetime: [2026-05-06 Wed]
|
||||
|
||||
* === Backlog ===
|
||||
* [ ] X: analysis validation view
|
||||
* [X] t1.3: cleanup model output and rejoin
|
||||
create a lightweight validation script that joins raw comments to normalized analysis output and writes a human-reviewable csv.
|
||||
review create_csv for the simple approach - keep this regardless
|
||||
|
||||
** acceptance criteria
|
||||
1. input raw scrape jsonl and all *-output.jsonl files in a dir
|
||||
@@ -255,7 +255,8 @@ create a lightweight validation script that joins raw comments to normalized ana
|
||||
- forum_id, comment_id, title, text, date, author
|
||||
- stance, stance_confidence, stance_rationale, tone, tags
|
||||
- error, truncated, analyzed_at, prompt_version, model
|
||||
4. print validation counts
|
||||
4. output parquet?
|
||||
5. print validation counts
|
||||
- raw comments
|
||||
- analyzed records
|
||||
- joined records
|
||||
@@ -264,16 +265,30 @@ create a lightweight validation script that joins raw comments to normalized ana
|
||||
- error records
|
||||
- stance counts
|
||||
- tone counts
|
||||
5. tests cover join behavior and missing/duplicate ids
|
||||
6. tests cover join behavior and missing/duplicate ids
|
||||
|
||||
** notes
|
||||
- analysis/create_csv.py: reads raw scrape JSONL + all job*-output.jsonl in a job dir (skips *-output-raw.jsonl); left-joins on comment_id; writes review.csv (UTF-8 BOM for Excel); optional --parquet.
|
||||
- Uses pd.read_json(path, lines=True) — no manual JSON parsing.
|
||||
- Prints summary counts: raw/analyzed/joined/unanalyzed/errors/duplicate IDs, stance distribution, tone distribution.
|
||||
|
||||
*** usage
|
||||
#+begin_src sh
|
||||
python analysis/create_csv.py output/f452.jsonl analysis/jobs/f452-1/
|
||||
python analysis/create_csv.py output/f452.jsonl analysis/jobs/f452-1/ --parquet
|
||||
# output: analysis/jobs/f452-1/review.csv (and optionally review.parquet)
|
||||
#+end_src
|
||||
|
||||
** evidence
|
||||
- commit:
|
||||
- tests:
|
||||
- csv:
|
||||
- datetime:
|
||||
* [ ] X: text encoding cleanup
|
||||
- commit: 28d6d22
|
||||
- tests: passing (pytest tests/create_csv.py tests/encoding.py)
|
||||
- csv: analysis/jobs/f452-1/review.csv
|
||||
- datetime: [2026-05-07 Thu 17:23]
|
||||
|
||||
* [X] t1.1.1: text encoding cleanup
|
||||
fix mojibake in scraped text before analysis/reporting, especially curly quotes showing as ’.
|
||||
|
||||
|
||||
** acceptance criteria
|
||||
1. identify whether mojibake exists in raw scrape, analysis output, or csv export only
|
||||
2. add repair step at the earliest correct layer
|
||||
@@ -286,14 +301,82 @@ fix mojibake in scraped text before analysis/reporting, especially curly quotes
|
||||
- —
|
||||
5. document whether repaired text is used for model input
|
||||
|
||||
** notes
|
||||
- Diagnosis: f452.jsonl raw data is CLEAN — proper Unicode throughout (U+2019, U+201C, etc.). The DEFAULT_RESPONSE_ENCODING=utf-8 spider setting is working for this site. No mojibake or FFFD chars found.
|
||||
- The encoding issue would surface for forums whose server sends cp1252 bytes (0x91-0x97 range) embedded in otherwise UTF-8 content. FFFD replacement chars appear when the UTF-8 decoder hits those bytes. Once the byte is replaced by FFFD, the original character cannot be recovered.
|
||||
- Repair layer: analysis/encoding.py applied in analysis/validate.py at reporting time. Raw scrape JSONL is never modified (AC3).
|
||||
- Model input: repair_text() is NOT applied in build_messages() for this dataset since raw data is clean. Can be added if a future forum produces dirty text.
|
||||
- Spider: DEFAULT_RESPONSE_ENCODING=utf-8 remains. If a future forum genuinely sends cp1252, change to 'cp1252' and apply ftfy post-decode in the item pipeline.
|
||||
|
||||
** evidence
|
||||
- commit:
|
||||
- tests:
|
||||
- before/after sample:
|
||||
- datetime:
|
||||
- commit: 1ea696d
|
||||
- tests: passing (pytest tests/encoding.py)
|
||||
- before/after sample: N/A — f452.jsonl is clean; tests cover synthetic mojibake patterns
|
||||
- datetime: [2026-05-07 Thu 17:00]
|
||||
|
||||
* [X] t1.4: graph data prototype
|
||||
create ./viz/prototype_charts.py generating individual plotly charts for exploring graphs to embed into streamlit or dash later
|
||||
|
||||
** acceptance criteria
|
||||
2. create graph for Stance/Share
|
||||
- stacked h-bar with % support/oppose/neutral/unknown + raw totals, eg 63% (5720) / 37% (3320) / 0.09% (8) / 0.37% (34)
|
||||
- later, consider centered diverging h-bar: oppose ← | neutral/unknown | → support
|
||||
3. create graph for Stance/Time:
|
||||
- cumulative support/oppose % over time
|
||||
4. create graph for Stance/Tone (heatmap count)
|
||||
5. create graph for Confidence/Stance (boxplot or histogram)
|
||||
|
||||
** notes
|
||||
- prototyped in plotly
|
||||
- initial streamlit
|
||||
|
||||
** evidence
|
||||
- commit: 3fb424d
|
||||
- tests: see viz/proto and viz/chart_tests
|
||||
- datetime: [2026-05-08 Fri 08:38]
|
||||
|
||||
* [X] t1.5: streamlit
|
||||
create organized webpage displaying useful information from completed job and analysis
|
||||
|
||||
** acceptance criteria
|
||||
1. display total stance breakdown
|
||||
2. display centered horiz-bar with absolute stances
|
||||
3. show daily comment stances and cumulative
|
||||
4. show comment table with filters for stance (filter tone?)
|
||||
5. clicking/selecting a comment shows full text and model rationale
|
||||
6. app runs locally with one command
|
||||
|
||||
** notes
|
||||
data pulls entirely from the job; goal is to point viz/streamlit.py at any job/ folder and have everything it needs
|
||||
|
||||
** evidence
|
||||
- commit: cc16acb
|
||||
- tests: from root dir, `streamlit run viz/streamlit.py <job-dir>`
|
||||
- datetime: [2026-05-08 Fri 23:44]
|
||||
|
||||
* +[ ] t1.6 host streamlit via dockerfile+
|
||||
planning to deploy manually, get cert, etc etc. probably dont care about https?
|
||||
+using streamlit.app instead+
|
||||
** acceptance criteria
|
||||
1. write dockerfile with slim image
|
||||
|
||||
** notes
|
||||
|
||||
* === Backlog ===
|
||||
- add forum_url, forum_collected_date to scraper (to add to viz)
|
||||
* [ ] X: complete proposal information
|
||||
Ensure we capture as much useful information as possible about the actual proposal - contact information, etc. what the state actually says about what was posted.
|
||||
** acceptance criteria
|
||||
1. Item: `Forum` stores id, url, proposal title, description, open/close date, number of comments, agency, board, guidance document id
|
||||
- add details for guidanceDoc, publication date, comments, guidance docs - eg: https://www.townhall.virginia.gov/L/GDocForum.cfm?GDocForumID=452
|
||||
2. Item: `Comment` stores forum_id, comment_id, author, title, text, date, url
|
||||
* [ ] X: add helper data to create_csv
|
||||
1. in create_csv.py, create helper columns:
|
||||
- stance_signed = {"support":1, "oppose":-1, "neutral":0, "unknown":0}
|
||||
- stance_weighted = stance_signed * stance_confidence
|
||||
- is_support_oppose = stance in ["support", "oppose"]
|
||||
- date_day
|
||||
- date_hour
|
||||
- text_norm
|
||||
- text_hash
|
||||
- confidence_bucket = 'low' <.7 | 'med' .7-.89 | 'high' >=.9
|
||||
|
||||
@@ -1,49 +1,110 @@
|
||||
#+title: VA Townhall
|
||||
#+date: [2026-05-05 Tue]
|
||||
#+version: 1
|
||||
#+version: 1.1
|
||||
|
||||
* Project Goals
|
||||
** Project Goals
|
||||
1. Document and analyze sentiment of public comments on Virginia law, to determine:
|
||||
1. the utility of this forum as a mechanism for public comment, and
|
||||
2. the impact of this forum on Virginia regulation.
|
||||
2. Make data and insights broadly available.
|
||||
3. Generalize to other public comment tools.
|
||||
|
||||
** Document and analyze sentiment
|
||||
- Scrape the data, parse, clean, and store. Clearly separate scraper from sentiment analyzer for maximum auditability.
|
||||
- Build tests for identifying abuse, such as spam and account fraud
|
||||
- Identify any patterns connecting measured sentiment against VA decisions
|
||||
*** Research questions
|
||||
1. What is the quality of the comments on the forum?
|
||||
1. Are there duplicate entries?
|
||||
2. Are there non-human-generated entries?
|
||||
3. Are there entries intended to abuse the forum or drown out comment?
|
||||
2. How do commenters feel about the proposed change?
|
||||
1. What is the total number and percent supporting vs opposing, and how does this change over time?
|
||||
2. What is the type of support, such as strong/weak, positive/negative?
|
||||
3. What impact do the comments have on the proposed change?
|
||||
(I anticipate this will not be measurable from currently available data)
|
||||
|
||||
** Make data available
|
||||
- Pick a good visualization tool
|
||||
** Architecture
|
||||
1. Scrape/Parse: Scrapy
|
||||
2. Sentiment analysis: gpt-5.4-mini
|
||||
3. Display: streamlit
|
||||
4. Storage: jsonl, csv, parquet
|
||||
|
||||
** Generalize
|
||||
- Identify scalable ways to apply this toolset to similar problems
|
||||
[[file:pipeline-v1.2.3.svg]]
|
||||
|
||||
* Architecture
|
||||
1. Scrape/Parse: **Scrapy** for downloading comments
|
||||
2. Storage: json
|
||||
3. Sentiment analysis: Claude haiku
|
||||
4. Display: TBD
|
||||
|
||||
** Scraper
|
||||
Scrapy provides a simple mechanism for browsing and
|
||||
*** Scraper
|
||||
Scrapy provides a simple mechanism for retrieving, parsing, and saving content form the forums.
|
||||
1. Forums listing page: `Forums.cfm` - lists all open forums with agency, reg title, action type, brief description, closing date, comment count
|
||||
2. Comment listing page: `comments.cfm?GDocForumID=X` or `comments.cfm?stageid=X` or `comments.cfm?petitionid=X` - lists comments with title, author, date
|
||||
3. Individual comment page: `viewcomments.cfm?commentid=X` - shows regulation title + brief description at the top, plus the comment
|
||||
|
||||
** Storage
|
||||
One JSONL file per forum/bill.
|
||||
*** Analysis
|
||||
Google and Amazon both return generic sentiment (tone of writing: positive/negative), not stance (for/against the regulation): "I strongly believe the government should NOT interfere" is negative tone but "against" the regulation. We add the proposed change as context to the model.
|
||||
|
||||
** Analysis
|
||||
Google and Amazon both return generic sentiment (tone of writing: positive/negative), not stance (for/against the regulation): "I strongly believe the government should NOT interfere" is negative tone but "against" the regulation. We will run the forum/bill title and cache the entirety of the proposed change, perhaps as a fallback.
|
||||
Before sending the comments for sentiment analysis, `tokenizer.py` receives the forum to be processed and prompt as inputs, then generates a `report.json` estimating tokens (tiktoken), cost, and time to run for multiple models.
|
||||
|
||||
| Tool | Output | Context | Sarcasm | Context window | Cost/1k comments |
|
||||
|-------------------+--------------------------------+------------+------------------+----------------+------------------|
|
||||
| Google NL API | -1→+1, magnitude | No/generic | Poorly | No | ~$1–2 |
|
||||
| Amazon Comprehend | Pos/Neg/Neutral/Mixed | No/generic | Poorly | No | ~$0.10 |
|
||||
| Claude Haiku | Prompted → for/against/neutral | Yes | Yes, with prompt | Yes | ~$0.10–0.30 |
|
||||
| GPT-4o-mini | Prompted → same | Yes | Yes | Yes | ~$0.05–0.15 |
|
||||
Then, the batch processing scripts uses the `report.json` to create multiple jobs, with subcommands to download and check their status.
|
||||
|
||||
We selected gpt-5.4-mini for a good balance of quality, cost, and time.
|
||||
|
||||
**** Prompt
|
||||
```
|
||||
You are an expert policy analyst classifying public comments submitted to the Virginia Town Hall
|
||||
regulatory comment system. You will be given the text of a proposed regulation and a single
|
||||
public comment. Return ONLY a JSON object — no other text.
|
||||
|
||||
Definitions:
|
||||
- stance: the commenter's position on whether the regulation should be adopted.
|
||||
"support" = wants it approved (as-is or with changes);
|
||||
"oppose" = wants it rejected or substantially weakened;
|
||||
"neutral" = takes no position, asks a question, or provides factual input only;
|
||||
"unknown" = too vague, off-topic, or uninterpretable to classify.
|
||||
- tone: the emotional register of the writing, independent of stance.
|
||||
"positive" = affirming, hopeful, appreciative;
|
||||
"negative" = angry, fearful, alarmed, or contemptuous;
|
||||
"neutral" = matter-of-fact, procedural, or informational;
|
||||
"mixed" = contains both positive and negative emotional content;
|
||||
"unclear" = tone cannot be determined (e.g., a one-word comment).
|
||||
- stance_confidence: float 0.0-1.0, your confidence in the stance label.
|
||||
- stance_rationale: 1-3 sentences explaining the key evidence; quote specific phrases where possible.
|
||||
- tags: up to 5 short topic labels relevant to the comment's specific concerns (e.g.
|
||||
"parental rights", "student safety", "privacy", "religious freedom", "LGBTQ+ inclusion",
|
||||
"bullying prevention", "school sports", "bathroom access"). Empty array if none apply.
|
||||
|
||||
Return exactly these keys: stance, stance_confidence, stance_rationale, tone, tags.
|
||||
```
|
||||
|
||||
|
||||
*** Storage
|
||||
- Each scraped forum is saved to `output/<forum-id>.jsonl`
|
||||
- Each report (forum + prompt) is saves to `reports/<forum-id-N>.json`
|
||||
- Each job is saved to `analysis/jobs/<report-id>/:
|
||||
└─`forum.jsonl` is a copy of the scraped forum for convenience
|
||||
└─`prompt.txt` is a copy of the prompt used
|
||||
└─`report.json` is a copy of the report used
|
||||
└─`status.json` contains metadata about the job
|
||||
For each batch in the job, four files are created:
|
||||
└─`jobN-input.jsonl` contains the exact queries sent to the API, for troubleshooting
|
||||
└─`jobN-output-raw.jsonl` contains the exact response from the API
|
||||
└─`jobN-output.jsonl` contains the exact response from the API
|
||||
└─`jobN-output-errors.jsonl` when errors are returned (this file may not exist)
|
||||
- Once complete, the cleanup script saves `review.csv`, `review.pqt`, and `review.sqlite` in this folder.
|
||||
|
||||
** Instructions
|
||||
1. Scrape the forum.
|
||||
`python
|
||||
2. Run model report.
|
||||
`python analysis/tokenizer.py <input> --prompt <prompt>`
|
||||
3. To run a realtime subset:
|
||||
`python analysis/openai_realtime.py <input> --prompt <prompt> --model <model> --limit <N comments>`
|
||||
`python analysis/openai_realtime.py output/f452.jsonl --prompt prompt-1.txt --model gpt-4o-mini --limit 10`
|
||||
4. To create and run the whole thing in batches, first create the batch jobs from the report:
|
||||
`python analysis/openai_batch.py create <report> --model <model>`
|
||||
`python analysis/openai_batch.py create ./reports/f452-1.json --model gpt-5.4-mini`
|
||||
5. Then, run the jobs sequentially. Don't submit more than one at a time, if the model fills up the batch will fail and resubmission is not implemented.
|
||||
`python analysis/openai_batch.py submit`
|
||||
# Check status
|
||||
`python analysis/openai_batch.py status`
|
||||
# When complete, download:
|
||||
`python analysis/openai_batch.py download`
|
||||
# Submit the next batch after the previous is complete:
|
||||
`python analysis/openai_batch.py submit`
|
||||
|
||||
* Roadmap
|
||||
1. Scrape one forum
|
||||
|
||||
BIN
requirements.txt
BIN
requirements.txt
Binary file not shown.
@@ -5,6 +5,8 @@ class ForumItem(scrapy.Item):
|
||||
forum_id = scrapy.Field()
|
||||
reg_title = scrapy.Field()
|
||||
reg_desc = scrapy.Field()
|
||||
scraped_at = scrapy.Field()
|
||||
forum_url = scrapy.Field()
|
||||
|
||||
|
||||
class CommentItem(scrapy.Item):
|
||||
|
||||
@@ -63,6 +63,8 @@ class ForumSpider(scrapy.Spider):
|
||||
forum_id=self.forum_id,
|
||||
reg_title=reg_title,
|
||||
reg_desc=reg_desc,
|
||||
scraped_at=datetime.utcnow().isoformat(),
|
||||
forum_url=_view_url(self.forum_id),
|
||||
)
|
||||
for page in range(2, last_page + 1):
|
||||
yield scrapy.FormRequest(
|
||||
|
||||
155
tests/create_csv.py
Normal file
155
tests/create_csv.py
Normal file
@@ -0,0 +1,155 @@
|
||||
"""Unit tests for analysis/create_csv.py — no external API calls."""
|
||||
|
||||
import json
|
||||
import sys
|
||||
from pathlib import Path
|
||||
|
||||
import pandas as pd
|
||||
import pytest
|
||||
|
||||
sys.path.insert(0, str(Path(__file__).parent.parent / "analysis"))
|
||||
import create_csv as cc
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Helpers
|
||||
|
||||
def _write_jsonl(path: Path, rows: list[dict]) -> None:
|
||||
with open(path, "w", encoding="utf-8") as f:
|
||||
for row in rows:
|
||||
f.write(json.dumps(row) + "\n")
|
||||
|
||||
|
||||
RAW_ROWS = [
|
||||
{"forum_id": "452", "comment_id": "1", "title": "Support", "text": "I support.", "date": "2021-01-01", "author": "Alice"},
|
||||
{"forum_id": "452", "comment_id": "2", "title": "Oppose", "text": "I oppose.", "date": "2021-01-02", "author": "Bob"},
|
||||
{"forum_id": "452", "comment_id": "3", "title": "Neutral", "text": "No opinion.","date": "2021-01-03", "author": "Carol"},
|
||||
]
|
||||
|
||||
ANALYSIS_ROWS = [
|
||||
{"comment_id": "1", "stance": "support", "stance_confidence": 0.9, "stance_rationale": "clear support",
|
||||
"tone": "neutral", "tags": '["policy"]', "error": None, "truncated": False,
|
||||
"analyzed_at": "2021-01-10", "prompt_version": "1", "model": "gpt-4o-mini"},
|
||||
{"comment_id": "2", "stance": "oppose", "stance_confidence": 0.8, "stance_rationale": "clear oppose",
|
||||
"tone": "negative", "tags": '[]', "error": None, "truncated": False,
|
||||
"analyzed_at": "2021-01-10", "prompt_version": "1", "model": "gpt-4o-mini"},
|
||||
]
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# load_raw
|
||||
|
||||
def test_load_raw_returns_raw_cols(tmp_path):
|
||||
p = tmp_path / "forum.jsonl"
|
||||
_write_jsonl(p, RAW_ROWS)
|
||||
df = cc.load_raw(p)
|
||||
assert list(df.columns) == cc.RAW_COLS
|
||||
|
||||
|
||||
def test_load_raw_row_count(tmp_path):
|
||||
p = tmp_path / "forum.jsonl"
|
||||
_write_jsonl(p, RAW_ROWS)
|
||||
df = cc.load_raw(p)
|
||||
assert len(df) == 3
|
||||
|
||||
|
||||
def test_load_raw_skips_non_comment_rows(tmp_path):
|
||||
"""Rows without comment_id (e.g. forum metadata) are dropped."""
|
||||
rows = RAW_ROWS + [{"forum_id": "452", "reg_title": "Metadata row"}]
|
||||
p = tmp_path / "forum.jsonl"
|
||||
_write_jsonl(p, rows)
|
||||
df = cc.load_raw(p)
|
||||
assert len(df) == 3
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# load_analysis
|
||||
|
||||
def test_load_analysis_returns_analysis_cols(tmp_path):
|
||||
jobs = tmp_path / "jobs"
|
||||
jobs.mkdir()
|
||||
_write_jsonl(jobs / "job1-output.jsonl", ANALYSIS_ROWS)
|
||||
df = cc.load_analysis(jobs)
|
||||
expected = ["comment_id"] + cc.ANALYSIS_COLS
|
||||
assert list(df.columns) == expected
|
||||
|
||||
|
||||
def test_load_analysis_skips_raw_files(tmp_path):
|
||||
jobs = tmp_path / "jobs"
|
||||
jobs.mkdir()
|
||||
_write_jsonl(jobs / "job1-output.jsonl", ANALYSIS_ROWS)
|
||||
_write_jsonl(jobs / "job1-output-raw.jsonl", ANALYSIS_ROWS) # should be ignored
|
||||
df = cc.load_analysis(jobs)
|
||||
assert len(df) == len(ANALYSIS_ROWS)
|
||||
|
||||
|
||||
def test_load_analysis_concatenates_multiple_files(tmp_path):
|
||||
jobs = tmp_path / "jobs"
|
||||
jobs.mkdir()
|
||||
_write_jsonl(jobs / "job1-output.jsonl", [ANALYSIS_ROWS[0]])
|
||||
_write_jsonl(jobs / "job2-output.jsonl", [ANALYSIS_ROWS[1]])
|
||||
df = cc.load_analysis(jobs)
|
||||
assert len(df) == 2
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# join
|
||||
|
||||
def test_join_all_raw_preserved(tmp_path):
|
||||
"""Left join: all raw comments appear in output, even without analysis."""
|
||||
raw = pd.DataFrame(RAW_ROWS)[cc.RAW_COLS]
|
||||
analysis = pd.DataFrame(ANALYSIS_ROWS)
|
||||
for col in cc.ANALYSIS_COLS:
|
||||
if col not in analysis.columns:
|
||||
analysis[col] = None
|
||||
analysis = analysis[["comment_id"] + cc.ANALYSIS_COLS]
|
||||
|
||||
merged = cc.join(raw, analysis)
|
||||
assert len(merged) == 3 # all 3 raw rows, even comment_id=3 with no analysis
|
||||
|
||||
|
||||
def test_join_unanalyzed_row_has_null_stance(tmp_path):
|
||||
raw = pd.DataFrame(RAW_ROWS)[cc.RAW_COLS]
|
||||
analysis = pd.DataFrame(ANALYSIS_ROWS)
|
||||
for col in cc.ANALYSIS_COLS:
|
||||
if col not in analysis.columns:
|
||||
analysis[col] = None
|
||||
analysis = analysis[["comment_id"] + cc.ANALYSIS_COLS]
|
||||
|
||||
merged = cc.join(raw, analysis)
|
||||
unanalyzed = merged[merged["comment_id"] == "3"]
|
||||
assert pd.isna(unanalyzed.iloc[0]["stance"])
|
||||
|
||||
|
||||
def test_join_column_order(tmp_path):
|
||||
raw = pd.DataFrame(RAW_ROWS)[cc.RAW_COLS]
|
||||
analysis = pd.DataFrame(ANALYSIS_ROWS)
|
||||
for col in cc.ANALYSIS_COLS:
|
||||
if col not in analysis.columns:
|
||||
analysis[col] = None
|
||||
analysis = analysis[["comment_id"] + cc.ANALYSIS_COLS]
|
||||
|
||||
merged = cc.join(raw, analysis)
|
||||
assert list(merged.columns) == cc.OUTPUT_COLS
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# End-to-end: write + read CSV
|
||||
|
||||
def test_csv_written_correctly(tmp_path):
|
||||
raw_path = tmp_path / "forum.jsonl"
|
||||
_write_jsonl(raw_path, RAW_ROWS)
|
||||
|
||||
jobs = tmp_path / "jobs"
|
||||
jobs.mkdir()
|
||||
_write_jsonl(jobs / "job1-output.jsonl", ANALYSIS_ROWS)
|
||||
|
||||
out = tmp_path / "review.csv"
|
||||
raw = cc.load_raw(raw_path)
|
||||
analysis = cc.load_analysis(jobs)
|
||||
merged = cc.join(raw, analysis)
|
||||
merged.to_csv(out, index=False, encoding="utf-8-sig")
|
||||
|
||||
loaded = pd.read_csv(out)
|
||||
assert len(loaded) == 3
|
||||
assert list(loaded.columns) == cc.OUTPUT_COLS
|
||||
119
tests/encoding.py
Normal file
119
tests/encoding.py
Normal file
@@ -0,0 +1,119 @@
|
||||
"""Unit tests for analysis/encoding.py — no external dependencies required."""
|
||||
|
||||
import sys
|
||||
from pathlib import Path
|
||||
|
||||
import pytest
|
||||
|
||||
sys.path.insert(0, str(Path(__file__).parent.parent / "analysis"))
|
||||
from encoding import repair_text, _KNOWN_REPAIRS
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Core contract
|
||||
|
||||
|
||||
def test_empty_string_unchanged():
|
||||
assert repair_text("") == ""
|
||||
|
||||
|
||||
def test_none_like_empty_unchanged():
|
||||
assert repair_text("") == ""
|
||||
|
||||
|
||||
def test_clean_ascii_unchanged():
|
||||
text = "This is a normal sentence with no encoding issues."
|
||||
assert repair_text(text) == text
|
||||
|
||||
|
||||
def test_clean_unicode_unchanged():
|
||||
text = "Café, naïve, résumé — proper Unicode already."
|
||||
result = repair_text(text)
|
||||
# Should either be unchanged or equivalently correct
|
||||
assert "Caf" in result and "na" in result
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Known mojibake sequences (tasks.org AC4)
|
||||
# These are the 5 patterns explicitly listed in the acceptance criteria.
|
||||
|
||||
|
||||
def test_right_single_quote():
|
||||
"""’ → ' (U+2019 right single quotation mark)"""
|
||||
assert repair_text("Virginia’s") == "Virginia’s"
|
||||
|
||||
|
||||
def test_left_double_quote():
|
||||
"""“ → " (U+201C left double quotation mark)"""
|
||||
assert repair_text("“Hello") == "“Hello"
|
||||
|
||||
|
||||
def test_en_dash():
|
||||
"""â€" (where last char is U+201C) → – (U+2013 en dash)"""
|
||||
result = repair_text("pages 1–5")
|
||||
assert "–" in result or "—" in result or "-" in result
|
||||
|
||||
|
||||
def test_em_dash():
|
||||
"""â€" (where last char is U+201D) → — (U+2014 em dash)"""
|
||||
result = repair_text("word—word")
|
||||
assert "—" in result or "–" in result or "-" in result
|
||||
|
||||
|
||||
def test_right_double_quote():
|
||||
"""â€\x9d → " (U+201D right double quotation mark)"""
|
||||
result = repair_text("said†he")
|
||||
# Should not contain the raw artifact
|
||||
assert "â€" not in result
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Round-trip: garbled text produces sensible output
|
||||
|
||||
|
||||
def test_garbled_sentence_repaired():
|
||||
"""A sentence with multiple mojibake chars is repaired to readable text."""
|
||||
# "Don't" with right single quote encoded as UTF-8, then decoded as cp1252
|
||||
# D o n ' t → D o n ’ t
|
||||
garbled = "Don’t worry"
|
||||
result = repair_text(garbled)
|
||||
assert "Don" in result and "t worry" in result
|
||||
assert "â€" not in result # artifact gone
|
||||
|
||||
|
||||
def test_clean_string_after_repair_has_no_artifacts():
|
||||
garbled = "She said “Hello†and left."
|
||||
result = repair_text(garbled)
|
||||
assert "â€" not in result
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# FFFD replacement characters (from strict UTF-8 decode of cp1252 bytes)
|
||||
|
||||
|
||||
def test_fffd_preserved_not_crashed():
|
||||
"""repair_text must not raise on U+FFFD; it may or may not repair it."""
|
||||
text = "Virginia<EFBFBD>s Public Schools"
|
||||
result = repair_text(text)
|
||||
assert isinstance(result, str)
|
||||
assert "Virginia" in result
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# _KNOWN_REPAIRS table structure
|
||||
|
||||
|
||||
def test_known_repairs_non_empty():
|
||||
assert len(_KNOWN_REPAIRS) > 0
|
||||
|
||||
|
||||
def test_known_repairs_are_pairs():
|
||||
for item in _KNOWN_REPAIRS:
|
||||
assert len(item) == 2
|
||||
bad, good = item
|
||||
assert isinstance(bad, str) and isinstance(good, str)
|
||||
|
||||
|
||||
def test_known_repairs_bad_not_equal_good():
|
||||
for bad, good in _KNOWN_REPAIRS:
|
||||
assert bad != good
|
||||
217
tests/validate-sentiment.py
Normal file
217
tests/validate-sentiment.py
Normal file
@@ -0,0 +1,217 @@
|
||||
"""Unit tests for analysis/validate.py — no file I/O beyond tmp_path."""
|
||||
|
||||
import json
|
||||
import sys
|
||||
from pathlib import Path
|
||||
|
||||
import pytest
|
||||
|
||||
sys.path.insert(0, str(Path(__file__).parent.parent / "analysis"))
|
||||
|
||||
try:
|
||||
import pandas as pd
|
||||
except ImportError:
|
||||
pytest.skip("pandas not installed", allow_module_level=True)
|
||||
|
||||
import validate as vl
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Fixtures
|
||||
|
||||
|
||||
def _write_jsonl(path: Path, rows: list[dict]) -> None:
|
||||
with open(path, "w", encoding="utf-8") as f:
|
||||
for row in rows:
|
||||
f.write(json.dumps(row, ensure_ascii=False) + "\n")
|
||||
|
||||
|
||||
RAW_ROWS = [
|
||||
{"forum_id": "452", "comment_id": "1", "title": "Support it",
|
||||
"text": "I support this.", "date": "2021-01-04T09:00:00", "author": "Alice"},
|
||||
{"forum_id": "452", "comment_id": "2", "title": "Oppose it",
|
||||
"text": "I oppose this.", "date": "2021-01-05T10:00:00", "author": "Bob"},
|
||||
{"forum_id": "452", "comment_id": "3", "title": "Neutral",
|
||||
"text": "No opinion.", "date": "2021-01-06T11:00:00", "author": "Carol"},
|
||||
]
|
||||
|
||||
ANALYSIS_ROWS = [
|
||||
{"run_id": "r1", "forum_id": "452", "comment_id": "1", "input_title": "Support it",
|
||||
"analyzed_at": "2026-05-06T12:00:00+00:00", "model": "gpt-5.4-mini",
|
||||
"prompt_version": "abc1234", "stance": "support", "stance_confidence": 0.95,
|
||||
"stance_rationale": "Commenter says 'I support'.", "tone": "positive",
|
||||
"tags": ["student safety"], "truncated": False, "error": None},
|
||||
{"run_id": "r1", "forum_id": "452", "comment_id": "2", "input_title": "Oppose it",
|
||||
"analyzed_at": "2026-05-06T12:00:00+00:00", "model": "gpt-5.4-mini",
|
||||
"prompt_version": "abc1234", "stance": "oppose", "stance_confidence": 0.90,
|
||||
"stance_rationale": "Commenter says 'I oppose'.", "tone": "negative",
|
||||
"tags": [], "truncated": False, "error": None},
|
||||
]
|
||||
|
||||
FORUM_ROW = {"forum_id": "452", "reg_title": "Policy X", "reg_desc": "Guidance on Y."}
|
||||
|
||||
|
||||
@pytest.fixture()
|
||||
def raw_jsonl(tmp_path) -> Path:
|
||||
p = tmp_path / "f452.jsonl"
|
||||
_write_jsonl(p, [FORUM_ROW] + RAW_ROWS)
|
||||
return p
|
||||
|
||||
|
||||
@pytest.fixture()
|
||||
def jobs_dir(tmp_path) -> Path:
|
||||
d = tmp_path / "jobs" / "f452-1"
|
||||
d.mkdir(parents=True)
|
||||
_write_jsonl(d / "job1-output.jsonl", ANALYSIS_ROWS)
|
||||
return d
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# load_raw
|
||||
|
||||
|
||||
def test_load_raw_returns_only_comments(raw_jsonl):
|
||||
df = vl.load_raw(raw_jsonl)
|
||||
assert len(df) == 3
|
||||
assert set(df.columns) == set(vl.RAW_COLS)
|
||||
|
||||
|
||||
def test_load_raw_correct_columns(raw_jsonl):
|
||||
df = vl.load_raw(raw_jsonl)
|
||||
for col in vl.RAW_COLS:
|
||||
assert col in df.columns
|
||||
|
||||
|
||||
def test_load_raw_skips_forum_item(raw_jsonl):
|
||||
df = vl.load_raw(raw_jsonl)
|
||||
assert "reg_title" not in df.columns
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# load_analysis
|
||||
|
||||
|
||||
def test_load_analysis_skips_raw_files(tmp_path):
|
||||
d = tmp_path / "jobs" / "f452-1"
|
||||
d.mkdir(parents=True)
|
||||
_write_jsonl(d / "job1-output-raw.jsonl", ANALYSIS_ROWS) # should be ignored
|
||||
_write_jsonl(d / "job1-output.jsonl", ANALYSIS_ROWS)
|
||||
df = vl.load_analysis(d)
|
||||
assert len(df) == len(ANALYSIS_ROWS)
|
||||
|
||||
|
||||
def test_load_analysis_concatenates_multiple_files(tmp_path):
|
||||
d = tmp_path / "jobs" / "f452-1"
|
||||
d.mkdir(parents=True)
|
||||
_write_jsonl(d / "job1-output.jsonl", [ANALYSIS_ROWS[0]])
|
||||
_write_jsonl(d / "job2-output.jsonl", [ANALYSIS_ROWS[1]])
|
||||
df = vl.load_analysis(d)
|
||||
assert len(df) == 2
|
||||
|
||||
|
||||
def test_load_analysis_tags_serialized_as_json(jobs_dir):
|
||||
df = vl.load_analysis(jobs_dir)
|
||||
tags_val = df.loc[df["comment_id"] == "1", "tags"].iloc[0]
|
||||
assert isinstance(tags_val, str)
|
||||
assert json.loads(tags_val) == ["student safety"]
|
||||
|
||||
|
||||
def test_load_analysis_empty_tags_serialized(jobs_dir):
|
||||
df = vl.load_analysis(jobs_dir)
|
||||
tags_val = df.loc[df["comment_id"] == "2", "tags"].iloc[0]
|
||||
assert json.loads(tags_val) == []
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# join — by comment_id, not index
|
||||
|
||||
|
||||
def test_join_by_comment_id_not_index(raw_jsonl, jobs_dir):
|
||||
raw = vl.load_raw(raw_jsonl)
|
||||
analysis = vl.load_analysis(jobs_dir)
|
||||
# Shuffle raw order so comment_id ordering differs from index
|
||||
raw = raw.sample(frac=1, random_state=42).reset_index(drop=True)
|
||||
merged = vl.join(raw, analysis)
|
||||
row_1 = merged[merged["comment_id"] == "1"].iloc[0]
|
||||
assert row_1["stance"] == "support"
|
||||
assert row_1["author"] == "Alice"
|
||||
|
||||
|
||||
def test_join_unanalyzed_comment_has_null_stance(raw_jsonl, jobs_dir):
|
||||
"""Comment 3 is in raw but not in analysis — stance should be NaN."""
|
||||
raw = vl.load_raw(raw_jsonl)
|
||||
analysis = vl.load_analysis(jobs_dir)
|
||||
merged = vl.join(raw, analysis)
|
||||
row_3 = merged[merged["comment_id"] == "3"].iloc[0]
|
||||
assert pd.isna(row_3["stance"])
|
||||
|
||||
|
||||
def test_join_preserves_all_raw_comments(raw_jsonl, jobs_dir):
|
||||
raw = vl.load_raw(raw_jsonl)
|
||||
analysis = vl.load_analysis(jobs_dir)
|
||||
merged = vl.join(raw, analysis)
|
||||
assert len(merged) == len(raw)
|
||||
|
||||
|
||||
def test_join_output_columns_in_order(raw_jsonl, jobs_dir):
|
||||
raw = vl.load_raw(raw_jsonl)
|
||||
analysis = vl.load_analysis(jobs_dir)
|
||||
merged = vl.join(raw, analysis)
|
||||
assert list(merged.columns) == vl.OUTPUT_COLS
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Duplicate comment_id handling
|
||||
|
||||
|
||||
def test_duplicate_raw_id_flagged(raw_jsonl, jobs_dir):
|
||||
raw = vl.load_raw(raw_jsonl)
|
||||
# Manually duplicate a row
|
||||
raw = pd.concat([raw, raw.iloc[[0]]], ignore_index=True)
|
||||
analysis = vl.load_analysis(jobs_dir)
|
||||
merged = vl.join(raw, analysis)
|
||||
# join still produces a row for each raw row (left join)
|
||||
assert len(merged) == len(raw)
|
||||
assert raw["comment_id"].duplicated().sum() == 1
|
||||
|
||||
|
||||
def test_duplicate_analysis_id_produces_extra_rows(raw_jsonl, tmp_path):
|
||||
"""Two analysis records for the same comment_id create two joined rows."""
|
||||
d = tmp_path / "jobs" / "f452-dup"
|
||||
d.mkdir(parents=True)
|
||||
dup_rows = [ANALYSIS_ROWS[0], {**ANALYSIS_ROWS[0], "stance": "oppose"}]
|
||||
_write_jsonl(d / "job1-output.jsonl", dup_rows)
|
||||
raw = vl.load_raw(raw_jsonl)
|
||||
analysis = vl.load_analysis(d)
|
||||
merged = vl.join(raw, analysis)
|
||||
assert len(merged[merged["comment_id"] == "1"]) == 2
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Validation counts (smoke test — just confirm it runs without error)
|
||||
|
||||
|
||||
def test_print_validation_runs(raw_jsonl, jobs_dir, capsys):
|
||||
raw = vl.load_raw(raw_jsonl)
|
||||
analysis = vl.load_analysis(jobs_dir)
|
||||
merged = vl.join(raw, analysis)
|
||||
vl.print_validation(raw, analysis, merged)
|
||||
out = capsys.readouterr().out
|
||||
assert "Raw comments" in out
|
||||
assert "Stance counts" in out
|
||||
assert "Tone counts" in out
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# CSV output
|
||||
|
||||
|
||||
def test_csv_written_to_jobs_dir(raw_jsonl, jobs_dir, tmp_path):
|
||||
raw = vl.load_raw(raw_jsonl)
|
||||
analysis = vl.load_analysis(jobs_dir)
|
||||
merged = vl.join(raw, analysis)
|
||||
out_path = jobs_dir / "review.csv"
|
||||
merged.to_csv(out_path, index=False, encoding="utf-8-sig")
|
||||
assert out_path.exists()
|
||||
loaded = pd.read_csv(out_path, encoding="utf-8-sig")
|
||||
assert list(loaded.columns) == vl.OUTPUT_COLS
|
||||
assert len(loaded) == len(raw)
|
||||
3888
viz/chart_tests/confidence_by_stance.html
Normal file
3888
viz/chart_tests/confidence_by_stance.html
Normal file
File diff suppressed because one or more lines are too long
3888
viz/chart_tests/cumulative_stance_area.html
Normal file
3888
viz/chart_tests/cumulative_stance_area.html
Normal file
File diff suppressed because one or more lines are too long
3888
viz/chart_tests/cumulative_stance_share.html
Normal file
3888
viz/chart_tests/cumulative_stance_share.html
Normal file
File diff suppressed because one or more lines are too long
3888
viz/chart_tests/stance_diverging_bar.html
Normal file
3888
viz/chart_tests/stance_diverging_bar.html
Normal file
File diff suppressed because one or more lines are too long
3888
viz/chart_tests/stance_over_time.html
Normal file
3888
viz/chart_tests/stance_over_time.html
Normal file
File diff suppressed because one or more lines are too long
3888
viz/chart_tests/stance_share.html
Normal file
3888
viz/chart_tests/stance_share.html
Normal file
File diff suppressed because one or more lines are too long
3888
viz/chart_tests/stance_tone_counts.html
Normal file
3888
viz/chart_tests/stance_tone_counts.html
Normal file
File diff suppressed because one or more lines are too long
3888
viz/chart_tests/stance_tone_heatmap.html
Normal file
3888
viz/chart_tests/stance_tone_heatmap.html
Normal file
File diff suppressed because one or more lines are too long
3888
viz/chart_tests/stance_tone_rowpct.html
Normal file
3888
viz/chart_tests/stance_tone_rowpct.html
Normal file
File diff suppressed because one or more lines are too long
3888
viz/proto/confidence_by_stance.html
Normal file
3888
viz/proto/confidence_by_stance.html
Normal file
File diff suppressed because one or more lines are too long
3888
viz/proto/stance_over_time.html
Normal file
3888
viz/proto/stance_over_time.html
Normal file
File diff suppressed because one or more lines are too long
3888
viz/proto/stance_share.html
Normal file
3888
viz/proto/stance_share.html
Normal file
File diff suppressed because one or more lines are too long
3888
viz/proto/stance_tone_heatmap.html
Normal file
3888
viz/proto/stance_tone_heatmap.html
Normal file
File diff suppressed because one or more lines are too long
134
viz/prototype_charts.py
Normal file
134
viz/prototype_charts.py
Normal file
@@ -0,0 +1,134 @@
|
||||
'''
|
||||
prototype_charts.py
|
||||
generate test charts for later addition to streamlit
|
||||
'''
|
||||
|
||||
|
||||
from pathlib import Path
|
||||
import pandas as pd
|
||||
import plotly.express as px
|
||||
import numpy as np
|
||||
|
||||
inp = Path(r"c:/users/moses/projects/vath/analysis/jobs/f452-1/review.csv")
|
||||
out = Path("viz/")
|
||||
out.mkdir(parents=True, exist_ok=True)
|
||||
|
||||
stance_order = ["support", "oppose", "neutral", "unknown"]
|
||||
|
||||
# tone_order = ["positive", "negative", "neutral", "mixed", "unknown", "unclear"]
|
||||
# default order was actually better - unclear/negative/neutral/mixed/positive vs unknown/oppose/neutral/support
|
||||
# same for pct w/in stance
|
||||
df = pd.read_csv(inp)
|
||||
df["date"] = pd.to_datetime(df["date"], errors="coerce")
|
||||
df["date_day"] = df["date"].dt.date
|
||||
df["stance"] = df["stance"].fillna("unknown")
|
||||
df["tone"] = df["tone"].fillna("unknown")
|
||||
|
||||
# 1. stance share
|
||||
counts = df["stance"].value_counts().reindex(stance_order, fill_value=0).reset_index()
|
||||
counts.columns = ["stance", "count"]
|
||||
fig = px.bar(counts, x="count", y="stance", orientation="h", text="count")
|
||||
fig.write_html(out / "stance_share.html")
|
||||
|
||||
# 2. stance over time
|
||||
daily = df.groupby(["date_day", "stance"]).size().reset_index(name="count")
|
||||
fig = px.bar(daily, x="date_day", y="count", color="stance", category_orders={"stance": stance_order})
|
||||
fig.write_html(out / "stance_over_time.html")
|
||||
|
||||
# 3. stance x tone
|
||||
heat = df.groupby(["stance", "tone"]).size().reset_index(name="count")
|
||||
fig = px.density_heatmap(heat, x="tone", y="stance", z="count", category_orders={"stance": stance_order})
|
||||
fig.write_html(out / "stance_tone_heatmap.html")
|
||||
|
||||
# 4. confidence by stance
|
||||
fig = px.box(df, x="stance", y="stance_confidence", category_orders={"stance": stance_order}, points="outliers")
|
||||
fig.write_html(out / "confidence_by_stance.html")
|
||||
|
||||
# 5. cumulative stance and share over time
|
||||
daily = (
|
||||
df.groupby(["date_day", "stance"])
|
||||
.size()
|
||||
.unstack(fill_value=0)
|
||||
.reindex(columns=stance_order, fill_value=0)
|
||||
.sort_index()
|
||||
)
|
||||
|
||||
cum = daily.cumsum()
|
||||
cum_long = cum.reset_index().melt(id_vars="date_day", var_name="stance", value_name="cumulative_count")
|
||||
|
||||
fig = px.area(
|
||||
cum_long,
|
||||
x="date_day",
|
||||
y="cumulative_count",
|
||||
color="stance",
|
||||
category_orders={"stance": stance_order},
|
||||
title="cumulative comments by stance over time",
|
||||
)
|
||||
fig.write_html(out / "cumulative_stance_area.html")
|
||||
|
||||
cum_pct = cum.div(cum.sum(axis=1), axis=0).reset_index().melt(
|
||||
id_vars="date_day", var_name="stance", value_name="cumulative_share"
|
||||
)
|
||||
|
||||
fig = px.line(
|
||||
cum_pct,
|
||||
x="date_day",
|
||||
y="cumulative_share",
|
||||
color="stance",
|
||||
category_orders={"stance": stance_order},
|
||||
title="cumulative stance share over time",
|
||||
)
|
||||
fig.update_yaxes(tickformat=".0%")
|
||||
fig.write_html(out / "cumulative_stance_share.html")
|
||||
|
||||
# 7. diverging h-bar
|
||||
stance_counts = df["stance"].value_counts().reindex(stance_order, fill_value=0)
|
||||
|
||||
div = pd.DataFrame({
|
||||
"stance": ["oppose", "support", "neutral", "unknown"],
|
||||
"count": [
|
||||
-stance_counts.get("oppose", 0),
|
||||
stance_counts.get("support", 0),
|
||||
stance_counts.get("neutral", 0),
|
||||
stance_counts.get("unknown", 0),
|
||||
],
|
||||
})
|
||||
|
||||
fig = px.bar(
|
||||
div,
|
||||
x="count",
|
||||
y="stance",
|
||||
orientation="h",
|
||||
text=div["count"].abs(),
|
||||
title="support vs oppose",
|
||||
)
|
||||
fig.update_xaxes(title="comments", zeroline=True)
|
||||
fig.update_traces(textposition="outside")
|
||||
fig.write_html(out / "stance_diverging_bar.html")
|
||||
|
||||
# 8. Stance x Tone labels
|
||||
heat = pd.crosstab(df["stance"], df["tone"]).reindex(
|
||||
index=stance_order,
|
||||
columns=[c for c in tone_order if c in df["tone"].unique()],
|
||||
fill_value=0,
|
||||
)
|
||||
|
||||
fig = px.imshow(
|
||||
heat,
|
||||
text_auto=True,
|
||||
aspect="auto",
|
||||
title="stance x tone, count",
|
||||
)
|
||||
fig.write_html(out / "stance_tone_counts.html")
|
||||
|
||||
rowpct = heat.div(heat.sum(axis=1).replace(0, np.nan), axis=0)
|
||||
|
||||
fig = px.imshow(
|
||||
rowpct,
|
||||
text_auto=".0%",
|
||||
aspect="auto",
|
||||
title="stance x tone, percent within stance",
|
||||
)
|
||||
fig.write_html(out / "stance_tone_rowpct.html")
|
||||
|
||||
|
||||
28
viz/prototype_streamlit.py
Normal file
28
viz/prototype_streamlit.py
Normal file
@@ -0,0 +1,28 @@
|
||||
# streamlit run analysis/viz/prototype_streamlit.py
|
||||
from datetime import datetime
|
||||
import pandas as pd
|
||||
import plotly.graph_objects as go
|
||||
import plotly.express as px
|
||||
import streamlit as st
|
||||
|
||||
df = pd.read_csv(r"analysis/jobs/f452-1/review.csv")
|
||||
st.set_page_config(layout="wide")
|
||||
|
||||
stance = st.multiselect("Filter stance", sorted(df["stance"].dropna().unique()), default=sorted(df["stance"].dropna().unique()))
|
||||
q = st.text_input("Search comment text")
|
||||
dff = df[df["stance"].isin(stance)]
|
||||
if q:
|
||||
dff = dff[dff["text"].fillna("").str.contains(q, case=False, regex=False)]
|
||||
|
||||
st.dataframe(dff[["comment_id", "title", "stance", "stance_confidence", "tone"]], width="stretch")
|
||||
st.write("Showing " + str(len(dff))+ " comments")
|
||||
|
||||
cid = st.selectbox("comment", dff["comment_id"].astype(str))
|
||||
row = dff[dff["comment_id"].astype(str) == cid].iloc[0]
|
||||
|
||||
st.subheader(row["title"])
|
||||
st.write(row["text"])
|
||||
st.write(row["author"] + ", " + row["date"][:10])
|
||||
st.write("**model:** " + str(row["model"]))
|
||||
st.markdown("**stance:** " + str(row["stance"]) + " \n**confidence:** " + str(row["stance_confidence"]) + " \n**tone:** " + str(row["tone"]))
|
||||
st.write("**analysis:** "+ row["stance_rationale"])
|
||||
189
viz/streamlit.py
Normal file
189
viz/streamlit.py
Normal file
@@ -0,0 +1,189 @@
|
||||
# streamlit run viz/streamlit.py -- --jobs-dir analysis/jobs/f452-1
|
||||
import argparse
|
||||
from pathlib import Path
|
||||
from datetime import datetime as dt
|
||||
import pandas as pd
|
||||
import plotly.graph_objects as go
|
||||
import plotly.express as px
|
||||
import streamlit as st
|
||||
|
||||
parser = argparse.ArgumentParser()
|
||||
parser.add_argument("--jobs-dir", default="analysis/jobs/f452-1", type=Path,
|
||||
help="Job directory containing review.csv, forum.jsonl, and prompt.txt")
|
||||
args, _ = parser.parse_known_args() # parse_known_args: ignore Streamlit's own argv entries
|
||||
workdir = args.jobs_dir
|
||||
df = pd.read_csv(workdir/"review.csv")
|
||||
df['date_dt'] = pd.to_datetime(df.date)
|
||||
df["date_day"] = df["date_dt"].dt.date
|
||||
forum = pd.read_json(workdir/"forum.jsonl", lines=True).iloc[0].to_dict()
|
||||
prompt = (workdir/"prompt.txt").read_text(encoding="utf-8")
|
||||
|
||||
stance_colors = {'oppose':'#ffa15a', 'neutral':'#e377c2','support':'#19d3f3','unknown':'#000000'}
|
||||
stance_order = ["oppose", "mixed", "unknown", "neutral", "support"]
|
||||
|
||||
st.set_page_config(layout="wide")
|
||||
st.title("Virginia Townhall Explorer",anchor=None)
|
||||
st.caption("Explore data collected from Virginia's public comment system. Source code at https://github.com/eulaly/vath")
|
||||
|
||||
st.subheader("Proposal",anchor=None,divider="gray")
|
||||
st.markdown(f"**{forum.get('reg_title')}**")
|
||||
st.text(forum.get('reg_desc'))
|
||||
st.caption(f'Comments posted from {dt.strftime(min(df.date_dt),"%D")}—{dt.strftime(max(df.date_dt),"%D")} at https://www.townhall.virginia.gov/L/Comments.cfm?GDocForumID={forum.get("forum_id")}')
|
||||
|
||||
st.subheader("Comment Summary",anchor=False,divider="gray")
|
||||
summary_left, summary_right = st.columns([1,2])
|
||||
with summary_left:
|
||||
# Summary Table
|
||||
summary_stats = (
|
||||
df.groupby("stance").size()
|
||||
.reindex(stance_order, fill_value=0)
|
||||
.reset_index(name="count")
|
||||
.assign(percent=lambda d: (d["count"] / d["count"].sum()).map("{:.1%}".format))
|
||||
)
|
||||
|
||||
st.dataframe(summary_stats, hide_index=True, width="stretch")
|
||||
with summary_right:
|
||||
# Stance div-h
|
||||
counts = df["stance"].value_counts()
|
||||
stance_divh = go.Figure()
|
||||
stance_divh.add_bar(y=["stance"], x=[-counts.get("oppose",0)], name="oppose", orientation="h", marker_color=stance_colors.get('oppose'), text=[counts.get("oppose",0)], textposition="inside")
|
||||
stance_divh.add_bar(y=["stance"], x=[counts.get("neutral",0)], name="neutral", orientation="h", marker_color=stance_colors.get('neutral'), text=[counts.get("neutral",0)], textposition="inside")
|
||||
stance_divh.add_bar(y=["stance"], x=[counts.get("unknown",0)], name="unknown", orientation="h", marker_color=stance_colors.get('unknown'), text=[counts.get("unknown",0)], textposition="inside")
|
||||
stance_divh.add_bar(y=["stance"], x=[counts.get("support",0)], name="support", orientation="h", marker_color=stance_colors.get('support'), text=[counts.get("support",0)], textposition="inside")
|
||||
stance_divh.update_yaxes(title_text="",showticklabels=False)
|
||||
stance_divh.update_layout(barmode="relative", title="", height=180, margin=dict(l=0,r=0,t=0,b=0),xaxis_title="", yaxis_title="",legend=dict(orientation="v",y=0.12))
|
||||
st.plotly_chart(stance_divh,width='stretch')
|
||||
|
||||
# Daily Comments Breakdown, 3 Tabs
|
||||
daily_wide = (
|
||||
df.groupby(["date_day", "stance"])
|
||||
.size()
|
||||
.unstack(fill_value=0)
|
||||
.reindex(columns=stance_order, fill_value=0)
|
||||
.sort_index()
|
||||
)
|
||||
|
||||
daily_long = (
|
||||
daily_wide.reset_index()
|
||||
.melt(id_vars="date_day", var_name="stance", value_name="count")
|
||||
)
|
||||
|
||||
cum_wide = daily_wide.cumsum()
|
||||
|
||||
cum_long = (
|
||||
cum_wide.reset_index()
|
||||
.melt(id_vars="date_day", var_name="stance", value_name="cumulative_count")
|
||||
)
|
||||
|
||||
cum_total = cum_wide.sum(axis=1)
|
||||
cum_share = cum_wide.div(cum_total.where(cum_total > 0), axis=0)
|
||||
|
||||
cum_share_long = (
|
||||
cum_share.reset_index()
|
||||
.melt(id_vars="date_day", var_name="stance", value_name="cumulative_share")
|
||||
)
|
||||
|
||||
|
||||
tab_daily, tab_area, tab_share = st.tabs([
|
||||
"Daily",
|
||||
"Cumulative",
|
||||
"Cumulative Share",
|
||||
])
|
||||
|
||||
with tab_daily:
|
||||
fig = px.bar(
|
||||
daily_long,
|
||||
x="date_day",
|
||||
y="count",
|
||||
color="stance",
|
||||
category_orders={"stance": stance_order},
|
||||
color_discrete_map=stance_colors,
|
||||
)
|
||||
fig.update_layout(barmode="stack", height=420, legend_orientation="v")
|
||||
st.plotly_chart(fig, width="stretch")
|
||||
|
||||
with tab_area:
|
||||
fig = px.area(
|
||||
cum_long,
|
||||
x="date_day",
|
||||
y="cumulative_count",
|
||||
color="stance",
|
||||
category_orders={"stance": stance_order},
|
||||
color_discrete_map=stance_colors,
|
||||
)
|
||||
fig.update_layout(height=420, legend_orientation="v")
|
||||
st.plotly_chart(fig, width="stretch")
|
||||
|
||||
with tab_share:
|
||||
fig = px.line(
|
||||
cum_share_long,
|
||||
x="date_day",
|
||||
y="cumulative_share",
|
||||
color="stance",
|
||||
category_orders={"stance": stance_order},
|
||||
color_discrete_map=stance_colors,
|
||||
)
|
||||
fig.update_yaxes(tickformat=".0%", range=[0, 1])
|
||||
fig.update_layout(height=420, legend_orientation="v")
|
||||
st.plotly_chart(fig, width="stretch")
|
||||
|
||||
st.subheader("Comment Explorer",anchor=False,divider="gray")
|
||||
# comment explorer
|
||||
cex_left, cex_right = st.columns([1,1])
|
||||
with cex_left:
|
||||
filter_stance = st.multiselect("Filter stance", sorted(df["stance"].dropna().unique()), default=sorted(df["stance"].dropna().unique()))
|
||||
filter_tone = st.multiselect("Filter tone", sorted(df["tone"].dropna().unique()), default=sorted(df["tone"].dropna().unique()))
|
||||
dff = df[df["stance"].isin(filter_stance) & df["tone"].isin(filter_tone)]
|
||||
|
||||
with cex_right:
|
||||
q = st.text_input("Search comment title and text")
|
||||
if q:
|
||||
dff = dff[dff["text"].fillna("").str.contains(q, case=False, regex=False)]
|
||||
st.text(""); st.text("")
|
||||
st.text("Showing " + str(len(dff))+ " comments",text_alignment="right", width="stretch")
|
||||
|
||||
st.dataframe(dff[["comment_id", "title", "text", "stance", "stance_confidence", "tone"]], width="stretch")
|
||||
|
||||
cid = st.selectbox("Select comment to view:", dff["comment_id"].astype(str))
|
||||
row = dff[dff["comment_id"].astype(str) == cid].iloc[0]
|
||||
|
||||
st.markdown(f'**{row["title"]}**')
|
||||
st.text(row["text"])
|
||||
st.write(row["author"] + ", " + row["date_dt"].strftime("%D"))
|
||||
|
||||
st.divider()
|
||||
|
||||
st.subheader('Analysis')
|
||||
cexs_left, cexs_right = st.columns([1,1])
|
||||
with cexs_left:
|
||||
st.write(f"**stance:** {row['stance']}")
|
||||
st.write(f"**stance_confidence:** {row['stance_confidence']:.2f}")
|
||||
st.write(f"**tone:** {row['tone']}")
|
||||
st.write("**analysis:** "+ row["stance_rationale"])
|
||||
with cexs_right:
|
||||
x_order = ["unknown","oppose","mixed","neutral","support"] # includes mixed even if absent; harmless zero column
|
||||
y_order = ["positive","neutral","mixed","negative","unclear"]
|
||||
tab = pd.crosstab(df["tone"], df["stance"]).reindex(index=y_order, columns=x_order, fill_value=0)
|
||||
pct = tab.div(tab.sum(axis=1).replace(0, pd.NA), axis=0).fillna(0)
|
||||
tone_stance = px.imshow(
|
||||
pct,
|
||||
x=x_order, y=y_order,
|
||||
text_auto=".0%",
|
||||
aspect="auto",
|
||||
color_continuous_scale="Greens",
|
||||
)
|
||||
tone_stance.update_traces(text=tab.astype(str) + " / " + (pct*100).round(0).astype(int).astype(str) + "%")
|
||||
tone_stance.add_scatter(x=[row["stance"]],y=[row["tone"]],mode="markers",marker=dict(size=15,color="yellow",symbol="cross",line=dict(width=1, color="red")),showlegend=False)
|
||||
tone_stance.update_layout(height=420, xaxis_title="stance", yaxis_title="tone")
|
||||
st.plotly_chart(tone_stance, width='stretch')
|
||||
st.caption("Tone by stance, % within tone", text_alignment="right",width="stretch")
|
||||
|
||||
st.divider()
|
||||
st.write("**model:** " + str(row["model"]))
|
||||
with st.expander("Prompt", expanded=False):
|
||||
st.code(prompt, language="text")
|
||||
|
||||
tone_conf = px.box(df,x="stance",y="stance_confidence",color="stance",category_orders={"stance":stance_order},color_discrete_map=stance_colors,points="outliers",title="Comment Stance Classification Confidence")
|
||||
tone_conf.update_yaxes(range=[0,1.02])
|
||||
tone_conf.update_layout(height=430, legend_orientation="v")
|
||||
st.plotly_chart(tone_conf,width="stretch")
|
||||
Reference in New Issue
Block a user