Compare commits
11 Commits
05515745fd
...
master
| Author | SHA1 | Date | |
|---|---|---|---|
| 8f1d9e7723 | |||
| 181477bce7 | |||
| 771f11fd3c | |||
| f42183eeda | |||
| 92706bafb5 | |||
| 723b353db8 | |||
| 67cd96a523 | |||
| cc16acbb12 | |||
| afd5b8c60e | |||
| 3fb424da3c | |||
| c3f2911563 |
39
README.md
39
README.md
@@ -1,17 +1,3 @@
|
|||||||
# Table of Contents
|
|
||||||
|
|
||||||
1. [Project Goals](#org2da6874)
|
|
||||||
1. [Research questions](#org1a2b8b3)
|
|
||||||
2. [Architecture](#orgfabfcd9)
|
|
||||||
1. [Scraper](#org2c5c7a2)
|
|
||||||
2. [Analysis](#org72990f4)
|
|
||||||
3. [Storage](#org58a5b72)
|
|
||||||
3. [Instructions](#org24fe465)
|
|
||||||
1. [Roadmap](#org5739d49)
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
<a id="org2da6874"></a>
|
|
||||||
|
|
||||||
## Project Goals
|
## Project Goals
|
||||||
|
|
||||||
@@ -21,8 +7,9 @@
|
|||||||
2. Make data and insights broadly available.
|
2. Make data and insights broadly available.
|
||||||
3. Generalize to other public comment tools.
|
3. Generalize to other public comment tools.
|
||||||
|
|
||||||
|
Take a look at https://vatownhall.streamlit.app
|
||||||
|

|
||||||
|
|
||||||
<a id="org1a2b8b3"></a>
|
|
||||||
|
|
||||||
### Research questions
|
### Research questions
|
||||||
|
|
||||||
@@ -66,9 +53,9 @@ Scrapy provides a simple mechanism for retrieving, parsing, and saving content f
|
|||||||
|
|
||||||
Google and Amazon both return generic sentiment (tone of writing: positive/negative), not stance (for/against the regulation): "I strongly believe the government should NOT interfere" is negative tone but "against" the regulation. We add the proposed change as context to the model.
|
Google and Amazon both return generic sentiment (tone of writing: positive/negative), not stance (for/against the regulation): "I strongly believe the government should NOT interfere" is negative tone but "against" the regulation. We add the proposed change as context to the model.
|
||||||
|
|
||||||
Before sending the comments for sentiment analysis, \`tokenizer.py\` receives the forum to be processed and prompt as inputs, then generates a \`report.json\` estimating tokens (tiktoken), cost, and time to run for multiple models.
|
Before sending the comments for sentiment analysis, `tokenizer.py` receives the forum to be processed and prompt as inputs, then generates a `report.json` estimating tokens (tiktoken), cost, and time to run for multiple models.
|
||||||
|
|
||||||
Then, the batch processing scripts uses the \`report.json\` to create multiple jobs, with subcommands to download and check their status.
|
Then, the batch processing scripts uses the `report.json` to create multiple jobs, with subcommands to download and check their status.
|
||||||
|
|
||||||
We selected gpt-5.4-mini for a good balance of quality, cost, and time.
|
We selected gpt-5.4-mini for a good balance of quality, cost, and time.
|
||||||
|
|
||||||
@@ -107,15 +94,15 @@ We selected gpt-5.4-mini for a good balance of quality, cost, and time.
|
|||||||
- Each scraped forum is saved to `output/<forum-id>.jsonl`
|
- Each scraped forum is saved to `output/<forum-id>.jsonl`
|
||||||
- Each report (forum + prompt) is saves to `reports/<forum-id-N>.json`
|
- Each report (forum + prompt) is saves to `reports/<forum-id-N>.json`
|
||||||
- Each job is saved to `analysis/jobs/<report-id>`:
|
- Each job is saved to `analysis/jobs/<report-id>`:
|
||||||
└─`forum.jsonl` is a copy of the scraped forum for convenience
|
└─`forum.jsonl` is a copy of the scraped forum for convenience
|
||||||
└─`prompt.txt` is a copy of the prompt used
|
└─`prompt.txt` is a copy of the prompt used
|
||||||
└─`report.json` is a copy of the report used
|
└─`report.json` is a copy of the report used
|
||||||
└─`status.json` contains metadata about the job
|
└─`status.json` contains metadata about the job
|
||||||
For each batch in the job, four files are created:
|
For each batch in the job, four files are created:
|
||||||
└─`jobN-input.jsonl` contains the exact queries sent to the API, for troubleshooting
|
└─`jobN-input.jsonl` contains the exact queries sent to the API, for troubleshooting
|
||||||
└─`jobN-output-raw.jsonl` contains the exact response from the API
|
└─`jobN-output-raw.jsonl` contains the exact response from the API
|
||||||
└─`jobN-output.jsonl` contains the exact response from the API
|
└─`jobN-output.jsonl` contains the exact response from the API
|
||||||
└─`jobN-output-errors.jsonl` when errors are returned (this file may not exist)
|
└─`jobN-output-errors.jsonl` when errors are returned (this file may not exist)
|
||||||
- Once complete, the cleanup script saves `review.csv`, `review.pqt`, and `review.sqlite` in this folder.
|
- Once complete, the cleanup script saves `review.csv`, `review.pqt`, and `review.sqlite` in this folder.
|
||||||
|
|
||||||
|
|
||||||
|
|||||||
Binary file not shown.
@@ -1,6 +1,4 @@
|
|||||||
You are an expert policy analyst classifying public comments submitted to the Virginia Town Hall
|
You are an expert policy analyst classifying public comments submitted to the Virginia Town Hall regulatory comment system. You will be given the text of a proposed regulation and a single public comment. Return ONLY a JSON object — no other text.
|
||||||
regulatory comment system. You will be given the text of a proposed regulation and a single
|
|
||||||
public comment. Return ONLY a JSON object — no other text.
|
|
||||||
|
|
||||||
Definitions:
|
Definitions:
|
||||||
- stance: the commenter's position on whether the regulation should be adopted.
|
- stance: the commenter's position on whether the regulation should be adopted.
|
||||||
@@ -16,8 +14,6 @@ Definitions:
|
|||||||
"unclear" = tone cannot be determined (e.g., a one-word comment).
|
"unclear" = tone cannot be determined (e.g., a one-word comment).
|
||||||
- stance_confidence: float 0.0-1.0, your confidence in the stance label.
|
- stance_confidence: float 0.0-1.0, your confidence in the stance label.
|
||||||
- stance_rationale: 1-3 sentences explaining the key evidence; quote specific phrases where possible.
|
- stance_rationale: 1-3 sentences explaining the key evidence; quote specific phrases where possible.
|
||||||
- tags: up to 5 short topic labels relevant to the comment's specific concerns (e.g.
|
- tags: up to 5 short topic labels relevant to the comment's specific concerns (e.g. "parental rights", "student safety", "privacy", "religious freedom", "LGBTQ inclusion", "bullying prevention", "school sports", "bathroom access"). Empty array if none apply.
|
||||||
"parental rights", "student safety", "privacy", "religious freedom", "LGBTQ+ inclusion",
|
|
||||||
"bullying prevention", "school sports", "bathroom access"). Empty array if none apply.
|
|
||||||
|
|
||||||
Return exactly these keys: stance, stance_confidence, stance_rationale, tone, tags.
|
Return exactly these keys: stance, stance_confidence, stance_rationale, tone, tags.
|
||||||
|
|||||||
BIN
docs/streamlit-snapshot.png
Normal file
BIN
docs/streamlit-snapshot.png
Normal file
Binary file not shown.
|
After Width: | Height: | Size: 30 KiB |
@@ -280,10 +280,10 @@ python analysis/create_csv.py output/f452.jsonl analysis/jobs/f452-1/ --parquet
|
|||||||
#+end_src
|
#+end_src
|
||||||
|
|
||||||
** evidence
|
** evidence
|
||||||
- commit:
|
- commit: 28d6d22
|
||||||
- tests: passing (pytest tests/create_csv.py tests/encoding.py)
|
- tests: passing (pytest tests/create_csv.py tests/encoding.py)
|
||||||
- csv: analysis/jobs/f452-1/review.csv
|
- csv: analysis/jobs/f452-1/review.csv
|
||||||
- datetime: [2026-05-07 Thu]
|
- datetime: [2026-05-07 Thu 17:23]
|
||||||
|
|
||||||
* [X] t1.1.1: text encoding cleanup
|
* [X] t1.1.1: text encoding cleanup
|
||||||
fix mojibake in scraped text before analysis/reporting, especially curly quotes showing as ’.
|
fix mojibake in scraped text before analysis/reporting, especially curly quotes showing as ’.
|
||||||
@@ -309,24 +309,74 @@ fix mojibake in scraped text before analysis/reporting, especially curly quotes
|
|||||||
- Spider: DEFAULT_RESPONSE_ENCODING=utf-8 remains. If a future forum genuinely sends cp1252, change to 'cp1252' and apply ftfy post-decode in the item pipeline.
|
- Spider: DEFAULT_RESPONSE_ENCODING=utf-8 remains. If a future forum genuinely sends cp1252, change to 'cp1252' and apply ftfy post-decode in the item pipeline.
|
||||||
|
|
||||||
** evidence
|
** evidence
|
||||||
- commit:
|
- commit: 1ea696d
|
||||||
- tests: passing (pytest tests/encoding.py)
|
- tests: passing (pytest tests/encoding.py)
|
||||||
- before/after sample: N/A — f452.jsonl is clean; tests cover synthetic mojibake patterns
|
- before/after sample: N/A — f452.jsonl is clean; tests cover synthetic mojibake patterns
|
||||||
- datetime: [2026-05-07 Thu]
|
- datetime: [2026-05-07 Thu 17:00]
|
||||||
* === Backlog ===
|
|
||||||
* [ ] X: first dash explorer
|
* [X] t1.4: graph data prototype
|
||||||
create a local dash app for exploring one forum analysis dataset.
|
create ./viz/prototype_charts.py generating individual plotly charts for exploring graphs to embed into streamlit or dash later
|
||||||
|
|
||||||
** acceptance criteria
|
** acceptance criteria
|
||||||
1. load parquet/csv review dataset
|
2. create graph for Stance/Share
|
||||||
2. show stance counts, tone counts, tag counts, and confidence histogram
|
- stacked h-bar with % support/oppose/neutral/unknown + raw totals, eg 63% (5720) / 37% (3320) / 0.09% (8) / 0.37% (34)
|
||||||
3. provide filters for stance, tone, confidence, tag, and text search
|
- later, consider centered diverging h-bar: oppose ← | neutral/unknown | → support
|
||||||
4. show filtered comment table
|
3. create graph for Stance/Time:
|
||||||
|
- cumulative support/oppose % over time
|
||||||
|
4. create graph for Stance/Tone (heatmap count)
|
||||||
|
5. create graph for Confidence/Stance (boxplot or histogram)
|
||||||
|
|
||||||
|
** notes
|
||||||
|
- prototyped in plotly
|
||||||
|
- initial streamlit
|
||||||
|
|
||||||
|
** evidence
|
||||||
|
- commit: 3fb424d
|
||||||
|
- tests: see viz/proto and viz/chart_tests
|
||||||
|
- datetime: [2026-05-08 Fri 08:38]
|
||||||
|
|
||||||
|
* [X] t1.5: streamlit
|
||||||
|
create organized webpage displaying useful information from completed job and analysis
|
||||||
|
|
||||||
|
** acceptance criteria
|
||||||
|
1. display total stance breakdown
|
||||||
|
2. display centered horiz-bar with absolute stances
|
||||||
|
3. show daily comment stances and cumulative
|
||||||
|
4. show comment table with filters for stance (filter tone?)
|
||||||
5. clicking/selecting a comment shows full text and model rationale
|
5. clicking/selecting a comment shows full text and model rationale
|
||||||
6. app runs locally with one command
|
6. app runs locally with one command
|
||||||
|
|
||||||
|
** notes
|
||||||
|
data pulls entirely from the job; goal is to point viz/streamlit.py at any job/ folder and have everything it needs
|
||||||
|
|
||||||
|
** evidence
|
||||||
|
- commit: cc16acb
|
||||||
|
- tests: from root dir, `streamlit run viz/streamlit.py <job-dir>`
|
||||||
|
- datetime: [2026-05-08 Fri 23:44]
|
||||||
|
|
||||||
|
* +[ ] t1.6 host streamlit via dockerfile+
|
||||||
|
planning to deploy manually, get cert, etc etc. probably dont care about https?
|
||||||
|
+using streamlit.app instead+
|
||||||
|
** acceptance criteria
|
||||||
|
1. write dockerfile with slim image
|
||||||
|
|
||||||
|
** notes
|
||||||
|
|
||||||
|
* === Backlog ===
|
||||||
|
- add forum_url, forum_collected_date to scraper (to add to viz)
|
||||||
* [ ] X: complete proposal information
|
* [ ] X: complete proposal information
|
||||||
Ensure we capture as much useful information as possible about the actual proposal - contact information, etc. what the state actually says about what was posted.
|
Ensure we capture as much useful information as possible about the actual proposal - contact information, etc. what the state actually says about what was posted.
|
||||||
** acceptance criteria
|
** acceptance criteria
|
||||||
1. Item: `Forum` stores id, url, proposal title, description, open/close date, number of comments, agency, board, guidance document id
|
1. Item: `Forum` stores id, url, proposal title, description, open/close date, number of comments, agency, board, guidance document id
|
||||||
- add details for guidanceDoc, publication date, comments, guidance docs - eg: https://www.townhall.virginia.gov/L/GDocForum.cfm?GDocForumID=452
|
- add details for guidanceDoc, publication date, comments, guidance docs - eg: https://www.townhall.virginia.gov/L/GDocForum.cfm?GDocForumID=452
|
||||||
2. Item: `Comment` stores forum_id, comment_id, author, title, text, date, url
|
2. Item: `Comment` stores forum_id, comment_id, author, title, text, date, url
|
||||||
|
* [ ] X: add helper data to create_csv
|
||||||
|
1. in create_csv.py, create helper columns:
|
||||||
|
- stance_signed = {"support":1, "oppose":-1, "neutral":0, "unknown":0}
|
||||||
|
- stance_weighted = stance_signed * stance_confidence
|
||||||
|
- is_support_oppose = stance in ["support", "oppose"]
|
||||||
|
- date_day
|
||||||
|
- date_hour
|
||||||
|
- text_norm
|
||||||
|
- text_hash
|
||||||
|
- confidence_bucket = 'low' <.7 | 'med' .7-.89 | 'high' >=.9
|
||||||
|
|||||||
BIN
requirements.txt
BIN
requirements.txt
Binary file not shown.
@@ -5,6 +5,8 @@ class ForumItem(scrapy.Item):
|
|||||||
forum_id = scrapy.Field()
|
forum_id = scrapy.Field()
|
||||||
reg_title = scrapy.Field()
|
reg_title = scrapy.Field()
|
||||||
reg_desc = scrapy.Field()
|
reg_desc = scrapy.Field()
|
||||||
|
scraped_at = scrapy.Field()
|
||||||
|
forum_url = scrapy.Field()
|
||||||
|
|
||||||
|
|
||||||
class CommentItem(scrapy.Item):
|
class CommentItem(scrapy.Item):
|
||||||
|
|||||||
@@ -63,6 +63,8 @@ class ForumSpider(scrapy.Spider):
|
|||||||
forum_id=self.forum_id,
|
forum_id=self.forum_id,
|
||||||
reg_title=reg_title,
|
reg_title=reg_title,
|
||||||
reg_desc=reg_desc,
|
reg_desc=reg_desc,
|
||||||
|
scraped_at=datetime.utcnow().isoformat(),
|
||||||
|
forum_url=_view_url(self.forum_id),
|
||||||
)
|
)
|
||||||
for page in range(2, last_page + 1):
|
for page in range(2, last_page + 1):
|
||||||
yield scrapy.FormRequest(
|
yield scrapy.FormRequest(
|
||||||
|
|||||||
3888
viz/chart_tests/confidence_by_stance.html
Normal file
3888
viz/chart_tests/confidence_by_stance.html
Normal file
File diff suppressed because one or more lines are too long
3888
viz/chart_tests/cumulative_stance_area.html
Normal file
3888
viz/chart_tests/cumulative_stance_area.html
Normal file
File diff suppressed because one or more lines are too long
3888
viz/chart_tests/cumulative_stance_share.html
Normal file
3888
viz/chart_tests/cumulative_stance_share.html
Normal file
File diff suppressed because one or more lines are too long
3888
viz/chart_tests/stance_diverging_bar.html
Normal file
3888
viz/chart_tests/stance_diverging_bar.html
Normal file
File diff suppressed because one or more lines are too long
3888
viz/chart_tests/stance_over_time.html
Normal file
3888
viz/chart_tests/stance_over_time.html
Normal file
File diff suppressed because one or more lines are too long
3888
viz/chart_tests/stance_share.html
Normal file
3888
viz/chart_tests/stance_share.html
Normal file
File diff suppressed because one or more lines are too long
3888
viz/chart_tests/stance_tone_counts.html
Normal file
3888
viz/chart_tests/stance_tone_counts.html
Normal file
File diff suppressed because one or more lines are too long
3888
viz/chart_tests/stance_tone_heatmap.html
Normal file
3888
viz/chart_tests/stance_tone_heatmap.html
Normal file
File diff suppressed because one or more lines are too long
3888
viz/chart_tests/stance_tone_rowpct.html
Normal file
3888
viz/chart_tests/stance_tone_rowpct.html
Normal file
File diff suppressed because one or more lines are too long
3888
viz/proto/confidence_by_stance.html
Normal file
3888
viz/proto/confidence_by_stance.html
Normal file
File diff suppressed because one or more lines are too long
3888
viz/proto/stance_over_time.html
Normal file
3888
viz/proto/stance_over_time.html
Normal file
File diff suppressed because one or more lines are too long
3888
viz/proto/stance_share.html
Normal file
3888
viz/proto/stance_share.html
Normal file
File diff suppressed because one or more lines are too long
3888
viz/proto/stance_tone_heatmap.html
Normal file
3888
viz/proto/stance_tone_heatmap.html
Normal file
File diff suppressed because one or more lines are too long
134
viz/prototype_charts.py
Normal file
134
viz/prototype_charts.py
Normal file
@@ -0,0 +1,134 @@
|
|||||||
|
'''
|
||||||
|
prototype_charts.py
|
||||||
|
generate test charts for later addition to streamlit
|
||||||
|
'''
|
||||||
|
|
||||||
|
|
||||||
|
from pathlib import Path
|
||||||
|
import pandas as pd
|
||||||
|
import plotly.express as px
|
||||||
|
import numpy as np
|
||||||
|
|
||||||
|
inp = Path(r"c:/users/moses/projects/vath/analysis/jobs/f452-1/review.csv")
|
||||||
|
out = Path("viz/")
|
||||||
|
out.mkdir(parents=True, exist_ok=True)
|
||||||
|
|
||||||
|
stance_order = ["support", "oppose", "neutral", "unknown"]
|
||||||
|
|
||||||
|
# tone_order = ["positive", "negative", "neutral", "mixed", "unknown", "unclear"]
|
||||||
|
# default order was actually better - unclear/negative/neutral/mixed/positive vs unknown/oppose/neutral/support
|
||||||
|
# same for pct w/in stance
|
||||||
|
df = pd.read_csv(inp)
|
||||||
|
df["date"] = pd.to_datetime(df["date"], errors="coerce")
|
||||||
|
df["date_day"] = df["date"].dt.date
|
||||||
|
df["stance"] = df["stance"].fillna("unknown")
|
||||||
|
df["tone"] = df["tone"].fillna("unknown")
|
||||||
|
|
||||||
|
# 1. stance share
|
||||||
|
counts = df["stance"].value_counts().reindex(stance_order, fill_value=0).reset_index()
|
||||||
|
counts.columns = ["stance", "count"]
|
||||||
|
fig = px.bar(counts, x="count", y="stance", orientation="h", text="count")
|
||||||
|
fig.write_html(out / "stance_share.html")
|
||||||
|
|
||||||
|
# 2. stance over time
|
||||||
|
daily = df.groupby(["date_day", "stance"]).size().reset_index(name="count")
|
||||||
|
fig = px.bar(daily, x="date_day", y="count", color="stance", category_orders={"stance": stance_order})
|
||||||
|
fig.write_html(out / "stance_over_time.html")
|
||||||
|
|
||||||
|
# 3. stance x tone
|
||||||
|
heat = df.groupby(["stance", "tone"]).size().reset_index(name="count")
|
||||||
|
fig = px.density_heatmap(heat, x="tone", y="stance", z="count", category_orders={"stance": stance_order})
|
||||||
|
fig.write_html(out / "stance_tone_heatmap.html")
|
||||||
|
|
||||||
|
# 4. confidence by stance
|
||||||
|
fig = px.box(df, x="stance", y="stance_confidence", category_orders={"stance": stance_order}, points="outliers")
|
||||||
|
fig.write_html(out / "confidence_by_stance.html")
|
||||||
|
|
||||||
|
# 5. cumulative stance and share over time
|
||||||
|
daily = (
|
||||||
|
df.groupby(["date_day", "stance"])
|
||||||
|
.size()
|
||||||
|
.unstack(fill_value=0)
|
||||||
|
.reindex(columns=stance_order, fill_value=0)
|
||||||
|
.sort_index()
|
||||||
|
)
|
||||||
|
|
||||||
|
cum = daily.cumsum()
|
||||||
|
cum_long = cum.reset_index().melt(id_vars="date_day", var_name="stance", value_name="cumulative_count")
|
||||||
|
|
||||||
|
fig = px.area(
|
||||||
|
cum_long,
|
||||||
|
x="date_day",
|
||||||
|
y="cumulative_count",
|
||||||
|
color="stance",
|
||||||
|
category_orders={"stance": stance_order},
|
||||||
|
title="cumulative comments by stance over time",
|
||||||
|
)
|
||||||
|
fig.write_html(out / "cumulative_stance_area.html")
|
||||||
|
|
||||||
|
cum_pct = cum.div(cum.sum(axis=1), axis=0).reset_index().melt(
|
||||||
|
id_vars="date_day", var_name="stance", value_name="cumulative_share"
|
||||||
|
)
|
||||||
|
|
||||||
|
fig = px.line(
|
||||||
|
cum_pct,
|
||||||
|
x="date_day",
|
||||||
|
y="cumulative_share",
|
||||||
|
color="stance",
|
||||||
|
category_orders={"stance": stance_order},
|
||||||
|
title="cumulative stance share over time",
|
||||||
|
)
|
||||||
|
fig.update_yaxes(tickformat=".0%")
|
||||||
|
fig.write_html(out / "cumulative_stance_share.html")
|
||||||
|
|
||||||
|
# 7. diverging h-bar
|
||||||
|
stance_counts = df["stance"].value_counts().reindex(stance_order, fill_value=0)
|
||||||
|
|
||||||
|
div = pd.DataFrame({
|
||||||
|
"stance": ["oppose", "support", "neutral", "unknown"],
|
||||||
|
"count": [
|
||||||
|
-stance_counts.get("oppose", 0),
|
||||||
|
stance_counts.get("support", 0),
|
||||||
|
stance_counts.get("neutral", 0),
|
||||||
|
stance_counts.get("unknown", 0),
|
||||||
|
],
|
||||||
|
})
|
||||||
|
|
||||||
|
fig = px.bar(
|
||||||
|
div,
|
||||||
|
x="count",
|
||||||
|
y="stance",
|
||||||
|
orientation="h",
|
||||||
|
text=div["count"].abs(),
|
||||||
|
title="support vs oppose",
|
||||||
|
)
|
||||||
|
fig.update_xaxes(title="comments", zeroline=True)
|
||||||
|
fig.update_traces(textposition="outside")
|
||||||
|
fig.write_html(out / "stance_diverging_bar.html")
|
||||||
|
|
||||||
|
# 8. Stance x Tone labels
|
||||||
|
heat = pd.crosstab(df["stance"], df["tone"]).reindex(
|
||||||
|
index=stance_order,
|
||||||
|
columns=[c for c in tone_order if c in df["tone"].unique()],
|
||||||
|
fill_value=0,
|
||||||
|
)
|
||||||
|
|
||||||
|
fig = px.imshow(
|
||||||
|
heat,
|
||||||
|
text_auto=True,
|
||||||
|
aspect="auto",
|
||||||
|
title="stance x tone, count",
|
||||||
|
)
|
||||||
|
fig.write_html(out / "stance_tone_counts.html")
|
||||||
|
|
||||||
|
rowpct = heat.div(heat.sum(axis=1).replace(0, np.nan), axis=0)
|
||||||
|
|
||||||
|
fig = px.imshow(
|
||||||
|
rowpct,
|
||||||
|
text_auto=".0%",
|
||||||
|
aspect="auto",
|
||||||
|
title="stance x tone, percent within stance",
|
||||||
|
)
|
||||||
|
fig.write_html(out / "stance_tone_rowpct.html")
|
||||||
|
|
||||||
|
|
||||||
28
viz/prototype_streamlit.py
Normal file
28
viz/prototype_streamlit.py
Normal file
@@ -0,0 +1,28 @@
|
|||||||
|
# streamlit run analysis/viz/prototype_streamlit.py
|
||||||
|
from datetime import datetime
|
||||||
|
import pandas as pd
|
||||||
|
import plotly.graph_objects as go
|
||||||
|
import plotly.express as px
|
||||||
|
import streamlit as st
|
||||||
|
|
||||||
|
df = pd.read_csv(r"analysis/jobs/f452-1/review.csv")
|
||||||
|
st.set_page_config(layout="wide")
|
||||||
|
|
||||||
|
stance = st.multiselect("Filter stance", sorted(df["stance"].dropna().unique()), default=sorted(df["stance"].dropna().unique()))
|
||||||
|
q = st.text_input("Search comment text")
|
||||||
|
dff = df[df["stance"].isin(stance)]
|
||||||
|
if q:
|
||||||
|
dff = dff[dff["text"].fillna("").str.contains(q, case=False, regex=False)]
|
||||||
|
|
||||||
|
st.dataframe(dff[["comment_id", "title", "stance", "stance_confidence", "tone"]], width="stretch")
|
||||||
|
st.write("Showing " + str(len(dff))+ " comments")
|
||||||
|
|
||||||
|
cid = st.selectbox("comment", dff["comment_id"].astype(str))
|
||||||
|
row = dff[dff["comment_id"].astype(str) == cid].iloc[0]
|
||||||
|
|
||||||
|
st.subheader(row["title"])
|
||||||
|
st.write(row["text"])
|
||||||
|
st.write(row["author"] + ", " + row["date"][:10])
|
||||||
|
st.write("**model:** " + str(row["model"]))
|
||||||
|
st.markdown("**stance:** " + str(row["stance"]) + " \n**confidence:** " + str(row["stance_confidence"]) + " \n**tone:** " + str(row["tone"]))
|
||||||
|
st.write("**analysis:** "+ row["stance_rationale"])
|
||||||
189
viz/streamlit.py
Normal file
189
viz/streamlit.py
Normal file
@@ -0,0 +1,189 @@
|
|||||||
|
# streamlit run viz/streamlit.py -- --jobs-dir analysis/jobs/f452-1
|
||||||
|
import argparse
|
||||||
|
from pathlib import Path
|
||||||
|
from datetime import datetime as dt
|
||||||
|
import pandas as pd
|
||||||
|
import plotly.graph_objects as go
|
||||||
|
import plotly.express as px
|
||||||
|
import streamlit as st
|
||||||
|
|
||||||
|
parser = argparse.ArgumentParser()
|
||||||
|
parser.add_argument("--jobs-dir", default="analysis/jobs/f452-1", type=Path,
|
||||||
|
help="Job directory containing review.csv, forum.jsonl, and prompt.txt")
|
||||||
|
args, _ = parser.parse_known_args() # parse_known_args: ignore Streamlit's own argv entries
|
||||||
|
workdir = args.jobs_dir
|
||||||
|
df = pd.read_csv(workdir/"review.csv")
|
||||||
|
df['date_dt'] = pd.to_datetime(df.date)
|
||||||
|
df["date_day"] = df["date_dt"].dt.date
|
||||||
|
forum = pd.read_json(workdir/"forum.jsonl", lines=True).iloc[0].to_dict()
|
||||||
|
prompt = (workdir/"prompt.txt").read_text(encoding="utf-8")
|
||||||
|
|
||||||
|
stance_colors = {'oppose':'#ffa15a', 'neutral':'#e377c2','support':'#19d3f3','unknown':'#000000'}
|
||||||
|
stance_order = ["oppose", "mixed", "unknown", "neutral", "support"]
|
||||||
|
|
||||||
|
st.set_page_config(layout="wide")
|
||||||
|
st.title("Virginia Townhall Explorer",anchor=None)
|
||||||
|
st.caption("Explore data collected from Virginia's public comment system. Source code at https://github.com/eulaly/vath")
|
||||||
|
|
||||||
|
st.subheader("Proposal",anchor=None,divider="gray")
|
||||||
|
st.markdown(f"**{forum.get('reg_title')}**")
|
||||||
|
st.text(forum.get('reg_desc'))
|
||||||
|
st.caption(f'Comments posted from {dt.strftime(min(df.date_dt),"%D")}—{dt.strftime(max(df.date_dt),"%D")} at https://www.townhall.virginia.gov/L/Comments.cfm?GDocForumID={forum.get("forum_id")}')
|
||||||
|
|
||||||
|
st.subheader("Comment Summary",anchor=False,divider="gray")
|
||||||
|
summary_left, summary_right = st.columns([1,2])
|
||||||
|
with summary_left:
|
||||||
|
# Summary Table
|
||||||
|
summary_stats = (
|
||||||
|
df.groupby("stance").size()
|
||||||
|
.reindex(stance_order, fill_value=0)
|
||||||
|
.reset_index(name="count")
|
||||||
|
.assign(percent=lambda d: (d["count"] / d["count"].sum()).map("{:.1%}".format))
|
||||||
|
)
|
||||||
|
|
||||||
|
st.dataframe(summary_stats, hide_index=True, width="stretch")
|
||||||
|
with summary_right:
|
||||||
|
# Stance div-h
|
||||||
|
counts = df["stance"].value_counts()
|
||||||
|
stance_divh = go.Figure()
|
||||||
|
stance_divh.add_bar(y=["stance"], x=[-counts.get("oppose",0)], name="oppose", orientation="h", marker_color=stance_colors.get('oppose'), text=[counts.get("oppose",0)], textposition="inside")
|
||||||
|
stance_divh.add_bar(y=["stance"], x=[counts.get("neutral",0)], name="neutral", orientation="h", marker_color=stance_colors.get('neutral'), text=[counts.get("neutral",0)], textposition="inside")
|
||||||
|
stance_divh.add_bar(y=["stance"], x=[counts.get("unknown",0)], name="unknown", orientation="h", marker_color=stance_colors.get('unknown'), text=[counts.get("unknown",0)], textposition="inside")
|
||||||
|
stance_divh.add_bar(y=["stance"], x=[counts.get("support",0)], name="support", orientation="h", marker_color=stance_colors.get('support'), text=[counts.get("support",0)], textposition="inside")
|
||||||
|
stance_divh.update_yaxes(title_text="",showticklabels=False)
|
||||||
|
stance_divh.update_layout(barmode="relative", title="", height=180, margin=dict(l=0,r=0,t=0,b=0),xaxis_title="", yaxis_title="",legend=dict(orientation="v",y=0.12))
|
||||||
|
st.plotly_chart(stance_divh,width='stretch')
|
||||||
|
|
||||||
|
# Daily Comments Breakdown, 3 Tabs
|
||||||
|
daily_wide = (
|
||||||
|
df.groupby(["date_day", "stance"])
|
||||||
|
.size()
|
||||||
|
.unstack(fill_value=0)
|
||||||
|
.reindex(columns=stance_order, fill_value=0)
|
||||||
|
.sort_index()
|
||||||
|
)
|
||||||
|
|
||||||
|
daily_long = (
|
||||||
|
daily_wide.reset_index()
|
||||||
|
.melt(id_vars="date_day", var_name="stance", value_name="count")
|
||||||
|
)
|
||||||
|
|
||||||
|
cum_wide = daily_wide.cumsum()
|
||||||
|
|
||||||
|
cum_long = (
|
||||||
|
cum_wide.reset_index()
|
||||||
|
.melt(id_vars="date_day", var_name="stance", value_name="cumulative_count")
|
||||||
|
)
|
||||||
|
|
||||||
|
cum_total = cum_wide.sum(axis=1)
|
||||||
|
cum_share = cum_wide.div(cum_total.where(cum_total > 0), axis=0)
|
||||||
|
|
||||||
|
cum_share_long = (
|
||||||
|
cum_share.reset_index()
|
||||||
|
.melt(id_vars="date_day", var_name="stance", value_name="cumulative_share")
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
tab_daily, tab_area, tab_share = st.tabs([
|
||||||
|
"Daily",
|
||||||
|
"Cumulative",
|
||||||
|
"Cumulative Share",
|
||||||
|
])
|
||||||
|
|
||||||
|
with tab_daily:
|
||||||
|
fig = px.bar(
|
||||||
|
daily_long,
|
||||||
|
x="date_day",
|
||||||
|
y="count",
|
||||||
|
color="stance",
|
||||||
|
category_orders={"stance": stance_order},
|
||||||
|
color_discrete_map=stance_colors,
|
||||||
|
)
|
||||||
|
fig.update_layout(barmode="stack", height=420, legend_orientation="v")
|
||||||
|
st.plotly_chart(fig, width="stretch")
|
||||||
|
|
||||||
|
with tab_area:
|
||||||
|
fig = px.area(
|
||||||
|
cum_long,
|
||||||
|
x="date_day",
|
||||||
|
y="cumulative_count",
|
||||||
|
color="stance",
|
||||||
|
category_orders={"stance": stance_order},
|
||||||
|
color_discrete_map=stance_colors,
|
||||||
|
)
|
||||||
|
fig.update_layout(height=420, legend_orientation="v")
|
||||||
|
st.plotly_chart(fig, width="stretch")
|
||||||
|
|
||||||
|
with tab_share:
|
||||||
|
fig = px.line(
|
||||||
|
cum_share_long,
|
||||||
|
x="date_day",
|
||||||
|
y="cumulative_share",
|
||||||
|
color="stance",
|
||||||
|
category_orders={"stance": stance_order},
|
||||||
|
color_discrete_map=stance_colors,
|
||||||
|
)
|
||||||
|
fig.update_yaxes(tickformat=".0%", range=[0, 1])
|
||||||
|
fig.update_layout(height=420, legend_orientation="v")
|
||||||
|
st.plotly_chart(fig, width="stretch")
|
||||||
|
|
||||||
|
st.subheader("Comment Explorer",anchor=False,divider="gray")
|
||||||
|
# comment explorer
|
||||||
|
cex_left, cex_right = st.columns([1,1])
|
||||||
|
with cex_left:
|
||||||
|
filter_stance = st.multiselect("Filter stance", sorted(df["stance"].dropna().unique()), default=sorted(df["stance"].dropna().unique()))
|
||||||
|
filter_tone = st.multiselect("Filter tone", sorted(df["tone"].dropna().unique()), default=sorted(df["tone"].dropna().unique()))
|
||||||
|
dff = df[df["stance"].isin(filter_stance) & df["tone"].isin(filter_tone)]
|
||||||
|
|
||||||
|
with cex_right:
|
||||||
|
q = st.text_input("Search comment title and text")
|
||||||
|
if q:
|
||||||
|
dff = dff[dff["text"].fillna("").str.contains(q, case=False, regex=False)]
|
||||||
|
st.text(""); st.text("")
|
||||||
|
st.text("Showing " + str(len(dff))+ " comments",text_alignment="right", width="stretch")
|
||||||
|
|
||||||
|
st.dataframe(dff[["comment_id", "title", "text", "stance", "stance_confidence", "tone"]], width="stretch")
|
||||||
|
|
||||||
|
cid = st.selectbox("Select comment to view:", dff["comment_id"].astype(str))
|
||||||
|
row = dff[dff["comment_id"].astype(str) == cid].iloc[0]
|
||||||
|
|
||||||
|
st.markdown(f'**{row["title"]}**')
|
||||||
|
st.text(row["text"])
|
||||||
|
st.write(row["author"] + ", " + row["date_dt"].strftime("%D"))
|
||||||
|
|
||||||
|
st.divider()
|
||||||
|
|
||||||
|
st.subheader('Analysis')
|
||||||
|
cexs_left, cexs_right = st.columns([1,1])
|
||||||
|
with cexs_left:
|
||||||
|
st.write(f"**stance:** {row['stance']}")
|
||||||
|
st.write(f"**stance_confidence:** {row['stance_confidence']:.2f}")
|
||||||
|
st.write(f"**tone:** {row['tone']}")
|
||||||
|
st.write("**analysis:** "+ row["stance_rationale"])
|
||||||
|
with cexs_right:
|
||||||
|
x_order = ["unknown","oppose","mixed","neutral","support"] # includes mixed even if absent; harmless zero column
|
||||||
|
y_order = ["positive","neutral","mixed","negative","unclear"]
|
||||||
|
tab = pd.crosstab(df["tone"], df["stance"]).reindex(index=y_order, columns=x_order, fill_value=0)
|
||||||
|
pct = tab.div(tab.sum(axis=1).replace(0, pd.NA), axis=0).fillna(0)
|
||||||
|
tone_stance = px.imshow(
|
||||||
|
pct,
|
||||||
|
x=x_order, y=y_order,
|
||||||
|
text_auto=".0%",
|
||||||
|
aspect="auto",
|
||||||
|
color_continuous_scale="Greens",
|
||||||
|
)
|
||||||
|
tone_stance.update_traces(text=tab.astype(str) + " / " + (pct*100).round(0).astype(int).astype(str) + "%")
|
||||||
|
tone_stance.add_scatter(x=[row["stance"]],y=[row["tone"]],mode="markers",marker=dict(size=15,color="yellow",symbol="cross",line=dict(width=1, color="red")),showlegend=False)
|
||||||
|
tone_stance.update_layout(height=420, xaxis_title="stance", yaxis_title="tone")
|
||||||
|
st.plotly_chart(tone_stance, width='stretch')
|
||||||
|
st.caption("Tone by stance, % within tone", text_alignment="right",width="stretch")
|
||||||
|
|
||||||
|
st.divider()
|
||||||
|
st.write("**model:** " + str(row["model"]))
|
||||||
|
with st.expander("Prompt", expanded=False):
|
||||||
|
st.code(prompt, language="text")
|
||||||
|
|
||||||
|
tone_conf = px.box(df,x="stance",y="stance_confidence",color="stance",category_orders={"stance":stance_order},color_discrete_map=stance_colors,points="outliers",title="Comment Stance Classification Confidence")
|
||||||
|
tone_conf.update_yaxes(range=[0,1.02])
|
||||||
|
tone_conf.update_layout(height=430, legend_orientation="v")
|
||||||
|
st.plotly_chart(tone_conf,width="stretch")
|
||||||
Reference in New Issue
Block a user