full local streamlit support

added streamlit v1
updated reqts
2026-05-08 21:57:04 -04:00 · 2026-05-08 17:22:33 -04:00 · 2026-05-07 21:55:00 -04:00
21 changed files with 50951 additions and 17 deletions
--- a/analysis/jobs/f452-1/review.xlsx
+++ b/analysis/jobs/f452-1/review.xlsx
--- a/analysis/prompt-1.txt
+++ b/analysis/prompt-1.txt
@@ -1,6 +1,4 @@
-You are an expert policy analyst classifying public comments submitted to the Virginia Town Hall
-regulatory comment system. You will be given the text of a proposed regulation and a single
-public comment. Return ONLY a JSON object — no other text.
+You are an expert policy analyst classifying public comments submitted to the Virginia Town Hall regulatory comment system. You will be given the text of a proposed regulation and a single public comment. Return ONLY a JSON object — no other text.

 Definitions:
 - stance: the commenter's position on whether the regulation should be adopted.
@@ -16,8 +14,6 @@ Definitions:
  "unclear"  = tone cannot be determined (e.g., a one-word comment).
 - stance_confidence: float 0.0-1.0, your confidence in the stance label.
 - stance_rationale: 1-3 sentences explaining the key evidence; quote specific phrases where possible.
- tags: up to 5 short topic labels relevant to the comment's specific concerns (e.g.
-  "parental rights", "student safety", "privacy", "religious freedom", "LGBTQ+ inclusion",
-  "bullying prevention", "school sports", "bathroom access"). Empty array if none apply.
+- tags: up to 5 short topic labels relevant to the comment's specific concerns (e.g. "parental rights", "student safety", "privacy", "religious freedom", "LGBTQ inclusion", "bullying prevention", "school sports", "bathroom access"). Empty array if none apply.

 Return exactly these keys: stance, stance_confidence, stance_rationale, tone, tags.
--- a/docs/streamlit-snapshot.png
+++ b/docs/streamlit-snapshot.png
--- a/docs/tasks.org
+++ b/docs/tasks.org
@@ -280,10 +280,10 @@ python analysis/create_csv.py output/f452.jsonl analysis/jobs/f452-1/ --parquet
 #+end_src

 ** evidence
- commit:
+- commit: 28d6d22
 - tests: passing (pytest tests/create_csv.py tests/encoding.py)
 - csv: analysis/jobs/f452-1/review.csv
- datetime: [2026-05-07 Thu]
+- datetime: [2026-05-07 Thu 17:23]

 * [X] t1.1.1: text encoding cleanup
 fix mojibake in scraped text before analysis/reporting, especially curly quotes showing as â€™.
@@ -309,24 +309,70 @@ fix mojibake in scraped text before analysis/reporting, especially curly quotes
 - Spider: DEFAULT_RESPONSE_ENCODING=utf-8 remains. If a future forum genuinely sends cp1252, change to 'cp1252' and apply ftfy post-decode in the item pipeline.

 ** evidence
- commit:
+- commit: 1ea696d
 - tests: passing (pytest tests/encoding.py)
 - before/after sample: N/A — f452.jsonl is clean; tests cover synthetic mojibake patterns
- datetime: [2026-05-07 Thu]
-* === Backlog ===
-* [ ] X: first dash explorer
-create a local dash app for exploring one forum analysis dataset.
+- datetime: [2026-05-07 Thu 17:00]
+
+* [X] t1.4: graph data prototype
+create ./viz/prototype_charts.py generating individual plotly charts for exploring graphs to embed into streamlit or dash later

 ** acceptance criteria
-1. load parquet/csv review dataset
-2. show stance counts, tone counts, tag counts, and confidence histogram
-3. provide filters for stance, tone, confidence, tag, and text search
-4. show filtered comment table
+2. create graph for Stance/Share
+   - stacked h-bar with % support/oppose/neutral/unknown + raw totals, eg  63% (5720) / 37% (3320) / 0.09% (8) / 0.37% (34)
+   - later, consider centered diverging h-bar: oppose ← | neutral/unknown | → support
+3. create graph for Stance/Time: 
+   - cumulative support/oppose % over time
+4. create graph for Stance/Tone (heatmap count)
+5. create graph for Confidence/Stance (boxplot or histogram)
+
+** notes
+- prototyped in plotly
+- initial streamlit  
+
+** evidence
+- commit: 3fb424d
+- tests: see viz/proto and viz/chart_tests
+- datetime: [2026-05-08 Fri 08:38]
+
+* [ ] t1.5: streamlit
+create organized webpage displaying useful information from completed job and analysis
+
+** acceptance criteria
+1. display total stance breakdown
+2. display centered horiz-bar with absolute stances
+3. show daily comment stances and cumulative
+4. show comment table with filters for stance (filter tone?)
 5. clicking/selecting a comment shows full text and model rationale
 6. app runs locally with one command
+
+** notes
+data pulls entirely from the job; goal is to point viz/streamlit.py at any job/ folder and have everything it needs
+
+** evidence
+- commit: 
+- tests: from root dir, `streamlit run viz/streamlit.py`
+
+7. add forum_url, forum_collected_date to scraper
+
+* [ ] t1.6 host streamlit
+figure out how to host this, locally or via streamlit servers
+   
+* === Backlog ===
+
 * [ ] X: complete proposal information
 Ensure we capture as much useful information as possible about the actual proposal - contact information, etc. what the state actually says about what was posted. 
 ** acceptance criteria
 1. Item: `Forum` stores id, url, proposal title, description, open/close date, number of comments, agency, board, guidance document id
   - add details for guidanceDoc, publication date, comments, guidance docs - eg: https://www.townhall.virginia.gov/L/GDocForum.cfm?GDocForumID=452
 2. Item: `Comment` stores forum_id, comment_id, author, title, text, date, url
+* [ ] X: add helper data to create_csv
+1. in create_csv.py, create helper columns:
+   - stance_signed = {"support":1, "oppose":-1, "neutral":0, "unknown":0}
+   - stance_weighted = stance_signed * stance_confidence
+   - is_support_oppose = stance in ["support", "oppose"]
+   - date_day
+   - date_hour
+   - text_norm
+   - text_hash
+   - confidence_bucket = 'low' <.7 | 'med' .7-.89 | 'high' >=.9
--- a/requirements.txt
+++ b/requirements.txt
--- a/viz/chart_tests/confidence_by_stance.html
+++ b/viz/chart_tests/confidence_by_stance.html
--- a/viz/chart_tests/cumulative_stance_area.html
+++ b/viz/chart_tests/cumulative_stance_area.html
--- a/viz/chart_tests/cumulative_stance_share.html
+++ b/viz/chart_tests/cumulative_stance_share.html
--- a/viz/chart_tests/stance_diverging_bar.html
+++ b/viz/chart_tests/stance_diverging_bar.html
--- a/viz/chart_tests/stance_over_time.html
+++ b/viz/chart_tests/stance_over_time.html
--- a/viz/chart_tests/stance_share.html
+++ b/viz/chart_tests/stance_share.html
--- a/viz/chart_tests/stance_tone_counts.html
+++ b/viz/chart_tests/stance_tone_counts.html
--- a/viz/chart_tests/stance_tone_heatmap.html
+++ b/viz/chart_tests/stance_tone_heatmap.html
--- a/viz/chart_tests/stance_tone_rowpct.html
+++ b/viz/chart_tests/stance_tone_rowpct.html
--- a/viz/proto/confidence_by_stance.html
+++ b/viz/proto/confidence_by_stance.html
--- a/viz/proto/stance_over_time.html
+++ b/viz/proto/stance_over_time.html
--- a/viz/proto/stance_share.html
+++ b/viz/proto/stance_share.html
--- a/viz/proto/stance_tone_heatmap.html
+++ b/viz/proto/stance_tone_heatmap.html
--- a/viz/prototype_charts.py
+++ b/viz/prototype_charts.py
@@ -0,0 +1,134 @@
+'''
+    prototype_charts.py
+    generate test charts for later addition to streamlit
+'''
+   
+
+from pathlib import Path
+import pandas as pd
+import plotly.express as px
+import numpy as np
+
+inp = Path(r"c:/users/moses/projects/vath/analysis/jobs/f452-1/review.csv")
+out = Path("viz/")
+out.mkdir(parents=True, exist_ok=True)
+
+stance_order = ["support", "oppose", "neutral", "unknown"]
+
+# tone_order = ["positive", "negative", "neutral", "mixed", "unknown", "unclear"]
+# default order was actually better - unclear/negative/neutral/mixed/positive vs unknown/oppose/neutral/support
+# same for pct w/in stance
+df = pd.read_csv(inp)
+df["date"] = pd.to_datetime(df["date"], errors="coerce")
+df["date_day"] = df["date"].dt.date
+df["stance"] = df["stance"].fillna("unknown")
+df["tone"] = df["tone"].fillna("unknown")
+
+# 1. stance share
+counts = df["stance"].value_counts().reindex(stance_order, fill_value=0).reset_index()
+counts.columns = ["stance", "count"]
+fig = px.bar(counts, x="count", y="stance", orientation="h", text="count")
+fig.write_html(out / "stance_share.html")
+
+# 2. stance over time
+daily = df.groupby(["date_day", "stance"]).size().reset_index(name="count")
+fig = px.bar(daily, x="date_day", y="count", color="stance", category_orders={"stance": stance_order})
+fig.write_html(out / "stance_over_time.html")
+
+# 3. stance x tone
+heat = df.groupby(["stance", "tone"]).size().reset_index(name="count")
+fig = px.density_heatmap(heat, x="tone", y="stance", z="count", category_orders={"stance": stance_order})
+fig.write_html(out / "stance_tone_heatmap.html")
+
+# 4. confidence by stance
+fig = px.box(df, x="stance", y="stance_confidence", category_orders={"stance": stance_order}, points="outliers")
+fig.write_html(out / "confidence_by_stance.html")
+
+# 5. cumulative stance and share over time
+daily = (
+    df.groupby(["date_day", "stance"])
+      .size()
+      .unstack(fill_value=0)
+      .reindex(columns=stance_order, fill_value=0)
+      .sort_index()
+)
+
+cum = daily.cumsum()
+cum_long = cum.reset_index().melt(id_vars="date_day", var_name="stance", value_name="cumulative_count")
+
+fig = px.area(
+    cum_long,
+    x="date_day",
+    y="cumulative_count",
+    color="stance",
+    category_orders={"stance": stance_order},
+    title="cumulative comments by stance over time",
+)
+fig.write_html(out / "cumulative_stance_area.html")
+
+cum_pct = cum.div(cum.sum(axis=1), axis=0).reset_index().melt(
+    id_vars="date_day", var_name="stance", value_name="cumulative_share"
+)
+
+fig = px.line(
+    cum_pct,
+    x="date_day",
+    y="cumulative_share",
+    color="stance",
+    category_orders={"stance": stance_order},
+    title="cumulative stance share over time",
+)
+fig.update_yaxes(tickformat=".0%")
+fig.write_html(out / "cumulative_stance_share.html")
+
+# 7. diverging h-bar
+stance_counts = df["stance"].value_counts().reindex(stance_order, fill_value=0)
+
+div = pd.DataFrame({
+    "stance": ["oppose", "support", "neutral", "unknown"],
+    "count": [
+        -stance_counts.get("oppose", 0),
+         stance_counts.get("support", 0),
+         stance_counts.get("neutral", 0),
+         stance_counts.get("unknown", 0),
+    ],
+})
+
+fig = px.bar(
+    div,
+    x="count",
+    y="stance",
+    orientation="h",
+    text=div["count"].abs(),
+    title="support vs oppose",
+)
+fig.update_xaxes(title="comments", zeroline=True)
+fig.update_traces(textposition="outside")
+fig.write_html(out / "stance_diverging_bar.html")
+
+# 8. Stance x Tone labels
+heat = pd.crosstab(df["stance"], df["tone"]).reindex(
+    index=stance_order,
+    columns=[c for c in tone_order if c in df["tone"].unique()],
+    fill_value=0,
+)
+
+fig = px.imshow(
+    heat,
+    text_auto=True,
+    aspect="auto",
+    title="stance x tone, count",
+)
+fig.write_html(out / "stance_tone_counts.html")
+
+rowpct = heat.div(heat.sum(axis=1).replace(0, np.nan), axis=0)
+
+fig = px.imshow(
+    rowpct,
+    text_auto=".0%",
+    aspect="auto",
+    title="stance x tone, percent within stance",
+)
+fig.write_html(out / "stance_tone_rowpct.html")
+
+
--- a/viz/prototype_streamlit.py
+++ b/viz/prototype_streamlit.py
@@ -0,0 +1,28 @@
+# streamlit run analysis/viz/prototype_streamlit.py
+from datetime import datetime
+import pandas as pd
+import plotly.graph_objects as go
+import plotly.express as px
+import streamlit as st
+
+df = pd.read_csv(r"analysis/jobs/f452-1/review.csv")
+st.set_page_config(layout="wide")
+
+stance = st.multiselect("Filter stance", sorted(df["stance"].dropna().unique()), default=sorted(df["stance"].dropna().unique()))
+q = st.text_input("Search comment text")
+dff = df[df["stance"].isin(stance)]
+if q:
+    dff = dff[dff["text"].fillna("").str.contains(q, case=False, regex=False)]
+
+st.dataframe(dff[["comment_id", "title", "stance", "stance_confidence", "tone"]], width="stretch")
+st.write("Showing " + str(len(dff))+ " comments")
+
+cid = st.selectbox("comment", dff["comment_id"].astype(str))
+row = dff[dff["comment_id"].astype(str) == cid].iloc[0]
+
+st.subheader(row["title"])
+st.write(row["text"])
+st.write(row["author"] + ", " + row["date"][:10])
+st.write("**model:** " + str(row["model"]))
+st.markdown("**stance:** " + str(row["stance"]) + "  \n**confidence:** " + str(row["stance_confidence"]) + "  \n**tone:** " + str(row["tone"]))
+st.write("**analysis:** "+ row["stance_rationale"])
--- a/viz/streamlit.py
+++ b/viz/streamlit.py
@@ -0,0 +1,186 @@
+# streamlit run analysis/viz/streamlit.py
+import argparse
+from pathlib import Path
+from datetime import datetime as dt
+import pandas as pd
+import plotly.graph_objects as go
+import plotly.express as px
+import plotly.subplots as ps
+import streamlit as st
+
+workdir = Path("analysis/jobs/f452-1")
+df = pd.read_csv(workdir/"review.csv")
+df['date_dt'] = pd.to_datetime(df.date)
+df["date_day"] = df["date_dt"].dt.date
+forum = pd.read_json(workdir/"forum.jsonl", lines=True).iloc[0].to_dict()
+prompt = (workdir/"prompt.txt").read_text(encoding="utf-8")
+
+stance_colors = {'oppose':'#ffa15a', 'neutral':'#e377c2','support':'#19d3f3','unknown':'#000000'}
+stance_order = ["oppose", "mixed", "unknown", "neutral", "support"]
+
+st.set_page_config(layout="wide")
+st.title("Virginia Townhall Explorer",anchor=None)
+st.caption("Explore data collected from Virginia's public comment system. Source code at https://github.com/eulaly/vath")
+
+st.subheader("Proposal",anchor=None,divider="gray")
+st.markdown(f"**{forum.get('reg_title')}**")
+st.text(forum.get('reg_desc'))
+st.caption(f'Comments posted from {dt.strftime(min(df.date_dt),"%D")}—{dt.strftime(max(df.date_dt),"%D")} at https://www.townhall.virginia.gov/L/Comments.cfm?GDocForumID={forum.get("forum_id")}')
+
+st.subheader("Comment Summary",anchor=False,divider="gray")
+summary_left, summary_right = st.columns([1,2])
+with summary_left:
+# Summary Table
+    summary_stats = (
+    df.groupby("stance").size()
+      .reindex(stance_order, fill_value=0)
+      .reset_index(name="count")
+      .assign(percent=lambda d: (d["count"] / d["count"].sum()).map("{:.1%}".format))
+)
+
+    st.dataframe(summary_stats, hide_index=True, width="stretch")
+with summary_right:
+# Stance div-h
+    counts = df["stance"].value_counts()
+    stance_divh = go.Figure()
+    stance_divh.add_bar(y=["stance"], x=[-counts.get("oppose",0)], name="oppose", orientation="h", marker_color=stance_colors.get('oppose'), text=[counts.get("oppose",0)], textposition="inside")
+    stance_divh.add_bar(y=["stance"], x=[counts.get("neutral",0)], name="neutral", orientation="h", marker_color=stance_colors.get('neutral'), text=[counts.get("neutral",0)], textposition="inside")
+    stance_divh.add_bar(y=["stance"], x=[counts.get("unknown",0)], name="unknown", orientation="h", marker_color=stance_colors.get('unknown'), text=[counts.get("unknown",0)], textposition="inside")
+    stance_divh.add_bar(y=["stance"], x=[counts.get("support",0)], name="support", orientation="h", marker_color=stance_colors.get('support'), text=[counts.get("support",0)], textposition="inside")
+    stance_divh.update_yaxes(title_text="",showticklabels=False)
+    stance_divh.update_layout(barmode="relative", title="", height=180, margin=dict(l=0,r=0,t=0,b=0),xaxis_title="", yaxis_title="",legend=dict(orientation="v",y=0.12))
+    st.plotly_chart(stance_divh,width='stretch')
+
+# Daily Comments Breakdown, 3 Tabs
+daily_wide = (
+    df.groupby(["date_day", "stance"])
+      .size()
+      .unstack(fill_value=0)
+      .reindex(columns=stance_order, fill_value=0)
+      .sort_index()
+)
+
+daily_long = (
+    daily_wide.reset_index()
+      .melt(id_vars="date_day", var_name="stance", value_name="count")
+)
+
+cum_wide = daily_wide.cumsum()
+
+cum_long = (
+    cum_wide.reset_index()
+      .melt(id_vars="date_day", var_name="stance", value_name="cumulative_count")
+)
+
+cum_total = cum_wide.sum(axis=1)
+cum_share = cum_wide.div(cum_total.where(cum_total > 0), axis=0)
+
+cum_share_long = (
+    cum_share.reset_index()
+      .melt(id_vars="date_day", var_name="stance", value_name="cumulative_share")
+)
+
+
+tab_daily, tab_area, tab_share = st.tabs([
+    "Daily",
+    "Cumulative",
+    "Cumulative Share",
+])
+
+with tab_daily:
+    fig = px.bar(
+        daily_long,
+        x="date_day",
+        y="count",
+        color="stance",
+        category_orders={"stance": stance_order},
+        color_discrete_map=stance_colors,
+    )
+    fig.update_layout(barmode="stack", height=420, legend_orientation="v")
+    st.plotly_chart(fig, width="stretch")
+
+with tab_area:
+    fig = px.area(
+        cum_long,
+        x="date_day",
+        y="cumulative_count",
+        color="stance",
+        category_orders={"stance": stance_order},
+        color_discrete_map=stance_colors,
+    )
+    fig.update_layout(height=420, legend_orientation="v")
+    st.plotly_chart(fig, width="stretch")
+
+with tab_share:
+    fig = px.line(
+        cum_share_long,
+        x="date_day",
+        y="cumulative_share",
+        color="stance",
+        category_orders={"stance": stance_order},
+        color_discrete_map=stance_colors,
+    )
+    fig.update_yaxes(tickformat=".0%", range=[0, 1])
+    fig.update_layout(height=420, legend_orientation="v")
+    st.plotly_chart(fig, width="stretch")
+    
+st.subheader("Comment Explorer",anchor=False,divider="gray") 
+# comment explorer
+cex_left, cex_right = st.columns([1,1])
+with cex_left:
+    stance = st.multiselect("Filter stance", sorted(df["stance"].dropna().unique()), default=sorted(df["stance"].dropna().unique()))
+    q = st.text_input("Search comment title and text")
+    dff = df[df["stance"].isin(stance)]
+    if q:
+        dff = dff[dff["text"].fillna("").str.contains(q, case=False, regex=False)]
+
+with cex_right:
+    filter_tone = st.multiselect("Filter tone", sorted(df["tone"].dropna().unique()), default=sorted(df["tone"].dropna().unique()))
+    st.text(""); st.text("")
+    st.text("Showing " + str(len(dff))+ " comments",text_alignment="right", width="stretch")
+
+st.dataframe(dff[["comment_id", "title", "text", "stance", "stance_confidence", "tone"]], width="stretch")
+
+cid = st.selectbox("Select comment to view:", dff["comment_id"].astype(str))
+row = dff[dff["comment_id"].astype(str) == cid].iloc[0]
+
+st.markdown(f'**{row["title"]}**')
+st.text(row["text"])
+st.write(row["author"] + ", " + row["date_dt"].strftime("%D"))
+
+st.divider()
+
+st.subheader('Analysis')
+cexs_left, cexs_right = st.columns([1,1])
+with cexs_left:
+    st.write(f"**stance:** {row['stance']}")
+    st.write(f"**stance_confidence:** {row['stance_confidence']:.2f}")
+    st.write(f"**tone:** {row['tone']}")
+    st.write("**analysis:** "+ row["stance_rationale"])
+with cexs_right:
+    x_order = ["unknown","oppose","mixed","neutral","support"]  # includes mixed even if absent; harmless zero column
+    y_order = ["positive","neutral","mixed","negative","unclear"]
+    tab = pd.crosstab(df["tone"], df["stance"]).reindex(index=y_order, columns=x_order, fill_value=0)
+    pct = tab.div(tab.sum(axis=1).replace(0, pd.NA), axis=0).fillna(0)
+    tone_stance = px.imshow(
+        pct,
+        x=x_order, y=y_order,
+        text_auto=".0%",
+        aspect="auto",
+        color_continuous_scale="Greens",
+    )
+    tone_stance.update_traces(text=tab.astype(str) + " / " + (pct*100).round(0).astype(int).astype(str) + "%")
+    tone_stance.add_scatter(x=[row["stance"]],y=[row["tone"]],mode="markers",marker=dict(size=15,color="yellow",symbol="cross",line=dict(width=1, color="red")),showlegend=False)
+    tone_stance.update_layout(height=420, xaxis_title="stance", yaxis_title="tone")
+    st.plotly_chart(tone_stance, width='stretch')
+    st.caption("Tone by stance, % within tone", text_alignment="right",width="stretch")
+
+st.divider()
+st.write("**model:** " + str(row["model"]))
+with st.expander("Prompt", expanded=False):
+    st.code(prompt, language="text")
+
+tone_conf = px.box(df,x="stance",y="stance_confidence",color="stance",category_orders={"stance":stance_order},color_discrete_map=stance_colors,points="outliers",title="Comment Stance Classification Confidence")
+tone_conf.update_yaxes(range=[0,1.02])
+tone_conf.update_layout(height=430, legend_orientation="v")
+st.plotly_chart(tone_conf,width="stretch")
Author	SHA1	Message	Date
eulaly	afd5b8c60e	full local streamlit support	2026-05-08 21:57:04 -04:00
eulaly	3fb424da3c	added streamlit v1	2026-05-08 17:22:33 -04:00
eulaly	c3f2911563	updated reqts	2026-05-07 21:55:00 -04:00