added forum metadata for later use

streamlit > local docker
updated readme
2026-05-09 00:36:30 -04:00 · 2026-05-09 00:25:27 -04:00 · 2026-05-09 00:02:24 -04:00 · 2026-05-09 00:00:59 -04:00 · 2026-05-08 23:57:46 -04:00 · 2026-05-08 23:33:55 -04:00
33 changed files with 60917 additions and 165 deletions
--- a/.gitignore
+++ b/.gitignore
@@ -29,3 +29,4 @@ output/

 # --- misc ---
 .DS_Store
+*~$*
--- a/README.md
+++ b/README.md
@@ -1,21 +1,5 @@

-# Table of Contents
-
-1.  [Project Goals](#org5acb669)
-    1.  [Document and analyze sentiment](#org9291576)
-    2.  [Make data available](#org8054421)
-    3.  [Generalize](#orgdda4b6f)
-2.  [Architecture](#org1d6bc40)
-    1.  [Scraper](#org4298028)
-    2.  [Storage](#org1cd413c)
-    3.  [Analysis](#orgaea450e)
-3.  [Roadmap](#org6b7660d)
-
-
-
-<a id="org5acb669"></a>
-
-# Project Goals
+## Project Goals

 1.  Document and analyze sentiment of public comments on Virginia law, to determine:
    1.  the utility of this forum as a mechanism for public comment, and
@@ -23,131 +7,127 @@
 2.  Make data and insights broadly available.
 3.  Generalize to other public comment tools.

-
-<a id="org9291576"></a>
-
-## Document and analyze sentiment
-
-   Scrape the data, parse, clean, and store. Clearly separate scraper from sentiment analyzer for maximum auditability.
-   Build tests for identifying abuse, such as spam and account fraud
-   Identify any patterns connecting measured sentiment against VA decisions
+Take a look at https://vatownhall.streamlit.app
+![img](./docs/streamlit-snapshot.png)


-<a id="org8054421"></a>
+### Research questions

-## Make data available
-
-   Pick a good visualization tool
+1.  What is the quality of the comments on the forum?
+    1.  Are there duplicate entries?
+    2.  Are there non-human-generated entries?
+    3.  Are there entries intended to abuse the forum or drown out comment?
+2.  How do commenters feel about the proposed change?
+    1.  What is the total number and percent supporting vs opposing, and how does this change over time?
+    2.  What is the type of support, such as strong/weak, positive/negative?
+3.  What impact do the comments have on the proposed change?
+    (I anticipate this will not be measurable from currently available data)


-<a id="orgdda4b6f"></a>
+<a id="orgfabfcd9"></a>

-## Generalize
+## Architecture

-   Identify scalable ways to apply this toolset to similar problems
+1.  Scrape/Parse: Scrapy
+2.  Sentiment analysis: gpt-5.4-mini
+3.  Display: streamlit
+4.  Storage: jsonl, csv, parquet
+
+![img](./docs/pipeline-v1.2.3.svg)


-<a id="org1d6bc40"></a>
+<a id="org2c5c7a2"></a>

-# Architecture
+### Scraper

-1.  Scrape/Parse: ****Scrapy**** for downloading comments
-2.  Storage: json
-3.  Sentiment analysis: Claude haiku
-4.  Display: TBD
+Scrapy provides a simple mechanism for retrieving, parsing, and saving content form the forums.
+
+1.  Forums listing page: `Forums.cfm` lists all open forums with agency, reg title, action type, brief description, closing date, comment count
+2.  Comment listing page: `comments.cfm?GDocForumID=X` or `comments.cfm?stageid=X` or `comments.cfm?petitionid=X` lists comments with title, author, date
+3.  Individual comment page: `viewcomments.cfm?commentid=X` shows regulation title + brief description at the top, plus the comment


-<a id="org4298028"></a>
+<a id="org72990f4"></a>

-## Scraper
+### Analysis

-Scrapy provides a simple mechanism for browsing and 
+Google and Amazon both return generic sentiment (tone of writing: positive/negative), not stance (for/against the regulation): "I strongly believe the government should NOT interfere" is negative tone but "against" the regulation.  We add the proposed change as context to the model.

-1.  Forums listing page: \`Forums.cfm\` - lists all open forums with agency, reg title, action type, brief description, closing date, comment count
-2.  Comment listing page: \`comments.cfm?GDocForumID=X\` or \`comments.cfm?stageid=X\` or \`comments.cfm?petitionid=X\` - lists comments with title, author, date
-3.  Individual comment page: \`viewcomments.cfm?commentid=X\` - shows regulation title + brief description at the top, plus the comment
+Before sending the comments for sentiment analysis, `tokenizer.py` receives the forum to be processed and prompt as inputs, then generates a `report.json` estimating tokens (tiktoken), cost, and time to run for multiple models.
+
+Then, the batch processing scripts uses the `report.json` to create multiple jobs, with subcommands to download and check their status. 
+
+We selected gpt-5.4-mini for a good balance of quality, cost, and time.
+
+1.  Prompt
+    ```
+    You are an expert policy analyst classifying public comments submitted to the Virginia Town Hall
+    regulatory comment system. You will be given the text of a proposed regulation and a single
+    public comment. Return ONLY a JSON object — no other text.
+    
+    Definitions:
+    -   stance: the commenter's position on whether the regulation should be adopted.
+        "support" = wants it approved (as-is or with changes);
+        "oppose"  = wants it rejected or substantially weakened;
+        "neutral" = takes no position, asks a question, or provides factual input only;
+        "unknown" = too vague, off-topic, or uninterpretable to classify.
+    -   tone: the emotional register of the writing, independent of stance.
+        "positive" = affirming, hopeful, appreciative;
+        "negative" = angry, fearful, alarmed, or contemptuous;
+        "neutral"  = matter-of-fact, procedural, or informational;
+        "mixed"    = contains both positive and negative emotional content;
+        "unclear"  = tone cannot be determined (e.g., a one-word comment).
+    -   stance_confidence: float 0.0-1.0, your confidence in the stance label.
+    -   stance_rationale: 1-3 sentences explaining the key evidence; quote specific phrases where possible.
+    -   tags: up to 5 short topic labels relevant to the comment's specific concerns (e.g.
+        "parental rights", "student safety", "privacy", "religious freedom", "LGBTQ+ inclusion",
+        "bullying prevention", "school sports", "bathroom access"). Empty array if none apply.
+    
+    Return exactly these keys: stance, stance_confidence, stance_rationale, tone, tags.
+    ```


-<a id="org1cd413c"></a>
+<a id="org58a5b72"></a>

-## Storage
+### Storage

-One JSONL file per forum/bill.
+-   Each scraped forum is saved to `output/<forum-id>.jsonl`
+-   Each report (forum + prompt) is saves to `reports/<forum-id-N>.json`
+-   Each job is saved to `analysis/jobs/<report-id>`:
+     └─`forum.jsonl` is a copy of the scraped forum for convenience  
+     └─`prompt.txt` is a copy of the prompt used  
+     └─`report.json` is a copy of the report used  
+     └─`status.json` contains metadata about the job  
+    For each batch in the job, four files are created:  
+     └─`jobN-input.jsonl` contains the exact queries sent to the API, for troubleshooting  
+     └─`jobN-output-raw.jsonl` contains the exact response from the API  
+     └─`jobN-output.jsonl` contains the exact response from the API  
+     └─`jobN-output-errors.jsonl` when errors are returned (this file may not exist)  
+-   Once complete, the cleanup script saves `review.csv`, `review.pqt`, and `review.sqlite` in this folder.


-<a id="orgaea450e"></a>
+<a id="org24fe465"></a>

-## Analysis
+## Instructions

-Google and Amazon both return generic sentiment (tone of writing: positive/negative), not stance (for/against the regulation): "I strongly believe the government should NOT interfere" is negative tone but "against" the regulation.  We will run the forum/bill title and cache the entirety of the proposed change, perhaps as a fallback.
-
-<table border="2" cellspacing="0" cellpadding="6" rules="groups" frame="hsides">
+1.  Scrape the forum.  
+    `python`  
+2.  Run model report.  
+    `python analysis/tokenizer.py <input> --prompt <prompt>`  
+3.  To run a realtime subset:  
+    `python analysis/openai_realtime.py <input> --prompt <prompt> --model <model> --limit <N comments>`  
+    `python analysis/openai_realtime.py output/f452.jsonl --prompt prompt-1.txt --model gpt-4o-mini --limit 10`  
+4.  To create and run the whole thing in batches, first create the batch jobs from the report:  
+    `python analysis/openai_batch.py create <report> --model <model>`  
+    `python analysis/openai_batch.py create ./reports/f452-1.json --model gpt-5.4-mini`  
+5.  Then, run the jobs sequentially. Don't submit more than one at a time, if the model fills up the batch will fail and resubmission is not implemented.  
+    `python analysis/openai<sub>batch.py</sub> submit`  
+    `python analysis/openai<sub>batch.py</sub> status`  
+    `python analysis/openai<sub>batch.py</sub> download`  
+    `python analysis/openai<sub>batch.py</sub> submit`  


-<colgroup>
-<col  class="org-left" />
-
-<col  class="org-left" />
-
-<col  class="org-left" />
-
-<col  class="org-left" />
-
-<col  class="org-left" />
-
-<col  class="org-left" />
-</colgroup>
-<thead>
-<tr>
-<th scope="col" class="org-left">Tool</th>
-<th scope="col" class="org-left">Output</th>
-<th scope="col" class="org-left">Context</th>
-<th scope="col" class="org-left">Sarcasm</th>
-<th scope="col" class="org-left">Context window</th>
-<th scope="col" class="org-left">Cost/1k comments</th>
-</tr>
-</thead>
-<tbody>
-<tr>
-<td class="org-left">Google NL API</td>
-<td class="org-left">-1→+1, magnitude</td>
-<td class="org-left">No/generic</td>
-<td class="org-left">Poorly</td>
-<td class="org-left">No</td>
-<td class="org-left">~$1–2</td>
-</tr>
-
-<tr>
-<td class="org-left">Amazon Comprehend</td>
-<td class="org-left">Pos/Neg/Neutral/Mixed</td>
-<td class="org-left">No/generic</td>
-<td class="org-left">Poorly</td>
-<td class="org-left">No</td>
-<td class="org-left">~$0.10</td>
-</tr>
-
-<tr>
-<td class="org-left">Claude Haiku</td>
-<td class="org-left">Prompted → for/against/neutral</td>
-<td class="org-left">Yes</td>
-<td class="org-left">Yes, with prompt</td>
-<td class="org-left">Yes</td>
-<td class="org-left">~$0.10–0.30</td>
-</tr>
-
-<tr>
-<td class="org-left">GPT-4o-mini</td>
-<td class="org-left">Prompted → same</td>
-<td class="org-left">Yes</td>
-<td class="org-left">Yes</td>
-<td class="org-left">Yes</td>
-<td class="org-left">~$0.05–0.15</td>
-</tr>
-</tbody>
-</table>
-
-
-<a id="org6b7660d"></a>
+<a id="org5739d49"></a>

 # Roadmap

--- a/analysis/create_csv.py
+++ b/analysis/create_csv.py
@@ -0,0 +1,76 @@
+#!/usr/bin/env python3
+"""analysis/create_csv.py — join raw scrape with analysis output for review."""
+
+import argparse
+from pathlib import Path
+
+import pandas as pd
+
+RAW_COLS = ["forum_id", "comment_id", "title", "text", "date", "author"]
+ANALYSIS_COLS = [
+    "stance", "stance_confidence", "stance_rationale", "tone", "tags",
+    "error", "truncated", "analyzed_at", "prompt_version", "model",
+]
+OUTPUT_COLS = RAW_COLS + ANALYSIS_COLS
+
+
+def load_raw(path: Path) -> pd.DataFrame:
+    df = pd.read_json(path, lines=True)
+    df = df[df["comment_id"].notna()] # rm first item (forum, not comment)
+    for col in RAW_COLS:
+        if col not in df.columns:
+            df[col] = None
+    return df[RAW_COLS].copy()
+
+
+def load_analysis(jobs_dir: Path) -> pd.DataFrame:
+    files = sorted(p for p in jobs_dir.glob("job*-output.jsonl") if "-raw" not in p.name)
+    df = pd.concat([pd.read_json(p, lines=True) for p in files], ignore_index=True)
+    for col in ANALYSIS_COLS:
+        if col not in df.columns:
+            df[col] = None
+    return df[["comment_id"] + ANALYSIS_COLS].copy()
+
+
+def join(raw: pd.DataFrame, analysis: pd.DataFrame) -> pd.DataFrame:
+    return raw.merge(analysis, on="comment_id", how="left")[OUTPUT_COLS]
+
+
+def print_counts(raw: pd.DataFrame, analysis: pd.DataFrame, merged: pd.DataFrame) -> None:
+    print(f"\nRaw comments  : {len(raw):,}")
+    print(f"Analyzed      : {len(analysis):,}")
+    print(f"Joined        : {merged['stance'].notna().sum():,}")
+    print(f"Unanalyzed    : {merged['stance'].isna().sum():,}")
+    print(f"Errors        : {analysis['error'].notna().sum():,}")
+    print(f"Dup IDs (raw) : {raw['comment_id'].duplicated().sum():,}")
+    print(f"\nStance:\n{analysis['stance'].value_counts(dropna=False).to_string()}")
+    print(f"\nTone:\n{analysis['tone'].value_counts(dropna=False).to_string()}\n")
+
+
+def main() -> None:
+    p = argparse.ArgumentParser(
+        description="Join raw scrape JSONL with analysis output; write review CSV."
+    )
+    p.add_argument("input",    help="Raw scrape JSONL (e.g. output/f452.jsonl)")
+    p.add_argument("jobs_dir", help="Job directory containing job*-output.jsonl files")
+    p.add_argument("--parquet", action="store_true", help="Also write review.parquet")
+    p.add_argument("--out", default=None, help="Output CSV path (default: <jobs_dir>/review.csv)")
+    args = p.parse_args()
+
+    raw      = load_raw(Path(args.input))
+    analysis = load_analysis(Path(args.jobs_dir))
+    merged   = join(raw, analysis)
+    print_counts(raw, analysis, merged)
+
+    out = Path(args.out) if args.out else Path(args.jobs_dir) / "review.csv"
+    merged.to_csv(out, index=False, encoding="utf-8-sig")
+    print(f"CSV     → {out}")
+
+    if args.parquet:
+        pq = out.with_suffix(".parquet")
+        merged.to_parquet(pq, index=False)
+        print(f"Parquet → {pq}")
+
+
+if __name__ == "__main__":
+    main()
--- a/analysis/encoding.py
+++ b/analysis/encoding.py
@@ -0,0 +1,74 @@
+"""
+analysis/encoding.py — text encoding repair for scraped content.
+
+The townhall.virginia.gov scraper forces UTF-8 decoding, which is correct for the
+site's current content. This module provides a defensive repair function for cases
+where a response arrives with Windows-1252/cp1252 bytes embedded in otherwise UTF-8
+content (common in older CMSes). The raw scrape files are never modified; repair is
+applied at the analysis and reporting layers only.
+
+Primary: uses `ftfy` when installed (pip install ftfy).
+Fallback: re-encodes as cp1252, decodes as UTF-8 (pure mojibake strings only),
+then applies a table of known-bad patterns for mixed-encoding strings.
+"""
+
+# ---------------------------------------------------------------------------
+# Known patterns: UTF-8 bytes decoded as cp1252, i.e. the 3-char sequences you
+# see when a server sends e.g. E2 80 99 and it gets decoded as cp1252 chars.
+#
+# Byte → cp1252 char mappings for the 0x80–0x9F range:
+#   E2 → â  (U+00E2, always)
+#   80 → €  (U+20AC, cp1252 0x80)
+#   99 → ™  (U+2122, cp1252 0x99)  ← E2 80 99 = U+2019 ' right single quote
+#   98 → ˜  (U+02DC, cp1252 0x98)  ← E2 80 98 = U+2018 ' left single quote
+#   9C → œ  (U+0153, cp1252 0x9C)  ← E2 80 9C = U+201C " left double quote
+#   9D → \x9d (undefined → U+009D) ← E2 80 9D = U+201D " right double quote
+#   93 → "  (U+201C, cp1252 0x93)  ← E2 80 93 = U+2013 – en dash
+#   94 → "  (U+201D, cp1252 0x94)  ← E2 80 94 = U+2014 — em dash
+#   A6 → ¦  (U+00A6, cp1252 0xA6)  ← E2 80 A6 = U+2026 … ellipsis
+
+_KNOWN_REPAIRS: list[tuple[str, str]] = [
+    # Longer / more specific patterns first to avoid partial matches
+    ("â€™",  "’"),  # â€™ → ' right single quote
+    ("â€˜",  "‘"),  # â€˜ → ' left single quote
+    ("â€œ",  "“"),  # â€œ → " left double quote
+    ("â€",  "”"),  # â€\x9d → " right double quote
+    ("â€“",  "–"),  # â€" (with left DQ) → – en dash
+    ("â€”",  "—"),  # â€" (with right DQ) → — em dash
+    ("â€¦",  "…"),  # â€¦ → … ellipsis
+    # Generic fallback: bare â€ prefix not caught above → remove artifact
+    ("â€",        ""),
+]
+
+
+def repair_text(text: str) -> str:
+    """Repair common encoding artifacts in scraped text.
+
+    Handles:
+    - UTF-8 bytes decoded as cp1252/Latin-1 (â€™ → ')
+    - Attempts best-effort cleanup for mixed-encoding strings
+
+    U+FFFD replacement characters (from strict UTF-8 decoding of cp1252 bytes)
+    cannot be recovered since the original byte is lost; they are left as-is.
+    """
+    if not text:
+        return text
+
+    try:
+        import ftfy
+        return ftfy.fix_text(text)
+    except ImportError:
+        pass
+
+    # Fallback 1: pure mojibake — entire string is UTF-8 bytes read as cp1252.
+    # Re-encode as cp1252 and decode as UTF-8.
+    try:
+        return text.encode("cp1252").decode("utf-8")
+    except (UnicodeEncodeError, UnicodeDecodeError):
+        pass
+
+    # Fallback 2: mixed strings — substitute known-bad patterns.
+    for bad, good in _KNOWN_REPAIRS:
+        if bad in text:
+            text = text.replace(bad, good)
+    return text
--- a/analysis/jobs/f452-1/review.csv
+++ b/analysis/jobs/f452-1/review.csv
--- a/analysis/jobs/f452-1/review.xlsx
+++ b/analysis/jobs/f452-1/review.xlsx
--- a/analysis/prompt-1.txt
+++ b/analysis/prompt-1.txt
@@ -1,6 +1,4 @@
-You are an expert policy analyst classifying public comments submitted to the Virginia Town Hall
-regulatory comment system. You will be given the text of a proposed regulation and a single
-public comment. Return ONLY a JSON object — no other text.
+You are an expert policy analyst classifying public comments submitted to the Virginia Town Hall regulatory comment system. You will be given the text of a proposed regulation and a single public comment. Return ONLY a JSON object — no other text.

 Definitions:
 - stance: the commenter's position on whether the regulation should be adopted.
@@ -16,8 +14,6 @@ Definitions:
  "unclear"  = tone cannot be determined (e.g., a one-word comment).
 - stance_confidence: float 0.0-1.0, your confidence in the stance label.
 - stance_rationale: 1-3 sentences explaining the key evidence; quote specific phrases where possible.
- tags: up to 5 short topic labels relevant to the comment's specific concerns (e.g.
-  "parental rights", "student safety", "privacy", "religious freedom", "LGBTQ+ inclusion",
-  "bullying prevention", "school sports", "bathroom access"). Empty array if none apply.
+- tags: up to 5 short topic labels relevant to the comment's specific concerns (e.g. "parental rights", "student safety", "privacy", "religious freedom", "LGBTQ inclusion", "bullying prevention", "school sports", "bathroom access"). Empty array if none apply.

 Return exactly these keys: stance, stance_confidence, stance_rationale, tone, tags.
--- a/docs/excel-snapshot.png
+++ b/docs/excel-snapshot.png
--- a/docs/streamlit-snapshot.png
+++ b/docs/streamlit-snapshot.png
--- a/docs/tasks.org
+++ b/docs/tasks.org
@@ -244,9 +244,9 @@ python analysis/openai_batch.py submit
 - tests: passing (pytest tests/openai_batch.py tests/openai_realtime.py tests/tokenizer.py)
 - datetime: [2026-05-06 Wed]

-* === Backlog ===
-* [ ] X: analysis validation view
+* [X] t1.3: cleanup model output and rejoin
 create a lightweight validation script that joins raw comments to normalized analysis output and writes a human-reviewable csv.
+review create_csv for the simple approach - keep this regardless

 ** acceptance criteria
 1. input raw scrape jsonl and all *-output.jsonl files in a dir
@@ -255,7 +255,8 @@ create a lightweight validation script that joins raw comments to normalized ana
   - forum_id, comment_id, title, text, date, author
   - stance, stance_confidence, stance_rationale, tone, tags
   - error, truncated, analyzed_at, prompt_version, model
-4. print validation counts
+4. output parquet?
+5. print validation counts
   - raw comments
   - analyzed records
   - joined records
@@ -264,16 +265,30 @@ create a lightweight validation script that joins raw comments to normalized ana
   - error records
   - stance counts
   - tone counts
-5. tests cover join behavior and missing/duplicate ids
+6. tests cover join behavior and missing/duplicate ids
+
+** notes
+- analysis/create_csv.py: reads raw scrape JSONL + all job*-output.jsonl in a job dir (skips *-output-raw.jsonl); left-joins on comment_id; writes review.csv (UTF-8 BOM for Excel); optional --parquet.
+- Uses pd.read_json(path, lines=True) — no manual JSON parsing.
+- Prints summary counts: raw/analyzed/joined/unanalyzed/errors/duplicate IDs, stance distribution, tone distribution.
+
+*** usage
+#+begin_src sh
+python analysis/create_csv.py output/f452.jsonl analysis/jobs/f452-1/
+python analysis/create_csv.py output/f452.jsonl analysis/jobs/f452-1/ --parquet
+# output: analysis/jobs/f452-1/review.csv (and optionally review.parquet)
+#+end_src

 ** evidence
- commit:
- tests:
- csv:
- datetime:       
-* [ ] X: text encoding cleanup
+- commit: 28d6d22
+- tests: passing (pytest tests/create_csv.py tests/encoding.py)
+- csv: analysis/jobs/f452-1/review.csv
+- datetime: [2026-05-07 Thu 17:23]
+
+* [X] t1.1.1: text encoding cleanup
 fix mojibake in scraped text before analysis/reporting, especially curly quotes showing as â€™.

+
 ** acceptance criteria
 1. identify whether mojibake exists in raw scrape, analysis output, or csv export only
 2. add repair step at the earliest correct layer
@@ -286,14 +301,82 @@ fix mojibake in scraped text before analysis/reporting, especially curly quotes
   - â€”
 5. document whether repaired text is used for model input

+** notes
+- Diagnosis: f452.jsonl raw data is CLEAN — proper Unicode throughout (U+2019, U+201C, etc.). The DEFAULT_RESPONSE_ENCODING=utf-8 spider setting is working for this site. No mojibake or FFFD chars found.
+- The encoding issue would surface for forums whose server sends cp1252 bytes (0x91-0x97 range) embedded in otherwise UTF-8 content. FFFD replacement chars appear when the UTF-8 decoder hits those bytes. Once the byte is replaced by FFFD, the original character cannot be recovered.
+- Repair layer: analysis/encoding.py applied in analysis/validate.py at reporting time. Raw scrape JSONL is never modified (AC3).
+- Model input: repair_text() is NOT applied in build_messages() for this dataset since raw data is clean. Can be added if a future forum produces dirty text.
+- Spider: DEFAULT_RESPONSE_ENCODING=utf-8 remains. If a future forum genuinely sends cp1252, change to 'cp1252' and apply ftfy post-decode in the item pipeline.
+
 ** evidence
- commit:
- tests:
- before/after sample:
- datetime:
+- commit: 1ea696d
+- tests: passing (pytest tests/encoding.py)
+- before/after sample: N/A — f452.jsonl is clean; tests cover synthetic mojibake patterns
+- datetime: [2026-05-07 Thu 17:00]
+
+* [X] t1.4: graph data prototype
+create ./viz/prototype_charts.py generating individual plotly charts for exploring graphs to embed into streamlit or dash later
+
+** acceptance criteria
+2. create graph for Stance/Share
+   - stacked h-bar with % support/oppose/neutral/unknown + raw totals, eg  63% (5720) / 37% (3320) / 0.09% (8) / 0.37% (34)
+   - later, consider centered diverging h-bar: oppose ← | neutral/unknown | → support
+3. create graph for Stance/Time: 
+   - cumulative support/oppose % over time
+4. create graph for Stance/Tone (heatmap count)
+5. create graph for Confidence/Stance (boxplot or histogram)
+
+** notes
+- prototyped in plotly
+- initial streamlit  
+
+** evidence
+- commit: 3fb424d
+- tests: see viz/proto and viz/chart_tests
+- datetime: [2026-05-08 Fri 08:38]
+
+* [X] t1.5: streamlit
+create organized webpage displaying useful information from completed job and analysis
+
+** acceptance criteria
+1. display total stance breakdown
+2. display centered horiz-bar with absolute stances
+3. show daily comment stances and cumulative
+4. show comment table with filters for stance (filter tone?)
+5. clicking/selecting a comment shows full text and model rationale
+6. app runs locally with one command
+
+** notes
+data pulls entirely from the job; goal is to point viz/streamlit.py at any job/ folder and have everything it needs
+
+** evidence
+- commit: cc16acb
+- tests: from root dir, `streamlit run viz/streamlit.py <job-dir>`
+- datetime: [2026-05-08 Fri 23:44]
+  
+* +[ ] t1.6 host streamlit via dockerfile+
+planning to deploy manually, get cert, etc etc. probably dont care about https?
+using streamlit.app instead+
+** acceptance criteria
+1. write dockerfile with slim image
+
+** notes
+   
+* === Backlog ===
+- add forum_url, forum_collected_date to scraper (to add to viz)
 * [ ] X: complete proposal information
 Ensure we capture as much useful information as possible about the actual proposal - contact information, etc. what the state actually says about what was posted. 
 ** acceptance criteria
 1. Item: `Forum` stores id, url, proposal title, description, open/close date, number of comments, agency, board, guidance document id
   - add details for guidanceDoc, publication date, comments, guidance docs - eg: https://www.townhall.virginia.gov/L/GDocForum.cfm?GDocForumID=452
 2. Item: `Comment` stores forum_id, comment_id, author, title, text, date, url
+* [ ] X: add helper data to create_csv
+1. in create_csv.py, create helper columns:
+   - stance_signed = {"support":1, "oppose":-1, "neutral":0, "unknown":0}
+   - stance_weighted = stance_signed * stance_confidence
+   - is_support_oppose = stance in ["support", "oppose"]
+   - date_day
+   - date_hour
+   - text_norm
+   - text_hash
+   - confidence_bucket = 'low' <.7 | 'med' .7-.89 | 'high' >=.9
--- a/docs/vatownhall.org
+++ b/docs/vatownhall.org
@@ -1,49 +1,110 @@
 #+title: VA Townhall
 #+date: [2026-05-05 Tue]
-#+version: 1
+#+version: 1.1

-* Project Goals
+** Project Goals
 1. Document and analyze sentiment of public comments on Virginia law, to determine:
   1. the utility of this forum as a mechanism for public comment, and
   2. the impact of this forum on Virginia regulation.
 2. Make data and insights broadly available.
 3. Generalize to other public comment tools.

-** Document and analyze sentiment
- Scrape the data, parse, clean, and store. Clearly separate scraper from sentiment analyzer for maximum auditability.
- Build tests for identifying abuse, such as spam and account fraud
- Identify any patterns connecting measured sentiment against VA decisions
+*** Research questions   
+1. What is the quality of the comments on the forum?
+   1. Are there duplicate entries?
+   2. Are there non-human-generated entries?
+   3. Are there entries intended to abuse the forum or drown out comment?
+2. How do commenters feel about the proposed change?
+   1. What is the total number and percent supporting vs opposing, and how does this change over time?
+   2. What is the type of support, such as strong/weak, positive/negative?
+3. What impact do the comments have on the proposed change?
+   (I anticipate this will not be measurable from currently available data)

-** Make data available
- Pick a good visualization tool
+** Architecture
+1. Scrape/Parse: Scrapy
+2. Sentiment analysis: gpt-5.4-mini
+3. Display: streamlit
+4. Storage: jsonl, csv, parquet

-** Generalize
- Identify scalable ways to apply this toolset to similar problems
+[[file:pipeline-v1.2.3.svg]]
   
-* Architecture
-1. Scrape/Parse: **Scrapy** for downloading comments
-2. Storage: json
-3. Sentiment analysis: Claude haiku
-4. Display: TBD   
-
-** Scraper
-Scrapy provides a simple mechanism for browsing and 
+*** Scraper
+Scrapy provides a simple mechanism for retrieving, parsing, and saving content form the forums.
 1. Forums listing page: `Forums.cfm` - lists all open forums with agency, reg title, action type, brief description, closing date, comment count
 2. Comment listing page: `comments.cfm?GDocForumID=X` or `comments.cfm?stageid=X` or `comments.cfm?petitionid=X` - lists comments with title, author, date
 3. Individual comment page: `viewcomments.cfm?commentid=X` - shows regulation title + brief description at the top, plus the comment

-** Storage
-One JSONL file per forum/bill.
+*** Analysis
+Google and Amazon both return generic sentiment (tone of writing: positive/negative), not stance (for/against the regulation): "I strongly believe the government should NOT interfere" is negative tone but "against" the regulation.  We add the proposed change as context to the model.

-** Analysis
-Google and Amazon both return generic sentiment (tone of writing: positive/negative), not stance (for/against the regulation): "I strongly believe the government should NOT interfere" is negative tone but "against" the regulation.  We will run the forum/bill title and cache the entirety of the proposed change, perhaps as a fallback.
+Before sending the comments for sentiment analysis, `tokenizer.py` receives the forum to be processed and prompt as inputs, then generates a `report.json` estimating tokens (tiktoken), cost, and time to run for multiple models.

-| Tool              | Output                         | Context    | Sarcasm          | Context window | Cost/1k comments |
-|-------------------+--------------------------------+------------+------------------+----------------+------------------|
-| Google NL API     | -1→+1, magnitude               | No/generic | Poorly           | No             | ~$1–2            |
-| Amazon Comprehend | Pos/Neg/Neutral/Mixed          | No/generic | Poorly           | No             | ~$0.10           |
-| Claude Haiku      | Prompted → for/against/neutral | Yes        | Yes, with prompt | Yes            | ~$0.10–0.30      |
-| GPT-4o-mini       | Prompted → same                | Yes        | Yes              | Yes            | ~$0.05–0.15      |
+Then, the batch processing scripts uses the `report.json` to create multiple jobs, with subcommands to download and check their status. 
+
+We selected gpt-5.4-mini for a good balance of quality, cost, and time.
+
+**** Prompt
+```
+You are an expert policy analyst classifying public comments submitted to the Virginia Town Hall
+regulatory comment system. You will be given the text of a proposed regulation and a single
+public comment. Return ONLY a JSON object — no other text.
+
+Definitions:
+- stance: the commenter's position on whether the regulation should be adopted.
+  "support" = wants it approved (as-is or with changes);
+  "oppose"  = wants it rejected or substantially weakened;
+  "neutral" = takes no position, asks a question, or provides factual input only;
+  "unknown" = too vague, off-topic, or uninterpretable to classify.
+- tone: the emotional register of the writing, independent of stance.
+  "positive" = affirming, hopeful, appreciative;
+  "negative" = angry, fearful, alarmed, or contemptuous;
+  "neutral"  = matter-of-fact, procedural, or informational;
+  "mixed"    = contains both positive and negative emotional content;
+  "unclear"  = tone cannot be determined (e.g., a one-word comment).
+- stance_confidence: float 0.0-1.0, your confidence in the stance label.
+- stance_rationale: 1-3 sentences explaining the key evidence; quote specific phrases where possible.
+- tags: up to 5 short topic labels relevant to the comment's specific concerns (e.g.
+  "parental rights", "student safety", "privacy", "religious freedom", "LGBTQ+ inclusion",
+  "bullying prevention", "school sports", "bathroom access"). Empty array if none apply.
+
+Return exactly these keys: stance, stance_confidence, stance_rationale, tone, tags.
+```
+
+
+*** Storage
+- Each scraped forum is saved to `output/<forum-id>.jsonl`
+- Each report (forum + prompt) is saves to `reports/<forum-id-N>.json`
+- Each job is saved to `analysis/jobs/<report-id>/:
+   └─`forum.jsonl` is a copy of the scraped forum for convenience
+   └─`prompt.txt` is a copy of the prompt used
+   └─`report.json` is a copy of the report used
+   └─`status.json` contains metadata about the job
+  For each batch in the job, four files are created:
+   └─`jobN-input.jsonl` contains the exact queries sent to the API, for troubleshooting
+   └─`jobN-output-raw.jsonl` contains the exact response from the API
+   └─`jobN-output.jsonl` contains the exact response from the API
+   └─`jobN-output-errors.jsonl` when errors are returned (this file may not exist)
+- Once complete, the cleanup script saves `review.csv`, `review.pqt`, and `review.sqlite` in this folder.
+   
+** Instructions
+1. Scrape the forum.
+   `python 
+2. Run model report.
+   `python analysis/tokenizer.py <input> --prompt <prompt>`
+3. To run a realtime subset:
+   `python analysis/openai_realtime.py <input> --prompt <prompt> --model <model> --limit <N comments>`
+   `python analysis/openai_realtime.py output/f452.jsonl --prompt prompt-1.txt --model gpt-4o-mini --limit 10`
+4. To create and run the whole thing in batches, first create the batch jobs from the report:
+   `python analysis/openai_batch.py create <report> --model <model>`
+   `python analysis/openai_batch.py create ./reports/f452-1.json --model gpt-5.4-mini`
+5. Then, run the jobs sequentially. Don't submit more than one at a time, if the model fills up the batch will fail and resubmission is not implemented.
+   `python analysis/openai_batch.py submit`
+  # Check status
+   `python analysis/openai_batch.py status`
+  # When complete, download:
+   `python analysis/openai_batch.py download`
+  # Submit the next batch after the previous is complete:
+   `python analysis/openai_batch.py submit`
   
 * Roadmap
 1. Scrape one forum
--- a/requirements.txt
+++ b/requirements.txt
--- a/scraper/items.py
+++ b/scraper/items.py
@@ -5,6 +5,8 @@ class ForumItem(scrapy.Item):
    forum_id  = scrapy.Field()
    reg_title = scrapy.Field()
    reg_desc  = scrapy.Field()
+    scraped_at = scrapy.Field()
+    forum_url = scrapy.Field()


 class CommentItem(scrapy.Item):
--- a/scraper/spiders/forum.py
+++ b/scraper/spiders/forum.py
@@ -63,6 +63,8 @@ class ForumSpider(scrapy.Spider):
                forum_id=self.forum_id,
                reg_title=reg_title,
                reg_desc=reg_desc,
+                scraped_at=datetime.utcnow().isoformat(),
+                forum_url=_view_url(self.forum_id),
            )
            for page in range(2, last_page + 1):
                yield scrapy.FormRequest(
--- a/tests/create_csv.py
+++ b/tests/create_csv.py
@@ -0,0 +1,155 @@
+"""Unit tests for analysis/create_csv.py — no external API calls."""
+
+import json
+import sys
+from pathlib import Path
+
+import pandas as pd
+import pytest
+
+sys.path.insert(0, str(Path(__file__).parent.parent / "analysis"))
+import create_csv as cc
+
+
+# ---------------------------------------------------------------------------
+# Helpers
+
+def _write_jsonl(path: Path, rows: list[dict]) -> None:
+    with open(path, "w", encoding="utf-8") as f:
+        for row in rows:
+            f.write(json.dumps(row) + "\n")
+
+
+RAW_ROWS = [
+    {"forum_id": "452", "comment_id": "1", "title": "Support", "text": "I support.", "date": "2021-01-01", "author": "Alice"},
+    {"forum_id": "452", "comment_id": "2", "title": "Oppose",  "text": "I oppose.",  "date": "2021-01-02", "author": "Bob"},
+    {"forum_id": "452", "comment_id": "3", "title": "Neutral", "text": "No opinion.","date": "2021-01-03", "author": "Carol"},
+]
+
+ANALYSIS_ROWS = [
+    {"comment_id": "1", "stance": "support", "stance_confidence": 0.9, "stance_rationale": "clear support",
+     "tone": "neutral", "tags": '["policy"]', "error": None, "truncated": False,
+     "analyzed_at": "2021-01-10", "prompt_version": "1", "model": "gpt-4o-mini"},
+    {"comment_id": "2", "stance": "oppose",  "stance_confidence": 0.8, "stance_rationale": "clear oppose",
+     "tone": "negative", "tags": '[]', "error": None, "truncated": False,
+     "analyzed_at": "2021-01-10", "prompt_version": "1", "model": "gpt-4o-mini"},
+]
+
+
+# ---------------------------------------------------------------------------
+# load_raw
+
+def test_load_raw_returns_raw_cols(tmp_path):
+    p = tmp_path / "forum.jsonl"
+    _write_jsonl(p, RAW_ROWS)
+    df = cc.load_raw(p)
+    assert list(df.columns) == cc.RAW_COLS
+
+
+def test_load_raw_row_count(tmp_path):
+    p = tmp_path / "forum.jsonl"
+    _write_jsonl(p, RAW_ROWS)
+    df = cc.load_raw(p)
+    assert len(df) == 3
+
+
+def test_load_raw_skips_non_comment_rows(tmp_path):
+    """Rows without comment_id (e.g. forum metadata) are dropped."""
+    rows = RAW_ROWS + [{"forum_id": "452", "reg_title": "Metadata row"}]
+    p = tmp_path / "forum.jsonl"
+    _write_jsonl(p, rows)
+    df = cc.load_raw(p)
+    assert len(df) == 3
+
+
+# ---------------------------------------------------------------------------
+# load_analysis
+
+def test_load_analysis_returns_analysis_cols(tmp_path):
+    jobs = tmp_path / "jobs"
+    jobs.mkdir()
+    _write_jsonl(jobs / "job1-output.jsonl", ANALYSIS_ROWS)
+    df = cc.load_analysis(jobs)
+    expected = ["comment_id"] + cc.ANALYSIS_COLS
+    assert list(df.columns) == expected
+
+
+def test_load_analysis_skips_raw_files(tmp_path):
+    jobs = tmp_path / "jobs"
+    jobs.mkdir()
+    _write_jsonl(jobs / "job1-output.jsonl", ANALYSIS_ROWS)
+    _write_jsonl(jobs / "job1-output-raw.jsonl", ANALYSIS_ROWS)  # should be ignored
+    df = cc.load_analysis(jobs)
+    assert len(df) == len(ANALYSIS_ROWS)
+
+
+def test_load_analysis_concatenates_multiple_files(tmp_path):
+    jobs = tmp_path / "jobs"
+    jobs.mkdir()
+    _write_jsonl(jobs / "job1-output.jsonl", [ANALYSIS_ROWS[0]])
+    _write_jsonl(jobs / "job2-output.jsonl", [ANALYSIS_ROWS[1]])
+    df = cc.load_analysis(jobs)
+    assert len(df) == 2
+
+
+# ---------------------------------------------------------------------------
+# join
+
+def test_join_all_raw_preserved(tmp_path):
+    """Left join: all raw comments appear in output, even without analysis."""
+    raw = pd.DataFrame(RAW_ROWS)[cc.RAW_COLS]
+    analysis = pd.DataFrame(ANALYSIS_ROWS)
+    for col in cc.ANALYSIS_COLS:
+        if col not in analysis.columns:
+            analysis[col] = None
+    analysis = analysis[["comment_id"] + cc.ANALYSIS_COLS]
+
+    merged = cc.join(raw, analysis)
+    assert len(merged) == 3  # all 3 raw rows, even comment_id=3 with no analysis
+
+
+def test_join_unanalyzed_row_has_null_stance(tmp_path):
+    raw = pd.DataFrame(RAW_ROWS)[cc.RAW_COLS]
+    analysis = pd.DataFrame(ANALYSIS_ROWS)
+    for col in cc.ANALYSIS_COLS:
+        if col not in analysis.columns:
+            analysis[col] = None
+    analysis = analysis[["comment_id"] + cc.ANALYSIS_COLS]
+
+    merged = cc.join(raw, analysis)
+    unanalyzed = merged[merged["comment_id"] == "3"]
+    assert pd.isna(unanalyzed.iloc[0]["stance"])
+
+
+def test_join_column_order(tmp_path):
+    raw = pd.DataFrame(RAW_ROWS)[cc.RAW_COLS]
+    analysis = pd.DataFrame(ANALYSIS_ROWS)
+    for col in cc.ANALYSIS_COLS:
+        if col not in analysis.columns:
+            analysis[col] = None
+    analysis = analysis[["comment_id"] + cc.ANALYSIS_COLS]
+
+    merged = cc.join(raw, analysis)
+    assert list(merged.columns) == cc.OUTPUT_COLS
+
+
+# ---------------------------------------------------------------------------
+# End-to-end: write + read CSV
+
+def test_csv_written_correctly(tmp_path):
+    raw_path = tmp_path / "forum.jsonl"
+    _write_jsonl(raw_path, RAW_ROWS)
+
+    jobs = tmp_path / "jobs"
+    jobs.mkdir()
+    _write_jsonl(jobs / "job1-output.jsonl", ANALYSIS_ROWS)
+
+    out = tmp_path / "review.csv"
+    raw      = cc.load_raw(raw_path)
+    analysis = cc.load_analysis(jobs)
+    merged   = cc.join(raw, analysis)
+    merged.to_csv(out, index=False, encoding="utf-8-sig")
+
+    loaded = pd.read_csv(out)
+    assert len(loaded) == 3
+    assert list(loaded.columns) == cc.OUTPUT_COLS
--- a/tests/encoding.py
+++ b/tests/encoding.py
@@ -0,0 +1,119 @@
+"""Unit tests for analysis/encoding.py — no external dependencies required."""
+
+import sys
+from pathlib import Path
+
+import pytest
+
+sys.path.insert(0, str(Path(__file__).parent.parent / "analysis"))
+from encoding import repair_text, _KNOWN_REPAIRS
+
+
+# ---------------------------------------------------------------------------
+# Core contract
+
+
+def test_empty_string_unchanged():
+    assert repair_text("") == ""
+
+
+def test_none_like_empty_unchanged():
+    assert repair_text("") == ""
+
+
+def test_clean_ascii_unchanged():
+    text = "This is a normal sentence with no encoding issues."
+    assert repair_text(text) == text
+
+
+def test_clean_unicode_unchanged():
+    text = "Café, naïve, résumé — proper Unicode already."
+    result = repair_text(text)
+    # Should either be unchanged or equivalently correct
+    assert "Caf" in result and "na" in result
+
+
+# ---------------------------------------------------------------------------
+# Known mojibake sequences (tasks.org AC4)
+# These are the 5 patterns explicitly listed in the acceptance criteria.
+
+
+def test_right_single_quote():
+    """â€™ → ' (U+2019 right single quotation mark)"""
+    assert repair_text("Virginiaâ€™s") == "Virginia’s"
+
+
+def test_left_double_quote():
+    """â€œ → " (U+201C left double quotation mark)"""
+    assert repair_text("â€œHello") == "“Hello"
+
+
+def test_en_dash():
+    """â€" (where last char is U+201C) → – (U+2013 en dash)"""
+    result = repair_text("pages 1â€“5")
+    assert "–" in result or "—" in result or "-" in result
+
+
+def test_em_dash():
+    """â€" (where last char is U+201D) → — (U+2014 em dash)"""
+    result = repair_text("wordâ€”word")
+    assert "—" in result or "–" in result or "-" in result
+
+
+def test_right_double_quote():
+    """â€\x9d → " (U+201D right double quotation mark)"""
+    result = repair_text("saidâ€ he")
+    # Should not contain the raw artifact
+    assert "â€" not in result
+
+
+# ---------------------------------------------------------------------------
+# Round-trip: garbled text produces sensible output
+
+
+def test_garbled_sentence_repaired():
+    """A sentence with multiple mojibake chars is repaired to readable text."""
+    # "Don't" with right single quote encoded as UTF-8, then decoded as cp1252
+    # D o n ' t  →  D o n â€™ t
+    garbled = "Donâ€™t worry"
+    result = repair_text(garbled)
+    assert "Don" in result and "t worry" in result
+    assert "â€" not in result  # artifact gone
+
+
+def test_clean_string_after_repair_has_no_artifacts():
+    garbled = "She said â€œHelloâ€ and left."
+    result = repair_text(garbled)
+    assert "â€" not in result
+
+
+# ---------------------------------------------------------------------------
+# FFFD replacement characters (from strict UTF-8 decode of cp1252 bytes)
+
+
+def test_fffd_preserved_not_crashed():
+    """repair_text must not raise on U+FFFD; it may or may not repair it."""
+    text = "Virginia<EFBFBD>s Public Schools"
+    result = repair_text(text)
+    assert isinstance(result, str)
+    assert "Virginia" in result
+
+
+# ---------------------------------------------------------------------------
+# _KNOWN_REPAIRS table structure
+
+
+def test_known_repairs_non_empty():
+    assert len(_KNOWN_REPAIRS) > 0
+
+
+def test_known_repairs_are_pairs():
+    for item in _KNOWN_REPAIRS:
+        assert len(item) == 2
+        bad, good = item
+        assert isinstance(bad, str) and isinstance(good, str)
+
+
+def test_known_repairs_bad_not_equal_good():
+    for bad, good in _KNOWN_REPAIRS:
+        assert bad != good
--- a/tests/validate-sentiment.py
+++ b/tests/validate-sentiment.py
@@ -0,0 +1,217 @@
+"""Unit tests for analysis/validate.py — no file I/O beyond tmp_path."""
+
+import json
+import sys
+from pathlib import Path
+
+import pytest
+
+sys.path.insert(0, str(Path(__file__).parent.parent / "analysis"))
+
+try:
+    import pandas as pd
+except ImportError:
+    pytest.skip("pandas not installed", allow_module_level=True)
+
+import validate as vl
+
+# ---------------------------------------------------------------------------
+# Fixtures
+
+
+def _write_jsonl(path: Path, rows: list[dict]) -> None:
+    with open(path, "w", encoding="utf-8") as f:
+        for row in rows:
+            f.write(json.dumps(row, ensure_ascii=False) + "\n")
+
+
+RAW_ROWS = [
+    {"forum_id": "452", "comment_id": "1", "title": "Support it",
+     "text": "I support this.", "date": "2021-01-04T09:00:00", "author": "Alice"},
+    {"forum_id": "452", "comment_id": "2", "title": "Oppose it",
+     "text": "I oppose this.", "date": "2021-01-05T10:00:00", "author": "Bob"},
+    {"forum_id": "452", "comment_id": "3", "title": "Neutral",
+     "text": "No opinion.", "date": "2021-01-06T11:00:00", "author": "Carol"},
+]
+
+ANALYSIS_ROWS = [
+    {"run_id": "r1", "forum_id": "452", "comment_id": "1", "input_title": "Support it",
+     "analyzed_at": "2026-05-06T12:00:00+00:00", "model": "gpt-5.4-mini",
+     "prompt_version": "abc1234", "stance": "support", "stance_confidence": 0.95,
+     "stance_rationale": "Commenter says 'I support'.", "tone": "positive",
+     "tags": ["student safety"], "truncated": False, "error": None},
+    {"run_id": "r1", "forum_id": "452", "comment_id": "2", "input_title": "Oppose it",
+     "analyzed_at": "2026-05-06T12:00:00+00:00", "model": "gpt-5.4-mini",
+     "prompt_version": "abc1234", "stance": "oppose", "stance_confidence": 0.90,
+     "stance_rationale": "Commenter says 'I oppose'.", "tone": "negative",
+     "tags": [], "truncated": False, "error": None},
+]
+
+FORUM_ROW = {"forum_id": "452", "reg_title": "Policy X", "reg_desc": "Guidance on Y."}
+
+
+@pytest.fixture()
+def raw_jsonl(tmp_path) -> Path:
+    p = tmp_path / "f452.jsonl"
+    _write_jsonl(p, [FORUM_ROW] + RAW_ROWS)
+    return p
+
+
+@pytest.fixture()
+def jobs_dir(tmp_path) -> Path:
+    d = tmp_path / "jobs" / "f452-1"
+    d.mkdir(parents=True)
+    _write_jsonl(d / "job1-output.jsonl", ANALYSIS_ROWS)
+    return d
+
+
+# ---------------------------------------------------------------------------
+# load_raw
+
+
+def test_load_raw_returns_only_comments(raw_jsonl):
+    df = vl.load_raw(raw_jsonl)
+    assert len(df) == 3
+    assert set(df.columns) == set(vl.RAW_COLS)
+
+
+def test_load_raw_correct_columns(raw_jsonl):
+    df = vl.load_raw(raw_jsonl)
+    for col in vl.RAW_COLS:
+        assert col in df.columns
+
+
+def test_load_raw_skips_forum_item(raw_jsonl):
+    df = vl.load_raw(raw_jsonl)
+    assert "reg_title" not in df.columns
+
+
+# ---------------------------------------------------------------------------
+# load_analysis
+
+
+def test_load_analysis_skips_raw_files(tmp_path):
+    d = tmp_path / "jobs" / "f452-1"
+    d.mkdir(parents=True)
+    _write_jsonl(d / "job1-output-raw.jsonl", ANALYSIS_ROWS)   # should be ignored
+    _write_jsonl(d / "job1-output.jsonl", ANALYSIS_ROWS)
+    df = vl.load_analysis(d)
+    assert len(df) == len(ANALYSIS_ROWS)
+
+
+def test_load_analysis_concatenates_multiple_files(tmp_path):
+    d = tmp_path / "jobs" / "f452-1"
+    d.mkdir(parents=True)
+    _write_jsonl(d / "job1-output.jsonl", [ANALYSIS_ROWS[0]])
+    _write_jsonl(d / "job2-output.jsonl", [ANALYSIS_ROWS[1]])
+    df = vl.load_analysis(d)
+    assert len(df) == 2
+
+
+def test_load_analysis_tags_serialized_as_json(jobs_dir):
+    df = vl.load_analysis(jobs_dir)
+    tags_val = df.loc[df["comment_id"] == "1", "tags"].iloc[0]
+    assert isinstance(tags_val, str)
+    assert json.loads(tags_val) == ["student safety"]
+
+
+def test_load_analysis_empty_tags_serialized(jobs_dir):
+    df = vl.load_analysis(jobs_dir)
+    tags_val = df.loc[df["comment_id"] == "2", "tags"].iloc[0]
+    assert json.loads(tags_val) == []
+
+
+# ---------------------------------------------------------------------------
+# join — by comment_id, not index
+
+
+def test_join_by_comment_id_not_index(raw_jsonl, jobs_dir):
+    raw      = vl.load_raw(raw_jsonl)
+    analysis = vl.load_analysis(jobs_dir)
+    # Shuffle raw order so comment_id ordering differs from index
+    raw = raw.sample(frac=1, random_state=42).reset_index(drop=True)
+    merged = vl.join(raw, analysis)
+    row_1 = merged[merged["comment_id"] == "1"].iloc[0]
+    assert row_1["stance"] == "support"
+    assert row_1["author"] == "Alice"
+
+
+def test_join_unanalyzed_comment_has_null_stance(raw_jsonl, jobs_dir):
+    """Comment 3 is in raw but not in analysis — stance should be NaN."""
+    raw      = vl.load_raw(raw_jsonl)
+    analysis = vl.load_analysis(jobs_dir)
+    merged   = vl.join(raw, analysis)
+    row_3 = merged[merged["comment_id"] == "3"].iloc[0]
+    assert pd.isna(row_3["stance"])
+
+
+def test_join_preserves_all_raw_comments(raw_jsonl, jobs_dir):
+    raw      = vl.load_raw(raw_jsonl)
+    analysis = vl.load_analysis(jobs_dir)
+    merged   = vl.join(raw, analysis)
+    assert len(merged) == len(raw)
+
+
+def test_join_output_columns_in_order(raw_jsonl, jobs_dir):
+    raw      = vl.load_raw(raw_jsonl)
+    analysis = vl.load_analysis(jobs_dir)
+    merged   = vl.join(raw, analysis)
+    assert list(merged.columns) == vl.OUTPUT_COLS
+
+
+# ---------------------------------------------------------------------------
+# Duplicate comment_id handling
+
+
+def test_duplicate_raw_id_flagged(raw_jsonl, jobs_dir):
+    raw      = vl.load_raw(raw_jsonl)
+    # Manually duplicate a row
+    raw = pd.concat([raw, raw.iloc[[0]]], ignore_index=True)
+    analysis = vl.load_analysis(jobs_dir)
+    merged   = vl.join(raw, analysis)
+    # join still produces a row for each raw row (left join)
+    assert len(merged) == len(raw)
+    assert raw["comment_id"].duplicated().sum() == 1
+
+
+def test_duplicate_analysis_id_produces_extra_rows(raw_jsonl, tmp_path):
+    """Two analysis records for the same comment_id create two joined rows."""
+    d = tmp_path / "jobs" / "f452-dup"
+    d.mkdir(parents=True)
+    dup_rows = [ANALYSIS_ROWS[0], {**ANALYSIS_ROWS[0], "stance": "oppose"}]
+    _write_jsonl(d / "job1-output.jsonl", dup_rows)
+    raw      = vl.load_raw(raw_jsonl)
+    analysis = vl.load_analysis(d)
+    merged   = vl.join(raw, analysis)
+    assert len(merged[merged["comment_id"] == "1"]) == 2
+
+
+# ---------------------------------------------------------------------------
+# Validation counts (smoke test — just confirm it runs without error)
+
+
+def test_print_validation_runs(raw_jsonl, jobs_dir, capsys):
+    raw      = vl.load_raw(raw_jsonl)
+    analysis = vl.load_analysis(jobs_dir)
+    merged   = vl.join(raw, analysis)
+    vl.print_validation(raw, analysis, merged)
+    out = capsys.readouterr().out
+    assert "Raw comments" in out
+    assert "Stance counts" in out
+    assert "Tone counts" in out
+
+
+# ---------------------------------------------------------------------------
+# CSV output
+
+
+def test_csv_written_to_jobs_dir(raw_jsonl, jobs_dir, tmp_path):
+    raw      = vl.load_raw(raw_jsonl)
+    analysis = vl.load_analysis(jobs_dir)
+    merged   = vl.join(raw, analysis)
+    out_path = jobs_dir / "review.csv"
+    merged.to_csv(out_path, index=False, encoding="utf-8-sig")
+    assert out_path.exists()
+    loaded = pd.read_csv(out_path, encoding="utf-8-sig")
+    assert list(loaded.columns) == vl.OUTPUT_COLS
+    assert len(loaded) == len(raw)
--- a/viz/chart_tests/confidence_by_stance.html
+++ b/viz/chart_tests/confidence_by_stance.html
--- a/viz/chart_tests/cumulative_stance_area.html
+++ b/viz/chart_tests/cumulative_stance_area.html
--- a/viz/chart_tests/cumulative_stance_share.html
+++ b/viz/chart_tests/cumulative_stance_share.html
--- a/viz/chart_tests/stance_diverging_bar.html
+++ b/viz/chart_tests/stance_diverging_bar.html
--- a/viz/chart_tests/stance_over_time.html
+++ b/viz/chart_tests/stance_over_time.html
--- a/viz/chart_tests/stance_share.html
+++ b/viz/chart_tests/stance_share.html
--- a/viz/chart_tests/stance_tone_counts.html
+++ b/viz/chart_tests/stance_tone_counts.html
--- a/viz/chart_tests/stance_tone_heatmap.html
+++ b/viz/chart_tests/stance_tone_heatmap.html
--- a/viz/chart_tests/stance_tone_rowpct.html
+++ b/viz/chart_tests/stance_tone_rowpct.html
--- a/viz/proto/confidence_by_stance.html
+++ b/viz/proto/confidence_by_stance.html
--- a/viz/proto/stance_over_time.html
+++ b/viz/proto/stance_over_time.html
--- a/viz/proto/stance_share.html
+++ b/viz/proto/stance_share.html
--- a/viz/proto/stance_tone_heatmap.html
+++ b/viz/proto/stance_tone_heatmap.html
--- a/viz/prototype_charts.py
+++ b/viz/prototype_charts.py
@@ -0,0 +1,134 @@
+'''
+    prototype_charts.py
+    generate test charts for later addition to streamlit
+'''
+   
+
+from pathlib import Path
+import pandas as pd
+import plotly.express as px
+import numpy as np
+
+inp = Path(r"c:/users/moses/projects/vath/analysis/jobs/f452-1/review.csv")
+out = Path("viz/")
+out.mkdir(parents=True, exist_ok=True)
+
+stance_order = ["support", "oppose", "neutral", "unknown"]
+
+# tone_order = ["positive", "negative", "neutral", "mixed", "unknown", "unclear"]
+# default order was actually better - unclear/negative/neutral/mixed/positive vs unknown/oppose/neutral/support
+# same for pct w/in stance
+df = pd.read_csv(inp)
+df["date"] = pd.to_datetime(df["date"], errors="coerce")
+df["date_day"] = df["date"].dt.date
+df["stance"] = df["stance"].fillna("unknown")
+df["tone"] = df["tone"].fillna("unknown")
+
+# 1. stance share
+counts = df["stance"].value_counts().reindex(stance_order, fill_value=0).reset_index()
+counts.columns = ["stance", "count"]
+fig = px.bar(counts, x="count", y="stance", orientation="h", text="count")
+fig.write_html(out / "stance_share.html")
+
+# 2. stance over time
+daily = df.groupby(["date_day", "stance"]).size().reset_index(name="count")
+fig = px.bar(daily, x="date_day", y="count", color="stance", category_orders={"stance": stance_order})
+fig.write_html(out / "stance_over_time.html")
+
+# 3. stance x tone
+heat = df.groupby(["stance", "tone"]).size().reset_index(name="count")
+fig = px.density_heatmap(heat, x="tone", y="stance", z="count", category_orders={"stance": stance_order})
+fig.write_html(out / "stance_tone_heatmap.html")
+
+# 4. confidence by stance
+fig = px.box(df, x="stance", y="stance_confidence", category_orders={"stance": stance_order}, points="outliers")
+fig.write_html(out / "confidence_by_stance.html")
+
+# 5. cumulative stance and share over time
+daily = (
+    df.groupby(["date_day", "stance"])
+      .size()
+      .unstack(fill_value=0)
+      .reindex(columns=stance_order, fill_value=0)
+      .sort_index()
+)
+
+cum = daily.cumsum()
+cum_long = cum.reset_index().melt(id_vars="date_day", var_name="stance", value_name="cumulative_count")
+
+fig = px.area(
+    cum_long,
+    x="date_day",
+    y="cumulative_count",
+    color="stance",
+    category_orders={"stance": stance_order},
+    title="cumulative comments by stance over time",
+)
+fig.write_html(out / "cumulative_stance_area.html")
+
+cum_pct = cum.div(cum.sum(axis=1), axis=0).reset_index().melt(
+    id_vars="date_day", var_name="stance", value_name="cumulative_share"
+)
+
+fig = px.line(
+    cum_pct,
+    x="date_day",
+    y="cumulative_share",
+    color="stance",
+    category_orders={"stance": stance_order},
+    title="cumulative stance share over time",
+)
+fig.update_yaxes(tickformat=".0%")
+fig.write_html(out / "cumulative_stance_share.html")
+
+# 7. diverging h-bar
+stance_counts = df["stance"].value_counts().reindex(stance_order, fill_value=0)
+
+div = pd.DataFrame({
+    "stance": ["oppose", "support", "neutral", "unknown"],
+    "count": [
+        -stance_counts.get("oppose", 0),
+         stance_counts.get("support", 0),
+         stance_counts.get("neutral", 0),
+         stance_counts.get("unknown", 0),
+    ],
+})
+
+fig = px.bar(
+    div,
+    x="count",
+    y="stance",
+    orientation="h",
+    text=div["count"].abs(),
+    title="support vs oppose",
+)
+fig.update_xaxes(title="comments", zeroline=True)
+fig.update_traces(textposition="outside")
+fig.write_html(out / "stance_diverging_bar.html")
+
+# 8. Stance x Tone labels
+heat = pd.crosstab(df["stance"], df["tone"]).reindex(
+    index=stance_order,
+    columns=[c for c in tone_order if c in df["tone"].unique()],
+    fill_value=0,
+)
+
+fig = px.imshow(
+    heat,
+    text_auto=True,
+    aspect="auto",
+    title="stance x tone, count",
+)
+fig.write_html(out / "stance_tone_counts.html")
+
+rowpct = heat.div(heat.sum(axis=1).replace(0, np.nan), axis=0)
+
+fig = px.imshow(
+    rowpct,
+    text_auto=".0%",
+    aspect="auto",
+    title="stance x tone, percent within stance",
+)
+fig.write_html(out / "stance_tone_rowpct.html")
+
+
--- a/viz/prototype_streamlit.py
+++ b/viz/prototype_streamlit.py
@@ -0,0 +1,28 @@
+# streamlit run analysis/viz/prototype_streamlit.py
+from datetime import datetime
+import pandas as pd
+import plotly.graph_objects as go
+import plotly.express as px
+import streamlit as st
+
+df = pd.read_csv(r"analysis/jobs/f452-1/review.csv")
+st.set_page_config(layout="wide")
+
+stance = st.multiselect("Filter stance", sorted(df["stance"].dropna().unique()), default=sorted(df["stance"].dropna().unique()))
+q = st.text_input("Search comment text")
+dff = df[df["stance"].isin(stance)]
+if q:
+    dff = dff[dff["text"].fillna("").str.contains(q, case=False, regex=False)]
+
+st.dataframe(dff[["comment_id", "title", "stance", "stance_confidence", "tone"]], width="stretch")
+st.write("Showing " + str(len(dff))+ " comments")
+
+cid = st.selectbox("comment", dff["comment_id"].astype(str))
+row = dff[dff["comment_id"].astype(str) == cid].iloc[0]
+
+st.subheader(row["title"])
+st.write(row["text"])
+st.write(row["author"] + ", " + row["date"][:10])
+st.write("**model:** " + str(row["model"]))
+st.markdown("**stance:** " + str(row["stance"]) + "  \n**confidence:** " + str(row["stance_confidence"]) + "  \n**tone:** " + str(row["tone"]))
+st.write("**analysis:** "+ row["stance_rationale"])
--- a/viz/streamlit.py
+++ b/viz/streamlit.py
@@ -0,0 +1,189 @@
+# streamlit run viz/streamlit.py -- --jobs-dir analysis/jobs/f452-1
+import argparse
+from pathlib import Path
+from datetime import datetime as dt
+import pandas as pd
+import plotly.graph_objects as go
+import plotly.express as px
+import streamlit as st
+
+parser = argparse.ArgumentParser()
+parser.add_argument("--jobs-dir", default="analysis/jobs/f452-1", type=Path,
+                    help="Job directory containing review.csv, forum.jsonl, and prompt.txt")
+args, _ = parser.parse_known_args()  # parse_known_args: ignore Streamlit's own argv entries
+workdir = args.jobs_dir
+df = pd.read_csv(workdir/"review.csv")
+df['date_dt'] = pd.to_datetime(df.date)
+df["date_day"] = df["date_dt"].dt.date
+forum = pd.read_json(workdir/"forum.jsonl", lines=True).iloc[0].to_dict()
+prompt = (workdir/"prompt.txt").read_text(encoding="utf-8")
+
+stance_colors = {'oppose':'#ffa15a', 'neutral':'#e377c2','support':'#19d3f3','unknown':'#000000'}
+stance_order = ["oppose", "mixed", "unknown", "neutral", "support"]
+
+st.set_page_config(layout="wide")
+st.title("Virginia Townhall Explorer",anchor=None)
+st.caption("Explore data collected from Virginia's public comment system. Source code at https://github.com/eulaly/vath")
+
+st.subheader("Proposal",anchor=None,divider="gray")
+st.markdown(f"**{forum.get('reg_title')}**")
+st.text(forum.get('reg_desc'))
+st.caption(f'Comments posted from {dt.strftime(min(df.date_dt),"%D")}—{dt.strftime(max(df.date_dt),"%D")} at https://www.townhall.virginia.gov/L/Comments.cfm?GDocForumID={forum.get("forum_id")}')
+
+st.subheader("Comment Summary",anchor=False,divider="gray")
+summary_left, summary_right = st.columns([1,2])
+with summary_left:
+# Summary Table
+    summary_stats = (
+    df.groupby("stance").size()
+      .reindex(stance_order, fill_value=0)
+      .reset_index(name="count")
+      .assign(percent=lambda d: (d["count"] / d["count"].sum()).map("{:.1%}".format))
+)
+
+    st.dataframe(summary_stats, hide_index=True, width="stretch")
+with summary_right:
+# Stance div-h
+    counts = df["stance"].value_counts()
+    stance_divh = go.Figure()
+    stance_divh.add_bar(y=["stance"], x=[-counts.get("oppose",0)], name="oppose", orientation="h", marker_color=stance_colors.get('oppose'), text=[counts.get("oppose",0)], textposition="inside")
+    stance_divh.add_bar(y=["stance"], x=[counts.get("neutral",0)], name="neutral", orientation="h", marker_color=stance_colors.get('neutral'), text=[counts.get("neutral",0)], textposition="inside")
+    stance_divh.add_bar(y=["stance"], x=[counts.get("unknown",0)], name="unknown", orientation="h", marker_color=stance_colors.get('unknown'), text=[counts.get("unknown",0)], textposition="inside")
+    stance_divh.add_bar(y=["stance"], x=[counts.get("support",0)], name="support", orientation="h", marker_color=stance_colors.get('support'), text=[counts.get("support",0)], textposition="inside")
+    stance_divh.update_yaxes(title_text="",showticklabels=False)
+    stance_divh.update_layout(barmode="relative", title="", height=180, margin=dict(l=0,r=0,t=0,b=0),xaxis_title="", yaxis_title="",legend=dict(orientation="v",y=0.12))
+    st.plotly_chart(stance_divh,width='stretch')
+
+# Daily Comments Breakdown, 3 Tabs
+daily_wide = (
+    df.groupby(["date_day", "stance"])
+      .size()
+      .unstack(fill_value=0)
+      .reindex(columns=stance_order, fill_value=0)
+      .sort_index()
+)
+
+daily_long = (
+    daily_wide.reset_index()
+      .melt(id_vars="date_day", var_name="stance", value_name="count")
+)
+
+cum_wide = daily_wide.cumsum()
+
+cum_long = (
+    cum_wide.reset_index()
+      .melt(id_vars="date_day", var_name="stance", value_name="cumulative_count")
+)
+
+cum_total = cum_wide.sum(axis=1)
+cum_share = cum_wide.div(cum_total.where(cum_total > 0), axis=0)
+
+cum_share_long = (
+    cum_share.reset_index()
+      .melt(id_vars="date_day", var_name="stance", value_name="cumulative_share")
+)
+
+
+tab_daily, tab_area, tab_share = st.tabs([
+    "Daily",
+    "Cumulative",
+    "Cumulative Share",
+])
+
+with tab_daily:
+    fig = px.bar(
+        daily_long,
+        x="date_day",
+        y="count",
+        color="stance",
+        category_orders={"stance": stance_order},
+        color_discrete_map=stance_colors,
+    )
+    fig.update_layout(barmode="stack", height=420, legend_orientation="v")
+    st.plotly_chart(fig, width="stretch")
+
+with tab_area:
+    fig = px.area(
+        cum_long,
+        x="date_day",
+        y="cumulative_count",
+        color="stance",
+        category_orders={"stance": stance_order},
+        color_discrete_map=stance_colors,
+    )
+    fig.update_layout(height=420, legend_orientation="v")
+    st.plotly_chart(fig, width="stretch")
+
+with tab_share:
+    fig = px.line(
+        cum_share_long,
+        x="date_day",
+        y="cumulative_share",
+        color="stance",
+        category_orders={"stance": stance_order},
+        color_discrete_map=stance_colors,
+    )
+    fig.update_yaxes(tickformat=".0%", range=[0, 1])
+    fig.update_layout(height=420, legend_orientation="v")
+    st.plotly_chart(fig, width="stretch")
+    
+st.subheader("Comment Explorer",anchor=False,divider="gray") 
+# comment explorer
+cex_left, cex_right = st.columns([1,1])
+with cex_left:
+    filter_stance = st.multiselect("Filter stance", sorted(df["stance"].dropna().unique()), default=sorted(df["stance"].dropna().unique()))
+    filter_tone = st.multiselect("Filter tone", sorted(df["tone"].dropna().unique()), default=sorted(df["tone"].dropna().unique()))
+    dff = df[df["stance"].isin(filter_stance) & df["tone"].isin(filter_tone)]
+
+with cex_right:
+    q = st.text_input("Search comment title and text")
+    if q:
+        dff = dff[dff["text"].fillna("").str.contains(q, case=False, regex=False)]
+    st.text(""); st.text("")
+    st.text("Showing " + str(len(dff))+ " comments",text_alignment="right", width="stretch")
+
+st.dataframe(dff[["comment_id", "title", "text", "stance", "stance_confidence", "tone"]], width="stretch")
+
+cid = st.selectbox("Select comment to view:", dff["comment_id"].astype(str))
+row = dff[dff["comment_id"].astype(str) == cid].iloc[0]
+
+st.markdown(f'**{row["title"]}**')
+st.text(row["text"])
+st.write(row["author"] + ", " + row["date_dt"].strftime("%D"))
+
+st.divider()
+
+st.subheader('Analysis')
+cexs_left, cexs_right = st.columns([1,1])
+with cexs_left:
+    st.write(f"**stance:** {row['stance']}")
+    st.write(f"**stance_confidence:** {row['stance_confidence']:.2f}")
+    st.write(f"**tone:** {row['tone']}")
+    st.write("**analysis:** "+ row["stance_rationale"])
+with cexs_right:
+    x_order = ["unknown","oppose","mixed","neutral","support"]  # includes mixed even if absent; harmless zero column
+    y_order = ["positive","neutral","mixed","negative","unclear"]
+    tab = pd.crosstab(df["tone"], df["stance"]).reindex(index=y_order, columns=x_order, fill_value=0)
+    pct = tab.div(tab.sum(axis=1).replace(0, pd.NA), axis=0).fillna(0)
+    tone_stance = px.imshow(
+        pct,
+        x=x_order, y=y_order,
+        text_auto=".0%",
+        aspect="auto",
+        color_continuous_scale="Greens",
+    )
+    tone_stance.update_traces(text=tab.astype(str) + " / " + (pct*100).round(0).astype(int).astype(str) + "%")
+    tone_stance.add_scatter(x=[row["stance"]],y=[row["tone"]],mode="markers",marker=dict(size=15,color="yellow",symbol="cross",line=dict(width=1, color="red")),showlegend=False)
+    tone_stance.update_layout(height=420, xaxis_title="stance", yaxis_title="tone")
+    st.plotly_chart(tone_stance, width='stretch')
+    st.caption("Tone by stance, % within tone", text_alignment="right",width="stretch")
+
+st.divider()
+st.write("**model:** " + str(row["model"]))
+with st.expander("Prompt", expanded=False):
+    st.code(prompt, language="text")
+
+tone_conf = px.box(df,x="stance",y="stance_confidence",color="stance",category_orders={"stance":stance_order},color_discrete_map=stance_colors,points="outliers",title="Comment Stance Classification Confidence")
+tone_conf.update_yaxes(range=[0,1.02])
+tone_conf.update_layout(height=430, legend_orientation="v")
+st.plotly_chart(tone_conf,width="stretch")
Author	SHA1	Message	Date
eulaly	8f1d9e7723	added forum metadata for later use	2026-05-09 00:36:30 -04:00
eulaly	181477bce7	streamlit > local docker	2026-05-09 00:25:27 -04:00
eulaly	771f11fd3c	updated readme	2026-05-09 00:02:24 -04:00
eulaly	f42183eeda	added streamlit link	2026-05-09 00:00:59 -04:00
eulaly	92706bafb5	updated tasks and deps	2026-05-08 23:57:46 -04:00
eulaly	723b353db8	lol	2026-05-08 23:33:55 -04:00
eulaly	67cd96a523	updated readme.md	2026-05-08 23:32:44 -04:00
eulaly	cc16acbb12	added argparse for job dir, added tone filter	2026-05-08 23:28:13 -04:00
eulaly	afd5b8c60e	full local streamlit support	2026-05-08 21:57:04 -04:00
eulaly	3fb424da3c	added streamlit v1	2026-05-08 17:22:33 -04:00
eulaly	c3f2911563	updated reqts	2026-05-07 21:55:00 -04:00
eulaly	05515745fd	Merge branch 'master' of https://git.hgsky.me/ben/vath	2026-05-07 21:54:27 -04:00
eulaly	3d3372bbb3	Merge branch 'master' of https://git.hgsky.me/ben/vath	2026-05-07 21:53:40 -04:00
ben	3a139da440	Delete docs/vatownhall.md ye	2026-05-07 21:48:08 -04:00
eulaly	976db1b0fe	finally got images working	2026-05-07 21:46:27 -04:00
ben	7593754866	Update README.md fixed display	2026-05-07 21:42:08 -04:00
ben	016882d527	Update docs/vatownhall.md	2026-05-07 21:35:49 -04:00
ben	58feb9820d	Update docs/vatownhall.md fixing inline img	2026-05-07 21:34:57 -04:00
ben	35f30e9514	Update docs/vatownhall.md fixing inline img	2026-05-07 21:34:33 -04:00
eulaly	985760be7c	tesging images	2026-05-07 18:07:45 -04:00
eulaly	983650a64f	testing images	2026-05-07 18:06:02 -04:00
eulaly	eaaefb66f2	adding image	2026-05-07 18:00:51 -04:00
eulaly	bdab3c5e21	added excel detritus	2026-05-07 17:56:05 -04:00
eulaly	b4a9651e11	added graph snapshot	2026-05-07 17:22:34 -04:00
eulaly	1ea696d818	added texts and fixes for mojibake	2026-05-07 17:22:16 -04:00
eulaly	28d6d222bd	added create_csv.py	2026-05-07 17:22:00 -04:00
eulaly	72c2ae0ca0	updated readme	2026-05-07 17:01:08 -04:00