Merge branch 'master' of https://git.hgsky.me/ben/vath

2026-05-07 21:53:40 -04:00
parent 976db1b0fe 7593754866
commit 3d3372bbb3
1 changed files with 35 additions and 41 deletions
--- a/README.md
+++ b/README.md
@@ -1,4 +1,3 @@
-
 # Table of Contents

 1.  [Project Goals](#org2da6874)
@@ -56,9 +55,9 @@

 Scrapy provides a simple mechanism for retrieving, parsing, and saving content form the forums.

-1.  Forums listing page: \`Forums.cfm\` - lists all open forums with agency, reg title, action type, brief description, closing date, comment count
-2.  Comment listing page: \`comments.cfm?GDocForumID=X\` or \`comments.cfm?stageid=X\` or \`comments.cfm?petitionid=X\` - lists comments with title, author, date
-3.  Individual comment page: \`viewcomments.cfm?commentid=X\` - shows regulation title + brief description at the top, plus the comment
+1.  Forums listing page: `Forums.cfm` lists all open forums with agency, reg title, action type, brief description, closing date, comment count
+2.  Comment listing page: `comments.cfm?GDocForumID=X` or `comments.cfm?stageid=X` or `comments.cfm?petitionid=X` lists comments with title, author, date
+3.  Individual comment page: `viewcomments.cfm?commentid=X` shows regulation title + brief description at the top, plus the comment


 <a id="org72990f4"></a>
@@ -74,14 +73,12 @@ Then, the batch processing scripts uses the \`report.json\` to create multiple j
 We selected gpt-5.4-mini for a good balance of quality, cost, and time.

 1.  Prompt
-
-    \`\`\`
+    ```
    You are an expert policy analyst classifying public comments submitted to the Virginia Town Hall
    regulatory comment system. You will be given the text of a proposed regulation and a single
    public comment. Return ONLY a JSON object — no other text.
    
    Definitions:
-    
    -   stance: the commenter's position on whether the regulation should be adopted.
        "support" = wants it approved (as-is or with changes);
        "oppose"  = wants it rejected or substantially weakened;
@@ -93,57 +90,54 @@ We selected gpt-5.4-mini for a good balance of quality, cost, and time.
        "neutral"  = matter-of-fact, procedural, or informational;
        "mixed"    = contains both positive and negative emotional content;
        "unclear"  = tone cannot be determined (e.g., a one-word comment).
-    -   stance<sub>confidence</sub>: float 0.0-1.0, your confidence in the stance label.
-    -   stance<sub>rationale</sub>: 1-3 sentences explaining the key evidence; quote specific phrases where possible.
+    -   stance_confidence: float 0.0-1.0, your confidence in the stance label.
+    -   stance_rationale: 1-3 sentences explaining the key evidence; quote specific phrases where possible.
    -   tags: up to 5 short topic labels relevant to the comment's specific concerns (e.g.
        "parental rights", "student safety", "privacy", "religious freedom", "LGBTQ+ inclusion",
        "bullying prevention", "school sports", "bathroom access"). Empty array if none apply.
    
-    Return exactly these keys: stance, stance<sub>confidence</sub>, stance<sub>rationale</sub>, tone, tags.
-    \`\`\`
+    Return exactly these keys: stance, stance_confidence, stance_rationale, tone, tags.
+    ```


 <a id="org58a5b72"></a>

 ### Storage

-   Each scraped forum is saved to \`output/<forum-id>.jsonl\`
-   Each report (forum + prompt) is saves to \`reports/<forum-id-N>.json\`
-   Each job is saved to \`analysis/jobs/<report-id>/:
-     └─\`forum.jsonl\` is a copy of the scraped forum for convenience
-     └─\`prompt.txt\` is a copy of the prompt used
-     └─\`report.json\` is a copy of the report used
-     └─\`status.json\` contains metadata about the job
+-   Each scraped forum is saved to `output/<forum-id>.jsonl`
+-   Each report (forum + prompt) is saves to `reports/<forum-id-N>.json`
+-   Each job is saved to `analysis/jobs/<report-id>`:
+     └─`forum.jsonl` is a copy of the scraped forum for convenience
+     └─`prompt.txt` is a copy of the prompt used
+     └─`report.json` is a copy of the report used
+     └─`status.json` contains metadata about the job
    For each batch in the job, four files are created:
-     └─\`jobN-input.jsonl\` contains the exact queries sent to the API, for troubleshooting
-     └─\`jobN-output-raw.jsonl\` contains the exact response from the API
-     └─\`jobN-output.jsonl\` contains the exact response from the API
-     └─\`jobN-output-errors.jsonl\` when errors are returned (this file may not exist)
-   Once complete, the cleanup script saves \`review.csv\`, \`review.pqt\`, and \`review.sqlite\` in this folder.
+     └─`jobN-input.jsonl` contains the exact queries sent to the API, for troubleshooting
+     └─`jobN-output-raw.jsonl` contains the exact response from the API
+     └─`jobN-output.jsonl` contains the exact response from the API
+     └─`jobN-output-errors.jsonl` when errors are returned (this file may not exist)
+-   Once complete, the cleanup script saves `review.csv`, `review.pqt`, and `review.sqlite` in this folder.


 <a id="org24fe465"></a>

 ## Instructions

-1.  Scrape the forum.
-    \`python
-2.  Run model report.
-    \`python analysis/tokenizer.py <input> &ndash;prompt <prompt>\`
-3.  To run a realtime subset:
-    \`python analysis/openai<sub>realtime.py</sub> <input> &ndash;prompt <prompt> &ndash;model <model> &ndash;limit <N comments>\`
-    \`python analysis/openai<sub>realtime.py</sub> output/f452.jsonl &ndash;prompt prompt-1.txt &ndash;model gpt-4o-mini &ndash;limit 10\`
-4.  To create and run the whole thing in batches, first create the batch jobs from the report:
-    \`python analysis/openai<sub>batch.py</sub> create <report> &ndash;model <model>\`
-    \`python analysis/openai<sub>batch.py</sub> create ./reports/f452-1.json &ndash;model gpt-5.4-mini\`
-5.  Then, run the jobs sequentially. Don't submit more than one at a time, if the model fills up the batch will fail and resubmission is not implemented.
-    \`python analysis/openai<sub>batch.py</sub> submit\`
-    
-    \`python analysis/openai<sub>batch.py</sub> status\`
-    
-    \`python analysis/openai<sub>batch.py</sub> download\`
-    
-    \`python analysis/openai<sub>batch.py</sub> submit\`
+1.  Scrape the forum.  
+    `python`  
+2.  Run model report.  
+    `python analysis/tokenizer.py <input> --prompt <prompt>`  
+3.  To run a realtime subset:  
+    `python analysis/openai_realtime.py <input> --prompt <prompt> --model <model> --limit <N comments>`  
+    `python analysis/openai_realtime.py output/f452.jsonl --prompt prompt-1.txt --model gpt-4o-mini --limit 10`  
+4.  To create and run the whole thing in batches, first create the batch jobs from the report:  
+    `python analysis/openai_batch.py create <report> --model <model>`  
+    `python analysis/openai_batch.py create ./reports/f452-1.json --model gpt-5.4-mini`  
+5.  Then, run the jobs sequentially. Don't submit more than one at a time, if the model fills up the batch will fail and resubmission is not implemented.  
+    `python analysis/openai<sub>batch.py</sub> submit`  
+    `python analysis/openai<sub>batch.py</sub> status`  
+    `python analysis/openai<sub>batch.py</sub> download`  
+    `python analysis/openai<sub>batch.py</sub> submit`  


 <a id="org5739d49"></a>