This commit is contained in:
2026-05-07 21:53:40 -04:00

View File

@@ -1,4 +1,3 @@
# Table of Contents # Table of Contents
1. [Project Goals](#org2da6874) 1. [Project Goals](#org2da6874)
@@ -56,9 +55,9 @@
Scrapy provides a simple mechanism for retrieving, parsing, and saving content form the forums. Scrapy provides a simple mechanism for retrieving, parsing, and saving content form the forums.
1. Forums listing page: \`Forums.cfm\` - lists all open forums with agency, reg title, action type, brief description, closing date, comment count 1. Forums listing page: `Forums.cfm` lists all open forums with agency, reg title, action type, brief description, closing date, comment count
2. Comment listing page: \`comments.cfm?GDocForumID=X\` or \`comments.cfm?stageid=X\` or \`comments.cfm?petitionid=X\` - lists comments with title, author, date 2. Comment listing page: `comments.cfm?GDocForumID=X` or `comments.cfm?stageid=X` or `comments.cfm?petitionid=X` lists comments with title, author, date
3. Individual comment page: \`viewcomments.cfm?commentid=X\` - shows regulation title + brief description at the top, plus the comment 3. Individual comment page: `viewcomments.cfm?commentid=X` shows regulation title + brief description at the top, plus the comment
<a id="org72990f4"></a> <a id="org72990f4"></a>
@@ -74,14 +73,12 @@ Then, the batch processing scripts uses the \`report.json\` to create multiple j
We selected gpt-5.4-mini for a good balance of quality, cost, and time. We selected gpt-5.4-mini for a good balance of quality, cost, and time.
1. Prompt 1. Prompt
```
\`\`\`
You are an expert policy analyst classifying public comments submitted to the Virginia Town Hall You are an expert policy analyst classifying public comments submitted to the Virginia Town Hall
regulatory comment system. You will be given the text of a proposed regulation and a single regulatory comment system. You will be given the text of a proposed regulation and a single
public comment. Return ONLY a JSON object — no other text. public comment. Return ONLY a JSON object — no other text.
Definitions: Definitions:
- stance: the commenter's position on whether the regulation should be adopted. - stance: the commenter's position on whether the regulation should be adopted.
"support" = wants it approved (as-is or with changes); "support" = wants it approved (as-is or with changes);
"oppose" = wants it rejected or substantially weakened; "oppose" = wants it rejected or substantially weakened;
@@ -93,57 +90,54 @@ We selected gpt-5.4-mini for a good balance of quality, cost, and time.
"neutral" = matter-of-fact, procedural, or informational; "neutral" = matter-of-fact, procedural, or informational;
"mixed" = contains both positive and negative emotional content; "mixed" = contains both positive and negative emotional content;
"unclear" = tone cannot be determined (e.g., a one-word comment). "unclear" = tone cannot be determined (e.g., a one-word comment).
- stance<sub>confidence</sub>: float 0.0-1.0, your confidence in the stance label. - stance_confidence: float 0.0-1.0, your confidence in the stance label.
- stance<sub>rationale</sub>: 1-3 sentences explaining the key evidence; quote specific phrases where possible. - stance_rationale: 1-3 sentences explaining the key evidence; quote specific phrases where possible.
- tags: up to 5 short topic labels relevant to the comment's specific concerns (e.g. - tags: up to 5 short topic labels relevant to the comment's specific concerns (e.g.
"parental rights", "student safety", "privacy", "religious freedom", "LGBTQ+ inclusion", "parental rights", "student safety", "privacy", "religious freedom", "LGBTQ+ inclusion",
"bullying prevention", "school sports", "bathroom access"). Empty array if none apply. "bullying prevention", "school sports", "bathroom access"). Empty array if none apply.
Return exactly these keys: stance, stance<sub>confidence</sub>, stance<sub>rationale</sub>, tone, tags. Return exactly these keys: stance, stance_confidence, stance_rationale, tone, tags.
\`\`\` ```
<a id="org58a5b72"></a> <a id="org58a5b72"></a>
### Storage ### Storage
- Each scraped forum is saved to \`output/<forum-id>.jsonl\` - Each scraped forum is saved to `output/<forum-id>.jsonl`
- Each report (forum + prompt) is saves to \`reports/<forum-id-N>.json\` - Each report (forum + prompt) is saves to `reports/<forum-id-N>.json`
- Each job is saved to \`analysis/jobs/<report-id>/: - Each job is saved to `analysis/jobs/<report-id>`:
└─\`forum.jsonl\` is a copy of the scraped forum for convenience └─`forum.jsonl` is a copy of the scraped forum for convenience
└─\`prompt.txt\` is a copy of the prompt used └─`prompt.txt` is a copy of the prompt used
└─\`report.json\` is a copy of the report used └─`report.json` is a copy of the report used
└─\`status.json\` contains metadata about the job └─`status.json` contains metadata about the job
For each batch in the job, four files are created: For each batch in the job, four files are created:
└─\`jobN-input.jsonl\` contains the exact queries sent to the API, for troubleshooting └─`jobN-input.jsonl` contains the exact queries sent to the API, for troubleshooting
└─\`jobN-output-raw.jsonl\` contains the exact response from the API └─`jobN-output-raw.jsonl` contains the exact response from the API
└─\`jobN-output.jsonl\` contains the exact response from the API └─`jobN-output.jsonl` contains the exact response from the API
└─\`jobN-output-errors.jsonl\` when errors are returned (this file may not exist) └─`jobN-output-errors.jsonl` when errors are returned (this file may not exist)
- Once complete, the cleanup script saves \`review.csv\`, \`review.pqt\`, and \`review.sqlite\` in this folder. - Once complete, the cleanup script saves `review.csv`, `review.pqt`, and `review.sqlite` in this folder.
<a id="org24fe465"></a> <a id="org24fe465"></a>
## Instructions ## Instructions
1. Scrape the forum. 1. Scrape the forum.
\`python `python`
2. Run model report. 2. Run model report.
\`python analysis/tokenizer.py <input> &ndash;prompt <prompt>\` `python analysis/tokenizer.py <input> --prompt <prompt>`
3. To run a realtime subset: 3. To run a realtime subset:
\`python analysis/openai<sub>realtime.py</sub> <input> &ndash;prompt <prompt> &ndash;model <model> &ndash;limit <N comments>\` `python analysis/openai_realtime.py <input> --prompt <prompt> --model <model> --limit <N comments>`
\`python analysis/openai<sub>realtime.py</sub> output/f452.jsonl &ndash;prompt prompt-1.txt &ndash;model gpt-4o-mini &ndash;limit 10\` `python analysis/openai_realtime.py output/f452.jsonl --prompt prompt-1.txt --model gpt-4o-mini --limit 10`
4. To create and run the whole thing in batches, first create the batch jobs from the report: 4. To create and run the whole thing in batches, first create the batch jobs from the report:
\`python analysis/openai<sub>batch.py</sub> create <report> &ndash;model <model>\` `python analysis/openai_batch.py create <report> --model <model>`
\`python analysis/openai<sub>batch.py</sub> create ./reports/f452-1.json &ndash;model gpt-5.4-mini\` `python analysis/openai_batch.py create ./reports/f452-1.json --model gpt-5.4-mini`
5. Then, run the jobs sequentially. Don't submit more than one at a time, if the model fills up the batch will fail and resubmission is not implemented. 5. Then, run the jobs sequentially. Don't submit more than one at a time, if the model fills up the batch will fail and resubmission is not implemented.
\`python analysis/openai<sub>batch.py</sub> submit\` `python analysis/openai<sub>batch.py</sub> submit`
`python analysis/openai<sub>batch.py</sub> status`
\`python analysis/openai<sub>batch.py</sub> status\` `python analysis/openai<sub>batch.py</sub> download`
`python analysis/openai<sub>batch.py</sub> submit`
\`python analysis/openai<sub>batch.py</sub> download\`
\`python analysis/openai<sub>batch.py</sub> submit\`
<a id="org5739d49"></a> <a id="org5739d49"></a>