Merge branch 'master' of https://git.hgsky.me/ben/vath
This commit is contained in:
66
README.md
66
README.md
@@ -1,4 +1,3 @@
|
|||||||
|
|
||||||
# Table of Contents
|
# Table of Contents
|
||||||
|
|
||||||
1. [Project Goals](#org2da6874)
|
1. [Project Goals](#org2da6874)
|
||||||
@@ -56,9 +55,9 @@
|
|||||||
|
|
||||||
Scrapy provides a simple mechanism for retrieving, parsing, and saving content form the forums.
|
Scrapy provides a simple mechanism for retrieving, parsing, and saving content form the forums.
|
||||||
|
|
||||||
1. Forums listing page: \`Forums.cfm\` - lists all open forums with agency, reg title, action type, brief description, closing date, comment count
|
1. Forums listing page: `Forums.cfm` lists all open forums with agency, reg title, action type, brief description, closing date, comment count
|
||||||
2. Comment listing page: \`comments.cfm?GDocForumID=X\` or \`comments.cfm?stageid=X\` or \`comments.cfm?petitionid=X\` - lists comments with title, author, date
|
2. Comment listing page: `comments.cfm?GDocForumID=X` or `comments.cfm?stageid=X` or `comments.cfm?petitionid=X` lists comments with title, author, date
|
||||||
3. Individual comment page: \`viewcomments.cfm?commentid=X\` - shows regulation title + brief description at the top, plus the comment
|
3. Individual comment page: `viewcomments.cfm?commentid=X` shows regulation title + brief description at the top, plus the comment
|
||||||
|
|
||||||
|
|
||||||
<a id="org72990f4"></a>
|
<a id="org72990f4"></a>
|
||||||
@@ -74,14 +73,12 @@ Then, the batch processing scripts uses the \`report.json\` to create multiple j
|
|||||||
We selected gpt-5.4-mini for a good balance of quality, cost, and time.
|
We selected gpt-5.4-mini for a good balance of quality, cost, and time.
|
||||||
|
|
||||||
1. Prompt
|
1. Prompt
|
||||||
|
```
|
||||||
\`\`\`
|
|
||||||
You are an expert policy analyst classifying public comments submitted to the Virginia Town Hall
|
You are an expert policy analyst classifying public comments submitted to the Virginia Town Hall
|
||||||
regulatory comment system. You will be given the text of a proposed regulation and a single
|
regulatory comment system. You will be given the text of a proposed regulation and a single
|
||||||
public comment. Return ONLY a JSON object — no other text.
|
public comment. Return ONLY a JSON object — no other text.
|
||||||
|
|
||||||
Definitions:
|
Definitions:
|
||||||
|
|
||||||
- stance: the commenter's position on whether the regulation should be adopted.
|
- stance: the commenter's position on whether the regulation should be adopted.
|
||||||
"support" = wants it approved (as-is or with changes);
|
"support" = wants it approved (as-is or with changes);
|
||||||
"oppose" = wants it rejected or substantially weakened;
|
"oppose" = wants it rejected or substantially weakened;
|
||||||
@@ -93,33 +90,33 @@ We selected gpt-5.4-mini for a good balance of quality, cost, and time.
|
|||||||
"neutral" = matter-of-fact, procedural, or informational;
|
"neutral" = matter-of-fact, procedural, or informational;
|
||||||
"mixed" = contains both positive and negative emotional content;
|
"mixed" = contains both positive and negative emotional content;
|
||||||
"unclear" = tone cannot be determined (e.g., a one-word comment).
|
"unclear" = tone cannot be determined (e.g., a one-word comment).
|
||||||
- stance<sub>confidence</sub>: float 0.0-1.0, your confidence in the stance label.
|
- stance_confidence: float 0.0-1.0, your confidence in the stance label.
|
||||||
- stance<sub>rationale</sub>: 1-3 sentences explaining the key evidence; quote specific phrases where possible.
|
- stance_rationale: 1-3 sentences explaining the key evidence; quote specific phrases where possible.
|
||||||
- tags: up to 5 short topic labels relevant to the comment's specific concerns (e.g.
|
- tags: up to 5 short topic labels relevant to the comment's specific concerns (e.g.
|
||||||
"parental rights", "student safety", "privacy", "religious freedom", "LGBTQ+ inclusion",
|
"parental rights", "student safety", "privacy", "religious freedom", "LGBTQ+ inclusion",
|
||||||
"bullying prevention", "school sports", "bathroom access"). Empty array if none apply.
|
"bullying prevention", "school sports", "bathroom access"). Empty array if none apply.
|
||||||
|
|
||||||
Return exactly these keys: stance, stance<sub>confidence</sub>, stance<sub>rationale</sub>, tone, tags.
|
Return exactly these keys: stance, stance_confidence, stance_rationale, tone, tags.
|
||||||
\`\`\`
|
```
|
||||||
|
|
||||||
|
|
||||||
<a id="org58a5b72"></a>
|
<a id="org58a5b72"></a>
|
||||||
|
|
||||||
### Storage
|
### Storage
|
||||||
|
|
||||||
- Each scraped forum is saved to \`output/<forum-id>.jsonl\`
|
- Each scraped forum is saved to `output/<forum-id>.jsonl`
|
||||||
- Each report (forum + prompt) is saves to \`reports/<forum-id-N>.json\`
|
- Each report (forum + prompt) is saves to `reports/<forum-id-N>.json`
|
||||||
- Each job is saved to \`analysis/jobs/<report-id>/:
|
- Each job is saved to `analysis/jobs/<report-id>`:
|
||||||
└─\`forum.jsonl\` is a copy of the scraped forum for convenience
|
└─`forum.jsonl` is a copy of the scraped forum for convenience
|
||||||
└─\`prompt.txt\` is a copy of the prompt used
|
└─`prompt.txt` is a copy of the prompt used
|
||||||
└─\`report.json\` is a copy of the report used
|
└─`report.json` is a copy of the report used
|
||||||
└─\`status.json\` contains metadata about the job
|
└─`status.json` contains metadata about the job
|
||||||
For each batch in the job, four files are created:
|
For each batch in the job, four files are created:
|
||||||
└─\`jobN-input.jsonl\` contains the exact queries sent to the API, for troubleshooting
|
└─`jobN-input.jsonl` contains the exact queries sent to the API, for troubleshooting
|
||||||
└─\`jobN-output-raw.jsonl\` contains the exact response from the API
|
└─`jobN-output-raw.jsonl` contains the exact response from the API
|
||||||
└─\`jobN-output.jsonl\` contains the exact response from the API
|
└─`jobN-output.jsonl` contains the exact response from the API
|
||||||
└─\`jobN-output-errors.jsonl\` when errors are returned (this file may not exist)
|
└─`jobN-output-errors.jsonl` when errors are returned (this file may not exist)
|
||||||
- Once complete, the cleanup script saves \`review.csv\`, \`review.pqt\`, and \`review.sqlite\` in this folder.
|
- Once complete, the cleanup script saves `review.csv`, `review.pqt`, and `review.sqlite` in this folder.
|
||||||
|
|
||||||
|
|
||||||
<a id="org24fe465"></a>
|
<a id="org24fe465"></a>
|
||||||
@@ -127,23 +124,20 @@ We selected gpt-5.4-mini for a good balance of quality, cost, and time.
|
|||||||
## Instructions
|
## Instructions
|
||||||
|
|
||||||
1. Scrape the forum.
|
1. Scrape the forum.
|
||||||
\`python
|
`python`
|
||||||
2. Run model report.
|
2. Run model report.
|
||||||
\`python analysis/tokenizer.py <input> –prompt <prompt>\`
|
`python analysis/tokenizer.py <input> --prompt <prompt>`
|
||||||
3. To run a realtime subset:
|
3. To run a realtime subset:
|
||||||
\`python analysis/openai<sub>realtime.py</sub> <input> –prompt <prompt> –model <model> –limit <N comments>\`
|
`python analysis/openai_realtime.py <input> --prompt <prompt> --model <model> --limit <N comments>`
|
||||||
\`python analysis/openai<sub>realtime.py</sub> output/f452.jsonl –prompt prompt-1.txt –model gpt-4o-mini –limit 10\`
|
`python analysis/openai_realtime.py output/f452.jsonl --prompt prompt-1.txt --model gpt-4o-mini --limit 10`
|
||||||
4. To create and run the whole thing in batches, first create the batch jobs from the report:
|
4. To create and run the whole thing in batches, first create the batch jobs from the report:
|
||||||
\`python analysis/openai<sub>batch.py</sub> create <report> –model <model>\`
|
`python analysis/openai_batch.py create <report> --model <model>`
|
||||||
\`python analysis/openai<sub>batch.py</sub> create ./reports/f452-1.json –model gpt-5.4-mini\`
|
`python analysis/openai_batch.py create ./reports/f452-1.json --model gpt-5.4-mini`
|
||||||
5. Then, run the jobs sequentially. Don't submit more than one at a time, if the model fills up the batch will fail and resubmission is not implemented.
|
5. Then, run the jobs sequentially. Don't submit more than one at a time, if the model fills up the batch will fail and resubmission is not implemented.
|
||||||
\`python analysis/openai<sub>batch.py</sub> submit\`
|
`python analysis/openai<sub>batch.py</sub> submit`
|
||||||
|
`python analysis/openai<sub>batch.py</sub> status`
|
||||||
\`python analysis/openai<sub>batch.py</sub> status\`
|
`python analysis/openai<sub>batch.py</sub> download`
|
||||||
|
`python analysis/openai<sub>batch.py</sub> submit`
|
||||||
\`python analysis/openai<sub>batch.py</sub> download\`
|
|
||||||
|
|
||||||
\`python analysis/openai<sub>batch.py</sub> submit\`
|
|
||||||
|
|
||||||
|
|
||||||
<a id="org5739d49"></a>
|
<a id="org5739d49"></a>
|
||||||
|
|||||||
Reference in New Issue
Block a user