Compare commits

..

1 Commits

Author SHA1 Message Date
25a17cb691 updated readme 2026-05-09 00:16:44 -04:00
4 changed files with 18 additions and 21 deletions

View File

@@ -110,30 +110,32 @@ We selected gpt-5.4-mini for a good balance of quality, cost, and time.
## Instructions ## Instructions
1. Scrape the forum. 1. Clone repo and install dependencies:
`python` `python -m pip install -r requirements.txt`
2. Run model report. 2. Scrape the forum based on the ID in the URL.
`scrapy crawl forum -a forum_id=<forum_id> -s LOG_LEVEL=WARNING 2>&1`
3. Run model report.
`python analysis/tokenizer.py <input> --prompt <prompt>` `python analysis/tokenizer.py <input> --prompt <prompt>`
3. To run a realtime subset: 4. To run a realtime subset:
`python analysis/openai_realtime.py <input> --prompt <prompt> --model <model> --limit <N comments>` `python analysis/openai_realtime.py <input> --prompt <prompt> --model <model> --limit <N comments>`
`python analysis/openai_realtime.py output/f452.jsonl --prompt prompt-1.txt --model gpt-4o-mini --limit 10` `python analysis/openai_realtime.py output/f452.jsonl --prompt prompt-1.txt --model gpt-4o-mini --limit 10`
4. To create and run the whole thing in batches, first create the batch jobs from the report: 5. To create and run the whole thing in batches, first create the batch jobs from the report:
`python analysis/openai_batch.py create <report> --model <model>` `python analysis/openai_batch.py create <report> --model <model>`
`python analysis/openai_batch.py create ./reports/f452-1.json --model gpt-5.4-mini` `python analysis/openai_batch.py create ./reports/f452-1.json --model gpt-5.4-mini`
5. Then, run the jobs sequentially. Don't submit more than one at a time, if the model fills up the batch will fail and resubmission is not implemented. 6. Then, run the jobs sequentially. Don't submit more than one at a time, if the model fills up the batch will fail and resubmission is not implemented.
`python analysis/openai<sub>batch.py</sub> submit` `python analysis/openai_batch.py</sub> submit`
`python analysis/openai<sub>batch.py</sub> status` `python analysis/openai_batch.py</sub> status`
`python analysis/openai<sub>batch.py</sub> download` `python analysis/openai_batch.py</sub> download`
`python analysis/openai<sub>batch.py</sub> submit` `python analysis/openai_batch.py</sub> submit`
<a id="org5739d49"></a> <a id="org5739d49"></a>
# Roadmap # Roadmap
1. Scrape one forum 1. /Done/ Scrape one forum, check sentiment, display
2. Compare sentiment models 2. Test different models
3. Display 3. Build batch runner
4. Scrape all data
5. Scale?

View File

@@ -354,9 +354,8 @@ data pulls entirely from the job; goal is to point viz/streamlit.py at any job/
- tests: from root dir, `streamlit run viz/streamlit.py <job-dir>` - tests: from root dir, `streamlit run viz/streamlit.py <job-dir>`
- datetime: [2026-05-08 Fri 23:44] - datetime: [2026-05-08 Fri 23:44]
* +[ ] t1.6 host streamlit via dockerfile+ * [ ] t1.6 host streamlit via dockerfile
planning to deploy manually, get cert, etc etc. probably dont care about https? planning to deploy manually, get cert, etc etc. probably dont care about https?
+using streamlit.app instead+
** acceptance criteria ** acceptance criteria
1. write dockerfile with slim image 1. write dockerfile with slim image

View File

@@ -5,8 +5,6 @@ class ForumItem(scrapy.Item):
forum_id = scrapy.Field() forum_id = scrapy.Field()
reg_title = scrapy.Field() reg_title = scrapy.Field()
reg_desc = scrapy.Field() reg_desc = scrapy.Field()
scraped_at = scrapy.Field()
forum_url = scrapy.Field()
class CommentItem(scrapy.Item): class CommentItem(scrapy.Item):

View File

@@ -63,8 +63,6 @@ class ForumSpider(scrapy.Spider):
forum_id=self.forum_id, forum_id=self.forum_id,
reg_title=reg_title, reg_title=reg_title,
reg_desc=reg_desc, reg_desc=reg_desc,
scraped_at=datetime.utcnow().isoformat(),
forum_url=_view_url(self.forum_id),
) )
for page in range(2, last_page + 1): for page in range(2, last_page + 1):
yield scrapy.FormRequest( yield scrapy.FormRequest(