Compare commits

..

2 Commits

Author SHA1 Message Date
8f1d9e7723 added forum metadata for later use 2026-05-09 00:36:30 -04:00
181477bce7 streamlit > local docker 2026-05-09 00:25:27 -04:00
4 changed files with 21 additions and 18 deletions

View File

@@ -110,32 +110,30 @@ We selected gpt-5.4-mini for a good balance of quality, cost, and time.
## Instructions ## Instructions
1. Clone repo and install dependencies: 1. Scrape the forum.
`python -m pip install -r requirements.txt` `python`
2. Scrape the forum based on the ID in the URL. 2. Run model report.
`scrapy crawl forum -a forum_id=<forum_id> -s LOG_LEVEL=WARNING 2>&1`
3. Run model report.
`python analysis/tokenizer.py <input> --prompt <prompt>` `python analysis/tokenizer.py <input> --prompt <prompt>`
4. To run a realtime subset: 3. To run a realtime subset:
`python analysis/openai_realtime.py <input> --prompt <prompt> --model <model> --limit <N comments>` `python analysis/openai_realtime.py <input> --prompt <prompt> --model <model> --limit <N comments>`
`python analysis/openai_realtime.py output/f452.jsonl --prompt prompt-1.txt --model gpt-4o-mini --limit 10` `python analysis/openai_realtime.py output/f452.jsonl --prompt prompt-1.txt --model gpt-4o-mini --limit 10`
5. To create and run the whole thing in batches, first create the batch jobs from the report: 4. To create and run the whole thing in batches, first create the batch jobs from the report:
`python analysis/openai_batch.py create <report> --model <model>` `python analysis/openai_batch.py create <report> --model <model>`
`python analysis/openai_batch.py create ./reports/f452-1.json --model gpt-5.4-mini` `python analysis/openai_batch.py create ./reports/f452-1.json --model gpt-5.4-mini`
6. Then, run the jobs sequentially. Don't submit more than one at a time, if the model fills up the batch will fail and resubmission is not implemented. 5. Then, run the jobs sequentially. Don't submit more than one at a time, if the model fills up the batch will fail and resubmission is not implemented.
`python analysis/openai_batch.py</sub> submit` `python analysis/openai<sub>batch.py</sub> submit`
`python analysis/openai_batch.py</sub> status` `python analysis/openai<sub>batch.py</sub> status`
`python analysis/openai_batch.py</sub> download` `python analysis/openai<sub>batch.py</sub> download`
`python analysis/openai_batch.py</sub> submit` `python analysis/openai<sub>batch.py</sub> submit`
<a id="org5739d49"></a> <a id="org5739d49"></a>
# Roadmap # Roadmap
1. /Done/ Scrape one forum, check sentiment, display 1. Scrape one forum
2. Test different models 2. Compare sentiment models
3. Build batch runner 3. Display
4. Scrape all data
5. Scale?

View File

@@ -354,8 +354,9 @@ data pulls entirely from the job; goal is to point viz/streamlit.py at any job/
- tests: from root dir, `streamlit run viz/streamlit.py <job-dir>` - tests: from root dir, `streamlit run viz/streamlit.py <job-dir>`
- datetime: [2026-05-08 Fri 23:44] - datetime: [2026-05-08 Fri 23:44]
* [ ] t1.6 host streamlit via dockerfile * +[ ] t1.6 host streamlit via dockerfile+
planning to deploy manually, get cert, etc etc. probably dont care about https? planning to deploy manually, get cert, etc etc. probably dont care about https?
+using streamlit.app instead+
** acceptance criteria ** acceptance criteria
1. write dockerfile with slim image 1. write dockerfile with slim image

View File

@@ -5,6 +5,8 @@ class ForumItem(scrapy.Item):
forum_id = scrapy.Field() forum_id = scrapy.Field()
reg_title = scrapy.Field() reg_title = scrapy.Field()
reg_desc = scrapy.Field() reg_desc = scrapy.Field()
scraped_at = scrapy.Field()
forum_url = scrapy.Field()
class CommentItem(scrapy.Item): class CommentItem(scrapy.Item):

View File

@@ -63,6 +63,8 @@ class ForumSpider(scrapy.Spider):
forum_id=self.forum_id, forum_id=self.forum_id,
reg_title=reg_title, reg_title=reg_title,
reg_desc=reg_desc, reg_desc=reg_desc,
scraped_at=datetime.utcnow().isoformat(),
forum_url=_view_url(self.forum_id),
) )
for page in range(2, last_page + 1): for page in range(2, last_page + 1):
yield scrapy.FormRequest( yield scrapy.FormRequest(