added forum metadata for later use

streamlit > local docker
2026-05-09 00:36:30 -04:00 · 2026-05-09 00:25:27 -04:00
4 changed files with 21 additions and 18 deletions
--- a/README.md
+++ b/README.md
@@ -110,32 +110,30 @@ We selected gpt-5.4-mini for a good balance of quality, cost, and time.

 ## Instructions

-1.  Clone repo and install dependencies:
-    `python -m pip install -r requirements.txt`
-2.  Scrape the forum based on the ID in the URL.  
-    `scrapy crawl forum -a forum_id=<forum_id> -s LOG_LEVEL=WARNING 2>&1`  
-3.  Run model report.  
+1.  Scrape the forum.  
+    `python`  
+2.  Run model report.  
    `python analysis/tokenizer.py <input> --prompt <prompt>`  
-4.  To run a realtime subset:  
+3.  To run a realtime subset:  
    `python analysis/openai_realtime.py <input> --prompt <prompt> --model <model> --limit <N comments>`  
    `python analysis/openai_realtime.py output/f452.jsonl --prompt prompt-1.txt --model gpt-4o-mini --limit 10`  
-5.  To create and run the whole thing in batches, first create the batch jobs from the report:  
+4.  To create and run the whole thing in batches, first create the batch jobs from the report:  
    `python analysis/openai_batch.py create <report> --model <model>`  
    `python analysis/openai_batch.py create ./reports/f452-1.json --model gpt-5.4-mini`  
-6.  Then, run the jobs sequentially. Don't submit more than one at a time, if the model fills up the batch will fail and resubmission is not implemented.  
-    `python analysis/openai_batch.py</sub> submit`  
-    `python analysis/openai_batch.py</sub> status`  
-    `python analysis/openai_batch.py</sub> download`  
-    `python analysis/openai_batch.py</sub> submit`  
+5.  Then, run the jobs sequentially. Don't submit more than one at a time, if the model fills up the batch will fail and resubmission is not implemented.  
+    `python analysis/openai<sub>batch.py</sub> submit`  
+    `python analysis/openai<sub>batch.py</sub> status`  
+    `python analysis/openai<sub>batch.py</sub> download`  
+    `python analysis/openai<sub>batch.py</sub> submit`  


 <a id="org5739d49"></a>

 # Roadmap

-1.  /Done/ Scrape one forum, check sentiment, display
-2.  Test different models
-3.  Build batch runner
-
-
+1.  Scrape one forum
+2.  Compare sentiment models
+3.  Display
+4.  Scrape all data
+5.  Scale?

--- a/docs/tasks.org
+++ b/docs/tasks.org
@@ -354,8 +354,9 @@ data pulls entirely from the job; goal is to point viz/streamlit.py at any job/
 - tests: from root dir, `streamlit run viz/streamlit.py <job-dir>`
 - datetime: [2026-05-08 Fri 23:44]
  
-* [ ] t1.6 host streamlit via dockerfile
+* +[ ] t1.6 host streamlit via dockerfile+
 planning to deploy manually, get cert, etc etc. probably dont care about https?
+using streamlit.app instead+
 ** acceptance criteria
 1. write dockerfile with slim image

--- a/scraper/items.py
+++ b/scraper/items.py
@@ -5,6 +5,8 @@ class ForumItem(scrapy.Item):
    forum_id  = scrapy.Field()
    reg_title = scrapy.Field()
    reg_desc  = scrapy.Field()
+    scraped_at = scrapy.Field()
+    forum_url = scrapy.Field()


 class CommentItem(scrapy.Item):
--- a/scraper/spiders/forum.py
+++ b/scraper/spiders/forum.py
@@ -63,6 +63,8 @@ class ForumSpider(scrapy.Spider):
                forum_id=self.forum_id,
                reg_title=reg_title,
                reg_desc=reg_desc,
+                scraped_at=datetime.utcnow().isoformat(),
+                forum_url=_view_url(self.forum_id),
            )
            for page in range(2, last_page + 1):
                yield scrapy.FormRequest(
Author	SHA1	Message	Date
eulaly	8f1d9e7723	added forum metadata for later use	2026-05-09 00:36:30 -04:00
eulaly	181477bce7	streamlit > local docker	2026-05-09 00:25:27 -04:00