updated readme

2026-05-09 00:16:44 -04:00
4 changed files with 18 additions and 21 deletions
--- a/README.md
+++ b/README.md
@@ -110,30 +110,32 @@ We selected gpt-5.4-mini for a good balance of quality, cost, and time.

 ## Instructions

-1.  Scrape the forum.  
-    `python`  
-2.  Run model report.  
+1.  Clone repo and install dependencies:
+    `python -m pip install -r requirements.txt`
+2.  Scrape the forum based on the ID in the URL.  
+    `scrapy crawl forum -a forum_id=<forum_id> -s LOG_LEVEL=WARNING 2>&1`  
+3.  Run model report.  
    `python analysis/tokenizer.py <input> --prompt <prompt>`  
-3.  To run a realtime subset:  
+4.  To run a realtime subset:  
    `python analysis/openai_realtime.py <input> --prompt <prompt> --model <model> --limit <N comments>`  
    `python analysis/openai_realtime.py output/f452.jsonl --prompt prompt-1.txt --model gpt-4o-mini --limit 10`  
-4.  To create and run the whole thing in batches, first create the batch jobs from the report:  
+5.  To create and run the whole thing in batches, first create the batch jobs from the report:  
    `python analysis/openai_batch.py create <report> --model <model>`  
    `python analysis/openai_batch.py create ./reports/f452-1.json --model gpt-5.4-mini`  
-5.  Then, run the jobs sequentially. Don't submit more than one at a time, if the model fills up the batch will fail and resubmission is not implemented.  
-    `python analysis/openai<sub>batch.py</sub> submit`  
-    `python analysis/openai<sub>batch.py</sub> status`  
-    `python analysis/openai<sub>batch.py</sub> download`  
-    `python analysis/openai<sub>batch.py</sub> submit`  
+6.  Then, run the jobs sequentially. Don't submit more than one at a time, if the model fills up the batch will fail and resubmission is not implemented.  
+    `python analysis/openai_batch.py</sub> submit`  
+    `python analysis/openai_batch.py</sub> status`  
+    `python analysis/openai_batch.py</sub> download`  
+    `python analysis/openai_batch.py</sub> submit`  


 <a id="org5739d49"></a>

 # Roadmap

-1.  Scrape one forum
-2.  Compare sentiment models
-3.  Display
-4.  Scrape all data
-5.  Scale?
+1.  /Done/ Scrape one forum, check sentiment, display
+2.  Test different models
+3.  Build batch runner
+
+

--- a/docs/tasks.org
+++ b/docs/tasks.org
@@ -354,9 +354,8 @@ data pulls entirely from the job; goal is to point viz/streamlit.py at any job/
 - tests: from root dir, `streamlit run viz/streamlit.py <job-dir>`
 - datetime: [2026-05-08 Fri 23:44]
  
-* +[ ] t1.6 host streamlit via dockerfile+
+* [ ] t1.6 host streamlit via dockerfile
 planning to deploy manually, get cert, etc etc. probably dont care about https?
-+using streamlit.app instead+
 ** acceptance criteria
 1. write dockerfile with slim image

--- a/scraper/items.py
+++ b/scraper/items.py
@@ -5,8 +5,6 @@ class ForumItem(scrapy.Item):
    forum_id  = scrapy.Field()
    reg_title = scrapy.Field()
    reg_desc  = scrapy.Field()
-    scraped_at = scrapy.Field()
-    forum_url = scrapy.Field()


 class CommentItem(scrapy.Item):
--- a/scraper/spiders/forum.py
+++ b/scraper/spiders/forum.py
@@ -63,8 +63,6 @@ class ForumSpider(scrapy.Spider):
                forum_id=self.forum_id,
                reg_title=reg_title,
                reg_desc=reg_desc,
-                scraped_at=datetime.utcnow().isoformat(),
-                forum_url=_view_url(self.forum_id),
            )
            for page in range(2, last_page + 1):
                yield scrapy.FormRequest(