openai batch refactor
This commit is contained in:
@@ -158,7 +158,7 @@ forum_id_input,comment_id,title,text,date,author,stance,stance_confidence,stance
|
||||
- tests: 23 passing (pytest tests/analysis_gpt4o_batch.py), 51 total across suite
|
||||
- datetime: [2026-05-06 Wed 08:55]
|
||||
|
||||
* [ ] t1.2.3: batch job refactor
|
||||
* [X] t1.2.3: batch job refactor
|
||||
This task encompasses intent and fixes for 1.2.1 and 1.2.2.
|
||||
batch processing should be a resumable job queue, not a one-shot script. the user should not need to remember offsets, completed chunks, failed batches, or which comments remain.
|
||||
** Acceptance Criteria
|
||||
@@ -200,6 +200,46 @@ batch processing should be a resumable job queue, not a one-shot script. the us
|
||||
- resume from status.json
|
||||
- remaining-comment detection
|
||||
|
||||
** notes
|
||||
- analysis/gpt4o/tokenizer.py: new standalone script; imports analysis_batch for MODEL_LIMITS, estimate_tokens, build_messages. Reads input JSONL + prompt, computes per-model jobs/cost/time table, writes report.json to input file's directory. MODEL_PRICING dict lives here (not in analysis_batch).
|
||||
- analysis/gpt4o/analysis_batch.py: fully rewritten with four subcommands: create, submit, status, download. No longer uses REQUESTS_DIR / RAW_DIR / RUNS_DIR.
|
||||
- Job directories: analysis/gpt4o/jobs/<stem[:8]>-N/ (e.g. f452-1). Each run is self-contained: forum.jsonl, prompt.txt, report.json, jobN-input.jsonl, jobN-output-raw.jsonl, jobN-output.jsonl, jobN-errors.jsonl.
|
||||
- status.json: tracks all jobs with pending/submitted/in_progress/completed/failed states. Updated by submit, status, download.
|
||||
- _find_next_eligible_job: pure function for testability. Returns (next_pending_job, None) or (None, warning). Blocks submission if previous job is in_progress/submitted.
|
||||
- create: no API key required. Reads report.json, re-chunks comments, writes all jobN-input.jsonl files, writes status.json.
|
||||
- submit: uploads jobN-input.jsonl to Files API, creates batch, updates status.json to 'submitted'. Will not stack batches.
|
||||
- status: retrieves batch from OpenAI, updates status.json counts and status.
|
||||
- download: auto-runs status first, downloads output_file_id → jobN-output-raw.jsonl, error_file_id → jobN-errors.jsonl, normalizes → jobN-output.jsonl. Updates status.json.
|
||||
- tests/test_tokenizer.py: 15 tests for compute_report schema, cost/time calculation, MODEL_PRICING coverage, print_table output, report.json round-trip.
|
||||
|
||||
*** usage
|
||||
#+begin_src sh
|
||||
# 1. estimate tokens and cost
|
||||
python analysis/gpt4o/tokenizer.py output/f452.jsonl --prompt analysis/prompt-1.txt
|
||||
# writes output/report.json
|
||||
|
||||
# 2. create job directory (no api key needed)
|
||||
python analysis/gpt4o/analysis_batch.py create output/report.json --model gpt-4o-mini
|
||||
# creates analysis/gpt4o/jobs/f452-1/
|
||||
|
||||
# 3. submit first job
|
||||
python analysis/gpt4o/analysis_batch.py submit
|
||||
|
||||
# 4. check status (repeat until completed)
|
||||
python analysis/gpt4o/analysis_batch.py status
|
||||
|
||||
# 5. download and normalize
|
||||
python analysis/gpt4o/analysis_batch.py download
|
||||
|
||||
# 6. submit next job (if multi-job run), then repeat 4-5
|
||||
python analysis/gpt4o/analysis_batch.py submit
|
||||
#+end_src
|
||||
|
||||
** evidence
|
||||
- commit:
|
||||
- tests: passing (pytest tests/analysis_gpt4o_batch.py tests/test_tokenizer.py)
|
||||
- datetime: [2026-05-05 Tue]
|
||||
|
||||
* === Backlog ===
|
||||
* [ ] X: analysis validation view
|
||||
create a lightweight validation script that joins raw comments to normalized analysis output and writes a human-reviewable csv.
|
||||
|
||||
Reference in New Issue
Block a user