382 lines
22 KiB
Org Mode
382 lines
22 KiB
Org Mode
#+title: VATH Task Log
|
|
#+date: [2026-05-05 Tue]
|
|
#+startup: Overview
|
|
|
|
* [X] t1.1: scrape one forum (1)
|
|
Use https://www.townhall.virginia.gov/L/comments.cfm?GDocForumID=452 as the first forum. Scraper should be run manually at this step.
|
|
ViewComments (townhall.virginia.gov/L/ViewComments.cfm?CommentID=#) appears to be raw list of all comments on forum - could be useful later for whole-scrape
|
|
Append forum id to viewall per forum (townhall.virginia.gov/L/ViewComments.cfm?GdocForumID=452)
|
|
Comments are hydrated in backend via js-cued button (AJAX?).
|
|
** acceptance criteria
|
|
1. run manual scraper
|
|
1. store proposal title and description
|
|
2. store comment title, commenter, date
|
|
3. store relevant metadata
|
|
2. friendly/polite scraping
|
|
3. store forum as distinct item with title, desc
|
|
4. add forum ID in comment filename, eg forum452_comments_<datetime>.jsonl
|
|
5. remove reg_title and reg_desc from each comment; these belong in forum item
|
|
6. parse datetimes into object for later use (plotting)
|
|
|
|
** notes
|
|
- scraper/spiders/forum.py — ForumSpider using ViewComments.cfm?GdocForumID=N with POST pagination. First request fetches page 1 (vPerPage=500), discovers the last page number from the form's link, generates all remaining page requests upfront. Parses each div.Cbox for all required fields.
|
|
- scraper/items.py — CommentItem with forum_id, reg_title, reg_desc, comment_id, author, date, title, text
|
|
- tests/test_forum_spider.py — 7 tests, all passing
|
|
- Settings: DEFAULT_RESPONSE_ENCODING=utf-8 (fixes Windows-1251 meta-tag mismatch), HTTPCACHE_ENABLED=True, feed output to output/
|
|
- ViewComments.cfm instead of comments.cfm: POST to Comments.cfm returned a 500 error (wrong endpoint). ViewComments.cfm?GdocForumID=N is the correct listing URL, returns full comment text on the page itself — no per-comment follow requests needed.
|
|
- Span-wrapped text: .divComment p::text missed 3.6% of comments where text is in <p><span>text</span></p>. Fixed to .divComment *::text, .divComment::text. Worth knowing for when the spider is extended to other forums.
|
|
- start() vs start_requests(): Scrapy 2.13+ deprecates start_requests() in favor of async def start()
|
|
- ForumItem vs CommentItem: ForumItem (forum_id, reg_title, reg_desc) yielded once on first page; CommentItem no longer carries reg_title/reg_desc. Both land in the same JSONL feed.
|
|
- Dynamic output filename: set via from_crawler() overriding FEEDS at 'spider' priority — format is output/forum{id}_comments_%(time)s.jsonl. FEEDS removed from settings.py; spider owns it.
|
|
- Date parsing: _parse_date() normalizes whitespace, upper-cases, parses "%m/%d/%y %I:%M %p" → ISO 8601; falls back to raw string on failure.
|
|
|
|
** evidence
|
|
- commit: beb5cf4 (AC1-2), e7df0b2 (AC3-6)
|
|
- tests: 8 passing (`python -m pytest tests -q`) or (`python -m pytest tests/`)
|
|
- `scrapy crawl forum -a forum_id=452 -s LOG_LEVEL=WARNING 2>&1`
|
|
- retrieved 9083 comments
|
|
- datetime: [2026-05-05 Tue 14:00]
|
|
|
|
* [X] t1.2: initial 4o sentiment
|
|
Write a simple manual pipeline for gpt-4o that reads one scraped forum jsonl file and roduces a separate analyzed jsonl file. this step must not mutate scraper output. analysis should classify each comment for regulatory stance, generic tone/sentiment, confidence, and enough rationale/evidence to support later dashboard drilldown.
|
|
Should be run manually, separate from scraper. You may use scrapy, but are not required to.
|
|
- Sentiment is derived, not scraped - keep separate from raw comments.
|
|
- keep jsonl as interchange/audit format
|
|
|
|
** acceptance criteria
|
|
1. input scraped jsonl doc by filename/path, e.g. "./output/forum452_comments_<datetime>.jsonl"
|
|
- handle mixed itemtypes, e.g., forum + comment items
|
|
2. output new analysis file, e.g., "analysis/forum452_<datetime>_<model>_<datetime>.jsonl"
|
|
- one analysis record per comment
|
|
- include run_id, forum_id, comment_id, analyzed_at, model, prompt_version
|
|
3. capture stance toward proposed reg/guidance:
|
|
- `stance`: support, oppose, neutral, unknown
|
|
- `confidence`: 0-1
|
|
- short rationale, if provided by model
|
|
4. capture generic sentiment/tone separately from stance: `tone`=positive, negative, neutral, mixed, unclear
|
|
5. capture issue/topic tags for later grouping, may be empty
|
|
6. use .env for api key management
|
|
7. document the exact prompt version used; prompt text may live in code or docs, but must have a version string/hash in output records
|
|
8. for this run, an option to run the first N comments (5, 10, 20, 50) - will add batch processing later
|
|
|
|
** notes
|
|
- analysis/gpt4o/analysis.py: standalone script; core functions importable for tests.
|
|
- Prompt version = SHA-256[:7] of SYSTEM_PROMPT+USER_TEMPLATE; auto-updates on prompt change.
|
|
- Output: analysis/gpt4o/forum{id}_{scrape_ts}_{model}_{run_ts}.jsonl, one record per comment.
|
|
- --limit {5,10,20,50} for test runs; omit for full corpus. Batch processing planned for later.
|
|
- Incremental flush after each record: safe to interrupt and inspect partial output.
|
|
- temperature=0.0 for deterministic, reproducible classifications across runs.
|
|
- Retry: 3 attempts (delays 1s, 2s) on RateLimitError; all other exceptions → error record + continue.
|
|
- openai==2.34.0 installed; python-dotenv already present; key loaded from .env via OPENAI_API_KEY.
|
|
- MAX_COMMENT_CHARS=6000: covers >99% without truncation; outliers (e.g. 18k-char law firm brief) flagged with truncated=True.
|
|
|
|
** evidence
|
|
- commit: d834d18
|
|
- tests: 20 passing (pytest tests/analysis_gpt4o_realtime.py), 28 total across suite
|
|
- `python ./analysis/gpt4o/analysis_realtime.py --limit 5 ./output/f452.jsonl`
|
|
- see: ./analysis/gpt4o/forum452_unknown_gpt-4o_2026-05-05T18-48-32+00-00.jsonl
|
|
- date: [2026-05-05 Tue 15:00]
|
|
|
|
* [X] t1.2.1: batch processing
|
|
Create analysis-batch.py to capture same elements as t1.2 above.
|
|
May need to add multiple commands to upload, check batch status, download, etc.
|
|
Commands should all be run manually.
|
|
Reference: ./docs/openai-batch.md. openai batch output order is not guaranteed, so custom_id is mandatory for reconciliation
|
|
** acceptance criteria
|
|
1. input scraped jsonl doc by filename/path, and process the whole thing via batch processing
|
|
- ignore non-comment items in jsonl
|
|
- do not modify raw scraper output
|
|
- specify model and prompt
|
|
2. output a run manifest in ./analysis/<model>/runs/<run_id>.json
|
|
- include: include run_id, input_filename, input_sha256, prompt_hash, model, batch_id, records_submitted, records_completed, records_failed, request_filename, raw_output_filename, normalized_output_filename, created_at, completed_at
|
|
3. add tests without live api calls
|
|
** notes
|
|
- analysis/gpt4o/analysis-batch.py with three subcommands:
|
|
- `submit`: reads scraped JSONL, builds batch request file (requests/<run_id>.jsonl), uploads to Files API, creates batch, saves manifest to runs/<run_id>.json. Prints run_id to stdout for scripting.
|
|
- `status`: retrieves batch from OpenAI, prints status + counts, updates manifest.
|
|
- `download`: downloads raw output to raw/<run_id>.jsonl, normalizes to <run_id>_<model>.jsonl using comment_lookup keyed by comment_id for reconciliation (batch output order not guaranteed). Updates manifest with filenames, counts, completed_at.
|
|
- custom_id format: comment_{comment_id} — unique within a forum, stable across runs.
|
|
- PROMPT_VERSION derived from analysis/prompt-1.txt (same file as realtime); both scripts produce matching prompt_hash in all records.
|
|
- analysis/prompt-1.txt: system prompt as plaintext, read at import time by both scripts. Edit here to change prompt for both pipelines.
|
|
|
|
** evidence
|
|
- commit: 683bfb3 (remove hyphen), f3abbef
|
|
- tests: 18 passing (pytest tests/analysis_gpt4o_batch.py), 46 total across suite
|
|
- datetime: [2026-05-05 Tue 17:00]
|
|
|
|
* [X] t1.2.2: Tokenizer / Batch mgmt
|
|
openai batch analysis requires coordination - more like a job queue.
|
|
batch script should setup queue for user to setup manually; openai api will reject subsequent batches when the total daily token limit is maxed.
|
|
** Acceptance Criteria
|
|
1. add token estimator utility script, probably to /analysis
|
|
2. add MODEL_LIMITS dict to analysis_batch.py. if there are more than (n)
|
|
- gpt-4o (30k tpm/90k tpd batch)
|
|
- gpt-4o-mini (200k tpm/2M tpd batch)
|
|
- add models listed in docs/openai.md
|
|
3. Auto-chunk submit: before writing the request file, walk comments, accumulate estimated tokens, and split into chunks that fit under the model's limit.
|
|
- Each chunk becomes its own batch submission with its own run_id.
|
|
- Drop --limit (or keep as hard cap override).
|
|
- Print all run_ids
|
|
- Submit the first batch only (failed)
|
|
4. Update test script to show tokenizer output
|
|
|
|
** notes
|
|
- MODEL_LIMITS and _MODEL_ENCODING dicts in analysis/gpt4o/analysis_batch.py; keyed by model name, sourced from docs/openai.md. Unknown models fall back to o200k_base encoding and 900k token limit.
|
|
- estimate_tokens(messages, model): uses tiktoken (o200k_base) when available; falls back to chars/3 + 4 overhead per message.
|
|
- chunk_comments_by_tokens(comments, forum, model): greedy bin-pack; respects 10% headroom (_LIMIT_BUFFER=0.90). Returns list of comment lists.
|
|
- submit sends only chunks[0] — enqueued token limit is a TOTAL across all concurrent batches; stacking would exceed quota. Remaining chunk ranges are printed as manual instructions.
|
|
- --limit N still available as a hard cap on total comments before chunking (useful when org-tier limit is below the published model limit).
|
|
- pip install tiktoken required for exact token counting; chars/3 fallback activates automatically if not installed.
|
|
|
|
|
|
*** usage
|
|
- `pip install tiktoken`
|
|
- submit first chunk (auto-sized to model token limit, uses most recent output file)
|
|
`python analysis/gpt4o/analysis_batch.py submit output/f452.jsonl --model gpt-4o-mini`
|
|
- check status (defaults to most recent run)
|
|
`python analysis/gpt4o/analysis_batch.py status`
|
|
- download + normalize when complete
|
|
`python analysis/gpt4o/analysis_batch.py download`
|
|
- submit next chunk: rerun with `--limit` to cover the next N comments
|
|
(track which comment_ids have already been analyzed to avoid duplicates)
|
|
|
|
*** validation
|
|
#+begin_src python
|
|
import pandas as pd
|
|
df_input = pd.read_json('C:/Users/moses/projects/vath/analysis/gpt4o/runs/75ee9a/f452.jsonl', lines=True)
|
|
# drop forum item
|
|
df_input_comments = df_input[df_input["comment_id"].notna()].copy()
|
|
df_output = pd.read_json('C:/Users/moses/projects/vath/analysis/gpt4o/runs/75ee9a/75ee9a6c-8fc2-4924-8d96-b55bb4d5e832_gpt-4o.jsonl', lines=True)
|
|
dfm = df_output.merge(df_input_comments,on="comment_id",how="left",suffixes=("","_input"),)
|
|
dfm.to_csv('C:/Users/moses/projects/vath/analysis/gpt4o/1.csv')
|
|
#+end_src
|
|
order columns:
|
|
forum_id_input,comment_id,title,text,date,author,stance,stance_confidence,stance_rationale,tone,tags,error,truncated,analyzed_at,prompt_version,model
|
|
|
|
** evidence
|
|
- commit:
|
|
- tests: 23 passing (pytest tests/analysis_gpt4o_batch.py), 51 total across suite
|
|
- datetime: [2026-05-06 Wed 08:55]
|
|
|
|
* [X] t1.2.3: batch job refactor
|
|
This task encompasses intent and fixes for 1.2.1 and 1.2.2.
|
|
batch processing should be a resumable job queue, not a one-shot script. the user should not need to remember offsets, completed chunks, failed batches, or which comments remain.
|
|
** Acceptance Criteria
|
|
1. create tokenizer to prepare the batch job
|
|
- input: prompt.txt, forum.jsonl
|
|
- output: report.json with each model's batch structure, cost, and time (considering tpd constraints)
|
|
- analysis_batch should be able to take this report to run the job. good place to copy the raw scraper jsonl
|
|
#+begin_src python
|
|
{'prompt': 'prompt1.txt',
|
|
'input_file': 'f451.jsonl',
|
|
'input_tokens': 123456789,
|
|
'gpt-4o': {'jobs':71,'cost_$':4,'est_queue_days':3} # divide tokens by model TPD to get time_days
|
|
'gpt-4o-mini': {'jobs':71,'cost_$':4,'est_queue_days':3} # divide tokens by model TPD to get time_days
|
|
#+end_src
|
|
2. batch py should contain commands to create, check, run, and complete jobs.
|
|
- inputs: report.json, --model, optional --job N, read api key from .env
|
|
- outputs:
|
|
- status.json: job structure, status, metadata; updated when jobs are finished. includes all report.json info
|
|
- for each job: jobN-input.jsonl (what is sent to openai); jobN-output-raw.jsonl, jobN-output.jsonl, and jobN-errors.jsonl (when downloaded)
|
|
- jobN-output.jsonl contains:
|
|
- one analysis record per comment
|
|
- `run_id`, `forum_id`, `comment_id`, `analyzed_at`, `model`, `prompt_version`
|
|
- `stance` toward proposed reg/guidance: support|oppose|neutral|unclear
|
|
- `stance_confidence`: 0-1
|
|
- short rationale, if provided by model
|
|
- generic sentiment `tone` (separate from stance): positive|negative|neutral|mixed|unclear
|
|
- `tags` for later grouping, may be empty
|
|
- commands: `create`, `submit`, `status`, `download`
|
|
- `create` run directory, copy input/prompt/report, generate status.json, job request files
|
|
- `submit` if eligible, submit next or specified job; does not blindly stack jobs, warns if prev jobs in progress, print next action
|
|
- `status` check status of one or all submitted jobs, update status.json
|
|
- `download` raw output (jobN-output-raw.jsonl) and error files for completed jobs, and normalize raw output (jobN-output.jsonl) auto run status.
|
|
3. tests without live api calls
|
|
- partial completed run
|
|
- failed batch records
|
|
- out-of-order output
|
|
- duplicate custom_id
|
|
- missing output file
|
|
- resume from status.json
|
|
- remaining-comment detection
|
|
|
|
** notes
|
|
- analysis/tokenizer.py: new standalone script; imports openai_batch for MODEL_LIMITS, estimate_tokens, build_messages. Reads input JSONL + prompt, computes per-model jobs/cost/time table, writes reports/<stem>-report.json. MODEL_PRICING dict lives here (not in openai_batch). Pass a jobN-input.jsonl to count actual tokens instead.
|
|
- analysis/openai_batch.py: fully rewritten with four subcommands: create, submit, status, download. Job dirs at analysis/jobs/<stem[:8]>-N/.
|
|
- Job directories: analysis/jobs/<stem[:8]>-N/ (e.g. f452-1). Each run is self-contained: forum.jsonl, prompt.txt, report.json, jobN-input.jsonl, jobN-output-raw.jsonl, jobN-output.jsonl, jobN-errors.jsonl.
|
|
- status.json: tracks all jobs with pending/submitted/in_progress/completed/failed states. Updated by submit, status, download.
|
|
- _find_next_eligible_job: pure function for testability. Returns (next_pending_job, None) or (None, warning). Blocks submission if previous job is in_progress/submitted.
|
|
- create: no API key required. Reads report.json, re-chunks comments, writes all jobN-input.jsonl files, writes status.json.
|
|
- submit: uploads jobN-input.jsonl to Files API, creates batch, updates status.json to 'submitted'. Will not stack batches.
|
|
- status: retrieves batch from OpenAI, updates status.json counts and status.
|
|
- download: auto-runs status first, downloads output_file_id → jobN-output-raw.jsonl, error_file_id → jobN-errors.jsonl, normalizes → jobN-output.jsonl. Updates status.json.
|
|
- tests/tokenizer.py: 19 tests for compute_report schema, cost/time calculation, MODEL_PRICING coverage, print_table output, count_input_tokens, report.json round-trip.
|
|
- Token limit buffer: _LIMIT_BUFFER=0.80 (20% headroom). Estimate uses OpenAI cookbook chat formula (role tokens + 3-token reply primer). Verify a job file with: python analysis/tokenizer.py analysis/jobs/<dir>/jobN-input.jsonl
|
|
|
|
*** usage
|
|
#+begin_src powershell
|
|
# 1. estimate tokens and cost
|
|
python analysis/tokenizer.py output/f452.jsonl --prompt analysis/prompt-1.txt
|
|
# writes reports/f452-report.json
|
|
|
|
# 2. verify actual tokens in a job file (optional sanity check)
|
|
python analysis/tokenizer.py analysis/jobs/f452-1/job1-input.jsonl
|
|
|
|
# 3. create job directory (no api key needed)
|
|
python analysis/openai_batch.py create reports/f452-report.json --model gpt-5.4-mini
|
|
# creates analysis/jobs/f452-1/
|
|
|
|
# 4. submit first job
|
|
python analysis/openai_batch.py submit
|
|
|
|
# 5. check status (repeat until completed)
|
|
python analysis/openai_batch.py status
|
|
|
|
# 6. download and normalize
|
|
python analysis/openai_batch.py download
|
|
|
|
# 7. submit next job (if multi-job run), then repeat 5-6
|
|
python analysis/openai_batch.py submit
|
|
#+end_src
|
|
|
|
** evidence
|
|
- commit:
|
|
- tests: passing (pytest tests/openai_batch.py tests/openai_realtime.py tests/tokenizer.py)
|
|
- datetime: [2026-05-06 Wed]
|
|
|
|
* [X] t1.3: cleanup model output and rejoin
|
|
create a lightweight validation script that joins raw comments to normalized analysis output and writes a human-reviewable csv.
|
|
review create_csv for the simple approach - keep this regardless
|
|
|
|
** acceptance criteria
|
|
1. input raw scrape jsonl and all *-output.jsonl files in a dir
|
|
2. join by comment_id, not dataframe index
|
|
3. output csv columns in review order:
|
|
- forum_id, comment_id, title, text, date, author
|
|
- stance, stance_confidence, stance_rationale, tone, tags
|
|
- error, truncated, analyzed_at, prompt_version, model
|
|
4. output parquet?
|
|
5. print validation counts
|
|
- raw comments
|
|
- analyzed records
|
|
- joined records
|
|
- missing comment text
|
|
- duplicate comment_ids
|
|
- error records
|
|
- stance counts
|
|
- tone counts
|
|
6. tests cover join behavior and missing/duplicate ids
|
|
|
|
** notes
|
|
- analysis/create_csv.py: reads raw scrape JSONL + all job*-output.jsonl in a job dir (skips *-output-raw.jsonl); left-joins on comment_id; writes review.csv (UTF-8 BOM for Excel); optional --parquet.
|
|
- Uses pd.read_json(path, lines=True) — no manual JSON parsing.
|
|
- Prints summary counts: raw/analyzed/joined/unanalyzed/errors/duplicate IDs, stance distribution, tone distribution.
|
|
|
|
*** usage
|
|
#+begin_src sh
|
|
python analysis/create_csv.py output/f452.jsonl analysis/jobs/f452-1/
|
|
python analysis/create_csv.py output/f452.jsonl analysis/jobs/f452-1/ --parquet
|
|
# output: analysis/jobs/f452-1/review.csv (and optionally review.parquet)
|
|
#+end_src
|
|
|
|
** evidence
|
|
- commit: 28d6d22
|
|
- tests: passing (pytest tests/create_csv.py tests/encoding.py)
|
|
- csv: analysis/jobs/f452-1/review.csv
|
|
- datetime: [2026-05-07 Thu 17:23]
|
|
|
|
* [X] t1.1.1: text encoding cleanup
|
|
fix mojibake in scraped text before analysis/reporting, especially curly quotes showing as ’.
|
|
|
|
|
|
** acceptance criteria
|
|
1. identify whether mojibake exists in raw scrape, analysis output, or csv export only
|
|
2. add repair step at the earliest correct layer
|
|
3. preserve original raw scrape if repair changes source text
|
|
4. add test cases for common bad sequences:
|
|
- ’
|
|
- “
|
|
- â€
|
|
- –
|
|
- —
|
|
5. document whether repaired text is used for model input
|
|
|
|
** notes
|
|
- Diagnosis: f452.jsonl raw data is CLEAN — proper Unicode throughout (U+2019, U+201C, etc.). The DEFAULT_RESPONSE_ENCODING=utf-8 spider setting is working for this site. No mojibake or FFFD chars found.
|
|
- The encoding issue would surface for forums whose server sends cp1252 bytes (0x91-0x97 range) embedded in otherwise UTF-8 content. FFFD replacement chars appear when the UTF-8 decoder hits those bytes. Once the byte is replaced by FFFD, the original character cannot be recovered.
|
|
- Repair layer: analysis/encoding.py applied in analysis/validate.py at reporting time. Raw scrape JSONL is never modified (AC3).
|
|
- Model input: repair_text() is NOT applied in build_messages() for this dataset since raw data is clean. Can be added if a future forum produces dirty text.
|
|
- Spider: DEFAULT_RESPONSE_ENCODING=utf-8 remains. If a future forum genuinely sends cp1252, change to 'cp1252' and apply ftfy post-decode in the item pipeline.
|
|
|
|
** evidence
|
|
- commit: 1ea696d
|
|
- tests: passing (pytest tests/encoding.py)
|
|
- before/after sample: N/A — f452.jsonl is clean; tests cover synthetic mojibake patterns
|
|
- datetime: [2026-05-07 Thu 17:00]
|
|
|
|
* [X] t1.4: graph data prototype
|
|
create ./viz/prototype_charts.py generating individual plotly charts for exploring graphs to embed into streamlit or dash later
|
|
|
|
** acceptance criteria
|
|
2. create graph for Stance/Share
|
|
- stacked h-bar with % support/oppose/neutral/unknown + raw totals, eg 63% (5720) / 37% (3320) / 0.09% (8) / 0.37% (34)
|
|
- later, consider centered diverging h-bar: oppose ← | neutral/unknown | → support
|
|
3. create graph for Stance/Time:
|
|
- cumulative support/oppose % over time
|
|
4. create graph for Stance/Tone (heatmap count)
|
|
5. create graph for Confidence/Stance (boxplot or histogram)
|
|
|
|
** notes
|
|
- prototyped in plotly
|
|
- initial streamlit
|
|
|
|
** evidence
|
|
- commit: 3fb424d
|
|
- tests: see viz/proto and viz/chart_tests
|
|
- datetime: [2026-05-08 Fri 08:38]
|
|
|
|
* [X] t1.5: streamlit
|
|
create organized webpage displaying useful information from completed job and analysis
|
|
|
|
** acceptance criteria
|
|
1. display total stance breakdown
|
|
2. display centered horiz-bar with absolute stances
|
|
3. show daily comment stances and cumulative
|
|
4. show comment table with filters for stance (filter tone?)
|
|
5. clicking/selecting a comment shows full text and model rationale
|
|
6. app runs locally with one command
|
|
|
|
** notes
|
|
data pulls entirely from the job; goal is to point viz/streamlit.py at any job/ folder and have everything it needs
|
|
|
|
** evidence
|
|
- commit: cc16acb
|
|
- tests: from root dir, `streamlit run viz/streamlit.py <job-dir>`
|
|
- datetime: [2026-05-08 Fri 23:44]
|
|
|
|
* [ ] t1.6 host streamlit via dockerfile
|
|
planning to deploy manually, get cert, etc etc. probably dont care about https?
|
|
** acceptance criteria
|
|
1. write dockerfile with slim image
|
|
|
|
** notes
|
|
|
|
* === Backlog ===
|
|
- add forum_url, forum_collected_date to scraper (to add to viz)
|
|
* [ ] X: complete proposal information
|
|
Ensure we capture as much useful information as possible about the actual proposal - contact information, etc. what the state actually says about what was posted.
|
|
** acceptance criteria
|
|
1. Item: `Forum` stores id, url, proposal title, description, open/close date, number of comments, agency, board, guidance document id
|
|
- add details for guidanceDoc, publication date, comments, guidance docs - eg: https://www.townhall.virginia.gov/L/GDocForum.cfm?GDocForumID=452
|
|
2. Item: `Comment` stores forum_id, comment_id, author, title, text, date, url
|
|
* [ ] X: add helper data to create_csv
|
|
1. in create_csv.py, create helper columns:
|
|
- stance_signed = {"support":1, "oppose":-1, "neutral":0, "unknown":0}
|
|
- stance_weighted = stance_signed * stance_confidence
|
|
- is_support_oppose = stance in ["support", "oppose"]
|
|
- date_day
|
|
- date_hour
|
|
- text_norm
|
|
- text_hash
|
|
- confidence_bucket = 'low' <.7 | 'med' .7-.89 | 'high' >=.9
|