Imouto: OpenRouter Council Scoring - Devlog

Council Scoring at Scale

A 3-model council using OpenRouter cloud inference (gemma4:e4b, llama3.2, mistral) replaced the planned local Ollama scoring. Each task is scored by all three models concurrently via ThreadPoolExecutor, with majority voting on the automatable classification and merged/deduplicated ai_tools suggestions.

Provider Abstraction

A providers/ package with BaseProvider, AnthropicProvider, OllamaProvider, and OpenRouterProvider abstracts away the inference backend. The --provider and --council flags on score.py make switching seamless.

Bulk Run

18,615 US tasks were scored with checkpoint/resume support and progress display showing per-task timing and ETA. Post-bulk Sonnet-graded validation on 30 stratified tasks confirmed quality within 5pp of eval-time agreement. A SQLite database layer replaced the JSON file for atomic writes.

Features Delivered

Provider Package

Provider abstraction — BaseProvider, Anthropic, Ollama, OpenRouter implementations

Council Scorer

3-model majority voting — Concurrent council calls with merged results
Checkpoint and resume — JSON checkpoints for long-running bulk jobs

Bulk Scoring

18,615 tasks scored — Full US catalogue via OpenRouter council
SQLite database layer — Atomic writes replacing JSON file