Imouto: Local Model Scoring and Eval Framework

Model Selection Science

To reduce scoring costs, an eval framework was built to test whether local Ollama models could match Claude's quality. The framework uses a stratified n=100 eval set (seed 42) with both Sonnet and Haiku baselines as anchors.

Multi-Model Sweep

Six general-purpose Ollama models (llama3.2, llama3.1, gemma3, mistral, qwen3:8b, gpt-oss:20b) were evaluated. The Haiku-anchored threshold requires agreement-with-Sonnet within 5 percentage points of Haiku's own agreement, plus 95% JSON validity.

No local model met the full threshold, but the eval infrastructure proved invaluable for the subsequent OpenRouter council scoring initiative.

Model Selection Science

Multi-Model Sweep

Features Delivered

Eval Framework

Model Sweep