Imouto

Imouto: Local Model Scoring and Eval Framework

Built an eval framework to objectively compare local Ollama models against Anthropic baselines, with n=100 stratified eval sets, multi-model sweep, and Haiku-anchored quality thresholds.

2 Phases
6 Tasks
1 Days

Model Selection Science

To reduce scoring costs, an eval framework was built to test whether local Ollama models could match Claude's quality. The framework uses a stratified n=100 eval set (seed 42) with both Sonnet and Haiku baselines as anchors.

Multi-Model Sweep

Six general-purpose Ollama models (llama3.2, llama3.1, gemma3, mistral, qwen3:8b, gpt-oss:20b) were evaluated. The Haiku-anchored threshold requires agreement-with-Sonnet within 5 percentage points of Haiku's own agreement, plus 95% JSON validity.

No local model met the full threshold, but the eval infrastructure proved invaluable for the subsequent OpenRouter council scoring initiative.

Features Delivered

Eval Framework

  • Stratified eval set — n=100 eval set with seed 42, Sonnet + Haiku baselines

Model Sweep

  • 6-model sweep — Automated evaluation across all general-purpose Ollama models