Model Selection Science
To reduce scoring costs, an eval framework was built to test whether local Ollama models could match Claude's quality. The framework uses a stratified n=100 eval set (seed 42) with both Sonnet and Haiku baselines as anchors.
Multi-Model Sweep
Six general-purpose Ollama models (llama3.2, llama3.1, gemma3, mistral, qwen3:8b, gpt-oss:20b) were evaluated. The Haiku-anchored threshold requires agreement-with-Sonnet within 5 percentage points of Haiku's own agreement, plus 95% JSON validity.
No local model met the full threshold, but the eval infrastructure proved invaluable for the subsequent OpenRouter council scoring initiative.