Routing LLMs by task verifiability: a small experiment (n=120, 3 models) inspired by Karpathy's framework [D]

Full disclosure: this is directional, not a paper. n=120 tasks, one internal evaluator, not peer reviewed. I work at an LLM infrastructure company. This experiment was done on my own time and is not a company claim. Karpathy's framework classifies tasks by verifiability. Can output be mechanically checked? High verifiability tasks like code compilation and structured JSON extraction are safer because the verifier catches errors. Low verifiability tasks like creative writing are riskier. I wondered if high verifiability tasks are also easier in practice. Can a weaker model do them as well as a frontier model if the verifier catches mistakes? Setup was 120 tasks across four categories. Code unit tests, structured extraction, multi hop reasoning, creative summarization. Three models: Claude Sonnet 4.6, GPT 5.5, local Mistral 3 8B via vLLM 0.6.3. Pass rate for the first two, human rating 1 to 5 for the last two. Results were messy. Code unit tests: Sonnet 4.6 94%, GPT 5.5 91%, Mistral 3 8B 87%. With one retry Mistral 3 hit 95%. That surprised me. I expected the gap to be bigger. Structured extraction: Sonnet 4.6 97%, GPT 5.5 94%, Mistral 3 8B 89%. With retry 96%. Also closer than I expected. But here is where it got weird. Sonnet 4.6 initially scored worse than GPT 5.5 on structured extraction, which made no sense. Turns out our JSON schema had an ambiguous nested array that confused Claude's tool use parser. Fixing the schema brought Sonnet to 98%, but I kept the original numbers in the table because the mistake is part of the story. Your verifier is only as good as your schema. Multi hop reasoning: Sonnet 4.6 78%, GPT 5.5 71%, Mistral 3 8B 51%. Retry didn't help. The model would hallucinate reasoning paths consistently. This is where the capability gap was real. Creative summarization: Sonnet 4.6 4.2 out of 5, GPT 5.5 3.9 out of 5, Mistral 3 8B 3.1 out of 5. Expected. Interpretation: high verifiability tasks seem simpler in the sense that weaker model plus verifier can approach frontier performance. Low verifiability tasks show the expected gap. Limitations: n=120 is tiny. Need 10x for confidence. Our verifier is just JSON Schema plus regexes. Constrained decoding might change the calculus entirely. I also didn't control for prompt length well. Any prompt over 8k tokens was excluded because Mistral 3 8B degrades near its limit, which probably skewed the sample. submitted by /u/DragonfruitAlone4497 [link] [Kommentare]

Routing LLMs by task verifiability: a small experiment (n=120, 3 models) inspired by Karpathy's framework [D](reddit.com)

Comments