Will frontier model performance gap (top-1 vs top-5 on standard benchmarks) compress to less than 5 percentage points by Q4 2026?
Resolves Dec 31, 2026
44%probability
4-model average
LOWconfidence
building category history
—stability
loading
Divergingmodels
37pt spread
The three supporting readings tell you how much weight to put on the probability: confidence reflects category-level track record, stability tracks how the estimate has moved over time, models shows whether the four agree.
Belief trajectory
Loading belief trajectory...
Per-model probabilities
GPT-4o
45%+1 vs avg
Gemini
40%-4 vs avg
Claude
28%-16 vs avg
Grok
65%+21 vs avg
Key disagreementGrok (65%) vs Claude (28%): Different weighting of factors