Will Meta release a Llama model exceeding GPT-4-class capability on standard benchmarks during 2026?
Resolves Dec 31, 2026
60%probability
4-model average
LOWconfidence
building category history
—stability
loading
Divergingmodels
45pt spread
The three supporting readings tell you how much weight to put on the probability: confidence reflects category-level track record, stability tracks how the estimate has moved over time, models shows whether the four agree.
Belief trajectory
Loading belief trajectory...
Per-model probabilities
GPT-4o
30%-30 vs avg
Gemini
65%+5 vs avg
Claude
72%+12 vs avg
Grok
75%+15 vs avg
Key disagreementGrok (75%) vs GPT-4o (30%): Different weighting of factors
Resolution criteria
SourceMeta AI publications, standard benchmark leaderboards 2026