Crene Logo
Review a thesisExamplesMethodologyData
Crene
Platform
Review a thesisExamples
Data
MethodologyDataAPI documentation
Company
About
Explore
Review one thesisThesis MapsFactorsScenariosDataAbout
Research
MethodologyAPI
Legal
TermsPrivacySupport
© 2026 Crene, Inc.
Back to thesis mapPart ofDoes realized enterprise economic adoption of generative AI materially lag AI infrastructure investment expansion by December 31, 2026?
Event · UNIT ECONOMICS & PRICING

Will frontier model performance gap (top-1 vs top-5 on standard benchmarks) compress to less than 5 percentage points by Q4 2026?

Resolves Dec 31, 2026
Probability
61%

4-model average

Confidence
LOW

building category history

Stability
—

loading

Models
Diverging

32pt spread

The three supporting readings tell you how much weight to put on the probability: confidence reflects category-level track record, stability tracks how the estimate has moved over time, models shows whether the four agree.

Belief trajectory

Loading belief trajectory...

Per-model probabilities

GPT-4o
40%-21 vs avg

The gap between top-1 and top-5 performance of frontier models has historically remained above 5 percentage points due to various factors, including model architecture and the selection of evaluation datasets. As of late 2023, reports indicate that while advancements are accelerating in model design, substantial improvements in top-1 accuracy relative to top-5 are becoming more challenging, with metrics for top-1 accuracy improvement stagnating in major benchmarks like ImageNet.

Key uncertaintySignificant breakthroughs in unsupervised learning techniques or multi-task learning approaches could rapidly shift the landscape, potentially compressing the performance gap faster than expected.
Gemini
65%+4 vs avg

Current trends in frontier model development show rapid improvements in both top-1 and top-5 accuracy, with the gap narrowing significantly over the past 1-2 years. Continued advancements in model architectures, training techniques, and larger datasets are likely to further reduce this gap by Q4 2026, potentially pushing it below 5 percentage points as models become more robust and less prone to minor errors.

Key uncertaintyThe pace and impact of a potential "AI winter" or significant regulatory slowdown could hinder the research and development necessary for this gap compression.
Claude
72%+11 vs avg

Frontier models already exhibit 1-3 percentage point gaps on standard benchmarks like MMLU and ImageNet, placing them near or below the 5pp threshold. Given the consistent 3-4 year compression trend (from 8pp to current levels) and 20 months of continued development, the probability of maintaining sub-

Key uncertaintyNot specified
GrokGrok
68%+7 vs avg

Top-1 vs top-5 gaps on ImageNet have already compressed from 8-12pp in 2015-2017 to 2-4pp for current frontier vision models, while on GLUE/SuperGLUE the effective top-1 vs top-5 gap for 2024 LLMs is <3pp when measured by exact match vs top-5 exact match; continued scaling of post-training (RLHF/RLAIF) and test-time compute at 3-5x current levels by 2026 makes further compression to <5pp likely on the remaining benchmarks (MMLU, GPQA, MATH).

Key uncertaintyWhether new harder benchmarks introduced by 2026 will reset the measured gap before compression can occur.
Key disagreementClaude (72%) vs GPT-4o (40%): Different weighting of factors

Resolution criteria

SourceStandard benchmark leaderboards 2026
CRENE-AIER-C085-20261231Generated Jun 28, 2026