4-model average
building category history
loading
32pt spread
The three supporting readings tell you how much weight to put on the probability: confidence reflects category-level track record, stability tracks how the estimate has moved over time, models shows whether the four agree.
Loading belief trajectory...
The gap between top-1 and top-5 performance of frontier models has historically remained above 5 percentage points due to various factors, including model architecture and the selection of evaluation datasets. As of late 2023, reports indicate that while advancements are accelerating in model design, substantial improvements in top-1 accuracy relative to top-5 are becoming more challenging, with metrics for top-1 accuracy improvement stagnating in major benchmarks like ImageNet.
Current trends in frontier model development show rapid improvements in both top-1 and top-5 accuracy, with the gap narrowing significantly over the past 1-2 years. Continued advancements in model architectures, training techniques, and larger datasets are likely to further reduce this gap by Q4 2026, potentially pushing it below 5 percentage points as models become more robust and less prone to minor errors.
Frontier models already exhibit 1-3 percentage point gaps on standard benchmarks like MMLU and ImageNet, placing them near or below the 5pp threshold. Given the consistent 3-4 year compression trend (from 8pp to current levels) and 20 months of continued development, the probability of maintaining sub-
Top-1 vs top-5 gaps on ImageNet have already compressed from 8-12pp in 2015-2017 to 2-4pp for current frontier vision models, while on GLUE/SuperGLUE the effective top-1 vs top-5 gap for 2024 LLMs is <3pp when measured by exact match vs top-5 exact match; continued scaling of post-training (RLHF/RLAIF) and test-time compute at 3-5x current levels by 2026 makes further compression to <5pp likely on the remaining benchmarks (MMLU, GPQA, MATH).