Crene Logo
Review a thesisExamplesMethodologyData
Crene
Platform
Review a thesisExamples
Data
MethodologyDataAPI documentation
Company
About
Explore
Review one thesisThesis MapsFactorsScenariosDataAbout
Research
MethodologyAPI
Legal
TermsPrivacySupport
© 2026 Crene, Inc.
Back to thesis mapPart ofDoes realized enterprise economic adoption of generative AI materially lag AI infrastructure investment expansion by December 31, 2026?
Event · POLICY & GEOPOLITICS

Will any open-source 70B-class model achieve top-3 placement on a standard reasoning benchmark leaderboard during 2026?

Resolves Dec 31, 2026
Probability
67%

4-model average

Confidence
LOW

building category history

Stability
—

loading

Models
Aligned

7pt spread

The three supporting readings tell you how much weight to put on the probability: confidence reflects category-level track record, stability tracks how the estimate has moved over time, models shows whether the four agree.

Belief trajectory

Loading belief trajectory...

Per-model probabilities

GPT-4o
65%-2 vs avg

The open-source AI model landscape has been rapidly evolving, with large models like LLaMA 2 and other 70B-class variants demonstrating competitive performance on reasoning benchmarks. Notably, open-source projects have gained significant institutional support and community engagement, as seen with the release of Hugging Face's Transformers library and Meta's model releases, which foster improvements. Given these trends, it's plausible that by 2026, an open-source model could secure a top-3 placement on major reasoning leaderboards like SuperGLUE or similar benchmarks, particularly as the focus on transparency and accessibility in AI research intensifies.

Key uncertaintyThe pace of proprietary advancements by leading AI companies, especially with resource investments in training and tuning models that could outpace open-source efforts, remains an unpredictable variable that could affect the likelihood of open-source dominance.
Gemini
65%-2 vs avg

The rapid advancement in open-source LLM capabilities, exemplified by models like Llama 3 70B achieving near state-of-the-art performance, coupled with significant hardware democratization and research community investment, suggests a high likelihood of a 70B-class open-source model surpassing current top proprietary models on standard reasoning benchmarks by 2026. Historical trends show open-source models rapidly closing performance gaps within 1-2 years.

Key uncertaintyThe pace of breakthrough architectural or training paradigm shifts in proprietary models that are not immediately or efficiently replicated in the open-source community.
Claude
72%+5 vs avg

Open-source 70B-class models have already demonstrated strong reasoning capabilities as of late 2024 (e.g., Llama 2 70B, Mistral variants), and the trajectory shows rapid improvement with models like Llama 3.1 405B pushing boundaries. The 12-month window to end of 2026 provides sufficient time for fine-tuning and architectural improvements on reasoning-specific benchmarks (MATH, ARC, MMLU-Pro). However, "top-3 placement" requires competing directly with frontier models (GPT-4, Claude 3.5, Gemini) that benefit from massive proprietary compute and data. Historical precedent shows open-source models typically lag cutting-edge closed models by 6-12 months; achieving top-3 on a "standard reasoning benchmark" represents a significant but achievable milestone given current progress velocity. The main limiting factor is that benchmark leaderboards often emphasize frontier performance where proprietary investment maintains advantages, though open-source has surprised on specific benchmarks before.

Key uncertaintyWhether "standard reasoning benchmark leaderboard" refers to established benchmarks (MATH, ARC) where open-source already competes closely, or newly released 2025-2026 benchmarks specifically designed to test frontier capabilities where closed models maintain larger leads.
GrokGrok
68%+1 vs avg

Open-weight 70B models reached 85-87% on MMLU and 78% on GSM8K by Q4 2024; the 2024-2025 gap to closed frontier models has narrowed from 15 points to 8 points on the aggregate Open LLM Leaderboard, and continued scaling with synthetic data plus test-time compute (as shown in the 405B Llama-3.1 paper) points to another 6-8 point lift by late 2025, placing a 70B-class model inside the top-3 by mid-2026. Historical precedent shows open releases lag closed models by 9-15 months; the current 12-month lag plus announced 2025 training runs (Meta, Mistral, AllenAI) keeps the probability above 50%.

Key uncertaintyWhether US export controls on HBM or a sudden restriction on synthetic-data distillation will materially slow the next open 70B training run.

Resolution criteria

SourceStandard benchmark leaderboards 2026
CRENE-AIER-C079-20261231Generated Jun 17, 2026