Crene Research

ResearchAPR 2026

Four frontier AI models independently forecast macro, earnings, and policy events. Across 0 resolved outcomes, the consensus Brier is 0.244 versus a 0.25 no-skill baseline. The edge is real but small.

0
Active Forecasts
9 categories
0
Resolved Events
CRENE-native predictions, verified against official sources
57.6%
Directional Accuracy
0 resolved events
Consensus Brier Score
Market: —
Top Model
What Makes This Dataset Unique
4 models
Multi-Model Ensemble

Four frontier LLMs forecast independently with no anchoring. Cross-model spread reveals uncertainty that single-model systems miss.

Official sources
Structured Resolution

Every prediction has named resolution criteria and authoritative sources (SEC filings, BLS, Fed statements). Not crowd sourced. Verified.

Per-event scoring
Per-Model Calibration

Brier scores computed per model per event. Enables model-level analysis: which LLM forecasts best in which domain?

What we found
The edge is real but small.

Four frontier models forecast each event independently. Across 0 resolved outcomes, the consensus Brier is 0.244 versus a 0.25 no-skill baseline. Directional accuracy is 57.6%. The improvement over a coin flip is statistically real but operationally modest.

Models are calibrated, not oracular.

Earlier internal analyses suggested that tight agreement between models could itself signal correctness. As the dataset expanded the effect did not hold consistently. We do not treat model agreement as a reliable indicator of accuracy.

What we explicitly do not claim
  • AI forecasts outperform liquid prediction markets.
  • Model agreement reliably improves accuracy.
  • Probabilities should be interpreted as deterministic outcomes.
Calibration Analysis

Are the probabilities meaningful? A well-calibrated model predicts 70% and is correct 70% of the time. Points near the dashed line indicate good calibration.

Loading calibration data...
Methodology
01Event Detection

Automated scanners detect upcoming earnings (Polygon.io financials ), macro releases (CPI, NFP, PMI), central bank meetings, and market events. Each gets structured binary resolution criteria and a named authoritative source.

024-Model Consensus

GPT-4o, Gemini 2.5 Flash Lite, Claude Haiku 4.5, and Grok 4 Fast each forecast independently with no model seeing another's output. Ensemble consensus is the mean probability. Spread (max minus min) measures disagreement.

03Belief Trajectories

Every active event is repolled at a daily cadence, producing a time series of how each model's probability evolves as new information emerges. Full trajectory data is queryable per event with timestamped per-model probabilities and event-level consensus.

04Resolution & Scoring

Earnings resolved daily against Polygon.io SEC-derived financials as primary source, with Alpha Vantage cross-check and a per-event audit trail recording every source response. Macro events resolved via Gemini search grounding, with the model cited source URL classified against an authoritative source allowlist (government statistical agencies, central banks, regulators). Brier scores computed per model per event. All data served via public REST API.

Dataset Coverage
Predictions
Active forecasts across categories
13
Categories
Earnings, macro, crypto, and more
4
AI Models
Independent probability estimates
Daily
Resolution Cadence
Verified against official sources
4 frontier LLMs13 categoriesUpdated dailyBrier scoredAll data public
Crene — Methodology