Methodology · MAY 2026

How Crene measures.

Four frontier AI models independently forecast macro, earnings, and policy events. Across 0 resolved outcomes, the consensus Brier is 0.244 versus a 0.25 no skill baseline. The edge is real but small.

Frontier language models contain compressed probabilistic representations of human informational structure: markets, history, causal reasoning, expert analysis. They can express uncertainty. The open question is whether those uncertainty estimates are calibrated, stable across methodology variations, and informative over time. Crene is the disciplined measurement of that question.

Resolved0

Brier0.244

Models4

Categories13

0active forecasts

9 categories. Refreshed daily.

0resolved events

CRENE-native predictions. Verified against official sources.

57.6%directional accuracy

0 resolved events.

—consensus Brier

Lower is better. 0.25 no-skill baseline.

What makes this dataset unique

Multi model ensemble

Four frontier LLMs forecast independently with no anchoring. Cross-model spread reveals uncertainty that single-model systems miss.

Structured resolution

Every prediction has named resolution criteria and authoritative sources. SEC filings, BLS, Fed statements. Not crowd sourced. Verified.

Per-model calibration

Brier scores computed per model per event. Enables model-level analysis: which LLM forecasts best in which domain?

What we found

The edge is real but small.

Four frontier models forecast each event independently. Across 0 resolved outcomes, the consensus Brier is 0.244 versus a 0.25 no skill baseline. Directional accuracy is 57.6%. The improvement over a coin flip is statistically real but operationally modest.

Models are calibrated, not oracular.

Earlier internal analyses suggested that tight agreement between models could itself signal correctness. As the dataset expanded the effect did not hold consistently. We do not treat model agreement as a reliable indicator of accuracy.

What we explicitly do not claim

AI forecasts outperform liquid prediction markets.
Model agreement reliably improves accuracy.
Probabilities should be interpreted as deterministic outcomes.
Stability under methodology variation. Calibration has not been tested under variations to prompts, model selection, or category structure. Robustness across configurations is an open question.
Trajectory legitimacy. Whether daily probability movement contains information beyond the news cycle is empirically unevaluated. The hypothesis becomes testable as resolution data accumulates over the coming weeks.
Per-category significance. Several categories have fewer than 20 resolved events. Per-category numbers are exploratory until sample sizes grow.

Calibration Analysis

Are the probabilities meaningful?

A well calibrated forecast that predicts 70% is correct 70% of the time. Points on the dashed identity line indicate perfect calibration. Points above the line indicate underconfidence, points below indicate overconfidence. The 0.25 baseline is a no skill coin flip.

Loading calibration data...

As of MAY 2026. Updated daily across all resolved events.

Pipeline

01

Event Detection

Automated scanners detect upcoming earnings (Polygon.io financials), macro releases (CPI, NFP, PMI), central bank meetings, and market events. Each gets structured binary resolution criteria and a named authoritative source.

02

Multi Model Consensus

GPT 4o, Gemini 2.5 Flash Lite, Claude Haiku 4.5, and Grok 4 Fast each forecast independently with no model seeing another model's output. Ensemble consensus is the mean probability. Spread (max minus min) measures disagreement.

03

Belief Trajectories

Every active event is repolled at a daily cadence, producing a time series of how each model's probability evolves as new information emerges. Full trajectory data is queryable per event with timestamped per model probabilities and event level consensus.

04

Resolution and Scoring

Earnings resolved daily against Polygon.io SEC derived financials as primary source, with Alpha Vantage cross check and a per event audit trail recording every source response. Macro events resolved via Gemini search grounding, with the model cited source URL classified against an authoritative source allowlist (government statistical agencies, central banks, regulators). Brier scores computed per model per event. All data served via public REST API.

Cluster Decomposition

A second pipeline, layered on the same multi-model scoring infrastructure, decomposes an anchor question into a factor matrix of falsifiable child events. Use case is different from individual forecasts: a quantitative team uses the matrix as a feature library to detect under-modeled exposure in their existing book, not as a standalone signal. Cluster anchors are typically forward-looking binary questions with public resolution dates, decomposable into 50-200 falsifiable child factors across distinct categories.

The first live cluster is anchored on "Will the Fed cut 75bps+ cumulatively in 2026?", with 100 child events spanning five categories. Future clusters will follow the same pipeline applied to different anchor questions, with category structures appropriate to each question's domain.

Generation pipeline (5 stages):

Candidate generation. ~80 questions per category produced via a frontier LLM, prompted for falsifiable binary events with explicit resolution dates and public data sources.
Falsifiability filter. Each candidate scored 1-5 by a separate model on three axes: falsifiability, specificity, and whether it represents an under-attended factor. Sub-threshold candidates dropped.
Probability pre-screen. Single-model probability estimate; candidates outside the 5-40% band rejected. Low-band events are noise floor; high-band events are mostly already priced into existing models.
Lexical deduplication. Jaccard similarity across token sets at threshold 0.55 within each category. Filters obvious paraphrases.
Manual curation. A human review pass selects the final cohort and rejects category drift, anti-anchor framing, and questions whose resolution requires subjective judgment.

After curation, child events run the same multi-model scoring as standalone events. Each event resolves on a specific date against a public data source and updates twice daily until resolution.

Why the 5-40% band:

Below 5%, multi-model spread is dominated by sampling variance in model priors rather than meaningful disagreement. Above 40%, the event is likely already in standard models for the anchor's domain and the marginal information from a multi-model consensus is small. The 5-40% range is where models reason about events that are mechanically plausible but socially under-attended, which is the value the matrix is designed to capture.

What spread reveals:

Each child has both a consensus probability and a multi-model spread. High-spread rows (≥30 percentage points) surface where models genuinely disagree about an under-modeled factor; these are the most decision-relevant rows for a factor analyst. Tight-spread low-probability rows surface agreed-tail events that can be used to size hedges with calibrated confidence. Trajectory data (twice-daily snapshots) is intended to reveal which factors are drifting up or down over time, which is the leading-indicator hypothesis the cluster product is built around. We do not yet have enough trajectory history to evaluate that hypothesis.

Editorial Choices in Cluster Construction

Continuous Factor Decomposition

A third pipeline, layered on the same multi-model scoring infrastructure, forecasts continuous state variables as cross-model percentile distributions. Where a cluster decomposes a binary anchor question into falsifiable child events, a factor decomposes a continuous variable into a driver matrix that explains movement in the distribution. The output of a factor is a quintile distribution (p5, p25, p50, p75, p95) aggregated across four frontier models, with a disagreement metric and a confidence label derived from cross-model spread.

The first live factors are UST10Y yield at horizon and SPX level at horizon. Each factor is decomposed into roughly 100 driver variables grouped by causal family. For rates these include Fed policy, inflation, real growth, term premium, supply, liquidity, and recession indicators. For equity these include earnings, margins, multiples, AI capex, credit conditions, flows, volatility, and sector leadership. Drivers are how distribution moves are explained, not how the distribution is generated. Each driver carries its own cross-model percentile distribution that can be inspected independently, and drivers can be referenced across multiple factors when the same causal channel applies.

Factors and clusters address different questions. A cluster anchors on a forward-looking binary with a public resolution date and decomposes into child events that resolve true or false. A factor anchors on a continuous variable at a horizon and decomposes into drivers whose distributions explain the anchor distribution. Some factors correspond to the same horizon as a cluster anchor and can be inspected jointly; most are standalone. Both pipelines surface on the homepage and have dedicated detail pages.

The full decomposition forms a three level topology: anchor factor at the top, driver families in the middle, individual drivers at the leaves, with each level carrying its own cross model distribution. The driver families themselves form a taxonomy that recurs across related factors, which is what makes cross factor correlation tractable. Inspecting a factor means traversing this topology rather than reading a flat list.

Editorial Choices in Factor Construction

Calibration of Factors

Dataset Coverage

-

Predictions

Active forecasts across categories

13

Crene Methodology

How Crene measures.

Are the probabilities meaningful?