Crene Methodology

Methodology · JUN 2026

How Crene organizes uncertainty.

Four AI models decompose questions into falsifiable components. The calibration record tells you how much to trust the output.
What makes this dataset unique
Multi model ensemble

Four frontier LLMs forecast independently with no anchoring. The last poll consensus is well calibrated; its confidence is informative even though raw cross model spread is not a reliable accuracy signal.

Decomposition architecture

Every question is decomposed into falsifiable components. Clusters break binary questions into situations. Factors break continuous variables into drivers. Scenarios break strategic questions into coherent pathways.

Structured resolution

Every component has named resolution criteria and authoritative sources. SEC filings, BLS, Fed statements, surveys, academic research. Not crowd sourced. Verified.

What we found
The calibration signal is real but uneven.

Measured at last poll before resolution with a 24-hour leakage gap, the four-model consensus scores Brier 0.114 against a 0.189 base-rate baseline (skill 0.075) across 811 resolved questions. The skill is real but uneven: near-perfect on high-conviction calls (Brier 0.034, n=440), no better than chance on genuinely contested ones (Brier 0.248, n=157). We report the split rather than the blended average.

Models are calibrated, not oracular.

Earlier internal analyses suggested that tight agreement between models could itself signal correctness. As the dataset expanded the effect did not hold consistently. We do not treat model agreement as a reliable indicator of accuracy.

What we explicitly do not claim
  • AI forecasts outperform liquid prediction markets.
  • Model agreement reliably improves accuracy.
  • Probabilities should be interpreted as deterministic outcomes.
  • Stability under methodology variation. Calibration has not been tested under variations to prompts, model selection, or category structure. Robustness across configurations is an open question.
  • Trajectory legitimacy. Whether daily probability movement contains information beyond the news cycle is empirically unevaluated. The hypothesis becomes testable as resolution data accumulates over the coming weeks.
  • Per-category significance. Several categories have fewer than 20 resolved events. Per-category numbers are exploratory until sample sizes grow.
0active events

13 categories. Repolled daily.

0resolved Crene questions

Crene-native, verified against official sources. Earnings heavy; macro-only skill is modest. See tiers.

0.2325macro Brier (ex earnings)

Crene macro questions, n=242. Base 0.25, skill 0.018. Modest.

0.114external benchmark Brier

Market-priced questions, n=811, leakage controlled. Validates the ensemble, not the product layer.

Calibration Analysis

Are the probabilities meaningful?

A well calibrated forecast that predicts 70% is correct 70% of the time. Points on the dashed identity line indicate perfect calibration. Points above the line indicate underconfidence, points below indicate overconfidence. The 0.25 line marks a balanced-binary reference; the leakage-controlled headline benchmark uses a 0.189 base rate.

Loading calibration data...
As of JUN 2026. Updated daily across all resolved events.
Pipeline
01
Event Detection

Automated scanners detect upcoming earnings (Polygon.io financials), macro releases (CPI, NFP, PMI), central bank meetings, and market events. Each gets structured binary resolution criteria and a named authoritative source.

02
Multi Model Consensus

GPT 4o, Gemini 2.5 Flash Lite, Claude Haiku 4.5, and Grok 4 Fast each forecast independently with no model seeing another model's output. Ensemble consensus is the mean probability. Spread (max minus min) measures disagreement.

03
Belief Trajectories

Every active event is repolled at a daily cadence, producing a time series of how each model's probability evolves as new information emerges. Full trajectory data is queryable per event with timestamped per model probabilities and event level consensus.

04
Resolution and Scoring

Earnings resolved daily against Polygon.io SEC derived financials as primary source, with Alpha Vantage cross check and a per event audit trail recording every source response. Macro events resolved via Gemini search grounding, with the model cited source URL classified against an authoritative source allowlist (government statistical agencies, central banks, regulators). Brier scores computed per model per event. All data served via public REST API.

PART II
Decomposition Systems

Everything above is empirically calibrated: resolved events, scored predictions, measured accuracy. Everything below is exploratory: decomposition systems that organize uncertainty into inspectable structure. The calibration record does not transfer downward. Clusters, factors, and scenarios inherit the same multi-model scoring discipline, but their value is structural, not predictive.

Scenarios
Structural transitions
Clusters
Binary thesis maps
Factors
Continuous forecast maps
Components
Falsifiable elements
Frontier AI Models
Independent ensemble
Claude, GPT, Gemini, Grok
Resolved Outcomes
Calibration evidence
SystemTypeQuestion
ScenariosStructural transitionsHow does the system interact?
ClustersBinary thesesWhat happens?
FactorsContinuous distributionsWhere does it land?
Scenario Modeling

A fourth pipeline moves from atomic forecasting to structured futures reasoning. Where clusters decompose binary thesis and factors decompose continuous state variables, scenarios decompose long-horizon strategic questions into coherent world-states called pathways. The three products form a complete hierarchy: clusters answer what happens, factors answer how much, and scenarios answer how the system interacts. Scenarios are not forecasts. They are maps of internally coherent futures that can be inspected, stress-tested, and compared.

ProductAnchorChildrenQuestion
ClustersBinarySituationsWhat happens?
FactorsContinuousDriversHow much?
ScenariosHybridPathwaysHow does the system interact?

A scenario anchors on a long-horizon strategic question. The anchor is decomposed into binary and continuous components spanning currency architecture, geopolitics, technology, demographics, fiscal policy, and market structure. Each component has a resolution source and horizon date, making the entire system falsifiable.

Pathways are the children of a scenario. Each pathway is a coherent mini-world: a named causal narrative backed by a precise vector state assigning values to a subset of the scenario components. Pathways specify only load-bearing assumptions, which is the central design principle. Unspecified components are treated as model-determined during polling, which prevents overconstraint and allows the system to surface emergent tensions.

The prompting architecture uses Option B joint-state reasoning: each model receives the full scenario with all components and produces an internally coherent set of estimates, reasoning about cross-component dependencies. The coherence layer checks each output for internal contradictions using dependency pairs, producing a coherence score and flagging structural disagreement across models. Multiple internally coherent futures can coexist. The system does not converge on one correct world-state; it surfaces where different coherent worlds diverge and which assumptions drive the divergence.

The product output is not a single probability for the thesis. It is a coherence map, hinge variable identification, necessary condition analysis, structural disagreement surface, and pathway-to-consensus delta tracking. The value is uncertainty organization: making the structure of what we do not know inspectable rather than compressing it into a number. This makes scenarios a genuinely distinct product from bundled clusters and factors.

Editorial Choices in Scenario Construction

Each scenario decomposes its anchor question into pathways with editorially chosen distributions across direction labels (acceleration, resistance, mixed). These are editorial choices about which futures are worth specifying, not probabilistic claims about which are most likely.

Each pathway carries a fragility assessment indicating how many single-variable flips would invalidate the thesis. Low fragility pathways are structurally robust. Very high fragility pathways are single-point bets. The fragility distribution surfaces which pathways are worth stress-testing versus which are robust under perturbation.

Crene preserves prior scenario framings for auditability. The current framing powers live scoring, while older framings remain accessible as historical reference views. AI Labor v2 broadens the original AI supervision framing into a labor transformation scenario. Some supervision components remain because AI management is one mechanism through which work may expand, contract, or restructure.

Scenarios are repolled daily across all components, producing a continuous trajectory of how each pathway's coherence with consensus evolves. Different institutions could specify different component sets, pathway structures, fragility assessments, and polling cadences. What stays fixed is the methodology: joint-state reasoning, coherence auditing, pathway-to-consensus delta tracking, and falsifiable resolution against named sources.

Cluster Decomposition

A second pipeline, layered on the same multi-model scoring infrastructure, decomposes a thesis into a factor matrix of falsifiable components. Use case is different from individual forecasts: a quantitative team uses the matrix as a feature library to detect under-modeled exposure in their existing book, not as a standalone signal. Cluster anchors are typically forward-looking binary questions with public resolution dates, decomposable into 50-200 falsifiable components across distinct categories.

Live clusters span monetary policy, AI transition, and civilization dynamics. Each cluster follows the same 5 stage pipeline with category structures appropriate to its domain.

Generation pipeline (5 stages):

  1. Candidate generation. ~80 questions per category produced via a frontier LLM, prompted for falsifiable binary events with explicit resolution dates and public data sources.
  2. Falsifiability filter. Each candidate scored 1-5 by a separate model on three axes: falsifiability, specificity, and whether it represents an under-attended factor. Sub-threshold candidates dropped.
  3. Probability pre-screen. Single-model probability estimate; candidates outside the target probability band rejected. Events near certainty or near impossibility carry less decomposition value.
  4. Lexical deduplication. Lexical similarity filtering within each category to remove near-duplicate candidates.
  5. Manual curation. A human review pass selects the final cohort and rejects category drift, anti-anchor framing, and questions whose resolution requires subjective judgment.

After curation, components run the same multi-model scoring as standalone events. Each event resolves on a specific date against a public data source and updates daily until resolution.

Why the 5-40% band:

At the extremes of the probability range, model spread reflects noise rather than meaningful disagreement. Near the middle of the range, events are likely already well-modeled by existing systems. The target band contains mechanically plausible events where model disagreement and decomposition density remain informative.

What spread reveals:

Each child has both a consensus probability and a multi-model spread. High-spread rows (≥30 percentage points) surface where models genuinely disagree about an under-modeled factor; these are the most decision-relevant rows for a factor analyst. Tight-spread low-probability rows surface agreed-tail events that can be used to size hedges with calibrated confidence. Trajectory data from repeated snapshots is intended to reveal which factors are drifting up or down over time, which is the leading-indicator hypothesis the cluster product is built around. We do not yet have enough trajectory history to evaluate that hypothesis.

Editorial Choices in Cluster Construction
Continuous Factor Decomposition

A third pipeline, layered on the same multi-model scoring infrastructure, forecasts continuous state variables as cross-model percentile distributions. Where a cluster decomposes a binary thesis into falsifiable components, a factor decomposes a continuous variable into a driver matrix that explains movement in the distribution. The output of a factor is a quintile distribution (p5, p25, p50, p75, p95) aggregated across four frontier models, with a disagreement metric and a confidence label derived from cross-model spread.

Live factors forecast continuous state variables. Drivers are how distribution moves are explained, not how the distribution is generated. Each driver carries its own cross-model percentile distribution that can be inspected independently, and drivers can be referenced across multiple factors when the same causal channel applies.

Factors and clusters address different questions. A cluster anchors on a forward-looking binary with a public resolution date and decomposes into components that resolve true or false. A factor anchors on a continuous variable at a horizon and decomposes into drivers whose distributions explain the anchor distribution. Some factors correspond to the same horizon as a cluster anchor and can be inspected jointly; most are standalone. Both pipelines surface on the homepage and have dedicated detail pages.

The full decomposition forms a three level topology: anchor factor at the top, driver families in the middle, individual drivers at the leaves, with each level carrying its own cross model distribution. The driver families themselves form a taxonomy that recurs across related factors, which is what makes cross factor correlation tractable. Inspecting a factor means traversing this topology rather than reading a flat list.

Editorial Choices in Factor Construction
Calibration of Factors
Crene Methodology | Calibration and Uncertainty Organization