Crene Methodology
A structured system for uncertainty.
Calibration validates the scoring engine. Decomposition organizes the uncertainty.
Crene separates two claims that are often collapsed. The resolved event record tests whether the model ensemble produces meaningful probabilities under a timestamped protocol. Scenarios, clusters, and factors use the same scoring discipline, but their value is structural: making assumptions visible, comparable, and eventually scorable.
This page therefore distinguishes calibrated evidence from exploratory decomposition. The calibration record does not automatically transfer to every long horizon map.
Resolved events test whether the model ensemble produces meaningful probabilities under a timestamped protocol.
Scenarios, clusters, and factors organize assumptions into inspectable maps. Their value is structure, not automatic prediction.
The scoring discipline carries over: independent models, named sources, daily trajectories, and eventual resolution where outcomes exist.
Every question is decomposed into falsifiable components. Clusters break binary questions into situations. Factors break continuous variables into drivers. Scenarios break strategic questions into coherent pathways.
Every component has named resolution criteria and authoritative sources. SEC filings, BLS, Fed statements, surveys, academic research. Not crowd sourced. Verified.
Four frontier LLMs forecast independently with no anchoring, producing a consensus and an explicit disagreement spread for every question.
Crene separates scored forecasting evidence from structural uncertainty maps. Resolved events test the scoring engine against outcomes. Scenarios, clusters, and factors organize assumptions into inspectable structure. The calibration record supports the scoring protocol; it does not automatically validate every decomposition map.
The cleanest Crene-native result is deliberately modest: macro skill ex-earnings, shown below, is directionally positive but thin at the current sample size. We lead with that number rather than the stronger external benchmark, because the benchmark is measured on already-priced questions — it shows the ensemble is calibrated, which is table stakes, not that it adds macro skill, which is the harder claim. Neither number validates the decomposition products (clusters, factors, scenarios); those are scored separately and, where outcomes have not yet accrued, not scored at all. We report both because they answer different questions. Sample reconciliation. The page uses three nested populations. Total resolved corpus refers to all Crene-native resolved questions currently in the database. The leakage-controlled benchmark is the subset with a valid last pre-resolution consensus score and a 24 hour guard before resolution. The macro ex-earnings subset excludes the earnings-heavy event flow and isolates the thinner macro record. Model leaderboard counts are per-model rows from the broader legacy resolved corpus, so their sample sizes differ from the consensus benchmark.
Measured at last poll before resolution with a 24 hour leakage gap, the four model consensus scores Brier 0.114 against a reported base rate baseline (skill modest) across 811 resolved questions. The skill is real but uneven: near perfect on high conviction calls (Brier strong on high conviction calls, n=reported), no better than chance on genuinely contested ones (Brier near chance, n=reported). We report the split rather than the blended average.
Earlier internal analyses suggested that tight agreement between models could itself signal correctness. As the dataset expanded the effect did not hold consistently. We do not treat model agreement as a reliable indicator of accuracy.
- AI forecasts outperform liquid prediction markets.
- Model agreement reliably improves accuracy.
- Probabilities should be interpreted as deterministic outcomes.
- Stability under methodology variation. Calibration has not been tested under variations to prompts, model selection, or category structure. Robustness across configurations is an open question.
- Trajectory legitimacy. Whether daily probability movement contains information beyond the news cycle is empirically unevaluated. The hypothesis becomes testable as resolution data accumulates over the coming weeks.
- Per-category significance. Several categories have fewer than 20 resolved events. Per-category numbers are exploratory until sample sizes grow.
38 categories. Repolled daily.
Crene-native, verified against official sources. Earnings heavy; macro only skill is modest. See tiers.
Crene macro questions, n=264. Base reported baseline, skill modest. Modest.
Market priced questions, n=811, leakage controlled. Validates the ensemble, not the product layer.
Are the probabilities meaningful?
A well calibrated forecast that predicts 70% is correct 70% of the time. Points on the dashed identity line indicate perfect calibration. Points above the line indicate underconfidence, points below indicate overconfidence. The 0.25 line marks a balanced binary reference; the leakage controlled headline benchmark uses the base rate returned by the live analytics API.
Automated scanners detect upcoming earnings (Polygon.io financials), macro releases (CPI, NFP, PMI), central bank meetings, and market events. Each gets structured binary resolution criteria and a named authoritative source.
GPT-4o mini, Gemini 2.5 Flash Lite, Claude Haiku 4.5, and Grok 4 Fast (non-reasoning) each forecast independently with no model seeing another model's output. Ensemble consensus is the mean probability. Spread (max minus min) measures disagreement.
Every active event is repolled at a daily cadence, producing a time series of how each model's probability evolves as new information emerges. Full trajectory data is queryable per event with timestamped per model probabilities and event level consensus.
Earnings resolved daily against Polygon.io SEC derived financials as primary source, with Alpha Vantage cross check and a per event audit trail recording every source response. Macro events resolved via Gemini search grounding, with the model cited source URL classified against an authoritative source allowlist (government statistical agencies, central banks, regulators). Brier scores computed per model per event. All data served via public REST API.
Everything above is empirically calibrated: resolved events, scored predictions, measured accuracy. Everything below is exploratory: decomposition systems that organize uncertainty into inspectable structure. The calibration record does not transfer downward. Clusters, factors, and scenarios inherit the same multi model scoring discipline, but their value is structural, not predictive.
The knowledge map sits above the decomposition systems.
Crene’s knowledge map does not introduce a new forecasting claim. It shows how scenarios, thesis maps, and factor maps sit in one shared field of assumptions.
In the map, threads are shared ontology fields such as fiscal capacity, rates path, AI productivity, markets, dollar system, demographics, and growth poles.
These threads should be read as structural recurrence, not inferred causality. A rates path thread means the same theme appears across multiple anchors. It does not, by itself, claim statistical dependence, causal influence, or predictive correlation. Authored links and empirical co-movement require separate evidence.
Forecast discipline, decomposition discipline, and tail-risk humility.
Crene sits between two traditions that are often treated separately: accountable probabilistic forecasting, associated with the work of Philip Tetlock, Barbara Mellers, and the Good Judgment Project; and humility under fat tails, nonlinear exposure, and model error, associated with Nassim Taleb’s critique of false precision. The product is built around both: score what can be scored, decompose what cannot yet be scored, and expose fragility where prediction would be misleading. These references are methodological context only; no affiliation or endorsement is implied.
Superforecasting begins by turning vague beliefs into explicit, updateable probability judgments. The relevant practices are decomposition, base-rate awareness, inside/outside view comparison, belief updating, and scoring against outcomes.
Crene applies this discipline where questions can resolve: independent model forecasts, timestamped snapshots, named resolution sources, leakage controls, and Brier scoring.
Taleb's work warns against false precision in systems governed by fat tails, nonlinear response, hidden fragility, and model error. In those domains, the central question is often not “what is the exact forecast?” but “where is the system fragile if the distribution is wrong?”
Crene reflects this by separating calibrated evidence from structural maps. Scenario, cluster, and factor maps are not presented as proof of predictive edge. They make exposures, dependencies, and load-bearing assumptions inspectable.
The Crene methodology therefore has two modes. In resolved event space, it behaves like an accountable forecasting system. In strategic uncertainty space, it behaves like an assumption architecture.
The knowledge map connects those architectures without pretending that structural recurrence is causality, correlation, or forecast validation.
Event probabilities, model calibration, Brier scores, leakage-controlled benchmarks, model disagreement, and realized outcomes against named sources.
Scenario pathways, cluster components, factor drivers, ontology fields, cross-anchor recurrence, and the assumption layer underneath strategic judgment.
This distinction is the central guardrail of the methodology page. Calibration validates the scoring protocol where outcomes exist. It does not automatically validate every decomposition map. The maps earn their value by making uncertainty inspectable, contestable, and eventually scorable as evidence accumulates. The reporting structure is also aligned with the statistical tradition behind Brier decomposition: reliability, resolution, and uncertainty should not be collapsed into one flattering number. In macro settings, Crene is closer to tail-aware density forecasting and Growth-at-Risk style reasoning than to point prediction. The goal is to represent uncertainty, downside exposure, and changing distributions rather than claim a single deterministic view.
Different uncertainties produce different errors.
Crene is designed around a practical allocation problem: not every uncertainty deserves the same kind of cognitive effort. Some questions can be scored directly. Some must first be decomposed. Some are dominated by hidden correlation, nonlinear exposure, or tail events that should not be compressed into a single probability without showing the structure underneath.
Forecast error asks whether stated probabilities match resolved outcomes. This is the domain of Brier scoring, calibration curves, leakage controls, base-rate comparison, and proper scoring rules. It is the cleanest part of the system because the output resolves true or false against a named source.
Decomposition error asks whether the right assumptions were surfaced in the first place. A forecast can be well calibrated and still omit the load-bearing variable. Crene addresses this by breaking strategic questions into components, pathways, drivers, and thesis maps before scoring them.
Correlation error asks whether apparently separate assumptions are actually the same bet in different language. The knowledge map exposes structural recurrence across scenarios, clusters, and factors. It does not yet claim that recurrence is statistical dependence; it identifies where dependence should be tested.
Tail error asks what happens when the distribution is wrong, the regime changes, or the important failure mode sits outside the observed sample. This is the Taleb guardrail: the question is not only “what is the probability?” but “where is the view fragile if the model is wrong?”
Why correlation is not a footnote
A set of assumptions can look diversified while carrying one dominant hidden exposure. In portfolio language, the key object is not only the marginal probability or marginal distribution of each bet, but the off-diagonal structure: how the bets move together, share drivers, or fail under the same regime.
Crene’s current public map exposes structural recurrence rather than empirical covariance. Shared ontology fields show that multiple anchors touch the same theme, such as rates path, fiscal capacity, AI productivity, markets, dollar system, demographics, or growth poles. That is not yet a causal or statistical claim. It is a disciplined way to identify where correlation analysis should begin.
As trajectory history accumulates, this becomes a measurable research layer: pairwise factor co-movement, effective number of independent assumptions, concentration diagnostics, and covariance-aware stress maps. Until that empirical layer is mature, Crene labels the output as structural recurrence rather than estimated correlation.
Crene separates outputs by epistemic type. Calibration, scoring protocol, and resolved-event accuracy are empirical claims. Scenario pathways, cluster components, factor drivers, and ontology fields are structural claims. Correlation, effective independence, and diversification ratios require additional trajectory evidence and are treated as research outputs until the sample is sufficient.
Crene distinguishes structural relationships from inferred relationships. A structural relationship is explicitly represented in the data: a pathway assigns a value to a component, a cluster child is evaluated against an anchor, or a factor driver belongs to a driver family. These relationships are inspectable without assuming statistical dependence. Inferred relationships — co-movement, correlation, or causal influence between components — require additional evidence and are not claimed by default. Where relationship signals are too thin, Crene surfaces the structural map rather than fabricating precision.
| System | Type | Question |
|---|---|---|
| Scenarios | Structural transitions | How does the system interact? |
| Clusters | Binary theses | What happens? |
| Factors | Continuous distributions | Where does it land? |
A fourth pipeline moves from atomic forecasting to structured futures reasoning. Where clusters decompose binary thesis and factors decompose continuous state variables, scenarios decompose long horizon strategic questions into coherent world states called pathways. The three products form a complete hierarchy: clusters answer what happens, factors answer how much, and scenarios answer how the system interacts. Scenarios are not forecasts. They are maps of internally coherent futures that can be inspected, stress tested, and compared.
| Product | Anchor | Children | Question |
|---|---|---|---|
| Clusters | Binary | Situations | What happens? |
| Factors | Continuous | Drivers | How much? |
| Scenarios | Hybrid | Pathways | How does the system interact? |
A scenario anchors on a long horizon strategic question. The anchor is decomposed into binary and continuous components spanning currency architecture, geopolitics, technology, demographics, fiscal policy, and market structure. Each component has a resolution source and horizon date, making the entire system falsifiable.
Pathways are the children of a scenario. Each pathway is a coherent mini world: a named causal narrative backed by a precise vector state assigning values to a subset of the scenario components. Pathways specify only load bearing assumptions, which is the central design principle. Unspecified components are treated as model determined during polling, which prevents overconstraint and allows the system to surface emergent tensions.
Put structurally: components are the binary and continuous state variables that span the scenario. A pathway is a coherent partial assignment across those variables — it fixes only the load-bearing ones and leaves the rest model-determined. Camps group similar pathways into readable regimes. This is what lets a large strategic question be inspected as a state space rather than a flat list of forecasts: Crene does not treat a scenario as a bundle of independent predictions, but as a space of internally coherent worlds whose shared and divergent assumptions are made explicit.
The prompting architecture uses Option B joint-state reasoning: each model receives the full scenario with all components and produces an internally coherent set of estimates, reasoning about cross component dependencies. The coherence layer checks each output for internal contradictions using dependency pairs, producing a coherence score and flagging structural disagreement across models. Multiple internally coherent futures can coexist. The system does not converge on one correct world-state; it surfaces where different coherent worlds diverge and which assumptions drive the divergence.
The product output is not a single probability for the thesis. It is a coherence map, hinge variable identification, necessary condition analysis, structural disagreement surface, and pathway to consensus delta tracking. The value is uncertainty organization: making the structure of what we do not know inspectable rather than compressing it into a number. This makes scenarios a genuinely distinct product from bundled clusters and factors.
Scenario titles are analytical framings, not house political views. They are used to make competing world states readable, inspectable, and falsifiable.
Each scenario decomposes its anchor question into pathways with editorially chosen distributions across direction labels (acceleration, resistance, mixed). These are editorial choices about which futures are worth specifying, not probabilistic claims about which are most likely.
Each pathway carries a fragility assessment indicating how many single variable flips would invalidate the thesis. Low fragility pathways are structurally robust. Very high fragility pathways are single point bets. The fragility distribution surfaces which pathways are worth stress-testing versus which are robust under perturbation.
Crene preserves prior scenario framings for auditability. The current framing powers live scoring, while older framings remain accessible as historical reference views. AI Labor v2 broadens the original AI supervision framing into a labor transformation scenario. Some supervision components remain because AI management is one mechanism through which work may expand, contract, or restructure.
Scenarios are repolled daily across all components, producing a continuous trajectory of how each pathway's coherence with consensus evolves. Different institutions could specify different component sets, pathway structures, fragility assessments, and polling cadences. What stays fixed is the methodology: joint-state reasoning, coherence auditing, pathway to consensus delta tracking, and falsifiable resolution against named sources.
A second pipeline, layered on the same multi model scoring infrastructure, decomposes a thesis into a factor matrix of falsifiable components. Use case is different from individual forecasts: a quantitative team uses the matrix as a feature library to detect under modeled exposure in their existing book, not as a standalone signal. Cluster anchors are typically forward looking binary questions with public resolution dates, decomposable into 50 to 200 falsifiable components across distinct categories.
Live clusters span monetary policy, AI transition, and trust and information dynamics. Each cluster follows the same 5 stage pipeline with category structures appropriate to its domain.
Generation pipeline (5 stages):
- Candidate generation. ~80 questions per category produced via a frontier LLM, prompted for falsifiable binary events with explicit resolution dates and public data sources.
- Falsifiability filter. Each candidate scored 1-5 by a separate model on three axes: falsifiability, specificity, and whether it represents an under-attended factor. Below threshold candidates dropped.
- Probability pre-screen. Single-model probability estimate; candidates outside the target probability band rejected. Events near certainty or near impossibility carry less decomposition value.
- Lexical deduplication. Lexical similarity filtering within each category to remove near-duplicate candidates.
- Manual curation. A human review pass selects the final cohort and rejects category drift, anti anchor framing, and questions whose resolution requires subjective judgment.
After curation, components run the same multi model scoring as standalone events. Each event resolves on a specific date against a public data source and updates daily until resolution.
Why the 5 to 40% band:
At the extremes of the probability range, model spread reflects noise rather than meaningful disagreement. Near the middle of the range, events are likely already well-modeled by existing systems. The target band contains mechanically plausible events where model disagreement and decomposition density remain informative.
What spread reveals:
Each child has both a consensus probability and a multi model spread. High-spread rows (≥30 percentage points) surface where models genuinely disagree about an under modeled factor; these are the most decision relevant rows for a factor analyst. Tight-spread low probability rows surface agreed-tail events that can be used to size hedges with calibrated confidence. Trajectory data from repeated snapshots is intended to reveal which factors are drifting up or down over time, which is the leading indicator hypothesis the cluster product is built around. We do not yet have enough trajectory history to evaluate that hypothesis.
Any decomposition of a complex system into a finite set of falsifiable questions encodes editorial choices. We name three explicitly because the calibration record alone does not surface them, and a sophisticated reader should be able to audit them.
Category taxonomy.
Every cluster decomposes into a fixed set of categories. The taxonomy reflects a specific worldview about what factors matter for the thesis. A different analyst would propose a different structure, and both taxonomies can be defensible while producing different matrices. We design taxonomies to map onto factor families that recur across related thesis, making cross cluster correlation tractable. The Warsh Fed Rate Path cluster, for example, uses seven categories spanning policy path, Fed communications, inflation, labor resilience, institutional reset, market pricing, and consensus narrative. Different anchors will have different category structures appropriate to their domain. We do not claim our taxonomies are neutral.
Probability band filtering.
Components outside the target probability band are excluded by design. At the extremes, model spread reflects noise rather than meaningful disagreement. Near the middle, events are already well-modeled. The selected band reflects a thesis that decomposition value lives in factors that are mechanically plausible but socially under-attended. A different band would produce a different matrix. The optimal band width is an open research question.
Candidate generation.
The first stage of the pipeline asks a frontier language model to generate roughly 80 candidate questions per category. The model has its own priors about what counts as a relevant question. The downstream filters (falsifiability, probability band, lexical dedup, manual curation) operate on the candidate pool; they do not correct for systematic bias within the pool. If the generating model overweights certain framings within a domain, the resulting matrix inherits that bias. We rotate candidate generation across the scoring models to mitigate single-model bias, but this does not eliminate the underlying issue.
Replaceability.
These are our editorial and scoring choices for our public clusters, not immutable assumptions. Different institutions decompose uncertainty differently. A buyer could specify a different taxonomy, probability band, model ensemble, aggregation method, snapshot cadence, or resolution standard, and run the same forecasting discipline against those choices. Calibration would rebaseline accordingly. What stays fixed is the methodology: independent model forecasts with no anchoring, timestamped trajectories, falsifiable resolution, and tracked calibration over time.
A third pipeline, layered on the same multi model scoring infrastructure, forecasts continuous state variables as cross model percentile distributions. Where a cluster decomposes a binary thesis into falsifiable components, a factor decomposes a continuous variable into a driver matrix that explains movement in the distribution. The output of a factor is a quintile distribution (p5, p25, p50, p75, p95) aggregated across four frontier models, with a disagreement metric and a confidence label derived from cross model spread.
Live factors forecast continuous state variables. Drivers are how distribution moves are explained, not how the distribution is generated. Each driver carries its own cross model percentile distribution that can be inspected independently, and drivers can be referenced across multiple factors when the same causal channel applies.
Factors and clusters address different questions. A cluster anchors on a forward looking binary with a public resolution date and decomposes into components that resolve true or false. A factor anchors on a continuous variable at a horizon and decomposes into drivers whose distributions explain the anchor distribution. Some factors correspond to the same horizon as a cluster anchor and can be inspected jointly; most are standalone. Both pipelines surface on the homepage and have dedicated detail pages.
The full decomposition forms a three level topology: anchor factor at the top, driver families in the middle, individual drivers at the leaves, with each level carrying its own cross model distribution. The driver families themselves form a taxonomy that recurs across related factors, which is what makes cross factor correlation tractable. Inspecting a factor means traversing this topology rather than reading a flat list.
Like cluster decomposition, factor decomposition encodes editorial choices that the calibration record alone does not surface. We name four explicitly so a sophisticated reader can audit them.
Driver taxonomy.
Each factor is decomposed into a fixed set of driver families. For UST10Y these are policy path, inflation, growth, term premium, supply, liquidity, and recession. For SPX these are earnings, margins, multiples, AI capex, credit, flows, volatility, and sector leadership. The families reflect a specific view about what mechanically moves the anchor distribution. A different analyst would propose a different family structure, and both taxonomies can be defensible while producing different driver matrices. We design family structures to map onto recurring causal channels, making cross factor correlation tractable. We do not claim our family taxonomies are neutral.
Percentile grid (p5, p25, p50, p75, p95).
Each model is asked to return a quintile distribution rather than a full density. Five points was chosen as the smallest grid that captures both central tendency and tail asymmetry without overfitting to the model's verbal precision. A finer grid produces apparent precision the underlying model cannot support; a coarser grid (just p50 or p25 to p75) loses the tail structure that is the entire reason a continuous factor exists. The choice is empirical, not theoretical. A different decomposition would rebaseline accordingly.
Unit and rendering format.
Each factor specifies its unit (percent, index points, basis points, multiple, dollar amount) and rendering format. Unit choice is consequential: a yield forecast in basis points reads differently from one in percent, and disagreement metrics depend on the unit's scale. We pick units that match how institutional buyers actually trade or analyze the anchor. The choice is a usability decision, not a methodological one, but it shapes how spread and confidence are perceived.
Confidence labeling.
Each observation carries a HIGH / MEDIUM / LOW confidence label derived from cross model disagreement scaled to the unit. Confidence labels are derived from cross model disagreement relative to each factor's unit scale. The raw disagreement metric is the primary signal; verbal labels are descriptive summaries. Label calibration is updated as the resolution corpus grows.
Binary forecasts are scored against an outcome of zero or one and aggregated with Brier. Continuous factor forecasts cannot be scored that way. The natural metrics are interval coverage (does the realized value fall inside the p25 to p75 band as often as the band claims, and inside the p5 to p95 band as often as that band claims), per-percentile hit rate (how often the realized value exceeds each forecast percentile), and the continuous ranked probability score (CRPS, the integrated squared distance between the forecast CDF and the realized value). Each captures a different aspect of calibration.
Factor calibration is forward looking. The first factor horizons resolve at end of year, and a single horizon resolution per factor is not enough to estimate any of the three metrics with statistical content. Meaningful calibration depends on the resolution corpus growing as more factors are launched with staggered horizons. We will publish factor calibration alongside binary cluster calibration once the corpus is sufficient.
We do not claim, at this stage, that cross model spread on factors is calibrated, that disagreement is informative about realized variance, or that confidence labels are reliable in either direction. We claim only that the same multi model scoring discipline used for binary forecasts is applied to continuous variables, with independent forecasts, timestamped trajectories, falsifiable resolution against authoritative sources, and tracked calibration once outcomes accrue.
Active questions are repolled daily across the model ensemble and served through the live API.
Resolved records are scored against named sources and used to update the calibration layer.
Claude, GPT, Gemini, and Grok are polled independently, with consensus and disagreement tracked over time.
Continue from the methodology.
Inspect live uncertainty maps, review the resolved record, or access the API once a view is live.