Assumptions, model disagreement, and rethink triggers, updated weekly before your PM, risk, or IC discussion. Currently accepting one macro thesis and one AI-economy thesis for July.
4-model average
building category history
loading
6pt spread
Consensus sits at 75% across the four models and is — (loading). Models are aligned at a 6pt spread. Category confidence is LOW (building category history).
Confidence reflects category-level track record. Stability tracks estimate movement. Models shows whether the four agree.
Loading belief trajectory...
As of 2023, there is growing recognition of the risks associated with AI model training data contamination from AI-generated content, evidenced by increasing academic literature and industry reports addressing model degradation. The emergence of Generative AI technologies, including large language models, is expected to exacerbate the issue, with concerns raised in studies indicating that AI-generated data can introduce biases and inaccuracies in training datasets. The recent AI Act proposals in Europe and other regulatory discussions suggest that governing bodies are aware of these issues as they shape policies that could emphasize data quality further, reinforcing the likelihood of recognition as a documented problem by 2028.
The proliferation of AI-generated content is rapidly increasing, making it highly probable that a significant portion of future training datasets will inadvertently include this synthetic data, degrading model performance as evidenced by the difficulty in distinguishing AI from human-generated text and art. Researchers have already documented the "model collapse" phenomenon in language models, where models trained on data generated by previous models show declining quality, a trend expected to worsen.
Data contamination from AI-generated content is already being documented in academic literature (e.g., 2023 papers on "model collapse" showing quality degradation when models train on synthetic data), and major AI labs have publicly acknowledged this as a concern. By 2028 (4 years away), the volume of AI-generated content will grow exponentially while detection methods remain imperfect, making contamination in training datasets nearly inevitable at scale. Historical precedent suggests that once a technical problem becomes theoretically possible and economically incentivized (cheaper synthetic training data), it becomes documented within 3-5 years; we're already 2+ years into this cycle with preliminary evidence emerging.
Web-crawled Common Crawl data already contains 3-7% synthetic content per 2023 Stanford and Epoch AI audits, with GPT-4/Claude-3 training runs incorporating an estimated 15-20% synthetic tokens per internal scaling reports; historical precedent shows model collapse when synthetic data exceeds ~10% in controlled studies from 2023-2024 (e.g., 0.3-0.5 perplexity degradation per 5% synthetic increment). Current trend shows synthetic web content doubling yearly while data filtering techniques lag, with only 2 major labs publicly reporting mitigation beyond basic perplexity filters.