4-model average
building category history
loading
13pt spread
The three supporting readings tell you how much weight to put on the probability: confidence reflects category-level track record, stability tracks how the estimate has moved over time, models shows whether the four agree.
Loading belief trajectory...
Currently, inference costs for large models are heavily influenced by the rising prices of cloud computing and specialized hardware, as evidenced by recent reports indicating pricing pressures from GPU shortages. However, advancements in model optimization techniques and potential competition from emerging AI startups could drive costs down. If these trends continue alongside increased efficiency in model deployment by 2026, it could lead to prices falling below the $0.05/1M tokens mark.
API error
Current inference pricing for 70B models ranges from $0.30-$0.80/1M tokens (Meta Llama 2 70B on Together AI, Replicate, and modal.com as of late 2024). Historical precedent shows LLM inference costs have declined 60-80% annually; GPT-3 inference fell from ~$0.02/1K tokens (2020) to current competitive rates. Over 20 months, achieving a 6x cost reduction ($0.30 → $0.05) requires sustained annual decline rates of 75-80%, which is consistent with recent trends driven by quantization improvements, MoE architectures, and provider competition. However, this assumes continued scaling of data center efficiency and no supply-chain constraints on compute hardware.
Current spot pricing shows 70B-class models at $0.10-0.15/1M tokens on major hosts (Together.ai, Fireworks), already down 60% from 2024 levels as HBM3e supply ramps and 8xH200 clusters achieve <$0.8/GPU-hour marginal costs. Historical precedents show 3-4x price drops per generation (Llama-2→3, Mixtral→Llama-3-70B) with inference efficiency gains of 2.2-2.8x per year from speculative decoding and FP8/INT4 quantization. With Blackwell B200 systems delivering 2.5x tokens/watt in 2025 and continued open-source model releases, $0.05/1M appears reachable by Q4 2026 absent supply shocks.