Neural Research

AMD stated it plainly: memory, not compute, is the next bottleneck in AI data centers. The industry is beginning to confirm it empirically. SemiAnalysis estimates that memory will account for roughly 30% of hyperscaler AI spending in 2026, up from approximately 8% in 2023 and 2024. This is not a projection about a future transition — it is a current operational reality: memory bandwidth and capacity are already the binding constraint during inference for modern large-scale models.

The Hardware Crunch

Micron's high-bandwidth memory (HBM) capacity is entirely sold out through calendar year 2026, and SK Hynix has confirmed record HBM demand through the same period. To cope with this, gaming GPU production faces up to 40% cuts as manufacturers redirect raw production capacity toward AI accelerators. AI now consumes the vast majority of global DRAM production capacity.

The Real Production Metrics

Quantitative evidence from production deployments is striking. The primary metrics that actually determine inference performance are completely detached from traditional raw compute metrics:

MBU% (Memory Bandwidth Utilization): The fraction of peak HBM bandwidth being consumed. In memory-bound inference workloads, an MBU% above 80–90% indicates a severe bottleneck. If MBU% is high while Streaming Multiprocessor (SM) utilization is low, the GPU is waiting for memory, not compute. You are paying for compute that sits idle while memory starves it.
KV Cache Pressure: Modern LLMs generate a Key-Value cache for every token in a context window. For long-context workloads with large batches, the KV cache alone can consume the entire HBM capacity. This is why frontier models claiming 200K token context windows become unreliable in practice around 130K—they hit a physical memory wall long before reaching the advertised context limit.

The Scale-Up Bandwidth Gap

At GTC 2026, NVIDIA's own presentations quantified the scale-up fabric bandwidth gap. A high-bandwidth switched scale-up fabric across 72 GPUs delivers around 1,800 GB/s of inter-GPU bandwidth per GPU, whereas standard Ethernet delivers closer to 100 GB/s.

That massive 18x gap determines whether the scale-up fabric bottlenecks token generation. As Mixture-of-Experts (MoE) becomes the dominant architecture for large-scale inference, every single decode step requires GPUs to rapidly exchange intermediate activations, turning scale-up fabric bandwidth into a direct input to inference throughput.

The Specific Mechanism: Training Bias vs. Inference Reality

The failure to optimize correctly stems from a classic engineer's attention bias. Training is where AI was born. Training is where the heroic compute stories come from—the multi-week runs, the thousands of parallel GPUs, the dramatic loss curves. Engineers optimized for training, and infrastructure teams planned for training.

But inference is where AI runs. Inference is where 80–90% of the lifetime cost of a production AI system accumulates because it runs continuously, every request, every hour. And inference is fundamentally a memory-bandwidth problem, not a compute problem.

During inference, the model weights must be loaded from memory into compute for every single forward pass. A 70B parameter model in FP16 requires 140GB of memory just to hold the weights before any activations or KV cache. If memory bandwidth is the bottleneck, adding more compute units doesn't fix it; it makes it worse by introducing more compute units competing for the same locked memory bandwidth.

Every team still reporting "GPU utilization" as their primary infrastructure metric is measuring the wrong thing. High GPU utilization during inference often means high memory-bandwidth saturation, not high compute efficiency. A GPU at 90% utilization that is memory-bound is performing at a fraction of its theoretical compute capacity.

The True Industry Cost

Lead times for data center GPUs currently run 36–52 weeks. While CHIPS Act investments provide long-term supply security, facilities opening years from now cannot ease 2026 supply constraints.

Organizations are procuring H100s and H200s to solve inference latency problems under the false assumption that compute units are the constraint. The result is millions of dollars of expensive hardware sitting partially idle. The performance gap between naive and optimized inference implementations exceeds 10x in memory efficiency. In fact, tuning KV paging and speculative decoding alone has yielded up to a 40% throughput gain without adding a single unit of compute.

Conclusion: What Needs to Exist