Skip to content
Formray
Torna a Field Notes
Technical6 min di lettura

The Memory War That Will Define AI

Giuseppe Albrizio/
Strategic Analysis - January 2026

Based on the article by Ben Pouladian


Executive Summary

At the end of December 2025, two seemingly disconnected events reveal an epochal transition in AI infrastructure:

The Two Key Events
  • Andrej Karpathy (co-founder OpenAI, former Director of AI Tesla) publicly states: "I've never felt so behind as a programmer"
  • NVIDIA orders 16-Hi HBM - ultra-advanced memory never mass produced - with delivery target Q4 2026

We are witnessing the construction of an infrastructure that will make AI inference effectively infinite and nearly free at the margin by 2028-2030. This transition will radically redefine the role of the software developer.


The Problem: The Memory Wall

The Bottleneck

AI models grow exponentially faster than our ability to feed them with data.

~3.5TB
GPT-4 (1.76T parameters)
5TB+
2028 Models (10T+ param)
312GB
KV cache per user @ 1M token

The '99% Idle Problem'

During inference decode, an H100 GPU worth $40,000 operates at less than 1% effective utilization. 99% of the time is spent waiting for data to arrive from memory.

Root Cause

Mismatch between computational capacity (990 TFLOPS) and memory bandwidth (3.35 TB/s). The H100 is optimized for 295 FLOPs/byte, but inference decode executes only ~2 FLOPs/byte.

This is the memory wall - and it's becoming the real bottleneck of AI.


Two Memory Architectures, Two Philosophies

CharacteristicHBM (High Bandwidth Memory)SRAM (On-Chip Static RAM)
Capacity80GB to 1TB (2027)50MB to 230MB (Groq)
Bandwidth3.35 TB/s to 32 TB/s12 TB/s to 80 TB/s
Latency100-150 ns0.5-2 ns (50-100x faster)
Trade-offHigh capacity, medium latencyLow capacity, minimal latency
Best forTraining, prefill, large modelsInference decode, low-latency

The Competition: Four Strategic Moves

1. The Race to 16-Hi HBM

NVIDIA wants 16 DRAM layers stacked within the 775um JEDEC height. Production requires 30um wafers (vs current 50um) - silicon so thin it's translucent. Samsung, SK Hynix and Micron compete for $50B+ annual HBM revenue by 2028.

2. The Physical Wall of SRAM

Physical Limit

SRAM density has stalled due to physical limits. You can't add significant SRAM to a monolithic die without prohibitive costs. This is a physics limit, not an engineering one.

3. The Groq $20B Deal

NVIDIA acquired the license for Groq's architecture for $20B. Groq demonstrated that SRAM-centric architectures with deterministic dataflow achieve 276 token/sec (vs 60-100 on GPU) on Llama 70B.

The problem: it requires 576 chips across 8 racks. NVIDIA paid for the strategic validation, not for the chips.

4. The NVIDIA Solution: Feynman 2028

The Architecture That Closes the Gap
  • 3D-stacked SRAM via hybrid bonding (AMD X3D style)
  • Compute die on TSMC A16 with backside power delivery
  • Separate SRAM dies on mature nodes, vertically stacked
  • HBM 16-Hi (48-64GB per stack) for capacity

Result: HBM capacity for training + SRAM bandwidth for low-latency inference.


Infrastructure Roadmap 2025-2030

PeriodTechnologyCapacity/BandwidthImpact
2025-2026HBM3E, 12-Hi HBM4, B200192GB, 8 TB/sCurrent baseline
Q4 202616-Hi HBM4 delivery256-320GB (est.)Production breakthrough
2027Rubin Ultra1TB HBM4E, 32 TB/sEnterprise scale
2028+Feynman (A16 + 3D SRAM)1TB+ HBM + stacked SRAMFull dominance

Competitive Implications: Who Loses

Pros
  • NVIDIA: complete vertical integration
  • Whoever dominates advanced packaging wins
  • Infrastructure converges on one player
Cons
  • Groq and specialized ASICs: the gap closes
  • Custom hyperscaler ASICs: ROI in question
  • AMD: needs a packaging response, not process
The Pattern

NVIDIA doesn't compete on individual parameters (SRAM, HBM, compute). It competes on vertical integration of all three through advanced packaging.


Implications for Software Development

The New Developer Paradigm

"I've never felt so behind as a programmer"

Andrej Karpathy

This doesn't signal obsolescence. It signals infrastructure velocity exceeding cognitive adaptation velocity.

From...To...
Writing codeOrchestrating AI systems
Syntax and implementationArchitecture and verification
Memorizing patterns and APIsJudgment on stochastic output

Meta-Stable Skills vs Volatile Tools

Skills That Remain Valid Regardless of Infrastructure
  • Structured thinking and problem decomposition
  • Ability to read and evaluate others' code rapidly
  • Intuition for code smells, anti-patterns, edge cases
  • Understanding of architectures and systemic trade-offs
  • Security awareness and threat modeling
Specific Tools: 6-18 Month Lifecycle

The AI graveyard of 2024-2025 includes: Inflection Pi ($4B - team hired by Microsoft), Character.AI ($1B+ - Google acquihire), Supermaven (35k devs - acquired by Cursor), Adept ($350M raised - Amazon acquihire).


Strategic Conclusions

For Organizations

Recommendations
  • AI infrastructure will converge on NVIDIA: Plan architectures assuming this as the 2028-2030 baseline
  • Inference cost will collapse: Models that are cost-prohibitive today will become commodities
  • Developer training on AI orchestration, not specific AI coding: Tools change every 6-12 months
  • Physical AI/Robotics becomes viable: Video world models and embodied AI require exactly this infrastructure

For Development Teams

Concrete Actions
  • Invest in meta-stable skills (80%) vs specific tools (20%)
  • Master the generation-verification loop: AI generates - human verifies - rapid iteration
  • Non-negotiable quality gates: Lint, test coverage >80%, security scan, no secrets, type hints
  • Monthly tool landscape review: The only constant is change

The Speed of Transition

Previous infrastructure transitions (railroads, electricity, internet) took decades. NVIDIA is compressing the AI buildout into a 5-year roadmap visible today.

It's not a question of "if" we'll have abundant, near-free AI inference. It's "when" - and the answer is 2028-2030.

Implication: The bottleneck shifts from "can we run this model?" to "what should we ask it?" Innovation becomes prompt design, agentic architectures, and orchestration - not inference optimization.


Strategic analysis based on the article "The Memory War That Will Define AI" by Ben Pouladian

Torna a Field Notes