Frontier ModelsMar 09, 20263 min read

NVIDIA GTC 2026: Blackwell Ultra, NeMo Overhaul, and the Inference War

NVIDIA's annual GPU conference delivered Blackwell Ultra with 2.5x inference throughput, a rebuilt NeMo framework, and a clear signal that the company sees inference — not training — as the next bottleneck.

By Jeff Brook
JB

Jeff Brook

AI Researcher — Founder, AI Daily News

Jensen Huang's keynote at GTC 2026 ran nearly three hours, but the strategic signal was compressible to one sentence: NVIDIA is pivoting its centre of gravity from training to inference. Every major announcement — Blackwell Ultra, the NeMo 3.0 framework rewrite, and the new TensorRT-LLM 2.0 runtime — pointed in the same direction.

What did NVIDIA actually announce?

The headline hardware is Blackwell Ultra, the second-generation Blackwell GPU that doubles the transformer engine throughput of the original B200. NVIDIA claims 2.5x inference performance per watt compared to the H100, achieved through a combination of FP4 support, larger on-chip SRAM, and a redesigned NVLink interconnect that pushes 1.8 TB/s between chips.

On the software side, NeMo 3.0 is a ground-up rewrite of NVIDIA's model training and customisation framework. The previous version was a monolith — training, fine-tuning, alignment, and evaluation were tightly coupled. NeMo 3.0 breaks these into composable microservices. You can now run RLHF alignment as a standalone service, plug in external reward models, and swap evaluation pipelines without touching the training loop.

The third pillar is TensorRT-LLM 2.0, which introduces speculative decoding as a first-class feature. NVIDIA demonstrated a 70B parameter model serving at 180 tokens per second on a single B200 node — roughly 3x what the same model achieves on H100s with the previous runtime.

Why is inference the new battleground?

The economics are straightforward. According to Epoch AI's compute trends tracker, training compute for frontier models doubled roughly every 6 months through 2025, but the total inference compute deployed by major providers has been growing at 4x that rate. Every model trained once gets served millions of times. As agentic workloads proliferate — where a single user request might trigger dozens of LLM calls — inference costs dominate the total cost of ownership.

NVIDIA's internal data suggests that inference now accounts for over 60% of all GPU-hours consumed across their cloud partners, up from approximately 40% in early 2025. The Blackwell Ultra is designed to capture this shift.

What does this mean for practitioners?

Three things worth acting on.

First, the FP4 story is real but nuanced. Blackwell Ultra's native FP4 support enables 4-bit inference with minimal quality degradation on most transformer architectures. NVIDIA showed benchmarks where Llama-class models at FP4 retained 98.5% of their FP16 benchmark scores while running at nearly double the throughput. For production deployments, this means you can likely halve your GPU fleet for inference workloads — but you need to validate on your specific use cases. Quantisation-sensitive tasks like structured code generation and mathematical reasoning still benefit from higher precision.

Second, NeMo 3.0's modular alignment pipeline is genuinely useful. The ability to run RLHF or DPO as a standalone microservice means you can iterate on alignment without re-running full training loops. For teams building domain-specific models, this cuts the feedback cycle from days to hours. The catch is that NeMo 3.0 is tightly coupled to NVIDIA hardware — it leverages NVLink and NCCL in ways that make it impractical to run on non-NVIDIA infrastructure.

Third, speculative decoding in TensorRT-LLM 2.0 changes the deployment calculus. If you are serving models above 30B parameters, speculative decoding with a well-matched draft model can deliver 2-3x latency improvements with minimal additional compute. The engineering effort to set this up has been significant until now — TensorRT-LLM 2.0 makes it a configuration option rather than a research project.

What should you watch for?

The competitive response matters. AMD's MI350 is expected later this year, and early leaked benchmarks suggest it will be competitive on inference throughput per dollar if not per watt. Google's TPU v6 Trillium is already in limited deployment. The inference hardware market is about to get genuinely competitive for the first time since the transformer revolution began.

The deeper question is whether hardware improvements can keep pace with the demand growth from agentic architectures. A single agent workflow that chains 15-20 LLM calls per user interaction creates fundamentally different scaling requirements than a simple chat interface. NVIDIA is betting that inference is the bottleneck — but the real constraint might be the orchestration layer above the GPUs, not the GPUs themselves.

GTC 2026 confirmed that NVIDIA sees the future clearly. Whether they can maintain their margin structure as inference becomes a commodity — that is the open question.

Share this briefing

Your daily AI update

Join business owners who stay ahead

AI moves fast. Get the stories that matter for your business — tools, threats, and opportunities — in your inbox every morning.

Free forever. No spam. Unsubscribe anytime.