Towards Speed-of-Light Text Generation with Nemotron-Labs Diffusion Language Models

Enterprise + Article Published May 23, 2026

1-headline-final

Large language models (LLMs) have become the default interface for code generation, math problem solving, summarization, document understanding, and many other developer workflows. Under the hood, though, many LLMs still generate text the same way: one token at a time, and each token depends on the tokens that appeared before it. As such, these models are called autoregressive, since they consume their own outputs.

That autoregressive (AR) approach has been remarkably successful. It is stable to train, simple to serve, and responsible for much of the progress in modern language modeling. But it also creates a hard limit: every new token requires a full model pass and every weight has to be loaded from the memory before computation can start. For developers building latency-sensitive applications, running smaller batch sizes, or trying to make better use of modern GPUs, token-by-token generation can leave performance on the table as most of the GPU’s time is spent on memory operations, rather than computation.

Additionally, once a token is generated by an autoregressive model, it is final and they do not inherently have the ability to revise previous tokens. Consequently, mistakes can propagate during the course of generation.

Nemotron-Labs Diffusion introduces a new path forward: diffusion language models (DLM) that work by generating multiple tokens in parallel, then iteratively refining the generated tokens in multiple steps. Not only can these models better leverage the computational model of the modern GPUs and offer significant runtime performance benefits, but they can also revise generated tokens, making them more suitable for revising existing text and addressing fill-in-the-middle objectives. This generate-and-refine property also offers a built-in way to control the inference budget. By reducing the number of refinement steps, one can reduce the compute requirements of these models at runtime.

Quick Links to the Models, Training Recipe and Technical Report

The Nemotron-Labs Diffusion family includes text models at 3B, 8B, and 14B scales, all available under the commercially-friendly NVIDIA Nemotron Open Model License, as well as a 8B scale vision-language model (VLM), available under the NVIDIA Source Code License, granting broad research flexibility. Across the lineup, NVIDIA is releasing both base models and instruction-tuned chat variants. NVIDIA is also releasing the code for training these models through the NVIDIA Megatron Bridge framework.

Three Generation Modes in One Model

2-tri-mode-final

Nemotron-Labs Diffusion is designed around a simple idea: autoregressive and diffusion generation should not be separate model families. They should be capabilities of the same model. The model supports three generation modes:

Autoregressive mode runs like a standard left-to-right LLM. This keeps compatibility with the generation workflow developers already know.

Diffusion mode generates block by block, gradually generating tokens over multiple steps.

Self-speculation mode uses diffusion to draft multiple candidate tokens, then uses autoregressive decoding to verify them. This combines the speed potential of diffusion-style drafting with the reliability of AR verification.

This flexible design is the key developer-facing feature where speed and accuracy both matter, even at workloads with unpredictable batch sizes, or those with a single query (batch size=1). Selecting the desired inference mode requires almost no change at the application level, since this is a deployment-time setting. As such, developers can seamlessly switch between the model they use today, or Nemotron-Labs Diffusion in various inference modes for ultra-fast generation speeds.

Performance Highlights

Screenshot from 2026-05-22 15-49-43

Nemotron-Labs Diffusion 8B achieves an improved average accuracy of 1.2% compared with Qwen3 8B. Comparing the inference speed measured in tokens per forward pass (TPF for short, a hardware-agnostic means of measuring token decoding efficiency), the diffusion mode reaches 2.6× higher TPF than AR models, while self-speculation pushes that further to 6× for linear self-speculation and 6.4× for quadratic self-speculation, with comparable accuracy across the evaluated tasks.

How we trained Nemotron-Labs Diffusion

Diffusion language models have been promising for years, but they have historically had practical barriers: lower accuracy than strong AR models, more difficult training, and limited compatibility with KV caching.

Recent work changed that direction. Efficient-DLM showed that pretrained AR models can be converted into diffusion language models through continued pretraining and altering the attention mechanism to a block-wise approach. That design helps preserve AR model capabilities while enabling KV-cache-friendly parallel decoding.

Nemotron-Labs Diffusion builds on the same practical insight: add diffusion capabilities to an existing AR model. The model was trained with a joint AR and diffusion objective, allowing it to retain what it had learned during its initial AR training while diffusion added parallel drafting capability. The model was pre-trained on 1.3T tokens from the NVIDIA Nemotron Pretraining datasets and underwent an additional supervised fine-tuning phase using 45B tokens from the NVIDIA Nemotron Post-training datasets.

Deployment and inference through SGLang

Deployment of Nemotron-Labs Diffusion models will soon be supported in the main branch of SGLang. At the time of this writing, support for inference is available through this issue tracker request on GitHub.

What's neat is that the integration lets you serve the same checkpoint in three different ways, picked by a single line in your algorithm config:

  • Plain autoregressive - set ar_mode=true and the model behaves like any other causal LM. Useful as the correctness reference, or if you just want a sanity check against pure AR output.

  • Diffusion mode (FastDiffuser) - the headliner for raw throughput. The model fills in a 32-token block at a time by iteratively denoising it, and a confidence threshold decides which tokens are "good enough" to commit each step.

  • Self-speculation (LinearSpec) - this one's our favorite. The same model drafts a block bidirectionally, then verifies it causally; whatever prefix matches gets committed. Output is lossless versus AR at temperature 0, but we hit ~865 tok/s on B200 on speedbench dataset - roughly 4× the autoregressive baseline on the same hardware.

Get Started Today

Nemotron-Labs Diffusion brings diffusion-style generation into a form developers can actually use: open models, familiar AR compatibility, diffusion decoding, and self-speculative acceleration in one family. With Nemotron-Labs Diffusion, developers get a new way to draft, refine, verify, and accelerate text generation, without needing to alter their applications.

To get started, explore the Nemotron-Labs Diffusion model family, read the technical report, and try the available training recipe.

Community

Sign up or log in to comment