---
license: apache-2.0
---

EAGLE
| EAGLE |
EAGLE-2 |
EAGLE-3 |
Blog |
##
EAGLE (Extrapolation Algorithm for Greater Language-model Efficiency) is a new baseline for fast decoding of Large Language Models (LLMs) with provable performance maintenance. This approach involves extrapolating the second-top-layer contextual feature vectors of LLMs, enabling a significant boost in generation efficiency.
- EAGLE is:
- certified by the third-party evaluation as the **fastest** speculative method so far.
- achieving **2x** speedup on gpt-fast.
- **3x** faster than vanilla decoding (13B).
- **2x** faster than Lookahead (13B).
- **1.6x** faster than Medusa (13B).
- provably maintaining the consistency with vanilla decoding in the distribution of generated texts.
- trainable (within 1-2 days) and testable on 8x RTX 3090 GPUs. So even the GPU poor can afford it.
- combinable with other parallelled techniques such as vLLM, DeepSpeed, Mamba, FlashAttention, quantization, and hardware optimization.
EAGLE-2 uses the confidence scores from the draft model to approximate acceptance rates, dynamically adjusting the draft tree structure, which further enhances performance.
- EAGLE-2 is:
- **4x** faster than vanilla decoding (13B).
- **1.4x** faster than EAGLE-1 (13B).
EAGLE-3 removes the feature prediction constraint in EAGLE and simulates this process during training using training-time testing. Considering that top-layer features are limited to next-token prediction, EAGLE-3 replaces them with a fusion of low-, mid-, and high-level semantic features.
EAGLE-3 further improves generation speed while ensuring lossless performance.
- EAGLE-3 is:
- **5.6** faster than vanilla decoding (13B).
- **1.8x** faster than EAGLE-1 (13B).
_Inference is conducted on 2x RTX 3090 GPUs at fp16 precision using the Vicuna 13B model._
[//]: # ()
[//]: # ()
[//]: # (Using EAGLE-2, the inference speed on 2 RTX 3060 GPUs can be faster than vanilla autoregressive decoding on an A100 GPU.)
## Support
EAGLE has been merged in the following mainstream LLM serving frameworks (listed in alphabetical order).
- AMD ROCm
- AngelSlim
- AWS NeuronX Distributed Core
- CPM.cu
- IntelĀ® Extension for Transformers
- IntelĀ® LLM Library for PyTorch
- MLC-LLM
- NVIDIA NeMo Framework
- NVIDIA TensorRT-LLM
- NVIDIA TensorRT Model Optimizer
- PaddleNLP
- SGLang
- SpecForge
- vLLM
## Reference
For technical details and full experimental results, please check [the paper of EAGLE](https://arxiv.org/pdf/2401.15077.pdf), [the paper of EAGLE-2](https://arxiv.org/pdf/2406.16858), and [the paper of EAGLE-3](https://arxiv.org/pdf/2503.01840).
```
@inproceedings{li2024eagle,
author = {Yuhui Li and Fangyun Wei and Chao Zhang and Hongyang Zhang},
title = {{EAGLE}: Speculative Sampling Requires Rethinking Feature Uncertainty},
booktitle = {International Conference on Machine Learning},
year = {2024}
}
@inproceedings{li2024eagle2,
author = {Yuhui Li and Fangyun Wei and Chao Zhang and Hongyang Zhang},
title = {{EAGLE-2}: Faster Inference of Language Models with Dynamic Draft Trees},
booktitle = {Empirical Methods in Natural Language Processing},
year = {2024}
}
@inproceedings{li2025eagle3,
author = {Yuhui Li and Fangyun Wei and Chao Zhang and Hongyang Zhang},
title = {{EAGLE-3}: Scaling up Inference Acceleration of Large Language Models via Training-Time Test},
booktitle = {Annual Conference on Neural Information Processing Systems},
year = {2025}
}
```