--- license: apache-2.0 ---

EAGLE

benchmark

EAGLE (Extrapolation Algorithm for Greater Language-model Efficiency) is a new baseline for fast decoding of Large Language Models (LLMs) with provable performance maintenance. This approach involves extrapolating the second-top-layer contextual feature vectors of LLMs, enabling a significant boost in generation efficiency. - EAGLE is: - certified by the third-party evaluation as the **fastest** speculative method so far. - achieving **2x** speedup on gpt-fast. - **3x** faster than vanilla decoding (13B). - **2x** faster than Lookahead (13B). - **1.6x** faster than Medusa (13B). - provably maintaining the consistency with vanilla decoding in the distribution of generated texts. - trainable (within 1-2 days) and testable on 8x RTX 3090 GPUs. So even the GPU poor can afford it. - combinable with other parallelled techniques such as vLLM, DeepSpeed, Mamba, FlashAttention, quantization, and hardware optimization. EAGLE-2 uses the confidence scores from the draft model to approximate acceptance rates, dynamically adjusting the draft tree structure, which further enhances performance. - EAGLE-2 is: - **4x** faster than vanilla decoding (13B). - **1.4x** faster than EAGLE-1 (13B). EAGLE-3 removes the feature prediction constraint in EAGLE and simulates this process during training using training-time testing. Considering that top-layer features are limited to next-token prediction, EAGLE-3 replaces them with a fusion of low-, mid-, and high-level semantic features. EAGLE-3 further improves generation speed while ensuring lossless performance. - EAGLE-3 is: - **5.6** faster than vanilla decoding (13B). - **1.8x** faster than EAGLE-1 (13B).

demogif

_Inference is conducted on 2x RTX 3090 GPUs at fp16 precision using the Vicuna 13B model._ [//]: # () [//]: # () [//]: # (Using EAGLE-2, the inference speed on 2 RTX 3060 GPUs can be faster than vanilla autoregressive decoding on an A100 GPU.) ## Support EAGLE has been merged in the following mainstream LLM serving frameworks (listed in alphabetical order). - AMD ROCm - AngelSlim - AWS NeuronX Distributed Core - CPM.cu - Intel® Extension for Transformers - Intel® LLM Library for PyTorch - MLC-LLM - NVIDIA NeMo Framework - NVIDIA TensorRT-LLM - NVIDIA TensorRT Model Optimizer - PaddleNLP - SGLang - SpecForge - vLLM ## Reference For technical details and full experimental results, please check [the paper of EAGLE](https://arxiv.org/pdf/2401.15077.pdf), [the paper of EAGLE-2](https://arxiv.org/pdf/2406.16858), and [the paper of EAGLE-3](https://arxiv.org/pdf/2503.01840). ``` @inproceedings{li2024eagle, author = {Yuhui Li and Fangyun Wei and Chao Zhang and Hongyang Zhang}, title = {{EAGLE}: Speculative Sampling Requires Rethinking Feature Uncertainty}, booktitle = {International Conference on Machine Learning}, year = {2024} } @inproceedings{li2024eagle2, author = {Yuhui Li and Fangyun Wei and Chao Zhang and Hongyang Zhang}, title = {{EAGLE-2}: Faster Inference of Language Models with Dynamic Draft Trees}, booktitle = {Empirical Methods in Natural Language Processing}, year = {2024} } @inproceedings{li2025eagle3, author = {Yuhui Li and Fangyun Wei and Chao Zhang and Hongyang Zhang}, title = {{EAGLE-3}: Scaling up Inference Acceleration of Large Language Models via Training-Time Test}, booktitle = {Annual Conference on Neural Information Processing Systems}, year = {2025} } ```