Whisfusion
Diffusion Transformer ASR model combining Whisper encoder with diffusion decoder.
Paper: arXiv:2508.07048
Code: GitHub
Model Files
whisfusion_stage2_decoder.pt: Full model (Stage 2 - encoder + decoder + adapter)whisfusion_stage1_adapter.pt: Adapter-only checkpoint (Stage 1)
Key Results
- Parallel decoding architecture with 14ร higher throughput than autoregressive models (3,180 vs 230 tokens/s)
- Superior accuracy - lower WER than Whisper-tiny (8.3% vs 9.7%) with comparable latency
- Scalable inference - constant time regardless of sequence length (up to 2.6ร faster on long audio >20s)
Citation
@article{kwon2025whisfusion,
title={Whisfusion: Parallel ASR Decoding via a Diffusion Transformer},
author={Kwon, Taeyoun and Ahn, Junhyuk and Yun, Taegeun and Jwa, Heeju and Choi, Yoonchae and Park, Siwon and Kim, Nam-Joon and Kim, Jangchan and Ryu, Hyun Gon and Lee, Hyuk-Jae},
journal={arXiv preprint arXiv:2508.07048},
year={2025}
}
License
Apache 2.0
- Downloads last month
- 7