Whisfusion

Diffusion Transformer ASR model combining Whisper encoder with diffusion decoder.

Paper: arXiv:2508.07048
Code: GitHub

Model Files

  • whisfusion_stage2_decoder.pt: Full model (Stage 2 - encoder + decoder + adapter)
  • whisfusion_stage1_adapter.pt: Adapter-only checkpoint (Stage 1)

Key Results

  • Parallel decoding architecture with 14ร— higher throughput than autoregressive models (3,180 vs 230 tokens/s)
  • Superior accuracy - lower WER than Whisper-tiny (8.3% vs 9.7%) with comparable latency
  • Scalable inference - constant time regardless of sequence length (up to 2.6ร— faster on long audio >20s)

Citation

@article{kwon2025whisfusion,
  title={Whisfusion: Parallel ASR Decoding via a Diffusion Transformer},
  author={Kwon, Taeyoun and Ahn, Junhyuk and Yun, Taegeun and Jwa, Heeju and Choi, Yoonchae and Park, Siwon and Kim, Nam-Joon and Kim, Jangchan and Ryu, Hyun Gon and Lee, Hyuk-Jae},
  journal={arXiv preprint arXiv:2508.07048},
  year={2025}
}

License

Apache 2.0

Downloads last month
7
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support