EMMA: Efficient Multimodal Understanding, Generation, and Editing with a Unified Architecture
Abstract
EMMA is an efficient unified architecture for multimodal tasks that uses autoencoders, channel-wise concatenation, shared-and-decoupled networks, and mixture-of-experts to achieve superior performance and efficiency.
We propose EMMA, an efficient and unified architecture for multimodal understanding, generation and editing. Specifically, EMMA primarily consists of 1) An efficient autoencoder with a 32x compression ratio, which significantly reduces the number of tokens required for generation. This also ensures the training balance between understanding and generation tasks by applying the same compression ratio to images. 2) Channel-wise concatenation instead of token-wise concatenation among visual understanding and generation tokens, which further reduces the visual tokens in unified architectures. 3) A shared-and-decoupled network that enables mutual improvements across tasks while meeting the task-specific modeling requirements. 4) A mixture-of-experts mechanism adopted for visual understanding encoder, which substantially improves perceptual capabilities with a few parameters increase. Extensive experiments have shown that EMMA-4B can significantly outperform state-of-the-art unified multimodal approaches (e.g., BAGEL-7B) in both efficiency and performance, while also achieving competitive results compared to recent multimodal understanding and generation experts (e.g., Qwen3-VL and Qwen-Image). We believe that EMMA lays a solid foundation for the future development of unified multimodal architectures.
Community
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- TUNA: Taming Unified Visual Representations for Native Unified Multimodal Models (2025)
- LightFusion: A Light-weighted, Double Fusion Framework for Unified Multimodal Understanding and Generation (2025)
- InteractiveOmni: A Unified Omni-modal Model for Audio-Visual Multi-turn Dialogue (2025)
- Ming-Flash-Omni: A Sparse, Unified Architecture for Multimodal Perception and Generation (2025)
- MammothModa2: A Unified AR-Diffusion Framework for Multimodal Understanding and Generation (2025)
- Architecture Decoupling Is Not All You Need For Unified Multimodal Model (2025)
- HBridge: H-Shape Bridging of Heterogeneous Experts for Unified Multimodal Understanding and Generation (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper