-
Building and better understanding vision-language models: insights and future directions
Paper • 2408.12637 • Published • 133 -
Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model
Paper • 2408.11039 • Published • 63 -
Mini-Omni: Language Models Can Hear, Talk While Thinking in Streaming
Paper • 2408.16725 • Published • 52 -
Eagle: Exploring The Design Space for Multimodal LLMs with Mixture of Encoders
Paper • 2408.15998 • Published • 87
Collections
Discover the best community collections!
Collections including paper arxiv:2408.12637
-
Law of Vision Representation in MLLMs
Paper • 2408.16357 • Published • 95 -
CogVLM2: Visual Language Models for Image and Video Understanding
Paper • 2408.16500 • Published • 57 -
Learning to Move Like Professional Counter-Strike Players
Paper • 2408.13934 • Published • 23 -
Building and better understanding vision-language models: insights and future directions
Paper • 2408.12637 • Published • 133
-
Building and better understanding vision-language models: insights and future directions
Paper • 2408.12637 • Published • 133 -
Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model
Paper • 2408.11039 • Published • 63 -
Mini-Omni: Language Models Can Hear, Talk While Thinking in Streaming
Paper • 2408.16725 • Published • 52 -
Eagle: Exploring The Design Space for Multimodal LLMs with Mixture of Encoders
Paper • 2408.15998 • Published • 87
-
Law of Vision Representation in MLLMs
Paper • 2408.16357 • Published • 95 -
CogVLM2: Visual Language Models for Image and Video Understanding
Paper • 2408.16500 • Published • 57 -
Learning to Move Like Professional Counter-Strike Players
Paper • 2408.13934 • Published • 23 -
Building and better understanding vision-language models: insights and future directions
Paper • 2408.12637 • Published • 133