Collections
Discover the best community collections!
Collections including paper arxiv:2412.13303
-
FastVLM: Efficient Vision Encoding for Vision Language Models
Paper • 2412.13303 • Published • 72 -
rStar2-Agent: Agentic Reasoning Technical Report
Paper • 2508.20722 • Published • 116 -
AgentScope 1.0: A Developer-Centric Framework for Building Agentic Applications
Paper • 2508.16279 • Published • 53 -
OmniWorld: A Multi-Domain and Multi-Modal Dataset for 4D World Modeling
Paper • 2509.12201 • Published • 104
-
vikhyatk/moondream2
Image-Text-to-Text • 2B • Updated • 1.72M • 1.35k -
Qwen/Qwen2.5-VL-7B-Instruct
Image-Text-to-Text • 8B • Updated • 3.34M • • 1.38k -
google/gemma-3-27b-it-qat-q4_0-gguf
Image-Text-to-Text • 27B • Updated • 18.5k • 365 -
google/paligemma2-3b-mix-224
Image-Text-to-Text • 3B • Updated • 14.1k • 40
-
MANZANO: A Simple and Scalable Unified Multimodal Model with a Hybrid Vision Tokenizer
Paper • 2509.16197 • Published • 56 -
Qwen/Qwen3-Omni-30B-A3B-Instruct
Any-to-Any • 35B • Updated • 283k • 744 -
facebook/dinov3-vitb16-pretrain-lvd1689m
Image Feature Extraction • 85.7M • Updated • 338k • 84 -
nvidia/NV-Embed-v2
Feature Extraction • 8B • Updated • 149k • 488
-
USO: Unified Style and Subject-Driven Generation via Disentangled and Reward Learning
Paper • 2508.18966 • Published • 56 -
Sharing is Caring: Efficient LM Post-Training with Collective RL Experience Sharing
Paper • 2509.08721 • Published • 660 -
FastVLM: Efficient Vision Encoding for Vision Language Models
Paper • 2412.13303 • Published • 72 -
Large-Scale Text-to-Image Model with Inpainting is a Zero-Shot Subject-Driven Image Generator
Paper • 2411.15466 • Published • 39
-
OpenGVLab/InternVL3-1B
Image-Text-to-Text • 0.9B • Updated • 77.2k • 75 -
vikhyatk/moondream2
Image-Text-to-Text • 2B • Updated • 1.72M • 1.35k -
microsoft/Florence-2-base
Image-Text-to-Text • 0.2B • Updated • 465k • 320 -
HuggingFaceTB/SmolVLM2-256M-Video-Instruct
Image-Text-to-Text • 0.3B • Updated • 94.2k • 85
-
Remember, Retrieve and Generate: Understanding Infinite Visual Concepts as Your Personalized Assistant
Paper • 2410.13360 • Published • 9 -
Critic-V: VLM Critics Help Catch VLM Errors in Multimodal Reasoning
Paper • 2411.18203 • Published • 41 -
Towards Interpreting Visual Information Processing in Vision-Language Models
Paper • 2410.07149 • Published • 1 -
Understanding Alignment in Multimodal LLMs: A Comprehensive Study
Paper • 2407.02477 • Published • 24
-
MANZANO: A Simple and Scalable Unified Multimodal Model with a Hybrid Vision Tokenizer
Paper • 2509.16197 • Published • 56 -
Qwen/Qwen3-Omni-30B-A3B-Instruct
Any-to-Any • 35B • Updated • 283k • 744 -
facebook/dinov3-vitb16-pretrain-lvd1689m
Image Feature Extraction • 85.7M • Updated • 338k • 84 -
nvidia/NV-Embed-v2
Feature Extraction • 8B • Updated • 149k • 488
-
FastVLM: Efficient Vision Encoding for Vision Language Models
Paper • 2412.13303 • Published • 72 -
rStar2-Agent: Agentic Reasoning Technical Report
Paper • 2508.20722 • Published • 116 -
AgentScope 1.0: A Developer-Centric Framework for Building Agentic Applications
Paper • 2508.16279 • Published • 53 -
OmniWorld: A Multi-Domain and Multi-Modal Dataset for 4D World Modeling
Paper • 2509.12201 • Published • 104
-
USO: Unified Style and Subject-Driven Generation via Disentangled and Reward Learning
Paper • 2508.18966 • Published • 56 -
Sharing is Caring: Efficient LM Post-Training with Collective RL Experience Sharing
Paper • 2509.08721 • Published • 660 -
FastVLM: Efficient Vision Encoding for Vision Language Models
Paper • 2412.13303 • Published • 72 -
Large-Scale Text-to-Image Model with Inpainting is a Zero-Shot Subject-Driven Image Generator
Paper • 2411.15466 • Published • 39
-
OpenGVLab/InternVL3-1B
Image-Text-to-Text • 0.9B • Updated • 77.2k • 75 -
vikhyatk/moondream2
Image-Text-to-Text • 2B • Updated • 1.72M • 1.35k -
microsoft/Florence-2-base
Image-Text-to-Text • 0.2B • Updated • 465k • 320 -
HuggingFaceTB/SmolVLM2-256M-Video-Instruct
Image-Text-to-Text • 0.3B • Updated • 94.2k • 85
-
vikhyatk/moondream2
Image-Text-to-Text • 2B • Updated • 1.72M • 1.35k -
Qwen/Qwen2.5-VL-7B-Instruct
Image-Text-to-Text • 8B • Updated • 3.34M • • 1.38k -
google/gemma-3-27b-it-qat-q4_0-gguf
Image-Text-to-Text • 27B • Updated • 18.5k • 365 -
google/paligemma2-3b-mix-224
Image-Text-to-Text • 3B • Updated • 14.1k • 40
-
Remember, Retrieve and Generate: Understanding Infinite Visual Concepts as Your Personalized Assistant
Paper • 2410.13360 • Published • 9 -
Critic-V: VLM Critics Help Catch VLM Errors in Multimodal Reasoning
Paper • 2411.18203 • Published • 41 -
Towards Interpreting Visual Information Processing in Vision-Language Models
Paper • 2410.07149 • Published • 1 -
Understanding Alignment in Multimodal LLMs: A Comprehensive Study
Paper • 2407.02477 • Published • 24