Models
Datasets
Spaces
Docs
Enterprise
Pricing
Log In
Sign Up

Collections

Discover the best community collections!

Collections including paper arxiv:2412.13303

FastVLM: Efficient Vision Encoding for Vision Language Models

Paper • 2412.13303 • Published Dec 17, 2024 • 72

FastVLM: Efficient Vision Encoding for Vision Language Models

Paper • 2412.13303 • Published Dec 17, 2024 • 72
rStar2-Agent: Agentic Reasoning Technical Report

Paper • 2508.20722 • Published Aug 28 • 116
AgentScope 1.0: A Developer-Centric Framework for Building Agentic Applications

Paper • 2508.16279 • Published Aug 22 • 53
OmniWorld: A Multi-Domain and Multi-Modal Dataset for 4D World Modeling

Paper • 2509.12201 • Published Sep 15 • 104

VIdeo Captioning

FastVLM: Efficient Vision Encoding for Vision Language Models

Paper • 2412.13303 • Published Dec 17, 2024 • 72

Efficient Vision Encoding for Vision Language Models

FastVLM: Efficient Vision Encoding for Vision Language Models

Paper • 2412.13303 • Published Dec 17, 2024 • 72
Running

Featured

418

FastVLM WebGPU

🍎

418

Real-time video captioning powered by FastVLM
apple/FastVLM-0.5B

Text Generation • 0.8B • Updated Sep 3 • 9.98k • 355
apple/FastVLM-1.5B

Text Generation • 2B • Updated Sep 3 • 2.09k • 71

vikhyatk/moondream2

Image-Text-to-Text • 2B • Updated Sep 23 • 1.72M • 1.35k
Qwen/Qwen2.5-VL-7B-Instruct

Image-Text-to-Text • 8B • Updated Apr 6 • 3.34M • • 1.38k
google/gemma-3-27b-it-qat-q4_0-gguf

Image-Text-to-Text • 27B • Updated Apr 11 • 18.5k • 365
google/paligemma2-3b-mix-224

Image-Text-to-Text • 3B • Updated Feb 7 • 14.1k • 40

MANZANO: A Simple and Scalable Unified Multimodal Model with a Hybrid Vision Tokenizer

Paper • 2509.16197 • Published Sep 19 • 56
Qwen/Qwen3-Omni-30B-A3B-Instruct

Any-to-Any • 35B • Updated Sep 22 • 283k • 744
facebook/dinov3-vitb16-pretrain-lvd1689m

Image Feature Extraction • 85.7M • Updated Aug 19 • 338k • 84
nvidia/NV-Embed-v2

Feature Extraction • 8B • Updated Jul 21 • 149k • 488

Interesting Papers

FastVLM: Efficient Vision Encoding for Vision Language Models

Paper • 2412.13303 • Published Dec 17, 2024 • 72
MedVista3D: Vision-Language Modeling for Reducing Diagnostic Errors in 3D CT Disease Detection, Understanding and Reporting

Paper • 2509.03800 • Published Sep 4 • 3

USO: Unified Style and Subject-Driven Generation via Disentangled and Reward Learning

Paper • 2508.18966 • Published Aug 26 • 56
Sharing is Caring: Efficient LM Post-Training with Collective RL Experience Sharing

Paper • 2509.08721 • Published Sep 10 • 660
FastVLM: Efficient Vision Encoding for Vision Language Models

Paper • 2412.13303 • Published Dec 17, 2024 • 72
Large-Scale Text-to-Image Model with Inpainting is a Zero-Shot Subject-Driven Image Generator

Paper • 2411.15466 • Published Nov 23, 2024 • 39

🤏🏾 VLMs & LLMs

My favorite small LLMs and VLMs.

OpenGVLab/InternVL3-1B

Image-Text-to-Text • 0.9B • Updated Sep 11 • 77.2k • 75
vikhyatk/moondream2

Image-Text-to-Text • 2B • Updated Sep 23 • 1.72M • 1.35k
microsoft/Florence-2-base

Image-Text-to-Text • 0.2B • Updated Aug 4 • 465k • 320
HuggingFaceTB/SmolVLM2-256M-Video-Instruct

Image-Text-to-Text • 0.3B • Updated Apr 8 • 94.2k • 85

Remember, Retrieve and Generate: Understanding Infinite Visual Concepts as Your Personalized Assistant

Paper • 2410.13360 • Published Oct 17, 2024 • 9
Critic-V: VLM Critics Help Catch VLM Errors in Multimodal Reasoning

Paper • 2411.18203 • Published Nov 27, 2024 • 41
Towards Interpreting Visual Information Processing in Vision-Language Models

Paper • 2410.07149 • Published Oct 9, 2024 • 1
Understanding Alignment in Multimodal LLMs: A Comprehensive Study

Paper • 2407.02477 • Published Jul 2, 2024 • 24

FastVLM: Efficient Vision Encoding for Vision Language Models

Paper • 2412.13303 • Published Dec 17, 2024 • 72

MANZANO: A Simple and Scalable Unified Multimodal Model with a Hybrid Vision Tokenizer

Paper • 2509.16197 • Published Sep 19 • 56
Qwen/Qwen3-Omni-30B-A3B-Instruct

Any-to-Any • 35B • Updated Sep 22 • 283k • 744
facebook/dinov3-vitb16-pretrain-lvd1689m

Image Feature Extraction • 85.7M • Updated Aug 19 • 338k • 84
nvidia/NV-Embed-v2

Feature Extraction • 8B • Updated Jul 21 • 149k • 488

FastVLM: Efficient Vision Encoding for Vision Language Models

Paper • 2412.13303 • Published Dec 17, 2024 • 72
rStar2-Agent: Agentic Reasoning Technical Report

Paper • 2508.20722 • Published Aug 28 • 116
AgentScope 1.0: A Developer-Centric Framework for Building Agentic Applications

Paper • 2508.16279 • Published Aug 22 • 53
OmniWorld: A Multi-Domain and Multi-Modal Dataset for 4D World Modeling

Paper • 2509.12201 • Published Sep 15 • 104

Interesting Papers

FastVLM: Efficient Vision Encoding for Vision Language Models

Paper • 2412.13303 • Published Dec 17, 2024 • 72
MedVista3D: Vision-Language Modeling for Reducing Diagnostic Errors in 3D CT Disease Detection, Understanding and Reporting

Paper • 2509.03800 • Published Sep 4 • 3

VIdeo Captioning

FastVLM: Efficient Vision Encoding for Vision Language Models

Paper • 2412.13303 • Published Dec 17, 2024 • 72

USO: Unified Style and Subject-Driven Generation via Disentangled and Reward Learning

Paper • 2508.18966 • Published Aug 26 • 56
Sharing is Caring: Efficient LM Post-Training with Collective RL Experience Sharing

Paper • 2509.08721 • Published Sep 10 • 660
FastVLM: Efficient Vision Encoding for Vision Language Models

Paper • 2412.13303 • Published Dec 17, 2024 • 72
Large-Scale Text-to-Image Model with Inpainting is a Zero-Shot Subject-Driven Image Generator

Paper • 2411.15466 • Published Nov 23, 2024 • 39

Efficient Vision Encoding for Vision Language Models

FastVLM: Efficient Vision Encoding for Vision Language Models

Paper • 2412.13303 • Published Dec 17, 2024 • 72
Running

Featured

418

FastVLM WebGPU

🍎

418

Real-time video captioning powered by FastVLM
apple/FastVLM-0.5B

Text Generation • 0.8B • Updated Sep 3 • 9.98k • 355
apple/FastVLM-1.5B

Text Generation • 2B • Updated Sep 3 • 2.09k • 71

🤏🏾 VLMs & LLMs

My favorite small LLMs and VLMs.

OpenGVLab/InternVL3-1B

Image-Text-to-Text • 0.9B • Updated Sep 11 • 77.2k • 75
vikhyatk/moondream2

Image-Text-to-Text • 2B • Updated Sep 23 • 1.72M • 1.35k
microsoft/Florence-2-base

Image-Text-to-Text • 0.2B • Updated Aug 4 • 465k • 320
HuggingFaceTB/SmolVLM2-256M-Video-Instruct

Image-Text-to-Text • 0.3B • Updated Apr 8 • 94.2k • 85

vikhyatk/moondream2

Image-Text-to-Text • 2B • Updated Sep 23 • 1.72M • 1.35k
Qwen/Qwen2.5-VL-7B-Instruct

Image-Text-to-Text • 8B • Updated Apr 6 • 3.34M • • 1.38k
google/gemma-3-27b-it-qat-q4_0-gguf

Image-Text-to-Text • 27B • Updated Apr 11 • 18.5k • 365
google/paligemma2-3b-mix-224

Image-Text-to-Text • 3B • Updated Feb 7 • 14.1k • 40

Remember, Retrieve and Generate: Understanding Infinite Visual Concepts as Your Personalized Assistant

Paper • 2410.13360 • Published Oct 17, 2024 • 9
Critic-V: VLM Critics Help Catch VLM Errors in Multimodal Reasoning

Paper • 2411.18203 • Published Nov 27, 2024 • 41
Towards Interpreting Visual Information Processing in Vision-Language Models

Paper • 2410.07149 • Published Oct 9, 2024 • 1
Understanding Alignment in Multimodal LLMs: A Comprehensive Study

Paper • 2407.02477 • Published Jul 2, 2024 • 24

Previous
1
2
Next

Company

TOS Privacy About Jobs

Website

Models Datasets Spaces Pricing Docs