Instructions to use aiqtech/LongCat-Flash-Omni with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use aiqtech/LongCat-Flash-Omni with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="aiqtech/LongCat-Flash-Omni", trust_remote_code=True) messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoModelForCausalLM model = AutoModelForCausalLM.from_pretrained("aiqtech/LongCat-Flash-Omni", trust_remote_code=True, dtype="auto") - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use aiqtech/LongCat-Flash-Omni with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "aiqtech/LongCat-Flash-Omni" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "aiqtech/LongCat-Flash-Omni", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/aiqtech/LongCat-Flash-Omni
- SGLang
How to use aiqtech/LongCat-Flash-Omni with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "aiqtech/LongCat-Flash-Omni" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "aiqtech/LongCat-Flash-Omni", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "aiqtech/LongCat-Flash-Omni" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "aiqtech/LongCat-Flash-Omni", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use aiqtech/LongCat-Flash-Omni with Docker Model Runner:
docker model run hf.co/aiqtech/LongCat-Flash-Omni
Duplicate from meituan-longcat/LongCat-Flash-Omni
Browse filesCo-authored-by: hc <diichen@users.noreply.huggingface.co>
This view is limited to 50 files because it contains too many changes. See raw diff
- .gitattributes +35 -0
- LICENSE +21 -0
- README.md +365 -0
- audio/audio_embeddings.pt +3 -0
- audio/audio_encoder.pt +3 -0
- audio/audio_output_layers.pt +3 -0
- audio/audio_projector.pt +3 -0
- audio/config.json +85 -0
- audio/global_cmvn +0 -0
- audio/preprocessor_config.json +26 -0
- audio_codec/LongCatAudioCodec_decoder_24k_4codebooks_aug_sft.pt +3 -0
- audio_codec/config.yaml +11 -0
- config.json +37 -0
- configuration_longcat_flash.py +216 -0
- generation_config.json +7 -0
- language_model_embedding.pt +3 -0
- model.safetensors.index.json +0 -0
- model_00001-of-00080.safetensors +3 -0
- model_00002-of-00080.safetensors +3 -0
- model_00003-of-00080.safetensors +3 -0
- model_00004-of-00080.safetensors +3 -0
- model_00005-of-00080.safetensors +3 -0
- model_00006-of-00080.safetensors +3 -0
- model_00007-of-00080.safetensors +3 -0
- model_00008-of-00080.safetensors +3 -0
- model_00009-of-00080.safetensors +3 -0
- model_00010-of-00080.safetensors +3 -0
- model_00011-of-00080.safetensors +3 -0
- model_00012-of-00080.safetensors +3 -0
- model_00013-of-00080.safetensors +3 -0
- model_00014-of-00080.safetensors +3 -0
- model_00015-of-00080.safetensors +3 -0
- model_00016-of-00080.safetensors +3 -0
- model_00017-of-00080.safetensors +3 -0
- model_00018-of-00080.safetensors +3 -0
- model_00019-of-00080.safetensors +3 -0
- model_00020-of-00080.safetensors +3 -0
- model_00021-of-00080.safetensors +3 -0
- model_00022-of-00080.safetensors +3 -0
- model_00023-of-00080.safetensors +3 -0
- model_00024-of-00080.safetensors +3 -0
- model_00025-of-00080.safetensors +3 -0
- model_00026-of-00080.safetensors +3 -0
- model_00027-of-00080.safetensors +3 -0
- model_00028-of-00080.safetensors +3 -0
- model_00029-of-00080.safetensors +3 -0
- model_00030-of-00080.safetensors +3 -0
- model_00031-of-00080.safetensors +3 -0
- model_00032-of-00080.safetensors +3 -0
- model_00033-of-00080.safetensors +3 -0
.gitattributes
ADDED
|
@@ -0,0 +1,35 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
*.7z filter=lfs diff=lfs merge=lfs -text
|
| 2 |
+
*.arrow filter=lfs diff=lfs merge=lfs -text
|
| 3 |
+
*.bin filter=lfs diff=lfs merge=lfs -text
|
| 4 |
+
*.bz2 filter=lfs diff=lfs merge=lfs -text
|
| 5 |
+
*.ckpt filter=lfs diff=lfs merge=lfs -text
|
| 6 |
+
*.ftz filter=lfs diff=lfs merge=lfs -text
|
| 7 |
+
*.gz filter=lfs diff=lfs merge=lfs -text
|
| 8 |
+
*.h5 filter=lfs diff=lfs merge=lfs -text
|
| 9 |
+
*.joblib filter=lfs diff=lfs merge=lfs -text
|
| 10 |
+
*.lfs.* filter=lfs diff=lfs merge=lfs -text
|
| 11 |
+
*.mlmodel filter=lfs diff=lfs merge=lfs -text
|
| 12 |
+
*.model filter=lfs diff=lfs merge=lfs -text
|
| 13 |
+
*.msgpack filter=lfs diff=lfs merge=lfs -text
|
| 14 |
+
*.npy filter=lfs diff=lfs merge=lfs -text
|
| 15 |
+
*.npz filter=lfs diff=lfs merge=lfs -text
|
| 16 |
+
*.onnx filter=lfs diff=lfs merge=lfs -text
|
| 17 |
+
*.ot filter=lfs diff=lfs merge=lfs -text
|
| 18 |
+
*.parquet filter=lfs diff=lfs merge=lfs -text
|
| 19 |
+
*.pb filter=lfs diff=lfs merge=lfs -text
|
| 20 |
+
*.pickle filter=lfs diff=lfs merge=lfs -text
|
| 21 |
+
*.pkl filter=lfs diff=lfs merge=lfs -text
|
| 22 |
+
*.pt filter=lfs diff=lfs merge=lfs -text
|
| 23 |
+
*.pth filter=lfs diff=lfs merge=lfs -text
|
| 24 |
+
*.rar filter=lfs diff=lfs merge=lfs -text
|
| 25 |
+
*.safetensors filter=lfs diff=lfs merge=lfs -text
|
| 26 |
+
saved_model/**/* filter=lfs diff=lfs merge=lfs -text
|
| 27 |
+
*.tar.* filter=lfs diff=lfs merge=lfs -text
|
| 28 |
+
*.tar filter=lfs diff=lfs merge=lfs -text
|
| 29 |
+
*.tflite filter=lfs diff=lfs merge=lfs -text
|
| 30 |
+
*.tgz filter=lfs diff=lfs merge=lfs -text
|
| 31 |
+
*.wasm filter=lfs diff=lfs merge=lfs -text
|
| 32 |
+
*.xz filter=lfs diff=lfs merge=lfs -text
|
| 33 |
+
*.zip filter=lfs diff=lfs merge=lfs -text
|
| 34 |
+
*.zst filter=lfs diff=lfs merge=lfs -text
|
| 35 |
+
*tfevents* filter=lfs diff=lfs merge=lfs -text
|
LICENSE
ADDED
|
@@ -0,0 +1,21 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
MIT License
|
| 2 |
+
|
| 3 |
+
Copyright (c) 2025 Meituan
|
| 4 |
+
|
| 5 |
+
Permission is hereby granted, free of charge, to any person obtaining a copy
|
| 6 |
+
of this software and associated documentation files (the "Software"), to deal
|
| 7 |
+
in the Software without restriction, including without limitation the rights
|
| 8 |
+
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
|
| 9 |
+
copies of the Software, and to permit persons to whom the Software is
|
| 10 |
+
furnished to do so, subject to the following conditions:
|
| 11 |
+
|
| 12 |
+
The above copyright notice and this permission notice shall be included in
|
| 13 |
+
all copies or substantial portions of the Software.
|
| 14 |
+
|
| 15 |
+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
|
| 16 |
+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
|
| 17 |
+
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
|
| 18 |
+
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
|
| 19 |
+
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
|
| 20 |
+
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
|
| 21 |
+
SOFTWARE.
|
README.md
ADDED
|
@@ -0,0 +1,365 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: mit
|
| 3 |
+
library_name: LongCat-Flash-Omni
|
| 4 |
+
pipeline_tag: text-generation
|
| 5 |
+
tags:
|
| 6 |
+
- transformers
|
| 7 |
+
---
|
| 8 |
+
|
| 9 |
+
# LongCat-Flash-Omni
|
| 10 |
+
|
| 11 |
+
<div align="center">
|
| 12 |
+
<img src="https://raw.githubusercontent.com/meituan-longcat/LongCat-Flash-Omni/main/figures/LongCat-Flash-Omni.svg"
|
| 13 |
+
width="300"
|
| 14 |
+
alt="LongCat Logo"/>
|
| 15 |
+
</div>
|
| 16 |
+
|
| 17 |
+
<div align="center" style="line-height: 1;">
|
| 18 |
+
<a href="https://longcat.ai/" target="_blank" style="margin: 2px;">
|
| 19 |
+
<img alt="Omni" src="https://img.shields.io/badge/🤖%20Omni-LongCat--Flash--Omni-ADFF2F?color=29E154&logoColor=white" fill-opacity="1" style="display: inline-block; vertical-align: middle;"/>
|
| 20 |
+
</a>
|
| 21 |
+
<a href="https://github.com/meituan-longcat/LongCat-Flash-Omni">
|
| 22 |
+
<img alt="github" src="https://img.shields.io/badge/🤖%20Github-LongCat--Flash--Omni-ff6b6b?color=1783ff&logoColor=white" style="display: inline-block; vertical-align: middle;"/>
|
| 23 |
+
</a>
|
| 24 |
+
</div>
|
| 25 |
+
|
| 26 |
+
<div align="center" style="line-height: 1;">
|
| 27 |
+
<a href="https://raw.githubusercontent.com/meituan-longcat/LongCat-Flash-Omni/main/figures/wechat_official_accounts.jpg" target="_blank" style="margin: 2px;">
|
| 28 |
+
<img alt="Wechat" src="https://img.shields.io/badge/WeChat-LongCat-brightgreen?logo=wechat&logoColor=white" style="display: inline-block; vertical-align: middle;"/>
|
| 29 |
+
</a>
|
| 30 |
+
<a href="https://x.com/Meituan_LongCat" target="_blank" style="margin: 2px;">
|
| 31 |
+
<img alt="Twitter Follow" src="https://img.shields.io/badge/Twitter-LongCat-white?logo=x&logoColor=white" style="display: inline-block; vertical-align: middle;"/>
|
| 32 |
+
</a>
|
| 33 |
+
</div>
|
| 34 |
+
|
| 35 |
+
<div align="center" style="line-height: 1;">
|
| 36 |
+
<a href="https://huggingface.co/meituan-longcat/LongCat-Flash-Omni/blob/main/LICENSE" style="margin: 2px;">
|
| 37 |
+
<img alt="License" src="https://img.shields.io/badge/License-MIT-f5de53?&color=f5de53" style="display: inline-block; vertical-align: middle;"/>
|
| 38 |
+
</a>
|
| 39 |
+
</div>
|
| 40 |
+
|
| 41 |
+
<p align="center">
|
| 42 |
+
<a href="https://github.com/meituan-longcat/LongCat-Flash-Omni/blob/main/tech_report.pdf"><b>Tech Report</b> 📄</a>
|
| 43 |
+
</p>
|
| 44 |
+
|
| 45 |
+
## Model Introduction
|
| 46 |
+
We introduce **LongCat-Flash-Omni**, a state-of-the-art open-source omni-modal model with 560 billion parameters (with 27B activated), excelling at real-time audio-visual interaction, which is attained by leveraging [LongCat-Flash](https://github.com/meituan-longcat/LongCat-Flash-Chat)'s high-performance Shortcut-connected Mixture-of-Experts (MoE) architecture with zero-computation experts, augmented by efficient multimodal perception and speech reconstruction modules. Through an effective curriculum-inspired progressive training strategy, our model achieves comprehensive multimodal capabilities while maintaining strong unimodal capability. Now, we open-source the model to foster future research and development in the community.
|
| 47 |
+
|
| 48 |
+
### Model Architecture
|
| 49 |
+
<div align="center">
|
| 50 |
+
<img src="https://raw.githubusercontent.com/meituan-longcat/LongCat-Flash-Omni/main/figures/longcat_flash_omni_architecture.png" width="60%" alt="LongCat-Flash-Omni" />
|
| 51 |
+
</div>
|
| 52 |
+
|
| 53 |
+
|
| 54 |
+
### Key Features
|
| 55 |
+
|
| 56 |
+
#### 🌟 SOTA and Unified Omni-Modal Model
|
| 57 |
+
|
| 58 |
+
LongCat-Flash-Omni is an open-source omni-modal model that achieves state-of-the-art cross-modal comprehension performance. It seamlessly integrates powerful offline multi-modal understanding with real-time audio–visual interaction within a single all-in-one framework.
|
| 59 |
+
|
| 60 |
+
|
| 61 |
+
#### 🌟 Large-Scale with Low-Latency Audio–Visual Interaction
|
| 62 |
+
|
| 63 |
+
By leveraging an efficient LLM backbone, carefully designed lightweight modality encoders and decoder, and a chunk-wise audio–visual feature interleaving mechanism, LongCat-Flash-Omni achieves low-latency, high-quality audio–visual processing and streaming speech generation. It supports a context window of up to 128K tokens, enabling advanced capabilities in long-term memory, multi-turn dialogue, and temporal reasoning across multiple modalities.
|
| 64 |
+
|
| 65 |
+
|
| 66 |
+
#### 🌟 Effective Early-Fusion Training
|
| 67 |
+
|
| 68 |
+
The model adopts an innovative multi-stage pretraining pipeline that progressively incorporates text, audio, and visual modalities under a balanced data strategy and early-fusion training paradigm, ensuring strong omni-modal performance without degradation in any single modality.
|
| 69 |
+
|
| 70 |
+
|
| 71 |
+
#### 🌟 Efficient Training Infrastructure
|
| 72 |
+
|
| 73 |
+
Inspired by the concept of modality decoupling, we propose a Modality-Decoupled Parallelism training scheme that significantly enhances the efficiency of large-scale and highly challenging multimodal training.
|
| 74 |
+
|
| 75 |
+
|
| 76 |
+
#### 🌟 Open-Source Contribution
|
| 77 |
+
We provide a comprehensive overview of the training methodology and data strategies behind LongCat-Flash-Omni, and release the model to accelerate future research and innovation in omni-modal intelligence.
|
| 78 |
+
|
| 79 |
+
|
| 80 |
+
For more detail, please refer to the comprehensive [***LongCat-Flash-Omni Technical Report***](https://github.com/meituan-longcat/LongCat-Flash-Omni/blob/main/tech_report.pdf).
|
| 81 |
+
|
| 82 |
+
## Evaluation Results
|
| 83 |
+
|
| 84 |
+
<details open>
|
| 85 |
+
<summary>Omni-modality</summary>
|
| 86 |
+
|
| 87 |
+
| **Benchmark** | **LongCat-Flash-Omni Instruct** | **Gemini-2.5-Pro (ThinkingBudget128)** | **Gemini-2.5-Flash (non-thinking)** | **Qwen3-Omni Instruct** | **Qwen2.5-Omni Instruct** |
|
| 88 |
+
|-----------|-------------------------------|-----------------------------------|------------------------------|----------------------|-------------------------|
|
| 89 |
+
| OmniBench | 61.38 | 66.80 | 54.99 | 58.41 | 48.16 |
|
| 90 |
+
| WorldSense | 60.89 | 63.96 | 58.72 | 52.01 | 46.69 |
|
| 91 |
+
| DailyOmni | 82.38 | 80.61 | 80.78 | 69.33 | 47.45 |
|
| 92 |
+
| UNO-Bench | 49.90 | 64.48 | 54.30 | 42.10 | 32.60 |
|
| 93 |
+
|
| 94 |
+
</details>
|
| 95 |
+
|
| 96 |
+
<details>
|
| 97 |
+
<summary>Vision</summary>
|
| 98 |
+
|
| 99 |
+
#### Image-to-Text
|
| 100 |
+
| **Benchmark** | **LongCat-Flash-Omni Instruct** | **Gemini-2.5-Pro (ThinkingBudget128)** | **Gemini-2.5-Flash (non-thinking)** | **Qwen3-Omni Instruct** | **Seed-1.6** | **GPT-4o-1120** | **Qwen3-VL-235B-A22B-Instruct** | **Qwen2.5-VL-72B-Instruct** |
|
| 101 |
+
|-----------|-------------------------------|-----------------------------------|------------------------------|----------------------|----------|---------------|------------------------------|---------------------------|
|
| 102 |
+
| **General** ||||||||||
|
| 103 |
+
| MMBench-EN<sub>test</sub> | 87.5 | 89.8 | 89.3 | 86.8 | 88.5 | 83.7 | 88.3 | 88.6* |
|
| 104 |
+
| MMBench-ZH<sub>test</sub> | 88.7 | 89.2 | 88.5 | 86.4 | 83.8 | 82.8 | 89.8 | 87.9* |
|
| 105 |
+
| RealWorldQA | 74.8 | 76.0 | 73.9 | 72.9 | 74.5 | 74.1 | 79.3* | 75.7* |
|
| 106 |
+
| MMStar | 70.9 | 78.5* | 75.5 | 68.5* | 71.5 | 63.2 | 78.4* | 68.2 |
|
| 107 |
+
| **STEM & Reasoning** ||||||||||
|
| 108 |
+
| MathVista<sub>mini</sub> | 77.9 | 77.7* | 77.1 | 75.9 | 78.7 | 62.8 | 84.9* | 74.8* |
|
| 109 |
+
| MMMU<sub>val</sub> | 70.7 | 80.9* | 76.3 | 69.1* | 74.9 | 69.4 | 78.7* | 70.2* |
|
| 110 |
+
| MMVet | 69.0 | 80.7 | 79.5 | 68.9 | 74.4 | 76.6 | 75.9 | 74.5 |
|
| 111 |
+
| **Multi-Image** ||||||||||
|
| 112 |
+
| BLINK | 63.1 | 70.0* | 65.7 | 56.1 | 65.0 | 65.5 | 70.7* | 60.1 |
|
| 113 |
+
| MuirBench | 77.1 | 74.0* | 73.7 | 62.1 | 74.6 | 70.5 | 72.8* | 70.7* |
|
| 114 |
+
| Mantis | 84.8 | 83.9 | 83.4 | 80.7 | 81.1 | 79.3 | 79.7 | 82.0 |
|
| 115 |
+
| **Text Recognition & Chart/Document Understanding** ||||||||||
|
| 116 |
+
| ChartQA | 87.6 | 71.7 | 77.6 | 86.8* | 82.4 | 74.5 | 89.2 | 89.5* |
|
| 117 |
+
| DocVQA | 91.8 | 94.0* | 93.6* | 95.7 | 94.3 | 80.9 | 94.6 | 96.4* |
|
| 118 |
+
| OCRBench | 84.9 | 87.2* | 85.6 | 85.5 | 85.6 | 82.3 | 91.2 | 88.5 |
|
| 119 |
+
| OmniDocBench<sub>EN/ZH</sub>↓ | 22.8/29.0 | 31.9/24.5 | 22.8/32.9 | 28.4/40.5 | 22.0/27.6 | 25.9/37.7 | 13.6/17.5 | 22.6/32.4* |
|
| 120 |
+
| **Grounding & Counting** ||||||||||
|
| 121 |
+
| RefCOCO-avg | 92.3 | 75.4 | 71.9 | 89.3 | 80.2 | - | 87.1 | 90.3 |
|
| 122 |
+
| CountBench | 92.4 | 91.0* | 78.6 | 90.0* | 94.1 | 85.6* | 94.3 | 93.6* |
|
| 123 |
+
| **Graphical User Interface (GUI)** ||||||||||
|
| 124 |
+
| VisualWebBench | 78.7 | 81.1 | 73.5 | 79.3 | 81.1 | 77.1 | 80.8 | 82.3* |
|
| 125 |
+
| ScreenSpot-v2 | 91.2 | 75.8 | 63.9 | 94.7 | 91.7 | - | 93.4 | 92.9 |
|
| 126 |
+
| AndroidControl<sub>low</sub> | 91.2 | 79.2 | 79.1 | 90.5 | 84.6 | 65.2 | 90.0 | 93.7* |
|
| 127 |
+
| AndroidControl<sub>high</sub> | 75.6 | 60.8 | 55.5 | 70.8 | 55.2 | 41.7 | 74.1 | 67.4* |
|
| 128 |
+
|
| 129 |
+
**Note**: Values marked with * are sourced from public reports. As GPT-4o does not support image grounding, we do not report its results on RefCOCO and ScreenSpot-v2
|
| 130 |
+
|
| 131 |
+
---
|
| 132 |
+
|
| 133 |
+
#### Video-to-Text
|
| 134 |
+
| **Benchmark** | **LongCat-Flash-Omni Instruct** | **Gemini-2.5-Pro (ThinkingBudget128)** | **Gemini-2.5-Flash (non-thinking)** | **Qwen3-Omni Instruct** | **Seed-1.6** | **GPT-4o-1120** | **Qwen3-VL (235B-A22B-Instruct)** | **Qwen2.5-VL-72B-Instruct** |
|
| 135 |
+
|-----------|-------------------------------|-----------------------------------|------------------------------|----------------------|----------|---------------|------------------------------|---------------------------|
|
| 136 |
+
| **Short Video** ||||||||||
|
| 137 |
+
| MVBench | 75.2 | 66.4 | 63.0 | 69.3* | 68.4 | 62.1 | 71.3 | 70.4* |
|
| 138 |
+
| NextQA | 86.2 | 84.2 | 81.4 | 82.4 | 84.1 | 79.7 | 81.3 | 82.3 |
|
| 139 |
+
| TempCompass | 82.2 | 80.8 | 80.2 | 73.5 | 79.4 | 76.4 | 80.5 | 74.8* |
|
| 140 |
+
| **Long Video** ||||||||||
|
| 141 |
+
| VideoMME (w/o audio) | 76.2 | - | - | 70.5* | 75.2 | 73.2 | 79.2* | 73.3* |
|
| 142 |
+
| VideoMME (w/ audio) | 78.2 | 80.6* | 78.5 | 73.0 | - | - | - | - |
|
| 143 |
+
| LongVideoBench | 69.3 | 69.4 | 66.4 | 65.4 | 64.8 | 63.9 | - | 60.7* |
|
| 144 |
+
| **STEM & Reasoning** ||||||||||
|
| 145 |
+
| MMVU | 67.1 | 75.6 | 72.4 | 62.4 | 67.3 | 67.4 | 69.3 | 62.9* |
|
| 146 |
+
| Video-MMMU | 67.5 | 79.4* | 76.6 | 60.3 | 75.4 | 68.0 | 73.7 | 59.3 |
|
| 147 |
+
|
| 148 |
+
**Note**: Values marked with * are sourced from public reports.
|
| 149 |
+
|
| 150 |
+
</details>
|
| 151 |
+
|
| 152 |
+
<details>
|
| 153 |
+
<summary>Audio</summary>
|
| 154 |
+
|
| 155 |
+
#### **Table 1: Automatic Speech Recognition (ASR) and Speech-to-Text Translation (S2TT)**
|
| 156 |
+
| **Benchmark** | **LongCat-Flash-Omni Instruct** | **Gemini-2.5-Pro (ThinkingBudget128)** | **GPT-4o-Audio** | **Qwen3-Omni Instruct** | **Kimi-Audio** | **Step-Audio-2-mini** |
|
| 157 |
+
|-----------|-------------------------------|-----------------------------------|--------------|----------------------|------------|-------------------|
|
| 158 |
+
| **ASR** | | | | | | |
|
| 159 |
+
| LibriSpeech (test-clean \| test-other) | 1.57 \| 4.01 | 1.74 \| 3.80 | 30.00 \| 41.83 | 1.22 \| 2.48 | 1.28 \| 2.42 | 1.33 \| 2.86 |
|
| 160 |
+
| AISHELL-1 | 0.63 | 3.11 | 34.81 | 0.84 | 0.60 | 0.78 |
|
| 161 |
+
| AISHELL-2 | 2.78 | 5.24 | 77.73 | 2.34 | 2.56 | 2.16 |
|
| 162 |
+
| Fleurs (zh \| en) | 3.99 \| 5.02 | 2.24 \| 4.77 | 3.91 \| 5.56 | 2.20 \| 2.72 | 2.69 \| 4.44 | 2.53 \| 3.05 |
|
| 163 |
+
| CommonVoice 15 (zh \| en) | 4.98 \| 13.59 | 47.30 \| 49.86 | 42.83 \| 23.88 | 4.31 \| 6.05 | 8.46 \| 7.92 | 5.00 \| 6.75 |
|
| 164 |
+
| WenetSpeech (test-meeting \| test-net) | 6.69 \| 6.09 | 136.13 \| 32.82 | 54.35 \| 67.90 | 5.89 \| 4.69 | 6.28 \| 5.37 | 4.87 \| 4.82 |
|
| 165 |
+
| **S2TT (BLEU)** | | | | | | |
|
| 166 |
+
| CoVost2 en→zh | 47.23 | 41.94 | 29.32 | 48.72 | - | 49.12 |
|
| 167 |
+
| CoVost2 zh→en | 27.32 | 25.38 | 16.01 | 21.51 | - | 29.47 |
|
| 168 |
+
|
| 169 |
+
**Note**: ASR results are in CER/WER (lower is better), S2TT results are in BLEU score.
|
| 170 |
+
|
| 171 |
+
---
|
| 172 |
+
|
| 173 |
+
#### **Table 2: Audio Understanding**
|
| 174 |
+
| **Benchmark** | **LongCat-Flash-Omni Instruct** | **Gemini-2.5-Pro (ThinkingBudget128)** | **GPT-4o-Audio** | **Qwen3-Omni Instruct** | **Kimi-Audio** | **Step-Audio-2-mini** |
|
| 175 |
+
|-----------|-------------------------------|-----------------------------------|--------------|----------------------|------------|-------------------|
|
| 176 |
+
| MMAU | 75.90 | 72.80 | 68.40 | 77.50 | 65.20 | 73.20 |
|
| 177 |
+
| VocalSound | 92.76 | 89.45 | 82.37 | 91.60 | 94.85 | 87.58 |
|
| 178 |
+
| TUT2017 | 65.43 | 33.15 | 20.74 | 40.74 | 65.25 | 30.67 |
|
| 179 |
+
| ClothoAQA | 72.83 | 69.67 | 61.87 | 75.16 | 72.21 | 68.39 |
|
| 180 |
+
| Nonspeech7k | 93.79 | 87.59 | 72.28 | 80.83 | 93.93 | 73.24 |
|
| 181 |
+
| CochlScene | 70.02 | 45.34 | 34.94 | 43.03 | 80.42 | 44.58 |
|
| 182 |
+
| MELD | 54.60 | 46.74 | 39.00 | 50.80 | 59.13 | 31.44 |
|
| 183 |
+
|
| 184 |
+
---
|
| 185 |
+
|
| 186 |
+
#### **Table 3: Audio-to-Text Chat**
|
| 187 |
+
| **Benchmark** | **LongCat-Flash-Omni Instruct** | **Gemini-2.5-Pro (ThinkingBudget128)** | **GPT-4o-Audio** | **Qwen3-Omni Instruct** | **Kimi-Audio** | **Step-Audio-2-mini** |
|
| 188 |
+
|-----------|-------------------------------|-----------------------------------|--------------|----------------------|------------|-------------------|
|
| 189 |
+
| **OpenAudioBench** | | | | | | |
|
| 190 |
+
| LlamaQuestions | 83.33 | 83.00 | 86.30 | 83.30 | 79.33 | 69.70 |
|
| 191 |
+
| ReasoningQA | 79.71 | 80.30 | 68.71 | 84.16 | 58.02 | 55.64 |
|
| 192 |
+
| TriviaQA | 86.20 | 90.20 | 76.00 | 75.90 | 62.10 | 45.30 |
|
| 193 |
+
| Webquestions | 76.00 | 80.90 | 81.20 | 75.20 | 70.20 | 54.40 |
|
| 194 |
+
| AlpacaEval | 75.43 | 76.58 | 81.61 | 85.43 | 75.73 | 53.92 |
|
| 195 |
+
| **VoiceBench** | | | | | | |
|
| 196 |
+
| AlpacaEval | 4.94 | 4.70 | 4.73 | 4.74 | 4.46 | 3.84 |
|
| 197 |
+
| CommonEval | 4.32 | 4.11 | 4.37 | 4.54 | 3.97 | 3.19 |
|
| 198 |
+
| OpenBookQA | 93.41 | 95.16 | 87.90 | 89.70 | 83.52 | 72.97 |
|
| 199 |
+
| SDQA | 82.46 | 83.54 | 90.10 | 76.90 | 63.12 | 44.85 |
|
| 200 |
+
| MMSU | 81.95 | 88.32 | 78.90 | 69.00 | 62.17 | 52.00 |
|
| 201 |
+
| AdvBench | 100 | 97.69 | 99.23 | 99.30 | 100 | 97.00 |
|
| 202 |
+
| IFEval | 77.99 | 77.83 | 66.81 | 77.80 | 61.10 | 29.80 |
|
| 203 |
+
|
| 204 |
+
</details>
|
| 205 |
+
|
| 206 |
+
<details>
|
| 207 |
+
<summary>Text</summary>
|
| 208 |
+
|
| 209 |
+
| **Benchmark** | **LongCat-Flash-Omni Instruct** | **LongCat-Flash** | **DeepSeek V3.1** | **Qwen3 MoE-2507** | **Kimi-K2** | **GPT-4.1** | **Claude Sonnet-4** | **Gemini-2.5-Flash** |
|
| 210 |
+
|-----------|-------------------------------|---------------|---------------|----------------|---------|---------|-----------------|------------------|
|
| 211 |
+
| Architecture | MoE | MoE | MoE | MoE | MoE | - | - | - |
|
| 212 |
+
| # Total Params | 560B | 560B | 671B | 235B | 1043B | - | - | - |
|
| 213 |
+
| # Activated Params | 27B | 27B | 37B | 22B | 32B | - | - | - |
|
| 214 |
+
| **General Domains** ||||||||||
|
| 215 |
+
| MMLU<sub>(acc)</sub> | 90.30 | 89.71 | 90.96 | 90.23 | 89.86 | 89.64 | 91.75 | 86.33 |
|
| 216 |
+
| MMLU-Pro<sub>(acc)</sub> | 82.73 | 82.68 | 84.45 | 84.83 | 82.06 | 81.72 | 83.74 | 81.95 |
|
| 217 |
+
| CEval<sub>(acc)</sub> | 91.68 | 90.44 | 89.21 | 92.70 | 91.26 | 79.53 | 86.63 | 78.78 |
|
| 218 |
+
| CMMLU<sub>(acc)</sub> | 89.39 | 84.34 | 88.04 | 88.14 | 89.66 | 77.65 | 86.51 | 78.30 |
|
| 219 |
+
| **Instruction Following** ||||||||||
|
| 220 |
+
| IFEval<sub>(acc)</sub> | 82.44 | 89.65 | 86.69 | 88.54 | 88.91 | 85.58 | 88.35 | 83.92 |
|
| 221 |
+
| COLLIE<sub>(acc)</sub> | 45.69 | 57.10 | 43.80 | 49.71 | 56.34 | 50.00 | 51.22 | 48.60 |
|
| 222 |
+
| Meeseeks-zh<sub>(acc)</sub> | 39.05 | 43.03 | 33.83 | 35.32 | 42.79 | 41.54 | 35.07 | 34.84 |
|
| 223 |
+
| **Mathematical Reasoning** ||||||||||
|
| 224 |
+
| MATH500<sub>(acc)</sub> | 97.60 | 96.40 | 96.08 | 98.80 | 97.60 | 90.60 | 93.80 | 98.40 |
|
| 225 |
+
| AIME24<sub>(avg@10)</sub> | 72.92 | 70.42 | 66.30* | 81.67 | 69.60* | 47.00 | 47.00 | 79.67 |
|
| 226 |
+
| BeyondAIME<sub>(avg@10)</sub> | 47.40 | 43.00 | 36.50 | 57.60 | 36.60 | 22.10 | 20.50 | 44.20 |
|
| 227 |
+
| **General Reasoning** ||||||||||
|
| 228 |
+
| GPQA-diamond<sub>(acc)</sub> | 74.41 | 73.23 | 74.90* | 77.43 | 75.76 | 67.68 | 70.71 | 80.30 |
|
| 229 |
+
| DROP<sub>(f1)</sub> | 83.53 | 79.06 | 84.19 | 78.57 | 89.04 | 66.94 | 73.06 | 45.03 |
|
| 230 |
+
| ZebraLogic<sub>(acc)</sub> | 86.00 | 89.30 | 85.30 | 94.22 | 89.11 | 56.30* | 80.10 | 57.00 |
|
| 231 |
+
| GraphWalks-128k<sub>(precision)</sub> | 56.00 | 51.05 | 73.54 | 80.72 | 47.50 | 85.02 | 80.57 | 64.83 |
|
| 232 |
+
| **Coding** ||||||||||
|
| 233 |
+
| LiveCodeBench<sub>(pass@1)</sub> | 52.64 | 48.02 | 56.40* | 46.48 | 46.70 | 39.21 | 45.59 | 39.65 |
|
| 234 |
+
| Humaneval+<sub>(pass@1)</sub> | 90.85 | 88.41 | 92.68 | 94.51 | 85.98 | 93.29 | 94.51 | 87.80 |
|
| 235 |
+
| MBPP+<sub>(pass@1)</sub> | 80.16 | 79.63 | 79.89 | 79.89 | 81.75 | 79.37 | 80.16 | 76.19 |
|
| 236 |
+
|
| 237 |
+
**Note**: Values marked with * are sourced from other public reports. Note that DeepSeek-V3.1, Qwen3-235B-A22B, Gemini2.5-Flash, and Claude4-Sonnet are evaluated under their non-thinking mode.
|
| 238 |
+
|
| 239 |
+
</details>
|
| 240 |
+
|
| 241 |
+
|
| 242 |
+
## Quick Start
|
| 243 |
+
|
| 244 |
+
### Model Download
|
| 245 |
+
|
| 246 |
+
LongCat-Flash-Omni is a MoE model, which means that the model weights are distributed across multiple devices. Therefore, during loading in Hugging Face Transformers or vLLM, model weights will be automatically downloaded based on the model name. However, if your runtime environment is not conducive to downloading weights during execution, you can refer to the following commands to manually download the model weights to a local directory:
|
| 247 |
+
|
| 248 |
+
```bash
|
| 249 |
+
# Download through Hugging Face
|
| 250 |
+
pip install -U "huggingface_hub[cli]"
|
| 251 |
+
huggingface-cli download meituan-longcat/LongCat-Flash-Omni --local-dir ./LongCat-Flash-Omni
|
| 252 |
+
```
|
| 253 |
+
|
| 254 |
+
### Usage
|
| 255 |
+
|
| 256 |
+
We have implemented basic adaptations in SGLang to support running the Longcat-Flash-Omni model. Currently, the official SGLang does not natively support Longcat-Flash-Omni, so you can temporarily use our [development branch](https://github.com/XiaoBin1992/sglang/tree/longcat_omni_v0.5.3.post3) for local installation and testing.
|
| 257 |
+
|
| 258 |
+
Due to its size of 560 billion parameters (560B), LongCat-Flash-Omni requires at least one node (e.g., 8×H20-141G) to host the model weights in FP8 format, and at least two nodes (e.g., 16×H800-80G) for BF16 weights. Detailed launch configurations are provided below.
|
| 259 |
+
|
| 260 |
+
#### Installation
|
| 261 |
+
* python >= 3.10.0 (Recommend to use Anaconda)
|
| 262 |
+
* PyTorch >= 2.8
|
| 263 |
+
* CUDA >= 12.9
|
| 264 |
+
|
| 265 |
+
```
|
| 266 |
+
conda create -n longcat python=3.10
|
| 267 |
+
conda activate longcat
|
| 268 |
+
|
| 269 |
+
# install SGLang
|
| 270 |
+
git clone -b longcat_omni_v0.5.3.post3 https://github.com/XiaoBin1992/sglang.git
|
| 271 |
+
pushd sglang
|
| 272 |
+
pip install -e "python"
|
| 273 |
+
popd
|
| 274 |
+
|
| 275 |
+
# install longcat-flash-omni demo
|
| 276 |
+
git clone https://github.com/meituan-longcat/LongCat-Flash-Omni
|
| 277 |
+
pushd LongCat-Flash-Omni
|
| 278 |
+
git submodule update --init --recursive
|
| 279 |
+
pip install -r requirements.txt
|
| 280 |
+
popd
|
| 281 |
+
```
|
| 282 |
+
|
| 283 |
+
#### Demo
|
| 284 |
+
|
| 285 |
+
The model can be served on your cluster using a combination of Tensor Parallelism and Expert Parallelism.
|
| 286 |
+
Once all dependencies are installed, you can launch the demo using the following command.
|
| 287 |
+
|
| 288 |
+
* single-node inference
|
| 289 |
+
```bash
|
| 290 |
+
python3 longcat_omni_demo.py \
|
| 291 |
+
--tp-size 8 \
|
| 292 |
+
--ep-size 8 \
|
| 293 |
+
--model-path where_you_download_model_dir \
|
| 294 |
+
--output-dir output
|
| 295 |
+
```
|
| 296 |
+
|
| 297 |
+
* multi-node inference
|
| 298 |
+
```bash
|
| 299 |
+
python3 longcat_omni_demo.py \
|
| 300 |
+
--tp-size 16 \
|
| 301 |
+
--ep-size 16 \
|
| 302 |
+
--nodes 2 \
|
| 303 |
+
--node-rank $NODE_RANK \
|
| 304 |
+
--dist-init-addr $MASTER_IP:5000 \
|
| 305 |
+
--model-path where_you_download_model_dir \
|
| 306 |
+
--output-dir output
|
| 307 |
+
```
|
| 308 |
+
|
| 309 |
+
> NOTE: Replace \$NODE\_RANK and \$MASTER\_IP with the corresponding values of your GPU machines.
|
| 310 |
+
|
| 311 |
+
All test cases are defined in examples_dict.py, and additional test cases may be added as needed. After model execution, the generated results are saved in the directory specified by the --output-dir parameter.
|
| 312 |
+
|
| 313 |
+
|
| 314 |
+
## Interaction with LongCat-Flash-Omni
|
| 315 |
+
|
| 316 |
+
### Real-time Chat Website
|
| 317 |
+
|
| 318 |
+
You can use LongCat-Flash-Omni (web version currently only supports audio interaction features) on [https://longcat.ai](https://longcat.ai). The full service will be provided in subsequent updates.
|
| 319 |
+
|
| 320 |
+
### APP
|
| 321 |
+
|
| 322 |
+
We are excited to announce that the LongCat-Flash-Omni app is now available for both Android and iOS.
|
| 323 |
+
|
| 324 |
+
For Android, you can download it from the following QR code.
|
| 325 |
+
|
| 326 |
+
<img src=https://raw.githubusercontent.com/meituan-longcat/LongCat-Flash-Omni/main/figures/android_app_qrcode.jpg width="200px">
|
| 327 |
+
|
| 328 |
+
For iOS, you can download it by searching "LongCat" at App Store or QR code. Currently, only the Chinese App Store is supported.
|
| 329 |
+
|
| 330 |
+
<img src=https://raw.githubusercontent.com/meituan-longcat/LongCat-Flash-Omni/main/figures/ios_app_qrcode.jpg width="200px">
|
| 331 |
+
|
| 332 |
+
|
| 333 |
+
## License Agreement
|
| 334 |
+
|
| 335 |
+
The **model weights** are released under the **MIT License**.
|
| 336 |
+
|
| 337 |
+
Any contributions to this repository are licensed under the MIT License, unless otherwise stated. This license does not grant any rights to use Meituan trademarks or patents.
|
| 338 |
+
|
| 339 |
+
See the [LICENSE](https://raw.githubusercontent.com/meituan-longcat/LongCat-Flash-Omni/refs/heads/main/LICENSE) file for the full license text.
|
| 340 |
+
|
| 341 |
+
## Usage Considerations
|
| 342 |
+
This model has not been specifically designed or comprehensively evaluated for every possible downstream application.
|
| 343 |
+
|
| 344 |
+
Developers should take into account the known limitations of large language models, including performance variations across different languages, and carefully assess accuracy, safety, and fairness before deploying the model in sensitive or high-risk scenarios.
|
| 345 |
+
It is the responsibility of developers and downstream users to understand and comply with all applicable laws and regulations relevant to their use case, including but not limited to data protection, privacy, and content safety requirements.
|
| 346 |
+
|
| 347 |
+
Nothing in this Model Card should be interpreted as altering or restricting the terms of the MIT License under which the model is released.
|
| 348 |
+
|
| 349 |
+
## Citation
|
| 350 |
+
We kindly encourage citation of our work if you find it useful.
|
| 351 |
+
|
| 352 |
+
```
|
| 353 |
+
@misc{
|
| 354 |
+
title={LongCat-Flash-Omni Technical Report},
|
| 355 |
+
author={Meituan LongCat Team},
|
| 356 |
+
year={2025},
|
| 357 |
+
url={https://github.com/meituan-longcat/LongCat-Flash-Omni},
|
| 358 |
+
}
|
| 359 |
+
```
|
| 360 |
+
|
| 361 |
+
## Contact
|
| 362 |
+
Please contact us at <a href="mailto:longcat-team@meituan.com">longcat-team@meituan.com</a> or join our WeChat Group if you have any questions.
|
| 363 |
+
|
| 364 |
+
#### WeChat Group
|
| 365 |
+
<img src=https://raw.githubusercontent.com/meituan-longcat/LongCat-Flash-Omni/main/figures/wechat_qrcode.jpeg width="200px">
|
audio/audio_embeddings.pt
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:691e4dd1450124c5571a0abe7c4fd63392bc9fd4fd501f4556ac1133b45b3cea
|
| 3 |
+
size 404228184
|
audio/audio_encoder.pt
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:3e6217e82dd201f081233a46711e08ccc2fed85431543829930123d03dcb14aa
|
| 3 |
+
size 1024481990
|
audio/audio_output_layers.pt
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:c8b34a30861bef825e45e5e8ed2fd3c2dd2f009eadbccb36613cf463ea15f5dc
|
| 3 |
+
size 707398600
|
audio/audio_projector.pt
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:2a66dc8801ed429dc5f9643762c08b43dfe7234bc57338b560d3033a1616bad0
|
| 3 |
+
size 151021584
|
audio/config.json
ADDED
|
@@ -0,0 +1,85 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"_name_or_path": "",
|
| 3 |
+
"activation": "relu6",
|
| 4 |
+
"add_cross_attention": false,
|
| 5 |
+
"architectures": [
|
| 6 |
+
"LongCatAudioEncoder"
|
| 7 |
+
],
|
| 8 |
+
"bad_words_ids": null,
|
| 9 |
+
"begin_suppress_tokens": null,
|
| 10 |
+
"bos_token_id": null,
|
| 11 |
+
"chunk_size_feed_forward": 0,
|
| 12 |
+
"cross_attention_hidden_size": null,
|
| 13 |
+
"decoder_start_token_id": null,
|
| 14 |
+
"diversity_penalty": 0.0,
|
| 15 |
+
"do_sample": false,
|
| 16 |
+
"dropout": 0.0,
|
| 17 |
+
"early_stopping": false,
|
| 18 |
+
"encoder_no_repeat_ngram_size": 0,
|
| 19 |
+
"eos_token_id": null,
|
| 20 |
+
"exponential_decay_length_penalty": null,
|
| 21 |
+
"finetuning_task": null,
|
| 22 |
+
"forced_bos_token_id": null,
|
| 23 |
+
"forced_eos_token_id": null,
|
| 24 |
+
"gradient_checkpointing": false,
|
| 25 |
+
"hidden_size": 6144,
|
| 26 |
+
"id2label": {
|
| 27 |
+
"0": "LABEL_0",
|
| 28 |
+
"1": "LABEL_1"
|
| 29 |
+
},
|
| 30 |
+
"input_size": 1200,
|
| 31 |
+
"is_decoder": false,
|
| 32 |
+
"is_encoder_decoder": false,
|
| 33 |
+
"label2id": {
|
| 34 |
+
"LABEL_0": 0,
|
| 35 |
+
"LABEL_1": 1
|
| 36 |
+
},
|
| 37 |
+
"layer_drop": 0.0,
|
| 38 |
+
"left_context": 7,
|
| 39 |
+
"left_order": 10,
|
| 40 |
+
"left_stride": 1,
|
| 41 |
+
"length_penalty": 1.0,
|
| 42 |
+
"max_length": 20,
|
| 43 |
+
"min_length": 0,
|
| 44 |
+
"model_type": "",
|
| 45 |
+
"ndnn": 2,
|
| 46 |
+
"nlayer": 22,
|
| 47 |
+
"no_repeat_ngram_size": 0,
|
| 48 |
+
"normalize": "LayerNorm",
|
| 49 |
+
"num_beam_groups": 1,
|
| 50 |
+
"num_beams": 1,
|
| 51 |
+
"num_return_sequences": 1,
|
| 52 |
+
"num_right_layers": 6,
|
| 53 |
+
"output_attentions": false,
|
| 54 |
+
"output_hidden_states": false,
|
| 55 |
+
"output_scores": false,
|
| 56 |
+
"pad_token_id": null,
|
| 57 |
+
"prefix": null,
|
| 58 |
+
"problem_type": null,
|
| 59 |
+
"proj_size": 1536,
|
| 60 |
+
"pruned_heads": {},
|
| 61 |
+
"remove_invalid_values": false,
|
| 62 |
+
"repetition_penalty": 1.0,
|
| 63 |
+
"return_dict": true,
|
| 64 |
+
"return_dict_in_generate": false,
|
| 65 |
+
"right_context": 7,
|
| 66 |
+
"right_order": 1,
|
| 67 |
+
"right_stride": 1,
|
| 68 |
+
"sep_token_id": null,
|
| 69 |
+
"stride": 8,
|
| 70 |
+
"suppress_tokens": null,
|
| 71 |
+
"task_specific_params": null,
|
| 72 |
+
"temperature": 1.0,
|
| 73 |
+
"tf_legacy_loss": false,
|
| 74 |
+
"tie_encoder_decoder": false,
|
| 75 |
+
"tie_word_embeddings": false,
|
| 76 |
+
"tokenizer_class": null,
|
| 77 |
+
"top_k": 50,
|
| 78 |
+
"top_p": 1.0,
|
| 79 |
+
"torch_dtype": "float32",
|
| 80 |
+
"torchscript": false,
|
| 81 |
+
"transformers_version": "4.39.3",
|
| 82 |
+
"typical_p": 1.0,
|
| 83 |
+
"use_bfloat16": false,
|
| 84 |
+
"vocab_size": 5252
|
| 85 |
+
}
|
audio/global_cmvn
ADDED
|
Binary file (663 Bytes). View file
|
|
|
audio/preprocessor_config.json
ADDED
|
@@ -0,0 +1,26 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"fbank": {
|
| 3 |
+
"dither": 0.0,
|
| 4 |
+
"frame_length": 25,
|
| 5 |
+
"frame_shift": 10,
|
| 6 |
+
"preemphasis": 0.97,
|
| 7 |
+
"freq": 16000,
|
| 8 |
+
"high_freq": -200,
|
| 9 |
+
"low_freq": 40,
|
| 10 |
+
"num_mel_bins": 80
|
| 11 |
+
},
|
| 12 |
+
"delta": {
|
| 13 |
+
"delta_order": 0,
|
| 14 |
+
"window_size": 2
|
| 15 |
+
},
|
| 16 |
+
"cmvn": {
|
| 17 |
+
"global_cmvn": "/path/to/audio/global_cmvn"
|
| 18 |
+
},
|
| 19 |
+
"splice": {
|
| 20 |
+
"left": 7,
|
| 21 |
+
"right": 7,
|
| 22 |
+
"stride": 8,
|
| 23 |
+
"random_start": false,
|
| 24 |
+
"seed": 0
|
| 25 |
+
}
|
| 26 |
+
}
|
audio_codec/LongCatAudioCodec_decoder_24k_4codebooks_aug_sft.pt
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:0b2ed907fbefd6ca60cbb6078852233ce693ddddef05f1ca4e201b4fd3119189
|
| 3 |
+
size 603637648
|
audio_codec/config.yaml
ADDED
|
@@ -0,0 +1,11 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
codec_config:
|
| 2 |
+
codec_dimension: 1024
|
| 3 |
+
codec_dec_ratios: [8,6,6,5]
|
| 4 |
+
decoder_dim: 1536
|
| 5 |
+
semantic_dim: 1280
|
| 6 |
+
decoder_type: '24k'
|
| 7 |
+
ckpt_path: '/path/to/audio_codec/LongCatAudioCodec_decoder_24k_4codebooks_aug_sft.pt'
|
| 8 |
+
codec_codebook_size: 90
|
| 9 |
+
codec_codebook_search_dim: 8
|
| 10 |
+
codec_codebooks: 3
|
| 11 |
+
semantic_token_nums: 8192
|
config.json
ADDED
|
@@ -0,0 +1,37 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"architectures": [
|
| 3 |
+
"LongcatFlashOmniForCausalLM"
|
| 4 |
+
],
|
| 5 |
+
"attention_bias": false,
|
| 6 |
+
"attention_dropout": 0.0,
|
| 7 |
+
"auto_map": {
|
| 8 |
+
"AutoConfig": "configuration_longcat_flash.LongcatFlashConfig",
|
| 9 |
+
"AutoModel": "modeling_longcat_flash.LongcatFlashModel",
|
| 10 |
+
"AutoModelForCausalLM": "modeling_longcat_flash.LongcatFlashForCausalLM"
|
| 11 |
+
},
|
| 12 |
+
"vocab_size": 131072,
|
| 13 |
+
"hidden_size": 6144,
|
| 14 |
+
"ffn_hidden_size": 12288,
|
| 15 |
+
"expert_ffn_hidden_size": 2048,
|
| 16 |
+
"num_layers": 28,
|
| 17 |
+
"num_attention_heads": 64,
|
| 18 |
+
"kv_lora_rank": 512,
|
| 19 |
+
"q_lora_rank": 1536,
|
| 20 |
+
"qk_rope_head_dim": 64,
|
| 21 |
+
"v_head_dim": 128,
|
| 22 |
+
"qk_nope_head_dim": 128,
|
| 23 |
+
"mla_scale_q_lora": true,
|
| 24 |
+
"mla_scale_kv_lora": true,
|
| 25 |
+
"routed_scaling_factor": 6.0,
|
| 26 |
+
"n_routed_experts": 512,
|
| 27 |
+
"max_position_embeddings": 131072,
|
| 28 |
+
"rms_norm_eps": 1e-5,
|
| 29 |
+
"use_cache": true,
|
| 30 |
+
"bos_token_id": 1,
|
| 31 |
+
"eos_token_id": 2,
|
| 32 |
+
"rope_theta": 10000000.0,
|
| 33 |
+
"attention_method": "MLA",
|
| 34 |
+
"zero_expert_num": 256,
|
| 35 |
+
"zero_expert_type": "identity",
|
| 36 |
+
"moe_topk": 12
|
| 37 |
+
}
|
configuration_longcat_flash.py
ADDED
|
@@ -0,0 +1,216 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
|
| 2 |
+
"""LongcatFlash model configuration"""
|
| 3 |
+
|
| 4 |
+
from transformers.configuration_utils import PretrainedConfig
|
| 5 |
+
from transformers.modeling_rope_utils import rope_config_validation
|
| 6 |
+
|
| 7 |
+
|
| 8 |
+
LONGCAT_PRETRAINED_CONFIG_ARCHIVE_MAP = {}
|
| 9 |
+
|
| 10 |
+
|
| 11 |
+
class LongcatFlashConfig(PretrainedConfig):
|
| 12 |
+
r"""
|
| 13 |
+
This is the configuration class to store the configuration of a [`LongcatFlashModel`]. It is used to instantiate an LongcatFlash
|
| 14 |
+
model according to the specified arguments, defining the model architecture. Instantiating a configuration with the
|
| 15 |
+
defaults will yield a similar configuration to that of the LongcatFlash.
|
| 16 |
+
Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
|
| 17 |
+
documentation from [`PretrainedConfig`] for more information.
|
| 18 |
+
|
| 19 |
+
|
| 20 |
+
Args:
|
| 21 |
+
vocab_size (`int`, *optional*, defaults to 131072):
|
| 22 |
+
Vocabulary size of the Deep model. Defines the number of different tokens that can be represented by the
|
| 23 |
+
`inputs_ids` passed when calling [`LongcatFlashModel`]
|
| 24 |
+
hidden_size (`int`, *optional*, defaults to 7168):
|
| 25 |
+
Dimension of the hidden representations.
|
| 26 |
+
ffn_hidden_size (`int`, *optional*, defaults to 18432):
|
| 27 |
+
Dimension of the MLP representations.
|
| 28 |
+
expert_ffn_hidden_size (`int`, *optional*, defaults to 2048):
|
| 29 |
+
Dimension of the MoE representations.
|
| 30 |
+
num_layers (`int`, *optional*, defaults to 61):
|
| 31 |
+
Number of hidden layers in the Transformer decoder.
|
| 32 |
+
num_attention_heads (`int`, *optional*, defaults to 128):
|
| 33 |
+
Number of attention heads for each attention layer in the Transformer decoder.
|
| 34 |
+
num_key_value_heads (`int`, *optional*, defaults to 128):
|
| 35 |
+
This is the number of key_value heads that should be used to implement Grouped Query Attention. If
|
| 36 |
+
`num_key_value_heads=num_attention_heads`, the model will use Multi Head Attention (MHA), if
|
| 37 |
+
`num_key_value_heads=1 the model will use Multi Query Attention (MQA) otherwise GQA is used. When
|
| 38 |
+
converting a multi-head checkpoint to a GQA checkpoint, each group key and value head should be constructed
|
| 39 |
+
by meanpooling all the original heads within that group. For more details checkout [this
|
| 40 |
+
paper](https://arxiv.org/pdf/2305.13245.pdf). If it is not specified, will default to
|
| 41 |
+
`num_attention_heads`.
|
| 42 |
+
n_routed_experts (`int`, *optional*, defaults to 256):
|
| 43 |
+
Number of routed experts.
|
| 44 |
+
routed_scaling_factor (`float`, *optional*, defaults to 2.5):
|
| 45 |
+
Scaling factor or routed experts.
|
| 46 |
+
kv_lora_rank (`int`, *optional*, defaults to 512):
|
| 47 |
+
Rank of the LoRA matrices for key and value projections.
|
| 48 |
+
q_lora_rank (`int`, *optional*, defaults to 1536):
|
| 49 |
+
Rank of the LoRA matrices for query projections.
|
| 50 |
+
qk_rope_head_dim (`int`, *optional*, defaults to 64):
|
| 51 |
+
Dimension of the query/key heads that use rotary position embeddings.
|
| 52 |
+
v_head_dim (`int`, *optional*, defaults to 128):
|
| 53 |
+
Dimension of the value heads.
|
| 54 |
+
qk_nope_head_dim (`int`, *optional*, defaults to 128):
|
| 55 |
+
Dimension of the query/key heads that don't use rotary position embeddings.
|
| 56 |
+
norm_topk_prob (`bool`, *optional*, defaults to `True`):
|
| 57 |
+
Whether to normalize the weights of the routed experts.
|
| 58 |
+
hidden_act (`str` or `function`, *optional*, defaults to `"silu"`):
|
| 59 |
+
The non-linear activation function (function or string) in the decoder.
|
| 60 |
+
max_position_embeddings (`int`, *optional*, defaults to 4096):
|
| 61 |
+
The maximum sequence length that this model might ever be used with.
|
| 62 |
+
rms_norm_eps (`float`, *optional*, defaults to 1e-06):
|
| 63 |
+
The epsilon used by the rms normalization layers.
|
| 64 |
+
use_cache (`bool`, *optional*, defaults to `True`):
|
| 65 |
+
Whether or not the model should return the last key/values attentions (not used by all models). Only
|
| 66 |
+
relevant if `config.is_decoder=True`.
|
| 67 |
+
pad_token_id (`int`, *optional*):
|
| 68 |
+
Padding token id.
|
| 69 |
+
bos_token_id (`int`, *optional*, defaults to 0):
|
| 70 |
+
Beginning of stream token id.
|
| 71 |
+
eos_token_id (`int`, *optional*, defaults to 1):
|
| 72 |
+
End of stream token id.
|
| 73 |
+
tie_word_embeddings (`bool`, *optional*, defaults to `False`):
|
| 74 |
+
Whether to tie weight embeddings
|
| 75 |
+
rope_theta (`float`, *optional*, defaults to 10000.0):
|
| 76 |
+
The base period of the RoPE embeddings.
|
| 77 |
+
attention_bias (`bool`, defaults to `False`, *optional*, defaults to `False`):
|
| 78 |
+
Whether to use a bias in the query, key, value and output projection layers during self-attention.
|
| 79 |
+
attention_dropout (`float`, *optional*, defaults to 0.0):
|
| 80 |
+
The dropout ratio for the attention probabilities.
|
| 81 |
+
attention_method (`str`, *optional*, defaults to `"MLA"`):
|
| 82 |
+
The attention method to use.
|
| 83 |
+
initializer_range (`float`, *optional*, defaults to 0.006):
|
| 84 |
+
The initializer range for the model.
|
| 85 |
+
router_bias (`bool`, *optional*, defaults to `False`):
|
| 86 |
+
Whether to use a bias in the router.
|
| 87 |
+
zero_expert_num (`int`, *optional*, defaults to `None`):
|
| 88 |
+
The number of zero experts to use.
|
| 89 |
+
zero_expert_type (`str`, *optional*, defaults to `None`):
|
| 90 |
+
The type of zero expert to use.
|
| 91 |
+
|
| 92 |
+
```python
|
| 93 |
+
>>> from transformers import LongcatFlashModel, LongcatFlashConfig
|
| 94 |
+
|
| 95 |
+
>>> # Initializing a LongcatFlash style configuration
|
| 96 |
+
>>> configuration = LongcatFlashConfig()
|
| 97 |
+
|
| 98 |
+
>>> # Accessing the model configuration
|
| 99 |
+
>>> configuration = model.config
|
| 100 |
+
```"""
|
| 101 |
+
|
| 102 |
+
model_type = "longcat_flash"
|
| 103 |
+
keys_to_ignore_at_inference = ["past_key_values"]
|
| 104 |
+
base_model_tp_plan = {
|
| 105 |
+
"layers.*.self_attn.k_proj": "colwise",
|
| 106 |
+
"layers.*.self_attn.v_proj": "colwise",
|
| 107 |
+
"layers.*.self_attn.o_proj": "rowwise",
|
| 108 |
+
"layers.*.mlp.experts.*.gate_proj": "local_colwise",
|
| 109 |
+
"layers.*.mlp.experts.*.up_proj": "local_colwise",
|
| 110 |
+
"layers.*.mlp.experts.*.down_proj": "local_rowwise",
|
| 111 |
+
"layers.*.mlps.*.gate_proj": "local_colwise",
|
| 112 |
+
"layers.*.mlps.*.up_proj": "local_colwise",
|
| 113 |
+
"layers.*.mlps.*.down_proj": "local_rowwise",
|
| 114 |
+
}
|
| 115 |
+
base_model_pp_plan = {
|
| 116 |
+
"embed_tokens": (["input_ids"], ["inputs_embeds"]),
|
| 117 |
+
"layers": (["hidden_states", "attention_mask"], ["hidden_states"]),
|
| 118 |
+
"norm": (["hidden_states"], ["hidden_states"]),
|
| 119 |
+
}
|
| 120 |
+
|
| 121 |
+
def __init__(
|
| 122 |
+
self,
|
| 123 |
+
vocab_size=131072,
|
| 124 |
+
hidden_size=7168,
|
| 125 |
+
ffn_hidden_size=18432,
|
| 126 |
+
expert_ffn_hidden_size=2048,
|
| 127 |
+
num_layers=61,
|
| 128 |
+
num_attention_heads=128,
|
| 129 |
+
num_key_value_heads=None,
|
| 130 |
+
n_routed_experts=256,
|
| 131 |
+
routed_scaling_factor=1,
|
| 132 |
+
kv_lora_rank=512,
|
| 133 |
+
q_lora_rank=1536,
|
| 134 |
+
qk_rope_head_dim=64,
|
| 135 |
+
v_head_dim=128,
|
| 136 |
+
qk_nope_head_dim=128,
|
| 137 |
+
mla_scale_q_lora=True,
|
| 138 |
+
mla_scale_kv_lora=True,
|
| 139 |
+
moe_topk=8,
|
| 140 |
+
norm_topk_prob=False,
|
| 141 |
+
hidden_act="silu",
|
| 142 |
+
max_position_embeddings=4096,
|
| 143 |
+
rms_norm_eps=1e-6,
|
| 144 |
+
use_cache=True,
|
| 145 |
+
pad_token_id=None,
|
| 146 |
+
bos_token_id=0,
|
| 147 |
+
eos_token_id=1,
|
| 148 |
+
tie_word_embeddings=False,
|
| 149 |
+
rope_theta=10000.0,
|
| 150 |
+
attention_bias=False,
|
| 151 |
+
attention_dropout=0.0,
|
| 152 |
+
attention_method='MLA',
|
| 153 |
+
initializer_range=0.006,
|
| 154 |
+
router_bias=False,
|
| 155 |
+
zero_expert_num=None,
|
| 156 |
+
zero_expert_type=None,
|
| 157 |
+
**kwargs,
|
| 158 |
+
):
|
| 159 |
+
self.vocab_size = vocab_size
|
| 160 |
+
self.max_position_embeddings = max_position_embeddings
|
| 161 |
+
self.hidden_size = hidden_size
|
| 162 |
+
self.ffn_hidden_size = ffn_hidden_size
|
| 163 |
+
self.expert_ffn_hidden_size = expert_ffn_hidden_size
|
| 164 |
+
self.num_layers = num_layers
|
| 165 |
+
self.num_attention_heads = num_attention_heads
|
| 166 |
+
self.n_routed_experts = n_routed_experts
|
| 167 |
+
self.routed_scaling_factor = routed_scaling_factor
|
| 168 |
+
self.kv_lora_rank = kv_lora_rank
|
| 169 |
+
self.q_lora_rank = q_lora_rank
|
| 170 |
+
self.qk_rope_head_dim = qk_rope_head_dim
|
| 171 |
+
self.v_head_dim = v_head_dim
|
| 172 |
+
self.qk_nope_head_dim = qk_nope_head_dim
|
| 173 |
+
self.qk_head_dim = qk_nope_head_dim + qk_rope_head_dim
|
| 174 |
+
self.moe_topk = moe_topk
|
| 175 |
+
self.norm_topk_prob = norm_topk_prob
|
| 176 |
+
self.mla_scale_q_lora = mla_scale_q_lora
|
| 177 |
+
self.mla_scale_kv_lora = mla_scale_kv_lora
|
| 178 |
+
self.attention_method = attention_method
|
| 179 |
+
self.initializer_range = initializer_range
|
| 180 |
+
self.router_bias = router_bias
|
| 181 |
+
self.zero_expert_num = zero_expert_num
|
| 182 |
+
self.zero_expert_type = zero_expert_type
|
| 183 |
+
|
| 184 |
+
if self.attention_method == "MLA":
|
| 185 |
+
self.head_dim = qk_rope_head_dim
|
| 186 |
+
else:
|
| 187 |
+
ValueError('attention_method should be one of ["MLA"]')
|
| 188 |
+
|
| 189 |
+
|
| 190 |
+
if num_key_value_heads is None:
|
| 191 |
+
num_key_value_heads = num_attention_heads
|
| 192 |
+
|
| 193 |
+
self.num_key_value_heads = num_key_value_heads
|
| 194 |
+
self.hidden_act = hidden_act
|
| 195 |
+
self.rms_norm_eps = rms_norm_eps
|
| 196 |
+
self.use_cache = use_cache
|
| 197 |
+
self.rope_theta = rope_theta
|
| 198 |
+
self.attention_bias = attention_bias
|
| 199 |
+
self.attention_dropout = attention_dropout
|
| 200 |
+
|
| 201 |
+
rope_config_validation(self)
|
| 202 |
+
|
| 203 |
+
super().__init__(
|
| 204 |
+
pad_token_id=pad_token_id,
|
| 205 |
+
bos_token_id=bos_token_id,
|
| 206 |
+
eos_token_id=eos_token_id,
|
| 207 |
+
tie_word_embeddings=tie_word_embeddings,
|
| 208 |
+
**kwargs,
|
| 209 |
+
)
|
| 210 |
+
|
| 211 |
+
@property
|
| 212 |
+
def num_hidden_layers(self):
|
| 213 |
+
return self.num_layers
|
| 214 |
+
|
| 215 |
+
|
| 216 |
+
__all__ = ["LongcatFlashConfig"]
|
generation_config.json
ADDED
|
@@ -0,0 +1,7 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"_from_model_config": true,
|
| 3 |
+
"bos_token_id": 1,
|
| 4 |
+
"eos_token_id": 2,
|
| 5 |
+
"pad_token_id": 3,
|
| 6 |
+
"transformers_version": "4.51.3"
|
| 7 |
+
}
|
language_model_embedding.pt
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:f18abb8d0e4cc2fa99c4a5fa30f887a3a12a1a41c0732b79f8b2bd133202a76d
|
| 3 |
+
size 1610614129
|
model.safetensors.index.json
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|
model_00001-of-00080.safetensors
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:0739930e842bb5a86662e1d67f8c545d9ea230d5c57714354b7e34c82ed9a9c7
|
| 3 |
+
size 1610612880
|
model_00002-of-00080.safetensors
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:e44ce263e6fd885f50d82ca515b9325375b43ee36ededb75acf161ce88bc2e41
|
| 3 |
+
size 48
|
model_00003-of-00080.safetensors
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:e44ce263e6fd885f50d82ca515b9325375b43ee36ededb75acf161ce88bc2e41
|
| 3 |
+
size 48
|
model_00004-of-00080.safetensors
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:e44ce263e6fd885f50d82ca515b9325375b43ee36ededb75acf161ce88bc2e41
|
| 3 |
+
size 48
|
model_00005-of-00080.safetensors
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:e44ce263e6fd885f50d82ca515b9325375b43ee36ededb75acf161ce88bc2e41
|
| 3 |
+
size 48
|
model_00006-of-00080.safetensors
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:7b744394fd0184b47ad5dd41d1f18e5ba86a7f39c9d751c1312000bcc1a45d6e
|
| 3 |
+
size 15953918680
|
model_00007-of-00080.safetensors
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:31aab2325632b27ee496e8f6dc004674288c85245331ff73a908a1c94f18e275
|
| 3 |
+
size 15918289608
|
model_00008-of-00080.safetensors
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:0923b82637807507c876cc3b08d566caa20b0371f50da986c8c1519728584ee6
|
| 3 |
+
size 15852491488
|
model_00009-of-00080.safetensors
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:a0bcca8e97b15ed9d955aad63f4584b94ccf3ebc898b8774573e4a1f187ed518
|
| 3 |
+
size 16111199992
|
model_00010-of-00080.safetensors
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:42a27f00c1962f37a158b37d69784b1d5e2dce5bbd73cabf73a0965fd0501eb4
|
| 3 |
+
size 16048279128
|
model_00011-of-00080.safetensors
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:bdd2411cce434a2469cf023fb6fe193147ce693cb8bd64ffc585d5489b2dd85e
|
| 3 |
+
size 15953918680
|
model_00012-of-00080.safetensors
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:ee9b9c8370f7b1dbed80225a816d378a112ed52132d5ec667f3e9dd7043c8982
|
| 3 |
+
size 15918289608
|
model_00013-of-00080.safetensors
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:b0c54ae6ca7cd3144a6b7028c27e3fbe959a056f7120da3fcd83ead126d8c95c
|
| 3 |
+
size 15852491488
|
model_00014-of-00080.safetensors
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:bb540e6244bcf1aef6614481443a1cd785a1b868953886667e796b6a2220d538
|
| 3 |
+
size 16111199992
|
model_00015-of-00080.safetensors
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:6bf32626e4b6cd01a6e4e2a71e2631e986ae47b94af32a2f8ebdeef5473e14db
|
| 3 |
+
size 16048279128
|
model_00016-of-00080.safetensors
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:a0d99af3e8fecbb229cac4ed0e9d39b555a6e86abb069b1eb68cc12ad7a9d65e
|
| 3 |
+
size 15953918680
|
model_00017-of-00080.safetensors
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:5b96034ec959b45273cb66bbf63940210f958eaeb962e1bf4da2dffe716ece71
|
| 3 |
+
size 15918289608
|
model_00018-of-00080.safetensors
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:ae04faa2f8abc37e2aa992929d8a7fef9ec771e63344b952c92e23c2d285c525
|
| 3 |
+
size 15852491488
|
model_00019-of-00080.safetensors
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:54942379fa2969eab782d01bf8d1891c685890a3c470b8431f427406fab2b606
|
| 3 |
+
size 16111199992
|
model_00020-of-00080.safetensors
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:72157629e7f74c4ef8831a334b40efe93b7c03744a1a54514488cc9011fbc8a0
|
| 3 |
+
size 16048279128
|
model_00021-of-00080.safetensors
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:29e85eb3cd7b4dff4889e38501af42aaf16b7237f83622195c4115803982551d
|
| 3 |
+
size 15953918680
|
model_00022-of-00080.safetensors
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:7760c0e0fe4f02cc6d5c8fe099bdd1049b2ef3a6b1afe8872d5300f277e0ec3a
|
| 3 |
+
size 15918289608
|
model_00023-of-00080.safetensors
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:98161d555f0750b5e0f420f5aca0a3a69760ae74c2a9d3359eda0f9038ddf1cd
|
| 3 |
+
size 15852491488
|
model_00024-of-00080.safetensors
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:599ef83350116b3247ce19cac87c502dfd5b5b2ba6fcce2304eb6395bbace8a8
|
| 3 |
+
size 16111199992
|
model_00025-of-00080.safetensors
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:0cb646c661f276d19267820f271efe4227fc6000b383f3057cd1f78f2210ea82
|
| 3 |
+
size 16048279128
|
model_00026-of-00080.safetensors
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:83fc5dd48b2dfa2d752f559e1c449c593a56ccd02108d610a45bc36530c89039
|
| 3 |
+
size 15953918680
|
model_00027-of-00080.safetensors
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:ba0f653bac6db2981b25a3eee1405a9f3102bd2dfc39245ad69cbbd3ed271ea2
|
| 3 |
+
size 15918289608
|
model_00028-of-00080.safetensors
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:d69ab73062467f436dfb0e864d4586fcc76228386ecd6bf47b4208fa74b37a46
|
| 3 |
+
size 15852491488
|
model_00029-of-00080.safetensors
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:19a5333380e24abc9ec6264b14002da142803762b3b9dd3b1be5aa26ba519b48
|
| 3 |
+
size 16111199992
|
model_00030-of-00080.safetensors
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:a5a0974c5e6a04ab2c718fa2858ade42b877cab87dd6256ea6b5a857e0e6d685
|
| 3 |
+
size 16048279128
|
model_00031-of-00080.safetensors
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:03a6400e25994e94dc4094497f2aa22d428622cc692b71c6a8b1e0da66e7ae5a
|
| 3 |
+
size 15953919304
|
model_00032-of-00080.safetensors
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:bb6e570622a03d425e5c4a317d3018c541bf865993927a5111aa854c3038ce29
|
| 3 |
+
size 15918290232
|
model_00033-of-00080.safetensors
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:308d665912c8bee46a9b831d07c61497ebce03c0fe0f17bb50fecff18b2db5f8
|
| 3 |
+
size 15852492112
|