MNV-17: Nonverbal Vocalization Recognition
This repository demonstrates the excellent performance of Qwen2.5-Omni and Qwen2-Audio models fine-tuned on the MNV-17 dataset for Nonverbal Vocalization (NV) ASR recognition tasks. It also provides inference scripts for Qwen2.5-Omni and Qwen2-Audio.
Click here for interactive audio demo
Key Findings
Unseen Speaker Generalization
Crucial Note: All demo samples are from speakers who were completely unseen during training.
This demonstrates that the model learned universal NV vocalization patterns rather than merely fitting specific speakers' habits, showcasing excellent cross-speaker generalization.
Model Performance
According to our paper experimental results:
| Model | Joint CER | NV Recognition Accuracy |
|---|---|---|
| Qwen2.5-Omni | 3.60% | 57.29% |
| Qwen2-Audio | 4.84% | 56.28% |
| SenseVoice | 8.71% | 57.29% |
| Paraformer | 5.70% | 28.64% |
Performance Highlights
- Lowest Joint Error Rate: Qwen2.5-Omni achieved 3.60% joint CER, best performance in dual ASR and NV recognition tasks.
- Excellent NV Recognition: 57.29% accuracy under strict exact-match evaluation (type, count, order must all match).
Dataset Characteristics
MNV-17 Dataset Advantages
- Performative Recording: Avoids ambiguity of NVs in spontaneous speech, ensures high-quality annotation.
- Class Balance: 17 NV categories with balanced distribution (max/min ratio only 2.7).
- Speaker Diversity: 49 native Mandarin speakers from various regions.
- Rich Context: NVs naturally embedded in semantically rich sentences.
Design Innovation
- Scripted Approach: LLM-generated natural contexts ensure semantic reasonableness of NVs.
- Multi-NV Combinations: Supports random combinations of 1-3 NVs, simulating real scenarios.
- Speaker-Independent Split: Strict train/validation/test division ensures generalization evaluation.
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
🙋
Ask for provider support
Model tree for kiiic/MNV-17-Qwen-fintune
Base model
Qwen/Qwen2-Audio-7B-Instruct