Improve model card: add library_name, pipeline_tag and link to paper

Hi! I'm Niels, part of the community science team at Hugging Face.

This PR improves the model card for `MOSS-VoiceGenerator` by:
- Adding `library_name: transformers` to the metadata, which enables the automated "Use in Transformers" button/code snippet.
- Ensuring the `pipeline_tag: text-to-speech` is correctly set.
- Updating the Arxiv badge to link directly to the research paper: [MOSS-Audio-Tokenizer: Scaling Audio Tokenizers for Future Audio Foundation Models](https://huggingface.co/papers/2602.10934).
- Adding a BibTeX citation for the paper.

Please feel free to merge if this looks good!

Files changed (1) hide show

README.md +22 -3

README.md CHANGED Viewed

@@ -1,8 +1,11 @@
 ---
 license: apache-2.0
 tags:
 - text-to-speech
 ---
 # MOSS-TTS Family
 <br>
@@ -18,7 +21,7 @@ tags:
   <a href="https://github.com/OpenMOSS/MOSS-TTS/tree/main"><img src="https://img.shields.io/badge/Project%20Page-GitHub-blue"></a>
   <a href="https://modelscope.cn/collections/OpenMOSS-Team/MOSS-TTS"><img src="https://img.shields.io/badge/ModelScope-Models-lightgrey?logo=modelscope&amp"></a>
   <a href="https://mosi.cn/#models"><img src="https://img.shields.io/badge/Blog-View-blue?logo=internet-explorer&amp"></a>
-  <a href="https://github.com/OpenMOSS/MOSS-TTS"><img src="https://img.shields.io/badge/Arxiv-Coming%20soon-red?logo=arxiv&amp"></a>
   <a href="https://studio.mosi.cn"><img src="https://img.shields.io/badge/AIStudio-Try-green?logo=internet-explorer&amp"></a>
   <a href="https://studio.mosi.cn/docs/moss-tts"><img src="https://img.shields.io/badge/API-Docs-00A3FF?logo=fastapi&amp"></a>
@@ -27,7 +30,7 @@ tags:
 </div>
 ## Overview
-MOSS‑TTS Family is an open‑source **speech and sound generation model family** from [MOSI.AI](https://mosi.cn/#hero) and the [OpenMOSS team](https://www.open-moss.com/). It is designed for **high‑fidelity**, **high‑expressiveness**, and **complex real‑world scenarios**, covering stable long‑form speech, multi‑speaker dialogue, voice/character design, environmental sound effects, and real‑time streaming TTS.
 ## Introduction
@@ -185,7 +188,7 @@ text1="哎呀，我的老腰啊，这年纪大了就是不行了。"
 instruction1="疲惫沙哑的老年声音缓慢抱怨，带有轻微呻吟。"
 text2="亲爱的观众们，今天我要为大家做一道传说中的龙须面，这道面条细如发丝，需要极其精湛的手艺才能制作成功，请大家仔细观看我的每一个动作。"
-instruction2="热情的美食节目主持人，语调生动活泼，充满对美食的热爱和专业精神。"
 text3="Hey there, stranger! What brings you to our humble town? Looking for a good drink or a tall tale?"
 instruction3="Hearty, jovial tavern owner's voice, loud and welcoming with a slightly gruff, friendly tone in American English, radiating warmth and hospitality."
@@ -264,3 +267,19 @@ MOSS Voice Generator demonstrates significant advantages in subjective evaluatio
 <p align="center">
   <img src="https://speech-demo.oss-cn-shanghai.aliyuncs.com/moss_tts_demo/tts_readme_imgaes_demo/moss_voiceGenerator_winrate" width="85%" />
 </p>

 ---
 license: apache-2.0
+pipeline_tag: text-to-speech
+library_name: transformers
 tags:
 - text-to-speech
 ---
 # MOSS-TTS Family
 <br>
   <a href="https://github.com/OpenMOSS/MOSS-TTS/tree/main"><img src="https://img.shields.io/badge/Project%20Page-GitHub-blue"></a>
   <a href="https://modelscope.cn/collections/OpenMOSS-Team/MOSS-TTS"><img src="https://img.shields.io/badge/ModelScope-Models-lightgrey?logo=modelscope&amp"></a>
   <a href="https://mosi.cn/#models"><img src="https://img.shields.io/badge/Blog-View-blue?logo=internet-explorer&amp"></a>
+  <a href="https://huggingface.co/papers/2602.10934"><img src="https://img.shields.io/badge/Arxiv-2602.10934-red?logo=arxiv&amp"></a>
   <a href="https://studio.mosi.cn"><img src="https://img.shields.io/badge/AIStudio-Try-green?logo=internet-explorer&amp"></a>
   <a href="https://studio.mosi.cn/docs/moss-tts"><img src="https://img.shields.io/badge/API-Docs-00A3FF?logo=fastapi&amp"></a>
 </div>
 ## Overview
+MOSS‑TTS Family is an open‑source **speech and sound generation model family** from [MOSI.AI](https://mosi.cn/#hero) and the [OpenMOSS team](https://www.open-moss.com/). It is designed for **high‑fidelity**, **high‑expressiveness**, and **complex real‑world scenarios**, covering stable long‑form speech, multi‑speaker dialogue, voice/character design, environmental sound effects, and real‑time streaming TTS. It leverages the technology presented in the paper [MOSS-Audio-Tokenizer: Scaling Audio Tokenizers for Future Audio Foundation Models](https://huggingface.co/papers/2602.10934).
 ## Introduction
 instruction1="疲惫沙哑的老年声音缓慢抱怨，带有轻微呻吟。"
 text2="亲爱的观众们，今天我要为大家做一道传说中的龙须面，这道面条细如发丝，需要极其精湛的手艺才能制作成功，请大家仔细观看我的每一个动作。"
+instruction2="热情的美食节目主持人，语调生动活泼，充满对美食的热爱 and 专业精神。"
 text3="Hey there, stranger! What brings you to our humble town? Looking for a good drink or a tall tale?"
 instruction3="Hearty, jovial tavern owner's voice, loud and welcoming with a slightly gruff, friendly tone in American English, radiating warmth and hospitality."
 <p align="center">
   <img src="https://speech-demo.oss-cn-shanghai.aliyuncs.com/moss_tts_demo/tts_readme_imgaes_demo/moss_voiceGenerator_winrate" width="85%" />
 </p>
+## Citation
+If you use this model or the CAT architecture in your research, please cite:
+```bibtex
+@misc{gong2026mossaudiotokenizerscalingaudiotokenizers,
+      title={MOSS-Audio-Tokenizer: Scaling Audio Tokenizers for Future Audio Foundation Models},
+      author={Yitian Gong and Kuangwei Chen and Zhaoye Fei and Xiaogui Yang and Ke Chen and Yang Wang and Kexin Huang and Mingshu Chen and Ruixiao Li and Qingyuan Cheng and Shimin Li and Xipeng Qiu},
+      year={2026},
+      eprint={2602.10934},
+      archivePrefix={arXiv},
+      primaryClass={cs.SD},
+      url={https://arxiv.org/abs/2602.10934},
+}
+```