carlosdanielhernandezmena commited on
Commit
d1289d7
·
verified ·
1 Parent(s): 42f1182

Adding info to the README file

Browse files
Files changed (1) hide show
  1. README.md +223 -5
README.md CHANGED
@@ -1,9 +1,227 @@
1
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
2
  license: apache-2.0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3
  ---
 
4
 
5
- > [!WARNING]
6
- > **Repository under construction**
7
- > The IPR department is currently handling the paperwork required to release these models.
8
- > They will be publicly available soon.
9
- > Sorry for the inconvenience.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ language: es
3
+ datasets:
4
+ - espnet/yodas
5
+ tags:
6
+ - audio
7
+ - automatic-speech-recognition
8
+ - spanish
9
+ - parakeet-rnnt-1.1b
10
+ - NeMo
11
+ - Projecte-ALIA
12
+ - Barcelona-Supercomputing-Center
13
+ - BSC
14
+ - EuroHPC
15
  license: apache-2.0
16
+ model-index:
17
+ - name: spanish-verification-model-pkt-d
18
+ results:
19
+ - task:
20
+ name: Automatic Speech Recognition
21
+ type: automatic-speech-recognition
22
+ dataset:
23
+ name: Common Voice 17.0 Spanish (Test)
24
+ type: mozilla-foundation/common_voice_17_0
25
+ split: test
26
+ args:
27
+ language: es
28
+ metrics:
29
+ - name: WER
30
+ type: wer
31
+ value: 7.909
32
+ - task:
33
+ name: Automatic Speech Recognition
34
+ type: automatic-speech-recognition
35
+ dataset:
36
+ name: Common Voice 17.0 Spanish (Dev)
37
+ type: mozilla-foundation/common_voice_17_0
38
+ split: dev
39
+ args:
40
+ language: es
41
+ metrics:
42
+ - name: WER
43
+ type: wer
44
+ value: 7.022
45
+ - task:
46
+ name: Automatic Speech Recognition
47
+ type: automatic-speech-recognition
48
+ dataset:
49
+ name: Multilingual LibriSpeech Spanish (Test)
50
+ type: facebook/multilingual_librispeech
51
+ split: test
52
+ args:
53
+ language: es
54
+ metrics:
55
+ - name: WER
56
+ type: wer
57
+ value: 6.077
58
+ - task:
59
+ name: Automatic Speech Recognition
60
+ type: automatic-speech-recognition
61
+ dataset:
62
+ name: Multilingual LibriSpeech Spanish (Dev)
63
+ type: facebook/multilingual_librispeech
64
+ split: dev
65
+ args:
66
+ language: es
67
+ metrics:
68
+ - name: WER
69
+ type: wer
70
+ value: 6.005
71
+ library_name: nemo
72
  ---
73
+ # spanish-verification-model-pkt-d
74
 
75
+ ## Table of Contents
76
+ <details>
77
+ <summary>Click to expand</summary>
78
+
79
+ - [Paper](#paper)
80
+ - [Model Summary](#model-summary)
81
+ - [Intended Uses and Limitations](#intended-uses-and-limitations)
82
+ - [How to Get Started with the Model](#how-to-get-started-with-the-model)
83
+ - [Training Details](#training-details)
84
+ - [Citation](#citation)
85
+ - [Additional Information](#additional-information)
86
+
87
+ </details>
88
+
89
+ <!--
90
+ ## Paper
91
+
92
+ **PDF:** [INDEPENDENT ASR ENSEMBLES FOR LOW-COST TRANSCRIPT VALIDATION: A MID-RESOURCE PROTOCOL ON 8,000 HOURS OF YODAS](https://2026.ieeeicassp.org/)
93
+ -->
94
+
95
+ ## Model Summary
96
+
97
+ We define verification models as ASR models specifically designed to assess the reliability of transcriptions. These models are particularly useful when no reference transcription is available, as they can generate hypotheses with a certain degree of confidence.
98
+
99
+ The core idea behind verification models is to train or fine-tune two or more ASR models on different datasets. If these models produce identical transcriptions for the same audio input, the result is likely to be accurate. Furthermore, if a verification model agrees with an existing reference transcription, this agreement can also be interpreted as a signal of reliability.
100
+
101
+ In this model card, we present Verification Model D for Spanish, available as ["spanish-verification-model-pkt-d"](https://huggingface.co/BSC-LT/spanish-verification-model-pkt-d). This acoustic model is based on ["nvidia/parakeet-rnnt-1.1b"](https://huggingface.co/nvidia/parakeet-rnnt-1.1b) and is designed for Automatic Speech Recognition in Spanish. It is intended to be used in tandem with Verification Model C, ["spanish-verification-model-pkt-c"](https://huggingface.co/BSC-LT/spanish-verification-model-pkt-c), to enable cross-verification and boost transcription confidence in unannotated or weakly supervised scenarios. These models can also be used together with Model A ["spanish-verification-model-pkt-a"](https://huggingface.co/BSC-LT/spanish-verification-model-pkt-a) and Model B ["spanish-verification-model-pkt-b"](https://huggingface.co/BSC-LT/spanish-verification-model-pkt-b) for better results.
102
+
103
+ ## Intended Uses and Limitations
104
+
105
+ This model is designed for the following scenarios:
106
+
107
+ * Verification of transcriptions: When two or more verification models produce the same output for a given audio segment, the transcription can be considered highly reliable. This is particularly useful in low-resource or weakly supervised settings.
108
+
109
+ * Transcription without references: In situations where no reference transcription exists, this model can still produce a hypothesis that -when corroborated by a second verification model- may be considered trustworthy.
110
+
111
+ * Data filtering and quality control: It can be used to automatically detect and retain high-confidence segments in large-scale speech datasets (e.g., for training or evaluation purposes).
112
+
113
+ * Human-in-the-loop workflows: These models can assist human annotators by flagging reliable transcriptions, helping reduce manual verification time.
114
+
115
+ As limitations, we identify the following:
116
+
117
+ * No ground-truth guarantee: Agreement between models does not guarantee correctness; it only increases the likelihood of reliability.
118
+
119
+ * Domain sensitivity: The accuracy and agreement rate may drop if used on speech data that differs significantly from the training domain (e.g., different accents, topics, or recording conditions).
120
+
121
+ * Designed for pairwise comparison: This model is intended to work in conjunction with at least one other verification model. Using it in isolation does not provide verification benefits.
122
+
123
+ * Language and model-specific: This particular model is optimized for Spanish and based on the Parakeet RNNT architecture. Performance in other languages or under different acoustic models may vary significantly.
124
+
125
+ ## How to Get Started with the Model
126
+
127
+ To see an updated and functional version of this code, please visit NVIDIA's official [repository](https://huggingface.co/nvidia/parakeet-rnnt-1.1b)
128
+
129
+ ### Installation
130
+
131
+ To use this model, you may install the [NVIDIA NeMo Framework](https://github.com/NVIDIA/NeMo):
132
+
133
+ Create a virtual environment:
134
+ ```bash
135
+ python -m venv /path/to/venv
136
+ ```
137
+ Activate the environment:
138
+ ```bash
139
+ source /path/to/venv/bin/activate
140
+ ```
141
+ Install the modules:
142
+ ```bash
143
+ BRANCH = 'main'
144
+ python -m pip install git+https://github.com/NVIDIA/NeMo.git@$BRANCH#egg=nemo_toolkit[all]
145
+ ```
146
+
147
+ ### For Inference
148
+ To transcribe audio in Spanish using this model, you can follow this example:
149
+
150
+ ```python
151
+ import nemo.collections.asr as nemo_asr
152
+
153
+ asr_model = nemo_asr.models.EncDecRNNTBPEModel.from_pretrained(model_name="BSC-LT/spanish-verification-model-pkt-d")
154
+
155
+ output = asr_model.transcribe(['YOUR_WAV_FILE.wav'])
156
+ print(output[0].text)
157
+
158
+ ```
159
+
160
+ ## Training Details
161
+
162
+ ### Training data
163
+
164
+ The training data for Model D consists of **1,500 hours of Spanish speech** extracted from the [YODAS](https://huggingface.co/datasets/espnet/yodas) dataset.
165
+
166
+ To ensure high-quality supervision, we applied a double-consensus filtering strategy: we only kept those utterances where the output of **Model A** ["spanish-verification-model-pkt-a"](https://huggingface.co/BSC-LT/spanish-verification-model-pkt-a), and the output of **Model B** ["spanish-verification-model-pkt-b"](https://huggingface.co/BSC-LT/spanish-verification-model-pkt-b) were **identical**.
167
+
168
+ This approach allowed us to minimize noisy or ambiguous transcriptions while maintaining a large amount of diverse training material.
169
+
170
+ ### Training procedure
171
+
172
+ This model is the result of finetuning the model ["parakeet-rnnt-1.1b"](https://huggingface.co/nvidia/parakeet-rnnt-1.1b) by following this [tutorial](https://github.com/NVIDIA/NeMo/blob/main/tutorials/asr/Transducers_with_HF_Datasets.ipynb)
173
+
174
+ ### Training Hyperparameters
175
+
176
+ * language: Spanish
177
+ * hours of training audio: 1500
178
+ * learning rate: 2e-4
179
+ * devices=4
180
+ * num_nodes=8
181
+ * batch_size=8
182
+ * accelerator=accelerator
183
+ * strategy="ddp"
184
+ * max_epochs=20
185
+ * enable_checkpointing=True
186
+ * logger=False
187
+ * log_every_n_steps=100
188
+ * check_val_every_n_epoch=1
189
+ * precision='bf16-mixed'
190
+ * callbacks=[checkpoint_callback]
191
+
192
+ ## Citation
193
+ If this model contributes to your research, please cite the work:
194
+
195
+ ```bibtex
196
+ @misc{bsc-esvermodel-pkt-c-2025,
197
+ title={Spanish Verification Model Parakeet D},
198
+ author={Hernandez Mena, Carlos Daniel; España-Bonet, Cristina},
199
+ organization={Barcelona Supercomputing Center},
200
+ url={https://huggingface.co/BSC-LT/spanish-verification-model-pkt-d},
201
+ year={2025}
202
+ }
203
+ ```
204
+
205
+ ## Additional Information
206
+
207
+ ### Author
208
+
209
+ The fine-tuning process was performed during August (2025) in the [Language Technologies Laboratory](https://huggingface.co/BSC-LT) of the [Barcelona Supercomputing Center](https://www.bsc.es/) by [Carlos Daniel Hernández Mena](https://huggingface.co/carlosdanielhernandezmena) supervised by [Cristina España-Bonet](https://huggingface.co/cristinae).
210
+
211
+ ### Contact
212
+ For further information, please email <[email protected]>.
213
+
214
+ ### Copyright
215
+ Copyright(c) 2025 by Language Technologies Laboratory, Barcelona Supercomputing Center.
216
+
217
+ ### License
218
+
219
+ [Apache-2.0](https://www.apache.org/licenses/LICENSE-2.0)
220
+
221
+
222
+ ### Funding
223
+ This work is funded by the Ministerio para la Transformación Digital y de la Función Pública - Funded by EU – NextGenerationEU within the framework of the project Desarrollo de Modelos [ALIA](https://alia.gob.es/).
224
+
225
+ The training of the model was possible thanks to the computing time provided by [Barcelona Supercomputing Center](https://www.bsc.es/) through MareNostrum 5.
226
+
227
+ We acknowledge the [EuroHPC Joint Undertaking](https://www.eurohpc-ju.europa.eu/index_en) for awarding us access to MareNostrum5 as BSC, Spain.