BSC-LT
/

spanish-verification-model-pkt-d

@@ -1,9 +1,227 @@
 ---
 license: apache-2.0
 ---
-> [!WARNING]
-> **Repository under construction**
-> The IPR department is currently handling the paperwork required to release these models.
-> They will be publicly available soon.
-> Sorry for the inconvenience.

 ---
+language: es
+datasets:
+- espnet/yodas
+tags:
+- audio
+- automatic-speech-recognition
+- spanish
+- parakeet-rnnt-1.1b
+- NeMo
+- Projecte-ALIA
+- Barcelona-Supercomputing-Center
+- BSC
+- EuroHPC
 license: apache-2.0
+model-index:
+- name: spanish-verification-model-pkt-d
+  results:
+  - task:
+      name: Automatic Speech Recognition
+      type: automatic-speech-recognition
+    dataset:
+      name: Common Voice 17.0 Spanish (Test)
+      type: mozilla-foundation/common_voice_17_0
+      split: test
+      args:
+        language: es
+    metrics:
+    - name: WER
+      type: wer
+      value: 7.909
+  - task:
+      name: Automatic Speech Recognition
+      type: automatic-speech-recognition
+    dataset:
+      name: Common Voice 17.0 Spanish (Dev)
+      type: mozilla-foundation/common_voice_17_0
+      split: dev
+      args:
+        language: es
+    metrics:
+    - name: WER
+      type: wer
+      value: 7.022
+  - task:
+      name: Automatic Speech Recognition
+      type: automatic-speech-recognition
+    dataset:
+      name: Multilingual LibriSpeech Spanish (Test)
+      type: facebook/multilingual_librispeech
+      split: test
+      args:
+        language: es
+    metrics:
+    - name: WER
+      type: wer
+      value: 6.077
+  - task:
+      name: Automatic Speech Recognition
+      type: automatic-speech-recognition
+    dataset:
+      name: Multilingual LibriSpeech Spanish (Dev)
+      type: facebook/multilingual_librispeech
+      split: dev
+      args:
+        language: es
+    metrics:
+    - name: WER
+      type: wer
+      value: 6.005
+library_name: nemo
 ---
+# spanish-verification-model-pkt-d
+## Table of Contents
+<details>
+<summary>Click to expand</summary>
+- [Paper](#paper)
+- [Model Summary](#model-summary)
+- [Intended Uses and Limitations](#intended-uses-and-limitations)
+- [How to Get Started with the Model](#how-to-get-started-with-the-model)
+- [Training Details](#training-details)
+- [Citation](#citation)
+- [Additional Information](#additional-information)
+</details>
+<!--
+## Paper
+**PDF:** [INDEPENDENT ASR ENSEMBLES FOR LOW-COST TRANSCRIPT VALIDATION: A MID-RESOURCE PROTOCOL ON 8,000 HOURS OF YODAS](https://2026.ieeeicassp.org/)
+-->
+## Model Summary
+We define verification models as ASR models specifically designed to assess the reliability of transcriptions. These models are particularly useful when no reference transcription is available, as they can generate hypotheses with a certain degree of confidence.
+The core idea behind verification models is to train or fine-tune two or more ASR models on different datasets. If these models produce identical transcriptions for the same audio input, the result is likely to be accurate. Furthermore, if a verification model agrees with an existing reference transcription, this agreement can also be interpreted as a signal of reliability.
+In this model card, we present Verification Model D for Spanish, available as ["spanish-verification-model-pkt-d"](https://huggingface.co/BSC-LT/spanish-verification-model-pkt-d). This acoustic model is based on ["nvidia/parakeet-rnnt-1.1b"](https://huggingface.co/nvidia/parakeet-rnnt-1.1b) and is designed for Automatic Speech Recognition in Spanish. It is intended to be used in tandem with Verification Model C, ["spanish-verification-model-pkt-c"](https://huggingface.co/BSC-LT/spanish-verification-model-pkt-c), to enable cross-verification and boost transcription confidence in unannotated or weakly supervised scenarios. These models can also be used together with Model A ["spanish-verification-model-pkt-a"](https://huggingface.co/BSC-LT/spanish-verification-model-pkt-a) and Model B ["spanish-verification-model-pkt-b"](https://huggingface.co/BSC-LT/spanish-verification-model-pkt-b) for better results.
+## Intended Uses and Limitations
+This model is designed for the following scenarios:
+* Verification of transcriptions: When two or more verification models produce the same output for a given audio segment, the transcription can be considered highly reliable. This is particularly useful in low-resource or weakly supervised settings.
+* Transcription without references: In situations where no reference transcription exists, this model can still produce a hypothesis that -when corroborated by a second verification model- may be considered trustworthy.
+* Data filtering and quality control: It can be used to automatically detect and retain high-confidence segments in large-scale speech datasets (e.g., for training or evaluation purposes).
+* Human-in-the-loop workflows: These models can assist human annotators by flagging reliable transcriptions, helping reduce manual verification time.
+As limitations, we identify the following:
+* No ground-truth guarantee: Agreement between models does not guarantee correctness; it only increases the likelihood of reliability.
+* Domain sensitivity: The accuracy and agreement rate may drop if used on speech data that differs significantly from the training domain (e.g., different accents, topics, or recording conditions).
+* Designed for pairwise comparison: This model is intended to work in conjunction with at least one other verification model. Using it in isolation does not provide verification benefits.
+* Language and model-specific: This particular model is optimized for Spanish and based on the Parakeet RNNT architecture. Performance in other languages or under different acoustic models may vary significantly.
+## How to Get Started with the Model
+To see an updated and functional version of this code, please visit NVIDIA's official [repository](https://huggingface.co/nvidia/parakeet-rnnt-1.1b)
+### Installation
+To use this model, you may install the [NVIDIA NeMo Framework](https://github.com/NVIDIA/NeMo):
+Create a virtual environment:
+```bash
+python -m venv /path/to/venv
+```
+Activate the environment:
+```bash
+source /path/to/venv/bin/activate
+```
+Install the modules:
+```bash
+BRANCH = 'main'
+python -m pip install git+https://github.com/NVIDIA/NeMo.git@$BRANCH#egg=nemo_toolkit[all]
+```
+### For Inference
+To transcribe audio in Spanish using this model, you can follow this example:
+```python
+import nemo.collections.asr as nemo_asr
+asr_model = nemo_asr.models.EncDecRNNTBPEModel.from_pretrained(model_name="BSC-LT/spanish-verification-model-pkt-d")
+output = asr_model.transcribe(['YOUR_WAV_FILE.wav'])
+print(output[0].text)
+```
+## Training Details
+### Training data
+The training data for Model D consists of **1,500 hours of Spanish speech** extracted from the [YODAS](https://huggingface.co/datasets/espnet/yodas) dataset.
+To ensure high-quality supervision, we applied a double-consensus filtering strategy: we only kept those utterances where the output of **Model A** ["spanish-verification-model-pkt-a"](https://huggingface.co/BSC-LT/spanish-verification-model-pkt-a), and the output of **Model B** ["spanish-verification-model-pkt-b"](https://huggingface.co/BSC-LT/spanish-verification-model-pkt-b) were **identical**.
+This approach allowed us to minimize noisy or ambiguous transcriptions while maintaining a large amount of diverse training material.
+### Training procedure
+This model is the result of finetuning the model ["parakeet-rnnt-1.1b"](https://huggingface.co/nvidia/parakeet-rnnt-1.1b) by following this [tutorial](https://github.com/NVIDIA/NeMo/blob/main/tutorials/asr/Transducers_with_HF_Datasets.ipynb)
+### Training Hyperparameters
+* language: Spanish
+* hours of training audio: 1500
+* learning rate: 2e-4
+* devices=4
+* num_nodes=8
+* batch_size=8
+* accelerator=accelerator
+* strategy="ddp"
+* max_epochs=20
+* enable_checkpointing=True
+* logger=False
+* log_every_n_steps=100
+* check_val_every_n_epoch=1
+* precision='bf16-mixed'
+* callbacks=[checkpoint_callback]
+## Citation
+If this model contributes to your research, please cite the work:
+```bibtex
+@misc{bsc-esvermodel-pkt-c-2025,
+  title={Spanish Verification Model Parakeet D},
+  author={Hernandez Mena, Carlos Daniel; España-Bonet, Cristina},
+  organization={Barcelona Supercomputing Center},
+  url={https://huggingface.co/BSC-LT/spanish-verification-model-pkt-d},
+  year={2025}
+}
+```
+## Additional Information
+### Author
+The fine-tuning process was performed during August (2025) in the [Language Technologies Laboratory](https://huggingface.co/BSC-LT) of the [Barcelona Supercomputing Center](https://www.bsc.es/) by [Carlos Daniel Hernández Mena](https://huggingface.co/carlosdanielhernandezmena) supervised by [Cristina España-Bonet](https://huggingface.co/cristinae).
+### Contact
+For further information, please email <[email protected]>.
+### Copyright
+Copyright(c) 2025 by Language Technologies Laboratory, Barcelona Supercomputing Center.
+### License
+[Apache-2.0](https://www.apache.org/licenses/LICENSE-2.0)
+### Funding
+This work is funded by the Ministerio para la Transformación Digital y de la Función Pública - Funded by EU – NextGenerationEU within the framework of the project Desarrollo de Modelos [ALIA](https://alia.gob.es/).
+The training of the model was possible thanks to the computing time provided by [Barcelona Supercomputing Center](https://www.bsc.es/) through MareNostrum 5.
+We acknowledge the [EuroHPC Joint Undertaking](https://www.eurohpc-ju.europa.eu/index_en) for awarding us access to MareNostrum5 as BSC, Spain.