Update README.md
Browse files
README.md
CHANGED
|
@@ -7,7 +7,7 @@ license: apache-2.0
|
|
| 7 |
|
| 8 |
Submitted to LREC 2026
|
| 9 |
|
| 10 |
-
|
| 11 |
|
| 12 |
BERnaT is a family of monolingual Basque encoder-only language models trained to better represent linguistic variation—including standard, dialectal, historical, and informal Basque—rather than focusing solely on standard textual corpora. Models were trained on corpora that combine high-quality standard Basque with varied sources such as social media and historical texts, aiming to enhance robustness and generalization across natural language understanding (NLU) tasks.
|
| 13 |
|
|
@@ -18,6 +18,36 @@ BERnaT is a family of monolingual Basque encoder-only language models trained to
|
|
| 18 |
- **Languages**: Basque (Euskara)
|
| 19 |
|
| 20 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 21 |
## Training Data
|
| 22 |
|
| 23 |
The BERnaT family was pre-trained on a combination of:
|
|
|
|
| 7 |
|
| 8 |
Submitted to LREC 2026
|
| 9 |
|
| 10 |
+
## Model Description
|
| 11 |
|
| 12 |
BERnaT is a family of monolingual Basque encoder-only language models trained to better represent linguistic variation—including standard, dialectal, historical, and informal Basque—rather than focusing solely on standard textual corpora. Models were trained on corpora that combine high-quality standard Basque with varied sources such as social media and historical texts, aiming to enhance robustness and generalization across natural language understanding (NLU) tasks.
|
| 13 |
|
|
|
|
| 18 |
- **Languages**: Basque (Euskara)
|
| 19 |
|
| 20 |
|
| 21 |
+
## Getting Started
|
| 22 |
+
|
| 23 |
+
You can either use this model directly as the example below, or fine-tune it to your task of interest.
|
| 24 |
+
|
| 25 |
+
```python
|
| 26 |
+
>>> from transformers import pipeline
|
| 27 |
+
>>> pipe = pipeline("fill-mask", model='HiTZ/BERnaT-base')
|
| 28 |
+
>>> pipe("Kaixo! Ni <mask> naiz!")
|
| 29 |
+
[{'score': 0.022003261372447014,
|
| 30 |
+
'token': 7497,
|
| 31 |
+
'token_str': ' euskalduna',
|
| 32 |
+
'sequence': 'Kaixo! Ni euskalduna naiz!'},
|
| 33 |
+
{'score': 0.016429167240858078,
|
| 34 |
+
'token': 14067,
|
| 35 |
+
'token_str': ' Olentzero',
|
| 36 |
+
'sequence': 'Kaixo! Ni Olentzero naiz!'},
|
| 37 |
+
{'score': 0.012804778292775154,
|
| 38 |
+
'token': 31087,
|
| 39 |
+
'token_str': ' ahobizi',
|
| 40 |
+
'sequence': 'Kaixo! Ni ahobizi naiz!'},
|
| 41 |
+
{'score': 0.01173020526766777,
|
| 42 |
+
'token': 331,
|
| 43 |
+
'token_str': ' ez',
|
| 44 |
+
'sequence': 'Kaixo! Ni ez naiz!'},
|
| 45 |
+
{'score': 0.010091394186019897,
|
| 46 |
+
'token': 7618,
|
| 47 |
+
'token_str': ' irakaslea',
|
| 48 |
+
'sequence': 'Kaixo! Ni irakaslea naiz!'}]
|
| 49 |
+
```
|
| 50 |
+
|
| 51 |
## Training Data
|
| 52 |
|
| 53 |
The BERnaT family was pre-trained on a combination of:
|