la_vectors_floret_md
Floret word vectors for Latin (medium, 50k hash buckets, 300 dimensions).
Part of the LatinCy project โ pretrained NLP pipelines for Latin.
Overview
| Feature | Value |
|---|---|
| Type | Floret (hash-based subword embeddings) |
| Dimensions | 300 |
| Hash buckets | 50,000 |
| Algorithm | CBOW |
| Language | Latin (la) |
| spaCy version | >=3.8.0,<3.9.0 |
| License | MIT |
Floret vectors use hash-based subword embeddings, meaning every word gets a vector โ there are no out-of-vocabulary words. This is especially important for morphologically rich languages like Latin.
Installation
pip install https://huggingface.co/latincy/la_vectors_floret_md/resolve/main/la_vectors_floret_md-3.9.0-py3-none-any.whl
Usage
import spacy
nlp = spacy.load("la_vectors_floret_md")
# Get word vectors
doc = nlp("rex populum regit")
for token in doc:
print(token.text, token.vector[:5])
# Compute similarity
doc1 = nlp("bellum")
doc2 = nlp("pugna")
print(doc1.similarity(doc2))
These vectors are primarily intended as a component in LatinCy pipelines (la_core_web_md, la_core_web_lg), but can also be used standalone.
Evaluation
Evaluated on curated Latin benchmarks (1,383 analogy items across 11 categories, 2,728 odd-one-out items). All models were trained on the same LatinCy corpus.
| Model | Analogy Rank 1 | Analogy Rank 5 | Odd-One-Out |
|---|---|---|---|
| LatinCy Floret v3.9 (md) | 79.1% | 91.5% | 61.1% |
| LatinCy Floret v3.8 (md) | 77.4% | 91.5% | 60.2% |
| LatinCy FastText CBOW-300 | 79.9% | 92.3% | 57.7% |
Floret is competitive with FastText on analogies and superior on semantic clustering (odd-one-out), while being 6x smaller and supporting arbitrary vocabulary, which is why we have chosen to use the Floret vectors for model training.
Training
Corpus
Trained on 13.7M sentences (437M tokens) from:
| Source | Description |
|---|---|
| UD Latin treebanks (5) | CIRCSE/Perseus/PROIEL/LLCT/UDante |
| Latin Wikisource | General Latin texts |
| CAMENA Neo-Latin | Early modern Latin texts |
| Patrologia Latina | Patristic texts |
| Perseus Digital Library | Classical texts |
| CLTK-Tesserae Latin | Classical texts |
| CC100-Latin | Web-crawled Latin text (deduplicated and filtered) |
| Latin Wikipedia | Latin Wikipedia articles |
| The Latin Library | General Latin texts |
Parameters
| Parameter | Value |
|---|---|
| Algorithm | CBOW |
| Dimensions | 300 |
| Subword n-gram range | 3โ6 |
| Hash buckets | 50,000 |
| Epochs | 15 |
| Negative sampling | 25 |
| Min count | 50 |
| Learning rate | 0.05 |
Training followed Sprugnoli et al. 2019 for epoch count and negative sampling parameters.
Citation
If you use these vectors, please cite this preprint:
@misc{burns2023latincy,
title = "{LatinCy}: Synthetic Trained Pipelines for {L}atin {NLP}",
author = "Burns, Patrick J.",
year = "2023",
eprint = "2305.04365",
archivePrefix = "arXiv",
primaryClass = "cs.CL",
url = "https://arxiv.org/abs/2305.04365"
}
See also
- la_vectors_floret_lg โ Large vectors (200k buckets)
- LatinCy pipelines โ Latin NLP pipelines for spaCy using these vectors
References
- Sprugnoli, R., Passarotti, M., and Moretti, G. 2019. โVir Is to Moderatus as Mulier Is to Intemperans Lemma Embeddings for Latin.โ In Proceedings of the Sixth Italian Conference on Computational Linguistics. Bari, Italy. 1โ7. http://ceur-ws.org/Vol-2481/paper69.pdf.
- Downloads last month
- 61