la_vectors_floret_md

Floret word vectors for Latin (medium, 50k hash buckets, 300 dimensions).

Part of the LatinCy project โ€” pretrained NLP pipelines for Latin.

Overview

Feature Value
Type Floret (hash-based subword embeddings)
Dimensions 300
Hash buckets 50,000
Algorithm CBOW
Language Latin (la)
spaCy version >=3.8.0,<3.9.0
License MIT

Floret vectors use hash-based subword embeddings, meaning every word gets a vector โ€” there are no out-of-vocabulary words. This is especially important for morphologically rich languages like Latin.

Installation

pip install https://huggingface.co/latincy/la_vectors_floret_md/resolve/main/la_vectors_floret_md-3.9.0-py3-none-any.whl

Usage

import spacy

nlp = spacy.load("la_vectors_floret_md")

# Get word vectors
doc = nlp("rex populum regit")
for token in doc:
    print(token.text, token.vector[:5])

# Compute similarity
doc1 = nlp("bellum")
doc2 = nlp("pugna")
print(doc1.similarity(doc2))

These vectors are primarily intended as a component in LatinCy pipelines (la_core_web_md, la_core_web_lg), but can also be used standalone.

Evaluation

Evaluated on curated Latin benchmarks (1,383 analogy items across 11 categories, 2,728 odd-one-out items). All models were trained on the same LatinCy corpus.

Model Analogy Rank 1 Analogy Rank 5 Odd-One-Out
LatinCy Floret v3.9 (md) 79.1% 91.5% 61.1%
LatinCy Floret v3.8 (md) 77.4% 91.5% 60.2%
LatinCy FastText CBOW-300 79.9% 92.3% 57.7%

Floret is competitive with FastText on analogies and superior on semantic clustering (odd-one-out), while being 6x smaller and supporting arbitrary vocabulary, which is why we have chosen to use the Floret vectors for model training.

Training

Corpus

Trained on 13.7M sentences (437M tokens) from:

Source Description
UD Latin treebanks (5) CIRCSE/Perseus/PROIEL/LLCT/UDante
Latin Wikisource General Latin texts
CAMENA Neo-Latin Early modern Latin texts
Patrologia Latina Patristic texts
Perseus Digital Library Classical texts
CLTK-Tesserae Latin Classical texts
CC100-Latin Web-crawled Latin text (deduplicated and filtered)
Latin Wikipedia Latin Wikipedia articles
The Latin Library General Latin texts

Parameters

Parameter Value
Algorithm CBOW
Dimensions 300
Subword n-gram range 3โ€“6
Hash buckets 50,000
Epochs 15
Negative sampling 25
Min count 50
Learning rate 0.05

Training followed Sprugnoli et al. 2019 for epoch count and negative sampling parameters.

Citation

If you use these vectors, please cite this preprint:

@misc{burns2023latincy,
    title = "{LatinCy}: Synthetic Trained Pipelines for {L}atin {NLP}",
    author = "Burns, Patrick J.",
    year = "2023",
    eprint = "2305.04365",
    archivePrefix = "arXiv",
    primaryClass = "cs.CL",
    url = "https://arxiv.org/abs/2305.04365"
}

See also

References

  • Sprugnoli, R., Passarotti, M., and Moretti, G. 2019. โ€œVir Is to Moderatus as Mulier Is to Intemperans Lemma Embeddings for Latin.โ€ In Proceedings of the Sixth Italian Conference on Computational Linguistics. Bari, Italy. 1โ€“7. http://ceur-ws.org/Vol-2481/paper69.pdf.
Downloads last month
61
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Paper for latincy/la_vectors_floret_md