la_vectors_floret_md

Floret word vectors for Latin (medium, 50k hash buckets, 300 dimensions).

Part of the LatinCy project — pretrained NLP pipelines for Latin.

Overview

Feature	Value
Type	Floret (hash-based subword embeddings)
Dimensions	300
Hash buckets	50,000
Algorithm	CBOW
Language	Latin (`la`)
spaCy version	`>=3.8.0,<3.9.0`
License	MIT

Floret vectors use hash-based subword embeddings, meaning every word gets a vector — there are no out-of-vocabulary words. This is especially important for morphologically rich languages like Latin.

Installation

pip install https://huggingface.co/latincy/la_vectors_floret_md/resolve/main/la_vectors_floret_md-3.9.0-py3-none-any.whl

Usage

import spacy

nlp = spacy.load("la_vectors_floret_md")

# Get word vectors
doc = nlp("rex populum regit")
for token in doc:
    print(token.text, token.vector[:5])

# Compute similarity
doc1 = nlp("bellum")
doc2 = nlp("pugna")
print(doc1.similarity(doc2))

These vectors are primarily intended as a component in LatinCy pipelines (la_core_web_md, la_core_web_lg), but can also be used standalone.

Evaluation

Evaluated on curated Latin benchmarks (1,383 analogy items across 11 categories, 2,728 odd-one-out items). All models were trained on the same LatinCy corpus.

Model	Analogy Rank 1	Analogy Rank 5	Odd-One-Out
LatinCy Floret v3.9 (md)	79.1%	91.5%	61.1%
LatinCy Floret v3.8 (md)	77.4%	91.5%	60.2%
LatinCy FastText CBOW-300	79.9%	92.3%	57.7%

Floret is competitive with FastText on analogies and superior on semantic clustering (odd-one-out), while being 6x smaller and supporting arbitrary vocabulary, which is why we have chosen to use the Floret vectors for model training.

Training

Corpus

Trained on 13.7M sentences (437M tokens) from:

Source	Description
UD Latin treebanks (5)	CIRCSE/Perseus/PROIEL/LLCT/UDante
Latin Wikisource	General Latin texts
CAMENA Neo-Latin	Early modern Latin texts
Patrologia Latina	Patristic texts
Perseus Digital Library	Classical texts
CLTK-Tesserae Latin	Classical texts
CC100-Latin	Web-crawled Latin text (deduplicated and filtered)
Latin Wikipedia	Latin Wikipedia articles
The Latin Library	General Latin texts

Parameters

Parameter	Value
Algorithm	CBOW
Dimensions	300
Subword n-gram range	3–6
Hash buckets	50,000
Epochs	15
Negative sampling	25
Min count	50
Learning rate	0.05

Training followed Sprugnoli et al. 2019 for epoch count and negative sampling parameters.

Citation

If you use these vectors, please cite this preprint:

@misc{burns2023latincy,
    title = "{LatinCy}: Synthetic Trained Pipelines for {L}atin {NLP}",
    author = "Burns, Patrick J.",
    year = "2023",
    eprint = "2305.04365",
    archivePrefix = "arXiv",
    primaryClass = "cs.CL",
    url = "https://arxiv.org/abs/2305.04365"
}

References

Sprugnoli, R., Passarotti, M., and Moretti, G. 2019. “Vir Is to Moderatus as Mulier Is to Intemperans Lemma Embeddings for Latin.” In Proceedings of the Sixth Italian Conference on Computational Linguistics. Bari, Italy. 1–7. http://ceur-ws.org/Vol-2481/paper69.pdf.

Downloads last month: 61

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Paper for latincy/la_vectors_floret_md

LatinCy: Synthetic Trained Pipelines for Latin NLP

Paper • 2305.04365 • Published May 7, 2023

latincy
/

la_vectors_floret_md