--- license: cc-by-nc-4.0 language: - fr base_model: - google-bert/bert-base-multilingual-cased pipeline_tag: text-classification datasets: - GEODE/GeoEDdA-TopoRel --- # bert-base-multilingual-cased-classification-relation This model is designed to classify spatial relations recognized from geographic encyclopedia articles. It is a fine-tuned version of the bert-base-multilingual-cased model. It has been trained on [GeoEDdA-TopoRel](https://huggingface.co/datasets/GEODE/GeoEDdA-TopoRel), a manually annotated subset of the French *Encyclopédie ou dictionnaire raisonné des sciences des arts et des métiers par une société de gens de lettres (1751-1772)* edited by Diderot and d'Alembert (provided by the [ARTFL Encyclopédie Project](https://artfl-project.uchicago.edu)). ## Model Description - **Authors:** Bin Yang, [Ludovic Moncla](https://ludovicmoncla.github.io), [Fabien Duchateau](https://perso.liris.cnrs.fr/fabien.duchateau/) and [Frédérique Laforest](https://perso.liris.cnrs.fr/flaforest/) in the framework of the [ECoDA](https://liris.cnrs.fr/projet-institutionnel/fil-2025-projet-ecoda) and [GEODE](https://geode-project.github.io) projects - **Model type:** Text classification - **Repository:** [https://gitlab.liris.cnrs.fr/ecoda/encyclopedia2geokg](https://gitlab.liris.cnrs.fr/ecoda/encyclopedia2geokg) - **Language(s) (NLP):** French - **License:** cc-by-nc-4.0 ## Class labels The tagset is as follows: - **Adjacency**: - **Crosses**: - **Distance-Orientation**: - **Inclusion**: - **Movement**: - **Other**: ## Dataset The model was trained using the [GeoEDdA-TopoRel](https://huggingface.co/datasets/GEODE/GeoEDdA-TopoRel) dataset. The dataset is splitted into train, validation and test sets which have the following distribution of entries among classes: | | Train | Validation | Test| |---|:---:|:---:|:---:| | Adjacency | 498 | 59 | 75| | Crosses | 397 | 50 | 29 | | Distance-Orientation | 1,065 | 163 | 115 | | Inclusion | 1,319 | 131 | 156 | | Movement | 184 | 15 | 35 | | Other | 195 | 30 | 42 | ## Evaluation * Overall weighted-average model performances | | Precision | Recall | F-score | |---|:---:|:---:|:---:| | | 0.92 | 0.92 | 0.92 | * Model performances (Test set) | | Precision | Recall | F-score | Support | |---|:---:|:---:|:---:|:---:| | Adjacency | 0.85 | 0.84 | 0.85 | 75| | Crosses | 0.78 | 0.86 | 0.82 | 29 | | Distance-Orientation | 0.93 | 0.99 | 0.96 | 115 | | Inclusion | 0.97 | 0.98 | 0.97 | 156 | | Movement | 0.89 | 0.69 | 0.77 | 35 | | Other | 0.95 | 0.88 | 0.91 | 42 | ## How to Get Started with the Model Use the code below to get started with the model. ```python import torch from transformers import pipeline, AutoTokenizer, AutoModelForSequenceClassification device = torch.device("mps" if torch.backends.mps.is_available() else ("cuda" if torch.cuda.is_available() else "cpu")) ner = pipeline("token-classification", model="GEODE/camembert-base-edda-span-classification", aggregation_strategy="simple", device=device) relation_classifier = pipeline("text-classification", model="GEODE/bert-base-multilingual-cased-classification-relation", truncation=True, device=device) def get_context(text, span, ngram_context_size=5): word = span["word"] start = span["start"] end = span["end"] label = span["entity_group"] # Extract context previous_text = text[:start].strip() next_text = text[end:].strip() previous_words = previous_text.split()[-ngram_context_size:] next_words = next_text.split()[:ngram_context_size] # Build context string context = f"[{word}]: {' '.join(previous_words)} {word} {' '.join(next_words)}" return word, context, label content = "WINCHESTER, (Géog. mod.) ou plutôt Wintchester, ville d'Angleterre, capitale du Hampshire, sur le bord de l'Itching, à dix-huit milles au sud-est de Salisbury, & à soixante sud-ouest de Londres. Long. 16. 20. latit. 51. 3." spans = ner(content) for span in spans: if span['entity_group'] == 'Relation': word, context, label = get_context(content, span, ngram_context_size=5) print(f"Relation: {word}") label = relation_classifier(context) print(f"Predicted label: {label}") # Output Relation: sur le bord de Predicted label: [{'label': 'Crosses', 'score': 0.9778845906257629}] Relation: à dix-huit milles au sud-est de Predicted label: [{'label': 'Distance-Orientation', 'score': 0.9959626793861389}] Relation: à soixante sud-ouest de Predicted label: [{'label': 'Distance-Orientation', 'score': 0.9963018894195557}] ``` ## Bias, Risks, and Limitations This model was trained entirely on French encyclopaedic entries classified as Geography and will likely not perform well on text in other languages or other corpora. ## Acknowledgement The authors are grateful to the [ASLAN project](https://aslan.universite-lyon.fr) (ANR-10-LABX-0081) of the Université de Lyon, for its financial support within the French program "Investments for the Future" operated by the National Research Agency (ANR). Data courtesy the [ARTFL Encyclopédie Project](https://artfl-project.uchicago.edu), University of Chicago.