Spaces:
Running
Running
File size: 5,541 Bytes
ec63fa6 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 |
# ÚFAL LatinPipe at EvaLatin 2024: Morphosyntactic Analysis of Latin
This repository contains the LatinPipe parser implementation described in
the _ÚFAL LatinPipe at EvaLatin 2024: Morphosyntactic Analysis of Latin_ paper.
📢 Besides this source code and the [trained model](https://hdl.handle.net/11234/1-5671),
LatinPipe is also available in the [UDPipe LINDAT/CLARIN service](http://lindat.mff.cuni.cz/services/udpipe/)
and can be used either in a web form or through a REST service.
---
<img src="figures/LatinPipe.svg" alt="LatinPipe Architecture" align="right" style="width: 45%">
<h3 align="center"><a href="https://aclanthology.org/2024.lt4hala-1.24/">ÚFAL LatinPipe at EvaLatin 2024: Morphosyntactic Analysis of Latin</a></h3>
<p align="center">
<b>Milan Straka</b> and <b>Jana Straková</b> and <b>Federica Gamba</b><br>
Charles University<br>
Faculty of Mathematics and Physics<br>
Institute of Formal and Applied Lingustics<br>
Malostranské nám. 25, Prague, Czech Republic
</p>
**Abstract:** We present LatinPipe, the winning submission to the EvaLatin 2024
Dependency Parsing shared task. Our system consists of a fine-tuned
concatenation of base and large pre-trained LMs, with a dot-product attention
head for parsing and softmax classification heads for morphology to jointly
learn both dependency parsing and morphological analysis. It is trained by
sampling from seven publicly available Latin corpora, utilizing additional
harmonization of annotations to achieve a more unified annotation style. Before
fine-tuning, we train the system for a few initial epochs with frozen weights.
We also add additional local relative contextualization by stacking the BiLSTM
layers on top of the Transformer(s). Finally, we ensemble output probability
distributions from seven randomly instantiated networks for the final
submission. The code is available at <a href="https://github.com/ufal/evalatin2024-latinpipe">https://github.com/ufal/evalatin2024-latinpipe</a>.<br clear="both">
---
## Content of this Repository
- The directory `data` is for all the required data (UD 2.13 data, harmonized
PROIEL, Sabellicus, Archimedes Latinus, EvaLatin 2024 test data).
- The script `data/fetch_data.sh` downloads and extracts all the data:
```sh
(cd data && sh fetch_data.sh)
```
- The `latinpipe_evalatin24.py` is the LatinPipe EvaLatin24 source.
- It depends on `latinpipe_evalatin24_eval.py`, which is a modularized version
of the official evaluation script.
- The `latinpipe_evalatin24_server.py` is a REST server with UDPipe-2-compatible
API, using `latinpipe_evalatin24.py` to perform tagging and parsing.
## The Released `latinpipe-evalatin24-240520` Model
The `latinpipe-evalatin24-240520` is a `PhilBerta`-based model for tagging,
lemmatization, and dependency parsing of Latin, based on the winning entry
to the EvaLatin 2024 <https://circse.github.io/LT4HALA/2024/EvaLatin> shared
task. It is released at https://hdl.handle.net/11234/1-5671 under the CC
BY-NC-SA 4.0 license.
The model is also available in the [UDPipe LINDAT/CLARIN service](http://lindat.mff.cuni.cz/services/udpipe/)
and can be used either in a web form or through a REST service.
See the [latinpipe-evalatin24-240520 directory](latinpipe-evalatin24-240520/) for
the download link, the model performance, and additional information.
## Training a Model
To train a model on all data, you should
1. run the `data/fetch_data.sh` script to download all required data,
2. create a Python environments with the packages listed in `requirements.txt`,
3. train the model itself using the `latinpipe_evalatin24.py` script.
To train a model performing UPOS/UFeats tagging, lemmatization, and
dependency parsing, we use
```sh
la_ud213_all="la_ittb la_llct la_perseus la_proiel la_udante"
la_other="la_archimedes la_sabellicus"
transformer="bowphs/PhilBerta" # or bowphs/LaBerta
latinpipe_evalatin24.py $(for split in dev test train; do echo --$split; for tb in $la_ud213_all; do [ $tb-$split = la_proiel-train ] && tb=la_proielh; echo data/$tb/$tb-ud-$split.conllu; done; done) $(for tb in $la_other; do echo data/$tb/$tb-train.conllu; done) --transformers $transformer --epochs=30 --exp=evalatin24_model --subword_combination=last --epochs_frozen=10 --batch_size=64 --save_checkpoint
```
## Predicting with a Trained Model
To predict with a trained model, you can use the following command:
```sh
latinpipe_evalatin24.py --load evalatin24_model/model.weights.h5 --exp target_directory --test input1.conllu input2.conllu
```
- the outputs are generated in the target directory, with a `.predicted.conllu` suffix;
- if you want to also evaluate the predicted files, you can use `--dev` option instead of `--test`.
---
## Contact
Milan Straka: ``[email protected]``\
Jana Straková: ``[email protected]``\
Federica Gamba: ``[email protected]``
## How to Cite
```
@inproceedings{straka-etal-2024-ufal,
title = "{{\'U}FAL} {L}atin{P}ipe at {E}va{L}atin 2024: Morphosyntactic Analysis of {L}atin",
author = "Straka, Milan and Strakov{\'a}, Jana and Gamba, Federica",
editor = "Sprugnoli, Rachele and Passarotti, Marco",
booktitle = "Proceedings of the Third Workshop on Language Technologies for Historical and Ancient Languages (LT4HALA) @ LREC-COLING-2024",
month = may,
year = "2024",
address = "Torino, Italia",
publisher = "ELRA and ICCL",
url = "https://aclanthology.org/2024.lt4hala-1.24",
pages = "207--214"
}
```
|