Tokenizing, POS Tagging, Lemmatizing and Parsing UD 2.0 with UDPipe

Milan Straka, Jana Straková


Abstract
Many natural language processing tasks, including the most advanced ones, routinely start by several basic processing steps – tokenization and segmentation, most likely also POS tagging and lemmatization, and commonly parsing as well. A multilingual pipeline performing these steps can be trained using the Universal Dependencies project, which contains annotations of the described tasks for 50 languages in the latest release UD 2.0. We present an update to UDPipe, a simple-to-use pipeline processing CoNLL-U version 2.0 files, which performs these tasks for multiple languages without requiring additional external data. We provide models for all 50 languages of UD 2.0, and furthermore, the pipeline can be trained easily using data in CoNLL-U format. UDPipe is a standalone application in C++, with bindings available for Python, Java, C# and Perl. In the CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies, UDPipe was the eight best system, while achieving low running times and moderately sized models.
Anthology ID:
K17-3009
Volume:
Proceedings of the CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies
Month:
August
Year:
2017
Address:
Vancouver, Canada
Venue:
CoNLL
SIG:
SIGNLL
Publisher:
Association for Computational Linguistics
Note:
Pages:
88–99
Language:
URL:
https://www.aclweb.org/anthology/K17-3009
DOI:
10.18653/v1/K17-3009
Bib Export formats:
BibTeX MODS XML EndNote
PDF:
http://aclanthology.lst.uni-saarland.de/K17-3009.pdf
Poster:
 K17-3009.Poster.pdf