Context Sensitive Neural Lemmatization with Lematus

Toms Bergmanis, Sharon Goldwater


Abstract
The main motivation for developing contextsensitive lemmatizers is to improve performance on unseen and ambiguous words. Yet previous systems have not carefully evaluated whether the use of context actually helps in these cases. We introduce Lematus, a lemmatizer based on a standard encoder-decoder architecture, which incorporates character-level sentence context. We evaluate its lemmatization accuracy across 20 languages in both a full data setting and a lower-resource setting with 10k training examples in each language. In both settings, we show that including context significantly improves results against a context-free version of the model. Context helps more for ambiguous words than for unseen words, though the latter has a greater effect on overall performance differences between languages. We also compare to three previous context-sensitive lemmatization systems, which all use pre-extracted edit trees as well as hand-selected features and/or additional sources of information such as tagged training data. Without using any of these, our context-sensitive model outperforms the best competitor system (Lemming) in the fulldata setting, and performs on par in the lowerresource setting.
Anthology ID:
N18-1126
Volume:
Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers)
Month:
June
Year:
2018
Address:
New Orleans, Louisiana
Venue:
NAACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
1391–1400
Language:
URL:
https://www.aclweb.org/anthology/N18-1126
DOI:
10.18653/v1/N18-1126
Bib Export formats:
BibTeX MODS XML EndNote
PDF:
http://aclanthology.lst.uni-saarland.de/N18-1126.pdf