Composing Byte-Pair Encodings for Morphological Sequence Classification

Adam Ek, Jean-Philippe Bernardy


Abstract
Byte-pair encodings is a method for splitting a word into sub-word tokens, a language model then assigns contextual representations separately to each of these tokens. In this paper, we evaluate four different methods of composing such sub-word representations into word representations. We evaluate the methods on morphological sequence classification, the task of predicting grammatical features of a word. Our experiments reveal that using an RNN to compute word representations is consistently more effective than the other methods tested across a sample of eight languages with different typology and varying numbers of byte-pair tokens per word.
Anthology ID:
2020.udw-1.9
Volume:
Proceedings of the Fourth Workshop on Universal Dependencies (UDW 2020)
Month:
December
Year:
2020
Address:
Barcelona, Spain (Online)
Venues:
COLING | UDW
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
76–86
Language:
URL:
https://www.aclweb.org/anthology/2020.udw-1.9
DOI:
Bib Export formats:
BibTeX MODS XML EndNote
PDF:
http://aclanthology.lst.uni-saarland.de/2020.udw-1.9.pdf