Composing Byte-Pair Encodings for Morphological Sequence Classification
Adam Ek, Jean-Philippe Bernardy
Abstract
Byte-pair encodings is a method for splitting a word into sub-word tokens, a language model then assigns contextual representations separately to each of these tokens. In this paper, we evaluate four different methods of composing such sub-word representations into word representations. We evaluate the methods on morphological sequence classification, the task of predicting grammatical features of a word. Our experiments reveal that using an RNN to compute word representations is consistently more effective than the other methods tested across a sample of eight languages with different typology and varying numbers of byte-pair tokens per word.- Anthology ID:
- 2020.udw-1.9
- Volume:
- Proceedings of the Fourth Workshop on Universal Dependencies (UDW 2020)
- Month:
- December
- Year:
- 2020
- Address:
- Barcelona, Spain (Online)
- Venues:
- COLING | UDW
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 76–86
- Language:
- URL:
- https://www.aclweb.org/anthology/2020.udw-1.9
- DOI:
- PDF:
- http://aclanthology.lst.uni-saarland.de/2020.udw-1.9.pdf