Addressing Low-Resource Scenarios with Character-aware Embeddings

Sean Papay, Sebastian Padó, Ngoc Thang Vu


Abstract
Most modern approaches to computing word embeddings assume the availability of text corpora with billions of words. In this paper, we explore a setup where only corpora with millions of words are available, and many words in any new text are out of vocabulary. This setup is both of practical interests – modeling the situation for specific domains and low-resource languages – and of psycholinguistic interest, since it corresponds much more closely to the actual experiences and challenges of human language learning and use. We compare standard skip-gram word embeddings with character-based embeddings on word relatedness prediction. Skip-grams excel on large corpora, while character-based embeddings do well on small corpora generally and rare and complex words specifically. The models can be combined easily.
Anthology ID:
W18-1204
Volume:
Proceedings of the Second Workshop on Subword/Character LEvel Models
Month:
June
Year:
2018
Address:
New Orleans
Venues:
NAACL | SCLeM | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
32–37
Language:
URL:
https://www.aclweb.org/anthology/W18-1204
DOI:
10.18653/v1/W18-1204
Bib Export formats:
BibTeX MODS XML EndNote
PDF:
http://aclanthology.lst.uni-saarland.de/W18-1204.pdf