Addressing Low-Resource Scenarios with Character-aware Embeddings
Sean Papay, Sebastian Padó, Ngoc Thang Vu
Abstract
Most modern approaches to computing word embeddings assume the availability of text corpora with billions of words. In this paper, we explore a setup where only corpora with millions of words are available, and many words in any new text are out of vocabulary. This setup is both of practical interests – modeling the situation for specific domains and low-resource languages – and of psycholinguistic interest, since it corresponds much more closely to the actual experiences and challenges of human language learning and use. We compare standard skip-gram word embeddings with character-based embeddings on word relatedness prediction. Skip-grams excel on large corpora, while character-based embeddings do well on small corpora generally and rare and complex words specifically. The models can be combined easily.- Anthology ID:
- W18-1204
- Volume:
- Proceedings of the Second Workshop on Subword/Character LEvel Models
- Month:
- June
- Year:
- 2018
- Address:
- New Orleans
- Venues:
- NAACL | SCLeM | WS
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 32–37
- Language:
- URL:
- https://www.aclweb.org/anthology/W18-1204
- DOI:
- 10.18653/v1/W18-1204
- PDF:
- http://aclanthology.lst.uni-saarland.de/W18-1204.pdf