Morphological Word Embeddings for Arabic Neural Machine Translation in Low-Resource Settings

Pamela Shapiro, Kevin Duh


Abstract
Neural machine translation has achieved impressive results in the last few years, but its success has been limited to settings with large amounts of parallel data. One way to improve NMT for lower-resource settings is to initialize a word-based NMT model with pretrained word embeddings. However, rare words still suffer from lower quality word embeddings when trained with standard word-level objectives. We introduce word embeddings that utilize morphological resources, and compare to purely unsupervised alternatives. We work with Arabic, a morphologically rich language with available linguistic resources, and perform Ar-to-En MT experiments on a small corpus of TED subtitles. We find that word embeddings utilizing subword information consistently outperform standard word embeddings on a word similarity task and as initialization of the source word embeddings in a low-resource NMT system.
Anthology ID:
W18-1201
Volume:
Proceedings of the Second Workshop on Subword/Character LEvel Models
Month:
June
Year:
2018
Address:
New Orleans
Venues:
NAACL | SCLeM | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
1–11
Language:
URL:
https://www.aclweb.org/anthology/W18-1201
DOI:
10.18653/v1/W18-1201
Bib Export formats:
BibTeX MODS XML EndNote
PDF:
http://aclanthology.lst.uni-saarland.de/W18-1201.pdf