Japanese Text Normalization with Encoder-Decoder Model

Taishi Ikeda, Hiroyuki Shindo, Yuji Matsumoto


Abstract
Text normalization is the task of transforming lexical variants to their canonical forms. We model the problem of text normalization as a character-level sequence to sequence learning problem and present a neural encoder-decoder model for solving it. To train the encoder-decoder model, many sentences pairs are generally required. However, Japanese non-standard canonical pairs are scarce in the form of parallel corpora. To address this issue, we propose a method of data augmentation to increase data size by converting existing resources into synthesized non-standard forms using handcrafted rules. We conducted an experiment to demonstrate that the synthesized corpus contributes to stably train an encoder-decoder model and improve the performance of Japanese text normalization.
Anthology ID:
W16-3918
Volume:
Proceedings of the 2nd Workshop on Noisy User-generated Text (WNUT)
Month:
December
Year:
2016
Address:
Osaka, Japan
Venues:
WNUT | WS
SIG:
Publisher:
The COLING 2016 Organizing Committee
Note:
Pages:
129–137
Language:
URL:
https://www.aclweb.org/anthology/W16-3918
DOI:
Bib Export formats:
BibTeX MODS XML EndNote
PDF:
http://aclanthology.lst.uni-saarland.de/W16-3918.pdf