Revisiting NMT for Normalization of Early English Letters

Mika Hämäläinen, Tanja Säily, Jack Rueter, Jörg Tiedemann, Eetu Mäkelä


Abstract
This paper studies the use of NMT (neural machine translation) as a normalization method for an early English letter corpus. The corpus has previously been normalized so that only less frequent deviant forms are left out without normalization. This paper discusses different methods for improving the normalization of these deviant forms by using different approaches. Adding features to the training data is found to be unhelpful, but using a lexicographical resource to filter the top candidates produced by the NMT model together with lemmatization improves results.
Anthology ID:
W19-2509
Volume:
Proceedings of the 3rd Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature
Month:
June
Year:
2019
Address:
Minneapolis, USA
Venues:
LaTeCH | NAACL | WS
SIG:
SIGHUM
Publisher:
Association for Computational Linguistics
Note:
Pages:
71–75
Language:
URL:
https://www.aclweb.org/anthology/W19-2509
DOI:
10.18653/v1/W19-2509
Bib Export formats:
BibTeX MODS XML EndNote
PDF:
http://aclanthology.lst.uni-saarland.de/W19-2509.pdf