A Comparison of Entity Matching Methods between English and Japanese Katakana

Michiharu Yamashita, Hideki Awashima, Hidekazu Oiwa


Abstract
Japanese Katakana is one component of the Japanese writing system and is used to express English terms, loanwords, and onomatopoeia in Japanese characters based on the phonemes. The main purpose of this research is to find the best entity matching methods between English and Katakana. We built two research questions to clarify which types of entity matching systems works better than others. The first question is what transliteration should be used for conversion. We need to transliterate English or Katakana terms into the same form in order to compute the string similarity. We consider five conversions that transliterate English to Katakana directly, Katakana to English directly, English to Katakana via phoneme, Katakana to English via phoneme, and both English and Katakana to phoneme. The second question is what should be used for the similarity measure at entity matching. To investigate the problem, we choose six methods, which are Overlap Coefficient, Cosine, Jaccard, Jaro-Winkler, Levenshtein, and the similarity of the phoneme probability predicted by RNN. Our results show that 1) matching using phonemes and conversion of Katakana to English works better than other methods, and 2) the similarity of phonemes outperforms other methods while other similarity score is changed depending on data and models.
Anthology ID:
W18-5809
Volume:
Proceedings of the Fifteenth Workshop on Computational Research in Phonetics, Phonology, and Morphology
Month:
October
Year:
2018
Address:
Brussels, Belgium
Venues:
EMNLP | WS
SIG:
SIGMORPHON
Publisher:
Association for Computational Linguistics
Note:
Pages:
84–92
Language:
URL:
https://www.aclweb.org/anthology/W18-5809
DOI:
10.18653/v1/W18-5809
Bib Export formats:
BibTeX MODS XML EndNote
PDF:
http://aclanthology.lst.uni-saarland.de/W18-5809.pdf