A Comparison of Entity Matching Methods between English and Japanese Katakana
Michiharu Yamashita | Hideki Awashima | Hidekazu Oiwa
Proceedings of the Fifteenth Workshop on Computational Research in Phonetics, Phonology, and Morphology
Japanese Katakana is one component of the Japanese writing system and is used to express English terms, loanwords, and onomatopoeia in Japanese characters based on the phonemes. The main purpose of this research is to find the best entity matching methods between English and Katakana. We built two research questions to clarify which types of entity matching systems works better than others. The first question is what transliteration should be used for conversion. We need to transliterate English or Katakana terms into the same form in order to compute the string similarity. We consider five conversions that transliterate English to Katakana directly, Katakana to English directly, English to Katakana via phoneme, Katakana to English via phoneme, and both English and Katakana to phoneme. The second question is what should be used for the similarity measure at entity matching. To investigate the problem, we choose six methods, which are Overlap Coefficient, Cosine, Jaccard, Jaro-Winkler, Levenshtein, and the similarity of the phoneme probability predicted by RNN. Our results show that 1) matching using phonemes and conversion of Katakana to English works better than other methods, and 2) the similarity of phonemes outperforms other methods while other similarity score is changed depending on data and models.