Mamoru Komachi


2020

pdf bib
Grammatical Error Correction Using Pseudo Learner Corpus Considering Learner’s Error Tendency
Yujin Takahashi | Satoru Katsumata | Mamoru Komachi
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop

Recently, several studies have focused on improving the performance of grammatical error correction (GEC) tasks using pseudo data. However, a large amount of pseudo data are required to train an accurate GEC model. To address the limitations of language and computational resources, we assume that introducing pseudo errors into sentences similar to those written by the language learners is more efficient, rather than incorporating random pseudo errors into monolingual data. In this regard, we study the effect of pseudo data on GEC task performance using two approaches. First, we extract sentences that are similar to the learners’ sentences from monolingual data. Second, we generate realistic pseudo errors by considering error types that learners often make. Based on our comparative results, we observe that F0.5 scores for the Russian GEC task are significantly improved.

pdf bib
Zero-shot North Korean to English Neural Machine Translation by Character Tokenization and Phoneme Decomposition
Hwichan Kim | Tosho Hirasawa | Mamoru Komachi
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop

The primary limitation of North Korean to English translation is the lack of a parallel corpus; therefore, high translation accuracy cannot be achieved. To address this problem, we propose a zero-shot approach using South Korean data, which are remarkably similar to North Korean data. We train a neural machine translation model after tokenizing a South Korean text at the character level and decomposing characters into phonemes.We demonstrate that our method can effectively learn North Korean to English translation and improve the BLEU scores by +1.01 points in comparison with the baseline.

pdf bib
English-to-Japanese Diverse Translation by Combining Forward and Backward Outputs
Masahiro Kaneko | Aizhan Imankulova | Tosho Hirasawa | Mamoru Komachi
Proceedings of the Fourth Workshop on Neural Generation and Translation

We introduce our TMU system that is submitted to The 4th Workshop on Neural Generation and Translation (WNGT2020) to English-to-Japanese (En→Ja) track on Simultaneous Translation And Paraphrase for Language Education (STAPLE) shared task. In most cases machine translation systems generate a single output from the input sentence, however, in order to assist language learners in their journey with better and more diverse feedback, it is helpful to create a machine translation system that is able to produce diverse translations of each input sentence. However, creating such systems would require complex modifications in a model to ensure the diversity of outputs. In this paper, we investigated if it is possible to create such systems in a simple way and whether it can produce desired diverse outputs. In particular, we combined the outputs from forward and backward neural translation models (NMT). Our system achieved third place in En→Ja track, despite adopting only a simple approach.

pdf bib
Construction of an Evaluation Corpus for Grammatical Error Correction for Learners of Japanese as a Second Language
Aomi Koyama | Tomoshige Kiyuna | Kenji Kobayashi | Mio Arai | Mamoru Komachi
Proceedings of the 12th Language Resources and Evaluation Conference

The NAIST Lang-8 Learner Corpora (Lang-8 corpus) is one of the largest second-language learner corpora. The Lang-8 corpus is suitable as a training dataset for machine translation-based grammatical error correction systems. However, it is not suitable as an evaluation dataset because the corrected sentences sometimes include inappropriate sentences. Therefore, we created and released an evaluation corpus for correcting grammatical errors made by learners of Japanese as a Second Language (JSL). As our corpus has less noise and its annotation scheme reflects the characteristics of the dataset, it is ideal as an evaluation corpus for correcting grammatical errors in sentences written by JSL learners. In addition, we applied neural machine translation (NMT) and statistical machine translation (SMT) techniques to correct the grammar of the JSL learners’ sentences and evaluated their results using our corpus. We also compared the performance of the NMT system with that of the SMT system.

pdf bib
Automated Essay Scoring System for Nonnative Japanese Learners
Reo Hirao | Mio Arai | Hiroki Shimanaka | Satoru Katsumata | Mamoru Komachi
Proceedings of the 12th Language Resources and Evaluation Conference

In this study, we created an automated essay scoring (AES) system for nonnative Japanese learners using an essay dataset with annotations for a holistic score and multiple trait scores, including content, organization, and language scores. In particular, we developed AES systems using two different approaches: a feature-based approach and a neural-network-based approach. In the former approach, we used Japanese-specific linguistic features, including character-type features such as “kanji” and “hiragana.” In the latter approach, we used two models: a long short-term memory (LSTM) model (Hochreiter and Schmidhuber, 1997) and a bidirectional encoder representations from transformers (BERT) model (Devlin et al., 2019), which achieved the highest accuracy in various natural language processing tasks in 2018. Overall, the BERT model achieved the best root mean squared error and quadratic weighted kappa scores. In addition, we analyzed the robustness of the outputs of the BERT model. We have released and shared this system to facilitate further research on AES for Japanese as a second language learners.

pdf bib
Generating Diverse Corrections with Local Beam Search for Grammatical Error Correction
Kengo Hotate | Masahiro Kaneko | Mamoru Komachi
Proceedings of the 28th International Conference on Computational Linguistics

In this study, we propose a beam search method to obtain diverse outputs in a local sequence transduction task where most of the tokens in the source and target sentences overlap, such as in grammatical error correction (GEC). In GEC, it is advisable to rewrite only the local sequences that must be rewritten while leaving the correct sequences unchanged. However, existing methods of acquiring various outputs focus on revising all tokens of a sentence. Therefore, existing methods may either generate ungrammatical sentences because they force the entire sentence to be changed or produce non-diversified sentences by weakening the constraints to avoid generating ungrammatical sentences. Considering these issues, we propose a method that does not rewrite all the tokens in a text, but only rewrites those parts that need to be diversely corrected. Our beam search method adjusts the search token in the beam according to the probability that the prediction is copied from the source sentence. The experimental results show that our proposed method generates more diverse corrections than existing methods without losing accuracy in the GEC task.

pdf bib
Cross-lingual Transfer Learning for Grammatical Error Correction
Ikumi Yamashita | Satoru Katsumata | Masahiro Kaneko | Aizhan Imankulova | Mamoru Komachi
Proceedings of the 28th International Conference on Computational Linguistics

In this study, we explore cross-lingual transfer learning in grammatical error correction (GEC) tasks. Many languages lack the resources required to train GEC models. Cross-lingual transfer learning from high-resource languages (the source models) is effective for training models of low-resource languages (the target models) for various tasks. However, in GEC tasks, the possibility of transferring grammatical knowledge (e.g., grammatical functions) across languages is not evident. Therefore, we investigate cross-lingual transfer learning methods for GEC. Our results demonstrate that transfer learning from other languages can improve the accuracy of GEC. We also demonstrate that proximity to source languages has a significant impact on the accuracy of correcting certain types of errors.

pdf bib
SOME: Reference-less Sub-Metrics Optimized for Manual Evaluations of Grammatical Error Correction
Ryoma Yoshimura | Masahiro Kaneko | Tomoyuki Kajiwara | Mamoru Komachi
Proceedings of the 28th International Conference on Computational Linguistics

We propose a reference-less metric trained on manual evaluations of system outputs for grammatical error correction (GEC). Previous studies have shown that reference-less metrics are promising; however, existing metrics are not optimized for manual evaluations of the system outputs because no dataset of the system output exists with manual evaluation. This study manually evaluates outputs of GEC systems to optimize the metrics. Experimental results show that the proposed metric improves correlation with the manual evaluation in both system- and sentence-level meta-evaluation. Our dataset and metric will be made publicly available.

pdf bib
Double Attention-based Multimodal Neural Machine Translation with Semantic Image Regions
Yuting Zhao | Mamoru Komachi | Tomoyuki Kajiwara | Chenhui Chu
Proceedings of the 22nd Annual Conference of the European Association for Machine Translation

Existing studies on multimodal neural machine translation (MNMT) have mainly focused on the effect of combining visual and textual modalities to improve translations. However, it has been suggested that the visual modality is only marginally beneficial. Conventional visual attention mechanisms have been used to select the visual features from equally-sized grids generated by convolutional neural networks (CNNs), and may have had modest effects on aligning the visual concepts associated with textual objects, because the grid visual features do not capture semantic information. In contrast, we propose the application of semantic image regions for MNMT by integrating visual and textual features using two individual attention mechanisms (double attention). We conducted experiments on the Multi30k dataset and achieved an improvement of 0.5 and 0.9 BLEU points for English-German and English-French translation tasks, compared with the MNMT with grid visual features. We also demonstrated concrete improvements on translation performance benefited from semantic image regions.

2019

pdf bib
Japanese-Russian TMU Neural Machine Translation System using Multilingual Model for WAT 2019
Aizhan Imankulova | Masahiro Kaneko | Mamoru Komachi
Proceedings of the 6th Workshop on Asian Translation

We introduce our system that is submitted to the News Commentary task (Japanese<->Russian) of the 6th Workshop on Asian Translation. The goal of this shared task is to study extremely low resource situations for distant language pairs. It is known that using parallel corpora of different language pair as training data is effective for multilingual neural machine translation model in extremely low resource scenarios. Therefore, to improve the translation quality of Japanese<->Russian language pair, our method leverages other in-domain Japanese-English and English-Russian parallel corpora as additional training data for our multilingual NMT model.

pdf bib
Controlling Grammatical Error Correction Using Word Edit Rate
Kengo Hotate | Masahiro Kaneko | Satoru Katsumata | Mamoru Komachi
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop

When professional English teachers correct grammatically erroneous sentences written by English learners, they use various methods. The correction method depends on how much corrections a learner requires. In this paper, we propose a method for neural grammar error correction (GEC) that can control the degree of correction. We show that it is possible to actually control the degree of GEC by using new training data annotated with word edit rate. Thereby, diverse corrected sentences is obtained from a single erroneous sentence. Moreover, compared to a GEC model that does not use information on the degree of correction, the proposed method improves correction accuracy.

pdf bib
Sakura: Large-scale Incorrect Example Retrieval System for Learners of Japanese as a Second Language
Mio Arai | Tomonori Kodaira | Mamoru Komachi
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics: System Demonstrations

This study develops an incorrect example retrieval system, called Sakura, using a large-scale Lang-8 dataset for Japanese language learners. Existing example retrieval systems do not include grammatically incorrect examples or present only a few examples, if any. If a retrieval system has a wide coverage of incorrect examples along with the correct counterpart, learners can revise their composition themselves. Considering the usability of retrieving incorrect examples, our proposed system uses a large-scale corpus to expand the coverage of incorrect examples and presents correct expressions along with incorrect expressions. Our intrinsic and extrinsic evaluations indicate that our system is more useful than a previous system.

pdf bib
(Almost) Unsupervised Grammatical Error Correction using Synthetic Comparable Corpus
Satoru Katsumata | Mamoru Komachi
Proceedings of the Fourteenth Workshop on Innovative Use of NLP for Building Educational Applications

We introduce unsupervised techniques based on phrase-based statistical machine translation for grammatical error correction (GEC) trained on a pseudo learner corpus created by Google Translation. We verified our GEC system through experiments on a low resource track of the shared task at BEA2019. As a result, we achieved an F0.5 score of 28.31 points with the test data.

pdf bib
TMU Transformer System Using BERT for Re-ranking at BEA 2019 Grammatical Error Correction on Restricted Track
Masahiro Kaneko | Kengo Hotate | Satoru Katsumata | Mamoru Komachi
Proceedings of the Fourteenth Workshop on Innovative Use of NLP for Building Educational Applications

We introduce our system that is submitted to the restricted track of the BEA 2019 shared task on grammatical error correction1 (GEC). It is essential to select an appropriate hypothesis sentence from the candidates list generated by the GEC model. A re-ranker can evaluate the naturalness of a corrected sentence using language models trained on large corpora. On the other hand, these language models and language representations do not explicitly take into account the grammatical errors written by learners. Thus, it is not straightforward to utilize language representations trained from a large corpus, such as Bidirectional Encoder Representations from Transformers (BERT), in a form suitable for the learner’s grammatical errors. Therefore, we propose to fine-tune BERT on learner corpora with grammatical errors for re-ranking. The experimental results of the W&I+LOCNESS development dataset demonstrate that re-ranking using BERT can effectively improve the correction performance.

pdf bib
Grammatical-Error-Aware Incorrect Example Retrieval System for Learners of Japanese as a Second Language
Mio Arai | Masahiro Kaneko | Mamoru Komachi
Proceedings of the Fourteenth Workshop on Innovative Use of NLP for Building Educational Applications

Existing example retrieval systems do not include grammatically incorrect examples or present only a few examples, if any. Even if a retrieval system has a wide coverage of incorrect examples along with the correct counterpart, learners need to know whether their query includes errors or not. Considering the usability of retrieving incorrect examples, our proposed method uses a large-scale corpus and presents correct expressions along with incorrect expressions using a grammatical error detection system so that the learner do not need to be aware of how to search for the examples. Intrinsic and extrinsic evaluations indicate that our method improves accuracy of example sentence retrieval and quality of learner’s writing.

pdf bib
Filtering Pseudo-References by Paraphrasing for Automatic Evaluation of Machine Translation
Ryoma Yoshimura | Hiroki Shimanaka | Yukio Matsumura | Hayahide Yamagishi | Mamoru Komachi
Proceedings of the Fourth Conference on Machine Translation (Volume 2: Shared Task Papers, Day 1)

In this paper, we introduce our participation in the WMT 2019 Metric Shared Task. We propose an improved version of sentence BLEU using filtered pseudo-references. We propose a method to filter pseudo-references by paraphrasing for automatic evaluation of machine translation (MT). We use the outputs of off-the-shelf MT systems as pseudo-references filtered by paraphrasing in addition to a single human reference (gold reference). We use BERT fine-tuned with paraphrase corpus to filter pseudo-references by checking the paraphrasability with the gold reference. Our experimental results of the WMT 2016 and 2017 datasets show that our method achieved higher correlation with human evaluation than the sentence BLEU (SentBLEU) baselines with a single reference and with unfiltered pseudo-references.

pdf bib
Debiasing Word Embeddings Improves Multimodal Machine Translation
Tosho Hirasawa | Mamoru Komachi
Proceedings of Machine Translation Summit XVII Volume 1: Research Track

pdf bib
Multi-Task Learning for Japanese Predicate Argument Structure Analysis
Hikaru Omori | Mamoru Komachi
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)

An event-noun is a noun that has an argument structure similar to a predicate. Recent works, including those considered state-of-the-art, ignore event-nouns or build a single model for solving both Japanese predicate argument structure analysis (PASA) and event-noun argument structure analysis (ENASA). However, because there are interactions between predicates and event-nouns, it is not sufficient to target only predicates. To address this problem, we present a multi-task learning method for PASA and ENASA. Our multi-task models improved the performance of both tasks compared to a single-task model by sharing knowledge from each task. Moreover, in PASA, our models achieved state-of-the-art results in overall F1 scores on the NAIST Text Corpus. In addition, this is the first work to employ neural networks in ENASA.

pdf bib
Multimodal Machine Translation with Embedding Prediction
Tosho Hirasawa | Hayahide Yamagishi | Yukio Matsumura | Mamoru Komachi
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Student Research Workshop

Multimodal machine translation is an attractive application of neural machine translation (NMT). It helps computers to deeply understand visual objects and their relations with natural languages. However, multimodal NMT systems suffer from a shortage of available training data, resulting in poor performance for translating rare words. In NMT, pretrained word embeddings have been shown to improve NMT of low-resource domains, and a search-based approach is proposed to address the rare word problem. In this study, we effectively combine these two approaches in the context of multimodal NMT and explore how we can take full advantage of pretrained word embeddings to better translate rare words. We report overall performance improvements of 1.24 METEOR and 2.49 BLEU and achieve an improvement of 7.67 F-score for rare word translation.

2018

pdf bib
Construction of a Japanese Word Similarity Dataset
Yuya Sakaizawa | Mamoru Komachi
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

pdf bib
Japanese Predicate Conjugation for Neural Machine Translation
Michiki Kurosawa | Yukio Matsumura | Hayahide Yamagishi | Mamoru Komachi
Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Student Research Workshop

Neural machine translation (NMT) has a drawback in that can generate only high-frequency words owing to the computational costs of the softmax function in the output layer. In Japanese-English NMT, Japanese predicate conjugation causes an increase in vocabulary size. For example, one verb can have as many as 19 surface varieties. In this research, we focus on predicate conjugation for compressing the vocabulary size in Japanese. The vocabulary list is filled with the various forms of verbs. We propose methods using predicate conjugation information without discarding linguistic information. The proposed methods can generate low-frequency words and deal with unknown words. Two methods were considered to introduce conjugation information: the first considers it as a token (conjugation token) and the second considers it as an embedded vector (conjugation feature). The results using these methods demonstrate that the vocabulary size can be compressed by approximately 86.1% (Tanaka corpus) and the NMT models can output the words not in the training data set. Furthermore, BLEU scores improved by 0.91 points in Japanese-to-English translation, and 0.32 points in English-to-Japanese translation with ASPEC.

pdf bib
Metric for Automatic Machine Translation Evaluation based on Universal Sentence Representations
Hiroki Shimanaka | Tomoyuki Kajiwara | Mamoru Komachi
Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Student Research Workshop

Sentence representations can capture a wide range of information that cannot be captured by local features based on character or word N-grams. This paper examines the usefulness of universal sentence representations for evaluating the quality of machine translation. Al-though it is difficult to train sentence representations using small-scale translation datasets with manual evaluation, sentence representations trained from large-scale data in other tasks can improve the automatic evaluation of machine translation. Experimental results of the WMT-2016 dataset show that the proposed method achieves state-of-the-art performance with sentence representation features only.

pdf bib
Graph-based Filtering of Out-of-Vocabulary Words for Encoder-Decoder Models
Satoru Katsumata | Yukio Matsumura | Hayahide Yamagishi | Mamoru Komachi
Proceedings of ACL 2018, Student Research Workshop

Encoder-decoder models typically only employ words that are frequently used in the training corpus because of the computational costs and/or to exclude noisy words. However, this vocabulary set may still include words that interfere with learning in encoder-decoder models. This paper proposes a method for selecting more suitable words for learning encoders by utilizing not only frequency, but also co-occurrence information, which we capture using the HITS algorithm. The proposed method is applied to two tasks: machine translation and grammatical error correction. For Japanese-to-English translation, this method achieved a BLEU score that was 0.56 points more than that of a baseline. It also outperformed the baseline method for English grammatical error correction, with an F-measure that was 1.48 points higher.

pdf bib
Complex Word Identification Based on Frequency in a Learner Corpus
Tomoyuki Kajiwara | Mamoru Komachi
Proceedings of the Thirteenth Workshop on Innovative Use of NLP for Building Educational Applications

We introduce the TMU systems for the Complex Word Identification (CWI) Shared Task 2018. TMU systems use random forest classifiers and regressors whose features are the number of characters, the number of words, and the frequency of target words in various corpora. Our simple systems performed best on 5 tracks out of 12 tracks. Our ablation analysis revealed the usefulness of a learner corpus for CWI task.

pdf bib
TMU System for SLAM-2018
Masahiro Kaneko | Tomoyuki Kajiwara | Mamoru Komachi
Proceedings of the Thirteenth Workshop on Innovative Use of NLP for Building Educational Applications

We introduce the TMU systems for the second language acquisition modeling shared task 2018 (Settles et al., 2018). To model learner error patterns, it is necessary to maintain a considerable amount of information regarding the type of exercises learners have been learning in the past and the manner in which they answered them. Tracking an enormous learner’s learning history and their correct and mistaken answers is essential to predict the learner’s future mistakes. Therefore, we propose a model which tracks the learner’s learning history efficiently. Our systems ranked fourth in the English and Spanish subtasks, and fifth in the French subtask.

pdf bib
Proceedings of the 5th Workshop on Natural Language Processing Techniques for Educational Applications
Yuen-Hsien Tseng | Hsin-Hsi Chen | Vincent Ng | Mamoru Komachi
Proceedings of the 5th Workshop on Natural Language Processing Techniques for Educational Applications

pdf bib
Neural Machine Translation of Logographic Language Using Sub-character Level Information
Longtu Zhang | Mamoru Komachi
Proceedings of the Third Conference on Machine Translation: Research Papers

Recent neural machine translation (NMT) systems have been greatly improved by encoder-decoder models with attention mechanisms and sub-word units. However, important differences between languages with logographic and alphabetic writing systems have long been overlooked. This study focuses on these differences and uses a simple approach to improve the performance of NMT systems utilizing decomposed sub-character level information for logographic languages. Our results indicate that our approach not only improves the translation capabilities of NMT systems between Chinese and English, but also further improves NMT systems between Chinese and Japanese, because it utilizes the shared information brought by similar sub-character units.

pdf bib
RUSE: Regressor Using Sentence Embeddings for Automatic Machine Translation Evaluation
Hiroki Shimanaka | Tomoyuki Kajiwara | Mamoru Komachi
Proceedings of the Third Conference on Machine Translation: Shared Task Papers

We introduce the RUSE metric for the WMT18 metrics shared task. Sentence embeddings can capture global information that cannot be captured by local features based on character or word N-grams. Although training sentence embeddings using small-scale translation datasets with manual evaluation is difficult, sentence embeddings trained from large-scale data in other tasks can improve the automatic evaluation of machine translation. We use a multi-layer perceptron regressor based on three types of sentence embeddings. The experimental results of the WMT16 and WMT17 datasets show that the RUSE metric achieves a state-of-the-art performance in both segment- and system-level metrics tasks with embedding features only.

pdf bib
Long Short-Term Memory for Japanese Word Segmentation
Yoshiaki Kitagawa | Mamoru Komachi
Proceedings of the 32nd Pacific Asia Conference on Language, Information and Computation

pdf bib
The Rule of Three: Abstractive Text Summarization in Three Bullet Points
Tomonori Kodaira | Mamoru Komachi
Proceedings of the 32nd Pacific Asia Conference on Language, Information and Computation

pdf bib
Japanese Sentiment Classification using a Tree-Structured Long Short-Term Memory with Attention
Ryosuke Miyazaki | Mamoru Komachi
Proceedings of the 32nd Pacific Asia Conference on Language, Information and Computation

pdf bib
TMU Japanese-Chinese Unsupervised NMT System for WAT 2018 Translation Task
Longtu Zhang | Yuting Zhao | Mamoru Komachi
Proceedings of the 32nd Pacific Asia Conference on Language, Information and Computation: 5th Workshop on Asian Translation: 5th Workshop on Asian Translation

pdf bib
TMU Japanese-English Neural Machine Translation System using Generative Adversarial Network for WAT 2018
Yukio Matsumura | Satoru Katsumata | Mamoru Komachi
Proceedings of the 32nd Pacific Asia Conference on Language, Information and Computation: 5th Workshop on Asian Translation: 5th Workshop on Asian Translation

2017

pdf bib
Building a Non-Trivial Paraphrase Corpus Using Multiple Machine Translation Systems
Yui Suzuki | Tomoyuki Kajiwara | Mamoru Komachi
Proceedings of ACL 2017, Student Research Workshop

pdf bib
Grammatical Error Detection Using Error- and Grammaticality-Specific Word Embeddings
Masahiro Kaneko | Yuya Sakaizawa | Mamoru Komachi
Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

In this study, we improve grammatical error detection by learning word embeddings that consider grammaticality and error patterns. Most existing algorithms for learning word embeddings usually model only the syntactic context of words so that classifiers treat erroneous and correct words as similar inputs. We address the problem of contextual information by considering learner errors. Specifically, we propose two models: one model that employs grammatical error patterns and another model that considers grammaticality of the target word. We determine grammaticality of n-gram sequence from the annotated error tags and extract grammatical error patterns for word embeddings from large-scale learner corpora. Experimental results show that a bidirectional long-short term memory model initialized by our word embeddings achieved the state-of-the-art accuracy by a large margin in an English grammatical error detection task on the First Certificate in English dataset.

pdf bib
MIPA: Mutual Information Based Paraphrase Acquisition via Bilingual Pivoting
Tomoyuki Kajiwara | Mamoru Komachi | Daichi Mochihashi
Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

We present a pointwise mutual information (PMI)-based approach to formalize paraphrasability and propose a variant of PMI, called MIPA, for the paraphrase acquisition. Our paraphrase acquisition method first acquires lexical paraphrase pairs by bilingual pivoting and then reranks them by PMI and distributional similarity. The complementary nature of information from bilingual corpora and from monolingual corpora makes the proposed method robust. Experimental results show that the proposed method substantially outperforms bilingual pivoting and distributional similarity themselves in terms of metrics such as MRR, MAP, coverage, and Spearman’s correlation.

pdf bib
Improving Japanese-to-English Neural Machine Translation by Voice Prediction
Hayahide Yamagishi | Shin Kanouchi | Takayuki Sato | Mamoru Komachi
Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 2: Short Papers)

This study reports an attempt to predict the voice of reference using the information from the input sentences or previous input/output sentences. Our previous study presented a voice controlling method to generate sentences for neural machine translation, wherein it was demonstrated that the BLEU score improved when the voice of generated sentence was controlled relative to that of the reference. However, it is impractical to use the reference information because we cannot discern the voice of the correct translation in advance. Thus, this study presents a voice prediction method for generated sentences for neural machine translation. While evaluating on Japanese-to-English translation, we obtain a 0.70-improvement in the BLEU using the predicted voice.

pdf bib
Improving Japanese-to-English Neural Machine Translation by Paraphrasing the Target Language
Yuuki Sekizawa | Tomoyuki Kajiwara | Mamoru Komachi
Proceedings of the 4th Workshop on Asian Translation (WAT2017)

Neural machine translation (NMT) produces sentences that are more fluent than those produced by statistical machine translation (SMT). However, NMT has a very high computational cost because of the high dimensionality of the output layer. Generally, NMT restricts the size of vocabulary, which results in infrequent words being treated as out-of-vocabulary (OOV) and degrades the performance of the translation. In evaluation, we achieved a statistically significant BLEU score improvement of 0.55-0.77 over the baselines including the state-of-the-art method.

pdf bib
Improving Low-Resource Neural Machine Translation with Filtered Pseudo-Parallel Corpus
Aizhan Imankulova | Takayuki Sato | Mamoru Komachi
Proceedings of the 4th Workshop on Asian Translation (WAT2017)

Large-scale parallel corpora are indispensable to train highly accurate machine translators. However, manually constructed large-scale parallel corpora are not freely available in many language pairs. In previous studies, training data have been expanded using a pseudo-parallel corpus obtained using machine translation of the monolingual corpus in the target language. However, in low-resource language pairs in which only low-accuracy machine translation systems can be used, translation quality is reduces when a pseudo-parallel corpus is used naively. To improve machine translation performance with low-resource language pairs, we propose a method to expand the training data effectively via filtering the pseudo-parallel corpus using a quality estimation based on back-translation. As a result of experiments with three language pairs using small, medium, and large parallel corpora, language pairs with fewer training data filtered out more sentence pairs and improved BLEU scores more significantly.

pdf bib
Tokyo Metropolitan University Neural Machine Translation System for WAT 2017
Yukio Matsumura | Mamoru Komachi
Proceedings of the 4th Workshop on Asian Translation (WAT2017)

In this paper, we describe our neural machine translation (NMT) system, which is based on the attention-based NMT and uses long short-term memories (LSTM) as RNN. We implemented beam search and ensemble decoding in the NMT system. The system was tested on the 4th Workshop on Asian Translation (WAT 2017) shared tasks. In our experiments, we participated in the scientific paper subtasks and attempted Japanese-English, English-Japanese, and Japanese-Chinese translation tasks. The experimental results showed that implementation of beam search and ensemble decoding can effectively improve the translation quality.

pdf bib
Suggesting Sentences for ESL using Kernel Embeddings
Kent Shioda | Mamoru Komachi | Rue Ikeya | Daichi Mochihashi
Proceedings of the 4th Workshop on Natural Language Processing Techniques for Educational Applications (NLPTEA 2017)

Sentence retrieval is an important NLP application for English as a Second Language (ESL) learners. ESL learners are familiar with web search engines, but generic web search results may not be adequate for composing documents in a specific domain. However, if we build our own search system specialized to a domain, it may be subject to the data sparseness problem. Recently proposed word2vec partially addresses the data sparseness problem, but fails to extract sentences relevant to queries owing to the modeling of the latent intent of the query. Thus, we propose a method of retrieving example sentences using kernel embeddings and N-gram windows. This method implicitly models latent intent of query and sentences, and alleviates the problem of noisy alignment. Our results show that our method achieved higher precision in sentence retrieval for ESL in the domain of a university press release corpus, as compared to a previous unsupervised method used for a semantic textual similarity task.

2016

pdf bib
Analysis of English Spelling Errors in a Word-Typing Game
Ryuichi Tachibana | Mamoru Komachi
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

The emergence of the web has necessitated the need to detect and correct noisy consumer-generated texts. Most of the previous studies on English spelling-error extraction collected English spelling errors from web services such as Twitter by using the edit distance or from input logs utilizing crowdsourcing. However, in the former approach, it is not clear which word corresponds to the spelling error, and the latter approach requires an annotation cost for the crowdsourcing. One notable exception is Rodrigues and Rytting (2012), who proposed to extract English spelling errors by using a word-typing game. Their approach saves the cost of crowdsourcing, and guarantees an exact alignment between the word and the spelling error. However, they did not assert whether the extracted spelling error corpora reflect the usual writing process such as writing a document. Therefore, we propose a new correctable word-typing game that is more similar to the actual writing process. Experimental results showed that we can regard typing-game logs as a source of spelling errors.

pdf bib
Building a Monolingual Parallel Corpus for Text Simplification Using Sentence Similarity Based on Alignment between Word Embeddings
Tomoyuki Kajiwara | Mamoru Komachi
Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers

Methods for text simplification using the framework of statistical machine translation have been extensively studied in recent years. However, building the monolingual parallel corpus necessary for training the model requires costly human annotation. Monolingual parallel corpora for text simplification have therefore been built only for a limited number of languages, such as English and Portuguese. To obviate the need for human annotation, we propose an unsupervised method that automatically builds the monolingual parallel corpus for text simplification using sentence similarity based on word embeddings. For any sentence pair comprising a complex sentence and its simple counterpart, we employ a many-to-one method of aligning each word in the complex sentence with the most similar word in the simple sentence and compute sentence similarity by averaging these word similarities. The experimental results demonstrate the excellent performance of the proposed method in a monolingual parallel corpus construction task for English text simplification. The results also demonstrated the superior accuracy in text simplification that use the framework of statistical machine translation trained using the corpus built by the proposed method to that using the existing corpora.

pdf bib
Controlled and Balanced Dataset for Japanese Lexical Simplification
Tomonori Kodaira | Tomoyuki Kajiwara | Mamoru Komachi
Proceedings of the ACL 2016 Student Research Workshop

pdf bib
Disaster Analysis using User-Generated Weather Report
Yasunobu Asakura | Masatsugu Hangyo | Mamoru Komachi
Proceedings of the 2nd Workshop on Noisy User-generated Text (WNUT)

Information extraction from user-generated text has gained much attention with the growth of the Web.Disaster analysis using information from social media provides valuable, real-time, geolocation information for helping people caught up these in disasters. However, it is not convenient to analyze texts posted on social media because disaster keywords match any texts that contain words. For collecting posts about a disaster from social media, we need to develop a classifier to filter posts irrelevant to disasters. Moreover, because of the nature of social media, we can take advantage of posts that come with GPS information. However, a post does not always refer to an event occurring at the place where it has been posted. Therefore, we propose a new task of classifying whether a flood disaster occurred, in addition to predicting the geolocation of events from user-generated text. We report the annotation of the flood disaster corpus and develop a classifier to demonstrate the use of this corpus for disaster analysis.

pdf bib
Japanese-English Machine Translation of Recipe Texts
Takayuki Sato | Jun Harashima | Mamoru Komachi
Proceedings of the 3rd Workshop on Asian Translation (WAT2016)

Concomitant with the globalization of food culture, demand for the recipes of specialty dishes has been increasing. The recent growth in recipe sharing websites and food blogs has resulted in numerous recipe texts being available for diverse foods in various languages. However, little work has been done on machine translation of recipe texts. In this paper, we address the task of translating recipes and investigate the advantages and disadvantages of traditional phrase-based statistical machine translation and more recent neural machine translation. Specifically, we translate Japanese recipes into English, analyze errors in the translated recipes, and discuss available room for improvements.

pdf bib
Neural Reordering Model Considering Phrase Translation and Word Alignment for Phrase-based Translation
Shin Kanouchi | Katsuhito Sudoh | Mamoru Komachi
Proceedings of the 3rd Workshop on Asian Translation (WAT2016)

This paper presents an improved lexicalized reordering model for phrase-based statistical machine translation using a deep neural network. Lexicalized reordering suffers from reordering ambiguity, data sparseness and noises in a phrase table. Previous neural reordering model is successful to solve the first and second problems but fails to address the third one. Therefore, we propose new features using phrase translation and word alignment to construct phrase vectors to handle inherently noisy phrase translation pairs. The experimental results show that our proposed method improves the accuracy of phrase reordering. We confirm that the proposed method works well with phrase pairs including NULL alignments.

pdf bib
Controlling the Voice of a Sentence in Japanese-to-English Neural Machine Translation
Hayahide Yamagishi | Shin Kanouchi | Takayuki Sato | Mamoru Komachi
Proceedings of the 3rd Workshop on Asian Translation (WAT2016)

In machine translation, we must consider the difference in expression between languages. For example, the active/passive voice may change in Japanese-English translation. The same verb in Japanese may be translated into different voices at each translation because the voice of a generated sentence cannot be determined using only the information of the Japanese sentence. Machine translation systems should consider the information structure to improve the coherence of the output by using several topicalization techniques such as passivization. Therefore, this paper reports on our attempt to control the voice of the sentence generated by an encoder-decoder model. To control the voice of the generated sentence, we added the voice information of the target sentence to the source sentence during the training. We then generated sentences with a specified voice by appending the voice information to the source sentence. We observed experimentally whether the voice could be controlled. The results showed that, we could control the voice of the generated sentence with 85.0% accuracy on average. In the evaluation of Japanese-English translation, we obtained a 0.73-point improvement in BLEU score by using gold voice labels.

2015

pdf bib
Improving Chinese Grammatical Error Correction with Corpus Augmentation and Hierarchical Phrase-based Statistical Machine Translation
Yinchen Zhao | Mamoru Komachi | Hiroshi Ishikawa
Proceedings of the 2nd Workshop on Natural Language Processing Techniques for Educational Applications

pdf bib
Source Phrase Segmentation and Translation for Japanese-English Translation Using Dependency Structure
Junki Matsuo | Kenichi Ohwada | Mamoru Komachi
Proceedings of the 2nd Workshop on Asian Translation (WAT2015)

pdf bib
Who caught a cold ? - Identifying the subject of a symptom
Shin Kanouchi | Mamoru Komachi | Naoaki Okazaki | Eiji Aramaki | Hiroshi Ishikawa
Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

pdf bib
Disease Event Detection based on Deep Modality Analysis
Yoshiaki Kitagawa | Mamoru Komachi | Eiji Aramaki | Naoaki Okazaki | Hiroshi Ishikawa
Proceedings of the ACL-IJCNLP 2015 Student Research Workshop

pdf bib
Japanese Sentiment Classification with Stacked Denoising Auto-Encoder using Distributed Word Representation
Peinan Zhang | Mamoru Komachi
Proceedings of the 29th Pacific Asia Conference on Language, Information and Computation

2014

pdf bib
Predicate-Argument Structure-based Preordering for Japanese-English Statistical Machine Translation of Scientific Papers
Kenichi Ohwada | Ryosuke Miyazaki | Mamoru Komachi
Proceedings of the 1st Workshop on Asian Translation (WAT2014)

2013

pdf bib
NAIST at the NLI 2013 Shared Task
Tomoya Mizumoto | Yuta Hayashibe | Keisuke Sakaguchi | Mamoru Komachi | Yuji Matsumoto
Proceedings of the Eighth Workshop on Innovative Use of NLP for Building Educational Applications

pdf bib
NAIST at 2013 CoNLL Grammatical Error Correction Shared Task
Ippei Yoshimoto | Tomoya Kose | Kensuke Mitsuzawa | Keisuke Sakaguchi | Tomoya Mizumoto | Yuta Hayashibe | Mamoru Komachi | Yuji Matsumoto
Proceedings of the Seventeenth Conference on Computational Natural Language Learning: Shared Task

pdf bib
Discriminative Approach to Fill-in-the-Blank Quiz Generation for Language Learners
Keisuke Sakaguchi | Yuki Arase | Mamoru Komachi
Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

pdf bib
A Learner Corpus-based Approach to Verb Suggestion for ESL
Yu Sawai | Mamoru Komachi | Yuji Matsumoto
Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

pdf bib
Towards Automatic Error Type Classification of Japanese Language Learners’ Writings
Hiromi Oyama | Mamoru Komachi | Yuji Matsumoto
Proceedings of the 27th Pacific Asia Conference on Language, Information, and Computation (PACLIC 27)

2012

pdf bib
Tense and Aspect Error Correction for ESL Learners Using Global Context
Toshikazu Tajiri | Mamoru Komachi | Yuji Matsumoto
Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

pdf bib
UniDic for Early Middle Japanese: a Dictionary for Morphological Analysis of Classical Japanese
Toshinobu Ogiso | Mamoru Komachi | Yasuharu Den | Yuji Matsumoto
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

In order to construct an annotated diachronic corpus of Japanese, we propose to create a new dictionary for morphological analysis of Early Middle Japanese (Classical Japanese) based on UniDic, a dictionary for Contemporary Japanese. Differences between the Early Middle Japanese and Contemporary Japanese, which prevent a naïve adaptation of UniDic to Early Middle Japanese, are found at the levels of lexicon, morphology, grammar, orthography and pronunciation. In order to overcome these problems, we extended dictionary entries and created a training corpus of Early Middle Japanese to adapt UniDic for Contemporary Japanese to Early Middle Japanese. Experimental results show that the proposed UniDic-EMJ, a new dictionary for Early Middle Japanese, achieves as high accuracy (97%) as needed for the linguistic research on lexicon and grammar in Japanese classical text analysis.

pdf bib
NAIST at the HOO 2012 Shared Task
Keisuke Sakaguchi | Yuta Hayashibe | Shuhei Kondo | Lis Kanashiro | Tomoya Mizumoto | Mamoru Komachi | Yuji Matsumoto
Proceedings of the Seventh Workshop on Building Educational Applications Using NLP

pdf bib
Joint English Spelling Error Correction and POS Tagging for Language Learners Writing
Keisuke Sakaguchi | Tomoya Mizumoto | Mamoru Komachi | Yuji Matsumoto
Proceedings of COLING 2012

pdf bib
The Effect of Learner Corpus Size in Grammatical Error Correction of ESL Writings
Tomoya Mizumoto | Yuta Hayashibe | Mamoru Komachi | Masaaki Nagata | Yuji Matsumoto
Proceedings of COLING 2012: Posters

2011

pdf bib
Using the Mutual k-Nearest Neighbor Graphs for Semi-supervised Classification on Natural Language Data
Kohei Ozaki | Masashi Shimbo | Mamoru Komachi | Yuji Matsumoto
Proceedings of the Fifteenth Conference on Computational Natural Language Learning

pdf bib
Narrative Schema as World Knowledge for Coreference Resolution
Joseph Irwin | Mamoru Komachi | Yuji Matsumoto
Proceedings of the Fifteenth Conference on Computational Natural Language Learning: Shared Task

pdf bib
Error Correcting Romaji-kana Conversion for Japanese Language Education
Seiji Kasahara | Mamoru Komachi | Masaaki Nagata | Yuji Matsumoto
Proceedings of the Workshop on Advances in Text Input Methods (WTIM 2011)

pdf bib
Mining Revision Log of Language Learning SNS for Automated Japanese Error Correction of Second Language Learners
Tomoya Mizumoto | Mamoru Komachi | Masaaki Nagata | Yuji Matsumoto
Proceedings of 5th International Joint Conference on Natural Language Processing

pdf bib
Japanese Predicate Argument Structure Analysis Exploiting Argument Position and Type
Yuta Hayashibe | Mamoru Komachi | Yuji Matsumoto
Proceedings of 5th International Joint Conference on Natural Language Processing

pdf bib
Automatic Labeling of Voiced Consonants for Morphological Analysis of Modern Japanese Literature
Teruaki Oka | Mamoru Komachi | Toshinobu Ogiso | Yuji Matsumoto
Proceedings of 5th International Joint Conference on Natural Language Processing

pdf bib
Japanese Abbreviation Expansion with Query and Clickthrough Logs
Kei Uchiumi | Mamoru Komachi | Keigo Machinaga | Toshiyuki Maezawa | Toshinori Satou | Yoshinori Kobayashi
Proceedings of 5th International Joint Conference on Natural Language Processing

pdf bib
HITS-based Seed Selection and Stop List Construction for Bootstrapping
Tetsuo Kiso | Masashi Shimbo | Mamoru Komachi | Yuji Matsumoto
Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies

2009

pdf bib
Learning Semantic Categories from Clickthrough Logs
Mamoru Komachi | Shimpei Makimoto | Kei Uchiumi | Manabu Sassano
Proceedings of the ACL-IJCNLP 2009 Conference Short Papers

2008

pdf bib
Minimally Supervised Learning of Semantic Knowledge from Query Logs
Mamoru Komachi | Hisami Suzuki
Proceedings of the Third International Joint Conference on Natural Language Processing: Volume-I

pdf bib
Graph-based Analysis of Semantic Drift in Espresso-like Bootstrapping Algorithms
Mamoru Komachi | Taku Kudo | Masashi Shimbo | Yuji Matsumoto
Proceedings of the 2008 Conference on Empirical Methods in Natural Language Processing

2007

pdf bib
Annotating a Japanese Text Corpus with Predicate-Argument and Coreference Relations
Ryu Iida | Mamoru Komachi | Kentaro Inui | Yuji Matsumoto
Proceedings of the Linguistic Annotation Workshop