Chenhui Chu


2020

pdf bib
Text Classification with Negative Supervision
Sora Ohashi | Junya Takayama | Tomoyuki Kajiwara | Chenhui Chu | Yuki Arase
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

Advanced pre-trained models for text representation have achieved state-of-the-art performance on various text classification tasks. However, the discrepancy between the semantic similarity of texts and labelling standards affects classifiers, i.e. leading to lower performance in cases where classifiers should assign different labels to semantically similar texts. To address this problem, we propose a simple multitask learning model that uses negative supervision. Specifically, our model encourages texts with different labels to have distinct representations. Comprehensive experiments show that our model outperforms the state-of-the-art pre-trained model on both single- and multi-label classifications, sentence and document classifications, and classifications in three different languages.

pdf bib
Constructing a Public Meeting Corpus
Koji Tanaka | Chenhui Chu | Haolin Ren | Benjamin Renoust | Yuta Nakashima | Noriko Takemura | Hajime Nagahara | Takao Fujikawa
Proceedings of the 12th Language Resources and Evaluation Conference

In this paper, we propose a full pipeline of analysis of a large corpus about a century of public meeting in historical Australian news papers, from construction to visual exploration. The corpus construction method is based on image processing and OCR. We digitize and transcribe texts of the specific topic of public meeting. Experiments show that our proposed method achieves a F-score of 87.8% for corpus construction. As a result, we built a content search tool for temporal and semantic content analysis.

pdf bib
Annotation of Adverse Drug Reactions in Patients’ Weblogs
Yuki Arase | Tomoyuki Kajiwara | Chenhui Chu
Proceedings of the 12th Language Resources and Evaluation Conference

Adverse drug reactions are a severe problem that significantly degrade quality of life, or even threaten the life of patients. Patient-generated texts available on the web have been gaining attention as a promising source of information in this regard. While previous studies annotated such patient-generated content, they only reported on limited information, such as whether a text described an adverse drug reaction or not. Further, they only annotated short texts of a few sentences crawled from online forums and social networking services. The dataset we present in this paper is unique for the richness of annotated information, including detailed descriptions of drug reactions with full context. We crawled patient’s weblog articles shared on an online patient-networking platform and annotated the effects of drugs therein reported. We identified spans describing drug reactions and assigned labels for related drug names, standard codes for the symptoms of the reactions, and types of effects. As a first dataset, we annotated 677 drug reactions with these detailed labels based on 169 weblog articles by Japanese lung cancer patients. Our annotation dataset is made publicly available at our web site (https://yukiar.github.io/adr-jp/) for further research on the detection of adverse drug reactions and more broadly, on patient-generated text processing.

pdf bib
IDSOU at WNUT-2020 Task 2: Identification of Informative COVID-19 English Tweets
Sora Ohashi | Tomoyuki Kajiwara | Chenhui Chu | Noriko Takemura | Yuta Nakashima | Hajime Nagahara
Proceedings of the Sixth Workshop on Noisy User-generated Text (W-NUT 2020)

We introduce the IDSOU submission for the WNUT-2020 task 2: identification of informative COVID-19 English Tweets. Our system is an ensemble of pre-trained language models such as BERT. We ranked 16th in the F1 score.

pdf bib
Multilingual Neural Machine Translation
Raj Dabre | Chenhui Chu | Anoop Kunchukuttan
Proceedings of the 28th International Conference on Computational Linguistics: Tutorial Abstracts

The advent of neural machine translation (NMT) has opened up exciting research in building multilingual translation systems i.e. translation models that can handle more than one language pair. Many advances have been made which have enabled (1) improving translation for low-resource languages via transfer learning from high resource languages; and (2) building compact translation models spanning multiple languages. In this tutorial, we will cover the latest advances in NMT approaches that leverage multilingualism, especially to enhance low-resource translation. In particular, we will focus on the following topics: modeling parameter sharing for multi-way models, massively multilingual models, training protocols, language divergence, transfer learning, zero-shot/zero-resource learning, pivoting, multilingual pre-training and multi-source translation.

pdf bib
Double Attention-based Multimodal Neural Machine Translation with Semantic Image Regions
Yuting Zhao | Mamoru Komachi | Tomoyuki Kajiwara | Chenhui Chu
Proceedings of the 22nd Annual Conference of the European Association for Machine Translation

Existing studies on multimodal neural machine translation (MNMT) have mainly focused on the effect of combining visual and textual modalities to improve translations. However, it has been suggested that the visual modality is only marginally beneficial. Conventional visual attention mechanisms have been used to select the visual features from equally-sized grids generated by convolutional neural networks (CNNs), and may have had modest effects on aligning the visual concepts associated with textual objects, because the grid visual features do not capture semantic information. In contrast, we propose the application of semantic image regions for MNMT by integrating visual and textual features using two individual attention mechanisms (double attention). We conducted experiments on the Multi30k dataset and achieved an improvement of 0.5 and 0.9 BLEU points for English-German and English-French translation tasks, compared with the MNMT with grid visual features. We also demonstrated concrete improvements on translation performance benefited from semantic image regions.

2019

pdf bib
Exploiting Multilingualism through Multistage Fine-Tuning for Low-Resource Neural Machine Translation
Raj Dabre | Atsushi Fujita | Chenhui Chu
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

This paper highlights the impressive utility of multi-parallel corpora for transfer learning in a one-to-many low-resource neural machine translation (NMT) setting. We report on a systematic comparison of multistage fine-tuning configurations, consisting of (1) pre-training on an external large (209k–440k) parallel corpus for English and a helping target language, (2) mixed pre-training or fine-tuning on a mixture of the external and low-resource (18k) target parallel corpora, and (3) pure fine-tuning on the target parallel corpora. Our experiments confirm that multi-parallel corpora are extremely useful despite their scarcity and content-wise redundancy thus exhibiting the true power of multilingualism. Even when the helping target language is not one of the target languages of our concern, our multistage fine-tuning can give 3–9 BLEU score gains over a simple one-to-one model.

2018

pdf bib
A Survey of Domain Adaptation for Neural Machine Translation
Chenhui Chu | Rui Wang
Proceedings of the 27th International Conference on Computational Linguistics

Neural machine translation (NMT) is a deep learning based approach for machine translation, which yields the state-of-the-art translation performance in scenarios where large-scale parallel corpora are available. Although the high-quality and domain-specific translation is crucial in the real world, domain-specific corpora are usually scarce or nonexistent, and thus vanilla NMT performs poorly in such scenarios. Domain adaptation that leverages both out-of-domain parallel corpora as well as monolingual corpora for in-domain translation, is very important for domain-specific translation. In this paper, we give a comprehensive survey of the state-of-the-art domain adaptation techniques for NMT.

pdf bib
iParaphrasing: Extracting Visually Grounded Paraphrases via an Image
Chenhui Chu | Mayu Otani | Yuta Nakashima
Proceedings of the 27th International Conference on Computational Linguistics

A paraphrase is a restatement of the meaning of a text in other words. Paraphrases have been studied to enhance the performance of many natural language processing tasks. In this paper, we propose a novel task iParaphrasing to extract visually grounded paraphrases (VGPs), which are different phrasal expressions describing the same visual concept in an image. These extracted VGPs have the potential to improve language and image multimodal tasks such as visual question answering and image captioning. How to model the similarity between VGPs is the key of iParaphrasing. We apply various existing methods as well as propose a novel neural network-based method with image attention, and report the results of the first attempt toward iParaphrasing.

pdf bib
Recursive Neural Network Based Preordering for English-to-Japanese Machine Translation
Yuki Kawara | Chenhui Chu | Yuki Arase
Proceedings of ACL 2018, Student Research Workshop

The word order between source and target languages significantly influences the translation quality. Preordering can effectively address this problem. Previous preordering methods require a manual feature design, making language dependent design difficult. In this paper, we propose a preordering method with recursive neural networks that learn features from raw inputs. Experiments show the proposed method is comparable to the state-of-the-art method but without a manual feature design.

pdf bib
Osaka University MT Systems for WAT 2018: Rewarding, Preordering, and Domain Adaptation
Yuki Kawara | Yuto Takebayashi | Chenhui Chu | Yuki Arase
Proceedings of the 32nd Pacific Asia Conference on Language, Information and Computation: 5th Workshop on Asian Translation: 5th Workshop on Asian Translation

2017

pdf bib
An Empirical Comparison of Domain Adaptation Methods for Neural Machine Translation
Chenhui Chu | Raj Dabre | Sadao Kurohashi
Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

In this paper, we propose a novel domain adaptation method named “mixed fine tuning” for neural machine translation (NMT). We combine two existing approaches namely fine tuning and multi domain NMT. We first train an NMT model on an out-of-domain parallel corpus, and then fine tune it on a parallel corpus which is a mix of the in-domain and out-of-domain corpora. All corpora are augmented with artificial tags to indicate specific domains. We empirically compare our proposed method against fine tuning and multi domain methods and discuss its benefits and shortcomings.

2016

pdf bib
Paraphrasing Out-of-Vocabulary Words with Word Embeddings and Semantic Lexicons for Low Resource Statistical Machine Translation
Chenhui Chu | Sadao Kurohashi
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

Out-of-vocabulary (OOV) word is a crucial problem in statistical machine translation (SMT) with low resources. OOV paraphrasing that augments the translation model for the OOV words by using the translation knowledge of their paraphrases has been proposed to address the OOV problem. In this paper, we propose using word embeddings and semantic lexicons for OOV paraphrasing. Experiments conducted on a low resource setting of the OLYMPICS task of IWSLT 2012 verify the effectiveness of our proposed method.

pdf bib
Simultaneous Sentence Boundary Detection and Alignment with Pivot-based Machine Translation Generated Lexicons
Antoine Bourlon | Chenhui Chu | Toshiaki Nakazawa | Sadao Kurohashi
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

Sentence alignment is a task that consists in aligning the parallel sentences in a translated article pair. This paper describes a method to perform sentence boundary detection and alignment simultaneously, which significantly improves the alignment accuracy on languages like Chinese with uncertain sentence boundaries. It relies on the definition of hard (certain) and soft (uncertain) punctuation delimiters, the latter being possibly ignored to optimize the alignment result. The alignment method is used in combination with lexicons automatically generated from the input article pairs using pivot-based MT, achieving better coverage of the input words with fewer entries than pre-existing dictionaries. Pivot-based MT makes it possible to build dictionaries for language pairs that have scarce parallel data. The alignment method is implemented in a tool that will be freely available in the near future.

pdf bib
Parallel Sentence Extraction from Comparable Corpora with Neural Network Features
Chenhui Chu | Raj Dabre | Sadao Kurohashi
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

Parallel corpora are crucial for machine translation (MT), however they are quite scarce for most language pairs and domains. As comparable corpora are far more available, many studies have been conducted to extract parallel sentences from them for MT. In this paper, we exploit the neural network features acquired from neural MT for parallel sentence extraction. We observe significant improvements for both accuracy in sentence extraction and MT performance.

pdf bib
Consistent Word Segmentation, Part-of-Speech Tagging and Dependency Labelling Annotation for Chinese Language
Mo Shen | Wingmui Li | HyunJeong Choe | Chenhui Chu | Daisuke Kawahara | Sadao Kurohashi
Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers

In this paper, we propose a new annotation approach to Chinese word segmentation, part-of-speech (POS) tagging and dependency labelling that aims to overcome the two major issues in traditional morphology-based annotation: Inconsistency and data sparsity. We re-annotate the Penn Chinese Treebank 5.0 (CTB5) and demonstrate the advantages of this approach compared to the original CTB5 annotation through word segmentation, POS tagging and machine translation experiments.

pdf bib
Dependency Forest based Word Alignment
Hitoshi Otsuki | Chenhui Chu | Toshiaki Nakazawa | Sadao Kurohashi
Proceedings of the ACL 2016 Student Research Workshop

pdf bib
Cross-language Projection of Dependency Trees with Constrained Partial Parsing for Tree-to-Tree Machine Translation
Yu Shen | Chenhui Chu | Fabien Cromieres | Sadao Kurohashi
Proceedings of the First Conference on Machine Translation: Volume 1, Research Papers

pdf bib
Kyoto University Participation to WAT 2016
Fabien Cromieres | Chenhui Chu | Toshiaki Nakazawa | Sadao Kurohashi
Proceedings of the 3rd Workshop on Asian Translation (WAT2016)

We describe here our approaches and results on the WAT 2016 shared translation tasks. We tried to use both an example-based machine translation (MT) system and a neural MT system. We report very good translation results, especially when using neural MT for Chinese-to-Japanese translation.

pdf bib
SCTB: A Chinese Treebank in Scientific Domain
Chenhui Chu | Toshiaki Nakazawa | Daisuke Kawahara | Sadao Kurohashi
Proceedings of the 12th Workshop on Asian Language Resources (ALR12)

Treebanks are curial for natural language processing (NLP). In this paper, we present our work for annotating a Chinese treebank in scientific domain (SCTB), to address the problem of the lack of Chinese treebanks in this domain. Chinese analysis and machine translation experiments conducted using this treebank indicate that the annotated treebank can significantly improve the performance on both tasks. This treebank is released to promote Chinese NLP research in scientific domain.

2015

pdf bib
KyotoEBMT System Description for the 2nd Workshop on Asian Translation
John Richardson | Raj Dabre | Chenhui Chu | Fabien Cromières | Toshiaki Nakazawa | Sadao Kurohashi
Proceedings of the 2nd Workshop on Asian Translation (WAT2015)

pdf bib
Large-scale Dictionary Construction via Pivot-based Statistical Machine Translation with Significance Pruning and Neural Network Features
Raj Dabre | Chenhui Chu | Fabien Cromieres | Toshiaki Nakazawa | Sadao Kurohashi
Proceedings of the 29th Pacific Asia Conference on Language, Information and Computation

pdf bib
Cross-language Projection of Dependency Trees for Tree-to-tree Machine Translation
Yu Shen | Chenhui Chu | Fabien Cromieres | Sadao Kurohashi
Proceedings of the 29th Pacific Asia Conference on Language, Information and Computation: Posters

2014

pdf bib
Improving Statistical Machine Translation Accuracy Using Bilingual Lexicon Extractionwith Paraphrases
Chenhui Chu | Toshiaki Nakazawa | Sadao Kurohashi
Proceedings of the 28th Pacific Asia Conference on Language, Information and Computing

pdf bib
Constructing a Chinese—Japanese Parallel Corpus from Wikipedia
Chenhui Chu | Toshiaki Nakazawa | Sadao Kurohashi
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

Parallel corpora are crucial for statistical machine translation (SMT). However, they are quite scarce for most language pairs, such as Chinese―Japanese. As comparable corpora are far more available, many studies have been conducted to automatically construct parallel corpora from comparable corpora. This paper presents a robust parallel sentence extraction system for constructing a Chinese―Japanese parallel corpus from Wikipedia. The system is inspired by previous studies that mainly consist of a parallel sentence candidate filter and a binary classifier for parallel sentence identification. We improve the system by using the common Chinese characters for filtering and two novel feature sets for classification. Experiments show that our system performs significantly better than the previous studies for both accuracy in parallel sentence extraction and SMT performance. Using the system, we construct a Chinese―Japanese parallel corpus with more than 126k highly accurate parallel sentences from Wikipedia. The constructed parallel corpus is freely available at http://orchid.kuee.kyoto-u.ac.jp/~chu/resource/wiki_zh_ja.tgz.

2013

pdf bib
Accurate Parallel Fragment Extraction from Quasi–Comparable Corpora using Alignment Model and Translation Lexicon
Chenhui Chu | Toshiaki Nakazawa | Sadao Kurohashi
Proceedings of the Sixth International Joint Conference on Natural Language Processing

pdf bib
Chinese–Japanese Parallel Sentence Extraction from Quasi–Comparable Corpora
Chenhui Chu | Toshiaki Nakazawa | Sadao Kurohashi
Proceedings of the Sixth Workshop on Building and Using Comparable Corpora

2012

pdf bib
Exploiting Shared Chinese Characters in Chinese Word Segmentation Optimization for Chinese-Japanese Machine Translation
Chenhui Chu | Toshiaki Nakazawa | Daisuke Kawahara | Sadao Kurohashi
Proceedings of the 16th Annual conference of the European Association for Machine Translation

pdf bib
Chinese Characters Mapping Table of Japanese, Traditional Chinese and Simplified Chinese
Chenhui Chu | Toshiaki Nakazawa | Sadao Kurohashi
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

Chinese characters are used both in Japanese and Chinese, which are called Kanji and Hanzi respectively. Chinese characters contain significant semantic information, a mapping table between Kanji and Hanzi can be very useful for many Japanese-Chinese bilingual applications, such as machine translation and cross-lingual information retrieval. Because Kanji characters are originated from ancient China, most Kanji have corresponding Chinese characters in Hanzi. However, the relation between Kanji and Hanzi is quite complicated. In this paper, we propose a method of making a Chinese characters mapping table of Japanese, Traditional Chinese and Simplified Chinese automatically by means of freely available resources. We define seven categories for Kanji based on the relation between Kanji and Hanzi, and classify mappings of Chinese characters into these categories. We use a resource from Wiktionary to show the completeness of the mapping table we made. Statistical comparison shows that our proposed method makes a more complete mapping table than the current version of Wiktionary.