Masato Hagiwara


2020

pdf bib
GitHub Typo Corpus: A Large-Scale Multilingual Dataset of Misspellings and Grammatical Errors
Masato Hagiwara | Masato Mita
Proceedings of the 12th Language Resources and Evaluation Conference

The lack of large-scale datasets has been a major hindrance to the development of NLP tasks such as spelling correction and grammatical error correction (GEC). As a complementary new resource for these tasks, we present the GitHub Typo Corpus, a large-scale, multilingual dataset of misspellings and grammatical errors along with their corrections harvested from GitHub, a large and popular platform for hosting and sharing git repositories. The dataset, which we have made publicly available, contains more than 350k edits and 65M characters in more than 15 languages, making it the largest dataset of misspellings to date. We also describe our process for filtering true typo edits based on learned classifiers on a small annotated subset, and demonstrate that typo edits can be identified with F1 0.9 using a very simple classifier with only three features. The detailed analyses of the dataset show that existing spelling correctors merely achieve an F-measure of approx. 0.5, suggesting that the dataset serves as a new, rich source of spelling errors that complement existing datasets.

pdf bib
Octanove Labs’ Japanese-Chinese Open Domain Translation System
Masato Hagiwara
Proceedings of the 17th International Conference on Spoken Language Translation

This paper describes Octanove Labs’ submission to the IWSLT 2020 open domain translation challenge. In order to build a high-quality Japanese-Chinese neural machine translation (NMT) system, we use a combination of 1) parallel corpus filtering and 2) back-translation. We have shown that, by using heuristic rules and learned classifiers, the size of the parallel data can be reduced by 70% to 90% without much impact on the final MT performance. We have also shown that including the artificially generated parallel data through back-translation further boosts the metric by 17% to 27%, while self-training contributes little. Aside from a small number of parallel sentences annotated for filtering, no external resources have been used to build our system.

pdf bib
Proceedings of Second Workshop for NLP Open Source Software (NLP-OSS)
Eunjeong L. Park | Masato Hagiwara | Dmitrijs Milajevs | Nelson F. Liu | Geeticka Chauhan | Liling Tan
Proceedings of Second Workshop for NLP Open Source Software (NLP-OSS)

pdf bib
Machine Learning–Driven Language Assessment
Burr Settles | Geoffrey T. LaFlair | Masato Hagiwara
Transactions of the Association for Computational Linguistics, Volume 8

We describe a method for rapidly creating language proficiency assessments, and provide experimental evidence that such tests can be valid, reliable, and secure. Our approach is the first to use machine learning and natural language processing to induce proficiency scales based on a given standard, and then use linguistic models to estimate item difficulty directly for computer-adaptive testing. This alleviates the need for expensive pilot testing with human subjects. We used these methods to develop an online proficiency exam called the Duolingo English Test, and demonstrate that its scores align significantly with other high-stakes English assessments. Furthermore, our approach produces test scores that are highly reliable, while generating item banks large enough to satisfy security requirements.

2019

pdf bib
TEASPN: Framework and Protocol for Integrated Writing Assistance Environments
Masato Hagiwara | Takumi Ito | Tatsuki Kuribayashi | Jun Suzuki | Kentaro Inui
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP): System Demonstrations

Language technologies play a key role in assisting people with their writing. Although there has been steady progress in e.g., grammatical error correction (GEC), human writers are yet to benefit from this progress due to the high development cost of integrating with writing software. We propose TEASPN, a protocol and an open-source framework for achieving integrated writing assistance environments. The protocol standardizes the way writing software communicates with servers that implement such technologies, allowing developers and researchers to integrate the latest developments in natural language processing (NLP) with low cost. As a result, users can enjoy the integrated experience in their favorite writing software. The results from experiments with human participants show that users use a wide range of technologies and rate their writing experience favorably, allowing them to write more fluent text.

pdf bib
Diamonds in the Rough: Generating Fluent Sentences from Early-Stage Drafts for Academic Writing Assistance
Takumi Ito | Tatsuki Kuribayashi | Hayato Kobayashi | Ana Brassard | Masato Hagiwara | Jun Suzuki | Kentaro Inui
Proceedings of the 12th International Conference on Natural Language Generation

The writing process consists of several stages such as drafting, revising, editing, and proofreading. Studies on writing assistance, such as grammatical error correction (GEC), have mainly focused on sentence editing and proofreading, where surface-level issues such as typographical errors, spelling errors, or grammatical errors should be corrected. We broaden this focus to include the earlier revising stage, where sentences require adjustment to the information included or major rewriting and propose Sentence-level Revision (SentRev) as a new writing assistance task. Well-performing systems in this task can help inexperienced authors by producing fluent, complete sentences given their rough, incomplete drafts. We build a new freely available crowdsourced evaluation dataset consisting of incomplete sentences authored by non-native writers paired with their final versions extracted from published academic papers for developing and evaluating SentRev models. We also establish baseline performance on SentRev using our newly built evaluation dataset.

2018

pdf bib
Second Language Acquisition Modeling
Burr Settles | Chris Brust | Erin Gustafson | Masato Hagiwara | Nitin Madnani
Proceedings of the Thirteenth Workshop on Innovative Use of NLP for Building Educational Applications

We present the task of second language acquisition (SLA) modeling. Given a history of errors made by learners of a second language, the task is to predict errors that they are likely to make at arbitrary points in the future. We describe a large corpus of more than 7M words produced by more than 6k learners of English, Spanish, and French using Duolingo, a popular online language-learning app. Then we report on the results of a shared task challenge aimed studying the SLA task via this corpus, which attracted 15 teams and synthesized work from various fields including cognitive science, linguistics, and machine learning.

pdf bib
Proceedings of Workshop for NLP Open Source Software (NLP-OSS)
Eunjeong L. Park | Masato Hagiwara | Dmitrijs Milajevs | Liling Tan
Proceedings of Workshop for NLP Open Source Software (NLP-OSS)

2015

pdf bib
Cross-lingual Transfer of Named Entity Recognizers without Parallel Corpora
Ayah Zirikly | Masato Hagiwara
Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers)

2014

pdf bib
Lightweight Client-Side Chinese/Japanese Morphological Analyzer Based on Online Learning
Masato Hagiwara | Satoshi Sekine
Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: System Demonstrations

pdf bib
Comparison of the Impact of Word Segmentation on Name Tagging for Chinese and Japanese
Haibo Li | Masato Hagiwara | Qi Li | Heng Ji
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

Word Segmentation is usually considered an essential step for many Chinese and Japanese Natural Language Processing tasks, such as name tagging. This paper presents several new observations and analysis on the impact of word segmentation on name tagging; (1). Due to the limitation of current state-of-the-art Chinese word segmentation performance, a character-based name tagger can outperform its word-based counterparts for Chinese but not for Japanese; (2). It is crucial to keep segmentation settings (e.g. definitions, specifications, methods) consistent between training and testing for name tagging; (3). As long as (2) is ensured, the performance of word segmentation does not have appreciable impact on Chinese and Japanese name tagging.

2013

pdf bib
Accurate Word Segmentation using Transliteration and Language Model Projection
Masato Hagiwara | Satoshi Sekine
Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

pdf bib
KooSHO: Japanese Text Input Environment based on Aerial Hand Writing
Masato Hagiwara | Soh Masuko
Proceedings of the 2013 NAACL HLT Demonstration Session

2012

pdf bib
Latent Semantic Transliteration using Dirichlet Mixture
Masato Hagiwara | Satoshi Sekine
Proceedings of the 4th Named Entity Workshop (NEWS) 2012

pdf bib
phloat : Integrated Writing Environment for ESL learners
Yuta Hayashibe | Masato Hagiwara | Satoshi Sekine
Proceedings of the Second Workshop on Advances in Text Input Methods

2011

pdf bib
Safety Information Mining — What can NLP do in a disaster—
Graham Neubig | Yuichiroh Matsubayashi | Masato Hagiwara | Koji Murakami
Proceedings of 5th International Joint Conference on Natural Language Processing

pdf bib
Latent Class Transliteration based on Source Language Origin
Masato Hagiwara | Satoshi Sekine
Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies

2009

pdf bib
Japanese Query Alteration Based on Lexical Semantic Similarity
Masato Hagiwara | Hisami Suzuki
Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics

2008

pdf bib
A Supervised Learning Approach to Automatic Synonym Identification Based on Distributional Features
Masato Hagiwara
Proceedings of the ACL-08: HLT Student Research Workshop

pdf bib
Context Feature Selection for Distributional Similarity
Masato Hagiwara | Yasuhiro Ogawa | Katsuhiko Toyama
Proceedings of the Third International Joint Conference on Natural Language Processing: Volume-I

pdf bib
Metric Learning for Synonym Acquisition
Nobuyuki Shimizu | Masato Hagiwara | Yasuhiro Ogawa | Katsuhiko Toyama | Hiroshi Nakagawa
Proceedings of the 22nd International Conference on Computational Linguistics (Coling 2008)

2006

pdf bib
Selection of Effective Contextual Information for Automatic Synonym Acquisition
Masato Hagiwara | Yasuhiro Ogawa | Katsuhiko Toyama
Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics

2005

pdf bib
PLSI Utilization for Automatic Thesaurus Construction
Masato Hagiwara | Yasuhiro Ogawa | Katsuhiko Toyama
Second International Joint Conference on Natural Language Processing: Full Papers