Masayuki Asahara


2020

pdf bib
Design of BCCWJ-EEG: Balanced Corpus with Human Electroencephalography
Yohei Oseki | Masayuki Asahara
Proceedings of the 12th Language Resources and Evaluation Conference

The past decade has witnessed the happy marriage between natural language processing (NLP) and the cognitive science of language. Moreover, given the historical relationship between biological and artificial neural networks, the advent of deep learning has re-sparked strong interests in the fusion of NLP and the neuroscience of language. Importantly, this inter-fertilization between NLP, on one hand, and the cognitive (neuro)science of language, on the other, has been driven by the language resources annotated with human language processing data. However, there remain several limitations with those language resources on annotations, genres, languages, etc. In this paper, we describe the design of a novel language resource called BCCWJ-EEG, the Balanced Corpus of Contemporary Written Japanese (BCCWJ) experimentally annotated with human electroencephalography (EEG). Specifically, after extensively reviewing the language resources currently available in the literature with special focus on eye-tracking and EEG, we summarize the details concerning (i) participants, (ii) stimuli, (iii) procedure, (iv) data preprocessing, (v) corpus evaluation, (vi) resource release, and (vii) compilation schedule. In addition, potential applications of BCCWJ-EEG to neuroscience and NLP will also be discussed.

pdf bib
KOTONOHA: A Corpus Concordance System for Skewer-Searching NINJAL Corpora
Teruaki Oka | Yuichi Ishimoto | Yutaka Yagi | Takenori Nakamura | Masayuki Asahara | Kikuo Maekawa | Toshinobu Ogiso | Hanae Koiso | Kumiko Sakoda | Nobuko Kibe
Proceedings of the 12th Language Resources and Evaluation Conference

The National Institute for Japanese Language and Linguistics, Japan (NINJAL, Japan), has developed several types of corpora. For each corpus NINJAL provided an online search environment, ‘Chunagon’, which is a morphological-information-annotation-based concordance system made publicly available in 2011. NINJAL has now provided a skewer-search system ‘Kotonoha’ based on the ‘Chunagon’ systems. This system enables querying of multiple corpora by certain categories, such as register type and period.

pdf bib
Dynamically Updating Event Representations for Temporal Relation Classification with Multi-category Learning
Fei Cheng | Masayuki Asahara | Ichiro Kobayashi | Sadao Kurohashi
Findings of the Association for Computational Linguistics: EMNLP 2020

Temporal relation classification is the pair-wise task for identifying the relation of a temporal link (TLINKs) between two mentions, i.e. event, time and document creation time (DCT). It leads to two crucial limits: 1) Two TLINKs involving a common mention do not share information. 2) Existing models with independent classifiers for each TLINK category (E2E, E2T and E2D) hinder from using the whole data. This paper presents an event centric model that allows to manage dynamic event representations across multiple TLINKs. Our model deals with three TLINK categories with multi-task learning to leverage the full size of data. The experimental results show that our proposal outperforms state-of-the-art models and two strong transfer learning baselines on both the English and Japanese data.

pdf bib
Automatic Creation of Correspondence Table of Meaning Tags from Two Dictionaries in One Language Using Bilingual Word Embedding
Teruo Hirabayashi | Kanako Komiya | Masayuki Asahara | Hiroyuki Shinnou
Proceedings of the 13th Workshop on Building and Using Comparable Corpora

In this paper, we show how to use bilingual word embeddings (BWE) to automatically create a corresponding table of meaning tags from two dictionaries in one language and examine the effectiveness of the method. To do this, we had a problem: the meaning tags do not always correspond one-to-one because the granularities of the word senses and the concepts are different from each other. Therefore, we regarded the concept tag that corresponds to a word sense the most as the correct concept tag corresponding the word sense. We used two BWE methods, a linear transformation matrix and VecMap. We evaluated the most frequent sense (MFS) method and the corpus concatenation method for comparison. The accuracies of the proposed methods were higher than the accuracy of the random baseline but lower than those of the MFS and corpus concatenation methods. However, because our method utilized the embedding vectors of the word senses, the relations of the sense tags corresponding to concept tags could be examined by mapping the sense embeddings to the vector space of the concept tags. Also, our methods could be performed when we have only concept or word sense embeddings whereas the MFS method requires a parallel corpus and the corpus concatenation method needs two tagged corpora.

pdf bib
Adversarial Training for Commonsense Inference
Lis Pereira | Xiaodong Liu | Fei Cheng | Masayuki Asahara | Ichiro Kobayashi
Proceedings of the 5th Workshop on Representation Learning for NLP

We apply small perturbations to word embeddings and minimize the resultant adversarial risk to regularize the model. We exploit a novel combination of two different approaches to estimate these perturbations: 1) using the true label and 2) using the model prediction. Without relying on any human-crafted features, knowledge bases, or additional datasets other than the target datasets, our model boosts the fine-tuning performance of RoBERTa, achieving competitive results on multiple reading comprehension datasets that require commonsense inference.

2019

pdf bib
Word Familiarity Rate Estimation Using a Bayesian Linear Mixed Model
Masayuki Asahara
Proceedings of the First Workshop on Aggregating and Analysing Crowdsourced Annotations for NLP

This paper presents research on word familiarity rate estimation using the ‘Word List by Semantic Principles’. We collected rating information on 96,557 words in the ‘Word List by Semantic Principles’ via Yahoo! crowdsourcing. We asked 3,392 subject participants to use their introspection to rate the familiarity of words based on the five perspectives of ‘KNOW’, ‘WRITE’, ‘READ’, ‘SPEAK’, and ‘LISTEN’, and each word was rated by at least 16 subject participants. We used Bayesian linear mixed models to estimate the word familiarity rates. We also explored the ratings with the semantic labels used in the ‘Word List by Semantic Principles’.

2018

pdf bib
All-words Word Sense Disambiguation Using Concept Embeddings
Rui Suzuki | Kanako Komiya | Masayuki Asahara | Minoru Sasaki | Hiroyuki Shinnou
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

pdf bib
Universal Dependencies Version 2 for Japanese
Masayuki Asahara | Hiroshi Kanayama | Takaaki Tanaka | Yusuke Miyao | Sumire Uematsu | Shinsuke Mori | Yuji Matsumoto | Mai Omura | Yugo Murawaki
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

pdf bib
Predicting Japanese Word Order in Double Object Constructions
Masayuki Asahara | Satoshi Nambu | Shin-Ichiro Sano
Proceedings of the Eight Workshop on Cognitive Aspects of Computational Language Learning and Processing

This paper presents a statistical model to predict Japanese word order in the double object constructions. We employed a Bayesian linear mixed model with manually annotated predicate-argument structure data. The findings from the refined corpus analysis confirmed the effects of information status of an NP as ‘givennew ordering’ in addition to the effects of ‘long-before-short’ as a tendency of the general Japanese word order.

pdf bib
Coordinate Structures in Universal Dependencies for Head-final Languages
Hiroshi Kanayama | Na-Rae Han | Masayuki Asahara | Jena D. Hwang | Yusuke Miyao | Jinho D. Choi | Yuji Matsumoto
Proceedings of the Second Workshop on Universal Dependencies (UDW 2018)

This paper discusses the representation of coordinate structures in the Universal Dependencies framework for two head-final languages, Japanese and Korean. UD applies a strict principle that makes the head of coordination the left-most conjunct. However, the guideline may produce syntactic trees which are difficult to accept in head-final languages. This paper describes the status in the current Japanese and Korean corpora and proposes alternative designs suitable for these languages.

pdf bib
UD-Japanese BCCWJ: Universal Dependencies Annotation for the Balanced Corpus of Contemporary Written Japanese
Mai Omura | Masayuki Asahara
Proceedings of the Second Workshop on Universal Dependencies (UDW 2018)

In this paper, we describe a corpus UD Japanese-BCCWJ that was created by converting the Balanced Corpus of Contemporary Written Japanese (BCCWJ), a Japanese language corpus, to adhere to the UD annotation schema. The BCCWJ already assigns dependency information at the level of the bunsetsu (a Japanese syntactic unit comparable to the phrase). We developed a program to convert the BCCWJ to UD based on this dependency structure, and this corpus is the result of completely automatic conversion using the program. UD Japanese-BCCWJ is the largest-scale UD Japanese corpus and the second-largest of all UD corpora, including 1,980 documents, 57,109 sentences, and 1,273k words across six distinct domains.

pdf bib
Between Reading Time and Clause Boundaries in Japanese - Wrap-up Effect in a Head-Final Language
Masayuki Asahara
Proceedings of the 32nd Pacific Asia Conference on Language, Information and Computation

pdf bib
Annotation of ‘Word List by Semantic Principles’ Labels for the Balanced Corpus of Contemporary Written Japanese
Sachi Kato | Masayuki Asahara | Makoto Yamazaki
Proceedings of the 32nd Pacific Asia Conference on Language, Information and Computation

2017

pdf bib
Between Reading Time and Syntactic/Semantic Categories
Masayuki Asahara | Sachi Kato
Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

This article presents a contrastive analysis between reading time and syntactic/semantic categories in Japanese. We overlaid the reading time annotation of BCCWJ-EyeTrack and a syntactic/semantic category information annotation on the ‘Balanced Corpus of Contemporary Written Japanese’. Statistical analysis based on a mixed linear model showed that verbal phrases tend to have shorter reading times than adjectives, adverbial phrases, or nominal phrases. The results suggest that the preceding phrases associated with the presenting phrases promote the reading process to shorten the gazing time.

pdf bib
Between Reading Time and Information Structure
Masayuki Asahara
Proceedings of the 31st Pacific Asia Conference on Language, Information and Computation

2016

pdf bib
Universal Dependencies for Japanese
Takaaki Tanaka | Yusuke Miyao | Masayuki Asahara | Sumire Uematsu | Hiroshi Kanayama | Shinsuke Mori | Yuji Matsumoto
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

We present an attempt to port the international syntactic annotation scheme, Universal Dependencies, to the Japanese language in this paper. Since the Japanese syntactic structure is usually annotated on the basis of unique chunk-based dependencies, we first introduce word-based dependencies by using a word unit called the Short Unit Word, which usually corresponds to an entry in the lexicon UniDic. Porting is done by mapping the part-of-speech tagset in UniDic to the universal part-of-speech tagset, and converting a constituent-based treebank to a typed dependency tree. The conversion is not straightforward, and we discuss the problems that arose in the conversion and the current solutions. A treebank consisting of 10,000 sentences was built by converting the existent resources and currently released to the public.

pdf bib
Reading-Time Annotations for “Balanced Corpus of Contemporary Written Japanese”
Masayuki Asahara | Hajime Ono | Edson T. Miyamoto
Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers

The Dundee Eyetracking Corpus contains eyetracking data collected while native speakers of English and French read newspaper editorial articles. Similar resources for other languages are still rare, especially for languages in which words are not overtly delimited with spaces. This is a report on a project to build an eyetracking corpus for Japanese. Measurements were collected while 24 native speakers of Japanese read excerpts from the Balanced Corpus of Contemporary Written Japanese Texts were presented with or without segmentation (i.e. with or without space at the boundaries between bunsetsu segmentations) and with two types of methodologies (eyetracking and self-paced reading presentation). Readers’ background information including vocabulary-size estimation and Japanese reading-span score were also collected. As an example of the possible uses for the corpus, we also report analyses investigating the phenomena of anti-locality.

pdf bib
BonTen’ – Corpus Concordance System for ‘NINJAL Web Japanese Corpus’
Masayuki Asahara | Kazuya Kawahara | Yuya Takei | Hideto Masuoka | Yasuko Ohba | Yuki Torii | Toru Morii | Yuki Tanaka | Kikuo Maekawa | Sachi Kato | Hikari Konishi
Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: System Demonstrations

The National Institute for Japanese Language and Linguistics, Japan (NINJAL) has undertaken a corpus compilation project to construct a web corpus for linguistic research comprising ten billion words. The project is divided into four parts: page collection, linguistic analysis, development of the corpus concordance system, and preservation. This article presents the corpus concordance system named ‘BonTen’ which enables the ten-billion-scaled corpus to be queried by string, a sequence of morphological information or a subtree of the syntactic dependency structure.

pdf bib
Demonstration of ChaKi.NET – beyond the corpus search system
Masayuki Asahara | Yuji Matsumoto | Toshio Morita
Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: System Demonstrations

ChaKi.NET is a corpus management system for dependency structure annotated corpora. After more than 10 years of continuous development, the system is now usable not only for corpus search, but also for visualization, annotation, labelling, and formatting for statistical analysis. This paper describes the various functions included in the current ChaKi.NET system.

pdf bib
BCCWJ-DepPara: A Syntactic Annotation Treebank on the ‘Balanced Corpus of Contemporary Written Japanese’
Masayuki Asahara | Yuji Matsumoto
Proceedings of the 12th Workshop on Asian Language Resources (ALR12)

Paratactic syntactic structures are difficult to represent in syntactic dependency tree structures. As such, we propose an annotation schema for syntactic dependency annotation of Japanese, in which coordinate structures are split from and overlaid on bunsetsu-based (base phrase unit) dependency. The schema represents nested coordinate structures, non-constituent conjuncts, and forward sharing as the set of regions. The annotation was performed on the core data of ‘Balanced Corpus of Contemporary Written Japanese’, which comprised about one million words and 1980 samples from six registers, such as newspapers, books, magazines, and web texts.

2014

pdf bib
BCCWJ-TimeBank: Temporal and Event Information Annotation on Japanese Text
Masayuki Asahara | Sachi Kato | Hikari Konishi | Mizuho Imada | Kikuo Maekawa
International Journal of Computational Linguistics & Chinese Language Processing, Volume 19, Number 3, September 2014

2013

pdf bib
BCCWJ-TimeBank: Temporal and Event Information Annotation on Japanese Text
Masayuki Asahara | Sachi Yasuda | Hikari Konishi | Mizuho Imada | Kikuo Maekawa
Proceedings of the 27th Pacific Asia Conference on Language, Information, and Computation (PACLIC 27)

2012

pdf bib
Head-driven Transition-based Parsing with Top-down Prediction
Katsuhiko Hayashi | Taro Watanabe | Masayuki Asahara | Yuji Matsumoto
Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

pdf bib
Identifying Temporal Relations by Sentence and Document Optimizations
Katsumasa Yoshikawa | Masayuki Asahara | Ryu Iida
Proceedings of COLING 2012: Posters

2011

pdf bib
Third-order Variational Reranking on Packed-Shared Dependency Forests
Katsuhiko Hayashi | Taro Watanabe | Masayuki Asahara | Yuji Matsumoto
Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing

pdf bib
Different Input Systems for Different Devices
Asad Habib | Masakazu Iwatate | Masayuki Asahara | Yuji Matsumoto
Proceedings of the Workshop on Advances in Text Input Methods (WTIM 2011)

pdf bib
Jointly Extracting Japanese Predicate-Argument Relation with Markov Logic
Katsumasa Yoshikawa | Masayuki Asahara | Yuji Matsumoto
Proceedings of 5th International Joint Conference on Natural Language Processing

2010

pdf bib
A Structured Model for Joint Learning of Argument Roles and Predicate Senses
Yotaro Watanabe | Masayuki Asahara | Yuji Matsumoto
Proceedings of the ACL 2010 Conference Short Papers

2009

pdf bib
Multilingual Syntactic-Semantic Dependency Parsing with Three-Stage Approximate Max-Margin Linear Models
Yotaro Watanabe | Masayuki Asahara | Yuji Matsumoto
Proceedings of the Thirteenth Conference on Computational Natural Language Learning (CoNLL 2009): Shared Task

pdf bib
Jointly Identifying Temporal Relations with Markov Logic
Katsumasa Yoshikawa | Sebastian Riedel | Masayuki Asahara | Yuji Matsumoto
Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP

2008

pdf bib
Japanese-Spanish Thesaurus Construction Using English as a Pivot
Jessica Ramírez | Masayuki Asahara | Yuji Matsumoto
Proceedings of the Third International Joint Conference on Natural Language Processing: Volume-I

pdf bib
Use of Event Types for Temporal Relation Identification in Chinese Text
Yuchang Cheng | Masayuki Asahara | Yuji Matsumoto
Proceedings of the Sixth SIGHAN Workshop on Chinese Language Processing

pdf bib
Analyzing Chinese Synthetic Words with Tree-based Information and a Survey on Chinese Morphologically Derived Words
Jia Lu | Masayuki Asahara | Yuji Matsumoto
Proceedings of the Sixth SIGHAN Workshop on Chinese Language Processing

pdf bib
Constructing a Temporal Relation Tagged Corpus of Chinese Based on Dependency Structure Analysis
Yuchang Cheng | Masayuki Asahara | Yuji Matsumoto
International Journal of Computational Linguistics & Chinese Language Processing, Volume 13, Number 2, June 2008

pdf bib
A Pipeline Approach for Syntactic and Semantic Dependency Parsing
Yotaro Watanabe | Masakazu Iwatate | Masayuki Asahara | Yuji Matsumoto
CoNLL 2008: Proceedings of the Twelfth Conference on Computational Natural Language Learning

pdf bib
Japanese Dependency Parsing Using a Tournament Model
Masakazu Iwatate | Masayuki Asahara | Yuji Matsumoto
Proceedings of the 22nd International Conference on Computational Linguistics (Coling 2008)

2007

pdf bib
NAIST.Japan: Temporal Relation Identification Using Dependency Parsed Tree
Yuchang Cheng | Masayuki Asahara | Yuji Matsumoto
Proceedings of the Fourth International Workshop on Semantic Evaluations (SemEval-2007)

pdf bib
A Graph-Based Approach to Named Entity Categorization in Wikipedia Using Conditional Random Fields
Yotaro Watanabe | Masayuki Asahara | Yuji Matsumoto
Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL)

2006

pdf bib
An Annotated Corpus Management Tool: ChaKi
Yuji Matsumoto | Masayuki Asahara | Kiyota Hashimoto | Yukio Tono | Akira Ohtani | Toshio Morita
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)

Large scale annotated corpora are very important not only inlinguistic research but also in practical natural language processingtasks since a number of practical tools such as Part-of-speech (POS) taggers and syntactic parsers are now corpus-based or machine learning-based systems which require some amount of accurately annotated corpora. This article presents an annotated corpus management tool that provides various functions that include flexible search, statistic calculation, and error correction for linguistically annotated corpora. The target of annotation covers POS tags, base phrase chunks and syntactic dependency structures. This tool aims at helping development of consistent construction of lexicon and annotated corpora to be used by researchers both in linguists and language processing communities.

pdf bib
Multi-lingual Dependency Parsing at NAIST
Yuchang Cheng | Masayuki Asahara | Yuji Matsumoto
Proceedings of the Tenth Conference on Computational Natural Language Learning (CoNLL-X)

pdf bib
The Construction of a Dictionary for a Two-layer Chinese Morphological Analyzer
Chooi-Ling Goh | Jia Lü | Yuchang Cheng | Masayuki Asahara | Yuji Matsumoto
Proceedings of the 20th Pacific Asia Conference on Language, Information and Computation

2005

pdf bib
Automatic Extraction of Fixed Multiword Expressions
Campbell Hore | Masayuki Asahara | Yūji Matsumoto
Second International Joint Conference on Natural Language Processing: Full Papers

pdf bib
Building a Japanese-Chinese Dictionary Using Kanji/Hanzi Conversion
Chooi-Ling Goh | Masayuki Asahara | Yuji Matsumoto
Second International Joint Conference on Natural Language Processing: Full Papers

pdf bib
Chinese Deterministic Dependency Analyzer: Examining Effects of Global Features and Root Node Finder
Yuchang Cheng | Masayuki Asahara | Yuji Matsumoto
Proceedings of the Fourth SIGHAN Workshop on Chinese Language Processing

pdf bib
Combination of Machine Learning Methods for Optimum Chinese Word Segmentation
Masayuki Asahara | Kenta Fukuoka | Ai Azuma | Chooi-Ling Goh | Yotaro Watanabe | Yuji Matsumoto | Takashi Tsuzuki
Proceedings of the Fourth SIGHAN Workshop on Chinese Language Processing

pdf bib
Chinese Word Segmentation by Classification of Characters
Chooi-Ling Goh | Masayuki Asahara | Yuji Matsumoto
International Journal of Computational Linguistics & Chinese Language Processing, Volume 10, Number 3, September 2005: Special Issue on Selected Papers from ROCLING XVI

2004

pdf bib
Pruning False Unknown Words to Improve Chinese Word Segmentation
Chooi-Ling Goh | Masayuki Asahara | Yuji Matsumoto
Proceedings of the 18th Pacific Asia Conference on Language, Information and Computation

pdf bib
Japanese Unknown Word Identification by Character-based Chunking
Masayuki Asahara | Yuji Matsumoto
COLING 2004: Proceedings of the 20th International Conference on Computational Linguistics

pdf bib
Chinese Word Segmentation by Classification of Characters
Chooi-Ling Goh | Masayuki Asahara | Yuji Matsumoto
Proceedings of the Third SIGHAN Workshop on Chinese Language Processing

2003

pdf bib
Combining Segmenter and Chunker for Chinese Word Segmentation
Masayuki Asahara | Chooi Ling Goh | Xiaojie Wang | Yuji Matsumoto
Proceedings of the Second SIGHAN Workshop on Chinese Language Processing

pdf bib
Chinese Unknown Word Identification Using Character-based Tagging and Chunking
Chooi Ling Goh | Masayuki Asahara | Yuji Matsumoto
The Companion Volume to the Proceedings of 41st Annual Meeting of the Association for Computational Linguistics

pdf bib
Japanese Named Entity Extraction with Redundant Morphological Analysis
Masayuki Asahara | Yuji Matsumoto
Proceedings of the 2003 Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics

2002

pdf bib
Use of XML and Relational Databases for Consistent Development and Maintenance of Lexicons and Annotated Corpora
Masayuki Asahara | Ryuichi Yoneda | Akiko Yamashita | Yasuharu Den | Yuji Matsumoto
Proceedings of the Third International Conference on Language Resources and Evaluation (LREC’02)

2000

pdf bib
Extended Models and Tools for High-performance Part-of-speech
Masayuki Asahara | Yuji Matsumoto
COLING 2000 Volume 1: The 18th International Conference on Computational Linguistics