Dat Quoc Nguyen


2020

pdf bib
A survey of embedding models of entities and relationships for knowledge graph completion
Dat Quoc Nguyen
Proceedings of the Graph-based Methods for Natural Language Processing (TextGraphs)

Knowledge graphs (KGs) of real-world facts about entities and their relationships are useful resources for a variety of natural language processing tasks. However, because knowledge graphs are typically incomplete, it is useful to perform knowledge graph completion or link prediction, i.e. predict whether a relationship not in the knowledge graph is likely to be true. This paper serves as a comprehensive survey of embedding models of entities and relationships for knowledge graph completion, summarizing up-to-date experimental results on standard benchmark datasets and pointing out potential future research directions.

pdf bib
PhoBERT: Pre-trained language models for Vietnamese
Dat Quoc Nguyen | Anh Tuan Nguyen
Findings of the Association for Computational Linguistics: EMNLP 2020

We present PhoBERT with two versions, PhoBERT-base and PhoBERT-large, the first public large-scale monolingual language models pre-trained for Vietnamese. Experimental results show that PhoBERT consistently outperforms the recent best pre-trained multilingual model XLM-R (Conneau et al., 2020) and improves the state-of-the-art in multiple Vietnamese-specific NLP tasks including Part-of-speech tagging, Dependency parsing, Named-entity recognition and Natural language inference. We release PhoBERT to facilitate future research and downstream applications for Vietnamese NLP. Our PhoBERT models are available at https://github.com/VinAIResearch/PhoBERT

pdf bib
A Pilot Study of Text-to-SQL Semantic Parsing for Vietnamese
Anh Tuan Nguyen | Mai Hoang Dao | Dat Quoc Nguyen
Findings of the Association for Computational Linguistics: EMNLP 2020

Semantic parsing is an important NLP task. However, Vietnamese is a low-resource language in this research area. In this paper, we present the first public large-scale Text-to-SQL semantic parsing dataset for Vietnamese. We extend and evaluate two strong semantic parsing baselines EditSQL (Zhang et al., 2019) and IRNet (Guo et al., 2019) on our dataset. We compare the two baselines with key configurations and find that: automatic Vietnamese word segmentation improves the parsing results of both baselines; the normalized pointwise mutual information (NPMI) score (Bouma, 2009) is useful for schema linking; latent syntactic features extracted from a neural dependency parser for Vietnamese also improve the results; and the monolingual language model PhoBERT for Vietnamese (Nguyen and Nguyen, 2020) helps produce higher performances than the recent best multilingual language model XLM-R (Conneau et al., 2020).

pdf bib
WNUT-2020 Task 2: Identification of Informative COVID-19 English Tweets
Dat Quoc Nguyen | Thanh Vu | Afshin Rahimi | Mai Hoang Dao | Linh The Nguyen | Long Doan
Proceedings of the Sixth Workshop on Noisy User-generated Text (W-NUT 2020)

In this paper, we provide an overview of the WNUT-2020 shared task on the identification of informative COVID-19 English Tweets. We describe how we construct a corpus of 10K Tweets and organize the development and evaluation phases for this task. In addition, we also present a brief summary of results obtained from the final system evaluation submissions of 55 teams, finding that (i) many systems obtain very high performance, up to 0.91 F1 score, (ii) the majority of the submissions achieve substantially higher results than the baseline fastText (Joulin et al., 2017), and (iii) fine-tuning pre-trained language models on relevant language data followed by supervised training performs well in this task.

pdf bib
BERTweet: A pre-trained language model for English Tweets
Dat Quoc Nguyen | Thanh Vu | Anh Tuan Nguyen
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations

We present BERTweet, the first public large-scale pre-trained language model for English Tweets. Our BERTweet, having the same architecture as BERT-base (Devlin et al., 2019), is trained using the RoBERTa pre-training procedure (Liu et al., 2019). Experiments show that BERTweet outperforms strong baselines RoBERTa-base and XLM-R-base (Conneau et al., 2020), producing better performance results than the previous state-of-the-art models on three Tweet NLP tasks: Part-of-speech tagging, Named-entity recognition and text classification. We release BERTweet under the MIT License to facilitate future research and applications on Tweet data. Our BERTweet is available at https://github.com/VinAIResearch/BERTweet

2019

pdf bib
A neural joint model for Vietnamese word segmentation, POS tagging and dependency parsing
Dat Quoc Nguyen
Proceedings of the The 17th Annual Workshop of the Australasian Language Technology Association

pdf bib
Detecting Chemical Reactions in Patents
Hiyori Yoshikawa | Dat Quoc Nguyen | Zenan Zhai | Christian Druckenbrodt | Camilo Thorne | Saber A. Akhondi | Timothy Baldwin | Karin Verspoor
Proceedings of the The 17th Annual Workshop of the Australasian Language Technology Association

Extracting chemical reactions from patents is a crucial task for chemists working on chemical exploration. In this paper we introduce the novel task of detecting the textual spans that describe or refer to chemical reactions within patents. We formulate this task as a paragraph-level sequence tagging problem, where the system is required to return a sequence of paragraphs which contain a description of a reaction. To address this new task, we construct an annotated dataset from an existing proprietary database of chemical reactions manually extracted from patents. We introduce several baseline methods for the task and evaluate them over our dataset. Through error analysis, we discuss what makes the task complex and challenging, and suggest possible directions for future research.

pdf bib
Improving Chemical Named Entity Recognition in Patents with Contextualized Word Embeddings
Zenan Zhai | Dat Quoc Nguyen | Saber Akhondi | Camilo Thorne | Christian Druckenbrodt | Trevor Cohn | Michelle Gregory | Karin Verspoor
Proceedings of the 18th BioNLP Workshop and Shared Task

Chemical patents are an important resource for chemical information. However, few chemical Named Entity Recognition (NER) systems have been evaluated on patent documents, due in part to their structural and linguistic complexity. In this paper, we explore the NER performance of a BiLSTM-CRF model utilising pre-trained word embeddings, character-level word representations and contextualized ELMo word representations for chemical patents. We compare word embeddings pre-trained on biomedical and chemical patent corpora. The effect of tokenizers optimized for the chemical domain on NER performance in chemical patents is also explored. The results on two patent corpora show that contextualized word representations generated from ELMo substantially improve chemical NER performance w.r.t. the current state-of-the-art. We also show that domain-specific resources such as word embeddings trained on chemical patents and chemical-specific tokenizers, have a positive impact on NER performance.

pdf bib
A Capsule Network-based Embedding Model for Knowledge Graph Completion and Search Personalization
Dai Quoc Nguyen | Thanh Vu | Tu Dinh Nguyen | Dat Quoc Nguyen | Dinh Phung
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)

In this paper, we introduce an embedding model, named CapsE, exploring a capsule network to model relationship triples (subject, relation, object). Our CapsE represents each triple as a 3-column matrix where each column vector represents the embedding of an element in the triple. This 3-column matrix is then fed to a convolution layer where multiple filters are operated to generate different feature maps. These feature maps are reconstructed into corresponding capsules which are then routed to another capsule to produce a continuous vector. The length of this vector is used to measure the plausibility score of the triple. Our proposed CapsE obtains better performance than previous state-of-the-art embedding models for knowledge graph completion on two benchmark datasets WN18RR and FB15k-237, and outperforms strong search personalization baselines on SEARCH17.

2018

pdf bib
NIHRIO at SemEval-2018 Task 3: A Simple and Accurate Neural Network Model for Irony Detection in Twitter
Thanh Vu | Dat Quoc Nguyen | Xuan-Son Vu | Dai Quoc Nguyen | Michael Catt | Michael Trenell
Proceedings of The 12th International Workshop on Semantic Evaluation

This paper describes our NIHRIO system for SemEval-2018 Task 3 “Irony detection in English tweets.” We propose to use a simple neural network architecture of Multilayer Perceptron with various types of input features including: lexical, syntactic, semantic and polarity features. Our system achieves very high performance in both subtasks of binary and multi-class irony detection in tweets. In particular, we rank at least fourth using the accuracy metric and sixth using the F1 metric. Our code is available at: https://github.com/NIHRIO/IronyDetectionInTwitter

pdf bib
A Fast and Accurate Vietnamese Word Segmenter
Dat Quoc Nguyen | Dai Quoc Nguyen | Thanh Vu | Mark Dras | Mark Johnson
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

pdf bib
An Improved Neural Network Model for Joint POS Tagging and Dependency Parsing
Dat Quoc Nguyen | Karin Verspoor
Proceedings of the CoNLL 2018 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies

We propose a novel neural network model for joint part-of-speech (POS) tagging and dependency parsing. Our model extends the well-known BIST graph-based dependency parser (Kiperwasser and Goldberg, 2016) by incorporating a BiLSTM-based tagging component to produce automatically predicted POS tags for the parser. On the benchmark English Penn treebank, our model obtains strong UAS and LAS scores at 94.51% and 92.87%, respectively, producing 1.5+% absolute improvements to the BIST graph-based parser, and also obtaining a state-of-the-art POS tagging accuracy at 97.97%. Furthermore, experimental results on parsing 61 “big” Universal Dependencies treebanks from raw texts show that our model outperforms the baseline UDPipe (Straka and Strakova, 2017) with 0.8% higher average POS tagging score and 3.6% higher average LAS score. In addition, with our model, we also obtain state-of-the-art downstream task scores for biomedical event extraction and opinion analysis applications. Our code is available together with all pre-trained models at: https://github.com/datquocnguyen/jPTDP

pdf bib
A Novel Embedding Model for Knowledge Base Completion Based on Convolutional Neural Network
Dai Quoc Nguyen | Tu Dinh Nguyen | Dat Quoc Nguyen | Dinh Phung
Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers)

In this paper, we propose a novel embedding model, named ConvKB, for knowledge base completion. Our model ConvKB advances state-of-the-art models by employing a convolutional neural network, so that it can capture global relationships and transitional characteristics between entities and relations in knowledge bases. In ConvKB, each triple (head entity, relation, tail entity) is represented as a 3-column matrix where each column vector represents a triple element. This 3-column matrix is then fed to a convolution layer where multiple filters are operated on the matrix to generate different feature maps. These feature maps are then concatenated into a single feature vector representing the input triple. The feature vector is multiplied with a weight vector via a dot product to return a score. This score is then used to predict whether the triple is valid or not. Experiments show that ConvKB achieves better link prediction performance than previous state-of-the-art embedding models on two benchmark datasets WN18RR and FB15k-237.

pdf bib
VnCoreNLP: A Vietnamese Natural Language Processing Toolkit
Thanh Vu | Dat Quoc Nguyen | Dai Quoc Nguyen | Mark Dras | Mark Johnson
Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Demonstrations

We present an easy-to-use and fast toolkit, namely VnCoreNLP—a Java NLP annotation pipeline for Vietnamese. Our VnCoreNLP supports key natural language processing (NLP) tasks including word segmentation, part-of-speech (POS) tagging, named entity recognition (NER) and dependency parsing, and obtains state-of-the-art (SOTA) results for these tasks. We release VnCoreNLP to provide rich linguistic annotations to facilitate research work on Vietnamese NLP. Our VnCoreNLP is open-source and available at: https://github.com/vncorenlp/VnCoreNLP

pdf bib
Convolutional neural networks for chemical-disease relation extraction are improved with character-based word embeddings
Dat Quoc Nguyen | Karin Verspoor
Proceedings of the BioNLP 2018 workshop

We investigate the incorporation of character-based word representations into a standard CNN-based relation extraction model. We experiment with two common neural architectures, CNN and LSTM, to learn word vector representations from character embeddings. Through a task on the BioCreative-V CDR corpus, extracting relationships between chemicals and diseases, we show that models exploiting the character-based word representations improve on models that do not use this information, obtaining state-of-the-art result relative to previous neural approaches.

pdf bib
Comparing CNN and LSTM character-level embeddings in BiLSTM-CRF models for chemical and disease named entity recognition
Zenan Zhai | Dat Quoc Nguyen | Karin Verspoor
Proceedings of the Ninth International Workshop on Health Text Mining and Information Analysis

We compare the use of LSTM-based and CNN-based character-level word embeddings in BiLSTM-CRF models to approach chemical and disease named entity recognition (NER) tasks. Empirical results over the BioCreative V CDR corpus show that the use of either type of character-level word embeddings in conjunction with the BiLSTM-CRF models leads to comparable state-of-the-art performance. However, the models using CNN-based character-level word embeddings have a computational performance advantage, increasing training time over word-based models by 25% while the LSTM-based character-level word embeddings more than double the required training time.

2017

pdf bib
A Novel Neural Network Model for Joint POS Tagging and Graph-based Dependency Parsing
Dat Quoc Nguyen | Mark Dras | Mark Johnson
Proceedings of the CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies

We present a novel neural network model that learns POS tagging and graph-based dependency parsing jointly. Our model uses bidirectional LSTMs to learn feature representations shared for both POS tagging and dependency parsing tasks, thus handling the feature-engineering problem. Our extensive experiments, on 19 languages from the Universal Dependencies project, show that our model outperforms the state-of-the-art neural network-based Stack-propagation model for joint POS tagging and transition-based dependency parsing, resulting in a new state of the art. Our code is open-source and available together with pre-trained models at: https://github.com/datquocnguyen/jPTDP

pdf bib
Sequence to Sequence Learning for Event Prediction
Dai Quoc Nguyen | Dat Quoc Nguyen | Cuong Xuan Chu | Stefan Thater | Manfred Pinkal
Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 2: Short Papers)

This paper presents an approach to the task of predicting an event description from a preceding sentence in a text. Our approach explores sequence-to-sequence learning using a bidirectional multi-layer recurrent neural network. Our approach substantially outperforms previous work in terms of the BLEU score on two datasets derived from WikiHow and DeScript respectively. Since the BLEU score is not easy to interpret as a measure of event prediction, we complement our study with a second evaluation that exploits the rich linguistic annotation of gold paraphrase sets of events.

pdf bib
A Mixture Model for Learning Multi-Sense Word Embeddings
Dai Quoc Nguyen | Dat Quoc Nguyen | Ashutosh Modi | Stefan Thater | Manfred Pinkal
Proceedings of the 6th Joint Conference on Lexical and Computational Semantics (*SEM 2017)

Word embeddings are now a standard technique for inducing meaning representations for words. For getting good representations, it is important to take into account different senses of a word. In this paper, we propose a mixture model for learning multi-sense word embeddings. Our model generalizes the previous works in that it allows to induce different weights of different senses of a word. The experimental results show that our model outperforms previous models on standard evaluation tasks.

pdf bib
From Word Segmentation to POS Tagging for Vietnamese
Dat Quoc Nguyen | Thanh Vu | Dai Quoc Nguyen | Mark Dras | Mark Johnson
Proceedings of the Australasian Language Technology Association Workshop 2017

2016

pdf bib
STransE: a novel embedding model of entities and relationships in knowledge bases
Dat Quoc Nguyen | Kairit Sirts | Lizhen Qu | Mark Johnson
Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

pdf bib
Neighborhood Mixture Model for Knowledge Base Completion
Dat Quoc Nguyen | Kairit Sirts | Lizhen Qu | Mark Johnson
Proceedings of The 20th SIGNLL Conference on Computational Natural Language Learning

pdf bib
An empirical study for Vietnamese dependency parsing
Dat Quoc Nguyen | Mark Dras | Mark Johnson
Proceedings of the Australasian Language Technology Association Workshop 2016

2015

pdf bib
Improving Topic Models with Latent Feature Word Representations
Dat Quoc Nguyen | Richard Billingsley | Lan Du | Mark Johnson
Transactions of the Association for Computational Linguistics, Volume 3

Probabilistic topic models are widely used to discover latent topics in document collections, while latent feature vector representations of words have been used to obtain high performance in many NLP tasks. In this paper, we extend two different Dirichlet multinomial topic models by incorporating latent feature vector representations of words trained on very large corpora to improve the word-topic mapping learnt on a smaller corpus. Experimental results show that by using information from the external corpora, our new models produce significant improvements on topic coherence, document clustering and document classification tasks, especially on datasets with few or short documents.

pdf bib
Improving Topic Coherence with Latent Feature Word Representations in MAP Estimation for Topic Modeling
Dat Quoc Nguyen | Kairit Sirts | Mark Johnson
Proceedings of the Australasian Language Technology Association Workshop 2015

2014

pdf bib
RDRPOSTagger: A Ripple Down Rules-based Part-Of-Speech Tagger
Dat Quoc Nguyen | Dai Quoc Nguyen | Dang Duc Pham | Son Bao Pham
Proceedings of the Demonstrations at the 14th Conference of the European Chapter of the Association for Computational Linguistics

pdf bib
Sentiment Classification on Polarity Reviews: An Empirical Study Using Rating-based Features
Dai Quoc Nguyen | Dat Quoc Nguyen | Thanh Vu | Son Bao Pham
Proceedings of the 5th Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis

2013

pdf bib
A Two-Stage Classifier for Sentiment Analysis
Dai Quoc Nguyen | Dat Quoc Nguyen | Son Bao Pham
Proceedings of the Sixth International Joint Conference on Natural Language Processing

2011

pdf bib
Systematic Knowledge Acquisition for Question Analysis
Dat Quoc Nguyen | Dai Quoc Nguyen | Son Bao Pham
Proceedings of the International Conference Recent Advances in Natural Language Processing 2011