Jinhua Du


2020

pdf bib
Pointing to Select: A Fast Pointer-LSTM for Long Text Classification
Jinhua Du | Yan Huang | Karo Moilanen
Proceedings of the 28th International Conference on Computational Linguistics

Recurrent neural networks (RNNs) suffer from well-known limitations and complications which include slow inference and vanishing gradients when processing long sequences in text classification. Recent studies have attempted to accelerate RNNs via various ad hoc mechanisms to skip irrelevant words in the input. However, word skipping approaches proposed to date effectively stop at each or a given time step to decide whether or not a given input word should be skipped, breaking the coherence of input processing in RNNs. Furthermore, current methods cannot change skip rates during inference and are consequently unable to support different skip rates in demanding real-world conditions. To overcome these limitations, we propose Pointer- LSTM, a novel LSTM framework which relies on a pointer network to select important words for target prediction. The model maintains a coherent input process for the LSTM modules and makes it possible to change the skip rate during inference. Our evaluation on four public data sets demonstrates that Pointer-LSTM (a) is 1.1x∼3.5x faster than the standard LSTM architecture; (b) is more accurate than Leap-LSTM (the state-of-the-art LSTM skipping model) at high skip rates; and (c) reaches robust accuracy levels even when the skip rate is changed during inference.

2019

pdf bib
Self-Attention Enhanced CNNs and Collaborative Curriculum Learning for Distantly Supervised Relation Extraction
Yuyun Huang | Jinhua Du
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

Distance supervision is widely used in relation extraction tasks, particularly when large-scale manual annotations are virtually impossible to conduct. Although Distantly Supervised Relation Extraction (DSRE) benefits from automatic labelling, it suffers from serious mislabelling issues, i.e. some or all of the instances for an entity pair (head and tail entities) do not express the labelled relation. In this paper, we propose a novel model that employs a collaborative curriculum learning framework to reduce the effects of mislabelled data. Specifically, we firstly propose an internal self-attention mechanism between the convolution operations in convolutional neural networks (CNNs) to learn a better sentence representation from the noisy inputs. Then we define two sentence selection models as two relation extractors in order to collaboratively learn and regularise each other under a curriculum scheme to alleviate noisy effects, where the curriculum could be constructed by conflicts or small loss. Finally, experiments are conducted on a widely-used public dataset and the results indicate that the proposed model significantly outperforms baselines including the state-of-the-art in terms of P@N and PR curve metrics, thus evidencing its capability of reducing noisy effects for DSRE.

pdf bib
AIG Investments.AI at the FinSBD Task: Sentence Boundary Detection through Sequence Labelling and BERT Fine-tuning
Jinhua Du | Yan Huang | Karo Moilanen
Proceedings of the First Workshop on Financial Technology and Natural Language Processing

pdf bib
Proceedings of the Second Workshop on Multilingualism at the Intersection of Knowledge Bases and Machine Translation
Mihael Arcan | Marco Turchi | Jinhua Du | Dimitar Shterionov | Daniel Torregrosa
Proceedings of the Second Workshop on Multilingualism at the Intersection of Knowledge Bases and Machine Translation

2018

pdf bib
Multi-Level Structured Self-Attentions for Distantly Supervised Relation Extraction
Jinhua Du | Jingguang Han | Andy Way | Dadong Wan
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

Attention mechanism is often used in deep neural networks for distantly supervised relation extraction (DS-RE) to distinguish valid from noisy instances. However, traditional 1-D vector attention model is insufficient for learning of different contexts in the selection of valid instances to predict the relationship for an entity pair. To alleviate this issue, we propose a novel multi-level structured (2-D matrix) self-attention mechanism for DS-RE in a multi-instance learning (MIL) framework using bidirectional recurrent neural networks (BiRNN). In the proposed method, a structured word-level self-attention learns a 2-D matrix where each row vector represents a weight distribution for different aspects of an instance regarding two entities. Targeting the MIL issue, the structured sentence-level attention learns a 2-D matrix where each row vector represents a weight distribution on selection of different valid instances. Experiments conducted on two publicly available DS-RE datasets show that the proposed framework with multi-level structured self-attention mechanism significantly outperform baselines in terms of PR curves, P@N and F1 measures.

pdf bib
NextGen AML: Distributed Deep Learning based Language Technologies to Augment Anti Money Laundering Investigation
Jingguang Han | Utsab Barman | Jeremiah Hayes | Jinhua Du | Edward Burgin | Dadong Wan
Proceedings of ACL 2018, System Demonstrations

Most of the current anti money laundering (AML) systems, using handcrafted rules, are heavily reliant on existing structured databases, which are not capable of effectively and efficiently identifying hidden and complex ML activities, especially those with dynamic and time-varying characteristics, resulting in a high percentage of false positives. Therefore, analysts are engaged for further investigation which significantly increases human capital cost and processing time. To alleviate these issues, this paper presents a novel framework for the next generation AML by applying and visualizing deep learning-driven natural language processing (NLP) technologies in a distributed and scalable manner to augment AML monitoring and investigation. The proposed distributed framework performs news and tweet sentiment analysis, entity recognition, relation extraction, entity linking and link analysis on different data sources (e.g. news articles and tweets) to provide additional evidence to human investigators for final decision-making. Each NLP module is evaluated on a task-specific data set, and the overall experiments are performed on synthetic and real-world datasets. Feedback from AML practitioners suggests that our system can reduce approximately 30% time and cost compared to their previous manual approaches of AML investigation.

2017

pdf bib
Semantics-Enhanced Task-Oriented Dialogue Translation: A Case Study on Hotel Booking
Longyue Wang | Jinhua Du | Liangyou Li | Zhaopeng Tu | Andy Way | Qun Liu
Proceedings of the IJCNLP 2017, System Demonstrations

We showcase TODAY, a semantics-enhanced task-oriented dialogue translation system, whose novelties are: (i) task-oriented named entity (NE) definition and a hybrid strategy for NE recognition and translation; and (ii) a novel grounded semantic method for dialogue understanding and task-order management. TODAY is a case-study demo which can efficiently and accurately assist customers and agents in different languages to reach an agreement in a dialogue for the hotel booking.

2016

pdf bib
Using BabelNet to Improve OOV Coverage in SMT
Jinhua Du | Andy Way | Andrzej Zydron
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

Out-of-vocabulary words (OOVs) are a ubiquitous and difficult problem in statistical machine translation (SMT). This paper studies different strategies of using BabelNet to alleviate the negative impact brought about by OOVs. BabelNet is a multilingual encyclopedic dictionary and a semantic network, which not only includes lexicographic and encyclopedic terms, but connects concepts and named entities in a very large network of semantic relations. By taking advantage of the knowledge in BabelNet, three different methods ― using direct training data, domain-adaptation techniques and the BabelNet API ― are proposed in this paper to obtain translations for OOVs to improve system performance. Experimental results on English―Polish and English―Chinese language pairs show that domain adaptation can better utilize BabelNet knowledge and performs better than other methods. The results also demonstrate that BabelNet is a really useful tool for improving translation performance of SMT systems.

pdf bib
ProphetMT: A Tree-based SMT-driven Controlled Language Authoring/Post-Editing Tool
Xiaofeng Wu | Jinhua Du | Qun Liu | Andy Way
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

This paper presents ProphetMT, a tree-based SMT-driven Controlled Language (CL) authoring and post-editing tool. ProphetMT employs the source-side rules in a translation model and provides them as auto-suggestions to users. Accordingly, one might say that users are writing in a Controlled Language that is understood by the computer. ProphetMT also allows users to easily attach structural information as they compose content. When a specific rule is selected, a partial translation is promptly generated on-the-fly with the help of the structural information. Our experiments conducted on English-to-Chinese show that our proposed ProphetMT system can not only better regularise an author’s writing behaviour, but also significantly improve translation fluency which is vital to reduce the post-editing time. Additionally, when the writing and translation process is over, ProphetMT can provide an effective colour scheme to further improve the productivity of post-editors by explicitly featuring the relations between the source and target rules.

2011

pdf bib
Incorporating Source-Language Paraphrases into Phrase-Based SMT with Confusion Networks
Jie Jiang | Jinhua Du | Andy Way
Proceedings of Fifth Workshop on Syntax, Semantics and Structure in Statistical Translation

2010

pdf bib
TMX Markup: A Challenge When Adapting SMT to the Localisation Environment
Jinhua Du | Johann Roturier | Andy Way
Proceedings of the 14th Annual conference of the European Association for Machine Translation

pdf bib
The Impact of Source–Side Syntactic Reordering on Hierarchical Phrase-based SMT
Jinhua Du | Andy Way
Proceedings of the 14th Annual conference of the European Association for Machine Translation

pdf bib
MATREX: The DCU MT System for WMT 2010
Sergio Penkale | Rejwanul Haque | Sandipan Dandapat | Pratyush Banerjee | Ankit K. Srivastava | Jinhua Du | Pavel Pecina | Sudip Kumar Naskar | Mikel L. Forcada | Andy Way
Proceedings of the Joint Fifth Workshop on Statistical Machine Translation and MetricsMATR

pdf bib
An Augmented Three-Pass System Combination Framework: DCU Combination System for WMT 2010
Jinhua Du | Pavel Pecina | Andy Way
Proceedings of the Joint Fifth Workshop on Statistical Machine Translation and MetricsMATR

pdf bib
The DCU Dependency-Based Metric in WMT-MetricsMATR 2010
Yifan He | Jinhua Du | Andy Way | Josef van Genabith
Proceedings of the Joint Fifth Workshop on Statistical Machine Translation and MetricsMATR

pdf bib
Source-side Syntactic Reordering Patterns with Functional Words for Improved Phrase-based SMT
Jie Jiang | Jinhua Du | Andy Way
Proceedings of the 4th Workshop on Syntax and Structure in Statistical Translation

pdf bib
A Discriminative Latent Variable-Based “DE” Classifier for Chinese-English SMT
Jinhua Du | Andy Way
Proceedings of the 23rd International Conference on Computational Linguistics (Coling 2010)

pdf bib
Facilitating Translation Using Source Language Paraphrase Lattices
Jinhua Du | Jie Jiang | Andy Way
Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing

2009

pdf bib
MATREX: The DCU MT System for WMT 2009
Jinhua Du | Yifan He | Sergio Penkale | Andy Way
Proceedings of the Fourth Workshop on Statistical Machine Translation