Haoran Li


pdf bib
Self-Attention Guided Copy Mechanism for Abstractive Summarization
Song Xu | Haoran Li | Peng Yuan | Youzheng Wu | Xiaodong He | Bowen Zhou
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

Copy module has been widely equipped in the recent abstractive summarization models, which facilitates the decoder to extract words from the source into the summary. Generally, the encoder-decoder attention is served as the copy distribution, while how to guarantee that important words in the source are copied remains a challenge. In this work, we propose a Transformer-based model to enhance the copy mechanism. Specifically, we identify the importance of each source word based on the degree centrality with a directed graph built by the self-attention layer in the Transformer. We use the centrality of each source word to guide the copy process explicitly. Experimental results show that the self-attention graph provides useful guidance for the copy distribution. Our proposed models significantly outperform the baseline methods on the CNN/Daily Mail dataset and the Gigaword dataset.

pdf bib
Emerging Cross-lingual Structure in Pretrained Language Models
Alexis Conneau | Shijie Wu | Haoran Li | Luke Zettlemoyer | Veselin Stoyanov
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

We study the problem of multilingual masked language modeling, i.e. the training of a single model on concatenated text from multiple languages, and present a detailed study of several factors that influence why these models are so effective for cross-lingual transfer. We show, contrary to what was previously hypothesized, that transfer is possible even when there is no shared vocabulary across the monolingual corpora and also when the text comes from very different domains. The only requirement is that there are some shared parameters in the top layers of the multi-lingual encoder. To better understand this result, we also show that representations from monolingual BERT models in different languages can be aligned post-hoc quite effectively, strongly suggesting that, much like for non-contextual word embeddings, there are universal latent symmetries in the learned embedding spaces. For multilingual masked language modeling, these symmetries are automatically discovered and aligned during the joint training process.

pdf bib
General Purpose Text Embeddings from Pre-trained Language Models for Scalable Inference
Jingfei Du | Myle Ott | Haoran Li | Xing Zhou | Veselin Stoyanov
Findings of the Association for Computational Linguistics: EMNLP 2020

The state of the art on many NLP tasks is currently achieved by large pre-trained language models, which require a considerable amount of computation. We aim to reduce the inference cost in a setting where many different predictions are made on a single piece of text. In that case, computational cost during inference can be amortized over the different predictions (tasks) using a shared text encoder. We compare approaches for training such an encoder and show that encoders pre-trained over multiple tasks generalize well to unseen tasks. We also compare ways of extracting fixed- and limited-size representations from this encoder, including pooling features extracted from multiple layers or positions. Our best approach compares favorably to knowledge distillation, achieving higher accuracy and lower computational cost once the system is handling around 7 tasks. Further, we show that through binary quantization, we can reduce the size of the extracted representations by a factor of 16 to store them for later use. The resulting method offers a compelling solution for using large-scale pre-trained models at a fraction of the computational cost when multiple tasks are performed on the same text.

pdf bib
Multimodal Sentence Summarization via Multimodal Selective Encoding
Haoran Li | Junnan Zhu | Jiajun Zhang | Xiaodong He | Chengqing Zong
Proceedings of the 28th International Conference on Computational Linguistics

This paper studies the problem of generating a summary for a given sentence-image pair. Existing multimodal sequence-to-sequence approaches mainly focus on enhancing the decoder by visual signals, while ignoring that the image can improve the ability of the encoder to identify highlights of a news event or a document. Thus, we propose a multimodal selective gate network that considers reciprocal relationships between textual and multi-level visual features, including global image descriptor, activation grids, and object proposals, to select highlights of the event when encoding the source sentence. In addition, we introduce a modality regularization to encourage the summary to capture the highlights embedded in the image more accurately. To verify the generalization of our model, we adopt the multimodal selective gate to the text-based decoder and multimodal-based decoder. Experimental results on a public multimodal sentence summarization dataset demonstrate the advantage of our models over baselines. Further analysis suggests that our proposed multimodal selective gate network can effectively select important information in the input sentence.

pdf bib
On the Faithfulness for E-commerce Product Summarization
Peng Yuan | Haoran Li | Song Xu | Youzheng Wu | Xiaodong He | Bowen Zhou
Proceedings of the 28th International Conference on Computational Linguistics

In this work, we present a model to generate e-commerce product summaries. The consistency between the generated summary and the product attributes is an essential criterion for the ecommerce product summarization task. To enhance the consistency, first, we encode the product attribute table to guide the process of summary generation. Second, we identify the attribute words from the vocabulary, and we constrain these attribute words can be presented in the summaries only through copying from the source, i.e., the attribute words not in the source cannot be generated. We construct a Chinese e-commerce product summarization dataset, and the experimental results on this dataset demonstrate that our models significantly improve the faithfulness.

pdf bib
Multimodal Joint Attribute Prediction and Value Extraction for E-commerce Product
Tiangang Zhu | Yue Wang | Haoran Li | Youzheng Wu | Xiaodong He | Bowen Zhou
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

Product attribute values are essential in many e-commerce scenarios, such as customer service robots, product recommendations, and product retrieval. While in the real world, the attribute values of a product are usually incomplete and vary over time, which greatly hinders the practical applications. In this paper, we propose a multimodal method to jointly predict product attributes and extract values from textual product descriptions with the help of the product images. We argue that product attributes and values are highly correlated, e.g., it will be easier to extract the values on condition that the product attributes are given. Thus, we jointly model the attribute prediction and value extraction tasks from multiple aspects towards the interactions between attributes and values. Moreover, product images have distinct effects on our tasks for different product attributes and values. Thus, we selectively draw useful visual information from product images to enhance our model. We annotate a multimodal product attribute value dataset that contains 87,194 instances, and the experimental results on this dataset demonstrate that explicitly modeling the relationship between attributes and values facilitates our method to establish the correspondence between them, and selectively utilizing visual product information is necessary for the task. Our code and dataset are available at https://github.com/jd-aig/JAVE.

pdf bib
Conversational Semantic Parsing
Armen Aghajanyan | Jean Maillard | Akshat Shrivastava | Keith Diedrick | Michael Haeger | Haoran Li | Yashar Mehdad | Veselin Stoyanov | Anuj Kumar | Mike Lewis | Sonal Gupta
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

The structured representation for semantic parsing in task-oriented assistant systems is geared towards simple understanding of one-turn queries. Due to the limitations of the representation, the session-based properties such as co-reference resolution and context carryover are processed downstream in a pipelined system. In this paper, we propose a semantic representation for such task-oriented conversational systems that can represent concepts such as co-reference and context carryover, enabling comprehensive understanding of queries in a session. We release a new session-based, compositional task-oriented parsing dataset of 20k sessions consisting of 60k utterances. Unlike Dialog State Tracking Challenges, the queries in the dataset have compositional forms. We propose a new family of Seq2Seq models for the session-based parsing above, which also set state-of-the-art in ATIS, SNIPS, TOP and DSTC2. Notably, we improve the best known results on DSTC2 by up to 5 points for slot-carryover.


pdf bib
MSMO: Multimodal Summarization with Multimodal Output
Junnan Zhu | Haoran Li | Tianshang Liu | Yu Zhou | Jiajun Zhang | Chengqing Zong
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

Multimodal summarization has drawn much attention due to the rapid growth of multimedia data. The output of the current multimodal summarization systems is usually represented in texts. However, we have found through experiments that multimodal output can significantly improve user satisfaction for informativeness of summaries. In this paper, we propose a novel task, multimodal summarization with multimodal output (MSMO). To handle this task, we first collect a large-scale dataset for MSMO research. We then propose a multimodal attention model to jointly generate text and select the most relevant image from the multimodal input. Finally, to evaluate multimodal outputs, we construct a novel multimodal automatic evaluation (MMAE) method which considers both intra-modality salience and inter-modality relevance. The experimental results show the effectiveness of MMAE.

pdf bib
Ensure the Correctness of the Summary: Incorporate Entailment Knowledge into Abstractive Sentence Summarization
Haoran Li | Junnan Zhu | Jiajun Zhang | Chengqing Zong
Proceedings of the 27th International Conference on Computational Linguistics

In this paper, we investigate the sentence summarization task that produces a summary from a source sentence. Neural sequence-to-sequence models have gained considerable success for this task, while most existing approaches only focus on improving the informativeness of the summary, which ignore the correctness, i.e., the summary should not contain unrelated information with respect to the source sentence. We argue that correctness is an essential requirement for summarization systems. Considering a correct summary is semantically entailed by the source sentence, we incorporate entailment knowledge into abstractive summarization models. We propose an entailment-aware encoder under multi-task framework (i.e., summarization generation and entailment recognition) and an entailment-aware decoder by entailment Reward Augmented Maximum Likelihood (RAML) training. Experiment results demonstrate that our models significantly outperform baselines from the aspects of informativeness and correctness.

pdf bib
Multilingual Seq2seq Training with Similarity Loss for Cross-Lingual Document Classification
Katherine Yu | Haoran Li | Barlas Oguz
Proceedings of The Third Workshop on Representation Learning for NLP

In this paper we continue experiments where neural machine translation training is used to produce joint cross-lingual fixed-dimensional sentence embeddings. In this framework we introduce a simple method of adding a loss to the learning objective which penalizes distance between representations of bilingually aligned sentences. We evaluate cross-lingual transfer using two approaches, cross-lingual similarity search on an aligned corpus (Europarl) and cross-lingual document classification on a recently published benchmark Reuters corpus, and we find the similarity loss significantly improves performance on both. Furthermore, we notice that while our Reuters results are very competitive, our English results are not as competitive, showing room for improvement in the current cross-lingual state-of-the-art. Our results are based on a set of 6 European languages.


pdf bib
Multi-modal Summarization for Asynchronous Collection of Text, Image, Audio and Video
Haoran Li | Junnan Zhu | Cong Ma | Jiajun Zhang | Chengqing Zong
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing

The rapid increase of the multimedia data over the Internet necessitates multi-modal summarization from collections of text, image, audio and video. In this work, we propose an extractive Multi-modal Summarization (MMS) method which can automatically generate a textual summary given a set of documents, images, audios and videos related to a specific topic. The key idea is to bridge the semantic gaps between multi-modal contents. For audio information, we design an approach to selectively use its transcription. For vision information, we learn joint representations of texts and images using a neural network. Finally, all the multi-modal aspects are considered to generate the textural summary by maximizing the salience, non-redundancy, readability and coverage through budgeted optimization of submodular functions. We further introduce an MMS corpus in English and Chinese. The experimental results on this dataset demonstrate that our method outperforms other competitive baseline methods.


pdf bib
An End-to-End Chinese Discourse Parser with Adaptation to Explicit and Non-explicit Relation Recognition
Xiaomian Kang | Haoran Li | Long Zhou | Jiajun Zhang | Chengqing Zong
Proceedings of the CoNLL-16 shared task