Weibo-COV: A Large-Scale COVID-19 Social Media Dataset from Weibo
Yong Hu | Heyan Huang | Anfan Chen | Xian-Ling Mao
Proceedings of the 1st Workshop on NLP for COVID-19 (Part 2) at EMNLP 2020

With the rapid development of COVID-19 around the world, people are requested to maintain “social distance” and “stay at home”. In this scenario, extensive social interactions transfer to cyberspace, especially on social media platforms like Twitter and Sina Weibo. People generate posts to share information, express opinions and seek help during the pandemic outbreak, and these kinds of data on social media are valuable for studies to prevent COVID-19 transmissions, such as early warning and outbreaks detection. Therefore, in this paper, we release a novel and fine-grained large-scale COVID-19 social media dataset collected from Sina Weibo, named Weibo-COV, contains more than 40 million posts ranging from December 1, 2019 to April 30, 2020. Moreover, this dataset includes comprehensive information nuggets like post-level information, interactive information, location information, and repost network. We hope this dataset can promote studies of COVID-19 from multiple perspectives and enable better and rapid researches to suppress the spread of this pandemic.

Towards Interpretable Reasoning over Paragraph Effects in Situation
Mucheng Ren | Xiubo Geng | Tao Qin | Heyan Huang | Daxin Jiang
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

We focus on the task of reasoning over paragraph effects in situation, which requires a model to understand the cause and effect described in a background paragraph, and apply the knowledge to a novel situation. Existing works ignore the complicated reasoning process and solve it with a one-step “black box” model. Inspired by human cognitive processes, in this paper we propose a sequential approach for this task which explicitly models each step of the reasoning process with neural network modules. In particular, five reasoning modules are designed and learned in an end-to-end manner, which leads to a more interpretable model. Experimental results on the ROPES dataset demonstrate the effectiveness and explainability of our proposed approach.


Concept Pointer Network for Abstractive Summarization
Wenbo Wang | Yang Gao | Heyan Huang | Yuxiang Zhou
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

A quality abstractive summary should not only copy salient source texts as summaries but should also tend to generate new conceptual words to express concrete details. Inspired by the popular pointer generator sequence-to-sequence model, this paper presents a concept pointer network for improving these aspects of abstractive summarization. The network leverages knowledge-based, context-aware conceptualizations to derive an extended set of candidate concepts. The model then points to the most appropriate choice using both the concept set and original source text. This joint approach generates abstractive summaries with higher-level semantic concepts. The training model is also optimized in a way that adapts to different data, which is based on a novel method of distant-supervised learning guided by reference summaries and testing set. Overall, the proposed approach provides statistically significant improvements over several state-of-the-art models on both the DUC-2004 and Gigaword datasets. A human evaluation of the model’s abstractive abilities also supports the quality of the summaries produced within this framework.

Improving Neural Machine Translation by Achieving Knowledge Transfer with Sentence Alignment Learning
Xuewen Shi | Heyan Huang | Wenguan Wang | Ping Jian | Yi-Kun Tang
Proceedings of the 23rd Conference on Computational Natural Language Learning (CoNLL)

Neural Machine Translation (NMT) optimized by Maximum Likelihood Estimation (MLE) lacks the guarantee of translation adequacy. To alleviate this problem, we propose an NMT approach that heightens the adequacy in machine translation by transferring the semantic knowledge learned from bilingual sentence alignment. Specifically, we first design a discriminator that learns to estimate sentence aligning score over translation candidates, and then the learned semantic knowledge is transfered to the NMT model under an adversarial learning framework. We also propose a gated self-attention based encoder for sentence embedding. Furthermore, an N-pair training loss is introduced in our framework to aid the discriminator in better capturing lexical evidence in translation candidates. Experimental results show that our proposed method outperforms baseline NMT models on Chinese-to-English and English-to-German translation tasks. Further analysis also indicates the detailed semantic knowledge transfered from the discriminator to the NMT model.

Open Domain Event Extraction Using Neural Latent Variable Models
Xiao Liu | Heyan Huang | Yue Zhang
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

We consider open domain event extraction, the task of extracting unconstraint types of events from news clusters. A novel latent variable neural model is constructed, which is scalable to very large corpus. A dataset is collected and manually annotated, with task-specific evaluation metrics being designed. Results show that the proposed unsupervised model gives better performance compared to the state-of-the-art method for event schema induction.


Zewen at SemEval-2018 Task 1: An Ensemble Model for Affect Prediction in Tweets
Zewen Chi | Heyan Huang | Jiangui Chen | Hao Wu | Ran Wei
Proceedings of The 12th International Workshop on Semantic Evaluation

This paper presents a method for Affect in Tweets, which is the task to automatically determine the intensity of emotions and intensity of sentiment of tweets. The term affect refers to emotion-related categories such as anger, fear, etc. Intensity of emo-tions need to be quantified into a real valued score in [0, 1]. We propose an en-semble system including four different deep learning methods which are CNN, Bidirectional LSTM (BLSTM), LSTM-CNN and a CNN-based Attention model (CA). Our system gets an average Pearson correlation score of 0.682 in the subtask EI-reg and an average Pearson correlation score of 0.784 in subtask V-reg, which ranks 17th among 48 systems in EI-reg and 19th among 38 systems in V-reg.

Genre Separation Network with Adversarial Training for Cross-genre Relation Extraction
Ge Shi | Chong Feng | Lifu Huang | Boliang Zhang | Heng Ji | Lejian Liao | Heyan Huang
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

Relation Extraction suffers from dramatical performance decrease when training a model on one genre and directly applying it to a new genre, due to the distinct feature distributions. Previous studies address this problem by discovering a shared space across genres using manually crafted features, which requires great human effort. To effectively automate this process, we design a genre-separation network, which applies two encoders, one genre-independent and one genre-shared, to explicitly extract genre-specific and genre-agnostic features. Then we train a relation classifier using the genre-agnostic features on the source genre and directly apply to the target genre. Experiment results on three distinct genres of the ACE dataset show that our approach achieves up to 6.1% absolute F1-score gain compared to previous methods. By incorporating a set of external linguistic features, our approach outperforms the state-of-the-art by 1.7% absolute F1 gain. We make all programs of our model publicly available for research purpose

Jointly Multiple Events Extraction via Attention-based Graph Information Aggregation
Xiao Liu | Zhunchen Luo | Heyan Huang
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

Event extraction is of practical utility in natural language processing. In the real world, it is a common phenomenon that multiple events existing in the same sentence, where extracting them are more difficult than extracting a single event. Previous works on modeling the associations between events by sequential modeling methods suffer a lot from the low efficiency in capturing very long-range dependencies. In this paper, we propose a novel Jointly Multiple Events Extraction (JMEE) framework to jointly extract multiple event triggers and arguments by introducing syntactic shortcut arcs to enhance information flow and attention-based graph convolution networks to model graph information. The experiment results demonstrate that our proposed framework achieves competitive results compared with state-of-the-art methods.

Task-oriented Word Embedding for Text Classification
Qian Liu | Heyan Huang | Yang Gao | Xiaochi Wei | Yuxin Tian | Luyang Liu
Proceedings of the 27th International Conference on Computational Linguistics

Distributed word representation plays a pivotal role in various natural language processing tasks. In spite of its success, most existing methods only consider contextual information, which is suboptimal when used in various tasks due to a lack of task-specific features. The rational word embeddings should have the ability to capture both the semantic features and task-specific features of words. In this paper, we propose a task-oriented word embedding method and apply it to the text classification task. With the function-aware component, our method regularizes the distribution of words to enable the embedding space to have a clear classification boundary. We evaluate our method using five text classification datasets. The experiment results show that our method significantly outperforms the state-of-the-art methods.


BIT at SemEval-2017 Task 1: Using Semantic Information Space to Evaluate Semantic Textual Similarity
Hao Wu | Heyan Huang | Ping Jian | Yuhang Guo | Chao Su
Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017)

This paper presents three systems for semantic textual similarity (STS) evaluation at SemEval-2017 STS task. One is an unsupervised system and the other two are supervised systems which simply employ the unsupervised one. All our systems mainly depend on the (SIS), which is constructed based on the semantic hierarchical taxonomy in WordNet, to compute non-overlapping information content (IC) of sentences. Our team ranked 2nd among 31 participating teams by the primary score of Pearson correlation coefficient (PCC) mean of 7 tracks and achieved the best performance on Track 1 (AR-AR) dataset.

QLUT at SemEval-2017 Task 2: Word Similarity Based on Word Embedding and Knowledge Base
Fanqing Meng | Wenpeng Lu | Yuteng Zhang | Ping Jian | Shumin Shi | Heyan Huang
Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017)

This paper shows the details of our system submissions in the task 2 of SemEval 2017. We take part in the subtask 1 of this task, which is an English monolingual subtask. This task is designed to evaluate the semantic word similarity of two linguistic items. The results of runs are assessed by standard Pearson and Spearman correlation, contrast with official gold standard set. The best performance of our runs is 0.781 (Final). The techniques of our runs mainly make use of the word embeddings and the knowledge-based method. The results demonstrate that the combined method is effective for the computation of word similarity, while the word embeddings and the knowledge-based technique, respectively, needs more deeply improvement in details.

A Parallel Recurrent Neural Network for Language Modeling with POS Tags
Chao Su | Heyan Huang | Shumin Shi | Yuhang Guo | Hao Wu
Proceedings of the 31st Pacific Asia Conference on Language, Information and Computation


A Novel Fast Framework for Topic Labeling Based on Similarity-preserved Hashing
Xian-Ling Mao | Yi-Jing Hao | Qiang Zhou | Wen-Qing Yuan | Liner Yang | Heyan Huang
Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers

Recently, topic modeling has been widely applied in data mining due to its powerful ability. A common, major challenge in applying such topic models to other tasks is to accurately interpret the meaning of each topic. Topic labeling, as a major interpreting method, has attracted significant attention recently. However, most of previous works only focus on the effectiveness of topic labeling, and less attention has been paid to quickly creating good topic descriptors; meanwhile, it’s hard to assign labels for new emerging topics by using most of existing methods. To solve the problems above, in this paper, we propose a novel fast topic labeling framework that casts the labeling problem as a k-nearest neighbor (KNN) search problem in a probability vector set. Our experimental results show that the proposed sequential interleaving method based on locality sensitive hashing (LSH) technology is efficient in boosting the comparison speed among probability distributions, and the proposed framework can generate meaningful labels to interpret topics, including new emerging topics.

CSE: Conceptual Sentence Embeddings based on Attention Model
Yashen Wang | Heyan Huang | Chong Feng | Qiang Zhou | Jiahui Gu | Xiong Gao
Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

BIT at SemEval-2016 Task 1: Sentence Similarity Based on Alignments and Vector with the Weight of Information Content
Hao Wu | Heyan Huang | Wenpeng Lu
Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016)


Topic-Based Chinese Message Polarity Classification System at SIGHAN8-Task2
Chun Liao | Chong Feng | Sen Yang | Heyan Huang
Proceedings of the Eighth SIGHAN Workshop on Chinese Language Processing


Introduction to BIT Chinese Spelling Correction System at CLP 2014 Bake-off
Min Liu | Ping Jian | Heyan Huang
Proceedings of The Third CIPS-SIGHAN Joint Conference on Chinese Language Processing


Emotional Tendency Identification for Micro-blog Topics Based on Multiple Characteristics
Quanchao Liu | Chong Feng | Heyan Huang
Proceedings of the 26th Pacific Asia Conference on Language, Information, and Computation

Chinese Word Sense Disambiguation based on Context Expansion
Zhizhuo Yang | Heyan Huang
Proceedings of COLING 2012: Posters


Unsupervised Word Sense Disambiguation Using Neighborhood Knowledge
Heyan Huang | Zhizhuo Yang | Ping Jian
Proceedings of the 25th Pacific Asia Conference on Language, Information and Computation

An English-Chinese Cross-lingual Word Semantic Similarity Measure Exploring Attributes and Relations
Lin Dai | Heyan Huang
Proceedings of the 25th Pacific Asia Conference on Language, Information and Computation


Incorporating New Words Detection with Chinese Word Segmentation
Hua-Ping Zhang | Jian Gao | Qian Mo | He-Yan Huang
CIPS-SIGHAN Joint Conference on Chinese Language Processing

Chinese Personal Name Disambiguation Based on Person Modeling
Hua-Ping Zhang | Zhi-Hua Liu | Qian Mo | He-Yan Huang
CIPS-SIGHAN Joint Conference on Chinese Language Processing


Translation & Transform Algorithm of Query Sentence in Cross-Language Information Retrieval
Xiao-fei Zhang | Ke-liang Zhang | He-yan Huang
Proceedings of the 20th Pacific Asia Conference on Language, Information and Computation