Zero-shot learning has been a tough problem since no labeled data is available for unseen classes during training, especially for classes with low similarity. In this situation, transferring from seen classes to unseen classes is extremely hard. To tackle this problem, in this paper we propose a self-training based method to efficiently leverage unlabeled data. Traditional self-training methods use fixed heuristics to select instances from unlabeled data, whose performance varies among different datasets. We propose a reinforcement learning framework to learn data selection strategy automatically and provide more reliable selection. Experimental results on both benchmarks and a real-world e-commerce dataset show that our approach significantly outperforms previous methods in zero-shot text classification
As a typical multi-layered semi-conscious language phenomenon, sarcasm is widely existed in social media text for enhancing the emotion expression. Thus, the detection and processing of sarcasm is important to social media analysis.However, most existing sarcasm dataset are in English and there is still a lack of authoritative Chinese sarcasm dataset. In this paper, we presents the design and construction of a largest high-quality Chinese sarcasm dataset, which contains 2,486 manual annotated sarcastic texts and 89,296 non-sarcastic texts. Furthermore, a balanced dataset through elaborately sampling the same amount non-sarcastic texts for training sarcasm classifier. Using the dataset as the benchmark, some sarcasm classification methods are evaluated.
This paper presents the design and construction of a large-scale target-based sentiment annotation corpus on Chinese financial news text. Different from the most existing paragraph/document-based annotation corpus, in this study, target-based fine-grained sentiment annotation is performed. The companies, brands and other financial entities are regarded as the targets. The clause reflecting the profitability, loss or other business status of financial entities is regarded as the sentiment expression for determining the polarity. Based on high quality annotation guideline and effective quality control strategy, a corpus with 8,314 target-level sentiment annotation is constructed on 6,336 paragraphs from Chinese financial news text. Based on this corpus, several state-of-the-art sentiment analysis models are evaluated.
We propose to generalize language models for conversational speech recognition to allow them to operate across utterance boundaries and speaker changes, thereby capturing conversation-level phenomena such as adjacency pairs, lexical entrainment, and topical coherence. The model consists of a long-short-term memory (LSTM) recurrent network that reads the entire word-level history of a conversation, as well as information about turn taking and speaker overlap, in order to predict each next word. The model is applied in a rescoring framework, where the word history prior to the current utterance is approximated with preliminary recognition results. In experiments in the conversational telephone speech domain (Switchboard) we find that such a model gives substantial perplexity reductions over a standard LSTM-LM with utterance scope, as well as improvements in word error rate.