Aasish Pappu


2020

pdf bib
100,000 Podcasts: A Spoken English Document Corpus
Ann Clifton | Sravana Reddy | Yongze Yu | Aasish Pappu | Rezvaneh Rezapour | Hamed Bonab | Maria Eskevich | Gareth Jones | Jussi Karlgren | Ben Carterette | Rosie Jones
Proceedings of the 28th International Conference on Computational Linguistics

Podcasts are a large and growing repository of spoken audio. As an audio format, podcasts are more varied in style and production type than broadcast news, contain more genres than typically studied in video data, and are more varied in style and format than previous corpora of conversations. When transcribed with automatic speech recognition they represent a noisy but fascinating collection of documents which can be studied through the lens of natural language processing, information retrieval, and linguistics. Paired with the audio files, they are also a resource for speech processing and the study of paralinguistic, sociolinguistic, and acoustic aspects of the domain. We introduce the Spotify Podcast Dataset, a new corpus of 100,000 podcasts. We demonstrate the complexity of the domain with a case study of two tasks: (1) passage search and (2) summarization. This is orders of magnitude larger than previous speech corpora used for search and summarization. Our results show that the size and variability of this corpus opens up new avenues for research.

2019

pdf bib
Unsupervised Neologism Normalization Using Embedding Space Mapping
Nasser Zalmout | Kapil Thadani | Aasish Pappu
Proceedings of the 5th Workshop on Noisy User-generated Text (W-NUT 2019)

This paper presents an approach for detecting and normalizing neologisms in social media content. Neologisms refer to recent expressions that are specific to certain entities or events and are being increasingly used by the public, but have not yet been accepted in mainstream language. Automated methods for handling neologisms are important for natural language understanding and normalization, especially for informal genres with user generated content. We present an unsupervised approach for detecting neologisms and then normalizing them to canonical words without relying on parallel training data. Our approach builds on the text normalization literature and introduces adaptations to fit the specificities of this task, including phonetic and etymological considerations. We evaluate the proposed techniques on a dataset of Reddit comments, with detected neologisms and corresponding normalizations.

2017

pdf bib
Post-Processing Techniques for Improving Predictions of Multilabel Learning Approaches
Akshay Soni | Aasish Pappu | Jerry Chia-mau Ni | Troy Chevalier
Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 2: Short Papers)

In Multilabel Learning (MLL) each training instance is associated with a set of labels and the task is to learn a function that maps an unseen instance to its corresponding label set. In this paper, we present a suite of – MLL algorithm independent – post-processing techniques that utilize the conditional and directional label-dependences in order to make the predictions from any MLL approach more coherent and precise. We solve constraint optimization problem over the output produced by any MLL approach and the result is a refined version of the input predicted label set. Using proposed techniques, we show absolute improvement of 3% on English News and 10% on Chinese E-commerce datasets for P@K metric.

pdf bib
Finding Good Conversations Online: The Yahoo News Annotated Comments Corpus
Courtney Napoles | Joel Tetreault | Aasish Pappu | Enrica Rosato | Brian Provenzale
Proceedings of the 11th Linguistic Annotation Workshop

This work presents a dataset and annotation scheme for the new task of identifying “good” conversations that occur online, which we call ERICs: Engaging, Respectful, and/or Informative Conversations. We develop a taxonomy to reflect features of entire threads and individual comments which we believe contribute to identifying ERICs; code a novel dataset of Yahoo News comment threads (2.4k threads and 10k comments) and 1k threads from the Internet Argument Corpus; and analyze the features characteristic of ERICs. This is one of the largest annotated corpora of online human dialogues, with the most detailed set of annotations. It will be valuable for identifying ERICs and other aspects of argumentation, dialogue, and discourse.

pdf bib
DocTag2Vec: An Embedding Based Multi-label Learning Approach for Document Tagging
Sheng Chen | Akshay Soni | Aasish Pappu | Yashar Mehdad
Proceedings of the 2nd Workshop on Representation Learning for NLP

Tagging news articles or blog posts with relevant tags from a collection of predefined ones is coined as document tagging in this work. Accurate tagging of articles can benefit several downstream applications such as recommendation and search. In this work, we propose a novel yet simple approach called DocTag2Vec to accomplish this task. We substantially extend Word2Vec and Doc2Vec – two popular models for learning distributed representation of words and documents. In DocTag2Vec, we simultaneously learn the representation of words, documents, and tags in a joint vector space during training, and employ the simple k-nearest neighbor search to predict tags for unseen documents. In contrast to previous multi-label learning methods, DocTag2Vec directly deals with raw text instead of provided feature vector, and in addition, enjoys advantages like the learning of tag representation, and the ability of handling newly created tags. To demonstrate the effectiveness of our approach, we conduct experiments on several datasets and show promising results against state-of-the-art methods.

2016

pdf bib
Humor in Collective Discourse: Unsupervised Funniness Detection in the New Yorker Cartoon Caption Contest
Dragomir Radev | Amanda Stent | Joel Tetreault | Aasish Pappu | Aikaterini Iliakopoulou | Agustin Chanfreau | Paloma de Juan | Jordi Vallmitjana | Alejandro Jaimes | Rahul Jha | Robert Mankoff
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

The New Yorker publishes a weekly captionless cartoon. More than 5,000 readers submit captions for it. The editors select three of them and ask the readers to pick the funniest one. We describe an experiment that compares a dozen automatic methods for selecting the funniest caption. We show that negative sentiment, human-centeredness, and lexical centrality most strongly match the funniest captions, followed by positive sentiment. These results are useful for understanding humor and also in the design of more engaging conversational agents in text and multimodal (vision+text) systems. As part of this work, a large set of cartoons and captions is being made available to the community.

2015

pdf bib
The Cohort and Speechify Libraries for Rapid Construction of Speech Enabled Applications for Android
Tejaswi Kasturi | Haojian Jin | Aasish Pappu | Sungjin Lee | Beverley Harrison | Ramana Murthy | Amanda Stent
Proceedings of the 16th Annual Meeting of the Special Interest Group on Discourse and Dialogue

2014

pdf bib
Conversational Strategies for Robustly Managing Dialog in Public Spaces
Aasish Pappu | Ming Sun | Seshadri Sridharan | Alexander Rudnicky
Proceedings of the EACL 2014 Workshop on Dialogue in Motion

pdf bib
Knowledge Acquisition Strategies for Goal-Oriented Dialog Systems
Aasish Pappu | Alexander Rudnicky
Proceedings of the 15th Annual Meeting of the Special Interest Group on Discourse and Dialogue (SIGDIAL)

2013

pdf bib
Predicting Tasks in Goal-Oriented Spoken Dialog Systems using Semantic Knowledge Bases
Aasish Pappu | Alexander Rudnicky
Proceedings of the SIGDIAL 2013 Conference

2012

pdf bib
The Structure and Generality of Spoken Route Instructions
Aasish Pappu | Alexander Rudnicky
Proceedings of the 13th Annual Meeting of the Special Interest Group on Discourse and Dialogue

2009

pdf bib
Using Wikipedia for Hierarchical Finer Categorization of Named Entities
Aasish Pappu
Proceedings of the 23rd Pacific Asia Conference on Language, Information and Computation, Volume 2

2008

pdf bib
Vaakkriti: Sanskrit Tokenizer
Aasish Pappu | Ratna Sanyal
Proceedings of the Third International Joint Conference on Natural Language Processing: Volume-II