Doug Downey


2020

pdf bib
Stolen Probability: A Structural Weakness of Neural Language Models
David Demeter | Gregory Kimmel | Doug Downey
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

Neural Network Language Models (NNLMs) generate probability distributions by applying a softmax function to a distance metric formed by taking the dot product of a prediction vector with all word vectors in a high-dimensional embedding space. The dot-product distance metric forms part of the inductive bias of NNLMs. Although NNLMs optimize well with this inductive bias, we show that this results in a sub-optimal ordering of the embedding space that structurally impoverishes some words at the expense of others when assigning probability. We present numerical, theoretical and empirical analyses which show that words on the interior of the convex hull in the embedding space have their probability bounded by the probabilities of the words on the hull.

pdf bib
SPECTER: Document-level Representation Learning using Citation-informed Transformers
Arman Cohan | Sergey Feldman | Iz Beltagy | Doug Downey | Daniel Weld
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

Representation learning is a critical ingredient for natural language processing systems. Recent Transformer language models like BERT learn powerful textual representations, but these models are targeted towards token- and sentence-level training objectives and do not leverage information on inter-document relatedness, which limits their document-level representation power. For applications on scientific documents, such as classification and recommendation, accurate embeddings of documents are a necessity. We propose SPECTER, a new method to generate document-level embedding of scientific papers based on pretraining a Transformer language model on a powerful signal of document-level relatedness: the citation graph. Unlike existing pretrained language models, Specter can be easily applied to downstream applications without task-specific fine-tuning. Additionally, to encourage further research on document-level models, we introduce SciDocs, a new evaluation benchmark consisting of seven document-level tasks ranging from citation prediction, to document classification and recommendation. We show that Specter outperforms a variety of competitive baselines on the benchmark.

pdf bib
Don’t Stop Pretraining: Adapt Language Models to Domains and Tasks
Suchin Gururangan | Ana Marasović | Swabha Swayamdipta | Kyle Lo | Iz Beltagy | Doug Downey | Noah A. Smith
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

Language models pretrained on text from a wide variety of sources form the foundation of today’s NLP. In light of the success of these broad-coverage models, we investigate whether it is still helpful to tailor a pretrained model to the domain of a target task. We present a study across four domains (biomedical and computer science publications, news, and reviews) and eight classification tasks, showing that a second phase of pretraining in-domain (domain-adaptive pretraining) leads to performance gains, under both high- and low-resource settings. Moreover, adapting to the task’s unlabeled data (task-adaptive pretraining) improves performance even after domain-adaptive pretraining. Finally, we show that adapting to a task corpus augmented using simple data selection strategies is an effective alternative, especially when resources for domain-adaptive pretraining might be unavailable. Overall, we consistently find that multi-phase adaptive pretraining offers large gains in task performance.

pdf bib
Generative Data Augmentation for Commonsense Reasoning
Yiben Yang | Chaitanya Malaviya | Jared Fernandez | Swabha Swayamdipta | Ronan Le Bras | Ji-Ping Wang | Chandra Bhagavatula | Yejin Choi | Doug Downey
Findings of the Association for Computational Linguistics: EMNLP 2020

Recent advances in commonsense reasoning depend on large-scale human-annotated training sets to achieve peak performance. However, manual curation of training sets is expensive and has been shown to introduce annotation artifacts that neural models can readily exploit and overfit to. We propose a novel generative data augmentation technique, G-DAUGˆC, that aims to achieve more accurate and robust learning in a low-resource setting. Our approach generates synthetic examples using pretrained language models and selects the most informative and diverse set of examples for data augmentation. On experiments with multiple commonsense reasoning benchmarks, G-DAUGˆC consistently outperforms existing data augmentation methods based on back-translation, establishing a new state-of-the-art on WinoGrande, CODAH, and CommonsenseQA, as well as enhances out-of-distribution generalization, proving to be robust against adversaries or perturbations. Our analysis demonstrates that G-DAUGˆC produces a diverse set of fluent training examples, and that its selection and training approaches are important for performance.

2019

pdf bib
CODAH: An Adversarially-Authored Question Answering Dataset for Common Sense
Michael Chen | Mike D’Arcy | Alisa Liu | Jared Fernandez | Doug Downey
Proceedings of the 3rd Workshop on Evaluating Vector Space Representations for NLP

Commonsense reasoning is a critical AI capability, but it is difficult to construct challenging datasets that test common sense. Recent neural question answering systems, based on large pre-trained models of language, have already achieved near-human-level performance on commonsense knowledge benchmarks. These systems do not possess human-level common sense, but are able to exploit limitations of the datasets to achieve human-level scores. We introduce the CODAH dataset, an adversarially-constructed evaluation dataset for testing common sense. CODAH forms a challenging extension to the recently-proposed SWAG dataset, which tests commonsense knowledge using sentence-completion questions that describe situations observed in video. To produce a more difficult dataset, we introduce a novel procedure for question acquisition in which workers author questions designed to target weaknesses of state-of-the-art neural question answering systems. Workers are rewarded for submissions that models fail to answer correctly both before and after fine-tuning (in cross-validation). We create 2.8k questions via this procedure and evaluate the performance of multiple state-of-the-art question answering systems on our dataset. We observe a significant gap between human performance, which is 95.3%, and the performance of the best baseline accuracy of 65.3% by the OpenAI GPT model.

pdf bib
Using Large Corpus N-gram Statistics to Improve Recurrent Neural Language Models
Yiben Yang | Ji-Ping Wang | Doug Downey
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)

Recurrent neural network language models (RNNLM) form a valuable foundation for many NLP systems, but training the models can be computationally expensive, and may take days to train on a large corpus. We explore a technique that uses large corpus n-gram statistics as a regularizer for training a neural network LM on a smaller corpus. In experiments with the Billion-Word and Wikitext corpora, we show that the technique is effective, and more time-efficient than simply training on a larger sequential corpus. We also introduce new strategies for selecting the most informative n-grams, and show that these boost efficiency.

pdf bib
A Semantic Cover Approach for Topic Modeling
Rajagopal Venkatesaramani | Doug Downey | Bradley Malin | Yevgeniy Vorobeychik
Proceedings of the Eighth Joint Conference on Lexical and Computational Semantics (*SEM 2019)

We introduce a novel topic modeling approach based on constructing a semantic set cover for clusters of similar documents. Specifically, our approach first clusters documents using their Tf-Idf representation, and then covers each cluster with a set of topic words based on semantic similarity, defined in terms of a word embedding. Computing a topic cover amounts to solving a minimum set cover problem. Our evaluation compares our topic modeling approach to Latent Dirichlet Allocation (LDA) on three metrics: 1) qualitative topic match, measured using evaluations by Amazon Mechanical Turk (MTurk) workers, 2) performance on classification tasks using each topic model as a sparse feature representation, and 3) topic coherence. We find that qualitative judgments significantly favor our approach, the method outperforms LDA on topic coherence, and is comparable to LDA on document classification tasks.

2018

pdf bib
Construction of the Literature Graph in Semantic Scholar
Waleed Ammar | Dirk Groeneveld | Chandra Bhagavatula | Iz Beltagy | Miles Crawford | Doug Downey | Jason Dunkelberger | Ahmed Elgohary | Sergey Feldman | Vu Ha | Rodney Kinney | Sebastian Kohlmeier | Kyle Lo | Tyler Murray | Hsu-Han Ooi | Matthew Peters | Joanna Power | Sam Skjonsberg | Lucy Wang | Chris Wilhelm | Zheng Yuan | Madeleine van Zuylen | Oren Etzioni
Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 3 (Industry Papers)

We describe a deployed scalable system for organizing published scientific literature into a heterogeneous graph to facilitate algorithmic manipulation and discovery. The resulting literature graph consists of more than 280M nodes, representing papers, authors, entities and various interactions between them (e.g., authorships, citations, entity mentions). We reduce literature graph construction into familiar NLP tasks (e.g., entity extraction and linking), point out research challenges due to differences from standard formulations of these tasks, and report empirical results for each task. The methods described in this paper are used to enable semantic features in www.semanticscholar.org.

pdf bib
Estimating Marginal Probabilities of n-grams for Recurrent Neural Language Models
Thanapon Noraset | Doug Downey | Lidong Bing
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

Recurrent neural network language models (RNNLMs) are the current standard-bearer for statistical language modeling. However, RNNLMs only estimate probabilities for complete sequences of text, whereas some applications require context-independent phrase probabilities instead. In this paper, we study how to compute an RNNLM’s em marginal probability: the probability that the model assigns to a short sequence of text when the preceding context is not known. We introduce a simple method of altering the RNNLM training to make the model more accurate at marginal estimation. Our experiments demonstrate that the technique is effective compared to baselines including the traditional RNNLM probability and an importance sampling approach. Finally, we show how we can use the marginal estimation to improve an RNNLM by training the marginals to match n-gram probabilities from a larger corpus.

pdf bib
Extracting Commonsense Properties from Embeddings with Limited Human Guidance
Yiben Yang | Larry Birnbaum | Ji-Ping Wang | Doug Downey
Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

Intelligent systems require common sense, but automatically extracting this knowledge from text can be difficult. We propose and assess methods for extracting one type of commonsense knowledge, object-property comparisons, from pre-trained embeddings. In experiments, we show that our approach exceeds the accuracy of previous work but requires substantially less hand-annotated knowledge. Further, we show that an active learning approach that synthesizes common-sense queries can boost accuracy.

pdf bib
Sampling Informative Training Data for RNN Language Models
Jared Fernandez | Doug Downey
Proceedings of ACL 2018, Student Research Workshop

We propose an unsupervised importance sampling approach to selecting training data for recurrent neural network (RNNs) language models. To increase the information content of the training set, our approach preferentially samples high perplexity sentences, as determined by an easily queryable n-gram language model. We experimentally evaluate the heldout perplexity of models trained with our various importance sampling distributions. We show that language models trained on data sampled using our proposed approach outperform models trained over randomly sampled subsets of both the Billion Word (Chelba et al., 2014 Wikitext-103 benchmark corpora (Merity et al., 2016).

2017

pdf bib
VecShare: A Framework for Sharing Word Representation Vectors
Jared Fernandez | Zhaocheng Yu | Doug Downey
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing

Many Natural Language Processing (NLP) models rely on distributed vector representations of words. Because the process of training word vectors can require large amounts of data and computation, NLP researchers and practitioners often utilize pre-trained embeddings downloaded from the Web. However, finding the best embeddings for a given task is difficult, and can be computationally prohibitive. We present a framework, called VecShare, that makes it easy to share and retrieve word embeddings on the Web. The framework leverages a public data-sharing infrastructure to host embedding sets, and provides automated mechanisms for retrieving the embeddings most similar to a given corpus. We perform an experimental evaluation of VecShare’s similarity strategies, and show that they are effective at efficiently retrieving embeddings that boost accuracy in a document classification task. Finally, we provide an open-source Python library for using the VecShare framework.

2015

pdf bib
Efficient Methods for Incorporating Knowledge into Topic Models
Yi Yang | Doug Downey | Jordan Boyd-Graber
Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing

pdf bib
Efficient Methods for Inferring Large Sparse Topic Hierarchies
Doug Downey | Chandra Bhagavatula | Yi Yang
Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

2014

pdf bib
Learning Representations for Weakly Supervised Natural Language Processing Tasks
Fei Huang | Arun Ahuja | Doug Downey | Yi Yang | Yuhong Guo | Alexander Yates
Computational Linguistics, Volume 40, Issue 1 - March 2014

pdf bib
Active Learning with Constrained Topic Model
Yi Yang | Shimei Pan | Doug Downey | Kunpeng Zhang
Proceedings of the Workshop on Interactive Language Learning, Visualization, and Interfaces

pdf bib
Adding High-Precision Links to Wikipedia
Thanapon Noraset | Chandra Bhagavatula | Doug Downey
Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP)

2013

pdf bib
Scaling Semi-supervised Naive Bayes with Feature Marginals
Michael Lucas | Doug Downey
Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

pdf bib
Overcoming the Memory Bottleneck in Distributed Training of Latent Variable Models of Text
Yi Yang | Alexander Yates | Doug Downey
Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

2011

pdf bib
Language Models as Representations for Weakly Supervised NLP Tasks
Fei Huang | Alexander Yates | Arun Ahuja | Doug Downey
Proceedings of the Fifteenth Conference on Computational Natural Language Learning

pdf bib
Local and Global Algorithms for Disambiguation to Wikipedia
Lev Ratinov | Dan Roth | Doug Downey | Mike Anderson
Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies

2010

pdf bib
Improved Extraction Assessment through Better Language Models
Arun Ahuja | Doug Downey
Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics

2008

pdf bib
It’s a Contradiction – no, it’s not: A Case Study using Functional Relations
Alan Ritter | Stephen Soderland | Doug Downey | Oren Etzioni
Proceedings of the 2008 Conference on Empirical Methods in Natural Language Processing

2007

pdf bib
Sparse Information Extraction: Unsupervised Language Models to the Rescue
Doug Downey | Stefan Schoenmackers | Oren Etzioni
Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics

2005

pdf bib
KnowItNow: Fast, Scalable Information Extraction from the Web
Michael J. Cafarella | Doug Downey | Stephen Soderland | Oren Etzioni
Proceedings of Human Language Technology Conference and Conference on Empirical Methods in Natural Language Processing