Nicholas Andrews


pdf bib
Sources of Transfer in Multilingual Named Entity Recognition
David Mueller | Nicholas Andrews | Mark Dredze
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

Named-entities are inherently multilingual, and annotations in any given language may be limited. This motivates us to consider polyglot named-entity recognition (NER), where one model is trained using annotated data drawn from more than one language. However, a straightforward implementation of this simple idea does not always work in practice: naive training of NER models using annotated data drawn from multiple languages consistently underperforms models trained on monolingual data alone, despite having access to more training data. The starting point of this paper is a simple solution to this problem, in which polyglot models are fine-tuned on monolingual data to consistently and significantly outperform their monolingual counterparts. To explain this phenomena, we explore the sources of multilingual transfer in polyglot NER models and examine the weight structure of polyglot models compared to their monolingual counterparts. We find that polyglot models efficiently share many parameters across languages and that fine-tuning may utilize a large number of those parameters.

pdf bib
Ensemble Distillation for Structured Prediction: Calibrated, Accurate, Fast—Choose Three
Steven Reich | David Mueller | Nicholas Andrews
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

Modern neural networks do not always produce well-calibrated predictions, even when trained with a proper scoring function such as cross-entropy. In classification settings, simple methods such as isotonic regression or temperature scaling may be used in conjunction with a held-out dataset to calibrate model outputs. However, extending these methods to structured prediction is not always straightforward or effective; furthermore, a held-out calibration set may not always be available. In this paper, we study ensemble distillation as a general framework for producing well-calibrated structured prediction models while avoiding the prohibitive inference-time cost of ensembles. We validate this framework on two tasks: named-entity recognition and machine translation. We find that, across both tasks, ensemble distillation produces models which retain much of, and occasionally improve upon, the performance and calibration benefits of ensembles, while only requiring a single model during test-time.

pdf bib
Compressing BERT: Studying the Effects of Weight Pruning on Transfer Learning
Mitchell Gordon | Kevin Duh | Nicholas Andrews
Proceedings of the 5th Workshop on Representation Learning for NLP

Pre-trained universal feature extractors, such as BERT for natural language processing and VGG for computer vision, have become effective methods for improving deep learning models without requiring more labeled data. While effective, feature extractors like BERT may be prohibitively large for some deployment scenarios. We explore weight pruning for BERT and ask: how does compression during pre-training affect transfer learning? We find that pruning affects transfer learning in three broad regimes. Low levels of pruning (30-40%) do not affect pre-training loss or transfer to downstream tasks at all. Medium levels of pruning increase the pre-training loss and prevent useful pre-training information from being transferred to downstream tasks. High levels of pruning additionally prevent models from fitting downstream datasets, leading to further degradation. Finally, we observe that fine-tuning BERT on a specific task does not improve its prunability. We conclude that BERT can be pruned once during pre-training rather than separately for each task without affecting performance.


pdf bib
Learning Invariant Representations of Social Media Users
Nicholas Andrews | Marcus Bishop
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

The evolution of social media users’ behavior over time complicates user-level comparison tasks such as verification, classification, clustering, and ranking. As a result, naive approaches may fail to generalize to new users or even to future observations of previously known users. In this paper, we propose a novel procedure to learn a mapping from short episodes of user activity on social media to a vector space in which the distance between points captures the similarity of the corresponding users’ invariant features. We fit the model by optimizing a surrogate metric learning objective over a large corpus of unlabeled social media content. Once learned, the mapping may be applied to users not seen at training time and enables efficient comparisons of users in the resulting vector space. We present a comprehensive evaluation to validate the benefits of the proposed approach using data from Reddit, Twitter, and Wikipedia.


pdf bib
Predicting Twitter User Demographics from Names Alone
Zach Wood-Doughty | Nicholas Andrews | Rebecca Marvin | Mark Dredze
Proceedings of the Second Workshop on Computational Modeling of People’s Opinions, Personality, and Emotions in Social Media

Social media analysis frequently requires tools that can automatically infer demographics to contextualize trends. These tools often require hundreds of user-authored messages for each user, which may be prohibitive to obtain when analyzing millions of users. We explore character-level neural models that learn a representation of a user’s name and screen name to predict gender and ethnicity, allowing for demographic inference with minimal data. We release trained models1 which may enable new demographic analyses that would otherwise require enormous amounts of data collection

pdf bib
Convolutions Are All You Need (For Classifying Character Sequences)
Zach Wood-Doughty | Nicholas Andrews | Mark Dredze
Proceedings of the 2018 EMNLP Workshop W-NUT: The 4th Workshop on Noisy User-generated Text

While recurrent neural networks (RNNs) are widely used for text classification, they demonstrate poor performance and slow convergence when trained on long sequences. When text is modeled as characters instead of words, the longer sequences make RNNs a poor choice. Convolutional neural networks (CNNs), although somewhat less ubiquitous than RNNs, have an internal structure more appropriate for long-distance character dependencies. To better understand how CNNs and RNNs differ in handling long sequences, we use them for text classification tasks in several character-level social media datasets. The CNN models vastly outperform the RNN models in our experiments, suggesting that CNNs are superior to RNNs at learning to classify character-level data.


pdf bib
Bayesian Modeling of Lexical Resources for Low-Resource Settings
Nicholas Andrews | Mark Dredze | Benjamin Van Durme | Jason Eisner
Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Lexical resources such as dictionaries and gazetteers are often used as auxiliary data for tasks such as part-of-speech induction and named-entity recognition. However, discriminative training with lexical features requires annotated data to reliably estimate the lexical feature weights and may result in overfitting the lexical features at the expense of features which generalize better. In this paper, we investigate a more robust approach: we stipulate that the lexicon is the result of an assumed generative process. Practically, this means that we may treat the lexical resources as observations under the proposed generative model. The lexical resources provide training data for the generative model without requiring separate data to estimate lexical feature weights. We evaluate the proposed approach in two settings: part-of-speech induction and low-resource named-entity recognition.


pdf bib
Twitter at the Grammys: A Social Media Corpus for Entity Linking and Disambiguation
Mark Dredze | Nicholas Andrews | Jay DeYoung
Proceedings of The Fourth International Workshop on Natural Language Processing for Social Media


pdf bib
A Concrete Chinese NLP Pipeline
Nanyun Peng | Francis Ferraro | Mo Yu | Nicholas Andrews | Jay DeYoung | Max Thomas | Matthew R. Gormley | Travis Wolfe | Craig Harman | Benjamin Van Durme | Mark Dredze
Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Demonstrations


pdf bib
Robust Entity Clustering via Phylogenetic Inference
Nicholas Andrews | Jason Eisner | Mark Dredze
Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)


pdf bib
PARMA: A Predicate Argument Aligner
Travis Wolfe | Benjamin Van Durme | Mark Dredze | Nicholas Andrews | Charley Beller | Chris Callison-Burch | Jay DeYoung | Justin Snyder | Jonathan Weese | Tan Xu | Xuchen Yao
Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)


pdf bib
Entity Clustering Across Languages
Spence Green | Nicholas Andrews | Matthew R. Gormley | Mark Dredze | Christopher D. Manning
Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

pdf bib
Name Phylogeny: A Generative Model of String Variation
Nicholas Andrews | Jason Eisner | Mark Dredze
Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning


pdf bib
Seeded Discovery of Base Relations in Large Corpora
Nicholas Andrews | Naren Ramakrishnan
Proceedings of the 2008 Conference on Empirical Methods in Natural Language Processing