Mark Dredze


2020

pdf bib
Sources of Transfer in Multilingual Named Entity Recognition
David Mueller | Nicholas Andrews | Mark Dredze
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

Named-entities are inherently multilingual, and annotations in any given language may be limited. This motivates us to consider polyglot named-entity recognition (NER), where one model is trained using annotated data drawn from more than one language. However, a straightforward implementation of this simple idea does not always work in practice: naive training of NER models using annotated data drawn from multiple languages consistently underperforms models trained on monolingual data alone, despite having access to more training data. The starting point of this paper is a simple solution to this problem, in which polyglot models are fine-tuned on monolingual data to consistently and significantly outperform their monolingual counterparts. To explain this phenomena, we explore the sources of multilingual transfer in polyglot NER models and examine the weight structure of polyglot models compared to their monolingual counterparts. We find that polyglot models efficiently share many parameters across languages and that fine-tuning may utilize a large number of those parameters.

pdf bib
Clinical Concept Linking with Contextualized Neural Representations
Elliot Schumacher | Andriy Mulyar | Mark Dredze
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

In traditional approaches to entity linking, linking decisions are based on three sources of information – the similarity of the mention string to an entity’s name, the similarity of the context of the document to the entity, and broader information about the knowledge base (KB). In some domains, there is little contextual information present in the KB and thus we rely more heavily on mention string similarity. We consider one example of this, concept linking, which seeks to link mentions of medical concepts to a medical concept ontology. We propose an approach to concept linking that leverages recent work in contextualized neural models, such as ELMo (Peters et al. 2018), which create a token representation that integrates the surrounding context of the mention and concept name. We find a neural ranking approach paired with contextualized embeddings provides gains over a competitive baseline (Leaman et al. 2013). Additionally, we find that a pre-training step using synonyms from the ontology offers a useful initialization for the ranker.

pdf bib
Do Models of Mental Health Based on Social Media Data Generalize?
Keith Harrigian | Carlos Aguirre | Mark Dredze
Findings of the Association for Computational Linguistics: EMNLP 2020

Proxy-based methods for annotating mental health status in social media have grown popular in computational research due to their ability to gather large training samples. However, an emerging body of literature has raised new concerns regarding the validity of these types of methods for use in clinical applications. To further understand the robustness of distantly supervised mental health models, we explore the generalization ability of machine learning classifiers trained to detect depression in individuals across multiple social media platforms. Our experiments not only reveal that substantial loss occurs when transferring between platforms, but also that there exist several unreliable confounding factors that may enable researchers to overestimate classification performance. Based on these results, we enumerate recommendations for future mental health dataset construction.

pdf bib
Civil Unrest on Twitter (CUT): A Dataset of Tweets to Support Research on Civil Unrest
Justin Sech | Alexandra DeLucia | Anna L. Buczak | Mark Dredze
Proceedings of the Sixth Workshop on Noisy User-generated Text (W-NUT 2020)

We present CUT, a dataset for studying Civil Unrest on Twitter. Our dataset includes 4,381 tweets related to civil unrest, hand-annotated with information related to the study of civil unrest discussion and events. Our dataset is drawn from 42 countries from 2014 to 2019. We present baseline systems trained on this data for the identification of tweets related to civil unrest. We include a discussion of ethical issues related to research on this topic.

pdf bib
Proceedings of the 1st Workshop on NLP for COVID-19 at ACL 2020
Karin Verspoor | Kevin Bretonnel Cohen | Mark Dredze | Emilio Ferrara | Jonathan May | Robert Munro | Cecile Paris | Byron Wallace
Proceedings of the 1st Workshop on NLP for COVID-19 at ACL 2020

pdf bib
Proceedings of the 1st Workshop on NLP for COVID-19 (Part 2) at EMNLP 2020
Karin Verspoor | Kevin Bretonnel Cohen | Michael Conway | Berry de Bruijn | Mark Dredze | Rada Mihalcea | Byron Wallace
Proceedings of the 1st Workshop on NLP for COVID-19 (Part 2) at EMNLP 2020

pdf bib
Do Explicit Alignments Robustly Improve Multilingual Encoders?
Shijie Wu | Mark Dredze
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

Multilingual BERT (mBERT), XLM-RoBERTa (XLMR) and other unsupervised multilingual encoders can effectively learn cross-lingual representation. Explicit alignment objectives based on bitexts like Europarl or MultiUN have been shown to further improve these representations. However, word-level alignments are often suboptimal and such bitexts are unavailable for many languages. In this paper, we propose a new contrastive alignment objective that can better utilize such signal, and examine whether these previous alignment methods can be adapted to noisier sources of aligned data: a randomly sampled 1 million pair subset of the OPUS collection. Additionally, rather than report results on a single dataset with a single model run, we report the mean and standard derivation of multiple runs with different seeds, on four datasets and tasks. Our more extensive analysis finds that, while our new objective outperforms previous work, overall these methods do not improve performance with a more robust evaluation framework. Furthermore, the gains from using a better underlying model eclipse any benefits from alignment training. These negative results dictate more care in evaluating these methods and suggest limitations in applying explicit alignment objectives.

pdf bib
Are All Languages Created Equal in Multilingual BERT?
Shijie Wu | Mark Dredze
Proceedings of the 5th Workshop on Representation Learning for NLP

Multilingual BERT (mBERT) trained on 104 languages has shown surprisingly good cross-lingual performance on several NLP tasks, even without explicit cross-lingual signals. However, these evaluations have focused on cross-lingual transfer with high-resource languages, covering only a third of the languages covered by mBERT. We explore how mBERT performs on a much wider set of languages, focusing on the quality of representation for low-resource languages, measured by within-language performance. We consider three tasks: Named Entity Recognition (99 languages), Part-of-speech Tagging and Dependency Parsing (54 languages each). mBERT does better than or comparable to baselines on high resource languages but does much worse for low resource languages. Furthermore, monolingual BERT models for these languages do even worse. Paired with similar languages, the performance gap between monolingual BERT and mBERT can be narrowed. We find that better models for low resource languages require more efficient pretraining techniques or more data.

2019

pdf bib
Beto, Bentz, Becas: The Surprising Cross-Lingual Effectiveness of BERT
Shijie Wu | Mark Dredze
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

Pretrained contextual representation models (Peters et al., 2018; Devlin et al., 2018) have pushed forward the state-of-the-art on many NLP tasks. A new release of BERT (Devlin, 2018) includes a model simultaneously pretrained on 104 languages with impressive performance for zero-shot cross-lingual transfer on a natural language inference task. This paper explores the broader cross-lingual potential of mBERT (multilingual) as a zero shot language transfer model on 5 NLP tasks covering a total of 39 languages from various language families: NLI, document classification, NER, POS tagging, and dependency parsing. We compare mBERT with the best-published methods for zero-shot cross-lingual transfer and find mBERT competitive on each task. Additionally, we investigate the most effective strategy for utilizing mBERT in this manner, determine to what extent mBERT generalizes away from language specific features, and measure factors that influence cross-lingual transfer.

pdf bib
Mental Health Surveillance over Social Media with Digital Cohorts
Silvio Amir | Mark Dredze | John W. Ayers
Proceedings of the Sixth Workshop on Computational Linguistics and Clinical Psychology

The ability to track mental health conditions via social media opened the doors for large-scale, automated, mental health surveillance. However, inferring accurate population-level trends requires representative samples of the underlying population, which can be challenging given the biases inherent in social media data. While previous work has adjusted samples based on demographic estimates, the populations were selected based on specific outcomes, e.g. specific mental health conditions. We depart from these methods, by conducting analyses over demographically representative digital cohorts of social media users. To validated this approach, we constructed a cohort of US based Twitter users to measure the prevalence of depression and PTSD, and investigate how these illnesses manifest across demographic subpopulations. The analysis demonstrates that cohort-based studies can help control for sampling biases, contextualize outcomes, and provide deeper insights into the data.

2018

pdf bib
Deep Dirichlet Multinomial Regression
Adrian Benton | Mark Dredze
Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers)

Dirichlet Multinomial Regression (DMR) and other supervised topic models can incorporate arbitrary document-level features to inform topic priors. However, their ability to model corpora are limited by the representation and selection of these features – a choice the topic modeler must make. Instead, we seek models that can learn the feature representations upon which to condition topic selection. We present deep Dirichlet Multinomial Regression (dDMR), a generative topic model that simultaneously learns document feature representations and topics. We evaluate dDMR on three datasets: New York Times articles with fine-grained tags, Amazon product reviews with product images, and Reddit posts with subreddit identity. dDMR learns representations that outperform DMR and LDA according to heldout perplexity and are more effective at downstream predictive tasks as the number of topics grows. Additionally, human subjects judge dDMR topics as being more representative of associated document features. Finally, we find that supervision leads to faster convergence as compared to an LDA baseline and that dDMR’s model fit is less sensitive to training parameters than DMR.

pdf bib
Challenges of Using Text Classifiers for Causal Inference
Zach Wood-Doughty | Ilya Shpitser | Mark Dredze
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

Causal understanding is essential for many kinds of decision-making, but causal inference from observational data has typically only been applied to structured, low-dimensional datasets. While text classifiers produce low-dimensional outputs, their use in causal inference has not previously been studied. To facilitate causal analyses based on language data, we consider the role that text classifiers can play in causal inference through established modeling mechanisms from the causality literature on missing data and measurement error. We demonstrate how to conduct causal analyses using text classifiers on simulated and Yelp data, and discuss the opportunities and challenges of future work that uses text data in causal inference.

pdf bib
Johns Hopkins or johnny-hopkins: Classifying Individuals versus Organizations on Twitter
Zach Wood-Doughty | Praateek Mahajan | Mark Dredze
Proceedings of the Second Workshop on Computational Modeling of People’s Opinions, Personality, and Emotions in Social Media

Twitter user accounts include a range of different user types. While many individuals use Twitter, organizations also have Twitter accounts. Identifying opinions and trends from Twitter requires the accurate differentiation of these two groups. Previous work (McCorriston et al., 2015) presented a method for determining if an account was an individual or organization based on account profile and a collection of tweets. We present a method that relies solely on the account profile, allowing for the classification of individuals versus organizations based on a single tweet. Our method obtains accuracies comparable to methods that rely on much more information by leveraging two improvements: a character-based Convolutional Neural Network, and an automatically derived labeled corpus an order of magnitude larger than the previously available dataset. We make both the dataset and the resulting tool available.

pdf bib
Predicting Twitter User Demographics from Names Alone
Zach Wood-Doughty | Nicholas Andrews | Rebecca Marvin | Mark Dredze
Proceedings of the Second Workshop on Computational Modeling of People’s Opinions, Personality, and Emotions in Social Media

Social media analysis frequently requires tools that can automatically infer demographics to contextualize trends. These tools often require hundreds of user-authored messages for each user, which may be prohibitive to obtain when analyzing millions of users. We explore character-level neural models that learn a representation of a user’s name and screen name to predict gender and ethnicity, allowing for demographic inference with minimal data. We release trained models1 which may enable new demographic analyses that would otherwise require enormous amounts of data collection

pdf bib
Using Author Embeddings to Improve Tweet Stance Classification
Adrian Benton | Mark Dredze
Proceedings of the 2018 EMNLP Workshop W-NUT: The 4th Workshop on Noisy User-generated Text

Many social media classification tasks analyze the content of a message, but do not consider the context of the message. For example, in tweet stance classification – where a tweet is categorized according to a viewpoint it espouses – the expressed viewpoint depends on latent beliefs held by the user. In this paper we investigate whether incorporating knowledge about the author can improve tweet stance classification. Furthermore, since author information and embeddings are often unavailable for labeled training examples, we propose a semi-supervised pretraining method to predict user embeddings. Although the neural stance classifiers we learn are often outperformed by a baseline SVM, author embedding pre-training yields improvements over a non-pre-trained neural network on four out of five domains in the SemEval 2016 6A tweet stance classification task. In a tweet gun control stance classification dataset, improvements from pre-training are only apparent when training data is limited.

pdf bib
Convolutions Are All You Need (For Classifying Character Sequences)
Zach Wood-Doughty | Nicholas Andrews | Mark Dredze
Proceedings of the 2018 EMNLP Workshop W-NUT: The 4th Workshop on Noisy User-generated Text

While recurrent neural networks (RNNs) are widely used for text classification, they demonstrate poor performance and slow convergence when trained on long sequences. When text is modeled as characters instead of words, the longer sequences make RNNs a poor choice. Convolutional neural networks (CNNs), although somewhat less ubiquitous than RNNs, have an internal structure more appropriate for long-distance character dependencies. To better understand how CNNs and RNNs differ in handling long sequences, we use them for text classification tasks in several character-level social media datasets. The CNN models vastly outperform the RNN models in our experiments, suggesting that CNNs are superior to RNNs at learning to classify character-level data.

2017

pdf bib
Bayesian Modeling of Lexical Resources for Low-Resource Settings
Nicholas Andrews | Mark Dredze | Benjamin Van Durme | Jason Eisner
Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Lexical resources such as dictionaries and gazetteers are often used as auxiliary data for tasks such as part-of-speech induction and named-entity recognition. However, discriminative training with lexical features requires annotated data to reliably estimate the lexical feature weights and may result in overfitting the lexical features at the expense of features which generalize better. In this paper, we investigate a more robust approach: we stipulate that the lexicon is the result of an assumed generative process. Practically, this means that we may treat the lexical resources as observations under the proposed generative model. The lexical resources provide training data for the generative model without requiring separate data to estimate lexical feature weights. We evaluate the proposed approach in two settings: part-of-speech induction and low-resource named-entity recognition.

pdf bib
Pocket Knowledge Base Population
Travis Wolfe | Mark Dredze | Benjamin Van Durme
Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

Existing Knowledge Base Population methods extract relations from a closed relational schema with limited coverage leading to sparse KBs. We propose Pocket Knowledge Base Population (PKBP), the task of dynamically constructing a KB of entities related to a query and finding the best characterization of relationships between entities. We describe novel Open Information Extraction methods which leverage the PKB to find informative trigger words. We evaluate using existing KBP shared-task data as well anew annotations collected for this work. Our methods produce high quality KB from just text with many more entities and relationships than existing KBP systems.

pdf bib
Proceedings of ACL 2017, Student Research Workshop
Allyson Ettinger | Spandana Gella | Matthieu Labeau | Cecilia Ovesdotter Alm | Marine Carpuat | Mark Dredze
Proceedings of ACL 2017, Student Research Workshop

pdf bib
CADET: Computer Assisted Discovery Extraction and Translation
Benjamin Van Durme | Tom Lippincott | Kevin Duh | Deana Burchfield | Adam Poliak | Cash Costello | Tim Finin | Scott Miller | James Mayfield | Philipp Koehn | Craig Harman | Dawn Lawrie | Chandler May | Max Thomas | Annabelle Carrell | Julianne Chaloux | Tongfei Chen | Alex Comerford | Mark Dredze | Benjamin Glass | Shudong Hao | Patrick Martin | Pushpendre Rastogi | Rashmi Sankepally | Travis Wolfe | Ying-Ying Tran | Ted Zhang
Proceedings of the IJCNLP 2017, System Demonstrations

Computer Assisted Discovery Extraction and Translation (CADET) is a workbench for helping knowledge workers find, label, and translate documents of interest. It combines a multitude of analytics together with a flexible environment for customizing the workflow for different users. This open-source framework allows for easy development of new research prototypes using a micro-service architecture based atop Docker and Apache Thrift.

pdf bib
Ethical Research Protocols for Social Media Health Research
Adrian Benton | Glen Coppersmith | Mark Dredze
Proceedings of the First ACL Workshop on Ethics in Natural Language Processing

Social media have transformed data-driven research in political science, the social sciences, health, and medicine. Since health research often touches on sensitive topics that relate to ethics of treatment and patient privacy, similar ethical considerations should be acknowledged when using social media data in health research. While much has been said regarding the ethical considerations of social media research, health research leads to an additional set of concerns. We provide practical suggestions in the form of guidelines for researchers working with social media data in health research. These guidelines can inform an IRB proposal for researchers new to social media health research.

pdf bib
Multi-task Domain Adaptation for Sequence Tagging
Nanyun Peng | Mark Dredze
Proceedings of the 2nd Workshop on Representation Learning for NLP

Many domain adaptation approaches rely on learning cross domain shared representations to transfer the knowledge learned in one domain to other domains. Traditional domain adaptation only considers adapting for one task. In this paper, we explore multi-task representation learning under the domain adaptation scenario. We propose a neural network framework that supports domain adaptation for multiple tasks simultaneously, and learns shared representations that better generalize for domain adaptation. We apply the proposed framework to domain adaptation for sequence tagging problems considering two tasks: Chinese word segmentation and named entity recognition. Experiments show that multi-task domain adaptation works better than disjoint domain adaptation for each task, and achieves the state-of-the-art results for both tasks in the social media domain.

pdf bib
How Does Twitter User Behavior Vary Across Demographic Groups?
Zach Wood-Doughty | Michael Smith | David Broniatowski | Mark Dredze
Proceedings of the Second Workshop on NLP and Computational Social Science

Demographically-tagged social media messages are a common source of data for computational social science. While these messages can indicate differences in beliefs and behaviors between demographic groups, we do not have a clear understanding of how different demographic groups use platforms such as Twitter. This paper presents a preliminary analysis of how groups’ differing behaviors may confound analyses of the groups themselves. We analyzed one million Twitter users by first inferring demographic attributes, and then measuring several indicators of Twitter behavior. We find differences in these indicators across demographic groups, suggesting that there may be underlying differences in how different demographic groups use Twitter.

pdf bib
Constructing an Alias List for Named Entities during an Event
Anietie Andy | Mark Dredze | Mugizi Rwebangira | Chris Callison-Burch
Proceedings of the 3rd Workshop on Noisy User-generated Text

In certain fields, real-time knowledge from events can help in making informed decisions. In order to extract pertinent real-time knowledge related to an event, it is important to identify the named entities and their corresponding aliases related to the event. The problem of identifying aliases of named entities that spike has remained unexplored. In this paper, we introduce an algorithm, EntitySpike, that identifies entities that spike in popularity in tweets from a given time period, and constructs an alias list for these spiked entities. EntitySpike uses a temporal heuristic to identify named entities with similar context that occur in the same time period (within minutes) during an event. Each entity is encoded as a vector using this temporal heuristic. We show how these entity-vectors can be used to create a named entity alias list. We evaluated our algorithm on a dataset of temporally ordered tweets from a single event, the 2013 Grammy Awards show. We carried out various experiments on tweets that were published in the same time period and show that our algorithm identifies most entity name aliases and outperforms a competitive baseline.

2016

pdf bib
Embedding Lexical Features via Low-Rank Tensors
Mo Yu | Mark Dredze | Raman Arora | Matthew R. Gormley
Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

pdf bib
Geolocation for Twitter: Timing Matters
Mark Dredze | Miles Osborne | Prabhanjan Kambadur
Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

pdf bib
Learning Multiview Embeddings of Twitter Users
Adrian Benton | Raman Arora | Mark Dredze
Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

pdf bib
Improving Named Entity Recognition for Chinese Social Media with Word Segmentation Representation Learning
Nanyun Peng | Mark Dredze
Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

pdf bib
Knowledge Base Population for Organization Mentions in Email
Ning Gao | Mark Dredze | Douglas Oard
Proceedings of the 5th Workshop on Automated Knowledge Base Construction

pdf bib
Name Variation in Community Question Answering Systems
Anietie Andy | Satoshi Sekine | Mugizi Rwebangira | Mark Dredze
Proceedings of the 2nd Workshop on Noisy User-generated Text (WNUT)

Name Variation in Community Question Answering Systems Abstract Community question answering systems are forums where users can ask and answer questions in various categories. Examples are Yahoo! Answers, Quora, and Stack Overflow. A common challenge with such systems is that a significant percentage of asked questions are left unanswered. In this paper, we propose an algorithm to reduce the number of unanswered questions in Yahoo! Answers by reusing the answer to the most similar past resolved question to the unanswered question, from the site. Semantically similar questions could be worded differently, thereby making it difficult to find questions that have shared needs. For example, “Who is the best player for the Reds?” and “Who is currently the biggest star at Manchester United?” have a shared need but are worded differently; also, “Reds” and “Manchester United” are used to refer to the soccer team Manchester United football club. In this research, we focus on question categories that contain a large number of named entities and entity name variations. We show that in these categories, entity linking can be used to identify relevant past resolved questions with shared needs as a given question by disambiguating named entities and matching these questions based on the disambiguated entities, identified entities, and knowledge base information related to these entities. We evaluated our algorithm on a new dataset constructed from Yahoo! Answers. The dataset contains annotated question pairs, (Qgiven, [Qpast, Answer]). We carried out experiments on several question categories and show that an entity-based approach gives good performance when searching for similar questions in entity rich categories.

pdf bib
Demographer: Extremely Simple Name Demographics
Rebecca Knowles | Josh Carroll | Mark Dredze
Proceedings of the First Workshop on NLP and Computational Social Science

pdf bib
A Study of Imitation Learning Methods for Semantic Role Labeling
Travis Wolfe | Mark Dredze | Benjamin Van Durme
Proceedings of the Workshop on Structured Prediction for NLP

pdf bib
Twitter at the Grammys: A Social Media Corpus for Entity Linking and Disambiguation
Mark Dredze | Nicholas Andrews | Jay DeYoung
Proceedings of The Fourth International Workshop on Natural Language Processing for Social Media

2015

pdf bib
Named Entity Recognition for Chinese Social Media with Jointly Trained Embeddings
Nanyun Peng | Mark Dredze
Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing

pdf bib
Improved Relation Extraction with Feature-Rich Compositional Embedding Models
Matthew R. Gormley | Mo Yu | Mark Dredze
Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing

pdf bib
From ADHD to SAD: Analyzing the Language of Mental Health on Twitter through Self-Reported Diagnoses
Glen Coppersmith | Mark Dredze | Craig Harman | Kristy Hollingshead
Proceedings of the 2nd Workshop on Computational Linguistics and Clinical Psychology: From Linguistic Signal to Clinical Reality

pdf bib
CLPsych 2015 Shared Task: Depression and PTSD on Twitter
Glen Coppersmith | Mark Dredze | Craig Harman | Kristy Hollingshead | Margaret Mitchell
Proceedings of the 2nd Workshop on Computational Linguistics and Clinical Psychology: From Linguistic Signal to Clinical Reality

pdf bib
Predicate Argument Alignment using a Global Coherence Model
Travis Wolfe | Mark Dredze | Benjamin Van Durme
Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

pdf bib
Entity Linking for Spoken Language
Adrian Benton | Mark Dredze
Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

pdf bib
Combining Word Embeddings and Feature Embeddings for Fine-grained Relation Extraction
Mo Yu | Matthew R. Gormley | Mark Dredze
Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

pdf bib
A Concrete Chinese NLP Pipeline
Nanyun Peng | Francis Ferraro | Mo Yu | Nicholas Andrews | Jay DeYoung | Max Thomas | Matthew R. Gormley | Travis Wolfe | Craig Harman | Benjamin Van Durme | Mark Dredze
Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Demonstrations

pdf bib
An Empirical Study of Chinese Name Matching and Applications
Nanyun Peng | Mo Yu | Mark Dredze
Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers)

pdf bib
FrameNet+: Fast Paraphrastic Tripling of FrameNet
Ellie Pavlick | Travis Wolfe | Pushpendre Rastogi | Chris Callison-Burch | Mark Dredze | Benjamin Van Durme
Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers)

pdf bib
Sprite: Generalizing Topic Models with Structured Priors
Michael J. Paul | Mark Dredze
Transactions of the Association for Computational Linguistics, Volume 3

We introduce Sprite, a family of topic models that incorporates structure into model priors as a function of underlying components. The structured priors can be constrained to model topic hierarchies, factorizations, correlations, and supervision, allowing Sprite to be tailored to particular settings. We demonstrate this flexibility by constructing a Sprite-based model to jointly infer topic hierarchies and author perspective, which we apply to corpora of political debates and online reviews. We show that the model learns intuitive topics, outperforming several other topic models at predictive tasks.

pdf bib
Learning Composition Models for Phrase Embeddings
Mo Yu | Mark Dredze
Transactions of the Association for Computational Linguistics, Volume 3

Lexical embeddings can serve as useful representations for words for a variety of NLP tasks, but learning embeddings for phrases can be challenging. While separate embeddings are learned for each word, this is infeasible for every phrase. We construct phrase embeddings by learning how to compose word embeddings using features that capture phrase structure and context. We propose efficient unsupervised and task-specific learning objectives that scale our model to large datasets. We demonstrate improvements on both language modeling and several phrase semantic similarity tasks with various phrase lengths. We make the implementation of our model and the datasets available for general use.

pdf bib
Approximation-Aware Dependency Parsing by Belief Propagation
Matthew R. Gormley | Mark Dredze | Jason Eisner
Transactions of the Association for Computational Linguistics, Volume 3

We show how to train the fast dependency parser of Smith and Eisner (2008) for improved accuracy. This parser can consider higher-order interactions among edges while retaining O(n3) runtime. It outputs the parse with maximum expected recall—but for speed, this expectation is taken under a posterior distribution that is constructed only approximately, using loopy belief propagation through structured factors. We show how to adjust the model parameters to compensate for the errors introduced by this approximation, by following the gradient of the actual loss on training data. We find this gradient by back-propagation. That is, we treat the entire parser (approximations and all) as a differentiable circuit, as others have done for loopy CRFs (Domke, 2010; Stoyanov et al., 2011; Domke, 2011; Stoyanov and Eisner, 2012). The resulting parser obtains higher accuracy with fewer iterations of belief propagation than one trained by conditional log-likelihood.

2014

pdf bib
Robust Entity Clustering via Phylogenetic Inference
Nicholas Andrews | Jason Eisner | Mark Dredze
Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

pdf bib
Low-Resource Semantic Role Labeling
Matthew R. Gormley | Margaret Mitchell | Benjamin Van Durme | Mark Dredze
Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

pdf bib
Improving Lexical Embeddings with Semantic Knowledge
Mo Yu | Mark Dredze
Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

pdf bib
Learning Polylingual Topic Models from Code-Switched Social Media Documents
Nanyun Peng | Yiming Wang | Mark Dredze
Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

pdf bib
Quantifying Mental Health Signals in Twitter
Glen Coppersmith | Mark Dredze | Craig Harman
Proceedings of the Workshop on Computational Linguistics and Clinical Psychology: From Linguistic Signal to Clinical Reality

2013

pdf bib
PARMA: A Predicate Argument Aligner
Travis Wolfe | Benjamin Van Durme | Mark Dredze | Nicholas Andrews | Charley Beller | Chris Callison-Burch | Jay DeYoung | Justin Snyder | Jonathan Weese | Tan Xu | Xuchen Yao
Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

pdf bib
Drug Extraction from the Web: Summarizing Drug Experiences with Multi-Dimensional Topic Models
Michael J. Paul | Mark Dredze
Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

pdf bib
What’s in a Domain? Multi-Domain Learning for Multi-Attribute Data
Mahesh Joshi | Mark Dredze | William W. Cohen | Carolyn P. Rosé
Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

pdf bib
Separating Fact from Fear: Tracking Flu Infections on Twitter
Alex Lamb | Michael J. Paul | Mark Dredze
Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

pdf bib
Broadly Improving User Classification via Communication-Based Name and Location Clustering on Twitter
Shane Bergsma | Mark Dredze | Benjamin Van Durme | Theresa Wilson | David Yarowsky
Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

pdf bib
Topic Models and Metadata for Visualizing Text Corpora
Justin Snyder | Rebecca Knowles | Mark Dredze | Matthew Gormley | Travis Wolfe
Proceedings of the 2013 NAACL HLT Demonstration Session

2012

pdf bib
Entity Clustering Across Languages
Spence Green | Nicholas Andrews | Matthew R. Gormley | Mark Dredze | Christopher D. Manning
Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

pdf bib
Shared Components Topic Models
Matthew R. Gormley | Mark Dredze | Benjamin Van Durme | Jason Eisner
Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

pdf bib
Fast Syntactic Analysis for Statistical Language Modeling via Substructure Sharing and Uptraining
Ariya Rastrow | Mark Dredze | Sanjeev Khudanpur
Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

pdf bib
Revisiting the Case for Explicit Syntactic Information in Language Models
Ariya Rastrow | Sanjeev Khudanpur | Mark Dredze
Proceedings of the NAACL-HLT 2012 Workshop: Will We Ever Really Replace the N-gram Model? On the Future of Language Modeling for HLT

pdf bib
Name Phylogeny: A Generative Model of String Variation
Nicholas Andrews | Jason Eisner | Mark Dredze
Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning

pdf bib
Multi-Domain Learning: When Do Domains Matter?
Mahesh Joshi | Mark Dredze | William W. Cohen | Carolyn Rosé
Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning

2011

pdf bib
Learning Sub-Word Units for Open Vocabulary Speech Recognition
Carolina Parada | Mark Dredze | Abhinav Sethy | Ariya Rastrow
Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies

2010

pdf bib
Contextual Information Improves OOV Detection in Speech
Carolina Parada | Mark Dredze | Denis Filimonov | Frederick Jelinek
Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics

pdf bib
Learning Simple Wikipedia: A Cogitation in Ascertaining Abecedarian Language
Courtney Napoles | Mark Dredze
Proceedings of the NAACL HLT 2010 Workshop on Computational Linguistics and Writing: Writing Processes and Authoring Aids

pdf bib
Proceedings of the NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon’s Mechanical Turk
Chris Callison-Burch | Mark Dredze
Proceedings of the NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon’s Mechanical Turk

pdf bib
Creating Speech and Language Data With Amazon’s Mechanical Turk
Chris Callison-Burch | Mark Dredze
Proceedings of the NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon’s Mechanical Turk

pdf bib
Annotating Named Entities in Twitter Data with Crowdsourcing
Tim Finin | William Murnane | Anand Karandikar | Nicholas Keller | Justin Martineau | Mark Dredze
Proceedings of the NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon’s Mechanical Turk

pdf bib
Non-Expert Correction of Automatically Generated Relation Annotations
Matthew R. Gormley | Adam Gerber | Mary Harper | Mark Dredze
Proceedings of the NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon’s Mechanical Turk

pdf bib
Entity Disambiguation for Knowledge Base Population
Mark Dredze | Paul McNamee | Delip Rao | Adam Gerber | Tim Finin
Proceedings of the 23rd International Conference on Computational Linguistics (Coling 2010)

pdf bib
Streaming Cross Document Entity Coreference Resolution
Delip Rao | Paul McNamee | Mark Dredze
Coling 2010: Posters

pdf bib
NLP on Spoken Documents Without ASR
Mark Dredze | Aren Jansen | Glen Coppersmith | Ken Church
Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing

pdf bib
We’re Not in Kansas Anymore: Detecting Domain Changes in Streams
Mark Dredze | Tim Oates | Christine Piatko
Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing

2009

pdf bib
Multi-Class Confidence Weighted Algorithms
Koby Crammer | Mark Dredze | Alex Kulesza
Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing

2008

pdf bib
Icelandic Data Driven Part of Speech Tagging
Mark Dredze | Joel Wallenberg
Proceedings of ACL-08: HLT, Short Papers

pdf bib
Active Learning with Confidence
Mark Dredze | Koby Crammer
Proceedings of ACL-08: HLT, Short Papers

pdf bib
Small Statistical Models by Random Feature Mixing
Kuzman Ganchev | Mark Dredze
Proceedings of the ACL-08: HLT Workshop on Mobile Language Processing

pdf bib
Reading the Markets: Forecasting Public Opinion of Political Candidates by News Analysis
Kevin Lerman | Ari Gilder | Mark Dredze | Fernando Pereira
Proceedings of the 22nd International Conference on Computational Linguistics (Coling 2008)

pdf bib
Online Methods for Multi-Domain Learning and Adaptation
Mark Dredze | Koby Crammer
Proceedings of the 2008 Conference on Empirical Methods in Natural Language Processing

2007

pdf bib
Automatic Code Assignment to Medical Text
Koby Crammer | Mark Dredze | Kuzman Ganchev | Partha Pratim Talukdar | Steven Carroll
Biological, translational, and clinical language processing

pdf bib
Biographies, Bollywood, Boom-boxes and Blenders: Domain Adaptation for Sentiment Classification
John Blitzer | Mark Dredze | Fernando Pereira
Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics

pdf bib
Frustratingly Hard Domain Adaptation for Dependency Parsing
Mark Dredze | John Blitzer | Partha Pratim Talukdar | Kuzman Ganchev | João Graça | Fernando Pereira
Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL)

Search
Co-authors