Katherine Keith


pdf bib
Text and Causal Inference: A Review of Using Text to Remove Confounding from Causal Estimates
Katherine Keith | David Jensen | Brendan O’Connor
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

Many applications of computational social science aim to infer causal conclusions from non-experimental data. Such observational data often contains confounders, variables that influence both potential causes and potential effects. Unmeasured or latent confounders can bias causal estimates, and this has motivated interest in measuring potential confounders from observed text. For example, an individual’s entire history of social media posts or the content of a news article could provide a rich measurement of multiple confounders.Yet, methods and applications for this problem are scattered across different communities and evaluation practices are inconsistent.This review is the first to gather and categorize these examples and provide a guide to data-processing and evaluation decisions. Despite increased attention on adjusting for confounding using text, there are still many open problems, which we highlight in this paper.

pdf bib
Uncertainty over Uncertainty: Investigating the Assumptions, Annotations, and Text Measurements of Economic Policy Uncertainty
Katherine Keith | Christoph Teichmann | Brendan O’Connor | Edgar Meij
Proceedings of the Fourth Workshop on Natural Language Processing and Computational Social Science

Methods and applications are inextricably linked in science, and in particular in the domain of text-as-data. In this paper, we examine one such text-as-data application, an established economic index that measures economic policy uncertainty from keyword occurrences in news. This index, which is shown to correlate with firm investment, employment, and excess market returns, has had substantive impact in both the private sector and academia. Yet, as we revisit and extend the original authors’ annotations and text measurements we find interesting text-as-data methodological research questions: (1) Are annotator disagreements a reflection of ambiguity in language? (2) Do alternative text measurements correlate with one another and with measures of external predictive validity? We find for this application (1) some annotator disagreements of economic policy uncertainty can be attributed to ambiguity in language, and (2) switching measurements from keyword-matching to supervised machine learning classifiers results in low correlation, a concerning implication for the validity of the index.


pdf bib
Modeling Financial Analysts’ Decision Making via the Pragmatics and Semantics of Earnings Calls
Katherine Keith | Amanda Stent
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

Every fiscal quarter, companies hold earnings calls in which company executives respond to questions from analysts. After these calls, analysts often change their price target recommendations, which are used in equity re- search reports to help investors make deci- sions. In this paper, we examine analysts’ decision making behavior as it pertains to the language content of earnings calls. We identify a set of 20 pragmatic features of analysts’ questions which we correlate with analysts’ pre-call investor recommendations. We also analyze the degree to which semantic and pragmatic features from an earnings call complement market data in predicting analysts’ post-call changes in price targets. Our results show that earnings calls are moderately predictive of analysts’ decisions even though these decisions are influenced by a number of other factors including private communication with company executives and market conditions. A breakdown of model errors indicates disparate performance on calls from different market sectors.


pdf bib
Monte Carlo Syntax Marginals for Exploring and Using Dependency Parses
Katherine Keith | Su Lin Blodgett | Brendan O’Connor
Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers)

Dependency parsing research, which has made significant gains in recent years, typically focuses on improving the accuracy of single-tree predictions. However, ambiguity is inherent to natural language syntax, and communicating such ambiguity is important for error analysis and better-informed downstream applications. In this work, we propose a transition sampling algorithm to sample from the full joint distribution of parse trees defined by a transition-based parsing model, and demonstrate the use of the samples in probabilistic dependency analysis. First, we define the new task of dependency path prediction, inferring syntactic substructures over part of a sentence, and provide the first analysis of performance on this task. Second, we demonstrate the usefulness of our Monte Carlo syntax marginal method for parser error analysis and calibration. Finally, we use this method to propagate parse uncertainty to two downstream information extraction applications: identifying persons killed by police and semantic role assignment.

pdf bib
Uncertainty-aware generative models for inferring document class prevalence
Katherine Keith | Brendan O’Connor
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

Prevalence estimation is the task of inferring the relative frequency of classes of unlabeled examples in a group—for example, the proportion of a document collection with positive sentiment. Previous work has focused on aggregating and adjusting discriminative individual classifiers to obtain prevalence point estimates. But imperfect classifier accuracy ought to be reflected in uncertainty over the predicted prevalence for scientifically valid inference. In this work, we present (1) a generative probabilistic modeling approach to prevalence estimation, and (2) the construction and evaluation of prevalence confidence intervals; in particular, we demonstrate that an off-the-shelf discriminative classifier can be given a generative re-interpretation, by backing out an implicit individual-level likelihood function, which can be used to conduct fast and simple group-level Bayesian inference. Empirically, we demonstrate our approach provides better confidence interval coverage than an alternative, and is dramatically more robust to shifts in the class prior between training and testing.


pdf bib
Identifying civilians killed by police with distantly supervised entity-event extraction
Katherine Keith | Abram Handler | Michael Pinkham | Cara Magliozzi | Joshua McDuffie | Brendan O’Connor
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing

We propose a new, socially-impactful task for natural language processing: from a news corpus, extract names of persons who have been killed by police. We present a newly collected police fatality corpus, which we release publicly, and present a model to solve this problem that uses EM-based distant supervision with logistic regression and convolutional neural network classifiers. Our model outperforms two off-the-shelf event extractor systems, and it can suggest candidate victim names in some cases faster than one of the major manually-collected police fatality databases.