Jonathan K. Kummerfeld


pdf bib
A Novel Workflow for Accurately and Efficiently Crowdsourcing Predicate Senses and Argument Labels
Youxuan Jiang | Huaiyu Zhu | Jonathan K. Kummerfeld | Yunyao Li | Walter Lasecki
Findings of the Association for Computational Linguistics: EMNLP 2020

Resources for Semantic Role Labeling (SRL) are typically annotated by experts at great expense. Prior attempts to develop crowdsourcing methods have either had low accuracy or required substantial expert annotation. We propose a new multi-stage crowd workflow that substantially reduces expert involvement without sacrificing accuracy. In particular, we introduce a unique filter stage based on the key observation that crowd workers are able to almost perfectly filter out incorrect options for labels. Our three-stage workflow produces annotations with 95% accuracy for predicate labels and 93% for argument labels, which is comparable to expert agreement. Compared to prior work on crowdsourcing for SRL, we decrease expert effort by 4x, from 56% to 14% of cases. Our approach enables more scalable annotation of SRL, and could enable annotation of NLP tasks that have previously been considered too complex to effectively crowdsource.

pdf bib
Inconsistencies in Crowdsourced Slot-Filling Annotations: A Typology and Identification Methods
Stefan Larson | Adrian Cheung | Anish Mahendran | Kevin Leach | Jonathan K. Kummerfeld
Proceedings of the 28th International Conference on Computational Linguistics

Slot-filling models in task-driven dialog systems rely on carefully annotated training data. However, annotations by crowd workers are often inconsistent or contain errors. Simple solutions like manually checking annotations or having multiple workers label each sample are expensive and waste effort on samples that are correct. If we can identify inconsistencies, we can focus effort where it is needed. Toward this end, we define six inconsistency types in slot-filling annotations. Using three new noisy crowd-annotated datasets, we show that a wide range of inconsistencies occur and can impact system performance if not addressed. We then introduce automatic methods of identifying inconsistencies. Experiments on our new datasets show that these methods effectively reveal inconsistencies in data, though there is further scope for improvement.

pdf bib
Exploring the Value of Personalized Word Embeddings
Charles Welch | Jonathan K. Kummerfeld | Verónica Pérez-Rosas | Rada Mihalcea
Proceedings of the 28th International Conference on Computational Linguistics

In this paper, we introduce personalized word embeddings, and examine their value for language modeling. We compare the performance of our proposed prediction model when using personalized versus generic word representations, and study how these representations can be leveraged for improved performance. We provide insight into what types of words can be more accurately predicted when building personalized models. Our results show that a subset of words belonging to specific psycholinguistic categories tend to vary more in their representations across users and that combining generic and personalized word embeddings yields the best performance, with a 4.7% relative reduction in perplexity. Additionally, we show that a language model using personalized word embeddings can be effectively used for authorship attribution.

pdf bib
Compositional Demographic Word Embeddings
Charles Welch | Jonathan K. Kummerfeld | Verónica Pérez-Rosas | Rada Mihalcea
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

Word embeddings are usually derived from corpora containing text from many individuals, thus leading to general purpose representations rather than individually personalized representations. While personalized embeddings can be useful to improve language model performance and other language processing tasks, they can only be computed for people with a large amount of longitudinal data, which is not the case for new users. We propose a new form of personalized word embeddings that use demographic-specific word representations derived compositionally from full or partial demographic information for a user (i.e., gender, age, location, religion). We show that the resulting demographic-aware word representations outperform generic word representations on two tasks for English: language modeling and word associations. We further explore the trade-off between the number of available attributes and their relative effectiveness and discuss the ethical implications of using them.

pdf bib
Iterative Feature Mining for Constraint-Based Data Collection to Increase Data Diversity and Model Robustness
Stefan Larson | Anthony Zheng | Anish Mahendran | Rishi Tekriwal | Adrian Cheung | Eric Guldan | Kevin Leach | Jonathan K. Kummerfeld
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

Diverse data is crucial for training robust models, but crowdsourced text often lacks diversity as workers tend to write simple variations from prompts. We propose a general approach for guiding workers to write more diverse text by iteratively constraining their writing. We show how prior workflows are special cases of our approach, and present a way to apply the approach to dialog tasks such as intent classification and slot-filling. Using our method, we create more challenging versions of test sets from prior dialog datasets and find dramatic performance drops for standard models. Finally, we show that our approach is complementary to recent work on improving data diversity, and training on data collected with our approach leads to more robust models.

pdf bib
Improving Low Compute Language Modeling with In-Domain Embedding Initialisation
Charles Welch | Rada Mihalcea | Jonathan K. Kummerfeld
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

Many NLP applications, such as biomedical data and technical support, have 10-100 million tokens of in-domain data and limited computational resources for learning from it. How should we train a language model in this scenario? Most language modeling research considers either a small dataset with a closed vocabulary (like the standard 1 million token Penn Treebank), or the whole web with byte-pair encoding. We show that for our target setting in English, initialising and freezing input embeddings using in-domain data can improve language model performance by providing a useful representation of rare words, and this pattern holds across several different domains. In the process, we show that the standard convention of tying input and output embeddings does not improve perplexity when initializing with embeddings trained on in-domain data.


pdf bib
An Evaluation Dataset for Intent Classification and Out-of-Scope Prediction
Stefan Larson | Anish Mahendran | Joseph J. Peper | Christopher Clarke | Andrew Lee | Parker Hill | Jonathan K. Kummerfeld | Kevin Leach | Michael A. Laurenzano | Lingjia Tang | Jason Mars
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

Task-oriented dialog systems need to know when a query falls outside their range of supported intents, but current text classification corpora only define label sets that cover every example. We introduce a new dataset that includes queries that are out-of-scope—i.e., queries that do not fall into any of the system’s supported intents. This poses a new challenge because models cannot assume that every query at inference time belongs to a system-supported intent class. Our dataset also covers 150 intent classes over 10 domains, capturing the breadth that a production task-oriented agent must handle. We evaluate a range of benchmark classifiers on our dataset along with several different out-of-scope identification schemes. We find that while the classifiers perform well on in-scope intent classification, they struggle to identify out-of-scope queries. Our dataset and evaluation fill an important gap in the field, offering a way of more rigorously and realistically benchmarking text classification in task-driven dialog systems.

pdf bib
A Large-Scale Corpus for Conversation Disentanglement
Jonathan K. Kummerfeld | Sai R. Gouravajhala | Joseph J. Peper | Vignesh Athreya | Chulaka Gunasekara | Jatin Ganhotra | Siva Sankalp Patel | Lazaros C Polymenakos | Walter Lasecki
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

Disentangling conversations mixed together in a single stream of messages is a difficult task, made harder by the lack of large manually annotated datasets. We created a new dataset of 77,563 messages manually annotated with reply-structure graphs that both disentangle conversations and define internal conversation structure. Our data is 16 times larger than all previously released datasets combined, the first to include adjudication of annotation disagreements, and the first to include context. We use our data to re-examine prior work, in particular, finding that 89% of conversations in a widely used dialogue corpus are either missing messages or contain extra messages. Our manually-annotated data presents an opportunity to develop robust data-driven methods for conversation disentanglement, which will help advance dialogue research.

pdf bib
SLATE: A Super-Lightweight Annotation Tool for Experts
Jonathan K. Kummerfeld
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics: System Demonstrations

Many annotation tools have been developed, covering a wide variety of tasks and providing features like user management, pre-processing, and automatic labeling. However, all of these tools use Graphical User Interfaces, and often require substantial effort to install and configure. This paper presents a new annotation tool that is designed to fill the niche of a lightweight interface for users with a terminal-based workflow. SLATE supports annotation at different scales (spans of characters, tokens, and lines, or a document) and of different types (free text, labels, and links), with easily customisable keybindings, and unicode support. In a user study comparing with other tools it was consistently the easiest to install and use. SLATE fills a need not met by existing systems, and has already been used to annotate two corpora, one of which involved over 250 hours of annotation effort.

pdf bib
DSTC7 Task 1: Noetic End-to-End Response Selection
Chulaka Gunasekara | Jonathan K. Kummerfeld | Lazaros Polymenakos | Walter Lasecki
Proceedings of the First Workshop on NLP for Conversational AI

Goal-oriented dialogue in complex domains is an extremely challenging problem and there are relatively few datasets. This task provided two new resources that presented different challenges: one was focused but small, while the other was large but diverse. We also considered several new variations on the next utterance selection problem: (1) increasing the number of candidates, (2) including paraphrases, and (3) not including a correct option in the candidate set. Twenty teams participated, developing a range of neural network models, including some that successfully incorporated external data to boost performance. Both datasets have been publicly released, enabling future work to build on these results, working towards robust goal-oriented dialogue systems.

pdf bib
Outlier Detection for Improved Data Quality and Diversity in Dialog Systems
Stefan Larson | Anish Mahendran | Andrew Lee | Jonathan K. Kummerfeld | Parker Hill | Michael A. Laurenzano | Johann Hauswald | Lingjia Tang | Jason Mars
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)

In a corpus of data, outliers are either errors: mistakes in the data that are counterproductive, or are unique: informative samples that improve model robustness. Identifying outliers can lead to better datasets by (1) removing noise in datasets and (2) guiding collection of additional data to fill gaps. However, the problem of detecting both outlier types has received relatively little attention in NLP, particularly for dialog systems. We introduce a simple and effective technique for detecting both erroneous and unique samples in a corpus of short texts using neural sentence embeddings combined with distance-based outlier detection. We also present a novel data collection pipeline built atop our detection technique to automatically and iteratively mine unique data samples while discarding erroneous samples. Experiments show that our outlier detection technique is effective at finding errors while our data collection pipeline yields highly diverse corpora that in turn produce more robust intent classification and slot-filling models.


pdf bib
World Knowledge for Abstract Meaning Representation Parsing
Charles Welch | Jonathan K. Kummerfeld | Song Feng | Rada Mihalcea
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

pdf bib
Factors Influencing the Surprising Instability of Word Embeddings
Laura Wendlandt | Jonathan K. Kummerfeld | Rada Mihalcea
Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers)

Despite the recent popularity of word embedding methods, there is only a small body of work exploring the limitations of these representations. In this paper, we consider one aspect of embedding spaces, namely their stability. We show that even relatively high frequency words (100-200 occurrences) are often unstable. We provide empirical evidence for how various factors contribute to the stability of word embeddings, and we analyze the effects of stability on downstream tasks.

pdf bib
Effective Crowdsourcing for a New Type of Summarization Task
Youxuan Jiang | Catherine Finegan-Dollak | Jonathan K. Kummerfeld | Walter Lasecki
Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers)

Most summarization research focuses on summarizing the entire given text, but in practice readers are often interested in only one aspect of the document or conversation. We propose targeted summarization as an umbrella category for summarization tasks that intentionally consider only parts of the input data. This covers query-based summarization, update summarization, and a new task we propose where the goal is to summarize a particular aspect of a document. However, collecting data for this new task is hard because directly asking annotators (e.g., crowd workers) to write summaries leads to data with low accuracy when there are a large number of facts to include. We introduce a novel crowdsourcing workflow, Pin-Refine, that allows us to collect high-quality summaries for our task, a necessary step for the development of automatic systems.

pdf bib
Data Collection for Dialogue System: A Startup Perspective
Yiping Kang | Yunqi Zhang | Jonathan K. Kummerfeld | Lingjia Tang | Jason Mars
Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 3 (Industry Papers)

Industrial dialogue systems such as Apple Siri and Google Now rely on large scale diverse and robust training data to enable their sophisticated conversation capability. Crowdsourcing provides a scalable and inexpensive way of data collection but collecting high quality data efficiently requires thoughtful orchestration of the crowdsourcing jobs. Prior study of this topic have focused on tasks only in the academia settings with limited scope or only provide intrinsic dataset analysis, lacking indication on how it affects the trained model performance. In this paper, we present a study of crowdsourcing methods for a user intent classification task in our deployed dialogue system. Our task requires classification of 47 possible user intents and contains many intent pairs with subtle differences. We consider different crowdsourcing job types and job prompts and analyze quantitatively the quality of the collected data and the downstream model performance on a test set of real user queries from production logs. Our observation provides insights into designing efficient crowdsourcing jobs and provide recommendations for future dialogue system data collection process.

pdf bib
Improving Text-to-SQL Evaluation Methodology
Catherine Finegan-Dollak | Jonathan K. Kummerfeld | Li Zhang | Karthik Ramanathan | Sesh Sadasivam | Rui Zhang | Dragomir Radev
Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

To be informative, an evaluation must measure how well systems generalize to realistic unseen data. We identify limitations of and propose improvements to current evaluations of text-to-SQL systems. First, we compare human-generated and automatically generated questions, characterizing properties of queries necessary for real-world applications. To facilitate evaluation on multiple datasets, we release standardized and improved versions of seven existing datasets and one new text-to-SQL dataset. Second, we show that the current division of data into training and test sets measures robustness to variations in the way questions are asked, but only partially tests how well systems generalize to new queries; therefore, we propose a complementary dataset split for evaluation of future work. Finally, we demonstrate how the common practice of anonymizing variables during evaluation removes an important challenge of the task. Our observations highlight key difficulties, and our methodology enables effective measurement of future development.


pdf bib
Understanding Task Design Trade-offs in Crowdsourced Paraphrase Collection
Youxuan Jiang | Jonathan K. Kummerfeld | Walter S. Lasecki
Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

Linguistically diverse datasets are critical for training and evaluating robust machine learning systems, but data collection is a costly process that often requires experts. Crowdsourcing the process of paraphrase generation is an effective means of expanding natural language datasets, but there has been limited analysis of the trade-offs that arise when designing tasks. In this paper, we present the first systematic study of the key factors in crowdsourcing paraphrase collection. We consider variations in instructions, incentives, data domains, and workflows. We manually analyzed paraphrases for correctness, grammaticality, and linguistic diversity. Our observations provide new insight into the trade-offs between accuracy and diversity in crowd responses that arise as a result of task design, providing guidance for future paraphrase generation procedures.

pdf bib
Parsing with Traces: An O(n4) Algorithm and a Structural Representation
Jonathan K. Kummerfeld | Dan Klein
Transactions of the Association for Computational Linguistics, Volume 5

General treebank analyses are graph structured, but parsers are typically restricted to tree structures for efficiency and modeling reasons. We propose a new representation and algorithm for a class of graph structures that is flexible enough to cover almost all treebank structures, while still admitting efficient learning and inference. In particular, we consider directed, acyclic, one-endpoint-crossing graph structures, which cover most long-distance dislocation, shared argumentation, and similar tree-violating linguistic phenomena. We describe how to convert phrase structure parses, including traces, to our new representation in a reversible manner. Our dynamic program uniquely decomposes structures, is sound and complete, and covers 97.3% of the Penn English Treebank. We also implement a proof-of-concept parser that recovers a range of null elements and trace types.

pdf bib
Identifying Products in Online Cybercrime Marketplaces: A Dataset for Fine-grained Domain Adaptation
Greg Durrett | Jonathan K. Kummerfeld | Taylor Berg-Kirkpatrick | Rebecca Portnoff | Sadia Afroz | Damon McCoy | Kirill Levchenko | Vern Paxson
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing

One weakness of machine-learned NLP models is that they typically perform poorly on out-of-domain data. In this work, we study the task of identifying products being bought and sold in online cybercrime forums, which exhibits particularly challenging cross-domain effects. We formulate a task that represents a hybrid of slot-filling information extraction and named entity recognition and annotate data from four different forums. Each of these forums constitutes its own “fine-grained domain” in that the forums cover different market sectors with different properties, even though all forums are in the broad domain of cybercrime. We characterize these domain differences in the context of a learning-based system: supervised models see decreased accuracy when applied to new forums, and standard techniques for semi-supervised learning and domain adaptation have limited effectiveness on this data, which suggests the need to improve these techniques. We release a dataset of 1,938 annotated posts from across the four forums.


pdf bib
An Empirical Analysis of Optimization for Max-Margin NLP
Jonathan K. Kummerfeld | Taylor Berg-Kirkpatrick | Dan Klein
Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing


pdf bib
Error-Driven Analysis of Challenges in Coreference Resolution
Jonathan K. Kummerfeld | Dan Klein
Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing

pdf bib
An Empirical Examination of Challenges in Chinese Parsing
Jonathan K. Kummerfeld | Daniel Tse | James R. Curran | Dan Klein
Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)


pdf bib
Robust Conversion of CCG Derivations to Phrase Structure Trees
Jonathan K. Kummerfeld | Dan Klein | James R. Curran
Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

pdf bib
Parser Showdown at the Wall Street Corral: An Empirical Investigation of Error Types in Parser Output
Jonathan K. Kummerfeld | David Hall | James R. Curran | Dan Klein
Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning


pdf bib
Mention Detection: Heuristics for the OntoNotes annotations
Jonathan K. Kummerfeld | Mohit Bansal | David Burkett | Dan Klein
Proceedings of the Fifteenth Conference on Computational Natural Language Learning: Shared Task


pdf bib
Morphological Analysis Can Improve a CCG Parser for English
Matthew Honnibal | Jonathan K. Kummerfeld | James R. Curran
Coling 2010: Posters

pdf bib
Faster Parsing by Supertagger Adaptation
Jonathan K. Kummerfeld | Jessika Roesner | Tim Dawborn | James Haggerty | James R. Curran | Stephen Clark
Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics


pdf bib
Faster parsing and supertagging model estimation
Jonathan K. Kummerfeld | Jessika Roesner | James Curran
Proceedings of the Australasian Language Technology Association Workshop 2009


pdf bib
Classification of Verb Particle Constructions with the Google Web1T Corpus
Jonathan K. Kummerfeld | James R. Curran
Proceedings of the Australasian Language Technology Association Workshop 2008