Jonathan H. Clark

Also published as: Jonathan Clark


pdf bib
CapWAP: Image Captioning with a Purpose
Adam Fisch | Kenton Lee | Ming-Wei Chang | Jonathan Clark | Regina Barzilay
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

The traditional image captioning task uses generic reference captions to provide textual information about images. Different user populations, however, will care about different visual aspects of images. In this paper, we propose a new task, Captioning with A Purpose (CapWAP). Our goal is to develop systems that can be tailored to be useful for the information needs of an intended population, rather than merely provide generic information about an image. In this task, we use question-answer (QA) pairs—a natural expression of information need—from users, instead of reference captions, for both training and post-inference evaluation. We show that it is possible to use reinforcement learning to directly optimize for the intended information need, by rewarding outputs that allow a question answering model to provide correct answers to sampled user questions. We convert several visual question answering datasets into CapWAP datasets, and demonstrate that under a variety of scenarios our purposeful captioning system learns to anticipate and fulfill specific information needs better than its generic counterparts, as measured by QA performance on user questions from unseen images, when using the caption alone as context.

pdf bib
TyDi QA: A Benchmark for Information-Seeking Question Answering in Typologically Diverse Languages
Jonathan H. Clark | Eunsol Choi | Michael Collins | Dan Garrette | Tom Kwiatkowski | Vitaly Nikolaev | Jennimaria Palomaki
Transactions of the Association for Computational Linguistics, Volume 8

Confidently making progress on multilingual modeling requires challenging, trustworthy evaluations. We present TyDi QA—a question answering dataset covering 11 typologically diverse languages with 204K question-answer pairs. The languages of TyDi QA are diverse with regard to their typology—the set of linguistic features each language expresses—such that we expect models performing well on this set to generalize across a large number of the world’s languages. We present a quantitative analysis of the data quality and example-level qualitative linguistic analyses of observed language phenomena that would not be found in English-only corpora. To provide a realistic information-seeking task and avoid priming effects, questions are written by people who want to know the answer, but don’t know the answer yet, and the data is collected directly in each language without the use of translation.


pdf bib
Locally Non-Linear Learning for Statistical Machine Translation via Discretization and Structured Regularization
Jonathan H. Clark | Chris Dyer | Alon Lavie
Transactions of the Association for Computational Linguistics, Volume 2

Linear models, which support efficient learning and inference, are the workhorses of statistical machine translation; however, linear decision rules are less attractive from a modeling perspective. In this work, we introduce a technique for learning arbitrary, rule-local, non-linear feature transforms that improve model expressivity, but do not sacrifice the efficient inference and learning associated with linear models. To demonstrate the value of our technique, we discard the customary log transform of lexical probabilities and drop the phrasal translation probability in favor of raw counts. We observe that our algorithm learns a variation of a log transform that leads to better translation quality compared to the explicit log transform. We conclude that non-linear responses play an important role in SMT, an observation that we hope will inform the efforts of feature engineers.


pdf bib
Scalable Modified Kneser-Ney Language Model Estimation
Kenneth Heafield | Ivan Pouzyrevsky | Jonathan H. Clark | Philipp Koehn
Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)


pdf bib
The CMU-ARK German-English Translation System
Chris Dyer | Kevin Gimpel | Jonathan H. Clark | Noah A. Smith
Proceedings of the Sixth Workshop on Statistical Machine Translation

pdf bib
Unsupervised Word Alignment with Arbitrary Features
Chris Dyer | Jonathan H. Clark | Alon Lavie | Noah A. Smith
Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies

pdf bib
Better Hypothesis Testing for Statistical Machine Translation: Controlling for Optimizer Instability
Jonathan H. Clark | Chris Dyer | Alon Lavie | Noah A. Smith
Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies


pdf bib
Improved Features and Grammar Selection for Syntax-Based MT
Greg Hanneman | Jonathan Clark | Alon Lavie
Proceedings of the Joint Fifth Workshop on Statistical Machine Translation and MetricsMATR

pdf bib
LoonyBin: Keeping Language Technologists Sane through Automated Management of Experimental (Hyper)Workflows
Jonathan H. Clark | Alon Lavie
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

Many contemporary language technology systems are characterized by long pipelines of tools with complex dependencies. Too often, these workflows are implemented by ad hoc scripts; or, worse, tools are run manually, making experiments difficult to reproduce. These practices are difficult to maintain in the face of rapidly evolving workflows while they also fail to expose and record important details about intermediate data. Further complicating these systems are hyperparameters, which often cannot be directly optimized by conventional methods, requiring users to determine which combination of values is best via trial and error. We describe LoonyBin, an open-source tool that addresses these issues by providing: 1) a visual interface for the user to create and modify workflows; 2) a well-defined mechanism for tracking metadata and provenance; 3) a script generator that compiles visual workflows into shell scripts; and 4) a new workflow representation we call a HyperWorkflow, which intuitively and succinctly encodes small experimental variations within a larger workflow.


pdf bib
An Improved Statistical Transfer System for French-English Machine Translation
Greg Hanneman | Vamshi Ambati | Jonathan H. Clark | Alok Parlikar | Alon Lavie
Proceedings of the Fourth Workshop on Statistical Machine Translation


pdf bib
Toward Active Learning in Data Selection: Automatic Discovery of Language Features During Elicitation
Jonathan Clark | Robert Frederking | Lori Levin
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

Data Selection has emerged as a common issue in language technologies. We define Data Selection as the choosing of a subset of training data that is most effective for a given task. This paper describes deductive feature detection, one component of a data selection system for machine translation. Feature detection determines whether features such as tense, number, and person are expressed in a language. The database of the The World Atlas of Language Structures provides a gold standard against which to evaluate feature detection. The discovered features can be used as input to a Navigator, which uses active learning to determine which piece of language data is the most important to acquire next.

pdf bib
Inductive Detection of Language Features via Clustering Minimal Pairs: Toward Feature-Rich Grammars in Machine Translation
Jonathan H. Clark | Robert Frederking | Lori Levin
Proceedings of the ACL-08: HLT Second Workshop on Syntax and Structure in Statistical Translation (SSST-2)