Dan Garrette


2020

pdf bib
Improving Multilingual Models with Language-Clustered Vocabularies
Hyung Won Chung | Dan Garrette | Kiat Chuan Tan | Jason Riesa
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

State-of-the-art multilingual models depend on vocabularies that cover all of the languages the model will expect to see at inference time, but the standard methods for generating those vocabularies are not ideal for massively multilingual applications. In this work, we introduce a novel procedure for multilingual vocabulary generation that combines the separately trained vocabularies of several automatically derived language clusters, thus balancing the trade-off between cross-lingual subword sharing and language-specific vocabularies. Our experiments show improvements across languages on key multilingual benchmark tasks TyDi QA (+2.9 F1), XNLI (+2.1%), and WikiAnn NER (+2.8 F1) and factor of 8 reduction in out-of-vocabulary rate, all without increasing the size of the model or data.

pdf bib
TyDi QA: A Benchmark for Information-Seeking Question Answering in Typologically Diverse Languages
Jonathan H. Clark | Eunsol Choi | Michael Collins | Dan Garrette | Tom Kwiatkowski | Vitaly Nikolaev | Jennimaria Palomaki
Transactions of the Association for Computational Linguistics, Volume 8

Confidently making progress on multilingual modeling requires challenging, trustworthy evaluations. We present TyDi QA—a question answering dataset covering 11 typologically diverse languages with 204K question-answer pairs. The languages of TyDi QA are diverse with regard to their typology—the set of linguistic features each language expresses—such that we expect models performing well on this set to generalize across a large number of the world’s languages. We present a quantitative analysis of the data quality and example-level qualitative linguistic analyses of observed language phenomena that would not be found in English-only corpora. To provide a realistic information-seeking task and avoid priming effects, questions are written by people who want to know the answer, but don’t know the answer yet, and the data is collected directly in each language without the use of translation.

2019

pdf bib
How Multilingual is Multilingual BERT?
Telmo Pires | Eva Schlinger | Dan Garrette
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

In this paper, we show that Multilingual BERT (M-BERT), released by Devlin et al. (2018) as a single language model pre-trained from monolingual corpora in 104 languages, is surprisingly good at zero-shot cross-lingual model transfer, in which task-specific annotations in one language are used to fine-tune the model for evaluation in another language. To understand why, we present a large number of probing experiments, showing that transfer is possible even to languages in different scripts, that transfer works best between typologically similar languages, that monolingual corpora can train models for code-switching, and that the model can find translation pairs. From these results, we can conclude that M-BERT does create multilingual representations, but that these representations exhibit systematic deficiencies affecting certain language pairs.

2018

pdf bib
Part-of-Speech Tagging for Code-Switched, Transliterated Texts without Explicit Language Identification
Kelsey Ball | Dan Garrette
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

Code-switching, the use of more than one language within a single utterance, is ubiquitous in much of the world, but remains a challenge for NLP largely due to the lack of representative data for training models. In this paper, we present a novel model architecture that is trained exclusively on monolingual resources, but can be applied to unseen code-switched text at inference time. The model accomplishes this by jointly maintaining separate word representations for each of the possible languages, or scripts in the case of transliteration, allowing each to contribute to inferences without forcing the model to commit to a language. Experiments on Hindi-English part-of-speech tagging demonstrate that our approach outperforms standard models when training on monolingual text without transliteration, and testing on code-switched text with alternate scripts.

2017

pdf bib
Automatic Compositor Attribution in the First Folio of Shakespeare
Maria Ryskina | Hannah Alpert-Abrams | Dan Garrette | Taylor Berg-Kirkpatrick
Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

Compositor attribution, the clustering of pages in a historical printed document by the individual who set the type, is a bibliographic task that relies on analysis of orthographic variation and inspection of visual details of the printed page. In this paper, we introduce a novel unsupervised model that jointly describes the textual and visual features needed to distinguish compositors. Applied to images of Shakespeare’s First Folio, our model predicts attributions that agree with the manual judgements of bibliographers with an accuracy of 87%, even on text that is the output of OCR.

pdf bib
STREAMLInED Challenges: Aligning Research Interests with Shared Tasks
Gina-Anne Levow | Emily M. Bender | Patrick Littell | Kristen Howell | Shobhana Chelliah | Joshua Crowgey | Dan Garrette | Jeff Good | Sharon Hargus | David Inman | Michael Maxwell | Michael Tjalve | Fei Xia
Proceedings of the 2nd Workshop on the Use of Computational Methods in the Study of Endangered Languages

2016

pdf bib
An Unsupervised Model of Orthographic Variation for Historical Document Transcription
Dan Garrette | Hannah Alpert-Abrams
Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

2015

pdf bib
A Supertag-Context Model for Weakly-Supervised CCG Parser Learning
Dan Garrette | Chris Dyer | Jason Baldridge | Noah A. Smith
Proceedings of the Nineteenth Conference on Computational Natural Language Learning

pdf bib
Unsupervised Code-Switching for Multilingual Historical Document Transcription
Dan Garrette | Hannah Alpert-Abrams | Taylor Berg-Kirkpatrick | Dan Klein
Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

2014

pdf bib
Weakly-Supervised Bayesian Learning of a CCG Supertagger
Dan Garrette | Chris Dyer | Jason Baldridge | Noah A. Smith
Proceedings of the Eighteenth Conference on Computational Natural Language Learning

2013

pdf bib
Real-World Semi-Supervised Learning of POS-Taggers for Low-Resource Languages
Dan Garrette | Jason Mielens | Jason Baldridge
Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

pdf bib
Learning a Part-of-Speech Tagger from Two Hours of Annotation
Dan Garrette | Jason Baldridge
Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

pdf bib
Montague Meets Markov: Deep Semantics with Probabilistic Logical Form
Islam Beltagy | Cuong Chau | Gemma Boleda | Dan Garrette | Katrin Erk | Raymond Mooney
Second Joint Conference on Lexical and Computational Semantics (*SEM), Volume 1: Proceedings of the Main Conference and the Shared Task: Semantic Textual Similarity

2012

pdf bib
Type-Supervised Hidden Markov Models for Part-of-Speech Tagging with Incomplete Tag Dictionaries
Dan Garrette | Jason Baldridge
Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning

2011

pdf bib
Integrating Logical Representations with Probabilistic Information using Markov Logic
Dan Garrette | Katrin Erk | Raymond Mooney
Proceedings of the Ninth International Conference on Computational Semantics (IWCS 2011)

2009

pdf bib
An Extensible Toolkit for Computational Semantics
Dan Garrette | Ewan Klein
Proceedings of the Eight International Conference on Computational Semantics