Garrett Nicolai


2020

pdf bib
Cross-Linguistic Syntactic Evaluation of Word Prediction Models
Aaron Mueller | Garrett Nicolai | Panayiota Petrou-Zeniou | Natalia Talmina | Tal Linzen
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

A range of studies have concluded that neural word prediction models can distinguish grammatical from ungrammatical sentences with high accuracy. However, these studies are based primarily on monolingual evidence from English. To investigate how these models’ ability to learn syntax varies by language, we introduce CLAMS (Cross-Linguistic Assessment of Models on Syntax), a syntactic evaluation suite for monolingual and multilingual models. CLAMS includes subject-verb agreement challenge sets for English, French, German, Hebrew and Russian, generated from grammars we develop. We use CLAMS to evaluate LSTM language models as well as monolingual and multilingual BERT. Across languages, monolingual LSTMs achieved high accuracy on dependencies without attractors, and generally poor accuracy on agreement across object relative clauses. On other constructions, agreement accuracy was generally higher in languages with richer morphology. Multilingual models generally underperformed monolingual models. Multilingual BERT showed high syntactic accuracy on English, but noticeable deficiencies in other languages.

pdf bib
The Johns Hopkins University Bible Corpus: 1600+ Tongues for Typological Exploration
Arya D. McCarthy | Rachel Wicks | Dylan Lewis | Aaron Mueller | Winston Wu | Oliver Adams | Garrett Nicolai | Matt Post | David Yarowsky
Proceedings of the 12th Language Resources and Evaluation Conference

We present findings from the creation of a massively parallel corpus in over 1600 languages, the Johns Hopkins University Bible Corpus (JHUBC). The corpus consists of over 4000 unique translations of the Christian Bible and counting. Our data is derived from scraping several online resources and merging them with existing corpora, combining them under a common scheme that is verse-parallel across all translations. We detail our effort to scrape, clean, align, and utilize this ripe multilingual dataset. The corpus captures the great typological variety of the world’s languages. We catalog this by showing highly similar proportions of representation of Ethnologue’s typological features in our corpus. We also give an example application: projecting pronoun features like clusivity across alignments to richly annotate languages which do not mark the distinction.

pdf bib
An Analysis of Massively Multilingual Neural Machine Translation for Low-Resource Languages
Aaron Mueller | Garrett Nicolai | Arya D. McCarthy | Dylan Lewis | Winston Wu | David Yarowsky
Proceedings of the 12th Language Resources and Evaluation Conference

In this work, we explore massively multilingual low-resource neural machine translation. Using translations of the Bible (which have parallel structure across languages), we train models with up to 1,107 source languages. We create various multilingual corpora, varying the number and relatedness of source languages. Using these, we investigate the best ways to use this many-way aligned resource for multilingual machine translation. Our experiments employ a grammatically and phylogenetically diverse set of source languages during testing for more representative evaluations. We find that best practices in this domain are highly language-specific: adding more languages to a training set is often better, but too many harms performance—the best number depends on the source language. Furthermore, training on related languages can improve or degrade performance, depending on the language. As there is no one-size-fits-most answer, we find that it is critical to tailor one’s approach to the source language and its typology.

pdf bib
UniMorph 3.0: Universal Morphology
Arya D. McCarthy | Christo Kirov | Matteo Grella | Amrit Nidhi | Patrick Xia | Kyle Gorman | Ekaterina Vylomova | Sabrina J. Mielke | Garrett Nicolai | Miikka Silfverberg | Timofey Arkhangelskiy | Nataly Krizhanovsky | Andrew Krizhanovsky | Elena Klyachko | Alexey Sorokin | John Mansfield | Valts Ernštreits | Yuval Pinter | Cassandra L. Jacobs | Ryan Cotterell | Mans Hulden | David Yarowsky
Proceedings of the 12th Language Resources and Evaluation Conference

The Universal Morphology (UniMorph) project is a collaborative effort providing broad-coverage instantiated normalized morphological paradigms for hundreds of diverse world languages. The project comprises two major thrusts: a language-independent feature schema for rich morphological annotation and a type-level resource of annotated data in diverse languages realizing that schema. We have implemented several improvements to the extraction pipeline which creates most of our data, so that it is both more complete and more correct. We have added 66 new languages, as well as new parts of speech for 12 languages. We have also amended the schema in several ways. Finally, we present three new community tools: two to validate data for resource creators, and one to make morphological data available from the command line. UniMorph is based at the Center for Language and Speech Processing (CLSP) at Johns Hopkins University in Baltimore, Maryland. This paper details advances made to the schema, tooling, and dissemination of project resources since the UniMorph 2.0 release described at LREC 2018.

pdf bib
Fine-grained Morphosyntactic Analysis and Generation Tools for More Than One Thousand Languages
Garrett Nicolai | Dylan Lewis | Arya D. McCarthy | Aaron Mueller | Winston Wu | David Yarowsky
Proceedings of the 12th Language Resources and Evaluation Conference

Exploiting the broad translation of the Bible into the world’s languages, we train and distribute morphosyntactic tools for approximately one thousand languages, vastly outstripping previous distributions of tools devoted to the processing of inflectional morphology. Evaluation of the tools on a subset of available inflectional dictionaries demonstrates strong initial models, supplemented and improved through ensembling and dictionary-based reranking. Likewise, a novel type-to-token based evaluation metric allows us to confirm that models generalize well across rare and common forms alike

pdf bib
Multilingual Dictionary Based Construction of Core Vocabulary
Winston Wu | Garrett Nicolai | David Yarowsky
Proceedings of the 12th Language Resources and Evaluation Conference

We propose a new functional definition and construction method for core vocabulary sets for multiple applications based on the relative coverage of a target concept in thousands of bilingual dictionaries. Our newly developed core concept vocabulary list derived from these dictionary consensus methods achieves high overlap with existing widely utilized core vocabulary lists targeted at applications such as first and second language learning or field linguistics. Our in-depth analysis illustrates multiple desirable properties of our newly proposed core vocabulary set, including their non-compositionality. We employ a cognate prediction method to recover missing coverage of this core vocabulary in massively multilingual dictionary construction, and we argue that this core vocabulary should be prioritized for elicitation when creating new dictionaries for low-resource languages for multiple downstream tasks including machine translation and language learning.

pdf bib
Noise Isn’t Always Negative: Countering Exposure Bias in Sequence-to-Sequence Inflection Models
Garrett Nicolai | Miikka Silfverberg
Proceedings of the 28th International Conference on Computational Linguistics

Morphological inflection, like many sequence-to-sequence tasks, sees great performance from recurrent neural architectures when data is plentiful, but performance falls off sharply in lower-data settings. We investigate one aspect of neural seq2seq models that we hypothesize contributes to overfitting - teacher forcing. By creating different training and test conditions, exposure bias increases the likelihood that a system too closely models its training data. Experiments show that teacher-forced models struggle to recover when they enter unknown territory. However, a simple modification to the training algorithm to more closely mimic test conditions creates models that are better able to generalize to unseen environments.

pdf bib
JHUBC’s Submission to LT4HALA EvaLatin 2020
Winston Wu | Garrett Nicolai
Proceedings of LT4HALA 2020 - 1st Workshop on Language Technologies for Historical and Ancient Languages

We describe the JHUBC submission to the EvaLatin Shared task on lemmatization and part-of-speech tagging for Latin. We modify a hard-attentional character-based encoder-decoder to produce lemmas and POS tags with separate decoders, and to incorporate contextual tagging cues. While our results show that the dual decoder approach fails to encode data as successfully as the single encoder, our simple context incorporation method does lead to modest improvements.

pdf bib
Proceedings of the 17th SIGMORPHON Workshop on Computational Research in Phonetics, Phonology, and Morphology
Garrett Nicolai | Kyle Gorman | Ryan Cotterell
Proceedings of the 17th SIGMORPHON Workshop on Computational Research in Phonetics, Phonology, and Morphology

pdf bib
SIGMORPHON 2020 Shared Task 0: Typologically Diverse Morphological Inflection
Ekaterina Vylomova | Jennifer White | Elizabeth Salesky | Sabrina J. Mielke | Shijie Wu | Edoardo Maria Ponti | Rowan Hall Maudslay | Ran Zmigrod | Josef Valvoda | Svetlana Toldova | Francis Tyers | Elena Klyachko | Ilya Yegorov | Natalia Krizhanovsky | Paula Czarnowska | Irene Nikkarinen | Andrew Krizhanovsky | Tiago Pimentel | Lucas Torroba Hennigen | Christo Kirov | Garrett Nicolai | Adina Williams | Antonios Anastasopoulos | Hilaria Cruz | Eleanor Chodroff | Ryan Cotterell | Miikka Silfverberg | Mans Hulden
Proceedings of the 17th SIGMORPHON Workshop on Computational Research in Phonetics, Phonology, and Morphology

A broad goal in natural language processing (NLP) is to develop a system that has the capacity to process any natural language. Most systems, however, are developed using data from just one language such as English. The SIGMORPHON 2020 shared task on morphological reinflection aims to investigate systems’ ability to generalize across typologically distinct languages, many of which are low resource. Systems were developed using data from 45 languages and just 5 language families, fine-tuned with data from an additional 45 languages and 10 language families (13 in total), and evaluated on all 90 languages. A total of 22 systems (19 neural) from 10 teams were submitted to the task. All four winning systems were neural (two monolingual transformers and two massively multilingual RNN-based models with gated attention). Most teams demonstrate utility of data hallucination and augmentation, ensembles, and multilingual training for low-resource languages. Non-neural learners and manually designed grammars showed competitive and even superior performance on some languages (such as Ingrian, Tajik, Tagalog, Zarma, Lingala), especially with very limited data. Some language families (Afro-Asiatic, Niger-Congo, Turkic) were relatively easy for most systems and achieved over 90% mean accuracy while others were more challenging.

pdf bib
The SIGMORPHON 2020 Shared Task on Unsupervised Morphological Paradigm Completion
Katharina Kann | Arya D. McCarthy | Garrett Nicolai | Mans Hulden
Proceedings of the 17th SIGMORPHON Workshop on Computational Research in Phonetics, Phonology, and Morphology

In this paper, we describe the findings of the SIGMORPHON 2020 shared task on unsupervised morphological paradigm completion (SIGMORPHON 2020 Task 2), a novel task in the field of inflectional morphology. Participants were asked to submit systems which take raw text and a list of lemmas as input, and output all inflected forms, i.e., the entire morphological paradigm, of each lemma. In order to simulate a realistic use case, we first released data for 5 development languages. However, systems were officially evaluated on 9 surprise languages, which were only revealed a few days before the submission deadline. We provided a modular baseline system, which is a pipeline of 4 components. 3 teams submitted a total of 7 systems, but, surprisingly, none of the submitted systems was able to improve over the baseline on average over all 9 test languages. Only on 3 languages did a submitted system obtain the best results. This shows that unsupervised morphological paradigm completion is still largely unsolved. We present an analysis here, so that this shared task will ground further research on the topic.

pdf bib
Induced Inflection-Set Keyword Search in Speech
Oliver Adams | Matthew Wiesner | Jan Trmal | Garrett Nicolai | David Yarowsky
Proceedings of the 17th SIGMORPHON Workshop on Computational Research in Phonetics, Phonology, and Morphology

We investigate the problem of searching for a lexeme-set in speech by searching for its inflectional variants. Experimental results indicate how lexeme-set search performance changes with the number of hypothesized inflections, while ablation experiments highlight the relative importance of different components in the lexeme-set search pipeline and the value of using curated inflectional paradigms. We provide a recipe and evaluation set for the community to use as an extrinsic measure of the performance of inflection generation approaches.

2019

pdf bib
Learning Morphosyntactic Analyzers from the Bible via Iterative Annotation Projection across 26 Languages
Garrett Nicolai | David Yarowsky
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

A large percentage of computational tools are concentrated in a very small subset of the planet’s languages. Compounding the issue, many languages lack the high-quality linguistic annotation necessary for the construction of such tools with current machine learning methods. In this paper, we address both issues simultaneously: leveraging the high accuracy of English taggers and parsers, we project morphological information onto translations of the Bible in 26 varied test languages. Using an iterative discovery, constraint, and training process, we build inflectional lexica in the target languages. Through a combination of iteration, ensembling, and reranking, we see double-digit relative error reductions in lemmatization and morphological analysis over a strong initial system.

pdf bib
Proceedings of the 16th Workshop on Computational Research in Phonetics, Phonology, and Morphology
Garrett Nicolai | Ryan Cotterell
Proceedings of the 16th Workshop on Computational Research in Phonetics, Phonology, and Morphology

pdf bib
The SIGMORPHON 2019 Shared Task: Morphological Analysis in Context and Cross-Lingual Transfer for Inflection
Arya D. McCarthy | Ekaterina Vylomova | Shijie Wu | Chaitanya Malaviya | Lawrence Wolf-Sonkin | Garrett Nicolai | Christo Kirov | Miikka Silfverberg | Sabrina J. Mielke | Jeffrey Heinz | Ryan Cotterell | Mans Hulden
Proceedings of the 16th Workshop on Computational Research in Phonetics, Phonology, and Morphology

The SIGMORPHON 2019 shared task on cross-lingual transfer and contextual analysis in morphology examined transfer learning of inflection between 100 language pairs, as well as contextual lemmatization and morphosyntactic description in 66 languages. The first task evolves past years’ inflection tasks by examining transfer of morphological inflection knowledge from a high-resource language to a low-resource language. This year also presents a new second challenge on lemmatization and morphological feature analysis in context. All submissions featured a neural component and built on either this year’s strong baselines or highly ranked systems from previous years’ shared tasks. Every participating team improved in accuracy over the baselines for the inflection task (though not Levenshtein distance), and every team in the contextual analysis task improved on both state-of-the-art neural and non-neural baselines.

2018

pdf bib
The CoNLLSIGMORPHON 2018 Shared Task: Universal Morphological Reinflection
Ryan Cotterell | Christo Kirov | John Sylak-Glassman | Géraldine Walther | Ekaterina Vylomova | Arya D. McCarthy | Katharina Kann | Sabrina J. Mielke | Garrett Nicolai | Miikka Silfverberg | David Yarowsky | Jason Eisner | Mans Hulden
Proceedings of the CoNLL–SIGMORPHON 2018 Shared Task: Universal Morphological Reinflection

pdf bib
Proceedings of the Fifteenth Workshop on Computational Research in Phonetics, Phonology, and Morphology
Sandra Kuebler | Garrett Nicolai
Proceedings of the Fifteenth Workshop on Computational Research in Phonetics, Phonology, and Morphology

pdf bib
String Transduction with Target Language Models and Insertion Handling
Garrett Nicolai | Saeed Najafi | Grzegorz Kondrak
Proceedings of the Fifteenth Workshop on Computational Research in Phonetics, Phonology, and Morphology

Many character-level tasks can be framed as sequence-to-sequence transduction, where the target is a word from a natural language. We show that leveraging target language models derived from unannotated target corpora, combined with a precise alignment of the training data, yields state-of-the art results on cognate projection, inflection generation, and phoneme-to-grapheme conversion.

2017

pdf bib
If you can’t beat them, join them: the University of Alberta system description
Garrett Nicolai | Bradley Hauer | Mohammad Motallebi | Saeed Najafi | Grzegorz Kondrak
Proceedings of the CoNLL SIGMORPHON 2017 Shared Task: Universal Morphological Reinflection

pdf bib
Morphological Analysis without Expert Annotation
Garrett Nicolai | Grzegorz Kondrak
Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers

The task of morphological analysis is to produce a complete list of lemma+tag analyses for a given word-form. We propose a discriminative string transduction approach which exploits plain inflection tables and raw text corpora, thus obviating the need for expert annotation. Experiments on four languages demonstrate that our system has much higher coverage than a hand-engineered FST analyzer, and is more accurate than a state-of-the-art morphological tagger.

pdf bib
Bootstrapping Unsupervised Bilingual Lexicon Induction
Bradley Hauer | Garrett Nicolai | Grzegorz Kondrak
Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers

The task of unsupervised lexicon induction is to find translation pairs across monolingual corpora. We develop a novel method that creates seed lexicons by identifying cognates in the vocabularies of related languages on the basis of their frequency and lexical similarity. We apply bidirectional bootstrapping to a method which learns a linear mapping between context-based vector spaces. Experimental results on three language pairs show consistent improvement over prior work.

2016

pdf bib
Leveraging Inflection Tables for Stemming and Lemmatization.
Garrett Nicolai | Grzegorz Kondrak
Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

pdf bib
Morphological Reinflection via Discriminative String Transduction
Garrett Nicolai | Bradley Hauer | Adam St Arnaud | Grzegorz Kondrak
Proceedings of the 14th SIGMORPHON Workshop on Computational Research in Phonetics, Phonology, and Morphology

pdf bib
Morphological Segmentation Can Improve Syllabification
Garrett Nicolai | Lei Yao | Grzegorz Kondrak
Proceedings of the 14th SIGMORPHON Workshop on Computational Research in Phonetics, Phonology, and Morphology

2015

pdf bib
Morpho-syntactic Regularities in Continuous Word Representations: A multilingual study.
Garrett Nicolai | Colin Cherry | Grzegorz Kondrak
Proceedings of the 1st Workshop on Vector Space Modeling for Natural Language Processing

pdf bib
Multiple System Combination for Transliteration
Garrett Nicolai | Bradley Hauer | Mohammad Salameh | Adam St Arnaud | Ying Xu | Lei Yao | Grzegorz Kondrak
Proceedings of the Fifth Named Entity Workshop

pdf bib
English orthography is not “close to optimal”
Garrett Nicolai | Grzegorz Kondrak
Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

pdf bib
Inflection Generation as Discriminative String Transduction
Garrett Nicolai | Colin Cherry | Grzegorz Kondrak
Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

2014

pdf bib
Does the Phonology of L1 Show Up in L2 Texts?
Garrett Nicolai | Grzegorz Kondrak
Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

2013

pdf bib
Cognate and Misspelling Features for Natural Language Identification
Garrett Nicolai | Bradley Hauer | Mohammad Salameh | Lei Yao | Grzegorz Kondrak
Proceedings of the Eighth Workshop on Innovative Use of NLP for Building Educational Applications