David Yarowsky


2020

pdf bib
The Johns Hopkins University Bible Corpus: 1600+ Tongues for Typological Exploration
Arya D. McCarthy | Rachel Wicks | Dylan Lewis | Aaron Mueller | Winston Wu | Oliver Adams | Garrett Nicolai | Matt Post | David Yarowsky
Proceedings of the 12th Language Resources and Evaluation Conference

We present findings from the creation of a massively parallel corpus in over 1600 languages, the Johns Hopkins University Bible Corpus (JHUBC). The corpus consists of over 4000 unique translations of the Christian Bible and counting. Our data is derived from scraping several online resources and merging them with existing corpora, combining them under a common scheme that is verse-parallel across all translations. We detail our effort to scrape, clean, align, and utilize this ripe multilingual dataset. The corpus captures the great typological variety of the world’s languages. We catalog this by showing highly similar proportions of representation of Ethnologue’s typological features in our corpus. We also give an example application: projecting pronoun features like clusivity across alignments to richly annotate languages which do not mark the distinction.

pdf bib
Computational Etymology and Word Emergence
Winston Wu | David Yarowsky
Proceedings of the 12th Language Resources and Evaluation Conference

We developed an extensible, comprehensive Wiktionary parser that improves over several existing parsers. We predict the etymology of a word across the full range of etymology types and languages in Wiktionary, showing improvements over a strong baseline. We also model word emergence and show the application of etymology in modeling this phenomenon. We release our parser to further research in this understudied field.

pdf bib
An Analysis of Massively Multilingual Neural Machine Translation for Low-Resource Languages
Aaron Mueller | Garrett Nicolai | Arya D. McCarthy | Dylan Lewis | Winston Wu | David Yarowsky
Proceedings of the 12th Language Resources and Evaluation Conference

In this work, we explore massively multilingual low-resource neural machine translation. Using translations of the Bible (which have parallel structure across languages), we train models with up to 1,107 source languages. We create various multilingual corpora, varying the number and relatedness of source languages. Using these, we investigate the best ways to use this many-way aligned resource for multilingual machine translation. Our experiments employ a grammatically and phylogenetically diverse set of source languages during testing for more representative evaluations. We find that best practices in this domain are highly language-specific: adding more languages to a training set is often better, but too many harms performance—the best number depends on the source language. Furthermore, training on related languages can improve or degrade performance, depending on the language. As there is no one-size-fits-most answer, we find that it is critical to tailor one’s approach to the source language and its typology.

pdf bib
UniMorph 3.0: Universal Morphology
Arya D. McCarthy | Christo Kirov | Matteo Grella | Amrit Nidhi | Patrick Xia | Kyle Gorman | Ekaterina Vylomova | Sabrina J. Mielke | Garrett Nicolai | Miikka Silfverberg | Timofey Arkhangelskiy | Nataly Krizhanovsky | Andrew Krizhanovsky | Elena Klyachko | Alexey Sorokin | John Mansfield | Valts Ernštreits | Yuval Pinter | Cassandra L. Jacobs | Ryan Cotterell | Mans Hulden | David Yarowsky
Proceedings of the 12th Language Resources and Evaluation Conference

The Universal Morphology (UniMorph) project is a collaborative effort providing broad-coverage instantiated normalized morphological paradigms for hundreds of diverse world languages. The project comprises two major thrusts: a language-independent feature schema for rich morphological annotation and a type-level resource of annotated data in diverse languages realizing that schema. We have implemented several improvements to the extraction pipeline which creates most of our data, so that it is both more complete and more correct. We have added 66 new languages, as well as new parts of speech for 12 languages. We have also amended the schema in several ways. Finally, we present three new community tools: two to validate data for resource creators, and one to make morphological data available from the command line. UniMorph is based at the Center for Language and Speech Processing (CLSP) at Johns Hopkins University in Baltimore, Maryland. This paper details advances made to the schema, tooling, and dissemination of project resources since the UniMorph 2.0 release described at LREC 2018.

pdf bib
Fine-grained Morphosyntactic Analysis and Generation Tools for More Than One Thousand Languages
Garrett Nicolai | Dylan Lewis | Arya D. McCarthy | Aaron Mueller | Winston Wu | David Yarowsky
Proceedings of the 12th Language Resources and Evaluation Conference

Exploiting the broad translation of the Bible into the world’s languages, we train and distribute morphosyntactic tools for approximately one thousand languages, vastly outstripping previous distributions of tools devoted to the processing of inflectional morphology. Evaluation of the tools on a subset of available inflectional dictionaries demonstrates strong initial models, supplemented and improved through ensembling and dictionary-based reranking. Likewise, a novel type-to-token based evaluation metric allows us to confirm that models generalize well across rare and common forms alike

pdf bib
Multilingual Dictionary Based Construction of Core Vocabulary
Winston Wu | Garrett Nicolai | David Yarowsky
Proceedings of the 12th Language Resources and Evaluation Conference

We propose a new functional definition and construction method for core vocabulary sets for multiple applications based on the relative coverage of a target concept in thousands of bilingual dictionaries. Our newly developed core concept vocabulary list derived from these dictionary consensus methods achieves high overlap with existing widely utilized core vocabulary lists targeted at applications such as first and second language learning or field linguistics. Our in-depth analysis illustrates multiple desirable properties of our newly proposed core vocabulary set, including their non-compositionality. We employ a cognate prediction method to recover missing coverage of this core vocabulary in massively multilingual dictionary construction, and we argue that this core vocabulary should be prioritized for elicitation when creating new dictionaries for low-resource languages for multiple downstream tasks including machine translation and language learning.

pdf bib
Neural Transduction for Multilingual Lexical Translation
Dylan Lewis | Winston Wu | Arya D. McCarthy | David Yarowsky
Proceedings of the 28th International Conference on Computational Linguistics

We present a method for completing multilingual translation dictionaries. Our probabilistic approach can synthesize new word forms, allowing it to operate in settings where correct translations have not been observed in text (cf. cross-lingual embeddings). In addition, we propose an approximate Maximum Mutual Information (MMI) decoding objective to further improve performance in both many-to-one and one-to-one word level translation tasks where we use either multiple input languages for a single target language or more typical single language pair translation. The model is trained in a many-to-many setting, where it can leverage information from related languages to predict words in each of its many target languages. We focus on 6 languages: French, Spanish, Italian, Portuguese, Romanian, and Turkish. When indirect multilingual information is available, ensembling with mixture-of-experts as well as incorporating related languages leads to a 27% relative improvement in whole-word accuracy of predictions over a single-source baseline. To seed the completion when multilingual data is unavailable, it is better to decode with an MMI objective.

pdf bib
Wiktionary Normalization of Translations and Morphological Information
Winston Wu | David Yarowsky
Proceedings of the 28th International Conference on Computational Linguistics

We extend the Yawipa Wiktionary Parser (Wu and Yarowsky, 2020) to extract and normalize translations from etymology glosses, and morphological form-of relations, resulting in 300K unique translations and over 4 million instances of 168 annotated morphological relations. We propose a method to identify typos in translation annotations. Using the extracted morphological data, we develop multilingual neural models for predicting three types of word formation—clipping, contraction, and eye dialect—and improve upon a standard attention baseline by using copy attention.

pdf bib
Measuring the Similarity of Grammatical Gender Systems by Comparing Partitions
Arya D. McCarthy | Adina Williams | Shijia Liu | David Yarowsky | Ryan Cotterell
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

A grammatical gender system divides a lexicon into a small number of relatively fixed grammatical categories. How similar are these gender systems across languages? To quantify the similarity, we define gender systems extensionally, thereby reducing the problem of comparisons between languages’ gender systems to cluster evaluation. We borrow a rich inventory of statistical tools for cluster evaluation from the field of community detection (Driver and Kroeber, 1932; Cattell, 1945), that enable us to craft novel information theoretic metrics for measuring similarity between gender systems. We first validate our metrics, then use them to measure gender system similarity in 20 languages. We then ask whether our gender system similarities alone are sufficient to reconstruct historical relationships between languages. Towards this end, we make phylogenetic predictions on the popular, but thorny, problem from historical linguistics of inducing a phylogenetic tree over extant Indo-European languages. Of particular interest, languages on the same branch of our phylogenetic tree are notably similar, whereas languages from separate branches are no more similar than chance.

pdf bib
Induced Inflection-Set Keyword Search in Speech
Oliver Adams | Matthew Wiesner | Jan Trmal | Garrett Nicolai | David Yarowsky
Proceedings of the 17th SIGMORPHON Workshop on Computational Research in Phonetics, Phonology, and Morphology

We investigate the problem of searching for a lexeme-set in speech by searching for its inflectional variants. Experimental results indicate how lexeme-set search performance changes with the number of hypothesized inflections, while ablation experiments highlight the relative importance of different components in the lexeme-set search pipeline and the value of using curated inflectional paradigms. We provide a recipe and evaluation set for the community to use as an extrinsic measure of the performance of inflection generation approaches.

2019

pdf bib
Modeling Color Terminology Across Thousands of Languages
Arya D. McCarthy | Winston Wu | Aaron Mueller | William Watson | David Yarowsky
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

There is an extensive history of scholarship into what constitutes a “basic” color term, as well as a broadly attested acquisition sequence of basic color terms across many languages, as articulated in the seminal work of Berlin and Kay (1969). This paper employs a set of diverse measures on massively cross-linguistic data to operationalize and critique the Berlin and Kay color term hypotheses. Collectively, the 14 empirically-grounded computational linguistic metrics we design—as well as their aggregation—correlate strongly with both the Berlin and Kay basic/secondary color term partition (γ = 0.96) and their hypothesized universal acquisition sequence. The measures and result provide further empirical evidence from computational linguistics in support of their claims, as well as additional nuance: they suggest treating the partition as a spectrum instead of a dichotomy.

pdf bib
Learning Morphosyntactic Analyzers from the Bible via Iterative Annotation Projection across 26 Languages
Garrett Nicolai | David Yarowsky
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

A large percentage of computational tools are concentrated in a very small subset of the planet’s languages. Compounding the issue, many languages lack the high-quality linguistic annotation necessary for the construction of such tools with current machine learning methods. In this paper, we address both issues simultaneously: leveraging the high accuracy of English taggers and parsers, we project morphological information onto translations of the Bible in 26 varied test languages. Using an iterative discovery, constraint, and training process, we build inflectional lexica in the target languages. Through a combination of iteration, ensembling, and reranking, we see double-digit relative error reductions in lemmatization and morphological analysis over a strong initial system.

pdf bib
Massively Multilingual Adversarial Speech Recognition
Oliver Adams | Matthew Wiesner | Shinji Watanabe | David Yarowsky
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)

We report on adaptation of multilingual end-to-end speech recognition models trained on as many as 100 languages. Our findings shed light on the relative importance of similarity between the target and pretraining languages along the dimensions of phonetics, phonology, language family, geographical location, and orthography. In this context, experiments demonstrate the effectiveness of two additional pretraining objectives in encouraging language-independent encoder representations: a context-independent phoneme objective paired with a language-adversarial classification objective.

2018

pdf bib
A Comparative Study of Extremely Low-Resource Transliteration of the World’s Languages
Winston Wu | David Yarowsky
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

pdf bib
Creating a Translation Matrix of the Bible’s Names Across 591 Languages
Winston Wu | Nidhi Vyas | David Yarowsky
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

pdf bib
UniMorph 2.0: Universal Morphology
Christo Kirov | Ryan Cotterell | John Sylak-Glassman | Géraldine Walther | Ekaterina Vylomova | Patrick Xia | Manaal Faruqui | Sabrina J. Mielke | Arya McCarthy | Sandra Kübler | David Yarowsky | Jason Eisner | Mans Hulden
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

pdf bib
Creating Large-Scale Multilingual Cognate Tables
Winston Wu | David Yarowsky
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

pdf bib
Massively Translingual Compound Analysis and Translation Discovery
Winston Wu | David Yarowsky
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

pdf bib
The CoNLLSIGMORPHON 2018 Shared Task: Universal Morphological Reinflection
Ryan Cotterell | Christo Kirov | John Sylak-Glassman | Géraldine Walther | Ekaterina Vylomova | Arya D. McCarthy | Katharina Kann | Sabrina J. Mielke | Garrett Nicolai | Miikka Silfverberg | David Yarowsky | Jason Eisner | Mans Hulden
Proceedings of the CoNLL–SIGMORPHON 2018 Shared Task: Universal Morphological Reinflection

pdf bib
Improving Low Resource Machine Translation using Morphological Glosses (Non-archival Extended Abstract)
Steven Shearing | Christo Kirov | Huda Khayrallah | David Yarowsky
Proceedings of the 13th Conference of the Association for Machine Translation in the Americas (Volume 1: Research Track)

pdf bib
Marrying Universal Dependencies and Universal Morphology
Arya D. McCarthy | Miikka Silfverberg | Ryan Cotterell | Mans Hulden | David Yarowsky
Proceedings of the Second Workshop on Universal Dependencies (UDW 2018)

The Universal Dependencies (UD) and Universal Morphology (UniMorph) projects each present schemata for annotating the morphosyntactic details of language. Each project also provides corpora of annotated text in many languages—UD at the token level and UniMorph at the type level. As each corpus is built by different annotators, language-specific decisions hinder the goal of universal schemata. With compatibility of tags, each project’s annotations could be used to validate the other’s. Additionally, the availability of both type- and token-level resources would be a boon to tasks such as parsing and homograph disambiguation. To ease this interoperability, we present a deterministic mapping from Universal Dependencies v2 features into the UniMorph schema. We validate our approach by lookup in the UniMorph corpora and find a macro-average of 64.13% recall. We also note incompatibilities due to paucity of data on either side. Finally, we present a critical evaluation of the foundations, strengths, and weaknesses of the two annotation projects.

2017

pdf bib
CoNLL-SIGMORPHON 2017 Shared Task: Universal Morphological Reinflection in 52 Languages
Ryan Cotterell | Christo Kirov | John Sylak-Glassman | Géraldine Walther | Ekaterina Vylomova | Patrick Xia | Manaal Faruqui | Sandra Kübler | David Yarowsky | Jason Eisner | Mans Hulden
Proceedings of the CoNLL SIGMORPHON 2017 Shared Task: Universal Morphological Reinflection

pdf bib
Deriving Consensus for Multi-Parallel Corpora: an English Bible Study
Patrick Xia | David Yarowsky
Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 2: Short Papers)

What can you do with multiple noisy versions of the same text? We present a method which generates a single consensus between multi-parallel corpora. By maximizing a function of linguistic features between word pairs, we jointly learn a single corpus-wide multiway alignment: a consensus between 27 versions of the English Bible. We additionally produce English paraphrases, word-level distributions of tags, and consensus dependency parses. Our method is language independent and applicable to any multi-parallel corpora. Given the Bible’s unique role as alignable bitext for over 800 of the world’s languages, this consensus alignment and resulting resources offer value for multilingual annotation projection, and also shed potential insights into the Bible itself.

pdf bib
Paradigm Completion for Derivational Morphology
Ryan Cotterell | Ekaterina Vylomova | Huda Khayrallah | Christo Kirov | David Yarowsky
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing

The generation of complex derived word forms has been an overlooked problem in NLP; we fill this gap by applying neural sequence-to-sequence models to the task. We overview the theoretical motivation for a paradigmatic treatment of derivational morphology, and introduce the task of derivational paradigm completion as a parallel to inflectional paradigm completion. State-of-the-art neural models adapted from the inflection task are able to learn the range of derivation patterns, and outperform a non-neural baseline by 16.4%. However, due to semantic, historical, and lexical considerations involved in derivational morphology, future work will be needed to achieve performance parity with inflection-generating systems.

2016

pdf bib
Remote Elicitation of Inflectional Paradigms to Seed Morphological Analysis in Low-Resource Languages
John Sylak-Glassman | Christo Kirov | David Yarowsky
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

Structured, complete inflectional paradigm data exists for very few of the world’s languages, but is crucial to training morphological analysis tools. We present methods inspired by linguistic fieldwork for gathering inflectional paradigm data in a machine-readable, interoperable format from remotely-located speakers of any language. Informants are tasked with completing language-specific paradigm elicitation templates. Templates are constructed by linguists using grammatical reference materials to ensure completeness. Each cell in a template is associated with contextual prompts designed to help informants with varying levels of linguistic expertise (from professional translators to untrained native speakers) provide the desired inflected form. To facilitate downstream use in interoperable NLP/HLT applications, each cell is also associated with a language-independent machine-readable set of morphological tags from the UniMorph Schema. This data is useful for seeding morphological analysis and generation software, particularly when the data is representative of the range of surface morphological variation in the language. At present, we have obtained 792 lemmas and 25,056 inflected forms from 15 languages.

pdf bib
Very-large Scale Parsing and Normalization of Wiktionary Morphological Paradigms
Christo Kirov | John Sylak-Glassman | Roger Que | David Yarowsky
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

Wiktionary is a large-scale resource for cross-lingual lexical information with great potential utility for machine translation (MT) and many other NLP tasks, especially automatic morphological analysis and generation. However, it is designed primarily for human viewing rather than machine readability, and presents numerous challenges for generalized parsing and extraction due to a lack of standardized formatting and grammatical descriptor definitions. This paper describes a large-scale effort to automatically extract and standardize the data in Wiktionary and make it available for use by the NLP research community. The methodological innovations include a multidimensional table parsing algorithm, a cross-lexeme, token-frequency-based method of separating inflectional form data from grammatical descriptors, the normalization of grammatical descriptors to a unified annotation scheme that accounts for cross-linguistic diversity, and a verification and correction process that exploits within-language, cross-lexeme table format consistency to minimize human effort. The effort described here resulted in the extraction of a uniquely large normalized resource of nearly 1,000,000 inflectional paradigms across 350 languages. Evaluation shows that even though the data is extracted using a language-independent approach, it is comparable in quantity and quality to data extracted using hand-tuned, language-specific approaches.

pdf bib
The SIGMORPHON 2016 Shared Task—Morphological Reinflection
Ryan Cotterell | Christo Kirov | John Sylak-Glassman | David Yarowsky | Jason Eisner | Mans Hulden
Proceedings of the 14th SIGMORPHON Workshop on Computational Research in Phonetics, Phonology, and Morphology

2015

pdf bib
Social Media Predictive Analytics
Svitlana Volkova | Benjamin Van Durme | David Yarowsky | Yoram Bachrach
Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Tutorial Abstracts

pdf bib
Cross-lingual Dependency Parsing Based on Distributed Representations
Jiang Guo | Wanxiang Che | David Yarowsky | Haifeng Wang | Ting Liu
Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

pdf bib
A Language-Independent Feature Schema for Inflectional Morphology
John Sylak-Glassman | Christo Kirov | David Yarowsky | Roger Que
Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers)

2014

pdf bib
Proceedings of the Joint Workshop on Social Dynamics and Personal Attributes in Social Media
Alice Oh | Benjamin Van Durme | David Yarowsky | Oren Tsur | Svitlana Volkova
Proceedings of the Joint Workshop on Social Dynamics and Personal Attributes in Social Media

2013

pdf bib
Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing
David Yarowsky | Timothy Baldwin | Anna Korhonen | Karen Livescu | Steven Bethard
Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing

pdf bib
Exploring Demographic Language Variations to Improve Multilingual Sentiment Analysis in Social Media
Svitlana Volkova | Theresa Wilson | David Yarowsky
Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing

pdf bib
Exploring Sentiment in Social Media: Bootstrapping Subjectivity Clues from Multilingual Twitter Streams
Svitlana Volkova | Theresa Wilson | David Yarowsky
Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

pdf bib
Broadly Improving User Classification via Communication-Based Name and Location Clustering on Twitter
Shane Bergsma | Mark Dredze | Benjamin Van Durme | Theresa Wilson | David Yarowsky
Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

2012

pdf bib
Stylometric Analysis of Scientific Articles
Shane Bergsma | Matt Post | David Yarowsky
Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

pdf bib
Toward Statistical Machine Translation without Parallel Corpora
Alexandre Klementiev | Ann Irvine | Chris Callison-Burch | David Yarowsky
Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics

2011

pdf bib
Proceedings of 5th International Joint Conference on Natural Language Processing
Haifeng Wang | David Yarowsky
Proceedings of 5th International Joint Conference on Natural Language Processing

pdf bib
Using Large Monolingual and Bilingual Corpora to Improve Coordination Disambiguation
Shane Bergsma | David Yarowsky | Kenneth Church
Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies

pdf bib
Typed Graph Models for Learning Latent Attributes from Names
Delip Rao | David Yarowsky
Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies

2010

pdf bib
New Tools for Web-Scale N-grams
Dekang Lin | Kenneth Church | Heng Ji | Satoshi Sekine | David Yarowsky | Shane Bergsma | Kailash Patil | Emily Pitler | Rachel Lathbury | Vikram Rao | Kapil Dalwani | Sushant Narsale
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

While the web provides a fantastic linguistic resource, collecting and processing data at web-scale is beyond the reach of most academic laboratories. Previous research has relied on search engines to collect online information, but this is hopelessly inefficient for building large-scale linguistic resources, such as lists of named-entity types or clusters of distributionally similar words. An alternative to processing web-scale text directly is to use the information provided in an N-gram corpus. An N-gram corpus is an efficient compression of large amounts of text. An N-gram corpus states how often each sequence of words (up to length N) occurs. We propose tools for working with enhanced web-scale N-gram corpora that include richer levels of source annotation, such as part-of-speech tags. We describe a new set of search tools that make use of these tags, and collectively lower the barrier for lexical learning and ambiguity resolution at web-scale. They will allow novel sources of information to be applied to long-standing natural language challenges.

2009

pdf bib
Improving Translation Lexicon Induction from Monolingual Corpora via Dependency Contexts and Part-of-Speech Equivalences
Nikesh Garera | Chris Callison-Burch | David Yarowsky
Proceedings of the Thirteenth Conference on Computational Natural Language Learning (CoNLL-2009)

pdf bib
Ranking and Semi-supervised Classification on Large Scale Graphs Using Map-Reduce
Delip Rao | David Yarowsky
Proceedings of the 2009 Workshop on Graph-based Methods for Natural Language Processing (TextGraphs-4)

pdf bib
Modeling Latent Biographic Attributes in Conversational Genres
Nikesh Garera | David Yarowsky
Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP

pdf bib
Arabic Cross-Document Coreference Resolution
Asad Sayeed | Tamer Elsayed | Nikesh Garera | David Alexander | Tan Xu | Doug Oard | David Yarowsky | Christine Piatko
Proceedings of the ACL-IJCNLP 2009 Conference Short Papers

pdf bib
Structural, Transitive and Latent Models for Biographic Fact Extraction
Nikesh Garera | David Yarowsky
Proceedings of the 12th Conference of the European Chapter of the ACL (EACL 2009)

2008

pdf bib
Unsupervised Translation Induction for Chinese Abbreviations using Monolingual Corpora
Zhifei Li | David Yarowsky
Proceedings of ACL-08: HLT

pdf bib
Translating Compounds by Learning Component Gloss Translation Models via Multiple Languages
Nikesh Garera | David Yarowsky
Proceedings of the Third International Joint Conference on Natural Language Processing: Volume-I

pdf bib
Minimally Supervised Multilingual Taxonomy and Translation Lexicon Induction
Nikesh Garera | David Yarowsky
Proceedings of the Third International Joint Conference on Natural Language Processing: Volume-I

pdf bib
Affinity Measures Based on the Graph Laplacian
Delip Rao | David Yarowsky | Chris Callison-Burch
Coling 2008: Proceedings of the 3rd Textgraphs workshop on Graph-based Algorithms for Natural Language Processing

pdf bib
Mining and Modeling Relations between Formal and Informal Chinese Phrases from Web Corpora
Zhifei Li | David Yarowsky
Proceedings of the 2008 Conference on Empirical Methods in Natural Language Processing

2007

pdf bib
JHU1 : An Unsupervised Approach to Person Name Disambiguation using Web Snippets
Delip Rao | Nikesh Garera | David Yarowsky
Proceedings of the Fourth International Workshop on Semantic Evaluations (SemEval-2007)

2006

pdf bib
Resolving and Generating Definite Anaphora by Modeling Hypernymy using Unlabeled Corpora
Nikesh Garera | David Yarowsky
Proceedings of the Tenth Conference on Computational Natural Language Learning (CoNLL-X)

2005

pdf bib
Multi-Field Information Extraction and Cross-Document Fusion
Gideon Mann | David Yarowsky
Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL’05)

pdf bib
Induction of Fine-Grained Part-of-Speech Taggers via Classifier Combination and Crosslingual Projection
Elliott Drábek | David Yarowsky
Proceedings of the ACL Workshop on Building and Using Parallel Texts

2004

pdf bib
Exploiting Aggregate Properties of Bilingual Dictionaries For Distinguishing Senses of English Words and Inducing English Sense Clusters
Charles Schafer | David Yarowsky
Proceedings of the ACL Interactive Poster and Demonstration Sessions

pdf bib
Improving Bitext Word Alignments via Syntax-based Reordering of English
Elliott Franco Drabek | David Yarowsky
Proceedings of the ACL Interactive Poster and Demonstration Sessions

2003

pdf bib
Unsupervised Personal Name Disambiguation
Gideon Mann | David Yarowsky
Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003

pdf bib
Statistical Machine Translation Using Coercive Two-Level Syntactic Transduction
Charles Schafer | David Yarowsky
Proceedings of the 2003 Conference on Empirical Methods in Natural Language Processing

pdf bib
Minimally Supervised Induction of Grammatical Gender
Silviu Cucerzan | David Yarowsky
Proceedings of the 2003 Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics

pdf bib
Desparately Seeking Cebuano
Douglas W. Oard | David Doermann | Bonnie Dorr | Daqing He | Philip Resnik | Amy Weinberg | William Byrne | Sanjeev Khudanpur | David Yarowsky | Anton Leuski | Philipp Koehn | Kevin Knight
Companion Volume of the Proceedings of HLT-NAACL 2003 - Short Papers

2002

pdf bib
Modeling Consensus: Classifier Combination for Word Sense Disambiguation
Radu Florian | David Yarowsky
Proceedings of the 2002 Conference on Empirical Methods in Natural Language Processing (EMNLP 2002)

pdf bib
Augmented Mixture Models for Lexical Disambiguation
Silviu Cucerzan | David Yarowsky
Proceedings of the 2002 Conference on Empirical Methods in Natural Language Processing (EMNLP 2002)

pdf bib
Bootstrapping a Multilingual Part-of-speech Tagger in One Person-day
Silviu Cucerzan | David Yarowsky
COLING-02: The 6th Conference on Natural Language Learning 2002 (CoNLL-2002)

pdf bib
Language Independent NER using a Unified Model of Internal and Contextual Evidence
Silviu Cucerzan | David Yarowsky
COLING-02: The 6th Conference on Natural Language Learning 2002 (CoNLL-2002)

pdf bib
Inducing Translation Lexicons via Diverse Similarity Measures and Bridge Languages
Charles Schafer | David Yarowsky
COLING-02: The 6th Conference on Natural Language Learning 2002 (CoNLL-2002)

pdf bib
Inducing Information Extraction Systems for New Languages via Cross-language Projection
Ellen Riloff | Charles Schafer | David Yarowsky
COLING 2002: The 19th International Conference on Computational Linguistics

2001

pdf bib
Inducing Multilingual Text Analysis Tools via Robust Projection across Aligned Corpora
David Yarowsky | Grace Ngai | Richard Wicentowski
Proceedings of the First International Conference on Human Language Technology Research

pdf bib
Multipath Translation Lexicon Induction via Bridge Languages
Gideon S. Mann | David Yarowsky
Second Meeting of the North American Chapter of the Association for Computational Linguistics

pdf bib
Inducing Multilingual POS Taggers and NP Bracketers via Robust Projection Across Aligned Corpora
David Yarowsky | Grace Ngai
Second Meeting of the North American Chapter of the Association for Computational Linguistics

pdf bib
Proceedings of SENSEVAL-2 Second International Workshop on Evaluating Word Sense Disambiguation Systems
Judita Preiss | David Yarowsky
Proceedings of SENSEVAL-2 Second International Workshop on Evaluating Word Sense Disambiguation Systems

pdf bib
The John Hopkins SENSEVAL-2 System Descriptions
David Yarowsky | Silviu Cucerzan | Radu Florian | Charles Schafer | Richard Wicentowski
Proceedings of SENSEVAL-2 Second International Workshop on Evaluating Word Sense Disambiguation Systems

2000

pdf bib
Rule Writing or Annotation: Cost-efficient Resource Usage for Base Noun Phrase Chunking
Grace Ngai | David Yarowsky
Proceedings of the 38th Annual Meeting of the Association for Computational Linguistics

pdf bib
Minimally Supervised Morphological Analysis by Multimodal Alignment
David Yarowsky | Richard Wicentowski
Proceedings of the 38th Annual Meeting of the Association for Computational Linguistics

pdf bib
Language Independent, Minimally Supervised Induction of Lexical Probabilities
Silviu Cucerzan | David Yarowsky
Proceedings of the 38th Annual Meeting of the Association for Computational Linguistics

1999

pdf bib
Dynamic Nonlocal Language Modeling via Hierarchical Topic-Based Adaptation
Radu Florian | David Yarowsky
Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics

pdf bib
Language Independent Named Entity Recognition Combining Morphological and Contextual Evidence
Silviu Cucerzan | David Yarowsky
1999 Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora

pdf bib
Taking the load off the conference chairs-towards a digital paper-routing assistant
David Yarowsky | Radu Florian
1999 Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora

1995

pdf bib
Unsupervised Word Sense Disambiguation Rivaling Supervised Methods
David Yarowsky
33rd Annual Meeting of the Association for Computational Linguistics

1994

pdf bib
DECISION LISTS FOR LEXICAL AMBIGUITY RESOLUTION: Application to Accent Restoration in Spanish and French
David Yarowsky
32nd Annual Meeting of the Association for Computational Linguistics

1993

pdf bib
One Sense per Collocation
David Yarowsky
Human Language Technology: Proceedings of a Workshop Held at Plainsboro, New Jersey, March 21-24, 1993

1992

pdf bib
One Sense Per Discourse
William A. Gale | Kenneth W. Church | David Yarowsky
Speech and Natural Language: Proceedings of a Workshop Held at Harriman, New York, February 23-26, 1992

pdf bib
Estimating Upper and Lower Bounds on the Performance of Word-Sense Disambiguation Programs
William Gale | Kenneth Ward Church | David Yarowsky
30th Annual Meeting of the Association for Computational Linguistics

pdf bib
Word-Sense Disambiguation Using Statistical Models of Roget’s Categories Trained on Large Corpora
David Yarowsky
COLING 1992 Volume 2: The 15th International Conference on Computational Linguistics

Search
Co-authors