Brian Roark


2020

pdf bib
Processing South Asian Languages Written in the Latin Script: the Dakshina Dataset
Brian Roark | Lawrence Wolf-Sonkin | Christo Kirov | Sabrina J. Mielke | Cibu Johny | Isin Demirsahin | Keith Hall
Proceedings of the 12th Language Resources and Evaluation Conference

This paper describes the Dakshina dataset, a new resource consisting of text in both the Latin and native scripts for 12 South Asian languages. The dataset includes, for each language: 1) native script Wikipedia text; 2) a romanization lexicon; and 3) full sentence parallel data in both a native script of the language and the basic Latin alphabet. We document the methods used for preparation and selection of the Wikipedia text in each language; collection of attested romanizations for sampled lexicons; and manual romanization of held-out sentences from the native script collections. We additionally provide baseline results on several tasks made possible by the dataset, including single word transliteration, full sentence transliteration, and language modeling of native script and romanized text.

pdf bib
Phonotactic Complexity and Its Trade-offs
Tiago Pimentel | Brian Roark | Ryan Cotterell
Transactions of the Association for Computational Linguistics, Volume 8

We present methods for calculating a measure of phonotactic complexity—bits per phoneme— that permits a straightforward cross-linguistic comparison. When given a word, represented as a sequence of phonemic segments such as symbols in the international phonetic alphabet, and a statistical model trained on a sample of word types from the language, we can approximately measure bits per phoneme using the negative log-probability of that word under the model. This simple measure allows us to compare the entropy across languages, giving insight into how complex a language’s phonotactics is. Using a collection of 1016 basic concept words across 106 languages, we demonstrate a very strong negative correlation of − 0.74 between bits per phoneme and the average length of words.

2019

pdf bib
Meaning to Form: Measuring Systematicity as Information
Tiago Pimentel | Arya D. McCarthy | Damian Blasi | Brian Roark | Ryan Cotterell
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

A longstanding debate in semiotics centers on the relationship between linguistic signs and their corresponding semantics: is there an arbitrary relationship between a word form and its meaning, or does some systematic phenomenon pervade? For instance, does the character bigram ‘gl’ have any systematic relationship to the meaning of words like ‘glisten’, ‘gleam’ and ‘glow’? In this work, we offer a holistic quantification of the systematicity of the sign using mutual information and recurrent neural networks. We employ these in a data-driven and massively multilingual approach to the question, examining 106 languages. We find a statistically significant reduction in entropy when modeling a word form conditioned on its semantic representation. Encouragingly, we also recover well-attested English examples of systematic affixes. We conclude with the meta-point: Our approximate effect size (measured in bits) is quite small—despite some amount of systematicity between form and meaning, an arbitrary relationship and its resulting benefits dominate human language.

pdf bib
What Kind of Language Is Hard to Language-Model?
Sabrina J. Mielke | Ryan Cotterell | Kyle Gorman | Brian Roark | Jason Eisner
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

How language-agnostic are current state-of-the-art NLP tools? Are there some types of language that are easier to model with current methods? In prior work (Cotterell et al., 2018) we attempted to address this question for language modeling, and observed that recurrent neural network language models do not perform equally well over all the high-resource European languages found in the Europarl corpus. We speculated that inflectional morphology may be the primary culprit for the discrepancy. In this paper, we extend these earlier experiments to cover 69 languages from 13 language families using a multilingual Bible corpus. Methodologically, we introduce a new paired-sample multiplicative mixed-effects model to obtain language difficulty coefficients from at-least-pairwise parallel corpora. In other words, the model is aware of inter-sentence variation and can handle missing data. Exploiting this model, we show that “translationese” is not any easier to model than natively written language in a fair comparison. Trying to answer the question of what features difficult languages have in common, we try and fail to reproduce our earlier (Cotterell et al., 2018) observation about morphological complexity and instead reveal far simpler statistics of the data that seem to drive complexity in a much larger sample.

pdf bib
Distilling weighted finite automata from arbitrary probabilistic models
Ananda Theertha Suresh | Brian Roark | Michael Riley | Vlad Schogol
Proceedings of the 14th International Conference on Finite-State Methods and Natural Language Processing

Weighted finite automata (WFA) are often used to represent probabilistic models, such as n-gram language models, since they are efficient for recognition tasks in time and space. The probabilistic source to be represented as a WFA, however, may come in many forms. Given a generic probabilistic model over sequences, we propose an algorithm to approximate it as a weighted finite automaton such that the Kullback-Leibler divergence between the source model and the WFA target model is minimized. The proposed algorithm involves a counting step and a difference of convex optimization, both of which can be performed efficiently. We demonstrate the usefulness of our approach on some tasks including distilling n-gram models from neural models.

pdf bib
Latin script keyboards for South Asian languages with finite-state normalization
Lawrence Wolf-Sonkin | Vlad Schogol | Brian Roark | Michael Riley
Proceedings of the 14th International Conference on Finite-State Methods and Natural Language Processing

The use of the Latin script for text entry of South Asian languages is common, even though there is no standard orthography for these languages in the script. We explore several compact finite-state architectures that permit variable spellings of words during mobile text entry. We find that approaches making use of transliteration transducers provide large accuracy improvements over baselines, but that simpler approaches involving a compact representation of many attested alternatives yields much of the accuracy gain. This is particularly important when operating under constraints on model size (e.g., on inexpensive mobile devices with limited storage and memory for keyboard models), and on speed of inference, since people typing on mobile keyboards expect no perceptual delay in keyboard responsiveness.

bib
Rethinking Phonotactic Complexity
Tiago Pimentel | Brian Roark | Ryan Cotterell
Proceedings of the 2019 Workshop on Widening NLP

In this work, we propose the use of phone-level language models to estimate phonotactic complexity—measured in bits per phoneme—which makes cross-linguistic comparison straightforward. We compare the entropy across languages using this simple measure, gaining insight on how complex different language’s phonotactics are. Finally, we show a very strong negative correlation between phonotactic complexity and the average length of words—Spearman rho=-0.744—when analysing a collection of 106 languages with 1016 basic concepts each.

pdf bib
Neural Models of Text Normalization for Speech Applications
Hao Zhang | Richard Sproat | Axel H. Ng | Felix Stahlberg | Xiaochang Peng | Kyle Gorman | Brian Roark
Computational Linguistics, Volume 45, Issue 2 - June 2019

Machine learning, including neural network techniques, have been applied to virtually every domain in natural language processing. One problem that has been somewhat resistant to effective machine learning solutions is text normalization for speech applications such as text-to-speech synthesis (TTS). In this application, one must decide, for example, that 123 is verbalized as one hundred twenty three in 123 pages but as one twenty three in 123 King Ave. For this task, state-of-the-art industrial systems depend heavily on hand-written language-specific grammars.We propose neural network models that treat text normalization for TTS as a sequence-to-sequence problem, in which the input is a text token in context, and the output is the verbalization of that token. We find that the most effective model, in accuracy and efficiency, is one where the sentential context is computed once and the results of that computation are combined with the computation of each token in sequence to compute the verbalization. This model allows for a great deal of flexibility in terms of representing the context, and also allows us to integrate tagging and segmentation into the process.These models perform very well overall, but occasionally they will predict wildly inappropriate verbalizations, such as reading 3 cm as three kilometers. Although rare, such verbalizations are a major issue for TTS applications. We thus use finite-state covering grammars to guide the neural models, either during training and decoding, or just during decoding, away from such “unrecoverable” errors. Such grammars can largely be learned from data.

2018

pdf bib
Are All Languages Equally Hard to Language-Model?
Ryan Cotterell | Sabrina J. Mielke | Jason Eisner | Brian Roark
Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers)

For general modeling methods applied to diverse languages, a natural question is: how well should we expect our models to work on languages with differing typological profiles? In this work, we develop an evaluation framework for fair cross-linguistic comparison of language models, using translated text so that all models are asked to predict approximately the same information. We then conduct a study on 21 languages, demonstrating that in some languages, the textual expression of the information is harder to predict with both n-gram and LSTM language models. We show complex inflectional morphology to be a cause of performance differences among languages.

2017

pdf bib
Transliterated Mobile Keyboard Input via Weighted Finite-State Transducers
Lars Hellsten | Brian Roark | Prasoon Goyal | Cyril Allauzen | Françoise Beaufays | Tom Ouyang | Michael Riley | David Rybach
Proceedings of the 13th International Conference on Finite State Methods and Natural Language Processing (FSMNLP 2017)

2016

pdf bib
Distributed representation and estimation of WFST-based n-gram models
Cyril Allauzen | Michael Riley | Brian Roark
Proceedings of the SIGFSM Workshop on Statistical NLP and Weighted Automata

2015

pdf bib
Graph-Based Word Alignment for Clinical Language Evaluation
Emily Prud’hommeaux | Brian Roark
Computational Linguistics, Volume 41, Issue 4 - December 2015

2014

pdf bib
Applications of Lexicographic Semirings to Problems in Speech and Language Processing
Richard Sproat | Mahsa Yarmohammadi | Izhak Shafran | Brian Roark
Computational Linguistics, Volume 40, Issue 4 - December 2014

pdf bib
Hippocratic Abbreviation Expansion
Brian Roark | Richard Sproat
Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

pdf bib
Transforming trees into hedges and parsing with “hedgebank” grammars
Mahsa Yarmohammadi | Aaron Dunlop | Brian Roark
Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

pdf bib
Challenges in Automating Maze Detection
Eric Morley | Anna Eva Hallin | Brian Roark
Proceedings of the Workshop on Computational Linguistics and Clinical Psychology: From Linguistic Signal to Clinical Reality

pdf bib
Data Driven Grammatical Error Detection in Transcripts of Children’s Speech
Eric Morley | Anna Eva Hallin | Brian Roark
Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP)

2013

pdf bib
Pair Language Models for Deriving Alternative Pronunciations and Spellings from Pronunciation Dictionaries
Russell Beckley | Brian Roark
Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing

pdf bib
The Utility of Manual and Automatic Linguistic Error Codes for Identifying Neurodevelopmental Disorders
Eric Morley | Brian Roark | Jan van Santen
Proceedings of the Eighth Workshop on Innovative Use of NLP for Building Educational Applications

pdf bib
Proceedings of the Fourth Workshop on Speech and Language Processing for Assistive Technologies
Jan Alexandersson | Peter Ljunglöf | Kathleen F. McCoy | François Portet | Brian Roark | Frank Rudzicz | Michel Vacher
Proceedings of the Fourth Workshop on Speech and Language Processing for Assistive Technologies

pdf bib
Smoothed marginal distribution constraints for language modeling
Brian Roark | Cyril Allauzen | Michael Riley
Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

pdf bib
Discriminative Joint Modeling of Lexical Variation and Acoustic Confusion for Automated Narrative Retelling Assessment
Maider Lehr | Izhak Shafran | Emily Prud’hommeaux | Brian Roark
Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

pdf bib
Distributional semantic models for the evaluation of disordered language
Masoud Rouhizadeh | Emily Prud’hommeaux | Brian Roark | Jan van Santen
Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

2012

pdf bib
Finite-State Chart Constraints for Reduced Complexity Context-Free Parsing Pipelines
Brian Roark | Kristy Hollingshead | Nathan Bodenstab
Computational Linguistics, Volume 38, Issue 4 - December 2012

pdf bib
The OpenGrm open-source finite-state grammar software libraries
Brian Roark | Richard Sproat | Cyril Allauzen | Michael Riley | Jeffrey Sorensen | Terry Tai
Proceedings of the ACL 2012 System Demonstrations

pdf bib
Robust kaomoji detection in Twitter
Steven Bedrick | Russell Beckley | Brian Roark | Richard Sproat
Proceedings of the Second Workshop on Language in Social Media

pdf bib
Graph-based alignment of narratives for automated neurological assessment
Emily Prud’hommeaux | Brian Roark
BioNLP: Proceedings of the 2012 Workshop on Biomedical Natural Language Processing

pdf bib
Proceedings of the Third Workshop on Speech and Language Processing for Assistive Technologies
Jan Alexandersson | Peter Ljunglöf | Kathleen F. McCoy | Brian Roark | Annalu Waller
Proceedings of the Third Workshop on Speech and Language Processing for Assistive Technologies

2011

pdf bib
Minimum Imputed-Risk: Unsupervised Discriminative Training for Machine Translation
Zhifei Li | Ziyuan Wang | Jason Eisner | Sanjeev Khudanpur | Brian Roark
Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing

pdf bib
Classification of Atypical Language in Autism
Emily T. Prud’hommeaux | Brian Roark | Lois M. Black | Jan van Santen
Proceedings of the 2nd Workshop on Cognitive Modeling and Computational Linguistics

pdf bib
Towards technology-assisted co-construction with communication partners
Brian Roark | Andrew Fowler | Richard Sproat | Christopher Gibbons | Melanie Fried-Oken
Proceedings of the Second Workshop on Speech and Language Processing for Assistive Technologies

pdf bib
Asynchronous fixed-grid scanning with dynamic codes
Russ Beckley | Brian Roark
Proceedings of the Second Workshop on Speech and Language Processing for Assistive Technologies

pdf bib
Efficient Matrix-Encoded Grammars and Low Latency Parallelization Strategies for CYK
Aaron Dunlop | Nathan Bodenstab | Brian Roark
Proceedings of the 12th International Conference on Parsing Technologies

pdf bib
Beam-Width Prediction for Efficient Context-Free Parsing
Nathan Bodenstab | Aaron Dunlop | Keith Hall | Brian Roark
Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies

pdf bib
Lexicographic Semirings for Exact Automata Encoding of Sequence Models
Brian Roark | Richard Sproat | Izhak Shafran
Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies

pdf bib
Semi-Supervised Modeling for Prenominal Modifier Ordering
Margaret Mitchell | Aaron Dunlop | Brian Roark
Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies

pdf bib
Unary Constraints for Efficient Context-Free Parsing
Nathan Bodenstab | Kristy Hollingshead | Brian Roark
Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies

pdf bib
An ERP-based Brain-Computer Interface for text entry using Rapid Serial Visual Presentation and Language Modeling
Kenneth Hild | Umut Orhan | Deniz Erdogmus | Brian Roark | Barry Oken | Shalini Purwar | Hooman Nezamfar | Melanie Fried-Oken
Proceedings of the ACL-HLT 2011 System Demonstrations

2010

pdf bib
Prenominal Modifier Ordering via Multiple Sequence Alignment
Aaron Dunlop | Margaret Mitchell | Brian Roark
Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics

pdf bib
Proceedings of the NAACL HLT 2010 Workshop on Speech and Language Processing for Assistive Technologies
Melanie Fried-Oken | Kathleen F. McCoy | Brian Roark
Proceedings of the NAACL HLT 2010 Workshop on Speech and Language Processing for Assistive Technologies

pdf bib
Scanning methods and language modeling for binary switch typing
Brian Roark | Jacques de Villiers | Christopher Gibbons | Melanie Fried-Oken
Proceedings of the NAACL HLT 2010 Workshop on Speech and Language Processing for Assistive Technologies

pdf bib
Demo Session Abstracts
Brian Roark
Proceedings of the NAACL HLT 2010 Workshop on Speech and Language Processing for Assistive Technologies

2009

pdf bib
Deriving lexical and syntactic expectation-based measures for psycholinguistic modeling via incremental top-down parsing
Brian Roark | Asaf Bachrach | Carlos Cardenas | Christophe Pallier
Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing

pdf bib
Proceedings of the ACL-IJCNLP 2009 Student Research Workshop
Brian Roark | Grace Ngai | Davis Muhajereen D. Dimalen | Jenny Rose Finkel | Blaise Thomson
Proceedings of the ACL-IJCNLP 2009 Student Research Workshop

pdf bib
Linear Complexity Context-Free Parsing Pipelines via Chart Constraints
Brian Roark | Kristy Hollingshead
Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics

pdf bib
Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Companion Volume: Tutorial Abstracts
Ciprian Chelba | Paul Kantor | Brian Roark
Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Companion Volume: Tutorial Abstracts

2008

pdf bib
Classifying Chart Cells for Quadratic Complexity Context-Free Inference
Brian Roark | Kristy Hollingshead
Proceedings of the 22nd International Conference on Computational Linguistics (Coling 2008)

2007

pdf bib
Syntactic complexity measures for detecting Mild Cognitive Impairment
Brian Roark | Margaret Mitchell | Kristy Hollingshead
Biological, translational, and clinical language processing

pdf bib
Book Reviews: Putting Linguistics into Speech Recognition: The Regulus Grammar Compiler, by Manny Rayner, Beth Ann Hockey, and Pierette Bouillon
Brian Roark
Computational Linguistics, Volume 33, Number 2, June 2007

pdf bib
Pipeline Iteration
Kristy Hollingshead | Brian Roark
Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics

pdf bib
The utility of parse-derived features for automatic discourse segmentation
Seeger Fisher | Brian Roark
Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics

2006

pdf bib
PCFGs with Syntactic and Prosodic Indicators of Speech Repairs
John Hale | Izhak Shafran | Lisa Yung | Bonnie J. Dorr | Mary Harper | Anna Krasnyanskaya | Matthew Lease | Yang Liu | Brian Roark | Matthew Snover | Robin Stewart
Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics

pdf bib
SParseval: Evaluation Metrics for Parsing Speech
Brian Roark | Mary Harper | Eugene Charniak | Bonnie Dorr | Mark Johnson | Jeremy Kahn | Yang Liu | Mari Ostendorf | John Hale | Anna Krasnyanskaya | Matthew Lease | Izhak Shafran | Matthew Snover | Robin Stewart | Lisa Yung
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)

While both spoken and written language processing stand to benefit from parsing, the standard Parseval metrics (Black et al., 1991) and their canonical implementation (Sekine and Collins, 1997) are only useful for text. The Parseval metrics are undefined when the words input to the parser do not match the words in the gold standard parse tree exactly, and word errors are unavoidable with automatic speech recognition (ASR) systems. To fill this gap, we have developed a publicly available tool for scoring parses that implements a variety of metrics which can handle mismatches in words and segmentations, including: alignment-based bracket evaluation, alignment-based dependency evaluation, and a dependency evaluation that does not require alignment. We describe the different metrics, how to use the tool, and the outcome of an extensive set of experiments on the sensitivity.

pdf bib
Probabilistic Context-Free Grammar Induction Based on Structural Zeros
Mehryar Mohri | Brian Roark
Proceedings of the Human Language Technology Conference of the NAACL, Main Conference

2005

pdf bib
Comparing and Combining Finite-State and Context-Free Parsers
Kristy Hollingshead | Seeger Fisher | Brian Roark
Proceedings of Human Language Technology Conference and Conference on Empirical Methods in Natural Language Processing

pdf bib
Discriminative Syntactic Language Modeling for Speech Recognition
Michael Collins | Brian Roark | Murat Saraclar
Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL’05)

2004

pdf bib
Discriminative Language Modeling with Conditional Random Fields and the Perceptron Algorithm
Brian Roark | Murat Saraclar | Michael Collins | Mark Johnson
Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics (ACL-04)

pdf bib
Incremental Parsing with the Perceptron Algorithm
Michael Collins | Brian Roark
Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics (ACL-04)

pdf bib
Language Model Adaptation with MAP Estimation and the Perceptron Algorithm
Michiel Bacchiani | Brian Roark | Murat Saraclar
Proceedings of HLT-NAACL 2004: Short Papers

pdf bib
Efficient Incremental Beam-Search Parsing with Generative and Discriminative Models
Brian Roark
Proceedings of the Workshop on Incremental Parsing: Bringing Engineering and Cognition Together

2003

pdf bib
Generalized Algorithms for Constructing Statistical Language Models
Cyril Allauzen | Mehryar Mohri | Brian Roark
Proceedings of the 41st Annual Meeting of the Association for Computational Linguistics

pdf bib
Supervised and unsupervised PCFG adaptation to novel domains
Brian Roark | Michiel Bacchiani
Proceedings of the 2003 Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics

2002

pdf bib
Markov Parsing: Lattice Rescoring with a Statistical Parser
Brian Roark
Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics

2001

pdf bib
Probabilistic Top-Down Parsing and Language Modeling
Brian Roark
Computational Linguistics, Volume 27, Number 2, June 2001

2000

pdf bib
Compact non-left-recursive grammars using the selective left-corner transform and factoring
Mark Johnson | Brian Roark
COLING 2000 Volume 1: The 18th International Conference on Computational Linguistics

pdf bib
Measuring Efficiency in High-accuracy, Broad-coverage Statistical Parsing
Brian Roark | Eugene Charniak
Proceedings of the COLING-2000 Workshop on Efficiency In Large-Scale Parsing Systems

1999

pdf bib
Efficient probabilistic top-down and left-corner parsing
Brian Roark | Mark Johnson
Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics

1998

pdf bib
Noun-phrase co-occurrence statistics for semi-automatic semantic lexicon construction
Brian Roark | Eugene Charniak
COLING 1998 Volume 2: The 17th International Conference on Computational Linguistics

pdf bib
Noun-Phrase Co-occurrence Statistics for Semi-Automatic Semantic Lexicon Construction
Brian Roark | Eugene Charniak
36th Annual Meeting of the Association for Computational Linguistics and 17th International Conference on Computational Linguistics, Volume 2

Search
Venues