Daan van Esch


2020

pdf bib
Data-Driven Parametric Text Normalization: Rapidly Scaling Finite-State Transduction Verbalizers to New Languages
Sandy Ritchie | Eoin Mahon | Kim Heiligenstein | Nikos Bampounis | Daan van Esch | Christian Schallhart | Jonas Mortensen | Benoit Brard
Proceedings of the 1st Joint Workshop on Spoken Language Technologies for Under-resourced languages (SLTU) and Collaboration and Computing for Under-Resourced Languages (CCURL)

This paper presents a methodology for rapidly generating FST-based verbalizers for ASR and TTS systems by efficiently sourcing language-specific data. We describe a questionnaire which collects the necessary data to bootstrap the number grammar induction system and parameterize the verbalizer templates described in Ritchie et al. (2019), and a machine-readable data store which allows the data collected through the questionnaire to be supplemented by additional data from other sources. This system allows us to rapidly scale technologies such as ASR and TTS to more languages, including low-resource languages.

pdf bib
Language ID in the Wild: Unexpected Challenges on the Path to a Thousand-Language Web Text Corpus
Isaac Caswell | Theresa Breiner | Daan van Esch | Ankur Bapna
Proceedings of the 28th International Conference on Computational Linguistics

Large text corpora are increasingly important for a wide variety of Natural Language Processing (NLP) tasks, and automatic language identification (LangID) is a core technology needed to collect such datasets in a multilingual context. LangID is largely treated as solved in the literature, with models reported that achieve over 90% average F1 on as many as 1,366 languages. We train LangID models on up to 1,629 languages with comparable quality on held-out test sets, but find that human-judged LangID accuracy for web-crawl text corpora created using these models is only around 5% for many lower-resource languages, suggesting a need for more robust evaluation. Further analysis revealed a variety of error modes, arising from domain mismatch, class imbalance, language similarity, and insufficiently expressive models. We propose two classes of techniques to mitigate these errors: wordlist-based tunable-precision filters (for which we release curated lists in about 500 languages) and transformer-based semi-supervised LangID models, which increase median dataset precision from 5.5% to 71.2%. These techniques enable us to create an initial data set covering 100K or more relatively clean sentences in each of 500+ languages, paving the way towards a 1,000-language web text corpus.

2019

pdf bib
Future Directions in Technological Support for Language Documentation
Daan van Esch | Ben Foley | Nay San
Proceedings of the 3rd Workshop on the Use of Computational Methods in the Study of Endangered Languages Volume 1 (Papers)

2018

pdf bib
Text Normalization Infrastructure that Scales to Hundreds of Language Varieties
Mason Chua | Daan van Esch | Noah Coccaro | Eunjoon Cho | Sujeet Bhandari | Libin Jia
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)