Niko Partanen


2020

pdf bib
Improving the Language Model for Low-Resource ASR with Online Text Corpora
Nils Hjortnaes | Timofey Arkhangelskiy | Niko Partanen | Michael Rießler | Francis Tyers
Proceedings of the 1st Joint Workshop on Spoken Language Technologies for Under-resourced languages (SLTU) and Collaboration and Computing for Under-Resourced Languages (CCURL)

In this paper, we expand on previous work on automatic speech recognition in a low-resource scenario typical of data collected by field linguists. We train DeepSpeech models on 35 hours of dialectal Komi speech recordings and correct the output using language models constructed from various sources. Previous experiments showed that transfer learning using DeepSpeech can improve the accuracy of a speech recognizer for Komi, though the error rate remained very high. In this paper we present further experiments with language models created using KenLM from text materials available online. These are constructed from two corpora, one containing literary texts, one for social media content, and another combining the two. We then trained the model using each language model to explore the impact of the language model data source on the speech recognition model. Our results show significant improvements of over 25% in character error rate and nearly 20% in word error rate. This offers important methodological insight into how ASR results can be improved under low-resource conditions: transfer learning can be used to compensate the lack of training data in the target language, and online texts are a very useful resource when developing language models in this context.

pdf bib
A Report on the VarDial Evaluation Campaign 2020
Mihaela Gaman | Dirk Hovy | Radu Tudor Ionescu | Heidi Jauhiainen | Tommi Jauhiainen | Krister Lindén | Nikola Ljubešić | Niko Partanen | Christoph Purschke | Yves Scherrer | Marcos Zampieri
Proceedings of the 7th Workshop on NLP for Similar Languages, Varieties and Dialects

This paper presents the results of the VarDial Evaluation Campaign 2020 organized as part of the seventh workshop on Natural Language Processing (NLP) for Similar Languages, Varieties and Dialects (VarDial), co-located with COLING 2020. The campaign included three shared tasks each focusing on a different challenge of language and dialect identification: Romanian Dialect Identification (RDI), Social Media Variety Geolocation (SMG), and Uralic Language Identification (ULI). The campaign attracted 30 teams who enrolled to participate in one or multiple shared tasks and 14 of them submitted runs across the three shared tasks. Finally, 11 papers describing participating systems are published in the VarDial proceedings and referred to in this report.

pdf bib
Uralic Language Identification (ULI) 2020 shared task dataset and the Wanca 2017 corpora
Tommi Jauhiainen | Heidi Jauhiainen | Niko Partanen | Krister Lindén
Proceedings of the 7th Workshop on NLP for Similar Languages, Varieties and Dialects

This article introduces the Wanca 2017 web corpora from which the sentences written in minor Uralic languages were collected for the test set of the Uralic Language Identification (ULI) 2020 shared task. We describe the ULI shared task and how the test set was constructed using the Wanca 2017 corpora and texts in different languages from the Leipzig corpora collection. We also provide the results of a baseline language identification experiment conducted using the ULI 2020 dataset.

pdf bib
A pseudonymisation method for language documentation corpora: An experiment with spoken Komi
Rogier Blokland | Niko Partanen | Michael Rießler
Proceedings of the Sixth International Workshop on Computational Linguistics of Uralic Languages

pdf bib
On the questions in developing computational infrastructure for Komi-Permyak
Jack Rueter | Niko Partanen | Larisa Ponomareva
Proceedings of the Sixth International Workshop on Computational Linguistics of Uralic Languages

pdf bib
Towards a Speech Recognizer for Komi, an Endangered and Low-Resource Uralic Language
Nils Hjortnaes | Niko Partanen | Michael Rießler | Francis M. Tyers
Proceedings of the Sixth International Workshop on Computational Linguistics of Uralic Languages

pdf bib
Ve’rdd. Narrowing the Gap between Paper Dictionaries, Low-Resource NLP and Community Involvement
Khalid Alnajjar | Mika Hämäläinen | Jack Rueter | Niko Partanen
Proceedings of the 28th International Conference on Computational Linguistics: System Demonstrations

We present an open-source online dictionary editing system, Ve′rdd, that offers a chance to re-evaluate and edit grassroots dictionaries that have been exposed to multiple amateur editors. The idea is to incorporate community activities into a state-of-the-art finite-state language description of a seriously endangered minority language, Skolt Sami. Problems involve getting the community to take part in things above the pencil-and-paper level. At times, it seems that the native speakers and the dictionary oriented are lacking technical understanding to utilize the infrastructures which might make their work more meaningful in the future, i.e. multiple reuse of all of their input. Therefore, our system integrates with the existing tools and infrastructures for Uralic language masking the technical complexities behind a user-friendly UI.

pdf bib
Open-Source Morphology for Endangered Mordvinic Languages
Jack Rueter | Mika Hämäläinen | Niko Partanen
Proceedings of Second Workshop for NLP Open Source Software (NLP-OSS)

This document describes shared development of finite-state description of two closely related but endangered minority languages, Erzya and Moksha. It touches upon morpholexical unity and diversity of the two languages and how this provides a motivation for shared open-source FST development. We describe how we have designed the transducers so that they can benefit from existing open-source infrastructures and are as reusable as possible.

2019

pdf bib
Dialect Text Normalization to Normative Standard Finnish
Niko Partanen | Mika Hämäläinen | Khalid Alnajjar
Proceedings of the 5th Workshop on Noisy User-generated Text (W-NUT 2019)

We compare different LSTMs and transformer models in terms of their effectiveness in normalizing dialectal Finnish into the normative standard Finnish. As dialect is the common way of communication for people online in Finnish, such a normalization is a necessary step to improve the accuracy of the existing Finnish NLP tools that are tailored for normative Finnish text. We work on a corpus consisting of dialectal data of 23 distinct Finnish dialects. The best functioning BRNN approach lowers the initial word error rate of the corpus from 52.89 to 5.73.

pdf bib
An OCR system for the Unified Northern Alphabet
Niko Partanen | Michael Rießler
Proceedings of the Fifth International Workshop on Computational Linguistics for Uralic Languages

pdf bib
Survey of Uralic Universal Dependencies development
Niko Partanen | Jack Rueter
Proceedings of the Third Workshop on Universal Dependencies (UDW, SyntaxFest 2019)

2018

pdf bib
Multilingual Dependency Parsing for Low-Resource Languages: Case Studies on North Saami and Komi-Zyrian
KyungTae Lim | Niko Partanen | Thierry Poibeau
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

pdf bib
Dependency Parsing of Code-Switching Data with Cross-Lingual Feature Representations
Niko Partanen | Kyungtae Lim | Michael Rießler | Thierry Poibeau
Proceedings of the Fourth International Workshop on Computational Linguistics of Uralic Languages

pdf bib
The First Komi-Zyrian Universal Dependencies Treebanks
Niko Partanen | Rogier Blokland | KyungTae Lim | Thierry Poibeau | Michael Rießler
Proceedings of the Second Workshop on Universal Dependencies (UDW 2018)

Two Komi-Zyrian treebanks were included in the Universal Dependencies 2.2 release. This article contextualizes the treebanks, discusses the process through which they were created, and outlines the future plans and timeline for the next improvements. Special attention is paid to the possibilities of using UD in the documentation and description of endangered languages.

2017

pdf bib
Instant annotations in ELAN corpora of spoken and written Komi, an endangered language of the Barents Sea region
Ciprian Gerstenberger | Niko Partanen | Michael Rießler
Proceedings of the 2nd Workshop on the Use of Computational Methods in the Study of Endangered Languages

pdf bib
Instant Annotations – Applying NLP Methods to the Annotation of Spoken Language Documentation Corpora
Ciprian Gerstenberger | Niko Partanen | Michael Rießler | Joshua Wilbur
Proceedings of the Third Workshop on Computational Linguistics for Uralic Languages