Mika Hämäläinen


2020

pdf bib
Morphological Disambiguation of South Sámi with FSTs and Neural Networks
Mika Hämäläinen | Linda Wiechetek
Proceedings of the 1st Joint Workshop on Spoken Language Technologies for Under-resourced languages (SLTU) and Collaboration and Computing for Under-Resourced Languages (CCURL)

We present a method for conducting morphological disambiguation for South Sámi, which is an endangered language. Our method uses an FST-based morphological analyzer to produce an ambiguous set of morphological readings for each word in a sentence. These readings are disambiguated with a Bi-RNN model trained on the related North Sámi UD Treebank and some synthetically generated South Sámi data. The disambiguation is done on the level of morphological tags ignoring word forms and lemmas; this makes it possible to use North Sámi training data for South Sámi without the need for a bilingual dictionary or aligned word embeddings. Our approach requires only minimal resources for South Sámi, which makes it usable and applicable in the contexts of any other endangered language as well.

pdf bib
FST Morphology for the Endangered Skolt Sami Language
Jack Rueter | Mika Hämäläinen
Proceedings of the 1st Joint Workshop on Spoken Language Technologies for Under-resourced languages (SLTU) and Collaboration and Computing for Under-Resourced Languages (CCURL)

We present advances in the development of a FST-based morphological analyzer and generator for Skolt Sami. Like other minority Uralic languages, Skolt Sami exhibits a rich morphology, on the one hand, and there is little golden standard material for it, on the other. This makes NLP approaches for its study difficult without a solid morphological analysis. The language is severely endangered and the work presented in this paper forms a part of a greater whole in its revitalization efforts. Furthermore, we intersperse our description with facilitation and description practices not well documented in the infrastructure. Currently, the analyzer covers over 30,000 Skolt Sami words in 148 inflectional paradigms and over 12 derivational forms.

pdf bib
On Editing Dictionaries for Uralic Languages in an Online Environment
Khalid Alnajjar | Mika Hämäläinen | Jack Rueter
Proceedings of the Sixth International Workshop on Computational Linguistics of Uralic Languages

pdf bib
Ve’rdd. Narrowing the Gap between Paper Dictionaries, Low-Resource NLP and Community Involvement
Khalid Alnajjar | Mika Hämäläinen | Jack Rueter | Niko Partanen
Proceedings of the 28th International Conference on Computational Linguistics: System Demonstrations

We present an open-source online dictionary editing system, Ve′rdd, that offers a chance to re-evaluate and edit grassroots dictionaries that have been exposed to multiple amateur editors. The idea is to incorporate community activities into a state-of-the-art finite-state language description of a seriously endangered minority language, Skolt Sami. Problems involve getting the community to take part in things above the pencil-and-paper level. At times, it seems that the native speakers and the dictionary oriented are lacking technical understanding to utilize the infrastructures which might make their work more meaningful in the future, i.e. multiple reuse of all of their input. Therefore, our system integrates with the existing tools and infrastructures for Uralic language masking the technical complexities behind a user-friendly UI.

pdf bib
Open-Source Morphology for Endangered Mordvinic Languages
Jack Rueter | Mika Hämäläinen | Niko Partanen
Proceedings of Second Workshop for NLP Open Source Software (NLP-OSS)

This document describes shared development of finite-state description of two closely related but endangered minority languages, Erzya and Moksha. It touches upon morpholexical unity and diversity of the two languages and how this provides a motivation for shared open-source FST development. We describe how we have designed the transducers so that they can benefit from existing open-source infrastructures and are as reusable as possible.

2019

pdf bib
Generating Modern Poetry Automatically in Finnish
Mika Hämäläinen | Khalid Alnajjar
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

We present a novel approach for generating poetry automatically for the morphologically rich Finnish language by using a genetic algorithm. The approach improves the state of the art of the previous Finnish poem generators by introducing a higher degree of freedom in terms of structural creativity. Our approach is evaluated and described within the paradigm of computational creativity, where the fitness functions of the genetic algorithm are assimilated with the notion of aesthetics. The output is considered to be a poem 81.5% of the time by human evaluators.

pdf bib
Dialect Text Normalization to Normative Standard Finnish
Niko Partanen | Mika Hämäläinen | Khalid Alnajjar
Proceedings of the 5th Workshop on Noisy User-generated Text (W-NUT 2019)

We compare different LSTMs and transformer models in terms of their effectiveness in normalizing dialectal Finnish into the normative standard Finnish. As dialect is the common way of communication for people online in Finnish, such a normalization is a necessary step to improve the accuracy of the existing Finnish NLP tools that are tailored for normative Finnish text. We work on a corpus consisting of dialectal data of 23 distinct Finnish dialects. The best functioning BRNN approach lowers the initial word error rate of the corpus from 52.89 to 5.73.

pdf bib
Revisiting NMT for Normalization of Early English Letters
Mika Hämäläinen | Tanja Säily | Jack Rueter | Jörg Tiedemann | Eetu Mäkelä
Proceedings of the 3rd Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature

This paper studies the use of NMT (neural machine translation) as a normalization method for an early English letter corpus. The corpus has previously been normalized so that only less frequent deviant forms are left out without normalization. This paper discusses different methods for improving the normalization of these deviant forms by using different approaches. Adding features to the training data is found to be unhelpful, but using a lexicographical resource to filter the top candidates produced by the NMT model together with lemmatization improves results.

pdf bib
Co-Operation as an Asymmetric Form of Human-Computer Creativity. Case: Peace Machine
Mika Hämäläinen | Timo Honkela
Proceedings of the First Workshop on NLP for Conversational AI

This theoretical paper identifies a need for a definition of asymmetric co-creativity where creativity is expected from the computational agent but not from the human user. Our co-operative creativity framework takes into account that the computational agent has a message to convey in a co-operative fashion, which introduces a trade-off on how creative the computer can be. The requirements of co-operation are identified from an interdisciplinary point of view. We divide co-operative creativity in message creativity, contextual creativity and communicative creativity. Finally these notions are applied in the context of the Peace Machine system concept.

pdf bib
Finding Sami Cognates with a Character-Based NMT Approach
Mika Hämäläinen | Jack Rueter
Proceedings of the 3rd Workshop on the Use of Computational Methods in the Study of Endangered Languages Volume 1 (Papers)

pdf bib
Morphosyntactic Disambiguation in an Endangered Language Setting
Jeff Ens | Mika Hämäläinen | Jack Rueter | Philippe Pasquier
Proceedings of the 22nd Nordic Conference on Computational Linguistics

Endangered Uralic languages present a high variety of inflectional forms in their morphology. This results in a high number of homonyms in inflections, which introduces a lot of morphological ambiguity in sentences. Previous research has employed constraint grammars to address this problem, however CGs are often unable to fully disambiguate a sentence, and their development is labour intensive. We present an LSTM based model for automatically ranking morphological readings of sentences based on their quality. This ranking can be used to evaluate the existing CG disambiguators or to directly morphologically disambiguate sentences. Our approach works on a morphological abstraction and it can be trained with a very small dataset.

pdf bib
Let’s FACE it. Finnish Poetry Generation with Aesthetics and Framing
Mika Hämäläinen | Khalid Alnajjar
Proceedings of the 12th International Conference on Natural Language Generation

We present a creative poem generator for the morphologically rich Finnish language. Our method falls into the master-apprentice paradigm, where a computationally creative genetic algorithm teaches a BRNN model to generate poetry. We model several parts of poetic aesthetics in the fitness function of the genetic algorithm, such as sonic features, semantic coherence, imagery and metaphor. Furthermore, we justify the creativity of our method based on the FACE theory on computational creativity and take additional care in evaluating our system by automatic metrics for concepts together with human evaluation for aesthetics, framing and expressions.

pdf bib
From the Paft to the Fiiture: a Fully Automatic NMT and Word Embeddings Method for OCR Post-Correction
Mika Hämäläinen | Simon Hengchen
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2019)

A great deal of historical corpora suffer from errors introduced by the OCR (optical character recognition) methods used in the digitization process. Correcting these errors manually is a time-consuming process and a great part of the automatic approaches have been relying on rules or supervised machine learning. We present a fully automatic unsupervised way of extracting parallel data for training a character-based sequence-to-sequence NMT (neural machine translation) model to conduct OCR error correction.

2018

pdf bib
Combining Concepts and Their Translations from Structured Dictionaries of Uralic Minority Languages
Mika Hämäläinen | Liisa Lotta Tarvainen | Jack Rueter
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

pdf bib
Development of an Open Source Natural Language Generation Tool for Finnish
Mika Hämäläinen | Jack Rueter
Proceedings of the Fourth International Workshop on Computational Linguistics of Uralic Languages

pdf bib
Normalizing Early English Letters to Present-day English Spelling
Mika Hämäläinen | Tanja Säily | Jack Rueter | Jörg Tiedemann | Eetu Mäkelä
Proceedings of the Second Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature

This paper presents multiple methods for normalizing the most deviant and infrequent historical spellings in a corpus consisting of personal correspondence from the 15th to the 19th century. The methods include machine translation (neural and statistical), edit distance and rule-based FST. Different normalization methods are compared and evaluated. All of the methods have their own strengths in word normalization. This calls for finding ways of combining the results from these methods to leverage their individual strengths.

pdf bib
Poem Machine - a Co-creative NLG Web Application for Poem Writing
Mika Hämäläinen
Proceedings of the 11th International Conference on Natural Language Generation

We present Poem Machine, an interactive online tool for co-authoring Finnish poetry with a computationally creative agent. Poem Machine can produce poetry of its own and assist the user in authoring poems. The main target group for the system is primary school children, and its use as a part of teaching is currently under study.

pdf bib
A Master-Apprentice Approach to Automatic Creation of Culturally Satirical Movie Titles
Khalid Alnajjar | Mika Hämäläinen
Proceedings of the 11th International Conference on Natural Language Generation

Satire has played a role in indirectly expressing critique towards an authority or a person from time immemorial. We present an autonomously creative master-apprentice approach consisting of a genetic algorithm and an NMT model to produce humorous and culturally apt satire out of movie titles automatically. Furthermore, we evaluate the approach in terms of its creativity and its output. We provide a solid definition for creativity to maximize the objectiveness of the evaluation.

2017

pdf bib
Synchronized Mediawiki based analyzer dictionary development
Jack Rueter | Mika Hämäläinen
Proceedings of the Third Workshop on Computational Linguistics for Uralic Languages