Hrafn Loftsson


2020

pdf bib
Effectively Aligning and Filtering Parallel Corpora under Sparse Data Conditions
Steinþór Steingrímsson | Hrafn Loftsson | Andy Way
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop

Parallel corpora are key to developing good machine translation systems. However, abundant parallel data are hard to come by, especially for languages with a low number of speakers. When rich morphology exacerbates the data sparsity problem, it is imperative to have accurate alignment and filtering methods that can help make the most of what is available by maximising the number of correctly translated segments in a corpus and minimising noise by removing incorrect translations and segments containing extraneous data. This paper sets out a research plan for improving alignment and filtering methods for parallel texts in low-resource settings. We propose an effective unsupervised alignment method to tackle the alignment problem. Moreover, we propose a strategy to supplement state-of-the-art models with automatically extracted information using basic NLP tools to effectively handle rich morphology.

pdf bib
Language Technology Programme for Icelandic 2019-2023
Anna Nikulásdóttir | Jón Guðnason | Anton Karl Ingason | Hrafn Loftsson | Eiríkur Rögnvaldsson | Einar Freyr Sigurðsson | Steinþór Steingrímsson
Proceedings of the 12th Language Resources and Evaluation Conference

In this paper, we describe a new national language technology programme for Icelandic. The programme, which spans a period of five years, aims at making Icelandic usable in communication and interactions in the digital world, by developing accessible, open-source language resources and software. The research and development work within the programme is carried out by a consortium of universities, institutions, and private companies, with a strong emphasis on cooperation between academia and industries. Five core projects will be the main content of the programme: language resources, speech recognition, speech synthesis, machine translation, and spell and grammar checking. We also describe other national language technology programmes and give an overview over the history of language technology in Iceland.

pdf bib
Kvistur 2.0: a BiLSTM Compound Splitter for Icelandic
Jón Daðason | David Mollberg | Hrafn Loftsson | Kristín Bjarnadóttir
Proceedings of the 12th Language Resources and Evaluation Conference

In this paper, we present a character-based BiLSTM model for splitting Icelandic compound words, and show how varying amounts of training data affects the performance of the model. Compounding is highly productive in Icelandic, and new compounds are constantly being created. This results in a large number of out-of-vocabulary (OOV) words, negatively impacting the performance of many NLP tools. Our model is trained on a dataset of 2.9 million unique word forms and their constituent structures from the Database of Icelandic Morphology. The model learns how to split compound words into two parts and can be used to derive the constituent structure of any word form. Knowing the constituent structure of a word form makes it possible to generate the optimal split for a given task, e.g., a full split for subword tokenization, or, in the case of part-of-speech tagging, splitting an OOV word until the largest known morphological head is found. The model outperforms other previously published methods when evaluated on a corpus of manually split word forms. This method has been integrated into Kvistur, an Icelandic compound word analyzer.

2019

pdf bib
Nefnir: A high accuracy lemmatizer for Icelandic
Svanhvít Lilja Ingólfsdóttir | Hrafn Loftsson | Jón Friðrik Daðason | Kristín Bjarnadóttir
Proceedings of the 22nd Nordic Conference on Computational Linguistics

Lemmatization, finding the basic morphological form of a word in a corpus, is an important step in many natural language processing tasks when working with morphologically rich languages. We describe and evaluate Nefnir, a new open source lemmatizer for Icelandic. Nefnir uses suffix substitution rules, derived from a large morphological database, to lemmatize tagged text. Evaluation shows that for correctly tagged text, Nefnir obtains an accuracy of 99.55%, and for text tagged with a PoS tagger, the accuracy obtained is 96.88%.

pdf bib
Towards High Accuracy Named Entity Recognition for Icelandic
Svanhvít Lilja Ingólfsdóttir | Sigurjón Þorsteinsson | Hrafn Loftsson
Proceedings of the 22nd Nordic Conference on Computational Linguistics

We report on work in progress which consists of annotating an Icelandic corpus for named entities (NEs) and using it for training a named entity recognizer based on a Bidirectional Long Short-Term Memory model. Currently, we have annotated 7,538 NEs appearing in the first 200,000 tokens of a 1 million token corpus, MIM-GOLD, originally developed for serving as a gold standard for part-of-speech tagging. Our best performing model, trained on this subset of MIM-GOLD, and enriched with external word embeddings, obtains an overall F1 score of 81.3% when categorizing NEs into the following four categories: persons, locations, organizations and miscellaneous. Our preliminary results are promising, especially given the fact that 80% of MIM-GOLD has not yet been used for training.

pdf bib
Augmenting a BiLSTM Tagger with a Morphological Lexicon and a Lexical Category Identification Step
Steinþór Steingrímsson | Örvar Kárason | Hrafn Loftsson
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2019)

Previous work on using BiLSTM models for PoS tagging has primarily focused on small tagsets. We evaluate BiLSTM models for tagging Icelandic, a morphologically rich language, using a relatively large tagset. Our baseline BiLSTM model achieves higher accuracy than any other previously published tagger, when not taking advantage of a morphological lexicon. When we extend the model by incorporating such data, we outperform the earlier state-of-the-art results by a significant margin. We also report on work in progress that attempts to address the problem of data sparsity inherent to morphologically detailed, fine-grained tagsets. We experiment with training a separate model on only the lexical category and using the coarse-grained output tag as an input into to the main model. This method further increases the accuracy and reduces the tagging errors by 21.3% compared to previous state-of-the-art results. Finally, we train and test our tagger on a new gold standard for Icelandic.

pdf bib
A Wide-Coverage Context-Free Grammar for Icelandic and an Accompanying Parsing System
Vilhjálmur Þorsteinsson | Hulda Óladóttir | Hrafn Loftsson
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2019)

We present an open-source, wide-coverage context-free grammar (CFG) for Icelandic, and an accompanying parsing system. The grammar has over 5,600 nonterminals, 4,600 terminals and 19,000 productions in fully expanded form, with feature agreement constraints for case, gender, number and person. The parsing system consists of an enhanced Earley-based parser and a mechanism to select best-scoring parse trees from shared packed parse forests. Our parsing system is able to parse about 90% of all sentences in articles published on the main Icelandic news websites. Preliminary evaluation with evalb shows an F-measure of 70.72% on parsed sentences. Our system demonstrates that parsing a morphologically rich language using a wide-coverage CFG can be practical.

2014

bib
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)
Nicoletta Calzolari | Khalid Choukri | Thierry Declerck | Hrafn Loftsson | Bente Maegaard | Joseph Mariani | Asuncion Moreno | Jan Odijk | Stelios Piperidis
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

pdf bib
Correcting Errors in a New Gold Standard for Tagging Icelandic Text
Sigrún Helgadóttir | Hrafn Loftsson | Eiríkur Rögnvaldsson
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

In this paper, we describe the correction of PoS tags in a new Icelandic corpus, MIM-GOLD, consisting of about 1 million tokens sampled from the Tagged Icelandic Corpus, MÍM, released in 2013. The goal is to use the corpus, among other things, as a new gold standard for training and testing PoS taggers. The construction of the corpus was first described in 2010 together with preliminary work on error detection and correction. In this paper, we describe further the correction of tags in the corpus. We describe manual correction and a method for semi-automatic error detection and correction. We show that, even after manual correction, the number of tagging errors in the corpus can be reduced significantly by applying our semi-automatic detection and correction method. After the semi-automatic error correction, preliminary evaluation of tagging accuracy shows very low error rates. We hope that the existence of the corpus will make it possible to improve PoS taggers for Icelandic text.

pdf bib
Rapid Deployment of Phrase Structure Parsing for Related Languages: A Case Study of Insular Scandinavian
Anton Karl Ingason | Hrafn Loftsson | Eiríkur Rögnvaldsson | Einar Freyr Sigurðsson | Joel C. Wallenberg
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

This paper presents ongoing work that aims to improve machine parsing of Faroese using a combination of Faroese and Icelandic training data. We show that even if we only have a relatively small parsed corpus of one language, namely 53,000 words of Faroese, we can obtain better results by adding information about phrase structure from a closely related language which has a similar syntax. Our experiment uses the Berkeley parser. We demonstrate that the addition of Icelandic data without any other modification to the experimental setup results in an f-measure improvement from 75.44% to 78.05% in Faroese and an improvement in part-of-speech tagging accuracy from 88.86% to 90.40%.

2013

pdf bib
Tagging the Past: Experiments using the Saga Corpus
Hrafn Loftsson
Proceedings of the 19th Nordic Conference of Computational Linguistics (NODALIDA 2013)

pdf bib
Tagging a Morphologically Complex Language Using an Averaged Perceptron Tagger: The Case of Icelandic
Hrafn Loftsson | Robert Östling
Proceedings of the 19th Nordic Conference of Computational Linguistics (NODALIDA 2013)

2011

pdf bib
Using a Morphological Database to Increase the Accuracy in POS Tagging
Hrafn Loftsson | Sigrún Helgadóttir | Eiríkur Rögnvaldsson
Proceedings of the International Conference Recent Advances in Natural Language Processing 2011

2009

pdf bib
Improving the PoS tagging accuracy of Icelandic text
Hrafn Loftsson | Ida Kramarczyk | Sigrún Helgadóttir | Eiríkur Rögnvaldsson
Proceedings of the 17th Nordic Conference of Computational Linguistics (NODALIDA 2009)

pdf bib
Context-Sensitive Spelling Correction and Rich Morphology
Anton K. Ingason | Skúli B. Jóhannsson | Eiríkur Rögnvaldsson | Hrafn Loftsson | Sigrún Helgadóttir
Proceedings of the 17th Nordic Conference of Computational Linguistics (NODALIDA 2009)

pdf bib
Correcting a POS-Tagged Corpus Using Three Complementary Methods
Hrafn Loftsson
Proceedings of the 12th Conference of the European Chapter of the ACL (EACL 2009)

2007

pdf bib
Tagging Icelandic Text using a Linguistic and a Statistical Tagger
Hrafn Loftsson
Human Language Technologies 2007: The Conference of the North American Chapter of the Association for Computational Linguistics; Companion Volume, Short Papers

pdf bib
IceParser: An Incremental Finite-State Parser for Icelandic
Hrafn Loftsson | Eiríkur Rögnvaldsson
Proceedings of the 16th Nordic Conference of Computational Linguistics (NODALIDA 2007)