Anton Karl Ingason

Also published as: Anton K. Ingason


2020

pdf bib
Creating a Parallel Icelandic Dependency Treebank from Raw Text to Universal Dependencies
Hildur Jónsdóttir | Anton Karl Ingason
Proceedings of the 12th Language Resources and Evaluation Conference

Making the low-resource language, Icelandic, accessible and usable in Language Technology is a work in progress and is supported by the Icelandic government. Creating resources and suitable training data (e.g., a dependency treebank) is a fundamental part of that work. We describe work on a parallel Icelandic dependency treebank based on Universal Dependencies (UD). This is important because it is the first parallel treebank resource for the language and since several other languages already have a resource based on the same text. Two Icelandic treebanks based on phrase-structure grammar have been built and ongoing work aims to convert them to UD. Previously, limited work has been done on dependency grammar for Icelandic. The current project aims to ameliorate this situation by creating a small dependency treebank from scratch. Creating a treebank is a laborious task so the process was implemented in an accessible manner using freely available tools and resources. The parallel data in the UD project was chosen as a source because this would furthermore give us the first parallel treebank for Icelandic. The Icelandic parallel UD corpus will be published as part of UD version 2.6.

pdf bib
Language Technology Programme for Icelandic 2019-2023
Anna Nikulásdóttir | Jón Guðnason | Anton Karl Ingason | Hrafn Loftsson | Eiríkur Rögnvaldsson | Einar Freyr Sigurðsson | Steinþór Steingrímsson
Proceedings of the 12th Language Resources and Evaluation Conference

In this paper, we describe a new national language technology programme for Icelandic. The programme, which spans a period of five years, aims at making Icelandic usable in communication and interactions in the digital world, by developing accessible, open-source language resources and software. The research and development work within the programme is carried out by a consortium of universities, institutions, and private companies, with a strong emphasis on cooperation between academia and industries. Five core projects will be the main content of the programme: language resources, speech recognition, speech synthesis, machine translation, and spell and grammar checking. We also describe other national language technology programmes and give an overview over the history of language technology in Iceland.

pdf bib
Disambiguating Confusion Sets as an Aid for Dyslexic Spelling
Steinunn Rut Friðriksdóttir | Anton Karl Ingason
Proceedings of the 1st Workshop on Tools and Resources to Empower People with REAding DIfficulties (READI)

Spell checkers and other proofreading software are crucial tools for people with dyslexia and other reading disabilities. Most spell checkers automatically detect spelling mistakes by looking up individual words and seeing if they exist in the vocabulary. However, one of the biggest challenges of automatic spelling correction is how to deal with real-word errors, i.e. spelling mistakes which lead to a real but unintended word, such as when then is written in place of than. These errors account for 20% of all spelling mistakes made by people with dyslexia. As both words exist in the vocabulary, a simple dictionary lookup will not detect the mistake. The only way to disambiguate which word was actually intended is to look at the context in which the word appears. This problem is particularly apparent in languages with rich morphology where there is often minimal orthographic difference between grammatical items. In this paper, we present our novel confusion set corpus for Icelandic and discuss how it could be used for context-sensitive spelling correction. We have collected word pairs from seven different categories, chosen for their homophonous properties, along with sentence examples and frequency information from said pairs. We present a small-scale machine learning experiment using a decision tree binary classification which results range from 73% to 86% average accuracy with 10-fold cross validation. While not intended as a finalized result, the method shows potential and will be improved in future research.

pdf bib
A Universal Dependencies Conversion Pipeline for a Penn-format Constituency Treebank
Þórunn Arnardóttir | Hinrik Hafsteinsson | Einar Freyr Sigurðsson | Kristín Bjarnadóttir | Anton Karl Ingason | Hildur Jónsdóttir | Steinþór Steingrímsson
Proceedings of the Fourth Workshop on Universal Dependencies (UDW 2020)

The topic of this paper is a rule-based pipeline for converting constituency treebanks based on the Penn Treebank format to Universal Dependencies (UD). We describe an Icelandic constituency treebank, its annotation scheme and the UD scheme. The conversion is discussed, the methods used to deliver a fully automated UD corpus and complications involved. To show its applicability to corpora in different languages, we extend the pipeline and convert a Faroese constituency treebank to a UD corpus. The result is an open-source conversion tool, published under an Apache 2.0 license, applicable to a Penn-style treebank for conversion to a UD corpus, along with the two new UD corpora.

2014

pdf bib
Rapid Deployment of Phrase Structure Parsing for Related Languages: A Case Study of Insular Scandinavian
Anton Karl Ingason | Hrafn Loftsson | Eiríkur Rögnvaldsson | Einar Freyr Sigurðsson | Joel C. Wallenberg
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

This paper presents ongoing work that aims to improve machine parsing of Faroese using a combination of Faroese and Icelandic training data. We show that even if we only have a relatively small parsed corpus of one language, namely 53,000 words of Faroese, we can obtain better results by adding information about phrase structure from a closely related language which has a similar syntax. Our experiment uses the Berkeley parser. We demonstrate that the addition of Icelandic data without any other modification to the experimental setup results in an f-measure improvement from 75.44% to 78.05% in Faroese and an improvement in part-of-speech tagging accuracy from 88.86% to 90.40%.

2012

pdf bib
The Icelandic Parsed Historical Corpus (IcePaHC)
Eiríkur Rögnvaldsson | Anton Karl Ingason | Einar Freyr Sigurðsson | Joel Wallenberg
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

We describe the background for and building of IcePaHC, a one million word parsed historical corpus of Icelandic which has just been finished. This corpus which is completely free and open contains fragments of 60 texts ranging from the late 12th century to the present. We describe the text selection and text collecting process and discuss the quality of the texts and their conversion to modern Icelandic spelling. We explain why we choose to use a phrase structure Penn style annotation scheme and briefly describe the syntactic anno-tation process. We also describe a spin-off project which is only in its beginning stages: a parsed historical corpus of Faroese. Finally, we advocate the importance of an open source policy as regards language resources.

2009

pdf bib
Context-Sensitive Spelling Correction and Rich Morphology
Anton K. Ingason | Skúli B. Jóhannsson | Eiríkur Rögnvaldsson | Hrafn Loftsson | Sigrún Helgadóttir
Proceedings of the 17th Nordic Conference of Computational Linguistics (NODALIDA 2009)