Kimmo Kettunen


2019

pdf bib
FiST – towards a free Semantic Tagger of modern standard Finnish
Kimmo Kettunen
Proceedings of the Fifth International Workshop on Computational Linguistics for Uralic Languages

2017

pdf bib
Tagging Named Entities in 19th Century and Modern Finnish Newspaper Material with a Finnish Semantic Tagger
Kimmo Kettunen | Laura Löfberg
Proceedings of the 21st Nordic Conference on Computational Linguistics

pdf bib
Improving Optical Character Recognition of Finnish Historical Newspapers with a Combination of Fraktur & Antiqua Models and Image Preprocessing
Mika Koistinen | Kimmo Kettunen | Tuula Pääkkönen
Proceedings of the 21st Nordic Conference on Computational Linguistics

2016

pdf bib
Measuring Lexical Quality of a Historical Finnish Newspaper Collection ― Analysis of Garbled OCR Data with Basic Language Technology Tools and Means
Kimmo Kettunen | Tuula Pääkkönen
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

The National Library of Finland has digitized a large proportion of the historical newspapers published in Finland between 1771 and 1910 (Bremer-Laamanen 2001). This collection contains approximately 1.95 million pages in Finnish and Swedish. Finnish part of the collection consists of about 2.39 billion words. The National Library’s Digital Collections are offered via the digi.kansalliskirjasto.fi web service, also known as Digi. Part of this material is also available freely downloadable in The Language Bank of Finland provided by the Fin-CLARIN consortium . The collection can also be accessed through the Korp environment that has been developed by Spräkbanken at the University of Gothenburg and extended by FIN-CLARIN team at the University of Helsinki to provide concordances of text resources. A Cranfield-style information retrieval test collection has been produced out of a small part of the Digi newspaper material at the University of Tampere (Järvelin et al., 2015). The quality of the OCRed collections is an important topic in digital humanities, as it affects general usability and searchability of collections. There is no single available method to assess the quality of large collections, but different methods can be used to approximate the quality. This paper discusses different corpus analysis style ways to approximate the overall lexical quality of the Finnish part of the Digi collection.

2010

pdf bib
Normalized Compression Distance Based Measures for MetricsMATR 2010
Marcus Dobrinkat | Tero Tapiovaara | Jaakko Väyrynen | Kimmo Kettunen
Proceedings of the Joint Fifth Workshop on Statistical Machine Translation and MetricsMATR

pdf bib
Evaluating Machine Translations Using mNCD
Marcus Dobrinkat | Tero Tapiovaara | Jaakko Väyrynen | Kimmo Kettunen
Proceedings of the ACL 2010 Conference Short Papers

2007

pdf bib
Managing Keyword Variation with Frequency Based Generation of Word Forms in IR
Kimmo Kettunen
Proceedings of the 16th Nordic Conference of Computational Linguistics (NODALIDA 2007)

1986

pdf bib
On modelling dependency-oriented parsing
Kimmo Kettunen
Proceedings of the 5th Nordic Conference of Computational Linguistics (NODALIDA 1985)

pdf bib
Is MT Linguistics?
Kimmo Kettunen
Computational Linguistics. Formerly the American Journal of Computational Linguistics, Volume 12, Number 1, January-March 1986