Michal Křen


2016

pdf bib
SYN2015: Representative Corpus of Contemporary Written Czech
Michal Křen | Václav Cvrček | Tomáš Čapka | Anna Čermáková | Milena Hnátková | Lucie Chlumská | Tomáš Jelínek | Dominika Kováříková | Vladimír Petkevič | Pavel Procházka | Hana Skoumalová | Michal Škrabal | Petr Truneček | Pavel Vondřička | Adrian Jan Zasina
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

The paper concentrates on the design, composition and annotation of SYN2015, a new 100-million representative corpus of contemporary written Czech. SYN2015 is a sequel of the representative corpora of the SYN series that can be described as traditional (as opposed to the web-crawled corpora), featuring cleared copyright issues, well-defined composition, reliability of annotation and high-quality text processing. At the same time, SYN2015 is designed as a reflection of the variety of written Czech text production with necessary methodological and technological enhancements that include a detailed bibliographic annotation and text classification based on an updated scheme. The corpus has been produced using a completely rebuilt text processing toolchain called SynKorp. SYN2015 is lemmatized, morphologically and syntactically annotated with state-of-the-art tools. It has been published within the framework of the Czech National Corpus and it is available via the standard corpus query interface KonText at http://kontext.korpus.cz as well as a dataset in shuffled format.

pdf bib
Lexical Coverage Evaluation of Large-scale Multilingual Semantic Lexicons for Twelve Languages
Scott Piao | Paul Rayson | Dawn Archer | Francesca Bianchi | Carmen Dayrell | Mahmoud El-Haj | Ricardo-María Jiménez | Dawn Knight | Michal Křen | Laura Löfberg | Rao Muhammad Adeel Nawab | Jawad Shafi | Phoey Lee Teh | Olga Mudraya
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

The last two decades have seen the development of various semantic lexical resources such as WordNet (Miller, 1995) and the USAS semantic lexicon (Rayson et al., 2004), which have played an important role in the areas of natural language processing and corpus-based studies. Recently, increasing efforts have been devoted to extending the semantic frameworks of existing lexical knowledge resources to cover more languages, such as EuroWordNet and Global WordNet. In this paper, we report on the construction of large-scale multilingual semantic lexicons for twelve languages, which employ the unified Lancaster semantic taxonomy and provide a multilingual lexical knowledge base for the automatic UCREL semantic annotation system (USAS). Our work contributes towards the goal of constructing larger-scale and higher-quality multilingual semantic lexical resources and developing corpus annotation tools based on them. Lexical coverage is an important factor concerning the quality of the lexicons and the performance of the corpus annotation tools, and in this experiment we focus on evaluating the lexical coverage achieved by the multilingual lexicons and semantic annotation tools based on them. Our evaluation shows that some semantic lexicons such as those for Finnish and Italian have achieved lexical coverage of over 90% while others need further expansion.

2014

pdf bib
The SYN-series corpora of written Czech
Milena Hnátková | Michal Křen | Pavel Procházka | Hana Skoumalová
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

The paper overviews the SYN series of synchronic corpora of written Czech compiled within the framework of the Czech National Corpus project. It describes their design and processing with a focus on the annotation, i.e. lemmatization and morphological tagging. The paper also introduces SYN2013PUB, a new 935-million newspaper corpus of Czech published in 2013 as the most recent addition to the SYN series before planned revision of its architecture. SYN2013PUB can be seen as a completion of the series in terms of titles and publication dates of major Czech newspapers that are now covered by complete volumes in comparable proportions. All SYN-series corpora can be characterized as traditional, with emphasis on cleared copyright issues, well-defined composition, reliable metadata and high-quality data processing; their overall size currently exceeds 2.2 billion running words.

2012

pdf bib
Balanced data repository of spontaneous spoken Czech
Lucie Válková | Martina Waclawičová | Michal Křen
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

The paper presents data repository that will be used as a source of data for ORAL2013, a new corpus of spontaneous spoken Czech. The corpus is planned to be published in 2013 within the framework of the Czech National Corpus and it will contain both the audio recordings and their transcriptions manually aligned with time stamps. The corpus will be designed as a representation of contemporary spontaneous spoken language used in informal, real-life situations on the area of the whole Czech Republic and thus balanced in the main sociolinguistic categories of speakers. Therefore, the data repository features broad regional coverage with large variety of speakers, as well as precise and uniform processing. The repository is already built, basically balanced and sized 3 million words proper (i.e. tokens not including punctuation). Before the publication, another set of overall consistency checks will be carried out, as well as final selection of the transcriptions to be included into ORAL2013 as the final product.