Michal Novák


2019

pdf bib
SAO WMT19 Test Suite: Machine Translation of Audit Reports
Tereza Vojtěchová | Michal Novák | Miloš Klouček | Ondřej Bojar
Proceedings of the Fourth Conference on Machine Translation (Volume 2: Shared Task Papers, Day 1)

This paper describes a machine translation test set of documents from the auditing domain and its use as one of the “test suites” in the WMT19 News Translation Task for translation directions involving Czech, English and German. Our evaluation suggests that current MT systems optimized for the general news domain can perform quite well even in the particular domain of audit reports. The detailed manual evaluation however indicates that deep factual knowledge of the domain is necessary. For the naked eye of a non-expert, translations by many systems seem almost perfect and automatic MT evaluation with one reference is practically useless for considering these details. Furthermore, we show on a sample document from the domain of agreements that even the best systems completely fail in preserving the semantics of the agreement, namely the identity of the parties.

2018

pdf bib
PAWS: A Multi-lingual Parallel Treebank with Anaphoric Relations
Anna Nedoluzhko | Michal Novák | Maciej Ogrodniczuk
Proceedings of the First Workshop on Computational Models of Reference, Anaphora and Coreference

We present PAWS, a multi-lingual parallel treebank with coreference annotation. It consists of English texts from the Wall Street Journal translated into Czech, Russian and Polish. In addition, the texts are syntactically parsed and word-aligned. PAWS is based on PCEDT 2.0 and continues the tradition of multilingual treebanks with coreference annotation. The paper focuses on the coreference annotation in PAWS and its language-specific differences. PAWS offers linguistic material that can be further leveraged in cross-lingual studies, especially on coreference.

pdf bib
A Fine-grained Large-scale Analysis of Coreference Projection
Michal Novák
Proceedings of the First Workshop on Computational Models of Reference, Anaphora and Coreference

We perform a fine-grained large-scale analysis of coreference projection. By projecting gold coreference from Czech to English and vice versa on Prague Czech-English Dependency Treebank 2.0 Coref, we set an upper bound of a proposed projection approach for these two languages. We undertake a detailed thorough analysis that combines the analysis of projection’s subtasks with analysis of performance on individual mention types. The findings are accompanied with examples from the corpus.

2017

pdf bib
Introducing EVALD – Software Applications for Automatic Evaluation of Discourse in Czech
Kateřina Rysová | Magdaléna Rysová | Jiří Mírovský | Michal Novák
Proceedings of the International Conference Recent Advances in Natural Language Processing, RANLP 2017

In the paper, we introduce two software applications for automatic evaluation of coherence in Czech texts called EVALD – Evaluator of Discourse. The first one – EVALD 1.0 – evaluates texts written by native speakers of Czech on a five-step scale commonly used at Czech schools (grade 1 is the best, grade 5 is the worst). The second application is EVALD 1.0 for Foreigners assessing texts by non-native speakers of Czech using six-step scale (A1–C2) according to CEFR. Both appli-cations are available online at https://lindat.mff.cuni.cz/services/evald-foreign/.

pdf bib
Projection-based Coreference Resolution Using Deep Syntax
Michal Novák | Anna Nedoluzhko | Zdeněk Žabokrtský
Proceedings of the 2nd Workshop on Coreference Resolution Beyond OntoNotes (CORBON 2017)

The paper describes the system for coreference resolution in German and Russian, trained exclusively on coreference relations project ed through a parallel corpus from English. The resolver operates on the level of deep syntax and makes use of multiple specialized models. It achieves 32 and 22 points in terms of CoNLL score for Russian and German, respectively. Analysis of the evaluation results show that the resolver for Russian is able to preserve 66% of the English resolver’s quality in terms of CoNLL score. The system was submitted to the Closed track of the CORBON 2017 Shared task.

2016

pdf bib
Coreference in Prague Czech-English Dependency Treebank
Anna Nedoluzhko | Michal Novák | Silvie Cinková | Marie Mikulová | Jiří Mírovský
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

We present coreference annotation on parallel Czech-English texts of the Prague Czech-English Dependency Treebank (PCEDT). The paper describes innovations made to PCEDT 2.0 concerning coreference, as well as coreference information already present there. We characterize the coreference annotation scheme, give the statistics and compare our annotation with the coreference annotation in Ontonotes and Prague Dependency Treebank for Czech. We also present the experiments made using this corpus to improve the alignment of coreferential expressions, which helps us to collect better statistics of correspondences between types of coreferential relations in Czech and English. The corpus released as PCEDT 2.0 Coref is publicly available.

pdf bib
Dictionary-based Domain Adaptation of MT Systems without Retraining
Rudolf Rosa | Roman Sudarikov | Michal Novák | Martin Popel | Ondřej Bojar
Proceedings of the First Conference on Machine Translation: Volume 2, Shared Task Papers

pdf bib
Pronoun Prediction with Linguistic Features and Example Weighing
Michal Novák
Proceedings of the First Conference on Machine Translation: Volume 2, Shared Task Papers

2015

pdf bib
Comparison of Coreference Resolvers for Deep Syntax Translation
Michal Novák | Dieke Oele | Gertjan van Noord
Proceedings of the Second Workshop on Discourse in Machine Translation

pdf bib
New Language Pairs in TectoMT
Ondřej Dušek | Luís Gomes | Michal Novák | Martin Popel | Rudolf Rosa
Proceedings of the Tenth Workshop on Statistical Machine Translation

pdf bib
Translation Model Interpolation for Domain Adaptation in TectoMT
Rudolf Rosa | Ondřej Dušek | Michal Novák | Martin Popel
Proceedings of the 1st Deep Machine Translation Workshop

2014

pdf bib
Machine Translation of Medical Texts in the Khresmoi Project
Ondřej Dušek | Jan Hajič | Jaroslava Hlaváčová | Michal Novák | Pavel Pecina | Rudolf Rosa | Aleš Tamchyna | Zdeňka Urešová | Daniel Zeman
Proceedings of the Ninth Workshop on Statistical Machine Translation

pdf bib
Cross-lingual Coreference Resolution of Pronouns
Michal Novák | Zdeněk Žabokrtský
Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers

2013

pdf bib
Two Case Studies on Translating Pronouns in a Deep Syntax Framework
Michal Novák | Zdeněk Žabokrtský | Anna Nedoluzhko
Proceedings of the Sixth International Joint Conference on Natural Language Processing

pdf bib
Translation of “It” in a Deep Syntax Framework
Michal Novák | Anna Nedoluzhko | Zdeněk Žabokrtský
Proceedings of the Workshop on Discourse in Machine Translation

2012

pdf bib
The Joy of Parallelism with CzEng 1.0
Ondřej Bojar | Zdeněk Žabokrtský | Ondřej Dušek | Petra Galuščáková | Martin Majliš | David Mareček | Jiří Maršík | Michal Novák | Martin Popel | Aleš Tamchyna
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

CzEng 1.0 is an updated release of our Czech-English parallel corpus, freely available for non-commercial research or educational purposes. In this release, we approximately doubled the corpus size, reaching 15 million sentence pairs (about 200 million tokens per language). More importantly, we carefully filtered the data to reduce the amount of non-matching sentence pairs. CzEng 1.0 is automatically aligned at the level of sentences as well as words. We provide not only the plain text representation, but also automatic morphological tags, surface syntactic as well as deep syntactic dependency parse trees and automatic co-reference links in both English and Czech. This paper describes key properties of the released resource including the distribution of text domains, the corpus data formats, and a toolkit to handle the provided rich annotation. We also summarize the procedure of the rich annotation (incl. co-reference resolution) and of the automatic filtering. Finally, we provide some suggestions on exploiting such an automatically annotated sentence-parallel corpus.

pdf bib
Formemes in English-Czech Deep Syntactic MT
Ondřej Dušek | Zdeněk Žabokrtský | Martin Popel | Martin Majliš | Michal Novák | David Mareček
Proceedings of the Seventh Workshop on Statistical Machine Translation