David Mareček


2020

pdf bib
Universal Dependencies According to BERT: Both More Specific and More General
Tomasz Limisiewicz | David Mareček | Rudolf Rosa
Findings of the Association for Computational Linguistics: EMNLP 2020

This work focuses on analyzing the form and extent of syntactic abstraction captured by BERT by extracting labeled dependency trees from self-attentions. Previous work showed that individual BERT heads tend to encode particular dependency relation types. We extend these findings by explicitly comparing BERT relations to Universal Dependencies (UD) annotations, showing that they often do not match one-to-one. We suggest a method for relation identification and syntactic tree construction. Our approach produces significantly more consistent dependency trees than previous work, showing that it better explains the syntactic abstractions in BERT. At the same time, it can be successfully applied with only a minimal amount of supervision and generalizes well across languages.

2019

pdf bib
Derivational Morphological Relations in Word Embeddings
Tomáš Musil | Jonáš Vidra | David Mareček
Proceedings of the 2019 ACL Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP

Derivation is a type of a word-formation process which creates new words from existing ones by adding, changing or deleting affixes. In this paper, we explore the potential of word embeddings to identify properties of word derivations in the morphologically rich Czech language. We extract derivational relations between pairs of words from DeriNet, a Czech lexical network, which organizes almost one million Czech lemmas into derivational trees. For each such pair, we compute the difference of the embeddings of the two words, and perform unsupervised clustering of the resulting vectors. Our results show that these clusters largely match manually annotated semantic categories of the derivational relations (e.g. the relation ‘bake–baker’ belongs to category ‘actor’, and a correct clustering puts it into the same cluster as ‘govern–governor’).

pdf bib
From Balustrades to Pierre Vinken: Looking for Syntax in Transformer Self-Attentions
David Mareček | Rudolf Rosa
Proceedings of the 2019 ACL Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP

We inspect the multi-head self-attention in Transformer NMT encoders for three source languages, looking for patterns that could have a syntactic interpretation. In many of the attention heads, we frequently find sequences of consecutive states attending to the same position, which resemble syntactic phrases. We propose a transparent deterministic method of quantifying the amount of syntactic information present in the self-attentions, based on automatically building and evaluating phrase-structure trees from the phrase-like sequences. We compare the resulting trees to existing constituency treebanks, both manually and by computing precision and recall.

2018

pdf bib
CUNI x-ling: Parsing Under-Resourced Languages in CoNLL 2018 UD Shared Task
Rudolf Rosa | David Mareček
Proceedings of the CoNLL 2018 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies

This is a system description paper for the CUNI x-ling submission to the CoNLL 2018 UD Shared Task. We focused on parsing under-resourced languages, with no or little training data available. We employed a wide range of approaches, including simple word-based treebank translation, combination of delexicalized parsers, and exploitation of available morphological dictionaries, with a dedicated setup tailored to each of the languages. In the official evaluation, our submission was identified as the clear winner of the Low-resource languages category.

pdf bib
Extracting Syntactic Trees from Transformer Encoder Self-Attentions
David Mareček | Rudolf Rosa
Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP

This is a work in progress about extracting the sentence tree structures from the encoder’s self-attention weights, when translating into another language using the Transformer neural network architecture. We visualize the structures and discuss their characteristics with respect to the existing syntactic theories and annotations.

pdf bib
Input Combination Strategies for Multi-Source Transformer Decoder
Jindřich Libovický | Jindřich Helcl | David Mareček
Proceedings of the Third Conference on Machine Translation: Research Papers

In multi-source sequence-to-sequence tasks, the attention mechanism can be modeled in several ways. This topic has been thoroughly studied on recurrent architectures. In this paper, we extend the previous work to the encoder-decoder attention in the Transformer architecture. We propose four different input combination strategies for the encoder-decoder attention: serial, parallel, flat, and hierarchical. We evaluate our methods on tasks of multimodal translation and translation with multiple source languages. The experiments show that the models are able to use multiple sources and improve over single source baselines.

2017

pdf bib
Slavic Forest, Norwegian Wood
Rudolf Rosa | Daniel Zeman | David Mareček | Zdeněk Žabokrtský
Proceedings of the Fourth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial)

We once had a corp, or should we say, it once had us They showed us its tags, isn’t it great, unified tags They asked us to parse and they told us to use everything So we looked around and we noticed there was near nothing We took other langs, bitext aligned: words one-to-one We played for two weeks, and then they said, here is the test The parser kept training till morning, just until deadline So we had to wait and hope what we get would be just fine And, when we awoke, the results were done, we saw we’d won So, we wrote this paper, isn’t it good, Norwegian wood.

pdf bib
Communication with Robots using Multilayer Recurrent Networks
Bedřich Pišl | David Mareček
Proceedings of the First Workshop on Language Grounding for Robotics

In this paper, we describe an improvement on the task of giving instructions to robots in a simulated block world using unrestricted natural language commands.

pdf bib
CUNI submission in WMT17: Chimera goes neural
Roman Sudarikov | David Mareček | Tom Kocmi | Dušan Variš | Ondřej Bojar
Proceedings of the Second Conference on Machine Translation

pdf bib
CUNI Experiments for WMT17 Metrics Task
David Mareček | Ondřej Bojar | Ondřej Hübsch | Rudolf Rosa | Dušan Variš
Proceedings of the Second Conference on Machine Translation

2016

pdf bib
If You Even Don’t Have a Bit of Bible: Learning Delexicalized POS Taggers
Zhiwei Yu | David Mareček | Zdeněk Žabokrtský | Daniel Zeman
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

Part-of-speech (POS) induction is one of the most popular tasks in research on unsupervised NLP. Various unsupervised and semi-supervised methods have been proposed to tag an unseen language. However, many of them require some partial understanding of the target language because they rely on dictionaries or parallel corpora such as the Bible. In this paper, we propose a different method named delexicalized tagging, for which we only need a raw corpus of the target language. We transfer tagging models trained on annotated corpora of one or more resource-rich languages. We employ language-independent features such as word length, frequency, neighborhood entropy, character classes (alphabetic vs. numeric vs. punctuation) etc. We demonstrate that such features can, to certain extent, serve as predictors of the part of speech, represented by the universal POS tag.

pdf bib
Merged bilingual trees based on Universal Dependencies in Machine Translation
David Mareček
Proceedings of the First Conference on Machine Translation: Volume 2, Shared Task Papers

pdf bib
Moses & Treex Hybrid MT Systems Bestiary
Rudolf Rosa | Martin Popel | Ondřej Bojar | David Mareček | Ondřej Dušek
Proceedings of the 2nd Deep Machine Translation Workshop

pdf bib
Planting Trees in the Desert: Delexicalized Tagging and Parsing Combined
Daniel Zeman | David Mareček | Zhiwei Yu | Zdeněk Žabokrtský
Proceedings of the 30th Pacific Asia Conference on Language, Information and Computation: Oral Papers

2014

pdf bib
HamleDT 2.0: Thirty Dependency Treebanks Stanfordized
Rudolf Rosa | Jan Mašek | David Mareček | Martin Popel | Daniel Zeman | Zdeněk Žabokrtský
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

We present HamleDT 2.0 (HArmonized Multi-LanguagE Dependency Treebank). HamleDT 2.0 is a collection of 30 existing treebanks harmonized into a common annotation style, the Prague Dependencies, and further transformed into Stanford Dependencies, a treebank annotation style that became popular in recent years. We use the newest basic Universal Stanford Dependencies, without added language-specific subtypes. We describe both of the annotation styles, including adjustments that were necessary to make, and provide details about the conversion process. We also discuss the differences between the two styles, evaluating their advantages and disadvantages, and note the effects of the differences on the conversion. We regard the stanfordization as generally successful, although we admit several shortcomings, especially in the distinction between direct and indirect objects, that have to be addressed in future. We release part of HamleDT 2.0 freely; we are not allowed to redistribute the whole dataset, but we do provide the conversion pipeline.

2013

pdf bib
Stop-probability estimates computed on a large corpus improve Unsupervised Dependency Parsing
David Mareček | Milan Straka
Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

pdf bib
Coordination Structures in Dependency Treebanks
Martin Popel | David Mareček | Jan Štěpánek | Daniel Zeman | Zdeněk Žabokrtský
Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

pdf bib
Deepfix: Statistical Post-editing of Statistical Machine Translation Using Deep Syntactic Analysis
Rudolf Rosa | David Mareček | Aleš Tamchyna
51st Annual Meeting of the Association for Computational Linguistics Proceedings of the Student Research Workshop

2012

pdf bib
HamleDT: To Parse or Not to Parse?
Daniel Zeman | David Mareček | Martin Popel | Loganathan Ramasamy | Jan Štěpánek | Zdeněk Žabokrtský | Jan Hajič
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

We propose HamleDT ― HArmonized Multi-LanguagE Dependency Treebank. HamleDT is a compilation of existing dependency treebanks (or dependency conversions of other treebanks), transformed so that they all conform to the same annotation style. While the license terms prevent us from directly redistributing the corpora, most of them are easily acquirable for research purposes. What we provide instead is the software that normalizes tree structures in the data obtained by the user from their original providers.

pdf bib
The Joy of Parallelism with CzEng 1.0
Ondřej Bojar | Zdeněk Žabokrtský | Ondřej Dušek | Petra Galuščáková | Martin Majliš | David Mareček | Jiří Maršík | Michal Novák | Martin Popel | Aleš Tamchyna
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

CzEng 1.0 is an updated release of our Czech-English parallel corpus, freely available for non-commercial research or educational purposes. In this release, we approximately doubled the corpus size, reaching 15 million sentence pairs (about 200 million tokens per language). More importantly, we carefully filtered the data to reduce the amount of non-matching sentence pairs. CzEng 1.0 is automatically aligned at the level of sentences as well as words. We provide not only the plain text representation, but also automatic morphological tags, surface syntactic as well as deep syntactic dependency parse trees and automatic co-reference links in both English and Czech. This paper describes key properties of the released resource including the distribution of text domains, the corpus data formats, and a toolkit to handle the provided rich annotation. We also summarize the procedure of the rich annotation (incl. co-reference resolution) and of the automatic filtering. Finally, we provide some suggestions on exploiting such an automatically annotated sentence-parallel corpus.

pdf bib
Unsupervised Dependency Parsing using Reducibility and Fertility features
David Mareček | Zdeněk Žabokrtský
Proceedings of the NAACL-HLT Workshop on the Induction of Linguistic Structure

pdf bib
Formemes in English-Czech Deep Syntactic MT
Ondřej Dušek | Zdeněk Žabokrtský | Martin Popel | Martin Majliš | Michal Novák | David Mareček
Proceedings of the Seventh Workshop on Statistical Machine Translation

pdf bib
DEPFIX: A System for Automatic Correction of Czech MT Outputs
Rudolf Rosa | David Mareček | Ondřej Dušek
Proceedings of the Seventh Workshop on Statistical Machine Translation

pdf bib
Using Parallel Features in Parsing of Machine-Translated Sentences for Correction of Grammatical Errors
Rudolf Rosa | Ondřej Dušek | David Mareček | Martin Popel
Proceedings of the Sixth Workshop on Syntax, Semantics and Structure in Statistical Translation

pdf bib
Exploiting Reducibility in Unsupervised Dependency Parsing
David Mareček | Zdeněk Žabokrtský
Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning

2011

pdf bib
Two-step translation with grammatical post-processing
David Mareček | Rudolf Rosa | Petra Galuščáková | Ondřej Bojar
Proceedings of the Sixth Workshop on Statistical Machine Translation

pdf bib
Influence of Parser Choice on Dependency-Based MT
Martin Popel | David Mareček | Nathan Green | Zdeněk Žabokrtský
Proceedings of the Sixth Workshop on Statistical Machine Translation

pdf bib
Gibbs Sampling with Treeness Constraint in Unsupervised Dependency Parsing
David Mareček | Zdeněk Žabokrtský
Proceedings of Workshop on Robust Unsupervised and Semisupervised Methods in Natural Language Processing

2010

pdf bib
Maximum Entropy Translation Model in Dependency-Based MT Framework
Zdeněk Žabokrtský | Martin Popel | David Mareček
Proceedings of the Joint Fifth Workshop on Statistical Machine Translation and MetricsMATR

pdf bib
Tackling Sparse Data Issue in Machine Translation Evaluation
Ondřej Bojar | Kamil Kos | David Mareček
Proceedings of the ACL 2010 Conference Short Papers

2009

pdf bib
English-Czech MT in 2008
Ondřej Bojar | David Mareček | Václav Novák | Martin Popel | Jan Ptáček | Jan Rouš | Zdeněk Žabokrtský
Proceedings of the Fourth Workshop on Statistical Machine Translation

pdf bib
Converting Russian Treebank SynTagRus into Praguian PDT Style
David Mareček | Natalia Kljueva
Proceedings of the Workshop Multilingual resources, technologies and evaluation for central and Eastern European languages

2008

pdf bib
Automatic alignment of Czech and English deep syntactic dependency trees
David Mareček | Zdeněk Žabokrtský | Václav Novák
Proceedings of the 12th Annual conference of the European Association for Machine Translation