Mathieu Dehouck


pdf bib
Data Augmentation via Subtree Swapping for Dependency Parsing of Low-Resource Languages
Mathieu Dehouck | Carlos Gómez-Rodríguez
Proceedings of the 28th International Conference on Computational Linguistics

The lack of annotated data is a big issue for building reliable NLP systems for most of the world’s languages. But this problem can be alleviated by automatic data generation. In this paper, we present a new data augmentation method for artificially creating new dependency-annotated sentences. The main idea is to swap subtrees between annotated sentences while enforcing strong constraints on those trees to ensure maximal grammaticality of the new sentences. We also propose a method to perform low-resource experiments using resource-rich languages by mimicking low-resource languages by sampling sentences under a low-resource distribution. In a series of experiments, we show that our newly proposed data augmentation method outperforms previous proposals using the same basic inputs.

pdf bib
Efficient EUD Parsing
Mathieu Dehouck | Mark Anderson | Carlos Gómez-Rodríguez
Proceedings of the 16th International Conference on Parsing Technologies and the IWPT 2020 Shared Task on Parsing into Enhanced Universal Dependencies

We present the system submission from the FASTPARSE team for the EUD Shared Task at IWPT 2020. We engaged with the task by focusing on efficiency. For this we considered training costs and inference efficiency. Our models are a combination of distilled neural dependency parsers and a rule-based system that projects UD trees into EUD graphs. We obtained an average ELAS of 74.04 for our official submission, ranking 4th overall.


pdf bib
Phylogenic Multi-Lingual Dependency Parsing
Mathieu Dehouck | Pascal Denis
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)

Languages evolve and diverge over time. Their evolutionary history is often depicted in the shape of a phylogenetic tree. Assuming parsing models are representations of their languages grammars, their evolution should follow a structure similar to that of the phylogenetic tree. In this paper, drawing inspiration from multi-task learning, we make use of the phylogenetic tree to guide the learning of multi-lingual dependency parsers leveraging languages structural similarities. Experiments on data from the Universal Dependency project show that phylogenetic training is beneficial to low resourced languages and to well furnished languages families. As a side product of phylogenetic training, our model is able to perform zero-shot parsing of previously unseen languages.


pdf bib
A Framework for Understanding the Role of Morphology in Universal Dependency Parsing
Mathieu Dehouck | Pascal Denis
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

This paper presents a simple framework for characterizing morphological complexity and how it encodes syntactic information. In particular, we propose a new measure of morpho-syntactic complexity in terms of governor-dependent preferential attachment that explains parsing performance. Through experiments on dependency parsing with data from Universal Dependencies (UD), we show that representations derived from morphological attributes deliver important parsing performance improvements over standard word form embeddings when trained on the same datasets. We also show that the new morpho-syntactic complexity measure is predictive of the gains provided by using morphological attributes over plain forms on parsing scores, making it a tool to distinguish languages using morphology as a syntactic marker from others.


pdf bib
Delexicalized Word Embeddings for Cross-lingual Dependency Parsing
Mathieu Dehouck | Pascal Denis
Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers

This paper presents a new approach to the problem of cross-lingual dependency parsing, aiming at leveraging training data from different source languages to learn a parser in a target language. Specifically, this approach first constructs word vector representations that exploit structural (i.e., dependency-based) contexts but only considering the morpho-syntactic information associated with each word and its contexts. These delexicalized word embeddings, which can be trained on any set of languages and capture features shared across languages, are then used in combination with standard language-specific features to train a lexicalized parser in the target language. We evaluate our approach through experiments on a set of eight different languages that are part the Universal Dependencies Project. Our main results show that using such delexicalized embeddings, either trained in a monolingual or multilingual fashion, achieves significant improvements over monolingual baselines.