Joachim Van den Bogaert


2020

pdf bib
Being Generous with Sub-Words towards Small NMT Children
Arne Defauw | Tom Vanallemeersch | Koen Van Winckel | Sara Szoc | Joachim Van den Bogaert
Proceedings of the 12th Language Resources and Evaluation Conference

In the context of under-resourced neural machine translation (NMT), transfer learning from an NMT model trained on a high resource language pair, or from a multilingual NMT (M-NMT) model, has been shown to boost performance to a large extent. In this paper, we focus on so-called cold start transfer learning from an M-NMT model, which means that the parent model is not trained on any of the child data. Such a set-up enables quick adaptation of M-NMT models to new languages. We investigate the effectiveness of cold start transfer learning from a many-to-many M-NMT model to an under-resourced child. We show that sufficiently large sub-word vocabularies should be used for transfer learning to be effective in such a scenario. When adopting relatively large sub-word vocabularies we observe increases in performance thanks to transfer learning from a parent M-NMT model, both when translating to and from the under-resourced language. Our proposed approach involving dynamic vocabularies is both practical and effective. We report results on two under-resourced language pairs, i.e. Icelandic-English and Irish-English.

pdf bib
A Post-Editing Dataset in the Legal Domain: Do we Underestimate Neural Machine Translation Quality?
Julia Ive | Lucia Specia | Sara Szoc | Tom Vanallemeersch | Joachim Van den Bogaert | Eduardo Farah | Christine Maroti | Artur Ventura | Maxim Khalilov
Proceedings of the 12th Language Resources and Evaluation Conference

We introduce a machine translation dataset for three pairs of languages in the legal domain with post-edited high-quality neural machine translation and independent human references. The data was collected as part of the EU APE-QUEST project and comprises crawled content from EU websites with translation from English into three European languages: Dutch, French and Portuguese. Altogether, the data consists of around 31K tuples including a source sentence, the respective machine translation by a neural machine translation system, a post-edited version of such translation by a professional translator, and - where available - the original reference translation crawled from parallel language websites. We describe the data collection process, provide an analysis of the resulting post-edits and benchmark the data using state-of-the-art quality estimation and automatic post-editing models. One interesting by-product of our post-editing analysis suggests that neural systems built with publicly available general domain data can provide high-quality translations, even though comparison to human references suggests that this quality is quite low. This makes our dataset a suitable candidate to test evaluation metrics. The data is freely available as an ELRC-SHARE resource.

pdf bib
APE-QUEST: an MT Quality Gate
Heidi Depraetere | Joachim Van den Bogaert | Sara Szoc | Tom Vanallemeersch
Proceedings of the 22nd Annual Conference of the European Association for Machine Translation

The APE-QUEST project (2018--2020) sets up a quality gate and crowdsourcing workflow for the eTranslation system of EC’s Connecting Europe Facility to improve translation quality in specific domains. It packages these services as a translation portal for machine-to-machine and machine-to-human scenarios.

pdf bib
MICE: a middleware layer for MT
Joachim Van den Bogaert | Tom Vanallemeersch | Heidi Depraetere
Proceedings of the 22nd Annual Conference of the European Association for Machine Translation

The MICE project (2018-2020) will deliver a middleware layer for improving the output quality of the eTranslation system of EC’s Connecting Europe Facility through additional services, such as domain adaptation and named entity recognition. It will also deliver a user portal, allowing for human post-editing.

pdf bib
OCR, Classification& Machine Translation (OCCAM)
Joachim Van den Bogaert | Arne Defauw | Frederic Everaert | Koen Van Winckel | Alina Kramchaninova | Anna Bardadym | Tom Vanallemeersch | Pavel Smrž | Michal Hradiš
Proceedings of the 22nd Annual Conference of the European Association for Machine Translation

The OCCAM project (Optical Character recognition, ClassificAtion & Machine Translation) aims at integrating the CEF (Connecting Europe Facility) Automated Translation service with image classification, Translation Memories (TMs), Optical Character Recognition (OCR), and Machine Translation (MT). It will support the automated translation of scanned business documents (a document format that, currently, cannot be processed by the CEF eTranslation service) and will also lead to a tool useful for the Digital Humanities domain.

pdf bib
CEFAT4Cities, a Natural Language Layer for the ISA2 Core Public Service Vocabulary
Joachim Van den Bogaert | Arne Defauw | Sara Szoc | Frederic Everaert | Koen Van Winckel | Alina Kramchaninova | Anna Bardadym | Tom Vanallemeersch
Proceedings of the 22nd Annual Conference of the European Association for Machine Translation

The CEFAT4Cities project (2020-2022) will create a “Smart Cities natural language context” (a software layer that facilitates the conversion of natural-language administrative procedures, into machine-readable data sets) on top of the existing ISA2 interoperability layer for public services. Integration with the FIWARE/ORION “Smart City” Context Broker, will make existing, paper-based, public services discoverable through “Smart City” frameworks, thus allowing for the development of more sophisticated and more user-friendly public services applications. An automated translation component will be included, to provide a solution that can be used by all EU Member States. As a result, the project will allow EU citizens and businesses to interact with public services on the city, national, regional and EU level, in their own language.

2019

pdf bib
APE-QUEST
Joachim Van den Bogaert | Heidi Depraetere | Sara Szoc | Tom Vanallemeersch | Koen Van Winckel | Frederic Everaert | Lucia Specia | Julia Ive | Maxim Khalilov | Christine Maroti | Eduardo Farah | Artur Ventura
Proceedings of Machine Translation Summit XVII Volume 2: Translator, Project and User Tracks

pdf bib
MICE
Joachim Van den Bogaert | Heidi Depraetere | Tom Vanallemeersch | Frederic Everaert | Koen Van Winckel | Katri Tammsaar | Ingmar Vali | Tambet Artma | Piret Saartee | Laura Katariina Teder | Artūrs Vasiļevskis | Valters Sics | Johan Haelterman | David Bienfait
Proceedings of Machine Translation Summit XVII Volume 2: Translator, Project and User Tracks

pdf bib
Collecting domain specific data for MT: an evaluation of the ParaCrawlpipeline
Arne Defauw | Tom Vanallemeersch | Sara Szoc | Frederic Everaert | Koen Van Winckel | Kim Scholte | Joris Brabers | Joachim Van den Bogaert
Proceedings of Machine Translation Summit XVII Volume 2: Translator, Project and User Tracks

pdf bib
Developing a Neural Machine Translation system for Irish
Arne Defauw | Sara Szoc | Tom Vanallemeersch | Anna Bardadym | Joris Brabers | Frederic Everaert | Kim Scholte | Koen Van Winckel | Joachim Van den Bogaert
Proceedings of the 2nd Workshop on Technologies for MT of Low Resource Languages