Mārcis Pinnis

Also published as: Marcis Pinnis


2020

pdf bib
Customized Neural Machine Translation Systems for the Swiss Legal Domain
Rubén Martínez-Domínguez | Matīss Rikters | Artūrs Vasiļevskis | Mārcis Pinnis | Paula Reichenberg
Proceedings of the 14th Conference of the Association for Machine Translation in the Americas (Volume 2: User Track)

pdf bib
A Tale of Eight Countries or the EU Council Presidency Translator in Retrospect
Mārcis Pinnis | Toms Bergmanis | Kristīne Metuzāle | Valters Šics | Artūrs Vasiļevskis | Andrejs Vasiļjevs
Proceedings of the 14th Conference of the Association for Machine Translation in the Americas (Volume 2: User Track)

pdf bib
Neural Translation for the European Union (NTEU) Project
Laurent Bié | Aleix Cerdà-i-Cucó | Hans Degroote | Amando Estela | Mercedes García-Martínez | Manuel Herranz | Alejandro Kohan | Maite Melero | Tony O’Dowd | Sinéad O’Gorman | Mārcis Pinnis | Roberts Rozis | Riccardo Superbo | Artūrs Vasiļevskis
Proceedings of the 22nd Annual Conference of the European Association for Machine Translation

The Neural Translation for the European Union (NTEU) project aims to build a neural engine farm with all European official language combinations for eTranslation, without the necessity to use a high-resourced language as a pivot. NTEU started in September 2019 and will run until August 2021.

2019

pdf bib
Developing and Orchestrating a Portfolio of Natural Legal Language Processing and Document Curation Services
Georg Rehm | Julián Moreno-Schneider | Jorge Gracia | Artem Revenko | Victor Mireles | Maria Khvalchik | Ilan Kernerman | Andis Lagzdins | Marcis Pinnis | Artus Vasilevskis | Elena Leitner | Jan Milde | Pia Weißenhorn
Proceedings of the Natural Legal Language Processing Workshop 2019

We present a portfolio of natural legal language processing and document curation services currently under development in a collaborative European project. First, we give an overview of the project and the different use cases, while, in the main part of the article, we focus upon the 13 different processing services that are being deployed in different prototype applications using a flexible and scalable microservices architecture. Their orchestration is operationalised using a content and document curation workflow manager.

pdf bib
Tilde’s Machine Translation Systems for WMT 2019
Marcis Pinnis | Rihards Krišlauks | Matīss Rikters
Proceedings of the Fourth Conference on Machine Translation (Volume 2: Shared Task Papers, Day 1)

The paper describes the development process of Tilde’s NMT systems for the WMT 2019 shared task on news translation. We trained systems for the English-Lithuanian and Lithuanian-English translation directions in constrained and unconstrained tracks. We build upon the best methods of the previous year’s competition and combine them with recent advancements in the field. We also present a new method to ensure source domain adherence in back-translated data. Our systems achieved a shared first place in human evaluation.

pdf bib
Large-scale Machine Translation Evaluation of the iADAATPA Project
Sheila Castilho | Natália Resende | Federico Gaspari | Andy Way | Tony O’Dowd | Marek Mazur | Manuel Herranz | Alex Helle | Gema Ramírez-Sánchez | Víctor Sánchez-Cartagena | Mārcis Pinnis | Valters Šics
Proceedings of Machine Translation Summit XVII Volume 2: Translator, Project and User Tracks

2018

pdf bib
Tilde MT Platform for Developing Client Specific MT Solutions
Mārcis Pinnis | Andrejs Vasiļjevs | Rihards Kalniņš | Roberts Rozis | Raivis Skadiņš | Valters Šics
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

pdf bib
Training and Adapting Multilingual NMT for Less-resourced and Morphologically Rich Languages
Matīss Rikters | Mārcis Pinnis | Rihards Krišlauks
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

pdf bib
Developing a Neural Machine Translation Service for the 2017-2018 European Union Presidency
Mārcis Pinnis | Rihards Kalnins
Proceedings of the 13th Conference of the Association for Machine Translation in the Americas (Volume 2: User Track)

pdf bib
Tilde’s Machine Translation Systems for WMT 2018
Mārcis Pinnis | Matīss Rikters | Rihards Krišlauks
Proceedings of the Third Conference on Machine Translation: Shared Task Papers

The paper describes the development process of the Tilde’s NMT systems that were submitted for the WMT 2018 shared task on news translation. We describe the data filtering and pre-processing workflows, the NMT system training architectures, and automatic evaluation results. For the WMT 2018 shared task, we submitted seven systems (both constrained and unconstrained) for English-Estonian and Estonian-English translation directions. The submitted systems were trained using Transformer models.

pdf bib
Tilde’s Parallel Corpus Filtering Methods for WMT 2018
Mārcis Pinnis
Proceedings of the Third Conference on Machine Translation: Shared Task Papers

The paper describes parallel corpus filtering methods that allow reducing noise of noisy “parallel” corpora from a level where the corpora are not usable for neural machine translation training (i.e., the resulting systems fail to achieve reasonable translation quality; well below 10 BLEU points) up to a level where the trained systems show decent (over 20 BLEU points on a 10 million word dataset and up to 30 BLEU points on a 100 million word dataset). The paper also documents Tilde’s submissions to the WMT 2018 shared task on parallel corpus filtering.

2017

pdf bib
NMT or SMT: Case Study of a Narrow-domain English-Latvian Post-editing Project
Inguna Skadiņa | Mārcis Pinnis
Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

The recent technological shift in machine translation from statistical machine translation (SMT) to neural machine translation (NMT) raises the question of the strengths and weaknesses of NMT. In this paper, we present an analysis of NMT and SMT systems’ outputs from narrow domain English-Latvian MT systems that were trained on a rather small amount of data. We analyze post-edits produced by professional translators and manually annotated errors in these outputs. Analysis of post-edits allowed us to conclude that both approaches are comparably successful, allowing for an increase in translators’ productivity, with the NMT system showing slightly worse results. Through the analysis of annotated errors, we found that NMT translations are more fluent than SMT translations. However, errors related to accuracy, especially, mistranslation and omission errors, occur more often in NMT outputs. The word form errors, that characterize the morphological richness of Latvian, are frequent for both systems, but slightly fewer in NMT outputs.

pdf bib
The QT21 Combined Machine Translation System for English to Latvian
Jan-Thorsten Peter | Hermann Ney | Ondřej Bojar | Ngoc-Quan Pham | Jan Niehues | Alex Waibel | Franck Burlot | François Yvon | Mārcis Pinnis | Valters Šics | Jasmijn Bastings | Miguel Rios | Wilker Aziz | Philip Williams | Frédéric Blain | Lucia Specia
Proceedings of the Second Conference on Machine Translation

pdf bib
Tilde’s Machine Translation Systems for WMT 2017
Mārcis Pinnis | Rihards Krišlauks | Toms Miks | Daiga Deksne | Valters Šics
Proceedings of the Second Conference on Machine Translation

2016

pdf bib
Designing a Speech Corpus for the Development and Evaluation of Dictation Systems in Latvian
Mārcis Pinnis | Askars Salimbajevs | Ilze Auziņa
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

In this paper the authors present a speech corpus designed and created for the development and evaluation of dictation systems in Latvian. The corpus consists of over nine hours of orthographically annotated speech from 30 different speakers. The corpus features spoken commands that are common for dictation systems for text editors. The corpus is evaluated in an automatic speech recognition scenario. Evaluation results in an ASR dictation scenario show that the addition of the corpus to the acoustic model training data in combination with language model adaptation allows to decrease the WER by up to relative 41.36% (or 16.83% in absolute numbers) compared to a baseline system without language model adaptation. Contribution of acoustic data augmentation is at relative 12.57% (or 3.43% absolute).

pdf bib
The QT21/HimL Combined Machine Translation System
Jan-Thorsten Peter | Tamer Alkhouli | Hermann Ney | Matthias Huck | Fabienne Braune | Alexander Fraser | Aleš Tamchyna | Ondřej Bojar | Barry Haddow | Rico Sennrich | Frédéric Blain | Lucia Specia | Jan Niehues | Alex Waibel | Alexandre Allauzen | Lauriane Aufrant | Franck Burlot | Elena Knyazeva | Thomas Lavergne | François Yvon | Mārcis Pinnis | Stella Frank
Proceedings of the First Conference on Machine Translation: Volume 2, Shared Task Papers

2015

pdf bib
Dynamic Terminology Integration Methods in Statistical Machine Translation
Marcis Pinnis
Proceedings of the 18th Annual Conference of the European Association for Machine Translation

pdf bib
Dynamic Terminology Integration Methods in Statistical Machine Translation
Mārcis Pinnis
Proceedings of the 18th Annual Conference of the European Association for Machine Translation

2014

pdf bib
Application of machine translation in localization into low-resourced languages
Raivis Skadiņš | Mārcis Pinnis | Andrejs Vasiļjevs | Inguna Skadiņa | Tomas Hudik
Proceedings of the 17th Annual conference of the European Association for Machine Translation

pdf bib
Terminology localization guidelines for the national scenario
Juris Borzovs | Ilze Ilziņa | Iveta Keiša | Mārcis Pinnis | Andrejs Vasiļjevs
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

This paper presents a set of principles and practical guidelines for terminology work in the national scenario to ensure a harmonized approach in term localization. These linguistic principles and guidelines are elaborated by the Terminology Commission in Latvia in the domain of Information and Communication Technology (ICT). We also present a novel approach in a corpus-based selection and an evaluation of the most frequently used terms. Analysis of the terms proves that, in general, in the normative terminology work in Latvia localized terms are coined according to these guidelines. We further evaluate how terms included in the database of official terminology are adopted in the general use such as newspaper articles, blogs, forums, websites etc. Our evaluation shows that in a non-normative context the official terminology faces a strong competition from other variations of localized terms. Conclusions and recommendations from lexical analysis of localized terms are provided. We hope that presented guidelines and approach in evaluation will be useful to terminology institutions, regulative authorities and researchers in different countries that are involved in the national terminology work.

pdf bib
Designing the Latvian Speech Recognition Corpus
Mārcis Pinnis | Ilze Auziņa | Kārlis Goba
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

In this paper the authors present the first Latvian speech corpus designed specifically for speech recognition purposes. The paper outlines the decisions made in the corpus designing process through analysis of related work on speech corpora creation for different languages. The authors provide also guidelines that were used for the creation of the Latvian speech recognition corpus. The corpus creation guidelines are fairly general for them to be re-used by other researchers when working on different language speech recognition corpora. The corpus consists of two parts ― an orthographically annotated corpus containing 100 hours of orthographically transcribed audio data and a phonetically annotated corpus containing 4 hours of phonetically transcribed audio data. Metadata files in XML format provide additional details about the speakers, noise levels, speech styles, etc. The speech recognition corpus is phonetically balanced and phonetically rich and the paper describes also the methodology how the phonetical balancedness has been assessed.

pdf bib
Bilingual dictionaries for all EU languages
Ahmet Aker | Monica Paramita | Mārcis Pinnis | Robert Gaizauskas
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

Bilingual dictionaries can be automatically generated using the GIZA++ tool. However, these dictionaries contain a lot of noise, because of which the quality of outputs of tools relying on the dictionaries are negatively affected. In this work we present three different methods for cleaning noise from automatically generated bilingual dictionaries: LLR, pivot and translation based approach. We have applied these approaches on the GIZA++ dictionaries -- dictionaries covering official EU languages -- in order to remove noise. Our evaluation showed that all methods help to reduce noise. However, the best performance is achieved using the transliteration based approach. We provide all bilingual dictionaries (the original GIZA++ dictionaries and the cleaned ones) free for download. We also provide the cleaning tools and scripts for free download.

2013

pdf bib
Context Independent Term Mapper for European Languages
Mārcis Pinnis
Proceedings of the International Conference Recent Advances in Natural Language Processing RANLP 2013

2012

pdf bib
ACCURAT Toolkit for Multi-Level Alignment and Information Extraction from Comparable Corpora
Mārcis Pinnis | Radu Ion | Dan Ştefănescu | Fangzhong Su | Inguna Skadiņa | Andrejs Vasiļjevs | Bogdan Babych
Proceedings of the ACL 2012 System Demonstrations

pdf bib
Collecting and Using Comparable Corpora for Statistical Machine Translation
Inguna Skadiņa | Ahmet Aker | Nikos Mastropavlos | Fangzhong Su | Dan Tufis | Mateja Verlic | Andrejs Vasiļjevs | Bogdan Babych | Paul Clough | Robert Gaizauskas | Nikos Glaros | Monica Lestari Paramita | Mārcis Pinnis
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

Lack of sufficient parallel data for many languages and domains is currently one of the major obstacles to further advancement of automated translation. The ACCURAT project is addressing this issue by researching methods how to improve machine translation systems by using comparable corpora. In this paper we present tools and techniques developed in the ACCURAT project that allow additional data needed for statistical machine translation to be extracted from comparable corpora. We present methods and tools for acquisition of comparable corpora from the Web and other sources, for evaluation of the comparability of collected corpora, for multi-level alignment of comparable corpora and for extraction of lexical and terminological data for machine translation. Finally, we present initial evaluation results on the utility of collected corpora in domain-adapted machine translation and real-life applications.

pdf bib
Latvian and Lithuanian Named Entity Recognition with TildeNER
Mārcis Pinnis
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

In this paper the author presents TildeNER ― an open source freely available named entity recognition toolkit and the first multi-class named entity recognition system for Latvian and Lithuanian languages. The system is built upon a supervised conditional random field classifier and features heuristic and statistical refinement methods that improve supervised classification, thus boosting the overall system's performance. The toolkit provides means for named entity recognition model bootstrapping, plaintext document and also pre-processed (morpho-syntactically tagged) tab-separated document named entity tagging and evaluation on test data. The paper presents the design of the system, describes the most important data formats and briefly discusses extension possibilities to different languages. It also gives evaluation on human annotated gold standard test corpora for Latvian and Lithuanian languages as well as comparative performance analysis to a state-of-the art English named entity recognition system using parallel and strongly comparable corpora. The author gives analysis of the Latvian and Lithuanian named entity tagged corpora annotation process and the created named entity annotated corpora.
Search
Co-authors
Venues