Felipe Soares


2020

pdf bib
ParaPat: The Multi-Million Sentences Parallel Corpus of Patents Abstracts
Felipe Soares | Mark Stevenson | Diego Bartolome | Anna Zaretskaya
Proceedings of the 12th Language Resources and Evaluation Conference

The Google Patents is one of the main important sources of patents information. A striking characteristic is that many of its abstracts are presented in more than one language, thus making it a potential source of parallel corpora. This article presents the development of a parallel corpus from the open access Google Patents dataset in 74 language pairs, comprising more than 68 million sentences and 800 million tokens. Sentences were automatically aligned using the Hunalign algorithm for the largest 22 language pairs, while the others were abstract (i.e. paragraph) aligned. We demonstrate the capabilities of our corpus by training Neural Machine Translation (NMT) models for the main 9 language pairs, with a total of 18 models. Our parallel corpus is freely available in TSV format and with a SQLite database, with complementary information regarding patent metadata.

pdf bib
QE Viewer: an Open-Source Tool for Visualization of Machine Translation Quality Estimation Results
Felipe Soares | Anna Zaretskaya | Diego Bartolome
Proceedings of the 22nd Annual Conference of the European Association for Machine Translation

QE Viewer is a web-based tool for visualizing results of a Machine Translation Quality Estimation (QE) system. It allows users to see information on the predicted post-editing distance (PED) for a given file or sentence, and highlighted words that were predicted to contain MT errors. The tool can be used in a variety of academic, educational and commercial scenarios.

2019

pdf bib
Medical Word Embeddings for Spanish: Development and Evaluation
Felipe Soares | Marta Villegas | Aitor Gonzalez-Agirre | Martin Krallinger | Jordi Armengol-Estapé
Proceedings of the 2nd Clinical Natural Language Processing Workshop

Word embeddings are representations of words in a dense vector space. Although they are not recent phenomena in Natural Language Processing (NLP), they have gained momentum after the recent developments of neural methods and Word2Vec. Regarding their applications in medical and clinical NLP, they are invaluable resources when training in-domain named entity recognition systems, classifiers or taggers, for instance. Thus, the development of tailored word embeddings for medical NLP is of great interest. However, we identified a gap in the literature which we aim to fill in this paper: the availability of embeddings for medical NLP in Spanish, as well as a standardized form of intrinsic evaluation. Since most work has been done for English, some established datasets for intrinsic evaluation are already available. In this paper, we show the steps we employed to adapt such datasets for the first time to Spanish, of particular relevance due to the considerable volume of EHRs in this language, as well as the creation of in-domain medical word embeddings for the Spanish using the state-of-the-art FastText model. We performed intrinsic evaluation with our adapted datasets, as well as extrinsic evaluation with a named entity recognition systems using a baseline embedding of general-domain. Both experiments proved that our embeddings are suitable for use in medical NLP in the Spanish language, and are more accurate than general-domain ones.

pdf bib
Findings of the WMT 2019 Biomedical Translation Shared Task: Evaluation for MEDLINE Abstracts and Biomedical Terminologies
Rachel Bawden | Kevin Bretonnel Cohen | Cristian Grozea | Antonio Jimeno Yepes | Madeleine Kittner | Martin Krallinger | Nancy Mah | Aurelie Neveol | Mariana Neves | Felipe Soares | Amy Siu | Karin Verspoor | Maika Vicente Navarro
Proceedings of the Fourth Conference on Machine Translation (Volume 3: Shared Task Papers, Day 2)

In the fourth edition of the WMT Biomedical Translation task, we considered a total of six languages, namely Chinese (zh), English (en), French (fr), German (de), Portuguese (pt), and Spanish (es). We performed an evaluation of automatic translations for a total of 10 language directions, namely, zh/en, en/zh, fr/en, en/fr, de/en, en/de, pt/en, en/pt, es/en, and en/es. We provided training data based on MEDLINE abstracts for eight of the 10 language pairs and test sets for all of them. In addition to that, we offered a new sub-task for the translation of terms in biomedical terminologies for the en/es language direction. Higher BLEU scores (close to 0.5) were obtained for the es/en, en/es and en/pt test sets, as well as for the terminology sub-task. After manual validation of the primary runs, some submissions were judged to be better than the reference translations, for instance, for de/en, en/es and es/en.

pdf bib
BSC Participation in the WMT Translation of Biomedical Abstracts
Felipe Soares | Martin Krallinger
Proceedings of the Fourth Conference on Machine Translation (Volume 3: Shared Task Papers, Day 2)

This paper describes the machine translation systems developed by the Barcelona Supercomputing (BSC) team for the biomedical translation shared task of WMT19. Our system is based on Neural Machine Translation unsing the OpenNMT-py toolkit and Transformer architecture. We participated in four translation directions for the English/Spanish and English/Portuguese language pairs. To create our training data, we concatenated several parallel corpora, both from in-domain and out-of-domain sources, as well as terminological resources from UMLS.

2018

pdf bib
A Large Parallel Corpus of Full-Text Scientific Articles
Felipe Soares | Viviane Moreira | Karin Becker
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

pdf bib
UFRGS Participation on the WMT Biomedical Translation Shared Task
Felipe Soares | Karin Becker
Proceedings of the Third Conference on Machine Translation: Shared Task Papers

This paper describes the machine translation systems developed by the Universidade Federal do Rio Grande do Sul (UFRGS) team for the biomedical translation shared task. Our systems are based on statistical machine translation and neural machine translation, using the Moses and OpenNMT toolkits, respectively. We participated in four translation directions for the English/Spanish and English/Portuguese language pairs. To create our training data, we concatenated several parallel corpora, both from in-domain and out-of-domain sources, as well as terminological resources from UMLS. Our systems achieved the best BLEU scores according to the official shared task evaluation.