MEDLINE as a Parallel Corpus: a Survey to Gain Insight on French-, Spanish- and Portuguese-speaking Authors’ Abstract Writing Practice

Aurélie Névéol, Antonio Jimeno Yepes, Mariana Neves


Abstract
Background: Parallel corpora are used to train and evaluate machine translation systems. To alleviate the cost of producing parallel resources for evaluation campaigns, existing corpora are leveraged. However, little information may be available about the methods used for producing the corpus, including translation direction. Objective: To gain insight on MEDLINE parallel corpus used in the biomedical task at the Workshop on Machine Translation in 2019 (WMT 2019). Material and Methods: Contact information for the authors of MEDLINE articles included in the English/Spanish (EN/ES), English/French (EN/FR), and English/Portuguese (EN/PT) WMT 2019 test sets was obtained from PubMed and publisher websites. The authors were asked about their abstract writing practices in a survey. Results: The response rate was above 20%. Authors reported that they are mainly native speakers of languages other than English. Although manual translation, sometimes via professional translation services, was commonly used for abstract translation, authors of articles in the EN/ES and EN/PT sets also relied on post-edited machine translation. Discussion: This study provides a characterization of MEDLINE authors’ language skills and abstract writing practices. Conclusion: The information collected in this study will be used to inform test set design for the next WMT biomedical task.
Anthology ID:
2020.lrec-1.453
Volume:
Proceedings of the 12th Language Resources and Evaluation Conference
Month:
May
Year:
2020
Address:
Marseille, France
Venues:
COLING | LREC
SIG:
Publisher:
European Language Resources Association
Note:
Pages:
3676–3682
Language:
English
URL:
https://www.aclweb.org/anthology/2020.lrec-1.453
DOI:
Bib Export formats:
BibTeX MODS XML EndNote
PDF:
http://aclanthology.lst.uni-saarland.de/2020.lrec-1.453.pdf