Martin Volk


2020

pdf bib
How Much Data Do You Need? About the Creation of a Ground Truth for Black Letter and the Effectiveness of Neural OCR
Phillip Benjamin Ströbel | Simon Clematide | Martin Volk
Proceedings of the 12th Language Resources and Evaluation Conference

Recent advances in Optical Character Recognition (OCR) and Handwritten Text Recognition (HTR) have led to more accurate textrecognition of historical documents. The Digital Humanities heavily profit from these developments, but they still struggle whenchoosing from the plethora of OCR systems available on the one hand and when defining workflows for their projects on the other hand.In this work, we present our approach to build a ground truth for a historical German-language newspaper published in black letter. Wealso report how we used it to systematically evaluate the performance of different OCR engines. Additionally, we used this ground truthto make an informed estimate as to how much data is necessary to achieve high-quality OCR results. The outcomes of our experimentsshow that HTR architectures can successfully recognise black letter text and that a ground truth size of 50 newspaper pages suffices toachieve good OCR accuracy. Moreover, our models perform equally well on data they have not seen during training, which means thatadditional manual correction for diverging data is superfluous.

pdf bib
Benchmarking Data-driven Automatic Text Simplification for German
Andreas Säuberli | Sarah Ebling | Martin Volk
Proceedings of the 1st Workshop on Tools and Resources to Empower People with REAding DIfficulties (READI)

Automatic text simplification is an active research area, and there are first systems for English, Spanish, Portuguese, and Italian. For German, no data-driven approach exists to this date, due to a lack of training data. In this paper, we present a parallel corpus of news items in German with corresponding simplifications on two complexity levels. The simplifications have been produced according to a well-documented set of guidelines. We then report on experiments in automatically simplifying the German news items using state-of-the-art neural machine translation techniques. We demonstrate that despite our small parallel corpus, our neural models were able to learn essential features of simplified language, such as lexical substitutions, deletion of less relevant words and phrases, and sentence shortening.

2019

pdf bib
Post-editing Productivity with Neural Machine Translation: An Empirical Assessment of Speed and Quality in the Banking and Finance Domain
Samuel Läubli | Chantal Amrhein | Patrick Düggelin | Beatriz Gonzalez | Alena Zwahlen | Martin Volk
Proceedings of Machine Translation Summit XVII Volume 1: Research Track

pdf bib
Geotagging a Diachronic Corpus of Alpine Texts: Comparing Distinct Approaches to Toponym Recognition
Tannon Kew | Anastassia Shaitarova | Isabel Meraner | Janis Goldzycher | Simon Clematide | Martin Volk
Proceedings of the Workshop on Language Technology for Digital Historical Archives

Geotagging historic and cultural texts provides valuable access to heritage data, enabling location-based searching and new geographically related discoveries. In this paper, we describe two distinct approaches to geotagging a variety of fine-grained toponyms in a diachronic corpus of alpine texts. By applying a traditional gazetteer-based approach, aided by a few simple heuristics, we attain strong high-precision annotations. Using the output of this earlier system, we adopt a state-of-the-art neural approach in order to facilitate the detection of new toponyms on the basis of context. Additionally, we present the results of preliminary experiments on integrating a small amount of crowdsourced annotations to improve overall performance of toponym recognition in our heritage corpus.

2018

pdf bib
Has Machine Translation Achieved Human Parity? A Case for Document-level Evaluation
Samuel Läubli | Rico Sennrich | Martin Volk
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

Recent research suggests that neural machine translation achieves parity with professional human translation on the WMT Chinese–English news translation task. We empirically test this claim with alternative evaluation protocols, contrasting the evaluation of single sentences and entire documents. In a pairwise ranking experiment, human raters assessing adequacy and fluency show a stronger preference for human over machine translation when evaluating documents as compared to isolated sentences. Our findings emphasise the need to shift towards document-level evaluation as machine translation improves to the degree that errors which are hard or impossible to spot at the sentence-level become decisive in discriminating quality of different translation outputs.

2017

pdf bib
Multilingwis² – Explore Your Parallel Corpus
Johannes Graën | Dominique Sandoz | Martin Volk
Proceedings of the 21st Nordic Conference on Computational Linguistics

2016

pdf bib
Crowdsourcing an OCR Gold Standard for a German and French Heritage Corpus
Simon Clematide | Lenz Furrer | Martin Volk
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

Crowdsourcing approaches for post-correction of OCR output (Optical Character Recognition) have been successfully applied to several historic text collections. We report on our crowd-correction platform Kokos, which we built to improve the OCR quality of the digitized yearbooks of the Swiss Alpine Club (SAC) from the 19th century. This multilingual heritage corpus consists of Alpine texts mainly written in German and French, all typeset in Antiqua font. Finding and engaging volunteers for correcting large amounts of pages into high quality text requires a carefully designed user interface, an easy-to-use workflow, and continuous efforts for keeping the participants motivated. More than 180,000 characters on about 21,000 pages were corrected by volunteers in about 7 month, achieving an OCR gold standard with a systematically evaluated accuracy of 99.7% on the word level. The crowdsourced OCR gold standard and the corresponding original OCR recognition results from Abby FineReader 7 for each page are available as a resource. Additionally, the scanned images (300dpi) of all pages are included in order to facilitate tests with other OCR software.

2015

pdf bib
Pre-reordering for Statistical Machine Translation of Non-fictional Subtitles
Magdalena Plamada | Gion Linder | Phillip Ströbel | Martin Volk
Proceedings of the 18th Annual Conference of the European Association for Machine Translation

pdf bib
Detecting Document-level Context Triggers to Resolve Translation Ambiguity
Laura Mascarell | Mark Fishel | Martin Volk
Proceedings of the Second Workshop on Discourse in Machine Translation

pdf bib
Pre-reordering for Statistical Machine Translation of Non-fictional Subtitles
Magdalena Plamadă | Gion Linder | Phillip Ströbel | Martin Volk
Proceedings of the 18th Annual Conference of the European Association for Machine Translation

pdf bib
Leveraging Compounds to Improve Noun Phrase Translation from Chinese and German
Xiao Pu | Laura Mascarell | Andrei Popescu-Belis | Mark Fishel | Ngoc-Quang Luong | Martin Volk
Proceedings of the ACL-IJCNLP 2015 Student Research Workshop

2014

pdf bib
Detecting Code-Switching in a Multilingual Alpine Heritage Corpus
Martin Volk | Simon Clematide
Proceedings of the First Workshop on Computational Approaches to Code Switching

pdf bib
Machine Translation for Subtitling: A Large-Scale Evaluation
Thierry Etchegoyhen | Lindsay Bywood | Mark Fishel | Panayota Georgakopoulou | Jie Jiang | Gerard van Loenhout | Arantza del Pozo | Mirjam Sepesy Maučec | Anja Turner | Martin Volk
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

This article describes a large-scale evaluation of the use of Statistical Machine Translation for professional subtitling. The work was carried out within the FP7 EU-funded project SUMAT and involved two rounds of evaluation: a quality evaluation and a measure of productivity gain/loss. We present the SMT systems built for the project and the corpora they were trained on, which combine professionally created and crowd-sourced data. Evaluation goals, methodology and results are presented for the eleven translation pairs that were evaluated by professional subtitlers. Overall, a majority of the machine translated subtitles received good quality ratings. The results were also positive in terms of productivity, with a global gain approaching 40%. We also evaluated the impact of applying quality estimation and filtering of poor MT output, which resulted in higher productivity gains for filtered files as opposed to fully machine-translated files. Finally, we present and discuss feedback from the subtitlers who participated in the evaluation, a key aspect for any eventual adoption of machine translation technology in professional subtitling.

pdf bib
Innovations in Parallel Corpus Search Tools
Martin Volk | Johannes Graën | Elena Callegaro
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

Recent years have seen an increased interest in and availability of parallel corpora. Large corpora from international organizations (e.g. European Union, United Nations, European Patent Office), or from multilingual Internet sites (e.g. OpenSubtitles) are now easily available and are used for statistical machine translation but also for online search by different user groups. This paper gives an overview of different usages and different types of search systems. In the past, parallel corpus search systems were based on sentence-aligned corpora. We argue that automatic word alignment allows for major innovations in searching parallel corpora. Some online query systems already employ word alignment for sorting translation variants, but none supports the full query functionality that has been developed for parallel treebanks. We propose to develop such a system for efficiently searching large parallel corpora with a powerful query language.

2013

pdf bib
Mining for Domain-specific Parallel Text from Wikipedia
Magdalena Plamadă | Martin Volk
Proceedings of the Sixth Workshop on Building and Using Comparable Corpora

pdf bib
Building a German/Simple German Parallel Corpus for Automatic Text Simplification
David Klaper | Sarah Ebling | Martin Volk
Proceedings of the Second Workshop on Predicting and Improving Text Readability for Target Reader Populations

pdf bib
Combining Statistical Machine Translation and Translation Memories with Domain Adaptation
Samuel Läubli | Mark Fishel | Martin Volk | Manuela Weibel
Proceedings of the 19th Nordic Conference of Computational Linguistics (NODALIDA 2013)

pdf bib
Exploiting Synergies Between Open Resources for German Dependency Parsing, POS-tagging, and Morphological Analysis
Rico Sennrich | Martin Volk | Gerold Schneider
Proceedings of the International Conference Recent Advances in Natural Language Processing RANLP 2013

2012

pdf bib
From Subtitles to Parallel Corpora
Mark Fishel | Yota Georgakopoulou | Sergio Penkale | Volha Petukhova | Matej Rojc | Martin Volk | Andy Way
Proceedings of the 16th Annual conference of the European Association for Machine Translation

pdf bib
SUMAT: Data Collection and Parallel Corpus Compilation for Machine Translation of Subtitles
Volha Petukhova | Rodrigo Agerri | Mark Fishel | Sergio Penkale | Arantza del Pozo | Mirjam Sepesy Maučec | Andy Way | Panayota Georgakopoulou | Martin Volk
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

Subtitling and audiovisual translation have been recognized as areas that could greatly benefit from the introduction of Statistical Machine Translation (SMT) followed by post-editing, in order to increase efficiency of subtitle production process. The FP7 European project SUMAT (An Online Service for SUbtitling by MAchine Translation: http://www.sumat-project.eu) aims to develop an online subtitle translation service for nine European languages, combined into 14 different language pairs, in order to semi-automate the subtitle translation processes of both freelance translators and subtitling companies on a large scale. In this paper we discuss the data collection and parallel corpus compilation for training SMT systems, which includes several procedures such as data partition, conversion, formatting, normalization and alignment. We discuss in detail each data pre-processing step using various approaches. Apart from the quantity (around 1 million subtitles per language pair), the SUMAT corpus has a number of very important characteristics. First of all, high quality both in terms of translation and in terms of high-precision alignment of parallel documents and their contents has been achieved. Secondly, the contents are provided in one consistent format and encoding. Finally, additional information such as type of content in terms of genres and domain is available.

2011

pdf bib
Reducing OCR Errors in Gothic-Script Documents
Lenz Furrer | Martin Volk
Proceedings of the Workshop on Language Technologies for Digital Humanities and Cultural Heritage

pdf bib
Iterative, MT-based Sentence Alignment of Parallel Texts
Rico Sennrich | Martin Volk
Proceedings of the 18th Nordic Conference of Computational Linguistics (NODALIDA 2011)

pdf bib
Disambiguation of English Contractions for Machine Translation of TV Subtitles
Martin Volk | Rico Sennrich
Proceedings of the 18th Nordic Conference of Computational Linguistics (NODALIDA 2011)

pdf bib
Combining Semantic and Syntactic Generalization in Example-Based Machine Translation
Sarah Ebling | Andy Way | Martin Volk | Sudip Kumar Naskar
Proceedings of the 15th Annual conference of the European Association for Machine Translation

2010

pdf bib
Combining Parallel Treebanks and Geo-Tagging
Martin Volk | Anne Goehring | Torsten Marek
Proceedings of the Fourth Linguistic Annotation Workshop

pdf bib
Challenges in Building a Multilingual Alpine Heritage Corpus
Martin Volk | Noah Bubenhofer | Adrian Althaus | Maya Bangerter | Lenz Furrer | Beni Ruef
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

This paper describes our efforts to build a multilingual heritage corpus of alpine texts. Currently we digitize the yearbooks of the Swiss Alpine Club which contain articles in French, German, Italian and Romansch. Articles comprise mountaineering reports from all corners of the earth, but also scientific topics such as topography, geology or glacierology as well as occasional poetry and lyrics. We have already scanned close to 70,000 pages which has resulted in a corpus of 25 million words, 10% of which is a parallel French-German corpus. We have solved a number of challenges in automatic language identification and text structure recognition. Our next goal is to identify the great variety of toponyms (e.g. names of mountains and valleys, glaciers and rivers, trails and cabins) in this corpus, and we sketch how a large gazetteer of Swiss topographical names can be exploited for this purpose. Despite the size of the resource, exact matching leads to a low recall because of spelling variations, language mixtures and partial repetitions.

2009

pdf bib
Using Linguistic Annotations in Statistical Machine Translation of Film Subtitles
Christian Hardmeier | Martin Volk
Proceedings of the 17th Nordic Conference of Computational Linguistics (NODALIDA 2009)

2008

pdf bib
Human Judgements in Parallel Treebank Alignment
Martin Volk | Torsten Marek | Yvonne Samuelsson
Coling 2008: Proceedings of the workshop on Human Judgements in Computational Linguistics

2007

pdf bib
A Search Tool for Parallel Treebanks
Martin Volk | Joakim Lundborg | Maël Mettler
Proceedings of the Linguistic Annotation Workshop

pdf bib
Proceedings of the Fourth ACL-SIGSEM Workshop on Prepositions
Fintan Costello | John Kelleher | Martin Volk
Proceedings of the Fourth ACL-SIGSEM Workshop on Prepositions

pdf bib
Comparing French PP-attachment to English, German and Swedish
Martin Volk | Frida Tidström
Proceedings of the 16th Nordic Conference of Computational Linguistics (NODALIDA 2007)

2006

pdf bib
How Bad is the Problem of PP-Attachment? A Comparison of English, German and Swedish
Martin Volk
Proceedings of the Third ACL-SIGSEM Workshop on Prepositions

pdf bib
XML-based Phrase Alignment in Parallel Treebanks
Martin Volk | Sofia Gustafson-Capková | Joakim Lundborg | Torsten Marek | Yvonne Samuelsson | Frida Tidström
Proceedings of the 5th Workshop on NLP and XML (NLPXML-2006): Multi-Dimensional Markup in Natural Language Processing

2004

pdf bib
Evaluation Resources for Concept-based Cross-Lingual Information Retrieval in the Medical Domain
Paul Buitelaar | Diana Steffen | Martin Volk | Dominic Widdows | Bogdan Sacaleanu | Špela Vintar | Stanley Peters | Hans Uszkoreit
Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC’04)

pdf bib
Bootstrapping Parallel Treebanks
Martin Volk | Yvonne Samuelsson
Proceedings of the 5th International Workshop on Linguistically Interpreted Corpora

2003

pdf bib
A Cross Language Document Retrieval System Based on Semantic Annotation
Bogdan Sacaleanu | Paul Buitelaar | Martin Volk
Demonstrations

2002

pdf bib
Combining Unsupervised and Supervised Methods for PP Attachment Disambiguation
Martin Volk
COLING 2002: The 19th International Conference on Computational Linguistics

2000

pdf bib
Evaluating Translation Quality as Input to Product Development
Niamh Bohan | Elisabeth Breidt | Martin Volk
Proceedings of the Second International Conference on Language Resources and Evaluation (LREC’00)

1997

pdf bib
Experiences with the GTU grammar development environment
Martin Volk
Computational Environments for Grammar Development and Linguistic Engineering

pdf bib
Probing the Lexicon in Evaluating Commercial MT Systems
Martin Volk
35th Annual Meeting of the Association for Computational Linguistics and 8th Conference of the European Chapter of the Association for Computational Linguistics

1992

pdf bib
The Role of Testing in Grammar Engineering
Martin Volk
Third Conference on Applied Natural Language Processing

1991

pdf bib
The Logical Structure of English: Computing Semantic Content
Martin Volk
Computational Linguistics, Volume 17, Number 3, September 1991