Fernando Batista


2020

pdf bib
Proceedings of the 22nd Annual Conference of the European Association for Machine Translation
André Martins | Helena Moniz | Sara Fumega | Bruno Martins | Fernando Batista | Luisa Coheur | Carla Parra | Isabel Trancoso | Marco Turchi | Arianna Bisazza | Joss Moorkens | Ana Guerberof | Mary Nurminen | Lena Marg | Mikel L. Forcada
Proceedings of the 22nd Annual Conference of the European Association for Machine Translation

2018

pdf bib
Contractions: To Align or Not to Align, That Is the Question
Anabela Barreiro | Fernando Batista
Proceedings of the First Workshop on Linguistic Resources for Natural Language Processing

This paper performs a detailed analysis on the alignment of Portuguese contractions, based on a previously aligned bilingual corpus. The alignment task was performed manually in a subset of the English-Portuguese CLUE4Translation Alignment Collection. The initial parallel corpus was pre-processed and, a decision was made as to whether the contraction should be maintained or decomposed in the alignment. Decomposition was required in the cases in which the two words that have been concatenated, i.e., the preposition and the determiner or pronoun, go in two separate translation alignment pairs (e.g., [no seio de] [a União Europeia] | [within] [the European Union]). Most contractions required decomposition in contexts where they are positioned at the end of a multiword unit. On the other hand, contractions tend to be maintained when they occur in the beginning or in the middle of the multiword unit, i.e., in the frozen part of the multiword (e.g., [no que diz respeito a] | [with regard to] or [além disso] [in addition]. A correct alignment of multiwords and phrasal units containing contractions is instrumental for machine translation, paraphrasing, and variety adaptation.

2016

pdf bib
SPA: Web-based Platform for easy Access to Speech Processing Modules
Fernando Batista | Pedro Curto | Isabel Trancoso | Alberto Abad | Jaime Ferreira | Eugénio Ribeiro | Helena Moniz | David Martins de Matos | Ricardo Ribeiro
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

This paper presents SPA, a web-based Speech Analytics platform that integrates several speech processing modules and that makes it possible to use them through the web. It was developed with the aim of facilitating the usage of the modules, without the need to know about software dependencies and specific configurations. Apart from being accessed by a web-browser, the platform also provides a REST API for easy integration with other applications. The platform is flexible, scalable, provides authentication for access restrictions, and was developed taking into consideration the time and effort of providing new services. The platform is still being improved, but it already integrates a considerable number of audio and text processing modules, including: Automatic transcription, speech disfluency classification, emotion detection, dialog act recognition, age and gender classification, non-nativeness detection, hyper-articulation detection, dialog act recognition, and two external modules for feature extraction and DTMF detection. This paper describes the SPA architecture, presents the already integrated modules, and provides a detailed description for the ones most recently integrated.

pdf bib
Machine Translation of Non-Contiguous Multiword Units
Anabela Barreiro | Fernando Batista
Proceedings of the Workshop on Discontinuous Structures in Natural Language Processing

2014

pdf bib
Revising the annotation of a Broadcast News corpus: a linguistic approach
Vera Cabarrão | Helena Moniz | Fernando Batista | Ricardo Ribeiro | Nuno Mamede | Hugo Meinedo | Isabel Trancoso | Ana Isabel Mata | David Martins de Matos
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

This paper presents a linguistic revision process of a speech corpus of Portuguese broadcast news focusing on metadata annotation for rich transcription, and reports on the impact of the new data on the performance for several modules. The main focus of the revision process consisted on annotating and revising structural metadata events, such as disfluencies and punctuation marks. The resultant revised data is now being extensively used, and was of extreme importance for improving the performance of several modules, especially the punctuation and capitalization modules, but also the speech recognition system, and all the subsequent modules. The resultant data has also been recently used in disfluency studies across domains.

pdf bib
OpenLogos Semantico-Syntactic Knowledge-Rich Bilingual Dictionaries
Anabela Barreiro | Fernando Batista | Ricardo Ribeiro | Helena Moniz | Isabel Trancoso
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

This paper presents 3 sets of OpenLogos resources, namely the English-German, the English-French, and the English-Italian bilingual dictionaries. In addition to the usual information on part-of-speech, gender, and number for nouns, offered by most dictionaries currently available, OpenLogos bilingual dictionaries have some distinctive features that make them unique: they contain cross-language morphological information (inflectional and derivational), semantico-syntactic knowledge, indication of the head word in multiword units, information about whether a source word corresponds to an homograph, information about verb auxiliaries, alternate words (i.e., predicate or process nouns), causatives, reflexivity, verb aspect, among others. The focal point of the paper will be the semantico-syntactic knowledge that is important for disambiguation and translation precision. The resources are publicly available at the METANET platform for free use by the research community.

pdf bib
Teenage and adult speech in school context: building and processing a corpus of European Portuguese
Ana Isabel Mata | Helena Moniz | Fernando Batista | Julia Hirschberg
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

We present a corpus of European Portuguese spoken by teenagers and adults in school context, CPE-FACES, with an overview of the differential characteristics of high school oral presentations and the challenges this data poses to automatic speech processing. The CPE-FACES corpus has been created with two main goals: to provide a resource for the study of prosodic patterns in both spontaneous and prepared unscripted speech, and to capture inter-speaker and speaking style variations common at school, for research on oral presentations. Research on speaking styles is still largely based on adult speech. References to teenagers are sparse and cross-analyses of speech types comparing teenagers and adults are rare. We expect CPE-FACES, currently a unique resource in this domain, will contribute to filling this gap in European Portuguese. Focusing on disfluencies and phrase-final phonetic-phonological processes we show the impact of teenage speech on the automatic segmentation of oral presentations. Analyzing fluent final intonation contours in declarative utterances, we also show that communicative situation specificities, speaker status and cross-gender differences are key factors in speaking style variation at school.

pdf bib
Prosodic, syntactic, semantic guidelines for topic structures across domains and corpora
Ana Isabel Mata | Helena Moniz | Telmo Móia | Anabela Gonçalves | Fátima Silva | Fernando Batista | Inês Duarte | Fátima Oliveira | Isabel Falé
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

This paper presents the annotation guidelines applied to naturally occurring speech, aiming at an integrated account of contrast and parallel structures in European Portuguese. These guidelines were defined to allow for the empirical study of interactions among intonation and syntax-discourse patterns in selected sets of different corpora (monologues and dialogues, by adults and teenagers). In this paper we focus on the multilayer annotation process of left periphery structures by using a small sample of highly spontaneous speech in which the distinct types of topic structures are displayed. The analysis of this sample provides fundamental training and testing material for further application in a wider range of domains and corpora. The annotation process comprises the following time-linked levels (manual and automatic): phone, syllable and word level transcriptions (including co-articulation effects); tonal events and break levels; part-of-speech tagging; syntactic-discourse patterns (construction type; construction position; syntactic function; discourse function), and disfluency events as well. Speech corpora with such a multi-level annotation are a valuable resource to look into grammar module relations in language use from an integrated viewpoint. Such viewpoint is innovative in our language, and has not been often assumed by studies for other languages.

pdf bib
Linguistic Evaluation of Support Verb Constructions by OpenLogos and Google Translate
Anabela Barreiro | Johanna Monti | Brigitte Orliac | Susanne Preuß | Kutz Arrieta | Wang Ling | Fernando Batista | Isabel Trancoso
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

This paper presents a systematic human evaluation of translations of English support verb constructions produced by a rule-based machine translation (RBMT) system (OpenLogos) and a statistical machine translation (SMT) system (Google Translate) for five languages: French, German, Italian, Portuguese and Spanish. We classify support verb constructions by means of their syntactic structure and semantic behavior and present a qualitative analysis of their translation errors. The study aims to verify how machine translation (MT) systems translate fine-grained linguistic phenomena, and how well-equipped they are to produce high-quality translation. Another goal of the linguistically motivated quality analysis of SVC raw output is to reinforce the need for better system hybridization, which leverages the strengths of RBMT to the benefit of SMT, especially in improving the translation of multiword units. Taking multiword units into account, we propose an effective method to achieve MT hybridization based on the integration of semantico-syntactic knowledge into SMT.

2008

pdf bib
Language Dynamics and Capitalization using Maximum Entropy
Fernando Batista | Nuno Mamede | Isabel Trancoso
Proceedings of ACL-08: HLT, Short Papers

2000

pdf bib
Some Language Resources and Tools for Computational Processing of Portuguese at INESC
Luzia Wittmann | Ricardo Daniel Ribeiro | Tânia Pêgo | Fernando Batista
Proceedings of the Second International Conference on Language Resources and Evaluation (LREC’00)