David Martins de Matos

Also published as: David Martins de Matos, David M. de Matos


pdf bib
Mapping the Dialog Act Annotations of the LEGO Corpus into ISO 24617-2 Communicative Functions
Eugénio Ribeiro | Ricardo Ribeiro | David Martins de Matos
Proceedings of the 12th Language Resources and Evaluation Conference

ISO 24617-2, the ISO standard for dialog act annotation, sets the ground for more comparable research in the area. However, the amount of data annotated according to it is still reduced, which impairs the development of approaches for automatic recognition. In this paper, we describe a mapping of the original dialog act labels of the LEGO corpus, which have been neglected, into the communicative functions of the standard. Although this does not lead to a complete annotation according to the standard, the 347 dialogs provide a relevant amount of data that can be used in the development of automatic communicative function recognition approaches, which may lead to a wider adoption of the standard. Using the 17 English dialogs of the DialogBank as gold standard, our preliminary experiments have shown that including the mapped dialogs during the training phase leads to improved performance while recognizing communicative functions in the Task dimension.


pdf bib
L2F/INESC-ID at SemEval-2019 Task 2: Unsupervised Lexical Semantic Frame Induction using Contextualized Word Representations
Eugénio Ribeiro | Vânia Mendonça | Ricardo Ribeiro | David Martins de Matos | Alberto Sardinha | Ana Lúcia Santos | Luísa Coheur
Proceedings of the 13th International Workshop on Semantic Evaluation

Building large datasets annotated with semantic information, such as FrameNet, is an expensive process. Consequently, such resources are unavailable for many languages and specific domains. This problem can be alleviated by using unsupervised approaches to induce the frames evoked by a collection of documents. That is the objective of the second task of SemEval 2019, which comprises three subtasks: clustering of verbs that evoke the same frame and clustering of arguments into both frame-specific slots and semantic roles. We approach all the subtasks by applying a graph clustering algorithm on contextualized embedding representations of the verbs and arguments. Using such representations is appropriate in the context of this task, since they provide cues for word-sense disambiguation. Thus, they can be used to identify different frames evoked by the same words. Using this approach we were able to outperform all of the baselines reported for the task on the test set in terms of Purity F1, as well as in terms of BCubed F1 in most cases.


pdf bib
SPA: Web-based Platform for easy Access to Speech Processing Modules
Fernando Batista | Pedro Curto | Isabel Trancoso | Alberto Abad | Jaime Ferreira | Eugénio Ribeiro | Helena Moniz | David Martins de Matos | Ricardo Ribeiro
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

This paper presents SPA, a web-based Speech Analytics platform that integrates several speech processing modules and that makes it possible to use them through the web. It was developed with the aim of facilitating the usage of the modules, without the need to know about software dependencies and specific configurations. Apart from being accessed by a web-browser, the platform also provides a REST API for easy integration with other applications. The platform is flexible, scalable, provides authentication for access restrictions, and was developed taking into consideration the time and effort of providing new services. The platform is still being improved, but it already integrates a considerable number of audio and text processing modules, including: Automatic transcription, speech disfluency classification, emotion detection, dialog act recognition, age and gender classification, non-nativeness detection, hyper-articulation detection, dialog act recognition, and two external modules for feature extraction and DTMF detection. This paper describes the SPA architecture, presents the already integrated modules, and provides a detailed description for the ones most recently integrated.


pdf bib
Automatic Keyword Extraction on Twitter
Luís Marujo | Wang Ling | Isabel Trancoso | Chris Dyer | Alan W. Black | Anatole Gershman | David Martins de Matos | João Neto | Jaime Carbonell
Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers)

pdf bib
Extending a Single-Document Summarizer to Multi-Document: a Hierarchical Approach
Luís Marujo | Ricardo Ribeiro | David Martins de Matos | João Neto | Anatole Gershman | Jaime Carbonell
Proceedings of the Fourth Joint Conference on Lexical and Computational Semantics


pdf bib
Revising the annotation of a Broadcast News corpus: a linguistic approach
Vera Cabarrão | Helena Moniz | Fernando Batista | Ricardo Ribeiro | Nuno Mamede | Hugo Meinedo | Isabel Trancoso | Ana Isabel Mata | David Martins de Matos
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

This paper presents a linguistic revision process of a speech corpus of Portuguese broadcast news focusing on metadata annotation for rich transcription, and reports on the impact of the new data on the performance for several modules. The main focus of the revision process consisted on annotating and revising structural metadata events, such as disfluencies and punctuation marks. The resultant revised data is now being extensively used, and was of extreme importance for improving the performance of several modules, especially the punctuation and capitalization modules, but also the speech recognition system, and all the subsequent modules. The resultant data has also been recently used in disfluency studies across domains.


pdf bib
Building and Exploring Semantic Equivalences Resources
Gracinda Carvalho | David Martins de Matos | Vitor Rocio
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

Language resources that include semantic equivalences at word level are common, and its usefulness is well established in text processing applications, as in the case of search. Named entities also play an important role for text based applications, but are not usually covered by the previously mentioned resources. The present work describes the WES base, Wikipedia Entity Synonym base, a freely available resource based on the Wikipedia. The WES base was built for the Portuguese Language, with the same format of another freely available thesaurus for the same language, the TeP base, which allows integration of equivalences both at word level and entity level. The resource has been built in a language independent way, so that it can be extended to different languages. The WES base was used in a Question Answering system, enhancing significantly its performance.


pdf bib
Discriminative Phrase-based Lexicalized Reordering Models using Weighted Reordering Graphs
Wang Ling | João Graça | David Martins de Matos | Isabel Trancoso | Alan W Black
Proceedings of 5th International Joint Conference on Natural Language Processing


pdf bib
Fairy Tale Corpus Organization Using Latent Semantic Mapping and an Item-to-item Top-n Recommendation Algorithm
Paula Vaz Lobo | David Martins de Matos
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

In this paper we present a fairy tale corpus that was semantically organized and tagged. The proposed method uses latent semantic mapping to represent the stories and a top-n item-to-item recommendation algorithm to define clusters of similar stories. Each story can be placed in more than one cluster and stories in the same cluster are related to the same concepts. The results were manually evaluated regarding the groupings as perceived by human judges. The evaluation resulted in a precision of 0.81, a recall of 0.69, and an f-measure of 0.75 when using tf*idf for word frequency. Our method is topic- and language-independent, and, contrary to traditional clustering methods, automatically defines the number of clusters based on the set of documents. This method can be used as a setup for traditional clustering or classification. The resulting corpus will be used for recommendation purposes, although it can also be used for emotion extraction, semantic role extraction, meaning extraction, text classification, among others.


pdf bib
High-Performance High-Volume Layered Corpora Annotation
Tiago Luís | David Martins de Matos
Proceedings of the Third Linguistic Annotation Workshop (LAW III)


pdf bib
Using Lexical Acquisition to Enrich a Predicate Argument Reusable Database
Paula Cristina Vaz | David Martins de Matos | Nuno J. Mamede
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

The work described in this paper aims to enrich the noun classifications of an existing database of lexical resources (de Matos and Ribeiro, 2004) adding missing information such as semantic relations. Relations are extracted from an annotated and manually corrected corpus. Semantic relations added to the database are retrieved from noun-appositive relations found in the corpus. The method uses clustering to generate labeled sets of words with hypernym relations between set label and set elements.

pdf bib
Mixed-Source Multi-Document Speech-to-Text Summarization
Ricardo Ribeiro | David Martins de Matos
Coling 2008: Proceedings of the workshop Multi-source Multilingual Information Extraction and Summarization


pdf bib
Rethinking Reusable Resources
David M. de Matos | Ricardo Ribeiro | Nuno J. Mamede
Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC’04)