Marco Antonio Sobrevilla Cabezudo

Also published as: Marco A. Sobrevilla Cabezudo, Marco Sobrevilla


2020

pdf bib
Efficient Strategies for Hierarchical Text Classification: External Knowledge and Auxiliary Tasks
Kervy Rivas Rojas | Gina Bustamante | Arturo Oncevay | Marco Antonio Sobrevilla Cabezudo
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

In hierarchical text classification, we perform a sequence of inference steps to predict the category of a document from top to bottom of a given class taxonomy. Most of the studies have focused on developing novels neural network architectures to deal with the hierarchical structure, but we prefer to look for efficient ways to strengthen a baseline model. We first define the task as a sequence-to-sequence problem. Afterwards, we propose an auxiliary synthetic task of bottom-up-classification. Then, from external dictionaries, we retrieve textual definitions for the classes of all the hierarchy’s layers, and map them into the word vector space. We use the class-definition embeddings as an additional input to condition the prediction of the next layer and in an adapted beam search. Whereas the modified search did not provide large gains, the combination of the auxiliary task and the additional input of class-definitions significantly enhance the classification accuracy. With our efficient approaches, we outperform previous studies, using a drastically reduced number of parameters, in two well-known English datasets.

pdf bib
NILC at SR’20: Exploring Pre-Trained Models in Surface Realisation
Marco Antonio Sobrevilla Cabezudo | Thiago Pardo
Proceedings of the Third Workshop on Multilingual Surface Realisation

This paper describes the submission by the NILC Computational Linguistics research group of the University of S ̃ao Paulo/Brazil to the English Track 2 (closed sub-track) at the Surface Realisation Shared Task 2020. The success of the current pre-trained models like BERT or GPT-2 in several tasks is well-known, however, this is not the case for data-to-text generation tasks and just recently some initiatives focused on it. This way, we explore how a pre-trained model (GPT-2) performs on the UD-to-text generation task. In general, the achieved results were poor, but there are some interesting ideas to explore. Among the learned lessons we may note that it is necessary to study strategies to represent UD inputs and to introduce structural knowledge into these pre-trained models.

2019

pdf bib
Back-Translation as Strategy to Tackle the Lack of Corpus in Natural Language Generation from Semantic Representations
Marco Antonio Sobrevilla Cabezudo | Simon Mille | Thiago Pardo
Proceedings of the 2nd Workshop on Multilingual Surface Realisation (MSR 2019)

This paper presents an exploratory study that aims to evaluate the usefulness of back-translation in Natural Language Generation (NLG) from semantic representations for non-English languages. Specifically, Abstract Meaning Representation and Brazilian Portuguese (BP) are chosen as semantic representation and language, respectively. Two methods (focused on Statistical and Neural Machine Translation) are evaluated on two datasets (one automatically generated and another one human-generated) to compare the performance in a real context. Also, several cuts according to quality measures are performed to evaluate the importance (or not) of the data quality in NLG. Results show that there are still many improvements to be made but this is a promising approach.

pdf bib
Natural Language Generation: Recently Learned Lessons, Directions for Semantic Representation-based Approaches, and the Case of Brazilian Portuguese Language
Marco Antonio Sobrevilla Cabezudo | Thiago Pardo
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop

This paper presents a more recent literature review on Natural Language Generation. In particular, we highlight the efforts for Brazilian Portuguese in order to show the available resources and the existent approaches for this language. We also focus on the approaches for generation from semantic representations (emphasizing the Abstract Meaning Representation formalism) as well as their advantages and limitations, including possible future directions.

pdf bib
Assessing Back-Translation as a Corpus Generation Strategy for non-English Tasks: A Study in Reading Comprehension and Word Sense Disambiguation
Fabricio Monsalve | Kervy Rivas Rojas | Marco Antonio Sobrevilla Cabezudo | Arturo Oncevay
Proceedings of the 13th Linguistic Annotation Workshop

Corpora curated by experts have sustained Natural Language Processing mainly in English, but the expensiveness of corpora creation is a barrier for the development in further languages. Thus, we propose a corpus generation strategy that only requires a machine translation system between English and the target language in both directions, where we filter the best translations by computing automatic translation metrics and the task performance score. By studying Reading Comprehension in Spanish and Word Sense Disambiguation in Portuguese, we identified that a more quality-oriented metric has high potential in the corpora selection without degrading the task performance. We conclude that it is possible to systematise the building of quality corpora using machine translation and automatic metrics, besides some prior effort to clean and process the data.

pdf bib
Towards a General Abstract Meaning Representation Corpus for Brazilian Portuguese
Marco Antonio Sobrevilla Cabezudo | Thiago Pardo
Proceedings of the 13th Linguistic Annotation Workshop

Abstract Meaning Representation (AMR) is a recent and prominent semantic representation with good acceptance and several applications in the Natural Language Processing area. For English, there is a large annotated corpus (with approximately 39K sentences) that supports the research with the representation. However, to the best of our knowledge, there is only one restricted corpus for Portuguese, which contains 1,527 sentences. In this context, this paper presents an effort to build a general purpose AMR-annotated corpus for Brazilian Portuguese by translating and adapting AMR English guidelines. Our results show that such approach is feasible, but there are some challenging phenomena to solve. More than this, efforts are necessary to increase the coverage of the corresponding lexical resource that supports the annotation.

2018

pdf bib
ChAnot: An Intelligent Annotation Tool for Indigenous and Highly Agglutinative Languages in Peru
Rodolfo Mercado-Gonzales | José Pereira-Noriega | Marco Sobrevilla | Arturo Oncevay
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

pdf bib
Corpus Building and Evaluation of Aspect-based Opinion Summaries from Tweets in Spanish
Daniel Peñaloza | Rodrigo López | Juanjosé Tenorio | Héctor Gómez | Arturo Oncevay-Marcos | Marco A. Sobrevilla Cabezudo
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

pdf bib
WordNet-Shp: Towards the Building of a Lexical Database for a Peruvian Minority Language
Diego Maguiño-Valencia | Arturo Oncevay-Marcos | Marco A. Sobrevilla Cabezudo
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

pdf bib
NILC-SWORNEMO at the Surface Realization Shared Task: Exploring Syntax-Based Word Ordering using Neural Models
Marco Antonio Sobrevilla Cabezudo | Thiago Pardo
Proceedings of the First Workshop on Multilingual Surface Realisation

This paper describes the submission by the NILC Computational Linguistics research group of the University of São Paulo/Brazil to the Track 1 of the Surface Realization Shared Task (SRST Track 1). We present a neural-based method that works at the syntactic level to order the words (which we refer by NILC-SWORNEMO, standing for “Syntax-based Word ORdering using NEural MOdels”). Additionally, we apply a bottom-up approach to build the sentence and, using language-specific lexicons, we produce the proper word form of each lemma in the sentence. The results obtained by our method outperformed the average of the results for English, Portuguese and Spanish in the track.