Luis Chiruzzo


2020

pdf bib
Development of a Guarani - Spanish Parallel Corpus
Luis Chiruzzo | Pedro Amarilla | Adolfo Ríos | Gustavo Giménez Lugo
Proceedings of the 12th Language Resources and Evaluation Conference

This paper presents the development of a Guarani - Spanish parallel corpus with sentence-level alignment. The Guarani sentences of the corpus use the Jopara Guarani dialect, the dialect of Guarani spoken in Paraguay, which is based on Guarani grammar and may include several Spanish loanwords or neologisms. The corpus has around 14,500 sentence pairs aligned using a semi-automatic process, containing 228,000 Guarani tokens and 336,000 Spanish tokens extracted from web sources.

pdf bib
HAHA 2019 Dataset: A Corpus for Humor Analysis in Spanish
Luis Chiruzzo | Santiago Castro | Aiala Rosá
Proceedings of the 12th Language Resources and Evaluation Conference

This paper presents the development of a corpus of 30,000 Spanish tweets that were crowd-annotated with humor value and funniness score. The corpus contains approximately 38.6% of humorous tweets with an average score of 2.04 in a scale from 1 to 5 for the humorous tweets. The corpus has been used in an automatic humor recognition and analysis competition, obtaining encouraging results from the participants.

pdf bib
A Multi-level Annotated Corpus of Scientific Papers for Scientific Document Summarization and Cross-document Relation Discovery
Ahmed AbuRa’ed | Horacio Saggion | Luis Chiruzzo
Proceedings of the 12th Language Resources and Evaluation Conference

Related work sections or literature reviews are an essential part of every scientific article being crucial for paper reviewing and assessment. The automatic generation of related work sections can be considered an instance of the multi-document summarization problem. In order to allow the study of this specific problem, we have developed a manually annotated, machine readable data-set of related work sections, cited papers (e.g. references) and sentences, together with an additional layer of papers citing the references. We additionally present experiments on the identification of cited sentences, using as input citation contexts. The corpus alongside the gold standard are made available for use by the scientific community.

pdf bib
Statistical Deep Parsing for Spanish Using Neural Networks
Luis Chiruzzo | Dina Wonsever
Proceedings of the 16th International Conference on Parsing Technologies and the IWPT 2020 Shared Task on Parsing into Enhanced Universal Dependencies

This paper presents the development of a deep parser for Spanish that uses a HPSG grammar and returns trees that contain both syntactic and semantic information. The parsing process uses a top-down approach implemented using LSTM neural networks, and achieves good performance results in terms of syntactic constituency and dependency metrics, and also SRL. We describe the grammar, corpus and implementation of the parser. Our process outperforms a CKY baseline and other Spanish parsers in terms of global metrics and also for some specific Spanish phenomena, such as clitics reduplication and relative referents.

2018

pdf bib
Spanish HPSG Treebank based on the AnCora Corpus
Luis Chiruzzo | Dina Wonsever
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

pdf bib
A Crowd-Annotated Spanish Corpus for Humor Analysis
Santiago Castro | Luis Chiruzzo | Aiala Rosá | Diego Garat | Guillermo Moncecchi
Proceedings of the Sixth International Workshop on Natural Language Processing for Social Media

Computational Humor involves several tasks, such as humor recognition, humor generation, and humor scoring, for which it is useful to have human-curated data. In this work we present a corpus of 27,000 tweets written in Spanish and crowd-annotated by their humor value and funniness score, with about four annotations per tweet, tagged by 1,300 people over the Internet. It is equally divided between tweets coming from humorous and non-humorous accounts. The inter-annotator agreement Krippendorff’s alpha value is 0.5710. The dataset is available for general usage and can serve as a basis for humor detection and as a first step to tackle subjectivity.

2017

pdf bib
What Sentence are you Referring to and Why? Identifying Cited Sentences in Scientific Literature
Ahmed AbuRa’ed | Luis Chiruzzo | Horacio Saggion
Proceedings of the International Conference Recent Advances in Natural Language Processing, RANLP 2017

In the current context of scientific information overload, text mining tools are of paramount importance for researchers who have to read scientific papers and assess their value. Current citation networks, which link papers by citation relationships (reference and citing paper), are useful to quantitatively understand the value of a piece of scientific work, however they are limited in that they do not provide information about what specific part of the reference paper the citing paper is referring to. This qualitative information is very important, for example, in the context of current community-based scientific summarization activities. In this paper, and relying on an annotated dataset of co-citation sentences, we carry out a number of experiments aimed at, given a citation sentence, automatically identify a part of a reference paper being cited. Additionally our algorithm predicts the specific reason why such reference sentence has been cited out of five possible reasons.

2013

pdf bib
Adaptation of a Rule-Based Translator to Río de la Plata Spanish
Ernesto López | Luis Chiruzzo | Dina Wonsever
Proceedings of the Workshop on Adaptation of Language Resources and Tools for Closely Related Languages and Language Variants