Alberto Barrón-Cedeño


2020

pdf bib
Prta: A System to Support the Analysis of Propaganda Techniques in the News
Giovanni Da San Martino | Shaden Shaar | Yifan Zhang | Seunghak Yu | Alberto Barrón-Cedeño | Preslav Nakov
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations

Recent events, such as the 2016 US Presidential Campaign, Brexit and the COVID-19 “infodemic”, have brought into the spotlight the dangers of online disinformation. There has been a lot of research focusing on fact-checking and disinformation detection. However, little attention has been paid to the specific rhetorical and psychological techniques used to convey propaganda messages. Revealing the use of such techniques can help promote media literacy and critical thinking, and eventually contribute to limiting the impact of “fake news” and disinformation campaigns. Prta (Propaganda Persuasion Techniques Analyzer) allows users to explore the articles crawled on a regular basis by highlighting the spans in which propaganda techniques occur and to compare them on the basis of their use of propaganda techniques. The system further reports statistics about the use of such techniques, overall and over time, or according to filtering criteria specified by the user based on time interval, keywords, and/or political orientation of the media. Moreover, it allows users to analyze any text or URL through a dedicated interface or via an API. The system is available online: https://www.tanbih.org/prta.

pdf bib
SemEval-2020 Task 11: Detection of Propaganda Techniques in News Articles
Giovanni Da San Martino | Alberto Barrón-Cedeño | Henning Wachsmuth | Rostislav Petrov | Preslav Nakov
Proceedings of the Fourteenth Workshop on Semantic Evaluation

We present the results and the main findings of SemEval-2020 Task 11 on Detection of Propaganda Techniques in News Articles. The task featured two subtasks. Subtask SI is about Span Identification: given a plain-text document, spot the specific text fragments containing propaganda. Subtask TC is about Technique Classification: given a specific text fragment, in the context of a full document, determine the propaganda technique it uses, choosing from an inventory of 14 possible propaganda techniques. The task attracted a large number of participants: 250 teams signed up to participate and 44 made a submission on the test set. In this paper, we present the task, analyze the results, and discuss the system submissions and the methods they used. For both subtasks, the best systems used pre-trained Transformers and ensembles.

2019

pdf bib
Fine-Grained Analysis of Propaganda in News Article
Giovanni Da San Martino | Seunghak Yu | Alberto Barrón-Cedeño | Rostislav Petrov | Preslav Nakov
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

Propaganda aims at influencing people’s mindset with the purpose of advancing a specific agenda. Previous work has addressed propaganda detection at document level, typically labelling all articles from a propagandistic news outlet as propaganda. Such noisy gold labels inevitably affect the quality of any learning system trained on them. A further issue with most existing systems is the lack of explainability. To overcome these limitations, we propose a novel task: performing fine-grained analysis of texts by detecting all fragments that contain propaganda techniques as well as their type. In particular, we create a corpus of news articles manually annotated at fragment level with eighteen propaganda techniques and propose a suitable evaluation measure. We further design a novel multi-granularity neural network, and we show that it outperforms several strong BERT-based baselines.

pdf bib
Tanbih: Get To Know What You Are Reading
Yifan Zhang | Giovanni Da San Martino | Alberto Barrón-Cedeño | Salvatore Romeo | Jisun An | Haewoon Kwak | Todor Staykovski | Israa Jaradat | Georgi Karadzhov | Ramy Baly | Kareem Darwish | James Glass | Preslav Nakov
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP): System Demonstrations

We introduce Tanbih, a news aggregator with intelligent analysis tools to help readers understanding what’s behind a news story. Our system displays news grouped into events and generates media profiles that show the general factuality of reporting, the degree of propagandistic content, hyper-partisanship, leading political ideology, general frame of reporting, and stance with respect to various claims and topics of a news outlet. In addition, we automatically analyse each article to detect whether it is propagandistic and to determine its stance with respect to a number of controversial topics.

pdf bib
Proceedings of the Second Workshop on Natural Language Processing for Internet Freedom: Censorship, Disinformation, and Propaganda
Anna Feldman | Giovanni Da San Martino | Alberto Barrón-Cedeño | Chris Brew | Chris Leberknight | Preslav Nakov
Proceedings of the Second Workshop on Natural Language Processing for Internet Freedom: Censorship, Disinformation, and Propaganda

pdf bib
Findings of the NLP4IF-2019 Shared Task on Fine-Grained Propaganda Detection
Giovanni Da San Martino | Alberto Barrón-Cedeño | Preslav Nakov
Proceedings of the Second Workshop on Natural Language Processing for Internet Freedom: Censorship, Disinformation, and Propaganda

We present the shared task on Fine-Grained Propaganda Detection, which was organized as part of the NLP4IF workshop at EMNLP-IJCNLP 2019. There were two subtasks. FLC is a fragment-level task that asks for the identification of propagandist text fragments in a news article and also for the prediction of the specific propaganda technique used in each such fragment (18-way classification task). SLC is a sentence-level binary classification task asking to detect the sentences that contain propaganda. A total of 12 teams submitted systems for the FLC task, 25 teams did so for the SLC task, and 14 teams eventually submitted a system description paper. For both subtasks, most systems managed to beat the baseline by a sizable margin. The leaderboard and the data from the competition are available at http://propaganda.qcri.org/nlp4if-shared-task/.

pdf bib
It Takes Nine to Smell a Rat: Neural Multi-Task Learning for Check-Worthiness Prediction
Slavena Vasileva | Pepa Atanasova | Lluís Màrquez | Alberto Barrón-Cedeño | Preslav Nakov
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2019)

We propose a multi-task deep-learning approach for estimating the check-worthiness of claims in political debates. Given a political debate, such as the 2016 US Presidential and Vice-Presidential ones, the task is to predict which statements in the debate should be prioritized for fact-checking. While different fact-checking organizations would naturally make different choices when analyzing the same debate, we show that it pays to learn from multiple sources simultaneously (PolitiFact, FactCheck, ABC, CNN, NPR, NYT, Chicago Tribune, The Guardian, and Washington Post) in a multi-task learning setup, even when a particular source is chosen as a target to imitate. Our evaluation shows state-of-the-art results on a standard dataset for the task of check-worthiness prediction.

pdf bib
Team Jack Ryder at SemEval-2019 Task 4: Using BERT Representations for Detecting Hyperpartisan News
Daniel Shaprin | Giovanni Da San Martino | Alberto Barrón-Cedeño | Preslav Nakov
Proceedings of the 13th International Workshop on Semantic Evaluation

We describe the system submitted by the Jack Ryder team to SemEval-2019 Task 4 on Hyperpartisan News Detection. The task asked participants to predict whether a given article is hyperpartisan, i.e., extreme-left or extreme-right. We proposed an approach based on BERT with fine-tuning, which was ranked 7th out 28 teams on the distantly supervised dataset, where all articles from a hyperpartisan/non-hyperpartisan news outlet are considered to be hyperpartisan/non-hyperpartisan. On a manually annotated test dataset, where human annotators double-checked the labels, we were ranked 29th out of 42 teams.

pdf bib
Team QCRI-MIT at SemEval-2019 Task 4: Propaganda Analysis Meets Hyperpartisan News Detection
Abdelrhman Saleh | Ramy Baly | Alberto Barrón-Cedeño | Giovanni Da San Martino | Mitra Mohtarami | Preslav Nakov | James Glass
Proceedings of the 13th International Workshop on Semantic Evaluation

We describe our submission to SemEval-2019 Task 4 on Hyperpartisan News Detection. We rely on a variety of engineered features originally used to detect propaganda. This is based on the assumption that biased messages are propagandistic and promote a particular political cause or viewpoint. In particular, we trained a logistic regression model with features ranging from simple bag of words to vocabulary richness and text readability. Our system achieved 72.9% accuracy on the manually annotated testset, and 60.8% on the test data that was obtained with distant supervision. Additional experiments showed that significant performance gains can be achieved with better feature pre-processing.

2018

pdf bib
ClaimRank: Detecting Check-Worthy Claims in Arabic and English
Israa Jaradat | Pepa Gencheva | Alberto Barrón-Cedeño | Lluís Màrquez | Preslav Nakov
Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Demonstrations

We present ClaimRank, an online system for detecting check-worthy claims. While originally trained on political debates, the system can work for any kind of text, e.g., interviews or just regular news articles. Its aim is to facilitate manual fact-checking efforts by prioritizing the claims that fact-checkers should consider first. ClaimRank supports both Arabic and English, it is trained on actual annotations from nine reputable fact-checking organizations (PolitiFact, FactCheck, ABC, CNN, NPR, NYT, Chicago Tribune, The Guardian, and Washington Post), and thus it can mimic the claim selection strategies for each and any of them, as well as for the union of them all.

pdf bib
A Flexible, Efficient and Accurate Framework for Community Question Answering Pipelines
Salvatore Romeo | Giovanni Da San Martino | Alberto Barrón-Cedeño | Alessandro Moschitti
Proceedings of ACL 2018, System Demonstrations

Although deep neural networks have been proving to be excellent tools to deliver state-of-the-art results, when data is scarce and the tackled tasks involve complex semantic inference, deep linguistic processing and traditional structure-based approaches, such as tree kernel methods, are an alternative solution. Community Question Answering is a research area that benefits from deep linguistic analysis to improve the experience of the community of forum users. In this paper, we present a UIMA framework to distribute the computation of cQA tasks over computer clusters such that traditional systems can scale to large datasets and deliver fast processing.

2017

pdf bib
Lump at SemEval-2017 Task 1: Towards an Interlingua Semantic Similarity
Cristina España-Bonet | Alberto Barrón-Cedeño
Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017)

This is the Lump team participation at SemEval 2017 Task 1 on Semantic Textual Similarity. Our supervised model relies on features which are multilingual or interlingual in nature. We include lexical similarities, cross-language explicit semantic analysis, internal representations of multilingual neural networks and interlingual word embeddings. Our representations allow to use large datasets in language pairs with many instances to better classify instances in smaller language pairs avoiding the necessity of translating into a single language. Hence we can deal with all the languages in the task: Arabic, English, Spanish, and Turkish.

pdf bib
A Context-Aware Approach for Detecting Worth-Checking Claims in Political Debates
Pepa Gencheva | Preslav Nakov | Lluís Màrquez | Alberto Barrón-Cedeño | Ivan Koychev
Proceedings of the International Conference Recent Advances in Natural Language Processing, RANLP 2017

In the context of investigative journalism, we address the problem of automatically identifying which claims in a given document are most worthy and should be prioritized for fact-checking. Despite its importance, this is a relatively understudied problem. Thus, we create a new corpus of political debates, containing statements that have been fact-checked by nine reputable sources, and we train machine learning models to predict which claims should be prioritized for fact-checking, i.e., we model the problem as a ranking task. Unlike previous work, which has looked primarily at sentences in isolation, in this paper we focus on a rich input representation modeling the context: relationship between the target statement and the larger context of the debate, interaction between the opponents, and reaction by the moderator and by the public. Our experiments show state-of-the-art results, outperforming a strong rivaling system by a margin, while also confirming the importance of the contextual information.

pdf bib
Fully Automated Fact Checking Using External Sources
Georgi Karadzhov | Preslav Nakov | Lluís Màrquez | Alberto Barrón-Cedeño | Ivan Koychev
Proceedings of the International Conference Recent Advances in Natural Language Processing, RANLP 2017

Given the constantly growing proliferation of false claims online in recent years, there has been also a growing research interest in automatically distinguishing false rumors from factually true claims. Here, we propose a general-purpose framework for fully-automatic fact checking using external sources, tapping the potential of the entire Web as a knowledge source to confirm or reject a claim. Our framework uses a deep neural network with LSTM text encoding to combine semantic kernels with task-specific embeddings that encode a claim together with pieces of potentially relevant text fragments from the Web, taking the source reliability into account. The evaluation results show good performance on two different tasks and datasets: (i) rumor detection and (ii) fact checking of the answers to a question in community question answering forums.

2016

pdf bib
Neural Attention for Learning to Rank Questions in Community Question Answering
Salvatore Romeo | Giovanni Da San Martino | Alberto Barrón-Cedeño | Alessandro Moschitti | Yonatan Belinkov | Wei-Ning Hsu | Yu Zhang | Mitra Mohtarami | James Glass
Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers

In real-world data, e.g., from Web forums, text is often contaminated with redundant or irrelevant content, which leads to introducing noise in machine learning algorithms. In this paper, we apply Long Short-Term Memory networks with an attention mechanism, which can select important parts of text for the task of similar question retrieval from community Question Answering (cQA) forums. In particular, we use the attention weights for both selecting entire sentences and their subparts, i.e., word/chunk, from shallow syntactic trees. More interestingly, we apply tree kernels to the filtered text representations, thus exploiting the implicit features of the subtree space for learning question reranking. Our results show that the attention-based pruning allows for achieving the top position in the cQA challenge of SemEval 2016, with a relatively large gap from the other participants while greatly decreasing running time.

pdf bib
Selecting Sentences versus Selecting Tree Constituents for Automatic Question Ranking
Alberto Barrón-Cedeño | Giovanni Da San Martino | Salvatore Romeo | Alessandro Moschitti
Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers

Community question answering (cQA) websites are focused on users who query questions onto an online forum, expecting for other users to provide them answers or suggestions. Unlike other social media, the length of the posted queries has no limits and queries tend to be multi-sentence elaborations combining context, actual questions, and irrelevant information. We approach the problem of question ranking: given a user’s new question, to retrieve those previously-posted questions which could be equivalent, or highly relevant. This could prevent the posting of nearly-duplicate questions and provide the user with instantaneous answers. For the first time in cQA, we address the selection of relevant text —both at sentence- and at constituent-level— for parse tree-based representations. Our supervised models for text selection boost the performance of a tree kernel-based machine learning model, allowing it to overtake the current state of the art on a recently released cQA evaluation framework.

pdf bib
An Interactive System for Exploring Community Question Answering Forums
Enamul Hoque | Shafiq Joty | Lluís Màrquez | Alberto Barrón-Cedeño | Giovanni Da San Martino | Alessandro Moschitti | Preslav Nakov | Salvatore Romeo | Giuseppe Carenini
Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: System Demonstrations

We present an interactive system to provide effective and efficient search capabilities in Community Question Answering (cQA) forums. The system integrates state-of-the-art technology for answer search with a Web-based user interface specifically tailored to support the cQA forum readers. The answer search module automatically finds relevant answers for a new question by exploring related questions and the comments within their threads. The graphical user interface presents the search results and supports the exploration of related information. The system is running live at http://www.qatarliving.com/betasearch/.

pdf bib
ConvKN at SemEval-2016 Task 3: Answer and Question Selection for Question Answering on Arabic and English Fora
Alberto Barrón-Cedeño | Daniele Bonadiman | Giovanni Da San Martino | Shafiq Joty | Alessandro Moschitti | Fahad Al Obaidli | Salvatore Romeo | Kateryna Tymoshenko | Antonio Uva
Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016)

2015

pdf bib
Global Thread-level Inference for Comment Classification in Community Question Answering
Shafiq Joty | Alberto Barrón-Cedeño | Giovanni Da San Martino | Simone Filice | Lluís Màrquez | Alessandro Moschitti | Preslav Nakov
Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing

pdf bib
Answer Selection in Arabic Community Question Answering: A Feature-Rich Approach
Yonatan Belinkov | Alberto Barrón-Cedeño | Hamdy Mubarak
Proceedings of the Second Workshop on Arabic Natural Language Processing

pdf bib
A Factory of Comparable Corpora from Wikipedia
Alberto Barrón-Cedeño | Cristina España-Bonet | Josu Boldoba | Lluís Màrquez
Proceedings of the Eighth Workshop on Building and Using Comparable Corpora

pdf bib
Thread-Level Information for Comment Classification in Community Question Answering
Alberto Barrón-Cedeño | Simone Filice | Giovanni Da San Martino | Shafiq Joty | Lluís Màrquez | Preslav Nakov | Alessandro Moschitti
Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers)

pdf bib
QCRI: Answer Selection for Community Question Answering - Experiments for Arabic and English
Massimo Nicosia | Simone Filice | Alberto Barrón-Cedeño | Iman Saleh | Hamdy Mubarak | Wei Gao | Preslav Nakov | Giovanni Da San Martino | Alessandro Moschitti | Kareem Darwish | Lluís Màrquez | Shafiq Joty | Walid Magdy
Proceedings of the 9th International Workshop on Semantic Evaluation (SemEval 2015)

2014

pdf bib
IPA and STOUT: Leveraging Linguistic and Source-based Features for Machine Translation Evaluation
Meritxell Gonzàlez | Alberto Barrón-Cedeño | Lluís Màrquez
Proceedings of the Ninth Workshop on Statistical Machine Translation

2013

pdf bib
The TALP-UPC Phrase-Based Translation Systems for WMT13: System Combination with Morphology Generation, Domain Adaptation and Corpus Filtering
Lluís Formiga | Marta R. Costa-jussà | José B. Mariño | José A. R. Fonollosa | Alberto Barrón-Cedeño | Lluís Màrquez
Proceedings of the Eighth Workshop on Statistical Machine Translation

pdf bib
The TALP-UPC Approach to System Selection: Asiya Features and Pairwise Classification Using Random Forests
Lluís Formiga | Meritxell Gonzàlez | Alberto Barrón-Cedeño | José A. R. Fonollosa | Lluís Màrquez
Proceedings of the Eighth Workshop on Statistical Machine Translation

pdf bib
Plagiarism Meets Paraphrasing: Insights for the Next Generation in Automatic Plagiarism Detection
Alberto Barrón-Cedeño | Marta Vila | M. Antònia Martí | Paolo Rosso
Computational Linguistics, Volume 39, Issue 4 - December 2013

pdf bib
UPC-CORE: What Can Machine Translation Evaluation Metrics and Wikipedia Do for Estimating Semantic Textual Similarity?
Alberto Barrón-Cedeño | Lluís Màrquez | Maria Fuentes | Horacio Rodríguez | Jordi Turmo
Second Joint Conference on Lexical and Computational Semantics (*SEM), Volume 1: Proceedings of the Main Conference and the Shared Task: Semantic Textual Similarity

2012

pdf bib
DeSoCoRe: Detecting Source Code Re-Use across Programming Languages
Enrique Flores | Alberto Barrón-Cedeño | Paolo Rosso | Lidia Moreno
Proceedings of the Demonstration Session at the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

2010

pdf bib
Plagiarism Detection across Distant Language Pairs
Alberto Barrón-Cedeño | Paolo Rosso | Eneko Agirre | Gorka Labaka
Proceedings of the 23rd International Conference on Computational Linguistics (Coling 2010)

pdf bib
An Evaluation Framework for Plagiarism Detection
Martin Potthast | Benno Stein | Alberto Barrón-Cedeño | Paolo Rosso
Coling 2010: Posters

pdf bib
Corpus and Evaluation Measures for Automatic Plagiarism Detection
Alberto Barrón-Cedeño | Martin Potthast | Paolo Rosso | Benno Stein
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

The simple access to texts on digital libraries and the World Wide Web has led to an increased number of plagiarism cases in recent years, which renders manual plagiarism detection infeasible at large. Various methods for automatic plagiarism detection have been developed whose objective is to assist human experts in the analysis of documents for plagiarism. The methods can be divided into two main approaches: intrinsic and external. Unlike other tasks in natural language processing and information retrieval, it is not possible to publish a collection of real plagiarism cases for evaluation purposes since they cannot be properly anonymized. Therefore, current evaluations found in the literature are incomparable and, very often not even reproducible. Our contribution in this respect is a newly developed large-scale corpus of artificial plagiarism useful for the evaluation of intrinsic as well as external plagiarism detection. Additionally, new detection performance measures tailored to the evaluation of plagiarism detection algorithms are proposed.

pdf bib
English-Spanish Large Statistical Dictionary of Inflectional Forms
Grigori Sidorov | Alberto Barrón-Cedeño | Paolo Rosso
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

The paper presents an approach for constructing a weighted bilingual dictionary of inflectional forms using as input data a traditional bilingual dictionary, and not parallel corpora. An algorithm is developed that generates all possible morphological (inflectional) forms and weights them using information on distribution of corresponding grammar sets (grammar information) in large corpora for each language. The algorithm also takes into account the compatibility of grammar sets in a language pair; for example, verb in past tense in language L normally is expected to be translated by verb in past tense in Language L'. We consider that the developed method is universal, i.e. can be applied to any pair of languages. The obtained dictionary is freely available. It can be used in several NLP tasks, for example, statistical machine translation.