Giovanni Da San Martino


2020

pdf bib
That is a Known Lie: Detecting Previously Fact-Checked Claims
Shaden Shaar | Nikolay Babulkov | Giovanni Da San Martino | Preslav Nakov
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

The recent proliferation of ”fake news” has triggered a number of responses, most notably the emergence of several manual fact-checking initiatives. As a result and over time, a large number of fact-checked claims have been accumulated, which increases the likelihood that a new claim in social media or a new statement by a politician might have already been fact-checked by some trusted fact-checking organization, as viral claims often come back after a while in social media, and politicians like to repeat their favorite statements, true or false, over and over again. As manual fact-checking is very time-consuming (and fully automatic fact-checking has credibility issues), it is important to try to save this effort and to avoid wasting time on claims that have already been fact-checked. Interestingly, despite the importance of the task, it has been largely ignored by the research community so far. Here, we aim to bridge this gap. In particular, we formulate the task and we discuss how it relates to, but also differs from, previous work. We further create a specialized dataset, which we release to the research community. Finally, we present learning-to-rank experiments that demonstrate sizable improvements over state-of-the-art retrieval and textual similarity approaches.

pdf bib
Prta: A System to Support the Analysis of Propaganda Techniques in the News
Giovanni Da San Martino | Shaden Shaar | Yifan Zhang | Seunghak Yu | Alberto Barrón-Cedeño | Preslav Nakov
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations

Recent events, such as the 2016 US Presidential Campaign, Brexit and the COVID-19 “infodemic”, have brought into the spotlight the dangers of online disinformation. There has been a lot of research focusing on fact-checking and disinformation detection. However, little attention has been paid to the specific rhetorical and psychological techniques used to convey propaganda messages. Revealing the use of such techniques can help promote media literacy and critical thinking, and eventually contribute to limiting the impact of “fake news” and disinformation campaigns. Prta (Propaganda Persuasion Techniques Analyzer) allows users to explore the articles crawled on a regular basis by highlighting the spans in which propaganda techniques occur and to compare them on the basis of their use of propaganda techniques. The system further reports statistics about the use of such techniques, overall and over time, or according to filtering criteria specified by the user based on time interval, keywords, and/or political orientation of the media. Moreover, it allows users to analyze any text or URL through a dedicated interface or via an API. The system is available online: https://www.tanbih.org/prta.

pdf bib
Proceedings of the 3rd NLP4IF Workshop on NLP for Internet Freedom: Censorship, Disinformation, and Propaganda
Giovanni Da San Martino | Chris Brew | Giovanni Luca Ciampaglia | Anna Feldman | Chris Leberknight | Preslav Nakov
Proceedings of the 3rd NLP4IF Workshop on NLP for Internet Freedom: Censorship, Disinformation, and Propaganda

pdf bib
SemEval-2020 Task 11: Detection of Propaganda Techniques in News Articles
Giovanni Da San Martino | Alberto Barrón-Cedeño | Henning Wachsmuth | Rostislav Petrov | Preslav Nakov
Proceedings of the Fourteenth Workshop on Semantic Evaluation

We present the results and the main findings of SemEval-2020 Task 11 on Detection of Propaganda Techniques in News Articles. The task featured two subtasks. Subtask SI is about Span Identification: given a plain-text document, spot the specific text fragments containing propaganda. Subtask TC is about Technique Classification: given a specific text fragment, in the context of a full document, determine the propaganda technique it uses, choosing from an inventory of 14 possible propaganda techniques. The task attracted a large number of participants: 250 teams signed up to participate and 44 made a submission on the test set. In this paper, we present the task, analyze the results, and discuss the system submissions and the methods they used. For both subtasks, the best systems used pre-trained Transformers and ensembles.

pdf bib
We Can Detect Your Bias: Predicting the Political Ideology of News Articles
Ramy Baly | Giovanni Da San Martino | James Glass | Preslav Nakov
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

We explore the task of predicting the leading political ideology or bias of news articles. First, we collect and release a large dataset of 34,737 articles that were manually annotated for political ideology –left, center, or right–, which is well-balanced across both topics and media. We further use a challenging experimental setup where the test examples come from media that were not seen during training, which prevents the model from learning to detect the source of the target news article instead of predicting its political ideology. From a modeling perspective, we propose an adversarial media adaptation, as well as a specially adapted triplet loss. We further add background information about the source, and we show that it is quite helpful for improving article-level prediction. Our experimental results show very sizable improvements over using state-of-the-art pre-trained Transformers in this challenging setup.

pdf bib
Fact-Checking, Fake News, Propaganda, and Media Bias: Truth Seeking in the Post-Truth Era
Preslav Nakov | Giovanni Da San Martino
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: Tutorial Abstracts

The rise of social media has democratized content creation and has made it easy for everybody to share and spread information online. On the positive side, this has given rise to citizen journalism, thus enabling much faster dissemination of information compared to what was possible with newspapers, radio, and TV. On the negative side, stripping traditional media from their gate-keeping role has left the public unprotected against the spread of misinformation, which could now travel at breaking-news speed over the same democratic channel. This has given rise to the proliferation of false information specifically created to affect individual people’s beliefs, and ultimately to influence major events such as political elections. There are strong indications that false information was weaponized at an unprecedented scale during Brexit and the 2016 U.S. presidential elections. “Fake news,” which can be defined as fabricated information that mimics news media content in form but not in organizational process or intent, became the Word of the Year for 2017, according to Collins Dictionary. Thus, limiting the spread of “fake news” and its impact has become a major focus for computer scientists, journalists, social media companies, and regulatory authorities. The tutorial will offer an overview of the broad and emerging research area of disinformation, with focus on the latest developments and research directions.

2019

pdf bib
Fine-Grained Analysis of Propaganda in News Article
Giovanni Da San Martino | Seunghak Yu | Alberto Barrón-Cedeño | Rostislav Petrov | Preslav Nakov
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

Propaganda aims at influencing people’s mindset with the purpose of advancing a specific agenda. Previous work has addressed propaganda detection at document level, typically labelling all articles from a propagandistic news outlet as propaganda. Such noisy gold labels inevitably affect the quality of any learning system trained on them. A further issue with most existing systems is the lack of explainability. To overcome these limitations, we propose a novel task: performing fine-grained analysis of texts by detecting all fragments that contain propaganda techniques as well as their type. In particular, we create a corpus of news articles manually annotated at fragment level with eighteen propaganda techniques and propose a suitable evaluation measure. We further design a novel multi-granularity neural network, and we show that it outperforms several strong BERT-based baselines.

pdf bib
Tanbih: Get To Know What You Are Reading
Yifan Zhang | Giovanni Da San Martino | Alberto Barrón-Cedeño | Salvatore Romeo | Jisun An | Haewoon Kwak | Todor Staykovski | Israa Jaradat | Georgi Karadzhov | Ramy Baly | Kareem Darwish | James Glass | Preslav Nakov
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP): System Demonstrations

We introduce Tanbih, a news aggregator with intelligent analysis tools to help readers understanding what’s behind a news story. Our system displays news grouped into events and generates media profiles that show the general factuality of reporting, the degree of propagandistic content, hyper-partisanship, leading political ideology, general frame of reporting, and stance with respect to various claims and topics of a news outlet. In addition, we automatically analyse each article to detect whether it is propagandistic and to determine its stance with respect to a number of controversial topics.

pdf bib
Proceedings of the Second Workshop on Natural Language Processing for Internet Freedom: Censorship, Disinformation, and Propaganda
Anna Feldman | Giovanni Da San Martino | Alberto Barrón-Cedeño | Chris Brew | Chris Leberknight | Preslav Nakov
Proceedings of the Second Workshop on Natural Language Processing for Internet Freedom: Censorship, Disinformation, and Propaganda

pdf bib
Findings of the NLP4IF-2019 Shared Task on Fine-Grained Propaganda Detection
Giovanni Da San Martino | Alberto Barrón-Cedeño | Preslav Nakov
Proceedings of the Second Workshop on Natural Language Processing for Internet Freedom: Censorship, Disinformation, and Propaganda

We present the shared task on Fine-Grained Propaganda Detection, which was organized as part of the NLP4IF workshop at EMNLP-IJCNLP 2019. There were two subtasks. FLC is a fragment-level task that asks for the identification of propagandist text fragments in a news article and also for the prediction of the specific propaganda technique used in each such fragment (18-way classification task). SLC is a sentence-level binary classification task asking to detect the sentences that contain propaganda. A total of 12 teams submitted systems for the FLC task, 25 teams did so for the SLC task, and 14 teams eventually submitted a system description paper. For both subtasks, most systems managed to beat the baseline by a sizable margin. The leaderboard and the data from the competition are available at http://propaganda.qcri.org/nlp4if-shared-task/.

pdf bib
Team Jack Ryder at SemEval-2019 Task 4: Using BERT Representations for Detecting Hyperpartisan News
Daniel Shaprin | Giovanni Da San Martino | Alberto Barrón-Cedeño | Preslav Nakov
Proceedings of the 13th International Workshop on Semantic Evaluation

We describe the system submitted by the Jack Ryder team to SemEval-2019 Task 4 on Hyperpartisan News Detection. The task asked participants to predict whether a given article is hyperpartisan, i.e., extreme-left or extreme-right. We proposed an approach based on BERT with fine-tuning, which was ranked 7th out 28 teams on the distantly supervised dataset, where all articles from a hyperpartisan/non-hyperpartisan news outlet are considered to be hyperpartisan/non-hyperpartisan. On a manually annotated test dataset, where human annotators double-checked the labels, we were ranked 29th out of 42 teams.

pdf bib
Team QCRI-MIT at SemEval-2019 Task 4: Propaganda Analysis Meets Hyperpartisan News Detection
Abdelrhman Saleh | Ramy Baly | Alberto Barrón-Cedeño | Giovanni Da San Martino | Mitra Mohtarami | Preslav Nakov | James Glass
Proceedings of the 13th International Workshop on Semantic Evaluation

We describe our submission to SemEval-2019 Task 4 on Hyperpartisan News Detection. We rely on a variety of engineered features originally used to detect propaganda. This is based on the assumption that biased messages are propagandistic and promote a particular political cause or viewpoint. In particular, we trained a logistic regression model with features ranging from simple bag of words to vocabulary richness and text readability. Our system achieved 72.9% accuracy on the manually annotated testset, and 60.8% on the test data that was obtained with distant supervision. Additional experiments showed that significant performance gains can be achieved with better feature pre-processing.

2018

pdf bib
A Flexible, Efficient and Accurate Framework for Community Question Answering Pipelines
Salvatore Romeo | Giovanni Da San Martino | Alberto Barrón-Cedeño | Alessandro Moschitti
Proceedings of ACL 2018, System Demonstrations

Although deep neural networks have been proving to be excellent tools to deliver state-of-the-art results, when data is scarce and the tackled tasks involve complex semantic inference, deep linguistic processing and traditional structure-based approaches, such as tree kernel methods, are an alternative solution. Community Question Answering is a research area that benefits from deep linguistic analysis to improve the experience of the community of forum users. In this paper, we present a UIMA framework to distribute the computation of cQA tasks over computer clusters such that traditional systems can scale to large datasets and deliver fast processing.

2017

pdf bib
KeLP at SemEval-2017 Task 3: Learning Pairwise Patterns in Community Question Answering
Simone Filice | Giovanni Da San Martino | Alessandro Moschitti
Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017)

This paper describes the KeLP system participating in the SemEval-2017 community Question Answering (cQA) task. The system is a refinement of the kernel-based sentence pair modeling we proposed for the previous year challenge. It is implemented within the Kernel-based Learning Platform called KeLP, from which we inherit the team’s name. Our primary submission ranked first in subtask A, and third in subtasks B and C, being the only systems appearing in the top-3 ranking for all the English subtasks. This shows that the proposed framework, which has minor variations among the three subtasks, is extremely flexible and effective in tackling learning tasks defined on sentence pairs.

pdf bib
Building Chatbots from Forum Data: Model Selection Using Question Answering Metrics
Martin Boyanov | Preslav Nakov | Alessandro Moschitti | Giovanni Da San Martino | Ivan Koychev
Proceedings of the International Conference Recent Advances in Natural Language Processing, RANLP 2017

We propose to use question answering (QA) data from Web forums to train chat-bots from scratch, i.e., without dialog data. First, we extract pairs of question and answer sentences from the typically much longer texts of questions and answers in a forum. We then use these shorter texts to train seq2seq models in a more efficient way. We further improve the parameter optimization using a new model selection strategy based on QA measures. Finally, we propose to use extrinsic evaluation with respect to a QA task as an automatic evaluation method for chatbot systems. The evaluation shows that the model achieves a MAP of 63.5% on the extrinsic task. Moreover, our manual evaluation demonstrates that the model can answer correctly 49.5% of the questions when they are similar in style to how questions are asked in the forum, and 47.3% of the questions, when they are more conversational in style.

pdf bib
The Impact of Modeling Overall Argumentation with Tree Kernels
Henning Wachsmuth | Giovanni Da San Martino | Dora Kiesel | Benno Stein
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing

Several approaches have been proposed to model either the explicit sequential structure of an argumentative text or its implicit hierarchical structure. So far, the adequacy of these models of overall argumentation remains unclear. This paper asks what type of structure is actually important to tackle downstream tasks in computational argumentation. We analyze patterns in the overall argumentation of texts from three corpora. Then, we adapt the idea of positional tree kernels in order to capture sequential and hierarchical argumentative structure together for the first time. In systematic experiments for three text classification tasks, we find strong evidence for the impact of both types of structure. Our results suggest that either of them is necessary while their combination may be beneficial.

2016

pdf bib
Neural Attention for Learning to Rank Questions in Community Question Answering
Salvatore Romeo | Giovanni Da San Martino | Alberto Barrón-Cedeño | Alessandro Moschitti | Yonatan Belinkov | Wei-Ning Hsu | Yu Zhang | Mitra Mohtarami | James Glass
Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers

In real-world data, e.g., from Web forums, text is often contaminated with redundant or irrelevant content, which leads to introducing noise in machine learning algorithms. In this paper, we apply Long Short-Term Memory networks with an attention mechanism, which can select important parts of text for the task of similar question retrieval from community Question Answering (cQA) forums. In particular, we use the attention weights for both selecting entire sentences and their subparts, i.e., word/chunk, from shallow syntactic trees. More interestingly, we apply tree kernels to the filtered text representations, thus exploiting the implicit features of the subtree space for learning question reranking. Our results show that the attention-based pruning allows for achieving the top position in the cQA challenge of SemEval 2016, with a relatively large gap from the other participants while greatly decreasing running time.

pdf bib
Selecting Sentences versus Selecting Tree Constituents for Automatic Question Ranking
Alberto Barrón-Cedeño | Giovanni Da San Martino | Salvatore Romeo | Alessandro Moschitti
Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers

Community question answering (cQA) websites are focused on users who query questions onto an online forum, expecting for other users to provide them answers or suggestions. Unlike other social media, the length of the posted queries has no limits and queries tend to be multi-sentence elaborations combining context, actual questions, and irrelevant information. We approach the problem of question ranking: given a user’s new question, to retrieve those previously-posted questions which could be equivalent, or highly relevant. This could prevent the posting of nearly-duplicate questions and provide the user with instantaneous answers. For the first time in cQA, we address the selection of relevant text —both at sentence- and at constituent-level— for parse tree-based representations. Our supervised models for text selection boost the performance of a tree kernel-based machine learning model, allowing it to overtake the current state of the art on a recently released cQA evaluation framework.

pdf bib
An Interactive System for Exploring Community Question Answering Forums
Enamul Hoque | Shafiq Joty | Lluís Màrquez | Alberto Barrón-Cedeño | Giovanni Da San Martino | Alessandro Moschitti | Preslav Nakov | Salvatore Romeo | Giuseppe Carenini
Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: System Demonstrations

We present an interactive system to provide effective and efficient search capabilities in Community Question Answering (cQA) forums. The system integrates state-of-the-art technology for answer search with a Web-based user interface specifically tailored to support the cQA forum readers. The answer search module automatically finds relevant answers for a new question by exploring related questions and the comments within their threads. The graphical user interface presents the search results and supports the exploration of related information. The system is running live at http://www.qatarliving.com/betasearch/.

pdf bib
QCRI at SemEval-2016 Task 4: Probabilistic Methods for Binary and Ordinal Quantification
Giovanni Da San Martino | Wei Gao | Fabrizio Sebastiani
Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016)

pdf bib
ConvKN at SemEval-2016 Task 3: Answer and Question Selection for Question Answering on Arabic and English Fora
Alberto Barrón-Cedeño | Daniele Bonadiman | Giovanni Da San Martino | Shafiq Joty | Alessandro Moschitti | Fahad Al Obaidli | Salvatore Romeo | Kateryna Tymoshenko | Antonio Uva
Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016)

2015

pdf bib
Global Thread-level Inference for Comment Classification in Community Question Answering
Shafiq Joty | Alberto Barrón-Cedeño | Giovanni Da San Martino | Simone Filice | Lluís Màrquez | Alessandro Moschitti | Preslav Nakov
Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing

pdf bib
Structural Representations for Learning Relations between Pairs of Texts
Simone Filice | Giovanni Da San Martino | Alessandro Moschitti
Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

pdf bib
Thread-Level Information for Comment Classification in Community Question Answering
Alberto Barrón-Cedeño | Simone Filice | Giovanni Da San Martino | Shafiq Joty | Lluís Màrquez | Preslav Nakov | Alessandro Moschitti
Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers)

pdf bib
QCRI: Answer Selection for Community Question Answering - Experiments for Arabic and English
Massimo Nicosia | Simone Filice | Alberto Barrón-Cedeño | Iman Saleh | Hamdy Mubarak | Wei Gao | Preslav Nakov | Giovanni Da San Martino | Alessandro Moschitti | Kareem Darwish | Lluís Màrquez | Shafiq Joty | Walid Magdy
Proceedings of the 9th International Workshop on Semantic Evaluation (SemEval 2015)