Marco Basaldella


2020

pdf bib
COMETA: A Corpus for Medical Entity Linking in the Social Media
Marco Basaldella | Fangyu Liu | Ehsan Shareghi | Nigel Collier
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

Whilst there has been growing progress in Entity Linking (EL) for general language, existing datasets fail to address the complex nature of health terminology in layman’s language. Meanwhile, there is a growing need for applications that can understand the public’s voice in the health domain. To address this we introduce a new corpus called COMETA, consisting of 20k English biomedical entity mentions from Reddit expert-annotated with links to SNOMED CT, a widely-used medical knowledge graph. Our corpus satisfies a combination of desirable properties, from scale and coverage to diversity and quality, that to the best of our knowledge has not been met by any of the existing resources in the field. Through benchmark experiments on 20 EL baselines from string- to neural-based models we shed light on the ability of these systems to perform complex inference on entities and concepts under 2 challenging evaluation scenarios. Our experimental results on COMETA illustrate that no golden bullet exists and even the best mainstream techniques still have a significant performance gap to fill, while the best solution relies on combining different views of data.

pdf bib
Natural Language Processing for Achieving Sustainable Development: the Case of Neural Labelling to Enhance Community Profiling
Costanza Conforti | Stephanie Hirmer | Dai Morgan | Marco Basaldella | Yau Ben Or
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

In recent years, there has been an increasing interest in the application of Artificial Intelligence – and especially Machine Learning – to the field of Sustainable Development (SD). However, until now, NLP has not been systematically applied in this context. In this paper, we show the high potential of NLP to enhance project sustainability. In particular, we focus on the case of community profiling in developing countries, where, in contrast to the developed world, a notable data gap exists. Here, NLP could help to address the cost and time barrier of structuring qualitative data that prohibits its widespread use and associated benefits. We propose the new extreme multi-class multi-label Automatic UserPerceived Value classification task. We release Stories2Insights, an expert-annotated dataset of interviews carried out in Uganda, we provide a detailed corpus analysis, and we implement a number of strong neural baselines to address the task. Experimental results show that the problem is challenging, and leaves considerable room for future research at the intersection of NLP and SD.

2019

pdf bib
BioReddit: Word Embeddings for User-Generated Biomedical NLP
Marco Basaldella | Nigel Collier
Proceedings of the Tenth International Workshop on Health Text Mining and Information Analysis (LOUHI 2019)

Word embeddings, in their different shapes and iterations, have changed the natural language processing research landscape in the last years. The biomedical text processing field is no stranger to this revolution; however, scholars in the field largely trained their embeddings on scientific documents only, even when working on user-generated data. In this paper we show how training embeddings from a corpus collected from user-generated text from medical forums heavily influences the performance on downstream tasks, outperforming embeddings trained both on general purpose data or on scientific papers when applied on user-generated content.

2017

pdf bib
Exploiting and Evaluating a Supervised, Multilanguage Keyphrase Extraction pipeline for under-resourced languages
Marco Basaldella | Muhammad Helmy | Elisa Antolli | Mihai Horia Popescu | Giuseppe Serra | Carlo Tasso
Proceedings of the International Conference Recent Advances in Natural Language Processing, RANLP 2017

This paper evaluates different techniques for building a supervised, multilanguage keyphrase extraction pipeline for languages which lack a gold standard. Starting from an unsupervised English keyphrase extraction pipeline, we implement pipelines for Arabic, Italian, Portuguese, and Romanian, and we build test collections for languages which lack one. Then, we add a Machine Learning module trained on a well-known English language corpus and we evaluate the performance not only over English but on the other languages as well. Finally, we repeat the same evaluation after training the pipeline over an Arabic language corpus to check whether using a language-specific corpus brings a further improvement in performance. On the five languages we analyzed, results show an improvement in performance when using a machine learning algorithm, even if such algorithm is not trained and tested on the same language.

2016

pdf bib
Evaluating anaphora and coreference resolution to improve automatic keyphrase extraction
Marco Basaldella | Giorgia Chiaradia | Carlo Tasso
Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers

In this paper we analyze the effectiveness of using linguistic knowledge from coreference and anaphora resolution for improving the performance for supervised keyphrase extraction. In order to verify the impact of these features, we define a baseline keyphrase extraction system and evaluate its performance on a standard dataset using different machine learning algorithms. Then, we consider new sets of features by adding combinations of the linguistic features we propose and we evaluate the new performance of the system. We also use anaphora and coreference resolution to transform the documents, trying to simulate the cohesion process performed by the human mind. We found that our approach has a slightly positive impact on the performance of automatic keyphrase extraction, in particular when considering the ranking of the results.