Aitor García Pablos


2020

pdf bib
Sensitive Data Detection and Classification in Spanish Clinical Text: Experiments with BERT
Aitor García Pablos | Naiara Perez | Montse Cuadros
Proceedings of the 12th Language Resources and Evaluation Conference

Massive digital data processing provides a wide range of opportunities and benefits, but at the cost of endangering personal data privacy. Anonymisation consists in removing or replacing sensitive information from data, enabling its exploitation for different purposes while preserving the privacy of individuals. Over the years, a lot of automatic anonymisation systems have been proposed; however, depending on the type of data, the target language or the availability of training documents, the task remains challenging still. The emergence of novel deep-learning models during the last two years has brought large improvements to the state of the art in the field of Natural Language Processing. These advancements have been most noticeably led by BERT, a model proposed by Google in 2018, and the shared language models pre-trained on millions of documents. In this paper, we use a BERT-based sequence labelling model to conduct a series of anonymisation experiments on several clinical datasets in Spanish. We also compare BERT with other algorithms. The experiments show that a simple BERT-based model with general-domain pre-training obtains highly competitive results without any domain specific feature engineering.

2016

pdf bib
A Comparison of Domain-based Word Polarity Estimation using different Word Embeddings
Aitor García Pablos | Montse Cuadros | German Rigau
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

A key point in Sentiment Analysis is to determine the polarity of the sentiment implied by a certain word or expression. In basic Sentiment Analysis systems this sentiment polarity of the words is accounted and weighted in different ways to provide a degree of positivity/negativity. Currently words are also modelled as continuous dense vectors, known as word embeddings, which seem to encode interesting semantic knowledge. With regard to Sentiment Analysis, word embeddings are used as features to more complex supervised classification systems to obtain sentiment classifiers. In this paper we compare a set of existing sentiment lexicons and sentiment lexicon generation techniques. We also show a simple but effective technique to calculate a word polarity value for each word in a domain using existing continuous word embeddings generation methods. Further, we also show that word embeddings calculated on in-domain corpus capture the polarity better than the ones calculated on general-domain corpus.