Maria Pia di Buono


2020

pdf bib
UNIOR NLP at MWSA Task - GlobaLex 2020: Siamese LSTM with Attention for Word Sense Alignment
Raffaele Manna | Giulia Speranza | Maria Pia di Buono | Johanna Monti
Proceedings of the 2020 Globalex Workshop on Linked Lexicography

In this paper we describe the system submitted to the ELEXIS Monolingual Word Sense Alignment Task. We test different systems,which are two types of LSTMs and a system based on a pretrained Bidirectional Encoder Representations from Transformers (BERT)model, to solve the task. LSTM models use fastText pre-trained word vectors features with different settings. For training the models,we did not combine external data with the dataset provided for the task. We select a sub-set of languages among the proposed ones,namely a set of Romance languages, i.e., Italian, Spanish, Portuguese, together with English and Dutch. The Siamese LSTM withattention and PoS tagging (LSTM-A) performed better than the other two systems, achieving a 5-Class Accuracy score of 0.844 in theOverall Results, ranking the first position among five teams.

pdf bib
From Linguistic Resources to Ontology-Aware Terminologies: Minding the Representation Gap
Giulia Speranza | Maria Pia di Buono | Johanna Monti | Federico Sangati
Proceedings of the 12th Language Resources and Evaluation Conference

Terminological resources have proven crucial in many applications ranging from Computer-Aided Translation tools to authoring softwares and multilingual and cross-lingual information retrieval systems. Nonetheless, with the exception of a few felicitous examples, such as the IATE (Interactive Terminology for Europe) Termbank, many terminological resources are not available in standard formats, such as Term Base eXchange (TBX), thus preventing their sharing and reuse. Yet, these terminologies could be improved associating the correspondent ontology-based information. The research described in the present contribution demonstrates the process and the methodologies adopted in the automatic conversion into TBX of such type of resources, together with their semantic enrichment based on the formalization of ontological information into terminologies. We present a proof-of-concept using the Italian Linguistic Resource for the Archaeological domain (developed according to Thesauri and Guidelines of the Italian Central Institute for the Catalogue and Documentation). Further, we introduce the conversion tool developed to support the process of creating ontology-aware terminologies for improving interoperability and sharing of existing language technologies and data sets.

pdf bib
Terme-à-LLOD: Simplifying the Conversion and Hosting of Terminological Resources as Linked Data
Maria Pia di Buono | Philipp Cimiano | Mohammad Fazleh Elahi | Frank Grimm
Proceedings of the 7th Workshop on Linked Data in Linguistics (LDL-2020)

In recent years, there has been increasing interest in publishing lexicographic and terminological resources as linked data. The benefit of using linked data technologies to publish terminologies is that terminologies can be linked to each other, thus creating a cloud of linked terminologies that cross domains, languages and that support advanced applications that do not work with single terminologies but can exploit multiple terminologies seamlessly. We present Terme-‘a-LLOD (TAL), a new paradigm for transforming and publishing terminologies as linked data which relies on a virtualization approach. The approach rests on a preconfigured virtual image of a server that can be downloaded and installed. We describe our approach to simplifying the transformation and hosting of terminological resources in the remainder of this paper. We provide a proof-of-concept for this paradigm showing how to apply it to the conversion of the well-known IATE terminology as well as to various smaller terminologies. Further, we discuss how the implementation of our paradigm can be integrated into existing NLP service infrastructures that rely on virtualization technology. While we apply this paradigm to the transformation and hosting of terminologies as linked data, the paradigm can be applied to any other resource format as well.

2018

pdf bib
TakeLab at SemEval-2018 Task 7: Combining Sparse and Dense Features for Relation Classification in Scientific Texts
Martin Gluhak | Maria Pia di Buono | Abbas Akkasi | Jan Šnajder
Proceedings of The 12th International Workshop on Semantic Evaluation

We describe two systems for semantic relation classification with which we participated in the SemEval 2018 Task 7, subtask 1 on semantic relation classification: an SVM model and a CNN model. Both models combine dense pretrained word2vec features and hancrafted sparse features. For training the models, we combine the two datasets provided for the subtasks in order to balance the under-represented classes. The SVM model performed better than CNN, achieving a F1-macro score of 69.98% on subtask 1.1 and 75.69% on subtask 1.2. The system ranked 7th on among 28 submissions on subtask 1.1 and 7th among 20 submissions on subtask 1.2.

2017

pdf bib
Two Layers of Annotation for Representing Event Mentions in News Stories
Maria Pia di Buono | Martin Tutek | Jan Šnajder | Goran Glavaš | Bojana Dalbelo Bašić | Nataša Milić-Frayling
Proceedings of the 11th Linguistic Annotation Workshop

In this paper, we describe our preliminary study on annotating event mention as a part of our research on high-precision news event extraction models. To this end, we propose a two-layer annotation scheme, designed to separately capture the functional and conceptual aspects of event mentions. We hypothesize that the precision of models can be improved by modeling and extracting separately the different aspects of news events, and then combining the extracted information by leveraging the complementarities of the models. In addition, we carry out a preliminary annotation using the proposed scheme and analyze the annotation quality in terms of inter-annotator agreement.

pdf bib
An Ontology-Based Method for Extracting and Classifying Domain-Specific Compositional Nominal Compounds
Maria Pia di Buono
Proceedings of the Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature

In this paper, we present our preliminary study on an ontology-based method to extract and classify compositional nominal compounds in specific domains of knowledge. This method is based on the assumption that, applying a conceptual model to represent knowledge domain, it is possible to improve the extraction and classification of lexicon occurrences for that domain in a semi-automatic way. We explore the possibility of extracting and classifying a specific construction type (nominal compounds) spanning a specific domain (Cultural Heritage) and a specific language (Italian).

pdf bib
Predicting News Values from Headline Text and Emotions
Maria Pia di Buono | Jan Šnajder | Bojana Dalbelo Bašić | Goran Glavaš | Martin Tutek | Natasa Milic-Frayling
Proceedings of the 2017 EMNLP Workshop: Natural Language Processing meets Journalism

We present a preliminary study on predicting news values from headline text and emotions. We perform a multivariate analysis on a dataset manually annotated with news values and emotions, discovering interesting correlations among them. We then train two competitive machine learning models – an SVM and a CNN – to predict news values from headline text and emotions as features. We find that, while both models yield a satisfactory performance, some news values are more difficult to detect than others, while some profit more from including emotion information.

2016

pdf bib
Semi-automatic Parsing for Web Knowledge Extraction through Semantic Annotation
Maria Pia di Buono
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

Parsing Web information, namely parsing content to find relevant documents on the basis of a user’s query, represents a crucial step to guarantee fast and accurate Information Retrieval (IR). Generally, an automated approach to such task is considered faster and cheaper than manual systems. Nevertheless, results do not seem have a high level of accuracy, indeed, as also Hjorland (2007) states, using stochastic algorithms entails: • Low precision due to the indexing of common Atomic Linguistic Units (ALUs) or sentences. • Low recall caused by the presence of synonyms. • Generic results arising from the use of too broad or too narrow terms. Usually IR systems are based on invert text index, namely an index data structure storing a mapping from content to its locations in a database file, or in a document or a set of documents. In this paper we propose a system, by means of which we will develop a search engine able to process online documents, starting from a natural language query, and to return information to users. The proposed approach, based on the Lexicon-Grammar (LG) framework and its language formalization methodologies, aims at integrating a semantic annotation process for both query analysis and document retrieval.

2014

pdf bib
Terminology and Knowledge Representation. Italian Linguistic Resources for the Archaeological Domain
Maria Pia di Buono | Mario Monteleone | Annibale Elia
Proceedings of Workshop on Lexical and Grammatical Resources for Language Processing

pdf bib
From Natural Language to Ontology Population in the Cultural Heritage Domain. A Computational Linguistics-based approach.
Maria Pia di Buono | Mario Monteleone
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

This paper presents an on-going Natural Language Processing (NLP) research based on Lexicon-Grammar (LG) and aimed at improving knowledge management of Cultural Heritage (CH) domain. We intend to demonstrate how our language formalization technique can be applied for both processing and populating a domain ontology. We also use NLP techniques for text extraction and mining to fill information gaps and improve access to cultural resources. The Linguistic Resources (LRs, i.e. electronic dictionaries) we built can be used in the structuring of effective Knowledge Management Systems (KMSs). In order to apply to Parts of Speech (POS) the classes and properties defined by the Conseil Interational des Musees (CIDOC) Conceptual Reference Model (CRM), we use Finite State Transducers/Automata (FSTs/FSA) and their variables built in the form of graphs. FSTs/FSA are also used for analysing corpora in order to retrieve recursive sentence structures, in which combinatorial and semantic constraints identify properties and denote relationship. Besides, FSTs/FSA are also used to match our electronic dictionary entries (ALUs, or Atomic Linguistic Units) to RDF subject, object and predicate (SKOS Core Vocabulary). This matching of linguistic data to RDF and their translation into SPARQL/SERQL path expressions allows the use ALUs to process natural-language queries.

2013

pdf bib
Cross-Lingual Information Retrieval and Semantic Interoperability for Cultural Heritage Repositories
Johanna Monti | Mario Monteleone | Maria Pia di Buono | Federica Marano
Proceedings of the International Conference Recent Advances in Natural Language Processing RANLP 2013