Felice Dell’Orletta


2020

pdf bib
“Voices of the Great War”: A Richly Annotated Corpus of Italian Texts on the First World War
Federico Boschetti | Irene De Felice | Stefano Dei Rossi | Felice Dell’Orletta | Michele Di Giorgio | Martina Miliani | Lucia C. Passaro | Angelica Puddu | Giulia Venturi | Nicola Labanca | Alessandro Lenci | Simonetta Montemagni
Proceedings of the 12th Language Resources and Evaluation Conference

“Voices of the Great War” is the first large corpus of Italian historical texts dating back to the period of First World War. This corpus differs from other existing resources in several respects. First, from the linguistic point of view it gives account of the wide range of varieties in which Italian was articulated in that period, namely from a diastratic (educated vs. uneducated writers), diaphasic (low/informal vs. high/formal registers) and diatopic (regional varieties, dialects) points of view. From the historical perspective, through a collection of texts belonging to different genres it represents different views on the war and the various styles of narrating war events and experiences. The final corpus is balanced along various dimensions, corresponding to the textual genre, the language variety used, the author type and the typology of conveyed contents. The corpus is fully annotated with lemmas, part-of-speech, terminology, and named entities. Significant corpus samples representative of the different “voices” have also been enriched with meta-linguistic and syntactic information. The layer of syntactic annotation forms the first nucleus of an Italian historical treebank complying with the Universal Dependencies standard. The paper illustrates the final resource, the methodology and tools used to build it, and the Web Interface for navigating it.

pdf bib
Invisible to People but not to Machines: Evaluation of Style-aware HeadlineGeneration in Absence of Reliable Human Judgment
Lorenzo De Mattei | Michele Cafagna | Felice Dell’Orletta | Malvina Nissim
Proceedings of the 12th Language Resources and Evaluation Conference

We automatically generate headlines that are expected to comply with the specific styles of two different Italian newspapers. Through a data alignment strategy and different training/testing settings, we aim at decoupling content from style and preserve the latter in generation. In order to evaluate the generated headlines’ quality in terms of their specific newspaper-compliance, we devise a fine-grained evaluation strategy based on automatic classification. We observe that our models do indeed learn newspaper-specific style. Importantly, we also observe that humans aren’t reliable judges for this task, since although familiar with the newspapers, they are not able to discern their specific styles even in the original human-written headlines. The utility of automatic evaluation goes therefore beyond saving the costs and hurdles of manual annotation, and deserves particular care in its design.

pdf bib
Profiling-UD: a Tool for Linguistic Profiling of Texts
Dominique Brunato | Andrea Cimino | Felice Dell’Orletta | Giulia Venturi | Simonetta Montemagni
Proceedings of the 12th Language Resources and Evaluation Conference

In this paper, we introduce Profiling–UD, a new text analysis tool inspired to the principles of linguistic profiling that can support language variation research from different perspectives. It allows the extraction of more than 130 features, spanning across different levels of linguistic description. Beyond the large number of features that can be monitored, a main novelty of Profiling–UD is that it has been specifically devised to be multilingual since it is based on the Universal Dependencies framework. In the second part of the paper, we demonstrate the effectiveness of these features in a number of theoretical and applicative studies in which they were successfully used for text and author profiling.

pdf bib
Tracking the Evolution of Written Language Competence in L2 Spanish Learners
Alessio Miaschi | Sam Davidson | Dominique Brunato | Felice Dell’Orletta | Kenji Sagae | Claudia Helena Sanchez-Gutierrez | Giulia Venturi
Proceedings of the Fifteenth Workshop on Innovative Use of NLP for Building Educational Applications

In this paper we present an NLP-based approach for tracking the evolution of written language competence in L2 Spanish learners using a wide range of linguistic features automatically extracted from students’ written productions. Beyond reporting classification results for different scenarios, we explore the connection between the most predictive features and the teaching curriculum, finding that our set of linguistic features often reflect the explicit instructions that students receive during each course.

pdf bib
Linguistic Profiling of a Neural Language Model
Alessio Miaschi | Dominique Brunato | Felice Dell’Orletta | Giulia Venturi
Proceedings of the 28th International Conference on Computational Linguistics

In this paper we investigate the linguistic knowledge learned by a Neural Language Model (NLM) before and after a fine-tuning process and how this knowledge affects its predictions during several classification problems. We use a wide set of probing tasks, each of which corresponds to a distinct sentence-level feature extracted from different levels of linguistic annotation. We show that BERT is able to encode a wide range of linguistic characteristics, but it tends to lose this information when trained on specific downstream tasks. We also find that BERT’s capacity to encode different kind of linguistic properties has a positive influence on its predictions: the more it stores readable linguistic information of a sentence, the higher will be its capacity of predicting the expected label assigned to that sentence.

pdf bib
Contextual and Non-Contextual Word Embeddings: an in-depth Linguistic Investigation
Alessio Miaschi | Felice Dell’Orletta
Proceedings of the 5th Workshop on Representation Learning for NLP

In this paper we present a comparison between the linguistic knowledge encoded in the internal representations of a contextual Language Model (BERT) and a contextual-independent one (Word2vec). We use a wide set of probing tasks, each of which corresponds to a distinct sentence-level feature extracted from different levels of linguistic annotation. We show that, although BERT is capable of understanding the full context of each word in an input sequence, the implicit knowledge encoded in its aggregated sentence representations is still comparable to that of a contextual-independent model. We also find that BERT is able to encode sentence-level properties even within single-word embeddings, obtaining comparable or even superior results than those obtained with sentence representations.

2019

pdf bib
Linguistically-Driven Strategy for Concept Prerequisites Learning on Italian
Alessio Miaschi | Chiara Alzetta | Franco Alberto Cardillo | Felice Dell’Orletta
Proceedings of the Fourteenth Workshop on Innovative Use of NLP for Building Educational Applications

We present a new concept prerequisite learning method for Learning Object (LO) ordering that exploits only linguistic features extracted from textual educational resources. The method was tested in a cross- and in- domain scenario both for Italian and English. Additionally, we performed experiments based on a incremental training strategy to study the impact of the training set size on the classifier performances. The paper also introduces ITA-PREREQ, to the best of our knowledge the first Italian dataset annotated with prerequisite relations between pairs of educational concepts, and describe the automatic strategy devised to build it.

2018

pdf bib
Universal Dependencies and Quantitative Typological Trends. A Case Study on Word Order
Chiara Alzetta | Felice Dell’Orletta | Simonetta Montemagni | Giulia Venturi
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

pdf bib
Is this Sentence Difficult? Do you Agree?
Dominique Brunato | Lorenzo De Mattei | Felice Dell’Orletta | Benedetta Iavarone | Giulia Venturi
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

In this paper, we present a crowdsourcing-based approach to model the human perception of sentence complexity. We collect a large corpus of sentences rated with judgments of complexity for two typologically-different languages, Italian and English. We test our approach in two experimental scenarios aimed to investigate the contribution of a wide set of lexical, morpho-syntactic and syntactic phenomena in predicting i) the degree of agreement among annotators independently from the assigned judgment and ii) the perception of sentence complexity.

pdf bib
Assessing the Impact of Incremental Error Detection and Correction. A Case Study on the Italian Universal Dependency Treebank
Chiara Alzetta | Felice Dell’Orletta | Simonetta Montemagni | Maria Simi | Giulia Venturi
Proceedings of the Second Workshop on Universal Dependencies (UDW 2018)

Detection and correction of errors and inconsistencies in “gold treebanks” are becoming more and more central topics of corpus annotation. The paper illustrates a new incremental method for enhancing treebanks, with particular emphasis on the extension of error patterns across different textual genres and registers. Impact and role of corrections have been assessed in a dependency parsing experiment carried out with four different parsers, whose results are promising. For both evaluation datasets, the performance of parsers increases, in terms of the standard LAS and UAS measures and of a more focused measure taking into account only relations involved in error patterns, and at the level of individual dependencies.

2017

pdf bib
Stacked Sentence-Document Classifier Approach for Improving Native Language Identification
Andrea Cimino | Felice Dell’Orletta
Proceedings of the 12th Workshop on Innovative Use of NLP for Building Educational Applications

In this paper, we describe the approach of the ItaliaNLP Lab team to native language identification and discuss the results we submitted as participants to the essay track of NLI Shared Task 2017. We introduce for the first time a 2-stacked sentence-document architecture for native language identification that is able to exploit both local sentence information and a wide set of general-purpose features qualifying the lexical and grammatical structure of the whole document. When evaluated on the official test set, our sentence-document stacked architecture obtained the best result among all the participants of the essay track with an F1 score of 0.8818.

pdf bib
On the order of Words in Italian: a Study on Genre vs Complexity
Dominique Brunato | Felice Dell’Orletta
Proceedings of the Fourth International Conference on Dependency Linguistics (Depling 2017)

pdf bib
Dangerous Relations in Dependency Treebanks
Chiara Alzetta | Felice Dell’Orletta | Simonetta Montemagni | Giulia Venturi
Proceedings of the 16th International Workshop on Treebanks and Linguistic Theories

2016

pdf bib
CItA: an L1 Italian Learners Corpus to Study the Development of Writing Competence
Alessia Barbagli | Pietro Lucisano | Felice Dell’Orletta | Simonetta Montemagni | Giulia Venturi
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

In this paper, we present the CItA corpus (Corpus Italiano di Apprendenti L1), a collection of essays written by Italian L1 learners collected during the first and second year of lower secondary school. The corpus was built in the framework of an interdisciplinary study jointly carried out by computational linguistics and experimental pedagogists and aimed at tracking the development of written language competence over the years and students’ background information.

pdf bib
PaCCSS-IT: A Parallel Corpus of Complex-Simple Sentences for Automatic Text Simplification
Dominique Brunato | Andrea Cimino | Felice Dell’Orletta | Giulia Venturi
Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing

pdf bib
Proceedings of the Workshop on Computational Linguistics for Linguistic Complexity (CL4LC)
Dominique Brunato | Felice Dell’Orletta | Giulia Venturi | Thomas François | Philippe Blache
Proceedings of the Workshop on Computational Linguistics for Linguistic Complexity (CL4LC)

2015

pdf bib
Design and Annotation of the First Italian Corpus for Text Simplification
Dominique Brunato | Felice Dell’Orletta | Giulia Venturi | Simonetta Montemagni
Proceedings of The 9th Linguistic Annotation Workshop

pdf bib
NLP–Based Readability Assessment of Health–Related Texts: a Case Study on Italian Informed Consent Forms
Giulia Venturi | Tommaso Bellandi | Felice Dell’Orletta | Simonetta Montemagni
Proceedings of the Sixth International Workshop on Health Text Mining and Information Analysis

2014

pdf bib
The PAISÀ Corpus of Italian Web Texts
Verena Lyding | Egon Stemle | Claudia Borghetti | Marco Brunello | Sara Castagnoli | Felice Dell’Orletta | Henrik Dittmann | Alessandro Lenci | Vito Pirrelli
Proceedings of the 9th Web as Corpus Workshop (WaC-9)

pdf bib
Assessing the Readability of Sentences: Which Corpora and Features?
Felice Dell’Orletta | Martijn Wieling | Giulia Venturi | Andrea Cimino | Simonetta Montemagni
Proceedings of the Ninth Workshop on Innovative Use of NLP for Building Educational Applications

pdf bib
T2K^2: a System for Automatically Extracting and Organizing Knowledge from Texts
Felice Dell’Orletta | Giulia Venturi | Andrea Cimino | Simonetta Montemagni
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

In this paper, we present T2K^2, a suite of tools for automatically extracting domain―specific knowledge from collections of Italian and English texts. T2K^2 (Text―To―Knowledge v2) relies on a battery of tools for Natural Language Processing (NLP), statistical text analysis and machine learning which are dynamically integrated to provide an accurate and incremental representation of the content of vast repositories of unstructured documents. Extracted knowledge ranges from domain―specific entities and named entities to the relations connecting them and can be used for indexing document collections with respect to different information types. T2K^2 also includes “linguistic profiling” functionalities aimed at supporting the user in constructing the acquisition corpus, e.g. in selecting texts belonging to the same genre or characterized by the same degree of specialization or in monitoring the “added value” of newly inserted documents. T2K^2 is a web application which can be accessed from any browser through a personal account which has been tested in a wide range of domains.

2013

pdf bib
Linguistic Profiling based on General–purpose Features and Native Language Identification
Andrea Cimino | Felice Dell’Orletta | Giulia Venturi | Simonetta Montemagni
Proceedings of the Eighth Workshop on Innovative Use of NLP for Building Educational Applications

pdf bib
Unsupervised Linguistically-Driven Reliable Dependency Parses Detection and Self-Training for Adaptation to the Biomedical Domain
Felice Dell’Orletta | Giulia Venturi | Simonetta Montemagni
Proceedings of the 2013 Workshop on Biomedical Natural Language Processing

pdf bib
Linguistic Profiling of Texts Across Textual Genres and Readability Levels. An Exploratory Study on Italian Fictional Prose
Felice Dell’Orletta | Simonetta Montemagni | Giulia Venturi
Proceedings of the International Conference Recent Advances in Natural Language Processing RANLP 2013

2012

pdf bib
Genre-oriented Readability Assessment: a Case Study
Felice Dell’Orletta | Giulia Venturi | Simonetta Montemagni
Proceedings of the Workshop on Speech and Language Processing Tools in Education

2011

pdf bib
ULISSE: an Unsupervised Algorithm for Detecting Reliable Dependency Parses
Felice Dell’Orletta | Giulia Venturi | Simonetta Montemagni
Proceedings of the Fifteenth Conference on Computational Natural Language Learning

pdf bib
READIT: Assessing Readability of Italian Texts with a View to Text Simplification
Felice Dell’Orletta | Simonetta Montemagni | Giulia Venturi
Proceedings of the Second Workshop on Speech and Language Processing for Assistive Technologies

2010

pdf bib
Contrastive Filtering of Domain-Specific Multi-Word Terms from Different Types of Corpora
Francesca Bonin | Felice Dell’Orletta | Giulia Venturi | Simonetta Montemagni
Proceedings of the 2010 Workshop on Multiword Expressions: from Theory to Applications

pdf bib
Improvements in Parsing the Index Thomisticus Treebank. Revision, Combination and a Feature Model for Medieval Latin
Marco Passarotti | Felice Dell’Orletta
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

The creation of language resources for less-resourced languages like the historical ones benefits from the exploitation of language-independent tools and methods developed over the years by many projects for modern languages. Along these lines, a number of treebanks for historical languages started recently to arise, including treebanks for Latin. Among the Latin treebanks, the Index Thomisticus Treebank is a 68,000 token dependency treebank based on the Index Thomisticus by Roberto Busa SJ, which contains the opera omnia of Thomas Aquinas (118 texts) as well as 61 texts by other authors related to Thomas, for a total of approximately 11 million tokens. In this paper, we describe a number of modifications that we applied to the dependency parser DeSR, in order to improve the parsing accuracy rates on the Index Thomisticus Treebank. First, we adapted the parser to the specific processing of Medieval Latin, defining an ad-hoc configuration of its features. Then, in order to improve the accuracy rates provided by DeSR, we applied a revision parsing method and we combined the outputs produced by different algorithms. This allowed us to improve accuracy rates substantially, reaching results that are well beyond the state of the art of parsing for Latin.

pdf bib
Comparing the Influence of Different Treebank Annotations on Dependency Parsing
Cristina Bosco | Simonetta Montemagni | Alessandro Mazzei | Vincenzo Lombardo | Felice Dell’Orletta | Alessandro Lenci | Leonardo Lesmo | Giuseppe Attardi | Maria Simi | Alberto Lavelli | Johan Hall | Jens Nilsson | Joakim Nivre
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

As the interest of the NLP community grows to develop several treebanks also for languages other than English, we observe efforts towards evaluating the impact of different annotation strategies used to represent particular languages or with reference to particular tasks. This paper contributes to the debate on the influence of resources used for the training and development on the performance of parsing systems. It presents a comparative analysis of the results achieved by three different dependency parsers developed and tested with respect to two treebanks for the Italian language, namely TUT and ISST--TANL, which differ significantly at the level of both corpus composition and adopted dependency representations.

pdf bib
A Contrastive Approach to Multi-word Extraction from Domain-specific Corpora
Francesca Bonin | Felice Dell’Orletta | Simonetta Montemagni | Giulia Venturi
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

In this paper, we present a novel approach to multi-word terminology extraction combining a well-known automatic term recognition approach, the C--NC value method, with a contrastive ranking technique, aimed at refining obtained results either by filtering noise due to common words or by discerning between semantically different types of terms within heterogeneous terminologies. Differently from other contrastive methods proposed in the literature that focus on single terms to overcome the multi-word terms' sparsity problem, the proposed contrastive function is able to handle variation in low frequency events by directly operating on pre-selected multi-word terms. This methodology has been tested in two case studies carried out in the History of Art and Legal domains. Evaluation of achieved results showed that the proposed two--stage approach improves significantly multi--word term extraction results. In particular, for what concerns the legal domain it provides an answer to a well-known problem in the semi--automatic construction of legal ontologies, namely that of singling out law terms from terms of the specific domain being regulated.

2009

pdf bib
Reverse Revision and Linear Tree Combination for Dependency Parsing
Giuseppe Attardi | Felice Dell’Orletta
Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Companion Volume: Short Papers

2008

pdf bib
DeSRL: A Linear-Time Semantic Role Labeling System
Massimiliano Ciaramita | Giuseppe Attardi | Felice Dell’Orletta | Mihai Surdeanu
CoNLL 2008: Proceedings of the Twelfth Conference on Computational Natural Language Learning

2007

pdf bib
Multilingual Dependency Parsing and Domain Adaptation using DeSR
Giuseppe Attardi | Felice Dell’Orletta | Maria Simi | Atanas Chanev | Massimiliano Ciaramita
Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL)

2006

pdf bib
Searching treebanks for functional constraints: cross-lingual experiments in grammatical relation assignment
Felice Dell’Orletta | Alessandro Lenci | Simonetta Montemagni | Vito Pirrelli
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)

The paper reports on a detailed quantitative analysis of distributional language data of both Italian and Czech, highlighting the relative contribution of a number of distributed grammatical factors to sentence-based identification of subjects and direct objects. The work is based on a Maximum Entropy model of stochastic resolution of grammatical conflicting constraints, and is demonstrably capable of putting explanatory theoretical accounts to the challenging test of an extensive, usage-based empirical verification.

pdf bib
Probing the Space of Grammatical Variation: Induction of Cross-Lingual Grammatical Constraints from Treebanks
Felice Dell’Orletta | Alessandro Lenci | Simonetta Montemagni | Vito Pirrelli
Proceedings of the Workshop on Frontiers in Linguistically Annotated Corpora 2006

2005

pdf bib
Climbing the Path to Grammar: A Maximum Entropy Model of Subject/Object Learning
Felice Dell’Orletta | Alessandro Lenci | Simonetta Montemagni | Vito Pirrelli
Proceedings of the Workshop on Psychocomputational Models of Human Language Acquisition