Marco Idiart


2019

pdf bib
Unsupervised Compositionality Prediction of Nominal Compounds
Silvio Cordeiro | Aline Villavicencio | Marco Idiart | Carlos Ramisch
Computational Linguistics, Volume 45, Issue 1 - March 2019

Nominal compounds such as red wine and nut case display a continuum of compositionality, with varying contributions from the components of the compound to its semantics. This article proposes a framework for compound compositionality prediction using distributional semantic models, evaluating to what extent they capture idiomaticity compared to human judgments. For evaluation, we introduce data sets containing human judgments in three languages: English, French, and Portuguese. The results obtained reveal a high agreement between the models and human predictions, suggesting that they are able to incorporate information about idiomaticity. We also present an in-depth evaluation of various factors that can affect prediction, such as model and corpus parameters and compositionality operations. General crosslingual analyses reveal the impact of morphological variation and corpus size in the ability of the model to predict compositionality, and of a uniform combination of the components for best results.

2018

pdf bib
The brWaC Corpus: A New Open Resource for Brazilian Portuguese
Jorge A. Wagner Filho | Rodrigo Wilkens | Marco Idiart | Aline Villavicencio
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

pdf bib
Similarity Measures for the Detection of Clinical Conditions with Verbal Fluency Tasks
Felipe Paula | Rodrigo Wilkens | Marco Idiart | Aline Villavicencio
Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers)

Semantic Verbal Fluency tests have been used in the detection of certain clinical conditions, like Dementia. In particular, given a sequence of semantically related words, a large number of switches from one semantic class to another has been linked to clinical conditions. In this work, we investigate three similarity measures for automatically identifying switches in semantic chains: semantic similarity from a manually constructed resource, and word association strength and semantic relatedness, both calculated from corpora. This information is used for building classifiers to distinguish healthy controls from clinical cases with early stages of Alzheimer’s Disease and Mild Cognitive Deficits. The overall results indicate that for clinical conditions the classifiers that use these similarity measures outperform those that use a gold standard taxonomy.

pdf bib
Proceedings of the Eight Workshop on Cognitive Aspects of Computational Language Learning and Processing
Marco Idiart | Alessandro Lenci | Thierry Poibeau | Aline Villavicencio
Proceedings of the Eight Workshop on Cognitive Aspects of Computational Language Learning and Processing

2017

pdf bib
LexSubNC: A Dataset of Lexical Substitution for Nominal Compounds
Rodrigo Wilkens | Leonardo Zilio | Silvio Ricardo Cordeiro | Felipe Paula | Carlos Ramisch | Marco Idiart | Aline Villavicencio
IWCS 2017 — 12th International Conference on Computational Semantics — Short papers

2016

pdf bib
Multiword Expressions in Child Language
Rodrigo Wilkens | Marco Idiart | Aline Villavicencio
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

The goal of this work is to introduce CHILDES-MWE, which contains English CHILDES corpora automatically annotated with Multiword Expressions (MWEs) information. The result is a resource with almost 350,000 sentences annotated with more than 70,000 distinct MWEs of various types from both longitudinal and latitudinal corpora. This resource can be used for large scale language acquisition studies of how MWEs feature in child language. Focusing on compound nouns (CN), we then verify in a longitudinal study if there are differences in the distribution and compositionality of CNs in child-directed and child-produced sentences across ages. Moreover, using additional latitudinal data, we investigate if there are further differences in CN usage and in compositionality preferences. The results obtained for the child-produced sentences reflect CN distribution and compositionality in child-directed sentences.

pdf bib
Predicting the Compositionality of Nominal Compounds: Giving Word Embeddings a Hard Time
Silvio Cordeiro | Carlos Ramisch | Marco Idiart | Aline Villavicencio
Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

pdf bib
How Naked is the Naked Truth? A Multilingual Lexicon of Nominal Compound Compositionality
Carlos Ramisch | Silvio Cordeiro | Leonardo Zilio | Marco Idiart | Aline Villavicencio
Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

pdf bib
Matrix Factorization using Window Sampling and Negative Sampling for Improved Word Representations
Alexandre Salle | Aline Villavicencio | Marco Idiart
Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

2014

pdf bib
Nothing like Good Old Frequency: Studying Context Filters for Distributional Thesauri
Muntsa Padró | Marco Idiart | Aline Villavicencio | Carlos Ramisch
Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP)

pdf bib
Comparing Similarity Measures for Distributional Thesauri
Muntsa Padró | Marco Idiart | Aline Villavicencio | Carlos Ramisch
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

Distributional thesauri have been applied for a variety of tasks involving semantic relatedness. In this paper, we investigate the impact of three parameters: similarity measures, frequency thresholds and association scores. We focus on the robustness and stability of the resulting thesauri, measuring inter-thesaurus agreement when testing different parameter values. The results obtained show that low-frequency thresholds affect thesaurus quality more than similarity measures, with more agreement found for increasing thresholds.These results indicate the sensitivity of distributional thesauri to frequency. Nonetheless, the observed differences do not transpose over extrinsic evaluation using TOEFL-like questions. While this may be specific to the task, we argue that a careful examination of the stability of distributional resources prior to application is needed.

2013

pdf bib
Language Acquisition and Probabilistic Models: keeping it simple
Aline Villavicencio | Marco Idiart | Robert Berwick | Igor Malioutov
Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

2012

pdf bib
A large scale annotated child language construction database
Aline Villavicencio | Beracah Yankama | Marco Idiart | Robert Berwick
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

Large scale annotated corpora of child language can be of great value in assessing theoretical proposals regarding language acquisition models. For example, they can help determine whether the type and amount of data required by a proposed language acquisition model can actually be found in a naturalistic data sample. To this end, several recent efforts have augmented the CHILDES child language corpora with POS tagging and parsing information for languages such as English. With the increasing availability of robust NLP systems and electronic resources, these corpora can be further annotated with more detailed information about the properties of words, verb argument structure, and sentences. This paper describes such an initiative for combining information from various sources to extend the annotation of the English CHILDES corpora with linguistic, psycholinguistic and distributional information, along with an example illustrating an application of this approach to the extraction of verb alternation information. The end result, the English CHILDES Verb Construction Database, is an integrated resource containing information such as grammatical relations, verb semantic classes, and age of acquisition, enabling more targeted complex searches involving different levels of annotation that can facilitate a more detailed analysis of the linguistic input available to children.

pdf bib
An annotated English child language database
Aline Villavicencio | Beracah Yankama | Rodrigo Wilkens | Marco Idiart | Robert Berwick
Proceedings of the Workshop on Computational Models of Language Acquisition and Loss

pdf bib
Get out but don’t fall down: verb-particle constructions in child language
Aline Villavicencio | Marco Idiart | Carlos Ramisch | Vítor Araújo | Beracah Yankama | Robert Berwick
Proceedings of the Workshop on Computational Models of Language Acquisition and Loss

2008

pdf bib
Picking them up and Figuring them out: Verb-Particle Constructions, Noise and Idiomaticity
Carlos Ramisch | Aline Villavicencio | Leonardo Moura | Marco Idiart
CoNLL 2008: Proceedings of the Twelfth Conference on Computational Natural Language Learning

2007

pdf bib
Validation and Evaluation of Automatically Acquired Multiword Expressions for Grammar Engineering
Aline Villavicencio | Valia Kordoni | Yi Zhang | Marco Idiart | Carlos Ramisch
Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL)

2006

pdf bib
Automated Multiword Expression Prediction for Grammar Engineering
Yi Zhang | Valia Kordoni | Aline Villavicencio | Marco Idiart
Proceedings of the Workshop on Multiword Expressions: Identifying and Exploiting Underlying Properties