Gerard de Melo


2020

pdf bib
Inducing Universal Semantic Tag Vectors
Da Huo | Gerard de Melo
Proceedings of the 12th Language Resources and Evaluation Conference

Given the well-established usefulness of part-of-speech tag annotations in many syntactically oriented downstream NLP tasks, the recently proposed notion of semantic tagging (Bjerva et al. 2016) aims at tagging words with tags informed by semantic distinctions, which are likely to be useful across a range of semantic tasks. To this end, their annotation scheme distinguishes, for instance, privative attributes from subsective ones. While annotated corpora exist, their size is limited and thus many words are out-of-vocabulary words. In this paper, we study to what extent we can automatically predict the tags associated with unseen words. We draw on large-scale word representation data to derive a large new Semantic Tag lexicon. Our experiments show that we can infer semantic tags for words with high accuracy both monolingually and cross-lingually.

pdf bib
Correcting the Autocorrect: Context-Aware Typographical Error Correction via Training Data Augmentation
Kshitij Shah | Gerard de Melo
Proceedings of the 12th Language Resources and Evaluation Conference

In this paper, we explore the artificial generation of typographical errors based on real-world statistics. We first draw on a small set of annotated data to compute spelling error statistics. These are then invoked to introduce errors into substantially larger corpora. The generation methodology allows us to generate particularly challenging errors that require context-aware error detection. We use it to create a set of English language error detection and correction datasets. Finally, we examine the effectiveness of machine learning models for detecting and correcting errors based on this data.

pdf bib
Sentence Analogies: Linguistic Regularities in Sentence Embeddings
Xunjie Zhu | Gerard de Melo
Proceedings of the 28th International Conference on Computational Linguistics

While important properties of word vector representations have been studied extensively, far less is known about the properties of sentence vector representations. Word vectors are often evaluated by assessing to what degree they exhibit regularities with regard to relationships of the sort considered in word analogies. In this paper, we investigate to what extent commonly used sentence vector representation spaces as well reflect certain kinds of regularities. We propose a number of schemes to induce evaluation data, based on lexical analogy data as well as semantic relationships between sentences. Our experiments consider a wide range of sentence embedding methods, including ones based on BERT-style contextual embeddings. We find that different models differ substantially in their ability to reflect such regularities.

pdf bib
Data Augmentation for Multiclass Utterance Classification – A Systematic Study
Binxia Xu | Siyuan Qiu | Jie Zhang | Yafang Wang | Xiaoyu Shen | Gerard de Melo
Proceedings of the 28th International Conference on Computational Linguistics

Utterance classification is a key component in many conversational systems. However, classifying real-world user utterances is challenging, as people may express their ideas and thoughts in manifold ways, and the amount of training data for some categories may be fairly limited, resulting in imbalanced data distributions. To alleviate these issues, we conduct a comprehensive survey regarding data augmentation approaches for text classification, including simple random resampling, word-level transformations, and neural text generation to cope with imbalanced data. Our experiments focus on multi-class datasets with a large number of data samples, which has not been systematically studied in previous work. The results show that the effectiveness of different data augmentation schemes depends on the nature of the dataset under consideration.

pdf bib
Cross-Lingual Emotion Lexicon Induction using Representation Alignment in Low-Resource Settings
Arun Ramachandran | Gerard de Melo
Proceedings of the 28th International Conference on Computational Linguistics

Emotion lexicons provide information about associations between words and emotions. They have proven useful in analyses of reviews, literary texts, and posts on social media, among other things. We evaluate the feasibility of deriving emotion lexicons cross-lingually, especially for low-resource languages, from existing emotion lexicons in resource-rich languages. For this, we start out from very small corpora to induce cross-lingually aligned vector spaces. Our study empirically analyses the effectiveness of the induced emotion lexicons by measuring translation precision and correlations with existing emotion lexicons, along with measurements on a downstream task of sentence emotion prediction.

pdf bib
Domain-Specific Sentiment Lexicons Induced from Labeled Documents
SM Mazharul Islam | Xin Dong | Gerard de Melo
Proceedings of the 28th International Conference on Computational Linguistics

Sentiment analysis is an area of substantial relevance both in industry and in academia, including for instance in social studies. Although supervised learning algorithms have advanced considerably in recent years, in many settings it remains more practical to apply an unsupervised technique. The latter are oftentimes based on sentiment lexicons. However, existing sentiment lexicons reflect an abstract notion of polarity and do not do justice to the substantial differences of word polarities between different domains. In this work, we draw on a collection of domain-specific data to induce a set of 24 domain-specific sentiment lexicons. We rely on initial linear models to induce initial word intensity scores, and then train new deep models based on word vector representations to overcome the scarcity of the original seed data. Our analysis shows substantial differences between domains, which make domain-specific sentiment lexicons a promising form of lexical resource in downstream tasks, and the predicted lexicons indeed perform effectively on tasks such as review classification and cross-lingual word sentiment prediction.

pdf bib
Query Distillation: BERT-based Distillation for Ensemble Ranking
Wangshu Zhang | Junhong Liu | Zujie Wen | Yafang Wang | Gerard de Melo
Proceedings of the 28th International Conference on Computational Linguistics: Industry Track

Recent years have witnessed substantial progress in the development of neural ranking networks, but also an increasingly heavy computational burden due to growing numbers of parameters and the adoption of model ensembles. Knowledge Distillation (KD) is a common solution to balance the effectiveness and efficiency. However, it is not straightforward to apply KD to ranking problems. Ranking Distillation (RD) has been proposed to address this issue, but only shows effectiveness on recommendation tasks. We present a novel two-stage distillation method for ranking problems that allows a smaller student model to be trained while benefitting from the better performance of the teacher model, providing better control of the inference latency and computational burden. We design a novel BERT-based ranking model structure for list-wise ranking to serve as our student model. All ranking candidates are fed to the BERT model simultaneously, such that the self-attention mechanism can enable joint inference to rank the document list. Our experiments confirm the advantages of our method, not just with regard to the inference latency but also in terms of higher-quality rankings compared to the original teacher model.

pdf bib
Interactive Question Clarification in Dialogue via Reinforcement Learning
Xiang Hu | Zujie Wen | Yafang Wang | Xiaolong Li | Gerard de Melo
Proceedings of the 28th International Conference on Computational Linguistics: Industry Track

Coping with ambiguous questions has been a perennial problem in real-world dialogue systems. Although clarification by asking questions is a common form of human interaction, it is hard to define appropriate questions to elicit more specific intents from a user. In this work, we propose a reinforcement model to clarify ambiguous questions by suggesting refinements of the original query. We first formulate a collection partitioning problem to select a set of labels enabling us to distinguish potential unambiguous intents. We list the chosen labels as intent phrases to the user for further confirmation. The selected label along with the original user query then serves as a refined query, for which a suitable response can more easily be identified. The model is trained using reinforcement learning with a deep policy network. We evaluate our model based on real-world user clicks and demonstrate significant improvements across several different experiments.

pdf bib
EmoTag1200: Understanding the Association between Emojis and Emotions
Abu Awal Md Shoeb | Gerard de Melo
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

Given the growing ubiquity of emojis in language, there is a need for methods and resources that shed light on their meaning and communicative role. One conspicuous aspect of emojis is their use to convey affect in ways that may otherwise be non-trivial to achieve. In this paper, we seek to explore the connection between emojis and emotions by means of a new dataset consisting of human-solicited association ratings. We additionally conduct experiments to assess to what extent such associations can be inferred from existing data in an unsupervised manner. Our experiments show that this succeeds when high-quality word-level information is available.

2019

pdf bib
A Robust Self-Learning Framework for Cross-Lingual Text Classification
Xin Dong | Gerard de Melo
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

Based on massive amounts of data, recent pretrained contextual representation models have made significant strides in advancing a number of different English NLP tasks. However, for other languages, relevant training data may be lacking, while state-of-the-art deep learning methods are known to be data-hungry. In this paper, we present an elegantly simple robust self-learning framework to include unlabeled non-English samples in the fine-tuning process of pretrained multilingual representation models. We leverage a multilingual model’s own predictions on unlabeled non-English data in order to obtain additional information that can be used during further fine-tuning. Compared with original multilingual models and other cross-lingual classification models, we observe significant gains in effectiveness on document and sentiment classification for a range of diverse languages.

pdf bib
Rhetorically Controlled Encoder-Decoder for Modern Chinese Poetry Generation
Zhiqiang Liu | Zuohui Fu | Jie Cao | Gerard de Melo | Yik-Cheung Tam | Cheng Niu | Jie Zhou
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

Rhetoric is a vital element in modern poetry, and plays an essential role in improving its aesthetics. However, to date, it has not been considered in research on automatic poetry generation. In this paper, we propose a rhetorically controlled encoder-decoder for modern Chinese poetry generation. Our model relies on a continuous latent variable as a rhetoric controller to capture various rhetorical patterns in an encoder, and then incorporates rhetoric-based mixtures while generating modern Chinese poetry. For metaphor and personification, an automated evaluation shows that our model outperforms state-of-the-art baselines by a substantial margin, while human evaluation shows that our model generates better poems than baseline methods in terms of fluency, coherence, meaningfulness, and rhetorical aesthetics.

pdf bib
Using Multi-Sense Vector Embeddings for Reverse Dictionaries
Michael A. Hedderich | Andrew Yates | Dietrich Klakow | Gerard de Melo
Proceedings of the 13th International Conference on Computational Semantics - Long Papers

Popular word embedding methods such as word2vec and GloVe assign a single vector representation to each word, even if a word has multiple distinct meanings. Multi-sense embeddings instead provide different vectors for each sense of a word. However, they typically cannot serve as a drop-in replacement for conventional single-sense embeddings, because the correct sense vector needs to be selected for each word. In this work, we study the effect of multi-sense embeddings on the task of reverse dictionaries. We propose a technique to easily integrate them into an existing neural network architecture using an attention mechanism. Our experiments demonstrate that large improvements can be obtained when employing multi-sense embeddings both in the input sequence as well as for the target representation. An analysis of the sense distributions and of the learned attention is provided as well.

pdf bib
EmoTag – Towards an Emotion-Based Analysis of Emojis
Abu Awal Md Shoeb | Shahab Raji | Gerard de Melo
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2019)

Despite being a fairly recent phenomenon, emojis have quickly become ubiquitous. Besides their extensive use in social media, they are now also invoked in customer surveys and feedback forms. Hence, there is a need for techniques to understand their sentiment and emotion. In this work, we provide a method to quantify the emotional association of basic emotions such as anger, fear, joy, and sadness for a set of emojis. We collect and process a unique corpus of 20 million emoji-centric tweets, such that we can capture rich emoji semantics using a comparably small dataset. We evaluate the induced emotion profiles of emojis with regard to their ability to predict word affect intensities as well as sentiment scores.

pdf bib
CITE: A Corpus of Image-Text Discourse Relations
Malihe Alikhani | Sreyasi Nag Chowdhury | Gerard de Melo | Matthew Stone
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)

This paper presents a novel crowd-sourced resource for multimodal discourse: our resource characterizes inferences in image-text contexts in the domain of cooking recipes in the form of coherence relations. Like previous corpora annotating discourse structure between text arguments, such as the Penn Discourse Treebank, our new corpus aids in establishing a better understanding of natural communication and common-sense reasoning, while our findings have implications for a wide range of applications, such as understanding and generation of multimodal documents.

2018

pdf bib
FontLex: A Typographical Lexicon based on Affective Associations
Tugba Kulahcioglu | Gerard de Melo
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

pdf bib
Metaphor Suggestions based on a Semantic Metaphor Repository
Gerard de Melo
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

pdf bib
Video Captioning with Multi-Faceted Attention
Xiang Long | Chuang Gan | Gerard de Melo
Transactions of the Association for Computational Linguistics, Volume 6

Video captioning has attracted an increasing amount of interest, due in part to its potential for improved accessibility and information retrieval. While existing methods rely on different kinds of visual features and model architectures, they do not make full use of pertinent semantic cues. We present a unified and extensible framework to jointly leverage multiple sorts of visual features and semantic attributes. Our novel architecture builds on LSTMs with two multi-faceted attention layers. These first learn to automatically select the most salient visual features or semantic attributes, and then yield overall representations for the input and output of the sentence generation component via custom feature scaling operations. Experimental results on the challenging MSVD and MSR-VTT datasets show that our framework outperforms previous work and performs robustly even in the presence of added noise to the features and attributes.

pdf bib
Generating Fine-Grained Open Vocabulary Entity Type Descriptions
Rajarshi Bhowmik | Gerard de Melo
Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

While large-scale knowledge graphs provide vast amounts of structured facts about entities, a short textual description can often be useful to succinctly characterize an entity and its type. Unfortunately, many knowledge graphs entities lack such textual descriptions. In this paper, we introduce a dynamic memory-based network that generates a short open vocabulary description of an entity by jointly leveraging induced fact embeddings as well as the dynamic context of the generated sequence of words. We demonstrate the ability of our architecture to discern relevant information for more accurate generation of type description by pitting the system against several strong baselines.

pdf bib
A Helping Hand: Transfer Learning for Deep Sentiment Analysis
Xin Dong | Gerard de Melo
Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Deep convolutional neural networks excel at sentiment polarity classification, but tend to require substantial amounts of training data, which moreover differs quite significantly between domains. In this work, we present an approach to feed generic cues into the training process of such networks, leading to better generalization abilities given limited training data. We propose to induce sentiment embeddings via supervision on extrinsic data, which are then fed into the model via a dedicated memory-based component. We observe significant gains in effectiveness on a range of different datasets in seven different languages.

pdf bib
Exploring Semantic Properties of Sentence Embeddings
Xunjie Zhu | Tingfeng Li | Gerard de Melo
Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

Neural vector representations are ubiquitous throughout all subfields of NLP. While word vectors have been studied in much detail, thus far only little light has been shed on the properties of sentence embeddings. In this paper, we assess to what extent prominent sentence embedding methods exhibit select semantic properties. We propose a framework that generate triplets of sentences to explore how changes in the syntactic structure or semantics of a given sentence affect the similarities obtained between their sentence embeddings.

2017

pdf bib
WebChild 2.0 : Fine-Grained Commonsense Knowledge Distillation
Niket Tandon | Gerard de Melo | Gerhard Weikum
Proceedings of ACL 2017, System Demonstrations

pdf bib
Multilingual Vector Representations of Words, Sentences, and Documents
Gerard de Melo
Proceedings of the IJCNLP 2017, Tutorial Abstracts

Neural vector representations are now ubiquitous in all subfields of natural language processing and text mining. While methods such as word2vec and GloVe are well-known, this tutorial focuses on multilingual and cross-lingual vector representations, of words, but also of sentences and documents as well.

pdf bib
PACRR: A Position-Aware Neural IR Model for Relevance Matching
Kai Hui | Andrew Yates | Klaus Berberich | Gerard de Melo
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing

In order to adopt deep learning for information retrieval, models are needed that can capture all relevant information required to assess the relevance of a document to a given user query. While previous works have successfully captured unigram term matches, how to fully employ position-dependent information such as proximity and term dependencies has been insufficiently explored. In this work, we propose a novel neural IR model named PACRR aiming at better modeling position-dependent interactions between a query and a document. Extensive experiments on six years’ TREC Web Track data confirm that the proposed model yields better results under multiple benchmarks.

2016

pdf bib
The Open Linguistics Working Group: Developing the Linguistic Linked Open Data Cloud
John Philip McCrae | Christian Chiarcos | Francis Bond | Philipp Cimiano | Thierry Declerck | Gerard de Melo | Jorge Gracia | Sebastian Hellmann | Bettina Klimek | Steven Moran | Petya Osenova | Antonio Pareja-Lora | Jonathan Pool
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

The Open Linguistics Working Group (OWLG) brings together researchers from various fields of linguistics, natural language processing, and information technology to present and discuss principles, case studies, and best practices for representing, publishing and linking linguistic data collections. A major outcome of our work is the Linguistic Linked Open Data (LLOD) cloud, an LOD (sub-)cloud of linguistic resources, which covers various linguistic databases, lexicons, corpora, terminologies, and metadata repositories. We present and summarize five years of progress on the development of the cloud and of advancements in open data in linguistics, and we describe recent community activities. The paper aims to serve as a guideline to orient and involve researchers with the community and/or Linguistic Linked Open Data.

pdf bib
Medical Concept Embeddings via Labeled Background Corpora
Eneldo Loza Mencía | Gerard de Melo | Jinseok Nam
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

In recent years, we have seen an increasing amount of interest in low-dimensional vector representations of words. Among other things, these facilitate computing word similarity and relatedness scores. The most well-known example of algorithms to produce representations of this sort are the word2vec approaches. In this paper, we investigate a new model to induce such vector spaces for medical concepts, based on a joint objective that exploits not only word co-occurrences but also manually labeled documents, as available from sources such as PubMed. Our extensive experimental analysis shows that our embeddings lead to significantly higher correlations with human similarity and relatedness assessments than previous work. Due to the simplicity and versatility of vector representations, these findings suggest that our resource can easily be used as a drop-in replacement to improve any systems relying on medical concept similarity measures.

pdf bib
Relation Classification via Multi-Level Attention CNNs
Linlin Wang | Zhu Cao | Gerard de Melo | Zhiyuan Liu
Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

pdf bib
Visualizing and Curating Knowledge Graphs over Time and Space
Tong Ge | Yafang Wang | Gerard de Melo | Haofeng Li | Baoquan Chen
Proceedings of ACL-2016 System Demonstrations

pdf bib
Detecting Cross-Cultural Differences Using a Multilingual Topic Model
E.D. Gutiérrez | Ekaterina Shutova | Patricia Lichtenstein | Gerard de Melo | Luca Gilardi
Transactions of the Association for Computational Linguistics, Volume 4

Understanding cross-cultural differences has important implications for world affairs and many aspects of the life of society. Yet, the majority of text-mining methods to date focus on the analysis of monolingual texts. In contrast, we present a statistical model that simultaneously learns a set of common topics from multilingual, non-parallel data and automatically discovers the differences in perspectives on these topics across linguistic communities. We perform a behavioural evaluation of a subset of the differences identified by our model in English and Spanish to investigate their psychological validity.

2015

pdf bib
Semantic Information Extraction for Improved Word Embeddings
Jiaqiang Chen | Gerard de Melo
Proceedings of the 1st Workshop on Vector Space Modeling for Natural Language Processing

pdf bib
Sentiment-Aspect Extraction based on Restricted Boltzmann Machines
Linlin Wang | Kang Liu | Zhu Cao | Jun Zhao | Gerard de Melo
Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

pdf bib
Perceptually Grounded Selectional Preferences
Ekaterina Shutova | Niket Tandon | Gerard de Melo
Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

2014

pdf bib
Structured Learning for Taxonomy Induction with Belief Propagation
Mohit Bansal | David Burkett | Gerard de Melo | Dan Klein
Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

pdf bib
Embedding NomLex-BR nominalizations into OpenWordnet-PT
Alexandre Rademaker | Valeria de Paiva | Gerard de Melo | Livy Maria Real Coelho
Proceedings of the Seventh Global Wordnet Conference

pdf bib
OpenWordNet-PT: A Project Report
Alexandre Rademaker | Valeria de Paiva | Gerard de Melo | Livy Real | Maira Gatti
Proceedings of the Seventh Global Wordnet Conference

pdf bib
Proceedings of Frame Semantics in NLP: A Workshop in Honor of Chuck Fillmore (1929-2014)
Miriam R. L. Petruck | Gerard de Melo
Proceedings of Frame Semantics in NLP: A Workshop in Honor of Chuck Fillmore (1929-2014)

pdf bib
NomLex-PT: A Lexicon of Portuguese Nominalizations
Valeria de Paiva | Livy Real | Alexandre Rademaker | Gerard de Melo
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

This paper presents NomLex-PT, a lexical resource describing Portuguese nominalizations. NomLex-PT connects verbs to their nominalizations, thereby enabling NLP systems to observe the potential semantic relationships between the two words when analysing a text. NomLex-PT is freely available and encoded in RDF for easy integration with other resources. Most notably, we have integrated NomLex-PT with OpenWordNet-PT, an open Portuguese Wordnet.

pdf bib
Etymological Wordnet: Tracing The History of Words
Gerard de Melo
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

Research on the history of words has led to remarkable insights about language and also about the history of human civilization more generally. This paper presents the Etymological Wordnet, the first database that aims at making word origin information available as a large, machine-readable network of words in many languages. The information in this resource is obtained from Wiktionary. Extracting a network of etymological information from Wiktionary requires significant effort, as much of the etymological information is only given in prose. We rely on custom pattern matching techniques and mine a large network with over 500,000 word origin links as well as over 2 million derivational/compositional links.

pdf bib
Bring vs. MTRoget: Evaluating automatic thesaurus translation
Lars Borin | Jens Allwood | Gerard de Melo
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

Evaluation of automatic language-independent methods for language technology resource creation is difficult, and confounded by a largely unknown quantity, viz. to what extent typological differences among languages are significant for results achieved for one language or language pair to be applicable across languages generally. In the work presented here, as a simplifying assumption, language-independence is taken as axiomatic within certain specified bounds. We evaluate the automatic translation of Roget’s “Thesaurus” from English into Swedish using an independently compiled Roget-style Swedish thesaurus, S.C. Bring’s “Swedish vocabulary arranged into conceptual classes” (1930). Our expectation is that this explicit evaluation of one of the thesaureses created in the MTRoget project will provide a good estimate of the quality of the other thesauruses created using similar methods.

2013

pdf bib
Good, Great, Excellent: Global Inference of Semantic Intensities
Gerard de Melo | Mohit Bansal
Transactions of the Association for Computational Linguistics, Volume 1

Adjectives like good, great, and excellent are similar in meaning, but differ in intensity. Intensity order information is very useful for language learners as well as in several NLP tasks, but is missing in most lexical resources (dictionaries, WordNet, and thesauri). In this paper, we present a primarily unsupervised approach that uses semantics from Web-scale data (e.g., phrases like good but not excellent) to rank words by assigning them positions on a continuous scale. We rely on Mixed Integer Linear Programming to jointly determine the ranks, such that individual decisions benefit from global information. When ranking English adjectives, our global algorithm achieves substantial improvements over previous work on both pairwise and rank correlation metrics (specifically, 70% pairwise accuracy as compared to only 56% by previous work). Moreover, our approach can incorporate external synonymy information (increasing its pairwise accuracy to 78%) and extends easily to new languages. We also make our code and data freely available.

2012

pdf bib
UWN: A Large Multilingual Lexical Knowledge Base
Gerard de Melo | Gerhard Weikum
Proceedings of the ACL 2012 System Demonstrations

pdf bib
Empirical Comparisons of MASC Word Sense Annotations
Gerard de Melo | Collin F. Baker | Nancy Ide | Rebecca J. Passonneau | Christiane Fellbaum
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

We analyze how different conceptions of lexical semantics affect sense annotations and how multiple sense inventories can be compared empirically, based on annotated text. Our study focuses on the MASC project, where data has been annotated using WordNet sense identifiers on the one hand, and FrameNet lexical units on the other. This allows us to compare the sense inventories of these lexical resources empirically rather than just theoretically, based on their glosses, leading to new insights. In particular, we compute contingency matrices and develop a novel measure, the Expected Jaccard Index, that quantifies the agreement between annotations of the same data based on two different resources even when they have different sets of categories.

pdf bib
OpenWordNet-PT: An Open Brazilian Wordnet for Reasoning
Valeria de Paiva | Alexandre Rademaker | Gerard de Melo
Proceedings of COLING 2012: Demonstration Papers

pdf bib
Markov Chains for Robust Graph-Based Commonsense Information Extraction
Niket Tandon | Dheeraj Rajagopal | Gerard de Melo
Proceedings of COLING 2012: Demonstration Papers

2010

pdf bib
Untangling the Cross-Lingual Link Structure of Wikipedia
Gerard de Melo | Gerhard Weikum
Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics

pdf bib
Providing Multilingual, Multimodal Answers to Lexical Database Queries
Gerard de Melo | Gerhard Weikum
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

Language users are increasingly turning to electronic resources to address their lexical information needs, due to their convenience and their ability to simultaneously capture different facets of lexical knowledge in a single interface. In this paper, we discuss techniques to respond to a user's lexical queries by providing multilingual and multimodal information, and facilitating navigating along different types of links. To this end, structured information from sources like WordNet, Wikipedia, Wiktionary, as well as Web services is linked and integrated to provide a multi-faceted yet consistent response to user queries. The meanings of words in many different languages are characterized by mapping them to appropriate WordNet sense identifiers and adding multilingual gloss descriptions as well as example sentences. Relationships are derived from WordNet and Wiktionary to allow users to discover semantically related words, etymologically related words, alternative spellings, as well as misspellings. Last but not least, images, audio recordings, and geographical maps extracted from Wikipedia and Wiktionary allow for a multimodal experience.

2009

pdf bib
Extracting Sense-Disambiguated Example Sentences From Parallel Corpora
Gerard de Melo | Gerhard Weikum
Proceedings of the 1st Workshop on Definition Extraction

2008

pdf bib
Mapping Roget’s Thesaurus and WordNet to French
Gerard de Melo | Gerhard Weikum
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

Roget’s Thesaurus and WordNet are very widely used lexical reference works. We describe an automatic mapping procedure that effectively produces French translations of the terms in these two resources. Our approach to the challenging task of disambiguation is based on structural statistics as well as measures of semantic relatedness that are utilized to learn a classification model for associations between entries in the thesaurus and French terms taken from bilingual dictionaries. By building and applying such models, we have produced French versions of Roget’s Thesaurus and WordNet with a considerable level of accuracy, which can be used for a variety of different purposes, by humans as well as in computational applications.