John Philip McCrae

Also published as: John McCrae, John P. McCrae


2020

pdf bib
A Sentiment Analysis Dataset for Code-Mixed Malayalam-English
Bharathi Raja Chakravarthi | Navya Jose | Shardul Suryawanshi | Elizabeth Sherly | John Philip McCrae
Proceedings of the 1st Joint Workshop on Spoken Language Technologies for Under-resourced languages (SLTU) and Collaboration and Computing for Under-Resourced Languages (CCURL)

There is an increasing demand for sentiment analysis of text from social media which are mostly code-mixed. Systems trained on monolingual data fail for code-mixed data due to the complexity of mixing at different levels of the text. However, very few resources are available for code-mixed data to create models specific for this data. Although much research in multilingual and cross-lingual sentiment analysis has used semi-supervised or unsupervised methods, supervised methods still performs better. Only a few datasets for popular languages such as English-Spanish, English-Hindi, and English-Chinese are available. There are no resources available for Malayalam-English code-mixed data. This paper presents a new gold standard corpus for sentiment analysis of code-mixed text in Malayalam-English annotated by voluntary annotators. This gold standard corpus obtained a Krippendorff’s alpha above 0.8 for the dataset. We use this new corpus to provide the benchmark for sentiment analysis in Malayalam-English code-mixed texts.

pdf bib
Corpus Creation for Sentiment Analysis in Code-Mixed Tamil-English Text
Bharathi Raja Chakravarthi | Vigneshwaran Muralidaran | Ruba Priyadharshini | John Philip McCrae
Proceedings of the 1st Joint Workshop on Spoken Language Technologies for Under-resourced languages (SLTU) and Collaboration and Computing for Under-Resourced Languages (CCURL)

Understanding the sentiment of a comment from a video or an image is an essential task in many applications. Sentiment analysis of a text can be useful for various decision-making processes. One such application is to analyse the popular sentiments of videos on social media based on viewer comments. However, comments from social media do not follow strict rules of grammar, and they contain mixing of more than one language, often written in non-native scripts. Non-availability of annotated code-mixed data for a low-resourced language like Tamil also adds difficulty to this problem. To overcome this, we created a gold standard Tamil-English code-switched, sentiment-annotated corpus containing 15,744 comment posts from YouTube. In this paper, we describe the process of creating the corpus and assigning polarities. We present inter-annotator agreement and show the results of sentiment analysis trained on this corpus as a benchmark.

pdf bib
Proceedings of the 2020 Globalex Workshop on Linked Lexicography
Ilan Kernerman | Simon Krek | John P. McCrae | Jorge Gracia | Sina Ahmadi | Besim Kabashi
Proceedings of the 2020 Globalex Workshop on Linked Lexicography

pdf bib
Modelling Frequency and Attestations for OntoLex-Lemon
Christian Chiarcos | Maxim Ionov | Jesse de Does | Katrien Depuydt | Anas Fahad Khan | Sander Stolk | Thierry Declerck | John Philip McCrae
Proceedings of the 2020 Globalex Workshop on Linked Lexicography

The OntoLex vocabulary enjoys increasing popularity as a means of publishing lexical resources with RDF and as Linked Data. The recent publication of a new OntoLex module for lexicography, lexicog, reflects its increasing importance for digital lexicography. However, not all aspects of digital lexicography have been covered to the same extent. In particular, supplementary information drawn from corpora such as frequency information, links to attestations, and collocation data were considered to be beyond the scope of lexicog. Therefore, the OntoLex community has put forward the proposal for a novel module for frequency, attestation and corpus information (FrAC), that not only covers the requirements of digital lexicography, but also accommodates essential data structures for lexical information in natural language processing. This paper introduces the current state of the OntoLex-FrAC vocabulary, describes its structure, some selected use cases, elementary concepts and fundamental definitions, with a focus on frequency and attestations.

pdf bib
NUIG at TIAD: Combining Unsupervised NLP and Graph Metrics for Translation Inference
John Philip McCrae | Mihael Arcan
Proceedings of the 2020 Globalex Workshop on Linked Lexicography

In this paper, we present the NUIG system at the TIAD shard task. This system includes graph-based metrics calculated using novel algorithms, with an unsupervised document embedding tool called ONETA and an unsupervised multi-way neural machine translation method. The results are an improvement over our previous system and produce the highest precision among all systems in the task as well as very competitive F-Measure results. Incorporating features from other systems should be easy in the framework we describe in this paper, suggesting this could very easily be extended to an even stronger result.

pdf bib
ULD@NUIG at SemEval-2020 Task 9: Generative Morphemes with an Attention Model for Sentiment Analysis in Code-Mixed Text
Koustava Goswami | Priya Rani | Bharathi Raja Chakravarthi | Theodorus Fransen | John P. McCrae
Proceedings of the Fourteenth Workshop on Semantic Evaluation

Code mixing is a common phenomena in multilingual societies where people switch from one language to another for various reasons. Recent advances in public communication over different social media sites have led to an increase in the frequency of code-mixed usage in written language. In this paper, we present the Generative Morphemes with Attention (GenMA) Model sentiment analysis system contributed to SemEval 2020 Task 9 SentiMix. The system aims to predict the sentiments of the given English-Hindi code-mixed tweets without using word-level language tags instead inferring this automatically using a morphological model. The system is based on a novel deep neural network (DNN) architecture, which has outperformed the baseline F1-score on the test data-set as well as the validation data-set. Our results can be found under the user name “koustava” on the “Sentimix Hindi English” page.

pdf bib
Some Issues with Building a Multilingual Wordnet
Francis Bond | Luis Morgado da Costa | Michael Wayne Goodman | John Philip McCrae | Ahti Lohk
Proceedings of the 12th Language Resources and Evaluation Conference

In this paper we discuss the experience of bringing together over 40 different wordnets. We introduce some extensions to the GWA wordnet LMF format proposed in Vossen et al. (2016) and look at how this new information can be displayed. Notable extensions include: confidence, corpus frequency, orthographic variants, lexicalized and non-lexicalized synsets and lemmas, new parts of speech, and more. Many of these extensions already exist in multiple wordnets – the challenge was to find a compatible representation. To this end, we introduce a new version of the Open Multilingual Wordnet (Bond and Foster, 2013), that integrates a new set of tools that tests the extensions introduced by this new format, while also ensuring the integrity of the Collaborative Interlingual Index (CILI: Bond et al., 2016), avoiding the same new concept to be introduced through multiple projects.

pdf bib
A Multilingual Evaluation Dataset for Monolingual Word Sense Alignment
Sina Ahmadi | John Philip McCrae | Sanni Nimb | Fahad Khan | Monica Monachini | Bolette Pedersen | Thierry Declerck | Tanja Wissik | Andrea Bellandi | Irene Pisani | Thomas Troelsgård | Sussi Olsen | Simon Krek | Veronika Lipp | Tamás Váradi | László Simon | András Gyorffy | Carole Tiberius | Tanneke Schoonheim | Yifat Ben Moshe | Maya Rudich | Raya Abu Ahmad | Dorielle Lonke | Kira Kovalenko | Margit Langemets | Jelena Kallas | Oksana Dereza | Theodorus Fransen | David Cillessen | David Lindemann | Mikel Alonso | Ana Salgado | José Luis Sancho | Rafael-J. Ureña-Ruiz | Jordi Porta Zamorano | Kiril Simov | Petya Osenova | Zara Kancheva | Ivaylo Radev | Ranka Stanković | Andrej Perdih | Dejan Gabrovsek
Proceedings of the 12th Language Resources and Evaluation Conference

Aligning senses across resources and languages is a challenging task with beneficial applications in the field of natural language processing and electronic lexicography. In this paper, we describe our efforts in manually aligning monolingual dictionaries. The alignment is carried out at sense-level for various resources in 15 languages. Moreover, senses are annotated with possible semantic relationships such as broadness, narrowness, relatedness, and equivalence. In comparison to previous datasets for this task, this dataset covers a wide range of languages and resources and focuses on the more challenging task of linking general-purpose language. We believe that our data will pave the way for further advances in alignment and evaluation of word senses by creating new solutions, particularly those notoriously requiring data such as neural networks. Our resources are publicly available at https://github.com/elexis-eu/MWSA.

pdf bib
Recent Developments for the Linguistic Linked Open Data Infrastructure
Thierry Declerck | John Philip McCrae | Matthias Hartung | Jorge Gracia | Christian Chiarcos | Elena Montiel-Ponsoda | Philipp Cimiano | Artem Revenko | Roser Saurí | Deirdre Lee | Stefania Racioppa | Jamal Abdul Nasir | Matthias Orlikowsk | Marta Lanau-Coronas | Christian Fäth | Mariano Rico | Mohammad Fazleh Elahi | Maria Khvalchik | Meritxell Gonzalez | Katharine Cooney
Proceedings of the 12th Language Resources and Evaluation Conference

In this paper we describe the contributions made by the European H2020 project “Prêt-à-LLOD” (‘Ready-to-use Multilingual Linked Language Data for Knowledge Services across Sectors’) to the further development of the Linguistic Linked Open Data (LLOD) infrastructure. Prêt-à-LLOD aims to develop a new methodology for building data value chains applicable to a wide range of sectors and applications and based around language resources and language technologies that can be integrated by means of semantic technologies. We describe the methods implemented for increasing the number of language data sets in the LLOD. We also present the approach for ensuring interoperability and for porting LLOD data sets and services to other infrastructures, as well as the contribution of the projects to existing standards.

pdf bib
Figure Me Out: A Gold Standard Dataset for Metaphor Interpretation
Omnia Zayed | John Philip McCrae | Paul Buitelaar
Proceedings of the 12th Language Resources and Evaluation Conference

Metaphor comprehension and understanding is a complex cognitive task that requires interpreting metaphors by grasping the interaction between the meaning of their target and source concepts. This is very challenging for humans, let alone computers. Thus, automatic metaphor interpretation is understudied in part due to the lack of publicly available datasets. The creation and manual annotation of such datasets is a demanding task which requires huge cognitive effort and time. Moreover, there will always be a question of accuracy and consistency of the annotated data due to the subjective nature of the problem. This work addresses these issues by presenting an annotation scheme to interpret verb-noun metaphoric expressions in text. The proposed approach is designed with the goal of reducing the workload on annotators and maintain consistency. Our methodology employs an automatic retrieval approach which utilises external lexical resources, word embeddings and semantic similarity to generate possible interpretations of identified metaphors in order to enable quick and accurate annotation. We validate our proposed approach by annotating around 1,500 metaphors in tweets which were annotated by six native English speakers. As a result of this work, we publish as linked data the first gold standard dataset for metaphor interpretation which will facilitate research in this area.

pdf bib
Contextual Modulation for Relation-Level Metaphor Identification
Omnia Zayed | John P. McCrae | Paul Buitelaar
Findings of the Association for Computational Linguistics: EMNLP 2020

Identifying metaphors in text is very challenging and requires comprehending the underlying comparison. The automation of this cognitive process has gained wide attention lately. However, the majority of existing approaches concentrate on word-level identification by treating the task as either single-word classification or sequential labelling without explicitly modelling the interaction between the metaphor components. On the other hand, while existing relation-level approaches implicitly model this interaction, they ignore the context where the metaphor occurs. In this work, we address these limitations by introducing a novel architecture for identifying relation-level metaphoric expressions of certain grammatical relations based on contextual modulation. In a methodology inspired by works in visual reasoning, our approach is based on conditioning the neural network computation on the deep contextualised features of the candidate expressions using feature-wise linear modulation. We demonstrate that the proposed architecture achieves state-of-the-art results on benchmark datasets. The proposed methodology is generic and could be applied to other textual classification problems that benefit from contextual interaction.

pdf bib
Bilingual Lexicon Induction across Orthographically-distinct Under-Resourced Dravidian Languages
Bharathi Raja Chakravarthi | Navaneethan Rajasekaran | Mihael Arcan | Kevin McGuinness | Noel E. O’Connor | John P. McCrae
Proceedings of the 7th Workshop on NLP for Similar Languages, Varieties and Dialects

Bilingual lexicons are a vital tool for under-resourced languages and recent state-of-the-art approaches to this leverage pretrained monolingual word embeddings using supervised or semi-supervised approaches. However, these approaches require cross-lingual information such as seed dictionaries to train the model and find a linear transformation between the word embedding spaces. Especially in the case of low-resourced languages, seed dictionaries are not readily available, and as such, these methods produce extremely weak results on these languages. In this work, we focus on the Dravidian languages, namely Tamil, Telugu, Kannada, and Malayalam, which are even more challenging as they are written in unique scripts. To take advantage of orthographic information and cognates in these languages, we bring the related languages into a single script. Previous approaches have used linguistically sub-optimal measures such as the Levenshtein edit distance to detect cognates, whereby we demonstrate that the longest common sub-sequence is linguistically more sound and improves the performance of bilingual lexicon induction. We show that our approach can increase the accuracy of bilingual lexicon induction methods on these languages many times, making bilingual lexicon induction approaches feasible for such under-resourced languages.

pdf bib
Proceedings of the Joint Workshop on Multiword Expressions and Electronic Lexicons
Stella Markantonatou | John McCrae | Jelena Mitrović | Carole Tiberius | Carlos Ramisch | Ashwini Vaidya | Petya Osenova | Agata Savary
Proceedings of the Joint Workshop on Multiword Expressions and Electronic Lexicons

pdf bib
A Dataset for Troll Classification of TamilMemes
Shardul Suryawanshi | Bharathi Raja Chakravarthi | Pranav Verma | Mihael Arcan | John Philip McCrae | Paul Buitelaar
Proceedings of the WILDRE5– 5th Workshop on Indian Language Data: Resources and Evaluation

Social media are interactive platforms that facilitate the creation or sharing of information, ideas or other forms of expression among people. This exchange is not free from offensive, trolling or malicious contents targeting users or communities. One way of trolling is by making memes, which in most cases combines an image with a concept or catchphrase. The challenge of dealing with memes is that they are region-specific and their meaning is often obscured in humour or sarcasm. To facilitate the computational modelling of trolling in the memes for Indian languages, we created a meme dataset for Tamil (TamilMemes). We annotated and released the dataset containing suspected trolls and not-troll memes. In this paper, we use the a image classification to address the difficulties involved in the classification of troll memes with the existing methods. We found that the identification of a troll meme with such an image classifier is not feasible which has been corroborated with precision, recall and F1-score.

pdf bib
English WordNet 2020: Improving and Extending a WordNet for English using an Open-Source Methodology
John Philip McCrae | Alexandre Rademaker | Ewa Rudnicka | Francis Bond
Proceedings of the LREC 2020 Workshop on Multimodal Wordnets (MMW2020)

WordNet, while one of the most widely used resources for NLP, has not been updated for a long time, and as such a new project English WordNet has arisen to continue the development of the model under an open-source paradigm. In this paper, we detail the second release of this resource entitled “English WordNet 2020”. The work has focused firstly, on the introduction of new synsets and senses and developing guidelines for this and secondly, on the integration of contributions from other projects. We present the changes in this edition, which total over 15,000 changes over the previous release.

pdf bib
On the Linguistic Linked Open Data Infrastructure
Christian Chiarcos | Bettina Klimek | Christian Fäth | Thierry Declerck | John Philip McCrae
Proceedings of the 1st International Workshop on Language Technology Platforms

In this paper we describe the current state of development of the Linguistic Linked Open Data (LLOD) infrastructure, an LOD(sub-)cloud of linguistic resources, which covers various linguistic data bases, lexicons, corpora, terminology and metadata repositories.We give in some details an overview of the contributions made by the European H2020 projects “Prêt-à-LLOD” (‘Ready-to-useMultilingual Linked Language Data for Knowledge Services across Sectors’) and “ELEXIS” (‘European Lexicographic Infrastructure’) to the further development of the LLOD.

pdf bib
Towards an Interoperable Ecosystem of AI and LT Platforms: A Roadmap for the Implementation of Different Levels of Interoperability
Georg Rehm | Dimitris Galanis | Penny Labropoulou | Stelios Piperidis | Martin Welß | Ricardo Usbeck | Joachim Köhler | Miltos Deligiannis | Katerina Gkirtzou | Johannes Fischer | Christian Chiarcos | Nils Feldhus | Julian Moreno-Schneider | Florian Kintzel | Elena Montiel | Víctor Rodríguez Doncel | John Philip McCrae | David Laqua | Irina Patricia Theile | Christian Dittmar | Kalina Bontcheva | Ian Roberts | Andrejs Vasiļjevs | Andis Lagzdiņš
Proceedings of the 1st International Workshop on Language Technology Platforms

With regard to the wider area of AI/LT platform interoperability, we concentrate on two core aspects: (1) cross-platform search and discovery of resources and services; (2) composition of cross-platform service workflows. We devise five different levels (of increasing complexity) of platform interoperability that we suggest to implement in a wider federation of AI/LT platforms. We illustrate the approach using the five emerging AI/LT platforms AI4EU, ELG, Lynx, QURATOR and SPEAKER.

pdf bib
Unsupervised Deep Language and Dialect Identification for Short Texts
Koustava Goswami | Rajdeep Sarkar | Bharathi Raja Chakravarthi | Theodorus Fransen | John P. McCrae
Proceedings of the 28th International Conference on Computational Linguistics

Automatic Language Identification (LI) or Dialect Identification (DI) of short texts of closely related languages or dialects, is one of the primary steps in many natural language processing pipelines. Language identification is considered a solved task in many cases; however, in the case of very closely related languages, or in an unsupervised scenario (where the languages are not known in advance), performance is still poor. In this paper, we propose the Unsupervised Deep Language and Dialect Identification (UDLDI) method, which can simultaneously learn sentence embeddings and cluster assignments from short texts. The UDLDI model understands the sentence constructions of languages by applying attention to character relations which helps to optimize the clustering of languages. We have performed our experiments on three short-text datasets for different language families, each consisting of closely related languages or dialects, with very minimal training sets. Our experimental evaluations on these datasets have shown significant improvement over state-of-the-art unsupervised methods and our model has outperformed state-of-the-art LI and DI systems in supervised settings.

pdf bib
Suggest me a movie for tonight: Leveraging Knowledge Graphs for Conversational Recommendation
Rajdeep Sarkar | Koustava Goswami | Mihael Arcan | John P. McCrae
Proceedings of the 28th International Conference on Computational Linguistics

Conversational recommender systems focus on the task of suggesting products to users based on the conversation flow. Recently, the use of external knowledge in the form of knowledge graphs has shown to improve the performance in recommendation and dialogue systems. Information from knowledge graphs aids in enriching those systems by providing additional information such as closely related products and textual descriptions of the items. However, knowledge graphs are incomplete since they do not contain all factual information present on the web. Furthermore, when working on a specific domain, knowledge graphs in its entirety contribute towards extraneous information and noise. In this work, we study several subgraph construction methods and compare their performance across the recommendation task. We incorporate pre-trained embeddings from the subgraphs along with positional embeddings in our models. Extensive experiments show that our method has a relative improvement of at least 5.62% compared to the state-of-the-art on multiple metrics on the recommendation task.

pdf bib
Proceedings of the 7th Workshop on Linked Data in Linguistics (LDL-2020)
Maxim Ionov | John P. McCrae | Christian Chiarcos | Thierry Declerck | Julia Bosque-Gil | Jorge Gracia
Proceedings of the 7th Workshop on Linked Data in Linguistics (LDL-2020)

pdf bib
Challenges of Word Sense Alignment: Portuguese Language Resources
Ana Salgado | Sina Ahmadi | Alberto Simões | John Philip McCrae | Rute Costa
Proceedings of the 7th Workshop on Linked Data in Linguistics (LDL-2020)

This paper reports on an ongoing task of monolingual word sense alignment in which a comparative study between the Portuguese Academy of Sciences Dictionary and the Dicionário Aberto is carried out in the context of the ELEXIS (European Lexicographic Infrastructure) project. Word sense alignment involves searching for matching senses within dictionary entries of different lexical resources and linking them, which poses significant challenges. The lexicographic criteria are not always entirely consistent within individual dictionaries and even less so across different projects where different options may have been assumed in terms of structure and especially wording techniques of lexicographic glosses. This hinders the task of matching senses. We aim to present our annotation workflow in Portuguese using the Semantic Web technologies. The results obtained are useful for the discussion within the community.

pdf bib
CogALex-VI Shared Task: Bidirectional Transformer based Identification of Semantic Relations
Saurav Karmakar | John P. McCrae
Proceedings of the Workshop on the Cognitive Aspects of the Lexicon

This paper presents a bidirectional transformer based approach for recognising semantic relationships between a pair of words as proposed by CogALex VI shared task in 2020. The system presented here works by employing BERT embeddings of the words and passing the same over tuned neural network to produce a learning model for the pair of words and their relationships. Afterwards the very same model is used for the relationship between unknown words from the test set. CogALex VI provided Subtask 1 as the identification of relationship of three specific categories amongst English pair of words and the presented system opts to work on that. The resulted relationships of the unknown words are analysed here which shows a balanced performance in overall characteristics with some scope for improvement.

pdf bib
A Comparative Study of Different State-of-the-Art Hate Speech Detection Methods in Hindi-English Code-Mixed Data
Priya Rani | Shardul Suryawanshi | Koustava Goswami | Bharathi Raja Chakravarthi | Theodorus Fransen | John Philip McCrae
Proceedings of the Second Workshop on Trolling, Aggression and Cyberbullying

Hate speech detection in social media communication has become one of the primary concerns to avoid conflicts and curb undesired activities. In an environment where multilingual speakers switch among multiple languages, hate speech detection becomes a challenging task using methods that are designed for monolingual corpora. In our work, we attempt to analyze, detect and provide a comparative study of hate speech in a code-mixed social media text. We also provide a Hindi-English code-mixed data set consisting of Facebook and Twitter posts and comments. Our experiments show that deep learning models trained on this code-mixed corpus perform better.

pdf bib
Adaptation of Word-Level Benchmark Datasets for Relation-Level Metaphor Identification
Omnia Zayed | John Philip McCrae | Paul Buitelaar
Proceedings of the Second Workshop on Figurative Language Processing

Metaphor processing and understanding has attracted the attention of many researchers recently with an increasing number of computational approaches. A common factor among these approaches is utilising existing benchmark datasets for evaluation and comparisons. The availability, quality and size of the annotated data are among the main difficulties facing the growing research area of metaphor processing. The majority of current approaches pertaining to metaphor processing concentrate on word-level processing due to data availability. On the other hand, approaches that process metaphors on the relation-level ignore the context where the metaphoric expression. This is due to the nature and format of the available data. Word-level annotation is poorly grounded theoretically and is harder to use in downstream tasks such as metaphor interpretation. The conversion from word-level to relation-level annotation is non-trivial. In this work, we attempt to fill this research gap by adapting three benchmark datasets, namely the VU Amsterdam metaphor corpus, the TroFi dataset and the TSV dataset, to suit relation-level metaphor identification. We publish the adapted datasets to facilitate future research in relation-level metaphor processing.

2019

pdf bib
Identification of Adjective-Noun Neologisms using Pretrained Language Models
John Philip McCrae
Proceedings of the Joint Workshop on Multiword Expressions and WordNet (MWE-WN 2019)

Neologism detection is a key task in the constructing of lexical resources and has wider implications for NLP, however the identification of multiword neologisms has received little attention. In this paper, we show that we can effectively identify the distinction between compositional and non-compositional adjective-noun pairs by using pretrained language models and comparing this with individual word embeddings. Our results show that the use of these models significantly improves over baseline linguistic features, however the combination with linguistic features still further improves the results, suggesting the strength of a hybrid approach.

pdf bib
Adapting Term Recognition to an Under-Resourced Language: the Case of Irish
John P. McCrae | Adrian Doyle
Proceedings of the Celtic Language Technology Workshop

pdf bib
A Character-Level LSTM Network Model for Tokenizing the Old Irish text of the Würzburg Glosses on the Pauline Epistles
Adrian Doyle | John P. McCrae | Clodagh Downey
Proceedings of the Celtic Language Technology Workshop

pdf bib
WordNet Gloss Translation for Under-resourced Languages using Multilingual Neural Machine Translation
Bharathi Raja Chakravarthi | Mihael Arcan | John P. McCrae
Proceedings of the Second Workshop on Multilingualism at the Intersection of Knowledge Bases and Machine Translation

2018

pdf bib
Automatic Enrichment of Terminological Resources: the IATE RDF Example
Mihael Arcan | Elena Montiel-Ponsoda | John P. McCrae | Paul Buitelaar
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

pdf bib
A Comparison Of Emotion Annotation Schemes And A New Annotated Data Set
Ian Wood | John P. McCrae | Vladimir Andryushechkin | Paul Buitelaar
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

pdf bib
A supervised approach to taxonomy extraction using word embeddings
Rajdeep Sarkar | John P. McCrae | Paul Buitelaar
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

pdf bib
Teanga: A Linked Data based platform for Natural Language Processing
Housam Ziad | John P. McCrae | Paul Buitelaar
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

pdf bib
Phrase-Level Metaphor Identification Using Distributed Representations of Word Meaning
Omnia Zayed | John Philip McCrae | Paul Buitelaar
Proceedings of the Workshop on Figurative Language Processing

Metaphor is an essential element of human cognition which is often used to express ideas and emotions that might be difficult to express using literal language. Processing metaphoric language is a challenging task for a wide range of applications ranging from text simplification to psychotherapy. Despite the variety of approaches that are trying to process metaphor, there is still a need for better models that mimic the human cognition while exploiting fewer resources. In this paper, we present an approach based on distributional semantics to identify metaphors on the phrase-level. We investigated the use of different word embeddings models to identify verb-noun pairs where the verb is used metaphorically. Several experiments are conducted to show the performance of the proposed approach on benchmark datasets.

pdf bib
Constructing an Annotated Corpus of Verbal MWEs for English
Abigail Walsh | Claire Bonial | Kristina Geeraert | John P. McCrae | Nathan Schneider | Clarissa Somers
Proceedings of the Joint Workshop on Linguistic Annotation, Multiword Expressions and Constructions (LAW-MWE-CxG-2018)

This paper describes the construction and annotation of a corpus of verbal MWEs for English, as part of the PARSEME Shared Task 1.1 on automatic identification of verbal MWEs. The criteria for corpus selection, the categories of MWEs used, and the training process are discussed, along with the particular issues that led to revisions in edition 1.1 of the annotation guidelines. Finally, an overview of the characteristics of the final annotated corpus is presented, as well as some discussion on inter-annotator agreement.

2016

pdf bib
The Open Linguistics Working Group: Developing the Linguistic Linked Open Data Cloud
John Philip McCrae | Christian Chiarcos | Francis Bond | Philipp Cimiano | Thierry Declerck | Gerard de Melo | Jorge Gracia | Sebastian Hellmann | Bettina Klimek | Steven Moran | Petya Osenova | Antonio Pareja-Lora | Jonathan Pool
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

The Open Linguistics Working Group (OWLG) brings together researchers from various fields of linguistics, natural language processing, and information technology to present and discuss principles, case studies, and best practices for representing, publishing and linking linguistic data collections. A major outcome of our work is the Linguistic Linked Open Data (LLOD) cloud, an LOD (sub-)cloud of linguistic resources, which covers various linguistic databases, lexicons, corpora, terminologies, and metadata repositories. We present and summarize five years of progress on the development of the cloud and of advancements in open data in linguistics, and we describe recent community activities. The paper aims to serve as a guideline to orient and involve researchers with the community and/or Linguistic Linked Open Data.

pdf bib
Expanding wordnets to new languages with multilingual sense disambiguation
Mihael Arcan | John Philip McCrae | Paul Buitelaar
Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers

Princeton WordNet is one of the most important resources for natural language processing, but is only available for English. While it has been translated using the expand approach to many other languages, this is an expensive manual process. Therefore it would be beneficial to have a high-quality automatic translation approach that would support NLP techniques, which rely on WordNet in new languages. The translation of wordnets is fundamentally complex because of the need to translate all senses of a word including low frequency senses, which is very challenging for current machine translation approaches. For this reason we leverage existing translations of WordNet in other languages to identify contextual information for wordnet senses from a large set of generic parallel corpora. We evaluate our approach using 10 translated wordnets for European languages. Our experiment shows a significant improvement over translation without any contextual information. Furthermore, we evaluate how the choice of pivot languages affects performance of multilingual word sense disambiguation.

pdf bib
NUIG-UNLP at SemEval-2016 Task 1: Soft Alignment and Deep Learning for Semantic Textual Similarity
John Philip McCrae | Kartik Asooja | Nitish Aggarwal | Paul Buitelaar
Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016)

2015

pdf bib
Proceedings of the 4th Workshop on Linked Data in Linguistics: Resources and Applications
Christian Chiarcos | John Philip McCrae | Petya Osenova | Philipp Cimiano | Nancy Ide
Proceedings of the 4th Workshop on Linked Data in Linguistics: Resources and Applications

pdf bib
Reconciling Heterogeneous Descriptions of Language Resources
John Philip McCrae | Philipp Cimiano | Victor Rodríguez Doncel | Daniel Vila-Suero | Jorge Gracia | Luca Matteis | Roberto Navigli | Andrejs Abele | Gabriela Vulcu | Paul Buitelaar
Proceedings of the 4th Workshop on Linked Data in Linguistics: Resources and Applications

pdf bib
Linking Four Heterogeneous Language Resources as Linked Data
Benjamin Siemoneit | John Philip McCrae | Philipp Cimiano
Proceedings of the 4th Workshop on Linked Data in Linguistics: Resources and Applications

2014

pdf bib
Bielefeld SC: Orthonormal Topic Modelling for Grammar Induction
John Philip McCrae | Philipp Cimiano
Proceedings of the 8th International Workshop on Semantic Evaluation (SemEval 2014)

pdf bib
Default Physical Measurements in SUMO
Francesca Quattri | Adam Pease | John P. McCrae
Proceedings of the 4th Workshop on Cognitive Aspects of the Lexicon (CogALex)

pdf bib
Modelling the Semantics of Adjectives in the Ontology-Lexicon Interface
John P. McCrae | Francesca Quattri | Christina Unger | Philipp Cimiano
Proceedings of the 4th Workshop on Cognitive Aspects of the Lexicon (CogALex)

pdf bib
Representing Multilingual Data as Linked Data: the Case of BabelNet 2.0
Maud Ehrmann | Francesco Cecconi | Daniele Vannella | John Philip McCrae | Philipp Cimiano | Roberto Navigli
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

Recent years have witnessed a surge in the amount of semantic information published on the Web. Indeed, the Web of Data, a subset of the Semantic Web, has been increasing steadily in both volume and variety, transforming the Web into a ‘global database’ in which resources are linked across sites. Linguistic fields -- in a broad sense -- have not been left behind, and we observe a similar trend with the growth of linguistic data collections on the so-called ‘Linguistic Linked Open Data (LLOD) cloud’. While both Semantic Web and Natural Language Processing communities can obviously take advantage of this growing and distributed linguistic knowledge base, they are today faced with a new challenge, i.e., that of facilitating multilingual access to the Web of data. In this paper we present the publication of BabelNet 2.0, a wide-coverage multilingual encyclopedic dictionary and ontology, as Linked Data. The conversion made use of lemon, a lexicon model for ontologies particularly well-suited for this enterprise. The result is an interlinked multilingual (lexical) resource which can not only be accessed on the LOD, but also be used to enrich existing datasets with linguistic information, or to support the process of mapping datasets across languages.

2013

pdf bib
Orthonormal Explicit Topic Analysis for Cross-Lingual Document Matching
John Philip McCrae | Philipp Cimiano | Roman Klinger
Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing

pdf bib
Mining translations from the web of open linked data
John Philip McCrae | Philipp Cimiano
Proceedings of the Joint Workshop on NLP&LOD and SWAIE: Semantic Web, Linked Open Data and Information Extraction

pdf bib
Proceedings of the 2nd Workshop on Linked Data in Linguistics (LDL-2013): Representing and linking lexicons, terminologies and other language data
Christian Chiarcos | Philipp Cimiano | Thierry Declerck | John P. McCrae
Proceedings of the 2nd Workshop on Linked Data in Linguistics (LDL-2013): Representing and linking lexicons, terminologies and other language data

pdf bib
Linguistic Linked Open Data (LLOD). Introduction and Overview
Christian Chiarcos | Philipp Cimiano | Thierry Declerck | John P. McCrae
Proceedings of the 2nd Workshop on Linked Data in Linguistics (LDL-2013): Representing and linking lexicons, terminologies and other language data

pdf bib
Releasing multimodal data as Linguistic Linked Open Data: An experience report
Peter Menke | John Philip McCrae | Philipp Cimiano
Proceedings of the 2nd Workshop on Linked Data in Linguistics (LDL-2013): Representing and linking lexicons, terminologies and other language data

2012

pdf bib
Collaborative semantic editing of linked data lexica
John McCrae | Elena Montiel-Ponsoda | Philipp Cimiano
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

The creation of language resources is a time-consuming process requiring the efforts of many people. The use of resources collaboratively created by non-linguistists can potentially ameliorate this situation. However, such resources often contain more errors compared to resources created by experts. For the particular case of lexica, we analyse the case of Wiktionary, a resource created along wiki principles and argue that through the use of a principled lexicon model, namely Lemon, the resulting data could be better understandable to machines. We then present a platform called Lemon Source that supports the creation of linked lexical data along the Lemon model. This tool builds on the concept of a semantic wiki to enable collaborative editing of the resources by many users concurrently. In this paper, we describe the model, the tool and present an evaluation of its usability based on a small group of users.

2011

pdf bib
Combining statistical and semantic approaches to the translation of ontologies and taxonomies
John McCrae | Mauricio Espinoza | Elena Montiel-Ponsoda | Guadalupe Aguado-de-Cea | Philipp Cimiano
Proceedings of Fifth Workshop on Syntax, Semantics and Structure in Statistical Translation

2010

pdf bib
An ontology-driven system for detecting global health events
Nigel Collier | Reiko Matsuda Goodwin | John McCrae | Son Doan | Ai Kawazoe | Mike Conway | Asanee Kawtrakul | Koichi Takeuchi | Dinh Dien
Proceedings of the 23rd International Conference on Computational Linguistics (Coling 2010)

Search
Co-authors