Mihael Arcan

Also published as: Mihael Arčan


2020

pdf bib
NUIG at TIAD: Combining Unsupervised NLP and Graph Metrics for Translation Inference
John Philip McCrae | Mihael Arcan
Proceedings of the 2020 Globalex Workshop on Linked Lexicography

In this paper, we present the NUIG system at the TIAD shard task. This system includes graph-based metrics calculated using novel algorithms, with an unsupervised document embedding tool called ONETA and an unsupervised multi-way neural machine translation method. The results are an improvement over our previous system and produce the highest precision among all systems in the task as well as very competitive F-Measure results. Incorporating features from other systems should be easy in the framework we describe in this paper, suggesting this could very easily be extended to an even stronger result.

pdf bib
NUIG at SemEval-2020 Task 12: Pseudo Labelling for Offensive Content Classification
Shardul Suryawanshi | Mihael Arcan | Paul Buitelaar
Proceedings of the Fourteenth Workshop on Semantic Evaluation

This work addresses the classification problem defined by sub-task A (English only) of the OffensEval 2020 challenge. We used a semi-supervised approach to classify given tweets into an offensive (OFF) or not-offensive (NOT) class. As the OffensEval 2020 dataset is loosely labelled with confidence scores given by unsupervised models, we used last year’s offensive language identification dataset (OLID) to label the OffensEval 2020 dataset. Our approach uses a pseudo-labelling method to annotate the current dataset. We trained four text classifiers on the OLID dataset and the classifier with the highest macro-averaged F1-score has been used to pseudo label the OffensEval 2020 dataset. The same model which performed best amongst four text classifiers on OLID dataset has been trained on the combined dataset of OLID and pseudo labelled OffensEval 2020. We evaluated the classifiers with precision, recall and macro-averaged F1-score as the primary evaluation metric on the OLID and OffensEval 2020 datasets. This work is licensed under a Creative Commons Attribution 4.0 International Licence. Licence details: http://creativecommons.org/licenses/by/4.0/.

pdf bib
Bilingual Lexicon Induction across Orthographically-distinct Under-Resourced Dravidian Languages
Bharathi Raja Chakravarthi | Navaneethan Rajasekaran | Mihael Arcan | Kevin McGuinness | Noel E. O’Connor | John P. McCrae
Proceedings of the 7th Workshop on NLP for Similar Languages, Varieties and Dialects

Bilingual lexicons are a vital tool for under-resourced languages and recent state-of-the-art approaches to this leverage pretrained monolingual word embeddings using supervised or semi-supervised approaches. However, these approaches require cross-lingual information such as seed dictionaries to train the model and find a linear transformation between the word embedding spaces. Especially in the case of low-resourced languages, seed dictionaries are not readily available, and as such, these methods produce extremely weak results on these languages. In this work, we focus on the Dravidian languages, namely Tamil, Telugu, Kannada, and Malayalam, which are even more challenging as they are written in unique scripts. To take advantage of orthographic information and cognates in these languages, we bring the related languages into a single script. Previous approaches have used linguistically sub-optimal measures such as the Levenshtein edit distance to detect cognates, whereby we demonstrate that the longest common sub-sequence is linguistically more sound and improves the performance of bilingual lexicon induction. We show that our approach can increase the accuracy of bilingual lexicon induction methods on these languages many times, making bilingual lexicon induction approaches feasible for such under-resourced languages.

pdf bib
A Dataset for Troll Classification of TamilMemes
Shardul Suryawanshi | Bharathi Raja Chakravarthi | Pranav Verma | Mihael Arcan | John Philip McCrae | Paul Buitelaar
Proceedings of the WILDRE5– 5th Workshop on Indian Language Data: Resources and Evaluation

Social media are interactive platforms that facilitate the creation or sharing of information, ideas or other forms of expression among people. This exchange is not free from offensive, trolling or malicious contents targeting users or communities. One way of trolling is by making memes, which in most cases combines an image with a concept or catchphrase. The challenge of dealing with memes is that they are region-specific and their meaning is often obscured in humour or sarcasm. To facilitate the computational modelling of trolling in the memes for Indian languages, we created a meme dataset for Tamil (TamilMemes). We annotated and released the dataset containing suspected trolls and not-troll memes. In this paper, we use the a image classification to address the difficulties involved in the classification of troll memes with the existing methods. We found that the identification of a troll meme with such an image classifier is not feasible which has been corroborated with precision, recall and F1-score.

pdf bib
Suggest me a movie for tonight: Leveraging Knowledge Graphs for Conversational Recommendation
Rajdeep Sarkar | Koustava Goswami | Mihael Arcan | John P. McCrae
Proceedings of the 28th International Conference on Computational Linguistics

Conversational recommender systems focus on the task of suggesting products to users based on the conversation flow. Recently, the use of external knowledge in the form of knowledge graphs has shown to improve the performance in recommendation and dialogue systems. Information from knowledge graphs aids in enriching those systems by providing additional information such as closely related products and textual descriptions of the items. However, knowledge graphs are incomplete since they do not contain all factual information present on the web. Furthermore, when working on a specific domain, knowledge graphs in its entirety contribute towards extraneous information and noise. In this work, we study several subgraph construction methods and compare their performance across the recommendation task. We incorporate pre-trained embeddings from the subgraphs along with positional embeddings in our models. Extensive experiments show that our method has a relative improvement of at least 5.62% compared to the state-of-the-art on multiple metrics on the recommendation task.

pdf bib
Multimodal Meme Dataset (MultiOFF) for Identifying Offensive Content in Image and Text
Shardul Suryawanshi | Bharathi Raja Chakravarthi | Mihael Arcan | Paul Buitelaar
Proceedings of the Second Workshop on Trolling, Aggression and Cyberbullying

A meme is a form of media that spreads an idea or emotion across the internet. As posting meme has become a new form of communication of the web, due to the multimodal nature of memes, postings of hateful memes or related events like trolling, cyberbullying are increasing day by day. Hate speech, offensive content and aggression content detection have been extensively explored in a single modality such as text or image. However, combining two modalities to detect offensive content is still a developing area. Memes make it even more challenging since they express humour and sarcasm in an implicit way, because of which the meme may not be offensive if we only consider the text or the image. Therefore, it is necessary to combine both modalities to identify whether a given meme is offensive or not. Since there was no publicly available dataset for multimodal offensive meme content detection, we leveraged the memes related to the 2016 U.S. presidential election and created the MultiOFF multimodal meme dataset for offensive content detection dataset. We subsequently developed a classifier for this task using the MultiOFF dataset. We use an early fusion technique to combine the image and text modality and compare it with a text- and an image-only baseline to investigate its effectiveness. Our results show improvements in terms of Precision, Recall, and F-Score. The code and dataset for this paper is published in https://github.com/bharathichezhiyan/Multimodal-Meme-Classification-Identifying-Offensive-Content-in-Image-and-Text

2019

pdf bib
Passive Diagnosis Incorporating the PHQ-4 for Depression and Anxiety
Fionn Delahunty | Robert Johansson | Mihael Arcan
Proceedings of the Fourth Social Media Mining for Health Applications (#SMM4H) Workshop & Shared Task

Depression and anxiety are the two most prevalent mental health disorders worldwide, impacting the lives of millions of people each year. In this work, we develop and evaluate a multilabel, multidimensional deep neural network designed to predict PHQ-4 scores based on individuals written text. Our system outperforms random baseline metrics and provides a novel approach to how we can predict psychometric scores from written text. Additionally, we explore how this architecture can be applied to analyse social media data.

pdf bib
Leveraging Rule-Based Machine Translation Knowledge for Under-Resourced Neural Machine Translation Models
Daniel Torregrosa | Nivranshu Pasricha | Maraim Masoud | Bharathi Raja Chakravarthi | Juan Alonso | Noe Casas | Mihael Arcan
Proceedings of Machine Translation Summit XVII Volume 2: Translator, Project and User Tracks

pdf bib
Multilingual Multimodal Machine Translation for Dravidian Languages utilizing Phonetic Transcription
Bharathi Raja Chakravarthi | Ruba Priyadharshini | Bernardo Stearns | Arun Jayapal | Sridevy S | Mihael Arcan | Manel Zarrouk | John P McCrae
Proceedings of the 2nd Workshop on Technologies for MT of Low Resource Languages

pdf bib
Proceedings of the Second Workshop on Multilingualism at the Intersection of Knowledge Bases and Machine Translation
Mihael Arcan | Marco Turchi | Jinhua Du | Dimitar Shterionov | Daniel Torregrosa
Proceedings of the Second Workshop on Multilingualism at the Intersection of Knowledge Bases and Machine Translation

pdf bib
WordNet Gloss Translation for Under-resourced Languages using Multilingual Neural Machine Translation
Bharathi Raja Chakravarthi | Mihael Arcan | John P. McCrae
Proceedings of the Second Workshop on Multilingualism at the Intersection of Knowledge Bases and Machine Translation

pdf bib
Neural Machine Translation of Literary Texts from English to Slovene
Taja Kuzman | Špela Vintar | Mihael Arčan
Proceedings of the Qualities of Literary Machine Translation

2018

pdf bib
Automatic Enrichment of Terminological Resources: the IATE RDF Example
Mihael Arcan | Elena Montiel-Ponsoda | John P. McCrae | Paul Buitelaar
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

2016

pdf bib
PE2rr Corpus: Manual Error Annotation of Automatically Pre-annotated MT Post-edits
Maja Popović | Mihael Arčan
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

We present a freely available corpus containing source language texts from different domains along with their automatically generated translations into several distinct morphologically rich languages, their post-edited versions, and error annotations of the performed post-edit operations. We believe that the corpus will be useful for many different applications. The main advantage of the approach used for creation of the corpus is the fusion of post-editing and error classification tasks, which have usually been seen as two independent tasks, although naturally they are not. We also show benefits of coupling automatic and manual error classification which facilitates the complex manual error annotation task as well as the development of automatic error classification tools. In addition, the approach facilitates annotation of language pair related issues.

pdf bib
IRIS: English-Irish Machine Translation System
Mihael Arcan | Caoilfhionn Lane | Eoin Ó Droighneáin | Paul Buitelaar
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

We describe IRIS, a statistical machine translation (SMT) system for translating from English into Irish and vice versa. Since Irish is considered an under-resourced language with a limited amount of machine-readable text, building a machine translation system that produces reasonable translations is rather challenging. As translation is a difficult task, current research in SMT focuses on obtaining statistics either from a large amount of parallel, monolingual or other multilingual resources. Nevertheless, we collected available English-Irish data and developed an SMT system aimed at supporting human translators and enabling cross-lingual language technology tasks.

pdf bib
Expanding wordnets to new languages with multilingual sense disambiguation
Mihael Arcan | John Philip McCrae | Paul Buitelaar
Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers

Princeton WordNet is one of the most important resources for natural language processing, but is only available for English. While it has been translated using the expand approach to many other languages, this is an expensive manual process. Therefore it would be beneficial to have a high-quality automatic translation approach that would support NLP techniques, which rely on WordNet in new languages. The translation of wordnets is fundamentally complex because of the need to translate all senses of a word including low frequency senses, which is very challenging for current machine translation approaches. For this reason we leverage existing translations of WordNet in other languages to identify contextual information for wordnet senses from a large set of generic parallel corpora. We evaluate our approach using 10 translated wordnets for European languages. Our experiment shows a significant improvement over translation without any contextual information. Furthermore, we evaluate how the choice of pivot languages affects performance of multilingual word sense disambiguation.

pdf bib
Potential and Limits of Using Post-edits as Reference Translations for MT Evaluation
Maja Popovic | Mihael Arčan | Arle Lommel
Proceedings of the 19th Annual Conference of the European Association for Machine Translation

pdf bib
Language Related Issues for Machine Translation between Closely Related South Slavic Languages
Maja Popović | Mihael Arčan | Filip Klubička
Proceedings of the Third Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial3)

Machine translation between closely related languages is less challenging and exibits a smaller number of translation errors than translation between distant languages, but there are still obstacles which should be addressed in order to improve such systems. This work explores the obstacles for machine translation systems between closely related South Slavic languages, namely Croatian, Serbian and Slovenian. Statistical systems for all language pairs and translation directions are trained using parallel texts from different domains, however mainly on spoken language i.e. subtitles. For translation between Serbian and Croatian, a rule-based system is also explored. It is shown that for all language pairs and translation systems, the main obstacles are differences between structural properties.

2015

pdf bib
Identifying main obstacles for statistical machine translation of morphologically rich South Slavic languages
Maja Popovic | Mihael Arcan
Proceedings of the 18th Annual Conference of the European Association for Machine Translation

pdf bib
Poor man’s lemmatisation for automatic error classification
Maja Popovic | Mihael Arcan | Eleftherios Avramidis | Aljoscha Burchardt
Proceedings of the 18th Annual Conference of the European Association for Machine Translation

pdf bib
MixedEmotions: Social Semantic Emotion Analysis for Innovative Multilingual Big Data Analytics Markets
Mihael Arcan | Paul Buitelaar
Proceedings of the 18th Annual Conference of the European Association for Machine Translation

pdf bib
Identifying main obstacles for statistical machine translation of morphologically rich South Slavic languages
Maja Popović | Mihael Arčan
Proceedings of the 18th Annual Conference of the European Association for Machine Translation

pdf bib
Poor man’s lemmatisation for automatic error classification
Maja Popović | Mihael Arčan | Eleftherios Avramidis | Aljoscha Burchardt | Arle Lommel
Proceedings of the 18th Annual Conference of the European Association for Machine Translation

pdf bib
MixedEmotions: Social Semantic Emotion Analysis for Innovative Multilingual Big Data Analytics Markets
Mihael Arcan | Paul Buitelaar
Proceedings of the 18th Annual Conference of the European Association for Machine Translation

pdf bib
Knowledge Portability with Semantic Expansion of Ontology Labels
Mihael Arcan | Marco Turchi | Paul Buitelaar
Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

2014

pdf bib
Identification of Bilingual Terms from Monolingual Documents for Statistical Machine Translation
Mihael Arcan | Claudio Giuliano | Marco Turchi | Paul Buitelaar
Proceedings of the 4th International Workshop on Computational Terminology (Computerm)

2013

pdf bib
Linguistic Linked Data for Sentiment Analysis
Paul Buitelaar | Mihael Arcan | Carlos Iglesias | Fernando Sánchez-Rada | Carlo Strapparava
Proceedings of the 2nd Workshop on Linked Data in Linguistics (LDL-2013): Representing and linking lexicons, terminologies and other language data

pdf bib
Ontology Label Translation
Mihael Arcan | Paul Buitelaar
Proceedings of the 2013 NAACL HLT Student Research Workshop

2012

pdf bib
Using Domain-specific and Collaborative Resources for Term Translation
Mihael Arcan | Christian Federmann | Paul Buitelaar
Proceedings of the Sixth Workshop on Syntax, Semantics and Structure in Statistical Translation

pdf bib
Experiments with Term Translation
Mihael Arcan | Christian Federmann | Paul Buitelaar
Proceedings of COLING 2012