El Moatez Billah Nagoudi


2020

pdf bib
AraNet: A Deep Learning Toolkit for Arabic Social Media
Muhammad Abdul-Mageed | Chiyu Zhang | Azadeh Hashemi | El Moatez Billah Nagoudi
Proceedings of the 4th Workshop on Open-Source Arabic Corpora and Processing Tools, with a Shared Task on Offensive Language Detection

We describe AraNet, a collection of deep learning Arabic social media processing tools. Namely, we exploit an extensive host of both publicly available and novel social media datasets to train bidirectional encoders from transformers (BERT) focused at social meaning extraction. AraNet models predict age, dialect, gender, emotion, irony, and sentiment. AraNet either delivers state-of-the-art performance on a number of these tasks and performs competitively on others. AraNet is exclusively based on a deep learning framework, giving it the advantage of being feature-engineering free. To the best of our knowledge, AraNet is the first to performs predictions across such a wide range of tasks for Arabic NLP. As such, AraNet has the potential to meet critical needs. We publicly release AraNet to accelerate research, and to facilitate model-based comparisons across the different tasks

pdf bib
Understanding and Detecting Dangerous Speech in Social Media
Ali Alshehri | El Moatez Billah Nagoudi | Muhammad Abdul-Mageed
Proceedings of the 4th Workshop on Open-Source Arabic Corpora and Processing Tools, with a Shared Task on Offensive Language Detection

Social media communication has become a significant part of daily activity in modern societies. For this reason, ensuring safety in social media platforms is a necessity. Use of dangerous language such as physical threats in online environments is a somewhat rare, yet remains highly important. Although several works have been performed on the related issue of detecting offensive and hateful language, dangerous speech has not previously been treated in any significant way. Motivated by these observations, we report our efforts to build a labeled dataset for dangerous speech. We also exploit our dataset to develop highly effective models to detect dangerous content. Our best model performs at 59.60% macro F1, significantly outperforming a competitive baseline.

pdf bib
Growing Together: Modeling Human Language Learning With n-Best Multi-Checkpoint Machine Translation
El Moatez Billah Nagoudi | Muhammad Abdul-Mageed | Hasan Cavusoglu
Proceedings of the Fourth Workshop on Neural Generation and Translation

We describe our submission to the 2020 Duolingo Shared Task on Simultaneous Translation And Paraphrase for Language Education (STAPLE). We view MT models at various training stages (i.e., checkpoints) as human learners at different levels. Hence, we employ an ensemble of multi-checkpoints from the same model to generate translation sequences with various levels of fluency. From each checkpoint, for our best model, we sample n-Best sequences (n=10) with a beam width =100. We achieve an 37.57 macro F1 with a 6 checkpoint model ensemble on the official shared task test data, outperforming a baseline Amazon translation system of 21.30 macro F1 and ultimately demonstrating the utility of our intuitive method.

pdf bib
Machine Generation and Detection of Arabic Manipulated and Fake News
El Moatez Billah Nagoudi | AbdelRahim Elmadany | Muhammad Abdul-Mageed | Tariq Alhindi
Proceedings of the Fifth Arabic Natural Language Processing Workshop

Fake news and deceptive machine-generated text are serious problems threatening modern societies, including in the Arab world. This motivates work on detecting false and manipulated stories online. However, a bottleneck for this research is lack of sufficient data to train detection models. We present a novel method for automatically generating Arabic manipulated (and potentially fake) news stories. Our method is simple and only depends on availability of true stories, which are abundant online, and a part of speech tagger (POS). To facilitate future work, we dispense with both of these requirements altogether by providing AraNews, a novel and large POS-tagged news dataset that can be used off-the-shelf. Using stories generated based on AraNews, we carry out a human annotation study that casts light on the effects of machine manipulation on text veracity. The study also measures human ability to detect Arabic machine manipulated text generated by our method. Finally, we develop the first models for detecting manipulated Arabic news and achieve state-of-the-art results on Arabic fake news detection (macro F1=70.06). Our models and data are publicly available.

2019

pdf bib
ArbEngVec : Arabic-English Cross-Lingual Word Embedding Model
Raki Lachraf | El Moatez Billah Nagoudi | Youcef Ayachi | Ahmed Abdelali | Didier Schwab
Proceedings of the Fourth Arabic Natural Language Processing Workshop

Word Embeddings (WE) are getting increasingly popular and widely applied in many Natural Language Processing (NLP) applications due to their effectiveness in capturing semantic properties of words; Machine Translation (MT), Information Retrieval (IR) and Information Extraction (IE) are among such areas. In this paper, we propose an open source ArbEngVec which provides several Arabic-English cross-lingual word embedding models. To train our bilingual models, we use a large dataset with more than 93 million pairs of Arabic-English parallel sentences. In addition, we perform both extrinsic and intrinsic evaluations for the different word embedding model variants. The extrinsic evaluation assesses the performance of models on the cross-language Semantic Textual Similarity (STS), while the intrinsic evaluation is based on the Word Translation (WT) task.

2018

pdf bib
ARB-SEN at SemEval-2018 Task1: A New Set of Features for Enhancing the Sentiment Intensity Prediction in Arabic Tweets
El Moatez Billah Nagoudi
Proceedings of The 12th International Workshop on Semantic Evaluation

This article describes our proposed Arabic Sentiment Analysis system named ARB-SEN. This system is designed for the International Workshop on Semantic Evaluation 2018 (SemEval-2018), Task1: Affect in Tweets. ARB-SEN proposes two supervised models to estimate the sentiment intensity in Arabic tweets. Both models use a set of features including sentiment lexicon, negation, word embedding and emotion symbols features. Our system combines these features to assist the sentiment analysis task. ARB-SEN system achieves a correlation score of 0.720, ranking 6th among all participants in the valence intensity regression (V-reg) for the Arabic sub-task organized within the SemEval 2018 evaluation campaign.

2017

pdf bib
LIM-LIG at SemEval-2017 Task1: Enhancing the Semantic Similarity for Arabic Sentences with Vectors Weighting
El Moatez Billah Nagoudi | Jérémy Ferrero | Didier Schwab
Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017)

This article describes our proposed system named LIM-LIG. This system is designed for SemEval 2017 Task1: Semantic Textual Similarity (Track1). LIM-LIG proposes an innovative enhancement to word embedding-based model devoted to measure the semantic similarity in Arabic sentences. The main idea is to exploit the word representations as vectors in a multidimensional space to capture the semantic and syntactic properties of words. IDF weighting and Part-of-Speech tagging are applied on the examined sentences to support the identification of words that are highly descriptive in each sentence. LIM-LIG system achieves a Pearson’s correlation of 0.74633, ranking 2nd among all participants in the Arabic monolingual pairs STS task organized within the SemEval 2017 evaluation campaign

pdf bib
Semantic Similarity of Arabic Sentences with Word Embeddings
El Moatez Billah Nagoudi | Didier Schwab
Proceedings of the Third Arabic Natural Language Processing Workshop

Semantic textual similarity is the basis of countless applications and plays an important role in diverse areas, such as information retrieval, plagiarism detection, information extraction and machine translation. This article proposes an innovative word embedding-based system devoted to calculate the semantic similarity in Arabic sentences. The main idea is to exploit vectors as word representations in a multidimensional space in order to capture the semantic and syntactic properties of words. IDF weighting and Part-of-Speech tagging are applied on the examined sentences to support the identification of words that are highly descriptive in each sentence. The performance of our proposed system is confirmed through the Pearson correlation between our assigned semantic similarity scores and human judgments.