AlexU-AUX-BERT at SemEval-2020 Task 3: Improving BERT Contextual Similarity Using Multiple Auxiliary Contexts
Somaia Mahmoud | Marwan Torki
Proceedings of the Fourteenth Workshop on Semantic Evaluation

This paper describes the system we built for SemEval-2020 task 3. That is predicting the scores of similarity for a pair of words within two different contexts. Our system is based on both BERT embeddings and WordNet. We simply use cosine similarity to find the closest synset of the target words. Our results show that using this simple approach greatly improves the system behavior. Our model is ranked 3rd in subtask-2 for SemEval-2020 task 3.

AlexU-BackTranslation-TL at SemEval-2020 Task 12: Improving Offensive Language Detection Using Data Augmentation and Transfer Learning
Mai Ibrahim | Marwan Torki | Nagwa El-Makky
Proceedings of the Fourteenth Workshop on Semantic Evaluation

Social media platforms, online news commenting spaces, and many other public forums have become widely known for issues of abusive behavior such as cyber-bullying and personal attacks. In this paper, we use the annotated tweets of the Offensive Language Identification Dataset (OLID) to train three levels of deep learning classifiers to solve the three sub-tasks associated with the dataset. Sub-task A is to determine if the tweet is toxic or not. Then, for offensive tweets, sub-task B requires determining whether the toxicity is targeted. Finally, for sub-task C, we predict the target of the offense; i.e. a group, individual, or other entity. In our solution, we tackle the problem of class imbalance in the dataset by using back translation for data augmentation and utilizing the fine-tuned BERT model in an ensemble of deep learning classifiers. We used this solution to participate in the three English sub-tasks of SemEval-2020 task 12. The proposed solution achieved 0.91393, 0.6300, and 0.57607 macro F1-average in sub-tasks A, B, and C respectively. We achieved the 9th, 14th, and 22nd places for sub-tasks A, B and C respectively.

Identifying Nuanced Dialect for Arabic Tweets with Deep Learning and Reverse Translation Corpus Extension System
Rawan Tahssin | Youssef Kishk | Marwan Torki
Proceedings of the Fifth Arabic Natural Language Processing Workshop

In this paper, we present our work for the NADI Shared Task (Abdul-Mageed and Habash, 2020): Nuanced Arabic Dialect Identification for Subtask-1: country-level dialect identification. We introduce a Reverse Translation Corpus Extension Systems (RTCES) to handle data imbalance along with reported results on several experimented approaches of word and document representations and different models architectures. The top scoring model was based on AraBERT (Antoun et al., 2020), with our modified extended corpus based on reverse translation of the given Arabic tweets. The selected system achieved a macro average F1 score of 20.34% on the test set, which places us as the 7th out of 18 teams in the final ranking Leaderboard.

Arabic Dialect Identification Using BERT Fine-Tuning
Moataz Mansour | Moustafa Tohamy | Zeyad Ezzat | Marwan Torki
Proceedings of the Fifth Arabic Natural Language Processing Workshop

In the last few years, deep learning has proved to be a very effective paradigm to discover patterns in large data sets. Unfortunately, deep learning training on small data sets is not the best option because most of the time traditional machine learning algorithms could get better scores. Now, we can train the neural network on a large data set then fine-tune on a smaller data set using the transfer learning technique. In this paper, we present our system for NADI shared Task: Country-level Dialect Identification, Our system is based on fine-tuning of BERT and it achieves 22.85 F1-score on Test Set and our rank is 5th out of 18 teams.


Question Answering Using Hierarchical Attention on Top of BERT Features
Reham Osama | Nagwa El-Makky | Marwan Torki
Proceedings of the 2nd Workshop on Machine Reading for Question Answering

The model submitted works as follows. When supplied a question and a passage it makes use of the BERT embedding along with the hierarchical attention model which consists of 2 parts, the co-attention and the self-attention, to locate a continuous span of the passage that is the answer to the question.

Arabic Dialect Identification with Deep Learning and Hybrid Frequency Based Features
Youssef Fares | Zeyad El-Zanaty | Kareem Abdel-Salam | Muhammed Ezzeldin | Aliaa Mohamed | Karim El-Awaad | Marwan Torki
Proceedings of the Fourth Arabic Natural Language Processing Workshop

Studies on Dialectical Arabic are growing more important by the day as it becomes the primary written and spoken form of Arabic online in informal settings. Among the important problems that should be explored is that of dialect identification. This paper reports different techniques that can be applied towards such goal and reports their performance on the Multi Arabic Dialect Applications and Resources (MADAR) Arabic Dialect Corpora. Our results show that improving on traditional systems using frequency based features and non deep learning classifiers is a challenging task. We propose different models based on different word and document representations. Our top model is able to achieve an F1 macro averaged score of 65.66 on MADAR’s small-scale parallel corpus of 25 dialects and Modern Standard Arabic (MSA).


A Document Descriptor using Covariance of Word Vectors
Marwan Torki
Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

In this paper, we address the problem of finding a novel document descriptor based on the covariance matrix of the word vectors of a document. Our descriptor has a fixed length, which makes it easy to use in many supervised and unsupervised applications. We tested our novel descriptor in different tasks including supervised and unsupervised settings. Our evaluation shows that our document covariance descriptor fits different tasks with competitive performance against state-of-the-art methods.


QU-BIGIR at SemEval 2017 Task 3: Using Similarity Features for Arabic Community Question Answering Forums
Marwan Torki | Maram Hasanain | Tamer Elsayed
Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017)

In this paper we describe our QU-BIGIR system for the Arabic subtask D of the SemEval 2017 Task 3. Our approach builds on our participation in the past version of the same subtask. This year, our system uses different similarity measures that encodes lexical and semantic pairwise similarity of text pairs. In addition to well known similarity measures such as cosine similarity, we use other measures based on the summary statistics of word embedding representation for a given text. To rank a list of candidate question answer pairs for a given question, we learn a linear SVM classifier over our similarity features. Our best resulting run came second in subtask D with a very competitive performance to the first-ranking system.


QU-IR at SemEval 2016 Task 3: Learning to Rank on Arabic Community Question Answering Forums with Word Embedding
Rana Malhas | Marwan Torki | Tamer Elsayed
Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016)


Al-Bayan: A Knowledge-based System for Arabic Answer Selection
Reham Mohamed | Maha Ragab | Heba Abdelnasser | Nagwa M. El-Makky | Marwan Torki
Proceedings of the 9th International Workshop on Semantic Evaluation (SemEval 2015)


Al-Bayan: An Arabic Question Answering System for the Holy Quran
Heba Abdelnasser | Maha Ragab | Reham Mohamed | Alaa Mohamed | Bassant Farouk | Nagwa El-Makky | Marwan Torki
Proceedings of the EMNLP 2014 Workshop on Arabic Natural Language Processing (ANLP)