Ebrahim Ansari


2020

pdf bib
LSCP: Enhanced Large Scale Colloquial Persian Language Understanding
Hadi Abdi Khojasteh | Ebrahim Ansari | Mahdi Bohlouli
Proceedings of the 12th Language Resources and Evaluation Conference

Language recognition has been significantly advanced in recent years by means of modern machine learning methods such as deep learning and benchmarks with rich annotations. However, research is still limited in low-resource formal languages. This consists of a significant gap in describing the colloquial language especially for low-resourced ones such as Persian. In order to target this gap for low resource languages, we propose a “Large Scale Colloquial Persian Dataset” (LSCP). LSCP is hierarchically organized in a semantic taxonomy that focuses on multi-task informal Persian language understanding as a comprehensive problem. This encompasses the recognition of multiple semantic aspects in the human-level sentences, which naturally captures from the real-world sentences. We believe that further investigations and processing, as well as the application of novel algorithms and methods, can strengthen enriching computerized understanding and processing of low resource languages. The proposed corpus consists of 120M sentences resulted from 27M tweets annotated with parsing tree, part-of-speech tags, sentiment polarity and translation in five different languages.

pdf bib
FINDINGS OF THE IWSLT 2020 EVALUATION CAMPAIGN
Ebrahim Ansari | Amittai Axelrod | Nguyen Bach | Ondřej Bojar | Roldano Cattoni | Fahim Dalvi | Nadir Durrani | Marcello Federico | Christian Federmann | Jiatao Gu | Fei Huang | Kevin Knight | Xutai Ma | Ajay Nagesh | Matteo Negri | Jan Niehues | Juan Pino | Elizabeth Salesky | Xing Shi | Sebastian Stüker | Marco Turchi | Alexander Waibel | Changhan Wang
Proceedings of the 17th International Conference on Spoken Language Translation

The evaluation campaign of the International Conference on Spoken Language Translation (IWSLT 2020) featured this year six challenge tracks: (i) Simultaneous speech translation, (ii) Video speech translation, (iii) Offline speech translation, (iv) Conversational speech translation, (v) Open domain translation, and (vi) Non-native speech translation. A total of teams participated in at least one of the tracks. This paper introduces each track’s goal, data and evaluation metrics, and reports the results of the received submissions.

pdf bib
ELITR: European Live Translator
Ondřej Bojar | Dominik Macháček | Sangeet Sagar | Otakar Smrž | Jonáš Kratochvíl | Ebrahim Ansari | Dario Franceschini | Chiara Canton | Ivan Simonini | Thai-Son Nguyen | Felix Schneider | Sebastian Stücker | Alex Waibel | Barry Haddow | Rico Sennrich | Philip Williams
Proceedings of the 22nd Annual Conference of the European Association for Machine Translation

ELITR (European Live Translator) project aims to create a speech translation system for simultaneous subtitling of conferences and online meetings targetting up to 43 languages. The technology is tested by the Supreme Audit Office of the Czech Republic and by alfaview®, a German online conferencing system. Other project goals are to advance document-level and multilingual machine translation, automatic speech recognition, and automatic minuting.

2019

pdf bib
Building a Morphological Network for Persian on Top of a Morpheme-Segmented Lexicon
Hamid Haghdoost | Ebrahim Ansari | Zdeněk Žabokrtský | Mahshid Nikravesh
Proceedings of the Second International Workshop on Resources and Tools for Derivational Morphology

pdf bib
Supervised Morphological Segmentation Using Rich Annotated Lexicon
Ebrahim Ansari | Zdeněk Žabokrtský | Mohammad Mahmoudi | Hamid Haghdoost | Jonáš Vidra
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2019)

Morphological segmentation of words is the process of dividing a word into smaller units called morphemes; it is tricky especially when a morphologically rich or polysynthetic language is under question. In this work, we designed and evaluated several Recurrent Neural Network (RNN) based models as well as various other machine learning based approaches for the morphological segmentation task. We trained our models using annotated segmentation lexicons. To evaluate the effect of the training data size on our models, we decided to create a large hand-annotated morphologically segmented corpus of Persian words, which is, to the best of our knowledge, the first and the only segmentation lexicon for the Persian language. In the experimental phase, using the hand-annotated Persian lexicon and two smaller similar lexicons for Czech and Finnish languages, we evaluated the effect of the training data size, different hyper-parameters settings as well as different RNN-based models.

2018

pdf bib
Extracting an English-Persian Parallel Corpus from Comparable Corpora
Akbar Karimi | Ebrahim Ansari | Bahram Sadeghi Bigham
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)