Marcos Zampieri


2020

pdf bib
SemEval-2020 Task 12: Multilingual Offensive Language Identification in Social Media (OffensEval 2020)
Marcos Zampieri | Preslav Nakov | Sara Rosenthal | Pepa Atanasova | Georgi Karadzhov | Hamdy Mubarak | Leon Derczynski | Zeses Pitenis | Çağrı Çöltekin
Proceedings of the Fourteenth Workshop on Semantic Evaluation

We present the results and the main findings of SemEval-2020 Task 12 on Multilingual Offensive Language Identification in Social Media (OffensEval-2020). The task included three subtasks corresponding to the hierarchical taxonomy of the OLID schema from OffensEval-2019, and it was offered in five languages: Arabic, Danish, English, Greek, and Turkish. OffensEval-2020 was one of the most popular tasks at SemEval-2020, attracting a large number of participants across all subtasks and languages: a total of 528 teams signed up to participate in the task, 145 teams submitted official runs on the test data, and 70 teams submitted system description papers.

pdf bib
Offensive Language Identification in Greek
Zesis Pitenis | Marcos Zampieri | Tharindu Ranasinghe
Proceedings of the 12th Language Resources and Evaluation Conference

As offensive language has become a rising issue for online communities and social media platforms, researchers have been investigating ways of coping with abusive content and developing systems to detect its different types: cyberbullying, hate speech, aggression, etc. With a few notable exceptions, most research on this topic so far has dealt with English. This is mostly due to the availability of language resources for English. To address this shortcoming, this paper presents the first Greek annotated dataset for offensive language identification: the Offensive Greek Tweet Dataset (OGTD). OGTD is a manually annotated dataset containing 4,779 posts from Twitter annotated as offensive and not offensive. Along with a detailed description of the dataset, we evaluate several computational models trained and tested on this data.

pdf bib
CompLex — A New Corpus for Lexical Complexity Prediction from Likert Scale Data
Matthew Shardlow | Michael Cooper | Marcos Zampieri
Proceedings of the 1st Workshop on Tools and Resources to Empower People with REAding DIfficulties (READI)

Predicting which words are considered hard to understand for a given target population is a vital step in many NLP applications such astext simplification. This task is commonly referred to as Complex Word Identification (CWI). With a few exceptions, previous studieshave approached the task as a binary classification task in which systems predict a complexity value (complex vs. non-complex) fora set of target words in a text. This choice is motivated by the fact that all CWI datasets compiled so far have been annotated using abinary annotation scheme. Our paper addresses this limitation by presenting the first English dataset for continuous lexical complexityprediction. We use a 5-point Likert scale scheme to annotate complex words in texts from three sources/domains: the Bible, Europarl,and biomedical texts. This resulted in a corpus of 9,476 sentences each annotated by around 7 annotators.

pdf bib
Proceedings of the 7th Workshop on NLP for Similar Languages, Varieties and Dialects
Marcos Zampieri | Preslav Nakov | Nikola Ljubešić | Jörg Tiedemann | Yves Scherrer
Proceedings of the 7th Workshop on NLP for Similar Languages, Varieties and Dialects

pdf bib
A Report on the VarDial Evaluation Campaign 2020
Mihaela Gaman | Dirk Hovy | Radu Tudor Ionescu | Heidi Jauhiainen | Tommi Jauhiainen | Krister Lindén | Nikola Ljubešić | Niko Partanen | Christoph Purschke | Yves Scherrer | Marcos Zampieri
Proceedings of the 7th Workshop on NLP for Similar Languages, Varieties and Dialects

This paper presents the results of the VarDial Evaluation Campaign 2020 organized as part of the seventh workshop on Natural Language Processing (NLP) for Similar Languages, Varieties and Dialects (VarDial), co-located with COLING 2020. The campaign included three shared tasks each focusing on a different challenge of language and dialect identification: Romanian Dialect Identification (RDI), Social Media Variety Geolocation (SMG), and Uralic Language Identification (ULI). The campaign attracted 30 teams who enrolled to participate in one or multiple shared tasks and 14 of them submitted runs across the three shared tasks. Finally, 11 papers describing participating systems are published in the VarDial proceedings and referred to in this report.

pdf bib
MaintNet: A Collaborative Open-Source Library for Predictive Maintenance Language Resources
Farhad Akhbardeh | Travis Desell | Marcos Zampieri
Proceedings of the 28th International Conference on Computational Linguistics: System Demonstrations

Maintenance record logbooks are an emerging text type in NLP. An important part of them typically consist of free text with many domain specific technical terms, abbreviations, and non-standard spelling and grammar. This poses difficulties for NLP pipelines trained on standard corpora. Analyzing and annotating such documents is of particular importance in the development of predictive maintenance systems, which aim to improve operational efficiency, reduce costs, prevent accidents, and save lives. In order to facilitate and encourage research in this area, we have developed MaintNet, a collaborative open-source library of technical and domain-specific language resources. MaintNet provides novel logbook data from the aviation, automotive, and facility maintenance domains along with tools to aid in their (pre-)processing and clustering. Furthermore, it provides a way to encourage discussion on and sharing of new datasets and tools for logbook data analysis.

pdf bib
Multilingual Offensive Language Identification with Cross-lingual Embeddings
Tharindu Ranasinghe | Marcos Zampieri
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

Offensive content is pervasive in social media and a reason for concern to companies and government organizations. Several studies have been recently published investigating methods to detect the various forms of such content (e.g. hate speech, cyberbulling, and cyberaggression). The clear majority of these studies deal with English partially because most annotated datasets available contain English data. In this paper, we take advantage of English data available by applying cross-lingual contextual word embeddings and transfer learning to make predictions in languages with less resources. We project predictions on comparable data in Bengali, Hindi, and Spanish and we report results of 0.8415 F1 macro for Bengali, 0.8568 F1 macro for Hindi, and 0.7513 F1 macro for Spanish. Finally, we show that our approach compares favorably to the best systems submitted to recent shared tasks on these three languages, confirming the robustness of cross-lingual contextual embeddings and transfer learning for this task.

pdf bib
Proceedings of the Second Workshop on Trolling, Aggression and Cyberbullying
Ritesh Kumar | Atul Kr. Ojha | Bornini Lahiri | Marcos Zampieri | Shervin Malmasi | Vanessa Murdock | Daniel Kadar
Proceedings of the Second Workshop on Trolling, Aggression and Cyberbullying

pdf bib
Evaluating Aggression Identification in Social Media
Ritesh Kumar | Atul Kr. Ojha | Shervin Malmasi | Marcos Zampieri
Proceedings of the Second Workshop on Trolling, Aggression and Cyberbullying

In this paper, we present the report and findings of the Shared Task on Aggression and Gendered Aggression Identification organised as part of the Second Workshop on Trolling, Aggression and Cyberbullying (TRAC - 2) at LREC 2020. The task consisted of two sub-tasks - aggression identification (sub-task A) and gendered identification (sub-task B) - in three languages - Bangla, Hindi and English. For this task, the participants were provided with a dataset of approximately 5,000 instances from YouTube comments in each language. For testing, approximately 1,000 instances were provided in each language for each sub-task. A total of 70 teams registered to participate in the task and 19 teams submitted their test runs. The best system obtained a weighted F-score of approximately 0.80 in sub-task A for all the three languages. While approximately 0.87 in sub-task B for all the three languages.

2019

pdf bib
Proceedings of the Sixth Workshop on NLP for Similar Languages, Varieties and Dialects
Marcos Zampieri | Preslav Nakov | Shervin Malmasi | Nikola Ljubešić | Jörg Tiedemann | Ahmed Ali
Proceedings of the Sixth Workshop on NLP for Similar Languages, Varieties and Dialects

pdf bib
A Report on the Third VarDial Evaluation Campaign
Marcos Zampieri | Shervin Malmasi | Yves Scherrer | Tanja Samardžić | Francis Tyers | Miikka Silfverberg | Natalia Klyueva | Tung-Le Pan | Chu-Ren Huang | Radu Tudor Ionescu | Andrei M. Butnaru | Tommi Jauhiainen
Proceedings of the Sixth Workshop on NLP for Similar Languages, Varieties and Dialects

In this paper, we present the findings of the Third VarDial Evaluation Campaign organized as part of the sixth edition of the workshop on Natural Language Processing (NLP) for Similar Languages, Varieties and Dialects (VarDial), co-located with NAACL 2019. This year, the campaign included five shared tasks, including one task re-run – German Dialect Identification (GDI) – and four new tasks – Cross-lingual Morphological Analysis (CMA), Discriminating between Mainland and Taiwan variation of Mandarin Chinese (DMT), Moldavian vs. Romanian Cross-dialect Topic identification (MRC), and Cuneiform Language Identification (CLI). A total of 22 teams submitted runs across the five shared tasks. After the end of the competition, we received 14 system description papers, which are published in the VarDial workshop proceedings and referred to in this report.

pdf bib
Experiments in Cuneiform Language Identification
Gustavo Henrique Paetzold | Marcos Zampieri
Proceedings of the Sixth Workshop on NLP for Similar Languages, Varieties and Dialects

This paper presents methods to discriminate between languages and dialects written in Cuneiform script, one of the first writing systems in the world. We report the results obtained by the PZ team in the Cuneiform Language Identification (CLI) shared task organized within the scope of the VarDial Evaluation Campaign 2019. The task included two languages, Sumerian and Akkadian. The latter is divided into six dialects: Old Babylonian, Middle Babylonian peripheral, Standard Babylonian, Neo Babylonian, Late Babylonian, and Neo Assyrian. We approach the task using a meta-classifier trained on various SVM models and we show the effectiveness of the system for this task. Our submission achieved 0.738 F1 score in discriminating between the seven languages and dialects and it was ranked fourth in the competition among eight teams.

pdf bib
Findings of the 2019 Conference on Machine Translation (WMT19)
Loïc Barrault | Ondřej Bojar | Marta R. Costa-jussà | Christian Federmann | Mark Fishel | Yvette Graham | Barry Haddow | Matthias Huck | Philipp Koehn | Shervin Malmasi | Christof Monz | Mathias Müller | Santanu Pal | Matt Post | Marcos Zampieri
Proceedings of the Fourth Conference on Machine Translation (Volume 2: Shared Task Papers, Day 1)

This paper presents the results of the premier shared task organized alongside the Conference on Machine Translation (WMT) 2019. Participants were asked to build machine translation systems for any of 18 language pairs, to be evaluated on a test set of news stories. The main metric for this task is human judgment of translation quality. The task was also opened up to additional test suites to probe specific aspects of translation.

pdf bib
UDSDFKI Submission to the WMT2019 Czech–Polish Similar Language Translation Shared Task
Santanu Pal | Marcos Zampieri | Josef van Genabith
Proceedings of the Fourth Conference on Machine Translation (Volume 3: Shared Task Papers, Day 2)

In this paper we present the UDS-DFKI system submitted to the Similar Language Translation shared task at WMT 2019. The first edition of this shared task featured data from three pairs of similar languages: Czech and Polish, Hindi and Nepali, and Portuguese and Spanish. Participants could choose to participate in any of these three tracks and submit system outputs in any translation direction. We report the results obtained by our system in translating from Czech to Polish and comment on the impact of out-of-domain test data in the performance of our system. UDS-DFKI achieved competitive performance ranking second among ten teams in Czech to Polish translation.

pdf bib
Improving CAT Tools in the Translation Workflow: New Approaches and Evaluation
Mihaela Vela | Santanu Pal | Marcos Zampieri | Sudip Naskar | Josef van Genabith
Proceedings of Machine Translation Summit XVII Volume 2: Translator, Project and User Tracks

pdf bib
Predicting the Type and Target of Offensive Posts in Social Media
Marcos Zampieri | Shervin Malmasi | Preslav Nakov | Sara Rosenthal | Noura Farra | Ritesh Kumar
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)

As offensive content has become pervasive in social media, there has been much research in identifying potentially offensive messages. However, previous work on this topic did not consider the problem as a whole, but rather focused on detecting very specific types of offensive content, e.g., hate speech, cyberbulling, or cyber-aggression. In contrast, here we target several different kinds of offensive content. In particular, we model the task hierarchically, identifying the type and the target of offensive messages in social media. For this purpose, we complied the Offensive Language Identification Dataset (OLID), a new dataset with tweets annotated for offensive content using a fine-grained three-layer annotation scheme, which we make publicly available. We discuss the main similarities and differences between OLID and pre-existing datasets for hate speech identification, aggression detection, and similar tasks. We further experiment with and we compare the performance of different machine learning models on OLID.

pdf bib
SemEval-2019 Task 6: Identifying and Categorizing Offensive Language in Social Media (OffensEval)
Marcos Zampieri | Shervin Malmasi | Preslav Nakov | Sara Rosenthal | Noura Farra | Ritesh Kumar
Proceedings of the 13th International Workshop on Semantic Evaluation

We present the results and the main findings of SemEval-2019 Task 6 on Identifying and Categorizing Offensive Language in Social Media (OffensEval). The task was based on a new dataset, the Offensive Language Identification Dataset (OLID), which contains over 14,000 English tweets, and it featured three sub-tasks. In sub-task A, systems were asked to discriminate between offensive and non-offensive posts. In sub-task B, systems had to identify the type of offensive content in the post. Finally, in sub-task C, systems had to detect the target of the offensive posts. OffensEval attracted a large number of participants and it was one of the most popular tasks in SemEval-2019. In total, nearly 800 teams signed up to participate in the task and 115 of them submitted results, which are presented and analyzed in this report.

pdf bib
UTFPR at SemEval-2019 Task 5: Hate Speech Identification with Recurrent Neural Networks
Gustavo Henrique Paetzold | Marcos Zampieri | Shervin Malmasi
Proceedings of the 13th International Workshop on Semantic Evaluation

In this paper we revisit the problem of automatically identifying hate speech in posts from social media. We approach the task using a system based on minimalistic compositional Recurrent Neural Networks (RNN). We tested our approach on the SemEval-2019 Task 5: Multilingual Detection of Hate Speech Against Immigrants and Women in Twitter (HatEval) shared task dataset. The dataset made available by the HatEval organizers contained English and Spanish posts retrieved from Twitter annotated with respect to the presence of hateful content and its target. In this paper we present the results obtained by our system in comparison to the other entries in the shared task. Our system achieved competitive performance ranking 7th in sub-task A out of 62 systems in the English track.

2018

pdf bib
LIdioms: A Multilingual Linked Idioms Data Set
Diego Moussallem | Mohamed Ahmed Sherif | Diego Esteves | Marcos Zampieri | Axel-Cyrille Ngonga Ngomo
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

pdf bib
RDF2PT: Generating Brazilian Portuguese Texts from RDF Data
Diego Moussallem | Thiago Ferreira | Marcos Zampieri | Maria Claudia Cavalcanti | Geraldo Xexéo | Mariana Neves | Axel-Cyrille Ngonga Ngomo
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

pdf bib
Classifying Patent Applications with Ensemble Methods
Fernando Benites | Shervin Malmasi | Marcos Zampieri
Proceedings of the Australasian Language Technology Association Workshop 2018

We present methods for the automatic classification of patent applications using an annotated dataset provided by the organizers of the ALTA 2018 shared task - Classifying Patent Applications. The goal of the task is to use computational methods to categorize patent applications according to a coarse-grained taxonomy of eight classes based on the International Patent Classification (IPC). We tested a variety of approaches for this task and the best results, 0.778 micro-averaged F1-Score, were achieved by SVM ensembles using a combination of words and characters as features. Our team, BMZ, was ranked first among 14 teams in the competition.

pdf bib
A Report on the Complex Word Identification Shared Task 2018
Seid Muhie Yimam | Chris Biemann | Shervin Malmasi | Gustavo Paetzold | Lucia Specia | Sanja Štajner | Anaïs Tack | Marcos Zampieri
Proceedings of the Thirteenth Workshop on Innovative Use of NLP for Building Educational Applications

We report the findings of the second Complex Word Identification (CWI) shared task organized as part of the BEA workshop co-located with NAACL-HLT’2018. The second CWI shared task featured multilingual and multi-genre datasets divided into four tracks: English monolingual, German monolingual, Spanish monolingual, and a multilingual track with a French test set, and two tasks: binary classification and probabilistic classification. A total of 12 teams submitted their results in different task/track combinations and 11 of them wrote system description papers that are referred to in this report and appear in the BEA workshop proceedings.

pdf bib
A Portuguese Native Language Identification Dataset
Iria del Río Gayo | Marcos Zampieri | Shervin Malmasi
Proceedings of the Thirteenth Workshop on Innovative Use of NLP for Building Educational Applications

In this paper we present NLI-PT, the first Portuguese dataset compiled for Native Language Identification (NLI), the task of identifying an author’s first language based on their second language writing. The dataset includes 1,868 student essays written by learners of European Portuguese, native speakers of the following L1s: Chinese, English, Spanish, German, Russian, French, Japanese, Italian, Dutch, Tetum, Arabic, Polish, Korean, Romanian, and Swedish. NLI-PT includes the original student text and four different types of annotation: POS, fine-grained POS, constituency parses, and dependency parses. NLI-PT can be used not only in NLI but also in research on several topics in the field of Second Language Acquisition and educational NLP. We discuss possible applications of this dataset and present the results obtained for the first lexical baseline system for Portuguese NLI.

pdf bib
Proceedings of the Fifth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2018)
Marcos Zampieri | Preslav Nakov | Nikola Ljubešić | Jörg Tiedemann | Shervin Malmasi | Ahmed Ali
Proceedings of the Fifth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2018)

pdf bib
Language Identification and Morphosyntactic Tagging: The Second VarDial Evaluation Campaign
Marcos Zampieri | Shervin Malmasi | Preslav Nakov | Ahmed Ali | Suwon Shon | James Glass | Yves Scherrer | Tanja Samardžić | Nikola Ljubešić | Jörg Tiedemann | Chris van der Lee | Stefan Grondelaers | Nelleke Oostdijk | Dirk Speelman | Antal van den Bosch | Ritesh Kumar | Bornini Lahiri | Mayank Jain
Proceedings of the Fifth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2018)

We present the results and the findings of the Second VarDial Evaluation Campaign on Natural Language Processing (NLP) for Similar Languages, Varieties and Dialects. The campaign was organized as part of the fifth edition of the VarDial workshop, collocated with COLING’2018. This year, the campaign included five shared tasks, including two task re-runs – Arabic Dialect Identification (ADI) and German Dialect Identification (GDI) –, and three new tasks – Morphosyntactic Tagging of Tweets (MTT), Discriminating between Dutch and Flemish in Subtitles (DFS), and Indo-Aryan Language Identification (ILI). A total of 24 teams submitted runs across the five shared tasks, and contributed 22 system description papers, which were included in the VarDial workshop proceedings and are referred to in this report.

pdf bib
Discriminating between Indo-Aryan Languages Using SVM Ensembles
Alina Maria Ciobanu | Marcos Zampieri | Shervin Malmasi | Santanu Pal | Liviu P. Dinu
Proceedings of the Fifth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2018)

In this paper we present a system based on SVM ensembles trained on characters and words to discriminate between five similar languages of the Indo-Aryan family: Hindi, Braj Bhasha, Awadhi, Bhojpuri, and Magahi. The system competed in the Indo-Aryan Language Identification (ILI) shared task organized within the VarDial Evaluation Campaign 2018. Our best entry in the competition, named ILIdentification, scored 88.95% F1 score and it was ranked 3rd out of 8 teams.

pdf bib
A Neural Approach to Language Variety Translation
Marta R. Costa-jussà | Marcos Zampieri | Santanu Pal
Proceedings of the Fifth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2018)

In this paper we present the first neural-based machine translation system trained to translate between standard national varieties of the same language. We take the pair Brazilian - European Portuguese as an example and compare the performance of this method to a phrase-based statistical machine translation system. We report a performance improvement of 0.9 BLEU points in translating from European to Brazilian Portuguese and 0.2 BLEU points when translating in the opposite direction. We also carried out a human evaluation experiment with native speakers of Brazilian Portuguese which indicates that humans prefer the output produced by the neural-based system in comparison to the statistical system.

pdf bib
Proceedings of the First Workshop on Trolling, Aggression and Cyberbullying (TRAC-2018)
Ritesh Kumar | Atul Kr. Ojha | Marcos Zampieri | Shervin Malmasi
Proceedings of the First Workshop on Trolling, Aggression and Cyberbullying (TRAC-2018)

pdf bib
Benchmarking Aggression Identification in Social Media
Ritesh Kumar | Atul Kr. Ojha | Shervin Malmasi | Marcos Zampieri
Proceedings of the First Workshop on Trolling, Aggression and Cyberbullying (TRAC-2018)

In this paper, we present the report and findings of the Shared Task on Aggression Identification organised as part of the First Workshop on Trolling, Aggression and Cyberbullying (TRAC - 1) at COLING 2018. The task was to develop a classifier that could discriminate between Overtly Aggressive, Covertly Aggressive, and Non-aggressive texts. For this task, the participants were provided with a dataset of 15,000 aggression-annotated Facebook Posts and Comments each in Hindi (in both Roman and Devanagari script) and English for training and validation. For testing, two different sets - one from Facebook and another from a different social media - were provided. A total of 130 teams registered to participate in the task, 30 teams submitted their test runs, and finally 20 teams also sent their system description paper which are included in the TRAC workshop proceedings. The best system obtained a weighted F-score of 0.64 for both Hindi and English on the Facebook test sets, while the best scores on the surprise set were 0.60 and 0.50 for English and Hindi respectively. The results presented in this report depict how challenging the task is. The positive response from the community and the great levels of participation in the first edition of this shared task also highlights the interest in this topic.

2017

pdf bib
Detecting Hate Speech in Social Media
Shervin Malmasi | Marcos Zampieri
Proceedings of the International Conference Recent Advances in Natural Language Processing, RANLP 2017

In this paper we examine methods to detect hate speech in social media, while distinguishing this from general profanity. We aim to establish lexical baselines for this task by applying supervised classification methods using a recently released dataset annotated for this purpose. As features, our system uses character n-grams, word n-grams and word skip-grams. We obtain results of 78% accuracy in identifying posts across three classes. Results demonstrate that the main challenge lies in discriminating profanity and hate speech from each other. A number of directions for future work are discussed.

pdf bib
Predicting the Law Area and Decisions of French Supreme Court Cases
Octavia-Maria Şulea | Marcos Zampieri | Mihaela Vela | Josef van Genabith
Proceedings of the International Conference Recent Advances in Natural Language Processing, RANLP 2017

In this paper, we investigate the application of text classification methods to predict the law area and the decision of cases judged by the French Supreme Court. We also investigate the influence of the time period in which a ruling was made over the textual form of the case description and the extent to which it is necessary to mask the judge’s motivation for a ruling to emulate a real-world test scenario. We report results of 96% f1 score in predicting a case ruling, 90% f1 score in predicting the law area of a case, and 75.9% f1 score in estimating the time span when a ruling has been issued using a linear Support Vector Machine (SVM) classifier trained on lexical features.

pdf bib
Proceedings of the Fourth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial)
Preslav Nakov | Marcos Zampieri | Nikola Ljubešić | Jörg Tiedemann | Shevin Malmasi | Ahmed Ali
Proceedings of the Fourth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial)

pdf bib
Findings of the VarDial Evaluation Campaign 2017
Marcos Zampieri | Shervin Malmasi | Nikola Ljubešić | Preslav Nakov | Ahmed Ali | Jörg Tiedemann | Yves Scherrer | Noëmi Aepli
Proceedings of the Fourth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial)

We present the results of the VarDial Evaluation Campaign on Natural Language Processing (NLP) for Similar Languages, Varieties and Dialects, which we organized as part of the fourth edition of the VarDial workshop at EACL’2017. This year, we included four shared tasks: Discriminating between Similar Languages (DSL), Arabic Dialect Identification (ADI), German Dialect Identification (GDI), and Cross-lingual Dependency Parsing (CLP). A total of 19 teams submitted runs across the four tasks, and 15 of them wrote system description papers.

pdf bib
German Dialect Identification in Interview Transcriptions
Shervin Malmasi | Marcos Zampieri
Proceedings of the Fourth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial)

This paper presents three systems submitted to the German Dialect Identification (GDI) task at the VarDial Evaluation Campaign 2017. The task consists of training models to identify the dialect of Swiss-German speech transcripts. The dialects included in the GDI dataset are Basel, Bern, Lucerne, and Zurich. The three systems we submitted are based on: a plurality ensemble, a mean probability ensemble, and a meta-classifier trained on character and word n-grams. The best results were obtained by the meta-classifier achieving 68.1% accuracy and 66.2% F1-score, ranking first among the 10 teams which participated in the GDI shared task.

pdf bib
Arabic Dialect Identification Using iVectors and ASR Transcripts
Shervin Malmasi | Marcos Zampieri
Proceedings of the Fourth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial)

This paper presents the systems submitted by the MAZA team to the Arabic Dialect Identification (ADI) shared task at the VarDial Evaluation Campaign 2017. The goal of the task is to evaluate computational models to identify the dialect of Arabic utterances using both audio and text transcriptions. The ADI shared task dataset included Modern Standard Arabic (MSA) and four Arabic dialects: Egyptian, Gulf, Levantine, and North-African. The three systems submitted by MAZA are based on combinations of multiple machine learning classifiers arranged as (1) voting ensemble; (2) mean probability ensemble; (3) meta-classifier. The best results were obtained by the meta-classifier achieving 71.7% accuracy, ranking second among the six teams which participated in the ADI shared task.

pdf bib
Native Language Identification on Text and Speech
Marcos Zampieri | Alina Maria Ciobanu | Liviu P. Dinu
Proceedings of the 12th Workshop on Innovative Use of NLP for Building Educational Applications

This paper presents an ensemble system combining the output of multiple SVM classifiers to native language identification (NLI). The system was submitted to the NLI Shared Task 2017 fusion track which featured students essays and spoken responses in form of audio transcriptions and iVectors by non-native English speakers of eleven native languages. Our system competed in the challenge under the team name ZCD and was based on an ensemble of SVM classifiers trained on character n-grams achieving 83.58% accuracy and ranking 3rd in the shared task.

pdf bib
Complex Word Identification: Challenges in Data Annotation and System Performance
Marcos Zampieri | Shervin Malmasi | Gustavo Paetzold | Lucia Specia
Proceedings of the 4th Workshop on Natural Language Processing Techniques for Educational Applications (NLPTEA 2017)

This paper revisits the problem of complex word identification (CWI) following up the SemEval CWI shared task. We use ensemble classifiers to investigate how well computational methods can discriminate between complex and non-complex words. Furthermore, we analyze the classification performance to understand what makes lexical complexity challenging. Our findings show that most systems performed poorly on the SemEval CWI dataset, and one of the reasons for that is the way in which human annotation was performed.

2016

pdf bib
CATaLog Online: Porting a Post-editing Tool to the Web
Santanu Pal | Marcos Zampieri | Sudip Kumar Naskar | Tapas Nayak | Mihaela Vela | Josef van Genabith
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

This paper presents CATaLog online, a new web-based MT and TM post-editing tool. CATaLog online is a freeware software that can be used through a web browser and it requires only a simple registration. The tool features a number of editing and log functions similar to the desktop version of CATaLog enhanced with several new features that we describe in detail in this paper. CATaLog online is designed to allow users to post-edit both translation memory segments as well as machine translation output. The tool provides a complete set of log information currently not available in most commercial CAT tools. Log information can be used both for project management purposes as well as for the study of the translation process and translator’s productivity.

pdf bib
Discriminating Similar Languages: Evaluations and Explorations
Cyril Goutte | Serge Léger | Shervin Malmasi | Marcos Zampieri
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

We present an analysis of the performance of machine learning classifiers on discriminating between similar languages and language varieties. We carried out a number of experiments using the results of the two editions of the Discriminating between Similar Languages (DSL) shared task. We investigate the progress made between the two tasks, estimate an upper bound on possible performance using ensemble and oracle combination, and provide learning curves to help us understand which languages are more challenging. A number of difficult sentences are identified and investigated further with human annotation

pdf bib
Modeling Language Change in Historical Corpora: The Case of Portuguese
Marcos Zampieri | Shervin Malmasi | Mark Dras
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

This paper presents a number of experiments to model changes in a historical Portuguese corpus composed of literary texts for the purpose of temporal text classification. Algorithms were trained to classify texts with respect to their publication date taking into account lexical variation represented as word n-grams, and morphosyntactic variation represented by part-of-speech (POS) distribution. We report results of 99.8% accuracy using word unigram features with a Support Vector Machines classifier to predict the publication date of documents in time intervals of both one century and half a century. A feature analysis is performed to investigate the most informative features for this task and how they are linked to language change.

pdf bib
CATaLog Online: A Web-based CAT Tool for Distributed Translation with Data Capture for APE and Translation Process Research
Santanu Pal | Sudip Kumar Naskar | Marcos Zampieri | Tapas Nayak | Josef van Genabith
Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: System Demonstrations

We present a free web-based CAT tool called CATaLog Online which provides a novel and user-friendly online CAT environment for post-editors/translators. The goal is to support distributed translation, reduce post-editing time and effort, improve the post-editing experience and capture data for incremental MT/APE (automatic post-editing) and translation process research. The tool supports individual as well as batch mode file translation and provides translations from three engines – translation memory (TM), MT and APE. TM suggestions are color coded to accelerate the post-editing task. The users can integrate their personal TM/MT outputs. The tool remotely monitors and records post-editing activities generating an extensive range of post-editing logs.

pdf bib
Predicting Post Severity in Mental Health Forums
Shervin Malmasi | Marcos Zampieri | Mark Dras
Proceedings of the Third Workshop on Computational Linguistics and Clinical Psychology

pdf bib
Findings of the 2016 Conference on Machine Translation
Ondřej Bojar | Rajen Chatterjee | Christian Federmann | Yvette Graham | Barry Haddow | Matthias Huck | Antonio Jimeno Yepes | Philipp Koehn | Varvara Logacheva | Christof Monz | Matteo Negri | Aurélie Névéol | Mariana Neves | Martin Popel | Matt Post | Raphael Rubino | Carolina Scarton | Lucia Specia | Marco Turchi | Karin Verspoor | Marcos Zampieri
Proceedings of the First Conference on Machine Translation: Volume 2, Shared Task Papers

pdf bib
USAAR: An Operation Sequential Model for Automatic Statistical Post-Editing
Santanu Pal | Marcos Zampieri | Josef van Genabith
Proceedings of the First Conference on Machine Translation: Volume 2, Shared Task Papers

pdf bib
Proceedings of the Third Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial3)
Preslav Nakov | Marcos Zampieri | Liling Tan | Nikola Ljubešić | Jörg Tiedemann | Shervin Malmasi
Proceedings of the Third Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial3)

pdf bib
Discriminating between Similar Languages and Arabic Dialect Identification: A Report on the Third DSL Shared Task
Shervin Malmasi | Marcos Zampieri | Nikola Ljubešić | Preslav Nakov | Ahmed Ali | Jörg Tiedemann
Proceedings of the Third Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial3)

We present the results of the third edition of the Discriminating between Similar Languages (DSL) shared task, which was organized as part of the VarDial’2016 workshop at COLING’2016. The challenge offered two subtasks: subtask 1 focused on the identification of very similar languages and language varieties in newswire texts, whereas subtask 2 dealt with Arabic dialect identification in speech transcripts. A total of 37 teams registered to participate in the task, 24 teams submitted test results, and 20 teams also wrote system description papers. High-order character n-grams were the most successful feature, and the best classification approaches included traditional supervised learning methods such as SVM, logistic regression, and language models, while deep learning approaches did not perform very well.

pdf bib
Arabic Dialect Identification in Speech Transcripts
Shervin Malmasi | Marcos Zampieri
Proceedings of the Third Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial3)

In this paper we describe a system developed to identify a set of four regional Arabic dialects (Egyptian, Gulf, Levantine, North African) and Modern Standard Arabic (MSA) in a transcribed speech corpus. We competed under the team name MAZA in the Arabic Dialect Identification sub-task of the 2016 Discriminating between Similar Languages (DSL) shared task. Our system achieved an F1-score of 0.51 in the closed training track, ranking first among the 18 teams that participated in the sub-task. Our system utilizes a classifier ensemble with a set of linear models as base classifiers. We experimented with three different ensemble fusion strategies, with the mean probability approach providing the best performance.

pdf bib
MAZA at SemEval-2016 Task 11: Detecting Lexical Complexity Using a Decision Stump Meta-Classifier
Shervin Malmasi | Marcos Zampieri
Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016)

pdf bib
LTG at SemEval-2016 Task 11: Complex Word Identification with Classifier Ensembles
Shervin Malmasi | Mark Dras | Marcos Zampieri
Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016)

pdf bib
MacSaar at SemEval-2016 Task 11: Zipfian and Character Features for ComplexWord Identification
Marcos Zampieri | Liling Tan | Josef van Genabith
Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016)

2015

pdf bib
Can Translation Memories afford not to use paraphrasing ?
Rohit Gupta | Constantin Orasan | Marcos Zampieri | Mihaela Vela | Josef van Genabith
Proceedings of the 18th Annual Conference of the European Association for Machine Translation

pdf bib
Searching for Context: a Study on Document-Level Labels for Translation Quality Estimation
Carolina Scarton | Marcos Zampieri | Mihaela Vela | Josef van Genabith | Lucia Specia
Proceedings of the 18th Annual Conference of the European Association for Machine Translation

pdf bib
Can Translation Memories afford not to use paraphrasing?
Rohit Gupta | Constantin Orăsan | Marcos Zampieri | Mihaela Vela | Josef van Genabith
Proceedings of the 18th Annual Conference of the European Association for Machine Translation

pdf bib
Searching for Context: a Study on Document-Level Labels for Translation Quality Estimation
Carolina Scarton | Marcos Zampieri | Mihaela Vela | Josef van Genabith | Lucia Specia
Proceedings of the 18th Annual Conference of the European Association for Machine Translation

pdf bib
CATaLog: New Approaches to TM and Post Editing Interfaces
Tapas Nayek | Sudip Kumar Naskar | Santanu Pal | Marcos Zampieri | Mihaela Vela | Josef van Genabith
Proceedings of the Workshop Natural Language Processing for Translation Memories

pdf bib
Proceedings of the Joint Workshop on Language Technology for Closely Related Languages, Varieties and Dialects
Preslav Nakov | Marcos Zampieri | Petya Osenova | Liling Tan | Cristina Vertan | Nikola Ljubešić | Jörg Tiedemann
Proceedings of the Joint Workshop on Language Technology for Closely Related Languages, Varieties and Dialects

pdf bib
Overview of the DSL Shared Task 2015
Marcos Zampieri | Liling Tan | Nikola Ljubešić | Jörg Tiedemann | Preslav Nakov
Proceedings of the Joint Workshop on Language Technology for Closely Related Languages, Varieties and Dialects

pdf bib
Comparing Approaches to the Identification of Similar Languages
Marcos Zampieri | Binyam Gebrekidan Gebre | Hernani Costa | Josef van Genabith
Proceedings of the Joint Workshop on Language Technology for Closely Related Languages, Varieties and Dialects

pdf bib
AMBRA: A Ranking Approach to Temporal Text Classification
Marcos Zampieri | Alina Maria Ciobanu | Vlad Niculae | Liviu P. Dinu
Proceedings of the 9th International Workshop on Semantic Evaluation (SemEval 2015)

2014

pdf bib
Temporal Text Ranking and Automatic Dating of Texts
Vlad Niculae | Marcos Zampieri | Liviu Dinu | Alina Maria Ciobanu
Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics, volume 2: Short Papers

pdf bib
Quantifying the Influence of MT Output in the Translators’ Performance: A Case Study in Technical Translation
Marcos Zampieri | Mihaela Vela
Proceedings of the EACL 2014 Workshop on Humans and Computer-assisted Translation

pdf bib
Proceedings of the First Workshop on Applying NLP Tools to Similar Languages, Varieties and Dialects
Marcos Zampieri | Liling Tan | Nikola Ljubešić | Jörg Tiedemann
Proceedings of the First Workshop on Applying NLP Tools to Similar Languages, Varieties and Dialects

pdf bib
A Report on the DSL Shared Task 2014
Marcos Zampieri | Liling Tan | Nikola Ljubešić | Jörg Tiedemann
Proceedings of the First Workshop on Applying NLP Tools to Similar Languages, Varieties and Dialects

pdf bib
VarClass: An Open-source Language Identification Tool for Language Varieties
Marcos Zampieri | Binyam Gebre
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

This paper presents VarClass, an open-source tool for language identification available both to be downloaded as well as through a graphical user-friendly interface. The main difference of VarClass in comparison to other state-of-the-art language identification tools is its focus on language varieties. General purpose language identification tools do not take language varieties into account and our work aims to fill this gap. VarClass currently contains language models for over 27 languages in which 10 of them are language varieties. We report an average performance of over 90.5% accuracy in a challenging dataset. More language models will be included in the upcoming months.

2013

pdf bib
Improving Native Language Identification with TF-IDF Weighting
Binyam Gebrekidan Gebre | Marcos Zampieri | Peter Wittenburg | Tom Heskes
Proceedings of the Eighth Workshop on Innovative Use of NLP for Building Educational Applications

pdf bib
Effective Spell Checking Methods Using Clustering Algorithms
Renato Cordeiro de Amorim | Marcos Zampieri
Proceedings of the International Conference Recent Advances in Natural Language Processing RANLP 2013

pdf bib
N-gram Language Models and POS Distribution for the Identification of Spanish Varieties (Ngrammes et Traits Morphosyntaxiques pour la Identification de Variétés de l’Espagnol) [in French]
Marcos Zampieri | Binyam Gebrekidan Gebre | Sascha Diwersy
Proceedings of TALN 2013 (Volume 2: Short Papers)

Search
Co-authors