Marius Popescu


2017

pdf bib
Can string kernels pass the test of time in Native Language Identification?
Radu Tudor Ionescu | Marius Popescu
Proceedings of the 12th Workshop on Innovative Use of NLP for Building Educational Applications

We describe a machine learning approach for the 2017 shared task on Native Language Identification (NLI). The proposed approach combines several kernels using multiple kernel learning. While most of our kernels are based on character p-grams (also known as n-grams) extracted from essays or speech transcripts, we also use a kernel based on i-vectors, a low-dimensional representation of audio recordings, provided by the shared task organizers. For the learning stage, we choose Kernel Discriminant Analysis (KDA) over Kernel Ridge Regression (KRR), because the former classifier obtains better results than the latter one on the development set. In our previous work, we have used a similar machine learning approach to achieve state-of-the-art NLI results. The goal of this paper is to demonstrate that our shallow and simple approach based on string kernels (with minor improvements) can pass the test of time and reach state-of-the-art performance in the 2017 NLI shared task, despite the recent advances in natural language processing. We participated in all three tracks, in which the competitors were allowed to use only the essays (essay track), only the speech transcripts (speech track), or both (fusion track). Using only the data provided by the organizers for training our models, we have reached a macro F1 score of 86.95% in the closed essay track, a macro F1 score of 87.55% in the closed speech track, and a macro F1 score of 93.19% in the closed fusion track. With these scores, our team (UnibucKernel) ranked in the first group of teams in all three tracks, while attaining the best scores in the speech and the fusion tracks.

2016

pdf bib
UnibucKernel: An Approach for Arabic Dialect Identification Based on Multiple String Kernels
Radu Tudor Ionescu | Marius Popescu
Proceedings of the Third Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial3)

The most common approach in text mining classification tasks is to rely on features like words, part-of-speech tags, stems, or some other high-level linguistic features. Unlike the common approach, we present a method that uses only character p-grams (also known as n-grams) as features for the Arabic Dialect Identification (ADI) Closed Shared Task of the DSL 2016 Challenge. The proposed approach combines several string kernels using multiple kernel learning. In the learning stage, we try both Kernel Discriminant Analysis (KDA) and Kernel Ridge Regression (KRR), and we choose KDA as it gives better results in a 10-fold cross-validation carried out on the training set. Our approach is shallow and simple, but the empirical results obtained in the ADI Shared Task prove that it achieves very good results. Indeed, we ranked on the second place with an accuracy of 50.91% and a weighted F1 score of 51.31%. We also present improved results in this paper, which we obtained after the competition ended. Simply by adding more regularization into our model to make it more suitable for test data that comes from a different distribution than training data, we obtain an accuracy of 51.82% and a weighted F1 score of 52.18%. Furthermore, the proposed approach has an important advantage in that it is language independent and linguistic theory neutral, as it does not require any NLP tools.

pdf bib
String Kernels for Native Language Identification: Insights from Behind the Curtains
Radu Tudor Ionescu | Marius Popescu | Aoife Cahill
Computational Linguistics, Volume 42, Issue 3 - September 2016

2014

pdf bib
Can characters reveal your native language? A language-independent approach to native language identification
Radu Tudor Ionescu | Marius Popescu | Aoife Cahill
Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP)

2013

pdf bib
The Story of the Characters, the DNA and the Native Language
Marius Popescu | Radu Tudor Ionescu
Proceedings of the Eighth Workshop on Innovative Use of NLP for Building Educational Applications

2011

pdf bib
Studying Translationese at the Character Level
Marius Popescu
Proceedings of the International Conference Recent Advances in Natural Language Processing 2011

2009

pdf bib
What’s in a name? In some languages, grammatical gender
Vivi Nastase | Marius Popescu
Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing

pdf bib
Comparing Statistical Similarity Measures for Stylistic Multivariate Analysis
Marius Popescu | Liviu P. Dinu
Proceedings of the International Conference RANLP-2009

2008

pdf bib
Authorship Identification of Romanian Texts with Controversial Paternity
Liviu Dinu | Marius Popescu | Anca Dinu
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

In this work we propose a new strategy for the authorship identification problem and we test it on an example from Romanian literature: did Radu Albala found the continuation of Mateiu Caragiale’s novel Sub pecetea tainei, or did he write himself the respective continuation? The proposed strategy is based on the similarity of rankings of function words; we compare the obtained results with the results obtained by a learning method (namely Support Vector Machines -SVM- with a string kernel).

pdf bib
Rank Distance as a Stylistic Similarity
Marius Popescu | Liviu P. Dinu
Coling 2008: Companion volume: Posters

2004

pdf bib
Regularized Least-Squares classification for Word Sense Disambiguation
Marius Popescu
Proceedings of SENSEVAL-3, the Third International Workshop on the Evaluation of Systems for the Semantic Analysis of Text