Liviu P. Dinu

Also published as: Liviu Petrisor Dinu, Liviu Dinu


2020

pdf bib
Automatically Building a Multilingual Lexicon of False Friends With No Supervision
Ana Sabina Uban | Liviu P. Dinu
Proceedings of the 12th Language Resources and Evaluation Conference

Cognate words, defined as words in different languages which derive from a common etymon, can be useful for language learners, who can leverage the orthographical similarity of cognates to more easily understand a text in a foreign language. Deceptive cognates, or false friends, do not share the same meaning anymore; these can be instead deceiving and detrimental for language acquisition or text understanding in a foreign language. We use an automatic method of detecting false friends from a set of cognates, in a fully unsupervised fashion, based on cross-lingual word embeddings. We implement our method for English and five Romance languages, including a low-resource language (Romanian), and evaluate it against two different gold standards. The method can be extended easily to any language pair, requiring only large monolingual corpora for the involved languages and a small bilingual dictionary for the pair. We additionally propose a measure of “falseness” of a false friends pair. We publish freely the database of false friends in the six languages, along with the falseness scores for each cognate pair. The resource is the largest of the kind that we are aware of, both in terms of languages covered and number of word pairs.

pdf bib
Automatic Reconstruction of Missing Romanian Cognates and Unattested Latin Words
Alina Maria Ciobanu | Liviu P. Dinu | Laurentiu Zoicas
Proceedings of the 12th Language Resources and Evaluation Conference

Producing related words is a key concern in historical linguistics. Given an input word, the task is to automatically produce either its proto-word, a cognate pair or a modern word derived from it. In this paper, we apply a method for producing related words based on sequence labeling, aiming to fill in the gaps in incomplete cognate sets in Romance languages with Latin etymology (producing Romanian cognates that are missing) and to reconstruct uncertified Latin words. We further investigate an ensemble-based aggregation for combining and re-ranking the word productions of multiple languages.

2019

pdf bib
The Myth of Double-Blind Review Revisited: ACL vs. EMNLP
Cornelia Caragea | Ana Uban | Liviu P. Dinu
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

The review and selection process for scientific paper publication is essential for the quality of scholarly publications in a scientific field. The double-blind review system, which enforces author anonymity during the review period, is widely used by prestigious conferences and journals to ensure the integrity of this process. Although the notion of anonymity in the double-blind review has been questioned before, the availability of full text paper collections brings new opportunities for exploring the question: Is the double-blind review process really double-blind? We study this question on the ACL and EMNLP paper collections and present an analysis on how well deep learning techniques can infer the authors of a paper. Specifically, we explore Convolutional Neural Networks trained on various aspects of a paper, e.g., content, style features, and references, to understand the extent to which we can infer the authors of a paper and what aspects contribute the most. Our results show that the authors of a paper can be inferred with accuracy as high as 87% on ACL and 78% on EMNLP for the top 100 most prolific authors.

pdf bib
Studying Laws of Semantic Divergence across Languages using Cognate Sets
Ana Uban | Alina Maria Ciobanu | Liviu P. Dinu
Proceedings of the 1st International Workshop on Computational Approaches to Historical Language Change

Semantic divergence in related languages is a key concern of historical linguistics. Intra-lingual semantic shift has been previously studied in computational linguistics, but this can only provide a limited picture of the evolution of word meanings, which often develop in a multilingual environment. In this paper we investigate semantic change across languages by measuring the semantic distance of cognate words in multiple languages. By comparing current meanings of cognates in different languages, we hope to uncover information about their previous meanings, and about how they diverged in their respective languages from their common original etymon. We further study the properties of their semantic divergence, by analyzing how the features of words such as frequency and polysemy are related to the divergence in their meaning, and thus make the first steps towards formulating laws of cross-lingual semantic change.

pdf bib
Linguistic classification: dealing jointly with irrelevance and inconsistency
Laura Franzoi | Andrea Sgarro | Anca Dinu | Liviu P. Dinu
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2019)

In this paper, we present new methods for language classification which put to good use both syntax and fuzzy tools, and are capable of dealing with irrelevant linguistic features (i.e. features which should not contribute to the classification) and even inconsistent features (which do not make sense for specific languages). We introduce a metric distance, based on the generalized Steinhaus transform, which allows one to deal jointly with irrelevance and inconsistency. To evaluate our methods, we test them on a syntactic data set, due to the linguist G. Longobardi and his school. We obtain phylogenetic trees which sometimes outperform the ones obtained by Atkinson and Gray.

pdf bib
From Image to Text in Sentiment Analysis via Regression and Deep Learning
Daniela Onita | Liviu P. Dinu | Adriana Birlutiu
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2019)

Images and text represent types of content which are used together for conveying user emotions in online social networks. These contents are usually associated with a sentiment category. In this paper, we investigate an approach for mapping images to text for three types of sentiment categories: positive, neutral and negative. The mapping from images to text is performed using a Kernel Ridge Regression model. We considered two types of image features: i) RGB pixel-values features, and ii) features extracted with a deep learning approach. The experimental evaluation was performed on a Twitter data set containing both text and images and the sentiment associated with these. The experimental results show a difference in performance for different sentiment categories, in particular the mapping that we propose performs better for the positive sentiment category in comparison with the neutral and negative ones. Furthermore, the experimental results show that the more complex deep learning features perform better than the RGB pixel-value features for all sentiment categories and for larger training sets.

pdf bib
Automatic Identification and Production of Related Words for Historical Linguistics
Alina Maria Ciobanu | Liviu P. Dinu
Computational Linguistics, Volume 45, Issue 4 - December 2019

Language change across space and time is one of the main concerns in historical linguistics. In this article, we develop tools to assist researchers and domain experts in the study of language evolution.First, we introduce a method to automatically determine whether two words are cognates. We propose an algorithm for extracting cognates from electronic dictionaries that contain etymological information. Having built a data set of related words, we further develop machine learning methods based on orthographic alignment for identifying cognates. We use aligned subsequences as features for classification algorithms in order to infer rules for linguistic changes undergone by words when entering new languages and to discriminate between cognates and non-cognates.Second, we extend the method to a finer-grained level, to identify the type of relationship between words. Discriminating between cognates and borrowings provides a deeper insight into the history of a language and allows a better characterization of language relatedness. We show that orthographic features have discriminative power and we analyze the underlying linguistic factors that prove relevant in the classification task. To our knowledge, this is the first attempt of this kind.Third, we develop a machine learning method for automatically producing related words. We focus on reconstructing proto-words, but we also address two related sub-problems, producing modern word forms and producing cognates. The task of reconstructing proto-words consists of recreating the words in an ancient language from its modern daughter languages. Having modern word forms in multiple Romance languages, we infer the form of their common Latin ancestors. Our approach relies on the regularities that occurred when words entered the modern languages. We leverage information from several modern languages, building an ensemble system for reconstructing proto-words. We apply our method to multiple data sets, showing that our approach improves on previous results, also having the advantage of requiring less input data, which is essential in historical linguistics, where resources are generally scarce.

2018

pdf bib
ALB at SemEval-2018 Task 10: A System for Capturing Discriminative Attributes
Bogdan Dumitru | Alina Maria Ciobanu | Liviu P. Dinu
Proceedings of The 12th International Workshop on Semantic Evaluation

Semantic difference detection attempts to capture whether a word is a discriminative attribute between two other words. For example, the discriminative feature red characterizes the first word from the (apple, banana) pair, but not the second. Modeling semantic difference is essential for language understanding systems, as it provides useful information for identifying particular aspects of word senses. This paper describes our system implementation (the ALB system of the NLP@Unibuc team) for the 10th task of the SemEval 2018 workshop, “Capturing Discriminative Attributes”. We propose a method for semantic difference detection that uses an SVM classifier with features based on co-occurrence counts and shallow semantic parsing, achieving 0.63 F1 score in the competition.

pdf bib
Exploring Optimism and Pessimism in Twitter Using Deep Learning
Cornelia Caragea | Liviu P. Dinu | Bogdan Dumitru
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

Identifying optimistic and pessimistic viewpoints and users from Twitter is useful for providing better social support to those who need such support, and for minimizing the negative influence among users and maximizing the spread of positive attitudes and ideas. In this paper, we explore a range of deep learning models to predict optimism and pessimism in Twitter at both tweet and user level and show that these models substantially outperform traditional machine learning classifiers used in prior work. In addition, we show evidence that a sentiment classifier would not be sufficient for accurately predicting optimism and pessimism in Twitter. Last, we study the verb tense usage as well as the presence of polarity words in optimistic and pessimistic tweets.

pdf bib
Ab Initio: Automatic Latin Proto-word Reconstruction
Alina Maria Ciobanu | Liviu P. Dinu
Proceedings of the 27th International Conference on Computational Linguistics

Proto-word reconstruction is central to the study of language evolution. It consists of recreating the words in an ancient language from its modern daughter languages. In this paper we investigate automatic word form reconstruction for Latin proto-words. Having modern word forms in multiple Romance languages (French, Italian, Spanish, Portuguese and Romanian), we infer the form of their common Latin ancestors. Our approach relies on the regularities that occurred when the Latin words entered the modern languages. We leverage information from all modern languages, building an ensemble system for proto-word reconstruction. We use conditional random fields for sequence labeling, but we conduct preliminary experiments with recurrent neural networks as well. We apply our method on multiple datasets, showing that our method improves on previous results, having also the advantage of requiring less input data, which is essential in historical linguistics, where resources are generally scarce.

pdf bib
Simulating Language Evolution: a Tool for Historical Linguistics
Alina Maria Ciobanu | Liviu P. Dinu
Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations

Language change across space and time is one of the main concerns in historical linguistics. In this paper, we develop a language evolution simulator: a web-based tool for word form production to assist in historical linguistics, in studying the evolution of the languages. Given a word in a source language, the system automatically predicts how the word evolves in a target language. The method that we propose is language-agnostic and does not use any external knowledge, except for the training word pairs.

pdf bib
Discriminating between Indo-Aryan Languages Using SVM Ensembles
Alina Maria Ciobanu | Marcos Zampieri | Shervin Malmasi | Santanu Pal | Liviu P. Dinu
Proceedings of the Fifth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2018)

In this paper we present a system based on SVM ensembles trained on characters and words to discriminate between five similar languages of the Indo-Aryan family: Hindi, Braj Bhasha, Awadhi, Bhojpuri, and Magahi. The system competed in the Indo-Aryan Language Identification (ILI) shared task organized within the VarDial Evaluation Campaign 2018. Our best entry in the competition, named ILIdentification, scored 88.95% F1 score and it was ranked 3rd out of 8 teams.

pdf bib
German Dialect Identification Using Classifier Ensembles
Alina Maria Ciobanu | Shervin Malmasi | Liviu P. Dinu
Proceedings of the Fifth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2018)

In this paper we present the GDI classification entry to the second German Dialect Identification (GDI) shared task organized within the scope of the VarDial Evaluation Campaign 2018. We present a system based on SVM classifier ensembles trained on characters and words. The system was trained on a collection of speech transcripts of five Swiss-German dialects provided by the organizers. The transcripts included in the dataset contained speakers from Basel, Bern, Lucerne, and Zurich. Our entry in the challenge reached 62.03% F1 score and was ranked third out of eight teams.

pdf bib
Content Extraction and Lexical Analysis from Customer-Agent Interactions
Sergiu Nisioi | Anca Bucur | Liviu P. Dinu
Proceedings of the 2018 EMNLP Workshop W-NUT: The 4th Workshop on Noisy User-generated Text

In this paper, we provide a lexical comparative analysis of the vocabulary used by customers and agents in an Enterprise Resource Planning (ERP) environment and a potential solution to clean the data and extract relevant content for NLP. As a result, we demonstrate that the actual vocabulary for the language that prevails in the ERP conversations is highly divergent from the standardized dictionary and further different from general language usage as extracted from the Common Crawl corpus. Moreover, in specific business communication circumstances, where it is expected to observe a high usage of standardized language, code switching and non-standard expression are predominant, emphasizing once more the discrepancy between the day-to-day use of language and the standardized one.

2017

pdf bib
Exploring Neural Text Simplification Models
Sergiu Nisioi | Sanja Štajner | Simone Paolo Ponzetto | Liviu P. Dinu
Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

We present the first attempt at using sequence to sequence neural networks to model text simplification (TS). Unlike the previously proposed automated TS systems, our neural text simplification (NTS) systems are able to simultaneously perform lexical simplification and content reduction. An extensive human evaluation of the output has shown that NTS systems achieve almost perfect grammaticality and meaning preservation of output sentences and higher level of simplification than the state-of-the-art automated TS systems

pdf bib
On the stylistic evolution from communism to democracy: Solomon Marcus study case
Anca Dinu | Liviu P. Dinu | Bogdan Dumitru
Proceedings of the International Conference Recent Advances in Natural Language Processing, RANLP 2017

In this article we propose a stylistic analysis of Solomon Marcus’ non-scientific published texts, gathered in six volumes, aiming to uncover some of his quantitative and qualitative fingerprints. Moreover, we compare and cluster two distinct periods of time in his writing style: 22 years of communist regime (1967-1989) and 27 years of democracy (1990-2016). The distributional analysis of Marcus’ text reveals that the passing from the communist regime period to democracy is sharply marked by two complementary changes in Marcus’ writing: in the pre-democracy period, the communist norms of writing style demanded on the one hand long phrases, long words and clichés, and on the other hand, a short list of preferred “official” topics; in democracy tendency was towards shorten phrases and words while approaching a broader area of topics.

pdf bib
Finding a Character’s Voice: Stylome Classification on Literary Characters
Liviu P. Dinu | Ana Sabina Uban
Proceedings of the Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature

We investigate in this paper the problem of classifying the stylome of characters in a literary work. Previous research in the field of authorship attribution has shown that the writing style of an author can be characterized and distinguished from that of other authors automatically. In this paper we take a look at the less approached problem of how the styles of different characters can be distinguished, trying to verify if an author managed to create believable characters with individual styles. We present the results of some initial experiments developed on the novel “Liaisons Dangereuses”, showing that a simple bag of words model can be used to classify the characters.

pdf bib
Native Language Identification on Text and Speech
Marcos Zampieri | Alina Maria Ciobanu | Liviu P. Dinu
Proceedings of the 12th Workshop on Innovative Use of NLP for Building Educational Applications

This paper presents an ensemble system combining the output of multiple SVM classifiers to native language identification (NLI). The system was submitted to the NLI Shared Task 2017 fusion track which featured students essays and spoken responses in form of audio transcriptions and iVectors by non-native English speakers of eleven native languages. Our system competed in the challenge under the team name ZCD and was based on an ensemble of SVM classifiers trained on character n-grams achieving 83.58% accuracy and ranking 3rd in the shared task.

2016

pdf bib
A Computational Perspective on the Romanian Dialects
Alina Maria Ciobanu | Liviu P. Dinu
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

In this paper we conduct an initial study on the dialects of Romanian. We analyze the differences between Romanian and its dialects using the Swadesh list. We analyze the predictive power of the orthographic and phonetic features of the words, building a classification problem for dialect identification.

pdf bib
Using Word Embeddings to Translate Named Entities
Octavia-Maria Şulea | Sergiu Nisioi | Liviu P. Dinu
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

In this paper we investigate the usefulness of neural word embeddings in the process of translating Named Entities (NEs) from a resource-rich language to a language low on resources relevant to the task at hand, introducing a novel, yet simple way of obtaining bilingual word vectors. Inspired by observations in (Mikolov et al., 2013b), which show that training their word vector model on comparable corpora yields comparable vector space representations of those corpora, reducing the problem of translating words to finding a rotation matrix, and results in (Zou et al., 2013), which showed that bilingual word embeddings can improve Chinese Named Entity Recognition (NER) and English to Chinese phrase translation, we use the sentence-aligned English-French EuroParl corpora and show that word embeddings extracted from a merged corpus (corpus resulted from the merger of the two aligned corpora) can be used to NE translation. We extrapolate that word embeddings trained on merged parallel corpora are useful in Named Entity Recognition and Translation tasks for resource-poor languages.

pdf bib
A Corpus of Native, Non-native and Translated Texts
Sergiu Nisioi | Ella Rabinovich | Liviu P. Dinu | Shuly Wintner
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

We describe a monolingual English corpus of original and (human) translated texts, with an accurate annotation of speaker properties, including the original language of the utterances and the speaker’s country of origin. We thus obtain three sub-corpora of texts reflecting native English, non-native English, and English translated from a variety of European languages. This dataset will facilitate the investigation of similarities and differences between these kinds of sub-languages. Moreover, it will facilitate a unified comparative study of translations and language produced by (highly fluent) non-native speakers, two closely-related phenomena that have only been studied in isolation so far.

pdf bib
Vanilla Classifiers for Distinguishing between Similar Languages
Sergiu Nisioi | Alina Maria Ciobanu | Liviu P. Dinu
Proceedings of the Third Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial3)

In this paper we describe the submission of the UniBuc-NLP team for the Discriminating between Similar Languages Shared Task, DSL 2016. We present and analyze the results we obtained in the closed track of sub-task 1 (Similar languages and language varieties) and sub-task 2 (Arabic dialects). For sub-task 1 we used a logistic regression classifier with tf-idf feature weighting and for sub-task 2 a character-based string kernel with an SVM classifier. Our results show that good accuracy scores can be obtained with limited feature and model engineering. While certain limitations are to be acknowledged, our approach worked surprisingly well for out-of-domain, social media data, with 0.898 accuracy (3rd place) for dataset B1 and 0.838 accuracy (4th place) for dataset B2.

2015

pdf bib
Readability Assessment of Translated Texts
Alina Maria Ciobanu | Liviu P. Dinu | Flaviu Pepelea
Proceedings of the International Conference Recent Advances in Natural Language Processing

pdf bib
Cross-lingual Synonymy Overlap
Anca Dinu | Liviu P. Dinu | Ana Sabina Uban
Proceedings of the International Conference Recent Advances in Natural Language Processing

pdf bib
Automatic Discrimination between Cognates and Borrowings
Alina Maria Ciobanu | Liviu P. Dinu
Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers)

pdf bib
AMBRA: A Ranking Approach to Temporal Text Classification
Marcos Zampieri | Alina Maria Ciobanu | Vlad Niculae | Liviu P. Dinu
Proceedings of the 9th International Workshop on Semantic Evaluation (SemEval 2015)

2014

pdf bib
Automatic Detection of Cognates Using Orthographic Alignment
Alina Maria Ciobanu | Liviu P. Dinu
Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

pdf bib
Temporal Text Ranking and Automatic Dating of Texts
Vlad Niculae | Marcos Zampieri | Liviu Dinu | Alina Maria Ciobanu
Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics, volume 2: Short Papers

pdf bib
Predicting Romanian Stress Assignment
Alina Maria Ciobanu | Anca Dinu | Liviu Dinu
Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics, volume 2: Short Papers

pdf bib
A Quantitative Insight into the Impact of Translation on Readability
Alina Maria Ciobanu | Liviu Dinu
Proceedings of the 3rd Workshop on Predicting and Improving Text Readability for Target Reader Populations (PITR)

pdf bib
An Etymological Approach to Cross-Language Orthographic Similarity. Application on Romanian
Alina Maria Ciobanu | Liviu P. Dinu
Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP)

pdf bib
On the Romance Languages Mutual Intelligibility
Liviu Dinu | Alina Maria Ciobanu
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

We propose a method for computing the similarity of natural languages and for clustering them based on their lexical similarity. Our study provides evidence to be used in the investigation of the written intelligibility, i.e., the ability of people writing in different languages to understand one another without prior knowledge of foreign languages. We account for etymons and cognates, we quantify lexical similarity and we extend our analysis from words to languages. Based on the introduced methodology, we compute a matrix of Romance languages intelligibility.

pdf bib
Aggregation methods for efficient collocation detection
Anca Dinu | Liviu Dinu | Ionut Sorodoc
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

In this article we propose a rank aggregation method for the task of collocations detection. It consists of applying some well-known methods (e.g. Dice method, chi-square test, z-test and likelihood ratio) and then aggregating the resulting collocations rankings by rank distance and Borda score. These two aggregation methods are especially well suited for the task, since the results of each individual method naturally forms a ranking of collocations. Combination methods are known to usually improve the results, and indeed, the proposed aggregation method performs better then each individual method taken in isolation.

pdf bib
Using a machine learning model to assess the complexity of stress systems
Liviu Dinu | Alina Maria Ciobanu | Ioana Chitoran | Vlad Niculae
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

We address the task of stress prediction as a sequence tagging problem. We present sequential models with averaged perceptron training for learning primary stress in Romanian words. We use character n-grams and syllable n-grams as features and we account for the consonant-vowel structure of the words. We show in this paper that Romanian stress is predictable, though not deterministic, by using data-driven machine learning techniques.

pdf bib
Building a Dataset of Multilingual Cognates for the Romanian Lexicon
Liviu Dinu | Alina Maria Ciobanu
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

Identifying cognates is an interesting task with applications in numerous research areas, such as historical and comparative linguistics, language acquisition, cross-lingual information retrieval, readability and machine translation. We propose a dictionary-based approach to identifying cognates based on etymology and etymons. We account for relationships between languages and we extract etymology-related information from electronic dictionaries. We employ the dataset of cognates that we obtain as a gold standard for evaluating to which extent orthographic methods can be used to detect cognate pairs. The question that arises is whether they are able to discriminate between cognates and non-cognates, given the orthographic changes undergone by foreign words when entering new languages. We investigate some orthographic approaches widely used in this research area and some original metrics as well. We run our experiments on the Romanian lexicon, but the method we propose is adaptable to any language, as far as resources are available.

2013

pdf bib
Temporal classification for historical Romanian texts
Alina Maria Ciobanu | Anca Dinu | Liviu Dinu | Vlad Niculae | Octavia-Maria Şulea
Proceedings of the 7th Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities

pdf bib
Temporal Text Classification for Romanian Novels set in the Past
Alina Maria Ciobanu | Liviu P. Dinu | Octavia-Maria Şulea | Anca Dinu | Vlad Niculae
Proceedings of the International Conference Recent Advances in Natural Language Processing RANLP 2013

pdf bib
A Dictionary-Based Approach for Evaluating Orthographic Methods in Cognates Identification
Alina Maria Ciobanu | Liviu Petrisor Dinu
Proceedings of the International Conference Recent Advances in Natural Language Processing RANLP 2013

pdf bib
Sequence Tagging for Verb Conjugation in Romanian
Liviu Dinu | Octavia-Maria Şulea | Vlad Niculae
Proceedings of the International Conference Recent Advances in Natural Language Processing RANLP 2013

pdf bib
A clustering approach for translationese identification
Sergiu Nisioi | Liviu P. Dinu
Proceedings of the International Conference Recent Advances in Natural Language Processing RANLP 2013

2012

pdf bib
The Romanian Neuter Examined Through A Two-Gender N-Gram Classification System
Liviu P. Dinu | Vlad Niculae | Octavia-Maria Şulea
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

Romanian has been traditionally seen as bearing three lexical genders: masculine, feminine and neuter, although it has always been known to have only two agreement patterns (for masculine and feminine). A recent analysis of the Romanian gender system described in (Bateman and Polinsky, 2010), based on older observations, argues that there are two lexically unspecified noun classes in the singular and two different ones in the plural and that what is generally called neuter in Romanian shares the class in the singular with masculines, and the class in the plural with feminines based not only on agreement features but also on form. Previous machine learning classifiers that have attempted to discriminate Romanian nouns according to gender have so far taken as input only the singular form, presupposing the traditional tripartite analysis. We propose a classifier based on two parallel support vector machines using n-gram features from the singular and from the plural which outperforms previous classifiers in its high ability to distinguish the neuter. The performance of our system suggests that the two-gender analysis of Romanian, on which it is based, is on the right track.

pdf bib
Pastiche Detection Based on Stopword Rankings. Exposing Impersonators of a Romanian Writer
Liviu P. Dinu | Vlad Niculae | Maria-Octavia Sulea
Proceedings of the Workshop on Computational Approaches to Deception Detection

pdf bib
Learning How to Conjugate the Romanian Verb. Rules for Regular and Partially Irregular Verbs
Liviu P. Dinu | Vlad Niculae | Octavia-Maria Sulea
Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics

pdf bib
On the Romanian Rhyme Detection
Alina Ciobanu | Liviu P. Dinu
Proceedings of COLING 2012: Demonstration Papers

pdf bib
Dealing with the Grey Sheep of the Romanian Gender System, the Neuter
Liviu P. Dinu | Vlad Niculae | Maria Sulea
Proceedings of COLING 2012: Demonstration Papers

pdf bib
Authorial Studies using Ranked Lexical Features
Liviu P. Dinu | Sergiu Nisioi
Proceedings of COLING 2012: Demonstration Papers

2011

pdf bib
Can Alternations Be Learned? A Machine Learning Approach To Romanian Verb Conjugation
Liviu P. Dinu | Emil Ionescu | Vlad Niculae | Octavia-Maria Şulea
Proceedings of the International Conference Recent Advances in Natural Language Processing 2011

2009

pdf bib
Comparing Statistical Similarity Measures for Stylistic Multivariate Analysis
Marius Popescu | Liviu P. Dinu
Proceedings of the International Conference RANLP-2009

pdf bib
On the behavior of Romanian syllables related to minimum effort laws
Anca Dinu | Liviu P. Dinu
Proceedings of the Workshop Multilingual resources, technologies and evaluation for central and Eastern European languages

2008

pdf bib
Authorship Identification of Romanian Texts with Controversial Paternity
Liviu Dinu | Marius Popescu | Anca Dinu
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

In this work we propose a new strategy for the authorship identification problem and we test it on an example from Romanian literature: did Radu Albala found the continuation of Mateiu Caragiale’s novel Sub pecetea tainei, or did he write himself the respective continuation? The proposed strategy is based on the similarity of rankings of function words; we compare the obtained results with the results obtained by a learning method (namely Support Vector Machines -SVM- with a string kernel).

pdf bib
Rank Distance as a Stylistic Similarity
Marius Popescu | Liviu P. Dinu
Coling 2008: Companion volume: Posters

2006

pdf bib
On the data base of Romanian syllables and some of its quantitative and cryptographic aspects
Liviu Dinu | Anca Dinu
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)

In this paper we argue for the need to construct a data base of Romanian syllables. We explain the reasons for our choice of the DOOM corpus which we have used. We describe the way syllabification was performed and explain how we have constructed the data base. The main quantitative aspects which we have extracted from our research are presented. We also computed the entropy of the syllables and the entropy of the syllables w.r.t. the consonant-vowel structure. The results are compared with results of similar researches realized for different languages.

pdf bib
Total Rank Distance and Scaled Total Rank Distance: Two Alternative Metrics in Computational Linguistics
Anca Dinu | Liviu P. Dinu
Proceedings of the Workshop on Linguistic Distances