Aleksandrs Berdicevskis


2020

pdf bib
Cross-lingual Embeddings Reveal Universal and Lineage-Specific Patterns in Grammatical Gender Assignment
Hartger Veeman | Marc Allassonnière-Tang | Aleksandrs Berdicevskis | Ali Basirat
Proceedings of the 24th Conference on Computational Natural Language Learning

Grammatical gender is assigned to nouns differently in different languages. Are all factors that influence gender assignment idiosyncratic to languages or are there any that are universal? Using cross-lingual aligned word embeddings, we perform two experiments to address these questions about language typology and human cognition. In both experiments, we predict the gender of nouns in language X using a classifier trained on the nouns of language Y, and take the classifier’s accuracy as a measure of transferability of gender systems. First, we show that for 22 Indo-European languages the transferability decreases as the phylogenetic distance increases. This correlation supports the claim that some gender assignment factors are idiosyncratic, and as the languages diverge, the proportion of shared inherited idiosyncrasies diminishes. Second, we show that when the classifier is trained on two Afro-Asiatic languages and tested on the same 22 Indo-European languages (or vice versa), its performance is still significantly above the chance baseline, thus showing that universal factors exist and, moreover, can be captured by word embeddings. When the classifier is tested across families and on inanimate nouns only, the performance is still above baseline, indicating that the universal factors are not limited to biological sex.

pdf bib
A Diachronic Treebank of Russian Spanning More Than a Thousand Years
Aleksandrs Berdicevskis | Hanne Eckhoff
Proceedings of the 12th Language Resources and Evaluation Conference

We describe the Tromsø Old Russian and Old Church Slavonic Treebank (TOROT) that spans from the earliest Old Church Slavonic to modern Russian texts, covering more than a thousand years of continuous language history. We focus on the latest additions to the treebank, first of all, the modern subcorpus that was created by a high-quality conversion of the existing treebank of contemporary standard Russian (SynTagRus).

pdf bib
Foreigner-directed speech is simpler than native-directed: Evidence from social media
Aleksandrs Berdicevskis
Proceedings of the Fourth Workshop on Natural Language Processing and Computational Social Science

I test two hypotheses that play an important role in modern sociolinguistics and language evolution studies: first, that non-native production is simpler than native; second, that production addressed to non-native speakers is simpler than that addressed to natives. The second hypothesis is particularly important for theories about contact-induced simplification, since the accommodation to non-natives may explain how the simplification can spread from adult learners to the whole community. To test the hypotheses, I create a very large corpus of native and non-native written speech in four languages (English, French, Italian, Spanish), extracting data from an internet forum where native languages of the participants are known and the structure of the interactions can be inferred. The corpus data yield inconsistent evidence with respect to the first hypothesis, but largely support the second one, suggesting that foreigner-directed speech is indeed simpler than native-directed. Importantly, when testing the first hypothesis, I contrast production of different speakers, which can introduce confounds and is a likely reason for the inconsistencies. When testing the second hypothesis, the comparison is always within the production of the same speaker (but with different addressees), which makes it more reliable.

pdf bib
Subjects tend to be coded only once: Corpus-based and grammar-based evidence for an efficiency-driven trade-off
Aleksandrs Berdicevskis | Karsten Schmidtke-Bode | Ilja Seržant
Proceedings of the 19th International Workshop on Treebanks and Linguistic Theories

pdf bib
Corpus evidence for word order freezing in Russian and German
Aleksandrs Berdicevskis | Alexander Piperski
Proceedings of the Fourth Workshop on Universal Dependencies (UDW 2020)

We use Universal Dependencies treebanks to test whether a well-known typological trade-off between word order freedom and richness of morphological marking of core arguments holds within individual languages. Using Russian and German treebank data, we show that the following phenomenon (sometimes dubbed word order freezing) does occur: those sentences where core arguments cannot be distinguished by morphological means (due to case syncretism or other kinds of ambiguity) have more rigid order of subject, verb and object than those where unambiguous morphological marking is present. In ambiguous clauses, word order is more often equal to the one which is default or dominant (most frequent) in the language. While Russian and German differ with respect to how exactly they mark core arguments, the effect of morphological ambiguity is significant in both languages. It is, however, small, suggesting that languages do adapt to the evolutionary pressure on communicative efficiency and avoidance of redundancy, but that the pressure is weak in this particular respect.

2018

pdf bib
Using Universal Dependencies in cross-linguistic complexity research
Aleksandrs Berdicevskis | Çağrı Çöltekin | Katharina Ehret | Kilu von Prince | Daniel Ross | Bill Thompson | Chunxiao Yan | Vera Demberg | Gary Lupyan | Taraka Rama | Christian Bentz
Proceedings of the Second Workshop on Universal Dependencies (UDW 2018)

We evaluate corpus-based measures of linguistic complexity obtained using Universal Dependencies (UD) treebanks. We propose a method of estimating robustness of the complexity values obtained using a given measure and a given treebank. The results indicate that measures of syntactic complexity might be on average less robust than those of morphological complexity. We also estimate the validity of complexity measures by comparing the results for very similar languages and checking for unexpected differences. We show that some of those differences that arise can be diminished by using parallel treebanks and, more importantly from the practical point of view, by harmonizing the language-specific solutions in the UD annotation.

2016

pdf bib
Learning pressures reduce morphological complexity: Linking corpus, computational and experimental evidence
Christian Bentz | Aleksandrs Berdicevskis
Proceedings of the Workshop on Computational Linguistics for Linguistic Complexity (CL4LC)

The morphological complexity of languages differs widely and changes over time. Pathways of change are often driven by the interplay of multiple competing factors, and are hard to disentangle. We here focus on a paradigmatic scenario of language change: the reduction of morphological complexity from Latin towards the Romance languages. To establish a causal explanation for this phenomenon, we employ three lines of evidence: 1) analyses of parallel corpora to measure the complexity of words in actual language production, 2) applications of NLP tools to further tease apart the contribution of inflectional morphology to word complexity, and 3) experimental data from artificial language learning, which illustrate the learning pressures at play when morphology simplifies. These three lines of evidence converge to show that pressures associated with imperfect language learning are good candidates to causally explain the reduction in morphological complexity in the Latin-to-Romance scenario. More generally, we argue that combining corpus, computational and experimental evidence is the way forward in historical linguistics and linguistic typology.

2015

pdf bib
Estimating Grammeme Redundancy by Measuring Their Importance for Syntactic Parser Performance
Aleksandrs Berdicevskis
Proceedings of the Sixth Workshop on Cognitive Aspects of Computational Language Learning