Proceedings of the First Workshop on Language technology for Digital Humanities in Central and (South-)Eastern Europe
This paper describes a Romanian Dependency Treebank, built at the Al. I. Cuza University (UAIC), and a special OCR techniques used to build it. The corpus has rich morphological and syntactic annotation. There are few annotated representative corpora in Romanian, and the existent ones are mainly focused on the contemporary Romanian standard. The corpus described below is focused on the non-standard aspects of the language, the Regional and the Old Romanian. Having the intention to participate at the PROIEL project, which aligns oldest New Testaments, we annotate the first printed Romanian New Testament (Alba Iulia, 1648). We began by applying the UAIC tools for the morphological and syntactic processing of Contemporary Romanian over the book’s first quarter (second edition). By carefully manually correcting the result of the automated annotation (having a modest accuracy) we obtained a sub-corpus for the training of tools for the Old Romanian processing. But the first edition of the New Testament is written in Cyrillic letters. The existence of books printed in the Old Cyrillic alphabet is a common problem for Romania and The Republic of Moldova, countries where the Romanian is spoken; a problem to solve by the joint efforts of the NLP researchers in the two countries.
Contemporary standard language corpora are ideal for NLP. There are few morphologically and syntactically annotated corpora for Romanian, and those existing or in progress only deal with the Contemporary Romanian standard. However, the necessity to study the dynamics of natural languages gave rise to balanced corpora, containing non-standard texts. In this paper, we describe the creation of tools for processing non-standard Romanian to build a big balanced corpus. We want to preserve in annotated form as many early stages of language as possible. We have already built a corpus in Old Romanian. We also intend to include the South-Danube dialects, remote to the standard language, along with regional forms closer to the standard. We try to preserve data about endangered idioms such as Aromanian, Meglenoromanian and Istroromanian dialects, and calculate the distance between different regional variants, including the language spoken in the Republic of Moldova. This distance, as well as the mutual understanding between the speakers, is the correct criterion for the classification of idioms as different languages, or as dialects, or as regional variants close to the standard.
We describe work done in the field of folkloristics and consisting in creating ontologies based on well-established studies proposed by “classical” folklorists. This work is supporting the availability of a huge amount of digital and structured knowledge on folktales to digital humanists. The ontological encoding of past and current motif-indexation and classification systems for folktales was in the first step limited to English language data. This led us to focus on making those newly generated formal knowledge sources available in a few more languages, like German, Russian and Bulgarian. We stress the importance of achieving this multilingual extension of our ontologies at a larger scale, in order for example to support the automated analysis and classification of such narratives in a large variety of languages, as those are getting more and more accessible on the Web.
Current approaches in Digital .Humanities tend to ignore a central as-pect of any hermeneutic introspection: the intrinsic vagueness of analyzed texts. Especially when dealing with his-torical documents neglecting vague-ness has important implications on the interpretation of the results. In this pa-per we present current limitation of an-notation approaches and describe a current methodology for annotating vagueness for historical Romanian texts.
Language Technologies in Teaching Bugarian at Primary and Secondary School Level: the NBU Platform of Language Teaching (PLT)
Maria Stambolieva | Valentina Ivanova | Mariana Raykova | Milka Hadjikoteva | Mariya Neykova
The NBU Language Teaching Platform (PLT) was initially designed for teaching foreign languages for specific purposes; at a second stage, some of its functionalities were extended to answer the needs of teaching general foreign language. New functionalities have now been created for the purpose of providing e-support for Bulgarian language and literature teaching at primary and secondary school level. The article presents the general structure of the platform and the functionalities specifically developed to match the standards and expected results set by the Ministry of Education. The E-platform integrates: 1/ an environment for creating, organizing and maintaining electronic text archives, for extracting text corpora and aligning corpora; 2/ a linguistic database; 3/ a concordancer; 4/ a set of modules for the generation and editing of practice exercises for each text or corpus; 5/ functionalities for export from the platform and import to other educational platforms. For Moodle, modules were created for test generation, performance assessment and feedback. The PLT allows centralized presentation of abundant teaching content, control of the educational process, fast and reliable feedback on performance.
This paper overviews the Majoritas ecosystem, providing a complete overview of political campaigns assessment aimed to assist politicians and their staff in delivering consistent and personalized message within social media.