Reconstructing NER Corpora: a Case Study on Bulgarian
Iva Marinova | Laska Laskova | Petya Osenova | Kiril Simov | Alexander Popov
Proceedings of the 12th Language Resources and Evaluation Conference

The paper reports on the usage of deep learning methods for improving a Named Entity Recognition (NER) training corpus and for predicting and annotating new types in a test corpus. We show how the annotations in a type-based corpus of named entities (NE) were populated as occurrences within it, thus ensuring density of the training information. A deep learning model was adopted for discovering inconsistencies in the initial annotation and for learning new NE types. The evaluation results get improved after data curation, randomization and deduplication.


The Second Cross-Lingual Challenge on Recognition, Normalization, Classification, and Linking of Named Entities across Slavic Languages
Jakub Piskorski | Laska Laskova | Michał Marcińczuk | Lidia Pivovarova | Pavel Přibáň | Josef Steinberger | Roman Yangarber
Proceedings of the 7th Workshop on Balto-Slavic Natural Language Processing

We describe the Second Multilingual Named Entity Challenge in Slavic languages. The task is recognizing mentions of named entities in Web documents, their normalization, and cross-lingual linking. The Challenge was organized as part of the 7th Balto-Slavic Natural Language Processing Workshop, co-located with the ACL-2019 conference. Eight teams participated in the competition, which covered four languages and five entity types. Performance for the named entity recognition task reached 90% F-measure, much higher than reported in the first edition of the Challenge. Seven teams covered all four languages, and five teams participated in the cross-lingual entity linking task. Detailed evaluation information is available on the shared task web page.

Modeling MWEs in BTB-WN
Laska Laskova | Petya Osenova | Kiril Simov | Ivajlo Radev | Zara Kancheva
Proceedings of the Joint Workshop on Multiword Expressions and WordNet (MWE-WN 2019)

The paper presents the characteristics of the predominant types of MultiWord expressions (MWEs) in the BulTreeBank WordNet – BTB-WN. Their distribution in BTB-WN is discussed with respect to the overall hierarchical organization of the lexical resource. Also, a catena-based modeling is proposed for handling the issues of lexical semantics of MWEs.


A Treebank-driven Creation of an OntoValence Verb lexicon for Bulgarian
Petya Osenova | Kiril Simov | Laska Laskova | Stanislava Kancheva
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

The paper presents a treebank-driven approach to the construction of a Bulgarian valence lexicon with ontological restrictions over the inner participants of the event. First, the underlying ideas behind the Bulgarian Ontology-based lexicon are outlined. Then, the extraction and manipulation of the valence frames is discussed with respect to the BulTreeBank annotation scheme and DOLCE ontology. Also, the most frequent types of syntactic frames are specified as well as the most frequent types of ontological restrictions over the verb arguments. The envisaged application of such a lexicon would be: in assigning ontological labels to syntactically parsed corpora, and expanding the lexicon and lexical information in the Bulgarian Resource Grammar.

Linguistic Analysis Processing Line for Bulgarian
Aleksandar Savkov | Laska Laskova | Stanislava Kancheva | Petya Osenova | Kiril Simov
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

This paper presents a linguistic processing pipeline for Bulgarian including morphological analysis, lemmatization and syntactic analysis of Bulgarian texts. The morphological analysis is performed by three modules ― two statistical-based and one rule-based. The combination of these modules achieves the best result for morphological tagging of Bulgarian over a rich tagset (680 tags). The lemmatization is based on rules, generated from a large morphological lexicon of Bulgarian. The syntactic analysis is implemented via MaltParser. The two statistical morphological taggers and MaltParser are trained on datasets constructed within BulTreeBank project. The processing pipeline includes also a sentence splitter and a tokenizer. All tools in the pipeline are packed in modules that can also perform separately. The whole pipeline is designed to be able to serve as a back-end of a web service oriented interface, but it also supports the user tasks with a command-line interface. The processing pipeline is compatible with the Text Corpus Format, which allows it to delegate the management of the components to the WebLicht platform.


Bulgarian-English Parallel Treebank: Word and Semantic Level Alignment
Kiril Simov | Petya Osenova | Laska Laskova | Aleksandar Savkov | Stanislava Kancheva
Proceedings of The Second Workshop on Annotation and Exploitation of Parallel Corpora


Exploring Co-Reference Chains for Concept Annotation of Domain Texts
Petya Osenova | Laska Laskova | Kiril Simov
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

The paper explores the co-reference chains as a way for improving the density of concept annotation over domain texts. The idea extends authors’ previous work on relating the ontology to the text terms in two domains ― IT and textile. Here IT domain is used. The challenge is to enhance relations among concepts instead of text entities, the latter pursued in most works. Our ultimate goal is to exploit these additional chains for concept disambiguation as well as sparseness resolution at concept level. First, a gold standard was prepared with manually connected links among concepts, anaphoric pronouns and contextual equivalents. This step was necessary not only for test purposes, but also for better orientation in the co-referent types and distribution. Then, two automatic systems were tested on the gold standard. Note that these systems were not designed specially for concept chaining. The conclusion is that the state-of-the-art co-reference resolution systems might address the concept sparseness problem, but not so much the concept disambiguation task. For the latter, word-sense disambiguation systems have to be integrated.


Event Ordering. Temporal Annotation on Top of the BulTreeBank
Laska Laskova
Proceedings of the Student Research Workshop