Alina Wróblewska


pdf bib
Towards the Conversion of National Corpus of Polish to Universal Dependencies
Alina Wróblewska
Proceedings of the 12th Language Resources and Evaluation Conference

The research presented in this paper aims at enriching the manually morphosyntactically annotated part of National Corpus of Polish (NKJP1M) with a syntactic layer, i.e. dependency trees of sentences, and at converting both dependency trees and morphosyntactic annotations of particular tokens to Universal Dependencies. The dependency layer is built using a semi-automatic annotation procedure. The sentences from NKJP1M are first parsed with a dependency parser trained on Polish Dependency Bank, i.e. the largest bank of Polish dependency trees. The predicted dependency trees and the morphosyntactic annotations of tokens are then automatically converted into UD dependency graphs. NKJP1M sentences are an essential part of Polish Dependency Bank, we thus replace some automatically predicted dependency trees with their manually annotated equivalents. The final dependency treebank consists of 86K trees (including 15K gold-standard trees). A natural language pre-processing model trained on the enlarged set of (possibly noisy) dependency trees outperforms a model trained on a smaller set of the gold-standard trees in predicting part-of-speech tags, morphological features, lemmata, and labelled dependency trees


pdf bib
Empirical Linguistic Study of Sentence Embeddings
Katarzyna Krasnowska-Kieraś | Alina Wróblewska
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

The purpose of the research is to answer the question whether linguistic information is retained in vector representations of sentences. We introduce a method of analysing the content of sentence embeddings based on universal probing tasks, along with the classification datasets for two contrasting languages. We perform a series of probing and downstream experiments with different types of sentence embeddings, followed by a thorough analysis of the experimental results. Aside from dependency parser-based embeddings, linguistic information is retained best in the recently proposed LASER sentence embeddings.


pdf bib
Polish Corpus of Annotated Descriptions of Images
Alina Wróblewska
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

pdf bib
Semi-Supervised Neural System for Tagging, Parsing and Lematization
Piotr Rybak | Alina Wróblewska
Proceedings of the CoNLL 2018 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies

This paper describes the ICS PAS system which took part in CoNLL 2018 shared task on Multilingual Parsing from Raw Text to Universal Dependencies. The system consists of jointly trained tagger, lemmatizer, and dependency parser which are based on features extracted by a biLSTM network. The system uses both fully connected and dilated convolutional neural architectures. The novelty of our approach is the use of an additional loss function, which reduces the number of cycles in the predicted dependency graphs, and the use of self-training to increase the system performance. The proposed system, i.e. ICS PAS (Warszawa), ranked 3th/4th in the official evaluation obtaining the following overall results: 73.02 (LAS), 60.25 (MLAS) and 64.44 (BLEX).

pdf bib
Extended and Enhanced Polish Dependency Bank in Universal Dependencies Format
Alina Wróblewska
Proceedings of the Second Workshop on Universal Dependencies (UDW 2018)

The paper presents the largest Polish Dependency Bank in Universal Dependencies format – PDBUD – with 22K trees and 352K tokens. PDBUD builds on its previous version, i.e. the Polish UD treebank (PL-SZ), and contains all 8K PL-SZ trees. The PL-SZ trees are checked and possibly corrected in the current edition of PDBUD. Further 14K trees are automatically converted from a new version of Polish Dependency Bank. The PDBUD trees are expanded with the enhanced edges encoding the shared dependents and the shared governors of the coordinated conjuncts and with the semantic roles of some dependents. The conducted evaluation experiments show that PDBUD is large enough for training a high-quality graph-based dependency parser for Polish.


pdf bib
Polish evaluation dataset for compositional distributional semantics models
Alina Wróblewska | Katarzyna Krasnowska-Kieraś
Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

The paper presents a procedure of building an evaluation dataset. for the validation of compositional distributional semantics models estimated for languages other than English. The procedure generally builds on steps designed to assemble the SICK corpus, which contains pairs of English sentences annotated for semantic relatedness and entailment, because we aim at building a comparable dataset. However, the implementation of particular building steps significantly differs from the original SICK design assumptions, which is caused by both lack of necessary extraneous resources for an investigated language and the need for language-specific transformation rules. The designed procedure is verified on Polish, a fusional language with a relatively free word order, and contributes to building a Polish evaluation dataset. The resource consists of 10K sentence pairs which are human-annotated for semantic relatedness and entailment. The dataset may be used for the evaluation of compositional distributional semantics models of Polish.


pdf bib
Projection-based Annotation of a Polish Dependency Treebank
Alina Wróblewska | Adam Przepiórkowski
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

This paper presents an approach of automatic annotation of sentences with dependency structures. The approach builds on the idea of cross-lingual dependency projection. The presented method of acquiring dependency trees involves a weighting factor in the processes of projecting source dependency relations to target sentences and inducing well-formed target dependency trees from sets of projected dependency relations. Using a parallel corpus, source trees are transferred onto equivalent target sentences via an extended set of alignment links. Projected arcs are initially weighted according to the certainty of word alignment links. Then, arc weights are recalculated using a method based on the EM selection algorithm. Maximum spanning trees selected from EM-scored digraphs and labelled with appropriate grammatical functions constitute a target dependency treebank. Extrinsic evaluation shows that parsers trained on such a treebank may perform comparably to parsers trained on a manually developed treebank.


pdf bib
Overview of the SPMRL 2013 Shared Task: A Cross-Framework Evaluation of Parsing Morphologically Rich Languages
Djamé Seddah | Reut Tsarfaty | Sandra Kübler | Marie Candito | Jinho D. Choi | Richárd Farkas | Jennifer Foster | Iakes Goenaga | Koldo Gojenola Galletebeitia | Yoav Goldberg | Spence Green | Nizar Habash | Marco Kuhlmann | Wolfgang Maier | Joakim Nivre | Adam Przepiórkowski | Ryan Roth | Wolfgang Seeker | Yannick Versley | Veronika Vincze | Marcin Woliński | Alina Wróblewska | Eric Villemonte de la Clergerie
Proceedings of the Fourth Workshop on Statistical Parsing of Morphologically-Rich Languages