Lidia Pivovarova


2020

pdf bib
Discovery Team at SemEval-2020 Task 1: Context-sensitive Embeddings Not Always Better than Static for Semantic Change Detection
Matej Martinc | Syrielle Montariol | Elaine Zosa | Lidia Pivovarova
Proceedings of the Fourteenth Workshop on Semantic Evaluation

This paper describes the approaches used by the Discovery Team to solve SemEval-2020 Task 1 - Unsupervised Lexical Semantic Change Detection. The proposed method is based on clustering of BERT contextual embeddings, followed by a comparison of cluster distributions across time. The best results were obtained by an ensemble of this method and static Word2Vec embeddings. According to the official results, our approach proved the best for Latin in Subtask 2.

pdf bib
A Comparison of Unsupervised Methods for Ad hoc Cross-Lingual Document Retrieval
Elaine Zosa | Mark Granroth-Wilding | Lidia Pivovarova
Proceedings of the workshop on Cross-Language Search and Summarization of Text and Speech (CLSSTS2020)

We address the problem of linking related documents across languages in a multilingual collection. We evaluate three diverse unsupervised methods to represent and compare documents: (1) multilingual topic model; (2) cross-lingual document embeddings; and (3) Wasserstein distance.We test the performance of these methods in retrieving news articles in Swedish that are known to be related to a given Finnish article.The results show that ensembles of the methods outperform the stand-alone methods, suggesting that they capture complementary characteristics of the documents

2019

pdf bib
Proceedings of the 7th Workshop on Balto-Slavic Natural Language Processing
Tomaž Erjavec | Michał Marcińczuk | Preslav Nakov | Jakub Piskorski | Lidia Pivovarova | Jan Šnajder | Josef Steinberger | Roman Yangarber
Proceedings of the 7th Workshop on Balto-Slavic Natural Language Processing

pdf bib
The Second Cross-Lingual Challenge on Recognition, Normalization, Classification, and Linking of Named Entities across Slavic Languages
Jakub Piskorski | Laska Laskova | Michał Marcińczuk | Lidia Pivovarova | Pavel Přibáň | Josef Steinberger | Roman Yangarber
Proceedings of the 7th Workshop on Balto-Slavic Natural Language Processing

We describe the Second Multilingual Named Entity Challenge in Slavic languages. The task is recognizing mentions of named entities in Web documents, their normalization, and cross-lingual linking. The Challenge was organized as part of the 7th Balto-Slavic Natural Language Processing Workshop, co-located with the ACL-2019 conference. Eight teams participated in the competition, which covered four languages and five entity types. Performance for the named entity recognition task reached 90% F-measure, much higher than reported in the first edition of the Challenge. Seven teams covered all four languages, and five teams participated in the cross-lingual entity linking task. Detailed evaluation information is available on the shared task web page.

pdf bib
Word Clustering for Historical Newspapers Analysis
Lidia Pivovarova | Elaine Zosa | Jani Marjanen
Proceedings of the Workshop on Language Technology for Digital Historical Archives

This paper is a part of a collaboration between computer scientists and historians aimed at development of novel tools and methods to improve analysis of historical newspapers. We present a case study of ideological terms ending with -ism suffix in nineteenth century Finnish newspapers. We propose a two-step procedure to trace differences in word usages over time: training of diachronic embeddings on several time slices and when clustering embeddings of selected words together with their neighbours to obtain historical context. The obtained clusters turn out to be useful for historical studies. The paper also discuss specific difficulties related to development historian-oriented tools.

2018

pdf bib
DL Team at SemEval-2018 Task 1: Tweet Affect Detection using Sentiment Lexicons and Embeddings
Dmitry Kravchenko | Lidia Pivovarova
Proceedings of The 12th International Workshop on Semantic Evaluation

The paper describes our approach for SemEval-2018 Task 1: Affect Detection in Tweets. We perform experiments with manually compelled sentiment lexicons and word embeddings. We test their performance on twitter affect detection task to determine which features produce the most informative representation of a sentence. We demonstrate that general-purpose word embeddings produces more informative sentence representation than lexicon features. However, combining lexicon features with embeddings yields higher performance than embeddings alone.

pdf bib
Benchmarks and models for entity-oriented polarity detection
Lidia Pivovarova | Arto Klami | Roman Yangarber
Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 3 (Industry Papers)

We address the problem of determining entity-oriented polarity in business news. This can be viewed as classifying the polarity of the sentiment expressed toward a given mention of a company in a news article. We present a complete, end-to-end approach to the problem. We introduce a new dataset of over 17,000 manually labeled documents, which is substantially larger than any currently available resources. We propose a benchmark solution based on convolutional neural networks for classifying entity-oriented polarity. Although our dataset is much larger than those currently available, it is small on the scale of datasets commonly used for training robust neural network models. To compensate for this, we use transfer learning—pre-train the model on a much larger dataset, annotated for a related but different classification task, in order to learn a good representation for business text, and then fine-tune it on the smaller polarity dataset.

pdf bib
Comparison of Representations of Named Entities for Document Classification
Lidia Pivovarova | Roman Yangarber
Proceedings of The Third Workshop on Representation Learning for NLP

We explore representations for multi-word names in text classification tasks, on Reuters (RCV1) topic and sector classification. We find that: the best way to treat names is to split them into tokens and use each token as a separate feature; NEs have more impact on sector classification than topic classification; replacing NEs with entity types is not an effective strategy; representing tokens by different embeddings for proper names vs. common nouns does not improve results. We highlight the improvements over state-of-the-art results that our CNN models yield.

2017

pdf bib
Grouping business news stories based on salience of named entities
Llorenç Escoter | Lidia Pivovarova | Mian Du | Anisia Katinskaia | Roman Yangarber
Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers

In news aggregation systems focused on broad news domains, certain stories may appear in multiple articles. Depending on the relative importance of the story, the number of versions can reach dozens or hundreds within a day. The text in these versions may be nearly identical or quite different. Linking multiple versions of a story into a single group brings several important benefits to the end-user–reducing the cognitive load on the reader, as well as signaling the relative importance of the story. We present a grouping algorithm, and explore several vector-based representations of input documents: from a baseline using keywords, to a method using salience–a measure of importance of named entities in the text. We demonstrate that features beyond keywords yield substantial improvements, verified on a manually-annotated corpus of business news stories.

pdf bib
HCS at SemEval-2017 Task 5: Polarity detection in business news using convolutional neural networks
Lidia Pivovarova | Llorenç Escoter | Arto Klami | Roman Yangarber
Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017)

Task 5 of SemEval-2017 involves fine-grained sentiment analysis on financial microblogs and news. Our solution for determining the sentiment score extends an earlier convolutional neural network for sentiment analysis in several ways. We explicitly encode a focus on a particular company, we apply a data augmentation scheme, and use a larger data collection to complement the small training data provided by the task organizers. The best results were achieved by training a model on an external dataset and then tuning it using the provided training dataset.

pdf bib
Proceedings of the 6th Workshop on Balto-Slavic Natural Language Processing
Tomaž Erjavec | Jakub Piskorski | Lidia Pivovarova | Jan Šnajder | Josef Steinberger | Roman Yangarber
Proceedings of the 6th Workshop on Balto-Slavic Natural Language Processing

pdf bib
Clustering of Russian Adjective-Noun Constructions using Word Embeddings
Andrey Kutuzov | Elizaveta Kuzmenko | Lidia Pivovarova
Proceedings of the 6th Workshop on Balto-Slavic Natural Language Processing

This paper presents a method of automatic construction extraction from a large corpus of Russian. The term ‘construction’ here means a multi-word expression in which a variable can be replaced with another word from the same semantic class, for example, ‘a glass of [water/juice/milk]’. We deal with constructions that consist of a noun and its adjective modifier. We propose a method of grouping such constructions into semantic classes via 2-step clustering of word vectors in distributional models. We compare it with other clustering techniques and evaluate it against A Russian-English Collocational Dictionary of the Human Body that contains manually annotated groups of constructions with nouns meaning human body parts. The best performing method is used to cluster all adjective-noun bigrams in the Russian National Corpus. Results of this procedure are publicly available and can be used for building Russian construction dictionary as well as to accelerate theoretical studies of constructions.

pdf bib
The First Cross-Lingual Challenge on Recognition, Normalization, and Matching of Named Entities in Slavic Languages
Jakub Piskorski | Lidia Pivovarova | Jan Šnajder | Josef Steinberger | Roman Yangarber
Proceedings of the 6th Workshop on Balto-Slavic Natural Language Processing

This paper describes the outcomes of the first challenge on multilingual named entity recognition that aimed at recognizing mentions of named entities in web documents in Slavic languages, their normalization/lemmatization, and cross-language matching. It was organised in the context of the 6th Balto-Slavic Natural Language Processing Workshop, co-located with the EACL 2017 conference. Although eleven teams signed up for the evaluation, due to the complexity of the task(s) and short time available for elaborating a solution, only two teams submitted results on time. The reported evaluation figures reflect the relatively higher level of complexity of named entity-related tasks in the context of processing texts in Slavic languages. Since the duration of the challenge goes beyond the date of the publication of this paper and updated picture of the participating systems and their corresponding performance can be found on the web page of the challenge.

pdf bib
Towards Never Ending Language Learning for Morphologically Rich Languages
Kseniya Buraya | Lidia Pivovarova | Sergey Budkov | Andrey Filchenkov
Proceedings of the 6th Workshop on Balto-Slavic Natural Language Processing

This work deals with ontology learning from unstructured Russian text. We implement one of components Never Ending Language Learner and introduce the algorithm extensions aimed to gather specificity of morphologicaly rich free-word-order language. We demonstrate that this method may be successfully applied to Russian data. In addition we perform several additional experiments comparing different settings of the training process. We demonstrate that utilizing of morphological features significantly improves the system precision while using of seed patterns helps to improve the coverage.

2015

pdf bib
The 5th Workshop on Balto-Slavic Natural Language Processing
Jakub Piskorski | Lidia Pivovarova | Jan Šnajder | Hristo Tanev | Roman Yangarber
The 5th Workshop on Balto-Slavic Natural Language Processing

pdf bib
Online Extraction of Russian Multiword Expressions
Mikhail Kopotev | Llorenç Escoter | Daria Kormacheva | Matthew Pierce | Lidia Pivovarova | Roman Yangarber
The 5th Workshop on Balto-Slavic Natural Language Processing

2013

pdf bib
Automatic Detection of Stable Grammatical Features in N-Grams
Mikhail Kopotev | Lidia Pivovarova | Natalia Kochetkova | Roman Yangarber
Proceedings of the 9th Workshop on Multiword Expressions

pdf bib
Event representation across genre
Lidia Pivovarova | Silja Huttunen | Roman Yangarber
Workshop on Events: Definition, Detection, Coreference, and Representation

pdf bib
Proceedings of the 4th Biennial International Workshop on Balto-Slavic Natural Language Processing
Jakub Piskorski | Lidia Pivovarova | Hristo Tanev | Roman Yangarber
Proceedings of the 4th Biennial International Workshop on Balto-Slavic Natural Language Processing

pdf bib
Adapting the PULS event extraction framework to analyze Russian text
Lidia Pivovarova | Mian Du | Roman Yangarber
Proceedings of the 4th Biennial International Workshop on Balto-Slavic Natural Language Processing

pdf bib
Combined analysis of news and Twitter messages
Mian Du | Jussi Kangasharju | Ossi Karkulahti | Lidia Pivovarova | Roman Yangarber
Proceedings of the Joint Workshop on NLP&LOD and SWAIE: Semantic Web, Linked Open Data and Information Extraction