Andrey Kutuzov


2020

pdf bib
UiO-UvA at SemEval-2020 Task 1: Contextualised Embeddings for Lexical Semantic Change Detection
Andrey Kutuzov | Mario Giulianelli
Proceedings of the Fourteenth Workshop on Semantic Evaluation

We apply contextualised word embeddings to lexical semantic change detection in the SemEval-2020 Shared Task 1. This paper focuses on Subtask 2, ranking words by the degree of their semantic drift over time. We analyse the performance of two contextualising architectures (BERT and ELMo) and three change detection algorithms. We find that the most effective algorithms rely on the cosine similarity between averaged token embeddings and the pairwise distances between token embeddings. They outperform strong baselines by a large margin (in the post-evaluation phase, we have the best Subtask 2 submission for SemEval-2020 Task 1), but interestingly, the choice of a particular algorithm depends on the distribution of gold scores in the test set.

pdf bib
Word Sense Disambiguation for 158 Languages using Word Embeddings Only
Varvara Logacheva | Denis Teslenko | Artem Shelmanov | Steffen Remus | Dmitry Ustalov | Andrey Kutuzov | Ekaterina Artemova | Chris Biemann | Simone Paolo Ponzetto | Alexander Panchenko
Proceedings of the 12th Language Resources and Evaluation Conference

Disambiguation of word senses in context is easy for humans, but is a major challenge for automatic approaches. Sophisticated supervised and knowledge-based models were developed to solve this task. However, (i) the inherent Zipfian distribution of supervised training instances for a given word and/or (ii) the quality of linguistic knowledge representations motivate the development of completely unsupervised and knowledge-free approaches to word sense disambiguation (WSD). They are particularly useful for under-resourced languages which do not have any resources for building either supervised and/or knowledge-based models. In this paper, we present a method that takes as input a standard pre-trained word embedding model and induces a fully-fledged word sense inventory, which can be used for disambiguation in context. We use this method to induce a collection of sense inventories for 158 languages on the basis of the original pre-trained fastText word embeddings by Grave et al., (2018), enabling WSD in these languages. Models and system are available online.

pdf bib
RuSemShift: a dataset of historical lexical semantic change in Russian
Julia Rodina | Andrey Kutuzov
Proceedings of the 28th International Conference on Computational Linguistics

We present RuSemShift, a large-scale manually annotated test set for the task of semantic change modeling in Russian for two long-term time period pairs: from the pre-Soviet through the Soviet times and from the Soviet through the post-Soviet times. Target words were annotated by multiple crowd-source workers. The annotation process was organized following the DURel framework and was based on sentence contexts extracted from the Russian National Corpus. Additionally, we report the performance of several distributional approaches on RuSemShift, achieving promising results, which at the same time leave room for other researchers to improve.

2019

pdf bib
ÚFAL-Oslo at MRP 2019: Garage Sale Semantic Parsing
Kira Droganova | Andrey Kutuzov | Nikita Mediankin | Daniel Zeman
Proceedings of the Shared Task on Cross-Framework Meaning Representation Parsing at the 2019 Conference on Natural Language Learning

This paper describes the ÚFAL--Oslo system submission to the shared task on Cross-Framework Meaning Representation Parsing (MRP, Oepen et al. 2019). The submission is based on several third-party parsers. Within the official shared task results, the submission ranked 11th out of 13 participating systems.

pdf bib
Making Fast Graph-based Algorithms with Graph Metric Embeddings
Andrey Kutuzov | Mohammad Dorgham | Oleksiy Oliynyk | Chris Biemann | Alexander Panchenko
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

Graph measures, such as node distances, are inefficient to compute. We explore dense vector representations as an effective way to approximate the same information. We introduce a simple yet efficient and effective approach for learning graph embeddings. Instead of directly operating on the graph structure, our method takes structural measures of pairwise node similarities into account and learns dense node representations reflecting user-defined graph distance measures, such as e.g. the shortest path distance or distance measures that take information beyond the graph structure into account. We demonstrate a speed-up of several orders of magnitude when predicting word similarity by vector operations on our embeddings as opposed to directly computing the respective path-based measures, while outperforming various other graph embeddings on semantic similarity and word sense disambiguation tasks.

pdf bib
One-to-X Analogical Reasoning on Word Embeddings: a Case for Diachronic Armed Conflict Prediction from News Texts
Andrey Kutuzov | Erik Velldal | Lilja Øvrelid
Proceedings of the 1st International Workshop on Computational Approaches to Historical Language Change

We extend the well-known word analogy task to a one-to-X formulation, including one-to-none cases, when no correct answer exists. The task is cast as a relation discovery problem and applied to historical armed conflicts datasets, attempting to predict new relations of type ‘location:armed-group’ based on data about past events. As the source of semantic information, we use diachronic word embedding models trained on English news texts. A simple technique to improve diachronic performance in such task is demonstrated, using a threshold based on a function of cosine distance to decrease the number of false positives; this approach is shown to be beneficial on two different corpora. Finally, we publish a ready-to-use test set for one-to-X analogy evaluation on historical armed conflicts data.

pdf bib
Measuring Diachronic Evolution of Evaluative Adjectives with Word Embeddings: the Case for English, Norwegian, and Russian
Julia Rodina | Daria Bakshandaeva | Vadim Fomin | Andrey Kutuzov | Samia Touileb | Erik Velldal
Proceedings of the 1st International Workshop on Computational Approaches to Historical Language Change

We measure the intensity of diachronic semantic shifts in adjectives in English, Norwegian and Russian across 5 decades. This is done in order to test the hypothesis that evaluative adjectives are more prone to temporal semantic change. To this end, 6 different methods of quantifying semantic change are used. Frequency-controlled experimental results show that, depending on the particular method, evaluative adjectives either do not differ from other types of adjectives in terms of semantic change or appear to actually be less prone to shifting (particularly, to ‘jitter’-type shifting). Thus, in spite of many well-known examples of semantically changing evaluative adjectives (like ‘terrific’ or ‘incredible’), it seems that such cases are not specific to this particular type of words.

pdf bib
To Lemmatize or Not to Lemmatize: How Word Normalisation Affects ELMo Performance in Word Sense Disambiguation
Andrey Kutuzov | Elizaveta Kuzmenko
Proceedings of the First NLPL Workshop on Deep Learning for Natural Language Processing

In this paper, we critically evaluate the widespread assumption that deep learning NLP models do not require lemmatized input. To test this, we trained versions of contextualised word embedding ELMo models on raw tokenized corpora and on the corpora with word tokens replaced by their lemmas. Then, these models were evaluated on the word sense disambiguation task. This was done for the English and Russian languages. The experiments showed that while lemmatization is indeed not necessary for English, the situation is different for Russian. It seems that for rich-morphology languages, using lemmatized training and testing data yields small but consistent improvements: at least for word sense disambiguation. This means that the decisions about text pre-processing before training ELMo should consider the linguistic nature of the language in question.

pdf bib
Learning Graph Embeddings from WordNet-based Similarity Measures
Andrey Kutuzov | Mohammad Dorgham | Oleksiy Oliynyk | Chris Biemann | Alexander Panchenko
Proceedings of the Eighth Joint Conference on Lexical and Computational Semantics (*SEM 2019)

We present path2vec, a new approach for learning graph embeddings that relies on structural measures of pairwise node similarities. The model learns representations for nodes in a dense space that approximate a given user-defined graph distance measure, such as e.g. the shortest path distance or distance measures that take information beyond the graph structure into account. Evaluation of the proposed model on semantic similarity and word sense disambiguation tasks, using various WordNet-based similarity measures, show that our approach yields competitive results, outperforming strong graph embedding baselines. The model is computationally efficient, being orders of magnitude faster than the direct computation of graph-based distances.

2018

pdf bib
Diachronic word embeddings and semantic shifts: a survey
Andrey Kutuzov | Lilja Øvrelid | Terrence Szymanski | Erik Velldal
Proceedings of the 27th International Conference on Computational Linguistics

Recent years have witnessed a surge of publications aimed at tracing temporal changes in lexical semantics using distributional methods, particularly prediction-based word embedding models. However, this vein of research lacks the cohesion, common terminology and shared practices of more established areas of natural language processing. In this paper, we survey the current state of academic research related to diachronic word embeddings and semantic shifts detection. We start with discussing the notion of semantic shifts, and then continue with an overview of the existing methods for tracing such time-related shifts with word embedding models. We propose several axes along which these methods can be compared, and outline the main challenges before this emerging subfield of NLP, as well as prospects and possible applications.

pdf bib
Unsupervised Semantic Frame Induction using Triclustering
Dmitry Ustalov | Alexander Panchenko | Andrey Kutuzov | Chris Biemann | Simone Paolo Ponzetto
Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

We use dependency triples automatically extracted from a Web-scale corpus to perform unsupervised semantic frame induction. We cast the frame induction problem as a triclustering problem that is a generalization of clustering for triadic data. Our replicable benchmarks demonstrate that the proposed graph-based approach, Triframes, shows state-of-the art results on this task on a FrameNet-derived dataset and performing on par with competitive methods on a verb class clustering task.

2017

pdf bib
Building Web-Interfaces for Vector Semantic Models with the WebVectors Toolkit
Andrey Kutuzov | Elizaveta Kuzmenko
Proceedings of the Software Demonstrations of the 15th Conference of the European Chapter of the Association for Computational Linguistics

In this demo we present WebVectors, a free and open-source toolkit helping to deploy web services which demonstrate and visualize distributional semantic models (widely known as word embeddings). WebVectors can be useful in a very common situation when one has trained a distributional semantics model for one’s particular corpus or language (tools for this are now widespread and simple to use), but then there is a need to demonstrate the results to general public over the Web. We show its abilities on the example of the living web services featuring distributional models for English, Norwegian and Russian.

pdf bib
Word vectors, reuse, and replicability: Towards a community repository of large-text resources
Murhaf Fares | Andrey Kutuzov | Stephan Oepen | Erik Velldal
Proceedings of the 21st Nordic Conference on Computational Linguistics

pdf bib
Redefining Context Windows for Word Embedding Models: An Experimental Study
Pierre Lison | Andrey Kutuzov
Proceedings of the 21st Nordic Conference on Computational Linguistics

pdf bib
Clustering of Russian Adjective-Noun Constructions using Word Embeddings
Andrey Kutuzov | Elizaveta Kuzmenko | Lidia Pivovarova
Proceedings of the 6th Workshop on Balto-Slavic Natural Language Processing

This paper presents a method of automatic construction extraction from a large corpus of Russian. The term ‘construction’ here means a multi-word expression in which a variable can be replaced with another word from the same semantic class, for example, ‘a glass of [water/juice/milk]’. We deal with constructions that consist of a noun and its adjective modifier. We propose a method of grouping such constructions into semantic classes via 2-step clustering of word vectors in distributional models. We compare it with other clustering techniques and evaluate it against A Russian-English Collocational Dictionary of the Human Body that contains manually annotated groups of constructions with nouns meaning human body parts. The best performing method is used to cluster all adjective-noun bigrams in the Russian National Corpus. Results of this procedure are publicly available and can be used for building Russian construction dictionary as well as to accelerate theoretical studies of constructions.

pdf bib
Tracing armed conflicts with diachronic word embedding models
Andrey Kutuzov | Erik Velldal | Lilja Øvrelid
Proceedings of the Events and Stories in the News Workshop

Recent studies have shown that word embedding models can be used to trace time-related (diachronic) semantic shifts in particular words. In this paper, we evaluate some of these approaches on the new task of predicting the dynamics of global armed conflicts on a year-to-year basis, using a dataset from the conflict research field as the gold standard and the Gigaword news corpus as the training data. The results show that much work still remains in extracting ‘cultural’ semantic shifts from diachronic word embedding models. At the same time, we present a new task complete with an evaluation set and introduce the ‘anchor words’ method which outperforms previous approaches on this set.

pdf bib
Universal Dependencies-based syntactic features in detecting human translation varieties
Maria Kunilovskaya | Andrey Kutuzov
Proceedings of the 16th International Workshop on Treebanks and Linguistic Theories

pdf bib
Temporal dynamics of semantic relations in word embeddings: an application to predicting armed conflict participants
Andrey Kutuzov | Erik Velldal | Lilja Øvrelid
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing

This paper deals with using word embedding models to trace the temporal dynamics of semantic relations between pairs of words. The set-up is similar to the well-known analogies task, but expanded with a time dimension. To this end, we apply incremental updating of the models with new training texts, including incremental vocabulary expansion, coupled with learned transformation matrices that let us map between members of the relation. The proposed approach is evaluated on the task of predicting insurgent armed groups based on geographical locations. The gold standard data for the time span 1994–2010 is extracted from the UCDP Armed Conflicts dataset. The results show that the method is feasible and outperforms the baselines, but also that important work still remains to be done.

2016

pdf bib
Neural Embedding Language Models in Semantic Clustering of Web Search Results
Andrey Kutuzov | Elizaveta Kuzmenko
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

In this paper, a new approach towards semantic clustering of the results of ambiguous search queries is presented. We propose using distributed vector representations of words trained with the help of prediction-based neural embedding models to detect senses of search queries and to cluster search engine results page according to these senses. The words from titles and snippets together with semantic relationships between them form a graph, which is further partitioned into components related to different query senses. This approach to search engine results clustering is evaluated against a new manually annotated evaluation data set of Russian search queries. We show that in the task of semantically clustering search results, prediction-based models slightly but stably outperform traditional count-based ones, with the same training corpora.

pdf bib
Redefining part-of-speech classes with distributional semantic models
Andrey Kutuzov | Erik Velldal | Lilja Øvrelid
Proceedings of The 20th SIGNLL Conference on Computational Natural Language Learning

pdf bib
Exploration of register-dependent lexical semantics using word embeddings
Andrey Kutuzov | Elizaveta Kuzmenko | Anna Marakasova
Proceedings of the Workshop on Language Technology Resources and Tools for Digital Humanities (LT4DH)

We present an approach to detect differences in lexical semantics across English language registers, using word embedding models from distributional semantics paradigm. Models trained on register-specific subcorpora of the BNC corpus are employed to compare lists of nearest associates for particular words and draw conclusions about their semantic shifts depending on register in which they are used. The models are evaluated on the task of register classification with the help of the deep inverse regression approach. Additionally, we present a demo web service featuring most of the described models and allowing to explore word meanings in different English registers and to detect register affiliation for arbitrary texts. The code for the service can be easily adapted to any set of underlying models.

2015

pdf bib
Semi-automated typical error annotation for learner English essays: integrating frameworks
Andrey Kutuzov | Elizaveta Kuzmenko
Proceedings of the fourth workshop on NLP for computer-assisted language learning

2014

pdf bib
Russian Error-Annotated Learner English Corpus: a Tool for Computer-Assisted Language Learning
Elizaveta Kuzmenko | Andrey Kutuzov
Proceedings of the third workshop on NLP for computer-assisted language learning

2013

pdf bib
Improving English-Russian sentence alignment through POS tagging and Damerau-Levenshtein distance
Andrey Kutuzov
Proceedings of the 4th Biennial International Workshop on Balto-Slavic Natural Language Processing