Alexey Sorokin


2020

pdf bib
UniMorph 3.0: Universal Morphology
Arya D. McCarthy | Christo Kirov | Matteo Grella | Amrit Nidhi | Patrick Xia | Kyle Gorman | Ekaterina Vylomova | Sabrina J. Mielke | Garrett Nicolai | Miikka Silfverberg | Timofey Arkhangelskiy | Nataly Krizhanovsky | Andrew Krizhanovsky | Elena Klyachko | Alexey Sorokin | John Mansfield | Valts Ernštreits | Yuval Pinter | Cassandra L. Jacobs | Ryan Cotterell | Mans Hulden | David Yarowsky
Proceedings of the 12th Language Resources and Evaluation Conference

The Universal Morphology (UniMorph) project is a collaborative effort providing broad-coverage instantiated normalized morphological paradigms for hundreds of diverse world languages. The project comprises two major thrusts: a language-independent feature schema for rich morphological annotation and a type-level resource of annotated data in diverse languages realizing that schema. We have implemented several improvements to the extraction pipeline which creates most of our data, so that it is both more complete and more correct. We have added 66 new languages, as well as new parts of speech for 12 languages. We have also amended the schema in several ways. Finally, we present three new community tools: two to validate data for resource creators, and one to make morphological data available from the command line. UniMorph is based at the Center for Language and Speech Processing (CLSP) at Johns Hopkins University in Baltimore, Maryland. This paper details advances made to the schema, tooling, and dissemination of project resources since the UniMorph 2.0 release described at LREC 2018.

pdf bib
Getting More Data for Low-resource Morphological Inflection: Language Models and Data Augmentation
Alexey Sorokin
Proceedings of the 12th Language Resources and Evaluation Conference

We investigate how to improve quality of low-resource morphological inflection without annotating more data. We examine two methods, language models and data augmentation. We show that the model whose decoder that additionally uses the states of the langauge model improves the model quality by 1.5% in combination with both baselines. We also demonstrate that the augmentation of data improves performance by 9% in average when adding 1000 artificially generated word forms to the dataset.

2019

pdf bib
Tuning Multilingual Transformers for Language-Specific Named Entity Recognition
Mikhail Arkhipov | Maria Trofimova | Yuri Kuratov | Alexey Sorokin
Proceedings of the 7th Workshop on Balto-Slavic Natural Language Processing

Our paper addresses the problem of multilingual named entity recognition on the material of 4 languages: Russian, Bulgarian, Czech and Polish. We solve this task using the BERT model. We use a hundred languages multilingual model as base for transfer to the mentioned Slavic languages. Unsupervised pre-training of the BERT model on these 4 languages allows to significantly outperform baseline neural approaches and multilingual BERT. Additional improvement is achieved by extending BERT with a word-level CRF layer. Our system was submitted to BSNLP 2019 Shared Task on Multilingual Named Entity Recognition and demonstrated top performance in multilingual setting for two competition metrics. We open-sourced NER models and BERT model pre-trained on the four Slavic languages.

pdf bib
Convolutional neural networks for low-resource morpheme segmentation: baseline or state-of-the-art?
Alexey Sorokin
Proceedings of the 16th Workshop on Computational Research in Phonetics, Phonology, and Morphology

We apply convolutional neural networks to the task of shallow morpheme segmentation using low-resource datasets for 5 different languages. We show that both in fully supervised and semi-supervised settings our model beats previous state-of-the-art approaches. We argue that convolutional neural networks reflect local nature of morpheme segmentation better than other semi-supervised approaches.

2018

pdf bib
What can we gain from language models for morphological inflection?
Alexey Sorokin
Proceedings of the CoNLL–SIGMORPHON 2018 Shared Task: Universal Morphological Reinflection

pdf bib
DeepPavlov: Open-Source Library for Dialogue Systems
Mikhail Burtsev | Alexander Seliverstov | Rafael Airapetyan | Mikhail Arkhipov | Dilyara Baymurzina | Nickolay Bushkov | Olga Gureenkova | Taras Khakhulin | Yuri Kuratov | Denis Kuznetsov | Alexey Litinsky | Varvara Logacheva | Alexey Lymar | Valentin Malykh | Maxim Petrov | Vadim Polulyakh | Leonid Pugachev | Alexey Sorokin | Maria Vikhreva | Marat Zaynutdinov
Proceedings of ACL 2018, System Demonstrations

Adoption of messaging communication and voice assistants has grown rapidly in the last years. This creates a demand for tools that speed up prototyping of feature-rich dialogue systems. An open-source library DeepPavlov is tailored for development of conversational agents. The library prioritises efficiency, modularity, and extensibility with the goal to make it easier to develop dialogue systems from scratch and with limited data available. It supports modular as well as end-to-end approaches to implementation of conversational agents. Conversational agent consists of skills and every skill can be decomposed into components. Components are usually models which solve typical NLP tasks such as intent classification, named entity recognition or pre-trained word vectors. Sequence-to-sequence chit-chat skill, question answering skill or task-oriented skill can be assembled from components provided in the library.

2017

pdf bib
Spelling Correction for Morphologically Rich Language: a Case Study of Russian
Alexey Sorokin
Proceedings of the 6th Workshop on Balto-Slavic Natural Language Processing

We present an algorithm for automatic correction of spelling errors on the sentence level, which uses noisy channel model and feature-based reranking of hypotheses. Our system is designed for Russian and clearly outperforms the winner of SpellRuEval-2016 competition. We show that language model size has the greatest influence on spelling correction quality. We also experiment with different types of features and show that morphological and semantic information also improves the accuracy of spellchecking.

2016

pdf bib
Using longest common subsequence and character models to predict word forms
Alexey Sorokin
Proceedings of the 14th SIGMORPHON Workshop on Computational Research in Phonetics, Phonology, and Morphology