Nikolay Bogoychev


2020

pdf bib
Parallel Sentence Mining by Constrained Decoding
Pinzhen Chen | Nikolay Bogoychev | Kenneth Heafield | Faheem Kirefu
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

We present a novel method to extract parallel sentences from two monolingual corpora, using neural machine translation. Our method relies on translating sentences in one corpus, but constraining the decoding by a prefix tree built on the other corpus. We argue that a neural machine translation system by itself can be a sentence similarity scorer and it efficiently approximates pairwise comparison with a modified beam search. When benchmarked on the BUCC shared task, our method achieves results comparable to other submissions.

pdf bib
In Neural Machine Translation, What Does Transfer Learning Transfer?
Alham Fikri Aji | Nikolay Bogoychev | Kenneth Heafield | Rico Sennrich
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

Transfer learning improves quality for low-resource machine translation, but it is unclear what exactly it transfers. We perform several ablation studies that limit information transfer, then measure the quality impact across three language pairs to gain a black-box understanding of transfer learning. Word embeddings play an important role in transfer learning, particularly if they are properly aligned. Although transfer learning can be performed without embeddings, results are sub-optimal. In contrast, transferring only the embeddings but nothing else yields catastrophic results. We then investigate diagonal alignments with auto-encoders over real languages and randomly generated sequences, finding even randomly generated sequences as parents yield noticeable but smaller gains. Finally, transfer learning can eliminate the need for a warm-up phase when training transformer models in high resource language pairs.

pdf bib
Edinburgh’s Submissions to the 2020 Machine Translation Efficiency Task
Nikolay Bogoychev | Roman Grundkiewicz | Alham Fikri Aji | Maximiliana Behnke | Kenneth Heafield | Sidharth Kashyap | Emmanouil-Ioannis Farsarakis | Mateusz Chudyk
Proceedings of the Fourth Workshop on Neural Generation and Translation

We participated in all tracks of the Workshop on Neural Generation and Translation 2020 Efficiency Shared Task: single-core CPU, multi-core CPU, and GPU. At the model level, we use teacher-student training with a variety of student sizes, tie embeddings and sometimes layers, use the Simpler Simple Recurrent Unit, and introduce head pruning. On GPUs, we used 16-bit floating-point tensor cores. On CPUs, we customized 8-bit quantization and multiple processes with affinity for the multi-core setting. To reduce model size, we experimented with 4-bit log quantization but use floats at runtime. In the shared task, most of our submissions were Pareto optimal with respect the trade-off between time and quality.

pdf bib
Character Mapping and Ad-hoc Adaptation: Edinburgh’s IWSLT 2020 Open Domain Translation System
Pinzhen Chen | Nikolay Bogoychev | Ulrich Germann
Proceedings of the 17th International Conference on Spoken Language Translation

This paper describes the University of Edinburgh’s neural machine translation systems submitted to the IWSLT 2020 open domain JapaneseChinese translation task. On top of commonplace techniques like tokenisation and corpus cleaning, we explore character mapping and unsupervised decoding-time adaptation. Our techniques focus on leveraging the provided data, and we show the positive impact of each technique through the gradual improvement of BLEU.

2019

pdf bib
Combining Global Sparse Gradients with Local Gradients in Distributed Neural Network Training
Alham Fikri Aji | Kenneth Heafield | Nikolay Bogoychev
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

One way to reduce network traffic in multi-node data-parallel stochastic gradient descent is to only exchange the largest gradients. However, doing so damages the gradient and degrades the model’s performance. Transformer models degrade dramatically while the impact on RNNs is smaller. We restore gradient quality by combining the compressed global gradient with the node’s locally computed uncompressed gradient. Neural machine translation experiments show that Transformer convergence is restored while RNNs converge faster. With our method, training on 4 nodes converges up to 1.5x as fast as with uncompressed gradients and scales 3.5x relative to single-node training.

pdf bib
From Research to Production and Back: Ludicrously Fast Neural Machine Translation
Young Jin Kim | Marcin Junczys-Dowmunt | Hany Hassan | Alham Fikri Aji | Kenneth Heafield | Roman Grundkiewicz | Nikolay Bogoychev
Proceedings of the 3rd Workshop on Neural Generation and Translation

This paper describes the submissions of the “Marian” team to the WNGT 2019 efficiency shared task. Taking our dominating submissions to the previous edition of the shared task as a starting point, we develop improved teacher-student training via multi-agent dual-learning and noisy backward-forward translation for Transformer-based student models. For efficient CPU-based decoding, we propose pre-packed 8-bit matrix products, improved batched decoding, cache-friendly student architectures with parameter sharing and light-weight RNN-based decoder architectures. GPU-based decoding benefits from the same architecture changes, from pervasive 16-bit inference and concurrent streams. These modifications together with profiler-based C++ code optimization allow us to push the Pareto frontier established during the 2018 edition towards 24x (CPU) and 14x (GPU) faster models at comparable or higher BLEU values. Our fastest CPU model is more than 4x faster than last year’s fastest submission at more than 3 points higher BLEU. Our fastest GPU model at 1.5 seconds translation time is slightly faster than last year’s fastest RNN-based submissions, but outperforms them by more than 4 BLEU and 10 BLEU points respectively.

pdf bib
Similar Minds Post Alike: Assessment of Suicide Risk Using a Hybrid Model
Lushi Chen | Abeer Aldayel | Nikolay Bogoychev | Tao Gong
Proceedings of the Sixth Workshop on Computational Linguistics and Clinical Psychology

This paper describes our system submission for the CLPsych 2019 shared task B on suicide risk assessment. We approached the problem with three separate models: a behaviour model; a language model and a hybrid model. For the behavioral model approach, we model each user’s behaviour and thoughts with four groups of features: posting behaviour, sentiment, motivation, and content of the user’s posting. We use these features as an input in a support vector machine (SVM). For the language model approach, we trained a language model for each risk level using all the posts from the users as the training corpora. Then, we computed the perplexity of each user’s posts to determine how likely his/her posts were to belong to each risk level. Finally, we built a hybrid model that combines both the language model and the behavioral model, which demonstrates the best performance in detecting the suicide risk level.

pdf bib
The University of Edinburgh’s Submissions to the WMT19 News Translation Task
Rachel Bawden | Nikolay Bogoychev | Ulrich Germann | Roman Grundkiewicz | Faheem Kirefu | Antonio Valerio Miceli Barone | Alexandra Birch
Proceedings of the Fourth Conference on Machine Translation (Volume 2: Shared Task Papers, Day 1)

The University of Edinburgh participated in the WMT19 Shared Task on News Translation in six language directions: English↔Gujarati, English↔Chinese, German→English, and English→Czech. For all translation directions, we created or used back-translations of monolingual data in the target language as additional synthetic training data. For English↔Gujarati, we also explored semi-supervised MT with cross-lingual language model pre-training, and translation pivoting through Hindi. For translation to and from Chinese, we investigated character-based tokenisation vs. sub-word segmentation of Chinese text. For German→English, we studied the impact of vast amounts of back-translated training data on translation quality, gaining a few additional insights over Edunov et al. (2018). For English→Czech, we compared different preprocessing and tokenisation regimes.

2018

pdf bib
Accelerating Asynchronous Stochastic Gradient Descent for Neural Machine Translation
Nikolay Bogoychev | Kenneth Heafield | Alham Fikri Aji | Marcin Junczys-Dowmunt
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

In order to extract the best possible performance from asynchronous stochastic gradient descent one must increase the mini-batch size and scale the learning rate accordingly. In order to achieve further speedup we introduce a technique that delays gradient updates effectively increasing the mini-batch size. Unfortunately with the increase of mini-batch size we worsen the stale gradient problem in asynchronous stochastic gradient descent (SGD) which makes the model convergence poor. We introduce local optimizers which mitigate the stale gradient problem and together with fine tuning our momentum we are able to train a shallow machine translation system 27% faster than an optimized baseline with negligible penalty in BLEU.

pdf bib
Marian: Fast Neural Machine Translation in C++
Marcin Junczys-Dowmunt | Roman Grundkiewicz | Tomasz Dwojak | Hieu Hoang | Kenneth Heafield | Tom Neckermann | Frank Seide | Ulrich Germann | Alham Fikri Aji | Nikolay Bogoychev | André F. T. Martins | Alexandra Birch
Proceedings of ACL 2018, System Demonstrations

We present Marian, an efficient and self-contained Neural Machine Translation framework with an integrated automatic differentiation engine based on dynamic computation graphs. Marian is written entirely in C++. We describe the design of the encoder-decoder framework and demonstrate that a research-friendly toolkit can achieve high training and translation speed.

pdf bib
The University of Edinburgh’s Submissions to the WMT18 News Translation Task
Barry Haddow | Nikolay Bogoychev | Denis Emelin | Ulrich Germann | Roman Grundkiewicz | Kenneth Heafield | Antonio Valerio Miceli Barone | Rico Sennrich
Proceedings of the Third Conference on Machine Translation: Shared Task Papers

The University of Edinburgh made submissions to all 14 language pairs in the news translation task, with strong performances in most pairs. We introduce new RNN-variant, mixed RNN/Transformer ensembles, data selection and weighting, and extensions to back-translation.

2016

pdf bib
N-gram language models for massively parallel devices
Nikolay Bogoychev | Adam Lopez
Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

pdf bib
Fast and highly parallelizable phrase table for statistical machine translation
Nikolay Bogoychev | Hieu Hoang
Proceedings of the First Conference on Machine Translation: Volume 1, Research Papers

2015

pdf bib
The Edinburgh/JHU Phrase-based Machine Translation Systems for WMT 2015
Barry Haddow | Matthias Huck | Alexandra Birch | Nikolay Bogoychev | Philipp Koehn
Proceedings of the Tenth Workshop on Statistical Machine Translation