Alham Fikri Aji


pdf bib
In Neural Machine Translation, What Does Transfer Learning Transfer?
Alham Fikri Aji | Nikolay Bogoychev | Kenneth Heafield | Rico Sennrich
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

Transfer learning improves quality for low-resource machine translation, but it is unclear what exactly it transfers. We perform several ablation studies that limit information transfer, then measure the quality impact across three language pairs to gain a black-box understanding of transfer learning. Word embeddings play an important role in transfer learning, particularly if they are properly aligned. Although transfer learning can be performed without embeddings, results are sub-optimal. In contrast, transferring only the embeddings but nothing else yields catastrophic results. We then investigate diagonal alignments with auto-encoders over real languages and randomly generated sequences, finding even randomly generated sequences as parents yield noticeable but smaller gains. Finally, transfer learning can eliminate the need for a warm-up phase when training transformer models in high resource language pairs.

pdf bib
Compressing Neural Machine Translation Models with 4-bit Precision
Alham Fikri Aji | Kenneth Heafield
Proceedings of the Fourth Workshop on Neural Generation and Translation

Neural Machine Translation (NMT) is resource-intensive. We design a quantization procedure to compress fit NMT models better for devices with limited hardware capability. We use logarithmic quantization, instead of the more commonly used fixed-point quantization, based on the empirical fact that parameters distribution is not uniform. We find that biases do not take a lot of memory and show that biases can be left uncompressed to improve the overall quality without affecting the compression rate. We also propose to use an error-feedback mechanism during retraining, to preserve the compressed model as a stale gradient. We empirically show that NMT models based on Transformer or RNN architecture can be compressed up to 4-bit precision without any noticeable quality degradation. Models can be compressed up to binary precision, albeit with lower quality. RNN architecture seems to be more robust towards compression, compared to the Transformer.

pdf bib
Edinburgh’s Submissions to the 2020 Machine Translation Efficiency Task
Nikolay Bogoychev | Roman Grundkiewicz | Alham Fikri Aji | Maximiliana Behnke | Kenneth Heafield | Sidharth Kashyap | Emmanouil-Ioannis Farsarakis | Mateusz Chudyk
Proceedings of the Fourth Workshop on Neural Generation and Translation

We participated in all tracks of the Workshop on Neural Generation and Translation 2020 Efficiency Shared Task: single-core CPU, multi-core CPU, and GPU. At the model level, we use teacher-student training with a variety of student sizes, tie embeddings and sometimes layers, use the Simpler Simple Recurrent Unit, and introduce head pruning. On GPUs, we used 16-bit floating-point tensor cores. On CPUs, we customized 8-bit quantization and multiple processes with affinity for the multi-core setting. To reduce model size, we experimented with 4-bit log quantization but use floats at runtime. In the shared task, most of our submissions were Pareto optimal with respect the trade-off between time and quality.

pdf bib
Benchmarking Multidomain English-Indonesian Machine Translation
Tri Wahyu Guntara | Alham Fikri Aji | Radityo Eko Prasojo
Proceedings of the 13th Workshop on Building and Using Comparable Corpora

In the context of Machine Translation (MT) from-and-to English, Bahasa Indonesia has been considered a low-resource language, and therefore applying Neural Machine Translation (NMT) which typically requires large training dataset proves to be problematic. In this paper, we show otherwise by collecting large, publicly-available datasets from the Web, which we split into several domains: news, religion, general, and conversation, to train and benchmark some variants of transformer-based NMT models across the domains. We show using BLEU that our models perform well across them , outperform the baseline Statistical Machine Translation (SMT) models, and perform comparably with Google Translate. Our datasets (with the standard split for training, validation, and testing), code, and models are available on


pdf bib
Combining Global Sparse Gradients with Local Gradients in Distributed Neural Network Training
Alham Fikri Aji | Kenneth Heafield | Nikolay Bogoychev
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

One way to reduce network traffic in multi-node data-parallel stochastic gradient descent is to only exchange the largest gradients. However, doing so damages the gradient and degrades the model’s performance. Transformer models degrade dramatically while the impact on RNNs is smaller. We restore gradient quality by combining the compressed global gradient with the node’s locally computed uncompressed gradient. Neural machine translation experiments show that Transformer convergence is restored while RNNs converge faster. With our method, training on 4 nodes converges up to 1.5x as fast as with uncompressed gradients and scales 3.5x relative to single-node training.

pdf bib
Making Asynchronous Stochastic Gradient Descent Work for Transformers
Alham Fikri Aji | Kenneth Heafield
Proceedings of the 3rd Workshop on Neural Generation and Translation

Asynchronous stochastic gradient descent (SGD) converges poorly for Transformer models, so synchronous SGD has become the norm for Transformer training. This is unfortunate because asynchronous SGD is faster at raw training speed since it avoids waiting for synchronization. Moreover, the Transformer model is the basis for state-of-the-art models for several tasks, including machine translation, so training speed matters. To understand why asynchronous SGD under-performs, we blur the lines between asynchronous and synchronous methods. We find that summing several asynchronous updates, rather than applying them immediately, restores convergence behavior. With this method, the Transformer attains the same BLEU score 1.36 times as fast.


pdf bib
Accelerating Asynchronous Stochastic Gradient Descent for Neural Machine Translation
Nikolay Bogoychev | Kenneth Heafield | Alham Fikri Aji | Marcin Junczys-Dowmunt
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

In order to extract the best possible performance from asynchronous stochastic gradient descent one must increase the mini-batch size and scale the learning rate accordingly. In order to achieve further speedup we introduce a technique that delays gradient updates effectively increasing the mini-batch size. Unfortunately with the increase of mini-batch size we worsen the stale gradient problem in asynchronous stochastic gradient descent (SGD) which makes the model convergence poor. We introduce local optimizers which mitigate the stale gradient problem and together with fine tuning our momentum we are able to train a shallow machine translation system 27% faster than an optimized baseline with negligible penalty in BLEU.

pdf bib
Marian: Fast Neural Machine Translation in C++
Marcin Junczys-Dowmunt | Roman Grundkiewicz | Tomasz Dwojak | Hieu Hoang | Kenneth Heafield | Tom Neckermann | Frank Seide | Ulrich Germann | Alham Fikri Aji | Nikolay Bogoychev | André F. T. Martins | Alexandra Birch
Proceedings of ACL 2018, System Demonstrations

We present Marian, an efficient and self-contained Neural Machine Translation framework with an integrated automatic differentiation engine based on dynamic computation graphs. Marian is written entirely in C++. We describe the design of the encoder-decoder framework and demonstrate that a research-friendly toolkit can achieve high training and translation speed.


pdf bib
Sparse Communication for Distributed Gradient Descent
Alham Fikri Aji | Kenneth Heafield
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing

We make distributed stochastic gradient descent faster by exchanging sparse updates instead of dense updates. Gradient updates are positively skewed as most updates are near zero, so we map the 99% smallest updates (by absolute value) to zero then exchange sparse matrices. This method can be combined with quantization to further improve the compression. We explore different configurations and apply them to neural machine translation and MNIST image classification tasks. Most configurations work on MNIST, whereas different configurations reduce convergence rate on the more complex translation task. Our experiments show that we can achieve up to 49% speed up on MNIST and 22% on NMT without damaging the final accuracy or BLEU.