2020
pdf
bib
abs
On the Discrepancy between Density Estimation and Sequence Generation
Jason Lee

Dustin Tran

Orhan Firat

Kyunghyun Cho
Proceedings of the Fourth Workshop on Structured Prediction for NLP
Many sequencetosequence generation tasks, including machine translation and texttospeech, can be posed as estimating the density of the output y given the input x: p(yx). Given this interpretation, it is natural to evaluate sequencetosequence models using conditional loglikelihood on a test set. However, the goal of sequencetosequence generation (or structured prediction) is to find the best output yˆ given an input x, and each task has its own downstream metric R that scores a model output by comparing against a set of references y*: R(yˆ, y*  x). While we hope that a model that excels in density estimation also performs well on the downstream metric, the exact correlation has not been studied for sequence generation tasks. In this paper, by comparing several density estimators on five machine translation tasks, we find that the correlation between rankings of models based on loglikelihood and BLEU varies significantly depending on the range of the model families being compared. First, loglikelihood is highly correlated with BLEU when we consider models within the same family (e.g. autoregressive models, or latent variable models with the same parameterization of the prior). However, we observe no correlation between rankings of models across different families: (1) among nonautoregressive latent variable models, a flexible prior distribution is better at density estimation but gives worse generation quality than a simple prior, and (2) autoregressive models offer the best translation performance overall, while latent variable models with a normalizing flow prior give the highest heldout loglikelihood across all datasets. Therefore, we recommend using a simple prior for the latent variable nonautoregressive model when fast generation speed is desired.
pdf
bib
abs
Iterative Refinement in the Continuous Space for NonAutoregressive Neural Machine Translation
Jason Lee

Raphael Shu

Kyunghyun Cho
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)
We propose an efficient inference procedure for nonautoregressive machine translation that iteratively refines translation purely in the continuous space. Given a continuous latent variable model for machine translation (Shu et al., 2020), we train an inference network to approximate the gradient of the marginal log probability of the target sentence, using the latent variable instead. This allows us to use gradientbased optimization to find the target sentence at inference time that approximately maximizes its marginal probability. As each refinement step only involves computation in the latent space of low dimensionality (we use 8 in our experiments), we avoid computational overhead incurred by existing nonautoregressive inference procedures that often refine in token space. We compare our approach to a recently proposed EMlike inference procedure (Shu et al., 2020) that optimizes in a hybrid space, consisting of both discrete and continuous variables. We evaluate our approach on WMT’14 En→De, WMT’16 Ro→En and IWSLT’16 De→En, and observe two advantages over the EMlike inference: (1) it is computationally efficient, i.e. each refinement step is twice as fast, and (2) it is more effective, resulting in higher marginal probabilities and BLEU scores with the same number of refinement steps. On WMT’14 En→De, for instance, our approach is able to decode 6.2 times faster than the autoregressive model with minimal degradation to translation quality (0.9 BLEU).
2019
pdf
bib
abs
Countering Language Drift via Visual Grounding
Jason Lee

Kyunghyun Cho

Douwe Kiela
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLPIJCNLP)
Emergent multiagent communication protocols are very different from natural language and not easily interpretable by humans. We find that agents that were initially pretrained to produce natural language can also experience detrimental language drift: when a nonlinguistic reward is used in a goalbased task, e.g. some scalar success metric, the communication protocol may easily and radically diverge from natural language. We recast translation as a multiagent communication game and examine auxiliary training constraints for their effectiveness in mitigating language drift. We show that a combination of syntactic (language model likelihood) and semantic (visual grounding) constraints gives the best communication performance, allowing pretrained agents to retain English syntax while learning to accurately convey the intended meaning.
2018
pdf
bib
abs
Deterministic NonAutoregressive Neural Sequence Modeling by Iterative Refinement
Jason Lee

Elman Mansimov

Kyunghyun Cho
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing
We propose a conditional nonautoregressive neural sequence model based on iterative refinement. The proposed model is designed based on the principles of latent variable models and denoising autoencoders, and is generally applicable to any sequence generation task. We extensively evaluate the proposed model on machine translation (EnDe and EnRo) and image caption generation, and observe that it significantly speeds up decoding while maintaining the generation quality comparable to the autoregressive counterpart.
2017
pdf
bib
abs
Fully CharacterLevel Neural Machine Translation without Explicit Segmentation
Jason Lee

Kyunghyun Cho

Thomas Hofmann
Transactions of the Association for Computational Linguistics, Volume 5
Most existing machine translation systems operate at the level of words, relying on explicit segmentation to extract tokens. We introduce a neural machine translation (NMT) model that maps a source character sequence to a target character sequence without any segmentation. We employ a characterlevel convolutional network with maxpooling at the encoder to reduce the length of source representation, allowing the model to be trained at a speed comparable to subwordlevel models while capturing local regularities. Our charactertocharacter model outperforms a recently proposed baseline with a subwordlevel encoder on WMT’15 DEEN and CSEN, and gives comparable performance on FIEN and RUEN. We then demonstrate that it is possible to share a single characterlevel encoder across multiple languages by training a model on a manytoone translation task. In this multilingual setting, the characterlevel encoder significantly outperforms the subwordlevel encoder on all the language pairs. We observe that on CSEN, FIEN and RUEN, the quality of the multilingual characterlevel translation even surpasses the models specifically trained on that language pair alone, both in terms of the BLEU score and human judgment.