The EOS Decision and Length Extrapolation

Benjamin Newman, John Hewitt, Percy Liang, Christopher D. Manning


Abstract
Extrapolation to unseen sequence lengths is a challenge for neural generative models of language. In this work, we characterize the effect on length extrapolation of a modeling decision often overlooked: predicting the end of the generative process through the use of a special end-of-sequence (EOS) vocabulary item. We study an oracle setting - forcing models to generate to the correct sequence length at test time - to compare the length-extrapolative behavior of networks trained to predict EOS (+EOS) with networks not trained to (-EOS). We find that -EOS substantially outperforms +EOS, for example extrapolating well to lengths 10 times longer than those seen at training time in a bracket closing task, as well as achieving a 40% improvement over +EOS in the difficult SCAN dataset length generalization task. By comparing the hidden states and dynamics of -EOS and +EOS models, we observe that +EOS models fail to generalize because they (1) unnecessarily stratify their hidden states by their linear position is a sequence (structures we call length manifolds) or (2) get stuck in clusters (which we refer to as length attractors) once the EOS token is the highest-probability prediction.
Anthology ID:
2020.blackboxnlp-1.26
Volume:
Proceedings of the Third BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP
Month:
November
Year:
2020
Address:
Online
Venues:
BlackboxNLP | EMNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
276–291
Language:
URL:
https://www.aclweb.org/anthology/2020.blackboxnlp-1.26
DOI:
10.18653/v1/2020.blackboxnlp-1.26
Bib Export formats:
BibTeX MODS XML EndNote
PDF:
http://aclanthology.lst.uni-saarland.de/2020.blackboxnlp-1.26.pdf