Large-Scale, Diverse, Paraphrastic Bitexts via Sampling and Clustering

J. Edward Hu, Abhinav Singh, Nils Holzenberger, Matt Post, Benjamin Van Durme


Abstract
Producing diverse paraphrases of a sentence is a challenging task. Natural paraphrase corpora are scarce and limited, while existing large-scale resources are automatically generated via back-translation and rely on beam search, which tends to lack diversity. We describe ParaBank 2, a new resource that contains multiple diverse sentential paraphrases, produced from a bilingual corpus using negative constraints, inference sampling, and clustering.We show that ParaBank 2 significantly surpasses prior work in both lexical and syntactic diversity while being meaning-preserving, as measured by human judgments and standardized metrics. Further, we illustrate how such paraphrastic resources may be used to refine contextualized encoders, leading to improvements in downstream tasks.
Anthology ID:
K19-1005
Volume:
Proceedings of the 23rd Conference on Computational Natural Language Learning (CoNLL)
Month:
November
Year:
2019
Address:
Hong Kong, China
Venue:
CoNLL
SIG:
SIGNLL
Publisher:
Association for Computational Linguistics
Note:
Pages:
44–54
Language:
URL:
https://www.aclweb.org/anthology/K19-1005
DOI:
10.18653/v1/K19-1005
Bib Export formats:
BibTeX MODS XML EndNote
PDF:
http://aclanthology.lst.uni-saarland.de/K19-1005.pdf