BERT is Not an Interlingua and the Bias of Tokenization

Jasdeep Singh, Bryan McCann, Richard Socher, Caiming Xiong


Abstract
Multilingual transfer learning can benefit both high- and low-resource languages, but the source of these improvements is not well understood. Cananical Correlation Analysis (CCA) of the internal representations of a pre- trained, multilingual BERT model reveals that the model partitions representations for each language rather than using a common, shared, interlingual space. This effect is magnified at deeper layers, suggesting that the model does not progressively abstract semantic con- tent while disregarding languages. Hierarchical clustering based on the CCA similarity scores between languages reveals a tree structure that mirrors the phylogenetic trees hand- designed by linguists. The subword tokenization employed by BERT provides a stronger bias towards such structure than character- and word-level tokenizations. We release a subset of the XNLI dataset translated into an additional 14 languages at https://www.github.com/salesforce/xnli_extension to assist further research into multilingual representations.
Anthology ID:
D19-6106
Volume:
Proceedings of the 2nd Workshop on Deep Learning Approaches for Low-Resource NLP (DeepLo 2019)
Month:
November
Year:
2019
Address:
Hong Kong, China
Venues:
EMNLP | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
47–55
Language:
URL:
https://www.aclweb.org/anthology/D19-6106
DOI:
10.18653/v1/D19-6106
Bib Export formats:
BibTeX MODS XML EndNote
PDF:
http://aclanthology.lst.uni-saarland.de/D19-6106.pdf