Domain adaptation challenges of BERT in tokenization and sub-word representations of Out-of-Vocabulary words

Anmol Nayak, Hariprasad Timmapathini, Karthikeyan Ponnalagu, Vijendran Gopalan Venkoparao


Abstract
BERT model (Devlin et al., 2019) has achieved significant progress in several Natural Language Processing (NLP) tasks by leveraging the multi-head self-attention mechanism (Vaswani et al., 2017) in its architecture. However, it still has several research challenges which are not tackled well for domain specific corpus found in industries. In this paper, we have highlighted these problems through detailed experiments involving analysis of the attention scores and dynamic word embeddings with the BERT-Base-Uncased model. Our experiments have lead to interesting findings that showed: 1) Largest substring from the left that is found in the vocabulary (in-vocab) is always chosen at every sub-word unit that can lead to suboptimal tokenization choices, 2) Semantic meaning of a vocabulary word deteriorates when found as a substring in an Out-Of-Vocabulary (OOV) word, and 3) Minor misspellings in words are inadequately handled. We believe that if these challenges are tackled, it will significantly help the domain adaptation aspect of BERT.
Anthology ID:
2020.insights-1.1
Volume:
Proceedings of the First Workshop on Insights from Negative Results in NLP
Month:
November
Year:
2020
Address:
Online
Venues:
EMNLP | insights
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
1–5
Language:
URL:
https://www.aclweb.org/anthology/2020.insights-1.1
DOI:
10.18653/v1/2020.insights-1.1
Bib Export formats:
BibTeX MODS XML EndNote
PDF:
http://aclanthology.lst.uni-saarland.de/2020.insights-1.1.pdf