Enhancing BERT for Lexical Normalization

Benjamin Muller, Benoit Sagot, Djamé Seddah


Abstract
Language model-based pre-trained representations have become ubiquitous in natural language processing. They have been shown to significantly improve the performance of neural models on a great variety of tasks. However, it remains unclear how useful those general models can be in handling non-canonical text. In this article, focusing on User Generated Content (UGC), we study the ability of BERT to perform lexical normalisation. Our contribution is simple: by framing lexical normalisation as a token prediction task, by enhancing its architecture and by carefully fine-tuning it, we show that BERT can be a competitive lexical normalisation model without the need of any UGC resources aside from 3,000 training sentences. To the best of our knowledge, it is the first work done in adapting and analysing the ability of this model to handle noisy UGC data.
Anthology ID:
D19-5539
Volume:
Proceedings of the 5th Workshop on Noisy User-generated Text (W-NUT 2019)
Month:
November
Year:
2019
Address:
Hong Kong, China
Venues:
EMNLP | WNUT | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
297–306
Language:
URL:
https://www.aclweb.org/anthology/D19-5539
DOI:
10.18653/v1/D19-5539
Bib Export formats:
BibTeX MODS XML EndNote
PDF:
http://aclanthology.lst.uni-saarland.de/D19-5539.pdf