Unsupervised Neologism Normalization Using Embedding Space Mapping
Nasser Zalmout, Kapil Thadani, Aasish Pappu
Abstract
This paper presents an approach for detecting and normalizing neologisms in social media content. Neologisms refer to recent expressions that are specific to certain entities or events and are being increasingly used by the public, but have not yet been accepted in mainstream language. Automated methods for handling neologisms are important for natural language understanding and normalization, especially for informal genres with user generated content. We present an unsupervised approach for detecting neologisms and then normalizing them to canonical words without relying on parallel training data. Our approach builds on the text normalization literature and introduces adaptations to fit the specificities of this task, including phonetic and etymological considerations. We evaluate the proposed techniques on a dataset of Reddit comments, with detected neologisms and corresponding normalizations.- Anthology ID:
- D19-5555
- Volume:
- Proceedings of the 5th Workshop on Noisy User-generated Text (W-NUT 2019)
- Month:
- November
- Year:
- 2019
- Address:
- Hong Kong, China
- Venues:
- EMNLP | WNUT | WS
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 425–430
- Language:
- URL:
- https://www.aclweb.org/anthology/D19-5555
- DOI:
- 10.18653/v1/D19-5555
- PDF:
- http://aclanthology.lst.uni-saarland.de/D19-5555.pdf