Topic Stability over Noisy Sources

Jing Su, Derek Greene, Oisín Boydell


Abstract
Topic modelling techniques such as LDA have recently been applied to speech transcripts and OCR output. These corpora may contain noisy or erroneous texts which may undermine topic stability. Therefore, it is important to know how well a topic modelling algorithm will perform when applied to noisy data. In this paper we show that different types of textual noise can have diverse effects on the stability of topic models. On the other hand, topic model stability is not consistent with the same type but different levels of noise. We introduce a dictionary filtering approach to address this challenge, with the result that a topic model with the correct number of topics is always identified across different levels of noise.
Anthology ID:
W16-3913
Volume:
Proceedings of the 2nd Workshop on Noisy User-generated Text (WNUT)
Month:
December
Year:
2016
Address:
Osaka, Japan
Venues:
WNUT | WS
SIG:
Publisher:
The COLING 2016 Organizing Committee
Note:
Pages:
85–93
Language:
URL:
https://www.aclweb.org/anthology/W16-3913
DOI:
Bib Export formats:
BibTeX MODS XML EndNote
PDF:
http://aclanthology.lst.uni-saarland.de/W16-3913.pdf