OCR Quality and NLP Preprocessing

Margot Mieskes, Stefan Schmunk


Abstract
We present initial experiments to evaluate the performance of tasks such as Part of Speech Tagging on data corrupted by Optical Character Recognition (OCR). Our results, based on English and German data, using artificial experiments as well as initial real OCRed data indicate that already a small drop in OCR quality considerably increases the error rates, which would have a significant impact on subsequent processing steps.
Anthology ID:
W19-3633
Volume:
Proceedings of the 2019 Workshop on Widening NLP
Month:
August
Year:
2019
Address:
Florence, Italy
Venues:
ACL | WS | WiNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
102–105
Language:
URL:
https://www.aclweb.org/anthology/W19-3633
DOI:
Bib Export formats:
BibTeX MODS XML EndNote