Automatic Document Quality Control

Neil Newbold, Lee Gillam


Abstract
This paper focuses on automatically improving the readability of documents. We explore mechanisms relating to content control that could be used (i) by authors to improve the quality and consistency of the language used in authoring; and (ii) to find a means to demonstrate this to readers. To achieve this, we implemented and evaluated a number of software components, including those of the University of Surrey Department of Computing’s content analysis applications (System Quirk). The software integrates these components within the commonly available GATE software and incorporates language resources considered useful within the standards development process: a Plain English thesaurus; lookup of ISO terminology provided from a terminology management system (TMS) via ISO 16642; automatic terminology discovery using statistical and linguistic techniques; and readability metrics. Results lead us to the development of an assistive tool, initially for authors of standards but not considered to be limited only to such authors, and also to a system that provides automatic annotation of texts to help readers to understand them. We describe the system developed and made freely available under the auspices of the EU eContent project LIRICS.
Anthology ID:
L08-1320
Volume:
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)
Month:
May
Year:
2008
Address:
Marrakech, Morocco
Venue:
LREC
SIG:
Publisher:
European Language Resources Association (ELRA)
Note:
Pages:
Language:
URL:
http://www.lrec-conf.org/proceedings/lrec2008/pdf/145_paper.pdf
DOI:
Bib Export formats:
BibTeX MODS XML EndNote
PDF:
http://www.lrec-conf.org/proceedings/lrec2008/pdf/145_paper.pdf