Maik Stührenberg


Less Destructive Cleaning of Web Documents by Using Standoff Annotation
Maik Stührenberg
Proceedings of the 9th Web as Corpus Workshop (WaC-9)

Extending standoff annotation
Maik Stührenberg
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

Textual information is sometimes accompanied by additional encodings (such as visuals). These multimodal documents may be interesting objects of investigation for linguistics. Another class of complex documents are pre-annotated documents. Classic XML inline annotation often fails for both document classes because of overlapping markup. However, standoff annotation, that is the separation of primary data and markup, is a valuable and common mechanism to annotate multiple hierarchies and/or read-only primary data. We demonstrate an extended version of the XStandoff meta markup language, that allows the definition of segments in spatial and pre-annotated primary data. Together with the ability to import already established (linguistic) serialization formats as annotation levels and layers in an XStandoff instance, we are able to annotate a variety of primary data files, including text, audio, still and moving images. Application scenarios that may benefit from using XStandoff are the analyzation of multimodal documents such as instruction manuals, or sports match analysis, or the less destructive cleaning of web pages.


Influence of Text Type and Text Length on Anaphoric Annotation
Daniela Goecke | Maik Stührenberg | Andreas Witt
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

We report the results of a study that investigates the agreement of anaphoric annotations. The study focuses on the influence of the factors text length and text type on a corpus of scientific articles and newspaper texts. In order to measure inter-annotator agreement we compare existing approaches and we propose to measure each step of the annotation process separately instead of measuring the resulting anaphoric relations only. A total amount of 3,642 anaphoric relations has been annotated for a corpus of 53,038 tokens (12,327 markables). The results of the study show that text type has more influence on inter-annotator agreement than text length. Furthermore, the definition of well-defined annotation instructions and coder training is a crucial point in order to receive good annotation results.


Web-based Annotation of Anaphoric Relations and Lexical Chains
Maik Stührenberg | Daniela Goecke | Nils Diewald | Alexander Mehler | Irene Cramer
Proceedings of the Linguistic Annotation Workshop


Multidimensional markup and heterogeneous linguistic resources
Maik Stührenberg | Andreas Witt | Daniela Goecke | Dieter Metzing | Oliver Schonefeld
Proceedings of the 5th Workshop on NLP and XML (NLPXML-2006): Multi-Dimensional Markup in Natural Language Processing