SciCorp: A Corpus of English Scientific Articles Annotated for Information Status Analysis

Ina Roesiger


Abstract
This paper presents SciCorp, a corpus of full-text English scientific papers of two disciplines, genetics and computational linguistics. The corpus comprises co-reference and bridging information as well as information status labels. Since SciCorp is annotated with both labels and the respective co-referent and bridging links, we believe it is a valuable resource for NLP researchers working on scientific articles or on applications such as co-reference resolution, bridging resolution or information status classification. The corpus has been reliably annotated by independent human coders with moderate inter-annotator agreement (average kappa = 0.71). In total, we have annotated 14 full papers containing 61,045 tokens and marked 8,708 definite noun phrases. The paper describes in detail the annotation scheme as well as the resulting corpus. The corpus is available for download in two different formats: in an offset-based format and for the co-reference annotations in the widely-used, tabular CoNLL-2012 format.
Anthology ID:
L16-1275
Volume:
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)
Month:
May
Year:
2016
Address:
Portorož, Slovenia
Venue:
LREC
SIG:
Publisher:
European Language Resources Association (ELRA)
Note:
Pages:
1743–1749
Language:
URL:
https://www.aclweb.org/anthology/L16-1275
DOI:
Bib Export formats:
BibTeX MODS XML EndNote
PDF:
http://aclanthology.lst.uni-saarland.de/L16-1275.pdf