TGermaCorp – A (Digital) Humanities Resource for (Computational) Linguistics

Andy Luecking, Armin Hoenen, Alexander Mehler


Abstract
TGermaCorp is a German text corpus whose primary sources are collected from German literature texts which date from the sixteenth century to the present. The corpus is intended to represent its target language (German) in syntactic, lexical, stylistic and chronological diversity. For this purpose, it is hand-annotated on several linguistic layers, including POS, lemma, named entities, multiword expressions, clauses, sentences and paragraphs. In order to introduce TGermaCorp in comparison to more homogeneous corpora of contemporary everyday language, quantitative assessments of syntactic and lexical diversity are provided. In this respect, TGermaCorp contributes to establishing characterising features for resource descriptions, which is needed for keeping track of a meaningful comparison of the ever-growing number of natural language resources. The assessments confirm the special role of proper names, whose propagation in text may influence lexical and syntactic diversity measures in rather trivial ways. TGermaCorp will be made available via hucompute.org.
Anthology ID:
L16-1677
Volume:
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)
Month:
May
Year:
2016
Address:
Portorož, Slovenia
Venue:
LREC
SIG:
Publisher:
European Language Resources Association (ELRA)
Note:
Pages:
4271–4277
Language:
URL:
https://www.aclweb.org/anthology/L16-1677
DOI:
Bib Export formats:
BibTeX MODS XML EndNote
PDF:
http://aclanthology.lst.uni-saarland.de/L16-1677.pdf