GRASS: the Graz corpus of Read And Spontaneous Speech

Barbara Schuppler, Martin Hagmueller, Juan A. Morales-Cordovilla, Hannes Pessentheiner


Abstract
This paper provides a description of the preparation, the speakers, the recordings, and the creation of the orthographic transcriptions of the first large scale speech database for Austrian German. It contains approximately 1900 minutes of (read and spontaneous) speech produced by 38 speakers. The corpus consists of three components. First, the Conversation Speech (CS) component contains free conversations of one hour length between friends, colleagues, couples, or family members. Second, the Commands Component (CC) contains commands and keywords which were either read or elicited by pictures. Third, the Read Speech (RS) component contains phonetically balanced sentences and digits. The speech of all components has been recorded at super-wideband quality in a soundproof recording-studio with head-mounted microphones, large-diaphragm microphones, a laryngograph, and with a video camera. The orthographic transcriptions, which have been created and subsequently corrected manually, contain approximately 290 000 word tokens from 15 000 different word types.
Anthology ID:
L14-1341
Volume:
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)
Month:
May
Year:
2014
Address:
Reykjavik, Iceland
Venue:
LREC
SIG:
Publisher:
European Language Resources Association (ELRA)
Note:
Pages:
1465–1470
Language:
URL:
http://www.lrec-conf.org/proceedings/lrec2014/pdf/394_Paper.pdf
DOI:
Bib Export formats:
BibTeX MODS XML EndNote
PDF:
http://www.lrec-conf.org/proceedings/lrec2014/pdf/394_Paper.pdf