BAStat : New Statistical Resources at the Bavarian Archive for Speech Signals

Florian Schiel


Abstract
A new type of language resource called 'BAStat' has been released by the Bavarian Archive for Speech Signals at Ludwig Maximilians Universitaet, Munich. In contrast to primary resources like speech and text corpora BAStat comprises statistical estimates based on a number of primary spoken language resources: first and second order occurrence probability of phones, syllables and words, duration statistics, probabilities of pronunciation variants of words and probabilities of context information. Unlike other statistical speech resources BAStat is based solely on recordings of conversational German and therefore models spoken language not text. The resource consists of a bundle of 7-bit ASCII tables and matrices to maximize inter-operability between different operation systems and can be downloaded for free from the BAS web-site. This contribution gives a detailed description about the empirical basis, the contained data types, the format of the resulting statistical data, some interesting interpretations of grand figures and a brief comparison to the text-based statistical resource CELEX.
Anthology ID:
L10-1191
Volume:
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)
Month:
May
Year:
2010
Address:
Valletta, Malta
Venue:
LREC
SIG:
Publisher:
European Language Resources Association (ELRA)
Note:
Pages:
Language:
URL:
http://www.lrec-conf.org/proceedings/lrec2010/pdf/277_Paper.pdf
DOI:
Bib Export formats:
BibTeX MODS XML EndNote
PDF:
http://www.lrec-conf.org/proceedings/lrec2010/pdf/277_Paper.pdf