Designing the Latvian Speech Recognition Corpus

Mārcis Pinnis, Ilze Auziņa, Kārlis Goba


Abstract
In this paper the authors present the first Latvian speech corpus designed specifically for speech recognition purposes. The paper outlines the decisions made in the corpus designing process through analysis of related work on speech corpora creation for different languages. The authors provide also guidelines that were used for the creation of the Latvian speech recognition corpus. The corpus creation guidelines are fairly general for them to be re-used by other researchers when working on different language speech recognition corpora. The corpus consists of two parts ― an orthographically annotated corpus containing 100 hours of orthographically transcribed audio data and a phonetically annotated corpus containing 4 hours of phonetically transcribed audio data. Metadata files in XML format provide additional details about the speakers, noise levels, speech styles, etc. The speech recognition corpus is phonetically balanced and phonetically rich and the paper describes also the methodology how the phonetical balancedness has been assessed.
Anthology ID:
L14-1257
Volume:
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)
Month:
May
Year:
2014
Address:
Reykjavik, Iceland
Venue:
LREC
SIG:
Publisher:
European Language Resources Association (ELRA)
Note:
Pages:
1547–1553
Language:
URL:
http://www.lrec-conf.org/proceedings/lrec2014/pdf/284_Paper.pdf
DOI:
Bib Export formats:
BibTeX MODS XML EndNote
PDF:
http://www.lrec-conf.org/proceedings/lrec2014/pdf/284_Paper.pdf