Spoken Term Discovery for Language Documentation using Translations

Antonios Anastasopoulos, Sameer Bansal, David Chiang, Sharon Goldwater, Adam Lopez


Abstract
Vast amounts of speech data collected for language documentation and research remain untranscribed and unsearchable, but often a small amount of speech may have text translations available. We present a method for partially labeling additional speech with translations in this scenario. We modify an unsupervised speech-to-translation alignment model and obtain prototype speech segments that match the translation words, which are in turn used to discover terms in the unlabelled data. We evaluate our method on a Spanish-English speech translation corpus and on two corpora of endangered languages, Arapaho and Ainu, demonstrating its appropriateness and applicability in an actual very-low-resource scenario.
Anthology ID:
W17-4607
Volume:
Proceedings of the Workshop on Speech-Centric Natural Language Processing
Month:
September
Year:
2017
Address:
Copenhagen, Denmark
Venue:
WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
53–58
Language:
URL:
https://www.aclweb.org/anthology/W17-4607
DOI:
10.18653/v1/W17-4607
Bib Export formats:
BibTeX MODS XML EndNote
PDF:
http://aclanthology.lst.uni-saarland.de/W17-4607.pdf