Corpora for Cross-Language Information Retrieval in Six Less-Resourced Languages

Ilya Zavorin, Aric Bills, Cassian Corey, Michelle Morrison, Audrey Tong, Richard Tong


Abstract
The Machine Translation for English Retrieval of Information in Any Language (MATERIAL) research program, sponsored by the Intelligence Advanced Research Projects Activity (IARPA), focuses on rapid development of end-to-end systems capable of retrieving foreign language speech and text documents relevant to different types of English queries that may be further restricted by domain. Those systems also provide evidence of relevance of the retrieved content in the form of English summaries. The program focuses on Less-Resourced Languages and provides its performer teams very limited amounts of annotated training data. This paper describes the corpora that were created for system development and evaluation for the six languages released by the program to date: Tagalog, Swahili, Somali, Lithuanian, Bulgarian and Pashto. The corpora include build packs to train Machine Translation and Automatic Speech Recognition systems; document sets in three text and three speech genres annotated for domain and partitioned for analysis, development and evaluation; and queries of several types together with corresponding binary relevance judgments against the entire set of documents. The paper also describes a detection metric called Actual Query Weighted Value developed by the program to evaluate end-to-end system performance.
Anthology ID:
2020.clssts-1.2
Volume:
Proceedings of the workshop on Cross-Language Search and Summarization of Text and Speech (CLSSTS2020)
Month:
May
Year:
2020
Address:
Marseille, France
Venues:
CLSSTS | LREC | WS
SIG:
Publisher:
European Language Resources Association
Note:
Pages:
7–13
Language:
English
URL:
https://www.aclweb.org/anthology/2020.clssts-1.2
DOI:
Bib Export formats:
BibTeX MODS XML EndNote
PDF:
http://aclanthology.lst.uni-saarland.de/2020.clssts-1.2.pdf