Universal Dependency Treebanks for Low-Resource Indian Languages: The Case of Bhojpuri

Atul Kr. Ojha, Daniel Zeman


Abstract
This paper presents the first dependency treebank for Bhojpuri, a resource-poor language that belongs to the Indo-Aryan language family. The objective behind the Bhojpuri Treebank (BHTB) project is to create a substantial, syntactically annotated treebank which not only acts as a valuable resource in building language technological tools, also helps in cross-lingual learning and typological research. Currently, the treebank consists of 4,881 annotated tokens in accordance with the annotation scheme of Universal Dependencies (UD). A Bhojpuri tagger and parser were created using machine learning approach. The accuracy of the model is 57.49% UAS, 45.50% LAS, 79.69% UPOS accuracy and 77.64% XPOS accuracy. The paper describes the details of the project including a discussion on linguistic analysis and annotation process of the Bhojpuri UD treebank.
Anthology ID:
2020.wildre-1.7
Volume:
Proceedings of the WILDRE5– 5th Workshop on Indian Language Data: Resources and Evaluation
Month:
May
Year:
2020
Address:
Marseille, France
Venues:
LREC | WILDRE | WS
SIG:
Publisher:
European Language Resources Association (ELRA)
Note:
Pages:
33–38
Language:
English
URL:
https://www.aclweb.org/anthology/2020.wildre-1.7
DOI:
Bib Export formats:
BibTeX MODS XML EndNote
PDF:
http://aclanthology.lst.uni-saarland.de/2020.wildre-1.7.pdf