From Grammar Rule Extraction to Treebanking: A Bootstrapping Approach

Masood Ghayoomi


Abstract
Most of the reliable language resources are developed via human supervision. Developing supervised annotated data is hard and tedious, and it will be very time consuming when it is done totally manually; as a result, various types of annotated data, including treebanks, are not available for many languages. Considering that a portion of the language is regular, we can define regular expressions as grammar rules to recognize the strings which match the regular expressions, and reduce the human effort to annotate further unseen data. In this paper, we propose an incremental bootstrapping approach via extracting grammar rules when no treebank is available in the first step. Since Persian suffers from lack of available data sources, we have applied our method to develop a treebank for this language. Our experiment shows that this approach significantly decreases the amount of manual effort in the annotation process while enlarging the treebank.
Anthology ID:
L12-1536
Volume:
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)
Month:
May
Year:
2012
Address:
Istanbul, Turkey
Venue:
LREC
SIG:
Publisher:
European Language Resources Association (ELRA)
Note:
Pages:
1912–1919
Language:
URL:
http://www.lrec-conf.org/proceedings/lrec2012/pdf/900_Paper.pdf
DOI:
Bib Export formats:
BibTeX MODS XML EndNote
PDF:
http://www.lrec-conf.org/proceedings/lrec2012/pdf/900_Paper.pdf