The Design of Syntactic Annotation Levels in the National Corpus of Polish

Katarzyna Głowińska, Adam Przepiórkowski


Abstract
The paper presents the procedure of syntactic annotation of the National Corpus of Polish. The paper concentrates on the delimitation of syntactic words (analytical forms, reflexive verbs, discontinuous conjunctions, etc.) and syntactic groups, as well as on problems encountered during the annotation process: syntactic group boundaries, multiword entities, abbreviations, discontinuous phrases and syntactic words. It includes the complete tagset for syntactic words and the list of syntactic groups recognized in NKJP. The tagset defines grammatical classes and categories according to morphosyntactic and syntactic criteria only. Syntactic annotation in the National Corpus of Polish is limited to making constituents of combinations of words. Annotation depends on shallow parsing and manual post-editing of the results by annotators. Manual annotation is performed by two independents annotators, with a referee in cases of disagreement. The manually constructed grammar, both for syntactic words and for syntactic groups, is encoded in the shallow parsing system Spejd.
Anthology ID:
L10-1175
Volume:
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)
Month:
May
Year:
2010
Address:
Valletta, Malta
Venue:
LREC
SIG:
Publisher:
European Language Resources Association (ELRA)
Note:
Pages:
Language:
URL:
http://www.lrec-conf.org/proceedings/lrec2010/pdf/259_Paper.pdf
DOI:
Bib Export formats:
BibTeX MODS XML EndNote
PDF:
http://www.lrec-conf.org/proceedings/lrec2010/pdf/259_Paper.pdf