Contemporary Amharic Corpus: Automatically Morpho-Syntactically Tagged Amharic Corpus
Andargachew Mekonnen Gezmu, Binyam Ephrem Seyoum, Michael Gasser, Andreas Nürnberger
Abstract
We introduced the contemporary Amharic corpus, which is automatically tagged for morpho-syntactic information. Texts are collected from 25,199 documents from different domains and about 24 million orthographic words are tokenized. Since it is partly a web corpus, we made some automatic spelling error correction. We have also modified the existing morphological analyzer, HornMorpho, to use it for the automatic tagging.- Anthology ID:
- W18-3809
- Volume:
- Proceedings of the First Workshop on Linguistic Resources for Natural Language Processing
- Month:
- August
- Year:
- 2018
- Address:
- Santa Fe, New Mexico, USA
- Venues:
- COLING | LR4NLP | WS
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 65–70
- Language:
- URL:
- https://www.aclweb.org/anthology/W18-3809
- DOI:
- PDF:
- http://aclanthology.lst.uni-saarland.de/W18-3809.pdf