Nuniyat Kifle


Amharic Word Sequence Prediction
Nuniyat Kifle
Proceedings of the 2019 Workshop on Widening NLP

The significance of computers and handheld devices are not deniable in the modern world of today. Texts are entered to these devices using word processing programs as well as other techniques and word prediction is one of the techniques. Word Prediction is the action of guessing or forecasting what word comes after, based on some current information, and it is the main focus of this study. Even though Amharic is used by a large number of populations, no significant work is done on the topic of word sequence prediction. In this study, Amharic word sequence prediction model is developed with statistical methods using Hidden Markov Model by incorporating detailed Part of speech tag. Evaluation of the model is performed using developed prototype and keystroke savings (KSS) as a metrics. According to our experiment, prediction result using a bi-gram with detailed Part of Speech tag model has higher KSS and it is better compared to tri-gram model and better than those without Part of Speech tag. Therefore, statistical approach with Detailed POS has quite good potential on word sequence prediction for Amharic language. This research deals with designing word sequence prediction model in Amharic language. It is a language that is spoken in eastern Africa. One of the needs for Amharic word sequence prediction for mobile use and other digital devices is in order to facilitate data entry and communication in our language. Word sequence prediction is a challenging task for inflected languages. (Arora, 2007) These kinds of languages are morphologically rich and have enormous word forms. i.e. one word can have different forms. As Amharic language is highly inflected language and morphologically rich it shares this problem. (prediction, 2008) This problem makes word prediction system much more difficult and results poor performance. Due to this reason storing all forms in dictionary won’t solve the problem as in English and other less inflected languages. But considering other techniques that could help the predictor to suggest the next word like a POS based prediction should be used. Previous researches used dictionary approach with no consideration of context information. Hence storing all forms of words in dictionary for inflected languages such as Amharic language has been less effective. The main goal of this thesis is to implement Amharic word prediction model that works with better prediction speed and with narrowed search space as much as possible. We introduced two models; tags and words and linear interpolation that use part of speech tag information in addition to word n-grams in order to maximize the likelihood of syntactic appropriateness of the suggestions. We believe the results found reflect this. Amharic word sequence prediction using bi-gram model with higher POS weight and detailed Part of speech tag gave better keystroke savings in all scenarios of our experiment. The study followed Design Science Research Methodology (DSRM). Since DSRM includes approaches, techniques, tools, algorithms and evaluation mechanisms in the process, we followed statistical approach with statistical language modeling and built Amharic prediction model based on information from Part of Speech tagger. The statistics included in the systems varies from single word frequencies to part-of-speech tag n-grams. That means it included the statistics of Word frequencies, Word sequence frequencies, Part-of-speech sequence frequencies and other important information. Later on the system was evaluated using Keystroke Savings. (Lindh, 011). Linux mint was used as the main Operation System during the frame work design. We used corpus of 680,000 tagged words that has 31 tag sets, python programming language and its libraries for both the part of speech tagger and the predictor module. Other Tool that was used is the SRILIM (The SRI language modeling toolkit) in order to generate unigram bigram and trigram count as an input for the language model. SRILIM is toolkit that uses to build and apply statistical language modeling. This thesis presented Amharic word sequence prediction model using the statistical approach. We described a combined statistical and lexical word prediction system for handling inflected languages by making use of POS tags to build the language model. We developed Amharic language models of bigram and trigram for the training purpose. We obtained 29% of KSS using bigram model with detailed part ofspeech tag. Hence, Based on the experiments carried out for this study and the results obtained, the following conclusions were made. We concluded that employing syntactic information in the form of Part-of-Speech (POS) n-grams promises more effective predictions. We also can conclude data quantity, performance of POS tagger and data quality highly affects the keystroke savings. Here in our study the tests were done on a small collection of 100 phrases. According to our evaluation better Keystroke saving (KSS) is achieved when using bi-gram model than the tri-gram models. We believe the results obtained using the experiment of detailed Part of speech tags were effective Since speed and search space are the basic issues in word sequence prediction