Michael Gasser


2020

pdf bib
Character Alignment in Morphologically Complex Translation Sets for Related Languages
Michael Gasser | Binyam Ephrem Seyoum | Nazareth Amlesom Kifle
Proceedings of the 7th Workshop on NLP for Similar Languages, Varieties and Dialects

For languages with complex morphology, word-to-word translation is a task with various potential applications, for example, in information retrieval, language instruction, and dictionary creation, as well as in machine translation. In this paper, we confine ourselves to the subtask of character alignment for the particular case of families of related languages with very few resources for most or all members. There are many such families; we focus on the subgroup of Semitic languages spoken in Ethiopia and Eritrea. We begin with an adaptation of the familiar alignment algorithms behind statistical machine translation, modifying them as appropriate for our task. We show how character alignment can reveal morphological, phonological, and orthographic correspondences among related languages.

bib
A Translation-Based Approach to Morphology Learning for Low Resource Languages
Tewodros Gebreselassie | Amanuel Mersha | Michael Gasser
Proceedings of the The Fourth Widening Natural Language Processing Workshop

“Low resource languages” usually refers to languages that lack corpora and basic tools such as part-of-speech taggers. But a significant number of such languages do benefit from the availability of relatively complex linguistic descriptions of phonology, morphology, and syntax, as well as dictionaries. A further category, probably the majority of the world’s languages, suffers from the lack of even these resources. In this paper, we investigate the possibility of learning the morphology of such a language by relying on its close relationship to a language with more resources. Specifically, we use a transfer-based approach to learn the morphology of the severely under-resourced language Gofa, starting with a neural morphological generator for the closely related language, Wolaytta. Both languages are members of the Omotic family, spoken and southwestern Ethiopia, and, like other Omotic languages, both are morphologically complex. We first create a finite- state transducer for morphological analysis and generation for Wolaytta, based on relatively complete linguistic descriptions and lexicons for the language. Next, we train an encoder-decoder neural network on the task of morphological generation for Wolaytta, using data generated by the FST. Such a network takes a root and a set of grammatical features as input and generates a word form as output. We then elicit Gofa translations of a small set of Wolaytta words from bilingual speakers. Finally, we retrain the decoder of the Wolaytta network, using a small set of Gofa target words that are translations of the Wolaytta outputs of the original network. The evaluation shows that the transfer network performs better than a separate encoder-decoder network trained on a larger set of Gofa words. We conclude with implications for the learning of morphology for severely under-resourced languages in regions where there are related languages with more resources.

2018

pdf bib
Contemporary Amharic Corpus: Automatically Morpho-Syntactically Tagged Amharic Corpus
Andargachew Mekonnen Gezmu | Binyam Ephrem Seyoum | Michael Gasser | Andreas Nürnberger
Proceedings of the First Workshop on Linguistic Resources for Natural Language Processing

We introduced the contemporary Amharic corpus, which is automatically tagged for morpho-syntactic information. Texts are collected from 25,199 documents from different domains and about 24 million orthographic words are tokenized. Since it is partly a web corpus, we made some automatic spelling error correction. We have also modified the existing morphological analyzer, HornMorpho, to use it for the automatic tagging.

2014

pdf bib
Guampa: a Toolkit for Collaborative Translation
Alex Rudnick | Taylor Skidmore | Alberto Samaniego | Michael Gasser
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

Here we present Guampa, a new software package for online collaborative translation. This system grows out of our discussions with Guarani-language activists and educators in Paraguay, and attempts to address problems faced by machine translation researchers and by members of any community speaking an under-represented language. Guampa enables volunteers and students to work together to translate documents into heritage languages, both to make more materials available in those languages, and also to generate bitext suitable for training machine translation systems. While many approaches to crowdsourcing bitext corpora focus on Mechanical Turk and temporarily engaging anonymous workers, Guampa is intended to foster an online community in which discussions can take place, language learners can practice their translation skills, and complete documents can be translated. This approach is appropriate for the Spanish-Guarani language pair as there are many speakers of both languages, and Guarani has a dedicated activist community. Our goal is to make it easy for anyone to set up their own instance of Guampa and populate it with documents -- such as automatically imported Wikipedia articles -- to be translated for their particular language pair. Guampa is freely available and relatively easy to use.

2013

pdf bib
Lexical Selection for Hybrid MT with Sequence Labeling
Alex Rudnick | Michael Gasser
Proceedings of the Second Workshop on Hybrid Approaches to Translation

pdf bib
HLTDI: CL-WSD Using Markov Random Fields for SemEval-2013 Task 10
Alex Rudnick | Can Liu | Michael Gasser
Second Joint Conference on Lexical and Computational Semantics (*SEM), Volume 2: Proceedings of the Seventh International Workshop on Semantic Evaluation (SemEval 2013)

2012

pdf bib
Incremental Learning of Affix Segmentation
Wondwossen Mulugeta | Michael Gasser | Baye Yimam
Proceedings of COLING 2012

2011

pdf bib
Towards a Malay Derivational Lexicon: Learning Affixes Using Expectation Maximization
Suriani Sulaiman | Michael Gasser | Sandra Kuebler
Proceedings of the 2nd Workshop on South Southeast Asian Natural Language Processing (WSSANLP)

2010

pdf bib
Expanding the Lexicon for a Resource-Poor Language Using a Morphological Analyzer and a Web Crawler
Michael Gasser
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

Resource-poor languages may suffer from a lack of any of the basic resources that are fundamental to computational linguistics, including an adequate digital lexicon. Given the relatively small corpus of texts that exists for such languages, extending the lexicon presents a challenge. Languages with complex morphology present a special case, however, because individual words in these languages provide a great deal of information about the grammatical properties of the roots that they are based on. Given a morphological analyzer, it is even possible to extract novel roots from words. In this paper, we look at the case of Tigrinya, a Semitic language with limited lexical resources for which a morphological analyzer is available. It is shown that this analyzer applied to the list of more than 200,000 Tigrinya words that is extracted by a web crawler can extend the lexicon in two ways, by adding new roots and by inferring some of the derivational constraints that apply to known roots.

2009

pdf bib
Semitic Morphological Analysis and Generation Using Finite State Transducers with Feature Structures
Michael Gasser
Proceedings of the 12th Conference of the European Chapter of the ACL (EACL 2009)

1994

pdf bib
Acquiring Receptive Morphology: A Connectionist Model
Michael Gasser
32nd Annual Meeting of the Association for Computational Linguistics

pdf bib
Modularity in a Connectionist Model of Morphology Acquisition
Michael Gasser
COLING 1994 Volume 1: The 15th International Conference on Computational Linguistics

1988

pdf bib
Sequencing in a Connectionist Model of Language Processing
Michael Gasser | Michael G. Dyer
Coling Budapest 1988 Volume 1: International Conference on Computational Linguistics