Judith L. Klavans (Editor)
- Anthology ID:
- Santa Fe, New Mexico, USA
- COLING | PYLO | WS
- Association for Computational Linguistics
Given advances in computational linguistic analysis of complex languages using Machine Learning as well as standard Finite State Transducers, coupled with recent efforts in language revitalization, the time was right to organize a first workshop to bring together experts in language technology and linguists on the one hand with language practitioners and revitalization experts on the other. This one-day meeting provides a promising forum to discuss new research on polysynthetic languages in combination with the needs of linguistic communities where such languages are written and spoken.
We experiment with training an encoder-decoder neural model for mimicking the behavior of an existing hand-written finite-state morphological grammar for Arapaho verbs, a polysynthetic language with a highly complex verbal inflection system. After adjusting for ambiguous parses, we find that the system is able to generalize to unseen forms with accuracies of 98.68% (unambiguous verbs) and 92.90% (all verbs).
This paper presents the phonological layer of a Kwak’wala finite-state morphological transducer, using the phonological hypotheses of Lincoln and Rath (1986) and the lenient composition operation of Karttunen (1998) to mediate the complicated relationship between underlying and surface forms. The resulting system decomposes the wide variety of surface forms in such a way that the morphological layer can be specified using unique and largely concatenative morphemes.
In this article we describe the application of finite-state transducers to the morphological and phonological systems of Chukchi, a polysynthetic language spoken in the north of the Russian Federation. The language exhibits progressive and regressive vowel harmony, productive incorporation and extensive circumfixing. To implement the analyser we use the well-known Helsinki Finite-State Toolkit (HFST). The resulting model covers the majority of the morphological and phonological processes. A brief evaluation carried out on publically-available corpora shows that the coverage of the transducer is between and 53% and 76%. An error evaluation of 100 tokens randomly selected from the corpus, which were not covered by the analyser shows that most of the morphological processes are covered and that the majority of errors are caused by a limited stem lexicon.
Kanyen’kéha (in English, Mohawk) is an Iroquoian language spoken primarily in Eastern Canada (Ontario, Québec). Classified as endangered, it has only a small number of speakers and very few younger native speakers. Consequently, teachers and courses, teaching materials and software are urgently needed. In the case of software, the polysynthetic nature of Kanyen’kéha means that the number of possible combinations grows exponentially and soon surpasses attempts to capture variant forms by hand. It is in this context that we describe an attempt to produce language teaching materials based on a generative approach. A natural language generation environment (ivi/Vinci) embedded in a web environment (VinciLingua) makes it possible to produce, by rule, variant forms of indefinite complexity. These may be used as models to explore, or as materials to which learners respond. Generated materials may take the form of written text, oral utterances, or images; responses may be typed on a keyboard, gestural (using a mouse) or, to a limited extent, oral. The software also provides complex orthographic, morphological and syntactic analysis of learner productions. We describe the trajectory of development of materials for a suite of four courses on Kanyen’kéha, the first of which will be taught in the fall of 2018.
In this paper we describe preliminary work on Kawennón:nis, a verb conjugator for Kanyen’kéha (Ohsweken dialect). The project is the result of a collaboration between Onkwawenna Kentyohkwa Kanyen’kéha immersion school and the Canadian National Research Council’s Indigenous Language Technology lab. The purpose of Kawennón:nis is to build on the educational successes of the Onkwawenna Kentyohkwa school and develop a tool that assists students in learning how to conjugate verbs in Kanyen’kéha; a skill that is essential to mastering the language. Kawennón:nis is implemented with both web and mobile front-ends that communicate with an application programming interface that in turn communicates with a symbolic language model implemented as a finite state transducer. Eventually, it will serve as a foundation for several other applications for both Kanyen’kéha and other Iroquoian languages.
Inuktitut is a polysynthetic language spoken in Northern Canada and is one of the official languages of the Canadian territory of Nunavut. As such, the Nunavut Legislature publishes all of its proceedings in parallel English and Inuktitut. Several parallel English-Inuktitut corpora from these proceedings have been created from these data and are publically available. The corpus used for current experiments is described. Morphological processing of one of these corpora was carried out and details about the processing are provided. Then, the processed corpus was used in morphological analysis and machine translation (MT) experiments. The morphological analysis experiments aimed to improve the coverage of morphological processing of the corpus, and compare an additional experimental condition to previously published results. The machine translation experiments made use of the additional morphologically analyzed word types in a statistical machine translation system designed to translate to and from Inuktitut morphemes. Results are reported and next steps are defined.
Lost in Translation: Analysis of Information Loss During Machine Translation Between Polysynthetic and Fusional Languages
Manuel Mager | Elisabeth Mager | Alfonso Medina-Urrea | Ivan Vladimir Meza Ruiz | Katharina Kann
Machine translation from polysynthetic to fusional languages is a challenging task, which gets further complicated by the limited amount of parallel text available. Thus, translation performance is far from the state of the art for high-resource and more intensively studied language pairs. To shed light on the phenomena which hamper automatic translation to and from polysynthetic languages, we study translations from three low-resource, polysynthetic languages (Nahuatl, Wixarika and Yorem Nokki) into Spanish and vice versa. Doing so, we find that in a morpheme-to-morpheme alignment an important amount of information contained in polysynthetic morphemes has no Spanish counterpart, and its translation is often omitted. We further conduct a qualitative analysis and, thus, identify morpheme types that are commonly hard to align or ignored in the translation process.
Morphological analysis of morphologically rich and low-resource languages is important to both descriptive linguistics and natural language processing. Field documentary efforts usually procure analyzed data in cooperation with native speakers who are capable of providing some level of linguistic information. Manually annotating such data is very expensive and the traditional process is arguably too slow in the face of language endangerment and loss. We report on a case study of learning to automatically gloss a Nakh-Daghestanian language, Lezgi, from a very small amount of seed data. We compare a conditional random field based sequence labeler and a neural encoder-decoder model and show that a nearly 0.9 F1-score on labeled accuracy of morphemes can be achieved with 3,000 words of transcribed oral text. Errors are mostly limited to morphemes with high allomorphy. These results are potentially useful for developing rapid annotation and fieldwork tools to support documentation of morphologically rich, endangered languages.