Alice Millour


2020

pdf bib
Text Corpora and the Challenge of Newly Written Languages
Alice Millour | Karën Fort
Proceedings of the 1st Joint Workshop on Spoken Language Technologies for Under-resourced languages (SLTU) and Collaboration and Computing for Under-Resourced Languages (CCURL)

Text corpora represent the foundation on which most natural language processing systems rely. However, for many languages, collecting or building a text corpus of a sufficient size still remains a complex issue, especially for corpora that are accessible and distributed under a clear license allowing modification (such as annotation) and further resharing. In this paper, we review the sources of text corpora usually called upon to fill the gap in low-resource contexts, and how crowdsourcing has been used to build linguistic resources. Then, we present our own experiments with crowdsourcing text corpora and an analysis of the obstacles we encountered. Although the results obtained in terms of participation are still unsatisfactory, we advocate that the effort towards a greater involvement of the speakers should be pursued, especially when the language of interest is newly written.

pdf bib
Creating Expert Knowledge by Relying on Language Learners: a Generic Approach for Mass-Producing Language Resources by Combining Implicit Crowdsourcing and Language Learning
Lionel Nicolas | Verena Lyding | Claudia Borg | Corina Forascu | Karën Fort | Katerina Zdravkova | Iztok Kosem | Jaka Čibej | Špela Arhar Holdt | Alice Millour | Alexander König | Christos Rodosthenous | Federico Sangati | Umair ul Hassan | Anisia Katinskaia | Anabela Barreiro | Lavinia Aparaschivei | Yaakov HaCohen-Kerner
Proceedings of the 12th Language Resources and Evaluation Conference

We introduce in this paper a generic approach to combine implicit crowdsourcing and language learning in order to mass-produce language resources (LRs) for any language for which a crowd of language learners can be involved. We present the approach by explaining its core paradigm that consists in pairing specific types of LRs with specific exercises, by detailing both its strengths and challenges, and by discussing how much these challenges have been addressed at present. Accordingly, we also report on on-going proof-of-concept efforts aiming at developing the first prototypical implementation of the approach in order to correct and extend an LR called ConceptNet based on the input crowdsourced from language learners. We then present an international network called the European Network for Combining Language Learning with Crowdsourcing Techniques (enetCollect) that provides the context to accelerate the implementation of this generic approach. Finally, we exemplify how it can be used in several language learning scenarios to produce a multitude of NLP resources and how it can therefore alleviate the long-standing NLP issue of the lack of LRs.

pdf bib
Répliquer et étendre pour l’alsacien “Étiquetage en parties du discours de langues peu dotées par spécialisation des plongements lexicaux” (Replicating and extending for Alsatian : “POS tagging for low-resource languages by adapting word embeddings”)
Alice Millour | Karën Fort | Pierre Magistry
Actes de la 6e conférence conjointe Journées d'Études sur la Parole (JEP, 33e édition), Traitement Automatique des Langues Naturelles (TALN, 27e édition), Rencontre des Étudiants Chercheurs en Informatique pour le Traitement Automatique des Langues (RÉCITAL, 22e édition). 2e atelier Éthique et TRaitemeNt Automatique des Langues (ETeRNAL)

Nous présentons ici les résultats d’un travail de réplication et d’extension pour l’alsacien d’une expérience concernant l’étiquetage en parties du discours de langues peu dotées par spécialisation des plongements lexicaux (Magistry et al., 2018). Ce travail a été réalisé en étroite collaboration avec les auteurs de l’article d’origine. Cette interaction riche nous a permis de mettre au jour les éléments manquants dans la présentation de l’expérience, de les compléter, et d’étendre la recherche à la robustesse à la variation.

2019

pdf bib
Unsupervised Data Augmentation for Less-Resourced Languages with no Standardized Spelling
Alice Millour | Karën Fort
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2019)

Building representative linguistic resources and NLP tools for non-standardized languages is challenging: when spelling is not determined by a norm, multiple written forms can be encountered for a given word, inducing a large proportion of out-of-vocabulary words. To embrace this diversity, we propose a methodology based on crowdsourced alternative spellings we use to extract rules applied to match OOV words with one of their spelling variants. This virtuous process enables the unsupervised augmentation of multi-variant lexicons without expert rule definition. We apply this multilingual methodology on Alsatian, a French regional language and provide an intrinsic evaluation of the correctness of the variants pairs, and an extrinsic evaluation on a downstream task. We show that in a low-resource scenario, 145 inital pairs can lead to the generation of 876 additional variant pairs, and a diminution of OOV words improving the part-of-speech tagging performance by 1 to 4%.

2018

pdf bib
Toward a Lightweight Solution for Less-resourced Languages: Creating a POS Tagger for Alsatian Using Voluntary Crowdsourcing
Alice Millour | Karën Fort
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

2017

pdf bib
Vers une solution légère de production de données pour le TAL : création d’un tagger de l’alsacien par crowdsourcing bénévole (Toward a lightweight solution to the language resources bottleneck issue: creating a POS tagger for Alsatian using voluntary crowdsourcing)
Alice Millour | Karën Fort | Delphine Bernhard | Lucie Steiblé
Actes des 24ème Conférence sur le Traitement Automatique des Langues Naturelles. Volume 1 - Articles longs

Nous présentons ici les résultats d’une expérience menée sur l’annotation en parties du discours d’un corpus d’une langue régionale encore peu dotée, l’alsacien, via une plateforme de myriadisation (crowdsourcing) bénévole développée spécifiquement à cette fin : Bisame1 . La plateforme, mise en ligne en mai 2016, nous a permis de recueillir 15 846 annotations grâce à 42 participants. L’évaluation des annotations, réalisée sur un corpus de référence, montre que la F-mesure des annotations volontaires est de 0, 93. Le tagger entraîné sur le corpus annoté atteint lui 82 % d’exactitude. Il s’agit du premier tagger spécifique à l’alsacien. Cette méthode de développement de ressources langagières est donc efficace et prometteuse pour certaines langues peu dotées, dont un nombre suffisant de locuteurs est connecté et actif sur le Web. Le code de la plateforme, le corpus annoté et le tagger sont librement disponibles.