Anaïs Tack


2020

pdf bib
Alector: A Parallel Corpus of Simplified French Texts with Alignments of Misreadings by Poor and Dyslexic Readers
Núria Gala | Anaïs Tack | Ludivine Javourey-Drevet | Thomas François | Johannes C. Ziegler
Proceedings of the 12th Language Resources and Evaluation Conference

In this paper, we present a new parallel corpus addressed to researchers, teachers, and speech therapists interested in text simplification as a means of alleviating difficulties in children learning to read. The corpus is composed of excerpts drawn from 79 authentic literary (tales, stories) and scientific (documentary) texts commonly used in French schools for children aged between 7 to 9 years old. The excerpts were manually simplified at the lexical, morpho-syntactic, and discourse levels in order to propose a parallel corpus for reading tests and for the development of automatic text simplification tools. A sample of 21 poor-reading and dyslexic children with an average reading delay of 2.5 years read a portion of the corpus. The transcripts of readings errors were integrated into the corpus with the goal of identifying lexical difficulty in the target population. By means of statistical testing, we provide evidence that the manual simplifications significantly reduced reading errors, highlighting that the words targeted for simplification were not only well-chosen but also substituted with substantially easier alternatives. The entire corpus is available for consultation through a web interface and available on demand for research purposes.

2018

pdf bib
A Report on the Complex Word Identification Shared Task 2018
Seid Muhie Yimam | Chris Biemann | Shervin Malmasi | Gustavo Paetzold | Lucia Specia | Sanja Štajner | Anaïs Tack | Marcos Zampieri
Proceedings of the Thirteenth Workshop on Innovative Use of NLP for Building Educational Applications

We report the findings of the second Complex Word Identification (CWI) shared task organized as part of the BEA workshop co-located with NAACL-HLT’2018. The second CWI shared task featured multilingual and multi-genre datasets divided into four tracks: English monolingual, German monolingual, Spanish monolingual, and a multilingual track with a French test set, and two tasks: binary classification and probabilistic classification. A total of 12 teams submitted their results in different task/track combinations and 11 of them wrote system description papers that are referred to in this report and appear in the BEA workshop proceedings.

pdf bib
NT2Lex: A CEFR-Graded Lexical Resource for Dutch as a Foreign Language Linked to Open Dutch WordNet
Anaïs Tack | Thomas François | Piet Desmet | Cédrick Fairon
Proceedings of the Thirteenth Workshop on Innovative Use of NLP for Building Educational Applications

In this paper, we introduce NT2Lex, a novel lexical resource for Dutch as a foreign language (NT2) which includes frequency distributions of 17,743 words and expressions attested in expert-written textbook texts and readers graded along the scale of the Common European Framework of Reference (CEFR). In essence, the lexicon informs us about what kind of vocabulary should be understood when reading Dutch as a non-native reader at a particular proficiency level. The main novelty of the resource with respect to the previously developed CEFR-graded lexicons concerns the introduction of corpus-based evidence for L2 word sense complexity through the linkage to Open Dutch WordNet (Postma et al., 2016). The resource thus contains, on top of the lemmatised and part-of-speech tagged lexical entries, a total of 11,999 unique word senses and 8,934 distinct synsets.

pdf bib
Deep Learning Architecture for Complex Word Identification
Dirk De Hertog | Anaïs Tack
Proceedings of the Thirteenth Workshop on Innovative Use of NLP for Building Educational Applications

We describe a system for the CWI-task that includes information on 5 aspects of the (complex) lexical item, namely distributional information of the item itself, morphological structure, psychological measures, corpus-counts and topical information. We constructed a deep learning architecture that combines those features and apply it to the probabilistic and binary classification task for all English sets and Spanish. We achieved reasonable performance on all sets with best performances seen on the probabilistic task, particularly on the English news set (MAE 0.054 and F1-score of 0.872). An analysis of the results shows that reasonable performance can be achieved with a single architecture without any domain-specific tweaking of the parameter settings and that distributional features capture almost all of the information also found in hand-crafted features.

2017

pdf bib
Human and Automated CEFR-based Grading of Short Answers
Anaïs Tack | Thomas François | Sophie Roekhaut | Cédrick Fairon
Proceedings of the 12th Workshop on Innovative Use of NLP for Building Educational Applications

This paper is concerned with the task of automatically assessing the written proficiency level of non-native (L2) learners of English. Drawing on previous research on automated L2 writing assessment following the Common European Framework of Reference for Languages (CEFR), we investigate the possibilities and difficulties of deriving the CEFR level from short answers to open-ended questions, which has not yet been subjected to numerous studies up to date. The object of our study is twofold: to examine the intricacy involved with both human and automated CEFR-based grading of short answers. On the one hand, we describe the compilation of a learner corpus of short answers graded with CEFR levels by three certified Cambridge examiners. We mainly observe that, although the shortness of the answers is reported as undermining a clear-cut evaluation, the length of the answer does not necessarily correlate with inter-examiner disagreement. On the other hand, we explore the development of a soft-voting system for the automated CEFR-based grading of short answers and draw tentative conclusions about its use in a computer-assisted testing (CAT) setting.

2016

pdf bib
SVALex: a CEFR-graded Lexical Resource for Swedish Foreign and Second Language Learners
Thomas François | Elena Volodina | Ildikó Pilán | Anaïs Tack
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

The paper introduces SVALex, a lexical resource primarily aimed at learners and teachers of Swedish as a foreign and second language that describes the distribution of 15,681 words and expressions across the Common European Framework of Reference (CEFR). The resource is based on a corpus of coursebook texts, and thus describes receptive vocabulary learners are exposed to during reading activities, as opposed to productive vocabulary they use when speaking or writing. The paper describes the methodology applied to create the list and to estimate the frequency distribution. It also discusses some characteristics of the resulting resource and compares it to other lexical resources for Swedish. An interesting feature of this resource is the possibility to separate the wheat from the chaff, identifying the core vocabulary at each level, i.e. vocabulary shared by several coursebook writers at each level, from peripheral vocabulary which is used by the minority of the coursebook writers.

pdf bib
Evaluating Lexical Simplification and Vocabulary Knowledge for Learners of French: Possibilities of Using the FLELex Resource
Anaïs Tack | Thomas François | Anne-Laure Ligozat | Cédrick Fairon
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

This study examines two possibilities of using the FLELex graded lexicon for the automated assessment of text complexity in French as a foreign language learning. From the lexical frequency distributions described in FLELex, we derive a single level of difficulty for each word in a parallel corpus of original and simplified texts. We then use this data to automatically address the lexical complexity of texts in two ways. On the one hand, we evaluate the degree of lexical simplification in manually simplified texts with respect to their original version. Our results show a significant simplification effect, both in the case of French narratives simplified for non-native readers and in the case of simplified Wikipedia texts. On the other hand, we define a predictive model which identifies the number of words in a text that are expected to be known at a particular learning level. We assess the accuracy with which these predictions are able to capture actual word knowledge as reported by Dutch-speaking learners of French. Our study shows that although the predictions seem relatively accurate in general (87.4% to 92.3%), they do not yet seem to cover the learners’ lack of knowledge very well.

pdf bib
Modèles adaptatifs pour prédire automatiquement la compétence lexicale d’un apprenant de français langue étrangère (Adaptive models for automatically predicting the lexical competence of French as a foreign language learners)
Anaïs Tack | Thomas François | Anne-Laure Ligozat | Cédrick Fairon
Actes de la conférence conjointe JEP-TALN-RECITAL 2016. volume 2 : TALN (Articles longs)

Cette étude examine l’utilisation de méthodes d’apprentissage incrémental supervisé afin de prédire la compétence lexicale d’apprenants de français langue étrangère (FLE). Les apprenants ciblés sont des néerlandophones ayant un niveau A2/B1 selon le Cadre européen commun de référence pour les langues (CECR). À l’instar des travaux récents portant sur la prédiction de la maîtrise lexicale à l’aide d’indices de complexité, nous élaborons deux types de modèles qui s’adaptent en fonction d’un retour d’expérience, révélant les connaissances de l’apprenant. En particulier, nous définissons (i) un modèle qui prédit la compétence lexicale de tous les apprenants du même niveau de maîtrise et (ii) un modèle qui prédit la compétence lexicale d’un apprenant individuel. Les modèles obtenus sont ensuite évalués par rapport à un modèle de référence déterminant la compétence lexicale à partir d’un lexique spécialisé pour le FLE et s’avèrent gagner significativement en exactitude (9%-17%).