Andrea Horbach


2020

pdf bib
Linguistic Appropriateness and Pedagogic Usefulness of Reading Comprehension Questions
Andrea Horbach | Itziar Aldabe | Marie Bexte | Oier Lopez de Lacalle | Montse Maritxalar
Proceedings of the 12th Language Resources and Evaluation Conference

Automatic generation of reading comprehension questions is a topic receiving growing interest in the NLP community, but there is currently no consensus on evaluation metrics and many approaches focus on linguistic quality only while ignoring the pedagogic value and appropriateness of questions. This paper overcomes such weaknesses by a new evaluation scheme where questions from the questionnaire are structured in a hierarchical way to avoid confronting human annotators with evaluation measures that do not make sense for a certain question. We show through an annotation study that our scheme can be applied, but that expert annotators with some level of expertise are needed. We also created and evaluated two new evaluation data sets from the biology domain for Basque and German, composed of questions written by people with an educational background, which will be publicly released. Results show that manually generated questions are in general both of higher linguistic as well as pedagogic quality and that among the human generated questions, teacher-generated ones tend to be most useful.

pdf bib
Don’t take “nswvtnvakgxpm” for an answer –The surprising vulnerability of automatic content scoring systems to adversarial input
Yuning Ding | Brian Riordan | Andrea Horbach | Aoife Cahill | Torsten Zesch
Proceedings of the 28th International Conference on Computational Linguistics

Automatic content scoring systems are widely used on short answer tasks to save human effort. However, the use of these systems can invite cheating strategies, such as students writing irrelevant answers in the hopes of gaining at least partial credit. We generate adversarial answers for benchmark content scoring datasets based on different methods of increasing sophistication and show that even simple methods lead to a surprising decrease in content scoring performance. As an extreme example, up to 60% of adversarial answers generated from random shuffling of words in real answers are accepted by a state-of-the-art scoring system. In addition to analyzing the vulnerabilities of content scoring systems, we examine countermeasures such as adversarial training and show that these measures improve system robustness against adversarial answers considerably but do not suffice to completely solve the problem.

2018

pdf bib
ESCRITO - An NLP-Enhanced Educational Scoring Toolkit
Torsten Zesch | Andrea Horbach
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

pdf bib
Semi-Supervised Clustering for Short Answer Scoring
Andrea Horbach | Manfred Pinkal
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

pdf bib
Cross-Lingual Content Scoring
Andrea Horbach | Sebastian Stennmanns | Torsten Zesch
Proceedings of the Thirteenth Workshop on Innovative Use of NLP for Building Educational Applications

We investigate the feasibility of cross-lingual content scoring, a scenario where training and test data in an automatic scoring task are from two different languages. Cross-lingual scoring can contribute to educational equality by allowing answers in multiple languages. Training a model in one language and applying it to another language might also help to overcome data sparsity issues by re-using trained models from other languages. As there is no suitable dataset available for this new task, we create a comparable bi-lingual corpus by extending the English ASAP dataset with German answers. Our experiments with cross-lingual scoring based on machine-translating either training or test data show a considerable drop in scoring quality.

2017

pdf bib
Investigating neural architectures for short answer scoring
Brian Riordan | Andrea Horbach | Aoife Cahill | Torsten Zesch | Chong Min Lee
Proceedings of the 12th Workshop on Innovative Use of NLP for Building Educational Applications

Neural approaches to automated essay scoring have recently shown state-of-the-art performance. The automated essay scoring task typically involves a broad notion of writing quality that encompasses content, grammar, organization, and conventions. This differs from the short answer content scoring task, which focuses on content accuracy. The inputs to neural essay scoring models – ngrams and embeddings – are arguably well-suited to evaluate content in short answer scoring tasks. We investigate how several basic neural approaches similar to those used for automated essay scoring perform on short answer scoring. We show that neural architectures can outperform a strong non-neural baseline, but performance and optimal parameter settings vary across the more diverse types of prompts typical of short answer scoring.

pdf bib
Fine-grained essay scoring of a complex writing task for native speakers
Andrea Horbach | Dirk Scholten-Akoun | Yuning Ding | Torsten Zesch
Proceedings of the 12th Workshop on Innovative Use of NLP for Building Educational Applications

Automatic essay scoring is nowadays successfully used even in high-stakes tests, but this is mainly limited to holistic scoring of learner essays. We present a new dataset of essays written by highly proficient German native speakers that is scored using a fine-grained rubric with the goal to provide detailed feedback. Our experiments with two state-of-the-art scoring systems (a neural and a SVM-based one) show a large drop in performance compared to existing datasets. This demonstrates the need for such datasets that allow to guide research on more elaborate essay scoring methods.

pdf bib
The Influence of Spelling Errors on Content Scoring Performance
Andrea Horbach | Yuning Ding | Torsten Zesch
Proceedings of the 4th Workshop on Natural Language Processing Techniques for Educational Applications (NLPTEA 2017)

Spelling errors occur frequently in educational settings, but their influence on automatic scoring is largely unknown. We therefore investigate the influence of spelling errors on content scoring performance using the example of the ASAP corpus. We conduct an annotation study on the nature of spelling errors in the ASAP dataset and utilize these finding in machine learning experiments that measure the influence of spelling errors on automatic content scoring. Our main finding is that scoring methods using both token and character n-gram features are robust against spelling errors up to the error frequency in ASAP.

2016

pdf bib
Improving POS Tagging of German Learner Language in a Reading Comprehension Scenario
Lena Keiper | Andrea Horbach | Stefan Thater
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

We present a novel method to automatically improve the accurracy of part-of-speech taggers on learner language. The key idea underlying our approach is to exploit the structure of a typical language learner task and automatically induce POS information for out-of-vocabulary (OOV) words. To evaluate the effectiveness of our approach, we add manual POS and normalization information to an existing language learner corpus. Our evaluation shows an increase in accurracy from 72.4% to 81.5% on OOV words.

pdf bib
A Corpus of Literal and Idiomatic Uses of German Infinitive-Verb Compounds
Andrea Horbach | Andrea Hensler | Sabine Krome | Jakob Prange | Werner Scholze-Stubenrecht | Diana Steffen | Stefan Thater | Christian Wellner | Manfred Pinkal
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

We present an annotation study on a representative dataset of literal and idiomatic uses of German infinitive-verb compounds in newspaper and journal texts. Infinitive-verb compounds form a challenge for writers of German, because spelling regulations are different for literal and idiomatic uses. Through the participation of expert lexicographers we were able to obtain a high-quality corpus resource which offers itself as a testbed for automatic idiomaticity detection and coarse-grained word-sense disambiguation. We trained a classifier on the corpus which was able to distinguish literal and idiomatic uses with an accuracy of 85 %.

pdf bib
Unsupervised Ranked Cross-Lingual Lexical Substitution for Low-Resource Languages
Stefan Ecker | Andrea Horbach | Stefan Thater
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

We propose an unsupervised system for a variant of cross-lingual lexical substitution (CLLS) to be used in a reading scenario in computer-assisted language learning (CALL), in which single-word translations provided by a dictionary are ranked according to their appropriateness in context. In contrast to most alternative systems, ours does not rely on either parallel corpora or machine translation systems, making it suitable for low-resource languages as the language to be learned. This is achieved by a graph-based scoring mechanism which can deal with ambiguous translations of context words provided by a dictionary. Due to this decoupling from the source language, we need monolingual corpus resources only for the target language, i.e. the language of the translation candidates. We evaluate our approach for the language pair Norwegian Nynorsk-English on an exploratory manually annotated gold standard and report promising results. When running our system on the original SemEval CLLS task, we rank 6th out of 18 (including 2 baselines and our 2 system variants) in the best evaluation.

pdf bib
Investigating Active Learning for Short-Answer Scoring
Andrea Horbach | Alexis Palmer
Proceedings of the 11th Workshop on Innovative Use of NLP for Building Educational Applications

pdf bib
UdS-(retrain|distributional|surface): Improving POS Tagging for OOV Words in German CMC and Web Data
Jakob Prange | Andrea Horbach | Stefan Thater
Proceedings of the 10th Web as Corpus Workshop

2015

pdf bib
Using Shallow Syntactic Features to Measure Influences of L1 and Proficiency Level in EFL Writings
Andrea Horbach | Jonathan Poitz | Alexis Palmer
Proceedings of the fourth workshop on NLP for computer-assisted language learning

pdf bib
Annotating Entailment Relations for Shortanswer Questions
Simon Ostermann | Andrea Horbach | Manfred Pinkal
Proceedings of the 2nd Workshop on Natural Language Processing Techniques for Educational Applications

2014

pdf bib
Paraphrase Detection for Short Answer Scoring
Nikolina Koleva | Andrea Horbach | Alexis Palmer | Simon Ostermann | Manfred Pinkal
Proceedings of the third workshop on NLP for computer-assisted language learning

pdf bib
Finding a Tradeoff between Accuracy and Rater’s Workload in Grading Clustered Short Answers
Andrea Horbach | Alexis Palmer | Magdalena Wolska
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

n this paper we investigate the potential of answer clustering for semi-automatic scoring of short answer questions for German as a foreign language. We use surface features like word and character n-grams to cluster answers to listening comprehension exercises per question and simulate having human graders only label one answer per cluster and then propagating this label to all other members of the cluster. We investigate various ways to select this single item to be labeled and find that choosing the item closest to the centroid of a cluster leads to improved (simulated) grading accuracy over random item selection. Averaged over all questions, we can reduce a teacher’s workload to labeling only 40% of all different answers for a question, while still maintaining a grading accuracy of more than 85%.

2013

pdf bib
Using the text to evaluate short answers for reading comprehension exercises
Andrea Horbach | Alexis Palmer | Manfred Pinkal
Second Joint Conference on Lexical and Computational Semantics (*SEM), Volume 1: Proceedings of the Main Conference and the Shared Task: Semantic Textual Similarity