Christian Bentz


2020

pdf bib
Grammatical error detection in transcriptions of spoken English
Andrew Caines | Christian Bentz | Kate Knill | Marek Rei | Paula Buttery
Proceedings of the 28th International Conference on Computational Linguistics

We describe the collection of transcription corrections and grammatical error annotations for the CrowdED Corpus of spoken English monologues on business topics. The corpus recordings were crowdsourced from native speakers of English and learners of English with German as their first language. The new transcriptions and annotations are obtained from different crowdworkers: we analyse the 1108 new crowdworker submissions and propose that they can be used for automatic transcription post-editing and grammatical error correction for speech. To further explore the data we train grammatical error detection models with various configurations including pre-trained and contextual word representations as input, additional features and auxiliary objectives, and extra training data from written error-annotated corpora. We find that a model concatenating pre-trained and contextual word representations as input performs best, and that additional information does not lead to further performance gains.

2018

pdf bib
Using Universal Dependencies in cross-linguistic complexity research
Aleksandrs Berdicevskis | Çağrı Çöltekin | Katharina Ehret | Kilu von Prince | Daniel Ross | Bill Thompson | Chunxiao Yan | Vera Demberg | Gary Lupyan | Taraka Rama | Christian Bentz
Proceedings of the Second Workshop on Universal Dependencies (UDW 2018)

We evaluate corpus-based measures of linguistic complexity obtained using Universal Dependencies (UD) treebanks. We propose a method of estimating robustness of the complexity values obtained using a given measure and a given treebank. The results indicate that measures of syntactic complexity might be on average less robust than those of morphological complexity. We also estimate the validity of complexity measures by comparing the results for very similar languages and checking for unexpected differences. We show that some of those differences that arise can be diminished by using parallel treebanks and, more importantly from the practical point of view, by harmonizing the language-specific solutions in the UD annotation.

2016

pdf bib
Crowdsourcing a Multi-lingual Speech Corpus: Recording, Transcription and Annotation of the CrowdIS Corpora
Andrew Caines | Christian Bentz | Calbert Graham | Tim Polzehl | Paula Buttery
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

We announce the release of the CROWDED CORPUS: a pair of speech corpora collected via crowdsourcing, containing a native speaker corpus of English (CROWDED_ENGLISH), and a corpus of German/English bilinguals (CROWDED_BILINGUAL). Release 1 of the CROWDED CORPUS contains 1000 recordings amounting to 33,400 tokens collected from 80 speakers and is freely available to other researchers. We recruited participants via the Crowdee application for Android. Recruits were prompted to respond to business-topic questions of the type found in language learning oral tests. We then used the CrowdFlower web application to pass these recordings to crowdworkers for transcription and annotation of errors and sentence boundaries. Finally, the sentences were tagged and parsed using standard natural language processing tools. We propose that crowdsourcing is a valid and economical method for corpus collection, and discuss the advantages and disadvantages of this approach.

pdf bib
A Comparison Between Morphological Complexity Measures: Typological Data vs. Language Corpora
Christian Bentz | Tatyana Ruzsics | Alexander Koplenig | Tanja Samardžić
Proceedings of the Workshop on Computational Linguistics for Linguistic Complexity (CL4LC)

Language complexity is an intriguing phenomenon argued to play an important role in both language learning and processing. The need to compare languages with regard to their complexity resulted in a multitude of approaches and methods, ranging from accounts targeting specific structural features to global quantification of variation more generally. In this paper, we investigate the degree to which morphological complexity measures are mutually correlated in a sample of more than 500 languages of 101 language families. We use human expert judgements from the World Atlas of Language Structures (WALS), and compare them to four quantitative measures automatically calculated from language corpora. These consist of three previously defined corpus-derived measures, which are all monolingual, and one new measure based on automatic word-alignment across pairs of languages. We find strong correlations between all the measures, illustrating that both expert judgements and automated approaches converge to similar complexity ratings, and can be used interchangeably.

pdf bib
Learning pressures reduce morphological complexity: Linking corpus, computational and experimental evidence
Christian Bentz | Aleksandrs Berdicevskis
Proceedings of the Workshop on Computational Linguistics for Linguistic Complexity (CL4LC)

The morphological complexity of languages differs widely and changes over time. Pathways of change are often driven by the interplay of multiple competing factors, and are hard to disentangle. We here focus on a paradigmatic scenario of language change: the reduction of morphological complexity from Latin towards the Romance languages. To establish a causal explanation for this phenomenon, we employ three lines of evidence: 1) analyses of parallel corpora to measure the complexity of words in actual language production, 2) applications of NLP tools to further tease apart the contribution of inflectional morphology to word complexity, and 3) experimental data from artificial language learning, which illustrate the learning pressures at play when morphology simplifies. These three lines of evidence converge to show that pressures associated with imperfect language learning are good candidates to causally explain the reduction in morphological complexity in the Latin-to-Romance scenario. More generally, we argue that combining corpus, computational and experimental evidence is the way forward in historical linguistics and linguistic typology.

2014

pdf bib
Towards a computational model of grammaticalization and lexical diversity
Christian Bentz | Paula Buttery
Proceedings of the 5th Workshop on Cognitive Aspects of Computational Language Learning (CogACLL)