Jelke Bloem


2020

pdf bib
Expert Concept-Modeling Ground Truth Construction for Word Embeddings Evaluation in Concept-Focused Domains
Arianna Betti | Martin Reynaert | Thijs Ossenkoppele | Yvette Oortwijn | Andrew Salway | Jelke Bloem
Proceedings of the 28th International Conference on Computational Linguistics

We present a novel, domain expert-controlled, replicable procedure for the construction of concept-modeling ground truths with the aim of evaluating the application of word embeddings. In particular, our method is designed to evaluate the application of word and paragraph embeddings in concept-focused textual domains, where a generic ontology does not provide enough information. We illustrate the procedure, and validate it by describing the construction of an expert ground truth, QuiNE-GT. QuiNE-GT is built to answer research questions concerning the concept of naturalized epistemology in QUINE, a 2-million-token, single-author, 20th-century English philosophy corpus of outstanding quality, cleaned up and enriched for the purpose. To the best of our ken, expert concept-modeling ground truths are extremely rare in current literature, nor has the theoretical methodology behind their construction ever been explicitly conceptualised and properly systematised. Expert-controlled concept-modeling ground truths are however essential to allow proper evaluation of word embeddings techniques, and increase their trustworthiness in specialised domains in which the detection of concepts through their expression in texts is important. We highlight challenges, requirements, and prospects for future work.

pdf bib
Distributional Semantics for Neo-Latin
Jelke Bloem | Maria Chiara Parisi | Martin Reynaert | Yvette Oortwijn | Arianna Betti
Proceedings of LT4HALA 2020 - 1st Workshop on Language Technologies for Historical and Ancient Languages

We address the problem of creating and evaluating quality Neo-Latin word embeddings for the purpose of philosophical research, adapting the Nonce2Vec tool to learn embeddings from Neo-Latin sentences. This distributional semantic modeling tool can learn from tiny data incrementally, using a larger background corpus for initialization. We conduct two evaluation tasks: definitional learning of Latin Wikipedia terms, and learning consistent embeddings from 18th century Neo-Latin sentences pertaining to the concept of mathematical method. Our results show that consistent Neo-Latin word embeddings can be learned from this type of data. While our evaluation results are promising, they do not reveal to what extent the learned models match domain expert knowledge of our Neo-Latin texts. Therefore, we propose an additional evaluation method, grounded in expert-annotated data, that would assess whether learned representations are conceptually sound in relation to the domain of study.

2019

pdf bib
Modeling a Historical Variety of a Low-Resource Language: Language Contact Effects in the Verbal Cluster of Early-Modern Frisian
Jelke Bloem | Arjen Versloot | Fred Weerman
Proceedings of the 1st International Workshop on Computational Approaches to Historical Language Change

Certain phenomena of interest to linguists mainly occur in low-resource languages, such as contact-induced language change. We show that it is possible to study contact-induced language change computationally in a historical variety of a low-resource language, Early-Modern Frisian, by creating a model using features that were established to be relevant in a closely related language, modern Dutch. This allows us to test two hypotheses on two types of language contact that may have taken place between Frisian and Dutch during this time. Our model shows that Frisian verb cluster word orders are associated with different context features than Dutch verb orders, supporting the ‘learned borrowing’ hypothesis.

pdf bib
Evaluating the Consistency of Word Embeddings from Small Data
Jelke Bloem | Antske Fokkens | Aurélie Herbelot
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2019)

In this work, we address the evaluation of distributional semantic models trained on smaller, domain-specific texts, specifically, philosophical text. Specifically, we inspect the behaviour of models using a pre-trained background space in learning. We propose a measure of consistency which can be used as an evaluation metric when no in-domain gold-standard data is available. This measure simply computes the ability of a model to learn similar embeddings from different parts of some homogeneous data. We show that in spite of being a simple evaluation, consistency actually depends on various combinations of factors, including the nature of the data itself, the model used to train the semantic space, and the frequency of the learnt terms, both in the background space and in the in-domain data of interest.

2016

pdf bib
Testing the Processing Hypothesis of word order variation using a probabilistic language model
Jelke Bloem
Proceedings of the Workshop on Computational Linguistics for Linguistic Complexity (CL4LC)

This work investigates the application of a measure of surprisal to modeling a grammatical variation phenomenon between near-synonymous constructions. We investigate a particular variation phenomenon, word order variation in Dutch two-verb clusters, where it has been established that word order choice is affected by processing cost. Several multifactorial corpus studies of Dutch verb clusters have used other measures of processing complexity to show that this factor affects word order choice. This previous work allows us to compare the surprisal measure, which is based on constraint satisfaction theories of language modeling, to those previously used measures, which are more directly linked to empirical observations of processing complexity. Our results show that surprisal does not predict the word order choice by itself, but is a significant predictor when used in a measure of uniform information density (UID). This lends support to the view that human language processing is facilitated not so much by predictable sequences of words but more by sequences of words in which information is spread evenly.

2015

pdf bib
An agent-based model of a historical word order change
Jelke Bloem | Arjen Versloot | Fred Weerman
Proceedings of the Sixth Workshop on Cognitive Aspects of Computational Language Learning

2014

pdf bib
Applying automatically parsed corpora to the study of language variation
Jelke Bloem | Arjen Versloot | Fred Weerman
Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers