Mike Kestemont


2020

pdf bib
Neural Machine Translation of Artwork Titles Using Iconclass Codes
Nikolay Banar | Walter Daelemans | Mike Kestemont
Proceedings of the The 4th Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature

We investigate the use of Iconclass in the context of neural machine translation for NL<->EN artwork titles. Iconclass is a widely used iconographic classification system used in the cultural heritage domain to describe and retrieve subjects represented in the visual arts. The resource contains keywords and definitions to encode the presence of objects, people, events and ideas depicted in artworks, such as paintings. We propose a simple concatenation approach that improves the quality of automatically generated title translations for artworks, by leveraging textual information extracted from Iconclass. Our results demonstrate that a neural machine translation system is able to exploit this metadata to boost the translation performance of artwork titles. This technology enables interesting applications of machine learning in resource-scarce domains in the cultural sector.

pdf bib
Detecting Direct Speech in Multilingual Collection of 19th-century Novels
Joanna Byszuk | Michał Woźniak | Mike Kestemont | Albert Leśniak | Wojciech Łukasik | Artjoms Šeļa | Maciej Eder
Proceedings of LT4HALA 2020 - 1st Workshop on Language Technologies for Historical and Ancient Languages

Fictional prose can be broadly divided into narrative and discursive forms with direct speech being central to any discourse representation (alongside indirect reported speech and free indirect discourse). This distinction is crucial in digital literary studies and enables interesting forms of narratological or stylistic analysis. The difficulty of automatically detecting direct speech, however, is currently under-estimated. Rule-based systems that work reasonably well for modern languages struggle with (the lack of) typographical conventions in 19th-century literature. While machine learning approaches to sequence modeling can be applied to solve the task, they typically face a severed skewness in the availability of training material, especially for lesser resourced languages. In this paper, we report the result of a multilingual approach to direct speech detection in a diverse corpus of 19th-century fiction in 9 European languages. The proposed method finetunes a transformer architecture with multilingual sentence embedder on a minimal amount of annotated training in each language, and improves performance across languages with ambiguous direct speech marking, in comparison to a carefully constructed regular expression baseline.

2019

pdf bib
On the Feasibility of Automated Detection of Allusive Text Reuse
Enrique Manjavacas | Brian Long | Mike Kestemont
Proceedings of the 3rd Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature

The detection of allusive text reuse is particularly challenging due to the sparse evidence on which allusive references rely — commonly based on none or very few shared words. Arguably, lexical semantics can be resorted to since uncovering semantic relations between words has the potential to increase the support underlying the allusion and alleviate the lexical sparsity. A further obstacle is the lack of evaluation benchmark corpora, largely due to the highly interpretative character of the annotation process. In the present paper, we aim to elucidate the feasibility of automated allusion detection. We approach the matter from an Information Retrieval perspective in which referencing texts act as queries and referenced texts as relevant documents to be retrieved, and estimate the difficulty of benchmark corpus compilation by a novel inter-annotator agreement study on query segmentation. Furthermore, we investigate to what extent the integration of lexical semantic information derived from distributional models and ontologies can aid retrieving cases of allusive reuse. The results show that (i) despite low agreement scores, using manual queries considerably improves retrieval performance with respect to a windowing approach, and that (ii) retrieval performance can be moderately boosted with distributional semantics.

pdf bib
Generation of Hip-Hop Lyrics with Hierarchical Modeling and Conditional Templates
Enrique Manjavacas | Mike Kestemont | Folgert Karsdorp
Proceedings of the 12th International Conference on Natural Language Generation

This paper addresses Hip-Hop lyric generation with conditional Neural Language Models. We develop a simple yet effective mechanism to extract and apply conditional templates from text snippets, and show—on the basis of a large-scale crowd-sourced manual evaluation—that these templates significantly improve the quality and realism of the generated snippets. Importantly, the proposed approach enables end-to-end training, targeting formal properties of text such as rhythm and rhyme, which are central characteristics of rap texts. Additionally, we explore how generating text at different scales (e.g. character-level or word-level) affects the quality of the output. We find that a hybrid form—a hierarchical model that aims to integrate Language Modeling at both word and character-level scales—yields significant improvements in text quality, yet surprisingly, cannot exploit conditional templates to their fullest extent. Our findings highlight that text generation models based on Recurrent Neural Networks (RNN) are sensitive to the modeling scale and call for further research on the observed differences in effectiveness of the conditioning mechanism at different scales.

pdf bib
Improving Lemmatization of Non-Standard Languages with Joint Learning
Enrique Manjavacas | Ákos Kádár | Mike Kestemont
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)

Lemmatization of standard languages is concerned with (i) abstracting over morphological differences and (ii) resolving token-lemma ambiguities of inflected words in order to map them to a dictionary headword. In the present paper we aim to improve lemmatization performance on a set of non-standard historical languages in which the difficulty is increased by an additional aspect (iii): spelling variation due to lacking orthographic standards. We approach lemmatization as a string-transduction task with an Encoder-Decoder architecture which we enrich with sentence information using a hierarchical sentence encoder. We show significant improvements over the state-of-the-art by fine-tuning the sentence encodings to jointly optimize a bidirectional language model loss. Crucially, our architecture does not require POS or morphological annotations, which are not always available for historical corpora. Additionally, we also test the proposed model on a set of typologically diverse standard languages showing results on par or better than a model without fine-tuned sentence representations and previous state-of-the-art systems. Finally, to encourage future work on processing of non-standard varieties, we release the dataset of non-standard languages underlying the present study, which is based on openly accessible sources.

2017

pdf bib
Proceedings of the Workshop on Computational Creativity in Natural Language Generation (CC-NLG 2017)
Hugo Gonçalo Oliveira | Ben Burtenshaw | Mike Kestemont | Tom De Smedt
Proceedings of the Workshop on Computational Creativity in Natural Language Generation (CC-NLG 2017)

pdf bib
Synthetic Literature: Writing Science Fiction in a Co-Creative Process
Enrique Manjavacas | Folgert Karsdorp | Ben Burtenshaw | Mike Kestemont
Proceedings of the Workshop on Computational Creativity in Natural Language Generation (CC-NLG 2017)

pdf bib
Assessing the Stylistic Properties of Neurally Generated Text in Authorship Attribution
Enrique Manjavacas | Jeroen De Gussem | Walter Daelemans | Mike Kestemont
Proceedings of the Workshop on Stylistic Variation

Recent applications of neural language models have led to an increased interest in the automatic generation of natural language. However impressive, the evaluation of neurally generated text has so far remained rather informal and anecdotal. Here, we present an attempt at the systematic assessment of one aspect of the quality of neurally generated text. We focus on a specific aspect of neural language generation: its ability to reproduce authorial writing styles. Using established models for authorship attribution, we empirically assess the stylistic qualities of neurally generated text. In comparison to conventional language models, neural models generate fuzzier text, that is relatively harder to attribute correctly. Nevertheless, our results also suggest that neurally generated text offers more valuable perspectives for the augmentation of training data.

2014

pdf bib
Mining the Twentieth Century’s History from the Time Magazine Corpus
Mike Kestemont | Folgert Karsdorp | Marten Düring
Proceedings of the 8th Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities (LaTeCH)

pdf bib
Function Words in Authorship Attribution. From Black Magic to Theory?
Mike Kestemont
Proceedings of the 3rd Workshop on Computational Linguistics for Literature (CLFL)

2012

pdf bib
The Netlog Corpus. A Resource for the Study of Flemish Dutch Internet Language
Mike Kestemont | Claudia Peersman | Benny De Decker | Guy De Pauw | Kim Luyckx | Roser Morante | Frederik Vaassen | Janneke van de Loo | Walter Daelemans
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

Although in recent years numerous forms of Internet communication ― such as e-mail, blogs, chat rooms and social network environments ― have emerged, balanced corpora of Internet speech with trustworthy meta-information (e.g. age and gender) or linguistic annotations are still limited. In this paper we present a large corpus of Flemish Dutch chat posts that were collected from the Belgian online social network Netlog. For all of these posts we also acquired the users' profile information, making this corpus a unique resource for computational and sociolinguistic research. However, for analyzing such a corpus on a large scale, NLP tools are required for e.g. automatic POS tagging or lemmatization. Because many NLP tools fail to correctly analyze the surface forms of chat language usage, we propose to normalize this ‘anomalous' input into a format suitable for existing NLP solutions for standard Dutch. Additionally, we have annotated a substantial part of the corpus (i.e. the Chatty subset) to provide a gold standard for the evaluation of future approaches to automatic (Flemish) chat language normalization.