Proceedings of the Workshop on Language Technology for Digital Historical Archives

University of Hamburg Cristina Vertan, Bulgarian cdemy of Sciences Petya Osenova, St. Kliment Ohridski University of Sofia, St. Kliment Ohridski University of Sofia Dimitar Iliev (Editors)

Varna, Bulgaria
Graphemic ambiguous queries on Arabic-scripted historical corpora
Alicia González Martínez

Word Clustering for Historical Newspapers Analysis
Lidia Pivovarova | Elaine Zosa | Jani Marjanen

This paper is a part of a collaboration between computer scientists and historians aimed at development of novel tools and methods to improve analysis of historical newspapers. We present a case study of ideological terms ending with -ism suffix in nineteenth century Finnish newspapers. We propose a two-step procedure to trace differences in word usages over time: training of diachronic embeddings on several time slices and when clustering embeddings of selected words together with their neighbours to obtain historical context. The obtained clusters turn out to be useful for historical studies. The paper also discuss specific difficulties related to development historian-oriented tools.

Geotagging a Diachronic Corpus of Alpine Texts: Comparing Distinct Approaches to Toponym Recognition
Tannon Kew | Anastassia Shaitarova | Isabel Meraner | Janis Goldzycher | Simon Clematide | Martin Volk

Geotagging historic and cultural texts provides valuable access to heritage data, enabling location-based searching and new geographically related discoveries. In this paper, we describe two distinct approaches to geotagging a variety of fine-grained toponyms in a diachronic corpus of alpine texts. By applying a traditional gazetteer-based approach, aided by a few simple heuristics, we attain strong high-precision annotations. Using the output of this earlier system, we adopt a state-of-the-art neural approach in order to facilitate the detection of new toponyms on the basis of context. Additionally, we present the results of preliminary experiments on integrating a small amount of crowdsourced annotations to improve overall performance of toponym recognition in our heritage corpus.

Controlled Semi-automatic Annotation of Classical Ethiopic
Cristina Vertan

Preservation of the cultural heritage by means of digital methods became extremely popular during last years. After intensive digitization campaigns the focus moves slowly from the genuine preservation (i.e digital archiving together with standard search mechanisms) to research-oriented usage of materials available electronically. This usage is intended to go far beyond simple reading of digitized materials; researchers should be able to gain new insigts in materials, discover new facts by means of tools relying on innovative algorithms. In this article we will describe the workflow necessary for the annotation of a dichronic corpus of classical Ethiopic, language of essential importance for the study of Early Christianity

Implementing an archival, multilingual and Semantic Web-compliant taxonomy by means of SKOS (Simple Knowledge Organization System)
Francesco Gelati

The paper shows how a multilingual hierarchical thesaurus, or taxonomy, can be created and implemented in compliance with Semantic Web requirements by means of the data model SKOS (Simple Knowledge Organization System). It takes the EHRI (European Holocaust Research Infrastructure) portal as an example, and shows how open-source software like SKOS Play! can facilitate the task.

EU 4 U: An educational platform for the cultural heritage of the EU
Maria Stambolieva

The paper presents an ongoing project of the NBU Laboratory for Language Technology aiming to create a multilingual, CEFR-graded electronic didactic resource for online learning, centered on the history and cultural heritage of the EU (e-EULearn). The resource is developed within the e-Platform of the NBU Laboratory for Language Technology and re-uses the rich corpus of educational material created at the Laboratory for the needs of NBU program modules, distance and blended learning language courses and other projects. Focus being not just on foreign language tuition, but above all on people, places and events in the history and culture of the EU member states, the annotation modules of the e-Platform have been accordingly extended. Current and upcoming activities are directed at: 1/ enriching the English corpus of didactic materials on EU history and culture, 2/ translating the texts into (the) other official EU languages and aligning the translations with the English texts; 3/ developing new test modules. In the process of developing this resource, a database on important people, places, objects and events in the cultural history of the EU will be created.

Modelling linguistic vagueness and uncertainty in historical texts
Cristina Vertan

Many applications in Digital Humanities (DH) rely on annotations of the raw material. These annotations (inferred automatically or done manually) assume that labelled facts are either true or false, thus all inferences started on such annotations us boolean logic. This contradicts hermeneutic principles used by humanites in which most part of the knowledge has a degree of truth which varies depending on the experience and the world knowledge of the interpreter. In this paper we will show how uncertainty and vagueness, two main features of any historical text can be encoded in annotations and thus be considered by DH applications.