Oliver Hellwig


2020

pdf bib
The Treebank of Vedic Sanskrit
Oliver Hellwig | Salvatore Scarlata | Elia Ackermann | Paul Widmer
Proceedings of the 12th Language Resources and Evaluation Conference

This paper introduces the first treebank of Vedic Sanskrit, a morphologically rich ancient Indian language that is of central importance for linguistic and historical research. The selection of the more than 3,700 sentences contained in this treebank reflects the development of metrical and prose texts over a period of 600 years. We discuss how these sentences are annotated in the Universal Dependencies scheme and which syntactic constructions required special attention. In addition, we describe a syntactic labeler based on neural networks that supports the initial annotation of the treebank, and whose evaluation can be helpful for setting up a full syntactic parser of Vedic Sanskrit.

pdf bib
Dating and Stratifying a Historical Corpus with a Bayesian Mixture Model
Oliver Hellwig
Proceedings of LT4HALA 2020 - 1st Workshop on Language Technologies for Historical and Ancient Languages

This paper introduces and evaluates a Bayesian mixture model that is designed for dating texts based on the distributions of linguistic features. The model is applied to the corpus of Vedic Sanskrit the historical structure of which is still unclear in many details. The evaluation concentrates on the interaction between time, genre and linguistic features, detecting those whose distributions are clearly coupled with the historical time. The evaluation also highlights the problems that arise when quantitative results need to be reconciled with philological insights.

pdf bib
Evaluating Neural Morphological Taggers for Sanskrit
Ashim Gupta | Amrith Krishna | Pawan Goyal | Oliver Hellwig
Proceedings of the 17th SIGMORPHON Workshop on Computational Research in Phonetics, Phonology, and Morphology

Neural sequence labelling approaches have achieved state of the art results in morphological tagging. We evaluate the efficacy of four standard sequence labelling models on Sanskrit, a morphologically rich, fusional Indian language. As its label space can theoretically contain more than 40,000 labels, systems that explicitly model the internal structure of a label are more suited for the task, because of their ability to generalise to labels not seen during training. We find that although some neural models perform better than others, one of the common causes for error for all of these models is mispredictions due to syncretism.

2018

pdf bib
Multi-layer Annotation of the Rigveda
Oliver Hellwig | Heinrich Hettrich | Ashutosh Modi | Manfred Pinkal
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

pdf bib
AET: Web-based Adjective Exploration Tool for German
Tatiana Bladier | Esther Seyffarth | Oliver Hellwig | Wiebke Petersen
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

pdf bib
Sanskrit Word Segmentation Using Character-level Recurrent and Convolutional Neural Networks
Oliver Hellwig | Sebastian Nehrdich
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

The paper introduces end-to-end neural network models that tokenize Sanskrit by jointly splitting compounds and resolving phonetic merges (Sandhi). Tokenization of Sanskrit depends on local phonetic and distant semantic features that are incorporated using convolutional and recurrent elements. Contrary to most previous systems, our models do not require feature engineering or extern linguistic resources, but operate solely on parallel versions of raw and segmented text. The models discussed in this paper clearly improve over previous approaches to Sanskrit word segmentation. As they are language agnostic, we will demonstrate that they also outperform the state of the art for the related task of German compound splitting.

2017

pdf bib
Coarse Semantic Classification of Rare Nouns Using Cross-Lingual Data and Recurrent Neural Networks
Oliver Hellwig
IWCS 2017 - 12th International Conference on Computational Semantics - Long papers

pdf bib
Unsupervised Induction of Compositional Types for English Adjective-Noun Pairs
Wiebke Petersen | Oliver Hellwig
IWCS 2017 — 12th International Conference on Computational Semantics — Short papers

2016

pdf bib
Detecting Sentence Boundaries in Sanskrit Texts
Oliver Hellwig
Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers

The paper applies a deep recurrent neural network to the task of sentence boundary detection in Sanskrit, an important, yet underresourced ancient Indian language. The deep learning approach improves the F scores set by a metrical baseline and by a Conditional Random Field classifier by more than 10%.

pdf bib
Exploring the value space of attributes: Unsupervised bidirectional clustering of adjectives in German
Wiebke Petersen | Oliver Hellwig
Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers

The paper presents an iterative bidirectional clustering of adjectives and nouns based on a co-occurrence matrix. The clustering method combines a Vector Space Models (VSM) and the results of a Latent Dirichlet Allocation (LDA), whose results are merged in each iterative step. The aim is to derive a clustering of German adjectives that reflects latent semantic classes of adjectives, and that can be used to induce frame-based representations of nouns in a later step. We are able to show that the method induces meaningful groups of adjectives, and that it outperforms a baseline k-means algorithm.

pdf bib
Improving the Morphological Analysis of Classical Sanskrit
Oliver Hellwig
Proceedings of the 6th Workshop on South and Southeast Asian Natural Language Processing (WSSANLP2016)

The paper describes a new tagset for the morphological disambiguation of Sanskrit, and compares the accuracy of two machine learning methods (Conditional Random Fields, deep recurrent neural networks) for this task, with a special focus on how to model the lexicographic information. It reports a significant improvement over previously published results.

2010

pdf bib
Using NLP Methods for the Analysis of Rituals
Nils Reiter | Oliver Hellwig | Anand Mishra | Anette Frank | Jens Burkhardt
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

This paper gives an overview of an interdisciplinary research project that is concerned with the application of computational linguistics methods to the analysis of the structure and variance of rituals, as investigated in ritual science. We present motivation and prospects of a computational approach to ritual research, and explain the choice of specific analysis techniques. We discuss design decisions for data collection and processing and present the general NLP architecture. For the analysis of ritual descriptions, we apply the frame semantics paradigm with newly invented frames where appropriate. Using scientific ritual research literature, we experimented with several techniques of automatic extraction of domain terms for the domain of rituals. As ritual research is a highly interdisciplinary endeavour, a vocabulary common to all sub-areas of ritual research can is hard to specify and highly controversial. The domain terms extracted from ritual research literature are used as a basis for a common vocabulary and thus help the creation of ritual specific frames. We applied the tf*idf, 2 and PageRank algorithm to our ritual research literature corpus and two non-domain corpora: The British National Corpus and the British Academic Written English corpus. All corpora have been part of speech tagged and lemmatized. The domain terms have been evaluated by two ritual experts independently. Interestingly, the results of the algorithms were different for different parts of speech. This finding is in line with the fact that the inter-annotator agreement also differs between parts of speech.