Ineke Schuurman


2016

pdf bib
AfriBooms: An Online Treebank for Afrikaans
Liesbeth Augustinus | Peter Dirix | Daniel van Niekerk | Ineke Schuurman | Vincent Vandeghinste | Frank Van Eynde | Gerhard van Huyssteen
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

Compared to well-resourced languages such as English and Dutch, natural language processing (NLP) tools for Afrikaans are still not abundant. In the context of the AfriBooms project, KU Leuven and the North-West University collaborated to develop a first, small treebank, a dependency parser, and an easy to use online linguistic search engine for Afrikaans for use by researchers and students in the humanities and social sciences. The search tool is based on a similar development for Dutch, i.e. GrETEL, a user-friendly search engine which allows users to query a treebank by means of a natural language example instead of a formal search instruction.

pdf bib
Improving Text-to-Pictograph Translation Through Word Sense Disambiguation
Leen Sevens | Gilles Jacobs | Vincent Vandeghinste | Ineke Schuurman | Frank Van Eynde
Proceedings of the Fifth Joint Conference on Lexical and Computational Semantics

2015

pdf bib
Natural Language Generation from Pictographs
Leen Sevens | Vincent Vandeghinste | Ineke Schuurman | Frank Van Eynde
Proceedings of the 15th European Workshop on Natural Language Generation (ENLG)

pdf bib
Extending a Dutch Text-to-Pictograph Converter to English and Spanish
Leen Sevens | Vincent Vandeghinste | Ineke Schuurman | Frank Van Eynde
Proceedings of SLPAT 2015: 6th Workshop on Speech and Language Processing for Assistive Technologies

2014

pdf bib
Experiences with the ISOcat Data Category Registry
Daan Broeder | Ineke Schuurman | Menzo Windhouwer
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

The ISOcat Data Category Registry has been a joint project of both ISO TC 37 and the European CLARIN infrastructure. In this paper the experiences of using ISOcat in CLARIN are described and evaluated. This evaluation clarifies the requirements of CLARIN with regard to a semantic registry to support its semantic interoperability needs. A simpler model based on concepts instead of data cate-gories and a simpler workflow based on community recommendations will address these needs better and offer the required flexibility.

pdf bib
Linking Pictographs to Synsets: Sclera2Cornetto
Vincent Vandeghinste | Ineke Schuurman
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

Social inclusion of people with Intellectual and Developmental Disabilities can be promoted by offering them ways to independently use the internet. People with reading or writing disabilities can use pictographs instead of text. We present a resource in which we have linked a set of 5710 pictographs to lexical-semantic concepts in Cornetto, a Wordnet-like database for Dutch. We show that, by using this resource in a text-to-pictograph translation system, we can greatly improve the coverage comparing with a baseline where words are converted into pictographs only if the word equals the filename.

pdf bib
Linguistic resources and cats: how to use ISOcat, RELcat and SCHEMAcat
Menzo Windhouwer | Ineke Schuurman
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

Within the European CLARIN infrastructure ISOcat is used to enable both humans and computer programs to find specific resources even when they use different terminology or data structures. In order to do so, it should be clear which concepts are used in these resources, both at the level of metadata for the resource as well as its content, and what is meant by them. The concepts can be specified in ISOcat. SCHEMAcat enables us to relate the concepts used by a resource, while RELcat enables to type these relationships and add relationships beyond resource boundaries. This way these three registries together allow us (and the programs) to find what we are looking for.

2013

pdf bib
Example-Based Treebank Querying with GrETEL–Now Also for Spoken Dutch
Liesbeth Augustinus | Vincent Vandeghinste | Ineke Schuurman | Frank Van Eynde
Proceedings of the 19th Nordic Conference of Computational Linguistics (NODALIDA 2013)

2012

pdf bib
Beyond SoNaR: towards the facilitation of large corpus building efforts
Martin Reynaert | Ineke Schuurman | Véronique Hoste | Nelleke Oostdijk | Maarten van Gompel
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

In this paper we report on the experiences gained in the recent construction of the SoNaR corpus, a 500 MW reference corpus of contemporary, written Dutch. It shows what can realistically be done within the confines of a project setting where there are limitations to the duration in time as well to the budget, employing current state-of-the-art tools, standards and best practices. By doing so we aim to pass on insights that may be beneficial for anyone considering to undertake an effort towards building a large, varied yet balanced corpus for use by the wider research community. Various issues are discussed that come into play while compiling a large corpus, including approaches to acquiring texts, the arrangement of IPR, the choice of text formats, and steps to be taken in the preprocessing of data from widely different origins. We describe FoLiA, a new XML format geared at rich linguistic annotations. We also explain the rationale behind the investment in the high-quali ty semi-automatic enrichment of a relatively small (1 MW) subset with very rich syntactic and semantic annotations. Finally, we present some ideas about future developments and the direction corpus development may take, such as setting up an integrated work flow between web services and the potential role for ISOcat. We list tips for potential corpus builders, tricks they may want to try and further recommendations regarding technical developments future corpus builders may wish to hope for.

2010

pdf bib
Interacting Semantic Layers of Annotation in SoNaR, a Reference Corpus of Contemporary Written Dutch
Ineke Schuurman | Véronique Hoste | Paola Monachesi
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

This paper reports on the annotation of a corpus of 1 million words with four semantic annotation layers, including named entities, co- reference relations, semantic roles and spatial and temporal expressions. These semantic annotation layers can benefit from the manually verified part of speech tagging, lemmatization and syntactic analysis (dependency tree) information layers which resulted from an earlier project (Van Noord et al., 2006) and will thus result in a deeply syntactically and semantically annotated corpus. This annotation effort is carried out in the framework of a larger project which aims at the collection of a 500-million word corpus of contemporary Dutch, covering the variants used in the Netherlands and Flanders, the Dutch speaking part of Belgium. All the annotation schemes used were (co-)developed by the authors within the Flemish-Dutch STEVIN-programme as no previous schemes for Dutch were available. They were created taking into account standards (either de facto or official (like ISO)) used elsewhere.

pdf bib
Cultural Aspects of Spatiotemporal Analysis in Multilingual Applications
Ineke Schuurman | Vincent Vandeghinste
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

In this paper we want to point out some issues arising when a natural language processing task involves several languages (like multi- lingual, multidocument summarization and the machine translation aspects involved) which are often neglected. These issues are of a more cultural nature, and may even come into play when several documents in a single language are involved. We pay special attention to those aspects dealing with the spatiotemporal characteristics of a text. Correct automatic selection of (parts of) texts such as handling the same eventuality, presupposes spatiotemporal disambiguation at a rather specific level. The same holds for the analysis of the query. For generation and translation purposes, spatiotemporal aspects may be relevant as well. At the moment English (both the British and American variants) and Dutch (the Flemish and Dutch variant) are covered, all taking into account the perspective of a contemporary, Flemish user. In our approach the cultural aspects associated with for example the language of publication and the language used by the user play a crucial role.

2008

pdf bib
From D-Coi to SoNaR: a reference corpus for Dutch
Nelleke Oostdijk | Martin Reynaert | Paola Monachesi | Gertjan Van Noord | Roeland Ordelman | Ineke Schuurman | Vincent Vandeghinste
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

The computational linguistics community in The Netherlands and Belgium has long recognized the dire need for a major reference corpus of written Dutch. In part to answer this need, the STEVIN programme was established. To pave the way for the effective building of a 500-million-word reference corpus of written Dutch, a pilot project was established. The Dutch Corpus Initiative project or D-Coi was highly successful in that it not only realized about 10% of the projected large reference corpus, but also established the best practices and developed all the protocols and the necessary tools for building the larger corpus within the confines of a necessarily limited budget. We outline the steps involved in an endeavour of this kind, including the major highlights and possible pitfalls. Once converted to a suitable XML format, further linguistic annotation based on the state-of-the-art tools developed either before or during the pilot by the consortium partners proved easily and fruitfully applicable. Linguistic enrichment of the corpus includes PoS tagging, syntactic parsing and semantic annotation, involving both semantic role labeling and spatiotemporal annotation. D-Coi is expected to be followed by SoNaR, during which the 500-million-word reference corpus of Dutch should be built.

pdf bib
Spatiotemporal Annotation Using MiniSTEx: how to deal with Alternative, Foreign, Vague and/or Obsolete Names?
Ineke Schuurman
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

We are currently developing MiniSTEx, a spatiotemporal annotation system to handle temporal and/or geospatial information directly and indirectly expressed in texts. In the end the aim is to locate all eventualities in a text on a time axis and/or a map to ensure an optimal base for automatic temporal and geospatial reasoning. MiniSTEx was originally developed for Dutch, keeping in mind that it should also be useful for other European languages, and for multilingual applications. In order to meet these desiderata we need the MiniSTEx system to be able to draw the conclusions human readers would also draw, e.g. based on their (spatiotemporal) world knowledge, i.e. the common knowledge such readers share. Therefore, notions like “background knowledge”, “intended audience”, and “present-day user” play a major role in our approach. The world knowledge MiniSTEx uses is contained in interconnected tables in a database. At the moment it is used for Dutch and English. Special attention will be paid to the problems we face when looking at older texts or recent historical or encyclopedic texts, i.e. texts with lots of references to times and locations that are not compatible with our current maps and calendars.

pdf bib
Evaluation of a Machine Translation System for Low Resource Languages: METIS-II
Vincent Vandeghinste | Peter Dirix | Ineke Schuurman | Stella Markantonatou | Sokratis Sofianopoulos | Marina Vassiliou | Olga Yannoutsou | Toni Badia | Maite Melero | Gemma Boleda | Michael Carl | Paul Schmidt
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

In this paper we describe the METIS-II system and its evaluation on each of the language pairs: Dutch, German, Greek, and Spanish to English. The METIS-II system envisaged developing a data-driven approach in which no parallel corpus is required, and in which no full parser or extensive rule sets are needed. We describe evalution on a development test set and on a test set coming from Europarl, and compare our results with SYSTRAN. We also provide some further analysis, researching the impact of the number and source of the reference translations and analysing the results according to test text type. The results are expectably lower for the METIS system, but not at an unatainable distance from a mature system like SYSTRAN.

2006

pdf bib
Transferring PoS-tagging and lemmatization tools from spoken to written Dutch corpus development
Antal van den Bosch | Ineke Schuurman | Vincent Vandeghinste
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)

We describe a case study in the reuse and transfer of tools in language resource development, from a corpus of spoken Dutch to a corpus of written Dutch. Once tools for a particular language have been developed, it is logical, but not trivial to reuse them for other types or registers of the language than the tools were originally designed for. This paper reviews the decisions and adaptations necessary to make this particular transfer from spoken to written language, focusing on a part-of-speech tagger and a lemmatizer. While the lemmatizer can be transferred fairly straightforwardly, the tagger needs to be adaptated considerably. We show how it can be adapted without starting from scratch. We describe how the part-of-speech tagset was adapted and how the tagger was retrained to deal with written-text phenomena it had not been trained on earlier.

pdf bib
METIS-II: Machine Translation for Low Resource Languages
Vincent Vandeghinste | Ineke Schuurman | Michael Carl | Stella Markantonatou | Toni Badia
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)

In this paper we describe a machine translation prototype in which we use only minimal resources for both the source and the target language. A shallow source language analysis, combined with a translation dictionary and a mapping system of source language phenomena into the target language and a target language corpus for generation are all the resources needed in the described system. Several approaches are presented.

pdf bib
Syntactic Annotation of Large Corpora in STEVIN
Gertjan van Noord | Ineke Schuurman | Vincent Vandeghinste
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)

The construction of a 500-million-word reference corpus of written Dutch has been identified as one of the priorities in the Dutch/Flemish STEVIN programme. For part of this corpus, manually corrected syntactic annotations will be provided. The paper presents the background of the syntactic annotation efforts, the Alpino parser which is used as an important tool for constructing the syntactic annotations, as well as a number of other annotation tools and guidelines. For the full STEVIN corpus, automatically derived syntactic annotations will be provided in a later phase of the programme. A number of arguments is provided suggesting that such a resource can be very useful for applications in information extraction, ontology building, lexical acquisition, machine translation and corpus linguistics.

2004

pdf bib
Linguistic Annotation of the Spoken Dutch Corpus: If We Had To Do It All Over Again
Ineke Schuurman | Wim Goedertier | Heleen Hoekstra | Nelleke Oostdijk | Richard Piepenbrock | Machteld Schouppe
Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC’04)

After the successful completion of the Spoken Dutch Corpus (1998 -- 2003) the time is ripe to take some time to sit back and reflect on our achievements and the procedures underlying them in order to learn from our experiences. In this paper we will in particular pay attention to issues affecting the levels of linguistic annotation, but some more general issues deserve to be treated as well (bug reporting, consistency). We will try to come up with solutions, but sometimes we want to invite further discussion from other researchers.

2003

pdf bib
CGN, an annotated corpus of spoken Dutch
Ineke Schuurman | Machteld Schouppe | Heleen Hoekstra | Ton van der Wouden
Proceedings of 4th International Workshop on Linguistically Interpreted Corpora (LINC-03) at EACL 2003

2002

pdf bib
Syntactic Analysis in the Spoken Dutch Corpus (CGN)
Ton van der Wouden | Heleen Hoekstra | Michael Moortgat | Bram Renmans | Ineke Schuurman
Proceedings of the Third International Conference on Language Resources and Evaluation (LREC’02)