Kris Heylen


pdf bib
Leveraging Sublanguage Features for the Semantic Categorization of Clinical Terms
Leonie Grön | Ann Bertels | Kris Heylen
Proceedings of the 18th BioNLP Workshop and Shared Task

The automatic processing of clinical documents, such as Electronic Health Records (EHRs), could benefit substantially from the enrichment of medical terminologies with terms encountered in clinical practice. To integrate such terms into existing knowledge sources, they must be linked to corresponding concepts. We present a method for the semantic categorization of clinical terms based on their surface form. We find that features based on sublanguage properties can provide valuable cues for the classification of term variants.


pdf bib
The Interplay of Form and Meaning in Complex Medical Terms: Evidence from a Clinical Corpus
Leonie Grön | Ann Bertels | Kris Heylen
Proceedings of the Joint Workshop on Linguistic Annotation, Multiword Expressions and Constructions (LAW-MWE-CxG-2018)

We conduct a corpus study to investigate the structure of multi-word expressions (MWEs) in the clinical domain. Based on an existing medical taxonomy, we develop an annotation scheme and label a sample of MWEs from a Dutch corpus with semantic and grammatical features. The analysis of the annotated data shows that the formal structure of clinical MWEs correlates with their conceptual properties. The insights gained from this study could inform the design of Natural Language Processing (NLP) systems for clinical writing, but also for other specialized genres.


pdf bib
TermWise: A CAT-tool with Context-Sensitive Terminological Support.
Kris Heylen | Stephen Bond | Dirk De Hertog | Ivan Vulić | Hendrik Kockaert
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

Increasingly, large bilingual document collections are being made available online, especially in the legal domain. This type of Big Data is a valuable resource that specialized translators exploit to search for informative examples of how domain-specific expressions should be translated. However, general purpose search engines are not optimized to retrieve previous translations that are maximally relevant to a translator. In this paper, we report on the TermWise project, a cooperation of terminologists, corpus linguists and computer scientists, that aims to leverage big online translation data for terminological support to legal translators at the Belgian Federal Ministry of Justice. The project developed dedicated knowledge extraction algorithms and a server-based tool to provide translators with the most relevant previous translations of domain-specific expressions relative to the current translation assignment. The functionality is implemented an extra database, a Term&Phrase Memory, that is meant to be integrated with existing Computer Assisted Translation tools. In the paper, we give an overview of the system, give a demo of the user interface, we present a user-based evaluation by translators and discuss how the tool is part of the general evolution towards exploiting Big Data in translation.


pdf bib
Etude sémantique des mots-clés et des marqueurs lexicaux stables dans un corpus technique (Semantic Analysis of Keywords and Stable Lexical Markers in a Technical Corpus) [in French]
Ann Bertels | Dirk De Hertog | Kris Heylen
Proceedings of the Joint Conference JEP-TALN-RECITAL 2012, volume 2: TALN

pdf bib
Looking at word meaning. An interactive visualization of Semantic Vector Spaces for Dutch synsets
Kris Heylen | Dirk Speelman | Dirk Geeraerts
Proceedings of the EACL 2012 Joint Workshop of LINGVIS & UNCLH


pdf bib
The Construction and Evaluation of Word Space Models
Yves Peirsman | Simon De Deyne | Kris Heylen | Dirk Geeraerts
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

Semantic similarity is a key issue in many computational tasks. This paper goes into the development and evaluation of two common ways of automatically calculating the semantic similarity between two words. On the one hand, such methods may depend on a manually constructed thesaurus like (Euro)WordNet. Their performance is often evaluated on the basis of a very restricted set of human similarity ratings. On the other hand, corpus-based methods rely on the distribution of two words in a corpus to determine their similarity. Their performance is generally quantified through a comparison with the judgements of the first type of approach. This paper introduces a new Gold Standard of more than 5,000 human intra-category similarity judgements. We show that corpus-based methods often outperform (Euro)WordNet on this data set, and that the use of the latter as a Gold Standard for the former, is thus often far from ideal.

pdf bib
Modelling Word Similarity: an Evaluation of Automatic Synonymy Extraction Algorithms.
Kris Heylen | Yves Peirsman | Dirk Geeraerts | Dirk Speelman
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

Vector-based models of lexical semantics retrieve semantically related words automatically from large corpora by exploiting the property that words with a similar meaning tend to occur in similar contexts. Despite their increasing popularity, it is unclear which kind of semantic similarity they actually capture and for which kind of words. In this paper, we use three vector-based models to retrieve semantically related words for a set of Dutch nouns and we analyse whether three linguistic properties of the nouns influence the results. In particular, we compare results from a dependency-based model with those from a 1st and 2nd order bag-of-words model and we examine the effect of the nouns’ frequency, semantic speficity and semantic class. We find that all three models find more synonyms for high-frequency nouns and those belonging to abstract semantic classses. Semantic specificty does not have a clear influence.