Localising the Clinical Terminology SNOMED CT by Semi-automated Creation of a German Interface Vocabulary

Stefan Schulz, Larissa Hammer, David Hashemian-Nik, Markus Kreuzthaler


Abstract
Medical language exhibits great variations regarding users, institutions and language registers. With large parts of clinical documents in free text, NLP is playing a more and more important role in unlocking re-usable and interoperable meaning from medical records. This study describes the architectural principles and the evolution of a German interface vocabulary, combining machine translation with human annotation and rule-based term generation, yielding a resource with 7.7 million raw entries, each of which linked to the reference terminology SNOMED CT, an international standard with about 350 thousand concepts. The purpose is to offer a high coverage of medical jargon in order to optimise terminology grounding of clinical texts by text mining systems. The core resource is a manually curated table of English-to-German word and chunk translations, supported by a set of language generation rules. The work describes a workflow consisting the enrichment and modification of this table with human and machine efforts, manually enriched by grammarspecific tags. Top-down and bottom-up methods for terminology population used in parallel. The final interface terms are produced by a term generator, which creates one-to-many German variants per SNOMED CT English description. Filtering against a large collection of domain terminologies and corpora drastically reduces the size of the vocabulary in favour of more realistic terms or terms that can reasonably be expected to match clinical text passages within a text-mining pipeline. An evaluation was performed by a comparison between the current version of the German interface vocabulary and the English description table of the SNOMED CT International release. An exact term matching was performed with a small parallel corpus constituted by text snippets from different clinical documents. With overall low retrieval parameters (with F-values around 30%), the performance of the German language scenario reaches 80 – 90% of the English one. Interestingly, annotations are slightly better with machine-translated (German – English) texts, using the International SNOMED CT resource only.
Anthology ID:
2020.multilingualbio-1.3
Volume:
Proceedings of the LREC 2020 Workshop on Multilingual Biomedical Text Processing (MultilingualBIO 2020)
Month:
May
Year:
2020
Address:
Marseille, France
Venues:
LREC | MultilingualBIO | WS
SIG:
Publisher:
European Language Resources Association
Note:
Pages:
15–20
Language:
English
URL:
https://www.aclweb.org/anthology/2020.multilingualbio-1.3
DOI:
Bib Export formats:
BibTeX MODS XML EndNote
PDF:
http://aclanthology.lst.uni-saarland.de/2020.multilingualbio-1.3.pdf