Hercules Dalianis


2020

pdf bib
A Semi-supervised Approach for De-identification of Swedish Clinical Text
Hanna Berg | Hercules Dalianis
Proceedings of the 12th Language Resources and Evaluation Conference

An abundance of electronic health records (EHR) is produced every day within healthcare. The records possess valuable information for research and future improvement of healthcare. Multiple efforts have been done to protect the integrity of patients while making electronic health records usable for research by removing personally identifiable information in patient records. Supervised machine learning approaches for de-identification of EHRs need annotated data for training, annotations that are costly in time and human resources. The annotation costs for clinical text is even more costly as the process must be carried out in a protected environment with a limited number of annotators who must have signed confidentiality agreements. In this paper is therefore, a semi-supervised method proposed, for automatically creating high-quality training data. The study shows that the method can be used to improve recall from 84.75% to 89.20% without sacrificing precision to the same extent, dropping from 95.73% to 94.20%. The model’s recall is arguably more important for de-identification than precision.

pdf bib
The Impact of De-identification on Downstream Named Entity Recognition in Clinical Text
Hanna Berg | Aron Henriksson | Hercules Dalianis
Proceedings of the 11th International Workshop on Health Text Mining and Information Analysis

The impact of de-identification on data quality and, in particular, utility for developing models for downstream tasks has been more thoroughly studied for structured data than for unstructured text. While previous studies indicate that text de-identification has a limited impact on models for downstream tasks, it remains unclear what the impact is with various levels and forms of de-identification, in particular concerning the trade-off between precision and recall. In this paper, the impact of de-identification is studied on downstream named entity recognition in Swedish clinical text. The results indicate that de-identification models with moderate to high precision lead to similar downstream performance, while low precision has a substantial negative impact. Furthermore, different strategies for concealing sensitive information affect performance to different degrees, ranging from pseudonymisation having a low impact to the removal of entire sentences with sensitive information having a high impact. This study indicates that it is possible to increase the recall of models for identifying sensitive information without negatively affecting the use of de-identified text data for training models for clinical named entity recognition; however, there is ultimately a trade-off between the level of de-identification and the subsequent utility of the data.

pdf bib
Detecting Adverse Drug Events from Swedish Electronic Health Records using Text Mining
Maria Bampa | Hercules Dalianis
Proceedings of the LREC 2020 Workshop on Multilingual Biomedical Text Processing (MultilingualBIO 2020)

Electronic Health Records are a valuable source of patient information which can be leveraged to detect Adverse Drug Events (ADEs) and aid post-mark drug-surveillance. The overall aim of this study is to scrutinize text written by clinicians in the EHRs and build a model for ADE detection that produces medically relevant predictions. Natural Language Processing techniques will be exploited to create important predictors and incorporate them into the learning process. The study focuses on the 5 most frequent ADE cases found ina Swedish electronic patient record corpus. The results indicate that considering textual features, rather than the structured, can improve the classification performance by 15% in some ADE cases. Additionally, variable patient history lengths are incorporated in the models, demonstrating the importance of the above decision rather than using an arbitrary number for a history length. The experimental findings suggest that the clinical text in EHRs includes information that can capture data beyond the ones that are found in a structured format.

2019

pdf bib
Building a De-identification System for Real Swedish Clinical Text Using Pseudonymised Clinical Text
Hanna Berg | Taridzo Chomutare | Hercules Dalianis
Proceedings of the Tenth International Workshop on Health Text Mining and Information Analysis (LOUHI 2019)

This article presents experiments with pseudonymised Swedish clinical text used as training data to de-identify real clinical text with the future aim to transfer non-sensitive training data to other hospitals. Conditional Random Fields (CFR) and Long Short-Term Memory (LSTM) machine learning algorithms were used to train de-identification models. The two models were trained on pseudonymised data and evaluated on real data. For benchmarking, models were also trained on real data, and evaluated on real data as well as trained on pseudonymised data and evaluated on pseudonymised data. CRF showed better performance for some PHI information like Date Part, First Name and Last Name; consistent with some reports in the literature. In contrast, poor performances on Location and Health Care Unit information were noted, partially due to the constrained vocabulary in the pseudonymised training data. It is concluded that it is possible to train transferable models based on pseudonymised Swedish clinical data, but even small narrative and distributional variation could negatively impact performance.

pdf bib
Augmenting a De-identification System for Swedish Clinical Text Using Open Resources and Deep Learning
Hanna Berg | Hercules Dalianis
Proceedings of the Workshop on NLP and Pseudonymisation

pdf bib
Pseudonymisation of Swedish Electronic Patient Records Using a Rule-Based Approach
Hercules Dalianis
Proceedings of the Workshop on NLP and Pseudonymisation

2017

pdf bib
Efficient Encoding of Pathology Reports Using Natural Language Processing
Rebecka Weegar | Jan F Nygård | Hercules Dalianis
Proceedings of the International Conference Recent Advances in Natural Language Processing, RANLP 2017

In this article we present a system that extracts information from pathology reports. The reports are written in Norwegian and contain free text describing prostate biopsies. Currently, these reports are manually coded for research and statistical purposes by trained experts at the Cancer Registry of Norway where the coders extract values for a set of predefined fields that are specific for prostate cancer. The presented system is rule based and achieves an average F-score of 0.91 for the fields Gleason grade, Gleason score, the number of biopsies that contain tumor tissue, and the orientation of the biopsies. The system also identifies reports that contain ambiguity or other content that should be reviewed by an expert. The system shows potential to encode the reports considerably faster, with less resources, and similar high quality to the manual encoding.

2016

pdf bib
Applying deep learning on electronic health records in Swedish to predict healthcare-associated infections
Olof Jacobson | Hercules Dalianis
Proceedings of the 15th Workshop on Biomedical Natural Language Processing

2015

pdf bib
Creating a rule based system for text mining of Norwegian breast cancer pathology reports
Rebecka Weegar | Hercules Dalianis
Proceedings of the Sixth International Workshop on Health Text Mining and Information Analysis

pdf bib
Adverse Drug Event classification of health records using dictionary based pre-processing and machine learning
Stefanie Friedrich | Hercules Dalianis
Proceedings of the Sixth International Workshop on Health Text Mining and Information Analysis

2014

pdf bib
Proceedings of the 5th International Workshop on Health Text Mining and Information Analysis (Louhi)
Sumithra Velupillai | Martin Duneld | Maria Kvist | Hercules Dalianis | Maria Skeppstedt | Aron Henriksson
Proceedings of the 5th International Workshop on Health Text Mining and Information Analysis (Louhi)

2013

pdf bib
Negation Scope Delimitation in Clinical Text Using Three Approaches: NegEx, PyConTextNLP and SynNeg
Hideyuki Tanushi | Hercules Dalianis | Martin Duneld | Maria Kvist | Maria Skeppstedt | Sumithra Velupillai
Proceedings of the 19th Nordic Conference of Computational Linguistics (NODALIDA 2013)

2012

pdf bib
Rule-based Entity Recognition and Coverage of SNOMED CT in Swedish Clinical Text
Maria Skeppstedt | Maria Kvist | Hercules Dalianis
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

Named entity recognition of the clinical entities disorders, findings and body structures is needed for information extraction from unstructured text in health records. Clinical notes from a Swedish emergency unit were annotated and used for evaluating a rule- and terminology-based entity recognition system. This system used different preprocessing techniques for matching terms to SNOMED CT, and, one by one, four other terminologies were added. For the class body structure, the results improved with preprocessing, whereas only small improvements were shown for the classes disorder and finding. The best average results were achieved when all terminologies were used together. The entity body structure was recognised with a precision of 0.74 and a recall of 0.80, whereas lower results were achieved for disorder (precision: 0.75, recall: 0.55) and for finding (precision: 0.57, recall: 0.30). The proportion of entities containing abbreviations were higher for false negatives than for correctly recognised entities, and no entities containing more than two tokens were recognised by the system. Low recall for disorders and findings shows both that additional methods are needed for entity recognition and that there are many expressions in clinical text that are not included in SNOMED CT.

2010

pdf bib
Proceedings of the NAACL HLT 2010 Second Louhi Workshop on Text and Data Mining of Health Documents
Hercules Dalianis | Martin Hassel | Gunnar Nilsson
Proceedings of the NAACL HLT 2010 Second Louhi Workshop on Text and Data Mining of Health Documents

pdf bib
Characteristics and Analysis of Finnish and Swedish Clinical Intensive Care Nursing Narratives
Helen Allvin | Elin Carlsson | Hercules Dalianis | Riitta Danielsson-Ojala | Vidas Daudaravicius | Martin Hassel | Dimitrios Kokkinakis | Heljä Lundgren-Laine | Gunnar Nilsson | Øystein Nytrø | Sanna Salanterä | Maria Skeppstedt | Hanna Suominen | Sumithra Velupillai
Proceedings of the NAACL HLT 2010 Second Louhi Workshop on Text and Data Mining of Health Documents

pdf bib
Uncertainty Detection as Approximate Max-Margin Sequence Labelling
Oscar Täckström | Sumithra Velupillai | Martin Hassel | Gunnar Eriksson | Hercules Dalianis | Jussi Karlgren
Proceedings of the Fourteenth Conference on Computational Natural Language Learning – Shared Task

pdf bib
Creating and evaluating a consensus for negated and speculative words in a Swedish clinical corpus
Hercules Dalianis | Maria Skeppstedt
Proceedings of the Workshop on Negation and Speculation in Natural Language Processing

pdf bib
Creating a Reusable English-Chinese Parallel Corpus for Bilingual Dictionary Construction
Hercules Dalianis | Hao-chun Xing | Xin Zhang
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

This paper first describes an experiment to construct an English-Chinese parallel corpus, then applying the Uplug word alignment tool on the corpus and finally produce and evaluate an English-Chinese word list. The Stockholm English-Chinese Parallel Corpus (SEC) was created by downloading English-Chinese parallel corpora from a Chinese web site containing law texts that have been manually translated from Chinese to English. The parallel corpus contains 104 563 Chinese characters equivalent to 59 918 Chinese words, and the corresponding English corpus contains 75 766 English words. However Chinese writing does not utilize any delimiters to mark word boundaries so we had to carry out word segmentation as a preprocessing step on the Chinese corpus. Moreover since the parallel corpus is downloaded from Internet the corpus is noisy regarding to alignment between corresponding translated sentences. Therefore we used 60 hours of manually work to align the sentences in the English and Chinese parallel corpus before performing automatic word alignment using Uplug. The word alignment with Uplug was carried out from English to Chinese. Nine respondents evaluated the resulting English-Chinese word list with frequency equal to or above three and we obtained an accuracy of 73.1 percent.

pdf bib
Influence of Module Order on Rule-Based De-identification of Personal Names in Electronic Patient Records Written in Swedish
Elin Carlsson | Hercules Dalianis
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

Electronic patient records (EPRs) are a valuable resource for research but for confidentiality reasons they cannot be used freely. In order to make EPRs available to a wider group of researchers, sensitive information such as personal names has to be removed. De-identification is a process that makes this possible. Both rule-based as well as statistical and machine learning based methods exist to perform de-identification, but the second method requires annotated training material which exists only very sparsely for patient names. It is therefore necessary to use rule-based methods for de-identification of EPRs. Not much is known, however, about the order in which the various rules should be applied and how the different rules influence precision and recall. This paper aims to answer this research question by implementing and evaluating four common rules for de-identification of personal names in EPRs written in Swedish: (1) dictionary name matching, (2) title matching, (3) common words filtering and (4) learning from previous modules. The results show that to obtain the highest recall and precision, the rules should be applied in the following order: title matching, common words filtering and dictionary name matching.

pdf bib
How Certain are Clinical Assessments? Annotating Swedish Clinical Text for (Un)certainties, Speculations and Negations
Hercules Dalianis | Sumithra Velupillai
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

Clinical texts contain a large amount of information. Some of this information is embedded in contexts where e.g. a patient status is reasoned about, which may lead to a considerable amount of statements that indicate uncertainty and speculation. We believe that distinguishing such instances from factual statements will be very beneficial for automatic information extraction. We have annotated a subset of the Stockholm Electronic Patient Record Corpus for certain and uncertain expressions as well as speculative and negation keywords, with the purpose of creating a resource for the development of automatic detection of speculative language in Swedish clinical text. We have analyzed the results from the initial annotation trial by means of pairwise Inter-Annotator Agreement (IAA) measured with F-score. Our main findings are that IAA results for certain expressions and negations are very high, but for uncertain expressions and speculative keywords results are less encouraging. These instances need to be defined in more detail. With this annotation trial, we have created an important resource that can be used to further analyze the properties of speculative language in Swedish clinical text. Our intention is to release this subset to other research groups in the future after removing identifiable information.

2009

pdf bib
Identification of Parallel Text Pairs Using Fingerprints
Martin Hassel | Hercules Dalianis
Proceedings of the International Conference RANLP-2009

pdf bib
Using Uplug and SiteSeeker to construct a cross language search engine for Scandinavian languages
Hercules Dalianis | Martin Rimka | Viggo Kann
Proceedings of the 17th Nordic Conference of Computational Linguistics (NODALIDA 2009)

pdf bib
Automatic training of lemmatization rules that handle morphological changes in pre-, in- and suffixes alike
Bart Jongejan | Hercules Dalianis
Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP

2008

pdf bib
Experiments to Investigate the Connection between Case Distribution and Topical Relevance of Search Terms in an Information Retrieval Setting
Jussi Karlgren | Hercules Dalianis | Bart Jongejan
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

We have performed a set of experiments made to investigate the utility of morphological analysis to improve retrieval of documents written in languages with relatively large morphological variation in a practical commercial setting, using the SiteSeeker search system developed and marketed by Euroling Ab. The objective of the experiments was to evaluate different lemmatisers and stemmers to determine which would be the most practical for the task at hand: highly interactive, relatively high precision web searches in commercial customer-oriented document collections. This paper gives an overview of some of the results for Finnish and German, and describes specifically one experiment designed to investigate the case distribution of nouns in a highly inflectional language (Finnish) and the topicality of the nouns in target texts. We find that topical nouns taken from queries are distributed differently over relevant and non-relevant documents depending on their grammatical case.

pdf bib
Automatic Construction of Domain-specific Dictionaries on Sparse Parallel Corpora in the Nordic languages
Sumithra Velupillai | Hercules Dalianis
Coling 2008: Proceedings of the workshop Multi-source Multilingual Information Extraction and Summarization

2006

pdf bib
Hand-crafted versus Machine-learned Inflectional Rules: The Euroling-SiteSeeker Stemmer and CST’s Lemmatiser
Hercules Dalianis | Bart Jongejan
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)

The Euroling stemmer is developed for a commercial web site and intranet search engine called SiteSeeker. SiteSeeker is basically used in the Swedish domain but to some extent also for the English domain. CST's lemmatiser comes from the Center for Language Technology, University of Copenhagen and was originally developed as a research prototype to create lemmatisation rules from training data. In this paper we compare the performance of the stemmer that uses handcrafted rules for Swedish, Danish and Norwegian as well one stemmer for Greek with CST's lemmatiser that uses training data to extract lemmatisation rules for Swedish, Danish, Norwegian and Greek. The performances of the two approaches are about the same with around 10 percent errors. The handcrafted rule based stemmer techniques are easy to get started with if the programmer has the proper linguistic knowledge. The machine trained sets of lemmatisation rules are very easy to produce without having linguistic knowledge given that one has correct training data.

pdf bib
Improving search engine retrieval using a compound splitter for Swedish
Hercules Dalianis
Proceedings of the 15th Nordic Conference of Computational Linguistics (NODALIDA 2005)

2001

pdf bib
Improving Precision in Information Retrieval for Swedish using Stemming
Johan Carlberger | Hercules Dalianis | Martin Duneld | Ola Knutsson
Proceedings of the 13th Nordic Conference of Computational Linguistics (NODALIDA 2001)

1996

pdf bib
On Lexical Aggregation and Ordering
Hercules Dalianis | Eduard Hovy
Eighth International Natural Language Generation Workshop (Posters and Demonstrations)

1995

pdf bib
Aggregation in the NL-generator of the Visual and Natural language Specification Tool
Hercules Dalianis
Seventh Conference of the European Chapter of the Association for Computational Linguistics