Christina Lohr


pdf bib
ProGene - A Large-scale, High-Quality Protein-Gene Annotated Benchmark Corpus
Erik Faessler | Luise Modersohn | Christina Lohr | Udo Hahn
Proceedings of the 12th Language Resources and Evaluation Conference

Genes and proteins constitute the fundamental entities of molecular genetics. We here introduce ProGene (formerly called FSU-PRGE), a corpus that reflects our efforts to cope with this important class of named entities within the framework of a long-lasting large-scale annotation campaign at the Jena University Language & Information Engineering (JULIE) Lab. We assembled the entire corpus from 11 subcorpora covering various biological domains to achieve an overall subdomain-independent corpus. It consists of 3,308 MEDLINE abstracts with over 36k sentences and more than 960k tokens annotated with nearly 60k named entity mentions. Two annotators strove for carefully assigning entity mentions to classes of genes/proteins as well as families/groups, complexes, variants and enumerations of those where genes and proteins are represented by a single class. The main purpose of the corpus is to provide a large body of consistent and reliable annotations for supervised training and evaluation of machine learning algorithms in this relevant domain. Furthermore, we provide an evaluation of two state-of-the-art baseline systems — BioBert and flair — on the ProGene corpus. We make the evaluation datasets and the trained models available to encourage comparable evaluations of new methods in the future.

pdf bib
GGPONC: A Corpus of German Medical Text with Rich Metadata Based on Clinical Practice Guidelines
Florian Borchert | Christina Lohr | Luise Modersohn | Thomas Langer | Markus Follmann | Jan Philipp Sachs | Udo Hahn | Matthieu-P. Schapranow
Proceedings of the 11th International Workshop on Health Text Mining and Information Analysis

The lack of publicly accessible text corpora is a major obstacle for progress in natural language processing. For medical applications, unfortunately, all language communities other than English are low-resourced. In this work, we present GGPONC (German Guideline Program in Oncology NLP Corpus), a freely dis tributable German language corpus based on clinical practice guidelines for oncology. This corpus is one of the largest ever built from German medical documents. Unlike clinical documents, clinical guidelines do not contain any patient-related information and can therefore be used without data protection restrictions. Moreover, GGPONC is the first corpus for the German language covering diverse conditions in a large medical subfield and provides a variety of metadata, such as literature references and evidence levels. By applying and evaluating existing medical information extraction pipelines for German text, we are able to draw comparisons for the use of medical language to other corpora, medical and non-medical ones.


pdf bib
Continuous Quality Control and Advanced Text Segment Annotation with WAT-SL 2.0
Christina Lohr | Johannes Kiesel | Stephanie Luther | Johannes Hellrich | Tobias Kolditz | Benno Stein | Udo Hahn
Proceedings of the 13th Linguistic Annotation Workshop

Today’s widely used annotation tools were designed for annotating typically short textual mentions of entities or relations, making their interface cumbersome to use for long(er) stretches of text, e.g, sentences running over several lines in a document. They also lack systematic support for hierarchically structured labels, i.e., one label being conceptually more general than another (e.g., anamnesis in relation to family anamnesis). Moreover, as a more fundamental shortcoming of today’s tools, they provide no continuous quality con trol mechanisms for the annotation process, an essential feature to intrinsically support iterative cycles in the development of annotation guidelines. We alleviated these problems by developing WAT-SL 2.0, an open-source web-based annotation tool for long-segment labeling, hierarchically structured label sets and built-ins for quality control.


pdf bib
Sharing Copies of Synthetic Clinical Corpora without Physical Distribution — A Case Study to Get Around IPRs and Privacy Constraints Featuring the German JSYNCC Corpus
Christina Lohr | Sven Buechel | Udo Hahn
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)