Gregor Wiedemann


2020

pdf bib
UHH-LT at SemEval-2020 Task 12: Fine-Tuning of Pre-Trained Transformer Networks for Offensive Language Detection
Gregor Wiedemann | Seid Muhie Yimam | Chris Biemann
Proceedings of the Fourteenth Workshop on Semantic Evaluation

Fine-tuning of pre-trained transformer networks such as BERT yield state-of-the-art results for text classification tasks. Typically, fine-tuning is performed on task-specific training datasets in a supervised manner. One can also fine-tune in unsupervised manner beforehand by further pre-training the masked language modeling (MLM) task. Hereby, in-domain data for unsupervised MLM resembling the actual classification target dataset allows for domain adaptation of the model. In this paper, we compare current pre-trained transformer networks with and without MLM fine-tuning on their performance for offensive language detection. Our MLM fine-tuned RoBERTa-based classifier officially ranks 1st in the SemEval 2020 Shared Task 12 for the English language. Further experiments with the ALBERT model even surpass this result.

2019

pdf bib
Adversarial Learning of Privacy-Preserving Text Representations for De-Identification of Medical Records
Max Friedrich | Arne Köhn | Gregor Wiedemann | Chris Biemann
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

De-identification is the task of detecting protected health information (PHI) in medical text. It is a critical step in sanitizing electronic health records (EHR) to be shared for research. Automatic de-identification classifiers can significantly speed up the sanitization process. However, obtaining a large and diverse dataset to train such a classifier that works well across many types of medical text poses a challenge as privacy laws prohibit the sharing of raw medical records. We introduce a method to create privacy-preserving shareable representations of medical text (i.e. they contain no PHI) that does not require expensive manual pseudonymization. These representations can be shared between organizations to create unified datasets for training de-identification models. Our representation allows training a simple LSTM-CRF de-identification model to an F1 score of 97.4%, which is comparable to a strong baseline that exposes private information in its representation. A robust, widely available de-identification classifier based on our representation could potentially enable studies for which de-identification would otherwise be too costly.

pdf bib
UHH-LT at SemEval-2019 Task 6: Supervised vs. Unsupervised Transfer Learning for Offensive Language Detection
Gregor Wiedemann | Eugen Ruppert | Chris Biemann
Proceedings of the 13th International Workshop on Semantic Evaluation

We present a neural network based approach of transfer learning for offensive language detection. For our system, we compare two types of knowledge transfer: supervised and unsupervised pre-training. Supervised pre-training of our bidirectional GRU-3-CNN architecture is performed as multi-task learning of parallel training of five different tasks. The selected tasks are supervised classification problems from public NLP resources with some overlap to offensive language such as sentiment detection, emoji classification, and aggressive language classification. Unsupervised transfer learning is performed with a thematic clustering of 40M unlabeled tweets via LDA. Based on this dataset, pre-training is performed by predicting the main topic of a tweet. Results indicate that unsupervised transfer from large datasets performs slightly better than supervised training on small ‘near target category’ datasets. In the SemEval Task, our system ranks 14 out of 103 participants.

2018

pdf bib
ILCM - A Virtual Research Infrastructure for Large-Scale Qualitative Data
Andreas Niekler | Arnim Bleier | Christian Kahmann | Lisa Posch | Gregor Wiedemann | Kenan Erdogan | Gerhard Heyer | Markus Strohmaier
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

pdf bib
Page Stream Segmentation with Convolutional Neural Nets Combining Textual and Visual Features
Gregor Wiedemann | Gerhard Heyer
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

pdf bib
A Multilingual Information Extraction Pipeline for Investigative Journalism
Gregor Wiedemann | Seid Muhie Yimam | Chris Biemann
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations

We introduce an advanced information extraction pipeline to automatically process very large collections of unstructured textual data for the purpose of investigative journalism. The pipeline serves as a new input processor for the upcoming major release of our New/s/leak 2.0 software, which we develop in cooperation with a large German news organization. The use case is that journalists receive a large collection of files up to several Gigabytes containing unknown contents. Collections may originate either from official disclosures of documents, e.g. Freedom of Information Act requests, or unofficial data leaks.