Eva D’hondt


2017

pdf bib
Generating a Training Corpus for OCR Post-Correction Using Encoder-Decoder Model
Eva D’hondt | Cyril Grouin | Brigitte Grau
Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

In this paper we present a novel approach to the automatic correction of OCR-induced orthographic errors in a given text. While current systems depend heavily on large training corpora or external information, such as domain-specific lexicons or confidence scores from the OCR process, our system only requires a small amount of (relatively) clean training data from a representative corpus to learn a character-based statistical language model using Bidirectional Long Short-Term Memory Networks (biLSTMs). We demonstrate the versatility and adaptability of our system on different text corpora with varying degrees of textual noise, including a real-life OCR corpus in the medical domain.

2016

pdf bib
Detection of Text Reuse in French Medical Corpora
Eva D’hondt | Cyril Grouin | Aurélie Névéol | Efstathios Stamatatos | Pierre Zweigenbaum
Proceedings of the Fifth Workshop on Building and Evaluating Resources for Biomedical Text Mining (BioTxtM2016)

Electronic Health Records (EHRs) are increasingly available in modern health care institutions either through the direct creation of electronic documents in hospitals’ health information systems, or through the digitization of historical paper records. Each EHR creation method yields the need for sophisticated text reuse detection tools in order to prepare the EHR collections for efficient secondary use relying on Natural Language Processing methods. Herein, we address the detection of two types of text reuse in French EHRs: 1) the detection of updated versions of the same document and 2) the detection of document duplicates that still bear surface differences due to OCR or de-identification processing. We present a robust text reuse detection method to automatically identify redundant document pairs in two French EHR corpora that achieves an overall macro F-measure of 0.68 and 0.60, respectively and correctly identifies all redundant document pairs of interest.

pdf bib
Low-resource OCR error detection and correction in French Clinical Texts
Eva D’hondt | Cyril Grouin | Brigitte Grau
Proceedings of the Seventh International Workshop on Health Text Mining and Information Analysis

2015

pdf bib
Redundancy in French Electronic Health Records: A preliminary study
Eva D’hondt | Xavier Tannier | Aurélie Névéol
Proceedings of the Sixth International Workshop on Health Text Mining and Information Analysis

2014

pdf bib
Genre classification using Balanced Winnow in the DEFT 2014 challenge
Eva D’hondt
TALN-RECITAL 2014 Workshop DEFT 2014 : DÉfi Fouille de Textes (DEFT 2014 Workshop: Text Mining Challenge)

2013

pdf bib
Text Representations for Patent Classification
Eva D’hondt | Suzan Verberne | Cornelis Koster | Lou Boves
Computational Linguistics, Volume 39, Issue 3 - September 2013