A Multilingual Information Extraction Pipeline for Investigative Journalism

Gregor Wiedemann, Seid Muhie Yimam, Chris Biemann


Abstract
We introduce an advanced information extraction pipeline to automatically process very large collections of unstructured textual data for the purpose of investigative journalism. The pipeline serves as a new input processor for the upcoming major release of our New/s/leak 2.0 software, which we develop in cooperation with a large German news organization. The use case is that journalists receive a large collection of files up to several Gigabytes containing unknown contents. Collections may originate either from official disclosures of documents, e.g. Freedom of Information Act requests, or unofficial data leaks.
Anthology ID:
D18-2014
Volume:
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations
Month:
November
Year:
2018
Address:
Brussels, Belgium
Venue:
EMNLP
SIG:
SIGDAT
Publisher:
Association for Computational Linguistics
Note:
Pages:
78–83
Language:
URL:
https://www.aclweb.org/anthology/D18-2014
DOI:
10.18653/v1/D18-2014
Bib Export formats:
BibTeX MODS XML EndNote
PDF:
http://aclanthology.lst.uni-saarland.de/D18-2014.pdf