Odinson: A Fast Rule-based Information Extraction Framework

Marco A. Valenzuela-Escárcega, Gus Hahn-Powell, Dane Bell


Abstract
We present Odinson, a rule-based information extraction framework, which couples a simple yet powerful pattern language that can operate over multiple representations of text, with a runtime system that operates in near real time. In the Odinson query language, a single pattern may combine regular expressions over surface tokens with regular expressions over graphs such as syntactic dependencies. To guarantee the rapid matching of these patterns, our framework indexes most of the necessary information for matching patterns, including directed graphs such as syntactic dependencies, into a custom Lucene index. Indexing minimizes the amount of expensive pattern matching that must take place at runtime. As a result, the runtime system matches a syntax-based graph traversal in 2.8 seconds in a corpus of over 134 million sentences, nearly 150,000 times faster than its predecessor.
Anthology ID:
2020.lrec-1.267
Volume:
Proceedings of the 12th Language Resources and Evaluation Conference
Month:
May
Year:
2020
Address:
Marseille, France
Venues:
COLING | LREC
SIG:
Publisher:
European Language Resources Association
Note:
Pages:
2183–2191
Language:
English
URL:
https://www.aclweb.org/anthology/2020.lrec-1.267
DOI:
Bib Export formats:
BibTeX MODS XML EndNote
PDF:
http://aclanthology.lst.uni-saarland.de/2020.lrec-1.267.pdf