LexiDB: Patterns & Methods for Corpus Linguistic Database Management

Matthew Coole, Paul Rayson, John Mariani


Abstract
LexiDB is a tool for storing, managing and querying corpus data. In contrast to other database management systems (DBMSs), it is designed specifically for text corpora. It improves on other corpus management systems (CMSs) because data can be added and deleted from corpora on the fly with the ability to add live data to existing corpora. LexiDB sits between these two categories of DBMSs and CMSs, more specialised to language data than a general purpose DBMS but more flexible than a traditional static corpus management system. Previous work has demonstrated the scalability of LexiDB in response to the growing need to be able to scale out for ever growing corpus datasets. Here, we present the patterns and methods developed in LexiDB for storage, retrieval and querying of multi-level annotated corpus data. These techniques are evaluated and compared to an existing CMS (Corpus Workbench CWB - CQP) and indexer (Lucene). We find that LexiDB consistently outperforms existing tools for corpus queries. This is particularly apparent with large corpora and when handling queries with large result sets
Anthology ID:
2020.lrec-1.383
Volume:
Proceedings of the 12th Language Resources and Evaluation Conference
Month:
May
Year:
2020
Address:
Marseille, France
Venues:
COLING | LREC
SIG:
Publisher:
European Language Resources Association
Note:
Pages:
3128–3135
Language:
English
URL:
https://www.aclweb.org/anthology/2020.lrec-1.383
DOI:
Bib Export formats:
BibTeX MODS XML EndNote
PDF:
http://aclanthology.lst.uni-saarland.de/2020.lrec-1.383.pdf