Legal-ES: A Set of Large Scale Resources for Spanish Legal Text Processing

Doaa Samy, Jerónimo Arenas-García, David Pérez-Fernández


Abstract
Legal-ES is an open source resource kit for legal Spanish. It consists of a large scale Spanish corpus of open legal texts and different kinds of language models including word embeddings and topic models. The corpus includes over 1000 million words covering a collection of legislative and administrative open access documents in Spanish from different sources representing international, national and regional entities. The corpus is pre-processed and tokenized using Spacy. For the word embeddings, gensim was used on the collection of tokens, producing a representation space that is especially suited to reflect the inherent characteristics of the legal domain. We calculate also topic models to obtain a convenient tool to understand the main topics in the corpus and to navigate through the documents exploiting the semantic similarity among documents. We will analyse the time structure of a dynamic topic model to infer changes in the legal production of Spanish jurisdiction that have occurred over the analysed time framework.
Anthology ID:
2020.lt4gov-1.6
Volume:
Proceedings of the 1st Workshop on Language Technologies for Government and Public Administration (LT4Gov)
Month:
May
Year:
2020
Address:
Marseille, France
Venues:
LREC | LT4Gov | WS
SIG:
Publisher:
European Language Resources Association
Note:
Pages:
32–36
Language:
English
URL:
https://www.aclweb.org/anthology/2020.lt4gov-1.6
DOI:
Bib Export formats:
BibTeX MODS XML EndNote
PDF:
http://aclanthology.lst.uni-saarland.de/2020.lt4gov-1.6.pdf