Bálint Sass


2020

pdf bib
The MARCELL Legislative Corpus
Tamás Váradi | Svetla Koeva | Martin Yamalov | Marko Tadić | Bálint Sass | Bartłomiej Nitoń | Maciej Ogrodniczuk | Piotr Pęzik | Verginica Barbu Mititelu | Radu Ion | Elena Irimia | Maria Mitrofan | Vasile Păiș | Dan Tufiș | Radovan Garabík | Simon Krek | Andraz Repar | Matjaž Rihtar | Janez Brank
Proceedings of the 12th Language Resources and Evaluation Conference

This article presents the current outcomes of the MARCELL CEF Telecom project aiming to collect and deeply annotate a large comparable corpus of legal documents. The MARCELL corpus includes 7 monolingual sub-corpora (Bulgarian, Croatian, Hungarian, Polish, Romanian, Slovak and Slovenian) containing the total body of respective national legislative documents. These sub-corpora are automatically sentence split, tokenized, lemmatized and morphologically and syntactically annotated. The monolingual sub-corpora are complemented by a thematically related parallel corpus (Croatian-English). The metadata and the annotations are uniformly provided for each language specific sub-corpus. Besides the standard morphosyntactic analysis plus named entity and dependency annotation, the corpus is enriched with the IATE and EUROVOC labels. The file format is CoNLL-U Plus Format, containing the ten columns specific to the CoNLL-U format and four extra columns specific to our corpora. The MARCELL corpora represents a rich and valuable source for further studies and developments in machine learning, cross-lingual terminological data extraction and classification.

pdf bib
The xtsv Framework and the Twelve Virtues of Pipelines
Balázs Indig | Bálint Sass | Iván Mittelholcz
Proceedings of the 12th Language Resources and Evaluation Conference

We present xtsv, an abstract framework for building NLP pipelines. It covers several kinds of functionalities which can be implemented at an abstract level. We survey these features and argue that all are desired in a modern pipeline. The framework has a simple yet powerful internal communication format which is essentially tsv (tab separated values) with header plus some additional features. We put emphasis on the capabilities of the presented framework, for example its ability to allow new modules to be easily integrated or replaced, or the variety of its usage options. When a module is put into xtsv, all functionalities of the system are immediately available for that module, and the module can be be a part of an xtsv pipeline. The design also allows convenient investigation and manual correction of the data flow from one module to another. We demonstrate the power of our framework with a successful application: a concrete NLP pipeline for Hungarian called e-magyar text processing system (emtsv) which integrates Hungarian NLP tools in xtsv. All the advantages of the pipeline come from the inherent properties of the xtsv framework.

2019

pdf bib
One format to rule them all – The emtsv pipeline for Hungarian
Balázs Indig | Bálint Sass | Eszter Simon | Iván Mittelholcz | Noémi Vadász | Márton Makrai
Proceedings of the 13th Linguistic Annotation Workshop

We present a more efficient version of the e-magyar NLP pipeline for Hungarian called emtsv. It integrates Hungarian NLP tools in a framework whose individual modules can be developed or replaced independently and allows new ones to be added. The design also allows convenient investigation and manual correction of the data flow from one module to another. The improvements we publish include effective communication between the modules and support of the use of individual modules both in the chain and standing alone. Our goals are accomplished using extended tsv (tab separated values) files, a simple, uniform, generic and self-documenting input/output format. Our vision is maintaining the system for a long time and making it easier for external developers to fit their own modules into the system, thus sharing existing competencies in the field of processing Hungarian, a mid-resourced language. The source code is available under LGPL 3.0 license at https://github.com/dlt-rilmta/emtsv .

pdf bib
The “Jump and Stay” Method to Discover Proper Verb Centered Constructions in Corpus Lattices
Bálint Sass
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2019)

The research presented here is based on the theoretical model of corpus lattices. We implemented this as an effective data structure, and developed an algorithm based on this structure to discover essential verbal expressions from corpus data. The idea behind the algorithm is the “jump and stay” principle, which tells us that our target expressions will be found at such places in the lattice where the value of a suitable function (defined on the vertex set of the corpus lattice) significantly increases (jumps) and then remains the same (stays). We evaluated our method on Hungarian data. Evaluation shows that about 75% of the obtained expressions are correct, actual errors are rare. Thus, this paper is 1. a proof of concept concerning the corpus lattice model, opening the way to investigate this structure further through our implementation; and 2. a proof of concept of the “jump and stay” idea and the algorithm itself, opening the way to apply it further, e.g. for other languages.

2018

pdf bib
E-magyar – A Digital Language Processing System
Tamás Váradi | Eszter Simon | Bálint Sass | Iván Mittelholcz | Attila Novák | Balázs Indig | Richárd Farkas | Veronika Vincze
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

2014

pdf bib
The Hungarian Gigaword Corpus
Csaba Oravecz | Tamás Váradi | Bálint Sass
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

The paper reports on the development of the Hungarian Gigaword Corpus (HGC), an extended new edition of the Hungarian National Corpus, with upgraded and redesigned linguistic annotation and an increased size of 1.5 billion tokens. Issues concerning the standard steps of corpus collection and preparation are discussed with special emphasis on linguistic analysis and annotation due to Hungarian having some challenging characteristics with respect to computational processing. As the HGC is designed to serve as a resource for a wide range of linguistic research as well as for the interested public, a number of issues had to be resolved which were raised by trying to find a balance between the above two application areas. The following main objectives have been defined for the development of the HGC, focusing on the pivotal concept of increase in: - size: extending the corpus to minimum 1 billion words, - quality: using new technology for development and analysis, - coverage and representativity: taking new samples of language use and including further variants (transcribed spoken language data and user generated content (social media) from the internet in particular).

2013

pdf bib
What Do We Drink? Automatically Extending Hungarian WordNet With Selectional Preference Relations
Márton Miháltz | Bálint Sass
Proceedings of the Joint Symposium on Semantic Processing. Textual Inference and Structures in Corpora

2009

pdf bib
A Unified Method for Extracting Simple and Multiword Verbs with Valence Information and Application for Hungarian
Bálint Sass
Proceedings of the International Conference RANLP-2009

pdf bib
Verb Argument Browser for Danish
Bálint Sass
Proceedings of the 17th Nordic Conference of Computational Linguistics (NODALIDA 2009)