Attila Novák


2020

pdf bib
CBOW-tag: a Modified CBOW Algorithm for Generating Embedding Models from Annotated Corpora
Attila Novák | László Laki | Borbála Novák
Proceedings of the 12th Language Resources and Evaluation Conference

In this paper, we present a modified version of the CBOW algorithm implemented in the fastText framework. Our modified algorithm, CBOW-tag builds a vector space model that includes the representation of the original word forms and their annotation at the same time. We illustrate the results by presenting a model built from a corpus that includes morphological and syntactic annotations. The simultaneous presence of unannotated elements and different annotations at the same time in the model makes it possible to constrain nearest neighbour queries to specific types of elements. The model can thus efficiently answer questions such as What do we eat?, What can we do with a skeleton? What else do we do with what we eat?, etc. Error analysis reveals that the model can highlight errors introduced into the annotation by the tagger and parser we used to generate the annotations as well as lexical peculiarities in the corpus itself, especially if we do not limit the vocabulary of the model to frequent items.

pdf bib
Much Ado About Nothing – Identification of Zero Copulas in Hungarian Using an NMT Model
Andrea Dömötör | Zijian Győző Yang | Attila Novák
Proceedings of the 12th Language Resources and Evaluation Conference

The research presented in this paper concerns zero copulas in Hungarian, i.e. the phenomenon that nominal predicates lack an explicit verbal copula in the default present tense 3rd person indicative case. We created a tool based on the state-of-the-art transformer architecture implemented in Marian NMT framework that can identify and mark the location of zero copulas, i.e. the position where an overt copula would appear in the non-default cases. Our primary aim was to support quantitative corpus-based linguistic research by creating a tool that can be used to compile a corpus of significant size containing examples of nominal predicates including the location of the zero copulas. We created the training corpus for our system transforming sentences containing overt copulas into ones containing zero copula labels. However, we first needed to disambiguate occurrences of the massively ambiguous verb van ‘exist/be/have’. We performed this using a rule-base classifier relying on English translations in the English-Hungarian parallel subcorpus of the OpenSubtitles corpus. We created several NMT-based models using different sampling methods and optionally using our baseline model to synthesize additional training data. Our best model obtains almost 90% precision and 80% recall on an in-domain test set.

2019

pdf bib
Creation of a corpus with semantic role labels for Hungarian
Attila Novák | László Laki | Borbála Novák | Andrea Dömötör | Noémi Ligeti-Nagy | Ágnes Kalivoda
Proceedings of the 13th Linguistic Annotation Workshop

In this article, an ongoing research is presented, the immediate goal of which is to create a corpus annotated with semantic role labels for Hungarian that can be used to train a parser-based system capable of formulating relevant questions about the text it processes. We briefly describe the objectives of our research, our efforts at eliminating errors in the Hungarian Universal Dependencies corpus, which we use as the base of our annotation effort, at creating a Hungarian verbal argument database annotated with thematic roles, at classifying adjuncts, and at matching verbal argument frames to specific occurrences of verbs and participles in the corpus.

2018

pdf bib
Cross-Lingual Generation and Evaluation of a Wide-Coverage Lexical Semantic Resource
Attila Novák | Borbála Novák
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

pdf bib
E-magyar – A Digital Language Processing System
Tamás Váradi | Eszter Simon | Bálint Sass | Iván Mittelholcz | Attila Novák | Balázs Indig | Richárd Farkas | Veronika Vincze
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

2016

pdf bib
A New Integrated Open-source Morphological Analyzer for Hungarian
Attila Novák | Borbála Siklósi | Csaba Oravecz
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

The goal of a Hungarian research project has been to create an integrated Hungarian natural language processing framework. This infrastructure includes tools for analyzing Hungarian texts, integrated into a standardized environment. The morphological analyzer is one of the core components of the framework. The goal of this paper is to describe a fast and customizable morphological analyzer and its development framework, which synthesizes and further enriches the morphological knowledge implemented in previous tools existing for Hungarian. In addition, we present the method we applied to add semantic knowledge to the lexical database of the morphology. The method utilizes neural word embedding models and morphological and shallow syntactic knowledge.

2015

pdf bib
Automatic Diacritics Restoration for Hungarian
Attila Novák | Borbála Siklósi
Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing

pdf bib
Restoring the intended structure of Hungarian ophthalmology documents
Borbála Siklósi | Attila Novák
Proceedings of BioNLP 15

2014

pdf bib
A New Form of Humor — Mapping Constraint-Based Computational Morphologies to a Finite-State Representation
Attila Novák
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

MorphoLogic’s Humor morphological analyzer engine has been used for the development of several high-quality computational morphologies, among them ones for complex agglutinative languages. However, Humor’s closed source licensing scheme has been an obstacle to making these resources widely available. Moreover, there are other limitations of the rule-based Humor engine: lack of support for morphological guessing and for the integration of frequency information or other weighting of the models. These problems were solved by converting the databases to a finite-state representation that allows for morphological guessing and the addition of weights. Moreover, it has open-source implementations.

2013

pdf bib
Morphological annotation of Old and Middle Hungarian corpora
Attila Novák | György Orosz | Nóra Wenszky
Proceedings of the 7th Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities

pdf bib
English to Hungarian Morpheme-based Statistical Machine Translation System with Reordering Rules
László Laki | Attila Novák | Borbála Siklósi
Proceedings of the Second Workshop on Hybrid Approaches to Translation

pdf bib
PurePos 2.0: a hybrid tool for morphological disambiguation
György Orosz | Attila Novák
Proceedings of the International Conference Recent Advances in Natural Language Processing RANLP 2013

2009

pdf bib
MorphoLogic‘s Submission for the WMT 2009 Shared Task
Attila Novák
Proceedings of the Fourth Workshop on Statistical Machine Translation

2008

pdf bib
The MetaMorpho Translation System
Attila Novák | László Tihanyi | Gábor Prószéky
Proceedings of the Third Workshop on Statistical Machine Translation

2006

pdf bib
Morphological Tools for Six Small Uralic Languages
Attila Novák
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)

This article presents a set of morphological tools for six small endangered minority languages belonging to the Uralic language family, Udmurt, Komi, Eastern Mari, Northern Mansi, Tundra Nenets and Nganasan. Following an introduction to the languages, the two sets of tools used in the project (MorphoLogic's Humor tools and the Xerox Finite State Tool) are described and compared. The article is concluded by a comparison of the six computational morphologies.

2004

pdf bib
Combining Symbolic and Statistical Methods in Morphological Analysis and Unknown Word Guessing
Attila Novák | Viktor Nagy | Csaba Oravecz
Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC’04)

Highly inflectional/agglutinative languages like Hungarian typically feature possible word forms in such a magnitude that automatic methods that provide morphosyntactic annotation on the basis of some training corpus often face the problem of data sparseness. A possible solution to this problem is to apply a comprehensive morphological analyser, which is able to analyse almost all wordforms alleviating the problem of unseen tokens. However, although in a smaller number, there will still remain forms which are unknown even to the morphological analyzer and should be handled by some guesser mechanism. The paper will describe a hybrid method which combines symbolic and statistical information to provide lemmatization and suffix analyses for unknown word forms. Evaluation is carried out with respect to the induction of possible analyses and their respective lexical probabilities for unknown word forms in a part-of-speech tagging system.