András Kornai


2019

pdf bib
Sentence Length
Gábor Borbély | András Kornai
Proceedings of the 16th Meeting on the Mathematics of Language

2016

pdf bib
Detecting Optional Arguments of Verbs
András Kornai | Dávid Márk Nemeskey | Gábor Recski
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

We propose a novel method for detecting optional arguments of Hungarian verbs using only positive data. We introduce a custom variant of collexeme analysis that explicitly models the noise in verb frames. Our method is, for the most part, unsupervised: we use the spectral clustering algorithm described in Brew and Schulte in Walde (2002) to build a noise model from a short, manually verified seed list of verbs. We experimented with both raw count- and context-based clusterings and found their performance almost identical. The code for our algorithm and the frame list are freely available at http://hlt.bme.hu/en/resources/tade.

pdf bib
Measuring Semantic Similarity of Words Using Concept Networks
Gábor Recski | Eszter Iklódi | Katalin Pajkossy | András Kornai
Proceedings of the 1st Workshop on Representation Learning for NLP

pdf bib
Evaluating embeddings on dictionary-based similarity
Judit Ács | András Kornai
Proceedings of the 1st Workshop on Evaluating Vector-Space Representations for NLP

pdf bib
Evaluating multi-sense embeddings for semantic resolution monolingually and in word translation
Gábor Borbély | Márton Makrai | Dávid Márk Nemeskey | András Kornai
Proceedings of the 1st Workshop on Evaluating Vector-Space Representations for NLP

2015

pdf bib
Lexical Semantics and Model Theory: Together at Last?
András Kornai | Marcus Kracht
Proceedings of the 14th Meeting on the Mathematics of Language (MoL 2015)

pdf bib
Competence in lexical semantics
András Kornai | Judit Ács | Márton Makrai | Dávid Márk Nemeskey | Katalin Pajkossy | Gábor Recski
Proceedings of the Fourth Joint Conference on Lexical and Computational Semantics

2013

pdf bib
Building basic vocabulary across 40 languages
Judit Ács | Katalin Pajkossy | András Kornai
Proceedings of the Sixth Workshop on Building and Using Comparable Corpora

pdf bib
Proceedings of the 13th Meeting on the Mathematics of Language (MoL 13)
András Kornai | Marco Kuhlmann
Proceedings of the 13th Meeting on the Mathematics of Language (MoL 13)

pdf bib
Structure Learning in Weighted Languages
András Kornai | Attila Zséder | Gábor Recski
Proceedings of the 13th Meeting on the Mathematics of Language (MoL 13)

pdf bib
Applicative structure in vector space models
Márton Makrai | David Mark Nemeskey | András Kornai
Proceedings of the Workshop on Continuous Vector Space Models and their Compositionality

pdf bib
The mathematics of language learning
András Kornai | Gerald Penn | James Rogers | Anssi Yli-Jyrä
Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Tutorials)

2012

pdf bib
Rapid creation of large-scale corpora and frequency dictionaries
Attila Zséder | Gábor Recski | Dániel Varga | András Kornai
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

We describe, and make public, large-scale language resources and the toolchain used in their creation, for fifteen medium density European languages: Catalan, Czech, Croatian, Danish, Dutch, Finnish, Lithuanian, Norwegian, Polish, Portuguese, Romanian, Serbian, Slovak, Spanish, and Swedish. To make the process uniform across languages, we selected tools that are either language-independent or easily customizable for each language, and reimplemented all stages that were taking too long. To achieve processing times that are insignificant compared to the time data collection (crawling) takes, we reimplemented the standard sentence- and word-level tokenizers and created new boilerplate and near-duplicate detection algorithms. Preliminary experiments with non-European languages indicate that our methods are now applicable not just to our sample, but the entire population of digitally viable languages, with the main limiting factor being the availability of high quality stemmers.

2010

pdf bib
NP Alignment in Bilingual Corpora
Gábor Recski | András Rung | Attila Zséder | András Kornai
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

Aligning the NPs of parallel corpora is logically halfway between the sentence- and word-alignment tasks that occupy much of the MT literature, but has received far less attention. NP alignment is a challenging problem, capable of rapidly exposing flaws both in the word-alignment and in the NP chunking algorithms one may bring to bear. It is also a very rewarding problem in that NPs are semantically natural translation units, which means that (i) word alignments will cross NP boundaries only exceptionally, and (ii) within sentences already aligned, the proportion of 1-1 alignments will be higher for NPs than words. We created a simple gold standard for English-Hungarian, Orwell’s 1984, (since this already exists in manually verified POS-tagged format in many languages thanks to the Multex and MultexEast project) by manually verifying the automaticaly generated NP chunking (we used the yamcha, mallet and hunchunk taggers) and manually aligning the maximal NPs and PPs. The maximum NP chunking problem is much harder than base NP chunking, with F-measure in the .7 range (as opposed to over .94 for base NPs). Since the results are highly impacted by the quality of the NP chunking, we tested our alignment algorithms both with real world (machine obtained) chunkings, where results are in the .35 range for the baseline algorithm which propagates GIZA++ word alignments to the NP level, and on idealized (manually obtained) chunkings, where the baseline reaches .4 and our current system reaches .64.

2008

pdf bib
Parallel Creation of Gigaword Corpora for Medium Density Languages - an Interim Report
Péter Halácsy | András Kornai | Péter Németh | Dániel Varga
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

For increased speed in developing gigaword language resources for medium resource density languages we integrated several FOSS tools in the HUN* toolkit. While the speed and efficiency of the resulting pipeline has surpassed our expectations, our experience in developing LDC-style resource packages for Uzbek and Kurdish makes clear that neither the data collection nor the subsequent processing stages can be fully automated.

2007

pdf bib
Poster paper: HunPos – an open source trigram tagger
Péter Halácsy | András Kornai | Csaba Oravecz
Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics Companion Volume Proceedings of the Demo and Poster Sessions

2006

pdf bib
Using a morphological analyzer in high precision POS tagging of Hungarian
Péter Halácsy | András Kornai | Csaba Oravecz | Viktor Trón | Dániel Varga
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)

The paper presents an evaluation of maxent POS disambiguation systems that incorporate an open source morphological analyzer to constrain the probabilistic models. The experiments show that the best proposed architecture, which is the first application of the maximum entropy framework in a Hungarian NLP task, outperforms comparable state of the art tagging methods and is able to handle out of vocabulary items robustly, allowing for efficient analysis of large (web-based) corpora.

pdf bib
Web-based frequency dictionaries for medium density languages
András Kornai | Péter Halácsy | Viktor Nagy | Csaba Oravecz | Viktor Trón | Dániel Varga
Proceedings of the 2nd International Workshop on Web as Corpus

2005

pdf bib
Hunmorph: Open Source Word Analysis
Viktor Trón | Gyögy Gyepesi | Péter Halácsky | András Kornai | László Németh | Dániel Varga
Proceedings of Workshop on Software

2004

pdf bib
Creating Open Language Resources for Hungarian
Péter Halácsy | András Kornai | László Németh | András Rung | István Szakadát | Viktor Trón
Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC’04)

2003

pdf bib
Classifying the Hungarian Web
András Kornai | Marc Krellenstein | Michael Mulligan | David Twomey | Fruzsina Veress | Alec Wysoker
10th Conference of the European Chapter of the Association for Computational Linguistics

1985

pdf bib
Natural Languages and the Chomsky Hierarchy
András Kornai
Second Conference of the European Chapter of the Association for Computational Linguistics