György Szarvas


2020

pdf bib
The Multilingual Amazon Reviews Corpus
Phillip Keung | Yichao Lu | György Szarvas | Noah A. Smith
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

We present the Multilingual Amazon Reviews Corpus (MARC), a large-scale collection of Amazon reviews for multilingual text classification. The corpus contains reviews in English, Japanese, German, French, Spanish, and Chinese, which were collected between 2015 and 2019. Each record in the dataset contains the review text, the review title, the star rating, an anonymized reviewer ID, an anonymized product ID, and the coarse-grained product category (e.g., ‘books’, ‘appliances’, etc.) The corpus is balanced across the 5 possible star ratings, so each rating constitutes 20% of the reviews in each language. For each language, there are 200,000, 5,000, and 5,000 reviews in the training, development, and test sets, respectively. We report baseline results for supervised text classification and zero-shot cross-lingual transfer learning by fine-tuning a multilingual BERT model on reviews data. We propose the use of mean absolute error (MAE) instead of classification accuracy for this task, since MAE accounts for the ordinal nature of the ratings.

2017

pdf bib
Inducing Semantic Micro-Clusters from Deep Multi-View Representations of Novels
Lea Frermann | György Szarvas
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing

Automatically understanding the plot of novels is important both for informing literary scholarship and applications such as summarization or recommendation. Various models have addressed this task, but their evaluation has remained largely intrinsic and qualitative. Here, we propose a principled and scalable framework leveraging expert-provided semantic tags (e.g., mystery, pirates) to evaluate plot representations in an extrinsic fashion, assessing their ability to produce locally coherent groupings of novels (micro-clusters) in model space. We present a deep recurrent autoencoder model that learns richly structured multi-view plot representations, and show that they i) yield better micro-clusters than less structured representations; and ii) are interpretable, and thus useful for further literary analysis or labeling of the emerging micro-clusters.

2013

pdf bib
Learning to Rank Lexical Substitutions
György Szarvas | Róbert Busa-Fekete | Eyke Hüllermeier
Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing

pdf bib
Uncertainty Detection for Natural Language Watermarking
György Szarvas | Iryna Gurevych
Proceedings of the Sixth International Joint Conference on Natural Language Processing

pdf bib
Supervised All-Words Lexical Substitution using Delexicalized Features
György Szarvas | Chris Biemann | Iryna Gurevych
Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

2012

pdf bib
Cross-Genre and Cross-Domain Detection of Semantic Uncertainty
György Szarvas | Veronika Vincze | Richárd Farkas | György Móra | Iryna Gurevych
Computational Linguistics, Volume 38, Issue 2 - June 2012

2010

pdf bib
Proceedings of the Fourteenth Conference on Computational Natural Language Learning – Shared Task
Richárd Farkas | Veronika Vincze | György Szarvas | György Móra | János Csirik
Proceedings of the Fourteenth Conference on Computational Natural Language Learning – Shared Task

pdf bib
The CoNLL-2010 Shared Task: Learning to Detect Hedges and their Scope in Natural Language Text
Richárd Farkas | Veronika Vincze | György Móra | János Csirik | György Szarvas
Proceedings of the Fourteenth Conference on Computational Natural Language Learning – Shared Task

pdf bib
TUD: Semantic Relatedness for Relation Classification
György Szarvas | Iryna Gurevych
Proceedings of the 5th International Workshop on Semantic Evaluation

2009

pdf bib
Exploring ways beyond the simple supervised learning approach for biological event extraction
György Móra | Richárd Farkas | György Szarvas | Zsolt Molnár
Proceedings of the BioNLP 2009 Workshop Companion Volume for Shared Task

2008

pdf bib
Hedge Classification in Biomedical Texts with a Weakly Supervised Selection of Keywords
György Szarvas
Proceedings of ACL-08: HLT

pdf bib
Hungarian Word-Sense Disambiguated Corpus
Veronika Vincze | György Szarvas | Attila Almási | Dóra Szauter | Róbert Ormándi | Richárd Farkas | Csaba Hatvani | János Csirik
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

To create the first Hungarian WSD corpus, 39 suitable word form samples were selected for the purpose of word sense disambiguation. Among others, selection criteria required the given word form to be frequent in Hungarian language usage, and to have more than one sense considered frequent in usage. HNC and its Heti Világgazdaság subcorpus provided the basis for corpus text selection. This way, each sample has a relevant context (whole article), and information on the lemma, POS-tagging and automatic tokenization is also available. When planning the corpus, 300-500 samples of each word form were to be annotated. This size makes it possible that the subcorpora prepared for the individual word forms can be compared to data available for other languages. However, the finalized database also contains unannotated samples and samples with single annotation, which were annotated only by one of the linguists. The corpus follows the ACL’s SensEval/SemEval WSD tasks format. The first version of the corpus was developed within the scope of the project titled The construction Hungarian WordNet Ontology and its application in Information Extraction Systems (Hatvani et al., 2007). The corpus “ for research and educational purposes” is available and can be downloaded free of charge.

pdf bib
The BioScope corpus: annotation for negation, uncertainty and their scope in biomedical texts
György Szarvas | Veronika Vincze | Richárd Farkas | János Csirik
Proceedings of the Workshop on Current Trends in Biomedical Natural Language Processing

2007

pdf bib
GYDER: Maxent Metonymy Resolution
Richárd Farkas | Eszter Simon | György Szarvas | Dániel Varga
Proceedings of the Fourth International Workshop on Semantic Evaluations (SemEval-2007)

2006

pdf bib
A highly accurate Named Entity corpus for Hungarian
György Szarvas | Richárd Farkas | László Felföldi | András Kocsor | János Csirik
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)

A highly accurate Named Entity (NE) corpus for Hungarian that is publicly available for research purposes is introduced in the paper, along with its main properties. The results of experiments that apply various Machine Learning models and classifier combination schemes are also presented to serve as a benchmark for further research based on the corpus. The data is a segment of the Szeged Corpus (Csendes et al., 2004), consisting of short business news articles collected from MTI (Hungarian News Agency, www.mti.hu). The annotation procedure was carried out paying special attention to annotation accuracy. The corpus went through a parallel annotation phase done by two annotators, resulting in a tagging with inter-annotator agreement rate of 99.89%. Controversial taggings were collected and discussed by the two annotators and a linguist with several years of experience in corpus annotation. These examples were tagged following the decision they made together, and finally all entities that had suspicious or dubious annotations were collected and checked for consistency. We consider the result of this correcting process virtually be free of errors. Our best performing Named Entity Recognizer (NER) model attained an accuracy of 92.86% F measure on the corpus.