Jan Kocoń


2019

pdf bib
Multi-Level Sentiment Analysis of PolEmo 2.0: Extended Corpus of Multi-Domain Consumer Reviews
Jan Kocoń | Piotr Miłkowski | Monika Zaśko-Zielińska
Proceedings of the 23rd Conference on Computational Natural Language Learning (CoNLL)

In this article we present an extended version of PolEmo – a corpus of consumer reviews from 4 domains: medicine, hotels, products and school. Current version (PolEmo 2.0) contains 8,216 reviews having 57,466 sentences. Each text and sentence was manually annotated with sentiment in 2+1 scheme, which gives a total of 197,046 annotations. We obtained a high value of Positive Specific Agreement, which is 0.91 for texts and 0.88 for sentences. PolEmo 2.0 is publicly available under a Creative Commons copyright license. We explored recent deep learning approaches for the recognition of sentiment, such as Bi-directional Long Short-Term Memory (BiLSTM) and Bidirectional Encoder Representations from Transformers (BERT).

pdf bib
Multi-level analysis and recognition of the text sentiment on the example of consumer opinions
Jan Kocoń | Monika Zaśko-Zielińska | Piotr Miłkowski
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2019)

In this article, we present a novel multi-domain dataset of Polish text reviews, annotated with sentiment on different levels: sentences and the whole documents. The annotation was made by linguists in a 2+1 scheme (with inter-annotator agreement analysis). We present a preliminary approach to the classification of labelled data using logistic regression, bidirectional long short-term memory recurrent neural networks (BiLSTM) and bidirectional encoder representations from transformers (BERT).

2018

pdf bib
Classifier-based Polarity Propagation in a WordNet
Jan Kocoń | Arkadiusz Janz | Maciej Piasecki
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

2017

pdf bib
Improved Recognition and Normalisation of Polish Temporal Expressions
Jan Kocoń | Michał Marcińczuk
Proceedings of the International Conference Recent Advances in Natural Language Processing, RANLP 2017

In this article we present the result of the recent research in the recognition and normalisation of Polish temporal expressions. The temporal information extracted from the text plays major role in many information extraction systems, like question answering, event recognition or discourse analysis. We proposed a new method for the temporal expressions normalisation, called Cascade of Partial Rules. Here we describe results achieved by updated version of Liner2 machine learning system.

pdf bib
Inforex — a collaborative system for text corpora annotation and analysis
Michał Marcińczuk | Marcin Oleksy | Jan Kocoń
Proceedings of the International Conference Recent Advances in Natural Language Processing, RANLP 2017

We report a first major upgrade of Inforex — a web-based system for qualitative and collaborative text corpora annotation and analysis. Inforex is a part of Polish CLARIN infrastructure. It is integrated with a digital repository for storing and publishing language resources and allows to visualize, browse and annotate text corpora stored in the repository. As a result of a series of workshops for researches from humanities and social sciences fields we improved the graphical interface to make the system more friendly and readable for non-experienced users. We also implemented a new functionality for gold standard annotation which includes private annotations and annotation agreement by a super-annotator.

pdf bib
Recognition of Genuine Polish Suicide Notes
Maciej Piasecki | Ksenia Młynarczyk | Jan Kocoń
Proceedings of the International Conference Recent Advances in Natural Language Processing, RANLP 2017

In this article we present the result of the recent research in the recognition of genuine Polish suicide notes (SNs). We provide useful method to distinguish between SNs and other types of discourse, including counterfeited SNs. The method uses a wide range of word-based and semantic features and it was evaluated using Polish Corpus of Suicide Notes, which contains 1244 genuine SNs, expanded with manually prepared set of 334 counterfeited SNs and 2200 letter-like texts from the Internet. We utilized the algorithm to create the class-related sense dictionaries to improve the result of SNs classification. The obtained results show that there are fundamental differences between genuine SNs and counterfeited SNs. The applied method of the sense dictionary construction appeared to be the best way of improving the model.

pdf bib
Liner2 — a Generic Framework for Named Entity Recognition
Michał Marcińczuk | Jan Kocoń | Marcin Oleksy
Proceedings of the 6th Workshop on Balto-Slavic Natural Language Processing

In the paper we present an adaptation of Liner2 framework to solve the BSNLP 2017 shared task on multilingual named entity recognition. The tool is tuned to recognize and lemmatize named entities for Polish.

2015

pdf bib
Recognition of Polish Temporal Expressions
Jan Kocoń | Michał Marcińczuk
Proceedings of the International Conference Recent Advances in Natural Language Processing

2013

pdf bib
Recognition of Named Entities Boundaries in Polish Texts
Michał Marcińczuk | Jan Kocoń
Proceedings of the 4th Biennial International Workshop on Balto-Slavic Natural Language Processing

2012

pdf bib
Inforex – a web-based tool for text corpus management and semantic annotation
Michał Marcińczuk | Jan Kocoń | Bartosz Broda
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

The aim of this paper is to present a system for semantic text annotation called Inforex. Inforex is a web-based system designed for managing and annotating text corpora on the semantic level including annotation of Named Entities (NE), anaphora, Word Sense Disambiguation (WSD) and relations between named entities. The system also supports manual text clean-up and automatic text pre-processing including text segmentation, morphosyntactic analysis and word selection for word sense annotation. Inforex can be accessed from any standard-compliant web browser supporting JavaScript. The user interface has a form of dynamic HTML pages using the AJAX technology. The server part of the system is written in PHP and the data is stored in MySQL database. The system make use of some external tools that are installed on the server or can be accessed via web services. The documents are stored in the database in the original format ― either plain text, XML or HTML. Tokenization and sentence segmentation is optional and is stored in a separate table. Tokens are stored as pairs of values representing indexes of first and last character of the tokens and sets of features representing the morpho-syntactic information.