Marko Tadić


2020

pdf bib
Natural Language Processing Chains Inside a Cross-lingual Event-Centric Knowledge Pipeline for European Union Under-resourced Languages
Diego Alves | Gaurish Thakkar | Marko Tadić
Proceedings of the 1st Joint Workshop on Spoken Language Technologies for Under-resourced languages (SLTU) and Collaboration and Computing for Under-Resourced Languages (CCURL)

This article presents the strategy for developing a platform containing Language Processing Chains for European Union languages, consisting of Tokenization to Parsing, also including Named Entity recognition and with addition of Sentiment Analysis. These chains are part of the first step of an event-centric knowledge processing pipeline whose aim is to process multilingual media information about major events that can cause an impact in Europe and the rest of the world. Due to the differences in terms of availability of language resources for each language, we have built this strategy in three steps, starting with processing chains for the well-resourced languages and finishing with the development of new modules for the under-resourced ones. In order to classify all European Union official languages in terms of resources, we have analysed the size of annotated corpora as well as the existence of pre-trained models in mainstream Language Processing tools, and we have combined this information with the proposed classification published at META-NET whitepaper series.

pdf bib
Evaluating Language Tools for Fifteen EU-official Under-resourced Languages
Diego Alves | Gaurish Thakkar | Marko Tadić
Proceedings of the 12th Language Resources and Evaluation Conference

This article presents the results of the evaluation campaign of language tools available for fifteen EU-official under-resourced languages. The evaluation was conducted within the MSC ITN CLEOPATRA action that aims at building the cross-lingual event-centric knowledge processing on top of the application of linguistic processing chains (LPCs) for at least 24 EU-official languages. In this campaign, we concentrated on three existing NLP platforms (Stanford CoreNLP, NLP Cube, UDPipe) that all provide models for under-resourced languages and in this first run we covered 15 under-resourced languages for which the models were available. We present the design of the evaluation campaign and present the results as well as discuss them. We considered the difference between reported and our tested results within a single percentage point as being within the limits of acceptable tolerance and thus consider this result as reproducible. However, for a number of languages, the results are below what was reported in the literature, and in some cases, our testing results are even better than the ones reported previously. Particularly problematic was the evaluation of NERC systems. One of the reasons is the absence of universally or cross-lingually applicable named entities classification scheme that would serve the NERC task in different languages analogous to the Universal Dependency scheme in parsing task. To build such a scheme has become one of our the future research directions.

pdf bib
The European Language Technology Landscape in 2020: Language-Centric and Human-Centric AI for Cross-Cultural Communication in Multilingual Europe
Georg Rehm | Katrin Marheinecke | Stefanie Hegele | Stelios Piperidis | Kalina Bontcheva | Jan Hajič | Khalid Choukri | Andrejs Vasiļjevs | Gerhard Backfried | Christoph Prinz | José Manuel Gómez-Pérez | Luc Meertens | Paul Lukowicz | Josef van Genabith | Andrea Lösch | Philipp Slusallek | Morten Irgens | Patrick Gatellier | Joachim Köhler | Laure Le Bars | Dimitra Anastasiou | Albina Auksoriūtė | Núria Bel | António Branco | Gerhard Budin | Walter Daelemans | Koenraad De Smedt | Radovan Garabík | Maria Gavriilidou | Dagmar Gromann | Svetla Koeva | Simon Krek | Cvetana Krstev | Krister Lindén | Bernardo Magnini | Jan Odijk | Maciej Ogrodniczuk | Eiríkur Rögnvaldsson | Mike Rosner | Bolette Pedersen | Inguna Skadiņa | Marko Tadić | Dan Tufiș | Tamás Váradi | Kadri Vider | Andy Way | François Yvon
Proceedings of the 12th Language Resources and Evaluation Conference

Multilingualism is a cultural cornerstone of Europe and firmly anchored in the European treaties including full language equality. However, language barriers impacting business, cross-lingual and cross-cultural communication are still omnipresent. Language Technologies (LTs) are a powerful means to break down these barriers. While the last decade has seen various initiatives that created a multitude of approaches and technologies tailored to Europe’s specific needs, there is still an immense level of fragmentation. At the same time, AI has become an increasingly important concept in the European Information and Communication Technology area. For a few years now, AI – including many opportunities, synergies but also misconceptions – has been overshadowing every other topic. We present an overview of the European LT landscape, describing funding programmes, activities, actions and challenges in the different countries with regard to LT, including the current state of play in industry and the LT market. We present a brief overview of the main LT-related activities on the EU level in the last ten years and develop strategic guidance with regard to four key dimensions.

pdf bib
The MARCELL Legislative Corpus
Tamás Váradi | Svetla Koeva | Martin Yamalov | Marko Tadić | Bálint Sass | Bartłomiej Nitoń | Maciej Ogrodniczuk | Piotr Pęzik | Verginica Barbu Mititelu | Radu Ion | Elena Irimia | Maria Mitrofan | Vasile Păiș | Dan Tufiș | Radovan Garabík | Simon Krek | Andraz Repar | Matjaž Rihtar | Janez Brank
Proceedings of the 12th Language Resources and Evaluation Conference

This article presents the current outcomes of the MARCELL CEF Telecom project aiming to collect and deeply annotate a large comparable corpus of legal documents. The MARCELL corpus includes 7 monolingual sub-corpora (Bulgarian, Croatian, Hungarian, Polish, Romanian, Slovak and Slovenian) containing the total body of respective national legislative documents. These sub-corpora are automatically sentence split, tokenized, lemmatized and morphologically and syntactically annotated. The monolingual sub-corpora are complemented by a thematically related parallel corpus (Croatian-English). The metadata and the annotations are uniformly provided for each language specific sub-corpus. Besides the standard morphosyntactic analysis plus named entity and dependency annotation, the corpus is enriched with the IATE and EUROVOC labels. The file format is CoNLL-U Plus Format, containing the ten columns specific to the CoNLL-U format and four extra columns specific to our corpora. The MARCELL corpora represents a rich and valuable source for further studies and developments in machine learning, cross-lingual terminological data extraction and classification.

pdf bib
Building the Spanish-Croatian Parallel Corpus
Bojana Mikelenić | Marko Tadić
Proceedings of the 12th Language Resources and Evaluation Conference

This paper describes the building of the first Spanish-Croatian unidirectional parallel corpus, which has been constructed at the Faculty of Humanities and Social Sciences of the University of Zagreb. The corpus is comprised of eleven Spanish novels and their translations to Croatian done by six different professional translators. All the texts were published between 1999 and 2012. The corpus has more than 2 Mw, with approximately 1 Mw for each language. It was automatically sentence segmented and aligned, as well as manually post-corrected, and contains 71,778 translation units. In order to protect the copyright and to make the corpus available under permissive CC-BY licence, the aligned translation units are shuffled. This limits the usability of the corpus for research of language units at sentence and lower language levels only. There are two versions of the corpus in TMX format that will be available for download through META-SHARE and CLARIN ERIC infrastructure. The former contains plain TMX, while the latter is lemmatised and POS-tagged and stored in the aTMX format.

2016

pdf bib
Building the Macedonian-Croatian Parallel Corpus
Ines Cebović | Marko Tadić
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

In this paper we present the newly created parallel corpus of two under-resourced languages, namely, Macedonian-Croatian Parallel Corpus (mk-hr_pcorp) that has been collected during 2015 at the Faculty of Humanities and Social Sciences, University of Zagreb. The mk-hr_pcorp is a unidirectional (mk→hr) parallel corpus composed of synchronic fictional prose texts received already in digital form with over 500 Kw in each language. The corpus was sentence segmented and provides 39,735 aligned sentences. The alignment was done automatically and then post-corrected manually. The alignments order was shuffled and this enabled the corpus to be available under CC-BY license through META-SHARE. However, this prevents the research in language units over the sentence level.

2014

pdf bib
Proceedings of the Demonstrations at the 14th Conference of the European Chapter of the Association for Computational Linguistics
Shuly Wintner | Marko Tadić | Bogdan Babych
Proceedings of the Demonstrations at the 14th Conference of the European Chapter of the Association for Computational Linguistics

pdf bib
CroDeriV: a new resource for processing Croatian morphology
Krešimir Šojat | Matea Srebačić | Marko Tadić | Tin Pavelić
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

The paper deals with the processing of Croatian morphology and presents CroDeriV ― a newly developed language resource that contains data about morphological structure and derivational relatedness of verbs in Croatian. In its present shape, CroDeriV contains 14 192 Croatian verbs. Verbs in CroDeriV are analyzed for morphemes and segmented into lexical, derivational and inflectional morphemes. The structure of CroDeriV enables the detection of verbal derivational families in Croatian as well as the distribution and frequency of particular affixes and lexical morphemes. Derivational families consist of a verbal base form and all prefixed or suffixed derivatives detected in available machine readable Croatian dictionaries and corpora. Language data structured in this way was further used for the expansion of other language resources for Croatian, such as Croatian WordNet and the Croatian Morphological Lexicon. Matching the data from CroDeriV on one side and Croatian WordNet and the Croatian Morphological Lexicon on the other resulted in significant enrichment of Croatian WordNet and enlargement of the Croatian Morphological Lexicon.

pdf bib
The Strategic Impact of META-NET on the Regional, National and International Level
Georg Rehm | Hans Uszkoreit | Sophia Ananiadou | Núria Bel | Audronė Bielevičienė | Lars Borin | António Branco | Gerhard Budin | Nicoletta Calzolari | Walter Daelemans | Radovan Garabík | Marko Grobelnik | Carmen García-Mateo | Josef van Genabith | Jan Hajič | Inma Hernáez | John Judge | Svetla Koeva | Simon Krek | Cvetana Krstev | Krister Lindén | Bernardo Magnini | Joseph Mariani | John McNaught | Maite Melero | Monica Monachini | Asunción Moreno | Jan Odijk | Maciej Ogrodniczuk | Piotr Pęzik | Stelios Piperidis | Adam Przepiórkowski | Eiríkur Rögnvaldsson | Michael Rosner | Bolette Pedersen | Inguna Skadiņa | Koenraad De Smedt | Marko Tadić | Paul Thompson | Dan Tufiş | Tamás Váradi | Andrejs Vasiļjevs | Kadri Vider | Jolanta Zabarskaite
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

This article provides an overview of the dissemination work carried out in META-NET from 2010 until early 2014; we describe its impact on the regional, national and international level, mainly with regard to politics and the situation of funding for LT topics. This paper documents the initiative’s work throughout Europe in order to boost progress and innovation in our field.

pdf bib
Language Processing Infrastructure in the XLike Project
Lluís Padró | Željko Agić | Xavier Carreras | Blaz Fortuna | Esteban García-Cuesta | Zhixing Li | Tadej Štajner | Marko Tadić
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

This paper presents the linguistic analysis tools and its infrastructure developed within the XLike project. The main goal of the implemented tools is to provide a set of functionalities for supporting some of the main objectives of XLike, such as enabling cross-lingual services for publishers, media monitoring or developing new business intelligence applications. The services cover seven major and minor languages: English, German, Spanish, Chinese, Catalan, Slovenian, and Croatian. These analyzers are provided as web services following a lightweight SOA architecture approach, and they are publically callable and are catalogued in META-SHARE.

pdf bib
Croatian Dependency Treebank 2.0: New Annotation Guidelines for Improved Parsing
Željko Agić | Daša Berović | Danijela Merkler | Marko Tadić
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

We present a new version of the Croatian Dependency Treebank. It constitutes a slight departure from the previously closely observed Prague Dependency Treebank syntactic layer annotation guidelines as we introduce a new subset of syntactic tags on top of the existing tagset. These new tags are used in explicit annotation of subordinate clauses via subordinate conjunctions. Introducing the new annotation to Croatian Dependency Treebank, we also modify head attachment rules addressing subordinate conjunctions and subordinate clause predicates. In an experiment with data-driven dependency parsing, we show that implementing these new annotation guidelines leeds to a statistically significant improvement in parsing accuracy. We also observe a substantial improvement in inter-annotator agreement, facilitating more consistent annotation in further treebank development.

pdf bib
RECSA: Resource for Evaluating Cross-lingual Semantic Annotation
Achim Rettinger | Lei Zhang | Daša Berović | Danijela Merkler | Matea Srebačić | Marko Tadić
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

In recent years large repositories of structured knowledge (DBpedia, Freebase, YAGO) have become a valuable resource for language technologies, especially for the automatic aggregation of knowledge from textual data. One essential component of language technologies, which leverage such knowledge bases, is the linking of words or phrases in specific text documents with elements from the knowledge base (KB). We call this semantic annotation. In the same time, initiatives like Wikidata try to make those knowledge bases less language dependent in order to allow cross-lingual or language independent knowledge access. This poses a new challenge to semantic annotation tools which typically are language dependent and link documents in one language to a structured knowledge base grounded in the same language. Ultimately, the goal is to construct cross-lingual semantic annotation tools that can link words or phrases in one language to a structured knowledge database in any other language or to a language independent representation. To support this line of research we developed what we believe could serve as a gold standard Resource for Evaluating Cross-lingual Semantic Annotation (RECSA). We compiled a hand-annotated parallel corpus of 300 news articles in three languages with cross-lingual semantic groundings to the English Wikipedia and DBPedia. We hope that this new language resource, which is freely available, will help to establish a standard test set and methodology to comparatively evaluate cross-lingual semantic annotation technologies.

2012

pdf bib
Croatian Dependency Treebank: Recent Development and Initial Experiments
Daša Berović | Željko Agić | Marko Tadić
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

We present the current state of development of the Croatian Dependency Treebank ― with special empahsis on adapting the Prague Dependency Treebank formalism to Croatian language specifics ― and illustrate its possible applications in an experiment with dependency parsing using MaltParser. The treebank currently contains approximately 2870 sentences, out of which the 2699 sentences and 66930 tokens were used in this experiment. Three linear-time projective algorithms implemented by the MaltParser system ― Nivre eager, Nivre standard and stack projective ― running on default settings were used in the experiment. The highest performing system, implementing the Nivre eager algorithm, scored (LAS 71.31 UAS 80.93 LA 83.87) within our experiment setup. The results obtained serve as an illustration of treebank's usefulness in natural language processing research and as a baseline for further research in dependency parsing of Croatian.

pdf bib
Generation of Verbal Stems in Derivationally Rich Language
Krešimir Šojat | Nives Mikelić Preradović | Marko Tadić
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

The paper presents a procedure for generating prefixed verbs in Croatian comprising combinations of one, two or three prefixes. The result of this generation process is a pool of derivationally valid prefixed verbs, although not necessarily occuring in corpora. The statistics of occurences of generated verbs in Croatian National Corpus has been calculated. Further usage of such language resource with generated potential verbs is also suggested, namely, enrichment of Croatian Morphological Lexicon, Croatian Wordnet and CROVALLEX.

pdf bib
Open source multi-platform NooJ for NLP
Max Silberztein | Tamás Váradi | Marko Tadić
Proceedings of COLING 2012: Demonstration Papers

pdf bib
Central and South-East European Resources in META-SHARE
Marko Tadić | Tamás Váradi
Proceedings of COLING 2012: Demonstration Papers

2010

pdf bib
Corpus Aligner (CorAl) Evaluation on English-Croatian Parallel Corpora
Sanja Seljan | Marko Tadić | Željko Agić | Jan Šnajder | Bojana Dalbelo Bašić | Vjekoslav Osmann
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

An increasing demand for new language resources of recent EU members and accessing countries has in turn initiated the development of different language tools and resources, such as alignment tools and corresponding translation memories for new languages pairs. The primary goal of this paper is to provide a description of a free sentence alignment tool CorAl (Corpus Aligner), developed at the Faculty of Electrical Engineering and Computing, University of Zagreb. The tool performs paragraph alignment at the first step of the alignment process, which is followed by sentence alignment. Description of the tool is followed by its evaluation. The paper describes an experiment with applying the CorAl aligner to a English-Croatian parallel corpus of legislative domain using metrics of precision, recall and F1-measure. Results are discussed and the concluding sections discuss future directions of CorAl development.

pdf bib
Improving Chunking Accuracy on Croatian Texts by Morphosyntactic Tagging
Kristina Vučković | Željko Agić | Marko Tadić
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

In this paper, we present the results of an experiment with utilizing a stochastic morphosyntactic tagger as a pre-processing module of a rule-based chunker and partial parser for Croatian in order to raise its overall chunking and partial parsing accuracy on Croatian texts. In order to conduct the experiment, we have manually chunked and partially parsed 459 sentences from the Croatia Weekly 100 kw newspaper sub-corpus taken from the Croatian National Corpus, that were previously also morphosyntactically disambiguated and lemmatized. Due to the lack of resources of this type, these sentences were designated as a temporary chunking and partial parsing gold standard for Croatian. We have then evaluated the chunker and partial parser in three different scenarios: (1) chunking previously morphosyntactically untagged text, (2) chunking text that was tagged using the stochastic morphosyntactic tagger for Croatian and (3) chunking manually tagged text. The obtained F1-scores for the three scenarios were, respectively, 0.874 (P: 0.825, R: 0.930), 0.891 (P: 0.856, R: 0.928) and 0.914 (P: 0.904, R: 0.925). The paper provides the description of language resources and tools used in the experiment, its setup and discussion of results and perspectives for future work.

pdf bib
Towards Sentiment Analysis of Financial Texts in Croatian
Željko Agić | Nikola Ljubešić | Marko Tadić
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

The paper presents results of an experiment dealing with sentiment analysis of Croatian text from the domain of finance. The goal of the experiment was to design a system model for automatic detection of general sentiment and polarity phrases in these texts. We have assembled a document collection from web sources writing on the financial market in Croatia and manually annotated articles from a subset of that collection for general sentiment. Additionally, we have manually annotated a number of these articles for phrases encoding positive or negative sentiment within a text. In the paper, we provide an analysis of the compiled resources. We show a statistically significant correspondence (1) between the overall market trend on the Zagreb Stock Exchange and the number of positively and negatively accented articles within periods of trend and (2) between the general sentiment of articles and the number of polarity phrases within those articles. We use this analysis as an input for designing a rule-based local grammar system for automatic detection of polarity phrases and evaluate it on held out data. The system achieves F1-scores of 0.61 (P: 0.94, R: 0.45) and 0.63 (P: 0.97, R: 0.47) on positive and negative polarity phrases.

2008

pdf bib
Rule-Based Chunker for Croatian
Kristina Vučković | Marko Tadić | Zdravko Dovedan
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

In this paper we discuss a rule-based approach to chunking sentences in Croatian, implemented using local regular grammars within the NooJ development environment. We describe the rules and their implementation by regular grammars and at the same time show that in NooJ environment it is extremely easy to fine tune their different sub-rules. Since Croatian has strong morphosyntactic features that are shared between most or all elements of a chunk, the rules are built by taking these features into account and strongly relying on them. For the evaluation of our chunker we used a extracted set of manually annotated sentences from 100 kw MSD/tagged and disambiguated Croatian corpus. Our chunker performed the best on VP-chunks (F: 97.01), while NP-chunks (F: 92.31) and PP-chunks (F: 83.08) were of lower quality. The results are comparable to chunker performance of CoNLL-2000 shared task of chunking.

2007

pdf bib
Implementation of Croatian NERC System
Božo Bekavac | Marko Tadić
Proceedings of the Workshop on Balto-Slavonic Natural Language Processing

2006

pdf bib
Evaluating Morphosyntactic Tagging of Croatian Texts
Željko Agić | Marko Tadić
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)

This paper describes results of the first successful effort in applying a stochastic strategy – or, namely, a second order Markov model paradigm implemented by the TnT trigram tagger – to morphosyntactic tagging of Croatian texts. Beside the tagger, for purposes of both training and testing, we had at our disposal only a 100 Kw Croatia Weekly newspaper subcorpus, manually tagged using approximately 1000 different MULTEXT-East v3 morphosyntactic tags. The test basically consisted of randomly assigning a variable size portion of the corpus for the tagger’s training procedure and also another fixed-size portion, sized at 10% of the corpus, for the tagging procedure itself; this method allowed us not only to provide preliminary results regarding tagger accuracy on Croatian texts, but also to inspect the behavior of the stochastic tagging paradigm in general. The results were then taken from the test case providing 90% of the corpus for training purposes and varied from around 86% in the worst case scenario up to a peak of around 95% correctly assigned full MSD tags. Results on PoS only expectedly reached the human error level, with TnT correctly tagging above 98% of test sets on average. Most MSD errors occurred on types with the highest number of candidate tags per word form – nouns, pronouns and adjectives – while errors on PoS, although following the same pattern, were almost insignificant. Detailed insight on tagging, F-measure for all PoS categories is provided in the course of the paper along with other facts of interest.

2004

pdf bib
Enlarging the Croatian Morphological Lexicon by Automatic Lexical Acquisition from Raw Corpora
Antoni Oliver | Marko Tadić
Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC’04)

This paper presents experiments for enlarging the Croatian Morphological Lexicon by applying an automatic acquisition methodology. The basic sources of information for the system are a set of morphological rules and a raw corpus. The morphological rules have been automatically derived from the existing Croatian Morphological Lexicon and we have used in our experiments a subset of the Croatian National Corpus. The methodology has proved to be efficient for those languages that, like Croatian, present a rich and mainly concatenative morphology. This method can be applied for the creation of new resources, as well as in the enrichment of existing ones. We also present an extension of the system that uses automatic querying to Internet to acquire those entries for which we have not enough information in our corpus.

pdf bib
Making Monolingual Corpora Comparable: a Case Study of Bulgarian and Croatian
Božo Bekavac | Petya Osenova | Kiril Simov | Marko Tadić
Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC’04)

2003

pdf bib
The MULTEXT-East Morphosyntactic Specification for Slavic Languages
Tomaž Erjavec | Cvetana Krstev | Vladimír Petkevič | Kiril Simov | Marko Tadić | Duško Vitas
Proceedings of the 2003 EACL Workshop on Morphological Processing of Slavic Languages

pdf bib
Building the Croatian Morphological Lexicon
Marko Tadić | Sanja Fulgosi
Proceedings of the 2003 EACL Workshop on Morphological Processing of Slavic Languages

2002

pdf bib
Building the Croatian National Corpus
Marko Tadić
Proceedings of the Third International Conference on Language Resources and Evaluation (LREC’02)

2000

pdf bib
Building the Croatian-English Parallel Corpus
Marko Tadić
Proceedings of the Second International Conference on Language Resources and Evaluation (LREC’00)

Search
Co-authors