Maud Ehrmann


2020

pdf bib
Language Resources for Historical Newspapers: the Impresso Collection
Maud Ehrmann | Matteo Romanello | Simon Clematide | Phillip Benjamin Ströbel | Raphaël Barman
Proceedings of the 12th Language Resources and Evaluation Conference

Following decades of massive digitization, an unprecedented amount of historical document facsimiles can now be retrieved and accessed via cultural heritage online portals. If this represents a huge step forward in terms of preservation and accessibility, the next fundamental challenge– and real promise of digitization– is to exploit the contents of these digital assets, and therefore to adapt and develop appropriate language technologies to search and retrieve information from this ‘Big Data of the Past’. Yet, the application of text processing tools on historical documents in general, and historical newspapers in particular, poses new challenges, and crucially requires appropriate language resources. In this context, this paper presents a collection of historical newspaper data sets composed of text and image resources, curated and published within the context of the ‘impresso - Media Monitoring of the Past’ project. With corpora, benchmarks, semantic annotations and language models in French, German and Luxembourgish covering ca. 200 years, the objective of the impresso resource collection is to contribute to historical language resources, and thereby strengthen the robustness of approaches to non-standard inputs and foster efficient processing of historical documents.

2017

pdf bib
Book Review: Linked Lexical Knowledge Bases Foundations and Applications by Iryna Gurevych, Judith Eckle-er and Michael Matuschek
Maud Ehrmann
Computational Linguistics, Volume 43, Issue 2 - June 2017

2016

pdf bib
Cross-lingual Linking of Multi-word Entities and their corresponding Acronyms
Guillaume Jacquet | Maud Ehrmann | Ralf Steinberger | Jaakko Väyrynen
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

This paper reports on an approach and experiments to automatically build a cross-lingual multi-word entity resource. Starting from a collection of millions of acronym/expansion pairs for 22 languages where expansion variants were grouped into monolingual clusters, we experiment with several aggregation strategies to link these clusters across languages. Aggregation strategies make use of string similarity distances and translation probabilities and they are based on vector space and graph representations. The accuracy of the approach is evaluated against Wikipedia’s redirection and cross-lingual linking tables. The resulting multi-word entity resource contains 64,000 multi-word entities with unique identifiers and their 600,000 multilingual lexical variants. We intend to make this new resource publicly available.

pdf bib
Named Entity Resources - Overview and Outlook
Maud Ehrmann | Damien Nouvel | Sophie Rosset
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

Recognition of real-world entities is crucial for most NLP applications. Since its introduction some twenty years ago, named entity processing has undergone a significant evolution with, among others, the definition of new tasks (e.g. entity linking) and the emergence of new types of data (e.g. speech transcriptions, micro-blogging). These pose certainly new challenges which affect not only methods and algorithms but especially linguistic resources. Where do we stand with respect to named entity resources? This paper aims at providing a systematic overview of named entity resources, accounting for qualities such as multilingualism, dynamicity and interoperability, and to identify shortfalls in order to guide future developments.

2014

pdf bib
Media monitoring and information extraction for the highly inflected agglutinative language Hungarian
Júlia Pajzs | Ralf Steinberger | Maud Ehrmann | Mohamed Ebrahim | Leonida Della Rocca | Stefano Bucci | Eszter Simon | Tamás Váradi
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

The Europe Media Monitor (EMM) is a fully-automatic system that analyses written online news by gathering articles in over 70 languages and by applying text analysis software for currently 21 languages, without using linguistic tools such as parsers, part-of-speech taggers or morphological analysers. In this paper, we describe the effort of adding to EMM Hungarian text mining tools for news gathering; document categorisation; named entity recognition and classification for persons, organisations and locations; name lemmatisation; quotation recognition; and cross-lingual linking of related news clusters. The major challenge of dealing with the Hungarian language is its high degree of inflection and agglutination. We present several experiments where we apply linguistically light-weight methods to deal with inflection and we propose a method to overcome the challenges. We also present detailed frequency lists of Hungarian person and location name suffixes, as found in real-life news texts. This empirical data can be used to draw further conclusions and to improve existing Named Entity Recognition software. Within EMM, the solutions described here will also be applied to other morphologically complex languages such as those of the Slavic language family. The media monitoring and analysis system EMM is freely accessible online via the web page http://emm.newsbrief.eu/overview.html.

pdf bib
Clustering of Multi-Word Named Entity variants: Multilingual Evaluation
Guillaume Jacquet | Maud Ehrmann | Ralf Steinberger
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

Multi-word entities, such as organisation names, are frequently written in many different ways. We have previously automatically identified over one million acronym pairs in 22 languages, consisting of their short form (e.g. EC) and their corresponding long forms (e.g. European Commission, European Union Commission). In order to automatically group such long form variants as belonging to the same entity, we cluster them, using bottom-up hierarchical clustering and pair-wise string similarity metrics. In this paper, we address the issue of how to evaluate the named entity variant clusters automatically, with minimal human annotation effort. We present experiments that make use of Wikipedia redirection tables and we show that this method produces good results.

pdf bib
Representing Multilingual Data as Linked Data: the Case of BabelNet 2.0
Maud Ehrmann | Francesco Cecconi | Daniele Vannella | John Philip McCrae | Philipp Cimiano | Roberto Navigli
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

Recent years have witnessed a surge in the amount of semantic information published on the Web. Indeed, the Web of Data, a subset of the Semantic Web, has been increasing steadily in both volume and variety, transforming the Web into a ‘global database’ in which resources are linked across sites. Linguistic fields -- in a broad sense -- have not been left behind, and we observe a similar trend with the growth of linguistic data collections on the so-called ‘Linguistic Linked Open Data (LLOD) cloud’. While both Semantic Web and Natural Language Processing communities can obviously take advantage of this growing and distributed linguistic knowledge base, they are today faced with a new challenge, i.e., that of facilitating multilingual access to the Web of data. In this paper we present the publication of BabelNet 2.0, a wide-coverage multilingual encyclopedic dictionary and ontology, as Linked Data. The conversion made use of lemon, a lexicon model for ontologies particularly well-suited for this enterprise. The result is an interlinked multilingual (lexical) resource which can not only be accessed on the LOD, but also be used to enrich existing datasets with linguistic information, or to support the process of mapping datasets across languages.

2013

pdf bib
On Named Entity Recognition in Targeted Twitter Streams in Polish.
Jakub Piskorski | Maud Ehrmann
Proceedings of the 4th Biennial International Workshop on Balto-Slavic Natural Language Processing

pdf bib
Acronym recognition and processing in 22 languages
Maud Ehrmann | Leonida Della Rocca | Ralf Steinberger | Hristo Tannev
Proceedings of the International Conference Recent Advances in Natural Language Processing RANLP 2013

2011

pdf bib
Building a Multilingual Named Entity-Annotated Corpus Using Annotation Projection
Maud Ehrmann | Marco Turchi | Ralf Steinberger
Proceedings of the International Conference Recent Advances in Natural Language Processing 2011

pdf bib
Highly Multilingual Coreference Resolution Exploiting a Mature Entity Repository
Josef Steinberger | Jenya Belyaeva | Jonathan Crawley | Leonida Della-Rocca | Mohamed Ebrahim | Maud Ehrmann | Mijail Kabadjov | Ralf Steinberger | Erik van der Goot
Proceedings of the International Conference Recent Advances in Natural Language Processing 2011

pdf bib
Creating Sentiment Dictionaries via Triangulation
Josef Steinberger | Polina Lenkova | Mohamed Ebrahim | Maud Ehrmann | Ali Hurriyetoglu | Mijail Kabadjov | Ralf Steinberger | Hristo Tanev | Vanni Zavarella | Silvia Vázquez
Proceedings of the 2nd Workshop on Computational Approaches to Subjectivity and Sentiment Analysis (WASSA 2.011)

2009

pdf bib
Towards a Methodology for Named Entities Annotation
Karën Fort | Maud Ehrmann | Adeline Nazarenko
Proceedings of the Third Linguistic Annotation Workshop (LAW III)

2007

pdf bib
XRCE-M: A Hybrid System for Named Entity Metonymy Resolution
Caroline Brun | Maud Ehrmann | Guillaume Jacquet
Proceedings of the Fourth International Workshop on Semantic Evaluations (SemEval-2007)