Bolette Sandford Pedersen

Also published as: Bolette Pedersen, Bo Pedersen, Bolette S. Pedersen, Bolette Sandford Pedersen


2020

pdf bib
Building Sense Representations in Danish by Combining Word Embeddings with Lexical Resources
Ida Rørmann Olsen | Bolette Pedersen | Asad Sayeed
Proceedings of the 2020 Globalex Workshop on Linked Lexicography

Our aim is to identify suitable sense representations for NLP in Danish. We investigate sense inventories that correlate with human interpretations of word meaning and ambiguity as typically described in dictionaries and wordnets and that are well reflected distributionally as expressed in word embeddings. To this end, we study a number of highly ambiguous Danish nouns and examine the effectiveness of sense representations constructed by combining vectors from a distributional model with the information from a wordnet. We establish representations based on centroids obtained from wordnet synests and example sentences as well as representations established via are tested in a word sense disambiguation task. We conclude that the more information extracted from the wordnet entries (example sentence, definition, semantic relations) the more successful the sense representation vector.

pdf bib
A Multilingual Evaluation Dataset for Monolingual Word Sense Alignment
Sina Ahmadi | John Philip McCrae | Sanni Nimb | Fahad Khan | Monica Monachini | Bolette Pedersen | Thierry Declerck | Tanja Wissik | Andrea Bellandi | Irene Pisani | Thomas Troelsgård | Sussi Olsen | Simon Krek | Veronika Lipp | Tamás Váradi | László Simon | András Gyorffy | Carole Tiberius | Tanneke Schoonheim | Yifat Ben Moshe | Maya Rudich | Raya Abu Ahmad | Dorielle Lonke | Kira Kovalenko | Margit Langemets | Jelena Kallas | Oksana Dereza | Theodorus Fransen | David Cillessen | David Lindemann | Mikel Alonso | Ana Salgado | José Luis Sancho | Rafael-J. Ureña-Ruiz | Jordi Porta Zamorano | Kiril Simov | Petya Osenova | Zara Kancheva | Ivaylo Radev | Ranka Stanković | Andrej Perdih | Dejan Gabrovsek
Proceedings of the 12th Language Resources and Evaluation Conference

Aligning senses across resources and languages is a challenging task with beneficial applications in the field of natural language processing and electronic lexicography. In this paper, we describe our efforts in manually aligning monolingual dictionaries. The alignment is carried out at sense-level for various resources in 15 languages. Moreover, senses are annotated with possible semantic relationships such as broadness, narrowness, relatedness, and equivalence. In comparison to previous datasets for this task, this dataset covers a wide range of languages and resources and focuses on the more challenging task of linking general-purpose language. We believe that our data will pave the way for further advances in alignment and evaluation of word senses by creating new solutions, particularly those notoriously requiring data such as neural networks. Our resources are publicly available at https://github.com/elexis-eu/MWSA.

pdf bib
World Class Language Technology - Developing a Language Technology Strategy for Danish
Sabine Kirchmeier | Bolette Pedersen | Sanni Nimb | Philip Diderichsen | Peter Juel Henrichsen
Proceedings of the 12th Language Resources and Evaluation Conference

Although Denmark is one of the most digitized countries in Europe, no coordinated efforts have been made in recent years to support the Danish language with regard to language technology and artificial intelligence. In March 2019, however, the Danish government adopted a new, ambitious strategy for LT and artificial intelligence. In this paper, we describe the process behind the development of the language-related parts of the strategy: A Danish Language Technology Committee was constituted and a comprehensive series of workshops were organized in which users, suppliers, developers, and researchers gave their valuable input based on their experiences. We describe how, based on this experience, the focus areas and recommendations for the LT strategy were established, and which steps are currently taken in order to put the strategy into practice.

pdf bib
The European Language Technology Landscape in 2020: Language-Centric and Human-Centric AI for Cross-Cultural Communication in Multilingual Europe
Georg Rehm | Katrin Marheinecke | Stefanie Hegele | Stelios Piperidis | Kalina Bontcheva | Jan Hajič | Khalid Choukri | Andrejs Vasiļjevs | Gerhard Backfried | Christoph Prinz | José Manuel Gómez-Pérez | Luc Meertens | Paul Lukowicz | Josef van Genabith | Andrea Lösch | Philipp Slusallek | Morten Irgens | Patrick Gatellier | Joachim Köhler | Laure Le Bars | Dimitra Anastasiou | Albina Auksoriūtė | Núria Bel | António Branco | Gerhard Budin | Walter Daelemans | Koenraad De Smedt | Radovan Garabík | Maria Gavriilidou | Dagmar Gromann | Svetla Koeva | Simon Krek | Cvetana Krstev | Krister Lindén | Bernardo Magnini | Jan Odijk | Maciej Ogrodniczuk | Eiríkur Rögnvaldsson | Mike Rosner | Bolette Pedersen | Inguna Skadiņa | Marko Tadić | Dan Tufiș | Tamás Váradi | Kadri Vider | Andy Way | François Yvon
Proceedings of the 12th Language Resources and Evaluation Conference

Multilingualism is a cultural cornerstone of Europe and firmly anchored in the European treaties including full language equality. However, language barriers impacting business, cross-lingual and cross-cultural communication are still omnipresent. Language Technologies (LTs) are a powerful means to break down these barriers. While the last decade has seen various initiatives that created a multitude of approaches and technologies tailored to Europe’s specific needs, there is still an immense level of fragmentation. At the same time, AI has become an increasingly important concept in the European Information and Communication Technology area. For a few years now, AI – including many opportunities, synergies but also misconceptions – has been overshadowing every other topic. We present an overview of the European LT landscape, describing funding programmes, activities, actions and challenges in the different countries with regard to LT, including the current state of play in industry and the LT market. We present a brief overview of the main LT-related activities on the EU level in the last ten years and develop strategic guidance with regard to four key dimensions.

pdf bib
Towards a Gold Standard for Evaluating Danish Word Embeddings
Nina Schneidermann | Rasmus Hvingelby | Bolette Pedersen
Proceedings of the 12th Language Resources and Evaluation Conference

This paper presents the process of compiling a model-agnostic similarity goal standard for evaluating Danish word embeddings based on human judgments made by 42 native speakers of Danish. Word embeddings resemble semantic similarity solely by distribution (meaning that word vectors do not reflect relatedness as differing from similarity), and we argue that this generalization poses a problem in most intrinsic evaluation scenarios. In order to be able to evaluate on both dimensions, our human-generated dataset is therefore designed to reflect the distinction between relatedness and similarity. The goal standard is applied for evaluating the “goodness” of six existing word embedding models for Danish, and it is discussed how a relatively low correlation can be explained by the fact that semantic similarity is substantially more challenging to model than relatedness, and that there seems to be a need for future human judgments to measure similarity in full context and along more than a single spectrum.

2018

pdf bib
A Danish FrameNet Lexicon and an Annotated Corpus Used for Training and Evaluating a Semantic Frame Classifier
Bolette Pedersen | Sanni Nimb | Anders Søgaard | Mareike Hartmann | Sussi Olsen
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

2016

pdf bib
The SemDaX Corpus ― Sense Annotations with Scalable Sense Inventories
Bolette Pedersen | Anna Braasch | Anders Johannsen | Héctor Martínez Alonso | Sanni Nimb | Sussi Olsen | Anders Søgaard | Nicolai Hartvig Sørensen
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

We launch the SemDaX corpus which is a recently completed Danish human-annotated corpus available through a CLARIN academic license. The corpus includes approx. 90,000 words, comprises six textual domains, and is annotated with sense inventories of different granularity. The aim of the developed corpus is twofold: i) to assess the reliability of the different sense annotation schemes for Danish measured by qualitative analyses and annotation agreement scores, and ii) to serve as training and test data for machine learning algorithms with the practical purpose of developing sense taggers for Danish. To these aims, we take a new approach to human-annotated corpus resources by double annotating a much larger part of the corpus than what is normally seen: for the all-words task we double annotated 60% of the material and for the lexical sample task 100%. We include in the corpus not only the adjucated files, but also the diverging annotations. In other words, we consider not all disagreement to be noise, but rather to contain valuable linguistic information that can help us improve our annotation schemes and our learning algorithms.

2015

pdf bib
Supersense tagging for Danish
Héctor Martínez Alonso | Anders Johannsen | Sussi Olsen | Sanni Nimb | Nicolai Hartvig Sørensen | Anna Braasch | Anders Søgaard | Bolette Sandford Pedersen
Proceedings of the 20th Nordic Conference of Computational Linguistics (NODALIDA 2015)

pdf bib
Proceedings of the workshop on Semantic resources and semantic annotation for Natural Language Processing and the Digital Humanities at NODALIDA 2015
Bolette Sandford Pedersen | Sussi Olsen | Lars Borin
Proceedings of the workshop on Semantic resources and semantic annotation for Natural Language Processing and the Digital Humanities at NODALIDA 2015

pdf bib
Coarse-grained sense annotation of Danish across textual domains
Sussi Olsen | Bolette S. Pedersen | Héctor Martínez Alonso | Anders Johannsen
Proceedings of the workshop on Semantic resources and semantic annotation for Natural Language Processing and the Digital Humanities at NODALIDA 2015

2014

pdf bib
The Strategic Impact of META-NET on the Regional, National and International Level
Georg Rehm | Hans Uszkoreit | Sophia Ananiadou | Núria Bel | Audronė Bielevičienė | Lars Borin | António Branco | Gerhard Budin | Nicoletta Calzolari | Walter Daelemans | Radovan Garabík | Marko Grobelnik | Carmen García-Mateo | Josef van Genabith | Jan Hajič | Inma Hernáez | John Judge | Svetla Koeva | Simon Krek | Cvetana Krstev | Krister Lindén | Bernardo Magnini | Joseph Mariani | John McNaught | Maite Melero | Monica Monachini | Asunción Moreno | Jan Odijk | Maciej Ogrodniczuk | Piotr Pęzik | Stelios Piperidis | Adam Przepiórkowski | Eiríkur Rögnvaldsson | Michael Rosner | Bolette Pedersen | Inguna Skadiņa | Koenraad De Smedt | Marko Tadić | Paul Thompson | Dan Tufiş | Tamás Váradi | Andrejs Vasiļjevs | Kadri Vider | Jolanta Zabarskaite
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

This article provides an overview of the dissemination work carried out in META-NET from 2010 until early 2014; we describe its impact on the regional, national and international level, mainly with regard to politics and the situation of funding for LT topics. This paper documents the initiative’s work throughout Europe in order to boost progress and innovation in our field.

pdf bib
CLARA: A New Generation of Researchers in Common Language Resources and Their Applications
Koenraad De Smedt | Erhard Hinrichs | Detmar Meurers | Inguna Skadiņa | Bolette Pedersen | Costanza Navarretta | Núria Bel | Krister Lindén | Markéta Lopatková | Jan Hajič | Gisle Andersen | Przemyslaw Lenkiewicz
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

CLARA (Common Language Resources and Their Applications) is a Marie Curie Initial Training Network which ran from 2009 until 2014 with the aim of providing researcher training in crucial areas related to language resources and infrastructure. The scope of the project was broad and included infrastructure design, lexical semantic modeling, domain modeling, multimedia and multimodal communication, applications, and parsing technologies and grammar models. An international consortium of 9 partners and 12 associate partners employed researchers in 19 new positions and organized a training program consisting of 10 thematic courses and summer/winter schools. The project has resulted in new theoretical insights as well as new resources and tools. Most importantly, the project has trained a new generation of researchers who can perform advanced research and development in language resources and technologies.

pdf bib
Encompassing a spectrum of LT users in the CLARIN-DK Infrastructure
Lina Henriksen | Dorte Haltrup Hansen | Bente Maegaard | Bolette Sandford Pedersen | Claus Povlsen
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

CLARIN-DK is a platform with language resources constituting the Danish part of the European infrastructure CLARIN ERIC. Unlike some other language based infrastructures CLARIN-DK is not solely a repository for upload and storage of data, but also a platform of web services permitting the user to process data in various ways. This involves considerable complications in relation to workflow requirements. The CLARIN-DK interface must guide the user to perform the necessary steps of a workflow; even when the user is inexperienced and perhaps has an unclear conception of the requested results. This paper describes a user driven approach to creating a user interface specification for CLARIN-DK. We indicate how different user profiles determined different crucial interface design options. We also describe some use cases established in order to give illustrative examples of how the platform may facilitate research.

2013

pdf bib
Nordic and Baltic Wordnets Aligned and Compared through “WordTies”
Bolette Sandford Pedersen | Lars Borin | Markus Forsberg | Neeme Kahusk | Krister Lindén | Jyrki Niemi | Niklas Nisbeth | Lars Nygaard | Heili Orav | Eirikur Rögnvaldsson | Mitchell Seaton | Kadri Vider | Kaarlo Voionmaa
Proceedings of the 19th Nordic Conference of Computational Linguistics (NODALIDA 2013)

pdf bib
Baltic and Nordic Parts of the European Linguistic Infrastructure
Inguna Skadiņa | Andrejs Vasiļjevs | Lars Borin | Krister Lindén | Gyri Losnegaard | Sussi Olsen | Bolette Sandford Pedersen | Roberts Rozis | Koenraad De Smedt
Proceedings of the 19th Nordic Conference of Computational Linguistics (NODALIDA 2013)

pdf bib
Annotation of regular polysemy and underspecification
Héctor Martínez Alonso | Bolette Sandford Pedersen | Núria Bel
Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

2012

pdf bib
A voting scheme to detect semantic underspecification
Héctor Martínez Alonso | Núria Bel | Bolette Sandford Pedersen
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

The following work describes a voting system to automatically classify the sense selection of the complex types Location/Organization and Container/Content, which depend on regular polysemy, as described by the Generative Lexicon (Pustejovsky, 1995) . This kind of sense alternations very often presents semantic underspecificacion between its two possible selected senses. This kind of underspecification is not traditionally contemplated in word sense disambiguation systems, as disambiguation systems are still coping with the need of a representation and recognition of underspecification (Pustejovsky, 2009) The data are characterized by the morphosyntactic and lexical enviroment of the headwords and provided as input for a classifier. The baseline decision tree classifier is compared against an eight-member voting scheme obtained from variants of the training data generated by modifications on the class representation and from two different classification algorithms, namely decision trees and k-nearest neighbors. The voting system improves the accuracy for the non-underspecified senses, but the underspecified sense remains difficult to identify

pdf bib
Towards a richer wordnet representation of properties
Sanni Nimb | Bolette Sandford Pedersen
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

This paper discusses how information on properties in a currently developed Danish thesaurus can be transferred to the Danish wordnet, DanNet, and in this way enrich the wordnet with the highly relevant links between properties and their external arguments (i.e. tasty ― food). In spite of the fact that the thesaurus is still under development (two thirds still to be compiled) we perform an automatic transfer of relations from the thesaurus to the wordnet which shows promising results. In all, 2,362 property relations are automatically transferred to DanNet and 2% of the transferred material is manually validated. The pilot validation indicates that approx. 90 % of the transferred relations are correctly assigned whereas around 10% are either erroneous or just not very informative, a fact which, however, can partly be explained by the incompleteness of the material at its current stage. As a further consequence, the experiment has led to a richer specification of the editor guidelines to be used in the last compilation phase of the thesaurus.

pdf bib
Creation of an Open Shared Language Resource Repository in the Nordic and Baltic Countries
Andrejs Vasiļjevs | Markus Forsberg | Tatiana Gornostay | Dorte Haltrup Hansen | Kristín Jóhannsdóttir | Gunn Lyse | Krister Lindén | Lene Offersgaard | Sussi Olsen | Bolette Pedersen | Eiríkur Rögnvaldsson | Inguna Skadiņa | Koenraad De Smedt | Ville Oksanen | Roberts Rozis
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

The META-NORD project has contributed to an open infrastructure for language resources (data and tools) under the META-NET umbrella. This paper presents the key objectives of META-NORD and reports on the results achieved in the first year of the project. META-NORD has mapped and described the national language technology landscape in the Nordic and Baltic countries in terms of language use, language technology and resources, main actors in the academy, industry, government and society; identified and collected the first batch of language resources in the Nordic and Baltic countries; documented, processed, linked, and upgraded the identified language resources to agreed standards and guidelines. The three horizontal multilingual actions in META-NORD are overviewed in this paper: linking and validating Nordic and Baltic wordnets, the harmonisation of multilingual Nordic and Baltic treebanks, and consolidating multilingual terminology resources across European countries. This paper also touches upon intellectual property rights for the sharing of language resources.

2011

pdf bib
Proceedings of the 18th Nordic Conference of Computational Linguistics (NODALIDA 2011)
Bolette Sandford Pedersen | Gunta Nešpore | Inguna Skadiņa
Proceedings of the 18th Nordic Conference of Computational Linguistics (NODALIDA 2011)

pdf bib
Identification of sense selection in regular polysemy using shallow features
Héctor Martínez Alonso | Núria Bel | Bolette Sandford Pedersen
Proceedings of the 18th Nordic Conference of Computational Linguistics (NODALIDA 2011)

pdf bib
“Andre ord” – a wordnet browser for the Danish wordnet, DanNet
Anders Johannsen | Bolette Sandford Pedersen
Proceedings of the 18th Nordic Conference of Computational Linguistics (NODALIDA 2011)

2010

pdf bib
Merging Specialist Taxonomies and Folk Taxonomies in Wordnets - A case Study of Plants, Animals and Foods in the Danish Wordnet
Bolette S. Pedersen | Sanni Nimb | Anna Braasch
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

In this paper we investigate the problem of merging specialist taxonomies with the more intuitive folk taxonomies in lexical-semantic resources like wordnets; and we focus in particular on plants, animals and foods. We show that a traditional dictionary like Den Danske Ordbog (DDO) survives well with several inconsistencies between different taxonomies of the vocabulary and that a restructuring is therefore necessary in order to compile a consistent wordnet resource on its basis. To this end, we apply Cruse’s definitions for hyponymies, namely those of natural kinds (such as plants and animals) on the one hand and functional kinds (such as foods) on the other. We pursue this distinction in the development of the Danish wordnet, DanNet, which has recently been built on the basis of DDO and is made open source for all potential users at www.wordnet.dk. Not surprisingly, we conclude that cultural background influences the structure of folk taxonomies quite radically, and that wordnet builders must therefore consider these carefully in order to capture their central characteristics in a systematic way.

2009

pdf bib
What do we need to know about humans? A view into the DanNet database
Bolette Sandford Pedersen | Anna Braasch
Proceedings of the 17th Nordic Conference of Computational Linguistics (NODALIDA 2009)

2008

pdf bib
Merging a Syntactic Resource with a WordNet: a Feasibility Study of a Merge between STO and DanNet
Bolette Sandford Pedersen | Anna Braasch | Lina Henriksen | Sussi Olsen | Claus Povlsen
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

This paper presents a feasibility study of a merge between SprogTeknologisk Ordbase (STO), which contains morphological and syntactic information, and DanNet, which is a Danish WordNet containing semantic information in terms of synonym sets and semantic relations. The aim of the merge is to develop a richer, composite resource which we believe will have a broader usage perspective than the two seen in isolation. In STO, the organizing principle is based on the observable syntactic features of a lemma’s near context (labeled syntactic units or SynUs). In contrast, the basic unit in DanNet is constituted by semantic senses or - in wordnet terminology - synonym sets (synsets). The merge of the two resources is thus basically to be understood as a linking between SynUs and synsets. In the paper we discuss which parts of the merge can be performed semi-automatically and which parts require manual linguistic matching procedures. We estimate that this manual work will amount to approx. 39% of the lexicon material.

2006

pdf bib
Query Expansion on Compounds
Bolette Sandford Pedersen
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)

Compounds constitute a specific issue in search, in particular in languages where they are written in one word, as is the case for Danish and the other Scandinavian languages. For such languages, expansion of the query compound into separate lemmas is a way of finding the often frequent alternative synonymous phrases in which the content of a compound can also be expressed. However, it is crucial to note that the number of irrelevant hits is generally very high when using this expansion strategy. The aim of this paper is to examine how we can obtain better search results on split compounds, partly by looking at the internal structure of the original compound, partly by analyzing the context in which the split compound occurs. We perform an NP analysis and introduce a new, linguistically based threshold for retrieved hits. The results obtained by using this strategy demonstrate that compound splitting combined with a shallow linguistic analysis focusing on the recognition of NPs can improve search by bringing down the number of irrelevant hits.

2004

pdf bib
“Human Language Technology Elements in a Knowledge Organisation System - The VID Project”
Costanza Navarretta | Bolette Sandford Pedersen | Dorte Haltrup Hansen
Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC’04)

This paper describes how Human Language Technologies and linguistic resources are used to support the construction of components of a knowledge organisation system. In particular we focus on methodologies and resources for building a corpus-based domain ontology and extracting relevant metadata information for text chunks from domain-specific corpora.

pdf bib
Some Tests of an Unsupervised Model of Language Acquisition
Bo Pedersen | Shimon Edelman | Zach Solan | David Horn
Proceedings of the Workshop on Psycho-Computational Models of Human Language Acquisition

2002

pdf bib
Semantic Lexical Resources Applied to Content-based Querying - the OntoQuery Project
Bolette S. Pedersen | Patrizia Paggio
Proceedings of the Third International Conference on Language Resources and Evaluation (LREC’02)

2000

pdf bib
Semantic Encoding of Danish Verbs in SIMPLE - Adapting a Verb Framed Model to a Satellite-framed Language
Bolette Sandford Pedersen | Sanni Nimb
Proceedings of the Second International Conference on Language Resources and Evaluation (LREC’00)

1999

pdf bib
SIMPLE- Semantic Information for Multifunctional Plurilingual Lexica: Some Examples of Danish” Concrete Nouns
Bolette Sandford Pedersen | Britt Keson
SIGLEX99: Standardizing Lexical Resources

Search
Co-authors
Venues