Leo Wanner


2020

pdf bib
ThemePro: A Toolkit for the Analysis of Thematic Progression
Monica Dominguez | Juan Soler | Leo Wanner
Proceedings of the 12th Language Resources and Evaluation Conference

This paper introduces ThemePro, a toolkit for the automatic analysis of thematic progression. Thematic progression is relevant to natural language processing (NLP) applications dealing, among others, with discourse structure, argumentation structure, natural language generation, summarization and topic detection. A web platform demonstrates the potential of this toolkit and provides a visualization of the results including syntactic trees, hierarchical thematicity over propositions and thematic progression over whole texts.

pdf bib
Toxic, Hateful, Offensive or Abusive? What Are We Really Classifying? An Empirical Analysis of Hate Speech Datasets
Paula Fortuna | Juan Soler | Leo Wanner
Proceedings of the 12th Language Resources and Evaluation Conference

The field of the automatic detection of hate speech and related concepts has raised a lot of interest in the last years. Different datasets were annotated and classified by means of applying different machine learning algorithms. However, few efforts were done in order to clarify the applied categories and homogenize different datasets. Our study takes up this demand. We analyze six different publicly available datasets in this field with respect to their similarity and compatibility. We conduct two different experiments. First, we try to make the datasets compatible and represent the dataset classes as Fast Text word vectors analyzing the similarity between different classes in a intra and inter dataset manner. Second, we submit the chosen datasets to the Perspective API Toxicity classifier, achieving different performances depending on the categories and datasets. One of the main conclusions of these experiments is that many different definitions are being used for equivalent concepts, which makes most of the publicly available datasets incompatible. Grounded in our analysis, we provide guidelines for future dataset collection and annotation.

pdf bib
CollFrEn: Rich Bilingual English–French Collocation Resource
Beatriz Fisas | Joan Codina-Filbá | Luis Espinosa Anke | Leo Wanner
Proceedings of the Joint Workshop on Multiword Expressions and Electronic Lexicons

Collocations in the sense of idiosyncratic lexical co-occurrences of two syntactically bound words traditionally pose a challenge to language learners and many Natural Language Processing (NLP) applications alike. Reliable ground truth (i.e., ideally manually compiled) resources are thus of high value. We present a manually compiled bilingual English–French collocation resource with 7,480 collocations in English and 6,733 in French. Each collocation is enriched with information that facilitates its downstream exploitation in NLP tasks such as machine translation, word sense disambiguation, natural language generation, relation classification, and so forth. Our proposed enrichment covers: the semantic category of the collocation (its lexical function), its vector space representation (for each individual word as well as their joint collocation embedding), a subcategorization pattern of both its elements, as well as their corresponding BabelNet id, and finally, indices of their occurrences in large scale reference corpora.

pdf bib
Proceedings of the Third Workshop on Multilingual Surface Realisation
Anya Belz | Bernd Bohnet | Thiago Castro Ferreira | Yvette Graham | Simon Mille | Leo Wanner
Proceedings of the Third Workshop on Multilingual Surface Realisation

pdf bib
The Third Multilingual Surface Realisation Shared Task (SR’20): Overview and Evaluation Results
Simon Mille | Anya Belz | Bernd Bohnet | Thiago Castro Ferreira | Yvette Graham | Leo Wanner
Proceedings of the Third Workshop on Multilingual Surface Realisation

This paper presents results from the Third Shared Task on Multilingual Surface Realisation (SR’20) which was organised as part of the COLING’20 Workshop on Multilingual Surface Realisation. As in SR’18 and SR’19, the shared task comprised two tracks: (1) a Shallow Track where the inputs were full UD structures with word order information removed and tokens lemmatised; and (2) a Deep Track where additionally, functional words and morphological information were removed. Moreover, each track had two subtracks: (a) restricted-resource, where only the data provided or approved as part of a track could be used for training models, and (b) open-resource, where any data could be used. The Shallow Track was offered in 11 languages, whereas the Deep Track in 3 ones. Systems were evaluated using both automatic metrics and direct assessment by human evaluators in terms of Readability and Meaning Similarity to reference outputs. We present the evaluation results, along with descriptions of the SR’19 tracks, data and evaluation methods, as well as brief summaries of the participating systems. For full descriptions of the participating systems, please see the separate system reports elsewhere in this volume.

2019

pdf bib
Proceedings of the 2nd Workshop on Multilingual Surface Realisation (MSR 2019)
Simon Mille | Anja Belz | Bernd Bohnet | Yvette Graham | Leo Wanner
Proceedings of the 2nd Workshop on Multilingual Surface Realisation (MSR 2019)

pdf bib
The Second Multilingual Surface Realisation Shared Task (SR’19): Overview and Evaluation Results
Simon Mille | Anja Belz | Bernd Bohnet | Yvette Graham | Leo Wanner
Proceedings of the 2nd Workshop on Multilingual Surface Realisation (MSR 2019)

We report results from the SR’19 Shared Task, the second edition of a multilingual surface realisation task organised as part of the EMNLP’19 Workshop on Multilingual Surface Realisation. As in SR’18, the shared task comprised two tracks with different levels of complexity: (a) a shallow track where the inputs were full UD structures with word order information removed and tokens lemmatised; and (b) a deep track where additionally, functional words and morphological information were removed. The shallow track was offered in eleven, and the deep track in three languages. Systems were evaluated (a) automatically, using a range of intrinsic metrics, and (b) by human judges in terms of readability and meaning similarity. This report presents the evaluation results, along with descriptions of the SR’19 tracks, data and evaluation methods. For full descriptions of the participating systems, please see the separate system reports elsewhere in this volume.

pdf bib
Collocation Classification with Unsupervised Relation Vectors
Luis Espinosa Anke | Steven Schockaert | Leo Wanner
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

Lexical relation classification is the task of predicting whether a certain relation holds between a given pair of words. In this paper, we explore to which extent the current distributional landscape based on word embeddings provides a suitable basis for classification of collocations, i.e., pairs of words between which idiosyncratic lexical relations hold. First, we introduce a novel dataset with collocations categorized according to lexical functions. Second, we conduct experiments on a subset of this benchmark, comparing it in particular to the well known DiffVec dataset. In these experiments, in addition to simple word vector arithmetic operations, we also investigate the role of unsupervised relation vectors as a complementary input. While these relation vectors indeed help, we also show that lexical function classification poses a greater challenge than the syntactic and semantic relations that are typically used for benchmarks in the literature.

pdf bib
A Hierarchically-Labeled Portuguese Hate Speech Dataset
Paula Fortuna | João Rocha da Silva | Juan Soler-Company | Leo Wanner | Sérgio Nunes
Proceedings of the Third Workshop on Abusive Language Online

Over the past years, the amount of online offensive speech has been growing steadily. To successfully cope with it, machine learning are applied. However, ML-based techniques require sufficiently large annotated datasets. In the last years, different datasets were published, mainly for English. In this paper, we present a new dataset for Portuguese, which has not been in focus so far. The dataset is composed of 5,668 tweets. For its annotation, we defined two different schemes used by annotators with different levels of expertise. Firstly, non-experts annotated the tweets with binary labels (‘hate’ vs. ‘no-hate’). Secondly, expert annotators classified the tweets following a fine-grained hierarchical multiple label scheme with 81 hate speech categories in total. The inter-annotator agreement varied from category to category, which reflects the insight that some types of hate speech are more subtle than others and that their detection depends on personal perception. This hierarchical annotation scheme is the main contribution of the presented work, as it facilitates the identification of different types of hate speech and their intersections. To demonstrate the usefulness of our dataset, we carried a baseline classification experiment with pre-trained word embeddings and LSTM on the binary classified data, with a state-of-the-art outcome.

pdf bib
Teaching FORGe to Verbalize DBpedia Properties in Spanish
Simon Mille | Stamatia Dasiopoulou | Beatriz Fisas | Leo Wanner
Proceedings of the 12th International Conference on Natural Language Generation

Statistical generators increasingly dominate the research in NLG. However, grammar-based generators that are grounded in a solid linguistic framework remain very competitive, especially for generation from deep knowledge structures. Furthermore, if built modularly, they can be ported to other genres and languages with a limited amount of work, without the need of the annotation of a considerable amount of training data. One of these generators is FORGe, which is based on the Meaning-Text Model. In the recent WebNLG challenge (the first comprehensive task addressing the mapping of RDF triples to text) FORGe ranked first with respect to the overall quality in human evaluation. We extend the coverage of FORGE’s open source grammatical and lexical resources for English, so as to further improve the English texts, and port them to Spanish, to achieve a comparable quality. This confirms that, as already observed in the case of SimpleNLG, a robust universal grammar-driven framework and a systematic organization of the linguistic resources can be an adequate choice for NLG applications.

2018

pdf bib
Generation of a Spanish Artificial Collocation Error Corpus
Sara Rodríguez-Fernández | Roberto Carlini | Leo Wanner
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

pdf bib
Compilation of Corpora for the Study of the Information Structure–Prosody Interface
Alicia Burga | Mónica Domínguez | Mireia Farrús | Leo Wanner
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

pdf bib
Proceedings of the First Workshop on Multilingual Surface Realisation
Simon Mille | Anja Belz | Bernd Bohnet | Emily Pitler | Leo Wanner
Proceedings of the First Workshop on Multilingual Surface Realisation

pdf bib
The First Multilingual Surface Realisation Shared Task (SR’18): Overview and Evaluation Results
Simon Mille | Anja Belz | Bernd Bohnet | Yvette Graham | Emily Pitler | Leo Wanner
Proceedings of the First Workshop on Multilingual Surface Realisation

We report results from the SR’18 Shared Task, a new multilingual surface realisation task organised as part of the ACL’18 Workshop on Multilingual Surface Realisation. As in its English-only predecessor task SR’11, the shared task comprised two tracks with different levels of complexity: (a) a shallow track where the inputs were full UD structures with word order information removed and tokens lemmatised; and (b) a deep track where additionally, functional words and morphological information were removed. The shallow track was offered in ten, and the deep track in three languages. Systems were evaluated (a) automatically, using a range of intrinsic metrics, and (b) by human judges in terms of readability and meaning similarity. This report presents the evaluation results, along with descriptions of the SR’18 tracks, data and evaluation methods. For full descriptions of the participating systems, please see the separate system reports elsewhere in this volume.

pdf bib
Underspecified Universal Dependency Structures as Inputs for Multilingual Surface Realisation
Simon Mille | Anja Belz | Bernd Bohnet | Leo Wanner
Proceedings of the 11th International Conference on Natural Language Generation

In this paper, we present the datasets used in the Shallow and Deep Tracks of the First Multilingual Surface Realisation Shared Task (SR’18). For the Shallow Track, data in ten languages has been released: Arabic, Czech, Dutch, English, Finnish, French, Italian, Portuguese, Russian and Spanish. For the Deep Track, data in three languages is made available: English, French and Spanish. We describe in detail how the datasets were derived from the Universal Dependencies V2.0, and report on an evaluation of the Deep Track input quality. In addition, we examine the motivation for, and likely usefulness of, deriving NLG inputs from annotations in resources originally developed for Natural Language Understanding (NLU), and assess whether the resulting inputs supply enough information of the right kind for the final stage in the NLG process.

pdf bib
Sentence Packaging in Text Generation from Semantic Graphs as a Community Detection Problem
Alexander Shvets | Simon Mille | Leo Wanner
Proceedings of the 11th International Conference on Natural Language Generation

An increasing amount of research tackles the challenge of text generation from abstract ontological or semantic structures, which are in their very nature potentially large connected graphs. These graphs must be “packaged” into sentence-wise subgraphs. We interpret the problem of sentence packaging as a community detection problem with post optimization. Experiments on the texts of the VerbNet/FrameNet structure annotated-Penn Treebank, which have been converted into graphs by a coreference merge using Stanford CoreNLP, show a high F1-score of 0.738.

2017

pdf bib
On the Relevance of Syntactic and Discourse Features for Author Profiling and Identification
Juan Soler-Company | Leo Wanner
Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers

The majority of approaches to author profiling and author identification focus mainly on lexical features, i.e., on the content of a text. We argue that syntactic and discourse features play a significantly more prominent role than they were given in the past. We show that they achieve state-of-the-art performance in author and gender identification on a literary corpus while keeping the feature set small: the used feature set is composed of only 188 features and still outperforms the winner of the PAN 2014 shared task on author verification in the literary genre.

pdf bib
FORGe at SemEval-2017 Task 9: Deep sentence generation based on a sequence of graph transducers
Simon Mille | Roberto Carlini | Alicia Burga | Leo Wanner
Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017)

We present the contribution of Universitat Pompeu Fabra’s NLP group to the SemEval Task 9.2 (AMR-to-English Generation). The proposed generation pipeline comprises: (i) a series of rule-based graph-transducers for the syntacticization of the input graphs and the resolution of morphological agreements, and (ii) an off-the-shelf statistical linearization component.

pdf bib
Automatic Extraction of Parallel Speech Corpora from Dubbed Movies
Alp Öktem | Mireia Farrús | Leo Wanner
Proceedings of the 10th Workshop on Building and Using Comparable Corpora

This paper presents a methodology to extract parallel speech corpora based on any language pair from dubbed movies, together with an application framework in which some corresponding prosodic parameters are extracted. The obtained parallel corpora are especially suitable for speech-to-speech translation applications when a prosody transfer between source and target languages is desired.

pdf bib
Shared Task Proposal: Multilingual Surface Realization Using Universal Dependency Trees
Simon Mille | Bernd Bohnet | Leo Wanner | Anja Belz
Proceedings of the 10th International Conference on Natural Language Generation

We propose a shared task on multilingual Surface Realization, i.e., on mapping unordered and uninflected universal dependency trees to correctly ordered and inflected sentences in a number of languages. A second deeper input will be available in which, in addition, functional words, fine-grained PoS and morphological information will be removed from the input trees. The first shared task on Surface Realization was carried out in 2011 with a similar setup, with a focus on English. We think that it is time for relaunching such a shared task effort in view of the arrival of Universal Dependencies annotated treebanks for a large number of languages on the one hand, and the increasing dominance of Deep Learning, which proved to be a game changer for NLP, on the other hand.

pdf bib
A demo of FORGe: the Pompeu Fabra Open Rule-based Generator
Simon Mille | Leo Wanner
Proceedings of the 10th International Conference on Natural Language Generation

This demo paper presents the multilingual deep sentence generator developed by the TALN group at Universitat Pompeu Fabra, implemented as a series of rule-based graph-transducers for the syntacticization of the input graphs, the resolution of morphological agreements, and the linearization of the trees.

pdf bib
Revising the METU-Sabancı Turkish Treebank: An Exercise in Surface-Syntactic Annotation of Agglutinative Languages
Alicia Burga | Alp Öktem | Leo Wanner
Proceedings of the Fourth International Conference on Dependency Linguistics (Depling 2017)

2016

pdf bib
A Semi-Supervised Approach for Gender Identification
Juan Soler | Leo Wanner
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

In most of the research studies on Author Profiling, large quantities of correctly labeled data are used to train the models. However, this does not reflect the reality in forensic scenarios: in practical linguistic forensic investigations, the resources that are available to profile the author of a text are usually scarce. To pay tribute to this fact, we implemented a Semi-Supervised Learning variant of the k nearest neighbors algorithm that uses small sets of labeled data and a larger amount of unlabeled data to classify the authors of texts by gender (man vs woman). We describe the enriched KNN algorithm and show that the use of unlabeled instances improves the accuracy of our gender identification model. We also present a feature set that facilitates the use of a very small number of instances, reaching accuracies higher than 70% with only 113 instances to train the model. It is also shown that the algorithm also performs well using publicly available data.

pdf bib
Towards Multiple Antecedent Coreference Resolution in Specialized Discourse
Alicia Burga | Sergio Cajal | Joan Codina-Filbà | Leo Wanner
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

Despite the popularity of coreference resolution as a research topic, the overwhelming majority of the work in this area focused so far on single antecedence coreference only. Multiple antecedent coreference (MAC) has been largely neglected. This can be explained by the scarcity of the phenomenon of MAC in generic discourse. However, in specialized discourse such as patents, MAC is very dominant. It seems thus unavoidable to address the problem of MAC resolution in the context of tasks related to automatic patent material processing, among them abstractive summarization, deep parsing of patents, construction of concept maps of the inventions, etc. We present the first version of an operational rule-based MAC resolution strategy for patent material that covers the three major types of MAC: (i) nominal MAC, (ii) MAC with personal / relative pronouns, and MAC with reflexive / reciprocal pronouns. The evaluation shows that our strategy performs well in terms of precision and recall.

pdf bib
Example-based Acquisition of Fine-grained Collocation Resources
Sara Rodríguez-Fernández | Roberto Carlini | Luis Espinosa Anke | Leo Wanner
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

Collocations such as “heavy rain” or “make [a] decision”, are combinations of two elements where one (the base) is freely chosen, while the choice of the other (collocate) is restricted, depending on the base. Collocations present difficulties even to advanced language learners, who usually struggle to find the right collocate to express a particular meaning, e.g., both “heavy” and “strong” express the meaning ‘intense’, but while “rain” selects “heavy”, “wind” selects “strong”. Lexical Functions (LFs) describe the meanings that hold between the elements of collocations, such as ‘intense’, ‘perform’, ‘create’, ‘increase’, etc. Language resources with semantically classified collocations would be of great help for students, however they are expensive to build, since they are manually constructed, and scarce. We present an unsupervised approach to the acquisition and semantic classification of collocations according to LFs, based on word embeddings in which, given an example of a collocation for each of the target LFs and a set of bases, the system retrieves a list of collocates for each base and LF.

pdf bib
A Neural Network Architecture for Multilingual Punctuation Generation
Miguel Ballesteros | Leo Wanner
Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing

pdf bib
An Automatic Prosody Tagger for Spontaneous Speech
Mónica Domínguez | Mireia Farrús | Leo Wanner
Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers

Speech prosody is known to be central in advanced communication technologies. However, despite the advances of theoretical studies in speech prosody, so far, no large scale prosody annotated resources that would facilitate empirical research and the development of empirical computational approaches are available. This is to a large extent due to the fact that current common prosody annotation conventions offer a descriptive framework of intonation contours and phrasing based on labels. This makes it difficult to reach a satisfactory inter-annotator agreement during the annotation of gold standard annotations and, subsequently, to create consistent large scale annotations. To address this problem, we present an annotation schema for prominence and boundary labeling of prosodic phrases based upon acoustic parameters and a tagger for prosody annotation at the prosodic phrase level. Evaluation proves that inter-annotator agreement reaches satisfactory values, from 0.60 to 0.80 Cohen’s kappa, while the prosody tagger achieves acceptable recall and f-measure figures for five spontaneous samples used in the evaluation of monologue and dialogue formats in English and Spanish. The work presented in this paper is a first step towards a semi-automatic acquisition of large corpora for empirical prosodic analysis.

pdf bib
Extending WordNet with Fine-Grained Collocational Information via Supervised Distributional Learning
Luis Espinosa-Anke | Jose Camacho-Collados | Sara Rodríguez-Fernández | Horacio Saggion | Leo Wanner
Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers

WordNet is probably the best known lexical resource in Natural Language Processing. While it is widely regarded as a high quality repository of concepts and semantic relations, updating and extending it manually is costly. One important type of relation which could potentially add enormous value to WordNet is the inclusion of collocational information, which is paramount in tasks such as Machine Translation, Natural Language Generation and Second Language Learning. In this paper, we present ColWordNet (CWN), an extended WordNet version with fine-grained collocational information, automatically introduced thanks to a method exploiting linear relations between analogous sense-level embeddings spaces. We perform both intrinsic and extrinsic evaluations, and release CWN for the use and scrutiny of the community.

pdf bib
Praat on the Web: An Upgrade of Praat for Semi-Automatic Speech Annotation
Mónica Domínguez | Iván Latorre | Mireia Farrús | Joan Codina-Filbà | Leo Wanner
Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: System Demonstrations

This paper presents an implementation of the widely used speech analysis tool Praat as a web application with an extended functionality for feature annotation. In particular, Praat on the Web addresses some of the central limitations of the original Praat tool and provides (i) enhanced visualization of annotations in a dedicated window for feature annotation at interval and point segments, (ii) a dynamic scripting composition exemplified with a modular prosody tagger, and (iii) portability and an operational web interface. Speech annotation tools with such a functionality are key for exploring large corpora and designing modular pipelines.

pdf bib
Semantics-Driven Recognition of Collocations Using Word Embeddings
Sara Rodríguez-Fernández | Luis Espinosa-Anke | Roberto Carlini | Leo Wanner
Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

2015

pdf bib
Classification of Lexical Collocation Errors in the Writings of Learners of Spanish
Sara Rodríguez-Fernández | Roberto Carlini | Leo Wanner
Proceedings of the International Conference Recent Advances in Natural Language Processing

pdf bib
Towards a multi-layered dependency annotation of Finnish
Alicia Burga | Simon Mille | Anton Granvik | Leo Wanner
Proceedings of the Third International Conference on Dependency Linguistics (Depling 2015)

pdf bib
Data-driven sentence generation with non-isomorphic trees
Miguel Ballesteros | Bernd Bohnet | Simon Mille | Leo Wanner
Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

pdf bib
Visualizing Deep-Syntactic Parser Output
Juan Soler-Company | Miguel Ballesteros | Bernd Bohnet | Simon Mille | Leo Wanner
Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Demonstrations

2014

pdf bib
Improving Collocation Correction by Ranking Suggestions Using Linguistic Knowledge
Roberto Carlini | Joan Codina-Filba | Leo Wanner
Proceedings of the third workshop on NLP for computer-assisted language learning

pdf bib
Classifiers for data-driven deep sentence generation
Miguel Ballesteros | Simon Mille | Leo Wanner
Proceedings of the 8th International Natural Language Generation Conference (INLG)

pdf bib
Deep-Syntactic Parsing
Miguel Ballesteros | Bernd Bohnet | Simon Mille | Leo Wanner
Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers

pdf bib
How to Use less Features and Reach Better Performance in Author Gender Identification
Juan Soler Company | Leo Wanner
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

Over the last years, author profiling in general and author gender identification in particular have become a popular research area due to their potential attractive applications that range from forensic investigations to online marketing studies. However, nearly all state-of-the-art works in the area still very much depend on the datasets they were trained and tested on, since they heavily draw on content features, mostly a large number of recurrent words or combinations of words extracted from the training sets. We show that using a small number of features that mainly depend on the structure of the texts we can outperform other approaches that depend mainly on the content of the texts and that use a huge number of features in the process of identifying if the author of a text is a man or a woman. Our system has been tested against a dataset constructed for our work as well as against two datasets that were previously used in other papers.

pdf bib
An Exercise in Reuse of Resources: Adapting General Discourse Coreference Resolution for Detecting Lexical Chains in Patent Documentation
Nadjet Bouayad-Agha | Alicia Burga | Gerard Casamayor | Joan Codina | Rogelio Nazar | Leo Wanner
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

The Stanford Coreference Resolution System (StCR) is a multi-pass, rule-based system that scored best in the CoNLL 2011 shared task on general discourse coreference resolution. We describe how the StCR has been adapted to the specific domain of patents and give some cues on how it can be adapted to other domains. We present a linguistic analysis of the patent domain and how we were able to adapt the rules to the domain and to expand coreferences with some lexical chains. A comparative evaluation shows an improvement of the coreference resolution system, denoting that (i) StCR is a valuable tool across different text genres; (ii) specialized discourse NLP may significantly benefit from general discourse NLP research.

2013

pdf bib
Towards the Annotation of Penn TreeBank with Information Structure
Bernd Bohnet | Alicia Burga | Leo Wanner
Proceedings of the Sixth International Joint Conference on Natural Language Processing

pdf bib
Overview of the First Content Selection Challenge from Open Semantic Web Data
Nadjet Bouayad-Agha | Gerard Casamayor | Leo Wanner | Chris Mellish
Proceedings of the 14th European Workshop on Natural Language Generation

pdf bib
Proceedings of the Second International Conference on Dependency Linguistics (DepLing 2013)
Eva Hajičová | Kim Gerdes | Leo Wanner
Proceedings of the Second International Conference on Dependency Linguistics (DepLing 2013)

pdf bib
AnCora-UPF: A Multi-Level Annotation of Spanish
Simon Mille | Alicia Burga | Leo Wanner
Proceedings of the Second International Conference on Dependency Linguistics (DepLing 2013)

2012

pdf bib
Towards a Surface Realization-Oriented Corpus Annotation
Leo Wanner | Simon Mille | Bernd Bohnet
INLG 2012 Proceedings of the Seventh International Natural Language Generation Conference

pdf bib
The Surface Realisation Task: Recent Developments and Future Plans
Anja Belz | Bernd Bohnet | Simon Mille | Leo Wanner | Michael White
INLG 2012 Proceedings of the Seventh International Natural Language Generation Conference

pdf bib
Content Selection From Semantic Web Data
Nadjet Bouayad-Agha | Gerard Casamayor | Leo Wanner | Chris Mellish
INLG 2012 Proceedings of the Seventh International Natural Language Generation Conference

pdf bib
How Does the Granularity of an Annotation Scheme Influence Dependency Parsing Performance?
Simon Mille | Alicia Burga | Gabriela Ferraro | Leo Wanner
Proceedings of COLING 2012: Posters

2011

pdf bib
Content selection from an ontology-based knowledge base for the generation of football summaries
Nadjet Bouayad-Agha | Gerard Casamayor | Leo Wanner
Proceedings of the 13th European Workshop on Natural Language Generation

pdf bib
<StuMaBa>: From Deep Representation to Surface
Bernd Bohnet | Simon Mille | Benoît Favre | Leo Wanner
Proceedings of the 13th European Workshop on Natural Language Generation

2010

pdf bib
Broad Coverage Multilingual Deep Sentence Generation with a Stochastic Multi-Level Realizer
Bernd Bohnet | Leo Wanner | Simon Mille | Alicia Burga
Proceedings of the 23rd International Conference on Computational Linguistics (Coling 2010)

pdf bib
Open Soucre Graph Transducer Interpreter and Grammar Development Environment
Bernd Bohnet | Leo Wanner
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

Graph and tree transducers have been applied in many NLP areas―among them, machine translation, summarization, parsing, and text generation. In particular, the successful use of tree rewriting transducers for the introduction of syntactic structures in statistical machine translation contributed to their popularity. However, the potential of such transducers is limited because they do not handle graphs and because they ”consume” the source structure in that they rewrite it instead of leaving it intact for intermediate consultations. In this paper, we describe an open source tree and graph transducer interpreter, which combines the advantages of graph transducers and two-tape Finite State Transducers and surpasses the limitations of state-of-the-art tree rewriting transducers. Along with the transducer, we present a graph grammar development environment that supports the compilation and maintenance of graph transducer grammatical and lexical resources. Such an environment is indispensable for any effort to create consistent large coverage NLP-resources by human experts.

pdf bib
Syntactic Dependencies for Multilingual and Multilevel Corpus Annotation
Simon Mille | Leo Wanner
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

The relevance of syntactic dependency annotated corpora is nowadays unquestioned. However, a broad debate on the optimal set of dependency relation tags did not take place yet. As a result, largely varying tag sets of a largely varying size are used in different annotation initiatives. We propose a hierarchical dependency structure annotation schema that is more detailed and more flexible than the known annotation schemata. The schema allows us to choose the level of the desired detail of annotation, which facilitates the use of the schema for corpus annotation for different languages and for different NLP applications. Thanks to the inclusion of semantico-syntactic tags into the schema, we can annotate a corpus not only with syntactic dependency structures, but also with valency patterns as they are usually found in separate treebanks such as PropBank and NomBank. Semantico-syntactic tags and the level of detail of the schema furthermore facilitate the derivation of deep-syntactic and semantic annotations, leading to truly multilevel annotated dependency corpora. Such multilevel annotations can be readily used for the task of ML-based acquisition of grammar resources that map between the different levels of linguistic representation ― something which forms part of, for instance, any natural language text generator.

pdf bib
Towards a Motivated Annotation Schema of Collocation Errors in Learner Corpora
Margarita Alonso Ramos | Leo Wanner | Orsolya Vincze | Gerard Casamayor del Bosque | Nancy Vázquez Veiga | Estela Mosqueira Suárez | Sabela Prieto González
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

Collocations play a significant role in second language acquisition. In order to be able to offer efficient support to learners, an NLP-based CALL environment for learning collocations should be based on a representative collocation error annotated learner corpus. However, so far, no theoretically-motivated collocation error tag set is available. Existing learner corpora tag collocation errors simply as “lexical errors” ― which is clearly insufficient given the wide range of different collocation errors that the learners make. In this paper, we present a fine-grained three-dimensional typology of collocation errors that has been derived in an empirical study from the learner corpus CEDEL2 compiled by a team at the Autonomous University of Madrid. The first dimension captures whether the error concerns the collocation as a whole or one of its elements; the second dimension captures the language-oriented error analysis, while the third exemplifies the interpretative error analysis. To facilitate a smooth annotation along this typology, we adapted Knowtator, a flexible off-the-shelf annotation tool implemented as a Protégé plugin.

2008

pdf bib
Using Semantically Annotated Corpora to Build Collocation Resources
Margarita Alonso Ramos | Owen Rambow | Leo Wanner
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

We present an experiment in extracting collocations from the FrameNet corpus, specifically, support verbs such as direct in Environmentalists directed strong criticism at world leaders. Support verbs do not contribute meaning of their own and the meaning of the construction is provided by the noun; the recognition of support verbs is thus useful in text understanding. Having access to a list of support verbs is also useful in applications that can benefit from paraphrasing, such as generation (where paraphrasing can provide variety). This paper starts with a brief presentation of the notion of lexical function in Meaning-Text Theory, where they fall under the notion of lexical function, and then discusses how relevant information is encoded in the FrameNet corpus. We describe the resource extracted from the FrameNet corpus.

pdf bib
Making Text Resources Accessible to the Reader: the Case of Patent Claims
Simon Mille | Leo Wanner
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

Hardly any other kind of text structures is as notoriously difficult to read as patents. This is first of all due to their abstract vocabulary and their very complex syntactic constructions. Especially the claims in a patent are a challenge: in accordance with international patent writing regulations, each claim must be rendered in a single sentence. As a result, sentences with more than 200 words are not uncommon. Therefore, paraphrasing of the claims in terms the user can understand is of high demand. We present a rule-based paraphrasing module that realizes paraphrasing of patent claims in English as a rewriting task. Prior to the rewriting proper, the module implies the stages of simplification and discourse and syntactic analyses. The rewriting makes use of a full-fledged text generator and consists in a number of genuine generation tasks such as aggregation, selection of referring expressions, choice of discourse markers and syntactic generation. As generator, we use the MATE-work bench, which is based on the Meaning-Text Theory of linguistics.

pdf bib
Multilingual summarization in practice: the case of patent claims
Simon Mille | Leo Wanner
Proceedings of the 12th Annual conference of the European Association for Machine Translation

pdf bib
Two-step flow in bilingual lexicon extraction from unrelated corpora
Rogelio Nazar | Leo Wanner | Jorge Vivaldi
Proceedings of the 12th Annual conference of the European Association for Machine Translation

2006

pdf bib
Local Document Relevance Clustering in IR Using Collocation Information
Leo Wanner | Margarita Alonso Ramos
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)

A series of different automatic query expansion techniques has been suggested in Information Retrieval. To estimate how suitable a document term is as an expansion term, the most popular of them use a measure of the frequency of the co-occurrence of this term with one or several query terms. The benefit of the use of the linguistic relations that hold between query terms is often questioned. If a linguistic phenomenon is taken into account, it is the phrase structure or lexical compound. We propose a technique that is based on the restricted lexical cooccurrence (collocation) of query terms. We use the knowledge on collocations formed by query terms for two tasks: (i) document relevance clustering done in the first stage of local query expansion and (ii) choice of suitable expansion terms from the relevant document cluster. In this paper, we describe the first task, providing evidence from first preliminary experiments on Spanish material that local relevance clustering benefits largely from knowledge on collocations.

2004

pdf bib
Enriching the Spanish EuroWordNet by Collocations
Leo Wanner | Margarita Alonso Ramos | Antonia Martí
Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC’04)

2003

pdf bib
Deriving the Communicative Structure in Applied NLG
Leo Wanner | Bernd Bohnet | Mark Giereth
Proceedings of the 9th European Workshop on Natural Language Generation (ENLG-2003) at EACL 2003

2001

pdf bib
On Using a Parallel Graph Rewriting Formalism in Generation
Bernd Bohnet | Leo Wanner
Proceedings of the ACL 2001 Eighth European Workshop on Natural Language Generation (EWNLG)

2000

pdf bib
A development Environment for an MTT-Based Sentence Generator
Bernd Bohnet | Andreas Langjahr | Leo Wanner
INLG’2000 Proceedings of the First International Conference on Natural Language Generation

1998

pdf bib
De-Constraining Text Generation
Stephen Beale | Sergei Nirenburg | Evelyne Viegas | Leo Wanner
Natural Language Generation

1996

pdf bib
The HealthDoc Sentence Planner
Leo Wanner | Eduard Hovy
Eighth International Natural Language Generation Workshop

1994

pdf bib
On Lexically Biased Discourse Organization in Text Generation
Leo Wanner
COLING 1994 Volume 1: The 15th International Conference on Computational Linguistics

pdf bib
Building Another Bridge over the Generation Gap
Leo Wanner
Proceedings of the Seventh International Workshop on Natural Language Generation

1990

pdf bib
A collocational based approach to salience-sensitive lexical selection
Leo Wanner | John A. Bateman
Proceedings of the Fifth International Workshop on Natural Language Generation