Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC’04)

Maria Teresa Lino, Maria Francisca Xavier, Fátima Ferreira, Rute Costa, Raquel Silva (Editors)


Anthology ID:
L04-1
Month:
May
Year:
2004
Address:
Lisbon, Portugal
Venue:
LREC
SIG:
Publisher:
European Language Resources Association (ELRA)
URL:
DOI:
Bib Export formats:
BibTeX MODS XML EndNote

bib
Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC’04)
Maria Teresa Lino | Maria Francisca Xavier | Fátima Ferreira | Rute Costa | Raquel Silva

pdf bib
Can We Talk? Prospects for Automatically Training Spoken Dialogue Systems
Marilyn Walker

pdf bib
Strategic Directions of National and International Research Funding
Hans Uszkoreit

pdf bib
Multilingual Content Processing
Gregor Thurmair

pdf bib
Collaborative Commentary: Opening Up Spoken Language Databases
Brian MacWhinney

pdf bib
Getting to the Heart of the Matter; Speech is More than Just the Expression of Text or Language
Nick Campbell

pdf bib
Industrial Needs for Language Resources
Bente Maegaard

pdf bib
Thesaurus or Logical Ontology, Which do we Need for Mining Text?
Junichi Tsujii

pdf bib
Information Extraction from Hindi Texts
Kamlesh Dutta | Saroj Kaushik | Nupur Prakash

pdf bib
The Language Belongs to the People!
Cornelis H.A. Koster | Stefan Gradmann

pdf bib
ALLES: Integrating NLP in ICALL Applications
Paul Schmidt | Sandrine Garnier | Mike Sharwood | Toni Badia | Lourdes Díaz | Martí Quixal | Ana Ruggia | Antonio S. Valderrabanos | Alberto J. Cruz | Enrique Torrejon | Celia Rico | Jorge Jimenez

pdf bib
The Automatic Content Extraction (ACE) Program – Tasks, Data, and Evaluation
George Doddington | Alexis Mitchell | Mark Przybocki | Lance Ramshaw | Stephanie Strassel | Ralph Weischedel

pdf bib
Multilingual Corpus-based Approach to the Resolution of English –ing
Lee Schwartz | Takako Aikawa

pdf bib
On the Problems of Creating a Golden Standard of Inflected Forms in Portuguese
Diana Santos | Anabela Barreiro

pdf bib
INSPIRE: Evaluation of a Smart-Home System for Infotainment Management and Device Control
Sebastian Möller | Jan Krebber | Alexander Raake | Paula Smeele | Martin Rajman | Mirek Melichar | Vincenzo Pallotta | Gianna Tsakou | Basilis Kladis | Anestis Vovos | Jettie Hoonhout | Dietmar Schuchardt | Nikos Fakotakis | Todor Ganchev | Ilyas Potamitis

pdf bib
Evaluating Multimodal NLG Using Production Experiments
Ielka van der Sluis | Emiel Krahmer

pdf bib
Concept Creation in Lexical Ontologies
Nuno Seco | Tony Veale | Jer Hayes

pdf bib
Polysemy and Category Structure in WordNet: An Evidential Approach
Tony Veale

pdf bib
Towards a Reference Annotation Framework
Susanne Salmon-Alt | Laurent Romary

pdf bib
A New ITU-T Recommendation on the Evaluation of Telephone-Based Spoken Dialogue Systems
Sebastian Möller

pdf bib
Raising the Bar: Stacked Conservative Error Correction Beyond Boosting
Dekai Wu | Grace Ngai | Marine Carpuat

pdf bib
An Analysis of the Relative Difficulty of Reuters-21578 Subsets
Franca Debole | Fabrizio Sebastiani

pdf bib
Experiences in Collection of Handwriting Data for Online Handwriting Recognition in Indic Scripts
Ajay S. Bhaskarabhatla | Sriganesh Madhvanath

pdf bib
Collocation Extraction Using Web Statistics
Hsin-Hsi Chen | Yi-Cheng Yu | Chih-Long Lin

pdf bib
An XML Representation for Annotated Handwriting Datasets for Online Handwriting Recognition
Ajay S. Bhaskarabhatla | Sriganesh Madhvanath

pdf bib
Reusing Language Resources for Speech Applications involving Emotion
Christina Alexandris | Stavroula-Evita Fotinea

pdf bib
Designing and Recording an Audiovisual Database of Emotional Speech in Basque
Eva Navas | Amaia Castelruiz | Iker Luengo | Jon Sánchez | Inmaculada Hernáez

pdf bib
Evaluation of Different Similarity Measures for the Extraction of Multiword Units in a Reinforcement Learning Environment
Gaël Dias | Sérgio Nunes

pdf bib
Terminal Device Oriented Comparable Corpora and its Alignment- Towards Extracting Paraphrasing Patterns
Hiroshi Nakagawa | Hidetaka Masuda | Dai Sato

pdf bib
Towards Basic Categories for Describing Properties of Texts in a Corpus
Serge Sharoff

pdf bib
Using Weighted Abduction to Align Term Variant Translations in Bilingual Texts
Michael Carl | Ecaterina Rascu | Johann Haller

pdf bib
Investigation on Semantics to Improve the COVAX System
Luciana Bordoni

pdf bib
Incremental Knowledge Acquisition from WordNet and EuroWordNet
Wim Peters

pdf bib
Finding Semantic Associations on Express Lane
Vivi Năstase | Rada Mihalcea

pdf bib
Infrastructure for Collaborative Annotation of Speech
Mickel Grönroos | Manne Miettinen

pdf bib
Automatic Language-Independent Induction of Gazetteer Lists
Diana Maynard | Kalina Bontcheva | Hamish Cunningham

pdf bib
Corpus Design, Recording and Phonetic Analysis of Greek Emotional Database
Nikos Fakotakis

pdf bib
Human Dialogue Modelling Using Annotated Corpora
Yorick Wilks | Nick Webb | Andrea Setzer | Mark Hepple | Roberta Catizone

pdf bib
CrossTowns: Automatically Generated Phonetic Lexicons of Cross-lingual Pronunciation Variants of European City Names
Stefan Schaden

pdf bib
Pattern Discovery in Named Organization Corpus
Hsin-Hsi Chen | Yi-Lin Chu

pdf bib
Connector Usage in the English Essay Writing of Japanese EFL Learners
Masumi Narita | Chieko Sato | Masatoshi Sugiura

pdf bib
Detection of Domain Specific Terminology Using Corpora Comparison
Patrick Drouin

pdf bib
Comparative Evaluation of a Stochastic Parser on Semantic and Syntactic-semantic Labels
Wolfgang Minker

pdf bib
Sinica BOW (Bilingual Ontological Wordnet): Integration of Bilingual WordNet and SUMO
Chu-Ren Huang | Ru-Yng Chang | Hsiang-Pin Lee

pdf bib
How to Disassemble Alphabetical Processions - Morphological Treatment of Unknown Words
Stephan Bopp | Sandro Pedrazzini | Elisabeth Maier

pdf bib
Creating Slovenian Language Resources for Development of Speech-to-speech Translation Components
Darinka Verdonik | Matej Rojc | Zdravko Kačič

pdf bib
Automatic Bilingual Lexicon Acquisition Using Random Indexing of Aligned Bilingual Data
Magnus Sahlgren

pdf bib
The Development and Integration of the LDA-Toolkit Into COST249 SpeechDat(II) SIG Reference Recognizer
Bojan Kotnik | Zdravko Kačič | Bogomir Horvat

pdf bib
Duration Modeling For Turkish Text-to-Speech Synthesis System
Özlem Öztürk | Özgul Salor | Tolga Çiloğlu | Mubeccel Demirekler

pdf bib
Clustering Concept Hierarchies from Text
Philipp Cimiano | Andreas Hotho | Steffen Staab

pdf bib
NIST Language Technology Evaluation Cookbook
Alvin F. Martin | John S. Garofolo | Jonathan C. Fiscus | Audrey N. Le | David S. Pallett | Mark A. Przybocki | Gregory A. Sanders

pdf bib
Definition, Dictionaries and Tagger for Extended Named Entity Hierarchy
Satoshi Sekine | Chikashi Nobata

pdf bib
Sejong Korean Corpora in the Making
Beom-mo Kang | Hunggyu Kim

pdf bib
Creation and Assessment of Korean Speech and Noise DB in Car Environment
Yong-Ju Lee | Bong-Wan Kim | Young-Il Kim | Dae-Lim Choi | Kwang-Hyun Lee | Yongnam Um

pdf bib
Automatic Generation of Glosses in the OntoLearn System
Alessandro Cucchiarelli | Roberto Navigli | Francesca Neri | Paola Velardi

pdf bib
The COST278 Pan-European Broadcast News Database
An Vandecatseye | Jean-Pierre Martens | Joao Neto | Hugo Meinedo | Carmen Garcia-Mateo | Javier Dieguez | France Mihelic | Janez Zibert | Jan Nouza | Petr David | Matus Pleva | Anton Cizmar | Harris Papageorgiou | Christina Alexandris

pdf bib
A Spoken Afrikaans Language Resource Designed for Research on Pronunciation Variations
Daan Wissing | Jean-Pierre Martens | Ulrike Janke | Wim Goedertier

pdf bib
The BITS Speech Synthesis Corpus for German
Tania Ellbogen | Florian Schiel | Alexander Steffen

pdf bib
MAUS Goes Iterative
Florian Schiel

pdf bib
EuroWordNet as a Resource for Cross-language Information Retrieval
Mark Stevenson | Paul Clough

pdf bib
Finding the Correct Interpretation of Swedish Compounds, a Statistical Approach
Jonas Sjöbergh | Viggo Kann

pdf bib
Automatic Extraction of Hyponyms from Japanese Newspapers. Using Lexico-syntactic Patterns
Maya Ando | Satoshi Sekine | Shun Ishizaki

pdf bib
Extending a Verb-lexicon Using a Semantically Annotated Corpus
Karin Kipper | Benjamin Snyder | Martha Palmer

pdf bib
The Centre for Dutch Language and Speech Technology (TST Centre)
J.C.T. Beeken | P.H.J. van der Kamp

pdf bib
A Global Data Category Registry for Interoperable Language Resources
Sue Ellen Wright

pdf bib
The Integrated Language Database of 8th - 21st-Century Dutch
J. G. Kruyt

pdf bib
From Acts and Topics to Transactions and Dialogue Smoothness
Hans Dybkjær | Laila Dybkjær

pdf bib
Grouping Synonymous Sentences from a Parallel Corpus
Hideki Kashioka

pdf bib
Discovery of (New) Knowledge and the Analysis of Text Corpora
Khurshid Ahmad | Maria Teresa Musacchio

pdf bib
Evaluation of Microphone Array Front-Ends for ASR - an Extension of the AURORA Framework
Harald Höge | Josef G. Bauer | Christian Geißler | Panji Setiawan | Kai Steinert

pdf bib
Development of Slovenian Broadcast News Speech Database
Janez Žibert | France Mihelič

pdf bib
A Named Entity Recognizer for Danish
Eckhard Bick

pdf bib
The GENOMA-KB Project: Towards the Integration of Concepts, Terms, Textual Corpora and Entities
M. Teresa Cabré | Carme Bach | Rosa Estopà | Judit Feliu | Gemma Martínez | Jorge Vivaldi

pdf bib
Portuguese Large-scale Language Resources for NLP Applications
Elisabete Ranchhod | Paula Carvalho | Cristina Mota | Anabela Barreiro

pdf bib
Development of a Corpus Workbench for the METU Turkish Corpus
Umut Özge | Bilge Say

pdf bib
Mercedes, a Term-in-Context Highlighter
Raúl Araya | Jordi Vivaldi

pdf bib
The Bilingual Web Dictionary on Demand
Henrik Selsøe Sørensen

pdf bib
Making an XML-based Japanese-Slovene Learners’ Dictionary
Tomaž Erjavec | Kristina Hmeljak Sangawa | Irena Srdanović | Anton ml. Vahčič

pdf bib
MULTEXT-East Version 3: Multilingual Morphosyntactic Specifications, Lexicons and Corpora
Tomaž Erjavec

pdf bib
A Galician Textual Corpus for Morphosyntactic Tagging with Application to Text-to-Speech Synthesis
Lorena Seijo Pereiro | Ana Martínez Ínsua | Francisco Méndez Pazó | Francisco Campillo Díaz | Eduardo Rodríguez Banga

pdf bib
The SPARTACUS-Database: a Spanish Sentence Database for Offline Handwriting Recognition
Salvador España | María José Castro | José Luis Hidalgo

pdf bib
Exploring Balkanet Shared Ontology for Multilingual Conceptual Indexing
Sofia Stamou | Goran Nenadic | Dimitris Christodoulakis

pdf bib
Building a Paraphrase Corpus for Speech Translation
Mitsuo Shimohata | Eiichiro Sumita | Yuji Matsumoto

pdf bib
Incremental Methods to Select Test Sentences for Evaluating Translation Ability
Yasuhiro Akiba | Eiichiro Sumita | Hiromi Nakaiwa | Seiichi Yamamoto | Hiroshi G. Okuno

pdf bib
Reusable Lexical Representations for Idioms
Jan Odijk

pdf bib
The Design of Czech Language Formal Listening Tests for the Evaluation of TTS Systems
Daniel Tihelka | Jindřich Matoušek

pdf bib
A Data-driven Adaptation of Prosody in a Multilingual TTS
Janez Stergar | Caglayan Erdem | Bogomir Horvat | Zdravko Kačič

pdf bib
MiniCors and Cast3LB: Two Semantically Tagged Spanish Corpora
M. Taulé | M. Civit | N. Artigas | M. García | L. Màrquez | M.A. Martí | B. Navarro

pdf bib
Probabilistic Detection of Context-Sensitive Spelling Errors
Johnny Bigert

pdf bib
Acquisition and Annotation of Slovenian Broadcast News Database
Andrej Žgank | Tomaž Rotovnik | Mirjam Sepesy Maučec | Darinka Verdonik | Janez Kitak | Damjan Vlaj | Vladimir Hozjan | Zdravko Kačič | Bogomir Horvat

pdf bib
The COST 278 MASPER Initiative - Crosslingual Speech Recognition with Large Telephone Databases
Andrej Žgank | Zdravko Kačič | Frank Diehl | Klara Vicsi | Gyorgy Szaszak | Jozef Juhar | Slavomir Lihan

pdf bib
Utilizing the One-Sense-per-Discourse Constraint for Fully Unsupervised Word Sense Induction and Disambiguation
Reinhard Rapp

pdf bib
A Freely Available Automatically Generated Thesaurus of Related Words
Reinhard Rapp

pdf bib
Using a Parallel Transcript/Subtitle Corpus for Sentence Compression
Vincent Vandeghinste | Erik Tjong Kim Sang

pdf bib
Handling Subtle Sense Distinctions Through Wordnet Semantic Types
Sofia Stamou | Dimitris Christodoulakis

pdf bib
Multi-lingual Evaluation of a Natural Language Generation System
Athanasios Karasimos | Amy Isard

pdf bib
The Tüba-D/Z Treebank: Annotating German with a Context-Free Backbone
Heike Telljohann | Erhard Hinrichs | Sandra Kübler

pdf bib
The NIST Meeting Room Pilot Corpus
John S. Garofolo | Christophe D. Laprun | Martial Michel | Vincent M. Stanford | Elham Tabassi

pdf bib
Securing Interpretability: The Case of Ega Language Documentation
Dafydd Gibbon | Catherine Bow | Steven Bird | Baden Hughes

pdf bib
A Comparative Study on Human Communication Behaviors and Linguistic Characteristics for Speech-to-Speech Translation
Toshiyuki Takezawa | Genichiro Kikui

pdf bib
Cost-effective Cross-lingual Document Classification
Núria Bel | Cornelis H.A. Koster | Marta Villegas

pdf bib
A Powerful and Versatile XML Format for Representing Role-semantic Annotation
Katrin Erk | Sebastian Padó

pdf bib
The MULI Project: Annotation and Analysis of Information Structure in German and English
Stefan Baumann | Caren Brinckmann | Silvia Hansen-Schirra | Geert-Jan Kruijff | Ivana Kruijff-Korbayová | Stella Neumann | Erich Steiner | Elke Teich | Hans Uszkoreit

pdf bib
Putting the Dutch PAROLE Corpus to Work
P. H. J. van der Kamp | J. G. Kruyt

pdf bib
Acquiring Reusable Multilingual Phonotactic Resources
Julie Carson-Berndsen | Robert Kelly

pdf bib
Phonological Treebanks. Issues in Generation and Application
Moritz Neugebauer | Stephen Wilson

pdf bib
Methodology for Rapid Prototyping and Testing of ASR Based User Interfaces
Pedro Concejero Cerezo | Juan José Rodríguez Soler | Daniel Tapias Merino | Alberto J. Sánchez García

pdf bib
Open Resources for Language Technology
Lars Degerstedt | Arne Jönsson

pdf bib
Unsupervised Text Mining for Ontology Extraction: An Evaluation of Statistical Measures
Marie-Laure Reinberger | Walter Daelemans

pdf bib
A Multilingual Phonological Resource Toolkit for Ubiquitous Speech Technology
Daniel Aioanei | Julie Carson-Berndsen | Anja Geumann | Robert Kelly | Moritz Neugebauer | Stephen Wilson

pdf bib
Benchmarking Ontology Tools. A Case Study for the WebODE Platform.
Oscar Corcho | Raúl García-Castro | Asunción Gómez-Pérez

pdf bib
A Chatbot as a Novel Corpus Visualization Tool
Bayan Abu Shawar | Eric Atwell

pdf bib
Evaluating Variants of the Lesk Approach for Disambiguating Words
Florentina Vasilescu | Philippe Langlais | Guy Lapalme

pdf bib
The Rationale for Building an Ontology Expressly for NLP
Sergei Nirenburg | Marjorie McShane | Stephen Beale

pdf bib
Some Meaning Procedures of Ontological Semantics
Marjorie McShane | Stephen Beale | Sergei Nirenburg

pdf bib
Using the Penn Treebank to Evaluate Non-Treebank Parsers
Eric K. Ringger | Robert C. Moore | Eugene Charniak | Lucy Vanderwende | Hisami Suzuki

pdf bib
Comparison of Some Automatic and Manual Methods for Summary Evaluation Based on the Text Summarization Challenge 2
Hidetsugu Nanba | Manabu Okumura

pdf bib
The Lancaster Corpus of Mandarin Chinese: A Corpus for Monolingual and Contrastive Language Study
Anthony McEnery | Zhonghua Xiao

pdf bib
Highlighting Latent Structure in Documents
H. Folch | B. Habert | M. Jardino | N. Pernelle | M.C. Rousset | A. Termier

pdf bib
Word Sense Disambiguation as a Wordnets’ Validation Method in Balkanet
Dan Tufis | Radu Ion | Nancy Ide

pdf bib
Term Translations in Parallel Corpora: Discovery and Consistency Check
Dan Tufis

pdf bib
The Corpógrafo – a Web-based Environment for Corpora Research
Luís Sarmento | Belinda Maia | Diana Santos

pdf bib
Automatic Classification of Geographic Named Entities
Daniel Ferrés | Marc Massot | Muntsa Padró | Horacio Rodríguez | Jordi Turmo

pdf bib
Acquiring Bayesian Networks from Text
Olivia Sanchez-Graillet | Massimo Poesio

pdf bib
Developping Tools and Building Linguistic Resources for Vietnamese Morpho-syntactic Processing
Thanh Bon Nguyen | Thi Minh Huyen Nguyen | Laurent Romary | Xuan Luong Vu

pdf bib
SpeechRecorder - a Universal Platform Independent Multi-Channel Audio Recording Software
Christoph Draxler | Klaus Jänsch

pdf bib
An Evaluation Protocol for Text Mining Tools : ALCESTE, SAS Text Miner, SPAD-CRM and Temis Text Mining Solutions Testing
Yasmina Quatrain | Sylvaine Nugier | Anne Peradotto

pdf bib
Using PiTagger for Lemmatization and PoS Tagging of a Spontaneous Speech Corpus: C-Oral-Rom Italian
Alessandro Panunzi | Eugenio Picchi | Massimo Moneglia

pdf bib
Introducing the La Repubblica Corpus: A Large, Annotated, TEI(XML)-compliant Corpus of Newspaper Italian
Marco Baroni | Silvia Bernardini | Federica Comastri | Lorenzo Piccioni | Alessandra Volpi | Guy Aston | Marco Mazzoleni

pdf bib
Exploiting Semantic Web Technologies for Intelligent Access to Historical Documents
Nancy Ide | David Woolner

pdf bib
Using Cooccurrence Statistics and the Web to Discover Synonyms in a Technical Language
Marco Baroni | Sabrina Bisi

pdf bib
Semi-supervised Learning by Fuzzy Clustering and Ensemble Learning
Hiroyuki Shinnou | Minoru Sasaki

pdf bib
Speech & Expression; the Value of a Longitudinal Corpus
Nick Campbell

pdf bib
A Complete Understanding Speech System Based on Semantic Concepts
Salma Jamoussi | Kamel Smaïli | Dominique Fohr | Jean-Paul Haton

pdf bib
The CLaRK System: XML-based Corpora Development System for Rapid Prototyping
Kiril Simov | Alexander Simov | Hristo Ganev | Krasimira Ivanova | Ilko Grigorov

pdf bib
NLP-enhanced Error Checking for Catalan Unrestricted Text
Toni Badia | Àngel Gil | Martí Quixal | Oriol Valentín

pdf bib
Open-source Tools for Creation, Maintenance, and Storage of Lexical Resources for Language Generation from Ontologies
Kalina Bontcheva

pdf bib
User Query Analysis for the Specification and Evaluation of a Dialogue Processing and Retrieval System
Agnes Lisowska | Andrei Popescu-Belis | Susan Armstrong

pdf bib
Creation of Reusable Components and Language Resources for Named Entity Recognition in Russian
Borislav Popov | Angel Kirilov | Diana Maynard | Dimitar Manov

pdf bib
Abstracting a Dialog Act Tagset for Meeting Processing
Andrei Popescu-Belis

pdf bib
Online Evaluation of Coreference Resolution
Andrei Popescu-Belis | Loïs Rigouste | Susanne Salmon-Alt | Laurent Romary

pdf bib
FreeLing: An Open-Source Suite of Language Analyzers
Xavier Carreras | Isaac Chao | Lluís Padró | Muntsa Padró

pdf bib
Phrase-Based Dependency Evaluation of a Japanese Parser
Hisami Suzuki

pdf bib
Functional Requirements for an Interlinear Text Editor
Baden Hughes | Catherine Bow | Steven Bird

pdf bib
Management of Metadata in Linguistic Fieldwork: Experience from the ACLA Project
Baden Hughes | David Penton | Steven Bird | Catherine Bow | Gillian Wigglesworth | Patrick McConvell | Jane Simpson

pdf bib
A Search Tool for Corpora with Positional Tagsets and Ambiguities
Adam Przepiórkowski | Zygmunt Krynicki | Łukasz Dębowski | Marcin Woliński | Daniel Janus | Piotr Bański

pdf bib
The American English SALA-II Data Collection
Peter A. Heeman

pdf bib
How Does Automatic Machine Translation Evaluation Correlate with Human Scoring as the Number of Reference Translations Increases?
Andrew Finch | Yasuhiro Akiba | Eiichiro Sumita

pdf bib
Evaluating the FOKS Error Model
Slaven Bilac | Timothy Baldwin | Hozumi Tanaka

pdf bib
Evaluation of a Speech Cuer: From Motion Capture to a Concatenative Text-to-cued Speech System
Guillaume Gibert | Gérard Bailly | Frédéric Eliséi | Denis Beautemps | Rémi Brun

pdf bib
Beyond TREC’s Filtering Track
Nikolaos Nanas | Victoria Uren | Anne de Roeck | John Domingue

pdf bib
A Corpus-based Syntactic Lexicon for Adverbs
Sanni Nimb

A word class often neglected in the field of NLP resources, namely adverbs, has lately been described in a computational lexicon produced at CST as one of the results of a Ph.D.-project. The adverb lexicon, which is integrated in the Danish STO lexicon, gives detailed syntactic information on the type of modification and position, as well as on other syntactic properties of approx 800 Danish adverbs. One of the aims of the lexicon has been to establish a clear distinction between syntactic and semantic information - where other lexicons often generalize over the syntactic behavior of semantic classes of adverbs, every adverb is described with respect to its proper syntactic behavior in a text corpus, revealing very individual syntactic properties. Syntactic information on adverbs is needed in NLP systems generating text to ensure correct placing in the phrase they modify. Also in systems analyzing text, this information is needed in order to attach the adverbs to the right node in the syntactic parse trees. Within the field of linguistic research, several results can be deduced from the lexicon, e.g. knowledge of syntactic classes of Danish adverbs.

pdf bib
The Future of Evaluation for Cross-Language Information Retrieval Systems
Carol Peters | Martin Braschler | Khalid Choukri | Julio Gonzalo | Michael Kluck

The objective of the Cross-Language Evaluation Forum (CLEF) is to promote research in the multilingual information access domain. In this short paper, we list the achievements of CLEF during its first four years of activity and describe how the range of tasks has been considerably expanded during this period. The aim of the paper is to demonstrate the importance of evaluation initiatives with respect to system research and development and to show how essential it is for such initiatives to keep abreast of and even anticipate the emerging needs of both system developers and application communities if they are to have a future.

pdf bib
SALA II Across the Finish Line: A Large Collection of Mobile Telephone Speech Databases from North and Latin America completed
Henk van den Heuvel | Phil Hall | Harald Höge | Asunción Moreno | Antonio Rincon | Francesco Senia

The SALA II project comprises mobile telephone recordings according to the SpeechDat (II) paradigm for several languages in North and Latin America. Each database contains the recordings of 1000 speakers, with the exception of US Spanish (2000 speakers) and US English (4000 speakers). A quarter of the recordings of each database are made respectively in a quiet environment (home/office), in the street, in a public place, and in a moving vehicle. This paper presents an evaluation of the project. The paper details on experiences with respect to the implementation of design specifications, speaker recruitment, data recordings (on site), data processing, orthographic transcription and lexicon generation. Furthermore, the validation procedure and its results are documented. Finally, the availability and distribution of the databases are addressed.

pdf bib
Parallel Corpora for the Galician Language: Building and Processing of the CLUVI (Linguistic Corpus of the University of Vigo)
Xavier Gómez-Guinovart | Elena Sacau Fontenla

In this paper, we present the methodology developed by the SLI (Computational Linguistics Group of the University of Vigo) for the building and processing of the CLUVI Corpus, showing the TMX-based XML specification designed to encode both morphosyntactic features and translation alignments in parallel corpora, and the solutions adopted for making the CLUVI parallel corpora freely available over the WWW (http://sli.uvigo.es/CLUVI/).

pdf bib
PBIE: A Data Preparation Toolkit Toward Developing a Parsing-Based Information Extraction System
Junko Hosaka | Igor V. Kurochkin | Akihiko Konagaya

We have developed a toolkit in which an annotation tool, a syntactic tree editor, and an extraction rule editor interact dynamically. Its output can be stored in a database for further use. In the field of biomedicine, there is a critical need for automatic text processing. However, current language processing approaches suffer from insufficient basic data incorporating both human domain expertise and domain-specific language processing capabilities. With the annotation tool presented here, a set of ggold standardsh can be collected, representing what should be extracted. At the same time, any change in annotation can be viewed on an associated syntactic tree. These facilities provide a clear picture of the relationship between the extraction target and the syntactic tree. Underlying sentences can be analyzed with a parser which can be plugged in, or a set of parsed sentences can be used to generate the tree. Extraction rules written with the integrated editor can be applied at once, and their validity can immediately be verified both on the syntactic tree and on the sentence string by coloring the corresponding segments. Thus our toolkit enables the user to efficiently construct parse-based extraction rules. PBIE2 works under Windows 2000/XP and requires Microsoft Internet Explorer 6.0 or higher. The data can be stored in Microsoft Access.

pdf bib
A Syntactically Annotated Corpus of Tibetan
Andreas Wagner | Bettina Zeisler

This paper describes the creation of a syntactically annotated Tibetan corpus. This corpus forms a part of the TUSNELDA collection of corpora and databases for linguistic research. It will ultimately comprise spoken and written Tibetan texts originating from different regions and historical epochs. These texts are annotated with several kinds of linguistic information, in particular POS tags, phrases, argument structures of verbs, clauses and sentences, as well as several kinds of discourse units and textual segments. The annotation is done in XML. The primary research interest which guides the development of the corpus is the investigation of cross-clausal references, especially the relation between empty arguments (i.e. arguments not overtly realised in a clause) and their antecedents in previous clauses. For this purpose, such references are explicitly encoded so that they can be qualitatively and quantitatively evaluated with the help of standard XML techniques such as XPath search and XSLT transformations. Apart from this primary research interest, we expect that our corpus will be useful for other projects concerning Tibetan and related languages. Like other data in TUSNELDA, it will be made accessible via a WWW query interface.

pdf bib
Lexical Entry Templates for Robust Deep Parsing
Montserrat Marimon | Núria Bel

We report on the development and employment of lexical entry templates in a large--coverage unification--based grammar of Spanish. The aim of the work reported in this paper is to provide robust deep linguistic processing in order to make the grammar more adequate for industrial NLP applications.

pdf bib
Tiered Tagging Revisited
Dan Tufis | Liviu Dragomirescu

In this paper we describe a new baseline tagset induction algorithm, which unlike the one described in previous work is fully automatic and produces tagsets with better performance than before. The algorithm is an information lossless transformation of the MULTEXT-EAST compliant lexical tags (MSD) into a reduced tagset that can be mapped back on the lexicon tagset fully deterministic. From the baseline tagsets, a corpus linguist, expert in the language in case, may further reduce the tagsets taking into account language distributional properties. As any further reduction of the baseline tagsets assumes losing information, adequate recovering rules should be designed for ensuring the final tagging in terms of lexicon encoding. The algorithm is described in details and the generated baseline tagsets for Czech, English, Estonian, Hungarian, Romanian and Slovenean are evaluated. They are much smaller and systematically ensures better tagging accuracy than the corresponding MSDs.

pdf bib
A Methodology and Associated Tools for Building Interlingual Wordnets
Dan Tufis | Eduard Barbu

The paper describes the methodology and the tools we developed for the purpose of building a Romanian wordnet. The work is carried out within the BalkaNet European project and is concerned with wordnets for Bulgarian, Czech, Greek, Romanian, Serbian and Turkish all of them aligned via an interlingual index (ILI) to Princeton Wordnet. The wordnets structuring follows the principles adopted in EuroWordNet. In order to ensure maximal cross-lingual lexical coverage, the consortium decided to implement the same concepts, represented by a common set of ILI concepts. We describe the selection of concepts to be implemented in all the monolingual wordnets The methodologies adopted by each partner were different and they depended on the language resources and personnel available. For the Romanian wordnet,we decided that it should be based on the reference lexicographic descriptions of Romanian which we had in electronic forms: EXPD, a heavily XML annotated explanatory dictionary (developed in the previous CONCEDE project and based on the standard Explanatory Dictionary of Romanian), SYND, a published dictionary of synonyms which we keyboarded, encoded and completed with more than 4000 new synonymy sets extracted from EXPD, EnRoD, a Romanian-English dictionary, most part of it being extracted automatically from parallel corpora and further hand validated and extended. Besides these monolingual resources, as all the other members of the consortium, we had at our disposal the interlingual mapping of the Princeton Wordnet. All the above mentioned resources have been incorporated into a user-friendly system, WnBuilder, which allows for cooperative work of a large number of lexicographers. When the distributed work is put together, the synsets are validated. Several errors show up, the most frequent and difficult to solve being the case of a literal with the same sense number appearing in different synsets. We discuss reasons for such conflicts as well as their correction, supported by another utility program called WnCorrector. The full paper presents WnBuilder and WnCorrector, as well as the status of the Romanian wordnet development.

pdf bib
Construction of a Bilingual Arabic-Spanish Lexicon of Verbs Based on a Parallel Corpus
Doaa Samy | Antonio Moreno-Sandoval | José M. Guirao

Parallel corpora are considered an important resource for the development of linguistic tools. In this paper our main goal is the development of a bilingual lexicon of verbs. The construction of this lexicon is possible using two main resources: I) a parallel corpus (through the alignment); II) the linguistic tools developed for Spanish (which serve as a starting point for developing tools for Arabic language). At the end, aligned equivalent verbs are detected automatically from a parallel corpus Spanish-Arabic. To achieve this goal, we had to pass through different preparatory stages concerning the assesment of the parallel corpus, the monolingual tokenization of each corpus, a preliminary sentence alignment and finally applying the model of automatic extraction of equivalent verbs. Our method is hybrid, since it combines both statistical and linguistic approaches.

pdf bib
A XML-Based Term Extraction Tool for Basque
I. Alegria | A. Gurrutxaga | P. Lizaso | X. Saralegi | S. Ugartetxea | R. Urizar

This project combines linguistic and statistical information to develop a term extraction tool for Basque. Being Basque an agglutinative and highly inflected language, the treatment of morphosyntactic information is vital. In addition, due to late unification process of the language, texts present more elevated term dispersion than in a highly normalized language. The result is a semi-automatic terminology extraction tool based on XML, for its use in technical and scientific information managing.

pdf bib
A Bayesian Model for Shallow Syntactic Parsing of Natural Language Texts
Manolis Maragoudakis | Nikos Fakotakis | George Kokkinakis

For the present work, we introduce and evaluate a novel Bayesian syntactic shallow parser that is able to perform robust detection of pairs of subject-object and subject-direct object-indirect object for a given verb, in a natural language sentence. The shallow parser infers on the correct subject-object pairs based on knowledge provided by Bayesian network learning from annotated text corpora. The DELOS corpus, a collection of economic domain texts that has been automatically annotated using various morphological and syntactic tools was used as training material. Our shallow parser makes use of limited linguistic input. More specifically, we consider only part of speech tagging, the voice and the mood of the verb as well as the head word of a noun phrase. For the task of detecting the head word of a phrase we used a sentence boundary detector. Identifying the head word of a noun phrase, i.e. the word that holds the morphological information (case, number) of the whole phrase, also proves to be very helpful for our task as its morphological tag is all the information that is needed regarding the phrase. The evaluation of the proposed method was performed against three other machine learning techniques, namely naive Bayes, k-Nearest Neighbor and Support Vector Machines, methods that have been previously applied to natural language processing tasks with satisfactory results. The experimental outcomes portray a satisfactory performance of our proposed shallow parser, which reaches almost 92 per cent in terms of precision.

pdf bib
Multifunctional Computational Lexicon of Contemporary Portuguese: An Available Resource for Multitype Applications
Florbela Barreto | Raquel Amaro

This paper presents some aspects of the first Portuguese frequency lexicon extracted from a corpus of large dimensions. The Multifunctional Computational Lexicon of Contemporary Portuguese (henceforth MCL) rised from the necessity of filling a gap existent in the studies of the contemporary Portuguese. Until recently, the frequency lexicons of Portuguese were of very small dimensions, such as Português Fundamental, which is constituted by 2.217 words extracted from a 700.000 word corpus and the Frequency Dictionary of Portuguese Words based on a literary corpus of 500.000 words. We describe here the main steps taken for collecting the lexical and frequency data and some of the major problems that arouse in the process. The resulting lexicon is a freely available reliable resource for several types of applications.

pdf bib
Use and Evaluation of Prosodic Annotations in Dutch
Jacques Duchateau | Tim Ceyssens | Hugo Van hamme

In the development of annotations for a spoken database, an important issue is whether the annotations can be generated automatically with sufficient precision, or whether expensive manual annotations are needed. In this paper, the case of prosodic annotations is discussed, which was investigated on the CGN database (Spoken Dutch Corpus). The main conclusions of this work are as follows. First, it was found that the available amount of manual prosodic annotations is sufficient for the development of our (baseline, decision tree based) prosodic models. In other words, more manual annotations do not improve the models. Second, the developed prosodic models for prominence are insufficiently accurate to produce automatic prominence annotations that are as good as the manual ones. But on the other hand the consistency between manual and automatic break annotations is as high as the inter-transcriber consistency for breaks. So given the current amount of manual break annotations, annotations for the remainder of the CGN database can be generated automatically with the same quality as the manual annotations.

pdf bib
Resources and Techniques for Multilingual Information Extraction
Stephan Busemann | Hans-Ulrich Krieger

Official travel warnings published regularly in the internet by the ministries for foreign affairs of France, Germany, and the UK provide a useful resource for assessing the risks associated with travelling to some countries. The shallow IE system SProUT has been extended to meet the specific needs of delivering a language-neutral output for English, French, or German input texts. A shared type hierarchy, a feature-enhanced gazetteer resource, and generic techniques of merging chunk analyses into larger results are major reusable results of this work.

pdf bib
Evaluating Factors Impacting the Accuracy of Forced Alignments in a Multimodal Corpus
Lei Chen | Yang Liu | Mary Harper | Eduardo Maia | Susan McRoy

People, when processing human-to-human communication, utilize everything they can in order to understand that communication, including speech and information such as the time and location of an interlocutor's gesture and gaze. Speech and gesture are known to exhibit a synchronous relationship in human communication; however, the precise nature of that relationship requires further investigation. The construction of computer models of multimodal human communication would be enabled by the availability of multimodal communication corpora annotated with synchronized gesture and speech features. To investigate the temporal relationships of these knowledge sources, we have collected and are annotating several multimodal corpora with time-aligned features. Forced alignment between a speech file and its transcription is a crucial part of multimodal corpus production. This paper investigates a number of factors that may contribute to highly accurate forced alignments to support the rapid production of these multimodal corpora including the acoustic model, the match between the speech used for training the system and that to be force aligned, the amount of data used to train the ASR system, the availability of speaker adaptation, and the duration of alignment segments.

pdf bib
Automatic Audio and Manual Transcripts Alignment, Time-code Transfer and Selection of Exact Transcripts
C. Barras | G. Adda | M. Adda-Decker | B. Habert | P. Boula de Mareüil | P. Paroubek

The present study focuses on automatic processing of sibling resources of audio and written documents, such as available in audio archives or for parliament debates: written texts are close but not exact audio transcripts. Such resources deserve attention for several reasons: they represent an interesting testbed for studying differences between written and spoken material and they yield low cost resources for acoustic model training. When automatically transcribing the audio data, regions of agreement between automatic transcripts and written sources allow to transfer time-codes to the written documents: this may be helpful in an audio archive or audio information retrieval environment. Regions of disagreement can be automatically selected for further correction by human transcribers. This study makes use of 10 hours of French radio interview archives with corresponding press-oriented transcripts. The audio corpus has then been transcribed using the LIMSI speech recognizer resulting in automatic transcripts, exhibiting an average word error rate of 12%. 80% of the text corpus (with word chunks of at least five words) can be exactly aligned with the automatic transcripts of the audio data. The residual word error rate on these 80% is less than 1%.

pdf bib
Evaluation of a Spoken Phonetic Database in Basque Language
V. Guijarrubia | I. Torres | L.J. Rodríguez

In this paper we present the evaluation of a spoken phonetic corpus designed to train acoustic models for Speech Recognition applications in Basque Language. A complete set of acoustic-phonetic decoding experiments was carried out over the proposed database. Context dependent and independent phoneme units were used in these experiments with two different approaches to acoustic modeling, namely discrete and continuous Hidden Markov Models (HMMs). A complete set of HMMs were trained and tested with the database. Experimental results reveal that the database is large and phonetically rich enough to get great acoustic models to be integrated in Continuous Speech Recognition Systems.

pdf bib
Using Paradigm Tables to Generate New Utterances Similar to those Existing in Linguistic Resources
Yves Lepage | Guilhem Peralta

We inspect the possibility of creating new linguistic utterances (small sentences) similar to those already present in an existing linguistic resource. Using paradigm tables ensures that the new generated sentences resemble previous data, while being of course different. We report an experiment in which 1,201 new correct sentences were generated starting from only 22 seed sentences.

pdf bib
Collection and Evaluation of Broadcast News Data for Arabic
Mohamed Afify | Ossama Emam

pdf bib
A Language Resources Infrastructure for Bulgarian
Kiril Simov | Petya Osenova | Sia Kolkovska | Elisaveta Balabanova | Dimitar Doikoff

This paper describes the infrastructure of a basic language resources set for Bulgarian in the context of BLARK initiative requirements. We focus on the treebanking task as a trigger for basic language resources compilation. Two strategies have been applied in this respect: (1) implementing the main pre-processing modules before the treebank compilation and (2) creating more elaborate types of resources in parallel to the treebank compilation. The description of language resources within BulTreeBank project is divided into two parts: language technology, which includes tokenization, morphosyntactic analyzer, morphosyntactic disambiguation, partial grammars, and language data, which includes the layers of the BulTreeBank corpus and the variety of lexicons. The advantages of our approach to a less-spoken language (like Bulgarian) are as follows: it triggers the creation of the basic set of language resources which lack for certain languages and it rises the question about the ways of language resources creation.

pdf bib
“You Stupid Tin Box” - Children Interacting with the AIBO Robot: A Cross-linguistic Emotional Speech Corpus
A. Batliner | C. Hacker | S. Steidl | E. Nöth | S. D’Arcy | M. Russell | M. Wong

This paper deals with databases that combine different aspects: children's speech, emotional speech, human-robot communication, cross-linguistics, and read vs. spontaneous speech: in a Wizard-of-Oz scenario, German and English children had to instruct Sony's AIBO robot to fulfil specific tasks. In one experimental condition, strictly parallel for German and English, the AIBO behaved `disobedient' by following it's own script irrespective of the child's commands. By that, reactions of different children to the same sequence of AIBO's actions could be obtained. In addition, both the German and the English children were recorded reading texts. The data are transliterated orthographically; emotional user states and some other phenomena will be annotated. We report preliminary word recognition rates and classification results.

pdf bib
The Role of MultiWord Terminology in Knowledge Management
James Dowdall | Will Lowe | Jeremy Ellman | Fabio Rinaldi | Michael Hess

One of the major obstacles for knowledge management remains MultiWord Terminology (MWT). This paper explores the difficulties that arise and describes real world solutions implemented as part of the Parmenides project. Parmenides is being built as an integrated knowledge management package that combines information, MWT and ontology extraction methods in a semi-automated framework. The focus of this paper is on eliciting ontological fragments based on dedicated MWT processing.

pdf bib
The OPUS Corpus - Parallel and Free: http://logos.uio.no/opus
Jörg Tiedemann | Lars Nygaard

The OPUS corpus is a growing collection of translated documents collected from the internet. The current version contains about 30 million words in 60 languages. The entire corpus is sentence aligned and it also contains linguistic markup for certain languages.

pdf bib
Selecting the Correct English Synset for a Spanish Sense
Javier Farreres | Horacio Rodríguez

This work tries to enrich the Spanish Wordnet using a Spanish taxonomy as a knowledge source. The Spanish taxonomy is composed by Spanish senses, while Spanish Wordnet is composed by synsets, mostly linked to English WordNet. A set of weighted associations between Spanish words and Wordnet synsets is used for inferring associations between both taxonomies.

pdf bib
Collection of SLR in the Asian-Pacific Area
Asunción Moreno | Khalid Choukri | Phil Hall | Henk van den Heuvel | Eric Sanders | Francesco Senia | Herbert Tropf

The goal of this project (LILA) is the collection of a large number of spoken databases for training Automatic Speech Recognition Systems for telephone applications in the Asian Pacific area. Specifications follow those of SpeechDat-like databases. Utterances will be recorded directly from calls made either from fixed or cellular telephones and are composed by read text and answers to specific questions. The project is driven by a consortium composed by a large number of industrial companies. Each company is in charge of the production of two databases. The consortium shares the databases produced in the project. The goal of the project should be reached within the year 2005.

pdf bib
Derivational Relations in Flectional Languages - Czech Case
Jaroslava Hlaváčová | Jana Klímová

When a text in any language is submitted to a morphological analysis, there always rest some unrecognized words. We can lower their number by adding new words into the dictionary used by the morphological analyzer but we can never gather the whole of the language. The system described in this paper (we call it "derivation module") deals with the unknown derived words. It aims not only at analyzing but also at synthesizing Czech derived words. Such a system is of particular value for automatic processing of languages where derivational morphology plays an important role in regular word formation.

pdf bib
Standards for Language Codes: developing ISO 639
David Dalby | Lee Gillam | Christopher Cox | Debbie Garside

pdf bib
SLR Validation: Current Trends and Developments
Henk van den Heuvel | Dorota Iskra | Eric Sanders | Folkert de Vriend

This paper deals with the quality evaluation (validation) of Spoken Language Resources (SLR). The current situation in terms of relevant validation criteria and procedures is briefly presented. Next, a number of validation issues related to new data formats (XML-based annotations, UTF-16 encoding) are discussed. Further, new validation cycles that were introduced in a series of new projects like SpeeCon and OrienTel are addressed: prompt sheet validation, lexicon validation and pre-release validation. Finally, SPEX's current and future

pdf bib
Identifying Definitions in Text Collections for Question Answering
Horacio Saggion

pdf bib
Multiple Sequence Alignment for Characterizing the Lineal Structure of Revision
Laura Alonso | Irene Castellón | Jordi Escribano | Xavier Messeguer | Lluís Padró

We present a first approach to the application of a data mining technique, Multiple Sequence Alignment, to the systematization of a polemic aspect of discourse, namely, the expression of contrast, concession, counterargument and semantically similar discursive relations. The representation of the phenomena under study is carried out by very simple techniques, mostly pattern-matching, but the results allow to drive insightful conclusions on the organization of this aspect of discourse: equivalence classes of discourse markers are established, and systematic patterns are discovered, which will be applied in enhancing a discursive parser.

pdf bib
Mining the Web for Discourse Markers
Ben Hutchinson

This paper proposes a methodology for obtaining sentences containing discourse markers from the World Wide Web. The proposed methodology is particularly suitable for collecting large numbers of discourse marker tokens. It relies on the automatic identification of discourse markers, and we show that this can be done with an accuracy within 9% of that of human performance. We also show that the distribution of discourse markers on the web correlates highly with those in a conventional balanced corpus.

pdf bib
A Pattern Extraction Workbench Combining Multiple Linguistic Levels
Magnus Merkel | Andreas Lange

In this paper an interactive pattern extraction workbench, I*Pex, is presented. The workbench comes in a graphical environment and is designed to be used in an incremental and interactive fashion with the user. Patterns can be constructed to work in combination involving specifications on several linguistic levels simultaneously, from the character level using regular expressions, parts of speech and dependency relations to semantic roles. The input text format is based on XCES XML format.

pdf bib
Exploiting Coreference Annotations for Text-to-Hypertext Conversion
Anke Holler | Jan Frederik Maas | Angelika Storrer

The paper describes an annotation scheme for coreference developed within the application context of text-to-hypertext conversion. In this context coference is used (1) for generating document-internal and cross-document hyperlinks, and (2) for resolving anaphoric expressions in order to achieve cohesive closedness in hypertext nodes. We will argue that for the purpose of cross-document linking it is necessary to separate the annotation of coreference relations from the annotation of anaphoric relations. To account for this requirement, we developed a knowledge-based annotation scheme that relates referential expressions in the text to entities in a knowledge representation, which is modeled using XML Topic Maps.

pdf bib
“Why do you Ignore me?” - Proof that not all Direct Speech is Bad
Laura Hasler

In the automatic summarisation of written texts, direct speech is usually deemed unsuitable for inclusion in important sentences. This is due to the fact that humans do not usually include such quotations when they create summaries. In this paper, we argue that despite generally negative attitudes, direct speech can be useful for summarisation and ignoring it can result in the omission of important and relevant information. We present an analysis of a corpus of annotated newswire texts in which a substantial amount of speech is marked by different annotators, and describe when and why direct speech can be included in summaries. In an attempt to make direct speech more appropriate for summaries, we also describe rules currently being developed to transform it into a more summary-acceptable format.

pdf bib
“Human Language Technology Elements in a Knowledge Organisation System - The VID Project”
Costanza Navarretta | Bolette Sandford Pedersen | Dorte Haltrup Hansen

This paper describes how Human Language Technologies and linguistic resources are used to support the construction of components of a knowledge organisation system. In particular we focus on methodologies and resources for building a corpus-based domain ontology and extracting relevant metadata information for text chunks from domain-specific corpora.

pdf bib
Generic Text Summarization Using WordNet
Kedar Bellare | Anish Das Sarma | Atish Das Sarma | Navneet Loiwal | Vaibhav Mehta | Ganesh Ramakrishnan | Pushpak Bhattacharyya

pdf bib
Development of Bilingual Domain-Specific Ontology for Automatic Conceptual Indexing
Natalia V. Loukachevitch | Boris V. Dobrov

In the paper we describe development, means of evaluation and applications of Russian-English Sociopolitical Thesaurus specially developed as a linguistic resource for automatic text processing applications. The Sociopolitical domain is not a domain of social research but a broad domain of social relations including economic, political, military, cultural, sports and other subdomains. The knowledge of this domain is necessary for automatic text processing of such important documents as official documents, legislative acts, newspaper articles.

pdf bib
Development of Ontologies with Minimal Set of Conceptual Relations
Natalia V. Loukachevitch | Boris V. Dobrov

In the paper we describe our approach to development of ontologies with small number of relation types. Non-taxonomic relations in our ontologies are based on ontological dependence conception described in the formal ontology. This minimal relations set does not depend on a domain or a task and makes possible to begin the ontology construction at once, as soon as a task is set and a domain is determined, to receive the first version of an ontology in short time. Such an initial ontology can be used for information-retrieval applications and can serve as a structural basis for further development of the ontology

pdf bib
Providing On-line Access to Portuguese Language Resources: Corpora and Lexicons
Maria Fernanda Bacelar do Nascimento | Amália Mendes | Luísa Pereira

Several Language Resources (LRs) for Portuguese, developed at the Center of Linguistics of the Lisbon University (CLUL), are available on-line at CLUL's webpage: www.clul.ul.pt/english/sectores/projecto_rld.html. These LRs have been extracted from or developed based on the Reference Corpus of Contemporary Portuguese (CRPC), a monitor corpus containing, at the present, more than 300 million words, taken by sampling from several types of written text (literary, newspaper, technical, didactic, juridical, parlamentary, etc.) and spoken text (informal and formal), pertaining to national and regional varieties of Portuguese (including European, Brazilian, African and Asian Portuguese). The LRs available for on-line queries include: a) several subcorpora (written and spoken, tagged and untagged) compiled and extracted from CRPC for specific CLUL's projects and now available for on-line queries; b) a published sample of "Português Fundamental", a spoken CRPC subcorpus, available for texts download; c) a frequency lexicon extracted from a CRPC subcorpus available for both on-line queries and download. Other RLs available for Portuguese are also referred: C-ORAL-ROM - Integrated Reference Corpora for Spoken Romance Languages, a CD-ROM edition of a spoken corpus with text-to-sound alignment; the LE-PAROLE corpus; the LE-PAROLE Lexicon and the SIMPLE Lexicon.

pdf bib
Automatisation of the Activity of Term Collection in Different Languages
Bruno Cartoni | Pierrette Bouillon | Yalina Alphonse | Sabine Lehmann

This article describes the use and development of a tool for grammar and terminology control (FLAG), for the purposes of automating the verification of terminology for a large-scale user of multilingual terminology. It describes the various advantages of the tool and shows a process for transforming a traditional terminology list into a list of inflected forms as well as patterns which can be used to find possible morpho-syntactic derivations of terms.

pdf bib
Automatically Selecting Domain Markers for Terminology Extraction
Jorge Vivaldi | Horacio Rodríguez

Some approaches to automatic terminology extraction from corpora imply the use of existing semantic resources for guiding the detection of terms. Most of these systems exploit specialised resources, like UMLS in the medical domain, while a few try to take profit from general-purpose semantic resources, like EuroWordNet (EWN). As the term extraction task is clearly domain depending, in the case a general-purpose resource without specific domain information is used, we need a way of attaching domain information to the units of the resource. For big resources it is desirable that this semantic enrichment could be carried out automatically. Given a specific domain, our proposal aims to detect in EWN those units that can be considered as domain markers (DM). We can define a DM as an EWN entry whose attached strings belong to the domain, as well as the variants of all its descendents through the hyponymy relation. The procedure we propose in this paper is fully automatic and, a priori, domain-independent. The only external knowledge it uses is a set of terms, which is an external vocabulary, which is considered to have at least one sense belonging to the domain.

pdf bib
The Ongoing Evaluation Campaign of Syntactic Parsing of French: EASY
Anne Vilnat | Patrick Paroubek | Laura Monceaux | Isabelle Robba | Véronique Gendner | Gabriel Illouz | Michèle Jardino

This paper presents EASY (Evaluation of Analyzers of SYntax), an ongoing evaluation campaign of syntactic parsing of French, a subproject of EVALDA in the French TECHNOLANGUE program. After presenting the elaboration of the annotation formalism, we describe the corpus building steps, the annotation tools, the evaluation measures and finally, plans to produce a validated large linguistic resource, syntactically annotated

pdf bib
Annotators’ Agreement: The Case of Topic-Focus Articulation
Kateřina Veselá | Jiří Havelka | Eva Hajičová

The annotation of the Prague Dependency Treebank (PDT) is conceived of as a multilayered scenario that comprises also dependency representations (tectogrammatical tree structures, TGTS's) of the underlying structure of the sentences. TGTS's capture three basic aspects of the underlying structure of sentences: (a) the dependency tree structure, (b) the kinds of dependency syntactic relations, and (c) the basic characteristics of the topic-focus articulation (TFA). Since the PDT is a large collection and the annotations on the deepest layer are to a large extent performed by several human annotators (based on an automatic preprocessing module), it is more than necessary to observe the consistence of annotators and the agreement among them. In the present paper, we summarize the results of the evaluation of parallel annotations of several samples taken from PDT and the measures accepted to improve the consistency of annotations.

pdf bib
Evaluating Lexical Resources for a Semantic Tagger
Scott S. L. Piao | Paul Rayson | Dawn Archer | Tony McEnery

Semantic lexical resources play an important part in both linguistic study and natural language engineering. In Lancaster, a large semantic lexical resource has been built over the past 14 years, which provides a knowledge base for the USAS semantic tagger. Capturing semantic lexicological theory and empirical lexical usage information extracted from corpora, the Lancaster semantic lexicon provides a valuable resource for the corpus research and NLP community. In this paper, we evaluate the lexical coverage of the semantic lexicon both in terms of genres and time periods. We conducted the evaluation on test corpora including the BNC sampler, the METER Corpus of law/court journalism reports and some corpora of Newsbooks, prose and fictional works published between 17th and 19th centuries. In the evaluation, the semantic lexicon achieved a lexical coverage of 98.49% on the BNC sampler, 95.38% on the METER Corpus and 92.76% -- 97.29% on the historical data. Our evaluation reveals that the Lancaster semantic lexicon has a remarkably high lexical coverage on modern English lexicon, but needs expansion with domain-specific terms and historical words. Our evaluation also shows that, in order to make claims about the lexical coverage of annotation systems as well as to render them ‘future proof’, we need to evaluate their potential both synchronically and diachronically across genres.

pdf bib
Multimodal Meaning Representation for Generic Dialogue Systems Architectures
Frédéric Landragin | Alexandre Denis | Annalisa Ricci | Laurent Romary

An unified language for the communicative acts between agents is essential for the design of multi-agents architectures. Whatever the type of interaction (linguistic, multimodal, including particular aspects such as force feedback), whatever the type of application (command dialogue, request dialogue, database querying), the concepts are common and we need a generic meta-model. In order to tend towards task-independent systems, we need to clarify the modules parameterization procedures. In this paper, we focus on the characteristics of a meta-model designed to represent meaning in linguistic and multimodal applications. This meta-model is called MMIL for MultiModal Interface Language, and has first been specified in the framework of the IST MIAMM European project. What we want to test here is how relevant is MMIL for a completely different context (a different task, a different interaction type, a different linguistic domain). We detail the exploitation of MMIL in the framework of the IST OZONE European project, and we draw the conclusions on the role of MMIL in the parameterization of task-independent dialogue managers.

pdf bib
STO: A Danish Lexicon Resource - Ready for Applications
Anna Braasch | Sussi Olsen

This paper deals with the STO lexicon, the most comprehensive computational lexicon of Danish developed for NLP/HLT applications, which is now ready for use. Danish was one of the 12 EU-languages participating in the LE-PAROLE and SIMPLE projects; therefore it was obvious to continue this work building on our experience obtained from these projects. The material for Danish produced within these projects – further enriched with language-specific information - is incorporated into the STO lexicon. First, we describe the main characteristics of the lexical coverage and linguistic content of the STO lexicon; second, we present some recent uses and point to some prospective exploitations of the material. Finally, we outline an internet-based user interface, which allows for browsing through the complex information content of the STO lexical database and some other selected WRL’s for Danish.

pdf bib
A Domain-Independent Approach to IE Rule Development
Kalliopi Zervanou | John McNaught

A key element for the extraction of information in a natural language document is a set of shallow text analysis rules, which are typically based on pre-defined linguistic patterns. Current Information Extraction research aims at the automatic or semi-automatic acquisition of these rules. Within this research framework, we consider in this paper the potential for acquiring generic extraction patterns. Our research is based on the hypothesis that, terms (the linguistic representation of concepts in a specialised domain) and Named Entities (the names of persons, organisations and dates of importance in the text) can together be considered as the basic semantic entities of textual information and can therefore be used as a basis for the conceptual representation of domain specific texts and the definition of what constitutes an information extraction template in linguistic terms. The extraction patterns discovered by this approach involve significant associations of these semantic entities with verbs and they can subsequently be translated into the grammar formalism of choice.

pdf bib
The French MEDIA/EVALDA Project: the Evaluation of the Understanding Capability of Spoken Language Dialogue Systems
Laurence Devillers | Hélène Maynard | Sophie Rosset | Patrick Paroubek | Kevin McTait | D. Mostefa | Khalid Choukri | Laurent Charnay | Caroline Bousquet | Nadine Vigouroux | Frédéric Béchet | Laurent Romary | Jean-Yves Antoine | J. Villaneau | Myriam Vergnes | J. Goulian

The aim of the MEDIA project is to design and test a methodology for the evaluat ion of context-dependent and independent spoken dialogue systems. We propose an evaluation paradigm based on the use of test suites from real-world corpora and a common semantic representation and common metrics. This paradigm should allow us to diagnose the context-sensitive understanding capability of dialogue system s. This paradigm will be used within an evaluation campaign involving several si tes all of which will carry out the task of querying information from a database .

pdf bib
The C-ORAL-ROM CORPUS. A Multilingual Resource of Spontaneous Speech for Romance Languages
Emanuela Cresti | Fernanda Bacelar do Nascimento | Antonio Moreno Sandoval | Jean Veronis | Philippe Martin | Khalid Choukri

The C-ORAL-ROM project has delivered a multilingual corpus of spontaneous speech for the main romance languages (Italian, French, Portuguese and Spanish). The collection aims to represent the variety of speech acts performed in everyday language and to enable the description of prosodic and syntactic structures in the four romance languages. Sampling criteria are defined in a corpus design scheme. C-ORAL-ROM adopts two different sampling strategies, one for the formal and one for the informal part: While a set of typical domains of application is selected to document the formal use of language, the informal part documents speech variation using parameters referring to the event’s structure (dialogue vs. monologue) and the sociological domain of use (family-private vs public). The four romance corpora are tagged with respect to terminal and non terminal prosodic breaks. Terminal breaks are assumed to be the more relevant cues for the identification of relevant linguistic domains in spontaneous speech (utterances). Relations with other concurrent criteria are discussed. The multimedia storage of the C-ORAL-ROM corpus is based on this principle; each textual string ending with a terminal break is aligned, through the Win Pitch speech software, to its acoustic counterpart, generating the data base of all utterances.

pdf bib
Principles of a System for Terminological Concept Modelling
Bodil Nistrup Madsen | Hanne Erdman Thomsen | Carl Vikner

We are working on a project called CAOS - Computer-Aided Ontology Structuring - whose aim is to develop a computer system designed to enable semi-automatic construction of concept systems, or ontologies. The system is intended to be interactive and presupposes an end-user with a terminological background (terminologist or professional translator). CAOS supports terminological concept modelling. The backbone of this concept modelling is constituted by characteristics modelled by formal feature specifications, i.e. attribute-value pairs. Our use of feature specifications is subject to a number of principles and constraints. In this paper we want to demonstrate some of these principles and to show why they are necessary in order to permit the construction of an interactive tool for building terminological ontologies. We will also show how they contribute to determine the structuring of the ontologies in CAOS and to facilitate the work of the terminologist user.

pdf bib
On the Usefulness of Large Spoken Language Corpora for Linguistic Research
Christophe Van Bael | Helmer Strik | Henk van den Heuvel

In the past, fundamental linguistic research was typically conducted on small data sets that were handcrafted for the specific research at hand. However, from the eighties onwards, many large spoken language corpora have become available. This study investigates the usefulness of large multi-purpose spoken language corpora for fundamental linguistic research. A research task was designed in which we tried to capture the major pronunciation differences between three speech styles in context-sensitive re-write rules at the phone level. These re-write rules were extracted from the alignments of both a manual phonetic transcription and an automatic phonetic transcription with a canonical reference transcription of the same material.

pdf bib
WALA: A Multilingual Resource Repository for West African Languages
Dafydd Gibbon | Firmin Ahoua | Eddi Gbéry | Eno-Abasi Urua | Moses Ekpenyong

The West African Language Archive (WALA) initiative has emerged from a number of concurrent projects, and aims to encourage local scholars to create high quality decentralised repositories documenting West African languages, and to make these repositories available to language communities, language planners, educationalists and scientists via an internet metadata portal such as OLAC (Open Language Archive Community). A wide range of criteria has to be met in designing and implementing this kind of archive. We discuss these criteria with reference to experiences in documentation work in three very different ongoing language documentation projects, on designing an encyclopaedia, on documenting an endangered language, and on creating a speech synthesiser. We pay special attention to the provision of metadata, a formal variety of catalogue or housekeeping information, without which resources are doomed to remain inaccessible.

pdf bib
Annotating a Corpus for Building a Domain-specific Knowledge Base
Sabine Bartsch

pdf bib
A Comparison of Summarisation Methods Based on Term Specificity Estimation
Constantin Orăsan | Viktor Pekar | Laura Hasler

pdf bib
Measurements of Spoken Language Variability in a Multilingual Corpus. Predictable Aspects
Massimo Moneglia

pdf bib
Reliability of Lexical and Prosodic Cues in Two Real-life Spoken Dialog Corpora
L. Devillers | I. Vasilescu

pdf bib
WordNet Affect: an Affective Extension of WordNet
Carlo Strapparava | Alessandro Valitutti

pdf bib
The GENOMA-KB Platform: Queries over Integrated Linguistic Resources
Margarita Hospedales | Manel Rodríguez

pdf bib
Evaluation of Consensus on the Annotation of Prosodic Breaks in the Romance Corpus of Spontaneous Speech “C-ORAL-ROM
Morena Danieli | Juan María Garrido | Massimo Moneglia | Andrea Panizza | Silvia Quazza | Marc Swerts

pdf bib
Towards the Use of Word Stems and Suffixes for Statistical Machine Translation
Maja Popović | Hermann Ney

pdf bib
Language Model Adaptation for Statistical Machine Translation Based on Information Retrieval
Matthias Eck | Stephan Vogel | Alex Waibel

pdf bib
Evaluating Name-Matching for Coreference Resolution
Olga Uryupina

pdf bib
Design and Implementation of a Semantic Search Engine for Portuguese
Carlos Amaral | Dominique Laurent | André Martins | Afonso Mendes | Cláudia Pinto

pdf bib
Converting Treebank Annotations to Language Neutral Syntax
Richard Campbell | Eric Ringger

pdf bib
Methodology For Building Thematic Indexes In Medicine For French
Yalina Alphonse | Pierrette Bouillon

pdf bib
Transcrigal: A Bilingual System for Automatic Indexing of Broadcast News
Carmen Garcia-Mateo | Javier Dieguez-Tirado | Laura Docio-Fernandez | Antonio Cardenal-Lopez

pdf bib
Abar-Hitz: An Annotation Tool for the Basque Dependency Treebank
Arantza Díaz de Ilarraza | Aitzpea Garmendia | Maite Oronoz

pdf bib
Creating Multi-purpose Linguistic Resources for Modern Greek: a Deep Modern Greek Grammar
Valia Kordoni | Julia Neu

pdf bib
Enriching the Spanish EuroWordNet by Collocations
Leo Wanner | Margarita Alonso Ramos | Antonia Martí

pdf bib
FrameNet as a “Net”
Charles J. Fillmore | Collin F. Baker | Hiroaki Sato

pdf bib
AV@CAR: A Spanish Multichannel Multimodal Corpus for In-Vehicle Automatic Audio-Visual Speech Recognition
Alfonso Ortega | Federico Sukno | Eduardo LLeida | Alejandro Frangi | Antonio Miguel | Luis Buera | Ernesto Zacur

pdf bib
Creation of a Doctor-Patient Dialogue Corpus Using Standardized Patients
Robert S. Melvin | Win May | Shrikanth Narayanan | Panayiotis Georgiou | Shadi Ganjavi

pdf bib
Talkbank: Building an Open Unified Multimodal Database of Communicative Interaction
Brian MacWhinney | Steven Bird | Christopher Cieri | Craig Martell

pdf bib
A Fine-Grained Evaluation Method for Speech-to-Speech Machine Translation Using Concept Annotations
Robert S. Belvin | Susanne Riehemann | Kristin Precoda

pdf bib
Rethinking Reusable Resources
David M. de Matos | Ricardo Ribeiro | Nuno J. Mamede

pdf bib
The Cross-Breeding of Dictionaries
Adam Meyers | Ruth Reeves | Catherine Macleod | Rachel Szekely | Veronika Zielinska | Brian Young

pdf bib
Annotating Noun Argument Structure for NomBank
Adam Meyers | Ruth Reeves | Catherine Macleod | Rachel Szekely | Veronika Zielinska | Brian Young | Ralph Grishman

pdf bib
Concept-based Queries: Combining and Reusing Linguistic Corpus Formats and Query Languages
Felix Sasaki | Andreas Witt | Dafydd Gibbon | Thorsten Trippel

pdf bib
Co-reference in Japanese Task-oriented Dialogues: A Contribution to the Development of Language-specific and Language-general Annotation Schemes and Resources
Felix Sasaki | Andreas Witt

pdf bib
Constructing Word-Sense Association Networks from Bilingual Dictionary and Comparable Corpora
Hiroyuki Kaji | Osamu Imaichi

A novel thesaurus named a gword-sense association networkh is proposed for the first time. It consists of nodes representing word senses, each of which is defined as a set consisting of a word and its translation equivalents, and edges connecting topically associated word senses. This word-sense association network is produced from a bilingual dictionary and comparable corpora by means of a newly developed fully automatic method. The feasibility and effectiveness of the method were demonstrated experimentally by using the EDR English-Japanese dictionary together with Wall Street Journal and Nihon Keizai Shimbun corpora. The word-sense association networks were applied to word-sense disambiguation as well as to a query interface for information retrieval.

pdf bib
Utilization of Multiple Language Resources for Robust Grammar-Based Tense and Aspect Classification
Alexis Palmer | Jonas Kuhn | Carlota Smith

This paper reports on an ongoing project that uses varied language resources and advanced NLP tools for a linguistic classification task in discourse semantics. The system we present is designed to assign a "situation entity" class label to each predicator in English text. The project goal is to achieve the best-possible identification of situation entities in naturally-occurring written texts by implementing a robust system that will deal with real corpus material, rather than just with constructed textbook examples of discourse. In this paper we focus on the combination of multiple information sources, which we see as being vital for a robust classification system. We use a deep syntactic grammar of English to identify morphological, syntactic, and discourse clues, and we use various lexical databases for fine-grained semantic properties of the predicators. Experiments performed to date show that enhancing the output of the grammar with information from lexical resources improves recall but lowers precision in the situation entity classification task.

pdf bib
Retrieving Annotated Corpora for Corpus Annotation
Kyôsuke Yoshida | Taiichi Hashimoto | Takenobu Tokunaga | Hozumi Tanaka

This paper introduces a tool \Bonsai which supports human in annotating corpora with morphosyntactic information, and in retrieving syntactic structures stored in the database. Integrating annotation and retrieval enables users to annotate a new instance while looking back at the already annotated sentences which share the similar morphosyntactic structure. We focus on the retrieval part of the system, and describe a method to decompose a large input query into smaller ones in order to gain retrieval efficiency. The proposed method is evaluated with the Penn Treebank corpus, showing significant improvements.

pdf bib
Classification of Japanese Spatial Nouns
Takenobu Tokunaga | Tomofumi Koyama | Suguru Saito | Masayuki Nakajima

We have already proposed a framework to represent a location in terms of both symbolic and numeric aspects. In order to deal with vague linguistic expressions of a location, the representation adopts a potential function mapping a location to its plausibility. This paper proposes classification of Japanese spatial nouns and potential functions corresponding to each class. We focused on a common Japanese spatial expression ``X no Y (Y of X)'' where X is a reference object and Y is a spatial noun. For example, ``tukue no migi (the right of the desk)'' denotes a location with reference to the desk. This expression were collected from corpora, and spatial nouns appearing in the Y position were classified into two major classes; designating a part of the reference object and designating a location apart from the reference object . And the latter class were further classified into two subclasses; direction-oriented and distance-oriented. For each class, a potential function were designed for providing meaning of spatial nouns.

pdf bib
Meaningful Clusters
Antonio Sanfilippo | Gus Calapristi | Vernon Crow | Beth Hetzler | Alan Turner

We present an approach to the disambiguation of cluster labels that capitalizes on the notion of semantic similarity to assign WordNet senses to cluster labels. The approach provides interesting insights on how document clustering can provide the basis for developing a novel approach to word sense disambiguation.

pdf bib
Multi-Document Summarization Using Multiple-Sequence Alignment
V. Finley Lacatusu | Steven J. Maiorano | Sanda M. Harabagiu

This paper describes a novel clustering-based text summarization system that uses Multiple Sequence Alignment to improve the alignment of sentences within topic clusters. While most current clustering-based summarization systems base their summaries only on the common information contained in a collection of highly-related sentences, our system constructs more informative summaries that incorporate both the redundant and unique contributions of the sentences in the cluster. When evaluated using ROUGE, the summaries produced by our system represent a substantial improvement over the baseline, which is at 63% of the human performance.

pdf bib
RevisionBank: A Resource for Revision-based Multi-document Summarization and Evaluation
Jahna Otterbacher | Dragomir Radev

Multi-document summaries produced via sentence extraction often suffer from a number of cohesion problems, including dangling anaphora, sudden shifts in topic and incorrect or awkward chronological ordering. Therefore, the development of an automated revision process to correct such problems is a research area of current interest. We present the RevisionBank, a corpus of 240 extractive, multi-document summaries that have been manually revised to promote cohesion. The summaries were revised by six linguistic students using a constrained set of revision operations that we previously developed. In the current paper, we describe the process of developing a taxonomy of cohesion problems and corrective revision operators that address such problems, as well as an annotation schema for our corpus. Finally, we discuss how our taxonomy and corpus can be used for the study of revision-based multi-document summarization as well as for summary evaluation.

pdf bib
The Lácio-Web: Corpora and Tools to Advance Brazilian Portuguese Language Investigations and Computational Linguistic Tools
Sandra Aluisio | Gisele Montilha Pinheiro | Aline M. P. Manfrin | Leandro H. M. de Oliveira | Luiz C. Genoves, Jr. | Stella E. O. Tagnin

In this paper we discuss the five requirements for building large publicly available corpora which geared the construction of the Lácio-Web corpora and their environments: 1) a comprehensive text typology; 2) text copyright clearance, compilation and annotation scheme; 3) a friendly and didactic interface; 4) the need to serve as support for several types of research; 5) the need to offer an array of associated tools. Also, we present the features that make Lácio-Web corpora interesting and novel as well as the limitations of this project, such as corpora size and balance, and the non-inclusion of spoken texts in the project’s reference corpus.

pdf bib
CST Bank: A Corpus for the Study of Cross-document Structural Relationships
Dragomir Radev | Jahna Otterbacher | Zhu Zhang

Clusters of multiple news stories related to the same topic exhibit a number of interesting properties. For example, when documents have been published at various points in time or by different authors or news agencies, one finds many instances of paraphrasing, information overlap and even contradiction. The current paper presents the Cross-document Structure Theory (CST) Bank, a collection of multi-document clusters in which pairs of sentences from different documents have been annotated for cross-document structure theory relationships. We will describe how we built the corpus, including our method for reducing the number of sentence pairs to be annotated by our hired judges, using lexical similarity measures. Finally, we will describe how CST and the CST Bank can be applied to different research areas such as multi-document summarization.

pdf bib
Applying Computational Linguistic Techniques in a Documentary Project for Q’anjob’al (Mayan, Guatemala)
Jonas Kuhn | B’alam Mateo-Toledo

This paper reports on a number of experiments in which we applied standard techniques from NLP in the context of documentation of endangered languages. We concentrated on the use of existing, freely available toolkits. Specifically, we explore the use of Finite-State Morphological Analysis, Maximum Entropy Part-of-Speech Tagging, and N-Gram Language Modeling.

pdf bib
Information Retrieval System Using Latent Contextual Relevance
Minoru Sasaki | Hiroyuki Shinnou

When the relevance feedback, which is one of the most popular information retrieval model, is used in an information retrieval system, a related word is extracted based on the first retrival result. Then these words are added into the original query, and retrieval is performed again using updated query. Generally, Using such query expansion technique, retrieval performance using the query expansion falls in comparison with the performance using the original query. As the cause, there is a few synonyms in the thesaurus and although some synonyms are added to the query, the same documents are retireved as a result. In this paper, to solve the problem over such related words, we propose latent context relevance in consideration of the relevance between query and each index words in the document set.

pdf bib
Toward Text Understanding: Integrating Relevance-tagged Corpus and Automatically Constructed Case Frames
Daisuke Kawahara | Ryohei Sasano | Sadao Kurohashi

This paper proposes a wide-range anaphora resolution system toward text understanding. This system resolves zero, direct and indirect anaphors in Japanese texts by integrating two sorts of linguistic resources: a hand-annotated corpus with various relations and automatically constructed case frames. The corpus has relevance tags which consist of predicate-argument relations, relations between nouns and coreferences, and is utilized for learning parameters of the system and testing it. The case frames are indispensable knowledge both for detecting zero/indirect anaphors and estimating appropriate antecedents. Our preliminary experiments showed promising results.

pdf bib
Lexical Analysis of Agglutinative Languages Using a Dictionary of Lemmas and Lexical Transducers
Sun-Mee Bae | Key-Sun Choi

This paper presents a simple method for performing a lexical analysis of agglutinative languages like Korean, which have a heavy morphology. Especially, for nouns and adverbs with regular morphological modifications and/or high productivity, we do not need to artificially construct huge dictionaries of all inflected forms of lemmas. To construct a dictionary of lemmas and lexical transducers, first, we construct automatically a dictionary of all inflected forms from KAIST POS-Tagged Corpus. Secondly, we separate the party of lemmas and one of sequences of inflectional suffixes. Thirdly, we describe their lexical transducers (i.e., morphological rules) to recognize all inflected forms of lemmas for nouns and adverbs according to the combinatorial restrictions between lemmas and their inflectional suffixes. Finally, we evaluate the advantages of this method.

pdf bib
Evaluation and Adaptation of a Specialised Language Checking Tool for Non-specialised Machine Translation and Non-expert MT Users for Multi-lingual Telecooperation
Rita Nüebel

Style guides or writing recommendations play an important role in the field of technical documentation production, e.g. in industrial contexts. Also, writing recommendations are used in technical contexts together with machine translation (MT) in order to circumvent the MT system's weaknesses. This paper describes the evaluation and adaptation of a language checker deployed in the project int.unity In this project, both MT and a specialised language checker were adapted to the requirements of non-expert users and a non-technical domain. The language technology was integrated with the groupware platform BSCW to support the multi-lingual communication of geographically distributed teams concerned with trade union work. The users' languages were either German or English, i.e. the users were monolingual. We chose linguatec's server version of Personal Translator 2004 MT system for the German<->English translations. The language checker CLAT for German and English has been developed at IAI. It is used by technical authors to support the production of high-quality technical documentation. The CLAT core system was adapted and extended in order to match the new requirements imposed by both the user profile and the subsequent MT application. In this paper, the focus will be on the assessment and adaptation of style rules for German.

pdf bib
A Critical Survey of the Methodology for IE Evaluation
A. Lavelli | M. E. Califf | F. Ciravegna | D. Freitag | C. Giuliano | N. Kushmerick | L. Romano

We survey the evaluation methodology adopted in Information Extraction (IE), as defined in the MUC conferences and in later independent efforts applying machine learning to IE. We point out a number of problematic issues that may hamper the comparison between results obtained by different researchers. Some of them are common to other NLP tasks: e.g., the difficulty of exactly identifying the effects on performance of the data (sample selection and sample size), of the domain theory (features selected), and of algorithm parameter settings. Issues specific to IE evaluation include: how leniently to assess inexact identification of filler boundaries, the possibility of multiple fillers for a slot, and how the counting is performed. We argue that, when specifying an information extraction task, a number of characteristics should be clearly defined. However, in the papers only a few of them are usually explicitly specified. Our aim is to elaborate a clear and detailed experimental methodology and propose it to the IE community. The goal is to reach a widespread agreement on such proposal so that future IE evaluations will adopt the proposed methodology, making comparisons between algorithms fair and reliable. In order to achieve this goal, we will develop and make available to the community a set of tools and resources that incorporate a standardized IE methodology.

pdf bib
Enriching WordNet Via Generative Metonymy and Creative Polysemy
Jer Hayes | Tony Veale | Nuno Seco

Metonymy is a creative process that establishes relationships based on contiguity or semantic relatedness between concepts. We outline a mechanism for deriving new concepts from WordNet using metonymy. We argue that by exploiting polysemy in WordNet we can take advantage of the metonymic relations between concepts. The focus of our metonymy generation work has been the creation of noun­ noun compounds that do not already exist in WordNet and which can be profitably added to WordNet. The mechanism of metonymy generation we outline takes a source compound and creates new compounds by exploiting the polysemy associated with hyponyms of the head of the source compound. We argue that metonymy generation is a sound basis for concept creation as the newly created compounds are semantically related to the source concept. We demonstrate that metonymy generation based on polysemy is superior to a method of metonymy generation that ignores polysemy. These new concepts can be used to augment WordNet.

pdf bib
Evaluation and Adaptation of the Celex Dutch Morphological Database
Tom Laureys | Guy De Pauw | Hugo Van hamme | Walter Daelemans | Dirk Van Compernolle

This paper describes some important modifications to the Celex morphological database in the context of the FLaVoR project. FLaVoR aims to develop a novel modular framework for speech recognition, enabling the integration of complex linguistic knowledge sources, such as a morphological model. Morphology is a fairly unexploited linguistic information source speech recognizers could benefit from. This is especially true for languages which allow for a rich set of morphological operations, such as our target language Dutch. In this paper we focus on the exploitation of the Celex Dutch morphological database as the information source underlying two different morphological analyzers being developed within the project. Although the Celex database provides a valuable source of morphological information for Dutch, many modifications were necessary before it could be practically applied. We identify major problems, discuss the implemented solutions and finally experimentally evaluate the effect of our modifications to the database.

pdf bib
A Model of Semantic Representations Analysis for Chinese Sentences
Li Tang | Donghong Ji | Lingpeng Yang | Yu Nie

pdf bib
A Comparison of Two Variant Corpora: The Same Content with Different Source
Kyonghee Paik | Kiyonori Ohtake | Kazuhide Yamamoto

In order to investigate the effect of source language on translations, we investigate two variants of a Korean translation corpus. The first variant consists of Korean translations of 162,308 Japanese sentences from the ATR BTEC (Basic Expression Text Corpus). The second variant was made by translating the English translations of the Japanese sentences into Korean. We show that the source language text has a large influence on the target text. Even after normalizing orthographic differences, fewer than 8.3\% of the sentences in the two variants were identical. We describe in general which phenomena differ and then discuss how our analysis can be used in natural language processing.

pdf bib
Training a Sentence-Level Machine Translation Confidence Measure
Christopher B. Quirk

We present a supervised method for training a sentence level confidence measure on translation output using a human-annotated corpus. We evaluate a variety of machine learning methods. The resultant measure, while trained on a very small dataset, correlates well with human judgments, and proves to be effective on one task based evaluation. Although the experiments have only been run on one MT system, we believe the nature of the features gathered are general enough that the approach will also work well on other systems.

pdf bib
Software Tools for Morphological Tagging of Zulu Corpora and Lexicon Development
Sonja E. Bosch | Laurette Pretorius

The aim of this paper is to discuss aspects of an on-going project on the development of grammatical and lexical resources for Zulu with sufficient coverage for unrestricted text. We explain how the basic software tools of computational morphology are used in linguistic processing, more specifically for automatic word form recognition and morphological tagging of the growing stock of electronic text corpora of a Bantu language such as Zulu. It is also shown how a machine-readable lexicon is in turn enhanced with the information acquired and extracted by means of such corpus analysis.

pdf bib
Improving Collocation Extraction for High Frequency Words
David Wible | Chin-Hwa Kuo | Nai-Lung Tsao

The purpose of this paper is to introduce an alternative word association measure aimed at addressing the under-extraction collocations that contain high frequency words. While measures such as MI provide the important contribution of filtering out sheer high frequency of words in the detection of collocations in large corpora, one side effect of this filtering is that it becomes correspondingly difficult for such measures to detect true collocations involving high frequency words. As an alternative, we propose normalizing the MI measure by dividing the frequency of a candidate lexeme by the number of senses of that lexeme. We premise this alternative approach on the one sense per collocation assumption of Yarowsky (1992; 1995). Ten verb-noun collocations involving three high frequency verbs (make, take, run) are used to compare the extraction results of traditional MI and the proposed normalized MI. Results show the ranking of these high-frequency verbs as candidate collocates with the target focal nouns is raised by normalizing MI as proposed. Side effects of these improved rankings are discussed, such as increase in false positives resulting from higher recall. It is found that overall rank precision remains quite stable even with the increased recall of normalized MI.

pdf bib
Annotation of Coreference Relations Among Linguistic Expressions and Images in Biological Articles
Ai Kawazoe | Asanobu Kitamoto | Nigel Collier

In this paper, we propose an annotation scheme which can be used not only for annotating coreference relations between linguistic expressions, but also those among linguistic expressions and images, in scientific texts such as biomedical articles. Images in biomedical domain often contain important information for analyses and diagnoses, and we consider that linking images to textual descriptions of their semantic contents in terms of coreference relations is useful for multimodal access to the information. We present our annotation scheme and the concept of a "coreference pool," which plays a central role in the scheme. We also introduce a support tool for text annotation named Open Ontology Forge which we have already developed, and additional functions for the software to cover image annotations (ImageOF) which is now being developed.

pdf bib
Evaluation of Cross-Language Information Retrieval Using the Domain-Specific GIRT Data as Parallel German-English Corpus
Michael Kluck

The development of the evaluation of domain-specific cross-language information retrieval (CLIR) is shown in the context of the Cross-Language Evaluation Forum (CLEF) campaigns from 2000 to 2003. The pre-conditions and the usable data and additionally available instruments are described. The main goals of this task of CLEF are to allow the evaluation of Cross-Language Information Retrieval (CLIR) systems in the context of structured data and in a domain-specific area (not in the more general context of floating, journalistic texts), and with the additional possibility to make use of thesauri which had been used for intellectual indexing of the documents and are provided with the data. The parallel German-English GIRT4 corpus is described and some of the results of the CLEF 2004 campaign are discussed.

pdf bib
Generating Coreferential Descriptions from a Structured Model of the Context
Hélène Manuélian

This paper shows on the basis of a corpus study how a model of the context should be structured for the generation of coreferring descriptions in French. We show that this way of structuring the context can help to generate more paraphrases and a particular kind of referring expressions used to add information about the referent.

pdf bib
Open Collaborative Development of the Thai Language Resources for Natural Language Processing
Thatsanee Charoenporn | Virach Sornlertlamvanich | Sawit Kasuriya | Chatchawarn Hansakunbuntheung | Hitoshi Isahara

pdf bib
Automatic Translation Memory Fuzzy Match Post-Editing: A Step Beyond Traditional TM/MT Integration
Lambros Kranias | Anna Samiotou

An innovative way of integrating Translation Memory (TM) and Machine Translation (MT) processing is presented which goes beyond the traditional "cascade" integration of Translation Memory and Machine Translation. The new method aims to automatically post-edit TM similar matches by the use of an MT module thus enhancing the TM fuzzy (similar) scores as well as enabling the utilisation of low-score TM fuzzy matches. This leads to substantial translation cost reduction. The suggested method, which can be classified as an Example-Based Machine Translation application, is analysed and examples are provided for clarification. It is evaluated through test results that involve human interaction. The method has been implemented within the ESTeam Translator (ET) Language Toolbox and is already in use in the various commercial installations of ET.

pdf bib
Linguistic Annotation of the Spoken Dutch Corpus: If We Had To Do It All Over Again
Ineke Schuurman | Wim Goedertier | Heleen Hoekstra | Nelleke Oostdijk | Richard Piepenbrock | Machteld Schouppe

After the successful completion of the Spoken Dutch Corpus (1998 -- 2003) the time is ripe to take some time to sit back and reflect on our achievements and the procedures underlying them in order to learn from our experiences. In this paper we will in particular pay attention to issues affecting the levels of linguistic annotation, but some more general issues deserve to be treated as well (bug reporting, consistency). We will try to come up with solutions, but sometimes we want to invite further discussion from other researchers.

pdf bib
Combining Symbolic and Statistical Methods in Morphological Analysis and Unknown Word Guessing
Attila Novák | Viktor Nagy | Csaba Oravecz

Highly inflectional/agglutinative languages like Hungarian typically feature possible word forms in such a magnitude that automatic methods that provide morphosyntactic annotation on the basis of some training corpus often face the problem of data sparseness. A possible solution to this problem is to apply a comprehensive morphological analyser, which is able to analyse almost all wordforms alleviating the problem of unseen tokens. However, although in a smaller number, there will still remain forms which are unknown even to the morphological analyzer and should be handled by some guesser mechanism. The paper will describe a hybrid method which combines symbolic and statistical information to provide lemmatization and suffix analyses for unknown word forms. Evaluation is carried out with respect to the induction of possible analyses and their respective lexical probabilities for unknown word forms in a part-of-speech tagging system.

pdf bib
A New Approach to the Corpus-based Statistical Investigation of Hungarian Multi-word Lexemes
Balázs Kis | Begoña Villada | Gosse Bouma | Gábor Ugray | Tamás Bíró | Gábor Pohl | John Nerbonne

pdf bib
Discarding Noise in an Automatically Acquired Lexicon of Support verb Constructions
M. Begoña Villada Moirón

We applied data-driven methods to carry out automatic acquisition of Dutch prepositional support verb constructions (SVCs) in corpora (e.g., iets in de gaten houden (``keep an eye on something'')). This paper addresses the question whether linguistic diagnostics help to discard noise from the nbest lists and how to (semi-)automatically apply such linguistic diagnostics to parsed corpora. We show that some of the linguistic diagnostics proposed in Hollebrandse (1993) effectively identify SVCs and contribute a modest error rate decrease.

pdf bib
Translation Memories Enrichment by Statistical Bilingual Segmentation
Francisco Nevado | Francisco Casacuberta | Josu Landa

A majority of Machine Aided Translation systems are based on comparisons between a source sentence and reference sentences stored in Translation Memories (TMs). The translation search is done by looking for sentences in a database which are similar to the source sentence. TMs have two basic limitations: the dependency on the repetition of complete sentences and the high cost of building a TM. As human translators do not only remember sentences from their preceding translations, but they also decompose the sentence to be translated and work with smaller units, it would be desirable to enrich the TM database with smaller translation units. This enrichment should also be automatic in order not to increase the cost of building a TM. We propose the application of two automatic bilingual segmentation techniques based on statistical translation methods in order to create new, shorter bilingual segments to be included in a TM database. An evaluation of the two techniques is carried out for a bilingual Basque-Spanish task.

pdf bib
The African Speech Technology Project: An Assessment
J. C. Roux | P. H. Louw | T. R. Niesler

This paper reflects on the recently completed African Speech Technology (AST) Project. The AST Project successfully developed eleven annotated telephone speech databases for five languages spoken in South Africa i.e. Xhosa, Southern Sotho, Zulu, English and Afrikaans. These databases were used to train and test speech recognition systems applied in a multilingual telephone-based prototype hotel booking system. An overview is given of the database design and contents. The acquisition of the data is discussed with regards to the telephony interface, as well as speaker recruitment and briefing. Particular reference is given to some of the practical implications of acquiring appropriate data in under-developed communities. Database management processes such as transcription, quality control and validation are explained. This is followed by information on the development of the prototype. Results of usability tests are discussed followed by an assessment of the Project as a whole.

pdf bib
Automatic Phonemic Labeling and Segmentation of Spoken Dutch
Kris Demuynck | Tom Laureys | Patrick Wambacq | Dirk Van Compernolle

The CGN corpus (Corpus Gesproken Nederlands/Corpus Spoken Dutch) is a large speech corpus of contemporary Dutch as spoken in Belgium (3.3 million words) and in the Netherlands (5.6 million words). Due to its size, manual phonemic annotation was limited to 10% of the data and automatic systems were used to complement this data. This paper describes the automatic generation of the phonemic annotations and the corresponding segmentations. First, we detail the processes used to generate possible pronunciations for each sentence and to select to most likely one. Next, we identify the remaining difficulties when handling the CGN data and explain how we solved them. We conclude with an evaluation of the quality of the resulting transcriptions and segmentations.

pdf bib
Using Large Multi-purpose Corpora for Specific Research Questions: Discourse Phenomena Related to Wh-questions in the Spoken Dutch Corpus
Nelleke Oostdijk | Lou Boves

In this paper, we investigate whether a dataset derived from a multi-purpose corpus such as the Spoken Dutch Corpus may be considered appropriate for developing a taxonomy of wh-questions, and a model of the way in which these questions are integrated in spoken discourse. We compare the results obtained from the Spoken Dutch Corpus with a similar analysis of a large random collection of FAQs from the internet. We find substantial differences between the questions in spoken discourse and FAQs. Therefore, it may not be trivial to use a general purpose corpus as a starting point for developing models for human-computer interaction.

pdf bib
Methods of Digital Access for Legal Language Documentation
Paola Mariani | Costanza Badii

For many years the Istituto di Teoria e Tecniche dell'Informazione Giuridica (ITTIG) of the Consiglio Nazionale delle Ricerche has studied the evolution of legal language, creating databases for documentation and digital retrieval of law texts. The ITTIG is attending to document legal language through information technology in order to provide as wide an access as possible to its findings. The Institute has recently created an on-line digital database that includes the full text of the most important Italian laws (Codes and Constitutions) from the 16th to the 20th century. The ITTIG is also in the process of preparing another database made up of contexts from the original 10th to the 20th century legal sources.

pdf bib
Architecture for Distributed Language Resource Management and Archiving
Peter Wittenburg | Heidi Johnson | Markus Buchhorn | Hennie Brugman | Daan Broeder

An architecture is presented that provides an integrated framework for managing, archiving and accessing language resources. This architecture was discussed in the DELAMAN network – a world-wide network of archives holding material about endangered languages. Such a framework will be built upon a metadata infrastructure, a mechanism to resolve unique resource identifiers, user and access rights management components. These components are closely related and have to be based on redundant and distributed services. For all these components existing middleware seems to be available, however, it has to be checked how they can interact with each other.

pdf bib
Creation and Validation of Large Lexica for Speech-to-Speech Translation Purposes
Hanne Fersøe | Elviira Hartikainen | Henk van den Heuvel | Giulio Maltese | Asuncíon Moreno | Shaunie Shammass | Ute Ziegenhain

This paper presents specifications and requirements for creation and validation of large lexica that are needed in automatic Speech Recognition (ASR), Text-to-Speech (TTS) and statistical Speech-to-Speech Translation (SST) systems. The prepared language resources are created and validated within the scope of the EU-project LC-STAR (Lexica and Corpora for Speech-to-Speech Translation Components) during years 2002-2005. Large lexica consisting of phonetic, suprasegmental and morpho-syntactic content will be provided with well-documented specifications for 13 languages. A short summary of the LC-STAR project itself is presented. Overview about the specification for the corpora collection and word extraction as well as the specification and format of the lexica are presented. Particular attention is paid to the validation of the produced lexica and the lessons learnt during pre-validation. The created and validated language resources will be available via ELRA/ELDA.

pdf bib
Enlarging the Croatian Morphological Lexicon by Automatic Lexical Acquisition from Raw Corpora
Antoni Oliver | Marko Tadić

This paper presents experiments for enlarging the Croatian Morphological Lexicon by applying an automatic acquisition methodology. The basic sources of information for the system are a set of morphological rules and a raw corpus. The morphological rules have been automatically derived from the existing Croatian Morphological Lexicon and we have used in our experiments a subset of the Croatian National Corpus. The methodology has proved to be efficient for those languages that, like Croatian, present a rich and mainly concatenative morphology. This method can be applied for the creation of new resources, as well as in the enrichment of existing ones. We also present an extension of the system that uses automatic querying to Internet to acquire those entries for which we have not enough information in our corpus.

pdf bib
Learning to Predict Pitch Accents Using Bayesian Belief Networks for Greek Language
Panagiotis Zervas | Manolis Maragoudakis | Nikos Fakotakis | George Kokkinakis

pdf bib
A Grammar and Style Checker Based on Internet Searches
Joaquim Moré | Salvador Climent | Antoni Oliver

pdf bib
Cross-Disciplinary Integration of Metadata Descriptions
Peter Wittenburg | Greg Gulrajani | Daan Broeder | Marcus Uneson

pdf bib
Representing Italian Complex Nominals: A Pilot Study
Valeria Quochi

pdf bib
Text Corpora, Local Grammars and Prediction
Hayssam Traboulsi | David Cheng | Khurshid Ahmad

pdf bib
SMOR: A German Computational Morphology Covering Derivation, Composition and Inflection
Helmut Schmid | Arne Fitschen | Ulrich Heid

pdf bib
The Overview of the SST Speech Corpus of Japanese Learner English and Evaluation Through the Experiment on Automatic Detection of Learners’ Errors
Emi Izumi | Kiyotaka Uchimoto | Hitoshi Isahara

pdf bib
Dynamic Lexicographic Data Modelling. A Diachronic Dictionary Development Report
Paul Gévaudan | Dirk Wiebel

pdf bib
Re-using High-quality Resources for Continued Evaluation of Automated Summarization Systems
Laura Alonso | Maria Fuentes | Marc Massot | Horacio Rodríguez

pdf bib
Corpus-based Learning of Lexical Resources for German Named Entity Recognition
Marc Rössler

pdf bib
Collaborative Annotation of Sign Language Data with Peer-to-Peer Technology
Hennie Brugman | Onno Crasborn | Albert Russel

pdf bib
Semantic Categorization of Spanish Se-constructions
Glòria Vázquez | Ana Fernández Montraveta | Irene Castellón | Laura Alonso

pdf bib
Web Services Architecture for Language Resources
Angelo Dalli | Valentin Tablan | Kalina Bontcheva | Yorick Wilks | Daan Broeder | Hennie Brugman | Peter Wittenburg

pdf bib
A Large Metadata Domain of Language Resources
Daan Broeder | Thierry Declerck | Laurent Romary | Markus Uneson | Sven Strömqvist | Peter Wittenburg

pdf bib
MetaMorpho TM: A Rule-Based Translation Corpus
Tamás Gröbler | Gábor Hodász | Balázs Kis

pdf bib
Annotating Multi-media/Multi-modal Resources with ELAN
Hennie Brugman | Albert Russel

pdf bib
Annotation of Anaphoric Expressions in an Aligned Bilingual Corpus
Agnès Tutin | Meriam Haddara | Ruslan Mitkov | Constantin Orasan

pdf bib
Unexpected Productions May Well be Errors
Tylman Ule | Kiril Simov

pdf bib
A Framework for Evaluating the Suitability of Non-English Corpora for Language Engineering
Avik Sarkar | Anne De Roeck

pdf bib
Intelligent Building of Language Resources for HLT Applications
Anna Samiotou | Lambros Kranias | Dimitrios Kokkinakis

pdf bib
Collecting Spontaneously Spoken Queries for Information Retrieval
Tomoyosi Akiba | Atsushi Fujii | Katunobu Itou

pdf bib
Multilingual Pattern Libraries for Question Answering: a Case Study for Definition Questions
Hristo Tanev | Milen Kouylekov | Matteo Negri | Bonaventura Coppola | Bernardo Magnini

pdf bib
Automatic Transformation of Phrase Treebanks to Dependency Trees
Michael Daum | Kilian A. Foth | Wolfgang Menzel

pdf bib
Computational Lexicography and Carlo Emilio Gadda, Principe dell’Analisi e Duca della Buona Cognizione
Maria Luigia Ceccotti | Manuela Sassi

pdf bib
An Annotation Scheme for a Rhetorical Analysis of Biology Articles
Yoko Mizuta | Nigel Collier

pdf bib
Textual Distraction as a Basis for Evaluating Automatic Summarisers
Antoinette Renouf | Andrew Kehoe

pdf bib
Verb Valency Descriptors for a Syntactic Treebank
Milena Slavcheva

pdf bib
Integrated Language Technologies for Multilingual Information Services in the MEMPHIS Project
Walter Kasper | Jörg Steffen | Jakub Piskorski | Paul Buitelaar

pdf bib
Automatic Generation of Compound Word Lexicon for Hindi Speech Synthesis
S.R. Deepa | Kalika Bali | A.G. Ramakrishnan | Partha Pratim Talukdar

pdf bib
Summarization of Multimodal Information
Saif Ahmad | Paulo C. F. de Oliveira | Khurshid Ahmad

pdf bib
Design of an Interactive Web-based User Interface for Speech Database Query Formation
Toomas Altosaar | Matti Karjalainen

pdf bib
Migrating Language Resources from SGML to XML: The Text Encoding Initiative Recommendations
Syd Bauman | Alejandro Bia | Lou Burnard | Tomaž Erjavec | Christine Ruotolo | Susan Schreibman

pdf bib
Evaluating Conversation with Hans Christian Andersen
Niels Ole Bernsen | Laila Dybkjær | Svend Kiilerich

pdf bib
The New Dutch-Flemish HLT Programme: a Concerted Effort to Stimulate the HLT Sector
Catia Cucchiarini | Elisabeth D’Halleweyn

pdf bib
Related Word-pairs Extraction Without Dictionaries
Eiko Yamamoto | Kyoji Umemura

pdf bib
What is my Style? Using Stylistic Features of Portuguese Web Texts to Classify Web Pages According to Users’ Needs
Rachel Aires | Aline Manfrin | Sandra Aluísio | Diana Santos

pdf bib
BootCaT: Bootstrapping Corpora and Terms from the Web
Marco Baroni | Silvia Bernardini

pdf bib
N-Gram Language Modeling for Robust Multi-Lingual Document Classification
Jörg Steffen

pdf bib
A Word Alignment System Based on a Translation Equivalence Extractor
Ana-Maria Barbu

pdf bib
Using Profiles for IMDI Metadata Creation
Daan Broeder | Peter Wittenburg | Onno Crasborn

pdf bib
Rethinking Readability of Digital Editions — The Case of the AAC’s “Digital Brenner”
Karlheinz Mörth

pdf bib
Automatic Building Gazetteers of Co-referring Named Entities
Daniel Ferrés | Marc Massot | Muntsa Padró | Horacio Rodríguez | Jordi Turmo

pdf bib
Semi-Automatic Derivation of a French Lexicon from CLIPS
Nilda Ruimy | Pierrette Bouillon | Bruno Cartoni

pdf bib
The American National Corpus First Release
Nancy Ide | Keith Suderman

pdf bib
Identifying Morphosyntactic Preferences in Collocations
Stefan Evert | Ulrich Heid | Kristina Spranger

pdf bib
Towards General-Purpose Annotation Tools – How Far Are We Today?
Laila Dybkjær | Niels Ole Bernse

pdf bib
Automated Morphological Segmentation and Evaluation
Uwe D. Reichel | Karl Weilhammer

pdf bib
A Registry of Standard Data Categories for Linguistic Annotation
Nancy Ide | Laurent Romary

pdf bib
A Natural Language Approach to Information Management: Tracking Scientific Advances Through the Structure of Words
Andrew Hippisley | Chara Karavasili

pdf bib
Building a Maritime Domain Lexicon: a Few Considerations on the Database Structure and the Semantic Coding
Rita Marinelli | Adriana Roventini | Alessandro Enea

pdf bib
Creating Open Language Resources for Hungarian
Péter Halácsy | András Kornai | László Németh | András Rung | István Szakadát | Viktor Trón

pdf bib
Test Collections for Patent-to-Patent Retrieval and Patent Map Generation in NTCIR-4 Workshop
Atsushi Fujii | Makoto Iwayama | Noriko Kando

pdf bib
Part-of-Speech Annotation of Biology Research Abstracts
Yuka Tateisi | Jun-ichi Tsujii

pdf bib
Making Monolingual Corpora Comparable: a Case Study of Bulgarian and Croatian
Božo Bekavac | Petya Osenova | Kiril Simov | Marko Tadić

pdf bib
Corporate Voice, Tone of Voice and Controlled Language Techniques
Lina Henriksen | Bart Jongejan | Bente Maegaard

pdf bib
Cypriot Speech Database: Data Collection and Greek to Cypriot Dialect Adaptation
Nikos Fakotakis

pdf bib
Automatic Extraction of Syntactic Semantic Patterns for Multilingual Resources
Borja Navarro | Manuel Palomar | Patricio Martínez-Barco

pdf bib
The Integral Dictionary: An Ontological Resource for the Semantic Web: Integration of EuroWordNet, Balkanet, TID, and SUMO
Dominique Dutoit | Pierre Nugues | Patrick de Torcy

pdf bib
Categorizing Web Pages as a Preprocessing Step for Information Extraction
Viktor Pekar | Richard Evans | Ruslan Mitkov

pdf bib
A Framework for Data-driven Video-realistic Audio-visual Speech-synthesis
Christian Weiss

pdf bib
Corpus Based Enrichment of GermaNet Verb Frames
Manuela Kunze | Dietmar Rösner

pdf bib
Semi-automatic Acquisition of Command Grammar
Thierry Poibeau | Bénédicte Goujon

pdf bib
Towards a Language Infrastructure for the Semantic Web
Thierry Declerck | Paul Buitelaar | Nicoletta Calzolari | Alessandro Lenci

pdf bib
Conversational Telephone Speech Corpus Collection for the NIST Speaker Recognition Evaluation 2004
Alvin Martin | David Miller | Mark Przybocki | Joseph Campbell | Hirotaka Nakasone

pdf bib
Augmenting Manual Dictionaries for Statistical Machine Translation Systems
Stephan Vogel | Christian Monson

pdf bib
Linguistic Corpus Search
Christian Biemann | Uwe Quasthoff | Christian Wolff

pdf bib
ENABLER Thematic Network of National Projects: Technical, Strategic and Political Issues of LRs
Nicoletta Calzolari | Khalid Choukri | Maria Gavrilidou | Bente Maegaard | Paola Baroni | Hanne Fersøe | Alessandro Lenci | Valérie Mapelli | Monica Monachini | Stelios Piperidis

pdf bib
The Influence of the Labeller’s Regional Background on Phonetic Transcriptions: Implications for the Evaluation of Spoken Language Resources
Evie Coussé | Steven Gillis | Hanne Kloots | Marc Swerts

pdf bib
Evaluation Resources for Concept-based Cross-Lingual Information Retrieval in the Medical Domain
Paul Buitelaar | Diana Steffen | Martin Volk | Dominic Widdows | Bogdan Sacaleanu | Špela Vintar | Stanley Peters | Hans Uszkoreit

pdf bib
Automatic Acquisition of Paradigmatic Relations Using Iterated Co-occurrences
Chris Biemann | Stefan Bordag | Uwe Quasthoff

pdf bib
Towards Ontology Engineering Based on Linguistic Analysis
Paul Buitelaar | Daniel Olejnik | Mihaela Hutanu | Alexander Schutz | Thierry Declerck | Michael Sintek

pdf bib
OrienTel - Telephony Databases Across Northern Africa and the Middle East
Dorota Iskra | Rainer Siemund | Jamal Borno | Asuncion Moreno | Ossama Emam | Khalid Choukri | Oren Gedge | Herbert Tropf | Albino Nogueiras | Imed Zitouni | Anastasios Tsopanoglou | Nikos Fakotakis

pdf bib
ELRA Validation Methodology and Standard Promotion for Linguistic Resources
Hanne Fersøe | Monica Monachini

pdf bib
The AAC [Austrian Academy Corpus] – An Enterprise to Develop Large Electronic Text Corpora
Hanno Biber | Evelyn Breiteneder

pdf bib
Improving Automatic Phonetic Transcription of Spontaneous Speech Through Variant-Based Pronunciation Variation Modelling
Diana Binnenpoorte | Catia Cucchiarini | Helmer Strik | Lou Boves

pdf bib
A General-Purpose, Off-the-shelf Anaphora Resolution Module: Implementation and Preliminary Evaluation
Massimo Poesio | Mijail A. Kabadjov

pdf bib
Building a Conceptual Graph Bank for Chinese Language
Donghong Ji | Li Tang | Lingpeng Yang

pdf bib
Enriching a French Treebank
Anne Abeillé | Nicolas Barrier

pdf bib
French-English Multi-word Term Alignment Based on Lexical Context Analysis
Béatrice Daille | Samuel Dufour-Kowalski | Emmanuel Morin

pdf bib
An Argumentative Annotation Schema for Meeting Discussions
Vincenzo Pallotta | Hatem Ghorbel | Patrick Ruch | Giovanni Coray

pdf bib
A morphological Analyzer for Standard Albanian
Jochen Trommer | Dalina Kallulli

pdf bib
Generating an Arabic Full-form Lexicon for Bidirectional Morphology Lookup
Abdelhadi Soudi | Andreas Eisele

pdf bib
Orthographic and Phonetic Annotation of Very Large Czech Corpora with Quality Assessment
Petr Pollák | Jan Černocký

pdf bib
INQUER: A WordNet-based Question-Answering Application
Catarina Ribeiro | Ricardo Santos | João Correia | Rui Pedro Chaves | Palmira Marrafa

pdf bib
Evaluating Solutions for the Rapid Development of State-of-the-Art POS Taggers for Portuguese
António Branco | João Silva

pdf bib
A High Quality Partial Parser for Annotating German Text Corpora
Stefan Klatt

pdf bib
Bayesian Semantics Incorporation to Web Content for Natural Language Information Retrieval
Manolis Maragoudakis | Nikos Fakotakis

pdf bib
Usability Evaluation of Spoken Dialogue Systems
Lars Bo Larsen

pdf bib
Enriching EWN with Syntagmatic Information by Means of WSD
Iulia Nica | Mª Antònia Martí | Andrés Montoyo | Sonia Vázquez

pdf bib
Proper Names and Polysemy: From a Lexicographic Experience
Rita Marinelli

pdf bib
Tools for Upgrading Printed Dictionaries by Means of Corpus-based Lexical Acquisition
Ulrich Heid | Bettina Säuberlich | Esther Debus-Gregor | Werner Scholze-Stubenrecht

pdf bib
Extraction of Polish Named-Entities
Jakub Piskorski

pdf bib
Automatic Acquisition of Sense Examples Using ExRetriever
Juan Fernández | Mauro Castillo | German Rigau | Jordi Atserias | Jordi Turmo

pdf bib
Combining Heterogeneous Lexical Resources
Cvetana Krstev | Duško Vitas | Ranka Stankoviæ | Ivan Obradoviæ | Gordana Pavloviæ-Lažetiæ

pdf bib
Spoken and Written Language Resources for Vietnamese
Viet-Bac Le | Do-Dat Tran | Eric Castelli | Laurent Besacier | Jean-François Serignat

pdf bib
Building and Using a Corpus of Shallow Dialogue Annotated Meetings
Andrei Popescu-Belis | Maria Georgescul | Alexander Clark | Susan Armstrong

pdf bib
XTERM: A Flexible Standard-Compliant XML-Based Termbase Management System
Lorenzo Piccioni | Eros Zanchetta

pdf bib
Word Sense Disambiguation Using Random Indexing
Márton Miháltz

pdf bib
Querying Both Time-aligned and Hierarchical Corpora with NXT Search
Ulrich Heid | Holger Voormann | Jan-Torsten Milde | Ulrike Gut | Katrin Erk | Sebastian Padó

pdf bib
Bypassing Greeklish!
A. Chalamandaris | P. Tsiakoulis | S. Raptis | G. Giannopoulos | G. Carayannis

pdf bib
Semi-Automatic UNL Dictionary Generation Using WordNet.PT
Catarina Ribeiro | Ricardo Santos | Rui Pedro Chaves | Palmira Marrafa

pdf bib
Bootstrapping a Database of German Multi-word Expressions
Alexander Geyken

pdf bib
A Practical Comparison of Different Filters Used in Automatic Term Extraction
Le An Ha

pdf bib
SVMTool: A general POS Tagger Generator Based on Support Vector Machines
Jesús Giménez | Lluís Màrquez

pdf bib
A Multi-Modal Documentation System for Warao
Stefanie Herrmann | Hartmut Keck | Stephan Kepser

pdf bib
The DeepThought Core Architecture Framework
Ulrich Callmeier | Andreas Eisele | Ulrich Schäfer | Melanie Siegel

pdf bib
Towards the Meaning Top Ontology: Sources of Ontological Meaning
Jordi Atserias | Salvador Climent | German Rigau

pdf bib
An Environment for Dialogue Corpora Collection (ENDIACC)
Zygmunt Vetulani

pdf bib
Development of Resources for a Bilingual Automatic Index System of Broadcast News in Basque and Spanish
G. Bordel | A. Ezeiza | K. Lopez de Ipina | M. Méndez | M. Peñagarikano | T. Rico | C. Tovar | E. Zulueta

pdf bib
An Acoustic Corpus Contemplating Regional Variation for Studies of European Portuguese Nasals
António Teixeira | Liliana Ferreira | Lurdes Moutinho | Rosa Lídia Coimbra | Raquel Lisboa

pdf bib
Experiments on Building Language Resources for Multi-Modal Dialogue Systems
Laurent Romary | Amalia Todirascu | David Langlois

pdf bib
Callisto: A Configurable Annotation Workbench
David Day | Chad McHenry | Robyn Kozierok | Laurel Riek

pdf bib
The Effect of Text Difficulty on Machine Translation Performance – A Pilot Study with ILR-Rated Texts in Spanish, Farsi, Arabic, Russian and Korean
Ray Clifford | Neil Granoien | Douglas Jones | Wade Shen | Clifford Weinstein

pdf bib
An Annotated German-Language Medical Text Corpus as Language Resource
Joachim Wermter | Udo Hahn

pdf bib
Application of the BLEU Method for Evaluating Free-text Answers in an E-learning Environment
Diana Pérez | Enrique Alfonseca | Pilar Rodríguez

pdf bib
Extraction of Hyperonymy of Adjectives from Large Corpora by Using the Neural Network Model
Kyoko Kanzaki | Qing Ma | Eiko Yamamoto | Masaki Murata | Hitoshi Isahara

pdf bib
The Penn Discourse Treebank
Eleni Miltsakaki | Rashmi Prasad | Aravind Joshi | Bonnie Webber

pdf bib
Using the Web as a Corpus for the Syntactic-Based Collocation Identification
Violeta Seretan | Luka Nerima | Eric Wehrli

pdf bib
Automatic Methods to Supplement Broad-Coverage Subcategorization Lexicons
Michael Schiehlen | Kristina Spranger

pdf bib
A Large-Scale Resource for Storing and Recognizing Technical Terminology
Henk Harkema | Robert Gaizauskas | Mark Hepple | Neil Davis | Yikun Guo | Angus Roberts | Ian Roberts

pdf bib
Evaluation of a Multimodal Dialogue System for Small-screen Devices
Holmer Hemsen

pdf bib
Web Services for Language Resources and Language Technology Applications
Christian Biemann | Stefan Bordag | Uwe Quasthoff | Christian Wolff

pdf bib
Development of New Telephone Speech Databases for French: the NEOLOGOS Project
Elisabeth Pinto | Delphine Charlet | Hélène François | Djamel Mostefa | Olivier Boëffard | Dominique Fohr | Odile Mella | Frédéric Bimbot | Khalid Choukri | Yann Philip | Francis Charpentier

pdf bib
Top Ontology as a Tool for Semantic Role Tagging
Karel Pala | Pavel Smrz

pdf bib
A Suite of Tools for Marking Up Textual Data for Temporal Text Mining Scenarios
Argyrios Vasilakopoulos | Michele Bersani | William J. Black

pdf bib
Frequent Term Distribution Measures for Dataset Profiling
Anne De Roeck | Avik Sarkar | Paul Garthwaite

pdf bib
Issues in Annotation of the Czech Spontaneous Speech Corpus in the MALACH project
Josef Psutka | Pavel Ircing | Jan Hajič | Vlasta Radová | Josef V. Psutka | William J. Byrne | Samuel Gustman

pdf bib
Ontology Evaluation Functionalities of RDF(S),DAML+OIL, and OWL Parsers and Ontology Platforms
Asunción Gómez-Pérez | M. Carmen Suárez-Figueroa

pdf bib
Word Association Norms as a Unique Supplement of Traditional Language Resources
Anna Sinopalnikova | Pavel Smrz

pdf bib
Towards a Dynamic Lexicon: Predicting the Syntactic Argument Structure of Complex Verbs
Nadine Aldinger

pdf bib
Semantic Annotating of Czech Corpus via WSD
Robert Král

pdf bib
Using the NITE XML Toolkit on the Switchboard Corpus to Study Syntactic Choice: a Case Study
Jean Carletta | Shipra Dingare | Malvina Nissim | Tatiana Nikitina

pdf bib
An Annotation Scheme for Information Status in Dialogue
Malvina Nissim | Shipra Dingare | Jean Carletta | Mark Steedman

pdf bib
Speech Recognition Simulation and its Application for Wizard-of-Oz Experiments
Alex Trutnev | Antoine Rozenknop | Martin Rajman

pdf bib
Language Modeling Using Dynamic Bayesian Networks
Murat Deviren | Khalid Daoudi | Kamel Smaïli

pdf bib
Pumping Documents Through a Domain and Genre Classification Pipeline
Udo Hahn | Joachim Wermter

pdf bib
A Hybrid Strategy For Regular Grammar Parsing
Kiril Simov | Petya Osenova

pdf bib
Cross-Language Acquisition of Semantic Models for Verbal Predicates
Jordi Atserias | Bernardo Magnini | Octavian Popescu | Eneko Agirre | Aitziber Atutxa | German Rigau | John Carroll | Rob Koeling

pdf bib
MED-TYP: A Typological Database for Mediterranean Languages
Andrea Sansò

pdf bib
A graphical Tool for Handling Rule Grammars in Java Speech Grammar Format
Kallirroi Georgila | Nikos Fakotakis | George Kokkinakis

pdf bib
A Flexible Language Acquisition Tool Kit for Natural Language Processing
Svetlana Sheremetyeva

pdf bib
The Effect of Bias on an Automatically-built Word Sense Corpus
David Martínez | Eneko Agirre

pdf bib
Bilingual Connections for Trilingual Corpora: An XML Approach
Victoria Arranz | Núria Castell | Josep Maria Crego | Jesús Giménez | Adrià de Gispert | Patrik Lambert

pdf bib
CoGesT: a Formal Transcription System for Conversational Gesture
Thorsten Trippel | Dafydd Gibbon | Alexandra Thies | Jan-Torsten Milde | Karin Looks | Benjamin Hell | Ulrike Gut

pdf bib
Memory-based Classification of Proper Names in Norwegian
Anders Nøklestad

pdf bib
Comparative Evaluations in the Domain of Automatic Speech Recognition
Alex Trutnev | Martin Rajman

pdf bib
Consistent Storage of Metadata in Inference Lexica: the MetaLex Approach
Thorsten Trippel | Felix Sasaki | Dafydd Gibbon

pdf bib
Applying a Part-of-Speech Tagger to Postal Address Detection on the Web
Nuno Cavalheiro Marques | Sérgio Gonçalves

pdf bib
Unifying Lexicons in view of a Phonological and Morphological Lexical DB
Monica Monachini | Federico Calzolari | Michele Mammini | Sergio Rossi | Marisa Ulivieri

pdf bib
Toward an Annotation Software for Video of Sign Language, Including Image Processing Tools and Signing Space Modelling
A. Braffort | A. Choisier | C. Collet | P. Dalle | F. Gianni | F. Lenseigne | J. Segouat

pdf bib
Building Distributed Language Resources By Grid Computing
Fabio Tamburini

pdf bib
Mapping Dependency Structures to Phrase Structures and the Automatic Acquisition of Mapping Rules
Bernd Bohnet | Halyna Seniv

pdf bib
A Framework for Temporal Resolution
Georgiana Puşcaşu

pdf bib
EGRAM – A Grammar Development Environment and its Usage for Language Generation
Stephan Busemann

pdf bib
Large Scale Experiments for Semantic Labeling of Noun Phrases in Raw Text
Louise Guthrie | Roberto Basili | Fabio Zanzotto | Kalina Bontcheva | Hamish Cunningham | David Guthrie | Jia Cui | Marco Cammisa | Jerry Cheng-Chieh Liu | Cassia Farria Martin | Kristiyan Haralambiev | Martin Holub | Klaus Macherey | Fredrick Jelinek

pdf bib
Exploring Portability of Syntactic Information from English to Basque
Eneko Agirre | Aitziber Atutxa | Koldo Gojenola | Kepa Sarasola

pdf bib
Spanish WordNet 1.6: Porting the Spanish Wordnet Across Princeton Versions
Jordi Atserias | Luís Villarejo | German Rigau

pdf bib
An Annotated Corpus of Tutorial Dialogs on Mathematical Theorem Proving
Magdalena Wolska | Bao Quoc Vo | Dimitra Tsovaltzi | Ivana Kruijff-Korbayová | Elena Karagjosova | Helmut Horacek | Armin Fiedler | Christoph Benzmüller

pdf bib
Automatic Keyword Extraction from Spoken Text. A Comparison of Two Lexical Resources: EDR and WordNet
Lonneke van der Plas | Vincenzo Pallotta | Martin Rajman | Hatem Ghorbel

pdf bib
Pronominal Anaphora Resolution for Unrestricted Text
Anna Kupść | Teruko Mitamura | Benjamin Van Durme | Eric Nyberg

pdf bib
The ESTER Evaluation Campaign for the Rich Transcription of French Broadcast News
G. Gravier | J-F. Bonastre | E. Geoffrois | S. Galliano | K. McTait | K. Choukri

pdf bib
Steps Towards Semantically Annotated Language Resources
Manfred Klenner | Fabio Rinaldi | Michael Hess

pdf bib
Designing a Realistic Evaluation of an End-to-end Interactive Question Answering System
Nina Wacholder | Sharon Small | Bing Bai | Diane Kelly | Robert Rittman | Sean Ryan | Robert Salkin | Peng Song | Ying Sun | Ting Liu | Paul Kantor | Tomek Strzalkowski

pdf bib
Semi-Automatic Construction of a Question Treebank
Karin Müller

pdf bib
Calibrating Resource-light Automatic MT Evaluation: a Cheap Approach to Ranking MT Systems by the Usability of Their Output
Bogdan Babych | Debbie Elliott | Anthony Hartley

pdf bib
Multimodal, Multilingual Resources in the Subtitling Process
Stelios Piperidis | Iason Demiros | Prokopis Prokopidis | Peter Vanroose | Anja Hoethker | Walter Daelemans | Elsa Sklavounou | Manos Konstantinou | Yannis Karavidas

pdf bib
Perceptual Evaluation of Quality Deterioration Owing to Prosody Modification
Kazuki Adachi | Tomoki Toda | Hiromichi Kawanami | Hiroshi Saruwatari | Kiyohiro Shikano

pdf bib
Integration of Russian Language Resources
Serge A. Yablonsky

pdf bib
A2Q: An Agent-based Architecure for Multilingual Q&A
Roberto Basili | Nicola Lorusso | Maria Teresa Pazienza | Fabio Massimo Zanzotto

pdf bib
OntoTag’s Linguistic Ontologies: Enhancing Higher Level and Semantic Web Annotations
Guadalupe Aguado de Cea | Inmaculada Álvarez-de-Mon | Antonio Pareja-Lora

pdf bib
Exploiting Language Resources for Semantic Web Annotations
Kaarel Kaljurand | Fabio Rinaldi | James Dowdall | Michael Hess

pdf bib
Towards an International Standard on Feature Structure Representation
Kiyong Lee | Lou Burnard | Laurent Romary | Eric de la Clergerie | Thierry Declerck | Syd Bauman | Harry Bunt | Lionel Clément | Tomaž Erjavec | Azim Roussanaly | Claude Roux

pdf bib
The Translation Correction Tool: English-Spanish User Studies
Ariadna Font Llitjós | Jaime Carbonell

pdf bib
A Labelled Corpus for Prepositional Phrase Attachment
Brian Mitchell | Robert Gaizauskas

pdf bib
Comparing the Ambiguity Reduction Abilities of Probabilistic Context-Free Grammars
Gabriel Infante-Lopez | Maarten de Rijke

pdf bib
NameNet: a Self-Improving Resource for Name Classification
Paul Morarescu | Sanda Harabagiu

pdf bib
Image-Language Multimodal Corpora: Needs, Lacunae and an AI Synergy for Annotation
Katerina Pastra | Yorick Wilks

pdf bib
Detecting Errors in English Article Usage with a Maximum Entropy Classifier Trained on a Large, Diverse Corpus
Na-Rae Han | Martin Chodorow | Claudia Leacock

pdf bib
The Core of the Czech Derivational Dictionary
Radek Sedláček

pdf bib
Automatic Sentence Simplification for Subtitling in Dutch and English
Walter Daelemans | Anja Höthker | Erik Tjong Kim Sang

pdf bib
Enriching a Thai Lexical Database with Selectional Preferences
Canasai Kruengkrai | Thatsanee Charoenporn | Virach Sornlertlamvanich | Hitoshi Isahara

pdf bib
Results of the 2003 Topic Detection and Tracking Evaluation
Jonathan G. Fiscus

pdf bib
Parsing Ungrammatical Input: an Evaluation Procedure
Jennifer Foster

pdf bib
An Automatic Method for Constructing Domain-Specific Ontology Resources
Melania Degeratu | Vasileios Hatzivassiloglou

pdf bib
A Lexicon Module for a Grammar Development Environment
Ann Copestake | Fabre Lambeau | Benjamin Waldron | Francis Bond | Dan Flickinger | Stephan Oepen

pdf bib
Modelling Legitimate Translation Variation for Automatic Evaluation of MT Quality
Bogdan Babych | Anthony Hartley

pdf bib
Semantic Mark-up of Italian Legal Texts Through NLP-based Techniques
Roberto Bartolini | Alessandro Lenci | Simonetta Montemagni | Vito Pirrelli | Claudia Soria

pdf bib
Morphology Based Automatic Acquisition of Large-coverage Lexica
Lionel Clément | Benoît Sagot | Bernard Lang

pdf bib
Towards Intelligent Written Cultural Heritage Processing - Lexical processing
Kiril Ribarov

pdf bib
Developing Language Resources for a Transnational Digital Government System
Violetta Cavalli-Sforza | Jaime G. Carbonell | Peter J. Jansen

pdf bib
Semi-automatic Syntactic and Semantic Corpus Annotation with a Deep Parser
Mary D. Swift | Myroslava O. Dzikovska | Joel R. Tetreault | James F. Allen

pdf bib
Collecting and Sharing Bilingual Spontaneous Speech Corpora: the ChinFaDial Experiment
Georges Fafiotte | Christian Boitet | Mark Seligman | Chengqing Zong

pdf bib
Can Anaphoric Definite Descriptions be Replaced by Pronouns?
Judita Preiss | Caroline Gasperin | Ted Briscoe

pdf bib
Hybrid Constraints for Robust Parsing: First Experiments and Evaluation
Roberto Bartolini | Alessandro Lenci | Simonetta Montemagni | Vito Pirrelli

pdf bib
E-Wiz: a Trapper Protocol for Hunting the Expressive Speech Corpora in Lab
Véronique Aubergé | Nicolas Audibert | Albert Rilliard

pdf bib
Agreement in Human Factoid Annotation for Summarization Evaluation
Simone Teufel | Hans van Halteren

pdf bib
Evaluating an Authentic Audio-Visual Expressive Speech Corpus
Albert Rilliard | Véronique Aubergé | Nicolas Audibert

pdf bib
The Italian NESPOLE! Corpus: a Multilingual Database with Interlingua Annotation in Tourism and Medical Domains
Nadia Mana | Roldano Cattoni | Emanuele Pianta | Franca Rossi | Fabio Pianesi | Susanne Burger

pdf bib
Linguistic Miner: An Italian Linguistic Knowledge System
Eugenio Picchi | Maria Luigia Ceccotti | Sebastiana Cucurullo | Manuela Sassi | Eva Sassolini

pdf bib
Metaphors in Wordnets: From Theory to Practice
Antonietta Alonge | Birte Lönneker

pdf bib
Standardization in Multimodal Content Representation: Some Methodological Issues
Harry Bunt | Laurent Romary

pdf bib
A Similarity Measure for Unsupervised Semantic Disambiguation
Roberto Basili | Marco Cammisa | Fabio Massimo Zanzotto

pdf bib
Usability Evaluation of Multimodal and Domain-Oriented Spoken Language Dialogue Systems
Laila Dybkjær | Niels Ole Bernsen | Wolfgang Minker

pdf bib
Using WordNet to Measure Semantic Orientations of Adjectives
Jaap Kamps | Maarten Marx | Robert J. Mokken | Maarten de Rijke

pdf bib
MT Goes Farming: Comparing Two Machine Translation Approaches on a New Domain
Per Weijnitz | Eva Forsbom | Ebba Gustavii | Eva Pettersson | Jörg Tiedemann

pdf bib
VOXMEX Speech Database: Design of a Phonetically Balanced Corpus
Esmeralda Uraga | César Gamboa

pdf bib
Data Driven Ontology Evaluation
Christopher Brewster | Harith Alani | Srinandan Dasmahapatra | Yorick Wilks

pdf bib
Embedding IMDI Metadata into a Large Phonetic Corpus
Oliver Schonefeld | Jan-Torsten Milde

pdf bib
Using Semantic Language Resources to Support Textual Inference for Question Answering
Francesca Bertagna

pdf bib
An Information Repository Model for Advanced Question Answering Systems
Vasco Calais Pedro | Jeongwoo Ko | Eric Nyberg | Teruko Mitamura

pdf bib
Content Interoperability of Lexical Resources: Open Issues and “MILE” Perspectives
Francesca Bertagna | Alessandro Lenci | Monica Monachini | Nicoletta Calzolari

pdf bib
Prague Czech-English Dependency Treebank. Syntactically Annotated Resources for Machine Translation
Martin Čmejrek | Jan Cuřín | Jiří Havelka | Jan Hajič | Vladislav Kuboň

pdf bib
Data Collection and Analysis of Mapudungun Morphology for Spelling Correction
Christian Monson | Lori Levin | Rodolfo Vega | Ralf Brown | Ariadna Font Llitjos | Alon Lavie | Jaime Carbonell | Eliseo Cañulef | Rosendo Huisca

pdf bib
An Efficient Word Confidence Measure Using Likelihood Ratio Scores
Arlindo O. Veiga | Fernando S. Perdigão

pdf bib
Adding Syntactic Annotations to Transcripts of Parent-Child Dialogs
Kenji Sagae | Brian MacWhinney | Alon Lavie

pdf bib
Distributional Consistency: As a General Method for Defining a Core Lexicon
Huarui Zhang | Churen Huang | Shiwen Yu

pdf bib
Computing Reliability for Coreference Annotation
Rebecca J. Passonneau

pdf bib
Publicly Available Topic Signatures for all WordNet Nominal Senses
Eneko Agirre | Oier Lopez de Lacalle

pdf bib
Road-testing the English Resource Grammar Over the British National Corpus
Timothy Baldwin | Emily M. Bender | Dan Flickinger | Ara Kim | Stephan Oepen

pdf bib
Interpreting BLEU/NIST Scores: How Much Improvement do We Need to Have a Better System?
Ying Zhang | Stephan Vogel | Alex Waibel

pdf bib
Exploiting Anchor Text as a Lexical Resource
Peter Anick

pdf bib
MEAD - A Platform for Multidocument Multilingual Text Summarization
Dragomir Radev | Timothy Allison | Sasha Blair-Goldensohn | John Blitzer | Arda Çelebi | Stanko Dimitrov | Elliott Drabek | Ali Hakim | Wai Lam | Danyu Liu | Jahna Otterbacher | Hong Qi | Horacio Saggion | Simone Teufel | Michael Topper | Adam Winkel | Zhu Zhang

pdf bib
Evaluation of Transcription and Annotation Tools for a Multi-modal, Multi-party Dialogue Corpus
Saurabh Garg | Bilyana Martinovski | Susan Robinson | Jens Stephan | Joel Tetreault | David R. Traum

pdf bib
Current Projects in Languages of Military Interest at the Defense Language Institute
Michael Emonts

pdf bib
A Multilingual Database of Idioms
Aline Villavicencio | Timothy Baldwin | Benjamin Waldron

pdf bib
Annotation Tools for Large-Scale Corpus Development: Using AGTK at the Linguistic Data Consortium
Kazuaki Maeda | Stephanie Strassel

pdf bib
Linguistic Resources for Effective, Affordable, Reusable Speech-to-Text
Stephanie Strassel

pdf bib
Building part-of-speech Corpora Through Histogram Hopping
Marc Vilain

pdf bib
An Emerging Transcontinental Collaborative Research and Education Agenda in Human Language Technologies
Gregory Ernest Monaco | Abdelhadi Soudi

pdf bib
Issues in Corpus Development for Multi-party Multi-modal Task-oriented Dialogue
Susan Robinson | Bilyana Martinovski | Saurabh Garg | Jens Stephan | David Traum

pdf bib
The Fisher Corpus: a Resource for the Next Generations of Speech-to-Text
Christopher Cieri | David Miller | Kevin Walker

pdf bib
Evaluation of Multi-party Virtual Reality Dialogue Interaction
David R. Traum | Susan Robinson | Jens Stephan

pdf bib
The Mixer Corpus of Multilingual, Multichannel Speaker Recognition Data
Christopher Cieri | Joseph P. Campbell | Hirotaka Nakasone | David Miller | Kevin Walker

pdf bib
Building a Large Grammar for Italian
Alessandro Mazzei | Vincenzo Lombardo

pdf bib
Japanese MULTEXT: a Prosodic Corpus
Shigeyoshi Kitazawa | Shinya Kiriyama | Toshihiko Itoh | Nick Campbell

pdf bib
The OLISSIPO and LECTIO Projects
Giuseppe Cappelli | Paulo Alberto

pdf bib
A Public Reference Implementation of the RAP Anaphora Resolution Algorithm
Long Qiu | Min-Yen Kan | Tat-Seng Chua

pdf bib
NLP-enhanced Content Filtering Within the POESIA Project
Mark Hepple | Neil Ireson | Paolo Allegrini | Simone Marchi | Simonetta Montemagni | Jose Maria Gomez Hidalgo

pdf bib
WinPitch Corpus, a Text to Speech Alignment Tool for Multimodal Corpora
Philippe Martin

pdf bib
The Statistical Analysis of Morphosyntactic Distributions
Stefan Evert

pdf bib
CHeM: A System for the Automatic Analysis of e-mails in the Restoration and Conservation Domain
Luciana Bordoni | Leonardo Pasqualini | Filippo Sciarrone

pdf bib
Resources for Place Name Analysis
Robert Irie | Beth Sundheim

pdf bib
NEMLAR - An Arabic Language Resources Project
Bente Maegaard

pdf bib
Korean-Chinese-Japanese Multilingual Wordnet with Shared Semantic Hierarchy
Key-Sun Choi | Hee-Sook Bae | Wonseok Kang | Juho Lee | Eunhe Kim | Hekyeong Kim | Donghee Kim | Youngbin Song | Hyosik Shin

pdf bib
Intranet Try To Find Project (ITTF): An Approach for the Search of Relevant Information Inside an Organization
Christophe Jouis | Jean-Marie Ferru

pdf bib
A Progress Report from the Linguistic Data Consortium: Recent Activities in Resource Creation and Distribution and the Development of Tools and Standards
Christopher Cieri | Mark Liberman

pdf bib
Recent Activities within the European Language Resources Association: Issues on Sharing Language Resources and Evaluation
Khalid Choukri

pdf bib
EVALDA-CESART Project: Terminological Resources Acquisition Tools Evaluation Campaign
Widad Mustafa El Hadi | Ismail Timimi | Marianne Dabbadie

pdf bib
From Weaver to the ALPAC Report
Gabriella Pardelli | Manuela Sassi | Sara Goggi

pdf bib
The Verb in the Terminological Collocations. Contribution to the Development of a Morphological Analyser: MorphoCom
Rute Costa | Raquel Silva

pdf bib
Cluster Analysis and Classification of Named Entities
Joaquim F. Ferreira da Silva | Zornitsa Kozareva | José Gabriel Pereira Lopes

pdf bib
Network of Data Centres (NetDC): BNSC - An Arabic Broadcast News Speech Corpus
Khalid Choukri | Mahtab Nikkhou | Niklas Paulsson

pdf bib
Technolangue: A Permanent Evaluation and Information Infrastructure
Valérie Mapelli | Maria Nava | Sylvain Surcin | Djamel Mostefa | Khalid Choukri

pdf bib
Extending Wordnets To Implicit Information
Palmira Marrafa

pdf bib
Russian Information Retrieval Evaluation Seminar
Boris Dobrov | Igor Kuralenok | Natalia Loukachevitch | Igor Nekrestyanov | Ilya Segalovich