Benjamin Waldron


Language Resources and Chemical Informatics
C.J. Rupp | Ann Copestake | Peter Corbett | Peter Murray-Rust | Advaith Siddharthan | Simone Teufel | Benjamin Waldron
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

Chemistry research papers are a primary source of information about chemistry, as in any scientific field. The presentation of the data is, predominantly, unstructured information, and so not immediately susceptible to processes developed within chemical informatics for carrying out chemistry research by information processing techniques. At one level, extracting the relevant information from research papers is a text mining task, requiring both extensive language resources and specialised knowledge of the subject domain. However, the papers also encode information about the way the research is conducted and the structure of the field itself. Applying language technology to research papers in chemistry can facilitate eScience on several different levels. The SciBorg project sets out to provide an extensive, analysed corpus of published chemistry research. This relies on the cooperation of several journal publishers to provide papers in an appropriate form. The work is carried out as a collaboration involving the Computer Laboratory, Chemistry Department and eScience Centre at Cambridge University, and is funded under the UK eScience programme.


Preprocessing and Tokenisation Standards in DELPH-IN Tools
Benjamin Waldron | Ann Copestake | Ulrich Schäfer | Bernd Kiefer
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)

We discuss preprocessing and tokenisation standards within DELPH-IN, a large scale open-source collaboration providing multiple independent multilingual shallow and deep processors. We discuss (i) a component-specific XML interface format which has been used for some time to interface preprocessor results to the PET parser, and (ii) our implementation of a more generic XML interface format influenced heavily by the (ISO working draft) Morphosyntactic Annotation Framework (MAF). Our generic format encapsulates the information which may be passed from the preprocessing stage to a parser: it uses standoff-annotation, a lattice for the representation of structural ambiguity, intra-annotation dependencies and allows for highly structured annotation content. This work builds on the existing Heart of Gold middleware system, and previous work on Robust Minimal Recursion Semantics (RMRS) as part of an inter-component interface. We give examples of usage with a number of the DELPH-IN processing components and deep grammars.

A Standoff Annotation Interface between DELPH-IN Components
Benjamin Waldron | Ann Copestake
Proceedings of the 5th Workshop on NLP and XML (NLPXML-2006): Multi-Dimensional Markup in Natural Language Processing


A Lexicon Module for a Grammar Development Environment
Ann Copestake | Fabre Lambeau | Benjamin Waldron | Francis Bond | Dan Flickinger | Stephan Oepen
Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC’04)

A Multilingual Database of Idioms
Aline Villavicencio | Timothy Baldwin | Benjamin Waldron
Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC’04)

Lexical Encoding of MWEs
Aline Villavicencio | Ann Copestake | Benjamin Waldron | Fabre Lambeau
Proceedings of the Workshop on Multiword Expressions: Integrating Processing