Gertjan van Noord

Also published as: Gertjan Van Noord


2020

pdf bib
A Shared Task of a New, Collaborative Type to Foster Reproducibility: A First Exercise in the Area of Language Science and Technology with REPROLANG2020
António Branco | Nicoletta Calzolari | Piek Vossen | Gertjan Van Noord | Dieter van Uytvanck | João Silva | Luís Gomes | André Moreira | Willem Elbers
Proceedings of the 12th Language Resources and Evaluation Conference

n this paper, we introduce a new type of shared task — which is collaborative rather than competitive — designed to support and fosterthe reproduction of research results. We also describe the first event running such a novel challenge, present the results obtained, discussthe lessons learned and ponder on future undertakings.

pdf bib
UDapter: Language Adaptation for Truly Universal Dependency Parsing
Ahmet Üstün | Arianna Bisazza | Gosse Bouma | Gertjan van Noord
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

Recent advances in multilingual dependency parsing have brought the idea of a truly universal parser closer to reality. However, cross-language interference and restrained model capacity remain major obstacles. To address this, we propose a novel multilingual task adaptation approach based on contextual parameter generation and adapter modules. This approach enables to learn adapters via language embeddings while sharing model parameters across languages. It also allows for an easy but effective integration of existing linguistic typology features into the parsing network. The resulting parser, UDapter, outperforms strong monolingual and multilingual baselines on the majority of both high-resource and low-resource (zero-shot) languages, showing the success of the proposed adaptation approach. Our in-depth analyses show that soft parameter sharing via typological features is key to this success.

pdf bib
Low-Resource Unsupervised NMT: Diagnosing the Problem and Providing a Linguistically Motivated Solution
Lukas Edman | Antonio Toral | Gertjan van Noord
Proceedings of the 22nd Annual Conference of the European Association for Machine Translation

Unsupervised Machine Translation has been advancing our ability to translate without parallel data, but state-of-the-art methods assume an abundance of monolingual data. This paper investigates the scenario where monolingual data is limited as well, finding that current unsupervised methods suffer in performance under this stricter setting. We find that the performance loss originates from the poor quality of the pretrained monolingual embeddings, and we offer a potential solution: dependency-based word embeddings. These embeddings result in a complementary word representation which offers a boost in performance of around 1.5 BLEU points compared to standard word2vec when monolingual data is limited to 1 million sentences per language. We also find that the inclusion of sub-word information is crucial to improving the quality of the embeddings.

pdf bib
AlpinoGraph: A Graph-based Search Engine for Flexible and Efficient Treebank Search
Peter Kleiweg | Gertjan van Noord
Proceedings of the 19th International Workshop on Treebanks and Linguistic Theories

2019

pdf bib
Multi-Team: A Multi-attention, Multi-decoder Approach to Morphological Analysis.
Ahmet Üstün | Rob van der Goot | Gosse Bouma | Gertjan van Noord
Proceedings of the 16th Workshop on Computational Research in Phonetics, Phonology, and Morphology

This paper describes our submission to SIGMORPHON 2019 Task 2: Morphological analysis and lemmatization in context. Our model is a multi-task sequence to sequence neural network, which jointly learns morphological tagging and lemmatization. On the encoding side, we exploit character-level as well as contextual information. We introduce a multi-attention decoder to selectively focus on different parts of character and word sequences. To further improve the model, we train on multiple datasets simultaneously and use external embeddings for initialization. Our final model reaches an average morphological tagging F1 score of 94.54 and a lemma accuracy of 93.91 on the test data, ranking respectively 3rd and 6th out of 13 teams in the SIGMORPHON 2019 shared task.

pdf bib
Cross-Lingual Word Embeddings for Morphologically Rich Languages
Ahmet Üstün | Gosse Bouma | Gertjan van Noord
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2019)

Cross-lingual word embedding models learn a shared vector space for two or more languages so that words with similar meaning are represented by similar vectors regardless of their language. Although the existing models achieve high performance on pairs of morphologically simple languages, they perform very poorly on morphologically rich languages such as Turkish and Finnish. In this paper, we propose a morpheme-based model in order to increase the performance of cross-lingual word embeddings on morphologically rich languages. Our model includes a simple extension which enables us to exploit morphemes for cross-lingual mapping. We applied our model for the Turkish-Finnish language pair on the bilingual word translation task. Results show that our model outperforms the baseline models by 2% in the nearest neighbour ranking.

2018

pdf bib
A Taxonomy for In-depth Evaluation of Normalization for User Generated Content
Rob van der Goot | Rik van Noord | Gertjan van Noord
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

pdf bib
Modeling Input Uncertainty in Neural Network Dependency Parsing
Rob van der Goot | Gertjan van Noord
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

Recently introduced neural network parsers allow for new approaches to circumvent data sparsity issues by modeling character level information and by exploiting raw data in a semi-supervised setting. Data sparsity is especially prevailing when transferring to non-standard domains. In this setting, lexical normalization has often been used in the past to circumvent data sparsity. In this paper, we investigate whether these new neural approaches provide similar functionality as lexical normalization, or whether they are complementary. We provide experimental results which show that a separate normalization component improves performance of a neural network parser even if it has access to character level information as well as external word embeddings. Further improvements are obtained by a straightforward but novel approach in which the top-N best candidates provided by the normalization component are available to the parser.

pdf bib
Squib: Reproducibility in Computational Linguistics: Are We Willing to Share?
Martijn Wieling | Josine Rawee | Gertjan van Noord
Computational Linguistics, Volume 44, Issue 4 - December 2018

This study focuses on an essential precondition for reproducibility in computational linguistics: the willingness of authors to share relevant source code and data. Ten years after Ted Pedersen’s influential “Last Words” contribution in Computational Linguistics, we investigate to what extent researchers in computational linguistics are willing and able to share their data and code. We surveyed all 395 full papers presented at the 2011 and 2016 ACL Annual Meetings, and identified whether links to data and code were provided. If working links were not provided, authors were requested to provide this information. Although data were often available, code was shared less often. When working links to code or data were not provided in the paper, authors provided the code in about one third of cases. For a selection of ten papers, we attempted to reproduce the results using the provided data and code. We were able to reproduce the results approximately for six papers. For only a single paper did we obtain the exact same results. Our findings show that even though the situation appears to have improved comparing 2016 to 2011, empiricism in computational linguistics still largely remains a matter of faith. Nevertheless, we are somewhat optimistic about the future. Ensuring reproducibility is not only important for the field as a whole, but also seems worthwhile for individual researchers: The median citation count for studies with working links to the source code is higher.

2017

pdf bib
Parser Adaptation for Social Media by Integrating Normalization
Rob van der Goot | Gertjan van Noord
Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

This work explores different approaches of using normalization for parser adaptation. Traditionally, normalization is used as separate pre-processing step. We show that integrating the normalization model into the parsing algorithm is more beneficial. This way, multiple normalization candidates can be leveraged, which improves parsing performance on social media. We test this hypothesis by modifying the Berkeley parser; out-of-the-box it achieves an F1 score of 66.52. Our integrated approach reaches a significant improvement with an F1 score of 67.36, while using the best normalization sequence results in an F1 score of only 66.94.

pdf bib
Increasing Return on Annotation Investment: The Automatic Construction of a Universal Dependency Treebank for Dutch
Gosse Bouma | Gertjan van Noord
Proceedings of the NoDaLiDa 2017 Workshop on Universal Dependencies (UDW 2017)

pdf bib
The Power of Character N-grams in Native Language Identification
Artur Kulmizev | Bo Blankers | Johannes Bjerva | Malvina Nissim | Gertjan van Noord | Barbara Plank | Martijn Wieling
Proceedings of the 12th Workshop on Innovative Use of NLP for Building Educational Applications

In this paper, we explore the performance of a linear SVM trained on language independent character features for the NLI Shared Task 2017. Our basic system (GRONINGEN) achieves the best performance (87.56 F1-score) on the evaluation set using only 1-9 character n-grams as features. We compare this against several ensemble and meta-classifiers in order to examine how the linear system fares when combined with other, especially non-linear classifiers. Special emphasis is placed on the topic bias that exists by virtue of the assessment essay prompt distribution.

pdf bib
Distributional Lesk: Effective Knowledge-Based Word Sense Disambiguation
Dieke Oele | Gertjan van Noord
IWCS 2017 — 12th International Conference on Computational Semantics — Short papers

2016

pdf bib
Bilingual Learning of Multi-sense Embeddings with Discrete Autoencoders
Simon Šuster | Ivan Titov | Gertjan van Noord
Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

pdf bib
SMT and Hybrid systems of the QTLeap project in the WMT16 IT-task
Rosa Gaudio | Gorka Labaka | Eneko Agirre | Petya Osenova | Kiril Simov | Martin Popel | Dieke Oele | Gertjan van Noord | Luís Gomes | João António Rodrigues | Steven Neale | João Silva | Andreia Querido | Nuno Rendeiro | António Branco
Proceedings of the First Conference on Machine Translation: Volume 2, Shared Task Papers

pdf bib
Proceedings of the 2nd Deep Machine Translation Workshop
Jan Hajič | Gertjan van Noord | António Branco
Proceedings of the 2nd Deep Machine Translation Workshop

pdf bib
Obituary: In Memoriam: Susan Armstrong
Pierrette Bouillon | Paola Merlo | Gertjan van Noord | Mike Rosner
Computational Linguistics, Volume 42, Issue 2 - June 2016

2015

pdf bib
Comparison of Coreference Resolvers for Deep Syntax Translation
Michal Novák | Dieke Oele | Gertjan van Noord
Proceedings of the Second Workshop on Discourse in Machine Translation

pdf bib
Lexical choice in Abstract Dependency Trees
Dieke Oele | Gertjan van Noord
Proceedings of the 1st Deep Machine Translation Workshop

pdf bib
ROB: Using Semantic Meaning to Recognize Paraphrases
Rob van der Goot | Gertjan van Noord
Proceedings of the 9th International Workshop on Semantic Evaluation (SemEval 2015)

2014

pdf bib
From neighborhood to parenthood: the advantages of dependency representation over bigrams in Brown clustering
Simon Šuster | Gertjan van Noord
Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers

pdf bib
Treelet Probabilities for HPSG Parsing and Error Correction
Angelina Ivanova | Gertjan van Noord
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

Most state-of-the-art parsers take an approach to produce an analysis for any input despite errors. However, small grammatical mistakes in a sentence often cause parser to fail to build a correct syntactic tree. Applications that can identify and correct mistakes during parsing are particularly interesting for processing user-generated noisy content. Such systems potentially could take advantage of linguistic depth of broad-coverage precision grammars. In order to choose the best correction for an utterance, probabilities of parse trees of different sentences should be comparable which is not supported by discriminative methods underlying parsing software for processing deep grammars. In the present work we assess the treelet model for determining generative probabilities for HPSG parsing with error correction. In the first experiment the treelet model is applied to the parse selection task and shows superior exact match accuracy than the baseline and PCFG. In the second experiment it is tested for the ability to score the parse tree of the correct sentence higher than the constituency tree of the original version of the sentence containing grammatical error.

2011

pdf bib
Adaptability of Lexical Acquisition for Large-scale Grammars
Kostadin Cholakov | Gertjan van Noord | Valia Kordoni | Yi Zhang
Proceedings of the International Conference Recent Advances in Natural Language Processing 2011

pdf bib
An Empirical Comparison of Unknown Word Prediction Methods
Kostadin Cholakov | Gertjan van Noord | Valia Kordoni | Yi Zhang
Proceedings of 5th International Joint Conference on Natural Language Processing

pdf bib
Effective Measures of Domain Similarity for Parsing
Barbara Plank | Gertjan van Noord
Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies

pdf bib
Reversible Stochastic Attribute-Value Grammars
Daniël de Kok | Barbara Plank | Gertjan van Noord
Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies

2010

pdf bib
Grammar-Driven versus Data-Driven: Which Parsing System Is More Affected by Domain Shifts?
Barbara Plank | Gertjan van Noord
Proceedings of the 2010 Workshop on NLP and Linguistics: Finding the Common Ground

pdf bib
Acquisition of Unknown Word Paradigms for Large-Scale Grammars
Kostadin Cholakov | Gertjan van Noord
Coling 2010: Posters

pdf bib
POS Multi-tagging Based on Combined Models
Yan Zhao | Gertjan van Noord
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

In the POS tagging task, there are two kinds of statistical models: one is generative model, such as the HMM, the others are discriminative models, such as the Maximum Entropy Model (MEM). POS multi-tagging decoding method includes the N-best paths method and forward-backward method. In this paper, we use the forward-backward decoding method based on a combined model of HMM and MEM. If P(t) is the forward-backward probability of each possible tag t, we first calculate P(t) according HMM and MEM separately. For all tags options in a certain position in a sentence, we normalize P(t) in HMM and MEM separately. Probability of the combined model is the sum of normalized forward-backward probabilities P norm(t) in HMM and MEM. For each word w, we select the best tag in which the probability of combined model is the highest. In the experiments, we use combined model and get higher accuracy than any single model on POS tagging tasks of three languages, which are Chinese, English and Dutch. The result indicates that our combined model is effective.

pdf bib
Using Unknown Word Techniques to Learn Known Words
Kostadin Cholakov | Gertjan van Noord
Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing

2009

pdf bib
Combining Finite State and Corpus-based Techniques for Unknown Word Prediction
Kostadin Cholakov | Gertjan van Noord
Proceedings of the International Conference RANLP-2009

pdf bib
Parsed Corpora for Linguistics
Gertjan van Noord | Gosse Bouma
Proceedings of the EACL 2009 Workshop on the Interaction between Linguistics and Computational Linguistics: Virtuous, Vicious or Vacuous?

pdf bib
A generalized method for iterative error mining in parsing results
Daniël de Kok | Jianqiang Ma | Gertjan van Noord
Proceedings of the 2009 Workshop on Grammar Engineering Across Frameworks (GEAF 2009)

pdf bib
Learning Efficient Parsing
Gertjan van Noord
Proceedings of the 12th Conference of the European Chapter of the ACL (EACL 2009)

2008

pdf bib
From D-Coi to SoNaR: a reference corpus for Dutch
Nelleke Oostdijk | Martin Reynaert | Paola Monachesi | Gertjan Van Noord | Roeland Ordelman | Ineke Schuurman | Vincent Vandeghinste
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

The computational linguistics community in The Netherlands and Belgium has long recognized the dire need for a major reference corpus of written Dutch. In part to answer this need, the STEVIN programme was established. To pave the way for the effective building of a 500-million-word reference corpus of written Dutch, a pilot project was established. The Dutch Corpus Initiative project or D-Coi was highly successful in that it not only realized about 10% of the projected large reference corpus, but also established the best practices and developed all the protocols and the necessary tools for building the larger corpus within the confines of a necessarily limited budget. We outline the steps involved in an endeavour of this kind, including the major highlights and possible pitfalls. Once converted to a suitable XML format, further linguistic annotation based on the state-of-the-art tools developed either before or during the pilot by the consortium partners proved easily and fruitfully applicable. Linguistic enrichment of the corpus includes PoS tagging, syntactic parsing and semantic annotation, involving both semantic role labeling and spatiotemporal annotation. D-Coi is expected to be followed by SoNaR, during which the 500-million-word reference corpus of Dutch should be built.

pdf bib
Exploring an Auxiliary Distribution Based Approach to Domain Adaptation of a Syntactic Disambiguation Model
Barbara Plank | Gertjan van Noord
Coling 2008: Proceedings of the workshop on Cross-Framework and Cross-Domain Parser Evaluation

2007

pdf bib
ACL 2007 Workshop on Deep Linguistic Processing
Timothy Baldwin | Mark Dras | Julia Hockenmaier | Tracy Holloway King | Gertjan van Noord
ACL 2007 Workshop on Deep Linguistic Processing

pdf bib
Using Self-Trained Bilexical Preferences to Improve Disambiguation Accuracy
Gertjan van Noord
Proceedings of the Tenth International Conference on Parsing Technologies

pdf bib
The Impact of Deep Linguistic Processing on Parsing Technology
Timothy Baldwin | Mark Dras | Julia Hockenmaier | Tracy Holloway King | Gertjan van Noord
Proceedings of the Tenth International Conference on Parsing Technologies

2006

pdf bib
Syntactic Annotation of Large Corpora in STEVIN
Gertjan van Noord | Ineke Schuurman | Vincent Vandeghinste
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)

The construction of a 500-million-word reference corpus of written Dutch has been identified as one of the priorities in the Dutch/Flemish STEVIN programme. For part of this corpus, manually corrected syntactic annotations will be provided. The paper presents the background of the syntactic annotation efforts, the Alpino parser which is used as an important tool for constructing the syntactic annotations, as well as a number of other annotation tools and guidelines. For the full STEVIN corpus, automatically derived syntactic annotations will be provided in a later phase of the programme. A number of arguments is provided suggesting that such a resource can be very useful for applications in information extraction, ontology building, lexical acquisition, machine translation and corpus linguistics.

pdf bib
Robust Parsing, Error Mining, Automated Lexical Acquisition, and Evaluation
Gertjan van Noord
Proceedings of the Workshop on ROMAND 2006:Robust Methods in Analysis of Natural language Data

2004

pdf bib
Error Mining for Wide-Coverage Grammar Engineering
Gertjan van Noord
Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics (ACL-04)

2001

pdf bib
Unsupervised POS-Tagging Improves Parsing Accuracy and Parsing Efficiency
Robbert Prins | Gertjan van Noord
Proceedings of the Seventh International Workshop on Parsing Technologies

2000

pdf bib
Treatment of epsilon moves in subset construction
Gertjan van Noord
Computational Linguistics, Volume 26, Number 1, March 2000

pdf bib
Approximation and Exactness in Finite State Optimality Theory
Dale Gerdemann | Gertjan van Noord
Proceedings of the Fifth Workshop of the ACL Special Interest Group in Computational Phonology

1999

pdf bib
Transducers from Rewrite Rules with Backreferences
Dale Gerdemann | Gertjan van Noord
Ninth Conference of the European Chapter of the Association for Computational Linguistics

1998

pdf bib
Treatment of e-Moves in Subset Construction
Gertjan van Noord
Finite State Methods in Natural Language Processing

1997

pdf bib
Grammatical analysis in the OVIS spoken-dialogue system
Mark-Jan Nederhof | Gosse Bouma | Rob Koeling | Gertjan van Noord
Interactive Spoken Dialog Systems: Bringing Speech and NLP Together in Real Applications

pdf bib
Hdrug. A Flexible and Extendible Development Environment for Natural Language Processing.
Gertjan van Noord | Gosse Bouma
Computational Environments for Grammar Development and Linguistic Engineering

pdf bib
An Efficient Implementation of the Head-Corner Parser
Gertjan van Noord
Computational Linguistics, Volume 23, Number 3, September 1997

1995

pdf bib
The intersection of Finite State Automata and Definite Clause Grammars
Gertjan van Noord
33rd Annual Meeting of the Association for Computational Linguistics

1994

pdf bib
Constraint-Based Categorial Grammar
Gosse Bouma | Gertjan van Noord
32nd Annual Meeting of the Association for Computational Linguistics

pdf bib
Adjuncts and the Processing of Lexical Rules
Gertjan van Noord | Gosse Bouma
COLING 1994 Volume 1: The 15th International Conference on Computational Linguistics

1993

pdf bib
Head-driven Parsing for Lexicalist Grammars: Experimental Results
Gosse Bouma | Gertjan van Noord
Sixth Conference of the European Chapter of the Association for Computational Linguistics

1992

pdf bib
Self-Monitoring with Reversible Grammars
Gunter Neumann | Gertjan van Noord
COLING 1992 Volume 2: The 15th International Conference on Computational Linguistics

1991

pdf bib
Head Corner Parsing for Discontinuous Constituency
Gertjan van Noord
29th Annual Meeting of the Association for Computational Linguistics

pdf bib
Towards Uniform Processing of Constraint-based Categorial Grammars
Gertjan van Noord
Reversible Grammar in Natural Language Processing

1990

pdf bib
Reversible Unification Based Machine Translation
Gertjan van Noord
COLING 1990 Volume 2: Papers presented to the 13th International Conference on Computational Linguistics

pdf bib
Semantic-Head-Driven Generation
Stuart M. Shieber | Gertjan van Noord | Fernando C. N. Pereira | Robert C. Moore
Computational Linguistics, Volume 16, Number 1, March 1990

1989

pdf bib
An Approach to Sentence-Level Anaphora in Machine Translation
Gertjan van Noord | Joke Dorrepaal | Doug Arnold | Steven Krauwer | Louisa Sadler | Louis des Tombe
Fourth Conference of the European Chapter of the Association for Computational Linguistics

pdf bib
A Semantic-Head-Driven Generation Algorithm for Unification-Based Formalisms
Stuart M. Shieber | Gertjan van Noord | Robert C. Moore | Fernando C. N. Pereira
27th Annual Meeting of the Association for Computational Linguistics