Abbas Ghaddar


2020

pdf bib
SEDAR: a Large Scale French-English Financial Domain Parallel Corpus
Abbas Ghaddar | Phillippe Langlais
Proceedings of the 12th Language Resources and Evaluation Conference

This paper describes the acquisition, preprocessing and characteristics of SEDAR, a large scale English-French parallel corpus for the financial domain. Our extensive experiments on machine translation show that SEDAR is essential to obtain good performance on finance. We observe a large gain in the performance of machine translation systems trained on SEDAR when tested on finance, which makes SEDAR suitable to study domain adaptation for neural machine translation. The first release of the corpus comprises 8.6 million high quality sentence pairs that are publicly available for research at https://github.com/autorite/sedar-bitext.

2019

pdf bib
Contextualized Word Representations from Distant Supervision with and for NER
Abbas Ghaddar | Phillippe Langlais
Proceedings of the 5th Workshop on Noisy User-generated Text (W-NUT 2019)

We describe a special type of deep contextualized word representation that is learned from distant supervision annotations and dedicated to named entity recognition. Our extensive experiments on 7 datasets show systematic gains across all domains over strong baselines, and demonstrate that our representation is complementary to previously proposed embeddings. We report new state-of-the-art results on CONLL and ONTONOTES datasets.

2018

pdf bib
Transforming Wikipedia into a Large-Scale Fine-Grained Entity Type Corpus
Abbas Ghaddar | Philippe Langlais
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

pdf bib
Robust Lexical Features for Improved Neural Network Named-Entity Recognition
Abbas Ghaddar | Phillippe Langlais
Proceedings of the 27th International Conference on Computational Linguistics

Neural network approaches to Named-Entity Recognition reduce the need for carefully hand-crafted features. While some features do remain in state-of-the-art systems, lexical features have been mostly discarded, with the exception of gazetteers. In this work, we show that this is unfair: lexical features are actually quite useful. We propose to embed words and entity types into a low-dimensional vector space we train from annotated data produced by distant supervision thanks to Wikipedia. From this, we compute — offline — a feature vector representing each word. When used with a vanilla recurrent neural network model, this representation yields substantial improvements. We establish a new state-of-the-art F1 score of 87.95 on ONTONOTES 5.0, while matching state-of-the-art performance with a F1 score of 91.73 on the over-studied CONLL-2003 dataset.

2017

pdf bib
WiNER: A Wikipedia Annotated Corpus for Named Entity Recognition
Abbas Ghaddar | Phillippe Langlais
Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

We revisit the idea of mining Wikipedia in order to generate named-entity annotations. We propose a new methodology that we applied to English Wikipedia to build WiNER, a large, high quality, annotated corpus. We evaluate its usefulness on 6 NER tasks, comparing 4 popular state-of-the art approaches. We show that LSTM-CRF is the approach that benefits the most from our corpus. We report impressive gains with this model when using a small portion of WiNER on top of the CONLL training material. Last, we propose a simple but efficient method for exploiting the full range of WiNER, leading to further improvements.

2016

pdf bib
WikiCoref: An English Coreference-annotated Corpus of Wikipedia Articles
Abbas Ghaddar | Phillippe Langlais
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

This paper presents WikiCoref, an English corpus annotated for anaphoric relations, where all documents are from the English version of Wikipedia. Our annotation scheme follows the one of OntoNotes with a few disparities. We annotated each markable with coreference type, mention type and the equivalent Freebase topic. Since most similar annotation efforts concentrate on very specific types of written text, mainly newswire, there is a lack of resources for otherwise over-used Wikipedia texts. The corpus described in this paper addresses this issue. We present a freely available resource we initially devised for improving coreference resolution algorithms dedicated to Wikipedia texts. Our corpus has no restriction on the topics of the documents being annotated, and documents of various sizes have been considered for annotation.

pdf bib
Coreference in Wikipedia: Main Concept Resolution
Abbas Ghaddar | Phillippe Langlais
Proceedings of The 20th SIGNLL Conference on Computational Natural Language Learning