Eric Atwell

Also published as: Eric S. Atwell, Eric Steven Atwell


2020

pdf bib
Constructing a Bilingual Hadith Corpus Using a Segmentation Tool
Shatha Altammami | Eric Atwell | Ammar Alsalka
Proceedings of the 12th Language Resources and Evaluation Conference

This article describes the process of gathering and constructing a bilingual parallel corpus of Islamic Hadith, which is the set of narratives reporting different aspects of the prophet Muhammad’s life. The corpus data is gathered from the six canonical Hadith collections using a custom segmentation tool that automatically segments and annotates the two Hadith components with 92% accuracy. This Hadith segmenter minimises the costs of language resource creation and produces consistent results independently from previous knowledge and experiences that usually influence human annotators. The corpus includes more than 10M tokens and will be freely available via the LREC repository.

2019

pdf bib
Proceedings of the 3rd Workshop on Arabic Corpus Linguistics
Mahmoud El-Haj | Paul Rayson | Eric Atwell | Lama Alsudias
Proceedings of the 3rd Workshop on Arabic Corpus Linguistics

pdf bib
Text Segmentation Using N-grams to Annotate Hadith Corpus
Shatha Altammami | Eric Atwell | Ammar Alsalka
Proceedings of the 3rd Workshop on Arabic Corpus Linguistics

pdf bib
Classifying Arabic dialect text in the Social Media Arabic Dialect Corpus (SMADC)
Areej Alshutayri | Eric Atwell
Proceedings of the 3rd Workshop on Arabic Corpus Linguistics

2018

pdf bib
Web-based Annotation Tool for Inflectional Language Resources
Abdulrahman Alosaimy | Eric Atwell
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

2016

pdf bib
An Empirical Study of Arabic Formulaic Sequence Extraction Methods
Ayman Alghamdi | Eric Atwell | Claire Brierley
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

This paper aims to implement what is referred to as the collocation of the Arabic keywords approach for extracting formulaic sequences (FSs) in the form of high frequency but semantically regular formulas that are not restricted to any syntactic construction or semantic domain. The study applies several distributional semantic models in order to automatically extract relevant FSs related to Arabic keywords. The data sets used in this experiment are rendered from a new developed corpus-based Arabic wordlist consisting of 5,189 lexical items which represent a variety of modern standard Arabic (MSA) genres and regions, the new wordlist being based on an overlapping frequency based on a comprehensive comparison of four large Arabic corpora with a total size of over 8 billion running words. Empirical n-best precision evaluation methods are used to determine the best association measures (AMs) for extracting high frequency and meaningful FSs. The gold standard reference FSs list was developed in previous studies and manually evaluated against well-established quantitative and qualitative criteria. The results demonstrate that the MI.log_f AM achieved the highest results in extracting significant FSs from the large MSA corpus, while the T-score association measure achieved the worst results.

pdf bib
Compilation of an Arabic Children’s Corpus
Latifa Al-Sulaiti | Noorhan Abbas | Claire Brierley | Eric Atwell | Ayman Alghamdi
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

Inspired by the Oxford Children’s Corpus, we have developed a prototype corpus of Arabic texts written and/or selected for children. Our Arabic Children’s Corpus of 2950 documents and nearly 2 million words has been collected manually from the web during a 3-month project. It is of high quality, and contains a range of different children’s genres based on sources located, including classic tales from The Arabian Nights, and popular fictional characters such as Goha. We anticipate that the current and subsequent versions of our corpus will lead to interesting studies in text classification, language use, and ideology in children’s texts.

pdf bib
Arabic Language WEKA-Based Dialect Classifier for Arabic Automatic Speech Recognition Transcripts
Areej Alshutayri | Eric Atwell | Abdulrahman Alosaimy | James Dickins | Michael Ingleby | Janet Watson
Proceedings of the Third Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial3)

This paper describes an Arabic dialect identification system which we developed for the Discriminating Similar Languages (DSL) 2016 shared task. We classified Arabic dialects by using Waikato Environment for Knowledge Analysis (WEKA) data analytic tool which contains many alternative filters and classifiers for machine learning. We experimented with several classifiers and the best accuracy was achieved using the Sequential Minimal Optimization (SMO) algorithm for training and testing process set to three different feature-sets for each testing process. Our approach achieved an accuracy equal to 42.85% which is considerably worse in comparison to the evaluation scores on the training set of 80-90% and with training set “60:40” percentage split which achieved accuracy around 50%. We observed that Buckwalter transcripts from the Saarland Automatic Speech Recognition (ASR) system are given without short vowels, though the Buckwalter system has notation for these. We elaborate such observations, describe our methods and analyse the training dataset.

2014

pdf bib
Tools for Arabic Natural Language Processing: a case study in qalqalah prosody
Claire Brierley | Majdi Sawalha | Eric Atwell
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

In this paper, we focus on the prosodic effect of qalqalah or “vibration” applied to a subset of Arabic consonants under certain constraints during correct Qur’anic recitation or taǧwīd, using our Boundary-Annotated Qur’an dataset of 77430 words (Brierley et al 2012; Sawalha et al 2014). These qalqalah events are rule-governed and are signified orthographically in the Arabic script. Hence they can be given abstract definition in the form of regular expressions and thus located and collected automatically. High frequency qalqalah content words are also found to be statistically significant discriminators or keywords when comparing Meccan and Medinan chapters in the Qur’an using a state-of-the-art Visual Analytics toolkit: Semantic Pathways. Thus we hypothesise that qalqalah prosody is one way of highlighting salient items in the text. Finally, we implement Arabic transcription technology (Brierley et al under review; Sawalha et al forthcoming) to create a qalqalah pronunciation guide where each word is transcribed phonetically in IPA and mapped to its chapter-verse ID. This is funded research under the EPSRC “Working Together” theme.

2012

pdf bib
QurAna: Corpus of the Quran annotated with Pronominal Anaphora
Abdul-Baquee Sharaf | Eric Atwell
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

This paper presents QurAna: a large corpus created from the original Quranic text, where personal pronouns are tagged with their antecedence. These antecedents are maintained as an ontological list of concepts, which have proved helpful for information retrieval tasks. QurAna is characterized by: (a) comparatively large number of pronouns tagged with antecedent information (over 24,500 pronouns), and (b) maintenance of an ontological concept list out of these antecedents. We have shown useful applications of this corpus. This corpus is first of its kind considering classical Arabic text, which could be used for interesting applications for Modern Standard Arabic as well. This corpus would benefit researchers in obtaining empirical and rules in building new anaphora resolution approaches. Also, such corpus would be used to train, optimize and evaluate existing approaches.

pdf bib
QurSim: A corpus for evaluation of relatedness in short texts
Abdul-Baquee Sharaf | Eric Atwell
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

This paper presents a large corpus created from the original Quranic text, where semantically similar or related verses are linked together. This corpus will be a valuable evaluation resource for computational linguists investigating similarity and relatedness in short texts. Furthermore, this dataset can be used for evaluation of paraphrase analysis and machine translation tasks. Our dataset is characterised by: (1) superior quality of relatedness assignment; as we have incorporated relations marked by well-known domain experts, this dataset could thus be considered a gold standard corpus for various evaluation tasks, (2) the size of our dataset; over 7,600 pairs of related verses are collected from scholarly sources with several levels of degree of relatedness. This dataset could be extended to over 13,500 pairs of related verses observing the commutative property of strongly related pairs. This dataset was incorporated into online query pages where users can visualize for a given verse a network of all directly and indirectly related verses. Empirical experiments showed that only 33% of related pairs shared root words, emphasising the need to go beyond common lexical matching methods, and incorporate -in addition- semantic, domain knowledge, and other corpus-based approaches.

pdf bib
Predicting Phrase Breaks in Classical and Modern Standard Arabic Text
Majdi Sawalha | Claire Brierley | Eric Atwell
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

We train and test two probabilistic taggers for Arabic phrase break prediction on a purpose-built, “gold standard”, boundary-annotated and PoS-tagged Qur'an corpus of 77430 words and 8230 sentences. In a related LREC paper (Brierley et al., 2012), we cover dataset build. Here we report on comparative experiments with off-the-shelf N-gram and HMM taggers and coarse-grained feature sets for syntax and prosody, where the task is to predict boundary locations in an unseen test set stripped of boundary annotations by classifying words as breaks or non-breaks. The preponderance of non-breaks in the training data sets a challenging baseline success rate: 85.56%. However, we achieve significant gains in accuracy with the trigram tagger, and significant gains in performance recognition of minority class instances with both taggers via Balanced Classification Rate. This is initial work on a long-term research project to produce annotation schemes, language resources, algorithms, and applications for Classical and Modern Standard Arabic.

pdf bib
Open-Source Boundary-Annotated Corpus for Arabic Speech and Language Processing
Claire Brierley | Majdi Sawalha | Eric Atwell
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

A boundary-annotated and part-of-speech tagged corpus is a prerequisite for developing phrase break classifiers. Boundary annotations in English speech corpora are descriptive, delimiting intonation units perceived by the listener. We take a novel approach to phrase break prediction for Arabic, deriving our prosodic annotation scheme from Tajwīd (recitation) mark-up in the Qur'an which we then interpret as additional text-based data for computational analysis. This mark-up is prescriptive, and signifies a widely-used recitation style, and one of seven original styles of transmission. Here we report on version 1.0 of our Boundary-Annotated Qur'an dataset of 77430 words and 8230 sentences, where each word is tagged with prosodic and syntactic information at two coarse-grained levels. In (Sawalha et al., 2012), we use the dataset in phrase break prediction experiments. This research is part of a larger-scale project to produce annotation schemes, language resources, algorithms, and applications for Classical and Modern Standard Arabic.

pdf bib
LAMP: A Multimodal Web Platform for Collaborative Linguistic Analysis
Kais Dukes | Eric Atwell
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

This paper describes the underlying software platform used to develop and publish annotations for the Quranic Arabic Corpus (QAC). The QAC (Dukes, Atwell and Habash, 2011) is a multimodal language resource that integrates deep tagging, interlinear translation, multiple speech recordings, visualization and collaborative analysis for the Classical Arabic language of the Quran. Available online at http://corpus.quran.com, the website is a popular study guide for Quranic Arabic, used by over 1.2 million visitors over the past year. We provide a description of the underlying software system that has been used to develop the corpus annotations. The multimodal data is made available online through an accessible cross-referenced web interface. Although our Linguistic Analysis Multimodal Platform (LAMP) has been applied to the Classical Arabic language of the Quran, we argue that our annotation model and software architecture may be of interest to other related corpus linguistics projects. Work related to LAMP includes recent efforts for annotating other Classical languages, such as Ancient Greek and Latin (Bamman, Mambrini and Crane, 2009), as well as commercial systems (e.g. Logos Bible study) that provide access to syntactic tagging for the Hebrew Bible and Greek New Testament (Brannan, 2011).

2010

pdf bib
Syntactic Annotation Guidelines for the Quranic Arabic Dependency Treebank
Kais Dukes | Eric Atwell | Abdul-Baquee M. Sharaf
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

The Quranic Arabic Dependency Treebank (QADT) is part of the Quranic Arabic Corpus (http://corpus.quran.com), an online linguistic resource organized by the University of Leeds, and developed through online collaborative annotation. The website has become a popular study resource for Arabic and the Quran, and is now used by over 1,500 researchers and students daily. This paper presents the treebank, explains the choice of syntactic representation, and highlights key parts of the annotation guidelines. The text being analyzed is the Quran, the central religious book of Islam, written in classical Quranic Arabic (c. 600 CE). To date, all 77,430 words of the Quran have a manually verified morphological analysis, and syntactic analysis is in progress. 11,000 words of Quranic Arabic have been syntactically annotated as part of a gold standard treebank. Annotation guidelines are especially important to promote consistency for a corpus which is being developed through online collaboration, since often many people will participate from different backgrounds and with different levels of linguistic expertise. The treebank is available online for collaborative correction to improve accuracy, with suggestions reviewed by expert Arabic linguists, and compared against existing published books of Quranic Syntax.

pdf bib
Fine-Grain Morphological Analyzer and Part-of-Speech Tagger for Arabic Text
Majdi Sawalha | Eric Atwell
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

Morphological analyzers and part-of-speech taggers are key technologies for most text analysis applications. Our aim is to develop a part-of-speech tagger for annotating a wide range of Arabic text formats, domains and genres including both vowelized and non-vowelized text. Enriching the text with linguistic analysis will maximize the potential for corpus re-use in a wide range of applications. We foresee the advantage of enriching the text with part-of-speech tags of very fine-grained grammatical distinctions, which reflect expert interest in syntax and morphology, but not specific needs of end-users, because end-user applications are not known in advance. In this paper we review existing Arabic Part-of-Speech Taggers and tag-sets, and illustrate four different Arabic PoS tag-sets for a sample of Arabic text from the Quran. We describe the detailed fine-grained morphological feature tag set of Arabic, and the fine-grained Arabic morphological analyzer algorithm. We faced practical challenges in applying the morphological analyzer to the 100-million-word Web Arabic Corpus: we had to port the software to the National Grid Service, adapt the analyser to cope with spelling variations and errors, and utilise a Broad-Coverage Lexical Resource combining 23 traditional Arabic lexicons. Finally we outline the construction of a Gold Standard for comparative evaluation.

pdf bib
Constructing and Using Broad-coverage Lexical Resource for Enhancing Morphological Analysis of Arabic
Majdi Sawalha | Eric Atwell
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

Broad-coverage language resources which provide prior linguistic knowledge must improve the accuracy and the performance of NLP applications. We are constructing a broad-coverage lexical resource to improve the accuracy of morphological analyzers and part-of-speech taggers of Arabic text. Over the past 1200 years, many different kinds of Arabic language lexicons were constructed; these lexicons are different in ordering, size and aim or goal of construction. We collected 23 machine-readable lexicons, which are freely available on the web. We combined lexical resources into one large broad-coverage lexical resource by extracting information from disparate formats and merging traditional Arabic lexicons. To evaluate the broad-coverage lexical resource we computed coverage over the Qur’an, the Corpus of Contemporary Arabic, and a sample from the Arabic Web Corpus, using two methods. Counting exact word matches between test corpora and lexicon scored about 65-68%; Arabic has a rich morphology with many combinations of roots, affixes and clitics, so about a third of words in the corpora did not have an exact match in the lexicon. The second approach is to compute coverage in terms of use in a lemmatizer program, which strips clitics to look for a match for the underlying lexeme; this scored about 82-85%.

pdf bib
ProPOSEC: A Prosody and PoS Annotated Spoken English Corpus
Claire Brierley | Eric Atwell
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

We have previously reported on ProPOSEL, a purpose-built Prosody and PoS English Lexicon compatible with the Python Natural Language ToolKit. ProPOSEC is a new corpus research resource built using this lexicon, intended for distribution with the Aix-MARSEC dataset. ProPOSEC comprises multi-level parallel annotations, juxtaposing prosodic and syntactic information from different versions of the Spoken English Corpus, with canonical dictionary forms, in a query format optimized for Perl, Python, and text processing programs. The order and content of fields in the text file is as follows: (1) Aix-MARSEC file number; (2) word; (3) LOB PoS-tag; (4) C5 PoS-tag; (5) Aix SAM-PA phonetic transcription; (6) SAM-PA phonetic transcription from ProPOSEL; (7) syllable count; (8) lexical stress pattern; (9) default content or function word tag; (10) DISC stressed and syllabified phonetic transcription; (11) alternative DISC representation, incorporating lexical stress pattern; (12) nested arrays of phonemes and tonic stress marks from Aix. As an experimental dataset, ProPOSEC can be used to study correlations between these annotation tiers, where significant findings are then expressed as additional features for phrasing models integral to Text-to-Speech and Speech Recognition. As a training set, ProPOSEC can be used for machine learning tasks in Information Retrieval and Speech Understanding systems.

2008

pdf bib
An AI-inspired intelligent agent/student architecture to combine Language Resources research and teaching
Bayan Abu Shawar | Eric Atwell
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

This paper describes experimental use of the multi-agent architecture to integrate Natural Language and Information Systems research and teaching, by casting a group of students as intelligent agents to collect and analyse English language resources from around the world. Section 2 and section 3 describe the hybrid intelligent information systems experiments at the University of Leeds and the results generated, including several research papers accepted at international conferences, and a finalist entry in the British Computer Society Machine Intelligence contest. Our proposals for applying the multi-agent idea in other universities such as the Arab Open University are presented in section 4. The conclusion is presented in section 5: the success of hybrid intelligent information systems experiments in generating research papers within a limited time.

pdf bib
ProPOSEL: A Prosody and POS English Lexicon for Language Engineering
Claire Brierley | Eric Atwell
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

ProPOSEL is a prototype prosody and PoS (part-of-speech) English lexicon for Language Engineering, derived from the following language resources: the computer-usable dictionary CUVPlus, the CELEX-2 database, the Carnegie-Mellon Pronouncing Dictionary, and the BNC, LOB and Penn Treebank PoS-tagged corpora. The lexicon is designed for the target application of prosodic phrase break prediction but is also relevant to other machine learning and language engineering tasks. It supplements the existing record structure for wordform entries in CUVPlus with syntactic annotations from rival PoS-tagging schemes, mapped to fields for default closed and open-class word categories and for lexical stress patterns representing the rhythmic structure of wordforms and interpreted as potential new text-based features for automatic phrase break classifiers. The current version of the lexicon comes as a textfile of 104052 separate entries and is intended for distribution with the Natural Language ToolKit; it is therefore accompanied by supporting Python software for manipulating the data so that it can be used for Natural Language Processing (NLP) and corpus-based research in speech synthesis and speech recognition.

pdf bib
ProPOSEL: a human-oriented prosody and PoS English lexicon for machine-learning and NLP
Claire Brierley | Eric Atwell
Coling 2008: Proceedings of the Workshop on Cognitive Aspects of the Lexicon (COGALEX 2008)

pdf bib
Comparative Evaluation of Arabic Language Morphological Analysers and Stemmers
Majdi Sawalha | Eric Atwell
Coling 2008: Companion volume: Posters

2007

pdf bib
Different measurement metrics to evaluate a chatbot system
Bayan Abu Shawar | Eric Atwell
Proceedings of the Workshop on Bridging the Gap: Academic and Industrial Research in Dialog Technologies

2004

pdf bib
A Chatbot as a Novel Corpus Visualization Tool
Bayan Abu Shawar | Eric Atwell
Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC’04)

2000

pdf bib
The ISLE Corpus of Non-Native Spoken English
Wolfgang Menzel | Eric Atwell | Patrizia Bonaventura | Daniel Herron | Peter Howarth | Rachel Morton | Clive Souter
Proceedings of the Second International Conference on Language Resources and Evaluation (LREC’00)

pdf bib
Using Lexical Semantic Knowledge from Machine Readable Dictionaries for Domain Independent Language Modelling
George Demetriou | Eric Atwell | Clive Souter
Proceedings of the Second International Conference on Language Resources and Evaluation (LREC’00)

pdf bib
Language Identification in Unknown Signals
John Elliott | Eric Atwell | Bill Whyte
COLING 2000 Volume 2: The 18th International Conference on Computational Linguistics

pdf bib
Increasing our Ignorance’ of Language: Identifying Language Structure in an Unknown ‘Signal’
John Elliot | Eric Atwell | Bill Whyte
Fourth Conference on Computational Natural Language Learning and the Second Learning Language in Logic Workshop

pdf bib
Comparing Linguistic Interpretation Schemes for English Corpora
Eric Atwell | George Demetriou | John Hughes | Amanda Schiffrin | Clive Souter | Sean Wilcock
Proceedings of the COLING-2000 Workshop on Linguistically Interpreted Corpora

1997

pdf bib
A Generic Template to evaluate integrated components in spoken dialogue systems
Gavin E. Churcher | Eric S. Atwell | Clive Souter
Interactive Spoken Dialog Systems: Bringing Speech and NLP Together in Real Applications

1994

pdf bib
AMALGAM: Automatic Mapping Among Lexico-Grammatical Annotation Models
Eric Atwell | John Hughes | Clive Souter
The Balancing Act: Combining Symbolic and Statistical Approaches to Language

1988

pdf bib
Project April --- A Progress Report
Robin Haigh | Geoffrey Sampson | Eric Atwell
26th Annual Meeting of the Association for Computational Linguistics

1987

pdf bib
How to Detect Grammatical Errors in a Text Without Parsing It
Eric Steven Atwell
Third Conference of the European Chapter of the Association for Computational Linguistics

pdf bib
Pattern Recognition Applied to the Acquisition of a Grammatical Classification System From Unrestricted English Text
Eric Steven Atwell | Nicos Frixou Drakos
Third Conference of the European Chapter of the Association for Computational Linguistics