John S. Y. Lee

Also published as: John Lee


2020

pdf bib
A Counselling Corpus in Cantonese
John Lee | Tianyuan Cai | Wenxiu Xie | Lam Xing
Proceedings of the 1st Joint Workshop on Spoken Language Technologies for Under-resourced languages (SLTU) and Collaboration and Computing for Under-Resourced Languages (CCURL)

Virtual agents are increasingly used for delivering health information in general, and mental health assistance in particular. This paper presents a corpus designed for training a virtual counsellor in Cantonese, a variety of Chinese. The corpus consists of a domain-independent subcorpus that supports small talk for rapport building with users, and a domain-specific subcorpus that provides material for a particular area of counselling. The former consists of ELIZA style responses, chitchat expressions, and a dataset of general dialog, all of which are reusable across counselling domains. The latter consists of example user inputs and appropriate chatbot replies relevant to the specific domain. In a case study, we created a chatbot with a domain-specific subcorpus that addressed 25 issues in test anxiety, with 436 inputs solicited from native speakers of Cantonese and 150 chatbot replies harvested from mental health websites. Preliminary evaluations show that Word Mover’s Distance achieved 56% accuracy in identifying the issue in user input, outperforming a number of baselines.

pdf bib
A Dataset for Investigating the Impact of Feedback on Student Revision Outcome
Ildiko Pilan | John Lee | Chak Yan Yeung | Jonathan Webster
Proceedings of the 12th Language Resources and Evaluation Conference

We present an annotation scheme and a dataset of teacher feedback provided for texts written by non-native speakers of English. The dataset consists of student-written sentences in their original and revised versions with teacher feedback provided for the errors. Feedback appears both in the form of open-ended comments and error category tags. We focus on a specific error type, namely linking adverbial (e.g. however, moreover) errors. The dataset has been annotated for two aspects: (i) revision outcome establishing whether the re-written student sentence was correct and (ii) directness, indicating whether teachers provided explicitly the correction in their feedback. This dataset allows for studies around the characteristics of teacher feedback and how these influence students’ revision outcome. We describe the data preparation process and we present initial statistical investigations regarding the effect of different feedback characteristics on revision outcome. These show that open-ended comments and mitigating expressions appear in a higher proportion of successful revisions than unsuccessful ones, while directness and metalinguistic terms have no effect. Given that the use of this type of data is relatively unexplored in natural language processing (NLP) applications, we also report some observations and challenges when working with feedback data.

pdf bib
Automatic Compilation of Resources for Academic Writing and Evaluating with Informal Word Identification and Paraphrasing System
Seid Muhie Yimam | Gopalakrishnan Venkatesh | John Lee | Chris Biemann
Proceedings of the 12th Language Resources and Evaluation Conference

We present the first approach to automatically building resources for academic writing. The aim is to build a writing aid system that automatically edits a text so that it better adheres to the academic style of writing. On top of existing academic resources, such as the Corpus of Contemporary American English (COCA) academic Word List, the New Academic Word List, and the Academic Collocation List, we also explore how to dynamically build such resources that would be used to automatically identify informal or non-academic words or phrases. The resources are compiled using different generic approaches that can be extended for different domains and languages. We describe the evaluation of resources with a system implementation. The system consists of an informal word identification (IWI), academic candidate paraphrase generation, and paraphrase ranking components. To generate candidates and rank them in context, we have used the PPDB and WordNet paraphrase resources. We use the Concepts in Context (CoInCO) “All-Words” lexical substitution dataset both for the informal word identification and paraphrase generation experiments. Our informal word identification component achieves an F-1 score of 82%, significantly outperforming a stratified classifier baseline. The main contribution of this work is a domain-independent methodology to build targeted resources for writing aids.

pdf bib
Using Verb Frames for Text Difficulty Assessment
John Lee | Meichun Liu | Tianyuan Cai
Proceedings of the International FrameNet Workshop 2020: Towards a Global, Multilingual FrameNet

This paper presents the first investigation on using semantic frames to assess text difficulty. Based on Mandarin VerbNet, a verbal semantic database that adopts a frame-based approach, we examine usage patterns of ten verbs in a corpus of graded Chinese texts. We identify a number of characteristics in texts at advanced grades: more frequent use of non-core frame elements; more frequent omission of some core frame elements; increased preference for noun phrases rather than clauses as verb arguments; and more frequent metaphoric usage. These characteristics can potentially be useful for automatic prediction of text readability.

pdf bib
Automatic Assistance for Academic Word Usage
Dariush Saberi | John Lee | Jonathan James Webster
Proceedings of the 28th International Conference on Computational Linguistics

This paper describes a writing assistance system that helps students improve their academic writing. Given an input text, the system suggests lexical substitutions that aim to incorporate more academic vocabulary. The substitution candidates are drawn from an academic word list and ranked by a masked language model. Experimental results show that lexical formality analysis can improve the quality of the suggestions, in comparison to a baseline that relies on the masked language model only.

pdf bib
Using Bilingual Patents for Translation Training
John Lee | Benjamin Tsou | Tianyuan Cai
Proceedings of the 28th International Conference on Computational Linguistics

While bilingual corpora have been instrumental for machine translation, their utility for training translators has been less explored. We investigate the use of bilingual corpora as pedagogical tools for translation in the technical domain. In a user study, novice translators revised Chinese translations of English patents through bilingual concordancing. Results show that concordancing with an in-domain bilingual corpus can yield greater improvement in translation quality of technical terms than a general-domain bilingual corpus.

2019

pdf bib
Noun Generation for Nominalization in Academic Writing
Dariush Saberi | John Lee
Proceedings of the 4th Workshop on Computational Creativity in Language Generation

pdf bib
Difficulty-aware Distractor Generation for Gap-Fill Items
Chak Yan Yeung | John Lee | Benjamin Tsou
Proceedings of the The 17th Annual Workshop of the Australasian Language Technology Association

pdf bib
Personalized Substitution Ranking for Lexical Simplification
John Lee | Chak Yan Yeung
Proceedings of the 12th International Conference on Natural Language Generation

A lexical simplification (LS) system substitutes difficult words in a text with simpler ones to make it easier for the user to understand. In the typical LS pipeline, the Substitution Ranking step determines the best substitution out of a set of candidates. Most current systems do not consider the user’s vocabulary proficiency, and always aim for the simplest candidate. This approach may overlook less-simple candidates that the user can understand, and that are semantically closer to the original word. We propose a personalized approach for Substitution Ranking to identify the candidate that is the closest synonym and is non-complex for the user. In experiments on learners of English at different proficiency levels, we show that this approach enhances the semantic faithfulness of the output, at the cost of a relatively small increase in the number of complex words.

2018

pdf bib
L1-L2 Parallel Treebank of Learner Chinese: Overused and Underused Syntactic Structures
Keying Li | John Lee
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

pdf bib
Personalizing Lexical Simplification
John Lee | Chak Yan Yeung
Proceedings of the 27th International Conference on Computational Linguistics

A lexical simplification (LS) system aims to substitute complex words with simple words in a text, while preserving its meaning and grammaticality. Despite individual users’ differences in vocabulary knowledge, current systems do not consider these variations; rather, they are trained to find one optimal substitution or ranked list of substitutions for all users. We evaluate the performance of a state-of-the-art LS system on individual learners of English at different proficiency levels, and measure the benefits of using complex word identification (CWI) models to personalize the system. Experimental results show that even a simple personalized CWI model, based on graded vocabulary lists, can help the system avoid some unnecessary simplifications and produce more readable output.

pdf bib
Personalized Text Retrieval for Learners of Chinese as a Foreign Language
Chak Yan Yeung | John Lee
Proceedings of the 27th International Conference on Computational Linguistics

This paper describes a personalized text retrieval algorithm that helps language learners select the most suitable reading material in terms of vocabulary complexity. The user first rates their knowledge of a small set of words, chosen by a graph-based active learning model. The system trains a complex word identification model on this set, and then applies the model to find texts that contain the desired proportion of new, challenging, and familiar vocabulary. In an evaluation on learners of Chinese as a foreign language, we show that this algorithm is effective in identifying simpler texts for low-proficiency learners, and more challenging ones for high-proficiency learners.

pdf bib
Register-sensitive Translation: a Case Study of Mandarin and Cantonese (Non-archival Extended Abstract)
Tak-sum Wong | John Lee
Proceedings of the 13th Conference of the Association for Machine Translation in the Americas (Volume 1: Research Track)

pdf bib
Assisted Nominalization for Academic English Writing
John Lee | Dariush Saberi | Marvin Lam | Jonathan Webster
Proceedings of the Workshop on Intelligent Interactive Systems and Language Generation (2IS&NLG)

2017

pdf bib
Identifying Speakers and Listeners of Quoted Speech in Literary Works
Chak Yan Yeung | John Lee
Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 2: Short Papers)

We present the first study that evaluates both speaker and listener identification for direct speech in literary texts. Our approach consists of two steps: identification of speakers and listeners near the quotes, and dialogue chain segmentation. Evaluation results show that this approach outperforms a rule-based approach that is state-of-the-art on a corpus of literary texts.

pdf bib
Lexical Simplification with the Deep Structured Similarity Model
Lis Pereira | Xiaodong Liu | John Lee
Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 2: Short Papers)

We explore the application of a Deep Structured Similarity Model (DSSM) to ranking in lexical simplification. Our results show that the DSSM can effectively capture fine-grained features to perform semantic matching when ranking substitution candidates, outperforming the state-of-the-art on two standard datasets used for the task.

pdf bib
Automatic Difficulty Assessment for Chinese Texts
John Lee | Meichun Liu | Chun Yin Lam | Tak On Lau | Bing Li | Keying Li
Proceedings of the IJCNLP 2017, System Demonstrations

We present a web-based interface that automatically assesses reading difficulty of Chinese texts. The system performs word segmentation, part-of-speech tagging and dependency parsing on the input text, and then determines the difficulty levels of the vocabulary items and grammatical constructions in the text. Furthermore, the system highlights the words and phrases that must be simplified or re-written in order to conform to the user-specified target difficulty level. Evaluation results show that the system accurately identifies the vocabulary level of 89.9% of the words, and detects grammar points at 0.79 precision and 0.83 recall.

pdf bib
Towards Universal Dependencies for Learner Chinese
John Lee | Herman Leung | Keying Li
Proceedings of the NoDaLiDa 2017 Workshop on Universal Dependencies (UDW 2017)

pdf bib
Distractor Generation for Chinese Fill-in-the-blank Items
Shu Jiang | John Lee
Proceedings of the 12th Workshop on Innovative Use of NLP for Building Educational Applications

This paper reports the first study on automatic generation of distractors for fill-in-the-blank items for learning Chinese vocabulary. We investigate the quality of distractors generated by a number of criteria, including part-of-speech, difficulty level, spelling, word co-occurrence and semantic similarity. Evaluations show that a semantic similarity measure, based on the word2vec model, yields distractors that are significantly more plausible than those generated by baseline methods.

pdf bib
Carrier Sentence Selection for Fill-in-the-blank Items
Shu Jiang | John Lee
Proceedings of the 4th Workshop on Natural Language Processing Techniques for Educational Applications (NLPTEA 2017)

Fill-in-the-blank items are a common form of exercise in computer-assisted language learning systems. To automatically generate an effective item, the system must be able to select a high-quality carrier sentence that illustrates the usage of the target word. Previous approaches for carrier sentence selection have considered sentence length, vocabulary difficulty, the position of the target word and the presence of finite verbs. This paper investigates the utility of word co-occurrence statistics and lexical similarity as selection criteria. In an evaluation on generating fill-in-the-blank items for learning Chinese as a foreign language, we show that these two criteria can improve carrier sentence quality.

pdf bib
L1-L2 Parallel Dependency Treebank as Learner Corpus
John Lee | Keying Li | Herman Leung
Proceedings of the 15th International Conference on Parsing Technologies

This opinion paper proposes the use of parallel treebank as learner corpus. We show how an L1-L2 parallel treebank — i.e., parse trees of non-native sentences, aligned to the parse trees of their target hypotheses — can facilitate retrieval of sentences with specific learner errors. We argue for its benefits, in terms of corpus re-use and interoperability, over a conventional learner corpus annotated with error tags. As a proof of concept, we conduct a case study on word-order errors made by learners of Chinese as a foreign language. We report precision and recall in retrieving a range of word-order error categories from L1-L2 tree pairs annotated in the Universal Dependency framework.

pdf bib
Splitting Complex English Sentences
John Lee | J. Buddhika K. Pathirage Don
Proceedings of the 15th International Conference on Parsing Technologies

This paper applies parsing technology to the task of syntactic simplification of English sentences, focusing on the identification of text spans that can be removed from a complex sentence. We report the most comprehensive evaluation to-date on this task, using a dataset of sentences that exhibit simplification based on coordination, subordination, punctuation/parataxis, adjectival clauses, participial phrases, and appositive phrases. We train a decision tree with features derived from text span length, POS tags and dependency relations, and show that it significantly outperforms a parser-only baseline.

pdf bib
Quantitative Comparative Syntax on the Cantonese-Mandarin Parallel Dependency Treebank
Tak-sum Wong | Kim Gerdes | Herman Leung | John Lee
Proceedings of the Fourth International Conference on Dependency Linguistics (Depling 2017)

2016

pdf bib
An Annotated Corpus of Direct Speech
John Lee | Chak Yan Yeung
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

We propose a scheme for annotating direct speech in literary texts, based on the Text Encoding Initiative (TEI) and the coreference annotation guidelines from the Message Understanding Conference (MUC). The scheme encodes the speakers and listeners of utterances in a text, as well as the quotative verbs that reports the utterances. We measure inter-annotator agreement on this annotation task. We then present statistics on a manually annotated corpus that consists of books from the New Testament. Finally, we visualize the corpus as a conversational network.

pdf bib
A Dependency Treebank of the Chinese Buddhist Canon
Tak-sum Wong | John Lee
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

We present a dependency treebank of the Chinese Buddhist Canon, which contains 1,514 texts with about 50 million Chinese characters. The treebank was created by an automatic parser trained on a smaller treebank, containing four manually annotated sutras (Lee and Kong, 2014). We report results on word segmentation, part-of-speech tagging and dependency parsing, and discuss challenges posed by the processing of medieval Chinese. In a case study, we exploit the treebank to examine verbs frequently associated with Buddha, and to analyze usage patterns of quotative verbs in direct speech. Our results suggest that certain quotative verbs imply status differences between the speaker and the listener.

pdf bib
A Reading Environment for Learners of Chinese as a Foreign Language
John Lee | Chun Yin Lam | Shu Jiang
Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: System Demonstrations

We present a mobile app that provides a reading environment for learners of Chinese as a foreign language. The app includes a text database that offers over 500K articles from Chinese Wikipedia. These articles have been word-segmented; each word is linked to its entry in a Chinese-English dictionary, and to automatically-generated review exercises. The app estimates the reading proficiency of the user based on a “to-learn” list of vocabulary items. It automatically constructs and maintains this list by tracking the user’s dictionary lookup behavior and performance in review exercises. When a user searches for articles to read, search results are filtered such that the proportion of unknown words does not exceed a user-specified threshold.

pdf bib
A Customizable Editor for Text Simplification
John Lee | Wenlong Zhao | Wenxiu Xie
Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: System Demonstrations

We present a browser-based editor for simplifying English text. Given an input sentence, the editor performs both syntactic and lexical simplification. It splits a complex sentence into shorter ones, and suggests word substitutions in drop-down lists. The user can choose the best substitution from the list, undo any inappropriate splitting, and further edit the sentence as necessary. A significant novelty is that the system accepts a customized vocabulary list for a target reader population. It identifies all words in the text that do not belong to the list, and attempts to substitute them with words from the list, thus producing a text tailored for the targeted readers.

pdf bib
A CALL System for Learning Preposition Usage
John Lee | Donald Sturgeon | Mengqi Luo
Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

pdf bib
Personalized Exercises for Preposition Learning
John Lee | Mengqi Luo
Proceedings of ACL-2016 System Demonstrations

pdf bib
Developing Universal Dependencies for Mandarin Chinese
Herman Leung | Rafaël Poiret | Tak-sum Wong | Xinying Chen | Kim Gerdes | John Lee
Proceedings of the 12th Workshop on Asian Language Resources (ALR12)

This article proposes a Universal Dependency Annotation Scheme for Mandarin Chinese, including POS tags and dependency analysis. We identify cases of idiosyncrasy of Mandarin Chinese that are difficult to fit into the current schema which has mainly been based on the descriptions of various Indo-European languages. We discuss differences between our scheme and those of the Stanford Chinese Dependencies and the Chinese Dependency Treebank.

2015

pdf bib
Translation Quality and Effort: Options versus Post-editing
Donald Sturgeon | John S. Y. Lee
Proceedings of the 12th International Conference on Natural Language Processing

pdf bib
Automatic Detection of Sentence Fragments
Chak Yan Yeung | John Lee
Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers)

2014

pdf bib
Automatic Detection of Comma Splices
John Lee | Chak Yan Yeung | Martin Chodorow
Proceedings of the 28th Pacific Asia Conference on Language, Information and Computing

2013

pdf bib
Treebanking for Data-driven Research in the Classroom
John Lee | Ying Cheuk Hui | Yin Hei Kong
Proceedings of the Fourth Workshop on Teaching NLP and CL

2012

pdf bib
A Dependency Treebank of Classical Chinese Poems
John Lee | Yin Hei Kong
Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

pdf bib
A Corpus of Textual Revisions in Second Language Writing
John Lee | Jonathan Webster
Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

pdf bib
A Classical Chinese Corpus with Nested Part-of-Speech Tags
John Lee
Proceedings of the 6th Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities

pdf bib
Extracting Networks of People and Places from Literary Texts
John Lee | Chak Yan Yeung
Proceedings of the 26th Pacific Asia Conference on Language, Information, and Computation

pdf bib
Glimpses of Ancient China from Classical Chinese Poems
John Lee | Tak-sum Wong
Proceedings of COLING 2012: Posters

2011

pdf bib
Toward a Parallel Corpus of Spoken Cantonese and Written Chinese
John Lee
Proceedings of 5th International Joint Conference on Natural Language Processing

pdf bib
A Discriminative Model for Joint Morphological Disambiguation and Dependency Parsing
John Lee | Jason Naradowsky | David A. Smith
Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies

2010

pdf bib
Porting an Ancient Greek and Latin Treebank
John Lee | Dag Haug
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

We have recently converted a dependency treebank, consisting of ancient Greek and Latin texts, from one annotation scheme to another that was independently designed. This paper makes two observations about this conversion process. First, we show that, despite significant surface differences between the two treebanks, a number of straightforward transformation rules yield a substantial level of compatibility between them, giving evidence for their sound design and high quality of annotation. Second, we analyze some linguistic annotations that require further disambiguation, proposing some simple yet effective machine learning methods.

2009

pdf bib
Human Evaluation of Article and Noun Number Usage: Influences of Context and Construction Variability
John Lee | Joel Tetreault | Martin Chodorow
Proceedings of the Third Linguistic Annotation Workshop (LAW III)

2008

pdf bib
Correcting Misuse of Verb Forms
John Lee | Stephanie Seneff
Proceedings of ACL-08: HLT

pdf bib
A Nearest-Neighbor Approach to the Automatic Analysis of Ancient Greek Morphology
John Lee
CoNLL 2008: Proceedings of the Twelfth Conference on Computational Natural Language Learning

2007

pdf bib
Detection of Non-Native Sentences Using Machine-Translated Training Data
John Lee | Ming Zhou | Xiaohua Liu
Human Language Technologies 2007: The Conference of the North American Chapter of the Association for Computational Linguistics; Companion Volume, Short Papers

pdf bib
Detecting Erroneous Sentences using Automatically Mined Sequential Patterns
Guihua Sun | Xiaohua Liu | Gao Cong | Ming Zhou | Zhongyang Xiong | John Lee | Chin-Yew Lin
Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics

pdf bib
A Computational Model of Text Reuse in Ancient Literary Texts
John Lee
Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics

2006

pdf bib
Combining Statistical and Knowledge-Based Spoken Language Understanding in Conditional Models
Ye-Yi Wang | Alex Acero | Milind Mahajan | John Lee
Proceedings of the COLING/ACL 2006 Main Conference Poster Sessions

2004

pdf bib
Automatic Article Restoration
John Lee
Proceedings of the Student Research Workshop at HLT-NAACL 2004

1997

pdf bib
Referring to Displays in Multimodal Interfaces
Daqing He | Graeme Ritchie | John Lee
Referring Phenomena in a Multimedia Context and their Computational Treatment