C. Lee Giles


2020

pdf bib
Acknowledgement Entity Recognition in CORD-19 Papers
Jian Wu | Pei Wang | Xin Wei | Sarah Rajtmajer | C. Lee Giles | Christopher Griffin
Proceedings of the First Workshop on Scholarly Document Processing

Acknowledgements are ubiquitous in scholarly papers. Existing acknowledgement entity recognition methods assume all named entities are acknowledged. Here, we examine the nuances between acknowledged and named entities by analyzing sentence structure. We develop an acknowledgement extraction system, AckExtract based on open-source text mining software and evaluate our method using manually labeled data. AckExtract uses the PDF of a scholarly paper as input and outputs acknowledgement entities. Results show an overall performance of F_1=0.92. We built a supplementary database by linking CORD-19 papers with acknowledgement entities extracted by AckExtract including persons and organizations and find that only up to 50–60% of named entities are actually acknowledged. We further analyze chronological trends of acknowledgement entities in CORD-19 papers. All codes and labeled data are publicly available at https://github.com/lamps-lab/ackextract.

pdf bib
Learning CNF Blocking for Large-scale Author Name Disambiguation
Kunho Kim | Athar Sefid | C. Lee Giles
Proceedings of the First Workshop on Scholarly Document Processing

Author name disambiguation (AND) algorithms identify a unique author entity record from all similar or same publication records in scholarly or similar databases. Typically, a clustering method is used that requires calculation of similarities between each possible record pair. However, the total number of pairs grows quadratically with the size of the author database making such clustering difficult for millions of records. One remedy is a blocking function that reduces the number of pairwise similarity calculations. Here, we introduce a new way of learning blocking schemes by using a conjunctive normal form (CNF) in contrast to the disjunctive normal form (DNF). We demonstrate on PubMed author records that CNF blocking reduces more pairs while preserving high pairs completeness compared to the previous methods that use a DNF and that the computation time is significantly reduced. In addition, we also show how to ensure that the method produces disjoint blocks so that much of the AND algorithm can be efficiently paralleled. Our CNF blocking method is tested on the entire PubMed database of 80 million author mentions and efficiently removes 82.17% of all author record pairs in 10 minutes.

pdf bib
CODA-19: Using a Non-Expert Crowd to Annotate Research Aspects on 10,000+ Abstracts in the COVID-19 Open Research Dataset
Ting-Hao Kenneth Huang | Chieh-Yang Huang | Chien-Kuang Cornelia Ding | Yen-Chia Hsu | C. Lee Giles
Proceedings of the 1st Workshop on NLP for COVID-19 at ACL 2020

This paper introduces CODA-19, a human-annotated dataset that codes the Background, Purpose, Method, Finding/Contribution, and Other sections of 10,966 English abstracts in the COVID-19 Open Research Dataset. CODA-19 was created by 248 crowd workers from Amazon Mechanical Turk within 10 days, and achieved labeling quality comparable to that of experts. Each abstract was annotated by nine different workers, and the final labels were acquired by majority vote. The inter-annotator agreement (Cohen’s kappa) between the crowd and the biomedical expert (0.741) is comparable to inter-expert agreement (0.788). CODA-19’s labels have an accuracy of 82.2% when compared to the biomedical expert’s labels, while the accuracy between experts was 85.0%. Reliable human annotations help scientists access and integrate the rapidly accelerating coronavirus literature, and also serve as the battery of AI/NLP research, but obtaining expert annotations can be slow. We demonstrated that a non-expert crowd can be rapidly employed at scale to join the fight against COVID-19.

2018

pdf bib
Distractor Generation for Multiple Choice Questions Using Learning to Rank
Chen Liang | Xiao Yang | Neisarg Dave | Drew Wham | Bart Pursel | C. Lee Giles
Proceedings of the Thirteenth Workshop on Innovative Use of NLP for Building Educational Applications

We investigate how machine learning models, specifically ranking models, can be used to select useful distractors for multiple choice questions. Our proposed models can learn to select distractors that resemble those in actual exam questions, which is different from most existing unsupervised ontology-based and similarity-based methods. We empirically study feature-based and neural net (NN) based ranking models with experiments on the recently released SciQ dataset and our MCQL dataset. Experimental results show that feature-based ensemble learning methods (random forest and LambdaMART) outperform both the NN-based method and unsupervised baselines. These two datasets can also be used as benchmarks for distractor generation.

2015

pdf bib
Learning a Deep Hybrid Model for Semi-Supervised Text Classification
Alexander Ororbia II | C. Lee Giles | David Reitter
Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing

pdf bib
Measuring Prerequisite Relations Among Concepts
Chen Liang | Zhaohui Wu | Wenyi Huang | C. Lee Giles
Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing

pdf bib
Proceedings of the ACL 2015 Workshop on Novel Computational Approaches to Keyphrase Extraction
Sujatha Das Gollapalli | Cornelia Caragea | Xiaoli Li | C. Lee Giles
Proceedings of the ACL 2015 Workshop on Novel Computational Approaches to Keyphrase Extraction

pdf bib
Storybase: Towards Building a Knowledge Base for News Events
Zhaohui Wu | Chen Liang | C. Lee Giles
Proceedings of ACL-IJCNLP 2015 System Demonstrations

2013

pdf bib
Measuring Term Informativeness in Context
Zhaohui Wu | C. Lee Giles
Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

2010

pdf bib
Enhancing Cross Document Coreference of Web Documents with Context Similarity and Very Large Scale Text Categorization
Jian Huang | Pucktada Treeratpituk | Sarah Taylor | C. Lee Giles
Proceedings of the 23rd International Conference on Computational Linguistics (Coling 2010)

pdf bib
SEERLAB: A System for Extracting Keyphrases from Scholarly Documents
Pucktada Treeratpituk | Pradeep Teregowda | Jian Huang | C. Lee Giles
Proceedings of the 5th International Workshop on Semantic Evaluation

2009

pdf bib
Profile Based Cross-Document Coreference Using Kernelized Fuzzy Relational Clustering
Jian Huang | Sarah M. Taylor | Jonathan L. Smith | Konstantinos A. Fotiadis | C. Lee Giles
Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP

pdf bib
Solving the “Who’s Mark Johnson Puzzle”: Information Extraction Based Cross Document Coreference
Jian Huang | Sarah M. Taylor | Jonathan L. Smith | Konstantinos A. Fotiadis | C. Lee Giles
Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Companion Volume: Student Research Workshop and Doctoral Consortium

2008

pdf bib
ParsCit: an Open-source CRF Reference String Parsing Package
Isaac Councill | C. Lee Giles | Min-Yen Kan
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

We describe ParsCit, a freely available, open-source implementation of a reference string parsing package. At the core of ParsCit is a trained conditional random field (CRF) model used to label the token sequences in the reference string. A heuristic model wraps this core with added functionality to identify reference strings from a plain text file, and to retrieve the citation contexts. The package comes with utilities to run it as a web service or as a standalone utility. We compare ParsCit on three distinct reference string datasets and show that it compares well with other previously published work.