Kyle Lo


pdf bib
S2ORC: The Semantic Scholar Open Research Corpus
Kyle Lo | Lucy Lu Wang | Mark Neumann | Rodney Kinney | Daniel Weld
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

We introduce S2ORC, a large corpus of 81.1M English-language academic papers spanning many academic disciplines. The corpus consists of rich metadata, paper abstracts, resolved bibliographic references, as well as structured full text for 8.1M open access papers. Full text is annotated with automatically-detected inline mentions of citations, figures, and tables, each linked to their corresponding paper objects. In S2ORC, we aggregate papers from hundreds of academic publishers and digital archives into a unified source, and create the largest publicly-available collection of machine-readable academic text to date. We hope this resource will facilitate research and development of tools and tasks for text mining over academic text.

pdf bib
Don’t Stop Pretraining: Adapt Language Models to Domains and Tasks
Suchin Gururangan | Ana Marasović | Swabha Swayamdipta | Kyle Lo | Iz Beltagy | Doug Downey | Noah A. Smith
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

Language models pretrained on text from a wide variety of sources form the foundation of today’s NLP. In light of the success of these broad-coverage models, we investigate whether it is still helpful to tailor a pretrained model to the domain of a target task. We present a study across four domains (biomedical and computer science publications, news, and reviews) and eight classification tasks, showing that a second phase of pretraining in-domain (domain-adaptive pretraining) leads to performance gains, under both high- and low-resource settings. Moreover, adapting to the task’s unlabeled data (task-adaptive pretraining) improves performance even after domain-adaptive pretraining. Finally, we show that adapting to a task corpus augmented using simple data selection strategies is an effective alternative, especially when resources for domain-adaptive pretraining might be unavailable. Overall, we consistently find that multi-phase adaptive pretraining offers large gains in task performance.

pdf bib
TLDR: Extreme Summarization of Scientific Documents
Isabel Cachola | Kyle Lo | Arman Cohan | Daniel Weld
Findings of the Association for Computational Linguistics: EMNLP 2020

We introduce TLDR generation, a new form of extreme summarization, for scientific papers. TLDR generation involves high source compression and requires expert background knowledge and understanding of complex domain-specific language. To facilitate study on this task, we introduce SCITLDR, a new multi-target dataset of 5.4K TLDRs over 3.2K papers. SCITLDR contains both author-written and expert-derived TLDRs, where the latter are collected using a novel annotation protocol that produces high-quality summaries while minimizing annotation burden. We propose CATTS, a simple yet effective learning strategy for generating TLDRs that exploits titles as an auxiliary training signal. CATTS improves upon strong baselines under both automated metrics and human evaluations. Data and code are publicly available at

pdf bib
Document-Level Definition Detection in Scholarly Documents: Existing Models, Error Analyses, and Future Directions
Dongyeop Kang | Andrew Head | Risham Sidhu | Kyle Lo | Daniel Weld | Marti A. Hearst
Proceedings of the First Workshop on Scholarly Document Processing

The task of definition detection is important for scholarly papers, because papers often make use of technical terminology that may be unfamiliar to readers. Despite prior work on definition detection, current approaches are far from being accurate enough to use in realworld applications. In this paper, we first perform in-depth error analysis of the current best performing definition detection system and discover major causes of errors. Based on this analysis, we develop a new definition detection system, HEDDEx, that utilizes syntactic features, transformer encoders, and heuristic filters, and evaluate it on a standard sentence-level benchmark. Because current benchmarks evaluate randomly sampled sentences, we propose an alternative evaluation that assesses every sentence within a document. This allows for evaluating recall in addition to precision. HEDDEx outperforms the leading system on both the sentence-level and the document-level tasks, by 12.7 F1 points and 14.4 F1 points, respectively. We note that performance on the high-recall document-level task is much lower than in the standard evaluation approach, due to the necessity of incorporation of document structure as features. We discuss remaining challenges in document-level definition detection, ideas for improvements, and potential issues for the development of reading aid applications.

pdf bib
CORD-19: The COVID-19 Open Research Dataset
Lucy Lu Wang | Kyle Lo | Yoganand Chandrasekhar | Russell Reas | Jiangjiang Yang | Doug Burdick | Darrin Eide | Kathryn Funk | Yannis Katsis | Rodney Michael Kinney | Yunyao Li | Ziyang Liu | William Merrill | Paul Mooney | Dewey A. Murdick | Devvret Rishi | Jerry Sheehan | Zhihong Shen | Brandon Stilson | Alex D. Wade | Kuansan Wang | Nancy Xin Ru Wang | Christopher Wilhelm | Boya Xie | Douglas M. Raymond | Daniel S. Weld | Oren Etzioni | Sebastian Kohlmeier
Proceedings of the 1st Workshop on NLP for COVID-19 at ACL 2020

The COVID-19 Open Research Dataset (CORD-19) is a growing resource of scientific papers on COVID-19 and related historical coronavirus research. CORD-19 is designed to facilitate the development of text mining and information retrieval systems over its rich collection of metadata and structured full text papers. Since its release, CORD-19 has been downloaded over 200K times and has served as the basis of many COVID-19 text mining and discovery systems. In this article, we describe the mechanics of dataset construction, highlighting challenges and key design decisions, provide an overview of how CORD-19 has been used, and describe several shared tasks built around the dataset. We hope this resource will continue to bring together the computing community, biomedical experts, and policy makers in the search for effective treatments and management policies for COVID-19.

pdf bib
Fact or Fiction: Verifying Scientific Claims
David Wadden | Shanchuan Lin | Kyle Lo | Lucy Lu Wang | Madeleine van Zuylen | Arman Cohan | Hannaneh Hajishirzi
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

We introduce scientific claim verification, a new task to select abstracts from the research literature containing evidence that SUPPORTS or REFUTES a given scientific claim, and to identify rationales justifying each decision. To study this task, we construct SciFact, a dataset of 1.4K expert-written scientific claims paired with evidence-containing abstracts annotated with labels and rationales. We develop baseline models for SciFact, and demonstrate that simple domain adaptation techniques substantially improve performance compared to models trained on Wikipedia or political news. We show that our system is able to verify claims related to COVID-19 by identifying evidence from the CORD-19 corpus. Our experiments indicate that SciFact will provide a challenging testbed for the development of new systems designed to retrieve and reason over corpora containing specialized domain knowledge. Data and code for this new task are publicly available at A leaderboard and COVID-19 fact-checking demo are available at


pdf bib
SciBERT: A Pretrained Language Model for Scientific Text
Iz Beltagy | Kyle Lo | Arman Cohan
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

Obtaining large-scale annotated data for NLP tasks in the scientific domain is challenging and expensive. We release SciBERT, a pretrained language model based on BERT (Devlin et. al., 2018) to address the lack of high-quality, large-scale labeled scientific data. SciBERT leverages unsupervised pretraining on a large multi-domain corpus of scientific publications to improve performance on downstream scientific NLP tasks. We evaluate on a suite of tasks including sequence tagging, sentence classification and dependency parsing, with datasets from a variety of scientific domains. We demonstrate statistically significant improvements over BERT and achieve new state-of-the-art results on several of these tasks. The code and pretrained models are available at

pdf bib
Combining Distant and Direct Supervision for Neural Relation Extraction
Iz Beltagy | Kyle Lo | Waleed Ammar
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)

In relation extraction with distant supervision, noisy labels make it difficult to train quality models. Previous neural models addressed this problem using an attention mechanism that attends to sentences that are likely to express the relations. We improve such models by combining the distant supervision data with an additional directly-supervised data, which we use as supervision for the attention weights. We find that joint training on both types of supervision leads to a better model because it improves the model’s ability to identify noisy sentences. In addition, we find that sigmoidal attention weights with max pooling achieves better performance over the commonly used weighted average attention in this setup. Our proposed method achieves a new state-of-the-art result on the widely used FB-NYT dataset.


pdf bib
Construction of the Literature Graph in Semantic Scholar
Waleed Ammar | Dirk Groeneveld | Chandra Bhagavatula | Iz Beltagy | Miles Crawford | Doug Downey | Jason Dunkelberger | Ahmed Elgohary | Sergey Feldman | Vu Ha | Rodney Kinney | Sebastian Kohlmeier | Kyle Lo | Tyler Murray | Hsu-Han Ooi | Matthew Peters | Joanna Power | Sam Skjonsberg | Lucy Wang | Chris Wilhelm | Zheng Yuan | Madeleine van Zuylen | Oren Etzioni
Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 3 (Industry Papers)

We describe a deployed scalable system for organizing published scientific literature into a heterogeneous graph to facilitate algorithmic manipulation and discovery. The resulting literature graph consists of more than 280M nodes, representing papers, authors, entities and various interactions between them (e.g., authorships, citations, entity mentions). We reduce literature graph construction into familiar NLP tasks (e.g., entity extraction and linking), point out research challenges due to differences from standard formulations of these tasks, and report empirical results for each task. The methods described in this paper are used to enable semantic features in

pdf bib
Ontology alignment in the biomedical domain using entity definitions and context
Lucy Wang | Chandra Bhagavatula | Mark Neumann | Kyle Lo | Chris Wilhelm | Waleed Ammar
Proceedings of the BioNLP 2018 workshop

Ontology alignment is the task of identifying semantically equivalent entities from two given ontologies. Different ontologies have different representations of the same entity, resulting in a need to de-duplicate entities when merging ontologies. We propose a method for enriching entities in an ontology with external definition and context information, and use this additional information for ontology alignment. We develop a neural architecture capable of encoding the additional information when available, and show that the addition of external data results in an F1-score of 0.69 on the Ontology Alignment Evaluation Initiative (OAEI) largebio SNOMED-NCI subtask, comparable with the entity-level matchers in a SOTA system.