Min Song


pdf bib
Evaluating Research Novelty Detection: Counterfactual Approaches
Reinald Kim Amplayo | Seung-won Hwang | Min Song
Proceedings of the Thirteenth Workshop on Graph-Based Methods for Natural Language Processing (TextGraphs-13)

In this paper, we explore strategies to evaluate models for the task research paper novelty detection: Given all papers released at a given date, which of the papers discuss new ideas and influence future research? We find the novelty is not a singular concept, and thus inherently lacks of ground truth annotations with cross-annotator agreement, which is a major obstacle in evaluating these models. Test-of-time award is closest to such annotation, which can only be made retrospectively and is extremely scarce. We thus propose to compare and evaluate models using counterfactual simulations. First, we ask models if they can differentiate papers at time t and counterfactual paper from future time t+d. Second, we ask models if they can predict test-of-time award at t+d. These are proxies that can be agreed by human annotators and easily augmented by correlated signals, using which evaluation can be done through four tasks: classification, ranking, correlation and feature selection. We show these proxy evaluation methods complement each other regarding error handling, coverage, interpretability, and scope, and thus altogether contribute to the observation of the relative strength of existing models.


pdf bib
Exploring the Leading Authors and Journals in Major Topics by Citation Sentences and Topic Modeling
Ha Jin Kim | Juyoung An | Yoo Kyung Jeong | Min Song
Proceedings of the Joint Workshop on Bibliometric-enhanced Information Retrieval and Natural Language Processing for Digital Libraries (BIRNDL)

pdf bib
Refactoring the Genia Event Extraction Shared Task Toward a General Framework for IE-Driven KB Development
Jin-Dong Kim | Yue Wang | Nicola Colic | Seung Han Beak | Yong Hwan Kim | Min Song
Proceedings of the 4th BioNLP Shared Task Workshop

pdf bib
Analyzing Impact, Trend, and Diffusion of Knowledge associated with Neoplasms Research
Min Song
Proceedings of the 5th International Workshop on Computational Terminology (Computerm2016)

Cancer (a.k.a neoplasms in a broader sense) is one of the leading causes of death worldwide and its incidence is expected to exacerbate. To respond to the critical need from the society, there have been rigorous attempts for the cancer research community to develop treatment for cancer. Accordingly, we observe a surge in the sheer volume of research products and outcomes in relation to neoplasms. In this talk, we introduce the notion of entitymetrics to provide a new lens for understanding the impact, trend, and diffusion of knowledge associated with neoplasms research. To this end, we collected over two million records from PubMed, the most popular search engine in the medical domain. Coupled with text mining techniques including named entity recognition, sentence boundary detection, string approximate matching, entitymetrics enables us to analyze knowledge diffusion, impact, and trend at various knowledge entity units, such as bio-entity, organization, and country. At the end of the talk, the future applications and possible directions of entitymetrics will be discussed.

pdf bib
Building Content-driven Entity Networks for Scarce Scientific Literature using Content Information
Reinald Kim Amplayo | Min Song
Proceedings of the Fifth Workshop on Building and Evaluating Resources for Biomedical Text Mining (BioTxtM2016)

This paper proposes several network construction methods for collections of scarce scientific literature data. We define scarcity as lacking in value and in volume. Instead of using the paper’s metadata to construct several kinds of scientific networks, we use the full texts of the articles and automatically extract the entities needed to construct the networks. Specifically, we present seven kinds of networks using the proposed construction methods: co-occurrence networks for author, keyword, and biological entities, and citation networks for author, keyword, biological, and topic entities. We show two case studies that applies our proposed methods: CADASIL, a rare yet the most common form of hereditary stroke disorder, and Metformin, the first-line medication to the type 2 diabetes treatment. We apply our proposed method to four different applications for evaluation: finding prolific authors, finding important bio-entities, finding meaningful keywords, and discovering influential topics. The results show that the co-occurrence and citation networks constructed using the proposed method outperforms the traditional-based networks. We also compare our proposed networks to traditional citation networks constructed using enough data and infer that even with the same amount of enough data, our methods perform comparably or better than the traditional methods.