Chen-Tse Tsai


2019

pdf bib
Named Entity Recognition with Partially Annotated Training Data
Stephen Mayhew | Snigdha Chaturvedi | Chen-Tse Tsai | Dan Roth
Proceedings of the 23rd Conference on Computational Natural Language Learning (CoNLL)

Supervised machine learning assumes the availability of fully-labeled data, but in many cases, such as low-resource languages, the only data available is partially annotated. We study the problem of Named Entity Recognition (NER) with partially annotated training data in which a fraction of the named entities are labeled, and all other tokens, entities or otherwise, are labeled as non-entity by default. In order to train on this noisy dataset, we need to distinguish between the true and false negatives. To this end, we introduce a constraint-driven iterative algorithm that learns to detect false negatives in the noisy set and downweigh them, resulting in a weighted training set. With this set, we train a weighted NER model. We evaluate our algorithm with weighted variants of neural and non-neural NER models on data in 8 languages from several language and script families, showing strong ability to learn from partial data. Finally, to show real-world efficacy, we evaluate on a Bengali NER corpus annotated by non-speakers, outperforming the prior state-of-the-art by over 5 points F1.

pdf bib
A Semi-Markov Structured Support Vector Machine Model for High-Precision Named Entity Recognition
Ravneet Arora | Chen-Tse Tsai | Ketevan Tsereteli | Prabhanjan Kambadur | Yi Yang
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

Named entity recognition (NER) is the backbone of many NLP solutions. F1 score, the harmonic mean of precision and recall, is often used to select/evaluate the best models. However, when precision needs to be prioritized over recall, a state-of-the-art model might not be the best choice. There is little in literature that directly addresses training-time modifications to achieve higher precision information extraction. In this paper, we propose a neural semi-Markov structured support vector machine model that controls the precision-recall trade-off by assigning weights to different types of errors in the loss-augmented inference during training. The semi-Markov property provides more accurate phrase-level predictions, thereby improving performance. We empirically demonstrate the advantage of our model when high precision is required by comparing against strong baselines based on CRF. In our experiments with the CoNLL 2003 dataset, our model achieves a better precision-recall trade-off at various precision levels.

2018

pdf bib
CogCompNLP: Your Swiss Army Knife for NLP
Daniel Khashabi | Mark Sammons | Ben Zhou | Tom Redman | Christos Christodoulopoulos | Vivek Srikumar | Nicholas Rizzolo | Lev Ratinov | Guanheng Luo | Quang Do | Chen-Tse Tsai | Subhro Roy | Stephen Mayhew | Zhili Feng | John Wieting | Xiaodong Yu | Yangqiu Song | Shashank Gupta | Shyam Upadhyay | Naveen Arivazhagan | Qiang Ning | Shaoshi Ling | Dan Roth
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

pdf bib
Zero-Shot Open Entity Typing as Type-Compatible Grounding
Ben Zhou | Daniel Khashabi | Chen-Tse Tsai | Dan Roth
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

The problem of entity-typing has been studied predominantly as a supervised learning problems, mostly with task-specific annotations (for coarse types) and sometimes with distant supervision (for fine types). While such approaches have strong performance within datasets they often lack the flexibility to transfer across text genres and to generalize to new type taxonomies. In this work we propose a zero-shot entity typing approach that requires no annotated data and can flexibly identify newly defined types. Given a type taxonomy, the entries of which we define as Boolean functions of freebase “types,” we ground a given mention to a set of type-compatible Wikipedia entries, and then infer the target mention’s type using an inference algorithm that makes use of the types of these entries. We evaluate our system on a broad range of datasets, including standard fine-grained and coarse-grained entity typing datasets, and on a dataset in the biological domain. Our system is shown to be competitive with state-of-the-art supervised NER systems, and to outperform them on out-of-training datasets. We also show that our system significantly outperforms other zero-shot fine typing systems.

2017

pdf bib
STCP: Simplified-Traditional Chinese Conversion and Proofreading
Jiarui Xu | Xuezhe Ma | Chen-Tse Tsai | Eduard Hovy
Proceedings of the IJCNLP 2017, System Demonstrations

This paper aims to provide an effective tool for conversion between Simplified Chinese and Traditional Chinese. We present STCP, a customizable system comprising statistical conversion model, and proofreading web interface. Experiments show that our system achieves comparable character-level conversion performance with the state-of-art systems. In addition, our proofreading interface can effectively support diagnostics and data annotation. STCP is available at http://lagos.lti.cs.cmu.edu:8002/

pdf bib
Cheap Translation for Cross-Lingual Named Entity Recognition
Stephen Mayhew | Chen-Tse Tsai | Dan Roth
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing

Recent work in NLP has attempted to deal with low-resource languages but still assumed a resource level that is not present for most languages, e.g., the availability of Wikipedia in the target language. We propose a simple method for cross-lingual named entity recognition (NER) that works well in settings with very minimal resources. Our approach makes use of a lexicon to “translate” annotated data available in one or several high resource language(s) into the target language, and learns a standard monolingual NER model there. Further, when Wikipedia is available in the target language, our method can enhance Wikipedia based methods to yield state-of-the-art NER results; we evaluate on 7 diverse languages, improving the state-of-the-art by an average of 5.5% F1 points. With the minimal resources required, this is an extremely portable cross-lingual NER approach, as illustrated using a truly low-resource language, Uyghur.

2016

pdf bib
Cross-lingual Wikification Using Multilingual Embeddings
Chen-Tse Tsai | Dan Roth
Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

pdf bib
Cross-Lingual Named Entity Recognition via Wikification
Chen-Tse Tsai | Stephen Mayhew | Dan Roth
Proceedings of The 20th SIGNLL Conference on Computational Natural Language Learning

pdf bib
Illinois Cross-Lingual Wikifier: Grounding Entities in Many Languages to the English Wikipedia
Chen-Tse Tsai | Dan Roth
Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: System Demonstrations

We release a cross-lingual wikification system for all languages in Wikipedia. Given a piece of text in any supported language, the system identifies names of people, locations, organizations, and grounds these names to the corresponding English Wikipedia entries. The system is based on two components: a cross-lingual named entity recognition (NER) model and a cross-lingual mention grounding model. The cross-lingual NER model is a language-independent model which can extract named entity mentions in the text of any language in Wikipedia. The extracted mentions are then grounded to the English Wikipedia using the cross-lingual mention grounding model. The only resources required to train the proposed system are the multilingual Wikipedia dump and existing training data for English NER. The system is online at http://cogcomp.cs.illinois.edu/page/demo_view/xl_wikifier

pdf bib
Concept Grounding to Multiple Knowledge Bases via Indirect Supervision
Chen-Tse Tsai | Dan Roth
Transactions of the Association for Computational Linguistics, Volume 4

We consider the problem of disambiguating concept mentions appearing in documents and grounding them in multiple knowledge bases, where each knowledge base addresses some aspects of the domain. This problem poses a few additional challenges beyond those addressed in the popular Wikification problem. Key among them is that most knowledge bases do not contain the rich textual and structural information Wikipedia does; consequently, the main supervision signal used to train Wikification rankers does not exist anymore. In this work we develop an algorithmic approach that, by carefully examining the relations between various related knowledge bases, generates an indirect supervision signal it uses to train a ranking model that accurately chooses knowledge base entries for a given mention; moreover, it also induces prior knowledge that can be used to support a global coherent mapping of all the concepts in a given document to the knowledge bases. Using the biomedical domain as our application, we show that our indirectly supervised ranking model outperforms other unsupervised baselines and that the quality of this indirect supervision scheme is very close to a supervised model. We also show that considering multiple knowledge bases together has an advantage over grounding concepts to each knowledge base individually.