Lei Cui


2020

pdf bib
TableBank: Table Benchmark for Image-based Table Detection and Recognition
Minghao Li | Lei Cui | Shaohan Huang | Furu Wei | Ming Zhou | Zhoujun Li
Proceedings of the 12th Language Resources and Evaluation Conference

We present TableBank, a new image-based table detection and recognition dataset built with novel weak supervision from Word and Latex documents on the internet. Existing research for image-based table detection and recognition usually fine-tunes pre-trained models on out-of-domain data with a few thousand human-labeled examples, which is difficult to generalize on real-world applications. With TableBank that contains 417K high quality labeled tables, we build several strong baselines using state-of-the-art models with deep neural networks. We make TableBank publicly available and hope it will empower more deep learning approaches in the table detection and recognition task. The dataset and models can be downloaded from https://github.com/doc-analysis/TableBank.

pdf bib
DocBank: A Benchmark Dataset for Document Layout Analysis
Minghao Li | Yiheng Xu | Lei Cui | Shaohan Huang | Furu Wei | Zhoujun Li | Ming Zhou
Proceedings of the 28th International Conference on Computational Linguistics

Document layout analysis usually relies on computer vision models to understand documents while ignoring textual information that is vital to capture. Meanwhile, high quality labeled datasets with both visual and textual information are still insufficient. In this paper, we present DocBank, a benchmark dataset that contains 500K document pages with fine-grained token-level annotations for document layout analysis. DocBank is constructed using a simple yet effective way with weak supervision from the LaTeX documents available on the arXiv.com. With DocBank, models from different modalities can be compared fairly and multi-modal approaches will be further investigated and boost the performance of document layout analysis. We build several strong baselines and manually split train/dev/test sets for evaluation. Experiment results show that models trained on DocBank accurately recognize the layout information for a variety of documents. The DocBank dataset is publicly available at https://github.com/doc-analysis/DocBank.

pdf bib
Unsupervised Fine-tuning for Text Clustering
Shaohan Huang | Furu Wei | Lei Cui | Xingxing Zhang | Ming Zhou
Proceedings of the 28th International Conference on Computational Linguistics

Fine-tuning with pre-trained language models (e.g. BERT) has achieved great success in many language understanding tasks in supervised settings (e.g. text classification). However, relatively little work has been focused on applying pre-trained models in unsupervised settings, such as text clustering. In this paper, we propose a novel method to fine-tune pre-trained models unsupervisedly for text clustering, which simultaneously learns text representations and cluster assignments using a clustering oriented loss. Experiments on three text clustering datasets (namely TREC-6, Yelp, and DBpedia) show that our model outperforms the baseline methods and achieves state-of-the-art results.

2019

pdf bib
Inspecting Unification of Encoding and Matching with Transformer: A Case Study of Machine Reading Comprehension
Hangbo Bao | Li Dong | Furu Wei | Wenhui Wang | Nan Yang | Lei Cui | Songhao Piao | Ming Zhou
Proceedings of the 2nd Workshop on Machine Reading for Question Answering

Most machine reading comprehension (MRC) models separately handle encoding and matching with different network architectures. In contrast, pretrained language models with Transformer layers, such as GPT (Radford et al., 2018) and BERT (Devlin et al., 2018), have achieved competitive performance on MRC. A research question that naturally arises is: apart from the benefits of pre-training, how many performance gain comes from the unified network architecture. In this work, we evaluate and analyze unifying encoding and matching components with Transformer for the MRC task. Experimental results on SQuAD show that the unified model outperforms previous networks that separately treat encoding and matching. We also introduce a metric to inspect whether a Transformer layer tends to perform encoding or matching. The analysis results show that the unified model learns different modeling strategies compared with previous manually-designed models.

pdf bib
Retrieval-Enhanced Adversarial Training for Neural Response Generation
Qingfu Zhu | Lei Cui | Wei-Nan Zhang | Furu Wei | Ting Liu
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

Dialogue systems are usually built on either generation-based or retrieval-based approaches, yet they do not benefit from the advantages of different models. In this paper, we propose a Retrieval-Enhanced Adversarial Training (REAT) method for neural response generation. Distinct from existing approaches, the REAT method leverages an encoder-decoder framework in terms of an adversarial training paradigm, while taking advantage of N-best response candidates from a retrieval-based system to construct the discriminator. An empirical study on a large scale public available benchmark dataset shows that the REAT method significantly outperforms the vanilla Seq2Seq model as well as the conventional adversarial training approach.

2018

pdf bib
EventWiki: A Knowledge Base of Major Events
Tao Ge | Lei Cui | Baobao Chang | Zhifang Sui | Furu Wei | Ming Zhou
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

pdf bib
Fine-grained Coordinated Cross-lingual Text Stream Alignment for Endless Language Knowledge Acquisition
Tao Ge | Qing Dou | Heng Ji | Lei Cui | Baobao Chang | Zhifang Sui | Furu Wei | Ming Zhou
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

This paper proposes to study fine-grained coordinated cross-lingual text stream alignment through a novel information network decipherment paradigm. We use Burst Information Networks as media to represent text streams and present a simple yet effective network decipherment algorithm with diverse clues to decipher the networks for accurate text stream alignment. Experiments on Chinese-English news streams show our approach not only outperforms previous approaches on bilingual lexicon extraction from coordinated text streams but also can harvest high-quality alignments from large amounts of streaming data for endless language knowledge mining, which makes it promising to be a new paradigm for automatic language knowledge acquisition.

pdf bib
Neural Open Information Extraction
Lei Cui | Furu Wei | Ming Zhou
Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

Conventional Open Information Extraction (Open IE) systems are usually built on hand-crafted patterns from other NLP tools such as syntactic parsing, yet they face problems of error propagation. In this paper, we propose a neural Open IE approach with an encoder-decoder framework. Distinct from existing methods, the neural Open IE approach learns highly confident arguments and relation tuples bootstrapped from a state-of-the-art Open IE system. An empirical study on a large benchmark dataset shows that the neural Open IE system significantly outperforms several baselines, while maintaining comparable computational efficiency.

2017

pdf bib
SuperAgent: A Customer Service Chatbot for E-commerce Websites
Lei Cui | Shaohan Huang | Furu Wei | Chuanqi Tan | Chaoqun Duan | Ming Zhou
Proceedings of ACL 2017, System Demonstrations

2016

pdf bib
News Stream Summarization using Burst Information Networks
Tao Ge | Lei Cui | Baobao Chang | Sujian Li | Ming Zhou | Zhifang Sui
Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing

pdf bib
Event Detection with Burst Information Networks
Tao Ge | Lei Cui | Baobao Chang | Zhifang Sui | Ming Zhou
Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers

Retrospective event detection is an important task for discovering previously unidentified events in a text stream. In this paper, we propose two fast centroid-aware event detection models based on a novel text stream representation – Burst Information Networks (BINets) for addressing the challenge. The BINets are time-aware, efficient and can be easily analyzed for identifying key information (centroids). These advantages allow the BINet-based approaches to achieve the state-of-the-art performance on multiple datasets, demonstrating the efficacy of BINets for the task of event detection.

2014

pdf bib
Learning Topic Representation for SMT with Neural Networks
Lei Cui | Dongdong Zhang | Shujie Liu | Qiming Chen | Mu Li | Ming Zhou | Muyun Yang
Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

2013

pdf bib
Multi-Domain Adaptation for SMT Using Multi-Task Learning
Lei Cui | Xilun Chen | Dongdong Zhang | Shujie Liu | Mu Li | Ming Zhou
Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing

pdf bib
Bilingual Data Cleaning for SMT using Graph-based Random Walk
Lei Cui | Dongdong Zhang | Shujie Liu | Mu Li | Ming Zhou
Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

2010

pdf bib
Hybrid Decoding: Decoding with Partial Hypotheses Combination over Multiple SMT Systems
Lei Cui | Dongdong Zhang | Mu Li | Ming Zhou | Tiejun Zhao
Coling 2010: Posters

pdf bib
A Joint Rule Selection Model for Hierarchical Phrase-Based Translation
Lei Cui | Dongdong Zhang | Mu Li | Ming Zhou | Tiejun Zhao
Proceedings of the ACL 2010 Conference Short Papers