Building Large Chinese Corpus for Spoken Dialogue Research in Specific Domains

Changliang Li, Xiuying Wang


Abstract
Corpus is a valuable resource for information retrieval and data-driven natural language processing systems,especially for spoken dialogue research in specific domains. However,there is little non-English corpora, particular for ones in Chinese. Spoken by the nation with the largest population in the world, Chinese become increasingly prevalent and popular among millions of people worldwide. In this paper, we build a large-scale and high-quality Chinese corpus, called CSDC (Chinese Spoken Dialogue Corpus). It contains five domains and more than 140 thousand dialogues in all. Each sentence in this corpus is annotated with slot information additionally compared to other corpora. To our best knowledge, this is the largest Chinese spoken dialogue corpus, as well as the first one with slot information. With this corpus, we proposed a method and did a well-designed experiment. The indicative result is reported at last.
Anthology ID:
I17-2054
Volume:
Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 2: Short Papers)
Month:
November
Year:
2017
Address:
Taipei, Taiwan
Venue:
IJCNLP
SIG:
Publisher:
Asian Federation of Natural Language Processing
Note:
Pages:
320–324
Language:
URL:
https://www.aclweb.org/anthology/I17-2054
DOI:
Bib Export formats:
BibTeX MODS XML EndNote
PDF:
http://aclanthology.lst.uni-saarland.de/I17-2054.pdf