Web-Scale Language-Independent Cataloging of Noisy Product Listings for E-Commerce

Pradipto Das, Yandi Xia, Aaron Levine, Giuseppe Di Fabbrizio, Ankur Datta


Abstract
The cataloging of product listings through taxonomy categorization is a fundamental problem for any e-commerce marketplace, with applications ranging from personalized search recommendations to query understanding. However, manual and rule based approaches to categorization are not scalable. In this paper, we compare several classifiers for categorizing listings in both English and Japanese product catalogs. We show empirically that a combination of words from product titles, navigational breadcrumbs, and list prices, when available, improves results significantly. We outline a novel method using correspondence topic models and a lightweight manual process to reduce noise from mis-labeled data in the training set. We contrast linear models, gradient boosted trees (GBTs) and convolutional neural networks (CNNs), and show that GBTs and CNNs yield the highest gains in error reduction. Finally, we show GBTs applied in a language-agnostic way on a large-scale Japanese e-commerce dataset have improved taxonomy categorization performance over current state-of-the-art based on deep belief network models.
Anthology ID:
E17-1091
Volume:
Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers
Month:
April
Year:
2017
Address:
Valencia, Spain
Venue:
EACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
969–979
Language:
URL:
https://www.aclweb.org/anthology/E17-1091
DOI:
Bib Export formats:
BibTeX MODS XML EndNote
PDF:
http://aclanthology.lst.uni-saarland.de/E17-1091.pdf