Automatic Product Categorization for Official Statistics

Andrea Roberson


Abstract
The North American Product Classification System (NAPCS) is a comprehensive, hierarchical classification system for products (goods and services) that is consistent across the three North American countries. Beginning in 2017, the Economic Census will use NAPCS to produce economy-wide product tabulations. Respondents are asked to report data from a long, pre-specified list of potential products in a given industry, with some lists containing more than 50 potential products. Businesses have expressed the desire to alternatively supply Universal Product Codes (UPC) to the U. S. Census Bureau. Much work has been done around the categorization of products using product descriptions. No study has applied these efforts for the calculation of official statistics (statistics published by government agencies) using only the text of UPC product descriptions. The question we address in this paper is: Given UPC codes and their associated product descriptions, can we accurately predict NAPCS? We tested the feasibility of businesses submitting a spreadsheet with Universal Product Codes and their associated text descriptions. This novel strategy classified text with very high accuracy rates, all of our algorithms surpassed over 90 percent.
Anthology ID:
W19-3623
Volume:
Proceedings of the 2019 Workshop on Widening NLP
Month:
August
Year:
2019
Address:
Florence, Italy
Venues:
ACL | WS | WiNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
68–72
Language:
URL:
https://www.aclweb.org/anthology/W19-3623
DOI:
Bib Export formats:
BibTeX MODS XML EndNote