Andrea Roberson


Automatic Product Categorization for Official Statistics
Andrea Roberson
Proceedings of the 2019 Workshop on Widening NLP

The North American Product Classification System (NAPCS) is a comprehensive, hierarchical classification system for products (goods and services) that is consistent across the three North American countries. Beginning in 2017, the Economic Census will use NAPCS to produce economy-wide product tabulations. Respondents are asked to report data from a long, pre-specified list of potential products in a given industry, with some lists containing more than 50 potential products. Businesses have expressed the desire to alternatively supply Universal Product Codes (UPC) to the U. S. Census Bureau. Much work has been done around the categorization of products using product descriptions. No study has applied these efforts for the calculation of official statistics (statistics published by government agencies) using only the text of UPC product descriptions. The question we address in this paper is: Given UPC codes and their associated product descriptions, can we accurately predict NAPCS? We tested the feasibility of businesses submitting a spreadsheet with Universal Product Codes and their associated text descriptions. This novel strategy classified text with very high accuracy rates, all of our algorithms surpassed over 90 percent.