On Hapax Legomena and Morphological Productivity

Janet Pierrehumbert, Ramon Granell


Abstract
Quantifying and predicting morphological productivity is a long-standing challenge in corpus linguistics and psycholinguistics. The same challenge reappears in natural language processing in the context of handling words that were not seen in the training set (out-of-vocabulary, or OOV, words). Prior research showed that a good indicator of the productivity of a morpheme is the number of words involving it that occur exactly once (the hapax legomena). A technical connection was adduced between this result and Good-Turing smoothing, which assigns probability mass to unseen events on the basis of the simplifying assumption that word frequencies are stationary. In a large-scale study of 133 affixes in Wikipedia, we develop evidence that success in fact depends on tapping the frequency range in which the assumptions of Good-Turing are violated.
Anthology ID:
W18-5814
Volume:
Proceedings of the Fifteenth Workshop on Computational Research in Phonetics, Phonology, and Morphology
Month:
October
Year:
2018
Address:
Brussels, Belgium
Venues:
EMNLP | WS
SIG:
SIGMORPHON
Publisher:
Association for Computational Linguistics
Note:
Pages:
125–130
Language:
URL:
https://www.aclweb.org/anthology/W18-5814
DOI:
10.18653/v1/W18-5814
Bib Export formats:
BibTeX MODS XML EndNote
PDF:
http://aclanthology.lst.uni-saarland.de/W18-5814.pdf