On Hapax Legomena and Morphological Productivity
Janet Pierrehumbert, Ramon Granell
Abstract
Quantifying and predicting morphological productivity is a long-standing challenge in corpus linguistics and psycholinguistics. The same challenge reappears in natural language processing in the context of handling words that were not seen in the training set (out-of-vocabulary, or OOV, words). Prior research showed that a good indicator of the productivity of a morpheme is the number of words involving it that occur exactly once (the hapax legomena). A technical connection was adduced between this result and Good-Turing smoothing, which assigns probability mass to unseen events on the basis of the simplifying assumption that word frequencies are stationary. In a large-scale study of 133 affixes in Wikipedia, we develop evidence that success in fact depends on tapping the frequency range in which the assumptions of Good-Turing are violated.- Anthology ID:
- W18-5814
- Volume:
- Proceedings of the Fifteenth Workshop on Computational Research in Phonetics, Phonology, and Morphology
- Month:
- October
- Year:
- 2018
- Address:
- Brussels, Belgium
- Venues:
- EMNLP | WS
- SIG:
- SIGMORPHON
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 125–130
- Language:
- URL:
- https://www.aclweb.org/anthology/W18-5814
- DOI:
- 10.18653/v1/W18-5814
- PDF:
- http://aclanthology.lst.uni-saarland.de/W18-5814.pdf