We present a system for identifying cognate sets across dictionaries of related languages. The likelihood of a cognate relationship is calculated on the basis of a rich set of features that capture both phonetic and semantic similarity, as well as the presence of regular sound correspondences. The similarity scores are used to cluster words from different languages that may originate from a common proto-word. When tested on the Algonquian language family, our system detects 63% of cognate sets while maintaining cluster purity of 70%.
Creating a Comparative Dictionary of Totonac-Tepehua
Grzegorz Kondrak | David Beck | Philip Dilts
Proceedings of Ninth Meeting of the ACL Special Interest Group in Computational Morphology and Phonology