Identifying Linguistic Areas for Geolocation

Tommaso Fornaciari, Dirk Hovy


Abstract
Geolocating social media posts relies on the assumption that language carries sufficient geographic information. However, locations are usually given as continuous latitude/longitude tuples, so we first need to define discrete geographic regions that can serve as labels. Most studies use some form of clustering to discretize the continuous coordinates (Han et al., 2016). However, the resulting regions do not always correspond to existing linguistic areas. Consequently, accuracy at 100 miles tends to be good, but degrades for finer-grained distinctions, when different linguistic regions get lumped together. We describe a new algorithm, Point-to-City (P2C), an iterative k-d tree-based method for clustering geographic coordinates and associating them with towns. We create three sets of labels at different levels of granularity, and compare performance of a state-of-the-art geolocation model trained and tested with P2C labels to one with regular k-d tree labels. Even though P2C results in substantially more labels than the baseline, model accuracy increases significantly over using traditional labels at the fine-grained level, while staying comparable at 100 miles. The results suggest that identifying meaningful linguistic areas is crucial for improving geolocation at a fine-grained level.
Anthology ID:
D19-5530
Volume:
Proceedings of the 5th Workshop on Noisy User-generated Text (W-NUT 2019)
Month:
November
Year:
2019
Address:
Hong Kong, China
Venues:
EMNLP | WNUT | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
231–236
Language:
URL:
https://www.aclweb.org/anthology/D19-5530
DOI:
10.18653/v1/D19-5530
Bib Export formats:
BibTeX MODS XML EndNote
PDF:
http://aclanthology.lst.uni-saarland.de/D19-5530.pdf