Conference Proceedings

Evaluating hypotheses in geolocation on a very large sample of Twitter

B Salehi, A Søgaard

Proceedings of the 3rd Workshop on Noisy User-generated Text | Association for Computational Linguistics | Published : 2017


Recent work in geolocation has made several hypotheses about what linguistic markers are relevant to detect where people write from. In this paper, we examine six hypotheses against a corpus consisting of all geo-tagged tweets from the US, or whose geo-tags could be inferred, in a 19% sample of Twitter history. Our experiments lend support to all six hypotheses, including that spelling variants and hashtags are strong predictors of location. We also study what kinds of common nouns are predictive of location after controlling for named entities such as dolphins or sharks