How Informative is a Term? Dispersion as a measure of Term Specificity

Rodney McDonell, Justin Zobel, Bodo Billerbeck

Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval | ASSOC COMPUTING MACHINERY | Published : 2016


Similarity functions assign scores to documents in response to queries. These functions require as input statistics about the terms in the queries and documents, where the intention is that the statistics are estimates of the relative informativeness of the terms. Common measures of informativeness use the number of documents containing each term (the document frequency) as a key measure. We argue in this paper that the distribution of within-document frequencies across a collection is also pertinent to informativeness, a measure that has not been considered in prior work: the most informative words tend to be those whose frequency of occurrence has high variance. We propose use of relative ..

