Conference Proceedings

Redundant documents and search effectiveness

Y Bernstein, J Zobel

International Conference on Information and Knowledge Management, Proceedings | Published : 2005


The web contains a great many documents that are content-equivalent, that is, informationally redundant with respect to each other. The presence of such mutually redundant documents in search results can degrade the user search experience. Previous attempts to address this issue, most notably the TREC novelty track, were characterized by difficulties with accuracy and evaluation. In this paper we explore syntactic techniques - particularly document fingerprinting - for detecting content equivalence. Using these techniques on the TREC GOV1 and GOV2 corpora revealed a high degree of redundancy; a user study confirmed that our metrics were accurately identifying content-equivalence. We show, mo..

View full abstract

University of Melbourne Researchers