Conference Proceedings

The case of the duplicate document measurement, search, and science

J Zobel, Y Bernstein, XF Zhou (ed.), J Li (ed.), HT Shen (ed.), M Kitsuregawa (ed.), Y Zhang (ed.)

FRONTIERS OF WWW RESEARCH AND DEVELOPMENT - APWEB 2006, PROCEEDINGS | SPRINGER-VERLAG BERLIN | Published : 2006

Abstract

Many of the documents in large text collections are duplicates and versions of each other. In recent research, we developed new methods for finding such duplicates; however, as there was no directly comparable prior work, we had no measure of whether we had succeeded. Worse, the concept of "duplicate" not only proved difficult to define, but on reflection was not logically defensible. Our investigation highlighted a paradox of computer science research: objective measurement of outcomes involves a subjective choice of preferred measure; and attempts to define measures can easily founder in circular reasoning. Also, some measures are abstractions that simplify complex real-world phenomena, so..

View full abstract

University of Melbourne Researchers