Accurate discovery of co-derivative documents via duplicate text detection

Y Bernstein; J Zobel

Journal article

Accurate discovery of co-derivative documents via duplicate text detection

Y Bernstein, J Zobel

Information Systems | PERGAMON-ELSEVIER SCIENCE LTD | Published : 2006

DOI: 10.1016/j.is.2005.11.006

Abstract

Documents are co-derivative if they share content: for two documents to be co-derived, some portion of one must be derived from the other, or some portion of both must be derived from a third document. An existing technique for concurrently detecting all co-derivatives in a collection is document fingerprinting, which matches documents based on the hash values of selected document subsequences, or chunks. Fingerprinting is hampered by an inability to accurately isolate information that is useful in identifying co-derivatives. In this paper we present spex, a novel hash-based algorithm for extracting duplicated chunks from a document collection. We discuss how information about shared chunks ..

View full abstract