Journal article

Relative lempelziv factorization for efficient storage and retrieval of web collections

C Hoobin, SJ Puglisi, J Zobel

Proceedings of the VLDB Endowment | Published : 2011

Abstract

Compression techniques that support fast random access are a core component of any information system. Current state-of-the-art methods group documents into fixed-sized blocks and compress each block with a general-purpose adaptive algorithm such as gzip. Random access to a specfic document then requires decompression of a block. The choice of block size is critical: it trades between compression effectiveness and document retrieval times. In this paper we present a scalable compression method for large document collections that allows fast random access. We build a representative sample of the collection and use it as a dictionary in a LZ77-like encoding of the rest of the collection, relat..

View full abstract

University of Melbourne Researchers

Grants

Funding Acknowledgements

This work was supported by the Australian Research Council and the NICTA Victorian Research Laboratory. NICTA is funded by the Australian Government as represented by the Department of Broadband, Communications, and the Digital Economy, and the Australian Research Council through the ICT Centre of Excellence program. Simon J. Puglisi is supported by a Newton Fellowship.