Conference Proceedings

Principled Dictionary Pruning for Low-Memory Corpus Compression

J Tong, AI Wirth, J Zobel

Proceedings of the 37th International ACM SIGIR Conference on Research and Development in Information Retrieval | ACM Press | Published : 2014


Compression of collections, such as text databases, can both reduce space consumption and increase retrieval efficiency, through better caching and better exploitation of the memory hierarchy. A promising technique is relative Lempel-Ziv coding, in which a sample of material from the collection serves as a static dictionary; in previous work, this method demonstrated extremely fast decoding and good compression ratios, while allowing random access to individual items. However, there is a trade-off between dictionary size and compression ratio, motivating the search for a compact, yet similarly effective, dictionary. In previous work it was observed that, since the dictionary is generated by ..

View full abstract


Awarded by NSF of China

Awarded by Program for New Century Excellent Talents in University

Awarded by Fundamental Research Funds for the Central Universities

Funding Acknowledgements

We thank Christopher Hoobin for providing the source code of RLZ. This work is partially supported by The Australian Research Council, NSF of China (61373018, 11301288), Program for New Century Excellent Talents in University (NCET-13-0301) and Fundamental Research Funds for the Central Universities(65141021). Jiancong would also like to thank the China Scholarship Council (CSC) for the State Scholarship Fund.