CSAM: Compressed SAM format

Rodrigo Canovas, Alistair Moffat, Andrew Turpin



MOTIVATION: Next generation sequencing machines produce vast amounts of genomic data. For the data to be useful, it is essential that it can be stored and manipulated efficiently. This work responds to the combined challenge of compressing genomic data, while providing fast access to regions of interest, without necessitating decompression of whole files. RESULTS: We describe CSAM (Compressed SAM format), a compression approach offering lossless and lossy compression for SAM files. The structures and techniques proposed are suitable for representing SAM files, as well as supporting fast access to the compressed information. They generate more compact lossless representations than BAM, which ..

Funding Acknowledgements

This work was supported by the NICTA Victorian Research Laboratory, and funded by the Australian Government as represented by the Department of Broadband, Communications and the Digital Economy and the Australian Research Council through the ICT Center of Excellence program. We thank Vadim Zalunin for helping with the CramTools usage; and Wei Shi and Jan Schroder for sharing their knowledge of the area.