Conference Proceedings
Evaluation of CD-HIT for constructing non-redundant databases
Q Chen, Y Wan, Y Lei, J Zobel, C VERSPOOR
Proceedings. IEEE International Conference on Bioinformatics and Biomedicine | IEEE | Published : 2016
Abstract
CD-HIT is one of the most popular tools for reducing sequence redundancy, and is considered to be the state-of-art method. It tries to minimise redundancy by reducing an input database into several representative sequences, under a user-defined threshold of sequence identity. We present a comprehensive assessment of the redundancy in the outputs of CD-HIT, exploring the impact of different identity thresholds and new evaluation data on the redundancy. We demonstrate that the relationship between threshold and redundancies is surprising weak. Applications of CD-HIT that set low identity threshold values also may suffer from substantial degradation in both efficiency and accuracy.
Grants
Awarded by Australian Research Council
Funding Acknowledgements
The project receives funding from the Australian Research Council through a Discovery Project grant, DP150101550.