Evaluation of CD-HIT for constructing non-redundant databases
Q Chen, Y Wan, Y Lei, J Zobel, C VERSPOOR
2016 IEEE International Conference on Bioinformatics and Biomedicine (BIBM) | IEEE | Published : 2016
CD-HIT is one of the most popular tools for reducing sequence redundancy, and is considered to be the state-of-art method. It tries to minimise redundancy by reducing an input database into several representative sequences, under a user-defined threshold of sequence identity. We present a comprehensive assessment of the redundancy in the outputs of CD-HIT, exploring the impact of different identity thresholds and new evaluation data on the redundancy. We demonstrate that the relationship between threshold and redundancies is surprising weak. Applications of CD-HIT that set low identity threshold values also may suffer from substantial degradation in both efficiency and accuracy.
Related Projects (1)
Awarded by Australian Research Council
The project receives funding from the Australian Research Council through a Discovery Project grant, DP150101550.