Evaluation of CD-HIT for constructing non-redundant databases

Q Chen, Y Wan, Y Lei, J Zobel, C VERSPOOR

2016 IEEE International Conference on Bioinformatics and Biomedicine (BIBM) | IEEE | Published : 2016


CD-HIT is one of the most popular tools for reducing sequence redundancy, and is considered to be the state-of-art method. It tries to minimise redundancy by reducing an input database into several representative sequences, under a user-defined threshold of sequence identity. We present a comprehensive assessment of the redundancy in the outputs of CD-HIT, exploring the impact of different identity thresholds and new evaluation data on the redundancy. We demonstrate that the relationship between threshold and redundancies is surprising weak. Applications of CD-HIT that set low identity threshold values also may suffer from substantial degradation in both efficiency and accuracy.