Journal article

Comparative Analysis of Sequence Clustering Methods for Deduplication of Biological Databases

Qingyu Chen, Yu Wan, Xiuzhen Zhang, Yang Lei, Justin Zobel, Karin Verspoor

Journal of Data and Information Quality | ASSOC COMPUTING MACHINERY | Published : 2018

Abstract

The massive volumes of data in biological sequence databases provide a remarkable resource for large-scale biological studies. However, the underlying data quality of these resources is a critical concern. A particular challenge is duplication, in which multiple records have similar sequences, creating a high level of redundancy that impacts database storage, curation, and search. Biological database deduplication has two direct applications: for database curation, where detected duplicates are removed to improve curation efficiency, and for database search, where detected duplicate sequences may be flagged but remain available to support analysis. Clustering methods have been widely applie..

View full abstract

Grants

Awarded by Australian Research Council


Funding Acknowledgements

We thank the Protein Information Resources team leader Hongzhan Huang for advice on the design of the case study. We also thank Jan Schroder for his discussions of this work. Q. Chen's work was supported by the Melbourne International Research Scholarship from the University of Melbourne. The project received funding from the Australian Research Council through a Discovery Project grant (DP150101550).