Comparative Analysis of Sequence Clustering Methods for Deduplication of Biological Databases

Qingyu Chen; Yu Wan; Xiuzhen Zhang; Yang Lei; Justin Zobel; Karin Verspoor

Journal article

Comparative Analysis of Sequence Clustering Methods for Deduplication of Biological Databases

Qingyu Chen, Yu Wan, Xiuzhen Zhang, Yang Lei, Justin Zobel, Karin Verspoor

ACM JOURNAL OF DATA AND INFORMATION QUALITY | ASSOC COMPUTING MACHINERY | Published : 2018

DOI: 10.1145/3131611

Abstract

The massive volumes of data in biological sequence databases provide a remarkable resource for large-scale biological studies. However, the underlying data quality of these resources is a critical concern. A particular challenge is duplication, in which multiple records have similar sequences, creating a high level of redundancy that impacts database storage, curation, and search. Biological database deduplication has two direct applications: for database curation, where detected duplicates are removed to improve curation efficiency, and for database search, where detected duplicate sequences may be flagged but remain available to support analysis. Clustering methods have been widely applie..

View full abstract

University of Melbourne Researchers

Justin Zobel Author

Related Projects (1)

Natural language processing for automated validation of protein databases

The project aims to use natural language processing and information retrieval to reconcile and improve sources of biological information. Bi..

Grants

Awarded by Australian Research Council

Funding Acknowledgements

We thank the Protein Information Resources team leader Hongzhan Huang for advice on the design of the case study. We also thank Jan Schroder for his discussions of this work. Q. Chen's work was supported by the Melbourne International Research Scholarship from the University of Melbourne. The project received funding from the Australian Research Council through a Discovery Project grant (DP150101550).