Benchmarks for measurement of duplicate detection methods in nucleotide databases

Q Chen; J Zobel; K Verspoor

Journal article

Benchmarks for measurement of duplicate detection methods in nucleotide databases

Q Chen, J Zobel, K Verspoor

Database | OXFORD UNIV PRESS | Published : 2023

DOI: 10.1093/database/baw164

Open access

Download PDF

Abstract

Duplication of information in databases is a major data quality challenge. The presence of duplicates, implying either redundancy or inconsistency, can have a range of impacts on the quality of analyses that use the data. To provide a sound basis for research on this issue in databases of nucleotide sequences, we have developed new, large-scale validated collections of duplicates, which can be used to test the effectiveness of duplicate detection methods. Previous collections were either designed primarily to test efficiency, or contained only a limited number of duplicates of limited kinds. To date, duplicate detection methods have been evaluated on separate, inconsistent benchmarks, leadin..

View full abstract

University of Melbourne Researchers

Justin Zobel Author

Related Projects (1)

Natural language processing for automated validation of protein databases

The project aims to use natural language processing and information retrieval to reconcile and improve sources of biological information. Bi..

Grants

Awarded by Australian Research Council

Funding Acknowledgements

Qingyu Chen's work is supported by an International Research Scholarship from The University of Melbourne.The project receives funding from the Australian Research Council through a Discovery Project grant, DP150101550

Citation metrics

6Scopus

7Web of Science

14Dimensions

Keywords

Mathematical & Computational Biology

31 Biological Sciences

Records

46 Information and Computing Sciences

4605 Data Management and Data Science

Generic Health Relevance

Protein

Cd-Hit

Science & Technology

Life Sciences & Biomedicine