Journal article

Benchmarks for measurement of duplicate detection methods in nucleotide databases

Q Chen, J Zobel, K Verspoor

Database | OXFORD UNIV PRESS | Published : 2023

Abstract

Duplication of information in databases is a major data quality challenge. The presence of duplicates, implying either redundancy or inconsistency, can have a range of impacts on the quality of analyses that use the data. To provide a sound basis for research on this issue in databases of nucleotide sequences, we have developed new, large-scale validated collections of duplicates, which can be used to test the effectiveness of duplicate detection methods. Previous collections were either designed primarily to test efficiency, or contained only a limited number of duplicates of limited kinds. To date, duplicate detection methods have been evaluated on separate, inconsistent benchmarks, leadin..

View full abstract

University of Melbourne Researchers