Evaluation of a Machine Learning Duplicate Detection Method for Bioinformatics Databases

Q Chen, J Zobel, K Verspoor

Proceedings of the ACM Ninth International Workshop on Data and Text Mining in Biomedical Informatics | ACM | Published : 2015


The impact of duplicate or inconsistent records in databases can be severe, and for general databases has led to development of a range of techniques for identification of such records. In bioinformatics, duplication arises when two or more database records represent the same biological entity, a problem that has been known for over 20 years. However, only a limited number of techniques for detecting bioinformatic duplicates have emerged. Special techniques for handling large data sets (a common 5000-record data set has over 10 million pairs to compare) and imbalanced data (where the prevalence of duplicate pairs is minute as compared to non-duplicate pairs). Biological domain interpretation..

