Journal article

Duplicates, redundancies and inconsistencies in the primary nucleotide databases: a descriptive study

Qingyu Chen, Justin Zobel, Karin Verspoor

DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION | OXFORD UNIV PRESS | Published : 2017

Abstract

GenBank, the EMBL European Nucleotide Archive and the DNA DataBank of Japan, known collectively as the International Nucleotide Sequence Database Collaboration or INSDC, are the three most significant nucleotide sequence databases. Their records are derived from laboratory work undertaken by different individuals, by different teams, with a range of technologies and assumptions and over a period of decades. As a consequence, they contain a great many duplicates, redundancies and inconsistencies, but neither the prevalence nor the characteristics of various types of duplicates have been rigorously assessed. Existing duplicate detection methods in bioinformatics only address specific duplicate..

View full abstract

Grants

Awarded by Australian Research Council


Funding Acknowledgements

Qingyu Chen's work is supported by an International Research Scholarship from The University of Melbourne. The project receives funding from the Australian Research Council through a Discovery Project grant, DP150101550.