Automated detection of records in biological sequence databases that are inconsistent with the literature
Mohamed Reda Bouadjenek, Karin Verspoor, Justin Zobel
JOURNAL OF BIOMEDICAL INFORMATICS | ACADEMIC PRESS INC ELSEVIER SCIENCE | Published : 2017
We investigate and analyse the data quality of nucleotide sequence databases with the objective of automatic detection of data anomalies and suspicious records. Specifically, we demonstrate that the published literature associated with each data record can be used to automatically evaluate its quality, by cross-checking the consistency of the key content of the database record with the referenced publications. Focusing on GenBank, we describe a set of quality indicators based on the relevance paradigm of information retrieval (IR). Then, we use these quality indicators to train an anomaly detection algorithm to classify records as "confident" or "suspicious". Our experiments on the PubMed Ce..View full abstract
Related Projects (1)
Awarded by Australian Research Council through Discovery Project grant
The project receives funding from the Australian Research Council through a Discovery Project grant, DP150101550.