Natural Language Processing for Automated Validation of Protein Databases

Grant number: DP150101550 | Funding period: 2015 - 2018

Completed

Abstract

The project aims to use natural language processing and information retrieval to reconcile and improve sources of biological information. Biological research has produced vast volumes of information about proteins, captured in structured resources (databases) and unstructured documents. However, the accuracy of much of this information is questionable. The project proposes to develop methods to validate data and reduce the dramatic inconsistencies in protein information resources by leveraging observed correlations and complementarity between them, and specifically through targeted fact extraction from the biomedical literature. These methods will be applied at scale across millions of publi..

View full description

Related publications (16)

Benchmarks for measurement of duplicate detection methods in nucleotide databases

Q Chen, J Zobel, K Verspoor

2023-01-01

Duplication of information in databases is a major data quality challenge. The presence of duplicates, implying either redundancy ..

Quality Matters: Biocuration Experts on the Impact of Duplication and Other Data Quality Issues in Biological Databases.

Qingyu Chen, Ramona Britto, Ivan Erill, Constance J Jeffery, Arthur Liberzon, Michele Magrane, Jun-Ichi Onami, Marc Robinson-Rechavi, Jana Sponarova, Justin Zobel, Karin Verspoor

2020-04-01

Biological databases represent an extraordinary collective volume of work. Diligently built up over decades and comprising many mi..

Automated assessment of biological database assertions using the scientific literature

Mohamed Reda Bouadjenek, Justin Zobel, Karin Verspoor

2019-04-29

Background The large biological databases such as GenBank contain vast numbers of records, the content of which is substantively b..

Search Effectiveness in Nonredundant Sequence Databases: Assessments and Solutions.

Qingyu Chen, Xiuzhen Zhang, Yu Wan, Justin Zobel, Karin Verspoor

2018-12-26

Duplicate sequence records-that is, records having similar or identical sequences-are a challenge in search of biological sequence..

An Improved Neural Network Model for Joint POS Tagging and Dependency Parsing

Dat Quoc Nguyen, Karin Verspoor

2018-10-01

We propose a novel neural network model for joint part-of-speech (POS) tagging and dependency parsing. Our model extends the well-..

Comparative Analysis of Sequence Clustering Methods for Deduplication of Biological Databases

Qingyu Chen, Yu Wan, Xiuzhen Zhang, Yang Lei, Justin Zobel, Karin Verspoor

2018-03-01

The massive volumes of data in biological sequence databases provide a remarkable resource for large-scale biological studies. How..

Automated detection of records in biological sequence databases that are inconsistent with the literature

MR Bouadjenek, K Verspoor, J Zobel

2017-07-01

We investigate and analyse the data quality of nucleotide sequence databases with the objective of automatic detection of data ano..

Duplicates, redundancies and inconsistencies in the primary nucleotide databases: A descriptive study

Q Chen, J Zobel, K Verspoor

2017-01-01

GenBank, the EMBL European Nucleotide Archive and the DNA DataBank of Japan, known collectively as the International Nucleotide Se..

Learning Biological Sequence Types Using the Literature

Mohamed Reda Bouadjenek, Karin Verspoor, Justin Zobel

2017-01-01

We explore in this paper automatic biological sequence type classification for records in biological sequence databases. The seque..

Literature consistency of bioinformatics sequence databases is effective for assessing record quality

MR Bouadjenek, K Verspoor, J Zobel

2017-01-01

Bioinformatics sequence databases such as Genbank or UniProt contain hundreds of millions of records of genomic data. These record..

Sequence clustering methods and completeness of biological database search

Q Chen, X Zhang, Y Wan, J Zobel, K Verspoor

2017-01-01

Supervised learning for detection of duplicates in genomic sequence databases

Q Chen, J Zobel, X Zhang, K Verspoor

2016-08-01

Motivation First identified as an issue in 1996, duplication in biological databases introduces redundancy and even leads to incon..

A categorical analysis of coreference resolution errors in biomedical texts

M Choi, J Zobel, K Verspoor

2016-04-01

Background: Coreference resolution is an essential task in information extraction from the published biomedical literature. It sup..

Coreference resolution improves extraction of Biological Expression Language statements from texts

M Choi, H Liu, W Baumgartner, J Zobel, K Verspoor

2016-01-01

We describe a system that automatically extracts biological events from biomedical journal articles, and translates those events i..

Evaluation of CD-HIT for constructing non-redundant databases

Q Chen, Y Wan, Y Lei, J Zobel, C VERSPOOR

2016-01-01

CD-HIT is one of the most popular tools for reducing sequence redundancy, and is considered to be the state-of-art method. It trie..

Evaluation of a Machine Learning Duplicate Detection Method for Bioinformatics Databases

Q Chen, J Zobel, K Verspoor

2015-01-01

The impact of duplicate or inconsistent records in databases can be severe, and for general databases has led to development of a ..

University of Melbourne Researchers

Justin Zobel