Exploring Species-Based Strategies for Gene Normalization
Karin Verspoor, Christophe Roeder, Helen L Johnson, K Bretonnel Cohen, William A Baumgartner, Lawrence E Hunter
IEEE-ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS | IEEE COMPUTER SOC | Published : 2010
We introduce a system developed for the BioCreative II.5 community evaluation of information extraction of proteins and protein interactions. The paper focuses primarily on the gene normalization task of recognizing protein mentions in text and mapping them to the appropriate database identifiers based on contextual clues. We outline a ""fuzzy" dictionary lookup approach to protein mention detection that matches regularized text to similarly regularized dictionary entries. We describe several different strategies for gene normalization that focus on species or organism mentions in the text, both globally throughout the document and locally in the immediate vicinity of a protein mention, and ..View full abstract
Awarded by NIH
Awarded by NATIONAL LIBRARY OF MEDICINE
This work was supported by the NIH grants 5R01LM009254 and 2R01LM008111 to Lawrence Hunter and grant 1R01LM010120-01 to Karin Verspoor. The authors would also like to thank Florian Leitner at CNIO for his patience in responding to myriad questions about the evaluation. They would also like to thank Fabio Rinaldi who shared mappings from species names listed in the Cell Line Knowledge Base to NCBI Taxonomy identifiers. They would also like to thank their undergraduate research intern Cesar Mejia Munoz for his assistance in exploring the parameter space of the system, and their collaborator Philip Ogren for use of his coordination analysis module.