Comparing Medline citations using modified N-grams

Rao Muhammad Adeel Nawab, Mark Stevenson, Paul Clough

Journal of the American Medical Informatics Association | OXFORD UNIV PRESS | Published : 2014


OBJECTIVE: We aim to identify duplicate pairs of Medline citations, particularly when the documents are not identical but contain similar information. MATERIALS AND METHODS: Duplicate pairs of citations are identified by comparing word n-grams in pairs of documents. N-grams are modified using two approaches which take account of the fact that the document may have been altered. These are: (1) deletion, an item in the n-gram is removed; and (2) substitution, an item in the n-gram is substituted with a similar term obtained from the Unified Medical Language System Metathesaurus. N-grams are also weighted using a score derived from a language model. Evaluation is carried out using a set of 520 ..

