Bayesian text segmentation for index term identification and keyphrase extraction

D Newman; N Koilada; JH Lau; T Baldwin

Conference Proceedings

Bayesian text segmentation for index term identification and keyphrase extraction

D Newman, N Koilada, JH Lau, T Baldwin

24th International Conference on Computational Linguistics Proceedings of Coling 2012 Technical Papers | Published : 2012

Abstract

Automatically extracting terminology and index terms from scientific literature is useful for a variety of digital library, indexing and search applications. This task is non-trivial, complicated by domain-specific terminology and a steady introduction of new terminology. Correctly identifying nested terminology further adds to the challenge. We present a Dirichlet Process (DP) model of word segmentation where multiword segments are either retrieved from a cache or newly generated. We show how this DP-Segmentation model can be used to successfully extract nested terminology, outperforming previous methods for solving this problem. © 2012 The COLING.

University of Melbourne Researchers

Tim Baldwin Author

Jey Han Lau Author

Citation metrics

45Scopus