On collocations and topic models

JH Lau; T Baldwin; D Newman

Journal article

On collocations and topic models

JH Lau, T Baldwin, D Newman

ACM Transactions on Speech and Language Processing | Published : 2013

DOI: 10.1145/2483969.2483972

Abstract

We investigate the impact of preextracting and tokenizing bigram collocations on topic models. Using extensive experiments on four different corpora, we show that incorporating bigram collocations in the document representation creates more parsimonious models and improves topic coherence. We point out some problems in interpreting test likelihood and test perplexity to compare model fit, and suggest an alternate measure that penalizes model complexity. We show how the Akaike information criterion is a more appropriate measure, which suggests that using a modest number (up to 1000) of top-ranked bigrams is the optimal topic modelling configuration. Using these 1000 bigrams also results in im..

View full abstract

University of Melbourne Researchers

Jey Han Lau Author

Tim Baldwin Author

Grants

Citation metrics

56Scopus

47Dimensions

Keywords

46 Information and Computing Sciences

4605 Data Management and Data Science