Compact, Efficient and Unlimited Capacity: Language Modeling with Compressed Suffix Trees
E Shareghi, M Petri, G Haffari, T Cohn
The Association for Computational Linguistics | Published : 2015
Efficient methods for storing and querying language models are critical for scaling to large corpora and high Markov orders. In this paper we propose methods for modeling extremely large corpora without imposing a Markov condition. At its core, our approach uses a succinct index - a compressed suffix tree - which provides near optimal compression while supporting efficient search. We present algorithms for on-the-fly computation of probabilities under a Kneser-Ney language model. Our technique is exact and although slower than leading LM toolkits, it shows promising scaling properties, which we demonstrate through oo-order modeling over the full Wikipedia collection.