Language identification: The long and the short of the matter

T Baldwin; M Lui

Conference Proceedings

Language identification: The long and the short of the matter

T Baldwin, M Lui

Naacl Hlt 2010 Human Language Technologies the 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics Proceedings of the Main Conference | Published : 2010

Abstract

Language identification is the task of identifying the language a given document is written in. This paper describes a detailed examination of what models perform best under different conditions, based on experiments across three separate datasets and a range of tokenisation strategies. We demonstrate that the task becomes increasingly difficult as we increase the number of languages, reduce the amount of training data and reduce the length of documents. We also show that it is possible to perform language identification without having to perform explicit character encoding detection. © 2010 Association for Computational Linguistics.

University of Melbourne Researchers

Tim Baldwin Author

Citation metrics

118Scopus