Conference Proceedings

langid.py: An off-the-shelf language identification tool

M Lui, T Baldwin

Proceedings of the Annual Meeting of the Association for Computational Linguistics | Association for Computational Linguistics | Published : 2012

Abstract

We present langid.py, an off-the-shelf language identification tool. We discuss the design and implementation of langid.py, and provide an empirical comparison on 5 long-document datasets, and 2 datasets from the microblog domain. We find that langid.py maintains consistently high accuracy across all domains, making it ideal for end-users that require language identification without wanting to invest in preparation of in-domain training data.

University of Melbourne Researchers

Grants

Citation metrics