Conference Proceedings

SnapGen: Taming High-Resolution Text-To-Image Models for Mobile Devices with Efficient Architectures and Training

J Chen, D Hu, X Huang, H Coskun, A Sahni, A Gupta, A Goyal, D Lahiri, R Singh, Y Idelbayev, J Cao, Y Li, KT Cheng, SHG Chan, M Gong, S Tulyakov, A Kag, Y Xu, J Ren

Proceedings 2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition | IEEE | Published : 2025

Abstract

Existing text-to-image (T2I) diffusion models face several limitations, including large model sizes, slow runtime, and low-quality generation on mobile devices. This paper aims to address all of these challenges by developing an extremely small and fast T2I model that generates high-resolution and high-quality images on mobile platforms. We propose several techniques to achieve this goal. First, we systematically examine the design choices of the network architecture to reduce model parameters and latency, while ensuring high-quality generation. Second, to further improve generation quality, we employ cross-architecture knowledge distillation from a much larger model, using a multi-level app..

View full abstract

University of Melbourne Researchers