CRIS: CLIP-Driven Referring Image Segmentation

Z Wang; Y Lu; Q Li; X Tao; Y Guo; M Gong; T Liu

Conference Proceedings

CRIS: CLIP-Driven Referring Image Segmentation

Z Wang, Y Lu, Q Li, X Tao, Y Guo, M Gong, T Liu

Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition | Published : 2022

DOI: 10.1109/CVPR52688.2022.01139

Abstract

Referring image segmentation aims to segment a referent via a natural linguistic expression. Due to the distinct data properties between text and image, it is challenging for a network to well align text and pixel-level features. Existing approaches use pretrained models to facilitate learning, yet separately transfer the language/vision knowledge from pretrained models, ignoring the multi-modal corresponding information. Inspired by the recent advance in Contrastive Language-Image Pretraining (CLIP), in this paper, we propose an end-to-end CLIP-Driven Referring Image Segmen-tation framework (CRIS). To transfer the multi-modal knowledge effectively, CRIS resorts to vision-language decoding a..

View full abstract