Abstract:Although generative adversarial networks (GANs) have achieved great success in the face image generation and manipulation, discovering meaningful directions in the latent space of GANs to manipulate semantic attributes is a difficult but meaningful challenge in computer vision. The realization of this challenge typically requires large amounts of labeled data and several hours of network fine-tuning. However, obtaining an annotated collection of images for each desired manipulation is usually very expensive and time consuming. Recent works aim to overcome this limitation by leveraging the pre-trained models. While promising, the accuracy of the manipulation and the authenticity of the results cannot meet the needs of real face editing scenarios. To address these problems, we encode the image and text description into a shared embedding space and propose a unified image generation and manipulation framework by leveraging the powerful joint representation capability from Contrastive Language-Image Pre-training (CLIP). With the carefully designed network structures and loss functions, our framework can learn a latent residual mapper network to map the input conditions into corresponding latent code residuals. This scheme enables our method to perform high-quality image generation and manipulation by leveraging the generative power from the pre-trained StyleGAN2 model. Extensive experiments demonstrate the superiority of our approach in terms of manipulation accuracy, visual realism, and irrelevant attribute preservation.