research-article

Open Access

CLIP-guided StyleGAN Inversion for Text-driven Real Image Editing

Authors:
Ahmet Canberk Baykal

Koç University, Turkey

Koç University, Turkey

0000-0002-0249-5858
View Profile

,
Abdul Basit Anees

Koç University, Turkey

Koç University, Turkey

0000-0003-1293-1796
View Profile

,
Duygu Ceylan

Adobe Research, United Kingdom

Adobe Research, United Kingdom

0000-0002-2307-9052
View Profile

,
Erkut Erdem

Hacettepe University, Turkey

Hacettepe University, Turkey

0000-0002-6744-8614
View Profile

,
Aykut Erdem

Koç University, Turkey

Koç University, Turkey

0000-0002-6280-8422
View Profile

,
Deniz Yuret

Koç University, Turkey

Koç University, Turkey

0000-0002-7039-0046
View Profile

Authors Info & Claims

ACM Transactions on Graphics Volume 42 Issue 5Article No.: 172pp 1–18https://doi.org/10.1145/3610287

Published:29 August 2023Publication History

ACM Transactions on Graphics

Abstract

Researchers have recently begun exploring the use of StyleGAN-based models for real image editing. One particularly interesting application is using natural language descriptions to guide the editing process. Existing approaches for editing images using language either resort to instance-level latent code optimization or map predefined text prompts to some editing directions in the latent space. However, these approaches have inherent limitations. The former is not very efficient, while the latter often struggles to effectively handle multi-attribute changes. To address these weaknesses, we present CLIPInverter, a new text-driven image editing approach that is able to efficiently and reliably perform multi-attribute changes. The core of our method is the use of novel, lightweight text-conditioned adapter layers integrated into pretrained GAN-inversion networks. We demonstrate that by conditioning the initial inversion step on the Contrastive Language-Image Pre-training (CLIP) embedding of the target description, we are able to obtain more successful edit directions. Additionally, we use a CLIP-guided refinement step to make corrections in the resulting residual latent codes, which further improves the alignment with the text prompt. Our method outperforms competing approaches in terms of manipulation accuracy and photo-realism on various domains including human faces, cats, and birds, as shown by our qualitative and quantitative results.

1 INTRODUCTION

The quality of images synthesized by Generative Adversarial Networks [Goodfellow et al. 2014] have reached a remarkable level in less than a decade. StyleGAN and its variants [Karras et al. 2019, 2020, 2021] are now capable of generating highly realistic images, while allowing control over the generation process by means of style mixing. Recent works [Härkönen et al. 2020; Shen et al. 2020] have demonstrated that StyleGAN learns disentangled attributes, making it possible to find directions in its latent space to generate images that possess such desired attributes. Consequently, there has been a growing interest in utilizing semantic editing directions in the latent space mostly for preset directions such as gender, face orientation, and hair color.

Concurrent to the advances in generative modeling, we are also witnessing exciting breakthroughs in multimodal learning. For example, the recently proposed Contrastive Language-Image Pre-training (CLIP) model [Radford et al. 2021] provides an effective common embedding for images and text captions. Such an embedding, when combined with powerful GANs paves the road toward text-guided image editing, one of the most natural and intuitive ways of manipulating images. Hence, it comes with no surprise that several recent works [Li et al. 2020; Xia et al. 2021a; Patashnik et al. 2021; Kocasari et al. 2021; Wei et al. 2022] have focused on mapping target textual descriptions to editing directions in the latent space of StyleGAN. While some methods perform optimization in the latent space guided by CLIP [Xia et al. 2021a; Patashnik et al. 2021], others train a separate mapper network for each type of textual edit [Patashnik et al. 2021] or a general mapper conditioned on reference images and textual descriptions [Wei et al. 2022]. Instance-based optimization methods require long inference times. Training mappers for a single text prompt reduces the inference time to a single forward pass but comes with the price of training time as separate mappers need to be trained for each text prompt. Moreover, these mappers that operate in the latent space do not directly consider the features of the original image as they take inverted latent codes as inputs from pretrained GAN inversion networks.

In this study, we present a new approach, which we call CLIPInverter, to automatically edit an input image based on a target textual description containing multiple attributes by adjoining lightweight adapter modules to pretrained unconditional inversion methods (see Figure ). CLIPInverter includes a novel CLIP-conditioned adapter module (CLIPAdapter) that is attached to the pretrained encoder model to map both the input image and the target textual description to a residual latent code by utilizing the common CLIP embedding space. The residual latent code is then combined with the latent code of the input image obtained by the unconditional branch of the encoder and is fed to a CLIP-guided correction module (CLIPRemapper) that applies a final correction by blending the latent codes with latent codes predicted from the CLIP embedding of the target textual description based on learnable blending coefficients. The final latent code is decoded by a pretrained and frozen StyleGAN2 generator to synthesize the manipulated image that reflects the desired changes while preserving the identity of the original subject as much as possible. Our encoder-adapters are lightweight networks that directly modulate image feature maps using text embeddings and they could be appended to many pretrained encoders. Our CLIP-guided correction module utilizes the CLIP text embeddings to enhance the manipulations of the generated images while preserving the photorealism. Our method does not require any additional optimization on the latents and it successfully applies manipulations using various text prompts in a single forward pass. Since we directly modulate feature maps extracted during the inversion phase, our method is capable of editing images much better than the competing approaches, especially in cases when there are multiple attributes present in the target textual description, as proven by our experiments. See Figure 1 for an overview of our framework.

Fig. 1. Multi-attribute real image manipulation with CLIPInverter. We present CLIPInverter that enables users to easily perform semantic changes on images using free natural text. Our approach is not specific to a certain category of images and can be applied to many different domains (e.g., human faces, cats, and birds) where a pretrained StyleGAN generator exists (top). Our approach specifically gives more accurate results for multi-attribute edits as compared to the prior work (middle). Moreover, as we utilize CLIP’s semantic embedding space, it can also perform manipulations based on reference images without any training or finetuning (bottom).

Fig. 2. An overview of our CLIPInverter approach in comparison to similar text-guided image manipulation methods. StyleCLIP-LM utilizes target description only in the loss function. HairCLIP additionally uses the description to modulate the latent code obtained by the encoder within the mapper. Alternatively, our CLIPInverter employs specially designed adapter layers, CLIPAdapter, to modulate the encoder in extracting the latent code with respect to the target description. To further obtain more accurate edits, it also makes use of an extra refinement module, CLIPRemapper, to make subsequent corrections on the predicted latent code.

Our method aims to strike a balance between distortion and editability [Tov et al. 2021]. Namely, our text-guided CLIPAdapter is utilized to find an editing direction that is aligned with the given target description, specific to the input image. By leveraging the inversion in the \(\mathcal {W+}\) space, we aim to preserve the identity of the input image in the manipulated output, which helps in achieving relatively low distortion. However, it is important to note that complete elimination of distortion is not feasible in this process. While we are able to preserve the identity to a certain degree, we observe that not all attributes described in the target caption may be fully captured in the manipulated image. To address this, we introduce the text-guided refinement module, CLIPRemapper, which applies a final correction to the latent code, further aligning it with the desired target description. Essentially, CLIPRemapper finds a more editable region in the vicinity of the latent code we obtain from the previous stage. This process boosts the manipulation performance of our model massively, while keeping the distortion at a comparable level, as shown in our ablation study.

We demonstrate editing results for challenging cases where there are many attributes present in the target description. Our method is not restricted to a particular domain like commonly studied human faces, and we also evaluate our approach on birds and cats images. Exploring the multimodal nature of CLIP, instead of target textual descriptions, we can additionally use images or target textual descriptions containing vocabulary never seen during training as the guiding signal. Finally, we show that linearly interpolating between the original latent code and the updated latent code results in smooth image manipulations, providing a means for user to have control over the manipulation process.

We evaluate our method on a diverse set of datasets and provide detailed qualitative results and comparisons against the state-of-the-art models. Quantitative comparisons in language-guided editing still remains a challenge, as one needs to evaluate the manipulations from different aspects, such as accuracy, preservation of text-irrelevant details, photorealism, and so on. Current metrics are not suitable for evaluation as they do not consider some of these aspects at all. We propose two new metrics, Attribute Manipulation Accuracy (AMA) and CLIP Manipulative Precision (CMP), to measure how accurately the manipulations are applied and how well the text-irrelevant details are preserved. We perform quantitative comparisons against state-of-the-art models using these metrics along with Fréchet Inception Distance (FID). These comparisons as well as a user study that we conducted to evaluate perceptual realism and manipulation accuracy demonstrate the superiority of our approach over the prior work.

Our code and models are publicly available at the project website.¹

2 RELATED WORK

2.1 GAN Inversion

In response to the growing demand for interpretability and controllability in GANs, the need for GAN inversion has emerged as a pivotal technique. By mapping a given image back into the latent space of a pretrained GAN model, as introduced by Zhu et al. [2016], GAN inversion facilitates a deeper understanding of the underlying features and structures in the latent space, enabling researchers to manipulate and interpret generated images with greater precision and insight. Below we discuss some representative works to highlight three main approaches to accomplish GAN Inversion; please refer to the recent survey [Xia et al. 2021b] for an in-depth discussion of various other inversion methods.

The optimization-based methods directly optimize a latent code that reconstructs the target image as close as possible using gradient descent [Abdal et al. 2019, 2020; Creswell and Bharath 2016; Tewari et al. 2020a]. This line of works is instance specific and does not require any trainable modules. The learning-based methods invert an image by a learned encoder. This approach is similar to an autoencoder pipeline, where the pretrained generator acts as the decoder. Unconditional encoders [Tewari et al. 2020b; Zhu et al. 2020; Alaluf et al. 2021a; Bau et al. 2019a; Richardson et al. 2021; Tov et al. 2021; Bai et al. 2022] aim to solely invert the image, without any modifications while conditional encoders [Alaluf et al. 2021b] are designed for obtaining a latent code conditioned on attributes such as pose, age, or facial expressions. The so-called hybrid methods [Zhu et al. 2016; Bau et al. 2019b] combine optimization-based methods with learning-based methods. The images are first inverted to a latent code by a learned encoder. This latent code then becomes the initialization for the latent optimization and is optimized to reconstruct the target image.

More recent approaches build different architectures, fine-tune StyleGAN weights, or modulate feature maps for inversion. Style Transformer [Hu et al. 2022] uses a combination of convolutional neural networks and transformers to invert images into the latent space. Pivotal Tuning Inversion [Roich et al. 2021] fine-tunes the generator around a pivotal latent code to find a balance for the distortion-editability trade-off. Some methods [Alaluf et al. 2021c; Dinh et al. 2022] train hypernetworks to modulate the weights of a pre-trained StyleGAN network for accurate as well as editable inversions. Spatially-Adaptive Multilayer (SAM) GAN Inversion [Parmar et al. 2022] predicts invertibility maps and High-Fidelity GAN Inversion [Wang et al. 2022] predicts latent maps to modulate StyleGAN features.

While both optimization-based and hybrid approaches may reconstruct images faithfully, they require solving an optimization problem for each image, resulting in longer processing times. However, our approach adapts learned adapters appended to encoders, which provides a much faster alternative to current methods. Furthermore, we condition the inversion process directly on the target captions, which ensures that a more effective editing space direction can be found in the latent space.

2.2 Latent Space Manipulation

Recent work has shown that GANs learn a semantically coherent latent space, enabling to map manipulations in the latent space to semantic image editing. Specifically, StyleGAN [Karras et al. 2019] learns an intermediate latent space by employing a mapping network to transform the sampled latent code. These intermediate latent codes determine the parameters of the AdaIN [Huang and Belongie 2017] layers introduced in the generator to control the style of the generated image, allowing control over the synthesis at different levels. A common approach when manipulating images is to first invert the input image back into the latent space of a pretrained generator using GAN inversion and then traverse the latent space to find a meaningful direction. Such a direction can be found by using explicit supervision of image attribute annotations [Shen et al. 2020; Abdal et al. 2021; Wu et al. 2020] or in an unsupervised manner [Voynov and Babenko 2020; Härkönen et al. 2020; Shen and Zhou 2021]. Recently proposed methods consider various modalities for conditional image manipulation. StyleMapGAN [Kim et al. 2021] proposes an intermediate latent space with spatial dimensions with spatial modulation that enables local editing based on reference images. Similarly, the study by Collins et al. [2020] uses a transformation matrix to control the interpolation between an input image and a reference image in the latent space to locally edit the input image. The recent work of Alaluf et al. [2021b] manipulates an input image based on a target age by training an encoder conditioned on the target age to find residual latent codes to add to the inverted latent code of the original image. In a similar vein, we train adapter layers appended to an encoder conditioned on textual descriptions to output these residual latent codes. We also use the CLIP model to define supervisory signals to explore the similarity of an input image and a textual description.

Moreover, there are several latent spaces to consider in a StyleGAN2 generator. The latent mapper transforms the latent codes in the space \(\mathcal {Z}\) drawn from a Normal distribution to an intermediate latent space \(\mathcal {W}\). The latent codes in the \(\mathcal {W}\) space are used at different stages in the StyleGAN2 generator, after being mapped to the \(\mathcal {S}\) space by an affine transformation. \(\mathcal {W}+\) space is an extended version of the \(\mathcal {W}\) space where a different \(\mathbf {w}\) is used for each style input of the generator. While some works find editing directions in the \(\mathcal {S}\) space such as StyleCLIP-GD [Patashnik et al. 2021] and StyleMC [Kocasari et al. 2021], many others like StyleCLIP-LO, StyleCLIP-LM [Patashnik et al. 2021], and SAM [Alaluf et al. 2021b] utilize the extended intermediate space \(\mathcal {W}+\). Our text-guided image encoder operates on \(\mathcal {W}+\) to find effective editing directions.

2.3 Text-guided Image Manipulation

Given an image and a target description in natural language, the aim of text-guided image manipulation is to generate images that reflect the desired semantic changes while also preserving the details or attributes not mentioned in the text. ManiGAN [Li et al. 2020] learns a text-image affine combination that selects image regions that are relevant to the language description and a detail correction module that modifies these regions. TediGAN [Xia et al. 2021a] enforces the text and image matching by mapping the images and the text to the same latent space and performs further optimization to preserve the identity of the subjects in the original image.

More recent works use semantics learned by a multi-modal method such as CLIP [Radford et al. 2021]. StyleCLIP [Patashnik et al. 2021] uses the CLIP space to optimize for the latent code (StyleCLIP-LO) that minimizes the distance of the image and text pair. They also present a latent mapper (StyleCLIP-LM) that predicts residual latent codes corresponding to specific attributes. Finally, they also experiment with mapping a text prompt to a global direction (StyleCLIP-GD) in the latent space that is independent of the input image. The most recent StyleMC [Kocasari et al. 2021] model presents an efficient method to learn global directions in the \(\mathcal {S}\) space of StyleGAN2 for a given text prompt, by finding directions at lower resolutions and applying manipulations at higher resolutions. It also utilizes CLIP to minimize the distance between the generated image and the text prompt. Most recently and most similar to our approach, HairCLIP [Wei et al. 2022] modulates the inverted latent codes based on hairstyle and hair color inputs as image or text. Their approach is similar to StyleCLIP-LM. However, they also modulate the latent codes with the CLIP embeddings rather than solely optimizing the similarity in the CLIP space.

Our work share some similarities with the aforementioned methods. Like the original TediGAN model, we employ an encoder to predict the latent code conditioned on the provided target description. That said, we estimate a residual latent code reflecting only the desired changes mentioned in the description, which is to be added to the inverted latent code of the input image. StyleCLIP-LM and StyleMC models predict residual latent codes similar to ours, but they require training their mapper functions from scratch for each text prompt via a loss function based on CLIP similarity. Most similar to our approach, HairCLIP applies modulations in the latent space after obtaining inversions with a pretrained network. However, we let CLIP embeddings modulate the feature maps via an adapter module for predicting the residual latent code. With this modulation, our inversion step is text guided, whereas HairCLIP applies text-conditioning on the latent space. We also train a correction module that applies latent code blending with learnable blending coefficients for improved accuracy, quality and fidelity in the output images. In Figure 1, we illustrate the aforementioned fundamental differences between our approach and the most similar StyleCLIP-LM and HairCLIP methods.

Our approach allows us to manipulate fine-scale details by modulating the feature maps, resulting in more accurate manipulations than HairCLIP. Thanks to this process, we also eliminate the need for separate training, unlike StyleCLIP-LM. That is, once our model is trained, it can be directly used to manipulate images by considering a large variety of text prompts containing multiple attributes.

We provide extensive comparisons against the aforementioned recent StyleGAN-based methods in Section 4 and show the superiority or competitiveness of our proposed approach.

Recently, diffusion models trained with variational inference achieved state-of-the-art performance in image generation [Dhariwal and Nichol 2021; Ho et al. 2020; Rombach et al. 2022]. With this success of diffusion-based models, several text-guided image manipulation methods have been proposed. DiffusionCLIP [Kim et al. 2022] first converts the images to latent noises by forward diffusion and then guides the reverse diffusion process by CLIP to control the attributes in the synthesized images. UniTune [Valevski et al. 2022] introduces a simple method to fine-tune large-scale text-to-image diffusion models on single images. Similarly, Imagic [Kawar et al. 2022] optimizes a text embedding and fine-tunes pretrained generative diffusion models to perform edits on a single image. Prompt-to-Prompt [Hertz et al. 2022] and its later extension Plug-and-Play [Tumanyan et al. 2023] achieve semantic edits by blending activations extracted from both the original and target prompts. These diffusion-based editing methods differ from ours as each one requires a large pre-trained text-to-image network. Hence, we do not directly evaluate our approach against these methods, but provide some comparisons in the supplementary.

2.4 Adapter Layers

Adapter layers [Houlsby et al. 2019], originally proposed for Natural Language Processing (NLP) tasks, are compact modules that allow parameter sharing in an efficient manner. The key idea is to add adapter modules, consisting of a few layers, between the layers of a pretrained network. The parameters of the adapter module are updated during the fine-tuning phase on a downstream task, while the original parameters of the pretrained network remain the same. This way, most of the parameters of the pretrained network are shared between different downstream tasks, resulting in a model that is able to perform diverse tasks efficiently. Since the parameters of the pretrained network are frozen, the original capabilities of the model are preserved. The module proposed for NLP [Houlsby et al. 2019] is appended after the feed-forward layers and before adding the skip connection back, in a transformer model. This module consists of a down-projection and an up-projection layer. Compared to the original pretrained model, the number of parameters of the adapter module is considerably smaller, allowing the learning of new tasks efficiently.

Adapter layers have also been proposed to use in computer vision tasks. Rebuffi et al. [2017] introduced residual adapter layers for multiple-domain learning in image recognition. Their residual adapter layers are slightly modified versions of the residual blocks in ResNet [He et al. 2016], where batch normalization and 1 \(\times\) 1 convolutions with residual connections are added to these residual blocks. Rebuffi et al. [2018] proposed several improvements over this module. They modified the series implementation of the residual adapter to obtain a parallel adapter, where the input to the convolutional blocks of the residual block is processed in parallel with the adapter convolutions and fed back to the original branch. They also investigated where to place the adapter layers in the ResNet to achieve the best performance. Finally, in VL-Adapter [Yi-Lin Sung 2022], the authors experimented with adapter layers in vision and language joint tasks. They added adapter modules consisting of downsampling and upsampling layers to the transformer architecture for parameter efficient fine-tuning.

Our approach consists of adapter modules that we attach to inversion models. Our encoder adapter module is similar to the mapping networks in StyleCLIP-LM. However, in these adapter modules, we modulate intermediate image feature maps that are extracted from the inversion model. After the modulation, the feature maps are fed back to the inversion model to be processed further. With this essential idea, we are able to add a text conditional branch to the existing GAN inversion models while preserving its unconditional inversion capabilities.

3 THE APPROACH

3.1 Overview of CLIPInverter

Our text-guided image editing framework includes two separate modules, namely CLIPAdapter and CLIP Remapper, each playing a different role in obtaining the desired edit. CLIPAdapter involves CLIP-conditioned adapter layers for the GAN inversion process, which are used for finding semantic editing directions in the latent space along which the given input image is manipulated. CLIPRemapper then performs a final refinement over the predicted latent code of the output image considering the CLIP embedding of the input text prompt to further improve the manipulation accuracy as well as the perceptual quality.

Given an input image \(\mathbf {x}_{\mathbf {in}}\) and a desired target description \(\mathbf {t}_{\mathbf {target}}\), the goal of our CLIPInverter approach is to manipulate the input image and synthesize an output image \(\mathbf {x}_{\mathbf {out}}\) such that the end result reflects the attributes described in the text (e.g., hair color, age, and gender), while preserving the identity of the subject present in the original image or any other features not relevant to the description. Assuming that we have access to a StyleGAN2 [Karras et al. 2020] generator \(G\) that can synthesize images from a particular domain, we cast this text-guided manipulation task as finding a mapping of the input image \(\mathbf {x}_{\mathbf {in}}\) and the target text prompt \(\mathbf {t}_{\mathbf {target}}\) to a latent code \(\mathbf {w^*}\in \mathcal {W}+\) in the latent space of \(G\) so that when decoded it generates the manipulation result as \(\mathbf {x}_{\mathbf {out}} = G(\mathbf {w^*})\).

We perform the latent space mapping in two steps, using the unconditional and the conditional branches of the text-guided encoder, which we obtain by attaching CLIPAdapter to a pretrained image inversion network, namely encoder4editing (\(\text{e4e}\)) [Tov et al. 2021]. We first map the input image \(\mathbf {x}_{\mathbf {in}}\) to its latent code \(\mathbf {w}\) through the pretrained encoder \(\text{e4e}\). We then compute a residual latent vector \(\Delta \mathbf {w}\) through the conditional branch, which processes both the input image and the CLIP model [Radford et al. 2021] embedding of the textual description. The final image \(\mathbf {x}_{\mathbf {out}}\) is synthesized by passing the aggregated latent code first through the refinement module, \(\mathbf {w^*} = \text{CLIPRemapper}(\mathbf {w} + \Delta \mathbf {w})\) and then through the generator network, which is a pretrained StyleGAN2 [Karras et al. 2020] generator. CLIPInverter applies one final correction to the latent code by predicting latents based on the CLIP embedding of the target caption \(\mathbf {t}_{\mathbf {target}}\). Then, the predicted latent is blended with the previously inverted latent code depending on a learned interpolation coefficient \(\alpha\).

In the following, we describe the details of the key modules of CLIPInverter and the loss functions we utilize during training.

3.2 CLIPAdapter: CLIP-guided Adapters for Latent Space Manipulation

Figure 2(a) shows the architecture of our proposed text-guided encoder, which follows the architecture of \(\text{e4e}\) with attached lightweight adapters that enable us to incorporate the textual descriptions. The original \(\text{e4e}\) architecture maps the input image to feature maps at three levels—coarse, medium, and fine. We introduce Adaptive Group Normalization (AdaGN) layers in CLIPAdapter, replacing the Instance Normalization in the Adaptive Instance Normalization (AdaIN) [Huang and Belongie 2017] layers to modulate these features using features obtained from the CLIP [Radford et al. 2021] embedding of the target description.

Fig. 3. CLIPAdapter and CLIPRemapper modules of our CLIPInverter framework. Our text-guided image editing framework includes two key modules, CLIPAdapter and CLIP Remapper. CLIPAdapter employs CLIP-conditioned adapter layers within the GAN inversion process to find the semantic editing direction in the latent space. CLIPRemapper further refines the predicted edit direction to improve the manipulation accuracy again based on the CLIP embedding of the input text prompt.

CLIPAdapter also employs shallow mapping networks, one for each level, to better align the multi-modal semantic space of the CLIP model with the \(\mathcal {W}+\) space of StyleGAN2. Specifically, we feed the text embedding obtained from the CLIP model to a multi-layer perceptron (MLP) that predicts the scale and shift parameters of the subsequent AdaGN blocks. Given the image features from the coarse, medium, and fine layers of the encoder, the AdaGN blocks perform feature modulation such that the outputs control the prediction of the residual latent codes.

The design philosophy behind our encoder architecture is to have adapter layers in a pretrained network that can identify visual features relevant and irrelevant to the manipulation task in both image and text-specific manner in computing the residual latent code to identify the manipulation direction in the \(\mathcal {W+}\) space. Specifically, we factorize the layers of the \(\text{e4e}\) network into two groups: \(\text{e4e}_{\text{body}}\) and \(\text{e4e}_{\text{m2s}}\). While \(\text{e4e}_{\text{body}}\) includes the convolutional backbone layers and it extracts a feature pyramid consisting of feature maps from coarse, medium, and fine levels, \(\text{e4e}_{\text{body}}\) consists of small convolutional mapping networks that transforms these feature maps to the latent styles in the \(\mathcal {W+}\) space. We insert CLIPAdapter between \(\text{e4e}_{\text{body}}\) and \(\text{e4e}_{\text{m2s}}\).

More formally, to manipulate a given image \(\mathbf {x}_{\mathbf {in}}\) based on a text prompt \(\mathbf {t}_{\mathbf {target}}\), we start with obtaining the latent code \(\mathbf {w}\) of the original image in the \(\mathcal {W}+\) latent space of StyleGAN2 [Karras et al. 2020] via \(\text{e4e}\), (1) \(\begin{equation} \mathbf {w} = \text{e4e}(\mathbf {x}_{\mathbf {in}}) \in \mathbb {R}^{18 \times 512}. \end{equation}\)

To perform semantic edits on \(\mathbf {x}_{\mathbf {in}}\) to reflect the desired target look, we utilize the text-conditioned branch of our encoder network, which takes both the input image and the target textual description as input and outputs the residual latent code. During this process, we first extract intermediate feature maps \(\mathbf {c_i}\) from the body layers of the encoder network, \(\text{e4e}_{\text{body}}\), (2) \(\begin{equation} \mathbf {c_i} = \text{e4e}_{\text{body}}(\mathbf {x}_{\mathbf {in}}). \end{equation}\)

Next, we utilize the CLIP text embedding of the target text prompt \(\mathbf {t}_{\mathbf {target}}\) to modulate \(\mathbf {c_i}\), obtaining the modulated feature maps \(\mathbf {c_o}\) through our encoder-adapter layers CLIPAdapter: (3) \(\begin{equation} \mathbf {c_o} = \text{CLIPAdapter}(\mathbf {c_i}, \mathbf {t}_{\mathbf {target}}). \end{equation}\)

As the final step to predict the manipulation directions as residual latents \(\Delta \mathbf {w}\), we pass the modulated feature maps \(\mathbf {c_o}\) through the map2style layers of \(\text{e4e}\), \(\text{e4e}_{\text{m2s}}\), (4) \(\begin{equation} \Delta \mathbf {w} = \text{e4e}_{\text{m2s}}(\mathbf {c_o}) \in \mathbb {R}^{18 \times 512}. \end{equation}\)

Note that the body and map2style layers of \(\text{e4e}\) are combined to complete the pretrained encoder \(\text{e4e} = [\text{e4e}_\text{body}, \text{e4e}_\text{m2s}]\). The language conditioning happens in the adapter layers CLIPAdapter and these layers are the only layers with trained parameters in the inversion framework, the rest of the parameters are pretrained.

3.3 CLIPRemapper: CLIP-guided Latent Vector Refinement

To further enhance the quality of the manipulated image, we introduce a final refinement step over the predicted latent code. As shown in Figure 2(b), our CLIPRemapper carries out this refinement process by mapping CLIP text embedding of the given text prompt to the \(\mathcal {W+}\) space and then using the projected text embedding to steer the residual latent code predicted by CLIPInverter toward a direction more compatible with the target text. Specifically, CLIPRemapper involves shallow mapping networks for each level to better align image with the text. The text embedding obtained from CLIP is fed to MLPs at each stage to predict a component for latent code correction corresponding to the caption, as follows: (5) \(\begin{equation} \mathbf {\Delta \widehat{w}_i} = \text{MLP}_{i}(\mathbf {t}_\mathbf {target}). \end{equation}\)

Taking into account \(\Delta \mathbf {\widehat{w}_i}\), we apply a further correction to the residual latent code predicted through CLIPAdapter as (6) \(\begin{equation} \mathbf {\Delta {w_i}^{\prime }} = \frac{\left(\alpha _{i} * \Delta \mathbf {{w_i}} + (1 - \alpha _{i}) * \Delta \mathbf {\widehat{w}_i}\right)*\Vert \Delta \mathbf {{w_i}}\Vert }{\Vert {\alpha _{i} * \Delta \mathbf {{w_i}} + (1 - \alpha _{i}) * \Delta \mathbf {\widehat{w}_i}}\Vert }, \end{equation}\) where \(\alpha _i\) is a weighting factor that is defined as a learnable parameter and \(\Delta \mathbf {{w_i}^{\prime }}\) represents the final corrected residual latent code.

In particular, the corrected residual latent code \(\Delta \mathbf {{w_i}^{\prime }}\) is obtained by considering linear combination of two separate codes, the residual latent code from CLIPAdapter \(\Delta \mathbf {{w_i}}\) and the vector \(\Delta \mathbf {\widehat{w}_i}\), followed by a normalization. We do not want the refinement procedure to make substantial changes in the predicted latent code. Hence, along with the loss functions introduced in the next section, the normalization further enforces the final latent code \(\Delta \mathbf {{w_i}^{\prime }}\) to be in the vicinity of the residual latent code predicted in the previous step. We only make the necessary changes in the semantic directions suggested by the CLIP embedding of the target text \(\mathbf {t}_\mathbf {target}\) through a simple image composition process in the latent StyleGAN space.

CLIPRemapper effectively integrates the local inductive bias of the target description and the desired visual characteristics for the source image as suggested by the target description. In structured domains such as human faces, residual latent code \(\Delta \widehat{w}\) obtained in an image blind manner using target description produces interpretable results. This process, as demonstrated in Figure 3, combines the manipulated image generated by CLIPAdapter with a generic image that predominantly exhibits the characteristics mentioned in the target description, leading to further improvements on both the manipulation accuracy and the perceptual quality. In the case of less structured domains, e.g., birds, while \(\Delta \widehat{w}\) may not be interpretable, it still provides some improvements to the manipulations. Additional visualizations for cat and bird images can be found in the supplementary material.

Fig. 4. Visualization of the latent code correction operation via CLIPRemapper. For two sample images, we show the initial editing results generated solely by CLIPAdapter, the generic images generated via CLIPRemapper, and the final manipulations by CLIPInverter obtained by the suggested correction scheme. Our refinement module works as intended, providing edits more consistent with the target descriptions.

3.4 Training Losses

We train our proposed \(\text{CLIPInverter}\) model on a training set of images paired with their corresponding textual descriptions \(\lbrace (\mathbf {x}_\mathbf {in},\mathbf {t}_\mathbf {real})\rbrace\). Specifically, we employ a cyclic adversarial training strategy [Zhu et al. 2017] during training, which involves two separate manipulation steps. In the first one, we feed in the original input image \(\mathbf {x}_\mathbf {in}\) along with a target textual description \(\mathbf {t}_{target}\) (which does not match with the input image) to our model. This process generates a manipulated image \(\mathbf {x}_\mathbf {out} = \text{CLIPInverter}(\mathbf {x}_\mathbf {in},\mathbf {t}_\mathbf {target})\). In the cyclic pass, we take this manipulated image \(\mathbf {x}_\mathbf {out}\) and the original text description \(\mathbf {t}_\mathbf {real}\) (which describes the original input image \(\mathbf {x}_\mathbf {in}\)) as inputs to obtain \(\widehat{\mathbf {x}}_\mathbf {in} = \text{CLIPInverter}(\mathbf {x}_\mathbf {out},\mathbf {t}_\mathbf {real})\). We expect \(\widehat{\mathbf {x}}_\mathbf {in}\) to closely resemble the original image \(\mathbf {x}_\mathbf {in}\) by enforcing cycle consistency. We obtain the target text description by rolling the minibatch, meaning that each image will be paired with the textual description that describes the next image in the minibatch. We train our model with a set of loss functions. Each of these objectives are used both in the first manipulation pass and the following cycle pass. In the following, we only describe the losses for the first manipulation pass for the sake of presentation simplicity.

We use \(\mathcal {L}_{2}\) and \(\mathcal {L}_{\mathrm{LPIPS}}\) [Zhang et al. 2018] losses to respectively enforce pixelwise and perceptual similarities between the input and the manipulated image, such that (7) \(\begin{equation} \mathcal {L}_{\mathrm{2}} = \Vert \mathbf {x}_{in}-\mathbf {x}_{out}\Vert _{2}, \end{equation}\) (8) \(\begin{equation} \mathcal {L}_{\mathrm{LPIPS}} = \Vert F(\mathbf {x}_{in})-F(\mathbf {x}_{out})\Vert _{2}, \end{equation}\) where \(F(\cdot)\) denotes deep features extracted from a pretrained AlexNet [Krizhevsky et al. 2012] model.

Ideally, we want any manipulation to preserve the identity of the subject in the original image. To preserve the identity, we employ an identity loss that maximizes the cosine similarity between the input image and the output image feature embeddings: (9) \(\begin{equation} \mathcal {L}_{\mathrm{ID}}=1-\langle R(\mathbf {x}_{in})), R(\mathbf {x}_{out})\rangle , \end{equation}\) where \(\langle \cdot , \cdot \rangle\) represents the cosine similarity between the feature vectors and \(R\) denotes a pretrained deep network. Specifically, we use the pretrained ArcFace [Deng et al. 2019] network for human faces, and a ResNet50 [He et al. 2015] network trained with MOCOv2 [Chen et al. 2020] for birds and cats.

We also employ the following regularization loss, which enforces the predicted latent codes to be close to the average latent code of the generator and was shown to improve overall image quality in previous work [Richardson et al. 2021], such that (10) \(\begin{equation} \mathcal {L}_{\text{reg}}=\Vert \mathbf {w^*}-\overline{\mathbf {w}}\Vert _{2}, \end{equation}\) where \(\mathbf {w^*}\) and \(\overline{\mathbf {w}}\) are the aggregated and the average latent codes, respectively.

Last, to enforce the similarity between the output image and the target description, we employ a directional CLIP loss [Gal et al. 2021]. Rather than directly minimizing the distance between the generated image \(\mathbf {x}_{out}\) and the text prompt \(\mathbf {t}_{target}\) in the CLIP space, directional CLIP loss aligns the direction from the input image \(\mathbf {x}_{in}\) to the manipulated image \(\mathbf {x}_{out}\) with the direction from the original text description \(\mathbf {t}_{real}\) to the target text description \(\mathbf {t}_{target}\): (11) \(\begin{eqnarray} \Delta T=E_\mathrm{CLIP,T}\left(\mathbf {t}_{target}\right)-E_\mathrm{CLIP,T}\left(\mathbf {t}_{real}\right), \nonumber \nonumber\\ \Delta I=E_\mathrm{CLIP,I}\left(\mathbf {x}_{out}\right)-E_\mathrm{CLIP,I}\left(\mathbf {x}_{in}\right), \nonumber \nonumber\\ \mathcal {L}_{\text{direction }}=1-\frac{\Delta I \cdot \Delta T}{|\Delta I||\Delta T|}, \end{eqnarray}\) where \(E_{\mathrm{CLIP,T}}\) and \(E_{\mathrm{CLIP,I}}\) are the text and image encoders of CLIP, respectively.

Our final loss function for the first manipulation pass is a weighted sum of the objectives: (12) \(\begin{equation} \mathcal {L}_{\mathrm{manipulation}} = \lambda _{1}\mathcal {L}_{2} + \lambda _{2}\mathcal {L}_{\mathrm{LPIPS}} + \lambda _{3}\mathcal {L}_{\mathrm{ID}} + \lambda _{4}\mathcal {L}_{\mathrm{reg}} + \lambda _{5}\mathcal {L}_{\mathrm{direction}} , \end{equation}\) where each \(\lambda _{i}\) determines the weight of the corresponding objective. The total loss including the first manipulation and the follow-up cycle passes is the following: (13) \(\begin{equation} \mathcal {L}_{\mathrm{total}} = \mathcal {L}_{\mathrm{manipulation}} + \lambda _6 \mathcal {L}_{\mathrm{cyclic}}, \end{equation}\) where \(\mathcal {L}_{\mathrm{cyclic}}\) is the cyclic consistency loss, which contains the same loss terms as \(\mathcal {L}_{\mathrm{manipulation}}\) in which \(\mathbf {x}_{out}\) is replaced with \(\widehat{\mathbf {x}}_{in}\), and \(\lambda _{\mathrm{cycle}}\) is the weight for this cyclic loss.

During training, we follow a multi-stage regime. We first train the CLIPAdapter (without using CLIPRemapper). Once these are fully trained, we freeze the weights of CLIPAdapter weights and train the CLIPRemapper while optimizing for the CLIP loss along with the L2, Learned Perceptual Image Patch Similarity (LPIPS), and Identity Similarity (ID) losses. For the LPIPS and L2 losses, we also include the loss between images generated with and without CLIPRemapper that ensures that the CLIPRemapper does not change the images by a large amount. In addition, we also include a L2 regularization loss on the interpolation coefficients (lambdas) such that the amount of interpolation between two latent codes does not change the original code by a large amount. This is also observed to remove artifacts in the generated images.

4 EXPERIMENTAL EVALUATION

4.1 Datasets

We conduct extensive evaluation on a variety of domains to illustrate the generalizability of our approach. We use the Multi-Modal CelebA-HQ [Lee et al. 2020; Xia et al. 2021a] dataset to train our model on the domain of human faces. This dataset consists of 30,000 images along with 10 textual descriptions for each image. We follow the default train/test split, using 6,000 images for testing and the remaining for training. For the birds domain, we use the CUB Birds dataset [Wah et al. 2011], which contains 11,788 images in total, including 2,933 images for testing, along with 10 captions for each image. Finally, for the domain of cat faces, we use the AFHQ-Cats dataset [Choi et al. 2020], which contains a total of 5,653 images, including 500 for testing. The captions for this dataset are generated using the approach mentioned in Nie et al. [2021] leveraging the CLIP [Radford et al. 2021] model.

4.2 Training Details

We use two pre-trained models trained on our datasets: StyleGAN2 generator and \(e4e\) encoder. Keeping the weights of these models frozen, we train CLIPInverter using the cyclic adversarial training scheme described in the previous section. The mismatching captions are sampled in such a way that matching caption for an image is sampled 25% of the time during training. In our experiments, for the CLIPAdapter, we empirically set \(\lambda _1 = 1.0\), \(\lambda _2 = 0.6\), \(\lambda _3 = 0.1\), \(\lambda _4 = 0.005\), \(\lambda _5 = 1.0\), and \(\lambda _6 = 1.0\) and the learning rate to 0.0005. For CLIPRemapper, we increase the weight of the identity loss to \(\lambda _3~=~0.5\) and totally exclude the regularization loss during training. We initialize the linear coefficient \(\alpha _i\)’s with 0.05 and train them together with the parameters of CLIPRemapper. We train CLIPAdapter for 200k iterations on a single Tesla v100 GPU, which takes about 6 days and CLIPRemapper for 20k iterations that takes about a day.

4.3 Evaluation Metrics

Quantitative analysis of the language-guided image manipulation task is a challenging matter. The quality and the photorealism of the generated images can be evaluated with FID [Heusel et al. 2017]. However, there is no established way to evaluate the manipulation accuracy of a model. It is crucial that an effective model should only alter the attributes specified in the target text prompt, while preserving the original attributes for the rest of the input image. Hence, we also use the ID similarity [Deng et al. 2019] to assess the identity preservation.

To evaluate the model accuracy in terms of these aspects, we propose two metrics: AMA and CMP. Attribute Manipulation Accuracy measures how accurately a model can apply single attribute manipulations. For face images, we train an attribute classifier using the images and their attribute annotations from the CelebA [Liu et al. 2015] dataset, following Nie et al. [2021]. In terms of the validation accuracy of the classifier on different attributes, we select 15 of the best performing attributes, such as blond hair, chubby, mustache (see the appendix for the full list of attributes), of 40 that are included in CelebA. Here, we have two versions of the AMA score. AMA-Single measures the accuracy of single attribute manipulations using our model. To evaluate this, we generate 50 image manipulations for each of the 15 selected attributes, resulting in a total of 750 images. For each manipulation, we employ pre-defined text prompts that specifically mention the attribute of interest, such as “This person has blond hair.” The accuracy is then determined by assessing how well the generated images align with the intended attribute manipulation. We evaluate the accuracy of these manipulations using the attribute classifier and take the mean of the accuracy across all the attributes to obtain the final AMA score for that model. AMA-Multiple evaluates the accuracy of multiple attribute manipulations achieved by our model. We generate target descriptions that involve combinations of two or three attributes and perform 50 image manipulations for each combination, resulting in a total of 350 images. We consider the manipulation successful only when the resulting changes can be accurately classified by the corresponding attribute classifiers. In this context, a classification is deemed correct if the attribute score surpasses a threshold of 0.90.

For cat and bird images, we use CLIP as a zero-shot classifier to calculate the AMA. We employ 30 attributes present in the AFHQ-Cats [Choi et al. 2020] and sample 40 attributes of the 273 attributes present in the CUB [Wah et al. 2011] dataset. For each selected attribute, we generate template-based captions covering all the classes in the category that the attribute belongs to. Then, we prompt CLIP with the output image and the generated captions to obtain similarity scores for each caption. The manipulation then is successful if the caption with the correct label has the highest probability after the softmax operation on the similarity scores.

CLIP Manipulative Precision is a modified version of the Manipulative Precision metric proposed by ManiGAN [Li et al. 2020] that uses the pre-trained CLIP [Radford et al. 2021] image and text encoders. CMP measures how aligned the synthesized image is with the target text prompt \(\mathbf {t_{target}}\) and how well the original contents of the input image are preserved. It is defined as (14) \(\begin{equation} \text{CMP} = (1 - \text{diff}) * \text{sim}, \end{equation}\) where diff is the \(\mathcal {L}_{\mathrm{1}}\) pixel difference between the input image \(\mathbf {x_{in}}\) and the output image \(\mathbf {x_{out}}\), and sim is the CLIP similarity between the output image \(\mathbf {x_{out}}\) and the target textual description \(\mathbf {t_{target}}\). We calculate the CMP for each of the images generated for the AMA score and take their average to obtain the final CMP score for the corresponding model.

4.4 Qualitative Results

In Figure 4, we show that our method can manipulate images from very different domains such as human faces, cats, and birds. Given an input image, we manipulate it by just providing a natural textual description highlighting the desired edits. As can be seen in the figure, the target descriptions can specify more than one attribute. For instance, one can simultaneously apply lipstick while changing the hair style of a woman or can alter the attitude and appearance of a cat at the same time.

Fig. 5. Qualitative manipulation results. We show sample text-guided manipulation results on human faces (left), cat images (middle), and bird images (right). Our approach successfully makes local semantic edits based on the target descriptions while keeping the generated outputs faithful to the input images. The images displayed on the left side are the inversion results obtained with the e4e encoder.

Our method can give plausible results independent of the complexity of the provided target description. For instance, in Figure 5, we present the outcomes of our approach obtained by taking into account compositions of different visual attributes. They demonstrate that our method can deal with the provided compositions and make the necessary changes in the original input images mentioned in the descriptions to its full extent.

Fig. 6. More qualitative results. We provide example manipulation results where we apply various compositions of several facial attributes as target descriptions.

In Figure 6, we demonstrate that predicting residual latent code for a given target description has the advantage that one can continuously interpolate between the original image and the final result, which allows users to have control over the degree of changes made during the manipulation process. For example, the appearance of the subjects smoothly changes to reflect the increase in the intensity of the lipstick, and the color of the cats and the bird slightly changes.

Fig. 7. Continuous manipulation results. We show that starting from the latent code of the original image and walking along the predicted residual latent codes, we can naturally obtain smooth image manipulations, providing control over the end result. For reference, we provide the original (left) and the target descriptions (right) below each row.

To some extent, our approach can also perform edits in a zero-shot setting by using descriptions never seen during training. The key to this ability lies in the use of the CLIP-based text guided adapters that enable us to align the visual and the textual domains and map out of domain textual descriptions to a semantic editing direction in the latent space. Hence, even if the terms in the target descriptions have not been observed for the first time, our method can make the necessary changes in the input images if semantically similar terms have been seen during training. For instance, in Figure 7, we include a number of cases where the color or the structure of the hair is manipulated using novel descriptions that do not exist in the training set such as curly hair, silver hair, and facial hair.

Fig. 8. Additional manipulation results with out-of-distribution training data. We demonstrate that our CLIPInverter method can perform manipulations with target descriptions involving words never seen during training but semantically similar to the observed ones.

In our proposed CLIPAdapter, we employ CLIP embeddings of the text prompt to modulate the convolutional feature maps to predict the residual latent code, representing the changes on the input image required to meet the desired target description. In fact, CLIP model learns the alignment between images and text via a contrastive learning objective and discovers a common semantic space. Hence, our framework also allows for using exemplar images as the conditioning element without any changes or training. In Figure 8, we provide some qualitative results for such image-based manipulations performed by our proposed approach. We observe that although no further training is done by considering reference images instead of target description, our model achieves a good performance on transferring the appearance of the provided reference images to the input images.

Fig. 9. Image-based manipulation results. Our framework allows for using a reference image as the conditioning input for editing. In the figure, these reference images are given at the top-right. Results on different domains illustrate that our model can transfer the look of the conditioning images to the provided input images.

We refer readers to the supplementary material for more manipulation results.

4.5 Qualitative Comparisons to Other Text-guided Manipulation Methods

We compare our approach with various existing methods, including TediGAN [Xia et al. 2021a], StyleCLIP [Patashnik et al. 2021], StyleMC [Kocasari et al. 2021], and HairCLIP [Wei et al. 2022]. For StyleCLIP, we use the latent optimization-based model StyleCLIP-LO, and for TediGAN, we use the CLIP-based optimization approach (TediGAN-B). In all of our experiments, we use the public implementations provided by the authors. For HairCLIP, we slightly modify its neural architecture and train it accordingly. In the original paper, they do consider different conditioning vectors for the mapper modules encoding hairstyle and hair color as they refer to details from different scales. Since, we focus on a generic text-guided manipulation process where it is hard to separate the textual terms into fine-, mid-, and high-level attributes, we let the embedding of the whole target description suggested by CLIP text encoder to condition the mappers equally. All of these approaches use StyleGAN2 as a frozen generator and utilize the CLIP embedding to measure the image and text similarity.

In Figure 9, we provide some qualitative comparisons between our method and the baselines on a number of human face images. As can be seen from the figure, our approach gives more accurate edits as compared to the existing methods, especially for captions that describe multiple attribute manipulations. For instance, for the first image, our model is able to make meaningful changes to the original input image to reflect the look depicted in the target description, and apply the gender change as well as changes in the eyebrows, hair, eyes, lips, and the outfit. For the second input image, our model is able to generate the smile and the lipstick while most of the other methods fail to apply both changes at the same time. In the last two examples, our manipulation results again reflect the given target descriptions—much better than those of the competing approaches. Our method manipulates the gender, hair color, eyebrows, and age of the man and applies makeup. Similarly, it generates a smile for the woman and makes her wear a jacket, which is inline with the necktie mentioned in the description. Similarly, in Figure 10, we compare our results with those of the TediGAN-B, StyleCLIP-LO, and HairCLIP methods on bird and cat images. Like the human faces, our model is able to generate visually more pleasing and relevant results than the competing approaches. For instance, our model is able to capture the yellow-greenish color mentioned in the description for the bird in the third row and the fearful look for the cat in the first row while other methods result in poor manipulations. For birds and cats, we could not provide any comparison against StyleCLIP-GD and StyleMC as their codebase use a different implementation of the StyleGAN and they do not provide pre-trained models for these datasets. In the supplementary material, we provide additional visual comparisons.

Fig. 10. Comparison against the state-of-the-art text-guided manipulation methods. Our method applies the target edits mentioned in the given descriptions much more accurately than the competing approaches, especially when there are multiple attributes present in the descriptions.

Fig. 11. Comparisons against other approaches on bird and cat images. As compared to TediGAN, our model generates reasonable manipulation results that are more consistent with the given target descriptions.

4.6 Quantitative Comparisons to Other Text-guided Manipulation Methods

We quantitatively compare our approach to the same approaches that are compared in the qualitative comparisons, namely TediGAN [Xia et al. 2021a], StyleCLIP-LO and StyleCLIP-GD [Patashnik et al. 2021], StyleMC [Kocasari et al. 2021], and HairCLIP [Wei et al. 2022]. We use the four metrics mentioned in Section 4.3 (FID, AMA, and CMP) and ID for these quantitative comparisons. The official PyTorch implementation [Seitzer 2020] is utilized to calculate the FID scores. The AMA and the CMP scores are calculated using the procedure described in Section 4.3.

Table 1 shows the quantitative comparisons for our model against various state-of-the-art approaches. TediGAN-B achieves fairly good FID and CMP scores. However, from the qualitative results, we observed that TediGAN-B exploits adversarial ways to optimize the CLIP similarity without changing the input pixels much while failing to apply the manipulations and producing distorted images.

Table 1.

	FID \(\downarrow\)	CMP \(\uparrow\)	AMA (Single) \(\uparrow\)	AMA (Multiple) \(\uparrow\)	ID \(\uparrow\)
TediGAN-B	55.424	0.285	11.286	1.142	37.97
StyleCLIP-LO	80.833	0.210	15.857	3.429	29.69
StyleCLIP-GD	82.393	0.191	33.143	11.429	57.37
StyleMC	84.088	0.187	12.143	2.857	30.05
HairCLIP	93.523	0.218	41.571	15.143	57.50
Ours	97.210	0.221	61.429	41.714	52.14

Our approach exhibits superior manipulation accuracy compared to other methods, particularly for manipulations involving multiple attributes, while maintaining a comparable level of perceptual quality. The best and second-best performing models are highlighted in bold and underlined, respectively.

View Table

Table 1. Quantitative Comparisons on the CelebA Dataset

Our approach exhibits superior manipulation accuracy compared to other methods, particularly for manipulations involving multiple attributes, while maintaining a comparable level of perceptual quality. The best and second-best performing models are highlighted in bold and underlined, respectively.

While performing well in terms of either one or two metrics, the competing approaches usually fail to be competitive across all four metrics. StyleCLIP-LO is able to achieve a fairly comparable CMP, since it optimizes the CLIP similarity for each instance, and a good FID score but fails to apply the given attribute manipulations accurately. StyleMC also achieves a good FID score, since it finds directions in the \(\mathcal {S}\) space. However, it also fails to output accurate manipulations. Even though StyleCLIP-GD performs better than these two models, its performance still falls behind the performance of our approach. Finally, HairCLIP achieves the best scores out of the competing approaches. The results demonstrate the superiority of our model against HairCLIP, as our method achieve much higher manipulation accuracies while remaining competitive in terms of the FID and ID scores. Our approach finds a good balance for the distortion and editability problem by applying manipulations successfully while being comparable in terms of photorealism. Hence, they are able to achieve good scores across all four metrics.

Table 2 presents the quantitative comparisons on the AFHQ-Cats and the CUB datasets. Since CLIP is used as a similarity metric in CMP and as a zero-shot classifier in AMA estimations, TediGAN-B again achieves really good scores in these two metrics. However, as seen from the FID scores and the results shown in Figure 10, it gives highly blurred and non-realistic outputs that are not actually in line with the target descriptions. Another optimization-based method, StyleCLIP-LO, achieves worse AMA and CMP scores than TediGAN-B but better FID. Their loss functions allow the model to output realistic outputs, but they fail to apply the manipulations successfully, which can be seen in Figure 10. HairCLIP generates images that are better in line with the descriptions than the aforementioned methods. However, our approach outperforms HairCLIP by a large margin in terms of CMP and AMA while having a fairly close or even better FID values. We underlined the second best performing models for each metric to demonstrate the superiority of our approach against the others, since the best-performing models usually exploit adversarial ways to optimize the CLIP similarity that yield high CMP and AMA values or fail to apply the manipulations that yield better FID values.

Table 2.

	AFHQ-Cats			CUB
	FID \(\downarrow\)	CMP \(\uparrow\)	AMA \(\uparrow\)	FID \(\downarrow\)	CMP \(\uparrow\)	AMA \(\uparrow\)
TediGAN-B	39.414	0.255	82.467	42.007	0.233	59.500
StyleCLIP-LO	18.771	0.226	48.133	19.209	0.211	27.000
HairCLIP	21.087	0.227	44.667	26.447	0.218	57.050
Ours	24.172	0.245	76.467	25.837	0.221	66.000

Our approach demonstrates superior manipulation accuracy compared to other methods, while also preserving a comparable perceptual quality. The best and second-best performing models are highlighted in bold and underlined, respectively.

View Table

Table 2. Quantitative Comparisons on the AFHQ-Cats and CUB Datasets

Our approach demonstrates superior manipulation accuracy compared to other methods, while also preserving a comparable perceptual quality. The best and second-best performing models are highlighted in bold and underlined, respectively.

For quantitative analysis, we conduct a user study via Qualtrics to evaluate the performance of our approach and all the other competing methods. Specifically, in this user study, we focus on two important aspects: (1) the accuracy of the edits with respect to the given target descriptions, and (2) the photorealism of the manipulated images. In our human evaluation, we randomly generate 48 questions and divide them into 3 groups, with 16 questions each. We make sure that at least 14 different subjects answer each of these group of questions. To measure the accuracy, we show the users an input image, a target description, and the manipulation results of all of the competing methods and ask them to rank the results against each other with respect to how consistent the edits are to the provided description. The participants perform this by dragging the images into their preferred order, where the left-most position refers to the worst result having rank order 1 and the right-most one represents the best outcome at rank order 6. To avoid any bias in the evaluation, the outputs of the methods are displayed in random order at each time. For the questions regarding photorealism, we design a similar ranking task, but this time, we show all the results in random order, and ask the participants to order these results with respect to how realistic they look. Please refer to the supplementary for a screenshot of our user study given to the participants.

Table 3 summarizes the results of our study where the average ranking scores are reported. We find that in terms of the accuracy, the human subjects prefer our proposed method against all the competing approaches. That is, our method makes only the necessary edits in the input images with respect to the given target descriptions in a precise manner. HairCLIP and StyleCLIP-GD give the next most accurate results following our model. In terms of photorealism, our results are also superior than these two approaches, indicating that our results are both accurate and photo-realistic. That said, the human subjects find the photorealism of the results of the concurrent StyleMC and StyleCLIP-LO models significantly better. However, the accuracy questions indicate that both StyleMC and StyleCLIP-LO have difficulty in manipulating the given input images in regard to the target descriptions, in contrast to our proposed model. StyleMC and StyleCLIP-LO, in general, make minimal, mostly insufficient changes in the input images (as also can be seen from Figure 9), and thus do not degrade the photorealism much.

Table 3.

Task	TediGAN-B	StyleCLIP-LO	StyleMC	StyleCLIP-GD	HairCLIP	Ours
Acc.	1.848	3.401	3.526	3.611	4.015	4.598
Real.	1.218	4.604	4.282	3.609	3.544	3.743

The table represent the average rankings of the methods with respect to accuracy and realism, where the higher the value is the better the method is. The participants favor the results of our proposed model over the current state of the art when the accuracy of the manipulations is considered.

View Table

Table 3. User Study Results

The table represent the average rankings of the methods with respect to accuracy and realism, where the higher the value is the better the method is. The participants favor the results of our proposed model over the current state of the art when the accuracy of the manipulations is considered.

4.7 Ablation Study

During training our model, we leverage different loss terms. To analyze the contributions of these loss terms, we have performed an ablation study where we either remove or modify some of these loss terms during training. We provide visual comparisons between these models separately trained on different loss terms in Figure 11.

Fig. 12. Qualitative results for the ablation study. The global CLIP loss leads to unintuitive and unnatural results. Without perceptual losses, unwanted manipulations occur. Without the cycle pass or CLIPRemapper, we are not able to apply all the desired manipulations.

First, we employ the directional CLIP loss following [Gal et al. 2021] to better enforce the image and description similarity. Compared to the global CLIP loss, which directly minimizes the distance between the manipulated image \(\mathbf {x}_{out}\) and the text prompt \(\mathbf {t}_{target}\) in the CLIP space, the directional CLIP loss aligns the directions between the real and target descriptions and input and output images. As can be seen in the second column of Figure 11, the global CLIP loss suffers from artificial-looking manipulations and results in poorly constructed facial attributes as compared to the directional CLIP loss.

Second, to preserve the features and the details of the input image in the areas that we do not wish to modify, we employ the perceptual \(\mathcal {L}_{\mathrm{2}}\) and the \(\mathcal {L}_{\mathrm{LPIPS}}\) losses between the input and the output images. In theory, these perceptual loss terms contradict the directional CLIP loss, since the CLIP loss is trying to enforce the image and text similarity by manipulating the pixel values. To analyze the contribution of these perceptual terms, we have reduced the weights of these loss terms in the overall objective. The third column in Figure 11 shows a manipulation example from this experiment. As can be seen, the smile in the first row is also modified, and the model manipulates the hair style to curly hair in the second row even though this manipulations were not mentioned in the target description. This experiment demonstrates the necessity of these perceptual loss terms to prevent unwanted manipulations.

Third, we employ a cyclic-adversarial training strategy, where we first manipulate the image with a mismatching caption and then recover it by manipulating the output of the first pass with the matching target description. The fourth column in Figure 11 shows an example manipulation from the experiment where we remove this cyclic training regime. Even though the output is visually similar to the output from our full model, we observe that the cyclic consistency loss helps with the preservation of the identity as well as the manipulation accuracy.

Finally, we utilize a CLIP-guided correction module CLIPRemapper to apply the manipulations more accurately and increase the image fidelity. We see from the last two columns of the figure that without CLIPRemapper, the model is not able to apply all of the specified manipulations accurately, like the hair color in the first row or the earrings in the second row.

Table 4 shows the quantitative analysis of the experiments described above. The metrics verify that the global CLIP loss performs much worse in terms of attribute manipulations. This model is able to achieve a high CMP, since it directly optimizes image and text similarity in the CLIP space rather than aligning semantic directions. When we reduce the weight of the perceptual losses, the model is able to apply the manipulations with very high accuracy. However, this comes with the price of perceptual quality as FID suggests and unwanted manipulations as CMP suggests. Adding the cycle pass gives us a better supervision signal to train the model, as the improvements in the accuracy and the CMP suggest. Without CLIPAdapter, our model is not able to achieve great accuracy scores, suggesting that the adapter layers yield the residual latent codes corresponding to semantic directions successfully. Finally, adding CLIPRemapper to our model highly boosts the manipulation performance, with slightly decreasing photorealism in terms of FID. Overall, these quantitative results demonstrate that the combination of the loss functions and the light-weight modules we use allows our model to perform well across all metrics and apply manipulations accurately while preserving the photorealism and preventing unwanted changes.

Table 4.

	FID \(\downarrow\)	CMP \(\uparrow\)	AMA \(\uparrow\)
Ours w/ Global CLIP Loss	83.404	0.221	25.429
Ours w/o Perceptual Losses	105.432	0.194	65.571
Ours w/o Cycle Pass	85.851	0.215	40.857
Ours w/o CLIPAdapter	89.244	0.202	41.571
Ours w/o CLIPRemapper	88.395	0.216	53.28
Ours	97.210	0.221	61.429

We have performed a quantitative analysis of the ablation study where we calculated the metrics for each of the described experiments for our model. The results demonstrate that our model finds a good balance for applying manipulations without decreasing the perceptual quality of the generations. The best and second-best performing models are highlighted in bold and underlined, respectively.

View Table

Table 4. Quantitative Analysis of the Ablation Study

We have performed a quantitative analysis of the ablation study where we calculated the metrics for each of the described experiments for our model. The results demonstrate that our model finds a good balance for applying manipulations without decreasing the perceptual quality of the generations. The best and second-best performing models are highlighted in bold and underlined, respectively.

4.8 Limitations

In our approach, the CLIPAdapter module can be integrated with any inversion network model. Consequently, the limitations of the underlying inversion network are inherited by our approach. For instance, when employing the \(\text{e4e}\), the network may struggle to find accurate latent codes for inputs with unusual poses or challenging lighting conditions. Hence, the reconstructions occasionally result in alterations to identity and the loss of certain details. Despite these limitations, our approach remains capable of generating outputs that align with the provided textual descriptions. It is important to note that the output images are consistent with the reconstructed images, rather than the input images themselves. Figure 12 demonstrates manipulation results using our approach with various challenging inputs. As observed, our method successfully applies the manipulations with respect to the desired reconstructions, however, some details present in the input images, such as shadows or specific lighting conditions, may not be fully preserved during the unconditional inversion phase. It is worth mentioning that this is a common limitation of current GAN-based editing methods, as many of these approaches rely on a pre-trained encoder like \(\text{e4e}\) to obtain an initial inversion of the inputs. For a comprehensive comparison of competing approaches under these challenging cases, we refer readers to the supplementary material.

Fig. 13. Limitations due to GAN inversion step. We present editing results for various challenging inputs (left) where the inputs have different lighting conditions, shadows then the training data. The underlying inversion model struggles to capture these details in the initial inversion phase (middle). Our approach is able to generate consistent results (right) with the reconstructed images.

Additionally, the effectiveness of our method mainly lies in the proposed text-guided image encoder CLIPInverter, which estimates the residual latent code to capture the desired changes. Since CLIPInverter is trained by using a set of training images paired with corresponding textual descriptions, we observe that results of our approach might be affected by the biases that exist in the training data. For instance, Multi-Modal CelebA-HQ dataset containing human face images consists of 10 descriptions for each image, but we observe that the descriptions are not diverse, often using similar adjectives referring to certain attributes. Moreover, there is an imbalance between the number of female and male images, causing a bias toward a specific gender in certain attributes. When only attributes are used in the textual descriptions without any pronouns, unexpected gender manipulations might occur due to these biases. As observed in Figure 13, when we only use the description “wavy hair,” a gender manipulation also occurs. We can alleviate this problem by using more comprehensive textual descriptions, including additional details such as “She has wavy hair,” which yields a much more accurate manipulation. It is an interesting future direction to tackle the bias problem in a more systematic manner.

Fig. 14. Limitations of our proposed CLIPInverter method. Our approach might make some undesired changes to the given input image not mentioned in the provided textual description due to the biases that exist in the training set. This problem can be prevented by providing more comprehensive descriptions.

5 CONCLUSION

In this work, we have introduced CLIPInverter, a novel text-driven image editing approach. It can be used to manipulate an input image through the lens of StyleGAN latent space solely by providing a target textual description, which is much more intuitive than the commonly used user inputs such as sketches, strokes, or segmentation masks. The key component of our approach is the proposed text-guided adapter module called CLIPAdapter, which modulates image feature maps during the inversion to extract semantic edit directions with respect to the provided target description. Moreover, we suggest a text-guided refinement module that we refer to as CLIPRemapper, which performs an additional correction step on the predicted latent code from CLIPAdapter to further boost the accuracy of the performed edits in the input image. Our model does not require an instance-level latent code optimization or a separate training for specific text prompts as done in the prior work, and thus provides a faster alternative to the approaches that exist in the literature.

Our approach is not limited to a specific domain in that it only needs a pretrained StyleGAN model. As our experimental analysis on several different datasets illustrate, our model can handle the semantic edits through textual descriptions for very different domains. Moreover, thanks to the shared semantic space provided by the CLIP [Radford et al. 2021] model between images and text, our model can also be used to perform manipulations conditioned on another image or a novel textual description that has not been seen during training. Our experiments demonstrate significant improvements over the previous approaches in that our model can manipulate images with high accuracy and quality for any description.

Furthermore, it is important to highlight that our proposed framework is not limited to StyleGAN and can be seamlessly integrated into other deep generative models that operate on a latent space representation. Although our current implementation focuses on StyleGAN, the key contributions of our framework, namely CLIPAdapter and CLIPRemapper, are not specific to StyleGAN and can be easily adapted to other GAN architectures. This flexibility opens up opportunities for leveraging our framework in conjunction with recent advancements in latent space extension, such as dual-space GANs, which exhibit enhanced disentanglement of style and content information [Kwon and Ye 2021; Xu et al. 2022]. By incorporating our framework into these models, we can further enhance manipulation accuracy and broaden the range of images that can be generated based on textual descriptions.

Footnotes

¹ https://cyberiada.github.io/CLIPInverter
Footnote

Supplemental Material

Available for Download

pdf

3610287-supp.pdf (49.7 MB)

Supplementary material

REFERENCES

Abdal Rameen, Qin Yipeng, and Wonka Peter. 2019. Image2StyleGAN: How to embed images into the StyleGAN latent space? In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV’19). 4431–4440. DOI:Google ScholarCross Ref
Reference
Abdal Rameen, Qin Yipeng, and Wonka Peter. 2020. Image2StyleGAN++: How to edit the embedded images? In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’20). IEEE. DOI:Google ScholarCross Ref
Reference
Abdal Rameen, Zhu Peihao, Mitra Niloy J., and Wonka Peter. 2021. StyleFlow: Attribute-conditioned exploration of StyleGAN-generated images using conditional continuous normalizing flows. ACM Trans. Graph. 40, 3, Article 21 (May2021), 21 pages. DOI:Google ScholarDigital Library
Reference
Alaluf Yuval, Patashnik Or, and Cohen-Or Daniel. 2021a. ReStyle: A residual-based StyleGAN encoder via iterative refinement. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV’21).Google ScholarCross Ref
Reference
Alaluf Yuval, Patashnik Or, and Cohen-Or Daniel. 2021b. Only a matter of style: Age transformation using a style-based regression model. ACM Trans. Graph. 40, 4, Article 45 (2021). Google ScholarDigital Library
Reference 1Reference 2Reference 3
Alaluf Yuval, Tov Omer, Mokady Ron, Gal Rinon, and Bermano Amit H.. 2021c. HyperStyle: StyleGAN inversion with HyperNetworks for real image editing. arxiv:2111.15666 [cs.CV]. Retrieved from https://arxiv.org/abs/2111.15666Google Scholar
Reference
Bai Qingyan, Xu Yinghao, Zhu Jiapeng, Xia Weihao, Yang Yujiu, and Shen Yujun. 2022. High-fidelity GAN inversion with padding space. In European Conference on Computer Vision. Springer, 36–53.Google ScholarDigital Library
Reference
Bau David, Strobelt Hendrik, Peebles William, Wulff Jonas, Zhou Bolei, Zhu Jun-Yan, and Torralba Antonio. 2019a. Semantic photo manipulation with a generative image prior. ACM Trans. Graph. 38, 4, Article 59 (Jul.2019), 11 pages. DOI:Google ScholarDigital Library
Reference
Bau David, Zhu Jun-Yan Zhu, Wulff Jonas, Peebles William, Strobelt Hendrik, Zhou Bolei, and Torralba Antonio. 2019b. Inverting layers of a large generator. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV’19).Google Scholar
Reference
Chen Xinlei, Fan Haoqi, Girshick Ross, and He Kaiming. 2020. Improved baselines with momentum contrastive learning. arxiv:2003.04297 [cs.CV]. Retrieved from https://arxiv.org/abs/2003.04297Google Scholar
Reference
Choi Yunjey, Uh Youngjung, Yoo Jaejun, and Ha Jung-Woo. 2020. StarGAN v2: Diverse image synthesis for multiple domains. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’20).Google ScholarCross Ref
Reference 1Reference 2
Collins Edo, Bala Raja, Price Bob, and Süsstrunk Sabine. 2020. Editing in style: Uncovering the local semantics of GANs. arXiv:2004.14367. Retrieved from https://arxiv.org/abs/2004.14367Google Scholar
Reference
Creswell Antonia and Bharath Anil Anthony. 2016. Inverting the generator of a generative adversarial network. arXiv:1611.05644. Retrieved from http://arxiv.org/abs/1611.05644Google Scholar
Reference
Deng Jiankang, Guo Jia, Xue Niannan, and Zafeiriou Stefanos. 2019. ArcFace: Additive angular margin loss for deep face recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’19). Computer Vision Foundation/IEEE, 4690–4699. DOI:Google ScholarCross Ref
Reference 1Reference 2
Dhariwal Prafulla and Nichol Alex. 2021. Diffusion models beat GANs on image synthesis. arXiv:2105.05233. Retrieved from https://arxiv.org/abs/2105.05233Google Scholar
Reference
Dinh Tan M., Tran Anh Tuan, Nguyen Rang, and Hua Binh-Son. 2022. HyperInverter: Improving StyleGAN inversion via hypernetwork. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’22).Google ScholarCross Ref
Reference
Gal Rinon, Patashnik Or, Maron Haggai, Chechik Gal, and Cohen-Or Daniel. 2021. StyleGAN-NADA: CLIP-guided domain adaptation of image generators. arXiv:2108.00946. Retrieved from https://arxiv.org/abs/2108.00946Google Scholar
Reference 1Reference 2
Goodfellow Ian J., Pouget-Abadie Jean, Mirza Mehdi, Xu Bing, Warde-Farley David, Ozair Sherjil, Courville Aaron, and Bengio Yoshua. 2014. Generative adversarial nets. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS’14). MIT Press, Cambridge, MA, 2672–2680.Google Scholar
Reference
He Kaiming, Zhang Xiangyu, Ren Shaoqing, and Sun Jian. 2015. Deep residual learning for image recognition. arxiv:1512.03385 [cs.CV]. Retrieved from https://arxiv.org/abs/1512.03385Google Scholar
Reference
He Kaiming, Zhang Xiangyu, Ren Shaoqing, and Sun Jian. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’16). 770–778. DOI:Google ScholarCross Ref
Reference
Hertz Amir, Mokady Ron, Tenenbaum Jay, Aberman Kfir, Pritch Yael, and Cohen-Or Daniel. 2022. Prompt-to-prompt image editing with cross attention control. arXiv:2208.01626. Retrieved from https://arxiv.org/abs/2208.01626Google Scholar
Reference
Heusel Martin, Ramsauer Hubert, Unterthiner Thomas, Nessler Bernhard, and Hochreiter Sepp. 2017. GANs trained by a two time-scale update rule converge to a local nash equilibrium. In Advances in Neural Information Processing Systems, Guyon I., Luxburg U. Von, Bengio S., Wallach H., Fergus R., Vishwanathan S., and Garnett R. (Eds.), Vol. 30. Curran Associates, Inc.Google Scholar
Reference
Ho Jonathan, Jain Ajay, and Abbeel Pieter. 2020. Denoising diffusion probabilistic models. arxiv:2006.11239. Retrieved from https://arxiv.org/abs/2006.11239Google Scholar
Reference
Houlsby Neil, Giurgiu Andrei, Jastrzebski Stanislaw, Morrone Bruna, Laroussilhe Quentin de, Gesmundo Andrea, Attariyan Mona, and Gelly Sylvain. 2019. Parameter-efficient transfer learning for NLP. arXiv:1902.00751. Retrieved from http://arxiv.org/abs/1902.00751Google Scholar
Reference 1Reference 2
Hu Xueqi, Huang Qiusheng, Shi Zhengyi, Li Siyuan, Gao Changxin, Sun Li, and Li Qingli. 2022. Style transformer for image inversion and editing. arXiv:2203.07932. Retreived from https://arxiv.org/abs/2203.07932Google Scholar
Reference
Huang Xun and Belongie Serge. 2017. Arbitrary style transfer in real-time with adaptive instance normalization. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV’17).Google ScholarCross Ref
Reference 1Reference 2
Härkönen Erik, Hertzmann Aaron, Lehtinen Jaakko, and Paris Sylvain. 2020. GANSpace: Discovering interpretable GAN controls. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS’20).Google Scholar
Reference 1Reference 2
Karras Tero, Aittala Miika, Laine Samuli, Härkönen Erik, Hellsten Janne, Lehtinen Jaakko, and Aila Timo. 2021. Alias-free generative adversarial networks. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS’21).Google Scholar
Karras Tero, Laine Samuli, and Aila Timo. 2019. A style-based generator architecture for generative adversarial networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’19). Computer Vision Foundation/IEEE, 4401–4410.Google ScholarCross Ref
Reference 1Reference 2
Karras Tero, Laine Samuli, Aittala Miika, Hellsten Janne, Lehtinen Jaakko, and Aila Timo. 2020. Analyzing and improving the image quality of StyleGAN. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’20).Google ScholarCross Ref
Navigate to
Reference 1
Reference 2
Reference 3
Reference 4
Kawar Bahjat, Zada Shiran, Lang Oran, Tov Omer, Chang Huiwen, Dekel Tali, Mosseri Inbar, and Irani Michal. 2022. Imagic: Text-based real image editing with diffusion models. DOI: DOI: arXiv.2210.09276. Retrieved from https://arxiv.org/abs/2210.09276Google Scholar
Reference 1Reference 2
Kim Gwanghyun, Kwon Taesung, and Ye Jong Chul. 2022. DiffusionCLIP: Text-guided diffusion models for robust image manipulation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’22). 2426–2435.Google ScholarCross Ref
Reference
Kim Hyunsu, Choi Yunjey, Kim Junho, Yoo Sungjoo, and Uh Youngjung. 2021. Exploiting spatial dimensions of latent in GAN for real-time image editing. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.Google ScholarCross Ref
Reference
Kocasari Umut, Dirik Alara, Tiftikci Mert, and Yanardag Pinar. 2021. StyleMC: Multi-channel based fast text-guided image generation and manipulation. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV’21).Google Scholar
Navigate to
Reference 1
Reference 2
Reference 3
Reference 4
Reference 5
Krizhevsky Alex, Sutskever Ilya, and Hinton Geoffrey E.. 2012. ImageNet classification with deep convolutional neural networks. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS’12), Pereira F., Burges C. J. C., Bottou L., and Weinberger K. Q. (Eds.), Vol. 25. Curran Associates, Inc.Google Scholar
Reference
Kwon Gihyun and Ye Jong Chul. 2021. Diagonal attention and style-based GAN for content-style disentanglement in image generation and translation. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV’21).Google ScholarCross Ref
Reference
Lee Cheng-Han, Liu Ziwei, Wu Lingyun, and Luo Ping. 2020. MaskGAN: Towards diverse and interactive facial image manipulation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’20).Google ScholarCross Ref
Reference
Li Bowen, Qi Xiaojuan, Lukasiewicz Thomas, and Torr Philip H. S.. 2020. Manigan: Text-guided image manipulation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’20). 7880–7889.Google ScholarCross Ref
Reference 1Reference 2Reference 3
Liu Ziwei, Luo Ping, Wang Xiaogang, and Tang Xiaoou. 2015. Deep learning face attributes in the wild. In Proceedings of International Conference on Computer Vision (ICCV’15).Google ScholarDigital Library
Reference
Nie Weili, Vahdat Arash, and Anandkumar Anima. 2021. Controllable and compositional generation with latent-space energy-based models. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS’21).Google Scholar
Reference 1Reference 2
Parmar Gaurav, Li Yijun, Lu Jingwan, Zhang Richard, Zhu Jun-Yan, and Singh Krishna Kumar. 2022. Spatially-adaptive multilayer selection for GAN inversion and editing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.Google ScholarCross Ref
Reference
Patashnik Or, Wu Zongze, Shechtman Eli, Cohen-Or Daniel, and Lischinski Dani. 2021. StyleCLIP: Text-driven manipulation of StyleGAN imagery. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV’21). 2085–2094.Google ScholarCross Ref
Navigate to
Reference 1
Reference 2
Reference 3
Reference 4
Reference 5
Reference 6
Reference 7
Reference 8
Radford Alec, Kim Jong Wook, Hallacy Chris, Ramesh Aditya, Goh Gabriel, Agarwal Sandhini, Sastry Girish, Askell Amanda, Mishkin Pamela, Clark Jack, Krueger Gretchen, and Sutskever Ilya. 2021. Learning transferable visual models from natural language supervision. In Proceedings of the 38th International Conference on Machine Learning,Proceedings of Machine Learning Research, Vol. 139, Meila Marina and Zhang Tong (Eds.). PMLR, 8748–8763.Google Scholar
Navigate to
Reference 1
Reference 2
Reference 3
Reference 4
Reference 5
Reference 6
Reference 7
Rebuffi Sylvestre-Alvise, Bilen Hakan, and Vedaldi Andrea. 2017. Learning multiple visual domains with residual adapters. arXiv:1705.08045. Retrieved from http://arxiv.org/abs/1705.08045Google Scholar
Reference
Rebuffi Sylvestre-Alvise, Vedaldi Andrea, and Bilen Hakan. 2018. Efficient parametrization of multi-domain deep neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 8119–8127. DOI:Google ScholarCross Ref
Reference
Richardson Elad, Alaluf Yuval, Patashnik Or, Nitzan Yotam, Azar Yaniv, Shapiro Stav, and Cohen-Or Daniel. 2021. Encoding in style: A StyleGAN encoder for image-to-image translation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’21).Google ScholarCross Ref
Reference 1Reference 2
Roich Daniel, Ron Mokady, Amit H. Bermano, and Daniel Cohen-Or. 2021. Pivotal tuning for latent-based editing of real images. ACM Trans. Graph. 42, 1 (2021).Google Scholar
Reference
Rombach Robin, Blattmann Andreas, Lorenz Dominik, Esser Patrick, and Ommer Björn. 2022. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’22). 10684–10695.Google ScholarCross Ref
Reference
Seitzer Maximilian. 2020. pytorch-fid: FID Score for PyTorch. Retrieved from https://github.com/mseitzer/pytorch-fidGoogle Scholar
Reference
Shen Yujun, Gu Jinjin, Tang Xiaoou, and Zhou Bolei. 2020. Interpreting the latent space of GANs for semantic face editing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’20).Google ScholarCross Ref
Reference 1Reference 2
Shen Yujun and Zhou Bolei. 2021. Closed-form factorization of latent semantics in GANs. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’21).Google ScholarCross Ref
Reference
Tewari Ayush, Elgharib Mohamed, R. Mallikarjun B., Bernard Florian, Seidel Hans-Peter, Pérez Patrick, Zollhöfer Michael, and Theobalt Christian. 2020b. PIE: Portrait image embedding for semantic control. arXiv:2009.09485. Retrieved from https://arxiv.org/abs/2009.09485Google Scholar
Reference
Tewari Ayush, Elgharib Mohamed, R. Mallikarjun B., Bernard Florian, Seidel Hans-Peter, Pérez Patrick, Zollhöfer Michael, and Theobalt Christian. 2020a. PIE: Portrait image embedding for semantic control. ACM Trans. Graph. 39, 6, Article 223 (Nov.2020), 14 pages. DOI:Google ScholarDigital Library
Reference
Tov Omer, Alaluf Yuval, Nitzan Yotam, Patashnik Or, and Cohen-Or Daniel. 2021. Designing an encoder for StyleGAN image manipulation. ACM Trans. Graph. 40, 4, Article 133 (Jul.2021), 14 pages. DOI:Google ScholarDigital Library
Reference 1Reference 2Reference 3
Tumanyan Narek, Geyer Michal, Bagon Shai, and Dekel Tali. 2023. Plug-and-play diffusion features for text-driven image-to-image translation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’23).Google ScholarCross Ref
Reference
Valevski Dani, Kalman Matan, Matias Y., and Leviathan Yaniv. 2022. UniTune: Text-driven image editing by fine tuning an image generation model on a single image. arXiv :2210.09477. Retrieved from https://arxiv.org/abs/2210.09477Google Scholar
Reference
Voynov Andrey and Babenko Artem. 2020. Unsupervised discovery of interpretable directions in the gan latent space. In International Conference on Machine Learning. PMLR, 9786–9796.Google Scholar
Reference
Wah C., Branson S., Welinder P., Perona P., and Belongie S.. 2011. The Caltech-UCSD Birds-200-2011 Dataset. Technical Report CNS-TR-2011-001. California Institute of Technology.Google Scholar
Reference 1Reference 2
Wang Tengfei, Zhang Yong, Fan Yanbo, Wang Jue, and Chen Qifeng. 2022. High-fidelity GAN inversion for image attribute editing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’22).Google ScholarCross Ref
Reference
Wei Tianyi, Chen Dongdong, Zhou Wenbo, Liao Jing, Tan Zhentao, Yuan Lu, Zhang Weiming, and Yu Nenghai. 2022. HairCLIP: Design your hair by text and reference image. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.Google Scholar
Navigate to
Reference 1
Reference 2
Reference 3
Reference 4
Reference 5
Wu Zongze, Lischinski Dani, and Shechtman Eli. 2020. StyleSpace analysis: Disentangled controls for StyleGAN image generation. arXiv:2011.12799. Retrieved from https://arxiv.org/abs/2011.12799Google Scholar
Reference
Xia Weihao, Yang Yujiu, Xue Jing-Hao, and Wu Baoyuan. 2021a. TediGAN: Text-guided diverse face image generation and manipulation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’21).Google ScholarCross Ref
Navigate to
Reference 1
Reference 2
Reference 3
Reference 4
Reference 5
Reference 6
Xia Weihao, Zhang Yulun, Yang Yujiu, Xue Jing-Hao, Zhou Bolei, and Yang Ming-Hsuan. 2021b. GAN inversion: A survey. arXiv: 2101.05278. Retrieved from https://arxiv.org/abs/2101.05278Google Scholar
Reference
Xu Yanbo, Yin Yueqin, Jiang Liming, Wu Qianyi, Zheng Chengyao, Loy Chen Change, Dai Bo, and Wu Wayne. 2022. TransEditor: Transformer-based dual-space GAN for highly controllable facial editing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’22).Google ScholarCross Ref
Reference
Cho Mohit Bansal, Yi-Lin Sung, and Jaemin. 2022. VL-Adapter: Parameter-efficient transfer learning for vision-and-language tasks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’22).Google Scholar
Reference
Zhang Richard, Isola Phillip, Efros Alexei A., Shechtman Eli, and Wang Oliver. 2018. The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’18). Computer Vision Foundation/IEEE Computer Society, 586–595. DOI:Google ScholarCross Ref
Reference
Zhu Jiapeng, Shen Yujun, Zhao Deli, and Zhou Bolei. 2020. In-domain GAN inversion for real image editing. In Proceedings of the European Conference on Computer Vision (ECCV’20).Google ScholarDigital Library
Reference
Zhu Jun-Yan, Krähenbühl Philipp, Shechtman Eli, and Efros Alexei A.. 2016. Generative visual manipulation on the natural image manifold. In Proceedings of the European Conference on Computer Vision (ECCV’16).Google ScholarCross Ref
Reference 1Reference 2
Zhu Jun-Yan, Park Taesung, Isola Phillip, and Efros Alexei A.. 2017. Unpaired image-to-image translation using cycle-consistent adversarial networks. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV’17).Google ScholarCross Ref
Reference

Index Terms

CLIP-guided StyleGAN Inversion for Text-driven Real Image Editing
1. Computing methodologies
  1. Computer graphics
    1. Image manipulation
  2. Machine learning
    1. Machine learning approaches
      1. Neural networks

Recommendations

Designing an encoder for StyleGAN image manipulation

Recently, there has been a surge of diverse methods for performing image editing by employing pre-trained unconditional generators. Applying these methods on real images, however, remains a challenge, as it necessarily requires the inversion of the ...
Read More
UniTune: Text-Driven Image Editing by Fine Tuning a Diffusion Model on a Single Image

Text-driven image generation methods have shown impressive results recently, allowing casual users to generate high quality images by providing textual descriptions. However, similar capabilities for editing existing images are still out of reach. Text-...
Read More
Plenoptic Image Editing

This paper presents a new class of interactive image editing operations designed to maintain consistency between multiple images of a physical 3D scene. The distinguishing feature of these operations is that edits to any one image propagate ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM Transactions on Graphics Volume 42, Issue 5
October 2023
195 pages
ISSN:0730-0301
EISSN:1557-7368
DOI:10.1145/3607124
Editor:
Carol O'Sullivan
Trinity College Dublin, Ireland
Issue’s Table of Contents
Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the owner/author(s).
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 29 August 2023
- Online AM: 19 July 2023
- Accepted: 6 July 2023
- Revised: 4 July 2023
- Received: 20 December 2022
Published in tog Volume 42, Issue 5

Check for updates
Author Tags
Generative adversarial networks
image-to-image translation
image editing
Qualifiers
- research-article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 0
  Total Citations
  View Citations
- 2,517
  Total Downloads
- Downloads (Last 12 months)2,517
- Downloads (Last 6 weeks)176
Other Metrics
View Author Metrics
Cited By
This publication has not been cited yet

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

CLIP-guided StyleGAN Inversion for Text-driven Real Image Editing

ACM Transactions on Graphics

Abstract

1 INTRODUCTION

2 RELATED WORK

2.1 GAN Inversion

2.2 Latent Space Manipulation

2.3 Text-guided Image Manipulation

2.4 Adapter Layers

3 THE APPROACH

3.1 Overview of CLIPInverter

3.2 CLIPAdapter: CLIP-guided Adapters for Latent Space Manipulation

3.3 CLIPRemapper: CLIP-guided Latent Vector Refinement

3.4 Training Losses

4 EXPERIMENTAL EVALUATION

4.1 Datasets

4.2 Training Details

4.3 Evaluation Metrics

4.4 Qualitative Results

4.5 Qualitative Comparisons to Other Text-guided Manipulation Methods

4.6 Quantitative Comparisons to Other Text-guided Manipulation Methods

4.7 Ablation Study

4.8 Limitations

5 CONCLUSION

Footnotes

Supplemental Material

Available for Download

REFERENCES

Cited By

Index Terms

Recommendations

Designing an encoder for StyleGAN image manipulation

UniTune: Text-Driven Image Editing by Fine Tuning a Diffusion Model on a Single Image

Plenoptic Image Editing

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media