current position:Home>"Dog" produces everything? The Israeli team proposed a zero sample training model, and the dog changed into Nicholas Cage

"Dog" produces everything? The Israeli team proposed a zero sample training model, and the dog changed into Nicholas Cage

2021-08-26 09:01:55 Xinzhiyuan

New Zhiyuan Report

source :arXiv

edit :Priscilla Good trapped

【 Introduction to new wisdom 】 Tel Aviv University and NVIDIA research team used CLIP The semantic power of the model , A text driven method is proposed :StyleGAN-NADA, No need to collect images in new areas , As long as there is a text prompt, you can quickly generate domain specific images .

All things can be GAN, The family dog can also become Nicholas Cage !

GAN There are many applications , Covers image enhancement 、 edit , Even classification and regression tasks , But before that ,GAN You have to collect a lot of images first .

And in reality , Some specific artist's paintings 、 Fictional scenes, etc , There may not be enough data to train GAN, There's not even any data at all .

Zero sample training generator , Directly generate an artistic image of New York City

To solve this problem , Tel Aviv University and NVIDIA research team use large-scale comparative language - Image pre training (CLIP) The semantic power of the model , A text-based method is proposed :StyleGAN-NADA.

There is no need to collect any image in a new field , Only text prompts are needed to generate domain specific images .

Address of thesis :https://arxiv.org/abs/2108.00946

With a text prompt , After a short period of training , You can adapt the generator to many fields of different styles and shapes .

There is no need to edit a single image , Use OpenAI Of CLIP The signal can train the generator .

At the same time, it can also say with the training data 「 Bye-bye 」, Run fast !

( High energy warning ) Enter... In the box Human and Zombie, The corresponding image can be generated immediately

Method realization

The goal of the team is to transfer a pre trained generator from a given source domain to a new target domain , Use only text prompts , Image without target domain .

As a source of oversight in the target area , The author uses only one pre trained CLIP Model .

There are two key issues in this process :

(1) How can we best extract CLIP Semantic information encapsulated in ?

(2) How to standardize the optimization process , Avoid confrontational solutions or pattern crashes ?

Overview of network structure

The core of the method is two uses StyleGAN2 framework , Combined generators .

Two generators share a mapping network , The same underlying code will initially produce the same image in both generators .

Setting of training structure

Two interleaved generators are initialized using the weights of the generators pre trained on the source domain image --Gfrozen and Gtrain.

Gfrozen The weight of remains fixed throughout the training process , and Gtrain The weight of is modified by the optimized and iterative layer freezing scheme .

This process shifts according to the text direction provided by the user Gtrain The field of , While maintaining a shared potential space .

be based on CLIP Guidance of

The author relies on a pre trained CLIP Model as the only source of oversight in our target area .

overall situation CLIP Loss

This loss is designed to make the difference between the generated image and some given target text CLIP- Minimum space cosine distance .

However , This loss also has disadvantages :

(1) There is no advantage in maintaining diversity

(2) Very vulnerable to adversarial solutions

If regularization is insufficient , The model will add appropriate pixel level disturbances to the image to deceive the image CLIP.

These disadvantages make global losses unsuitable for training generators .

however , The authors still use it to adaptively determine which subset of layers to train in each iteration .

Directional CLIP Loss

therefore , The second loss is designed to protect diversity and prevent widespread image damage .

One by reference ( frozen ) The generator generates , The other is modified by ( Can be trained ) The generator uses the same underlying code to generate .

The author referred to ( Source ) Images and modifications ( The goal is ) Between image embedding CLIP- The relationship between spatial orientation and the embedding of a pair of source and target text CLIP- Keep the spatial direction consistent .

among EI and ET Namely CLIP Image and text encoder ,Gfrozen and Gtrain Is a frozen source generator and a modified trainable generator , and ttarget and tsource Is the source and target class text .

This overcomes the disadvantage of global loss :

(1) Unable to maintain diversity

Because the global loss is affected by the pattern crash example , If the target generator creates only one image , So from all sources to the target image CLIP- The spatial direction will be different .

therefore , They can't all be in the same direction as the text .

(2) Networks are more difficult to converge to adversarial solutions

Because you want to cheat CLIP, It is necessary to design disturbances that can be applied to infinite instances .

Directional loss

Embed the images generated by the two generators CLIP Space , And ask to connect their vectors ΔI The direction specified with the source text and the target text ΔT Collinear , And by maximizing their normalized inner product .

Embedded constant loss

In some cases , Use StyleCLIP Potential mapper for , It can better identify the potential spatial region matching the target domain .

However , The mapper will induce unwanted semantic artifacts on the image , And this is different from the generated image CLIP- The specification of spatial embedding is increased .

therefore , These specifications are constrained by introducing an additional loss during mapper training , This prevents the mapper from introducing this artifact .

among M It's a potential mapper .

experimental result

The method proposed in this paper is also applicable to a wide range of extraterritorial image generation , From style and texture changes to shape modification , From reality to fantasy , This can be achieved through a text-based interface .

Even the most extreme shape changes , Just a few minutes of training can do it .

The above images are randomly sampled images synthesized by the generator , The target area is from face 、 The church 、 A dog to a car .

For purely style based changes , The authors allow training at all levels .

For minor shape modifications , The authors found that training about two-thirds of the model layer ( namely 1024×1024 Model 12 layer ) Provides a good compromise between stability and training time .

The source field text is 「 Dog 」

Compared with the first two pictures , The changes in the above figure mainly focus on style or slight shape adjustment , The model here needs significant shape changes .

Results contrast

The existing pre training generator can only be edited in the field , It is impossible to generate images outside the field of training .

The figure above shows the use of three StyleCLIP And the results of editing latent code in the direction of extraterritorial text .

You can see , Only the model proposed in this paper successfully generates the corresponding image .

Reference material :

https://arxiv.org/abs/2108.00946

copyright notice
author[Xinzhiyuan],Please bring the original link to reprint, thank you.
https://caren.inotgo.com/2021/08/20210826090148518M.html

Random recommended