Nvidia’s “realistic to scary” image generator has been upgraded.
Editor’s note: This article comes from WeChat public account “ 新 智 元 ” ( ID: AI_era), source arXiv, editor Xiao Qin. p>
p>
[Guide to Xinzhiyuan] strong> StyleGAN is currently the most advanced high-resolution image synthesis method. The face photos it generated were once considered “realistic to scary.” Today, Nvidia researchers have released an upgraded version, StyleGAN2, which focuses on fixing feature artifacts and further improves the quality of generated images. p> blockquote>
StyleGAN is a new image generation method released by NVIDIA last year and open sourced in February this year. p>
The image generated by StyleGAN is very realistic, it is to generate artificial images step by step, starting from very low resolution, all the way to high resolution (1024 × 1024). By modifying the input of each level in the network separately, it can control the visual features represented in that level, from rough features (pose, face shape) to fine details (hair color) without affecting other level. p>
p>
Face generated by StyleGAN p>
StyleGAN is currently the most advanced high-resolution image synthesis method and has been proven to work reliably on a variety of data sets. In addition to realistic portraits, StyleGAN can also be used to generate other animals, cars and even rooms. p>
However, StyleGAN is not perfect. The most obvious flaw is that the generated images sometimes contain speckle-like artifacts (artifacts), and this flaw has been perfectly resolved today! p>
Today, NVIDIA researchers released an upgraded version of StyleGAN-StyleGAN2, which focuses on fixing artifacts and further improves the quality of generated images. p>
p>
p>
Images generated by StyleGAN2 p>
Major improvements include: p>
The quality of the generated images is significantly better (higher FID score and fewer artifacts) p> li>
Propose a new method to replace progressive growing, with more perfect details such as teeth and eyes p> li>
Improved Style-mixing p> li>
Smoother interpolation (extra regularization) p> li>
Training is faster p>
Nvidia StyleGAN2 p>
li>
ul>Redesigned StyleGAN image synthesis network h2>
The distinctive feature of StyleGAN is its unconventional generator architecture. The mapping network f not only inputs the entered latent code z ∈ Z to the beginning of the network, but also first converts it into an intermediate latent code w ∈ W. Affine transforms are then generatedStyles, which control the layers of the synthesis network g through adaptive instance normalization (AdaIN). p>
In this study, we focus all our analysis on W, because from the perspective of the synthetic network, W is the relevant latent space. p>
Many people have noticed feature artifacts in the images generated by StyleGAN. This study identifies two causes of these artifacts and describes how to eliminate them by changing architecture and training methods. p>
p>
p>
Figure 1: Instance normalization will cause speckle-like artifacts in images generated by StyleGAN p>
First, we studied the origin of common speckled artifacts and found that the generator created them to circumvent design flaws in their architecture. We redesigned the normalization used in the generator, which removed artifacts. p>
Second, we analyzed artifacts related to progressive growing. Progressive growth has been very successful in stabilizing high-resolution GAN training. p>
We propose an alternative design that can achieve the same goal-focus on low-resolution images at the beginning of training, and then gradually shift attention to higher and higher resolutions-in The network topology is not changed during the training process. This new design also allows us to reason about the effective resolution of the generated images, with results that are lower than expected, motivating us to design larger models. p>
p>
Figure 2: Redesigned StyleGAN image synthesis network p>
As shown in Figure 2, (a) is the original StyleGAN, where A represents the affine transformation learned from W, resulting in a style; (b) shows the details of the original StyleGAN architecture. Here, we decompose AdaIN into an explicit normalization and then modulation mode, and operate on the mean and standard deviation of each feature map. p>
We also annotated the weights (w), biases (b), and constant inputs (c) of the learning, and redrawn the gray boxes so that each box activates a style. The activation function (leaky ReLU) is always applied immediately after adding a bias. As shown in (c), we made several changes to the original architecture, including removing some redundant operations at the beginning, moving the addition of b and B outside the active area of style, and adjusting only each feature map Standard deviation. p>
(d) is a modified architecture that allows us to replace instance normalization with the “demodulation” operation. We apply the demodulation operation to the weights associated with each convolutional layer. p>
p>
Figure 3: Demodulation instead of instance normalization can remove feature artifacts in images and activations. p>
As shown in Figure 3, the redesigned StyleGAN2 architecture eliminates feature artifacts while retaining full controllability. p>
p>
p>
Quantitative analysis of the image quality generated by GAN is still a challenging subject. Frechet inception distance (FID) measures the difference between the two distribution densities in the high-dimensional feature space of the InceptionV3 classifier. Precision and Recall (P & R) provide additional visibility by explicitly quantifying the percentage of images generated that are similar to the training data and the percentage of training data that can be generated. We use these metrics to quantify the improvements of StyleGAN2. p>
p>
Table 1: Main results p>
FID is basically unaffected (Table 1, rows A, B), but there is a significant change, from precision to FID. p>
FID and P & R are both based on classifier networks. Recent research has shown that classifier networks focus on textures rather than shapes. Therefore, these indicators cannot accurately represent the quality of images.All aspects. We use the Perceptual Path Length (PPL) index as a method to estimate the quality of potential spatial interpolation, which is related to the consistency and stability of the shape. p>
Based on this, we regularize the synthetic network to support smooth mapping and obtain significant quality improvements. To offset the computational overhead, we also recommend reducing the frequency of performing all regularizations, as doing so will not affect efficiency. p>
p>
Figure 4 p>
p>
Figure 5 p>
The new method replaces Progressive growing, the details are more perfect h2>
Progressive growing has proven to be very successful in stabilizing high-resolution image synthesis, but it produces its own feature artifacts. p>
The key issue is that the generators that appear to grow progressively have a strong position preference in detail, for example, when features such as teeth or eyes move smoothly on the image, they may stay in their original position and then jump Go to the next preferred location. p>
p>
Figure 6 shows a related artifact. We believe that the problem is that in progressive growing, each resolution temporarily serves as the output resolution, forcing it to produce the maximum frequency detail, which causes the trained network to be too high in the middle layer frequency, sacrificing translation invariance. p>
p>
Figure 6: Progressive growing results in a “phase” artifact. In this example, the teeth did not follow the posture change, the face turned to one side, and the teeth were still facing straight ahead, as shown by the blue line. p>
In order to solve these problems, we propose an alternative method that eliminates the defects while retaining the advantages of progressive growing. p>
Although StyleGAN uses simple feedforward design in generators (synthetic networks) and discriminators, there is still a lot of work dedicated to researching better network architectures. In particular, skip connections [34, 22], residual networks [17, 16, 31] and hierarchical methods [7, 46, 47], these methods have proven to be very successful. Therefore, we decided to re-evaluate StyleGAN’s network design and look for an architecture that can generate high-quality images without the need for progressive growing. p>
p>
Figure 7: Three generators (above the dotted line) and discriminator architecture. p>
Figure 7a shows MSG-GAN [22], which uses multiple skip connections to connect the matching resolution of the generator and discriminator. p>
In Figure 7b, we simplify this design by upsampling and summing the RGB outputs corresponding to different resolutions. In the discriminator, we also provide down-sampled images to each resolution block of the discriminator. We use bilinear filtering in all upsampling and downsampling operations. p>
In Figure 7c, we have further modified the design to use residual connections. This design is similar to LAPGAN [7]. p>
Table 2 compares the three generator and discriminator architectures: for StyleGAN, skip connectioThe original feedforward network of ns and residual network, they are both trained, but not using progressive growing. p>
p>
Table 2: Comparison of generator and discriminator structure without progressive growing. p>
For these 9 combinations, each provides FID and PPL results. We can see two big trends: the skip connections of the generator greatly improve the PPL of all configurations, and the residual discriminator network obviously benefits FID. p>
StyleGAN2 uses a skip generator and a residual discriminator, but does not use progressive growing. This corresponds to configuration E in Table 1, as can be seen from the table, switching to this setting significantly improves FID and PPL. p>
Finally, we found that using the new path length regularization generator to project the image onto the latent space W is significantly better than the original StyleGAN. p>
Paper address: p>
https://arxiv.org/pdf/1912.04958.pdf p>
Code and trained models are open source: p>
https://github.com/NVlabs/stylegan2 p>