The source codes of this tutorial are in book/09.gan . Please refer to the instructions to Book document for first use.
## Backgrounds
GAN \[[1](#References)\] is a kind of unsupervised learning method, which learns through games between two neural networks. This method was proposed by lan·Goodfellow et al in 2014, for whose paper you can refer to [Generative Adversarial Network](https://arxiv.org/abs/1406.2661)。
GAN is constituted by a generative network and a discrimination network. The generative network takes random sampling from latent space as input, while its output results need to imitate the real samples in training set to the greatest extent. The discrimination network takes real samples or the output of the generative network as input, aimed to distinguish the output of the generative network from real samples. And the generative network tries to cheat the discrimination network. These two networks confront each other and adjust parameters constantly in order to tell the samples generated by the generative network and the real samples apart. \[[2](#References)\] ).
GAN is commonly used to generate convincing pictures that can \[[3](#References)\] ). What's more, it can also generate videos and 3D object models etc.
## Effect Display
In this tutorial, MNIST data set are input to the network for training. After training for 19 turns, we can see that the generative pictures are very close to the real pictures. In the figure below, the first eight rows show real pictures and the rest show pictures generated by the network:
figure 1. generative handwriting digit effect of GAN
</p>
## Model Overview
### GAN
As its name suggests, GAN means learning generative models of data distribution by adversarial ways. And the word "adversarial" signifies that the Generator and the Discriminator confront each other. Take picture generation for example:
- Generator (G) receives random noise z and generate pictures close to samples to greatest extent, which is dented as G(z)
- Discriminator (D) receives an input picture x, discriminate whether the picture belongs to real samples or fake samples generated by the network. the output D(x) of Discriminator represents the probability that x is real. D(x)=1 means that Discriminator considers the input a real picture while D(x)=0 means that Discriminator consders it a fake one.
In the process of training, the two networks confront each other to achieve dynamic equilibrium finally. And this can be described as:
In the optimal situation, G can generate a picture G(z) that is very close to real samples. D has difficulties judging whether the generative picture is real so it guesses randomly about the authenticity randomly, which means D(G(z))=0.5 .
The figure below show the training process of GAN. We have a hypothesis that the black, green and blue lines represent the real sample distribution, the generative sample distribution and the discrimination model in the beginning respectively. When the training starts, it's hard to distinguish between the real samples and the generative samples for the discrimination model. Then when we fix the generative model and optimize the discrimination model, we get optimization results as the second figure shows. It can be seen that the discrimination model has good performance in distinguishing between generative and real data. Third, fix the discrimination model and improve the generative model, trying to make the former unable to distinguishing between generative and real pictures. In this process, we can see that the distribution of pictures generated by model is closer to the real one. Such iteration keeps progressing until convergence. Finally, the generative distribution and the discrimination distribution coincide so that the doscrimination model cannot distinguish between real and generative pictures.
However, in real process, it's hard to achieve the perfect balance node. Researches about the convergence theory about GAN is still in the works.
### DCGAN
[DCGAN](https://arxiv.org/abs/1511.06434)\[[4](#References)\] is the combination of Deep CNN and GAN, whose rationale is same with GAN except that it replaces Generator and Discriminator with two CNNs. In order to increase the quality of generative samples and the convergence rate of networks, DCGAN in the paper has improved in network structure:
- cancel pooling layer: in the network, all pooling layers replace by strided convolutions (Discriminator) and fractional-strided convolutions (Generator).
- add batch normalization: add batchnorm in both Generator and Discriminator
- use FCN: take the FC layer away to achieve deeper network structure
- the activation function: in Generator(G) ,the last layer adopts Tanh function while the others adopt ReLu function; in Discriminator(D) LeakyReLu function is applied
The structure of Generator(G) in DCGAN(G)is shown in the following figure:
In this tutorial, MNIST of small data size is used to train Generator ande Discriminator , which can be downloaded to lacal automatically by paddle.dataset module.
Please refer to [Digit Recognition](https://github.com/PaddlePaddle/book/tree/develop/02.recognize_digits) for specific introduction to NMIST.
## Train the Model
`09.gan/dc_gan.py` demonstrates the whole training process.
### Load the Package
First load the Fluid and other relaterd packages of PaddlePaddle.
```python
importsys
importos
importmatplotlib
importPIL
importsix
importnumpyasnp
importmath
importtime
importpaddle
importpaddle.fluidasfluid
matplotlib.use('agg')
importmatplotlib.pyplotasplt
importmatplotlib.gridspecasgridspec
from__future__importabsolute_import
from__future__importdivision
from__future__importprint_function
```
### define auxiliary tools
define plot function to visualize the process of generating pictures
gf_dim=64# the number of basic channels during Discriminator's feature mapping, which is multiples of the number of basic channels
df_dim=64# the number of basic channels during Discriminator's feature mapping, which is multiples of the number of basic channels
gfc_dim=1024*2# FCL dimention of Generator
dfc_dim=1024# FCL dimention of Discriminator
img_dim=28# dimention of the input image
NOISE_SIZE=100# dimension of the input noise
LEARNING_RATE=2e-4# learning rate of training
epoch=20# the number of epochs in training
output="./output_dcgan"# storage path of models and test results
use_cudnn=False# whether cuDNN is used
use_gpu=False# whether GPU is used to train
```
### define the network structure
- bn layer
Call `fluid.layers.batch_norm` interface to achieve bn layer, while the activation function is ReLu by default.
```python
defbn(x,name=None,act='relu'):
returnfluid.layers.batch_norm(
x,
param_attr=name+'1',
bias_attr=name+'2',
moving_mean_name=name+'3',
moving_variance_name=name+'4',
name=name,
act=act)
```
- convolution layer
Call `fluid.nets.simple_img_conv_pool` to realize pooling of convolution. The kernel dimension is 3x3, the pooling window dimension is 2x2, the window slide step size is 2, and the activation function is appointed by certain network structure.
```python
defconv(x,num_filters,name=None,act=None):
returnfluid.nets.simple_img_conv_pool(
input=x,
filter_size=5,
num_filters=num_filters,
pool_size=2,
pool_stride=2,
param_attr=name+'w',
bias_attr=name+'b',
use_cudnn=use_cudnn,
act=act)
```
- Fully Connected Layer
```python
deffc(x,num_filters,name=None,act=None):
returnfluid.layers.fc(input=x,
size=num_filters,
act=act,
param_attr=name+'w',
bias_attr=name+'b')
```
- Transpose Convolution Layer
In Generator, we need to generate full-size pictures by random sample value. DCGAN use Transpose Convolution Layer for upsampling. In Fluid, we call `fluid.layers.conv2d_transpose` to realize transpose convolution.
```python
defdeconv(x,
num_filters,
name=None,
filter_size=5,
stride=2,
dilation=1,
padding=2,
output_size=None,
act=None):
returnfluid.layers.conv2d_transpose(
input=x,
param_attr=name+'w',
bias_attr=name+'b',
num_filters=num_filters,
output_size=output_size,
filter_size=filter_size,
stride=stride,
dilation=dilation,
padding=padding,
use_cudnn=use_cudnn,
act=act)
```
- Discriminator
Discriminator uses real data set and fake pictures generated by Generator to train at the same time, and try to make the output result 1 in the real data set case and make it 0 in the fake case. In this tutorial, Discriminator realized is constituted by two convolution pooling layers and two fully connected layers, in which the neuron number of the lat FCL is 1, outputting a dichotomy result.
Generator is constituted by two fully connected layers with BN and two transpose convolution layers. The inputs of the network are random noise data, and the kernel number of the last transpose convolution layer is 1, which means the outputs are grayscale images.
Next, we start the training process. We take paddle.dataset.mnist.train() as training data set, which returns a reader —— reader in PaddlePaddle is a Python function, which returns a Python yield generator every time it's called.
The shuffle below is reader decorator, which receive a reader A and return another reader B. Reader B writes training data whose quantity is buffer_size into a buffer every time, and then disrupts the order and outputs item by item.
The batch is a special decorator, whose input is a reader and output a batched reader. In PaddlePaddle, a reader yield a piece of training data each time, while a batched reader yield a minibatch each time.
```python
batch_size=128# Minibatch size
train_reader=paddle.batch(
paddle.reader.shuffle(
paddle.dataset.mnist.train(),buf_size=60000),
batch_size=batch_size)
```
### Create Executor
```python
ifuse_gpu:
exe=fluid.Executor(fluid.CUDAPlace(0))
else:
exe=fluid.Executor(fluid.CPUPlace())
exe.run(fluid.default_startup_program())
```
### Start Training
Generator and Discriminator set the iteration times respectively every iteration in the training process. In order to avoid Discriminator converging to 0 very fast, it is a default to train Dscriminator once and Generator twice every iteration in this tutorial.
# observe the generative image of the 10th epoch and 460th batch:
display_image(10,460)
```
## Summary
DCGAN takes a random noise vector as input, which goes through a structure similiar but opposite with CNN and is magnifyed to 2D data. By generative models of such structure and discrimination models of CNN structure, DCGAN can perform very well in image generation. In this example, we generate handwriting digit image by DCGAN. You can try changing data set to generate images satisfied with your personal requirements or changing the network structure to observe different generation effects.
[2] Andrej Karpathy, Pieter Abbeel, Greg Brockman, Peter Chen, Vicki Cheung, Rocky Duan, Ian Goodfellow, Durk Kingma, Jonathan Ho, Rein Houthooft, Tim Salimans, John Schulman, Ilya Sutskever, And Wojciech Zaremba, Generative Models, OpenAI, [April 7, 2016]
[3] alimans, Tim; Goodfellow, Ian; Zaremba, Wojciech; Cheung, Vicki; Radford, Alec; Chen, Xi. Improved Techniques for Training GANs. 2016. arXiv:1606.03498 [cs.LG].
[4] Radford A, Metz L, Chintala S. Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks[J]. Computer Science, 2015.