dl12.md

# 深度学习2：第2部分第12课

![](../img/1_iFOmwPIB-BHiM7G4ttDb9w.png)

### 生成对抗网络（GAN）

[视频](https://youtu.be/ondivPiwQho) / [论坛](http://forums.fast.ai/t/part-2-lesson-12-in-class/15023)

非常热门的技术，但绝对值得深入学习课程的一部分，因为它们并没有被证明对任何事情都有用，但它们几乎就在那里并且肯定会到达那里。 我们将专注于它们在实践中肯定会有用的东西，并且在许多领域它们可能变得有用但我们还不知道。 因此我认为它们在实践中肯定会有用的区域是你在幻灯片左侧看到的那种东西 - 例如将绘图转换为渲染图片。 这是来自[2天前刚刚发布的论文](https://arxiv.org/abs/1804.04732) ，所以现在正在进行一项非常活跃的研究。

**从上一次讲座[** [**1:04**](https://youtu.be/ondivPiwQho%3Ft%3D1m4s) **]：**我们的多元化研究员之一Christine Payne拥有斯坦福大学的医学硕士学位，因此她有兴趣思考如果我们建立一种医学语言模型会是什么样子。 我们在第4课中简要介绍过但上次没有真正谈论过的事情之一就是这个想法，你实际上可以种下一个生成语言模型，这意味着你已经在一些语料库中训练了一个语言模型，然后你就是将从该语言模型生成一些文本。 你可以先用几句话来说“这是在语言模型中创建隐藏状态的前几个单词，然后从那里生成。 克里斯汀做了一些聪明的事情，就是用一个问题来播种它，并重复三次问题并让它从那里产生。 她提供了许多不同医学文本的语言模型，如下所示：

![](../img/1_v6gjjQ9Eu_yyJnj5qJoMyA.png)

杰里米对此感到有趣的是，对于那些没有医学硕士学位的人来说，这是一个可信的答案。 但它与现实无关。 他认为这是一种有趣的道德和用户体验困境。 Jeremy参与了一家名为doc.ai的公司，该公司正在努力做一些事情，但最终为医生和患者提供了一个应用程序，可以帮助创建一个对话用户界面，帮助他们解决他们的医疗问题。 他一直在对那个团队的软件工程师说，请不要尝试使用LSTM创建一个生成模型，因为他们会非常善于创建听起来令人印象深刻的糟糕建议 - 有点像政治专家或终身教授谁可以说具有很大权威的背叛。 所以他认为这是非常有趣的实验。 如果你做了一些有趣的实验，请在论坛，博客，Twitter上分享。 让人们了解它并让真棒的人注意到。

#### CIFAR10 [ [5:26](https://youtu.be/ondivPiwQho%3Ft%3D5m26s) ]

让我们来谈谈CIFAR10，原因是我们今天要研究一些比较简单的PyTorch的东西来构建这些生成的对抗模型。 现在根本没有对GAN说话的快速支持 - 很快就会出现，但目前还没有，我们将从头开始构建大量模型。 我们已经做了很多严肃的模型建设已经有一段时间了。 我们在课程的第1部分看了CIFAR10，我们制作了一些准确率达到85％的东西，花了几个小时训练。 有趣的是，现在正在进行一场竞赛，看谁能最快地训练CIFAR10（ [DAWN](https://dawn.cs.stanford.edu/benchmark/) ），目标是让它达到94％的准确率。 看看我们是否可以构建一个可以达到94％精度的架构会很有趣，因为这比我们之前的尝试要好得多。 希望在这样做的过程中，我们将学到一些关于创建良好架构的知识，这对于今天查看GAN非常有用。 此外它很有用，因为Jeremy在过去几年关于不同类型的CNN架构的论文中已经深入研究，并且意识到这些论文中的许多见解没有被广泛利用，并且显然没有被广泛理解。 因此，如果我们能够利用这种理解，他想告诉你会发生什么。

#### [cifar10-darknet.ipynb](https://github.com/fastai/fastai/blob/master/courses/dl2/cifar10-darknet.ipynb) [ [7:17](https://youtu.be/ondivPiwQho%3Ft%3D7m17s) ]

笔记本电脑被称为[暗网，](https://pjreddie.com/darknet/)因为我们要看的特定架构非常接近暗网架构。 但是你会在整个过程中看到暗网结构不是整个YOLO v3端到端的东西，而只是它们在ImageNet上预先训练过来进行分类的部分。 它几乎就像你能想到的最通用的简单架构，所以它是实验的一个非常好的起点。 因此我们将其称为“暗网”，但它并不是那样，你可以摆弄它来创造绝对不是暗网的东西。 它实际上只是几乎所有基于ResNet的现代架构的基础。

CIFAR10是一个相当小的数据集[ [8:06](https://youtu.be/ondivPiwQho%3Ft%3D8m6s) ]。 图像大小只有32 x 32，这是一个很好的数据集，因为：

*   与ImageNet不同，您可以相对快速地训练它
*   相对少量的数据
*   实际上很难识别图像，因为32乘32太小，不容易看到发生了什么。

这是一个不受重视的数据集，因为它已经过时了。 当他们可以使用整个服务器机房处理更大的数据时，谁想要使用小的旧数据集。 但它是一个非常好的数据集，专注于。

来吧，导入我们常用的东西，我们将尝试从头开始构建一个网络来训练[ [8:58](https://youtu.be/ondivPiwQho%3Ft%3D8m58s) ]。

```
 %matplotlib inline  %reload_ext autoreload  %autoreload 2 
```

```
 **from** **fastai.conv_learner** **import** *  PATH = Path("data/cifar10/")  os.makedirs(PATH,exist_ok= **True** ) 
```

对于那些对他们的广播和PyTorch基本技能没有100％信心的人来说，这是一个非常好的练习，可以理解Jeremy如何提出这些`stats`数据。 这些数字是CIFAR10中每个通道的平均值和标准偏差。 尝试并确保您可以重新创建这些数字，看看是否可以使用不超过几行代码（无循环！）来完成。

因为它们相当小，我们可以使用比平时更大的批量大小，这些图像的大小是32 [ [9:46](https://youtu.be/ondivPiwQho%3Ft%3D9m46s) ]。

```
 classes = ('plane', 'car', 'bird', 'cat', 'deer', 'dog', 'frog',  'horse', 'ship', 'truck')  stats = (np.array([ 0.4914 , 0.48216, 0.44653]),  np.array([ 0.24703, 0.24349, 0.26159]))  num_workers = num_cpus()//2  bs=256  sz=32 
```

转换[ [9:57](https://youtu.be/ondivPiwQho%3Ft%3D9m57s) ]，通常我们使用这组标准的side_on转换，用于普通对象的照片。 我们不打算在这里使用它，因为这些图像非常小，试图将32 x 32的图像稍微旋转会引入大量的块状失真。 所以人们倾向于使用的标准变换是随机水平翻转，然后我们在每一侧添加4个像素（大小除以8）的填充。 一个非常有效的方法是默认情况下fastai不添加许多其他库所做的黑色填充。 Fastai拍摄现有照片的最后4个像素并翻转并反射它，我们发现默认情况下使用反射填充可以获得更好的效果。 现在我们有了40 x 40的图像，这组训练中的变换会随机选择一个32乘32的作物，所以我们得到一点变化而不是堆。 我们可以使用普通的`from_paths`来获取我们的数据。

```
 tfms = tfms_from_stats(stats, sz, aug_tfms=[RandomFlip()],  pad=sz//8)  data = ImageClassifierData.from_paths(PATH, val_name='test',  tfms=tfms, bs=bs) 
```

现在我们需要一个架构，我们将创建一个适合一个屏幕[ [11:07](https://youtu.be/ondivPiwQho%3Ft%3D11m7s) ]。 这是从头开始的。 我们使用预定义的`Conv2d` ， `BatchNorm2d` ， `LeakyReLU`模块，但我们没有使用任何块或任何东西。 整个过程都在一个屏幕上，所以如果你想知道我能理解一个现代的优质建筑，绝对是！ 我们来研究这个。

```
 **def** conv_layer(ni, nf, ks=3, stride=1):  **return** nn.Sequential(  nn.Conv2d(ni, nf, kernel_size=ks, bias= **False** , stride=stride,  padding=ks//2),  nn.BatchNorm2d(nf, momentum=0.01),  nn.LeakyReLU(negative_slope=0.1, inplace= **True** )) 
```

```
 **class** **ResLayer** (nn.Module):  **def** __init__(self, ni):  super().__init__()  self.conv1=conv_layer(ni, ni//2, ks=1)  self.conv2=conv_layer(ni//2, ni, ks=3)  **def** forward(self, x): **return** x.add_(self.conv2(self.conv1(x))) 
```

```
 **class** **Darknet** (nn.Module):  **def** make_group_layer(self, ch_in, num_blocks, stride=1):  **return** [conv_layer(ch_in, ch_in*2,stride=stride)  ] + [(ResLayer(ch_in*2)) **for** i **in** range(num_blocks)]  **def** __init__(self, num_blocks, num_classes, nf=32):  super().__init__()  layers = [conv_layer(3, nf, ks=3, stride=1)]  **for** i,nb **in** enumerate(num_blocks):  layers += self.make_group_layer(nf, nb, stride=2-(i==1))  nf *= 2  layers += [nn.AdaptiveAvgPool2d(1), Flatten(),  nn.Linear(nf, num_classes)]  self.layers = nn.Sequential(*layers)  **def** forward(self, x): **return** self.layers(x) 
```

一个体系结构的基本出发点是它是一堆堆叠的层，一般来说，层会有某种层次[ [11:51](https://youtu.be/ondivPiwQho%3Ft%3D11m51s) ]。 在最底层，有卷积层和批量规范层之类的东西，但是只要你有卷积，你可能会有一些标准序列。 通常它会是：

1.  CONV
2.  批量规范
3.  非线性激活（例如ReLU）

我们将首先确定我们的基本单元是什么，并在一个函数（ `conv_layer` ）中定义它，这样我们就不必担心尝试保持一致性，它会使一切变得更简单。

**Leaky Relu** [ [12:43](https://youtu.be/ondivPiwQho%3Ft%3D12m43s) ]：

![](../img/1_p1xIcvOk2F-EWWDgTTfW7Q.png)

Leaky ReLU（其中_x_ &lt;0）的梯度变化但是大约0.1或0.01的常数。 它背后的想法是，当你处于负区域时，你不会得到零梯度，这使得更新它变得非常困难。 在实践中，人们发现Leaky ReLU在较小的数据集上更有用，而在大数据集中则不太有用。 但有趣的是，对于[YOLO v3](https://pjreddie.com/media/files/papers/YOLOv3.pdf)论文，他们使用了Leaky ReLU并从中获得了很好的表现。 它很少使事情变得更糟，它往往使事情变得更好。 因此，如果您需要创建自己的架构以使您的默认设置是使用Leaky ReLU，那可能并不错。

你会注意到我们没有在`conv_layer`定义PyTorch模块，我们只是做`nn.Sequential` [ [14:07](https://youtu.be/ondivPiwQho%3Ft%3D14m7s) ]。 如果您阅读其他人的PyTorch代码，这是真的未充分利用。 人们倾向于将所有东西都写成带有`__init__`和`forward`的PyTorch模块，但是如果你想要的东西只是一个接一个的序列，那么它就更简洁易懂，使它成为一个`Sequential` 。

**剩余块** [ [14:40](https://youtu.be/ondivPiwQho%3Ft%3D14m40s) ]：如前所述，在大多数现代网络中通常存在许多单元层次结构，现在我们知道ResNet的这个单元层次结构中的下一个层次是ResBlock或残差块（参见`ResLayer` ） 。 回到我们上次做CIFAR10时，我们过度简化了这一点（作弊一点）。 我们有`x`进来，我们通过`conv` ，然后我们将它添加回`x`出去。 在真正的ResBlock中，有两个。 当我们说“conv”时，我们将它用作我们的`conv_layer` （conv，batch norm，ReLU）的快捷方式。

![](../img/1_unH5bhpWH7HfLCG8WozNfA.png)

这里有一个有趣的见解是这些卷积中的通道数量[ [16:47](https://youtu.be/ondivPiwQho%3Ft%3D16m47s) ]。 我们有一些`ni`进来（一些输入通道/过滤器）。 黑暗人们设置的方式是他们让这些Res层中的每一个都吐出相同数量的通道，Jeremy喜欢这个，这就是他在`ResLayer`使用它的原因，因为它让生活更简单。 第一个转换器将通道数量减半，然后第二个转换器将它再次加倍。 所以你有这种漏斗效果，64个频道进入，第一个转换为32个频道，然后再次恢复到64个频道。

**问题：**为什么`LeakyReLU`中的`LeakyReLU` `inplace=True` [ [17:54](https://youtu.be/ondivPiwQho%3Ft%3D17m54s) ]？ 谢谢你的询问！ 很多人都忘记了这个或者不知道它，但这是一个非常重要的记忆技术。 如果你考虑一下，这个`conv_layer` ，它是最低级别的东西，所以我们的ResNet中的所有内容一旦全部放在一起就会有很多`conv_layer` 。 如果你没有`inplace=True` ，那么它将为ReLU的输出创建一个完整的独立内存，因此它将分配一大堆完全没有必要的内存。 另一个例子是`ResLayer`中的原始`forward`看起来像：

```
 **def** forward(self, x): **return** x + self.conv2(self.conv1(x)) 
```

希望你们中的一些人可能还记得在PyTorch中，几乎每个函数都有一个下划线后缀版本，告诉它在原地进行。 `+`相当于`add`和就地版本`add_`所以这会减少内存使用量：

```
 **def** forward(self, x): **return** x.add_(self.conv2(self.conv1(x))) 
```

这些都是非常方便的小动作。 Jeremy最初忘记了`inplace=True` ，但是他不得不将批量减少到更低的数量，这让他发疯了 - 然后他意识到这种情况已经消失了。 如果你有辍学，你也可以通过辍学来做到这一点。 以下是需要注意的事项：

*   退出
*   所有激活功能
*   任何算术运算

**问题** ：在ResNet中，为什么在conv_layer [ [19:53](https://youtu.be/ondivPiwQho%3Ft%3D19m53s) ]中偏差通常设置为False？ 在`Conv` ，有一个`BatchNorm` 。 请记住， `BatchNorm`每次激活都有2个可学习的参数 - 您乘以的东西和您添加的东西。 如果我们在`Conv`有偏见然后在`BatchNorm`添加另一个东西，我们将添加两个完全没有意义的东西 - 这是两个权重，其中一个会做。 因此，如果您在`Conv`之后有BatchNorm，您可以告诉`BatchNorm`不要包含添加位，或者更容易告诉`Conv`不要包含偏差。 没有特别的伤害，但同样，它需要更多的记忆，因为它需要跟踪更多的渐变，所以最好避免。

另外一个小技巧是，大多数人的`conv_layer`都有填充作为参数[ [21:11](https://youtu.be/ondivPiwQho%3Ft%3D21m11s) ]。 但一般来说，你应该能够轻松地计算填充。 如果你的内核大小为3，那么显然每侧会有一个单元重叠，所以我们要填充1.否则，如果它的内核大小为1，那么我们不需要任何填充。 所以一般来说，内核大小“整数除以”的填充是你需要的。 有时会有一些调整，但在这种情况下，这非常有效。 再次，尝试通过计算机为我计算内容来简化我的代码，而不是我自己必须这样做。

![](../img/1_Pc3_ut-tOnPm5FLdYqRrOA.png)

两个`conv_layer`的另一件事[ [22:14](https://youtu.be/ondivPiwQho%3Ft%3D22m14s) ]：我们有这个瓶颈的想法（减少通道然后再增加它们），还有使用的内核大小。 第一个有1比1的`Conv` 。 什么实际发生在1对1转？ 如果我们有4个4格的网格和32个滤波器/通道，我们将逐步进行转换，转换器的内核看起来像中间的那个。 当我们谈论内核大小时，我们从不提及最后一块 - 但是让我们说它是1乘1乘32，因为这是过滤器的一部分并过滤掉了。 内核以黄色放在第一个单元格上，我们得到这32个深位的点积，这给了我们第一个输出。 然后我们将它移动到第二个单元格并获得第二个输出。 因此，网格中的每个点都会有一堆点积。 它允许我们以任何方式在通道维度中更改维度。 我们正在创建`ni//2`过滤器，我们将有`ni//2`点积，这些产品基本上是输入通道的加权平均值。 通过非常少的计算，它可以让我们添加额外的计算和非线性步骤。 这是一个很酷的技巧，利用这些1比1的转换，创建这个瓶颈，然后再用3乘3转将其拉出 - 这将充分利用输入的2D性质。 或者，1乘1转并没有充分利用它。

![](../img/1_-lUndrOFGp_FdJ27T_Px-g.png)

这两行代码中没有太多内容，但它是对你的理解和直觉的一个非常好的考验[ [25:17](https://youtu.be/ondivPiwQho%3Ft%3D25m17s) ] - 为什么它有效？ 为什么张量排队？ 为什么尺寸排列很好？ 为什么这是个好主意？ 它到底在做什么？ 摆弄它是一件非常好的事情。 也许在Jupyter Notebook中创建一些小的，自己运行它们，看看输入和输出是什么输入和输出。 真的感受到了这一点。 一旦你这样做了，你就可以玩弄不同的东西。

其中一篇真正[未被重视的](https://youtu.be/ondivPiwQho%3Ft%3D26m9s)论文是[ [26:09](https://youtu.be/ondivPiwQho%3Ft%3D26m9s) ] - [广泛的剩余网络](https://arxiv.org/abs/1605.07146) 。 这是非常简单的纸张，但他们所做的是他们用这两行代码摆弄：

*   我们做了什么`ni*2`而不是`ni//2` ？
*   如果我们添加了`conv3`怎么`conv3` ？

他们提出了这种简单的符号来定义两行代码的样子，并展示了大量的实验。 他们展示的是，这种减少ResNet中几乎普遍的渠道数量的瓶颈方法可能不是一个好主意。 事实上，从实验中，绝对不是一个好主意。 因为它会让你创建真正的深层网络。 创建ResNet的人因创建1001层网络而闻名。 但是关于1001层的事情是你在第1层完成之前无法计算第2层。在完成第2层计算之前，你无法计算第3层。所以它是顺序的。 GPU不喜欢顺序。 所以他们展示的是，如果你有更少的层，但每层有更多的计算 - 所以一个简单的方法是删除`//2` ，没有其他变化：

![](../img/1_89Seymgfa5Bdx1_EXBW_lA.png)

在家尝试一下。 尝试运行CIFAR，看看会发生什么。 甚至乘以2或摆弄。 这可以让你的GPU做更多的工作而且非常有趣，因为绝大多数谈论不同架构性能的论文从未实际计算通过它运行批量需要多长时间。 他们说“这个每批需要X次浮点运算”，但他们从来没有像真正的实验主义者那样费心去运行，并且发现它是更快还是更慢。 许多真正着名的体系结构现在变得像糖蜜一样缓慢并且需要大量的内存并且完全没用，因为研究人员从来没有真正费心去看它们是否很快并且实际上看它们是否适合RAM正常批量大小。 因此，广泛的ResNet纸张不同寻常之处在于它实际上需要花费多长时间才能获得同样的洞察力的YOLO v3纸张。 他们可能错过了Wide ResNet论文，因为YOLO v3论文得出了许多相同的结论，但Jeremy不确定他们选择了Wide ResNet论文，所以他们可能不知道所有这些工作都已完成。 很高兴看到人们真正计时并注意到实际上有意义的东西。

**问题** ：您对SELU（缩放指数线性单位）有何看法？ [ [29:44](https://youtu.be/ondivPiwQho%3Ft%3D29m44s) ] [SELU](https://youtu.be/ondivPiwQho%3Ft%3D29m44s)主要用于完全连接的层，它允许你摆脱批量规范，基本的想法是，如果你使用这个不同的激活功能，它是自我规范化。 自我归一化意味着它将始终保持单位标准差和零均值，因此您不需要批量规范。 它并没有真正去过任何地方，原因是因为它非常挑剔 - 你必须使用一个非常具体的初始化，否则它不会以恰当的标准偏差和平均值开始。 很难将它与嵌入之类的东西一起使用，如果你这样做，你必须使用一种特殊的嵌入初始化，这对于嵌入是没有意义的。 你完成所有这些工作，很难做到正确，如果你最终做到了，那有什么意义呢？ 好吧，你已经设法摆脱了一些批量标准层，无论如何都没有真正伤害你。 这很有趣，因为SELU论文 - 人们注意到它的主要原因是它是由LSTM的发明者创造的，并且它有一个巨大的数学附录。 因此人们认为“很多来自一个着名人物的数学 - 它一定很棒！”但在实践中，杰里米并没有看到任何人使用它来获得任何最先进的结果或赢得任何比赛。

`Darknet.make_group_layer`包含一堆`ResLayer` [ [31:28](https://youtu.be/ondivPiwQho%3Ft%3D31m28s) ]。 `group_layer`将有一些频道/过滤器进入。我们将通过使用标准的`conv_layer`加倍进入的频道数量。 可选地，我们将使用2的步幅将网格大小减半。然后我们将做一大堆ResLayers - 我们可以选择多少（2,3,8等），因为记住ResLayers不会改变网格大小和它们不会更改通道数，因此您可以添加任意数量的通道而不会出现任何问题。 这将使用更多的计算和更多的RAM，但没有其他理由，你不能添加任意多的。 因此， `group_layer`最终会使通道数增加一倍，因为初始卷积使通道数增加一倍，并且根据我们在`stride`传递的内容，如果我们设置`stride=2` ，它也可以将网格大小减半。 然后我们可以根据需要进行一大堆Res块计算。

为了定义我们的`Darknet` ，我们将传递一些看起来像这样的东西[ [33:13](https://youtu.be/ondivPiwQho%3Ft%3D33m13s) ]：

```
 m = Darknet([1, 2, 4, 6, 3], num_classes=10, nf=32)  m = nn.DataParallel(m, [1,2,3]) 
```

这说的是创建五个组层：第一个将包含1个额外的ResLayer，第二个将包含2个，然后是4个，6个，3个，我们希望从32个过滤器开始。 ResLayers中的第一个将包含32个过滤器，并且只有一个额外的ResLayer。 第二个，它将使过滤器的数量增加一倍，因为这是我们每次有新的组层时所做的事情。 所以第二个将有64，然后是128,256,512，那就是它。 几乎所有的网络都将成为那些层，并记住，这些组层中的每一个在开始时也都有一个卷积。 那么我们所拥有的就是在所有这一切发生之前，我们将在一开始就有一个卷积层，最后我们将进行标准的自适应平均池化，展平和线性层来创建数字最后的课程。 总结[ [34:44](https://youtu.be/ondivPiwQho%3Ft%3D34m44s) ]，在一端的一个卷积，自适应池和另一端的一个线性层，在中间，这些组层每个由卷积层和随后的`n`个ResLayers组成。

**自适应平均汇集** [ [35:02](https://youtu.be/ondivPiwQho%3Ft%3D35m2s) ]：杰里米曾多次提到这一点，但他还没有看到任何代码，任何地方，任何地方，使用自适应平均池。 他所看到的每一个人都像`nn.AvgPool2d(n)`那样写，其中`n`是一个特定的数字 - 这意味着它现在与特定的图像大小相关联，这绝对不是你想要的。 因此，大多数人仍然认为特定架构与特定大小相关联。 当人们认为这是一个巨大的问题，因为它确实限制了他们使用较小尺寸来启动建模或使用较小尺寸进行实验的能力。

**顺序** [ [35:53](https://youtu.be/ondivPiwQho%3Ft%3D35m53s) ]：创建体系结构的一个好方法是首先创建一个列表，在这种情况下，这是一个只有一个`conv_layer`的列表，而`make_group_layer`返回另一个列表。 然后我们可以使用`+=`将该列表附加到上一个列表，并对包含`AdaptiveAvnPool2d`另一个列表执行相同的操作。 最后，我们将调用所有这些层的`nn.Sequential` 。 现在， `forward`只是`self.layers(x)` 。

![](../img/1_nr69J3I7lNPlsblmrLt15A.png)

这是一个很好的图片，说明如何使您的架构尽可能简单。 有很多你可以摆弄。 您可以参数化`ni`的分隔符，使其成为您传入的数字以传递不同的数字 - 可能会执行2次。 您还可以传入更改内核大小的内容，或更改卷积层的数量。 杰里米有一个版本，他将为你运行，它实现了Wide ResNet论文中的所有不同参数，所以他可以摆弄一下，看看效果如何。

![](../img/1_mR3BupmhN_XGo34Qvm58Sg.png)

```
 lr = 1.3  learn = ConvLearner.from_model_data(m, data)  learn.crit = nn.CrossEntropyLoss()  learn.metrics = [accuracy]  wd=1e-4 
```

```
 %time learn.fit(lr, 1, wds=wd, cycle_len=30, use_clr_beta=(20, 20,  0.95, 0.85)) 
```

一旦我们得到了它，我们可以使用`ConvLearner.from_model_data`来获取我们的PyTorch模块和模型数据对象，并将它们变成学习者[ [37:08](https://youtu.be/ondivPiwQho%3Ft%3D37m8s) ]。 给它一个标准，如果我们愿意，添加一个指标，然后我们可以适应和离开我们去。

**问题** ：您能解释一下自适应平均汇总吗？ 如何设置1工作[ [37:25](https://youtu.be/ondivPiwQho%3Ft%3D37m25s) ]？ 当然。 通常当我们进行平均合并时，假设我们有4x4而且我们做了`avgpool((2, 2))` [ [40:35](https://youtu.be/ondivPiwQho%3Ft%3D40m35s) ]。 这会创建2x2区域（下方为蓝色）并取这四个区域的平均值。 如果我们传入`stride=1` ，则下一个是2x2，显示为绿色并取平均值。 所以这就是正常的2x2平均合并量。 如果我们没有任何填充，那将会吐出3x3。 如果我们想要4x4，我们可以添加填充。

![](../img/1_vTPZGULUC12lQplYtkGZuQ.png)

如果我们想要1x1怎么办？ 然后我们可以说`avgpool((4,4), stride=1)`将以黄色做4x4并且平均整个批次导致1x1。 但这只是一种方法。 而不是说汇集过滤器的大小，为什么我们不说“我不关心输入网格的大小是什么。 我总是一个接一个地想要“。 这就是你说`adap_avgpool(1)` 。 在这种情况下，您没有说明池化过滤器的大小，而是说出我们想要的输出大小。 我们想要一个接一个的东西。 如果你输入一个整数`n` ，它假定你的意思是`n`乘以`n` 。 在这种情况下，具有4x4网格的自适应平均合并1与平均合并（4,4）相同。 如果它是7x7网格，它将与平均合并（7,7）相同。 它是相同的操作，它只是表达它的方式，无论输入如何，我们想要一些大小的输出。

**DAWNBench** [ [37:43](https://youtu.be/ondivPiwQho%3Ft%3D37m43s) ]：让我们看看我们如何利用我们简单的网络来对抗这些最先进的结果。 杰里米准备好了。 我们已经把所有这些东西都放到了一个简单的Python脚本中，他修改了他提到的一些参数来创建一些他称之为`wrn_22`网络的东西，这种东西不是正式存在但是它对我们所讨论的参数有一些变化基于杰里米的实验。 它有一堆很酷的东西，如：

*   莱斯利史密斯的一个周期
*   半精度浮点实现

![](../img/1_OhRBmhkrMWXgMpDHKZGJJQ.png)

这将在具有8个GPU和Volta架构GPU的AWS p3上运行，这些GPU具有对半精度浮点的特殊支持。 Fastai是第一个将Volta优化的半精度浮点实际集成到库中的库，因此您可以自动`learn.half()`并获得该支持。 它也是第一个整合一个周期的人。

它实际上做的是使用PyTorch的多GPU支持[ [39:35](https://youtu.be/ondivPiwQho%3Ft%3D39m35s) ]。 由于有八个GPU，它实际上将启动八个独立的Python处理器，每个处理器将进行一些训练，然后最后它将梯度更新传递回将要集成的主进程他们都在一起。 所以你会看到很多进度条一起出现。

![](../img/1_7JkxOliX34lgAkzwGbl1tA.png)

当你这样做时，你可以看到训练三到四秒。 在其他地方，当杰里米早些时候训练时，他每个时期得到30秒。 所以这样做，我们可以训练的东西快10倍，非常酷。

**检查状态** [ [43:19](https://youtu.be/ondivPiwQho%3Ft%3D43m19s) ]：

![](../img/1_fgY5v-w-44eIBkkS4fPqEA.png)

完成！ 我们达到了94％，耗时3分11秒。 以前最先进的是1小时7分钟。 是否值得摆弄这些参数并了解这些架构如何实际工作而不只是使用开箱即用的东西？ 好吧，圣洁的废话。 我们刚刚使用了一个公开的实例（我们使用了一个现场实例，所以每小时花费8美元--3分钟40美分），从头开始训练这比以往任何人都快20倍。 所以这是最疯狂的最先进的结果之一。 我们已经看过很多，但是这个只是把它从水中吹走了。 这部分归功于摆弄架构的那些参数，主要是坦率地说使用Leslie Smith的一个周期。 提醒它正在做什么[ [44:35](https://youtu.be/ondivPiwQho%3Ft%3D44m35s) ]，对于学习率，它创造了与向下路径同样长的向上路径，因此它是真正的三角形循环学习率（CLR）。 按照惯例，您可以选择x和y的比率（即起始LR /峰值LR）。 在

![](../img/1_5lQZ0Jln6Cn29rd_9Bvzfw.png)

在这种情况下，我们选择了50比例。 所以我们开始时学习率要小得多。 然后它就有了这个很酷的想法，你可以说你的时代占了三角形底部的百分比从几乎一直到零 - 这是第二个数字。 所以15％的批次都是从三角形底部进一步消耗的。

![](../img/1_E0gxTQ5sf4XSceo9pWKxWQ.png)

这不是一个周期的唯一作用，我们也有动力。 动量从.95到.85。 换句话说，当学习率非常低时，我们会使用很多动力，当学习率非常高时，我们使用的动量非常小，这很有意义但是直到Leslie Smith在论文中表明这一点，Jeremy从来没有看到有人以前做过。 这是一个非常酷的技巧。 您现在可以通过在fastai中使用`use-clr-beta`参数（ [Sylvain的论坛帖子](http://forums.fast.ai/t/using-use-clr-beta-and-new-plotting-tools/14702) ）来使用它，您应该能够复制最先进的结果。 您可以在自己的计算机或纸张空间中使用它，您唯一不会得到的是多GPU部件，但这使得它更容易进行训练。

**问题** ： `make_group_layer`包含步幅等于2，因此这意味着步幅为第1层，第2步为第二层。 它背后的逻辑是什么？ 通常我所看到的步伐是奇怪的[ [46:52](https://youtu.be/ondivPiwQho%3Ft%3D46m52s) ]。 跨栏是一两个。 我认为你在考虑内核大小。 所以stride = 2意味着我跳过两个意味着你将网格大小减半。 所以我认为你可能会对步幅和内核大小感到困惑。 如果你有一个步幅，网格大小不会改变。 如果你有两步，那就确实如此。 在这种情况下，因为这是CIFAR10,32乘32很小，我们不会经常将网格大小减半，因为很快我们就会耗尽单元格。 这就是为什么第一层有一个步幅，所以我们不会立即减小网格尺寸。 这是一种很好的方式，因为这就是为什么我们在第一个`Darknet([1, 2, 4, 6, 3], …)`数字很少。 我们可以从大网格上没有太多的计算开始，然后随着网格变得越来越小，我们可以逐渐进行越来越多的计算，因为较小的网格计算将花费更少的时间

### 生成性对抗网络（GAN）[ [48:49](https://youtu.be/ondivPiwQho%3Ft%3D48m49s) ]

*   [Wasserstein GAN](https://arxiv.org/abs/1701.07875)
*   [用深层卷积生成对抗网络学习无监督表示](https://arxiv.org/abs/1511.06434)

我们将谈论生成对抗网络，也称为GAN，特别是我们将关注Wasserstein GAN论文，其中包括继续创建PyTorch的Soumith Chintala。 Wasserstein GAN（WGAN）深受卷积生成对抗性网络论文的影响，该论文也与Soumith有关。 这是一篇非常有趣的论文。 很多看起来像这样：

![](../img/1_9zXXZvCNC8_eF_V9LJIolw.png)

好消息是你可以跳过那些位，因为还有一些看起来像这样：

![](../img/1_T90I-RKpUzV7yyo_TOqowQ.png)

很多论文都有一个理论部分，似乎完全是为了超越评论者对理论的需求。 WGAN论文不是这样。 理论位实际上很有趣 - 你不需要知道它就可以使用它，但是如果你想了解一些很酷的想法并看到为什么这个特殊的算法背后的想法，它绝对是迷人的。 在这篇论文出来之前，杰里米知道没有人研究它所基于的数学，所以每个人都必须学习数学。 这篇文章很好地布置了所有的部分（你必须自己做一堆阅读）。 因此，如果你有兴趣深入挖掘一些论文背后的深层数学，看看研究它是什么样的，我会选择这个，因为在理论部分结束时，你会说“我现在可以看到他们为什么使这个算法成为现实。“

GAN的基本思想是它是一个生成模型[ [51:23](https://youtu.be/ondivPiwQho%3Ft%3D51m23s) ]。 它可以创建句子，创建图像或生成某些东西。 它会尝试创造一个很难分辨生成的东西和真实东西之间的东西的东西。 因此，可以使用生成模型来交换视频 - 这是一个非常有争议的深刻假货和伪造的色情内容。 它可以用来伪造某人的声音。 它可以用来假冒医学问题的答案 - 但在这种情况下，它不是真的假，它可能是一个医学问题的生成性答案，实际上是一个很好的答案，所以你生成语言。 例如，您可以为图像生成标题。 所以生成模型有很多有趣的应用。 但一般来说，它们需要足够好，例如，如果你使用它来为Carrie Fisher在下一部星球大战电影中自动创建一个新场景而且她不再玩那个部分，你想尝试生成她的形象看起来一样，然后它必须欺骗星球大战的观众思考“好吧，这看起来不像一些奇怪的嘉莉费舍尔 - 看起来像真正的嘉莉费舍尔。 或者，如果您正在尝试生成医学问题的答案，您希望生成能够很好地清晰读取的英语，并且听起来具有权威性和意义。 生成对抗网络的想法是我们不仅要创建生成图像的生成模型，而且要创建第二个模型来尝试选择哪些是真实的，哪些是生成的（我们称之为“假的”） ）。 因此，我们有一个生成器，它将创建我们的虚假内容和一个鉴别器，它将试图善于识别哪些是真实的，哪些是假的。 因此，将会有两个模型，它们将是对抗性的，意味着发电机将试图不断变得更好地愚弄鉴别者认为假货是真实的，并且鉴别器将试图在辨别力方面继续变得更好真假之间。 所以他们将要面对面。 它基本上就像杰里米刚才描述的那样容易[ [54:14](https://youtu.be/ondivPiwQho%3Ft%3D54m14s) ]：

*   我们将在PyTorch中构建两个模型
*   我们将创建一个训练循环，首先说鉴别器的损失函数是“你能告诉真实和假的区别，然后更新它的权重。
*   我们将为生成器创建一个损失函数，“你可以生成一些愚弄鉴别器并从损失中更新权重的东西。
*   我们将循环几次，看看会发生什么。

#### 看代码[ [54:52](https://youtu.be/ondivPiwQho%3Ft%3D54m52s) ]

[笔记本](https://github.com/fastai/fastai/blob/master/courses/dl2/wgan.ipynb)

你可以用GANS做很多不同的事情。 我们将做一些有点无聊但易于理解的事情，它甚至可能很酷，我们将从无到有生成一些图片。 我们只是想拍一些照片。 具体来说，我们将把它画成卧室的照片。 希望你有机会在一周内使用自己的数据集来解决这个问题。 如果您选择一个像ImageNet这样变化很大的数据集，然后让GAN尝试创建ImageNet图片，那么它往往不会很好，因为它不够清晰，您想要的图片。 所以最好给它，例如，有一个名为[CelebA](http://mmlab.ie.cuhk.edu.hk/projects/CelebA.html)的数据集，这是名人面孔的图片，与GAN很有效。 你创造了真正清晰的名人面孔，这些面孔实际上并不存在。 卧室数据集也很好 - 相同类型的图片。

有一种叫做LSUN场景分类数据集[ [55:55](https://youtu.be/ondivPiwQho%3Ft%3D55m55s) ]。

```
 **from** **fastai.conv_learner** **import** *  **from** **fastai.dataset** **import** *  **import** **gzip** 
```

下载LSUN场景分类数据集卧室类别，解压缩并将其转换为jpg文件（脚本文件夹位于`dl2`文件夹中）：

```
 curl 'http://lsun.cs.princeton.edu/htbin/download.cgi?tag=latest&category=bedroom&set=train' -o bedroom.zip 
```

```
 unzip bedroom.zip 
```

```
 pip install lmdb 
```

```
 python lsun-data.py {PATH}/bedroom_train_lmdb --out_dir {PATH}/bedroom 
```

这不在Windows上测试 - 如果它不起作用，您可以使用Linux框转换文件，然后将其复制。 或者，您可以从Kaggle数据集下载[此20％样本](https://www.kaggle.com/jhoward/lsun_bedroom) 。

```
 PATH = Path('data/lsun/')  IMG_PATH = PATH/'bedroom'  CSV_PATH = PATH/'files.csv'  TMP_PATH = PATH/'tmp'  TMP_PATH.mkdir(exist_ok= **True** ) 
```

在这种情况下，在处理我们的数据时，更容易使用CSV路径。 因此，我们生成一个包含我们想要的文件列表的CSV，以及一个假标签“0”，因为我们根本没有这些标签。 One CSV file contains everything in that bedroom dataset, and another one contains random 10%. It is nice to do that because then we can most of the time use the sample when we are experimenting because there is well over a million files even just reading in the list takes a while.

```
 files = PATH.glob('bedroom/**/*.jpg')  with CSV_PATH.open('w') as fo:  for f in files: fo.write(f'{f.relative_to(IMG_PATH)},0 \n ') 
```

```
 # Optional - sampling a subset of files  CSV_PATH = PATH/'files_sample.csv' 
```

```
 files = PATH.glob('bedroom/**/*.jpg')  with CSV_PATH.open('w') as fo:  for f in files:  if random.random()<0.1:  fo.write(f'{f.relative_to(IMG_PATH)},0 \n ') 
```

This will look pretty familiar [ [57:10](https://youtu.be/ondivPiwQho%3Ft%3D57m10s) ]. This is before Jeremy realized that sequential models are much better. So if you compare this to the previous conv block with a sequential model, there is a lot more lines of code here — but it does the same thing of conv, ReLU, batch norm.

```
 class ConvBlock (nn.Module):  def __init__(self, ni, no, ks, stride, bn= True , pad= None ):  super().__init__()  if pad is None : pad = ks//2//stride  self.conv = nn.Conv2d(ni, no, ks, stride, padding=pad,  bias= False )  self.bn = nn.BatchNorm2d(no) if bn else None  self.relu = nn.LeakyReLU(0.2, inplace= True )  def forward(self, x):  x = self.relu(self.conv(x))  return self.bn(x) if self.bn else x 
```

The first thing we are going to do is to build a discriminator [ [57:47](https://youtu.be/ondivPiwQho%3Ft%3D57m47s) ]. A discriminator is going to receive an image as an input, and it's going to spit out a number. The number is meant to be lower if it thinks this image is real. Of course “what does it do for a lower number” thing does not appear in the architecture, that will be in the loss function. So all we have to do is to create something that takes an image and spits out a number. A lot of this code is borrowed from the original authors of this paper, so some of the naming scheme is different to what we are used to. But it looks similar to what we had before. We start out with a convolution (conv, ReLU, batch norm). Then we have a bunch of extra conv layers — this is not going to use a residual so it looks very similar to before a bunch of extra layers but these are going to be conv layers rather than res layers. At the end, we need to append enough stride 2 conv layers that we decrease the grid size down to no bigger than 4x4\. So it's going to keep using stride 2, divide the size by 2, and repeat till our grid size is no bigger than 4\. This is quite a nice way of creating as many layers as you need in a network to handle arbitrary sized images and turn them into a fixed known grid size.

**Question** : Does GAN need a lot more data than say dogs vs. cats or NLP? Or is it comparable [ [59:48](https://youtu.be/ondivPiwQho%3Ft%3D59m48s) ]? Honestly, I am kind of embarrassed to say I am not an expert practitioner in GANs. The stuff I teach in part one is things I am happy to say I know the best way to do these things and so I can show you state-of-the-art results like we just did with CIFAR10 with the help of some of the students. I am not there at all with GANs so I am not quite sure how much you need. In general, it seems it needs quite a lot but remember the only reason we didn't need too much in dogs and cats is because we had a pre-trained model and could we leverage pre-trained GAN models and fine tune them, probably. I don't think anybody has done it as far as I know. That could be really interesting thing for people to think about and experiment with. Maybe people have done it and there is some literature there we haven't come across. I'm somewhat familiar with the main pieces of literature in GANs but I don't know all of it, so maybe I've missed something about transfer learning in GANs. But that would be the trick to not needing too much data.

**Question** : So the huge speed-up a combination of one cycle learning rate and momentum annealing plus the eight GPU parallel training in the half precision? Is that only possible to do the half precision calculation with consumer GPU? Another question, why is the calculation 8 times faster from single to half precision, while from double the single is only 2 times faster [ [1:01:09](https://youtu.be/ondivPiwQho%3Ft%3D1h1m9s) ]? Okay, so the CIFAR10 result, it's not 8 times faster from single to half. It's about 2 or 3 times as fast from single to half. NVIDIA claims about the flops performance of the tensor cores, academically correct, but in practice meaningless because it really depends on what calls you need for what piece — so about 2 or 3x improvement for half. So the half precision helps a bit, the extra GPUs helps a bit, the one cycle helps an enormous amount, then another key piece was the playing around with the parameters that I told you about. So reading the wide ResNet paper carefully, identifying the kinds of things that they found there, and then writing a version of the architecture you just saw that made it really easy for us to fiddle around with parameters, staying up all night trying every possible combination of different kernel sizes, numbers of kernels, number of layer groups, size of layer groups. And remember, we did a bottleneck but actually we tended to focus instead on widening so we increase the size and then decrease it because it takes better advantage of the GPU. So all those things combined together, I'd say the one cycle was perhaps the most critical but every one of those resulted in a big speed-up. That's why we were able to get this 30x improvement over the state-of-the-art CIFAR10\. We have some ideas for other things — after this DAWN bench finishes, maybe we'll try and go even further to see if we can beat one minute one day. That'll be fun.

```
 class DCGAN_D (nn.Module):  def __init__(self, isize, nc, ndf, n_extra_layers=0):  super().__init__()  assert isize % 16 == 0, "isize has to be a multiple of 16"  self.initial = ConvBlock(nc, ndf, 4, 2, bn= False )  csize,cndf = isize/2,ndf  self.extra = nn.Sequential(*[ConvBlock(cndf, cndf, 3, 1)  for t in range(n_extra_layers)])  pyr_layers = []  while csize > 4:  pyr_layers.append(ConvBlock(cndf, cndf*2, 4, 2))  cndf *= 2; csize /= 2  self.pyramid = nn.Sequential(*pyr_layers)  self.final = nn.Conv2d(cndf, 1, 4, padding=0, bias= False )  def forward(self, input):  x = self.initial(input)  x = self.extra(x)  x = self.pyramid(x)  return self.final(x).mean(0).view(1) 
```

So here is our discriminator [ [1:03:37](https://youtu.be/ondivPiwQho%3Ft%3D1h3m37s) ].The important thing to remember about an architecture is it doesn't do anything rather than have some input tensor size and rank, and some output tensor size and rank. As you see the last conv has one channel. This is different from what we are used to because normally our last thing is a linear block. But our last layer here is a conv block. It only has one channel but it has a grid size of something around 4x4 (no more than 4x4). So we are going to spit out (let's say it's 4x4), 4 by 4 by 1 tensor. What we then do is we then take the mean of that. So it goes from 4x4x1 to a scalar. This is kind of like the ultimate adaptive average pooling because we have something with just one channel and we take the mean. So this is a bit different — normally we first do average pooling and then we put it through a fully connected layer to get our one thing out. But this is getting one channel out and then taking the mean of that. Jeremy suspects that it would work better if we did the normal way, but he hasn't tried it yet and he doesn't really have a good enough intuition to know whether he is missing something — but it will be an interesting experiment to try if somebody wants to stick an adaptive average pooling layer and a fully connected layer afterwards with a single output.

So that's a discriminator. Let's assume we already have a generator — somebody says “okay, here is a generator which generates bedrooms. I want you to build a model that can figure out which ones are real and which ones aren't”. We are going to take the dataset and label bunch of images which are fake bedrooms from the generator, and a bunch of images of real bedrooms from LSUN dataset to stick a 1 or a 0 on each one. Then we'll try to get the discriminator to tell the difference. So that is going to be simple enough. But we haven't been given a generator. We need to build one. We haven't talked about the loss function yet — we are going to assume that there's some loss function that does this thing.

#### **Generator** [ [1:06:15](https://youtu.be/ondivPiwQho%3Ft%3D1h6m15s) ]

A generator is also an architecture which doesn't do anything by itself until we have a loss function and data. But what are the ranks and sizes of the tensors? The input to the generator is going to be a vector of random numbers. In the paper, they call that the “prior.” How big? We don't know. The idea is that a different bunch of random numbers will generate a different bedroom. So our generator has to take as input a vector, stick it through sequential models, and turn it into a rank 4 tensor (rank 3 without the batch dimension) — height by width by 3\. So in the final step, `nc` (number of channel) is going to have to end up being 3 because it's going to create a 3 channel image of some size.

```
 class DeconvBlock (nn.Module):  def __init__(self, ni, no, ks, stride, pad, bn= True ):  super().__init__()  self.conv = nn.ConvTranspose2d(ni, no, ks, stride,  padding=pad, bias= False )  self.bn = nn.BatchNorm2d(no)  self.relu = nn.ReLU(inplace= True )  def forward(self, x):  x = self.relu(self.conv(x))  return self.bn(x) if self.bn else x 
```

```
 class DCGAN_G (nn.Module):  def __init__(self, isize, nz, nc, ngf, n_extra_layers=0):  super().__init__()  assert isize % 16 == 0, "isize has to be a multiple of 16"  cngf, tisize = ngf//2, 4  while tisize!=isize: cngf*=2; tisize*=2  layers = [DeconvBlock(nz, cngf, 4, 1, 0)]  csize, cndf = 4, cngf  while csize < isize//2:  layers.append(DeconvBlock(cngf, cngf//2, 4, 2, 1))  cngf //= 2; csize *= 2  layers += [DeconvBlock(cngf, cngf, 3, 1, 1)  for t in range(n_extra_layers)]  layers.append(nn.ConvTranspose2d(cngf, nc, 4, 2, 1,  bias= False ))  self.features = nn.Sequential(*layers)  def forward(self, input): return F.tanh(self.features(input)) 
```

**Question** : In ConvBlock, is there a reason why batch norm comes after ReLU (ie `self.bn(self.relu(…))` ) [ [1:07:50](https://youtu.be/ondivPiwQho%3Ft%3D1h7m50s) ]? I would normally expect to go ReLU then batch norm [ [1:08:23](https://youtu.be/ondivPiwQho%3Ft%3D1h8m23s) ] that this is actually the order that makes sense to Jeremy. The order we had in the darknet was what they used in the darknet paper, so everybody seems to have a different order of these things. In fact, most people for CIFAR10 have a different order again which is batch norm → ReLU → conv which is a quirky way of thinking about it, but it turns out that often for residual blocks that works better. That is called a “pre-activation ResNet.” There is a few blog posts out there where people have experimented with different order of those things and it seems to depend a lot on what specific dataset it is and what you are doing with — although the difference in performance is small enough that you won't care unless it's for a competition.

#### Deconvolution [ [1:09:36](https://youtu.be/ondivPiwQho%3Ft%3D1h9m36s) ]

So the generator needs to start with a vector and end up with a rank 3 tensor. We don't really know how to do that yet. We need to use something called a “deconvolution” and PyTorch calls it transposed convolution — same thing, different name. Deconvolution is something which rather than decreasing the grid size, it increases the grid size. So as with all things, it's easiest to see in an Excel spreadsheet.

Here is a convolution. We start, let's say, with a 4 by 4 grid cell with a single channel. Let's put it through a 3 by 3 kernel with a single output filter. So we have a single channel in, a single filter kernel, so if we don't add any padding, we are going to end up with 2 by 2\. Remember, the convolution is just the sum of the product of the kernel and the appropriate grid cell [ [1:11:09](https://youtu.be/ondivPiwQho%3Ft%3D1h11m9s) ]. So there is our standard 3 by 3 conv one channel one filter.

![](../img/1_FqkDO90rEDwa_CgxTAlyIQ.png)

So the idea now is we want to go the opposite direction [ [1:11:25](https://youtu.be/ondivPiwQho%3Ft%3D1h11m25s) ]. We want to start with our 2 by 2 and we want to create a 4 by 4\. Specifically we want to create the same 4 by 4 that we started with. And we want to do that by using a convolution. How would we do that?

If we have a 3 by 3 convolution, then if we want to create a 4 by 4 output, we are going to need to create this much padding:

![](../img/1_flOxFmF21kUyLpPDJ6kr-w.png)

Because with this much padding, we are going to end up with 4 by 4\. So let's say our convolutional filter was just a bunch of zeros then we can calculate our error for each cell just by taking this subtraction:

![](../img/1_HKcU-wgdLPgxd5kJfEkmlg.png)

Then we can get the sum of absolute values (L1 loss) by summing up the absolute values of those errors:

![](../img/1_mjLTOFUXneXGeER4hKj4Kw.png)

So now we could use optimization, in Excel it's called “solver” to do a gradient descent. So we will set the Total cell equal to minimum and we'll try and reduce our loss by changing our filter. You can see it's come up with a filter such that Result is almost like Data. It's not perfect, and in general, you can't assume that a deconvolution can exactly create the same exact thing you want because there is just not enough. Because there is 9 things in the filter and 16 things in the result. But it's made a pretty good attempt. So this is what a deconvolution looks like — a stride 1, 3x3 deconvolution on a 2x2 grid cell input.

![](../img/1_QzJe8qhpZl6hfKAB0Zw-vQ.png)

**Question** : How difficult is it to create a discriminator to identify fake news vs. real news [ [1:13:43](https://youtu.be/ondivPiwQho%3Ft%3D1h13m43s) ]? You don't need anything special — that's just a classifier. So you would just use the NLP classifier from previous class and lesson 4\. In that case, there is no generative piece, so you just need a dataset that says these are the things that we believe are fake news and these are the things we consider to be real news and it should actually work very well. To the best of our knowledge, if you try it you should get as good a result as anybody else has got — whether it's good enough to be useful in practice, Jeremy doesn't know. The best thing you could do at this stage would be to generate a kind of a triage that says these things look pretty sketchy based on how they are written and then some human could go in and fact check them. NLP classifier and RNN can't fact-check things but it could recognize that these are written in that kind of highly popularized style which often fake news is written in so maybe these ones are worth paying attention to. That would probably be the best you could hope for without drawing on some kind of external data sources. But it's important to remember the discriminator is basically just a classifier and you don't need any special techniques beyond what we've already learned to do NLP classification.

#### ConvTranspose2d [ [1:16:00](https://youtu.be/ondivPiwQho%3Ft%3D1h16m) ]

To do deconvolution in PyTorch, just say:

`nn.ConvTranspose2d(ni, no, ks, stride, padding=pad, bias=False)`

*   `ni` : number of input channels
*   `no` : number of ourput channels
*   `ks` : kernel size

The reason it's called a ConvTranspose is because it turns out that this is the same as the calculation of the gradient of convolution. That's why they call it that.

**Visualizing** [ [1:16:33](https://youtu.be/ondivPiwQho%3Ft%3D1h16m33s) ]

![](../img/1_GZz25GtnzqaYy5MV5iQPmA.png)

<figcaption class="imageCaption">[http://deeplearning.net/software/theano/tutorial/conv_arithmetic.html](http://deeplearning.net/software/theano/tutorial/conv_arithmetic.html)</figcaption>


One on the left is what we just saw of doing a 2x2 deconvolution. If there is a stride 2, then you don't just have padding around the outside, but you actually have to put padding in the middle as well. They are not actually quite implemented this way because this is slow to do. In practice, you'll implement them in a different way but it all happens behind the scene, so you don't have to worry about it. We've talked about this [convolution arithmetic tutorial](http://deeplearning.net/software/theano/tutorial/conv_arithmetic.html) before and if you are still not comfortable with convolutions and in order to get comfortable with deconvolutions, this is a great site to go to. If you want to see the paper, it is [A guide to convolution arithmetic for deep learning](https://arxiv.org/abs/1603.07285) .

`DeconvBlock` looks identical to a `ConvBlock` except it has the word `Transpose` [ [1:17:49](https://youtu.be/ondivPiwQho%3Ft%3D1h17m49s) ]. We just go conv → relu → batch norm as before, and it has input filters and output filters. The only difference is taht stride 2 means that the grid size will double rather than half.

![](../img/1_vUpDoEX5vPs6y3auKiCFsQ.png)

Question: Both `nn.ConvTranspose2d` and `nn.Upsample` seem to do the same thing, ie expand grid-size (height and width) from previous layer. Can we say `nn.ConvTranspose2d` is always better than `nn.Upsample` , since `nn.Upsample` is merely resize and fill unknowns by zero's or interpolation [ [1:18:10](https://youtu.be/ondivPiwQho%3Ft%3D1h18m10s) ]? No, you can't. There is a fantastic interactive paper on distill.pub called [Deconvolution and Checkerboard Artifacts](https://distill.pub/2016/deconv-checkerboard/) which points out that what we are doing right now is extremely suboptimal but the good news is everybody else does it.

![](../img/1_-EmXZ1cNtZEO-2SwEYG6bA.png)

Have a look here, could you see these checkerboard artifacts? These are all from actual papers and basically they noticed every one of these papers with generative models have these checkerboard artifacts and what they realized is it's because when you have a stride 2 convolution of size three kernel, they overlap. So some grid cells gets twice as much activation,

![](../img/1_rafmdyh7EfqCsptcOppq1w.png)

So even if you start with random weights, you end up with a checkerboard artifacts. So deeper you get, the worse it gets. Their advice is less direct than it ought to be, Jeremy found that for most generative models, upsampling is better. If you `nn.Upsample` , it's basically doing the opposite of pooling — it says let's replace this one grid cell with four (2x2). There is a number of ways to upsample — one is just to copy it all across to those four, and other is to use bilinear or bicubic interpolation. There are various techniques to try and create a smooth upsampled version and you can choose any of them in PyTorch. If you do a 2 x 2 upsample and then regular stride one 3 x 3 convolution, that is another way of doing the same kind of thing as a ConvTranspose — it's doubling the grid size and doing some convolutional arithmetic on it. For generative models, it pretty much always works better. In that distil.pub publication, they indicate that maybe that's a good approach but they don't just come out and say just do this whereas Jeremy would just say just do this. Having said that, for GANS, he hasn't had that much success with it yet and he thinks it probably requires some tweaking to get it to work, The issue is that in the early stages, it doesn't create enough noise. He had a version where he tried to do it with an upsample and you could kind of see that the noise didn't look very noisy. Next week when we look at style transfer and super-resolution, you will see `nn.Upsample` really comes into its own.

The generator, we can now start with the vector [ [1:22:04](https://youtu.be/ondivPiwQho%3Ft%3D1h22m04s) ]. We can decide and say okay let's not think of it as a vector but actually it's 1x1 grid cell, and then we can turn it into a 4x4 then 8x8 and so forth. That is why we have to make sure it's a suitable multiple so that we can create something of the right size. As you can see, it's doing the exact opposite as before. It's making the cell size bigger and bigger by 2 at a time as long as it can until it gets to half the size that we want, and then finally we add `n` more on at the end with stride 1\. Then we add one more ConvTranspose to finally get to the size that we wanted and we are done. Finally we put that through a `tanh` and that will force us to be in the zero to one range because of course we don't want to spit out arbitrary size pixel values. So we have a generator architecture which spits out an image of some given size with the correct number of channels with values between zero and one.

![](../img/1_sNvYsoGpBl6vzCdcjWkH1Q.png)

At this point, we can now create our model data object [ [1:23:38](https://youtu.be/ondivPiwQho%3Ft%3D1h23m38s) ]. These things take a while to train, so we made it 128 by 128 (just a convenient way to make it a little bit faster). So that is going to be the size of the input, but then we are going to use transformation to turn it into 64 by 64\.

There's been more recent advances which have attempted to really increase this up to high resolution sizes but they still tend to require either a batch size of 1 or lots and lots of GPUs [ [1:24:05](https://youtu.be/ondivPiwQho%3Ft%3D1h24m5s) ]. So we are trying to do things that we can do with a single consumer GPU. Here is an example of one of the 64 by 64 bedrooms.

```
 bs,sz,nz = 64,64,100 
```

```
 tfms = tfms_from_stats(inception_stats, sz)  md = ImageClassifierData.from_csv(PATH, 'bedroom', CSV_PATH,  tfms=tfms, bs=128, skip_header= False , continuous= True ) 
```

```
 md = md.resize(128) 
```

```
 x,_ = next(iter(md.val_dl)) 
```

```
 plt.imshow(md.trn_ds.denorm(x)[0]); 
```

![](../img/1_FIBPb5I8EloAjg7mvtRXaQ.png)

#### Putting them all together [ [1:24:30](https://youtu.be/ondivPiwQho%3Ft%3D1h24m30s) ]

We are going to do pretty much everything manually so let's go ahead and create our two models — our generator and discriminator and as you can see they are DCGAN, so in other words, they are the same modules that appeared in [this paper](https://arxiv.org/abs/1511.06434) . It is well worth going back and looking at the DCGAN paper to see what these architectures are because it's assumed that when you read the Wasserstein GAN paper that you already know that.

```
 netG = DCGAN_G(sz, nz, 3, 64, 1).cuda()  netD = DCGAN_D(sz, 3, 64, 1).cuda() 
```

**Question** : Shouldn't we use a sigmoid if we want values between 0 and 1 [ [1:25:06](https://youtu.be/ondivPiwQho%3Ft%3D1h25m6s) ]? As usual, our images have been normalized to have a range from -1 to 1, so their pixel values don't go between 0 and 1 anymore. This is why we want values going from -1 to 1 otherwise we wouldn't give a correct input for the discriminator (via [this post](http://forums.fast.ai/t/part-2-lesson-12-wiki/15023/140) ).

So we have a generator and a discriminator, and we need a function that returns a “prior” vector (ie a bunch of noise)[ [1:25:49](https://youtu.be/ondivPiwQho%3Ft%3D1h25m49s) ]. We do that by creating a bunch of zeros. `nz` is the size of `z` — very often in our code, if you see a mysterious letter, it's because that's the letter they used in the paper. Here, `z` is the size of our noise vector. We then use normal distribution to generate random numbers between 0 and 1\. And that needs to be a variable because it's going to be participating in the gradient updates.

```
 def create_noise(b):  return V(torch.zeros(b, nz, 1, 1).normal_(0, 1)) 
```

```
 preds = netG(create_noise(4))  pred_ims = md.trn_ds.denorm(preds)  fig, axes = plt.subplots(2, 2, figsize=(6, 6))  for i,ax in enumerate(axes.flat): ax.imshow(pred_ims[i]) 
```

![](../img/1_4nHm3LLiShNb0pSS3dCuCw.png)

So here is an example of creating some noise and resulting four different pieces of noise.

```
 def gallery(x, nc=3):  n,h,w,c = x.shape  nr = n//nc  assert n == nr*nc  return (x.reshape(nr, nc, h, w, c)  .swapaxes(1,2)  .reshape(h*nr, w*nc, c)) 
```

We need an optimizer in order to update our gradients [ [1:26:41](https://youtu.be/ondivPiwQho%3Ft%3D1h26m41s) ]. In the Wasserstein GAN paper, they told us to use RMSProp:

![](../img/1_5o4cwLlNjQfgrNVgLrsVlg.png)

We can easily do that in PyTorch:

```
 optimizerD = optim.RMSprop(netD.parameters(), lr = 1e-4)  optimizerG = optim.RMSprop(netG.parameters(), lr = 1e-4) 
```

In the paper, they suggested a learning rate of 0.00005 ( `5e-5` ), we found `1e-4` seem to work, so we made it a little bit bigger.

Now we need a training loop [ [1:27:14](https://youtu.be/ondivPiwQho%3Ft%3D1h27m14s) ]:

![](../img/1_VROXSgyt6HWaJiMMY6ogFQ.png)

<figcaption class="imageCaption">For easier reading</figcaption>


A training loop will go through some number of epochs that we get to pick (so that's going to be a parameter). Remember, when you do everything manually, you've got to remember all the manual steps to do:

1.  You have to set your modules into training mode when you are training them and into evaluation mode when you are evaluating because in training mode batch norm updates happen and dropout happens, in evaluation mode, those two things gets turned off.
2.  We are going to grab an iterator from our training data loader
3.  We are going to see how many steps we have to go through and then we will use `tqdm` to give us a progress bar, and we are going to go through that many steps.

The first step of the algorithm in the paper is to update the discriminator (in the paper, they call discriminator a “critic” and `w` is the weights of the critic). So the first step is to train our critic a little bit, and then we are going to train our generator a little bit, and we will go back to the top of the loop. The inner `for` loop in the paper correspond to the second `while` loop in our code.

What we are going to do now is we have a generator that is random at the moment [ [1:29:06](https://youtu.be/ondivPiwQho%3Ft%3D1h29m6s) ]. So our generator will generate something that looks like the noise. First of all, we need to teach our discriminator to tell the difference between the noise and a bedroom — which shouldn't be too hard you would hope. So we just do it in the usual way but there is a few little tweaks:

1.  We are going to grab a mini batch of real bedroom photos so we can just grab the next batch from our iterator, turn it into a variable.
2.  Then we are going to calculate the loss for that — so this is going to be how much the discriminator thinks this looks fake (“does the real one look fake?”).
3.  Then we are going to create some fake images and to do that we will create some random noise, and we will stick it through our generator which at this stage is just a bunch of random weights. That will create a mini batch of fake images.
4.  Then we will put that through the same discriminator module as before to get the loss for that (“how fake does the fake one look?”). Remember, when you do everything manually, you have to zero the gradients ( `netD.zero_grad()` ) in your loop. If you have forgotten about that, go back to the part 1 lesson where we do everything from scratch.
5.  Finally, the total discriminator loss is equal to the real loss minus the fake loss.

So you can see that here [ [1:30:58](https://youtu.be/ondivPiwQho%3Ft%3D1h30m58s) ]:

![](../img/1_atls5DInIbp5wHZz8szQ1A.png)

They don't talk about the loss, they actually just talk about one of the gradient updates.

![](../img/1_9nGWityXFzNdgOxN15flRA.png)

In PyTorch, we don't have to worry about getting the gradients, we can just specify the loss and call `loss.backward()` then discriminator's `optimizer.step()` [ [1:34:27](https://youtu.be/ondivPiwQho%3Ft%3D1h34m27s) ]. There is one key step which is that we have to keep all of our weights which are the parameters in PyTorch module in the small range of -0.01 and 0.01\. 为什么？ Because the mathematical assumptions that make this algorithm work only apply in a small ball. It is interesting to understand the math of why that is the case, but it's very specific to this one paper and understanding it won't help you understand any other paper, so only study it if you are interested. It is nicely explained and Jeremy thinks it's fun but it won't be information that you will reuse elsewhere unless you get super into GANs. He also mentioned that after the came out and improved Wasserstein GAN came out that said there are better ways to ensure that your weight space is in this tight ball which was to penalize gradients that are too high, so nowadays there are slightly different ways to do this. But this line of code is the key contribution and it is what makes it Wasserstein GAN:

```
 for p in netD.parameters(): p.data.clamp_(-0.01, 0.01) 
```

At the end of this, we have a discriminator that can recognize real bedrooms and our totally random crappy generated images [ [1:36:20](https://youtu.be/ondivPiwQho%3Ft%3D1h36m20s) ]. Let's now try and create some better images. So now set trainable discriminator to false, set trainable generator to true, zero out the gradients of the generator. Our loss again is `fw` (discriminator) of the generator applied to some more random noise. So it's exactly the same as before where we did generator on the noise and then pass that to a discriminator, but this time, the thing that's trainable is the generator, not the discriminator. In other words, in the pseudo code, the thing they update is Ɵ which is the generator's parameters. So it takes noise, generate some images, try and figure out if they are fake or real, and use that to get gradients with respect to the generator, as opposed to earlier we got them with respect to the discriminator, and use that to update our weights with RMSProp with an alpha learning rate [ [1:38:21](https://youtu.be/ondivPiwQho%3Ft%3D1h38m21s) ].

```
 def train(niter, first= True ):  gen_iterations = 0  for epoch in trange(niter):  netD.train(); netG.train()  data_iter = iter(md.trn_dl)  i,n = 0,len(md.trn_dl)  with tqdm(total=n) as pbar:  while i < n:  set_trainable(netD, True )  set_trainable(netG, False )  d_iters = 100 if (first and (gen_iterations < 25)  or (gen_iterations % 500 == 0)) else 5  j = 0  while (j < d_iters) and (i < n):  j += 1; i += 1  for p in netD.parameters():  p.data.clamp_(-0.01, 0.01)  real = V(next(data_iter)[0])  real_loss = netD(real)  fake = netG(create_noise(real.size(0)))  fake_loss = netD(V(fake.data))  netD.zero_grad()  lossD = real_loss-fake_loss  lossD.backward()  optimizerD.step()  pbar.update()  set_trainable(netD, False )  set_trainable(netG, True )  netG.zero_grad()  lossG = netD(netG(create_noise(bs))).mean(0).view(1)  lossG.backward()  optimizerG.step()  gen_iterations += 1  print(f'Loss_D {to_np(lossD)}; Loss_G {to_np(lossG)}; '  f'D_real {to_np(real_loss)}; Loss_D_fake  {to_np(fake_loss)}') 
```

You'll see that it's unfair that the discriminator is getting trained _ncritic_ times ( `d_iters` in above code) which they set to 5 for every time we train the generator once. And the paper talks a bit about this but the basic idea is there is no point making the generator better if the discriminator doesn't know how to discriminate yet. So that's why we have the second while loop. And here is that 5:

```
 d_iters = 100 if (first and (gen_iterations < 25)  or (gen_iterations % 500 == 0)) else 5 
```

Actually something which was added in the later paper or maybe supplementary material is the idea that from time to time and a bunch of times at the start, you should do more steps at the discriminator to make sure that the discriminator is capable.

```
 torch.backends.cudnn.benchmark= True 
```

Let's train that for one epoch:

```
 train(1, False ) 
```

```
 0%| | 0/1 [00:00<?, ?it/s]  100%|██████████| 18957/18957 [19:48<00:00, 10.74it/s]  Loss_D [-0.67574]; Loss_G [0.08612]; D_real [-0.1782]; Loss_D_fake [0.49754]  100%|██████████| 1/1 [19:49<00:00, 1189.02s/it] 
```

Then let's create some noise so we can generate some examples.

```
 fixed_noise = create_noise(bs) 
```

But before that, reduce the learning rate by 10 and do one more pass:

```
 set_trainable(netD, True )  set_trainable(netG, True )  optimizerD = optim.RMSprop(netD.parameters(), lr = 1e-5)  optimizerG = optim.RMSprop(netG.parameters(), lr = 1e-5) 
```

```
 train(1, False ) 
```

```
 0%| | 0/1 [00:00<?, ?it/s]  100%|██████████| 18957/18957 [23:31<00:00, 13.43it/s]  Loss_D [-1.01657]; Loss_G [0.51333]; D_real [-0.50913]; Loss_D_fake [0.50744]  100%|██████████| 1/1 [23:31<00:00, 1411.84s/it] 
```

Then let's use the noise to pass it to our generator, then put it through our denormalization to turn it back into something we can see, and then plot it:

```
 netD.eval(); netG.eval();  fake = netG(fixed_noise).data.cpu()  faked = np.clip(md.trn_ds.denorm(fake),0,1)  plt.figure(figsize=(9,9))  plt.imshow(gallery(faked, 8)); 
```

![](../img/1_b8XHbkL7E3tREt_T2mXFqQ.png)

And we have some bedrooms. These are not real bedrooms, and some of them don't look particularly like bedrooms, but some of them look a lot like bedrooms, so that's the idea. That's GAN. The best way to think about GAN is it is like an underlying technology that you will probably never use like this, but you will use in lots of interesting ways. For example, we are going to use it to create a cycle GAN.

**Question** : Is there any reason for using RMSProp specifically as the optimizer as opposed to Adam etc. [ [1:41:38](https://youtu.be/ondivPiwQho%3Ft%3D1h41m38s) ]? I don't remember it being explicitly discussed in the paper. I don't know if it's just experimental or the theoretical reason. Have a look in the paper and see what it says.

[From the forum](http://forums.fast.ai/t/part-2-lesson-12-wiki/15023/211)

> From experimenting I figured that Adam and WGANs not just work worse — it causes to completely fail to train meaningful generator.

> from WGAN paper:

> _Finally, as a negative result, we report that WGAN training becomes unstable at times when one uses a momentum based optimizer such as Adam [8] (with β1&gt;0) on the critic, or when one uses high learning rates. Since the loss for the critic is nonstationary, momentum based methods seemed to perform worse. We identified momentum as a potential cause because, as the loss blew up and samples got worse, the cosine between the Adam step and the gradient usually turned negative. The only places where this cosine was negative was in these situations of instability. We therefore switched to RMSProp [21] which is known to perform well even on very nonstationary problems_

**Question** : Which could be a reasonable way of detecting overfitting while training? Or of evaluating the performance of one of these GAN models once we are done training? In other words, how does the notion of train/val/test sets translate to GANs [ [1:41:57](https://youtu.be/ondivPiwQho%3Ft%3D1h41m57s) ]? That is an awesome question, and there's a lot of people who make jokes about how GANs is the one field where you don't need a test set and people take advantage of that by making stuff up and saying it looks great. There are some famous problems with GANs, one of them is called Mode Collapse. Mode collapse happens where you look at your bedrooms and it turns out that there's only three kinds of bedrooms that every possible noise vector maps to. You look at your gallery and it turns out they are all just the same thing or just three different things. Mode collapse is easy to see if you collapse down to a small number of modes, like 3 or 4\. But what if you have a mode collapse down to 10,000 modes? So there are only 10,000 possible bedrooms that all of your noise vectors collapse to. You wouldn't be able to see in the gallery view we just saw because it's unlikely you would have two identical bedrooms out of 10,000\. Or what if every one of these bedrooms is basically a direct copy of one of the input — it basically memorized some input. Could that be happening? And the truth is, most papers don't do a good job or sometimes any job of checking those things. So the question of how do we evaluate GANS and even the point of maybe we should actually evaluate GANs properly is something that is not widely enough understood even now. Some people are trying to really push. Ian Goodfellow was the first author on the most famous deep learning book and is the inventor of GANs and he's been sending continuous stream of tweets reminding people about the importance of testing GANs properly. If you see a paper that claims exceptional GAN results, then this is definitely something to look at. Have they talked about mode collapse? Have they talked about memorization? 等等。

**Question** : Can GANs be used for data augmentation [ [1:45:33](https://youtu.be/ondivPiwQho%3Ft%3D1h45m33s) ]? Yeah, absolutely you can use GAN for data augmentation. 你应该？ 我不知道。 There are some papers that try to do semi-supervised learning with GANs. I haven't found any that are particularly compelling showing state-of-the-art results on really interesting datasets that have been widely studied. I'm a little skeptical and the reason I'm a little skeptical is because in my experience, if you train a model with synthetic data, the neural net will become fantastically good at recognizing the specific problems of your synthetic data and that'll end up what it's learning from. There are lots of other ways of doing semi-supervised models which do work well. There are some places that can work. For example, you might remember Otavio Good created that fantastic visualization in part 1 of the zooming conv net where it showed letter going through MNIST, he, at least at that time, was the number one in autonomous remote control car competitions, and he trained his model using synthetically augmented data where he basically took real videos of a car driving around the circuit and added fake people and fake other cars. I think that worked well because A. he is kind of a genius and B. because I think he had a well defined little subset that he had to work in. But in general, it's really really hard to use synthetic data. I've tried using synthetic data and models for decades now (obviously not GANs because they're pretty new) but in general it's very hard to do. Very interesting research question.

### Cycle GAN [ [1:41:08](https://youtu.be/ondivPiwQho%3Ft%3D1h41m8s) ]

[Paper](https://arxiv.org/abs/1703.10593) / [Notebook](https://github.com/fastai/fastai/blob/master/courses/dl2/cyclegan.ipynb)

We are going to use cycle GAN to turn horses into zebras. You can also use it to turn Monet prints into photos or to turn photos of Yosemite in summer into winter.

![](../img/1_dWd0lVTbnu80UZM641gCbw.gif)

This is going to be really straight forward because it's just a neural net [ [1:44:46](https://youtu.be/ondivPiwQho%3Ft%3D1h44m46s) ]. All we are going to do is we are going to create an input containing lots of zebra photos and with each one we'll pair it with an equivalent horse photo and we'll just train a neural net that goes from one to the other. Or you could do the same thing for every Monet painting — create a dataset containing the photo of the place …oh wait, that's not possible because the places that Monet painted aren't there anymore and there aren't exact zebra versions of horses …how the heck is this going to work? This seems to break everything we know about what neural nets can do and how they do them.

So somehow these folks at Berkeley cerated a model that can turn a horse into a zebra despite not having any photos. Unless they went out there and painted horses and took before-and-after shots but I believe they didn't [ [1:47:51](https://youtu.be/ondivPiwQho%3Ft%3D1h47m51s) ]. So how the heck did they do this? It's kind of genius.

The person I know who is doing the most interesting practice of cycle GAN right now is one of our students Helena Sarin [**@** glagolista](https://twitter.com/glagolista) . She is the only artist I know of who is a cycle GAN artist.

![](../img/1_y0xHbQJvxcwUsx7EEK4nHQ.jpeg)

![](../img/1_QZWqdoLXR1TjgeWDivTlnA.jpeg)

![](../img/1_JIF1OaO04wxkWIP_7b14uA.jpeg)

![](../img/1_xn7L_rsu2J6Py2Mjq_q1LA.jpeg)

Here are some more of her amazing works and I think it's really interesting. I mentioned at the start of this class that GANs are in the category of stuff that is not there yet, but it's nearly there. And in this case, there is at least one person in the world who is creating beautiful and extraordinary artworks using GANs (specifically cycle GANs). At least a dozen people I know of who are just doing interesting creative work with neural nets more generally. And the field of creative AI is going to expand dramatically.

![](../img/1_oqSRuiHT8Z9pWl0Zq9_Sjw.png)

Here is the basic trick [ [1:50:11](https://youtu.be/ondivPiwQho%3Ft%3D1h50m11s) ]. This is from the cycle GAN paper. We are going to have two images (assuming we are doing this with images). The key thing is they are not paired images, so we don't have a dataset of horses and the equivalent zebras. We have bunch of horses, and bunch of zebras. Grab one horse _X_ , grab one zebra _Y_ . We are going to train a generator (what they call here a “mapping function”) that turns horse into zebra. We'll call that mapping function _G_ and we'll create one mapping function (aka generator) that turns a zebra into a horse and we will call that _F._ We will create a discriminator just like we did before which is going to get as good as possible at recognizing real from fake horses so that will be _Dx._ Another discriminator which is going to be as good as possible at recognizing real from fake zebras, we will call that _Dy_ . That is our starting point.

The key thing to making this work [ [1:51:27](https://youtu.be/ondivPiwQho%3Ft%3D1h51m27s) ]— so we are generating a loss function here ( _Dx_ and _Dy_ ). We are going to create something called **cycle-consistency loss** which says after you turn your horse into a zebra with your generator, and check whether or not I can recognize that it's a real. We turn our horse into a zebra and then going to try and turn that zebra back into the same horse that we started with. Then we are going to have another function that is going to check whether this horse which are generated knowing nothing about _x_ — generated entirely from this zebra _Y_ is similar to the original horse or not. So the idea would be if your generated zebra doesn't look anything like your original horse, you've got no chance of turning it back into the original horse. So a loss which compares _x-hat_ to _x_ is going to be really bad unless you can go into _Y_ and back out again and you're probably going to be able to do that if you're able to create a zebra that looks like the original horse so that you know what the original horse looked like. And vice versa — take your zebra, turn it into a fake horse, and check that you can recognize that and then try and turn it back into the original zebra and check that it looks like the original.

So notice _F_ (zebra to horse) and _G_ (horse to zebra) are doing two things [ [1:53:09](https://youtu.be/ondivPiwQho%3Ft%3D1h53m9s) ]. They are both turning the original horse into the zebra, and then turning the zebra back into the original horse. So there are only two generators. There isn't a separate generator for the reverse mapping. You have to use the same generator that was used for the original mapping. So this is the cycle-consistency loss. I think this is genius. The idea that this is a thing that could even be possible. Honestly when this came out, it just never occurred to me as a thing that I could even try and solve. It seems so obviously impossible and then the idea that you can solve it like this — I just think it's so darn smart.

It's good to look at the equations in this paper because they are good examples — they are written pretty simply and it's not like some of the Wasserstein GAN paper which is lots of theoretical proofs and whatever else [ [1:54:05](https://youtu.be/ondivPiwQho%3Ft%3D1h54m5s) ]. In this case, they are just equations that lay out what's going on. You really want to get to a point where you can read them and understand them.

![](../img/1_Mygxs_TWrjycbanbH5aUeQ.png)

So we've got a horse _X_ and a zebra _Y_ [ [1:54:34](https://youtu.be/ondivPiwQho%3Ft%3D1h54m34s) ]. For some mapping function _G_ which is our horse to zebra mapping function then there is a GAN loss which is a bit we are already familiar with it says we have a horse, a zebra, a fake zebra recognizer, and a horse-zebra generator. The loss is what we saw before — it's our ability to draw one zebra out of our zebras and recognize whether it is real or fake. Then take a horse and turn it into a zebra and recognize whether that's real or fake. You then do one minus the other (in this case, they have a log in there but the log is not terribly important). So this is the thing we just saw. That is why we did Wasserstein GAN first. This is just a standard GAN loss in math form.

**Question** : All of this sounds awfully like translating in one language to another then back to the original. Have GANs or any equivalent been tried in translation [ [1:55:54](https://youtu.be/ondivPiwQho%3Ft%3D1h55m54s) ]? [Paper from the forum](https://arxiv.org/abs/1711.00043) . Back up to what I do know — normally with translation you require this kind of paired input (ie parallel text — “this is the French translation of this English sentence”). There has been a couple of recent papers that show the ability to create good quality translation models without paired data. I haven't implemented them and I don't understand anything I haven't implemented, but they may well be doing the same basic idea. We'll look at it during the week and get back to you.

**Cycle-consistency loss** [ [1:57:14](https://youtu.be/ondivPiwQho%3Ft%3D1h57m14s) ]: So we've got a GAN loss and the next piece is the cycle-consistency loss. So the basic idea here is that we start with our horse, use our zebra generator on that to create a zebra, use our horse generator on that to create a horse and compare that to the original horse. This double lines with the 1 is the L1 loss — sum of the absolute value of differences [ [1:57:35](https://youtu.be/ondivPiwQho%3Ft%3D1h57m35s) ]. Where else if this was 2, it would be the L2 loss so the 2-norm which would be the sum of the squared differences.

![](../img/1_0wq511kW9eRhBMWS94G0Bw.png)

We now know this squiggle idea which is from our horses grab a horse. This is what we mean by sample from a distribution. There's all kinds of distributions but most commonly in these papers we're using an empirical distribution, in other words we've got some rows of data, grab a row. So here, it is saying grab something from the data and we are going to call that thing _x_ . To recapture:

1.  From our horse pictures, grab a horse
2.  Turn it into a zebra
3.  Turn it back into a horse
4.  Compare it to the original and sum of the absolute values
5.  Do it for zebra to horse as well
6.  And add the two together

That is our cycle-consistency loss.

**Full objective** [ [1:58:54](https://youtu.be/ondivPiwQho%3Ft%3D1h58m54s) ]

![](../img/1_84eYJ5eck_7r3zVJzrzGzA.png)

Now we get our loss function and the whole loss function depends on:

*   our horse generator
*   a zebra generator
*   our horse recognizer
*   our zebra recognizer (aka discriminator)

We are going to add up :

*   the GAN loss for recognizing horses
*   GAN loss for recognizing zebras
*   the cycle-consistency loss for our two generators

We have a lambda here which hopefully we are kind of used to this idea now that is when you have two different kinds of loss, you chuck in a parameter there you can multiply them by so they are about the same scale [ [1:59:23](https://youtu.be/ondivPiwQho%3Ft%3D1h59m23s) ]. We did a similar thing with our bounding box loss compared to our classifier loss when we did the localization.

Then for this loss function, we are going to try to maximize the capability of the discriminators to discriminate, whilst minimizing that for the generators. So the generators and the discriminators are going to be facing off against each other. When you see this _min max_ thing in papers, it basically means this idea that in your training loop, one thing is trying to make something better, the other is trying to make something worse, and there're lots of ways to do it but most commonly, you'll alternate between the two. You will often see this just referred to in math papers as min-max. So when you see min-max, you should immediately think **adversarial training** .

#### Implementing cycle GAN [ [2:00:41](https://youtu.be/ondivPiwQho%3Ft%3D2h41s) ]

Let's look at the code. We are going to do something almost unheard of which is I started looking at somebody else's code and I was not so disgusted that I threw the whole thing away and did it myself. I actually said I quite like this, I like it enough I'm going to show it to my students. [This](https://github.com/junyanz/pytorch-CycleGAN-and-pix2pix) is where the code came from, and this is one of the people that created the original code for cycle GANs and they created a PyTorch version. I had to clean it up a little bit but it's actually pretty darn good. The cool thing about this is that you are now going to get to see almost all the bits of fast.ai or all the relevant bits of fast.ai written in a different way by somebody else. So you're going to get to see how they do datasets, data loaders, models, training loops, and so forth.

You'll find there is a `cgan` directory [ [2:02:12](https://youtu.be/ondivPiwQho%3Ft%3D2h2m12s) ] which is basically nearly the original with some cleanups which I hope to submit as a PR sometime . It was written in a way that unfortunately made it a bit over connected to how they were using it as a script, so I cleaned it up a little bit so I could use it as a module. But other than that, it's pretty similar.

```
 **from** **fastai.conv_learner** **import** *  **from** **fastai.dataset** **import** *  from cgan.options.train_options import * 
```

So `cgan` is their code copied from their github repo with some minor changes. The way `cgan` mini library has been set up is that the configuration options, they are assuming, are being passed into like a script. So they have `TrainOptions().parse` method and I'm basically passing in an array of script options (where's my data, how many threads, do I want to dropout, how many iterations, what am I going to call this model, which GPU do I want run it on). That gives us an `opt` object which you can see what it contains. You'll see that it contains some things we didn't mention that is because it has defaults for everything else that we didn't mention.

```
 opt = TrainOptions().parse(['--dataroot',  '/data0/datasets/cyclegan/horse2zebra', '--nThreads', '8',  '--no_dropout', '--niter', '100', '--niter_decay', '100',  '--name', 'nodrop', '--gpu_ids', '2']) 
```

So rather than using fast.ai stuff, we are going to largely use cgan stuff.

```
 from cgan.data.data_loader import CreateDataLoader  from cgan.models.models import create_model 
```

The first thing we are going to need is a data loader. So this is also a great opportunity for you again to practice your ability to navigate through code with your editor or IDE of choice. We are going to start with `CreateDataLoader` . You should be able to go find symbol or in vim tag to jump straight to `CreateDataLoader` and we can see that's creating a `CustomDatasetDataLoader` . Then we can see `CustomDatasetDataLoader` is a `BaseDataLoader` . We can see that it's going to use a standard PyTorch DataLoader, so that's good. We know if you are going to use a standard PyTorch DataLoader, you have pass it a dataset, and we know that a dataset is something that contains a length and an indexer so presumably when we look at `CreateDataset` it's going to do that.

Here is `CreateDataset` and this library does more than just cycle GAN — it handles both aligned and unaligned image pairs [ [2:04:46](https://youtu.be/ondivPiwQho%3Ft%3D2h4m46s) ]. We know that our image pairs are unaligned so we are going to `UnalignedDataset` .

![](../img/1_wDbxkFlSWbEnC9QDtymlZA.png)

As expected, it has `__getitem__` and `__len__` . For length, A and B are our horses and zebras, we got two sets, so whichever one is longer is the length of the `DataLoader` . `__getitem__` is going to:

*   Randomly grab something from each of our two horses and zebras
*   Open them up with pillow (PIL)
*   Run them through some transformations
*   Then we could either be turning horses into zebras or zebras into horses, so there's some direction
*   Return our horse, zebra, a path to the horse, and a path of zebra

Hopefully you can kind of see that this is looking pretty similar to the kind of things fast.ai does. Fast.ai obviously does quite a lot more when it comes to transforms and performance, but remember, this is research code for this one thing and it's pretty cool that they did all this work.

![](../img/1_zWN8sgzWry6qu7R9FS0Ydw.png)

```
 data_loader = CreateDataLoader(opt)  dataset = data_loader.load_data()  dataset_size = len(data_loader)  dataset_size 
```

```
 1334 
```

We've got a data loader so we can go and load our data into it [ [2:06:17](https://youtu.be/ondivPiwQho%3Ft%3D2h6m17s) ]. That will tell us how many mini-batches are in it (that's the length of the data loader in PyTorch).

Next step is to create a model. Same idea, we've got different kind of models and we're going to be doing a cycle GAN.

![](../img/1_TmC6TtfaP2xRyS9KK1ryjA.png)

Here is our `CycleGANModel` . There is quite a lot of stuff in `CycleGANModel` , so let's go through and find out what's going to be used. At this stage, we've just called initializer so when we initialize it, it's going to go through and define two generators which is not surprising a generator for our horses and a generator for zebras. There is some way for it to generate a pool of fake data and then we're going to grab our GAN loss, and as we talked about our cycle-consistency loss is an L1 loss. They are going to use Adam, so obviously for cycle GANS they found Adam works pretty well. Then we are going to have an optimizer for our horse discriminator, an optimizer for our zebra discriminator, and an optimizer for our generator. The optimizer for the generator is going to contain the parameters both for the horse generator and the zebra generator all in one place.

![](../img/1_eDn2CkHKsIDaAz1M5WnWBg.png)

So the initializer is going to set up all of the different networks and loss functions we need and they are going to be stored inside this `model` [ [2:08:14](https://youtu.be/ondivPiwQho%3Ft%3D2h8m14s) ].

```
 model = create_model(opt) 
```

It then prints out and shows us exactly the PyTorch model we have. It's interesting to see that they are using ResNets and so you can see the ResNets look pretty familiar, so we have conv, batch norm, Relu. `InstanceNorm` is just the same as batch norm basically but it applies to one image at a time and the difference isn't particularly important. And you can see they are doing reflection padding just like we are. You can kind of see when you try to build everything from scratch like this, it is a lot of work and you can forget the nice little things that fast.ai does automatically for you. You have to do all of them by hand and only you end up with a subset of them. So over time, hopefully soon, we'll get all of this GAN stuff into fast.ai and it'll be nice and easy.

![](../img/1_YTCDe7-xeLelfeQNiKiq4A.png)

We've got our model and remember the model contains the loss functions, generators, discriminators, all in one convenient place [ [2:09:32](https://youtu.be/ondivPiwQho%3Ft%3D2h9m32s) ]. I've gone ahead and copied and pasted and slightly refactored the training loop from their code so that we can run it inside the notebook. So this one should look a lot familiar. A loop to go through each epoch and a loop to go through the data. Before we did this, we set up `dataset` . This is actually not a PyTorch dataset, I think this is what they used slightly confusingly to talk about their combined what we would call a model data object — all the data that they need. Loop through that with `tqdm` to get a progress bar, and so now we can go through and see what happens in the model.

```
 total_steps = 0  for epoch in range(opt.epoch_count, opt.niter + opt.niter_decay+1):  epoch_start_time = time.time()  iter_data_time = time.time()  epoch_iter = 0  for i, data in tqdm(enumerate(dataset)):  iter_start_time = time.time()  if total_steps % opt.print_freq == 0:  t_data = iter_start_time - iter_data_time  total_steps += opt.batchSize  epoch_iter += opt.batchSize  model.set_input(data)  model.optimize_parameters()  if total_steps % opt.display_freq == 0:  save_result = total_steps % opt.update_html_freq == 0  if total_steps % opt.print_freq == 0:  errors = model.get_current_errors()  t = (time.time() - iter_start_time) / opt.batchSize  if total_steps % opt.save_latest_freq == 0:  print('saving the latest model(epoch %d ,total_steps %d )'  % (epoch, total_steps))  model.save('latest')  iter_data_time = time.time()  if epoch % opt.save_epoch_freq == 0:  print('saving the model at the end of epoch %d , iters %d '  % (epoch, total_steps))  model.save('latest')  model.save(epoch)  print('End of epoch %d / %d \t Time Taken: %d sec' %  (epoch, opt.niter + opt.niter_decay, time.time()  - epoch_start_time))  model.update_learning_rate() 
```

`set_input` [ [2:10:32](https://youtu.be/ondivPiwQho%3Ft%3D2h10m32s) ]: It's a different approach to what we do in fast.ai. This is kind of neat, it's quite specific to cycle GANs but basically internally inside this model is this idea that we are going to go into our data and grab the appropriate one. We are either going horse to zebra or zebra to horse, depending on which way we go, `A` is either horse or zebra, and vice versa. If necessary put it on the appropriate GPU, then grab the appropriate paths. So the model now has a mini-batch of horses and a mini-batch of zebras.

![](../img/1__s9OBHq4z1OBiR9SJORySw.png)

Now we optimize the parameters [ [2:11:19](https://youtu.be/ondivPiwQho%3Ft%3D2h11m19s) ]. It's kind of nice to see it like this. You can see each step. First of all, try to optimize the generators, then try to optimize the horse discriminators, then try to optimize the zebra discriminator. `zero_grad()` is a part of PyTorch, as well as `step()` . So the interesting bit is the actual thing that does the back propagation on the generator.

![](../img/1_CXawhHC0Mc9pgBFBIWg22Q.png)

Here it is [ [2:12:04](https://youtu.be/ondivPiwQho%3Ft%3D2h12m4s) ]. Let's jump to the key pieces. There's all the formula that we just saw in the paper. Let's take a horse and generate a zebra. Let's now use the discriminator to see if we can tell whether it's fake or not ( `pred_fake` ). Then let's pop that into our loss function which we set up earlier to get a GAN loss based on that prediction. Let's do the same thing going the opposite direction using the opposite discriminator then put that through the loss function again. Then let's do the cycle consistency loss. Again, we take our fake which we created and try and turn it back again into the original. Let's use the cycle consistency loss function we created earlier to compare it to the real original. And here is that lambda — so there's some weight that we used and that would set up actually we just use the default that they suggested in their options. Then do the same for the opposite direction and then add them all together. We then do the backward step. 而已。

![](../img/1_q-ir1SHyywXmO5EkTDVq1w.png)

So we can do the same thing for the first discriminator [ [2:13:50](https://youtu.be/ondivPiwQho%3Ft%3D2h13m50s) ]. Since basically all the work has been done now, there's much less to do here. There that is. We won't step all through it but it's basically the same basic stuff that we've already seen.

![](../img/1_PPZdNJDrTHrrQLVRjzucgg.png)

So `optimize_parameters()` is calculating the losses and doing the optimizer step. From time to time, save and print out some results. Then from time to time, update the learning rate so they've got some learning rate annealing built in here as well. Kind of like fast.ai, they've got this idea of schedulers which you can then use to update your learning rates.

![](../img/1_Xrc3Dxs8hKV7pWHQBfZSsQ.png)

For those of you are interested in better understanding deep learning APIs, contributing more to fast.ai, or creating your own version of some of this stuff in some different back-end, it's cool to look at a second API that covers some subset of some of the similar things to get a sense for how they are solving some of these problems and what the similarities/differences are.

```
 def show_img(im, ax= None , figsize= None ):  if not ax: fig,ax = plt.subplots(figsize=figsize)  ax.imshow(im)  ax.get_xaxis().set_visible( False )  ax.get_yaxis().set_visible( False )  return ax 
```

```
 def get_one(data):  model.set_input(data)  model.test()  return list(model.get_current_visuals().values()) 
```

```
 model.save(201) 
```

```
 test_ims = []  for i,o in enumerate(dataset):  if i>10: break  test_ims.append(get_one(o)) 
```

```
 def show_grid(ims):  fig,axes = plt.subplots(2,3,figsize=(9,6))  for i,ax in enumerate(axes.flat): show_img(ims[i], ax);  fig.tight_layout() 
```

```
 for i in range(8): show_grid(test_ims[i]) 
```

We train that for a little while and then we can just grab a few examples and here we have them [ [2:15:29](https://youtu.be/ondivPiwQho%3Ft%3D2h15m29s) ]. Here are horses, zebras, and back again as horses.

![](../img/1_CcsmcW4TlZvn7eywxQPHUQ.png)

![](../img/1_uMqqilXzEmTMXf8x5ry0CQ.png)

![](../img/1__9FdL_2vB30MCQ1V8fU-qw.png)

![](../img/1_SanvgXWJHOoucKANA6A36A.png)

![](../img/1_TcjrwAtTdYLkV1x5kCqdxA.png)

![](../img/1_Rlpp3gVTYSaKknsAq4qOig.png)

![](../img/1_xxPYAgd8hRxgvGv2mBQ2Vg.png)

![](../img/1_MxQzw0SwBT_iYfbyd4BD_w.png)

It took me like 24 hours to train it even that far so it's kind of slow [ [2:16:39](https://youtu.be/ondivPiwQho%3Ft%3D2h16m39s) ]. I know Helena is constantly complaining on Twitter about how long these things take. I don't know how she's so productive with them.

```
 #! wget https://people.eecs.berkeley.edu/~taesung_park/CycleGAN/datasets/horse2zebra.zip 
```

I will mention one more thing that just came out yesterday [ [2:16:54](https://youtu.be/ondivPiwQho%3Ft%3D2h16m54s) ]:

[Multimodal Unsupervised Image-to-Image Translation](https://arxiv.org/abs/1804.04732)

There is now a multi-modal image to image translation of unpaired. So you can basically now create different cats for instance from this dog.

This is basically not just creating one example of the output that you want, but creating multiple ones. This came out yesterday or the day before. I think it's pretty amazing. So you can kind of see how this technology is developing and I think there's so many opportunities to maybe do this with music, speech, writing, or to create kind of tools for artists.