From 6345c65adb3022cdb093f3f5815c6b652158c69c Mon Sep 17 00:00:00 2001 From: Aston Zhang Date: Tue, 21 Aug 2018 01:19:03 +0000 Subject: [PATCH] revise momentum and adagrad --- .../softmax-regression.md | 6 +- chapter_optimization/adagrad.md | 72 ++------- chapter_optimization/momentum.md | 144 +++--------------- 3 files changed, 38 insertions(+), 184 deletions(-) diff --git a/chapter_deep-learning-basics/softmax-regression.md b/chapter_deep-learning-basics/softmax-regression.md index da5b1e0..85742e9 100644 --- a/chapter_deep-learning-basics/softmax-regression.md +++ b/chapter_deep-learning-basics/softmax-regression.md @@ -39,11 +39,9 @@ $$\hat{y}_1, \hat{y}_2, \hat{y}_3 = \text{softmax}(o_1, o_2, o_3),$$ 其中 $$ -\begin{aligned} -\hat{y}_1 = \frac{ \exp(o_1)}{\sum_{i=1}^3 \exp(o_i)},\\ -\hat{y}_2 = \frac{ \exp(o_2)}{\sum_{i=1}^3 \exp(o_i)},\\ +\hat{y}_1 = \frac{ \exp(o_1)}{\sum_{i=1}^3 \exp(o_i)},\quad +\hat{y}_2 = \frac{ \exp(o_2)}{\sum_{i=1}^3 \exp(o_i)},\quad \hat{y}_3 = \frac{ \exp(o_3)}{\sum_{i=1}^3 \exp(o_i)}. -\end{aligned} $$ 容易看出$\hat{y}_1 + \hat{y}_2 + \hat{y}_3 = 1$且$\hat{y}_1 > 0, \hat{y}_2 > 0, \hat{y}_3 > 0$,因此$\hat{y}_1, \hat{y}_2, \hat{y}_3$是一个合法的概率分布。这时候,如果$\hat y_2=0.8$,不管其他两个值多少,我们都知道有80%概率图片里是猫。此外,可以注意到 diff --git a/chapter_optimization/adagrad.md b/chapter_optimization/adagrad.md index b3bb139..582cca7 100644 --- a/chapter_optimization/adagrad.md +++ b/chapter_optimization/adagrad.md @@ -8,30 +8,31 @@ x_1 \leftarrow x_1 - \eta \frac{\partial{f}}{\partial{x_1}}, \quad x_2 \leftarrow x_2 - \eta \frac{\partial{f}}{\partial{x_2}}. $$ -在[“动量法”](./momentum.md)一节里我们看到当$x_1$和$x_2$的梯度值有较大差别时(我们使用的例子是20倍不同),我们需要选择足够小的学习率使得梯度较大的维度不发散,但这样会导致梯度较小的维度收敛缓慢。动量依赖指数加权移动平均来使得自变量的更新方向更加一致来降低发散可能。这一节我们介绍Adagrad算法,它根据每个维度的数值大小来自动调整学习率,从而避免统一的学习率难以适应所有维度的问题。 +在[“动量法”](./momentum.md)一节里我们看到当$x_1$和$x_2$的梯度值有较大差别时,我们需要选择足够小的学习率使得自变量在梯度值较大的维度上不发散。但这样会导致自变量在梯度值较小的维度上迭代过慢。动量法依赖指数加权移动平均使得自变量的更新方向更加一致,从而降低发散的可能。这一节我们介绍Adagrad算法,它根据自变量在每个维度的梯度值的大小来调整各个维度上的学习率,从而避免统一的学习率难以适应所有维度的问题。 ## Adagrad算法 -Adagrad的算法对每个维度维护其所有时间步里梯度的平法累加。在算法开始前我们定义累加变量$\boldsymbol{s}$,其元素个数等于自变量的个数,并将其中每个元素初始化为0。在每次迭代中,假设小批量随机梯度为$\boldsymbol{g}$,我们将该梯度按元素平方后累加到变量$\boldsymbol{s}$里: +Adagrad的算法会使用一个小批量随机梯度按元素平方的累加变量$\boldsymbol{s}$,其形状与自变量形状相同。开始时将变量$\boldsymbol{s}$中每个元素初始化为0。在每次迭代中,首先计算小批量随机梯度$\boldsymbol{g}$,然后将该梯度按元素平方后累加到变量$\boldsymbol{s}$: -$$\boldsymbol{s} \leftarrow \boldsymbol{s} + \boldsymbol{g} \odot \boldsymbol{g}, $$ +$$\boldsymbol{s} \leftarrow \boldsymbol{s} + \boldsymbol{g} \odot \boldsymbol{g},$$ -这里$\odot$是按元素相乘(请参见[“数学基础”](../chapter_appendix/math.md)一节)。 +其中$\odot$是按元素相乘(请参见[“数学基础”](../chapter_appendix/math.md)一节)。接着,我们将目标函数自变量中每个元素的学习率通过按元素运算重新调整一下: -在自变量更新前,我们将梯度中的每个元素除以累加变量中对应元素的平方根,这样使得每个元素数值在同样的尺度下,然后再乘以学习率后更新: +$$\boldsymbol{g}' \leftarrow \frac{\eta}{\sqrt{\boldsymbol{s} + \epsilon}} \odot \boldsymbol{g},$$ -$$\boldsymbol{x} \leftarrow \boldsymbol{x} - \frac{\eta}{\sqrt{\boldsymbol{s} + \epsilon}} \odot \boldsymbol{g},$$ +其中$\eta$是初始学习率且$\eta > 0$,$\epsilon$是为了维持数值稳定性而添加的常数,例如$10^{-7}$。我们需要注意,其中开方、除法和乘法的运算都是按元素进行的。这些按元素运算使得目标函数自变量中每个元素都分别拥有自己的学习率。 -这里开方、除法和乘法的运算都是按元素进行的,$\epsilon$是为了使得除数不为0来而添加的正常数,例如$10^{-7}$。 +最后,自变量的迭代步骤与小批量随机梯度下降类似。只是这里梯度前的学习率已经被调整过了: -## Adagrad特性 +$$\boldsymbol{x} \leftarrow \boldsymbol{x} - \boldsymbol{g}'.$$ -为了更好的理解累加变量是如何将每个自变量的更新变化到同样尺度,我们来看时间步1时的更新,这时候$\boldsymbol{s} = \boldsymbol{g} \odot \boldsymbol{g}$,忽略掉$\epsilon$的话,这时候的更新是$\boldsymbol{x} \leftarrow \boldsymbol{x} - \eta\cdot\textrm{sign}(\boldsymbol{g})$,这里$\textrm{sign}$是按元素的取符号。就是说,不管梯度具体值是多少,此时的每个自变量的更新量只是$\eta$,$-\eta$或0。 -从另一个角度来看,如果自变量中某个元素取值总是另外一个元素的数倍,例如$x_2=20x_1$,那么其梯度也是20被的关系,那么在Adagrad里这两个元素的更新量总是一样,而不是20倍的关系。 -此外,由于累加变量里我们总是累加,所以其会变得越来越大,等效于我们一直减低学习率。例如如果每个时间步的梯度都是常数$c$,那么时间步$t$的学习率就是$\frac{\eta}{c\sqrt{t}}$,其以平方根的速度依次递减。 + +## Adagrad的特点 + +需要强调的是,小批量随机梯度按元素平方的累加变量$\boldsymbol{s}$出现在含调整后学习率的梯度$\boldsymbol{g}'$的分母项。因此,如果目标函数有关自变量中某个元素的偏导数一直都较大,那么就让该元素的学习率下降快一点;反之,如果目标函数有关自变量中某个元素的偏导数一直都较小,那么就让该元素的学习率下降慢一点。然而,由于$\boldsymbol{s}$一直在累加按元素平方的梯度,自变量中每个元素的学习率在迭代过程中一直在降低(或不变)。所以,当学习率在迭代早期降得较快且当前解依然不佳时,Adagrad在迭代后期由于学习率过小,可能较难找到一个有用的解。 @@ -50,49 +51,7 @@ import numpy as np import math ``` -我们先实现一个简单的针对二维目标函数$f(\boldsymbol{x})=0.1x_1^2+2x_2$的Adagrad来查看其自变量更新轨迹。 - -```{.python .input} -f = lambda x1, x2: 0.1*x1**2 + 2*x2**2 -f_grad = lambda x1, x2: (0.2*x1, 2*x2) - -def adagrad(eta): - x1, x2 = -5, -2 - sx1, sx2 = 0, 0 - eps = 1e-7 - res = [(x1, x2)] - for i in range(15): - gx1, gx2 = f_grad(x1, x2) - sx1 += gx1 ** 2 - sx2 += gx2 ** 2 - x1 -= eta / math.sqrt(sx1 + eps) * gx1 - x2 -= eta / math.sqrt(sx2 + eps) * gx2 - res.append((x1, x2)) - return res - -def show(res): - x1, x2 = zip(*res) - gb.set_figsize((3.5, 2.5)) - gb.plt.plot(x1, x2, '-o') - - x1 = np.arange(-5.0, 1.0, .1) - x2 = np.arange(min(-3.0, min(x2)-1), max(3.0, max(x2)+1), .1) - x1, x2 = np.meshgrid(x1, x2) - gb.plt.contour(x1, x2, f(x1, x2), colors='g') - - gb.plt.xlabel('x1') - gb.plt.ylabel('x2') - -show(adagrad(.9)) -``` - -可以看到使用$\eta=0.9$,Adagrad的更新轨迹非常平滑。但由于其自有的降低学习率特性,我们看到在后期收敛比较缓慢。这个特性同样也使得我们可以在Adagrad中使用更大的学习率。 - -```{.python .input} -show(adagrad(2)) -``` - -接下来我们以之前介绍过的线性回归为例。设数据集的样本数为1000,我们使用权重`w`为[2, -3.4],偏差`b`为4.2的线性回归模型来生成数据集。该模型的平方损失函数即所需优化的目标函数,模型参数即目标函数自变量。 +实验中,我们以之前介绍过的线性回归为例。设数据集的样本数为1000,我们使用权重`w`为[2, -3.4],偏差`b`为4.2的线性回归模型来生成数据集。该模型的平方损失函数即所需优化的目标函数,模型参数即目标函数自变量。 我们把梯度按元素平方的累加变量初始化为和模型参数形状相同的零张量。 @@ -119,7 +78,7 @@ def init_params(): return params, sqrs ``` -接下来基于NDArray来实现Adagrad。 +接下来基于NDArray实现Adagrad算法。 ```{.python .input n=1} def adagrad(params, sqrs, lr, batch_size): @@ -148,7 +107,8 @@ def optimize(batch_size, lr, num_epochs, log_interval): adagrad([w, b], sqrs, lr, batch_size) if batch_i * batch_size % log_interval == 0: ls.append(loss(net(features, w, b), labels).mean().asnumpy()) - print('w:', w, '\nb:', b, '\n') + print('w[0]=%.2f, w[1]=%.2f, b=%.2f' + % (w[0].asscalar(), w[1].asscalar(), b.asscalar())) es = np.linspace(0, num_epochs, len(ls), endpoint=True) gb.semilogy(es, ls, 'epoch', 'loss') ``` diff --git a/chapter_optimization/momentum.md b/chapter_optimization/momentum.md index 36db5cd..cca7fb3 100644 --- a/chapter_optimization/momentum.md +++ b/chapter_optimization/momentum.md @@ -6,7 +6,7 @@ ## 梯度下降的问题 -给定目标函数,在梯度下降中,自变量的迭代方向仅仅取决于自变量当前位置。这可能会带来一些问题。考虑一个输入和输出分别为二维向量$\boldsymbol{x} = [x_1, x_2]^\top$和标量的目标函数$f(\boldsymbol{x})=0.1x_1^2+2x_2$。下面我们观察梯度下降的迭代过程。首先,导入本节中实验所需的包或模块。 +给定目标函数,在梯度下降中,自变量的迭代方向仅仅取决于自变量当前位置。这可能会带来一些问题。考虑一个输入和输出分别为二维向量$\boldsymbol{x} = [x_1, x_2]^\top$和标量的目标函数$f(\boldsymbol{x})=0.1x_1^2+2x_2$。为了观察梯度下降优化该目标函数的迭代过程,下面导入实验所需的包或模块。 ```{.python .input n=1} import sys @@ -18,7 +18,7 @@ from mxnet import autograd, nd import numpy as np ``` -接下来我们实现梯度下降和作图函数。和上一节不同,这里的目标函数的输入有两个维度。因此,在作图中我们使用等高线示意二维输入下的目标函数值。 +接下来实现梯度下降和作图函数。和上一节中的不同,这里的目标函数的输入是二维的。因此,在作图中我们使用等高线示意二维输入下的目标函数值。 ```{.python .input n=2} f = lambda x1, x2: 0.1 * x1 ** 2 + 2 * x2 ** 2 @@ -26,7 +26,7 @@ f_grad = lambda x1, x2: (0.2 * x1, 2 * x2) def gd(eta): x1, x2 = -5, -2 - res = [] + res = [(x1, x2)] for i in range(15): gx1, gx2 = f_grad(x1, x2) x1 = x1 - eta * gx1 @@ -52,45 +52,18 @@ def plot_iterate(res): plot_iterate(gd(0.9)) ``` -```{.json .output n=3} -[ - { - "data": { - "image/svg+xml": "\n\n\n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n\n", - "text/plain": "
" - }, - "metadata": {}, - "output_type": "display_data" - } -] -``` - - +同一位置上,目标函数在竖直方向($x_2$轴方向)比在水平方向($x_1$轴方向)的斜率的绝对值更大。因此,给定学习率,梯度下降迭代自变量时会使自变量在竖直方向比在水平方向移动幅度更大。因此,我们需要一个较小的学习率(这里使用了0.9)从而避免自变量在竖直方向上越过目标函数最优解。然而,这造成了图中自变量向最优解移动较慢。 -然而,这造成了图7.2中自变量向最优解移动较慢。 +我们试着将学习率调的稍大一点,此时自变量在竖直方向不断越过最优解并逐渐发散。 ```{.python .input n=4} plot_iterate(gd(1.1)) ``` -```{.json .output n=4} -[ - { - "data": { - "image/svg+xml": "\n\n\n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n\n", - "text/plain": "
" - }, - "metadata": {}, - "output_type": "display_data" - } -] -``` - ## 动量法 动量法的提出是为了应对梯度下降的上述问题。以小批量随机梯度下降为例,动量法对每次迭代的步骤做如下修改: - $$ \begin{aligned} \boldsymbol{v} &\leftarrow \gamma \boldsymbol{v} + \eta \nabla f_\mathcal{B}(\boldsymbol{x}),\\ @@ -100,60 +73,36 @@ $$ 其中$\boldsymbol{v}$是速度变量,动量超参数$\gamma$满足$0 \leq \gamma \leq 1$。动量法中的学习率$\eta$和有关小批量$\mathcal{B}$的随机梯度$\nabla f_\mathcal{B}(\boldsymbol{x})$已在[“梯度下降和随机梯度下降”](gd-sgd.md)一节中描述。 -在解释其原理前让我们从实验中观察是否动量法解决了之前的问题。 +在解释动量法的原理前,让我们先从实验中观察梯度下降在使用动量法后的迭代过程。与本节上一个实验相比,这里目标函数和自变量的初始位置均保持不变。 ```{.python .input n=5} def momentum(eta, mom): - x, y = -5, -2 - res = [] - v_x, v_y = 0, 0 + x1, x2 = -5, -2 + res = [(x1, x2)] + v_x1, v_x2 = 0, 0 for i in range(15): - gx, gy = f_grad(x, y) - v_x = mom * v_x + eta * gx - v_y = mom * v_y + eta * gy - x = x - v_x - y = y - v_y - res.append((x, y)) + gx1, gx2 = f_grad(x1, x2) + v_x1 = mom * v_x1 + eta * gx1 + v_x2 = mom * v_x2 + eta * gx2 + x1 = x1 - v_x1 + x2 = x2 - v_x2 + res.append((x1, x2)) return res plot_iterate(momentum(0.9, 0.2)) ``` -```{.json .output n=5} -[ - { - "data": { - "image/svg+xml": "\n\n\n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n\n", - "text/plain": "
" - }, - "metadata": {}, - "output_type": "display_data" - } -] -``` +可以看到使用学习率$\eta=0.9$和动量超参数$\gamma=0.2$时,动量法在竖直方向上的移动更加平滑,且在水平方向上更快逼近最优解。 -可以看到使用$\eta=0.9$和动量参数$\gamma=0.2$,动量法在垂直方向上更加平滑,且加速了水平方向的进度。使用更大的$\eta=1.1$也不会使得收敛发散。 +我们还发现,使用更大的学习率$\eta=1.1$时,自变量也不再发散。由于能够使用更大的学习率,自变量可以在水平方向上以更快的速度逼近最优解。 ```{.python .input n=6} plot_iterate(momentum(1.1, 0.2)) ``` -```{.json .output n=6} -[ - { - "data": { - "image/svg+xml": "\n\n\n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n\n", - "text/plain": "
" - }, - "metadata": {}, - "output_type": "display_data" - } -] -``` - ### 指数加权移动平均 -为了数学上理解动量法,让我们先解释指数加权移动平均(exponentially weighted moving average)。给定超参数$\gamma$且$0 \leq \gamma < 1$,当前时刻$t$的变量$y^{(t)}$是上一时刻$t-1$的变量$y^{(t-1)}$和当前时刻另一变量$x^{(t)}$的线性组合: +为了从数学上理解动量法,让我们先解释指数加权移动平均(exponentially weighted moving average)。给定超参数$\gamma$且$0 \leq \gamma < 1$,当前时刻$t$的变量$y^{(t)}$是上一时刻$t-1$的变量$y^{(t-1)}$和当前时刻另一变量$x^{(t)}$的线性组合: $$y^{(t)} = \gamma y^{(t-1)} + (1-\gamma) x^{(t)}.$$ @@ -188,7 +137,8 @@ $$\boldsymbol{v} \leftarrow \gamma \boldsymbol{v} + (1 - \gamma) \frac{\eta \nab 由指数加权移动平均的形式可得,速度变量$\boldsymbol{v}$实际上对$(\eta\nabla f_\mathcal{B}(\boldsymbol{x})) /(1-\gamma)$做了指数加权移动平均。给定动量超参数$\gamma$和学习率$\eta$,含动量法的小批量随机梯度下降可被看作使用了特殊梯度来迭代目标函数的自变量。这个特殊梯度是最近$1/(1-\gamma)$个时刻的$\nabla f_\mathcal{B}(\boldsymbol{x})/(1-\gamma)$的加权平均。 -给定目标函数,在动量法的每次迭代中,自变量在各个方向上的移动幅度不仅取决当前梯度,还取决过去各个梯度在各个方向上是否一致。在上面示例中,由于所有梯度的水平方向为正(向右)、在竖直上时正(向上)时负(向下),自变量在水平方向移动幅度逐渐增大,而在竖直方向移动幅度逐渐减小。这样,我们就可以使用较大的学习率,从而使自变量向最优解更快移动。 +给定目标函数,在动量法的每次迭代中,自变量在各个方向上的移动幅度不仅取决当前梯度,还取决过去各个梯度在各个方向上是否一致。在本节之前示例的优化问题中,由于所有梯度在水平方向上为正(向右)、而在竖直方向上时正(向上)时负(向下),自变量在水平方向的移动幅度逐渐增大,而在竖直方向的移动幅度逐渐减小。这样,我们就可以使用较大的学习率,从而使自变量向最优解更快移动。 + ## 实验 @@ -259,72 +209,18 @@ def optimize(batch_size, lr, mom, num_epochs, log_interval): optimize(batch_size=10, lr=0.2, mom=0.99, num_epochs=3, log_interval=10) ``` -```{.json .output n=10} -[ - { - "name": "stdout", - "output_type": "stream", - "text": "w[0]=-4.64, w[1]=7.65, b=-19.84\n" - }, - { - "data": { - "image/svg+xml": "\n\n\n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n\n", - "text/plain": "
" - }, - "metadata": {}, - "output_type": "display_data" - } -] -``` - 假设学习率不变,为了降低上述特殊梯度中的系数,我们将动量超参数$\gamma$(`mom`)设0.9。此时,上述特殊梯度变成最近10个时刻的$10\nabla f_\mathcal{B}(\boldsymbol{x})$的加权平均。我们观察到,损失函数值在3个迭代周期后下降。 ```{.python .input n=11} optimize(batch_size=10, lr=0.2, mom=0.9, num_epochs=3, log_interval=10) ``` -```{.json .output n=11} -[ - { - "name": "stdout", - "output_type": "stream", - "text": "w[0]=2.00, w[1]=-3.40, b=4.20\n" - }, - { - "data": { - "image/svg+xml": "\n\n\n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n\n", - "text/plain": "
" - }, - "metadata": {}, - "output_type": "display_data" - } -] -``` - 继续保持学习率不变,我们将动量超参数$\gamma$(`mom`)设0.5。此时,小梯度随机梯度下降可被看作使用了新的特殊梯度:这个特殊梯度是最近2个时刻的$2\nabla f_\mathcal{B}(\boldsymbol{x})$的加权平均。我们观察到,损失函数值在3个迭代周期后下降,且下降曲线较平滑。最终,优化所得的模型参数值与它们的真实值较接近。 ```{.python .input n=12} optimize(batch_size=10, lr=0.2, mom=0.5, num_epochs=3, log_interval=10) ``` -```{.json .output n=12} -[ - { - "name": "stdout", - "output_type": "stream", - "text": "w[0]=2.00, w[1]=-3.40, b=4.20\n" - }, - { - "data": { - "image/svg+xml": "\n\n\n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n\n", - "text/plain": "
" - }, - "metadata": {}, - "output_type": "display_data" - } -] -``` - ## 小结 * 动量法使用了指数加权移动平均的思想。 -- GitLab