提交 9e20ebb3 编写于 作者: W wizardforcel

4.3.

上级 2ce43979
......@@ -108,37 +108,37 @@ abs(f(x_{i+1}) - f(xi)) < precision
<img src="img/stepsize.png" width="400">
The domain of $x$ also affects the learning rate magnitude. This is all a very complicated finicky business and those experienced in the field tell me it's very much an art picking the learning rate, starting positions, precision, and so on. You can start out with a low learning rate and crank it up to see if you still converge without oscillating around the minimum. An excellent description of gradient descent and other minimization techniques can be found in [Numerical Recipes](http://apps.nrbook.com/fortran/index.html).
`x`的范围也会影响学习率。 这是一个非常复杂的挑剔的任务,那些在该领域经验丰富的人告诉我,这是一个挑选学习率,起始位置,精确度等等的艺术。 您可以从较低的学习率开始,然后将其调高,来确定您是否仍在收敛而不会在最小值附近摆动。 可以在[数值秘籍](http://apps.nrbook.com/fortran/index.html)中找到梯度下降和其他最小化技术的优秀描述。
### Approximating derivatives with finite differences
### 使用有限差分来近似导数
Sometimes, the derivative is hard, expensive, or impossible to find analytically (symbolically). For example, some functions are themselves iterative in nature or even simulations that must be optimized. There might be no closed form for $f(x)$. To get around this and to reduce the input requirements, we can approximate the derivative in the neighborhood of a particular $x$ value. That way we can optimize any reasonably well behaved function (left and right continuity would be nice). Our minimizer then only requires a starting location and $f(x)$ but not $f'(x)$, which makes the lives of our users much simpler and our minimizer much more flexible.
有时,导数很难,昂贵,或者无法通过解析(符号)找到。 例如,某些函数本身就是迭代的,甚至是必须优化的模拟。`f(x)`可能没有闭式。 为了解决这个问题并减少输入要求,我们可以估算出特定`x`值附近的导数。 这样我们可以优化任何合理的表现良好的函数(左右连续性很好)。 我们的最小化器只需要一个起始位置和`f(x)`,而不是`f'(x)`,这使我们的用户的生活更加简单,我们的最小化器更加灵活。
To approximate the derivative, we can take several approaches. The simplest involves a comparison. Since we really just need a direction, all we have to do is compare the current $f(x_i)$ with values a small step, $h$, away in either direction: $f(x_{i}-h)$ and $f(x_{i}+h)$. If $f(x_{i}-h) < f(x_{i})$, we should move $x_{i+1}$ to the left of $x_{i}$. If $f(x_{i}+h) < f(x_{i})$, we should move $x_{i+1}$ to the right. These are called the backward and forward differences, but there is also a central difference. The excellent article [Stochastic Gradient Descent Tricks](http://research.microsoft.com/pubs/192769/tricks-2012.pdf) has a lot of practical information on computing gradients etc...
为了近似导数,我们可以采取几种方法。 最简单的是比较。由于我们真的只需要一个方向,我们所要做的就是将当前的`f(xi)`与两个方向上的小步(`h`)进行比较:`f(xi - h)``f(xi + h)`。 如果`f(xi - h) < f(xi)`,我们应该将`x{i+1}`移到`xi`的左边。 如果`f(xi + h) < f(xi)`,我们应该向右移动`x{i+1}`。 这些被称为后向和前向差异,但也存在中心差异。 优秀的文章[随机梯度下降技巧](http://research.microsoft.com/pubs/192769/tricks-2012.pdf)有很多计算梯度的实用信息......
Using the direction of the slope works, but does not converge very fast. What we really want is to use the magnitude of the slope to make the algorithm go fast where it's steep and slow where it's shallow because it will be approaching a minimum. So, rather than just using the sign of the finite difference, we should use the magnitude or rate of change. Replacing the derivative in our recurrence relation with the finite (forward) difference, we get a similar formula:
使用斜率的方向是有效的,但不会很快收敛。 我们真正想要的是使用斜率的模,来使算法在陡的地方变快,浅的地方变慢,,因为它将接近最小值。 因此,我们应该使用变化的模或速率,而不仅仅使用有限差分的符号。 在递推关系中,用我们的(前向)有限差分替换导数,我们得到一个类似的公式:
\\[
x _{i+1} = x_i - \eta \frac{f(x_{i}+h) - f(x_{i})}{h} \text{ where } f'(x) \approx \frac{f(x_{i}+h) - f(x_{i})}{h}
\\]
```
x{i+1} = xi - η (f(xi+h) - f(xi)) / h , where f'(x) ~ (f(xi+h) - f(xi)) / h
```
To simplify things, we can roll the step size $h$ into the learning rate $\eta$ constant as we are going to pick that anyway.
为了简化操作,我们可以将步长`h`折叠到学习率`η`常数中,因为我们无论如何都要选择它。
\\[
x _{i+1} = x_i - \eta (f(x_{i}+h) - f(x_{i}))
\\]
```
x{i+1} = xi - η (f(xi+h) - f(xi))
```
The step size is bigger when the slope is bigger and is smaller as we approach the minimum (since the region is flatter). Abu-Mostafa indicates in his slides that $\eta$ should increase with the slope whereas we are keeping it fixed and allowing the finite difference to increase the step size. We are not normalizing the derivative/difference to a unit vector like he does (see his slides).
当斜率越大时,步长越大,当接近最小值时,步长变小(因为该区域更平坦)。 Abu-Mostafa 在他的幻灯片中指出,`η`应该随坡度增加而增加,但我们保持固定并允许有限差分增加步长。 我们没有像他那样将导数/差分归一化为单位向量(参见他的幻灯片)。
## An implementation
## 一个实现
Our goal is to use gradient descent to minimize $f(x) = cos(3\pi x) / x$. To increase chances of finding the global minimum, we can pick a few random starting locations in the range $[0.1,1.3]$ using standard python {\tt random.uniform()} and perform gradient descent on all of them. To observe our minimizer in action, we'll eventually plot the trace of $x$'s that indicate the steps taken by our gradient descent. Here are two sample descents where the $x$ and $f(x)$ values are displayed as well as the minima:
我们的目标是使用梯度下降来最小化`f(x) = cos(3 pi x) / x`。 为了增加找到全局最小值的几率,我们可以使用标准 python 的`random.uniform()`,在`[0.1,1.3]`范围内选择一些随机起始位置,并对所有这些位置执行梯度下降。为了观察我们的最小化器,我们最终将绘制`x`的轨迹,表明我们的梯度下降所采取的步骤。 下面是两个示例下降,其中显示了`x``f(x)`值以及最小值:
<img src="img/cos-trace-2minima.svg" width="350">
<img src="img/cos-trace-2minima-another.svg" width="350">
Recall from our square root lecture that we had a basic outline for an iterative method:
回想一下我们的平方根讲义,我们有一个迭代方法的基本大纲:
```python
x_prev = initial value
......
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册