Skip to content
体验新版
项目
组织
正在加载...
登录
切换导航
打开侧边栏
Crayon鑫
Paddle
提交
6829d94f
P
Paddle
项目概览
Crayon鑫
/
Paddle
与 Fork 源项目一致
Fork自
PaddlePaddle / Paddle
通知
1
Star
1
Fork
0
代码
文件
提交
分支
Tags
贡献者
分支图
Diff
Issue
1
列表
看板
标记
里程碑
合并请求
0
Wiki
0
Wiki
分析
仓库
DevOps
项目成员
Pages
P
Paddle
项目概览
项目概览
详情
发布
仓库
仓库
文件
提交
分支
标签
贡献者
分支图
比较
Issue
1
Issue
1
列表
看板
标记
里程碑
合并请求
0
合并请求
0
Pages
分析
分析
仓库分析
DevOps
Wiki
0
Wiki
成员
成员
收起侧边栏
关闭侧边栏
动态
分支图
创建新Issue
提交
Issue看板
提交
6829d94f
编写于
11月 06, 2017
作者:
T
Travis CI
浏览文件
操作
浏览文件
下载
电子邮件补丁
差异文件
Deploy to GitHub Pages:
5eb0ebaf
上级
9361062d
变更
4
展开全部
显示空白变更内容
内联
并排
Showing
4 changed file
with
60 addition
and
62 deletion
+60
-62
develop/doc/api/v2/config/optimizer.html
develop/doc/api/v2/config/optimizer.html
+29
-30
develop/doc/searchindex.js
develop/doc/searchindex.js
+1
-1
develop/doc_cn/api/v2/config/optimizer.html
develop/doc_cn/api/v2/config/optimizer.html
+29
-30
develop/doc_cn/searchindex.js
develop/doc_cn/searchindex.js
+1
-1
未找到文件。
develop/doc/api/v2/config/optimizer.html
浏览文件 @
6829d94f
...
...
@@ -188,35 +188,44 @@
<h1>
Optimizer
<a
class=
"headerlink"
href=
"#optimizer"
title=
"Permalink to this headline"
>
¶
</a></h1>
<div
class=
"section"
id=
"momentum"
>
<h2>
Momentum
<a
class=
"headerlink"
href=
"#momentum"
title=
"Permalink to this headline"
>
¶
</a></h2>
<p>
Optimizers(update equation) for SGD method.
</p>
<p>
TODO(yuyang18): Complete comments.
</p>
<dl
class=
"class"
>
<dt>
<em
class=
"property"
>
class
</em><code
class=
"descclassname"
>
paddle.v2.optimizer.
</code><code
class=
"descname"
>
Momentum
</code><span
class=
"sig-paren"
>
(
</span><em>
momentum=None
</em>
,
<em>
sparse=False
</em>
,
<em>
**kwargs
</em><span
class=
"sig-paren"
>
)
</span></dt>
<dd><p>
SGD Optimizer.
</p>
<p>
SGD is an optimization method, trying to find a neural network that
minimize the
“
cost/error
”
of it by iteration. In paddle
’
s implementation
SGD Optimizer is synchronized, which means all gradients will be wait to
calculate and reduced into one gradient, then do optimize operation.
</p>
<p>
The neural network consider the learning problem of minimizing an objective
function, that has the form of a sum
</p>
<dd><p>
Momentum Optimizer.
</p>
<p>
When sparse=False, the momentum update formula is as follows:
</p>
<div
class=
"math"
>
\[Q(w) = \sum_{i}^{n} Q_i(w)\]
</div>
<p>
The value of function Q sometimes is the cost of neural network (Mean
Square Error between prediction and label for example). The function Q is
parametrised by w, the weight/bias of neural network. And weights is what to
be learned. The i is the i-th observation in (trainning) data.
</p>
<p>
So, the SGD method will optimize the weight by
</p>
\[\begin{split}v_{t}
&
= k * v_{t-1} - \gamma_t / (g_{t} + \lambda w_{t-1}) \\
w_{t}
&
= w_{t-1} + v_{t} \\\end{split}\]
</div>
<p>
where,
<span
class=
"math"
>
\(k\)
</span>
is momentum,
<span
class=
"math"
>
\(\lambda\)
</span>
is decay rate,
<span
class=
"math"
>
\(\gamma_t\)
</span>
is learning rate at the t
’
th iteration.
<span
class=
"math"
>
\(w_{t}\)
</span>
is the weight as the t
’
th iteration.
And the
<span
class=
"math"
>
\(v_{t}\)
</span>
is the history momentum variable.
</p>
<p>
When sparse=True, the update scheme:
</p>
<div
class=
"math"
>
\[w = w - \eta \nabla Q(w) = w - \eta \sum_{i}^{n} \nabla Q_i(w)\]
</div>
<p>
where
<span
class=
"math"
>
\(\eta\)
</span>
is learning rate. And
<span
class=
"math"
>
\(n\)
</span>
is batch size.
</p>
\[\begin{split}\alpha_t
&
= \alpha_{t-1} / k \\
\beta_t
&
= \beta_{t-1} / (1 + \lambda \gamma_t) \\
u_t
&
= u_{t-1} - \alpha_t \gamma_t g_t \\
v_t
&
= v_{t-1} + \tau_{t-1} \alpha_t \gamma_t g_t \\
\tau_t
&
= \tau_{t-1} + \beta_t / \alpha_t\end{split}\]
</div>
<p>
where
<span
class=
"math"
>
\(k\)
</span>
is momentum,
<span
class=
"math"
>
\(\lambda\)
</span>
is decay rate,
<span
class=
"math"
>
\(\gamma_t\)
</span>
is learning rate at the t
’
th iteration.
</p>
<table
class=
"docutils field-list"
frame=
"void"
rules=
"none"
>
<col
class=
"field-name"
/>
<col
class=
"field-body"
/>
<tbody
valign=
"top"
>
<tr
class=
"field-odd field"
><th
class=
"field-name"
>
Parameters:
</th><td
class=
"field-body"
><ul
class=
"first last simple"
>
<li><strong>
momentum
</strong>
(
<em>
float
</em>
)
–
the momentum factor.
</li>
<li><strong>
sparse
</strong>
(
<em>
bool
</em>
)
–
with sparse support or not, False by default.
</li>
</ul>
</td>
</tr>
</tbody>
</table>
</dd></dl>
</div>
<div
class=
"section"
id=
"adam"
>
<h2>
Adam
<a
class=
"headerlink"
href=
"#adam"
title=
"Permalink to this headline"
>
¶
</a></h2>
<p>
Optimizers(update equation) for SGD method.
</p>
<p>
TODO(yuyang18): Complete comments.
</p>
<dl
class=
"class"
>
<dt>
<em
class=
"property"
>
class
</em><code
class=
"descclassname"
>
paddle.v2.optimizer.
</code><code
class=
"descname"
>
Adam
</code><span
class=
"sig-paren"
>
(
</span><em>
beta1=0.9
</em>
,
<em>
beta2=0.999
</em>
,
<em>
epsilon=1e-08
</em>
,
<em>
**kwargs
</em><span
class=
"sig-paren"
>
)
</span></dt>
...
...
@@ -225,7 +234,7 @@ The details of please refer <a class="reference external" href="https://arxiv.or
<div
class=
"math"
>
\[\begin{split}m(w, t)
&
= \beta_1 m(w, t-1) + (1 - \beta_1) \nabla Q_i(w) \\
v(w, t)
&
= \beta_2 v(w, t-1) + (1 - \beta_2)(\nabla Q_i(w)) ^2 \\
w
&
= w - \frac{\eta}{\sqrt{v(w,t) + \epsilon}}\end{split}\]
</div>
w
&
= w - \frac{\eta
m(w, t)
}{\sqrt{v(w,t) + \epsilon}}\end{split}\]
</div>
<table
class=
"docutils field-list"
frame=
"void"
rules=
"none"
>
<col
class=
"field-name"
/>
<col
class=
"field-body"
/>
...
...
@@ -245,8 +254,6 @@ divided by zero.</li>
</div>
<div
class=
"section"
id=
"adamax"
>
<h2>
Adamax
<a
class=
"headerlink"
href=
"#adamax"
title=
"Permalink to this headline"
>
¶
</a></h2>
<p>
Optimizers(update equation) for SGD method.
</p>
<p>
TODO(yuyang18): Complete comments.
</p>
<dl
class=
"class"
>
<dt>
<em
class=
"property"
>
class
</em><code
class=
"descclassname"
>
paddle.v2.optimizer.
</code><code
class=
"descname"
>
Adamax
</code><span
class=
"sig-paren"
>
(
</span><em>
beta1=0.9
</em>
,
<em>
beta2=0.999
</em>
,
<em>
**kwargs
</em><span
class=
"sig-paren"
>
)
</span></dt>
...
...
@@ -273,8 +280,6 @@ w_t & = w_{t-1} - (\eta/(1-\beta_1^t))*m_t/u_t\end{split}\]</div>
</div>
<div
class=
"section"
id=
"adagrad"
>
<h2>
AdaGrad
<a
class=
"headerlink"
href=
"#adagrad"
title=
"Permalink to this headline"
>
¶
</a></h2>
<p>
Optimizers(update equation) for SGD method.
</p>
<p>
TODO(yuyang18): Complete comments.
</p>
<dl
class=
"class"
>
<dt>
<em
class=
"property"
>
class
</em><code
class=
"descclassname"
>
paddle.v2.optimizer.
</code><code
class=
"descname"
>
AdaGrad
</code><span
class=
"sig-paren"
>
(
</span><em>
**kwargs
</em><span
class=
"sig-paren"
>
)
</span></dt>
...
...
@@ -289,8 +294,6 @@ w & = w - \eta diag(G)^{-\frac{1}{2}} \circ g\end{split}\]</div>
</div>
<div
class=
"section"
id=
"decayedadagrad"
>
<h2>
DecayedAdaGrad
<a
class=
"headerlink"
href=
"#decayedadagrad"
title=
"Permalink to this headline"
>
¶
</a></h2>
<p>
Optimizers(update equation) for SGD method.
</p>
<p>
TODO(yuyang18): Complete comments.
</p>
<dl
class=
"class"
>
<dt>
<em
class=
"property"
>
class
</em><code
class=
"descclassname"
>
paddle.v2.optimizer.
</code><code
class=
"descname"
>
DecayedAdaGrad
</code><span
class=
"sig-paren"
>
(
</span><em>
rho=0.95
</em>
,
<em>
epsilon=1e-06
</em>
,
<em>
**kwargs
</em><span
class=
"sig-paren"
>
)
</span></dt>
...
...
@@ -316,8 +319,6 @@ learning\_rate &= 1/sqrt( ( E(g_t^2) + \epsilon )\end{split}\]</div>
</div>
<div
class=
"section"
id=
"adadelta"
>
<h2>
AdaDelta
<a
class=
"headerlink"
href=
"#adadelta"
title=
"Permalink to this headline"
>
¶
</a></h2>
<p>
Optimizers(update equation) for SGD method.
</p>
<p>
TODO(yuyang18): Complete comments.
</p>
<dl
class=
"class"
>
<dt>
<em
class=
"property"
>
class
</em><code
class=
"descclassname"
>
paddle.v2.optimizer.
</code><code
class=
"descname"
>
AdaDelta
</code><span
class=
"sig-paren"
>
(
</span><em>
rho=0.95
</em>
,
<em>
epsilon=1e-06
</em>
,
<em>
**kwargs
</em><span
class=
"sig-paren"
>
)
</span></dt>
...
...
@@ -345,8 +346,6 @@ E(dx_t^2) &= \rho * E(dx_{t-1}^2) + (1-\rho) * (-g*learning\_rate)^2\end{spl
</div>
<div
class=
"section"
id=
"rmsprop"
>
<h2>
RMSProp
<a
class=
"headerlink"
href=
"#rmsprop"
title=
"Permalink to this headline"
>
¶
</a></h2>
<p>
Optimizers(update equation) for SGD method.
</p>
<p>
TODO(yuyang18): Complete comments.
</p>
<dl
class=
"class"
>
<dt>
<em
class=
"property"
>
class
</em><code
class=
"descclassname"
>
paddle.v2.optimizer.
</code><code
class=
"descname"
>
RMSProp
</code><span
class=
"sig-paren"
>
(
</span><em>
rho=0.95
</em>
,
<em>
epsilon=1e-06
</em>
,
<em>
**kwargs
</em><span
class=
"sig-paren"
>
)
</span></dt>
...
...
develop/doc/searchindex.js
浏览文件 @
6829d94f
因为 它太大了无法显示 source diff 。你可以改为
查看blob
。
develop/doc_cn/api/v2/config/optimizer.html
浏览文件 @
6829d94f
...
...
@@ -201,35 +201,44 @@
<h1>
Optimizer
<a
class=
"headerlink"
href=
"#optimizer"
title=
"永久链接至标题"
>
¶
</a></h1>
<div
class=
"section"
id=
"momentum"
>
<h2>
Momentum
<a
class=
"headerlink"
href=
"#momentum"
title=
"永久链接至标题"
>
¶
</a></h2>
<p>
Optimizers(update equation) for SGD method.
</p>
<p>
TODO(yuyang18): Complete comments.
</p>
<dl
class=
"class"
>
<dt>
<em
class=
"property"
>
class
</em><code
class=
"descclassname"
>
paddle.v2.optimizer.
</code><code
class=
"descname"
>
Momentum
</code><span
class=
"sig-paren"
>
(
</span><em>
momentum=None
</em>
,
<em>
sparse=False
</em>
,
<em>
**kwargs
</em><span
class=
"sig-paren"
>
)
</span></dt>
<dd><p>
SGD Optimizer.
</p>
<p>
SGD is an optimization method, trying to find a neural network that
minimize the
“
cost/error
”
of it by iteration. In paddle
’
s implementation
SGD Optimizer is synchronized, which means all gradients will be wait to
calculate and reduced into one gradient, then do optimize operation.
</p>
<p>
The neural network consider the learning problem of minimizing an objective
function, that has the form of a sum
</p>
<dd><p>
Momentum Optimizer.
</p>
<p>
When sparse=False, the momentum update formula is as follows:
</p>
<div
class=
"math"
>
\[Q(w) = \sum_{i}^{n} Q_i(w)\]
</div>
<p>
The value of function Q sometimes is the cost of neural network (Mean
Square Error between prediction and label for example). The function Q is
parametrised by w, the weight/bias of neural network. And weights is what to
be learned. The i is the i-th observation in (trainning) data.
</p>
<p>
So, the SGD method will optimize the weight by
</p>
\[\begin{split}v_{t}
&
= k * v_{t-1} - \gamma_t / (g_{t} + \lambda w_{t-1}) \\
w_{t}
&
= w_{t-1} + v_{t} \\\end{split}\]
</div>
<p>
where,
<span
class=
"math"
>
\(k\)
</span>
is momentum,
<span
class=
"math"
>
\(\lambda\)
</span>
is decay rate,
<span
class=
"math"
>
\(\gamma_t\)
</span>
is learning rate at the t
’
th iteration.
<span
class=
"math"
>
\(w_{t}\)
</span>
is the weight as the t
’
th iteration.
And the
<span
class=
"math"
>
\(v_{t}\)
</span>
is the history momentum variable.
</p>
<p>
When sparse=True, the update scheme:
</p>
<div
class=
"math"
>
\[w = w - \eta \nabla Q(w) = w - \eta \sum_{i}^{n} \nabla Q_i(w)\]
</div>
<p>
where
<span
class=
"math"
>
\(\eta\)
</span>
is learning rate. And
<span
class=
"math"
>
\(n\)
</span>
is batch size.
</p>
\[\begin{split}\alpha_t
&
= \alpha_{t-1} / k \\
\beta_t
&
= \beta_{t-1} / (1 + \lambda \gamma_t) \\
u_t
&
= u_{t-1} - \alpha_t \gamma_t g_t \\
v_t
&
= v_{t-1} + \tau_{t-1} \alpha_t \gamma_t g_t \\
\tau_t
&
= \tau_{t-1} + \beta_t / \alpha_t\end{split}\]
</div>
<p>
where
<span
class=
"math"
>
\(k\)
</span>
is momentum,
<span
class=
"math"
>
\(\lambda\)
</span>
is decay rate,
<span
class=
"math"
>
\(\gamma_t\)
</span>
is learning rate at the t
’
th iteration.
</p>
<table
class=
"docutils field-list"
frame=
"void"
rules=
"none"
>
<col
class=
"field-name"
/>
<col
class=
"field-body"
/>
<tbody
valign=
"top"
>
<tr
class=
"field-odd field"
><th
class=
"field-name"
>
参数:
</th><td
class=
"field-body"
><ul
class=
"first last simple"
>
<li><strong>
momentum
</strong>
(
<em>
float
</em>
)
–
the momentum factor.
</li>
<li><strong>
sparse
</strong>
(
<em>
bool
</em>
)
–
with sparse support or not, False by default.
</li>
</ul>
</td>
</tr>
</tbody>
</table>
</dd></dl>
</div>
<div
class=
"section"
id=
"adam"
>
<h2>
Adam
<a
class=
"headerlink"
href=
"#adam"
title=
"永久链接至标题"
>
¶
</a></h2>
<p>
Optimizers(update equation) for SGD method.
</p>
<p>
TODO(yuyang18): Complete comments.
</p>
<dl
class=
"class"
>
<dt>
<em
class=
"property"
>
class
</em><code
class=
"descclassname"
>
paddle.v2.optimizer.
</code><code
class=
"descname"
>
Adam
</code><span
class=
"sig-paren"
>
(
</span><em>
beta1=0.9
</em>
,
<em>
beta2=0.999
</em>
,
<em>
epsilon=1e-08
</em>
,
<em>
**kwargs
</em><span
class=
"sig-paren"
>
)
</span></dt>
...
...
@@ -238,7 +247,7 @@ The details of please refer <a class="reference external" href="https://arxiv.or
<div
class=
"math"
>
\[\begin{split}m(w, t)
&
= \beta_1 m(w, t-1) + (1 - \beta_1) \nabla Q_i(w) \\
v(w, t)
&
= \beta_2 v(w, t-1) + (1 - \beta_2)(\nabla Q_i(w)) ^2 \\
w
&
= w - \frac{\eta}{\sqrt{v(w,t) + \epsilon}}\end{split}\]
</div>
w
&
= w - \frac{\eta
m(w, t)
}{\sqrt{v(w,t) + \epsilon}}\end{split}\]
</div>
<table
class=
"docutils field-list"
frame=
"void"
rules=
"none"
>
<col
class=
"field-name"
/>
<col
class=
"field-body"
/>
...
...
@@ -258,8 +267,6 @@ divided by zero.</li>
</div>
<div
class=
"section"
id=
"adamax"
>
<h2>
Adamax
<a
class=
"headerlink"
href=
"#adamax"
title=
"永久链接至标题"
>
¶
</a></h2>
<p>
Optimizers(update equation) for SGD method.
</p>
<p>
TODO(yuyang18): Complete comments.
</p>
<dl
class=
"class"
>
<dt>
<em
class=
"property"
>
class
</em><code
class=
"descclassname"
>
paddle.v2.optimizer.
</code><code
class=
"descname"
>
Adamax
</code><span
class=
"sig-paren"
>
(
</span><em>
beta1=0.9
</em>
,
<em>
beta2=0.999
</em>
,
<em>
**kwargs
</em><span
class=
"sig-paren"
>
)
</span></dt>
...
...
@@ -286,8 +293,6 @@ w_t & = w_{t-1} - (\eta/(1-\beta_1^t))*m_t/u_t\end{split}\]</div>
</div>
<div
class=
"section"
id=
"adagrad"
>
<h2>
AdaGrad
<a
class=
"headerlink"
href=
"#adagrad"
title=
"永久链接至标题"
>
¶
</a></h2>
<p>
Optimizers(update equation) for SGD method.
</p>
<p>
TODO(yuyang18): Complete comments.
</p>
<dl
class=
"class"
>
<dt>
<em
class=
"property"
>
class
</em><code
class=
"descclassname"
>
paddle.v2.optimizer.
</code><code
class=
"descname"
>
AdaGrad
</code><span
class=
"sig-paren"
>
(
</span><em>
**kwargs
</em><span
class=
"sig-paren"
>
)
</span></dt>
...
...
@@ -302,8 +307,6 @@ w & = w - \eta diag(G)^{-\frac{1}{2}} \circ g\end{split}\]</div>
</div>
<div
class=
"section"
id=
"decayedadagrad"
>
<h2>
DecayedAdaGrad
<a
class=
"headerlink"
href=
"#decayedadagrad"
title=
"永久链接至标题"
>
¶
</a></h2>
<p>
Optimizers(update equation) for SGD method.
</p>
<p>
TODO(yuyang18): Complete comments.
</p>
<dl
class=
"class"
>
<dt>
<em
class=
"property"
>
class
</em><code
class=
"descclassname"
>
paddle.v2.optimizer.
</code><code
class=
"descname"
>
DecayedAdaGrad
</code><span
class=
"sig-paren"
>
(
</span><em>
rho=0.95
</em>
,
<em>
epsilon=1e-06
</em>
,
<em>
**kwargs
</em><span
class=
"sig-paren"
>
)
</span></dt>
...
...
@@ -329,8 +332,6 @@ learning\_rate &= 1/sqrt( ( E(g_t^2) + \epsilon )\end{split}\]</div>
</div>
<div
class=
"section"
id=
"adadelta"
>
<h2>
AdaDelta
<a
class=
"headerlink"
href=
"#adadelta"
title=
"永久链接至标题"
>
¶
</a></h2>
<p>
Optimizers(update equation) for SGD method.
</p>
<p>
TODO(yuyang18): Complete comments.
</p>
<dl
class=
"class"
>
<dt>
<em
class=
"property"
>
class
</em><code
class=
"descclassname"
>
paddle.v2.optimizer.
</code><code
class=
"descname"
>
AdaDelta
</code><span
class=
"sig-paren"
>
(
</span><em>
rho=0.95
</em>
,
<em>
epsilon=1e-06
</em>
,
<em>
**kwargs
</em><span
class=
"sig-paren"
>
)
</span></dt>
...
...
@@ -358,8 +359,6 @@ E(dx_t^2) &= \rho * E(dx_{t-1}^2) + (1-\rho) * (-g*learning\_rate)^2\end{spl
</div>
<div
class=
"section"
id=
"rmsprop"
>
<h2>
RMSProp
<a
class=
"headerlink"
href=
"#rmsprop"
title=
"永久链接至标题"
>
¶
</a></h2>
<p>
Optimizers(update equation) for SGD method.
</p>
<p>
TODO(yuyang18): Complete comments.
</p>
<dl
class=
"class"
>
<dt>
<em
class=
"property"
>
class
</em><code
class=
"descclassname"
>
paddle.v2.optimizer.
</code><code
class=
"descname"
>
RMSProp
</code><span
class=
"sig-paren"
>
(
</span><em>
rho=0.95
</em>
,
<em>
epsilon=1e-06
</em>
,
<em>
**kwargs
</em><span
class=
"sig-paren"
>
)
</span></dt>
...
...
develop/doc_cn/searchindex.js
浏览文件 @
6829d94f
此差异已折叠。
点击以展开。
编辑
预览
Markdown
is supported
0%
请重试
或
添加新附件
.
添加附件
取消
You are about to add
0
people
to the discussion. Proceed with caution.
先完成此消息的编辑!
取消
想要评论请
注册
或
登录