AdamOptimizer_cn.rst 10.4 KB
Newer Older
H
Hao Wang 已提交
1 2 3 4 5
.. _cn_api_fluid_optimizer_AdamOptimizer:

AdamOptimizer
-------------------------------

6
.. py:class:: paddle.fluid.optimizer.AdamOptimizer(learning_rate=0.001, beta1=0.9, beta2=0.999, epsilon=1e-08, parameter_list=None, regularization=None, name=None, lazy_mode=False)
H
Hao Wang 已提交
7

8 9 10
Adam优化器出自 `Adam论文 <https://arxiv.org/abs/1412.6980>`_ 的第二节,能够利用梯度的一阶矩估计和二阶矩估计动态调整每个参数的学习率。

其参数更新的计算公式如下:
H
Hao Wang 已提交
11 12

.. math::
13 14 15 16 17 18 19 20
    \\t = t + 1
.. math::
    moment\_1\_out=\beta_1∗moment\_1+(1−\beta_1)∗grad
.. math::
    moment\_2\_out=\beta_2∗moment\_2+(1−\beta_2)∗grad*grad
.. math::
    learning\_rate=\frac{learning\_rate}{1-\beta_1^t}
.. math::
21
    param\_out=param-learning\_rate*\frac{moment\_1}{\sqrt{moment\_2}+\epsilon}\\
H
Hao Wang 已提交
22

23 24
相关论文:`Adam: A Method for Stochastic Optimization <https://arxiv.org/abs/1412.6980>`_ 

H
Hao Wang 已提交
25
参数: 
26
    - **learning_rate** (float|Variable,可选) - 学习率,用于参数更新的计算。可以是一个浮点型值或者一个值为浮点型的Variable,默认值为0.001
27
    - **parameter_list** (list, 可选) - 指定优化器需要优化的参数。在动态图模式下必须提供该参数;在静态图模式下默认值为None,这时所有的参数都将被优化。
28 29
    - **beta1** (float|Variable, 可选) - 一阶矩估计的指数衰减率,是一个float类型或者一个shape为[1],数据类型为float32的Variable类型。默认值为0.9
    - **beta2** (float|Variable, 可选) - 二阶矩估计的指数衰减率,是一个float类型或者一个shape为[1],数据类型为float32的Variable类型。默认值为0.999
30 31 32 33
    - **epsilon** (float, 可选) - 保持数值稳定性的短浮点类型值,默认值为1e-08
    - **regularization** (WeightDecayRegularizer, 可选) - 正则化函数,用于减少泛化误差。例如可以是 :ref:`cn_api_fluid_regularizer_L2DecayRegularizer` ,默认值为None
    - **name** (str, 可选)- 该参数供开发人员打印调试信息时使用,具体用法请参见 :ref:`api_guide_Name` ,默认值为None
    - **lazy_mode** (bool, 可选) - 设为True时,仅更新当前具有梯度的元素。官方Adam算法有两个移动平均累加器(moving-average accumulators)。累加器在每一步都会更新。在密集模式和稀疏模式下,两条移动平均线的每个元素都会更新。如果参数非常大,那么更新可能很慢。 lazy mode仅更新当前具有梯度的元素,所以它会更快。但是这种模式与原始的算法有不同的描述,可能会导致不同的结果,默认为False
H
Hao Wang 已提交
34 35


36
**代码示例**
H
Hao Wang 已提交
37

38
.. code-block:: python
H
Hao Wang 已提交
39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62

    import paddle
    import paddle.fluid as fluid
     
    place = fluid.CPUPlace()
    main = fluid.Program()
    with fluid.program_guard(main):
        x = fluid.layers.data(name='x', shape=[13], dtype='float32')
        y = fluid.layers.data(name='y', shape=[1], dtype='float32')
        y_predict = fluid.layers.fc(input=x, size=1, act=None)
        cost = fluid.layers.square_error_cost(input=y_predict, label=y)
        avg_cost = fluid.layers.mean(cost)
        adam_optimizer = fluid.optimizer.AdamOptimizer(0.01)
        adam_optimizer.minimize(avg_cost)

        fetch_list = [avg_cost]
        train_reader = paddle.batch(
            paddle.dataset.uci_housing.train(), batch_size=1)
        feeder = fluid.DataFeeder(place=place, feed_list=[x, y])
        exe = fluid.Executor(place)
        exe.run(fluid.default_startup_program())
        for data in train_reader():
            exe.run(main, feed=feeder.feed(data), fetch_list=fetch_list)

63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121
.. code-block:: python

    # Adam with beta1/beta2 as Variable
    import paddle
    import paddle.fluid as fluid
    import paddle.fluid.layers.learning_rate_scheduler as lr_scheduler

    place = fluid.CPUPlace()
    main = fluid.Program()
    with fluid.program_guard(main):
        x = fluid.data(name='x', shape=[None, 13], dtype='float32')
        y = fluid.data(name='y', shape=[None, 1], dtype='float32')
        y_predict = fluid.layers.fc(input=x, size=1, act=None)
        cost = fluid.layers.square_error_cost(input=y_predict, label=y)
        avg_cost = fluid.layers.mean(cost)

        # define beta decay variable
        def get_decayed_betas(beta1_init, beta2_init, decay_steps, decay_rate)
            global_step = lr_scheduler._decay_step_counter()

            beta1 = fluid.layers.create_global_var(
                shape=[1],
                value=float(beta1_init),
                dtype='float32',
                # set persistable for save checkpoints and resume
                persistable=True,
                name="beta1")
            beta2 = fluid.layers.create_global_var(
                shape=[1],
                value=float(beta2_init),
                dtype='float32',
                # set persistable for save checkpoints and resume
                persistable=True,
                name="beta2")

            div_res = global_step / decay_steps
            decayed_beta1 = beta1_init * (decay_rate**div_res)
            decayed_beta2 = beta2_init * (decay_rate**div_res)
            fluid.layers.assign(decayed_beta1, beta1)
            fluid.layers.assign(decayed_beta2, beta2)

            return beta1, beta2

        beta1, beta2 = get_decayed_betas(0.9, 0.99, 1e5, 0.9)
        adam_optimizer = fluid.optimizer.AdamOptimizer(
                                            learning_rate=0.01,
                                            beta1=beta1
                                            beta2=beta2)
        adam_optimizer.minimize(avg_cost)

        fetch_list = [avg_cost]
        train_reader = paddle.batch(
            paddle.dataset.uci_housing.train(), batch_size=1)
        feeder = fluid.DataFeeder(place=place, feed_list=[x, y])
        exe = fluid.Executor(place)
        exe.run(fluid.default_startup_program())
        for data in train_reader():
            exe.run(main, feed=feeder.feed(data), fetch_list=fetch_list)

H
Hao Wang 已提交
122

123 124
.. py:method:: minimize(loss, startup_program=None, parameter_list=None, no_grad_set=None, grad_clip=None)

125
为网络添加反向计算过程,并根据反向计算所得的梯度,更新parameter_list中的Parameters,最小化网络损失值loss。
126

127 128 129
参数:
    - **loss** (Variable) – 需要最小化的损失值变量
    - **startup_program** (Program, 可选) – 用于初始化parameter_list中参数的 :ref:`cn_api_fluid_Program` , 默认值为None,此时将使用 :ref:`cn_api_fluid_default_startup_program` 
130
    - **parameter_list** (list, 可选) – 待更新的Parameter或者Parameter.name组成的列表, 默认值为None,此时将更新所有的Parameter
131
    - **no_grad_set** (set, 可选) – 不需要更新的Parameter或者Parameter.name组成的集合,默认值为None
132
    - **grad_clip** (GradClipBase, 可选) – 梯度裁剪的策略,静态图模式不需要使用本参数,当前本参数只支持在dygraph模式下的梯度裁剪,未来本参数可能会调整,默认值为None
133

134
返回: (optimize_ops, params_grads),数据类型为(list, list),其中optimize_ops是minimize接口为网络添加的OP列表,params_grads是一个由(param, grad)变量对组成的列表,param是Parameter,grad是该Parameter对应的梯度值
135

136 137
返回类型: tuple

138
**代码示例**
139

140
.. code-block:: python
141

142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161
    import numpy
    import paddle.fluid as fluid
     
    x = fluid.layers.data(name='X', shape=[13], dtype='float32')
    y = fluid.layers.data(name='Y', shape=[1], dtype='float32')
    y_predict = fluid.layers.fc(input=x, size=1, act=None)
    cost = fluid.layers.square_error_cost(input=y_predict, label=y)
    loss = fluid.layers.mean(cost)
    adam = fluid.optimizer.AdamOptimizer(learning_rate=0.2)
    adam.minimize(loss)

    place = fluid.CPUPlace() # fluid.CUDAPlace(0)
    exe = fluid.Executor(place)
     
    x = numpy.random.random(size=(10, 13)).astype('float32')
    y = numpy.random.random(size=(10, 1)).astype('float32')
    exe.run(fluid.default_startup_program())
    outs = exe.run(program=fluid.default_main_program(),
                   feed={'X': x, 'Y': y},
                   fetch_list=[loss.name])
162

163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240

.. py:method:: clear_gradients()

**注意:**

  **1. 该API只在** `Dygraph <../../user_guides/howto/dygraph/DyGraph.html>`_ **模式下生效**


清除需要优化的参数的梯度。

**代码示例**

.. code-block:: python

    import paddle.fluid as fluid
    import numpy as np

    with fluid.dygraph.guard():
        value = np.arange(26).reshape(2, 13).astype("float32")
        a = fluid.dygraph.to_variable(value)
        linear = fluid.Linear(13, 5, dtype="float32")
        optimizer = fluid.optimizer.Adam(learning_rate=0.02,
                                         parameter_list=linear.parameters())
        out = linear(a)
        out.backward()
        optimizer.minimize(out)
        optimizer.clear_gradients()


.. py:method:: current_step_lr()

**注意:**

  **1. 该API只在** `Dygraph <../../user_guides/howto/dygraph/DyGraph.html>`_ **模式下生效**

获取当前步骤的学习率。当不使用LearningRateDecay时,每次调用的返回值都相同,否则返回当前步骤的学习率。

返回:当前步骤的学习率。

返回类型:float

**代码示例**

.. code-block:: python

    import paddle.fluid as fluid
    import numpy as np

    # example1: LearningRateDecay is not used, return value is all the same
    with fluid.dygraph.guard():
        emb = fluid.dygraph.Embedding([10, 10])
        adam = fluid.optimizer.Adam(0.001, parameter_list = emb.parameters())
        lr = adam.current_step_lr()
        print(lr) # 0.001

    # example2: PiecewiseDecay is used, return the step learning rate
    with fluid.dygraph.guard():
        inp = np.random.uniform(-0.1, 0.1, [10, 10]).astype("float32")
        linear = fluid.dygraph.nn.Linear(10, 10)
        inp = fluid.dygraph.to_variable(inp)
        out = linear(inp)
        loss = fluid.layers.reduce_mean(out)

        bd = [2, 4, 6, 8]
        value = [0.2, 0.4, 0.6, 0.8, 1.0]
        adam = fluid.optimizer.Adam(fluid.dygraph.PiecewiseDecay(bd, value, 0),
                           parameter_list=linear.parameters())

        # first step: learning rate is 0.2
        np.allclose(adam.current_step_lr(), 0.2, rtol=1e-06, atol=0.0) # True

        # learning rate for different steps
        ret = [0.2, 0.2, 0.4, 0.4, 0.6, 0.6, 0.8, 0.8, 1.0, 1.0, 1.0, 1.0]
        for i in range(12):
            adam.minimize(loss)
            lr = adam.current_step_lr()
            np.allclose(lr, ret[i], rtol=1e-06, atol=0.0) # True