Optimizers¶
BaseSGDOptimizer¶
- 
class paddle.trainer_config_helpers.optimizers.BaseSGDOptimizer
- SGD Optimizer. - SGD is an optimization method, trying to find a neural network that minimize the “cost/error” of it by iteration. In paddle’s implementation SGD Optimizer is synchronized, which means all gradients will be wait to calculate and reduced into one gradient, then do optimize operation. - The neural network consider the learning problem of minimizing an objective function, that has the form of a sum \[Q(w) = \sum_{i}^{n} Q_i(w)\]- The value of function Q sometimes is the cost of neural network (Mean Square Error between prediction and label for example). The function Q is parametrised by w, the weight/bias of neural network. And weights is what to be learned. The i is the i-th observation in (trainning) data. - So, the SGD method will optimize the weight by \[w = w - \eta \nabla Q(w) = w - \eta \sum_{i}^{n} \nabla Q_i(w)\]- where \(\eta\) is learning rate. And \(n\) is batch size. 
MomentumOptimizer¶
- 
class paddle.trainer_config_helpers.optimizers.MomentumOptimizer(momentum=None, sparse=False)
- MomentumOptimizer. - When sparse=True, the update scheme: \[\begin{split}\alpha_t &= \alpha_{t-1} / k \\ \beta_t &= \beta_{t-1} / (1 + \lambda \gamma_t) \\ u_t &= u_{t-1} - \alpha_t \gamma_t g_t \\ v_t &= v_{t-1} + \tau_{t-1} \alpha_t \gamma_t g_t \\ \tau_t &= \tau_{t-1} + \beta_t / \alpha_t\end{split}\]- where \(k\) is momentum, \(\lambda\) is decay rate, \(\gamma_t\) is learning rate at the t’th step. - 参数: - sparse (bool) – with sparse support or not. 
AdamOptimizer¶
- 
class paddle.trainer_config_helpers.optimizers.AdamOptimizer(beta1=0.9, beta2=0.999, epsilon=1e-08)
- Adam optimizer. The details of please refer Adam: A Method for Stochastic Optimization \[\begin{split}m(w, t) & = \beta_1 m(w, t-1) + (1 - \beta_1) \nabla Q_i(w) \\ v(w, t) & = \beta_2 v(w, t-1) + (1 - \beta_2)(\nabla Q_i(w)) ^2 \\ w & = w - \frac{\eta}{\sqrt{v(w,t) + \epsilon}}\end{split}\]- 参数: - beta1 (float) – the \(\beta_1\) in equation.
- beta2 (float) – the \(\beta_2\) in equation.
- epsilon (float) – the \(\epsilon\) in equation. It is used to prevent divided by zero.
 
AdamaxOptimizer¶
- 
class paddle.trainer_config_helpers.optimizers.AdamaxOptimizer(beta1, beta2)
- Adamax optimizer. - The details of please refer this Adam: A Method for Stochastic Optimization \[\begin{split}m_t & = \beta_1 * m_{t-1} + (1-\beta_1)* \nabla Q_i(w) \\ u_t & = max(\beta_2*u_{t-1}, abs(\nabla Q_i(w))) \\ w_t & = w_{t-1} - (\eta/(1-\beta_1^t))*m_t/u_t\end{split}\]- 参数: - beta1 (float) – the \(\beta_1\) in the equation.
- beta2 (float) – the \(\beta_2\) in the equation.
 
AdaGradOptimizer¶
- 
class paddle.trainer_config_helpers.optimizers.AdaGradOptimizer
- Adagrad(for ADAptive GRAdient algorithm) optimizer. - For details please refer this Adaptive Subgradient Methods for Online Learning and Stochastic Optimization. \[\begin{split}G &= \sum_{\tau=1}^{t} g_{\tau} g_{\tau}^T \\ w & = w - \eta diag(G)^{-\frac{1}{2}} \circ g\end{split}\]
DecayedAdaGradOptimizer¶
- 
class paddle.trainer_config_helpers.optimizers.DecayedAdaGradOptimizer(rho=0.95, epsilon=1e-06)
- AdaGrad method with decayed sum gradients. The equations of this method show as follow. \[\begin{split}E(g_t^2) &= \rho * E(g_{t-1}^2) + (1-\rho) * g^2 \\ learning\_rate &= 1/sqrt( ( E(g_t^2) + \epsilon )\end{split}\]- 参数: - rho (float) – The \(\rho\) parameter in that equation
- epsilon (float) – The \(\epsilon\) parameter in that equation.
 
AdaDeltaOptimizer¶
- 
class paddle.trainer_config_helpers.optimizers.AdaDeltaOptimizer(rho=0.95, epsilon=1e-06)
- AdaDelta method. The details of adadelta please refer to this ADADELTA: AN ADAPTIVE LEARNING RATE METHOD. \[\begin{split}E(g_t^2) &= \rho * E(g_{t-1}^2) + (1-\rho) * g^2 \\ learning\_rate &= sqrt( ( E(dx_{t-1}^2) + \epsilon ) / ( \ E(g_t^2) + \epsilon ) ) \\ E(dx_t^2) &= \rho * E(dx_{t-1}^2) + (1-\rho) * (-g*learning\_rate)^2\end{split}\]- 参数: - rho (float) – \(\rho\) in equation
- epsilon (float) – \(\rho\) in equation
 
RMSPropOptimizer¶
- 
class paddle.trainer_config_helpers.optimizers.RMSPropOptimizer(rho=0.95, epsilon=1e-06)
- RMSProp(for Root Mean Square Propagation) optimizer. For details please refer this slide. - The equations of this method as follows: \[\begin{split}v(w, t) & = \rho v(w, t-1) + (1 - \rho)(\nabla Q_{i}(w))^2 \\ w & = w - \frac{\eta} {\sqrt{v(w,t) + \epsilon}} \nabla Q_{i}(w)\end{split}\]- 参数: - rho (float) – the \(\rho\) in the equation. The forgetting factor.
- epsilon (float) – the \(\epsilon\) in the equation.
 
settings¶
- 
paddle.trainer_config_helpers.optimizers.settings(*args, **kwargs)
- Set the optimization method, learning rate, batch size, and other training settings. The currently supported algorithms are SGD and Async-SGD. - 警告 - Note that the ‘batch_size’ in PaddlePaddle is not equal to global training batch size. It represents the single training process’s batch size. If you use N processes to train one model, for example use three GPU machines, the global batch size is N*’batch_size’. - 参数: - batch_size (int) – batch size for one training process.
- learning_rate (float) – learning rate for SGD
- learning_method (BaseSGDOptimizer) – The extension optimization algorithms of gradient descent, such as momentum, adagrad, rmsprop, etc. Note that it should be instance with base type BaseSGDOptimizer.
- regularization (BaseRegularization) – The regularization method.
- is_async (bool) – Is Async-SGD or not. Default value is False.
- model_average (ModelAverage) – Model Average Settings.
- gradient_clipping_threshold (float) – gradient clipping threshold. If gradient value larger than some value, will be clipped.
 
