mixed_precision.md 7.0 KB
Newer Older
L
leiyuning 已提交
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
# Mixed Precision

<!-- TOC -->

- [Mixed Precision](#mixed-precision)
    - [Overview](#overview)
    - [Computation Process](#computation-process)
    - [Automatic Mixed Precision](#automatic-mixed-precision)
    - [Manual Mixed Precision](#manual-mixed-precision)

<!-- /TOC -->

## Overview

The mixed precision training method accelerates the deep learning neural network training process by using both the single-precision and half-precision data formats, and maintains the network precision achieved by the single-precision training at the same time. 
Mixed precision training can accelerate the computation process, reduce memory usage, and enable a larger model or batch size to be trained on specific hardware.

18 19
For FP16 operators, if the input data type is FP32, the backend of MindSpore will automatically handle it with reduced precision. Users could check the reduced-precision operators by enabling INFO log and then searching `reduce precision'.

L
leiyuning 已提交
20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50
## Computation Process

The following figure shows the typical computation process of mixed precision in MindSpore.

![mix precision](./images/mix_precision.jpg)

1. Parameters are stored in FP32 format.
2. During the forward computation, if an FP16 operator is involved, the operator input and parameters need to be cast from FP32 to FP16.
3. The loss layer is set to FP32.
4. During backward computation, the value is multiplied by Loss Scale to avoid underflow due to a small gradient.
5. The FP16 parameter is used for gradient computation, and the result is cast back to FP32.
6. Then, the value is divided by Loss scale to restore the multiplied gradient.
7. The optimizer checks whether the gradient overflows. If yes, the optimizer skips the update. If no, the optimizer uses FP32 to update the original parameters.

This document describes the computation process by using examples of automatic and manual mixed precision.

## Automatic Mixed Precision

To use the automatic mixed precision, you need to invoke the corresponding API, which takes the network to be trained and the optimizer as the input. This API converts the operators of the entire network into FP16 operators (except the BatchNorm and Loss operators).
In addition, after the mixed precision is employed, the loss scale must be used to avoid data overflow.

The procedure is as follows:
1. Introduce the MindSpore mixed precision API.

2. Define the network. This step is the same as the common network definition. (You do not need to manually configure the precision of any specific operator.)

3. Use the amp.build_train_network() API to encapsulate the network model and optimizer. In this step, MindSpore automatically converts the operators to the required format.

A code example is as follows:

```python
51 52 53 54 55 56 57 58
import numpy as np
import mindspore.nn as nn
import mindspore.common.dtype as mstype
from mindspore import Tensor, context
from mindspore.ops import operations as P
from mindspore.nn import WithLossCell
from mindspore.nn import Momentum
from mindspore.nn.loss import MSELoss
L
leiyuning 已提交
59 60 61
# The interface of Auto_mixed precision
from mindspore.train import amp

62 63 64
context.set_context(mode=context.GRAPH_MODE)
context.set_context(device_target="Ascend")

L
leiyuning 已提交
65 66 67 68 69 70 71 72 73 74
# Define network
class LeNet5(nn.Cell):
    def __init__(self):
        super(LeNet5, self).__init__()
        self.conv1 = nn.Conv2d(1, 6, 5, pad_mode='valid')
        self.conv2 = nn.Conv2d(6, 16, 5, pad_mode='valid')
        self.fc1 = nn.Dense(16 * 5 * 5, 120)
        self.fc2 = nn.Dense(120, 84)
        self.fc3 = nn.Dense(84, 10)
        self.relu = nn.ReLU()
75
        self.max_pool2d = nn.MaxPool2d(kernel_size=2, stride=2)
L
leiyuning 已提交
76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101
        self.flatten = P.Flatten()

    def construct(self, x):
        x = self.max_pool2d(self.relu(self.conv1(x)))
        x = self.max_pool2d(self.relu(self.conv2(x)))
        x = self.flatten(x)
        x = self.relu(self.fc1(x))
        x = self.relu(self.fc2(x))
        x = self.fc3(x)
        return x

# Initialize network
net = LeNet5()

# Define training data, label and sens
predict = Tensor(np.ones([1, 1, 32, 32]).astype(np.float32) * 0.01)
label = Tensor(np.zeros([1, 10]).astype(np.float32))
scaling_sens = Tensor(np.full((1), 1.0), dtype=mstype.float32)

# Define Loss and Optimizer
loss = MSELoss()
optimizer = Momentum(params=net.trainable_params(), learning_rate=0.1, momentum=0.9)
net_with_loss = WithLossCell(net, loss)
train_network = amp.build_train_network(net_with_loss, optimizer, level="O2")

# Run training
102
output = train_network(predict, label, scaling_sens)
L
leiyuning 已提交
103 104 105 106 107 108 109 110
```


## Manual Mixed Precision

MindSpore also supports manual mixed precision. It is assumed that only one dense layer in the network needs to be calculated by using FP32, and other layers are calculated by using FP16. The mixed precision is configured in the granularity of cell. The default format of a cell is FP32.

The following is the procedure for implementing manual mixed precision:
111
1. Define the network. This step is similar to step 2 in the automatic mixed precision. 
L
leiyuning 已提交
112

113
2. Configure the mixed precision. Use net.to_float(mstype.float16) to set all operators of the cell and its sub-cells to FP16. Then, configure the fc3 to FP32.
L
leiyuning 已提交
114 115 116 117 118 119

3. Use TrainOneStepWithLossScaleCell to encapsulate the network model and optimizer.

A code example is as follows:

```python
120 121 122 123 124 125 126 127 128 129 130 131 132
import numpy as np
import mindspore.nn as nn
import mindspore.common.dtype as mstype
from mindspore import Tensor, context
from mindspore.ops import operations as P
from mindspore.nn import WithLossCell, TrainOneStepWithLossScaleCell
from mindspore.nn import Momentum
from mindspore.nn.loss import MSELoss


context.set_context(mode=context.GRAPH_MODE)
context.set_context(device_target="Ascend")

L
leiyuning 已提交
133 134 135 136 137 138 139 140
# Define network
class LeNet5(nn.Cell):
    def __init__(self):
        super(LeNet5, self).__init__()
        self.conv1 = nn.Conv2d(1, 6, 5, pad_mode='valid')
        self.conv2 = nn.Conv2d(6, 16, 5, pad_mode='valid')
        self.fc1 = nn.Dense(16 * 5 * 5, 120)
        self.fc2 = nn.Dense(120, 84)
141
        self.fc3 = nn.Dense(84, 10)
L
leiyuning 已提交
142
        self.relu = nn.ReLU()
143
        self.max_pool2d = nn.MaxPool2d(kernel_size=2, stride=2)
L
leiyuning 已提交
144 145 146 147 148 149 150 151 152 153 154 155 156
        self.flatten = P.Flatten()

    def construct(self, x):
        x = self.max_pool2d(self.relu(self.conv1(x)))
        x = self.max_pool2d(self.relu(self.conv2(x)))
        x = self.flatten(x)
        x = self.relu(self.fc1(x))
        x = self.relu(self.fc2(x))
        x = self.fc3(x)
        return x

# Initialize network and set mixing precision
net = LeNet5()
157
net.to_float(mstype.float16)
158
net.fc3.to_float(mstype.float32)
L
leiyuning 已提交
159 160 161 162 163 164 165 166 167 168 169 170

# Define training data, label and sens
predict = Tensor(np.ones([1, 1, 32, 32]).astype(np.float32) * 0.01)
label = Tensor(np.zeros([1, 10]).astype(np.float32))
scaling_sens = Tensor(np.full((1), 1.0), dtype=mstype.float32)

# Define Loss and Optimizer
net.set_train()
loss = MSELoss()
optimizer = Momentum(params=net.trainable_params(), learning_rate=0.1, momentum=0.9)
net_with_loss = WithLossCell(net, loss)
train_network = TrainOneStepWithLossScaleCell(net_with_loss, optimizer)
171
train_network.set_train()
L
leiyuning 已提交
172 173

# Run training
174
output = train_network(predict, label, scaling_sens)
L
leiyuning 已提交
175
```