提交 6755f426 编写于 作者: M muli

update scratch

上级 19e60fa9
......@@ -16,16 +16,6 @@ then we can check how many GPUs are available by running the command `nvidia-smi
!nvidia-smi
```
```{.json .output n=1}
[
{
"name": "stdout",
"output_type": "stream",
"text": "Thu Oct 19 05:22:42 2017 \r\n+-----------------------------------------------------------------------------+\r\n| NVIDIA-SMI 375.26 Driver Version: 375.26 |\r\n|-------------------------------+----------------------+----------------------+\r\n| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |\r\n| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |\r\n|===============================+======================+======================|\r\n| 0 Tesla M60 Off | 0000:00:1D.0 Off | 0 |\r\n| N/A 37C P0 38W / 150W | 319MiB / 7612MiB | 0% Default |\r\n+-------------------------------+----------------------+----------------------+\r\n| 1 Tesla M60 Off | 0000:00:1E.0 Off | 0 |\r\n| N/A 43C P0 44W / 150W | 2MiB / 7612MiB | 0% Default |\r\n+-------------------------------+----------------------+----------------------+\r\n \r\n+-----------------------------------------------------------------------------+\r\n| Processes: GPU Memory |\r\n| GPU PID Type Process name Usage |\r\n|=============================================================================|\r\n| 0 116696 C .../miniconda3/envs/gluon_zh_docs/bin/python 317MiB |\r\n+-----------------------------------------------------------------------------+\r\n"
}
]
```
We want to use all of the GPUs on together for the purpose of significantly speeding up training (in terms of wall clock).
Remember that CPUs and GPUs each can have multiple cores.
CPUs on a laptop might have 2 or 4 cores, and on a server might have up to 16 or 32 cores.
......@@ -48,19 +38,6 @@ Finally, we collect the gradients from each of the GPUs and sum them together be
The following pseudo-code shows how to train one data batch on *k* GPUs.
def train_batch(data, k):
# split data into k parts
for i = range(k): # run in parallel
compute grad_i w.r.t. weight_i using data_i on the i-th GPU
grad = grad_1 + ... + grad_k
for i = 1, ..., k: # run in parallel
copy grad to i-th GPU
update weight_i by using grad
## Define model and updater
......@@ -85,16 +62,16 @@ params = [W1, b1, W2, b2, W3, b3, W4, b4]
# network and loss
def lenet(X, params):
# first conv
h1_conv = nd.Convolution(data=X, weight=params[0], bias=params[1],
h1_conv = nd.Convolution(data=X, weight=params[0], bias=params[1],
kernel=(3,3), num_filter=20)
h1_activation = nd.relu(h1_conv)
h1 = nd.Pooling(data=h1_activation, pool_type="avg",
h1 = nd.Pooling(data=h1_activation, pool_type="avg",
kernel=(2,2), stride=(2,2))
# second conv
h2_conv = nd.Convolution(data=h1, weight=params[2], bias=params[3],
kernel=(5,5), num_filter=50)
h2_activation = nd.relu(h2_conv)
h2 = nd.Pooling(data=h2_activation, pool_type="avg",
h2 = nd.Pooling(data=h2_activation, pool_type="avg",
kernel=(2,2), stride=(2,2))
h2 = nd.flatten(h2)
# first dense
......@@ -126,16 +103,6 @@ print('b1 weight = ', new_params[1])
print('b1 grad = ', new_params[1].grad)
```
```{.json .output n=3}
[
{
"name": "stdout",
"output_type": "stream",
"text": "b1 weight = \n[ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.\n 0. 0.]\n<NDArray 20 @gpu(0)>\nb1 grad = \n[ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.\n 0. 0.]\n<NDArray 20 @gpu(0)>\n"
}
]
```
Given a list of data that spans multiple GPUs, we then define a function to sum the data
and broadcast the results to each GPU.
......@@ -153,16 +120,6 @@ allreduce(data)
print('After:', data)
```
```{.json .output n=4}
[
{
"name": "stdout",
"output_type": "stream",
"text": "Before: [\n[[ 1. 1.]]\n<NDArray 1x2 @gpu(0)>, \n[[ 2. 2.]]\n<NDArray 1x2 @gpu(1)>]\nAfter: [\n[[ 3. 3.]]\n<NDArray 1x2 @gpu(0)>, \n[[ 3. 3.]]\n<NDArray 1x2 @gpu(1)>]\n"
}
]
```
Given a data batch, we define a function that splits this batch and copies each part into the corresponding GPU.
```{.python .input n=5}
......@@ -181,16 +138,6 @@ print('Load into', ctx)
print('Output:', splitted)
```
```{.json .output n=5}
[
{
"name": "stdout",
"output_type": "stream",
"text": "Intput: \n[[ 0. 1. 2. 3.]\n [ 4. 5. 6. 7.]\n [ 8. 9. 10. 11.]\n [ 12. 13. 14. 15.]]\n<NDArray 4x4 @cpu(0)>\nLoad into [gpu(0), gpu(1)]\nOutput: [\n[[ 0. 1. 2. 3.]\n [ 4. 5. 6. 7.]]\n<NDArray 2x4 @gpu(0)>, \n[[ 8. 9. 10. 11.]\n [ 12. 13. 14. 15.]]\n<NDArray 2x4 @gpu(1)>]\n"
}
]
```
## 训练一个批量
Now we are ready to implement how to train one data batch with data parallelism.
......@@ -232,16 +179,16 @@ def train(num_gpus, batch_size, lr):
ctx = [gpu(i) for i in range(num_gpus)]
print('Running on', ctx)
# copy parameters to all GPUs
dev_params = [get_params(params, c) for c in ctx]
for epoch in range(5):
# train
start = time()
for data, label in train_data:
train_batch(data, label, dev_params, ctx, lr)
nd.waitall()
for data, label in train_data:
train_batch(data, label, dev_params, ctx, lr)
nd.waitall()
print('Epoch %d, training time = %.1f sec'%(
epoch, time()-start))
......@@ -257,32 +204,12 @@ First run on a single GPU with batch size 64.
train(1, 256, 0.3)
```
```{.json .output n=8}
[
{
"name": "stdout",
"output_type": "stream",
"text": "Running on [gpu(0)]\nEpoch 0, training time = 2.2 sec\n validation accuracy = 0.1001\nEpoch 1, training time = 1.8 sec\n validation accuracy = 0.6264\nEpoch 2, training time = 1.8 sec\n validation accuracy = 0.7881\nEpoch 3, training time = 1.8 sec\n validation accuracy = 0.7849\nEpoch 4, training time = 1.8 sec\n validation accuracy = 0.8259\n"
}
]
```
Running on multiple GPUs, we often want to increase the batch size so that each GPU still gets a large enough batch size for good computation performance. A larger batch size sometimes slows down the convergence, we often want to increases the learning rate as well.
```{.python .input n=9}
train(2, 512, 0.6)
```
```{.json .output n=9}
[
{
"name": "stdout",
"output_type": "stream",
"text": "Running on [gpu(0), gpu(1)]\nEpoch 0, training time = 1.3 sec\n validation accuracy = 0.0995\nEpoch 1, training time = 1.1 sec\n validation accuracy = 0.1009\nEpoch 2, training time = 1.1 sec\n validation accuracy = 0.6300\nEpoch 3, training time = 1.1 sec\n validation accuracy = 0.7381\nEpoch 4, training time = 1.1 sec\n validation accuracy = 0.7972\n"
}
]
```
## Conclusion
We have shown how to implement data parallelism on a deep neural network from scratch. Thanks to the auto-parallelism, we only need to write serial codes while the engine is able to parallelize them on multiple GPUs.
......
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册