update scratch

6755f426 · muli · 19e60fa9 · 6755f426
隐藏空白更改
内联并排

Showing with 8 addition and 81 deletion

chapter_gluon-advances/multiple-gpus-scratch.md chapter_gluon-advances/multiple-gpus-scratch.md +8 -81

未找到文件。
--- a/chapter_gluon-advances/multiple-gpus-scratch.md
+++ b/chapter_gluon-advances/multiple-gpus-scratch.md
@@ -16,16 +16,6 @@ then we can check how many GPUs are available by running the command `nvidia-smi
 !nvidia-smi
 ```

-```{.json .output n=1}
-[
- {
-  "name": "stdout",
-  "output_type": "stream",
-  "text": "Thu Oct 19 05:22:42 2017       \r\n+-----------------------------------------------------------------------------+\r\n| NVIDIA-SMI 375.26                 Driver Version: 375.26                    |\r\n|-------------------------------+----------------------+----------------------+\r\n| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |\r\n| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |\r\n|===============================+======================+======================|\r\n|   0  Tesla M60           Off  | 0000:00:1D.0     Off |                    0 |\r\n| N/A   37C    P0    38W / 150W |    319MiB /  7612MiB |      0%      Default |\r\n+-------------------------------+----------------------+----------------------+\r\n|   1  Tesla M60           Off  | 0000:00:1E.0     Off |                    0 |\r\n| N/A   43C    P0    44W / 150W |      2MiB /  7612MiB |      0%      Default |\r\n+-------------------------------+----------------------+----------------------+\r\n                                                                               \r\n+-----------------------------------------------------------------------------+\r\n| Processes:                                                       GPU Memory |\r\n|  GPU       PID  Type  Process name                               Usage      |\r\n|=============================================================================|\r\n|    0    116696    C   .../miniconda3/envs/gluon_zh_docs/bin/python   317MiB |\r\n+-----------------------------------------------------------------------------+\r\n"
- }
-]
-```
-
 We want to use all of the GPUs on together for the purpose of significantly speeding up training (in terms of wall clock).
 Remember that CPUs and GPUs each can have multiple cores.
 CPUs on a laptop might have 2 or 4 cores, and on a server might have up to 16 or 32 cores.
@@ -48,19 +38,6 @@ Finally, we collect the gradients from each of the GPUs and sum them together be

 The following pseudo-code shows how to train one data batch on *k* GPUs.

-def train_batch(data, k):
-    # split data into k parts
-    for i = range(k):  # run in parallel
-
-
-        compute grad_i w.r.t. weight_i using data_i on the i-th GPU
-    grad = grad_1 + ... + grad_k
-    for i = 1, ..., k:  # run in parallel
-        copy grad to i-th GPU
-        update weight_i by using grad
-
-
-

 ## Define model and updater

@@ -85,16 +62,16 @@ params = [W1, b1, W2, b2, W3, b3, W4, b4]
 # network and loss
 def lenet(X, params):
    # first conv
-    h1_conv = nd.Convolution(data=X, weight=params[0], bias=params[1], 
+    h1_conv = nd.Convolution(data=X, weight=params[0], bias=params[1],
                             kernel=(3,3), num_filter=20)
    h1_activation = nd.relu(h1_conv)
-    h1 = nd.Pooling(data=h1_activation, pool_type="avg", 
+    h1 = nd.Pooling(data=h1_activation, pool_type="avg",
                    kernel=(2,2), stride=(2,2))
    # second conv
    h2_conv = nd.Convolution(data=h1, weight=params[2], bias=params[3],
                             kernel=(5,5), num_filter=50)
    h2_activation = nd.relu(h2_conv)
-    h2 = nd.Pooling(data=h2_activation, pool_type="avg", 
+    h2 = nd.Pooling(data=h2_activation, pool_type="avg",
                    kernel=(2,2), stride=(2,2))
    h2 = nd.flatten(h2)
    # first dense
@@ -126,16 +103,6 @@ print('b1 weight = ', new_params[1])
 print('b1 grad = ', new_params[1].grad)
 ```

-```{.json .output n=3}
-[
- {
-  "name": "stdout",
-  "output_type": "stream",
-  "text": "b1 weight =  \n[ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.\n  0.  0.]\n<NDArray 20 @gpu(0)>\nb1 grad =  \n[ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.\n  0.  0.]\n<NDArray 20 @gpu(0)>\n"
- }
-]
-```
-
 Given a list of data that spans multiple GPUs, we then define a function to sum the data
 and broadcast the results to each GPU.

@@ -153,16 +120,6 @@ allreduce(data)
 print('After:', data)
 ```

-```{.json .output n=4}
-[
- {
-  "name": "stdout",
-  "output_type": "stream",
-  "text": "Before: [\n[[ 1.  1.]]\n<NDArray 1x2 @gpu(0)>, \n[[ 2.  2.]]\n<NDArray 1x2 @gpu(1)>]\nAfter: [\n[[ 3.  3.]]\n<NDArray 1x2 @gpu(0)>, \n[[ 3.  3.]]\n<NDArray 1x2 @gpu(1)>]\n"
- }
-]
-```
-
 Given a data batch, we define a function that splits this batch and copies each part into the corresponding GPU.

 ```{.python .input  n=5}
@@ -181,16 +138,6 @@ print('Load into', ctx)
 print('Output:', splitted)
 ```

-```{.json .output n=5}
-[
- {
-  "name": "stdout",
-  "output_type": "stream",
-  "text": "Intput:  \n[[  0.   1.   2.   3.]\n [  4.   5.   6.   7.]\n [  8.   9.  10.  11.]\n [ 12.  13.  14.  15.]]\n<NDArray 4x4 @cpu(0)>\nLoad into [gpu(0), gpu(1)]\nOutput: [\n[[ 0.  1.  2.  3.]\n [ 4.  5.  6.  7.]]\n<NDArray 2x4 @gpu(0)>, \n[[  8.   9.  10.  11.]\n [ 12.  13.  14.  15.]]\n<NDArray 2x4 @gpu(1)>]\n"
- }
-]
-```
-
 ## 训练一个批量

 Now we are ready to implement how to train one data batch with data parallelism.
@@ -232,16 +179,16 @@ def train(num_gpus, batch_size, lr):

    ctx = [gpu(i) for i in range(num_gpus)]
    print('Running on', ctx)
-    
+
    # copy parameters to all GPUs
    dev_params = [get_params(params, c) for c in ctx]
-    
+
    for epoch in range(5):
        # train
        start = time()
-        for data, label in train_data:            
-            train_batch(data, label, dev_params, ctx, lr)            
-        nd.waitall()  
+        for data, label in train_data:
+            train_batch(data, label, dev_params, ctx, lr)
+        nd.waitall()
        print('Epoch %d, training time = %.1f sec'%(
            epoch, time()-start))

@@ -257,32 +204,12 @@ First run on a single GPU with batch size 64.
 train(1, 256, 0.3)
 ```

-```{.json .output n=8}
-[
- {
-  "name": "stdout",
-  "output_type": "stream",
-  "text": "Running on [gpu(0)]\nEpoch 0, training time = 2.2 sec\n         validation accuracy = 0.1001\nEpoch 1, training time = 1.8 sec\n         validation accuracy = 0.6264\nEpoch 2, training time = 1.8 sec\n         validation accuracy = 0.7881\nEpoch 3, training time = 1.8 sec\n         validation accuracy = 0.7849\nEpoch 4, training time = 1.8 sec\n         validation accuracy = 0.8259\n"
- }
-]
-```
-
 Running on multiple GPUs, we often want to increase the batch size so that each GPU still gets a large enough batch size for good computation performance. A larger batch size sometimes slows down the convergence, we often want to increases the learning rate as well.

 ```{.python .input  n=9}
 train(2, 512, 0.6)
 ```

-```{.json .output n=9}
-[
- {
-  "name": "stdout",
-  "output_type": "stream",
-  "text": "Running on [gpu(0), gpu(1)]\nEpoch 0, training time = 1.3 sec\n         validation accuracy = 0.0995\nEpoch 1, training time = 1.1 sec\n         validation accuracy = 0.1009\nEpoch 2, training time = 1.1 sec\n         validation accuracy = 0.6300\nEpoch 3, training time = 1.1 sec\n         validation accuracy = 0.7381\nEpoch 4, training time = 1.1 sec\n         validation accuracy = 0.7972\n"
- }
-]
-```
-
 ## Conclusion

 We have shown how to implement data parallelism on a deep neural network from scratch. Thanks to the auto-parallelism, we only need to write serial codes while the engine is able to parallelize them on multiple GPUs.