README.md 5.6 KB
Newer Older
Y
Yancey1989 已提交
1 2 3
# Distributed Image Classification Models Training

This folder contains implementations of **Image Classification Models**, they are designed to support
Y
Yancey1989 已提交
4
large-scaled distributed training with two distributed mode: parameter server mode and NCCL2(Nvidia NCCL2 communication library) collective mode.
Y
Yancey1989 已提交
5 6 7

## Getting Started

Y
Yancey1989 已提交
8
Before getting started, please make sure you have go throught the imagenet [Data Preparation](../README.md#data-preparation).
Y
Yancey1989 已提交
9

Y
Yancey1989 已提交
10 11
1. The entrypoint file is `dist_train.py`, some important flags are as follows:

T
typhoonzero 已提交
12
    - `model`, the model to run with, default is the fine tune model `DistResnet`.
Y
Yancey1989 已提交
13
    - `batch_size`, the batch_size per device.
Y
Yancey1989 已提交
14
    - `update_method`, specify the update method, can choose from local, pserver or nccl2.
Y
Yancey1989 已提交
15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
    - `device`, use CPU or GPU device.
    - `gpus`, the GPU device count that the process used.

    you can check out more details of the flags by `python dist_train.py --help`.

1. Runtime configurations

    We use the environment variable to distinguish the different training role of a distributed training job.

    - `PADDLE_TRAINING_ROLE`, the current training role, should be in [PSERVER, TRAINER].
    - `PADDLE_TRAINERS`, the trainer count of a job.
    - `PADDLE_CURRENT_IP`, the current instance IP.
    - `PADDLE_PSERVER_IPS`, the parameter server IP list, separated by ","  only be used with update_method is pserver.
    - `PADDLE_TRAINER_ID`, the unique trainer ID of a job, the ranging is [0, PADDLE_TRAINERS).
    - `PADDLE_PSERVER_PORT`, the port of the parameter pserver listened on.
    - `PADDLE_TRAINER_IPS`, the trainer IP list, separated by ",", only be used with upadte_method is nccl2.

Y
Yancey1989 已提交
32
### Parameter Server Mode
Y
Yancey1989 已提交
33 34 35 36 37

In this example, we launched 4 parameter server instances and 4 trainer instances in the cluster:

1. launch parameter server process

T
typhoonzero 已提交
38
    ``` bash
Y
Yancey1989 已提交
39 40 41 42 43 44
    PADDLE_TRAINING_ROLE=PSERVER \
    PADDLE_TRAINERS=4 \
    PADDLE_PSERVER_IPS=192.168.0.100,192.168.0.101,192.168.0.102,192.168.0.103 \
    PADDLE_CURRENT_IP=192.168.0.100 \
    PADDLE_PSERVER_PORT=7164 \
    python dist_train.py \
T
typhoonzero 已提交
45
        --model=DistResnet \
Y
Yancey1989 已提交
46 47
        --batch_size=32 \
        --update_method=pserver \
Y
Yancey1989 已提交
48 49
        --device=CPU \
        --data_dir=../data/ILSVRC2012
Y
Yancey1989 已提交
50 51 52 53
    ```

1. launch trainer process

T
typhoonzero 已提交
54
    ``` bash
55
    PADDLE_TRAINING_ROLE=TRAINER \
Y
Yancey1989 已提交
56 57 58 59 60
    PADDLE_TRAINERS=4 \
    PADDLE_PSERVER_IPS=192.168.0.100,192.168.0.101,192.168.0.102,192.168.0.103 \
    PADDLE_TRAINER_ID=0 \
    PADDLE_PSERVER_PORT=7164 \
    python dist_train.py \
T
typhoonzero 已提交
61
        --model=DistResnet \
Y
Yancey1989 已提交
62 63
        --batch_size=32 \
        --update_method=pserver \
Y
Yancey1989 已提交
64 65
        --device=GPU \
        --data_dir=../data/ILSVRC2012
Y
Yancey1989 已提交
66 67
    ```

Y
Yancey1989 已提交
68
### NCCL2 Collective Mode
Y
Yancey1989 已提交
69 70 71

1. launch trainer process

T
typhoonzero 已提交
72
    ``` bash
Y
Yancey1989 已提交
73 74 75 76 77
    PADDLE_TRAINING_ROLE=TRAINER \
    PADDLE_TRAINERS=4 \
    PADDLE_TRAINER_IPS=192.168.0.100,192.168.0.101,192.168.0.102,192.168.0.103 \
    PADDLE_TRAINER_ID=0 \
    python dist_train.py \
T
typhoonzero 已提交
78
        --model=DistResnet \
Y
Yancey1989 已提交
79
        --batch_size=32 \
T
typhoonzero 已提交
80
        --update_method=nccl2 \
Y
Yancey1989 已提交
81 82
        --device=GPU \
        --data_dir=../data/ILSVRC2012
Y
Yancey1989 已提交
83 84
    ```

Y
Yancey1989 已提交
85
### Visualize the Training Process
Y
Yancey1989 已提交
86

Y
Yancey1989 已提交
87
It's easy to draw the learning curve accroding to the training logs, for example,
Y
Yancey1989 已提交
88 89 90 91 92 93 94 95 96 97 98 99 100 101 102
the logs of ResNet50 is as follows:

``` text
Pass 0, batch 0, loss 7.0336914, accucacys: [0.0, 0.00390625]
Pass 0, batch 1, loss 7.094781, accucacys: [0.0, 0.0]
Pass 0, batch 2, loss 7.007068, accucacys: [0.0, 0.0078125]
Pass 0, batch 3, loss 7.1056547, accucacys: [0.00390625, 0.00390625]
Pass 0, batch 4, loss 7.133543, accucacys: [0.0, 0.0078125]
Pass 0, batch 5, loss 7.3055463, accucacys: [0.0078125, 0.01171875]
Pass 0, batch 6, loss 7.341838, accucacys: [0.0078125, 0.01171875]
Pass 0, batch 7, loss 7.290557, accucacys: [0.0, 0.0]
Pass 0, batch 8, loss 7.264951, accucacys: [0.0, 0.00390625]
Pass 0, batch 9, loss 7.43522, accucacys: [0.00390625, 0.00390625]
```

T
typhoonzero 已提交
103 104
The below figure shows top 1 train accuracy for local training with 8 GPUs and distributed training
with 32 GPUs, and also distributed training with batch merge feature turned on. Note that the
T
typhoonzero 已提交
105 106
red curve is trained with origin model configuration, which does not have the warmup and some detailed
modifications.
T
typhoonzero 已提交
107 108

For distributed training with 32GPUs using `--model DistResnet` we can achieve test accuracy 75.5% after
T
typhoonzero 已提交
109 110
90 passes of training (the test accuracy is not shown in below figure). We can also achieve this result
using "batch merge" feature by setting `--multi_batch_repeat 4` and with higher throughput.
Y
Yancey1989 已提交
111 112 113

<p align="center">
<img src="../images/resnet50_32gpus-acc1.png" height=300 width=528 > <br/>
T
typhoonzero 已提交
114
Training top-1 accuracy curves
Y
Yancey1989 已提交
115 116
</p>

T
typhoonzero 已提交
117 118 119 120 121
### Finetuning for Distributed Training

The default resnet50 distributed training config is based on this paper: https://arxiv.org/pdf/1706.02677.pdf

- use `--model DistResnet`
T
typhoonzero 已提交
122
- we use 32 P40 GPUs with 4 Nodes, each has 8 GPUs
T
typhoonzero 已提交
123
- we set `batch_size=32` for each GPU, in `batch_merge=on` case, we repeat 4 times before communicating with pserver.
T
typhoonzero 已提交
124
- learning rate starts from 0.1 and warm up to 0.4 in 5 passes(because we already have gradient merging,
T
typhoonzero 已提交
125 126 127 128 129 130 131 132 133
  so we only need to linear scale up to trainer count) using 4 nodes.
- using batch_merge (`--multi_batch_repeat 4`) can make better use of GPU computing power and increase the
  total training throughput. Because in the fine-tune configuration, we have to use `batch_size=32` per GPU,
  and recent GPU is so fast that the communication between nodes may delay the total speed. In batch_merge mode
  we run several batches forward and backward computation, then merge the gradients and send to pserver for
  optimization, we use different batch norm mean and variance variable in each repeat so that adding repeats
  behaves the same as adding more GPUs.


Y
Yancey1989 已提交
134 135
### Performance

136
TBD