提交 934f1f67 编写于 作者: T typhoonzero

refine dist image classification

上级 9a4f5786
...@@ -7,13 +7,15 @@ large-scaled distributed training with two distributed mode: parameter server mo ...@@ -7,13 +7,15 @@ large-scaled distributed training with two distributed mode: parameter server mo
Before getting started, please make sure you have go throught the imagenet [Data Preparation](../README.md#data-preparation). Before getting started, please make sure you have go throught the imagenet [Data Preparation](../README.md#data-preparation).
1. The entrypoint file is `dist_train.py`, some important flags are as follows: 1. The entrypoint file is `dist_train.py`, the commandline arguments are almost the same as the original `train.py`, with the following arguments specific to distributed training.
- `model`, the model to run with, default is the fine tune model `DistResnet`.
- `batch_size`, the batch_size per device.
- `update_method`, specify the update method, can choose from local, pserver or nccl2. - `update_method`, specify the update method, can choose from local, pserver or nccl2.
- `device`, use CPU or GPU device. - `multi_batch_repeat`, set this greater than 1 to merge batches before pushing gradients to pservers.
- `gpus`, the GPU device count that the process used. - `start_test_pass`, when to start running tests.
- `num_threads`, how many threads will be used for ParallelExecutor.
- `split_var`, in pserver mode, whether to split one parameter to several pservers, default True.
- `async_mode`, do async training, defalt False.
- `reduce_strategy`, choose from "reduce", "allreduce".
you can check out more details of the flags by `python dist_train.py --help`. you can check out more details of the flags by `python dist_train.py --help`.
...@@ -21,66 +23,27 @@ Before getting started, please make sure you have go throught the imagenet [Data ...@@ -21,66 +23,27 @@ Before getting started, please make sure you have go throught the imagenet [Data
We use the environment variable to distinguish the different training role of a distributed training job. We use the environment variable to distinguish the different training role of a distributed training job.
- `PADDLE_TRAINING_ROLE`, the current training role, should be in [PSERVER, TRAINER]. - General envs:
- `PADDLE_TRAINERS`, the trainer count of a job. - `PADDLE_TRAINER_ID`, the unique trainer ID of a job, the ranging is [0, PADDLE_TRAINERS).
- `PADDLE_CURRENT_IP`, the current instance IP. - `PADDLE_TRAINERS_NUM`, the trainer count of a distributed job.
- `PADDLE_PSERVER_IPS`, the parameter server IP list, separated by "," only be used with update_method is pserver. - `PADDLE_CURRENT_ENDPOINT`, current process endpoint.
- `PADDLE_TRAINER_ID`, the unique trainer ID of a job, the ranging is [0, PADDLE_TRAINERS). - Pserver mode:
- `PADDLE_PSERVER_PORT`, the port of the parameter pserver listened on. - `PADDLE_TRAINING_ROLE`, the current training role, should be in [PSERVER, TRAINER].
- `PADDLE_TRAINER_IPS`, the trainer IP list, separated by ",", only be used with upadte_method is nccl2. - `PADDLE_PSERVER_ENDPOINTS`, the parameter server endpoint list, separated by ",".
- NCCL2 mode:
### Parameter Server Mode - `PADDLE_TRAINER_ENDPOINTS`, endpoint list for each worker, separated by ",".
In this example, we launched 4 parameter server instances and 4 trainer instances in the cluster: ### Try Out Different Distributed Training Modes
1. launch parameter server process You can test if distributed training works on a single node before deploying to the "real" cluster.
``` bash ***NOTE: for best performance, we recommend using multi-process mode, see No.3. And together with fp16.***
PADDLE_TRAINING_ROLE=PSERVER \
PADDLE_TRAINERS=4 \ 1. simply run `python dist_train.py` to start local training with default configuratioins.
PADDLE_PSERVER_IPS=192.168.0.100,192.168.0.101,192.168.0.102,192.168.0.103 \ 2. for pserver mode, run `bash run_ps_mode.sh` to start 2 pservers and 2 trainers, these 2 trainers
PADDLE_CURRENT_IP=192.168.0.100 \ will use GPU 0 and 1 to simulate 2 workers.
PADDLE_PSERVER_PORT=7164 \ 3. for nccl2 mode, run `bash run_nccl2_mode.sh` to start 2 workers.
python dist_train.py \ 4. for local/distributed multi-process mode, run `run_mp_mode.sh` (this test use 4 GPUs).
--model=DistResnet \
--batch_size=32 \
--update_method=pserver \
--device=CPU \
--data_dir=../data/ILSVRC2012
```
1. launch trainer process
``` bash
PADDLE_TRAINING_ROLE=TRAINER \
PADDLE_TRAINERS=4 \
PADDLE_PSERVER_IPS=192.168.0.100,192.168.0.101,192.168.0.102,192.168.0.103 \
PADDLE_TRAINER_ID=0 \
PADDLE_PSERVER_PORT=7164 \
python dist_train.py \
--model=DistResnet \
--batch_size=32 \
--update_method=pserver \
--device=GPU \
--data_dir=../data/ILSVRC2012
```
### NCCL2 Collective Mode
1. launch trainer process
``` bash
PADDLE_TRAINING_ROLE=TRAINER \
PADDLE_TRAINERS=4 \
PADDLE_TRAINER_IPS=192.168.0.100,192.168.0.101,192.168.0.102,192.168.0.103 \
PADDLE_TRAINER_ID=0 \
python dist_train.py \
--model=DistResnet \
--batch_size=32 \
--update_method=nccl2 \
--device=GPU \
--data_dir=../data/ILSVRC2012
```
### Visualize the Training Process ### Visualize the Training Process
...@@ -88,16 +51,10 @@ It's easy to draw the learning curve accroding to the training logs, for example ...@@ -88,16 +51,10 @@ It's easy to draw the learning curve accroding to the training logs, for example
the logs of ResNet50 is as follows: the logs of ResNet50 is as follows:
``` text ``` text
Pass 0, batch 0, loss 7.0336914, accucacys: [0.0, 0.00390625] Pass 0, batch 30, loss 7.569439, acc1: 0.0125, acc5: 0.0125, avg batch time 0.1720
Pass 0, batch 1, loss 7.094781, accucacys: [0.0, 0.0] Pass 0, batch 60, loss 7.027379, acc1: 0.0, acc5: 0.0, avg batch time 0.1551
Pass 0, batch 2, loss 7.007068, accucacys: [0.0, 0.0078125] Pass 0, batch 90, loss 6.819984, acc1: 0.0, acc5: 0.0125, avg batch time 0.1492
Pass 0, batch 3, loss 7.1056547, accucacys: [0.00390625, 0.00390625] Pass 0, batch 120, loss 6.9076853, acc1: 0.0, acc5: 0.0125, avg batch time 0.1464
Pass 0, batch 4, loss 7.133543, accucacys: [0.0, 0.0078125]
Pass 0, batch 5, loss 7.3055463, accucacys: [0.0078125, 0.01171875]
Pass 0, batch 6, loss 7.341838, accucacys: [0.0078125, 0.01171875]
Pass 0, batch 7, loss 7.290557, accucacys: [0.0, 0.0]
Pass 0, batch 8, loss 7.264951, accucacys: [0.0, 0.00390625]
Pass 0, batch 9, loss 7.43522, accucacys: [0.00390625, 0.00390625]
``` ```
The below figure shows top 1 train accuracy for local training with 8 GPUs and distributed training The below figure shows top 1 train accuracy for local training with 8 GPUs and distributed training
......
import paddle.fluid as fluid
def copyback_repeat_bn_params(main_prog):
repeat_vars = set()
for op in main_prog.global_block().ops:
if op.type == "batch_norm":
repeat_vars.add(op.input("Mean")[0])
repeat_vars.add(op.input("Variance")[0])
for vname in repeat_vars:
real_var = fluid.global_scope().find_var("%s.repeat.0" % vname).get_tensor()
orig_var = fluid.global_scope().find_var(vname).get_tensor()
orig_var.set(np.array(real_var), fluid.CUDAPlace(0)) # test on GPU0
def append_bn_repeat_init_op(main_prog, startup_prog, num_repeats):
repeat_vars = set()
for op in main_prog.global_block().ops:
if op.type == "batch_norm":
repeat_vars.add(op.input("Mean")[0])
repeat_vars.add(op.input("Variance")[0])
for i in range(num_repeats):
for op in startup_prog.global_block().ops:
if op.type == "fill_constant":
for oname in op.output_arg_names:
if oname in repeat_vars:
var = startup_prog.global_block().var(oname)
repeat_var_name = "%s.repeat.%d" % (oname, i)
repeat_var = startup_prog.global_block().create_var(
name=repeat_var_name,
type=var.type,
dtype=var.dtype,
shape=var.shape,
persistable=var.persistable
)
main_prog.global_block()._clone_variable(repeat_var)
startup_prog.global_block().append_op(
type="fill_constant",
inputs={},
outputs={"Out": repeat_var},
attrs=op.all_attrs()
)
import os
import paddle.fluid as fluid
def nccl2_prepare(args, startup_prog):
config = fluid.DistributeTranspilerConfig()
config.mode = "nccl2"
t = fluid.DistributeTranspiler(config=config)
envs = args.dist_env
t.transpile(envs["trainer_id"],
trainers=','.join(envs["trainer_endpoints"]),
current_endpoint=envs["current_endpoint"],
startup_program=startup_prog)
def pserver_prepare(args, train_prog, startup_prog):
config = fluid.DistributeTranspilerConfig()
config.slice_var_up = args.split_var
t = fluid.DistributeTranspiler(config=config)
envs = args.dist_env
training_role = envs["training_role"]
t.transpile(
envs["trainer_id"],
program=train_prog,
pservers=envs["pserver_endpoints"],
trainers=envs["num_trainers"],
sync_mode=not args.async_mode,
startup_program=startup_prog)
if training_role == "PSERVER":
pserver_program = t.get_pserver_program(envs["current_endpoint"])
pserver_startup_program = t.get_startup_program(
envs["current_endpoint"], pserver_program, startup_program=startup_prog)
return pserver_program, pserver_startup_program
elif training_role == "TRAINER":
train_program = t.get_trainer_program()
return train_program, startup_prog
else:
raise ValueError(
'PADDLE_TRAINING_ROLE environment variable must be either TRAINER or PSERVER'
)
import os
def dist_env():
"""
Return a dict of all variable that distributed training may use.
NOTE: you may rewrite this function to suit your cluster environments.
"""
trainer_id = int(os.getenv("PADDLE_TRAINER_ID", "0"))
num_trainers = 1
training_role = os.getenv("PADDLE_TRAINING_ROLE", "TRAINER")
assert(training_role == "PSERVER" or training_role == "TRAINER")
# - PADDLE_TRAINER_ENDPOINTS means nccl2 mode.
# - PADDLE_PSERVER_ENDPOINTS means pserver mode.
# - PADDLE_CURRENT_ENDPOINT means current process endpoint.
trainer_endpoints = os.getenv("PADDLE_TRAINER_ENDPOINTS")
pserver_endpoints = os.getenv("PADDLE_PSERVER_ENDPOINTS")
current_endpoint = os.getenv("PADDLE_CURRENT_ENDPOINT")
if trainer_endpoints:
trainer_endpoints = trainer_endpoints.split(",")
num_trainers = len(trainer_endpoints)
elif pserver_endpoints:
num_trainers = int(os.getenv("PADDLE_TRAINERS_NUM"))
return {
"trainer_id": trainer_id,
"num_trainers": num_trainers,
"current_endpoint": current_endpoint,
"training_role": training_role,
"pserver_endpoints": pserver_endpoints,
"trainer_endpoints": trainer_endpoints
}
#!/bin/bash
# Test using 4 GPUs
export CUDA_VISIBLE_DEVICES="0,1,2,3"
export MODEL="DistResNet"
export PADDLE_TRAINER_ENDPOINTS="127.0.0.1:7160,127.0.0.1:7161,127.0.0.1:7162,127.0.0.1:7163"
# PADDLE_TRAINERS_NUM is used only for reader when nccl2 mode
export PADDLE_TRAINERS_NUM="4"
mkdir -p logs
for i in {0..3}
do
PADDLE_TRAINING_ROLE="TRAINER" \
PADDLE_CURRENT_ENDPOINT="127.0.0.1:716${i}" \
PADDLE_TRAINER_ID="${i}" \
FLAGS_selected_gpus="${i}" \
python dist_train.py --model $MODEL --update_method nccl2 --batch_size 32 --fp16 1 --scale_loss 8 &> logs/tr$i.log &
done
#!/bin/bash
export MODEL="DistResNet"
export PADDLE_TRAINER_ENDPOINTS="127.0.0.1:7160,127.0.0.1:7161"
# PADDLE_TRAINERS_NUM is used only for reader when nccl2 mode
export PADDLE_TRAINERS_NUM="2"
mkdir -p logs
# NOTE: set NCCL_P2P_DISABLE so that can run nccl2 distribute train on one node.
PADDLE_TRAINING_ROLE="TRAINER" \
PADDLE_CURRENT_ENDPOINT="127.0.0.1:7160" \
PADDLE_TRAINER_ID="0" \
CUDA_VISIBLE_DEVICES="0" \
NCCL_P2P_DISABLE="1" \
python dist_train.py --model $MODEL --update_method nccl2 --batch_size 32 &> logs/tr0.log &
PADDLE_TRAINING_ROLE="TRAINER" \
PADDLE_CURRENT_ENDPOINT="127.0.0.1:7161" \
PADDLE_TRAINER_ID="1" \
CUDA_VISIBLE_DEVICES="1" \
NCCL_P2P_DISABLE="1" \
python dist_train.py --model $MODEL --update_method nccl2 --batch_size 32 &> logs/tr1.log &
#!/bin/bash
export MODEL="DistResNet"
export PADDLE_PSERVER_ENDPOINTS="127.0.0.1:7160,127.0.0.1:7161"
export PADDLE_TRAINERS_NUM="2"
mkdir -p logs
PADDLE_TRAINING_ROLE="PSERVER" \
PADDLE_CURRENT_ENDPOINT="127.0.0.1:7160" \
python dist_train.py --model $MODEL --update_method pserver --batch_size 32 &> logs/ps0.log &
PADDLE_TRAINING_ROLE="PSERVER" \
PADDLE_CURRENT_ENDPOINT="127.0.0.1:7161" \
python dist_train.py --model $MODEL --update_method pserver --batch_size 32 &> logs/ps1.log &
PADDLE_TRAINING_ROLE="TRAINER" \
PADDLE_CURRENT_ENDPOINT="127.0.0.1:7160" \
PADDLE_TRAINER_ID="0" \
CUDA_VISIBLE_DEVICES="0" \
python dist_train.py --model $MODEL --update_method pserver --batch_size 32 &> logs/tr0.log &
PADDLE_TRAINING_ROLE="TRAINER" \
PADDLE_CURRENT_ENDPOINT="127.0.0.1:7161" \
PADDLE_TRAINER_ID="1" \
CUDA_VISIBLE_DEVICES="1" \
python dist_train.py --model $MODEL --update_method pserver --batch_size 32 &> logs/tr1.log &
...@@ -14,8 +14,9 @@ train_parameters = { ...@@ -14,8 +14,9 @@ train_parameters = {
"learning_strategy": { "learning_strategy": {
"name": "piecewise_decay", "name": "piecewise_decay",
"batch_size": 256, "batch_size": 256,
"epochs": [30, 60, 90], "epochs": [30, 60, 80],
"steps": [0.1, 0.01, 0.001, 0.0001] "steps": [0.1, 0.01, 0.001, 0.0001],
"warmup_passes": 5
} }
} }
...@@ -118,3 +119,4 @@ class DistResNet(): ...@@ -118,3 +119,4 @@ class DistResNet():
short = self.shortcut(input, num_filters * 4, stride) short = self.shortcut(input, num_filters * 4, stride)
return fluid.layers.elementwise_add(x=short, y=conv2, act='relu') return fluid.layers.elementwise_add(x=short, y=conv2, act='relu')
...@@ -139,7 +139,7 @@ def _reader_creator(file_list, ...@@ -139,7 +139,7 @@ def _reader_creator(file_list,
if mode == 'train' and os.getenv('PADDLE_TRAINING_ROLE'): if mode == 'train' and os.getenv('PADDLE_TRAINING_ROLE'):
# distributed mode if the env var `PADDLE_TRAINING_ROLE` exits # distributed mode if the env var `PADDLE_TRAINING_ROLE` exits
trainer_id = int(os.getenv("PADDLE_TRAINER_ID", "0")) trainer_id = int(os.getenv("PADDLE_TRAINER_ID", "0"))
trainer_count = int(os.getenv("PADDLE_TRAINERS", "1")) trainer_count = int(os.getenv("PADDLE_TRAINERS_NUM", "1"))
per_node_lines = len(full_lines) // trainer_count per_node_lines = len(full_lines) // trainer_count
lines = full_lines[trainer_id * per_node_lines:(trainer_id + 1) lines = full_lines[trainer_id * per_node_lines:(trainer_id + 1)
* per_node_lines] * per_node_lines]
......
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册