@@ -7,13 +7,15 @@ large-scaled distributed training with two distributed mode: parameter server mo
...
@@ -7,13 +7,15 @@ large-scaled distributed training with two distributed mode: parameter server mo
Before getting started, please make sure you have go throught the imagenet [Data Preparation](../README.md#data-preparation).
Before getting started, please make sure you have go throught the imagenet [Data Preparation](../README.md#data-preparation).
1. The entrypoint file is `dist_train.py`, some important flags are as follows:
1. The entrypoint file is `dist_train.py`, the commandline arguments are almost the same as the original `train.py`, with the following arguments specific to distributed training.
- `model`, the model to run with, default is the fine tune model `DistResnet`.
- `batch_size`, the batch_size per device.
- `update_method`, specify the update method, can choose from local, pserver or nccl2.
- `update_method`, specify the update method, can choose from local, pserver or nccl2.
- `device`, use CPU or GPU device.
- `multi_batch_repeat`, set this greater than 1 to merge batches before pushing gradients to pservers.
- `gpus`, the GPU device count that the process used.
- `start_test_pass`, when to start running tests.
- `num_threads`, how many threads will be used for ParallelExecutor.
- `split_var`, in pserver mode, whether to split one parameter to several pservers, default True.
- `async_mode`, do async training, defalt False.
- `reduce_strategy`, choose from "reduce", "allreduce".
you can check out more details of the flags by `python dist_train.py --help`.
you can check out more details of the flags by `python dist_train.py --help`.
...
@@ -21,66 +23,27 @@ Before getting started, please make sure you have go throught the imagenet [Data
...
@@ -21,66 +23,27 @@ Before getting started, please make sure you have go throught the imagenet [Data
We use the environment variable to distinguish the different training role of a distributed training job.
We use the environment variable to distinguish the different training role of a distributed training job.
- `PADDLE_TRAINING_ROLE`, the current training role, should be in [PSERVER, TRAINER].
- General envs:
- `PADDLE_TRAINERS`, the trainer count of a job.
- `PADDLE_TRAINER_ID`, the unique trainer ID of a job, the ranging is [0, PADDLE_TRAINERS).
- `PADDLE_CURRENT_IP`, the current instance IP.
- `PADDLE_TRAINERS_NUM`, the trainer count of a distributed job.
- `PADDLE_PSERVER_IPS`, the parameter server IP list, separated by "," only be used with update_method is pserver.
- `PADDLE_CURRENT_ENDPOINT`, current process endpoint.
- `PADDLE_TRAINER_ID`, the unique trainer ID of a job, the ranging is [0, PADDLE_TRAINERS).
- Pserver mode:
- `PADDLE_PSERVER_PORT`, the port of the parameter pserver listened on.
- `PADDLE_TRAINING_ROLE`, the current training role, should be in [PSERVER, TRAINER].
- `PADDLE_TRAINER_IPS`, the trainer IP list, separated by ",", only be used with upadte_method is nccl2.
- `PADDLE_PSERVER_ENDPOINTS`, the parameter server endpoint list, separated by ",".
- NCCL2 mode:
### Parameter Server Mode
- `PADDLE_TRAINER_ENDPOINTS`, endpoint list for each worker, separated by ",".
In this example, we launched 4 parameter server instances and 4 trainer instances in the cluster:
### Try Out Different Distributed Training Modes
1. launch parameter server process
You can test if distributed training works on a single node before deploying to the "real" cluster.
``` bash
***NOTE: for best performance, we recommend using multi-process mode, see No.3. And together with fp16.***
PADDLE_TRAINING_ROLE=PSERVER \
PADDLE_TRAINERS=4 \
1. simply run `python dist_train.py` to start local training with default configuratioins.