@@ -7,13 +7,15 @@ large-scaled distributed training with two distributed mode: parameter server mo
Before getting started, please make sure you have go throught the imagenet [Data Preparation](../README.md#data-preparation).
1. The entrypoint file is `dist_train.py`, some important flags are as follows:
1. The entrypoint file is `dist_train.py`, the commandline arguments are almost the same as the original `train.py`, with the following arguments specific to distributed training.
- `model`, the model to run with, default is the fine tune model `DistResnet`.
- `batch_size`, the batch_size per device.
- `update_method`, specify the update method, can choose from local, pserver or nccl2.
- `device`, use CPU or GPU device.
- `gpus`, the GPU device count that the process used.
- `multi_batch_repeat`, set this greater than 1 to merge batches before pushing gradients to pservers.
- `start_test_pass`, when to start running tests.
- `num_threads`, how many threads will be used for ParallelExecutor.
- `split_var`, in pserver mode, whether to split one parameter to several pservers, default True.
- `async_mode`, do async training, defalt False.
- `reduce_strategy`, choose from "reduce", "allreduce".
you can check out more details of the flags by `python dist_train.py --help`.
...
...
@@ -21,66 +23,27 @@ Before getting started, please make sure you have go throught the imagenet [Data
We use the environment variable to distinguish the different training role of a distributed training job.
- `PADDLE_TRAINING_ROLE`, the current training role, should be in [PSERVER, TRAINER].
- `PADDLE_TRAINERS`, the trainer count of a job.
- `PADDLE_CURRENT_IP`, the current instance IP.
- `PADDLE_PSERVER_IPS`, the parameter server IP list, separated by "," only be used with update_method is pserver.
- `PADDLE_TRAINER_ID`, the unique trainer ID of a job, the ranging is [0, PADDLE_TRAINERS).
- `PADDLE_PSERVER_PORT`, the port of the parameter pserver listened on.
- `PADDLE_TRAINER_IPS`, the trainer IP list, separated by ",", only be used with upadte_method is nccl2.
### Parameter Server Mode
In this example, we launched 4 parameter server instances and 4 trainer instances in the cluster: