提交 9b914cb5 编写于 作者: Y Yancey1989

update by comment

上级 29153a7b
# Distributed Image Classification Models Training
This folder contains implementations of **Image Classification Models**, they are designed to support
large-scaled distributed training with two distributed architecture: parameter server with RPC and ring-base with Nvidia NCCL2 library.
large-scaled distributed training with two distributed mode: parameter server mode and NCCL2(Nvidia NCCL2 communication library) collective mode.
## Getting Started
Before getting started, please make sure you have finished the imagenet [Data Preparation](../README.md#data-preparation).
Before getting started, please make sure you have go throught the imagenet [Data Preparation](../README.md#data-preparation).
1. The entrypoint file is `dist_train.py`, some important flags are as follows:
- `model`, the model to run with, such as `ResNet50`, `ResNet101` and etc..
- `batch_size`, the batch_size per device.
- `update_method`, specify the update method, local, pserver or nccl2.
- `update_method`, specify the update method, can choose from local, pserver or nccl2.
- `device`, use CPU or GPU device.
- `gpus`, the GPU device count that the process used.
......@@ -29,7 +29,7 @@ Before getting started, please make sure you have finished the imagenet [Data Pr
- `PADDLE_PSERVER_PORT`, the port of the parameter pserver listened on.
- `PADDLE_TRAINER_IPS`, the trainer IP list, separated by ",", only be used with upadte_method is nccl2.
### Pserver Server with RPC
### Parameter Server Mode
In this example, we launched 4 parameter server instances and 4 trainer instances in the cluster:
......@@ -66,7 +66,7 @@ In this example, we launched 4 parameter server instances and 4 trainer instance
```
### Ring-base with Nvidia NCCL2 library
### NCCL2 Collective Mode
1. launch trainer process
......@@ -83,9 +83,9 @@ In this example, we launched 4 parameter server instances and 4 trainer instance
--data_dir=../data/ILSVRC2012
```
### Training Curve
### Visualize the Training Process
It's easy to draw the training curve accroding to the training logs, for example,
It's easy to draw the learning curve accroding to the training logs, for example,
the logs of ResNet50 is as follows:
``` text
......
......@@ -22,7 +22,7 @@ BENCHMARK_MODELS = [
def parse_args():
parser = argparse.ArgumentParser('Fluid model benchmarks.')
parser = argparse.ArgumentParser('Distributed Image Classification Training.')
parser.add_argument(
'--model',
type=str,
......@@ -74,8 +74,6 @@ def parse_args():
default='flowers',
choices=['cifar10', 'flowers', 'imagenet'],
help='Optional dataset for benchmark.')
parser.add_argument(
'--infer_only', action='store_true', help='If set, run forward only.')
parser.add_argument(
'--no_test',
action='store_true',
......@@ -84,10 +82,6 @@ def parse_args():
'--memory_optimize',
action='store_true',
help='If set, optimize runtime memory before start.')
parser.add_argument(
'--use_fake_data',
action='store_true',
help='If set ommit the actual read data operators.')
parser.add_argument(
'--update_method',
type=str,
......@@ -104,19 +98,10 @@ def parse_args():
action='store_true',
default=False,
help='Whether start pserver in async mode to support ASGD')
parser.add_argument(
'--use_reader_op',
action='store_true',
help='Whether to use reader op, and must specify the data path if set this to true.'
)
parser.add_argument(
'--no_random',
action='store_true',
help='If set, keep the random seed and do not shuffle the data.')
parser.add_argument(
'--use_lars',
action='store_true',
help='If set, use lars for optimizers, ONLY support resnet module.')
parser.add_argument(
'--reduce_strategy',
type=str,
......
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册