Skip to content
体验新版
项目
组织
正在加载...
登录
切换导航
打开侧边栏
PaddlePaddle
models
提交
9b914cb5
M
models
项目概览
PaddlePaddle
/
models
1 年多 前同步成功
通知
222
Star
6828
Fork
2962
代码
文件
提交
分支
Tags
贡献者
分支图
Diff
Issue
602
列表
看板
标记
里程碑
合并请求
255
Wiki
0
Wiki
分析
仓库
DevOps
项目成员
Pages
M
models
项目概览
项目概览
详情
发布
仓库
仓库
文件
提交
分支
标签
贡献者
分支图
比较
Issue
602
Issue
602
列表
看板
标记
里程碑
合并请求
255
合并请求
255
Pages
分析
分析
仓库分析
DevOps
Wiki
0
Wiki
成员
成员
收起侧边栏
关闭侧边栏
动态
分支图
创建新Issue
提交
Issue看板
提交
9b914cb5
编写于
9月 20, 2018
作者:
Y
Yancey1989
浏览文件
操作
浏览文件
下载
电子邮件补丁
差异文件
update by comment
上级
29153a7b
变更
2
隐藏空白更改
内联
并排
Showing
2 changed file
with
8 addition
and
23 deletion
+8
-23
fluid/image_classification/dist_train/README.md
fluid/image_classification/dist_train/README.md
+7
-7
fluid/image_classification/dist_train/args.py
fluid/image_classification/dist_train/args.py
+1
-16
未找到文件。
fluid/image_classification/dist_train/README.md
浏览文件 @
9b914cb5
# Distributed Image Classification Models Training
This folder contains implementations of
**Image Classification Models**
, they are designed to support
large-scaled distributed training with two distributed
architecture: parameter server with RPC and ring-base with Nvidia NCCL2 library
.
large-scaled distributed training with two distributed
mode: parameter server mode and NCCL2(Nvidia NCCL2 communication library) collective mode
.
## Getting Started
Before getting started, please make sure you have
finished
the imagenet
[
Data Preparation
](
../README.md#data-preparation
)
.
Before getting started, please make sure you have
go throught
the imagenet
[
Data Preparation
](
../README.md#data-preparation
)
.
1.
The entrypoint file is
`dist_train.py`
, some important flags are as follows:
- `model`, the model to run with, such as `ResNet50`, `ResNet101` and etc..
- `batch_size`, the batch_size per device.
- `update_method`, specify the update method, local, pserver or nccl2.
- `update_method`, specify the update method,
can choose from
local, pserver or nccl2.
- `device`, use CPU or GPU device.
- `gpus`, the GPU device count that the process used.
...
...
@@ -29,7 +29,7 @@ Before getting started, please make sure you have finished the imagenet [Data Pr
- `PADDLE_PSERVER_PORT`, the port of the parameter pserver listened on.
- `PADDLE_TRAINER_IPS`, the trainer IP list, separated by ",", only be used with upadte_method is nccl2.
### P
server Server with RPC
### P
arameter Server Mode
In this example, we launched 4 parameter server instances and 4 trainer instances in the cluster:
...
...
@@ -66,7 +66,7 @@ In this example, we launched 4 parameter server instances and 4 trainer instance
```
###
Ring-base with Nvidia NCCL2 library
###
NCCL2 Collective Mode
1.
launch trainer process
...
...
@@ -83,9 +83,9 @@ In this example, we launched 4 parameter server instances and 4 trainer instance
--data_dir=../data/ILSVRC2012
```
###
Training Curve
###
Visualize the Training Process
It's easy to draw the
trai
ning curve accroding to the training logs, for example,
It's easy to draw the
lear
ning curve accroding to the training logs, for example,
the logs of ResNet50 is as follows:
```
text
...
...
fluid/image_classification/dist_train/args.py
浏览文件 @
9b914cb5
...
...
@@ -22,7 +22,7 @@ BENCHMARK_MODELS = [
def
parse_args
():
parser
=
argparse
.
ArgumentParser
(
'
Fluid model benchmarks
.'
)
parser
=
argparse
.
ArgumentParser
(
'
Distributed Image Classification Training
.'
)
parser
.
add_argument
(
'--model'
,
type
=
str
,
...
...
@@ -74,8 +74,6 @@ def parse_args():
default
=
'flowers'
,
choices
=
[
'cifar10'
,
'flowers'
,
'imagenet'
],
help
=
'Optional dataset for benchmark.'
)
parser
.
add_argument
(
'--infer_only'
,
action
=
'store_true'
,
help
=
'If set, run forward only.'
)
parser
.
add_argument
(
'--no_test'
,
action
=
'store_true'
,
...
...
@@ -84,10 +82,6 @@ def parse_args():
'--memory_optimize'
,
action
=
'store_true'
,
help
=
'If set, optimize runtime memory before start.'
)
parser
.
add_argument
(
'--use_fake_data'
,
action
=
'store_true'
,
help
=
'If set ommit the actual read data operators.'
)
parser
.
add_argument
(
'--update_method'
,
type
=
str
,
...
...
@@ -104,19 +98,10 @@ def parse_args():
action
=
'store_true'
,
default
=
False
,
help
=
'Whether start pserver in async mode to support ASGD'
)
parser
.
add_argument
(
'--use_reader_op'
,
action
=
'store_true'
,
help
=
'Whether to use reader op, and must specify the data path if set this to true.'
)
parser
.
add_argument
(
'--no_random'
,
action
=
'store_true'
,
help
=
'If set, keep the random seed and do not shuffle the data.'
)
parser
.
add_argument
(
'--use_lars'
,
action
=
'store_true'
,
help
=
'If set, use lars for optimizers, ONLY support resnet module.'
)
parser
.
add_argument
(
'--reduce_strategy'
,
type
=
str
,
...
...
编辑
预览
Markdown
is supported
0%
请重试
或
添加新附件
.
添加附件
取消
You are about to add
0
people
to the discussion. Proceed with caution.
先完成此消息的编辑!
取消
想要评论请
注册
或
登录