未验证 提交 fee90b50 编写于 作者: Y Yancey 提交者: GitHub

Add async sgd document (#8474)

* add async sgd document

* fix ci

* update by comment

* update doc
上级 e84615ba
## 启动参数说明 # 启动参数说明
下面以`doc/howto/cluster/src/word2vec`中的代码作为实例,介绍使用PaddlePaddle v2 API完成分布式训练。 下面以`doc/howto/cluster/src/word2vec`中的代码作为实例,介绍使用PaddlePaddle v2 API完成分布式训练。
### 启动参数服务器 ## 启动参数服务器
执行以下的命令启动一个参数服务器并等待和计算节点的数据交互 执行以下的命令启动一个参数服务器并等待和计算节点的数据交互
```bash ```bash
$ paddle pserver --port=7164 --ports_num=1 --ports_num_for_sparse=1 --num_gradient_servers=1 $ paddle pserver --port=7164 --ports_num=1 --ports_num_for_sparse=1 --num_gradient_servers=1
``` ```
如果希望可以在后台运行pserver程序,并保存输出到一个日志文件,可以运行: 如果希望可以在后台运行pserver程序,并保存输出到一个日志文件,可以运行:
```bash ```bash
$ stdbuf -oL /usr/bin/nohup paddle pserver --port=7164 --ports_num=1 --ports_num_for_sparse=1 --num_gradient_servers=1 &> pserver.log $ stdbuf -oL /usr/bin/nohup paddle pserver --port=7164 --ports_num=1 --ports_num_for_sparse=1 --num_gradient_servers=1 &> pserver.log
``` ```
...@@ -20,8 +23,10 @@ $ stdbuf -oL /usr/bin/nohup paddle pserver --port=7164 --ports_num=1 --ports_num ...@@ -20,8 +23,10 @@ $ stdbuf -oL /usr/bin/nohup paddle pserver --port=7164 --ports_num=1 --ports_num
- ports_num_for_sparse:**必选,默认0**,用于稀疏类型参数通信的端口个数 - ports_num_for_sparse:**必选,默认0**,用于稀疏类型参数通信的端口个数
- num_gradient_servers:**必选,默认1**,当前训练任务pserver总数 - num_gradient_servers:**必选,默认1**,当前训练任务pserver总数
### 启动计算节点 ## 启动计算节点
执行以下命令启动使用python编写的trainer程序(文件名为任意文件名,如train.py) 执行以下命令启动使用python编写的trainer程序(文件名为任意文件名,如train.py)
```bash ```bash
$ python train.py $ python train.py
``` ```
...@@ -67,7 +72,7 @@ paddle.init( ...@@ -67,7 +72,7 @@ paddle.init(
- pservers:**必选,默认127.0.0.1**,当前训练任务启动的pserver的IP列表,多个IP使用“,”隔开 - pservers:**必选,默认127.0.0.1**,当前训练任务启动的pserver的IP列表,多个IP使用“,”隔开
### 准备数据集 ## 准备数据集
参考样例数据准备脚本[prepare.py](https://github.com/PaddlePaddle/Paddle/tree/develop/doc/howto/usage/cluster/src/word2vec/prepare.py),准备训练数据和验证数据集,我们使用paddle.dataset.imikolov数据集,并根据分布式训练并发数(trainer节点个数),在`prepare.py`开头部分指定`SPLIT_COUNT`将数据切分成多份。 参考样例数据准备脚本[prepare.py](https://github.com/PaddlePaddle/Paddle/tree/develop/doc/howto/usage/cluster/src/word2vec/prepare.py),准备训练数据和验证数据集,我们使用paddle.dataset.imikolov数据集,并根据分布式训练并发数(trainer节点个数),在`prepare.py`开头部分指定`SPLIT_COUNT`将数据切分成多份。
...@@ -84,7 +89,8 @@ for f in flist: ...@@ -84,7 +89,8 @@ for f in flist:
``` ```
示例程序`prepare.py`会把训练集和测试集分别分割成多个文件(例子中为3个,后缀为`-00000``-00001``-00002`): 示例程序`prepare.py`会把训练集和测试集分别分割成多个文件(例子中为3个,后缀为`-00000``-00001``-00002`):
```
```bash
train.txt train.txt
train.txt-00000 train.txt-00000
train.txt-00001 train.txt-00001
...@@ -99,12 +105,13 @@ test.txt-00002 ...@@ -99,12 +105,13 @@ test.txt-00002
对于不同的训练任务,训练数据格式和训练程序的`reader()`会大不相同,所以开发者需要根据自己训练任务的实际场景完成训练数据的分割和`reader()`的编写。 对于不同的训练任务,训练数据格式和训练程序的`reader()`会大不相同,所以开发者需要根据自己训练任务的实际场景完成训练数据的分割和`reader()`的编写。
### 准备训练程序 ## 准备训练程序
我们会对每个训练任务都会在每个节点上创建一个工作空间(workspace),其中包含了用户的训练程序、程序依赖、挂载或下载的训练数据分片。 我们会对每个训练任务都会在每个节点上创建一个工作空间(workspace),其中包含了用户的训练程序、程序依赖、挂载或下载的训练数据分片。
最后,工作空间应如下所示: 最后,工作空间应如下所示:
```
```bash
. .
|-- my_lib.py |-- my_lib.py
|-- word_dict.pickle |-- word_dict.pickle
...@@ -133,3 +140,21 @@ test.txt-00002 ...@@ -133,3 +140,21 @@ test.txt-00002
- `train_data_dir`:包含训练数据的目录,可以是从分布式存储挂载过来的,也可以是在任务启动前下载到本地的。 - `train_data_dir`:包含训练数据的目录,可以是从分布式存储挂载过来的,也可以是在任务启动前下载到本地的。
- `test_data_dir`:包含测试数据集的目录。 - `test_data_dir`:包含测试数据集的目录。
## 异步 SGD 更新
我们可以通过设置 `optimize` 的参数使之支持异步SGD更新。
例如,设置 `AdaGrad` optimize 的 `is_async``async_lagged_grad_discard_ratio` 参数:
```python
adagrad = paddle.optimizer.AdaGrad(
is_async=True,
async_lagged_grad_discard_ratio=1.6,
learning_rate=3e-3,
regularization=paddle.optimizer.L2Regularization(8e-4))
```
- `is_async`: 是否为异步SGD更新模式。
- `async_lagged_grad_discard_ratio`: 异步SGD更新的步长控制,接收到足够的gradient(
`async_lagged_grad_discard_ratio * num_gradient_servers`)之后,后面的gradient
将会被抛弃。
## Command-line arguments # Command-line arguments
We'll take `doc/howto/cluster/src/word2vec` as an example to introduce distributed training using PaddlePaddle v2 API. We'll take `doc/howto/cluster/src/word2vec` as an example to introduce distributed training using PaddlePaddle v2 API.
### Starting parameter server ## Starting parameter server
Type the below command to start a parameter server which will wait for trainers to connect: Type the below command to start a parameter server which will wait for trainers to connect:
```bash ```bash
$ paddle pserver --port=7164 --ports_num=1 --ports_num_for_sparse=1 --num_gradient_servers=1 $ paddle pserver --port=7164 --ports_num=1 --ports_num_for_sparse=1 --num_gradient_servers=1 --nics=eth0
``` ```
If you wish to run parameter servers in background, and save a log file, you can type: If you wish to run parameter servers in background, and save a log file, you can type:
```bash ```bash
$ stdbuf -oL /usr/bin/nohup paddle pserver --port=7164 --ports_num=1 --ports_num_for_sparse=1 --num_gradient_servers=1 &> pserver.log $ stdbuf -oL /usr/bin/nohup paddle pserver --port=7164 --ports_num=1 --ports_num_for_sparse=1 --num_gradient_servers=1 --nics=eth0 &> pserver.log &
``` ```
Parameter Description Parameter Description
...@@ -21,8 +22,10 @@ Parameter Description ...@@ -21,8 +22,10 @@ Parameter Description
- ports_num: **required, default 1**, total number of ports will listen on. - ports_num: **required, default 1**, total number of ports will listen on.
- ports_num_for_sparse: **required, default 0**, number of ports which serves sparse parameter update. - ports_num_for_sparse: **required, default 0**, number of ports which serves sparse parameter update.
- num_gradient_servers: **required, default 1**, total number of gradient servers. - num_gradient_servers: **required, default 1**, total number of gradient servers.
- nics: **optional, default xgbe0,xgbe1**, network device name which paramter server will listen on.
## Starting trainer
### Starting trainer
Type the command below to start the trainer(name the file whatever you want, like "train.py") Type the command below to start the trainer(name the file whatever you want, like "train.py")
```bash ```bash
...@@ -70,7 +73,7 @@ Parameter Description ...@@ -70,7 +73,7 @@ Parameter Description
- trainer_id: **required, default 0**, ID for every trainer, start from 0. - trainer_id: **required, default 0**, ID for every trainer, start from 0.
- pservers: **required, default 127.0.0.1**, list of IPs of parameter servers, separated by ",". - pservers: **required, default 127.0.0.1**, list of IPs of parameter servers, separated by ",".
### Prepare Training Dataset ## Prepare Training Dataset
Here's some example code [prepare.py](https://github.com/PaddlePaddle/Paddle/tree/develop/doc/howto/usage/cluster/src/word2vec/prepare.py), it will download public `imikolov` dataset and split it into multiple files according to job parallelism(trainers count). Modify `SPLIT_COUNT` at the begining of `prepare.py` to change the count of output files. Here's some example code [prepare.py](https://github.com/PaddlePaddle/Paddle/tree/develop/doc/howto/usage/cluster/src/word2vec/prepare.py), it will download public `imikolov` dataset and split it into multiple files according to job parallelism(trainers count). Modify `SPLIT_COUNT` at the begining of `prepare.py` to change the count of output files.
...@@ -88,7 +91,7 @@ for f in flist: ...@@ -88,7 +91,7 @@ for f in flist:
Example code `prepare.py` will split training data and testing data into 3 files with digital suffix like `-00000`, `-00001` and`-00002`: Example code `prepare.py` will split training data and testing data into 3 files with digital suffix like `-00000`, `-00001` and`-00002`:
``` ```bash
train.txt train.txt
train.txt-00000 train.txt-00000
train.txt-00001 train.txt-00001
...@@ -103,13 +106,13 @@ When job started, every trainer needs to get it's own part of data. In some dist ...@@ -103,13 +106,13 @@ When job started, every trainer needs to get it's own part of data. In some dist
Different training jobs may have different data format and `reader()` function, developers may need to write different data prepare scripts and `reader()` functions for their job. Different training jobs may have different data format and `reader()` function, developers may need to write different data prepare scripts and `reader()` functions for their job.
### Prepare Training program ## Prepare Training program
We'll create a *workspace* directory on each node, storing your training program, dependencies, mounted or downloaded dataset directory. We'll create a *workspace* directory on each node, storing your training program, dependencies, mounted or downloaded dataset directory.
Your workspace may looks like: Your workspace may looks like:
```
```bash
. .
|-- my_lib.py |-- my_lib.py
|-- word_dict.pickle |-- word_dict.pickle
...@@ -138,3 +141,21 @@ Your workspace may looks like: ...@@ -138,3 +141,21 @@ Your workspace may looks like:
- `train_data_dir`: containing training data. Mount from storage service or copy trainning data to here. - `train_data_dir`: containing training data. Mount from storage service or copy trainning data to here.
- `test_data_dir`: containing testing data. - `test_data_dir`: containing testing data.
## Async SGD Update
We can set some parameters of the optimizer to make it support async SGD update.
For example, we can set the `is_async` and `async_lagged_grad_discard_ratio` of the `AdaGrad` optimizer:
```python
adagrad = paddle.optimizer.AdaGrad(
is_async=True,
async_lagged_grad_discard_ratio=1.6,
learning_rate=3e-3,
regularization=paddle.optimizer.L2Regularization(8e-4))
```
- `is_async`: Is Async-SGD or not.
- `async_lagged_grad_discard_ratio`: For async SGD gradient commit control.
when `async_lagged_grad_discard_ratio * num_gradient_servers` commit passed,
current async gradient will be discard silently.
...@@ -361,6 +361,7 @@ def settings(batch_size, ...@@ -361,6 +361,7 @@ def settings(batch_size,
learning_rate_decay_b=0., learning_rate_decay_b=0.,
learning_rate_schedule='poly', learning_rate_schedule='poly',
learning_rate_args='', learning_rate_args='',
async_lagged_grad_discard_ratio=1.5,
learning_method=None, learning_method=None,
regularization=None, regularization=None,
is_async=False, is_async=False,
...@@ -396,6 +397,10 @@ def settings(batch_size, ...@@ -396,6 +397,10 @@ def settings(batch_size,
value larger than some value, will be value larger than some value, will be
clipped. clipped.
:type gradient_clipping_threshold: float :type gradient_clipping_threshold: float
:param async_lagged_grad_discard_ratio: async SGD gradient commit control,
when async_lagged_grad_discard_ratio * num_gradient_servers commit passed,
the current async SGD gradient is discarded.
:type async_lagged_grad_discard_ratio: float
""" """
if isinstance(regularization, BaseRegularization): if isinstance(regularization, BaseRegularization):
regularization = [regularization] regularization = [regularization]
...@@ -409,7 +414,7 @@ def settings(batch_size, ...@@ -409,7 +414,7 @@ def settings(batch_size,
args = [ args = [
'batch_size', 'learning_rate', 'learning_rate_decay_a', 'batch_size', 'learning_rate', 'learning_rate_decay_a',
'learning_rate_decay_b', 'learning_rate_schedule', 'learning_rate_args', 'learning_rate_decay_b', 'learning_rate_schedule', 'learning_rate_args',
'gradient_clipping_threshold' 'gradient_clipping_threshold', 'async_lagged_grad_discard_ratio'
] ]
kwargs = dict() kwargs = dict()
kwargs['algorithm'] = algorithm kwargs['algorithm'] = algorithm
......
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册