- ports_num: **required, default 1**, total number of ports will listen on.
- ports_num: **required, default 1**, total number of ports will listen on.
- ports_num_for_sparse: **required, default 0**, number of ports which serves sparse parameter update.
- ports_num_for_sparse: **required, default 0**, number of ports which serves sparse parameter update.
- num_gradient_servers: **required, default 1**, total number of gradient servers.
- num_gradient_servers: **required, default 1**, total number of gradient servers.
- nics: **optional, default xgbe0,xgbe1**, network device name which paramter server will listen on.
## Starting trainer
### Starting trainer
Type the command below to start the trainer(name the file whatever you want, like "train.py")
Type the command below to start the trainer(name the file whatever you want, like "train.py")
```bash
```bash
...
@@ -70,7 +73,7 @@ Parameter Description
...
@@ -70,7 +73,7 @@ Parameter Description
- trainer_id: **required, default 0**, ID for every trainer, start from 0.
- trainer_id: **required, default 0**, ID for every trainer, start from 0.
- pservers: **required, default 127.0.0.1**, list of IPs of parameter servers, separated by ",".
- pservers: **required, default 127.0.0.1**, list of IPs of parameter servers, separated by ",".
### Prepare Training Dataset
## Prepare Training Dataset
Here's some example code [prepare.py](https://github.com/PaddlePaddle/Paddle/tree/develop/doc/howto/usage/cluster/src/word2vec/prepare.py), it will download public `imikolov` dataset and split it into multiple files according to job parallelism(trainers count). Modify `SPLIT_COUNT` at the begining of `prepare.py` to change the count of output files.
Here's some example code [prepare.py](https://github.com/PaddlePaddle/Paddle/tree/develop/doc/howto/usage/cluster/src/word2vec/prepare.py), it will download public `imikolov` dataset and split it into multiple files according to job parallelism(trainers count). Modify `SPLIT_COUNT` at the begining of `prepare.py` to change the count of output files.
...
@@ -88,7 +91,7 @@ for f in flist:
...
@@ -88,7 +91,7 @@ for f in flist:
Example code `prepare.py` will split training data and testing data into 3 files with digital suffix like `-00000`, `-00001` and`-00002`:
Example code `prepare.py` will split training data and testing data into 3 files with digital suffix like `-00000`, `-00001` and`-00002`:
```
```bash
train.txt
train.txt
train.txt-00000
train.txt-00000
train.txt-00001
train.txt-00001
...
@@ -103,13 +106,13 @@ When job started, every trainer needs to get it's own part of data. In some dist
...
@@ -103,13 +106,13 @@ When job started, every trainer needs to get it's own part of data. In some dist
Different training jobs may have different data format and `reader()` function, developers may need to write different data prepare scripts and `reader()` functions for their job.
Different training jobs may have different data format and `reader()` function, developers may need to write different data prepare scripts and `reader()` functions for their job.
### Prepare Training program
## Prepare Training program
We'll create a *workspace* directory on each node, storing your training program, dependencies, mounted or downloaded dataset directory.
We'll create a *workspace* directory on each node, storing your training program, dependencies, mounted or downloaded dataset directory.
Your workspace may looks like:
Your workspace may looks like:
```
```bash
.
.
|-- my_lib.py
|-- my_lib.py
|-- word_dict.pickle
|-- word_dict.pickle
...
@@ -138,3 +141,21 @@ Your workspace may looks like:
...
@@ -138,3 +141,21 @@ Your workspace may looks like:
- `train_data_dir`: containing training data. Mount from storage service or copy trainning data to here.
- `train_data_dir`: containing training data. Mount from storage service or copy trainning data to here.
- `test_data_dir`: containing testing data.
- `test_data_dir`: containing testing data.
## Async SGD Update
We can set some parameters of the optimizer to make it support async SGD update.
For example, we can set the `is_async` and `async_lagged_grad_discard_ratio` of the `AdaGrad` optimizer:
<spanid="starting-parameter-server"></span><h2>Starting parameter server<aclass="headerlink"href="#starting-parameter-server"title="Permalink to this headline">¶</a></h2>
<spanid="starting-parameter-server"></span><h2>Starting parameter server<aclass="headerlink"href="#starting-parameter-server"title="Permalink to this headline">¶</a></h2>
<p>Type the below command to start a parameter server which will wait for trainers to connect:</p>
<p>Type the below command to start a parameter server which will wait for trainers to connect:</p>
<li>ports_num: <strong>required, default 1</strong>, total number of ports will listen on.</li>
<li>ports_num: <strong>required, default 1</strong>, total number of ports will listen on.</li>
<li>ports_num_for_sparse: <strong>required, default 0</strong>, number of ports which serves sparse parameter update.</li>
<li>ports_num_for_sparse: <strong>required, default 0</strong>, number of ports which serves sparse parameter update.</li>
<li>num_gradient_servers: <strong>required, default 1</strong>, total number of gradient servers.</li>
<li>num_gradient_servers: <strong>required, default 1</strong>, total number of gradient servers.</li>
<li>nics: <strong>optional, default xgbe0,xgbe1</strong>, network device name which paramter server will listen on.</li>
</ul>
</ul>
</div>
</div>
<divclass="section"id="starting-trainer">
<divclass="section"id="starting-trainer">
...
@@ -274,14 +275,14 @@ python train.py
...
@@ -274,14 +275,14 @@ python train.py
</pre></div>
</pre></div>
</div>
</div>
<p>Example code <codeclass="docutils literal"><spanclass="pre">prepare.py</span></code> will split training data and testing data into 3 files with digital suffix like <codeclass="docutils literal"><spanclass="pre">-00000</span></code>, <codeclass="docutils literal"><spanclass="pre">-00001</span></code> and<codeclass="docutils literal"><spanclass="pre">-00002</span></code>:</p>
<p>Example code <codeclass="docutils literal"><spanclass="pre">prepare.py</span></code> will split training data and testing data into 3 files with digital suffix like <codeclass="docutils literal"><spanclass="pre">-00000</span></code>, <codeclass="docutils literal"><spanclass="pre">-00001</span></code> and<codeclass="docutils literal"><spanclass="pre">-00002</span></code>:</p>
<p>When job started, every trainer needs to get it’s own part of data. In some distributed systems a storage service will be provided, so the date under that path can be accessed by all the trainer nodes. Without the storage service, you must copy the training data to each trainer node.</p>
<p>When job started, every trainer needs to get it’s own part of data. In some distributed systems a storage service will be provided, so the date under that path can be accessed by all the trainer nodes. Without the storage service, you must copy the training data to each trainer node.</p>
...
@@ -291,18 +292,18 @@ python train.py
...
@@ -291,18 +292,18 @@ python train.py
<spanid="prepare-training-program"></span><h2>Prepare Training program<aclass="headerlink"href="#prepare-training-program"title="Permalink to this headline">¶</a></h2>
<spanid="prepare-training-program"></span><h2>Prepare Training program<aclass="headerlink"href="#prepare-training-program"title="Permalink to this headline">¶</a></h2>
<p>We’ll create a <em>workspace</em> directory on each node, storing your training program, dependencies, mounted or downloaded dataset directory.</p>
<p>We’ll create a <em>workspace</em> directory on each node, storing your training program, dependencies, mounted or downloaded dataset directory.</p>
<spanid="async-sgd-update"></span><h2>Async SGD Update<aclass="headerlink"href="#async-sgd-update"title="Permalink to this headline">¶</a></h2>
<p>We can set some parameters of the optimizer to make it support async SGD update.
For example, we can set the <codeclass="docutils literal"><spanclass="pre">is_async</span></code> and <codeclass="docutils literal"><spanclass="pre">async_lagged_grad_discard_ratio</span></code> of the <codeclass="docutils literal"><spanclass="pre">AdaGrad</span></code> optimizer:</p>
<li><codeclass="docutils literal"><spanclass="pre">is_async</span></code>: Is Async-SGD or not.</li>
<li><codeclass="docutils literal"><spanclass="pre">async_lagged_grad_discard_ratio</span></code>: For async SGD gradient commit control.
when <codeclass="docutils literal"><spanclass="pre">async_lagged_grad_discard_ratio</span><spanclass="pre">*</span><spanclass="pre">num_gradient_servers</span></code> commit passed,
current async gradient will be discard silently.</li>