@@ -10,7 +10,7 @@ A dataset is a list of files in *RecordIO* format. A RecordIO file consists of c
## Task Queue
As mentioned in [distributed training design doc](./README.md), a *task* is a data shard that the master server assigns to the trainer process to train on. A task consists of one or multiple *blocks* from one or multiple files. The master server maintains *task queues* to track the training progress.
As mentioned in [distributed training design doc](./README.md), a *task* is a data shard that the master server assigns to the trainer process to train on. A task consists of one or multiple *chunks* from one or multiple files. The master server maintains *task queues* to track the training progress.
### Task Queue Creation
...
...
@@ -21,23 +21,23 @@ As mentioned in [distributed training design doc](./README.md), a *task* is a da
1. The master server will scan through each RecordIO file to generate the *block index* and know how many blocks does each file have. A block can be referenced by the file path and the index of the block within the file. The block index is in memory data structure that enables fast access to each block, and the index of the block with the file is an integer start from 0, representing the n-th block within the file.
1. The master server will scan through each RecordIO file to generate the *chunk index* and know how many chunks does each file have. A chunk can be referenced by the file path and the index of the chunk within the file. The chunk index is in memory data structure that enables fast access to each chunk, and the index of the chunk with the file is an integer start from 0, representing the n-th chunk within the file.
The definition of the block is:
The definition of the chunk is:
```go
type Block struct {
Idx int // index of the block within the file
type Chunk struct {
Idx int // index of the chunk within the file
Path string
Index recordio.Index // block index
Index recordio.Index // chunk index
}
```
1. Blocks are grouped into tasks, and tasks are filled into the todo queue. The pending queue and the done queue are initialized with no element.
1. Chunks are grouped into tasks, and tasks are filled into the todo queue. The pending queue and the done queue are initialized with no element.
@@ -55,7 +55,7 @@ The trainer select process is encapsulated in the C API function:
```c
int paddle_begin_init_params(paddle_pserver_client* client, const char* config_proto);
```
The selected trainer's call to `paddle_begin_init_params` will return with 1, and the other trainers' call to `paddle_begin_init_params` will block until initialization is done, and return 0. As illustrated below:
The selected trainer's call to `paddle_begin_init_params` will return with 1, and the other trainers' call to `paddle_begin_init_params` will return 0. `paddle_get_params` will be blocked until initialization is completed. As illustrated below:
`Parameters` is a concept we designed in Paddle V2 API. `Parameters` is a container of parameters, and make Paddle can shared parameter between topologies. We described usages of `Parameter` in [api.md](./api.md).
We used Python to implement Parameters when designing V2 API before. There are several defects for current implementation:
* We just use `memcpy` to share Parameters between topologies, but this is very inefficient.
* We did not implement share Parameters while training. We just trigger `memcpy` when start training.
It is necessary that we implement Parameters in CPP side. However, it could be a code refactoring for Paddle, because Paddle was designed for training only one topology before, i.e., each GradientMachine contains its Parameter as a data member. In current Paddle implementation, there are three concepts associated with `Parameters`:
1. `paddle::Parameter`. A `Parameters` is a container for `paddle::Parameter`.
It is evident that we should use `paddle::Parameter` when developing `Parameters`.
However, the `Parameter` class contains many functions and does not have a clear interface.
It contains `create/store Parameter`, `serialize/deserialize`, `optimize(i.e SGD)`, `randomize/zero`.
When we developing `Parameters`, we only use `create/store Parameter` functionality.
We should extract functionalities of Parameter into many classes to clean Paddle CPP implementation.
2. `paddle::GradientMachine` and its sub-classes, e.g., `paddle::MultiGradientMachine`, `paddle::NeuralNetwork`.
We should pass `Parameters` to `paddle::GradientMachine` when `forward/backward` to avoid `memcpy` between topologies.
Also, we should handle multi-GPU/CPU training, because `forward` and `backward` would perform on multi-GPUs and multi-CPUs.
`Parameters` should dispatch the parameter value to each device, and gather the parameter gradient from each device.
3. `paddle::ParameterUpdater`. The ParameterUpdater is used to update parameters in Paddle.
So `Parameters` should be used by `paddle::ParameterUpdater`, and `paddle::ParameterUpdater` should optimize `Parameters` (by SGD).
The step by step approach for implementation Parameters in Paddle C++ core is listed below. Each step should be a PR and could be merged into Paddle one by one.
1. Clean `paddle::Parameter` interface. Extract the functionalities of `paddle::Parameter` to prepare for the implementation of Parameters.
2. Implementation a `Parameters` class. It just stores the `paddle::Parameter` inside. Make `GradientMachine` uses `Parameters` as a class member.
3. Make `Parameters` support Multi-CPU and Multi-GPU training to prepare for sharing `Parameter` between topologies.
Because we need share `Parameters` between topologies, it is `Parameters`'s response to exchange Parameters between GPUs.
`GradientMachine` should not handle how to exchange Parameters because `GradientMachine` only used to train one topology and we need to support train many topologies in Paddle, i.e., there could be many GradientMachines use one `Parameters`.
* We should use a global function to exchange Parameters between GPUs, not a member function in `Parameters`. The `MultiGradientMachine` invoke this function, which uses `Parameters` as this function inputs.
* The MultiGradientMachine contains many functionalities. Extracting the Parameters exchanging logic could make MultiGradientMachine clearer and simpler.
4. Make `Parameters` as an argument for `forward/backward` function, not a data member for `GradientMachine`. For example, `forward` could be `forward(const Parameters& params, ...)` and `backward` could be `backward(Parameters* params, ...)`. After this step, Paddle could share `Parameters` between topologies.
5. `ParameterUpdater` is invoked by `GradientMachine` and `Trainer`, but it updates `Parameters`. In the end of this code refactoring, we could change `ParameterUpdater` directly uses `Parameters` to make `ParameterUpdater`'s implementation clear.
sequence. It calculates precision, recall and F1 scores for the chunk detection.</p>
<p>To use chunk evaluator, several concepts need to be clarified firstly.</p>
<ulclass="simple">
<li><strong>Chunk type</strong> is the type of the whole chunk and a chunk consists of one or several words. (For example in NER, ORG for organization name, PER for person name etc.)</li>
<li><strong>Tag type</strong> indicates the position of a word in a chunk. (B for begin, I for inside, E for end, S for single)</li>
</ul>
<p>We can name a label by combining tag type and chunk type. (ie. B-ORG for begining of an organization name)</p>
<p>The construction of label dictionary should obey the following rules:</p>
<ulclass="simple">
<li>Use one of the listed labelling schemes. These schemes differ in ways indicating chunk boundry.</li>
currently supports rectangular filters, the filter’s
shape will be (filter_size, filter_size_y).</li>
<li><strong>num_filters</strong>– Each filter group’s number of filter</li>
<li><strong>act</strong> (<em>paddle.v2.Activation.Base</em>) – Activation type. Default is tanh</li>
<li><strong>act</strong> (<em>paddle.v2.activation.Base</em>) – Activation type. Default is tanh</li>
<li><strong>groups</strong> (<em>int</em>) – Group size of filters.</li>
<li><strong>stride</strong> (<em>int|tuple|list</em>) – The x dimension of the stride. Or input a tuple for two image
dimension.</li>
...
...
@@ -548,7 +523,7 @@ otherwise layer_type has to be either “exconv” or
<spanid="api-v2-layer-context-projection"></span><h3>context_projection<aclass="headerlink"href="#context-projection"title="Permalink to this headline">¶</a></h3>
<trclass="field-even field"><thclass="field-name">Returns:</th><tdclass="field-body"><pclass="first">LayerOutput object which is a memory.</p>
<trclass="field-even field"><thclass="field-name">Returns:</th><tdclass="field-body"><pclass="first">paddle.v2.config_base.Layer object which is a memory.</p>
<p>paddle.v2.config_base.Layer will be scattered into time steps.
SubsequenceInput will be scattered into sequence steps.
StaticInput will be imported to each time step, and doesn’t change
through time. It’s a mechanism to access layer outside step function.</p>
</li>
<li><strong>reverse</strong> (<em>bool</em>) – If reverse is set true, the recurrent unit will process the
input sequence in a reverse order.</li>
<li><strong>targetInlink</strong> (<em>LayerOutput|SubsequenceInput</em>) –<p>the input layer which share info with layer group’s output</p>
<li><strong>targetInlink</strong> (<em>paddle.v2.config_base.Layer|SubsequenceInput</em>) –<p>the input layer which share info with layer group’s output</p>
<p>Param input specifies multiple input layers. For
SubsequenceInput inputs, config should assign one input
layer that share info(the number of sentences and the number
of words in each sentence) with all layer group’s outputs.
targetInlink should be one of the layer group’s input.</p>
</li>
<li><strong>is_generating</strong>– If is generating, none of input type should be LayerOutput;
<li><strong>is_generating</strong>– If is generating, none of input type should be paddle.v2.config_base.Layer;
else, for training or testing, one of the input type must
<li><strong>bias_attr</strong> (<em>ParameterAttribute</em><em> or </em><em>None</em><em> or </em><em>bool</em>) – The Bias Attribute. If no bias, then pass False or
something not type of ParameterAttribute. None will get a
<li><strong>bias_attr</strong> (<em>paddle.v2.attr.ParameterAttribute</em><em> or </em><em>None</em><em> or </em><em>bool</em>) – The Bias Attribute. If no bias, then pass False or
something not type of paddle.v2.attr.ParameterAttribute. None will get a
default Bias.</li>
<li><strong>layer_attr</strong> (<em>ExtraLayerAttribute</em>) – The extra layer config. Default is None.</li>
<li><strong>layer_attr</strong> (<em>paddle.v2.attr.ExtraAttribute</em>) – The extra layer config. Default is None.</li>
</ul>
</td>
</tr>
...
...
@@ -1476,7 +1451,7 @@ default Bias.</li>
<spanid="api-v2-layer-embedding"></span><h3>embedding<aclass="headerlink"href="#embedding"title="Permalink to this headline">¶</a></h3>
<li><strong>layer_attr</strong> (<em>paddle.v2.attr.ExtraAttribute</em>) – Extra Layer Attribute.</li>
<li><strong>bias_attr</strong> (<em>paddle.v2.attr.ParameterAttribute</em><em> or </em><em>None</em><em> or </em><em>bool</em>) – The Bias Attribute. If no bias, then pass False or
something not type of paddle.v2.attr.ParameterAttribute. None will get a
...
...
@@ -2020,7 +1995,7 @@ default Bias.</li>
<h3>block_expand<aclass="headerlink"href="#block-expand"title="Permalink to this headline">¶</a></h3>
<li><strong>layer_attr</strong> (<em>paddle.v2.attr.ExtraAttribute</em>) – extra layer attributes.</li>
<li><strong>bias_attr</strong> (<em>paddle.v2.attr.ParameterAttribute</em><em> or </em><em>None</em><em> or </em><em>bool</em>) – The Bias Attribute. If no bias, then pass False or
something not type of paddle.v2.attr.ParameterAttribute. None will get a
...
...
@@ -2253,7 +2228,7 @@ default Bias.</li>
<h3>addto<aclass="headerlink"href="#addto"title="Permalink to this headline">¶</a></h3>
<li><strong>input</strong> (<em>paddle.v2.config_base.Layer|list|tuple</em>) – Input layers. It could be a paddle.v2.config_base.Layer or list/tuple of
paddle.v2.config_base.Layer.</li>
<li><strong>act</strong> (<em>paddle.v2.Activation.Base</em>) – Activation Type, default is tanh.</li>
<li><strong>act</strong> (<em>paddle.v2.activation.Base</em>) – Activation Type, default is tanh.</li>
<li><strong>bias_attr</strong> (<em>paddle.v2.attr.ParameterAttribute|bool</em>) – Bias attribute. If False, means no bias. None is default
bias.</li>
<li><strong>layer_attr</strong> (<em>paddle.v2.attr.ExtraAttribute</em>) – Extra Layer attribute.</li>
...
...
@@ -2305,7 +2280,7 @@ bias.</li>
<h3>linear_comb<aclass="headerlink"href="#linear-comb"title="Permalink to this headline">¶</a></h3>
@@ -2902,7 +2877,7 @@ Input should be a vector of positive numbers, without normalization.</p>
<h3>multi_binary_label_cross_entropy_cost<aclass="headerlink"href="#multi-binary-label-cross-entropy-cost"title="Permalink to this headline">¶</a></h3>
<li><strong>weight</strong> (<em>paddle.v2.config_base.Layer</em>) – weight layer, can be None(default)</li>
<li><strong>num_classes</strong> (<em>int</em>) – number of classes.</li>
<li><strong>act</strong> (<em>paddle.v2.Activation.Base</em>) – Activation, default is Sigmoid.</li>
<li><strong>act</strong> (<em>paddle.v2.activation.Base</em>) – Activation, default is Sigmoid.</li>
<li><strong>param_attr</strong> (<em>paddle.v2.attr.ParameterAttribute</em>) – The Parameter Attribute|list.</li>
<li><strong>num_neg_samples</strong> (<em>int</em>) – number of negative samples. Default is 10.</li>
<li><strong>neg_distribution</strong> (<em>list|tuple|collections.Sequence|None</em>) – The distribution for generating the random negative labels.
...
...
@@ -3393,7 +3368,7 @@ If not None, its length must be equal to num_classes.</li>
<h3>hsigmoid<aclass="headerlink"href="#hsigmoid"title="Permalink to this headline">¶</a></h3>
<li><strong>context_proj_param_attr</strong> (<em>paddle.v2.attr.ParameterAttribute</em><em> or </em><em>None.</em>) – context projection parameter attribute.
<li><strong>context_proj_param_attr</strong> (<em>ParameterAttribute</em><em> or </em><em>None.</em>) – context projection parameter attribute.
None if user don’t care.</li>
<li><strong>fc_name</strong> (<em>basestring</em>) – fc layer name. None if user don’t care.</li>
<li><strong>fc_param_attr</strong> (<em>paddle.v2.attr.ParameterAttribute</em><em> or </em><em>None</em>) – fc layer parameter attribute. None if user don’t care.</li>
<li><strong>fc_bias_attr</strong> (<em>paddle.v2.attr.ParameterAttribute</em><em> or </em><em>None</em>) – fc bias parameter attribute. False if no bias,
<li><strong>fc_layer_name</strong> (<em>basestring</em>) – fc layer name. None if user don’t care.</li>
<li><strong>fc_param_attr</strong> (<em>ParameterAttribute</em><em> or </em><em>None</em>) – fc layer parameter attribute. None if user don’t care.</li>
<li><strong>fc_bias_attr</strong> (<em>ParameterAttribute</em><em> or </em><em>None</em>) – fc bias parameter attribute. False if no bias,
None if user don’t care.</li>
<li><strong>fc_act</strong> (<em>paddle.v2.Activation.Base</em>) – fc layer activation type. None means tanh</li>
<li><strong>pool_bias_attr</strong> (<em>paddle.v2.attr.ParameterAttribute</em><em> or </em><em>None.</em>) – pooling layer bias attr. None if don’t care.
<li><strong>fc_act</strong> (<em>BaseActivation</em>) – fc layer activation type. None means tanh</li>
<li><strong>pool_bias_attr</strong> (<em>ParameterAttribute</em><em> or </em><em>None.</em>) – pooling layer bias attr. None if don’t care.
False if no bias.</li>
<li><strong>fc_attr</strong> (<em>paddle.v2.attr.ExtraAttribute</em>) – fc layer extra attribute.</li>
<li><strong>context_attr</strong> (<em>paddle.v2.attr.ExtraAttribute</em>) – context projection layer extra attribute.</li>
<li><strong>pool_attr</strong> (<em>paddle.v2.attr.ExtraAttribute</em>) – pooling layer extra attribute.</li>
<li><strong>fc_attr</strong> (<em>ExtraLayerAttribute</em>) – fc layer extra attribute.</li>
<li><strong>context_attr</strong> (<em>ExtraLayerAttribute</em>) – context projection layer extra attribute.</li>
<li><strong>pool_attr</strong> (<em>ExtraLayerAttribute</em>) – pooling layer extra attribute.</li>
<spanid="api-trainer-config-helpers-network-text-conv-pool"></span><h3>text_conv_pool<aclass="headerlink"href="#text-conv-pool"title="Permalink to this headline">¶</a></h3>
<li><strong>context_proj_param_attr</strong> (<em>paddle.v2.attr.ParameterAttribute</em><em> or </em><em>None.</em>) – context projection parameter attribute.
<li><strong>context_proj_param_attr</strong> (<em>ParameterAttribute</em><em> or </em><em>None.</em>) – context projection parameter attribute.
None if user don’t care.</li>
<li><strong>fc_name</strong> (<em>basestring</em>) – fc layer name. None if user don’t care.</li>
<li><strong>fc_param_attr</strong> (<em>paddle.v2.attr.ParameterAttribute</em><em> or </em><em>None</em>) – fc layer parameter attribute. None if user don’t care.</li>
<li><strong>fc_bias_attr</strong> (<em>paddle.v2.attr.ParameterAttribute</em><em> or </em><em>None</em>) – fc bias parameter attribute. False if no bias,
<li><strong>fc_layer_name</strong> (<em>basestring</em>) – fc layer name. None if user don’t care.</li>
<li><strong>fc_param_attr</strong> (<em>ParameterAttribute</em><em> or </em><em>None</em>) – fc layer parameter attribute. None if user don’t care.</li>
<li><strong>fc_bias_attr</strong> (<em>ParameterAttribute</em><em> or </em><em>None</em>) – fc bias parameter attribute. False if no bias,
None if user don’t care.</li>
<li><strong>fc_act</strong> (<em>paddle.v2.Activation.Base</em>) – fc layer activation type. None means tanh</li>
<li><strong>pool_bias_attr</strong> (<em>paddle.v2.attr.ParameterAttribute</em><em> or </em><em>None.</em>) – pooling layer bias attr. None if don’t care.
<li><strong>fc_act</strong> (<em>BaseActivation</em>) – fc layer activation type. None means tanh</li>
<li><strong>pool_bias_attr</strong> (<em>ParameterAttribute</em><em> or </em><em>None.</em>) – pooling layer bias attr. None if don’t care.
False if no bias.</li>
<li><strong>fc_attr</strong> (<em>paddle.v2.attr.ExtraAttribute</em>) – fc layer extra attribute.</li>
<li><strong>context_attr</strong> (<em>paddle.v2.attr.ExtraAttribute</em>) – context projection layer extra attribute.</li>
<li><strong>pool_attr</strong> (<em>paddle.v2.attr.ExtraAttribute</em>) – pooling layer extra attribute.</li>
<li><strong>fc_attr</strong> (<em>ExtraLayerAttribute</em>) – fc layer extra attribute.</li>
<li><strong>context_attr</strong> (<em>ExtraLayerAttribute</em>) – context projection layer extra attribute.</li>
<li><strong>pool_attr</strong> (<em>ExtraLayerAttribute</em>) – pooling layer extra attribute.</li>
<spanid="api-trainer-config-helpers-network-simple-img-conv-pool"></span><h3>simple_img_conv_pool<aclass="headerlink"href="#simple-img-conv-pool"title="Permalink to this headline">¶</a></h3>
<dd><p>Same model from <aclass="reference external"href="https://gist.github.com/ksimonyan/211839e770f7b538e2d8">https://gist.github.com/ksimonyan/211839e770f7b538e2d8</a></p>
<li><strong>return_seq</strong> (<em>bool</em>) – If set False, outputs of the last time step are
concatenated and returned.
...
...
@@ -639,10 +639,10 @@ concatenated and returned.</li>
</ul>
</td>
</tr>
<trclass="field-even field"><thclass="field-name">Returns:</th><tdclass="field-body"><pclass="first">paddle.v2.config_base.Layer object accroding to the return_seq.</p>
<trclass="field-even field"><thclass="field-name">Returns:</th><tdclass="field-body"><pclass="first">LayerOutput object accroding to the return_seq.</p>
<spanid="task-queue"></span><h2>Task Queue<aclass="headerlink"href="#task-queue"title="Permalink to this headline">¶</a></h2>
<p>As mentioned in <aclass="reference internal"href="README.html"><spanclass="doc">distributed training design doc</span></a>, a <em>task</em> is a data shard that the master server assigns to the trainer process to train on. A task consists of one or multiple <em>blocks</em> from one or multiple files. The master server maintains <em>task queues</em> to track the training progress.</p>
<p>As mentioned in <aclass="reference internal"href="README.html"><spanclass="doc">distributed training design doc</span></a>, a <em>task</em> is a data shard that the master server assigns to the trainer process to train on. A task consists of one or multiple <em>chunks</em> from one or multiple files. The master server maintains <em>task queues</em> to track the training progress.</p>
<divclass="section"id="task-queue-creation">
<spanid="task-queue-creation"></span><h3>Task Queue Creation<aclass="headerlink"href="#task-queue-creation"title="Permalink to this headline">¶</a></h3>
<ol>
...
...
@@ -197,21 +197,21 @@
</pre></div>
</div>
</li>
<li><pclass="first">The master server will scan through each RecordIO file to generate the <em>block index</em> and know how many blocks does each file have. A block can be referenced by the file path and the index of the block within the file. The block index is in memory data structure that enables fast access to each block, and the index of the block with the file is an integer start from 0, representing the n-th block within the file.</p>
<spanclass="nx">Idx</span><spanclass="kt">int</span><spanclass="c1">// index of the block within the file</span>
<li><pclass="first">The master server will scan through each RecordIO file to generate the <em>chunk index</em> and know how many chunks does each file have. A chunk can be referenced by the file path and the index of the chunk within the file. The chunk index is in memory data structure that enables fast access to each chunk, and the index of the chunk with the file is an integer start from 0, representing the n-th chunk within the file.</p>
<li><pclass="first">Blocks are grouped into tasks, and tasks are filled into the todo queue. The pending queue and the done queue are initialized with no element.</p>
<li><pclass="first">Chunks are grouped into tasks, and tasks are filled into the todo queue. The pending queue and the done queue are initialized with no element.</p>
<p>The selected trainer’s call to <codeclass="docutils literal"><spanclass="pre">paddle_begin_init_params</span></code> will return with 1, and the other trainers’ call to <codeclass="docutils literal"><spanclass="pre">paddle_begin_init_params</span></code> will block until initialization is done, and return 0. As illustrated below:</p>
<p>The selected trainer’s call to <codeclass="docutils literal"><spanclass="pre">paddle_begin_init_params</span></code> will return with 1, and the other trainers’ call to <codeclass="docutils literal"><spanclass="pre">paddle_begin_init_params</span></code> will return 0. <codeclass="docutils literal"><spanclass="pre">paddle_get_params</span></code> will be blocked until initialization is completed. As illustrated below:</p>
<p><imgsrc="./src/pserver_init.png"></p>
</div>
</div>
...
...
@@ -259,16 +259,13 @@ name:sparse-n-1
<spanclass="cm"> *</span>
<spanclass="cm"> * paddle_begin_init_params will be called from multiple trainers,</span>
<spanclass="cm"> * only one trainer will be selected to initialize the parameters on</span>
<spanclass="cm"> * parameter servers. Other trainers will be blocked until the</span>
<spanclass="cm"> * initialization is done, and they need to get the initialized</span>
<spanclass="cm"> * parameter servers. Other trainers need to get the initialized</span>
<spanclass="cm"> * parameters from parameter servers using @paddle_get_params.</span>
<spanclass="cm"> *</span>
<spanclass="cm"> * @param pserver_config_proto serialized parameter server configuration in</span>
<liclass="toctree-l2"><aclass="reference internal"href="../getstarted/build_and_install/index_en.html">Install and Build</a><ul>
<liclass="toctree-l3"><aclass="reference internal"href="../getstarted/build_and_install/docker_install_en.html">PaddlePaddle in Docker Containers</a></li>
<liclass="toctree-l2"><aclass="reference internal"href="../howto/usage/k8s/k8s_en.html">Paddle On Kubernetes</a></li>
<liclass="toctree-l2"><aclass="reference internal"href="../howto/usage/k8s/k8s_aws_en.html">Distributed PaddlePaddle Training on AWS with Kubernetes</a></li>
<liclass="toctree-l2"><aclass="reference internal"href="../howto/dev/new_layer_en.html">Write New Layers</a></li>
<spanid="design-doc-the-c-class-parameters"></span><h1>Design Doc: The C++ Class <codeclass="docutils literal"><spanclass="pre">Parameters</span></code><aclass="headerlink"href="#design-doc-the-c-class-parameters"title="Permalink to this headline">¶</a></h1>
<p><codeclass="docutils literal"><spanclass="pre">Parameters</span></code> is a concept we designed in Paddle V2 API. <codeclass="docutils literal"><spanclass="pre">Parameters</span></code> is a container of parameters, and make Paddle can shared parameter between topologies. We described usages of <codeclass="docutils literal"><spanclass="pre">Parameter</span></code> in <aclass="reference internal"href="api.html"><spanclass="doc">api.md</span></a>.</p>
<p>We used Python to implement Parameters when designing V2 API before. There are several defects for current implementation:</p>
<ulclass="simple">
<li>We just use <codeclass="docutils literal"><spanclass="pre">memcpy</span></code> to share Parameters between topologies, but this is very inefficient.</li>
<li>We did not implement share Parameters while training. We just trigger <codeclass="docutils literal"><spanclass="pre">memcpy</span></code> when start training.</li>
</ul>
<p>It is necessary that we implement Parameters in CPP side. However, it could be a code refactoring for Paddle, because Paddle was designed for training only one topology before, i.e., each GradientMachine contains its Parameter as a data member. In current Paddle implementation, there are three concepts associated with <codeclass="docutils literal"><spanclass="pre">Parameters</span></code>:</p>
<olclass="simple">
<li><codeclass="docutils literal"><spanclass="pre">paddle::Parameter</span></code>. A <codeclass="docutils literal"><spanclass="pre">Parameters</span></code> is a container for <codeclass="docutils literal"><spanclass="pre">paddle::Parameter</span></code>.
It is evident that we should use <codeclass="docutils literal"><spanclass="pre">paddle::Parameter</span></code> when developing <codeclass="docutils literal"><spanclass="pre">Parameters</span></code>.
However, the <codeclass="docutils literal"><spanclass="pre">Parameter</span></code> class contains many functions and does not have a clear interface.
It contains <codeclass="docutils literal"><spanclass="pre">create/store</span><spanclass="pre">Parameter</span></code>, <codeclass="docutils literal"><spanclass="pre">serialize/deserialize</span></code>, <codeclass="docutils literal"><spanclass="pre">optimize(i.e</span><spanclass="pre">SGD)</span></code>, <codeclass="docutils literal"><spanclass="pre">randomize/zero</span></code>.
When we developing <codeclass="docutils literal"><spanclass="pre">Parameters</span></code>, we only use <codeclass="docutils literal"><spanclass="pre">create/store</span><spanclass="pre">Parameter</span></code> functionality.
We should extract functionalities of Parameter into many classes to clean Paddle CPP implementation.</li>
<li><codeclass="docutils literal"><spanclass="pre">paddle::GradientMachine</span></code> and its sub-classes, e.g., <codeclass="docutils literal"><spanclass="pre">paddle::MultiGradientMachine</span></code>, <codeclass="docutils literal"><spanclass="pre">paddle::NeuralNetwork</span></code>.
We should pass <codeclass="docutils literal"><spanclass="pre">Parameters</span></code> to <codeclass="docutils literal"><spanclass="pre">paddle::GradientMachine</span></code> when <codeclass="docutils literal"><spanclass="pre">forward/backward</span></code> to avoid <codeclass="docutils literal"><spanclass="pre">memcpy</span></code> between topologies.
Also, we should handle multi-GPU/CPU training, because <codeclass="docutils literal"><spanclass="pre">forward</span></code> and <codeclass="docutils literal"><spanclass="pre">backward</span></code> would perform on multi-GPUs and multi-CPUs.
<codeclass="docutils literal"><spanclass="pre">Parameters</span></code> should dispatch the parameter value to each device, and gather the parameter gradient from each device.</li>
<li><codeclass="docutils literal"><spanclass="pre">paddle::ParameterUpdater</span></code>. The ParameterUpdater is used to update parameters in Paddle.
So <codeclass="docutils literal"><spanclass="pre">Parameters</span></code> should be used by <codeclass="docutils literal"><spanclass="pre">paddle::ParameterUpdater</span></code>, and <codeclass="docutils literal"><spanclass="pre">paddle::ParameterUpdater</span></code> should optimize <codeclass="docutils literal"><spanclass="pre">Parameters</span></code> (by SGD).</li>
</ol>
<p>The step by step approach for implementation Parameters in Paddle C++ core is listed below. Each step should be a PR and could be merged into Paddle one by one.</p>
<olclass="simple">
<li>Clean <codeclass="docutils literal"><spanclass="pre">paddle::Parameter</span></code> interface. Extract the functionalities of <codeclass="docutils literal"><spanclass="pre">paddle::Parameter</span></code> to prepare for the implementation of Parameters.</li>
<li>Implementation a <codeclass="docutils literal"><spanclass="pre">Parameters</span></code> class. It just stores the <codeclass="docutils literal"><spanclass="pre">paddle::Parameter</span></code> inside. Make <codeclass="docutils literal"><spanclass="pre">GradientMachine</span></code> uses <codeclass="docutils literal"><spanclass="pre">Parameters</span></code> as a class member.</li>
<li>Make <codeclass="docutils literal"><spanclass="pre">Parameters</span></code> support Multi-CPU and Multi-GPU training to prepare for sharing <codeclass="docutils literal"><spanclass="pre">Parameter</span></code> between topologies.
Because we need share <codeclass="docutils literal"><spanclass="pre">Parameters</span></code> between topologies, it is <codeclass="docutils literal"><spanclass="pre">Parameters</span></code>‘s response to exchange Parameters between GPUs.
<codeclass="docutils literal"><spanclass="pre">GradientMachine</span></code> should not handle how to exchange Parameters because <codeclass="docutils literal"><spanclass="pre">GradientMachine</span></code> only used to train one topology and we need to support train many topologies in Paddle, i.e., there could be many GradientMachines use one <codeclass="docutils literal"><spanclass="pre">Parameters</span></code>.<ul>
<li>We should use a global function to exchange Parameters between GPUs, not a member function in <codeclass="docutils literal"><spanclass="pre">Parameters</span></code>. The <codeclass="docutils literal"><spanclass="pre">MultiGradientMachine</span></code> invoke this function, which uses <codeclass="docutils literal"><spanclass="pre">Parameters</span></code> as this function inputs.</li>
<li>The MultiGradientMachine contains many functionalities. Extracting the Parameters exchanging logic could make MultiGradientMachine clearer and simpler.</li>
</ul>
</li>
<li>Make <codeclass="docutils literal"><spanclass="pre">Parameters</span></code> as an argument for <codeclass="docutils literal"><spanclass="pre">forward/backward</span></code> function, not a data member for <codeclass="docutils literal"><spanclass="pre">GradientMachine</span></code>. For example, <codeclass="docutils literal"><spanclass="pre">forward</span></code> could be <codeclass="docutils literal"><spanclass="pre">forward(const</span><spanclass="pre">Parameters&</span><spanclass="pre">params,</span><spanclass="pre">...)</span></code> and <codeclass="docutils literal"><spanclass="pre">backward</span></code> could be <codeclass="docutils literal"><spanclass="pre">backward(Parameters*</span><spanclass="pre">params,</span><spanclass="pre">...)</span></code>. After this step, Paddle could share <codeclass="docutils literal"><spanclass="pre">Parameters</span></code> between topologies.</li>
<li><codeclass="docutils literal"><spanclass="pre">ParameterUpdater</span></code> is invoked by <codeclass="docutils literal"><spanclass="pre">GradientMachine</span></code> and <codeclass="docutils literal"><spanclass="pre">Trainer</span></code>, but it updates <codeclass="docutils literal"><spanclass="pre">Parameters</span></code>. In the end of this code refactoring, we could change <codeclass="docutils literal"><spanclass="pre">ParameterUpdater</span></code> directly uses <codeclass="docutils literal"><spanclass="pre">Parameters</span></code> to make <codeclass="docutils literal"><spanclass="pre">ParameterUpdater</span></code>‘s implementation clear.</li>
Built with <ahref="http://sphinx-doc.org/">Sphinx</a> using a <ahref="https://github.com/snide/sphinx_rtd_theme">theme</a> provided by <ahref="https://readthedocs.org">Read the Docs</a>.
@@ -10,7 +10,7 @@ A dataset is a list of files in *RecordIO* format. A RecordIO file consists of c
## Task Queue
As mentioned in [distributed training design doc](./README.md), a *task* is a data shard that the master server assigns to the trainer process to train on. A task consists of one or multiple *blocks* from one or multiple files. The master server maintains *task queues* to track the training progress.
As mentioned in [distributed training design doc](./README.md), a *task* is a data shard that the master server assigns to the trainer process to train on. A task consists of one or multiple *chunks* from one or multiple files. The master server maintains *task queues* to track the training progress.
### Task Queue Creation
...
...
@@ -21,23 +21,23 @@ As mentioned in [distributed training design doc](./README.md), a *task* is a da
1. The master server will scan through each RecordIO file to generate the *block index* and know how many blocks does each file have. A block can be referenced by the file path and the index of the block within the file. The block index is in memory data structure that enables fast access to each block, and the index of the block with the file is an integer start from 0, representing the n-th block within the file.
1. The master server will scan through each RecordIO file to generate the *chunk index* and know how many chunks does each file have. A chunk can be referenced by the file path and the index of the chunk within the file. The chunk index is in memory data structure that enables fast access to each chunk, and the index of the chunk with the file is an integer start from 0, representing the n-th chunk within the file.
The definition of the block is:
The definition of the chunk is:
```go
type Block struct {
Idx int // index of the block within the file
type Chunk struct {
Idx int // index of the chunk within the file
Path string
Index recordio.Index // block index
Index recordio.Index // chunk index
}
```
1. Blocks are grouped into tasks, and tasks are filled into the todo queue. The pending queue and the done queue are initialized with no element.
1. Chunks are grouped into tasks, and tasks are filled into the todo queue. The pending queue and the done queue are initialized with no element.
@@ -55,7 +55,7 @@ The trainer select process is encapsulated in the C API function:
```c
int paddle_begin_init_params(paddle_pserver_client* client, const char* config_proto);
```
The selected trainer's call to `paddle_begin_init_params` will return with 1, and the other trainers' call to `paddle_begin_init_params` will block until initialization is done, and return 0. As illustrated below:
The selected trainer's call to `paddle_begin_init_params` will return with 1, and the other trainers' call to `paddle_begin_init_params` will return 0. `paddle_get_params` will be blocked until initialization is completed. As illustrated below:
`Parameters` is a concept we designed in Paddle V2 API. `Parameters` is a container of parameters, and make Paddle can shared parameter between topologies. We described usages of `Parameter` in [api.md](./api.md).
We used Python to implement Parameters when designing V2 API before. There are several defects for current implementation:
* We just use `memcpy` to share Parameters between topologies, but this is very inefficient.
* We did not implement share Parameters while training. We just trigger `memcpy` when start training.
It is necessary that we implement Parameters in CPP side. However, it could be a code refactoring for Paddle, because Paddle was designed for training only one topology before, i.e., each GradientMachine contains its Parameter as a data member. In current Paddle implementation, there are three concepts associated with `Parameters`:
1. `paddle::Parameter`. A `Parameters` is a container for `paddle::Parameter`.
It is evident that we should use `paddle::Parameter` when developing `Parameters`.
However, the `Parameter` class contains many functions and does not have a clear interface.
It contains `create/store Parameter`, `serialize/deserialize`, `optimize(i.e SGD)`, `randomize/zero`.
When we developing `Parameters`, we only use `create/store Parameter` functionality.
We should extract functionalities of Parameter into many classes to clean Paddle CPP implementation.
2. `paddle::GradientMachine` and its sub-classes, e.g., `paddle::MultiGradientMachine`, `paddle::NeuralNetwork`.
We should pass `Parameters` to `paddle::GradientMachine` when `forward/backward` to avoid `memcpy` between topologies.
Also, we should handle multi-GPU/CPU training, because `forward` and `backward` would perform on multi-GPUs and multi-CPUs.
`Parameters` should dispatch the parameter value to each device, and gather the parameter gradient from each device.
3. `paddle::ParameterUpdater`. The ParameterUpdater is used to update parameters in Paddle.
So `Parameters` should be used by `paddle::ParameterUpdater`, and `paddle::ParameterUpdater` should optimize `Parameters` (by SGD).
The step by step approach for implementation Parameters in Paddle C++ core is listed below. Each step should be a PR and could be merged into Paddle one by one.
1. Clean `paddle::Parameter` interface. Extract the functionalities of `paddle::Parameter` to prepare for the implementation of Parameters.
2. Implementation a `Parameters` class. It just stores the `paddle::Parameter` inside. Make `GradientMachine` uses `Parameters` as a class member.
3. Make `Parameters` support Multi-CPU and Multi-GPU training to prepare for sharing `Parameter` between topologies.
Because we need share `Parameters` between topologies, it is `Parameters`'s response to exchange Parameters between GPUs.
`GradientMachine` should not handle how to exchange Parameters because `GradientMachine` only used to train one topology and we need to support train many topologies in Paddle, i.e., there could be many GradientMachines use one `Parameters`.
* We should use a global function to exchange Parameters between GPUs, not a member function in `Parameters`. The `MultiGradientMachine` invoke this function, which uses `Parameters` as this function inputs.
* The MultiGradientMachine contains many functionalities. Extracting the Parameters exchanging logic could make MultiGradientMachine clearer and simpler.
4. Make `Parameters` as an argument for `forward/backward` function, not a data member for `GradientMachine`. For example, `forward` could be `forward(const Parameters& params, ...)` and `backward` could be `backward(Parameters* params, ...)`. After this step, Paddle could share `Parameters` between topologies.
5. `ParameterUpdater` is invoked by `GradientMachine` and `Trainer`, but it updates `Parameters`. In the end of this code refactoring, we could change `ParameterUpdater` directly uses `Parameters` to make `ParameterUpdater`'s implementation clear.
sequence. It calculates precision, recall and F1 scores for the chunk detection.</p>
<p>To use chunk evaluator, several concepts need to be clarified firstly.</p>
<ulclass="simple">
<li><strong>Chunk type</strong> is the type of the whole chunk and a chunk consists of one or several words. (For example in NER, ORG for organization name, PER for person name etc.)</li>
<li><strong>Tag type</strong> indicates the position of a word in a chunk. (B for begin, I for inside, E for end, S for single)</li>
</ul>
<p>We can name a label by combining tag type and chunk type. (ie. B-ORG for begining of an organization name)</p>
<p>The construction of label dictionary should obey the following rules:</p>
<ulclass="simple">
<li>Use one of the listed labelling schemes. These schemes differ in ways indicating chunk boundry.</li>
<trclass="field-even field"><thclass="field-name">返回:</th><tdclass="field-body"><pclass="first">LayerOutput object which is a memory.</p>
<trclass="field-even field"><thclass="field-name">返回:</th><tdclass="field-body"><pclass="first">paddle.v2.config_base.Layer object which is a memory.</p>
<p>paddle.v2.config_base.Layer will be scattered into time steps.
SubsequenceInput will be scattered into sequence steps.
StaticInput will be imported to each time step, and doesn’t change
through time. It’s a mechanism to access layer outside step function.</p>
</li>
<li><strong>reverse</strong> (<em>bool</em>) – If reverse is set true, the recurrent unit will process the
input sequence in a reverse order.</li>
<li><strong>targetInlink</strong> (<em>LayerOutput|SubsequenceInput</em>) –<p>the input layer which share info with layer group’s output</p>
<li><strong>targetInlink</strong> (<em>paddle.v2.config_base.Layer|SubsequenceInput</em>) –<p>the input layer which share info with layer group’s output</p>
<p>Param input specifies multiple input layers. For
SubsequenceInput inputs, config should assign one input
layer that share info(the number of sentences and the number
of words in each sentence) with all layer group’s outputs.
targetInlink should be one of the layer group’s input.</p>
</li>
<li><strong>is_generating</strong>– If is generating, none of input type should be LayerOutput;
<li><strong>is_generating</strong>– If is generating, none of input type should be paddle.v2.config_base.Layer;
else, for training or testing, one of the input type must
<li><strong>bias_attr</strong> (<em>ParameterAttribute</em><em> or </em><em>None</em><em> or </em><em>bool</em>) – The Bias Attribute. If no bias, then pass False or
something not type of ParameterAttribute. None will get a
<li><strong>bias_attr</strong> (<em>paddle.v2.attr.ParameterAttribute</em><em> or </em><em>None</em><em> or </em><em>bool</em>) – The Bias Attribute. If no bias, then pass False or
something not type of paddle.v2.attr.ParameterAttribute. None will get a
default Bias.</li>
<li><strong>layer_attr</strong> (<em>ExtraLayerAttribute</em>) – The extra layer config. Default is None.</li>
<li><strong>layer_attr</strong> (<em>paddle.v2.attr.ExtraAttribute</em>) – The extra layer config. Default is None.</li>
<li><strong>layer_attr</strong> (<em>paddle.v2.attr.ExtraAttribute</em>) – Extra Layer Attribute.</li>
<li><strong>bias_attr</strong> (<em>paddle.v2.attr.ParameterAttribute</em><em> or </em><em>None</em><em> or </em><em>bool</em>) – The Bias Attribute. If no bias, then pass False or
something not type of paddle.v2.attr.ParameterAttribute. None will get a
<li><strong>layer_attr</strong> (<em>paddle.v2.attr.ExtraAttribute</em>) – extra layer attributes.</li>
<li><strong>bias_attr</strong> (<em>paddle.v2.attr.ParameterAttribute</em><em> or </em><em>None</em><em> or </em><em>bool</em>) – The Bias Attribute. If no bias, then pass False or
something not type of paddle.v2.attr.ParameterAttribute. None will get a
<li><strong>input</strong> (<em>paddle.v2.config_base.Layer|list|tuple</em>) – Input layers. It could be a paddle.v2.config_base.Layer or list/tuple of
paddle.v2.config_base.Layer.</li>
<li><strong>act</strong> (<em>paddle.v2.Activation.Base</em>) – Activation Type, default is tanh.</li>
<li><strong>act</strong> (<em>paddle.v2.activation.Base</em>) – Activation Type, default is tanh.</li>
<li><strong>bias_attr</strong> (<em>paddle.v2.attr.ParameterAttribute|bool</em>) – Bias attribute. If False, means no bias. None is default
bias.</li>
<li><strong>layer_attr</strong> (<em>paddle.v2.attr.ExtraAttribute</em>) – Extra Layer attribute.</li>
<li><strong>weight</strong> (<em>paddle.v2.config_base.Layer</em>) – weight layer, can be None(default)</li>
<li><strong>num_classes</strong> (<em>int</em>) – number of classes.</li>
<li><strong>act</strong> (<em>paddle.v2.Activation.Base</em>) – Activation, default is Sigmoid.</li>
<li><strong>act</strong> (<em>paddle.v2.activation.Base</em>) – Activation, default is Sigmoid.</li>
<li><strong>param_attr</strong> (<em>paddle.v2.attr.ParameterAttribute</em>) – The Parameter Attribute|list.</li>
<li><strong>num_neg_samples</strong> (<em>int</em>) – number of negative samples. Default is 10.</li>
<li><strong>neg_distribution</strong> (<em>list|tuple|collections.Sequence|None</em>) – The distribution for generating the random negative labels.
...
...
@@ -3400,7 +3375,7 @@ If not None, its length must be equal to num_classes.</li>
<li><strong>context_proj_param_attr</strong> (<em>paddle.v2.attr.ParameterAttribute</em><em> or </em><em>None.</em>) – context projection parameter attribute.
<li><strong>context_proj_param_attr</strong> (<em>ParameterAttribute</em><em> or </em><em>None.</em>) – context projection parameter attribute.
None if user don’t care.</li>
<li><strong>fc_name</strong> (<em>basestring</em>) – fc layer name. None if user don’t care.</li>
<li><strong>fc_param_attr</strong> (<em>paddle.v2.attr.ParameterAttribute</em><em> or </em><em>None</em>) – fc layer parameter attribute. None if user don’t care.</li>
<li><strong>fc_bias_attr</strong> (<em>paddle.v2.attr.ParameterAttribute</em><em> or </em><em>None</em>) – fc bias parameter attribute. False if no bias,
<li><strong>fc_layer_name</strong> (<em>basestring</em>) – fc layer name. None if user don’t care.</li>
<li><strong>fc_param_attr</strong> (<em>ParameterAttribute</em><em> or </em><em>None</em>) – fc layer parameter attribute. None if user don’t care.</li>
<li><strong>fc_bias_attr</strong> (<em>ParameterAttribute</em><em> or </em><em>None</em>) – fc bias parameter attribute. False if no bias,
None if user don’t care.</li>
<li><strong>fc_act</strong> (<em>paddle.v2.Activation.Base</em>) – fc layer activation type. None means tanh</li>
<li><strong>pool_bias_attr</strong> (<em>paddle.v2.attr.ParameterAttribute</em><em> or </em><em>None.</em>) – pooling layer bias attr. None if don’t care.
<li><strong>fc_act</strong> (<em>BaseActivation</em>) – fc layer activation type. None means tanh</li>
<li><strong>pool_bias_attr</strong> (<em>ParameterAttribute</em><em> or </em><em>None.</em>) – pooling layer bias attr. None if don’t care.
False if no bias.</li>
<li><strong>fc_attr</strong> (<em>paddle.v2.attr.ExtraAttribute</em>) – fc layer extra attribute.</li>
<li><strong>context_attr</strong> (<em>paddle.v2.attr.ExtraAttribute</em>) – context projection layer extra attribute.</li>
<li><strong>pool_attr</strong> (<em>paddle.v2.attr.ExtraAttribute</em>) – pooling layer extra attribute.</li>
<li><strong>fc_attr</strong> (<em>ExtraLayerAttribute</em>) – fc layer extra attribute.</li>
<li><strong>context_attr</strong> (<em>ExtraLayerAttribute</em>) – context projection layer extra attribute.</li>
<li><strong>pool_attr</strong> (<em>ExtraLayerAttribute</em>) – pooling layer extra attribute.</li>
<li><strong>context_proj_param_attr</strong> (<em>paddle.v2.attr.ParameterAttribute</em><em> or </em><em>None.</em>) – context projection parameter attribute.
<li><strong>context_proj_param_attr</strong> (<em>ParameterAttribute</em><em> or </em><em>None.</em>) – context projection parameter attribute.
None if user don’t care.</li>
<li><strong>fc_name</strong> (<em>basestring</em>) – fc layer name. None if user don’t care.</li>
<li><strong>fc_param_attr</strong> (<em>paddle.v2.attr.ParameterAttribute</em><em> or </em><em>None</em>) – fc layer parameter attribute. None if user don’t care.</li>
<li><strong>fc_bias_attr</strong> (<em>paddle.v2.attr.ParameterAttribute</em><em> or </em><em>None</em>) – fc bias parameter attribute. False if no bias,
<li><strong>fc_layer_name</strong> (<em>basestring</em>) – fc layer name. None if user don’t care.</li>
<li><strong>fc_param_attr</strong> (<em>ParameterAttribute</em><em> or </em><em>None</em>) – fc layer parameter attribute. None if user don’t care.</li>
<li><strong>fc_bias_attr</strong> (<em>ParameterAttribute</em><em> or </em><em>None</em>) – fc bias parameter attribute. False if no bias,
None if user don’t care.</li>
<li><strong>fc_act</strong> (<em>paddle.v2.Activation.Base</em>) – fc layer activation type. None means tanh</li>
<li><strong>pool_bias_attr</strong> (<em>paddle.v2.attr.ParameterAttribute</em><em> or </em><em>None.</em>) – pooling layer bias attr. None if don’t care.
<li><strong>fc_act</strong> (<em>BaseActivation</em>) – fc layer activation type. None means tanh</li>
<li><strong>pool_bias_attr</strong> (<em>ParameterAttribute</em><em> or </em><em>None.</em>) – pooling layer bias attr. None if don’t care.
False if no bias.</li>
<li><strong>fc_attr</strong> (<em>paddle.v2.attr.ExtraAttribute</em>) – fc layer extra attribute.</li>
<li><strong>context_attr</strong> (<em>paddle.v2.attr.ExtraAttribute</em>) – context projection layer extra attribute.</li>
<li><strong>pool_attr</strong> (<em>paddle.v2.attr.ExtraAttribute</em>) – pooling layer extra attribute.</li>
<li><strong>fc_attr</strong> (<em>ExtraLayerAttribute</em>) – fc layer extra attribute.</li>
<li><strong>context_attr</strong> (<em>ExtraLayerAttribute</em>) – context projection layer extra attribute.</li>
<li><strong>pool_attr</strong> (<em>ExtraLayerAttribute</em>) – pooling layer extra attribute.</li>
<dd><p>Same model from <aclass="reference external"href="https://gist.github.com/ksimonyan/211839e770f7b538e2d8">https://gist.github.com/ksimonyan/211839e770f7b538e2d8</a></p>
<li><strong>return_seq</strong> (<em>bool</em>) – If set False, outputs of the last time step are
concatenated and returned.
...
...
@@ -646,10 +646,10 @@ concatenated and returned.</li>
</ul>
</td>
</tr>
<trclass="field-even field"><thclass="field-name">返回:</th><tdclass="field-body"><pclass="first">paddle.v2.config_base.Layer object accroding to the return_seq.</p>
<trclass="field-even field"><thclass="field-name">返回:</th><tdclass="field-body"><pclass="first">LayerOutput object accroding to the return_seq.</p>
<p>As mentioned in <aclass="reference internal"href="README.html"><spanclass="doc">distributed training design doc</span></a>, a <em>task</em> is a data shard that the master server assigns to the trainer process to train on. A task consists of one or multiple <em>blocks</em> from one or multiple files. The master server maintains <em>task queues</em> to track the training progress.</p>
<p>As mentioned in <aclass="reference internal"href="README.html"><spanclass="doc">distributed training design doc</span></a>, a <em>task</em> is a data shard that the master server assigns to the trainer process to train on. A task consists of one or multiple <em>chunks</em> from one or multiple files. The master server maintains <em>task queues</em> to track the training progress.</p>
<li><pclass="first">The master server will scan through each RecordIO file to generate the <em>block index</em> and know how many blocks does each file have. A block can be referenced by the file path and the index of the block within the file. The block index is in memory data structure that enables fast access to each block, and the index of the block with the file is an integer start from 0, representing the n-th block within the file.</p>
<spanclass="nx">Idx</span><spanclass="kt">int</span><spanclass="c1">// index of the block within the file</span>
<li><pclass="first">The master server will scan through each RecordIO file to generate the <em>chunk index</em> and know how many chunks does each file have. A chunk can be referenced by the file path and the index of the chunk within the file. The chunk index is in memory data structure that enables fast access to each chunk, and the index of the chunk with the file is an integer start from 0, representing the n-th chunk within the file.</p>
<li><pclass="first">Blocks are grouped into tasks, and tasks are filled into the todo queue. The pending queue and the done queue are initialized with no element.</p>
<li><pclass="first">Chunks are grouped into tasks, and tasks are filled into the todo queue. The pending queue and the done queue are initialized with no element.</p>
<p>The selected trainer’s call to <codeclass="docutils literal"><spanclass="pre">paddle_begin_init_params</span></code> will return with 1, and the other trainers’ call to <codeclass="docutils literal"><spanclass="pre">paddle_begin_init_params</span></code> will block until initialization is done, and return 0. As illustrated below:</p>
<p>The selected trainer’s call to <codeclass="docutils literal"><spanclass="pre">paddle_begin_init_params</span></code> will return with 1, and the other trainers’ call to <codeclass="docutils literal"><spanclass="pre">paddle_begin_init_params</span></code> will return 0. <codeclass="docutils literal"><spanclass="pre">paddle_get_params</span></code> will be blocked until initialization is completed. As illustrated below:</p>
<p><imgsrc="./src/pserver_init.png"></p>
</div>
</div>
...
...
@@ -266,16 +266,13 @@ name:sparse-n-1
<spanclass="cm"> *</span>
<spanclass="cm"> * paddle_begin_init_params will be called from multiple trainers,</span>
<spanclass="cm"> * only one trainer will be selected to initialize the parameters on</span>
<spanclass="cm"> * parameter servers. Other trainers will be blocked until the</span>
<spanclass="cm"> * initialization is done, and they need to get the initialized</span>
<spanclass="cm"> * parameter servers. Other trainers need to get the initialized</span>
<spanclass="cm"> * parameters from parameter servers using @paddle_get_params.</span>
<spanclass="cm"> *</span>
<spanclass="cm"> * @param pserver_config_proto serialized parameter server configuration in</span>
<spanid="design-doc-the-c-class-parameters"></span><h1>Design Doc: The C++ Class <codeclass="docutils literal"><spanclass="pre">Parameters</span></code><aclass="headerlink"href="#design-doc-the-c-class-parameters"title="永久链接至标题">¶</a></h1>
<p><codeclass="docutils literal"><spanclass="pre">Parameters</span></code> is a concept we designed in Paddle V2 API. <codeclass="docutils literal"><spanclass="pre">Parameters</span></code> is a container of parameters, and make Paddle can shared parameter between topologies. We described usages of <codeclass="docutils literal"><spanclass="pre">Parameter</span></code> in <aclass="reference internal"href="api.html"><spanclass="doc">api.md</span></a>.</p>
<p>We used Python to implement Parameters when designing V2 API before. There are several defects for current implementation:</p>
<ulclass="simple">
<li>We just use <codeclass="docutils literal"><spanclass="pre">memcpy</span></code> to share Parameters between topologies, but this is very inefficient.</li>
<li>We did not implement share Parameters while training. We just trigger <codeclass="docutils literal"><spanclass="pre">memcpy</span></code> when start training.</li>
</ul>
<p>It is necessary that we implement Parameters in CPP side. However, it could be a code refactoring for Paddle, because Paddle was designed for training only one topology before, i.e., each GradientMachine contains its Parameter as a data member. In current Paddle implementation, there are three concepts associated with <codeclass="docutils literal"><spanclass="pre">Parameters</span></code>:</p>
<olclass="simple">
<li><codeclass="docutils literal"><spanclass="pre">paddle::Parameter</span></code>. A <codeclass="docutils literal"><spanclass="pre">Parameters</span></code> is a container for <codeclass="docutils literal"><spanclass="pre">paddle::Parameter</span></code>.
It is evident that we should use <codeclass="docutils literal"><spanclass="pre">paddle::Parameter</span></code> when developing <codeclass="docutils literal"><spanclass="pre">Parameters</span></code>.
However, the <codeclass="docutils literal"><spanclass="pre">Parameter</span></code> class contains many functions and does not have a clear interface.
It contains <codeclass="docutils literal"><spanclass="pre">create/store</span><spanclass="pre">Parameter</span></code>, <codeclass="docutils literal"><spanclass="pre">serialize/deserialize</span></code>, <codeclass="docutils literal"><spanclass="pre">optimize(i.e</span><spanclass="pre">SGD)</span></code>, <codeclass="docutils literal"><spanclass="pre">randomize/zero</span></code>.
When we developing <codeclass="docutils literal"><spanclass="pre">Parameters</span></code>, we only use <codeclass="docutils literal"><spanclass="pre">create/store</span><spanclass="pre">Parameter</span></code> functionality.
We should extract functionalities of Parameter into many classes to clean Paddle CPP implementation.</li>
<li><codeclass="docutils literal"><spanclass="pre">paddle::GradientMachine</span></code> and its sub-classes, e.g., <codeclass="docutils literal"><spanclass="pre">paddle::MultiGradientMachine</span></code>, <codeclass="docutils literal"><spanclass="pre">paddle::NeuralNetwork</span></code>.
We should pass <codeclass="docutils literal"><spanclass="pre">Parameters</span></code> to <codeclass="docutils literal"><spanclass="pre">paddle::GradientMachine</span></code> when <codeclass="docutils literal"><spanclass="pre">forward/backward</span></code> to avoid <codeclass="docutils literal"><spanclass="pre">memcpy</span></code> between topologies.
Also, we should handle multi-GPU/CPU training, because <codeclass="docutils literal"><spanclass="pre">forward</span></code> and <codeclass="docutils literal"><spanclass="pre">backward</span></code> would perform on multi-GPUs and multi-CPUs.
<codeclass="docutils literal"><spanclass="pre">Parameters</span></code> should dispatch the parameter value to each device, and gather the parameter gradient from each device.</li>
<li><codeclass="docutils literal"><spanclass="pre">paddle::ParameterUpdater</span></code>. The ParameterUpdater is used to update parameters in Paddle.
So <codeclass="docutils literal"><spanclass="pre">Parameters</span></code> should be used by <codeclass="docutils literal"><spanclass="pre">paddle::ParameterUpdater</span></code>, and <codeclass="docutils literal"><spanclass="pre">paddle::ParameterUpdater</span></code> should optimize <codeclass="docutils literal"><spanclass="pre">Parameters</span></code> (by SGD).</li>
</ol>
<p>The step by step approach for implementation Parameters in Paddle C++ core is listed below. Each step should be a PR and could be merged into Paddle one by one.</p>
<olclass="simple">
<li>Clean <codeclass="docutils literal"><spanclass="pre">paddle::Parameter</span></code> interface. Extract the functionalities of <codeclass="docutils literal"><spanclass="pre">paddle::Parameter</span></code> to prepare for the implementation of Parameters.</li>
<li>Implementation a <codeclass="docutils literal"><spanclass="pre">Parameters</span></code> class. It just stores the <codeclass="docutils literal"><spanclass="pre">paddle::Parameter</span></code> inside. Make <codeclass="docutils literal"><spanclass="pre">GradientMachine</span></code> uses <codeclass="docutils literal"><spanclass="pre">Parameters</span></code> as a class member.</li>
<li>Make <codeclass="docutils literal"><spanclass="pre">Parameters</span></code> support Multi-CPU and Multi-GPU training to prepare for sharing <codeclass="docutils literal"><spanclass="pre">Parameter</span></code> between topologies.
Because we need share <codeclass="docutils literal"><spanclass="pre">Parameters</span></code> between topologies, it is <codeclass="docutils literal"><spanclass="pre">Parameters</span></code>‘s response to exchange Parameters between GPUs.
<codeclass="docutils literal"><spanclass="pre">GradientMachine</span></code> should not handle how to exchange Parameters because <codeclass="docutils literal"><spanclass="pre">GradientMachine</span></code> only used to train one topology and we need to support train many topologies in Paddle, i.e., there could be many GradientMachines use one <codeclass="docutils literal"><spanclass="pre">Parameters</span></code>.<ul>
<li>We should use a global function to exchange Parameters between GPUs, not a member function in <codeclass="docutils literal"><spanclass="pre">Parameters</span></code>. The <codeclass="docutils literal"><spanclass="pre">MultiGradientMachine</span></code> invoke this function, which uses <codeclass="docutils literal"><spanclass="pre">Parameters</span></code> as this function inputs.</li>
<li>The MultiGradientMachine contains many functionalities. Extracting the Parameters exchanging logic could make MultiGradientMachine clearer and simpler.</li>
</ul>
</li>
<li>Make <codeclass="docutils literal"><spanclass="pre">Parameters</span></code> as an argument for <codeclass="docutils literal"><spanclass="pre">forward/backward</span></code> function, not a data member for <codeclass="docutils literal"><spanclass="pre">GradientMachine</span></code>. For example, <codeclass="docutils literal"><spanclass="pre">forward</span></code> could be <codeclass="docutils literal"><spanclass="pre">forward(const</span><spanclass="pre">Parameters&</span><spanclass="pre">params,</span><spanclass="pre">...)</span></code> and <codeclass="docutils literal"><spanclass="pre">backward</span></code> could be <codeclass="docutils literal"><spanclass="pre">backward(Parameters*</span><spanclass="pre">params,</span><spanclass="pre">...)</span></code>. After this step, Paddle could share <codeclass="docutils literal"><spanclass="pre">Parameters</span></code> between topologies.</li>
<li><codeclass="docutils literal"><spanclass="pre">ParameterUpdater</span></code> is invoked by <codeclass="docutils literal"><spanclass="pre">GradientMachine</span></code> and <codeclass="docutils literal"><spanclass="pre">Trainer</span></code>, but it updates <codeclass="docutils literal"><spanclass="pre">Parameters</span></code>. In the end of this code refactoring, we could change <codeclass="docutils literal"><spanclass="pre">ParameterUpdater</span></code> directly uses <codeclass="docutils literal"><spanclass="pre">Parameters</span></code> to make <codeclass="docutils literal"><spanclass="pre">ParameterUpdater</span></code>‘s implementation clear.</li>
Built with <ahref="http://sphinx-doc.org/">Sphinx</a> using a <ahref="https://github.com/snide/sphinx_rtd_theme">theme</a> provided by <ahref="https://readthedocs.org">Read the Docs</a>.