<liclass="toctree-l2"><aclass="reference internal"href="../../../getstarted/build_and_install/index_en.html">Install and Build</a><ul>
<liclass="toctree-l3"><aclass="reference internal"href="../../../getstarted/build_and_install/pip_install_en.html">Install Using pip</a></li>
<liclass="toctree-l3"><aclass="reference internal"href="../../../getstarted/build_and_install/docker_install_en.html">Run in Docker Containers</a></li>
<liclass="toctree-l3"><aclass="reference internal"href="../../../howto/dev/build_en.html">Build using Docker</a></li>
<liclass="toctree-l3"><aclass="reference internal"href="../../../getstarted/build_and_install/build_from_source_en.html">Build from Sources</a></li>
Built with <ahref="http://sphinx-doc.org/">Sphinx</a> using a <ahref="https://github.com/snide/sphinx_rtd_theme">theme</a> provided by <ahref="https://readthedocs.org">Read the Docs</a>.
<liclass="toctree-l2"><aclass="reference internal"href="../../../getstarted/build_and_install/index_en.html">Install and Build</a><ul>
<liclass="toctree-l3"><aclass="reference internal"href="../../../getstarted/build_and_install/pip_install_en.html">Install Using pip</a></li>
<liclass="toctree-l3"><aclass="reference internal"href="../../../getstarted/build_and_install/docker_install_en.html">Run in Docker Containers</a></li>
<liclass="toctree-l3"><aclass="reference internal"href="../../../howto/dev/build_en.html">Build using Docker</a></li>
<liclass="toctree-l3"><aclass="reference internal"href="../../../getstarted/build_and_install/build_from_source_en.html">Build from Sources</a></li>
<h2>DataProvider for the non-sequential model<aclass="headerlink"href="#dataprovider-for-the-non-sequential-model"title="Permalink to this headline">¶</a></h2>
<p>Here we use the MNIST handwriting recognition data as an example to illustrate
how to write a simple PyDataProvider.</p>
<p>MNIST is a handwriting classification data set. It contains 70,000 digital
grayscale images. Labels of the training sample range from 0 to 9. All the
images have been size-normalized and centered into images with the same size
of 28 x 28 pixels.</p>
<p>A small part of the original data as an example is shown as below:</p>
<p>The corresponding dataprovider is shown as below:</p>
<divclass="highlight-default"><divclass="highlight"><pre><span></span><spanclass="c1"># Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserve.</span>
<spanclass="c1">#</span>
<spanclass="c1"># Licensed under the Apache License, Version 2.0 (the "License");</span>
<spanclass="c1"># you may not use this file except in compliance with the License.</span>
<spanclass="c1"># You may obtain a copy of the License at</span>
<spanclass="k">def</span><spanclass="nf">process</span><spanclass="p">(</span><spanclass="n">settings</span><spanclass="p">,</span><spanclass="n">filename</span><spanclass="p">):</span><spanclass="c1"># settings is not used currently.</span>
<spanclass="n">f</span><spanclass="o">=</span><spanclass="nb">open</span><spanclass="p">(</span><spanclass="n">filename</span><spanclass="p">,</span><spanclass="s1">'r'</span><spanclass="p">)</span><spanclass="c1"># open one of training file</span>
<spanclass="k">for</span><spanclass="n">line</span><spanclass="ow">in</span><spanclass="n">f</span><spanclass="p">:</span><spanclass="c1"># read each line</span>
It sets some properties to DataProvider, and constructs a real PaddlePaddle
DataProvider from a very simple user implemented python function. It does not
matter if you are not familiar with <aclass="reference external"href="http://www.learnpython.org/en/Decorators">Decorator</a>. You can keep it simple by
just taking <codeclass="code docutils literal"><spanclass="pre">@provider</span></code> as a fixed mark above the provider function you
implemented.</p>
<p><aclass="reference internal"href="#input-types">input_types</a> defines the data format that a DataProvider returns.
In this example, it is set to a 28x28-dimensional dense vector and an integer
scalar, whose value ranges from 0 to 9.
<aclass="reference internal"href="#input-types">input_types</a> can be set to several kinds of input formats, please refer to the
document of <aclass="reference internal"href="#input-types">input_types</a> for more details.</p>
<p>The process method is the core part to construct a real DataProvider in
PaddlePaddle. It implements how to open the text file, how to read one sample
from the original text file, convert them into <aclass="reference internal"href="#input-types">input_types</a>, and give them
back to PaddlePaddle process at line 23.
Note that data yielded by the process function must follow the same order that
<aclass="reference internal"href="#input-types">input_types</a> are defined.</p>
<p>With the help of PyDataProvider2, user can focus on how to generate ONE traning
sample by using keywords <codeclass="code docutils literal"><spanclass="pre">yield</span></code>.
<codeclass="code docutils literal"><spanclass="pre">yield</span></code> is a python keyword, and a concept related to it includes
<p>Only a few lines of codes need to be added into the training configuration file,
you can take this as an example.</p>
<divclass="highlight-default"><divclass="highlight"><pre><span></span><spanclass="c1"># Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserve.</span>
<spanclass="c1">#</span>
<spanclass="c1"># Licensed under the Apache License, Version 2.0 (the "License");</span>
<spanclass="c1"># you may not use this file except in compliance with the License.</span>
<spanclass="c1"># You may obtain a copy of the License at</span>
38</pre></div></td><tdclass="code"><divclass="highlight"><pre><span></span><spanclass="c1"># Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserve.</span>
<spanclass="c1">#</span>
<spanclass="c1"># Licensed under the Apache License, Version 2.0 (the "License");</span>
<spanclass="c1"># you may not use this file except in compliance with the License.</span>
<spanclass="c1"># You may obtain a copy of the License at</span>
<spanclass="k">def</span><spanclass="nf">process</span><spanclass="p">(</span><spanclass="n">settings</span><spanclass="p">,</span><spanclass="n">filename</span><spanclass="p">):</span><spanclass="c1"># settings is not used currently.</span>
<spanclass="n">f</span><spanclass="o">=</span><spanclass="nb">open</span><spanclass="p">(</span><spanclass="n">filename</span><spanclass="p">,</span><spanclass="s1">'r'</span><spanclass="p">)</span><spanclass="c1"># open one of training file</span>
<spanclass="k">for</span><spanclass="n">line</span><spanclass="ow">in</span><spanclass="n">f</span><spanclass="p">:</span><spanclass="c1"># read each line</span>
<spanid="api-pydataprovider2-sequential-model"></span><h2>DataProvider for the sequential model<aclass="headerlink"href="#dataprovider-for-the-sequential-model"title="Permalink to this headline">¶</a></h2>
<p>A sequence model takes sequences as its input. A sequence is made up of several
timesteps. The so-called timestep, is not necessary to have something to do
with time. It can also be explained to that the order of data are taken into
consideration into model design and training.
For example, the sentence can be interpreted as a kind of sequence data in NLP
tasks.</p>
<p>Here is an example on data proivider for English sentiment classification data.
The original input data are simple English text, labeled into positive or
negative sentiment (marked by 0 and 1 respectively).</p>
<p>A small part of the original data as an example can be found in the path below:</p>
<p>The corresponding data provider can be found in the path below:</p>
<divclass="highlight-default"><divclass="highlight"><pre><span></span><spanclass="c1"># Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserve.</span>
<spanclass="c1">#</span>
<spanclass="c1"># Licensed under the Apache License, Version 2.0 (the "License");</span>
<spanclass="c1"># you may not use this file except in compliance with the License.</span>
<spanclass="c1"># You may obtain a copy of the License at</span>
<spanclass="c1"># The text is a sequence of integer values, and each value is a word id.</span>
<spanclass="c1"># The whole sequence is the sentences that we want to predict its</span>
<spanclass="c1"># sentimental.</span>
<spanclass="s1">'data'</span><spanclass="p">:</span><spanclass="n">integer_value_sequence</span><spanclass="p">(</span><spanclass="nb">len</span><spanclass="p">(</span><spanclass="n">dictionary</span><spanclass="p">)),</span><spanclass="c1"># text input</span>
<spanclass="k">for</span><spanclass="n">line</span><spanclass="ow">in</span><spanclass="n">f</span><spanclass="p">:</span><spanclass="c1"># read each line of file</span>
<spanclass="n">label</span><spanclass="p">,</span><spanclass="n">sentence</span><spanclass="o">=</span><spanclass="n">line</span><spanclass="o">.</span><spanclass="n">split</span><spanclass="p">(</span><spanclass="s1">'</span><spanclass="se">\t</span><spanclass="s1">'</span><spanclass="p">)</span><spanclass="c1"># get label and sentence</span>
<spanclass="n">words</span><spanclass="o">=</span><spanclass="n">sentence</span><spanclass="o">.</span><spanclass="n">split</span><spanclass="p">(</span><spanclass="s1">''</span><spanclass="p">)</span><spanclass="c1"># get words</span>
<spanclass="c1"># convert word string to word id</span>
<spanclass="c1"># the word not in dictionary will be ignored.</span>
<p>This data provider for sequential model is a little more complex than that
for MINST dataset.
A new initialization method is introduced here.
The method <codeclass="code docutils literal"><spanclass="pre">on_init</span></code> is configured to DataProvider by <codeclass="code docutils literal"><spanclass="pre">@provider</span></code>‘s
<codeclass="code docutils literal"><spanclass="pre">init_hook</span></code> parameter, and it will be invoked once DataProvider is
initialized. The <codeclass="code docutils literal"><spanclass="pre">on_init</span></code> function has the following parameters:</p>
<ulclass="simple">
<li>The first parameter is the settings object.</li>
<li>The rest parameters are passed by key word arguments. Some of them are passed
by PaddlePaddle, see reference for <aclass="reference internal"href="#init-hook">init_hook</a>.
The <codeclass="code docutils literal"><spanclass="pre">dictionary</span></code> object is a python dict object passed from the trainer
configuration file, and it maps word string to word id.</li>
</ul>
<p>To pass these parameters into DataProvider, the following lines should be added
into trainer configuration file.</p>
<divclass="highlight-default"><divclass="highlight"><pre><span></span><spanclass="c1"># Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserve.</span>
<spanclass="c1">#</span>
<spanclass="c1"># Licensed under the Apache License, Version 2.0 (the "License");</span>
<spanclass="c1"># you may not use this file except in compliance with the License.</span>
<spanclass="c1"># You may obtain a copy of the License at</span>
<p>The definition is basically same as MNIST example, except:
* Load dictionary in this configuration
* Pass it as a parameter to the DataProvider</p>
<p>The <cite>input_types</cite> is configured in method <codeclass="code docutils literal"><spanclass="pre">on_init</span></code>. It has the same
effect to configure them by <codeclass="code docutils literal"><spanclass="pre">@provider</span></code>‘s <codeclass="code docutils literal"><spanclass="pre">input_types</span></code> parameter.
However, the <codeclass="code docutils literal"><spanclass="pre">input_types</span></code> is set at runtime, so we can set it to
different types according to the input data. Input of the neural network is a
sequence of word id, so set <codeclass="code docutils literal"><spanclass="pre">seq_type</span></code> to <codeclass="code docutils literal"><spanclass="pre">integer_value_sequence</span></code>.</p>
<p>Durning <codeclass="code docutils literal"><spanclass="pre">on_init</span></code>, we save <codeclass="code docutils literal"><spanclass="pre">dictionary</span></code> variable to
<codeclass="code docutils literal"><spanclass="pre">settings</span></code>, and it will be used in <codeclass="code docutils literal"><spanclass="pre">process</span></code>. Note the settings
parameter for the process function and for the on_init’s function are a same
object.</p>
<p>The basic processing logic is the same as MNIST’s <codeclass="code docutils literal"><spanclass="pre">process</span></code> method. Each
sample in the data file is given back to PaddlePaddle process.</p>
<p>Thus, the basic usage of PyDataProvider is here.
Please refer to the following section reference for details.</p>
</div>
<divclass="section"id="reference">
<h2>Reference<aclass="headerlink"href="#reference"title="Permalink to this headline">¶</a></h2>
<divclass="section"id="provider">
<h3>@provider<aclass="headerlink"href="#provider"title="Permalink to this headline">¶</a></h3>
<dlclass="function">
<dtid="paddle.trainer.PyDataProvider2.provider">
<codeclass="descclassname">paddle.trainer.PyDataProvider2.</code><codeclass="descname">provider</code><spanclass="sig-paren">(</span><em>input_types=None</em>, <em>should_shuffle=None</em>, <em>pool_size=-1</em>, <em>min_pool_size=-1</em>, <em>can_over_batch_size=True</em>, <em>calc_batch_size=None</em>, <em>cache=0</em>, <em>check=False</em>, <em>check_fail_continue=False</em>, <em>init_hook=None</em>, <em>**outter_kwargs</em><spanclass="sig-paren">)</span><aclass="headerlink"href="#paddle.trainer.PyDataProvider2.provider"title="Permalink to this definition">¶</a></dt>
<dd><p>Provider decorator. Use it to make a function into PyDataProvider2 object.
In this function, user only need to get each sample for some train/test
<li><codeclass="code docutils literal"><spanclass="pre">sparse_binary_vector</span></code>: sparse binary vector, most of the value is 0, and
the non zero elements are fixed to 1.</li>
<li><codeclass="code docutils literal"><spanclass="pre">sparse_float_vector</span></code>: sparse float vector, most of the value is 0, and some
non zero elements can be any float value. They are given by the user.</li>
<li><codeclass="code docutils literal"><spanclass="pre">integer</span></code>: an integer scalar, that is especially used for label or word index.</li>
</ul>
<p>The three sequence types are:</p>
<ulclass="simple">
<li><codeclass="code docutils literal"><spanclass="pre">SequenceType.NO_SEQUENCE</span></code> means the sample is not a sequence.</li>
<li><codeclass="code docutils literal"><spanclass="pre">SequenceType.SEQUENCE</span></code> means the sample is a sequence.</li>
<li><codeclass="code docutils literal"><spanclass="pre">SequenceType.SUB_SEQUENCE</span></code> means it is a nested sequence, that each timestep of
the input sequence is also a sequence.</li>
</ul>
<p>Different input type has a defferenct input format. Their formats are shown
Built with <ahref="http://sphinx-doc.org/">Sphinx</a> using a <ahref="https://github.com/snide/sphinx_rtd_theme">theme</a> provided by <ahref="https://readthedocs.org">Read the Docs</a>.
Built with <ahref="http://sphinx-doc.org/">Sphinx</a> using a <ahref="https://github.com/snide/sphinx_rtd_theme">theme</a> provided by <ahref="https://readthedocs.org">Read the Docs</a>.
<liclass="toctree-l2"><aclass="reference internal"href="../../../getstarted/build_and_install/index_en.html">Install and Build</a><ul>
<liclass="toctree-l3"><aclass="reference internal"href="../../../getstarted/build_and_install/pip_install_en.html">Install Using pip</a></li>
<liclass="toctree-l3"><aclass="reference internal"href="../../../getstarted/build_and_install/docker_install_en.html">Run in Docker Containers</a></li>
<liclass="toctree-l3"><aclass="reference internal"href="../../../howto/dev/build_en.html">Build using Docker</a></li>
<liclass="toctree-l3"><aclass="reference internal"href="../../../getstarted/build_and_install/build_from_source_en.html">Build from Sources</a></li>
<p>The module that does the most of the job is py_paddle.swig_paddle, it’s
generated by SWIG and has complete documents, for more details you can use
python’s <codeclass="code docutils literal"><spanclass="pre">help()</span></code> function. Let’s walk through the above python script:</p>
<ulclass="simple">
<li>At the beginning, use <codeclass="code docutils literal"><spanclass="pre">swig_paddle.initPaddle()</span></code> to initialize
PaddlePaddle with command line arguments, for more about command line arguments
see <aclass="reference internal"href="../../../howto/usage/cmd_parameter/detail_introduction_en.html#cmd-detail-introduction"><spanclass="std std-ref">Detail Description</span></a> .</li>
<li>Parse the configuration file that is used in training with <codeclass="code docutils literal"><spanclass="pre">parse_config()</span></code>.
Because data to predict with always have no label, and output of prediction work
normally is the output layer rather than the cost layer, so you should modify
the configuration file accordingly before using it in the prediction work.</li>
<li>Create a neural network with
<codeclass="code docutils literal"><spanclass="pre">swig_paddle.GradientMachine.createFromConfigproto()</span></code>, which takes the
parsed configuration <codeclass="code docutils literal"><spanclass="pre">conf.model_config</span></code> as argument. Then load the
trained parameters from the model with <codeclass="code docutils literal"><spanclass="pre">network.loadParameters()</span></code>.</li>
<li><dlclass="first docutils">
<dt>Create a data converter object of utility class <codeclass="code docutils literal"><spanclass="pre">DataProviderConverter</span></code>.</dt>
<dd><ulclass="first last">
<li>Note: As swig_paddle can only accept C++ matrices, we offer a utility
class DataProviderConverter that can accept the same input data with
PyDataProvider2, for more information please refer to document
of <aclass="reference internal"href="../data_provider/pydataprovider2_en.html#api-pydataprovider2"><spanclass="std std-ref">PyDataProvider2</span></a> .</li>
</ul>
</dd>
</dl>
</li>
<li>Do the prediction with <codeclass="code docutils literal"><spanclass="pre">forwardTest()</span></code>, which takes the converted
input data and outputs the activations of the output layer.</li>
Built with <ahref="http://sphinx-doc.org/">Sphinx</a> using a <ahref="https://github.com/snide/sphinx_rtd_theme">theme</a> provided by <ahref="https://readthedocs.org">Read the Docs</a>.
<li><ahref="api/v1/data_provider/pydataprovider2_en.html#paddle.trainer.PyDataProvider2.provider">provider() (in module paddle.trainer.PyDataProvider2)</a>
Built with <ahref="http://sphinx-doc.org/">Sphinx</a> using a <ahref="https://github.com/snide/sphinx_rtd_theme">theme</a> provided by <ahref="https://readthedocs.org">Read the Docs</a>.
<divclass="highlight-default"><divclass="highlight"><pre><span></span><spanclass="c1"># Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserve.</span>
<spanclass="c1">#</span>
<spanclass="c1"># Licensed under the Apache License, Version 2.0 (the "License");</span>
<spanclass="c1"># you may not use this file except in compliance with the License.</span>
<spanclass="c1"># You may obtain a copy of the License at</span>
<spanclass="k">def</span><spanclass="nf">process</span><spanclass="p">(</span><spanclass="n">settings</span><spanclass="p">,</span><spanclass="n">filename</span><spanclass="p">):</span><spanclass="c1"># settings is not used currently.</span>
<spanclass="n">f</span><spanclass="o">=</span><spanclass="nb">open</span><spanclass="p">(</span><spanclass="n">filename</span><spanclass="p">,</span><spanclass="s1">'r'</span><spanclass="p">)</span><spanclass="c1"># open one of training file</span>
<spanclass="k">for</span><spanclass="n">line</span><spanclass="ow">in</span><spanclass="n">f</span><spanclass="p">:</span><spanclass="c1"># read each line</span>
<divclass="highlight-default"><divclass="highlight"><pre><span></span><spanclass="c1"># Unless required by applicable law or agreed to in writing, software</span>
<spanclass="c1"># distributed under the License is distributed on an "AS IS" BASIS,</span>
<divclass="highlight-default"><divclass="highlight"><pre><span></span><spanclass="c1"># Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserve.</span>
<spanclass="c1">#</span>
<spanclass="c1"># Licensed under the Apache License, Version 2.0 (the "License");</span>
<spanclass="c1"># you may not use this file except in compliance with the License.</span>
<spanclass="c1"># You may obtain a copy of the License at</span>
<divclass="highlight-default"><divclass="highlight"><pre><span></span><spanclass="c1"># Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserve.</span>
<spanclass="c1">#</span>
<spanclass="c1"># Licensed under the Apache License, Version 2.0 (the "License");</span>
<spanclass="c1"># you may not use this file except in compliance with the License.</span>
<spanclass="c1"># You may obtain a copy of the License at</span>
<spanclass="c1"># The text is a sequence of integer values, and each value is a word id.</span>
<spanclass="c1"># The whole sequence is the sentences that we want to predict its</span>
<spanclass="c1"># sentimental.</span>
<spanclass="s1">'data'</span><spanclass="p">:</span><spanclass="n">integer_value_sequence</span><spanclass="p">(</span><spanclass="nb">len</span><spanclass="p">(</span><spanclass="n">dictionary</span><spanclass="p">)),</span><spanclass="c1"># text input</span>
<spanclass="k">for</span><spanclass="n">line</span><spanclass="ow">in</span><spanclass="n">f</span><spanclass="p">:</span><spanclass="c1"># read each line of file</span>
<spanclass="n">label</span><spanclass="p">,</span><spanclass="n">sentence</span><spanclass="o">=</span><spanclass="n">line</span><spanclass="o">.</span><spanclass="n">split</span><spanclass="p">(</span><spanclass="s1">'</span><spanclass="se">\t</span><spanclass="s1">'</span><spanclass="p">)</span><spanclass="c1"># get label and sentence</span>
<spanclass="n">words</span><spanclass="o">=</span><spanclass="n">sentence</span><spanclass="o">.</span><spanclass="n">split</span><spanclass="p">(</span><spanclass="s1">''</span><spanclass="p">)</span><spanclass="c1"># get words</span>
<spanclass="c1"># convert word string to word id</span>
<spanclass="c1"># the word not in dictionary will be ignored.</span>
<divclass="highlight-default"><divclass="highlight"><pre><span></span><spanclass="c1"># Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserve.</span>
<spanclass="c1">#</span>
<spanclass="c1"># Licensed under the Apache License, Version 2.0 (the "License");</span>
<spanclass="c1"># you may not use this file except in compliance with the License.</span>
<spanclass="c1"># You may obtain a copy of the License at</span>
Built with <ahref="http://sphinx-doc.org/">Sphinx</a> using a <ahref="https://github.com/snide/sphinx_rtd_theme">theme</a> provided by <ahref="https://readthedocs.org">Read the Docs</a>.
Built with <ahref="http://sphinx-doc.org/">Sphinx</a> using a <ahref="https://github.com/snide/sphinx_rtd_theme">theme</a> provided by <ahref="https://readthedocs.org">Read the Docs</a>.
<spanclass="k">assert</span><spanclass="nb">isinstance</span><spanclass="p">(</span><spanclass="n">network</span><spanclass="p">,</span><spanclass="n">swig_paddle</span><spanclass="o">.</span><spanclass="n">GradientMachine</span><spanclass="p">)</span><spanclass="c1"># For code hint.</span>
Built with <ahref="http://sphinx-doc.org/">Sphinx</a> using a <ahref="https://github.com/snide/sphinx_rtd_theme">theme</a> provided by <ahref="https://readthedocs.org">Read the Docs</a>.