提交 82e956b5 编写于 作者: T Travis CI

Deploy to GitHub Pages: b15b2637

上级 8503f028
因为 它太大了无法显示 source diff 。你可以改为 查看blob
此差异已折叠。
......@@ -257,147 +257,11 @@
<li>process:PaddlePaddle调用process函数来读取数据。每次读取一条数据后,process函数会用yield语句输出这条数据,从而能够被PaddlePaddle 捕获 (harvest)。</li>
</ul>
<p><code class="docutils literal"><span class="pre">dataprovider_bow.py</span></code> 文件给出了完整例子:</p>
<div class="highlight-python"><table class="highlighttable"><tr><td class="linenos"><div class="linenodiv"><pre> 1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50</pre></div></td><td class="code"><div class="highlight"><pre><span></span><span class="c1"># initializer is called by the framework during initialization.</span>
<span class="c1"># It allows the user to describe the data types and setup the</span>
<span class="c1"># necessary data structure for later use.</span>
<span class="c1"># `settings` is an object. initializer need to properly fill settings.input_types.</span>
<span class="c1"># initializer can also store other data structures needed to be used at process().</span>
<span class="c1"># In this example, dictionary is stored in settings.</span>
<span class="c1"># `dictionay` and `kwargs` are arguments passed from trainer_config.lr.py</span>
<span class="hll"><span class="k">def</span> <span class="nf">initializer</span><span class="p">(</span><span class="n">settings</span><span class="p">,</span> <span class="n">dictionary</span><span class="p">,</span> <span class="o">**</span><span class="n">kwargs</span><span class="p">):</span>
</span> <span class="c1"># Put the word dictionary into settings</span>
<span class="n">settings</span><span class="o">.</span><span class="n">word_dict</span> <span class="o">=</span> <span class="n">dictionary</span>
<span class="c1"># setting.input_types specifies what the data types the data provider</span>
<span class="c1"># generates.</span>
<span class="n">settings</span><span class="o">.</span><span class="n">input_types</span> <span class="o">=</span> <span class="p">{</span>
<span class="c1"># The first input is a sparse_binary_vector,</span>
<span class="c1"># which means each dimension of the vector is either 0 or 1. It is the</span>
<span class="c1"># bag-of-words (BOW) representation of the texts.</span>
<span class="s1">&#39;word&#39;</span><span class="p">:</span> <span class="n">sparse_binary_vector</span><span class="p">(</span><span class="nb">len</span><span class="p">(</span><span class="n">dictionary</span><span class="p">)),</span>
<span class="c1"># The second input is an integer. It represents the category id of the</span>
<span class="c1"># sample. 2 means there are two labels in the dataset.</span>
<span class="c1"># (1 for positive and 0 for negative)</span>
<span class="s1">&#39;label&#39;</span><span class="p">:</span> <span class="n">integer_value</span><span class="p">(</span><span class="mi">2</span><span class="p">)</span>
<span class="p">}</span>
<span class="c1"># Delaring a data provider. It has an initializer &#39;data_initialzer&#39;.</span>
<span class="c1"># It will cache the generated data of the first pass in memory, so that</span>
<span class="c1"># during later pass, no on-the-fly data generation will be needed.</span>
<span class="c1"># `setting` is the same object used by initializer()</span>
<span class="c1"># `file_name` is the name of a file listed train_list or test_list file given</span>
<span class="c1"># to define_py_data_sources2(). See trainer_config.lr.py.</span>
<span class="nd">@provider</span><span class="p">(</span><span class="n">init_hook</span><span class="o">=</span><span class="n">initializer</span><span class="p">,</span> <span class="n">cache</span><span class="o">=</span><span class="n">CacheType</span><span class="o">.</span><span class="n">CACHE_PASS_IN_MEM</span><span class="p">)</span>
<span class="hll"><span class="k">def</span> <span class="nf">process</span><span class="p">(</span><span class="n">settings</span><span class="p">,</span> <span class="n">file_name</span><span class="p">):</span>
</span> <span class="c1"># Open the input data file.</span>
<span class="k">with</span> <span class="nb">open</span><span class="p">(</span><span class="n">file_name</span><span class="p">,</span> <span class="s1">&#39;r&#39;</span><span class="p">)</span> <span class="k">as</span> <span class="n">f</span><span class="p">:</span>
<span class="c1"># Read each line.</span>
<span class="k">for</span> <span class="n">line</span> <span class="ow">in</span> <span class="n">f</span><span class="p">:</span>
<span class="c1"># Each line contains the label and text of the comment, separated by \t.</span>
<span class="n">label</span><span class="p">,</span> <span class="n">comment</span> <span class="o">=</span> <span class="n">line</span><span class="o">.</span><span class="n">strip</span><span class="p">()</span><span class="o">.</span><span class="n">split</span><span class="p">(</span><span class="s1">&#39;</span><span class="se">\t</span><span class="s1">&#39;</span><span class="p">)</span>
<span class="c1"># Split the words into a list.</span>
<span class="n">words</span> <span class="o">=</span> <span class="n">comment</span><span class="o">.</span><span class="n">split</span><span class="p">()</span>
<span class="c1"># convert the words into a list of ids by looking them up in word_dict.</span>
<span class="n">word_vector</span> <span class="o">=</span> <span class="p">[</span><span class="n">settings</span><span class="o">.</span><span class="n">word_dict</span><span class="o">.</span><span class="n">get</span><span class="p">(</span><span class="n">w</span><span class="p">,</span> <span class="n">UNK_IDX</span><span class="p">)</span> <span class="k">for</span> <span class="n">w</span> <span class="ow">in</span> <span class="n">words</span><span class="p">]</span>
<span class="c1"># Return the features for the current comment. The first is a list</span>
<span class="c1"># of ids representing a 0-1 binary sparse vector of the text,</span>
<span class="c1"># the second is the integer id of the label.</span>
<span class="k">yield</span> <span class="p">{</span><span class="s1">&#39;word&#39;</span><span class="p">:</span> <span class="n">word_vector</span><span class="p">,</span> <span class="s1">&#39;label&#39;</span><span class="p">:</span> <span class="nb">int</span><span class="p">(</span><span class="n">label</span><span class="p">)}</span>
</pre></div>
</td></tr></table></div>
<p>详细内容请参见 <a class="reference internal" href="../../api/v1/data_provider/dataprovider_cn.html#api-dataprovider"><span class="std std-ref">DataProvider的介绍</span></a></p>
</div>
<div class="section" id="id8">
<h3>配置中的数据加载定义<a class="headerlink" href="#id8" title="永久链接至标题"></a></h3>
<p>在模型配置中通过 <code class="docutils literal"><span class="pre">define_py_data_sources2</span></code> 接口来加载数据:</p>
<div class="highlight-python"><table class="highlighttable"><tr><td class="linenos"><div class="linenodiv"><pre> 1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17</pre></div></td><td class="code"><div class="highlight"><pre><span></span><span class="n">dict_file</span> <span class="o">=</span> <span class="s2">&quot;./data/dict.txt&quot;</span>
<span class="n">word_dict</span> <span class="o">=</span> <span class="nb">dict</span><span class="p">()</span>
<span class="k">with</span> <span class="nb">open</span><span class="p">(</span><span class="n">dict_file</span><span class="p">,</span> <span class="s1">&#39;r&#39;</span><span class="p">)</span> <span class="k">as</span> <span class="n">f</span><span class="p">:</span>
<span class="k">for</span> <span class="n">i</span><span class="p">,</span> <span class="n">line</span> <span class="ow">in</span> <span class="nb">enumerate</span><span class="p">(</span><span class="n">f</span><span class="p">):</span>
<span class="n">w</span> <span class="o">=</span> <span class="n">line</span><span class="o">.</span><span class="n">strip</span><span class="p">()</span><span class="o">.</span><span class="n">split</span><span class="p">()[</span><span class="mi">0</span><span class="p">]</span>
<span class="n">word_dict</span><span class="p">[</span><span class="n">w</span><span class="p">]</span> <span class="o">=</span> <span class="n">i</span>
<span class="n">is_predict</span> <span class="o">=</span> <span class="n">get_config_arg</span><span class="p">(</span><span class="s1">&#39;is_predict&#39;</span><span class="p">,</span> <span class="nb">bool</span><span class="p">,</span> <span class="bp">False</span><span class="p">)</span>
<span class="n">trn</span> <span class="o">=</span> <span class="s1">&#39;data/train.list&#39;</span> <span class="k">if</span> <span class="ow">not</span> <span class="n">is_predict</span> <span class="k">else</span> <span class="bp">None</span>
<span class="n">tst</span> <span class="o">=</span> <span class="s1">&#39;data/test.list&#39;</span> <span class="k">if</span> <span class="ow">not</span> <span class="n">is_predict</span> <span class="k">else</span> <span class="s1">&#39;data/pred.list&#39;</span>
<span class="n">process</span> <span class="o">=</span> <span class="s1">&#39;process&#39;</span> <span class="k">if</span> <span class="ow">not</span> <span class="n">is_predict</span> <span class="k">else</span> <span class="s1">&#39;process_predict&#39;</span>
<span class="hll"><span class="n">define_py_data_sources2</span><span class="p">(</span>
</span> <span class="n">train_list</span><span class="o">=</span><span class="n">trn</span><span class="p">,</span>
<span class="n">test_list</span><span class="o">=</span><span class="n">tst</span><span class="p">,</span>
<span class="n">module</span><span class="o">=</span><span class="s2">&quot;dataprovider_emb&quot;</span><span class="p">,</span>
<span class="n">obj</span><span class="o">=</span><span class="n">process</span><span class="p">,</span>
<span class="n">args</span><span class="o">=</span><span class="p">{</span><span class="s2">&quot;dictionary&quot;</span><span class="p">:</span> <span class="n">word_dict</span><span class="p">})</span>
</pre></div>
</td></tr></table></div>
<p>以下是对上述数据加载的解释:</p>
<ul class="simple">
<li>data/train.list,data/test.list: 指定训练数据和测试数据</li>
......
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册