pydataprovider2.html 36.8 KB
Newer Older
Y
Yu Yang 已提交
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
  "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">


<html xmlns="http://www.w3.org/1999/xhtml">
  <head>
    <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
    
    <title>PyDataProvider2的使用 &mdash; PADDLE  documentation</title>
    
    <link rel="stylesheet" href="../../_static/classic.css" type="text/css" />
    <link rel="stylesheet" href="../../_static/pygments.css" type="text/css" />
    
    <script type="text/javascript">
      var DOCUMENTATION_OPTIONS = {
        URL_ROOT:    '../../',
        VERSION:     '',
        COLLAPSE_INDEX: false,
        FILE_SUFFIX: '.html',
        HAS_SOURCE:  true
      };
    </script>
    <script type="text/javascript" src="../../_static/jquery.js"></script>
    <script type="text/javascript" src="../../_static/underscore.js"></script>
    <script type="text/javascript" src="../../_static/doctools.js"></script>
    <script type="text/javascript" src="https://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML"></script>
    <link rel="top" title="PADDLE  documentation" href="../../index.html" />
    <link rel="up" title="Paddle的数据提供(DataProvider)介绍" href="index.html" />
    <link rel="next" title="自定义一个DataProvider" href="write_new_dataprovider.html" />
    <link rel="prev" title="Paddle的数据提供(DataProvider)介绍" href="index.html" /> 
  </head>
  <body role="document">
    <div class="related" role="navigation" aria-label="related navigation">
      <h3>Navigation</h3>
      <ul>
        <li class="right" style="margin-right: 10px">
          <a href="../../genindex.html" title="General Index"
             accesskey="I">index</a></li>
        <li class="right" >
          <a href="write_new_dataprovider.html" title="自定义一个DataProvider"
             accesskey="N">next</a> |</li>
        <li class="right" >
          <a href="index.html" title="Paddle的数据提供(DataProvider)介绍"
             accesskey="P">previous</a> |</li>
        <li class="nav-item nav-item-0"><a href="../../index.html">PADDLE  documentation</a> &raquo;</li>
          <li class="nav-item nav-item-1"><a href="../index.html" >配置</a> &raquo;</li>
          <li class="nav-item nav-item-2"><a href="index.html" accesskey="U">Paddle的数据提供(DataProvider)介绍</a> &raquo;</li> 
      </ul>
    </div>  

    <div class="document">
      <div class="documentwrapper">
        <div class="bodywrapper">
          <div class="body" role="main">
            
  <div class="section" id="pydataprovider2">
<h1>PyDataProvider2的使用<a class="headerlink" href="#pydataprovider2" title="Permalink to this headline"></a></h1>
<p>PyDataProvider是Paddle使用Python提供数据的推荐接口。使用该接口用户可以只关注如何
从文件中读取每一条数据,而不用关心数据如何传输给Paddle,数据如何存储等等。该数据
接口使用多线程读取数据,并提供了简单的Cache功能。</p>
<div class="section" id="id1">
<h2>简单的使用场景<a class="headerlink" href="#id1" title="Permalink to this headline"></a></h2>
<p>这里以MNIST手写识别为例,来说明简单的PyDataProvider如何使用。MNIST是一个包含有
70,000张灰度图片的数字分类数据集。对于MNIST而言,标签是0-9的数字,而特征即为
28*28的像素灰度值。这里我们使用简单的文本文件表示MNIST图片,样例数据如下。</p>
<div class="highlight-python"><div class="highlight"><pre><span></span>5;0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.215686 0.533333 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.67451 0.992157 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.070588 0.886275 0.992157 0 0 0 0 0 0 0 0 0 0 0.192157 0.070588 0 0 0 0 0 0 0 0 0 0 0 0 0 0.670588 0.992157 0.992157 0 0 0 0 0 0 0 0 0 0.117647 0.933333 0.858824 0.313725 0 0 0 0 0 0 0 0 0 0 0 0.090196 0.858824 0.992157 0.831373 0 0 0 0 0 0 0 0 0 0.141176 0.992157 0.992157 0.611765 0.054902 0 0 0 0 0 0 0 0 0 0 0.258824 0.992157 0.992157 0.529412 0 0 0 0 0 0 0 0 0 0.368627 0.992157 0.992157 0.419608 0.003922 0 0 0 0 0 0 0 0 0 0.094118 0.835294 0.992157 0.992157 0.517647 0 0 0 0 0 0 0 0 0 0.603922 0.992157 0.992157 0.992157 0.603922 0.545098 0.043137 0 0 0 0 0 0 0 0.447059 0.992157 0.992157 0.956863 0.062745 0 0 0 0 0 0 0 0 0.011765 0.666667 0.992157 0.992157 0.992157 0.992157 0.992157 0.745098 0.137255 0 0 0 0 0 0.152941 0.866667 0.992157 0.992157 0.521569 0 0 0 0 0 0 0 0 0 0.070588 0.992157 0.992157 0.992157 0.803922 0.352941 0.745098 0.992157 0.945098 0.317647 0 0 0 0 0.580392 0.992157 0.992157 0.764706 0.043137 0 0 0 0 0 0 0 0 0 0.070588 0.992157 0.992157 0.776471 0.043137 0 0.007843 0.27451 0.882353 0.941176 0.176471 0 0 0.180392 0.898039 0.992157 0.992157 0.313725 0 0 0 0 0 0 0 0 0 0 0.070588 0.992157 0.992157 0.713725 0 0 0 0 0.627451 0.992157 0.729412 0.062745 0 0.509804 0.992157 0.992157 0.776471 0.035294 0 0 0 0 0 0 0 0 0 0 0.494118 0.992157 0.992157 0.968627 0.168627 0 0 0 0.423529 0.992157 0.992157 0.364706 0 0.717647 0.992157 0.992157 0.317647 0 0 0 0 0 0 0 0 0 0 0 0.533333 0.992157 0.984314 0.945098 0.603922 0 0 0 0.003922 0.466667 0.992157 0.988235 0.976471 0.992157 0.992157 0.788235 0.007843 0 0 0 0 0 0 0 0 0 0 0 0.686275 0.882353 0.364706 0 0 0 0 0 0 0.098039 0.588235 0.992157 0.992157 0.992157 0.980392 0.305882 0 0 0 0 0 0 0 0 0 0 0 0 0.101961 0.67451 0.321569 0 0 0 0 0 0 0 0.105882 0.733333 0.976471 0.811765 0.713725 0 0 0 0 0 0 0 0 0 0 0 0 0 0.65098 0.992157 0.321569 0 0 0 0 0 0 0 0 0 0.25098 0.007843 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0.94902 0.219608 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.968627 0.764706 0.152941 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.498039 0.25098 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0;
0;0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.298039 0.333333 0.333333 0.333333 0.337255 0.333333 0.333333 0.109804 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.027451 0.223529 0.776471 0.964706 0.988235 0.988235 0.988235 0.992157 0.988235 0.988235 0.780392 0.098039 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.14902 0.698039 0.988235 0.992157 0.988235 0.901961 0.87451 0.568627 0.882353 0.976471 0.988235 0.988235 0.501961 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.188235 0.647059 0.988235 0.988235 0.745098 0.439216 0.098039 0 0 0 0.572549 0.988235 0.988235 0.988235 0 0 0 0 0 0 0 0 0 0 0 0 0 0.2 0.933333 0.992157 0.941176 0.247059 0 0 0 0 0 0 0.188235 0.898039 0.992157 0.992157 0 0 0 0 0 0 0 0 0 0 0 0.039216 0.639216 0.933333 0.988235 0.913725 0.278431 0 0 0 0 0 0 0 0.113725 0.843137 0.988235 0.988235 0 0 0 0 0 0 0 0 0 0 0 0.235294 0.988235 0.992157 0.988235 0.815686 0.07451 0 0 0 0 0 0 0 0.333333 0.988235 0.988235 0.552941 0 0 0 0 0 0 0 0 0 0 0.211765 0.878431 0.988235 0.992157 0.701961 0.329412 0.109804 0 0 0 0 0 0 0 0.698039 0.988235 0.913725 0.145098 0 0 0 0 0 0 0 0 0 0.188235 0.890196 0.988235 0.988235 0.745098 0.047059 0 0 0 0 0 0 0 0 0 0.882353 0.988235 0.568627 0 0 0 0 0 0 0 0 0 0.2 0.933333 0.992157 0.992157 0.992157 0.447059 0.294118 0 0 0 0 0 0 0 0 0.447059 0.992157 0.768627 0 0 0 0 0 0 0 0 0 0 0.623529 0.988235 0.988235 0.988235 0.988235 0.992157 0.47451 0 0 0 0 0 0 0 0.188235 0.933333 0.87451 0.509804 0 0 0 0 0 0 0 0 0 0 0.992157 0.988235 0.937255 0.792157 0.988235 0.894118 0.082353 0 0 0 0 0 0 0.027451 0.647059 0.992157 0.654902 0 0 0 0 0 0 0 0 0 0 0 0.623529 0.988235 0.913725 0.329412 0.376471 0.184314 0 0 0 0 0 0 0.027451 0.513725 0.988235 0.635294 0.219608 0 0 0 0 0 0 0 0 0 0 0 0.196078 0.929412 0.988235 0.988235 0.741176 0.309804 0 0 0 0 0 0 0.529412 0.988235 0.678431 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.223529 0.992157 0.992157 1 0.992157 0.992157 0.992157 0.992157 1 0.992157 0.992157 0.882353 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.023529 0.478431 0.654902 0.658824 0.952941 0.988235 0.988235 0.988235 0.992157 0.988235 0.729412 0.278431 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.196078 0.647059 0.764706 0.764706 0.768627 0.580392 0.047059 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0;
4;0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.180392 0.470588 0.623529 0.623529 0.623529 0.588235 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.243137 0.494118 0.862745 0.870588 0.960784 0.996078 0.996078 0.996078 0.996078 0.992157 0.466667 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.317647 0.639216 0.639216 0.639216 0.639216 0.639216 0.470588 0.262745 0.333333 0.929412 0.694118 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.811765 0.694118 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.811765 0.694118 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.811765 0.694118 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.184314 0.992157 0.694118 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.192157 0.996078 0.384314 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.454902 0.980392 0.219608 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.564706 0.941176 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.588235 0.776471 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.945098 0.560784 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.054902 0.952941 0.356863 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.337255 0.917647 0.109804 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.698039 0.701961 0.019608 0.4 0.662745 0.662745 0.662745 0.662745 0.662745 0.662745 0.662745 0.376471 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.090196 0.639216 0.972549 0.945098 0.913725 0.996078 0.996078 0.996078 0.996078 1 0.996078 0.996078 1 0.996078 0 0 0 0 0 0 0 0 0 0 0.007843 0.105882 0.717647 0.776471 0.905882 0.996078 0.996078 0.988235 0.980392 0.862745 0.537255 0.223529 0.223529 0.368627 0.376471 0.6 0.6 0.6 0 0 0 0 0 0 0 0 0.262745 0.470588 0.6 0.996078 0.996078 0.996078 0.996078 0.847059 0.356863 0.156863 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.909804 0.705882 0.823529 0.635294 0.490196 0.219608 0.113725 0.062745 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.152941 0.152941 0.156863 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0;
</pre></div>
</div>
<p>其数据使用;间隔,第一段数据为这张图片的label,第二段数据为这个图片的像素值。
首先我们将这个数据文件(例如文件名是&#8217;mnist_train.txt&#8217;)写入train.list。那么
train.list即为</p>
<div class="highlight-python"><div class="highlight"><pre><span></span><span class="n">mnist_train</span><span class="o">.</span><span class="n">txt</span>
</pre></div>
</div>
<p>那么对应的dataprovider既为</p>
<div class="highlight-python"><table class="highlighttable"><tr><td class="linenos"><div class="linenodiv"><pre> 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25</pre></div></td><td class="code"><div class="highlight"><pre><span></span><span class="kn">from</span> <span class="nn">paddle.trainer.PyDataProvider2</span> <span class="kn">import</span> <span class="o">*</span>


<span class="c1"># Define a py data provider</span>
<span class="nd">@provider</span><span class="p">(</span><span class="n">input_types</span><span class="o">=</span><span class="p">[</span>
    <span class="n">dense_vector</span><span class="p">(</span><span class="mi">28</span> <span class="o">*</span> <span class="mi">28</span><span class="p">),</span>
    <span class="n">integer_value</span><span class="p">(</span><span class="mi">10</span><span class="p">)</span>
<span class="p">])</span>
<span class="k">def</span> <span class="nf">process</span><span class="p">(</span><span class="n">settings</span><span class="p">,</span> <span class="n">filename</span><span class="p">):</span>  <span class="c1"># settings is not used currently.</span>
    <span class="n">f</span> <span class="o">=</span> <span class="nb">open</span><span class="p">(</span><span class="n">filename</span><span class="p">,</span> <span class="s1">&#39;r&#39;</span><span class="p">)</span>  <span class="c1"># open one of training file</span>

    <span class="k">for</span> <span class="n">line</span> <span class="ow">in</span> <span class="n">f</span><span class="p">:</span>  <span class="c1"># read each line</span>
        <span class="n">label</span><span class="p">,</span> <span class="n">pixel</span> <span class="o">=</span> <span class="n">line</span><span class="o">.</span><span class="n">split</span><span class="p">(</span><span class="s1">&#39;;&#39;</span><span class="p">)</span>

        <span class="c1"># get features and label</span>
        <span class="n">pixels_str</span> <span class="o">=</span> <span class="n">pixel</span><span class="o">.</span><span class="n">split</span><span class="p">(</span><span class="s1">&#39; &#39;</span><span class="p">)</span>

        <span class="n">pixels_float</span> <span class="o">=</span> <span class="p">[]</span>
        <span class="k">for</span> <span class="n">each_pixel_str</span> <span class="ow">in</span> <span class="n">pixels_str</span><span class="p">:</span>
            <span class="n">pixels_float</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="nb">float</span><span class="p">(</span><span class="n">each_pixel_str</span><span class="p">))</span>

        <span class="c1"># give data to paddle.</span>
        <span class="k">yield</span> <span class="n">pixels_float</span><span class="p">,</span> <span class="nb">int</span><span class="p">(</span><span class="n">label</span><span class="p">)</span>

    <span class="n">f</span><span class="o">.</span><span class="n">close</span><span class="p">()</span>  <span class="c1"># close file</span>
</pre></div>
</td></tr></table></div>
<p>其中第一行是引入Paddle的PyDataProvider2包。主要函数是process函数。process函数
具有两个参数,第一个参数是 settings 。这个参数在这个样例里没有使用,具
体可以参考 settings 。第二个参数是filename,这个参数被Paddle进程传入,为
train.list中的一行(即train.list若干数据文件路径的某一个路径)。</p>
<p><code class="code docutils literal"><span class="pre">&#64;provider</span></code> 是一个Python的 <a class="reference external" href="http://www.learnpython.org/en/Decorators">Decorator</a>
。这行的作用是设置DataProvider的一些属性,并且标记process函数是一个DataProvider。
如果不了解 <a class="reference external" href="http://www.learnpython.org/en/Decorators">Decorator</a> 是什么也没关系,
只需要知道这只是一个标记属性的方法就可以了。</p>
<p>属性 <a class="reference internal" href="#input-types">input_types</a> 是设置这个DataProvider返回什么样的数据。这里设置的是返回一个
28*28的稠密向量和一个[0-9],10维的整数值。 <a class="reference internal" href="#input-types">input_types</a> 具体可以设置成什么其他格
式,请参考 <a class="reference internal" href="#input-types">input_types</a> 的文档。</p>
<p>process函数是实现数据输入的主函数,在这个函数中,实现了打开文本文件,从文本文件中读取
每一行,并将每行转换成和 <a class="reference internal" href="#input-types">input_types</a> 一致的特征,并在23行返回给Paddle进程。需要注意
的是, 返回的顺序需要和 <a class="reference internal" href="#input-types">input_types</a> 中定义的顺序一致。</p>
<p>同时,返回数据在Paddle中是仅仅返回一条完整的训练样本,并且使用关键词 <code class="code docutils literal"><span class="pre">yield</span></code>
在PyDataProvider中,可以为一个数据文件返回多条训练样本(就像这个样例一样),只需要在
process函数调用多次 <code class="code docutils literal"><span class="pre">yield</span></code> 即可。 <code class="code docutils literal"><span class="pre">yield</span></code> 是Python的一个关键词,相关的概
念是 <code class="code docutils literal"><span class="pre">generator</span></code> 。使用这个关键词,可以在一个函数里,多次返回变量。</p>
<p>在训练配置里,只需要使用一行代码即可以设置训练引用这个DataProvider。这个设置为</p>
<div class="highlight-python"><div class="highlight"><pre><span></span><span class="kn">from</span> <span class="nn">paddle.trainer_config_helpers</span> <span class="kn">import</span> <span class="o">*</span>

<span class="n">define_py_data_sources2</span><span class="p">(</span><span class="n">train_list</span><span class="o">=</span><span class="s1">&#39;train.list&#39;</span><span class="p">,</span>
                        <span class="n">test_list</span><span class="o">=</span><span class="bp">None</span><span class="p">,</span>
                        <span class="n">module</span><span class="o">=</span><span class="s1">&#39;mnist_provider&#39;</span><span class="p">,</span>
                        <span class="n">obj</span><span class="o">=</span><span class="s1">&#39;process&#39;</span><span class="p">)</span>
</pre></div>
</div>
<p>这里说明了训练数据是 &#8216;train.list&#8217;,而没有测试数据。引用的DataProvider是 &#8216;mnist_provider&#8217;
这个模块中的 &#8216;process&#8217; 函数。</p>
<p>至此,简单的PyDataProvider样例就说明完毕了。对于用户来说,讲数据发送给Paddle,仅仅需要
知道如何从 <strong>一个文件</strong> 里面读取 <strong>一条</strong> 样本。而Paddle进程帮助用户做了</p>
<ul class="simple">
<li>将数据组合成Batch训练</li>
<li>Shuffle训练数据</li>
<li>多线程数据读取</li>
<li>缓存训练数据到内存(可选)</li>
<li>CPU-&gt;GPU双缓存</li>
</ul>
<p>是不是很简单呢?</p>
</div>
<div class="section" id="id3">
<h2>序列模型数据提供<a class="headerlink" href="#id3" title="Permalink to this headline"></a></h2>
<p>序列模型是指数据的某一维度是一个序列形式,即包含时间步信息。所谓时间步信息,
不一定和时间有关系,只是说明数据的顺序是重要的。例如,文本信息就是一个序列
数据。</p>
<p>这里举例的数据是英文情感分类的数据。数据是给一段英文文本,分类成正面情绪和
负面情绪两类(用0和1表示)。样例数据为</p>
<div class="highlight-python"><div class="highlight"><pre><span></span>0       I saw this movie at the AFI Dallas festival . It all takes place at a lake house and it looks wonderful .
1       This documentary makes you travel all around the globe . It contains rare and stunning sequels from the wilderness .
...
</pre></div>
</div>
<p>这里,DataProvider可以是</p>
<div class="highlight-python"><div class="highlight"><pre><span></span><span class="kn">from</span> <span class="nn">paddle.trainer.PyDataProvider2</span> <span class="kn">import</span> <span class="o">*</span>


<span class="k">def</span> <span class="nf">on_init</span><span class="p">(</span><span class="n">settings</span><span class="p">,</span> <span class="n">dictionary</span><span class="p">,</span> <span class="o">**</span><span class="n">kwargs</span><span class="p">):</span>
    <span class="c1"># on_init will invoke when data provider is initialized. The dictionary</span>
    <span class="c1"># is passed from trainer_config, and is a dict object with type</span>
    <span class="c1"># (word string =&gt; word id).</span>

    <span class="c1"># set input types in runtime. It will do the same thing as</span>
    <span class="c1"># @provider(input_types) will do, but it is set dynamically during runtime.</span>
    <span class="n">settings</span><span class="o">.</span><span class="n">input_types</span> <span class="o">=</span> <span class="p">[</span>
        <span class="c1"># The text is a sequence of integer values, and each value is a word id.</span>
        <span class="c1"># The whole sequence is the sentences that we want to predict its</span>
        <span class="c1"># sentimental.</span>
        <span class="n">integer_value</span><span class="p">(</span><span class="nb">len</span><span class="p">(</span><span class="n">dictionary</span><span class="p">),</span> <span class="n">seq_type</span><span class="o">=</span><span class="n">SequenceType</span><span class="p">),</span>  <span class="c1"># text input</span>

        <span class="c1"># label positive/negative</span>
        <span class="n">integer_value</span><span class="p">(</span><span class="mi">2</span><span class="p">)</span>
    <span class="p">]</span>

    <span class="c1"># save dictionary as settings.dictionary. It will be used in process</span>
    <span class="c1"># method.</span>
    <span class="n">settings</span><span class="o">.</span><span class="n">dictionary</span> <span class="o">=</span> <span class="n">dictionary</span>


<span class="nd">@provider</span><span class="p">(</span><span class="n">init_hook</span><span class="o">=</span><span class="n">on_init</span><span class="p">)</span>
<span class="k">def</span> <span class="nf">process</span><span class="p">(</span><span class="n">settings</span><span class="p">,</span> <span class="n">filename</span><span class="p">):</span>
    <span class="n">f</span> <span class="o">=</span> <span class="nb">open</span><span class="p">(</span><span class="n">filename</span><span class="p">,</span> <span class="s1">&#39;r&#39;</span><span class="p">)</span>

    <span class="k">for</span> <span class="n">line</span> <span class="ow">in</span> <span class="n">f</span><span class="p">:</span>  <span class="c1"># read each line of file</span>
        <span class="n">label</span><span class="p">,</span> <span class="n">sentence</span> <span class="o">=</span> <span class="n">line</span><span class="o">.</span><span class="n">split</span><span class="p">(</span><span class="s1">&#39;</span><span class="se">\t</span><span class="s1">&#39;</span><span class="p">)</span>  <span class="c1"># get label and sentence</span>
        <span class="n">words</span> <span class="o">=</span> <span class="n">sentence</span><span class="o">.</span><span class="n">split</span><span class="p">(</span><span class="s1">&#39; &#39;</span><span class="p">)</span>  <span class="c1"># get words</span>

        <span class="c1"># convert word string to word id</span>
        <span class="c1"># the word not in dictionary will be ignored.</span>
        <span class="n">word_ids</span> <span class="o">=</span> <span class="p">[]</span>

        <span class="k">for</span> <span class="n">each_word</span> <span class="ow">in</span> <span class="n">words</span><span class="p">:</span>
            <span class="k">if</span> <span class="n">each_word</span> <span class="ow">in</span> <span class="n">settings</span><span class="o">.</span><span class="n">dictionary</span><span class="p">:</span>
                <span class="n">word_ids</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">settings</span><span class="o">.</span><span class="n">dictionary</span><span class="p">[</span><span class="n">each_word</span><span class="p">])</span>

        <span class="c1"># give data to paddle.</span>
        <span class="k">yield</span> <span class="n">word_ids</span><span class="p">,</span> <span class="nb">int</span><span class="p">(</span><span class="n">label</span><span class="p">)</span>

    <span class="n">f</span><span class="o">.</span><span class="n">close</span><span class="p">()</span>
</pre></div>
</div>
<p>这个序列模型比较复杂。主要是增加了初始化机制。其中 <code class="code docutils literal"><span class="pre">on_init</span></code> 函数是使用
<a class="reference internal" href="#provider">&#64;provider</a> 中的 <a class="reference internal" href="#init-hook">init_hook</a> 配置参数配置给DataProvider的。这个函数会在
DataProvider创建的时候执行。这个初始化函数具有如下参数:</p>
<ul class="simple">
<li>第一个参数是 settings 对象。</li>
<li>其他参数均使用key word argument形式传入。有部分参数是Paddle自动生成的,
参考 <a class="reference internal" href="#init-hook">init_hook</a> 。这里的 <code class="code docutils literal"><span class="pre">dictionary</span></code> 是从训练配置传入的dict对象。
即从单词字符串到单词id的字典。</li>
</ul>
<p>传入这个变量的方式为</p>
<div class="highlight-python"><div class="highlight"><pre><span></span><span class="kn">from</span> <span class="nn">paddle.trainer_config_helpers</span> <span class="kn">import</span> <span class="o">*</span>

<span class="n">dictionary</span> <span class="o">=</span> <span class="nb">dict</span><span class="p">()</span>
<span class="o">...</span>  <span class="c1">#  read dictionary from outside</span>

<span class="n">define_py_data_sources2</span><span class="p">(</span><span class="n">train_list</span><span class="o">=</span><span class="s1">&#39;train.list&#39;</span><span class="p">,</span> <span class="n">test_list</span><span class="o">=</span><span class="bp">None</span><span class="p">,</span>
                        <span class="n">module</span><span class="o">=</span><span class="s1">&#39;sentimental_provider&#39;</span><span class="p">,</span> <span class="n">obj</span><span class="o">=</span><span class="s1">&#39;process&#39;</span><span class="p">,</span>
                        <span class="c1"># above codes same as mnist sample.</span>
                        <span class="n">args</span><span class="o">=</span><span class="p">{</span>  <span class="c1"># pass to provider.</span>
                            <span class="s1">&#39;dictionary&#39;</span><span class="p">:</span> <span class="n">dictionary</span>
                        <span class="p">})</span>
</pre></div>
</div>
<p>这个声明基本上和mnist的样例一致。除了</p>
<ul class="simple">
<li>在配置中读取了字典</li>
<li>在声明DataProvider的时候传入了dictionary作为参数。</li>
</ul>
<p><code class="code docutils literal"><span class="pre">on_init</span></code> 函数中,配置了 <cite>input_types</cite> 。这个和在 <a class="reference internal" href="#provider">&#64;provider</a> 中配置
<cite>input_types</cite> 效果一致,但是在 <cite>on_init</cite> 中配置 <cite>input_types</cite> 是在运行时执行的,所以
可以根据不同的数据配置不同的输入类型。这里的输入特征是词id的序列,所以将 <code class="code docutils literal"><span class="pre">seq_type</span></code>
设置成了序列(同时,也可以使用 <code class="code docutils literal"><span class="pre">integer_sequence</span></code> 类型来设置)。</p>
<p>同时,将字典存入了settings 对象。这个字典可以在 <code class="code docutils literal"><span class="pre">process</span></code> 函数中使用。 <code class="code docutils literal"><span class="pre">process</span></code>
函数中的 settings 和 <code class="code docutils literal"><span class="pre">on_init</span></code> 中的settings 是同一个对象。</p>
<p>而在 <code class="code docutils literal"><span class="pre">process</span></code> 函数中,基本的处理逻辑也和mnist逻辑一致。依次返回了文件中的每条数据。</p>
<p>至此,基本的PyDataProvider使用介绍完毕了。具体DataProvider还具有什么功能,请参考下节reference。</p>
</div>
<div class="section" id="reference">
<h2>参考(Reference)<a class="headerlink" href="#reference" title="Permalink to this headline"></a></h2>
<div class="section" id="provider">
<h3>&#64;provider<a class="headerlink" href="#provider" title="Permalink to this headline"></a></h3>
<p><a class="reference external" href="mailto:'&#37;&#52;&#48;provider">'<span>&#64;</span>provider</a>&#8216;是一个Python的 <a class="reference external" href="http://www.learnpython.org/en/Decorators">Decorator</a> ,他可以将某一个函数标记成一个PyDataProvider。它包含的参数有:</p>
<ul class="simple">
<li><a class="reference internal" href="#input-types">input_types</a> 是数据输入格式。具体有哪些格式,参考 <a class="reference internal" href="#input-types">input_types</a></li>
<li>should_shuffle 是个DataProvider是不是要做shuffle,如果不设置的话,训练的时候默认shuffle,
测试的时候默认不shuffle</li>
<li>pool_size 是设置DataProvider在内存中暂存的数据条数。设置成-1的话,即不在乎内存暂存多少条数据。</li>
<li>can_over_batch_size 表示是否允许Paddle暂存略微多余pool_size的数据。这样做可以避免很多死锁问题。
一般推荐设置成True</li>
<li>calc_batch_size 传入的是一个函数,这个函数以一条数据为参数,返回batch_size的大小。默认情况下一条数据
是一个batch size,但是有时为了计算均衡性,可以将一条数据设置成多个batch size</li>
<li>cache 是数据缓存的策略,参考 <a class="reference internal" href="#cache">cache</a></li>
<li>init_hook 是初始化时调用的函数,参考 <a class="reference internal" href="#init-hook">init_hook</a></li>
</ul>
</div>
<div class="section" id="input-types">
<h3>input_types<a class="headerlink" href="#input-types" title="Permalink to this headline"></a></h3>
<p>Paddle的数据包括四种主要类型,和三种序列模式。其中,四种数据类型是</p>
<ul class="simple">
<li>dense_vector 表示稠密的浮点数向量。</li>
<li>sparse_binary_vector 表示稀疏的零一向量,即大部分值为0,有值的位置只能取1</li>
<li>sparse_float_vector 表示稀疏的向量,即大部分值为0,有值的部分可以是任何浮点数</li>
<li>integer 表示整数标签。</li>
</ul>
<p>而三种序列模式为</p>
<ul class="simple">
<li>SequenceType.NO_SEQUENCE 即不是一条序列</li>
<li>SequenceType.SEQUENCE 即是一条时间序列</li>
<li>SequenceType.SUB_SEQUENCE 即是一条时间序列,且序列的每一个元素还是一个时间序列。</li>
</ul>
<p>不同的数据类型和序列模式返回的格式不同,列表如下</p>
<table border="1" class="docutils">
<colgroup>
<col width="17%" />
<col width="17%" />
<col width="28%" />
<col width="38%" />
</colgroup>
<thead valign="bottom">
<tr class="row-odd"><th class="head">&nbsp;</th>
<th class="head">NO_SEQUENCE</th>
<th class="head">SEQUENCE</th>
<th class="head">SUB_SEQUENCE</th>
</tr>
</thead>
<tbody valign="top">
<tr class="row-even"><td>dense_vector</td>
<td>[f, f, ...]</td>
<td>[[f, ...], [f, ...], ...]</td>
<td>[[[f, ...], ...], [[f, ...], ...],...]</td>
</tr>
<tr class="row-odd"><td>sparse_binary_vector</td>
<td>[i, i, ...]</td>
<td>[[i, ...], [i, ...], ...]</td>
<td>[[[i, ...], ...], [[i, ...], ...],...]</td>
</tr>
<tr class="row-even"><td>sparse_float_vector</td>
<td>[(i,f), (i,f), ...]</td>
<td>[[(i,f), ...], [(i,f), ...], ...]</td>
<td>[[[(i,f), ...], ...], [[(i,f), ...], ...],...]</td>
</tr>
<tr class="row-odd"><td>integer_value</td>
<td>i</td>
<td>[i, i, ...]</td>
<td>[[i, ...], [i, ...], ...]</td>
</tr>
</tbody>
</table>
<p>其中,f代表一个浮点数,i代表一个整数。</p>
</div>
<div class="section" id="init-hook">
<h3>init_hook<a class="headerlink" href="#init-hook" title="Permalink to this headline"></a></h3>
<p>init_hook可以传入一个函数。这个函数在初始化的时候会被调用。这个函数的参数是:</p>
<ul>
<li><dl class="first docutils">
<dt>第一个参数是 settings 对象。这个对象和process的第一个参数一致。具有的属性有</dt>
<dd><ul class="first last simple">
<li>settings.input_types 设置输入类型。参考 <a class="reference internal" href="#input-types">input_types</a></li>
<li>settings.logger 一个logging对象</li>
</ul>
</dd>
</dl>
</li>
<li><dl class="first docutils">
<dt>其他参数都使用key word argument传入。这些参数包括paddle定义的参数,和用户传入的参数。</dt>
<dd><ul class="first last">
<li><dl class="first docutils">
<dt>Paddle定义的参数包括:</dt>
<dd><ul class="first last simple">
<li>is_train bool参数,表示这个DataProvider是训练用的DataProvider或者测试用的
DataProvider</li>
<li>file_list 所有文件列表。</li>
</ul>
</dd>
</dl>
</li>
<li><p class="first">用户定义的参数使用args在训练配置中设置。</p>
</li>
</ul>
</dd>
</dl>
</li>
</ul>
<p>注意,paddle保留添加参数的权力,所以init_hook尽量使用 <code class="code docutils literal"><span class="pre">**kwargs</span></code> , 来接受不使用的
函数来保证兼容性。</p>
</div>
<div class="section" id="cache">
<h3>cache<a class="headerlink" href="#cache" title="Permalink to this headline"></a></h3>
<p>DataProvider提供了两种简单的Cache策略。他们是</p>
<ul class="simple">
<li>CacheType.NO_CACHE 不缓存任何数据,每次都会从python端读取数据</li>
<li>CacheType.CACHE_PASS_IN_MEM 第一个pass会从python端读取数据,剩下的pass会直接从内存里
读取数据。</li>
</ul>
</div>
</div>
</div>


          </div>
        </div>
      </div>
      <div class="sphinxsidebar" role="navigation" aria-label="main navigation">
        <div class="sphinxsidebarwrapper">
  <h3><a href="../../index.html">Table Of Contents</a></h3>
  <ul>
<li><a class="reference internal" href="#">PyDataProvider2的使用</a><ul>
<li><a class="reference internal" href="#id1">简单的使用场景</a></li>
<li><a class="reference internal" href="#id3">序列模型数据提供</a></li>
<li><a class="reference internal" href="#reference">参考(Reference)</a><ul>
<li><a class="reference internal" href="#provider">&#64;provider</a></li>
<li><a class="reference internal" href="#input-types">input_types</a></li>
<li><a class="reference internal" href="#init-hook">init_hook</a></li>
<li><a class="reference internal" href="#cache">cache</a></li>
</ul>
</li>
</ul>
</li>
</ul>

  <h4>Previous topic</h4>
  <p class="topless"><a href="index.html"
                        title="previous chapter">Paddle的数据提供(DataProvider)介绍</a></p>
  <h4>Next topic</h4>
  <p class="topless"><a href="write_new_dataprovider.html"
                        title="next chapter">自定义一个DataProvider</a></p>
  <div role="note" aria-label="source link">
    <h3>This Page</h3>
    <ul class="this-page-menu">
      <li><a href="../../_sources/ui/data_provider/pydataprovider2.txt"
            rel="nofollow">Show Source</a></li>
    </ul>
   </div>
<div id="searchbox" style="display: none" role="search">
  <h3>Quick search</h3>
    <form class="search" action="../../search.html" method="get">
      <input type="text" name="q" />
      <input type="submit" value="Go" />
      <input type="hidden" name="check_keywords" value="yes" />
      <input type="hidden" name="area" value="default" />
    </form>
    <p class="searchtip" style="font-size: 90%">
    Enter search terms or a module, class or function name.
    </p>
</div>
<script type="text/javascript">$('#searchbox').show(0);</script>
        </div>
      </div>
      <div class="clearer"></div>
    </div>
    <div class="related" role="navigation" aria-label="related navigation">
      <h3>Navigation</h3>
      <ul>
        <li class="right" style="margin-right: 10px">
          <a href="../../genindex.html" title="General Index"
             >index</a></li>
        <li class="right" >
          <a href="write_new_dataprovider.html" title="自定义一个DataProvider"
             >next</a> |</li>
        <li class="right" >
          <a href="index.html" title="Paddle的数据提供(DataProvider)介绍"
             >previous</a> |</li>
        <li class="nav-item nav-item-0"><a href="../../index.html">PADDLE  documentation</a> &raquo;</li>
          <li class="nav-item nav-item-1"><a href="../index.html" >配置</a> &raquo;</li>
          <li class="nav-item nav-item-2"><a href="index.html" >Paddle的数据提供(DataProvider)介绍</a> &raquo;</li> 
      </ul>
    </div>
    <div class="footer" role="contentinfo">
        &copy; Copyright 2016, PADDLE developers.
      Created using <a href="http://sphinx-doc.org/">Sphinx</a> 1.3.5.
    </div>
  </body>
</html>