Merge conflict with hl_cuda_device.cc

e4880016 · liaogang · 8393c19c · 3a982d33 · e4880016 · e4880016
474 changed file
--- a/.pre-commit-config.yaml
+++ b/.pre-commit-config.yaml
@@ -7,18 +7,14 @@
    hooks:
    -   id: yapf
 -   repo: https://github.com/pre-commit/pre-commit-hooks
-    sha: 4ef03c4223ad322c7adaa6c6c0efb26b57df3b71
+    sha: 7539d8bd1a00a3c1bfd34cdb606d3a6372e83469
    hooks:
    -   id: check-added-large-files
    -   id: check-merge-conflict
    -   id: check-symlinks
    -   id: detect-private-key
    -   id: end-of-file-fixer
-# TODO(yuyang): trailing whitespace has some bugs on markdown 
-# files now, please not add it to pre-commit hook now
-#    -   id: trailing-whitespace
-#
-# TODO(yuyang): debug-statements not fit for Paddle, because
-# not all of our python code is runnable. Some are used for 
-# documenation
-#    -   id: debug-statements
+-   repo: https://github.com/PaddlePaddle/clang-format-pre-commit-hook.git
+    sha: 28c0ea8a67a3e2dbbf4822ef44e85b63a0080a29
+    hooks:
+    -   id: clang-formater
--- a/CMakeLists.txt
+++ b/CMakeLists.txt
@@ -2,8 +2,8 @@ cmake_minimum_required(VERSION 2.8)

 project(paddle CXX C)
 set(PADDLE_MAJOR_VERSION 0)
-set(PADDLE_MINOR_VERSION 8)
-set(PADDLE_PATCH_VERSION 0b3)
+set(PADDLE_MINOR_VERSION 9)
+set(PADDLE_PATCH_VERSION 0a0)
 set(PADDLE_VERSION ${PADDLE_MAJOR_VERSION}.${PADDLE_MINOR_VERSION}.${PADDLE_PATCH_VERSION})

 set(CMAKE_MODULE_PATH ${CMAKE_MODULE_PATH} "${CMAKE_SOURCE_DIR}/cmake")

--- a/README.md
+++ b/README.md
 # PaddlePaddle


-[![Build Status](https://travis-ci.org/baidu/Paddle.svg?branch=master)](https://travis-ci.org/baidu/Paddle)
-[![Coverage Status](https://coveralls.io/repos/github/baidu/Paddle/badge.svg?branch=develop)](https://coveralls.io/github/baidu/Paddle?branch=develop)
-[![Join the chat at https://gitter.im/PaddlePaddle/Deep_Learning](https://badges.gitter.im/Join%20Chat.svg)](https://gitter.im/PaddlePaddle/Deep_Learning?utm_source=badge&utm_medium=badge&utm_campaign=pr-badge&utm_content=badge)
-[![License](https://img.shields.io/badge/license-Apache%202.0-green.svg)](LICENSE)
+[![Build Status](https://travis-ci.org/PaddlePaddle/Paddle.svg?branch=develop)](https://travis-ci.org/PaddlePaddle/Paddle)
+[![Documentation Status](https://img.shields.io/badge/docs-latest-brightgreen.svg?style=flat)](http://www.paddlepaddle.org/)
+[![Documentation Status](https://img.shields.io/badge/中文文档-最新-brightgreen.svg)](http://www.paddlepaddle.org/cn/index.html)
+[![Coverage Status](https://coveralls.io/repos/github/PaddlePaddle/Paddle/badge.svg?branch=develop)](https://coveralls.io/github/PaddlePaddle/Paddle?branch=develop)
+[![Release](https://img.shields.io/github/release/PaddlePaddle/Paddle.svg)](https://github.com/PaddlePaddle/Paddle/releases)
+[![License](https://img.shields.io/badge/license-Apache%202-blue.svg)](LICENSE)
+

 Welcome to the PaddlePaddle GitHub.

@@ -14,7 +17,7 @@ developed by Baidu scientists and engineers for the purpose of applying deep
 learning to many products at Baidu.

 Our vision is to enable deep learning for everyone via PaddlePaddle.
-Please refer to our [release announcement](https://github.com/baidu/Paddle/releases) to track the latest feature of PaddlePaddle. 
+Please refer to our [release announcement](https://github.com/PaddlePaddle/Paddle/releases) to track the latest feature of PaddlePaddle.

 ## Features

@@ -26,15 +29,15 @@ Please refer to our [release announcement](https://github.com/baidu/Paddle/relea
    connection.

 -  **Efficiency**
-  
+
    In order to unleash the power of heterogeneous computing resource,
    optimization occurs at different levels of PaddlePaddle, including
    computing, memory, architecture and communication. The following are some
    examples:

      - Optimized math operations through SSE/AVX intrinsics, BLAS libraries
-      (e.g. MKL, ATLAS, cuBLAS) or customized CPU/GPU kernels. 
-      - Highly optimized recurrent networks which can handle **variable-length** 
+      (e.g. MKL, ATLAS, cuBLAS) or customized CPU/GPU kernels.
+      - Highly optimized recurrent networks which can handle **variable-length**
      sequence without padding.
      - Optimized local and distributed training for models with high dimensional
      sparse data.
@@ -57,41 +60,39 @@ Please refer to our [release announcement](https://github.com/baidu/Paddle/relea

 ## Installation
 Check out the [Install Guide](http://paddlepaddle.org/doc/build/) to install from
-pre-built packages (**docker image**, **deb package**) or 
+pre-built packages (**docker image**, **deb package**) or
 directly build on **Linux** and **Mac OS X** from the source code.
- 
+
 ## Documentation
 Both [English Docs](http://paddlepaddle.org/doc/) and [Chinese Docs](http://paddlepaddle.org/doc_cn/) are provided for our users and developers.

 - [Quick Start](http://paddlepaddle.org/doc/demo/quick_start/index_en) <br>
   You can follow the quick start tutorial to learn how use PaddlePaddle
   step-by-step.
-    
+
 - [Example and Demo](http://paddlepaddle.org/doc/demo/) <br>
   We provide five demos, including: image classification, sentiment analysis,
-   sequence to sequence model, recommendation, semantic role labeling. 
-   
+   sequence to sequence model, recommendation, semantic role labeling.
+
 - [Distributed Training](http://paddlepaddle.org/doc/cluster) <br>
  This system supports training deep learning models on multiple machines
  with data parallelism.
-   
+
 - [Python API](http://paddlepaddle.org/doc/ui/) <br>
   PaddlePaddle supports using either Python interface or C++ to build your
   system. We also use SWIG to wrap C++ source code to create a user friendly
   interface for Python. You can also use SWIG to create interface for your
   favorite programming language.
- 
+
 - [How to Contribute](http://paddlepaddle.org/doc/build/contribute_to_paddle.html) <br>
   We sincerely appreciate your interest and contributions. If you would like to
-   contribute, please read the contribution guide.   
+   contribute, please read the contribution guide.

 - [Source Code Documents](http://paddlepaddle.org/doc/source/) <br>

 ## Ask Questions
-Please join the [**gitter chat**](https://gitter.im/PaddlePaddle/Deep_Learning) or send email to
-**paddle-dev@baidu.com** to ask questions and talk about methods and models.
-Framework development discussions and
-bug reports are collected on [Issues](https://github.com/baidu/paddle/issues).
+
+You are welcome to submit questions and bug reports as [Github Issues](https://github.com/PaddlePaddle/Paddle/issues).

 ## Copyright and License
 PaddlePaddle is provided under the [Apache-2.0 license](LICENSE).
--- a/demo/semantic_role_labeling/data/extract_dict_feature.py
+++ b/demo/semantic_role_labeling/data/extract_dict_feature.py
@@ -17,24 +17,15 @@ import os
 from optparse import OptionParser


-def extract_dict_features(pair_file, feature_file, src_dict_file,
-                          tgt_dict_file):
-    src_dict = set()
-    tgt_dict = set()
-
-    with open(pair_file) as fin, open(feature_file, 'w') as feature_out, open(
-            src_dict_file, 'w') as src_dict_out, open(tgt_dict_file,
-                                                      'w') as tgt_dict_out:
+def extract_dict_features(pair_file, feature_file):
+
+    with open(pair_file) as fin, open(feature_file, 'w') as feature_out:
        for line in fin:
-            sentence, labels = line.strip().split('\t')
+            sentence, predicate, labels = line.strip().split('\t')
            sentence_list = sentence.split()
            labels_list = labels.split()

-            src_dict.update(sentence_list)
-            tgt_dict.update(labels_list)
-
            verb_index = labels_list.index('B-V')
-            verb_feature = sentence_list[verb_index]

            mark = [0] * len(labels_list)
            if verb_index > 0:
@@ -42,47 +33,50 @@ def extract_dict_features(pair_file, feature_file, src_dict_file,
                ctx_n1 = sentence_list[verb_index - 1]
            else:
                ctx_n1 = 'bos'
-            ctx_n1_feature = ctx_n1
+            
+            if verb_index > 1:
+                mark[verb_index - 2] = 1
+                ctx_n2 = sentence_list[verb_index - 2]
+            else:
+                ctx_n2 = 'bos'

            mark[verb_index] = 1
-            ctx_0_feature = sentence_list[verb_index]
+            ctx_0 = sentence_list[verb_index]

            if verb_index < len(labels_list) - 2:
                mark[verb_index + 1] = 1
                ctx_p1 = sentence_list[verb_index + 1]
            else:
                ctx_p1 = 'eos'
-            ctx_p1_feature = ctx_p1
+            
+            if verb_index < len(labels_list) - 3:
+                mark[verb_index + 2] = 1
+                ctx_p2 = sentence_list[verb_index + 2]
+            else:
+                ctx_p2 = 'eos'
+

            feature_str  = sentence + '\t' \
-                           + verb_feature + '\t' \
-                           + ctx_n1_feature + '\t' \
-                           + ctx_0_feature + '\t' \
-                           + ctx_p1_feature + '\t' \
+                           + predicate + '\t' \
+                           + ctx_n2 + '\t' \
+                           + ctx_n1 + '\t' \
+                           + ctx_0 + '\t' \
+                           + ctx_p1 + '\t' \
+                           + ctx_p2 + '\t' \
                           + ' '.join([str(i) for i in mark]) + '\t' \
                           + labels

            feature_out.write(feature_str + '\n')

-        src_dict_out.write('<unk>\n')
-        src_dict_out.write('\n'.join(list(src_dict)))
-
-        tgt_dict_out.write('\n'.join(list(tgt_dict)))


 if __name__ == '__main__':

-    usage = '-p pair_file -f feature_file -s source dictionary -t target dictionary '
+    usage = '-p pair_file -f feature_file'
    parser = OptionParser(usage)
    parser.add_option('-p', dest='pair_file', help='the pair file')
-    parser.add_option(
-        '-f', dest='feature_file', help='the file to store feature')
-    parser.add_option(
-        '-s', dest='src_dict', help='the file to store source dictionary')
-    parser.add_option(
-        '-t', dest='tgt_dict', help='the file to store target dictionary')
+    parser.add_option('-f', dest='feature_file', help='the feature file')

    (options, args) = parser.parse_args()

-    extract_dict_features(options.pair_file, options.feature_file,
-                          options.src_dict, options.tgt_dict)
+    extract_dict_features(options.pair_file, options.feature_file)
--- a/demo/semantic_role_labeling/data/extract_pairs.py
+++ b/demo/semantic_role_labeling/data/extract_pairs.py
@@ -51,7 +51,7 @@ def read_sentences(words_file):
        for line in fin:
            line = line.strip()
            if line == '':
-                sentences.append(s.lower())
+                sentences.append(s)
                s = ''
            else:
                s += line + ' '
@@ -64,6 +64,11 @@ def transform_labels(sentences, labels):
        if len(labels[i]) == 1:
            continue
        else:
+            verb_list = []
+            for x in labels[i][0]:
+                if x !='-':
+                   verb_list.append(x)
+
            for j in xrange(1, len(labels[i])):
                label_list = labels[i][j]
                current_tag = 'O'
@@ -88,8 +93,7 @@ def transform_labels(sentences, labels):
                        is_in_bracket = True
                    else:
                        print 'error:', ll
-
-                sen_lab_pair.append((sentences[i], label_seq))
+                sen_lab_pair.append((sentences[i], verb_list[j-1], label_seq))
    return sen_lab_pair


@@ -97,9 +101,9 @@ def write_file(sen_lab_pair, output_file):
    with open(output_file, 'w') as fout:
        for x in sen_lab_pair:
            sentence = x[0]
-            label_seq = ' '.join(x[1])
-            assert len(sentence.split()) == len(x[1])
-            fout.write(sentence + '\t' + label_seq + '\n')
+            label_seq = ' '.join(x[2])
+            assert len(sentence.split()) == len(x[2])
+            fout.write(sentence + '\t' + x[1]+'\t' +label_seq + '\n')


 if __name__ == '__main__':

--- a/demo/semantic_role_labeling/data/get_data.sh
+++ b/demo/semantic_role_labeling/data/get_data.sh
@@ -14,6 +14,10 @@
 # limitations under the License.
 set -e
 wget http://www.cs.upc.edu/~srlconll/conll05st-tests.tar.gz
+wget https://www.googledrive.com/host/0B7Q8d52jqeI9ejh6Q1RpMTFQT1k/semantic_role_labeling/verbDict.txt --no-check-certificate
+wget https://www.googledrive.com/host/0B7Q8d52jqeI9ejh6Q1RpMTFQT1k/semantic_role_labeling/targetDict.txt --no-check-certificate
+wget https://www.googledrive.com/host/0B7Q8d52jqeI9ejh6Q1RpMTFQT1k/semantic_role_labeling/wordDict.txt --no-check-certificate
+wget https://www.googledrive.com/host/0B7Q8d52jqeI9ejh6Q1RpMTFQT1k/semantic_role_labeling/emb --no-check-certificate
 tar -xzvf conll05st-tests.tar.gz
 rm conll05st-tests.tar.gz
 cp ./conll05st-release/test.wsj/words/test.wsj.words.gz  .
@@ -22,4 +26,4 @@ gunzip test.wsj.words.gz
 gunzip test.wsj.props.gz

 python extract_pairs.py  -w test.wsj.words -p test.wsj.props -o test.wsj.seq_pair
-python extract_dict_feature.py -p test.wsj.seq_pair -f feature  -s src.dict  -t tgt.dict
+python extract_dict_feature.py -p test.wsj.seq_pair -f feature 
--- a/demo/semantic_role_labeling/dataprovider.py
+++ b/demo/semantic_role_labeling/dataprovider.py
@@ -17,11 +17,15 @@ from paddle.trainer.PyDataProvider2 import *
 UNK_IDX = 0


-def hook(settings, word_dict, label_dict, **kwargs):
+def hook(settings, word_dict, label_dict, predicate_dict, **kwargs):
    settings.word_dict = word_dict
    settings.label_dict = label_dict
+    settings.predicate_dict = predicate_dict
+   
    #all inputs are integral and sequential type
    settings.slots = [
+        integer_value_sequence(len(word_dict)),
+        integer_value_sequence(len(predicate_dict)),
        integer_value_sequence(len(word_dict)),
        integer_value_sequence(len(word_dict)),
        integer_value_sequence(len(word_dict)),
@@ -31,27 +35,33 @@ def hook(settings, word_dict, label_dict, **kwargs):
    ]


-@provider(init_hook=hook)
-def process(obj, file_name):
+def get_batch_size(yeild_data):
+    return len(yeild_data[0])
+    
+
+@provider(init_hook=hook, should_shuffle=True, calc_batch_size=get_batch_size, 
+          can_over_batch_size=False, cache=CacheType.CACHE_PASS_IN_MEM)
+def process(settings, file_name):
    with open(file_name, 'r') as fdata:
        for line in fdata:
-            sentence, predicate, ctx_n1, ctx_0, ctx_p1, mark, label = \
+            sentence, predicate, ctx_n2, ctx_n1, ctx_0, ctx_p1, ctx_p2,  mark, label = \
                line.strip().split('\t')
-
+           
            words = sentence.split()
            sen_len = len(words)
-            word_slot = [obj.word_dict.get(w, UNK_IDX) for w in words]
+            word_slot = [settings.word_dict.get(w, UNK_IDX) for w in words]

-            predicate_slot = [obj.word_dict.get(predicate, UNK_IDX)] * sen_len
-            ctx_n1_slot = [obj.word_dict.get(ctx_n1, UNK_IDX)] * sen_len
-            ctx_0_slot = [obj.word_dict.get(ctx_0, UNK_IDX)] * sen_len
-            ctx_p1_slot = [obj.word_dict.get(ctx_p1, UNK_IDX)] * sen_len
+            predicate_slot = [settings.predicate_dict.get(predicate)] * sen_len
+            ctx_n2_slot = [settings.word_dict.get(ctx_n2, UNK_IDX)] * sen_len
+            ctx_n1_slot = [settings.word_dict.get(ctx_n1, UNK_IDX)] * sen_len
+            ctx_0_slot = [settings.word_dict.get(ctx_0, UNK_IDX)] * sen_len
+            ctx_p1_slot = [settings.word_dict.get(ctx_p1, UNK_IDX)] * sen_len
+            ctx_p2_slot = [settings.word_dict.get(ctx_p2, UNK_IDX)] * sen_len

            marks = mark.split()
            mark_slot = [int(w) for w in marks]

            label_list = label.split()
-            label_slot = [obj.label_dict.get(w) for w in label_list]
-
-            yield word_slot, predicate_slot, ctx_n1_slot, \
-                  ctx_0_slot, ctx_p1_slot, mark_slot, label_slot
+            label_slot = [settings.label_dict.get(w) for w in label_list]
+            yield word_slot, predicate_slot, ctx_n2_slot, ctx_n1_slot, \
+                  ctx_0_slot, ctx_p1_slot, ctx_p2_slot, mark_slot, label_slot
--- a/demo/semantic_role_labeling/db_lstm.py
+++ b/demo/semantic_role_labeling/db_lstm.py
@@ -18,8 +18,9 @@ import sys
 from paddle.trainer_config_helpers import *

 #file paths
-word_dict_file = './data/src.dict'
-label_dict_file = './data/tgt.dict'
+word_dict_file = './data/wordDict.txt'
+label_dict_file = './data/targetDict.txt'
+predicate_file= './data/verbDict.txt'
 train_list_file = './data/train.list'
 test_list_file = './data/test.list'

@@ -30,8 +31,10 @@ if not is_predict:
    #load dictionaries
    word_dict = dict()
    label_dict = dict()
+    predicate_dict = dict()
    with open(word_dict_file, 'r') as f_word, \
-         open(label_dict_file, 'r') as f_label:
+         open(label_dict_file, 'r') as f_label, \
+         open(predicate_file, 'r') as f_pre:
        for i, line in enumerate(f_word):
            w = line.strip()
            word_dict[w] = i
@@ -40,6 +43,11 @@ if not is_predict:
            w = line.strip()
            label_dict[w] = i

+        for i, line in enumerate(f_pre):
+            w = line.strip()
+            predicate_dict[w] = i
+
+
    if is_test:
        train_list_file = None

@@ -50,91 +58,157 @@ if not is_predict:
        module='dataprovider',
        obj='process',
        args={'word_dict': word_dict,
-              'label_dict': label_dict})
+              'label_dict': label_dict,
+              'predicate_dict': predicate_dict })

    word_dict_len = len(word_dict)
    label_dict_len = len(label_dict)
+    pred_len = len(predicate_dict)

 else:
    word_dict_len = get_config_arg('dict_len', int)
    label_dict_len = get_config_arg('label_len', int)
+    pred_len = get_config_arg('pred_len', int)

+############################## Hyper-parameters ##################################
 mark_dict_len = 2
 word_dim = 32
 mark_dim = 5
-hidden_dim = 128
+hidden_dim = 512
 depth = 8
-emb_lr = 1e-2
-fc_lr = 1e-2
-lstm_lr = 2e-2
+
+
+
+########################### Optimizer #######################################
+

 settings(
    batch_size=150,
-    learning_method=AdamOptimizer(),
-    learning_rate=1e-3,
+    learning_method=MomentumOptimizer(momentum=0),
+    learning_rate=2e-2,
    regularization=L2Regularization(8e-4),
-    gradient_clipping_threshold=25)
+    is_async=False,
+    model_average=ModelAverage(average_window=0.5,
+                               max_average_window=10000),
+                               
+)

-#6 features
+
+
+
+####################################### network ##############################
+#8 features and 1 target
 word = data_layer(name='word_data', size=word_dict_len)
-predicate = data_layer(name='verb_data', size=word_dict_len)
+predicate = data_layer(name='verb_data', size=pred_len)
+
+ctx_n2 = data_layer(name='ctx_n2_data', size=word_dict_len)
 ctx_n1 = data_layer(name='ctx_n1_data', size=word_dict_len)
 ctx_0 = data_layer(name='ctx_0_data', size=word_dict_len)
 ctx_p1 = data_layer(name='ctx_p1_data', size=word_dict_len)
+ctx_p2 = data_layer(name='ctx_p2_data', size=word_dict_len)
 mark = data_layer(name='mark_data', size=mark_dict_len)

+
 if not is_predict:
    target = data_layer(name='target', size=label_dict_len)

-ptt = ParameterAttribute(name='src_emb', learning_rate=emb_lr)
-layer_attr = ExtraLayerAttribute(drop_rate=0.5)
-fc_para_attr = ParameterAttribute(learning_rate=fc_lr)
-lstm_para_attr = ParameterAttribute(initial_std=0., learning_rate=lstm_lr)
-para_attr = [fc_para_attr, lstm_para_attr]

-word_embedding = embedding_layer(size=word_dim, input=word, param_attr=ptt)
-predicate_embedding = embedding_layer(
-    size=word_dim, input=predicate, param_attr=ptt)
-ctx_n1_embedding = embedding_layer(size=word_dim, input=ctx_n1, param_attr=ptt)
-ctx_0_embedding = embedding_layer(size=word_dim, input=ctx_0, param_attr=ptt)
-ctx_p1_embedding = embedding_layer(size=word_dim, input=ctx_p1, param_attr=ptt)
-mark_embedding = embedding_layer(size=mark_dim, input=mark)
+default_std=1/math.sqrt(hidden_dim)/3.0
+
+emb_para = ParameterAttribute(name='emb', initial_std=0., learning_rate=0.)
+std_0 = ParameterAttribute(initial_std=0.)
+std_default = ParameterAttribute(initial_std=default_std) 
+
+predicate_embedding = embedding_layer(size=word_dim, input=predicate, param_attr=ParameterAttribute(name='vemb',initial_std=default_std))
+mark_embedding = embedding_layer(name='word_ctx-in_embedding', size=mark_dim, input=mark, param_attr=std_0)
+
+word_input=[word, ctx_n2, ctx_n1, ctx_0, ctx_p1, ctx_p2]
+emb_layers = [embedding_layer(size=word_dim, input=x, param_attr=emb_para) for x in word_input]
+emb_layers.append(predicate_embedding)
+emb_layers.append(mark_embedding)

 hidden_0 = mixed_layer(
+    name='hidden0',
    size=hidden_dim,
-    input=[
-        full_matrix_projection(input=word_embedding),
-        full_matrix_projection(input=predicate_embedding),
-        full_matrix_projection(input=ctx_n1_embedding),
-        full_matrix_projection(input=ctx_0_embedding),
-        full_matrix_projection(input=ctx_p1_embedding),
-        full_matrix_projection(input=mark_embedding),
-    ])
+    bias_attr=std_default,
+    input=[ full_matrix_projection(input=emb, param_attr=std_default ) for emb in emb_layers ])
+

-lstm_0 = lstmemory(input=hidden_0, layer_attr=layer_attr)
+mix_hidden_lr = 1e-3
+lstm_para_attr = ParameterAttribute(initial_std=0.0, learning_rate=1.0)
+hidden_para_attr = ParameterAttribute(initial_std=default_std, learning_rate=mix_hidden_lr)
+
+lstm_0 = lstmemory(name='lstm0',
+                   input=hidden_0, 
+                   act=ReluActivation(),
+                   gate_act=SigmoidActivation(),
+                   state_act=SigmoidActivation(),
+                   bias_attr=std_0,
+                   param_attr=lstm_para_attr)

 #stack L-LSTM and R-LSTM with direct edges
 input_tmp = [hidden_0, lstm_0]

+
 for i in range(1, depth):

-    fc = fc_layer(input=input_tmp, size=hidden_dim, param_attr=para_attr)
+    mix_hidden = mixed_layer(name='hidden'+str(i),
+                             size=hidden_dim, 
+                             bias_attr=std_default,
+                             input=[full_matrix_projection(input=input_tmp[0], param_attr=hidden_para_attr),
+                                    full_matrix_projection(input=input_tmp[1], param_attr=lstm_para_attr)
+                                   ]
+                             )
+
+    lstm = lstmemory(name='lstm'+str(i),
+                     input=mix_hidden,
+                     act=ReluActivation(),
+                     gate_act=SigmoidActivation(),
+                     state_act=SigmoidActivation(),
+                     reverse=((i % 2)==1),
+                     bias_attr=std_0,
+                     param_attr=lstm_para_attr)
+
+    input_tmp = [mix_hidden, lstm]
+
+feature_out = mixed_layer(name='output',
+                          size=label_dict_len,
+                          bias_attr=std_default, 
+                          input=[full_matrix_projection(input=input_tmp[0], param_attr=hidden_para_attr),
+                                 full_matrix_projection(input=input_tmp[1], param_attr=lstm_para_attr)
+                                ],
+                          )

-    lstm = lstmemory(
-        input=fc,
-        act=ReluActivation(),
-        reverse=(i % 2) == 1,
-        layer_attr=layer_attr)
-    input_tmp = [fc, lstm]

-prob = fc_layer(
-    input=input_tmp,
-    size=label_dict_len,
-    act=SoftmaxActivation(),
-    param_attr=para_attr)

 if not is_predict:
-    cls = classification_cost(input=prob, label=target)
-    outputs(cls)
+    crf_l = crf_layer( name = 'crf',
+                       size = label_dict_len,
+                       input = feature_out, 
+                       label = target,
+                       param_attr=ParameterAttribute(name='crfw',initial_std=default_std, learning_rate=mix_hidden_lr)
+
+                      )
+
+    
+    crf_dec_l = crf_decoding_layer(name = 'crf_dec_l',
+                                   size = label_dict_len,
+                                   input = feature_out,
+                                   label = target,
+                                   param_attr=ParameterAttribute(name='crfw')
+                                       )
+
+
+    eval = sum_evaluator(input=crf_dec_l)
+        
+    outputs(crf_l)
+
 else:
-    outputs(prob)
+    crf_dec_l = crf_decoding_layer(name = 'crf_dec_l',
+                                   size = label_dict_len,
+                                   input = feature_out,
+                                   param_attr=ParameterAttribute(name='crfw')
+                                       )
+
+    outputs(crf_dec_l)
+
--- a/demo/semantic_role_labeling/predict.py
+++ b/demo/semantic_role_labeling/predict.py
@@ -26,7 +26,7 @@ UNK_IDX = 0


 class Prediction():
-    def __init__(self, train_conf, dict_file, model_dir, label_file):
+    def __init__(self, train_conf, dict_file, model_dir, label_file, predicate_dict_file):
        """
        train_conf: trainer configure.
        dict_file: word dictionary file name.
@@ -35,26 +35,41 @@ class Prediction():

        self.dict = {}
        self.labels = {}
+        self.predicate_dict={}
        self.labels_reverse = {}
-        self.load_dict_label(dict_file, label_file)
+        self.load_dict_label(dict_file, label_file, predicate_dict_file)

        len_dict = len(self.dict)
        len_label = len(self.labels)
-
-        conf = parse_config(train_conf, 'dict_len=' + str(len_dict) +
-                            ',label_len=' + str(len_label) + ',is_predict=True')
+        len_pred = len(self.predicate_dict)
+
+        conf = parse_config(
+            train_conf,
+            'dict_len=' + str(len_dict) + 
+            ',label_len=' + str(len_label) +
+            ',pred_len=' + str(len_pred) +
+            ',is_predict=True')
        self.network = swig_paddle.GradientMachine.createFromConfigProto(
            conf.model_config)
        self.network.loadParameters(model_dir)

        slots = [
+            integer_value_sequence(len_dict),
+            integer_value_sequence(len_pred),
+            integer_value_sequence(len_dict),
+            integer_value_sequence(len_dict),
+            integer_value_sequence(len_dict),
+            integer_value_sequence(len_dict),
+            integer_value_sequence(len_dict), 
+            integer_value_sequence(2)
+            ]
            integer_value_sequence(len_dict), integer_value_sequence(len_dict),
            integer_value_sequence(len_dict), integer_value_sequence(len_dict),
            integer_value_sequence(len_dict), integer_value_sequence(2)
        ]
        self.converter = DataProviderConverter(slots)

-    def load_dict_label(self, dict_file, label_file):
+    def load_dict_label(self, dict_file, label_file, predicate_dict_file):
        """
        Load dictionary from self.dict_file.
        """
@@ -65,39 +80,42 @@ class Prediction():
            self.labels[line.strip()] = line_count
            self.labels_reverse[line_count] = line.strip()

+        for line_count, line in enumerate(open(predicate_dict_file, 'r')):
+            self.predicate_dict[line.strip()] = line_count
    def get_data(self, data_file):
        """
        Get input data of paddle format.
        """
        with open(data_file, 'r') as fdata:
            for line in fdata:
-                sentence, predicate, ctx_n1, ctx_0, ctx_p1, mark, label = line.strip(
+                sentence, predicate, ctx_n2, ctx_n1, ctx_0, ctx_p1, ctx_p2, mark, label = line.strip(
                ).split('\t')
                words = sentence.split()
                sen_len = len(words)
-
+                 
                word_slot = [self.dict.get(w, UNK_IDX) for w in words]
-                predicate_slot = [self.dict.get(predicate, UNK_IDX)] * sen_len
+                predicate_slot = [self.predicate_dict.get(predicate, UNK_IDX)] * sen_len
+                ctx_n2_slot = [self.dict.get(ctx_n2, UNK_IDX)] * sen_len
                ctx_n1_slot = [self.dict.get(ctx_n1, UNK_IDX)] * sen_len
                ctx_0_slot = [self.dict.get(ctx_0, UNK_IDX)] * sen_len
                ctx_p1_slot = [self.dict.get(ctx_p1, UNK_IDX)] * sen_len
+                ctx_p2_slot = [self.dict.get(ctx_p2, UNK_IDX)] * sen_len

                marks = mark.split()
                mark_slot = [int(w) for w in marks]
+                
+                yield word_slot, predicate_slot, ctx_n2_slot, ctx_n1_slot, \
+                      ctx_0_slot, ctx_p1_slot, ctx_p2_slot,  mark_slot

-                yield word_slot, predicate_slot, ctx_n1_slot, \
-                      ctx_0_slot, ctx_p1_slot, mark_slot
-
-    def predict(self, data_file):
+    def predict(self, data_file, output_file):
        """
        data_file: file name of input data.
        """
        input = self.converter(self.get_data(data_file))
        output = self.network.forwardTest(input)
-        prob = output[0]["value"]
-        lab = list(np.argsort(-prob)[:, 0])
+        lab = output[0]["id"].tolist()

-        with open(data_file, 'r') as fin, open('predict.res', 'w') as fout:
+        with open(data_file, 'r') as fin, open(output_file, 'w') as fout:
            index = 0
            for line in fin:
                sen = line.split('\t')[0]
@@ -109,8 +127,8 @@ class Prediction():


 def option_parser():
-    usage = ("python predict.py -c config -w model_dir "
-             "-d word dictionary -l label_file -i input_file")
+    usage = ("python predict.py -c config -w model_dir " 
+             "-d word dictionary -l label_file -i input_file  -p pred_dict_file")
    parser = OptionParser(usage="usage: %s [options]" % usage)
    parser.add_option(
        "-c",
@@ -131,6 +149,13 @@ def option_parser():
        dest="label_file",
        default=None,
        help="label file")
+    parser.add_option(
+        "-p",
+        "--predict_dict_file",
+        action="store",
+        dest="predict_dict_file",
+        default=None,
+        help="predict_dict_file")
    parser.add_option(
        "-i",
        "--data",
@@ -144,6 +169,14 @@ def option_parser():
        dest="model_path",
        default=None,
        help="model path")
+
+    parser.add_option(
+        "-o",
+        "--output_file",
+        action="store",
+        dest="output_file",
+        default=None,
+        help="output file")
    return parser.parse_args()


@@ -154,10 +187,12 @@ def main():
    dict_file = options.dict_file
    model_path = options.model_path
    label_file = options.label_file
+    predict_dict_file = options.predict_dict_file
+    output_file = options.output_file

    swig_paddle.initPaddle("--use_gpu=0")
-    predict = Prediction(train_conf, dict_file, model_path, label_file)
-    predict.predict(data_file)
+    predict = Prediction(train_conf, dict_file, model_path, label_file, predict_dict_file)
+    predict.predict(data_file,output_file)


 if __name__ == '__main__':

--- a/demo/semantic_role_labeling/predict.sh
+++ b/demo/semantic_role_labeling/predict.sh
@@ -26,15 +26,18 @@ LOG=`get_best_pass $log`
 LOG=(${LOG})
 best_model_path="output/pass-${LOG[1]}"

-
 config_file=db_lstm.py
-dict_file=./data/src.dict
-label_file=./data/tgt.dict 
+dict_file=./data/wordDict.txt
+label_file=./data/targetDict.txt 
+predicate_dict_file=./data/verbDict.txt
 input_file=./data/feature
+output_file=predict.res
 
 python predict.py \
     -c $config_file \
     -w $best_model_path \
     -l $label_file \
+     -p $predicate_dict_file  \
     -d $dict_file \
-     -i $input_file
+     -i $input_file \
+     -o $output_file
--- a/demo/semantic_role_labeling/test.sh
+++ b/demo/semantic_role_labeling/test.sh
@@ -36,4 +36,5 @@ paddle train \
  --job=test \
  --use_gpu=false \
  --config_args=is_test=1 \
+  --test_all_data_in_one_period=1 \
 2>&1 | tee 'test.log'
--- a/demo/semantic_role_labeling/train.sh
+++ b/demo/semantic_role_labeling/train.sh
@@ -16,11 +16,14 @@
 set -e
 paddle train \
  --config=./db_lstm.py \
+  --use_gpu=0 \
+  --log_period=5000 \
+  --trainer_count=1 \
+  --show_parameter_stats_period=5000 \
  --save_dir=./output \
-  --trainer_count=4 \
-  --log_period=10 \
-  --num_passes=500 \
-  --use_gpu=false \
-  --show_parameter_stats_period=10 \
+  --num_passes=10000 \
+  --average_test_period=10000000 \
+  --init_model_path=./data \
+  --load_missing_parameter_strategy=rand \
  --test_all_data_in_one_period=1 \
-2>&1 | tee 'train.log'
+  2>&1 | tee 'train.log'
--- a/demo/sentiment/trainer_config.py
+++ b/demo/sentiment/trainer_config.py
@@ -29,6 +29,7 @@ settings(
    batch_size=128,
    learning_rate=2e-3,
    learning_method=AdamOptimizer(),
+    average_window=0.5,
    regularization=L2Regularization(8e-4),
    gradient_clipping_threshold=25)


--- a/doc/algorithm/rnn/rnn.rst
+++ b/doc/algorithm/rnn/rnn.rst
@@ -17,7 +17,7 @@ PaddlePaddle does not need any preprocessing to sequence data, such as padding.

 .. code-block:: python

-    settings.slots = [
+    settings.input_types = [
      integer_value_sequence(len(settings.src_dict)),
      integer_value_sequence(len(settings.trg_dict)),
      integer_value_sequence(len(settings.trg_dict))]

--- a/doc/demo/semantic_role_labeling/curve.jpg
+++ b/doc/demo/semantic_role_labeling/curve.jpg
--- a/doc/demo/semantic_role_labeling/semantic_role_labeling.md
+++ b/doc/demo/semantic_role_labeling/semantic_role_labeling.md
-# Semantic Role labeling Tutorial #
-
-Semantic role labeling (SRL) is a form of shallow semantic parsing whose goal is to discover the predicate-argument structure of each predicate in a given input sentence. SRL is useful as an intermediate step in a wide range of natural language processing tasks, such as information extraction. automatic document categorization and question answering.  An instance is as following [1]:
-
- [ <sub>A0</sub> He ] [ <sub>AM-MOD</sub> would ][ <sub>AM-NEG</sub> n’t ] [ <sub>V</sub> accept] [ <sub>A1</sub> anything of value ] from [<sub>A2</sub> those he was writing about ]. 
-
- V: verb
- A0: acceptor
- A1: thing accepted
- A2: accepted-from
- A3: Attribute
- AM-MOD: modal 
- AM-NEG: negation
-
-Given the verb "accept", the chunks in sentence would play certain semantic roles. Here, the label scheme is from Penn Proposition Bank. 
-
-To this date, most of the successful SRL systems are built on top of some form of parsing results where pre-defined feature templates over the syntactic structure are used. This tutorial will present an end-to-end system using deep bidirectional long short-term memory (DB-LSTM)[2] for solving the SRL task, which largely outperforms the previous state-of-the-art systems. The system regards SRL task as the sequence labelling problem. 
-
-## Data Description
-The relevant paper[2] takes the data set in CoNLL-2005&2012 Shared Task for training and testing. Accordingto data license,  the demo adopts the test data set of CoNLL-2005, which can be reached on website.
-
-To download and process the original data, user just need to execute the following command:
-
-```bash
-cd data
-./get_data.sh
-```
-Several new files appear in the `data `directory as follows.
-```bash
-conll05st-release：the test data set of CoNll-2005 shared task 
-test.wsj.words：the Wall Street Journal data sentences
-test.wsj.props:  the propositional arguments
-src.dict：the dictionary of words in sentences
-tgt.dict：the labels dictionary
-feature: the extracted features from data set
-```
-
-## Training
-### DB-LSTM
-Please refer to the Sentiment Analysis demo to learn more about the long short-term memory unit. 
-
-Unlike Bidirectional-LSTM that used in Sentiment Analysis demo,  the DB-LSTM adopts another way to stack LSTM layer. First a standard LSTM processes the sequence in forward direction. The input and output of this LSTM layer are taken by the next LSTM layer as input, processed in reversed direction. These two standard LSTM layers compose a pair of LSTM. Then we stack LSTM layers pair after pair to obtain the deep LSTM model. 
-
-The following figure shows a temporal expanded 2-layer DB-LSTM network.
-<center>
-![pic](./network_arch.png)
-</center>
-
-### Features
-Two input features play an essential role in this pipeline: predicate (pred) and argument (argu). Two other features: predicate context (ctx-p) and region mark (mr) are also adopted. Because a single predicate word can not exactly describe the predicate information, especially when the same words appear more than one times in a sentence. With the predicate context, the ambiguity can be largely eliminated. Similarly, we use region mark m<sub>r</sub> = 1 to denote the argument position if it locates in the predicate context region, or m<sub>r</sub> = 0 if does not. These four simple features are all we need for our SRL system. Features of one sample with context size set to 1 is showed as following[2]:
-<center>
-![pic](./feature.jpg)
-</center>
-
-In this sample, the coresponding labelled sentence is:
-
-[ <sub>A1</sub> A record date ] has [ <sub>AM-NEG</sub> n't ] been [ <sub>V</sub> set ] . 
-
-In the demo, we adopt the feature template as above, consists of :  `argument`, `predicate`, `ctx-p (p=-1,0,1)`, `mark` and use `B/I/O` scheme to label each argument. These features and labels are stored in `feature` file, and separated by `\t`.
-
-### Data Provider
-
-`dataprovider.py` is the python file to wrap data. `hook()` function is to define the data slots for network. The  Six features and label are all IndexSlots.
-```
-def hook(settings, word_dict, label_dict, **kwargs):
-    settings.word_dict = word_dict
-    settings.label_dict = label_dict
-    #all inputs are integral and sequential type
-    settings.slots = [
-        integer_value_sequence(len(word_dict)),
-        integer_value_sequence(len(word_dict)),
-        integer_value_sequence(len(word_dict)),
-        integer_value_sequence(len(word_dict)),
-        integer_value_sequence(len(word_dict)),
-        integer_value_sequence(2),
-        integer_value_sequence(len(label_dict))]
-```
-The corresponding data iterator is as following:
-```
-@provider(use_seq=True, init_hook=hook)
-def process(obj, file_name):
-    with open(file_name, 'r') as fdata:
-        for line in fdata:
-            sentence, predicate, ctx_n1, ctx_0, ctx_p1, mark, label = line.strip().split('\t')
-            words = sentence.split()
-            sen_len = len(words)
-            word_slot = [obj.word_dict.get(w, UNK_IDX) for w in words]
-
-            predicate_slot = [obj.word_dict.get(predicate, UNK_IDX)] * sen_len
-            ctx_n1_slot = [obj.word_dict.get(ctx_n1, UNK_IDX) ] * sen_len
-            ctx_0_slot = [obj.word_dict.get(ctx_0, UNK_IDX) ] * sen_len
-            ctx_p1_slot = [obj.word_dict.get(ctx_p1, UNK_IDX) ] * sen_len
-
-            marks = mark.split()
-            mark_slot = [int(w) for w in marks]
-
-            label_list = label.split()
-            label_slot = [obj.label_dict.get(w) for w in label_list]
-
-            yield word_slot, predicate_slot, ctx_n1_slot, ctx_0_slot, ctx_p1_slot, mark_slot, label_slot
-```
-The `process`function yield 7 lists which are six features and labels.
- 
-### Neural Network Config
-`db_lstm.py` is the neural network config file to load the dictionaries and define the  data provider module and network architecture during the training procedure. 
-
-Seven `data_layer` load instances from data provider. Six features are transformed into embedddings respectively, and mixed by `mixed_layer` .  Deep bidirectional LSTM layers extract features for the softmax layer. The objective function is cross entropy of labels.
-
-### Run Training 
-The script for training is `train.sh`, user just need to execute:
-```bash
-  ./train.sh
-```
-The content in `train.sh`:
-```
-paddle train \
-  --config=./db_lstm.py \
-  --save_dir=./output \
-  --trainer_count=4 \
-  --log_period=10 \
-  --num_passes=500 \
-  --use_gpu=false \
-  --show_parameter_stats_period=10 \
-  --test_all_data_in_one_period=1 \
-2>&1 | tee 'train.log'
-```
-
-  \--config=./db_lstm.py : network config file.
-  \--save_di=./output: output path to save models.
-  \--trainer_count=4 : set thread number (or GPU count).
-  \--log_period=10 : print log every 20 batches.
-  \--num_passes=500: set pass number, one pass in PaddlePaddle means training all samples in dataset one time.
-  \--use_gpu=false: use CPU to train, set true, if you install GPU version of PaddlePaddle and want to use GPU to train.
-  \--show_parameter_stats_period=10: show parameter statistic every 100 batches.
-  \--test_all_data_in_one_period=1: test all data in every testing.
-
-
-After training, the models  will be saved in directory `output`.
-
-### Run testing
-The script for testing is `test.sh`, user just need to execute:
-```bash
-  ./test.sh
-```
-The main part in `tesh.sh`
-```
-paddle train \
-  --config=./db_lstm.py \
-  --model_list=$model_list \
-  --job=test \
-  --config_args=is_test=1 \
-```
-
-  - \--config=./db_lstm.py: network config file
-  - \--model_list=$model_list.list: model list file
-  - \--job=test: indicate the test job
-  - \--config_args=is_test=1: flag to indicate test
-  
-
-### Run prediction
-The script for prediction is `predict.sh`, user just need to execute:
-```bash
-  ./predict.sh
-  
-```
-In `predict.sh`, user should offer the network config file, model path, label file, word dictionary file, feature file
-```
-python predict.py 
-     -c $config_file 
-     -w $model_path 
-     -l $label_file 
-     -d $dict_file 
-     -i $input_file
-```
-
-`predict.py` is the main executable python script, which includes functions: load model, load data, data prediction. The network model will output the probability distribution of labels. In the demo, we take the label with maximum probability as result. User can also implement the beam search or viterbi decoding upon the probability distribution matrix.
-
-After prediction,  the result is saved in `predict.res`.
-
-## Reference
-[1] Martha Palmer, Dan Gildea, and Paul Kingsbury. The Proposition Bank: An Annotated Corpus of Semantic Roles , Computational Linguistics, 31(1), 2005. 
-
-[2] Zhou, Jie, and Wei Xu. "End-to-end learning of semantic role labeling using recurrent neural networks." Proceedings of the Annual Meeting of the Association for Computational Linguistics. 2015.
+# Semantic Role labeling Tutorial #
+
+Semantic role labeling (SRL) is a form of shallow semantic parsing whose goal is to discover the predicate-argument structure of each predicate in a given input sentence. SRL is useful as an intermediate step in a wide range of natural language processing tasks, such as information extraction. automatic document categorization and question answering.  An instance is as following [1]:
+
+ [ <sub>A0</sub> He ] [ <sub>AM-MOD</sub> would ][ <sub>AM-NEG</sub> n’t ] [ <sub>V</sub> accept] [ <sub>A1</sub> anything of value ] from [<sub>A2</sub> those he was writing about ]. 
+
+- V: verb
+- A0: acceptor
+- A1: thing accepted
+- A2: accepted-from
+- A3: Attribute
+- AM-MOD: modal 
+- AM-NEG: negation
+
+Given the verb "accept", the chunks in sentence would play certain semantic roles. Here, the label scheme is from Penn Proposition Bank. 
+
+To this date, most of the successful SRL systems are built on top of some form of parsing results where pre-defined feature templates over the syntactic structure are used. This tutorial will present an end-to-end system using deep bidirectional long short-term memory (DB-LSTM)[2] for solving the SRL task, which largely outperforms the previous state-of-the-art systems. The system regards SRL task as the sequence labelling problem. 
+
+## Data Description
+The relevant paper[2] takes the data set in CoNLL-2005&2012 Shared Task for training and testing. Accordingto data license,  the demo adopts the test data set of CoNLL-2005, which can be reached on website.
+
+To download and process the original data, user just need to execute the following command:
+
+```bash
+cd data
+./get_data.sh
+```
+Several new files appear in the `data `directory as follows.
+```bash
+conll05st-release：the test data set of CoNll-2005 shared task 
+test.wsj.words：the Wall Street Journal data sentences
+test.wsj.props:  the propositional arguments
+feature: the extracted features from data set
+```
+
+## Training
+### DB-LSTM
+Please refer to the Sentiment Analysis demo to learn more about the long short-term memory unit. 
+
+Unlike Bidirectional-LSTM that used in Sentiment Analysis demo,  the DB-LSTM adopts another way to stack LSTM layer. First a standard LSTM processes the sequence in forward direction. The input and output of this LSTM layer are taken by the next LSTM layer as input, processed in reversed direction. These two standard LSTM layers compose a pair of LSTM. Then we stack LSTM layers pair after pair to obtain the deep LSTM model. 
+
+The following figure shows a temporal expanded 2-layer DB-LSTM network.
+<center>
+![pic](./network_arch.png)
+</center>
+
+### Features
+Two input features play an essential role in this pipeline: predicate (pred) and argument (argu). Two other features: predicate context (ctx-p) and region mark (mr) are also adopted. Because a single predicate word can not exactly describe the predicate information, especially when the same words appear more than one times in a sentence. With the predicate context, the ambiguity can be largely eliminated. Similarly, we use region mark m<sub>r</sub> = 1 to denote the argument position if it locates in the predicate context region, or m<sub>r</sub> = 0 if does not. These four simple features are all we need for our SRL system. Features of one sample with context size set to 1 is showed as following[2]:
+<center>
+![pic](./feature.jpg)
+</center>
+
+In this sample, the coresponding labelled sentence is:
+
+[ <sub>A1</sub> A record date ] has [ <sub>AM-NEG</sub> n't ] been [ <sub>V</sub> set ] . 
+
+In the demo, we adopt the feature template as above, consists of :  `argument`, `predicate`, `ctx-p (p=-1,0,1)`, `mark` and use `B/I/O` scheme to label each argument. These features and labels are stored in `feature` file, and separated by `\t`.
+
+### Data Provider
+
+`dataprovider.py` is the python file to wrap data. `hook()` function is to define the data slots for network. The  Six features and label are all IndexSlots.
+```
+def hook(settings, word_dict, label_dict, **kwargs):
+    settings.word_dict = word_dict
+    settings.label_dict = label_dict
+    #all inputs are integral and sequential type
+    settings.slots = [
+        integer_value_sequence(len(word_dict)),
+        integer_value_sequence(len(predicate_dict)),
+        integer_value_sequence(len(word_dict)),
+        integer_value_sequence(len(word_dict)),
+        integer_value_sequence(len(word_dict)),
+        integer_value_sequence(len(word_dict)),
+        integer_value_sequence(len(word_dict)),
+        integer_value_sequence(2),
+        integer_value_sequence(len(label_dict))]
+```
+The corresponding data iterator is as following:
+```
+@provider(init_hook=hook, should_shuffle=True, calc_batch_size=get_batch_size,
+          can_over_batch_size=False, cache=CacheType.CACHE_PASS_IN_MEM)
+def process(settings, file_name):
+    with open(file_name, 'r') as fdata:
+        for line in fdata:
+            sentence, predicate, ctx_n2, ctx_n1, ctx_0, ctx_p1, ctx_p2,  mark, label = \
+                line.strip().split('\t')
+
+            words = sentence.split()
+            sen_len = len(words)
+            word_slot = [settings.word_dict.get(w, UNK_IDX) for w in words]
+
+            predicate_slot = [settings.predicate_dict.get(predicate)] * sen_len
+            ctx_n2_slot = [settings.word_dict.get(ctx_n2, UNK_IDX)] * sen_len
+            ctx_n1_slot = [settings.word_dict.get(ctx_n1, UNK_IDX)] * sen_len
+            ctx_0_slot = [settings.word_dict.get(ctx_0, UNK_IDX)] * sen_len
+            ctx_p1_slot = [settings.word_dict.get(ctx_p1, UNK_IDX)] * sen_len
+            ctx_p2_slot = [settings.word_dict.get(ctx_p2, UNK_IDX)] * sen_len
+
+            marks = mark.split()
+            mark_slot = [int(w) for w in marks]
+
+            label_list = label.split()
+            label_slot = [settings.label_dict.get(w) for w in label_list]
+            yield word_slot, predicate_slot, ctx_n2_slot, ctx_n1_slot, \
+                  ctx_0_slot, ctx_p1_slot, ctx_p2_slot, mark_slot, label_slot
+```
+The `process`function yield 9 lists which are 8 features and label.
+ 
+### Neural Network Config
+`db_lstm.py` is the neural network config file to load the dictionaries and define the  data provider module and network architecture during the training procedure. 
+
+Nine `data_layer` load instances from data provider. Eight features are transformed into embedddings respectively, and mixed by `mixed_layer` .  Deep bidirectional LSTM layers extract features for the softmax layer. The objective function is cross entropy of labels.
+
+### Run Training 
+The script for training is `train.sh`, user just need to execute:
+```bash
+  ./train.sh
+```
+The content in `train.sh`:
+```
+paddle train \
+  --config=./db_lstm.py \
+  --use_gpu=0 \
+  --log_period=5000 \
+  --trainer_count=1 \
+  --show_parameter_stats_period=5000 \
+  --save_dir=./output \
+  --num_passes=10000 \
+  --average_test_period=10000000 \
+  --init_model_path=./data \
+  --load_missing_parameter_strategy=rand \
+  --test_all_data_in_one_period=1 \
+2>&1 | tee 'train.log'
+```
+
+-  \--config=./db_lstm.py : network config file.
+-  \--use_gpu=false: use CPU to train, set true, if you install GPU version of PaddlePaddle and want to use GPU to train, until now crf_layer do not support GPU
+-  \--log_period=500: print log every 20 batches.
+-  \--trainer_count=1: set thread number (or GPU count).
+-  \--show_parameter_stats_period=5000: show parameter statistic every 100 batches.
+-  \--save_dir=./output: output path to save models.
+-  \--num_passes=10000: set pass number, one pass in PaddlePaddle means training all samples in dataset one time.
+-  \--average_test_period=10000000:  do test on average parameter every average_test_period batches
+-  \--init_model_path=./data: parameter initialization path 
+-  \--load_missing_parameter_strategy=rand: random initialization unexisted parameters
+-  \--test_all_data_in_one_period=1: test all data in one period
+
+
+After training, the models  will be saved in directory `output`. Our training curve is as following:
+<center>
+![pic](./curve.jpg)
+</center>
+
+### Run testing
+The script for testing is `test.sh`, user just need to execute:
+```bash
+  ./test.sh
+```
+The main part in `tesh.sh`
+```
+paddle train \
+  --config=./db_lstm.py \
+  --model_list=$model_list \
+  --job=test \
+  --config_args=is_test=1 \
+```
+
+  - \--config=./db_lstm.py: network config file
+  - \--model_list=$model_list.list: model list file
+  - \--job=test: indicate the test job
+  - \--config_args=is_test=1: flag to indicate test
+  - \--test_all_data_in_one_period=1: test all data in 1 period
+  
+
+### Run prediction
+The script for prediction is `predict.sh`, user just need to execute:
+```bash
+  ./predict.sh
+  
+```
+In `predict.sh`, user should offer the network config file, model path, label file, word dictionary file, feature file
+```
+python predict.py 
+     -c $config_file \
+     -w $best_model_path \
+     -l $label_file \
+     -p $predicate_dict_file  \
+     -d $dict_file \
+     -i $input_file \
+     -o $output_file
+```
+
+`predict.py` is the main executable python script, which includes functions: load model, load data, data prediction. The network model will output the probability distribution of labels. In the demo, we take the label with maximum probability as result. User can also implement the beam search or viterbi decoding upon the probability distribution matrix.
+
+After prediction,  the result is saved in `predict.res`.
+
+## Reference
+[1] Martha Palmer, Dan Gildea, and Paul Kingsbury. The Proposition Bank: An Annotated Corpus of Semantic Roles , Computational Linguistics, 31(1), 2005. 
+
+[2] Zhou, Jie, and Wei Xu. "End-to-end learning of semantic role labeling using recurrent neural networks." Proceedings of the Annual Meeting of the Association for Computational Linguistics. 2015.
--- a/doc/demo/sentiment_analysis/sentiment_analysis.md
+++ b/doc/demo/sentiment_analysis/sentiment_analysis.md
@@ -6,7 +6,7 @@ Sentiment analysis is also used to monitor social media based on large amount of

 On the other hand, grabbing the user comments of products and analyzing their sentiment are useful to understand user preferences for companies, products, even competing products.

-This tutorial will guide you through the process of training a Long Short Term Memory (LSTM) Network to classify the sentiment of sentences from [Large Movie Review Dataset](http://ai.stanford.edu/~amaas/data/sentiment/), sometimes known as the [Internet Movie Database (IMDB)](http://ai.stanford.edu/~amaas/papers/wvSent_acl2011.pdf). This dataset contains movie reviews along with their associated binary sentiment polarity labels, namely positive and negative. So randomly guessing yields 50% accuracy.
+This tutorial will guide you through the process of training a Long Short Term Memory (LSTM) Network to classify the sentiment of sentences from [Large Movie Review Dataset](http://ai.stanford.edu/~amaas/data/sentiment/), sometimes known as the Internet Movie Database (IMDB). This dataset contains movie reviews along with their associated binary sentiment polarity labels, namely positive and negative. So randomly guessing yields 50% accuracy.

 ## Data Preparation

@@ -39,7 +39,7 @@ imdbEr.txt  imdb.vocab  README  test  train
 * imdbEr.txt: expected rating for each token in imdb.vocab.
 * README: data documentation.

-Both train and test set directory contains:
+The file in train set directory is as follows. The test set also contains them except `unsup` and `urls_unsup.txt`.

 ```
 labeledBow.feat  neg  pos  unsup  unsupBow.feat  urls_neg.txt  urls_pos.txt  urls_unsup.txt
@@ -151,6 +151,7 @@ settings(
  batch_size=128,
  learning_rate=2e-3,
  learning_method=AdamOptimizer(),
+  average_window=0.5,
  regularization=L2Regularization(8e-4),
  gradient_clipping_threshold=25
 )
@@ -163,17 +164,18 @@ stacked_lstm_net(dict_dim, class_dim=class_dim,

 * **Data Definition**:
   * get\_config\_arg(): get arguments setted by `--config_args=xx` in commandline argument.
-   * Define TrainData and TestData provider, here using Python interface (PyDataProviderWrapper) of PaddlePaddle to load data. For details, you can refer to the document of PyDataProvider.
+   * Define data provider, here using Python interface to load data. For details, you can refer to the document of PyDataProvider2.

 * **Algorithm Configuration**:
-   * use sgd algorithm.
-   * use adam optimization.
   * set batch size of 128.
-   * set average sgd window.
   * set global learning rate.
+   * use adam optimization.
+   * set average sgd window.
+   * set L2 regularization.
+   * set gradient clipping threshold.
 * **Network Configuration**:
-   * dict_dim: get dictionary dimension.
-   * class_dim: set category number, IMDB has two label, namely positive and negative label.
+   * dict_dim: dictionary dimension.
+   * class_dim: category number, IMDB has two label, namely positive and negative label.
   * `stacked_lstm_net`: predefined network as shown in Figure 3, use this network by default.
   * `bidirectional_lstm_net`: predefined network as shown in Figure 2.


--- a/doc/dev/new_layer/new_layer.rst
+++ b/doc/dev/new_layer/new_layer.rst
@@ -60,7 +60,7 @@ Implement C++ Class

 The C++ class of the layer implements the initialization, forward, and backward part of the layer. The fully connected layer is at :code:`paddle/gserver/layers/FullyConnectedLayer.h` and :code:`paddle/gserver/layers/FullyConnectedLayer.cpp`. We list simplified version of the code below.

-It needs to derive the base class :code:`paddle::BaseLayer`, and it needs to override the following functions:
+It needs to derive the base class :code:`paddle::Layer`, and it needs to override the following functions:

 - constructor and destructor.
 - :code:`init` function. It is used to initialize the parameters and settings.

--- a/doc/optimization/gpu_profiling.rst
+++ b/doc/optimization/gpu_profiling.rst
@@ -53,7 +53,7 @@ above profilers.

 .. literalinclude:: ../../paddle/math/tests/test_GpuProfiler.cpp
   :language: c++
-   :lines: 107-121
+   :lines: 111-124
   :linenos:

 The above code snippet includes two methods, you can use any of them to profile the regions of interest.
@@ -75,12 +75,12 @@ To enable built-in timer in PaddlePaddle, first you have to add :code:`REGISTER_
 Then, all information could be stamped in the console via :code:`printStatus` or :code:`printAllStatus` function.
 As a simple example, consider the following:

-1. Add :code:`REGISTER_TIMER_INFO` and :code:`printStatus` functions (see the emphasize-lines).
+1. Add :code:`REGISTER_TIMER_INFO` and :code:`printAllStatus` functions (see the emphasize-lines).

    .. literalinclude:: ../../paddle/math/tests/test_GpuProfiler.cpp
        :language: c++
-        :lines: 107-121
-        :emphasize-lines: 10-11,14
+        :lines: 111-124
+        :emphasize-lines: 8-10,13
        :linenos:

 2. Configure cmake with **WITH_TIMER** and recompile PaddlePaddle.
@@ -126,8 +126,8 @@ To use this command line profiler **nvprof**, you can simply issue the following

    .. literalinclude:: ../../paddle/math/tests/test_GpuProfiler.cpp
        :language: c++
-        :lines: 107-121
-        :emphasize-lines: 7-8
+        :lines: 111-124
+        :emphasize-lines: 6-7
        :linenos:

 2. Configure cmake with **WITH_PROFILER** and recompile PaddlePaddle.

--- a/doc/source/api/api.rst
+++ b/doc/source/api/api.rst
 API
-========
+===

 .. doxygenfile:: paddle/api/PaddleAPI.h
 .. doxygenfile:: paddle/api/Internal.h
--- a/doc/source/cuda/cuda/index.rst
+++ b/doc/source/cuda/cuda/index.rst
-CUDA
-====================
-
-.. toctree::
-  :maxdepth: 3
-
-  cuda.rst
--- a/doc/source/cuda/matrix/index.rst
+++ b/doc/source/cuda/matrix/index.rst
-Matrix
-====================
+CUDA
+====

 .. toctree::
-  :maxdepth: 3
+  :maxdepth: 2

  matrix.rst
+  nn.rst
+  utils.rst
--- a/doc/source/cuda/matrix/matrix.rst
+++ b/doc/source/cuda/matrix/matrix.rst
 Matrix
-=======
+======

-Base Matrix
-------------
+Base
+----

 hl_matrix.h
-``````````````````
+```````````
 .. doxygenfile:: paddle/cuda/include/hl_matrix.h

 hl_matrix_base.h
-``````````````````
+````````````````
 .. doxygenfile:: paddle/cuda/include/hl_matrix_base.cuh

 hl_matrix_apply.cuh
-``````````````````````
+```````````````````
 .. doxygenfile:: paddle/cuda/include/hl_matrix_apply.cuh

 hl_matrix_ops.cuh
-``````````````````````
+`````````````````
 .. doxygenfile:: paddle/cuda/include/hl_matrix_ops.cuh

 hl_matrix_type.cuh
-``````````````````````
+``````````````````
 .. doxygenfile:: paddle/cuda/include/hl_matrix_type.cuh

 hl_sse_matrix_kernel.cuh
-``````````````````````````
+````````````````````````
 .. doxygenfile:: paddle/cuda/include/hl_sse_matrix_kernel.cuh

+Matrix Function 
+---------------
+
 hl_batch_transpose.h
-``````````````````````````
+````````````````````
 .. doxygenfile:: paddle/cuda/include/hl_batch_transpose.h

-Sparse Matrix
--------------
-
-hl_sparse.h
-``````````````````
-.. doxygenfile:: paddle/cuda/include/hl_sparse.h
-
-hl_sparse.ph
-``````````````````````
-.. doxygenfile:: paddle/cuda/include/hl_sparse.ph
-
-Others
---------------
-
 hl_aggregate.h
-``````````````````
+``````````````
 .. doxygenfile:: paddle/cuda/include/hl_aggregate.h

+hl_top_k.h
+``````````
+.. doxygenfile:: paddle/cuda/include/hl_top_k.h
+
 hl_table_apply.h
-``````````````````
+````````````````
 .. doxygenfile:: paddle/cuda/include/hl_table_apply.h

-hl_top_k.h
-``````````````````
-.. doxygenfile:: paddle/cuda/include/hl_top_k.h
+Sparse Matrix
+-------------

+hl_sparse.h
+```````````
+.. doxygenfile:: paddle/cuda/include/hl_sparse.h

+hl_sparse.ph
+````````````
+.. doxygenfile:: paddle/cuda/include/hl_sparse.ph
--- a/doc/source/cuda/rnn/rnn.rst
+++ b/doc/source/cuda/rnn/rnn.rst
-Neural Networks
-==================
+Neural Network
+==============

 Base
-------
+----
+
 .. doxygenfile:: paddle/cuda/include/hl_gpu.h
-.. doxygenfile:: paddle/cuda/include/hl_cnn.h
 .. doxygenfile:: paddle/cuda/include/hl_functions.h
 .. doxygenfile:: paddle/cuda/include/hl_avx_functions.h
-.. doxygenfile:: paddle/cuda/include/hl_device_functions.cuh
 .. doxygenfile:: paddle/cuda/include/hl_gpu_functions.cuh
-
-Activation Functions
-----------------------
 .. doxygenfile:: paddle/cuda/include/hl_activation_functions.h

+
+CNN Related APIs
+----------------
+.. doxygenfile:: paddle/cuda/include/hl_cnn.h
+.. doxygenfile:: paddle/cuda/include/hl_cuda_cudnn.h
+.. doxygenfile:: paddle/cuda/include/hl_cuda_cudnn.ph
+
 RNN Related APIs
-----------------
+----------------

 .. doxygenfile:: paddle/cuda/include/hl_recurrent_apply.cuh
 .. doxygenfile:: paddle/cuda/include/hl_sequence.h

 LSTM Model
-``````````````
+``````````
+
 .. doxygenfile:: paddle/cuda/include/hl_lstm.h
 .. dpxygenfile:: paddle/cuda/include/hl_cpu_lstm.cuh
 .. doxygenfile:: paddle/cuda/include/hl_gpu_lstm.cuh
 .. doxygenfile:: paddle/cuda/include/hl_lstm_ops.cuh

 GRU Model
-````````````````
+`````````
+
 .. doxygenfile:: paddle/cuda/include/hl_gru_ops.cuh
 .. doxygenfile:: paddle/cuda/include/hl_cpu_gru.cuh
 .. doxygenfile:: paddle/cuda/include/hl_gpu_gru.cuh
-
-
--- a/doc/source/cuda/rnn/index.rst
+++ b/doc/source/cuda/rnn/index.rst
-RNN
-====================
-
-.. toctree::
-  :maxdepth: 3
-
-  rnn.rst
--- a/doc/source/cuda/cuda/cuda.rst
+++ b/doc/source/cuda/cuda/cuda.rst
-Cuda
-=============
+Utils
+=====

 Dynamic Link Libs
--------------------------
-
-hl_dso_loader.h
-``````````````````
+-----------------
 .. doxygenfile:: paddle/cuda/include/hl_dso_loader.h

 GPU Resources
----------------
+-------------

 hl_cuda.ph
-``````````````
+``````````
 .. doxygenfile:: paddle/cuda/include/hl_cuda.ph

 hl_cuda.h
-``````````````
+`````````
 .. doxygenfile:: paddle/cuda/include/hl_cuda.h

-CUDA Wrapper
--------------
+HPPL Base
+---------
+.. doxygenfile:: paddle/cuda/include/hl_base.h

-hl_cuda_cublas.h
-``````````````````````
+CUBLAS Wrapper
+--------------
 .. doxygenfile:: paddle/cuda/include/hl_cuda_cublas.h

-hl_cuda_cudnn.h
-``````````````````````
-.. doxygenfile:: paddle/cuda/include/hl_cuda_cudnn.h
-
-hl_cuda_cudnn.h
-``````````````````````
-.. doxygenfile:: paddle/cuda/include/hl_cuda_cudnn.ph
-
-
+Timer
+-----
+.. doxygenfile:: paddle/cuda/include/hl_time.h

+Thread Resource
+---------------
+.. doxygenfile:: paddle/cuda/include/hl_thread.ph

+Device Function
+---------------
+.. doxygenfile:: paddle/cuda/include/hl_device_functions.cuh
--- a/doc/source/cuda/utils/index.rst
+++ b/doc/source/cuda/utils/index.rst
-Utils
-====================
-
-.. toctree::
-  :maxdepth: 3
-
-  utils.rst
--- a/doc/source/cuda/utils/utils.rst
+++ b/doc/source/cuda/utils/utils.rst
-Utilities
-===========
-
-HPPL Base
------------
-
-hl_base.h
-``````````````
-.. doxygenfile:: paddle/cuda/include/hl_base.h
-
-Timer
-----------
-
-hl_time.h
-``````````````
-.. doxygenfile:: paddle/cuda/include/hl_time.h
-
-Thread Resource
-----------
-
-hl_thread.ph
-``````````````
-.. doxygenfile:: paddle/cuda/include/hl_thread.ph
--- a/doc/source/gserver/activations/index.rst
+++ b/doc/source/gserver/activations/index.rst
 Activations
-=============
+===========

 ..  doxygenclass:: paddle::ActivationFunction
    :members:
--- a/doc/source/gserver/dataprovider/index.rst
+++ b/doc/source/gserver/dataprovider/index.rst
-Data Providers Documents
-==========================
-
-.. toctree::
-  :maxdepth: 3
-
-  dataproviders.rst
--- a/doc/source/gserver/dataprovider/dataproviders.rst
+++ b/doc/source/gserver/dataprovider/dataproviders.rst
+==============
 Data Providers
-================
+==============

-Base DataProvider
------------------
+DataProviders
+=============
+
+Base
+----
 ..  doxygenclass:: paddle::DataProvider
    :members:

 DataProviderGroup
-------------------
+-----------------
 ..  doxygenclass:: paddle::DataProviderGroup
    :members:

 MultiDataProvider
-------------------
+-----------------
 ..  doxygenclass:: paddle::MultiDataProvider
    :members:

 PyDataProvider
-===================
+==============

 IFieldScanner
 -------------
@@ -45,7 +49,7 @@ SparseValueScanner
    :members:

 SequenceScanner
------------------
+---------------
 ..  doxygenclass:: paddle::SparseValueScanner
    :members:

@@ -69,8 +73,8 @@ IPyDataProvider
 ..  doxygenclass:: paddle::PyDataProvider2
    :members:

-Proto Data Provider
-===================
+ProtoDataProvider
+=================

 ProtoDataProvider
 ----------------
@@ -78,6 +82,6 @@ ProtoDataProvider
    :members:

 ProtoSequenceDataProvider
----------------
+-------------------------
 ..  doxygenclass:: paddle::ProtoSequenceDataProvider
    :members:
--- a/doc/source/gserver/evaluators/evaluators.rst
+++ b/doc/source/gserver/evaluators/evaluators.rst
-Base Evaluator
-==============
+==========
+Evaluators
+==========
+
+Base
+====

-Evaluator
---------
 ..  doxygenclass:: paddle::Evaluator
    :members:

-
-Utils
-=====
+Sum
+===

 SumEvaluator
 ------------

--- a/doc/source/gserver/evaluators/index.rst
+++ b/doc/source/gserver/evaluators/index.rst
-Evaluators
-==========
-
-.. toctree::
-  :maxdepth: 3
-
-  evaluators.rst
--- a/doc/source/gserver/gradientmachines/gradientmachines.rst
+++ b/doc/source/gserver/gradientmachines/gradientmachines.rst
 Gradient Machines
-================
+=================

 GradientMachine
---------------------
+---------------
 ..  doxygenclass:: paddle::GradientMachine
    :members:

-GradientMachineModel
--------------------
+GradientMachineMode
+-------------------
 ..  doxygenclass:: paddle::IGradientMachineMode
    :members:

 MultiGradientMachine
---------------------
+--------------------
 ..  doxygenclass:: paddle::MultiGradientMachine
    :members:

@@ -21,20 +21,7 @@ TrainerThread
 ..  doxygenclass:: paddle::TrainerThread
    :members:

-Recurrent Gradient Machines
---------------------------
+RecurrentGradientMachine
+------------------------
 ..  doxygenclass:: paddle::RecurrentGradientMachine
    :members:
-
-Networks
-========
-
-NeuralNetwork
-------------
-..  doxygenclass:: paddle::NeuralNetwork
-    :members:
-
-ParallelNeuralNetwork
---------------------
-..  doxygenclass:: paddle::ParallelNeuralNetwork
-    :members:
--- a/doc/source/gserver/gradientmachines/index.rst
+++ b/doc/source/gserver/gradientmachines/index.rst
-Gradient Machines Documents
-=============================
-
-.. toctree::
-  :maxdepth: 3
-
-  gradientmachines.rst
--- a/doc/source/gserver/index.rst
+++ b/doc/source/gserver/index.rst
+GServer
+=======
+
+.. toctree::
+  :maxdepth: 2
+
+  activations.rst
+  dataproviders.rst
+  evaluators.rst
+  gradientmachines.rst
+  layers.rst
+  neworks.rst
--- a/doc/source/gserver/layers/layer.rst
+++ b/doc/source/gserver/layers/layer.rst
-Base
+======
+Layers
 ======

+Base
+====
+
 Layer 
 -----
 ..  doxygenclass:: paddle::Layer
@@ -17,7 +21,7 @@ Operator
    :members:
    
 Data Layer
-===========
+==========

 ..  doxygenclass:: paddle::DataLayer
    :members:
@@ -58,6 +62,11 @@ CudnnConvLayer
 ..  doxygenclass:: paddle::CudnnConvLayer
    :members:

+ExpandConvBaseLayer
+-------------------
+..  doxygenclass:: paddle::ExpandConvBaseLayer
+    :members:
+
 ExpandConvLayer
 ---------------
 ..  doxygenclass:: paddle::ExpandConvLayer
@@ -86,6 +95,16 @@ CudnnPoolLayer
 ..  doxygenclass:: paddle::CudnnPoolLayer
    :members:

+SpatialPyramidPoolLayer
+-----------------------
+..  doxygenclass:: paddle::SpatialPyramidPoolLayer
+    :members:
+
+MaxOutLayer
+-----------
+..  doxygenclass:: paddle::MaxOutLayer
+    :members:
+
 Norm Layers
 ===========

@@ -402,6 +421,11 @@ TransLayer
 Sampling Layers
 ===============

+BilinearInterpLayer
+-------------------
+..  doxygenclass:: paddle::BilinearInterpLayer
+    :members:
+
 MultinomialSampler
 ------------------
 ..  doxygenclass:: paddle::MultinomialSampler

--- a/doc/source/gserver/layers/index.rst
+++ b/doc/source/gserver/layers/index.rst
-Layers Documents
-====================
-
-.. toctree::
-  :maxdepth: 3
-
-  layer.rst
--- a/doc/source/gserver/neworks.rst
+++ b/doc/source/gserver/neworks.rst
+Networks
+========
+
+NeuralNetwork
+-------------
+..  doxygenclass:: paddle::NeuralNetwork
+    :members:
+
+ParallelNeuralNetwork
+---------------------
+..  doxygenclass:: paddle::ParallelNeuralNetwork
+    :members:
--- a/doc/source/index.md
+++ b/doc/source/index.md
-# Source Code Documents
-
-## cuda
-
- [CUDA](cuda/cuda/index.rst)
- [Matrix](cuda/matrix/index.rst)
- [RNN](cuda/rnn/index.rst)
- [Utils](cuda/utils/index.rst)
-
-## gserver
-
- [Activations](gserver/activations/index.rst)
- [Data Providers](gserver/dataprovider/index.rst)
- [Evaluators](gserver/evaluators/index.rst)
- [Gradient Machines](gserver/gradientmachines/index.rst)
- [Layers](gserver/layers/index.rst)
-
-## math
-
- [Matrix](math/matrix/index.rst)
- [Utils](math/utils/index.rst)
-
-## parameter
-
- [Parameter](parameter/parameter/index.rst)
- [Update](parameter/update/index.rst)
- [Optimizer](parameter/optimizer/index.rst)
-
-## pserver
-
- [Client](pserver/client/index.rst)
- [Network](pserver/network/index.rst)
- [Server](pserver/server/index.rst)
-
-## trainer
-
- [Trainer](trainer/trainer.rst)
-
-## api
-
- [API](api/api.rst)
-
-## utils
-
- [CustomStackTrace](utils/customStackTrace.rst)
- [Enumeration wrapper](utils/enum.rst)
- [Lock](utils/lock.rst)
- [Queue](utils/queue.rst)
- [Thread](utils/thread.rst)
--- a/doc/source/index.rst
+++ b/doc/source/index.rst
+Source Code Documents
+=====================
+
+.. toctree::
+  :maxdepth: 1
+
+  gserver/index.rst
+  trainer.rst
+  parameter/index.rst
+  pserver/index.rst
+  api.rst
+  cuda/index.rst
+  math/index.rst
+  utils/index.rst
--- a/doc/source/math/functions.rst
+++ b/doc/source/math/functions.rst
+Functions
+=========
+
+MathFunctions
+-------------
+.. doxygenfile:: paddle/math/MathFunctions.h
+
+SIMDFunctions
+-------------
+.. doxygenfile:: paddle/math/SIMDFunctions.h
--- a/doc/source/math/index.rst
+++ b/doc/source/math/index.rst
+Math
+====
+
+.. toctree::
+  :maxdepth: 2
+
+  vector.rst
+  matrix.rst
+  functions.rst
+  utils.rst
--- a/doc/source/math/matrix.rst
+++ b/doc/source/math/matrix.rst
+Matrix
+======
+
+Base
+----
+
+BaseMatrix Template
+```````````````````
+..  doxygenclass:: paddle::BaseMatrixT
+    :members:
+
+Matrix
+``````
+..  doxygenclass:: paddle::Matrix
+    :members:
+
+MatrixOffset
+````````````
+..  doxygenclass:: paddle::MatrixOffset
+    :members:
+
+CpuMatrix
+---------
+
+CpuMatrix
+`````````
+..  doxygenclass:: paddle::CpuMatrix
+    :members:
+
+SharedCpuMatrix
+```````````````
+..  doxygenclass:: paddle::SharedCpuMatrix
+    :members:
+
+GpuMatrix
+---------
+..  doxygenclass:: paddle::GpuMatrix
+    :members:
+
+CpuSparseMatrix
+---------------
+
+CpuSparseMatrix
+```````````````
+..  doxygenclass:: paddle::CpuSparseMatrix
+    :members:
+
+SparseRowCpuMatrix
+``````````````````
+..  doxygenclass:: paddle::SparseRowCpuMatrix
+    :members:
+
+SparseAutoGrowRowCpuMatrix
+``````````````````````````
+..  doxygenclass:: paddle::SparseAutoGrowRowCpuMatrix
+    :members:
+
+SparsePrefetchRowCpuMatrix
+``````````````````````````
+..  doxygenclass:: paddle::SparsePrefetchRowCpuMatrix
+    :members:
+
+SparseRowIdsCpuMatrix
+`````````````````````
+..  doxygenclass:: paddle::SparseRowIdsCpuMatrix
+    :members:
+
+CacheRowCpuMatrix
+`````````````````
+..  doxygenclass:: paddle::CacheRowCpuMatrix
+    :members:
+
+GpuSparseMatrix
+---------------
+..  doxygenclass:: paddle::GpuSparseMatrix
+    :members:
--- a/doc/source/math/matrix/index.rst
+++ b/doc/source/math/matrix/index.rst
-Matrix Documents
-====================
-
-.. toctree::
-  :maxdepth: 3
-
-  matrix.rst
--- a/doc/source/math/matrix/matrix.rst
+++ b/doc/source/math/matrix/matrix.rst
-Matrix
-=======
-
-Base
--------
-.. doxygenfile:: paddle/math/BaseMatrix.h
-
-Sparse Matrix
----------------
-.. doxygenfile:: paddle/math/Matrix.h
-.. doxygenfile:: paddle/math/Vector.h
-.. doxygenfile:: paddle/math/MathUtils.h
-.. doxygenfile:: paddle/math/SparseMatrix.h
-.. doxygenfile:: paddle/math/SparseRowMatrix.h
-.. doxygenfile:: paddle/math/CpuSparseMatrix.h
-
-Others
----------
-.. doxygenfile:: paddle/math/MathFunctions.h
-.. doxygenfile:: paddle/math/SIMDFunctions.h
--- a/doc/source/math/utils/utils.rst
+++ b/doc/source/math/utils/utils.rst
-Utils
-=======
+Memory Manager
+==============

 Memory Handle
--------------
+-------------
 .. doxygenfile:: paddle/math/MemoryHandle.h
+
+Allocator
+---------
 .. doxygenfile:: paddle/math/Allocator.h
+
+PoolAllocator
+`````````````
 .. doxygenfile:: paddle/math/PoolAllocator.h
+
+Storage
+-------
 .. doxygenfile:: paddle/math/Storage.h
--- a/doc/source/math/utils/index.rst
+++ b/doc/source/math/utils/index.rst
-Utils Documents
-====================
-
-.. toctree::
-  :maxdepth: 3
-
-  utils.rst
--- a/doc/source/math/vector.rst
+++ b/doc/source/math/vector.rst
+Vector
+======
+
+BaseVector
+``````````
+..  doxygenclass:: paddle::BaseVector
+    :members:
+
+Vector Template
+```````````````
+..  doxygenclass:: paddle::VectorT
+    :members:
+
+CpuVector Template
+``````````````````
+..  doxygenclass:: paddle::CpuVectorT
+    :members:
+
+GpuVector Template
+``````````````````
+..  doxygenclass:: paddle::GpuVectorT
+    :members:
+
+ParallelCpuVector Template
+``````````````````````````
+..  doxygenclass:: paddle::ParallelCpuVectorT
+    :members:
+
+ParallelGpuVector Template
+``````````````````````````
+..  doxygenclass:: paddle::ParallelGpuVectorT
+    :members:
+
+CpuGpuVector Template
+`````````````````````
+..  doxygenclass:: paddle::CpuGpuVectorT
+    :members:
--- a/doc/source/parameter/optimizer/index.rst
+++ b/doc/source/parameter/optimizer/index.rst
-Parameter Documents
-====================
+Parameter
+=========

 .. toctree::
-  :maxdepth: 3
+  :maxdepth: 2

+  parameter.rst
  optimizer.rst
+  updater.rst
--- a/doc/source/parameter/optimizer/optimizer.rst
+++ b/doc/source/parameter/optimizer/optimizer.rst
 Optimizer
-============
+=========

+ParameterOptimizer
+------------------
+.. doxygenfile:: paddle/parameter/ParameterOptimizer.h
+
+Regularizer
+-----------
+.. doxygenfile:: paddle/parameter/Regularizer.h
+
+FirstOrderOptimizer
+-------------------
 .. doxygenfile:: paddle/parameter/FirstOrderOptimizer.h
+
+AverageOptimizer
+----------------
 .. doxygenfile:: paddle/parameter/AverageOptimizer.h
-.. doxygenfile:: paddle/parameter/ParameterOptimizer.h
+
+OptimizerWithRegularizer
+------------------------
 .. doxygenfile:: paddle/parameter/OptimizerWithRegularizer.h
--- a/doc/source/parameter/parameter/parameter.rst
+++ b/doc/source/parameter/parameter/parameter.rst
 Parameter
-=============
-
-Weight
--------
-.. doxygenfile:: paddle/parameter/Weight.h
-
-Regularizer
------------
-.. doxygenfile:: paddle/parameter/Regularizer.h
+=========

 Parameter
-------------
+---------
 .. doxygenfile:: paddle/parameter/Argument.h
 .. doxygenfile:: paddle/parameter/Parameter.h
 .. doxygenfile:: paddle/parameter/ParallelParameter.h
+
+Weight
+------
+.. doxygenfile:: paddle/parameter/Weight.h
--- a/doc/source/parameter/parameter/index.rst
+++ b/doc/source/parameter/parameter/index.rst
-Parameter Documents
-====================
-
-.. toctree::
-  :maxdepth: 3
-
-  parameter.rst
--- a/doc/source/parameter/update/index.rst
+++ b/doc/source/parameter/update/index.rst
-Parameter Documents
-====================
-
-.. toctree::
-  :maxdepth: 3
-
-  update.rst
--- a/doc/source/parameter/update/update.rst
+++ b/doc/source/parameter/update/update.rst
-Update
-==========
+Updater
+=======

+Base
+----
 .. doxygenfile:: paddle/parameter/ParameterUpdaterBase.h
+
+Hook
+----
 .. doxygenfile:: paddle/parameter/ParameterUpdaterHook.h
-.. doxygenfile:: paddle/parameter/ParameterUpdateFunctions.h

+Functions
+---------
+.. doxygenfile:: paddle/parameter/ParameterUpdateFunctions.h
--- a/doc/source/pserver/client.rst
+++ b/doc/source/pserver/client.rst
+Client
+======
+
+BaseClient
+----------
+..  doxygenclass:: paddle::BaseClient
+    :members:
+
+ParameterClient2
+----------------
+..  doxygenclass:: paddle::ParameterClient2
+    :members:
--- a/doc/source/pserver/client/client.rst
+++ b/doc/source/pserver/client/client.rst
-Client
-=========
-
-.. doxygenclass:: paddle::BaseClient
-    :members:
-    :protected-members:
-    :private-members:
-    :undoc-members:
-
-.. doxygenclass:: paddle::ParameterClient2
-    :members:
-    :protected-members:
-    :private-members:
-    :undoc-members:
--- a/doc/source/pserver/client/index.rst
+++ b/doc/source/pserver/client/index.rst
-Client Documents
-====================
-
-.. toctree::
-  :maxdepth: 3
-
-  client.rst
--- a/doc/source/pserver/index.rst
+++ b/doc/source/pserver/index.rst
+PServer
+=======
+
+.. toctree::
+  :maxdepth: 2
+
+  client.rst
+  network.rst
+  server.rst
+  utils.rst
--- a/doc/source/pserver/network.rst
+++ b/doc/source/pserver/network.rst
+Network
+=======
+
+SocketServer
+------------
+..  doxygenclass:: paddle::SocketServer
+    :members:
+
+SocketWorker
+------------
+..  doxygenclass:: paddle::SocketWorker
+    :members:
+
+SocketClient
+------------
+..  doxygenclass:: paddle::SocketClient
+    :members:
+
+SocketChannel
+-------------
+..  doxygenclass:: paddle::SocketChannel
+    :members:
+
+MessageReader
+-------------
+..  doxygenclass:: paddle::MsgReader
+    :members:
--- a/doc/source/pserver/network/index.rst
+++ b/doc/source/pserver/network/index.rst
-Network Documents
-====================
-
-.. toctree::
-  :maxdepth: 3
-
-  network.rst
--- a/doc/source/pserver/network/network.rst
+++ b/doc/source/pserver/network/network.rst
-Network
-==========
-
-Socket Server
----------------
-.. doxygenclass:: paddle::SocketServer
-    :members:
-    :protected-members:
-    :private-members:
-    :undoc-members:
-
-Socket Worker
----------------
-.. doxygenclass:: paddle::SocketWorker
-    :members:
-    :protected-members:
-    :private-members:
-    :undoc-members:
-
-Socket Client
----------------
-.. doxygenclass:: paddle::SocketClient
-    :members:
-    :protected-members:
-    :private-members:
-    :undoc-members:
-
-Socket Channel
---------------
-.. doxygenclass:: paddle::SocketChannel
-    :members:
-    :protected-members:
-    :private-members:
-    :undoc-members:
-
-Message Reader
---------------
-.. doxygenclass:: paddle::MsgReader
-    :members:
-    :protected-members:
-    :private-members:
-    :undoc-members:
--- a/doc/source/pserver/server.rst
+++ b/doc/source/pserver/server.rst
+Server
+======
+
+ProtoServer
+-----------
+..  doxygenclass:: paddle::ProtoServer
+    :members:
+
+ParameterServer2
+----------------
+..  doxygenclass:: paddle::ParameterServer2
+    :members:
--- a/doc/source/pserver/server/index.rst
+++ b/doc/source/pserver/server/index.rst
-Server Documents
-====================
-
-.. toctree::
-  :maxdepth: 3
-
-  server.rst
--- a/doc/source/pserver/server/server.rst
+++ b/doc/source/pserver/server/server.rst
-Server
-==========
-
-.. doxygenclass:: paddle::ProtoServer
-    :members:
-    :protected-members:
-    :private-members:
-    :undoc-members:
-
-.. doxygenclass:: paddle::ParameterServer2
-    :members:
-    :protected-members:
-    :private-members:
-    :undoc-members:
--- a/doc/source/trainer/trainer.rst
+++ b/doc/source/trainer/trainer.rst
@@ -14,7 +14,7 @@ RemoteParameterUpdater
    :members:

 ConcurrentRemoteParameterUpdater
---------------------------------
+--------------------------------

 ..  doxygenclass:: paddle::ConcurrentRemoteParameterUpdater
    :members:

--- a/doc/source/utils/customStackTrace.rst
+++ b/doc/source/utils/customStackTrace.rst
 CustomStackTrace
 ================
-
-
-class CustomStackTrace
----------------------
-
 ..  doxygenclass:: paddle::CustomStackTrace
    :members:
--- a/doc/source/utils/enum.rst
+++ b/doc/source/utils/enum.rst
-enumeration_wrapper
+Enumeration wrapper
 ===================
-
-
-namespace paddle::enumeration_wrapper
-------------------------------------
-
 ..  doxygennamespace:: paddle::enumeration_wrapper
-
--- a/doc/source/utils/index.rst
+++ b/doc/source/utils/index.rst
+Utils
+=====
+
+.. toctree::
+  :maxdepth: 2
+
+  lock.rst
+  queue.rst
+  thread.rst
+  customStackTrace.rst
+  enum.rst
--- a/doc/source/utils/lock.rst
+++ b/doc/source/utils/lock.rst
-Thread
-======
+Lock
+====

-
-class Thread 
------------
-
-..  doxygenclass:: paddle::Thread
+RWLock
+------
+..  doxygenclass:: paddle::RWLock
    :members:

-
-class ThreadWorker
------------------
-
-..  doxygenclass:: paddle::ThreadWorker
+ReadLockGuard
+-------------
+..  doxygenclass:: paddle::ReadLockGuard
    :members:
-    

-class SyncThreadPool 
--------------------
-
-..  doxygenclass:: paddle::SyncThreadPool 
+SpinLock
+--------
+..  doxygenclass:: paddle::SpinLock
    :members:
-    
-
-class MultiThreadWorker 
-----------------------

-..  doxygenclass:: paddle::MultiThreadWorker 
+Semaphore
+---------
+..  doxygenclass:: paddle::Semaphore
    :members:
-    

-class AsyncThreadPool 
---------------------
+ThreadBarrier
+-------------
+..  doxygenclass:: paddle::ThreadBarrier
+    :members:

-..  doxygenclass:: paddle::AsyncThreadPool 
+LockedCondition
+---------------
+..  doxygenclass:: paddle::LockedCondition
    :members:
--- a/doc/source/utils/queue.rst
+++ b/doc/source/utils/queue.rst
 Queue
 =====

-
-class Queue
------------
-
+Queue
+-----
 ..  doxygenclass:: paddle::Queue
    :members:

-
-class BlockingQueue 
-------------------
-
+BlockingQueue 
+-------------
 ..  doxygenclass:: paddle::BlockingQueue 
    :members:
--- a/doc/source/utils/thread.rst
+++ b/doc/source/utils/thread.rst
-Lock
-====
+Thread
+======

-
-class RWLock
------------
-
-..  doxygenclass:: paddle::RWLock
+Thread 
+------
+..  doxygenclass:: paddle::Thread
    :members:

-class ReadLockGuard
-------------------
-
-..  doxygenclass:: paddle::ReadLockGuard
+ThreadWorker
+------------
+..  doxygenclass:: paddle::ThreadWorker
    :members:

-class SpinLock
+SyncThreadPool 
 --------------
-
-..  doxygenclass:: paddle::SpinLock
+..  doxygenclass:: paddle::SyncThreadPool 
    :members:
-
-class Semaphore
---------------
-
-..  doxygenclass:: paddle::Semaphore
-    :members:
-
-class ThreadBarrier
-------------------
-
-..  doxygenclass:: paddle::ThreadBarrier
+    
+MultiThreadWorker 
+-----------------
+..  doxygenclass:: paddle::MultiThreadWorker 
    :members:

-class LockedCondition
---------------------
-
-..  doxygenclass:: paddle::LockedCondition
+AsyncThreadPool 
+---------------
+..  doxygenclass:: paddle::AsyncThreadPool
    :members:
-
--- a/doc_cn/build_and_install/cmake/cblas_settings.csv
+++ b/doc_cn/build_and_install/cmake/cblas_settings.csv
-MKL_ROOT,mkl的路径，在${MKL_ROOT}/include下需要包含mkl.h，在${MKL_ROOT}/lib目录下需要包含 mkl_core，mkl_sequential和mkl_intel_lp64三个库
-ATLAS_ROOT,ATLAS库的路径，在${ATLAS_ROOT}/include下需要包含cblas.h，而在${ATLAS_ROOT}/lib下需要包含cblas和atlas两个库
-OPENBLAS_ROOT,在${OPENBLAS_ROOT}/include下需要包含cblas.h，而在${OPENBLAS_ROOT}/lib下需要包含openblas库
-REFERENCE_CBLAS_ROOT,在${REFERENCE_CBLAS_ROOT}/include下需要包含cblas.h，在${REFERENCE_CBLAS_ROOT}/lib下需要包含cblas库
\ No newline at end of file
+编译选项,描述,注意
+MKL_ROOT,MKL的路径,${MKL_ROOT}/include下需要包含mkl.h，${MKL_ROOT}/lib目录下需要包含mkl_core，mkl_sequential和mkl_intel_lp64三个库。
+ATLAS_ROOT,ATLAS的路径,${ATLAS_ROOT}/include下需要包含cblas.h，${ATLAS_ROOT}/lib下需要包含cblas和atlas两个库。
+OPENBLAS_ROOT,OpenBLAS的路径,${OPENBLAS_ROOT}/include下需要包含cblas.h，${OPENBLAS_ROOT}/lib下需要包含openblas库。
+REFERENCE_CBLAS_ROOT,REFERENCE BLAS的路径,${REFERENCE_CBLAS_ROOT}/include下需要包含cblas.h，${REFERENCE_CBLAS_ROOT}/lib下需要包含cblas库。
\ No newline at end of file
--- a/doc_cn/build_and_install/cmake/compile_options.csv
+++ b/doc_cn/build_and_install/cmake/compile_options.csv
-选项,说明,默认值
-WITH_GPU,是否编译GPU支持。,是否寻找到cuda工具链
-WITH_DOUBLE,是否使用双精度浮点数。,否
-WITH_DSO,是否使用运行时动态加载cuda动态库，而非静态加载cuda动态库。,是
-WITH_AVX,是否编译含有AVX指令集的PaddlePaddle二进制,是
-WITH_PYTHON,是否内嵌python解释器。可以方便嵌入式工作。,是
-WITH_STYLE_CHECK,是否编译时进行代码风格检查,是
-WITH_RDMA,是否开启RDMA支持,否
-WITH_GLOG,是否使用GLOG，如果不使用则会使用一个简化版的日志实现。可以方便嵌入式工作。,取决于是否寻找到GLOG
-WITH_GFLAGS,是否使用GFLAGS，如果不使用则会使用一个简化版的命令行参数解析。可以方便嵌入式工作。,取决于是否寻找到GFLAGS
-WITH_TIMER,是否开启计时功能开启计时功能会导致运行略慢，打印的日志变多。但是方便调试和benchmark,否
-WITH_TESTING,是否开启单元测试,取决于是否寻找到gtest
-WITH_DOC,是否编译英文文档,否
-WITH_DOC_CN,是否编译中文文档,否
-WITH_SWIG_PY,是否编译python的swig接口，python的swig接口可以方便进行预测和定制化训练,取决于是否找到swig
+选项,说明,默认值
+WITH_GPU,是否支持GPU。,取决于是否寻找到CUDA工具链
+WITH_DOUBLE,是否使用双精度浮点数。,否
+WITH_DSO,是否运行时动态加载CUDA动态库，而非静态加载CUDA动态库。,是
+WITH_AVX,是否编译含有AVX指令集的PaddlePaddle二进制文件,是
+WITH_PYTHON,是否内嵌PYTHON解释器。方便今后的嵌入式移植工作。,是
+WITH_STYLE_CHECK,是否编译时进行代码风格检查,是
+WITH_RDMA,是否开启RDMA,否
+WITH_GLOG,是否开启GLOG。如果不开启，则会使用一个简化版的日志，同时方便今后的嵌入式移植工作。,取决于是否寻找到GLOG
+WITH_GFLAGS,是否使用GFLAGS。如果不开启，则会使用一个简化版的命令行参数解析器，同时方便今后的嵌入式移植工作。,取决于是否寻找到GFLAGS
+WITH_TIMER,是否开启计时功能。如果开启会导致运行略慢，打印的日志变多，但是方便调试和测Benchmark,否
+WITH_TESTING,是否开启单元测试,取决于是否寻找到GTEST
+WITH_DOC,是否编译中英文文档,否
+WITH_SWIG_PY,是否编译PYTHON的SWIG接口，该接口可用于预测和定制化训练,取决于是否寻找到SWIG
\ No newline at end of file
--- a/doc_cn/build_and_install/cmake/compile_options.rst
+++ b/doc_cn/build_and_install/cmake/compile_options.rst
-设置PaddlePaddle的编译选项
-==========================
-
-PaddlePaddle的编译选项可以在调用cmake的时候设置。cmake是一个跨平台的编译脚本，调用
-cmake可以将cmake项目文件，生成各个平台的makefile。详细的cmake使用方法可以参考
-`cmake的官方文档 <https://cmake.org/cmake-tutorial>`_ 。
-
-PaddlePaddle的编译选项是可以控制PaddlePaddle生成CPU/GPU版本二进制，链接何种blas等等。所有的
-编译选项列表如下
-
-PaddlePaddle的编译选项
----------------------
-
-bool型的编译选项
-++++++++++++++++
-设置下列编译选项时，可以在cmake的命令行设置。使用 -D命令即可。例如 
-:code:`cmake -D WITH_GPU=OFF`
-
-..  csv-table:: PaddlePaddle的bool型编译选项
-    :widths: 1, 7, 2
-    :file: compile_options.csv
-
-blas相关的编译选项
-++++++++++++++++++
-
-PaddlePaddle可以使用 `MKL <https://software.intel.com/en-us/intel-mkl>`_ ，
-`Atlas <http://math-atlas.sourceforge.net/>`_ ,
-`OpenBlas <http://www.openblas.net/>`_ 和 
-`refference Blas <http://www.netlib.org/blas/>`_ ，任意一种cblas实现。
-通过编译时指定路径来实现引用各种blas。
-
-cmake编译时会首先在系统路径(/usr/lib\:/usr/local/lib)中寻找这些blas的实现。同时
-也会读取相关路径变量来进行搜索。路径变量为\:
-
-
-..  csv-table:: PaddlePaddle的cblas编译选项
-    :widths: 1, 9
-    :header: "编译选项", "描述"
-    :file: cblas_settings.csv
-
-这些变量均可以使用 -D命令指定。例如 :code:`cmake -D MKL_ROOT=/opt/mkl/`。这些变
-量也可以通过调用cmake命令前通过环境变量指定。例如
-
-..  code-block:: bash
-
-    export MKL_ROOT=/opt/mkl
-    cmake
-
-需要注意的是，这些变量只在第一次cmake的时候有效。如果在第一次cmake之后想要重新设
-置这些变量，推荐清理( :code:`rm -rf` )掉编译目录后，再指定。
-
-cuda/cudnn相关的编译选项
-++++++++++++++++++++++++
-
-PaddlePaddle可以使用 cudnn v2之后的任何一个cudnn版本来编译运行。但需要注意的是编译和
-运行使用的cudnn尽量是同一个版本。推荐使用最新版本的cudnn v5.1。
-
-在cmake配置时可以使用 :code:`CUDNN_ROOT` 来配置CUDNN的安装路径。使用的命令也是 
-D，例如 :code:`cmake -D CUDNN_ROOT=/opt/cudnnv5` 。
-
-需要注意的是，这些变量只在第一次cmake的时候有效。如果在第一次cmake之后想要重新设
-置这些变量，推荐清理( :code:`rm -rf` )掉编译目录后，再指定。
+PaddlePaddle的编译选项
+======================
+
+PaddlePaddle的编译选项，包括生成CPU/GPU二进制文件、链接何种BLAS库等。用户可在调用cmake的时候设置它们，详细的cmake使用方法可以参考 `官方文档 <https://cmake.org/cmake-tutorial>`_ 。
+
+Bool型的编译选项
+----------------
+用户可在cmake的命令行中，通过使用 ``-D`` 命令设置该类编译选项，例如
+
+..  code-block:: bash
+
+    cmake .. -DWITH_GPU=OFF
+
+..  csv-table:: Bool型的编译选项
+    :widths: 1, 7, 2
+    :file: compile_options.csv
+
+BLAS/CUDA/Cudnn的编译选项
+--------------------------
+BLAS
+++++
+
+PaddlePaddle支持以下任意一种BLAS库：`MKL <https://software.intel.com/en-us/intel-mkl>`_ ，`ATLAS <http://math-atlas.sourceforge.net/>`_ ，`OpenBlAS <http://www.openblas.net/>`_ 和 `REFERENCE BLAS <http://www.netlib.org/blas/>`_ 。
+
+..  csv-table:: BLAS路径相关的编译选项
+    :widths: 1, 2, 7
+    :file: cblas_settings.csv
+
+CUDA/Cudnn
+++++++++++
+
+PaddlePaddle可以使用cudnn v2之后的任何一个版本来编译运行，但尽量请保持编译和运行使用的cudnn是同一个版本。 我们推荐使用最新版本的cudnn v5.1。
+
+编译选项的设置
++++++++++++++
+
+PaddePaddle通过编译时指定路径来实现引用各种BLAS/CUDA/Cudnn库。cmake编译时，首先在系统路径(/usr/lib\:/usr/local/lib)中搜索这几个库，同时也会读取相关路径变量来进行搜索。 通过使用 ``-D`` 命令可以设置，例如 
+
+..  code-block:: bash
+
+    cmake .. -DMKL_ROOT=/opt/mkl/ -DCUDNN_ROOT=/opt/cudnnv5
+
+注意：这几个编译选项的设置，只在第一次cmake的时候有效。如果之后想要重新设置，推荐清理整个编译目录（``rm -rf``）后，再指定。
\ No newline at end of file
--- a/paddle/api/Arguments.cpp
+++ b/paddle/api/Arguments.cpp
@@ -12,7 +12,6 @@ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 See the License for the specific language governing permissions and
 limitations under the License. */

-
 #include "PaddleAPI.h"
 #include "PaddleAPIPrivate.h"

@@ -112,7 +111,7 @@ void Arguments::setSlotSequenceStartPositions(size_t idx,
 }

 void Arguments::setSlotSubSequenceStartPositions(
-    size_t idx, IVector *vec) throw(RangeError) {
+    size_t idx, IVector* vec) throw(RangeError) {
  auto& a = m->getArg(idx);
  auto& v = m->cast<paddle::IVector>(vec->getSharedPtr());
  a.subSequenceStartPositions = std::make_shared<paddle::ICpuGpuVector>(v);

--- a/paddle/api/ConfigParser.cpp
+++ b/paddle/api/ConfigParser.cpp
@@ -12,7 +12,6 @@ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 See the License for the specific language governing permissions and
 limitations under the License. */

-
 #include "PaddleAPI.h"
 #include "PaddleAPIPrivate.h"
 #include "paddle/trainer/Trainer.h"
@@ -44,8 +43,7 @@ TrainerConfig* TrainerConfig::createFromTrainerConfigFile(
  return retv;
 }

-TrainerConfig* TrainerConfig::createFromProtoString(
-    const std::string& str) {
+TrainerConfig* TrainerConfig::createFromProtoString(const std::string& str) {
  auto retv = new TrainerConfig();
  paddle::TrainerConfig trainerConfigProto;
  auto conf = std::make_shared<paddle::TrainerConfigHelper>(trainerConfigProto);

--- a/paddle/api/GradientMachine.cpp
+++ b/paddle/api/GradientMachine.cpp
@@ -12,7 +12,6 @@ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 See the License for the specific language governing permissions and
 limitations under the License. */

-
 #include "PaddleAPI.h"
 #include "PaddleAPIPrivate.h"

@@ -27,7 +26,8 @@ GradientMachine::GradientMachine() : m(new GradientMachinePrivate()) {}
 GradientMachine::~GradientMachine() { delete m; }

 GradientMachine* GradientMachine::createFromPaddleModelPtr(
-    const void* confPtr, GradientMatchineCreateMode mode,
+    const void* confPtr,
+    GradientMatchineCreateMode mode,
    const std::vector<int>& types) {
  auto& conf = *(const paddle::ModelConfig*)(confPtr);
  std::vector<ParameterType> realTypes;
@@ -44,7 +44,8 @@ GradientMachine* GradientMachine::createFromPaddleModelPtr(
 }

 GradientMachine* GradientMachine::createByConfigProtoStr(
-    const std::string& protoStr, GradientMatchineCreateMode mode,
+    const std::string& protoStr,
+    GradientMatchineCreateMode mode,
    const std::vector<int>& types) {
  paddle::ModelConfig conf;
  conf.ParseFromString(protoStr);
@@ -56,13 +57,15 @@ GradientMachine* GradientMachine::createByConfigProtoStr(
 }

 GradientMachine* GradientMachine::createByModelConfig(
-    ModelConfig* conf, GradientMatchineCreateMode mode,
+    ModelConfig* conf,
+    GradientMatchineCreateMode mode,
    const std::vector<int>& types) {
  auto confPtr = &conf->m->conf->getModelConfig();
  return GradientMachine::createFromPaddleModelPtr(confPtr, mode, types);
 }

-void GradientMachine::forward(const Arguments& inArgs, Arguments* outArgs,
+void GradientMachine::forward(const Arguments& inArgs,
+                              Arguments* outArgs,
                              PassType passType) {
  auto& in =
      m->cast<std::vector<paddle::Argument>>(inArgs.getInternalArgumentsPtr());
@@ -99,7 +102,8 @@ void GradientMachine::backward(const UpdateCallback& callback) {
 }

 void GradientMachine::forwardBackward(const Arguments& inArgs,
-                                      Arguments* outArgs, PassType passType,
+                                      Arguments* outArgs,
+                                      PassType passType,
                                      const UpdateCallback& callback) {
  auto& in =
      m->cast<std::vector<paddle::Argument>>(inArgs.getInternalArgumentsPtr());
@@ -129,7 +133,7 @@ Parameter* GradientMachine::getParameter(size_t i) throw(RangeError) {
 void GradientMachine::randParameters() { m->machine->randParameters(); }

 Matrix* GradientMachine::getLayerOutput(const std::string& layerName) const
-  throw(UnsupportError) {
+    throw(UnsupportError) {
  auto nn = std::dynamic_pointer_cast<paddle::NeuralNetwork>(m->machine);
  if (nn) {
    auto mat = nn->getLayerOutput(layerName);
@@ -140,8 +144,11 @@ Matrix* GradientMachine::getLayerOutput(const std::string& layerName) const
 }

 SequenceGenerator* GradientMachine::asSequenceGenerator(
-    const std::vector<std::string>& dict, size_t begin_id, size_t end_id,
-    size_t max_length, size_t beam_size) {
+    const std::vector<std::string>& dict,
+    size_t begin_id,
+    size_t end_id,
+    size_t max_length,
+    size_t beam_size) {
  SequenceGenerator* r =
      SequenceGenerator::createByGradientMachineSharedPtr(&m->machine);
  r->setDict(dict);

--- a/paddle/api/Internal.h
+++ b/paddle/api/Internal.h
@@ -12,7 +12,6 @@ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 See the License for the specific language governing permissions and
 limitations under the License. */

-
 #pragma once

 #include "PaddleAPI.h"
@@ -23,7 +22,8 @@ limitations under the License. */
 template <typename T1, typename T2>
 void staticCastVector(std::vector<T2>* dest, const std::vector<T1>& src) {
  dest->resize(src.size());
-  std::transform(src.begin(), src.end(), dest->begin(), [](T1 t){
-    return static_cast<T2>(t);
-  });
+  std::transform(src.begin(),
+                 src.end(),
+                 dest->begin(),
+                 [](T1 t) { return static_cast<T2>(t); });
 }
--- a/paddle/api/Matrix.cpp
+++ b/paddle/api/Matrix.cpp
@@ -12,7 +12,6 @@ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 See the License for the specific language governing permissions and
 limitations under the License. */

-
 #include "PaddleAPI.h"
 #include "paddle/math/Matrix.h"
 #include "paddle/math/SparseMatrix.h"
@@ -44,17 +43,21 @@ Matrix* Matrix::createZero(size_t height, size_t width, bool useGpu) {
  return m;
 }

-Matrix* Matrix::createDense(const std::vector<float>& data, size_t height,
-                            size_t width, bool useGpu) {
+Matrix* Matrix::createDense(const std::vector<float>& data,
+                            size_t height,
+                            size_t width,
+                            bool useGpu) {
  auto m = new Matrix();
  m->m->mat = paddle::Matrix::create(height, width, useGpu);
  m->m->mat->copyFrom(data.data(), data.size());
  return m;
 }

-Matrix* Matrix::createDenseFromNumpy(float* data, int dim1, int dim2,
-                                      bool copy, bool useGpu)
-                                     throw (UnsupportError) {
+Matrix* Matrix::createDenseFromNumpy(float* data,
+                                     int dim1,
+                                     int dim2,
+                                     bool copy,
+                                     bool useGpu) throw(UnsupportError) {
  if (useGpu) {
    /// Gpu mode only supports copy=True
    if (!copy) {
@@ -66,7 +69,9 @@ Matrix* Matrix::createDenseFromNumpy(float* data, int dim1, int dim2,
  }
 }

-Matrix* Matrix::createCpuDenseFromNumpy(float* data, int dim1, int dim2,
+Matrix* Matrix::createCpuDenseFromNumpy(float* data,
+                                        int dim1,
+                                        int dim2,
                                        bool copy) {
  auto m = new Matrix();
  if (copy) {
@@ -85,12 +90,20 @@ Matrix* Matrix::createGpuDenseFromNumpy(float* data, int dim1, int dim2) {
  return m;
 }

-Matrix* Matrix::createSparse(size_t height, size_t width, size_t nnz,
-                             bool isNonVal, bool isTrans, bool useGpu) {
+Matrix* Matrix::createSparse(size_t height,
+                             size_t width,
+                             size_t nnz,
+                             bool isNonVal,
+                             bool isTrans,
+                             bool useGpu) {
  auto m = new Matrix();
  m->m->mat = paddle::Matrix::createSparseMatrix(
-      height, width, nnz, isNonVal ? paddle::NO_VALUE : paddle::FLOAT_VALUE,
-      isTrans, useGpu);
+      height,
+      width,
+      nnz,
+      isNonVal ? paddle::NO_VALUE : paddle::FLOAT_VALUE,
+      isTrans,
+      useGpu);
  return m;
 }

@@ -221,7 +234,8 @@ FloatArray Matrix::getData() const {
 }

 void Matrix::sparseCopyFrom(
-    const std::vector<int>& rows, const std::vector<int>& cols,
+    const std::vector<int>& rows,
+    const std::vector<int>& cols,
    const std::vector<float>& vals) throw(UnsupportError) {
  auto cpuSparseMat =
      std::dynamic_pointer_cast<paddle::CpuSparseMatrix>(m->mat);
@@ -240,7 +254,8 @@ void Matrix::sparseCopyFrom(

 void* Matrix::getSharedPtr() const { return &m->mat; }

-void Matrix::toNumpyMatInplace(float** view_data, int* dim1,
+void Matrix::toNumpyMatInplace(float** view_data,
+                               int* dim1,
                               int* dim2) throw(UnsupportError) {
  auto cpuMat = std::dynamic_pointer_cast<paddle::CpuMatrix>(m->mat);
  if (cpuMat) {
@@ -251,7 +266,8 @@ void Matrix::toNumpyMatInplace(float** view_data, int* dim1,
    throw UnsupportError();
  }
 }
-void Matrix::copyToNumpyMat(float** view_m_data, int* dim1,
+void Matrix::copyToNumpyMat(float** view_m_data,
+                            int* dim1,
                            int* dim2) throw(UnsupportError) {
  static_assert(sizeof(paddle::real) == sizeof(float),
                "Currently PaddleAPI only support for single "
@@ -269,8 +285,8 @@ void Matrix::copyToNumpyMat(float** view_m_data, int* dim1,
    } else if (auto gpuMat = dynamic_cast<paddle::GpuMatrix*>(m->mat.get())) {
      auto src = gpuMat->getData();
      auto dest = *view_m_data;
-      hl_memcpy_device2host(dest, src,
-                            sizeof(paddle::real) * (*dim1) * (*dim2));
+      hl_memcpy_device2host(
+          dest, src, sizeof(paddle::real) * (*dim1) * (*dim2));
    } else {
      LOG(WARNING) << "Unexpected Situation";
      throw UnsupportError();
@@ -278,7 +294,8 @@ void Matrix::copyToNumpyMat(float** view_m_data, int* dim1,
  }
 }

-void Matrix::copyFromNumpyMat(float* data, int dim1,
+void Matrix::copyFromNumpyMat(float* data,
+                              int dim1,
                              int dim2) throw(UnsupportError, RangeError) {
  if (isSparse()) {
    throw UnsupportError();

--- a/paddle/api/PaddleAPI.h
+++ b/paddle/api/PaddleAPI.h
@@ -12,7 +12,6 @@ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 See the License for the specific language governing permissions and
 limitations under the License. */

-
 #pragma once

 #include <stddef.h>
@@ -61,8 +60,8 @@ class RangeError {};
 /// Not support Error, such as access GPU memory directly, etc.
 class UnsupportError : public std::runtime_error {
 public:
-  UnsupportError() : std::runtime_error(" ") {};
-  UnsupportError(const std::string& message) : std::runtime_error(message) {};
+  UnsupportError() : std::runtime_error(" "){};
+  UnsupportError(const std::string& message) : std::runtime_error(message){};
 };

 /// This type will map to python's list of float.
@@ -112,7 +111,8 @@ public:
  /**
   * Create A Matrix with height,width, which is filled by zero.
   */
-  static Matrix* createZero(size_t height, size_t width,
+  static Matrix* createZero(size_t height,
+                            size_t width,
                            bool useGpu = isUsingGpu());

  /**
@@ -124,8 +124,11 @@ public:
   *
   * @note the default sparse type is SPARSE_CSR.
   */
-  static Matrix* createSparse(size_t height, size_t width, size_t nnz,
-                              bool isNonVal = true, bool trans = false,
+  static Matrix* createSparse(size_t height,
+                              size_t width,
+                              size_t nnz,
+                              bool isNonVal = true,
+                              bool trans = false,
                              bool useGpu = isUsingGpu());

  /**
@@ -134,13 +137,17 @@ public:
   * @param data  list of float should be passed in python.
   * @note        the value will be copy into a new matrix.
   */
-  static Matrix* createDense(const std::vector<float>& data, size_t height,
-                             size_t width, bool useGpu = isUsingGpu());
-
-  static Matrix* createDenseFromNumpy(float* data, int dim1, int dim2,
-                                      bool copy = true,
-                                      bool useGpu = isUsingGpu())
-                                      throw (UnsupportError);
+  static Matrix* createDense(const std::vector<float>& data,
+                             size_t height,
+                             size_t width,
+                             bool useGpu = isUsingGpu());
+
+  static Matrix* createDenseFromNumpy(
+      float* data,
+      int dim1,
+      int dim2,
+      bool copy = true,
+      bool useGpu = isUsingGpu()) throw(UnsupportError);

  /**
   *  Create Cpu Dense Matrix from numpy matrix, dtype=float32
@@ -151,7 +158,9 @@ public:
   *  @param copy  true if copy into a new matrix, false will create
   *               matrix inplace.
   */
-  static Matrix* createCpuDenseFromNumpy(float* data, int dim1, int dim2,
+  static Matrix* createCpuDenseFromNumpy(float* data,
+                                         int dim1,
+                                         int dim2,
                                         bool copy = false);

  /// Create Gpu Dense Matrix from numpy matrix, dtype=float32
@@ -171,11 +180,13 @@ public:
   * numpy_mat = m.toNumpyMat()
   * @endcode
   */
-  void toNumpyMatInplace(float** view_data, int* dim1,
+  void toNumpyMatInplace(float** view_data,
+                         int* dim1,
                         int* dim2) throw(UnsupportError);

  /// Copy To numpy mat.
-  void copyToNumpyMat(float** view_m_data, int* dim1,
+  void copyToNumpyMat(float** view_m_data,
+                      int* dim1,
                      int* dim2) throw(UnsupportError);

  /// Copy From Numpy Mat
@@ -248,15 +259,18 @@ public:
  static Vector* create(const std::vector<float>& data,
                        bool useGpu = isUsingGpu());

-  static Vector* createVectorFromNumpy(float* data, int dim, bool copy = true,
-                                       bool useGpu = isUsingGpu())
-                                       throw (UnsupportError);
+  static Vector* createVectorFromNumpy(
+      float* data,
+      int dim,
+      bool copy = true,
+      bool useGpu = isUsingGpu()) throw(UnsupportError);
  /**
   * Create Cpu Vector from numpy array, which dtype=float32
   *
   * If copy is false, it will create vector inplace.
   */
-  static Vector* createCpuVectorFromNumpy(float* data, int dim,
+  static Vector* createCpuVectorFromNumpy(float* data,
+                                          int dim,
                                          bool copy = false);

  /// Create Gpu Vector from numpy array, which dtype=float32
@@ -312,16 +326,19 @@ public:
  static IVector* create(const std::vector<int>& data,
                         bool useGpu = isUsingGpu());

-  static IVector* createVectorFromNumpy(int* data, int dim, bool copy = true,
-                                        bool useGpu = isUsingGpu())
-                                        throw (UnsupportError);
+  static IVector* createVectorFromNumpy(
+      int* data,
+      int dim,
+      bool copy = true,
+      bool useGpu = isUsingGpu()) throw(UnsupportError);

  /**
   * Create Cpu IVector from numpy array, which dtype=int32
   *
   * If copy is false, it will create vector inplace
   */
-  static IVector* createCpuVectorFromNumpy(int* data, int dim,
+  static IVector* createCpuVectorFromNumpy(int* data,
+                                           int dim,
                                           bool copy = false);
  /**
   * Create Gpu IVector from numpy array, which dtype=int32
@@ -605,7 +622,8 @@ class ParameterTraverseCallback {
 public:
  ~ParameterTraverseCallback();

-  void apply(const std::vector<Vector*>& vecs, const ParameterConfig& config,
+  void apply(const std::vector<Vector*>& vecs,
+             const ParameterConfig& config,
             size_t sparseId);

 private:
@@ -638,7 +656,8 @@ public:

  void finishBatch();

-  void update(const std::vector<Vector*>& vecs, const ParameterConfig& conf,
+  void update(const std::vector<Vector*>& vecs,
+              const ParameterConfig& conf,
              size_t sparseId = NO_SPARSE_ID);

  std::vector<int> getParameterTypes() const;
@@ -678,7 +697,8 @@ public:
   * model config by TrainerConfig
   */
  static GradientMachine* createByModelConfig(
-      ModelConfig* conf, GradientMatchineCreateMode mode = CREATE_MODE_NORMAL,
+      ModelConfig* conf,
+      GradientMatchineCreateMode mode = CREATE_MODE_NORMAL,
      const std::vector<int>& parameterTypes = defaultParamTypes);

  /**
@@ -701,7 +721,8 @@ public:
  /**
   * Combine forward/backward
   */
-  void forwardBackward(const Arguments& inArgs, Arguments* outArgs,
+  void forwardBackward(const Arguments& inArgs,
+                       Arguments* outArgs,
                       PassType passType,
                       const UpdateCallback& callback = UpdateCallback());

@@ -722,14 +743,17 @@ public:
   */
  SequenceGenerator* asSequenceGenerator(
      const std::vector<std::string>& dict = std::vector<std::string>(),
-      size_t begin_id = 0UL, size_t end_id = 0UL, size_t max_length = 100UL,
+      size_t begin_id = 0UL,
+      size_t end_id = 0UL,
+      size_t max_length = 100UL,
      size_t beam_size = -1UL);

 private:
  GradientMachinePrivate* m;

  static GradientMachine* createFromPaddleModelPtr(
-      const void* confPtr, GradientMatchineCreateMode mode,
+      const void* confPtr,
+      GradientMatchineCreateMode mode,
      const std::vector<int>& types);

  // Not to use c++ 11 init-list, so we use static var as function default arg.
@@ -751,8 +775,8 @@ public:
  /// Create A Trainer By TrainerConfig. using paddle command line.
  static Trainer* createByCommandLine() throw(IOError);

-  static Trainer* create(TrainerConfig* optConfig, GradientMachine* gm)
-      throw(IOError);
+  static Trainer* create(TrainerConfig* optConfig,
+                         GradientMachine* gm) throw(IOError);

  /// Start training
  void startTrain();

--- a/paddle/api/Parameter.cpp
+++ b/paddle/api/Parameter.cpp
@@ -12,7 +12,6 @@ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 See the License for the specific language governing permissions and
 limitations under the License. */

-
 #include "PaddleAPI.h"
 #include "paddle/parameter/Parameter.h"


--- a/paddle/api/ParameterOptimizer.cpp
+++ b/paddle/api/ParameterOptimizer.cpp
@@ -12,7 +12,6 @@ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 See the License for the specific language governing permissions and
 limitations under the License. */

-
 #include "PaddleAPI.h"
 #include "PaddleAPIPrivate.h"
 #include "paddle/parameter/ParameterOptimizer.h"
@@ -32,17 +31,21 @@ struct ParameterTraverseCallbackPrivate {
      const paddle::ParameterOptimizer::TraverseCallback& callback)
      : callback(callback) {}

-  void apply(const std::vector<Vector*>& vecs, const ParameterConfig& conf,
+  void apply(const std::vector<Vector*>& vecs,
+             const ParameterConfig& conf,
             size_t sparseId) {
    std::vector<paddle::VectorPtr> real_vecs;
    real_vecs.resize(vecs.size());
-    std::transform(vecs.begin(), vecs.end(), real_vecs.begin(), [](Vector* v) {
-      if (v) {
-        return *(paddle::VectorPtr*)(v->getSharedPtr());
-      } else {
-        return paddle::VectorPtr();
-      }
-    });
+    std::transform(vecs.begin(),
+                   vecs.end(),
+                   real_vecs.begin(),
+                   [](Vector* v) {
+                     if (v) {
+                       return *(paddle::VectorPtr*)(v->getSharedPtr());
+                     } else {
+                       return paddle::VectorPtr();
+                     }
+                   });

    paddle::ParameterConfig& real_conf =
        *(paddle::ParameterConfig*)(const_cast<ParameterConfig&>(conf)
@@ -86,10 +89,12 @@ void ParameterOptimizer::startBatch(size_t numSamplesProcessed) {
 void ParameterOptimizer::finishBatch() { m->optimizer->finishBatch(); }

 void ParameterOptimizer::update(const std::vector<Vector*>& vecs,
-                                const ParameterConfig& conf, size_t sparseId) {
-  ParameterTraverseCallbackPrivate invoker([&](
-      const paddle::VectorPtr _vecs[], const paddle::ParameterConfig& config,
-      size_t sid = -1UL) { m->optimizer->update(_vecs, config, sid); });
+                                const ParameterConfig& conf,
+                                size_t sparseId) {
+  ParameterTraverseCallbackPrivate invoker(
+      [&](const paddle::VectorPtr _vecs[],
+          const paddle::ParameterConfig& config,
+          size_t sid = -1UL) { m->optimizer->update(_vecs, config, sid); });
  invoker.apply(vecs, conf, sparseId);
 }

@@ -116,8 +121,9 @@ void ParameterTraverseCallback::apply(const std::vector<Vector*>& vecs,

 ParameterTraverseCallback* ParameterOptimizer::needSpecialTraversal(
    const ParameterConfig& config) const {
-  auto& param_config = *(paddle::ParameterConfig*)const_cast<ParameterConfig&>(
-                            config).getRawPtr();
+  auto& param_config =
+      *(paddle::ParameterConfig*)const_cast<ParameterConfig&>(config)
+           .getRawPtr();
  auto callback = m->optimizer->needSpecialTraversal(param_config);
  if (callback) {
    auto retCallback = new ParameterTraverseCallback();

--- a/paddle/api/SequenceGenerator.cpp
+++ b/paddle/api/SequenceGenerator.cpp
@@ -12,7 +12,6 @@ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 See the License for the specific language governing permissions and
 limitations under the License. */

-
 #include "PaddleAPI.h"
 #include "paddle/gserver/gradientmachines/GradientMachine.h"
 #include "paddle/parameter/Argument.h"
@@ -42,8 +41,10 @@ struct Path {
 // position
 static void findNBest(paddle::GradientMachine* gradMachine,
                      std::vector<paddle::Argument>& inArgs,
-                      std::vector<Path>& finalPaths, size_t bos_id,
-                      size_t eos_id, size_t max_length) {
+                      std::vector<Path>& finalPaths,
+                      size_t bos_id,
+                      size_t eos_id,
+                      size_t max_length) {
  std::vector<Path> paths;
  Path emptyPath;
  paths.push_back(emptyPath);
@@ -166,7 +167,8 @@ public:
    if (id < getSize()) {
      Path& p = (*path_)[id];
      std::ostringstream sout;
-      std::transform(p.ids.begin(), p.ids.end(),
+      std::transform(p.ids.begin(),
+                     p.ids.end(),
                     std::ostream_iterator<std::string>(sout, split ? " " : ""),
                     [&](int id) { return (*dict_)[id]; });
      return sout.str();

--- a/paddle/api/Trainer.cpp
+++ b/paddle/api/Trainer.cpp
@@ -64,12 +64,11 @@ Trainer* Trainer::createByCommandLine() throw(IOError) {

 Trainer::Trainer(TrainerConfig* config, GradientMachine* gm)
    : m(new TrainerPrivate()) {
-  m->init(config->m->conf, /* testing= */false, gm ? gm->m->machine : nullptr);
+  m->init(config->m->conf, /* testing= */ false, gm ? gm->m->machine : nullptr);
 }

-Trainer* Trainer::create(TrainerConfig* config, GradientMachine* gm)
-    throw(IOError)
-{
+Trainer* Trainer::create(TrainerConfig* config,
+                         GradientMachine* gm) throw(IOError) {
  auto retv = new Trainer(config, gm);
  if (retv->m->getConfig().IsInitialized()) {
    return retv;
@@ -134,15 +133,17 @@ void Trainer::finishTestPeriod() { m->finishTestPeriod(); }

 Matrix* Trainer::getLayerOutput(const std::string& layerName) {
  auto nn = std::dynamic_pointer_cast<paddle::NeuralNetwork>(
-          this->m->getGradientMachine());
+      this->m->getGradientMachine());
  CHECK(nn) << "trainerInternal_.getGradientMachine() is not NeuralNetwork";
  auto m = nn->getLayerOutput(layerName);
  return Matrix::createByPaddleMatrixPtr(&m);
 }

-void Trainer::forwardOneBatch(size_t batchSize) { m->forwardOneBatch(batchSize); }
+void Trainer::forwardOneBatch(size_t batchSize) {
+  m->forwardOneBatch(batchSize);
+}

-bool TrainerPrivate::forwardOneBatch(size_t batchSize)  {
+bool TrainerPrivate::forwardOneBatch(size_t batchSize) {
  CHECK(dataProvider_) << "data_provider is not specified";
  paddle::DataBatch dataBatch;
  int num = dataProvider_->getNextBatch(batchSize, &dataBatch);
@@ -156,7 +157,6 @@ bool TrainerPrivate::forwardOneBatch(size_t batchSize)  {

 void TrainerPrivate::forwardOneDataBatch(
    const std::vector<paddle::Argument>& inArgs) {
-
  std::vector<paddle::Argument>& outArgs = forwardOutput_;

  if (config_->getOptConfig().use_sparse_remote_updater()) {

--- a/paddle/api/Util.cpp
+++ b/paddle/api/Util.cpp
@@ -37,13 +37,15 @@ FloatArray::FloatArray(const float* b, const size_t l)
 IntArray::IntArray(const int* b, const size_t l, bool f)
    : buf(b), length(l), needFree(f) {}

-IntWithFloatArray::IntWithFloatArray(const float* v, const int* i, size_t l,
+IntWithFloatArray::IntWithFloatArray(const float* v,
+                                     const int* i,
+                                     size_t l,
                                     bool f)
    : valBuf(v), idxBuf(i), length(l), needFree(f) {}

-bool isUsingGpu() {return FLAGS_use_gpu;}
+bool isUsingGpu() { return FLAGS_use_gpu; }

-void setUseGpu(bool useGpu) {FLAGS_use_gpu = useGpu;}
+void setUseGpu(bool useGpu) { FLAGS_use_gpu = useGpu; }

 bool isGpuVersion() {
 #ifdef PADDLE_ONLY_CPU

--- a/paddle/api/Vector.cpp
+++ b/paddle/api/Vector.cpp
@@ -12,7 +12,6 @@ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 See the License for the specific language governing permissions and
 limitations under the License. */

-
 #include "PaddleAPI.h"

 #include "paddle/math/Vector.h"
@@ -39,8 +38,10 @@ IVector* IVector::create(const std::vector<int>& data, bool useGpu) {
  return v;
 }

-IVector* IVector::createVectorFromNumpy(int* data, int dim, bool copy,
-                                        bool useGpu) throw (UnsupportError){
+IVector* IVector::createVectorFromNumpy(int* data,
+                                        int dim,
+                                        bool copy,
+                                        bool useGpu) throw(UnsupportError) {
  if (useGpu) {
    /// if use gpu only copy=true is supported
    if (!copy) {
@@ -137,8 +138,8 @@ void IVector::copyToNumpyArray(int** view_m_data, int* dim1) {
  if (auto cpuVec = dynamic_cast<paddle::CpuIVector*>(m->vec.get())) {
    std::memcpy(*view_m_data, cpuVec->getData(), sizeof(int) * (*dim1));
  } else if (auto gpuVec = dynamic_cast<paddle::GpuIVector*>(m->vec.get())) {
-    hl_memcpy_device2host(*view_m_data, gpuVec->getData(),
-                          sizeof(int) * (*dim1));
+    hl_memcpy_device2host(
+        *view_m_data, gpuVec->getData(), sizeof(int) * (*dim1));
  } else {
    LOG(INFO) << "Unexpected situation";
  }
@@ -201,8 +202,10 @@ Vector* Vector::createByPaddleVectorPtr(void* ptr) {
  }
 }

-Vector* Vector::createVectorFromNumpy(float* data, int dim, bool copy,
-                                      bool useGpu) throw (UnsupportError){
+Vector* Vector::createVectorFromNumpy(float* data,
+                                      int dim,
+                                      bool copy,
+                                      bool useGpu) throw(UnsupportError) {
  if (useGpu) {
    /// if use gpu only copy=True is supported
    if (!copy) {
@@ -251,8 +254,8 @@ void Vector::copyToNumpyArray(float** view_m_data, int* dim1) {
  if (auto cpuVec = dynamic_cast<paddle::CpuVector*>(m->vec.get())) {
    std::memcpy(*view_m_data, cpuVec->getData(), sizeof(float) * (*dim1));
  } else if (auto gpuVec = dynamic_cast<paddle::CpuVector*>(m->vec.get())) {
-    hl_memcpy_device2host(*view_m_data, gpuVec->getData(),
-                          sizeof(float) * (*dim1));
+    hl_memcpy_device2host(
+        *view_m_data, gpuVec->getData(), sizeof(float) * (*dim1));
  } else {
    LOG(INFO) << "Unexpected situation";
  }

--- a/paddle/cuda/CMakeLists.txt
+++ b/paddle/cuda/CMakeLists.txt
@@ -81,5 +81,8 @@ else()
    add_library(paddle_cuda ${CUDA_SOURCES})
 endif()

-add_style_check_target(paddle_cuda ${CUDA_SOURCES})
-add_style_check_target(paddle_cuda ${CUDA_HEADERS})
+add_style_check_target(paddle_cuda
+                       ${CUDA_SOURCES}
+                       ${CUDA_HEADERS}
+                       ${CUDA_DSO_SOURCES}
+                       ${CUDA_CXX_WITH_GPU_SOURCES})
--- a/paddle/cuda/include/hl_activation_functions.h
+++ b/paddle/cuda/include/hl_activation_functions.h
@@ -12,7 +12,6 @@ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 See the License for the specific language governing permissions and
 limitations under the License. */

-
 #ifndef HL_ACTIVATION_FUNCTIONS_H_
 #define HL_ACTIVATION_FUNCTIONS_H_

@@ -21,11 +20,8 @@ limitations under the License. */
 /**
 * Active functions: sigmoid, relu, tanh and linear.
 */
-#define HPPL_ACTIVE_FUNCTION  {hppl::sigmoid,   \
-                               hppl::relu,      \
-                               hppl::tanh,      \
-                               hppl::linear     \
-                              }
+#define HPPL_ACTIVE_FUNCTION \
+  { hppl::sigmoid, hppl::relu, hppl::tanh, hppl::linear }

 namespace hppl {

@@ -42,18 +38,18 @@ public:

 #ifdef __NVCC__
 namespace gpu {
-static __device__ Active<real>::forward  forward[]  = HPPL_ACTIVE_FUNCTION;
+static __device__ Active<real>::forward forward[] = HPPL_ACTIVE_FUNCTION;
 static __device__ Active<real>::backward backward[] = HPPL_ACTIVE_FUNCTION;
 }
 #else
 namespace cpu {
-static Active<real>::forward  forward[] = HPPL_ACTIVE_FUNCTION;
+static Active<real>::forward forward[] = HPPL_ACTIVE_FUNCTION;
 static Active<real>::backward backward[] = HPPL_ACTIVE_FUNCTION;
 }

 #ifdef __AVX__
 namespace avx {
-static Active<__m256>::forward  forward[] = HPPL_ACTIVE_FUNCTION;
+static Active<__m256>::forward forward[] = HPPL_ACTIVE_FUNCTION;
 static Active<__m256>::backward backward[] = HPPL_ACTIVE_FUNCTION;
 }
 #endif

--- a/paddle/cuda/include/hl_aggregate.h
+++ b/paddle/cuda/include/hl_aggregate.h
@@ -12,7 +12,6 @@ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 See the License for the specific language governing permissions and
 limitations under the License. */

-
 #ifndef HL_AGGREGATE_H_
 #define HL_AGGREGATE_H_


--- a/paddle/cuda/include/hl_avx_functions.h
+++ b/paddle/cuda/include/hl_avx_functions.h
@@ -12,22 +12,21 @@ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 See the License for the specific language governing permissions and
 limitations under the License. */

-
 #ifndef HL_AVX_FUNCTIONS_H_
 #define HL_AVX_FUNCTIONS_H_

 #include <immintrin.h>

 namespace hppl {
-  __m256 relu(const __m256 a);
-  __m256 sigmoid(const __m256 a);
-  __m256 tanh(const __m256 a);
-  __m256 linear(const __m256 a);
-
-  __m256 relu(const __m256 a, const __m256 b);
-  __m256 sigmoid(const __m256 a, const __m256 b);
-  __m256 tanh(const __m256 a, const __m256 b);
-  __m256 linear(const __m256 a, const __m256 b);
+__m256 relu(const __m256 a);
+__m256 sigmoid(const __m256 a);
+__m256 tanh(const __m256 a);
+__m256 linear(const __m256 a);
+
+__m256 relu(const __m256 a, const __m256 b);
+__m256 sigmoid(const __m256 a, const __m256 b);
+__m256 tanh(const __m256 a, const __m256 b);
+__m256 linear(const __m256 a, const __m256 b);
 }  // namespace hppl

 #endif  // HL_AVX_FUNCTIONS_H_
--- a/paddle/cuda/include/hl_base.h
+++ b/paddle/cuda/include/hl_base.h
@@ -12,8 +12,6 @@ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 See the License for the specific language governing permissions and
 limitations under the License. */

-
-
 #ifndef HL_BASE_H_
 #define HL_BASE_H_

@@ -33,36 +31,36 @@ limitations under the License. */
 *          HPPL_STREAM_DEFAULT is HPPL default stream.
 */
 typedef enum {
-    HPPL_STREAM_DEFAULT = 0,    /* Thread Default Stream*/
-    HPPL_STREAM_1 = 1,
-    HPPL_STREAM_2 = 2,
-    HPPL_STREAM_3 = 3,
-    HPPL_STREAM_4 = 4,
-    HPPL_THREAD_STREAM_1 = 5,
-    HPPL_THREAD_STREAM_2 = 6,
-    HPPL_THREAD_STREAM_3 = 7,
-    HPPL_THREAD_STREAM_4 = 8,
-    HPPL_STREAM_END
+  HPPL_STREAM_DEFAULT = 0, /* Thread Default Stream*/
+  HPPL_STREAM_1 = 1,
+  HPPL_STREAM_2 = 2,
+  HPPL_STREAM_3 = 3,
+  HPPL_STREAM_4 = 4,
+  HPPL_THREAD_STREAM_1 = 5,
+  HPPL_THREAD_STREAM_2 = 6,
+  HPPL_THREAD_STREAM_3 = 7,
+  HPPL_THREAD_STREAM_4 = 8,
+  HPPL_STREAM_END
 } hl_stream_t;

 /**
 * @brief HPPL activation mode.
 */
 typedef enum {
-    HL_ACTIVATION_SIGMOID   = 0,
-    HL_ACTIVATION_RELU      = 1,
-    HL_ACTIVATION_TANH      = 2,
-    HL_ACTIVATION_LINEAR    = 3,
-    HL_ACTIVATION_END
+  HL_ACTIVATION_SIGMOID = 0,
+  HL_ACTIVATION_RELU = 1,
+  HL_ACTIVATION_TANH = 2,
+  HL_ACTIVATION_LINEAR = 3,
+  HL_ACTIVATION_END
 } hl_activation_mode_t;

 /**
 * @brief Transpose type.
 */
 typedef enum {
-    HPPL_OP_N = 0, /* transpose */
-    HPPL_OP_T = 1, /* non transpose */
-    HPPL_OP_END
+  HPPL_OP_N = 0, /* transpose */
+  HPPL_OP_T = 1, /* non transpose */
+  HPPL_OP_END
 } hl_trans_op_t;

 /**
@@ -148,23 +146,21 @@ typedef struct {
 * @brief  Sparse matrix value type.
 */
 typedef enum {
-    HL_NO_VALUE = 0,                       /* matrix values only 0 or 1 */
-    HL_FLOAT_VALUE = 1,
-    HL_VALUE_END
+  HL_NO_VALUE = 0, /* matrix values only 0 or 1 */
+  HL_FLOAT_VALUE = 1,
+  HL_VALUE_END
 } hl_matrix_value_t;

-
 /**
 * @brief  HPPL matrix format.
 */
 typedef enum {
-    HL_SPARSE_CSR = 0,
-    HL_SPARSE_CSC = 1,
-    HL_SPARSE_END
+  HL_SPARSE_CSR = 0,
+  HL_SPARSE_CSC = 1,
+  HL_SPARSE_END
 } hl_matrix_format_t;

-
-typedef struct _hl_matrix_s * hl_matrix_s;
+typedef struct _hl_matrix_s *hl_matrix_s;

 /**
 * @brief   HPPL sparse matrix.
@@ -177,12 +173,12 @@ typedef struct _hl_matrix_s * hl_matrix_s;
 * @param  nnz        nonzero values of sparse matrix.
 */
 typedef struct {
-    hl_matrix_s             matrix;
-    hl_matrix_format_t      format;
-    hl_matrix_value_t       type;
-    int                     rows;
-    int                     cols;
-    size_t                  nnz;
+  hl_matrix_s matrix;
+  hl_matrix_format_t format;
+  hl_matrix_value_t type;
+  int rows;
+  int cols;
+  size_t nnz;
 } _hl_sparse_matrix_s, *hl_sparse_matrix_s;

 #ifndef PADDLE_TYPE_DOUBLE
@@ -195,7 +191,7 @@ typedef struct {
 *
 * HL_FLOAT_MIN: 1.17549435e-38F
 */
-#define HL_FLOAT_MAX        3.40282347e+38F
+#define HL_FLOAT_MAX 3.40282347e+38F
 /**
 * if real == double
 *
@@ -203,20 +199,18 @@ typedef struct {
 *
 * HL_FLOAT_MIN: 2.2250738585072014e-308
 */
-#define HL_FLOAT_MIN        1.17549435e-38F
+#define HL_FLOAT_MIN 1.17549435e-38F
 #else
-#define HL_FLOAT_MAX        1.7976931348623157e+308
-#define HL_FLOAT_MIN        2.2250738585072014e-308
+#define HL_FLOAT_MAX 1.7976931348623157e+308
+#define HL_FLOAT_MIN 2.2250738585072014e-308
 #endif

-
 /**
 * The maximum input value for exp, used to avoid overflow problem.
 *
 * Currently only used for tanh function.
 */
-#define EXP_MAX_INPUT       40.0
-
+#define EXP_MAX_INPUT 40.0

 /**
 * @brief DIVUP(x, y) is similar to ceil(x / y).
@@ -224,7 +218,7 @@ typedef struct {
 *        the size of blockDim.
 */
 #ifndef DIVUP
-#define DIVUP(x, y) (((x) + (y) - 1) / (y))
+#define DIVUP(x, y) (((x) + (y)-1) / (y))
 #endif

 #ifdef __NVCC__
@@ -233,7 +227,7 @@ typedef struct {
 #include "hl_cuda.h"
 #include "cuda_runtime.h"

-extern  __thread bool g_sync_flag;
+extern __thread bool g_sync_flag;
 extern __thread cudaStream_t default_stream;
 #define STREAM_DEFAULT default_stream

@@ -241,16 +235,15 @@ extern __thread cudaStream_t default_stream;
 * @brief   Check cuda kernel execution.
 * @param   msg   error string
 */
-#define CHECK_SYNC(msg)                                   \
-  if (true == g_sync_flag) {                              \
-    hl_stream_synchronize(HPPL_STREAM_DEFAULT);           \
-    cudaError_t err                                       \
-      = (cudaError_t)hl_get_device_last_error();          \
-    CHECK_EQ(cudaSuccess, err) << "[" << msg << "] "      \
-      << "CUDA error: "                                   \
-      << hl_get_device_error_string((size_t)err);         \
+#define CHECK_SYNC(msg)                                               \
+  if (true == g_sync_flag) {                                          \
+    hl_stream_synchronize(HPPL_STREAM_DEFAULT);                       \
+    cudaError_t err = (cudaError_t)hl_get_device_last_error();        \
+    CHECK_EQ(cudaSuccess, err)                                        \
+        << "[" << msg << "] "                                         \
+        << "CUDA error: " << hl_get_device_error_string((size_t)err); \
  }

-#endif  /* __NVCC__ */
+#endif /* __NVCC__ */

-#endif  /* HL_BASE_H_ */
+#endif /* HL_BASE_H_ */
--- a/paddle/cuda/include/hl_batch_transpose.h
+++ b/paddle/cuda/include/hl_batch_transpose.h
@@ -12,7 +12,6 @@ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 See the License for the specific language governing permissions and
 limitations under the License. */

-
 #ifndef HL_BATCH_TRANSPOSE_H_
 #define HL_BATCH_TRANSPOSE_H_

@@ -31,10 +30,7 @@ limitations under the License. */
 *          order. Each batch has height * width data, which are
 *          arranged in height-first (or row-first) manner.
 */
-extern void batchTranspose(const real* input,
-                           real* output,
-                           int width,
-                           int height,
-                           int batchSize);
+extern void batchTranspose(
+    const real* input, real* output, int width, int height, int batchSize);

 #endif  // HL_BATCH_TRANSPOSE_H_
--- a/paddle/cuda/include/hl_cnn.h
+++ b/paddle/cuda/include/hl_cnn.h
@@ -12,7 +12,6 @@ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 See the License for the specific language governing permissions and
 limitations under the License. */

-
 #ifndef HL_CNN_H_
 #define HL_CNN_H_

@@ -37,15 +36,21 @@ limitations under the License. */
 * @param[in]   alpha
 * @param[in]   beta
 */
-extern void hl_shrink_col2feature(
-    const real * dataCol, size_t channels,
-    size_t height, size_t width,
-    size_t blockH, size_t blockW,
-    size_t strideH, size_t strideW,
-    size_t paddingH, size_t paddingW,
-    size_t outputH, size_t outputW,
-    real* dataIm,
-    real alpha = 1.0f, real beta = 0.0f);
+extern void hl_shrink_col2feature(const real* dataCol,
+                                  size_t channels,
+                                  size_t height,
+                                  size_t width,
+                                  size_t blockH,
+                                  size_t blockW,
+                                  size_t strideH,
+                                  size_t strideW,
+                                  size_t paddingH,
+                                  size_t paddingW,
+                                  size_t outputH,
+                                  size_t outputW,
+                                  real* dataIm,
+                                  real alpha = 1.0f,
+                                  real beta = 0.0f);

 /**
 * @brief   Expand feature to column.
@@ -65,14 +70,19 @@ extern void hl_shrink_col2feature(
 * @param[out]  dataCol     expand data.
 *
 */
-extern void hl_expand_feature2col(
-    const real* dataIm, size_t channels,
-    size_t height, size_t width,
-    size_t blockH, size_t blockW,
-    size_t strideH, size_t strideW,
-    size_t paddingH, size_t paddingW,
-    size_t outputH, size_t outputW,
-    real* dataCol);
+extern void hl_expand_feature2col(const real* dataIm,
+                                  size_t channels,
+                                  size_t height,
+                                  size_t width,
+                                  size_t blockH,
+                                  size_t blockW,
+                                  size_t strideH,
+                                  size_t strideW,
+                                  size_t paddingH,
+                                  size_t paddingW,
+                                  size_t outputH,
+                                  size_t outputW,
+                                  real* dataCol);

 /**
 * @brief   Maximum pool forward.
@@ -94,15 +104,21 @@ extern void hl_expand_feature2col(
 * @param[in]   tgtStride   stride between output data samples.
 *
 */
-extern void hl_maxpool_forward(
-    const int frameCnt, const real* inputData,
-    const int channels,
-    const int height, const int width,
-    const int pooledH, const int pooledW,
-    const int sizeX, const int sizeY,
-    const int strideH, const int strideW,
-    const int paddingH, const int paddingW,
-    real* tgtData, const int tgtStride);
+extern void hl_maxpool_forward(const int frameCnt,
+                               const real* inputData,
+                               const int channels,
+                               const int height,
+                               const int width,
+                               const int pooledH,
+                               const int pooledW,
+                               const int sizeX,
+                               const int sizeY,
+                               const int strideH,
+                               const int strideW,
+                               const int paddingH,
+                               const int paddingW,
+                               real* tgtData,
+                               const int tgtStride);

 /**
 * @brief   Maximum pool backward.
@@ -125,20 +141,28 @@ extern void hl_maxpool_forward(
 * @param[in]   paddingH    padding height.
 * @param[in]   paddingW    padding width.
 * @param[out]  targetGrad  output grad.
- * @param[in]   outStride   stride between output data samples. 
+ * @param[in]   outStride   stride between output data samples.
 *
 */
-extern void hl_maxpool_backward(
-    const int frameCnt, const real* inputData,
-    const real* outData, const real* outGrad,
-    const int channels, const int height,
-    const int width,
-    const int pooledH, const int pooledW,
-    const int sizeX, const int sizeY,
-    const int strideH, const int strideW,
-    const int paddingH, const int paddingW,
-    real scaleA, real scaleB,
-    real* targetGrad, const int outStride);
+extern void hl_maxpool_backward(const int frameCnt,
+                                const real* inputData,
+                                const real* outData,
+                                const real* outGrad,
+                                const int channels,
+                                const int height,
+                                const int width,
+                                const int pooledH,
+                                const int pooledW,
+                                const int sizeX,
+                                const int sizeY,
+                                const int strideH,
+                                const int strideW,
+                                const int paddingH,
+                                const int paddingW,
+                                real scaleA,
+                                real scaleB,
+                                real* targetGrad,
+                                const int outStride);

 /**
 * @brief   Averge pool forward.
@@ -160,15 +184,21 @@ extern void hl_maxpool_backward(
 * @param[in]   tgtStride   stride between output data samples.
 *
 */
-extern void hl_avgpool_forward(
-    const int frameCnt, const real* inputData,
-    const int channels,
-    const int height, const int width,
-    const int pooledH, const int pooledW,
-    const int sizeX, const int sizeY,
-    const int strideH, const int strideW,
-    const int paddingH, const int paddingW,
-    real* tgtData, const int tgtStride);
+extern void hl_avgpool_forward(const int frameCnt,
+                               const real* inputData,
+                               const int channels,
+                               const int height,
+                               const int width,
+                               const int pooledH,
+                               const int pooledW,
+                               const int sizeX,
+                               const int sizeY,
+                               const int strideH,
+                               const int strideW,
+                               const int paddingH,
+                               const int paddingW,
+                               real* tgtData,
+                               const int tgtStride);

 /**
 * @brief   Maximum pool backward.
@@ -189,19 +219,26 @@ extern void hl_avgpool_forward(
 * @param[in]   scaleA      scale.
 * @param[in]   scaleB      scale.
 * @param[out]  backGrad    output grad.
- * @param[in]   outStride   stride between output data samples. 
+ * @param[in]   outStride   stride between output data samples.
 *
 */
-extern void hl_avgpool_backward(
-    const int frameCnt, const real* outGrad,
-    const int channels, const int height,
-    const int width,
-    const int pooledH, const int pooledW,
-    const int sizeX, const int sizeY,
-    const int strideH, const int strideW,
-    int paddingH, int paddingW,
-    real scaleA, real scaleB,
-    real* backGrad, const int outStride);
+extern void hl_avgpool_backward(const int frameCnt,
+                                const real* outGrad,
+                                const int channels,
+                                const int height,
+                                const int width,
+                                const int pooledH,
+                                const int pooledW,
+                                const int sizeX,
+                                const int sizeY,
+                                const int strideH,
+                                const int strideW,
+                                int paddingH,
+                                int paddingW,
+                                real scaleA,
+                                real scaleB,
+                                real* backGrad,
+                                const int outStride);

 /**
 * @brief   Cross-map-respose normalize forward.
@@ -218,10 +255,16 @@ extern void hl_avgpool_backward(
 * @param[in]   beta        scale.
 *
 */
-extern void hl_CMRNorm_forward(
-    size_t frameCnt, const real* in, real* scale, real* out,
-    size_t channels, size_t height, size_t width, size_t sizeX,
-    real alpha, real beta);
+extern void hl_CMRNorm_forward(size_t frameCnt,
+                               const real* in,
+                               real* scale,
+                               real* out,
+                               size_t channels,
+                               size_t height,
+                               size_t width,
+                               size_t sizeX,
+                               real alpha,
+                               real beta);

 /**
 * @brief   Cross-map-respose normalize backward.
@@ -240,11 +283,18 @@ extern void hl_CMRNorm_forward(
 * @param[in]   beta        scale.
 *
 */
-extern void hl_CMRNorm_backward(
-    size_t frameCnt, const real* inV, const real* scale,
-    const real* outV, const real* outDiff, real *inDiff,
-    size_t channels, size_t height, size_t width, size_t sizeX,
-    real alpha, real beta);
+extern void hl_CMRNorm_backward(size_t frameCnt,
+                                const real* inV,
+                                const real* scale,
+                                const real* outV,
+                                const real* outDiff,
+                                real* inDiff,
+                                size_t channels,
+                                size_t height,
+                                size_t width,
+                                size_t sizeX,
+                                real alpha,
+                                real beta);

 /**
 * @brief   Bilinear interpolation forward.
@@ -278,24 +328,24 @@ extern void hl_bilinear_forward(const real* inData,
                                const real ratioH,
                                const real ratioW);

- /**
- * @brief   Bilinear interpolation backward.
- *
- * @param[out]  inGrad      input gradient.
- * @param[in]   inImgH      input image height.
- * @param[in]   inImgW      input image width.
- * @param[in]   inputH      input batchSize.
- * @param[in]   inputW      input image data dim.
- * @param[in]   outGrad     output gradient.
- * @param[in]   outImgH     output image height.
- * @param[in]   outImgW     output image width.
- * @param[in]   outputH     output batchSize.
- * @param[in]   outputW     output image data dim.
- * @param[in]   numChannels number of channels.
- * @param[in]   ratioH      inImgH / outImgH.
- * @param[in]   ratioW      inImgW / outImgW.
- *
- */                               
+/**
+* @brief   Bilinear interpolation backward.
+*
+* @param[out]  inGrad      input gradient.
+* @param[in]   inImgH      input image height.
+* @param[in]   inImgW      input image width.
+* @param[in]   inputH      input batchSize.
+* @param[in]   inputW      input image data dim.
+* @param[in]   outGrad     output gradient.
+* @param[in]   outImgH     output image height.
+* @param[in]   outImgW     output image width.
+* @param[in]   outputH     output batchSize.
+* @param[in]   outputW     output image data dim.
+* @param[in]   numChannels number of channels.
+* @param[in]   ratioH      inImgH / outImgH.
+* @param[in]   ratioW      inImgW / outImgW.
+*
+*/
 extern void hl_bilinear_backward(real* inGrad,
                                 const size_t inImgH,
                                 const size_t inImgW,
@@ -321,9 +371,13 @@ extern void hl_bilinear_backward(real* inGrad,
 * @param[in]   featLen     feature length = image height * image width.
 * @param[in]   groups      number of groups.
 */
-extern void hl_maxout_forward(
-    const real* inData, real* outData, int* idData,
-    size_t batchSize, size_t size, size_t featLen, size_t groups);
+extern void hl_maxout_forward(const real* inData,
+                              real* outData,
+                              int* idData,
+                              size_t batchSize,
+                              size_t size,
+                              size_t featLen,
+                              size_t groups);

 /**
 * @brief   MaxOut backward.
@@ -336,8 +390,12 @@ extern void hl_maxout_forward(
 * @param[in]   featLen     feature length = image height * image width.
 * @param[in]   groups      number of groups.
 */
-extern void hl_maxout_backward(
-    real* inGrad, const real* outGrad, const int* idData,
-    size_t batchSize, size_t size, size_t featLen, size_t groups);
+extern void hl_maxout_backward(real* inGrad,
+                               const real* outGrad,
+                               const int* idData,
+                               size_t batchSize,
+                               size_t size,
+                               size_t featLen,
+                               size_t groups);

 #endif /* HL_CNN_H_ */
--- a/paddle/cuda/include/hl_cuda.h
+++ b/paddle/cuda/include/hl_cuda.h
@@ -12,18 +12,16 @@ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 See the License for the specific language governing permissions and
 limitations under the License. */

-
 #ifndef HL_CUDA_H_
 #define HL_CUDA_H_

-#include "hl_base.h"
 #include <string>
+#include "hl_base.h"

 /**
 * @brief   HPPL event.
 */
-typedef struct _hl_event_st *  hl_event_t;
-
+typedef struct _hl_event_st *hl_event_t;

 /**
 * @brief return cuda runtime api version.
@@ -42,7 +40,7 @@ extern void hl_start();
 *                      if device is NULL, will start all GPU.
 * @param[in]   number  number of devices.
 */
-extern void hl_specify_devices_start(int* device, int number);
+extern void hl_specify_devices_start(int *device, int number);

 /**
 * @brief   Queries if a device may directly access a peer device's memory.
@@ -126,7 +124,7 @@ extern int hl_get_device();
 *
 * @return      dest_d   pointer to device memory.
 */
-extern void* hl_malloc_device(size_t size);
+extern void *hl_malloc_device(size_t size);

 /**
 * @brief   Free device memory.
@@ -143,7 +141,7 @@ extern void hl_free_mem_device(void *dest_d);
 *
 * @return      dest_h   pointer to host memory.
 */
-extern void* hl_malloc_host(size_t size);
+extern void *hl_malloc_host(size_t size);

 /**
 * @brief   Free host page-lock memory.
@@ -228,9 +226,9 @@ extern void hl_srand(unsigned int seed);
 * @param[in]   stream  stream id.
 */
 extern void hl_memcpy_async(void *dst,
-                           void *src,
-                           size_t size,
-                           hl_stream_t stream);
+                            void *src,
+                            size_t size,
+                            hl_stream_t stream);

 /**
 * @brief   Waits for stream tasks to complete.
@@ -261,8 +259,7 @@ extern void hl_destroy_event(hl_event_t event);
 *
 * @return      time   Time between start and end in ms.
 */
-extern float hl_event_elapsed_time(hl_event_t start,
-                                   hl_event_t end);
+extern float hl_event_elapsed_time(hl_event_t start, hl_event_t end);

 /**
 * @brief   Records an event.
@@ -300,7 +297,7 @@ extern void hl_set_device_flags_block();
 /**
 * @brief   Returns the last error string from a cuda runtime call.
 */
-extern const char* hl_get_device_error_string();
+extern const char *hl_get_device_error_string();

 /**
 * @brief     Returns the last error string from a cuda runtime call.
@@ -309,7 +306,7 @@ extern const char* hl_get_device_error_string();
 *
 * @see       hl_get_device_last_error()
 */
-extern const char* hl_get_device_error_string(size_t err);
+extern const char *hl_get_device_error_string(size_t err);

 /**
 * @brief   Returns the last error number.

--- a/paddle/cuda/include/hl_cuda_cublas.h
+++ b/paddle/cuda/include/hl_cuda_cublas.h
@@ -12,7 +12,6 @@ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 See the License for the specific language governing permissions and
 limitations under the License. */

-
 #ifndef HL_CUDA_CUBLAS_H_
 #define HL_CUDA_CUBLAS_H_

@@ -29,12 +28,8 @@ limitations under the License. */
 * @param[in]   ldc     the first dimension of C_d.
 *
 */
-extern void hl_matrix_transpose(real *A_d,
-                                real *C_d,
-                                int dimM,
-                                int dimN,
-                                int lda,
-                                int ldc);
+extern void hl_matrix_transpose(
+    real *A_d, real *C_d, int dimM, int dimN, int lda, int ldc);

 /*
 * @brief Matrix transpose, while lda = dimN, ldc = dimM.
@@ -45,10 +40,7 @@ extern void hl_matrix_transpose(real *A_d,
 * @param[in]   dimN    matrix width.
 *
 */
-extern void hl_matrix_transpose(real *A_d,
-                                real *C_d,
-                                int dimM,
-                                int dimN);
+extern void hl_matrix_transpose(real *A_d, real *C_d, int dimM, int dimN);

 /*
 * @brief Matrix inverse
@@ -60,11 +52,7 @@ extern void hl_matrix_transpose(real *A_d,
 * @param[in]   ldc    the first dimension of C_d
 *
 */
-extern void hl_matrix_inverse(real *A_d,
-                              real *C_d,
-                              int dimN,
-                              int lda,
-                              int ldc);
+extern void hl_matrix_inverse(real *A_d, real *C_d, int dimN, int lda, int ldc);

 /**
 * @brief   C_d = alpha*(op(A_d) * op(B_d)) + beta*C_d
@@ -84,12 +72,19 @@ extern void hl_matrix_inverse(real *A_d,
 * @param[in]   ldc     the first dimension of C_d.
 *
 */
-extern void hl_matrix_mul(real *A_d, hl_trans_op_t transa,
-                          real *B_d, hl_trans_op_t transb,
+extern void hl_matrix_mul(real *A_d,
+                          hl_trans_op_t transa,
+                          real *B_d,
+                          hl_trans_op_t transb,
                          real *C_d,
-                          int dimM, int dimN, int dimK,
-                          real alpha, real beta,
-                          int lda, int ldb, int ldc);
+                          int dimM,
+                          int dimN,
+                          int dimK,
+                          real alpha,
+                          real beta,
+                          int lda,
+                          int ldb,
+                          int ldc);

 /**
 * @brief   C_d = alpha*(op(A_d) * op(B_d)) + beta*C_d
@@ -106,11 +101,16 @@ extern void hl_matrix_mul(real *A_d, hl_trans_op_t transa,
 * @param[in]   beta    scalar used for multiplication.
 *
 */
-extern void hl_matrix_mul(real *A_d, hl_trans_op_t transa,
-                          real *B_d, hl_trans_op_t transb,
+extern void hl_matrix_mul(real *A_d,
+                          hl_trans_op_t transa,
+                          real *B_d,
+                          hl_trans_op_t transb,
                          real *C_d,
-                          int dimM, int dimN, int dimK,
-                          real alpha, real beta);
+                          int dimM,
+                          int dimN,
+                          int dimK,
+                          real alpha,
+                          real beta);

 /**
 * @brief   This function performs the matrix-vector multiplication.
@@ -132,11 +132,17 @@ extern void hl_matrix_mul(real *A_d, hl_trans_op_t transa,
 *
 */

-extern void hl_matrix_mul_vector(real *A_d, hl_trans_op_t trans,
-                                 real *B_d, real *C_d,
-                                 int dimM, int dimN,
-                                 real alpha, real beta,
-                                 int lda, int incb, int incc);
+extern void hl_matrix_mul_vector(real *A_d,
+                                 hl_trans_op_t trans,
+                                 real *B_d,
+                                 real *C_d,
+                                 int dimM,
+                                 int dimN,
+                                 real alpha,
+                                 real beta,
+                                 int lda,
+                                 int incb,
+                                 int incc);

 /**
 * @brief   This function performs the matrix-vector multiplication.
@@ -154,9 +160,13 @@ extern void hl_matrix_mul_vector(real *A_d, hl_trans_op_t trans,
 * @param[in]     beta   scalar used for multiplication.
 *
 */
-extern void hl_matrix_mul_vector(real *A_d, hl_trans_op_t trans,
-                                 real *B_d, real *C_d,
-                                 int dimM, int dimN,
-                                 real alpha, real beta);
+extern void hl_matrix_mul_vector(real *A_d,
+                                 hl_trans_op_t trans,
+                                 real *B_d,
+                                 real *C_d,
+                                 int dimM,
+                                 int dimN,
+                                 real alpha,
+                                 real beta);

 #endif /* HL_CUDA_CUBLAS_H_ */
--- a/paddle/cuda/include/hl_cuda_cudnn.h
+++ b/paddle/cuda/include/hl_cuda_cudnn.h
@@ -12,7 +12,6 @@ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 See the License for the specific language governing permissions and
 limitations under the License. */

-
 #ifndef HL_CUDA_CUDNN_H_
 #define HL_CUDA_CUDNN_H_

@@ -22,7 +21,7 @@ limitations under the License. */
 *  hppl pooling mode
 */
 typedef enum {
-  HL_POOLING_MAX     = 0,
+  HL_POOLING_MAX = 0,
  // average includes padded values
  HL_POOLING_AVERAGE = 1,
  // average does not include padded values
@@ -324,17 +323,16 @@ extern void hl_convolution_forward_add_bias(hl_tensor_descriptor bias,
 * @param[in]   sizeInBytes         gpu workspace size (bytes).
 * @param[in]   convBwdFilterAlgo   backward filter algorithm.
 */
-extern void hl_convolution_backward_filter(
-        hl_tensor_descriptor input,
-        real* input_data,
-        hl_tensor_descriptor output,
-        real* output_grad_data,
-        hl_filter_descriptor filter,
-        real* filter_grad_data,
-        hl_convolution_descriptor conv,
-        void* gpuWorkSpace,
-        size_t sizeInBytes,
-        int  convBwdFilterAlgo);
+extern void hl_convolution_backward_filter(hl_tensor_descriptor input,
+                                           real* input_data,
+                                           hl_tensor_descriptor output,
+                                           real* output_grad_data,
+                                           hl_filter_descriptor filter,
+                                           real* filter_grad_data,
+                                           hl_convolution_descriptor conv,
+                                           void* gpuWorkSpace,
+                                           size_t sizeInBytes,
+                                           int convBwdFilterAlgo);

 /**
 * @brief   convolution backward data(calculate input image grad data).
@@ -350,17 +348,16 @@ extern void hl_convolution_backward_filter(
 * @param[in]   sizeInBytes         gpu workspace size (bytes).
 * @param[in]   convBwdDataAlgo     backward data algorithm.
 */
-extern void hl_convolution_backward_data(
-        hl_tensor_descriptor input,
-        real* input_data_grad,
-        hl_tensor_descriptor output,
-        real* output_grad_data,
-        hl_filter_descriptor filter,
-        real* filter_data,
-        hl_convolution_descriptor conv,
-        void* gpuWorkSpace,
-        size_t sizeInBytes,
-        int convBwdDataAlgo);
+extern void hl_convolution_backward_data(hl_tensor_descriptor input,
+                                         real* input_data_grad,
+                                         hl_tensor_descriptor output,
+                                         real* output_grad_data,
+                                         hl_filter_descriptor filter,
+                                         real* filter_data,
+                                         hl_convolution_descriptor conv,
+                                         void* gpuWorkSpace,
+                                         size_t sizeInBytes,
+                                         int convBwdDataAlgo);

 /**
 * @brief   convolution backward bias(calculate bias grad data).
@@ -383,8 +380,8 @@ extern void hl_convolution_backward_bias(hl_tensor_descriptor bias,
 * @param[in]   height              matrix height.
 * @param[in]   width               matrix width.
 */
-extern void hl_softmax_forward(real *input,
-                               real *output,
+extern void hl_softmax_forward(real* input,
+                               real* output,
                               int height,
                               int width);

@@ -396,8 +393,8 @@ extern void hl_softmax_forward(real *input,
 * @param[in]   height              matrix height.
 * @param[in]   width               matrix width.
 */
-extern void hl_softmax_backward(real *output_value,
-                                real *output_grad,
+extern void hl_softmax_backward(real* output_value,
+                                real* output_grad,
                                int height,
                                int width);

@@ -426,18 +423,18 @@ extern void hl_softmax_backward(real *output_value,
 *
 */
 extern void hl_batch_norm_forward_training(hl_tensor_descriptor inputDesc,
-                                           real *input,
+                                           real* input,
                                           hl_tensor_descriptor outputDesc,
-                                           real *output,
+                                           real* output,
                                           hl_tensor_descriptor bnParamDesc,
-                                           real *scale,
-                                           real *bias,
+                                           real* scale,
+                                           real* bias,
                                           double factor,
-                                           real *runningMean,
-                                           real *runningInvVar,
+                                           real* runningMean,
+                                           real* runningInvVar,
                                           double epsilon,
-                                           real *savedMean,
-                                           real *savedVar);
+                                           real* savedMean,
+                                           real* savedVar);

 /**
 * @brief   cudnn batch norm forward.
@@ -463,14 +460,14 @@ extern void hl_batch_norm_forward_training(hl_tensor_descriptor inputDesc,
 *
 */
 extern void hl_batch_norm_forward_inference(hl_tensor_descriptor inputDesc,
-                                            real *input,
+                                            real* input,
                                            hl_tensor_descriptor outputDesc,
-                                            real *output,
+                                            real* output,
                                            hl_tensor_descriptor bnParamDesc,
-                                            real *scale,
-                                            real *bias,
-                                            real *estimatedMean,
-                                            real *estimatedVar,
+                                            real* scale,
+                                            real* bias,
+                                            real* estimatedMean,
+                                            real* estimatedVar,
                                            double epsilon);

 /**
@@ -483,7 +480,8 @@ extern void hl_batch_norm_forward_inference(hl_tensor_descriptor inputDesc,
 * @param[in]   inGradDesc      input tensor descriptor desc.
 * @param[in]   inGrad          input data.
 * @param[in]   dBnParamDesc    tensor descriptor desc.
- *                              bnScale, bnBias, running mean/var, save_mean/var.
+ *                              bnScale, bnBias, running mean/var,
+ * save_mean/var.
 * @param[in]   scale           batch normalization scale parameter (in original
 *                              paper scale is referred to as gamma).
 * @param[in]   scaleGrad       batch normalization scale parameter (in original
@@ -497,17 +495,17 @@ extern void hl_batch_norm_forward_inference(hl_tensor_descriptor inputDesc,
 *
 */
 extern void hl_batch_norm_backward(hl_tensor_descriptor inputDesc,
-                                   real *input,
+                                   real* input,
                                   hl_tensor_descriptor outGradDesc,
-                                   real *outGrad,
+                                   real* outGrad,
                                   hl_tensor_descriptor inGradDesc,
-                                   real *inGrad,
+                                   real* inGrad,
                                   hl_tensor_descriptor dBnParamDesc,
-                                   real *scale,
-                                   real *scaleGrad,
-                                   real *biasGrad,
+                                   real* scale,
+                                   real* scaleGrad,
+                                   real* biasGrad,
                                   double epsilon,
-                                   real *savedMean,
-                                   real *savedInvVar);
+                                   real* savedMean,
+                                   real* savedInvVar);

 #endif  // HL_CUDA_CUDNN_H_
--- a/paddle/cuda/include/hl_dso_loader.h
+++ b/paddle/cuda/include/hl_dso_loader.h
@@ -12,7 +12,6 @@ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 See the License for the specific language governing permissions and
 limitations under the License. */

-
 #ifndef HL_DSO_LOADER_H_
 #define HL_DSO_LOADER_H_


--- a/paddle/cuda/include/hl_functions.h
+++ b/paddle/cuda/include/hl_functions.h
@@ -12,7 +12,6 @@ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 See the License for the specific language governing permissions and
 limitations under the License. */

-
 #ifndef HL_FUNCTIONS_H_
 #define HL_FUNCTIONS_H_

@@ -21,30 +20,30 @@ limitations under the License. */
 /**
 * sigmoid threshold maximum
 */
-#define     SIGMOID_THRESHOLD_MIN   -40.0
+#define SIGMOID_THRESHOLD_MIN -40.0

 /**
 * sigmoid threshold minimum
 */
-#define     SIGMOID_THRESHOLD_MAX   13.0
+#define SIGMOID_THRESHOLD_MAX 13.0

 #ifndef __NVCC__
 namespace hppl {
-  /*
-   * forward activation
-   */
-  real relu(const real a);
-  real sigmoid(const real a);
-  real tanh(const real a);
-  real linear(const real a);
-
-  /*
-   * backward activation
-   */
-  real relu(const real a, const real b);
-  real sigmoid(const real a, const real b);
-  real tanh(const real a, const real b);
-  real linear(const real a, const real b);
+/*
+ * forward activation
+ */
+real relu(const real a);
+real sigmoid(const real a);
+real tanh(const real a);
+real linear(const real a);
+
+/*
+ * backward activation
+ */
+real relu(const real a, const real b);
+real sigmoid(const real a, const real b);
+real tanh(const real a, const real b);
+real linear(const real a, const real b);
 }  // namespace hppl

 #ifdef __AVX__

--- a/paddle/cuda/include/hl_gpu.h
+++ b/paddle/cuda/include/hl_gpu.h
@@ -12,7 +12,6 @@ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 See the License for the specific language governing permissions and
 limitations under the License. */

-
 #ifndef HL_GPU_H_
 #define HL_GPU_H_


--- a/paddle/cuda/include/hl_lstm.h
+++ b/paddle/cuda/include/hl_lstm.h
--- a/paddle/cuda/include/hl_matrix.h
+++ b/paddle/cuda/include/hl_matrix.h
--- a/paddle/cuda/include/hl_sequence.h
+++ b/paddle/cuda/include/hl_sequence.h
--- a/paddle/cuda/include/hl_sparse.h
+++ b/paddle/cuda/include/hl_sparse.h
--- a/paddle/cuda/include/hl_table_apply.h
+++ b/paddle/cuda/include/hl_table_apply.h
--- a/paddle/cuda/include/hl_time.h
+++ b/paddle/cuda/include/hl_time.h
--- a/paddle/cuda/include/hl_top_k.h
+++ b/paddle/cuda/include/hl_top_k.h
--- a/paddle/cuda/include/stub/hl_aggregate_stub.h
+++ b/paddle/cuda/include/stub/hl_aggregate_stub.h
--- a/paddle/cuda/include/stub/hl_cnn_stub.h
+++ b/paddle/cuda/include/stub/hl_cnn_stub.h
--- a/paddle/cuda/include/stub/hl_cuda_cublas_stub.h
+++ b/paddle/cuda/include/stub/hl_cuda_cublas_stub.h
--- a/paddle/cuda/include/stub/hl_cuda_cudnn_stub.h
+++ b/paddle/cuda/include/stub/hl_cuda_cudnn_stub.h
--- a/paddle/cuda/include/stub/hl_cuda_stub.h
+++ b/paddle/cuda/include/stub/hl_cuda_stub.h
--- a/paddle/cuda/include/stub/hl_lstm_stub.h
+++ b/paddle/cuda/include/stub/hl_lstm_stub.h
--- a/paddle/cuda/include/stub/hl_matrix_stub.h
+++ b/paddle/cuda/include/stub/hl_matrix_stub.h
--- a/paddle/cuda/include/stub/hl_sequence_stub.h
+++ b/paddle/cuda/include/stub/hl_sequence_stub.h
--- a/paddle/cuda/include/stub/hl_sparse_stub.h
+++ b/paddle/cuda/include/stub/hl_sparse_stub.h
--- a/paddle/cuda/src/avx_mathfun.h
+++ b/paddle/cuda/src/avx_mathfun.h
--- a/paddle/cuda/src/hl_avx_functions.cc
+++ b/paddle/cuda/src/hl_avx_functions.cc
--- a/paddle/cuda/src/hl_cpu_functions.cc
+++ b/paddle/cuda/src/hl_cpu_functions.cc
--- a/paddle/cuda/src/hl_cuda_cublas.cc
+++ b/paddle/cuda/src/hl_cuda_cublas.cc
--- a/paddle/cuda/src/hl_cuda_cudnn.cc
+++ b/paddle/cuda/src/hl_cuda_cudnn.cc
--- a/paddle/cuda/src/hl_cuda_device.cc
+++ b/paddle/cuda/src/hl_cuda_device.cc
--- a/paddle/cuda/src/hl_cudart_wrap.cc
+++ b/paddle/cuda/src/hl_cudart_wrap.cc
--- a/paddle/cuda/src/hl_dso_loader.cc
+++ b/paddle/cuda/src/hl_dso_loader.cc
--- a/paddle/cuda/src/hl_math.cc
+++ b/paddle/cuda/src/hl_math.cc
--- a/paddle/cuda/src/hl_time.cc
+++ b/paddle/cuda/src/hl_time.cc
--- a/paddle/gserver/activations/ActivationFunction.cpp
+++ b/paddle/gserver/activations/ActivationFunction.cpp
--- a/paddle/gserver/activations/ActivationFunction.h
+++ b/paddle/gserver/activations/ActivationFunction.h
--- a/paddle/gserver/dataproviders/DataProvider.cpp
+++ b/paddle/gserver/dataproviders/DataProvider.cpp
--- a/paddle/gserver/dataproviders/DataProvider.h
+++ b/paddle/gserver/dataproviders/DataProvider.h
--- a/paddle/gserver/dataproviders/DataProviderGroup.h
+++ b/paddle/gserver/dataproviders/DataProviderGroup.h
--- a/paddle/gserver/dataproviders/MultiDataProvider.cpp
+++ b/paddle/gserver/dataproviders/MultiDataProvider.cpp
--- a/paddle/gserver/dataproviders/MultiDataProvider.h
+++ b/paddle/gserver/dataproviders/MultiDataProvider.h
--- a/paddle/gserver/dataproviders/ProtoDataProvider.cpp
+++ b/paddle/gserver/dataproviders/ProtoDataProvider.cpp
--- a/paddle/gserver/dataproviders/ProtoDataProvider.h
+++ b/paddle/gserver/dataproviders/ProtoDataProvider.h
--- a/paddle/gserver/dataproviders/ProtoReader.h
+++ b/paddle/gserver/dataproviders/ProtoReader.h
--- a/paddle/gserver/dataproviders/PyDataProvider.cpp
+++ b/paddle/gserver/dataproviders/PyDataProvider.cpp
--- a/paddle/gserver/dataproviders/PyDataProvider.h
+++ b/paddle/gserver/dataproviders/PyDataProvider.h
--- a/paddle/gserver/dataproviders/PyDataProvider2.cpp
+++ b/paddle/gserver/dataproviders/PyDataProvider2.cpp
--- a/paddle/gserver/evaluators/CTCErrorEvaluator.cpp
+++ b/paddle/gserver/evaluators/CTCErrorEvaluator.cpp
--- a/paddle/gserver/evaluators/ChunkEvaluator.cpp
+++ b/paddle/gserver/evaluators/ChunkEvaluator.cpp
--- a/paddle/gserver/evaluators/Evaluator.cpp
+++ b/paddle/gserver/evaluators/Evaluator.cpp
--- a/paddle/gserver/evaluators/Evaluator.h
+++ b/paddle/gserver/evaluators/Evaluator.h
--- a/paddle/gserver/gradientmachines/GradientMachine.cpp
+++ b/paddle/gserver/gradientmachines/GradientMachine.cpp
--- a/paddle/gserver/gradientmachines/GradientMachine.h
+++ b/paddle/gserver/gradientmachines/GradientMachine.h
--- a/paddle/gserver/gradientmachines/GradientMachineMode.h
+++ b/paddle/gserver/gradientmachines/GradientMachineMode.h
--- a/paddle/gserver/gradientmachines/MultiGradientMachine.cpp
+++ b/paddle/gserver/gradientmachines/MultiGradientMachine.cpp
--- a/paddle/gserver/gradientmachines/MultiGradientMachine.h
+++ b/paddle/gserver/gradientmachines/MultiGradientMachine.h
--- a/paddle/gserver/gradientmachines/MultiNetwork.cpp
+++ b/paddle/gserver/gradientmachines/MultiNetwork.cpp
--- a/paddle/gserver/gradientmachines/MultiNetwork.h
+++ b/paddle/gserver/gradientmachines/MultiNetwork.h
--- a/paddle/gserver/gradientmachines/NeuralNetwork.cpp
+++ b/paddle/gserver/gradientmachines/NeuralNetwork.cpp
--- a/paddle/gserver/gradientmachines/NeuralNetwork.h
+++ b/paddle/gserver/gradientmachines/NeuralNetwork.h
--- a/paddle/gserver/gradientmachines/ParallelNeuralNetwork.cpp
+++ b/paddle/gserver/gradientmachines/ParallelNeuralNetwork.cpp
--- a/paddle/gserver/gradientmachines/ParallelNeuralNetwork.h
+++ b/paddle/gserver/gradientmachines/ParallelNeuralNetwork.h
--- a/paddle/gserver/gradientmachines/RecurrentGradientMachine.cpp
+++ b/paddle/gserver/gradientmachines/RecurrentGradientMachine.cpp
--- a/paddle/gserver/gradientmachines/RecurrentGradientMachine.h
+++ b/paddle/gserver/gradientmachines/RecurrentGradientMachine.h
--- a/paddle/gserver/layers/AddtoLayer.cpp
+++ b/paddle/gserver/layers/AddtoLayer.cpp
--- a/paddle/gserver/layers/AddtoLayer.h
+++ b/paddle/gserver/layers/AddtoLayer.h
--- a/paddle/gserver/layers/AgentLayer.cpp
+++ b/paddle/gserver/layers/AgentLayer.cpp
--- a/paddle/gserver/layers/AgentLayer.h
+++ b/paddle/gserver/layers/AgentLayer.h
--- a/paddle/gserver/layers/AverageLayer.cpp
+++ b/paddle/gserver/layers/AverageLayer.cpp
--- a/paddle/gserver/layers/BatchNormBaseLayer.cpp
+++ b/paddle/gserver/layers/BatchNormBaseLayer.cpp
--- a/paddle/gserver/layers/BatchNormBaseLayer.h
+++ b/paddle/gserver/layers/BatchNormBaseLayer.h
--- a/paddle/gserver/layers/BatchNormalizationLayer.cpp
+++ b/paddle/gserver/layers/BatchNormalizationLayer.cpp
--- a/paddle/gserver/layers/BatchNormalizationLayer.h
+++ b/paddle/gserver/layers/BatchNormalizationLayer.h
--- a/paddle/gserver/layers/BilinearInterpLayer.cpp
+++ b/paddle/gserver/layers/BilinearInterpLayer.cpp
--- a/paddle/gserver/layers/BlockExpandLayer.cpp
+++ b/paddle/gserver/layers/BlockExpandLayer.cpp
--- a/paddle/gserver/layers/BlockExpandLayer.h
+++ b/paddle/gserver/layers/BlockExpandLayer.h
--- a/paddle/gserver/layers/CRFDecodingLayer.cpp
+++ b/paddle/gserver/layers/CRFDecodingLayer.cpp
--- a/paddle/gserver/layers/CRFDecodingLayer.h
+++ b/paddle/gserver/layers/CRFDecodingLayer.h
--- a/paddle/gserver/layers/CRFLayer.cpp
+++ b/paddle/gserver/layers/CRFLayer.cpp
--- a/paddle/gserver/layers/CRFLayer.h
+++ b/paddle/gserver/layers/CRFLayer.h
--- a/paddle/gserver/layers/CTCLayer.cpp
+++ b/paddle/gserver/layers/CTCLayer.cpp
--- a/paddle/gserver/layers/CTCLayer.h
+++ b/paddle/gserver/layers/CTCLayer.h
--- a/paddle/gserver/layers/ConcatenateLayer.cpp
+++ b/paddle/gserver/layers/ConcatenateLayer.cpp
--- a/paddle/gserver/layers/ContextProjection.cpp
+++ b/paddle/gserver/layers/ContextProjection.cpp
--- a/paddle/gserver/layers/ContextProjection.h
+++ b/paddle/gserver/layers/ContextProjection.h
--- a/paddle/gserver/layers/ConvBaseLayer.cpp
+++ b/paddle/gserver/layers/ConvBaseLayer.cpp
--- a/paddle/gserver/layers/ConvBaseLayer.h
+++ b/paddle/gserver/layers/ConvBaseLayer.h
--- a/paddle/gserver/layers/ConvOperator.cpp
+++ b/paddle/gserver/layers/ConvOperator.cpp
--- a/paddle/gserver/layers/ConvProjection.cpp
+++ b/paddle/gserver/layers/ConvProjection.cpp
--- a/paddle/gserver/layers/ConvProjection.h
+++ b/paddle/gserver/layers/ConvProjection.h
--- a/paddle/gserver/layers/ConvShiftLayer.cpp
+++ b/paddle/gserver/layers/ConvShiftLayer.cpp
--- a/paddle/gserver/layers/ConvexCombinationLayer.cpp
+++ b/paddle/gserver/layers/ConvexCombinationLayer.cpp
--- a/paddle/gserver/layers/CosSimLayer.cpp
+++ b/paddle/gserver/layers/CosSimLayer.cpp
--- a/paddle/gserver/layers/CosSimLayer.h
+++ b/paddle/gserver/layers/CosSimLayer.h
--- a/paddle/gserver/layers/CosSimVecMatLayer.cpp
+++ b/paddle/gserver/layers/CosSimVecMatLayer.cpp
--- a/paddle/gserver/layers/CostLayer.cpp
+++ b/paddle/gserver/layers/CostLayer.cpp
--- a/paddle/gserver/layers/CostLayer.h
+++ b/paddle/gserver/layers/CostLayer.h
--- a/paddle/gserver/layers/CudnnBatchNormLayer.cpp
+++ b/paddle/gserver/layers/CudnnBatchNormLayer.cpp
--- a/paddle/gserver/layers/CudnnBatchNormLayer.h
+++ b/paddle/gserver/layers/CudnnBatchNormLayer.h
--- a/paddle/gserver/layers/CudnnConvLayer.cpp
+++ b/paddle/gserver/layers/CudnnConvLayer.cpp
--- a/paddle/gserver/layers/CudnnConvLayer.h
+++ b/paddle/gserver/layers/CudnnConvLayer.h
--- a/paddle/gserver/layers/CudnnPoolLayer.cpp
+++ b/paddle/gserver/layers/CudnnPoolLayer.cpp
--- a/paddle/gserver/layers/CudnnPoolLayer.h
+++ b/paddle/gserver/layers/CudnnPoolLayer.h
--- a/paddle/gserver/layers/DataLayer.cpp
+++ b/paddle/gserver/layers/DataLayer.cpp
--- a/paddle/gserver/layers/DataLayer.h
+++ b/paddle/gserver/layers/DataLayer.h
--- a/paddle/gserver/layers/DataNormLayer.cpp
+++ b/paddle/gserver/layers/DataNormLayer.cpp
--- a/paddle/gserver/layers/DataNormLayer.h
+++ b/paddle/gserver/layers/DataNormLayer.h
--- a/paddle/gserver/layers/DotMulOperator.cpp
+++ b/paddle/gserver/layers/DotMulOperator.cpp
--- a/paddle/gserver/layers/DotMulProjection.cpp
+++ b/paddle/gserver/layers/DotMulProjection.cpp
--- a/paddle/gserver/layers/EosIdCheckLayer.cpp
+++ b/paddle/gserver/layers/EosIdCheckLayer.cpp
--- a/paddle/gserver/layers/ExpandConvBaseLayer.cpp
+++ b/paddle/gserver/layers/ExpandConvBaseLayer.cpp
--- a/paddle/gserver/layers/ExpandConvBaseLayer.h
+++ b/paddle/gserver/layers/ExpandConvBaseLayer.h
--- a/paddle/gserver/layers/ExpandConvLayer.cpp
+++ b/paddle/gserver/layers/ExpandConvLayer.cpp
--- a/paddle/gserver/layers/ExpandConvLayer.h
+++ b/paddle/gserver/layers/ExpandConvLayer.h
--- a/paddle/gserver/layers/ExpandConvTransLayer.cpp
+++ b/paddle/gserver/layers/ExpandConvTransLayer.cpp
--- a/paddle/gserver/layers/ExpandConvTransLayer.h
+++ b/paddle/gserver/layers/ExpandConvTransLayer.h
--- a/paddle/gserver/layers/FeatureMapExpandLayer.cpp
+++ b/paddle/gserver/layers/FeatureMapExpandLayer.cpp
--- a/paddle/gserver/layers/FullMatrixProjection.cpp
+++ b/paddle/gserver/layers/FullMatrixProjection.cpp
--- a/paddle/gserver/layers/FullMatrixProjection.h
+++ b/paddle/gserver/layers/FullMatrixProjection.h
--- a/paddle/gserver/layers/FullyConnectedLayer.cpp
+++ b/paddle/gserver/layers/FullyConnectedLayer.cpp
--- a/paddle/gserver/layers/FullyConnectedLayer.h
+++ b/paddle/gserver/layers/FullyConnectedLayer.h
--- a/paddle/gserver/layers/GatedRecurrentLayer.cpp
+++ b/paddle/gserver/layers/GatedRecurrentLayer.cpp
--- a/paddle/gserver/layers/GatedRecurrentLayer.h
+++ b/paddle/gserver/layers/GatedRecurrentLayer.h
--- a/paddle/gserver/layers/GetOutputLayer.cpp
+++ b/paddle/gserver/layers/GetOutputLayer.cpp
--- a/paddle/gserver/layers/GruCompute.cpp
+++ b/paddle/gserver/layers/GruCompute.cpp
--- a/paddle/gserver/layers/GruCompute.h
+++ b/paddle/gserver/layers/GruCompute.h
--- a/paddle/gserver/layers/GruStepLayer.cpp
+++ b/paddle/gserver/layers/GruStepLayer.cpp
--- a/paddle/gserver/layers/HierarchicalSigmoidLayer.cpp
+++ b/paddle/gserver/layers/HierarchicalSigmoidLayer.cpp
--- a/paddle/gserver/layers/HierarchicalSigmoidLayer.h
+++ b/paddle/gserver/layers/HierarchicalSigmoidLayer.h
--- a/paddle/gserver/layers/IdentityProjection.cpp
+++ b/paddle/gserver/layers/IdentityProjection.cpp
--- a/paddle/gserver/layers/InterpolationLayer.cpp
+++ b/paddle/gserver/layers/InterpolationLayer.cpp
--- a/paddle/gserver/layers/Layer.cpp
+++ b/paddle/gserver/layers/Layer.cpp
--- a/paddle/gserver/layers/Layer.h
+++ b/paddle/gserver/layers/Layer.h
--- a/paddle/gserver/layers/LinearChainCRF.cpp
+++ b/paddle/gserver/layers/LinearChainCRF.cpp
--- a/paddle/gserver/layers/LinearChainCRF.h
+++ b/paddle/gserver/layers/LinearChainCRF.h
--- a/paddle/gserver/layers/LinearChainCTC.cpp
+++ b/paddle/gserver/layers/LinearChainCTC.cpp
--- a/paddle/gserver/layers/LinearChainCTC.h
+++ b/paddle/gserver/layers/LinearChainCTC.h
--- a/paddle/gserver/layers/LstmCompute.cpp
+++ b/paddle/gserver/layers/LstmCompute.cpp
--- a/paddle/gserver/layers/LstmCompute.h
+++ b/paddle/gserver/layers/LstmCompute.h
--- a/paddle/gserver/layers/LstmLayer.cpp
+++ b/paddle/gserver/layers/LstmLayer.cpp
--- a/paddle/gserver/layers/LstmLayer.h
+++ b/paddle/gserver/layers/LstmLayer.h
--- a/paddle/gserver/layers/LstmStepLayer.cpp
+++ b/paddle/gserver/layers/LstmStepLayer.cpp
--- a/paddle/gserver/layers/MDLstmLayer.cpp
+++ b/paddle/gserver/layers/MDLstmLayer.cpp
--- a/paddle/gserver/layers/MaxIdLayer.cpp
+++ b/paddle/gserver/layers/MaxIdLayer.cpp
--- a/paddle/gserver/layers/MaxLayer.cpp
+++ b/paddle/gserver/layers/MaxLayer.cpp
--- a/paddle/gserver/layers/MaxLayer.h
+++ b/paddle/gserver/layers/MaxLayer.h
--- a/paddle/gserver/layers/MixedLayer.cpp
+++ b/paddle/gserver/layers/MixedLayer.cpp
--- a/paddle/gserver/layers/MixedLayer.h
+++ b/paddle/gserver/layers/MixedLayer.h
--- a/paddle/gserver/layers/MultinomialSampler.cpp
+++ b/paddle/gserver/layers/MultinomialSampler.cpp
--- a/paddle/gserver/layers/MultinomialSampler.h
+++ b/paddle/gserver/layers/MultinomialSampler.h
--- a/paddle/gserver/layers/MultiplexLayer.cpp
+++ b/paddle/gserver/layers/MultiplexLayer.cpp
--- a/paddle/gserver/layers/NCELayer.cpp
+++ b/paddle/gserver/layers/NCELayer.cpp
--- a/paddle/gserver/layers/NormLayer.cpp
+++ b/paddle/gserver/layers/NormLayer.cpp
--- a/paddle/gserver/layers/NormLayer.h
+++ b/paddle/gserver/layers/NormLayer.h
--- a/paddle/gserver/layers/NormProjectionLayer.cpp
+++ b/paddle/gserver/layers/NormProjectionLayer.cpp
--- a/paddle/gserver/layers/NormProjectionLayer.h
+++ b/paddle/gserver/layers/NormProjectionLayer.h
--- a/paddle/gserver/layers/Operator.cpp
+++ b/paddle/gserver/layers/Operator.cpp
--- a/paddle/gserver/layers/Operator.h
+++ b/paddle/gserver/layers/Operator.h
--- a/paddle/gserver/layers/OuterProdLayer.cpp
+++ b/paddle/gserver/layers/OuterProdLayer.cpp
--- a/paddle/gserver/layers/ParameterReluLayer.cpp
+++ b/paddle/gserver/layers/ParameterReluLayer.cpp
--- a/paddle/gserver/layers/ParameterReluLayer.h
+++ b/paddle/gserver/layers/ParameterReluLayer.h
--- a/paddle/gserver/layers/PoolLayer.cpp
+++ b/paddle/gserver/layers/PoolLayer.cpp
--- a/paddle/gserver/layers/PoolLayer.h
+++ b/paddle/gserver/layers/PoolLayer.h
--- a/paddle/gserver/layers/PoolProjection.cpp
+++ b/paddle/gserver/layers/PoolProjection.cpp
--- a/paddle/gserver/layers/PoolProjection.h
+++ b/paddle/gserver/layers/PoolProjection.h
--- a/paddle/gserver/layers/PoolProjectionLayer.cpp
+++ b/paddle/gserver/layers/PoolProjectionLayer.cpp
--- a/paddle/gserver/layers/PowerLayer.cpp
+++ b/paddle/gserver/layers/PowerLayer.cpp
--- a/paddle/gserver/layers/PrintLayer.cpp
+++ b/paddle/gserver/layers/PrintLayer.cpp
--- a/paddle/gserver/layers/Projection.cpp
+++ b/paddle/gserver/layers/Projection.cpp
--- a/paddle/gserver/layers/Projection.h
+++ b/paddle/gserver/layers/Projection.h
--- a/paddle/gserver/layers/RecurrentLayer.cpp
+++ b/paddle/gserver/layers/RecurrentLayer.cpp
--- a/paddle/gserver/layers/RecurrentLayerGroup.cpp
+++ b/paddle/gserver/layers/RecurrentLayerGroup.cpp
--- a/paddle/gserver/layers/ResizeLayer.cpp
+++ b/paddle/gserver/layers/ResizeLayer.cpp
--- a/paddle/gserver/layers/ScalingLayer.cpp
+++ b/paddle/gserver/layers/ScalingLayer.cpp
--- a/paddle/gserver/layers/ScalingProjection.cpp
+++ b/paddle/gserver/layers/ScalingProjection.cpp
--- a/paddle/gserver/layers/SelectiveFullyConnectedLayer.cpp
+++ b/paddle/gserver/layers/SelectiveFullyConnectedLayer.cpp
--- a/paddle/gserver/layers/SelectiveFullyConnectedLayer.h
+++ b/paddle/gserver/layers/SelectiveFullyConnectedLayer.h
--- a/paddle/gserver/layers/SequenceConcatLayer.cpp
+++ b/paddle/gserver/layers/SequenceConcatLayer.cpp
--- a/paddle/gserver/layers/SequenceLastInstanceLayer.cpp
+++ b/paddle/gserver/layers/SequenceLastInstanceLayer.cpp
--- a/paddle/gserver/layers/SequencePoolLayer.cpp
+++ b/paddle/gserver/layers/SequencePoolLayer.cpp
--- a/paddle/gserver/layers/SequenceReshapeLayer.cpp
+++ b/paddle/gserver/layers/SequenceReshapeLayer.cpp
--- a/paddle/gserver/layers/SequenceToBatch.cpp
+++ b/paddle/gserver/layers/SequenceToBatch.cpp
--- a/paddle/gserver/layers/SequenceToBatch.h
+++ b/paddle/gserver/layers/SequenceToBatch.h
--- a/paddle/gserver/layers/SlopeInterceptLayer.cpp
+++ b/paddle/gserver/layers/SlopeInterceptLayer.cpp
--- a/paddle/gserver/layers/SpatialPyramidPoolLayer.cpp
+++ b/paddle/gserver/layers/SpatialPyramidPoolLayer.cpp
--- a/paddle/gserver/layers/SpatialPyramidPoolLayer.h
+++ b/paddle/gserver/layers/SpatialPyramidPoolLayer.h
--- a/paddle/gserver/layers/SubSequenceLayer.cpp
+++ b/paddle/gserver/layers/SubSequenceLayer.cpp
--- a/paddle/gserver/layers/SumToOneNormLayer.cpp
+++ b/paddle/gserver/layers/SumToOneNormLayer.cpp
--- a/paddle/gserver/layers/TableProjection.cpp
+++ b/paddle/gserver/layers/TableProjection.cpp
--- a/paddle/gserver/layers/TableProjection.h
+++ b/paddle/gserver/layers/TableProjection.h
--- a/paddle/gserver/layers/TensorLayer.cpp
+++ b/paddle/gserver/layers/TensorLayer.cpp
--- a/paddle/gserver/layers/TensorLayer.h
+++ b/paddle/gserver/layers/TensorLayer.h
--- a/paddle/gserver/layers/TransLayer.cpp
+++ b/paddle/gserver/layers/TransLayer.cpp
--- a/paddle/gserver/layers/TransLayer.h
+++ b/paddle/gserver/layers/TransLayer.h
--- a/paddle/gserver/layers/TransposedFullMatrixProjection.cpp
+++ b/paddle/gserver/layers/TransposedFullMatrixProjection.cpp
--- a/paddle/gserver/layers/ValidationLayer.cpp
+++ b/paddle/gserver/layers/ValidationLayer.cpp
--- a/paddle/gserver/tests/LayerGradUtil.cpp
+++ b/paddle/gserver/tests/LayerGradUtil.cpp
--- a/paddle/gserver/tests/LayerGradUtil.h
+++ b/paddle/gserver/tests/LayerGradUtil.h
--- a/paddle/gserver/tests/TestUtil.cpp
+++ b/paddle/gserver/tests/TestUtil.cpp
--- a/paddle/gserver/tests/TestUtil.h
+++ b/paddle/gserver/tests/TestUtil.h
--- a/paddle/gserver/tests/test_ActivationGrad.cpp
+++ b/paddle/gserver/tests/test_ActivationGrad.cpp
--- a/paddle/gserver/tests/test_ConvTrans.cpp
+++ b/paddle/gserver/tests/test_ConvTrans.cpp
--- a/paddle/gserver/tests/test_Evaluator.cpp
+++ b/paddle/gserver/tests/test_Evaluator.cpp
--- a/paddle/gserver/tests/test_LayerGrad.cpp
+++ b/paddle/gserver/tests/test_LayerGrad.cpp
--- a/paddle/gserver/tests/test_LinearChainCRF.cpp
+++ b/paddle/gserver/tests/test_LinearChainCRF.cpp
--- a/paddle/gserver/tests/test_MultinomialSampler.cpp
+++ b/paddle/gserver/tests/test_MultinomialSampler.cpp
--- a/paddle/gserver/tests/test_NetworkCompare.cpp
+++ b/paddle/gserver/tests/test_NetworkCompare.cpp
--- a/paddle/gserver/tests/test_ProtoDataProvider.cpp
+++ b/paddle/gserver/tests/test_ProtoDataProvider.cpp
--- a/paddle/gserver/tests/test_PyDataProvider.cpp
+++ b/paddle/gserver/tests/test_PyDataProvider.cpp
--- a/paddle/gserver/tests/test_PyDataProvider2.cpp
+++ b/paddle/gserver/tests/test_PyDataProvider2.cpp
--- a/paddle/gserver/tests/test_RecurrentGradientMachine.cpp
+++ b/paddle/gserver/tests/test_RecurrentGradientMachine.cpp
--- a/paddle/gserver/tests/test_RecurrentLayer.cpp
+++ b/paddle/gserver/tests/test_RecurrentLayer.cpp
--- a/paddle/gserver/tests/test_SelectiveFCLayer.cpp
+++ b/paddle/gserver/tests/test_SelectiveFCLayer.cpp
--- a/paddle/math/Allocator.h
+++ b/paddle/math/Allocator.h
--- a/paddle/math/BaseMatrix.cu
+++ b/paddle/math/BaseMatrix.cu
--- a/paddle/math/BaseMatrix.h
+++ b/paddle/math/BaseMatrix.h
--- a/paddle/math/CpuSparseMatrix.cpp
+++ b/paddle/math/CpuSparseMatrix.cpp
--- a/paddle/math/CpuSparseMatrix.h
+++ b/paddle/math/CpuSparseMatrix.h
--- a/paddle/math/ExecViaCpu.h
+++ b/paddle/math/ExecViaCpu.h
--- a/paddle/math/MathFunctions.cpp
+++ b/paddle/math/MathFunctions.cpp
--- a/paddle/math/MathFunctions.h
+++ b/paddle/math/MathFunctions.h
--- a/paddle/math/MathUtils.cpp
+++ b/paddle/math/MathUtils.cpp
--- a/paddle/math/MathUtils.h
+++ b/paddle/math/MathUtils.h
--- a/paddle/math/Matrix.cpp
+++ b/paddle/math/Matrix.cpp
--- a/paddle/math/Matrix.h
+++ b/paddle/math/Matrix.h
--- a/paddle/math/MatrixBitCode.cpp
+++ b/paddle/math/MatrixBitCode.cpp
--- a/paddle/math/MemoryHandle.cpp
+++ b/paddle/math/MemoryHandle.cpp
--- a/paddle/math/MemoryHandle.h
+++ b/paddle/math/MemoryHandle.h
--- a/paddle/math/PoolAllocator.cpp
+++ b/paddle/math/PoolAllocator.cpp
--- a/paddle/math/PoolAllocator.h
+++ b/paddle/math/PoolAllocator.h
--- a/paddle/math/SIMDFunctions.cpp
+++ b/paddle/math/SIMDFunctions.cpp
--- a/paddle/math/SIMDFunctions.h
+++ b/paddle/math/SIMDFunctions.h
--- a/paddle/math/SparseMatrix.cpp
+++ b/paddle/math/SparseMatrix.cpp
--- a/paddle/math/SparseMatrix.h
+++ b/paddle/math/SparseMatrix.h
--- a/paddle/math/SparseRowMatrix.cpp
+++ b/paddle/math/SparseRowMatrix.cpp
--- a/paddle/math/SparseRowMatrix.h
+++ b/paddle/math/SparseRowMatrix.h
--- a/paddle/math/Storage.cpp
+++ b/paddle/math/Storage.cpp
--- a/paddle/math/Vector.cpp
+++ b/paddle/math/Vector.cpp
--- a/paddle/math/Vector.h
+++ b/paddle/math/Vector.h
--- a/paddle/math/tests/test_Allocator.cpp
+++ b/paddle/math/tests/test_Allocator.cpp
--- a/paddle/math/tests/test_ExecViaCpu.cpp
+++ b/paddle/math/tests/test_ExecViaCpu.cpp
--- a/paddle/math/tests/test_FPException.cpp
+++ b/paddle/math/tests/test_FPException.cpp
--- a/paddle/math/tests/test_SIMDFunctions.cpp
+++ b/paddle/math/tests/test_SIMDFunctions.cpp
--- a/paddle/math/tests/test_batchTranspose.cpp
+++ b/paddle/math/tests/test_batchTranspose.cpp
--- a/paddle/math/tests/test_matrix.cpp
+++ b/paddle/math/tests/test_matrix.cpp
--- a/paddle/math/tests/test_matrixCompare.cpp
+++ b/paddle/math/tests/test_matrixCompare.cpp
--- a/paddle/math/tests/test_matrixUtil.h
+++ b/paddle/math/tests/test_matrixUtil.h
--- a/paddle/math/tests/test_perturbation.cpp
+++ b/paddle/math/tests/test_perturbation.cpp
--- a/paddle/math/tests/test_sparseMatrixCompare.cpp
+++ b/paddle/math/tests/test_sparseMatrixCompare.cpp
--- a/paddle/parameter/Argument.cpp
+++ b/paddle/parameter/Argument.cpp
--- a/paddle/parameter/Argument.h
+++ b/paddle/parameter/Argument.h
--- a/paddle/parameter/AverageOptimizer.cpp
+++ b/paddle/parameter/AverageOptimizer.cpp
--- a/paddle/parameter/AverageOptimizer.h
+++ b/paddle/parameter/AverageOptimizer.h
--- a/paddle/parameter/FirstOrderOptimizer.cpp
+++ b/paddle/parameter/FirstOrderOptimizer.cpp
--- a/paddle/parameter/FirstOrderOptimizer.h
+++ b/paddle/parameter/FirstOrderOptimizer.h
--- a/paddle/parameter/LearningRateScheduler.cpp
+++ b/paddle/parameter/LearningRateScheduler.cpp
--- a/paddle/parameter/LearningRateScheduler.h
+++ b/paddle/parameter/LearningRateScheduler.h
--- a/paddle/parameter/OptimizerFunctions.cpp
+++ b/paddle/parameter/OptimizerFunctions.cpp
--- a/paddle/parameter/OptimizerFunctions.h
+++ b/paddle/parameter/OptimizerFunctions.h
--- a/paddle/parameter/OptimizerWithRegularizer.cpp
+++ b/paddle/parameter/OptimizerWithRegularizer.cpp
--- a/paddle/parameter/OptimizerWithRegularizer.h
+++ b/paddle/parameter/OptimizerWithRegularizer.h
--- a/paddle/parameter/ParallelParameter.cpp
+++ b/paddle/parameter/ParallelParameter.cpp
--- a/paddle/parameter/ParallelParameter.h
+++ b/paddle/parameter/ParallelParameter.h
--- a/paddle/parameter/Parameter.cpp
+++ b/paddle/parameter/Parameter.cpp
--- a/paddle/parameter/Parameter.h
+++ b/paddle/parameter/Parameter.h
--- a/paddle/parameter/ParameterOptimizer.cpp
+++ b/paddle/parameter/ParameterOptimizer.cpp
--- a/paddle/parameter/ParameterOptimizer.h
+++ b/paddle/parameter/ParameterOptimizer.h
--- a/paddle/parameter/ParameterUpdateFunctions.cpp
+++ b/paddle/parameter/ParameterUpdateFunctions.cpp
--- a/paddle/parameter/ParameterUpdateFunctions.h
+++ b/paddle/parameter/ParameterUpdateFunctions.h
--- a/paddle/parameter/ParameterUpdaterBase.cpp
+++ b/paddle/parameter/ParameterUpdaterBase.cpp
--- a/paddle/parameter/ParameterUpdaterBase.h
+++ b/paddle/parameter/ParameterUpdaterBase.h
--- a/paddle/parameter/ParameterUpdaterHook.cpp
+++ b/paddle/parameter/ParameterUpdaterHook.cpp
--- a/paddle/parameter/ParameterUpdaterHook.h
+++ b/paddle/parameter/ParameterUpdaterHook.h
--- a/paddle/parameter/Regularizer.cpp
+++ b/paddle/parameter/Regularizer.cpp
--- a/paddle/parameter/Regularizer.h
+++ b/paddle/parameter/Regularizer.h
--- a/paddle/parameter/Weight.cpp
+++ b/paddle/parameter/Weight.cpp
--- a/paddle/parameter/tests/test_common.cpp
+++ b/paddle/parameter/tests/test_common.cpp
--- a/paddle/pserver/BaseClient.cpp
+++ b/paddle/pserver/BaseClient.cpp
--- a/paddle/pserver/BaseClient.h
+++ b/paddle/pserver/BaseClient.h
--- a/paddle/pserver/LightNetwork.cpp
+++ b/paddle/pserver/LightNetwork.cpp
--- a/paddle/pserver/LightNetwork.h
+++ b/paddle/pserver/LightNetwork.h
--- a/paddle/pserver/ParameterClient2.cpp
+++ b/paddle/pserver/ParameterClient2.cpp
--- a/paddle/pserver/ParameterClient2.h
+++ b/paddle/pserver/ParameterClient2.h
--- a/paddle/pserver/ParameterServer2.cpp
+++ b/paddle/pserver/ParameterServer2.cpp
--- a/paddle/pserver/ParameterServer2.h
+++ b/paddle/pserver/ParameterServer2.h
--- a/paddle/pserver/ProtoServer.cpp
+++ b/paddle/pserver/ProtoServer.cpp
--- a/paddle/pserver/ProtoServer.h
+++ b/paddle/pserver/ProtoServer.h
--- a/paddle/pserver/RDMANetwork.h
+++ b/paddle/pserver/RDMANetwork.h
--- a/paddle/pserver/SocketChannel.cpp
+++ b/paddle/pserver/SocketChannel.cpp
--- a/paddle/pserver/SocketChannel.h
+++ b/paddle/pserver/SocketChannel.h
--- a/paddle/pserver/SparseParameterDistribution.cpp
+++ b/paddle/pserver/SparseParameterDistribution.cpp
--- a/paddle/pserver/test/SocketTest.cpp
+++ b/paddle/pserver/test/SocketTest.cpp
--- a/paddle/pserver/test/test_ParameterServer2.cpp
+++ b/paddle/pserver/test/test_ParameterServer2.cpp
--- a/paddle/pserver/test/test_ProtoServer.cpp
+++ b/paddle/pserver/test/test_ProtoServer.cpp
--- a/paddle/scripts/deb/build_scripts/build.sh
+++ b/paddle/scripts/deb/build_scripts/build.sh
--- a/paddle/scripts/deb/build_scripts/build_deb.sh
+++ b/paddle/scripts/deb/build_scripts/build_deb.sh
--- a/paddle/scripts/docker/Dockerfile.cpu
+++ b/paddle/scripts/docker/Dockerfile.cpu
--- a/paddle/scripts/docker/Dockerfile.cpu-demo
+++ b/paddle/scripts/docker/Dockerfile.cpu-demo
--- a/paddle/scripts/docker/Dockerfile.cpu-devel
+++ b/paddle/scripts/docker/Dockerfile.cpu-devel
--- a/paddle/scripts/docker/Dockerfile.cpu-noavx
+++ b/paddle/scripts/docker/Dockerfile.cpu-noavx
--- a/paddle/scripts/docker/Dockerfile.cpu-noavx-demo
+++ b/paddle/scripts/docker/Dockerfile.cpu-noavx-demo
--- a/paddle/scripts/docker/Dockerfile.cpu-noavx-devel
+++ b/paddle/scripts/docker/Dockerfile.cpu-noavx-devel
--- a/paddle/scripts/docker/Dockerfile.gpu
+++ b/paddle/scripts/docker/Dockerfile.gpu
--- a/paddle/scripts/docker/Dockerfile.gpu-demo
+++ b/paddle/scripts/docker/Dockerfile.gpu-demo
--- a/paddle/scripts/docker/Dockerfile.gpu-devel
+++ b/paddle/scripts/docker/Dockerfile.gpu-devel
--- a/paddle/scripts/docker/Dockerfile.gpu-noavx
+++ b/paddle/scripts/docker/Dockerfile.gpu-noavx
--- a/paddle/scripts/docker/Dockerfile.gpu-noavx-demo
+++ b/paddle/scripts/docker/Dockerfile.gpu-noavx-demo
--- a/paddle/scripts/docker/Dockerfile.gpu-noavx-devel
+++ b/paddle/scripts/docker/Dockerfile.gpu-noavx-devel
--- a/paddle/scripts/docker/Dockerfile.m4
+++ b/paddle/scripts/docker/Dockerfile.m4
--- a/paddle/scripts/submit_local.sh.in
+++ b/paddle/scripts/submit_local.sh.in
--- a/paddle/trainer/ParamUtil.cpp
+++ b/paddle/trainer/ParamUtil.cpp
--- a/paddle/trainer/ParamUtil.h
+++ b/paddle/trainer/ParamUtil.h
--- a/paddle/trainer/ParameterUpdater.cpp
+++ b/paddle/trainer/ParameterUpdater.cpp
--- a/paddle/trainer/ParameterUpdater.h
+++ b/paddle/trainer/ParameterUpdater.h
--- a/paddle/trainer/RemoteParameterUpdater.cpp
+++ b/paddle/trainer/RemoteParameterUpdater.cpp
--- a/paddle/trainer/RemoteParameterUpdater.h
+++ b/paddle/trainer/RemoteParameterUpdater.h
--- a/paddle/trainer/Tester.cpp
+++ b/paddle/trainer/Tester.cpp
--- a/paddle/trainer/Tester.h
+++ b/paddle/trainer/Tester.h
--- a/paddle/trainer/TesterConfig.h
+++ b/paddle/trainer/TesterConfig.h
--- a/paddle/trainer/ThreadParameterUpdater.cpp
+++ b/paddle/trainer/ThreadParameterUpdater.cpp
--- a/paddle/trainer/ThreadParameterUpdater.h
+++ b/paddle/trainer/ThreadParameterUpdater.h
--- a/paddle/trainer/Trainer.cpp
+++ b/paddle/trainer/Trainer.cpp
--- a/paddle/trainer/Trainer.h
+++ b/paddle/trainer/Trainer.h
--- a/paddle/trainer/TrainerConfigHelper.cpp
+++ b/paddle/trainer/TrainerConfigHelper.cpp
--- a/paddle/trainer/TrainerConfigHelper.h
+++ b/paddle/trainer/TrainerConfigHelper.h
--- a/paddle/trainer/TrainerInternal.cpp
+++ b/paddle/trainer/TrainerInternal.cpp
--- a/paddle/trainer/TrainerInternal.h
+++ b/paddle/trainer/TrainerInternal.h
--- a/paddle/trainer/TrainerInternalConfig.cpp
+++ b/paddle/trainer/TrainerInternalConfig.cpp
--- a/paddle/trainer/TrainerInternalConfig.h
+++ b/paddle/trainer/TrainerInternalConfig.h
--- a/paddle/trainer/TrainerMain.cpp
+++ b/paddle/trainer/TrainerMain.cpp
--- a/paddle/trainer/tests/picojson.h
+++ b/paddle/trainer/tests/picojson.h
--- a/paddle/trainer/tests/test_Compare.cpp
+++ b/paddle/trainer/tests/test_Compare.cpp
--- a/paddle/trainer/tests/test_CompareSparse.cpp
+++ b/paddle/trainer/tests/test_CompareSparse.cpp
--- a/paddle/trainer/tests/test_CompareTwoNets.cpp
+++ b/paddle/trainer/tests/test_CompareTwoNets.cpp
--- a/paddle/trainer/tests/test_CompareTwoOpts.cpp
+++ b/paddle/trainer/tests/test_CompareTwoOpts.cpp
--- a/paddle/trainer/tests/test_Prediction.cpp
+++ b/paddle/trainer/tests/test_Prediction.cpp
--- a/paddle/trainer/tests/test_PyDataProviderWrapper.cpp
+++ b/paddle/trainer/tests/test_PyDataProviderWrapper.cpp
--- a/paddle/trainer/tests/test_Trainer.cpp
+++ b/paddle/trainer/tests/test_Trainer.cpp
--- a/paddle/trainer/tests/test_TrainerOnePass.cpp
+++ b/paddle/trainer/tests/test_TrainerOnePass.cpp
--- a/paddle/trainer/tests/test_recurrent_machine_generation.cpp
+++ b/paddle/trainer/tests/test_recurrent_machine_generation.cpp
--- a/paddle/utils/BarrierStat.cpp
+++ b/paddle/utils/BarrierStat.cpp
--- a/paddle/utils/BarrierStat.h
+++ b/paddle/utils/BarrierStat.h
--- a/paddle/utils/ClassRegistrar.h
+++ b/paddle/utils/ClassRegistrar.h
--- a/paddle/utils/CommandLineParser.cpp
+++ b/paddle/utils/CommandLineParser.cpp
--- a/paddle/utils/CommandLineParser.h
+++ b/paddle/utils/CommandLineParser.h
--- a/paddle/utils/CompilerMacros.h
+++ b/paddle/utils/CompilerMacros.h
--- a/paddle/utils/CustomStackTrace.cpp
+++ b/paddle/utils/CustomStackTrace.cpp
--- a/paddle/utils/CustomStackTrace.h
+++ b/paddle/utils/CustomStackTrace.h
--- a/paddle/utils/DisableCopy.h
+++ b/paddle/utils/DisableCopy.h
--- a/paddle/utils/Excepts.cpp
+++ b/paddle/utils/Excepts.cpp
--- a/paddle/utils/Flags.cpp
+++ b/paddle/utils/Flags.cpp
--- a/paddle/utils/Flags.h
+++ b/paddle/utils/Flags.h
--- a/paddle/utils/GlobalConstants.cpp
+++ b/paddle/utils/GlobalConstants.cpp
--- a/paddle/utils/GlobalConstants.h
+++ b/paddle/utils/GlobalConstants.h
--- a/paddle/utils/Locks.h
+++ b/paddle/utils/Locks.h
--- a/paddle/utils/Logging.cpp
+++ b/paddle/utils/Logging.cpp
--- a/paddle/utils/Logging.h
+++ b/paddle/utils/Logging.h
--- a/paddle/utils/PythonUtil.cpp
+++ b/paddle/utils/PythonUtil.cpp
--- a/paddle/utils/PythonUtil.h
+++ b/paddle/utils/PythonUtil.h
--- a/paddle/utils/Queue.h
+++ b/paddle/utils/Queue.h
--- a/paddle/utils/Stat.h
+++ b/paddle/utils/Stat.h
--- a/paddle/utils/StringUtil.h
+++ b/paddle/utils/StringUtil.h
--- a/paddle/utils/Thread.h
+++ b/paddle/utils/Thread.h
--- a/paddle/utils/ThreadLocal.cpp
+++ b/paddle/utils/ThreadLocal.cpp
--- a/paddle/utils/ThreadLocal.h
+++ b/paddle/utils/ThreadLocal.h
--- a/paddle/utils/TypeDefs.h
+++ b/paddle/utils/TypeDefs.h
--- a/paddle/utils/Util.cpp
+++ b/paddle/utils/Util.cpp
--- a/paddle/utils/Util.h
+++ b/paddle/utils/Util.h
--- a/paddle/utils/Version.cpp
+++ b/paddle/utils/Version.cpp
--- a/paddle/utils/Version.h
+++ b/paddle/utils/Version.h
--- a/paddle/utils/arch/linux/Locks.cpp
+++ b/paddle/utils/arch/linux/Locks.cpp
--- a/paddle/utils/arch/osx/Locks.cpp
+++ b/paddle/utils/arch/osx/Locks.cpp
--- a/paddle/utils/tests/test_CommandLineParser.cpp
+++ b/paddle/utils/tests/test_CommandLineParser.cpp
--- a/paddle/utils/tests/test_CustomStackTrace.cpp
+++ b/paddle/utils/tests/test_CustomStackTrace.cpp
--- a/paddle/utils/tests/test_CustomStackTracePrint.cpp
+++ b/paddle/utils/tests/test_CustomStackTracePrint.cpp
--- a/paddle/utils/tests/test_Logging.cpp
+++ b/paddle/utils/tests/test_Logging.cpp
--- a/paddle/utils/tests/test_SpinLock.cpp
+++ b/paddle/utils/tests/test_SpinLock.cpp
--- a/paddle/utils/tests/test_StringUtils.cpp
+++ b/paddle/utils/tests/test_StringUtils.cpp
--- a/paddle/utils/tests/test_Thread.cpp
+++ b/paddle/utils/tests/test_Thread.cpp
--- a/paddle/utils/tests/test_ThreadBarrier.cpp
+++ b/paddle/utils/tests/test_ThreadBarrier.cpp
--- a/proto/ModelConfig.proto.m4
+++ b/proto/ModelConfig.proto.m4
--- a/python/paddle/trainer/config_parser.py
+++ b/python/paddle/trainer/config_parser.py