Merge pull request #48 from reyoung/master

Merge Baidu Changes into github

Merge pull request #48 from reyoung/master
Merge Baidu Changes into github
3304de7a · Yu Yang · GitHub · c3c76d69 · dbaabc94 · 3304de7a
22 changed file
--- a/CMakeLists.txt
+++ b/CMakeLists.txt
@@ -3,7 +3,7 @@ cmake_minimum_required(VERSION 2.8)
 project(paddle CXX C)
 set(PADDLE_MAJOR_VERSION 0)
 set(PADDLE_MINOR_VERSION 8)
-set(PADDLE_PATCH_VERSION 0b)
+set(PADDLE_PATCH_VERSION 0b0)
 set(PADDLE_VERSION ${PADDLE_MAJOR_VERSION}.${PADDLE_MINOR_VERSION}.${PADDLE_PATCH_VERSION})

 set(CMAKE_MODULE_PATH ${CMAKE_MODULE_PATH} "${CMAKE_SOURCE_DIR}/cmake")

--- a/doc/build/docker_install.md
+++ b/doc/build/docker_install.md
@@ -8,12 +8,12 @@ Docker is a tool designed to make it easier to create, deploy, and run applicati
 ### PaddlePaddle Docker images
 There are six Docker images:

- paddledev/paddle:latest-cpu: PaddlePaddle CPU binary image.
- paddledev/paddle:latest-gpu: PaddlePaddle GPU binary image.
- paddledev/paddle:latest-cpu-devel: PaddlePaddle CPU binary image plus source code.
- paddledev/paddle:latest-gpu-devel: PaddlePaddle GPU binary image plus source code.
- paddledev/paddle:latest-cpu-demo: PaddlePaddle CPU binary image plus source code and demo
- paddledev/paddle:latest-gpu-demo: PaddlePaddle GPU binary image plus source code and demo
+- paddledev/paddle:cpu-latest: PaddlePaddle CPU binary image.
+- paddledev/paddle:gpu-latest: PaddlePaddle GPU binary image.
+- paddledev/paddle:cpu-devel-latest: PaddlePaddle CPU binary image plus source code.
+- paddledev/paddle:gpu-devel-latest: PaddlePaddle GPU binary image plus source code.
+- paddledev/paddle:cpu-demo-latest: PaddlePaddle CPU binary image plus source code and demo
+- paddledev/paddle:gpu-demo-latest: PaddlePaddle GPU binary image plus source code and demo

 Tags with latest will be replaced by a released version. 

@@ -23,7 +23,7 @@ You have to install Docker in your machine which has linux kernel version 3.10+

 You can use ```docker pull ```to download images first, or just launch a container with ```docker run```:
 ```bash
-docker run -it paddledev/paddle:lastest-cpu
+docker run -it paddledev/paddle:cpu-latest
 ```

 If you want to launch container with GPU support, you need to set some environment variables at the same time:
@@ -31,7 +31,7 @@ If you want to launch container with GPU support, you need to set some environme
 ```bash
 export CUDA_SO="$(\ls /usr/lib64/libcuda* | xargs -I{} echo '-v {}:{}') $(\ls /usr/lib64/libnvidia* | xargs -I{} echo '-v {}:{}"
 export DEVICES=$(\ls /dev/nvidia* | xargs -I{} echo '--device {}:{}')
-docker run -it paddledev/paddle:latest-gpu
+docker run -it paddledev/paddle:gpu-latest
 ``` 

 ### Notice

--- a/doc/demo/imagenet_model/resnet_model.md
+++ b/doc/demo/imagenet_model/resnet_model.md
@@ -165,7 +165,7 @@ We provide both C++ and Python interfaces to extract features. The following exa

 ### C++ Interface

-First, specify image data list in `define_py_data_sources` in the config, see example `demo/model_zoo/resnet/resnet.py`.
+First, specify image data list in `define_py_data_sources2` in the config, see example `demo/model_zoo/resnet/resnet.py`.

 ```
    train_list = 'train.list' if not is_test else None

--- a/doc/demo/rec/ml_regression.rst
+++ b/doc/demo/rec/ml_regression.rst
@@ -257,7 +257,7 @@ In these network, we use several api in `trainer_config_helpers
 *  Text Convolution Pooling Layer, `text_conv_pool
   <../../ui/api/trainer_config_helpers/networks.html
   #trainer_config_helpers.networks.text_conv_pool>`_
-*  Declare Python Data Sources, `define_py_data_sources
+*  Declare Python Data Sources, `define_py_data_sources2
   <../../ui/api/trainer_config_helpers/data_sources.html>`_

 Data Provider

--- a/doc/ui/predict/predict_sample.py
+++ b/doc/ui/predict/predict_sample.py
@@ -12,8 +12,8 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.

-from py_paddle import swig_paddle, DataProviderWrapperConverter
-from paddle.trainer.PyDataProviderWrapper import DenseSlot
+from py_paddle import swig_paddle, DataProviderConverter
+from paddle.trainer.PyDataProvider2 import dense_vector
 from paddle.trainer.config_parser import parse_config

 TEST_DATA = [[[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
@@ -89,12 +89,12 @@ TEST_DATA = [[[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,


 def main():
-    conf = parse_config("./mnist_model/trainer_config.conf.norm", "")
+    conf = parse_config("./mnist_model/trainer_config.py", "")
    print conf.data_config.load_data_args
    network = swig_paddle.GradientMachine.createFromConfigProto(conf.model_config)
    assert isinstance(network, swig_paddle.GradientMachine)  # For code hint.
    network.loadParameters("./mnist_model/")
-    converter = DataProviderWrapperConverter(False, [DenseSlot(784)])
+    converter = DataProviderConverter([dense_vector(784)])
    inArg = converter(TEST_DATA)
    print network.forwardTest(inArg)


--- a/doc/ui/predict/swig_py_paddle_en.rst
+++ b/doc/ui/predict/swig_py_paddle_en.rst
@@ -10,27 +10,35 @@ SWIG. The main steps of predict values in python are:
 * Predict

 Here is a sample python script that shows the typical prediction process for the
-MNIST classification problem.
+MNIST classification problem. A complete sample code could be found at
+:code:`src_root/doc/ui/predict/predict_sample.py`.

 ..  literalinclude:: ./predict_sample.py
    :language: python
-    :linenos:
+    :lines: 15-18,90-100,101-104

 The module that does the most of the job is py_paddle.swig_paddle, it's
 generated by SWIG and has complete documents, for more details you can use
 python's :code:`help()` function. Let's walk through the above python script:

-* At the beginning, initialize PaddlePaddle with command line arguments(line 90).
-* Parse the configuration file that is used in training(line 93).
-* Create a neural network at line 95 according the parsed configuration, then
-  load the trained parameters from model at line 97.
-* A utility class for data transformation is created at line 98.
+* At the beginning, use :code:`swig_paddle.initPaddle()` to initialize
+  PaddlePaddle with command line arguments, for more about command line arguments
+  see `Command Line Arguments <../cmd_argument/detail_introduction.html>`_.
+* Parse the configuration file that is used in training with :code:`parse_config()`.
+  Because data to predict with always have no label, and output of prediction work
+  normally is the output layer rather than the cost layer, so you should modify
+  the configuration file accordingly before using it in the prediction work.
+* Create a neural network with
+  :code:`swig_paddle.GradientMachine.createFromConfigproto()`, which takes the
+  parsed configuration :code:`conf.model_config` as argument. Then load the
+  trained parameters from the model with :code:`network.loadParameters()`.
+* Create a data converter object of utility class :code:`DataProviderConverter`.
    - Note: As swig_paddle can only accept C++ matrices, we offer a utility
-      class DataProviderWraaperConverter that can accept the same input data with
-      PyDataProviderWrapper, for more information please refer to document
+      class DataProviderConverter that can accept the same input data with
+      PyDataProvider2, for more information please refer to document
      of `PyDataProvider2 <../data_provider/pydataprovider2.html>`_.
-* Do the prediction and output the result at line 100, forwardTest is another
-  utility class that directly takes the activations of the output layer.
+* Do the prediction with :code:`forwardTest()`, which takes the converted
+  input data and outputs the activations of the output layer.

 Here is a typical output:


--- a/paddle/CMakeLists.txt
+++ b/paddle/CMakeLists.txt
@@ -7,6 +7,9 @@ add_subdirectory(pserver)
 add_subdirectory(trainer)
 add_subdirectory(scripts)

+configure_file(${CMAKE_CURRENT_SOURCE_DIR}/setup.py.in
+    ${CMAKE_CURRENT_SOURCE_DIR}/setup.py)
+
 if(WITH_PREDICT_SDK)
    add_subdirectory(predict)
 endif()

--- a/paddle/cuda/src/hl_cuda_matrix.cu
+++ b/paddle/cuda/src/hl_cuda_matrix.cu
@@ -266,25 +266,21 @@ template<int blockSize>
 __global__ void KeMatrixClassificationError(real* in_A,
                                            int* in_B,
                                            real* out_C,
-                                            int dimM,
                                            int dimN) {
  __shared__ real max_s[blockSize];
  __shared__ int max_l[blockSize];
-  int cnt = (dimN + blockSize -1) / blockSize;
-  int tid = threadIdx.x;
-  int lmt = tid;
-  int index = 0;
-  real t;
+  const int tid = threadIdx.x;
+  const int rowId = blockIdx.x;

  max_s[tid] = -1e30f;
-  for (int ii = 0; ii < cnt && lmt < dimN; ii++) {
-    index = blockIdx.y*dimN + lmt;
-    t = in_A[index];
-    if (max_s[tid] < t) {
-      max_s[tid] = t;
-      max_l[tid] = lmt;
+  in_A += rowId * dimN;
+  real tmp;
+  for (int colId = tid; colId < dimN; colId += blockSize) {
+    tmp = in_A[colId];
+    if (max_s[tid] < tmp) {
+      max_s[tid] = tmp;
+      max_l[tid] = colId;
    }
-    lmt += blockSize;
  }
  __syncthreads();

@@ -300,7 +296,7 @@ __global__ void KeMatrixClassificationError(real* in_A,
  __syncthreads();

  if (tid == 0) {
-    out_C[blockIdx.y] = (max_l[0] == in_B[blockIdx.y] ? 0 : 1.0f);
+    out_C[rowId] = (max_l[0] == in_B[rowId] ? 0 : 1.0f);
  }
 }

@@ -313,12 +309,9 @@ void hl_matrix_classification_error(real* A_d,
  CHECK_NOTNULL(B_d);
  CHECK_NOTNULL(C_d);

-  int blocksX = 1;
-  int blocksY = dimM;
-  dim3 threads(1024, 1);
-  dim3 grid(blocksX, blocksY);
-  KeMatrixClassificationError<1024><<< grid, threads, 0, STREAM_DEFAULT >>>
-           (A_d, B_d, C_d, dimM, dimN);
+  // each sample is calculated by one block
+  KeMatrixClassificationError<1024><<< dimM, 1024, 0, STREAM_DEFAULT >>>
+    (A_d, B_d, C_d, dimN);
  CHECK_SYNC("hl_matrix_classification_error");
 }


--- a/paddle/gserver/layers/CRFLayer.cpp
+++ b/paddle/gserver/layers/CRFLayer.cpp
@@ -47,81 +47,40 @@ bool CRFLayer::init(const LayerMap& layerMap,
  // We don't need sequenceStartPositions because each sample of output_ is
  // for the cost of one sequence.
  setNeedSequenceInfo(false);
-  if (useGpu_) {
-    tmpCpuInput_.reserve(inputLayers_.size());
-    for (size_t i = 0; i < inputLayers_.size(); i++) {
-      tmpCpuInput_.push_back(Argument());
-    }
-  }
+
  return true;
 }

 void CRFLayer::forward(PassType passType) {
  Layer::forward(passType);
-  if (useGpu_) {
-    for (size_t i = 0; i < inputLayers_.size(); i++) {
-      tmpCpuInput_[i].resizeAndCopyFrom(getInput(i), false, HPPL_STREAM_1);
-    }
-    VectorPtr cpuParameterValue;
-    VectorPtr cpuParameterGradient;
-    cpuParameterValue =
-      Vector::create(parameter_->getBuf(PARAMETER_VALUE)->getSize(), false);
-    cpuParameterValue->
-      copyFrom(*parameter_->getBuf(PARAMETER_VALUE), HPPL_STREAM_1);
-    if (parameter_->getBuf(PARAMETER_GRADIENT)) {
-      cpuParameterGradient =
-        Vector::create(parameter_->getBuf(PARAMETER_GRADIENT)->getSize(),
-                       false);
-      cpuParameterGradient->
-        copyFrom(*parameter_->getBuf(PARAMETER_GRADIENT), HPPL_STREAM_1);
-    } else {
-      cpuParameterGradient = nullptr;
-    }
-    forwardImp(tmpCpuInput_[0], tmpCpuInput_[1], cpuParameterValue,
-               cpuParameterGradient);
-    parameter_->getBuf(PARAMETER_VALUE)->copyFrom(*cpuParameterValue,
-                                                  HPPL_STREAM_1);
-    if (parameter_->getBuf(PARAMETER_GRADIENT)) {
-      parameter_->getBuf(PARAMETER_GRADIENT)->copyFrom(*cpuParameterGradient,
-                                                    HPPL_STREAM_1);
-    }
-  } else {
-    forwardImp(getInput(0), getInput(1), parameter_->getBuf(PARAMETER_VALUE),
-               parameter_->getBuf(PARAMETER_GRADIENT));
-  }
-}

-void CRFLayer::forwardImp(const Argument&output,
-                          const Argument& label,
-                          VectorPtr parameterValue,
-                          VectorPtr parameterGradient) {
+  CHECK(!useGpu_) << "GPU is not supported";
+
+  const Argument& output = getInput(0);
+  const Argument& label = getInput(1);
  CHECK(label.sequenceStartPositions);
  CHECK(label.ids);

  int batchSize = output.getBatchSize();
  size_t numSequences = label.sequenceStartPositions->getSize() - 1;
  resizeOutput(numSequences, 1);
-  std::vector<real> out(numSequences);

  const int* starts = label.sequenceStartPositions->getData(false);
  CHECK_EQ(starts[numSequences], batchSize);
-  VectorPtr cpuParameterValue;
-  VectorPtr cpuParameterGradient;
-

  for (size_t i = 0; i < numSequences; ++i) {
    if (i >= crfs_.size()) {
      crfs_.emplace_back(numClasses_,
-                         parameterValue->getData(),
-                         parameterGradient
-                            ? parameterGradient->getData()
+                         parameter_->getBuf(PARAMETER_VALUE)->getData(),
+                         parameter_->getBuf(PARAMETER_GRADIENT)
+                            ? parameter_->getBuf(PARAMETER_GRADIENT)->getData()
                            : nullptr);
    }
-    out[i] = crfs_[i].forward(
+    output_.value->getData()[i] = crfs_[i].forward(
        output.value->getData() + numClasses_ * starts[i],
        label.ids->getData() + starts[i], starts[i + 1] - starts[i]);
  }
-  output_.value->copyFrom(out.data(), numSequences);
+
  if (weightLayer_) {
    const MatrixPtr& weight = getInputValue(*weightLayer_);
    getOutputValue()->dotMul(*getOutputValue(), *weight);
@@ -129,22 +88,8 @@ void CRFLayer::forwardImp(const Argument&output,
 }

 void CRFLayer::backward(const UpdateCallback &callback) {
-  (void)callback;
-  if (useGpu_) {
-    backwardImp(callback, tmpCpuInput_[0], tmpCpuInput_[1]);
-    const_cast<Argument&>(getInput(0)).
-            resizeAndCopyFrom(tmpCpuInput_[0], true, HPPL_STREAM_1);
-    const_cast<Argument&>(getInput(1)).
-            resizeAndCopyFrom(tmpCpuInput_[1], true, HPPL_STREAM_1);
-
-  } else {
-    backwardImp(callback, getInput(0), getInput(1));
-  }
-}
-
-void CRFLayer::backwardImp(const UpdateCallback& callback,
-                           const Argument&output,
-                           const Argument& label) {
+  const Argument& output = getInput(0);
+  const Argument& label = getInput(1);
  const int* starts = label.sequenceStartPositions->getData(false);
  int numSequences = label.sequenceStartPositions->getSize() - 1;

@@ -159,9 +104,11 @@ void CRFLayer::backwardImp(const UpdateCallback& callback,
      grad->mulScalar(weight);
    }
  }
+
  if (coeff_ != real(1.0f)) {
    output.grad->mulScalar(coeff_);
  }
+
  parameter_->incUpdate(callback);
 }


--- a/paddle/gserver/layers/CRFLayer.h
+++ b/paddle/gserver/layers/CRFLayer.h
@@ -32,11 +32,7 @@ public:
  explicit CRFLayer(const LayerConfig& config) : Layer(config) {}
  virtual bool init(const LayerMap& layerMap, const ParameterMap& parameterMap);
  virtual void forward(PassType passType);
-  void forwardImp(const Argument&output, const Argument& label,
-                  VectorPtr parameterValue, VectorPtr parameterGradient);
  virtual void backward(const UpdateCallback& callback);
-  void backwardImp(const UpdateCallback& callback, const Argument&output,
-                   const Argument& label);

 protected:
  size_t numClasses_;
@@ -44,7 +40,6 @@ protected:
  std::vector<LinearChainCRF> crfs_;
  LayerPtr weightLayer_;  // weight for each sequence
  real coeff_;  // weight for the layer
-  std::vector<Argument> tmpCpuInput_;
 };

 }  // namespace paddle
--- a/paddle/gserver/tests/sequenceGen.py
+++ b/paddle/gserver/tests/sequenceGen.py
@@ -18,27 +18,33 @@
 import os
 import sys

-from paddle.trainer.PyDataProviderWrapper import *
+from paddle.trainer.PyDataProvider2 import *

-@init_hook_wrapper
-def hook(obj, dict_file, **kwargs):
-    obj.word_dict = dict_file
-    obj.slots = [IndexSlot(len(obj.word_dict)), IndexSlot(3)]
-    obj.logger.info('dict len : %d' % (len(obj.word_dict)))
+def hook(settings, dict_file, **kwargs):
+    settings.word_dict = dict_file
+    settings.input_types = [integer_value_sequence(len(settings.word_dict)), 
+                            integer_value_sequence(3)]
+    settings.logger.info('dict len : %d' % (len(settings.word_dict)))

-@provider(use_seq=True, init_hook=hook)
-def process(obj, file_name):
+@provider(init_hook=hook)
+def process(settings, file_name):
    with open(file_name, 'r') as fdata:
        for line in fdata:
            label, comment = line.strip().split('\t')
            label = int(''.join(label.split()))
            words = comment.split()
-            word_slot = [obj.word_dict[w] for w in words if w in obj.word_dict]
+            word_slot = [settings.word_dict[w] for w in words if w in settings.word_dict]
            yield word_slot, [label]

 ## for hierarchical sequence network
-@provider(use_seq=True, init_hook=hook)
-def process2(obj, file_name):
+def hook2(settings, dict_file, **kwargs):
+    settings.word_dict = dict_file
+    settings.input_types = [integer_value_sub_sequence(len(settings.word_dict)),
+                            integer_value_sub_sequence(3)]
+    settings.logger.info('dict len : %d' % (len(settings.word_dict)))
+
+@provider(init_hook=hook2)
+def process2(settings, file_name):
    with open(file_name) as fdata:
        label_list = []
        word_slot_list = []
@@ -47,7 +53,7 @@ def process2(obj, file_name):
                label,comment = line.strip().split('\t')
                label = int(''.join(label.split()))
                words = comment.split()
-                word_slot = [obj.word_dict[w] for w in words if w in obj.word_dict]
+                word_slot = [settings.word_dict[w] for w in words if w in settings.word_dict]
                label_list.append([label])
                word_slot_list.append(word_slot)
            else:

--- a/paddle/gserver/tests/sequence_layer_group.conf
+++ b/paddle/gserver/tests/sequence_layer_group.conf
@@ -21,11 +21,11 @@ dict_file = dict()
 for line_count, line in enumerate(open(dict_path, "r")):
    dict_file[line.strip()] = line_count

-define_py_data_sources(train_list='gserver/tests/Sequence/train.list',
-                       test_list=None,
-                       module='sequenceGen',
-                       obj='process',
-                       args={"dict_file":dict_file})
+define_py_data_sources2(train_list='gserver/tests/Sequence/train.list',
+                        test_list=None,
+                        module='sequenceGen',
+                        obj='process',
+                        args={"dict_file":dict_file})

 settings(batch_size=5)
 ######################## network configure ################################

--- a/paddle/gserver/tests/sequence_nest_layer_group.conf
+++ b/paddle/gserver/tests/sequence_nest_layer_group.conf
@@ -21,11 +21,11 @@ dict_file = dict()
 for line_count, line in enumerate(open(dict_path, "r")):
    dict_file[line.strip()] = line_count

-define_py_data_sources(train_list='gserver/tests/Sequence/train.list.nest',
-                       test_list=None,
-                       module='sequenceGen',
-                       obj='process2',
-                       args={"dict_file":dict_file})
+define_py_data_sources2(train_list='gserver/tests/Sequence/train.list.nest',
+                        test_list=None,
+                        module='sequenceGen',
+                        obj='process2',
+                        args={"dict_file":dict_file})

 settings(batch_size=2)
 ######################## network configure ################################

--- a/paddle/gserver/tests/test_LayerGrad.cpp
+++ b/paddle/gserver/tests/test_LayerGrad.cpp
@@ -179,10 +179,9 @@ TEST(Layer, CRFLayer) {
  config.layerConfig.add_inputs();
  config.layerConfig.add_inputs();

-  for (auto useGpu : {false, true}) {
-    testLayerGrad(config, "crf", 100, /* trans */ false, /* useGpu */ useGpu,
-                  false /*useWeight*/, 0.03 /*epsilon*/);
-  }
+  // Not support GPU now
+  testLayerGrad(config, "crf", 100, /* trans */ false, /* useGpu */ false,
+                false /*useWeight*/, 0.03 /*epsilon*/);
 }

 TEST(Layer, CTCLayer) {

--- a/paddle/math/tests/test_matrixCompare.cpp
+++ b/paddle/math/tests/test_matrixCompare.cpp
@@ -1697,7 +1697,6 @@ TEST(Matrix, cosSimDerivate) {
  }
 }

-
 void testParamReluForward(int height, int width, int w_height,
                                                 int w_width) {
  MatrixPtr output = CpuMatrix::create(height, width, false, false);
@@ -1736,7 +1735,6 @@ TEST(Matrix, paramReluForward) {
  }
 }

-
 void testParamReluBackwardW(int height, int width, int w_height,
                                                   int w_width) {
  MatrixPtr oGrad = CpuMatrix::create(height, width, false, false);
@@ -1775,7 +1773,6 @@ TEST(Matrix, paramReluBackwardW) {
  }
 }

-
 void testParamReluBackwardDiff(int height, int width, int w_height,
                                                      int w_width) {
  MatrixPtr oGrad = CpuMatrix::create(height, width, false, false);
@@ -1819,6 +1816,36 @@ TEST(Matrix, paramReluBackwardDiff) {
  }
 }

+void testClassificationError(int numSamples, int dim) {
+  MatrixPtr cpuError = std::make_shared<CpuMatrix>(numSamples, 1);
+  MatrixPtr gpuError = std::make_shared<GpuMatrix>(numSamples, 1);
+  MatrixPtr cpuOutput = std::make_shared<CpuMatrix>(numSamples, dim);
+  MatrixPtr gpuOutput = std::make_shared<GpuMatrix>(numSamples, dim);
+  IVectorPtr cpuLabel = std::make_shared<CpuIVector>(numSamples);
+  IVectorPtr gpuLabel = std::make_shared<GpuIVector>(numSamples);
+
+  cpuOutput->randomizeUniform();
+  cpuLabel->rand(dim);
+  gpuOutput->copyFrom(*cpuOutput);
+  gpuLabel->copyFrom(*cpuLabel);
+
+  cpuError->classificationError(cpuOutput, cpuLabel);
+  gpuError->classificationError(gpuOutput, gpuLabel);
+
+  MatrixPtr check = std::make_shared<CpuMatrix>(numSamples, 1);
+  check->copyFrom(*gpuError);
+  MatrixCheckEqual(*cpuError, *check);
+}
+
+TEST(Matrix, classificationError) {
+  for (auto numSamples : {1, 10, 100, 1000, 70000}) {
+    for (auto dim : {1, 10, 100, 1000}) {
+      VLOG(3) << " numSamples=" << numSamples << " dim=" << dim;
+      testClassificationError(numSamples, dim);
+    }
+  }
+}
+
 int main(int argc, char** argv) {
  testing::InitGoogleTest(&argc, argv);
  initMain(argc, argv);

--- a/paddle/parameter/Argument.cpp
+++ b/paddle/parameter/Argument.cpp
@@ -269,6 +269,9 @@ void Argument::concat(const std::vector<Argument>& args,
                      const std::vector<int>& selectRows,
                      const std::vector<int>& seqStartPos, bool useGpu,
                      hl_stream_t stream, PassType passType) {
+  CHECK(!subSequenceStartPositions)
+          << "undefined behavior for subsequence positions";
+
  size_t batchSize = selectRows.size();
  auto copyArg = [batchSize, stream](MatrixPtr& dst, MatrixPtr src,
                                     int startRow, int pos, int size,
@@ -347,9 +350,11 @@ void Argument::concat(const std::vector<Argument>& args, bool useGpu,
                      hl_stream_t stream, PassType passType) {
  int32_t batchSize = 0;
  int64_t numSequences = 0;
+  int64_t numSubSequences = 0;
  for (auto& arg : args) {
    batchSize += arg.getBatchSize();
    numSequences += arg.getNumSequences();
+    numSubSequences += arg.getNumSubSequences();
  }

  auto copyArg = [batchSize, stream](MatrixPtr& dst, MatrixPtr src,
@@ -393,8 +398,26 @@ void Argument::concat(const std::vector<Argument>& args, bool useGpu,
    std::copy(src->begin(), src->end(), dst->begin() + startRow);
  };

+  auto copySequencePos = []
+          (ICpuGpuVectorPtr& dstSeq, const ICpuGpuVectorPtr& srcSeq,
+           int dstNumSequences, int srcNumSequences,
+           int& startSequences, int startRow) {
+      if (srcSeq) {
+          ICpuGpuVector::resizeOrCreate(dstSeq, dstNumSequences + 1, false);
+          const int* src = srcSeq->getData(false);
+          int* dest = dstSeq->getMutableData(false);
+          for (int i = 0; i < srcNumSequences + 1; ++i) {
+              dest[i + startSequences] = src[i] + startRow;
+          }
+          startSequences += srcNumSequences;
+      } else {
+          dstSeq.reset();
+      }
+  };
+
  int startRow = 0;
  int startSequences = 0;
+  int startSubSequences = 0;
  dataId = args[0].dataId;
  for (auto& arg : args) {
    CHECK_EQ(arg.dataId, dataId) << "Arguments in concat should have"
@@ -403,17 +426,18 @@ void Argument::concat(const std::vector<Argument>& args, bool useGpu,
    copyArg(value, arg.value, startRow, useGpu);
    if (passType != PASS_TEST) copyArg(grad, arg.grad, startRow, useGpu);
    copyIds(ids, arg.ids, startRow, useGpu);
-    if (arg.sequenceStartPositions) {
-      ICpuGpuVector::resizeOrCreate(sequenceStartPositions,
-                                     numSequences + 1,
-                                     false);
-      const int* src = arg.sequenceStartPositions->getData(false);
-      int* dest = sequenceStartPositions->getMutableData(false);
-      for (int i = 0; i < arg.getNumSequences() + 1; ++i) {
-        dest[i + startSequences] = src[i] + startRow;
-      }
-      startSequences += arg.getNumSequences();
-    }
+    copySequencePos(sequenceStartPositions,
+                    arg.sequenceStartPositions,
+                    numSequences,
+                    arg.getNumSequences(),
+                    startSequences,
+                    startRow);
+    copySequencePos(subSequenceStartPositions,
+                    arg.subSequenceStartPositions,
+                    numSubSequences,
+                    arg.getNumSubSequences(),
+                    startSubSequences,
+                    startRow);
    copyStrs(strs, arg.strs, startRow, useGpu);
    startRow += arg.getBatchSize();
  }

--- a/paddle/pserver/ParameterClient2.cpp
+++ b/paddle/pserver/ParameterClient2.cpp
@@ -278,7 +278,7 @@ void ParameterClient2::prepareSendData(

          if (sendingPara) {
            sendJob->parallelInputIovs[serverId].push_back(
-                {sendMat->getLocalRow(row), sizeof(real) * blockSize});
+                {sendMat->getLocalRow(row), sizeof(real) * (size_t) blockSize});
            /// detect sparse parameter distribution
            sparseDistribution_->probeDistribution(serverId,
                    sizeof(real) * blockSize);
@@ -302,8 +302,8 @@ void ParameterClient2::prepareSendData(
        block->set_begin_pos(beginDim);
        block->set_block_size(endDim - beginDim);
        if (buf) {
-            sendJob->parallelInputIovs[serverId].push_back(
-                    {buf + beginDim, sizeof(real) * (endDim - beginDim)});
+            sendJob->parallelInputIovs[serverId].push_back({buf + beginDim,
+                     sizeof(real) * ((size_t) (endDim - beginDim))});
        }
      }
    }

--- a/paddle/pserver/ParameterServer2.cpp
+++ b/paddle/pserver/ParameterServer2.cpp
@@ -724,7 +724,7 @@ void ParameterServer2::sendBackParameter(const ParameterBlock& block,
      << " id=" << block.para_id() << " block id=" << block.block_id();

  real* valueBuffer = vectors_[parameterType]->getPoint(offset);
-  outputBuffers->push_back({valueBuffer, block.block_size()});
+  outputBuffers->push_back({valueBuffer, (size_t) block.block_size()});
 }

 void ParameterServer2::sendBackParameter(const ParameterBlock& block,

--- a/paddle/pserver/SocketChannel.cpp
+++ b/paddle/pserver/SocketChannel.cpp
@@ -148,7 +148,8 @@ void SocketChannel::writeMessage(const std::vector<struct iovec>& userIovs) {
  std::vector<iovec> iovs;
  iovs.reserve(userIovs.size() + 2);
  iovs.push_back({&header, sizeof(header)});
-  iovs.push_back({&iovLengths[0], sizeof(iovLengths[0]) * header.numIovs});
+  iovs.push_back({&iovLengths[0],
+       sizeof(iovLengths[0]) * (size_t) header.numIovs});
  iovs.insert(iovs.end(), userIovs.begin(), userIovs.end());

  header.totalLength = 0;

--- a/paddle/setup.py
+++ b/paddle/setup.py
@@ -35,11 +35,11 @@ except:
  pass

 setup(name="py_paddle",
-  version="0.8.0b",  # TODO(yuyang18): Make this version same as CMake
+  version="@PADDLE_VERSION@",
  ext_modules=[
    Extension('py_paddle._swig_paddle',      # Build SWIG Extension.
      ['Paddle_wrap.cxx'],
-      extra_link_args=["-Xlinker", '-start-group'] + 
+      extra_link_args=["-Xlinker", '-start-group'] +
                        extra_links + ["-Xlinker", "-end-group"]
    )
  ],

--- a/paddle/trainer/ThreadParameterUpdater.h
+++ b/paddle/trainer/ThreadParameterUpdater.h
@@ -79,7 +79,7 @@ protected:
  // The update function for after update operations, such as averager.
  void threadTraverse(const ParameterOptimizer::TraverseCallback& callback,
                      int tid, size_t numThreads, Parameter* para);
-  typedef std::function<const ParameterOptimizer::TraverseCallback&(Parameter*)>
+  typedef std::function<const ParameterOptimizer::TraverseCallback(Parameter*)>
    GetTraverseCallback;
  void traverse(GetTraverseCallback getTraverseCallback);
 };

--- a/python/paddle/trainer/config_parser.py
+++ b/python/paddle/trainer/config_parser.py
@@ -262,8 +262,8 @@ def SubModelEnd(name = None):

 def MakeLayerNameInParentSubmodel(name):
    suffix = ""
-    for submodel in g_submodel_stack[1:]:
-        suffix = "@" + submodel.name + suffix
+    if len(g_submodel_stack) > 1:
+        suffix = "@" + g_submodel_stack[-1].name
    return name + suffix

 def GetLayerBaseName(name):