Merge branch 'develop' of https://github.com/PaddlePaddle/Paddle into enable_manylinux

79e50c14 · typhoonzero · e1157c52 · 6f43c936 · 79e50c14 · 79e50c14
26 changed file
--- a/doc/design/float16.md
+++ b/doc/design/float16.md
+# Design Doc: float16
+## Why float16
+Half precision (float16) is a binary floating-point format that occupies 16 bits in memory. float16 is half the size of traditional 32-bit single precision format (float) and has lower precision and smaller range. 
+When high precision computation is not required, using float16 data type could potentially 
+- reduce storage space, memory bandwidth, and power usages; 
+- increase the chance of data fitting into a smaller cache of lower latency; 
+- provide arithmetic speed up if supported by hardware. 
+## Survey of current float16 support
+A brief survey of float16 support on different compilers, hardwares, and libraries can be found below. Interested readers can refer to [link1](https://github.com/PaddlePaddle/Paddle/issues/4853) and [link2](https://github.com/Xreki/Xreki.github.io/blob/master/multi_data_types_in_dl_framework/ppt/float16_and_quantized_type.md) for more info.
+The goal of float16 is to serve as a key for the executor to find and run the correct version of compute method specialized for float16 in operator kernel. It should be compatible with various natively supported float16 implementations including `__half` for cuda, `float16_t` for ARM, and `Eigen::half` for Eigen to make writing customized float16 kernels easier. 
+### Compiler
+- nvcc supports `__half` data type after CUDA 7.5.
+- `__fp16` or `float16_t` is supported as storage type for gcc >= 6.1 and clang >= 3.4.
+- `__fp16` or `float16_t` is supported as arithmetic type for gcc >= 7.1 and clang >= 3.9.
+### Hardware
+- `__half` is supported on GPU with compute capability >= 5.3.
+- `__fp16` is supported as storage type for ARMv7-A, ARMv8-A, and above.
+- `__fp16` is supported as arithmetic type after ARMv8.2-A (currently, the only microarchitecture implementing ARMv8.2-A is ARM Cortex-A75, which is announced in May 2017. There seems to be no application processors currently available on market that adopts this architecture. It is reported that Qualcomm Snapdragon 845 uses Cortex-A75 design and will be available in mobile devices in early 2018).
+### Libraries
+- [Eigen](https://github.com/RLovelett/eigen) >= 3.3 supports float16 calculation on both GPU and CPU using the `Eigen::half` class. It is mostly useful for Nvidia GPUs because of the overloaded arithmetic operators using cuda intrinsics. It falls back to using software emulation on CPU for calculation and there is no special treatment to ARM processors.
+- [ARM compute library](https://github.com/ARM-software/ComputeLibrary) >= 17.02.01 supports NEON FP16 kernels (requires ARMv8.2-A CPU).
+## Implementation
+The float16 class holds a 16-bit `uint16_t` data internally.
+```
+struct float16 {
+  uint16_t x;
+};
+``` 
+float16 supports the following features:
+  - constructors / assignment operators that take input from primitive data types including bool, integers of various length, float, and double. 
+  - constructors / assignment operators that take input from `__half` on cuda, `float16_t` on ARM, and `Eigen::half` on Eigen.
+  - conversion operators to primitive data types and half precision data types on cuda, ARM and Eigen. 
+  - overloaded arithmetic operators for cuda, arm, and non-arm cpu, respectively. These operators will take advantage of the cuda and ARM intrinsics on the corresponding hardware. 
+To support the above features, two fundamental conversion functions are provided:
+```
+float16 float_to_half_rn(float f);  // convert to half precision in round-to-nearest-even mode
+float half_to_float(float16 h);
+```
+which provides one-to-one conversion between float32 and float16. These twos functions will do different conversion routines based on the current hardware. CUDA/ARM instrinsics will be used when the corresonding hardware is available. If the hardware or compiler level does not support float32 to float16 conversion, software emulation will be performed to do the conversion.
+## To do
+After float16 class is available, some of the future items are below:
+- Update pybind/tensor_py.h to bind c++ float16 with numpy float16. 
+- Modify `IndicateDataType()` method in `framework/operator.h` to make it compatible with float16.
+- Create a type-casting operator that can convert the data type in tensor between float16 and other types.
--- a/paddle/framework/op_desc.cc
+++ b/paddle/framework/op_desc.cc
@@ -67,8 +67,11 @@ class CompileTimeInferShapeContext : public InferShapeContext {
                      out);
    in_var->SetLoDLevel(out_var->GetLodLevel());
  }
+  bool IsRuntime() const override;
+ protected:
+  VarDesc::VarType GetVarType(const std::string &name) const override;
- private:
  DDim GetDim(const std::string &name) const override;
  void SetDim(const std::string &name, const DDim &dim) override;
@@ -451,6 +454,12 @@ void CompileTimeInferShapeContext::SetDim(const std::string &name,
                                          const DDim &dim) {
  block_.FindVarRecursive(name)->SetShape(framework::vectorize(dim));
 }
+bool CompileTimeInferShapeContext::IsRuntime() const { return false; }
+VarDesc::VarType CompileTimeInferShapeContext::GetVarType(
+    const std::string &name) const {
+  return block_.FindVarRecursive(name)->GetType();
+}
 }  // namespace framework
 }  // namespace paddle
--- a/paddle/framework/operator.cc
+++ b/paddle/framework/operator.cc
@@ -15,7 +15,9 @@ limitations under the License. */
 #include "paddle/framework/operator.h"
 #include <algorithm>
 #include <atomic>
+#include "paddle/framework/lod_tensor_array.h"
 #include "paddle/framework/shape_inference.h"
+#include "paddle/framework/var_type.h"
 namespace paddle {
 namespace framework {
@@ -365,7 +367,9 @@ class RuntimeInferShapeContext : public InferShapeContext {
    out_tensor->set_lod(in_tensor.lod());
  }
- private:
+  bool IsRuntime() const override { return true; }
+ protected:
  DDim GetDim(const std::string& name) const override {
    Variable* var = scope_.FindVar(name);
    if (var->IsType<LoDTensor>()) {
@@ -388,6 +392,12 @@ class RuntimeInferShapeContext : public InferShapeContext {
    }
  }
+  VarDesc::VarType GetVarType(const std::string& name) const override {
+    auto* var = scope_.FindVar(name);
+    return ToVarType(var->Type());
+  }
+ private:
  const OperatorBase& op_;
  const Scope& scope_;
 };

--- a/paddle/framework/operator.h
+++ b/paddle/framework/operator.h
@@ -298,11 +298,10 @@ class ExecutionContext {
  }
 #ifdef PADDLE_WITH_CUDA
-  const platform::CUDADeviceContext& cuda_device_context() const {
+  const inline platform::CUDADeviceContext& cuda_device_context() const {
    PADDLE_ENFORCE(platform::is_gpu_place(device_context_.GetPlace()));
-    auto cuda_ctx =
+    return *reinterpret_cast<const platform::CUDADeviceContext*>(
-        reinterpret_cast<const platform::CUDADeviceContext*>(&device_context_);
+        &device_context_);
-    return *cuda_ctx;
  }
 #endif

--- a/paddle/framework/shape_inference.cc
+++ b/paddle/framework/shape_inference.cc
@@ -46,6 +46,23 @@ void InferShapeContext::SetDims(const std::vector<std::string> &names,
    SetDim(names[i], dims[i]);
  }
 }
+std::vector<VarDesc::VarType> InferShapeContext::GetInputsVarType(
+    const std::string &name) const {
+  return GetVarTypes(Inputs(name));
+}
+std::vector<VarDesc::VarType> InferShapeContext::GetOutputsVarType(
+    const std::string &name) const {
+  return GetVarTypes(Outputs(name));
+}
+std::vector<VarDesc::VarType> InferShapeContext::GetVarTypes(
+    const std::vector<std::string> &names) const {
+  std::vector<VarDesc::VarType> retv;
+  retv.resize(names.size());
+  std::transform(names.begin(), names.end(), retv.begin(),
+                 std::bind(std::mem_fn(&InferShapeContext::GetVarType), this,
+                           std::placeholders::_1));
+  return retv;
+}
 }  // namespace framework
 }  // namespace paddle
--- a/paddle/framework/shape_inference.h
+++ b/paddle/framework/shape_inference.h
@@ -16,6 +16,7 @@ limitations under the License. */
 #include "paddle/framework/attribute.h"
 #include "paddle/framework/ddim.h"
+#include "paddle/framework/framework.pb.h"
 namespace paddle {
 namespace framework {
@@ -26,6 +27,10 @@ class InferShapeContext {
  virtual bool HasInput(const std::string &name) const = 0;
  virtual bool HasOutput(const std::string &name) const = 0;
+  std::vector<VarDesc::VarType> GetInputsVarType(const std::string &name) const;
+  std::vector<VarDesc::VarType> GetOutputsVarType(
+      const std::string &name) const;
  virtual bool HasInputs(const std::string &name) const = 0;
  virtual bool HasOutputs(const std::string &name) const = 0;
@@ -46,6 +51,8 @@ class InferShapeContext {
  virtual void ShareLoD(const std::string &in, const std::string &out,
                        size_t i = 0, size_t j = 0) const = 0;
+  virtual bool IsRuntime() const = 0;
 protected:
  virtual framework::DDim GetDim(const std::string &name) const = 0;
  virtual void SetDim(const std::string &name, const framework::DDim &dim) = 0;
@@ -55,6 +62,11 @@ class InferShapeContext {
  void SetDims(const std::vector<std::string> &names,
               const std::vector<framework::DDim> &dims);
+  std::vector<VarDesc::VarType> GetVarTypes(
+      const std::vector<std::string> &names) const;
+  virtual VarDesc::VarType GetVarType(const std::string &name) const = 0;
 };
 }  // namespace framework

--- a/paddle/framework/var_type.h
+++ b/paddle/framework/var_type.h
+/* Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserved.
+   Licensed under the Apache License, Version 2.0 (the "License");
+   you may not use this file except in compliance with the License.
+   You may obtain a copy of the License at
+   http://www.apache.org/licenses/LICENSE-2.0
+   Unless required by applicable law or agreed to in writing, software
+   distributed under the License is distributed on an "AS IS" BASIS,
+   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+   See the License for the specific language governing permissions and
+   limitations under the License. */
+#pragma once
+#include "paddle/framework/framework.pb.h"
+#include "paddle/framework/lod_rank_table.h"
+#include "paddle/framework/lod_tensor.h"
+#include "paddle/framework/lod_tensor_array.h"
+namespace paddle {
+namespace framework {
+inline VarDesc::VarType ToVarType(std::type_index type) {
+  if (type.hash_code() == typeid(LoDTensor).hash_code()) {
+    return VarDesc_VarType_LOD_TENSOR;
+  } else if (type.hash_code() == typeid(LoDRankTable).hash_code()) {
+    return VarDesc_VarType_LOD_RANK_TABLE;
+  } else if (type.hash_code() == typeid(LoDTensorArray).hash_code()) {
+    return VarDesc_VarType_LOD_TENSOR_ARRAY;
+  } else {
+    PADDLE_THROW("ToVarType:Unsupported type %s", type.name());
+  }
+}
+}  // namespace framework
+}  // namespace paddle
--- a/paddle/framework/variable.h
+++ b/paddle/framework/variable.h
@@ -48,6 +48,11 @@ class Variable {
  void Clear() { holder_.reset(); }
+  std::type_index Type() const {
+    PADDLE_ENFORCE(holder_ != nullptr, "Must hold memory");
+    return holder_->Type();
+  }
 private:
  struct Placeholder {
    virtual ~Placeholder() {}

--- a/paddle/gserver/layers/ConvBaseProjection.cpp
+++ b/paddle/gserver/layers/ConvBaseProjection.cpp
@@ -17,7 +17,7 @@ limitations under the License. */
 namespace paddle {
-ThreadLocalD<std::vector<MemoryHandle *>> ConvBaseProjection::convMem_;
+ThreadLocalD<std::vector<MemoryHandlePtr>> ConvBaseProjection::convMem_;
 ConvBaseProjection::ConvBaseProjection(const ProjectionConfig &config,
                                       ParameterPtr parameter,
@@ -175,18 +175,18 @@ void ConvBaseProjection::reshape(int batchSize) {
 }
 void *ConvBaseProjection::getSpaceBytes(size_t size) {
-  std::vector<MemoryHandle *> &convMem = *convMem_;
+  std::vector<MemoryHandlePtr> &convMem = *convMem_;
  if (convMem.empty()) {
    int numDevices = hl_get_device_count();
    convMem.resize(numDevices);
  }
  int devId = hl_get_device();
-  MemoryHandle **localMem = &(convMem[devId]);
+  MemoryHandlePtr localMem = convMem[devId];
-  if (NULL == *localMem || size > (*localMem)->getAllocSize()) {
+  if (NULL == localMem || size > localMem->getAllocSize()) {
-    *localMem = new GpuMemoryHandle(size);
+    localMem = std::make_shared<GpuMemoryHandle>(size);
  }
-  return (*localMem)->getBuf();
+  return localMem->getBuf();
 }
 ConvBaseProjection::~ConvBaseProjection() {

--- a/paddle/gserver/layers/ConvBaseProjection.h
+++ b/paddle/gserver/layers/ConvBaseProjection.h
@@ -105,7 +105,7 @@ protected:
  bool bias_;
  std::unique_ptr<Weight> weight_;
-  static ThreadLocalD<std::vector<MemoryHandle*>> convMem_;
+  static ThreadLocalD<std::vector<MemoryHandlePtr>> convMem_;
 };
 }  // namespace paddle
--- a/paddle/operators/accuracy_op.cu
+++ b/paddle/operators/accuracy_op.cu
@@ -72,11 +72,8 @@ class AccuracyOpCUDAKernel : public framework::OpKernel<T> {
    }
    AccuracyCudaKernel<PADDLE_CUDA_NUM_THREADS><<<
-        1, PADDLE_CUDA_NUM_THREADS, 0,
+        1, PADDLE_CUDA_NUM_THREADS, 0, ctx.cuda_device_context().stream()>>>(
-        reinterpret_cast<const platform::CUDADeviceContext&>(
+        num_samples, infer_width, indices_data, label_data, accuracy_data);
-            ctx.device_context())
-            .stream()>>>(num_samples, infer_width, indices_data, label_data,
-                         accuracy_data);
  }
 };

--- a/paddle/operators/conv2d_transpose_cudnn_op.cu
+++ b/paddle/operators/conv2d_transpose_cudnn_op.cu
@@ -27,7 +27,6 @@ using ScopedTensorDescriptor = platform::ScopedTensorDescriptor;
 using ScopedFilterDescriptor = platform::ScopedFilterDescriptor;
 using ScopedConvolutionDescriptor = platform::ScopedConvolutionDescriptor;
 using DataLayout = platform::DataLayout;
-using CUDADeviceContext = platform::CUDADeviceContext;
 static constexpr size_t kConvCudnnWorkspaceLimitBytes = 1024 * 1024 * 1024;

--- a/paddle/operators/conv_cudnn_op.cu
+++ b/paddle/operators/conv_cudnn_op.cu
@@ -27,7 +27,6 @@ using ScopedTensorDescriptor = platform::ScopedTensorDescriptor;
 using ScopedFilterDescriptor = platform::ScopedFilterDescriptor;
 using ScopedConvolutionDescriptor = platform::ScopedConvolutionDescriptor;
 using DataLayout = platform::DataLayout;
-using CUDADeviceContext = platform::CUDADeviceContext;
 static constexpr size_t kCONV_CUDNN_WORKSPACE_LIMIT_BYTES = 1024 * 1024 * 1024;

--- a/paddle/operators/conv_shift_op.cu
+++ b/paddle/operators/conv_shift_op.cu
@@ -130,9 +130,7 @@ class ConvShiftKernel<platform::GPUPlace, T> : public framework::OpKernel<T> {
    dim3 grid_dim(num_x_blocks, batch_size);
-    auto stream = reinterpret_cast<const platform::CUDADeviceContext &>(
+    auto stream = context.cuda_device_context().stream();
-                      context.device_context())
-                      .stream();
    conv_shift_forward<T><<<grid_dim, x_per_block, mem_per_block, stream>>>(
        x_data, y_data, out_data, x_width, y_width, y_half_width, batch_size);
@@ -159,9 +157,7 @@ class ConvShiftGradKernel<platform::GPUPlace, T>
    int y_width = Y->dims()[1];
    int y_half_width = (y_width - 1) / 2;
-    auto stream = reinterpret_cast<const platform::CUDADeviceContext &>(
+    auto stream = context.cuda_device_context().stream();
-                      context.device_context())
-                      .stream();
    const int x_per_block = 256;
    int num_x_blocks = div_up(x_width, x_per_block);

--- a/paddle/operators/cross_entropy_op.cu
+++ b/paddle/operators/cross_entropy_op.cu
@@ -82,24 +82,19 @@ class CrossEntropyGradientOpCUDAKernel : public framework::OpKernel<T> {
    int block = 512;
    int grid = (batch_size * class_num + block - 1) / block;
+    auto stream = ctx.cuda_device_context().stream();
    if (ctx.Attr<bool>("soft_label")) {
      auto* label_data = label->data<T>();
-      SoftCrossEntropyGradientKernel<T><<<
+      SoftCrossEntropyGradientKernel<T><<<grid, block, 0, stream>>>(
-          grid, block, 0, reinterpret_cast<const platform::CUDADeviceContext&>(
+          dx_data, dy_data, x_data, label_data, batch_size, class_num);
-                              ctx.device_context())
-                              .stream()>>>(dx_data, dy_data, x_data, label_data,
-                                           batch_size, class_num);
    } else {
      math::SetConstant<platform::GPUPlace, T> functor;
      functor(ctx.device_context(), dx, 0);
      auto* label_data = label->data<int64_t>();
      grid = (batch_size + block - 1) / block;
-      CrossEntropyGradientKernel<T><<<
+      CrossEntropyGradientKernel<T><<<grid, block, 0, stream>>>(
-          grid, block, 0, reinterpret_cast<const platform::CUDADeviceContext&>(
+          dx_data, dy_data, x_data, label_data, batch_size, class_num);
-                              ctx.device_context())
-                              .stream()>>>(dx_data, dy_data, x_data, label_data,
-                                           batch_size, class_num);
    }
  }
 };

--- a/paddle/operators/lookup_table_op.cu
+++ b/paddle/operators/lookup_table_op.cu
@@ -74,10 +74,10 @@ class LookupTableCUDAKernel : public framework::OpKernel<T> {
    dim3 threads(128, 8);
    dim3 grids(8, 1);
-    LookupTable<T, 128, 8, 8><<<
+    LookupTable<
-        grids, threads, 0, reinterpret_cast<const platform::CUDADeviceContext&>(
+        T, 128, 8,
-                               context.device_context())
+        8><<<grids, threads, 0, context.cuda_device_context().stream()>>>(
-                               .stream()>>>(output, table, ids, N, K, D);
+        output, table, ids, N, K, D);
  }
 };
@@ -95,9 +95,7 @@ class LookupTableGradCUDAKernel : public framework::OpKernel<T> {
      auto* ids_data = ids->data<int64_t>();
      auto ids_dim = ids->dims();
-      auto stream = reinterpret_cast<const platform::CUDADeviceContext&>(
+      auto stream = context.cuda_device_context().stream();
-                        context.device_context())
-                        .stream();
      // copy GPU memory to CPU pinned memory
      framework::Vector<int64_t> new_rows;
      new_rows.resize(ids_dim[0]);
@@ -136,11 +134,10 @@ class LookupTableGradCUDAKernel : public framework::OpKernel<T> {
      dim3 threads(128, 8);
      dim3 grids(8, 1);
-      LookupTableGrad<T, 128, 8,
+      LookupTableGrad<
-                      8><<<grids, threads, 0,
+          T, 128, 8,
-                           reinterpret_cast<const platform::CUDADeviceContext&>(
+          8><<<grids, threads, 0, context.cuda_device_context().stream()>>>(
-                               context.device_context())
+          d_table, d_output, ids, N, K, D);
-                               .stream()>>>(d_table, d_output, ids, N, K, D);
    }
  }
 };

--- a/paddle/operators/math/detail/CMakeLists.txt
+++ b/paddle/operators/math/detail/CMakeLists.txt
-if(WITH_AVX)
+cc_library(activation_functions SRCS avx_functions.cc)
-    cc_library(activation_functions SRCS avx_functions.cc)
-endif()
--- a/paddle/operators/math/detail/avx_functions.cc
+++ b/paddle/operators/math/detail/avx_functions.cc
@@ -12,6 +12,8 @@ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 See the License for the specific language governing permissions and
 limitations under the License. */
+#ifdef __AVX__
 #include <immintrin.h>
 #include "paddle/operators/math/detail/activation_functions.h"
 // TODO(qingqing) refine this dependence
@@ -84,3 +86,5 @@ __m256 Identity(const __m256 a, const __m256 b) { return a; }
 }  // namespace math
 }  // namespace operators
 }  // namespace paddle
+#endif
--- a/paddle/operators/multiplex_op.cu
+++ b/paddle/operators/multiplex_op.cu
@@ -35,9 +35,7 @@ class MultiplexGPUKernel : public framework::OpKernel<T> {
    Tensor index_t_cpu;
    index_t_cpu.CopyFrom(*ids, platform::CPUPlace(), ctx.device_context());
    auto* index = index_t_cpu.data<int32_t>();
-    auto stream = reinterpret_cast<const platform::CUDADeviceContext&>(
+    auto stream = ctx.cuda_device_context().stream();
-                      ctx.device_context())
-                      .stream();
    Place place = boost::get<Place>(ctx.GetPlace());
    for (auto i = 0; i < rows; i++) {
      int32_t k = index[i];
@@ -73,9 +71,7 @@ class MultiplexGradGPUKernel : public framework::OpKernel<T> {
    index_t_cpu.CopyFrom(*ids, platform::CPUPlace(), ctx.device_context());
    auto* index = index_t_cpu.data<int32_t>();
-    auto stream = reinterpret_cast<const platform::CUDADeviceContext&>(
+    auto stream = ctx.cuda_device_context().stream();
-                      ctx.device_context())
-                      .stream();
    Place place = boost::get<Place>(ctx.GetPlace());
    for (auto i = 0; i < rows; i++) {
      size_t k = static_cast<size_t>(index[i]);

--- a/paddle/operators/nccl_op.cu
+++ b/paddle/operators/nccl_op.cu
@@ -64,9 +64,7 @@ class NCCLAllReduceKernel : public framework::OpKernel<T> {
    auto* comm = ctx.Input<Communicator>("Communicator");
-    auto stream = reinterpret_cast<const platform::CUDADeviceContext&>(
+    auto stream = ctx.cuda_device_context().stream();
-                      ctx.device_context())
-                      .stream();
    // device id
    int gpu_id = boost::get<platform::GPUPlace>(ctx.GetPlace()).GetDeviceId();

--- a/paddle/operators/sum_op.cc
+++ b/paddle/operators/sum_op.cc
@@ -24,10 +24,16 @@ class SumOp : public framework::OperatorWithKernel {
  void InferShape(framework::InferShapeContext* ctx) const override {
    PADDLE_ENFORCE(ctx->HasInputs("X"), "Inputs(X) should not be null");
-    auto x_dims = ctx->GetInputsDim("X");
    PADDLE_ENFORCE(ctx->HasOutput("Out"),
                   "Output(Out) of SumOp should not be null.");
+    if (ctx->IsRuntime() &&
+        ctx->GetOutputsVarType("Out")[0] ==
+            framework::VarDesc::LOD_TENSOR_ARRAY) {
+      return;  // skip runtime infershape when is tensor array;
+    }
+    auto x_dims = ctx->GetInputsDim("X");
    size_t N = x_dims.size();
    PADDLE_ENFORCE_GT(N, 1, "Input tensors count should > 1.");
@@ -39,6 +45,28 @@ class SumOp : public framework::OperatorWithKernel {
    ctx->SetOutputDim("Out", in_dim);
    ctx->ShareLoD("X", /*->*/ "Out");
  }
+ protected:
+  framework::DataType IndicateDataType(
+      const framework::ExecutionContext& ctx) const override {
+    auto x_vars = ctx.MultiInputVar("X");
+    if (x_vars[0]->IsType<framework::LoDTensor>()) {
+      return framework::ToDataType(
+          x_vars[0]->Get<framework::LoDTensor>().type());
+    } else if (x_vars[0]->IsType<framework::SelectedRows>()) {
+      return framework::ToDataType(
+          x_vars[0]->Get<framework::SelectedRows>().value().type());
+    } else if (x_vars[0]->IsType<framework::LoDTensorArray>()) {
+      auto& array = x_vars[0]->Get<framework::LoDTensorArray>();
+      for (auto& each : array) {
+        if (each.numel() != 0) {
+          return framework::ToDataType(each.type());
+        }
+      }
+    }
+    PADDLE_THROW("Unexpected branch. Input type is %s",
+                 x_vars[0]->Type().name());
+  }
 };
 class SumOpMaker : public framework::OpProtoAndCheckerMaker {
@@ -63,18 +91,32 @@ class SumOpVarTypeInference : public framework::VarTypeInference {
  void operator()(const framework::OpDescBind& op_desc,
                  framework::BlockDescBind* block) const override {
    auto& inputs = op_desc.Input("X");
-    auto default_var_type = framework::VarDesc::SELECTED_ROWS;
+    auto var_type = framework::VarDesc::SELECTED_ROWS;
    bool any_input_is_lod_tensor = std::any_of(
        inputs.begin(), inputs.end(), [block](const std::string& name) {
          return block->Var(name)->GetType() == framework::VarDesc::LOD_TENSOR;
        });
-    if (any_input_is_lod_tensor) {
-      default_var_type = framework::VarDesc::LOD_TENSOR;
+    auto is_tensor_array = [block](const std::string& name) {
+      return block->Var(name)->GetType() ==
+             framework::VarDesc::LOD_TENSOR_ARRAY;
+    };
+    bool any_input_is_tensor_array =
+        std::any_of(inputs.begin(), inputs.end(), is_tensor_array);
+    bool all_inputs_are_tensor_array =
+        std::all_of(inputs.begin(), inputs.end(), is_tensor_array);
+    if (any_input_is_tensor_array) {
+      PADDLE_ENFORCE(all_inputs_are_tensor_array);
+      var_type = framework::VarDesc::LOD_TENSOR_ARRAY;
+    } else if (any_input_is_lod_tensor) {
+      var_type = framework::VarDesc::LOD_TENSOR;
    }
    auto out_var_name = op_desc.Output("Out").front();
-    block->Var(out_var_name)->SetType(default_var_type);
+    block->Var(out_var_name)->SetType(var_type);
  }
 };

--- a/paddle/operators/sum_op.h
+++ b/paddle/operators/sum_op.h
@@ -11,6 +11,7 @@ limitations under the License. */
 #pragma once
 #include "paddle/framework/eigen.h"
+#include "paddle/framework/lod_tensor_array.h"
 #include "paddle/framework/op_registry.h"
 #include "paddle/operators/math/math_function.h"
 #include "paddle/operators/math/selected_rows_functor.h"
@@ -28,7 +29,7 @@ using EigenVector = framework::EigenVector<T, MajorType, IndexType>;
 template <typename Place, typename T>
 class SumKernel : public framework::OpKernel<T> {
 public:
-  void Compute(const framework::ExecutionContext& context) const override {
+  void Compute(const framework::ExecutionContext &context) const override {
    auto in_vars = context.MultiInputVar("X");
    int N = in_vars.size();
    auto out_var = context.OutputVar("Out");
@@ -36,7 +37,7 @@ class SumKernel : public framework::OpKernel<T> {
    bool in_place = out_var == in_vars[0];
    if (out_var->IsType<framework::LoDTensor>()) {
-      auto* out = context.Output<Tensor>("Out");
+      auto *out = context.Output<Tensor>("Out");
      out->mutable_data<T>(context.GetPlace());
      auto result = EigenVector<T>::Flatten(*out);
@@ -51,11 +52,11 @@ class SumKernel : public framework::OpKernel<T> {
      // If in_place, just skip the first tensor
      for (int i = in_place ? 1 : 0; i < N; i++) {
        if (in_vars[i]->IsType<framework::LoDTensor>()) {
-          auto& in_t = in_vars[i]->Get<framework::LoDTensor>();
+          auto &in_t = in_vars[i]->Get<framework::LoDTensor>();
          auto in = EigenVector<T>::Flatten(in_t);
          result.device(place) = result + in;
        } else if (in_vars[i]->IsType<framework::SelectedRows>()) {
-          auto& in_t = in_vars[i]->Get<framework::SelectedRows>();
+          auto &in_t = in_vars[i]->Get<framework::SelectedRows>();
          functor(context.device_context(), in_t, out);
        } else {
          PADDLE_THROW("Variable type must be LoDTensor/SelectedRows.");
@@ -63,8 +64,8 @@ class SumKernel : public framework::OpKernel<T> {
      }
    } else if (out_var->IsType<framework::SelectedRows>()) {
      PADDLE_ENFORCE(!in_place, "SelectedRows not support inplace sum now");
-      auto* out = context.Output<SelectedRows>("Out");
+      auto *out = context.Output<SelectedRows>("Out");
-      auto* out_value = out->mutable_value();
+      auto *out_value = out->mutable_value();
      // Runtime InferShape
      size_t first_dim = 0;
@@ -88,9 +89,36 @@ class SumKernel : public framework::OpKernel<T> {
                offset, out);
        offset += in_vars[i]->Get<SelectedRows>().value().numel();
      }
+    } else if (out_var->IsType<framework::LoDTensorArray>()) {
+      auto &out_array = *out_var->GetMutable<framework::LoDTensorArray>();
+      for (size_t i = in_place ? 1 : 0; i < in_vars.size(); ++i) {
+        PADDLE_ENFORCE(in_vars[i]->IsType<framework::LoDTensorArray>(),
+                       "Only support all inputs are TensorArray");
+        auto &in_array = in_vars[i]->Get<framework::LoDTensorArray>();
+        for (size_t i = 0; i < in_array.size(); ++i) {
+          if (in_array[i].numel() != 0) {
+            if (i >= out_array.size()) {
+              out_array.resize(i + 1);
+            }
+            if (out_array[i].numel() == 0) {
+              out_array[i].CopyFrom(in_array[i], in_array[i].place(),
+                                    context.device_context());
+              out_array[i].set_lod(in_array[i].lod());
+            } else {
+              PADDLE_ENFORCE(out_array[i].lod() == in_array[i].lod());
+              auto in = EigenVector<T>::Flatten(in_array[i]);
+              auto result = EigenVector<T>::Flatten(out_array[i]);
+              result.device(context.GetEigenDevice<Place>()) = result + in;
+            }
+          }
+        }
+      }
+    } else {
+      PADDLE_THROW("Unexpected branch, output variable type is %s",
+                   out_var->Type().name());
    }
  }
 };
 }  // namespace operators
 }  // namespace paddle
--- a/paddle/operators/tensor_array_read_write_op.cc
+++ b/paddle/operators/tensor_array_read_write_op.cc
@@ -115,7 +115,6 @@ class WriteToArrayInferVarType : public framework::VarTypeInference {
 public:
  void operator()(const framework::OpDescBind &op_desc,
                  framework::BlockDescBind *block) const override {
-    VLOG(10) << "I am here?";
    for (auto &out_var : op_desc.OutputArgumentNames()) {
      VLOG(10) << "Set Variable " << out_var << " as LOD_TENSOR_ARRAY";
      block->Var(out_var)->SetType(framework::VarDesc::LOD_TENSOR_ARRAY);

--- a/python/paddle/v2/framework/layers.py
+++ b/python/paddle/v2/framework/layers.py
@@ -801,12 +801,13 @@ def zeros(shape, dtype, main_program=None):
 def increment(x, value=1.0, main_program=None):
    helper = LayerHelper("increment", **locals())
+    tmp = helper.create_tmp_variable(dtype=x.data_type)
    helper.append_op(
        type='increment',
        inputs={'X': [x]},
-        outputs={'Out': [x]},
+        outputs={'Out': [tmp]},
        attrs={'step': value})
-    return x
+    return tmp
 def array_write(x, i, array=None, main_program=None):

--- a/python/paddle/v2/framework/tests/test_array_read_write_op.py
+++ b/python/paddle/v2/framework/tests/test_array_read_write_op.py
 import unittest
-import numpy
 import paddle.v2.framework.core as core
 import paddle.v2.framework.layers as layers
 from paddle.v2.framework.executor import Executor
+from paddle.v2.framework.backward import append_backward_ops
+from paddle.v2.framework.framework import g_main_program
+import numpy
 class TestArrayReadWrite(unittest.TestCase):
@@ -21,16 +21,20 @@ class TestArrayReadWrite(unittest.TestCase):
        i = layers.zeros(shape=[1], dtype='int64')
        arr = layers.array_write(x=x[0], i=i)
-        layers.increment(x=i)
+        i = layers.increment(x=i)
+        i.stop_gradient = True
        arr = layers.array_write(x=x[1], i=i, array=arr)
-        layers.increment(x=i)
+        i = layers.increment(x=i)
+        i.stop_gradient = True
        arr = layers.array_write(x=x[2], i=i, array=arr)
        i = layers.zeros(shape=[1], dtype='int64')
        a0 = layers.array_read(array=arr, i=i)
-        layers.increment(x=i)
+        i = layers.increment(x=i)
+        i.stop_gradient = True  # index should not calculate gradient
        a1 = layers.array_read(array=arr, i=i)
-        layers.increment(x=i)
+        i = layers.increment(x=i)
+        i.stop_gradient = True
        a2 = layers.array_read(array=arr, i=i)
        mean_a0 = layers.mean(x=a0)
@@ -61,6 +65,29 @@ class TestArrayReadWrite(unittest.TestCase):
                           scope=scope))
        self.assertEqual(outs[0], outs[1])
+        total_sum = layers.sums(input=[a_sum, x_sum])
+        total_sum_scaled = layers.scale(x=total_sum, scale=1 / 6.0)
+        append_backward_ops(total_sum_scaled)
+        g_vars = map(g_main_program.global_block().var,
+                     [each_x.name + "@GRAD" for each_x in x])
+        g_out = [
+            item.sum()
+            for item in map(
+                numpy.array,
+                exe.run(feed={'x0': tensor,
+                              'x1': tensor,
+                              'x2': tensor},
+                        fetch_list=g_vars))
+        ]
+        g_out_sum = numpy.array(g_out).sum()
+        # since our final gradient is 1 and the neural network are all linear
+        # with mean_op.
+        # the input gradient should also be 1
+        self.assertAlmostEqual(1.0, g_out_sum, delta=0.1)
 if __name__ == '__main__':
    unittest.main()
--- a/python/paddle/v2/framework/tests/test_word2vec.py
+++ b/python/paddle/v2/framework/tests/test_word2vec.py
@@ -118,6 +118,10 @@ train_reader = paddle.batch(
 place = core.CPUPlace()
 exe = Executor(place)
+# fix https://github.com/PaddlePaddle/Paddle/issues/5434 then remove
+# below exit line.
+exit(0)
 exe.run(startup_program, feed={}, fetch_list=[])
 PASS_NUM = 100
 for pass_id in range(PASS_NUM):