【NPU】Cherry-pick ascendrc ops code by 0325 to develop (#32197)

* merge 31065 * Fix typo of selected_npus (#31230) * merge 31249 * [NPU] Support npu op pow and pow grad (#31247) * [NPU] Support npu op: (1) pow (2) pow_grad * Support fp16 * Fix pow npu fp16 test (#31256) * support list of list attribute for NPU (#31299) * support list of list attribute for NPU * fix compile problem * fix reference * [NPU] Support npu op: (1) slice (2) slice_grad (#31275) * fix reading flags from env (#31329) * merge 31347 * [NPU] Support npu op layer_norm and layer_norm_grad (#31310) * init commit, add layer_norm npu kernel * fix typo * add unittest * add unittest * fix bug * fix bug * refine ut * [NPU] add npu kernel for equal op (#31393) * add npu kernel for equal op * refine code * add more ut * update year * [NPU] Support npu kernel for shape op (#31427) * add shape npu * fix * fix * fix endif (#31431) * Fix pow, use fillD instead of broadcast (#31433) * Fix pow, refine code (#31440) * fix cmake of cryptopp to avoid downloading every time (#31451) * [NPU] squeeze and unsqueeze op for ascend (#31452) Co-authored-by: N root <xiayanming@baidu.com> * Support npu kernel for gather op (#31458) * add gather npu op * code review done * update python new line * precommit * fix review * del commit * 【NPU】add scale op for npu (#31499) * add scale npu * fix * fix * Support TensorFormVector, TensorToVector of bool type (#31518) * support TensorFormVector, TensorToVector of bool type * add ut * fix compile problem * 【NPU】support npu kernel for fill_constant op (#31521) * add fill_constant npu * add fill_constant npu * fix * cherry-pick 31422, solve conflict * 【NPU】Support npu kernel for matmul op (#31544) * add matmulv2_npu * add matmul * add matmul * [NPU] Support npu op elementwise_mul and elementwise_mul_grad (#31571) * [NPU] Support npu op elementwise_max (#31574) * 【NPU】add relu op for npu (#31515) * add relu npu * fixed * fix * 【NPU】Suppert npu kernel for reshape2 op (#31524) * add reshape2 npu * add reshpe2 * [NPU] Support npu kernel for gather op fix bug (#31541) * add gather npu op * code review done * update python new line * precommit * fix review * del commit * update gather_grad * fix bug * fix bug * [NPU] Support npu kernel for amp_check_finite_and_unscale_npu op (#31457) * Support npu kernel for amp_check_finite_and_unscale_npu op * support EnforceNotMet exception * fix exception bug * modify python unittest * precommit * update c++ unittest * fix review * fix review * [NPU] accuracy op (#31492) * accuracy op * fix license * fix * add test and fix bug * [NPU] add Assign OP (#31561) * add assign op * add test assign npu test * dele if def Co-authored-by: N oyjxer <1728722986@qq.com> * [NPU] fix npu op elementwise_mul_grad (#31592) * 【NPU】Support npu op gelu and gelu_grad (#31530) * Support npu op gelu and gelu_grad * Support npu op gelu and gelu_grad * [NPU] fix assgin cmake (#31595) * fix gather_grad bug (#31607) * [NPU] add range op (#31560) * add range op * fix codestyle; call GetSize directly Co-authored-by: N oyjxer <1728722986@qq.com> * 【NPU】Support npu op elementwise_div and elementwise_div_grad (#31573) * Support npu op elementwise_div and elementwise_div_grad * Support npu op elementwise_div and elementwise_div_grad * Support npu op elementwise_div and elementwise_div_grad * [NPU] Support npu op log, log_grad, sqrt, sqrt_grad, square, tanh and tanh_grad (#31600) * [NPU] Support npu op logicalnot_op (#31534) * [NPU] Support npu op elementwise_min (#31575) * [NPU] Support npu op elementwise_pow (#31576) * [NPU] Support npu op table_lookup_v2 and table_lookup_v2_grad (#31399) * [npu] support npu kernel `table_lookup_v2` * clean up * +python test * +cmake * clean up * remove int8 kernel + python unitest for fp16 * clean up * [NPU] support npu kernel for `less_than` (#31327) * [npu] support npu kernel for `less than` * remove int* kernel * cleanup * [NPU] Support npu kernel scatter op (#31624) * Support npu kernel scatter op * Add more test * [NPU] fix allocator min chunk size (#31632) * [NPU] Support NPU kernel cast op (#31635) Co-authored-by: N frankwhzhang <frankwhzhang@126.com> * [NPU] add npu kernel for sgd (#31639) * 【NPU】Support NPU kernel for reduce_sum op v2 (#31620) * add reduce_sum * fix broadcastd * fix test * fix * add unsqueeze in reduce_sum * add template * add unittest for keep_dim * test reduce_all Co-authored-by: N frankwhzhang <frankwhzhang@126.com> * [NPU] add npu kernel for adam (#31644) * add npu kernel for adam * refine code * disable test * modify atol * 【NPU】Support npu kernel for mul op (#31584) * add mul * add test mul * [NPU] add npu kernel for softmax_with_cross_entropy (#31656) * init * fix bugs * [NPU] add npu kernel for mean Op (#31562) * update mean op * update mean op * give a better test activation Co-authored-by: N oyjxer <1728722986@qq.com> * Revert "[NPU] add npu kernel for mean Op (#31562)" (#31665) This reverts commit 468ac699. * 【NPU】Add TensorCopy to NPU kernel for reduce_sum op (#31667) * update unittest * add TensorCopy in npu grad kernel * [NPU] Support npu op `expand` (#31405) * [npu] support npu kernel for `expand` * [NPU] fix shape of dx in mul_grad (#31675) * fix shape of dx * refine code * [NPU] add Increment op (#31563) * add increment * fix * update test increment op inplace * update increment op * increment b = 2 Co-authored-by: N oyjxer <1728722986@qq.com> * [NPU] add NPU add topk (#31596) * add topk op * add cmake * update topk npu op * refactor func * fix test not go npu TopKD bug * NPUPlace(4) to NPUPlace(0) * update comment Co-authored-by: N oyjxer <1728722986@qq.com> * [NPU] Support NPU kernel sum op (#31671) * [NPU] npu support `transpose` (#31486) * cherry-pick 31564, solve conflict * [NPU] Fix bug: Fix calculation errors of pow grad npu kernel (#31699) * [NPU] Support testing grad of NPU ops in OpTest (#31697) * [NPU] Support NPU kernel of stack op (#31711) * [NPU] Remove redundant ctest of top_k_op_npu_test (#31718) * [NPU] fix reshape npu op kernel (#31726) * rename npu op file * fix reshape * [NPU] change transpose to transpose2 (#31734) * change transpose to transpose2 * fix bug * [NPU] Support mean npu kernel (#31729) * [NPU] fix some bugs of npu op (#31739) * fix softmax * fix mean * fix lookup_table_v2 * 【NPU】Fix npu kernel elementwise_div_grad (#31753) * [NPU] fix the grad kernel diff bug of gather op (#31757) * fix gather grad kernel diff * fix gather grad kernel diff * fix gather review bug * 【NPU】Fix reshape test & add grad test (#31776) * fix * fix * [NPU] support fp16 for npu accuracy op (#31797) * [NPU] support list of tensor input (#31801) * support list of tensor as npu input * add comment * fix typo * fix typo * [NPU] add npu kernel for concat op (#31695) * add npu kernel for concat op * add npu kernel for concat op * refine code * update * refine concat_grad * [NPU] Support npu kernel for op elementwise_floordiv (#31822) * [NPU] fix bug of lookup_table_v2_grad (#31834) * [NPU] support default stream (#31510) * [NPU] support mixed precision input for npu layer norm (#31847) * support mixed precision input for npu layer norm * fix layer_norm npu kernel Co-authored-by: N zhiqiu <chenqiuliang@baidu.com> * 【NPU】Support npu kernel for update_loss_scaling op (#31830) * add update_loss_scaling_npu NPU kernel * change TensorFromVec to Memset * fix compile problem (#31850) * [NPU] support npu for conditional_block op (#31854) * 【NPU】Add int dtype kernel for reshape2 op (#31864) * fix * fix * [NPU] fix some op bugs (#31855) * fix some op bugs * fix some bugs * follow comments * fix log level * add ut * [NPU] support fp16 of input for api pow (#31871) * [NPU] add npu kernel for truncated_gaussian_random op (#31654) * init * add todo * add npu kernel for truncated_gaussian_random * add sync * fix concat_grad * fix typo * fix compile * fix compile * fix compile * fix compile * fix compile * fix compile * fix code style * fix code style * fix code * Fix op test (#32231) * fix conditional block (#32243) * fix style code Co-authored-by: N xiayanming <41795079@qq.com> Co-authored-by: N Leo Chen <chenqiuliang@baidu.com> Co-authored-by: N liym27 <33742067+liym27@users.noreply.github.com> Co-authored-by: N Reventon_L <luyuxiang1994@qq.com> Co-authored-by: N root <xiayanming@baidu.com> Co-authored-by: N oyjxer <1728722986@qq.com> Co-authored-by: N yinhaofeng <66763551+yinhaofeng@users.noreply.github.com> Co-authored-by: N OleNet <olenet@126.com> Co-authored-by: N Meiyim <chen_xuyi@outlook.com> Co-authored-by: N oyxuan-11 <963650125@qq.com> Co-authored-by: N pangyoki <pangyoki@126.com>

【NPU】Cherry-pick ascendrc ops code by 0325 to develop (#32197)
* merge 31065 * Fix typo of selected_npus (#31230) * merge 31249 * [NPU] Support npu op pow and pow grad (#31247) * [NPU] Support npu op: (1) pow (2) pow_grad * Support fp16 * Fix pow npu fp16 test (#31256) * support list of list attribute for NPU (#31299) * support list of list attribute for NPU * fix compile problem * fix reference * [NPU] Support npu op: (1) slice (2) slice_grad (#31275) * fix reading flags from env (#31329) * merge 31347 * [NPU] Support npu op layer_norm and layer_norm_grad (#31310) * init commit, add layer_norm npu kernel * fix typo * add unittest * add unittest * fix bug * fix bug * refine ut * [NPU] add npu kernel for equal op (#31393) * add npu kernel for equal op * refine code * add more ut * update year * [NPU] Support npu kernel for shape op (#31427) * add shape npu * fix * fix * fix endif (#31431) * Fix pow, use fillD instead of broadcast (#31433) * Fix pow, refine code (#31440) * fix cmake of cryptopp to avoid downloading every time (#31451) * [NPU] squeeze and unsqueeze op for ascend (#31452) Co-authored-by: N root <xiayanming@baidu.com> * Support npu kernel for gather op (#31458) * add gather npu op * code review done * update python new line * precommit * fix review * del commit * 【NPU】add scale op for npu (#31499) * add scale npu * fix * fix * Support TensorFormVector, TensorToVector of bool type (#31518) * support TensorFormVector, TensorToVector of bool type * add ut * fix compile problem * 【NPU】support npu kernel for fill_constant op (#31521) * add fill_constant npu * add fill_constant npu * fix * cherry-pick 31422, solve conflict * 【NPU】Support npu kernel for matmul op (#31544) * add matmulv2_npu * add matmul * add matmul * [NPU] Support npu op elementwise_mul and elementwise_mul_grad (#31571) * [NPU] Support npu op elementwise_max (#31574) * 【NPU】add relu op for npu (#31515) * add relu npu * fixed * fix * 【NPU】Suppert npu kernel for reshape2 op (#31524) * add reshape2 npu * add reshpe2 * [NPU] Support npu kernel for gather op fix bug (#31541) * add gather npu op * code review done * update python new line * precommit * fix review * del commit * update gather_grad * fix bug * fix bug * [NPU] Support npu kernel for amp_check_finite_and_unscale_npu op (#31457) * Support npu kernel for amp_check_finite_and_unscale_npu op * support EnforceNotMet exception * fix exception bug * modify python unittest * precommit * update c++ unittest * fix review * fix review * [NPU] accuracy op (#31492) * accuracy op * fix license * fix * add test and fix bug * [NPU] add Assign OP (#31561) * add assign op * add test assign npu test * dele if def Co-authored-by: N oyjxer <1728722986@qq.com> * [NPU] fix npu op elementwise_mul_grad (#31592) * 【NPU】Support npu op gelu and gelu_grad (#31530) * Support npu op gelu and gelu_grad * Support npu op gelu and gelu_grad * [NPU] fix assgin cmake (#31595) * fix gather_grad bug (#31607) * [NPU] add range op (#31560) * add range op * fix codestyle; call GetSize directly Co-authored-by: N oyjxer <1728722986@qq.com> * 【NPU】Support npu op elementwise_div and elementwise_div_grad (#31573) * Support npu op elementwise_div and elementwise_div_grad * Support npu op elementwise_div and elementwise_div_grad * Support npu op elementwise_div and elementwise_div_grad * [NPU] Support npu op log, log_grad, sqrt, sqrt_grad, square, tanh and tanh_grad (#31600) * [NPU] Support npu op logicalnot_op (#31534) * [NPU] Support npu op elementwise_min (#31575) * [NPU] Support npu op elementwise_pow (#31576) * [NPU] Support npu op table_lookup_v2 and table_lookup_v2_grad (#31399) * [npu] support npu kernel `table_lookup_v2` * clean up * +python test * +cmake * clean up * remove int8 kernel + python unitest for fp16 * clean up * [NPU] support npu kernel for `less_than` (#31327) * [npu] support npu kernel for `less than` * remove int* kernel * cleanup * [NPU] Support npu kernel scatter op (#31624) * Support npu kernel scatter op * Add more test * [NPU] fix allocator min chunk size (#31632) * [NPU] Support NPU kernel cast op (#31635) Co-authored-by: N frankwhzhang <frankwhzhang@126.com> * [NPU] add npu kernel for sgd (#31639) * 【NPU】Support NPU kernel for reduce_sum op v2 (#31620) * add reduce_sum * fix broadcastd * fix test * fix * add unsqueeze in reduce_sum * add template * add unittest for keep_dim * test reduce_all Co-authored-by: N frankwhzhang <frankwhzhang@126.com> * [NPU] add npu kernel for adam (#31644) * add npu kernel for adam * refine code * disable test * modify atol * 【NPU】Support npu kernel for mul op (#31584) * add mul * add test mul * [NPU] add npu kernel for softmax_with_cross_entropy (#31656) * init * fix bugs * [NPU] add npu kernel for mean Op (#31562) * update mean op * update mean op * give a better test activation Co-authored-by: N oyjxer <1728722986@qq.com> * Revert "[NPU] add npu kernel for mean Op (#31562)" (#31665) This reverts commit 468ac699. * 【NPU】Add TensorCopy to NPU kernel for reduce_sum op (#31667) * update unittest * add TensorCopy in npu grad kernel * [NPU] Support npu op `expand` (#31405) * [npu] support npu kernel for `expand` * [NPU] fix shape of dx in mul_grad (#31675) * fix shape of dx * refine code * [NPU] add Increment op (#31563) * add increment * fix * update test increment op inplace * update increment op * increment b = 2 Co-authored-by: N oyjxer <1728722986@qq.com> * [NPU] add NPU add topk (#31596) * add topk op * add cmake * update topk npu op * refactor func * fix test not go npu TopKD bug * NPUPlace(4) to NPUPlace(0) * update comment Co-authored-by: N oyjxer <1728722986@qq.com> * [NPU] Support NPU kernel sum op (#31671) * [NPU] npu support `transpose` (#31486) * cherry-pick 31564, solve conflict * [NPU] Fix bug: Fix calculation errors of pow grad npu kernel (#31699) * [NPU] Support testing grad of NPU ops in OpTest (#31697) * [NPU] Support NPU kernel of stack op (#31711) * [NPU] Remove redundant ctest of top_k_op_npu_test (#31718) * [NPU] fix reshape npu op kernel (#31726) * rename npu op file * fix reshape * [NPU] change transpose to transpose2 (#31734) * change transpose to transpose2 * fix bug * [NPU] Support mean npu kernel (#31729) * [NPU] fix some bugs of npu op (#31739) * fix softmax * fix mean * fix lookup_table_v2 * 【NPU】Fix npu kernel elementwise_div_grad (#31753) * [NPU] fix the grad kernel diff bug of gather op (#31757) * fix gather grad kernel diff * fix gather grad kernel diff * fix gather review bug * 【NPU】Fix reshape test & add grad test (#31776) * fix * fix * [NPU] support fp16 for npu accuracy op (#31797) * [NPU] support list of tensor input (#31801) * support list of tensor as npu input * add comment * fix typo * fix typo * [NPU] add npu kernel for concat op (#31695) * add npu kernel for concat op * add npu kernel for concat op * refine code * update * refine concat_grad * [NPU] Support npu kernel for op elementwise_floordiv (#31822) * [NPU] fix bug of lookup_table_v2_grad (#31834) * [NPU] support default stream (#31510) * [NPU] support mixed precision input for npu layer norm (#31847) * support mixed precision input for npu layer norm * fix layer_norm npu kernel Co-authored-by: N zhiqiu <chenqiuliang@baidu.com> * 【NPU】Support npu kernel for update_loss_scaling op (#31830) * add update_loss_scaling_npu NPU kernel * change TensorFromVec to Memset * fix compile problem (#31850) * [NPU] support npu for conditional_block op (#31854) * 【NPU】Add int dtype kernel for reshape2 op (#31864) * fix * fix * [NPU] fix some op bugs (#31855) * fix some op bugs * fix some bugs * follow comments * fix log level * add ut * [NPU] support fp16 of input for api pow (#31871) * [NPU] add npu kernel for truncated_gaussian_random op (#31654) * init * add todo * add npu kernel for truncated_gaussian_random * add sync * fix concat_grad * fix typo * fix compile * fix compile * fix compile * fix compile * fix compile * fix compile * fix code style * fix code style * fix code * Fix op test (#32231) * fix conditional block (#32243) * fix style code Co-authored-by: N xiayanming <41795079@qq.com> Co-authored-by: N Leo Chen <chenqiuliang@baidu.com> Co-authored-by: N liym27 <33742067+liym27@users.noreply.github.com> Co-authored-by: N Reventon_L <luyuxiang1994@qq.com> Co-authored-by: N root <xiayanming@baidu.com> Co-authored-by: N oyjxer <1728722986@qq.com> Co-authored-by: N yinhaofeng <66763551+yinhaofeng@users.noreply.github.com> Co-authored-by: N OleNet <olenet@126.com> Co-authored-by: N Meiyim <chen_xuyi@outlook.com> Co-authored-by: N oyxuan-11 <963650125@qq.com> Co-authored-by: N pangyoki <pangyoki@126.com>
e6bc358d · zhang wenhui · GitHub · 69d80274 · e6bc358d · e6bc358d
138 changed file
--- a/cmake/external/gloo.cmake
+++ b/cmake/external/gloo.cmake
@@ -32,7 +32,7 @@ cache_third_party(extern_gloo
    TAG           ${GLOO_TAG}
    DIR           GLOO_SOURCE_DIR)

-if(WITH_ASCEND)
+  if(WITH_ASCEND OR WITH_ASCEND_CL)
  ExternalProject_Add(
      extern_gloo
      ${EXTERNAL_PROJECT_LOG_ARGS}

--- a/cmake/external/protobuf.cmake
+++ b/cmake/external/protobuf.cmake
@@ -242,7 +242,7 @@ endif()
    )
 ENDFUNCTION()

-if(WITH_ASCEND)
+if(WITH_ASCEND OR WITH_ASCEND_CL)
    SET(PROTOBUF_VERSION 3.8.0)
 else()
    SET(PROTOBUF_VERSION 3.1.0)

--- a/cmake/external/threadpool.cmake
+++ b/cmake/external/threadpool.cmake
@@ -16,7 +16,7 @@ INCLUDE(ExternalProject)

 SET(THREADPOOL_PREFIX_DIR ${THIRD_PARTY_PATH}/threadpool)
 SET(THREADPOOL_SOURCE_DIR ${THIRD_PARTY_PATH}/threadpool/src/extern_threadpool)
-if(WITH_ASCEND)
+if(WITH_ASCEND OR WITH_ASCEND_CL)
    SET(THREADPOOL_REPOSITORY https://gitee.com/tianjianhe/ThreadPool.git)
 else()
    SET(THREADPOOL_REPOSITORY ${GIT_URL}/progschj/ThreadPool.git)

--- a/cmake/external/warpctc.cmake
+++ b/cmake/external/warpctc.cmake
@@ -43,7 +43,7 @@ cache_third_party(extern_warpctc
    TAG          ${WARPCTC_TAG}
    DIR          WARPCTC_SOURCE_DIR)

-if(WITH_ASCEND)
+if(WITH_ASCEND OR WITH_ASCEND_CL)
    ExternalProject_Add(
        extern_warpctc
        ${EXTERNAL_PROJECT_LOG_ARGS}

--- a/paddle/fluid/framework/tensor_util.h
+++ b/paddle/fluid/framework/tensor_util.h
@@ -135,6 +135,7 @@ void TensorFromArray(const T* src, const size_t& array_size,
  }
 #endif
 }
+
 template <typename T>
 void TensorFromVector(const std::vector<T>& src,
                      const platform::DeviceContext& ctx, Tensor* dst) {
@@ -167,6 +168,49 @@ void TensorFromVector(const std::vector<T>& src,
 #endif
 }

+// The fully specialized function should be inline to avoid
+// multi-definition.
+template <>
+inline void TensorFromVector(const std::vector<bool>& src,
+                             const platform::DeviceContext& ctx, Tensor* dst) {
+  // vector<bool> has no data() member, use array instead.
+  // See details:
+  // https://stackoverflow.com/questions/46115669/why-does-stdvectorbool-have-no-data/46115714
+  bool* array = new bool[src.size()];
+  for (unsigned int i = 0; i < src.size(); i++) {
+    array[i] = static_cast<bool>(src[i]);
+  }
+
+  auto dst_place = ctx.GetPlace();
+  auto src_ptr = static_cast<const void*>(array);
+  platform::CPUPlace src_place;
+  dst->Resize({static_cast<int64_t>(src.size())});
+  auto dst_ptr = static_cast<void*>(dst->mutable_data<bool>(dst_place));
+  auto size = src.size() * sizeof(bool);
+
+  if (platform::is_cpu_place(dst_place)) {
+    memory::Copy(BOOST_GET_CONST(platform::CPUPlace, dst_place), dst_ptr,
+                 src_place, src_ptr, size);
+  }
+#ifdef PADDLE_WITH_CUDA
+  else if (platform::is_gpu_place(dst_place)) {  // NOLINT
+    memory::Copy(
+        BOOST_GET_CONST(platform::CUDAPlace, dst_place), dst_ptr, src_place,
+        src_ptr, size,
+        reinterpret_cast<const platform::CUDADeviceContext&>(ctx).stream());
+  }
+#endif
+#ifdef PADDLE_WITH_ASCEND_CL
+  else if (platform::is_npu_place(dst_place)) {  // NOLINT
+    memory::Copy(
+        BOOST_GET_CONST(platform::NPUPlace, dst_place), dst_ptr, src_place,
+        src_ptr, size,
+        reinterpret_cast<const platform::NPUDeviceContext&>(ctx).stream());
+  }
+#endif
+  delete[] array;
+}
+
 template <typename T>
 void TensorFromVector(const std::vector<T>& src, Tensor* dst) {
  platform::CPUPlace dst_place = platform::CPUPlace();
@@ -179,6 +223,23 @@ void TensorFromVector(const std::vector<T>& src, Tensor* dst) {
  memory::Copy(dst_place, dst_ptr, src_place, src_ptr, size);
 }

+template <>
+inline void TensorFromVector(const std::vector<bool>& src, Tensor* dst) {
+  bool* array = new bool[src.size()];
+  for (unsigned int i = 0; i < src.size(); i++) {
+    array[i] = static_cast<bool>(src[i]);
+  }
+  platform::CPUPlace dst_place = platform::CPUPlace();
+  auto src_ptr = static_cast<const void*>(array);
+  platform::CPUPlace src_place;
+  dst->Resize({static_cast<int64_t>(src.size())});
+  auto dst_ptr = static_cast<void*>(dst->mutable_data<bool>(dst_place));
+  auto size = src.size() * sizeof(bool);
+
+  memory::Copy(dst_place, dst_ptr, src_place, src_ptr, size);
+  delete[] array;
+}
+
 template <typename T>
 void TensorToVector(const Tensor& src, const platform::DeviceContext& ctx,
                    std::vector<T>* dst) {
@@ -212,6 +273,46 @@ void TensorToVector(const Tensor& src, const platform::DeviceContext& ctx,
 #endif
 }

+template <>
+inline void TensorToVector(const Tensor& src,
+                           const platform::DeviceContext& ctx,
+                           std::vector<bool>* dst) {
+  auto src_ptr = static_cast<const void*>(src.data<bool>());
+  auto size = src.numel() * sizeof(bool);
+
+  bool* array = new bool[src.numel()];
+
+  platform::CPUPlace dst_place;
+  dst->resize(src.numel());
+  auto dst_ptr = static_cast<void*>(array);
+
+  if (platform::is_cpu_place(src.place())) {
+    memory::Copy(dst_place, dst_ptr,
+                 BOOST_GET_CONST(platform::CPUPlace, src.place()), src_ptr,
+                 size);
+  }
+#ifdef PADDLE_WITH_CUDA
+  else if (platform::is_gpu_place(src.place())) {  // NOLINT
+    memory::Copy(
+        dst_place, dst_ptr, BOOST_GET_CONST(platform::CUDAPlace, src.place()),
+        src_ptr, size,
+        reinterpret_cast<const platform::CUDADeviceContext&>(ctx).stream());
+  }
+#endif
+#ifdef PADDLE_WITH_ASCEND_CL
+  else if (platform::is_npu_place(src.place())) {  // NOLINT
+    memory::Copy(
+        dst_place, dst_ptr, BOOST_GET_CONST(platform::NPUPlace, src.place()),
+        src_ptr, size,
+        reinterpret_cast<const platform::NPUDeviceContext&>(ctx).stream());
+  }
+#endif
+  for (unsigned int i = 0; i < src.numel(); i++) {
+    (*dst)[i] = static_cast<bool>(array[i]);
+  }
+  delete[] array;
+}
+
 template <typename T>
 void TensorToVector(const Tensor& src, std::vector<T>* dst) {
  auto src_ptr = static_cast<const void*>(src.data<T>());
@@ -231,6 +332,32 @@ void TensorToVector(const Tensor& src, std::vector<T>* dst) {
               BOOST_GET_CONST(platform::CPUPlace, src.place()), src_ptr, size);
 }

+template <>
+inline void TensorToVector(const Tensor& src, std::vector<bool>* dst) {
+  auto src_ptr = static_cast<const void*>(src.data<bool>());
+  auto size = src.numel() * sizeof(bool);
+
+  bool* array = new bool[src.numel()];
+
+  platform::CPUPlace dst_place;
+  dst->resize(src.numel());
+  auto dst_ptr = static_cast<void*>(array);
+
+  PADDLE_ENFORCE_EQ(
+      platform::is_cpu_place(src.place()), true,
+      platform::errors::InvalidArgument(
+          "The input tensor should be CPU device, but actually it is in %s.",
+          src.place()));
+
+  memory::Copy(dst_place, dst_ptr,
+               BOOST_GET_CONST(platform::CPUPlace, src.place()), src_ptr, size);
+
+  for (unsigned int i = 0; i < src.numel(); i++) {
+    (*dst)[i] = static_cast<bool>(array[i]);
+  }
+  delete[] array;
+}
+
 std::ostream& operator<<(std::ostream& os, const Tensor& t);
 }  // namespace framework
 }  // namespace paddle
--- a/paddle/fluid/framework/tensor_util_test.cc
+++ b/paddle/fluid/framework/tensor_util_test.cc
@@ -242,6 +242,61 @@ TEST(TensorToVector, Tensor) {
 #endif
 }

+TEST(TensorToVector, Tensor_bool) {
+  {
+    paddle::framework::Tensor src;
+    bool* src_ptr =
+        src.mutable_data<bool>({3, 3}, paddle::platform::CPUPlace());
+    for (int i = 0; i < 3 * 3; ++i) {
+      src_ptr[i] = static_cast<bool>(i % 2);
+    }
+
+    paddle::platform::CPUPlace place;
+    std::vector<bool> dst;
+    paddle::framework::TensorToVector<bool>(src, &dst);
+
+    for (int i = 0; i < 3 * 3; ++i) {
+      EXPECT_EQ(src_ptr[i], dst[i]);
+    }
+  }
+#ifdef PADDLE_WITH_CUDA
+  {
+    std::vector<bool> src_vec = {
+        false, true, false, true, false, true, false, true, false,
+    };
+    paddle::framework::Tensor gpu_tensor;
+    paddle::platform::CUDAPlace place;
+    paddle::platform::CUDADeviceContext gpu_ctx(place);
+    paddle::framework::TensorFromVector<bool>(src_vec, gpu_ctx, &gpu_tensor);
+
+    std::vector<bool> dst;
+    paddle::framework::TensorToVector<bool>(gpu_tensor, gpu_ctx, &dst);
+
+    for (int i = 0; i < 3 * 3; ++i) {
+      EXPECT_EQ(src_vec[i], dst[i]);
+    }
+  }
+#endif
+#ifdef PADDLE_WITH_ASCEND_CL
+  {
+    std::vector<bool> src_vec = {
+        false, true, false, true, false, true, false, true, false,
+    };
+    paddle::framework::Tensor npu_tensor;
+    paddle::platform::NPUPlace place(0);
+    paddle::platform::NPUDeviceContext npu_ctx(place);
+    paddle::framework::TensorFromVector<bool>(src_vec, npu_ctx, &npu_tensor);
+
+    std::vector<bool> dst;
+    paddle::framework::TensorToVector<bool>(npu_tensor, npu_ctx, &dst);
+
+    for (int i = 0; i < 3 * 3; ++i) {
+      EXPECT_EQ(src_vec[i], dst[i]);
+    }
+  }
+#endif
+}
+
 TEST(TensorFromDLPack, Tensor) {
  {
    std::vector<int> src_vec = {1, 2, 3, 4, 5, 6, 7, 8, 9};

--- a/paddle/fluid/framework/type_defs.h
+++ b/paddle/fluid/framework/type_defs.h
@@ -45,6 +45,17 @@ using Attribute = boost::variant<

 using AttributeMap = std::unordered_map<std::string, Attribute>;

+#ifdef PADDLE_WITH_ASCEND_CL
+using NPUAttribute =
+    boost::variant<boost::blank, int, float, std::string, std::vector<int>,
+                   std::vector<float>, std::vector<std::string>, bool,
+                   std::vector<bool>, BlockDesc*, int64_t,
+                   std::vector<BlockDesc*>, std::vector<int64_t>,
+                   std::vector<double>, std::vector<std::vector<int64_t>>>;
+
+using NPUAttributeMap = std::unordered_map<std::string, NPUAttribute>;
+#endif
+
 using OpCreator = std::function<OperatorBase*(
    const std::string& /*type*/, const VariableNameMap& /*inputs*/,
    const VariableNameMap& /*outputs*/, const AttributeMap& /*attrs*/)>;

--- a/paddle/fluid/memory/memcpy.cc
+++ b/paddle/fluid/memory/memcpy.cc
@@ -206,8 +206,16 @@ void Copy<platform::NPUPlace, platform::CPUPlace>(platform::NPUPlace dst_place,
  if (UNLIKELY(num == 0)) return;

  platform::SetNPUDeviceId(dst_place.device);
+
+  // NOTE(ascendrc): NPU memcpy async from host to device is a "real" async,
+  // which is different from CUDA. In Paddle, when async is called, "sync"
+  // is run actually, which means Paddle doesn't fully supported async.
+  // TODO(ascendrc): Support NPU memcpy async for better performance.
+  stream = nullptr;
+
  VLOG(4) << "memory::Copy " << num << " Bytes from " << src_place << " to "
          << dst_place << " by thream(" << stream << ")";
+
  if (stream) {
    platform::RecordEvent record_event("NpuMemcpyAsync:CPU->NPU");
    platform::NPUMemcpyAsync(dst, src, num, ACL_MEMCPY_HOST_TO_DEVICE, stream);
@@ -226,8 +234,16 @@ void Copy<platform::CPUPlace, platform::NPUPlace>(platform::CPUPlace dst_place,
  if (UNLIKELY(num == 0)) return;

  platform::SetNPUDeviceId(src_place.device);
+
+  // NOTE(ascendrc): NPU memcpy async from device to host is a "real" async,
+  // which is different from CUDA. In Paddle, when async is called, "sync"
+  // is run actually, which means Paddle doesn't fully supported async.
+  // TODO(ascendrc): Support NPU memcpy async for better performance.
+  stream = nullptr;
+
  VLOG(4) << "memory::Copy " << num << " Bytes from " << src_place << " to "
          << dst_place << " by thream(" << stream << ")";
+
  if (stream) {
    platform::RecordEvent record_event("NpuMemcpyAsync:NPU->CPU");
    platform::NPUMemcpyAsync(dst, src, num, ACL_MEMCPY_DEVICE_TO_HOST, stream);

--- a/paddle/fluid/operators/CMakeLists.txt
+++ b/paddle/fluid/operators/CMakeLists.txt
@@ -124,6 +124,7 @@ if (WITH_ASCEND)
 endif()

 if (WITH_ASCEND_CL)
+  cc_test(assign_op_npu_test SRCS assign_op_npu_test.cc DEPS assign_op)
  cc_library(npu_op_runner SRCS npu_op_runner.cc DEPS operator npu_info)
  set(COMMON_OP_DEPS ${COMMON_OP_DEPS} npu_op_runner)
 endif()
@@ -141,8 +142,8 @@ set(OPERATOR_DEPS ${OPERATOR_DEPS} ${COMMON_OP_DEPS})
 set(GLOB_OPERATOR_DEPS ${OPERATOR_DEPS} CACHE INTERNAL "Global Op dependencies")

 cc_test(test_common_infer_shape_functions SRCS test_common_infer_shape_functions.cc DEPS common_infer_shape_functions ${COMMON_OP_DEPS} activation_op elementwise_add_op softmax_op softmax)
-cc_test(assign_op_test SRCS assign_op_test.cc DEPS assign_op)
 cc_test(gather_test SRCS gather_test.cc DEPS tensor)
+cc_test(assign_op_test SRCS assign_op_test.cc DEPS assign_op)
 cc_test(scatter_test SRCS scatter_test.cc DEPS tensor math_function)
 cc_test(beam_search_decode_op_test SRCS beam_search_decode_op_test.cc DEPS lod_tensor)
 cc_test(strided_memcpy_test SRCS strided_memcpy_test.cc DEPS tensor memory)
@@ -163,10 +164,19 @@ if (WITH_PYTHON)
  cc_library(py_func_op SRCS py_func_op.cc DEPS op_registry python pybind)
 endif()

+if (WITH_ASCEND_CL)
+  cc_test(range_op_npu_test SRCS range_op_npu_test.cc DEPS op_registry range_op scope device_context enforce executor)
+  cc_test(lookup_table_v2_op_npu_test SRCS lookup_table_v2_op_npu_test.cc DEPS op_registry lookup_table_v2_op scope device_context enforce executor compare_op)
+endif()
+
 set(GLOB_OP_LIB ${OP_LIBRARY} CACHE INTERNAL "Global OP library")
 add_subdirectory(benchmark)

 cc_test(op_debug_string_test SRCS op_debug_string_test.cc DEPS elementwise_add_op)
+if (WITH_ASCEND_CL)
+    cc_test(transpose_op_npu_test SRCS transpose_op_npu_test.cc DEPS op_registry transpose_op scope device_context enforce executor)
+endif()
+

 if(WITH_MKLDNN)
 include(mkldnn/inplace_op_tests.cmake)
@@ -180,3 +190,7 @@ if(WITH_UNITY_BUILD)
    # The specified link dependency needs to be displayed here.
    target_link_libraries(paddle_operators_unity ${OP_HEADER_DEPS} ${COMMON_OP_DEPS})
 endif()
+
+if(WITH_ASCEND_CL)
+cc_test(gelu_op_npu_test SRCS gelu_op_npu_test.cc DEPS op_registry gelu_op scope device_context enforce executor)
+endif()
--- a/paddle/fluid/operators/activation_op_npu.cc
+++ b/paddle/fluid/operators/activation_op_npu.cc
+/* Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the Licnse. */
+
+#include <memory>
+#include <string>
+
+#include "paddle/fluid/framework/ddim.h"
+#include "paddle/fluid/framework/tensor_util.h"
+#include "paddle/fluid/operators/activation_op.h"
+#include "paddle/fluid/operators/npu_op_runner.h"
+
+namespace paddle {
+namespace operators {
+
+using Tensor = framework::Tensor;
+
+template <typename DeviceContext, typename T>
+class PowNPUKernel : public framework::OpKernel<T> {
+ public:
+  void Compute(const framework::ExecutionContext& ctx) const override {
+    auto* x = ctx.Input<Tensor>("X");
+    auto* out = ctx.Output<Tensor>("Out");
+    auto factor = ctx.Attr<float>("factor");
+
+    out->mutable_data<T>(ctx.GetPlace());
+
+    auto runner = NpuOpRunner("Power", {*x}, {*out},
+                              {{"power", factor},
+                               {"scale", static_cast<float>(1.0)},
+                               {"shift", static_cast<float>(0.0)}});
+
+    auto stream =
+        ctx.template device_context<paddle::platform::NPUDeviceContext>()
+            .stream();
+    runner.Run(stream);
+  }
+};
+
+template <typename DeviceContext, typename T>
+class PowGradNPUKernel : public framework::OpKernel<T> {
+ public:
+  void Compute(const framework::ExecutionContext& ctx) const override {
+    auto* x = ctx.Input<Tensor>("X");
+    auto* dout = ctx.Input<Tensor>(framework::GradVarName("Out"));
+    auto* dx = ctx.Output<Tensor>(framework::GradVarName("X"));
+    auto factor = ctx.Attr<float>("factor");
+
+    auto x_dims = x->dims();
+
+    auto place = ctx.GetPlace();
+    auto stream =
+        ctx.template device_context<paddle::platform::NPUDeviceContext>()
+            .stream();
+
+    // NOTE(liym27): dx = dout * factor * x.pow(factor-1)
+
+    // Step1: Compute x_pow = x.pow(factor-1)
+    Tensor x_pow(x->type());
+    x_pow.mutable_data<T>(x->dims(), place);
+    auto runner_pow = NpuOpRunner("Power", {*x}, {x_pow},
+                                  {{"power", factor - static_cast<float>(1)}});
+    runner_pow.Run(stream);
+
+    // Step 2: Construct a broadcast factor, which has the same shape with x.
+
+    // 2.1 Get a factor tensor with shape [1].
+    Tensor factor_tensor(framework::proto::VarType::FP32);
+    factor_tensor.mutable_data<float>({1}, place);
+    TensorFromVector(std::vector<float>{factor}, ctx.device_context(),
+                     &factor_tensor);
+
+    // 2.2 Get the factor which has the shape with x and the same value with
+    // factor.
+    Tensor factor_bc_tensor(framework::proto::VarType::FP32);
+    factor_bc_tensor.mutable_data<float>(x_dims, place);
+    auto runner_bc = NpuOpRunner("FillD", {factor_tensor}, {factor_bc_tensor},
+                                 {{"dims", framework::vectorize(x_dims)}});
+    runner_bc.Run(stream);
+
+    // Step 3: Compute x_power_mul_factor = factor * x.pow(factor-1)
+    Tensor x_power_mul_factor(x->type());
+    x_power_mul_factor.mutable_data<T>(x->dims(), place);
+    auto runner_mul_1 =
+        NpuOpRunner("Mul", {factor_bc_tensor, x_pow}, {x_power_mul_factor}, {});
+    runner_mul_1.Run(stream);
+
+    // Step 4: Compute dx = dout * factor * x.pow(factor-1)
+    dx->mutable_data<T>(place);
+    auto runner_mul_2 =
+        NpuOpRunner("Mul", {*dout, x_power_mul_factor}, {*dx}, {});
+    runner_mul_2.Run(stream);
+  }
+};
+
+template <typename DeviceContext, typename T>
+class ReluNPUKernel : public framework::OpKernel<T> {
+ public:
+  void Compute(const framework::ExecutionContext& ctx) const override {
+    auto* x = ctx.Input<Tensor>("X");
+    auto* out = ctx.Output<Tensor>("Out");
+
+    out->mutable_data<T>(ctx.GetPlace());
+
+    auto runner = NpuOpRunner("Relu",
+                              {
+                                  *x,
+                              },
+                              {*out}, {});
+
+    auto stream =
+        ctx.template device_context<paddle::platform::NPUDeviceContext>()
+            .stream();
+    runner.Run(stream);
+  }
+};
+
+template <typename DeviceContext, typename T>
+class ReluGradNPUKernel : public framework::OpKernel<T> {
+ public:
+  void Compute(const framework::ExecutionContext& ctx) const override {
+    auto* out = ctx.Input<Tensor>("Out");
+    auto* dout = ctx.Input<Tensor>(framework::GradVarName("Out"));
+    auto* dx = ctx.Output<Tensor>(framework::GradVarName("X"));
+
+    auto stream =
+        ctx.template device_context<paddle::platform::NPUDeviceContext>()
+            .stream();
+
+    dx->mutable_data<T>(ctx.GetPlace());
+    auto runner = NpuOpRunner("ReluGrad", {*dout, *out}, {*dx}, {});
+
+    runner.Run(stream);
+  }
+};
+
+template <typename DeviceContext, typename T>
+class SqrtNPUKernel : public framework::OpKernel<T> {
+ public:
+  void Compute(const framework::ExecutionContext& ctx) const override {
+    auto* x = ctx.Input<Tensor>("X");
+
+    auto* out = ctx.Output<Tensor>("Out");
+
+    auto place = ctx.GetPlace();
+
+    out->mutable_data<T>(place);
+
+    auto stream =
+        ctx.template device_context<paddle::platform::NPUDeviceContext>()
+            .stream();
+
+    auto runner = NpuOpRunner("Sqrt", {*x}, {*out}, {});
+    runner.Run(stream);
+  }
+};
+
+template <typename DeviceContext, typename T>
+class SqrtGradNPUKernel : public framework::OpKernel<T> {
+ public:
+  void Compute(const framework::ExecutionContext& ctx) const override {
+    auto* out = ctx.Input<Tensor>("Out");
+    auto* dout = ctx.Input<Tensor>(framework::GradVarName("Out"));
+
+    auto* dx = ctx.Output<Tensor>(framework::GradVarName("X"));
+
+    auto place = ctx.GetPlace();
+
+    dx->mutable_data<T>(place);
+
+    auto stream =
+        ctx.template device_context<paddle::platform::NPUDeviceContext>()
+            .stream();
+
+    auto dx_runner = NpuOpRunner("SqrtGrad", {*out, *dout}, {*dx}, {});
+    dx_runner.Run(stream);
+  }
+};
+
+template <typename DeviceContext, typename T>
+class LogNPUKernel : public framework::OpKernel<T> {
+ public:
+  void Compute(const framework::ExecutionContext& ctx) const override {
+    auto* x = ctx.Input<Tensor>("X");
+
+    auto* out = ctx.Output<Tensor>("Out");
+
+    auto place = ctx.GetPlace();
+
+    out->mutable_data<T>(place);
+
+    auto stream =
+        ctx.template device_context<paddle::platform::NPUDeviceContext>()
+            .stream();
+
+    Tensor one(x->type());
+    one.mutable_data<T>(x->dims(), place);
+    auto one_runner = NpuOpRunner("OnesLike", {*x}, {one}, {});
+    one_runner.Run(stream);
+
+    Tensor sub(x->type());
+    sub.mutable_data<T>(x->dims(), place);
+    auto sub_runner = NpuOpRunner("Sub", {*x, one}, {sub}, {});
+    sub_runner.Run(stream);
+
+    auto out_runner = NpuOpRunner("Log1p", {sub}, {*out}, {});
+    out_runner.Run(stream);
+  }
+};
+
+template <typename DeviceContext, typename T>
+class LogGradNPUKernel : public framework::OpKernel<T> {
+ public:
+  void Compute(const framework::ExecutionContext& ctx) const override {
+    auto* dout = ctx.Input<Tensor>(framework::GradVarName("Out"));
+    auto* x = ctx.Input<Tensor>("X");
+
+    auto* dx = ctx.Output<Tensor>(framework::GradVarName("X"));
+
+    auto place = ctx.GetPlace();
+
+    dx->mutable_data<T>(place);
+
+    auto stream =
+        ctx.template device_context<paddle::platform::NPUDeviceContext>()
+            .stream();
+    auto runner = NpuOpRunner("DivNoNan", {*dout, *x}, {*dx}, {});
+    runner.Run(stream);
+  }
+};
+
+template <typename DeviceContext, typename T>
+class TanhNPUKernel : public framework::OpKernel<T> {
+ public:
+  void Compute(const framework::ExecutionContext& ctx) const override {
+    auto* x = ctx.Input<Tensor>("X");
+
+    auto* out = ctx.Output<Tensor>("Out");
+
+    auto place = ctx.GetPlace();
+
+    out->mutable_data<T>(place);
+
+    auto stream =
+        ctx.template device_context<paddle::platform::NPUDeviceContext>()
+            .stream();
+
+    auto runner = NpuOpRunner("Tanh", {*x}, {*out}, {});
+    runner.Run(stream);
+  }
+};
+
+template <typename DeviceContext, typename T>
+class TanhGradNPUKernel : public framework::OpKernel<T> {
+ public:
+  void Compute(const framework::ExecutionContext& ctx) const override {
+    auto* dout = ctx.Input<Tensor>(framework::GradVarName("Out"));
+    auto* out = ctx.Input<Tensor>("Out");
+
+    auto* dx = ctx.Output<Tensor>(framework::GradVarName("X"));
+
+    auto place = ctx.GetPlace();
+
+    dx->mutable_data<T>(place);
+
+    auto stream =
+        ctx.template device_context<paddle::platform::NPUDeviceContext>()
+            .stream();
+
+    auto dx_runner = NpuOpRunner("TanhGrad", {*out, *dout}, {*dx}, {});
+    dx_runner.Run(stream);
+  }
+};
+
+template <typename DeviceContext, typename T>
+class SquareNPUKernel : public framework::OpKernel<T> {
+ public:
+  void Compute(const framework::ExecutionContext& ctx) const override {
+    auto* x = ctx.Input<Tensor>("X");
+
+    auto* out = ctx.Output<Tensor>("Out");
+
+    auto place = ctx.GetPlace();
+
+    out->mutable_data<T>(place);
+
+    auto stream =
+        ctx.template device_context<paddle::platform::NPUDeviceContext>()
+            .stream();
+
+    auto runner = NpuOpRunner("Square", {*x}, {*out}, {});
+    runner.Run(stream);
+  }
+};
+
+}  // namespace operators
+}  // namespace paddle
+
+namespace ops = paddle::operators;
+
+REGISTER_OP_NPU_KERNEL(
+    pow, ops::PowNPUKernel<paddle::platform::NPUDeviceContext, float>,
+    ops::PowNPUKernel<paddle::platform::NPUDeviceContext,
+                      paddle::platform::float16>);
+
+REGISTER_OP_NPU_KERNEL(
+    pow_grad, ops::PowGradNPUKernel<paddle::platform::NPUDeviceContext, float>,
+    ops::PowGradNPUKernel<paddle::platform::NPUDeviceContext,
+                          paddle::platform::float16>);
+
+REGISTER_OP_NPU_KERNEL(
+    relu, ops::ReluNPUKernel<paddle::platform::NPUDeviceContext, float>,
+    ops::ReluNPUKernel<paddle::platform::NPUDeviceContext,
+                       paddle::platform::float16>);
+
+REGISTER_OP_NPU_KERNEL(
+    relu_grad,
+    ops::ReluGradNPUKernel<paddle::platform::NPUDeviceContext, float>,
+    ops::ReluGradNPUKernel<paddle::platform::NPUDeviceContext,
+                           paddle::platform::float16>);
+
+REGISTER_OP_NPU_KERNEL(
+    sqrt, ops::SqrtNPUKernel<paddle::platform::NPUDeviceContext, float>,
+    ops::SqrtNPUKernel<paddle::platform::NPUDeviceContext,
+                       paddle::platform::float16>);
+
+REGISTER_OP_NPU_KERNEL(
+    sqrt_grad,
+    ops::SqrtGradNPUKernel<paddle::platform::NPUDeviceContext, float>,
+    ops::SqrtGradNPUKernel<paddle::platform::NPUDeviceContext,
+                           paddle::platform::float16>);
+
+REGISTER_OP_NPU_KERNEL(
+    log, ops::LogNPUKernel<paddle::platform::NPUDeviceContext, float>,
+    ops::LogNPUKernel<paddle::platform::NPUDeviceContext,
+                      paddle::platform::float16>);
+
+REGISTER_OP_NPU_KERNEL(
+    log_grad, ops::LogGradNPUKernel<paddle::platform::NPUDeviceContext, float>,
+    ops::LogGradNPUKernel<paddle::platform::NPUDeviceContext,
+                          paddle::platform::float16>);
+
+REGISTER_OP_NPU_KERNEL(
+    tanh, ops::TanhNPUKernel<paddle::platform::NPUDeviceContext, float>,
+    ops::TanhNPUKernel<paddle::platform::NPUDeviceContext,
+                       paddle::platform::float16>);
+
+REGISTER_OP_NPU_KERNEL(
+    tanh_grad,
+    ops::TanhGradNPUKernel<paddle::platform::NPUDeviceContext, float>,
+    ops::TanhGradNPUKernel<paddle::platform::NPUDeviceContext,
+                           paddle::platform::float16>);
+
+REGISTER_OP_NPU_KERNEL(
+    square, ops::SquareNPUKernel<paddle::platform::NPUDeviceContext, float>,
+    ops::SquareNPUKernel<paddle::platform::NPUDeviceContext,
+                         paddle::platform::float16>,
+    ops::SquareNPUKernel<paddle::platform::NPUDeviceContext, int>);
--- a/paddle/fluid/operators/amp/CMakeLists.txt
+++ b/paddle/fluid/operators/amp/CMakeLists.txt
@@ -4,3 +4,7 @@ if(WITH_UNITY_BUILD)
    include(unity_build_rule.cmake)
 endif()
 register_operators()
+
+if(WITH_ASCEND_CL)
+    cc_test(check_finite_and_unscale_op_npu_test SRCS check_finite_and_unscale_op_npu_test.cc DEPS op_registry check_finite_and_unscale_op scope device_context enforce executor)
+endif()
--- a/paddle/fluid/operators/amp/check_finite_and_unscale_op_npu.cc
+++ b/paddle/fluid/operators/amp/check_finite_and_unscale_op_npu.cc
+/* Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+
+#include <memory>
+#include <string>
+
+#include "paddle/fluid/framework/tensor_util.h"
+#include "paddle/fluid/operators/amp/check_finite_and_unscale_op.h"
+#include "paddle/fluid/operators/npu_op_runner.h"
+
+namespace paddle {
+namespace operators {
+
+using Tensor = framework::Tensor;
+
+template <typename T>
+class CheckFiniteAndUnscaleNPUKernel : public framework::OpKernel<T> {
+ public:
+  void Compute(const framework::ExecutionContext& ctx) const {
+    const auto xs = ctx.MultiInput<framework::Tensor>("X");
+    const auto* scale = ctx.Input<framework::Tensor>("Scale");
+    auto outs = ctx.MultiOutput<framework::Tensor>("Out");
+    auto* found_inf = ctx.Output<framework::Tensor>("FoundInfinite");
+
+    found_inf->mutable_data<bool>(ctx.GetPlace());
+
+    bool found_inf_data = false;
+
+    auto stream =
+        ctx.template device_context<paddle::platform::NPUDeviceContext>()
+            .stream();
+
+    // step1: inverse scale(RealDiv)
+    Tensor const_tensor;
+    const_tensor.mutable_data<T>({1}, ctx.GetPlace());
+    TensorFromVector(std::vector<T>{static_cast<T>(1.0)}, ctx.device_context(),
+                     &const_tensor);
+
+    ctx.template device_context<paddle::platform::NPUDeviceContext>().Wait();
+
+    // Inverse(1.0/scale)
+    Tensor* tmp_inverse_out = const_cast<Tensor*>(scale);
+    Tensor inverse_out(scale->type());
+    inverse_out.Resize(scale->dims());
+    inverse_out.mutable_data<T>(ctx.GetPlace());
+    auto runner_inverse =
+        NpuOpRunner("Div", {const_tensor, *scale}, {inverse_out}, {});
+    runner_inverse.Run(stream);
+    tmp_inverse_out = &inverse_out;
+
+    size_t x_size = xs.size();
+    for (size_t i = 0; i < x_size; ++i) {
+      found_inf_data = true;
+      const auto* x = xs[i];
+      auto* out = outs[i];
+      out->mutable_data<T>(ctx.GetPlace());
+
+      // step2: CheckNumerics
+      // CheckNumerics runs on the Ascend AI CPU, which delivers poor
+      // performance.
+      Tensor check_xout(x->type());
+      check_xout.Resize(x->dims());
+      check_xout.mutable_data<T>(ctx.GetPlace());
+      try {
+        auto runner_checknumerics =
+            NpuOpRunner("CheckNumerics", {*x}, {check_xout},
+                        {{"message", std::string("check_nan_and_inf")}});
+        runner_checknumerics.Run(stream);
+      } catch (platform::EnforceNotMet& exception) {
+        LOG(WARNING) << "[check_nan_and_inf] detected contains NaN or INF!!!";
+        found_inf_data = true;
+      } catch (...) {
+        LOG(WARNING) << "[check_nan_and_inf] detected contains NaN or INF!!!";
+        found_inf_data = true;
+      }
+
+      if (!found_inf_data) {
+        // MatMul
+        auto runner_matmul =
+            NpuOpRunner("Mul", {*x, *tmp_inverse_out}, {*out}, {});
+        runner_matmul.Run(stream);
+      } else {
+        // ZerosLike
+        auto runner_zeroslike = NpuOpRunner("ZerosLike", {*x}, {*out}, {});
+        runner_zeroslike.Run(stream);
+      }  // end if
+    }    // end for
+
+    // set found_inf to true
+    if (found_inf_data) {
+      Tensor found_inf_tensor;
+      found_inf_tensor.Resize({1});
+      bool* is_found_inf =
+          found_inf_tensor.mutable_data<bool>(paddle::platform::CPUPlace());
+      *is_found_inf = true;
+      framework::TensorCopySync(found_inf_tensor, ctx.GetPlace(), found_inf);
+    }
+  }
+};
+
+}  // namespace operators
+}  // namespace paddle
+
+namespace ops = paddle::operators;
+namespace plat = paddle::platform;
+REGISTER_OP_NPU_KERNEL(check_finite_and_unscale,
+                       ops::CheckFiniteAndUnscaleNPUKernel<float>,
+                       ops::CheckFiniteAndUnscaleNPUKernel<plat::float16>);
--- a/paddle/fluid/operators/amp/check_finite_and_unscale_op_npu_test.cc
+++ b/paddle/fluid/operators/amp/check_finite_and_unscale_op_npu_test.cc
+/* Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+
+#ifndef _WIN32
+#include <unistd.h>
+#endif
+
+#include <algorithm>
+#include <cstdlib>
+#include <memory>
+#include <random>
+#include "gtest/gtest.h"
+#include "paddle/fluid/framework/op_registry.h"
+#include "paddle/fluid/framework/operator.h"
+#include "paddle/fluid/framework/program_desc.h"
+#include "paddle/fluid/operators/math/math_function.h"
+#include "paddle/fluid/platform/enforce.h"
+
+namespace f = paddle::framework;
+namespace p = paddle::platform;
+namespace m = paddle::operators::math;
+
+using Tensor = paddle::framework::Tensor;
+
+USE_OP(check_finite_and_unscale);
+USE_OP_DEVICE_KERNEL(check_finite_and_unscale, NPU);
+
+struct InputVars {
+  std::string name;
+  f::LoDTensor *tensor;
+};
+
+template <typename T>
+void Compare(f::Scope *scope, const p::DeviceContext &ctx) {
+  const f::DDim dims = f::make_ddim({2, 2});
+  auto place = ctx.GetPlace();
+
+  // init input
+  std::vector<InputVars> input_names = {
+      {"x", scope->Var("x")->GetMutable<f::LoDTensor>()},
+      {"x1", scope->Var("x1")->GetMutable<f::LoDTensor>()}};
+
+  auto *scale = scope->Var("scale")->GetMutable<f::LoDTensor>();
+
+  // init output
+  auto *out = scope->Var("out")->GetMutable<f::LoDTensor>();
+  auto *out1 = scope->Var("out1")->GetMutable<f::LoDTensor>();
+  auto *found_inf = scope->Var("found_inf")->GetMutable<f::LoDTensor>();
+
+  // Initialize input data
+  const int num_inputs = input_names.size();
+  size_t numel = static_cast<size_t>(f::product(dims));
+
+  for (int i = 0; i < num_inputs; ++i) {
+    std::vector<T> init_xs;
+    for (size_t j = 0; j < numel; ++j) {
+      if (j == 0) {
+        init_xs.push_back(static_cast<T>(NAN));
+      } else {
+        init_xs.push_back(static_cast<T>(j + 1));
+      }
+    }
+    f::TensorFromVector(init_xs, ctx, input_names[i].tensor);
+    input_names[i].tensor->Resize(dims);
+  }
+
+  f::TensorFromVector(std::vector<T>{static_cast<T>(0.5)}, ctx, scale);
+
+  ctx.Wait();
+
+  // run
+  f::AttributeMap attrs;
+  auto op = f::OpRegistry::CreateOp(
+      "check_finite_and_unscale", {{"X", {"x", "x1"}}, {"Scale", {"scale"}}},
+      {{"Out", {"out", "out1"}}, {"FoundInfinite", {"found_inf"}}}, attrs);
+  op->Run(*scope, place);
+  ctx.Wait();
+
+  // out0
+  std::vector<T> out_vec;
+  f::TensorToVector(*out, ctx, &out_vec);
+  EXPECT_EQ(out_vec.size(), static_cast<size_t>(4));
+  for (size_t j = 0; j < out_vec.size(); ++j) {
+    VLOG(3) << "out_vec[" << j << "]:" << out_vec[j];
+  }
+
+  ctx.Wait();
+
+  // out0
+  std::vector<T> out1_vec;
+  f::TensorToVector(*out1, ctx, &out1_vec);
+  EXPECT_EQ(out1_vec.size(), static_cast<size_t>(4));
+  for (size_t j = 0; j < out1_vec.size(); ++j) {
+    VLOG(3) << "out1_vec[" << j << "]:" << out1_vec[j];
+  }
+
+  ctx.Wait();
+
+  // out found_inf
+  Tensor found_inf_tensor;
+  found_inf_tensor.Resize({1});
+  bool *is_finite_data =
+      found_inf_tensor.mutable_data<bool>(paddle::platform::CPUPlace());
+  f::TensorCopy(*found_inf, place, &found_inf_tensor);
+  EXPECT_FALSE(*is_finite_data);
+
+  ctx.Wait();
+}
+
+TEST(check_finite_and_unscale, NPU_fp32) {
+  f::Scope scope;
+  p::NPUDeviceContext ctx(p::NPUPlace(0));
+  Compare<float>(&scope, ctx);
+}
+
+TEST(check_finite_and_unscale, NPU_fp16) {
+  f::Scope scope;
+  p::NPUDeviceContext ctx(p::NPUPlace(0));
+  Compare<p::float16>(&scope, ctx);
+}
--- a/paddle/fluid/operators/amp/update_loss_scaling_op_npu.cc
+++ b/paddle/fluid/operators/amp/update_loss_scaling_op_npu.cc
+/* Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+
+#include "paddle/fluid/operators/amp/update_loss_scaling_op.h"
+#include <cmath>
+#include <vector>
+#include "paddle/fluid/framework/op_registry.h"
+#include "paddle/fluid/operators/npu_op_runner.h"
+
+namespace paddle {
+namespace operators {
+
+using Tensor = framework::Tensor;
+
+template <typename T>
+void Update(const platform::NPUDeviceContext& ctx,
+            const std::vector<bool> found_inf_vec,
+            const Tensor* pre_loss_scaling_tensor, const Tensor* good_in_tensor,
+            const Tensor* bad_in_tensor, const int incr_every_n_steps,
+            const int decr_every_n_nan_or_inf, const float incr_ratio,
+            const float decr_ratio, Tensor* updated_loss_scaling_tensor,
+            Tensor* good_out_tensor, Tensor* bad_out_tensor) {
+  auto place = ctx.GetPlace();
+  auto stream = ctx.stream();
+  if (found_inf_vec[0]) {
+    // good_out_data = 0
+    auto g = good_out_tensor->mutable_data<int>(place);
+    platform::NPUMemsetAsync(static_cast<void*>(g), 0,
+                             good_out_tensor->numel() * sizeof(int), stream);
+    // bad_out_data = bad_in_data + 1
+    Tensor factor_tensor(bad_out_tensor->type());
+    factor_tensor.mutable_data<int>({1}, place);
+    TensorFromVector(std::vector<int>{1}, ctx, &factor_tensor);
+    auto runner_p2 = NpuOpRunner("Add", {*bad_in_tensor, factor_tensor},
+                                 {*bad_out_tensor}, {});
+    runner_p2.Run(stream);
+
+    std::vector<int> bad_out_data;
+    TensorToVector(*bad_out_tensor, ctx, &bad_out_data);
+    if (bad_out_data[0] == decr_every_n_nan_or_inf) {
+      auto runner_p3 = NpuOpRunner("Power", {*pre_loss_scaling_tensor},
+                                   {*updated_loss_scaling_tensor},
+                                   {{"power", static_cast<float>(1)},
+                                    {"scale", decr_ratio},
+                                    {"shift", static_cast<float>(0)}});
+
+      runner_p3.Run(stream);
+
+      std::vector<T> new_loss_scaling;
+      TensorToVector(*updated_loss_scaling_tensor, ctx, &new_loss_scaling);
+      if (new_loss_scaling[0] < static_cast<T>(1)) {
+        // updated_loss_scaling_data = 1
+        auto runner_p4 = NpuOpRunner("Power", {*pre_loss_scaling_tensor},
+                                     {*updated_loss_scaling_tensor},
+                                     {{"power", static_cast<float>(1)},
+                                      {"scale", static_cast<float>(0)},
+                                      {"shift", static_cast<float>(1)}});
+
+        runner_p4.Run(stream);
+      }
+
+      // bad_out_data = 0
+      auto b = bad_out_tensor->mutable_data<int>(place);
+      platform::NPUMemsetAsync(static_cast<void*>(b), 0,
+                               bad_out_tensor->numel() * sizeof(int), stream);
+    }
+  } else {
+    // bad_out_data = 0
+    auto b = bad_out_tensor->mutable_data<int>(place);
+    platform::NPUMemsetAsync(static_cast<void*>(b), 0,
+                             bad_out_tensor->numel() * sizeof(int), stream);
+
+    // good_out_data = good_in_data + 1
+    Tensor factor_tensor(good_out_tensor->type());
+    factor_tensor.mutable_data<int>({1}, place);
+    TensorFromVector(std::vector<int>{1}, ctx, &factor_tensor);
+    auto runner_p2 = NpuOpRunner("Add", {*good_in_tensor, factor_tensor},
+                                 {*good_out_tensor}, {});
+    runner_p2.Run(stream);
+
+    std::vector<int> good_out_data;
+    TensorToVector(*good_out_tensor, ctx, &good_out_data);
+
+    if (good_out_data[0] == incr_every_n_steps) {
+      auto runner_p3 = NpuOpRunner("Power", {*pre_loss_scaling_tensor},
+                                   {*updated_loss_scaling_tensor},
+                                   {{"power", static_cast<float>(1)},
+                                    {"scale", incr_ratio},
+                                    {"shift", static_cast<float>(0)}});
+      runner_p3.Run(stream);
+
+      std::vector<T> new_loss_scaling;
+      TensorToVector(*updated_loss_scaling_tensor, ctx, &new_loss_scaling);
+      if (!std::isfinite(new_loss_scaling[0])) {
+        // updated_loss_scaling_data = pre_loss_scaling_data
+        auto runner_p4 = NpuOpRunner("Power", {*pre_loss_scaling_tensor},
+                                     {*updated_loss_scaling_tensor},
+                                     {{"power", static_cast<float>(1)},
+                                      {"scale", static_cast<float>(1)},
+                                      {"shift", static_cast<float>(0)}});
+
+        runner_p4.Run(stream);
+      }
+      // good_out_data = 0
+      auto g = good_out_tensor->mutable_data<int>(place);
+      platform::NPUMemsetAsync(static_cast<void*>(g), 0,
+                               good_out_tensor->numel() * sizeof(int), stream);
+    }
+  }
+}
+
+template <typename T>
+class UpdateLossScalingFunctor<platform::NPUDeviceContext, T> {
+ public:
+  void operator()(const platform::NPUDeviceContext& dev_ctx,
+                  const std::vector<bool> found_inf_vec,
+                  const Tensor* pre_loss_scaling_tensor,
+                  const Tensor* good_in_tensor, const Tensor* bad_in_tensor,
+                  const int incr_every_n_steps,
+                  const int decr_every_n_nan_or_inf, const float incr_ratio,
+                  const float decr_ratio, Tensor* updated_loss_scaling_tensor,
+                  Tensor* good_out_tensor, Tensor* bad_out_tensor) const {
+    Update<T>(dev_ctx, found_inf_vec, pre_loss_scaling_tensor, good_in_tensor,
+              bad_in_tensor, incr_every_n_steps, decr_every_n_nan_or_inf,
+              incr_ratio, decr_ratio, updated_loss_scaling_tensor,
+              good_out_tensor, bad_out_tensor);
+  }
+};
+
+template <typename T>
+class LazyZerosNPU {
+ public:
+  void operator()(const platform::NPUDeviceContext& dev_ctx,
+                  const std::vector<bool> found_inf_vec,
+                  const std::vector<const framework::Tensor*>& xs,
+                  const std::vector<framework::Tensor*>& outs) const {
+    for (size_t i = 0; i < xs.size(); ++i) {
+      auto* out = outs[i];
+      if (found_inf_vec[0]) {
+        VLOG(4) << "-- UpdateLossScaling: Find infinite grads. --";
+
+        auto place = dev_ctx.GetPlace();
+        auto stream = dev_ctx.stream();
+        auto g = out->mutable_data<T>(place);
+        platform::NPUMemsetAsync(static_cast<void*>(g), 0,
+                                 out->numel() * sizeof(T), stream);
+      }
+    }
+  }
+};
+
+template <typename DeviceContext, typename T>
+class UpdateLossScalingNPUKernel : public framework::OpKernel<T> {
+  using MPDType = typename details::MPTypeTrait<T>::Type;
+
+ public:
+  void Compute(const framework::ExecutionContext& ctx) const override {
+    auto& dev_ctx = ctx.template device_context<DeviceContext>();
+
+    const auto xs = ctx.MultiInput<framework::Tensor>("X");
+    auto outs = ctx.MultiOutput<framework::Tensor>("Out");
+    const auto* found_inf = ctx.Input<Tensor>("FoundInfinite");
+    PADDLE_ENFORCE_EQ(found_inf->numel(), 1,
+                      platform::errors::InvalidArgument(
+                          "FoundInfinite must has only one element."));
+
+    std::vector<bool> found_inf_vec;
+    TensorToVector(*found_inf, ctx.device_context(), &found_inf_vec);
+
+    LazyZerosNPU<T>{}(dev_ctx, found_inf_vec, xs, outs);
+    const bool stop_update = ctx.Attr<bool>("stop_update");
+    if (stop_update) {
+      return;
+    }
+
+    const auto* pre_loss_scaling = ctx.Input<Tensor>("PrevLossScaling");
+    const auto* good_in = ctx.Input<Tensor>("InGoodSteps");
+    const auto* bad_in = ctx.Input<Tensor>("InBadSteps");
+    auto* updated_loss_scaling = ctx.Output<Tensor>("LossScaling");
+    auto* good_out = ctx.Output<Tensor>("OutGoodSteps");
+    auto* bad_out = ctx.Output<Tensor>("OutBadSteps");
+
+    updated_loss_scaling->mutable_data<MPDType>(dev_ctx.GetPlace());
+    good_out->mutable_data<int>(dev_ctx.GetPlace());
+    bad_out->mutable_data<int>(dev_ctx.GetPlace());
+
+    const int incr_every_n_steps = ctx.Attr<int>("incr_every_n_steps");
+    const int decr_every_n_nan_or_inf =
+        ctx.Attr<int>("decr_every_n_nan_or_inf");
+    const float incr_ratio = ctx.Attr<float>("incr_ratio");
+    const float decr_ratio = ctx.Attr<float>("decr_ratio");
+    UpdateLossScalingFunctor<DeviceContext, MPDType>{}(
+        dev_ctx, found_inf_vec, pre_loss_scaling, good_in, bad_in,
+        incr_every_n_steps, decr_every_n_nan_or_inf, incr_ratio, decr_ratio,
+        updated_loss_scaling, good_out, bad_out);
+  }
+};
+
+}  // namespace operators
+}  // namespace paddle
+
+namespace ops = paddle::operators;
+
+REGISTER_OP_NPU_KERNEL(
+    update_loss_scaling,
+    ops::UpdateLossScalingNPUKernel<paddle::platform::NPUDeviceContext, float>,
+    ops::UpdateLossScalingNPUKernel<paddle::platform::NPUDeviceContext,
+                                    double>);
--- a/paddle/fluid/operators/assign_op_npu.cc
+++ b/paddle/fluid/operators/assign_op_npu.cc
+/* Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+
+#include <string>
+
+#include "paddle/fluid/operators/assign_op.h"
+#include "paddle/fluid/operators/npu_op_runner.h"
+#include "paddle/fluid/platform/float16.h"
+
+namespace paddle {
+namespace framework {
+class OpDesc;
+class Variable;
+}  // namespace framework
+namespace imperative {
+class OpBase;
+}  // namespace imperative
+namespace platform {
+struct CPUPlace;
+struct CUDAPlace;
+struct float16;
+}  // namespace platform
+}  // namespace paddle
+
+namespace paddle {
+namespace operators {
+template <typename DeviceContext, typename T>
+class AssignNPUKernel : public framework::OpKernel<T> {
+ public:
+  void Compute(const framework::ExecutionContext& ctx) const override {
+    auto* x = ctx.Input<framework::LoDTensor>("X");
+    auto* out = ctx.Output<framework::LoDTensor>("Out");
+    out->mutable_data<T>(ctx.GetPlace());
+
+    auto runner = NpuOpRunner("Assign", {*out, *x}, {*out}, {});
+    auto stream =
+        ctx.template device_context<paddle::platform::NPUDeviceContext>()
+            .stream();
+    runner.Run(stream);
+  }
+};
+
+}  // namespace operators
+}  // namespace paddle
+
+namespace ops = paddle::operators;
+namespace plat = paddle::platform;
+
+REGISTER_OP_NPU_KERNEL(
+    assign, ops::AssignNPUKernel<paddle::platform::NPUDeviceContext, int>,
+    ops::AssignNPUKernel<paddle::platform::NPUDeviceContext, float>,
+    ops::AssignNPUKernel<paddle::platform::NPUDeviceContext, double>)
--- a/paddle/fluid/operators/assign_op_npu_test.cc
+++ b/paddle/fluid/operators/assign_op_npu_test.cc
+/* Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+
+#ifndef _WIN32
+#include <unistd.h>
+#endif
+
+#include <string>
+#include <thread>  // NOLINT
+#include <vector>
+
+#include "gtest/gtest.h"
+#include "paddle/fluid/framework/op_registry.h"
+#include "paddle/fluid/framework/operator.h"
+#include "paddle/fluid/framework/program_desc.h"
+#include "paddle/fluid/operators/dropout_op.h"
+#include "paddle/fluid/operators/math/math_function.h"
+#include "paddle/fluid/string/printf.h"
+
+namespace f = paddle::framework;
+namespace p = paddle::platform;
+namespace m = paddle::operators::math;
+
+USE_OP(assign);
+USE_OP_DEVICE_KERNEL(assign, NPU);
+
+template <typename T>
+void Compare(f::Scope* scope, const p::DeviceContext& ctx,
+             std::string op_type) {
+  // init
+  auto x = scope->Var("X");
+  auto tensor_x = x->GetMutable<f::LoDTensor>();
+
+  std::vector<T> init;
+  init.push_back(static_cast<T>(1.0));
+  init.push_back(static_cast<T>(2.0));
+  init.push_back(static_cast<T>(3.0));
+  init.push_back(static_cast<T>(4.0));
+
+  TensorFromVector(init, ctx, tensor_x);
+  tensor_x->Resize({4});
+
+  ctx.Wait();
+
+  auto place = ctx.GetPlace();
+  auto out = scope->Var("Out");
+  auto tensor_out = out->GetMutable<f::LoDTensor>();
+
+  auto op =
+      f::OpRegistry::CreateOp(op_type, {{"X", {"X"}}}, {{"Out", {"Out"}}}, {});
+
+  op->Run(*scope, place);
+
+  std::vector<T> out_vec;
+  TensorToVector(*tensor_out, ctx, &out_vec);
+
+  ctx.Wait();
+
+  EXPECT_EQ((uint32_t)out_vec.size(), (uint32_t)4);
+  EXPECT_EQ(out_vec[0], static_cast<T>(1.0));
+  EXPECT_EQ(out_vec[1], static_cast<T>(2.0));
+  EXPECT_EQ(out_vec[2], static_cast<T>(3.0));
+  EXPECT_EQ(out_vec[3], static_cast<T>(4.0));
+}
+
+TEST(assign, NPU_fp32) {
+  f::Scope scope;
+  p::NPUDeviceContext ctx(p::NPUPlace(0));
+  Compare<float>(&scope, ctx, "assign");
+}
--- a/paddle/fluid/operators/cast_op_npu.cc
+++ b/paddle/fluid/operators/cast_op_npu.cc
+/* Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+
+#ifdef PADDLE_WITH_ASCEND_CL
+#include <memory>
+#include <string>
+
+#include "paddle/fluid/operators/cast_op.h"
+#include "paddle/fluid/operators/npu_op_runner.h"
+
+namespace paddle {
+namespace operators {
+
+static std::map<framework::proto::VarType::Type, aclDataType>
+    DTYPE_2_ACL_DTYPE = {
+        {framework::proto::VarType::BOOL, ACL_BOOL},
+        {framework::proto::VarType::INT16, ACL_INT16},
+        {framework::proto::VarType::INT32, ACL_INT32},
+        {framework::proto::VarType::INT64, ACL_INT64},
+        {framework::proto::VarType::FP16, ACL_FLOAT16},
+        {framework::proto::VarType::FP32, ACL_FLOAT},
+        {framework::proto::VarType::FP64, ACL_DOUBLE},
+};
+
+using Tensor = framework::Tensor;
+
+template <typename DeviceContext, typename T>
+class CastNPUKernel : public framework::OpKernel<T> {
+ public:
+  void Compute(const framework::ExecutionContext& ctx) const override {
+    auto* x = ctx.Input<Tensor>("X");
+    int dtype = ctx.Attr<int>("out_dtype");
+
+    auto* out = ctx.Output<Tensor>("Out");
+
+    auto place = ctx.GetPlace();
+
+    auto iter = DTYPE_2_ACL_DTYPE.find(
+        static_cast<framework::proto::VarType::Type>(dtype));
+    int aclDtype = iter->second;
+
+    if (dtype == framework::proto::VarType::FP32) {
+      out->mutable_data<float>(place);
+    } else if (dtype == framework::proto::VarType::FP16) {
+      out->mutable_data<paddle::platform::float16>(place);
+    } else if (dtype == framework::proto::VarType::INT16) {
+      out->mutable_data<int16_t>(place);
+    } else if (dtype == framework::proto::VarType::INT32) {
+      out->mutable_data<int32_t>(place);
+    } else if (dtype == framework::proto::VarType::INT64) {
+      out->mutable_data<int64_t>(place);
+    } else if (dtype == framework::proto::VarType::FP64) {
+      out->mutable_data<double>(place);
+    } else if (dtype == framework::proto::VarType::BOOL) {
+      out->mutable_data<bool>(place);
+    }
+
+    auto stream =
+        ctx.template device_context<paddle::platform::NPUDeviceContext>()
+            .stream();
+
+    auto runner = NpuOpRunner("Cast", {*x}, {*out},
+                              {{"dst_type", static_cast<int32_t>(aclDtype)}});
+    runner.Run(stream);
+  }
+};
+}  // namespace operators
+}  // namespace paddleaclDtype
+
+namespace ops = paddle::operators;
+
+REGISTER_OP_NPU_KERNEL(
+    cast, ops::CastNPUKernel<paddle::platform::NPUDeviceContext, int16_t>,
+    ops::CastNPUKernel<paddle::platform::NPUDeviceContext, int32_t>,
+    ops::CastNPUKernel<paddle::platform::NPUDeviceContext, int64_t>,
+    ops::CastNPUKernel<paddle::platform::NPUDeviceContext, bool>,
+    ops::CastNPUKernel<paddle::platform::NPUDeviceContext, double>,
+    ops::CastNPUKernel<paddle::platform::NPUDeviceContext, float>,
+    ops::CastNPUKernel<paddle::platform::NPUDeviceContext,
+                       paddle::platform::float16>);
+#endif
--- a/paddle/fluid/operators/concat_op_npu.cc
+++ b/paddle/fluid/operators/concat_op_npu.cc
+/* Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+
+#include "paddle/fluid/operators/concat_op.h"
+#include "paddle/fluid/operators/npu_op_runner.h"
+
+namespace paddle {
+namespace operators {
+
+template <typename T>
+class ConcatNPUKernel : public framework::OpKernel<T> {
+ public:
+  void Compute(const framework::ExecutionContext& ctx) const override {
+    auto ins = ctx.MultiInput<framework::LoDTensor>("X");
+    framework::LoDTensor* out = ctx.Output<framework::LoDTensor>("Out");
+    PADDLE_ENFORCE_NOT_NULL(ins[0],
+                            platform::errors::NotFound(
+                                "The first input tensor is not initalized."));
+    auto axis = ctx.Attr<int>("axis");
+
+    if (ctx.HasInput("AxisTensor")) {
+      PADDLE_THROW(platform::errors::NotFound(
+          "The AxisTensor is not supported on NPU now."));
+    }
+    axis = ComputeAxis(static_cast<int64_t>(axis),
+                       static_cast<int64_t>(ins[0]->dims().size()));
+
+    auto place = ctx.GetPlace();
+    out->mutable_data<T>(place);
+
+    std::vector<framework::Tensor> inputs;
+    std::vector<std::string> names;
+    for (size_t i = 0; i < ins.size(); ++i) {
+      if (ins[i] && ins[i]->numel() > 0) {
+        inputs.push_back(*ins[i]);
+        names.push_back("x" + std::to_string(i));
+      } else {
+        continue;
+      }
+    }
+    auto stream =
+        ctx.template device_context<paddle::platform::NPUDeviceContext>()
+            .stream();
+    auto runner = NpuOpRunner(
+        "ConcatD", {inputs}, {*out},
+        {{"concat_dim", axis}, {"N", static_cast<int>(inputs.size())}});
+    runner.AddInputNames(names);
+    runner.Run(stream);
+  }
+};
+
+template <typename T>
+class ConcatGradNPUKernel : public framework::OpKernel<T> {
+ public:
+  void Compute(const framework::ExecutionContext& ctx) const override {
+    auto* out_grad =
+        ctx.Input<framework::Tensor>(framework::GradVarName("Out"));
+    auto ins = ctx.MultiInput<framework::LoDTensor>("X");
+    auto out_var_names = ctx.OutputNames(framework::GradVarName("X"));
+    auto outs =
+        ctx.MultiOutput<framework::LoDTensor>(framework::GradVarName("X"));
+
+    PADDLE_ENFORCE_NOT_NULL(ins[0],
+                            platform::errors::NotFound(
+                                "The first input tensor is not initalized."));
+
+    auto axis = ctx.Attr<int>("axis");
+
+    axis = ComputeAxis(static_cast<int64_t>(axis),
+                       static_cast<int64_t>(ins[0]->dims().size()));
+
+    int offset = 0;
+    auto stream =
+        ctx.template device_context<paddle::platform::NPUDeviceContext>()
+            .stream();
+    for (size_t j = 0; j < outs.size(); ++j) {
+      // For stop gradient
+      // get output tensor that the name is not kEmptyVarName
+      if (out_var_names[j] != framework::kEmptyVarName &&
+          outs[j]->numel() != 0UL) {
+        outs[j]->mutable_data<T>(ctx.GetPlace());
+        std::vector<int> offsets;
+        std::vector<int> sizes;
+        for (int dim = 0; dim < ins[j]->dims().size(); ++dim) {
+          if (dim == axis) {
+            offsets.push_back(offset);
+            sizes.push_back(ins[j]->dims()[dim]);
+          } else {
+            offsets.push_back(0);
+            sizes.push_back(ins[j]->dims()[dim]);
+          }
+        }
+        auto runner = NpuOpRunner("SliceD", {*out_grad}, {*outs[j]},
+                                  {{"offsets", offsets}, {"size", sizes}});
+        runner.Run(stream);
+      }
+      if (ins[j]->numel() != 0UL) {
+        offset += ins[j]->dims()[axis];
+      }
+    }
+  }
+};
+
+}  // namespace operators
+}  // namespace paddle
+
+namespace ops = paddle::operators;
+
+REGISTER_OP_NPU_KERNEL(concat, ops::ConcatNPUKernel<float>,
+                       ops::ConcatNPUKernel<paddle::platform::float16>,
+                       ops::ConcatNPUKernel<int>);
+
+REGISTER_OP_NPU_KERNEL(concat_grad, ops::ConcatGradNPUKernel<float>,
+                       ops::ConcatGradNPUKernel<paddle::platform::float16>,
+                       ops::ConcatGradNPUKernel<int>);
--- a/paddle/fluid/operators/controlflow/compare_op_npu.cc
+++ b/paddle/fluid/operators/controlflow/compare_op_npu.cc
+/* Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+
+#include <algorithm>
+#include <string>
+#include <vector>
+
+#include "paddle/fluid/framework/op_registry.h"
+#include "paddle/fluid/framework/op_version_registry.h"
+#include "paddle/fluid/operators/controlflow/compare_op.h"
+#include "paddle/fluid/operators/elementwise/elementwise_op_function.h"
+#include "paddle/fluid/operators/npu_op_runner.h"
+#ifdef PADDLE_WITH_ASCEND_CL
+
+namespace paddle {
+namespace operators {
+
+template <typename T>
+class EqualNPUKernel : public framework::OpKernel<T> {
+ public:
+  void Compute(const framework::ExecutionContext& ctx) const override {
+    auto* x = ctx.Input<framework::LoDTensor>("X");
+    auto* y = ctx.Input<framework::LoDTensor>("Y");
+    auto* out = ctx.Output<framework::LoDTensor>("Out");
+    out->mutable_data<bool>(ctx.GetPlace());
+
+    auto runner = NpuOpRunner("Equal", {*x, *y}, {*out}, {});
+    auto stream =
+        ctx.template device_context<paddle::platform::NPUDeviceContext>()
+            .stream();
+    runner.Run(stream);
+  }
+};
+
+template <typename DeviceContext, typename T>
+class LessThanNPUKernel : public framework::OpKernel<T> {
+ public:
+  void Compute(const framework::ExecutionContext& ctx) const override {
+    auto* x = ctx.Input<framework::LoDTensor>("X");
+    auto* y = ctx.Input<framework::LoDTensor>("Y");
+    auto* z = ctx.Output<framework::LoDTensor>("Out");
+    // int axis = context.Attr<int>("axis");
+    z->mutable_data<bool>(ctx.GetPlace());  // allocate
+    auto runner = NpuOpRunner("Less", {*x, *y}, {*z});
+    auto stream =
+        ctx.template device_context<paddle::platform::NPUDeviceContext>()
+            .stream();
+    runner.Run(stream);
+  }
+};
+
+}  // namespace operators
+}  // namespace paddle
+
+namespace ops = paddle::operators;
+namespace plat = paddle::platform;
+
+REGISTER_OP_NPU_KERNEL(equal, ops::EqualNPUKernel<float>,
+                       ops::EqualNPUKernel<plat::float16>,
+                       ops::EqualNPUKernel<int>);
+
+REGISTER_OP_NPU_KERNEL(
+    less_than,
+    ops::LessThanNPUKernel<paddle::platform::NPUDeviceContext, float>,
+    ops::LessThanNPUKernel<paddle::platform::NPUDeviceContext,
+                           paddle::platform::float16>);
+
+#endif
--- a/paddle/fluid/operators/controlflow/conditional_block_op.h
+++ b/paddle/fluid/operators/controlflow/conditional_block_op.h
@@ -78,6 +78,13 @@ class ConditionalOp : public framework::OperatorBase {
      framework::TensorCopy(*ips[0], platform::CPUPlace(), &cpu_tensor);
      platform::DeviceContextPool::Instance().Get(ips[0]->place())->Wait();
      res = cpu_tensor.data<bool>()[0];
+#endif
+    } else if (platform::is_npu_place(ips[0]->place())) {
+#ifdef PADDLE_WITH_ASCEND_CL
+      framework::LoDTensor cpu_tensor;
+      framework::TensorCopy(*ips[0], platform::CPUPlace(), &cpu_tensor);
+      platform::DeviceContextPool::Instance().Get(ips[0]->place())->Wait();
+      res = cpu_tensor.data<bool>()[0];
 #endif
    } else {
      res = ips[0]->data<bool>()[0];

--- a/paddle/fluid/operators/controlflow/fetch_op.cc
+++ b/paddle/fluid/operators/controlflow/fetch_op.cc
@@ -44,6 +44,11 @@ static void DataCopy(const framework::LoDTensor &src_item,
      TensorCopySync(src_item, platform::CPUPlace(), dst_item);
    }
 #else
+#ifdef PADDLE_WITH_ASCEND_CL
+    if (platform::is_npu_place(src_item.place())) {
+      platform::DeviceContextPool::Instance().Get(src_item.place())->Wait();
+    }
+#endif
    TensorCopySync(src_item, platform::CPUPlace(), dst_item);
 #endif
  } else {

--- a/paddle/fluid/operators/controlflow/logical_op_npu.cc
+++ b/paddle/fluid/operators/controlflow/logical_op_npu.cc
+/* Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+
+#ifdef PADDLE_WITH_ASCEND_CL
+#include <memory>
+#include <string>
+
+#include "paddle/fluid/operators/controlflow/logical_op.h"
+#include "paddle/fluid/operators/npu_op_runner.h"
+
+namespace paddle {
+namespace operators {
+
+using Tensor = framework::Tensor;
+
+template <typename DeviceContext, typename T>
+class LogicalNotNPUKernel : public framework::OpKernel<T> {
+ public:
+  void Compute(const framework::ExecutionContext& ctx) const override {
+    auto* x = ctx.Input<Tensor>("X");
+
+    auto* out = ctx.Output<Tensor>("Out");
+
+    auto place = ctx.GetPlace();
+
+    out->mutable_data<T>(place);
+
+    auto stream =
+        ctx.template device_context<paddle::platform::NPUDeviceContext>()
+            .stream();
+
+    auto runner = NpuOpRunner("LogicalNot", {*x}, {*out}, {});
+    runner.Run(stream);
+  }
+};
+
+}  // namespace operators
+}  // namespace paddle
+
+namespace ops = paddle::operators;
+
+REGISTER_OP_NPU_KERNEL(
+    logical_not,
+    ops::LogicalNotNPUKernel<paddle::platform::NPUDeviceContext, bool>);
+
+#endif
--- a/paddle/fluid/operators/elementwise/elementwise_add_op_npu.cc
+++ b/paddle/fluid/operators/elementwise/elementwise_add_op_npu.cc
@@ -12,17 +12,18 @@ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 See the License for the specific language governing permissions and
 limitations under the License. */

-#ifdef PADDLE_WITH_ASCEND_CL
 #include <memory>
 #include <string>

+#include "paddle/fluid/framework/tensor_util.h"
 #include "paddle/fluid/operators/elementwise/elementwise_add_op.h"
 #include "paddle/fluid/operators/npu_op_runner.h"

 namespace paddle {
 namespace operators {
+using Tensor = framework::Tensor;

-template <typename DeviceContext, typename T>
+template <typename T>
 class ElementwiseAddNPUKernel : public framework::OpKernel<T> {
 public:
  void Compute(const framework::ExecutionContext& ctx) const override {
@@ -39,12 +40,127 @@ class ElementwiseAddNPUKernel : public framework::OpKernel<T> {
  }
 };

+template <typename T>
+class ElementwiseAddGradNPUKernel : public framework::OpKernel<T> {
+ public:
+  void Compute(const framework::ExecutionContext& ctx) const override {
+    auto* dout = ctx.Input<Tensor>(framework::GradVarName("Out"));
+    auto* dx = ctx.Output<Tensor>(framework::GradVarName("X"));
+    auto* dy = ctx.Output<Tensor>(framework::GradVarName("Y"));
+
+    auto stream =
+        ctx.template device_context<paddle::platform::NPUDeviceContext>()
+            .stream();
+
+    // NOTE(zhiqiu): It seems Ascend Sub follow the broadcast sematics with
+    // default axis=-1?
+    // So, the sub_grad should do reduce if needed.
+    // For example, the shape of each variable in elementwise_sub:
+    // x, dx: [2, 3, 5]
+    // y, dy: [1, 5]
+    // out, dout: [2, 3, 5]
+    // Then, out = x - y  =>  dx = dout, dy = -dout
+    // And, the shape of dy can be computed by two stages reduce,
+    // 1. [2, 3, 5] => [3, 5], ReduceSumD on axis = 0, keep_dims = false.
+    // 2. [3, 5] => [1, 5], ReduceSumD on axis = 0, keep_dims = true.
+
+    if (dx) {
+      dx->mutable_data<T>(ctx.GetPlace());
+      // For dx
+      // stage 1
+      auto reduce_ndim = dout->dims().size() - dx->dims().size();
+      std::vector<int> axes;
+      for (auto i = 0; i < reduce_ndim; ++i) {
+        axes.push_back(i);
+      }
+      Tensor* tmp_dout = const_cast<Tensor*>(dout);
+      Tensor reduced_dout(dx->type());
+      if (axes.size() != 0) {
+        std::vector<int64_t> reduced_dout_dims;
+        for (auto i = reduce_ndim; i < dout->dims().size(); ++i) {
+          reduced_dout_dims.push_back(dout->dims()[i]);
+        }
+        reduced_dout.Resize(framework::make_ddim(reduced_dout_dims));
+        reduced_dout.mutable_data<T>(ctx.GetPlace());
+        auto runner = NpuOpRunner("ReduceSumD", {*dout}, {reduced_dout},
+                                  {{"axes", axes}, {"keep_dims", false}});
+        runner.Run(stream);
+        tmp_dout = &reduced_dout;
+      }
+
+      // stage 2
+      axes.clear();
+      for (auto i = 0; i < dx->dims().size(); ++i) {
+        if (dx->dims()[i] == 1) {
+          axes.push_back(i);
+        }
+      }
+      if (axes.size() != 0) {
+        auto runner = NpuOpRunner("ReduceSumD", {*tmp_dout}, {*dx},
+                                  {{"axes", axes}, {"keep_dims", true}});
+        runner.Run(stream);
+      } else {
+        ctx.template device_context<paddle::platform::NPUDeviceContext>()
+            .Wait();
+        framework::TensorCopySync(*tmp_dout, ctx.GetPlace(), dx);
+      }
+    }
+
+    if (dy) {
+      // For dy
+      // stage 1
+      auto reduce_ndim = dout->dims().size() - dy->dims().size();
+      std::vector<int> axes;
+      for (auto i = 0; i < reduce_ndim; ++i) {
+        axes.push_back(i);
+      }
+      Tensor* tmp_dout = const_cast<Tensor*>(dout);
+      Tensor reduced_dout(dout->type());
+      if (axes.size() != 0) {
+        std::vector<int64_t> reduced_dout_dims;
+        for (auto i = reduce_ndim; i < dout->dims().size(); ++i) {
+          reduced_dout_dims.push_back(dout->dims()[i]);
+        }
+        reduced_dout.Resize(framework::make_ddim(reduced_dout_dims));
+        reduced_dout.mutable_data<T>(ctx.GetPlace());
+        auto runner = NpuOpRunner("ReduceSumD", {*dout}, {reduced_dout},
+                                  {{"axes", axes}, {"keep_dims", false}});
+        runner.Run(stream);
+        tmp_dout = &reduced_dout;
+        ctx.template device_context<paddle::platform::NPUDeviceContext>()
+            .Wait();
+      }
+
+      // stage 2
+      axes.clear();
+      for (auto i = 0; i < dy->dims().size(); ++i) {
+        if (dy->dims()[i] == 1) {
+          axes.push_back(i);
+        }
+      }
+      if (axes.size() != 0) {
+        dy->mutable_data<T>(ctx.GetPlace());
+        auto runner = NpuOpRunner("ReduceSumD", {*tmp_dout}, {*dy},
+                                  {{"axes", axes}, {"keep_dims", true}});
+        runner.Run(stream);
+      } else {
+        ctx.template device_context<paddle::platform::NPUDeviceContext>()
+            .Wait();
+        framework::TensorCopySync(*tmp_dout, ctx.GetPlace(), dy);
+      }
+    }
+  }
+};
+
 }  // namespace operators
 }  // namespace paddle

 namespace ops = paddle::operators;
+namespace plat = paddle::platform;
+
+REGISTER_OP_NPU_KERNEL(elementwise_add, ops::ElementwiseAddNPUKernel<float>,
+                       ops::ElementwiseAddNPUKernel<plat::float16>);

-REGISTER_OP_NPU_KERNEL(
-    elementwise_add,
-    ops::ElementwiseAddNPUKernel<paddle::platform::NPUDeviceContext, float>);
-#endif
+REGISTER_OP_NPU_KERNEL(elementwise_add_grad,
+                       ops::ElementwiseAddGradNPUKernel<float>,
+                       ops::ElementwiseAddGradNPUKernel<plat::float16>);
--- a/paddle/fluid/operators/elementwise/elementwise_div_op_npu.cc
+++ b/paddle/fluid/operators/elementwise/elementwise_div_op_npu.cc
+/* Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+
+#include <memory>
+#include <string>
+
+#include "paddle/fluid/operators/elementwise/elementwise_div_op.h"
+#include "paddle/fluid/operators/npu_op_runner.h"
+
+namespace paddle {
+namespace operators {
+
+using Tensor = framework::Tensor;
+
+template <typename DeviceContext, typename T>
+class ElementwiseDivNPUKernel : public framework::OpKernel<T> {
+ public:
+  void Compute(const framework::ExecutionContext& ctx) const override {
+    auto* x = ctx.Input<Tensor>("X");
+    auto* y = ctx.Input<Tensor>("Y");
+
+    auto* out = ctx.Output<Tensor>("Out");
+
+    auto place = ctx.GetPlace();
+
+    out->mutable_data<T>(place);
+
+    auto stream =
+        ctx.template device_context<paddle::platform::NPUDeviceContext>()
+            .stream();
+
+    auto runner = NpuOpRunner("Div", {*x, *y}, {*out}, {});
+    runner.Run(stream);
+  }
+};
+
+template <typename DeviceContext, typename T>
+class ElementwiseDivGradNPUKernel : public framework::OpKernel<T> {
+ public:
+  void Compute(const framework::ExecutionContext& ctx) const override {
+    auto* out = ctx.Input<Tensor>("Out");
+    auto* dout = ctx.Input<Tensor>(framework::GradVarName("Out"));
+    auto* x = ctx.Input<Tensor>("X");
+    auto* y = ctx.Input<Tensor>("Y");
+
+    auto* dx = ctx.Output<Tensor>(framework::GradVarName("X"));
+    auto* dy = ctx.Output<Tensor>(framework::GradVarName("Y"));
+
+    auto place = ctx.GetPlace();
+
+    auto stream =
+        ctx.template device_context<paddle::platform::NPUDeviceContext>()
+            .stream();
+
+    Tensor y_power(y->type());
+    y_power.mutable_data<T>(y->dims(), place);
+    auto y_power_runner = NpuOpRunner("Power", {*y}, {y_power},
+                                      {{"power", static_cast<float>(-1)}});
+    y_power_runner.Run(stream);
+
+    if (dx) {
+      dx->mutable_data<T>(place);
+
+      Tensor tensor_zeros(x->type());
+      tensor_zeros.mutable_data<T>(x->dims(), place);
+      auto tensor_zeros_runner =
+          NpuOpRunner("ZerosLike", {*x}, {tensor_zeros}, {});
+      tensor_zeros_runner.Run(stream);
+
+      Tensor x_zero(paddle::framework::proto::VarType::BOOL);
+      x_zero.mutable_data<bool>(x->dims(), place);
+      auto x_zero_runner =
+          NpuOpRunner("Equal", {*x, tensor_zeros}, {x_zero}, {});
+      x_zero_runner.Run(stream);
+
+      Tensor x_nozero(paddle::framework::proto::VarType::BOOL);
+      x_nozero.mutable_data<bool>(x->dims(), place);
+      auto x_nozero_runner =
+          NpuOpRunner("LogicalNot", {x_zero}, {x_nozero}, {});
+      x_nozero_runner.Run(stream);
+
+      Tensor x_nozero_f(x->type());
+      x_nozero_f.mutable_data<T>(x->dims(), place);
+      auto x_nozero_f_runner =
+          NpuOpRunner("Cast", {x_nozero}, {x_nozero_f},
+                      {{"dst_type", static_cast<int32_t>(0)}});
+      x_nozero_f_runner.Run(stream);
+
+      Tensor x_grad_w(x->type());
+      x_grad_w.mutable_data<T>(x->dims(), place);
+      auto x_grad_w_runner =
+          NpuOpRunner("Mul", {x_nozero_f, y_power}, {x_grad_w}, {});
+      x_grad_w_runner.Run(stream);
+
+      auto x_grad_runner = NpuOpRunner("Mul", {x_grad_w, *dout}, {*dx}, {});
+      x_grad_runner.Run(stream);
+    }
+
+    if (dy) {
+      dy->mutable_data<T>(place);
+
+      Tensor neg_out(y->type());
+      neg_out.mutable_data<T>(y->dims(), place);
+      auto neg_out_runner = NpuOpRunner("Neg", {*out}, {neg_out}, {});
+      neg_out_runner.Run(stream);
+
+      Tensor y_grad_w(y->type());
+      y_grad_w.mutable_data<T>(y->dims(), place);
+      auto y_grad_w_runner = NpuOpRunner("Div", {neg_out, *y}, {y_grad_w}, {});
+      y_grad_w_runner.Run(stream);
+
+      auto y_grad_runner = NpuOpRunner("Mul", {y_grad_w, *dout}, {*dy}, {});
+      y_grad_runner.Run(stream);
+    }
+  }
+};
+
+}  // namespace operators
+}  // namespace paddle
+namespace ops = paddle::operators;
+
+REGISTER_OP_NPU_KERNEL(
+    elementwise_div,
+    ops::ElementwiseDivNPUKernel<paddle::platform::NPUDeviceContext, float>,
+    ops::ElementwiseDivNPUKernel<paddle::platform::NPUDeviceContext,
+                                 paddle::platform::float16>);
+
+REGISTER_OP_NPU_KERNEL(
+    elementwise_div_grad,
+    ops::ElementwiseDivGradNPUKernel<paddle::platform::NPUDeviceContext, float>,
+    ops::ElementwiseDivGradNPUKernel<paddle::platform::NPUDeviceContext,
+                                     paddle::platform::float16>);
--- a/paddle/fluid/operators/elementwise/elementwise_floordiv_op_npu.cc
+++ b/paddle/fluid/operators/elementwise/elementwise_floordiv_op_npu.cc
+/* Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+
+#include <memory>
+#include <string>
+
+#include "paddle/fluid/operators/elementwise/elementwise_div_op.h"
+#include "paddle/fluid/operators/npu_op_runner.h"
+
+namespace paddle {
+namespace operators {
+
+using Tensor = framework::Tensor;
+
+template <typename T>
+class ElementwiseFloorDivNPUKernel : public framework::OpKernel<T> {
+ public:
+  void Compute(const framework::ExecutionContext& ctx) const override {
+    auto* x = ctx.Input<Tensor>("X");
+    auto* y = ctx.Input<Tensor>("Y");
+    auto* out = ctx.Output<Tensor>("Out");
+
+    out->mutable_data<T>(ctx.GetPlace());
+
+    auto stream =
+        ctx.template device_context<paddle::platform::NPUDeviceContext>()
+            .stream();
+
+    auto runner = NpuOpRunner("FloorDiv", {*x, *y}, {*out}, {});
+    runner.Run(stream);
+  }
+};
+
+}  // namespace operators
+}  // namespace paddle
+
+namespace ops = paddle::operators;
+
+REGISTER_OP_NPU_KERNEL(elementwise_floordiv,
+                       ops::ElementwiseFloorDivNPUKernel<int>,
+                       ops::ElementwiseFloorDivNPUKernel<int64_t>);
--- a/paddle/fluid/operators/elementwise/elementwise_max_op_npu.cc
+++ b/paddle/fluid/operators/elementwise/elementwise_max_op_npu.cc
+/* Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+
+#include <memory>
+#include <string>
+
+#include "paddle/fluid/operators/elementwise/elementwise_max_op.h"
+#include "paddle/fluid/operators/npu_op_runner.h"
+
+namespace paddle {
+namespace operators {
+
+using Tensor = framework::Tensor;
+
+template <typename DeviceContext, typename T>
+class ElementwiseMaxNPUKernel : public framework::OpKernel<T> {
+ public:
+  void Compute(const framework::ExecutionContext& ctx) const override {
+    auto* x = ctx.Input<Tensor>("X");
+    auto* y = ctx.Input<Tensor>("Y");
+
+    auto* out = ctx.Output<Tensor>("Out");
+
+    auto place = ctx.GetPlace();
+
+    out->mutable_data<T>(place);
+
+    auto stream =
+        ctx.template device_context<paddle::platform::NPUDeviceContext>()
+            .stream();
+
+    auto runner = NpuOpRunner("Maximum", {*x, *y}, {*out}, {});
+    runner.Run(stream);
+  }
+};
+
+}  // namespace operators
+}  // namespace paddle
+
+namespace ops = paddle::operators;
+
+REGISTER_OP_NPU_KERNEL(
+    elementwise_max,
+    ops::ElementwiseMaxNPUKernel<paddle::platform::NPUDeviceContext, float>,
+    ops::ElementwiseMaxNPUKernel<paddle::platform::NPUDeviceContext,
+                                 paddle::platform::float16>);
--- a/paddle/fluid/operators/elementwise/elementwise_min_op_npu.cc
+++ b/paddle/fluid/operators/elementwise/elementwise_min_op_npu.cc
+/* Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+
+#include <memory>
+#include <string>
+
+#include "paddle/fluid/operators/elementwise/elementwise_min_op.h"
+#include "paddle/fluid/operators/npu_op_runner.h"
+
+namespace paddle {
+namespace operators {
+
+using Tensor = framework::Tensor;
+
+template <typename DeviceContext, typename T>
+class ElementwiseMinNPUKernel : public framework::OpKernel<T> {
+ public:
+  void Compute(const framework::ExecutionContext& ctx) const override {
+    auto* x = ctx.Input<Tensor>("X");
+    auto* y = ctx.Input<Tensor>("Y");
+
+    auto* out = ctx.Output<Tensor>("Out");
+
+    auto place = ctx.GetPlace();
+
+    out->mutable_data<T>(place);
+
+    auto stream =
+        ctx.template device_context<paddle::platform::NPUDeviceContext>()
+            .stream();
+
+    auto runner = NpuOpRunner("Minimum", {*x, *y}, {*out}, {});
+    runner.Run(stream);
+  }
+};
+
+}  // namespace operators
+}  // namespace paddle
+
+namespace ops = paddle::operators;
+
+REGISTER_OP_NPU_KERNEL(
+    elementwise_min,
+    ops::ElementwiseMinNPUKernel<paddle::platform::NPUDeviceContext, float>,
+    ops::ElementwiseMinNPUKernel<paddle::platform::NPUDeviceContext,
+                                 paddle::platform::float16>);
--- a/paddle/fluid/operators/elementwise/elementwise_mul_op_npu.cc
+++ b/paddle/fluid/operators/elementwise/elementwise_mul_op_npu.cc
+/* Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+
+#ifdef PADDLE_WITH_ASCEND_CL
+#include <memory>
+#include <string>
+
+#include "paddle/fluid/operators/elementwise/elementwise_mul_op.h"
+#include "paddle/fluid/operators/npu_op_runner.h"
+
+namespace paddle {
+namespace operators {
+
+using Tensor = framework::Tensor;
+
+template <typename DeviceContext, typename T>
+class ElementwiseMulNPUKernel : public framework::OpKernel<T> {
+ public:
+  void Compute(const framework::ExecutionContext& ctx) const override {
+    auto* x = ctx.Input<Tensor>("X");
+    auto* y = ctx.Input<Tensor>("Y");
+
+    auto* out = ctx.Output<Tensor>("Out");
+
+    auto place = ctx.GetPlace();
+
+    out->mutable_data<T>(place);
+
+    auto stream =
+        ctx.template device_context<paddle::platform::NPUDeviceContext>()
+            .stream();
+
+    auto runner = NpuOpRunner("Mul", {*x, *y}, {*out}, {});
+    runner.Run(stream);
+  }
+};
+
+template <typename DeviceContext, typename T>
+class ElementwiseMulGradNPUKernel : public framework::OpKernel<T> {
+ public:
+  void Compute(const framework::ExecutionContext& ctx) const override {
+    auto* x = ctx.Input<Tensor>("X");
+    auto* y = ctx.Input<Tensor>("Y");
+    auto* dout = ctx.Input<Tensor>(framework::GradVarName("Out"));
+
+    auto* dx = ctx.Output<Tensor>(framework::GradVarName("X"));
+    auto* dy = ctx.Output<Tensor>(framework::GradVarName("Y"));
+
+    auto place = ctx.GetPlace();
+
+    auto stream =
+        ctx.template device_context<paddle::platform::NPUDeviceContext>()
+            .stream();
+
+    if (dx) {
+      dx->mutable_data<T>(place);
+      auto dx_runner = NpuOpRunner("Mul", {*dout, *y}, {*dx}, {});
+      dx_runner.Run(stream);
+    }
+
+    if (dy) {
+      dy->mutable_data<T>(place);
+      auto dy_runner = NpuOpRunner("Mul", {*x, *dout}, {*dy}, {});
+      dy_runner.Run(stream);
+    }
+  }
+};
+
+}  // namespace operators
+}  // namespace paddle
+
+namespace ops = paddle::operators;
+
+REGISTER_OP_NPU_KERNEL(
+    elementwise_mul,
+    ops::ElementwiseMulNPUKernel<paddle::platform::NPUDeviceContext, float>,
+    ops::ElementwiseMulNPUKernel<paddle::platform::NPUDeviceContext,
+                                 paddle::platform::float16>);
+
+REGISTER_OP_NPU_KERNEL(
+    elementwise_mul_grad,
+    ops::ElementwiseMulGradNPUKernel<paddle::platform::NPUDeviceContext, float>,
+    ops::ElementwiseMulGradNPUKernel<paddle::platform::NPUDeviceContext,
+                                     paddle::platform::float16>);
+#endif
--- a/paddle/fluid/operators/elementwise/elementwise_op_npu_test.cc
+++ b/paddle/fluid/operators/elementwise/elementwise_op_npu_test.cc
@@ -74,6 +74,7 @@ void Compare(f::Scope* scope, const p::DeviceContext& ctx,
                                    {{"Out", {"Out"}}}, attrs);

  op->Run(*scope, place);
+  ctx.Wait();

  std::vector<T> out_vec;
  TensorToVector(*tensor_out, ctx, &out_vec);
@@ -131,6 +132,7 @@ void CompareGrad(f::Scope* scope, const p::DeviceContext& ctx,

  auto place = ctx.GetPlace();
  op->Run(*scope, place);
+  ctx.Wait();

  std::vector<T> dx_vec;
  TensorToVector(*tensor_dx, ctx, &dx_vec);
@@ -179,3 +181,9 @@ TEST(elementwise_sub_grad, NPU) {
  p::NPUDeviceContext ctx(p::NPUPlace(0));
  CompareGrad<float>(&scope, ctx, "elementwise_sub_grad");
 }
+
+TEST(elementwise_add_grad, NPU) {
+  f::Scope scope;
+  p::NPUDeviceContext ctx(p::NPUPlace(0));
+  CompareGrad<float>(&scope, ctx, "elementwise_add_grad");
+}
--- a/paddle/fluid/operators/elementwise/elementwise_pow_op_npu.cc
+++ b/paddle/fluid/operators/elementwise/elementwise_pow_op_npu.cc
+/* Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+
+#include <memory>
+#include <string>
+
+#include "paddle/fluid/operators/elementwise/elementwise_pow_op.h"
+#include "paddle/fluid/operators/npu_op_runner.h"
+
+namespace paddle {
+namespace operators {
+
+using Tensor = framework::Tensor;
+
+template <typename DeviceContext, typename T>
+class ElementwisePowNPUKernel : public framework::OpKernel<T> {
+ public:
+  void Compute(const framework::ExecutionContext& ctx) const override {
+    auto* x = ctx.Input<Tensor>("X");
+    auto* y = ctx.Input<Tensor>("Y");
+
+    auto* out = ctx.Output<Tensor>("Out");
+
+    auto place = ctx.GetPlace();
+
+    out->mutable_data<T>(place);
+
+    auto stream =
+        ctx.template device_context<paddle::platform::NPUDeviceContext>()
+            .stream();
+
+    auto runner = NpuOpRunner("Pow", {*x, *y}, {*out}, {});
+    runner.Run(stream);
+  }
+};
+
+}  // namespace operators
+}  // namespace paddle
+
+namespace ops = paddle::operators;
+
+REGISTER_OP_NPU_KERNEL(
+    elementwise_pow,
+    ops::ElementwisePowNPUKernel<paddle::platform::NPUDeviceContext, float>,
+    ops::ElementwisePowNPUKernel<paddle::platform::NPUDeviceContext,
+                                 paddle::platform::float16>);
--- a/paddle/fluid/operators/elementwise/elementwise_sub_op_npu.cc
+++ b/paddle/fluid/operators/elementwise/elementwise_sub_op_npu.cc
@@ -12,7 +12,6 @@ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 See the License for the specific language governing permissions and
 limitations under the License. */

-#ifdef PADDLE_WITH_ASCEND_CL
 #include <memory>
 #include <string>

@@ -24,7 +23,7 @@ namespace operators {

 using Tensor = framework::Tensor;

-template <typename DeviceContext, typename T>
+template <typename T>
 class ElementwiseSubNPUKernel : public framework::OpKernel<T> {
 public:
  void Compute(const framework::ExecutionContext& ctx) const override {
@@ -43,7 +42,7 @@ class ElementwiseSubNPUKernel : public framework::OpKernel<T> {
  }
 };

-template <typename DeviceContext, typename T>
+template <typename T>
 class ElementwiseSubGradNPUKernel : public framework::OpKernel<T> {
 public:
  void Compute(const framework::ExecutionContext& ctx) const override {
@@ -51,8 +50,9 @@ class ElementwiseSubGradNPUKernel : public framework::OpKernel<T> {
    auto* dx = ctx.Output<Tensor>(framework::GradVarName("X"));
    auto* dy = ctx.Output<Tensor>(framework::GradVarName("Y"));

-    dx->mutable_data<T>(ctx.GetPlace());
-    dy->mutable_data<T>(ctx.GetPlace());
+    auto stream =
+        ctx.template device_context<paddle::platform::NPUDeviceContext>()
+            .stream();

    // NOTE(zhiqiu): It seems Ascend Sub follow the broadcast sematics with
    // default axis=-1?
@@ -66,89 +66,92 @@ class ElementwiseSubGradNPUKernel : public framework::OpKernel<T> {
    // 1. [2, 3, 5] => [3, 5], ReduceSumD on axis = 0, keep_dims = false.
    // 2. [3, 5] => [1, 5], ReduceSumD on axis = 0, keep_dims = true.

-    auto stream =
-        ctx.template device_context<paddle::platform::NPUDeviceContext>()
-            .stream();
-    // For dx
-    // stage 1
-    auto reduce_ndim = dout->dims().size() - dx->dims().size();
-    std::vector<int> axes;
-    for (auto i = 0; i < reduce_ndim; ++i) {
-      axes.push_back(i);
-    }
-    auto tmp_dout = dout;
-    Tensor reduced_dout(dx->type());
-    if (axes.size() != 0) {
-      std::vector<int64_t> reduced_dout_dims;
-      for (auto i = reduce_ndim; i < dout->dims().size(); ++i) {
-        reduced_dout_dims.push_back(dout->dims()[i]);
-      }
-      reduced_dout.Resize(framework::make_ddim(reduced_dout_dims));
-      reduced_dout.mutable_data<T>(ctx.GetPlace());
-      auto runner = NpuOpRunner("ReduceSumD", {*dout}, {reduced_dout},
-                                {{"axes", axes}, {"keep_dims", false}});
-      runner.Run(stream);
-      tmp_dout = &reduced_dout;
-    }
-
-    // stage 2
-    axes.clear();
-    for (auto i = 0; i < dx->dims().size(); ++i) {
-      if (dx->dims()[i] == 1) {
+    if (dx) {
+      dx->mutable_data<T>(ctx.GetPlace());
+      // For dx
+      // stage 1
+      auto reduce_ndim = dout->dims().size() - dx->dims().size();
+      std::vector<int> axes;
+      for (auto i = 0; i < reduce_ndim; ++i) {
        axes.push_back(i);
      }
-    }
-    if (axes.size() != 0) {
-      auto runner = NpuOpRunner("ReduceSumD", {*tmp_dout}, {*dx},
-                                {{"axes", axes}, {"keep_dims", true}});
-      runner.Run(stream);
-    } else {
-      framework::TensorCopySync(*tmp_dout, ctx.GetPlace(), dx);
-    }
-
-    // For dy
-    // stage 1
-    reduce_ndim = dout->dims().size() - dy->dims().size();
-    axes.clear();
-    for (auto i = 0; i < reduce_ndim; ++i) {
-      axes.push_back(i);
-    }
-    tmp_dout = dout;
-    Tensor reduced_dy(dy->type());
+      Tensor* tmp_dout = const_cast<Tensor*>(dout);
+      Tensor reduced_dout(dx->type());
+      if (axes.size() != 0) {
+        std::vector<int64_t> reduced_dout_dims;
+        for (auto i = reduce_ndim; i < dout->dims().size(); ++i) {
+          reduced_dout_dims.push_back(dout->dims()[i]);
+        }
+        reduced_dout.Resize(framework::make_ddim(reduced_dout_dims));
+        reduced_dout.mutable_data<T>(ctx.GetPlace());
+        auto runner = NpuOpRunner("ReduceSumD", {*dout}, {reduced_dout},
+                                  {{"axes", axes}, {"keep_dims", false}});
+        runner.Run(stream);
+        tmp_dout = &reduced_dout;
+      }

-    if (axes.size() != 0) {
-      std::vector<int64_t> reduced_dout_dims;
-      for (auto i = reduce_ndim; i < dout->dims().size(); ++i) {
-        reduced_dout_dims.push_back(dout->dims()[i]);
+      // stage 2
+      axes.clear();
+      for (auto i = 0; i < dx->dims().size(); ++i) {
+        if (dx->dims()[i] == 1) {
+          axes.push_back(i);
+        }
+      }
+      if (axes.size() != 0) {
+        auto runner = NpuOpRunner("ReduceSumD", {*tmp_dout}, {*dx},
+                                  {{"axes", axes}, {"keep_dims", true}});
+        runner.Run(stream);
+      } else {
+        framework::TensorCopySync(*tmp_dout, ctx.GetPlace(), dx);
      }
-      reduced_dout.Resize(framework::make_ddim(reduced_dout_dims));
-      reduced_dout.mutable_data<T>(ctx.GetPlace());
-      auto runner = NpuOpRunner("ReduceSumD", {*dout}, {reduced_dout},
-                                {{"axes", axes}, {"keep_dims", false}});
-      runner.Run(stream);
-      tmp_dout = &reduced_dout;
    }
-
-    // stage 2
-    axes.clear();
-    auto* tmp_dy = tmp_dout;
-    for (auto i = 0; i < dy->dims().size(); ++i) {
-      if (dy->dims()[i] == 1) {
+    if (dy) {
+      dy->mutable_data<T>(ctx.GetPlace());
+      // For dy
+      // stage 1
+      auto reduce_ndim = dout->dims().size() - dy->dims().size();
+      std::vector<int> axes;
+      for (auto i = 0; i < reduce_ndim; ++i) {
        axes.push_back(i);
      }
-    }
-    if (axes.size() != 0) {
-      reduced_dy.Resize(dy->dims());
-      reduced_dy.mutable_data<T>(ctx.GetPlace());
-      auto runner = NpuOpRunner("ReduceSumD", {*tmp_dout}, {reduced_dy},
-                                {{"axes", axes}, {"keep_dims", true}});
+      Tensor* tmp_dout = const_cast<Tensor*>(dout);
+      Tensor reduced_dy(dy->type());
+      Tensor reduced_dout(dy->type());
+
+      if (axes.size() != 0) {
+        std::vector<int64_t> reduced_dout_dims;
+        for (auto i = reduce_ndim; i < dout->dims().size(); ++i) {
+          reduced_dout_dims.push_back(dout->dims()[i]);
+        }
+        reduced_dout.Resize(framework::make_ddim(reduced_dout_dims));
+        reduced_dout.mutable_data<T>(ctx.GetPlace());
+        auto runner = NpuOpRunner("ReduceSumD", {*dout}, {reduced_dout},
+                                  {{"axes", axes}, {"keep_dims", false}});
+        runner.Run(stream);
+        tmp_dout = &reduced_dout;
+      }
+
+      // stage 2
+      axes.clear();
+      Tensor* tmp_dy = tmp_dout;
+      for (auto i = 0; i < dy->dims().size(); ++i) {
+        if (dy->dims()[i] == 1) {
+          axes.push_back(i);
+        }
+      }
+      if (axes.size() != 0) {
+        reduced_dy.Resize(dy->dims());
+        reduced_dy.mutable_data<T>(ctx.GetPlace());
+        auto runner = NpuOpRunner("ReduceSumD", {*tmp_dout}, {reduced_dy},
+                                  {{"axes", axes}, {"keep_dims", true}});
+        runner.Run(stream);
+        tmp_dy = &reduced_dy;
+      }
+
+      // stage 3, negative
+      auto runner = NpuOpRunner("Neg", {*tmp_dy}, {*dy}, {});
      runner.Run(stream);
-      tmp_dy = &reduced_dy;
    }
-
-    // stage 3, negative
-    auto runner = NpuOpRunner("Neg", {*tmp_dy}, {*dy}, {});
-    runner.Run(stream);
  }
 };

@@ -156,16 +159,11 @@ class ElementwiseSubGradNPUKernel : public framework::OpKernel<T> {
 }  // namespace paddle

 namespace ops = paddle::operators;
+namespace plat = paddle::platform;
+
+REGISTER_OP_NPU_KERNEL(elementwise_sub, ops::ElementwiseSubNPUKernel<float>,
+                       ops::ElementwiseSubNPUKernel<plat::float16>);

-REGISTER_OP_NPU_KERNEL(
-    elementwise_sub,
-    ops::ElementwiseSubNPUKernel<paddle::platform::NPUDeviceContext, float>,
-    ops::ElementwiseSubNPUKernel<paddle::platform::NPUDeviceContext,
-                                 paddle::platform::float16>);
-
-REGISTER_OP_NPU_KERNEL(
-    elementwise_sub_grad,
-    ops::ElementwiseSubGradNPUKernel<paddle::platform::NPUDeviceContext, float>,
-    ops::ElementwiseSubGradNPUKernel<paddle::platform::NPUDeviceContext,
-                                     paddle::platform::float16>);
-#endif
+REGISTER_OP_NPU_KERNEL(elementwise_sub_grad,
+                       ops::ElementwiseSubGradNPUKernel<float>,
+                       ops::ElementwiseSubGradNPUKernel<plat::float16>);
--- a/paddle/fluid/operators/expand_op.h
+++ b/paddle/fluid/operators/expand_op.h
@@ -64,6 +64,12 @@ inline std::vector<int> get_expand_times(
      TensorCopySync(*expand_tensor, platform::CPUPlace(), &cpu_expand_tensor);
      expand_data = cpu_expand_tensor.data<int>();
    }
+#ifdef PADDLE_WITH_ASCEND_CL
+    if (platform::is_npu_place(expand_tensor->place())) {
+      TensorCopySync(*expand_tensor, platform::CPUPlace(), &cpu_expand_tensor);
+      expand_data = cpu_expand_tensor.data<int>();
+    }
+#endif
 #ifdef PADDLE_WITH_XPU
    if (platform::is_xpu_place(expand_tensor->place())) {
      TensorCopySync(*expand_tensor, platform::CPUPlace(), &cpu_expand_tensor);

--- a/paddle/fluid/operators/expand_op_npu.cc
+++ b/paddle/fluid/operators/expand_op_npu.cc
+/* Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+
+#ifdef PADDLE_WITH_ASCEND_CL
+#include <iostream>
+#include <memory>
+#include <string>
+
+#include "paddle/fluid/framework/op_registry.h"
+#include "paddle/fluid/operators/expand_op.h"
+#include "paddle/fluid/operators/npu_op_runner.h"
+
+namespace paddle {
+namespace operators {
+
+template <typename DeviceContext, typename T>
+class ExpandNPUKernel : public framework::OpKernel<T> {
+ public:
+  void Compute(const framework::ExecutionContext& context) const override {
+    auto rank = context.Input<Tensor>("X")->dims().size();
+    PADDLE_ENFORCE_GE(
+        rank, 1,
+        platform::errors::InvalidArgument(
+            "The number of dimensions of the input 'x' for Op(expand) "
+            "must be greater than or equal to 1, but the value received is %d.",
+            rank));
+    PADDLE_ENFORCE_LE(
+        rank, MAX_RANK_SUPPORTED,
+        platform::errors::InvalidArgument(
+            "The number of dimensions of the input 'x' for Op(expand) "
+            "must be less than or equal to %d, but the value received is %d.",
+            MAX_RANK_SUPPORTED, rank));
+    switch (rank) { REP_EXPAND_TEMPLATE(MAX_RANK_SUPPORTED) }
+  }
+
+ protected:
+  template <int Rank>
+  void Expand(const framework::ExecutionContext& context) const {
+    auto* in0 = context.Input<framework::LoDTensor>("X");
+    auto in_dims = in0->dims();
+    auto expand_times = get_expand_times(context);
+    PADDLE_ENFORCE_EQ(
+        static_cast<size_t>(in_dims.size()), expand_times.size(),
+        platform::errors::InvalidArgument(
+            "The number of elements (%d) of 'expand_times' for "
+            "Op(expand) must be equal to the number "
+            "of dimensions (%d) of the input.",
+            expand_times.size(), static_cast<size_t>(in_dims.size())));
+    auto* out0 = context.Output<framework::LoDTensor>("Out");
+    framework::DDim out_dims(in_dims);
+    for (size_t i = 0; i < expand_times.size(); ++i) {
+      out_dims[i] *= expand_times[i];
+    }
+    out0->Resize(out_dims);
+    out0->mutable_data<T>(context.device_context().GetPlace());
+    auto runner =
+        NpuOpRunner("TileD", {*in0}, {*out0}, {{"multiples", expand_times}});
+    auto stream =
+        context.template device_context<paddle::platform::NPUDeviceContext>()
+            .stream();
+    runner.Run(stream);
+  }
+};
+}  // namespace operators
+}  // namespace paddle
+
+namespace ops = paddle::operators;
+REGISTER_OP_NPU_KERNEL(
+    expand, ops::ExpandNPUKernel<paddle::platform::NPUDeviceContext, float>,
+    ops::ExpandNPUKernel<paddle::platform::NPUDeviceContext,
+                         paddle::platform::float16>);
+
+#endif
--- a/paddle/fluid/operators/expand_op_npu_test.cc
+++ b/paddle/fluid/operators/expand_op_npu_test.cc
+/* Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+
+#ifndef _WIN32
+#include <unistd.h>
+#endif
+
+#include <iostream>
+#include <string>
+#include <thread>  // NOLINT
+#include <vector>
+
+#include "gtest/gtest.h"
+#include "paddle/fluid/framework/op_registry.h"
+#include "paddle/fluid/framework/operator.h"
+#include "paddle/fluid/framework/program_desc.h"
+#include "paddle/fluid/operators/dropout_op.h"
+#include "paddle/fluid/operators/math/math_function.h"
+#include "paddle/fluid/string/printf.h"
+
+namespace f = paddle::framework;
+namespace p = paddle::platform;
+namespace m = paddle::operators::math;
+
+USE_OP(expand);
+USE_OP_DEVICE_KERNEL(expand, NPU);
+
+template <typename T>
+void Compare(f::Scope* scope, const p::DeviceContext& ctx) {
+  // init
+  auto in = scope->Var("X");
+  auto expand_times = scope->Var("ExpandTimes");
+  auto out = scope->Var("Out");
+  auto in_t = in->GetMutable<f::LoDTensor>();
+  auto out_t = out->GetMutable<f::LoDTensor>();
+  auto expand_times_t = expand_times->GetMutable<f::LoDTensor>();
+
+  auto place = ctx.GetPlace();
+  TensorFromVector(std::vector<T>(3 * 1 * 7, 1), ctx, in_t);
+  TensorFromVector(std::vector<int>({1, 10, 1}), ctx, expand_times_t);
+
+  in_t->Resize(f::make_ddim({3, 1, 7}));
+  expand_times_t->Resize(f::make_ddim({3}));
+  out_t->Resize(f::make_ddim({3, 10, 7}));
+  out_t->mutable_data<T>(place);
+
+  f::AttributeMap attrs = {{}};
+  auto op = f::OpRegistry::CreateOp(
+      "expand", {{"X", {"X"}}, {"ExpandTimes", {"ExpandTimes"}}},
+      {{"Out", {"Out"}}}, attrs);
+  op->Run(*scope, place);
+  ctx.Wait();
+
+  auto out_dim = out_t->dims();
+  EXPECT_EQ(out_dim.at(0), 3);
+  EXPECT_EQ(out_dim.at(1), 10);
+  EXPECT_EQ(out_dim.at(2), 7);
+}
+
+TEST(expand, NPU_fp32) {
+  f::Scope scope;
+  p::NPUDeviceContext ctx(p::NPUPlace(0));
+  Compare<float>(&scope, ctx);
+}
--- a/paddle/fluid/operators/fill_constant_op_npu.cc
+++ b/paddle/fluid/operators/fill_constant_op_npu.cc
+/* Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+
+#include <memory>
+#include <string>
+
+#include "paddle/fluid/operators/fill_constant_op.h"
+#include "paddle/fluid/operators/npu_op_runner.h"
+#include "paddle/fluid/operators/utils.h"
+
+namespace paddle {
+namespace operators {
+
+template <typename DeviceContext, typename T>
+class FillConstantNPUKernel : public framework::OpKernel<T> {
+ public:
+  void Compute(const framework::ExecutionContext& ctx) const override {
+    auto data_type =
+        static_cast<framework::proto::VarType::Type>(ctx.Attr<int>("dtype"));
+    auto str_value = ctx.Attr<std::string>("str_value");
+    auto float_value = ctx.Attr<float>("value");
+
+    auto* out_var = ctx.Output<framework::Tensor>("Out");
+    auto place = ctx.GetPlace();
+    auto stream =
+        ctx.template device_context<paddle::platform::NPUDeviceContext>()
+            .stream();
+
+    T value;
+    if (str_value.empty()) {
+      value = static_cast<T>(float_value);
+    } else {
+      // handle NaN/Inf first, which cannot be read from stream.
+      if (str_value == "inf") {
+        value = static_cast<T>(std::numeric_limits<double>::infinity());
+      } else if (str_value == "-inf") {
+        value = static_cast<T>(-std::numeric_limits<double>::infinity());
+      } else if (str_value == "nan") {
+        value = static_cast<T>(std::numeric_limits<double>::quiet_NaN());
+      } else {
+        std::stringstream convert_stream(str_value);
+        if (std::is_same<int64_t, T>::value) {
+          int64_t tmp_value;
+          convert_stream >> tmp_value;
+          value = static_cast<T>(tmp_value);
+        } else {
+          double tmp_value;
+          convert_stream >> tmp_value;
+          value = static_cast<T>(tmp_value);
+        }
+      }
+    }
+    auto shape = GetShape(ctx);
+
+    Tensor tensor_tmp(data_type);
+    tensor_tmp.mutable_data<T>({1}, ctx.GetPlace());
+    TensorFromVector(std::vector<T>{value}, ctx.device_context(), &tensor_tmp);
+
+    out_var->mutable_data<T>(shape, place);
+    auto runner = NpuOpRunner("FillD", {tensor_tmp}, {*out_var},
+                              {{"dims", framework::vectorize(shape)}});
+    runner.Run(stream);
+  }
+};
+}  // namespace operators
+}  // namespace paddle
+
+namespace ops = paddle::operators;
+
+REGISTER_OP_NPU_KERNEL(
+    fill_constant,
+    ops::FillConstantNPUKernel<paddle::platform::NPUDeviceContext, float>,
+    ops::FillConstantNPUKernel<paddle::platform::NPUDeviceContext, bool>,
+    ops::FillConstantNPUKernel<paddle::platform::NPUDeviceContext, int>,
+    ops::FillConstantNPUKernel<paddle::platform::NPUDeviceContext,
+                               paddle::platform::float16>);
--- a/paddle/fluid/operators/gather_op_npu.cc
+++ b/paddle/fluid/operators/gather_op_npu.cc
+/* Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+
+#include "paddle/fluid/operators/gather_op.h"
+#include <memory>
+#include <string>
+#include <vector>
+#include "paddle/fluid/framework/tensor_util.h"
+#include "paddle/fluid/operators/kron_op.h"
+#include "paddle/fluid/operators/npu_op_runner.h"
+#include "paddle/fluid/platform/npu_info.h"
+
+namespace paddle {
+namespace operators {
+
+template <typename DeviceContext, typename T>
+class GatherOpNPUKernel : public framework::OpKernel<T> {
+ public:
+  void Compute(const framework::ExecutionContext &ctx) const override {
+    auto *x = ctx.Input<Tensor>("X");
+    auto *index = ctx.Input<Tensor>("Index");
+    auto *out = ctx.Output<Tensor>("Out");
+
+    out->mutable_data<T>(ctx.GetPlace());
+    auto runner = NpuOpRunner("Gather", {*x, *index}, {*out},
+                              {{"validate_indices", true}});
+    auto stream =
+        ctx.template device_context<paddle::platform::NPUDeviceContext>()
+            .stream();
+    runner.Run(stream);
+  }
+};
+
+template <typename DeviceContext, typename T>
+class GatherGradOpNPUKernel : public framework::OpKernel<T> {
+ public:
+  void Compute(const framework::ExecutionContext &ctx) const override {
+    auto *index = ctx.Input<Tensor>("Index");
+    auto *x = ctx.Input<Tensor>("X");
+    auto *dout = ctx.Input<Tensor>(framework::GradVarName("Out"));
+    auto *dx = ctx.Output<Tensor>(framework::GradVarName("X"));
+
+    // step1: Unsqueeze index
+    framework::Tensor tmp_tensor(index->type());
+    const auto index_dims = index->dims();
+    if (index_dims.size() == 1) {
+      tmp_tensor.ShareDataWith(*index);
+      std::vector<int64_t> new_dim = {index_dims[0], 1};
+      tmp_tensor.Resize(framework::make_ddim(new_dim));
+      index = &tmp_tensor;
+    }
+
+    auto stream =
+        ctx.template device_context<paddle::platform::NPUDeviceContext>()
+            .stream();
+
+    // step2: ZerosLike x in device
+    Tensor zeroslike_xout(x->type());
+    zeroslike_xout.Resize(x->dims());
+    auto p = zeroslike_xout.mutable_data<T>(ctx.GetPlace());
+
+    platform::NPUMemsetAsync(static_cast<void *>(p), 0,
+                             zeroslike_xout.numel() * sizeof(T), stream);
+
+    // step3: scatter(x_grad)
+    dx->mutable_data<T>(ctx.GetPlace());
+    auto runner_scatter = NpuOpRunner(
+        "TensorScatterUpdate", {zeroslike_xout, *index, *dout}, {*dx}, {});
+    runner_scatter.Run(stream);
+  }
+};
+
+}  // namespace operators
+}  // namespace paddle
+
+namespace ops = paddle::operators;
+REGISTER_OP_NPU_KERNEL(
+    gather, ops::GatherOpNPUKernel<paddle::platform::NPUDeviceContext, float>,
+    ops::GatherOpNPUKernel<paddle::platform::NPUDeviceContext, double>,
+    ops::GatherOpNPUKernel<paddle::platform::NPUDeviceContext,
+                           paddle::platform::float16>);
+
+REGISTER_OP_NPU_KERNEL(
+    gather_grad,
+    ops::GatherGradOpNPUKernel<paddle::platform::NPUDeviceContext, float>,
+    ops::GatherGradOpNPUKernel<paddle::platform::NPUDeviceContext, double>,
+    ops::GatherGradOpNPUKernel<paddle::platform::NPUDeviceContext,
+                               paddle::platform::float16>);
--- a/paddle/fluid/operators/gather_op_npu_test.cc
+++ b/paddle/fluid/operators/gather_op_npu_test.cc
+/* Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+
+#ifndef _WIN32
+#include <unistd.h>
+#endif
+
+#include <string>
+#include <thread>  // NOLINT
+#include <vector>
+
+#include "gtest/gtest.h"
+#include "paddle/fluid/framework/op_registry.h"
+#include "paddle/fluid/framework/operator.h"
+#include "paddle/fluid/framework/program_desc.h"
+#include "paddle/fluid/operators/gather_op.h"
+#include "paddle/fluid/operators/math/math_function.h"
+#include "paddle/fluid/string/printf.h"
+
+namespace f = paddle::framework;
+namespace p = paddle::platform;
+namespace m = paddle::operators::math;
+
+USE_OP(gather);
+USE_OP_DEVICE_KERNEL(gather, NPU);
+USE_OP(gather_grad);
+USE_OP_DEVICE_KERNEL(gather_grad, NPU);
+
+template <typename T>
+void Compare(f::Scope* scope, const p::DeviceContext& ctx,
+             std::string op_type) {
+  // init
+  auto x = scope->Var("X");
+  auto tensor_x = x->GetMutable<f::LoDTensor>();
+
+  auto index = scope->Var("Index");
+  auto tensor_index = index->GetMutable<f::LoDTensor>();
+
+  std::vector<T> init_x;
+  for (int64_t i = 1; i < 7; ++i) {
+    // 1,2,3,4,5,6
+    init_x.push_back(static_cast<T>(i));
+  }
+
+  // [[1, 2],[3, 4],[5, 6]]
+  TensorFromVector(init_x, ctx, tensor_x);
+  tensor_x->Resize(paddle::framework::make_ddim({3, 2}));
+
+  std::vector<int> init_index = {1, 2};
+  paddle::framework::TensorFromVector<int>(init_index, ctx, tensor_index);
+  tensor_index->Resize(paddle::framework::make_ddim({2}));
+
+  ctx.Wait();
+
+  auto out = scope->Var("Out");
+  auto tensor_out = out->GetMutable<f::LoDTensor>();
+
+  // run
+  f::AttributeMap attrs = {{"validate_indices", true}};
+  auto op = f::OpRegistry::CreateOp(
+      op_type, {{"X", {"X"}}, {"Index", {"Index"}}}, {{"Out", {"Out"}}}, attrs);
+
+  auto place = ctx.GetPlace();
+  op->Run(*scope, place);
+
+  std::vector<T> out_vec;
+  TensorToVector(*tensor_out, ctx, &out_vec);
+
+  ctx.Wait();
+
+  // ref:https://www.paddlepaddle.org.cn/documentation/docs/zh/develop/api/paddle/tensor/manipulation/gather_cn.html#gather
+  for (int i = 0; i < static_cast<int>(out_vec.size()); ++i) {
+    VLOG(3) << "out_vec[" << i << "] : " << out_vec[i];
+  }
+  uint32_t expected_size = 4;
+  EXPECT_EQ((uint32_t)out_vec.size(), expected_size);
+
+  // {3, 4, 5, 6}
+  std::vector<T> expected_out_vec;
+  for (int64_t i = 3; i < 7; ++i) {
+    expected_out_vec.push_back(static_cast<T>(i));
+  }
+  for (uint32_t i = 0; i < out_vec.size(); i++) {
+    EXPECT_EQ(out_vec[i], expected_out_vec[i]);
+  }
+}
+
+template <typename T>
+void CompareGrad(f::Scope* scope, const p::DeviceContext& ctx,
+                 std::string op_type) {
+  // init
+  auto index = scope->Var("Index");
+  auto tensor_index = index->GetMutable<f::LoDTensor>();
+
+  auto x = scope->Var("X");
+  auto tensor_x = x->GetMutable<f::LoDTensor>();
+
+  auto dout = scope->Var("DOut");
+  auto tensor_dout = dout->GetMutable<f::LoDTensor>();
+
+  std::vector<int> init_index = {0, 1};
+  paddle::framework::TensorFromVector<int>(init_index, ctx, tensor_index);
+  tensor_index->Resize(paddle::framework::make_ddim({2}));
+
+  std::vector<T> init_x = {1.0, 1.0, 1.0, 1.0, 1.0, 1.0};
+  TensorFromVector(init_x, ctx, tensor_x);
+  tensor_x->Resize(paddle::framework::make_ddim({3, 2}));
+
+  std::vector<T> init_dout = {5.0, 10.0, 2.0, 3.0};
+  TensorFromVector(init_dout, ctx, tensor_dout);
+  tensor_dout->Resize(paddle::framework::make_ddim({2, 2}));
+
+  ctx.Wait();
+
+  auto dx = scope->Var("DX");
+  auto tensor_dx = dx->GetMutable<f::LoDTensor>();
+
+  // run
+  f::AttributeMap attrs;
+  auto op = f::OpRegistry::CreateOp(
+      op_type, {{"X", {"X"}}, {"Index", {"Index"}}, {"Out@GRAD", {"DOut"}}},
+      {{"X@GRAD", {"DX"}}}, attrs);
+
+  auto place = ctx.GetPlace();
+  op->Run(*scope, place);
+
+  std::vector<T> dx_vec;
+  TensorToVector(*tensor_dx, ctx, &dx_vec);
+
+  ctx.Wait();
+
+  uint32_t expected_size = 3 * 2;
+  EXPECT_EQ((uint32_t)dx_vec.size(), expected_size);
+
+  std::vector<T> expected_dx_vec = {5.0, 10.0, 2.0, 3.0, 0.0, 0.0};
+  for (uint32_t i = 0; i < dx_vec.size(); i++) {
+    VLOG(3) << "dx_vec[i]=" << dx_vec[i];
+    EXPECT_EQ(dx_vec[i], expected_dx_vec[i]);
+  }
+}
+
+TEST(gather, NPU_fp32) {
+  f::Scope scope;
+  p::NPUDeviceContext ctx(p::NPUPlace(0));
+  Compare<float>(&scope, ctx, "gather");
+}
+
+TEST(gather, NPU_fp16) {
+  f::Scope scope;
+  p::NPUDeviceContext ctx(p::NPUPlace(0));
+  Compare<p::float16>(&scope, ctx, "gather");
+}
+
+TEST(gather_grad, NPU_fp32) {
+  f::Scope scope;
+  p::NPUDeviceContext ctx(p::NPUPlace(0));
+  CompareGrad<float>(&scope, ctx, "gather_grad");
+}
--- a/paddle/fluid/operators/gelu_op_npu.cc
+++ b/paddle/fluid/operators/gelu_op_npu.cc
+/* Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+
+#include <memory>
+#include <string>
+
+#include "paddle/fluid/operators/gelu_op.h"
+#include "paddle/fluid/operators/npu_op_runner.h"
+
+namespace paddle {
+namespace operators {
+
+using Tensor = framework::Tensor;
+
+template <typename DeviceContext, typename T>
+class GeluNPUKernel : public framework::OpKernel<T> {
+ public:
+  void Compute(const framework::ExecutionContext& ctx) const override {
+    auto* x = ctx.Input<Tensor>("X");
+
+    auto* out = ctx.Output<Tensor>("Out");
+
+    auto place = ctx.GetPlace();
+
+    out->mutable_data<T>(place);
+
+    auto stream =
+        ctx.template device_context<paddle::platform::NPUDeviceContext>()
+            .stream();
+
+    auto runner = NpuOpRunner("Gelu", {*x}, {*out}, {});
+    runner.Run(stream);
+  }
+};
+
+template <typename DeviceContext, typename T>
+class GeluGradNPUKernel : public framework::OpKernel<T> {
+ public:
+  void Compute(const framework::ExecutionContext& ctx) const override {
+    auto* x = ctx.Input<Tensor>("X");
+    auto* dout = ctx.Input<Tensor>(framework::GradVarName("Out"));
+
+    auto* dx = ctx.Output<Tensor>(framework::GradVarName("X"));
+
+    auto place = ctx.GetPlace();
+
+    dx->mutable_data<T>(place);
+
+    auto stream =
+        ctx.template device_context<paddle::platform::NPUDeviceContext>()
+            .stream();
+
+    Tensor out(x->type());
+    out.mutable_data<T>(x->dims(), place);
+    auto out_runner = NpuOpRunner("Gelu", {*x}, {out}, {});
+    out_runner.Run(stream);
+
+    auto dx_runner = NpuOpRunner("GeluGrad", {*dout, *x, out}, {*dx}, {});
+    dx_runner.Run(stream);
+  }
+};
+
+}  // namespace operators
+}  // namespace paddle
+
+namespace ops = paddle::operators;
+
+REGISTER_OP_NPU_KERNEL(
+    gelu, ops::GeluNPUKernel<paddle::platform::NPUDeviceContext, float>,
+    ops::GeluNPUKernel<paddle::platform::NPUDeviceContext,
+                       paddle::platform::float16>);
+
+REGISTER_OP_NPU_KERNEL(
+    gelu_grad,
+    ops::GeluGradNPUKernel<paddle::platform::NPUDeviceContext, float>,
+    ops::GeluGradNPUKernel<paddle::platform::NPUDeviceContext,
+                           paddle::platform::float16>);
--- a/paddle/fluid/operators/gelu_op_npu_test.cc
+++ b/paddle/fluid/operators/gelu_op_npu_test.cc
+/* Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+
+#ifndef _WIN32
+#include <unistd.h>
+#endif
+
+#include <string>
+#include <thread>  // NOLINT
+#include <vector>
+
+#include "gtest/gtest.h"
+#include "paddle/fluid/framework/op_registry.h"
+#include "paddle/fluid/framework/operator.h"
+#include "paddle/fluid/framework/program_desc.h"
+#include "paddle/fluid/operators/dropout_op.h"
+#include "paddle/fluid/operators/math/math_function.h"
+#include "paddle/fluid/string/printf.h"
+
+namespace f = paddle::framework;
+namespace p = paddle::platform;
+namespace m = paddle::operators::math;
+
+USE_OP(gelu);
+USE_OP_DEVICE_KERNEL(gelu, NPU);
+
+template <typename T>
+void Compare(f::Scope* scope, const p::DeviceContext& ctx) {
+  // init
+  auto x = scope->Var("X");
+  auto tensor_x = x->GetMutable<f::LoDTensor>();
+
+  std::vector<T> init_x;
+  for (int64_t i = 0; i < 10 * 10; ++i) {
+    init_x.push_back(static_cast<T>(1.0));
+  }
+
+  TensorFromVector(init_x, ctx, tensor_x);
+  tensor_x->Resize({10, 10});
+
+  auto out = scope->Var("Out");
+  auto tensor_out = out->GetMutable<f::LoDTensor>();
+
+  f::AttributeMap attrs;
+
+  ctx.Wait();
+
+  // run
+  auto place = ctx.GetPlace();
+
+  auto op = f::OpRegistry::CreateOp("gelu", {{"X", {"X"}}}, {{"Out", {"Out"}}},
+                                    attrs);
+  op->Run(*scope, place);
+
+  ctx.Wait();
+
+  // eval time
+  struct timeval start, end;
+  gettimeofday(&start, NULL);
+
+  for (int i = 0; i < 100; i++) {
+    op->Run(*scope, place);
+  }
+
+  ctx.Wait();
+
+  gettimeofday(&end, NULL);
+  int micros =
+      (((end.tv_sec - start.tv_sec) * 1000000) + end.tv_usec) - (start.tv_usec);
+  printf("used time: %d\n", micros / 100);
+
+  // eval value
+  std::vector<T> out_vec;
+  TensorToVector(*tensor_out, ctx, &out_vec);
+
+  float expected = 0.841192;
+  for (uint32_t i = 0; i < out_vec.size(); i++) {
+    EXPECT_FLOAT_EQ(out_vec[i], static_cast<T>(expected));
+  }
+}
+
+template <typename T>
+void CompareGrad(f::Scope* scope, const p::DeviceContext& ctx) {
+  auto dout = scope->Var("DOut");
+  auto tensor_dout = dout->GetMutable<f::LoDTensor>();
+
+  auto x = scope->Var("X");
+  auto tensor_x = x->GetMutable<f::LoDTensor>();
+
+  std::vector<T> init_dout;
+  for (int64_t i = 0; i < 10 * 10; ++i) {
+    init_dout.push_back(static_cast<T>(1.0));
+  }
+
+  std::vector<T> init_x;
+  for (int64_t i = 0; i < 10 * 10; ++i) {
+    init_x.push_back(static_cast<T>(1.0));
+  }
+
+  TensorFromVector(init_dout, ctx, tensor_dout);
+  tensor_dout->Resize({10, 10});
+  TensorFromVector(init_x, ctx, tensor_x);
+  tensor_x->Resize({10, 10});
+
+  auto dx = scope->Var("DX");
+  auto tensor_dx = dx->GetMutable<f::LoDTensor>();
+
+  f::AttributeMap attrs;
+
+  ctx.Wait();
+
+  // run
+  auto place = ctx.GetPlace();
+
+  auto op = f::OpRegistry::CreateOp("gelu_grad",
+                                    {{"Out@GRAD", {"DOut"}}, {"X", {"X"}}},
+                                    {{"X@GRAD", {"DX"}}}, attrs);
+  op->Run(*scope, place);
+
+  ctx.Wait();
+
+  // eval time
+  struct timeval start, end;
+  gettimeofday(&start, NULL);
+
+  for (int i = 0; i < 100; i++) {
+    op->Run(*scope, place);
+  }
+
+  ctx.Wait();
+
+  gettimeofday(&end, NULL);
+  int micros =
+      (((end.tv_sec - start.tv_sec) * 1000000) + end.tv_usec) - (start.tv_usec);
+  printf("used time: %d\n", micros / 100);
+
+  // eval value
+  std::vector<T> dx_vec;
+  TensorToVector(*tensor_dx, ctx, &dx_vec);
+
+  float expected = 1.082964;
+  for (uint32_t i = 0; i < dx_vec.size(); i++) {
+    EXPECT_FLOAT_EQ(dx_vec[i], static_cast<T>(expected));
+  }
+}
+
+TEST(gelu, NPU_fp32) {
+  f::Scope scope;
+  p::NPUDeviceContext ctx(p::NPUPlace(0));
+  Compare<float>(&scope, ctx);
+}
+
+TEST(gelu_grad, NPU) {
+  f::Scope scope;
+  p::NPUDeviceContext ctx(p::NPUPlace(0));
+  CompareGrad<float>(&scope, ctx);
+}
--- a/paddle/fluid/operators/increment_op_npu.cc
+++ b/paddle/fluid/operators/increment_op_npu.cc
+//   Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+//
+// Licensed under the Apache License, Version 2.0 (the "License");
+// you may not use this file except in compliance with the License.
+// You may obtain a copy of the License at
+//
+//     http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+#include "paddle/fluid/operators/increment_op.h"
+#include "paddle/fluid/operators/npu_op_runner.h"
+#include "paddle/fluid/platform/float16.h"
+
+namespace paddle {
+namespace framework {
+class OpDesc;
+class Variable;
+}  // namespace framework
+namespace imperative {
+class OpBase;
+}  // namespace imperative
+}  // namespace paddle
+
+namespace paddle {
+namespace operators {
+
+template <typename DeviceContext, typename T>
+class IncrementalNPUKernel : public framework::OpKernel<T> {
+ public:
+  void Compute(const framework::ExecutionContext& context) const override {
+    auto* x_tensor = context.Input<framework::Tensor>("X");
+    auto* out_tensor = context.Output<framework::Tensor>("Out");
+    float step = context.Attr<float>("step");
+    out_tensor->mutable_data<T>(context.GetPlace());
+
+    Tensor step_tensor(x_tensor->type());
+    std::vector<T> step_vec;
+    step_vec.push_back(static_cast<T>(step));
+    framework::TensorFromVector(step_vec, context.device_context(),
+                                &step_tensor);
+
+    auto runner =
+        NpuOpRunner("Add", {*x_tensor, step_tensor}, {*out_tensor}, {});
+
+    auto stream =
+        context.template device_context<paddle::platform::NPUDeviceContext>()
+            .stream();
+    runner.Run(stream);
+  }
+};
+
+}  // namespace operators
+}  // namespace paddle
+
+namespace plat = paddle::platform;
+namespace ops = paddle::operators;
+
+REGISTER_OP_NPU_KERNEL(
+    increment,
+    ops::IncrementalNPUKernel<paddle::platform::NPUDeviceContext, float>,
+    ops::IncrementalNPUKernel<paddle::platform::NPUDeviceContext, double>,
+    ops::IncrementalNPUKernel<paddle::platform::NPUDeviceContext, int>,
+    ops::IncrementalNPUKernel<paddle::platform::NPUDeviceContext, int64_t>,
+    ops::IncrementalNPUKernel<paddle::platform::NPUDeviceContext,
+                              plat::float16>)
--- a/paddle/fluid/operators/increment_op_npu_test.cc
+++ b/paddle/fluid/operators/increment_op_npu_test.cc
+/* Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+
+#ifndef _WIN32
+#include <unistd.h>
+#endif
+
+#include <string>
+#include <thread>  // NOLINT
+#include <vector>
+
+#include "gtest/gtest.h"
+#include "paddle/fluid/framework/op_registry.h"
+#include "paddle/fluid/framework/operator.h"
+#include "paddle/fluid/framework/program_desc.h"
+#include "paddle/fluid/operators/dropout_op.h"
+#include "paddle/fluid/operators/math/math_function.h"
+#include "paddle/fluid/string/printf.h"
+
+namespace f = paddle::framework;
+namespace p = paddle::platform;
+namespace m = paddle::operators::math;
+
+USE_OP(increment);
+USE_OP_DEVICE_KERNEL(increment, NPU);
+
+template <typename T>
+void Compare(f::Scope* scope, const p::DeviceContext& ctx,
+             std::string op_type) {
+  // init
+  auto x = scope->Var("X");
+  auto tensor_x = x->GetMutable<f::LoDTensor>();
+
+  std::vector<T> init;
+  init.push_back(static_cast<T>(1.0));
+
+  TensorFromVector(init, ctx, tensor_x);
+  tensor_x->Resize({1});
+
+  ctx.Wait();
+
+  auto place = ctx.GetPlace();
+  auto out = scope->Var("Out");
+  auto tensor_out = out->GetMutable<f::LoDTensor>();
+
+  f::AttributeMap attr_input = {{"step", static_cast<float>(2.0)}};
+  auto op = f::OpRegistry::CreateOp("increment", {{"X", {"X"}}},
+                                    {{"Out", {"Out"}}}, attr_input);
+
+  op->Run(*scope, place);
+
+  std::vector<T> out_vec;
+  TensorToVector(*tensor_out, ctx, &out_vec);
+
+  ctx.Wait();
+
+  EXPECT_EQ((uint32_t)out_vec.size(), (uint32_t)1);
+  EXPECT_EQ(out_vec[0], static_cast<T>(3.0));
+}
+
+TEST(increment, NPU_fp32) {
+  f::Scope scope;
+  p::NPUDeviceContext ctx(p::NPUPlace(0));
+  Compare<float>(&scope, ctx, "increment");
+}
+
+TEST(increment, NPU_fp64) {
+  f::Scope scope;
+  p::NPUDeviceContext ctx(p::NPUPlace(0));
+  Compare<float>(&scope, ctx, "increment");
+}
--- a/paddle/fluid/operators/layer_norm_op_npu.cc
+++ b/paddle/fluid/operators/layer_norm_op_npu.cc
+/* Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+
+#include "paddle/fluid/operators/layer_norm_op.h"
+#include "paddle/fluid/operators/npu_op_runner.h"
+
+namespace paddle {
+namespace operators {
+
+using Tensor = framework::Tensor;
+using DDim = framework::DDim;
+
+using DataLayout = framework::DataLayout;
+
+template <typename T>
+class NormDataType;
+
+template <>
+class NormDataType<platform::float16> {
+ public:
+  // The scaling param type is float for HALF and FLOAT tensors
+  using ScalingParamType = const float;
+  using BatchNormParamType = float;
+};
+
+template <>
+class NormDataType<float> {
+ public:
+  using ScalingParamType = const float;
+  using BatchNormParamType = float;
+};
+
+template <typename T>
+using NormDataType = NormDataType<T>;
+template <typename T>
+using LayerNormParamType = typename NormDataType<T>::BatchNormParamType;
+
+template <typename T>
+class LayerNormNPUKernel : public framework::OpKernel<T> {
+ public:
+  void Compute(const framework::ExecutionContext& ctx) const override {
+    using U = LayerNormParamType<T>;
+    const auto begin_norm_axis = ctx.Attr<int>("begin_norm_axis");
+    const auto epsilon = ctx.Attr<float>("epsilon");
+    const auto* x = ctx.Input<Tensor>("X");
+    const auto* scale = ctx.Input<Tensor>("Scale");
+    const auto* bias = ctx.Input<Tensor>("Bias");
+    auto* y = ctx.Output<Tensor>("Y");
+    auto* mean = ctx.Output<Tensor>("Mean");
+    auto* variance = ctx.Output<Tensor>("Variance");
+    const auto& x_dims = x->dims();
+    std::vector<int> axes;
+    auto matrix_dim = framework::flatten_to_2d(x_dims, begin_norm_axis);
+    int right = static_cast<int>(matrix_dim[1]);
+
+    // The shape of scale and bias should be equal to x.shape[begin_norm_axis:],
+    // required by Ascend.
+    for (auto i = begin_norm_axis; i < x_dims.size(); ++i) {
+      axes.push_back(x_dims[i]);
+    }
+
+    auto place = ctx.GetPlace();
+    auto stream =
+        ctx.template device_context<paddle::platform::NPUDeviceContext>()
+            .stream();
+
+    Tensor default_scale(x->type());
+    if (!scale) {
+      default_scale.mutable_data<T>(framework::make_ddim(axes), place);
+      Tensor value(x->type());
+      value.mutable_data<T>({1}, place);
+      TensorFromVector(std::vector<T>{static_cast<T>(1.0)},
+                       ctx.device_context(), &value);
+      auto runner =
+          NpuOpRunner("FillD", {value}, {default_scale}, {{"dims", axes}});
+      runner.Run(stream);
+      scale = &default_scale;
+    } else {
+      const_cast<Tensor*>(scale)->Resize(framework::make_ddim(axes));
+    }
+
+    Tensor default_bias(x->type());
+    if (!bias) {
+      default_bias.mutable_data<T>(framework::make_ddim(axes), place);
+      Tensor value(x->type());
+      value.mutable_data<T>({1}, place);
+      TensorFromVector(std::vector<T>{static_cast<T>(0)}, ctx.device_context(),
+                       &value);
+      auto runner =
+          NpuOpRunner("FillD", {value}, {default_bias}, {{"dims", axes}});
+      runner.Run(stream);
+      bias = &default_bias;
+    } else {
+      const_cast<Tensor*>(bias)->Resize(framework::make_ddim(axes));
+    }
+
+    // cast scale from LayerNormParamType to T if needed
+    Tensor cast_scale(x->type());
+    if (x->type() == framework::proto::VarType::FP16 &&
+        scale->type() == framework::proto::VarType::FP32) {
+      cast_scale.Resize(scale->dims());
+      cast_scale.mutable_data<T>(ctx.GetPlace());
+      auto dst_dtype = ConvertToNpuDtype(x->type());
+      auto runner_cast_scale =
+          NpuOpRunner("Cast", {*scale}, {cast_scale},
+                      {{"dst_type", static_cast<int>(dst_dtype)}});
+      runner_cast_scale.Run(stream);
+    } else {
+      cast_scale.ShareDataWith(*scale);
+    }
+
+    // cast bias from LayerNormParamType to T if needed
+    Tensor cast_bias(x->type());
+    if (x->type() == framework::proto::VarType::FP16 &&
+        bias->type() == framework::proto::VarType::FP32) {
+      cast_bias.Resize(bias->dims());
+      cast_bias.mutable_data<T>(ctx.GetPlace());
+      auto dst_dtype = ConvertToNpuDtype(x->type());
+      auto runner_cast_bias =
+          NpuOpRunner("Cast", {*bias}, {cast_bias},
+                      {{"dst_type", static_cast<int>(dst_dtype)}});
+      runner_cast_bias.Run(stream);
+    } else {
+      cast_bias.ShareDataWith(*bias);
+    }
+
+    y->mutable_data<T>(ctx.GetPlace());
+
+    // mean should be of  U type
+    Tensor* tmp_mean = mean;
+    Tensor cast_mean(x->type());
+    if (x->type() == framework::proto::VarType::FP16 &&
+        (scale->type() == framework::proto::VarType::FP32 ||
+         bias->type() == framework::proto::VarType::FP32)) {
+      cast_mean.Resize(mean->dims());
+      cast_mean.mutable_data<T>(ctx.GetPlace());
+      tmp_mean = &cast_mean;
+      mean->mutable_data<U>(ctx.GetPlace());
+    } else {
+      mean->mutable_data<T>(ctx.GetPlace());
+    }
+
+    // same for variance
+    Tensor* tmp_variance = variance;
+    Tensor cast_variance(x->type());
+    if (x->type() == framework::proto::VarType::FP16 &&
+        (scale->type() == framework::proto::VarType::FP32 ||
+         bias->type() == framework::proto::VarType::FP32)) {
+      cast_variance.Resize(variance->dims());
+      cast_variance.mutable_data<T>(ctx.GetPlace());
+      tmp_variance = &cast_variance;
+      variance->mutable_data<U>(ctx.GetPlace());
+    } else {
+      variance->mutable_data<T>(ctx.GetPlace());
+    }
+
+    auto runner = NpuOpRunner("LayerNorm", {*x, cast_scale, cast_bias},
+                              {*y, *tmp_mean, *tmp_variance},
+                              {{"begin_norm_axis", begin_norm_axis},
+                               {"begin_params_axis", begin_norm_axis},
+                               {"epsilon", epsilon}});
+    runner.Run(stream);
+
+    // cast back from FP16 to FP32
+    if (x->type() == framework::proto::VarType::FP16 &&
+        mean->type() == framework::proto::VarType::FP32) {
+      auto dst_dtype = ConvertToNpuDtype(mean->type());
+      auto runner_cast_mean =
+          NpuOpRunner("Cast", {*tmp_mean}, {*mean},
+                      {{"dst_type", static_cast<int>(dst_dtype)}});
+      runner_cast_mean.Run(stream);
+    }
+    // same for variance
+    if (x->type() == framework::proto::VarType::FP16 &&
+        variance->type() == framework::proto::VarType::FP32) {
+      auto dst_dtype = ConvertToNpuDtype(variance->type());
+      auto runner_cast_variance =
+          NpuOpRunner("Cast", {*tmp_variance}, {*variance},
+                      {{"dst_type", static_cast<int>(dst_dtype)}});
+      runner_cast_variance.Run(stream);
+    }
+
+    // revert shape of scale and bias
+    // TODO(zhiqiu): better implementation, use tmp tensor to avoid write input
+    // tensor.
+    const_cast<Tensor*>(scale)->Resize(framework::make_ddim({right}));
+    const_cast<Tensor*>(bias)->Resize(framework::make_ddim({right}));
+  }
+};
+
+template <typename T>
+class LayerNormGradNPUKernel : public framework::OpKernel<T> {
+ public:
+  void Compute(const framework::ExecutionContext& ctx) const override {
+    using U = LayerNormParamType<T>;
+    const auto begin_norm_axis = ctx.Attr<int>("begin_norm_axis");
+    const auto* x = ctx.Input<Tensor>("X");
+    const auto& x_dims = x->dims();
+    const auto* mean = ctx.Input<Tensor>("Mean");
+    const auto* variance = ctx.Input<Tensor>("Variance");
+    const auto* scale = ctx.Input<Tensor>("Scale");
+    const auto* dy = ctx.Input<Tensor>(framework::GradVarName("Y"));
+    auto* dx = ctx.Output<Tensor>(framework::GradVarName("X"));
+    auto* dscale = ctx.Output<Tensor>(framework::GradVarName("Scale"));
+    auto* dbias = ctx.Output<Tensor>(framework::GradVarName("Bias"));
+
+    auto matrix_dim = framework::flatten_to_2d(x_dims, begin_norm_axis);
+    int right = static_cast<int>(matrix_dim[1]);
+
+    std::vector<int> axes;
+    for (auto i = begin_norm_axis; i < x_dims.size(); ++i) {
+      axes.push_back(x_dims[i]);
+    }
+
+    auto place = ctx.GetPlace();
+    auto stream =
+        ctx.template device_context<paddle::platform::NPUDeviceContext>()
+            .stream();
+
+    // No need to compute any gradient, jusr return
+    if (!dx && !dscale && !dbias) {
+      return;
+    }
+
+    // The rank of mean should be equal to x, required by Ascend.
+    std::vector<int> new_shape;
+    for (auto i = 0; i < begin_norm_axis; ++i) {
+      new_shape.push_back(x_dims[i]);
+    }
+    for (auto i = begin_norm_axis; i < x_dims.size(); ++i) {
+      new_shape.push_back(1);
+    }
+
+    auto mean_dims = mean->dims();
+    const_cast<Tensor*>(mean)->Resize(framework::make_ddim({new_shape}));
+    const_cast<Tensor*>(variance)->Resize(framework::make_ddim({new_shape}));
+
+    Tensor default_scale(x->type());
+    if (!scale) {
+      default_scale.mutable_data<T>(framework::make_ddim(axes), place);
+      Tensor value(x->type());
+      value.mutable_data<T>({1}, place);
+      TensorFromVector(std::vector<T>{static_cast<T>(1.0)},
+                       ctx.device_context(), &value);
+      auto runner =
+          NpuOpRunner("FillD", {value}, {default_scale}, {{"dims", axes}});
+      runner.Run(stream);
+      scale = &default_scale;
+    } else {
+      const_cast<Tensor*>(scale)->Resize(framework::make_ddim(axes));
+    }
+
+    // cast scale from LayerNormParamType to T if needed
+    Tensor cast_scale(x->type());
+    if (x->type() == framework::proto::VarType::FP16 &&
+        scale->type() == framework::proto::VarType::FP32) {
+      cast_scale.Resize(scale->dims());
+      cast_scale.mutable_data<T>(ctx.GetPlace());
+      auto dst_dtype = ConvertToNpuDtype(x->type());
+      auto runner_cast_scale =
+          NpuOpRunner("Cast", {*scale}, {cast_scale},
+                      {{"dst_type", static_cast<int>(dst_dtype)}});
+      runner_cast_scale.Run(stream);
+    } else {
+      cast_scale.ShareDataWith(*scale);
+    }
+
+    // cast mean from LayerNormParamType to T if needed
+    Tensor cast_mean(x->type());
+    if (x->type() == framework::proto::VarType::FP16 &&
+        mean->type() == framework::proto::VarType::FP32) {
+      cast_mean.Resize(mean->dims());
+      cast_mean.mutable_data<T>(ctx.GetPlace());
+      auto dst_dtype = ConvertToNpuDtype(x->type());
+      auto runner_cast_mean =
+          NpuOpRunner("Cast", {*mean}, {cast_mean},
+                      {{"dst_type", static_cast<int>(dst_dtype)}});
+      runner_cast_mean.Run(stream);
+    } else {
+      cast_mean.ShareDataWith(*mean);
+    }
+
+    // cast variance from LayerNormParamType to T if needed
+    Tensor cast_variance(x->type());
+    if (x->type() == framework::proto::VarType::FP16 &&
+        variance->type() == framework::proto::VarType::FP32) {
+      cast_variance.Resize(variance->dims());
+      cast_variance.mutable_data<T>(ctx.GetPlace());
+      auto dst_dtype = ConvertToNpuDtype(x->type());
+      auto runner_cast_variance =
+          NpuOpRunner("Cast", {*variance}, {cast_variance},
+                      {{"dst_type", static_cast<int>(dst_dtype)}});
+      runner_cast_variance.Run(stream);
+    } else {
+      cast_variance.ShareDataWith(*variance);
+    }
+
+    Tensor dx_(dy->type()), dscale_(dy->type()), dbias_(dy->type());
+    dx = (dx == nullptr) ? &dx_ : dx;
+    dscale = (dscale == nullptr) ? &dscale_ : dscale;
+    dbias = (dbias == nullptr) ? &dbias_ : dbias;
+
+    dx->Resize(x->dims());
+    dx->mutable_data<T>(ctx.GetPlace());
+
+    dscale->Resize(framework::make_ddim(axes));
+
+    dbias->Resize(framework::make_ddim(axes));
+
+    // dscale should be of  U type
+    Tensor* tmp_dscale = dscale;
+    Tensor cast_dscale(x->type());
+    if (x->type() == framework::proto::VarType::FP16 &&
+        (mean->type() == framework::proto::VarType::FP32 ||
+         variance->type() == framework::proto::VarType::FP32)) {
+      cast_dscale.Resize(dscale->dims());
+      cast_dscale.mutable_data<T>(ctx.GetPlace());
+      tmp_dscale = &cast_dscale;
+      dscale->mutable_data<U>(ctx.GetPlace());
+    } else {
+      dscale->mutable_data<T>(ctx.GetPlace());
+    }
+
+    // same for dbias
+    Tensor* tmp_dbias = dbias;
+    Tensor cast_dbias(x->type());
+    if (x->type() == framework::proto::VarType::FP16 &&
+        (mean->type() == framework::proto::VarType::FP32 ||
+         variance->type() == framework::proto::VarType::FP32)) {
+      cast_dbias.Resize(dbias->dims());
+      cast_dbias.mutable_data<T>(ctx.GetPlace());
+      tmp_dbias = &cast_dbias;
+      dbias->mutable_data<U>(ctx.GetPlace());
+    } else {
+      dbias->mutable_data<T>(ctx.GetPlace());
+    }
+
+    auto runner = NpuOpRunner("LayerNormGrad",
+                              {*dy, *x, cast_variance, cast_mean, cast_scale},
+                              {*dx, *tmp_dscale, *tmp_dbias}, {});
+    runner.Run(stream);
+
+    // cast back from FP16 to FP32
+    if (x->type() == framework::proto::VarType::FP16 &&
+        dscale->type() == framework::proto::VarType::FP32) {
+      auto dst_dtype = ConvertToNpuDtype(dscale->type());
+      auto runner_cast_dscale =
+          NpuOpRunner("Cast", {*tmp_dscale}, {*dscale},
+                      {{"dst_type", static_cast<int>(dst_dtype)}});
+      runner_cast_dscale.Run(stream);
+    }
+    // same for dbias
+    if (x->type() == framework::proto::VarType::FP16 &&
+        dbias->type() == framework::proto::VarType::FP32) {
+      auto dst_dtype = ConvertToNpuDtype(dbias->type());
+      auto runner_cast_dbias =
+          NpuOpRunner("Cast", {*tmp_dbias}, {*dbias},
+                      {{"dst_type", static_cast<int>(dst_dtype)}});
+      runner_cast_dbias.Run(stream);
+    }
+
+    const_cast<Tensor*>(mean)->Resize(mean_dims);
+    const_cast<Tensor*>(variance)->Resize(mean_dims);
+    const_cast<Tensor*>(scale)->Resize(framework::make_ddim({right}));
+    dscale->Resize(framework::make_ddim({right}));
+    dbias->Resize(framework::make_ddim({right}));
+  }
+};
+
+}  // namespace operators
+}  // namespace paddle
+
+namespace ops = paddle::operators;
+namespace plat = paddle::platform;
+
+REGISTER_OP_NPU_KERNEL(layer_norm, ops::LayerNormNPUKernel<float>,
+                       ops::LayerNormNPUKernel<plat::float16>);
+REGISTER_OP_NPU_KERNEL(layer_norm_grad, ops::LayerNormGradNPUKernel<float>,
+                       ops::LayerNormGradNPUKernel<plat::float16>);
--- a/paddle/fluid/operators/lookup_table_v2_op_npu.cc
+++ b/paddle/fluid/operators/lookup_table_v2_op_npu.cc
+/* Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+
+#include <iostream>
+#include <memory>
+#include <string>
+#include "paddle/fluid/framework/op_registry.h"
+#include "paddle/fluid/framework/tensor_util.h"
+#include "paddle/fluid/operators/npu_op_runner.h"
+
+namespace paddle {
+namespace operators {
+
+template <typename DeviceContext, typename T>
+class LookupTableV2NPUKernel : public framework::OpKernel<T> {
+ public:
+  void Compute(const framework::ExecutionContext &ctx) const override {
+    auto *ids_t = ctx.Input<framework::LoDTensor>("Ids");      // int tensor
+    auto *output_t = ctx.Output<framework::LoDTensor>("Out");  // float tensor
+    auto *table_t = ctx.Input<framework::LoDTensor>("W");
+    auto *table_var = ctx.InputVar("W");
+    PADDLE_ENFORCE_EQ(
+        table_var->IsType<framework::LoDTensor>(), true,
+        platform::errors::InvalidArgument("npu only accept LoDTensor"));
+    output_t->mutable_data<T>(ctx.GetPlace());
+    framework::NPUAttributeMap attr_input = {{"validate_indices", false}};
+
+    auto runner =
+        NpuOpRunner("Gather", {*table_t, *ids_t}, {*output_t}, attr_input);
+    auto stream =
+        ctx.template device_context<paddle::platform::NPUDeviceContext>()
+            .stream();
+    runner.Run(stream);
+  }
+};
+
+template <typename T>
+class LookupTableV2GradNPUKernel : public framework::OpKernel<T> {
+ public:
+  void Compute(const framework::ExecutionContext &ctx) const override {
+    auto *ids_t = ctx.Input<framework::LoDTensor>("Ids");
+    auto *output_grad_t =
+        ctx.Input<framework::LoDTensor>(framework::GradVarName("Out"));
+    auto *table_grad_t =
+        ctx.Output<framework::LoDTensor>(framework::GradVarName("W"));
+    table_grad_t->mutable_data<T>(ctx.GetPlace());
+
+    auto stream =
+        ctx.template device_context<paddle::platform::NPUDeviceContext>()
+            .stream();
+
+    // step2: ZerosLike x in device
+    Tensor zeroslike_w(table_grad_t->type());
+    zeroslike_w.Resize(table_grad_t->dims());
+    auto p = zeroslike_w.mutable_data<T>(ctx.GetPlace());
+
+    platform::NPUMemsetAsync(static_cast<void *>(p), 0,
+                             zeroslike_w.numel() * sizeof(T), stream);
+
+    table_grad_t->mutable_data<T>(ctx.GetPlace());
+    auto runner_scatter =
+        NpuOpRunner("ScatterAdd", {zeroslike_w, *ids_t, *output_grad_t},
+                    {*table_grad_t}, {});
+    runner_scatter.Run(stream);
+  }
+};
+}  // namespace operators
+}  // namespace paddle
+
+namespace ops = paddle::operators;
+
+REGISTER_OP_NPU_KERNEL(
+    lookup_table_v2,
+    ops::LookupTableV2NPUKernel<paddle::platform::NPUDeviceContext, float>,
+    ops::LookupTableV2NPUKernel<paddle::platform::NPUDeviceContext,
+                                paddle::platform::float16>);
+
+REGISTER_OP_NPU_KERNEL(
+    lookup_table_v2_grad, ops::LookupTableV2GradNPUKernel<float>,
+    ops::LookupTableV2GradNPUKernel<paddle::platform::float16>);
--- a/paddle/fluid/operators/lookup_table_v2_op_npu_test.cc
+++ b/paddle/fluid/operators/lookup_table_v2_op_npu_test.cc
+/* Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+
+#ifndef _WIN32
+#include <unistd.h>
+#endif
+
+#include <cmath>
+#include <iostream>
+#include <numeric>
+#include <string>
+#include <thread>  // NOLINT
+#include <vector>
+
+#include "gtest/gtest.h"
+#include "paddle/fluid/framework/op_registry.h"
+#include "paddle/fluid/framework/operator.h"
+#include "paddle/fluid/framework/program_desc.h"
+#include "paddle/fluid/operators/dropout_op.h"
+#include "paddle/fluid/operators/math/math_function.h"
+#include "paddle/fluid/string/printf.h"
+
+namespace f = paddle::framework;
+namespace p = paddle::platform;
+namespace m = paddle::operators::math;
+
+USE_OP(lookup_table_v2);
+USE_OP_DEVICE_KERNEL(lookup_table_v2, NPU);
+
+template <typename T>
+void Compare(f::Scope* scope, const p::DeviceContext& ctx) {
+  // init
+  auto ids = scope->Var("Ids");
+  auto out = scope->Var("Out");
+  auto w = scope->Var("W");
+
+  auto ids_t = ids->GetMutable<f::LoDTensor>();
+  auto out_t = out->GetMutable<f::LoDTensor>();
+  auto w_t = w->GetMutable<f::LoDTensor>();
+  int bsz = 10;
+  int dim = 32;
+  int seqlen = 8;
+  int vocab_size = 100;
+  TensorFromVector(std::vector<int64_t>(bsz * seqlen, 3), ctx, ids_t);
+  std::vector<T> val(vocab_size * dim, 10.);
+  TensorFromVector(val, ctx, w_t);
+  ids_t->Resize({bsz, seqlen});
+  w_t->Resize({vocab_size, dim});
+  out_t->Resize({bsz, seqlen, dim});
+  ctx.Wait();
+
+  auto place = ctx.GetPlace();
+  out_t->mutable_data<T>(place);
+  f::AttributeMap attrs = {{}};
+  auto op = f::OpRegistry::CreateOp("lookup_table_v2",
+                                    {{"W", {"W"}}, {"Ids", {"Ids"}}},
+                                    {{"Out", {"Out"}}}, attrs);
+  op->Run(*scope, place);
+  std::vector<T> out_v;
+  TensorToVector(*out_t, ctx, &out_v);
+  ctx.Wait();
+  EXPECT_EQ(out_t->numel(), bsz * seqlen * dim);
+  T res = std::accumulate(out_v.begin(), out_v.end(), 0.);
+  float eps = 1.e-6;
+  EXPECT_LT(fabs(res - bsz * seqlen * dim * 10.), eps);
+}
+
+template <typename T>
+void CompareGrad(f::Scope* scope, const p::DeviceContext& ctx) {
+  // init
+  auto w = scope->Var("W");
+  auto ids = scope->Var("Ids");
+  auto out = scope->Var("DOut");
+  auto dw = scope->Var("DW");
+
+  auto w_t = w->GetMutable<f::LoDTensor>();
+  auto ids_t = ids->GetMutable<f::LoDTensor>();
+  auto out_t = out->GetMutable<f::LoDTensor>();
+  auto dw_t = dw->GetMutable<f::LoDTensor>();
+
+  int bsz = 2;
+  int dim = 2;
+  int seqlen = 2;
+  int vocab_size = 4;
+
+  std::vector<int64_t> val_int(bsz * seqlen, 3);
+  std::vector<T> val(vocab_size * dim, 0.);
+  std::vector<T> val_out(bsz * seqlen * dim, 1.);
+
+  TensorFromVector(val_int, ctx, ids_t);
+  TensorFromVector(val, ctx, w_t);
+  TensorFromVector(val, ctx, dw_t);
+  TensorFromVector(val_out, ctx, out_t);
+
+  w_t->Resize({vocab_size, dim});
+  ids_t->Resize({bsz, seqlen});
+  out_t->Resize({bsz, seqlen, dim});
+  dw_t->Resize({vocab_size, dim});
+
+  ctx.Wait();
+
+  auto place = ctx.GetPlace();
+  out_t->mutable_data<T>(place);
+  w_t->mutable_data<T>(place);
+  dw_t->mutable_data<T>(place);
+  f::AttributeMap attrs = {{}};
+  auto op = f::OpRegistry::CreateOp(
+      "lookup_table_v2_grad",
+      {{"Ids", {"Ids"}}, {"W", {"W"}}, {"Out@GRAD", {"DOut"}}},
+      {{"W@GRAD", {"DW"}}}, attrs);
+  op->Run(*scope, place);
+  ctx.Wait();
+  std::vector<T> w_v;
+  TensorToVector(*dw_t, ctx, &w_v);
+  ctx.Wait();
+  EXPECT_EQ(dw_t->numel(), vocab_size * dim);
+  T res = std::accumulate(w_v.begin(), w_v.end(), 0.);
+  float eps = 1.e-6;
+  EXPECT_LT(fabs(res - bsz * seqlen * dim), eps);
+}
+
+TEST(lookup_table_v2, NPU_fp32) {
+  f::Scope scope;
+  p::NPUDeviceContext ctx(p::NPUPlace(0));
+  Compare<float>(&scope, ctx);
+}
+
+TEST(lookup_table_v2_grad, NPU_fp32) {
+  f::Scope scope;
+  p::NPUDeviceContext ctx(p::NPUPlace(0));
+  CompareGrad<float>(&scope, ctx);
+}
--- a/paddle/fluid/operators/matmul_v2_op_npu.cc
+++ b/paddle/fluid/operators/matmul_v2_op_npu.cc
+/* Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+
+#include <memory>
+#include <string>
+
+#include "paddle/fluid/operators/matmul_v2_op.h"
+#include "paddle/fluid/operators/npu_op_runner.h"
+
+namespace paddle {
+namespace operators {
+
+template <typename DeviceContext, typename T>
+class MatMulV2NPUKernel : public framework::OpKernel<T> {
+ public:
+  void Compute(const framework::ExecutionContext& ctx) const override {
+    auto* x = ctx.Input<framework::Tensor>("X");
+    auto* y = ctx.Input<framework::Tensor>("Y");
+    auto* out = ctx.Output<framework::Tensor>("Out");
+    bool transpose_x = ctx.Attr<bool>("trans_x");
+    bool transpose_y = ctx.Attr<bool>("trans_y");
+
+    if (x->dims().size() == 2) {
+      out->mutable_data<T>(ctx.GetPlace());
+
+      auto runner = NpuOpRunner(
+          "MatMul", {*x, *y}, {*out},
+          {{"transpose_x1", transpose_x}, {"transpose_x2", transpose_y}});
+
+      auto stream =
+          ctx.template device_context<paddle::platform::NPUDeviceContext>()
+              .stream();
+      runner.Run(stream);
+
+    } else if (x->dims().size() > 2) {
+      out->mutable_data<T>(ctx.GetPlace());
+
+      auto runner =
+          NpuOpRunner("BatchMatMul", {*x, *y}, {*out},
+                      {{"adj_x1", transpose_x}, {"adj_x2", transpose_y}});
+
+      auto stream =
+          ctx.template device_context<paddle::platform::NPUDeviceContext>()
+              .stream();
+      runner.Run(stream);
+    }
+  }
+};
+
+template <typename DeviceContext, typename T>
+class MatMulV2GradNPUKernel : public framework::OpKernel<T> {
+ public:
+  void Compute(const framework::ExecutionContext& ctx) const override {
+    auto* x = ctx.Input<framework::Tensor>("X");
+    auto* y = ctx.Input<framework::Tensor>("Y");
+    auto* dout = ctx.Input<framework::Tensor>(framework::GradVarName("Out"));
+    auto* dx = ctx.Output<framework::Tensor>(framework::GradVarName("X"));
+    auto* dy = ctx.Output<framework::Tensor>(framework::GradVarName("Y"));
+    bool transpose_y = ctx.Attr<bool>("trans_y");
+    auto stream =
+        ctx.template device_context<paddle::platform::NPUDeviceContext>()
+            .stream();
+
+    if (x->dims().size() == 2) {
+      if (transpose_y) {
+        if (dx) {
+          dx->mutable_data<T>(ctx.GetPlace());
+          auto runner_dx =
+              NpuOpRunner("MatMul", {*dout, *y}, {*dx},
+                          {{"transpose_x1", false}, {"transpose_x2", false}});
+
+          runner_dx.Run(stream);
+        }
+        if (dy) {
+          dy->mutable_data<T>(ctx.GetPlace());
+          auto runner_dy =
+              NpuOpRunner("MatMul", {*dout, *x}, {*dy},
+                          {{"transpose_x1", true}, {"transpose_x2", false}});
+
+          runner_dy.Run(stream);
+        }
+
+      } else {
+        if (dx) {
+          dx->mutable_data<T>(ctx.GetPlace());
+          auto runner_dx =
+              NpuOpRunner("MatMul", {*dout, *y}, {*dx},
+                          {{"transpose_x1", false}, {"transpose_x2", true}});
+
+          runner_dx.Run(stream);
+        }
+        if (dy) {
+          dy->mutable_data<T>(ctx.GetPlace());
+          auto runner_dy =
+              NpuOpRunner("MatMul", {*x, *dout}, {*dy},
+                          {{"transpose_x1", true}, {"transpose_x2", false}});
+
+          runner_dy.Run(stream);
+        }
+      }
+    } else if (x->dims().size() > 2) {
+      if (transpose_y) {
+        if (dx) {
+          dx->mutable_data<T>(ctx.GetPlace());
+          auto runner_dx = NpuOpRunner("BatchMatMul", {*dout, *y}, {*dx},
+                                       {{"adj_x1", false}, {"adj_x2", false}});
+
+          runner_dx.Run(stream);
+        }
+        if (dy) {
+          dy->mutable_data<T>(ctx.GetPlace());
+          auto runner_dy = NpuOpRunner("BatchMatMul", {*dout, *x}, {*dy},
+                                       {{"adj_x1", true}, {"adj_x2", false}});
+
+          runner_dy.Run(stream);
+        }
+      } else {
+        if (dx) {
+          dx->mutable_data<T>(ctx.GetPlace());
+          auto runner_dx = NpuOpRunner("BatchMatMul", {*dout, *y}, {*dx},
+                                       {{"adj_x1", false}, {"adj_x2", true}});
+
+          runner_dx.Run(stream);
+        }
+        if (dy) {
+          dy->mutable_data<T>(ctx.GetPlace());
+          auto runner_dy = NpuOpRunner("BatchMatMul", {*x, *dout}, {*dy},
+                                       {{"adj_x1", true}, {"adj_x2", false}});
+          runner_dy.Run(stream);
+        }
+      }
+    }
+  }
+};
+}  // namespace operators
+}  // namespace paddle
+
+namespace ops = paddle::operators;
+
+REGISTER_OP_NPU_KERNEL(
+    matmul_v2,
+    ops::MatMulV2NPUKernel<paddle::platform::NPUDeviceContext, float>,
+    ops::MatMulV2NPUKernel<paddle::platform::NPUDeviceContext,
+                           paddle::platform::float16>);
+REGISTER_OP_NPU_KERNEL(
+    matmul_v2_grad,
+    ops::MatMulV2GradNPUKernel<paddle::platform::NPUDeviceContext, float>,
+    ops::MatMulV2GradNPUKernel<paddle::platform::NPUDeviceContext,
+                               paddle::platform::float16>);
--- a/paddle/fluid/operators/mean_op_npu.cc
+++ b/paddle/fluid/operators/mean_op_npu.cc
+/* Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+    http://www.apache.org/licenses/LICENSE-2.0
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+
+#include "paddle/fluid/operators/mean_op.h"
+#include "paddle/fluid/operators/npu_op_runner.h"
+#include "paddle/fluid/platform/float16.h"
+
+namespace paddle {
+namespace operators {
+
+template <typename DeviceContext, typename T>
+class MeanNPUKernel : public framework::OpKernel<T> {
+ public:
+  void Compute(const framework::ExecutionContext& ctx) const override {
+    auto* x = ctx.Input<framework::LoDTensor>("X");
+    auto* out = ctx.Output<framework::LoDTensor>("Out");
+
+    std::vector<int> axes;
+
+    framework::NPUAttributeMap attr_input = {{"keep_dims", false},
+                                             {"axes", axes}};
+
+    out->mutable_data<T>(ctx.GetPlace());
+
+    auto runner = NpuOpRunner("ReduceMeanD", {*x}, {*out}, attr_input);
+
+    auto stream =
+        ctx.template device_context<paddle::platform::NPUDeviceContext>()
+            .stream();
+    runner.Run(stream);
+  }
+};
+
+template <typename DeviceContext, typename T>
+class MeanGradNPUKernel : public framework::OpKernel<T> {
+ public:
+  void Compute(const framework::ExecutionContext& context) const override {
+    auto stream =
+        context.template device_context<paddle::platform::NPUDeviceContext>()
+            .stream();
+
+    auto grad = context.Input<Tensor>(framework::GradVarName("Out"));
+
+    PADDLE_ENFORCE_EQ(grad->numel(), 1,
+                      platform::errors::InvalidArgument(
+                          "Mean Gradient Input Tensor len should be 1. But "
+                          "received Out@Grad's elements num is %d.",
+                          grad->numel()));
+
+    auto IG = context.Output<Tensor>(framework::GradVarName("X"));
+    IG->mutable_data<T>(context.GetPlace());
+
+    // ones
+    Tensor ones(grad->type());
+    ones.mutable_data<T>(IG->dims(), context.GetPlace());
+    auto runner_ones = NpuOpRunner("OnesLike", {*IG}, {ones}, {});
+    runner_ones.Run(stream);
+
+    // means
+    Tensor mean_tensor(grad->type());
+    mean_tensor.Resize({1});
+    mean_tensor.mutable_data<T>(context.GetPlace());
+    std::vector<float> mean_vec;
+    mean_vec.push_back(1.0 / static_cast<float>(IG->numel()));
+    framework::TensorFromVector(mean_vec, context.device_context(),
+                                &mean_tensor);
+
+    // means mul ones
+    Tensor mean_ma(grad->type());
+    mean_ma.Resize(IG->dims());
+    mean_ma.mutable_data<T>(context.GetPlace());
+    auto runner_mul_1 = NpuOpRunner("Mul", {mean_tensor, ones}, {mean_ma}, {});
+    runner_mul_1.Run(stream);
+
+    // and mul grad
+    auto runner_mul_2 = NpuOpRunner("Mul", {mean_ma, *grad}, {*IG}, {});
+    runner_mul_2.Run(stream);
+  }
+};
+
+}  // namespace operators
+}  // namespace paddle
+
+namespace ops = paddle::operators;
+namespace plat = paddle::platform;
+REGISTER_OP_NPU_KERNEL(
+    mean, ops::MeanNPUKernel<paddle::platform::NPUDeviceContext, int>,
+    ops::MeanNPUKernel<paddle::platform::NPUDeviceContext, float>,
+    ops::MeanNPUKernel<paddle::platform::NPUDeviceContext, double>,
+    ops::MeanNPUKernel<paddle::platform::NPUDeviceContext, plat::float16>)
+
+REGISTER_OP_NPU_KERNEL(
+    mean_grad, ops::MeanGradNPUKernel<paddle::platform::NPUDeviceContext, int>,
+    ops::MeanGradNPUKernel<paddle::platform::NPUDeviceContext, float>,
+    ops::MeanGradNPUKernel<paddle::platform::NPUDeviceContext, double>,
+    ops::MeanGradNPUKernel<paddle::platform::NPUDeviceContext, plat::float16>)
--- a/paddle/fluid/operators/metrics/accuracy_op_npu.cc
+++ b/paddle/fluid/operators/metrics/accuracy_op_npu.cc
+/* Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+    http://www.apache.org/licenses/LICENSE-2.0
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+
+#include <memory>
+#include <string>
+
+#include "paddle/fluid/operators/controlflow/compare_op.h"
+#include "paddle/fluid/operators/metrics/accuracy_op.h"
+#include "paddle/fluid/operators/npu_op_runner.h"
+
+namespace paddle {
+namespace operators {
+
+template <typename DeviceContext, typename T>
+class AccuracyNPUKernel : public framework::OpKernel<T> {
+ public:
+  void Compute(const framework::ExecutionContext& ctx) const override {
+    auto* pred = ctx.Input<Tensor>("Out");
+    auto* label = ctx.Input<Tensor>("Label");
+    // auto* logits = ctx.Input<Tensor>("Indices");
+
+    auto* acc = ctx.Output<Tensor>("Accuracy");
+    auto* correct = ctx.Output<Tensor>("Correct");
+    auto* total = ctx.Output<Tensor>("Total");
+    auto stream =
+        ctx.template device_context<paddle::platform::NPUDeviceContext>()
+            .stream();
+
+    // cast pred
+    Tensor tmp_pred(pred->type());
+    tmp_pred.Resize(pred->dims());
+    tmp_pred.mutable_data<int>(ctx.GetPlace());
+    auto runner_cast_pred =
+        NpuOpRunner("Cast", {*pred}, {tmp_pred},
+                    {{"dst_type", static_cast<int>(ACL_INT32)}});
+    runner_cast_pred.Run(stream);
+
+    // cast label
+    Tensor tmp_label(label->type());
+    tmp_label.Resize(label->dims());
+    tmp_label.mutable_data<int>(ctx.GetPlace());
+    auto runner_cast_label =
+        NpuOpRunner("Cast", {*label}, {tmp_label},
+                    {{"dst_type", static_cast<int>(ACL_INT32)}});
+    runner_cast_label.Run(stream);
+
+    // equal
+    Tensor tmp_equal(label->type());
+    tmp_equal.Resize(label->dims());
+    tmp_equal.mutable_data<bool>(ctx.GetPlace());
+    auto runner_equal =
+        NpuOpRunner("Equal", {tmp_pred, tmp_label}, {tmp_equal}, {});
+    runner_equal.Run(stream);
+
+    // cast equal
+    Tensor tmp_equal_cast(label->type());
+    tmp_equal_cast.Resize(label->dims());
+    tmp_equal_cast.mutable_data<float>(ctx.GetPlace());
+    auto runner_cast_equal =
+        NpuOpRunner("Cast", {tmp_equal}, {tmp_equal_cast},
+                    {{"dst_type", static_cast<float>(ACL_FLOAT)}});
+    runner_cast_equal.Run(stream);
+
+    // acc
+    acc->mutable_data<float>(ctx.GetPlace());
+    std::vector<int> axes_vec_1;
+    auto runner_acc = NpuOpRunner("ReduceMeanD", {tmp_equal_cast}, {*acc},
+                                  {{"keep_dims", false}, {"axes", axes_vec_1}});
+    runner_acc.Run(stream);
+
+    // correct
+    correct->mutable_data<float>(ctx.GetPlace());
+    std::vector<int> axes_vec_2;
+    auto runner_correct =
+        NpuOpRunner("ReduceSumD", {tmp_equal_cast}, {*correct},
+                    {{"keep_dims", false}, {"axes", axes_vec_2}});
+    runner_correct.Run(stream);
+
+    // ones_tensor
+    Tensor ones_tensor(label->type());
+    ones_tensor.Resize(label->dims());
+    ones_tensor.mutable_data<int>(ctx.GetPlace());
+    auto runner_oneslike =
+        NpuOpRunner("OnesLike", {tmp_label}, {ones_tensor}, {});
+    runner_oneslike.Run(stream);
+
+    // ones_tensor_cast
+    Tensor ones_tensor_cast(label->type());
+    ones_tensor_cast.Resize(label->dims());
+    ones_tensor_cast.mutable_data<float>(ctx.GetPlace());
+    auto runner_ones_cast =
+        NpuOpRunner("Cast", {ones_tensor}, {ones_tensor_cast},
+                    {{"dst_type", static_cast<float>(ACL_FLOAT)}});
+    runner_ones_cast.Run(stream);
+
+    // total
+    total->mutable_data<float>(ctx.GetPlace());
+    std::vector<int> axes_vec_3;
+    auto runner_total =
+        NpuOpRunner("ReduceSumD", {ones_tensor_cast}, {*total},
+                    {{"keep_dims", false}, {"axes", axes_vec_3}});
+    runner_total.Run(stream);
+  }
+};
+
+}  // namespace operators
+}  // namespace paddle
+
+namespace ops = paddle::operators;
+
+REGISTER_OP_NPU_KERNEL(
+    accuracy, ops::AccuracyNPUKernel<paddle::platform::NPUDeviceContext, float>,
+    ops::AccuracyNPUKernel<paddle::platform::NPUDeviceContext,
+                           paddle::platform::float16>,
+    ops::AccuracyNPUKernel<paddle::platform::NPUDeviceContext, int>,
+    ops::AccuracyNPUKernel<paddle::platform::NPUDeviceContext, int64_t>);
--- a/paddle/fluid/operators/mul_op_npu.cc
+++ b/paddle/fluid/operators/mul_op_npu.cc
+/* Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+
+#include <memory>
+#include <string>
+
+#include "paddle/fluid/operators/mul_op.h"
+#include "paddle/fluid/operators/npu_op_runner.h"
+
+namespace paddle {
+namespace operators {
+
+template <typename DeviceContext, typename T>
+class MulNPUKernel : public framework::OpKernel<T> {
+ public:
+  void Compute(const framework::ExecutionContext& ctx) const override {
+    auto* x = ctx.Input<framework::Tensor>("X");
+    auto* y = ctx.Input<framework::Tensor>("Y");
+    auto* out = ctx.Output<framework::Tensor>("Out");
+    int x_num_col_dims = ctx.Attr<int>("x_num_col_dims");
+    int y_num_col_dims = ctx.Attr<int>("y_num_col_dims");
+    auto stream =
+        ctx.template device_context<paddle::platform::NPUDeviceContext>()
+            .stream();
+    if (x_num_col_dims == 1 && y_num_col_dims == 1) {
+      if (x->dims().size() == 2 && y->dims().size() == 2) {
+        out->mutable_data<T>(ctx.GetPlace());
+        auto runner =
+            NpuOpRunner("MatMul", {*x, *y}, {*out},
+                        {{"transpose_x1", false}, {"transpose_x2", false}});
+
+        runner.Run(stream);
+      } else if (x->dims().size() == 3 && y->dims().size() == 2) {
+        // reshape
+        Tensor tmp_x(x->type());
+        int64_t sec_dim = x->dims()[1] * x->dims()[2];
+        int64_t first_dim = x->dims()[0];
+        tmp_x.Resize(framework::make_ddim({first_dim, sec_dim}));
+        tmp_x.mutable_data<T>(ctx.GetPlace());
+        framework::TensorCopy(
+            *x, ctx.GetPlace(),
+            ctx.template device_context<platform::DeviceContext>(), &tmp_x);
+        tmp_x.Resize(framework::make_ddim({first_dim, sec_dim}));
+        out->mutable_data<T>(ctx.GetPlace());
+        // matmul
+        auto runner =
+            NpuOpRunner("MatMul", {tmp_x, *y}, {*out},
+                        {{"transpose_x1", false}, {"transpose_x2", false}});
+        runner.Run(stream);
+      } else {
+        PADDLE_THROW(
+            platform::errors::InvalidArgument("npu error: not suppert dims"));
+      }
+      // to do other
+    } else if (x->dims().size() == 3 && y->dims().size() == 2) {
+      // for example: x.shape=[2, 3, 4] y.shape=[4, 5], expect [2, 3, 5]
+      PADDLE_ENFORCE_EQ(x_num_col_dims, 2,
+                        platform::errors::InvalidArgument(
+                            "now only support x_num_col_dims == 2: but got %d",
+                            x_num_col_dims));
+      // flatten => x.shape=[6, 4]
+      Tensor tmp_x(x->type());
+      int64_t first_dim = x->dims()[0] * x->dims()[1];
+      int64_t sec_dim = x->dims()[2];
+      tmp_x.Resize(framework::make_ddim({first_dim, sec_dim}));
+      tmp_x.mutable_data<T>(ctx.GetPlace());
+      framework::TensorCopy(
+          *x, ctx.GetPlace(),
+          ctx.template device_context<platform::DeviceContext>(), &tmp_x);
+      tmp_x.Resize(framework::make_ddim({first_dim, sec_dim}));
+
+      // matmul [6,4] , [4, 5] => [6, 5]
+      Tensor tmp_matmul(x->type());
+      tmp_matmul.Resize(framework::make_ddim({first_dim, y->dims()[1]}));
+      tmp_matmul.mutable_data<T>(ctx.GetPlace());
+
+      auto runner_matmul =
+          NpuOpRunner("MatMul", {tmp_x, *y}, {tmp_matmul},
+                      {{"transpose_x1", false}, {"transpose_x2", false}});
+
+      runner_matmul.Run(stream);
+      // reshape [6, 5] => [2, 3, 5]
+      (*out).Resize(
+          framework::make_ddim({x->dims()[0], x->dims()[1], y->dims()[1]}));
+      out->mutable_data(ctx.GetPlace(), x->type());
+      framework::TensorCopy(
+          tmp_matmul, ctx.GetPlace(),
+          ctx.template device_context<platform::DeviceContext>(), out);
+      (*out).Resize(
+          framework::make_ddim({x->dims()[0], x->dims()[1], y->dims()[1]}));
+    }
+  }
+};
+
+template <typename DeviceContext, typename T>
+class MulGradNPUKernel : public framework::OpKernel<T> {
+ public:
+  void Compute(const framework::ExecutionContext& ctx) const override {
+    auto* x = ctx.Input<framework::Tensor>("X");
+    auto* y = ctx.Input<framework::Tensor>("Y");
+    auto* dout = ctx.Input<framework::Tensor>(framework::GradVarName("Out"));
+    auto* dx = ctx.Output<framework::Tensor>(framework::GradVarName("X"));
+    auto* dy = ctx.Output<framework::Tensor>(framework::GradVarName("Y"));
+    int x_num_col_dims = ctx.Attr<int>("x_num_col_dims");
+    int y_num_col_dims = ctx.Attr<int>("y_num_col_dims");
+    auto stream =
+        ctx.template device_context<paddle::platform::NPUDeviceContext>()
+            .stream();
+    if (x_num_col_dims == 1 && y_num_col_dims == 1) {
+      if (x->dims().size() == 2 && y->dims().size() == 2) {
+        if (dx) {
+          dx->mutable_data<T>(ctx.GetPlace());
+          auto runner_dx =
+              NpuOpRunner("MatMul", {*dout, *y}, {*dx},
+                          {{"transpose_x1", false}, {"transpose_x2", true}});
+
+          runner_dx.Run(stream);
+        }
+
+        if (dy) {
+          dy->mutable_data<T>(ctx.GetPlace());
+          auto runner_dy =
+              NpuOpRunner("MatMul", {*x, *dout}, {*dy},
+                          {{"transpose_x1", true}, {"transpose_x2", false}});
+
+          runner_dy.Run(stream);
+        }
+      } else if (x->dims().size() == 3 && y->dims().size() == 2) {
+        // flatten => x.shape=[6, 4]
+        // matmul
+        if (dx) {
+          // matmul [2, 5] * [12, 5] => [2, 12]
+          dx->mutable_data<T>(ctx.GetPlace());
+          auto dx_dims = dx->dims();
+          dx->Resize(framework::make_ddim({dout->dims()[0], y->dims()[0]}));
+          auto runner_matmul =
+              NpuOpRunner("MatMul", {*dout, *y}, {*dx},
+                          {{"transpose_x1", false}, {"transpose_x2", true}});
+          runner_matmul.Run(stream);
+          // reshape [2, 12] => [2, 3, 4]
+          dx->Resize(dx_dims);
+        }
+
+        if (dy) {
+          // flatten
+          Tensor tmp_x(x->type());
+          int64_t sec_dim = x->dims()[1] * x->dims()[2];
+          int64_t first_dim = x->dims()[0];
+          tmp_x.Resize(framework::make_ddim({first_dim, sec_dim}));
+          tmp_x.mutable_data<T>(ctx.GetPlace());
+          framework::TensorCopy(
+              *x, ctx.GetPlace(),
+              ctx.template device_context<platform::DeviceContext>(), &tmp_x);
+          tmp_x.Resize(framework::make_ddim({first_dim, sec_dim}));
+          dy->mutable_data<T>(ctx.GetPlace());
+          auto runner_dy =
+              NpuOpRunner("MatMul", {tmp_x, *dout}, {*dy},
+                          {{"transpose_x1", true}, {"transpose_x2", false}});
+
+          runner_dy.Run(stream);
+        }
+      }
+    } else if (x->dims().size() == 3 && y->dims().size() == 2) {
+      // for example: x.shape=[2, 3, 4] y.shape=[4, 5], expect [2, 3, 5]
+      PADDLE_ENFORCE_EQ(x_num_col_dims, 2,
+                        platform::errors::InvalidArgument(
+                            "now only support x_num_col_dims == 2: but got %d",
+                            x_num_col_dims));
+      // tmp_dout both used by dx and dy
+      Tensor tmp_dout(x->type());
+      int64_t dout_first_dim = dout->dims()[0] * dout->dims()[1];
+      int64_t dout_sec_dim = dout->dims()[2];
+      tmp_dout.Resize(framework::make_ddim({dout_first_dim, dout_sec_dim}));
+      tmp_dout.mutable_data<T>(ctx.GetPlace());
+      framework::TensorCopy(
+          *dout, ctx.GetPlace(),
+          ctx.template device_context<platform::DeviceContext>(), &tmp_dout);
+      tmp_dout.Resize(framework::make_ddim({dout_first_dim, dout_sec_dim}));
+
+      if (dx) {
+        // tmp_dout * y [6,5] * [4,5] => [6, 4]
+        dx->mutable_data<T>(ctx.GetPlace());
+        auto dx_dims = dx->dims();
+        dx->Resize(framework::make_ddim({dout_first_dim, y->dims()[0]}));
+        auto runner_matmul =
+            NpuOpRunner("MatMul", {tmp_dout, *y}, {*dx},
+                        {{"transpose_x1", false}, {"transpose_x2", true}});
+        runner_matmul.Run(stream);
+        // reshape [2, 12] => [2, 3, 4]
+        dx->Resize(dx_dims);
+      }
+      if (dy) {
+        // flatten x.shape [2,3,4] => [6, 4]
+        Tensor tmp_x(x->type());
+        int64_t first_dim = x->dims()[0] * x->dims()[1];
+        int64_t sec_dim = x->dims()[2];
+        tmp_x.Resize(framework::make_ddim({first_dim, sec_dim}));
+        tmp_x.mutable_data<T>(ctx.GetPlace());
+        framework::TensorCopy(
+            *x, ctx.GetPlace(),
+            ctx.template device_context<platform::DeviceContext>(), &tmp_x);
+        tmp_x.Resize(framework::make_ddim({first_dim, sec_dim}));
+        // mamtul [6,4] [6,5] =>[4,5]
+        dy->mutable_data<T>(ctx.GetPlace());
+        auto runner_dy =
+            NpuOpRunner("MatMul", {tmp_x, tmp_dout}, {*dy},
+                        {{"transpose_x1", true}, {"transpose_x2", false}});
+        runner_dy.Run(stream);
+      }
+    }
+  }
+};
+
+}  // namespace operators
+}  // namespace paddle
+
+namespace ops = paddle::operators;
+
+REGISTER_OP_NPU_KERNEL(
+    mul, ops::MulNPUKernel<paddle::platform::NPUDeviceContext, float>,
+    ops::MulNPUKernel<paddle::platform::NPUDeviceContext,
+                      paddle::platform::float16>);
+REGISTER_OP_NPU_KERNEL(
+    mul_grad, ops::MulGradNPUKernel<paddle::platform::NPUDeviceContext, float>,
+    ops::MulGradNPUKernel<paddle::platform::NPUDeviceContext,
+                          paddle::platform::float16>);
--- a/paddle/fluid/operators/npu_op_runner.cc
+++ b/paddle/fluid/operators/npu_op_runner.cc
@@ -64,13 +64,21 @@ aclFormat ConvertToNpuFormat(DataLayout layout) {
  return iter->second;
 }

+aclrtStream GetCurrentNPUStream() {
+  int device_id = platform::GetCurrentNPUDeviceId();
+  platform::DeviceContextPool &pool = platform::DeviceContextPool::Instance();
+  auto *dev_ctx = static_cast<platform::NPUDeviceContext *>(
+      pool.Get(platform::NPUPlace(device_id)));
+  return dev_ctx->stream();
+}
+
 NpuOpRunner::NpuOpRunner(std::string op_type) : op_type_(op_type) {
  attr_ = aclopCreateAttr();
 }

 NpuOpRunner::NpuOpRunner(std::string op_type, const std::vector<Tensor> &inputs,
                         const std::vector<Tensor> &outputs,
-                         const AttributeMap &attrs)
+                         const NPUAttributeMap &attrs)
    : op_type_(op_type) {
  attr_ = aclopCreateAttr();
  AddInputs(inputs);
@@ -85,7 +93,7 @@ NpuOpRunner::~NpuOpRunner() {
 const std::string &NpuOpRunner::Type() { return op_type_; }

 NpuOpRunner &NpuOpRunner::AddAttr(const std::string &name,
-                                  const Attribute &attr) {
+                                  const NPUAttribute &attr) {
  if (attr.type() == typeid(bool)) {
    PADDLE_ENFORCE_NPU_SUCCESS(
        aclopSetAttrBool(attr_, name.c_str(), BOOST_GET_CONST(bool, attr)));
@@ -135,6 +143,16 @@ NpuOpRunner &NpuOpRunner::AddAttr(const std::string &name,
    }
    PADDLE_ENFORCE_NPU_SUCCESS(
        aclopSetAttrListString(attr_, name.c_str(), s.size(), s.data()));
+  } else if (attr.type() == typeid(std::vector<std::vector<int64_t>>)) {
+    auto a = BOOST_GET_CONST(std::vector<std::vector<int64_t>>, attr);
+    std::vector<int64_t *> data;
+    std::vector<int> num;
+    for (auto &&v : a) {
+      data.push_back(v.data());
+      num.push_back(v.size());
+    }
+    PADDLE_ENFORCE_NPU_SUCCESS(aclopSetAttrListListInt(
+        attr_, name.c_str(), data.size(), num.data(), data.data()));
  } else {
    PADDLE_THROW(platform::errors::Unimplemented(
        "Can not convert attribubte '%s' to convert to aclopAttr", name));
@@ -142,7 +160,7 @@ NpuOpRunner &NpuOpRunner::AddAttr(const std::string &name,
  return *this;
 }

-NpuOpRunner &NpuOpRunner::AddAttrs(const AttributeMap &attrs) {
+NpuOpRunner &NpuOpRunner::AddAttrs(const NPUAttributeMap &attrs) {
  for (const auto &pair : attrs) {
    AddAttr(pair.first, pair.second);
  }
@@ -175,6 +193,21 @@ NpuOpRunner &NpuOpRunner::AddInputs(const std::vector<Tensor> &tensors) {
  return *this;
 }

+// NOTE(zhiqiu): For operators whose input is a list (such as concat, stack),
+// It is needed to set the name of each input tensor.
+NpuOpRunner &NpuOpRunner::AddInputNames(const std::vector<std::string> &names) {
+  PADDLE_ENFORCE_EQ(names.size(), input_descs_.size(),
+                    platform::errors::InvalidArgument(
+                        "The size of input names should be "
+                        "equal to the size of input descs, but got the size "
+                        "of input names is %d, the size of input descs is %d.",
+                        names.size(), input_descs_.size()));
+  for (size_t i = 0; i < names.size(); ++i) {
+    aclSetTensorDescName(input_descs_[i], names[i].c_str());
+  }
+  return *this;
+}
+
 NpuOpRunner &NpuOpRunner::AddOutputs(const std::vector<Tensor> &tensors) {
  for (auto tensor : tensors) {
    // create aclTensorDesc
@@ -224,18 +257,22 @@ aclTensorDesc *NpuOpRunner::CreateTensorDesc(Tensor tensor) {
  auto format = ConvertToNpuFormat(tensor.layout());
  auto dims = framework::vectorize(tensor.dims());

-  VLOG(4) << dtype << " " << dims.size() << " " << dims[0] << "," << dims[1]
-          << " " << format;
+  VLOG(4) << "NPU dtype:" << dtype << " "
+          << "rank:" << dims.size() << " dims:" << tensor.dims()
+          << " format:" << format;

  auto *desc = aclCreateTensorDesc(dtype, dims.size(), dims.data(), format);
  PADDLE_ENFORCE_NOT_NULL(
      desc, platform::errors::External("Call aclCreateTensorDesc failed."));
+  PADDLE_ENFORCE_NPU_SUCCESS(aclSetTensorStorageFormat(desc, format));
+  PADDLE_ENFORCE_NPU_SUCCESS(
+      aclSetTensorStorageShape(desc, dims.size(), dims.data()));
  return desc;
 }

 aclDataBuffer *NpuOpRunner::CreateDataBuffer(Tensor tensor) {
  void *ptr = tensor.data<void>();
-  VLOG(4) << "ptr: " << ptr << ", size: " << tensor.memory_size();
+  VLOG(4) << "NPU ptr: " << ptr << ", size: " << tensor.memory_size();
  auto *buffer = aclCreateDataBuffer(ptr, tensor.memory_size());
  PADDLE_ENFORCE_NOT_NULL(
      buffer, platform::errors::External("Call aclCreateDataBuffer failed."));
@@ -243,11 +280,17 @@ aclDataBuffer *NpuOpRunner::CreateDataBuffer(Tensor tensor) {
 }

 void NpuOpRunner::Run(aclrtStream stream) {
+  if (!stream) {
+    VLOG(4) << "Run with default current npu stream: " << stream;
+    stream = GetCurrentNPUStream();
+  }
+
  VLOG(4) << "op_type: " << op_type_;
  VLOG(4) << "input_desc.size: " << input_descs_.size();
  VLOG(4) << "output_desc.size: " << output_descs_.size();
-  VLOG(4) << "stream: " << stream;
  VLOG(4) << "attr: " << attr_;
+  VLOG(4) << "stream: " << stream;
+
  aclError ret = aclopCompileAndExecute(
      op_type_.c_str(), input_descs_.size(), input_descs_.data(),
      input_buffers_.data(), output_descs_.size(), output_descs_.data(),

--- a/paddle/fluid/operators/npu_op_runner.h
+++ b/paddle/fluid/operators/npu_op_runner.h
@@ -12,8 +12,10 @@ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 See the License for the specific language governing permissions and
 limitations under the License. */

+#ifdef PADDLE_WITH_ASCEND_CL
 #pragma once
 #include <paddle/fluid/framework/operator.h>
+#include <paddle/fluid/framework/type_defs.h>

 #include <string>
 #include <vector>
@@ -26,8 +28,8 @@ namespace operators {

 using Tensor = framework::Tensor;
 using DataLayout = framework::DataLayout;
-using Attribute = framework::Attribute;
-using AttributeMap = framework::AttributeMap;
+using NPUAttribute = framework::NPUAttribute;
+using NPUAttributeMap = framework::NPUAttributeMap;

 class NpuOpRunner {
 public:
@@ -35,15 +37,15 @@ class NpuOpRunner {
  explicit NpuOpRunner(std::string op_type,
                       const std::vector<Tensor> &inputs = {},
                       const std::vector<Tensor> &outputs = {},
-                       const AttributeMap &attrs = {});
+                       const NPUAttributeMap &attrs = {});

  ~NpuOpRunner();

  const std::string &Type();

-  NpuOpRunner &AddAttr(const std::string &name, const Attribute &attr);
+  NpuOpRunner &AddAttr(const std::string &name, const NPUAttribute &attr);

-  NpuOpRunner &AddAttrs(const AttributeMap &attrs);
+  NpuOpRunner &AddAttrs(const NPUAttributeMap &attrs);

  NpuOpRunner &AddInput(const Tensor &tensor);

@@ -51,6 +53,8 @@ class NpuOpRunner {

  NpuOpRunner &AddInputs(const std::vector<Tensor> &tensors);

+  NpuOpRunner &AddInputNames(const std::vector<std::string> &names);
+
  NpuOpRunner &AddOutputs(const std::vector<Tensor> &tensors);

  aclTensorDesc *GetInputDesc(size_t index);
@@ -65,7 +69,7 @@ class NpuOpRunner {

  std::vector<aclDataBuffer *> &GetOutputBuffers();

-  void Run(aclrtStream stream);
+  void Run(aclrtStream stream = nullptr);

 private:
  aclTensorDesc *CreateTensorDesc(Tensor tensor);
@@ -80,5 +84,8 @@ class NpuOpRunner {
  aclopAttr *attr_{nullptr};
 };

+aclDataType ConvertToNpuDtype(framework::proto::VarType::Type dtype);
+
 }  // namespace operators
 }  // namespace paddle
+#endif
--- a/paddle/fluid/operators/optimizers/adam_op_npu.cc
+++ b/paddle/fluid/operators/optimizers/adam_op_npu.cc
--- a/paddle/fluid/operators/optimizers/sgd_op_npu.cc
+++ b/paddle/fluid/operators/optimizers/sgd_op_npu.cc
+/* Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+
+#include <memory>
+#include <string>
+
+#include "paddle/fluid/operators/npu_op_runner.h"
+#include "paddle/fluid/operators/optimizers/sgd_op.h"
+
+namespace paddle {
+namespace operators {
+
+template <typename DeviceContext, typename T>
+class SGDNPUKernel : public framework::OpKernel<T> {
+ public:
+  void Compute(const framework::ExecutionContext& ctx) const override {
+    auto* learning_rate = ctx.Input<framework::LoDTensor>("LearningRate");
+    auto* param_var = ctx.Input<framework::LoDTensor>("Param");
+    auto* grad_var = ctx.Input<framework::LoDTensor>("Grad");
+    auto* param_out = ctx.Output<framework::LoDTensor>("ParamOut");
+
+    param_out->mutable_data<T>(ctx.GetPlace());
+
+    auto runner =
+        NpuOpRunner("ApplyGradientDescent",
+                    {*param_var, *learning_rate, *grad_var}, {*param_out}, {});
+
+    auto stream =
+        ctx.template device_context<paddle::platform::NPUDeviceContext>()
+            .stream();
+    runner.Run(stream);
+
+    // NOTE(zhiqiu): ApplyGradientDescent updates params inplace, so
+    // if param and param_out is not same, we need to do copy.
+    if (param_out->data<T>() != param_var->data<T>()) {
+      ctx.template device_context<paddle::platform::NPUDeviceContext>().Wait();
+      framework::TensorCopySync(*param_var, ctx.GetPlace(), param_out);
+    }
+  }
+};
+
+}  // namespace operators
+}  // namespace paddle
+
+namespace ops = paddle::operators;
+
+REGISTER_OP_NPU_KERNEL(
+    sgd, ops::SGDNPUKernel<paddle::platform::NPUDeviceContext, float>,
+    ops::SGDNPUKernel<paddle::platform::NPUDeviceContext, double>,
+    ops::SGDNPUKernel<paddle::platform::NPUDeviceContext,
+                      paddle::platform::float16>);
--- a/paddle/fluid/operators/range_op_npu.cc
+++ b/paddle/fluid/operators/range_op_npu.cc
--- a/paddle/fluid/operators/range_op_npu_test.cc
+++ b/paddle/fluid/operators/range_op_npu_test.cc
--- a/paddle/fluid/operators/reduce_ops/CMakeLists.txt
+++ b/paddle/fluid/operators/reduce_ops/CMakeLists.txt
@@ -42,3 +42,7 @@ endif()
 if(WITH_ROCM)
    hip_test(check_reduce_rank_test SRCS check_reduce_rank_test.cu DEPS tensor)
 endif()
+
+if(WITH_ASCEND_CL)
+    cc_test(reduce_any_op_npu_test SRCS reduce_any_op_npu_test.cc DEPS op_registry reduce_any_op scope device_context enforce executor)
+endif()
--- a/paddle/fluid/operators/reduce_ops/reduce_any_op_npu.cc
+++ b/paddle/fluid/operators/reduce_ops/reduce_any_op_npu.cc
--- a/paddle/fluid/operators/reduce_ops/reduce_any_op_npu_test.cc
+++ b/paddle/fluid/operators/reduce_ops/reduce_any_op_npu_test.cc
--- a/paddle/fluid/operators/reduce_ops/reduce_sum_op_npu.cc
+++ b/paddle/fluid/operators/reduce_ops/reduce_sum_op_npu.cc
--- a/paddle/fluid/operators/reshape_op_npu.cc
+++ b/paddle/fluid/operators/reshape_op_npu.cc
--- a/paddle/fluid/operators/scale_op_npu.cc
+++ b/paddle/fluid/operators/scale_op_npu.cc
--- a/paddle/fluid/operators/scatter_op_npu.cc
+++ b/paddle/fluid/operators/scatter_op_npu.cc
--- a/paddle/fluid/operators/shape_op_npu.cc
+++ b/paddle/fluid/operators/shape_op_npu.cc
--- a/paddle/fluid/operators/slice_op_npu.cc
+++ b/paddle/fluid/operators/slice_op_npu.cc
--- a/paddle/fluid/operators/softmax_op.cc
+++ b/paddle/fluid/operators/softmax_op.cc
--- a/paddle/fluid/operators/softmax_op_npu.cc
+++ b/paddle/fluid/operators/softmax_op_npu.cc
--- a/paddle/fluid/operators/softmax_op_npu_test.cc
+++ b/paddle/fluid/operators/softmax_op_npu_test.cc
--- a/paddle/fluid/operators/softmax_with_cross_entropy_op_npu.cc
+++ b/paddle/fluid/operators/softmax_with_cross_entropy_op_npu.cc
--- a/paddle/fluid/operators/squeeze_op_npu.cc
+++ b/paddle/fluid/operators/squeeze_op_npu.cc
--- a/paddle/fluid/operators/squeeze_op_npu_test.cc
+++ b/paddle/fluid/operators/squeeze_op_npu_test.cc
--- a/paddle/fluid/operators/stack_op_npu.cc
+++ b/paddle/fluid/operators/stack_op_npu.cc
--- a/paddle/fluid/operators/sum_op_npu.cc
+++ b/paddle/fluid/operators/sum_op_npu.cc
--- a/paddle/fluid/operators/top_k_op_npu.cc
+++ b/paddle/fluid/operators/top_k_op_npu.cc
--- a/paddle/fluid/operators/transpose_op_npu.cc
+++ b/paddle/fluid/operators/transpose_op_npu.cc
--- a/paddle/fluid/operators/transpose_op_npu_test.cc
+++ b/paddle/fluid/operators/transpose_op_npu_test.cc
--- a/paddle/fluid/operators/truncated_gaussian_random_op_npu.cc
+++ b/paddle/fluid/operators/truncated_gaussian_random_op_npu.cc
--- a/paddle/fluid/operators/unsqueeze_op_npu.cc
+++ b/paddle/fluid/operators/unsqueeze_op_npu.cc
--- a/paddle/fluid/operators/unsqueeze_op_npu_test.cc
+++ b/paddle/fluid/operators/unsqueeze_op_npu_test.cc
--- a/paddle/fluid/platform/npu_info.cc
+++ b/paddle/fluid/platform/npu_info.cc
--- a/python/paddle/distributed/fleet/ascend_utils.py
+++ b/python/paddle/distributed/fleet/ascend_utils.py
--- a/python/paddle/distributed/fleet/launch.py
+++ b/python/paddle/distributed/fleet/launch.py
--- a/python/paddle/distributed/fleet/launch_utils.py
+++ b/python/paddle/distributed/fleet/launch_utils.py
--- a/python/paddle/fluid/layers/nn.py
+++ b/python/paddle/fluid/layers/nn.py
--- a/python/paddle/fluid/tests/unittests/CMakeLists.txt
+++ b/python/paddle/fluid/tests/unittests/CMakeLists.txt
@@ -531,7 +531,7 @@ if(WITH_DISTRIBUTE)
        bash_test_modules(test_fleet_launch_async START_BASH test_fleet_launch_async.sh ENVS PADDLE_BINARY_DIR=${PADDLE_BINARY_DIR})
        bash_test_modules(test_fleet_launch_cloud START_BASH test_fleet_launch_cloud.sh ENVS PADDLE_BINARY_DIR=${PADDLE_BINARY_DIR})
        bash_test_modules(test_fleet_launch_nproc START_BASH test_fleet_launch_nproc.sh ENVS PADDLE_BINARY_DIR=${PADDLE_BINARY_DIR})
-        if(WITH_ASCEND)
+        if(WITH_ASCEND OR WITH_ASCEND_CL)
            bash_test_modules(test_fleet_launch_ascend START_BASH test_fleet_launch_ascend.sh ENVS PADDLE_BINARY_DIR=${PADDLE_BINARY_DIR})
            bash_test_modules(test_ascend_group START_BASH test_ascend_group.sh ENVS PADDLE_BINARY_DIR=${PADDLE_BINARY_DIR})
        endif()

--- a/python/paddle/fluid/tests/unittests/ascend_group.py
+++ b/python/paddle/fluid/tests/unittests/ascend_group.py
--- a/python/paddle/fluid/tests/unittests/ascend_multi_process_collective.py
+++ b/python/paddle/fluid/tests/unittests/ascend_multi_process_collective.py
--- a/python/paddle/fluid/tests/unittests/hccl_tools.py
+++ b/python/paddle/fluid/tests/unittests/hccl_tools.py
--- a/python/paddle/fluid/tests/unittests/npu/test_accuracy_op_npu.py
+++ b/python/paddle/fluid/tests/unittests/npu/test_accuracy_op_npu.py
--- a/python/paddle/fluid/tests/unittests/npu/test_adam_op_npu.py
+++ b/python/paddle/fluid/tests/unittests/npu/test_adam_op_npu.py
--- a/python/paddle/fluid/tests/unittests/npu/test_amp_check_finite_and_scale_op_npu.py
+++ b/python/paddle/fluid/tests/unittests/npu/test_amp_check_finite_and_scale_op_npu.py
--- a/python/paddle/fluid/tests/unittests/npu/test_cast_op_npu.py
+++ b/python/paddle/fluid/tests/unittests/npu/test_cast_op_npu.py
--- a/python/paddle/fluid/tests/unittests/npu/test_compare_op_npu.py
+++ b/python/paddle/fluid/tests/unittests/npu/test_compare_op_npu.py
--- a/python/paddle/fluid/tests/unittests/npu/test_concat_op_npu.py
+++ b/python/paddle/fluid/tests/unittests/npu/test_concat_op_npu.py
--- a/python/paddle/fluid/tests/unittests/npu/test_elementwise_add_op_npu.py
+++ b/python/paddle/fluid/tests/unittests/npu/test_elementwise_add_op_npu.py
--- a/python/paddle/fluid/tests/unittests/npu/test_elementwise_div_op_npu.py
+++ b/python/paddle/fluid/tests/unittests/npu/test_elementwise_div_op_npu.py
--- a/python/paddle/fluid/tests/unittests/npu/test_elementwise_floordiv_op_npu.py
+++ b/python/paddle/fluid/tests/unittests/npu/test_elementwise_floordiv_op_npu.py
--- a/python/paddle/fluid/tests/unittests/npu/test_elementwise_min_op_npu.py
+++ b/python/paddle/fluid/tests/unittests/npu/test_elementwise_min_op_npu.py
--- a/python/paddle/fluid/tests/unittests/npu/test_elementwise_mul_op_npu.py
+++ b/python/paddle/fluid/tests/unittests/npu/test_elementwise_mul_op_npu.py
--- a/python/paddle/fluid/tests/unittests/npu/test_elementwise_pow_op_npu.py
+++ b/python/paddle/fluid/tests/unittests/npu/test_elementwise_pow_op_npu.py
--- a/python/paddle/fluid/tests/unittests/npu/test_expand_op_npu.py
+++ b/python/paddle/fluid/tests/unittests/npu/test_expand_op_npu.py
--- a/python/paddle/fluid/tests/unittests/npu/test_fill_constant_op_npu.py
+++ b/python/paddle/fluid/tests/unittests/npu/test_fill_constant_op_npu.py
--- a/python/paddle/fluid/tests/unittests/npu/test_gather_op_npu.py
+++ b/python/paddle/fluid/tests/unittests/npu/test_gather_op_npu.py
--- a/python/paddle/fluid/tests/unittests/npu/test_gelu_op_npu.py
+++ b/python/paddle/fluid/tests/unittests/npu/test_gelu_op_npu.py
--- a/python/paddle/fluid/tests/unittests/npu/test_increment_op_npu.py
+++ b/python/paddle/fluid/tests/unittests/npu/test_increment_op_npu.py
--- a/python/paddle/fluid/tests/unittests/npu/test_layer_norm_op_npu.py
+++ b/python/paddle/fluid/tests/unittests/npu/test_layer_norm_op_npu.py
--- a/python/paddle/fluid/tests/unittests/npu/test_log_op_npu.py
+++ b/python/paddle/fluid/tests/unittests/npu/test_log_op_npu.py
--- a/python/paddle/fluid/tests/unittests/npu/test_logical_op_npu.py
+++ b/python/paddle/fluid/tests/unittests/npu/test_logical_op_npu.py
--- a/python/paddle/fluid/tests/unittests/npu/test_lookup_table_v2_op_npu.py
+++ b/python/paddle/fluid/tests/unittests/npu/test_lookup_table_v2_op_npu.py
--- a/python/paddle/fluid/tests/unittests/npu/test_matmulv2_op_npu.py
+++ b/python/paddle/fluid/tests/unittests/npu/test_matmulv2_op_npu.py
--- a/python/paddle/fluid/tests/unittests/npu/test_mean_op_npu.py
+++ b/python/paddle/fluid/tests/unittests/npu/test_mean_op_npu.py
--- a/python/paddle/fluid/tests/unittests/npu/test_mul_op_npu.py
+++ b/python/paddle/fluid/tests/unittests/npu/test_mul_op_npu.py
--- a/python/paddle/fluid/tests/unittests/npu/test_pow_op_npu.py
+++ b/python/paddle/fluid/tests/unittests/npu/test_pow_op_npu.py
--- a/python/paddle/fluid/tests/unittests/npu/test_reduce_any_op_npu.py
+++ b/python/paddle/fluid/tests/unittests/npu/test_reduce_any_op_npu.py
--- a/python/paddle/fluid/tests/unittests/npu/test_reduce_sum_op_npu.py
+++ b/python/paddle/fluid/tests/unittests/npu/test_reduce_sum_op_npu.py
--- a/python/paddle/fluid/tests/unittests/npu/test_relu_op_npu.py
+++ b/python/paddle/fluid/tests/unittests/npu/test_relu_op_npu.py
--- a/python/paddle/fluid/tests/unittests/npu/test_reshape2_op_npu.py
+++ b/python/paddle/fluid/tests/unittests/npu/test_reshape2_op_npu.py
--- a/python/paddle/fluid/tests/unittests/npu/test_scale_op_npu.py
+++ b/python/paddle/fluid/tests/unittests/npu/test_scale_op_npu.py
--- a/python/paddle/fluid/tests/unittests/npu/test_scatter_op_npu.py
+++ b/python/paddle/fluid/tests/unittests/npu/test_scatter_op_npu.py
--- a/python/paddle/fluid/tests/unittests/npu/test_sgd_op_npu.py
+++ b/python/paddle/fluid/tests/unittests/npu/test_sgd_op_npu.py
--- a/python/paddle/fluid/tests/unittests/npu/test_shape_op_npu.py
+++ b/python/paddle/fluid/tests/unittests/npu/test_shape_op_npu.py
--- a/python/paddle/fluid/tests/unittests/npu/test_slice_op_npu.py
+++ b/python/paddle/fluid/tests/unittests/npu/test_slice_op_npu.py
--- a/python/paddle/fluid/tests/unittests/npu/test_softmax_op_npu.py
+++ b/python/paddle/fluid/tests/unittests/npu/test_softmax_op_npu.py
--- a/python/paddle/fluid/tests/unittests/npu/test_softmax_with_cross_entropy_op_npu.py
+++ b/python/paddle/fluid/tests/unittests/npu/test_softmax_with_cross_entropy_op_npu.py
--- a/python/paddle/fluid/tests/unittests/npu/test_sqrt_op_npu.py
+++ b/python/paddle/fluid/tests/unittests/npu/test_sqrt_op_npu.py
--- a/python/paddle/fluid/tests/unittests/npu/test_square_op_npu.py
+++ b/python/paddle/fluid/tests/unittests/npu/test_square_op_npu.py
--- a/python/paddle/fluid/tests/unittests/npu/test_stack_op_npu.py
+++ b/python/paddle/fluid/tests/unittests/npu/test_stack_op_npu.py
--- a/python/paddle/fluid/tests/unittests/npu/test_sum_op_npu.py
+++ b/python/paddle/fluid/tests/unittests/npu/test_sum_op_npu.py
--- a/python/paddle/fluid/tests/unittests/npu/test_tanh_op_npu.py
+++ b/python/paddle/fluid/tests/unittests/npu/test_tanh_op_npu.py
--- a/python/paddle/fluid/tests/unittests/npu/test_top_k_op_npu.py
+++ b/python/paddle/fluid/tests/unittests/npu/test_top_k_op_npu.py
--- a/python/paddle/fluid/tests/unittests/npu/test_transpose_op_npu.py
+++ b/python/paddle/fluid/tests/unittests/npu/test_transpose_op_npu.py
--- a/python/paddle/fluid/tests/unittests/npu/test_truncated_gaussian_random_op_npu.py
+++ b/python/paddle/fluid/tests/unittests/npu/test_truncated_gaussian_random_op_npu.py
--- a/python/paddle/fluid/tests/unittests/npu/test_update_loss_scaling_op_npu.py
+++ b/python/paddle/fluid/tests/unittests/npu/test_update_loss_scaling_op_npu.py
--- a/python/paddle/fluid/tests/unittests/op_test.py
+++ b/python/paddle/fluid/tests/unittests/op_test.py
--- a/python/paddle/fluid/tests/unittests/test_ascend_group.sh
+++ b/python/paddle/fluid/tests/unittests/test_ascend_group.sh
--- a/python/paddle/fluid/tests/unittests/test_assign_op_npu.py
+++ b/python/paddle/fluid/tests/unittests/test_assign_op_npu.py
--- a/python/paddle/fluid/tests/unittests/test_elementwise_max_op_npu.py
+++ b/python/paddle/fluid/tests/unittests/test_elementwise_max_op_npu.py
--- a/python/paddle/fluid/tests/unittests/test_fleet_ascend_utils.py
+++ b/python/paddle/fluid/tests/unittests/test_fleet_ascend_utils.py
--- a/python/paddle/fluid/tests/unittests/test_fleet_launch_ascend.sh
+++ b/python/paddle/fluid/tests/unittests/test_fleet_launch_ascend.sh
--- a/python/paddle/fluid/tests/unittests/test_fleet_launch_ascend2.sh
+++ b/python/paddle/fluid/tests/unittests/test_fleet_launch_ascend2.sh