未验证 提交 e6bc358d 编写于 作者: Z zhang wenhui 提交者: GitHub

【NPU】Cherry-pick ascendrc ops code by 0325 to develop (#32197)

* merge 31065

* Fix typo of selected_npus (#31230)

* merge 31249

* [NPU] Support npu op pow and pow grad (#31247)

* [NPU] Support npu op: (1) pow (2) pow_grad

* Support fp16

* Fix pow npu fp16 test (#31256)

* support list of list attribute for NPU (#31299)

* support list of list attribute for NPU

* fix compile problem

* fix reference

* [NPU] Support npu op: (1) slice (2) slice_grad (#31275)

* fix reading flags from env (#31329)

* merge 31347

* [NPU] Support npu op layer_norm and layer_norm_grad (#31310)

* init commit, add layer_norm npu kernel

* fix typo

* add unittest

* add unittest

* fix bug

* fix bug

* refine ut

* [NPU] add npu kernel for equal op (#31393)

* add npu kernel for equal op

* refine code

* add more ut

* update year

* [NPU] Support npu kernel for shape op  (#31427)

* add shape npu

* fix

* fix

* fix endif (#31431)

* Fix pow, use fillD instead of broadcast (#31433)

* Fix pow, refine code (#31440)

* fix cmake of cryptopp to avoid downloading every time (#31451)

* [NPU] squeeze and unsqueeze op for ascend (#31452)
Co-authored-by: Nroot <xiayanming@baidu.com>

* Support npu kernel for gather op (#31458)

* add gather npu op

* code review done

* update python new line

* precommit

* fix review

* del commit

* 【NPU】add scale op for npu (#31499)

* add scale npu

* fix

* fix

* Support TensorFormVector, TensorToVector of bool type (#31518)

* support TensorFormVector, TensorToVector of bool type

* add ut

* fix compile problem

* 【NPU】support npu kernel for fill_constant op (#31521)

* add fill_constant npu

* add fill_constant npu

* fix

* cherry-pick 31422, solve conflict

* 【NPU】Support npu kernel for matmul op (#31544)

* add matmulv2_npu

* add matmul

* add matmul

* [NPU] Support npu op elementwise_mul and elementwise_mul_grad (#31571)

* [NPU] Support npu op elementwise_max (#31574)

* 【NPU】add relu op for  npu (#31515)

* add relu npu

* fixed

* fix

* 【NPU】Suppert npu kernel for reshape2 op (#31524)

* add reshape2 npu

* add reshpe2

* [NPU] Support npu kernel for gather op fix bug (#31541)

* add gather npu op

* code review done

* update python new line

* precommit

* fix review

* del commit

* update gather_grad

* fix bug

* fix bug

* [NPU] Support npu kernel for amp_check_finite_and_unscale_npu op (#31457)

* Support npu kernel for amp_check_finite_and_unscale_npu op

* support EnforceNotMet exception

* fix exception bug

* modify python unittest

* precommit

* update c++ unittest

* fix review

* fix review

* [NPU] accuracy op (#31492)

* accuracy op

* fix license

* fix

* add test and fix bug

* [NPU] add Assign OP (#31561)

* add assign op

* add test assign npu test

* dele if def
Co-authored-by: Noyjxer <1728722986@qq.com>

* [NPU] fix npu op elementwise_mul_grad (#31592)

* 【NPU】Support npu op gelu and gelu_grad (#31530)

* Support npu op gelu and gelu_grad

* Support npu op gelu and gelu_grad

* [NPU] fix assgin cmake (#31595)

* fix gather_grad bug (#31607)

* [NPU] add range op (#31560)

* add range op

* fix codestyle; call GetSize directly
Co-authored-by: Noyjxer <1728722986@qq.com>

* 【NPU】Support npu op elementwise_div and elementwise_div_grad (#31573)

* Support npu op elementwise_div and elementwise_div_grad

* Support npu op elementwise_div and elementwise_div_grad

* Support npu op elementwise_div and elementwise_div_grad

* [NPU] Support npu op log, log_grad, sqrt, sqrt_grad, square, tanh and tanh_grad (#31600)

* [NPU] Support npu op logicalnot_op (#31534)

* [NPU] Support npu op elementwise_min (#31575)

* [NPU] Support npu op elementwise_pow (#31576)

* [NPU] Support npu op table_lookup_v2 and table_lookup_v2_grad (#31399)

* [npu] support npu kernel `table_lookup_v2`

* clean up

* +python test

* +cmake

* clean up

* remove int8 kernel
+ python unitest for fp16

* clean up

* [NPU] support npu kernel for `less_than` (#31327)

* [npu] support npu kernel for `less than`

* remove int* kernel

* cleanup

* [NPU] Support npu kernel scatter op (#31624)

* Support npu kernel scatter op

* Add more test

* [NPU] fix allocator min chunk size (#31632)

* [NPU] Support NPU kernel cast op (#31635)
Co-authored-by: Nfrankwhzhang <frankwhzhang@126.com>

* [NPU] add npu kernel for sgd (#31639)

* 【NPU】Support NPU kernel for reduce_sum op v2 (#31620)

* add reduce_sum

* fix broadcastd

* fix test

* fix

* add unsqueeze in reduce_sum

* add template

* add unittest for keep_dim

* test reduce_all
Co-authored-by: Nfrankwhzhang <frankwhzhang@126.com>

* [NPU] add npu kernel for adam (#31644)

* add npu kernel for adam

* refine code

* disable test

* modify atol

* 【NPU】Support npu kernel for mul op (#31584)

* add mul

* add test mul

* [NPU] add npu kernel for softmax_with_cross_entropy (#31656)

* init

* fix bugs

* [NPU] add npu kernel for mean Op (#31562)

* update mean op

* update mean op

* give a better test activation
Co-authored-by: Noyjxer <1728722986@qq.com>

* Revert "[NPU] add npu kernel for mean Op (#31562)" (#31665)

This reverts commit 468ac699.

* 【NPU】Add TensorCopy to NPU kernel for reduce_sum op  (#31667)

* update unittest

* add TensorCopy in npu grad kernel

* [NPU] Support npu op `expand` (#31405)

* [npu] support npu kernel  for `expand`

* [NPU] fix shape of dx in mul_grad (#31675)

* fix shape of dx

* refine code

* [NPU] add Increment op (#31563)

* add increment

* fix

* update test increment op inplace

* update increment op

* increment b = 2
Co-authored-by: Noyjxer <1728722986@qq.com>

* [NPU] add NPU add topk  (#31596)

* add topk op

* add cmake

* update topk npu op

* refactor func

* fix test not go npu TopKD bug

* NPUPlace(4) to NPUPlace(0)

* update comment
Co-authored-by: Noyjxer <1728722986@qq.com>

* [NPU] Support NPU kernel sum op (#31671)

* [NPU] npu support `transpose` (#31486)

* cherry-pick 31564, solve conflict

* [NPU] Fix bug: Fix calculation errors of pow grad npu kernel (#31699)

* [NPU] Support testing grad of NPU ops in OpTest (#31697)

* [NPU] Support NPU kernel of stack op (#31711)

* [NPU] Remove redundant ctest of top_k_op_npu_test (#31718)

* [NPU] fix reshape npu op kernel (#31726)

* rename npu op file

* fix reshape

* [NPU] change transpose to transpose2 (#31734)

* change transpose to transpose2

* fix bug

* [NPU] Support  mean npu kernel (#31729)

* [NPU] fix some bugs of npu op (#31739)

* fix softmax

* fix mean

* fix lookup_table_v2

* 【NPU】Fix npu kernel elementwise_div_grad  (#31753)

* [NPU] fix the grad kernel diff bug of gather op (#31757)

* fix gather grad kernel diff

* fix gather grad kernel diff

* fix gather review bug

* 【NPU】Fix reshape test & add grad test (#31776)

* fix

* fix

* [NPU] support fp16 for npu accuracy op (#31797)

* [NPU] support list of tensor input (#31801)

* support list of tensor as npu input

* add comment

* fix typo

* fix typo

* [NPU] add npu kernel for concat op (#31695)

* add npu kernel for concat op

* add npu kernel for concat op

* refine code

* update

* refine concat_grad

* [NPU] Support npu kernel for op elementwise_floordiv (#31822)

* [NPU] fix bug of lookup_table_v2_grad (#31834)

* [NPU] support default stream (#31510)

* [NPU] support mixed precision input for npu layer norm (#31847)

* support mixed precision input for npu layer norm

* fix layer_norm npu kernel
Co-authored-by: Nzhiqiu <chenqiuliang@baidu.com>

* 【NPU】Support npu kernel for update_loss_scaling op (#31830)

* add update_loss_scaling_npu NPU kernel

* change TensorFromVec to Memset

* fix compile problem (#31850)

* [NPU] support npu for conditional_block op (#31854)

* 【NPU】Add int dtype kernel for reshape2 op (#31864)

* fix

* fix

* [NPU] fix some op bugs (#31855)

* fix some op bugs

* fix some bugs

* follow comments

* fix log level

* add ut

* [NPU] support fp16 of input for api pow (#31871)

* [NPU] add npu kernel for truncated_gaussian_random op (#31654)

* init

* add todo

* add npu kernel for truncated_gaussian_random

* add sync

* fix concat_grad

* fix typo

* fix compile

* fix compile

* fix compile

* fix compile

* fix compile

* fix compile

* fix code style

* fix code style

* fix code

* Fix op test (#32231)

* fix conditional block (#32243)

* fix style code
Co-authored-by: Nxiayanming <41795079@qq.com>
Co-authored-by: NLeo Chen <chenqiuliang@baidu.com>
Co-authored-by: Nliym27 <33742067+liym27@users.noreply.github.com>
Co-authored-by: NReventon_L <luyuxiang1994@qq.com>
Co-authored-by: Nroot <xiayanming@baidu.com>
Co-authored-by: Noyjxer <1728722986@qq.com>
Co-authored-by: Nyinhaofeng <66763551+yinhaofeng@users.noreply.github.com>
Co-authored-by: NOleNet <olenet@126.com>
Co-authored-by: NMeiyim <chen_xuyi@outlook.com>
Co-authored-by: Noyxuan-11 <963650125@qq.com>
Co-authored-by: Npangyoki <pangyoki@126.com>
上级 69d80274
......@@ -32,7 +32,7 @@ cache_third_party(extern_gloo
TAG ${GLOO_TAG}
DIR GLOO_SOURCE_DIR)
if(WITH_ASCEND)
if(WITH_ASCEND OR WITH_ASCEND_CL)
ExternalProject_Add(
extern_gloo
${EXTERNAL_PROJECT_LOG_ARGS}
......
......@@ -242,7 +242,7 @@ endif()
)
ENDFUNCTION()
if(WITH_ASCEND)
if(WITH_ASCEND OR WITH_ASCEND_CL)
SET(PROTOBUF_VERSION 3.8.0)
else()
SET(PROTOBUF_VERSION 3.1.0)
......
......@@ -16,7 +16,7 @@ INCLUDE(ExternalProject)
SET(THREADPOOL_PREFIX_DIR ${THIRD_PARTY_PATH}/threadpool)
SET(THREADPOOL_SOURCE_DIR ${THIRD_PARTY_PATH}/threadpool/src/extern_threadpool)
if(WITH_ASCEND)
if(WITH_ASCEND OR WITH_ASCEND_CL)
SET(THREADPOOL_REPOSITORY https://gitee.com/tianjianhe/ThreadPool.git)
else()
SET(THREADPOOL_REPOSITORY ${GIT_URL}/progschj/ThreadPool.git)
......
......@@ -43,7 +43,7 @@ cache_third_party(extern_warpctc
TAG ${WARPCTC_TAG}
DIR WARPCTC_SOURCE_DIR)
if(WITH_ASCEND)
if(WITH_ASCEND OR WITH_ASCEND_CL)
ExternalProject_Add(
extern_warpctc
${EXTERNAL_PROJECT_LOG_ARGS}
......
......@@ -135,6 +135,7 @@ void TensorFromArray(const T* src, const size_t& array_size,
}
#endif
}
template <typename T>
void TensorFromVector(const std::vector<T>& src,
const platform::DeviceContext& ctx, Tensor* dst) {
......@@ -167,6 +168,49 @@ void TensorFromVector(const std::vector<T>& src,
#endif
}
// The fully specialized function should be inline to avoid
// multi-definition.
template <>
inline void TensorFromVector(const std::vector<bool>& src,
const platform::DeviceContext& ctx, Tensor* dst) {
// vector<bool> has no data() member, use array instead.
// See details:
// https://stackoverflow.com/questions/46115669/why-does-stdvectorbool-have-no-data/46115714
bool* array = new bool[src.size()];
for (unsigned int i = 0; i < src.size(); i++) {
array[i] = static_cast<bool>(src[i]);
}
auto dst_place = ctx.GetPlace();
auto src_ptr = static_cast<const void*>(array);
platform::CPUPlace src_place;
dst->Resize({static_cast<int64_t>(src.size())});
auto dst_ptr = static_cast<void*>(dst->mutable_data<bool>(dst_place));
auto size = src.size() * sizeof(bool);
if (platform::is_cpu_place(dst_place)) {
memory::Copy(BOOST_GET_CONST(platform::CPUPlace, dst_place), dst_ptr,
src_place, src_ptr, size);
}
#ifdef PADDLE_WITH_CUDA
else if (platform::is_gpu_place(dst_place)) { // NOLINT
memory::Copy(
BOOST_GET_CONST(platform::CUDAPlace, dst_place), dst_ptr, src_place,
src_ptr, size,
reinterpret_cast<const platform::CUDADeviceContext&>(ctx).stream());
}
#endif
#ifdef PADDLE_WITH_ASCEND_CL
else if (platform::is_npu_place(dst_place)) { // NOLINT
memory::Copy(
BOOST_GET_CONST(platform::NPUPlace, dst_place), dst_ptr, src_place,
src_ptr, size,
reinterpret_cast<const platform::NPUDeviceContext&>(ctx).stream());
}
#endif
delete[] array;
}
template <typename T>
void TensorFromVector(const std::vector<T>& src, Tensor* dst) {
platform::CPUPlace dst_place = platform::CPUPlace();
......@@ -179,6 +223,23 @@ void TensorFromVector(const std::vector<T>& src, Tensor* dst) {
memory::Copy(dst_place, dst_ptr, src_place, src_ptr, size);
}
template <>
inline void TensorFromVector(const std::vector<bool>& src, Tensor* dst) {
bool* array = new bool[src.size()];
for (unsigned int i = 0; i < src.size(); i++) {
array[i] = static_cast<bool>(src[i]);
}
platform::CPUPlace dst_place = platform::CPUPlace();
auto src_ptr = static_cast<const void*>(array);
platform::CPUPlace src_place;
dst->Resize({static_cast<int64_t>(src.size())});
auto dst_ptr = static_cast<void*>(dst->mutable_data<bool>(dst_place));
auto size = src.size() * sizeof(bool);
memory::Copy(dst_place, dst_ptr, src_place, src_ptr, size);
delete[] array;
}
template <typename T>
void TensorToVector(const Tensor& src, const platform::DeviceContext& ctx,
std::vector<T>* dst) {
......@@ -212,6 +273,46 @@ void TensorToVector(const Tensor& src, const platform::DeviceContext& ctx,
#endif
}
template <>
inline void TensorToVector(const Tensor& src,
const platform::DeviceContext& ctx,
std::vector<bool>* dst) {
auto src_ptr = static_cast<const void*>(src.data<bool>());
auto size = src.numel() * sizeof(bool);
bool* array = new bool[src.numel()];
platform::CPUPlace dst_place;
dst->resize(src.numel());
auto dst_ptr = static_cast<void*>(array);
if (platform::is_cpu_place(src.place())) {
memory::Copy(dst_place, dst_ptr,
BOOST_GET_CONST(platform::CPUPlace, src.place()), src_ptr,
size);
}
#ifdef PADDLE_WITH_CUDA
else if (platform::is_gpu_place(src.place())) { // NOLINT
memory::Copy(
dst_place, dst_ptr, BOOST_GET_CONST(platform::CUDAPlace, src.place()),
src_ptr, size,
reinterpret_cast<const platform::CUDADeviceContext&>(ctx).stream());
}
#endif
#ifdef PADDLE_WITH_ASCEND_CL
else if (platform::is_npu_place(src.place())) { // NOLINT
memory::Copy(
dst_place, dst_ptr, BOOST_GET_CONST(platform::NPUPlace, src.place()),
src_ptr, size,
reinterpret_cast<const platform::NPUDeviceContext&>(ctx).stream());
}
#endif
for (unsigned int i = 0; i < src.numel(); i++) {
(*dst)[i] = static_cast<bool>(array[i]);
}
delete[] array;
}
template <typename T>
void TensorToVector(const Tensor& src, std::vector<T>* dst) {
auto src_ptr = static_cast<const void*>(src.data<T>());
......@@ -231,6 +332,32 @@ void TensorToVector(const Tensor& src, std::vector<T>* dst) {
BOOST_GET_CONST(platform::CPUPlace, src.place()), src_ptr, size);
}
template <>
inline void TensorToVector(const Tensor& src, std::vector<bool>* dst) {
auto src_ptr = static_cast<const void*>(src.data<bool>());
auto size = src.numel() * sizeof(bool);
bool* array = new bool[src.numel()];
platform::CPUPlace dst_place;
dst->resize(src.numel());
auto dst_ptr = static_cast<void*>(array);
PADDLE_ENFORCE_EQ(
platform::is_cpu_place(src.place()), true,
platform::errors::InvalidArgument(
"The input tensor should be CPU device, but actually it is in %s.",
src.place()));
memory::Copy(dst_place, dst_ptr,
BOOST_GET_CONST(platform::CPUPlace, src.place()), src_ptr, size);
for (unsigned int i = 0; i < src.numel(); i++) {
(*dst)[i] = static_cast<bool>(array[i]);
}
delete[] array;
}
std::ostream& operator<<(std::ostream& os, const Tensor& t);
} // namespace framework
} // namespace paddle
......@@ -242,6 +242,61 @@ TEST(TensorToVector, Tensor) {
#endif
}
TEST(TensorToVector, Tensor_bool) {
{
paddle::framework::Tensor src;
bool* src_ptr =
src.mutable_data<bool>({3, 3}, paddle::platform::CPUPlace());
for (int i = 0; i < 3 * 3; ++i) {
src_ptr[i] = static_cast<bool>(i % 2);
}
paddle::platform::CPUPlace place;
std::vector<bool> dst;
paddle::framework::TensorToVector<bool>(src, &dst);
for (int i = 0; i < 3 * 3; ++i) {
EXPECT_EQ(src_ptr[i], dst[i]);
}
}
#ifdef PADDLE_WITH_CUDA
{
std::vector<bool> src_vec = {
false, true, false, true, false, true, false, true, false,
};
paddle::framework::Tensor gpu_tensor;
paddle::platform::CUDAPlace place;
paddle::platform::CUDADeviceContext gpu_ctx(place);
paddle::framework::TensorFromVector<bool>(src_vec, gpu_ctx, &gpu_tensor);
std::vector<bool> dst;
paddle::framework::TensorToVector<bool>(gpu_tensor, gpu_ctx, &dst);
for (int i = 0; i < 3 * 3; ++i) {
EXPECT_EQ(src_vec[i], dst[i]);
}
}
#endif
#ifdef PADDLE_WITH_ASCEND_CL
{
std::vector<bool> src_vec = {
false, true, false, true, false, true, false, true, false,
};
paddle::framework::Tensor npu_tensor;
paddle::platform::NPUPlace place(0);
paddle::platform::NPUDeviceContext npu_ctx(place);
paddle::framework::TensorFromVector<bool>(src_vec, npu_ctx, &npu_tensor);
std::vector<bool> dst;
paddle::framework::TensorToVector<bool>(npu_tensor, npu_ctx, &dst);
for (int i = 0; i < 3 * 3; ++i) {
EXPECT_EQ(src_vec[i], dst[i]);
}
}
#endif
}
TEST(TensorFromDLPack, Tensor) {
{
std::vector<int> src_vec = {1, 2, 3, 4, 5, 6, 7, 8, 9};
......
......@@ -45,6 +45,17 @@ using Attribute = boost::variant<
using AttributeMap = std::unordered_map<std::string, Attribute>;
#ifdef PADDLE_WITH_ASCEND_CL
using NPUAttribute =
boost::variant<boost::blank, int, float, std::string, std::vector<int>,
std::vector<float>, std::vector<std::string>, bool,
std::vector<bool>, BlockDesc*, int64_t,
std::vector<BlockDesc*>, std::vector<int64_t>,
std::vector<double>, std::vector<std::vector<int64_t>>>;
using NPUAttributeMap = std::unordered_map<std::string, NPUAttribute>;
#endif
using OpCreator = std::function<OperatorBase*(
const std::string& /*type*/, const VariableNameMap& /*inputs*/,
const VariableNameMap& /*outputs*/, const AttributeMap& /*attrs*/)>;
......
......@@ -206,8 +206,16 @@ void Copy<platform::NPUPlace, platform::CPUPlace>(platform::NPUPlace dst_place,
if (UNLIKELY(num == 0)) return;
platform::SetNPUDeviceId(dst_place.device);
// NOTE(ascendrc): NPU memcpy async from host to device is a "real" async,
// which is different from CUDA. In Paddle, when async is called, "sync"
// is run actually, which means Paddle doesn't fully supported async.
// TODO(ascendrc): Support NPU memcpy async for better performance.
stream = nullptr;
VLOG(4) << "memory::Copy " << num << " Bytes from " << src_place << " to "
<< dst_place << " by thream(" << stream << ")";
if (stream) {
platform::RecordEvent record_event("NpuMemcpyAsync:CPU->NPU");
platform::NPUMemcpyAsync(dst, src, num, ACL_MEMCPY_HOST_TO_DEVICE, stream);
......@@ -226,8 +234,16 @@ void Copy<platform::CPUPlace, platform::NPUPlace>(platform::CPUPlace dst_place,
if (UNLIKELY(num == 0)) return;
platform::SetNPUDeviceId(src_place.device);
// NOTE(ascendrc): NPU memcpy async from device to host is a "real" async,
// which is different from CUDA. In Paddle, when async is called, "sync"
// is run actually, which means Paddle doesn't fully supported async.
// TODO(ascendrc): Support NPU memcpy async for better performance.
stream = nullptr;
VLOG(4) << "memory::Copy " << num << " Bytes from " << src_place << " to "
<< dst_place << " by thream(" << stream << ")";
if (stream) {
platform::RecordEvent record_event("NpuMemcpyAsync:NPU->CPU");
platform::NPUMemcpyAsync(dst, src, num, ACL_MEMCPY_DEVICE_TO_HOST, stream);
......
......@@ -124,6 +124,7 @@ if (WITH_ASCEND)
endif()
if (WITH_ASCEND_CL)
cc_test(assign_op_npu_test SRCS assign_op_npu_test.cc DEPS assign_op)
cc_library(npu_op_runner SRCS npu_op_runner.cc DEPS operator npu_info)
set(COMMON_OP_DEPS ${COMMON_OP_DEPS} npu_op_runner)
endif()
......@@ -141,8 +142,8 @@ set(OPERATOR_DEPS ${OPERATOR_DEPS} ${COMMON_OP_DEPS})
set(GLOB_OPERATOR_DEPS ${OPERATOR_DEPS} CACHE INTERNAL "Global Op dependencies")
cc_test(test_common_infer_shape_functions SRCS test_common_infer_shape_functions.cc DEPS common_infer_shape_functions ${COMMON_OP_DEPS} activation_op elementwise_add_op softmax_op softmax)
cc_test(assign_op_test SRCS assign_op_test.cc DEPS assign_op)
cc_test(gather_test SRCS gather_test.cc DEPS tensor)
cc_test(assign_op_test SRCS assign_op_test.cc DEPS assign_op)
cc_test(scatter_test SRCS scatter_test.cc DEPS tensor math_function)
cc_test(beam_search_decode_op_test SRCS beam_search_decode_op_test.cc DEPS lod_tensor)
cc_test(strided_memcpy_test SRCS strided_memcpy_test.cc DEPS tensor memory)
......@@ -163,10 +164,19 @@ if (WITH_PYTHON)
cc_library(py_func_op SRCS py_func_op.cc DEPS op_registry python pybind)
endif()
if (WITH_ASCEND_CL)
cc_test(range_op_npu_test SRCS range_op_npu_test.cc DEPS op_registry range_op scope device_context enforce executor)
cc_test(lookup_table_v2_op_npu_test SRCS lookup_table_v2_op_npu_test.cc DEPS op_registry lookup_table_v2_op scope device_context enforce executor compare_op)
endif()
set(GLOB_OP_LIB ${OP_LIBRARY} CACHE INTERNAL "Global OP library")
add_subdirectory(benchmark)
cc_test(op_debug_string_test SRCS op_debug_string_test.cc DEPS elementwise_add_op)
if (WITH_ASCEND_CL)
cc_test(transpose_op_npu_test SRCS transpose_op_npu_test.cc DEPS op_registry transpose_op scope device_context enforce executor)
endif()
if(WITH_MKLDNN)
include(mkldnn/inplace_op_tests.cmake)
......@@ -180,3 +190,7 @@ if(WITH_UNITY_BUILD)
# The specified link dependency needs to be displayed here.
target_link_libraries(paddle_operators_unity ${OP_HEADER_DEPS} ${COMMON_OP_DEPS})
endif()
if(WITH_ASCEND_CL)
cc_test(gelu_op_npu_test SRCS gelu_op_npu_test.cc DEPS op_registry gelu_op scope device_context enforce executor)
endif()
/* Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the Licnse. */
#include <memory>
#include <string>
#include "paddle/fluid/framework/ddim.h"
#include "paddle/fluid/framework/tensor_util.h"
#include "paddle/fluid/operators/activation_op.h"
#include "paddle/fluid/operators/npu_op_runner.h"
namespace paddle {
namespace operators {
using Tensor = framework::Tensor;
template <typename DeviceContext, typename T>
class PowNPUKernel : public framework::OpKernel<T> {
public:
void Compute(const framework::ExecutionContext& ctx) const override {
auto* x = ctx.Input<Tensor>("X");
auto* out = ctx.Output<Tensor>("Out");
auto factor = ctx.Attr<float>("factor");
out->mutable_data<T>(ctx.GetPlace());
auto runner = NpuOpRunner("Power", {*x}, {*out},
{{"power", factor},
{"scale", static_cast<float>(1.0)},
{"shift", static_cast<float>(0.0)}});
auto stream =
ctx.template device_context<paddle::platform::NPUDeviceContext>()
.stream();
runner.Run(stream);
}
};
template <typename DeviceContext, typename T>
class PowGradNPUKernel : public framework::OpKernel<T> {
public:
void Compute(const framework::ExecutionContext& ctx) const override {
auto* x = ctx.Input<Tensor>("X");
auto* dout = ctx.Input<Tensor>(framework::GradVarName("Out"));
auto* dx = ctx.Output<Tensor>(framework::GradVarName("X"));
auto factor = ctx.Attr<float>("factor");
auto x_dims = x->dims();
auto place = ctx.GetPlace();
auto stream =
ctx.template device_context<paddle::platform::NPUDeviceContext>()
.stream();
// NOTE(liym27): dx = dout * factor * x.pow(factor-1)
// Step1: Compute x_pow = x.pow(factor-1)
Tensor x_pow(x->type());
x_pow.mutable_data<T>(x->dims(), place);
auto runner_pow = NpuOpRunner("Power", {*x}, {x_pow},
{{"power", factor - static_cast<float>(1)}});
runner_pow.Run(stream);
// Step 2: Construct a broadcast factor, which has the same shape with x.
// 2.1 Get a factor tensor with shape [1].
Tensor factor_tensor(framework::proto::VarType::FP32);
factor_tensor.mutable_data<float>({1}, place);
TensorFromVector(std::vector<float>{factor}, ctx.device_context(),
&factor_tensor);
// 2.2 Get the factor which has the shape with x and the same value with
// factor.
Tensor factor_bc_tensor(framework::proto::VarType::FP32);
factor_bc_tensor.mutable_data<float>(x_dims, place);
auto runner_bc = NpuOpRunner("FillD", {factor_tensor}, {factor_bc_tensor},
{{"dims", framework::vectorize(x_dims)}});
runner_bc.Run(stream);
// Step 3: Compute x_power_mul_factor = factor * x.pow(factor-1)
Tensor x_power_mul_factor(x->type());
x_power_mul_factor.mutable_data<T>(x->dims(), place);
auto runner_mul_1 =
NpuOpRunner("Mul", {factor_bc_tensor, x_pow}, {x_power_mul_factor}, {});
runner_mul_1.Run(stream);
// Step 4: Compute dx = dout * factor * x.pow(factor-1)
dx->mutable_data<T>(place);
auto runner_mul_2 =
NpuOpRunner("Mul", {*dout, x_power_mul_factor}, {*dx}, {});
runner_mul_2.Run(stream);
}
};
template <typename DeviceContext, typename T>
class ReluNPUKernel : public framework::OpKernel<T> {
public:
void Compute(const framework::ExecutionContext& ctx) const override {
auto* x = ctx.Input<Tensor>("X");
auto* out = ctx.Output<Tensor>("Out");
out->mutable_data<T>(ctx.GetPlace());
auto runner = NpuOpRunner("Relu",
{
*x,
},
{*out}, {});
auto stream =
ctx.template device_context<paddle::platform::NPUDeviceContext>()
.stream();
runner.Run(stream);
}
};
template <typename DeviceContext, typename T>
class ReluGradNPUKernel : public framework::OpKernel<T> {
public:
void Compute(const framework::ExecutionContext& ctx) const override {
auto* out = ctx.Input<Tensor>("Out");
auto* dout = ctx.Input<Tensor>(framework::GradVarName("Out"));
auto* dx = ctx.Output<Tensor>(framework::GradVarName("X"));
auto stream =
ctx.template device_context<paddle::platform::NPUDeviceContext>()
.stream();
dx->mutable_data<T>(ctx.GetPlace());
auto runner = NpuOpRunner("ReluGrad", {*dout, *out}, {*dx}, {});
runner.Run(stream);
}
};
template <typename DeviceContext, typename T>
class SqrtNPUKernel : public framework::OpKernel<T> {
public:
void Compute(const framework::ExecutionContext& ctx) const override {
auto* x = ctx.Input<Tensor>("X");
auto* out = ctx.Output<Tensor>("Out");
auto place = ctx.GetPlace();
out->mutable_data<T>(place);
auto stream =
ctx.template device_context<paddle::platform::NPUDeviceContext>()
.stream();
auto runner = NpuOpRunner("Sqrt", {*x}, {*out}, {});
runner.Run(stream);
}
};
template <typename DeviceContext, typename T>
class SqrtGradNPUKernel : public framework::OpKernel<T> {
public:
void Compute(const framework::ExecutionContext& ctx) const override {
auto* out = ctx.Input<Tensor>("Out");
auto* dout = ctx.Input<Tensor>(framework::GradVarName("Out"));
auto* dx = ctx.Output<Tensor>(framework::GradVarName("X"));
auto place = ctx.GetPlace();
dx->mutable_data<T>(place);
auto stream =
ctx.template device_context<paddle::platform::NPUDeviceContext>()
.stream();
auto dx_runner = NpuOpRunner("SqrtGrad", {*out, *dout}, {*dx}, {});
dx_runner.Run(stream);
}
};
template <typename DeviceContext, typename T>
class LogNPUKernel : public framework::OpKernel<T> {
public:
void Compute(const framework::ExecutionContext& ctx) const override {
auto* x = ctx.Input<Tensor>("X");
auto* out = ctx.Output<Tensor>("Out");
auto place = ctx.GetPlace();
out->mutable_data<T>(place);
auto stream =
ctx.template device_context<paddle::platform::NPUDeviceContext>()
.stream();
Tensor one(x->type());
one.mutable_data<T>(x->dims(), place);
auto one_runner = NpuOpRunner("OnesLike", {*x}, {one}, {});
one_runner.Run(stream);
Tensor sub(x->type());
sub.mutable_data<T>(x->dims(), place);
auto sub_runner = NpuOpRunner("Sub", {*x, one}, {sub}, {});
sub_runner.Run(stream);
auto out_runner = NpuOpRunner("Log1p", {sub}, {*out}, {});
out_runner.Run(stream);
}
};
template <typename DeviceContext, typename T>
class LogGradNPUKernel : public framework::OpKernel<T> {
public:
void Compute(const framework::ExecutionContext& ctx) const override {
auto* dout = ctx.Input<Tensor>(framework::GradVarName("Out"));
auto* x = ctx.Input<Tensor>("X");
auto* dx = ctx.Output<Tensor>(framework::GradVarName("X"));
auto place = ctx.GetPlace();
dx->mutable_data<T>(place);
auto stream =
ctx.template device_context<paddle::platform::NPUDeviceContext>()
.stream();
auto runner = NpuOpRunner("DivNoNan", {*dout, *x}, {*dx}, {});
runner.Run(stream);
}
};
template <typename DeviceContext, typename T>
class TanhNPUKernel : public framework::OpKernel<T> {
public:
void Compute(const framework::ExecutionContext& ctx) const override {
auto* x = ctx.Input<Tensor>("X");
auto* out = ctx.Output<Tensor>("Out");
auto place = ctx.GetPlace();
out->mutable_data<T>(place);
auto stream =
ctx.template device_context<paddle::platform::NPUDeviceContext>()
.stream();
auto runner = NpuOpRunner("Tanh", {*x}, {*out}, {});
runner.Run(stream);
}
};
template <typename DeviceContext, typename T>
class TanhGradNPUKernel : public framework::OpKernel<T> {
public:
void Compute(const framework::ExecutionContext& ctx) const override {
auto* dout = ctx.Input<Tensor>(framework::GradVarName("Out"));
auto* out = ctx.Input<Tensor>("Out");
auto* dx = ctx.Output<Tensor>(framework::GradVarName("X"));
auto place = ctx.GetPlace();
dx->mutable_data<T>(place);
auto stream =
ctx.template device_context<paddle::platform::NPUDeviceContext>()
.stream();
auto dx_runner = NpuOpRunner("TanhGrad", {*out, *dout}, {*dx}, {});
dx_runner.Run(stream);
}
};
template <typename DeviceContext, typename T>
class SquareNPUKernel : public framework::OpKernel<T> {
public:
void Compute(const framework::ExecutionContext& ctx) const override {
auto* x = ctx.Input<Tensor>("X");
auto* out = ctx.Output<Tensor>("Out");
auto place = ctx.GetPlace();
out->mutable_data<T>(place);
auto stream =
ctx.template device_context<paddle::platform::NPUDeviceContext>()
.stream();
auto runner = NpuOpRunner("Square", {*x}, {*out}, {});
runner.Run(stream);
}
};
} // namespace operators
} // namespace paddle
namespace ops = paddle::operators;
REGISTER_OP_NPU_KERNEL(
pow, ops::PowNPUKernel<paddle::platform::NPUDeviceContext, float>,
ops::PowNPUKernel<paddle::platform::NPUDeviceContext,
paddle::platform::float16>);
REGISTER_OP_NPU_KERNEL(
pow_grad, ops::PowGradNPUKernel<paddle::platform::NPUDeviceContext, float>,
ops::PowGradNPUKernel<paddle::platform::NPUDeviceContext,
paddle::platform::float16>);
REGISTER_OP_NPU_KERNEL(
relu, ops::ReluNPUKernel<paddle::platform::NPUDeviceContext, float>,
ops::ReluNPUKernel<paddle::platform::NPUDeviceContext,
paddle::platform::float16>);
REGISTER_OP_NPU_KERNEL(
relu_grad,
ops::ReluGradNPUKernel<paddle::platform::NPUDeviceContext, float>,
ops::ReluGradNPUKernel<paddle::platform::NPUDeviceContext,
paddle::platform::float16>);
REGISTER_OP_NPU_KERNEL(
sqrt, ops::SqrtNPUKernel<paddle::platform::NPUDeviceContext, float>,
ops::SqrtNPUKernel<paddle::platform::NPUDeviceContext,
paddle::platform::float16>);
REGISTER_OP_NPU_KERNEL(
sqrt_grad,
ops::SqrtGradNPUKernel<paddle::platform::NPUDeviceContext, float>,
ops::SqrtGradNPUKernel<paddle::platform::NPUDeviceContext,
paddle::platform::float16>);
REGISTER_OP_NPU_KERNEL(
log, ops::LogNPUKernel<paddle::platform::NPUDeviceContext, float>,
ops::LogNPUKernel<paddle::platform::NPUDeviceContext,
paddle::platform::float16>);
REGISTER_OP_NPU_KERNEL(
log_grad, ops::LogGradNPUKernel<paddle::platform::NPUDeviceContext, float>,
ops::LogGradNPUKernel<paddle::platform::NPUDeviceContext,
paddle::platform::float16>);
REGISTER_OP_NPU_KERNEL(
tanh, ops::TanhNPUKernel<paddle::platform::NPUDeviceContext, float>,
ops::TanhNPUKernel<paddle::platform::NPUDeviceContext,
paddle::platform::float16>);
REGISTER_OP_NPU_KERNEL(
tanh_grad,
ops::TanhGradNPUKernel<paddle::platform::NPUDeviceContext, float>,
ops::TanhGradNPUKernel<paddle::platform::NPUDeviceContext,
paddle::platform::float16>);
REGISTER_OP_NPU_KERNEL(
square, ops::SquareNPUKernel<paddle::platform::NPUDeviceContext, float>,
ops::SquareNPUKernel<paddle::platform::NPUDeviceContext,
paddle::platform::float16>,
ops::SquareNPUKernel<paddle::platform::NPUDeviceContext, int>);
......@@ -4,3 +4,7 @@ if(WITH_UNITY_BUILD)
include(unity_build_rule.cmake)
endif()
register_operators()
if(WITH_ASCEND_CL)
cc_test(check_finite_and_unscale_op_npu_test SRCS check_finite_and_unscale_op_npu_test.cc DEPS op_registry check_finite_and_unscale_op scope device_context enforce executor)
endif()
/* Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License. */
#include <memory>
#include <string>
#include "paddle/fluid/framework/tensor_util.h"
#include "paddle/fluid/operators/amp/check_finite_and_unscale_op.h"
#include "paddle/fluid/operators/npu_op_runner.h"
namespace paddle {
namespace operators {
using Tensor = framework::Tensor;
template <typename T>
class CheckFiniteAndUnscaleNPUKernel : public framework::OpKernel<T> {
public:
void Compute(const framework::ExecutionContext& ctx) const {
const auto xs = ctx.MultiInput<framework::Tensor>("X");
const auto* scale = ctx.Input<framework::Tensor>("Scale");
auto outs = ctx.MultiOutput<framework::Tensor>("Out");
auto* found_inf = ctx.Output<framework::Tensor>("FoundInfinite");
found_inf->mutable_data<bool>(ctx.GetPlace());
bool found_inf_data = false;
auto stream =
ctx.template device_context<paddle::platform::NPUDeviceContext>()
.stream();
// step1: inverse scale(RealDiv)
Tensor const_tensor;
const_tensor.mutable_data<T>({1}, ctx.GetPlace());
TensorFromVector(std::vector<T>{static_cast<T>(1.0)}, ctx.device_context(),
&const_tensor);
ctx.template device_context<paddle::platform::NPUDeviceContext>().Wait();
// Inverse(1.0/scale)
Tensor* tmp_inverse_out = const_cast<Tensor*>(scale);
Tensor inverse_out(scale->type());
inverse_out.Resize(scale->dims());
inverse_out.mutable_data<T>(ctx.GetPlace());
auto runner_inverse =
NpuOpRunner("Div", {const_tensor, *scale}, {inverse_out}, {});
runner_inverse.Run(stream);
tmp_inverse_out = &inverse_out;
size_t x_size = xs.size();
for (size_t i = 0; i < x_size; ++i) {
found_inf_data = true;
const auto* x = xs[i];
auto* out = outs[i];
out->mutable_data<T>(ctx.GetPlace());
// step2: CheckNumerics
// CheckNumerics runs on the Ascend AI CPU, which delivers poor
// performance.
Tensor check_xout(x->type());
check_xout.Resize(x->dims());
check_xout.mutable_data<T>(ctx.GetPlace());
try {
auto runner_checknumerics =
NpuOpRunner("CheckNumerics", {*x}, {check_xout},
{{"message", std::string("check_nan_and_inf")}});
runner_checknumerics.Run(stream);
} catch (platform::EnforceNotMet& exception) {
LOG(WARNING) << "[check_nan_and_inf] detected contains NaN or INF!!!";
found_inf_data = true;
} catch (...) {
LOG(WARNING) << "[check_nan_and_inf] detected contains NaN or INF!!!";
found_inf_data = true;
}
if (!found_inf_data) {
// MatMul
auto runner_matmul =
NpuOpRunner("Mul", {*x, *tmp_inverse_out}, {*out}, {});
runner_matmul.Run(stream);
} else {
// ZerosLike
auto runner_zeroslike = NpuOpRunner("ZerosLike", {*x}, {*out}, {});
runner_zeroslike.Run(stream);
} // end if
} // end for
// set found_inf to true
if (found_inf_data) {
Tensor found_inf_tensor;
found_inf_tensor.Resize({1});
bool* is_found_inf =
found_inf_tensor.mutable_data<bool>(paddle::platform::CPUPlace());
*is_found_inf = true;
framework::TensorCopySync(found_inf_tensor, ctx.GetPlace(), found_inf);
}
}
};
} // namespace operators
} // namespace paddle
namespace ops = paddle::operators;
namespace plat = paddle::platform;
REGISTER_OP_NPU_KERNEL(check_finite_and_unscale,
ops::CheckFiniteAndUnscaleNPUKernel<float>,
ops::CheckFiniteAndUnscaleNPUKernel<plat::float16>);
/* Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License. */
#ifndef _WIN32
#include <unistd.h>
#endif
#include <algorithm>
#include <cstdlib>
#include <memory>
#include <random>
#include "gtest/gtest.h"
#include "paddle/fluid/framework/op_registry.h"
#include "paddle/fluid/framework/operator.h"
#include "paddle/fluid/framework/program_desc.h"
#include "paddle/fluid/operators/math/math_function.h"
#include "paddle/fluid/platform/enforce.h"
namespace f = paddle::framework;
namespace p = paddle::platform;
namespace m = paddle::operators::math;
using Tensor = paddle::framework::Tensor;
USE_OP(check_finite_and_unscale);
USE_OP_DEVICE_KERNEL(check_finite_and_unscale, NPU);
struct InputVars {
std::string name;
f::LoDTensor *tensor;
};
template <typename T>
void Compare(f::Scope *scope, const p::DeviceContext &ctx) {
const f::DDim dims = f::make_ddim({2, 2});
auto place = ctx.GetPlace();
// init input
std::vector<InputVars> input_names = {
{"x", scope->Var("x")->GetMutable<f::LoDTensor>()},
{"x1", scope->Var("x1")->GetMutable<f::LoDTensor>()}};
auto *scale = scope->Var("scale")->GetMutable<f::LoDTensor>();
// init output
auto *out = scope->Var("out")->GetMutable<f::LoDTensor>();
auto *out1 = scope->Var("out1")->GetMutable<f::LoDTensor>();
auto *found_inf = scope->Var("found_inf")->GetMutable<f::LoDTensor>();
// Initialize input data
const int num_inputs = input_names.size();
size_t numel = static_cast<size_t>(f::product(dims));
for (int i = 0; i < num_inputs; ++i) {
std::vector<T> init_xs;
for (size_t j = 0; j < numel; ++j) {
if (j == 0) {
init_xs.push_back(static_cast<T>(NAN));
} else {
init_xs.push_back(static_cast<T>(j + 1));
}
}
f::TensorFromVector(init_xs, ctx, input_names[i].tensor);
input_names[i].tensor->Resize(dims);
}
f::TensorFromVector(std::vector<T>{static_cast<T>(0.5)}, ctx, scale);
ctx.Wait();
// run
f::AttributeMap attrs;
auto op = f::OpRegistry::CreateOp(
"check_finite_and_unscale", {{"X", {"x", "x1"}}, {"Scale", {"scale"}}},
{{"Out", {"out", "out1"}}, {"FoundInfinite", {"found_inf"}}}, attrs);
op->Run(*scope, place);
ctx.Wait();
// out0
std::vector<T> out_vec;
f::TensorToVector(*out, ctx, &out_vec);
EXPECT_EQ(out_vec.size(), static_cast<size_t>(4));
for (size_t j = 0; j < out_vec.size(); ++j) {
VLOG(3) << "out_vec[" << j << "]:" << out_vec[j];
}
ctx.Wait();
// out0
std::vector<T> out1_vec;
f::TensorToVector(*out1, ctx, &out1_vec);
EXPECT_EQ(out1_vec.size(), static_cast<size_t>(4));
for (size_t j = 0; j < out1_vec.size(); ++j) {
VLOG(3) << "out1_vec[" << j << "]:" << out1_vec[j];
}
ctx.Wait();
// out found_inf
Tensor found_inf_tensor;
found_inf_tensor.Resize({1});
bool *is_finite_data =
found_inf_tensor.mutable_data<bool>(paddle::platform::CPUPlace());
f::TensorCopy(*found_inf, place, &found_inf_tensor);
EXPECT_FALSE(*is_finite_data);
ctx.Wait();
}
TEST(check_finite_and_unscale, NPU_fp32) {
f::Scope scope;
p::NPUDeviceContext ctx(p::NPUPlace(0));
Compare<float>(&scope, ctx);
}
TEST(check_finite_and_unscale, NPU_fp16) {
f::Scope scope;
p::NPUDeviceContext ctx(p::NPUPlace(0));
Compare<p::float16>(&scope, ctx);
}
/* Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License. */
#include "paddle/fluid/operators/amp/update_loss_scaling_op.h"
#include <cmath>
#include <vector>
#include "paddle/fluid/framework/op_registry.h"
#include "paddle/fluid/operators/npu_op_runner.h"
namespace paddle {
namespace operators {
using Tensor = framework::Tensor;
template <typename T>
void Update(const platform::NPUDeviceContext& ctx,
const std::vector<bool> found_inf_vec,
const Tensor* pre_loss_scaling_tensor, const Tensor* good_in_tensor,
const Tensor* bad_in_tensor, const int incr_every_n_steps,
const int decr_every_n_nan_or_inf, const float incr_ratio,
const float decr_ratio, Tensor* updated_loss_scaling_tensor,
Tensor* good_out_tensor, Tensor* bad_out_tensor) {
auto place = ctx.GetPlace();
auto stream = ctx.stream();
if (found_inf_vec[0]) {
// good_out_data = 0
auto g = good_out_tensor->mutable_data<int>(place);
platform::NPUMemsetAsync(static_cast<void*>(g), 0,
good_out_tensor->numel() * sizeof(int), stream);
// bad_out_data = bad_in_data + 1
Tensor factor_tensor(bad_out_tensor->type());
factor_tensor.mutable_data<int>({1}, place);
TensorFromVector(std::vector<int>{1}, ctx, &factor_tensor);
auto runner_p2 = NpuOpRunner("Add", {*bad_in_tensor, factor_tensor},
{*bad_out_tensor}, {});
runner_p2.Run(stream);
std::vector<int> bad_out_data;
TensorToVector(*bad_out_tensor, ctx, &bad_out_data);
if (bad_out_data[0] == decr_every_n_nan_or_inf) {
auto runner_p3 = NpuOpRunner("Power", {*pre_loss_scaling_tensor},
{*updated_loss_scaling_tensor},
{{"power", static_cast<float>(1)},
{"scale", decr_ratio},
{"shift", static_cast<float>(0)}});
runner_p3.Run(stream);
std::vector<T> new_loss_scaling;
TensorToVector(*updated_loss_scaling_tensor, ctx, &new_loss_scaling);
if (new_loss_scaling[0] < static_cast<T>(1)) {
// updated_loss_scaling_data = 1
auto runner_p4 = NpuOpRunner("Power", {*pre_loss_scaling_tensor},
{*updated_loss_scaling_tensor},
{{"power", static_cast<float>(1)},
{"scale", static_cast<float>(0)},
{"shift", static_cast<float>(1)}});
runner_p4.Run(stream);
}
// bad_out_data = 0
auto b = bad_out_tensor->mutable_data<int>(place);
platform::NPUMemsetAsync(static_cast<void*>(b), 0,
bad_out_tensor->numel() * sizeof(int), stream);
}
} else {
// bad_out_data = 0
auto b = bad_out_tensor->mutable_data<int>(place);
platform::NPUMemsetAsync(static_cast<void*>(b), 0,
bad_out_tensor->numel() * sizeof(int), stream);
// good_out_data = good_in_data + 1
Tensor factor_tensor(good_out_tensor->type());
factor_tensor.mutable_data<int>({1}, place);
TensorFromVector(std::vector<int>{1}, ctx, &factor_tensor);
auto runner_p2 = NpuOpRunner("Add", {*good_in_tensor, factor_tensor},
{*good_out_tensor}, {});
runner_p2.Run(stream);
std::vector<int> good_out_data;
TensorToVector(*good_out_tensor, ctx, &good_out_data);
if (good_out_data[0] == incr_every_n_steps) {
auto runner_p3 = NpuOpRunner("Power", {*pre_loss_scaling_tensor},
{*updated_loss_scaling_tensor},
{{"power", static_cast<float>(1)},
{"scale", incr_ratio},
{"shift", static_cast<float>(0)}});
runner_p3.Run(stream);
std::vector<T> new_loss_scaling;
TensorToVector(*updated_loss_scaling_tensor, ctx, &new_loss_scaling);
if (!std::isfinite(new_loss_scaling[0])) {
// updated_loss_scaling_data = pre_loss_scaling_data
auto runner_p4 = NpuOpRunner("Power", {*pre_loss_scaling_tensor},
{*updated_loss_scaling_tensor},
{{"power", static_cast<float>(1)},
{"scale", static_cast<float>(1)},
{"shift", static_cast<float>(0)}});
runner_p4.Run(stream);
}
// good_out_data = 0
auto g = good_out_tensor->mutable_data<int>(place);
platform::NPUMemsetAsync(static_cast<void*>(g), 0,
good_out_tensor->numel() * sizeof(int), stream);
}
}
}
template <typename T>
class UpdateLossScalingFunctor<platform::NPUDeviceContext, T> {
public:
void operator()(const platform::NPUDeviceContext& dev_ctx,
const std::vector<bool> found_inf_vec,
const Tensor* pre_loss_scaling_tensor,
const Tensor* good_in_tensor, const Tensor* bad_in_tensor,
const int incr_every_n_steps,
const int decr_every_n_nan_or_inf, const float incr_ratio,
const float decr_ratio, Tensor* updated_loss_scaling_tensor,
Tensor* good_out_tensor, Tensor* bad_out_tensor) const {
Update<T>(dev_ctx, found_inf_vec, pre_loss_scaling_tensor, good_in_tensor,
bad_in_tensor, incr_every_n_steps, decr_every_n_nan_or_inf,
incr_ratio, decr_ratio, updated_loss_scaling_tensor,
good_out_tensor, bad_out_tensor);
}
};
template <typename T>
class LazyZerosNPU {
public:
void operator()(const platform::NPUDeviceContext& dev_ctx,
const std::vector<bool> found_inf_vec,
const std::vector<const framework::Tensor*>& xs,
const std::vector<framework::Tensor*>& outs) const {
for (size_t i = 0; i < xs.size(); ++i) {
auto* out = outs[i];
if (found_inf_vec[0]) {
VLOG(4) << "-- UpdateLossScaling: Find infinite grads. --";
auto place = dev_ctx.GetPlace();
auto stream = dev_ctx.stream();
auto g = out->mutable_data<T>(place);
platform::NPUMemsetAsync(static_cast<void*>(g), 0,
out->numel() * sizeof(T), stream);
}
}
}
};
template <typename DeviceContext, typename T>
class UpdateLossScalingNPUKernel : public framework::OpKernel<T> {
using MPDType = typename details::MPTypeTrait<T>::Type;
public:
void Compute(const framework::ExecutionContext& ctx) const override {
auto& dev_ctx = ctx.template device_context<DeviceContext>();
const auto xs = ctx.MultiInput<framework::Tensor>("X");
auto outs = ctx.MultiOutput<framework::Tensor>("Out");
const auto* found_inf = ctx.Input<Tensor>("FoundInfinite");
PADDLE_ENFORCE_EQ(found_inf->numel(), 1,
platform::errors::InvalidArgument(
"FoundInfinite must has only one element."));
std::vector<bool> found_inf_vec;
TensorToVector(*found_inf, ctx.device_context(), &found_inf_vec);
LazyZerosNPU<T>{}(dev_ctx, found_inf_vec, xs, outs);
const bool stop_update = ctx.Attr<bool>("stop_update");
if (stop_update) {
return;
}
const auto* pre_loss_scaling = ctx.Input<Tensor>("PrevLossScaling");
const auto* good_in = ctx.Input<Tensor>("InGoodSteps");
const auto* bad_in = ctx.Input<Tensor>("InBadSteps");
auto* updated_loss_scaling = ctx.Output<Tensor>("LossScaling");
auto* good_out = ctx.Output<Tensor>("OutGoodSteps");
auto* bad_out = ctx.Output<Tensor>("OutBadSteps");
updated_loss_scaling->mutable_data<MPDType>(dev_ctx.GetPlace());
good_out->mutable_data<int>(dev_ctx.GetPlace());
bad_out->mutable_data<int>(dev_ctx.GetPlace());
const int incr_every_n_steps = ctx.Attr<int>("incr_every_n_steps");
const int decr_every_n_nan_or_inf =
ctx.Attr<int>("decr_every_n_nan_or_inf");
const float incr_ratio = ctx.Attr<float>("incr_ratio");
const float decr_ratio = ctx.Attr<float>("decr_ratio");
UpdateLossScalingFunctor<DeviceContext, MPDType>{}(
dev_ctx, found_inf_vec, pre_loss_scaling, good_in, bad_in,
incr_every_n_steps, decr_every_n_nan_or_inf, incr_ratio, decr_ratio,
updated_loss_scaling, good_out, bad_out);
}
};
} // namespace operators
} // namespace paddle
namespace ops = paddle::operators;
REGISTER_OP_NPU_KERNEL(
update_loss_scaling,
ops::UpdateLossScalingNPUKernel<paddle::platform::NPUDeviceContext, float>,
ops::UpdateLossScalingNPUKernel<paddle::platform::NPUDeviceContext,
double>);
/* Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License. */
#include <string>
#include "paddle/fluid/operators/assign_op.h"
#include "paddle/fluid/operators/npu_op_runner.h"
#include "paddle/fluid/platform/float16.h"
namespace paddle {
namespace framework {
class OpDesc;
class Variable;
} // namespace framework
namespace imperative {
class OpBase;
} // namespace imperative
namespace platform {
struct CPUPlace;
struct CUDAPlace;
struct float16;
} // namespace platform
} // namespace paddle
namespace paddle {
namespace operators {
template <typename DeviceContext, typename T>
class AssignNPUKernel : public framework::OpKernel<T> {
public:
void Compute(const framework::ExecutionContext& ctx) const override {
auto* x = ctx.Input<framework::LoDTensor>("X");
auto* out = ctx.Output<framework::LoDTensor>("Out");
out->mutable_data<T>(ctx.GetPlace());
auto runner = NpuOpRunner("Assign", {*out, *x}, {*out}, {});
auto stream =
ctx.template device_context<paddle::platform::NPUDeviceContext>()
.stream();
runner.Run(stream);
}
};
} // namespace operators
} // namespace paddle
namespace ops = paddle::operators;
namespace plat = paddle::platform;
REGISTER_OP_NPU_KERNEL(
assign, ops::AssignNPUKernel<paddle::platform::NPUDeviceContext, int>,
ops::AssignNPUKernel<paddle::platform::NPUDeviceContext, float>,
ops::AssignNPUKernel<paddle::platform::NPUDeviceContext, double>)
/* Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License. */
#ifndef _WIN32
#include <unistd.h>
#endif
#include <string>
#include <thread> // NOLINT
#include <vector>
#include "gtest/gtest.h"
#include "paddle/fluid/framework/op_registry.h"
#include "paddle/fluid/framework/operator.h"
#include "paddle/fluid/framework/program_desc.h"
#include "paddle/fluid/operators/dropout_op.h"
#include "paddle/fluid/operators/math/math_function.h"
#include "paddle/fluid/string/printf.h"
namespace f = paddle::framework;
namespace p = paddle::platform;
namespace m = paddle::operators::math;
USE_OP(assign);
USE_OP_DEVICE_KERNEL(assign, NPU);
template <typename T>
void Compare(f::Scope* scope, const p::DeviceContext& ctx,
std::string op_type) {
// init
auto x = scope->Var("X");
auto tensor_x = x->GetMutable<f::LoDTensor>();
std::vector<T> init;
init.push_back(static_cast<T>(1.0));
init.push_back(static_cast<T>(2.0));
init.push_back(static_cast<T>(3.0));
init.push_back(static_cast<T>(4.0));
TensorFromVector(init, ctx, tensor_x);
tensor_x->Resize({4});
ctx.Wait();
auto place = ctx.GetPlace();
auto out = scope->Var("Out");
auto tensor_out = out->GetMutable<f::LoDTensor>();
auto op =
f::OpRegistry::CreateOp(op_type, {{"X", {"X"}}}, {{"Out", {"Out"}}}, {});
op->Run(*scope, place);
std::vector<T> out_vec;
TensorToVector(*tensor_out, ctx, &out_vec);
ctx.Wait();
EXPECT_EQ((uint32_t)out_vec.size(), (uint32_t)4);
EXPECT_EQ(out_vec[0], static_cast<T>(1.0));
EXPECT_EQ(out_vec[1], static_cast<T>(2.0));
EXPECT_EQ(out_vec[2], static_cast<T>(3.0));
EXPECT_EQ(out_vec[3], static_cast<T>(4.0));
}
TEST(assign, NPU_fp32) {
f::Scope scope;
p::NPUDeviceContext ctx(p::NPUPlace(0));
Compare<float>(&scope, ctx, "assign");
}
/* Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License. */
#ifdef PADDLE_WITH_ASCEND_CL
#include <memory>
#include <string>
#include "paddle/fluid/operators/cast_op.h"
#include "paddle/fluid/operators/npu_op_runner.h"
namespace paddle {
namespace operators {
static std::map<framework::proto::VarType::Type, aclDataType>
DTYPE_2_ACL_DTYPE = {
{framework::proto::VarType::BOOL, ACL_BOOL},
{framework::proto::VarType::INT16, ACL_INT16},
{framework::proto::VarType::INT32, ACL_INT32},
{framework::proto::VarType::INT64, ACL_INT64},
{framework::proto::VarType::FP16, ACL_FLOAT16},
{framework::proto::VarType::FP32, ACL_FLOAT},
{framework::proto::VarType::FP64, ACL_DOUBLE},
};
using Tensor = framework::Tensor;
template <typename DeviceContext, typename T>
class CastNPUKernel : public framework::OpKernel<T> {
public:
void Compute(const framework::ExecutionContext& ctx) const override {
auto* x = ctx.Input<Tensor>("X");
int dtype = ctx.Attr<int>("out_dtype");
auto* out = ctx.Output<Tensor>("Out");
auto place = ctx.GetPlace();
auto iter = DTYPE_2_ACL_DTYPE.find(
static_cast<framework::proto::VarType::Type>(dtype));
int aclDtype = iter->second;
if (dtype == framework::proto::VarType::FP32) {
out->mutable_data<float>(place);
} else if (dtype == framework::proto::VarType::FP16) {
out->mutable_data<paddle::platform::float16>(place);
} else if (dtype == framework::proto::VarType::INT16) {
out->mutable_data<int16_t>(place);
} else if (dtype == framework::proto::VarType::INT32) {
out->mutable_data<int32_t>(place);
} else if (dtype == framework::proto::VarType::INT64) {
out->mutable_data<int64_t>(place);
} else if (dtype == framework::proto::VarType::FP64) {
out->mutable_data<double>(place);
} else if (dtype == framework::proto::VarType::BOOL) {
out->mutable_data<bool>(place);
}
auto stream =
ctx.template device_context<paddle::platform::NPUDeviceContext>()
.stream();
auto runner = NpuOpRunner("Cast", {*x}, {*out},
{{"dst_type", static_cast<int32_t>(aclDtype)}});
runner.Run(stream);
}
};
} // namespace operators
} // namespace paddleaclDtype
namespace ops = paddle::operators;
REGISTER_OP_NPU_KERNEL(
cast, ops::CastNPUKernel<paddle::platform::NPUDeviceContext, int16_t>,
ops::CastNPUKernel<paddle::platform::NPUDeviceContext, int32_t>,
ops::CastNPUKernel<paddle::platform::NPUDeviceContext, int64_t>,
ops::CastNPUKernel<paddle::platform::NPUDeviceContext, bool>,
ops::CastNPUKernel<paddle::platform::NPUDeviceContext, double>,
ops::CastNPUKernel<paddle::platform::NPUDeviceContext, float>,
ops::CastNPUKernel<paddle::platform::NPUDeviceContext,
paddle::platform::float16>);
#endif
/* Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License. */
#include "paddle/fluid/operators/concat_op.h"
#include "paddle/fluid/operators/npu_op_runner.h"
namespace paddle {
namespace operators {
template <typename T>
class ConcatNPUKernel : public framework::OpKernel<T> {
public:
void Compute(const framework::ExecutionContext& ctx) const override {
auto ins = ctx.MultiInput<framework::LoDTensor>("X");
framework::LoDTensor* out = ctx.Output<framework::LoDTensor>("Out");
PADDLE_ENFORCE_NOT_NULL(ins[0],
platform::errors::NotFound(
"The first input tensor is not initalized."));
auto axis = ctx.Attr<int>("axis");
if (ctx.HasInput("AxisTensor")) {
PADDLE_THROW(platform::errors::NotFound(
"The AxisTensor is not supported on NPU now."));
}
axis = ComputeAxis(static_cast<int64_t>(axis),
static_cast<int64_t>(ins[0]->dims().size()));
auto place = ctx.GetPlace();
out->mutable_data<T>(place);
std::vector<framework::Tensor> inputs;
std::vector<std::string> names;
for (size_t i = 0; i < ins.size(); ++i) {
if (ins[i] && ins[i]->numel() > 0) {
inputs.push_back(*ins[i]);
names.push_back("x" + std::to_string(i));
} else {
continue;
}
}
auto stream =
ctx.template device_context<paddle::platform::NPUDeviceContext>()
.stream();
auto runner = NpuOpRunner(
"ConcatD", {inputs}, {*out},
{{"concat_dim", axis}, {"N", static_cast<int>(inputs.size())}});
runner.AddInputNames(names);
runner.Run(stream);
}
};
template <typename T>
class ConcatGradNPUKernel : public framework::OpKernel<T> {
public:
void Compute(const framework::ExecutionContext& ctx) const override {
auto* out_grad =
ctx.Input<framework::Tensor>(framework::GradVarName("Out"));
auto ins = ctx.MultiInput<framework::LoDTensor>("X");
auto out_var_names = ctx.OutputNames(framework::GradVarName("X"));
auto outs =
ctx.MultiOutput<framework::LoDTensor>(framework::GradVarName("X"));
PADDLE_ENFORCE_NOT_NULL(ins[0],
platform::errors::NotFound(
"The first input tensor is not initalized."));
auto axis = ctx.Attr<int>("axis");
axis = ComputeAxis(static_cast<int64_t>(axis),
static_cast<int64_t>(ins[0]->dims().size()));
int offset = 0;
auto stream =
ctx.template device_context<paddle::platform::NPUDeviceContext>()
.stream();
for (size_t j = 0; j < outs.size(); ++j) {
// For stop gradient
// get output tensor that the name is not kEmptyVarName
if (out_var_names[j] != framework::kEmptyVarName &&
outs[j]->numel() != 0UL) {
outs[j]->mutable_data<T>(ctx.GetPlace());
std::vector<int> offsets;
std::vector<int> sizes;
for (int dim = 0; dim < ins[j]->dims().size(); ++dim) {
if (dim == axis) {
offsets.push_back(offset);
sizes.push_back(ins[j]->dims()[dim]);
} else {
offsets.push_back(0);
sizes.push_back(ins[j]->dims()[dim]);
}
}
auto runner = NpuOpRunner("SliceD", {*out_grad}, {*outs[j]},
{{"offsets", offsets}, {"size", sizes}});
runner.Run(stream);
}
if (ins[j]->numel() != 0UL) {
offset += ins[j]->dims()[axis];
}
}
}
};
} // namespace operators
} // namespace paddle
namespace ops = paddle::operators;
REGISTER_OP_NPU_KERNEL(concat, ops::ConcatNPUKernel<float>,
ops::ConcatNPUKernel<paddle::platform::float16>,
ops::ConcatNPUKernel<int>);
REGISTER_OP_NPU_KERNEL(concat_grad, ops::ConcatGradNPUKernel<float>,
ops::ConcatGradNPUKernel<paddle::platform::float16>,
ops::ConcatGradNPUKernel<int>);
/* Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License. */
#include <algorithm>
#include <string>
#include <vector>
#include "paddle/fluid/framework/op_registry.h"
#include "paddle/fluid/framework/op_version_registry.h"
#include "paddle/fluid/operators/controlflow/compare_op.h"
#include "paddle/fluid/operators/elementwise/elementwise_op_function.h"
#include "paddle/fluid/operators/npu_op_runner.h"
#ifdef PADDLE_WITH_ASCEND_CL
namespace paddle {
namespace operators {
template <typename T>
class EqualNPUKernel : public framework::OpKernel<T> {
public:
void Compute(const framework::ExecutionContext& ctx) const override {
auto* x = ctx.Input<framework::LoDTensor>("X");
auto* y = ctx.Input<framework::LoDTensor>("Y");
auto* out = ctx.Output<framework::LoDTensor>("Out");
out->mutable_data<bool>(ctx.GetPlace());
auto runner = NpuOpRunner("Equal", {*x, *y}, {*out}, {});
auto stream =
ctx.template device_context<paddle::platform::NPUDeviceContext>()
.stream();
runner.Run(stream);
}
};
template <typename DeviceContext, typename T>
class LessThanNPUKernel : public framework::OpKernel<T> {
public:
void Compute(const framework::ExecutionContext& ctx) const override {
auto* x = ctx.Input<framework::LoDTensor>("X");
auto* y = ctx.Input<framework::LoDTensor>("Y");
auto* z = ctx.Output<framework::LoDTensor>("Out");
// int axis = context.Attr<int>("axis");
z->mutable_data<bool>(ctx.GetPlace()); // allocate
auto runner = NpuOpRunner("Less", {*x, *y}, {*z});
auto stream =
ctx.template device_context<paddle::platform::NPUDeviceContext>()
.stream();
runner.Run(stream);
}
};
} // namespace operators
} // namespace paddle
namespace ops = paddle::operators;
namespace plat = paddle::platform;
REGISTER_OP_NPU_KERNEL(equal, ops::EqualNPUKernel<float>,
ops::EqualNPUKernel<plat::float16>,
ops::EqualNPUKernel<int>);
REGISTER_OP_NPU_KERNEL(
less_than,
ops::LessThanNPUKernel<paddle::platform::NPUDeviceContext, float>,
ops::LessThanNPUKernel<paddle::platform::NPUDeviceContext,
paddle::platform::float16>);
#endif
......@@ -78,6 +78,13 @@ class ConditionalOp : public framework::OperatorBase {
framework::TensorCopy(*ips[0], platform::CPUPlace(), &cpu_tensor);
platform::DeviceContextPool::Instance().Get(ips[0]->place())->Wait();
res = cpu_tensor.data<bool>()[0];
#endif
} else if (platform::is_npu_place(ips[0]->place())) {
#ifdef PADDLE_WITH_ASCEND_CL
framework::LoDTensor cpu_tensor;
framework::TensorCopy(*ips[0], platform::CPUPlace(), &cpu_tensor);
platform::DeviceContextPool::Instance().Get(ips[0]->place())->Wait();
res = cpu_tensor.data<bool>()[0];
#endif
} else {
res = ips[0]->data<bool>()[0];
......
......@@ -44,6 +44,11 @@ static void DataCopy(const framework::LoDTensor &src_item,
TensorCopySync(src_item, platform::CPUPlace(), dst_item);
}
#else
#ifdef PADDLE_WITH_ASCEND_CL
if (platform::is_npu_place(src_item.place())) {
platform::DeviceContextPool::Instance().Get(src_item.place())->Wait();
}
#endif
TensorCopySync(src_item, platform::CPUPlace(), dst_item);
#endif
} else {
......
/* Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License. */
#ifdef PADDLE_WITH_ASCEND_CL
#include <memory>
#include <string>
#include "paddle/fluid/operators/controlflow/logical_op.h"
#include "paddle/fluid/operators/npu_op_runner.h"
namespace paddle {
namespace operators {
using Tensor = framework::Tensor;
template <typename DeviceContext, typename T>
class LogicalNotNPUKernel : public framework::OpKernel<T> {
public:
void Compute(const framework::ExecutionContext& ctx) const override {
auto* x = ctx.Input<Tensor>("X");
auto* out = ctx.Output<Tensor>("Out");
auto place = ctx.GetPlace();
out->mutable_data<T>(place);
auto stream =
ctx.template device_context<paddle::platform::NPUDeviceContext>()
.stream();
auto runner = NpuOpRunner("LogicalNot", {*x}, {*out}, {});
runner.Run(stream);
}
};
} // namespace operators
} // namespace paddle
namespace ops = paddle::operators;
REGISTER_OP_NPU_KERNEL(
logical_not,
ops::LogicalNotNPUKernel<paddle::platform::NPUDeviceContext, bool>);
#endif
......@@ -12,17 +12,18 @@ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License. */
#ifdef PADDLE_WITH_ASCEND_CL
#include <memory>
#include <string>
#include "paddle/fluid/framework/tensor_util.h"
#include "paddle/fluid/operators/elementwise/elementwise_add_op.h"
#include "paddle/fluid/operators/npu_op_runner.h"
namespace paddle {
namespace operators {
using Tensor = framework::Tensor;
template <typename DeviceContext, typename T>
template <typename T>
class ElementwiseAddNPUKernel : public framework::OpKernel<T> {
public:
void Compute(const framework::ExecutionContext& ctx) const override {
......@@ -39,12 +40,127 @@ class ElementwiseAddNPUKernel : public framework::OpKernel<T> {
}
};
template <typename T>
class ElementwiseAddGradNPUKernel : public framework::OpKernel<T> {
public:
void Compute(const framework::ExecutionContext& ctx) const override {
auto* dout = ctx.Input<Tensor>(framework::GradVarName("Out"));
auto* dx = ctx.Output<Tensor>(framework::GradVarName("X"));
auto* dy = ctx.Output<Tensor>(framework::GradVarName("Y"));
auto stream =
ctx.template device_context<paddle::platform::NPUDeviceContext>()
.stream();
// NOTE(zhiqiu): It seems Ascend Sub follow the broadcast sematics with
// default axis=-1?
// So, the sub_grad should do reduce if needed.
// For example, the shape of each variable in elementwise_sub:
// x, dx: [2, 3, 5]
// y, dy: [1, 5]
// out, dout: [2, 3, 5]
// Then, out = x - y => dx = dout, dy = -dout
// And, the shape of dy can be computed by two stages reduce,
// 1. [2, 3, 5] => [3, 5], ReduceSumD on axis = 0, keep_dims = false.
// 2. [3, 5] => [1, 5], ReduceSumD on axis = 0, keep_dims = true.
if (dx) {
dx->mutable_data<T>(ctx.GetPlace());
// For dx
// stage 1
auto reduce_ndim = dout->dims().size() - dx->dims().size();
std::vector<int> axes;
for (auto i = 0; i < reduce_ndim; ++i) {
axes.push_back(i);
}
Tensor* tmp_dout = const_cast<Tensor*>(dout);
Tensor reduced_dout(dx->type());
if (axes.size() != 0) {
std::vector<int64_t> reduced_dout_dims;
for (auto i = reduce_ndim; i < dout->dims().size(); ++i) {
reduced_dout_dims.push_back(dout->dims()[i]);
}
reduced_dout.Resize(framework::make_ddim(reduced_dout_dims));
reduced_dout.mutable_data<T>(ctx.GetPlace());
auto runner = NpuOpRunner("ReduceSumD", {*dout}, {reduced_dout},
{{"axes", axes}, {"keep_dims", false}});
runner.Run(stream);
tmp_dout = &reduced_dout;
}
// stage 2
axes.clear();
for (auto i = 0; i < dx->dims().size(); ++i) {
if (dx->dims()[i] == 1) {
axes.push_back(i);
}
}
if (axes.size() != 0) {
auto runner = NpuOpRunner("ReduceSumD", {*tmp_dout}, {*dx},
{{"axes", axes}, {"keep_dims", true}});
runner.Run(stream);
} else {
ctx.template device_context<paddle::platform::NPUDeviceContext>()
.Wait();
framework::TensorCopySync(*tmp_dout, ctx.GetPlace(), dx);
}
}
if (dy) {
// For dy
// stage 1
auto reduce_ndim = dout->dims().size() - dy->dims().size();
std::vector<int> axes;
for (auto i = 0; i < reduce_ndim; ++i) {
axes.push_back(i);
}
Tensor* tmp_dout = const_cast<Tensor*>(dout);
Tensor reduced_dout(dout->type());
if (axes.size() != 0) {
std::vector<int64_t> reduced_dout_dims;
for (auto i = reduce_ndim; i < dout->dims().size(); ++i) {
reduced_dout_dims.push_back(dout->dims()[i]);
}
reduced_dout.Resize(framework::make_ddim(reduced_dout_dims));
reduced_dout.mutable_data<T>(ctx.GetPlace());
auto runner = NpuOpRunner("ReduceSumD", {*dout}, {reduced_dout},
{{"axes", axes}, {"keep_dims", false}});
runner.Run(stream);
tmp_dout = &reduced_dout;
ctx.template device_context<paddle::platform::NPUDeviceContext>()
.Wait();
}
// stage 2
axes.clear();
for (auto i = 0; i < dy->dims().size(); ++i) {
if (dy->dims()[i] == 1) {
axes.push_back(i);
}
}
if (axes.size() != 0) {
dy->mutable_data<T>(ctx.GetPlace());
auto runner = NpuOpRunner("ReduceSumD", {*tmp_dout}, {*dy},
{{"axes", axes}, {"keep_dims", true}});
runner.Run(stream);
} else {
ctx.template device_context<paddle::platform::NPUDeviceContext>()
.Wait();
framework::TensorCopySync(*tmp_dout, ctx.GetPlace(), dy);
}
}
}
};
} // namespace operators
} // namespace paddle
namespace ops = paddle::operators;
namespace plat = paddle::platform;
REGISTER_OP_NPU_KERNEL(elementwise_add, ops::ElementwiseAddNPUKernel<float>,
ops::ElementwiseAddNPUKernel<plat::float16>);
REGISTER_OP_NPU_KERNEL(
elementwise_add,
ops::ElementwiseAddNPUKernel<paddle::platform::NPUDeviceContext, float>);
#endif
REGISTER_OP_NPU_KERNEL(elementwise_add_grad,
ops::ElementwiseAddGradNPUKernel<float>,
ops::ElementwiseAddGradNPUKernel<plat::float16>);
/* Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License. */
#include <memory>
#include <string>
#include "paddle/fluid/operators/elementwise/elementwise_div_op.h"
#include "paddle/fluid/operators/npu_op_runner.h"
namespace paddle {
namespace operators {
using Tensor = framework::Tensor;
template <typename DeviceContext, typename T>
class ElementwiseDivNPUKernel : public framework::OpKernel<T> {
public:
void Compute(const framework::ExecutionContext& ctx) const override {
auto* x = ctx.Input<Tensor>("X");
auto* y = ctx.Input<Tensor>("Y");
auto* out = ctx.Output<Tensor>("Out");
auto place = ctx.GetPlace();
out->mutable_data<T>(place);
auto stream =
ctx.template device_context<paddle::platform::NPUDeviceContext>()
.stream();
auto runner = NpuOpRunner("Div", {*x, *y}, {*out}, {});
runner.Run(stream);
}
};
template <typename DeviceContext, typename T>
class ElementwiseDivGradNPUKernel : public framework::OpKernel<T> {
public:
void Compute(const framework::ExecutionContext& ctx) const override {
auto* out = ctx.Input<Tensor>("Out");
auto* dout = ctx.Input<Tensor>(framework::GradVarName("Out"));
auto* x = ctx.Input<Tensor>("X");
auto* y = ctx.Input<Tensor>("Y");
auto* dx = ctx.Output<Tensor>(framework::GradVarName("X"));
auto* dy = ctx.Output<Tensor>(framework::GradVarName("Y"));
auto place = ctx.GetPlace();
auto stream =
ctx.template device_context<paddle::platform::NPUDeviceContext>()
.stream();
Tensor y_power(y->type());
y_power.mutable_data<T>(y->dims(), place);
auto y_power_runner = NpuOpRunner("Power", {*y}, {y_power},
{{"power", static_cast<float>(-1)}});
y_power_runner.Run(stream);
if (dx) {
dx->mutable_data<T>(place);
Tensor tensor_zeros(x->type());
tensor_zeros.mutable_data<T>(x->dims(), place);
auto tensor_zeros_runner =
NpuOpRunner("ZerosLike", {*x}, {tensor_zeros}, {});
tensor_zeros_runner.Run(stream);
Tensor x_zero(paddle::framework::proto::VarType::BOOL);
x_zero.mutable_data<bool>(x->dims(), place);
auto x_zero_runner =
NpuOpRunner("Equal", {*x, tensor_zeros}, {x_zero}, {});
x_zero_runner.Run(stream);
Tensor x_nozero(paddle::framework::proto::VarType::BOOL);
x_nozero.mutable_data<bool>(x->dims(), place);
auto x_nozero_runner =
NpuOpRunner("LogicalNot", {x_zero}, {x_nozero}, {});
x_nozero_runner.Run(stream);
Tensor x_nozero_f(x->type());
x_nozero_f.mutable_data<T>(x->dims(), place);
auto x_nozero_f_runner =
NpuOpRunner("Cast", {x_nozero}, {x_nozero_f},
{{"dst_type", static_cast<int32_t>(0)}});
x_nozero_f_runner.Run(stream);
Tensor x_grad_w(x->type());
x_grad_w.mutable_data<T>(x->dims(), place);
auto x_grad_w_runner =
NpuOpRunner("Mul", {x_nozero_f, y_power}, {x_grad_w}, {});
x_grad_w_runner.Run(stream);
auto x_grad_runner = NpuOpRunner("Mul", {x_grad_w, *dout}, {*dx}, {});
x_grad_runner.Run(stream);
}
if (dy) {
dy->mutable_data<T>(place);
Tensor neg_out(y->type());
neg_out.mutable_data<T>(y->dims(), place);
auto neg_out_runner = NpuOpRunner("Neg", {*out}, {neg_out}, {});
neg_out_runner.Run(stream);
Tensor y_grad_w(y->type());
y_grad_w.mutable_data<T>(y->dims(), place);
auto y_grad_w_runner = NpuOpRunner("Div", {neg_out, *y}, {y_grad_w}, {});
y_grad_w_runner.Run(stream);
auto y_grad_runner = NpuOpRunner("Mul", {y_grad_w, *dout}, {*dy}, {});
y_grad_runner.Run(stream);
}
}
};
} // namespace operators
} // namespace paddle
namespace ops = paddle::operators;
REGISTER_OP_NPU_KERNEL(
elementwise_div,
ops::ElementwiseDivNPUKernel<paddle::platform::NPUDeviceContext, float>,
ops::ElementwiseDivNPUKernel<paddle::platform::NPUDeviceContext,
paddle::platform::float16>);
REGISTER_OP_NPU_KERNEL(
elementwise_div_grad,
ops::ElementwiseDivGradNPUKernel<paddle::platform::NPUDeviceContext, float>,
ops::ElementwiseDivGradNPUKernel<paddle::platform::NPUDeviceContext,
paddle::platform::float16>);
/* Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License. */
#include <memory>
#include <string>
#include "paddle/fluid/operators/elementwise/elementwise_div_op.h"
#include "paddle/fluid/operators/npu_op_runner.h"
namespace paddle {
namespace operators {
using Tensor = framework::Tensor;
template <typename T>
class ElementwiseFloorDivNPUKernel : public framework::OpKernel<T> {
public:
void Compute(const framework::ExecutionContext& ctx) const override {
auto* x = ctx.Input<Tensor>("X");
auto* y = ctx.Input<Tensor>("Y");
auto* out = ctx.Output<Tensor>("Out");
out->mutable_data<T>(ctx.GetPlace());
auto stream =
ctx.template device_context<paddle::platform::NPUDeviceContext>()
.stream();
auto runner = NpuOpRunner("FloorDiv", {*x, *y}, {*out}, {});
runner.Run(stream);
}
};
} // namespace operators
} // namespace paddle
namespace ops = paddle::operators;
REGISTER_OP_NPU_KERNEL(elementwise_floordiv,
ops::ElementwiseFloorDivNPUKernel<int>,
ops::ElementwiseFloorDivNPUKernel<int64_t>);
/* Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License. */
#include <memory>
#include <string>
#include "paddle/fluid/operators/elementwise/elementwise_max_op.h"
#include "paddle/fluid/operators/npu_op_runner.h"
namespace paddle {
namespace operators {
using Tensor = framework::Tensor;
template <typename DeviceContext, typename T>
class ElementwiseMaxNPUKernel : public framework::OpKernel<T> {
public:
void Compute(const framework::ExecutionContext& ctx) const override {
auto* x = ctx.Input<Tensor>("X");
auto* y = ctx.Input<Tensor>("Y");
auto* out = ctx.Output<Tensor>("Out");
auto place = ctx.GetPlace();
out->mutable_data<T>(place);
auto stream =
ctx.template device_context<paddle::platform::NPUDeviceContext>()
.stream();
auto runner = NpuOpRunner("Maximum", {*x, *y}, {*out}, {});
runner.Run(stream);
}
};
} // namespace operators
} // namespace paddle
namespace ops = paddle::operators;
REGISTER_OP_NPU_KERNEL(
elementwise_max,
ops::ElementwiseMaxNPUKernel<paddle::platform::NPUDeviceContext, float>,
ops::ElementwiseMaxNPUKernel<paddle::platform::NPUDeviceContext,
paddle::platform::float16>);
/* Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License. */
#include <memory>
#include <string>
#include "paddle/fluid/operators/elementwise/elementwise_min_op.h"
#include "paddle/fluid/operators/npu_op_runner.h"
namespace paddle {
namespace operators {
using Tensor = framework::Tensor;
template <typename DeviceContext, typename T>
class ElementwiseMinNPUKernel : public framework::OpKernel<T> {
public:
void Compute(const framework::ExecutionContext& ctx) const override {
auto* x = ctx.Input<Tensor>("X");
auto* y = ctx.Input<Tensor>("Y");
auto* out = ctx.Output<Tensor>("Out");
auto place = ctx.GetPlace();
out->mutable_data<T>(place);
auto stream =
ctx.template device_context<paddle::platform::NPUDeviceContext>()
.stream();
auto runner = NpuOpRunner("Minimum", {*x, *y}, {*out}, {});
runner.Run(stream);
}
};
} // namespace operators
} // namespace paddle
namespace ops = paddle::operators;
REGISTER_OP_NPU_KERNEL(
elementwise_min,
ops::ElementwiseMinNPUKernel<paddle::platform::NPUDeviceContext, float>,
ops::ElementwiseMinNPUKernel<paddle::platform::NPUDeviceContext,
paddle::platform::float16>);
/* Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License. */
#ifdef PADDLE_WITH_ASCEND_CL
#include <memory>
#include <string>
#include "paddle/fluid/operators/elementwise/elementwise_mul_op.h"
#include "paddle/fluid/operators/npu_op_runner.h"
namespace paddle {
namespace operators {
using Tensor = framework::Tensor;
template <typename DeviceContext, typename T>
class ElementwiseMulNPUKernel : public framework::OpKernel<T> {
public:
void Compute(const framework::ExecutionContext& ctx) const override {
auto* x = ctx.Input<Tensor>("X");
auto* y = ctx.Input<Tensor>("Y");
auto* out = ctx.Output<Tensor>("Out");
auto place = ctx.GetPlace();
out->mutable_data<T>(place);
auto stream =
ctx.template device_context<paddle::platform::NPUDeviceContext>()
.stream();
auto runner = NpuOpRunner("Mul", {*x, *y}, {*out}, {});
runner.Run(stream);
}
};
template <typename DeviceContext, typename T>
class ElementwiseMulGradNPUKernel : public framework::OpKernel<T> {
public:
void Compute(const framework::ExecutionContext& ctx) const override {
auto* x = ctx.Input<Tensor>("X");
auto* y = ctx.Input<Tensor>("Y");
auto* dout = ctx.Input<Tensor>(framework::GradVarName("Out"));
auto* dx = ctx.Output<Tensor>(framework::GradVarName("X"));
auto* dy = ctx.Output<Tensor>(framework::GradVarName("Y"));
auto place = ctx.GetPlace();
auto stream =
ctx.template device_context<paddle::platform::NPUDeviceContext>()
.stream();
if (dx) {
dx->mutable_data<T>(place);
auto dx_runner = NpuOpRunner("Mul", {*dout, *y}, {*dx}, {});
dx_runner.Run(stream);
}
if (dy) {
dy->mutable_data<T>(place);
auto dy_runner = NpuOpRunner("Mul", {*x, *dout}, {*dy}, {});
dy_runner.Run(stream);
}
}
};
} // namespace operators
} // namespace paddle
namespace ops = paddle::operators;
REGISTER_OP_NPU_KERNEL(
elementwise_mul,
ops::ElementwiseMulNPUKernel<paddle::platform::NPUDeviceContext, float>,
ops::ElementwiseMulNPUKernel<paddle::platform::NPUDeviceContext,
paddle::platform::float16>);
REGISTER_OP_NPU_KERNEL(
elementwise_mul_grad,
ops::ElementwiseMulGradNPUKernel<paddle::platform::NPUDeviceContext, float>,
ops::ElementwiseMulGradNPUKernel<paddle::platform::NPUDeviceContext,
paddle::platform::float16>);
#endif
......@@ -74,6 +74,7 @@ void Compare(f::Scope* scope, const p::DeviceContext& ctx,
{{"Out", {"Out"}}}, attrs);
op->Run(*scope, place);
ctx.Wait();
std::vector<T> out_vec;
TensorToVector(*tensor_out, ctx, &out_vec);
......@@ -131,6 +132,7 @@ void CompareGrad(f::Scope* scope, const p::DeviceContext& ctx,
auto place = ctx.GetPlace();
op->Run(*scope, place);
ctx.Wait();
std::vector<T> dx_vec;
TensorToVector(*tensor_dx, ctx, &dx_vec);
......@@ -179,3 +181,9 @@ TEST(elementwise_sub_grad, NPU) {
p::NPUDeviceContext ctx(p::NPUPlace(0));
CompareGrad<float>(&scope, ctx, "elementwise_sub_grad");
}
TEST(elementwise_add_grad, NPU) {
f::Scope scope;
p::NPUDeviceContext ctx(p::NPUPlace(0));
CompareGrad<float>(&scope, ctx, "elementwise_add_grad");
}
/* Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License. */
#include <memory>
#include <string>
#include "paddle/fluid/operators/elementwise/elementwise_pow_op.h"
#include "paddle/fluid/operators/npu_op_runner.h"
namespace paddle {
namespace operators {
using Tensor = framework::Tensor;
template <typename DeviceContext, typename T>
class ElementwisePowNPUKernel : public framework::OpKernel<T> {
public:
void Compute(const framework::ExecutionContext& ctx) const override {
auto* x = ctx.Input<Tensor>("X");
auto* y = ctx.Input<Tensor>("Y");
auto* out = ctx.Output<Tensor>("Out");
auto place = ctx.GetPlace();
out->mutable_data<T>(place);
auto stream =
ctx.template device_context<paddle::platform::NPUDeviceContext>()
.stream();
auto runner = NpuOpRunner("Pow", {*x, *y}, {*out}, {});
runner.Run(stream);
}
};
} // namespace operators
} // namespace paddle
namespace ops = paddle::operators;
REGISTER_OP_NPU_KERNEL(
elementwise_pow,
ops::ElementwisePowNPUKernel<paddle::platform::NPUDeviceContext, float>,
ops::ElementwisePowNPUKernel<paddle::platform::NPUDeviceContext,
paddle::platform::float16>);
......@@ -12,7 +12,6 @@ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License. */
#ifdef PADDLE_WITH_ASCEND_CL
#include <memory>
#include <string>
......@@ -24,7 +23,7 @@ namespace operators {
using Tensor = framework::Tensor;
template <typename DeviceContext, typename T>
template <typename T>
class ElementwiseSubNPUKernel : public framework::OpKernel<T> {
public:
void Compute(const framework::ExecutionContext& ctx) const override {
......@@ -43,7 +42,7 @@ class ElementwiseSubNPUKernel : public framework::OpKernel<T> {
}
};
template <typename DeviceContext, typename T>
template <typename T>
class ElementwiseSubGradNPUKernel : public framework::OpKernel<T> {
public:
void Compute(const framework::ExecutionContext& ctx) const override {
......@@ -51,8 +50,9 @@ class ElementwiseSubGradNPUKernel : public framework::OpKernel<T> {
auto* dx = ctx.Output<Tensor>(framework::GradVarName("X"));
auto* dy = ctx.Output<Tensor>(framework::GradVarName("Y"));
dx->mutable_data<T>(ctx.GetPlace());
dy->mutable_data<T>(ctx.GetPlace());
auto stream =
ctx.template device_context<paddle::platform::NPUDeviceContext>()
.stream();
// NOTE(zhiqiu): It seems Ascend Sub follow the broadcast sematics with
// default axis=-1?
......@@ -66,89 +66,92 @@ class ElementwiseSubGradNPUKernel : public framework::OpKernel<T> {
// 1. [2, 3, 5] => [3, 5], ReduceSumD on axis = 0, keep_dims = false.
// 2. [3, 5] => [1, 5], ReduceSumD on axis = 0, keep_dims = true.
auto stream =
ctx.template device_context<paddle::platform::NPUDeviceContext>()
.stream();
// For dx
// stage 1
auto reduce_ndim = dout->dims().size() - dx->dims().size();
std::vector<int> axes;
for (auto i = 0; i < reduce_ndim; ++i) {
axes.push_back(i);
}
auto tmp_dout = dout;
Tensor reduced_dout(dx->type());
if (axes.size() != 0) {
std::vector<int64_t> reduced_dout_dims;
for (auto i = reduce_ndim; i < dout->dims().size(); ++i) {
reduced_dout_dims.push_back(dout->dims()[i]);
}
reduced_dout.Resize(framework::make_ddim(reduced_dout_dims));
reduced_dout.mutable_data<T>(ctx.GetPlace());
auto runner = NpuOpRunner("ReduceSumD", {*dout}, {reduced_dout},
{{"axes", axes}, {"keep_dims", false}});
runner.Run(stream);
tmp_dout = &reduced_dout;
}
// stage 2
axes.clear();
for (auto i = 0; i < dx->dims().size(); ++i) {
if (dx->dims()[i] == 1) {
if (dx) {
dx->mutable_data<T>(ctx.GetPlace());
// For dx
// stage 1
auto reduce_ndim = dout->dims().size() - dx->dims().size();
std::vector<int> axes;
for (auto i = 0; i < reduce_ndim; ++i) {
axes.push_back(i);
}
}
if (axes.size() != 0) {
auto runner = NpuOpRunner("ReduceSumD", {*tmp_dout}, {*dx},
{{"axes", axes}, {"keep_dims", true}});
runner.Run(stream);
} else {
framework::TensorCopySync(*tmp_dout, ctx.GetPlace(), dx);
}
// For dy
// stage 1
reduce_ndim = dout->dims().size() - dy->dims().size();
axes.clear();
for (auto i = 0; i < reduce_ndim; ++i) {
axes.push_back(i);
}
tmp_dout = dout;
Tensor reduced_dy(dy->type());
Tensor* tmp_dout = const_cast<Tensor*>(dout);
Tensor reduced_dout(dx->type());
if (axes.size() != 0) {
std::vector<int64_t> reduced_dout_dims;
for (auto i = reduce_ndim; i < dout->dims().size(); ++i) {
reduced_dout_dims.push_back(dout->dims()[i]);
}
reduced_dout.Resize(framework::make_ddim(reduced_dout_dims));
reduced_dout.mutable_data<T>(ctx.GetPlace());
auto runner = NpuOpRunner("ReduceSumD", {*dout}, {reduced_dout},
{{"axes", axes}, {"keep_dims", false}});
runner.Run(stream);
tmp_dout = &reduced_dout;
}
if (axes.size() != 0) {
std::vector<int64_t> reduced_dout_dims;
for (auto i = reduce_ndim; i < dout->dims().size(); ++i) {
reduced_dout_dims.push_back(dout->dims()[i]);
// stage 2
axes.clear();
for (auto i = 0; i < dx->dims().size(); ++i) {
if (dx->dims()[i] == 1) {
axes.push_back(i);
}
}
if (axes.size() != 0) {
auto runner = NpuOpRunner("ReduceSumD", {*tmp_dout}, {*dx},
{{"axes", axes}, {"keep_dims", true}});
runner.Run(stream);
} else {
framework::TensorCopySync(*tmp_dout, ctx.GetPlace(), dx);
}
reduced_dout.Resize(framework::make_ddim(reduced_dout_dims));
reduced_dout.mutable_data<T>(ctx.GetPlace());
auto runner = NpuOpRunner("ReduceSumD", {*dout}, {reduced_dout},
{{"axes", axes}, {"keep_dims", false}});
runner.Run(stream);
tmp_dout = &reduced_dout;
}
// stage 2
axes.clear();
auto* tmp_dy = tmp_dout;
for (auto i = 0; i < dy->dims().size(); ++i) {
if (dy->dims()[i] == 1) {
if (dy) {
dy->mutable_data<T>(ctx.GetPlace());
// For dy
// stage 1
auto reduce_ndim = dout->dims().size() - dy->dims().size();
std::vector<int> axes;
for (auto i = 0; i < reduce_ndim; ++i) {
axes.push_back(i);
}
}
if (axes.size() != 0) {
reduced_dy.Resize(dy->dims());
reduced_dy.mutable_data<T>(ctx.GetPlace());
auto runner = NpuOpRunner("ReduceSumD", {*tmp_dout}, {reduced_dy},
{{"axes", axes}, {"keep_dims", true}});
Tensor* tmp_dout = const_cast<Tensor*>(dout);
Tensor reduced_dy(dy->type());
Tensor reduced_dout(dy->type());
if (axes.size() != 0) {
std::vector<int64_t> reduced_dout_dims;
for (auto i = reduce_ndim; i < dout->dims().size(); ++i) {
reduced_dout_dims.push_back(dout->dims()[i]);
}
reduced_dout.Resize(framework::make_ddim(reduced_dout_dims));
reduced_dout.mutable_data<T>(ctx.GetPlace());
auto runner = NpuOpRunner("ReduceSumD", {*dout}, {reduced_dout},
{{"axes", axes}, {"keep_dims", false}});
runner.Run(stream);
tmp_dout = &reduced_dout;
}
// stage 2
axes.clear();
Tensor* tmp_dy = tmp_dout;
for (auto i = 0; i < dy->dims().size(); ++i) {
if (dy->dims()[i] == 1) {
axes.push_back(i);
}
}
if (axes.size() != 0) {
reduced_dy.Resize(dy->dims());
reduced_dy.mutable_data<T>(ctx.GetPlace());
auto runner = NpuOpRunner("ReduceSumD", {*tmp_dout}, {reduced_dy},
{{"axes", axes}, {"keep_dims", true}});
runner.Run(stream);
tmp_dy = &reduced_dy;
}
// stage 3, negative
auto runner = NpuOpRunner("Neg", {*tmp_dy}, {*dy}, {});
runner.Run(stream);
tmp_dy = &reduced_dy;
}
// stage 3, negative
auto runner = NpuOpRunner("Neg", {*tmp_dy}, {*dy}, {});
runner.Run(stream);
}
};
......@@ -156,16 +159,11 @@ class ElementwiseSubGradNPUKernel : public framework::OpKernel<T> {
} // namespace paddle
namespace ops = paddle::operators;
namespace plat = paddle::platform;
REGISTER_OP_NPU_KERNEL(elementwise_sub, ops::ElementwiseSubNPUKernel<float>,
ops::ElementwiseSubNPUKernel<plat::float16>);
REGISTER_OP_NPU_KERNEL(
elementwise_sub,
ops::ElementwiseSubNPUKernel<paddle::platform::NPUDeviceContext, float>,
ops::ElementwiseSubNPUKernel<paddle::platform::NPUDeviceContext,
paddle::platform::float16>);
REGISTER_OP_NPU_KERNEL(
elementwise_sub_grad,
ops::ElementwiseSubGradNPUKernel<paddle::platform::NPUDeviceContext, float>,
ops::ElementwiseSubGradNPUKernel<paddle::platform::NPUDeviceContext,
paddle::platform::float16>);
#endif
REGISTER_OP_NPU_KERNEL(elementwise_sub_grad,
ops::ElementwiseSubGradNPUKernel<float>,
ops::ElementwiseSubGradNPUKernel<plat::float16>);
......@@ -64,6 +64,12 @@ inline std::vector<int> get_expand_times(
TensorCopySync(*expand_tensor, platform::CPUPlace(), &cpu_expand_tensor);
expand_data = cpu_expand_tensor.data<int>();
}
#ifdef PADDLE_WITH_ASCEND_CL
if (platform::is_npu_place(expand_tensor->place())) {
TensorCopySync(*expand_tensor, platform::CPUPlace(), &cpu_expand_tensor);
expand_data = cpu_expand_tensor.data<int>();
}
#endif
#ifdef PADDLE_WITH_XPU
if (platform::is_xpu_place(expand_tensor->place())) {
TensorCopySync(*expand_tensor, platform::CPUPlace(), &cpu_expand_tensor);
......
/* Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License. */
#ifdef PADDLE_WITH_ASCEND_CL
#include <iostream>
#include <memory>
#include <string>
#include "paddle/fluid/framework/op_registry.h"
#include "paddle/fluid/operators/expand_op.h"
#include "paddle/fluid/operators/npu_op_runner.h"
namespace paddle {
namespace operators {
template <typename DeviceContext, typename T>
class ExpandNPUKernel : public framework::OpKernel<T> {
public:
void Compute(const framework::ExecutionContext& context) const override {
auto rank = context.Input<Tensor>("X")->dims().size();
PADDLE_ENFORCE_GE(
rank, 1,
platform::errors::InvalidArgument(
"The number of dimensions of the input 'x' for Op(expand) "
"must be greater than or equal to 1, but the value received is %d.",
rank));
PADDLE_ENFORCE_LE(
rank, MAX_RANK_SUPPORTED,
platform::errors::InvalidArgument(
"The number of dimensions of the input 'x' for Op(expand) "
"must be less than or equal to %d, but the value received is %d.",
MAX_RANK_SUPPORTED, rank));
switch (rank) { REP_EXPAND_TEMPLATE(MAX_RANK_SUPPORTED) }
}
protected:
template <int Rank>
void Expand(const framework::ExecutionContext& context) const {
auto* in0 = context.Input<framework::LoDTensor>("X");
auto in_dims = in0->dims();
auto expand_times = get_expand_times(context);
PADDLE_ENFORCE_EQ(
static_cast<size_t>(in_dims.size()), expand_times.size(),
platform::errors::InvalidArgument(
"The number of elements (%d) of 'expand_times' for "
"Op(expand) must be equal to the number "
"of dimensions (%d) of the input.",
expand_times.size(), static_cast<size_t>(in_dims.size())));
auto* out0 = context.Output<framework::LoDTensor>("Out");
framework::DDim out_dims(in_dims);
for (size_t i = 0; i < expand_times.size(); ++i) {
out_dims[i] *= expand_times[i];
}
out0->Resize(out_dims);
out0->mutable_data<T>(context.device_context().GetPlace());
auto runner =
NpuOpRunner("TileD", {*in0}, {*out0}, {{"multiples", expand_times}});
auto stream =
context.template device_context<paddle::platform::NPUDeviceContext>()
.stream();
runner.Run(stream);
}
};
} // namespace operators
} // namespace paddle
namespace ops = paddle::operators;
REGISTER_OP_NPU_KERNEL(
expand, ops::ExpandNPUKernel<paddle::platform::NPUDeviceContext, float>,
ops::ExpandNPUKernel<paddle::platform::NPUDeviceContext,
paddle::platform::float16>);
#endif
/* Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License. */
#ifndef _WIN32
#include <unistd.h>
#endif
#include <iostream>
#include <string>
#include <thread> // NOLINT
#include <vector>
#include "gtest/gtest.h"
#include "paddle/fluid/framework/op_registry.h"
#include "paddle/fluid/framework/operator.h"
#include "paddle/fluid/framework/program_desc.h"
#include "paddle/fluid/operators/dropout_op.h"
#include "paddle/fluid/operators/math/math_function.h"
#include "paddle/fluid/string/printf.h"
namespace f = paddle::framework;
namespace p = paddle::platform;
namespace m = paddle::operators::math;
USE_OP(expand);
USE_OP_DEVICE_KERNEL(expand, NPU);
template <typename T>
void Compare(f::Scope* scope, const p::DeviceContext& ctx) {
// init
auto in = scope->Var("X");
auto expand_times = scope->Var("ExpandTimes");
auto out = scope->Var("Out");
auto in_t = in->GetMutable<f::LoDTensor>();
auto out_t = out->GetMutable<f::LoDTensor>();
auto expand_times_t = expand_times->GetMutable<f::LoDTensor>();
auto place = ctx.GetPlace();
TensorFromVector(std::vector<T>(3 * 1 * 7, 1), ctx, in_t);
TensorFromVector(std::vector<int>({1, 10, 1}), ctx, expand_times_t);
in_t->Resize(f::make_ddim({3, 1, 7}));
expand_times_t->Resize(f::make_ddim({3}));
out_t->Resize(f::make_ddim({3, 10, 7}));
out_t->mutable_data<T>(place);
f::AttributeMap attrs = {{}};
auto op = f::OpRegistry::CreateOp(
"expand", {{"X", {"X"}}, {"ExpandTimes", {"ExpandTimes"}}},
{{"Out", {"Out"}}}, attrs);
op->Run(*scope, place);
ctx.Wait();
auto out_dim = out_t->dims();
EXPECT_EQ(out_dim.at(0), 3);
EXPECT_EQ(out_dim.at(1), 10);
EXPECT_EQ(out_dim.at(2), 7);
}
TEST(expand, NPU_fp32) {
f::Scope scope;
p::NPUDeviceContext ctx(p::NPUPlace(0));
Compare<float>(&scope, ctx);
}
/* Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License. */
#include <memory>
#include <string>
#include "paddle/fluid/operators/fill_constant_op.h"
#include "paddle/fluid/operators/npu_op_runner.h"
#include "paddle/fluid/operators/utils.h"
namespace paddle {
namespace operators {
template <typename DeviceContext, typename T>
class FillConstantNPUKernel : public framework::OpKernel<T> {
public:
void Compute(const framework::ExecutionContext& ctx) const override {
auto data_type =
static_cast<framework::proto::VarType::Type>(ctx.Attr<int>("dtype"));
auto str_value = ctx.Attr<std::string>("str_value");
auto float_value = ctx.Attr<float>("value");
auto* out_var = ctx.Output<framework::Tensor>("Out");
auto place = ctx.GetPlace();
auto stream =
ctx.template device_context<paddle::platform::NPUDeviceContext>()
.stream();
T value;
if (str_value.empty()) {
value = static_cast<T>(float_value);
} else {
// handle NaN/Inf first, which cannot be read from stream.
if (str_value == "inf") {
value = static_cast<T>(std::numeric_limits<double>::infinity());
} else if (str_value == "-inf") {
value = static_cast<T>(-std::numeric_limits<double>::infinity());
} else if (str_value == "nan") {
value = static_cast<T>(std::numeric_limits<double>::quiet_NaN());
} else {
std::stringstream convert_stream(str_value);
if (std::is_same<int64_t, T>::value) {
int64_t tmp_value;
convert_stream >> tmp_value;
value = static_cast<T>(tmp_value);
} else {
double tmp_value;
convert_stream >> tmp_value;
value = static_cast<T>(tmp_value);
}
}
}
auto shape = GetShape(ctx);
Tensor tensor_tmp(data_type);
tensor_tmp.mutable_data<T>({1}, ctx.GetPlace());
TensorFromVector(std::vector<T>{value}, ctx.device_context(), &tensor_tmp);
out_var->mutable_data<T>(shape, place);
auto runner = NpuOpRunner("FillD", {tensor_tmp}, {*out_var},
{{"dims", framework::vectorize(shape)}});
runner.Run(stream);
}
};
} // namespace operators
} // namespace paddle
namespace ops = paddle::operators;
REGISTER_OP_NPU_KERNEL(
fill_constant,
ops::FillConstantNPUKernel<paddle::platform::NPUDeviceContext, float>,
ops::FillConstantNPUKernel<paddle::platform::NPUDeviceContext, bool>,
ops::FillConstantNPUKernel<paddle::platform::NPUDeviceContext, int>,
ops::FillConstantNPUKernel<paddle::platform::NPUDeviceContext,
paddle::platform::float16>);
/* Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License. */
#include "paddle/fluid/operators/gather_op.h"
#include <memory>
#include <string>
#include <vector>
#include "paddle/fluid/framework/tensor_util.h"
#include "paddle/fluid/operators/kron_op.h"
#include "paddle/fluid/operators/npu_op_runner.h"
#include "paddle/fluid/platform/npu_info.h"
namespace paddle {
namespace operators {
template <typename DeviceContext, typename T>
class GatherOpNPUKernel : public framework::OpKernel<T> {
public:
void Compute(const framework::ExecutionContext &ctx) const override {
auto *x = ctx.Input<Tensor>("X");
auto *index = ctx.Input<Tensor>("Index");
auto *out = ctx.Output<Tensor>("Out");
out->mutable_data<T>(ctx.GetPlace());
auto runner = NpuOpRunner("Gather", {*x, *index}, {*out},
{{"validate_indices", true}});
auto stream =
ctx.template device_context<paddle::platform::NPUDeviceContext>()
.stream();
runner.Run(stream);
}
};
template <typename DeviceContext, typename T>
class GatherGradOpNPUKernel : public framework::OpKernel<T> {
public:
void Compute(const framework::ExecutionContext &ctx) const override {
auto *index = ctx.Input<Tensor>("Index");
auto *x = ctx.Input<Tensor>("X");
auto *dout = ctx.Input<Tensor>(framework::GradVarName("Out"));
auto *dx = ctx.Output<Tensor>(framework::GradVarName("X"));
// step1: Unsqueeze index
framework::Tensor tmp_tensor(index->type());
const auto index_dims = index->dims();
if (index_dims.size() == 1) {
tmp_tensor.ShareDataWith(*index);
std::vector<int64_t> new_dim = {index_dims[0], 1};
tmp_tensor.Resize(framework::make_ddim(new_dim));
index = &tmp_tensor;
}
auto stream =
ctx.template device_context<paddle::platform::NPUDeviceContext>()
.stream();
// step2: ZerosLike x in device
Tensor zeroslike_xout(x->type());
zeroslike_xout.Resize(x->dims());
auto p = zeroslike_xout.mutable_data<T>(ctx.GetPlace());
platform::NPUMemsetAsync(static_cast<void *>(p), 0,
zeroslike_xout.numel() * sizeof(T), stream);
// step3: scatter(x_grad)
dx->mutable_data<T>(ctx.GetPlace());
auto runner_scatter = NpuOpRunner(
"TensorScatterUpdate", {zeroslike_xout, *index, *dout}, {*dx}, {});
runner_scatter.Run(stream);
}
};
} // namespace operators
} // namespace paddle
namespace ops = paddle::operators;
REGISTER_OP_NPU_KERNEL(
gather, ops::GatherOpNPUKernel<paddle::platform::NPUDeviceContext, float>,
ops::GatherOpNPUKernel<paddle::platform::NPUDeviceContext, double>,
ops::GatherOpNPUKernel<paddle::platform::NPUDeviceContext,
paddle::platform::float16>);
REGISTER_OP_NPU_KERNEL(
gather_grad,
ops::GatherGradOpNPUKernel<paddle::platform::NPUDeviceContext, float>,
ops::GatherGradOpNPUKernel<paddle::platform::NPUDeviceContext, double>,
ops::GatherGradOpNPUKernel<paddle::platform::NPUDeviceContext,
paddle::platform::float16>);
/* Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License. */
#ifndef _WIN32
#include <unistd.h>
#endif
#include <string>
#include <thread> // NOLINT
#include <vector>
#include "gtest/gtest.h"
#include "paddle/fluid/framework/op_registry.h"
#include "paddle/fluid/framework/operator.h"
#include "paddle/fluid/framework/program_desc.h"
#include "paddle/fluid/operators/gather_op.h"
#include "paddle/fluid/operators/math/math_function.h"
#include "paddle/fluid/string/printf.h"
namespace f = paddle::framework;
namespace p = paddle::platform;
namespace m = paddle::operators::math;
USE_OP(gather);
USE_OP_DEVICE_KERNEL(gather, NPU);
USE_OP(gather_grad);
USE_OP_DEVICE_KERNEL(gather_grad, NPU);
template <typename T>
void Compare(f::Scope* scope, const p::DeviceContext& ctx,
std::string op_type) {
// init
auto x = scope->Var("X");
auto tensor_x = x->GetMutable<f::LoDTensor>();
auto index = scope->Var("Index");
auto tensor_index = index->GetMutable<f::LoDTensor>();
std::vector<T> init_x;
for (int64_t i = 1; i < 7; ++i) {
// 1,2,3,4,5,6
init_x.push_back(static_cast<T>(i));
}
// [[1, 2],[3, 4],[5, 6]]
TensorFromVector(init_x, ctx, tensor_x);
tensor_x->Resize(paddle::framework::make_ddim({3, 2}));
std::vector<int> init_index = {1, 2};
paddle::framework::TensorFromVector<int>(init_index, ctx, tensor_index);
tensor_index->Resize(paddle::framework::make_ddim({2}));
ctx.Wait();
auto out = scope->Var("Out");
auto tensor_out = out->GetMutable<f::LoDTensor>();
// run
f::AttributeMap attrs = {{"validate_indices", true}};
auto op = f::OpRegistry::CreateOp(
op_type, {{"X", {"X"}}, {"Index", {"Index"}}}, {{"Out", {"Out"}}}, attrs);
auto place = ctx.GetPlace();
op->Run(*scope, place);
std::vector<T> out_vec;
TensorToVector(*tensor_out, ctx, &out_vec);
ctx.Wait();
// ref:https://www.paddlepaddle.org.cn/documentation/docs/zh/develop/api/paddle/tensor/manipulation/gather_cn.html#gather
for (int i = 0; i < static_cast<int>(out_vec.size()); ++i) {
VLOG(3) << "out_vec[" << i << "] : " << out_vec[i];
}
uint32_t expected_size = 4;
EXPECT_EQ((uint32_t)out_vec.size(), expected_size);
// {3, 4, 5, 6}
std::vector<T> expected_out_vec;
for (int64_t i = 3; i < 7; ++i) {
expected_out_vec.push_back(static_cast<T>(i));
}
for (uint32_t i = 0; i < out_vec.size(); i++) {
EXPECT_EQ(out_vec[i], expected_out_vec[i]);
}
}
template <typename T>
void CompareGrad(f::Scope* scope, const p::DeviceContext& ctx,
std::string op_type) {
// init
auto index = scope->Var("Index");
auto tensor_index = index->GetMutable<f::LoDTensor>();
auto x = scope->Var("X");
auto tensor_x = x->GetMutable<f::LoDTensor>();
auto dout = scope->Var("DOut");
auto tensor_dout = dout->GetMutable<f::LoDTensor>();
std::vector<int> init_index = {0, 1};
paddle::framework::TensorFromVector<int>(init_index, ctx, tensor_index);
tensor_index->Resize(paddle::framework::make_ddim({2}));
std::vector<T> init_x = {1.0, 1.0, 1.0, 1.0, 1.0, 1.0};
TensorFromVector(init_x, ctx, tensor_x);
tensor_x->Resize(paddle::framework::make_ddim({3, 2}));
std::vector<T> init_dout = {5.0, 10.0, 2.0, 3.0};
TensorFromVector(init_dout, ctx, tensor_dout);
tensor_dout->Resize(paddle::framework::make_ddim({2, 2}));
ctx.Wait();
auto dx = scope->Var("DX");
auto tensor_dx = dx->GetMutable<f::LoDTensor>();
// run
f::AttributeMap attrs;
auto op = f::OpRegistry::CreateOp(
op_type, {{"X", {"X"}}, {"Index", {"Index"}}, {"Out@GRAD", {"DOut"}}},
{{"X@GRAD", {"DX"}}}, attrs);
auto place = ctx.GetPlace();
op->Run(*scope, place);
std::vector<T> dx_vec;
TensorToVector(*tensor_dx, ctx, &dx_vec);
ctx.Wait();
uint32_t expected_size = 3 * 2;
EXPECT_EQ((uint32_t)dx_vec.size(), expected_size);
std::vector<T> expected_dx_vec = {5.0, 10.0, 2.0, 3.0, 0.0, 0.0};
for (uint32_t i = 0; i < dx_vec.size(); i++) {
VLOG(3) << "dx_vec[i]=" << dx_vec[i];
EXPECT_EQ(dx_vec[i], expected_dx_vec[i]);
}
}
TEST(gather, NPU_fp32) {
f::Scope scope;
p::NPUDeviceContext ctx(p::NPUPlace(0));
Compare<float>(&scope, ctx, "gather");
}
TEST(gather, NPU_fp16) {
f::Scope scope;
p::NPUDeviceContext ctx(p::NPUPlace(0));
Compare<p::float16>(&scope, ctx, "gather");
}
TEST(gather_grad, NPU_fp32) {
f::Scope scope;
p::NPUDeviceContext ctx(p::NPUPlace(0));
CompareGrad<float>(&scope, ctx, "gather_grad");
}
/* Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License. */
#include <memory>
#include <string>
#include "paddle/fluid/operators/gelu_op.h"
#include "paddle/fluid/operators/npu_op_runner.h"
namespace paddle {
namespace operators {
using Tensor = framework::Tensor;
template <typename DeviceContext, typename T>
class GeluNPUKernel : public framework::OpKernel<T> {
public:
void Compute(const framework::ExecutionContext& ctx) const override {
auto* x = ctx.Input<Tensor>("X");
auto* out = ctx.Output<Tensor>("Out");
auto place = ctx.GetPlace();
out->mutable_data<T>(place);
auto stream =
ctx.template device_context<paddle::platform::NPUDeviceContext>()
.stream();
auto runner = NpuOpRunner("Gelu", {*x}, {*out}, {});
runner.Run(stream);
}
};
template <typename DeviceContext, typename T>
class GeluGradNPUKernel : public framework::OpKernel<T> {
public:
void Compute(const framework::ExecutionContext& ctx) const override {
auto* x = ctx.Input<Tensor>("X");
auto* dout = ctx.Input<Tensor>(framework::GradVarName("Out"));
auto* dx = ctx.Output<Tensor>(framework::GradVarName("X"));
auto place = ctx.GetPlace();
dx->mutable_data<T>(place);
auto stream =
ctx.template device_context<paddle::platform::NPUDeviceContext>()
.stream();
Tensor out(x->type());
out.mutable_data<T>(x->dims(), place);
auto out_runner = NpuOpRunner("Gelu", {*x}, {out}, {});
out_runner.Run(stream);
auto dx_runner = NpuOpRunner("GeluGrad", {*dout, *x, out}, {*dx}, {});
dx_runner.Run(stream);
}
};
} // namespace operators
} // namespace paddle
namespace ops = paddle::operators;
REGISTER_OP_NPU_KERNEL(
gelu, ops::GeluNPUKernel<paddle::platform::NPUDeviceContext, float>,
ops::GeluNPUKernel<paddle::platform::NPUDeviceContext,
paddle::platform::float16>);
REGISTER_OP_NPU_KERNEL(
gelu_grad,
ops::GeluGradNPUKernel<paddle::platform::NPUDeviceContext, float>,
ops::GeluGradNPUKernel<paddle::platform::NPUDeviceContext,
paddle::platform::float16>);
/* Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License. */
#ifndef _WIN32
#include <unistd.h>
#endif
#include <string>
#include <thread> // NOLINT
#include <vector>
#include "gtest/gtest.h"
#include "paddle/fluid/framework/op_registry.h"
#include "paddle/fluid/framework/operator.h"
#include "paddle/fluid/framework/program_desc.h"
#include "paddle/fluid/operators/dropout_op.h"
#include "paddle/fluid/operators/math/math_function.h"
#include "paddle/fluid/string/printf.h"
namespace f = paddle::framework;
namespace p = paddle::platform;
namespace m = paddle::operators::math;
USE_OP(gelu);
USE_OP_DEVICE_KERNEL(gelu, NPU);
template <typename T>
void Compare(f::Scope* scope, const p::DeviceContext& ctx) {
// init
auto x = scope->Var("X");
auto tensor_x = x->GetMutable<f::LoDTensor>();
std::vector<T> init_x;
for (int64_t i = 0; i < 10 * 10; ++i) {
init_x.push_back(static_cast<T>(1.0));
}
TensorFromVector(init_x, ctx, tensor_x);
tensor_x->Resize({10, 10});
auto out = scope->Var("Out");
auto tensor_out = out->GetMutable<f::LoDTensor>();
f::AttributeMap attrs;
ctx.Wait();
// run
auto place = ctx.GetPlace();
auto op = f::OpRegistry::CreateOp("gelu", {{"X", {"X"}}}, {{"Out", {"Out"}}},
attrs);
op->Run(*scope, place);
ctx.Wait();
// eval time
struct timeval start, end;
gettimeofday(&start, NULL);
for (int i = 0; i < 100; i++) {
op->Run(*scope, place);
}
ctx.Wait();
gettimeofday(&end, NULL);
int micros =
(((end.tv_sec - start.tv_sec) * 1000000) + end.tv_usec) - (start.tv_usec);
printf("used time: %d\n", micros / 100);
// eval value
std::vector<T> out_vec;
TensorToVector(*tensor_out, ctx, &out_vec);
float expected = 0.841192;
for (uint32_t i = 0; i < out_vec.size(); i++) {
EXPECT_FLOAT_EQ(out_vec[i], static_cast<T>(expected));
}
}
template <typename T>
void CompareGrad(f::Scope* scope, const p::DeviceContext& ctx) {
auto dout = scope->Var("DOut");
auto tensor_dout = dout->GetMutable<f::LoDTensor>();
auto x = scope->Var("X");
auto tensor_x = x->GetMutable<f::LoDTensor>();
std::vector<T> init_dout;
for (int64_t i = 0; i < 10 * 10; ++i) {
init_dout.push_back(static_cast<T>(1.0));
}
std::vector<T> init_x;
for (int64_t i = 0; i < 10 * 10; ++i) {
init_x.push_back(static_cast<T>(1.0));
}
TensorFromVector(init_dout, ctx, tensor_dout);
tensor_dout->Resize({10, 10});
TensorFromVector(init_x, ctx, tensor_x);
tensor_x->Resize({10, 10});
auto dx = scope->Var("DX");
auto tensor_dx = dx->GetMutable<f::LoDTensor>();
f::AttributeMap attrs;
ctx.Wait();
// run
auto place = ctx.GetPlace();
auto op = f::OpRegistry::CreateOp("gelu_grad",
{{"Out@GRAD", {"DOut"}}, {"X", {"X"}}},
{{"X@GRAD", {"DX"}}}, attrs);
op->Run(*scope, place);
ctx.Wait();
// eval time
struct timeval start, end;
gettimeofday(&start, NULL);
for (int i = 0; i < 100; i++) {
op->Run(*scope, place);
}
ctx.Wait();
gettimeofday(&end, NULL);
int micros =
(((end.tv_sec - start.tv_sec) * 1000000) + end.tv_usec) - (start.tv_usec);
printf("used time: %d\n", micros / 100);
// eval value
std::vector<T> dx_vec;
TensorToVector(*tensor_dx, ctx, &dx_vec);
float expected = 1.082964;
for (uint32_t i = 0; i < dx_vec.size(); i++) {
EXPECT_FLOAT_EQ(dx_vec[i], static_cast<T>(expected));
}
}
TEST(gelu, NPU_fp32) {
f::Scope scope;
p::NPUDeviceContext ctx(p::NPUPlace(0));
Compare<float>(&scope, ctx);
}
TEST(gelu_grad, NPU) {
f::Scope scope;
p::NPUDeviceContext ctx(p::NPUPlace(0));
CompareGrad<float>(&scope, ctx);
}
// Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
#include "paddle/fluid/operators/increment_op.h"
#include "paddle/fluid/operators/npu_op_runner.h"
#include "paddle/fluid/platform/float16.h"
namespace paddle {
namespace framework {
class OpDesc;
class Variable;
} // namespace framework
namespace imperative {
class OpBase;
} // namespace imperative
} // namespace paddle
namespace paddle {
namespace operators {
template <typename DeviceContext, typename T>
class IncrementalNPUKernel : public framework::OpKernel<T> {
public:
void Compute(const framework::ExecutionContext& context) const override {
auto* x_tensor = context.Input<framework::Tensor>("X");
auto* out_tensor = context.Output<framework::Tensor>("Out");
float step = context.Attr<float>("step");
out_tensor->mutable_data<T>(context.GetPlace());
Tensor step_tensor(x_tensor->type());
std::vector<T> step_vec;
step_vec.push_back(static_cast<T>(step));
framework::TensorFromVector(step_vec, context.device_context(),
&step_tensor);
auto runner =
NpuOpRunner("Add", {*x_tensor, step_tensor}, {*out_tensor}, {});
auto stream =
context.template device_context<paddle::platform::NPUDeviceContext>()
.stream();
runner.Run(stream);
}
};
} // namespace operators
} // namespace paddle
namespace plat = paddle::platform;
namespace ops = paddle::operators;
REGISTER_OP_NPU_KERNEL(
increment,
ops::IncrementalNPUKernel<paddle::platform::NPUDeviceContext, float>,
ops::IncrementalNPUKernel<paddle::platform::NPUDeviceContext, double>,
ops::IncrementalNPUKernel<paddle::platform::NPUDeviceContext, int>,
ops::IncrementalNPUKernel<paddle::platform::NPUDeviceContext, int64_t>,
ops::IncrementalNPUKernel<paddle::platform::NPUDeviceContext,
plat::float16>)
/* Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License. */
#ifndef _WIN32
#include <unistd.h>
#endif
#include <string>
#include <thread> // NOLINT
#include <vector>
#include "gtest/gtest.h"
#include "paddle/fluid/framework/op_registry.h"
#include "paddle/fluid/framework/operator.h"
#include "paddle/fluid/framework/program_desc.h"
#include "paddle/fluid/operators/dropout_op.h"
#include "paddle/fluid/operators/math/math_function.h"
#include "paddle/fluid/string/printf.h"
namespace f = paddle::framework;
namespace p = paddle::platform;
namespace m = paddle::operators::math;
USE_OP(increment);
USE_OP_DEVICE_KERNEL(increment, NPU);
template <typename T>
void Compare(f::Scope* scope, const p::DeviceContext& ctx,
std::string op_type) {
// init
auto x = scope->Var("X");
auto tensor_x = x->GetMutable<f::LoDTensor>();
std::vector<T> init;
init.push_back(static_cast<T>(1.0));
TensorFromVector(init, ctx, tensor_x);
tensor_x->Resize({1});
ctx.Wait();
auto place = ctx.GetPlace();
auto out = scope->Var("Out");
auto tensor_out = out->GetMutable<f::LoDTensor>();
f::AttributeMap attr_input = {{"step", static_cast<float>(2.0)}};
auto op = f::OpRegistry::CreateOp("increment", {{"X", {"X"}}},
{{"Out", {"Out"}}}, attr_input);
op->Run(*scope, place);
std::vector<T> out_vec;
TensorToVector(*tensor_out, ctx, &out_vec);
ctx.Wait();
EXPECT_EQ((uint32_t)out_vec.size(), (uint32_t)1);
EXPECT_EQ(out_vec[0], static_cast<T>(3.0));
}
TEST(increment, NPU_fp32) {
f::Scope scope;
p::NPUDeviceContext ctx(p::NPUPlace(0));
Compare<float>(&scope, ctx, "increment");
}
TEST(increment, NPU_fp64) {
f::Scope scope;
p::NPUDeviceContext ctx(p::NPUPlace(0));
Compare<float>(&scope, ctx, "increment");
}
/* Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License. */
#include "paddle/fluid/operators/layer_norm_op.h"
#include "paddle/fluid/operators/npu_op_runner.h"
namespace paddle {
namespace operators {
using Tensor = framework::Tensor;
using DDim = framework::DDim;
using DataLayout = framework::DataLayout;
template <typename T>
class NormDataType;
template <>
class NormDataType<platform::float16> {
public:
// The scaling param type is float for HALF and FLOAT tensors
using ScalingParamType = const float;
using BatchNormParamType = float;
};
template <>
class NormDataType<float> {
public:
using ScalingParamType = const float;
using BatchNormParamType = float;
};
template <typename T>
using NormDataType = NormDataType<T>;
template <typename T>
using LayerNormParamType = typename NormDataType<T>::BatchNormParamType;
template <typename T>
class LayerNormNPUKernel : public framework::OpKernel<T> {
public:
void Compute(const framework::ExecutionContext& ctx) const override {
using U = LayerNormParamType<T>;
const auto begin_norm_axis = ctx.Attr<int>("begin_norm_axis");
const auto epsilon = ctx.Attr<float>("epsilon");
const auto* x = ctx.Input<Tensor>("X");
const auto* scale = ctx.Input<Tensor>("Scale");
const auto* bias = ctx.Input<Tensor>("Bias");
auto* y = ctx.Output<Tensor>("Y");
auto* mean = ctx.Output<Tensor>("Mean");
auto* variance = ctx.Output<Tensor>("Variance");
const auto& x_dims = x->dims();
std::vector<int> axes;
auto matrix_dim = framework::flatten_to_2d(x_dims, begin_norm_axis);
int right = static_cast<int>(matrix_dim[1]);
// The shape of scale and bias should be equal to x.shape[begin_norm_axis:],
// required by Ascend.
for (auto i = begin_norm_axis; i < x_dims.size(); ++i) {
axes.push_back(x_dims[i]);
}
auto place = ctx.GetPlace();
auto stream =
ctx.template device_context<paddle::platform::NPUDeviceContext>()
.stream();
Tensor default_scale(x->type());
if (!scale) {
default_scale.mutable_data<T>(framework::make_ddim(axes), place);
Tensor value(x->type());
value.mutable_data<T>({1}, place);
TensorFromVector(std::vector<T>{static_cast<T>(1.0)},
ctx.device_context(), &value);
auto runner =
NpuOpRunner("FillD", {value}, {default_scale}, {{"dims", axes}});
runner.Run(stream);
scale = &default_scale;
} else {
const_cast<Tensor*>(scale)->Resize(framework::make_ddim(axes));
}
Tensor default_bias(x->type());
if (!bias) {
default_bias.mutable_data<T>(framework::make_ddim(axes), place);
Tensor value(x->type());
value.mutable_data<T>({1}, place);
TensorFromVector(std::vector<T>{static_cast<T>(0)}, ctx.device_context(),
&value);
auto runner =
NpuOpRunner("FillD", {value}, {default_bias}, {{"dims", axes}});
runner.Run(stream);
bias = &default_bias;
} else {
const_cast<Tensor*>(bias)->Resize(framework::make_ddim(axes));
}
// cast scale from LayerNormParamType to T if needed
Tensor cast_scale(x->type());
if (x->type() == framework::proto::VarType::FP16 &&
scale->type() == framework::proto::VarType::FP32) {
cast_scale.Resize(scale->dims());
cast_scale.mutable_data<T>(ctx.GetPlace());
auto dst_dtype = ConvertToNpuDtype(x->type());
auto runner_cast_scale =
NpuOpRunner("Cast", {*scale}, {cast_scale},
{{"dst_type", static_cast<int>(dst_dtype)}});
runner_cast_scale.Run(stream);
} else {
cast_scale.ShareDataWith(*scale);
}
// cast bias from LayerNormParamType to T if needed
Tensor cast_bias(x->type());
if (x->type() == framework::proto::VarType::FP16 &&
bias->type() == framework::proto::VarType::FP32) {
cast_bias.Resize(bias->dims());
cast_bias.mutable_data<T>(ctx.GetPlace());
auto dst_dtype = ConvertToNpuDtype(x->type());
auto runner_cast_bias =
NpuOpRunner("Cast", {*bias}, {cast_bias},
{{"dst_type", static_cast<int>(dst_dtype)}});
runner_cast_bias.Run(stream);
} else {
cast_bias.ShareDataWith(*bias);
}
y->mutable_data<T>(ctx.GetPlace());
// mean should be of U type
Tensor* tmp_mean = mean;
Tensor cast_mean(x->type());
if (x->type() == framework::proto::VarType::FP16 &&
(scale->type() == framework::proto::VarType::FP32 ||
bias->type() == framework::proto::VarType::FP32)) {
cast_mean.Resize(mean->dims());
cast_mean.mutable_data<T>(ctx.GetPlace());
tmp_mean = &cast_mean;
mean->mutable_data<U>(ctx.GetPlace());
} else {
mean->mutable_data<T>(ctx.GetPlace());
}
// same for variance
Tensor* tmp_variance = variance;
Tensor cast_variance(x->type());
if (x->type() == framework::proto::VarType::FP16 &&
(scale->type() == framework::proto::VarType::FP32 ||
bias->type() == framework::proto::VarType::FP32)) {
cast_variance.Resize(variance->dims());
cast_variance.mutable_data<T>(ctx.GetPlace());
tmp_variance = &cast_variance;
variance->mutable_data<U>(ctx.GetPlace());
} else {
variance->mutable_data<T>(ctx.GetPlace());
}
auto runner = NpuOpRunner("LayerNorm", {*x, cast_scale, cast_bias},
{*y, *tmp_mean, *tmp_variance},
{{"begin_norm_axis", begin_norm_axis},
{"begin_params_axis", begin_norm_axis},
{"epsilon", epsilon}});
runner.Run(stream);
// cast back from FP16 to FP32
if (x->type() == framework::proto::VarType::FP16 &&
mean->type() == framework::proto::VarType::FP32) {
auto dst_dtype = ConvertToNpuDtype(mean->type());
auto runner_cast_mean =
NpuOpRunner("Cast", {*tmp_mean}, {*mean},
{{"dst_type", static_cast<int>(dst_dtype)}});
runner_cast_mean.Run(stream);
}
// same for variance
if (x->type() == framework::proto::VarType::FP16 &&
variance->type() == framework::proto::VarType::FP32) {
auto dst_dtype = ConvertToNpuDtype(variance->type());
auto runner_cast_variance =
NpuOpRunner("Cast", {*tmp_variance}, {*variance},
{{"dst_type", static_cast<int>(dst_dtype)}});
runner_cast_variance.Run(stream);
}
// revert shape of scale and bias
// TODO(zhiqiu): better implementation, use tmp tensor to avoid write input
// tensor.
const_cast<Tensor*>(scale)->Resize(framework::make_ddim({right}));
const_cast<Tensor*>(bias)->Resize(framework::make_ddim({right}));
}
};
template <typename T>
class LayerNormGradNPUKernel : public framework::OpKernel<T> {
public:
void Compute(const framework::ExecutionContext& ctx) const override {
using U = LayerNormParamType<T>;
const auto begin_norm_axis = ctx.Attr<int>("begin_norm_axis");
const auto* x = ctx.Input<Tensor>("X");
const auto& x_dims = x->dims();
const auto* mean = ctx.Input<Tensor>("Mean");
const auto* variance = ctx.Input<Tensor>("Variance");
const auto* scale = ctx.Input<Tensor>("Scale");
const auto* dy = ctx.Input<Tensor>(framework::GradVarName("Y"));
auto* dx = ctx.Output<Tensor>(framework::GradVarName("X"));
auto* dscale = ctx.Output<Tensor>(framework::GradVarName("Scale"));
auto* dbias = ctx.Output<Tensor>(framework::GradVarName("Bias"));
auto matrix_dim = framework::flatten_to_2d(x_dims, begin_norm_axis);
int right = static_cast<int>(matrix_dim[1]);
std::vector<int> axes;
for (auto i = begin_norm_axis; i < x_dims.size(); ++i) {
axes.push_back(x_dims[i]);
}
auto place = ctx.GetPlace();
auto stream =
ctx.template device_context<paddle::platform::NPUDeviceContext>()
.stream();
// No need to compute any gradient, jusr return
if (!dx && !dscale && !dbias) {
return;
}
// The rank of mean should be equal to x, required by Ascend.
std::vector<int> new_shape;
for (auto i = 0; i < begin_norm_axis; ++i) {
new_shape.push_back(x_dims[i]);
}
for (auto i = begin_norm_axis; i < x_dims.size(); ++i) {
new_shape.push_back(1);
}
auto mean_dims = mean->dims();
const_cast<Tensor*>(mean)->Resize(framework::make_ddim({new_shape}));
const_cast<Tensor*>(variance)->Resize(framework::make_ddim({new_shape}));
Tensor default_scale(x->type());
if (!scale) {
default_scale.mutable_data<T>(framework::make_ddim(axes), place);
Tensor value(x->type());
value.mutable_data<T>({1}, place);
TensorFromVector(std::vector<T>{static_cast<T>(1.0)},
ctx.device_context(), &value);
auto runner =
NpuOpRunner("FillD", {value}, {default_scale}, {{"dims", axes}});
runner.Run(stream);
scale = &default_scale;
} else {
const_cast<Tensor*>(scale)->Resize(framework::make_ddim(axes));
}
// cast scale from LayerNormParamType to T if needed
Tensor cast_scale(x->type());
if (x->type() == framework::proto::VarType::FP16 &&
scale->type() == framework::proto::VarType::FP32) {
cast_scale.Resize(scale->dims());
cast_scale.mutable_data<T>(ctx.GetPlace());
auto dst_dtype = ConvertToNpuDtype(x->type());
auto runner_cast_scale =
NpuOpRunner("Cast", {*scale}, {cast_scale},
{{"dst_type", static_cast<int>(dst_dtype)}});
runner_cast_scale.Run(stream);
} else {
cast_scale.ShareDataWith(*scale);
}
// cast mean from LayerNormParamType to T if needed
Tensor cast_mean(x->type());
if (x->type() == framework::proto::VarType::FP16 &&
mean->type() == framework::proto::VarType::FP32) {
cast_mean.Resize(mean->dims());
cast_mean.mutable_data<T>(ctx.GetPlace());
auto dst_dtype = ConvertToNpuDtype(x->type());
auto runner_cast_mean =
NpuOpRunner("Cast", {*mean}, {cast_mean},
{{"dst_type", static_cast<int>(dst_dtype)}});
runner_cast_mean.Run(stream);
} else {
cast_mean.ShareDataWith(*mean);
}
// cast variance from LayerNormParamType to T if needed
Tensor cast_variance(x->type());
if (x->type() == framework::proto::VarType::FP16 &&
variance->type() == framework::proto::VarType::FP32) {
cast_variance.Resize(variance->dims());
cast_variance.mutable_data<T>(ctx.GetPlace());
auto dst_dtype = ConvertToNpuDtype(x->type());
auto runner_cast_variance =
NpuOpRunner("Cast", {*variance}, {cast_variance},
{{"dst_type", static_cast<int>(dst_dtype)}});
runner_cast_variance.Run(stream);
} else {
cast_variance.ShareDataWith(*variance);
}
Tensor dx_(dy->type()), dscale_(dy->type()), dbias_(dy->type());
dx = (dx == nullptr) ? &dx_ : dx;
dscale = (dscale == nullptr) ? &dscale_ : dscale;
dbias = (dbias == nullptr) ? &dbias_ : dbias;
dx->Resize(x->dims());
dx->mutable_data<T>(ctx.GetPlace());
dscale->Resize(framework::make_ddim(axes));
dbias->Resize(framework::make_ddim(axes));
// dscale should be of U type
Tensor* tmp_dscale = dscale;
Tensor cast_dscale(x->type());
if (x->type() == framework::proto::VarType::FP16 &&
(mean->type() == framework::proto::VarType::FP32 ||
variance->type() == framework::proto::VarType::FP32)) {
cast_dscale.Resize(dscale->dims());
cast_dscale.mutable_data<T>(ctx.GetPlace());
tmp_dscale = &cast_dscale;
dscale->mutable_data<U>(ctx.GetPlace());
} else {
dscale->mutable_data<T>(ctx.GetPlace());
}
// same for dbias
Tensor* tmp_dbias = dbias;
Tensor cast_dbias(x->type());
if (x->type() == framework::proto::VarType::FP16 &&
(mean->type() == framework::proto::VarType::FP32 ||
variance->type() == framework::proto::VarType::FP32)) {
cast_dbias.Resize(dbias->dims());
cast_dbias.mutable_data<T>(ctx.GetPlace());
tmp_dbias = &cast_dbias;
dbias->mutable_data<U>(ctx.GetPlace());
} else {
dbias->mutable_data<T>(ctx.GetPlace());
}
auto runner = NpuOpRunner("LayerNormGrad",
{*dy, *x, cast_variance, cast_mean, cast_scale},
{*dx, *tmp_dscale, *tmp_dbias}, {});
runner.Run(stream);
// cast back from FP16 to FP32
if (x->type() == framework::proto::VarType::FP16 &&
dscale->type() == framework::proto::VarType::FP32) {
auto dst_dtype = ConvertToNpuDtype(dscale->type());
auto runner_cast_dscale =
NpuOpRunner("Cast", {*tmp_dscale}, {*dscale},
{{"dst_type", static_cast<int>(dst_dtype)}});
runner_cast_dscale.Run(stream);
}
// same for dbias
if (x->type() == framework::proto::VarType::FP16 &&
dbias->type() == framework::proto::VarType::FP32) {
auto dst_dtype = ConvertToNpuDtype(dbias->type());
auto runner_cast_dbias =
NpuOpRunner("Cast", {*tmp_dbias}, {*dbias},
{{"dst_type", static_cast<int>(dst_dtype)}});
runner_cast_dbias.Run(stream);
}
const_cast<Tensor*>(mean)->Resize(mean_dims);
const_cast<Tensor*>(variance)->Resize(mean_dims);
const_cast<Tensor*>(scale)->Resize(framework::make_ddim({right}));
dscale->Resize(framework::make_ddim({right}));
dbias->Resize(framework::make_ddim({right}));
}
};
} // namespace operators
} // namespace paddle
namespace ops = paddle::operators;
namespace plat = paddle::platform;
REGISTER_OP_NPU_KERNEL(layer_norm, ops::LayerNormNPUKernel<float>,
ops::LayerNormNPUKernel<plat::float16>);
REGISTER_OP_NPU_KERNEL(layer_norm_grad, ops::LayerNormGradNPUKernel<float>,
ops::LayerNormGradNPUKernel<plat::float16>);
/* Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License. */
#include <iostream>
#include <memory>
#include <string>
#include "paddle/fluid/framework/op_registry.h"
#include "paddle/fluid/framework/tensor_util.h"
#include "paddle/fluid/operators/npu_op_runner.h"
namespace paddle {
namespace operators {
template <typename DeviceContext, typename T>
class LookupTableV2NPUKernel : public framework::OpKernel<T> {
public:
void Compute(const framework::ExecutionContext &ctx) const override {
auto *ids_t = ctx.Input<framework::LoDTensor>("Ids"); // int tensor
auto *output_t = ctx.Output<framework::LoDTensor>("Out"); // float tensor
auto *table_t = ctx.Input<framework::LoDTensor>("W");
auto *table_var = ctx.InputVar("W");
PADDLE_ENFORCE_EQ(
table_var->IsType<framework::LoDTensor>(), true,
platform::errors::InvalidArgument("npu only accept LoDTensor"));
output_t->mutable_data<T>(ctx.GetPlace());
framework::NPUAttributeMap attr_input = {{"validate_indices", false}};
auto runner =
NpuOpRunner("Gather", {*table_t, *ids_t}, {*output_t}, attr_input);
auto stream =
ctx.template device_context<paddle::platform::NPUDeviceContext>()
.stream();
runner.Run(stream);
}
};
template <typename T>
class LookupTableV2GradNPUKernel : public framework::OpKernel<T> {
public:
void Compute(const framework::ExecutionContext &ctx) const override {
auto *ids_t = ctx.Input<framework::LoDTensor>("Ids");
auto *output_grad_t =
ctx.Input<framework::LoDTensor>(framework::GradVarName("Out"));
auto *table_grad_t =
ctx.Output<framework::LoDTensor>(framework::GradVarName("W"));
table_grad_t->mutable_data<T>(ctx.GetPlace());
auto stream =
ctx.template device_context<paddle::platform::NPUDeviceContext>()
.stream();
// step2: ZerosLike x in device
Tensor zeroslike_w(table_grad_t->type());
zeroslike_w.Resize(table_grad_t->dims());
auto p = zeroslike_w.mutable_data<T>(ctx.GetPlace());
platform::NPUMemsetAsync(static_cast<void *>(p), 0,
zeroslike_w.numel() * sizeof(T), stream);
table_grad_t->mutable_data<T>(ctx.GetPlace());
auto runner_scatter =
NpuOpRunner("ScatterAdd", {zeroslike_w, *ids_t, *output_grad_t},
{*table_grad_t}, {});
runner_scatter.Run(stream);
}
};
} // namespace operators
} // namespace paddle
namespace ops = paddle::operators;
REGISTER_OP_NPU_KERNEL(
lookup_table_v2,
ops::LookupTableV2NPUKernel<paddle::platform::NPUDeviceContext, float>,
ops::LookupTableV2NPUKernel<paddle::platform::NPUDeviceContext,
paddle::platform::float16>);
REGISTER_OP_NPU_KERNEL(
lookup_table_v2_grad, ops::LookupTableV2GradNPUKernel<float>,
ops::LookupTableV2GradNPUKernel<paddle::platform::float16>);
/* Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License. */
#ifndef _WIN32
#include <unistd.h>
#endif
#include <cmath>
#include <iostream>
#include <numeric>
#include <string>
#include <thread> // NOLINT
#include <vector>
#include "gtest/gtest.h"
#include "paddle/fluid/framework/op_registry.h"
#include "paddle/fluid/framework/operator.h"
#include "paddle/fluid/framework/program_desc.h"
#include "paddle/fluid/operators/dropout_op.h"
#include "paddle/fluid/operators/math/math_function.h"
#include "paddle/fluid/string/printf.h"
namespace f = paddle::framework;
namespace p = paddle::platform;
namespace m = paddle::operators::math;
USE_OP(lookup_table_v2);
USE_OP_DEVICE_KERNEL(lookup_table_v2, NPU);
template <typename T>
void Compare(f::Scope* scope, const p::DeviceContext& ctx) {
// init
auto ids = scope->Var("Ids");
auto out = scope->Var("Out");
auto w = scope->Var("W");
auto ids_t = ids->GetMutable<f::LoDTensor>();
auto out_t = out->GetMutable<f::LoDTensor>();
auto w_t = w->GetMutable<f::LoDTensor>();
int bsz = 10;
int dim = 32;
int seqlen = 8;
int vocab_size = 100;
TensorFromVector(std::vector<int64_t>(bsz * seqlen, 3), ctx, ids_t);
std::vector<T> val(vocab_size * dim, 10.);
TensorFromVector(val, ctx, w_t);
ids_t->Resize({bsz, seqlen});
w_t->Resize({vocab_size, dim});
out_t->Resize({bsz, seqlen, dim});
ctx.Wait();
auto place = ctx.GetPlace();
out_t->mutable_data<T>(place);
f::AttributeMap attrs = {{}};
auto op = f::OpRegistry::CreateOp("lookup_table_v2",
{{"W", {"W"}}, {"Ids", {"Ids"}}},
{{"Out", {"Out"}}}, attrs);
op->Run(*scope, place);
std::vector<T> out_v;
TensorToVector(*out_t, ctx, &out_v);
ctx.Wait();
EXPECT_EQ(out_t->numel(), bsz * seqlen * dim);
T res = std::accumulate(out_v.begin(), out_v.end(), 0.);
float eps = 1.e-6;
EXPECT_LT(fabs(res - bsz * seqlen * dim * 10.), eps);
}
template <typename T>
void CompareGrad(f::Scope* scope, const p::DeviceContext& ctx) {
// init
auto w = scope->Var("W");
auto ids = scope->Var("Ids");
auto out = scope->Var("DOut");
auto dw = scope->Var("DW");
auto w_t = w->GetMutable<f::LoDTensor>();
auto ids_t = ids->GetMutable<f::LoDTensor>();
auto out_t = out->GetMutable<f::LoDTensor>();
auto dw_t = dw->GetMutable<f::LoDTensor>();
int bsz = 2;
int dim = 2;
int seqlen = 2;
int vocab_size = 4;
std::vector<int64_t> val_int(bsz * seqlen, 3);
std::vector<T> val(vocab_size * dim, 0.);
std::vector<T> val_out(bsz * seqlen * dim, 1.);
TensorFromVector(val_int, ctx, ids_t);
TensorFromVector(val, ctx, w_t);
TensorFromVector(val, ctx, dw_t);
TensorFromVector(val_out, ctx, out_t);
w_t->Resize({vocab_size, dim});
ids_t->Resize({bsz, seqlen});
out_t->Resize({bsz, seqlen, dim});
dw_t->Resize({vocab_size, dim});
ctx.Wait();
auto place = ctx.GetPlace();
out_t->mutable_data<T>(place);
w_t->mutable_data<T>(place);
dw_t->mutable_data<T>(place);
f::AttributeMap attrs = {{}};
auto op = f::OpRegistry::CreateOp(
"lookup_table_v2_grad",
{{"Ids", {"Ids"}}, {"W", {"W"}}, {"Out@GRAD", {"DOut"}}},
{{"W@GRAD", {"DW"}}}, attrs);
op->Run(*scope, place);
ctx.Wait();
std::vector<T> w_v;
TensorToVector(*dw_t, ctx, &w_v);
ctx.Wait();
EXPECT_EQ(dw_t->numel(), vocab_size * dim);
T res = std::accumulate(w_v.begin(), w_v.end(), 0.);
float eps = 1.e-6;
EXPECT_LT(fabs(res - bsz * seqlen * dim), eps);
}
TEST(lookup_table_v2, NPU_fp32) {
f::Scope scope;
p::NPUDeviceContext ctx(p::NPUPlace(0));
Compare<float>(&scope, ctx);
}
TEST(lookup_table_v2_grad, NPU_fp32) {
f::Scope scope;
p::NPUDeviceContext ctx(p::NPUPlace(0));
CompareGrad<float>(&scope, ctx);
}
/* Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License. */
#include <memory>
#include <string>
#include "paddle/fluid/operators/matmul_v2_op.h"
#include "paddle/fluid/operators/npu_op_runner.h"
namespace paddle {
namespace operators {
template <typename DeviceContext, typename T>
class MatMulV2NPUKernel : public framework::OpKernel<T> {
public:
void Compute(const framework::ExecutionContext& ctx) const override {
auto* x = ctx.Input<framework::Tensor>("X");
auto* y = ctx.Input<framework::Tensor>("Y");
auto* out = ctx.Output<framework::Tensor>("Out");
bool transpose_x = ctx.Attr<bool>("trans_x");
bool transpose_y = ctx.Attr<bool>("trans_y");
if (x->dims().size() == 2) {
out->mutable_data<T>(ctx.GetPlace());
auto runner = NpuOpRunner(
"MatMul", {*x, *y}, {*out},
{{"transpose_x1", transpose_x}, {"transpose_x2", transpose_y}});
auto stream =
ctx.template device_context<paddle::platform::NPUDeviceContext>()
.stream();
runner.Run(stream);
} else if (x->dims().size() > 2) {
out->mutable_data<T>(ctx.GetPlace());
auto runner =
NpuOpRunner("BatchMatMul", {*x, *y}, {*out},
{{"adj_x1", transpose_x}, {"adj_x2", transpose_y}});
auto stream =
ctx.template device_context<paddle::platform::NPUDeviceContext>()
.stream();
runner.Run(stream);
}
}
};
template <typename DeviceContext, typename T>
class MatMulV2GradNPUKernel : public framework::OpKernel<T> {
public:
void Compute(const framework::ExecutionContext& ctx) const override {
auto* x = ctx.Input<framework::Tensor>("X");
auto* y = ctx.Input<framework::Tensor>("Y");
auto* dout = ctx.Input<framework::Tensor>(framework::GradVarName("Out"));
auto* dx = ctx.Output<framework::Tensor>(framework::GradVarName("X"));
auto* dy = ctx.Output<framework::Tensor>(framework::GradVarName("Y"));
bool transpose_y = ctx.Attr<bool>("trans_y");
auto stream =
ctx.template device_context<paddle::platform::NPUDeviceContext>()
.stream();
if (x->dims().size() == 2) {
if (transpose_y) {
if (dx) {
dx->mutable_data<T>(ctx.GetPlace());
auto runner_dx =
NpuOpRunner("MatMul", {*dout, *y}, {*dx},
{{"transpose_x1", false}, {"transpose_x2", false}});
runner_dx.Run(stream);
}
if (dy) {
dy->mutable_data<T>(ctx.GetPlace());
auto runner_dy =
NpuOpRunner("MatMul", {*dout, *x}, {*dy},
{{"transpose_x1", true}, {"transpose_x2", false}});
runner_dy.Run(stream);
}
} else {
if (dx) {
dx->mutable_data<T>(ctx.GetPlace());
auto runner_dx =
NpuOpRunner("MatMul", {*dout, *y}, {*dx},
{{"transpose_x1", false}, {"transpose_x2", true}});
runner_dx.Run(stream);
}
if (dy) {
dy->mutable_data<T>(ctx.GetPlace());
auto runner_dy =
NpuOpRunner("MatMul", {*x, *dout}, {*dy},
{{"transpose_x1", true}, {"transpose_x2", false}});
runner_dy.Run(stream);
}
}
} else if (x->dims().size() > 2) {
if (transpose_y) {
if (dx) {
dx->mutable_data<T>(ctx.GetPlace());
auto runner_dx = NpuOpRunner("BatchMatMul", {*dout, *y}, {*dx},
{{"adj_x1", false}, {"adj_x2", false}});
runner_dx.Run(stream);
}
if (dy) {
dy->mutable_data<T>(ctx.GetPlace());
auto runner_dy = NpuOpRunner("BatchMatMul", {*dout, *x}, {*dy},
{{"adj_x1", true}, {"adj_x2", false}});
runner_dy.Run(stream);
}
} else {
if (dx) {
dx->mutable_data<T>(ctx.GetPlace());
auto runner_dx = NpuOpRunner("BatchMatMul", {*dout, *y}, {*dx},
{{"adj_x1", false}, {"adj_x2", true}});
runner_dx.Run(stream);
}
if (dy) {
dy->mutable_data<T>(ctx.GetPlace());
auto runner_dy = NpuOpRunner("BatchMatMul", {*x, *dout}, {*dy},
{{"adj_x1", true}, {"adj_x2", false}});
runner_dy.Run(stream);
}
}
}
}
};
} // namespace operators
} // namespace paddle
namespace ops = paddle::operators;
REGISTER_OP_NPU_KERNEL(
matmul_v2,
ops::MatMulV2NPUKernel<paddle::platform::NPUDeviceContext, float>,
ops::MatMulV2NPUKernel<paddle::platform::NPUDeviceContext,
paddle::platform::float16>);
REGISTER_OP_NPU_KERNEL(
matmul_v2_grad,
ops::MatMulV2GradNPUKernel<paddle::platform::NPUDeviceContext, float>,
ops::MatMulV2GradNPUKernel<paddle::platform::NPUDeviceContext,
paddle::platform::float16>);
/* Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License. */
#include "paddle/fluid/operators/mean_op.h"
#include "paddle/fluid/operators/npu_op_runner.h"
#include "paddle/fluid/platform/float16.h"
namespace paddle {
namespace operators {
template <typename DeviceContext, typename T>
class MeanNPUKernel : public framework::OpKernel<T> {
public:
void Compute(const framework::ExecutionContext& ctx) const override {
auto* x = ctx.Input<framework::LoDTensor>("X");
auto* out = ctx.Output<framework::LoDTensor>("Out");
std::vector<int> axes;
framework::NPUAttributeMap attr_input = {{"keep_dims", false},
{"axes", axes}};
out->mutable_data<T>(ctx.GetPlace());
auto runner = NpuOpRunner("ReduceMeanD", {*x}, {*out}, attr_input);
auto stream =
ctx.template device_context<paddle::platform::NPUDeviceContext>()
.stream();
runner.Run(stream);
}
};
template <typename DeviceContext, typename T>
class MeanGradNPUKernel : public framework::OpKernel<T> {
public:
void Compute(const framework::ExecutionContext& context) const override {
auto stream =
context.template device_context<paddle::platform::NPUDeviceContext>()
.stream();
auto grad = context.Input<Tensor>(framework::GradVarName("Out"));
PADDLE_ENFORCE_EQ(grad->numel(), 1,
platform::errors::InvalidArgument(
"Mean Gradient Input Tensor len should be 1. But "
"received Out@Grad's elements num is %d.",
grad->numel()));
auto IG = context.Output<Tensor>(framework::GradVarName("X"));
IG->mutable_data<T>(context.GetPlace());
// ones
Tensor ones(grad->type());
ones.mutable_data<T>(IG->dims(), context.GetPlace());
auto runner_ones = NpuOpRunner("OnesLike", {*IG}, {ones}, {});
runner_ones.Run(stream);
// means
Tensor mean_tensor(grad->type());
mean_tensor.Resize({1});
mean_tensor.mutable_data<T>(context.GetPlace());
std::vector<float> mean_vec;
mean_vec.push_back(1.0 / static_cast<float>(IG->numel()));
framework::TensorFromVector(mean_vec, context.device_context(),
&mean_tensor);
// means mul ones
Tensor mean_ma(grad->type());
mean_ma.Resize(IG->dims());
mean_ma.mutable_data<T>(context.GetPlace());
auto runner_mul_1 = NpuOpRunner("Mul", {mean_tensor, ones}, {mean_ma}, {});
runner_mul_1.Run(stream);
// and mul grad
auto runner_mul_2 = NpuOpRunner("Mul", {mean_ma, *grad}, {*IG}, {});
runner_mul_2.Run(stream);
}
};
} // namespace operators
} // namespace paddle
namespace ops = paddle::operators;
namespace plat = paddle::platform;
REGISTER_OP_NPU_KERNEL(
mean, ops::MeanNPUKernel<paddle::platform::NPUDeviceContext, int>,
ops::MeanNPUKernel<paddle::platform::NPUDeviceContext, float>,
ops::MeanNPUKernel<paddle::platform::NPUDeviceContext, double>,
ops::MeanNPUKernel<paddle::platform::NPUDeviceContext, plat::float16>)
REGISTER_OP_NPU_KERNEL(
mean_grad, ops::MeanGradNPUKernel<paddle::platform::NPUDeviceContext, int>,
ops::MeanGradNPUKernel<paddle::platform::NPUDeviceContext, float>,
ops::MeanGradNPUKernel<paddle::platform::NPUDeviceContext, double>,
ops::MeanGradNPUKernel<paddle::platform::NPUDeviceContext, plat::float16>)
/* Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License. */
#include <memory>
#include <string>
#include "paddle/fluid/operators/controlflow/compare_op.h"
#include "paddle/fluid/operators/metrics/accuracy_op.h"
#include "paddle/fluid/operators/npu_op_runner.h"
namespace paddle {
namespace operators {
template <typename DeviceContext, typename T>
class AccuracyNPUKernel : public framework::OpKernel<T> {
public:
void Compute(const framework::ExecutionContext& ctx) const override {
auto* pred = ctx.Input<Tensor>("Out");
auto* label = ctx.Input<Tensor>("Label");
// auto* logits = ctx.Input<Tensor>("Indices");
auto* acc = ctx.Output<Tensor>("Accuracy");
auto* correct = ctx.Output<Tensor>("Correct");
auto* total = ctx.Output<Tensor>("Total");
auto stream =
ctx.template device_context<paddle::platform::NPUDeviceContext>()
.stream();
// cast pred
Tensor tmp_pred(pred->type());
tmp_pred.Resize(pred->dims());
tmp_pred.mutable_data<int>(ctx.GetPlace());
auto runner_cast_pred =
NpuOpRunner("Cast", {*pred}, {tmp_pred},
{{"dst_type", static_cast<int>(ACL_INT32)}});
runner_cast_pred.Run(stream);
// cast label
Tensor tmp_label(label->type());
tmp_label.Resize(label->dims());
tmp_label.mutable_data<int>(ctx.GetPlace());
auto runner_cast_label =
NpuOpRunner("Cast", {*label}, {tmp_label},
{{"dst_type", static_cast<int>(ACL_INT32)}});
runner_cast_label.Run(stream);
// equal
Tensor tmp_equal(label->type());
tmp_equal.Resize(label->dims());
tmp_equal.mutable_data<bool>(ctx.GetPlace());
auto runner_equal =
NpuOpRunner("Equal", {tmp_pred, tmp_label}, {tmp_equal}, {});
runner_equal.Run(stream);
// cast equal
Tensor tmp_equal_cast(label->type());
tmp_equal_cast.Resize(label->dims());
tmp_equal_cast.mutable_data<float>(ctx.GetPlace());
auto runner_cast_equal =
NpuOpRunner("Cast", {tmp_equal}, {tmp_equal_cast},
{{"dst_type", static_cast<float>(ACL_FLOAT)}});
runner_cast_equal.Run(stream);
// acc
acc->mutable_data<float>(ctx.GetPlace());
std::vector<int> axes_vec_1;
auto runner_acc = NpuOpRunner("ReduceMeanD", {tmp_equal_cast}, {*acc},
{{"keep_dims", false}, {"axes", axes_vec_1}});
runner_acc.Run(stream);
// correct
correct->mutable_data<float>(ctx.GetPlace());
std::vector<int> axes_vec_2;
auto runner_correct =
NpuOpRunner("ReduceSumD", {tmp_equal_cast}, {*correct},
{{"keep_dims", false}, {"axes", axes_vec_2}});
runner_correct.Run(stream);
// ones_tensor
Tensor ones_tensor(label->type());
ones_tensor.Resize(label->dims());
ones_tensor.mutable_data<int>(ctx.GetPlace());
auto runner_oneslike =
NpuOpRunner("OnesLike", {tmp_label}, {ones_tensor}, {});
runner_oneslike.Run(stream);
// ones_tensor_cast
Tensor ones_tensor_cast(label->type());
ones_tensor_cast.Resize(label->dims());
ones_tensor_cast.mutable_data<float>(ctx.GetPlace());
auto runner_ones_cast =
NpuOpRunner("Cast", {ones_tensor}, {ones_tensor_cast},
{{"dst_type", static_cast<float>(ACL_FLOAT)}});
runner_ones_cast.Run(stream);
// total
total->mutable_data<float>(ctx.GetPlace());
std::vector<int> axes_vec_3;
auto runner_total =
NpuOpRunner("ReduceSumD", {ones_tensor_cast}, {*total},
{{"keep_dims", false}, {"axes", axes_vec_3}});
runner_total.Run(stream);
}
};
} // namespace operators
} // namespace paddle
namespace ops = paddle::operators;
REGISTER_OP_NPU_KERNEL(
accuracy, ops::AccuracyNPUKernel<paddle::platform::NPUDeviceContext, float>,
ops::AccuracyNPUKernel<paddle::platform::NPUDeviceContext,
paddle::platform::float16>,
ops::AccuracyNPUKernel<paddle::platform::NPUDeviceContext, int>,
ops::AccuracyNPUKernel<paddle::platform::NPUDeviceContext, int64_t>);
/* Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License. */
#include <memory>
#include <string>
#include "paddle/fluid/operators/mul_op.h"
#include "paddle/fluid/operators/npu_op_runner.h"
namespace paddle {
namespace operators {
template <typename DeviceContext, typename T>
class MulNPUKernel : public framework::OpKernel<T> {
public:
void Compute(const framework::ExecutionContext& ctx) const override {
auto* x = ctx.Input<framework::Tensor>("X");
auto* y = ctx.Input<framework::Tensor>("Y");
auto* out = ctx.Output<framework::Tensor>("Out");
int x_num_col_dims = ctx.Attr<int>("x_num_col_dims");
int y_num_col_dims = ctx.Attr<int>("y_num_col_dims");
auto stream =
ctx.template device_context<paddle::platform::NPUDeviceContext>()
.stream();
if (x_num_col_dims == 1 && y_num_col_dims == 1) {
if (x->dims().size() == 2 && y->dims().size() == 2) {
out->mutable_data<T>(ctx.GetPlace());
auto runner =
NpuOpRunner("MatMul", {*x, *y}, {*out},
{{"transpose_x1", false}, {"transpose_x2", false}});
runner.Run(stream);
} else if (x->dims().size() == 3 && y->dims().size() == 2) {
// reshape
Tensor tmp_x(x->type());
int64_t sec_dim = x->dims()[1] * x->dims()[2];
int64_t first_dim = x->dims()[0];
tmp_x.Resize(framework::make_ddim({first_dim, sec_dim}));
tmp_x.mutable_data<T>(ctx.GetPlace());
framework::TensorCopy(
*x, ctx.GetPlace(),
ctx.template device_context<platform::DeviceContext>(), &tmp_x);
tmp_x.Resize(framework::make_ddim({first_dim, sec_dim}));
out->mutable_data<T>(ctx.GetPlace());
// matmul
auto runner =
NpuOpRunner("MatMul", {tmp_x, *y}, {*out},
{{"transpose_x1", false}, {"transpose_x2", false}});
runner.Run(stream);
} else {
PADDLE_THROW(
platform::errors::InvalidArgument("npu error: not suppert dims"));
}
// to do other
} else if (x->dims().size() == 3 && y->dims().size() == 2) {
// for example: x.shape=[2, 3, 4] y.shape=[4, 5], expect [2, 3, 5]
PADDLE_ENFORCE_EQ(x_num_col_dims, 2,
platform::errors::InvalidArgument(
"now only support x_num_col_dims == 2: but got %d",
x_num_col_dims));
// flatten => x.shape=[6, 4]
Tensor tmp_x(x->type());
int64_t first_dim = x->dims()[0] * x->dims()[1];
int64_t sec_dim = x->dims()[2];
tmp_x.Resize(framework::make_ddim({first_dim, sec_dim}));
tmp_x.mutable_data<T>(ctx.GetPlace());
framework::TensorCopy(
*x, ctx.GetPlace(),
ctx.template device_context<platform::DeviceContext>(), &tmp_x);
tmp_x.Resize(framework::make_ddim({first_dim, sec_dim}));
// matmul [6,4] , [4, 5] => [6, 5]
Tensor tmp_matmul(x->type());
tmp_matmul.Resize(framework::make_ddim({first_dim, y->dims()[1]}));
tmp_matmul.mutable_data<T>(ctx.GetPlace());
auto runner_matmul =
NpuOpRunner("MatMul", {tmp_x, *y}, {tmp_matmul},
{{"transpose_x1", false}, {"transpose_x2", false}});
runner_matmul.Run(stream);
// reshape [6, 5] => [2, 3, 5]
(*out).Resize(
framework::make_ddim({x->dims()[0], x->dims()[1], y->dims()[1]}));
out->mutable_data(ctx.GetPlace(), x->type());
framework::TensorCopy(
tmp_matmul, ctx.GetPlace(),
ctx.template device_context<platform::DeviceContext>(), out);
(*out).Resize(
framework::make_ddim({x->dims()[0], x->dims()[1], y->dims()[1]}));
}
}
};
template <typename DeviceContext, typename T>
class MulGradNPUKernel : public framework::OpKernel<T> {
public:
void Compute(const framework::ExecutionContext& ctx) const override {
auto* x = ctx.Input<framework::Tensor>("X");
auto* y = ctx.Input<framework::Tensor>("Y");
auto* dout = ctx.Input<framework::Tensor>(framework::GradVarName("Out"));
auto* dx = ctx.Output<framework::Tensor>(framework::GradVarName("X"));
auto* dy = ctx.Output<framework::Tensor>(framework::GradVarName("Y"));
int x_num_col_dims = ctx.Attr<int>("x_num_col_dims");
int y_num_col_dims = ctx.Attr<int>("y_num_col_dims");
auto stream =
ctx.template device_context<paddle::platform::NPUDeviceContext>()
.stream();
if (x_num_col_dims == 1 && y_num_col_dims == 1) {
if (x->dims().size() == 2 && y->dims().size() == 2) {
if (dx) {
dx->mutable_data<T>(ctx.GetPlace());
auto runner_dx =
NpuOpRunner("MatMul", {*dout, *y}, {*dx},
{{"transpose_x1", false}, {"transpose_x2", true}});
runner_dx.Run(stream);
}
if (dy) {
dy->mutable_data<T>(ctx.GetPlace());
auto runner_dy =
NpuOpRunner("MatMul", {*x, *dout}, {*dy},
{{"transpose_x1", true}, {"transpose_x2", false}});
runner_dy.Run(stream);
}
} else if (x->dims().size() == 3 && y->dims().size() == 2) {
// flatten => x.shape=[6, 4]
// matmul
if (dx) {
// matmul [2, 5] * [12, 5] => [2, 12]
dx->mutable_data<T>(ctx.GetPlace());
auto dx_dims = dx->dims();
dx->Resize(framework::make_ddim({dout->dims()[0], y->dims()[0]}));
auto runner_matmul =
NpuOpRunner("MatMul", {*dout, *y}, {*dx},
{{"transpose_x1", false}, {"transpose_x2", true}});
runner_matmul.Run(stream);
// reshape [2, 12] => [2, 3, 4]
dx->Resize(dx_dims);
}
if (dy) {
// flatten
Tensor tmp_x(x->type());
int64_t sec_dim = x->dims()[1] * x->dims()[2];
int64_t first_dim = x->dims()[0];
tmp_x.Resize(framework::make_ddim({first_dim, sec_dim}));
tmp_x.mutable_data<T>(ctx.GetPlace());
framework::TensorCopy(
*x, ctx.GetPlace(),
ctx.template device_context<platform::DeviceContext>(), &tmp_x);
tmp_x.Resize(framework::make_ddim({first_dim, sec_dim}));
dy->mutable_data<T>(ctx.GetPlace());
auto runner_dy =
NpuOpRunner("MatMul", {tmp_x, *dout}, {*dy},
{{"transpose_x1", true}, {"transpose_x2", false}});
runner_dy.Run(stream);
}
}
} else if (x->dims().size() == 3 && y->dims().size() == 2) {
// for example: x.shape=[2, 3, 4] y.shape=[4, 5], expect [2, 3, 5]
PADDLE_ENFORCE_EQ(x_num_col_dims, 2,
platform::errors::InvalidArgument(
"now only support x_num_col_dims == 2: but got %d",
x_num_col_dims));
// tmp_dout both used by dx and dy
Tensor tmp_dout(x->type());
int64_t dout_first_dim = dout->dims()[0] * dout->dims()[1];
int64_t dout_sec_dim = dout->dims()[2];
tmp_dout.Resize(framework::make_ddim({dout_first_dim, dout_sec_dim}));
tmp_dout.mutable_data<T>(ctx.GetPlace());
framework::TensorCopy(
*dout, ctx.GetPlace(),
ctx.template device_context<platform::DeviceContext>(), &tmp_dout);
tmp_dout.Resize(framework::make_ddim({dout_first_dim, dout_sec_dim}));
if (dx) {
// tmp_dout * y [6,5] * [4,5] => [6, 4]
dx->mutable_data<T>(ctx.GetPlace());
auto dx_dims = dx->dims();
dx->Resize(framework::make_ddim({dout_first_dim, y->dims()[0]}));
auto runner_matmul =
NpuOpRunner("MatMul", {tmp_dout, *y}, {*dx},
{{"transpose_x1", false}, {"transpose_x2", true}});
runner_matmul.Run(stream);
// reshape [2, 12] => [2, 3, 4]
dx->Resize(dx_dims);
}
if (dy) {
// flatten x.shape [2,3,4] => [6, 4]
Tensor tmp_x(x->type());
int64_t first_dim = x->dims()[0] * x->dims()[1];
int64_t sec_dim = x->dims()[2];
tmp_x.Resize(framework::make_ddim({first_dim, sec_dim}));
tmp_x.mutable_data<T>(ctx.GetPlace());
framework::TensorCopy(
*x, ctx.GetPlace(),
ctx.template device_context<platform::DeviceContext>(), &tmp_x);
tmp_x.Resize(framework::make_ddim({first_dim, sec_dim}));
// mamtul [6,4] [6,5] =>[4,5]
dy->mutable_data<T>(ctx.GetPlace());
auto runner_dy =
NpuOpRunner("MatMul", {tmp_x, tmp_dout}, {*dy},
{{"transpose_x1", true}, {"transpose_x2", false}});
runner_dy.Run(stream);
}
}
}
};
} // namespace operators
} // namespace paddle
namespace ops = paddle::operators;
REGISTER_OP_NPU_KERNEL(
mul, ops::MulNPUKernel<paddle::platform::NPUDeviceContext, float>,
ops::MulNPUKernel<paddle::platform::NPUDeviceContext,
paddle::platform::float16>);
REGISTER_OP_NPU_KERNEL(
mul_grad, ops::MulGradNPUKernel<paddle::platform::NPUDeviceContext, float>,
ops::MulGradNPUKernel<paddle::platform::NPUDeviceContext,
paddle::platform::float16>);
......@@ -64,13 +64,21 @@ aclFormat ConvertToNpuFormat(DataLayout layout) {
return iter->second;
}
aclrtStream GetCurrentNPUStream() {
int device_id = platform::GetCurrentNPUDeviceId();
platform::DeviceContextPool &pool = platform::DeviceContextPool::Instance();
auto *dev_ctx = static_cast<platform::NPUDeviceContext *>(
pool.Get(platform::NPUPlace(device_id)));
return dev_ctx->stream();
}
NpuOpRunner::NpuOpRunner(std::string op_type) : op_type_(op_type) {
attr_ = aclopCreateAttr();
}
NpuOpRunner::NpuOpRunner(std::string op_type, const std::vector<Tensor> &inputs,
const std::vector<Tensor> &outputs,
const AttributeMap &attrs)
const NPUAttributeMap &attrs)
: op_type_(op_type) {
attr_ = aclopCreateAttr();
AddInputs(inputs);
......@@ -85,7 +93,7 @@ NpuOpRunner::~NpuOpRunner() {
const std::string &NpuOpRunner::Type() { return op_type_; }
NpuOpRunner &NpuOpRunner::AddAttr(const std::string &name,
const Attribute &attr) {
const NPUAttribute &attr) {
if (attr.type() == typeid(bool)) {
PADDLE_ENFORCE_NPU_SUCCESS(
aclopSetAttrBool(attr_, name.c_str(), BOOST_GET_CONST(bool, attr)));
......@@ -135,6 +143,16 @@ NpuOpRunner &NpuOpRunner::AddAttr(const std::string &name,
}
PADDLE_ENFORCE_NPU_SUCCESS(
aclopSetAttrListString(attr_, name.c_str(), s.size(), s.data()));
} else if (attr.type() == typeid(std::vector<std::vector<int64_t>>)) {
auto a = BOOST_GET_CONST(std::vector<std::vector<int64_t>>, attr);
std::vector<int64_t *> data;
std::vector<int> num;
for (auto &&v : a) {
data.push_back(v.data());
num.push_back(v.size());
}
PADDLE_ENFORCE_NPU_SUCCESS(aclopSetAttrListListInt(
attr_, name.c_str(), data.size(), num.data(), data.data()));
} else {
PADDLE_THROW(platform::errors::Unimplemented(
"Can not convert attribubte '%s' to convert to aclopAttr", name));
......@@ -142,7 +160,7 @@ NpuOpRunner &NpuOpRunner::AddAttr(const std::string &name,
return *this;
}
NpuOpRunner &NpuOpRunner::AddAttrs(const AttributeMap &attrs) {
NpuOpRunner &NpuOpRunner::AddAttrs(const NPUAttributeMap &attrs) {
for (const auto &pair : attrs) {
AddAttr(pair.first, pair.second);
}
......@@ -175,6 +193,21 @@ NpuOpRunner &NpuOpRunner::AddInputs(const std::vector<Tensor> &tensors) {
return *this;
}
// NOTE(zhiqiu): For operators whose input is a list (such as concat, stack),
// It is needed to set the name of each input tensor.
NpuOpRunner &NpuOpRunner::AddInputNames(const std::vector<std::string> &names) {
PADDLE_ENFORCE_EQ(names.size(), input_descs_.size(),
platform::errors::InvalidArgument(
"The size of input names should be "
"equal to the size of input descs, but got the size "
"of input names is %d, the size of input descs is %d.",
names.size(), input_descs_.size()));
for (size_t i = 0; i < names.size(); ++i) {
aclSetTensorDescName(input_descs_[i], names[i].c_str());
}
return *this;
}
NpuOpRunner &NpuOpRunner::AddOutputs(const std::vector<Tensor> &tensors) {
for (auto tensor : tensors) {
// create aclTensorDesc
......@@ -224,18 +257,22 @@ aclTensorDesc *NpuOpRunner::CreateTensorDesc(Tensor tensor) {
auto format = ConvertToNpuFormat(tensor.layout());
auto dims = framework::vectorize(tensor.dims());
VLOG(4) << dtype << " " << dims.size() << " " << dims[0] << "," << dims[1]
<< " " << format;
VLOG(4) << "NPU dtype:" << dtype << " "
<< "rank:" << dims.size() << " dims:" << tensor.dims()
<< " format:" << format;
auto *desc = aclCreateTensorDesc(dtype, dims.size(), dims.data(), format);
PADDLE_ENFORCE_NOT_NULL(
desc, platform::errors::External("Call aclCreateTensorDesc failed."));
PADDLE_ENFORCE_NPU_SUCCESS(aclSetTensorStorageFormat(desc, format));
PADDLE_ENFORCE_NPU_SUCCESS(
aclSetTensorStorageShape(desc, dims.size(), dims.data()));
return desc;
}
aclDataBuffer *NpuOpRunner::CreateDataBuffer(Tensor tensor) {
void *ptr = tensor.data<void>();
VLOG(4) << "ptr: " << ptr << ", size: " << tensor.memory_size();
VLOG(4) << "NPU ptr: " << ptr << ", size: " << tensor.memory_size();
auto *buffer = aclCreateDataBuffer(ptr, tensor.memory_size());
PADDLE_ENFORCE_NOT_NULL(
buffer, platform::errors::External("Call aclCreateDataBuffer failed."));
......@@ -243,11 +280,17 @@ aclDataBuffer *NpuOpRunner::CreateDataBuffer(Tensor tensor) {
}
void NpuOpRunner::Run(aclrtStream stream) {
if (!stream) {
VLOG(4) << "Run with default current npu stream: " << stream;
stream = GetCurrentNPUStream();
}
VLOG(4) << "op_type: " << op_type_;
VLOG(4) << "input_desc.size: " << input_descs_.size();
VLOG(4) << "output_desc.size: " << output_descs_.size();
VLOG(4) << "stream: " << stream;
VLOG(4) << "attr: " << attr_;
VLOG(4) << "stream: " << stream;
aclError ret = aclopCompileAndExecute(
op_type_.c_str(), input_descs_.size(), input_descs_.data(),
input_buffers_.data(), output_descs_.size(), output_descs_.data(),
......
......@@ -12,8 +12,10 @@ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License. */
#ifdef PADDLE_WITH_ASCEND_CL
#pragma once
#include <paddle/fluid/framework/operator.h>
#include <paddle/fluid/framework/type_defs.h>
#include <string>
#include <vector>
......@@ -26,8 +28,8 @@ namespace operators {
using Tensor = framework::Tensor;
using DataLayout = framework::DataLayout;
using Attribute = framework::Attribute;
using AttributeMap = framework::AttributeMap;
using NPUAttribute = framework::NPUAttribute;
using NPUAttributeMap = framework::NPUAttributeMap;
class NpuOpRunner {
public:
......@@ -35,15 +37,15 @@ class NpuOpRunner {
explicit NpuOpRunner(std::string op_type,
const std::vector<Tensor> &inputs = {},
const std::vector<Tensor> &outputs = {},
const AttributeMap &attrs = {});
const NPUAttributeMap &attrs = {});
~NpuOpRunner();
const std::string &Type();
NpuOpRunner &AddAttr(const std::string &name, const Attribute &attr);
NpuOpRunner &AddAttr(const std::string &name, const NPUAttribute &attr);
NpuOpRunner &AddAttrs(const AttributeMap &attrs);
NpuOpRunner &AddAttrs(const NPUAttributeMap &attrs);
NpuOpRunner &AddInput(const Tensor &tensor);
......@@ -51,6 +53,8 @@ class NpuOpRunner {
NpuOpRunner &AddInputs(const std::vector<Tensor> &tensors);
NpuOpRunner &AddInputNames(const std::vector<std::string> &names);
NpuOpRunner &AddOutputs(const std::vector<Tensor> &tensors);
aclTensorDesc *GetInputDesc(size_t index);
......@@ -65,7 +69,7 @@ class NpuOpRunner {
std::vector<aclDataBuffer *> &GetOutputBuffers();
void Run(aclrtStream stream);
void Run(aclrtStream stream = nullptr);
private:
aclTensorDesc *CreateTensorDesc(Tensor tensor);
......@@ -80,5 +84,8 @@ class NpuOpRunner {
aclopAttr *attr_{nullptr};
};
aclDataType ConvertToNpuDtype(framework::proto::VarType::Type dtype);
} // namespace operators
} // namespace paddle
#endif
此差异已折叠。
/* Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License. */
#include <memory>
#include <string>
#include "paddle/fluid/operators/npu_op_runner.h"
#include "paddle/fluid/operators/optimizers/sgd_op.h"
namespace paddle {
namespace operators {
template <typename DeviceContext, typename T>
class SGDNPUKernel : public framework::OpKernel<T> {
public:
void Compute(const framework::ExecutionContext& ctx) const override {
auto* learning_rate = ctx.Input<framework::LoDTensor>("LearningRate");
auto* param_var = ctx.Input<framework::LoDTensor>("Param");
auto* grad_var = ctx.Input<framework::LoDTensor>("Grad");
auto* param_out = ctx.Output<framework::LoDTensor>("ParamOut");
param_out->mutable_data<T>(ctx.GetPlace());
auto runner =
NpuOpRunner("ApplyGradientDescent",
{*param_var, *learning_rate, *grad_var}, {*param_out}, {});
auto stream =
ctx.template device_context<paddle::platform::NPUDeviceContext>()
.stream();
runner.Run(stream);
// NOTE(zhiqiu): ApplyGradientDescent updates params inplace, so
// if param and param_out is not same, we need to do copy.
if (param_out->data<T>() != param_var->data<T>()) {
ctx.template device_context<paddle::platform::NPUDeviceContext>().Wait();
framework::TensorCopySync(*param_var, ctx.GetPlace(), param_out);
}
}
};
} // namespace operators
} // namespace paddle
namespace ops = paddle::operators;
REGISTER_OP_NPU_KERNEL(
sgd, ops::SGDNPUKernel<paddle::platform::NPUDeviceContext, float>,
ops::SGDNPUKernel<paddle::platform::NPUDeviceContext, double>,
ops::SGDNPUKernel<paddle::platform::NPUDeviceContext,
paddle::platform::float16>);
此差异已折叠。
此差异已折叠。
......@@ -42,3 +42,7 @@ endif()
if(WITH_ROCM)
hip_test(check_reduce_rank_test SRCS check_reduce_rank_test.cu DEPS tensor)
endif()
if(WITH_ASCEND_CL)
cc_test(reduce_any_op_npu_test SRCS reduce_any_op_npu_test.cc DEPS op_registry reduce_any_op scope device_context enforce executor)
endif()
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
......@@ -531,7 +531,7 @@ if(WITH_DISTRIBUTE)
bash_test_modules(test_fleet_launch_async START_BASH test_fleet_launch_async.sh ENVS PADDLE_BINARY_DIR=${PADDLE_BINARY_DIR})
bash_test_modules(test_fleet_launch_cloud START_BASH test_fleet_launch_cloud.sh ENVS PADDLE_BINARY_DIR=${PADDLE_BINARY_DIR})
bash_test_modules(test_fleet_launch_nproc START_BASH test_fleet_launch_nproc.sh ENVS PADDLE_BINARY_DIR=${PADDLE_BINARY_DIR})
if(WITH_ASCEND)
if(WITH_ASCEND OR WITH_ASCEND_CL)
bash_test_modules(test_fleet_launch_ascend START_BASH test_fleet_launch_ascend.sh ENVS PADDLE_BINARY_DIR=${PADDLE_BINARY_DIR})
bash_test_modules(test_ascend_group START_BASH test_ascend_group.sh ENVS PADDLE_BINARY_DIR=${PADDLE_BINARY_DIR})
endif()
......
此差异已折叠。
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册