support multiply inputs and outputs (#36851)

* initial tensor design & sign kernel demo * add move constructor for meta & add lodtensor * add dirs & sign xpu kernel * add mean cpu&cuda kernel impl * move sign & mean xpu & npu kernel * add selected_rows basic impl * refactor design, BaseTensor to DenseTensor, etc. * add scale mkldnn kernel * polish xpu & npu impl details * fix mkldnn reuse compile failed * change tensor operation lib name * rename util filename * add more comments * change TensorImplInterface to TensorInterface * add kernel key and factory * remove MKLDNNTensorMeta, add MKLDNNDenseTensor * change XXDeviceContext to XXContext * add base kernel registrar utils & test on sign * replace boost::any by paddle::any * fix several ci failed * fix npu compile error * add ordered map util * fix multiple ordered_map compile errors * move dev into include dir * support sign op in static op run * fix static op run error * fix new executor compile failed * add dygraph branch & remove sign_op.h * fix test_infer_no_need_buffer_slots * fix rocm compile link error * fix unitybuild error & clear glog * fix npu compile failed * skip quant trans test * fix part windows compile problem * fix xpu enforce error * fix inference test failed * remove ordered_map to solve quant failed * fix part of rcom compile faild * add more register kernels * revert scale kernel temporarily * fix code format error * add new kernel registrar marco * rename top to tcmpt * revert xpu, npu, mkldnn impl & remove op def * add kernel args parse functor to auto parse args * revert some change & add scale kernels * add op proto in dygraph kernelcontext building * polish kernel dispatch logic & nameing rule * fix scale kernel match error * fix scale test failed * add mean API and unittest * test mean api success * add branch to solve compiled error * skip clang format error * add mean skip rule in op_library * add dot kernel, api and unittest (#6) * remove old kernel and add symbol link * fix dot compiled failed * add merco for module declare * fix npu and xpu compile error * revert sign, mean, scale, dot kernel removing * add comment for keeping old kernel impl * fix mutable_data error * fix bfloat16 conflit * fix inference undef error * adapt to msvc compile rules * polish comment for template inst * add cmake template instantiation for win * fix backend to place device id bug * fix ifdef error * Op2functor (#7) * add kernel args maker class * make args maker non-const * remove debug log * modify codes by review options * split constructPrKernelContext function * fix output name bug * fix test_mean_op test_sign_op failed * fill_any_like kernel refactor (#10) * fill_any_like kernel refactor * remove useless code of full_like c++ api * skip dtype for fill_any_like * add attrs for kernel key constrcut * add use_pt_kernel Flags to control whether to use pt kernel (#13) * add use_pt_kernel Flags to control whether to use pt kernel * change the default value to true for cheking pt kernels * fix mutable_data cuda place error * move high level apis into hapi * remove selectedrows adapting temporarily * Support Scalar in Tensor Compute Library (#14) * fill_any_like kernel refactor * remove useless code of full_like c++ api * Support Scalar in Tensor Compute Library * add scalar in dygraph and static graph mode * keep the basic type for attr, instead of using scalar for all * merge the code * remove mkldnn tensor & polish details * use flat_hash_map and small_vector in kernel factory * Refactor flatten kernel (#12) * refactor flatten kernel * update infershape function * fix compile bugs * fix bugs when merge * fix compiler bugs * fix bugs when run test_flatten_api * fix bugs when run test * Revert "use flat_hash_map and small_vector in kernel factory" This reverts commit 23091495cfdd3df8cc1be592d30f09ea66a7c72b. * Move cpu, cuda and other device code into kernels (#15) * fill_any_like kernel refactor * remove useless code of full_like c++ api * Support Scalar in Tensor Compute Library * add scalar in dygraph and static graph mode * keep the basic type for attr, instead of using scalar for all * merge the code * start refactor matmul * move cpu, cuda and other device modules into kernels * merge code * polish code in operator.cc * Perfect unitests (#16) * perfect unittest * update license * replace with flat_hash_map, small_vector (#19) * fix small_vector build error on windows platform * replace with flat_hash_map, small_vector * remove todo * Perfect unitests (#20) * perfect unittest * update license * fix bug when run tcmpt_utils_test * refactor execution adapting impl * fix insert conflit * Fix CI bug of test_yolov3 (#21) * fill_any_like kernel refactor * remove useless code of full_like c++ api * Support Scalar in Tensor Compute Library * add scalar in dygraph and static graph mode * keep the basic type for attr, instead of using scalar for all * merge the code * start refactor matmul * move cpu, cuda and other device modules into kernels * merge code * polish code in operator.cc * Fix CI bug of test_yolov3 * add the tensor base class, test=develop (#17) * update the tensor base class, test=develop * remove two funcs, test=develop * update the error msg, test=develop Co-authored-by: N Chen Weihang <chenweihang@baidu.com> * [no-verify] commit backend and tensor signature changes * Rename tcmpt to pten (#23) * rename tcmpt to pten * update omitted files for rename to pten * update omitted file for rename to pten * remove k of all enum var * remove kernel_instantiate (#26) * remove symbols and spatial_tensor * change common to functions * readd share tensor impl methods * add a candidate dense tensor class, test=develop (#28) * change all Pt to Pten * resolve conflit with xiaowei * Op2functor opt1 (#27) * replace to small vector and change to const & * add std::move Co-authored-by: N Chen Weihang <chenweihang@baidu.com> * polish kernel factory and kernel registry * fix operator test error msg mismatch * remove tensor signature and backend set member * move scalar and polish enforce * revert dtype layout change to fix error * fix enum operator override error * add several base unittests * add pten utils tests * polish some details * Dev/op2func refactor 3 (#30) * add a candidate dense tensor class, test=develop * remove TensorBase::backend(), test=develop * remove some ops, test=develop * cherry-pick the pr of tensor meta, test=develop * moves the dense tensor and some ops, test=develop * update the linalg operator, test=develop * update other operators, test=develop * fix errors, test=develop * fix bugs, test=develop * try to resolve the problem of windows ci, test=develop * updates codes, test=develop * fix the tensor_utils.cc, test=develop * modify the dense tensor, test=develop * fix the data type, test=develop Co-authored-by: N shixiaowei02 <39303645+Shixiaowei02@users.noreply.github.com> * polish some details * polish kernel signature details * fix a bug about offsets of the tensor, test=develop (#31) Co-authored-by: N shixiaowei02 <39303645+Shixiaowei02@users.noreply.github.com> * support multiply inputs and outputs * rm attrs {} * fix multioutputs bug * merge develop * remove unsed header file * add missing & in const reference * modify inputAt, outputAt to inputBetween, outputBetween Co-authored-by: N Chen Weihang <chenweihang@baidu.com> Co-authored-by: N zyfncg <1370305206@qq.com> Co-authored-by: N YuanRisheng <yuanrisheng@baidu.com> Co-authored-by: N 石晓伟 <39303645+Shixiaowei02@users.noreply.github.com>

support multiply inputs and outputs (#36851)
* initial tensor design & sign kernel demo * add move constructor for meta & add lodtensor * add dirs & sign xpu kernel * add mean cpu&cuda kernel impl * move sign & mean xpu & npu kernel * add selected_rows basic impl * refactor design, BaseTensor to DenseTensor, etc. * add scale mkldnn kernel * polish xpu & npu impl details * fix mkldnn reuse compile failed * change tensor operation lib name * rename util filename * add more comments * change TensorImplInterface to TensorInterface * add kernel key and factory * remove MKLDNNTensorMeta, add MKLDNNDenseTensor * change XXDeviceContext to XXContext * add base kernel registrar utils & test on sign * replace boost::any by paddle::any * fix several ci failed * fix npu compile error * add ordered map util * fix multiple ordered_map compile errors * move dev into include dir * support sign op in static op run * fix static op run error * fix new executor compile failed * add dygraph branch & remove sign_op.h * fix test_infer_no_need_buffer_slots * fix rocm compile link error * fix unitybuild error & clear glog * fix npu compile failed * skip quant trans test * fix part windows compile problem * fix xpu enforce error * fix inference test failed * remove ordered_map to solve quant failed * fix part of rcom compile faild * add more register kernels * revert scale kernel temporarily * fix code format error * add new kernel registrar marco * rename top to tcmpt * revert xpu, npu, mkldnn impl & remove op def * add kernel args parse functor to auto parse args * revert some change & add scale kernels * add op proto in dygraph kernelcontext building * polish kernel dispatch logic & nameing rule * fix scale kernel match error * fix scale test failed * add mean API and unittest * test mean api success * add branch to solve compiled error * skip clang format error * add mean skip rule in op_library * add dot kernel, api and unittest (#6) * remove old kernel and add symbol link * fix dot compiled failed * add merco for module declare * fix npu and xpu compile error * revert sign, mean, scale, dot kernel removing * add comment for keeping old kernel impl * fix mutable_data error * fix bfloat16 conflit * fix inference undef error * adapt to msvc compile rules * polish comment for template inst * add cmake template instantiation for win * fix backend to place device id bug * fix ifdef error * Op2functor (#7) * add kernel args maker class * make args maker non-const * remove debug log * modify codes by review options * split constructPrKernelContext function * fix output name bug * fix test_mean_op test_sign_op failed * fill_any_like kernel refactor (#10) * fill_any_like kernel refactor * remove useless code of full_like c++ api * skip dtype for fill_any_like * add attrs for kernel key constrcut * add use_pt_kernel Flags to control whether to use pt kernel (#13) * add use_pt_kernel Flags to control whether to use pt kernel * change the default value to true for cheking pt kernels * fix mutable_data cuda place error * move high level apis into hapi * remove selectedrows adapting temporarily * Support Scalar in Tensor Compute Library (#14) * fill_any_like kernel refactor * remove useless code of full_like c++ api * Support Scalar in Tensor Compute Library * add scalar in dygraph and static graph mode * keep the basic type for attr, instead of using scalar for all * merge the code * remove mkldnn tensor & polish details * use flat_hash_map and small_vector in kernel factory * Refactor flatten kernel (#12) * refactor flatten kernel * update infershape function * fix compile bugs * fix bugs when merge * fix compiler bugs * fix bugs when run test_flatten_api * fix bugs when run test * Revert "use flat_hash_map and small_vector in kernel factory" This reverts commit 23091495cfdd3df8cc1be592d30f09ea66a7c72b. * Move cpu, cuda and other device code into kernels (#15) * fill_any_like kernel refactor * remove useless code of full_like c++ api * Support Scalar in Tensor Compute Library * add scalar in dygraph and static graph mode * keep the basic type for attr, instead of using scalar for all * merge the code * start refactor matmul * move cpu, cuda and other device modules into kernels * merge code * polish code in operator.cc * Perfect unitests (#16) * perfect unittest * update license * replace with flat_hash_map, small_vector (#19) * fix small_vector build error on windows platform * replace with flat_hash_map, small_vector * remove todo * Perfect unitests (#20) * perfect unittest * update license * fix bug when run tcmpt_utils_test * refactor execution adapting impl * fix insert conflit * Fix CI bug of test_yolov3 (#21) * fill_any_like kernel refactor * remove useless code of full_like c++ api * Support Scalar in Tensor Compute Library * add scalar in dygraph and static graph mode * keep the basic type for attr, instead of using scalar for all * merge the code * start refactor matmul * move cpu, cuda and other device modules into kernels * merge code * polish code in operator.cc * Fix CI bug of test_yolov3 * add the tensor base class, test=develop (#17) * update the tensor base class, test=develop * remove two funcs, test=develop * update the error msg, test=develop Co-authored-by: N Chen Weihang <chenweihang@baidu.com> * [no-verify] commit backend and tensor signature changes * Rename tcmpt to pten (#23) * rename tcmpt to pten * update omitted files for rename to pten * update omitted file for rename to pten * remove k of all enum var * remove kernel_instantiate (#26) * remove symbols and spatial_tensor * change common to functions * readd share tensor impl methods * add a candidate dense tensor class, test=develop (#28) * change all Pt to Pten * resolve conflit with xiaowei * Op2functor opt1 (#27) * replace to small vector and change to const & * add std::move Co-authored-by: N Chen Weihang <chenweihang@baidu.com> * polish kernel factory and kernel registry * fix operator test error msg mismatch * remove tensor signature and backend set member * move scalar and polish enforce * revert dtype layout change to fix error * fix enum operator override error * add several base unittests * add pten utils tests * polish some details * Dev/op2func refactor 3 (#30) * add a candidate dense tensor class, test=develop * remove TensorBase::backend(), test=develop * remove some ops, test=develop * cherry-pick the pr of tensor meta, test=develop * moves the dense tensor and some ops, test=develop * update the linalg operator, test=develop * update other operators, test=develop * fix errors, test=develop * fix bugs, test=develop * try to resolve the problem of windows ci, test=develop * updates codes, test=develop * fix the tensor_utils.cc, test=develop * modify the dense tensor, test=develop * fix the data type, test=develop Co-authored-by: N shixiaowei02 <39303645+Shixiaowei02@users.noreply.github.com> * polish some details * polish kernel signature details * fix a bug about offsets of the tensor, test=develop (#31) Co-authored-by: N shixiaowei02 <39303645+Shixiaowei02@users.noreply.github.com> * support multiply inputs and outputs * rm attrs {} * fix multioutputs bug * merge develop * remove unsed header file * add missing & in const reference * modify inputAt, outputAt to inputBetween, outputBetween Co-authored-by: N Chen Weihang <chenweihang@baidu.com> Co-authored-by: N zyfncg <1370305206@qq.com> Co-authored-by: N YuanRisheng <yuanrisheng@baidu.com> Co-authored-by: N 石晓伟 <39303645+Shixiaowei02@users.noreply.github.com>
e4a134ac · chentianyu03 · GitHub · 4a7f1a0d · e4a134ac · e4a134ac
4 changed file
--- a/paddle/pten/core/kernel_context.h
+++ b/paddle/pten/core/kernel_context.h
@@ -52,37 +52,37 @@ class KernelContext {
  }
  void EmplaceBackInput(std::shared_ptr<TensorBase> input) {
+    int index = inputs_.size();
    inputs_.emplace_back(std::move(input));
    // Record the start and end index of the input
-    int index = inputs_.size();
    input_range_.emplace_back(std::pair<int, int>(index, index + 1));
  }
  void EmplaceBackInputs(
-      paddle::SmallVector<std::shared_ptr<TensorBase>> inputs) {
+      const paddle::SmallVector<std::shared_ptr<TensorBase>>& inputs) {
+    int index = inputs_.size();
    for (auto in : inputs) {
-      inputs_.emplace_back(in);
+      inputs_.emplace_back(std::move(in));
    }
    // Record the start and end index of the input
-    int index = inputs_.size();
    input_range_.emplace_back(
        std::pair<int, int>(index, index + inputs.size()));
  }
  void EmplaceBackOutput(std::shared_ptr<TensorBase> output) {
+    int index = outputs_.size();
    outputs_.emplace_back(std::move(output));
    // Record the start and end index of the input
-    int index = outputs_.size();
    output_range_.emplace_back(std::pair<int, int>(index, index + 1));
  }
  void EmplaceBackOutputs(
-      paddle::SmallVector<std::shared_ptr<TensorBase>> outputs) {
+      const paddle::SmallVector<std::shared_ptr<TensorBase>>& outputs) {
+    int index = outputs_.size();
    for (auto out : outputs) {
-      outputs_.emplace_back(out);
+      outputs_.emplace_back(std::move(out));
    }
    // Record the start and end index of the input
-    int index = outputs_.size();
    output_range_.emplace_back(
        std::pair<int, int>(index, index + outputs.size()));
  }
@@ -96,11 +96,40 @@ class KernelContext {
    return static_cast<const TensorType&>(*(inputs_.at(idx)));
  }
+  template <typename TensorType>
+  std::vector<TensorType> InputBetween(size_t start, size_t end) const {
+    std::vector<TensorType> v;
+    for (size_t i = start; i < end; ++i) {
+      auto t = std::dynamic_pointer_cast<TensorType>(inputs_.at(i));
+      v.emplace_back(std::move(*t.get()));
+    }
+    return v;
+  }
+  const std::pair<int, int>& InputRangeAt(size_t idx) const {
+    return input_range_.at(idx);
+  }
+  const std::pair<int, int>& OutputRangeAt(size_t idx) const {
+    return output_range_.at(idx);
+  }
  template <typename TensorType>
  TensorType* MutableOutputAt(size_t idx) {
    return static_cast<TensorType*>(outputs_.at(idx).get());
  }
+  template <typename TensorType>
+  std::vector<TensorType*> MutableOutputBetween(size_t start, size_t end) {
+    std::vector<TensorType*> v;
+    for (size_t i = start; i < end; ++i) {
+      v.emplace_back(static_cast<TensorType*>(outputs_.at(i).get()));
+    }
+    return v;
+  }
  template <typename AttrType>
  AttrType AttrAt(size_t idx) const {
    try {

--- a/paddle/pten/core/kernel_registry.h
+++ b/paddle/pten/core/kernel_registry.h
@@ -62,9 +62,17 @@ struct KernelArgsParseFunctor<Return_ (*)(Args_...)> {
      } else if (arg_type == std::type_index(typeid(const DenseTensor&))) {
        args_def->AppendInput(
            default_key.backend(), default_tensor_layout, default_key.dtype());
+      } else if (arg_type ==
+                 std::type_index(typeid(const std::vector<DenseTensor>&))) {
+        args_def->AppendInput(
+            default_key.backend(), default_tensor_layout, default_key.dtype());
      } else if (arg_type == std::type_index(typeid(DenseTensor*))) {
        args_def->AppendOutput(
            default_key.backend(), default_tensor_layout, default_key.dtype());
+      } else if (arg_type ==
+                 std::type_index(typeid(std::vector<DenseTensor*>))) {
+        args_def->AppendOutput(
+            default_key.backend(), default_tensor_layout, default_key.dtype());
      } else {
        // Attribute deal with
        // TODO(chenweihang): now here allow any types of attribute, maybe

--- a/paddle/pten/core/kernel_utils.h
+++ b/paddle/pten/core/kernel_utils.h
@@ -79,7 +79,30 @@ using XPUContext = paddle::platform::XPUDeviceContext;
                    "Kernel's Input should appear before Attributes."); \
      static_assert(out_idx == 0,                                       \
                    "Kernel's Input should appear before Outputs.");    \
-      const tensor_type& arg = ctx->InputAt<tensor_type>(in_idx);       \
+      const std::pair<int, int> range = ctx->InputRangeAt(in_idx);      \
+      const tensor_type& arg = ctx->InputAt<tensor_type>(range.first);  \
+      KernelCallHelper<Tail...>::                                       \
+          template Compute<dev_ctx_idx, in_idx + 1, attr_idx, out_idx>( \
+              ctx, pargs..., arg);                                      \
+    }                                                                   \
+  }
+#define PT_SPECIALIZE_KernelCallHelper_FOR_MULTI_INPUT(tensor_type)     \
+  template <typename... Tail>                                           \
+  struct KernelCallHelper<const std::vector<tensor_type>&, Tail...> {   \
+    template <int dev_ctx_idx,                                          \
+              int in_idx,                                               \
+              int attr_idx,                                             \
+              int out_idx,                                              \
+              typename... PreviousArgs>                                 \
+    static void Compute(KernelContext* ctx, PreviousArgs&... pargs) {   \
+      static_assert(attr_idx == 0,                                      \
+                    "Kernel's Input should appear before Attributes."); \
+      static_assert(out_idx == 0,                                       \
+                    "Kernel's Input should appear before Outputs.");    \
+      const std::pair<int, int> range = ctx->InputRangeAt(in_idx);      \
+      std::vector<tensor_type> arg = std::move(                         \
+          ctx->InputBetween<tensor_type>(range.first, range.second));   \
      KernelCallHelper<Tail...>::                                       \
          template Compute<dev_ctx_idx, in_idx + 1, attr_idx, out_idx>( \
              ctx, pargs..., arg);                                      \
@@ -104,20 +127,39 @@ using XPUContext = paddle::platform::XPUDeviceContext;
    }                                                                     \
  }
-#define PT_SPECIALIZE_KernelCallHelper_FOR_OUTPUT(tensor_type)          \
+#define PT_SPECIALIZE_KernelCallHelper_FOR_OUTPUT(tensor_type)           \
-  template <typename... Tail>                                           \
+  template <typename... Tail>                                            \
-  struct KernelCallHelper<tensor_type*, Tail...> {                      \
+  struct KernelCallHelper<tensor_type*, Tail...> {                       \
-    template <int dev_ctx_idx,                                          \
+    template <int dev_ctx_idx,                                           \
-              int in_idx,                                               \
+              int in_idx,                                                \
-              int attr_idx,                                             \
+              int attr_idx,                                              \
-              int out_idx,                                              \
+              int out_idx,                                               \
-              typename... PreviousArgs>                                 \
+              typename... PreviousArgs>                                  \
-    static void Compute(KernelContext* ctx, PreviousArgs&... pargs) {   \
+    static void Compute(KernelContext* ctx, PreviousArgs&... pargs) {    \
-      tensor_type* arg = ctx->MutableOutputAt<tensor_type>(out_idx);    \
+      const std::pair<int, int> range = ctx->OutputRangeAt(out_idx);     \
-      KernelCallHelper<Tail...>::                                       \
+      tensor_type* arg = ctx->MutableOutputAt<tensor_type>(range.first); \
-          template Compute<dev_ctx_idx, in_idx, attr_idx, out_idx + 1>( \
+      KernelCallHelper<Tail...>::                                        \
-              ctx, pargs..., arg);                                      \
+          template Compute<dev_ctx_idx, in_idx, attr_idx, out_idx + 1>(  \
-    }                                                                   \
+              ctx, pargs..., arg);                                       \
+    }                                                                    \
+  }
+#define PT_SPECIALIZE_KernelCallHelper_FOR_MULTI_OUTPUT(tensor_type)          \
+  template <typename... Tail>                                                 \
+  struct KernelCallHelper<std::vector<tensor_type*>, Tail...> {               \
+    template <int dev_ctx_idx,                                                \
+              int in_idx,                                                     \
+              int attr_idx,                                                   \
+              int out_idx,                                                    \
+              typename... PreviousArgs>                                       \
+    static void Compute(KernelContext* ctx, PreviousArgs&... pargs) {         \
+      const std::pair<int, int> range = ctx->OutputRangeAt(out_idx);          \
+      std::vector<tensor_type*> arg = std::move(                              \
+          ctx->MutableOutputBetween<tensor_type>(range.first, range.second)); \
+      KernelCallHelper<Tail...>::                                             \
+          template Compute<dev_ctx_idx, in_idx, attr_idx, out_idx + 1>(       \
+              ctx, pargs..., arg);                                            \
+    }                                                                         \
  }
 template <typename T>
@@ -152,6 +194,7 @@ struct KernelImpl<Return (*)(Args...), kernel_fn> {
  /* Input Helpers */
  PT_SPECIALIZE_KernelCallHelper_FOR_INPUT(DenseTensor);
+  PT_SPECIALIZE_KernelCallHelper_FOR_MULTI_INPUT(DenseTensor);
  // TODO(chenweihang): adapt SelectedRows
  // PT_SPECIALIZE_KernelCallHelper_FOR_INPUT(SelectedRowsTensor);
@@ -168,6 +211,7 @@ struct KernelImpl<Return (*)(Args...), kernel_fn> {
  /* Output Helpers */
  PT_SPECIALIZE_KernelCallHelper_FOR_OUTPUT(DenseTensor);
+  PT_SPECIALIZE_KernelCallHelper_FOR_MULTI_OUTPUT(DenseTensor);
  // TODO(chenweihang): adapt SelectedRows
  // PT_SPECIALIZE_KernelCallHelper_FOR_OUTPUT(SelectedRowsTensor);

--- a/paddle/pten/hapi/lib/kernel_dispatch.h
+++ b/paddle/pten/hapi/lib/kernel_dispatch.h
@@ -122,6 +122,14 @@ struct KernelKeyParser : ArgsIterator<KernelKeyParser> {
    key_set.dtype = x.type();
  }
+  void operator()(const std::vector<Tensor>& x) {
+    key_set.backend_set =
+        key_set.backend_set | detail::GetTensorBackendSet(x[0]);
+    // TODO(chenweihang): selecte multi layout and dtype
+    key_set.layout = x[0].layout();
+    key_set.dtype = x[0].type();
+  }
  // skip other type args, these args don't used in kernel selection
  template <typename T>
  void operator()(const T& x) {