diff --git a/CMakeLists.txt b/CMakeLists.txt
index 48e52961a95d50264b201eec50ccb3a462f39c54..317f7f9eb46a96e9f6ea393abf82d608af50fc4b 100644
--- a/CMakeLists.txt
+++ b/CMakeLists.txt
@@ -138,12 +138,6 @@ else()
set(THIRD_PARTY_BUILD_TYPE Release)
endif()
-if(WITH_MKL)
- option(MKL_SPLIT_GEMM "PaddlePaddle MKL gemm would split to small ones" OFF)
- if (MKL_SPLIT_GEMM)
- add_definitions(-DPADDLE_MKL_SPLIT_GEMM)
- endif()
-endif()
set(WITH_MKLML ${WITH_MKL})
if (NOT DEFINED WITH_MKLDNN)
if (WITH_MKL AND AVX2_FOUND)
diff --git a/doc/fluid/design/dist_train/dist_train_nccl2.md b/doc/fluid/design/dist_train/dist_train_nccl2.md
index aa7455ec5de0d46d7c2b0cef3b7ebf4754af3cb1..b8b8427811cddcddf872db5badfd37c96a76c3e3 100644
--- a/doc/fluid/design/dist_train/dist_train_nccl2.md
+++ b/doc/fluid/design/dist_train/dist_train_nccl2.md
@@ -1,7 +1,7 @@
# Distributed Training with NCCL2
We design a pattern that can enable training with `ParallelExecutor` and
-using [NCCL2](https://developer.nvidia.com/nccl) as it's collective
+use [NCCL2](https://developer.nvidia.com/nccl) as it's collective
communication library.
In `ParallelExecutor` we can use `AllReduce` or `Reduce` and `Broadcast`
@@ -9,14 +9,14 @@ to do multi GPU training. And if we initialize NCCL2 communicators as
ranks in a distributed environment, we can simply run the `ParallelExecutor`
as a distributed program! The only thing that may be different than in
the single node version is that we need to broadcast the NCCL unique ID
-to all the nodes, and initialize communicators using that ID, so NCCL2
-will know each other as ranks.
+to all the nodes and initialize communicators using that ID, so NCCL2
+can know each other as ranks.
To achieve this feature, we introduce a new operator: `gen_nccl_id` op,
so we are ***not*** "bind to" running NCCL2 with MPI, we can run it in
-what ever platform you like.
+whatever platform you like.
-It have two running modes:
+It has two running modes:
1. Generate and broadcast mode, which should be used on trainer 0;
1. Listen and fetch mode, which should be used on trainers other than 0.
@@ -29,7 +29,7 @@ initialize NCCL communicator objects.
The above figure indicates the general process when training with NCCL2
-distributed. Each trainer have the number of communicators equal to the
+distributed. Each trainer has the number of communicators equal to the
number of GPUs, but the ranks should match the global ranks number: here
we have total 8 GPUs, so `nranks==8`, for each trainer, the ranks should
be from 0 ~ 3 on trainer 0 and 4 ~ 7 on trainer 1.
diff --git a/doc/fluid/dev/new_op_cn.md b/doc/fluid/dev/new_op_cn.md
index c00f73be955e0fb54bb01ffa9a61b3f27c112f75..ff7408111fa20a7a6a3a2fe9f9ba20835918f399 100644
--- a/doc/fluid/dev/new_op_cn.md
+++ b/doc/fluid/dev/new_op_cn.md
@@ -36,19 +36,19 @@
OpProtoMake定义 |
-`.cc`文件,Backward Op不需要定义OpProtoMake |
+.cc 文件,Backward Op不需要定义OpProtoMake |
Op定义 |
- `.cc`文件 |
+ .cc 文件 |
Kernel实现 |
- CPU、CUDA共享Kernel实现在`.h`文件中,否则,CPU 实现在`.cc`文件中,CUDA 实现在`.cu`文件中。 |
+ CPU、CUDA共享Kernel实现在.h 文件中,否则,CPU 实现在.cc 文件中,CUDA 实现在.cu 文件中。 |
注册Op |
- Op注册实现在`.cc`文件;Kernel注册CPU实现在`.cc`文件中,CUDA实现在`.cu`文件中 |
+ Op注册实现在.cc 文件;Kernel注册CPU实现在.cc 文件中,CUDA实现在.cu 文件中 |
@@ -391,7 +391,7 @@ PADDLE_ENFORCE(ctx->HasInput("X"), "");
```
问题示例2 :提示信息过于简单
```
-PADDLE_ENFORCE(i != nullptr, "I must be set"); // I是什么?
+PADDLE_ENFORCE(i != nullptr, "i must be set"); // i是什么?
```
2. 在报错信息中使用开发人员定义的变量缩写,不易理解!
diff --git a/doc/fluid/howto/cluster/nccl2_rdma_training.md b/doc/fluid/howto/cluster/nccl2_rdma_training.md
index cecd5c3a7a7339e3be6772543a534728ec132105..8adaf324fccb4cda7af16b9bace559c0642ae444 100644
--- a/doc/fluid/howto/cluster/nccl2_rdma_training.md
+++ b/doc/fluid/howto/cluster/nccl2_rdma_training.md
@@ -1,12 +1,12 @@
# Distributed Training with NCCL2 and RDMA
-When doing distributed multi-GPU training, network bandwith often becomes the
-bottle neck. We introduce a way to use NCCL2 to do such training job to
-achieve best performace.
+When doing distributed multi-GPU training, network bandwidth often becomes the
+bottleneck. We introduce a way to use NCCL2 to do such training job to
+achieve best performance.
-## Prepare Hardwares with RDMA and Multiple GPUs
+## Prepare Hardware with RDMA and Multiple GPUs
-I'm using two Linux servers each of them is installed with 8 GPUs and
+I'm using two Linux servers each of them installed with 8 GPUs and
one 100Gb RDMA card.
Base environment is:
@@ -25,7 +25,7 @@ In general, the steps including:
1. Use docker to run tests and make sure GPUs and RDMA can work inside
the container.
-I'll ommit section "Install GPU drivers" because we can find it easily
+I'll omit the section "Install GPU drivers" because we can find it easily
somewhere else.
### Install RDMA drivers
@@ -33,7 +33,7 @@ somewhere else.
For my case, I've got two machines with device
"Mellanox Technologies MT27700 Family [ConnectX-4]" installed. The OS was
"CentOS 7.4" and I updated the kernel to version 4.4 so that docker can
-work with latest overlay2 filesystem.
+work with the latest overlay2 filesystem.
***NOTE: before you start, make sure you have a way to get a console
of the server other than ssh because we may need to re-configure the
@@ -45,14 +45,14 @@ network device.***
1. Run `./mlnxofedinstall --add-kernel-support` in the software package.
1. Run `/etc/init.d/openibd restart` to make everything work, note that
this operation may cause the network goes down if you are using this
- RDMA device as default network device and use ssh to login the server.
+ RDMA device as default network device and use ssh to log in the server.
1. Re-configure the network interface, for example:
`ifconfig eth2 192.168.16.30/20 up`, then add routes if needed:
`ip route add default via 192.168.16.1 dev eth2`.
1. Do the same thing on the other node.
1. Use `ping` to test if the two nodes have typical ICMP connection.
1. Use either `udaddy` or `ib_write_bw` to test the network connection is
- ready and have the desired bandwith.
+ ready and have the desired bandwidth.
### Prepare Docker Image to Run RDMA Programs
@@ -60,7 +60,7 @@ network device.***
package in it.
1. Start a docker container and mount GPU driver libs into it (you can
skip this step if you are using nvidia-docker).
-1. Mount RDMA dirvers and libs into the docker image (see below section),
+1. Mount RDMA drivers and libs into the docker image (see below section),
also `udaddy` and `ib_write_bw` if needed.
1. Mount GPU devices and RDMA devices into the container using `--device`
or just use privileged mode `--privileged`.
diff --git a/paddle/fluid/API.spec b/paddle/fluid/API.spec
index 9250cde1b2bc8fa1e14c0ba1ea9b509c496fc506..8c5cc44528a754f7612a23b1de09c247ca3f0c8e 100644
--- a/paddle/fluid/API.spec
+++ b/paddle/fluid/API.spec
@@ -162,6 +162,7 @@ paddle.fluid.layers.crop ArgSpec(args=['x', 'shape', 'offsets', 'name'], varargs
paddle.fluid.layers.rank_loss ArgSpec(args=['label', 'left', 'right', 'name'], varargs=None, keywords=None, defaults=(None,))
paddle.fluid.layers.prelu ArgSpec(args=['x', 'mode', 'param_attr', 'name'], varargs=None, keywords=None, defaults=(None, None))
paddle.fluid.layers.flatten ArgSpec(args=['x', 'axis', 'name'], varargs=None, keywords=None, defaults=(1, None))
+paddle.fluid.layers.stack ArgSpec(args=['x', 'axis'], varargs=None, keywords=None, defaults=(0,))
paddle.fluid.layers.data ArgSpec(args=['name', 'shape', 'append_batch_size', 'dtype', 'lod_level', 'type', 'stop_gradient'], varargs=None, keywords=None, defaults=(True, 'float32', 0, VarType.LOD_TENSOR, True))
paddle.fluid.layers.open_recordio_file ArgSpec(args=['filename', 'shapes', 'lod_levels', 'dtypes', 'pass_num', 'for_parallel'], varargs=None, keywords=None, defaults=(1, True))
paddle.fluid.layers.open_files ArgSpec(args=['filenames', 'shapes', 'lod_levels', 'dtypes', 'thread_num', 'buffer_size', 'pass_num', 'is_test'], varargs=None, keywords=None, defaults=(None, None, 1, None))
@@ -191,7 +192,7 @@ paddle.fluid.layers.argsort ArgSpec(args=['input', 'axis', 'name'], varargs=None
paddle.fluid.layers.ones ArgSpec(args=['shape', 'dtype', 'force_cpu'], varargs=None, keywords=None, defaults=(False,))
paddle.fluid.layers.zeros ArgSpec(args=['shape', 'dtype', 'force_cpu'], varargs=None, keywords=None, defaults=(False,))
paddle.fluid.layers.reverse ArgSpec(args=['x', 'axis'], varargs=None, keywords=None, defaults=None)
-paddle.fluid.layers.While.__init__ ArgSpec(args=['self', 'cond', 'name'], varargs=None, keywords=None, defaults=(None,))
+paddle.fluid.layers.While.__init__ ArgSpec(args=['self', 'cond', 'is_test', 'name'], varargs=None, keywords=None, defaults=(False, None))
paddle.fluid.layers.While.block ArgSpec(args=['self'], varargs=None, keywords=None, defaults=None)
paddle.fluid.layers.Switch.__init__ ArgSpec(args=['self', 'name'], varargs=None, keywords=None, defaults=(None,))
paddle.fluid.layers.Switch.case ArgSpec(args=['self', 'condition'], varargs=None, keywords=None, defaults=None)
diff --git a/paddle/fluid/framework/array.h b/paddle/fluid/framework/array.h
new file mode 100644
index 0000000000000000000000000000000000000000..be9efcd74924a2050a2fd9ab83059590a1a2a2fd
--- /dev/null
+++ b/paddle/fluid/framework/array.h
@@ -0,0 +1,48 @@
+// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved.
+//
+// Licensed under the Apache License, Version 2.0 (the "License");
+// you may not use this file except in compliance with the License.
+// You may obtain a copy of the License at
+//
+// http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+#pragma once
+
+#include
+#include "paddle/fluid/platform/hostdevice.h"
+
+namespace paddle {
+namespace framework {
+template
+class Array {
+ static_assert(N > 0, "The size of array must be larger than 0");
+
+ public:
+ HOSTDEVICE Array() {}
+
+ HOSTDEVICE explicit Array(const T &val) {
+ for (size_t i = 0; i < N; ++i) data_[i] = val;
+ }
+
+ HOSTDEVICE const T *Get() const { return data_; }
+
+ HOSTDEVICE T *GetMutable() { return data_; }
+
+ HOSTDEVICE T &operator[](size_t index) { return data_[index]; }
+
+ HOSTDEVICE const T &operator[](size_t index) const { return data_[index]; }
+
+ HOSTDEVICE constexpr size_t size() const { return N; }
+
+ private:
+ T data_[N];
+};
+
+} // namespace framework
+} // namespace paddle
diff --git a/paddle/fluid/framework/details/multi_devices_graph_pass.cc b/paddle/fluid/framework/details/multi_devices_graph_pass.cc
index c5a13e7e1f45e1eb9b4271880630c52d30022f4b..bc61b0eacbf6c8a1fd4487ad5a442fed1b536345 100644
--- a/paddle/fluid/framework/details/multi_devices_graph_pass.cc
+++ b/paddle/fluid/framework/details/multi_devices_graph_pass.cc
@@ -763,6 +763,8 @@ void MultiDevSSAGraphBuilder::CreateDistTrainOp(ir::Graph *result,
// Create RPC related op handles that connects its in ops and out ops.
void MultiDevSSAGraphBuilder::CreateRPCOp(ir::Graph *result,
ir::Node *node) const {
+ // FIXME(typhoonzero): Cleanup this deps for both sync mode and async mode
+ // put them into transpiler.
int op_dev_id = -1;
if (node->Op()->Type() == "send") {
// TODO(paddle-dev): getting the first var is not safe.
@@ -771,26 +773,42 @@ void MultiDevSSAGraphBuilder::CreateRPCOp(ir::Graph *result,
"This hack no longer holds, please fix.");
// the variable name which contains .block means it was splited by
// split_byref op
- // so that we can balance the variable blocks to all the pserver
- // instances.
if (strategy_.reduce_ == BuildStrategy::ReduceStrategy::kAllReduce &&
node->inputs[0]->Name().find(".block") == std::string::npos) {
std::vector input_var_names;
for (ir::Node *n : node->inputs) {
input_var_names.push_back(n->Name());
}
- op_dev_id = GetAppropriateDeviceID(input_var_names);
+ auto send_param_grad = boost::get>(
+ node->Op()->GetAttr(OpProtoAndCheckerMaker::OpRoleVarAttrName()));
+ PADDLE_ENFORCE_EQ(send_param_grad.size(), 2U);
+ op_dev_id = GetAppropriateDeviceID({send_param_grad[1]});
+ VLOG(10) << "send grad " << input_var_names[0] << " origin "
+ << send_param_grad[1] << " place: " << op_dev_id;
for (auto &varname : input_var_names) {
result->Get(kShardedVarDevice)
.emplace(varname, op_dev_id);
}
+ result->Get(kShardedVarDevice)
+ .emplace(send_param_grad[1], op_dev_id);
}
} else if (node->Op()->Type() == "recv") {
std::vector output_var_names;
for (ir::Node *n : node->outputs) {
output_var_names.push_back(n->Name());
}
- op_dev_id = GetAppropriateDeviceID(output_var_names);
+ auto recv_param_grad = boost::get>(
+ node->Op()->GetAttr(OpProtoAndCheckerMaker::OpRoleVarAttrName()));
+ // FIXME(typhoonzero): assume each recv op output one param
+ // Use the same place as send.
+ if (recv_param_grad.size() == 2U) {
+ op_dev_id = GetVarDeviceID(*result, recv_param_grad[1]);
+ VLOG(10) << "recv param " << recv_param_grad[0]
+ << " get grad place: " << recv_param_grad[1]
+ << " place: " << op_dev_id;
+ } else {
+ op_dev_id = GetAppropriateDeviceID(output_var_names);
+ }
for (auto &varname : output_var_names) {
result->Get(kShardedVarDevice)
.emplace(varname, op_dev_id);
diff --git a/paddle/fluid/framework/details/multi_devices_graph_print_pass.cc b/paddle/fluid/framework/details/multi_devices_graph_print_pass.cc
index 69944a42b688a9ea5ff29f75f18dd4b156848a27..361c91dc78c08a2cbf84ee88211d389c1e2312e5 100644
--- a/paddle/fluid/framework/details/multi_devices_graph_print_pass.cc
+++ b/paddle/fluid/framework/details/multi_devices_graph_print_pass.cc
@@ -54,7 +54,8 @@ void GraphvizSSAGraphPrinter::Print(const ir::Graph &graph,
sout << "var_" << cur_var_id << " [label=\"" << var_handle_ptr->name_
<< "\\n"
<< var_handle_ptr->place_ << "\\n"
- << var_handle_ptr->version_ << "\"]" << std::endl;
+ << "scope: " << var_handle_ptr->scope_idx_ << "\\n"
+ << "v" << var_handle_ptr->version_ << "\"]" << std::endl;
} else if (dummy_ptr) {
sout << "var_" << cur_var_id << " [label=\"dummy\"]" << std::endl;
}
diff --git a/paddle/fluid/framework/ir/graph.h b/paddle/fluid/framework/ir/graph.h
index 25e33861c06c9fcd2625e3a4036a04508acbd2ca..0d27be5fc007746d6ca41ff0dbcea5c5f45599ef 100644
--- a/paddle/fluid/framework/ir/graph.h
+++ b/paddle/fluid/framework/ir/graph.h
@@ -142,8 +142,6 @@ class Graph {
nodes_.erase(node);
}
- const ProgramDesc &program() const { return program_; }
-
private:
// This method takes ownership of `node`.
ir::Node *AddNode(ir::Node *node) {
@@ -154,7 +152,7 @@ class Graph {
}
// NOTE: program_ shouldn't be exposed to user.
- const ProgramDesc &program_;
+ const ProgramDesc program_;
std::map attrs_;
std::map> attr_dels_;
std::map> nodes_;
diff --git a/paddle/fluid/framework/ir/graph_pattern_detecter_tester.cc b/paddle/fluid/framework/ir/graph_pattern_detecter_tester.cc
index 993c885a810fe80a170ed190b892b148d85e8b5f..06f9df5546910f492c9dd1da3e694623898d3d1d 100644
--- a/paddle/fluid/framework/ir/graph_pattern_detecter_tester.cc
+++ b/paddle/fluid/framework/ir/graph_pattern_detecter_tester.cc
@@ -163,8 +163,8 @@ TEST(GraphPatternDetecter, MultiSubgraph) {
// 3. Detect op2 -> var2 -> op4
// 4. Detect op2 -> var3 -> op5
// But 2 and 3 and 4 overlapped, so keep 2, so the final choices are 1 and 2
- ASSERT_GE(count, 1UL);
- ASSERT_LE(count, 2UL);
+ ASSERT_GE(count, 1);
+ ASSERT_LE(count, 2);
}
} // namespace ir
diff --git a/paddle/fluid/framework/ir/node.cc b/paddle/fluid/framework/ir/node.cc
index aca77da8d674f29b89c023717cdcd061232d023a..65c45c7d2038cd06168d50c202dc81b4389cc5ed 100644
--- a/paddle/fluid/framework/ir/node.cc
+++ b/paddle/fluid/framework/ir/node.cc
@@ -17,7 +17,7 @@ limitations under the License. */
namespace paddle {
namespace framework {
namespace ir {
-const char Node::kControlDepVarName[] = "__control_var";
+constexpr char Node::kControlDepVarName[];
} // namespace ir
} // namespace framework
} // namespace paddle
diff --git a/paddle/fluid/framework/ir/node.h b/paddle/fluid/framework/ir/node.h
index 063c70fb7b9c0f9b90d872a70f362459ef149391..aab3180e7e5ece5de5f5227e76f78687700fed87 100644
--- a/paddle/fluid/framework/ir/node.h
+++ b/paddle/fluid/framework/ir/node.h
@@ -27,7 +27,7 @@ namespace ir {
class Node {
public:
enum class Type { kOperation, kVariable };
- static const char kControlDepVarName[];
+ static constexpr char kControlDepVarName[] = "__control_var";
explicit Node(const std::string& name, Type type)
: name_(name), var_desc_(nullptr), op_desc_(nullptr), type_(type) {}
@@ -41,8 +41,7 @@ class Node {
explicit Node(OpDesc* op_desc)
: name_(op_desc->Type()),
var_desc_(nullptr),
- op_desc_(new OpDesc(*op_desc)), // TODO(panyx0718) the pointer in the
- // original OpDesc might go out.
+ op_desc_(new OpDesc(*op_desc, op_desc->Block())),
type_(Type::kOperation) {}
Type NodeType() const { return type_; }
diff --git a/paddle/fluid/framework/op_proto_maker.cc b/paddle/fluid/framework/op_proto_maker.cc
index 2288c7fe6609a765612b468d69ad35101b92b384..9c289243c5a27839f628f3e143ce0363bf75a0b1 100644
--- a/paddle/fluid/framework/op_proto_maker.cc
+++ b/paddle/fluid/framework/op_proto_maker.cc
@@ -129,6 +129,10 @@ void OpProtoAndCheckerMaker::operator()(proto::OpProto* proto,
"Optimized for variable")
.SetDefault({});
+ AddAttr>(OpCreationCallstackAttrName(),
+ "Callstack for Op Creatation.")
+ .SetDefault({});
+
Validate();
}
diff --git a/paddle/fluid/framework/op_proto_maker.h b/paddle/fluid/framework/op_proto_maker.h
index 80970291c9c234f1306162f4ffa3c2528f88c35f..cb9c8ab1704ab867182079db31a34125669c645b 100644
--- a/paddle/fluid/framework/op_proto_maker.h
+++ b/paddle/fluid/framework/op_proto_maker.h
@@ -39,6 +39,7 @@ class OpProtoAndCheckerMaker {
public:
static const char *OpRoleAttrName() { return "op_role"; }
static const char *OpRoleVarAttrName() { return "op_role_var"; }
+ static const char *OpCreationCallstackAttrName() { return "op_callstack"; }
void operator()(proto::OpProto *proto, OpAttrChecker *attr_checker);
diff --git a/paddle/fluid/framework/operator.cc b/paddle/fluid/framework/operator.cc
index d04f7744961b2561977f4d36d0073a97557043da..9f8cdf1aeba43d30676cb2adf80a77cab86547a8 100644
--- a/paddle/fluid/framework/operator.cc
+++ b/paddle/fluid/framework/operator.cc
@@ -11,15 +11,17 @@ distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License. */
-#include
-#include
-
+#include "paddle/fluid/framework/operator.h"
#include
-
+#include
+#include
+#include
+#include "gflags/gflags.h"
+#include "glog/logging.h"
#include "paddle/fluid/framework/data_transform.h"
#include "paddle/fluid/framework/executor.h"
#include "paddle/fluid/framework/lod_tensor.h"
-#include "paddle/fluid/framework/operator.h"
+#include "paddle/fluid/framework/op_proto_maker.h"
#include "paddle/fluid/framework/shape_inference.h"
#include "paddle/fluid/framework/var_type.h"
#include "paddle/fluid/platform/profiler.h"
@@ -127,19 +129,48 @@ static LoD GetLoD(const Scope& scope, const std::string& name) {
}
void OperatorBase::Run(const Scope& scope, const platform::Place& place) {
- VLOG(4) << place << " " << DebugStringEx(&scope);
- if (platform::is_gpu_place(place)) {
+ try {
+ if (VLOG_IS_ON(4)) {
+ VLOG(4) << place << " " << DebugStringEx(&scope);
+ }
+ if (platform::is_gpu_place(place)) {
#ifndef PADDLE_WITH_CUDA
- PADDLE_THROW("Cannot run operator on place %s", place);
+ PADDLE_THROW("Cannot run operator on place %s", place);
#else
- auto dev_id = boost::get(place).device;
- platform::SetDeviceId(dev_id);
+ auto dev_id = boost::get(place).device;
+ platform::SetDeviceId(dev_id);
#endif
+ }
+ platform::DeviceContextPool& pool = platform::DeviceContextPool::Instance();
+ platform::RecordEvent record_event(Type(), pool.Get(place));
+ RunImpl(scope, place);
+ if (VLOG_IS_ON(3)) {
+ VLOG(3) << place << " " << DebugStringEx(&scope);
+ }
+ } catch (platform::EnforceNotMet exception) {
+ if (Attrs().count("sub_block") != 0) {
+ throw exception;
+ }
+
+ auto& callstack = Attr>(
+ OpProtoAndCheckerMaker::OpCreationCallstackAttrName());
+
+ if (callstack.empty()) {
+ throw exception;
+ }
+ std::ostringstream sout;
+ sout << "Invoke operator " << Type() << " error.\n";
+ sout << "Python Callstacks: \n";
+ for (auto& line : callstack) {
+ sout << line;
+ }
+ sout << "C++ Callstacks: \n";
+ sout << exception.err_str_;
+ exception.err_str_ = sout.str();
+ throw exception;
+ } catch (...) {
+ std::rethrow_exception(std::current_exception());
}
- platform::DeviceContextPool& pool = platform::DeviceContextPool::Instance();
- platform::RecordEvent record_event(Type(), pool.Get(place));
- RunImpl(scope, place);
- VLOG(3) << place << " " << DebugStringEx(&scope);
}
bool OperatorBase::HasInputs(const std::string& name) const {
@@ -167,7 +198,7 @@ const std::vector& OperatorBase::Inputs(
}
bool OperatorBase::HasOutputs(const std::string& name) const {
- if (outputs_.find(name) != outputs_.end()) {
+ if (outputs_.end() != outputs_.find(name)) {
return true;
} else {
return false;
diff --git a/paddle/fluid/framework/selected_rows.cc b/paddle/fluid/framework/selected_rows.cc
index c202b0a5be1f891b8ae7b11e1f6e0ce02fcba588..a4319ffabb04f39437b76d97845e021ef9de66d3 100644
--- a/paddle/fluid/framework/selected_rows.cc
+++ b/paddle/fluid/framework/selected_rows.cc
@@ -139,7 +139,7 @@ int64_t SelectedRows::AutoGrownIndex(int64_t key, bool auto_grown) {
}
auto write_iter = id_to_index_.find(key);
if (write_iter == id_to_index_.end()) {
- size_t row_num = rows_.size();
+ int row_num = rows_.size();
if (row_num == value_->dims()[0]) {
rwlock_->UNLock();
PADDLE_THROW("selected rows is full, then length exceed %d", row_num);
@@ -182,7 +182,7 @@ void SelectedRows::Get(const framework::Tensor& ids, framework::Tensor* value,
PADDLE_ENFORCE_EQ(value_width, value->numel() / value->dims()[0],
"output tensor should have the same shape with table "
"except the dims[0].");
- for (size_t i = 0; i < ids.numel(); ++i) {
+ for (int i = 0; i < ids.numel(); ++i) {
int64_t index = AutoGrownIndex(ids.data()[i], auto_grown);
framework::VisitDataType(
framework::ToDataType(value_->type()),
diff --git a/paddle/fluid/inference/analysis/analyzer_tester.cc b/paddle/fluid/inference/analysis/analyzer_tester.cc
index 52f5c4f5aea387c947ee909b79dae8a1bfb89d82..baa7600283a9bc0b81833b419a2ea64692ed2203 100644
--- a/paddle/fluid/inference/analysis/analyzer_tester.cc
+++ b/paddle/fluid/inference/analysis/analyzer_tester.cc
@@ -23,6 +23,8 @@
DEFINE_string(infer_ditu_rnn_model, "", "model path for ditu RNN");
DEFINE_string(infer_ditu_rnn_data, "", "data path for ditu RNN");
+DEFINE_int32(batch_size, 10, "batch size.");
+DEFINE_int32(repeat, 1, "Running the inference program repeat times.");
namespace paddle {
namespace inference {
@@ -92,7 +94,7 @@ struct DataRecord {
size_t batch_iter{0};
size_t batch_size{1};
DataRecord() = default;
- DataRecord(const std::string &path, int batch_size = 1)
+ explicit DataRecord(const std::string &path, int batch_size = 1)
: batch_size(batch_size) {
Load(path);
}
@@ -165,7 +167,6 @@ struct DataRecord {
};
void PrepareInputs(std::vector *input_slots, DataRecord *data,
int batch_size) {
- // DataRecord data(FLAGS_datapath, batch_size);
PaddleTensor lod_attention_tensor, init_zero_tensor, lod_tensor_tensor,
week_tensor, minute_tensor;
lod_attention_tensor.name = "data_lod_attention";
@@ -174,28 +175,33 @@ void PrepareInputs(std::vector *input_slots, DataRecord *data,
week_tensor.name = "week";
minute_tensor.name = "minute";
auto one_batch = data->NextBatch();
- // clang-format off
- std::vector rnn_link_data_shape
- ({static_cast(one_batch.rnn_link_data.size()), static_cast(one_batch.rnn_link_data.front().size())});
+ std::vector rnn_link_data_shape(
+ {static_cast(one_batch.rnn_link_data.size()),
+ static_cast(one_batch.rnn_link_data.front().size())});
lod_attention_tensor.shape.assign({1, 2});
lod_attention_tensor.lod.assign({one_batch.lod1, one_batch.lod2});
init_zero_tensor.shape.assign({batch_size, 15});
init_zero_tensor.lod.assign({one_batch.lod3});
lod_tensor_tensor.shape = rnn_link_data_shape;
lod_tensor_tensor.lod.assign({one_batch.lod1});
- week_tensor.shape.assign({(int) one_batch.rnn_week_datas.size(), (int) one_batch.rnn_week_datas.front().size()});
+ // clang-format off
+ week_tensor.shape.assign(
+ {static_cast(one_batch.rnn_week_datas.size()),
+ static_cast(one_batch.rnn_week_datas.front().size())});
week_tensor.lod.assign({one_batch.lod3});
- minute_tensor.shape.assign({(int) one_batch.rnn_minute_datas.size(),
- (int) one_batch.rnn_minute_datas.front().size()});
+ minute_tensor.shape.assign(
+ {static_cast(one_batch.rnn_minute_datas.size()),
+ static_cast(one_batch.rnn_minute_datas.front().size())});
minute_tensor.lod.assign({one_batch.lod3});
+ // clang-format on
// assign data
- TensorAssignData(&lod_attention_tensor, std::vector>({{0, 0}}));
+ TensorAssignData(&lod_attention_tensor,
+ std::vector>({{0, 0}}));
std::vector tmp_zeros(batch_size * 15, 0.);
TensorAssignData(&init_zero_tensor, {tmp_zeros});
TensorAssignData(&lod_tensor_tensor, one_batch.rnn_link_data);
TensorAssignData(&week_tensor, one_batch.rnn_week_datas);
TensorAssignData(&minute_tensor, one_batch.rnn_minute_datas);
- // clang-format on
// Set inputs.
auto init_zero_tensor1 = init_zero_tensor;
init_zero_tensor1.name = "hidden_init";
@@ -231,12 +237,9 @@ std::string DescribeTensor(const PaddleTensor &tensor) {
os << "\n";
os << " - data: ";
- // clang-format off
- int dim = std::accumulate(tensor.shape.begin(),
- tensor.shape.end(),
- 1,
- [](int a, int b) { return a * b; }); // clang-format on
- for (size_t i = 0; i < dim; i++) {
+ int dim = std::accumulate(tensor.shape.begin(), tensor.shape.end(), 1,
+ [](int a, int b) { return a * b; });
+ for (int i = 0; i < dim; i++) {
os << static_cast(tensor.data.data())[i] << " ";
}
os << '\n';
@@ -300,13 +303,16 @@ void TestDituRNNPrediction(const std::string &model_path,
for (int i = 0; i < num_times; i++) {
predictor->Run(input_slots, &outputs);
}
- LOG(INFO) << "time/batch: " << timer.toc() / num_times;
+ LOG(INFO) << "===========profile result===========";
+ LOG(INFO) << "batch_size: " << batch_size << ", repeat: " << num_times
+ << ", latency: " << timer.toc() / num_times << "ms";
+ LOG(INFO) << "=====================================";
for (auto &out : outputs) {
size_t size = std::accumulate(out.shape.begin(), out.shape.end(), 1,
[](int a, int b) { return a * b; });
float *data = static_cast(out.data.data());
- for (int i = 0;
+ for (size_t i = 0;
i < std::min(sizeof(ditu_rnn_target_data) / sizeof(float), size);
i++) {
EXPECT_NEAR(data[i], ditu_rnn_target_data[i], 1e-3);
@@ -336,7 +342,7 @@ TEST(Analyzer, SupportIRPass) {
// Directly infer with the original model.
TEST(Analyzer, DituRNN_without_analysis) {
TestDituRNNPrediction(FLAGS_infer_ditu_rnn_model, FLAGS_infer_ditu_rnn_data,
- 10, false, false);
+ FLAGS_batch_size, false, false, FLAGS_repeat);
}
// Inference with the original model with the analysis turned on, the analysis
@@ -344,14 +350,14 @@ TEST(Analyzer, DituRNN_without_analysis) {
TEST(Analyzer, DituRNN_with_analysis) {
LOG(INFO) << "ditu rnn with analysis";
TestDituRNNPrediction(FLAGS_infer_ditu_rnn_model, FLAGS_infer_ditu_rnn_data,
- 10, true, false, 1);
+ FLAGS_batch_size, true, false, FLAGS_repeat);
}
// Inference with analysis and IR. The IR module will fuse some large kernels.
TEST(Analyzer, DituRNN_with_analysis_with_IR) {
LOG(INFO) << "ditu rnn with analysis and IR fuse";
TestDituRNNPrediction(FLAGS_infer_ditu_rnn_model, FLAGS_infer_ditu_rnn_data,
- 10, true, true, 1);
+ FLAGS_batch_size, true, true, FLAGS_repeat);
}
} // namespace analysis
diff --git a/paddle/fluid/operators/attention_lstm_op.cc b/paddle/fluid/operators/attention_lstm_op.cc
new file mode 100644
index 0000000000000000000000000000000000000000..1cb65346ee2b755b48f8dd8f1456a32861c3a0b6
--- /dev/null
+++ b/paddle/fluid/operators/attention_lstm_op.cc
@@ -0,0 +1,422 @@
+/* Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+
+#include "paddle/fluid/operators/attention_lstm_op.h"
+#include
+#include
+#include "paddle/fluid/operators/math/blas.h"
+#include "paddle/fluid/operators/math/cpu_vec.h"
+#include "paddle/fluid/operators/math/fc_compute.h"
+#include "paddle/fluid/platform/cpu_info.h"
+
+namespace paddle {
+namespace operators {
+
+void AttentionLSTMOp::InferShape(framework::InferShapeContext* ctx) const {
+ PADDLE_ENFORCE(ctx->HasInput("X"),
+ "Input(X) of AttentionLSTM should not be null.");
+ PADDLE_ENFORCE(ctx->HasInput("C0"),
+ "Input(C0) of AttentionLSTM should not be null.");
+ PADDLE_ENFORCE(ctx->HasInput("LSTMWeight"),
+ "Input(LSTMWeight) of AttentionLSTM should not be null.");
+ PADDLE_ENFORCE(ctx->HasInput("LSTMBias"),
+ "Input(LSTMBias) of AttentionLSTM should not be null.");
+ PADDLE_ENFORCE(ctx->HasInput("AttentionWeight"),
+ "Input(AttentionWeight) of AttentionLSTM should not be null.");
+
+ PADDLE_ENFORCE(ctx->HasOutput("Hidden"),
+ "Output(Hidden) of AttentionLSTM should not be null.");
+ PADDLE_ENFORCE(ctx->HasOutput("Cell"),
+ "Output(Cell) of AttentionLSTM should not be null.");
+ PADDLE_ENFORCE(ctx->HasOutput("AttentionedX"),
+ "Output(AttentionedX) of AttentionLSTM should not be null.");
+ PADDLE_ENFORCE(ctx->HasOutput("AttentionFCOut"),
+ "Output(AttentionFCOut) of AttentionLSTM should not be null.");
+ PADDLE_ENFORCE(ctx->HasOutput("LSTMX"),
+ "Output(LSTMX) of AttentionLSTM should not be null.");
+ PADDLE_ENFORCE(ctx->HasOutput("LSTMOUT"),
+ "Output(LSTMOUT) of AttentionLSTM should not be null.");
+
+ auto x_dims = ctx->GetInputDim("X");
+ const int M = x_dims[1];
+ PADDLE_ENFORCE_EQ(x_dims.size(), 2, "Input(X)'s rank must be 2.");
+
+ auto w_dims = ctx->GetInputDim("LSTMWeight");
+ const int D = w_dims[1] / 4;
+ PADDLE_ENFORCE_EQ(w_dims.size(), 2, "Input(LSTMWeight)'s rank must be 2.");
+ PADDLE_ENFORCE_EQ(w_dims[0], D + M,
+ "LSTMWeight dims should be (%d + %d) * %d.", D + M, 4 * D);
+
+ auto b_dims = ctx->GetInputDim("LSTMBias");
+ PADDLE_ENFORCE_EQ(b_dims.size(), 2, "Input(LSTMBias)'s rank must be 2.");
+ PADDLE_ENFORCE_EQ(b_dims[0], 1, "LSTMBias dims should be 1 x %d.", 4 * D);
+ PADDLE_ENFORCE_EQ(b_dims[1], 4 * D, "LSTMBias dims should be 1 x %d.", 4 * D);
+
+ auto c_dims = ctx->GetInputDim("C0");
+ PADDLE_ENFORCE_EQ(c_dims.size(), 2, "Input(C0)'s rank must be 2.");
+ PADDLE_ENFORCE_EQ(c_dims[1], D, "C0 dims should be N x %d.", D);
+ if (ctx->HasInput("H0")) {
+ auto h_dims = ctx->GetInputDim("H0");
+ PADDLE_ENFORCE(h_dims == c_dims,
+ "The dimension of Input(H0) and Input(C0) "
+ "should be the same.");
+ }
+
+ auto atten_w_dims = ctx->GetInputDim("AttentionWeight");
+ PADDLE_ENFORCE_EQ(atten_w_dims.size(), 2,
+ "Input(AttentionWeight)'s rank must be 2.");
+ PADDLE_ENFORCE_EQ(atten_w_dims[0], M + D,
+ "AttentionWeight shapes must be (%d + %d) * 1.", M, D);
+ PADDLE_ENFORCE_EQ(atten_w_dims[1], 1,
+ "AttentionWeight shapes must be (%d + %d) * 1.", M, D);
+ if (ctx->HasInput("AttentionBias")) {
+ auto atten_b_dims = ctx->GetInputDim("AttentionBias");
+ PADDLE_ENFORCE_EQ(atten_b_dims.size(), 2,
+ "Input(AttentionBias)'s rank must be 2.");
+ PADDLE_ENFORCE_EQ(atten_b_dims[0], 1,
+ "AttentionBias shapes must be 1 * 1.");
+ PADDLE_ENFORCE_EQ(atten_b_dims[1], 1,
+ "AttentionBias shapes must be 1 * 1.");
+ }
+
+ if (ctx->HasInput("AttentionScalar")) {
+ auto dims = ctx->GetInputDim("AttentionScalar");
+ PADDLE_ENFORCE_EQ(dims.size(), 2,
+ "Input(AttentionScalar)'s rank must be 2.");
+ PADDLE_ENFORCE_EQ(dims[0], 1, "AttentionScalar shapes must be 1 * 1.");
+ PADDLE_ENFORCE_EQ(dims[1], 1, "AttentionScalar shapes must be 1 * 1.");
+ }
+
+ if (ctx->HasInput("AttentionScalarBias")) {
+ auto dims = ctx->GetInputDim("AttentionScalarBias");
+ PADDLE_ENFORCE(
+ ctx->HasInput("AttentionScalar"),
+ "AttentionScalar should not be null when have AttentionScalarBias.");
+ PADDLE_ENFORCE_EQ(dims.size(), 2,
+ "Input(AttentionScalarBias)'s rank must be 2.");
+ PADDLE_ENFORCE_EQ(dims[0], 1, "AttentionScalarBias shapes must be 1 * 1.");
+ PADDLE_ENFORCE_EQ(dims[1], 1, "AttentionScalarBias shapes must be 1 * 1.");
+ }
+
+ framework::DDim out_dims({x_dims[0], D});
+ ctx->SetOutputDim("Hidden", out_dims);
+ ctx->SetOutputDim("Cell", out_dims);
+ ctx->SetOutputDim("AttentionedX", {x_dims[0], 1});
+ ctx->SetOutputDim("LSTMX", {1, M});
+ ctx->SetOutputDim("LSTMOUT", {1, 4 * D});
+ // AttentionFCOut should be reshape as (maxseqlen,1) in runtime
+ ctx->ShareLoD("X", "Hidden");
+ ctx->ShareLoD("X", "Cell");
+}
+
+framework::OpKernelType AttentionLSTMOp::GetExpectedKernelType(
+ const framework::ExecutionContext& ctx) const {
+ return framework::OpKernelType(
+ framework::ToDataType(ctx.Input("X")->type()),
+ ctx.device_context());
+}
+
+void AttentionLSTMOpMaker::Make() {
+ AddInput("X",
+ "(LoDTensor) the input is a LodTensor, which support "
+ "variable-time length input sequence. The underlying tensor in "
+ "this LoDTensor is a matrix with shape (T X M), where T is the "
+ "total time steps in this mini-batch, M is the dim size of x.");
+ AddInput("C0",
+ "(Tensor) LSTM C0"
+ "This is a tensor with shape (N x D), where N is the batch size, D "
+ "is the gate size."
+ "C0 is necessary because of attention.");
+ AddInput("H0",
+ "(Tensor, optional) LSTM H0"
+ "This is a tensor with shape (N x D), where N is the "
+ "batch size and D is the gate size.")
+ .AsDispensable();
+ AddInput("AttentionWeight",
+ "(Tensor) the weights of attention fc. Always relu the fc result."
+ "The shape is ((M+D) x 1), where M is the dim size of x, D is the "
+ "gate size of LSTM.");
+ AddInput("AttentionBias",
+ "(Tensor, optional) the bias of attention fc."
+ "The shape is (1 x 1)")
+ .AsDispensable();
+ AddInput("AttentionScalar",
+ "(Tensor, optional) the scalar on the result of attentioned fc. "
+ "Always relu the Scalar."
+ "The shape is (1 x 1)")
+ .AsDispensable();
+ AddInput("AttentionScalarBias",
+ "(Tensor, optional) the scalar bias of attention fc."
+ "The shape is (1 x 1)")
+ .AsDispensable();
+ AddInput("LSTMWeight",
+ "(Tensor) the combined weight of LSTM"
+ " - The shape is ((D+M) x 4D), where D is the hidden gate size, M "
+ "is the dim size of x"
+ " - Weight = {W_forget, W_input, W_output, W_cell}");
+ AddInput("LSTMBias",
+ "(Tensor) the combined bias of LSTM, shape (1x4D)."
+ "Note: we should add the bias of hidden and context accorindg to "
+ "the same gate: "
+ "{B_forget, B_input, B_output, B_cell}");
+ AddOutput("Hidden",
+ "(LoDTensor) (same as LSTMOp) the hidden state of LSTM operator. "
+ "The shape is (T x D), and lod is the same with the `Input`.");
+ AddOutput("Cell",
+ "(LoDTensor) (same as LSTMOp) the cell state of LSTM operator. "
+ "The shape is (T x D), and lod is the same with the `Input`.");
+ AddOutput("AttentionedX",
+ "(Tensor) shape is (T x 1), the result after X * AttentionWeight,"
+ " where T is the total time steps in this mini-batch,"
+ " D is the hidden size.")
+ .AsIntermediate();
+ AddOutput("AttentionFCOut",
+ "(Tensor) (max_seq_len, 1), compute at each step.")
+ .AsIntermediate();
+ AddOutput("LSTMX",
+ "(Tensor) the input X of LSTM for each step."
+ "Shape is (1 x M), where M is the x frame size")
+ .AsIntermediate();
+ AddOutput(
+ "LSTMOUT",
+ "(Tensor) the output of LSTM X(1*(D+M))* weight((D+M)*4D) for each step."
+ "Shape is (1 x 4D), where M is the x frame size")
+ .AsIntermediate();
+ AddAttr("gate_activation",
+ "(string, default: sigmoid)"
+ "The activation for input gate, forget gate and output "
+ "gate, `sigmoid` by default.")
+ .SetDefault("sigmoid")
+ .InEnum({"sigmoid", "tanh", "relu", "identity"});
+ AddAttr("cell_activation",
+ "(string, default: tanh)"
+ "The activation for cell output, `tanh` by defalut.")
+ .SetDefault("tanh")
+ .InEnum({"sigmoid", "tanh", "relu", "identity"});
+ AddAttr("candidate_activation",
+ "(string, default: tanh)"
+ "The activation for candidate hidden state, "
+ "`tanh` by default.")
+ .SetDefault("tanh")
+ .InEnum({"sigmoid", "tanh", "relu", "identity"});
+ AddComment(R"DOC(
+Attention Long-Short Term Memory (LSTM) Operator.
+
+Attention part:
+concat( x(seqlen * M), expand( cell_t-1(1,D) ) ) => tmp(seqlen*(M+D))
+
+tmp(seqlen*(M+D)) * fc((M+D)*1) => fcout(seqlen*1) with bias, relu
+
+fcout(seqlen*1) * scalar => fcout(seqlen*1) with bias, relu
+
+dotmul and sum pool ( fcout(seqlen*1), x(seqlen * M) ) => lstm_x_t(1, M)
+
+LSTM part:
+use lstm_x_t as input and compute as standard LSTM.
+
+)DOC");
+}
+
+// y[i] = (x[i] + bias[0]) > 0 ? (x[i] + bias[0]) : 0;
+template
+inline void bias_relu(const int n, const T* x, const T* bias, T* y) {
+ if (bias) {
+ for (int i = 0; i < n; ++i) {
+ y[i] = x[i] + bias[0];
+ }
+ math::vec_relu(n, y, y);
+ } else {
+ math::vec_relu(n, x, y);
+ }
+}
+
+template
+inline void vec_softmax(const math::BlasT& blas, const int n,
+ const T* x, T* y) {
+ T scalar = x[0];
+ // max
+ for (int i = 1; i < n; ++i) {
+ scalar = scalar < x[i] ? x[i] : scalar;
+ }
+
+ // sub
+ for (int i = 0; i < n; ++i) {
+ y[i] = x[i] - scalar;
+ }
+
+ // exp
+ blas.VEXP(n, y, y);
+
+ // sum
+ scalar = T(0);
+ for (int i = 0; i < n; ++i) {
+ scalar += y[i];
+ }
+
+ // scale
+ blas.SCAL(n, static_cast(1) / scalar, y);
+}
+
+template
+class AttentionLSTMKernel : public framework::OpKernel {
+ public:
+ void Compute(const framework::ExecutionContext& ctx) const override {
+ using DeviceContext = paddle::platform::CPUDeviceContext;
+
+ auto* x = ctx.Input("X");
+ auto* h0 = ctx.Input("H0");
+ auto* c0 = ctx.Input("C0");
+ auto* atten_w = ctx.Input("AttentionWeight");
+ auto* atten_b = ctx.Input("AttentionBias");
+ auto* atten_scalar = ctx.Input("AttentionScalar");
+ auto* atten_scalar_bias = ctx.Input("AttentionScalarBias");
+ auto* lstm_w = ctx.Input("LSTMWeight");
+ auto* lstm_b = ctx.Input("LSTMBias");
+
+ auto* hidden_out = ctx.Output("Hidden");
+ auto* cell_out = ctx.Output("Cell");
+ auto* atted_x = ctx.Output("AttentionedX");
+ auto* fc_out = ctx.Output("AttentionFCOut");
+ auto* lstm_x = ctx.Output("LSTMX");
+ auto* lstm_out = ctx.Output("LSTMOUT");
+
+ // some shape should be reshape here since infershape can not get lod info
+ auto x_lod = x->lod();
+ const int N = x_lod[0].size() - 1; // batch size
+ auto x_dims = x->dims(); // T x M
+ auto w_dims = lstm_w->dims(); // (D+M) x 4D
+ const int total_T = x_dims[0];
+ const int M = x_dims[1]; // x frame size
+ const int D = w_dims[1] / 4; // gate frame size
+ const int D2 = D * 2;
+ const int D3 = D * 3;
+ const int D4 = w_dims[1];
+ int max_seq_len = x_lod[0][1];
+ for (int i = 1; i < N; ++i) {
+ int len = x_lod[0][i + 1] - x_lod[0][i];
+ max_seq_len = max_seq_len < len ? len : max_seq_len;
+ }
+ PADDLE_ENFORCE_EQ(x_lod.size(), 1, "Input(X)'s lod size must be 1.");
+ PADDLE_ENFORCE_EQ(c0->dims()[0], N, "C0 dims should be %d x %d.", N, D);
+ fc_out->Resize({max_seq_len, 1});
+
+ math::VecActivations act_functor;
+ std::function act_gate, act_cell, act_cand;
+ act_gate = act_functor(ctx.Attr("gate_activation"));
+ act_cell = act_functor(ctx.Attr("cell_activation"));
+ act_cand = act_functor(ctx.Attr("candidate_activation"));
+
+ const T* x_data = x->data();
+ const T* h0_data = h0 ? h0->data() : NULL;
+ const T* c0_data = c0->data();
+ const T* lstm_w_data = lstm_w->data();
+ const T* lstm_b_data = lstm_b->data();
+ const T* atten_w_data = atten_w->data();
+ const T* atten_b_data = atten_b ? atten_b->data() : NULL;
+ const T* atten_scalar_data = atten_scalar ? atten_scalar->data() : NULL;
+ const T* atten_scalar_bias_data =
+ atten_scalar_bias ? atten_scalar_bias->data() : NULL;
+
+ T* hidden_out_data = hidden_out->mutable_data(ctx.GetPlace());
+ T* cell_out_data = cell_out->mutable_data(ctx.GetPlace());
+ T* atted_x_data = atted_x->mutable_data(ctx.GetPlace());
+ T* fc_out_data = fc_out->mutable_data(ctx.GetPlace());
+ T* lstm_x_data = lstm_x->mutable_data(ctx.GetPlace());
+ T* lstm_out_data = lstm_out->mutable_data(ctx.GetPlace());
+
+ // x(TxM) * fc (Mx1) part of atten_wgt(M+D)x1
+ auto blas = math::GetBlas(ctx);
+ math::FCCompute(blas, total_T, 1, M, x_data, atten_w_data,
+ atted_x_data, atten_b_data);
+
+ const T* cur_atten_x_data = atted_x_data;
+ const T* cur_x_data = x_data;
+ const T* prev_cell_data = NULL;
+ const T* prev_hidden_data = NULL;
+ T* cur_cell_out_data = cell_out_data;
+ T* cur_hidden_out_data = hidden_out_data;
+ for (int i = 0; i < N; ++i) {
+ int seq_len = x_lod[0][i + 1] - x_lod[0][i];
+ prev_cell_data = c0_data + i * D;
+ prev_hidden_data = h0_data ? h0_data + i * D : NULL;
+ for (int step = 0; step < seq_len; ++step) {
+ /// 1. compute attention vector
+ // 1a. prev_cell(1xD) * fc(D) rest part of atten_wgt
+ T prev_cell_bias = blas.DOT(D, prev_cell_data, atten_w_data + M);
+ // 1b. add cell bias and relu
+ bias_relu(seq_len, cur_atten_x_data, &prev_cell_bias, fc_out_data);
+ // 1c. fc scalar
+ if (atten_scalar_data) {
+ blas.SCAL(seq_len, *atten_scalar_data, fc_out_data);
+ bias_relu(seq_len, fc_out_data, atten_scalar_bias_data,
+ fc_out_data);
+ }
+ // 1d. softmax
+ vec_softmax(blas, seq_len, fc_out_data, fc_out_data);
+ // mul x(seq_len*M) and sum pool
+ math::FCCompute(blas, 1, M, seq_len, fc_out_data,
+ cur_x_data, lstm_x_data);
+
+ /// 2. compute LSTM step
+ // lstm weight : concat[forget , input , output , tilde]
+ // shape : (D + M) x (4 * D)
+ // fc inputX(1xM) * weightX(M*(4D)) => 1 x 4D
+ blas.MatMul(1, D4, M, lstm_x_data, lstm_w_data + D * D4, lstm_out_data);
+ if (prev_hidden_data) {
+ blas.GEMM(CblasNoTrans, CblasNoTrans, 1, D4, D, static_cast(1),
+ prev_hidden_data, D, lstm_w_data, D4, static_cast(1),
+ lstm_out_data, D4);
+ }
+ // since input is 1xM, so can use add bias
+ blas.VADD(D4, lstm_b_data, lstm_out_data, lstm_out_data);
+
+ // gate act: sigmoid
+ act_gate(D3, lstm_out_data, lstm_out_data);
+ // candicate act: tanh
+ act_cand(D, lstm_out_data + D3, lstm_out_data + D3);
+
+ // a = forget * prev_cell
+ blas.VMUL(D, lstm_out_data, prev_cell_data, lstm_out_data);
+
+ // b = input * tilde
+ blas.VMUL(D, lstm_out_data + D, lstm_out_data + D3, lstm_out_data + D);
+
+ // cell_out = a + b
+ blas.VADD(D, lstm_out_data, lstm_out_data + D, cur_cell_out_data);
+
+ // state act tanh(cell_out) * output_gate
+ act_cell(D, cur_cell_out_data, lstm_out_data);
+ blas.VMUL(D, lstm_out_data, lstm_out_data + D2, cur_hidden_out_data);
+
+ prev_hidden_data = cur_hidden_out_data;
+ prev_cell_data = cur_cell_out_data;
+ cur_cell_out_data = cur_cell_out_data + D;
+ cur_hidden_out_data = cur_hidden_out_data + D;
+ }
+ cur_x_data = cur_x_data + seq_len * M;
+ cur_atten_x_data = cur_atten_x_data + seq_len;
+ }
+ }
+};
+
+} // namespace operators
+} // namespace paddle
+
+namespace ops = paddle::operators;
+REGISTER_OPERATOR(attention_lstm, ops::AttentionLSTMOp,
+ ops::AttentionLSTMOpMaker,
+ paddle::framework::DefaultGradOpDescMaker);
+
+REGISTER_OP_CPU_KERNEL(attention_lstm, ops::AttentionLSTMKernel,
+ ops::AttentionLSTMKernel);
diff --git a/paddle/fluid/operators/attention_lstm_op.h b/paddle/fluid/operators/attention_lstm_op.h
new file mode 100644
index 0000000000000000000000000000000000000000..6ede3a7f3c96dd2d13d7c5c19816647e16a3c8d0
--- /dev/null
+++ b/paddle/fluid/operators/attention_lstm_op.h
@@ -0,0 +1,41 @@
+/* Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+
+#pragma once
+#include "paddle/fluid/framework/op_registry.h"
+
+namespace paddle {
+namespace operators {
+
+using LoDTensor = framework::LoDTensor;
+using Tensor = framework::Tensor;
+
+class AttentionLSTMOp : public framework::OperatorWithKernel {
+ public:
+ using framework::OperatorWithKernel::OperatorWithKernel;
+
+ void InferShape(framework::InferShapeContext* ctx) const override;
+
+ protected:
+ framework::OpKernelType GetExpectedKernelType(
+ const framework::ExecutionContext& ctx) const override;
+};
+
+class AttentionLSTMOpMaker : public framework::OpProtoAndCheckerMaker {
+ public:
+ void Make() override;
+};
+
+} // namespace operators
+} // namespace paddle
diff --git a/paddle/fluid/operators/concat_op.h b/paddle/fluid/operators/concat_op.h
index a496301526f58875ff51aeaa5b2094c3c656531c..78be2e1e1f06c7a518e35a770c1dc9581b2d10fe 100644
--- a/paddle/fluid/operators/concat_op.h
+++ b/paddle/fluid/operators/concat_op.h
@@ -62,9 +62,21 @@ class ConcatGradKernel : public framework::OpKernel {
void Compute(const framework::ExecutionContext& ctx) const {
auto* out_grad =
ctx.Input(framework::GradVarName("Out"));
- auto ins = ctx.MultiInput("X");
+ auto ins = ctx.MultiInput("X");
auto out_var_names = ctx.Outputs(framework::GradVarName("X"));
- auto outs = ctx.MultiOutput(framework::GradVarName("X"));
+ auto outs =
+ ctx.MultiOutput(framework::GradVarName("X"));
+
+ {
+ auto dx = outs;
+ auto x = ins;
+ for (size_t i = 0; i < dx.size(); ++i) {
+ if (dx[i] != nullptr) {
+ dx[i]->set_lod(x[i]->lod());
+ }
+ }
+ }
+
int64_t axis = static_cast(ctx.Attr("axis"));
// get output tensor that the name is not kEmptyVarName
diff --git a/paddle/fluid/operators/elementwise_add_mkldnn_op.cc b/paddle/fluid/operators/elementwise_add_mkldnn_op.cc
index c86cd57316078778e5930c9b524b931d523028d7..9ad82aec8182d6ba06b67391d71317a3d0df1833 100644
--- a/paddle/fluid/operators/elementwise_add_mkldnn_op.cc
+++ b/paddle/fluid/operators/elementwise_add_mkldnn_op.cc
@@ -137,9 +137,10 @@ class EltwiseAddMKLDNNKernel : public framework::OpKernel {
};
template
-class EltwiseAddMKLDNNGradKernel : public framework::OpKernel {
+class EltwiseAddMKLDNNGradKernel : public ElemwiseGradKernel {
public:
void Compute(const framework::ExecutionContext& ctx) const override {
+ ElemwiseGradKernel::Compute(ctx);
using Tensor = framework::Tensor;
auto* dout = ctx.Input(framework::GradVarName("Out"));
diff --git a/paddle/fluid/operators/elementwise_add_op.h b/paddle/fluid/operators/elementwise_add_op.h
index 5356105e2e551c0528694091608fc7585dce66d2..c60cb1f92e99329d52f6ed39dccde406a5f83563 100644
--- a/paddle/fluid/operators/elementwise_add_op.h
+++ b/paddle/fluid/operators/elementwise_add_op.h
@@ -15,6 +15,7 @@ limitations under the License. */
#pragma once
#include "paddle/fluid/framework/eigen.h"
+#include "paddle/fluid/operators/elementwise_op.h"
#include "paddle/fluid/operators/elementwise_op_function.h"
#include "paddle/fluid/operators/math/blas.h"
@@ -136,9 +137,11 @@ elementwise_add_grad(const framework::ExecutionContext& ctx,
}
template
-class ElementwiseAddGradKernel : public framework::OpKernel {
+class ElementwiseAddGradKernel : public ElemwiseGradKernel {
public:
void Compute(const framework::ExecutionContext& ctx) const override {
+ ElemwiseGradKernel::Compute(ctx);
+
using Tensor = framework::Tensor;
auto* dout = ctx.Input(framework::GradVarName("Out"));
diff --git a/paddle/fluid/operators/elementwise_div_op.h b/paddle/fluid/operators/elementwise_div_op.h
index 95649ac46e6bd41b9e1a865794cdec3ae1e6e247..41a7950bf0c598507c0fda48c6a43f2fd38c41d2 100644
--- a/paddle/fluid/operators/elementwise_div_op.h
+++ b/paddle/fluid/operators/elementwise_div_op.h
@@ -14,8 +14,8 @@ limitations under the License. */
#pragma once
+#include "paddle/fluid/operators/elementwise_op.h"
#include "paddle/fluid/operators/elementwise_op_function.h"
-
namespace paddle {
namespace operators {
@@ -53,9 +53,10 @@ struct DivGradDY {
};
template
-class ElementwiseDivGradKernel : public framework::OpKernel {
+class ElementwiseDivGradKernel : public ElemwiseGradKernel {
public:
void Compute(const framework::ExecutionContext& ctx) const override {
+ ElemwiseGradKernel::Compute(ctx);
using Tensor = framework::Tensor;
auto* x = ctx.Input("X");
diff --git a/paddle/fluid/operators/elementwise_max_op.h b/paddle/fluid/operators/elementwise_max_op.h
index 527a18ee3ba88a158a13266a7fbcdafe59ec69d9..bfb5c931958b4ca890ea720af42dad91d5625abb 100644
--- a/paddle/fluid/operators/elementwise_max_op.h
+++ b/paddle/fluid/operators/elementwise_max_op.h
@@ -14,6 +14,7 @@ limitations under the License. */
#pragma once
+#include "paddle/fluid/operators/elementwise_op.h"
#include "paddle/fluid/operators/elementwise_op_function.h"
namespace paddle {
@@ -55,9 +56,10 @@ struct MaxGradDy {
};
template
-class ElementwiseMaxGradKernel : public framework::OpKernel {
+class ElementwiseMaxGradKernel : public ElemwiseGradKernel {
public:
void Compute(const framework::ExecutionContext& ctx) const override {
+ ElemwiseGradKernel::Compute(ctx);
using Tensor = framework::Tensor;
auto* x = ctx.Input("X");
diff --git a/paddle/fluid/operators/elementwise_min_op.h b/paddle/fluid/operators/elementwise_min_op.h
index d4e5831463f3e54c72789b6876ea696cf1b4ef4b..db035ffb52e619b337c8190af4ed0e155aaac48d 100644
--- a/paddle/fluid/operators/elementwise_min_op.h
+++ b/paddle/fluid/operators/elementwise_min_op.h
@@ -14,8 +14,8 @@ limitations under the License. */
#pragma once
+#include "paddle/fluid/operators/elementwise_op.h"
#include "paddle/fluid/operators/elementwise_op_function.h"
-
namespace paddle {
namespace operators {
@@ -55,9 +55,10 @@ struct MinGradDy {
};
template
-class ElementwiseMinGradKernel : public framework::OpKernel {
+class ElementwiseMinGradKernel : public ElemwiseGradKernel {
public:
void Compute(const framework::ExecutionContext& ctx) const override {
+ ElemwiseGradKernel::Compute(ctx);
using Tensor = framework::Tensor;
auto* x = ctx.Input("X");
diff --git a/paddle/fluid/operators/elementwise_mul_op.h b/paddle/fluid/operators/elementwise_mul_op.h
index 329d2d129a9ea450cd211f0c6d2ea5e37ff8491d..4437da4d95f97b5cbbca1650badf9710c26b4380 100644
--- a/paddle/fluid/operators/elementwise_mul_op.h
+++ b/paddle/fluid/operators/elementwise_mul_op.h
@@ -13,6 +13,7 @@ See the License for the specific language governing permissions and
limitations under the License. */
#pragma once
+#include "paddle/fluid/operators/elementwise_op.h"
#include "paddle/fluid/operators/elementwise_op_function.h"
#include "paddle/fluid/operators/math/blas.h"
@@ -84,9 +85,10 @@ struct MulGradDY {
};
template
-class ElementwiseMulGradKernel : public framework::OpKernel {
+class ElementwiseMulGradKernel : public ElemwiseGradKernel {
public:
void Compute(const framework::ExecutionContext& ctx) const override {
+ ElemwiseGradKernel::Compute(ctx);
using Tensor = framework::Tensor;
auto* x = ctx.Input("X");
diff --git a/paddle/fluid/operators/elementwise_op.h b/paddle/fluid/operators/elementwise_op.h
index d8a12e800ad733800c1ec333f15d31d4dcd1a3a5..a79b900b9801e6b80e4433a9acdd4dab6c34859d 100644
--- a/paddle/fluid/operators/elementwise_op.h
+++ b/paddle/fluid/operators/elementwise_op.h
@@ -205,6 +205,20 @@ class ElementwiseOpExplicitGrad : public ElementwiseOpGrad {
}
};
+template
+class ElemwiseGradKernel : public framework::OpKernel {
+ public:
+ void Compute(const framework::ExecutionContext& context) const override {
+ auto* dx =
+ context.Output(framework::GradVarName("X"));
+ if (dx != nullptr) {
+ auto& dout =
+ *context.Input(framework::GradVarName("Out"));
+ dx->set_lod(dout.lod());
+ }
+ }
+};
+
} // namespace operators
} // namespace paddle
diff --git a/paddle/fluid/operators/elementwise_sub_op.h b/paddle/fluid/operators/elementwise_sub_op.h
index 11c7e3fe628001f095836a788f2bcc7c4ee7ad4b..3385df0897700d37d60d8804a01db777ebc02a7e 100644
--- a/paddle/fluid/operators/elementwise_sub_op.h
+++ b/paddle/fluid/operators/elementwise_sub_op.h
@@ -13,6 +13,7 @@ See the License for the specific language governing permissions and
limitations under the License. */
#pragma once
+#include "paddle/fluid/operators/elementwise_op.h"
#include "paddle/fluid/operators/elementwise_op_function.h"
namespace paddle {
@@ -50,9 +51,10 @@ struct SubGradDY {
};
template
-class ElementwiseSubGradKernel : public framework::OpKernel {
+class ElementwiseSubGradKernel : public ElemwiseGradKernel {
public:
void Compute(const framework::ExecutionContext& ctx) const override {
+ ElemwiseGradKernel::Compute(ctx);
using Tensor = framework::Tensor;
auto* dout = ctx.Input(framework::GradVarName("Out"));
diff --git a/paddle/fluid/operators/fusion_lstm_op.h b/paddle/fluid/operators/fusion_lstm_op.h
index 39dc09b4d116193399d8ac9a51e88dbc3e239918..7f79601602348ac454fc6c0cefcba0643ad8e6e2 100644
--- a/paddle/fluid/operators/fusion_lstm_op.h
+++ b/paddle/fluid/operators/fusion_lstm_op.h
@@ -13,7 +13,6 @@ See the License for the specific language governing permissions and
limitations under the License. */
#pragma once
-// #include
#include "paddle/fluid/framework/op_registry.h"
namespace paddle {
diff --git a/paddle/fluid/operators/math/blas.h b/paddle/fluid/operators/math/blas.h
index 8dcf7c99f3860789dee834787eeb8b7ad4cc3530..da185d93c09f9b06bd5968b9c8e93176f9ef014b 100644
--- a/paddle/fluid/operators/math/blas.h
+++ b/paddle/fluid/operators/math/blas.h
@@ -90,6 +90,11 @@ class Blas {
void GEMM(bool transA, bool transB, int M, int N, int K, T alpha, const T* A,
int lda, const T* B, int ldb, T beta, T* C, int ldc) const;
+ template
+ void GEMM(CBLAS_TRANSPOSE transA, CBLAS_TRANSPOSE transB, int M, int N, int K,
+ T alpha, const T* A, int lda, const T* B, int ldb, T beta, T* C,
+ int ldc) const;
+
#ifdef PADDLE_WITH_MKLML
template
T* GEMM_ALLOC(const CBLAS_IDENTIFIER id, const int M, const int N,
@@ -109,6 +114,10 @@ class Blas {
void GEMM_FREE(T* data) const;
#endif
+ template
+ void MatMul(const int M, const int N, const int K, const T* A, const T* B,
+ T* C) const;
+
template
void MatMul(const framework::Tensor& mat_a, bool trans_a,
const framework::Tensor& mat_b, bool trans_b, T alpha,
@@ -140,10 +149,19 @@ class Blas {
template
void VCOPY(int n, const T* x, T* y) const;
+ template
+ void VEXP(int n, const T* x, T* y) const;
+
template
void GEMV(bool trans_a, int M, int N, T alpha, const T* A, const T* B, T beta,
T* C) const;
+ template
+ T DOT(int n, const T* x, const T* y) const;
+
+ template
+ void SCAL(int n, const T a, T* x) const;
+
template
void BatchedGEMM(CBLAS_TRANSPOSE transA, CBLAS_TRANSPOSE transB, int M, int N,
int K, T alpha, const T* A, const T* B, T beta, T* C,
@@ -215,11 +233,26 @@ class BlasT : private Blas {
Base()->template VCOPY(args...);
}
+ template
+ void VEXP(ARGS... args) const {
+ Base()->template VEXP(args...);
+ }
+
template
void GEMV(ARGS... args) const {
Base()->template GEMV(args...);
}
+ template
+ T DOT(ARGS... args) const {
+ return Base()->template DOT(args...);
+ }
+
+ template
+ void SCAL(ARGS... args) const {
+ Base()->template SCAL(args...);
+ }
+
template
void BatchedGEMM(ARGS... args) const {
Base()->template BatchedGEMM(args...);
diff --git a/paddle/fluid/operators/math/blas_impl.h b/paddle/fluid/operators/math/blas_impl.h
index dc77b6d793702458a22a2f59b68e9d9f2c23b4ff..e1df78d11e41c5f74e244643f40c6d0581fa6a4a 100644
--- a/paddle/fluid/operators/math/blas_impl.h
+++ b/paddle/fluid/operators/math/blas_impl.h
@@ -73,6 +73,16 @@ struct CBlas {
platform::dynload::cblas_sgemv(args...);
}
+ template
+ static float DOT(ARGS... args) {
+ return platform::dynload::cblas_sdot(args...);
+ }
+
+ template
+ static void SCAL(ARGS... args) {
+ platform::dynload::cblas_sscal(args...);
+ }
+
template
static void GEMM_BATCH(ARGS... args) {
platform::dynload::cblas_sgemm_batch(args...);
@@ -87,6 +97,11 @@ struct CBlas {
static void VMUL(ARGS... args) {
platform::dynload::vsMul(args...);
}
+
+ template
+ static void VEXP(ARGS... args) {
+ platform::dynload::vsExp(args...);
+ }
};
template <>
@@ -138,6 +153,16 @@ struct CBlas {
platform::dynload::cblas_dgemv(args...);
}
+ template
+ static double DOT(ARGS... args) {
+ return platform::dynload::cblas_ddot(args...);
+ }
+
+ template
+ static void SCAL(ARGS... args) {
+ platform::dynload::cblas_dscal(args...);
+ }
+
template
static void GEMM_BATCH(ARGS... args) {
platform::dynload::cblas_dgemm_batch(args...);
@@ -152,6 +177,11 @@ struct CBlas {
static void VMUL(ARGS... args) {
platform::dynload::vdMul(args...);
}
+
+ template
+ static void VEXP(ARGS... args) {
+ platform::dynload::vdExp(args...);
+ }
};
#else
@@ -210,6 +240,9 @@ struct CBlas {
PADDLE_THROW("float16 SMM_GEMM not supported on CPU");
}
static void VMUL(...) { PADDLE_THROW("float16 VMUL not supported on CPU"); }
+ static void VEXP(...) { PADDLE_THROW("float16 VEXP not supported on CPU"); }
+ static void DOT(...) { PADDLE_THROW("float16 DOT not supported on CPU"); };
+ static void SCAL(...) { PADDLE_THROW("float16 SCAL not supported on CPU"); };
#ifdef PADDLE_WITH_MKLML
static void GEMM_BATCH(...) {
PADDLE_THROW("float16 GEMM_BATCH not supported on CPU");
@@ -217,64 +250,6 @@ struct CBlas {
#endif
};
-template
-inline bool UseXSMM(const int &m, const int &n, const int &k, bool transa,
- bool transb, const T &alpha, const T &beta) {
-#ifdef PADDLE_WITH_LIBXSMM
- // Refer to https://github.com/hfp/libxsmm/blob/master/README.md
- // But the threshold is custom
- constexpr int LIBXSMM_THRESHOLD = 20 * 20 * 20;
- if (m * n * k > LIBXSMM_THRESHOLD || transa || transb ||
- std::abs(alpha - static_cast(1) >
- std::numeric_limits::epsilon()) ||
- std::abs(beta) > std::numeric_limits::epsilon()) {
- return false;
- } else {
- return true;
- }
-#endif
- return false;
-}
-
-template <>
-inline bool UseXSMM(const int &m, const int &n, const int &k,
- bool transa, bool transb,
- const platform::float16 &alpha,
- const platform::float16 &beta) {
- return false;
-}
-
-template
-inline void GEMM_WARP(CBLAS_ORDER order, CBLAS_TRANSPOSE transA,
- CBLAS_TRANSPOSE transB, int M, int N, int K, T alpha,
- const T *A, int lda, const T *B, int ldb, T beta, T *C,
- int ldc) {
-#ifdef PADDLE_WITH_LIBXSMM
- if (UseXSMM(M, N, K, transA != CblasNoTrans, transB != CblasNoTrans, alpha,
- beta)) {
- // Note: SMM use ColMajor
- const char transa = 'N';
- const char transb = 'N';
- CBlas::SMM_GEMM(&transa, &transb, &N, &M, &K, &alpha, B, &ldb, A, &lda,
- &beta, C, &ldc);
- return;
- }
-#endif
-
-#ifdef PADDLE_MKL_SPLIT_GEMM
- constexpr int bs = 2;
- if (M % bs == 0 && transA == CblasNoTrans && transB == CblasNoTrans) {
- for (int off = 0; off < M; off += bs) {
- CBlas::GEMM(CblasRowMajor, CblasNoTrans, CblasNoTrans, bs, N, K, alpha,
- A + off * lda, lda, B, ldb, beta, C + off * ldb, ldc);
- }
- return;
- }
-#endif
- CBlas::GEMM(CblasRowMajor, transA, transB, M, N, K, alpha, A, lda, B, ldb,
- beta, C, ldc);
-}
-
#ifdef PADDLE_WITH_MKLML
template <>
template
@@ -319,8 +294,8 @@ void Blas::GEMM(CBLAS_TRANSPOSE transA,
int lda = (transA == CblasNoTrans) ? K : M;
int ldb = (transB == CblasNoTrans) ? N : K;
int ldc = N;
- GEMM_WARP(CblasRowMajor, transA, transB, M, N, K, alpha, A, lda, B, ldb,
- beta, C, ldc);
+ CBlas::GEMM(CblasRowMajor, transA, transB, M, N, K, alpha, A, lda, B, ldb,
+ beta, C, ldc);
}
template <>
@@ -329,9 +304,20 @@ void Blas::GEMM(bool transA, bool transB, int M,
int N, int K, T alpha, const T *A,
int lda, const T *B, int ldb,
T beta, T *C, int ldc) const {
- GEMM_WARP(CblasRowMajor, transA == false ? CblasNoTrans : CblasTrans,
- transB == false ? CblasNoTrans : CblasTrans, M, N, K, alpha, A,
- lda, B, ldb, beta, C, ldc);
+ CBlas::GEMM(CblasRowMajor, transA == false ? CblasNoTrans : CblasTrans,
+ transB == false ? CblasNoTrans : CblasTrans, M, N, K, alpha, A,
+ lda, B, ldb, beta, C, ldc);
+}
+
+template <>
+template
+void Blas::GEMM(CBLAS_TRANSPOSE transA,
+ CBLAS_TRANSPOSE transB, int M,
+ int N, int K, T alpha, const T *A,
+ int lda, const T *B, int ldb,
+ T beta, T *C, int ldc) const {
+ CBlas::GEMM(CblasRowMajor, transA, transB, M, N, K, alpha, A, lda, B, ldb,
+ beta, C, ldc);
}
template
@@ -399,6 +385,47 @@ void Blas::VMUL(int n, const T *x, const T *y,
#endif
}
+template <>
+template
+void Blas::VEXP(int n, const T *x, T *y) const {
+#ifdef PADDLE_WITH_MKLML
+ CBlas::VEXP(n, x, y);
+#else
+ // try to find if openblas support vexp
+ for (int i = 0; i < n; ++i) {
+ y[i] = std::exp(x[i]);
+ }
+#endif
+}
+
+template <>
+template
+T Blas::DOT(int n, const T *x, const T *y) const {
+#ifdef PADDLE_WITH_MKLML
+ return CBlas::DOT(n, x, 1, y, 1);
+#else
+ // try to find if openblas support cblas_dot
+ T sum = 0;
+ for (int i = 0; i < n; ++i) {
+ sum += x[i] * y[i];
+ }
+ return sum;
+#endif
+}
+
+template <>
+template
+void Blas::SCAL(int n, const T a, T *x) const {
+#ifdef PADDLE_WITH_MKLML
+ CBlas::SCAL(n, a, x, 1);
+#else
+ // try to find if openblas support cblas_scal
+ for (int i = 0; i < n; ++i) {
+ x[i] = a * x[i];
+ }
+#endif
+}
+
template <>
template
void Blas::GEMV(bool trans_a, int M, int N, T alpha,
@@ -440,6 +467,42 @@ void Blas::BatchedGEMM(
#endif
}
+template
+template
+void Blas::MatMul(const int M, const int N, const int K,
+ const T *A, const T *B, T *C) const {
+ this->template GEMM(CblasRowMajor, CblasNoTrans, CblasNoTrans, M, N, K,
+ static_cast(1), A, K, B, N, static_cast(0), C,
+ N);
+}
+
+template <>
+template
+void Blas::MatMul(const int M, const int N,
+ const int K, const T *A,
+ const T *B, T *C) const {
+#ifdef PADDLE_WITH_LIBXSMM
+ // Refer to https://github.com/hfp/libxsmm/blob/master/README.md
+ // But the threshold is custom constexpr int LIBXSMM_THRESHOLD = 20 * 20 * 20;
+
+ // Since the matrix is very small,
+ // so the unit of calculation is already very fast,
+ // and the if( M*N*K < LIBXSMM_THRESHOLD) would be overhead,
+ // use xsmm directly.
+ // Note: SMM use ColMajor
+ const char transa = 'N';
+ const char transb = 'N';
+ const T alpha = static_cast(1);
+ const T beta = static_cast(0);
+ CBlas::SMM_GEMM(&transa, &transb, &N, &M, &K, &alpha, B, &N, A, &K, &beta,
+ C, &N);
+ return;
+#endif
+
+ CBlas::GEMM(CblasRowMajor, CblasNoTrans, CblasNoTrans, M, N, K,
+ static_cast(1), A, K, B, N, static_cast(0), C, N);
+}
+
template
template
void Blas::MatMul(const framework::Tensor &mat_a,
diff --git a/paddle/fluid/operators/math/concat.cc b/paddle/fluid/operators/math/concat.cc
index 55c8a472aca7fe700ef6a3f96bed1496d7b12b80..fbe7c2978385401b35765101c87387ff727be4e0 100644
--- a/paddle/fluid/operators/math/concat.cc
+++ b/paddle/fluid/operators/math/concat.cc
@@ -71,7 +71,7 @@ class ConcatGradFunctor {
public:
void operator()(const platform::CPUDeviceContext& context,
const framework::Tensor& input,
- const std::vector& ref_inputs,
+ const std::vector& ref_inputs,
const int axis, std::vector* outputs) {
// TODO(zcd): Add input data validity checking
size_t num = outputs->size();
diff --git a/paddle/fluid/operators/math/concat.cu b/paddle/fluid/operators/math/concat.cu
index 5863d74fca21de8b77bc208fb95d8fd52562f7a7..820e73e779720e4f76168e0a84a254ef645784ee 100644
--- a/paddle/fluid/operators/math/concat.cu
+++ b/paddle/fluid/operators/math/concat.cu
@@ -189,7 +189,7 @@ class ConcatGradFunctor {
public:
void operator()(const platform::CUDADeviceContext& context,
const framework::Tensor& input,
- const std::vector& ref_inputs,
+ const std::vector& ref_inputs,
const int axis, std::vector* outputs) {
// TODO(zcd): Add input data validity checking
int o_num = outputs->size();
diff --git a/paddle/fluid/operators/math/concat.h b/paddle/fluid/operators/math/concat.h
index 9e080f2e8be23768dcea47b577043beef37b2eaf..e5d7d860b371677b3cfc67a57390bdee0d0ecc37 100644
--- a/paddle/fluid/operators/math/concat.h
+++ b/paddle/fluid/operators/math/concat.h
@@ -15,7 +15,7 @@ limitations under the License. */
#pragma once
#include
#include "paddle/fluid/framework/data_type.h"
-#include "paddle/fluid/framework/tensor.h"
+#include "paddle/fluid/framework/lod_tensor.h"
namespace paddle {
namespace operators {
@@ -57,7 +57,7 @@ template
class ConcatGradFunctor {
public:
void operator()(const DeviceContext& context, const framework::Tensor& input,
- const std::vector& ref_inputs,
+ const std::vector& ref_inputs,
const int axis, std::vector* outputs);
};
diff --git a/paddle/fluid/operators/math/cpu_vec.h b/paddle/fluid/operators/math/cpu_vec.h
new file mode 100644
index 0000000000000000000000000000000000000000..48c0da0e368a0fe6efcd758536e5659eeee26f7e
--- /dev/null
+++ b/paddle/fluid/operators/math/cpu_vec.h
@@ -0,0 +1,105 @@
+/* Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+
+#pragma once
+#include
+#include "paddle/fluid/platform/cpu_info.h"
+
+namespace paddle {
+namespace operators {
+namespace math {
+
+#define SIGMOID_THRESHOLD_MIN -40.0
+#define SIGMOID_THRESHOLD_MAX 13.0
+#define EXP_MAX_INPUT 40.0
+
+template