提交 e01c770c 编写于 作者: W wanghaoshuang

Merge branch 'develop' of https://github.com/PaddlePaddle/Paddle into average_model

## Design Doc: Distributed Lookup Table Operator
A lookup table operator in PaddlePaddle where the table could be out
of the memory of a computer.
## Background
A lookup table operator is well-used in deep learning for learning the
representation, or the
[*embedding*](http://www.cs.toronto.edu/~fritz/absps/ieee-lre.pdf), of
symbols.
### The Forward Algorithm
The forward algorithm of the lookup table is a multiplication of the
input vector x and the lookup table matrix W:
$$y = x * W$$
When x is a sparse vector of symbols, the above multiplication
simplifies into looking up rows in W that correspond to symbols in x,
denoted by W(x). Please be aware that W could be huge and out of the
memory, so we'd need a distributed storage service, which supports the
lookup of rows.
The following figure illustrates the multiplication of x with two
non-zero elements, or say, two symbols, and a lookup table W:
![lookup table](./src/lookup_table.png)
### The Backward Algorithm
The backward algorithm computes W'(x) using W(x). W'(x) has the same
scale of size as W(x) and is much smaller than W.
To optimize W given W', we can do simple SGD update:
$$W = f(W') = \lambda * W'$$
or some more sophisticated algorithms that rely on both W' and W:
$$W = f(W, W')$$
The following figure illustrates the backward pass of the lookup
operator: ![lookup table training](./src/lookup_table_training.png)
## Distributed Storage Service
The forward algorithm requires a distributed storage service for W.
The backward algorithm prefers that the storage system can apply the
optimization algorithm on W. The following two sections describe two
solutions -- the former doesn't require that the storage service can
do optimization, the latter does.
### Storage Service Doesn't Optimize
In this design, we use highly-optimized distributed storage, e.g.,
memcached, as the storage service, and we run the optimization
algorithm on parameter servers of PaddlePaddle. The following figure
illustrates the training process.
<!--
Note: please update the following URL when update this digraph.
<img src='https://g.gravizo.com/svg?
digraph G {
rankdir="LR";
subgraph cluster1 {
P1 [label="pserver 1"];
P2 [label="pserver 2"];
T1 [label="trainer 1"];
T2 [label="trainer 2"];
T3 [label="trainer 3"];
}
KV [label="memcached"];
T1 -> P1;
T1 -> P2;
T2 -> P1;
T2 -> P2;
T3 -> P1;
T3 -> P2;
P1 -> KV [color=gray, weight=0.1];
KV -> P1 [color=gray, weight=0.1];
P2 -> KV [color=gray, weight=0.1];
KV -> P2 [color=gray, weight=0.1];
KV -> T1 [color=gray, weight=0.1];
KV -> T2 [color=gray, weight=0.1];
KV -> T3 [color=gray, weight=0.1];
}
)
'/>
-->
<img src='https://g.gravizo.com/svg?%20digraph%20G%20{%20rankdir=%22LR%22;%20subgraph%20cluster1%20{%20P1%20[label=%22pserver%201%22];%20P2%20[label=%22pserver%202%22];%20T1%20[label=%22trainer%201%22];%20T2%20[label=%22trainer%202%22];%20T3%20[label=%22trainer%203%22];%20}%20KV%20[label=%22memcached%22];%20T1%20-%3E%20P1;%20T1%20-%3E%20P2;%20T2%20-%3E%20P1;%20T2%20-%3E%20P2;%20T3%20-%3E%20P1;%20T3%20-%3E%20P2;%20P1%20-%3E%20KV%20[color=gray,%20weight=0.1];%20KV%20-%3E%20P1%20[color=gray,%20weight=0.1];%20P2%20-%3E%20KV%20[color=gray,%20weight=0.1];%20KV%20-%3E%20P2%20[color=gray,%20weight=0.1];%20KV%20-%3E%20T1%20[color=gray,%20weight=0.1];%20KV%20-%3E%20T2%20[color=gray,%20weight=0.1];%20KV%20-%3E%20T3%20[color=gray,%20weight=0.1];%20}'/>
Each trainer runs the forward and backward passes using their local
data:
1. In the forward pass, when a trainer runs the forward algorithm of a
lookup operator, it retrieves W(x) from the storage service.
1. The trainer computes W'(x) in the backward pass using W(x).
During the global update process:
1. Each trainer uploads its W'(x) to parameter servers.
1. The parameter server runs the optimization algorithm, e.g., the
Adam optimization algorithm, which requires that
1. The parameter server retrieves W(x) from memcached, and
1. The parameter server pushes $\Delta W(x)=f(W(x), lambda \sum_j
W'(x))$ to memcached, where $f$ denotes the optimization
algorithm.
### Storage Service Does Optimize
This design is very similar to the above one, except that the
optimization algorithm $f$ runs on the storage service.
- Pro: parameter servers do not retrieve W(x) from the storage
service, thus saves half network communication.
- Con: the storage service needs to be able to run the optimization
algorithm.
## Conclusion
Let us do the "storage service does not optimize" solution first, as a
baseline at least, because it is easier to use a well-optimized
distributed storage service like memcached. We can do the "storage
service does optimize" solution later or at the same time, which, if
implemented carefully, should have better performance than the former.
......@@ -103,7 +103,7 @@ In computability theory, a system of data-manipulation rules, such as a programm
There are two ways to execute a Fluid program. When a program is executed, it creates a protobuf message [`ProgramDesc`](https://github.com/PaddlePaddle/Paddle/blob/a91efdde6910ce92a78e3aa7157412c4c88d9ee8/paddle/framework/framework.proto#L145) that describes the process and is conceptually like an [abstract syntax tree](https://en.wikipedia.org/wiki/Abstract_syntax_tree).
There is a C++ class [`Executor`](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/framework/executor.h), which runs a `ProgramDesc`, similar to how an interpreter runs a Python program.
There is a C++ class [`Executor`](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/fluid/framework/executor.h), which runs a `ProgramDesc`, similar to how an interpreter runs a Python program.
Fluid is moving towards the direction of a compiler, which is explain in [fluid_compiler.md](fluid_compiler.md).
......
......@@ -871,3 +871,67 @@ TEST(ChannelHolder, ChannelHolderDestroyUnblocksSendersTest) {
ch->Reset<int>(0);
ChannelHolderDestroyUnblockSenders(ch, false);
}
// This tests that closing a channelholder many times.
void ChannelHolderManyTimesClose(ChannelHolder *ch) {
const int num_threads = 15;
std::thread t[num_threads];
bool thread_ended[num_threads];
// Launches threads that try to send data to channel.
for (size_t i = 0; i < num_threads / 3; i++) {
thread_ended[i] = false;
t[i] = std::thread(
[&](bool *ended) {
int data = 10;
ch->Send(&data);
*ended = true;
},
&thread_ended[i]);
}
// Launches threads that try to receive data to channel.
for (size_t i = num_threads / 3; i < 2 * num_threads / 3; i++) {
thread_ended[i] = false;
t[i] = std::thread(
[&](bool *p) {
int data;
if (ch->Receive(&data)) {
EXPECT_EQ(data, 10);
}
*p = true;
},
&thread_ended[i]);
}
// Launches threads that try to close the channel.
for (size_t i = 2 * num_threads / 3; i < num_threads; i++) {
thread_ended[i] = false;
t[i] = std::thread(
[&](bool *p) {
if (!ch->IsClosed()) {
ch->close();
}
*p = true;
},
&thread_ended[i]);
}
std::this_thread::sleep_for(std::chrono::milliseconds(100)); // wait
// Verify that all threads are unblocked
for (size_t i = 0; i < num_threads; i++) {
EXPECT_EQ(thread_ended[i], true);
}
EXPECT_TRUE(ch->IsClosed());
// delete the channel
delete ch;
for (size_t i = 0; i < num_threads; i++) t[i].join();
}
TEST(ChannelHolder, ChannelHolderManyTimesCloseTest) {
// Check for Buffered Channel
ChannelHolder *ch = new ChannelHolder();
ch->Reset<int>(10);
ChannelHolderManyTimesClose(ch);
}
......@@ -113,10 +113,11 @@ void Executor::Run(const ProgramDesc& pdesc, Scope* scope, int block_id,
// and feed_holder_name. Raise exception when any mismatch is found.
// Return true if the block has feed operators and holder of matching info.
static bool has_feed_operators(
BlockDesc* block, std::map<std::string, const LoDTensor*>& feed_targets,
const BlockDesc& block,
std::map<std::string, const LoDTensor*>& feed_targets,
const std::string& feed_holder_name) {
size_t feed_count = 0;
for (auto* op : block->AllOps()) {
for (auto* op : block.AllOps()) {
if (op->Type() == kFeedOpType) {
feed_count++;
PADDLE_ENFORCE_EQ(op->Input("X")[0], feed_holder_name,
......@@ -135,7 +136,7 @@ static bool has_feed_operators(
"The number of feed operators should match 'feed_targets'");
// When feed operator are present, so should be feed_holder
auto var = block->FindVar(feed_holder_name);
auto var = block.FindVar(feed_holder_name);
PADDLE_ENFORCE_NOT_NULL(var, "Block should already have a '%s' variable",
feed_holder_name);
PADDLE_ENFORCE_EQ(var->GetType(), proto::VarType::FEED_MINIBATCH,
......@@ -153,10 +154,10 @@ static bool has_feed_operators(
// and fetch_holder_name. Raise exception when any mismatch is found.
// Return true if the block has fetch operators and holder of matching info.
static bool has_fetch_operators(
BlockDesc* block, std::map<std::string, LoDTensor*>& fetch_targets,
const BlockDesc& block, std::map<std::string, LoDTensor*>& fetch_targets,
const std::string& fetch_holder_name) {
size_t fetch_count = 0;
for (auto* op : block->AllOps()) {
for (auto* op : block.AllOps()) {
if (op->Type() == kFetchOpType) {
fetch_count++;
PADDLE_ENFORCE_EQ(op->Output("Out")[0], fetch_holder_name,
......@@ -175,7 +176,7 @@ static bool has_fetch_operators(
"The number of fetch operators should match 'fetch_targets'");
// When fetch operator are present, so should be fetch_holder
auto var = block->FindVar(fetch_holder_name);
auto var = block.FindVar(fetch_holder_name);
PADDLE_ENFORCE_NOT_NULL(var, "Block should already have a '%s' variable",
fetch_holder_name);
PADDLE_ENFORCE_EQ(var->GetType(), proto::VarType::FETCH_LIST,
......@@ -192,10 +193,19 @@ void Executor::Run(const ProgramDesc& program, Scope* scope,
const std::string& feed_holder_name,
const std::string& fetch_holder_name) {
platform::RecordBlock b(kProgramId);
auto* copy_program = new ProgramDesc(program);
bool has_feed_ops =
has_feed_operators(program.Block(0), feed_targets, feed_holder_name);
bool has_fetch_ops =
has_fetch_operators(program.Block(0), fetch_targets, fetch_holder_name);
ProgramDesc* copy_program = const_cast<ProgramDesc*>(&program);
if (!has_feed_ops || !has_fetch_ops) {
copy_program = std::unique_ptr<ProgramDesc>(new ProgramDesc(program)).get();
}
auto* global_block = copy_program->MutableBlock(0);
if (!has_feed_operators(global_block, feed_targets, feed_holder_name)) {
if (!has_feed_ops) {
// create feed_holder variable
auto* feed_holder = global_block->Var(feed_holder_name);
feed_holder->SetType(proto::VarType::FEED_MINIBATCH);
......@@ -228,7 +238,7 @@ void Executor::Run(const ProgramDesc& program, Scope* scope,
}
}
if (!has_fetch_operators(global_block, fetch_targets, fetch_holder_name)) {
if (!has_fetch_ops) {
// create fetch_holder variable
auto* fetch_holder = global_block->Var(fetch_holder_name);
fetch_holder->SetType(proto::VarType::FETCH_LIST);
......@@ -262,8 +272,6 @@ void Executor::Run(const ProgramDesc& program, Scope* scope,
GetFetchVariable(*scope, fetch_holder_name, idx);
}
}
delete copy_program;
}
ExecutorPrepareContext* Executor::Prepare(const ProgramDesc& program,
......@@ -313,9 +321,8 @@ void Executor::RunPreparedContext(ExecutorPrepareContext* ctx, Scope* scope,
} // if (create_vars)
for (auto& op : ctx->ops_) {
VLOG(4) << place_ << " " << op->DebugStringEx(local_scope);
op->Run(*local_scope, place_);
VLOG(3) << place_ << " " << op->DebugStringEx(local_scope);
op->Run(*local_scope, place_);
if (FLAGS_benchmark) {
VLOG(2) << "Memory used after operator " + op->Type() + " running: "
......
......@@ -26,6 +26,7 @@ namespace paddle {
namespace framework {
std::once_flag gflags_init_flag;
std::once_flag p2p_init_flag;
void InitGflags(std::vector<std::string> &argv) {
std::call_once(gflags_init_flag, [&]() {
......@@ -42,6 +43,27 @@ void InitGflags(std::vector<std::string> &argv) {
});
}
void InitP2P(int count) {
#ifdef PADDLE_WITH_CUDA
std::call_once(p2p_init_flag, [&]() {
for (int i = 0; i < count; ++i) {
for (int j = 0; j < count; ++j) {
if (i == j) continue;
int can_acess = -1;
PADDLE_ENFORCE(cudaDeviceCanAccessPeer(&can_acess, i, j),
"Failed to test P2P access.");
if (can_acess != 1) {
LOG(WARNING) << "Cannot enable P2P access from " << i << " to " << j;
} else {
cudaSetDevice(i);
cudaDeviceEnablePeerAccess(j, 0);
}
}
}
});
#endif
}
void InitDevices() {
/*Init all avaiable devices by default */
......@@ -63,7 +85,7 @@ void InitDevices() {
for (int i = 0; i < count; ++i) {
places.emplace_back(platform::CUDAPlace(i));
}
InitP2P(count);
platform::DeviceContextPool::Init(places);
}
......
......@@ -13,6 +13,11 @@ cc_library(paddle_fluid_shared SHARED
SRCS io.cc
DEPS ARCHIVE_START ${GLOB_OP_LIB} ${FLUID_CORE_MODULES} ARCHIVE_END)
set_target_properties(paddle_fluid_shared PROPERTIES OUTPUT_NAME paddle_fluid)
if(NOT APPLE)
# TODO(liuyiqun): Temporarily disable the link flag because it is not support on Mac.
set(LINK_FLAGS "-Wl,--version-script ${CMAKE_CURRENT_SOURCE_DIR}/paddle_fluid.map")
set_target_properties(paddle_fluid_shared PROPERTIES LINK_FLAGS "${LINK_FLAGS}")
endif()
if(WITH_TESTING)
add_subdirectory(tests/book)
......
{
global:
*paddle*;
local:
*;
};
......@@ -70,16 +70,16 @@ void ConvOp::InferShape(framework::InferShapeContext* ctx) const {
framework::OpKernelType ConvOp::GetExpectedKernelType(
const framework::ExecutionContext& ctx) const {
framework::LibraryType library_{framework::LibraryType::kPlain};
framework::LibraryType library{framework::LibraryType::kPlain};
#ifdef PADDLE_WITH_CUDA
if (platform::CanCUDNNBeUsed(ctx)) {
library_ = framework::LibraryType::kCUDNN;
library = framework::LibraryType::kCUDNN;
}
#endif
#ifdef PADDLE_WITH_MKLDNN
if (library_ == framework::LibraryType::kPlain &&
if (library == framework::LibraryType::kPlain &&
platform::CanMKLDNNBeUsed(ctx)) {
library_ = framework::LibraryType::kMKLDNN;
library = framework::LibraryType::kMKLDNN;
}
#endif
......@@ -91,15 +91,15 @@ framework::OpKernelType ConvOp::GetExpectedKernelType(
"input and filter data type should be consistent");
if (input_data_type == framework::proto::VarType::FP16) {
PADDLE_ENFORCE_EQ(library_, framework::LibraryType::kCUDNN,
PADDLE_ENFORCE_EQ(library, framework::LibraryType::kCUDNN,
"float16 can only be used when CUDNN is used");
}
std::string data_format = ctx.Attr<std::string>("data_format");
// TODO(pzelazko-intel): enable MKLDNN layout when it's ready
framework::DataLayout layout_ = framework::StringToDataLayout(data_format);
return framework::OpKernelType(input_data_type, ctx.GetPlace(), layout_,
library_);
framework::DataLayout layout = framework::StringToDataLayout(data_format);
return framework::OpKernelType(input_data_type, ctx.GetPlace(), layout,
library);
}
Conv2DOpMaker::Conv2DOpMaker(OpProto* proto, OpAttrChecker* op_checker)
......
......@@ -15,6 +15,7 @@ limitations under the License. */
#include "paddle/fluid/framework/feed_fetch_type.h"
#include "paddle/fluid/framework/op_registry.h"
#include "paddle/fluid/framework/operator.h"
#include "paddle/fluid/platform/profiler.h"
namespace paddle {
namespace operators {
......@@ -28,6 +29,10 @@ class FeedOp : public framework::OperatorBase {
private:
void RunImpl(const framework::Scope &scope,
const platform::Place &place) const override {
// get device context from pool
auto *dev_ctx = platform::DeviceContextPool::Instance().Get(place);
platform::RecordEvent record_event(Type(), dev_ctx);
auto feed_var_name = Input("X");
auto *feed_var = scope.FindVar(feed_var_name);
......@@ -50,14 +55,10 @@ class FeedOp : public framework::OperatorBase {
auto &feed_item = feed_list.at(static_cast<size_t>(col));
auto *out_item = out_var->GetMutable<framework::FeedFetchType>();
// get device context from pool
platform::DeviceContextPool &pool = platform::DeviceContextPool::Instance();
auto &dev_ctx = *pool.Get(place);
if (platform::is_same_place(feed_item.place(), place)) {
out_item->ShareDataWith(feed_item);
} else {
framework::TensorCopy(feed_item, place, dev_ctx, out_item);
framework::TensorCopy(feed_item, place, *dev_ctx, out_item);
}
out_item->set_lod(feed_item.lod());
}
......
......@@ -15,6 +15,7 @@ limitations under the License. */
#include "paddle/fluid/framework/feed_fetch_type.h"
#include "paddle/fluid/framework/op_registry.h"
#include "paddle/fluid/platform/device_context.h"
#include "paddle/fluid/platform/profiler.h"
namespace paddle {
namespace operators {
......@@ -29,6 +30,9 @@ class FetchOp : public framework::OperatorBase {
private:
void RunImpl(const framework::Scope &scope,
const platform::Place &place) const override {
platform::DeviceContextPool &pool = platform::DeviceContextPool::Instance();
platform::RecordEvent record_event(Type(), pool.Get(place));
auto fetch_var_name = Input("X");
auto *fetch_var = scope.FindVar(fetch_var_name);
PADDLE_ENFORCE(fetch_var != nullptr,
......@@ -53,7 +57,6 @@ class FetchOp : public framework::OperatorBase {
// FIXME(yuyang18): Should we assume the fetch operator always generate
// CPU outputs?
platform::DeviceContextPool &pool = platform::DeviceContextPool::Instance();
auto &dev_ctx = *pool.Get(src_item.place());
TensorCopy(src_item, platform::CPUPlace(), dev_ctx, &dst_item);
......
......@@ -15,6 +15,7 @@ limitations under the License. */
#include "paddle/fluid/framework/op_registry.h"
#include "paddle/fluid/platform/device_context.h"
#include "paddle/fluid/platform/profiler.h"
namespace paddle {
namespace operators {
......@@ -29,6 +30,9 @@ class LoadOp : public framework::OperatorBase {
private:
void RunImpl(const framework::Scope &scope,
const platform::Place &place) const override {
auto *dev_ctx = platform::DeviceContextPool::Instance().Get(place);
platform::RecordEvent record_event(Type(), dev_ctx);
auto filename = Attr<std::string>("file_path");
std::ifstream fin(filename);
PADDLE_ENFORCE(static_cast<bool>(fin), "Cannot open file %s for load op",
......@@ -41,9 +45,7 @@ class LoadOp : public framework::OperatorBase {
auto *tensor = out_var->GetMutable<framework::LoDTensor>();
platform::DeviceContextPool &pool = platform::DeviceContextPool::Instance();
auto &dev_ctx = *pool.Get(place);
DeserializeFromStream(fin, tensor, dev_ctx);
DeserializeFromStream(fin, tensor, *dev_ctx);
if (platform::is_gpu_place(place)) {
// copy CPU to GPU
......@@ -55,7 +57,7 @@ class LoadOp : public framework::OperatorBase {
out_var->Clear();
tensor = out_var->GetMutable<framework::LoDTensor>();
tensor->set_lod(cpu_tensor.lod());
TensorCopy(cpu_tensor, place, dev_ctx, tensor);
TensorCopy(cpu_tensor, place, *dev_ctx, tensor);
}
}
};
......
......@@ -22,17 +22,16 @@ class LoDResetOp : public framework::OperatorWithKernel {
using framework::OperatorWithKernel::OperatorWithKernel;
void InferShape(framework::InferShapeContext *ctx) const override {
// input check
PADDLE_ENFORCE(ctx->HasInput("X"),
"Input(X) of LoDResetOp should not be null.");
PADDLE_ENFORCE(ctx->HasOutput("Out"),
"Output(Out) of LoDResetOp should not be null.");
// If target LoD is not set form Input(), then it must be set from Attr().
if (!ctx->HasInput("TargetLoD")) {
if (!ctx->HasInput("Y")) {
auto level0 = ctx->Attrs().Get<std::vector<int>>("target_lod");
PADDLE_ENFORCE(level0.size() > 1,
"Target LoD is not found, should be set to be a valid one "
"through Input() or Attr().");
PADDLE_ENFORCE_GT(level0.size(), 1,
"If Input(Y) not provided, the target lod should be "
"specified by attribute `target_lod`.");
}
ctx->SetOutputDim("Out", ctx->GetInputDim("X"));
}
......@@ -50,36 +49,77 @@ class LoDResetOpMaker : public framework::OpProtoAndCheckerMaker {
public:
LoDResetOpMaker(OpProto *proto, OpAttrChecker *op_checker)
: OpProtoAndCheckerMaker(proto, op_checker) {
AddInput("X", "(LoDTensor) The input tensor of lod_reset operator.");
AddInput("TargetLoD",
"(Tensor, optional) The target level 0 LoD from Input().")
AddInput("X",
"(Tensor, LoDTensor) Input variable of LoDResetOp which "
"could be a Tensor or LoDTensor, where the data of output "
"variable inherits from.");
AddInput("Y",
"(Tensor, LoDTensor, optional) If provided and Y is LoDTensor, "
"lod of Input(Y) would be considered as the target lod first, "
"otherwise data of Input(Y) would be considered as the "
"target lod.")
.AsDispensable();
AddOutput("Out", "(LoDTensor) The output tensor of lod_reset operator.");
AddOutput("Out",
"(LoDTensor) Output variable of LoDResetOp which should be a "
"LoDTensor.");
AddAttr<std::vector<int>>("target_lod",
"The target level 0 LoD from Attr().")
.SetDefault(std::vector<int>{});
AddComment(R"DOC(LoDReset operator
Reset LoD of Input(X) into a new one specified by Input(TargetLoD) or
Attr(target_lod), or set LoD for Input(X) if it doesn't have one.
Currently the lod_reset operator only supports the reset of level 0 LoD.
At least one of Input(TargetLoD) and Attr(target_lod) must be set,
and if both of them are set, Input(TargetLoD) will be chosen as the
target LoD.
Set LoD of `X` to a new one specified by `Y` or attribute `target_lod`. When `Y`
provided and `Y` is a LoDTensor, `Y.lod` would be considered as target LoD
first, otherwise `Y.data` would be considered as target LoD. If `Y` is not
provided, target LoD should be specified by attribute `target_lod`.
If target LoD is specified by `Y.data` or `target_lod`, only one level LoD
is supported.
Example 1:
Given a 1-level LoDTensor input(X):
X.lod = [[ 0, 2, 5 6 ]]
X.data = [[1.0], [2.0], [3.0], [4.0], [5.0], [6.0]]
X.dims = [6, 1]
attr(target_lod): [0, 4, 6]
then we get a 1-level LoDTensor:
Out.lod = [[ 0, 4, 6 ]]
Out.data = [[1.0], [2.0], [3.0], [4.0], [5.0], [6.0]]
Out.dims = [6, 1]
Example 2:
An example:
Given a float LoDTensor X with shape (6, 1), its transpose form represents
Given a 1-level LoDTensor input(X):
X.lod = [[ 0, 2, 5 6 ]]
X.data = [[1.0], [2.0], [3.0], [4.0], [5.0], [6.0]]
X.dims = [6, 1]
[1.0, 2.0, 3.0, 4.0, 5.0, 6.0],
input(Y) is a Tensor:
Y.data = [[0, 2, 6]]
Y.dims = [1, 3]
with LoD = [[0, 2, 5, 6]] and the three (transposed) sequences look like
then we get a 1-level LoDTensor:
Out.lod = [[ 0, 2, 6 ]]
Out.data = [[1.0], [2.0], [3.0], [4.0], [5.0], [6.0]]
Out.dims = [6, 1]
[1.0, 2.0], [3.0, 4.0, 5.0], [6.0].
Example 3:
If target LoD = [0, 4, 6], the lod_reset operator will reset the LoD and
the sequences that the LoDTensor Output(Out) contains becomes:
Given a 1-level LoDTensor input(X):
X.lod = [[ 0, 2, 5 6 ]]
X.data = [[1.0], [2.0], [3.0], [4.0], [5.0], [6.0]]
X.dims = [6, 1]
[1.0, 2.0, 3.0, 4.0], [5.0, 6.0].
input(Y) is a 2-level LoDTensor:
Y.lod = [[0, 2, 4], [0, 2, 5, 6]]
Y.data = [[1.1], [2.1], [3.1], [4.1], [5.1], [6.1]]
Y.dims = [6, 1]
then we get a 2-level LoDTensor:
Out.lod = [[0, 2, 4], [0, 2, 5, 6]]
Out.data = [[1.0], [2.0], [3.0], [4.0], [5.0], [6.0]]
Out.dims = [6, 1]
)DOC");
}
......@@ -90,10 +130,16 @@ class LoDResetGradOp : public framework::OperatorWithKernel {
using framework::OperatorWithKernel::OperatorWithKernel;
void InferShape(framework::InferShapeContext *ctx) const override {
PADDLE_ENFORCE(ctx->HasInput("X"), "Input(X) shouldn't be null.");
PADDLE_ENFORCE(ctx->HasInput("X"),
"Input(X) of LoDResetGradOp should not be null.");
PADDLE_ENFORCE(ctx->HasInput(framework::GradVarName("Out")),
"Input(Out@GRAD) shouldn't be null.");
ctx->SetOutputDim(framework::GradVarName("X"), ctx->GetInputDim("X"));
"Input(Out@Grad) of LoDResetGradOp should not be null.");
auto x_grad_name = framework::GradVarName("X");
if (ctx->HasOutput(x_grad_name)) {
ctx->SetOutputDim(x_grad_name, ctx->GetInputDim("X"));
ctx->ShareLoD("X", /*->*/ x_grad_name);
}
}
protected:
......@@ -111,9 +157,13 @@ class LoDResetGradOp : public framework::OperatorWithKernel {
namespace ops = paddle::operators;
REGISTER_OP(lod_reset, ops::LoDResetOp, ops::LoDResetOpMaker, lod_reset_grad,
ops::LoDResetGradOp);
REGISTER_OP_CPU_KERNEL(lod_reset,
ops::LoDResetKernel<paddle::platform::CPUPlace, float>,
ops::LoDResetKernel<paddle::platform::CPUPlace, double>);
REGISTER_OP_CPU_KERNEL(
lod_reset, ops::LoDResetKernel<paddle::platform::CPUPlace, float>,
ops::LoDResetKernel<paddle::platform::CPUPlace, double>,
ops::LoDResetKernel<paddle::platform::CPUPlace, int>,
ops::LoDResetKernel<paddle::platform::CPUPlace, int64_t>);
REGISTER_OP_CPU_KERNEL(
lod_reset_grad, ops::LoDResetGradKernel<paddle::platform::CPUPlace, float>,
ops::LoDResetGradKernel<paddle::platform::CPUPlace, double>);
ops::LoDResetGradKernel<paddle::platform::CPUPlace, double>,
ops::LoDResetGradKernel<paddle::platform::CPUPlace, int>,
ops::LoDResetGradKernel<paddle::platform::CPUPlace, int64_t>);
......@@ -18,8 +18,12 @@ namespace ops = paddle::operators;
REGISTER_OP_CUDA_KERNEL(
lod_reset, ops::LoDResetKernel<paddle::platform::CUDADeviceContext, float>,
ops::LoDResetKernel<paddle::platform::CUDADeviceContext, double>);
ops::LoDResetKernel<paddle::platform::CUDADeviceContext, double>,
ops::LoDResetKernel<paddle::platform::CUDADeviceContext, int>,
ops::LoDResetKernel<paddle::platform::CUDADeviceContext, int64_t>);
REGISTER_OP_CUDA_KERNEL(
lod_reset_grad,
ops::LoDResetGradKernel<paddle::platform::CUDADeviceContext, float>,
ops::LoDResetGradKernel<paddle::platform::CUDADeviceContext, double>);
ops::LoDResetGradKernel<paddle::platform::CUDADeviceContext, double>,
ops::LoDResetGradKernel<paddle::platform::CUDADeviceContext, int>,
ops::LoDResetGradKernel<paddle::platform::CUDADeviceContext, int64_t>);
......@@ -26,10 +26,21 @@ class LoDResetKernel : public framework::OpKernel<T> {
void Compute(const framework::ExecutionContext& ctx) const {
auto* out = ctx.Output<framework::LoDTensor>("Out");
auto* in = ctx.Input<framework::LoDTensor>("X");
auto* lod_t = ctx.Input<framework::Tensor>("TargetLoD");
auto* lod_t = ctx.Input<framework::LoDTensor>("Y");
out->ShareDataWith(*in);
std::vector<int> level0;
if (lod_t) {
if (lod_t->lod().size() > 0) {
auto y_lod = lod_t->lod();
auto last_level = y_lod[y_lod.size() - 1];
PADDLE_ENFORCE_EQ(last_level.back(), in->dims()[0],
"Last value of `Y`'s last level LoD should be equal "
"to the first dimension of `X`");
out->set_lod(y_lod);
return; // early return, since lod already set
} else {
auto* lod = lod_t->data<int>();
if (platform::is_gpu_place(ctx.GetPlace())) {
framework::Tensor lod_cpu;
......@@ -38,15 +49,16 @@ class LoDResetKernel : public framework::OpKernel<T> {
lod = lod_cpu.data<int>();
}
level0 = std::vector<int>(lod, lod + lod_t->numel());
}
} else {
level0 = ctx.Attr<std::vector<int>>("target_lod");
}
PADDLE_ENFORCE(level0.size() > 1UL,
"The size of target LoD should be greater than 1.");
PADDLE_ENFORCE(level0[0] == 0,
PADDLE_ENFORCE_GT(level0.size(), 1UL,
"Size of target LoD should be greater than 1.");
PADDLE_ENFORCE_EQ(level0[0], 0,
"Target LoD should be a vector starting from 0.");
PADDLE_ENFORCE(level0.back() == in->dims()[0],
PADDLE_ENFORCE_EQ(level0.back(), in->dims()[0],
"Target LoD should be a vector end with the "
"first dimension of Input(X).");
for (size_t i = 0; i < level0.size() - 1; ++i) {
......@@ -54,7 +66,6 @@ class LoDResetKernel : public framework::OpKernel<T> {
"Target LoD should be an ascending vector.");
}
out->ShareDataWith(*in);
// cast level0 to size_t
std::vector<size_t> ulevel0(level0.size(), 0);
std::transform(level0.begin(), level0.end(), ulevel0.begin(),
......
......@@ -43,7 +43,7 @@ math_library(sequence2batch)
math_library(sequence_padding)
math_library(sequence_pooling DEPS math_function)
math_library(sequence_scale)
math_library(softmax)
math_library(softmax DEPS math_function)
math_library(unpooling)
math_library(vol2col)
......
......@@ -44,7 +44,7 @@ class ConcatFunctor<platform::CPUDeviceContext, T> {
out_cols += t_cols;
input_cols[i] = t_cols;
}
auto& cpu_place = boost::get<platform::CPUPlace>(context.GetPlace());
auto cpu_place = boost::get<platform::CPUPlace>(context.GetPlace());
// computation
for (int k = 0; k < out_rows; ++k) {
......@@ -87,7 +87,7 @@ class ConcatGradFunctor<platform::CPUDeviceContext, T> {
input_cols += t_cols;
output_cols[i] = t_cols;
}
auto& cpu_place = boost::get<platform::CPUPlace>(context.GetPlace());
auto cpu_place = boost::get<platform::CPUPlace>(context.GetPlace());
// computation
for (int k = 0; k < input_rows; ++k) {
......
......@@ -131,7 +131,7 @@ def make_channel(dtype, capacity=0):
return channel
def channel_send(channel, value):
def channel_send(channel, value, copy=False):
"""
Sends a value through a channel variable. Used by an unbuffered or buffered
channel to pass data from within or to a concurrent Go block, where
......@@ -141,6 +141,8 @@ def channel_send(channel, value):
channel (Variable|Channel): Channel variable created using
`make_channel`.
value (Variable): Value to send to channel
copy (bool): Copy data while channel send. If False, then data
is moved. The input cannot be used after move.
Returns:
Variable: The boolean status on whether or not the channel
successfully sent the passed value.
......@@ -162,11 +164,26 @@ def channel_send(channel, value):
type=core.VarDesc.VarType.LOD_TENSOR,
dtype=core.VarDesc.VarType.BOOL)
X = value
if copy is True:
copied_X = helper.create_variable(
name=unique_name.generate(value.name + '_copy'),
type=value.type,
dtype=value.dtype,
shape=value.shape,
lod_level=value.lod_level,
capacity=value.capacity)
assign_op = channel_send_block.append_op(
type="assign_op", inputs={"X": value}, outputs={"Out": copied_X})
X = copied_X
channel_send_op = channel_send_block.append_op(
type="channel_send",
inputs={
"Channel": channel,
"X": value,
"X": X,
},
outputs={"Status": status})
......
......@@ -73,6 +73,7 @@ __all__ = [
'smooth_l1',
'one_hot',
'autoincreased_step_counter',
'lod_reset',
]
......@@ -3292,3 +3293,98 @@ def autoincreased_step_counter(counter_name=None, begin=1, step=1):
counter.stop_gradient = True
return counter
def lod_reset(x, y=None, target_lod=None):
"""
LoD Reset Operator. Set LoD of **x** to a new one specified by **y** or
**target_lod**. When **y** provided, **y.lod** would be considered as target
LoD first, otherwise **y.data** would be considered as target LoD. If **y**
is not provided, target LoD should be specified by **target_lod**.
If target LoD is specified by **Y.data** or **target_lod**, only one level
LoD is supported.
.. code-block:: text
* Example 1:
Given a 1-level LoDTensor x:
x.lod = [[ 0, 2, 5 6 ]]
x.data = [[1.0], [2.0], [3.0], [4.0], [5.0], [6.0]]
x.dims = [6, 1]
target_lod: [0, 4, 6]
then we get a 1-level LoDTensor:
out.lod = [[ 0, 4, 6 ]]
out.data = [[1.0], [2.0], [3.0], [4.0], [5.0], [6.0]]
out.dims = [6, 1]
* Example 2:
Given a 1-level LoDTensor x:
x.lod = [[ 0, 2, 5 6 ]]
x.data = [[1.0], [2.0], [3.0], [4.0], [5.0], [6.0]]
x.dims = [6, 1]
y is a Tensor:
y.data = [[0, 2, 6]]
y.dims = [1, 3]
then we get a 1-level LoDTensor:
out.lod = [[ 0, 2, 6 ]]
out.data = [[1.0], [2.0], [3.0], [4.0], [5.0], [6.0]]
out.dims = [6, 1]
* Example 3:
Given a 1-level LoDTensor x:
x.lod = [[ 0, 2, 5 6 ]]
x.data = [[1.0], [2.0], [3.0], [4.0], [5.0], [6.0]]
x.dims = [6, 1]
y is a 2-level LoDTensor:
y.lod = [[0, 2, 4], [0, 2, 5, 6]]
y.data = [[1.1], [2.1], [3.1], [4.1], [5.1], [6.1]]
y.dims = [6, 1]
then we get a 2-level LoDTensor:
out.lod = [[0, 2, 4], [0, 2, 5, 6]]
out.data = [[1.0], [2.0], [3.0], [4.0], [5.0], [6.0]]
out.dims = [6, 1]
Args:
x (Variable): Input variable which could be a Tensor or LodTensor.
y (Variable|None): If provided, output's LoD would be derived from y.
target_lod (list|tuple|None): One level LoD which should be considered
as target LoD when y not provided.
Returns:
Variable: Output variable with LoD specified by this operator.
Raises:
ValueError: If y and target_lod are both None.
Examples:
.. code-block:: python
x = layers.data(name='x', shape=[10])
y = layers.data(name='y', shape=[10, 20], lod_level=2)
out = layers.lod_reset(x=x, y=y)
"""
helper = LayerHelper("lod_reset", **locals())
out = helper.create_tmp_variable(dtype=x.dtype)
if y is not None:
helper.append_op(
type="lod_reset", inputs={'X': x,
'Y': y}, outputs={'Out': out})
elif target_lod is not None:
helper.append_op(
type="lod_reset",
inputs={'X': x},
attrs={'target_lod': target_lod},
outputs={'Out': out})
else:
raise ValueError("y and target_lod should not be both None.")
return out
......@@ -327,6 +327,15 @@ class TestBook(unittest.TestCase):
self.assertIsNotNone(loss)
print(str(program))
def test_lod_reset(self):
program = Program()
with program_guard(program):
x = layers.data(name='x', shape=[10], dtype='float32')
y = layers.data(
name='y', shape=[10, 20], dtype='float32', lod_level=2)
print(layers.lod_reset(x=x, y=y))
print(str(program))
if __name__ == '__main__':
unittest.main()
......@@ -42,7 +42,7 @@ class TestLodResetOpByInput(OpTest):
target_lod_0 = [0, 4, 7, 10]
self.inputs = {
'X': (x, lod),
'TargetLoD': np.array([target_lod_0]).astype('int32')
'Y': np.array([target_lod_0]).astype('int32')
}
self.outputs = {'Out': (x, [target_lod_0])}
......@@ -50,7 +50,7 @@ class TestLodResetOpByInput(OpTest):
self.check_output()
def test_check_grad(self):
self.check_grad(["X"], "Out", no_grad_set=set("TargetLoD"))
self.check_grad(["X"], "Out", no_grad_set=set("Y"))
class TestLodResetOpBoth(OpTest):
......@@ -62,7 +62,7 @@ class TestLodResetOpBoth(OpTest):
target_lod_0_in = [0, 4, 7, 10]
self.inputs = {
'X': (x, lod),
'TargetLoD': np.array(target_lod_0_in).astype('int32')
'Y': np.array(target_lod_0_in).astype('int32')
}
self.attrs = {'target_lod': target_lod_0_attr}
self.outputs = {'Out': (x, [target_lod_0_in])}
......@@ -71,7 +71,24 @@ class TestLodResetOpBoth(OpTest):
self.check_output()
def test_check_grad(self):
self.check_grad(["X"], "Out", no_grad_set=set("TargetLoD"))
self.check_grad(["X"], "Out", no_grad_set=set("Y"))
class TestLodResetOpYIsLoDTensor(OpTest):
def setUp(self):
self.op_type = "lod_reset"
x = np.random.random((10, 20)).astype("float32")
lod = [[0, 3, 5, 10]]
y = np.random.random((10, 10)).astype("float32")
target_lod_0 = [[0, 4, 7, 10]]
self.inputs = {'X': (x, lod), 'Y': (y, target_lod_0)}
self.outputs = {'Out': (x, target_lod_0)}
def test_check_output(self):
self.check_output()
def test_check_grad(self):
self.check_grad(["X"], "Out", no_grad_set=set("Y"))
if __name__ == '__main__':
......
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册