提交 6ea45823 编写于 作者: F fengjiayi

Merge branch 'develop' of https://github.com/PaddlePaddle/Paddle into dev_update_reader_doc

...@@ -155,7 +155,7 @@ Cluster environment. ...@@ -155,7 +155,7 @@ Cluster environment.
<img src="src/remote_executor.png" width="500" align="center" /> <img src="src/remote_executor.png" width="500" align="center" />
`RemoteExecutor.run` sends the `ProgramDesc` and `RemoteExecutor.run` sends the `ProgramDesc` and
[TrainingJob](https://github.com/PaddlePaddle/cloud/blob/develop/doc/autoscale/README.md#training-job-resource) [TrainingJob](https://github.com/PaddlePaddle/cloud/blob/unreleased-tpr/doc/autoscale/README.md#training-job-resource)
to a server in the cluster which executes `RemoteExecutor.listen`. This server is responsible to a server in the cluster which executes `RemoteExecutor.listen`. This server is responsible
to start the final Kubernetes Jobs to run the different role of `ProgramDesc` from `ConfigMap`. to start the final Kubernetes Jobs to run the different role of `ProgramDesc` from `ConfigMap`.
......
# API注释撰写标准
- [API注释模块](#API注释模块)
- [格式及示例](#格式及示例)
- [完整示例](#完整示例)
## API注释模块
API文档须包含以下几个模块(排列顺序为文档撰写顺序):
- Python API Definition
API的代码定义。
- Function Description
API的功能描述。描述该API的含义、作用或对输入所做的操作,及参考文献和对应链接(如果有),必要时给出公式,并解释公式中关键变量的含义。
- Args Description
API参数介绍。按代码定义中的参数顺序逐个介绍,介绍内容包含数据类型、默认值(如果有)、含义等。
- Returns
API返回值介绍。介绍返回值含义,必要时给出对应的形状。若返回值为包含多个参数的tuple,则按顺序逐个介绍各参数。
- Raises(如果有)
可能抛出的异常或错误及可能的产生原因,当可能抛出多种异常或错误时应分条列出。
- Note(如果有)
注意事项。当有多条注意事项时,应分条列出。
- Examples
API的使用示例。
## 格式及示例
API文档须使用reStructuredText格式撰写,该格式详情请参考[链接](http://sphinx-doc-zh.readthedocs.io/en/latest/rest.html)。API文档各模块的内容格式及示例如下(以下以fc为例进行说明):
- Python API Definition
- 格式:
[Python API Definition]
- 示例
```
fc(input,
size,
num_flatten_dims=1,
param_attr=None,
bias_attr=None,
act=None,
name=None,
main_program=None,
startup_program=None)
```
- Function Description
- 格式
本模块应包含以下内容(排列顺序为文档撰写顺序):
[Function Description]
[Formula]
[Symbols' Descriptions if necessary]
[References if necessary]
- 示例
[Function Description]
```
**Fully Connected Layer**
The fully connected layer can take multiple tensors as its inputs. It
creates a variable called weights for each input tensor, which represents
a fully connected weight matrix from each input unit to each output unit.
The fully connected layer multiplies each input tensor with its coresponding
weight to produce an output Tensor. If multiple input tensors are given,
the results of multiple multiplications will be sumed up. If bias_attr is
not None, a bias variable will be created and added to the output. Finally,
if activation is not None, it will be applied to the output as well.
```
[Formula]
```
This process can be formulated as follows:
.. math::
Out = Act({\sum_{i=0}^{N-1}X_iW_i + b})
```
[Symbols' Descriptions if necessary]
```
In the above equation:
* :math:`N`: Number of the input.
* :math:`X_i`: The input tensor.
* :math:`W`: The weights created by this layer.
* :math:`b`: The bias parameter created by this layer (if needed).
* :math:`Act`: The activation function.
* :math:`Out`: The output tensor.
```
[References if necessary]
因fc没有必要列出的参考文献,故该内容省略。其他情况下需明确给出对应的参考文献和对应连接,以 layer_norm 为例:
```
Refer to `Layer Normalization <https://arxiv.org/pdf/1607.06450v1.pdf>`_ for more details.
```
- Args Description
- 格式
\[Arg's Name\][(Data Type, Default Value)][Description]
- 示例
fc的部分参数注释如下:
```
Args:
input (Variable|list of Variable): The input tensor(s) of this layer, and the dimension of
the input tensor(s) is at least 2.
param_attr (ParamAttr|list of ParamAttr, default None): The parameter attribute for learnable
parameters/weights of this layer.
name (str, default None): The name of this layer.
```
- Returns
- 格式
[Name][Shape]
- 示例
```
Returns:
A tensor variable storing the transformation result.
```
当返回值为包含多个参数的tuple时,应按顺序逐个介绍各参数,以dynamic_lstm为例:
```
Returns:
A tuple containing:
The hidden state of LSTM whose shape is (T X D).
The cell state of LSTM whose shape is (T X D).
```
- Raises
- 格式
[Exception Type][Condition]
- 示例
```
Raises:
ValueError: If the rank of the input is less than 2.
```
- Note
- 格式
[Note]
- 示例
fc没有注意事项,故该模块省略不写。如有注意事项应明确给出,当有多条注意事项,须分条列出,以scaled\_dot\_product\_attention为例:
```
Note:
1. When num_heads > 1, three linear projections are learned respectively
to map input queries, keys and values into queries', keys' and values'.
queries', keys' and values' have the same shapes with queries, keys
and values.
2. When num_heads == 1, scaled_dot_product_attention has no learnable
parameters.
```
- Examples
- 格式
\[Python Code Snipper]
- 示例
```
Examples:
.. code-block:: python
data = fluid.layers.data(name="data", shape=[32, 32], dtype="float32")
fc = fluid.layers.fc(input=data, size=1000, act="tanh")
```
## 完整示例
fc 的完整注释见[示例](src/fc.py)
# Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
def fc(input,
size,
num_flatten_dims=1,
param_attr=None,
bias_attr=None,
act=None,
name=None):
"""
**Fully Connected Layer**
The fully connected layer can take multiple tensors as its inputs. It
creates a variable called weights for each input tensor, which represents
a fully connected weight matrix from each input unit to each output unit.
The fully connected layer multiplies each input tensor with its coresponding
weight to produce an output Tensor. If multiple input tensors are given,
the results of multiple multiplications will be sumed up. If bias_attr is
not None, a bias variable will be created and added to the output. Finally,
if activation is not None, it will be applied to the output as well.
This process can be formulated as follows:
.. math::
Out = Act({\sum_{i=0}^{N-1}X_iW_i + b})
In the above equation:
* :math:`N`: Number of the input.
* :math:`X_i`: The input tensor.
* :math:`W`: The weights created by this layer.
* :math:`b`: The bias parameter created by this layer (if needed).
* :math:`Act`: The activation function.
* :math:`Out`: The output tensor.
Args:
input (Variable|list of Variable): The input tensor(s) of this layer, and the dimension of
the input tensor(s) is at least 2.
size(int): The number of output units in this layer.
num_flatten_dims (int, default 1): The fc layer can accept an input tensor with more than
two dimensions. If this happens, the multidimensional tensor will first be flattened
into a 2-dimensional matrix. The parameter `num_flatten_dims` determines how the input
tensor is flattened: the first `num_flatten_dims` (inclusive, index starts from 1)
dimensions will be flatten to form the first dimension of the final matrix (height of
the matrix), and the rest `rank(X) - num_flatten_dims` dimensions are flattened to
form the second dimension of the final matrix (width of the matrix). For example, suppose
`X` is a 6-dimensional tensor with a shape [2, 3, 4, 5, 6], and `num_flatten_dims` = 3.
Then, the flattened matrix will have a shape [2 x 3 x 4, 5 x 6] = [24, 30].
param_attr (ParamAttr|list of ParamAttr, default None): The parameter attribute for learnable
parameters/weights of this layer.
bias_attr (ParamAttr|list of ParamAttr, default None): The parameter attribute for the bias
of this layer. If it is set to None, no bias will be added to the output units.
act (str, default None): Activation to be applied to the output of this layer.
name (str, default None): The name of this layer.
Returns:
A tensor variable storing the transformation result.
Raises:
ValueError: If rank of the input tensor is less than 2.
Examples:
.. code-block:: python
data = fluid.layers.data(name="data", shape=[32, 32], dtype="float32")
fc = fluid.layers.fc(input=data, size=1000, act="tanh")
"""
...@@ -445,15 +445,7 @@ class RuntimeInferShapeContext : public InferShapeContext { ...@@ -445,15 +445,7 @@ class RuntimeInferShapeContext : public InferShapeContext {
} }
std::vector<DDim> GetRepeatedDims(const std::string& name) const override { std::vector<DDim> GetRepeatedDims(const std::string& name) const override {
Variable* var = scope_.FindVar(name); PADDLE_THROW("Only compile time support this method");
if (var->IsType<ReaderHolder>()) {
return var->Get<ReaderHolder>().shapes();
} else {
PADDLE_THROW(
"Only ReaderHolder support 'GetRepeatedDims', but Variable %s's "
"type_id is %s.",
name, var->Type().name());
}
} }
void SetDim(const std::string& name, const DDim& dim) override { void SetDim(const std::string& name, const DDim& dim) override {
...@@ -470,15 +462,7 @@ class RuntimeInferShapeContext : public InferShapeContext { ...@@ -470,15 +462,7 @@ class RuntimeInferShapeContext : public InferShapeContext {
void SetRepeatedDims(const std::string& name, void SetRepeatedDims(const std::string& name,
const std::vector<DDim>& dims) override { const std::vector<DDim>& dims) override {
Variable* var = scope_.FindVar(name); PADDLE_THROW("Only compile time support this method");
if (var->IsType<ReaderHolder>()) {
var->GetMutable<ReaderHolder>()->set_shapes(dims);
} else {
PADDLE_THROW(
"Only ReaderHolder support 'SetRepeatedDims', but Variable %s's "
"type_id is %s.",
name, var->Type().name());
}
} }
proto::VarType::Type GetVarType(const std::string& name) const override { proto::VarType::Type GetVarType(const std::string& name) const override {
......
...@@ -16,14 +16,22 @@ ...@@ -16,14 +16,22 @@
namespace paddle { namespace paddle {
namespace framework { namespace framework {
ReaderBase::~ReaderBase() {}
DDim ReaderBase::shape(size_t idx) const { FileReader::FileReader(const std::vector<DDim> &dims) : dims_(dims) {}
PADDLE_ENFORCE_LT(
idx, shapes_.size(), void FileReader::ReadNext(std::vector<LoDTensor> *out) {
"Cannot get the %d'th shape, 'shapes_' only has %d elements.", idx, ReadNextImpl(out);
shapes_.size()); PADDLE_ENFORCE_EQ(out->size(), dims_.size());
return shapes_[idx]; for (size_t i = 0; i < dims_.size(); ++i) {
} auto &actual = out->at(i).dims();
auto &expect = dims_[i];
PADDLE_ENFORCE_EQ(actual.size(), expect.size());
for (int j = 0; j < actual.size(); ++j) {
PADDLE_ENFORCE(actual[i] == expect[i] || expect[i] == -1);
}
}
}
} // namespace framework } // namespace framework
} // namespace paddle } // namespace paddle
...@@ -16,51 +16,29 @@ ...@@ -16,51 +16,29 @@
#include "paddle/fluid/framework/ddim.h" #include "paddle/fluid/framework/ddim.h"
#include "paddle/fluid/framework/lod_tensor_array.h" #include "paddle/fluid/framework/lod_tensor_array.h"
#include "paddle/fluid/platform/place.h"
#include <memory>
#include <thread>
#include <vector>
namespace paddle { namespace paddle {
namespace framework { namespace framework {
class ReaderBase { class ReaderBase {
public: public:
explicit ReaderBase(const std::vector<DDim>& shapes) : shapes_(shapes) {
PADDLE_ENFORCE(!shapes_.empty());
}
virtual void ReadNext(std::vector<LoDTensor>* out) = 0; virtual void ReadNext(std::vector<LoDTensor>* out) = 0;
virtual void ReInit() = 0; virtual void ReInit() = 0;
DDim shape(size_t idx) const;
std::vector<DDim> shapes() const { return shapes_; }
void set_shapes(const std::vector<DDim>& shapes) { shapes_ = shapes; }
virtual bool HasNext() const = 0; virtual bool HasNext() const = 0;
virtual ~ReaderBase() {} virtual ~ReaderBase();
protected:
std::vector<DDim> shapes_;
};
class FileReader : public ReaderBase {
public:
explicit FileReader(const std::vector<DDim>& shapes) : shapes_(shapes) {}
void ReadNext(std::vector<LoDTensor>* out) override final {
ReadNextImpl(out);
CheckShapes(out);
}
virtual void ReadNextImpl(std::vector<LoDTensor>* out) = 0;
protected:
CheckShape(const std::vector<LoDTensor>* out);
std::vector<DDim> shapes_;
}; };
class DecoratedReader : public ReaderBase { class DecoratedReader : public ReaderBase {
public: public:
explicit DecoratedReader(ReaderBase* reader) : reader_(reader) { explicit DecoratedReader(ReaderBase* reader) : ReaderBase(), reader_(reader) {
PADDLE_ENFORCE_NOT_NULL(reader_); PADDLE_ENFORCE_NOT_NULL(reader_);
} }
...@@ -72,6 +50,19 @@ class DecoratedReader : public ReaderBase { ...@@ -72,6 +50,19 @@ class DecoratedReader : public ReaderBase {
ReaderBase* reader_; ReaderBase* reader_;
}; };
class FileReader : public ReaderBase {
public:
explicit FileReader(const std::vector<DDim>& dims);
void ReadNext(std::vector<LoDTensor>* out) override;
protected:
virtual void ReadNextImpl(std::vector<LoDTensor>* out) = 0;
private:
std::vector<DDim> dims_;
};
// The ReaderHolder is used as reader' unified wrapper, // The ReaderHolder is used as reader' unified wrapper,
// making it easier to access different type reader in Variables. // making it easier to access different type reader in Variables.
class ReaderHolder { class ReaderHolder {
...@@ -89,19 +80,6 @@ class ReaderHolder { ...@@ -89,19 +80,6 @@ class ReaderHolder {
reader_->ReInit(); reader_->ReInit();
} }
DDim shape(size_t idx) const {
PADDLE_ENFORCE_NOT_NULL(reader_);
return reader_->shape(idx);
}
std::vector<DDim> shapes() const {
PADDLE_ENFORCE_NOT_NULL(reader_);
return reader_->shapes();
}
void set_shapes(const std::vector<DDim>& shapes) {
PADDLE_ENFORCE_NOT_NULL(reader_);
reader_->set_shapes(shapes);
}
bool HasNext() const { return reader_->HasNext(); } bool HasNext() const { return reader_->HasNext(); }
private: private:
......
...@@ -97,7 +97,7 @@ bool RPCClient::AsyncGetVariable(const std::string& ep, ...@@ -97,7 +97,7 @@ bool RPCClient::AsyncGetVariable(const std::string& ep,
return true; return true;
} }
bool RPCClient::AsyncSendBatchBarrier(const std::string& ep, int64_t time_out) { void RPCClient::AsyncSendBatchBarrier(const std::string& ep, int64_t time_out) {
const auto ch = GetChannel(ep); const auto ch = GetChannel(ep);
BatchBarrierProcessor* s = new BatchBarrierProcessor(ch); BatchBarrierProcessor* s = new BatchBarrierProcessor(ch);
...@@ -108,8 +108,18 @@ bool RPCClient::AsyncSendBatchBarrier(const std::string& ep, int64_t time_out) { ...@@ -108,8 +108,18 @@ bool RPCClient::AsyncSendBatchBarrier(const std::string& ep, int64_t time_out) {
auto rpc = s->stub_->AsyncSendVariable(s->context_.get(), req, &cq_); auto rpc = s->stub_->AsyncSendVariable(s->context_.get(), req, &cq_);
rpc->Finish(&s->reply_, &s->status_, (void*)s); rpc->Finish(&s->reply_, &s->status_, (void*)s);
req_count_++; req_count_++;
}
return true; void RPCClient::AsyncSendFetchBarrier(const std::string& ep, int64_t time_out) {
const auto ch = GetChannel(ep);
FetchBarrierProcessor* s = new FetchBarrierProcessor(ch);
s->Prepare(time_out);
sendrecv::VariableMessage req;
req.set_varname(FETCH_BARRIER_MESSAGE);
auto rpc = s->stub_->AsyncGetVariable(s->context_.get(), req, &cq_);
rpc->Finish(&s->reply_, &s->status_, (void*)s);
req_count_++;
} }
bool RPCClient::Wait() { bool RPCClient::Wait() {
...@@ -154,7 +164,7 @@ bool RPCClient::Proceed() { ...@@ -154,7 +164,7 @@ bool RPCClient::Proceed() {
PADDLE_ENFORCE(tag); PADDLE_ENFORCE(tag);
// TODO(gongwb): add more retries. // TODO(gongwb): add more retries.
ClientBase* c = static_cast<ClientBase*>(tag); BaseProcessor* c = static_cast<BaseProcessor*>(tag);
if (!c->status_.ok()) { if (!c->status_.ok()) {
LOG(ERROR) << "proc param error:" << c->var_h_.String() LOG(ERROR) << "proc param error:" << c->var_h_.String()
<< " grpc error:" << c->status_.error_message(); << " grpc error:" << c->status_.error_message();
...@@ -174,6 +184,8 @@ std::shared_ptr<grpc::Channel> RPCClient::GetChannel(const std::string& ep) { ...@@ -174,6 +184,8 @@ std::shared_ptr<grpc::Channel> RPCClient::GetChannel(const std::string& ep) {
} }
grpc::ChannelArguments args; grpc::ChannelArguments args;
args.SetInt("grpc.testing.fixed_reconnect_backoff_ms", 5000);
args.SetCompressionAlgorithm(GRPC_COMPRESS_NONE);
args.SetMaxSendMessageSize(std::numeric_limits<int>::max()); args.SetMaxSendMessageSize(std::numeric_limits<int>::max());
args.SetMaxReceiveMessageSize(std::numeric_limits<int>::max()); args.SetMaxReceiveMessageSize(std::numeric_limits<int>::max());
......
...@@ -52,14 +52,14 @@ struct VarHandle { ...@@ -52,14 +52,14 @@ struct VarHandle {
void ProcGetResponse(const VarHandle& var_h, void ProcGetResponse(const VarHandle& var_h,
const sendrecv::VariableMessage& msg); const sendrecv::VariableMessage& msg);
class ClientBase { class BaseProcessor {
public: public:
explicit ClientBase(std::shared_ptr<grpc::Channel> ch) { explicit BaseProcessor(std::shared_ptr<grpc::Channel> ch) {
stub_ = sendrecv::SendRecvService::NewStub(ch); stub_ = sendrecv::SendRecvService::NewStub(ch);
context_ = NULL; context_ = NULL;
} }
virtual ~ClientBase() {} virtual ~BaseProcessor() {}
virtual void Prepare(const VarHandle& var_info, int64_t time_out) { virtual void Prepare(const VarHandle& var_info, int64_t time_out) {
context_.reset(new grpc::ClientContext()); context_.reset(new grpc::ClientContext());
...@@ -91,9 +91,10 @@ class ClientBase { ...@@ -91,9 +91,10 @@ class ClientBase {
typedef std::function<void(const VarHandle&, const sendrecv::VoidMessage&)> typedef std::function<void(const VarHandle&, const sendrecv::VoidMessage&)>
RequestSendCallBack; RequestSendCallBack;
class SendProcessor : public ClientBase { class SendProcessor : public BaseProcessor {
public: public:
explicit SendProcessor(std::shared_ptr<grpc::Channel> ch) : ClientBase(ch) {} explicit SendProcessor(std::shared_ptr<grpc::Channel> ch)
: BaseProcessor(ch) {}
virtual ~SendProcessor() {} virtual ~SendProcessor() {}
...@@ -110,9 +111,10 @@ class SendProcessor : public ClientBase { ...@@ -110,9 +111,10 @@ class SendProcessor : public ClientBase {
typedef std::function<void(const VarHandle&, const sendrecv::VariableMessage&)> typedef std::function<void(const VarHandle&, const sendrecv::VariableMessage&)>
RequestGetCallBack; RequestGetCallBack;
class GetProcessor : public ClientBase { class GetProcessor : public BaseProcessor {
public: public:
explicit GetProcessor(std::shared_ptr<grpc::Channel> ch) : ClientBase(ch) {} explicit GetProcessor(std::shared_ptr<grpc::Channel> ch)
: BaseProcessor(ch) {}
virtual ~GetProcessor() {} virtual ~GetProcessor() {}
...@@ -126,10 +128,10 @@ class GetProcessor : public ClientBase { ...@@ -126,10 +128,10 @@ class GetProcessor : public ClientBase {
RequestGetCallBack response_call_back_ = ProcGetResponse; RequestGetCallBack response_call_back_ = ProcGetResponse;
}; };
class BatchBarrierProcessor : public ClientBase { class BatchBarrierProcessor : public BaseProcessor {
public: public:
explicit BatchBarrierProcessor(std::shared_ptr<grpc::Channel> ch) explicit BatchBarrierProcessor(std::shared_ptr<grpc::Channel> ch)
: ClientBase(ch) {} : BaseProcessor(ch) {}
virtual ~BatchBarrierProcessor() {} virtual ~BatchBarrierProcessor() {}
...@@ -137,6 +139,17 @@ class BatchBarrierProcessor : public ClientBase { ...@@ -137,6 +139,17 @@ class BatchBarrierProcessor : public ClientBase {
sendrecv::VoidMessage reply_; sendrecv::VoidMessage reply_;
}; };
class FetchBarrierProcessor : public BaseProcessor {
public:
explicit FetchBarrierProcessor(std::shared_ptr<grpc::Channel> ch)
: BaseProcessor(ch) {}
virtual ~FetchBarrierProcessor() {}
virtual void Process() {}
sendrecv::VariableMessage reply_;
};
class RPCClient { class RPCClient {
public: public:
bool AsyncSendVariable(const std::string& ep, bool AsyncSendVariable(const std::string& ep,
...@@ -151,7 +164,10 @@ class RPCClient { ...@@ -151,7 +164,10 @@ class RPCClient {
const std::string& var_name, const std::string& var_name,
int64_t time_out = 600 * 1000); int64_t time_out = 600 * 1000);
bool AsyncSendBatchBarrier(const std::string& ep, void AsyncSendBatchBarrier(const std::string& ep,
int64_t time_out = 600 * 1000);
void AsyncSendFetchBarrier(const std::string& ep,
int64_t time_out = 600 * 1000); int64_t time_out = 600 * 1000);
bool Wait(); bool Wait();
......
...@@ -84,7 +84,7 @@ class RequestGet final : public RequestBase { ...@@ -84,7 +84,7 @@ class RequestGet final : public RequestBase {
explicit RequestGet(sendrecv::SendRecvService::AsyncService* service, explicit RequestGet(sendrecv::SendRecvService::AsyncService* service,
grpc::ServerCompletionQueue* cq, framework::Scope* scope, grpc::ServerCompletionQueue* cq, framework::Scope* scope,
const platform::DeviceContext* dev_ctx, const platform::DeviceContext* dev_ctx,
SimpleBlockQueue<char>* queue) SimpleBlockQueue<MessageWithName>* queue)
: RequestBase(service, cq), : RequestBase(service, cq),
responder_(&ctx_), responder_(&ctx_),
scope_(scope), scope_(scope),
...@@ -101,11 +101,16 @@ class RequestGet final : public RequestBase { ...@@ -101,11 +101,16 @@ class RequestGet final : public RequestBase {
// proc request. // proc request.
std::string var_name = request_.varname(); std::string var_name = request_.varname();
auto* var = scope_->FindVar(var_name); auto* var = scope_->FindVar(var_name);
SerializeToMessage(var_name, var, *dev_ctx_, &reply_); if (var_name != FETCH_BARRIER_MESSAGE) {
SerializeToMessage(var_name, var, *dev_ctx_, &reply_);
}
// TODO(gongwb): check var's info. // TODO(gongwb): check var's info.
responder_.Finish(reply_, grpc::Status::OK, this); responder_.Finish(reply_, grpc::Status::OK, this);
status_ = FINISH; status_ = FINISH;
queue_->Push('c'); MessageWithName msg_with_name =
// request name reply
std::make_pair(var_name, std::move(reply_));
queue_->Push(msg_with_name);
} }
protected: protected:
...@@ -114,12 +119,16 @@ class RequestGet final : public RequestBase { ...@@ -114,12 +119,16 @@ class RequestGet final : public RequestBase {
ServerAsyncResponseWriter<sendrecv::VariableMessage> responder_; ServerAsyncResponseWriter<sendrecv::VariableMessage> responder_;
framework::Scope* scope_; framework::Scope* scope_;
const platform::DeviceContext* dev_ctx_; const platform::DeviceContext* dev_ctx_;
SimpleBlockQueue<char>* queue_; SimpleBlockQueue<MessageWithName>* queue_;
}; };
void AsyncGRPCServer::WaitClientGet(int count) { void AsyncGRPCServer::WaitClientGet(int count) {
for (int i = 0; i < count; ++i) { int fetch_barriers = 0;
var_get_queue_.Pop(); while (fetch_barriers < count) {
auto msg = var_get_queue_.Pop();
if (msg.first == FETCH_BARRIER_MESSAGE) {
fetch_barriers++;
}
} }
} }
......
...@@ -77,7 +77,7 @@ class AsyncGRPCServer final : public sendrecv::SendRecvService::Service { ...@@ -77,7 +77,7 @@ class AsyncGRPCServer final : public sendrecv::SendRecvService::Service {
const platform::DeviceContext *dev_ctx_; const platform::DeviceContext *dev_ctx_;
// received variable from RPC, operators fetch variable from this queue. // received variable from RPC, operators fetch variable from this queue.
SimpleBlockQueue<MessageWithName> var_recv_queue_; SimpleBlockQueue<MessageWithName> var_recv_queue_;
SimpleBlockQueue<char> var_get_queue_; SimpleBlockQueue<MessageWithName> var_get_queue_;
// condition of the sub program // condition of the sub program
std::mutex barrier_mutex_; std::mutex barrier_mutex_;
......
...@@ -32,6 +32,7 @@ namespace detail { ...@@ -32,6 +32,7 @@ namespace detail {
#define LISTEN_TERMINATE_MESSAGE "TERMINATE@RECV" #define LISTEN_TERMINATE_MESSAGE "TERMINATE@RECV"
#define BATCH_BARRIER_MESSAGE "BATCH_BARRIER@RECV" #define BATCH_BARRIER_MESSAGE "BATCH_BARRIER@RECV"
#define FETCH_BARRIER_MESSAGE "FETCH_BARRIER@RECV"
typedef void (*DestroyCallback)(void*); typedef void (*DestroyCallback)(void*);
......
...@@ -128,8 +128,8 @@ class ListenAndServOp : public framework::OperatorBase { ...@@ -128,8 +128,8 @@ class ListenAndServOp : public framework::OperatorBase {
} }
} }
if (exit_flag) { if (exit_flag) {
rpc_service_->ShutDown();
rpc_service_->SetCond(1); rpc_service_->SetCond(1);
rpc_service_->ShutDown();
break; break;
} }
try { try {
...@@ -148,7 +148,7 @@ class ListenAndServOp : public framework::OperatorBase { ...@@ -148,7 +148,7 @@ class ListenAndServOp : public framework::OperatorBase {
} }
rpc_service_->SetCond(1); rpc_service_->SetCond(1);
// FIXME(typhoonzero): use another condition to sync wait clients get. // FIXME(typhoonzero): use another condition to sync wait clients get.
rpc_service_->WaitClientGet(ins.size()); rpc_service_->WaitClientGet(fan_in);
sparse_vars.clear(); sparse_vars.clear();
} // while(true) } // while(true)
} }
......
...@@ -104,19 +104,38 @@ class NCCLAllReduceOp : public framework::OperatorWithKernel { ...@@ -104,19 +104,38 @@ class NCCLAllReduceOp : public framework::OperatorWithKernel {
" Input(Communicator) of AllReduce op input should not be NULL"); " Input(Communicator) of AllReduce op input should not be NULL");
PADDLE_ENFORCE(ctx->HasOutput("Out"), PADDLE_ENFORCE(ctx->HasOutput("Out"),
" Output(Out) of AllReduce op output should not be NULL"); " Output(Out) of AllReduce op output should not be NULL");
auto x_dims = ctx->GetInputsDim("X");
std::string reduction = ctx->Attrs().Get<std::string>("reduction"); std::string reduction = ctx->Attrs().Get<std::string>("reduction");
PADDLE_ENFORCE((reduction == "ncclSum" || reduction == "ncclProd" || PADDLE_ENFORCE((reduction == "ncclSum" || reduction == "ncclProd" ||
reduction == "ncclMin" || reduction == "ncclMax"), reduction == "ncclMin" || reduction == "ncclMax"),
"invalid reduction."); "invalid reduction.");
auto x_dims = ctx->GetInputsDim("X");
ctx->SetOutputsDim("Out", x_dims); ctx->SetOutputsDim("Out", x_dims);
ctx->ShareLoD("X", /*->*/ "Out"); ctx->ShareLoD("X", /*->*/ "Out");
} }
}; };
// AllReduceOp
class NCCLAllReduceOpMaker : public framework::OpProtoAndCheckerMaker {
public:
NCCLAllReduceOpMaker(OpProto *proto, OpAttrChecker *op_checker)
: OpProtoAndCheckerMaker(proto, op_checker) {
AddInput("X", "The input of AllReduce op");
AddInput("Communicator", "Communicator for communicating between gpus");
AddOutput("Out", "The output of AllReduce op");
AddAttr<std::string>("reduction",
"(string, default 'ncclSum') "
"{'ncclMin', 'ncclMax', 'ncclProd', 'ncclSum'}.")
.SetDefault("ncclSum");
AddComment(R"DOC(
NCCLAllReduce Operator.
AllReduce the input tensors.
)DOC");
}
};
// ReduceOp // ReduceOp
class NCCLReduceOp : public framework::OperatorWithKernel { class NCCLReduceOp : public framework::OperatorWithKernel {
public: public:
...@@ -143,50 +162,6 @@ class NCCLReduceOp : public framework::OperatorWithKernel { ...@@ -143,50 +162,6 @@ class NCCLReduceOp : public framework::OperatorWithKernel {
} }
}; };
// BcastOp
class NCCLBcastOp : public framework::OperatorWithKernel {
public:
using framework::OperatorWithKernel::OperatorWithKernel;
protected:
void InferShape(framework::InferShapeContext *ctx) const override {
PADDLE_ENFORCE(ctx->HasInput("X"),
" Input(X) of Bcast op input should not be NULL");
PADDLE_ENFORCE(ctx->HasInput("Communicator"),
" Input(Communicator) of Bcast op input should not be NULL");
PADDLE_ENFORCE(ctx->HasOutput("Out"),
" Output(Out) of Bcast op output should not be NULL");
int root = ctx->Attrs().Get<int>("root");
PADDLE_ENFORCE(root != platform::kInvalidGPUId, "Bcast root must be set.");
auto x_dims = ctx->GetInputsDim("X");
ctx->SetOutputsDim("Out", x_dims);
ctx->ShareLoD("X", /*->*/ "Out");
}
};
// AllreduceOp
class NCCLAllReduceOpMaker : public framework::OpProtoAndCheckerMaker {
public:
NCCLAllReduceOpMaker(OpProto *proto, OpAttrChecker *op_checker)
: OpProtoAndCheckerMaker(proto, op_checker) {
AddInput("X", "The input of AllReduce op");
AddInput("Communicator", "Communicator for communicating between gpus");
AddOutput("Out", "The output of AllReduce op");
AddAttr<std::string>("reduction",
"(string, default 'ncclSum') "
"{'ncclMin', 'ncclMax', 'ncclProd', 'ncclSum'}.")
.SetDefault("ncclSum");
AddComment(R"DOC(
NCCLAllReduce Operator.
AllReduce the input tensors.
)DOC");
}
};
// ReduceOp // ReduceOp
class NCCLReduceOpMaker : public framework::OpProtoAndCheckerMaker { class NCCLReduceOpMaker : public framework::OpProtoAndCheckerMaker {
public: public:
...@@ -213,6 +188,29 @@ Reduce the tensors. ...@@ -213,6 +188,29 @@ Reduce the tensors.
} }
}; };
// BcastOp
class NCCLBcastOp : public framework::OperatorWithKernel {
public:
using framework::OperatorWithKernel::OperatorWithKernel;
protected:
void InferShape(framework::InferShapeContext *ctx) const override {
PADDLE_ENFORCE(ctx->HasInput("X"),
" Input(X) of Bcast op input should not be NULL");
PADDLE_ENFORCE(ctx->HasInput("Communicator"),
" Input(Communicator) of Bcast op input should not be NULL");
PADDLE_ENFORCE(ctx->HasOutput("Out"),
" Output(Out) of Bcast op output should not be NULL");
int root = ctx->Attrs().Get<int>("root");
PADDLE_ENFORCE(root != platform::kInvalidGPUId, "Bcast root must be set.");
auto x_dims = ctx->GetInputsDim("X");
ctx->SetOutputsDim("Out", x_dims);
ctx->ShareLoD("X", /*->*/ "Out");
}
};
// BcastOp // BcastOp
class NCCLBcastOpMaker : public framework::OpProtoAndCheckerMaker { class NCCLBcastOpMaker : public framework::OpProtoAndCheckerMaker {
public: public:
......
...@@ -43,13 +43,12 @@ class NCCLAllReduceKernel : public framework::OpKernel<T> { ...@@ -43,13 +43,12 @@ class NCCLAllReduceKernel : public framework::OpKernel<T> {
void Compute(const framework::ExecutionContext& ctx) const override { void Compute(const framework::ExecutionContext& ctx) const override {
PADDLE_ENFORCE(platform::is_gpu_place(ctx.GetPlace()), PADDLE_ENFORCE(platform::is_gpu_place(ctx.GetPlace()),
"This kernel only runs on GPU device."); "This kernel only runs on GPU device.");
auto* x = ctx.Input<LoDTensor>("X");
auto ins = ctx.MultiInput<LoDTensor>("X"); auto* out = ctx.Output<LoDTensor>("Out");
auto outs = ctx.MultiOutput<LoDTensor>("Out"); auto* comm = ctx.Input<Communicator>("Communicator");
std::string reduction = ctx.Attr<std::string>("reduction"); std::string reduction = ctx.Attr<std::string>("reduction");
ncclRedOp_t reduction_op_ = ncclSum;
ncclRedOp_t reduction_op_ = ncclSum;
if (reduction == "ncclMin") { if (reduction == "ncclMin") {
reduction_op_ = ncclMin; reduction_op_ = ncclMin;
} else if (reduction == "ncclMax") { } else if (reduction == "ncclMax") {
...@@ -61,30 +60,19 @@ class NCCLAllReduceKernel : public framework::OpKernel<T> { ...@@ -61,30 +60,19 @@ class NCCLAllReduceKernel : public framework::OpKernel<T> {
} else { } else {
PADDLE_THROW("Invalid reduction. default ncclSum."); PADDLE_THROW("Invalid reduction. default ncclSum.");
} }
auto* comm = ctx.Input<Communicator>("Communicator");
auto stream = ctx.cuda_device_context().stream();
// device id // device id
int gpu_id = boost::get<platform::CUDAPlace>(ctx.GetPlace()).GetDeviceId(); int gpu_id = boost::get<platform::CUDAPlace>(ctx.GetPlace()).GetDeviceId();
int idx = comm->GetCommId(gpu_id); int idx = comm->GetCommId(gpu_id);
VLOG(3) << "gpu : "
for (size_t i = 0; i < ins.size(); ++i) { << " invoke allreduce. send " << x->numel() << " recv "
VLOG(1) << "gpu : " << out->numel();
<< " invoke allreduce. send " << ins[i]->numel() << " recv " PADDLE_ENFORCE(platform::dynload::ncclAllReduce(
<< outs[i]->numel(); x->data<T>(), out->mutable_data<T>(ctx.GetPlace()), out->numel(),
NCCLTypeWrapper<T>::type, reduction_op_, comm->comms().at(idx),
PADDLE_ENFORCE(platform::dynload::ncclAllReduce( ctx.cuda_device_context().stream()));
ins[i]->data<T>(), outs[i]->mutable_data<T>(ctx.GetPlace()), VLOG(3) << "gpu : "
outs[i]->numel(), NCCLTypeWrapper<T>::type, reduction_op_, << " finished allreduce. send " << x->numel() << " recv "
comm->comms().at(idx), stream)); << out->numel();
PADDLE_ENFORCE(cudaStreamSynchronize(stream));
VLOG(1) << "gpu : "
<< " finished allreduce. send " << ins[i]->numel() << " recv "
<< outs[i]->numel();
}
} }
}; };
...@@ -94,13 +82,13 @@ class NCCLReduceKernel : public framework::OpKernel<T> { ...@@ -94,13 +82,13 @@ class NCCLReduceKernel : public framework::OpKernel<T> {
void Compute(const framework::ExecutionContext& ctx) const override { void Compute(const framework::ExecutionContext& ctx) const override {
PADDLE_ENFORCE(platform::is_gpu_place(ctx.GetPlace()), PADDLE_ENFORCE(platform::is_gpu_place(ctx.GetPlace()),
"This kernel only runs on GPU device."); "This kernel only runs on GPU device.");
auto x = ctx.Input<LoDTensor>("X"); // x0, x1, x2
auto ins = ctx.MultiInput<LoDTensor>("X"); // x0, x1, x2 auto out = ctx.Output<LoDTensor>("Out");
auto outs = ctx.MultiOutput<LoDTensor>("Out"); auto* comm = ctx.Input<Communicator>("Communicator");
int root = ctx.Attr<int>("root");
std::string reduction = ctx.Attr<std::string>("reduction"); std::string reduction = ctx.Attr<std::string>("reduction");
ncclRedOp_t reduction_op_ = ncclSum;
ncclRedOp_t reduction_op_ = ncclSum;
if (reduction == "ncclMin") { if (reduction == "ncclMin") {
reduction_op_ = ncclMin; reduction_op_ = ncclMin;
} else if (reduction == "ncclMax") { } else if (reduction == "ncclMax") {
...@@ -112,40 +100,21 @@ class NCCLReduceKernel : public framework::OpKernel<T> { ...@@ -112,40 +100,21 @@ class NCCLReduceKernel : public framework::OpKernel<T> {
} else { } else {
PADDLE_THROW("Invalid reduction. default ncclSum."); PADDLE_THROW("Invalid reduction. default ncclSum.");
} }
int root = ctx.Attr<int>("root");
auto* comm = ctx.Input<Communicator>("Communicator");
auto stream = reinterpret_cast<const platform::CUDADeviceContext&>(
ctx.device_context())
.stream();
// device id // device id
int gpu_id = boost::get<platform::CUDAPlace>(ctx.GetPlace()).GetDeviceId(); int gpu_id = boost::get<platform::CUDAPlace>(ctx.GetPlace()).GetDeviceId();
int idx = comm->GetCommId(gpu_id); int idx = comm->GetCommId(gpu_id);
T* recvbuffer = nullptr;
auto ins_names = ctx.Inputs("X"); if (root == gpu_id) {
std::hash<std::string> hasher; recvbuffer = out->mutable_data<T>(ctx.GetPlace());
for (size_t i = 0; i < ins.size(); ++i) {
if (root == platform::kInvalidGPUId) {
root = hasher(ins_names[i]) % comm->comms().size();
}
T* recvbuffer = nullptr;
if (root == gpu_id) {
recvbuffer = outs[i]->mutable_data<T>(ctx.GetPlace());
}
VLOG(1) << "gpu : " << gpu_id << " invoke reduce. send "
<< ins[i]->numel() << " recv " << outs[i]->numel();
PADDLE_ENFORCE(platform::dynload::ncclReduce(
ins[i]->data<T>(), recvbuffer, ins[i]->numel(),
NCCLTypeWrapper<T>::type, reduction_op_, root, comm->comms().at(idx),
stream));
PADDLE_ENFORCE(cudaStreamSynchronize(stream));
VLOG(1) << "gpu : " << gpu_id << " finished reduce. send "
<< ins[i]->numel() << " recv " << outs[i]->numel();
} }
VLOG(3) << "gpu : " << gpu_id << " invoke reduce. send " << x->numel()
<< " recv " << out->numel();
PADDLE_ENFORCE(platform::dynload::ncclReduce(
x->data<T>(), recvbuffer, x->numel(), NCCLTypeWrapper<T>::type,
reduction_op_, root, comm->comms().at(idx),
ctx.cuda_device_context().stream()));
VLOG(3) << "gpu : " << gpu_id << " finished reduce. send " << x->numel()
<< " recv " << out->numel();
} }
}; };
...@@ -155,47 +124,27 @@ class NCCLBcastKernel : public framework::OpKernel<T> { ...@@ -155,47 +124,27 @@ class NCCLBcastKernel : public framework::OpKernel<T> {
void Compute(const framework::ExecutionContext& ctx) const override { void Compute(const framework::ExecutionContext& ctx) const override {
PADDLE_ENFORCE(platform::is_gpu_place(ctx.GetPlace()), PADDLE_ENFORCE(platform::is_gpu_place(ctx.GetPlace()),
"This kernel only runs on GPU device."); "This kernel only runs on GPU device.");
int root = ctx.Attr<int>("root"); int root = ctx.Attr<int>("root");
auto* comm = ctx.Input<Communicator>("Communicator"); auto* comm = ctx.Input<Communicator>("Communicator");
auto stream = reinterpret_cast<const platform::CUDADeviceContext&>(
ctx.device_context())
.stream();
// device id // device id
int gpu_id = boost::get<platform::CUDAPlace>(ctx.GetPlace()).GetDeviceId(); int gpu_id = boost::get<platform::CUDAPlace>(ctx.GetPlace()).GetDeviceId();
int idx = comm->GetCommId(gpu_id); int idx = comm->GetCommId(gpu_id);
if (idx == root) { if (idx == root) {
auto ins = ctx.MultiInput<LoDTensor>("X"); auto* x = ctx.Input<LoDTensor>("X");
for (size_t i = 0; i < ins.size(); ++i) { VLOG(3) << "gpu : " << gpu_id << " invoke Bcast. send " << x->numel();
VLOG(1) << "gpu : " << gpu_id << " invoke Bcast. send " PADDLE_ENFORCE(platform::dynload::ncclBcast(
<< ins[i]->numel(); (void*)x->data<T>(), x->numel(), NCCLTypeWrapper<T>::type, root,
comm->comms().at(idx), ctx.cuda_device_context().stream()));
VLOG(1) << " before ncclBcast"; VLOG(3) << "gpu : " << gpu_id << " finished Bcast.";
PADDLE_ENFORCE(platform::dynload::ncclBcast(
(void*)ins[i]->data<T>(), ins[i]->numel(), NCCLTypeWrapper<T>::type,
root, comm->comms().at(idx), stream));
VLOG(1) << " after ncclBcast";
PADDLE_ENFORCE(cudaStreamSynchronize(stream));
VLOG(1) << "gpu : " << gpu_id << " finished Bcast.";
}
} else { } else {
auto outs = ctx.MultiOutput<LoDTensor>("Out"); auto* out = ctx.Output<LoDTensor>("Out");
for (size_t i = 0; i < outs.size(); ++i) { VLOG(3) << "gpu : " << gpu_id << " invoke Bcast. recv buffer "
VLOG(1) << "gpu : " << gpu_id << " invoke Bcast. recv buffer " << framework::product(out->dims());
<< framework::product(outs[i]->dims()); PADDLE_ENFORCE(platform::dynload::ncclBcast(
out->mutable_data<T>(ctx.GetPlace()), out->numel(),
PADDLE_ENFORCE(platform::dynload::ncclBcast( NCCLTypeWrapper<T>::type, root, comm->comms().at(idx),
outs[i]->mutable_data<T>(ctx.GetPlace()), outs[i]->numel(), ctx.cuda_device_context().stream()));
NCCLTypeWrapper<T>::type, root, comm->comms().at(idx), stream)); VLOG(3) << "gpu : " << gpu_id << " finished Bcast. recv " << out->numel();
PADDLE_ENFORCE(cudaStreamSynchronize(stream));
VLOG(1) << "gpu : " << gpu_id << " finished Bcast. recv "
<< outs[i]->numel();
}
} }
} }
}; };
......
...@@ -24,11 +24,31 @@ static constexpr size_t kDoubleBufferSize = 2; ...@@ -24,11 +24,31 @@ static constexpr size_t kDoubleBufferSize = 2;
class DoubleBufferReader : public framework::DecoratedReader { class DoubleBufferReader : public framework::DecoratedReader {
public: public:
explicit DoubleBufferReader(ReaderBase* reader) struct Item {
: DecoratedReader(reader), Item() : ctx_(nullptr) {}
buffer_(framework::MakeChannel<std::vector<framework::LoDTensor>>(
kDoubleBufferSize)) { std::vector<framework::LoDTensor> payloads_;
std::thread prefetch(&DoubleBufferReader::PrefetchThreadFunc, this); platform::DeviceContext* ctx_;
};
explicit DoubleBufferReader(
ReaderBase* reader, platform::Place target_place = platform::CPUPlace())
: DecoratedReader(reader), place_(target_place) {
for (size_t i = 0; i < kDoubleBufferSize; ++i) {
if (platform::is_gpu_place(place_)) {
#ifdef PADDLE_WITH_CUDA
ctxs_.emplace_back(new platform::CUDADeviceContext(
boost::get<platform::CUDAPlace>(place_)));
#endif
}
}
start_thread();
}
void start_thread() {
buffer_ = framework::MakeChannel<Item>(kDoubleBufferSize);
std::thread prefetch([this] { PrefetchThreadFunc(); });
prefetch.detach(); prefetch.detach();
} }
...@@ -42,7 +62,10 @@ class DoubleBufferReader : public framework::DecoratedReader { ...@@ -42,7 +62,10 @@ class DoubleBufferReader : public framework::DecoratedReader {
private: private:
void PrefetchThreadFunc(); void PrefetchThreadFunc();
framework::Channel<std::vector<framework::LoDTensor>>* buffer_; framework::Channel<Item>* buffer_;
platform::Place place_;
std::vector<std::unique_ptr<platform::DeviceContext>> ctxs_;
mutable Item local_buffer_;
}; };
class CreateDoubleBufferReaderOp : public framework::OperatorBase { class CreateDoubleBufferReaderOp : public framework::OperatorBase {
...@@ -56,7 +79,20 @@ class CreateDoubleBufferReaderOp : public framework::OperatorBase { ...@@ -56,7 +79,20 @@ class CreateDoubleBufferReaderOp : public framework::OperatorBase {
->Get<framework::ReaderHolder>(); ->Get<framework::ReaderHolder>();
auto* out = scope.FindVar(Output("Out")) auto* out = scope.FindVar(Output("Out"))
->template GetMutable<framework::ReaderHolder>(); ->template GetMutable<framework::ReaderHolder>();
out->Reset(new DoubleBufferReader(underlying_reader.Get()));
auto place_str = Attr<std::string>("place");
platform::Place place;
if (place_str == "CPU") {
place = platform::CPUPlace();
} else {
std::istringstream sin(place_str);
sin.seekg(std::string("CUDA:").size(), std::ios::beg);
size_t num;
sin >> num;
place = platform::CUDAPlace(static_cast<int>(num));
}
out->Reset(new DoubleBufferReader(underlying_reader.Get(), place));
} }
}; };
...@@ -71,44 +107,73 @@ class CreateDoubleBufferReaderOpMaker : public DecoratedReaderMakerBase { ...@@ -71,44 +107,73 @@ class CreateDoubleBufferReaderOpMaker : public DecoratedReaderMakerBase {
It launches another thread to execute the 'underlying reader' asynchronously, It launches another thread to execute the 'underlying reader' asynchronously,
which prevents reading process from blocking subsequent training. which prevents reading process from blocking subsequent training.
)DOC"); )DOC");
std::unordered_set<std::string> enum_range;
constexpr size_t kMaxCUDADevs = 128;
for (size_t i = 0; i < kMaxCUDADevs; ++i) {
enum_range.insert(string::Sprintf("CUDA:%d", i));
}
enum_range.insert("CPU");
AddAttr<std::string>("place", "The double buffer place, default is CPU")
.SetDefault("CPU")
.InEnum({enum_range});
} }
}; };
void DoubleBufferReader::ReadNext(std::vector<framework::LoDTensor>* out) { void DoubleBufferReader::ReadNext(std::vector<framework::LoDTensor>* out) {
out->clear(); if (local_buffer_.payloads_.empty()) {
buffer_->Receive(out); buffer_->Receive(&local_buffer_);
}
*out = local_buffer_.payloads_;
local_buffer_.payloads_.clear();
if (local_buffer_.ctx_) {
local_buffer_.ctx_->Wait();
}
} }
void DoubleBufferReader::ReInit() { void DoubleBufferReader::ReInit() {
reader_->ReInit(); reader_->ReInit();
buffer_->Close(); buffer_->Close();
// The existing prefetch thread will terminate for the buffer_ is closed. start_thread();
buffer_ = framework::MakeChannel<std::vector<framework::LoDTensor>>(
kDoubleBufferSize);
std::thread prefetch(&DoubleBufferReader::PrefetchThreadFunc, this);
prefetch.detach();
} }
void DoubleBufferReader::PrefetchThreadFunc() { void DoubleBufferReader::PrefetchThreadFunc() {
VLOG(5) << "A new prefetch thread starts."; VLOG(5) << "A new prefetch thread starts.";
while (true) { size_t gpu_ctx_offset = 0;
std::vector<framework::LoDTensor> batch; while (reader_->HasNext()) {
reader_->ReadNext(&batch); Item batch;
if (batch.empty()) { reader_->ReadNext(&batch.payloads_);
// EOF if (platform::is_gpu_place(place_)) {
buffer_->Close(); std::vector<framework::LoDTensor> gpu_batch;
VLOG(5) << "Reached the end of the file. The prefetch thread terminates."; auto& gpu_ctx = this->ctxs_[gpu_ctx_offset++];
break; gpu_ctx_offset %= this->ctxs_.size();
gpu_batch.resize(batch.payloads_.size());
for (size_t i = 0; i < batch.payloads_.size(); ++i) {
framework::TensorCopy(batch.payloads_[i], place_, *gpu_ctx,
&gpu_batch[i]);
gpu_batch[i].set_lod(batch.payloads_[i].lod());
}
batch.ctx_ = gpu_ctx.get();
std::swap(gpu_batch, batch.payloads_);
} }
if (!buffer_->Send(&batch)) { if (!buffer_->Send(&batch)) {
VLOG(5) << "WARNING: The double buffer channel has been closed. The " VLOG(5) << "WARNING: The double buffer channel has been closed. The "
"prefetch thread terminates."; "prefetch thread terminates.";
break; break;
} }
} }
buffer_->Close();
} }
bool DoubleBufferReader::HasNext() const { PADDLE_THROW("Not Implemented"); } bool DoubleBufferReader::HasNext() const {
if (local_buffer_.payloads_.empty()) {
bool ok = buffer_->Receive(&local_buffer_);
return ok;
} else {
return true;
}
}
} // namespace reader } // namespace reader
} // namespace operators } // namespace operators
......
...@@ -19,11 +19,11 @@ namespace operators { ...@@ -19,11 +19,11 @@ namespace operators {
namespace reader { namespace reader {
template <typename T> template <typename T>
class RandomDataGenerator : public framework::FileReader { class RandomDataGenerator : public framework::ReaderBase {
public: public:
RandomDataGenerator(const std::vector<framework::DDim>& shapes, float min, RandomDataGenerator(const std::vector<framework::DDim>& shapes, float min,
float max) float max)
: FileReader(shapes), min_(min), max_(max) { : framework::ReaderBase(), min_(min), max_(max), shapes_(shapes) {
PADDLE_ENFORCE_LE( PADDLE_ENFORCE_LE(
min, max, "'min' shouldn't be greater than 'max'.(%f vs %f)", min, max); min, max, "'min' shouldn't be greater than 'max'.(%f vs %f)", min, max);
unsigned int seed = std::random_device()(); unsigned int seed = std::random_device()();
...@@ -59,6 +59,7 @@ class RandomDataGenerator : public framework::FileReader { ...@@ -59,6 +59,7 @@ class RandomDataGenerator : public framework::FileReader {
float max_; float max_;
std::minstd_rand engine_; std::minstd_rand engine_;
std::uniform_real_distribution<float> dist_; std::uniform_real_distribution<float> dist_;
std::vector<framework::DDim> shapes_;
}; };
template <typename T> template <typename T>
......
...@@ -20,21 +20,22 @@ namespace operators { ...@@ -20,21 +20,22 @@ namespace operators {
namespace reader { namespace reader {
class RecordIOFileReader : public framework::FileReader { class RecordIOFileReader : public framework::FileReader {
public: public:
RecordIOFileReader(const std::string& filename, explicit RecordIOFileReader(const std::string& filename,
const std::vector<framework::DDim>& shapes) const std::vector<framework::DDim>& dims)
: FileReader(shapes), : FileReader(dims),
scanner_(filename), scanner_(filename),
dev_ctx_(*platform::DeviceContextPool::Instance().Get( dev_ctx_(*platform::DeviceContextPool::Instance().Get(
platform::CPUPlace())) {} platform::CPUPlace())) {}
void ReadNext(std::vector<framework::LoDTensor>* out) override {
*out = framework::ReadFromRecordIO(scanner_, dev_ctx_);
}
bool HasNext() const override { return scanner_.HasNext(); } bool HasNext() const override { return scanner_.HasNext(); }
void ReInit() override { scanner_.Reset(); } void ReInit() override { scanner_.Reset(); }
protected:
void ReadNextImpl(std::vector<framework::LoDTensor>* out) override {
*out = framework::ReadFromRecordIO(scanner_, dev_ctx_);
}
private: private:
recordio::Scanner scanner_; recordio::Scanner scanner_;
const platform::DeviceContext& dev_ctx_; const platform::DeviceContext& dev_ctx_;
...@@ -54,12 +55,12 @@ class CreateRecordIOReaderOp : public framework::OperatorBase { ...@@ -54,12 +55,12 @@ class CreateRecordIOReaderOp : public framework::OperatorBase {
int(shape_concat.size()), int(shape_concat.size()),
"The accumulate of all ranks should be equal to the " "The accumulate of all ranks should be equal to the "
"shape concat's length."); "shape concat's length.");
std::vector<framework::DDim> shapes = RestoreShapes(shape_concat, ranks);
std::string filename = Attr<std::string>("filename"); std::string filename = Attr<std::string>("filename");
auto* out = scope.FindVar(Output("Out")) auto* out = scope.FindVar(Output("Out"))
->template GetMutable<framework::ReaderHolder>(); ->template GetMutable<framework::ReaderHolder>();
out->Reset(new RecordIOFileReader(filename, shapes)); out->Reset(
new RecordIOFileReader(filename, RestoreShapes(shape_concat, ranks)));
} }
}; };
...@@ -85,3 +86,5 @@ namespace reader = paddle::operators::reader; ...@@ -85,3 +86,5 @@ namespace reader = paddle::operators::reader;
REGISTER_FILE_READER_OPERATOR(create_recordio_file_reader, REGISTER_FILE_READER_OPERATOR(create_recordio_file_reader,
reader::CreateRecordIOReaderOp, reader::CreateRecordIOReaderOp,
reader::CreateRecordIOReaderOpMaker); reader::CreateRecordIOReaderOpMaker);
REGISTER_FILE_READER(recordio, reader::RecordIOFileReader);
...@@ -12,6 +12,9 @@ ...@@ -12,6 +12,9 @@
// See the License for the specific language governing permissions and // See the License for the specific language governing permissions and
// limitations under the License. // limitations under the License.
#include <random>
#include "glog/logging.h"
#include "paddle/fluid/operators/detail/safe_ref.h"
#include "paddle/fluid/operators/reader/reader_op_registry.h" #include "paddle/fluid/operators/reader/reader_op_registry.h"
namespace paddle { namespace paddle {
...@@ -20,43 +23,53 @@ namespace reader { ...@@ -20,43 +23,53 @@ namespace reader {
class ShuffleReader : public framework::DecoratedReader { class ShuffleReader : public framework::DecoratedReader {
public: public:
ShuffleReader(ReaderBase* reader, int buffer_size) ShuffleReader(ReaderBase* reader, size_t buffer_size, size_t seed = 0)
: DecoratedReader(reader), buffer_size_(buffer_size), iteration_pos_(0) { : DecoratedReader(reader), buffer_size_(buffer_size), seed_(seed) {
buffer_.reserve(buffer_size); VLOG(10) << "Create shuffle reader of " << reader_;
if (seed_ == 0) {
std::random_device device;
seed_ = device();
}
ReadIntoBuffers();
} }
void ReadNext(std::vector<framework::LoDTensor>* out) override; void ReadNext(std::vector<framework::LoDTensor>* out) override {
if (iteration_pos_ >= buffer_.size()) {
VLOG(10) << "Resetting shuffle buffer";
ReadIntoBuffers();
}
*out = buffer_[iteration_pos_++];
}
private: bool HasNext() const override {
int buffer_size_; return iteration_pos_ < buffer_.size() || reader_->HasNext();
std::vector<std::vector<framework::LoDTensor>> buffer_; }
size_t iteration_pos_;
};
void ShuffleReader::ReadNext(std::vector<framework::LoDTensor>* out) { private:
if (iteration_pos_ >= buffer_.size()) { void ReadIntoBuffers() {
// Reload buffer with new data
buffer_.clear(); buffer_.clear();
buffer_.reserve(buffer_size_); buffer_.reserve(buffer_size_);
for (int i = 0; i < buffer_size_; ++i) { iteration_pos_ = 0;
buffer_.push_back(std::vector<framework::LoDTensor>()); PADDLE_ENFORCE(reader_->HasNext());
reader_->ReadNext(&buffer_.back()); for (size_t i = 0; i < buffer_size_; ++i) {
if (buffer_.back().empty()) { if (!reader_->HasNext()) {
buffer_.pop_back();
break; break;
} }
buffer_.emplace_back();
reader_->ReadNext(&buffer_.back());
} }
// TODO(fengjiayi): 'std::random_shuffle' can be very slow. It needs to be std::mt19937 g(seed_);
// optimize. std::shuffle(buffer_.begin(), buffer_.end(), g);
std::random_shuffle(buffer_.begin(), buffer_.end()); seed_ = g(); // update seed_;
iteration_pos_ = 0; VLOG(10) << "random buffer size = " << buffer_.size();
} }
out->clear();
if (!buffer_.empty()) { size_t buffer_size_;
std::swap(*out, buffer_[iteration_pos_++]); std::vector<std::vector<framework::LoDTensor>> buffer_;
}
// if buffer_ is empty, the 'out' will return as an empty vector. size_t iteration_pos_;
} size_t seed_;
};
class CreateShuffleReaderOp : public framework::OperatorBase { class CreateShuffleReaderOp : public framework::OperatorBase {
public: public:
...@@ -67,10 +80,10 @@ class CreateShuffleReaderOp : public framework::OperatorBase { ...@@ -67,10 +80,10 @@ class CreateShuffleReaderOp : public framework::OperatorBase {
const platform::Place& dev_place) const override { const platform::Place& dev_place) const override {
const auto& underlying_reader = scope.FindVar(Input("UnderlyingReader")) const auto& underlying_reader = scope.FindVar(Input("UnderlyingReader"))
->Get<framework::ReaderHolder>(); ->Get<framework::ReaderHolder>();
auto* out = scope.FindVar(Output("Out")) auto& var = detail::Ref(scope.FindVar(Output("Out")));
->template GetMutable<framework::ReaderHolder>(); var.GetMutable<framework::ReaderHolder>()->Reset(
out->Reset( new ShuffleReader(underlying_reader.Get(),
new ShuffleReader(underlying_reader.Get(), Attr<int>("buffer_size"))); static_cast<size_t>(Attr<int>("buffer_size"))));
} }
}; };
......
...@@ -31,6 +31,11 @@ std::vector<framework::DDim> RestoreShapes(const std::vector<int>& shape_concat, ...@@ -31,6 +31,11 @@ std::vector<framework::DDim> RestoreShapes(const std::vector<int>& shape_concat,
return res; return res;
} }
std::unordered_map<std::string, FileReaderCreator>& FileReaderRegistry() {
static std::unordered_map<std::string, FileReaderCreator> regs;
return regs;
}
FileReaderMakerBase::FileReaderMakerBase( FileReaderMakerBase::FileReaderMakerBase(
framework::OpProtoAndCheckerMaker::OpProto* op_proto, framework::OpProtoAndCheckerMaker::OpProto* op_proto,
framework::OpAttrChecker* op_checker) framework::OpAttrChecker* op_checker)
......
...@@ -21,6 +21,20 @@ namespace paddle { ...@@ -21,6 +21,20 @@ namespace paddle {
namespace operators { namespace operators {
namespace reader { namespace reader {
using FileReaderCreator = std::function<framework::ReaderBase*(
const std::string&, const std::vector<framework::DDim>&)>;
std::unordered_map<std::string, FileReaderCreator>& FileReaderRegistry();
template <typename Reader>
int RegisterFileReader(const std::string& filetype) {
FileReaderRegistry()[filetype] = [](
const std::string& fn, const std::vector<paddle::framework::DDim>& dim) {
return new Reader(fn, dim);
};
return 0;
}
extern std::vector<framework::DDim> RestoreShapes( extern std::vector<framework::DDim> RestoreShapes(
const std::vector<int>& shape_concat, const std::vector<int>& ranks); const std::vector<int>& shape_concat, const std::vector<int>& ranks);
...@@ -73,3 +87,15 @@ class DecoratedReaderMakerBase : public framework::OpProtoAndCheckerMaker { ...@@ -73,3 +87,15 @@ class DecoratedReaderMakerBase : public framework::OpProtoAndCheckerMaker {
paddle::operators::reader::DecoratedReaderInferShape, \ paddle::operators::reader::DecoratedReaderInferShape, \
paddle::framework::EmptyGradOpMaker, \ paddle::framework::EmptyGradOpMaker, \
paddle::operators::reader::DecoratedReaderInferVarType) paddle::operators::reader::DecoratedReaderInferVarType)
#define REGISTER_FILE_READER(_filetype, _reader) \
STATIC_ASSERT_GLOBAL_NAMESPACE( \
_reg_file_reader_##_filetype, \
"Must use REGISTER_FILE_READER in global namespace"); \
int TouchFileReader##_filetype() { return 0; } \
int _reg_file_reader_entry_##filetype = \
paddle::operators::reader::RegisterFileReader<_reader>(#_filetype)
#define USE_FILE_READER(filetype) \
extern int TouchFileReader##filetype(); \
static int _use_##filetype = TouchFileReader##filetype()
...@@ -173,6 +173,15 @@ class ReduceMinOpMaker : public ReduceOpMaker { ...@@ -173,6 +173,15 @@ class ReduceMinOpMaker : public ReduceOpMaker {
} }
}; };
class ReduceProdOpMaker : public ReduceOpMaker {
public:
ReduceProdOpMaker(OpProto *proto, OpAttrChecker *op_checker)
: ReduceOpMaker(proto, op_checker) {
SetComment("ReduceProd", "production");
AddComment(comment_);
}
};
} // namespace operators } // namespace operators
} // namespace paddle } // namespace paddle
...@@ -190,6 +199,9 @@ REGISTER_OP(reduce_max, ops::ReduceOp, ops::ReduceMaxOpMaker, reduce_max_grad, ...@@ -190,6 +199,9 @@ REGISTER_OP(reduce_max, ops::ReduceOp, ops::ReduceMaxOpMaker, reduce_max_grad,
REGISTER_OP(reduce_min, ops::ReduceOp, ops::ReduceMinOpMaker, reduce_min_grad, REGISTER_OP(reduce_min, ops::ReduceOp, ops::ReduceMinOpMaker, reduce_min_grad,
ops::ReduceGradOp); ops::ReduceGradOp);
REGISTER_OP(reduce_prod, ops::ReduceOp, ops::ReduceProdOpMaker,
reduce_prod_grad, ops::ReduceGradOp);
#define REGISTER_REDUCE_CPU_KERNEL(reduce_type, functor, grad_functor) \ #define REGISTER_REDUCE_CPU_KERNEL(reduce_type, functor, grad_functor) \
REGISTER_OP_CPU_KERNEL(reduce_type, \ REGISTER_OP_CPU_KERNEL(reduce_type, \
ops::ReduceKernel<paddle::platform::CPUDeviceContext, \ ops::ReduceKernel<paddle::platform::CPUDeviceContext, \
......
...@@ -93,6 +93,22 @@ struct MaxOrMinGradFunctor { ...@@ -93,6 +93,22 @@ struct MaxOrMinGradFunctor {
} }
}; };
struct ProdFunctor {
template <typename DeviceContext, typename X, typename Y, typename Dim>
void operator()(const DeviceContext& place, X& x, Y& y, const Dim& dim) {
y.device(place) = x.prod(dim);
}
};
struct ProdGradFunctor {
template <typename DeviceContext, typename X, typename Y, typename DX,
typename DY, typename Dim>
void operator()(const DeviceContext& place, X& x, Y& y, DX& dx, DY& dy,
const Dim& dim, int size) {
dx.device(place) = dy.broadcast(dim) * y.broadcast(dim) * x.inverse();
}
};
template <typename DeviceContext, typename T, typename Functor> template <typename DeviceContext, typename T, typename Functor>
class ReduceKernel : public framework::OpKernel<T> { class ReduceKernel : public framework::OpKernel<T> {
public: public:
...@@ -254,4 +270,5 @@ class ReduceGradKernel : public framework::OpKernel<T> { ...@@ -254,4 +270,5 @@ class ReduceGradKernel : public framework::OpKernel<T> {
__macro(reduce_sum, SumFunctor, SumGradFunctor); \ __macro(reduce_sum, SumFunctor, SumGradFunctor); \
__macro(reduce_mean, MeanFunctor, MeanGradFunctor); \ __macro(reduce_mean, MeanFunctor, MeanGradFunctor); \
__macro(reduce_max, MaxFunctor, MaxOrMinGradFunctor); \ __macro(reduce_max, MaxFunctor, MaxOrMinGradFunctor); \
__macro(reduce_min, MinFunctor, MaxOrMinGradFunctor); __macro(reduce_min, MinFunctor, MaxOrMinGradFunctor); \
__macro(reduce_prod, ProdFunctor, ProdGradFunctor);
...@@ -88,6 +88,12 @@ class SendOp : public framework::OperatorBase { ...@@ -88,6 +88,12 @@ class SendOp : public framework::OperatorBase {
rpc_client->AsyncGetVariable(epmap[i], ctx, scope, outs[i]); rpc_client->AsyncGetVariable(epmap[i], ctx, scope, outs[i]);
} }
PADDLE_ENFORCE(rpc_client->Wait()); PADDLE_ENFORCE(rpc_client->Wait());
// tell pservers that current trainer have called fetch
for (auto& ep : endpoints) {
VLOG(3) << "send fetch barrier, ep: " << ep;
rpc_client->AsyncSendFetchBarrier(ep);
}
PADDLE_ENFORCE(rpc_client->Wait());
} }
} }
}; };
......
...@@ -250,6 +250,8 @@ class DistributeTranspiler: ...@@ -250,6 +250,8 @@ class DistributeTranspiler:
def get_trainer_program(self): def get_trainer_program(self):
# remove optimize ops and add a send op to main_program # remove optimize ops and add a send op to main_program
self.program.global_block().delete_ops(self.optimize_ops) self.program.global_block().delete_ops(self.optimize_ops)
# FIXME(typhoonzero): serialize once will fix error occurs when clone.
self.program.__str__()
return self.program return self.program
def get_pserver_program(self, endpoint): def get_pserver_program(self, endpoint):
...@@ -309,7 +311,8 @@ class DistributeTranspiler: ...@@ -309,7 +311,8 @@ class DistributeTranspiler:
for _, opt_op in enumerate(opt_op_on_pserver): for _, opt_op in enumerate(opt_op_on_pserver):
if ufind.is_connected(op, opt_op): if ufind.is_connected(op, opt_op):
if self._is_opt_op(op): if self._is_opt_op(op):
self._append_pserver_ops(optimize_block, op, endpoint) self._append_pserver_ops(optimize_block, op, endpoint,
default_main_program())
else: else:
self._append_pserver_non_opt_ops(optimize_block, op) self._append_pserver_non_opt_ops(optimize_block, op)
break break
...@@ -520,7 +523,8 @@ class DistributeTranspiler: ...@@ -520,7 +523,8 @@ class DistributeTranspiler:
orig_var_name = varname[:suff_idx] orig_var_name = varname[:suff_idx]
return orig_var_name return orig_var_name
def _append_pserver_ops(self, optimize_block, opt_op, endpoint): def _append_pserver_ops(self, optimize_block, opt_op, endpoint,
origin_program):
program = optimize_block.program program = optimize_block.program
pserver_block = program.global_block() pserver_block = program.global_block()
new_inputs = dict() new_inputs = dict()
...@@ -576,7 +580,17 @@ class DistributeTranspiler: ...@@ -576,7 +580,17 @@ class DistributeTranspiler:
elif key == "LearningRate": elif key == "LearningRate":
# leraning rate variable has already be created by non-optimize op, # leraning rate variable has already be created by non-optimize op,
# don't create it once again. # don't create it once again.
new_inputs[key] = pserver_block.vars[opt_op.input(key)[0]] lr_varname = opt_op.input(key)[0]
if pserver_block.vars.has_key(lr_varname):
new_inputs[key] = pserver_block.vars[opt_op.input(key)[0]]
else:
origin_var = origin_program.global_block().vars[lr_varname]
tmpvar = pserver_block.create_var(
name=origin_var.name,
persistable=origin_var.persistable,
dtype=origin_var.dtype,
shape=origin_var.shape)
new_inputs[key] = tmpvar
for key in opt_op.input_names: for key in opt_op.input_names:
new_shape = None new_shape = None
......
...@@ -21,7 +21,7 @@ from ..executor import global_scope ...@@ -21,7 +21,7 @@ from ..executor import global_scope
__all__ = [ __all__ = [
'data', 'BlockGuardServ', 'ListenAndServ', 'Send', 'open_recordio_file', 'data', 'BlockGuardServ', 'ListenAndServ', 'Send', 'open_recordio_file',
'read_file' 'read_file', 'create_shuffle_reader', 'create_double_buffer_reader'
] ]
...@@ -245,6 +245,8 @@ def monkey_patch_reader_methods(reader): ...@@ -245,6 +245,8 @@ def monkey_patch_reader_methods(reader):
reader.eof = eof reader.eof = eof
reader.reset = reset reader.reset = reset
reader.stop_gradient = True
reader.persistable = True
return reader return reader
...@@ -285,6 +287,33 @@ def open_recordio_file(filename, shapes, lod_levels, dtypes): ...@@ -285,6 +287,33 @@ def open_recordio_file(filename, shapes, lod_levels, dtypes):
startup_var) startup_var)
def __create_decorated_reader__(op_type, reader, attrs):
var_name = unique_name(op_type)
startup_blk = default_startup_program().current_block()
startup_var = startup_blk.create_var(name=var_name)
startup_blk.append_op(
type=op_type,
inputs={'UnderlyingReader': reader},
outputs={'Out': [startup_var]},
attrs=attrs)
startup_var.persistable = True
return _copy_reader_var_(default_main_program().current_block(),
startup_var)
def create_shuffle_reader(reader, buffer_size):
return __create_decorated_reader__('create_shuffle_reader', reader,
{'buffer_size': int(buffer_size)})
def create_double_buffer_reader(reader, place=None):
attrs = dict()
if place is not None:
attrs['place'] = str(place).upper()
return __create_decorated_reader__('create_double_buffer_reader', reader,
attrs)
def read_file(file_obj): def read_file(file_obj):
helper = LayerHelper('read_file') helper = LayerHelper('read_file')
out = [ out = [
......
...@@ -49,6 +49,7 @@ __all__ = [ ...@@ -49,6 +49,7 @@ __all__ = [
'reduce_mean', 'reduce_mean',
'reduce_max', 'reduce_max',
'reduce_min', 'reduce_min',
'reduce_prod',
'sequence_first_step', 'sequence_first_step',
'sequence_last_step', 'sequence_last_step',
'dropout', 'dropout',
...@@ -84,13 +85,12 @@ def fc(input, ...@@ -84,13 +85,12 @@ def fc(input,
**Fully Connected Layer** **Fully Connected Layer**
The fully connected layer can take multiple tensors as its inputs. It The fully connected layer can take multiple tensors as its inputs. It
creates a variable (one for each input tensor) called weights for each creates a variable called weights for each input tensor, which represents
input tensor, which represents a fully connected weight matrix from a fully connected weight matrix from each input unit to each output unit.
each input unit to each output unit. The fully connected layer The fully connected layer multiplies each input tensor with its coresponding
multiplies each input tensor with its coresponding weight to produce weight to produce an output Tensor. If multiple input tensors are given,
an output Tensor. If multiple input tensors are given, the results of the results of multiple multiplications will be sumed up. If bias_attr is
multiple multiplications will be sumed up. If bias_attr is not None, not None, a bias variable will be created and added to the output. Finally,
a biases variable will be created and added to the output. Finally,
if activation is not None, it will be applied to the output as well. if activation is not None, it will be applied to the output as well.
This process can be formulated as follows: This process can be formulated as follows:
...@@ -109,44 +109,27 @@ def fc(input, ...@@ -109,44 +109,27 @@ def fc(input,
* :math:`Out`: The output tensor. * :math:`Out`: The output tensor.
Args: Args:
input(Variable|list): The input tensor(s) to the fully connected layer. input (Variable|list of Variable): The input tensor(s) of this layer, and the dimension of
size(int): The number of output units in the fully connected layer. the input tensor(s) is at least 2.
num_flatten_dims(int): The fc layer can accept an input tensor with more size(int): The number of output units in this layer.
than two dimensions. If this happens, the num_flatten_dims (int, default 1): The fc layer can accept an input tensor with more than
multidimensional tensor will first be flattened two dimensions. If this happens, the multidimensional tensor will first be flattened
into a 2-dimensional matrix. The parameter into a 2-dimensional matrix. The parameter `num_flatten_dims` determines how the input
`num_flatten_dims` determines how the input tensor tensor is flattened: the first `num_flatten_dims` (inclusive, index starts from 1)
is flattened: the first `num_flatten_dims` dimensions will be flatten to form the first dimension of the final matrix (height of
(inclusive, index starts from 1) dimensions will the matrix), and the rest `rank(X) - num_flatten_dims` dimensions are flattened to
be flatten to form the first dimension of the form the second dimension of the final matrix (width of the matrix). For example, suppose
final matrix (height of the matrix), and the rest `X` is a 6-dimensional tensor with a shape [2, 3, 4, 5, 6], and `num_flatten_dims` = 3.
`rank(X) - num_flatten_dims` dimensions are Then, the flattened matrix will have a shape [2 x 3 x 4, 5 x 6] = [24, 30].
flattened to form the second dimension of the param_attr (ParamAttr|list of ParamAttr, default None): The parameter attribute for learnable
final matrix (width of the matrix). For example, parameters/weights of this layer.
suppose `X` is a 6-dimensional tensor with a shape bias_attr (ParamAttr|list of ParamAttr, default None): The parameter attribute for the bias
[2, 3, 4, 5, 6], and `num_flatten_dims` = 3. Then, of this layer. If it is set to None, no bias will be added to the output units.
the flattened matrix will have a shape act (str, default None): Activation to be applied to the output of this layer.
[2 x 3 x 4, 5 x 6] = [24, 30]. By default, name (str, default None): The name of this layer.
`num_flatten_dims` is set to 1.
param_attr(ParamAttr|list): The parameter attribute for learnable
parameters/weights of the fully connected
layer.
param_initializer(ParamAttr|list): The initializer used for the
weight/parameter. If set None,
XavierInitializer() will be used.
bias_attr(ParamAttr|list): The parameter attribute for the bias parameter
for this layer. If set None, no bias will be
added to the output units.
bias_initializer(ParamAttr|list): The initializer used for the bias.
If set None, then ConstantInitializer()
will be used.
act(str): Activation to be applied to the output of the fully connected
layer.
name(str): Name/alias of the fully connected layer.
Returns: Returns:
Variable: The output tensor variable. A tensor variable storing the transformation result.
Raises: Raises:
ValueError: If rank of the input tensor is less than 2. ValueError: If rank of the input tensor is less than 2.
...@@ -2202,6 +2185,53 @@ def reduce_min(input, dim=None, keep_dim=False, name=None): ...@@ -2202,6 +2185,53 @@ def reduce_min(input, dim=None, keep_dim=False, name=None):
return out return out
def reduce_prod(input, dim=None, keep_dim=False, name=None):
"""
Computes the product of tensor elements over the given dimension.
Args:
input (Variable): The input variable which is a Tensor or LoDTensor.
dim (int|None): The dimension along which the product is performed. If
:attr:`None`, multipy all elements of :attr:`input` and return a
Tensor variable with a single element, otherwise must be in the
range :math:`[-rank(input), rank(input))`. If :math:`dim < 0`,
the dimension to reduce is :math:`rank + dim`.
keep_dim (bool|False): Whether to reserve the reduced dimension in the
output Tensor. The result tensor will have one fewer dimension
than the :attr:`input` unless :attr:`keep_dim` is true.
name(str|None): A name for this layer(optional). If set None, the
layer will be named automatically.
Returns:
Variable: The reduced Tensor variable.
Examples:
.. code-block:: python
# x is a Tensor variable with following elements:
# [[0.2, 0.3, 0.5, 0.9]
# [0.1, 0.2, 0.6, 0.7]]
# Each example is followed by the correspending output tensor.
fluid.layers.reduce_prod(x) # [0.0002268]
fluid.layers.reduce_prod(x, dim=0) # [0.02, 0.06, 0.3, 0.63]
fluid.layers.reduce_prod(x, dim=-1) # [0.027, 0.0084]
fluid.layers.reduce_prod(x, dim=1,
keep_dim=True) # [[0.027], [0.0084]]
"""
helper = LayerHelper('reduce_prod', **locals())
out = helper.create_tmp_variable(dtype=helper.input_dtype())
helper.append_op(
type='reduce_prod',
inputs={'X': input},
outputs={'Out': out},
attrs={
'dim': dim if dim != None else 0,
'keep_dim': keep_dim,
'reduce_all': True if dim == None else False
})
return out
def split(input, num_or_sections, dim=-1, name=None): def split(input, num_or_sections, dim=-1, name=None):
""" """
Split the input tensor into multiple sub-tensors. Split the input tensor into multiple sub-tensors.
......
...@@ -36,6 +36,7 @@ def convert_reader_to_recordio_file( ...@@ -36,6 +36,7 @@ def convert_reader_to_recordio_file(
feed_order=None): feed_order=None):
if feed_order is None: if feed_order is None:
feed_order = feeder.feed_names feed_order = feeder.feed_names
counter = 0
with create_recordio_writer(filename, compressor, with create_recordio_writer(filename, compressor,
max_num_records) as writer: max_num_records) as writer:
for batch in reader_creator(): for batch in reader_creator():
...@@ -43,3 +44,5 @@ def convert_reader_to_recordio_file( ...@@ -43,3 +44,5 @@ def convert_reader_to_recordio_file(
for each in feed_order: for each in feed_order:
writer.append_tensor(res[each]) writer.append_tensor(res[each])
writer.complete_append_tensor() writer.complete_append_tensor()
counter += 1
return counter
...@@ -13,6 +13,7 @@ ...@@ -13,6 +13,7 @@
# limitations under the License. # limitations under the License.
import framework import framework
from . import core
__all__ = [ __all__ = [
'append_regularization_ops', 'append_regularization_ops',
...@@ -46,9 +47,9 @@ def append_regularization_ops(parameters_and_grads, regularization=None): ...@@ -46,9 +47,9 @@ def append_regularization_ops(parameters_and_grads, regularization=None):
regularization_term = None regularization_term = None
if param.regularizer is not None: if param.regularizer is not None:
# Add variable for regularization term in grad block # Add variable for regularization term in grad block
regularization_term = param.regularizer(param, grad.block) regularization_term = param.regularizer(param, grad, grad.block)
elif regularization is not None: elif regularization is not None:
regularization_term = regularization(param, grad.block) regularization_term = regularization(param, grad, grad.block)
# If no gradient or no regularization specified, # If no gradient or no regularization specified,
# then we don't need to do anything # then we don't need to do anything
...@@ -82,7 +83,7 @@ class WeightDecayRegularizer(object): ...@@ -82,7 +83,7 @@ class WeightDecayRegularizer(object):
def __init__(self): def __init__(self):
pass pass
def __call__(self, param, block): def __call__(self, param, grad, block):
"""Add corresponding weight decay operations to the network """Add corresponding weight decay operations to the network
""" """
raise NotImplementedError() raise NotImplementedError()
...@@ -102,7 +103,7 @@ class L2DecayRegularizer(WeightDecayRegularizer): ...@@ -102,7 +103,7 @@ class L2DecayRegularizer(WeightDecayRegularizer):
super(L2DecayRegularizer, self).__init__() super(L2DecayRegularizer, self).__init__()
self._regularization_coeff = regularization_coeff self._regularization_coeff = regularization_coeff
def __call__(self, param, block): def __call__(self, param, grad, block):
"""Add L2 weight decay ops to network """Add L2 weight decay ops to network
Adds L2 weight decay ops. Adds L2 weight decay ops.
...@@ -117,8 +118,23 @@ class L2DecayRegularizer(WeightDecayRegularizer): ...@@ -117,8 +118,23 @@ class L2DecayRegularizer(WeightDecayRegularizer):
""" """
assert isinstance(param, framework.Parameter) assert isinstance(param, framework.Parameter)
assert isinstance(block, framework.Block) assert isinstance(block, framework.Block)
decay = block.create_var( decay = block.create_var(
dtype="float32", shape=param.shape, lod_level=param.lod_level) dtype="float32", shape=param.shape, lod_level=param.lod_level)
if grad.type == core.VarDesc.VarType.SELECTED_ROWS:
decay = block.create_var(
dtype="float32",
shape=param.shape,
type=core.VarDesc.VarType.SELECTED_ROWS)
block.append_op(
type='lookup_table',
inputs={'W': param,
'Ids': grad},
outputs={'Out': decay},
attrs={'is_sparse': True})
param = decay
# Append Op to calculate decay # Append Op to calculate decay
block.append_op( block.append_op(
type='scale', type='scale',
...@@ -141,7 +157,7 @@ class L1DecayRegularizer(WeightDecayRegularizer): ...@@ -141,7 +157,7 @@ class L1DecayRegularizer(WeightDecayRegularizer):
super(L1DecayRegularizer, self).__init__() super(L1DecayRegularizer, self).__init__()
self._regularization_coeff = regularization_coeff self._regularization_coeff = regularization_coeff
def __call__(self, param, block): def __call__(self, param, grad, block):
"""Add L1 weight decay ops to network """Add L1 weight decay ops to network
Adds L1 weight decay ops. Adds L1 weight decay ops.
...@@ -158,6 +174,19 @@ class L1DecayRegularizer(WeightDecayRegularizer): ...@@ -158,6 +174,19 @@ class L1DecayRegularizer(WeightDecayRegularizer):
assert isinstance(block, framework.Block) assert isinstance(block, framework.Block)
decay = block.create_var( decay = block.create_var(
dtype="float32", shape=param.shape, lod_level=param.lod_level) dtype="float32", shape=param.shape, lod_level=param.lod_level)
if grad.type == core.VarDesc.VarType.SELECTED_ROWS:
decay = block.create_var(
dtype="float32",
shape=param.shape,
type=core.VarDesc.VarType.SELECTED_ROWS)
block.append_op(
type='lookup_table',
inputs={'W': param,
'Ids': grad},
outputs={'Out': decay},
attrs={'is_sparse': True})
# Append sign op # Append sign op
block.append_op( block.append_op(
type='sign', inputs={"X": param}, outputs={"Out": decay}) type='sign', inputs={"X": param}, outputs={"Out": decay})
......
...@@ -181,7 +181,10 @@ def train_main(use_cuda, is_sparse, is_local=True): ...@@ -181,7 +181,10 @@ def train_main(use_cuda, is_sparse, is_local=True):
cost = pd.cross_entropy(input=rnn_out, label=label) cost = pd.cross_entropy(input=rnn_out, label=label)
avg_cost = pd.mean(cost) avg_cost = pd.mean(cost)
optimizer = fluid.optimizer.Adagrad(learning_rate=1e-4) optimizer = fluid.optimizer.Adagrad(
learning_rate=1e-4,
regularization=fluid.regularizer.L2DecayRegularizer(
regularization_coeff=0.1))
optimize_ops, params_grads = optimizer.minimize(avg_cost) optimize_ops, params_grads = optimizer.minimize(avg_cost)
train_data = paddle.batch( train_data = paddle.batch(
......
...@@ -13,9 +13,10 @@ ...@@ -13,9 +13,10 @@
# limitations under the License. # limitations under the License.
import unittest import unittest
import paddle.fluid as fluid import paddle.fluid as fluid
import paddle.v2.dataset.mnist as mnist
import paddle.v2 as paddle import paddle.v2 as paddle
import paddle.v2.dataset.mnist as mnist
class TestRecordIO(unittest.TestCase): class TestRecordIO(unittest.TestCase):
...@@ -31,10 +32,10 @@ class TestRecordIO(unittest.TestCase): ...@@ -31,10 +32,10 @@ class TestRecordIO(unittest.TestCase):
name='label', shape=[1], dtype='int64'), name='label', shape=[1], dtype='int64'),
], ],
place=fluid.CPUPlace()) place=fluid.CPUPlace())
fluid.recordio_writer.convert_reader_to_recordio_file( self.num_batches = fluid.recordio_writer.convert_reader_to_recordio_file(
'./mnist.recordio', reader, feeder) './mnist.recordio', reader, feeder)
def test_main(self): def test_main(self, decorator_callback=None):
# use new program # use new program
with fluid.program_guard(fluid.Program(), fluid.Program()): with fluid.program_guard(fluid.Program(), fluid.Program()):
data_file = fluid.layers.open_recordio_file( data_file = fluid.layers.open_recordio_file(
...@@ -42,6 +43,8 @@ class TestRecordIO(unittest.TestCase): ...@@ -42,6 +43,8 @@ class TestRecordIO(unittest.TestCase):
shapes=[[-1, 784], [-1, 1]], shapes=[[-1, 784], [-1, 1]],
lod_levels=[0, 0], lod_levels=[0, 0],
dtypes=['float32', 'int64']) dtypes=['float32', 'int64'])
if decorator_callback is not None:
data_file = decorator_callback(data_file)
img, label = fluid.layers.read_file(data_file) img, label = fluid.layers.read_file(data_file)
hidden = fluid.layers.fc(input=img, size=100, act='tanh') hidden = fluid.layers.fc(input=img, size=100, act='tanh')
...@@ -51,14 +54,28 @@ class TestRecordIO(unittest.TestCase): ...@@ -51,14 +54,28 @@ class TestRecordIO(unittest.TestCase):
fluid.optimizer.Adam(learning_rate=1e-3).minimize(avg_loss) fluid.optimizer.Adam(learning_rate=1e-3).minimize(avg_loss)
exe = fluid.Executor(fluid.CPUPlace()) if fluid.core.is_compiled_with_cuda():
place = fluid.CUDAPlace(0)
else:
place = fluid.CPUPlace()
exe = fluid.Executor(place)
exe.run(fluid.default_startup_program()) exe.run(fluid.default_startup_program())
avg_loss_np = [] avg_loss_np = []
# train a pass # train a pass
batch_id = 0
while not data_file.eof(): while not data_file.eof():
tmp, = exe.run(fetch_list=[avg_loss]) tmp, = exe.run(fetch_list=[avg_loss])
avg_loss_np.append(tmp) avg_loss_np.append(tmp)
batch_id += 1
data_file.reset() data_file.reset()
self.assertEqual(batch_id, self.num_batches)
self.assertLess(avg_loss_np[-1], avg_loss_np[0]) self.assertLess(avg_loss_np[-1], avg_loss_np[0])
def test_shuffle_reader(self):
self.test_main(decorator_callback=lambda reader: fluid.layers.create_shuffle_reader(reader, buffer_size=200))
def test_double_buffer_reader(self):
self.test_main(decorator_callback=lambda reader: fluid.layers.create_double_buffer_reader(reader,
place='cuda:0' if fluid.core.is_compiled_with_cuda() else 'cpu'))
...@@ -70,6 +70,19 @@ class TestMinOp(OpTest): ...@@ -70,6 +70,19 @@ class TestMinOp(OpTest):
self.check_output() self.check_output()
class TestProdOp(OpTest):
def setUp(self):
self.op_type = "reduce_prod"
self.inputs = {'X': np.random.random((5, 6, 10)).astype("float64")}
self.outputs = {'Out': self.inputs['X'].prod(axis=0)}
def test_check_output(self):
self.check_output()
def test_check_grad(self):
self.check_grad(['X'], 'Out')
class TestKeepDimReduce(OpTest): class TestKeepDimReduce(OpTest):
def setUp(self): def setUp(self):
self.op_type = "reduce_sum" self.op_type = "reduce_sum"
......
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册