提交 1f6e0448 编写于 作者: T typhoonzero

Merge branch 'develop' of https://github.com/PaddlePaddle/Paddle into prepare_pserver_executor

...@@ -36,11 +36,41 @@ ...@@ -36,11 +36,41 @@
- Trainer Count: 100 - Trainer Count: 100
- Metrics: mini-batch / sec - Metrics: mini-batch / sec
| Batch Size | 32 | 64 | 128 | 256 |
| -- | -- | -- | -- | -- | <table>
| PaddlePaddle Fluid | - | - | - | - | <thead>
| PaddlePaddle v2 | - | - | - | - | <tr>
| TensorFlow | - | - | - | - | <th>Batch Size </th>
<th> 32</th>
<th>64</th>
<th>128 </th>
<th>256</th>
</tr>
</thead>
<tbody>
<tr>
<td> PaddlePaddle Fluid</td>
<td>-</td>
<td>- </td>
<td>- </td>
<td>- </td>
</tr>
<tr>
<td>PaddlePaddle v2 </td>
<td>- </td>
<td>- </td>
<td>- </td>
<td>- </td>
</tr>
<tr>
<td>TensorFlow </td>
<td>- </td>
<td>- </td>
<td>- </td>
<td>- </td>
</tr>
</tbody>
</table>
### Measure the Performance for Different PServer Count ### Measure the Performance for Different PServer Count
...@@ -48,11 +78,41 @@ ...@@ -48,11 +78,41 @@
- Batch Size: 64 - Batch Size: 64
- Metrics: mini-batch / sec - Metrics: mini-batch / sec
| PServer Count | 10 | 20 | 40 | 60 |
| -- | -- | -- | -- | -- | <table>
| PaddlePaddle Fluid | - | - | - | - | <thead>
| PaddlePaddle v2 | - | - | - | - | <tr>
| TensorFlow | - | - | - | - | <th>PServer Count </th>
<th>10</th>
<th>20</th>
<th>40 </th>
<th>60</th>
</tr>
</thead>
<tbody>
<tr>
<td> PaddlePaddle Fluid</td>
<td>-</td>
<td>- </td>
<td>- </td>
<td>- </td>
</tr>
<tr>
<td>PaddlePaddle v2 </td>
<td>- </td>
<td>- </td>
<td>- </td>
<td>- </td>
</tr>
<tr>
<td>TensorFlow </td>
<td>- </td>
<td>- </td>
<td>- </td>
<td>- </td>
</tr>
</tbody>
</table>
### Measure Parallel Efficiency By Increasing Trainer Count ### Measure Parallel Efficiency By Increasing Trainer Count
...@@ -67,11 +127,69 @@ The parallel efficiency is: ...@@ -67,11 +127,69 @@ The parallel efficiency is:
$E = \div(S, N)$ $E = \div(S, N)$
| Trainer Counter | 1 | 10 | 20 | 30 | 40 | 50 | 60 | 70 | 80 | 90 | 100 | <table>
| -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | <thead>
| PaddlePaddle Fluid | - | - | - | - | - | - | - | - | - | - | - | <tr>
| PaddlePaddle v2 | - | - | - | - | - | - | - | - | - | - | - | - | <th>Trainer Counter </th>
| TensorFlow | - | - | - | - | - | - | - | - | - | - | - | - | - | <th>1</th>
<th>10</th>
<th>20 </th>
<th>30</th>
<th>40</th>
<th>50</th>
<th>60 </th>
<th>70</th>
<th>80</th>
<th>90</th>
<th>100 </th>
</tr>
</thead>
<tbody>
<tr>
<td> PaddlePaddle Fluid</td>
<td>-</td>
<td>- </td>
<td>- </td>
<td>- </td>
<td>-</td>
<td>- </td>
<td>- </td>
<td>- </td>
<td>-</td>
<td>- </td>
<td>- </td>
</tr>
<tr>
<td>PaddlePaddle v2 </td>
<td>- </td>
<td>- </td>
<td>- </td>
<td>- </td>
<td>-</td>
<td>- </td>
<td>- </td>
<td>- </td>
<td>-</td>
<td>- </td>
<td>- </td>
</tr>
<tr>
<td>TensorFlow </td>
<td>- </td>
<td>- </td>
<td>- </td>
<td>- </td>
<td>-</td>
<td>- </td>
<td>- </td>
<td>- </td>
<td>-</td>
<td>- </td>
<td>- </td>
</tr>
</tbody>
</table>
## Reproduce the benchmark ## Reproduce the benchmark
......
...@@ -16,11 +16,41 @@ Setting environment variable: `MKL_NUM_THREADS=1`. ...@@ -16,11 +16,41 @@ Setting environment variable: `MKL_NUM_THREADS=1`.
- Metrics: samples / sec - Metrics: samples / sec
| Batch Size | 32 | 64 | 128 | 256 | <table>
| -- | -- | -- | -- | -- | <thead>
| PaddlePaddle Fluid | 15.44 | 16.32 | 16.74 | 16.79 | <tr>
| PaddlePaddle v2 | 15.97 | 17.04 | 17.60 | 17.83 | <th>Batch Size </th>
| TensorFlow | 9.09 | 9.10 | 9.24 | 8.66 | <th> 32</th>
<th>64</th>
<th>128 </th>
<th>256</th>
</tr>
</thead>
<tbody>
<tr>
<td> PaddlePaddle Fluid</td>
<td> 15.44 </td>
<td> 16.32 </td>
<td> 16.74 </td>
<td> 16.79 </td>
</tr>
<tr>
<td>PaddlePaddle v2 </td>
<td> 15.97 </td>
<td> 17.04 </td>
<td> 17.60 </td>
<td> 17.83 </td>
</tr>
<tr>
<td>TensorFlow </td>
<td> 9.09 </td>
<td> 9.10 </td>
<td> 9.24 </td>
<td> 8.66 </td>
</tr>
</tbody>
</table>
### Different Batch Size ### Different Batch Size
...@@ -28,12 +58,40 @@ Setting environment variable: `MKL_NUM_THREADS=1`. ...@@ -28,12 +58,40 @@ Setting environment variable: `MKL_NUM_THREADS=1`.
- Trainer Count: 20 - Trainer Count: 20
- Metrics: samples / sec - Metrics: samples / sec
| Batch Size | 32 | 64 | 128 | 256 | <table>
| -- | -- | -- | -- | -- | <thead>
| PaddlePaddle Fluid | 190.20 | 222.15 | 247.40 | 258.18 | <tr>
| PaddlePaddle v2 | 170.96 | 233.71 | 256.14 | 329.23 | <th>Batch Size </th>
| TensorFlow | - | - | - | - | <th> 32</th>
<th>64</th>
<th>128 </th>
<th>256</th>
</tr>
</thead>
<tbody>
<tr>
<td> PaddlePaddle Fluid</td>
<td> 190.20 </td>
<td> 222.15 </td>
<td> 247.40 </td>
<td> 258.18 </td>
</tr>
<tr>
<td>PaddlePaddle v2 </td>
<td> 170.96 </td>
<td> 233.71 </td>
<td> 256.14 </td>
<td> 329.23 </td>
</tr>
<tr>
<td>TensorFlow </td>
<td> - </td>
<td> - </td>
<td> - </td>
<td> - </td>
</tr>
</tbody>
</table>
### Accelerate Rate ### Accelerate Rate
...@@ -41,11 +99,41 @@ Setting environment variable: `MKL_NUM_THREADS=1`. ...@@ -41,11 +99,41 @@ Setting environment variable: `MKL_NUM_THREADS=1`.
- Batch Size: 128 - Batch Size: 128
- Metrics: samples / sec - Metrics: samples / sec
| Trainer Count | 20 | 40 | 80 | 100 | <table>
| -- | -- | -- | -- | -- | <thead>
| PaddlePaddle Fluid | 263.29 (78.64%) | 518.80 (77.47%) | 836.26 (62.44%) | 1019.29 (60.89%) | <tr>
| PaddlePaddle v2 (need more tests) | 326.85 (92.85%) | 534.58 (75.93%) | 853.30 (60.60%) | 1041.99 (59.20%) | <th>Trainer Count </th>
| TensorFlow | - | - | - | - | <th>20</th>
<th>40</th>
<th>80</th>
<th>100</th>
</tr>
</thead>
<tbody>
<tr>
<td> PaddlePaddle Fluid</td>
<td> 263.29 (78.64%) </td>
<td> 518.80 (77.47%) </td>
<td> 836.26 (62.44%) </td>
<td> 1019.29 (60.89%) </td>
</tr>
<tr>
<td>PaddlePaddle v2 (need more tests) </td>
<td> 326.85 (92.85%) </td>
<td> 534.58 (75.93%) </td>
<td> 853.30 (60.60%) </td>
<td> 1041.99 (59.20%) </td>
</tr>
<tr>
<td>TensorFlow </td>
<td> - </td>
<td> - </td>
<td> - </td>
<td> - </td>
</tr>
</tbody>
</table>
### Different Pserver Count ### Different Pserver Count
...@@ -53,11 +141,41 @@ Setting environment variable: `MKL_NUM_THREADS=1`. ...@@ -53,11 +141,41 @@ Setting environment variable: `MKL_NUM_THREADS=1`.
- Batch Size: 128 - Batch Size: 128
- Metrics: samples/ sec - Metrics: samples/ sec
| PServer Count | 3 | 6 |10 | 20 | <table>
| -- | -- | -- | -- | -- | <thead>
| PaddlePaddle Fluid(should fix in next PR) | 589.1 | 592.6 | 656.4 | 655.8 | <tr>
| PaddlePaddle v2 | 593.4 | 791.3 | 729.7 | 821.7 | <th>PServer Count </th>
| TensorFlow | - | - | - | - | <th>3</th>
<th>6</th>
<th>10</th>
<th>20</th>
</tr>
</thead>
<tbody>
<tr>
<td> PaddlePaddle Fluid(should fix in next PR) </td>
<td> 589.1 </td>
<td> 592.6 </td>
<td> 656.4 </td>
<td> 655.8 </td>
</tr>
<tr>
<td>PaddlePaddle v2 (need more tests) </td>
<td> 593.4 </td>
<td> 791.3 </td>
<td> 729.7 </td>
<td> 821.7 </td>
</tr>
<tr>
<td>TensorFlow </td>
<td> - </td>
<td> - </td>
<td> - </td>
<td> - </td>
</tr>
</tbody>
</table>
*The performance gap between Fuild and v2 comes from the network interference.* *The performance gap between Fuild and v2 comes from the network interference.*
......
# Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""seq2seq model for fluid."""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import numpy as np
import argparse
import time
import distutils.util
import paddle.v2 as paddle
import paddle.fluid as fluid
import paddle.fluid.core as core
import paddle.fluid.framework as framework
from paddle.fluid.executor import Executor
parser = argparse.ArgumentParser(description=__doc__)
parser.add_argument(
"--embedding_dim",
type=int,
default=512,
help="The dimension of embedding table. (default: %(default)d)")
parser.add_argument(
"--encoder_size",
type=int,
default=512,
help="The size of encoder bi-rnn unit. (default: %(default)d)")
parser.add_argument(
"--decoder_size",
type=int,
default=512,
help="The size of decoder rnn unit. (default: %(default)d)")
parser.add_argument(
"--batch_size",
type=int,
default=16,
help="The sequence number of a mini-batch data. (default: %(default)d)")
parser.add_argument(
"--dict_size",
type=int,
default=30000,
help="The dictionary capacity. Dictionaries of source sequence and "
"target dictionary have same capacity. (default: %(default)d)")
parser.add_argument(
"--pass_num",
type=int,
default=2,
help="The pass number to train. (default: %(default)d)")
parser.add_argument(
"--learning_rate",
type=float,
default=0.0002,
help="Learning rate used to train the model. (default: %(default)f)")
parser.add_argument(
"--infer_only", action='store_true', help="If set, run forward only.")
parser.add_argument(
"--beam_size",
type=int,
default=3,
help="The width for beam searching. (default: %(default)d)")
parser.add_argument(
"--use_gpu",
type=distutils.util.strtobool,
default=True,
help="Whether to use gpu. (default: %(default)d)")
parser.add_argument(
"--max_length",
type=int,
default=250,
help="The maximum length of sequence when doing generation. "
"(default: %(default)d)")
def lstm_step(x_t, hidden_t_prev, cell_t_prev, size):
def linear(inputs):
return fluid.layers.fc(input=inputs, size=size, bias_attr=True)
forget_gate = fluid.layers.sigmoid(x=linear([hidden_t_prev, x_t]))
input_gate = fluid.layers.sigmoid(x=linear([hidden_t_prev, x_t]))
output_gate = fluid.layers.sigmoid(x=linear([hidden_t_prev, x_t]))
cell_tilde = fluid.layers.tanh(x=linear([hidden_t_prev, x_t]))
cell_t = fluid.layers.sums(input=[
fluid.layers.elementwise_mul(
x=forget_gate, y=cell_t_prev), fluid.layers.elementwise_mul(
x=input_gate, y=cell_tilde)
])
hidden_t = fluid.layers.elementwise_mul(
x=output_gate, y=fluid.layers.tanh(x=cell_t))
return hidden_t, cell_t
def seq_to_seq_net(embedding_dim, encoder_size, decoder_size, source_dict_dim,
target_dict_dim, is_generating, beam_size, max_length):
"""Construct a seq2seq network."""
def bi_lstm_encoder(input_seq, gate_size):
# Linear transformation part for input gate, output gate, forget gate
# and cell activation vectors need be done outside of dynamic_lstm.
# So the output size is 4 times of gate_size.
input_forward_proj = fluid.layers.fc(input=input_seq,
size=gate_size * 4,
act=None,
bias_attr=False)
forward, _ = fluid.layers.dynamic_lstm(
input=input_forward_proj, size=gate_size * 4, use_peepholes=False)
input_reversed_proj = fluid.layers.fc(input=input_seq,
size=gate_size * 4,
act=None,
bias_attr=False)
reversed, _ = fluid.layers.dynamic_lstm(
input=input_reversed_proj,
size=gate_size * 4,
is_reverse=True,
use_peepholes=False)
return forward, reversed
src_word_idx = fluid.layers.data(
name='source_sequence', shape=[1], dtype='int64', lod_level=1)
src_embedding = fluid.layers.embedding(
input=src_word_idx,
size=[source_dict_dim, embedding_dim],
dtype='float32')
src_forward, src_reversed = bi_lstm_encoder(
input_seq=src_embedding, gate_size=encoder_size)
encoded_vector = fluid.layers.concat(
input=[src_forward, src_reversed], axis=1)
encoded_proj = fluid.layers.fc(input=encoded_vector,
size=decoder_size,
bias_attr=False)
backward_first = fluid.layers.sequence_pool(
input=src_reversed, pool_type='first')
decoder_boot = fluid.layers.fc(input=backward_first,
size=decoder_size,
bias_attr=False,
act='tanh')
def lstm_decoder_with_attention(target_embedding, encoder_vec, encoder_proj,
decoder_boot, decoder_size):
def simple_attention(encoder_vec, encoder_proj, decoder_state):
decoder_state_proj = fluid.layers.fc(input=decoder_state,
size=decoder_size,
bias_attr=False)
decoder_state_expand = fluid.layers.sequence_expand(
x=decoder_state_proj, y=encoder_proj)
concated = fluid.layers.concat(
input=[encoder_proj, decoder_state_expand], axis=1)
attention_weights = fluid.layers.fc(input=concated,
size=1,
act='tanh',
bias_attr=False)
attention_weights = fluid.layers.sequence_softmax(
input=attention_weights)
weigths_reshape = fluid.layers.reshape(
x=attention_weights, shape=[-1])
scaled = fluid.layers.elementwise_mul(
x=encoder_vec, y=weigths_reshape, axis=0)
context = fluid.layers.sequence_pool(input=scaled, pool_type='sum')
return context
rnn = fluid.layers.DynamicRNN()
cell_init = fluid.layers.fill_constant_batch_size_like(
input=decoder_boot,
value=0.0,
shape=[-1, decoder_size],
dtype='float32')
cell_init.stop_gradient = False
with rnn.block():
current_word = rnn.step_input(target_embedding)
encoder_vec = rnn.static_input(encoder_vec)
encoder_proj = rnn.static_input(encoder_proj)
hidden_mem = rnn.memory(init=decoder_boot, need_reorder=True)
cell_mem = rnn.memory(init=cell_init)
context = simple_attention(encoder_vec, encoder_proj, hidden_mem)
decoder_inputs = fluid.layers.concat(
input=[context, current_word], axis=1)
h, c = lstm_step(decoder_inputs, hidden_mem, cell_mem, decoder_size)
rnn.update_memory(hidden_mem, h)
rnn.update_memory(cell_mem, c)
out = fluid.layers.fc(input=h,
size=target_dict_dim,
bias_attr=True,
act='softmax')
rnn.output(out)
return rnn()
if not is_generating:
trg_word_idx = fluid.layers.data(
name='target_sequence', shape=[1], dtype='int64', lod_level=1)
trg_embedding = fluid.layers.embedding(
input=trg_word_idx,
size=[target_dict_dim, embedding_dim],
dtype='float32')
prediction = lstm_decoder_with_attention(trg_embedding, encoded_vector,
encoded_proj, decoder_boot,
decoder_size)
label = fluid.layers.data(
name='label_sequence', shape=[1], dtype='int64', lod_level=1)
cost = fluid.layers.cross_entropy(input=prediction, label=label)
avg_cost = fluid.layers.mean(x=cost)
feeding_list = ["source_sequence", "target_sequence", "label_sequence"]
return avg_cost, feeding_list
def to_lodtensor(data, place):
seq_lens = [len(seq) for seq in data]
cur_len = 0
lod = [cur_len]
for l in seq_lens:
cur_len += l
lod.append(cur_len)
flattened_data = np.concatenate(data, axis=0).astype("int64")
flattened_data = flattened_data.reshape([len(flattened_data), 1])
lod_t = core.LoDTensor()
lod_t.set(flattened_data, place)
lod_t.set_lod([lod])
return lod_t, lod[-1]
def lodtensor_to_ndarray(lod_tensor):
dims = lod_tensor.get_dims()
ndarray = np.zeros(shape=dims).astype('float32')
for i in xrange(np.product(dims)):
ndarray.ravel()[i] = lod_tensor.get_float_element(i)
return ndarray
def train():
avg_cost, feeding_list = seq_to_seq_net(
args.embedding_dim,
args.encoder_size,
args.decoder_size,
args.dict_size,
args.dict_size,
False,
beam_size=args.beam_size,
max_length=args.max_length)
# clone from default main program
inference_program = fluid.default_main_program().clone()
optimizer = fluid.optimizer.Adam(learning_rate=args.learning_rate)
optimizer.minimize(avg_cost)
fluid.memory_optimize(fluid.default_main_program())
train_batch_generator = paddle.batch(
paddle.reader.shuffle(
paddle.dataset.wmt14.train(args.dict_size), buf_size=1000),
batch_size=args.batch_size)
test_batch_generator = paddle.batch(
paddle.reader.shuffle(
paddle.dataset.wmt14.test(args.dict_size), buf_size=1000),
batch_size=args.batch_size)
place = core.CUDAPlace(0) if args.use_gpu else core.CPUPlace()
exe = Executor(place)
exe.run(framework.default_startup_program())
def do_validation():
total_loss = 0.0
count = 0
for batch_id, data in enumerate(test_batch_generator()):
src_seq = to_lodtensor(map(lambda x: x[0], data), place)[0]
trg_seq = to_lodtensor(map(lambda x: x[1], data), place)[0]
lbl_seq = to_lodtensor(map(lambda x: x[2], data), place)[0]
fetch_outs = exe.run(inference_program,
feed={
feeding_list[0]: src_seq,
feeding_list[1]: trg_seq,
feeding_list[2]: lbl_seq
},
fetch_list=[avg_cost],
return_numpy=False)
total_loss += lodtensor_to_ndarray(fetch_outs[0])[0]
count += 1
return total_loss / count
for pass_id in xrange(args.pass_num):
pass_start_time = time.time()
words_seen = 0
for batch_id, data in enumerate(train_batch_generator()):
src_seq, word_num = to_lodtensor(map(lambda x: x[0], data), place)
words_seen += word_num
trg_seq, word_num = to_lodtensor(map(lambda x: x[1], data), place)
words_seen += word_num
lbl_seq, _ = to_lodtensor(map(lambda x: x[2], data), place)
fetch_outs = exe.run(framework.default_main_program(),
feed={
feeding_list[0]: src_seq,
feeding_list[1]: trg_seq,
feeding_list[2]: lbl_seq
},
fetch_list=[avg_cost])
avg_cost_val = np.array(fetch_outs[0])
print('pass_id=%d, batch_id=%d, train_loss: %f' %
(pass_id, batch_id, avg_cost_val))
pass_end_time = time.time()
test_loss = do_validation()
time_consumed = pass_end_time - pass_start_time
words_per_sec = words_seen / time_consumed
print("pass_id=%d, test_loss: %f, words/s: %f, sec/pass: %f" %
(pass_id, test_loss, words_per_sec, time_consumed))
def infer():
pass
if __name__ == '__main__':
args = parser.parse_args()
if args.infer_only:
infer()
else:
train()
# Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import numpy as np
import argparse
import time
import paddle.v2 as paddle
import paddle.fluid as fluid
import paddle.fluid.profiler as profiler
SEED = 1
DTYPE = "float32"
# random seed must set before configuring the network.
# fluid.default_startup_program().random_seed = SEED
def parse_args():
parser = argparse.ArgumentParser("mnist model benchmark.")
parser.add_argument(
'--batch_size', type=int, default=128, help='The minibatch size.')
parser.add_argument(
'--iterations', type=int, default=35, help='The number of minibatches.')
parser.add_argument(
'--pass_num', type=int, default=5, help='The number of passes.')
parser.add_argument(
'--device',
type=str,
default='GPU',
choices=['CPU', 'GPU'],
help='The device type.')
parser.add_argument(
'--infer_only', action='store_true', help='If set, run forward only.')
parser.add_argument(
'--use_cprof', action='store_true', help='If set, use cProfile.')
parser.add_argument(
'--use_nvprof',
action='store_true',
help='If set, use nvprof for CUDA.')
args = parser.parse_args()
return args
def print_arguments(args):
vars(args)['use_nvprof'] = (vars(args)['use_nvprof'] and
vars(args)['device'] == 'GPU')
print('----------- Configuration Arguments -----------')
for arg, value in sorted(vars(args).iteritems()):
print('%s: %s' % (arg, value))
print('------------------------------------------------')
def cnn_model(data):
conv_pool_1 = fluid.nets.simple_img_conv_pool(
input=data,
filter_size=5,
num_filters=20,
pool_size=2,
pool_stride=2,
act="relu")
conv_pool_2 = fluid.nets.simple_img_conv_pool(
input=conv_pool_1,
filter_size=5,
num_filters=50,
pool_size=2,
pool_stride=2,
act="relu")
# TODO(dzhwinter) : refine the initializer and random seed settting
SIZE = 10
input_shape = conv_pool_2.shape
param_shape = [reduce(lambda a, b: a * b, input_shape[1:], 1)] + [SIZE]
scale = (2.0 / (param_shape[0]**2 * SIZE))**0.5
predict = fluid.layers.fc(
input=conv_pool_2,
size=SIZE,
act="softmax",
param_attr=fluid.param_attr.ParamAttr(
initializer=fluid.initializer.NormalInitializer(
loc=0.0, scale=scale)))
return predict
def eval_test(exe, batch_acc, batch_size_tensor, inference_program):
test_reader = paddle.batch(
paddle.dataset.mnist.test(), batch_size=args.batch_size)
test_pass_acc = fluid.average.WeightedAverage()
for batch_id, data in enumerate(test_reader()):
img_data = np.array(map(lambda x: x[0].reshape([1, 28, 28]),
data)).astype(DTYPE)
y_data = np.array(map(lambda x: x[1], data)).astype("int64")
y_data = y_data.reshape([len(y_data), 1])
acc, weight = exe.run(inference_program,
feed={"pixel": img_data,
"label": y_data},
fetch_list=[batch_acc, batch_size_tensor])
test_pass_acc.add(value=acc, weight=weight)
pass_acc = test_pass_acc.eval()
return pass_acc
def run_benchmark(model, args):
if args.use_cprof:
pr = cProfile.Profile()
pr.enable()
start_time = time.time()
# Input data
images = fluid.layers.data(name='pixel', shape=[1, 28, 28], dtype=DTYPE)
label = fluid.layers.data(name='label', shape=[1], dtype='int64')
# Train program
predict = model(images)
cost = fluid.layers.cross_entropy(input=predict, label=label)
avg_cost = fluid.layers.mean(x=cost)
# Evaluator
batch_size_tensor = fluid.layers.create_tensor(dtype='int64')
batch_acc = fluid.layers.accuracy(
input=predict, label=label, total=batch_size_tensor)
# inference program
inference_program = fluid.default_main_program().clone()
with fluid.program_guard(inference_program):
inference_program = fluid.io.get_inference_program(
target_vars=[batch_acc, batch_size_tensor])
# Optimization
opt = fluid.optimizer.AdamOptimizer(
learning_rate=0.001, beta1=0.9, beta2=0.999)
opt.minimize(avg_cost)
fluid.memory_optimize(fluid.default_main_program())
# Initialize executor
place = fluid.CPUPlace() if args.device == 'CPU' else fluid.CUDAPlace(0)
exe = fluid.Executor(place)
# Parameter initialization
exe.run(fluid.default_startup_program())
# Reader
train_reader = paddle.batch(
paddle.dataset.mnist.train(), batch_size=args.batch_size)
accuracy = fluid.average.WeightedAverage()
for pass_id in range(args.pass_num):
accuracy.reset()
pass_start = time.time()
for batch_id, data in enumerate(train_reader()):
img_data = np.array(
map(lambda x: x[0].reshape([1, 28, 28]), data)).astype(DTYPE)
y_data = np.array(map(lambda x: x[1], data)).astype("int64")
y_data = y_data.reshape([len(y_data), 1])
start = time.time()
outs = exe.run(
fluid.default_main_program(),
feed={"pixel": img_data,
"label": y_data},
fetch_list=[avg_cost, batch_acc, batch_size_tensor]
) # The accuracy is the accumulation of batches, but not the current batch.
accuracy.add(value=outs[1], weight=outs[2])
end = time.time()
loss = np.array(outs[0])
acc = np.array(outs[1])
print("pass=%d, batch=%d, loss=%f, error=%f, elapse=%f" %
(pass_id, batch_id, loss, 1 - acc, (end - start) / 1000))
pass_end = time.time()
train_avg_acc = accuracy.eval()
test_avg_acc = eval_test(exe, batch_acc, batch_size_tensor,
inference_program)
print("pass=%d, train_avg_acc=%f, test_avg_acc=%f, elapse=%f" %
(pass_id, train_avg_acc, test_avg_acc,
(pass_end - pass_start) / 1000))
if __name__ == '__main__':
args = parse_args()
print_arguments(args)
if args.use_nvprof and args.device == 'GPU':
with profiler.cuda_profiler("cuda_profiler.txt", 'csv') as nvprof:
run_benchmark(cnn_model, args)
else:
run_benchmark(cnn_model, args)
# Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import argparse
import functools
import numpy as np
import time
import cProfile, pstats, StringIO
import paddle.v2 as paddle
import paddle.fluid as fluid
import paddle.fluid.core as core
import paddle.fluid.profiler as profiler
def parse_args():
parser = argparse.ArgumentParser('Convolution model benchmark.')
parser.add_argument(
'--model',
type=str,
choices=['resnet_imagenet', 'resnet_cifar10'],
default='resnet_imagenet',
help='The model architecture.')
parser.add_argument(
'--batch_size', type=int, default=32, help='The minibatch size.')
parser.add_argument(
'--use_fake_data',
action='store_true',
help='use real data or fake data')
parser.add_argument(
'--skip_batch_num',
type=int,
default=5,
help='The first num of minibatch num to skip, for better performance test'
)
parser.add_argument(
'--iterations', type=int, default=80, help='The number of minibatches.')
parser.add_argument(
'--pass_num', type=int, default=100, help='The number of passes.')
parser.add_argument(
'--data_format',
type=str,
default='NCHW',
choices=['NCHW', 'NHWC'],
help='The data data_format, now only support NCHW.')
parser.add_argument(
'--device',
type=str,
default='GPU',
choices=['CPU', 'GPU'],
help='The device type.')
parser.add_argument(
'--data_set',
type=str,
default='flowers',
choices=['cifar10', 'flowers'],
help='Optional dataset for benchmark.')
parser.add_argument(
'--infer_only', action='store_true', help='If set, run forward only.')
parser.add_argument(
'--use_cprof', action='store_true', help='If set, use cProfile.')
parser.add_argument(
'--use_nvprof',
action='store_true',
help='If set, use nvprof for CUDA.')
parser.add_argument(
'--with_test',
action='store_true',
help='If set, test the testset during training.')
args = parser.parse_args()
return args
def print_arguments(args):
vars(args)['use_nvprof'] = (vars(args)['use_nvprof'] and
vars(args)['device'] == 'GPU')
print('----------- Configuration Arguments -----------')
for arg, value in sorted(vars(args).iteritems()):
print('%s: %s' % (arg, value))
print('------------------------------------------------')
def conv_bn_layer(input, ch_out, filter_size, stride, padding, act='relu'):
conv1 = fluid.layers.conv2d(
input=input,
filter_size=filter_size,
num_filters=ch_out,
stride=stride,
padding=padding,
act=None,
bias_attr=False)
return fluid.layers.batch_norm(input=conv1, act=act)
def shortcut(input, ch_out, stride):
ch_in = input.shape[1] if args.data_format == 'NCHW' else input.shape[-1]
if ch_in != ch_out:
return conv_bn_layer(input, ch_out, 1, stride, 0, None)
else:
return input
def basicblock(input, ch_out, stride):
short = shortcut(input, ch_out, stride)
conv1 = conv_bn_layer(input, ch_out, 3, stride, 1)
conv2 = conv_bn_layer(conv1, ch_out, 3, 1, 1, act=None)
return fluid.layers.elementwise_add(x=short, y=conv2, act='relu')
def bottleneck(input, ch_out, stride):
short = shortcut(input, ch_out * 4, stride)
conv1 = conv_bn_layer(input, ch_out, 1, stride, 0)
conv2 = conv_bn_layer(conv1, ch_out, 3, 1, 1)
conv3 = conv_bn_layer(conv2, ch_out * 4, 1, 1, 0, act=None)
return fluid.layers.elementwise_add(x=short, y=conv3, act='relu')
def layer_warp(block_func, input, ch_out, count, stride):
res_out = block_func(input, ch_out, stride)
for i in range(1, count):
res_out = block_func(res_out, ch_out, 1)
return res_out
def resnet_imagenet(input, class_dim, depth=50, data_format='NCHW'):
cfg = {
18: ([2, 2, 2, 1], basicblock),
34: ([3, 4, 6, 3], basicblock),
50: ([3, 4, 6, 3], bottleneck),
101: ([3, 4, 23, 3], bottleneck),
152: ([3, 8, 36, 3], bottleneck)
}
stages, block_func = cfg[depth]
conv1 = conv_bn_layer(input, ch_out=64, filter_size=7, stride=2, padding=3)
pool1 = fluid.layers.pool2d(
input=conv1, pool_type='avg', pool_size=3, pool_stride=2)
res1 = layer_warp(block_func, pool1, 64, stages[0], 1)
res2 = layer_warp(block_func, res1, 128, stages[1], 2)
res3 = layer_warp(block_func, res2, 256, stages[2], 2)
res4 = layer_warp(block_func, res3, 512, stages[3], 2)
pool2 = fluid.layers.pool2d(
input=res4,
pool_size=7,
pool_type='avg',
pool_stride=1,
global_pooling=True)
out = fluid.layers.fc(input=pool2, size=class_dim, act='softmax')
return out
def resnet_cifar10(input, class_dim, depth=32, data_format='NCHW'):
assert (depth - 2) % 6 == 0
n = (depth - 2) // 6
conv1 = conv_bn_layer(
input=input, ch_out=16, filter_size=3, stride=1, padding=1)
res1 = layer_warp(basicblock, conv1, 16, n, 1)
res2 = layer_warp(basicblock, res1, 32, n, 2)
res3 = layer_warp(basicblock, res2, 64, n, 2)
pool = fluid.layers.pool2d(
input=res3, pool_size=8, pool_type='avg', pool_stride=1)
out = fluid.layers.fc(input=pool, size=class_dim, act='softmax')
return out
def run_benchmark(model, args):
if args.use_cprof:
pr = cProfile.Profile()
pr.enable()
if args.data_set == "cifar10":
class_dim = 10
if args.data_format == 'NCHW':
dshape = [3, 32, 32]
else:
dshape = [32, 32, 3]
else:
class_dim = 102
if args.data_format == 'NCHW':
dshape = [3, 224, 224]
else:
dshape = [224, 224, 3]
input = fluid.layers.data(name='data', shape=dshape, dtype='float32')
label = fluid.layers.data(name='label', shape=[1], dtype='int64')
predict = model(input, class_dim)
cost = fluid.layers.cross_entropy(input=predict, label=label)
avg_cost = fluid.layers.mean(x=cost)
batch_size_tensor = fluid.layers.create_tensor(dtype='int64')
batch_acc = fluid.layers.accuracy(
input=predict, label=label, total=batch_size_tensor)
inference_program = fluid.default_main_program().clone()
with fluid.program_guard(inference_program):
inference_program = fluid.io.get_inference_program(
target_vars=[batch_acc, batch_size_tensor])
optimizer = fluid.optimizer.Momentum(learning_rate=0.01, momentum=0.9)
opts = optimizer.minimize(avg_cost)
fluid.memory_optimize(fluid.default_main_program())
train_reader = paddle.batch(
paddle.reader.shuffle(
paddle.dataset.cifar.train10()
if args.data_set == 'cifar10' else paddle.dataset.flowers.train(),
buf_size=5120),
batch_size=args.batch_size)
test_reader = paddle.batch(
paddle.dataset.cifar.test10()
if args.data_set == 'cifar10' else paddle.dataset.flowers.test(),
batch_size=args.batch_size)
def test(exe):
test_accuracy = fluid.average.WeightedAverage()
for batch_id, data in enumerate(test_reader()):
img_data = np.array(map(lambda x: x[0].reshape(dshape),
data)).astype("float32")
y_data = np.array(map(lambda x: x[1], data)).astype("int64")
y_data = y_data.reshape([-1, 1])
acc, weight = exe.run(inference_program,
feed={"data": img_data,
"label": y_data},
fetch_list=[batch_acc, batch_size_tensor])
test_accuracy.add(value=acc, weight=weight)
return test_accuracy.eval()
place = core.CPUPlace() if args.device == 'CPU' else core.CUDAPlace(0)
exe = fluid.Executor(place)
exe.run(fluid.default_startup_program())
accuracy = fluid.average.WeightedAverage()
if args.use_fake_data:
data = train_reader().next()
image = np.array(map(lambda x: x[0].reshape(dshape), data)).astype(
'float32')
label = np.array(map(lambda x: x[1], data)).astype('int64')
label = label.reshape([-1, 1])
iters, num_samples, start_time = 0, 0, time.time()
for pass_id in range(args.pass_num):
accuracy.reset()
train_accs = []
train_losses = []
for batch_id, data in enumerate(train_reader()):
if iters == args.skip_batch_num:
start_time = time.time()
num_samples = 0
if iters == args.iterations:
break
if not args.use_fake_data:
image = np.array(map(lambda x: x[0].reshape(dshape),
data)).astype('float32')
label = np.array(map(lambda x: x[1], data)).astype('int64')
label = label.reshape([-1, 1])
loss, acc, weight = exe.run(
fluid.default_main_program(),
feed={'data': image,
'label': label},
fetch_list=[avg_cost, batch_acc, batch_size_tensor])
iters += 1
num_samples += label[0]
accuracy.add(value=acc, weight=weight)
train_losses.append(loss)
train_accs.append(acc)
print("Pass: %d, Iter: %d, Loss: %f, Accuracy: %f" %
(pass_id, iters, loss, acc))
pass_train_acc = accuracy.eval()
# evaluation
if args.with_test:
pass_test_acc = test(exe)
train_elapsed = time.time() - start_time
print("Pass: %d, Loss: %f, Train Accuray: %f\n" %
(pass_id, np.mean(train_losses), np.mean(train_accs)))
examples_per_sec = num_samples / train_elapsed
print('\nTotal examples: %d, total time: %.5f, %.5f examples/sed\n' %
(num_samples, train_elapsed, examples_per_sec))
if args.use_cprof:
pr.disable()
s = StringIO.StringIO()
sortby = 'cumulative'
ps = pstats.Stats(pr, stream=s).sort_stats(sortby)
ps.print_stats()
print(s.getvalue())
if __name__ == '__main__':
model_map = {
'resnet_imagenet': resnet_imagenet,
'resnet_cifar10': resnet_cifar10
}
args = parse_args()
print_arguments(args)
if args.data_format == 'NHWC':
raise ValueError('Only support NCHW data_format now.')
if args.use_nvprof and args.device == 'GPU':
with profiler.cuda_profiler("cuda_profiler.txt", 'csv') as nvprof:
run_benchmark(model_map[args.model], args)
else:
run_benchmark(model_map[args.model], args)
#!/bin/bash
# This script benchmarking the PaddlePaddle Fluid on
# single thread single GPU.
export CUDNN_PATH=/paddle/cudnn_v5/cuda/lib
# disable openmp and mkl parallel
#https://github.com/PaddlePaddle/Paddle/issues/7199
export MKL_NUM_THREADS=1
export OMP_NUM_THREADS=1
ht=`lscpu |grep "per core"|awk -F':' '{print $2}'|xargs`
if [ $ht -eq 1 ]; then # HT is OFF
if [ -z "$KMP_AFFINITY" ]; then
export KMP_AFFINITY="granularity=fine,compact,0,0"
fi
if [ -z "$OMP_DYNAMIC" ]; then
export OMP_DYNAMIC="FALSE"
fi
else # HT is ON
if [ -z "$KMP_AFFINITY" ]; then
export KMP_AFFINITY="granularity=fine,compact,1,0"
fi
fi
# disable multi-gpu if have more than one
export CUDA_VISIBLE_DEVICES=0
export LD_LIBRARY_PATH=/usr/local/lib:$LD_LIBRARY_PATH
export LD_LIBRARY_PATH=$CUDNN_PATH:$LD_LIBRARY_PATH
# vgg16
# cifar10 gpu cifar10 128
FLAGS_benchmark=true python fluid/vgg.py \
--device=GPU \
--batch_size=128 \
--skip_batch_num=5 \
--iterations=30 \
2>&1 > vgg16_gpu_128.log
# resnet50
# resnet50 gpu cifar10 128
FLAGS_benchmark=true python fluid/resnet.py \
--device=GPU \
--batch_size=128 \
--data_set=cifar10 \
--model=resnet_cifar10 \
--skip_batch_num=5 \
--iterations=30 \
2>&1 > resnet50_gpu_128.log
# lstm
# Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import argparse
import cPickle
import os
import random
import time
import numpy
import paddle.v2 as paddle
import paddle.v2.dataset.imdb as imdb
import paddle.fluid as fluid
from paddle.v2 import batch
import paddle.fluid.profiler as profiler
def parse_args():
parser = argparse.ArgumentParser("Understand Sentiment by Dynamic RNN.")
parser.add_argument(
'--batch_size',
type=int,
default=32,
help='The sequence number of a batch data. (default: %(default)d)')
parser.add_argument(
'--emb_dim',
type=int,
default=512,
help='Dimension of embedding table. (default: %(default)d)')
parser.add_argument(
'--hidden_dim',
type=int,
default=512,
help='Hidden size of lstm unit. (default: %(default)d)')
parser.add_argument(
'--pass_num',
type=int,
default=100,
help='Epoch number to train. (default: %(default)d)')
parser.add_argument(
'--device',
type=str,
default='CPU',
choices=['CPU', 'GPU'],
help='The device type.')
parser.add_argument(
'--crop_size',
type=int,
default=int(os.environ.get('CROP_SIZE', '1500')),
help='The max sentence length of input. Since this model use plain RNN,'
' Gradient could be explored if sentence is too long')
args = parser.parse_args()
return args
word_dict = imdb.word_dict()
def crop_sentence(reader, crop_size):
unk_value = word_dict['<unk>']
def __impl__():
for item in reader():
if len([x for x in item[0] if x != unk_value]) < crop_size:
yield item
return __impl__
def main():
args = parse_args()
lstm_size = args.hidden_dim
data = fluid.layers.data(
name="words", shape=[1], lod_level=1, dtype='int64')
sentence = fluid.layers.embedding(
input=data, size=[len(word_dict), args.emb_dim])
sentence = fluid.layers.fc(input=sentence, size=lstm_size, act='tanh')
rnn = fluid.layers.DynamicRNN()
with rnn.block():
word = rnn.step_input(sentence)
prev_hidden = rnn.memory(value=0.0, shape=[lstm_size])
prev_cell = rnn.memory(value=0.0, shape=[lstm_size])
def gate_common(
ipt,
hidden,
size, ):
gate0 = fluid.layers.fc(input=ipt, size=size, bias_attr=True)
gate1 = fluid.layers.fc(input=hidden, size=size, bias_attr=False)
gate = fluid.layers.sums(input=[gate0, gate1])
return gate
forget_gate = fluid.layers.sigmoid(
x=gate_common(word, prev_hidden, lstm_size))
input_gate = fluid.layers.sigmoid(
x=gate_common(word, prev_hidden, lstm_size))
output_gate = fluid.layers.sigmoid(
x=gate_common(word, prev_hidden, lstm_size))
cell_gate = fluid.layers.tanh(
x=gate_common(word, prev_hidden, lstm_size))
cell = fluid.layers.sums(input=[
fluid.layers.elementwise_mul(
x=forget_gate, y=prev_cell), fluid.layers.elementwise_mul(
x=input_gate, y=cell_gate)
])
hidden = fluid.layers.elementwise_mul(
x=output_gate, y=fluid.layers.tanh(x=cell))
rnn.update_memory(prev_cell, cell)
rnn.update_memory(prev_hidden, hidden)
rnn.output(hidden)
last = fluid.layers.sequence_pool(rnn(), 'last')
logit = fluid.layers.fc(input=last, size=2, act='softmax')
loss = fluid.layers.cross_entropy(
input=logit,
label=fluid.layers.data(
name='label', shape=[1], dtype='int64'))
loss = fluid.layers.mean(x=loss)
# add acc
batch_size_tensor = fluid.layers.create_tensor(dtype='int64')
batch_acc = fluid.layers.accuracy(input=logit, label=fluid.layers.data(name='label', \
shape=[1], dtype='int64'), total=batch_size_tensor)
inference_program = fluid.default_main_program().clone()
with fluid.program_guard(inference_program):
inference_program = fluid.io.get_inference_program(
target_vars=[batch_acc, batch_size_tensor])
adam = fluid.optimizer.Adam()
adam.minimize(loss)
fluid.memory_optimize(fluid.default_main_program())
place = fluid.CPUPlace() if args.device == 'CPU' else fluid.CUDAPlace(0)
exe = fluid.Executor(place)
exe.run(fluid.default_startup_program())
def train_loop(pass_num, crop_size):
with profiler.profiler(args.device, 'total') as prof:
for pass_id in range(pass_num):
train_reader = batch(
paddle.reader.shuffle(
crop_sentence(imdb.train(word_dict), crop_size),
buf_size=25000),
batch_size=args.batch_size)
word_nums = 0
pass_start_time = time.time()
for batch_id, data in enumerate(train_reader()):
tensor_words = to_lodtensor([x[0] for x in data], place)
for x in data:
word_nums += len(x[0])
label = numpy.array([x[1] for x in data]).astype("int64")
label = label.reshape((-1, 1))
loss_np, acc, weight = exe.run(
fluid.default_main_program(),
feed={"words": tensor_words,
"label": label},
fetch_list=[loss, batch_acc, batch_size_tensor])
print("pass_id=%d, batch_id=%d, loss=%f, acc=%f" %
(pass_id, batch_id, loss_np, acc))
pass_end_time = time.time()
time_consumed = pass_end_time - pass_start_time
words_per_sec = word_nums / time_consumed
print("pass_id=%d, sec/pass: %f, words/s: %f" %
(pass_id, time_consumed, words_per_sec))
train_loop(args.pass_num, args.crop_size)
def to_lodtensor(data, place):
seq_lens = [len(seq) for seq in data]
cur_len = 0
lod = [cur_len]
for l in seq_lens:
cur_len += l
lod.append(cur_len)
flattened_data = numpy.concatenate(data, axis=0).astype("int64")
flattened_data = flattened_data.reshape([len(flattened_data), 1])
res = fluid.LoDTensor()
res.set(flattened_data, place)
res.set_lod([lod])
return res
if __name__ == '__main__':
main()
# Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""VGG16 benchmark in Fluid"""
from __future__ import print_function
import sys
import time
import numpy as np
import paddle.v2 as paddle
import paddle.fluid as fluid
import paddle.fluid.core as core
import argparse
import functools
parser = argparse.ArgumentParser(description=__doc__)
parser.add_argument(
'--batch_size', type=int, default=128, help="Batch size for training.")
parser.add_argument(
'--skip_batch_num',
type=int,
default=5,
help='The first num of minibatch num to skip, for better performance test')
parser.add_argument(
'--iterations', type=int, default=80, help='The number of minibatches.')
parser.add_argument(
'--learning_rate',
type=float,
default=1e-3,
help="Learning rate for training.")
parser.add_argument('--pass_num', type=int, default=50, help="No. of passes.")
parser.add_argument(
'--device',
type=str,
default='GPU',
choices=['CPU', 'GPU'],
help="The device type.")
parser.add_argument(
'--data_format',
type=str,
default='NCHW',
choices=['NCHW', 'NHWC'],
help='The data order, now only support NCHW.')
parser.add_argument(
'--data_set',
type=str,
default='cifar10',
choices=['cifar10', 'flowers'],
help='Optional dataset for benchmark.')
parser.add_argument(
'--with_test',
action='store_true',
help='If set, test the testset during training.')
args = parser.parse_args()
def vgg16_bn_drop(input):
def conv_block(input, num_filter, groups, dropouts):
return fluid.nets.img_conv_group(
input=input,
pool_size=2,
pool_stride=2,
conv_num_filter=[num_filter] * groups,
conv_filter_size=3,
conv_act='relu',
conv_with_batchnorm=True,
conv_batchnorm_drop_rate=dropouts,
pool_type='max')
conv1 = conv_block(input, 64, 2, [0.3, 0])
conv2 = conv_block(conv1, 128, 2, [0.4, 0])
conv3 = conv_block(conv2, 256, 3, [0.4, 0.4, 0])
conv4 = conv_block(conv3, 512, 3, [0.4, 0.4, 0])
conv5 = conv_block(conv4, 512, 3, [0.4, 0.4, 0])
drop = fluid.layers.dropout(x=conv5, dropout_prob=0.5)
fc1 = fluid.layers.fc(input=drop, size=512, act=None)
bn = fluid.layers.batch_norm(input=fc1, act='relu')
drop2 = fluid.layers.dropout(x=bn, dropout_prob=0.5)
fc2 = fluid.layers.fc(input=drop2, size=512, act=None)
return fc2
def main():
if args.data_set == "cifar10":
classdim = 10
if args.data_format == 'NCHW':
data_shape = [3, 32, 32]
else:
data_shape = [32, 32, 3]
else:
classdim = 102
if args.data_format == 'NCHW':
data_shape = [3, 224, 224]
else:
data_shape = [224, 224, 3]
# Input data
images = fluid.layers.data(name='pixel', shape=data_shape, dtype='float32')
label = fluid.layers.data(name='label', shape=[1], dtype='int64')
# Train program
net = vgg16_bn_drop(images)
predict = fluid.layers.fc(input=net, size=classdim, act='softmax')
cost = fluid.layers.cross_entropy(input=predict, label=label)
avg_cost = fluid.layers.mean(x=cost)
# Evaluator
batch_size_tensor = fluid.layers.create_tensor(dtype='int64')
batch_acc = fluid.layers.accuracy(
input=predict, label=label, total=batch_size_tensor)
# inference program
inference_program = fluid.default_main_program().clone()
with fluid.program_guard(inference_program):
inference_program = fluid.io.get_inference_program(
target_vars=[batch_acc, batch_size_tensor])
# Optimization
optimizer = fluid.optimizer.Adam(learning_rate=args.learning_rate)
opts = optimizer.minimize(avg_cost)
fluid.memory_optimize(fluid.default_main_program())
# Initialize executor
place = core.CPUPlace() if args.device == 'CPU' else core.CUDAPlace(0)
exe = fluid.Executor(place)
# Parameter initialization
exe.run(fluid.default_startup_program())
# data reader
train_reader = paddle.batch(
paddle.reader.shuffle(
paddle.dataset.cifar.train10()
if args.data_set == 'cifar10' else paddle.dataset.flowers.train(),
buf_size=5120),
batch_size=args.batch_size)
test_reader = paddle.batch(
paddle.dataset.cifar.test10()
if args.data_set == 'cifar10' else paddle.dataset.flowers.test(),
batch_size=args.batch_size)
# test
def test(exe):
test_accuracy = fluid.average.WeightedAverage()
for batch_id, data in enumerate(test_reader()):
img_data = np.array(map(lambda x: x[0].reshape(data_shape),
data)).astype("float32")
y_data = np.array(map(lambda x: x[1], data)).astype("int64")
y_data = y_data.reshape([-1, 1])
acc, weight = exe.run(inference_program,
feed={"pixel": img_data,
"label": y_data},
fetch_list=[batch_acc, batch_size_tensor])
test_accuracy.add(value=acc, weight=weight)
return test_accuracy.eval()
iters, num_samples, start_time = 0, 0, time.time()
accuracy = fluid.average.WeightedAverage()
for pass_id in range(args.pass_num):
accuracy.reset()
train_accs = []
train_losses = []
for batch_id, data in enumerate(train_reader()):
if iters == args.skip_batch_num:
start_time = time.time()
num_samples = 0
if iters == args.iterations:
break
img_data = np.array(map(lambda x: x[0].reshape(data_shape),
data)).astype("float32")
y_data = np.array(map(lambda x: x[1], data)).astype("int64")
y_data = y_data.reshape([-1, 1])
loss, acc, weight = exe.run(
fluid.default_main_program(),
feed={"pixel": img_data,
"label": y_data},
fetch_list=[avg_cost, batch_acc, batch_size_tensor])
accuracy.add(value=acc, weight=weight)
iters += 1
num_samples += len(data)
print(
"Pass = %d, Iter = %d, Loss = %f, Accuracy = %f" %
(pass_id, iters, loss, acc)
) # The accuracy is the accumulation of batches, but not the current batch.
pass_train_acc = accuracy.eval()
train_losses.append(loss)
train_accs.append(acc)
# evaluation
if args.with_test:
pass_test_acc = test(exe)
train_elapsed = time.time() - start_time
print("Pass: %d, Loss: %f, Train Accuray: %f\n" %
(pass_id, np.mean(train_losses), np.mean(train_accs)))
def print_arguments():
print('----------- Configuration Arguments -----------')
for arg, value in sorted(vars(args).iteritems()):
print('%s: %s' % (arg, value))
print('------------------------------------------------')
if __name__ == "__main__":
print_arguments()
main()
...@@ -494,6 +494,12 @@ reshape ...@@ -494,6 +494,12 @@ reshape
.. autofunction:: paddle.fluid.layers.reshape .. autofunction:: paddle.fluid.layers.reshape
:noindex: :noindex:
pad
---
.. autofunction:: paddle.fluid.layers.pad
:noindex:
scale scale
----- -----
......
...@@ -7,7 +7,7 @@ Polyak and Juditsky (1992) showed that the test performance of simple average of ...@@ -7,7 +7,7 @@ Polyak and Juditsky (1992) showed that the test performance of simple average of
Hence, to accelerate the speed of Stochastic Gradient Descent, Averaged Stochastic Gradient Descent (ASGD) was proposed in Polyak and Juditsky (1992). For ASGD, the running average of parameters obtained by SGD, is used as the estimator for <img src="./images/theta_star.gif"/><br/> . The averaging is done as follows: Hence, to accelerate the speed of Stochastic Gradient Descent, Averaged Stochastic Gradient Descent (ASGD) was proposed in Polyak and Juditsky (1992). For ASGD, the running average of parameters obtained by SGD, is used as the estimator for <img src="./images/theta_star.gif"/><br/> . The averaging is done as follows:
<img src="./images/asgd.gif" align="center"/><br/> ![](./images/asgd.gif)
We propose averaging for any optimizer similar to how ASGD performs it, as mentioned above. We propose averaging for any optimizer similar to how ASGD performs it, as mentioned above.
......
...@@ -6,11 +6,33 @@ Here are some initial thoughts. Your comments are welcome! ...@@ -6,11 +6,33 @@ Here are some initial thoughts. Your comments are welcome!
I think we need only the following few CMake functions to make a project description mean and clean: I think we need only the following few CMake functions to make a project description mean and clean:
| C++ | CUDA C++ | Go | <table>
|---|---|---| <thead>
| cc_library | nv_library | go_library | <tr>
| cc_binary | nv_binary | go_binary | <th>C++</th>
| cc_test | nv_test | go_test | <th>CUDA C++</th>
<th>Go</th>
</tr>
</thead>
<tbody>
<tr>
<td>cc_library </td>
<td>nv_library </td>
<td>go_library </td>
</tr>
<tr>
<td>cc_binary </td>
<td>nv_binary </td>
<td>go_binary </td>
</tr>
<tr>
<td> cc_test </td>
<td> nv_test </td>
<td> go_test </td>
</tr>
</tbody>
</table>
- The `_library` functions generate .a files from source code. - The `_library` functions generate .a files from source code.
- The `_binary` functions generate executable binary files. - The `_binary` functions generate executable binary files.
......
...@@ -14,11 +14,29 @@ In programming languages, a block is a pair of curly braces that includes local ...@@ -14,11 +14,29 @@ In programming languages, a block is a pair of curly braces that includes local
Blocks work with control flow structures like `if`, `else`, and `for`, which have equivalents in deep learning: Blocks work with control flow structures like `if`, `else`, and `for`, which have equivalents in deep learning:
| programming languages | PaddlePaddle | <table>
|-----------------------|-----------------------| <thead>
| for, while loop | RNN, WhileOp | <tr>
| if, if-else, switch | IfElseOp, SwitchOp | <th>programming languages</th>
| sequential execution | a sequence of layers | <th>PaddlePaddle</th>
</tr>
</thead>
<tbody>
<tr>
<td>for, while loop </td>
<td>RNN, WhileOp </td>
</tr>
<tr>
<td>if, if-else, switch </td>
<td>IfElseOp, SwitchOp </td>
</tr>
<tr>
<td>sequential execution </td>
<td>a sequence of layers </td>
</tr>
</tbody>
</table>
A key difference is that a C++ program describes a one pass computation, whereas a deep learning program describes both the forward and backward passes. A key difference is that a C++ program describes a one pass computation, whereas a deep learning program describes both the forward and backward passes.
...@@ -26,12 +44,33 @@ A key difference is that a C++ program describes a one pass computation, whereas ...@@ -26,12 +44,33 @@ A key difference is that a C++ program describes a one pass computation, whereas
The existence of the backward pass makes the execution of a block of PaddlePaddle different from traditional programs: The existence of the backward pass makes the execution of a block of PaddlePaddle different from traditional programs:
| programming languages | PaddlePaddle | <table>
|-----------------------|---------------------------------| <thead>
| stack | scope hierarchy | <tr>
| stack frame | scope | <th>programming languages</th>
| push at entering block| push at entering block | <th>PaddlePaddle</th>
| pop at leaving block | destroy when minibatch completes| </tr>
</thead>
<tbody>
<tr>
<td>stack </td>
<td>scope hierarchy </td>
</tr>
<tr>
<td>stack frame </td>
<td>scope </td>
</tr>
<tr>
<td>push at entering block </td>
<td>push at entering block </td>
</tr>
<tr>
<td>pop at leaving block </td>
<td>destroy when minibatch completes </td>
</tr>
</tbody>
</table>
1. In traditional programs: 1. In traditional programs:
......
...@@ -86,12 +86,40 @@ def layer.fc(X): ...@@ -86,12 +86,40 @@ def layer.fc(X):
We'd like to have Python bindings to operators in package `paddle.operator`, and Python compositions of operators in package `paddle.layer`. So we have the following concepts in above illustrative example: We'd like to have Python bindings to operators in package `paddle.operator`, and Python compositions of operators in package `paddle.layer`. So we have the following concepts in above illustrative example:
<table>
| C++ functions/functors | mul | add | | | <thead>
|------------------------|--------------|--------------|-------------|----------| <tr>
| C++ operator class | mulOp | addOp | FCOp | | <th>C++ functions/functors</th>
| Python binding | operator.mul | operator.add | operator.fc | | <th>mul</th>
| Python function | | | | layer.fc | <th>add</th>
<th></th>
<th></th>
</tr>
</thead>
<tbody>
<tr>
<td>C++ operator class </td>
<td>mulOp</td>
<td>addOp </td>
<td>FCOp </td>
<td></td>
</tr>
<tr>
<td>Python binding </td>
<td>operator.mul</td>
<td> operator.add </td>
<td>operator.fc </td>
<td></td>
</tr>
<tr>
<td>Python function </td>
<td></td>
<td></td>
<td> </td>
<td>layer.fc</td>
</tr>
</tbody>
</table>
This is how we differentiate layer and operators in PaddlePaddle: This is how we differentiate layer and operators in PaddlePaddle:
......
...@@ -2,12 +2,38 @@ ...@@ -2,12 +2,38 @@
Like other deep learning systems, PaddlePaddle supports training models from sequence data. Also, like other systems, PaddlePaddle represent a mini-batch of sequences as a Tensor. What is different is that PaddlePaddle doesn't require all sequences in a mini-batch to be of the same length. Thus no need for padding zeros. Like other deep learning systems, PaddlePaddle supports training models from sequence data. Also, like other systems, PaddlePaddle represent a mini-batch of sequences as a Tensor. What is different is that PaddlePaddle doesn't require all sequences in a mini-batch to be of the same length. Thus no need for padding zeros.
| | TensorFlow | PaddlePaddle | <table>
|-----------------------|------------|--------------| <thead>
| RNN | Support | Support | <tr>
| recursive RNN | Support | Support | <th></th>
| padding zeros | Must | No need | <th>TensorFlow</th>
| blob data type | Tensor | LoDTensor | <th>PaddlePaddle</th>
</tr>
</thead>
<tbody>
<tr>
<td>RNN </td>
<td>Support </td>
<td>Support </td>
</tr>
<tr>
<td>recursive RNN </td>
<td>Support </td>
<td>Support </td>
</tr>
<tr>
<td>padding zeros </td>
<td> Must </td>
<td>No need </td>
</tr>
<tr>
<td> blob data type </td>
<td> Tensor</td>
<td> LoDTensor </td>
</tr>
</tbody>
</table>
PaddlePaddle achieves this flexibility by passing through a new data type, *LoD Tensor*, which is a Tensor attached with segmentation index known as *LoD*, between operators. The LoD index doesn't only segment a tensor, but also recursively segments sub-sequences. This document presents the design of LoD and LoDTensor. PaddlePaddle achieves this flexibility by passing through a new data type, *LoD Tensor*, which is a Tensor attached with segmentation index known as *LoD*, between operators. The LoD index doesn't only segment a tensor, but also recursively segments sub-sequences. This document presents the design of LoD and LoDTensor.
......
...@@ -10,10 +10,27 @@ PaddlePaddle uses proto message to describe compile time program because : ...@@ -10,10 +10,27 @@ PaddlePaddle uses proto message to describe compile time program because :
The computation `Program` consists of nested `Blocks`. Each `Block` will consist of data(i.e. `Variable`) and `Operations`. The concept to represent them is in the table below. The computation `Program` consists of nested `Blocks`. Each `Block` will consist of data(i.e. `Variable`) and `Operations`. The concept to represent them is in the table below.
| |compile time|runtime| <table>
|---|---|---| <thead>
|Data|VarDesc(proto)|Variable(cpp)| <tr>
|Operation|OpDesc(proto)|Operator(cpp)| <th></th>
<th>compile time</th>
<th>runtime</th>
</tr>
</thead>
<tbody>
<tr>
<td>Data </td>
<td>VarDesc(proto) </td>
<td>Variable(cpp) </td>
</tr>
<tr>
<td>Operation </td>
<td>OpDesc(proto) </td>
<td>Operator(cpp) </td>
</tr>
</tbody>
</table>
## Definition of VarType ## Definition of VarType
......
...@@ -10,12 +10,38 @@ The answer relies on the fact that a `ProgramDesc` is similar to an abstract syn ...@@ -10,12 +10,38 @@ The answer relies on the fact that a `ProgramDesc` is similar to an abstract syn
The following table compares concepts in Fluid and Go The following table compares concepts in Fluid and Go
| Go | Fluid | <table>
|----|-------| <thead>
|user-defined functions | [layers](https://github.com/PaddlePaddle/Paddle/tree/develop/python/paddle/fluid) | <tr>
| control-flow and built-in functions | [intrinsics/operators](https://github.com/PaddlePaddle/Paddle/tree/develop/paddle/operators) | <th></th>
| goroutines, channels | [class ThreadPool](https://github.com/PaddlePaddle/Paddle/tree/develop/paddle/framework/thread_pool.h) | <th>Go</th>
| runtime | [class Executor](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/framework/executor.h) | <th>Fluid</th>
</tr>
</thead>
<tbody>
<tr>
<td>user-defined functions </td>
<td>
<a href="https://github.com/PaddlePaddle/Paddle/tree/develop/python/paddle/fluid">layers</a></td>
</tr>
<tr>
<td>control-flow and built-in functions </td>
<td>
<a href="https://github.com/PaddlePaddle/Paddle/tree/develop/paddle/operators">intrinsics/operators</a></td>
</tr>
<tr>
<td>goroutines, channels </td>
<td>
<a href="https://github.com/PaddlePaddle/Paddle/tree/develop/paddle/framework/thread_pool.h">class ThreadPool</a></td>
</tr>
<tr>
<td>runtime </td>
<td>
<a href="https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/framework/executor.h">class Executor</a></td>
</tr>
</tbody>
</table>
## An Example Concurrent Program ## An Example Concurrent Program
......
...@@ -13,14 +13,41 @@ Most DL systems, including TensorFlow, Caffe2, and MxNet, can asynchronously exe ...@@ -13,14 +13,41 @@ Most DL systems, including TensorFlow, Caffe2, and MxNet, can asynchronously exe
There were many concurrent programming models, implemented in various forms: There were many concurrent programming models, implemented in various forms:
| concurrent programming model | implementation | <table>
|-----|-----| <thead>
| mutex | types and functions in standard libraries | <tr>
| semaphore | types and functions in standard libraries | <th>concurrent programming model</th>
| communicating sequential processes (CSP) | Go programming language | <th>implementation</th>
| actor model | Erlang programming language | </tr>
| message passing | MPI | </thead>
| bulk synchronous parallel (BSP) | Pregel distributed programming framework | <tbody>
<tr>
<td>mutex </td>
<td>types and functions in standard libraries </td>
</tr>
<tr>
<td>semaphore </td>
<td> types and functions in standard libraries </td>
</tr>
<tr>
<td> communicating sequential processes (CSP) </td>
<td> Go programming language </td>
</tr>
<tr>
<td> actor model </td>
<td> Erlang programming language </td>
</tr>
<tr>
<td> message passing </td>
<td> MPI </td>
</tr>
<tr>
<td> bulk synchronous parallel (BSP) </td>
<td> Pregel distributed programming framework </td>
</tr>
</tbody>
</table>
Since Fluid was designed to be a programming language, we would like to implement CSP in Fluid. Since Fluid was designed to be a programming language, we would like to implement CSP in Fluid.
......
...@@ -2,12 +2,33 @@ ...@@ -2,12 +2,33 @@
Due to the refactorization of the PaddlePaddle core, we need Python classes to construct corresponding protobuf messages that describe a DL program. Due to the refactorization of the PaddlePaddle core, we need Python classes to construct corresponding protobuf messages that describe a DL program.
| Python classes | Protobuf messages | <table>
| --- | --- | <thead>
| Program | ProgramDesc | <tr>
| Block | BlockDesc | <th>Python classes</th>
| Operator | OpDesc | <th>Protobuf messages</th>
| Variable | VarDesc | </tr>
</thead>
<tbody>
<tr>
<td>Program </td>
<td>ProgramDesc </td>
</tr>
<tr>
<td>Block </td>
<td>BlockDesc </td>
</tr>
<tr>
<td>Operator </td>
<td>OpDesc </td>
</tr>
<tr>
<td>Variable </td>
<td>VarDesc </td>
</tr>
</tbody>
</table>
Please be aware that these Python classes need to maintain some construction-time information, which are not part of the protobuf messages. Please be aware that these Python classes need to maintain some construction-time information, which are not part of the protobuf messages.
......
...@@ -10,11 +10,37 @@ Fluid is the answer. Fluid is similar to PyTorch and TensorFlow Eager Execution ...@@ -10,11 +10,37 @@ Fluid is the answer. Fluid is similar to PyTorch and TensorFlow Eager Execution
Deep learning infrastructure is one of the fastest evolving technologies. Within four years, there have already been three generations of technologies invented. Deep learning infrastructure is one of the fastest evolving technologies. Within four years, there have already been three generations of technologies invented.
| Existed since | model as sequence of layers | model as graph of operators | No model | <table>
|--|--|--|--| <thead>
| 2013 | Caffe, Theano, Torch, PaddlePaddle | | | <tr>
| 2015 | | TensorFlow, MxNet, Caffe2, ONNX, n-graph | | <th>Existed since</th>
| 2016 | | | PyTorch, TensorFlow Eager Execution, PaddlePaddle Fluid | <th>model as sequence of layers</th>
<th>model as graph of operators</th>
<th>No model</th>
</tr>
</thead>
<tbody>
<tr>
<td>2013 </td>
<td>Caffe, Theano, Torch, PaddlePaddle </td>
<td> </td>
<td> </td>
</tr>
<tr>
<td>2015 </td>
<td> </td>
<td>TensorFlow, MxNet, Caffe2, ONNX, n-graph </td>
<td> </td>
</tr>
<tr>
<td>2016 </td>
<td> </td>
<td> </td>
<td> PyTorch, TensorFlow Eager Execution, PaddlePaddle Fluid</td>
</tr>
</tbody>
</table>
From the above table, we see that the deep learning technology is evolving towards getting rid of the concept of a model. To understand the reasons behind this direction, a comparison of the *programming paradigms* or the ways to program deep learning applications using these systems, would be helpful. The following section goes over these. From the above table, we see that the deep learning technology is evolving towards getting rid of the concept of a model. To understand the reasons behind this direction, a comparison of the *programming paradigms* or the ways to program deep learning applications using these systems, would be helpful. The following section goes over these.
......
...@@ -36,11 +36,37 @@ At compile time, the Python program generates a protobuf message representation ...@@ -36,11 +36,37 @@ At compile time, the Python program generates a protobuf message representation
At runtime, the C++ program realizes the graph and runs it. At runtime, the C++ program realizes the graph and runs it.
| | Representation (protobuf messages) | Realization (C++ class objects) | <table>
|---|---|---| <thead>
|Data|[VarDesc](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/framework/framework.proto#L107)|[Variable](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/framework/variable.h#L24)| <tr>
|Operation|[OpDesc](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/framework/framework.proto#L35)|[Operator](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/framework/operator.h#L64)| <th></th>
|Block|BlockDesc|Block| <th>Representation (protobuf messages)</th>
<th>Realization (C++ class objects) </th>
</tr>
</thead>
<tbody>
<tr>
<td>Data</td>
<td>
<a href="https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/framework/framework.proto#L107">VarDesc</a></td>
<td>
<a href="https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/framework/variable.h#L24">Variable</a></td>
</tr>
<tr>
<td>Operation </td>
<td>
<a href="https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/framework/framework.proto#L35">OpDesc</a></td>
<td>
<a href="https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/framework/operator.h#L64">Operator</a></td>
</tr>
<tr>
<td>Block </td>
<td>BlockDesc </td>
<td>Block </td>
</tbody>
</table>
The word *graph* is interchangeable with *block* in this document. A graph consists of computation steps and local variables similar to a C++/Java program block, or a pair of parentheses(`{` and `}`). The word *graph* is interchangeable with *block* in this document. A graph consists of computation steps and local variables similar to a C++/Java program block, or a pair of parentheses(`{` and `}`).
......
...@@ -68,11 +68,33 @@ We roughly break down the project into 14 tasks: ...@@ -68,11 +68,33 @@ We roughly break down the project into 14 tasks:
Tasks parallelizable within phases: Tasks parallelizable within phases:
Roadmap | Description | Parallelizable Tasks <table>
----------- | :------------------------------------ | :-------------------- <thead>
Phase I | Simplified model & components | *Task 1* ~ *Task 8* <tr>
Phase II | Standard model & benchmarking & profiling | *Task 9* ~ *Task 12* <th>Roadmap</th>
Phase III | Documentations | *Task13* ~ *Task14* <th>Description</th>
<th> Parallelizable Tasks</th>
</tr>
</thead>
<tbody>
<tr>
<td>Phase I </td>
<td>Simplified model & components </td>
<td>Task 1 ~ Task 8</td>
</tr>
<tr>
<td>Phase II </td>
<td> Standard model & benchmarking & profiling</td>
<td>Task 9 ~ Task 12 </td>
</tr>
<tr>
<td>Phase III </td>
<td> Documentations</td>
<td> Task13 ~ Task14 </td>
</tr>
</tbody>
</table>
Issue for each task will be created later. Contributions, discussions and comments are all highly appreciated and welcomed! Issue for each task will be created later. Contributions, discussions and comments are all highly appreciated and welcomed!
...@@ -121,18 +143,63 @@ Key ingredients about the layers: ...@@ -121,18 +143,63 @@ Key ingredients about the layers:
- Added to all above layers (except for data and loss layer). - Added to all above layers (except for data and loss layer).
- Sequence-wise normalization for RNNs: BatchNorm only performed on input-state projection and not state-state projection, for efficiency consideration. - Sequence-wise normalization for RNNs: BatchNorm only performed on input-state projection and not state-state projection, for efficiency consideration.
<table>
<thead>
<tr>
<th>Required Components</th>
<th> PaddlePaddle Support</th>
<th> Need to Develop</th>
</tr>
</thead>
<tbody>
<tr>
<td>Data Layer I (Spectrogram) </td>
<td>Not supported yet.</td>
<td>TBD (Task 3)</td>
</tr>
<tr>
<td>Data Layer II (Transcription) </td>
<td> paddle.data_type.integer_value_sequence</td>
<td> - </td>
</tr>
<tr>
<td>2D Convolution Layer </td>
<td> paddle.layer.image_conv_layer</td>
<td> - </td>
</tr>
<tr>
<td>DataType Converter (vec2seq)</td>
<td> paddle.layer.block_expand</td>
<td> - </td>
</tr>
<tr>
<td>Bi-/Uni-directional RNNs </td>
<td>paddle.layer.recurrent_group</td>
<td> - </td>
</tr>
<tr>
<td>Row Convolution Layer </td>
<td>Not supported yet.</td>
<td>TBD (Task 4)</td>
</tr>
<tr>
<td>CTC-loss Layer </td>
<td>paddle.layer.warp_ctc</td>
<td> - </td>
</tr>
<tr>
<td>Batch Normalization Layer </td>
<td>paddle.layer.batch_norm</td>
<td> - </td>
</tr>
<tr>
<td>CTC-Beam search </td>
<td>Not supported yet.</td>
<td> TBD (Task 6) </td>
</tr>
</tbody>
</table>
Required Components | PaddlePaddle Support | Need to Develop
:------------------------------------- | :-------------------------------------- | :-----------------------
Data Layer I (Spectrogram) | Not supported yet. | TBD (Task 3)
Data Layer II (Transcription) | `paddle.data_type.integer_value_sequence` | -
2D Convolution Layer | `paddle.layer.image_conv_layer` | -
DataType Converter (vec2seq) | `paddle.layer.block_expand` | -
Bi-/Uni-directional RNNs | `paddle.layer.recurrent_group` | -
Row Convolution Layer | Not supported yet. | TBD (Task 4)
CTC-loss Layer | `paddle.layer.warp_ctc` | -
Batch Normalization Layer | `paddle.layer.batch_norm` | -
CTC-Beam search | Not supported yet. | TBD (Task 6)
### Row Convolution ### Row Convolution
......
...@@ -4,9 +4,9 @@ ...@@ -4,9 +4,9 @@
.. toctree:: .. toctree::
:maxdepth: 1 :maxdepth: 1
new_op_en.md new_op_cn.md
new_op_kernel_en.md new_op_kernel.md
use_eigen_en.md use_eigen_cn.md
name_convention.md name_convention.md
support_new_device.md support_new_device.md
releasing_process.md releasing_process.md
......
...@@ -5,7 +5,7 @@ Development ...@@ -5,7 +5,7 @@ Development
:maxdepth: 1 :maxdepth: 1
new_op_en.md new_op_en.md
new_op_kernel_en.md new_op_kernel.md
use_eigen_en.md use_eigen_en.md
name_convention.md name_convention.md
support_new_device.md support_new_device.md
......
...@@ -26,13 +26,32 @@ ...@@ -26,13 +26,32 @@
依据是否包含kernel,可以将Op分为两种:包含Kernel的Op和不包含kernel的Op,前者Op的定义继承自`OperatorWithKernel`,后者继承自`OperatorBase`。本教程主要介绍带Kernel的Op如何写,简单总结Op需要包含的内容如下: 依据是否包含kernel,可以将Op分为两种:包含Kernel的Op和不包含kernel的Op,前者Op的定义继承自`OperatorWithKernel`,后者继承自`OperatorBase`。本教程主要介绍带Kernel的Op如何写,简单总结Op需要包含的内容如下:
<table>
内容 | 定义位置 <thead>
-------------- | :---------------------- <tr>
OpProtoMake定义 | `.cc`文件,Backward Op不需要定义OpProtoMake <th>内容</th>
Op定义 | `.cc`文件 <th>定义位置</th>
Kernel实现 | CPU、CUDA共享Kernel实现在`.h`文件中,否则,CPU 实现在`.cc`文件中,CUDA 实现在`.cu`文件中。 </tr>
注册Op | Op注册实现在`.cc`文件;Kernel注册CPU实现在`.cc`文件中,CUDA实现在`.cu`文件中 </thead>
<tbody>
<tr>
<td>OpProtoMake定义 </td>
<td>`.cc`文件,Backward Op不需要定义OpProtoMake </td>
</tr>
<tr>
<td>Op定义 </td>
<td> `.cc`文件</td>
</tr>
<tr>
<td>Kernel实现 </td>
<td> CPU、CUDA共享Kernel实现在`.h`文件中,否则,CPU 实现在`.cc`文件中,CUDA 实现在`.cu`文件中。</td>
</tr>
<tr>
<td>注册Op </td>
<td> Op注册实现在`.cc`文件;Kernel注册CPU实现在`.cc`文件中,CUDA实现在`.cu`文件中</td>
</tr>
</tbody>
</table>
实现新的op都添加至目录[paddle/operators](https://github.com/PaddlePaddle/Paddle/tree/develop/paddle/operators)下,文件命名以`*_op.h`(如有) 、 `*_op.cc``*_op.cu`(如有)结尾。**系统会根据文件名自动构建op和其对应的Python扩展。** 实现新的op都添加至目录[paddle/operators](https://github.com/PaddlePaddle/Paddle/tree/develop/paddle/operators)下,文件命名以`*_op.h`(如有) 、 `*_op.cc``*_op.cu`(如有)结尾。**系统会根据文件名自动构建op和其对应的Python扩展。**
......
...@@ -33,6 +33,33 @@ Op definition | `.cc` files ...@@ -33,6 +33,33 @@ Op definition | `.cc` files
Kernel implementation | The kernel methods shared between CPU and CUDA are defined in `.h` files. CPU-specific kernels live in `.cc` files, while CUDA-specific kernels are implemented in `.cu`files. Kernel implementation | The kernel methods shared between CPU and CUDA are defined in `.h` files. CPU-specific kernels live in `.cc` files, while CUDA-specific kernels are implemented in `.cu`files.
Registering the Op | Ops are registered in `.cc` files; For Kernel registration, `.cc` files contain the CPU implementation, while `.cu` files contain the CUDA implementation. Registering the Op | Ops are registered in `.cc` files; For Kernel registration, `.cc` files contain the CPU implementation, while `.cu` files contain the CUDA implementation.
<table>
<thead>
<tr>
<th>Information</th>
<th> Where is it defined</th>
</tr>
</thead>
<tbody>
<tr>
<td>OpProtoMake definition </td>
<td> `.cc`files, Backward Op does not need an OpProtoMake interface. </td>
</tr>
<tr>
<td>Op definition </td>
<td> `.cc` files</td>
</tr>
<tr>
<td>Kernel implementation </td>
<td> The kernel methods shared between CPU and CUDA are defined in `.h` files. CPU-specific kernels live in `.cc` files, while CUDA-specific kernels are implemented in `.cu`files.</td>
</tr>
<tr>
<td>Registering the Op </td>
<td> Ops are registered in `.cc` files; For Kernel registration, `.cc` files contain the CPU implementation, while `.cu` files contain the CUDA implementation.</td>
</tr>
</tbody>
</table>
New Operator implementations are added to the list [paddle/operators](https://github.com/PaddlePaddle/Paddle/tree/develop/paddle/operators), with file names in the format `*_op.h` (if applicable), `*_op.cc`, `*_op.cu` (if applicable).** The system will use the naming scheme to automatically build operators and their corresponding Python extensions.** New Operator implementations are added to the list [paddle/operators](https://github.com/PaddlePaddle/Paddle/tree/develop/paddle/operators), with file names in the format `*_op.h` (if applicable), `*_op.cc`, `*_op.cu` (if applicable).** The system will use the naming scheme to automatically build operators and their corresponding Python extensions.**
......
...@@ -78,13 +78,116 @@ PaddlePaddle开发过程使用[git-flow](http://nvie.com/posts/a-successful-git- ...@@ -78,13 +78,116 @@ PaddlePaddle开发过程使用[git-flow](http://nvie.com/posts/a-successful-git-
PaddlePaddle每次发版本首先要保证PaddlePaddle Book中所有章节功能的正确性。功能的正确性包括验证PaddlePaddle目前的`paddle_trainer`训练和纯使用`Python`训练模型正确性。 PaddlePaddle每次发版本首先要保证PaddlePaddle Book中所有章节功能的正确性。功能的正确性包括验证PaddlePaddle目前的`paddle_trainer`训练和纯使用`Python`训练模型正确性。
| | 新手入门章节 | 识别数字 | 图像分类 | 词向量 | 情感分析 | 语意角色标注 | 机器翻译 | 个性化推荐 | <table>
| --- | --- | --- | --- | --- | --- | --- | --- | --- | <thead>
| API.V2 + Docker + GPU | | | | | | | | | <tr>
| API.V2 + Docker + CPU | | | | | | | | | <th></th>
| `paddle_trainer` + Docker + GPU | | | | | | | | | <th>新手入门章节 </th>
| `paddle_trainer` + Docker + CPU | | | | | | | | | <th> 识别数字</th>
| API.V2 + Ubuntu + GPU | | | | | | | | | <th> 图像分类</th>
| API.V2 + Ubuntu + CPU | | | | | | | | | <th>词向量</th>
| `paddle_trainer` + Ubuntu + GPU | | | | | | | | | <th> 情感分析</th>
| `paddle_trainer` + Ubuntu + CPU | | | | | | | | | <th>语意角色标注</th>
<th> 机器翻译</th>
<th>个性化推荐</th>
</tr>
</thead>
<tbody>
<tr>
<td>API.V2 + Docker + GPU </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
</tr>
<tr>
<td> API.V2 + Docker + CPU </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
</tr>
<tr>
<td>`paddle_trainer` + Docker + GPU </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
</tr>
<tr>
<td>`paddle_trainer` + Docker + CPU </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
</tr>
<tr>
<td> API.V2 + Ubuntu + GPU</td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
</tr>
<tr>
<td>API.V2 + Ubuntu + CPU </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
</tr>
<tr>
<td> `paddle_trainer` + Ubuntu + GPU</td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
</tr>
<tr>
<td> `paddle_trainer` + Ubuntu + CPU</td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
</tr>
</tbody>
</table>
...@@ -16,18 +16,58 @@ As a result, we design a particular format for tensor serialization. By default, ...@@ -16,18 +16,58 @@ As a result, we design a particular format for tensor serialization. By default,
The table below shows a tensor's byte view in detail. Note that all the signed values are written in the little-endian format. The table below shows a tensor's byte view in detail. Note that all the signed values are written in the little-endian format.
|field name | type | description | <table>
| --- | --- | --- | <thead>
| version | uint32_t | Version of saved file. Always 0 now. | <tr>
| tensor desc length | uint32_t | TensorDesc(Protobuf message) length in bytes. | <th>field name</th>
| tensor desc | void* | TensorDesc protobuf binary message | <th>type </th>
| tensor data | void* | Tensor's data in binary format. The length of `tensor_data` is decided by `TensorDesc.dims()` and `TensorDesc.data_type()` | <th>description </th>
| lod_level | uint64_t | Level of LoD | </tr>
| length of lod[0] | uint64_t | [Optional] length of lod[0] in bytes. | </thead>
| data of lod[0] | uint64_t* | [Optional] lod[0].data() | <tbody>
| ... | ... | ... | <tr>
<td> version</td>
<td> uint32_t </td>
<td> Version of saved file. Always 0 now.</td>
</tr>
<tr>
<td> tensor desc length </td>
<td> uint32_t </td>
<td> TensorDesc(Protobuf message) length in bytes. </td>
</tr>
<tr>
<td>tensor desc </td>
<td> void*</td>
<td> TensorDesc protobuf binary message </td>
</tr>
<tr>
<td> tensor data </td>
<td> void* </td>
<td> Tensor's data in binary format. The length of `tensor_data` is decided by `TensorDesc.dims()` and `TensorDesc.data_type()` </td>
</tr>
<tr>
<td> lod_level</td>
<td> uint64_t </td>
<td> Level of LoD </td>
</tr>
<tr>
<td> length of lod[0] </td>
<td> uint64_t </td>
<td> [Optional] length of lod[0] in bytes. </td>
</tr>
<tr>
<td> data of lod[0] </td>
<td> uint64_t* </td>
<td> [Optional] lod[0].data() </td>
</tr>
<tr>
<td>... </td>
<td> ... </td>
<td> ... </td>
</tr>
</tbody>
</table>
## Summary ## Summary
......
...@@ -99,13 +99,49 @@ for pass_id in range(100): ...@@ -99,13 +99,49 @@ for pass_id in range(100):
### 分布式训练脚本运行说明 ### 分布式训练脚本运行说明
分布式任务的运行需要将表格中说明的多个参数进行赋值: 分布式任务的运行需要将表格中说明的多个参数进行赋值:
| 参数名 | 值类型 | 说明 | 示例 | <table>
|:-------------|:------|:---------------------------------------|:-------------| <thead>
| trainer_id | int | 当前训练节点的ID,训练节点ID编号为0 - n-1, n为trainers的值 | 0/1/2/3 | <tr>
| pservers | str | parameter server 列表 | 127.0.0.1:6710,127.0.0.1:6711 | <th>参数名</th>
| trainers | int | 训练节点的总个数,>0的数字 | 4 | <th> 值类型</th>
| server_endpoint | str | 当前所起的服务节点的IP:PORT | 127.0.0.1:8789 | <th>说明</th>
| training_role | str | 节点角色, TRAINER/PSERVER | PSERVER | <th> 示例</th>
</tr>
</thead>
<tbody>
<tr>
<td>trainer_id </td>
<td> int</td>
<td> 当前训练节点的ID,训练节点ID编号为0 - n-1, n为trainers的值 </td>
<td> 0/1/2/3 </td>
</tr>
<tr>
<td>pservers </td>
<td> str</td>
<td> parameter server 列表 </td>
<td> 127.0.0.1:6710,127.0.0.1:6711 </td>
</tr>
<tr>
<td>trainers </td>
<td>int </td>
<td> 训练节点的总个数,>0的数字 </td>
<td> 4 </td>
</tr>
<tr>
<td> server_endpoint</td>
<td> str </td>
<td> 当前所起的服务节点的IP:PORT </td>
<td> 127.0.0.1:8789 </td>
</tr>
<tr>
<td> training_role</td>
<td>str </td>
<td> 节点角色, TRAINER/PSERVER </td>
<td> PSERVER </td>
</tr>
</tbody>
</table>
**注意:** ```training_role```是用来区分当前所起服务的角色的,用于训练程序中,用户可根据需要自行定义,其他参数为fluid.DistributeTranspiler的transpile函数所需要,需要在调用函数前进行定义,样例如下: **注意:** ```training_role```是用来区分当前所起服务的角色的,用于训练程序中,用户可根据需要自行定义,其他参数为fluid.DistributeTranspiler的transpile函数所需要,需要在调用函数前进行定义,样例如下:
......
...@@ -42,14 +42,40 @@ cprofilev -a 0.0.0.0 -p 3214 -f profile.out main.py ...@@ -42,14 +42,40 @@ cprofilev -a 0.0.0.0 -p 3214 -f profile.out main.py
每一列的含义是: 每一列的含义是:
| 列名 | 含义 | <table>
| --- | --- | <thead>
| ncalls | 函数的调用次数 | <tr>
| tottime | 函数实际使用的总时间。该时间去除掉本函数调用其他函数的时间 | <th>列名</th>
| percall | tottime的每次调用平均时间 | <th>含义 </th>
| cumtime | 函数总时间。包含这个函数调用其他函数的时间 | </tr>
| percall | cumtime的每次调用平均时间 | </thead>
| filename:lineno(function) | 文件名, 行号,函数名 | <tbody>
<tr>
<td> ncalls</td>
<td> 函数的调用次数</td>
</tr>
<tr>
<td>tottime</td>
<td> 函数实际使用的总时间。该时间去除掉本函数调用其他函数的时间</td>
</tr>
<tr>
<td> percall </td>
<td> tottime的每次调用平均时间</td>
</tr>
<tr>
<td> cumtime</td>
<td> 函数总时间。包含这个函数调用其他函数的时间</td>
</tr>
<tr>
<td> percall</td>
<td> cumtime的每次调用平均时间</td>
</tr>
<tr>
<td> filename:lineno(function) </td>
<td> 文件名, 行号,函数名 </td>
</tr>
</tbody>
</table>
### 寻找性能瓶颈 ### 寻找性能瓶颈
......
...@@ -57,14 +57,40 @@ port, we will see the output like the following: ...@@ -57,14 +57,40 @@ port, we will see the output like the following:
where each line corresponds to Python function, and the meaning of where each line corresponds to Python function, and the meaning of
each column is as follows: each column is as follows:
| column | meaning | <table>
| --- | --- | <thead>
| ncalls | the number of calls into a function | <tr>
| tottime | the total execution time of the function, not including the execution time of other functions called by the function | <th>column</th>
| percall | tottime divided by ncalls | <th>meaning </th>
| cumtime | the total execution time of the function, including the execution time of other functions being called | </tr>
| percall | cumtime divided by ncalls | </thead>
| filename:lineno(function) | where the function is defined | <tbody>
<tr>
<td> ncalls</td>
<td> the number of calls into a function</td>
</tr>
<tr>
<td>tottime</td>
<td> the total execution time of the function, not including the execution time of other functions called by the function</td>
</tr>
<tr>
<td> percall </td>
<td> tottime divided by ncalls</td>
</tr>
<tr>
<td> cumtime</td>
<td> the total execution time of the function, including the execution time of other functions being called</td>
</tr>
<tr>
<td> percall</td>
<td> cumtime divided by ncalls</td>
</tr>
<tr>
<td> filename:lineno(function) </td>
<td> where the function is define </td>
</tr>
</tbody>
</table>
### Identify Performance Bottlenecks ### Identify Performance Bottlenecks
......
############################ .. _install_faq:
Install, Build and Unit test
############################ ###############################
Compile, Install, and Unit Test
###############################
.. contents::
1. Insufficient CUDA driver version
----------------------------------------------------------------
Many users usually face issues like `Cuda Error: CUDA driver version is insufficient for CUDA runtime version` when running the PaddlePaddle GPU Docker image. The cause is that you may not map the local CUDA driver to a container directory.
You can solve the issue by running the following commands:
.. code-block:: bash
$ export CUDA_SO="$(\ls usr/lib64/libcuda* | xargs -I{} echo '-v {}:{}') $(\ls /usr/lib64/libnvidia* | xargs -I{} echo '-v {}:{}')"
$ export DEVICES=$(\ls /dev/nvidia* | xargs -I{} echo '--device {}:{}')
$ docker run ${CUDA_SO} ${DEVICES} -it paddlepaddle/paddle:latest-gpu
For more infomation about Docker's installation and usage, please refer to `PaddlePaddle Docker documentation <http://www.paddlepaddle.org/docs/0.11.0/documentation/zh/getstarted/build_and_install/docker_install_en.html>`_ .
2. Version mismatch between PythonLibs and PythonInterpreter
----------------------------------------------------------------
It is a common bug when CMake looks up Python. If you install multiple versions of Python, Cmake may find the version mismatch between PythonLibs and PythonInterpreter . You are forced to specify a Python version, as follows.
.. code-block:: bash
cmake .. -DPYTHON_EXECUTABLE=<exc_path> -DPYTHON_LIBRARY=<lib_path> -DPYTHON_INCLUDE_DIR=<inc_path>
You should specify ``<exc_path>``, ``<lib_path>``, ``<inc_path>`` to your local paths.
3. PaddlePaddle version is 0.0.0
------------------------------------------------
This issue would happen when you run the code `paddle version` or `cmake ..`
.. code-block:: bash
CMake Warning at cmake/version.cmake:20 (message):
Cannot add paddle version from git tag
You should pull all remote branches to your local machine with the command :code:`git fetch upstream` and then run :code:`cmake`
4. paddlepaddle\*.whl is not a supported wheel on this platform.
------------------------------------------------------------------------
The primary cause for this issue is that it can not find the correct PaddlePaddle installation package that matches your current system.The latest PaddlePaddle Python installation package supports Linux x86_64 and MacOS 10.12 os including Python2.7 and Pip 9.0.1.
You can upgrade Pip with the following command\:
.. code-block:: bash
pip install --upgrade pip
If it does not work for you, you can run the command :code:`python -c "import pip; print(pip.pep425tags.get_supported())"` to get the suffix of Python package which your system may support and then compare it with the suffix of your installation.
If the system supports :code:`linux_x86_64` and the installation package is :code:`manylinux1_x86_64`, you should upgrade pip to the latest
if the system supports :code:`manylinux_x86_64` and the local installation package is :code:`linux1_x86_64`, you can rename the whl package to :code:`manylinux1_x86_64` and then try again.
5. ImportError: No module named v2
----------------------------------
Please uninstall Paddle V1 if you have installed it before.
.. code-block:: bash
pip uninstall py_paddle paddle
Then install Python for PaddlePaddle , enter the build directory and run the following commands
pip install python/dist/paddle*.whl && pip install ../paddle/dist/py_paddle*.whl
6. Illegal instruction
-----------------------
This issue may be caused by the wrong usage of PaddlePaddle binary version which uses avx SIMD instructions to increase the performance of cpu. Please choose the correct version.
7. Python unittest fails
--------------------------------
If the following python unittest testcases fail:
.. code-block:: bash
24 - test_PyDataProvider (Failed)
26 - test_RecurrentGradientMachine (Failed)
27 - test_NetworkCompare (Failed)
28 - test_PyDataProvider2 (Failed)
32 - test_Prediction (Failed)
33 - test_Compare (Failed)
34 - test_Trainer (Failed)
35 - test_TrainerOnePass (Failed)
36 - test_CompareTwoNets (Failed)
37 - test_CompareTwoOpts (Failed)
38 - test_CompareSparse (Failed)
39 - test_recurrent_machine_generation (Failed)
40 - test_PyDataProviderWrapper (Failed)
41 - test_config_parser (Failed)
42 - test_swig_api (Failed)
43 - layers_test (Failed)
Please check the PaddlePaddle unittest logs which may suggest the following:
.. code-block:: bash
paddle package is already in your PYTHONPATH. But unittest need a clean environment.
Please uninstall paddle package before start unittest. Try to 'pip uninstall paddle'.
The solution is:
* Remove old PaddlePaddle to make a clean environment for the unit tests. If PaddlePaddle package is already in Python's site-packages, unit tests would refer Python package in site-packages instead of Python package in the :code:`/python` directory of the source directory. Setting :code:`PYTHONPATH` to :code:`/python` is also useless because Python's search path would give the priority to the installed Python package.
8. Failed to download the MKLML library
----------------------------------------------
.. code-block:: bash
make[2]: *** [third_party/mklml/src/extern_mklml-stamp/extern_mklml-download] error 4
make[1]: *** [CMakeFiles/extern_mklml.dir/all] error 2
make[1]: *** waiting for the unfinished jobs....
Cause: The network speed or SSL link causes the MKLML library to download unsuccessfully.
The solution is: manually download and install, the specific steps are as follows.
.. code-block:: bash
// 1. enter the directory
cd build/third_party/mklml/src/extern_mklml
// 2. check the size of the package, normally 75M, if less than 75M, the download fails
du -sh mklml_lnx_2018.0.1.20171007.tgz
// 3. manually download and unzip and make the download success tag:
wget --no-check-certificate https://github.com/01org/mkl-dnn/releases/download/v0.11/mklml_lnx_2018.0.1.20171007.tgz -c -O mklml_lnx_2018.0.1.20171007.tgz
tar zxf mklml_lnx_2018.0.1.20171007.tgz
touch ../extern_mklml-stamp/extern_mklml-download
// 4. then compile
TBD
...@@ -55,6 +55,7 @@ std::unique_ptr<SSAGraph> MultiDevSSAGraphBuilder::Build( ...@@ -55,6 +55,7 @@ std::unique_ptr<SSAGraph> MultiDevSSAGraphBuilder::Build(
const ProgramDesc &program) const { const ProgramDesc &program) const {
auto graph = new SSAGraph(); auto graph = new SSAGraph();
SSAGraph &result = *graph; SSAGraph &result = *graph;
std::unordered_set<std::string> og_has_been_broadcast;
result.vars_.resize(places_.size()); result.vars_.resize(places_.size());
bool is_forwarding = true; bool is_forwarding = true;
...@@ -122,9 +123,15 @@ std::unique_ptr<SSAGraph> MultiDevSSAGraphBuilder::Build( ...@@ -122,9 +123,15 @@ std::unique_ptr<SSAGraph> MultiDevSSAGraphBuilder::Build(
if (!is_forwarding) { if (!is_forwarding) {
auto var_names = op->OutputArgumentNames(); auto var_names = op->OutputArgumentNames();
// Currently, we assume that once gradient is generated, it can be
// broadcast, and each gradient is only broadcast once. But there are no
// other cases, for example, we need to adjust the gradient according to
// the input when we get the gradient, which is not considered at present.
for (auto &og : var_names) { for (auto &og : var_names) {
if (grad_names_.count(og) != 0) { // is param grad if (grad_names_.count(og) != 0 &&
og_has_been_broadcast.count(og) == 0) { // is param grad
// Insert NCCL AllReduce Op // Insert NCCL AllReduce Op
og_has_been_broadcast.insert(og);
#ifdef PADDLE_WITH_CUDA #ifdef PADDLE_WITH_CUDA
result.ops_.emplace_back( result.ops_.emplace_back(
new NCCLAllReduceOpHandle(local_scopes_, places_, *nccl_ctxs_)); new NCCLAllReduceOpHandle(local_scopes_, places_, *nccl_ctxs_));
......
...@@ -76,7 +76,7 @@ void NCCLAllReduceOpHandle::RunImpl() { ...@@ -76,7 +76,7 @@ void NCCLAllReduceOpHandle::RunImpl() {
} }
} }
std::string NCCLAllReduceOpHandle::Name() const { return "NCCL AllReduce"; } std::string NCCLAllReduceOpHandle::Name() const { return "nccl_all_reduce"; }
} // namespace details } // namespace details
} // namespace framework } // namespace framework
} // namespace paddle } // namespace paddle
...@@ -14,6 +14,9 @@ ...@@ -14,6 +14,9 @@
#pragma once #pragma once
#include <string>
#include <vector>
#include "paddle/fluid/framework/details/op_handle_base.h" #include "paddle/fluid/framework/details/op_handle_base.h"
#include "paddle/fluid/framework/lod_tensor.h" #include "paddle/fluid/framework/lod_tensor.h"
#include "paddle/fluid/framework/scope.h" #include "paddle/fluid/framework/scope.h"
...@@ -34,6 +37,10 @@ struct NCCLAllReduceOpHandle : public OpHandleBase { ...@@ -34,6 +37,10 @@ struct NCCLAllReduceOpHandle : public OpHandleBase {
std::string Name() const override; std::string Name() const override;
// Delay and buffer nccl_all_reduce together can significantly increase
// performance. Disable this feature by returning false.
bool IsMultiDeviceTransfer() override { return true; };
protected: protected:
void RunImpl() override; void RunImpl() override;
}; };
......
...@@ -13,6 +13,8 @@ ...@@ -13,6 +13,8 @@
// limitations under the License. // limitations under the License.
#pragma once #pragma once
#include <string>
#include <vector>
#include "paddle/fluid/framework/details/var_handle.h" #include "paddle/fluid/framework/details/var_handle.h"
#include "paddle/fluid/platform/device_context.h" #include "paddle/fluid/platform/device_context.h"
...@@ -53,6 +55,10 @@ class OpHandleBase { ...@@ -53,6 +55,10 @@ class OpHandleBase {
void AddOutput(VarHandleBase *out); void AddOutput(VarHandleBase *out);
// If the Op involves data transfer of multiple devices that
// will likely block other computations.
virtual bool IsMultiDeviceTransfer() { return false; }
protected: protected:
virtual void RunImpl() = 0; virtual void RunImpl() = 0;
}; };
......
...@@ -23,22 +23,36 @@ ThreadedSSAGraphExecutor::ThreadedSSAGraphExecutor( ...@@ -23,22 +23,36 @@ ThreadedSSAGraphExecutor::ThreadedSSAGraphExecutor(
size_t num_threads, bool use_event, size_t num_threads, bool use_event,
const std::vector<Scope *> &local_scopes, const std::vector<Scope *> &local_scopes,
const std::vector<platform::Place> &places, const std::vector<platform::Place> &places,
std::unique_ptr<SSAGraph> &&graph) std::unique_ptr<SSAGraph> &&graph, bool allow_op_delay)
: SSAGraphExecutor(std::move(graph)), : SSAGraphExecutor(std::move(graph)),
pool_(num_threads >= 2 ? new ::ThreadPool(num_threads) : nullptr), pool_(num_threads >= 2 ? new ::ThreadPool(num_threads) : nullptr),
local_scopes_(local_scopes), local_scopes_(local_scopes),
places_(places), places_(places),
fetch_ctxs_(places), fetch_ctxs_(places),
use_event_(use_event) {} use_event_(use_event),
running_ops_(0),
allow_op_delay_(allow_op_delay) {}
void ThreadedSSAGraphExecutor::RunDelayedOps(
const std::unordered_set<OpHandleBase *> &delayed_ops) {
for (auto op : delayed_ops) {
op->Run(use_event_);
}
}
FeedFetchList ThreadedSSAGraphExecutor::Run( FeedFetchList ThreadedSSAGraphExecutor::Run(
const std::vector<std::string> &fetch_tensors) { const std::vector<std::string> &fetch_tensors) {
std::unordered_map<OpHandleBase *, size_t> pending_ops; std::unordered_map<OpHandleBase *, size_t> pending_ops;
std::unordered_set<VarHandleBase *> pending_vars; std::unordered_set<VarHandleBase *> pending_vars;
BlockingQueue<VarHandleBase *> ready_vars; BlockingQueue<VarHandleBase *> ready_vars;
std::unordered_set<OpHandleBase *> ready_ops; std::unordered_set<OpHandleBase *> ready_ops;
// For ops (e.g. nccl_all_reduce) that need to coordinate multiple
// streams from multiple GPUs, it's faster to buffer them and schedule
// together since we currently cannot overlap computation and memcpy streams.
// Should revisit it if overlapping is available.
std::unordered_set<OpHandleBase *> delayed_ops;
std::unordered_set<OpHandleBase *> blocked_by_delayed_ops;
std::unordered_set<VarHandleBase *> delayed_vars;
auto InsertPendingVar = [&pending_vars, &ready_vars](VarHandleBase &var) { auto InsertPendingVar = [&pending_vars, &ready_vars](VarHandleBase &var) {
pending_vars.insert(&var); pending_vars.insert(&var);
...@@ -106,7 +120,14 @@ FeedFetchList ThreadedSSAGraphExecutor::Run( ...@@ -106,7 +120,14 @@ FeedFetchList ThreadedSSAGraphExecutor::Run(
auto run_all_ready_ops = [&] { auto run_all_ready_ops = [&] {
for (auto *op : ready_ops) { for (auto *op : ready_ops) {
RunOp(ready_vars, op); if (op->IsMultiDeviceTransfer() && allow_op_delay_) {
delayed_ops.insert(op);
delayed_vars.insert(op->outputs_.begin(), op->outputs_.end());
ready_vars.Extend(op->outputs_);
continue;
}
running_ops_++;
RunOp(&ready_vars, op);
} }
ready_ops.clear(); ready_ops.clear();
}; };
...@@ -118,13 +139,13 @@ FeedFetchList ThreadedSSAGraphExecutor::Run( ...@@ -118,13 +139,13 @@ FeedFetchList ThreadedSSAGraphExecutor::Run(
} }
// Step 3. Execution // Step 3. Execution
while (!pending_vars.empty()) { while (!pending_vars.empty() || !ready_ops.empty() || !delayed_ops.empty()) {
// 1. Run All Ready ops // 1. Run All Ready ops
run_all_ready_ops(); run_all_ready_ops();
// 2. Find ready variable // 2. Find ready variable
bool timeout; bool timeout;
auto cur_ready_vars = ready_vars.PopAll(1000, &timeout); auto cur_ready_vars = ready_vars.PopAll(1, &timeout);
if (timeout) { if (timeout) {
if (exception_) { if (exception_) {
...@@ -141,13 +162,29 @@ FeedFetchList ThreadedSSAGraphExecutor::Run( ...@@ -141,13 +162,29 @@ FeedFetchList ThreadedSSAGraphExecutor::Run(
auto &deps = pending_ops[op]; auto &deps = pending_ops[op];
--deps; --deps;
if (deps == 0) { if (deps == 0) {
if (delayed_vars.find(ready_var) != delayed_vars.end()) {
blocked_by_delayed_ops.insert(op);
} else {
ready_ops.insert(op); ready_ops.insert(op);
} }
} }
} }
}
// When there are no other ops to schedule, schedule buffered delayed
// ops and unblock other ops.
if (ready_ops.empty() && !delayed_ops.empty() && running_ops_ == 0) {
RunDelayedOps(delayed_ops);
delayed_ops.clear();
for (auto *op : blocked_by_delayed_ops) {
ready_ops.insert(op);
}
blocked_by_delayed_ops.clear();
}
// Keep loop until all vars are ready. // Keep loop until all vars are ready.
} }
PADDLE_ENFORCE(ready_ops.empty());
PADDLE_ENFORCE(delayed_ops.empty());
PADDLE_ENFORCE(blocked_by_delayed_ops.empty());
++computation_count_; ++computation_count_;
auto sync_computation = [&] { auto sync_computation = [&] {
...@@ -182,12 +219,13 @@ FeedFetchList ThreadedSSAGraphExecutor::Run( ...@@ -182,12 +219,13 @@ FeedFetchList ThreadedSSAGraphExecutor::Run(
} }
void ThreadedSSAGraphExecutor::RunOp( void ThreadedSSAGraphExecutor::RunOp(
BlockingQueue<VarHandleBase *> &ready_var_q, details::OpHandleBase *op) { BlockingQueue<VarHandleBase *> *ready_var_q, details::OpHandleBase *op) {
auto op_run = [&ready_var_q, op, this] { auto op_run = [ready_var_q, op, this] {
try { try {
VLOG(10) << op->Name() << " : " << op->DebugString(); VLOG(10) << op->Name() << " : " << op->DebugString();
op->Run(use_event_); op->Run(use_event_);
ready_var_q.Extend(op->outputs_); running_ops_--;
ready_var_q->Extend(op->outputs_);
} catch (platform::EnforceNotMet ex) { } catch (platform::EnforceNotMet ex) {
exception_.reset(new platform::EnforceNotMet(ex)); exception_.reset(new platform::EnforceNotMet(ex));
} catch (...) { } catch (...) {
......
...@@ -14,7 +14,12 @@ ...@@ -14,7 +14,12 @@
#pragma once #pragma once
#include <chrono> #include <deque>
#include <string>
#include <unordered_set>
#include <utility>
#include <vector>
#include <functional> #include <functional>
#include "ThreadPool.h" // ThreadPool in thrird party #include "ThreadPool.h" // ThreadPool in thrird party
#include "paddle/fluid/framework/details/ssa_graph_executor.h" #include "paddle/fluid/framework/details/ssa_graph_executor.h"
...@@ -70,7 +75,8 @@ class ThreadedSSAGraphExecutor : public SSAGraphExecutor { ...@@ -70,7 +75,8 @@ class ThreadedSSAGraphExecutor : public SSAGraphExecutor {
ThreadedSSAGraphExecutor(size_t num_threads, bool use_event, ThreadedSSAGraphExecutor(size_t num_threads, bool use_event,
const std::vector<Scope *> &local_scopes, const std::vector<Scope *> &local_scopes,
const std::vector<platform::Place> &places, const std::vector<platform::Place> &places,
std::unique_ptr<SSAGraph> &&graph); std::unique_ptr<SSAGraph> &&graph,
bool allow_op_delay);
// Run a SSAGraph by a thread pool // Run a SSAGraph by a thread pool
// Use topological sort algorithm // Use topological sort algorithm
...@@ -79,9 +85,11 @@ class ThreadedSSAGraphExecutor : public SSAGraphExecutor { ...@@ -79,9 +85,11 @@ class ThreadedSSAGraphExecutor : public SSAGraphExecutor {
~ThreadedSSAGraphExecutor() {} ~ThreadedSSAGraphExecutor() {}
private: private:
void RunOp(BlockingQueue<VarHandleBase *> &ready_var_q, void RunOp(BlockingQueue<VarHandleBase *> *ready_var_q,
details::OpHandleBase *op); details::OpHandleBase *op);
void RunDelayedOps(const std::unordered_set<OpHandleBase *> &delayed_ops);
private: private:
std::unique_ptr<::ThreadPool> pool_; std::unique_ptr<::ThreadPool> pool_;
std::vector<Scope *> local_scopes_; std::vector<Scope *> local_scopes_;
...@@ -89,6 +97,8 @@ class ThreadedSSAGraphExecutor : public SSAGraphExecutor { ...@@ -89,6 +97,8 @@ class ThreadedSSAGraphExecutor : public SSAGraphExecutor {
platform::DeviceContextPool fetch_ctxs_; platform::DeviceContextPool fetch_ctxs_;
const bool use_event_; const bool use_event_;
std::unique_ptr<platform::EnforceNotMet> exception_; std::unique_ptr<platform::EnforceNotMet> exception_;
std::atomic<int> running_ops_;
bool allow_op_delay_;
size_t computation_count_{0}; size_t computation_count_{0};
size_t max_async_computation{100}; size_t max_async_computation{100};
......
...@@ -22,7 +22,7 @@ ...@@ -22,7 +22,7 @@
namespace paddle { namespace paddle {
namespace framework { namespace framework {
namespace details { namespace details {
struct OpHandleBase; class OpHandleBase;
// VarHandleBase is the var node in the dependency graph. // VarHandleBase is the var node in the dependency graph.
// A variable can only be generated by a single operator. i.e. // A variable can only be generated by a single operator. i.e.
......
...@@ -13,9 +13,10 @@ See the License for the specific language governing permissions and ...@@ -13,9 +13,10 @@ See the License for the specific language governing permissions and
limitations under the License. */ limitations under the License. */
#include "paddle/fluid/framework/parallel_executor.h" #include "paddle/fluid/framework/parallel_executor.h"
#include <string> #include "paddle/fluid/platform/profiler.h"
#include "ThreadPool.h" #include <string>
#include <vector>
#ifdef PADDLE_WITH_CUDA #ifdef PADDLE_WITH_CUDA
#include "paddle/fluid/platform/nccl_helper.h" #include "paddle/fluid/platform/nccl_helper.h"
...@@ -47,7 +48,7 @@ ParallelExecutor::ParallelExecutor( ...@@ -47,7 +48,7 @@ ParallelExecutor::ParallelExecutor(
const std::vector<platform::Place> &places, const std::vector<platform::Place> &places,
const std::unordered_set<std::string> &params, const std::unordered_set<std::string> &params,
const ProgramDesc &startup_program, const ProgramDesc &main_program, const ProgramDesc &startup_program, const ProgramDesc &main_program,
const std::string &loss_var_name, Scope *scope) const std::string &loss_var_name, Scope *scope, bool allow_op_delay)
: member_(new ParallelExecutorPrivate(places)) { : member_(new ParallelExecutorPrivate(places)) {
member_->global_scope_ = scope; member_->global_scope_ = scope;
...@@ -82,8 +83,8 @@ ParallelExecutor::ParallelExecutor( ...@@ -82,8 +83,8 @@ ParallelExecutor::ParallelExecutor(
auto graph = builder.Build(main_program); auto graph = builder.Build(main_program);
member_->executor_.reset(new details::ThreadedSSAGraphExecutor( member_->executor_.reset(new details::ThreadedSSAGraphExecutor(
num_threads, use_event, member_->local_scopes_, places, num_threads, use_event, member_->local_scopes_, places, std::move(graph),
std::move(graph))); allow_op_delay));
// Step 3. Create vars in each scope; // Step 3. Create vars in each scope;
for (auto *scope : member_->local_scopes_) { for (auto *scope : member_->local_scopes_) {
...@@ -151,6 +152,7 @@ void ParallelExecutor::BCastParamsToGPUs( ...@@ -151,6 +152,7 @@ void ParallelExecutor::BCastParamsToGPUs(
void ParallelExecutor::Run(const std::vector<std::string> &fetch_tensors, void ParallelExecutor::Run(const std::vector<std::string> &fetch_tensors,
const std::string &fetched_var_name) { const std::string &fetched_var_name) {
platform::RecordBlock b(0);
auto fetch_data = member_->executor_->Run(fetch_tensors); auto fetch_data = member_->executor_->Run(fetch_tensors);
*member_->global_scope_->Var(fetched_var_name)->GetMutable<FeedFetchList>() = *member_->global_scope_->Var(fetched_var_name)->GetMutable<FeedFetchList>() =
fetch_data; fetch_data;
......
...@@ -14,8 +14,9 @@ limitations under the License. */ ...@@ -14,8 +14,9 @@ limitations under the License. */
#pragma once #pragma once
#include <future> #include <string>
#include <unordered_set> #include <unordered_set>
#include <vector>
#include "paddle/fluid/framework/executor.h" #include "paddle/fluid/framework/executor.h"
#include "paddle/fluid/framework/op_info.h" #include "paddle/fluid/framework/op_info.h"
#include "paddle/fluid/framework/program_desc.h" #include "paddle/fluid/framework/program_desc.h"
...@@ -37,7 +38,8 @@ class ParallelExecutor { ...@@ -37,7 +38,8 @@ class ParallelExecutor {
const std::unordered_set<std::string>& params, const std::unordered_set<std::string>& params,
const ProgramDesc& startup_program, const ProgramDesc& startup_program,
const ProgramDesc& main_program, const ProgramDesc& main_program,
const std::string& loss_var_name, Scope* scope); const std::string& loss_var_name, Scope* scope,
bool allow_op_delay);
void Run(const std::vector<std::string>& fetch_tensors, void Run(const std::vector<std::string>& fetch_tensors,
const std::string& fetched_var_name = "fetched_var"); const std::string& fetched_var_name = "fetched_var");
......
...@@ -10,6 +10,9 @@ See the License for the specific language governing permissions and ...@@ -10,6 +10,9 @@ See the License for the specific language governing permissions and
limitations under the License. */ limitations under the License. */
#pragma once #pragma once
#include <vector>
#include "paddle/fluid/framework/lod_tensor.h" #include "paddle/fluid/framework/lod_tensor.h"
#include "paddle/fluid/framework/tensor.h" #include "paddle/fluid/framework/tensor.h"
...@@ -52,7 +55,7 @@ class SelectedRows { ...@@ -52,7 +55,7 @@ class SelectedRows {
private: private:
// Notice: rows can be duplicate. We can have {0, 4, 7, 0, 5, 7, 9} here. // Notice: rows can be duplicate. We can have {0, 4, 7, 0, 5, 7, 9} here.
// SelectedRows are simplely concated when adding together. Until a // SelectedRows are simply concated when adding together. Until a
// SelectedRows add a Tensor, will the duplicate rows be handled. // SelectedRows add a Tensor, will the duplicate rows be handled.
Vector<int64_t> rows_; Vector<int64_t> rows_;
std::unique_ptr<Tensor> value_{nullptr}; std::unique_ptr<Tensor> value_{nullptr};
......
...@@ -2,7 +2,8 @@ if(WITH_DISTRIBUTE) ...@@ -2,7 +2,8 @@ if(WITH_DISTRIBUTE)
grpc_library(sendrecvop_grpc SRCS bytebuffer_stream.cc sendrecvop_utils.cc grpc_client.cc grpc_library(sendrecvop_grpc SRCS bytebuffer_stream.cc sendrecvop_utils.cc grpc_client.cc
grpc_server.cc variable_response.cc PROTO send_recv.proto DEPS lod_tensor selected_rows) grpc_server.cc variable_response.cc PROTO send_recv.proto DEPS lod_tensor selected_rows)
set(DISTRIBUTE_COMPILE_FLAGS "-Wno-non-virtual-dtor -Wno-error=non-virtual-dtor -Wno-error=delete-non-virtual-dtor") set(DISTRIBUTE_COMPILE_FLAGS "-Wno-non-virtual-dtor -Wno-error=non-virtual-dtor -Wno-error=delete-non-virtual-dtor")
set_source_files_properties(serde_test.cc PROPERTIES COMPILE_FLAGS ${DISTRIBUTE_COMPILE_FLAGS}) set_source_files_properties(serde_test.cc grpc_server_test.cc PROPERTIES COMPILE_FLAGS ${DISTRIBUTE_COMPILE_FLAGS})
cc_test(serde_test SRCS serde_test.cc variable_response.cc DEPS grpc++_unsecure grpc_unsecure gpr cc_test(serde_test SRCS serde_test.cc variable_response.cc DEPS grpc++_unsecure grpc_unsecure gpr
cares zlib protobuf sendrecvop_grpc) cares zlib protobuf sendrecvop_grpc)
cc_test(grpc_server_test SRCS grpc_server_test.cc DEPS sendrecvop_grpc grpc++_unsecure grpc_unsecure gpr cares zlib protobuf)
endif() endif()
...@@ -12,8 +12,10 @@ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. ...@@ -12,8 +12,10 @@ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and See the License for the specific language governing permissions and
limitations under the License. */ limitations under the License. */
#include "grpc_client.h" #include "paddle/fluid/operators/detail/grpc_client.h"
#include <sys/time.h>
#include <limits>
#include "paddle/fluid/framework/threadpool.h" #include "paddle/fluid/framework/threadpool.h"
namespace paddle { namespace paddle {
...@@ -52,7 +54,7 @@ bool RPCClient::AsyncSendVariable(const std::string& ep, ...@@ -52,7 +54,7 @@ bool RPCClient::AsyncSendVariable(const std::string& ep,
auto call = s->stub_g_.PrepareUnaryCall( auto call = s->stub_g_.PrepareUnaryCall(
s->context_.get(), "/sendrecv.SendRecvService/SendVariable", req, &cq_); s->context_.get(), "/sendrecv.SendRecvService/SendVariable", req, &cq_);
call->StartCall(); call->StartCall();
call->Finish(&s->reply_, &s->status_, (void*)s); call->Finish(&s->reply_, &s->status_, static_cast<void*>(s));
}); });
req_count_++; req_count_++;
...@@ -70,8 +72,7 @@ void ProcGetResponse(const VarHandle& var_h, ...@@ -70,8 +72,7 @@ void ProcGetResponse(const VarHandle& var_h,
template <typename T> template <typename T>
void RequestToByteBuffer(const T& proto, ::grpc::ByteBuffer* result) { void RequestToByteBuffer(const T& proto, ::grpc::ByteBuffer* result) {
::grpc::Slice slice(proto.ByteSizeLong()); ::grpc::Slice slice(proto.ByteSizeLong());
proto.SerializeWithCachedSizesToArray( proto.SerializeWithCachedSizesToArray(const_cast<uint8_t*>(slice.begin()));
const_cast<uint8_t*>(reinterpret_cast<const uint8_t*>(slice.begin())));
::grpc::ByteBuffer tmp(&slice, 1); ::grpc::ByteBuffer tmp(&slice, 1);
result->Swap(&tmp); result->Swap(&tmp);
} }
...@@ -109,7 +110,7 @@ bool RPCClient::AsyncGetVariable(const std::string& ep, ...@@ -109,7 +110,7 @@ bool RPCClient::AsyncGetVariable(const std::string& ep,
auto call = s->stub_g_.PrepareUnaryCall( auto call = s->stub_g_.PrepareUnaryCall(
s->context_.get(), "/sendrecv.SendRecvService/GetVariable", buf, &cq_); s->context_.get(), "/sendrecv.SendRecvService/GetVariable", buf, &cq_);
call->StartCall(); call->StartCall();
call->Finish(&s->reply_, &s->status_, (void*)s); call->Finish(&s->reply_, &s->status_, static_cast<void*>(s));
}); });
req_count_++; req_count_++;
...@@ -150,9 +151,10 @@ bool RPCClient::AsyncPrefetchVariable(const std::string& ep, ...@@ -150,9 +151,10 @@ bool RPCClient::AsyncPrefetchVariable(const std::string& ep,
s->response_call_back_ = ProcGetResponse; s->response_call_back_ = ProcGetResponse;
auto call = s->stub_g_.PrepareUnaryCall( auto call = s->stub_g_.PrepareUnaryCall(
s->context_.get(), "/sendrecv.SendRecvService/GetVariable", req, &cq_); s->context_.get(), "/sendrecv.SendRecvService/PrefetchVariable", req,
&cq_);
call->StartCall(); call->StartCall();
call->Finish(&s->reply_, &s->status_, (void*)s); call->Finish(&s->reply_, &s->status_, static_cast<void*>(s));
}); });
req_count_++; req_count_++;
...@@ -168,7 +170,7 @@ void RPCClient::AsyncSendBatchBarrier(const std::string& ep, int64_t time_out) { ...@@ -168,7 +170,7 @@ void RPCClient::AsyncSendBatchBarrier(const std::string& ep, int64_t time_out) {
sendrecv::VariableMessage req; sendrecv::VariableMessage req;
req.set_varname(BATCH_BARRIER_MESSAGE); req.set_varname(BATCH_BARRIER_MESSAGE);
auto rpc = s->stub_->AsyncSendVariable(s->context_.get(), req, &cq_); auto rpc = s->stub_->AsyncSendVariable(s->context_.get(), req, &cq_);
rpc->Finish(&s->reply_, &s->status_, (void*)s); rpc->Finish(&s->reply_, &s->status_, static_cast<void*>(s));
req_count_++; req_count_++;
} }
...@@ -180,7 +182,7 @@ void RPCClient::AsyncSendFetchBarrier(const std::string& ep, int64_t time_out) { ...@@ -180,7 +182,7 @@ void RPCClient::AsyncSendFetchBarrier(const std::string& ep, int64_t time_out) {
sendrecv::VariableMessage req; sendrecv::VariableMessage req;
req.set_varname(FETCH_BARRIER_MESSAGE); req.set_varname(FETCH_BARRIER_MESSAGE);
auto rpc = s->stub_->AsyncGetVariable(s->context_.get(), req, &cq_); auto rpc = s->stub_->AsyncGetVariable(s->context_.get(), req, &cq_);
rpc->Finish(&s->reply_, &s->status_, (void*)s); rpc->Finish(&s->reply_, &s->status_, static_cast<void*>(s));
req_count_++; req_count_++;
} }
......
...@@ -14,6 +14,9 @@ limitations under the License. */ ...@@ -14,6 +14,9 @@ limitations under the License. */
#include "paddle/fluid/operators/detail/grpc_server.h" #include "paddle/fluid/operators/detail/grpc_server.h"
#include <limits>
#include <string>
using ::grpc::ServerAsyncResponseWriter; using ::grpc::ServerAsyncResponseWriter;
namespace paddle { namespace paddle {
...@@ -128,6 +131,49 @@ class RequestGet final : public RequestBase { ...@@ -128,6 +131,49 @@ class RequestGet final : public RequestBase {
SimpleBlockQueue<MessageWithName>* queue_; SimpleBlockQueue<MessageWithName>* queue_;
}; };
class RequestPrefetch final : public RequestBase {
public:
explicit RequestPrefetch(GrpcService::AsyncService* service,
::grpc::ServerCompletionQueue* cq,
framework::Scope* scope,
const platform::DeviceContext* dev_ctx,
framework::Executor* executor,
framework::ProgramDesc* program, int blkid)
: RequestBase(service, cq, dev_ctx),
responder_(&ctx_),
scope_(scope),
executor_(executor),
program_(program),
blkid_(blkid) {
int method_id = static_cast<int>(detail::GrpcMethod::kPrefetchVariable);
service_->RequestAsyncUnary(method_id, &ctx_, &request_, &responder_, cq_,
cq_, this);
}
virtual ~RequestPrefetch() {}
virtual std::string GetReqName() { return request_.varname(); }
virtual void Process() {
// prefetch process...
::grpc::ByteBuffer reply;
// TODO(Yancey1989): execute the Block which containers prefetch ops
VLOG(3) << "RequestPrefetch Process in";
responder_.Finish(reply, ::grpc::Status::OK, this);
status_ = FINISH;
}
protected:
sendrecv::VariableMessage request_;
ServerAsyncResponseWriter<::grpc::ByteBuffer> responder_;
framework::Scope* scope_;
framework::Executor* executor_;
framework::ProgramDesc* program_;
int blkid_;
};
void AsyncGRPCServer::WaitClientGet(int count) { void AsyncGRPCServer::WaitClientGet(int count) {
int fetch_barriers = 0; int fetch_barriers = 0;
while (fetch_barriers < count) { while (fetch_barriers < count) {
...@@ -147,6 +193,7 @@ void AsyncGRPCServer::RunSyncUpdate() { ...@@ -147,6 +193,7 @@ void AsyncGRPCServer::RunSyncUpdate() {
cq_send_ = builder.AddCompletionQueue(); cq_send_ = builder.AddCompletionQueue();
cq_get_ = builder.AddCompletionQueue(); cq_get_ = builder.AddCompletionQueue();
cq_prefetch_ = builder.AddCompletionQueue();
server_ = builder.BuildAndStart(); server_ = builder.BuildAndStart();
LOG(INFO) << "Server listening on " << address_ << std::endl; LOG(INFO) << "Server listening on " << address_ << std::endl;
...@@ -155,6 +202,8 @@ void AsyncGRPCServer::RunSyncUpdate() { ...@@ -155,6 +202,8 @@ void AsyncGRPCServer::RunSyncUpdate() {
std::bind(&AsyncGRPCServer::TryToRegisterNewSendOne, this); std::bind(&AsyncGRPCServer::TryToRegisterNewSendOne, this);
std::function<void()> get_register = std::function<void()> get_register =
std::bind(&AsyncGRPCServer::TryToRegisterNewGetOne, this); std::bind(&AsyncGRPCServer::TryToRegisterNewGetOne, this);
std::function<void()> prefetch_register =
std::bind(&AsyncGRPCServer::TryToRegisterNewPrefetchOne, this);
t_send_.reset( t_send_.reset(
new std::thread(std::bind(&AsyncGRPCServer::HandleRequest, this, new std::thread(std::bind(&AsyncGRPCServer::HandleRequest, this,
...@@ -163,17 +212,21 @@ void AsyncGRPCServer::RunSyncUpdate() { ...@@ -163,17 +212,21 @@ void AsyncGRPCServer::RunSyncUpdate() {
t_get_.reset( t_get_.reset(
new std::thread(std::bind(&AsyncGRPCServer::HandleRequest, this, new std::thread(std::bind(&AsyncGRPCServer::HandleRequest, this,
cq_get_.get(), "cq_get", get_register))); cq_get_.get(), "cq_get", get_register)));
t_prefetch_.reset(new std::thread(
std::bind(&AsyncGRPCServer::HandleRequest, this, cq_prefetch_.get(),
"cq_prefetch", prefetch_register)));
// wait server // wait server
server_->Wait(); server_->Wait();
t_send_->join(); t_send_->join();
t_get_->join(); t_get_->join();
t_prefetch_->join();
} }
void AsyncGRPCServer::ShutdownQueue() { void AsyncGRPCServer::ShutdownQueue() {
std::unique_lock<std::mutex> lock(cq_mutex_); std::unique_lock<std::mutex> lock(cq_mutex_);
cq_send_->Shutdown(); cq_send_->Shutdown();
cq_get_->Shutdown(); cq_get_->Shutdown();
cq_prefetch_->Shutdown();
} }
// This URL explains why shutdown is complicate: // This URL explains why shutdown is complicate:
...@@ -186,6 +239,7 @@ void AsyncGRPCServer::ShutDown() { ...@@ -186,6 +239,7 @@ void AsyncGRPCServer::ShutDown() {
void AsyncGRPCServer::TryToRegisterNewSendOne() { void AsyncGRPCServer::TryToRegisterNewSendOne() {
std::unique_lock<std::mutex> lock(cq_mutex_); std::unique_lock<std::mutex> lock(cq_mutex_);
if (is_shut_down_) { if (is_shut_down_) {
VLOG(3) << "shutdown, do not TryToRegisterNewSendOne";
return; return;
} }
RequestSend* send = new RequestSend(&service_, cq_send_.get(), scope_, RequestSend* send = new RequestSend(&service_, cq_send_.get(), scope_,
...@@ -196,6 +250,7 @@ void AsyncGRPCServer::TryToRegisterNewSendOne() { ...@@ -196,6 +250,7 @@ void AsyncGRPCServer::TryToRegisterNewSendOne() {
void AsyncGRPCServer::TryToRegisterNewGetOne() { void AsyncGRPCServer::TryToRegisterNewGetOne() {
std::unique_lock<std::mutex> lock(cq_mutex_); std::unique_lock<std::mutex> lock(cq_mutex_);
if (is_shut_down_) { if (is_shut_down_) {
VLOG(3) << "shutdown, do not TryToRegisterNewGetOne";
return; return;
} }
RequestGet* get = new RequestGet(&service_, cq_get_.get(), scope_, dev_ctx_, RequestGet* get = new RequestGet(&service_, cq_get_.get(), scope_, dev_ctx_,
...@@ -203,6 +258,19 @@ void AsyncGRPCServer::TryToRegisterNewGetOne() { ...@@ -203,6 +258,19 @@ void AsyncGRPCServer::TryToRegisterNewGetOne() {
VLOG(4) << "Create RequestGet status:" << get->Status(); VLOG(4) << "Create RequestGet status:" << get->Status();
} }
void AsyncGRPCServer::TryToRegisterNewPrefetchOne() {
std::unique_lock<std::mutex> lock(cq_mutex_);
if (is_shut_down_) {
VLOG(3) << "shutdown, do not TryToRegisterNewPrefetchOne";
return;
}
RequestPrefetch* prefetch =
new RequestPrefetch(&service_, cq_prefetch_.get(), scope_, dev_ctx_,
executor_, program_, prefetch_blk_id_);
VLOG(4) << "Create RequestPrefetch status:" << prefetch->Status();
}
// FIXME(typhoonzero): change cq_name to enum. // FIXME(typhoonzero): change cq_name to enum.
void AsyncGRPCServer::HandleRequest(::grpc::ServerCompletionQueue* cq, void AsyncGRPCServer::HandleRequest(::grpc::ServerCompletionQueue* cq,
std::string cq_name, std::string cq_name,
...@@ -211,25 +279,28 @@ void AsyncGRPCServer::HandleRequest(::grpc::ServerCompletionQueue* cq, ...@@ -211,25 +279,28 @@ void AsyncGRPCServer::HandleRequest(::grpc::ServerCompletionQueue* cq,
void* tag = NULL; void* tag = NULL;
bool ok = false; bool ok = false;
while (true) { while (true) {
VLOG(3) << "HandleRequest for " << cq_name << " while in";
if (!cq->Next(&tag, &ok)) { if (!cq->Next(&tag, &ok)) {
LOG(INFO) << cq_name << " CompletionQueue shutdown!"; LOG(INFO) << cq_name << " CompletionQueue shutdown!";
break; break;
} }
VLOG(3) << "HandleRequest for " << cq_name << " while after Next";
PADDLE_ENFORCE(tag); PADDLE_ENFORCE(tag);
// FIXME(typhoonzero): de-couple the barriers with recv_op // FIXME(typhoonzero): de-couple the barriers with recv_op
if (!is_shut_down_ && cq_name == "cq_get") WaitCond(1); if (!is_shut_down_ && cq_name == "cq_get") WaitCond(1);
if (!is_shut_down_ && cq_name == "cq_send") WaitCond(0); if (!is_shut_down_ && cq_name == "cq_send") WaitCond(0);
RequestBase* base = (RequestBase*)tag; RequestBase* base = reinterpret_cast<RequestBase*>(tag);
// reference: // reference:
// https://github.com/tensorflow/tensorflow/issues/5596 // https://github.com/tensorflow/tensorflow/issues/5596
// https://groups.google.com/forum/#!topic/grpc-io/xftlRy-IQwM // https://groups.google.com/forum/#!topic/grpc-io/xftlRy-IQwM
// https://groups.google.com/forum/#!topic/grpc-io/ywATt88Ef_I // https://groups.google.com/forum/#!topic/grpc-io/ywATt88Ef_I
if (!ok) { if (!ok) {
LOG(WARNING) << cq_name << " recv no regular event:argument name" LOG(WARNING) << cq_name << " recv no regular event:argument name["
<< base->GetReqName(); << base->GetReqName() << "]";
TryToRegisterNewOne(); TryToRegisterNewOne();
delete base; delete base;
continue; continue;
......
...@@ -15,9 +15,12 @@ limitations under the License. */ ...@@ -15,9 +15,12 @@ limitations under the License. */
#pragma once #pragma once
#include <grpc++/grpc++.h> #include <grpc++/grpc++.h>
#include <thread> #include <string>
#include <utility>
#include "paddle/fluid/framework/executor.h"
#include "paddle/fluid/framework/lod_tensor.h" #include "paddle/fluid/framework/lod_tensor.h"
#include "paddle/fluid/framework/program_desc.h"
#include "paddle/fluid/framework/scope.h" #include "paddle/fluid/framework/scope.h"
#include "paddle/fluid/framework/selected_rows.h" #include "paddle/fluid/framework/selected_rows.h"
#include "paddle/fluid/framework/var_type.h" #include "paddle/fluid/framework/var_type.h"
...@@ -53,6 +56,12 @@ class AsyncGRPCServer final { ...@@ -53,6 +56,12 @@ class AsyncGRPCServer final {
void SetDevCtx(const platform::DeviceContext *dev_ctx) { dev_ctx_ = dev_ctx; } void SetDevCtx(const platform::DeviceContext *dev_ctx) { dev_ctx_ = dev_ctx; }
void SetProgram(framework::ProgramDesc *program) { program_ = program; }
void SetPrefetchBlkdId(int blkid) { prefetch_blk_id_ = blkid; }
void SetExecutor(framework::Executor *executor) { executor_ = executor; }
const ReceivedMessage Get() { return this->var_recv_queue_.Pop(); } const ReceivedMessage Get() { return this->var_recv_queue_.Pop(); }
void Push(const std::string &msg_name) { void Push(const std::string &msg_name) {
...@@ -66,6 +75,7 @@ class AsyncGRPCServer final { ...@@ -66,6 +75,7 @@ class AsyncGRPCServer final {
std::function<void()> TryToRegisterNewOne); std::function<void()> TryToRegisterNewOne);
void TryToRegisterNewSendOne(); void TryToRegisterNewSendOne();
void TryToRegisterNewGetOne(); void TryToRegisterNewGetOne();
void TryToRegisterNewPrefetchOne();
void ShutdownQueue(); void ShutdownQueue();
private: private:
...@@ -73,6 +83,7 @@ class AsyncGRPCServer final { ...@@ -73,6 +83,7 @@ class AsyncGRPCServer final {
volatile bool is_shut_down_ = false; volatile bool is_shut_down_ = false;
std::unique_ptr<::grpc::ServerCompletionQueue> cq_send_; std::unique_ptr<::grpc::ServerCompletionQueue> cq_send_;
std::unique_ptr<::grpc::ServerCompletionQueue> cq_get_; std::unique_ptr<::grpc::ServerCompletionQueue> cq_get_;
std::unique_ptr<::grpc::ServerCompletionQueue> cq_prefetch_;
GrpcService::AsyncService service_; GrpcService::AsyncService service_;
std::unique_ptr<::grpc::Server> server_; std::unique_ptr<::grpc::Server> server_;
...@@ -83,6 +94,7 @@ class AsyncGRPCServer final { ...@@ -83,6 +94,7 @@ class AsyncGRPCServer final {
// received variable from RPC, operators fetch variable from this queue. // received variable from RPC, operators fetch variable from this queue.
SimpleBlockQueue<MessageWithName> var_get_queue_; SimpleBlockQueue<MessageWithName> var_get_queue_;
// client send variable to this queue.
ReceivedQueue var_recv_queue_; ReceivedQueue var_recv_queue_;
// condition of the sub program // condition of the sub program
...@@ -92,6 +104,11 @@ class AsyncGRPCServer final { ...@@ -92,6 +104,11 @@ class AsyncGRPCServer final {
std::unique_ptr<std::thread> t_send_; std::unique_ptr<std::thread> t_send_;
std::unique_ptr<std::thread> t_get_; std::unique_ptr<std::thread> t_get_;
std::unique_ptr<std::thread> t_prefetch_;
int prefetch_blk_id_;
framework::ProgramDesc *program_;
framework::Executor *executor_;
}; };
}; // namespace detail }; // namespace detail
......
/* Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserved.
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License. */
#include <unistd.h>
#include <string>
#include <thread>
#include "gtest/gtest.h"
#include "paddle/fluid/operators/detail/grpc_client.h"
#include "paddle/fluid/operators/detail/grpc_server.h"
namespace framework = paddle::framework;
namespace platform = paddle::platform;
namespace detail = paddle::operators::detail;
std::unique_ptr<detail::AsyncGRPCServer> rpc_service_;
void StartServer(const std::string& endpoint) {
rpc_service_.reset(new detail::AsyncGRPCServer(endpoint));
rpc_service_->RunSyncUpdate();
}
TEST(PREFETCH, CPU) {
// start up a server instance backend
// TODO(Yancey1989): Need to start a server with optimize blocks and
// prefetch blocks.
std::thread server_thread(StartServer, "127.0.0.1:8889");
framework::Scope scope;
platform::CPUPlace place;
platform::CPUDeviceContext ctx(place);
// create var on local scope
std::string in_var_name("in");
std::string out_var_name("out");
auto* in_var = scope.Var(in_var_name);
auto* in_tensor = in_var->GetMutable<framework::LoDTensor>();
in_tensor->Resize({10, 10});
VLOG(3) << "before mutable_data";
in_tensor->mutable_data<int>(place);
scope.Var(out_var_name);
VLOG(3) << "before fetch";
detail::RPCClient client;
client.AsyncPrefetchVariable("127.0.0.1:8889", ctx, scope, in_var_name,
out_var_name);
client.Wait();
rpc_service_->ShutDown();
server_thread.join();
rpc_service_.reset(nullptr);
}
...@@ -76,10 +76,11 @@ namespace detail { ...@@ -76,10 +76,11 @@ namespace detail {
enum class GrpcMethod { enum class GrpcMethod {
kSendVariable, kSendVariable,
kGetVariable, kGetVariable,
kPrefetchVariable,
}; };
static const int kGrpcNumMethods = static const int kGrpcNumMethods =
static_cast<int>(GrpcMethod::kGetVariable) + 1; static_cast<int>(GrpcMethod::kPrefetchVariable) + 1;
inline const char* GrpcMethodName(GrpcMethod id) { inline const char* GrpcMethodName(GrpcMethod id) {
switch (id) { switch (id) {
...@@ -87,6 +88,8 @@ inline const char* GrpcMethodName(GrpcMethod id) { ...@@ -87,6 +88,8 @@ inline const char* GrpcMethodName(GrpcMethod id) {
return "/sendrecv.SendRecvService/SendVariable"; return "/sendrecv.SendRecvService/SendVariable";
case GrpcMethod::kGetVariable: case GrpcMethod::kGetVariable:
return "/sendrecv.SendRecvService/GetVariable"; return "/sendrecv.SendRecvService/GetVariable";
case GrpcMethod::kPrefetchVariable:
return "/sendrecv.SendRecvService/PrefetchVariable";
} }
// Shouldn't be reached. // Shouldn't be reached.
...@@ -114,5 +117,5 @@ class GrpcService final { ...@@ -114,5 +117,5 @@ class GrpcService final {
}; };
} // namespace detail } // namespace detail
} // namespace operator } // namespace operators
} // namespace paddle } // namespace paddle
...@@ -21,6 +21,8 @@ service SendRecvService { ...@@ -21,6 +21,8 @@ service SendRecvService {
rpc SendVariable(VariableMessage) returns (VoidMessage) {} rpc SendVariable(VariableMessage) returns (VoidMessage) {}
// Argument VariableMessage for GetVariable should only contain varname. // Argument VariableMessage for GetVariable should only contain varname.
rpc GetVariable(VariableMessage) returns (VariableMessage) {} rpc GetVariable(VariableMessage) returns (VariableMessage) {}
// Prefetch variable by Ids
rpc PrefetchVariable(VariableMessage) returns (VariableMessage) {}
} }
// VariableMessage is serialized paddle variable message. // VariableMessage is serialized paddle variable message.
......
...@@ -13,22 +13,13 @@ See the License for the specific language governing permissions and ...@@ -13,22 +13,13 @@ See the License for the specific language governing permissions and
limitations under the License. */ limitations under the License. */
#include <stdint.h> #include <stdint.h>
#include <sys/stat.h>
#include <ostream> #include <ostream>
#include <thread>
#include <unistd.h>
#include "paddle/fluid/framework/executor.h" #include "paddle/fluid/framework/executor.h"
#include "paddle/fluid/framework/framework.pb.h"
#include "paddle/fluid/framework/lod_tensor.h" #include "paddle/fluid/framework/lod_tensor.h"
#include "paddle/fluid/framework/op_registry.h" #include "paddle/fluid/framework/op_registry.h"
#include "paddle/fluid/framework/proto_desc.h"
#include "paddle/fluid/framework/threadpool.h" #include "paddle/fluid/framework/threadpool.h"
#include "paddle/fluid/operators/detail/grpc_server.h" #include "paddle/fluid/operators/detail/grpc_server.h"
#include "paddle/fluid/operators/detail/sendrecvop_utils.h"
#include "paddle/fluid/operators/detail/simple_block_queue.h"
#include "paddle/fluid/string/printf.h"
namespace paddle { namespace paddle {
namespace operators { namespace operators {
...@@ -121,6 +112,11 @@ class ListenAndServOp : public framework::OperatorBase { ...@@ -121,6 +112,11 @@ class ListenAndServOp : public framework::OperatorBase {
prepared.begin(), prepared.begin(),
std::shared_ptr<framework::ExecutorPrepareContext>(nullptr)); std::shared_ptr<framework::ExecutorPrepareContext>(nullptr));
// TODO(qiao) set proper fields for table lookup and update
rpc_service_->SetExecutor(&executor);
rpc_service_->SetPrefetchBlkdId(0);
rpc_service_->SetProgram(program);
// TODO(typhoonzero): change this to a while_op for every cluster-batch. // TODO(typhoonzero): change this to a while_op for every cluster-batch.
bool exit_flag = false; bool exit_flag = false;
// Record received sparse variables, so that // Record received sparse variables, so that
...@@ -184,7 +180,8 @@ class ListenAndServOp : public framework::OperatorBase { ...@@ -184,7 +180,8 @@ class ListenAndServOp : public framework::OperatorBase {
ParallelExecuteBlocks(parallel_blkids, &executor, prepared, program, ParallelExecuteBlocks(parallel_blkids, &executor, prepared, program,
&recv_scope); &recv_scope);
VLOG(2) << "run all blocks spent (ms) " << detail::GetTimestamp() - ts; VLOG(3) << "run all blocks spent " << detail::GetTimestamp() - ts
<< "(ms)";
// Reset the received sparse variables, the sum operator would not // Reset the received sparse variables, the sum operator would not
// sum the input sparse variables which rows is empty at the next // sum the input sparse variables which rows is empty at the next
......
...@@ -18,6 +18,22 @@ limitations under the License. */ ...@@ -18,6 +18,22 @@ limitations under the License. */
namespace paddle { namespace paddle {
namespace operators { namespace operators {
static inline framework::OpKernelType ExpectedKernelType(
const framework::ExecutionContext& ctx) {
auto* table_var = ctx.InputVar("W");
if (table_var->IsType<LoDTensor>()) {
return framework::OpKernelType(
framework::ToDataType(table_var->Get<LoDTensor>().type()),
ctx.device_context());
} else if (table_var->IsType<SelectedRows>()) {
return framework::OpKernelType(
framework::ToDataType(table_var->Get<SelectedRows>().value().type()),
ctx.device_context());
} else {
PADDLE_THROW("W should be LoDTensor or SelectedRows");
}
}
class LookupTableOp : public framework::OperatorWithKernel { class LookupTableOp : public framework::OperatorWithKernel {
public: public:
using framework::OperatorWithKernel::OperatorWithKernel; using framework::OperatorWithKernel::OperatorWithKernel;
...@@ -51,9 +67,7 @@ class LookupTableOp : public framework::OperatorWithKernel { ...@@ -51,9 +67,7 @@ class LookupTableOp : public framework::OperatorWithKernel {
protected: protected:
framework::OpKernelType GetExpectedKernelType( framework::OpKernelType GetExpectedKernelType(
const framework::ExecutionContext& ctx) const override { const framework::ExecutionContext& ctx) const override {
return framework::OpKernelType( return ExpectedKernelType(ctx);
framework::ToDataType(ctx.Input<LoDTensor>("W")->type()),
ctx.device_context());
} }
}; };
...@@ -84,7 +98,7 @@ class LookupTableOpMaker : public framework::OpProtoAndCheckerMaker { ...@@ -84,7 +98,7 @@ class LookupTableOpMaker : public framework::OpProtoAndCheckerMaker {
"If the value is -1, it makes no effect to lookup. " "If the value is -1, it makes no effect to lookup. "
"Otherwise the given value indicates padding the output " "Otherwise the given value indicates padding the output "
"with zeros whenever lookup encounters it in Ids.") "with zeros whenever lookup encounters it in Ids.")
.SetDefault(-1); .SetDefault(kNoPadding);
AddComment(R"DOC( AddComment(R"DOC(
Lookup Table Operator. Lookup Table Operator.
...@@ -124,9 +138,7 @@ class LookupTableOpGrad : public framework::OperatorWithKernel { ...@@ -124,9 +138,7 @@ class LookupTableOpGrad : public framework::OperatorWithKernel {
protected: protected:
framework::OpKernelType GetExpectedKernelType( framework::OpKernelType GetExpectedKernelType(
const framework::ExecutionContext& ctx) const override { const framework::ExecutionContext& ctx) const override {
return framework::OpKernelType( return ExpectedKernelType(ctx);
framework::ToDataType(ctx.Input<LoDTensor>("W")->type()),
ctx.device_context());
} }
}; };
......
...@@ -14,6 +14,9 @@ limitations under the License. */ ...@@ -14,6 +14,9 @@ limitations under the License. */
#pragma once #pragma once
#include <string>
#include <vector>
#include "paddle/fluid/framework/eigen.h" #include "paddle/fluid/framework/eigen.h"
#include "paddle/fluid/framework/lod_tensor.h" #include "paddle/fluid/framework/lod_tensor.h"
#include "paddle/fluid/framework/op_registry.h" #include "paddle/fluid/framework/op_registry.h"
...@@ -25,16 +28,37 @@ namespace operators { ...@@ -25,16 +28,37 @@ namespace operators {
using Tensor = framework::Tensor; using Tensor = framework::Tensor;
using LoDTensor = framework::LoDTensor; using LoDTensor = framework::LoDTensor;
using SelectedRows = framework::SelectedRows; using SelectedRows = framework::SelectedRows;
using DDim = framework::DDim;
static constexpr int64_t kNoPadding = -1;
inline size_t getIndex(const std::vector<int64_t> &rows, int64_t value) {
auto it = std::find(rows.begin(), rows.end(), value);
PADDLE_ENFORCE(it != rows.end(), "id should be in rows");
return static_cast<size_t>(std::distance(rows.begin(), it));
}
template <typename T> template <typename T>
class LookupTableKernel : public framework::OpKernel<T> { class LookupTableKernel : public framework::OpKernel<T> {
public: public:
void Compute(const framework::ExecutionContext& context) const override { void Compute(const framework::ExecutionContext &context) const override {
auto* table_t = context.Input<LoDTensor>("W"); auto *table_var = context.InputVar("W");
auto* ids_var = context.InputVar("Ids"); auto *ids_var = context.InputVar("Ids");
Tensor* output_t = context.Output<Tensor>("Out"); Tensor *output_t = context.Output<Tensor>("Out");
int64_t padding_idx = context.Attr<int64_t>("padding_idx");
DDim table_dim;
if (table_var->IsType<LoDTensor>()) {
table_dim = context.Input<LoDTensor>("W")->dims();
} else if (table_var->IsType<SelectedRows>()) {
auto *table_t = context.Input<SelectedRows>("W");
table_dim = table_t->value().dims();
} else {
PADDLE_THROW("table only support LoDTensor and SelectedRows");
}
int64_t* ids; int64_t *ids;
int64_t ids_numel; int64_t ids_numel;
// The type of Ids(Input) is SelectedRows or LoDTensor, when Ids's type // The type of Ids(Input) is SelectedRows or LoDTensor, when Ids's type
...@@ -42,39 +66,50 @@ class LookupTableKernel : public framework::OpKernel<T> { ...@@ -42,39 +66,50 @@ class LookupTableKernel : public framework::OpKernel<T> {
// when Ids's type is SelectedRows, the rows of Ids contains the // when Ids's type is SelectedRows, the rows of Ids contains the
// ids to be looked up in W. // ids to be looked up in W.
if (ids_var->IsType<LoDTensor>()) { if (ids_var->IsType<LoDTensor>()) {
auto* ids_t = context.Input<LoDTensor>("Ids"); auto *ids_t = context.Input<LoDTensor>("Ids");
ids = const_cast<int64_t*>(ids_t->data<int64_t>()); ids = const_cast<int64_t *>(ids_t->data<int64_t>());
ids_numel = ids_t->numel(); ids_numel = ids_t->numel();
} else if (ids_var->IsType<SelectedRows>()) { } else if (ids_var->IsType<SelectedRows>()) {
auto* ids_t = context.Input<SelectedRows>("Ids"); auto *ids_t = context.Input<SelectedRows>("Ids");
ids = const_cast<int64_t*>(ids_t->rows().data()); ids = const_cast<int64_t *>(ids_t->rows().data());
ids_numel = ids_t->rows().size(); ids_numel = ids_t->rows().size();
output_t->Resize({ids_numel, table_t->dims()[1]}); output_t->Resize({ids_numel, table_dim[1]});
} else { } else {
PADDLE_THROW("Unsupported Variable Type of Ids"); PADDLE_THROW("Unsupported Variable Type of Ids");
} }
int64_t padding_idx = context.Attr<int64_t>("padding_idx"); if (table_var->IsType<LoDTensor>()) {
auto *table_t = context.Input<LoDTensor>("W");
int64_t row_number = table_t->dims()[0];
int64_t row_width = table_t->dims()[1];
int N = table_t->dims()[0]; auto *table = table_t->data<T>();
int D = table_t->dims()[1]; auto *output = output_t->mutable_data<T>(context.GetPlace());
auto* table = table_t->data<T>();
auto* output = output_t->mutable_data<T>(context.GetPlace());
if (padding_idx == -1) {
for (int64_t i = 0; i < ids_numel; ++i) { for (int64_t i = 0; i < ids_numel; ++i) {
PADDLE_ENFORCE_LT(ids[i], N); if (padding_idx != kNoPadding && ids[i] == padding_idx) {
memset(output + i * row_width, 0, row_width * sizeof(T));
} else {
PADDLE_ENFORCE_LT(ids[i], row_number);
PADDLE_ENFORCE_GE(ids[i], 0); PADDLE_ENFORCE_GE(ids[i], 0);
memcpy(output + i * D, table + ids[i] * D, D * sizeof(T)); memcpy(output + i * row_width, table + ids[i] * row_width,
row_width * sizeof(T));
} }
} else { }
} else if (table_var->IsType<SelectedRows>()) {
const auto &table_t = table_var->Get<SelectedRows>();
int64_t row_width = table_t.value().dims()[1];
const auto *table = table_t.value().data<T>();
auto *output = output_t->mutable_data<T>(context.GetPlace());
for (int64_t i = 0; i < ids_numel; ++i) { for (int64_t i = 0; i < ids_numel; ++i) {
if (ids[i] == padding_idx) { if (padding_idx != kNoPadding && ids[i] == padding_idx) {
memset(output + i * D, 0, D * sizeof(T)); memset(output + i * row_width, 0, row_width * sizeof(T));
} else { } else {
PADDLE_ENFORCE_LT(ids[i], N);
PADDLE_ENFORCE_GE(ids[i], 0); PADDLE_ENFORCE_GE(ids[i], 0);
memcpy(output + i * D, table + ids[i] * D, D * sizeof(T)); auto id_index = getIndex(table_t.rows(), ids[i]);
memcpy(output + i * row_width, table + id_index * row_width,
row_width * sizeof(T));
} }
} }
} }
...@@ -84,17 +119,27 @@ class LookupTableKernel : public framework::OpKernel<T> { ...@@ -84,17 +119,27 @@ class LookupTableKernel : public framework::OpKernel<T> {
template <typename T> template <typename T>
class LookupTableGradKernel : public framework::OpKernel<T> { class LookupTableGradKernel : public framework::OpKernel<T> {
public: public:
void Compute(const framework::ExecutionContext& context) const override { void Compute(const framework::ExecutionContext &context) const override {
auto *table_var = context.InputVar("W");
DDim table_dim;
if (table_var->IsType<LoDTensor>()) {
table_dim = context.Input<LoDTensor>("W")->dims();
} else if (table_var->IsType<SelectedRows>()) {
auto *table_t = context.Input<SelectedRows>("W");
table_dim = table_t->value().dims();
} else {
PADDLE_THROW("table only support LoDTensor and SelectedRows");
}
bool is_sparse = context.Attr<bool>("is_sparse"); bool is_sparse = context.Attr<bool>("is_sparse");
// Since paddings are not trainable and fixed in forward, the gradient of // Since paddings are not trainable and fixed in forward, the gradient of
// paddings makes no sense and we don't deal with it in backward. // paddings makes no sense and we don't deal with it in backward.
if (is_sparse) { if (is_sparse) {
auto* ids = context.Input<LoDTensor>("Ids"); auto *ids = context.Input<LoDTensor>("Ids");
auto* table = context.Input<LoDTensor>("W"); auto *d_output = context.Input<LoDTensor>(framework::GradVarName("Out"));
auto* d_output = context.Input<LoDTensor>(framework::GradVarName("Out")); auto *d_table = context.Output<SelectedRows>(framework::GradVarName("W"));
auto* d_table = context.Output<SelectedRows>(framework::GradVarName("W"));
auto* ids_data = ids->data<int64_t>(); auto *ids_data = ids->data<int64_t>();
auto ids_dim = ids->dims(); auto ids_dim = ids->dims();
framework::Vector<int64_t> new_rows; framework::Vector<int64_t> new_rows;
...@@ -104,31 +149,30 @@ class LookupTableGradKernel : public framework::OpKernel<T> { ...@@ -104,31 +149,30 @@ class LookupTableGradKernel : public framework::OpKernel<T> {
} }
d_table->set_rows(new_rows); d_table->set_rows(new_rows);
auto* d_table_value = d_table->mutable_value(); auto *d_table_value = d_table->mutable_value();
d_table_value->Resize({ids_dim[0], table->dims()[1]}); d_table_value->Resize({ids_dim[0], table_dim[1]});
d_table_value->mutable_data<T>(context.GetPlace()); d_table_value->mutable_data<T>(context.GetPlace());
d_table->set_height(table->dims()[0]); d_table->set_height(table_dim[0]);
auto* d_output_data = d_output->data<T>(); auto *d_output_data = d_output->data<T>();
auto* d_table_data = d_table_value->data<T>(); auto *d_table_data = d_table_value->data<T>();
PADDLE_ENFORCE_EQ(d_table_value->dims(), d_output->dims()); PADDLE_ENFORCE_EQ(d_table_value->dims(), d_output->dims());
memcpy(d_table_data, d_output_data, sizeof(T) * d_output->numel()); memcpy(d_table_data, d_output_data, sizeof(T) * d_output->numel());
} else { } else {
auto* ids = context.Input<LoDTensor>("Ids"); auto *ids = context.Input<LoDTensor>("Ids");
auto* d_output = context.Input<LoDTensor>(framework::GradVarName("Out")); auto *d_output = context.Input<LoDTensor>(framework::GradVarName("Out"));
auto* d_table = context.Output<LoDTensor>(framework::GradVarName("W")); auto *d_table = context.Output<LoDTensor>(framework::GradVarName("W"));
auto* table = context.Input<LoDTensor>("W");
auto* ids_data = ids->data<int64_t>(); auto *ids_data = ids->data<int64_t>();
auto ids_dim = ids->dims(); auto ids_dim = ids->dims();
int N = table->dims()[0]; int N = table_dim[0];
int D = d_output->dims()[1]; int D = d_output->dims()[1];
auto* d_output_data = d_output->data<T>(); auto *d_output_data = d_output->data<T>();
auto* d_table_data = d_table->mutable_data<T>(context.GetPlace()); auto *d_table_data = d_table->mutable_data<T>(context.GetPlace());
memset(d_table_data, 0, d_table->numel() * sizeof(T)); memset(d_table_data, 0, d_table->numel() * sizeof(T));
......
...@@ -137,6 +137,8 @@ class NCCLTester : public ::testing::Test { ...@@ -137,6 +137,8 @@ class NCCLTester : public ::testing::Test {
TEST_F(NCCLTester, ncclInitOp) {} TEST_F(NCCLTester, ncclInitOp) {}
// ncclAllReduceOp with desc // ncclAllReduceOp with desc
// TODO(helin): https://github.com/PaddlePaddle/Paddle/issues/9367
/*
TEST_F(NCCLTester, ncclAllReduceOp) { TEST_F(NCCLTester, ncclAllReduceOp) {
std::unique_ptr<f::OpDesc> op2(new f::OpDesc); std::unique_ptr<f::OpDesc> op2(new f::OpDesc);
op2->SetType("ncclAllReduce"); op2->SetType("ncclAllReduce");
...@@ -184,6 +186,7 @@ TEST_F(NCCLTester, ncclAllReduceOp) { ...@@ -184,6 +186,7 @@ TEST_F(NCCLTester, ncclAllReduceOp) {
} }
} }
} }
*/
// ncclReduceOp with desc // ncclReduceOp with desc
TEST_F(NCCLTester, ncclReduceOp) { TEST_F(NCCLTester, ncclReduceOp) {
...@@ -236,6 +239,8 @@ TEST_F(NCCLTester, ncclReduceOp) { ...@@ -236,6 +239,8 @@ TEST_F(NCCLTester, ncclReduceOp) {
} }
// ncclBcastOp with desc // ncclBcastOp with desc
// TODO(helin): https://github.com/PaddlePaddle/Paddle/issues/9540
/*
TEST_F(NCCLTester, ncclBcastOp) { TEST_F(NCCLTester, ncclBcastOp) {
std::unique_ptr<f::OpDesc> op2(new f::OpDesc); std::unique_ptr<f::OpDesc> op2(new f::OpDesc);
const int kRoot = 0; const int kRoot = 0;
...@@ -281,3 +286,4 @@ TEST_F(NCCLTester, ncclBcastOp) { ...@@ -281,3 +286,4 @@ TEST_F(NCCLTester, ncclBcastOp) {
ASSERT_NEAR(ct[j], result, 1e-5); ASSERT_NEAR(ct[j], result, 1e-5);
} }
} }
*/
...@@ -20,12 +20,29 @@ namespace paddle { ...@@ -20,12 +20,29 @@ namespace paddle {
namespace operators { namespace operators {
namespace reader { namespace reader {
static constexpr size_t kDoubleBufferSize = 2; // 'Double buffer' means we shall maintain two batches of input data at the same
// time. So the kCacheSize shoul be at least 2.
static constexpr size_t kCacheSize = 2;
// There will be two bacthes out of the channel during training:
// 1. the one waiting to be sent to the channel
// 2. the one just be received from the channel, which is also being used by
// subsequent operators.
// So the channel size should be kChacheSize - 2
static constexpr size_t kChannelSize = 0; // kCacheSize - 2
class DoubleBufferReader : public framework::DecoratedReader { class DoubleBufferReader : public framework::DecoratedReader {
public: public:
struct Item { struct Item {
Item() : ctx_(nullptr) {} Item() : ctx_(nullptr) {}
Item(Item&& b) {
payloads_ = std::move(b.payloads_);
ctx_ = std::move(b.ctx_);
}
Item& operator=(Item&& b) {
payloads_ = std::move(b.payloads_);
ctx_ = std::move(b.ctx_);
return *this;
}
std::vector<framework::LoDTensor> payloads_; std::vector<framework::LoDTensor> payloads_;
platform::DeviceContext* ctx_; platform::DeviceContext* ctx_;
...@@ -34,42 +51,44 @@ class DoubleBufferReader : public framework::DecoratedReader { ...@@ -34,42 +51,44 @@ class DoubleBufferReader : public framework::DecoratedReader {
explicit DoubleBufferReader( explicit DoubleBufferReader(
ReaderBase* reader, platform::Place target_place = platform::CPUPlace()) ReaderBase* reader, platform::Place target_place = platform::CPUPlace())
: DecoratedReader(reader), place_(target_place) { : DecoratedReader(reader), place_(target_place) {
for (size_t i = 0; i < kDoubleBufferSize; ++i) {
if (platform::is_gpu_place(place_)) {
#ifdef PADDLE_WITH_CUDA #ifdef PADDLE_WITH_CUDA
for (size_t i = 0; i < kCacheSize; ++i) {
if (platform::is_gpu_place(place_)) {
ctxs_.emplace_back(new platform::CUDADeviceContext( ctxs_.emplace_back(new platform::CUDADeviceContext(
boost::get<platform::CUDAPlace>(place_))); boost::get<platform::CUDAPlace>(place_)));
#endif
} }
} }
#endif
start_thread(); StartPrefetcher();
}
void start_thread() {
buffer_ = framework::MakeChannel<Item>(kDoubleBufferSize);
prefetcher_ = std::thread([this] { PrefetchThreadFunc(); });
} }
bool HasNext() const override;
void ReadNext(std::vector<framework::LoDTensor>* out) override; void ReadNext(std::vector<framework::LoDTensor>* out) override;
void ReInit() override; void ReInit() override;
~DoubleBufferReader() { ~DoubleBufferReader() { EndPrefetcher(); }
buffer_->Close();
prefetcher_.join(); private:
delete buffer_; void StartPrefetcher() {
channel_ = framework::MakeChannel<Item>(kChannelSize);
prefetcher_ = std::thread([this] { PrefetchThreadFunc(); });
} }
bool HasNext() const override; void EndPrefetcher() {
channel_->Close();
if (prefetcher_.joinable()) {
prefetcher_.join();
}
delete channel_;
channel_ = nullptr;
}
private:
void PrefetchThreadFunc(); void PrefetchThreadFunc();
std::thread prefetcher_; std::thread prefetcher_;
framework::Channel<Item>* buffer_; framework::Channel<Item>* channel_;
platform::Place place_; platform::Place place_;
std::vector<std::unique_ptr<platform::DeviceContext>> ctxs_; std::vector<std::unique_ptr<platform::DeviceContext>> ctxs_;
mutable Item local_buffer_;
}; };
class CreateDoubleBufferReaderOp : public framework::OperatorBase { class CreateDoubleBufferReaderOp : public framework::OperatorBase {
...@@ -123,70 +142,70 @@ class CreateDoubleBufferReaderOpMaker : public DecoratedReaderMakerBase { ...@@ -123,70 +142,70 @@ class CreateDoubleBufferReaderOpMaker : public DecoratedReaderMakerBase {
} }
}; };
bool DoubleBufferReader::HasNext() const {
while (!channel_->IsClosed() && !channel_->CanReceive()) {
}
return channel_->CanReceive();
}
void DoubleBufferReader::ReadNext(std::vector<framework::LoDTensor>* out) { void DoubleBufferReader::ReadNext(std::vector<framework::LoDTensor>* out) {
if (!HasNext()) { if (!HasNext()) {
PADDLE_THROW("There is no next data!"); PADDLE_THROW("There is no next data!");
} }
if (local_buffer_.payloads_.empty()) { Item batch;
buffer_->Receive(&local_buffer_); channel_->Receive(&batch);
} *out = batch.payloads_;
*out = local_buffer_.payloads_; if (batch.ctx_) {
local_buffer_.payloads_.clear(); batch.ctx_->Wait();
if (local_buffer_.ctx_) {
local_buffer_.ctx_->Wait();
} }
} }
void DoubleBufferReader::ReInit() { void DoubleBufferReader::ReInit() {
reader_->ReInit(); reader_->ReInit();
buffer_->Close(); EndPrefetcher();
prefetcher_.join(); StartPrefetcher();
delete buffer_;
start_thread();
} }
void DoubleBufferReader::PrefetchThreadFunc() { void DoubleBufferReader::PrefetchThreadFunc() {
VLOG(5) << "A new prefetch thread starts."; VLOG(5) << "A new prefetch thread starts.";
size_t gpu_ctx_offset = 0; std::vector<std::vector<framework::LoDTensor>> cpu_tensor_cache(kCacheSize);
std::vector<std::vector<framework::LoDTensor>> gpu_tensor_cache(kCacheSize);
size_t cached_tensor_id = 0;
while (reader_->HasNext()) { while (reader_->HasNext()) {
Item batch; Item batch;
reader_->ReadNext(&batch.payloads_); auto& cpu_batch = cpu_tensor_cache[cached_tensor_id];
reader_->ReadNext(&cpu_batch);
if (platform::is_gpu_place(place_)) { if (platform::is_gpu_place(place_)) {
std::vector<framework::LoDTensor> gpu_batch; auto& gpu_batch = gpu_tensor_cache[cached_tensor_id];
auto& gpu_ctx = this->ctxs_[gpu_ctx_offset++]; auto* gpu_ctx = ctxs_[cached_tensor_id].get();
gpu_ctx_offset %= this->ctxs_.size(); gpu_batch.resize(cpu_batch.size());
gpu_batch.resize(batch.payloads_.size()); for (size_t i = 0; i < cpu_batch.size(); ++i) {
for (size_t i = 0; i < batch.payloads_.size(); ++i) { framework::TensorCopy(cpu_batch[i], place_, *gpu_ctx, &gpu_batch[i]);
framework::TensorCopy(batch.payloads_[i], place_, *gpu_ctx, gpu_batch[i].set_lod(cpu_batch[i].lod());
&gpu_batch[i]); }
gpu_batch[i].set_lod(batch.payloads_[i].lod()); batch.payloads_ = gpu_batch;
} batch.ctx_ = gpu_ctx;
batch.ctx_ = gpu_ctx.get(); } else {
std::swap(gpu_batch, batch.payloads_); // CPUPlace
batch.payloads_ = cpu_batch;
} }
++cached_tensor_id;
cached_tensor_id %= kCacheSize;
try { try {
buffer_->Send(&batch); channel_->Send(&batch);
} catch (paddle::platform::EnforceNotMet e) { } catch (paddle::platform::EnforceNotMet e) {
VLOG(5) << "WARNING: The double buffer channel has been closed. The " VLOG(5) << "WARNING: The double buffer channel has been closed. The "
"prefetch thread will terminate."; "prefetch thread will terminate.";
break; break;
} }
} }
buffer_->Close(); channel_->Close();
VLOG(5) << "Prefetch thread terminates."; VLOG(5) << "Prefetch thread terminates.";
} }
bool DoubleBufferReader::HasNext() const {
if (local_buffer_.payloads_.empty()) {
bool ok = buffer_->Receive(&local_buffer_);
return ok;
} else {
return true;
}
}
} // namespace reader } // namespace reader
} // namespace operators } // namespace operators
} // namespace paddle } // namespace paddle
......
...@@ -17,90 +17,66 @@ limitations under the License. */ ...@@ -17,90 +17,66 @@ limitations under the License. */
namespace paddle { namespace paddle {
namespace operators { namespace operators {
class ReshapeOp : public framework::OperatorWithKernel {
public:
ReshapeOp(const std::string &type, const framework::VariableNameMap &inputs,
const framework::VariableNameMap &outputs,
const framework::AttributeMap &attrs)
: OperatorWithKernel(type, inputs, outputs, attrs) {}
void InferShape(framework::InferShapeContext *ctx) const override {
// input check
PADDLE_ENFORCE(ctx->HasInput("X"),
"Input(X) of ReshapeOp should not be null.");
PADDLE_ENFORCE(ctx->HasOutput("Out"),
"Output(Out) of ReshapeOp should not be null.");
auto shape = ctx->Attrs().Get<std::vector<int>>("shape");
PADDLE_ENFORCE(shape.size() > 0, "Attr(shape) shouldn't be empty.");
auto x_dims = ctx->GetInputDim("X");
std::vector<size_t> neg_dims_idx;
// set some dimension to -1 if it is unknown
const int unknown_size = -1;
for (size_t i = 0; i < shape.size(); ++i) {
PADDLE_ENFORCE(shape[i] > 0 || shape[i] == unknown_size,
"Each dimension of Attr(shape) must be positive or %d.",
unknown_size);
if (shape[i] == unknown_size) {
neg_dims_idx.push_back(i);
PADDLE_ENFORCE(neg_dims_idx.size() <= 1,
"Only one dimension of Attr(shape) can be unknown.");
}
}
int64_t capacity =
std::accumulate(shape.begin(), shape.end(), 1, std::multiplies<int>());
int64_t in_size = framework::product(x_dims);
if (neg_dims_idx.size() == 1) {
// dim infer
shape[neg_dims_idx[0]] = in_size / (-capacity);
// recalculate capacity
capacity = shape[neg_dims_idx[0]] * (-capacity);
}
// capacity check
PADDLE_ENFORCE(capacity == in_size,
"The size of Input(X) mismatches with Attr(shape).");
// resize output
std::vector<int64_t> shape_int64(shape.size(), 0);
std::transform(shape.begin(), shape.end(), shape_int64.begin(),
[](int a) { return static_cast<int64_t>(a); });
auto out_dims = framework::make_ddim(shape_int64);
ctx->SetOutputDim("Out", out_dims);
if (shape[0] == x_dims[0]) {
// Only pass LoD when the first dimension is equal between
// output and input.
ctx->ShareLoD("X", /*->*/ "Out");
}
}
};
class ReshapeOpMaker : public framework::OpProtoAndCheckerMaker { class ReshapeOpMaker : public framework::OpProtoAndCheckerMaker {
public: public:
ReshapeOpMaker(OpProto *proto, OpAttrChecker *op_checker) ReshapeOpMaker(OpProto *proto, OpAttrChecker *op_checker)
: OpProtoAndCheckerMaker(proto, op_checker) { : OpProtoAndCheckerMaker(proto, op_checker) {
AddInput("X", "The input tensor of reshape operator."); AddInput("X", "(Tensor). The input tensor of reshape operator.");
AddOutput("Out", "The output tensor of reshape operator."); AddInput("Shape",
AddAttr<std::vector<int>>("shape", "(Tensor<int32>, optional). If provided, reshape according to "
"(vector<int>) " "this given shape. That is to say it has a higher priority than "
"Target shape of reshape operator."); "the shape attribute, while the shape attribute still should be "
"set correctly to gurantee shape inference in compile time.")
.AsDispensable();
AddOutput("Out", "(Tensor). The output tensor of reshape operator.");
AddAttr<std::vector<int>>(
"shape", "(std::vector<int>) Target shape of reshape operator.");
AddAttr<bool>("inplace", AddAttr<bool>("inplace",
"Change the source tensor's shape without copy memory.") "(default: false) Change the source tensor's shape without "
.SetDefault(true); "memory copy. When Attr(inplace) is set true, the output "
"tensor shares memory with Input(X), otherwise, a new output "
"tensor is created, and its data are copied from Input(x).")
.SetDefault(false);
AddComment(R"DOC( AddComment(R"DOC(
Reshape Operator. Reshape Operator.
Reshape Input(X) into the shape specified by Attr(shape). Reshape Input(X) into the shape specified by Attr(shape) or Input(Shape). The
data in Input(X) are unchanged.
Examples:
1. Given a 3-D tensor Input(X) with a shape [2, 4, 6], and the target shape
specified by Attr(shape) is [6, 8], the reshape operator will transform Input(X)
into a 2-D tensor with shape [6, 8] and leaving Input(X)'s data unchanged.
2. Given a 3-D tensor Input(X) with a shape [2, 4, 6], and the target shape
specified by Attr(shape) is [2, 3, -1, 2], the reshape operator will transform
Input(X) into a 4-D tensor with shape [2, 3, 4, 2] and leaving Input(X)'s data
unchanged. In this case, one and only dimension of Attr(shape) can be set to -1,
the value of this dimension is inferred from the total element number of
Input(X) and remaining dimensions.
An example: 3. Given a 3-D tensor Input(X) with a shape [2, 4, 6], and the target shape
Given a 2-D tensor X with 2 rows and 2 columns : [[1, 2], [3, 4]] specified by Attr(shape) is [-1, 0, 3, 2], the reshape operator will transform
Input(X) into a 4-D tensor with shape [2, 4, 3, 2] and leaving Input(X)'s data
unchanged. In this case, besides -1, 0 means the actual dimension value is going
to be copied from the corresponding dimension of Input(X).
and target shape = [1, 4], the reshape operator will transform Note:
the tensor X into a 2-D tensor: [[1, 2, 3, 4]]
1. One and only one dimension in Attr(shape) can be set -1. In this case,
the actual dimension value will be infered from the total element number of
Input(X) and remaining dimensions.
2. More than one dimensions in Attr(shape) can be set to 0, which means the real
dimension value will be copied from Input(X) at runtime. Note that the index of
0 can not exceed Rank(X). For example, Input(X) is a 3-D tensor with shape
[2, 3, 4], Attr(shape) = [2, 3, 2, 0] is an invalid input.
3. Input(Shape) has a higher priority than Attr(shape) if it is provided, while
Attr(shape) still should be set correctly to gurantee shape inference in
compile-time.
One dimension in the target shape can be set -1, representing that its
size is unknown. In this case, the real dimension will be infered from
the original shape of Input(X) and other dimensions in the target shape.
)DOC"); )DOC");
} }
}; };
...@@ -119,6 +95,14 @@ class ReshapeGradOp : public framework::OperatorWithKernel { ...@@ -119,6 +95,14 @@ class ReshapeGradOp : public framework::OperatorWithKernel {
"Input(Out@GRAD) shouldn't be null."); "Input(Out@GRAD) shouldn't be null.");
ctx->SetOutputDim(framework::GradVarName("X"), ctx->GetInputDim("X")); ctx->SetOutputDim(framework::GradVarName("X"), ctx->GetInputDim("X"));
} }
protected:
framework::OpKernelType GetExpectedKernelType(
const framework::ExecutionContext &ctx) const override {
return framework::OpKernelType(
framework::ToDataType(ctx.Input<framework::LoDTensor>("X")->type()),
ctx.device_context());
}
}; };
} // namespace operators } // namespace operators
......
...@@ -20,17 +20,129 @@ limitations under the License. */ ...@@ -20,17 +20,129 @@ limitations under the License. */
namespace paddle { namespace paddle {
namespace operators { namespace operators {
class ReshapeOp : public framework::OperatorWithKernel {
public:
ReshapeOp(const std::string &type, const framework::VariableNameMap &inputs,
const framework::VariableNameMap &outputs,
const framework::AttributeMap &attrs)
: OperatorWithKernel(type, inputs, outputs, attrs) {}
void InferShape(framework::InferShapeContext *ctx) const override {
PADDLE_ENFORCE(ctx->HasInput("X"),
"Input(X) of ReshapeOp should not be null.");
PADDLE_ENFORCE(ctx->HasOutput("Out"),
"Output(Out) of ReshapeOp should not be null.");
const std::vector<int> &shape = ctx->Attrs().Get<std::vector<int>>("shape");
PADDLE_ENFORCE(!shape.empty(),
"The shape information must be set by Attr(shape).");
if (ctx->HasInput("Shape") && ctx->IsRuntime()) {
// If true, set the shape of Output(Out) according to Input(Shape) in
// ReshapeKernel with ExecutionContext. Also check LoD in ReshapeKernel.
ctx->ShareLoD("X", /*->*/ "Out");
return;
}
auto x_dims = ctx->GetInputDim("X");
auto out_dims = ValidateShape(shape, x_dims);
ctx->SetOutputDim("Out", out_dims);
if (x_dims[0] == out_dims[0]) {
// Only pass LoD when the first dimension of output and Input(X)
// are the same.
ctx->ShareLoD("X", /*->*/ "Out");
}
}
static framework::DDim ValidateShape(const std::vector<int> shape,
const framework::DDim &in_dims) {
const int64_t in_size = framework::product(in_dims);
// only one dimension canbe set to -1, whose size will be automatically
// infered.
const int64_t unk_dim_val = -1;
const int64_t copy_dim_val = 0;
std::vector<int64_t> output_shape(shape.size(), 0);
int64_t capacity = 1;
int unk_dim_idx = -1;
for (size_t i = 0; i < shape.size(); ++i) {
if (shape[i] == unk_dim_val) {
PADDLE_ENFORCE(
unk_dim_idx == -1,
"Only one input dimension of Attr(shape) can be unknown.");
unk_dim_idx = i;
} else if (shape[i] == copy_dim_val) {
PADDLE_ENFORCE(
static_cast<int>(i) < in_dims.size(),
"The index of dimension to copy from input shape must be less "
"than the size of input shape.");
} else {
PADDLE_ENFORCE(
shape[i] > 0,
"Each input dimension of Attr(shape) must not be negtive except "
"one unknown dimension.");
}
capacity *= (shape[i] ? shape[i] : in_dims[i]);
output_shape[i] =
(shape[i] ? static_cast<int64_t>(shape[i]) : in_dims[i]);
}
if (unk_dim_idx != -1) {
output_shape[unk_dim_idx] = -in_size / capacity;
PADDLE_ENFORCE_EQ(output_shape[unk_dim_idx] * capacity, -in_size,
"Invalid shape is given.");
} else {
PADDLE_ENFORCE_EQ(capacity, in_size, "Invalid shape is given.");
}
return framework::make_ddim(output_shape);
}
protected:
framework::OpKernelType GetExpectedKernelType(
const framework::ExecutionContext &ctx) const override {
return framework::OpKernelType(
framework::ToDataType(ctx.Input<framework::LoDTensor>("X")->type()),
ctx.device_context());
}
};
template <typename DeviceContext, typename T> template <typename DeviceContext, typename T>
class ReshapeKernel : public framework::OpKernel<T> { class ReshapeKernel : public framework::OpKernel<T> {
public: public:
void Compute(const framework::ExecutionContext& ctx) const { void Compute(const framework::ExecutionContext &ctx) const {
auto* out = ctx.Output<framework::Tensor>("Out"); auto *out = ctx.Output<framework::LoDTensor>("Out");
auto* in = ctx.Input<framework::Tensor>("X"); auto *in = ctx.Input<framework::LoDTensor>("X");
auto *shape_tensor = ctx.Input<framework::LoDTensor>("Shape");
framework::DDim out_dims = out->dims();
if (shape_tensor) {
auto *shape_data = shape_tensor->data<int>();
if (platform::is_gpu_place(ctx.GetPlace())) {
framework::Tensor cpu_shape_tensor;
TensorCopy(*shape_tensor, platform::CPUPlace(), ctx.device_context(),
&cpu_shape_tensor);
shape_data = cpu_shape_tensor.data<int>();
}
auto shape =
std::vector<int>(shape_data, shape_data + shape_tensor->numel());
out_dims = ReshapeOp::ValidateShape(shape, in->dims());
}
if (!in->lod().empty()) {
PADDLE_ENFORCE_EQ(
out_dims[0], in->dims()[0],
"Reshape operator cannot reshape an input sequence batch "
"into an output sequence batch that has a different "
"number of time steps. Please consider using "
"sequence_reshape op.");
}
bool inplace = ctx.Attr<bool>("inplace"); bool inplace = ctx.Attr<bool>("inplace");
auto out_dims = out->dims(); out->Resize(out_dims);
if (!inplace) { if (!inplace) {
out->mutable_data<T>(ctx.GetPlace()); out->mutable_data<T>(ctx.GetPlace());
framework::TensorCopy(*in, ctx.GetPlace(), ctx.device_context(), out); framework::TensorCopy(*in, ctx.GetPlace(), ctx.device_context(), out);
// TensorCopy will resize to in_dims.
out->Resize(out_dims); out->Resize(out_dims);
} else { } else {
out->ShareDataWith(*in); out->ShareDataWith(*in);
...@@ -42,9 +154,10 @@ class ReshapeKernel : public framework::OpKernel<T> { ...@@ -42,9 +154,10 @@ class ReshapeKernel : public framework::OpKernel<T> {
template <typename DeviceContext, typename T> template <typename DeviceContext, typename T>
class ReshapeGradKernel : public framework::OpKernel<T> { class ReshapeGradKernel : public framework::OpKernel<T> {
public: public:
void Compute(const framework::ExecutionContext& ctx) const { void Compute(const framework::ExecutionContext &ctx) const {
auto* d_out = ctx.Input<framework::Tensor>(framework::GradVarName("Out")); auto *d_out = ctx.Input<framework::Tensor>(framework::GradVarName("Out"));
auto* d_x = ctx.Output<framework::Tensor>(framework::GradVarName("X")); auto *d_x = ctx.Output<framework::Tensor>(framework::GradVarName("X"));
d_x->mutable_data<T>(ctx.GetPlace()); d_x->mutable_data<T>(ctx.GetPlace());
bool inplace = ctx.Attr<bool>("inplace"); bool inplace = ctx.Attr<bool>("inplace");
......
...@@ -504,10 +504,10 @@ All parameter, weight, gradient are variables in Paddle. ...@@ -504,10 +504,10 @@ All parameter, weight, gradient are variables in Paddle.
const std::unordered_set<std::string> &params, const std::unordered_set<std::string> &params,
const ProgramDesc &startup_program, const ProgramDesc &startup_program,
const ProgramDesc &main_program, const std::string &loss_var_name, const ProgramDesc &main_program, const std::string &loss_var_name,
Scope *scope) { Scope *scope, bool allow_op_delay) {
new (&self) ParallelExecutor(num_threads, use_event, places, new (&self) ParallelExecutor(num_threads, use_event, places,
params, startup_program, main_program, params, startup_program, main_program,
loss_var_name, scope); loss_var_name, scope, allow_op_delay);
}) })
.def("run", &ParallelExecutor::Run); .def("run", &ParallelExecutor::Run);
......
...@@ -104,8 +104,10 @@ EOF ...@@ -104,8 +104,10 @@ EOF
# make install should also be test when unittest # make install should also be test when unittest
make install -j `nproc` make install -j `nproc`
pip install /usr/local/opt/paddle/share/wheels/*.whl pip install /usr/local/opt/paddle/share/wheels/*.whl
if [[ ${WITH_FLUID_ONLY:-OFF} == "OFF" ]] ; then
paddle version paddle version
fi fi
fi
} }
...@@ -183,6 +185,14 @@ EOF ...@@ -183,6 +185,14 @@ EOF
NCCL_DEPS="" NCCL_DEPS=""
fi fi
if [[ ${WITH_FLUID_ONLY:-OFF} == "OFF" ]]; then
PADDLE_VERSION="paddle version"
CMD='"paddle", "version"'
else
PADDLE_VERSION="true"
CMD='"true"'
fi
cat >> /paddle/build/Dockerfile <<EOF cat >> /paddle/build/Dockerfile <<EOF
ADD python/dist/*.whl / ADD python/dist/*.whl /
# run paddle version to install python packages first # run paddle version to install python packages first
...@@ -192,7 +202,7 @@ EOF ...@@ -192,7 +202,7 @@ EOF
pip install /*.whl; apt-get install -f -y && \ pip install /*.whl; apt-get install -f -y && \
apt-get clean -y && \ apt-get clean -y && \
rm -f /*.whl && \ rm -f /*.whl && \
paddle version && \ ${PADDLE_VERSION} && \
ldconfig ldconfig
${DOCKERFILE_CUDNN_DSO} ${DOCKERFILE_CUDNN_DSO}
${DOCKERFILE_GPU_ENV} ${DOCKERFILE_GPU_ENV}
...@@ -200,7 +210,7 @@ EOF ...@@ -200,7 +210,7 @@ EOF
ADD go/cmd/pserver/pserver /usr/bin/ ADD go/cmd/pserver/pserver /usr/bin/
ADD go/cmd/master/master /usr/bin/ ADD go/cmd/master/master /usr/bin/
# default command shows the paddle version and exit # default command shows the paddle version and exit
CMD ["paddle", "version"] CMD [${CMD}]
EOF EOF
} }
......
...@@ -81,6 +81,7 @@ if (WITH_TESTING) ...@@ -81,6 +81,7 @@ if (WITH_TESTING)
# enable v2 API unittest only when paddle swig api is compiled # enable v2 API unittest only when paddle swig api is compiled
add_subdirectory(paddle/v2/tests) add_subdirectory(paddle/v2/tests)
add_subdirectory(paddle/v2/plot/tests) add_subdirectory(paddle/v2/plot/tests)
add_subdirectory(paddle/v2/reader/tests)
endif() endif()
endif() endif()
add_subdirectory(paddle/fluid/tests) add_subdirectory(paddle/fluid/tests)
......
...@@ -37,7 +37,7 @@ __all__ = [ ...@@ -37,7 +37,7 @@ __all__ = [
'cifar', 'cifar',
'movielens', 'movielens',
'conll05', 'conll05',
'sentiment' 'sentiment',
'uci_housing', 'uci_housing',
'wmt14', 'wmt14',
'wmt16', 'wmt16',
......
...@@ -276,12 +276,15 @@ class DistributeTranspiler: ...@@ -276,12 +276,15 @@ class DistributeTranspiler:
suff_idx = v.name.find(".trainer_") suff_idx = v.name.find(".trainer_")
if suff_idx >= 0: if suff_idx >= 0:
orig_var_name = v.name[:suff_idx] orig_var_name = v.name[:suff_idx]
pserver_program.global_block().create_var( else:
orig_var_name = v.name
single_trainer_var = pserver_program.global_block().create_var(
name=orig_var_name, name=orig_var_name,
persistable=True, persistable=True,
type=v.type, type=v.type,
dtype=v.dtype, dtype=v.dtype,
shape=v.shape) shape=v.shape)
if self.trainers > 1:
for trainer_id in xrange(self.trainers): for trainer_id in xrange(self.trainers):
var = pserver_program.global_block().create_var( var = pserver_program.global_block().create_var(
name="%s.trainer_%d" % (orig_var_name, trainer_id), name="%s.trainer_%d" % (orig_var_name, trainer_id),
...@@ -290,6 +293,8 @@ class DistributeTranspiler: ...@@ -290,6 +293,8 @@ class DistributeTranspiler:
dtype=v.dtype, dtype=v.dtype,
shape=v.shape) shape=v.shape)
recv_inputs.append(var) recv_inputs.append(var)
else:
recv_inputs.append(single_trainer_var)
# step3 # step3
optimize_block = pserver_program.create_block(0) optimize_block = pserver_program.create_block(0)
...@@ -511,8 +516,11 @@ class DistributeTranspiler: ...@@ -511,8 +516,11 @@ class DistributeTranspiler:
def _append_split_op(self, program, gradblocks): def _append_split_op(self, program, gradblocks):
# Split variables that need to be split and append respective ops # Split variables that need to be split and append respective ops
add_suffix = False
if self.trainers > 1:
add_suffix = True
var_mapping = self._create_vars_from_blocklist( var_mapping = self._create_vars_from_blocklist(
program, gradblocks, add_trainer_suffix=True) program, gradblocks, add_trainer_suffix=add_suffix)
for varname, splited_vars in var_mapping.iteritems(): for varname, splited_vars in var_mapping.iteritems():
# variable that don't need to split have empty splited_vars # variable that don't need to split have empty splited_vars
if len(splited_vars) <= 1: if len(splited_vars) <= 1:
......
...@@ -48,8 +48,7 @@ def as_numpy(tensor): ...@@ -48,8 +48,7 @@ def as_numpy(tensor):
assert isinstance(tensor, core.LoDTensor) assert isinstance(tensor, core.LoDTensor)
lod = tensor.lod() lod = tensor.lod()
if len(lod) > 0: if len(lod) > 0:
raise RuntimeError( raise RuntimeError("Some of your fetched tensors hold LoD information. \
"Some of your featched tensors hold LoD information. \
They can not be completely cast to Python ndarray. \ They can not be completely cast to Python ndarray. \
Please set the parameter 'return_numpy' as 'False' to \ Please set the parameter 'return_numpy' as 'False' to \
return LoDTensor itself directly.") return LoDTensor itself directly.")
...@@ -180,59 +179,23 @@ def get_program_cache_key(feed, fetch_list): ...@@ -180,59 +179,23 @@ def get_program_cache_key(feed, fetch_list):
class Executor(object): class Executor(object):
def __init__(self, places): def __init__(self, place):
if not isinstance(places, list) and not isinstance(places, tuple): self.place = place
places = [places]
act_places = []
for each in places:
p = core.Place() p = core.Place()
p.set_place(each) p.set_place(place)
act_places.append(p) self.executor = core.Executor(p)
# TODO(dzhwinter) : only use the first place
self.executor = core.Executor(act_places[0])
self.places = places
self.program_caches = dict() self.program_caches = dict()
def aslodtensor(self, data): def as_lodtensor(self, data):
def accumulate(data): if isinstance(data, list):
if not isinstance(data, list): raise RuntimeError("Some of your feed data hold LoD information. \
return 1 They can not be completely cast from a list of Python \
return sum([accumulate(sub) for sub in data]) ndarray to LoDTensor. Please convert data to LoDTensor \
directly before feeding the data.\
def parselod(data): ")
seq_lens = [accumulate(seq) for seq in data] # single tensor case
cur_len = 0
lod = [cur_len]
for l in seq_lens:
cur_len += l
lod.append(cur_len)
return lod
assert len(self.places) != 0
if not isinstance(data, list):
# pure tensor case
tensor = core.LoDTensor()
tensor.set(data, self.places[0])
return tensor
else:
raise RuntimeError("Current implementation lacks unittests")
# lodtensor case
lod = []
if not isinstance(data[0], list):
lod.append(parselod(data))
flattened_data = np.concatenate(data, axis=0).astype("int64")
else:
while isinstance(data[0], list):
lod.append(parselod(seq))
flattened_data = [item for seq in data for item in seq]
data = flattened_data
flattened_data = np.concatenate(data, axis=0).astype("int64")
flattened_data = flattened_data.reshape([len(flattened_data), 1])
tensor = core.LoDTensor() tensor = core.LoDTensor()
tensor.set(flattened_data, self.places[0]) tensor.set(data, self.place)
tensor.set_lod(lod)
return tensor return tensor
def _get_program_cache(self, program_cache_key): def _get_program_cache(self, program_cache_key):
...@@ -293,7 +256,7 @@ class Executor(object): ...@@ -293,7 +256,7 @@ class Executor(object):
feed_target_name = op.desc.output('Out')[0] feed_target_name = op.desc.output('Out')[0]
cur_feed = feed[feed_target_name] cur_feed = feed[feed_target_name]
if not isinstance(cur_feed, core.LoDTensor): if not isinstance(cur_feed, core.LoDTensor):
cur_feed = self.aslodtensor(cur_feed) cur_feed = self.as_lodtensor(cur_feed)
idx = op.desc.attr('col') idx = op.desc.attr('col')
core.set_feed_variable(scope, cur_feed, feed_var_name, idx) core.set_feed_variable(scope, cur_feed, feed_var_name, idx)
else: else:
......
...@@ -19,7 +19,6 @@ from layer_function_generator import generate_layer_fn ...@@ -19,7 +19,6 @@ from layer_function_generator import generate_layer_fn
from layer_function_generator import autodoc from layer_function_generator import autodoc
from ..layer_helper import LayerHelper from ..layer_helper import LayerHelper
import tensor import tensor
import ops
import nn import nn
import math import math
...@@ -130,9 +129,9 @@ def detection_output(loc, ...@@ -130,9 +129,9 @@ def detection_output(loc,
target_box=loc, target_box=loc,
code_type='decode_center_size') code_type='decode_center_size')
old_shape = scores.shape old_shape = scores.shape
scores = ops.reshape(x=scores, shape=(-1, old_shape[-1])) scores = nn.reshape(x=scores, shape=(-1, old_shape[-1]))
scores = nn.softmax(input=scores) scores = nn.softmax(input=scores)
scores = ops.reshape(x=scores, shape=old_shape) scores = nn.reshape(x=scores, shape=old_shape)
scores = nn.transpose(scores, perm=[0, 2, 1]) scores = nn.transpose(scores, perm=[0, 2, 1])
scores.stop_gradient = True scores.stop_gradient = True
nmsed_outs = helper.create_tmp_variable(dtype=decoded_box.dtype) nmsed_outs = helper.create_tmp_variable(dtype=decoded_box.dtype)
...@@ -463,7 +462,7 @@ def ssd_loss(location, ...@@ -463,7 +462,7 @@ def ssd_loss(location,
num, num_prior, num_class = confidence.shape num, num_prior, num_class = confidence.shape
def __reshape_to_2d(var): def __reshape_to_2d(var):
return ops.reshape(x=var, shape=[-1, var.shape[-1]]) return nn.reshape(x=var, shape=[-1, var.shape[-1]])
# 1. Find matched boundding box by prior box. # 1. Find matched boundding box by prior box.
# 1.1 Compute IOU similarity between ground-truth boxes and prior boxes. # 1.1 Compute IOU similarity between ground-truth boxes and prior boxes.
...@@ -474,7 +473,7 @@ def ssd_loss(location, ...@@ -474,7 +473,7 @@ def ssd_loss(location,
# 2. Compute confidence for mining hard examples # 2. Compute confidence for mining hard examples
# 2.1. Get the target label based on matched indices # 2.1. Get the target label based on matched indices
gt_label = ops.reshape(x=gt_label, shape=gt_label.shape + (1, )) gt_label = nn.reshape(x=gt_label, shape=gt_label.shape + (1, ))
gt_label.stop_gradient = True gt_label.stop_gradient = True
target_label, _ = target_assign( target_label, _ = target_assign(
gt_label, matched_indices, mismatch_value=background_label) gt_label, matched_indices, mismatch_value=background_label)
...@@ -487,7 +486,7 @@ def ssd_loss(location, ...@@ -487,7 +486,7 @@ def ssd_loss(location,
conf_loss = nn.softmax_with_cross_entropy(confidence, target_label) conf_loss = nn.softmax_with_cross_entropy(confidence, target_label)
# 3. Mining hard examples # 3. Mining hard examples
conf_loss = ops.reshape(x=conf_loss, shape=(num, num_prior)) conf_loss = nn.reshape(x=conf_loss, shape=(num, num_prior))
conf_loss.stop_gradient = True conf_loss.stop_gradient = True
neg_indices = helper.create_tmp_variable(dtype='int32') neg_indices = helper.create_tmp_variable(dtype='int32')
dtype = matched_indices.dtype dtype = matched_indices.dtype
...@@ -556,7 +555,7 @@ def ssd_loss(location, ...@@ -556,7 +555,7 @@ def ssd_loss(location,
# 5.3 Compute overall weighted loss. # 5.3 Compute overall weighted loss.
loss = conf_loss_weight * conf_loss + loc_loss_weight * loc_loss loss = conf_loss_weight * conf_loss + loc_loss_weight * loc_loss
# reshape to [N, Np], N is the batch size and Np is the prior box number. # reshape to [N, Np], N is the batch size and Np is the prior box number.
loss = ops.reshape(x=loss, shape=[-1, num_prior]) loss = nn.reshape(x=loss, shape=[-1, num_prior])
loss = nn.reduce_sum(loss, dim=1, keep_dim=True) loss = nn.reduce_sum(loss, dim=1, keep_dim=True)
if normalize: if normalize:
normalizer = nn.reduce_sum(target_loc_weight) normalizer = nn.reduce_sum(target_loc_weight)
...@@ -709,7 +708,7 @@ def multi_box_head(inputs, ...@@ -709,7 +708,7 @@ def multi_box_head(inputs,
new_shape = [ new_shape = [
-1, reduce(lambda x, y: x * y, input.shape[axis:len(input.shape)]) -1, reduce(lambda x, y: x * y, input.shape[axis:len(input.shape)])
] ]
out = ops.reshape(x=input, shape=new_shape) out = nn.reshape(x=input, shape=new_shape)
return out return out
def _is_list_or_tuple_(data): def _is_list_or_tuple_(data):
...@@ -803,7 +802,7 @@ def multi_box_head(inputs, ...@@ -803,7 +802,7 @@ def multi_box_head(inputs,
mbox_loc.shape[0], mbox_loc.shape[0],
mbox_loc.shape[1] * mbox_loc.shape[2] * mbox_loc.shape[3] / 4, 4 mbox_loc.shape[1] * mbox_loc.shape[2] * mbox_loc.shape[3] / 4, 4
] ]
mbox_loc_flatten = ops.reshape(mbox_loc, shape=new_shape) mbox_loc_flatten = nn.reshape(mbox_loc, shape=new_shape)
mbox_locs.append(mbox_loc_flatten) mbox_locs.append(mbox_loc_flatten)
# get conf # get conf
...@@ -819,7 +818,7 @@ def multi_box_head(inputs, ...@@ -819,7 +818,7 @@ def multi_box_head(inputs,
conf_loc.shape[0], conf_loc.shape[1] * conf_loc.shape[2] * conf_loc.shape[0], conf_loc.shape[1] * conf_loc.shape[2] *
conf_loc.shape[3] / num_classes, num_classes conf_loc.shape[3] / num_classes, num_classes
] ]
conf_loc_flatten = ops.reshape(conf_loc, shape=new_shape) conf_loc_flatten = nn.reshape(conf_loc, shape=new_shape)
mbox_confs.append(conf_loc_flatten) mbox_confs.append(conf_loc_flatten)
if len(box_results) == 1: if len(box_results) == 1:
......
...@@ -73,8 +73,10 @@ __all__ = [ ...@@ -73,8 +73,10 @@ __all__ = [
'smooth_l1', 'smooth_l1',
'one_hot', 'one_hot',
'autoincreased_step_counter', 'autoincreased_step_counter',
'reshape',
'lod_reset', 'lod_reset',
'lrn', 'lrn',
'pad',
] ]
...@@ -3265,6 +3267,8 @@ def one_hot(input, depth): ...@@ -3265,6 +3267,8 @@ def one_hot(input, depth):
The one-hot tensor or LodTensor, same as input. The one-hot tensor or LodTensor, same as input.
Examples: Examples:
.. code-block:: python
X is a LoDTensor: X is a LoDTensor:
X.lod = [[0, 1, 4]] X.lod = [[0, 1, 4]]
X.shape = [4, 1] X.shape = [4, 1]
...@@ -3319,6 +3323,102 @@ def autoincreased_step_counter(counter_name=None, begin=1, step=1): ...@@ -3319,6 +3323,102 @@ def autoincreased_step_counter(counter_name=None, begin=1, step=1):
return counter return counter
def reshape(x, shape, actual_shape=None, act=None, inplace=True, name=None):
"""
Gives a new shape to the input Tensor without changing its data.
The target shape can be given by :attr:`shape` or :attr:`actual_shape`.
:attr:`shape` is a list of integer while :attr:`actual_shape` is a tensor
variable. :attr:`actual_shape` has a higher priority than :attr:`shape`
if it is provided, while :attr:`shape` still should be set correctly to
gurantee shape inference in compile-time.
Some tricks exist when specifying the target shape.
1. -1 means the value of this dimension is inferred from the total element
number of x and remaining dimensions. Thus one and only one dimension can
be set -1.
2. 0 means the actual dimension value is going to be copied from the
corresponding dimension of x. The indice of 0s in shape can not exceed
Rank(X).
Here are some examples to explain it.
1. Given a 3-D tensor x with a shape [2, 4, 6], and the target shape
is [6, 8], the reshape operator will transform x into a 2-D tensor with
shape [6, 8] and leaving x's data unchanged.
2. Given a 3-D tensor x with a shape [2, 4, 6], and the target shape
specified is [2, 3, -1, 2], the reshape operator will transform x into a
4-D tensor with shape [2, 3, 4, 2] and leaving x's data unchanged. In this
case, one dimension of the target shape is set to -1, the value of this
dimension is inferred from the total element number of x and remaining
dimensions.
3. Given a 3-D tensor x with a shape [2, 4, 6], and the target shape
is [-1, 0, 3, 2], the reshape operator will transform x into a 4-D tensor
with shape [2, 4, 3, 2] and leaving x's data unchanged. In this case,
besides -1, 0 means the actual dimension value is going to be copied from
the corresponding dimension of x.
Args:
input(variable): The input tensor.
shape(list): The new shape. At most one dimension of the new shape can
be -1.
actual_shape(variable): An optional input. If provided, reshape
according to this given shape rather than
:attr:`shape` specifying shape. That is to
say :attr:`actual_shape` has a higher priority
than :attr:`shape`.
act (str): The non-linear activation to be applied to output variable.
inplace(bool): If this flag is set true, a new output tensor is created
whose data is copied from input x, otherwise the output
shares data with input without copying.
Returns(variable): The output tensor.
Examples:
.. code-block:: python
data = fluid.layers.data(
name='data', shape=[2, 4, 6], dtype='float32')
reshaped = fluid.layers.reshape(
x=data, shape=[-1, 0, 3, 2], act='tanh', inplace=True)
"""
if not (isinstance(shape, list) or isinstance(shape, tuple)):
raise ValueError("Input shape must be a python lsit or tuple.")
# Validate the shape
unk_dim_idx = -1
for dim_idx, dim_size in enumerate(shape):
if dim_size == -1:
assert unk_dim_idx == -1, (
"Only one dimension in shape can be unknown.")
unk_dim_idx = dim_idx
elif dim_size == 0:
assert dim_idx < len(x.shape), (
"The indice of 0s in shape can not exceed Rank(X).")
else:
assert dim_size > 0, (
"Each dimension size given in shape must not be negtive "
"except one unknown dimension.")
helper = LayerHelper("reshape", **locals())
reshaped = helper.create_tmp_variable(dtype=x.dtype)
helper.append_op(
type="reshape",
inputs={"X": x,
"Shape": actual_shape}
if isinstance(actual_shape, Variable) else {"X": x},
attrs={"shape": shape,
"inplace": inplace},
outputs={"Out": reshaped})
return helper.append_activation(reshaped)
def lod_reset(x, y=None, target_lod=None): def lod_reset(x, y=None, target_lod=None):
""" """
LoD Reset Operator. Set LoD of **x** to a new one specified by **y** or LoD Reset Operator. Set LoD of **x** to a new one specified by **y** or
...@@ -3482,3 +3582,62 @@ def lrn(input, n=5, k=1.0, alpha=1e-4, beta=0.75, name=None): ...@@ -3482,3 +3582,62 @@ def lrn(input, n=5, k=1.0, alpha=1e-4, beta=0.75, name=None):
"beta": beta}) "beta": beta})
return lrn_out return lrn_out
def pad(x, paddings, pad_value=0., name=None):
"""
Pads a tensor with a constant value given by :attr:`pad_value`, and the
padded width is specified by :attr:`paddings`.
Specifically, the number of values padded before the contents of :attr:`x`
in dimension :attr:`i` is indicated by :attr:`paddings[i]`, and the number
of values padded after the contents of :attr:`x` in dimension :attr:`i` is
indicated by :attr:`paddings[i+1]`.
See below for an example.
.. code-block:: text
Given:
x = [[1, 2], [3, 4]]
paddings = [0, 1, 1, 2]
pad_value = 0
Return:
out = [[0, 1, 2, 0, 0]
[0, 3, 4, 0, 0]
[0, 0, 0, 0, 0]]
Args:
x (Variable): The input tensor variable.
paddings (list): A list of integers. Its elements specify the padded
width before and after for each dimension in turn.
The length of :attr:paddings must be
:math:`rank(x) \\times 2`.
pad_value (float): The constant value used to pad.
name(str|None): A name for this layer(optional). If set None, the layer
will be named automatically.
Returns:
Variable: The padded tensor variable.
Examples:
.. code-block:: python
# x is a rank 2 tensor variable.
out = fluid.layers.pad(
x=x, paddings=[0, 1, 1, 2], pad_value=0.)
"""
helper = LayerHelper('pad', input=x, **locals())
dtype = helper.input_dtype()
out = helper.create_tmp_variable(dtype)
helper.append_op(
type='pad',
inputs={'X': x},
outputs={'Out': out},
attrs={'paddings': paddings,
'pad_value': float(pad_value)})
return out
...@@ -49,7 +49,6 @@ __activations__ = [ ...@@ -49,7 +49,6 @@ __activations__ = [
__all__ = [ __all__ = [
'mean', 'mean',
'mul', 'mul',
'reshape',
'scale', 'scale',
'sigmoid_cross_entropy_with_logits', 'sigmoid_cross_entropy_with_logits',
'elementwise_add', 'elementwise_add',
......
...@@ -21,7 +21,11 @@ __all__ = ['ParallelExecutor'] ...@@ -21,7 +21,11 @@ __all__ = ['ParallelExecutor']
class ParallelExecutor(object): class ParallelExecutor(object):
def __init__(self, loss_name, use_cuda, num_threads=None): def __init__(self,
loss_name,
use_cuda,
num_threads=None,
allow_op_delay=False):
places = [] places = []
if use_cuda: if use_cuda:
for i in xrange(core.get_cuda_device_count()): for i in xrange(core.get_cuda_device_count()):
...@@ -35,7 +39,12 @@ class ParallelExecutor(object): ...@@ -35,7 +39,12 @@ class ParallelExecutor(object):
places.append(p) places.append(p)
if num_threads is None: if num_threads is None:
num_threads = min(len(places) * 2, multiprocessing.cpu_count()) if use_cuda:
# Experiments on se-resnext shows that too many threads hurt
# performance. Worth tunning for other models in the future.
num_threads = len(places)
else:
min(len(places) * 2, multiprocessing.cpu_count())
startup = framework.default_startup_program() startup = framework.default_startup_program()
main = framework.default_main_program() main = framework.default_main_program()
...@@ -52,7 +61,8 @@ class ParallelExecutor(object): ...@@ -52,7 +61,8 @@ class ParallelExecutor(object):
startup.desc, startup.desc,
main.desc, main.desc,
loss_name, loss_name,
scope) scope,
allow_op_delay)
self.scope = scope self.scope = scope
def run(self, fetch_list): def run(self, fetch_list):
......
...@@ -29,6 +29,7 @@ function(py_test_modules TARGET_NAME) ...@@ -29,6 +29,7 @@ function(py_test_modules TARGET_NAME)
endfunction() endfunction()
# test time consuming OPs in a separate process for expliot parallism # test time consuming OPs in a separate process for expliot parallism
list(REMOVE_ITEM TEST_OPS test_parallel_executor)
list(REMOVE_ITEM TEST_OPS test_warpctc_op) list(REMOVE_ITEM TEST_OPS test_warpctc_op)
list(REMOVE_ITEM TEST_OPS test_dyn_rnn) list(REMOVE_ITEM TEST_OPS test_dyn_rnn)
list(REMOVE_ITEM TEST_OPS test_mul_op) list(REMOVE_ITEM TEST_OPS test_mul_op)
...@@ -64,6 +65,7 @@ else() ...@@ -64,6 +65,7 @@ else()
endif(WITH_FAST_BUNDLE_TEST) endif(WITH_FAST_BUNDLE_TEST)
# tests with high overhead # tests with high overhead
py_test_modules(test_parallel_executor MODULES test_parallel_executor)
py_test_modules(test_warpctc_op MODULES test_warpctc_op ENVS FLAGS_warpctc_dir=${WARPCTC_LIB_DIR}) py_test_modules(test_warpctc_op MODULES test_warpctc_op ENVS FLAGS_warpctc_dir=${WARPCTC_LIB_DIR})
py_test_modules(test_train_dyn_rnn MODULES test_dyn_rnn) py_test_modules(test_train_dyn_rnn MODULES test_dyn_rnn)
py_test_modules(test_mul_op MODULES test_mul_op) py_test_modules(test_mul_op MODULES test_mul_op)
......
...@@ -334,7 +334,7 @@ class OpTest(unittest.TestCase): ...@@ -334,7 +334,7 @@ class OpTest(unittest.TestCase):
np.allclose( np.allclose(
actual_t, expect_t, atol=atol), actual_t, expect_t, atol=atol),
"Output (" + out_name + ") has diff at " + str(place) + "Output (" + out_name + ") has diff at " + str(place) +
str(actual_t) + str(expect_t)) str(actual_t) + "\n" + str(expect_t))
if isinstance(expect, tuple): if isinstance(expect, tuple):
self.assertListEqual(actual.lod(), expect[1], self.assertListEqual(actual.lod(), expect[1],
"Output (" + out_name + "Output (" + out_name +
...@@ -568,6 +568,6 @@ class OpTest(unittest.TestCase): ...@@ -568,6 +568,6 @@ class OpTest(unittest.TestCase):
fetch_list = [g for p, g in param_grad_list] fetch_list = [g for p, g in param_grad_list]
executor = Executor(place) executor = Executor(place)
return map( return map(np.array,
np.array, executor.run(prog, feed_dict, fetch_list,
executor.run(prog, feed_dict, fetch_list, return_numpy=False)) return_numpy=False))
...@@ -96,5 +96,47 @@ class TestLookupTableIdsIsSelectedRows(OpTest): ...@@ -96,5 +96,47 @@ class TestLookupTableIdsIsSelectedRows(OpTest):
self.check_with_place(place) self.check_with_place(place)
class TestLookupTableWIsSelectedRows(OpTest):
def check_with_place(self, place):
scope = core.Scope()
# create and initialize Id Variable
ids_tensor = scope.var('Ids').get_tensor()
ids_array = np.array([[0], [4], [3], [5]]).astype("int64")
ids_tensor.set(ids_array, place)
# create and initialize W Variable
rows = [0, 1, 2, 3, 4, 5, 6]
row_numel = 12
w_selected_rows = scope.var('W').get_selected_rows()
w_selected_rows.set_height(len(rows))
w_selected_rows.set_rows(rows)
w_array = np.ones((len(rows), row_numel)).astype("float32")
for i in range(len(rows)):
w_array[i] *= i
ids_tensor = w_selected_rows.get_tensor()
ids_tensor.set(w_array, place)
# create Out Variable
Out_tensor = scope.var('Out').get_tensor()
# create and run lookup_table operator
lookup_table = Operator("lookup_table", W='W', Ids='Ids', Out='Out')
lookup_table.run(scope, place)
# get result from Out
result_array = np.array(Out_tensor)
# all(): return True if all elements of the iterable are true (or if the iterable is empty)
for idx, row in enumerate(ids_array):
assert (row[0] == result_array[idx]).all()
def test_w_is_selected_rows(self):
places = [core.CPUPlace()]
# currently only support CPU
for place in places:
self.check_with_place(place)
if __name__ == "__main__": if __name__ == "__main__":
unittest.main() unittest.main()
...@@ -23,7 +23,7 @@ import time ...@@ -23,7 +23,7 @@ import time
class TestRecvOp(unittest.TestCase): class TestRecvOp(unittest.TestCase):
def test_send(self): def no_test_send(self):
# Run init_serv in a thread # Run init_serv in a thread
place = fluid.CPUPlace() place = fluid.CPUPlace()
p = Process(target=self.init_serv, args=(place, )) p = Process(target=self.init_serv, args=(place, ))
......
文件模式从 100755 更改为 100644
...@@ -22,13 +22,17 @@ import data_type ...@@ -22,13 +22,17 @@ import data_type
import topology import topology
import networks import networks
import evaluator import evaluator
from . import dataset
from . import reader
from . import plot from . import plot
import attr import attr
import op import op
import pooling import pooling
import inference import inference
import networks import networks
import minibatch
import plot import plot
import image
import paddle.trainer.config_parser as cp import paddle.trainer.config_parser as cp
__all__ = [ __all__ = [
...@@ -44,11 +48,14 @@ __all__ = [ ...@@ -44,11 +48,14 @@ __all__ = [
'data_type', 'data_type',
'attr', 'attr',
'pooling', 'pooling',
'dataset',
'reader',
'topology', 'topology',
'networks', 'networks',
'infer', 'infer',
'plot', 'plot',
'evaluator', 'evaluator',
'image',
'master', 'master',
] ]
...@@ -146,3 +153,4 @@ def init(**kwargs): ...@@ -146,3 +153,4 @@ def init(**kwargs):
infer = inference.infer infer = inference.infer
batch = minibatch.batch
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册