单机多卡(或多CPU模式),异步读数据报错ValueError: Feed a list of tensor, the list should be the same size as places
Created by: QianShengWu
paddpaddle的单机多卡是不是异步读数据会出现问题,数据不一致还是冲突? 代码是百度的simnet,修改成异步读数据。
left = data.ops(name="left", shape=[1], dtype="int64", lod_level=1)
pos_right = data.ops(name="right", shape=[1], dtype="int64", lod_level=1)
neg_right = data.ops(name="neg_right", shape=[1], dtype="int64", lod_level=1)
left_feat, pos_score = net.predict(left, pos_right)
_, neg_score = net.predict(left, neg_right)
avg_cost = loss.compute(pos_score, neg_score)
reader = data_reader.get_reader(conf_dict, False, None)
py_reader = fluid.io.PyReader(feed_list=[left, pos_right, neg_right], capacity=640, use_double_buffer=True,
iterable=True)
进行训练
batch_data = paddle.batch(reader, conf_dict["batch_size"], drop_last=False)
py_reader.decorate_sample_list_generator(batch_data, places=place)
for epoch_id in range(conf_dict["epoch_num"]):
losses = []
# Get batch data iterator
start_time = time.time()
total_loss = 0.0
for iter, data in enumerate(py_reader()):
avg_loss = parallel_executor.run(
[avg_cost.name], feed=data)
出现错误如下
➜ paddle git:(master) ✗ export CPU_NUM=2
➜ paddle git:(master) ✗ sh run_train.sh
{'net': {'module_name': 'bow', 'class_name': 'BOW', 'emb_dim': 128, 'bow_dim': 128, 'hidden_dim': 128}, 'loss': {'module_name': 'hinge_loss', 'class_name': 'HingeLoss', 'margin': 0.1}, 'optimizer': {'class_name': 'AdamOptimizer', 'learning_rate': 0.001, 'beta1': 0.9, 'beta2': 0.999, 'epsilon': 1e-08}, 'dict_size': 14522, 'task_mode': 'pairwise', 'train_file_path': 'data/small.train', 'test_file_path': 'data/small.test', 'result_file_path': 'result_bow_pairwise', 'epoch_num': 10, 'model_path': 'models/bow_pairwise', 'use_epoch': 0, 'batch_size': 64, 'num_threads': 4}
I0813 17:46:26.663134 69371328 parallel_executor.cc:329] The number of CPUPlace, which is used in ParallelExecutor, is 2. And the Program will be copied 2 copies
I0813 17:46:26.673501 69371328 build_strategy.cc:340] SeqOnlyAllReduceOps:0, num_trainers:1
Traceback (most recent call last):
File "paddle_simnet.py", line 232, in <module>
train(conf_dict)
File "paddle_simnet.py", line 119, in train
[avg_cost.name], feed=data)
File "/usr/local/lib/python3.7/site-packages/paddle/fluid/parallel_executor.py", line 280, in run
return_numpy=return_numpy)
File "/usr/local/lib/python3.7/site-packages/paddle/fluid/executor.py", line 666, in run
return_numpy=return_numpy)
File "/usr/local/lib/python3.7/site-packages/paddle/fluid/executor.py", line 508, in _run_parallel
"Feed a list of tensor, the list should be the same size as places"
ValueError: Feed a list of tensor, the list should be the same size as places
单GPU, CPU代码运行是正常的。
麻烦各位大佬看看,小白先谢了。