单机多卡(或多CPU模式)，异步读数据报错ValueError: Feed a list of tensor, the list should be the same size as places (#19184) · Issue · PaddlePaddle / Paddle

单机多卡(或多CPU模式)，异步读数据报错ValueError: Feed a list of tensor, the list should be the same size as places

Created by: QianShengWu

paddpaddle的单机多卡是不是异步读数据会出现问题，数据不一致还是冲突？代码是百度的simnet，修改成异步读数据。

left = data.ops(name="left", shape=[1], dtype="int64", lod_level=1)
pos_right = data.ops(name="right", shape=[1], dtype="int64", lod_level=1)
neg_right = data.ops(name="neg_right", shape=[1], dtype="int64", lod_level=1)
left_feat, pos_score = net.predict(left, pos_right)
_, neg_score = net.predict(left, neg_right)
avg_cost = loss.compute(pos_score, neg_score)
        
reader = data_reader.get_reader(conf_dict, False, None)
py_reader = fluid.io.PyReader(feed_list=[left, pos_right, neg_right], capacity=640, use_double_buffer=True,
                                      iterable=True)

进行训练

batch_data = paddle.batch(reader, conf_dict["batch_size"], drop_last=False)
    py_reader.decorate_sample_list_generator(batch_data, places=place)

    for epoch_id in range(conf_dict["epoch_num"]):
        losses = []
        # Get batch data iterator
        start_time = time.time()
        total_loss = 0.0
        for iter, data in enumerate(py_reader()):
            avg_loss = parallel_executor.run(
                [avg_cost.name], feed=data)

出现错误如下

➜  paddle git:(master) ✗ export CPU_NUM=2
➜  paddle git:(master) ✗ sh run_train.sh 
{'net': {'module_name': 'bow', 'class_name': 'BOW', 'emb_dim': 128, 'bow_dim': 128, 'hidden_dim': 128}, 'loss': {'module_name': 'hinge_loss', 'class_name': 'HingeLoss', 'margin': 0.1}, 'optimizer': {'class_name': 'AdamOptimizer', 'learning_rate': 0.001, 'beta1': 0.9, 'beta2': 0.999, 'epsilon': 1e-08}, 'dict_size': 14522, 'task_mode': 'pairwise', 'train_file_path': 'data/small.train', 'test_file_path': 'data/small.test', 'result_file_path': 'result_bow_pairwise', 'epoch_num': 10, 'model_path': 'models/bow_pairwise', 'use_epoch': 0, 'batch_size': 64, 'num_threads': 4}
I0813 17:46:26.663134 69371328 parallel_executor.cc:329] The number of CPUPlace, which is used in ParallelExecutor, is 2. And the Program will be copied 2 copies
I0813 17:46:26.673501 69371328 build_strategy.cc:340] SeqOnlyAllReduceOps:0, num_trainers:1
Traceback (most recent call last):
  File "paddle_simnet.py", line 232, in <module>
    train(conf_dict)
  File "paddle_simnet.py", line 119, in train
    [avg_cost.name], feed=data)
  File "/usr/local/lib/python3.7/site-packages/paddle/fluid/parallel_executor.py", line 280, in run
    return_numpy=return_numpy)
  File "/usr/local/lib/python3.7/site-packages/paddle/fluid/executor.py", line 666, in run
    return_numpy=return_numpy)
  File "/usr/local/lib/python3.7/site-packages/paddle/fluid/executor.py", line 508, in _run_parallel
    "Feed a list of tensor, the list should be the same size as places"
ValueError: Feed a list of tensor, the list should be the same size as places

单GPU, CPU代码运行是正常的。

麻烦各位大佬看看，小白先谢了。

PaddlePaddle / Paddle 大约 2 年 前同步成功

单机多卡(或多CPU模式)，异步读数据报错ValueError: Feed a list of tensor, the list should be the same size as places

PaddlePaddle / Paddle
大约 2 年前同步成功