paddle mpi 使用fleet transpiler模式 分布式同步训练(sync_mode=True) 夯住
Created by: maosengshulei
paddle 1.6.1 paddle 1.7.1 均出现 trainer端:
Wed May 6 22:13:02 2020[1,9]<stdout>:E0506 22:13:01.996211903 1456 tcp_server_posix.cc:64] check for SO_REUSEPORT: {"created":"@1588774381.996186251","description":"OS Error","errno":92,"file":"src/core/lib/iomgr/socket_utils_common_posix.cc","file_line":169,"os_error":"Protocol not available","syscall":"setsockopt(SO_REUSEPORT)"}
Wed May 6 22:13:02 2020[1,36]<stdout>:E0506 22:13:02.006885582 1454 tcp_server_posix.cc:64] check for SO_REUSEPORT: {"created":"@1588774382.006858766","description":"OS Error","errno":92,"file":"src/core/lib/iomgr/socket_utils_common_posix.cc","file_line":169,"os_error":"Protocol not available","syscall":"setsockopt(SO_REUSEPORT)"}
Wed May 6 22:13:02 2020[1,9]<stdout>:I0506 22:13:01.999068 1456 grpc_server.cc:477] Server listening on 10.76.85.32:62004 successful, selected port: 62004
Wed May 6 22:13:02 2020[1,36]<stdout>:I0506 22:13:02.009769 1454 grpc_server.cc:477] Server listening on 10.76.56.42:62004 successful, selected port: 62004
Wed May 6 22:13:02 2020[1,29]<stdout>:E0506 22:13:02.024260706 1460 tcp_server_posix.cc:64] check for SO_REUSEPORT: {"created":"@1588774382.024233488","description":"OS Error","errno":92,"file":"src/core/lib/iomgr/socket_utils_common_posix.cc","file_line":169,"os_error":"Protocol not available","syscall":"setsockopt(SO_REUSEPORT)"}
Wed May 6 22:13:02 2020[1,29]<stdout>:I0506 22:13:02.027292 1460 grpc_server.cc:477] Server listening on 10.76.87.22:62004 successful, selected port: 62004
Wed May 6 22:13:02 2020[1,5]<stdout>:E0506 22:13:02.049540110 1459 tcp_server_posix.cc:64] check for SO_REUSEPORT: {"created":"@1588774382.049515524","description":"OS Error","errno":92,"file":"src/core/lib/iomgr/socket_utils_common_posix.cc","file_line":169,"os_error":"Protocol not available","syscall":"setsockopt(SO_REUSEPORT)"}
Wed May 6 22:13:02 2020[1,5]<stdout>:I0506 22:13:02.052477 1459 grpc_server.cc:477] Server listening on 10.76.85.15:62004 successful, selected port: 62004
Wed May 6 22:13:02 2020[1,38]<stdout>:E0506 22:13:02.064906617 1458 tcp_server_posix.cc:64] check for SO_REUSEPORT: {"created":"@1588774382.064880232","description":"OS Error","errno":92,"file":"src/core/lib/iomgr/socket_utils_common_posix.cc","file_line":169,"os_error":"Protocol not available","syscall":"setsockopt(SO_REUSEPORT)"}
Wed May 6 22:13:02 2020[1,38]<stdout>:I0506 22:13:02.067895 1458 grpc_server.cc:477] Server listening on 10.76.58.12:62004 successful, selected port: 62004
Wed May 6 22:13:02 2020[1,19]<stdout>:E0506 22:13:02.136696785 1457 tcp_server_posix.cc:64] check for SO_REUSEPORT: {"created":"@1588774382.136666332","description":"OS Error","errno":92,"file":"src/core/lib/iomgr/socket_utils_common_posix.cc","file_line":169,"os_error":"Protocol not available","syscall":"setsockopt(SO_REUSEPORT)"}
Wed May 6 22:13:02 2020[1,19]<stdout>:I0506 22:13:02.139530 1457 grpc_server.cc:477] Server listening on 10.76.86.28:62004 successful, selected port: 62004
Wed May 6 22:13:02 2020[1,4]<stdout>:2020-05-06 22:13:02,201-INFO: run pserver
Wed May 6 22:13:02 2020[1,22]<stdout>:E0506 22:13:02.322785357 1459 tcp_server_posix.cc:64] check for SO_REUSEPORT: {"created":"@1588774382.322756336","description":"OS Error","errno":92,"file":"src/core/lib/iomgr/socket_utils_common_posix.cc","file_line":169,"os_error":"Protocol not available","syscall":"setsockopt(SO_REUSEPORT)"}
Wed May 6 22:13:02 2020[1,22]<stdout>:I0506 22:13:02.325747 1459 grpc_server.cc:477] Server listening on 10.76.86.38:62004 successful, selected port: 62004
Wed May 6 22:13:02 2020[1,4]<stdout>:E0506 22:13:02.531316305 1458 tcp_server_posix.cc:64] check for SO_REUSEPORT: {"created":"@1588774382.531287183","description":"OS Error","errno":92,"file":"src/core/lib/iomgr/socket_utils_common_posix.cc","file_line":169,"os_error":"Protocol not available","syscall":"setsockopt(SO_REUSEPORT)"}
Wed May 6 22:13:02 2020[1,4]<stdout>:I0506 22:13:02.534200 1458 grpc_server.cc:477] Server listening on 10.76.85.12:62004 successful, selected port: 62004
Wed May 6 22:13:02 2020[1,18]<stdout>:get_pserver_program() is deprecated, call get_pserver_programs() to get pserver main and startup in a single call.
Wed May 6 22:13:02 2020[1,37]<stdout>:get_pserver_program() is deprecated, call get_pserver_programs() to get pserver main and startup in a single call.
Wed May 6 22:13:03 2020[1,18]<stdout>:2020-05-06 22:13:03,116-INFO: run pserver
Wed May 6 22:13:03 2020[1,12]<stdout>:get_pserver_program() is deprecated, call get_pserver_programs() to get pserver main and startup in a single call.
Wed May 6 22:13:03 2020[1,37]<stdout>:2020-05-06 22:13:03,200-INFO: run pserver
Wed May 6 22:13:03 2020[1,18]<stdout>:I0506 22:13:03.451534 1458 grpc_server.cc:477] Server listening on 10.76.86.25:62004 successful, selected port: 62004
Wed May 6 22:13:03 2020[1,37]<stdout>:I0506 22:13:03.550685 1455 grpc_server.cc:477] Server listening on 10.76.56.45:62004 successful, selected port: 62004
Wed May 6 22:13:03 2020[1,12]<stdout>:2020-05-06 22:13:03,767-INFO: run pserver
Wed May 6 22:13:04 2020[1,12]<stdout>:I0506 22:13:04.109608 1456 grpc_server.cc:477] Server listening on 10.76.135.12:62004 successful, selected port: 62004
Wed May 6 22:13:08 2020[1,24]<stdout>:get_pserver_program() is deprecated, call get_pserver_programs() to get pserver main and startup in a single call.
Wed May 6 22:13:09 2020[1,24]<stdout>:2020-05-06 22:13:09,238-INFO: run pserver
Wed May 6 22:13:09 2020[1,24]<stdout>:I0506 22:13:09.649618 1479 grpc_server.cc:477] Server listening on 10.76.86.42:62004 successful, selected port: 62004
代码: ``` strategy = DistributeTranspilerConfig() strategy.sync_mode = True optimizer = fluid.optimizer.Adam(learning_rate=1e-4, lazy_mode=True) optimizer = fleet.distributed_optimizer(optimizer, strategy) if fleet.is_server(): logger.info("run pserver")
fleet.init_server()
fleet.run_server()
elif fleet.is_worker():
logger.info("run trainer")
fleet.init_worker()
exe.run(fleet.startup_program)
dataset = fluid.DatasetFactory().create_dataset()
data_list_names = [var.name for var in data_list]
dataset.set_use_var(data_list)
pipe_command = "python dnn_reader.py %s %s %s" % (args.feat_list_path,args.mapping_path,args.seccate_map_path)
dataset.set_pipe_command(pipe_command)
dataset.set_batch_size(args.batch_size)
thread_num = 10
dataset.set_thread(thread_num)
whole_filelist = [args.train_data_path + '/' + x for x in os.listdir(args.train_data_path)]
epochs = args.epochs
for i in range(epochs):
start = time.time()
dataset.set_filelist(whole_filelist)
exe.train_from_dataset(program=fleet.main_program,
dataset=dataset,
fetch_list=[avg_ctr_cost, avg_teacher_ctr_cost, avg_hint_loss, student_acc, teacher_acc, student_auc, student_batch_auc, teacher_auc, teacher_batch_auc],
fetch_info=["student_loss", "teacher_loss", "hint_loss", "student_accuracy", "teacher_accuracy", "student_auc", "student_batch_auc", "teacher_auc", "teacher_batch_auc"],
print_period=10,
debug=False)
sys.stderr.write('epoch%d is finished and takes %f s\n' % (
(i + 1), time.time() - start))
把sync_mode设为False 分布式训练可正常进行。