MPI训练时,多节点时test auc明显低于单节点
Created by: fty8788
每轮Pass后,用测试集计算auc,auc不一致。 单节点时: Test at Pass 0, {'auc_evaluator_0': 0.7868553996086121, 'classification_error_evaluator': 0.2663058936595917} http://nmg01-hpc-off-mon.dmop.baidu.com:8090/job/i-646520/
多节点时:(10节点) Test at Pass 0, {'auc_evaluator_0': 0.7584338784217834, 'classification_error_evaluator': 0.2819223999977112} Test at Pass 0, {'auc_evaluator_0': 0.7583796381950378, 'classification_error_evaluator': 0.28192976117134094} Test at Pass 0, {'auc_evaluator_0': 0.7583440542221069, 'classification_error_evaluator': 0.2819364368915558} 。。。 http://nmg01-hpc-off-mon.dmop.baidu.com:8090/job/i-646689/
提交代码:
paddle cluster_train
--config mpi_train.py
--port 8788
--use_gpu cpu
--use_remote_sparse 0
--time_limit 100:00:00
--submitter wangdong08
--num_nodes 10
--job_priority normal
--trainer_count 1
--num_passes 10
--fs_name hdfs://nmg01-mulan-hdfs.dmop.baidu.com:54310
--fs_ugi fcr-ad,2f6b06d4ce
--train_data_path /app/ecom/fcr-ad/yitengfei/intent-q/paddle_dssm_intentid/1030.train
--test_data_path /app/ecom/fcr-ad/yitengfei/intent-q/paddle_dssm_intentid/1030.test
--output_path /app/ecom/fcr-ad/yitengfei/intent-q/paddle_dssm_intentid/model_output.1030_left_right-64.n10
--where nmg01-hpc-off-dmop-slow-cpu-10G_cluster
--thirdparty thirdparty
--job_name wangdong08_paddle_cluster_dssm