paddle mpi fleet分布式训练偶发性出现nan
Created by: maosengshulei
- 版本、环境信息: 1)PaddlePaddle版本:paddlepaddle 1.6.0 2)CPU: 4)
- 模型信息
一个普通的MLP 点击率预估模型,使用了fleet接口来分布式训练,训练时最后一层sofamax输出后变为nan,导致计算auc时出core。
Tensor[fc_3.tmp_2] shape: [1024,2,] dtype: f data: nan,nan,nan,nan,nan,nan,nan,nan,nan,nan,nan,nan,nan,nan,nan,nan,nan,nan,nan,nan, terminate called after throwing an instance of 'paddle::platform::EnforceNotMet'
PaddleCheckError: Expected predict_data <= 1, but received predict_data:nan > 1:1. The predict data must less or equal 1. at [/paddle/paddle/fluid/operators/metrics/auc_op.h:83] [operator < auc > error]
但是重新提交相同任务后,模型正常训练完,未能复现nan导致的core。此mpi日志链接 :http://10.76.125.48:8910/fileview.html?type=logsdir&path=/&instance=0.app-user-20191103163759-40991--shulei_msd_mmoe_dnn_v1_20191102_paddlecloud 感觉不是脏数据导致的。
现在每次提交全量数据训练任务基本都会出相同的core。