【1.5】fleet 测试Transpiler模式下,GPU异步训练,会hang住
Created by: ccmeteorljh
复现代码: http://wiki.baidu.com/display/PDLPDL/Fleet+with+Simnet-BOW
修改为GPU下的训练; 训练一些epoch后hang住; trainer0.log
Epoch: 7, Step: 0, Loss: 0.101000763476, Accuracy: 0.25, Samples/sec: 85.953255802, Queue: 1
Epoch: 7, Step: 1, Loss: 0.102627635002, Accuracy: 0.25, Samples/sec: 1282.85792935, Queue: 1
Epoch: 7, Step: 2, Loss: 0.100948266685, Accuracy: 0.5, Samples/sec: 1403.0118749, Queue: 1
Epoch: 7, Step: 3, Loss: 0.0964086353779, Accuracy: 0.5, Samples/sec: 1392.30008299, Queue: 0
Epoch: 7, Step: 4, Loss: 0.103044293821, Accuracy: 0.25, Samples/sec: 649.775987607, Queue: 0
Epoch: 7, Train total expend: 0.0626969337463
Epoch: 8, Step: 0, Loss: 0.10452979058, Accuracy: 0.25, Samples/sec: 89.9582627346, Queue: 2
Epoch: 8, Step: 1, Loss: 0.10368168354, Accuracy: 0.5, Samples/sec: 766.117904927, Queue: 3
Epoch: 8, Step: 2, Loss: 0.107750244439, Accuracy: 0.0, Samples/sec: 1275.05821553, Queue: 3
Epoch: 8, Step: 3, Loss: 0.0966242179275, Accuracy: 0.5, Samples/sec: 1163.46851595, Queue: 3
trainer1.log
Epoch: 5, Step: 6, Loss: 0.0934507548809, Accuracy: 0.75, Samples/sec: 1503.20007168, Queue: 1
Epoch: 5, Step: 7, Loss: 0.0980113819242, Accuracy: 0.5, Samples/sec: 1555.89501994, Queue: 0
Epoch: 5, Train total expend: 0.0729529857635
Epoch: 6, Step: 0, Loss: 0.100071139634, Accuracy: 0.5, Samples/sec: 85.31510806, Queue: 1
Epoch: 6, Step: 1, Loss: 0.0988587737083, Accuracy: 0.75, Samples/sec: 1220.96033768, Queue: 1
Epoch: 6, Step: 2, Loss: 0.0986485481262, Accuracy: 0.5, Samples/sec: 1466.28351687, Queue: 1
Epoch: 6, Step: 3, Loss: 0.104199156165, Accuracy: 0.25, Samples/sec: 459.662346914, Queue: 2
Epoch: 6, Step: 4, Loss: 0.102274246514, Accuracy: 0.5, Samples/sec: 1387.92322965, Queue: 1
Epoch: 6, Step: 5, Loss: 0.104135371745, Accuracy: 0.25, Samples/sec: 1488.13340429, Queue: 0
trainer2.log
Epoch: 10, Step: 0, Loss: 0.103060126305, Accuracy: 0.5, Samples/sec: 80.9324502289, Queue: 2
Epoch: 10, Step: 1, Loss: 0.108582191169, Accuracy: 0.0, Samples/sec: 898.715234626, Queue: 4
Epoch: 10, Step: 2, Loss: 0.10179399699, Accuracy: 0.5, Samples/sec: 1059.03396036, Queue: 5
Epoch: 10, Step: 3, Loss: 0.0983226001263, Accuracy: 0.75, Samples/sec: 1048.77264487, Queue: 5
Epoch: 10, Step: 4, Loss: 0.106994174421, Accuracy: 0.25, Samples/sec: 1084.28979513, Queue: 4
Epoch: 10, Step: 5, Loss: 0.107414141297, Accuracy: 0.0, Samples/sec: 1018.28210731, Queue: 3
Epoch: 10, Step: 6, Loss: 0.099970087409, Accuracy: 0.5, Samples/sec: 1175.03964141, Queue: 2
Epoch: 10, Step: 7, Loss: 0.100574709475, Accuracy: 0.5, Samples/sec: 1226.67368575, Queue: 1
Epoch: 10, Step: 8, Loss: 0.106034442782, Accuracy: 0.0, Samples/sec: 1267.06562948, Queue: 0
Epoch: 10, Train total expend: 0.0811319351196
trainer3.log
Epoch: 6, Step: 0, Loss: 0.100563570857, Accuracy: 0.5, Samples/sec: 82.246899297, Queue: 1
Epoch: 6, Step: 1, Loss: 0.104799412191, Accuracy: 0.25, Samples/sec: 1174.71054474, Queue: 1
Epoch: 6, Step: 2, Loss: 0.102766729891, Accuracy: 0.25, Samples/sec: 1350.49633744, Queue: 0
Epoch: 6, Step: 3, Loss: 0.1008400172, Accuracy: 0.25, Samples/sec: 1363.77954804, Queue: 0
Epoch: 6, Step: 4, Loss: 0.10382220149, Accuracy: 0.25, Samples/sec: 1128.63881601, Queue: 0
Epoch: 6, Train total expend: 0.0627670288086
Epoch: 7, Step: 0, Loss: 0.103582732379, Accuracy: 0.0, Samples/sec: 94.5381680895, Queue: 1
Epoch: 7, Step: 1, Loss: 0.105683520436, Accuracy: 0.0, Samples/sec: 1223.98891078, Queue: 1
Epoch: 7, Step: 2, Loss: 0.108535170555, Accuracy: 0.0, Samples/sec: 1466.28351687, Queue: 0
Epoch: 7, Step: 3, Loss: 0.109039276838, Accuracy: 0.25, Samples/sec: 1283.74137271, Queue: 0