mpi集群训练失败,
Created by: lyp2github
Wed Oct 10 02:46:27 2018[1,47]:18/10/10 02:46:27 WARN hdfs.FMSClient: DFS Read: org.apache.hadoop.hdfs.protocol.ReadSlowException: 60000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/10.73.51.16:22432 remote=/10.73.82.38:7001] Wed Oct 10 02:46:27 2018[1,47]: at org.apache.hadoop.hdfs.FMSClient$DFSInputStream.readBuffer(FMSClient.java:3060) Wed Oct 10 02:46:27 2018[1,47]: at org.apache.hadoop.hdfs.FMSClient$DFSInputStream.readInternal(FMSClient.java:3127) Wed Oct 10 02:46:27 2018[1,47]: at org.apache.hadoop.hdfs.FMSClient$DFSInputStream.read(FMSClient.java:3092) Wed Oct 10 02:46:27 2018[1,47]: at java.io.DataInputStream.read(DataInputStream.java:132) Wed Oct 10 02:46:27 2018[1,47]: at org.apache.hadoop.io.LimitInputStream.read(LimitInputStream.java:84) Wed Oct 10 02:46:27 2018[1,47]: at java.io.InputStream.read(InputStream.java:82) Wed Oct 10 02:46:27 2018[1,47]: at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:138) Wed Oct 10 02:46:27 2018[1,47]: at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:159) Wed Oct 10 02:46:27 2018[1,47]: at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:434) Wed Oct 10 02:46:27 2018[1,47]: at org.apache.hadoop.fs.FsShell.copyToLocal(FsShell.java:414) Wed Oct 10 02:46:27 2018[1,47]: at org.apache.hadoop.fs.FsShell.copyToLocal(FsShell.java:348) Wed Oct 10 02:46:27 2018[1,47]: at org.apache.hadoop.fs.FsShell.run(FsShell.java:2203) Wed Oct 10 02:46:27 2018[1,47]: at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) Wed Oct 10 02:46:27 2018[1,47]: at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79) Wed Oct 10 02:46:27 2018[1,47]: at org.apache.hadoop.fs.FsShell.main(FsShell.java:2353) Wed Oct 10 02:46:27 2018[1,47]:
Wed Oct 10 02:46:33 2018[1,27]:18/10/10 02:46:33 INFO common.UpdateService: ZkstatusUpdater to nmg01-mulan-hdfs.dmop.baidu.com:54310 started Wed Oct 10 02:46:33 2018[1,14]:*** Check failure stack trace: *** Wed Oct 10 02:46:33 2018[1,14]: @ 0x7f8caf6fa27d google::LogMessage::Fail() Wed Oct 10 02:46:33 2018[1,14]: @ 0x7f8caf6fdd2c google::LogMessage::SendToLog() Wed Oct 10 02:46:33 2018[1,14]: @ 0x7f8caf6f9da3 google::LogMessage::Flush() Wed Oct 10 02:46:33 2018[1,14]: @ 0x7f8caf6ff23e google::LogMessageFatal::~LogMessageFatal() Wed Oct 10 02:46:33 2018[1,14]: @ 0x7f8caf5573c1 paddle::SocketClient::TcpClient() Wed Oct 10 02:46:33 2018[1,14]: @ 0x7f8caf5575a1 paddle::SocketClient::SocketClient() Wed Oct 10 02:46:33 2018[1,14]: @ 0x7f8cb01da9b0 paddle::ParameterClient2::init() Wed Oct 10 02:46:33 2018[1,14]: @ 0x7f8cafd6709d paddle::RemoteParameterUpdater::init() Wed Oct 10 02:46:33 2018[1,14]: @ 0x7f8caf6da1ea ParameterUpdater::init() Wed Oct 10 02:46:33 2018[1,14]: @ 0x7f8caf383f7b _wrap_ParameterUpdater_init Wed Oct 10 02:46:33 2018[1,2]:Download File: /app/ecom/fcr-opt/liuyaping/paddle/paddle_title_spa/test/part-00687-B Wed Oct 10 02:46:33 2018[1,14]: @ 0x4b4cb9 PyEval_EvalFrameEx Wed Oct 10 02:46:33 2018[1,14]: @ 0x4b6b28 PyEval_EvalCodeEx Wed Oct 10 02:46:33 2018[1,14]: @ 0x4b5d10 PyEval_EvalFrameEx Wed Oct 10 02:46:33 2018[1,14]: @ 0x4b6b28 PyEval_EvalCodeEx Wed Oct 10 02:46:33 2018[1,14]: @ 0x4b5d10 PyEval_EvalFrameEx Wed Oct 10 02:46:33 2018[1,14]: @ 0x4b6b28 PyEval_EvalCodeEx Wed Oct 10 02:46:33 2018[1,14]: @ 0x4b5d10 PyEval_EvalFrameEx Wed Oct 10 02:46:33 2018[1,14]: @ 0x4b6b28 PyEval_EvalCodeEx Wed Oct 10 02:46:33 2018[1,14]: @ 0x4b6c52 PyEval_EvalCode Wed Oct 10 02:46:33 2018[1,14]: @ 0x4e1c7d PyRun_FileExFlags Wed Oct 10 02:46:33 2018[1,14]: @ 0x4e3501 PyRun_SimpleFileExFlags Wed Oct 10 02:46:33 2018[1,14]: @ 0x4159dd Py_Main Wed Oct 10 02:46:33 2018[1,14]: @ 0x7f8cb1c09bd5 __libc_start_main Wed Oct 10 02:46:33 2018[1,14]: @ 0x414b71 (unknown) Wed Oct 10 02:46:33 2018[1,14]: @ (nil) (unknown)
nmg-off集群,任务链接job/i-787772/ 以及 job/i-787776/