From efd7ee8521986e7789ea88ec0e9a2c7ff5c83ca9 Mon Sep 17 00:00:00 2001 From: m3ngyang Date: Sun, 25 Mar 2018 19:35:20 +0800 Subject: [PATCH] translate Cluster Training and Prediction --- doc/v2/faq/cluster/index_en.rst | 13 ++++++++++++- 1 file changed, 12 insertions(+), 1 deletion(-) diff --git a/doc/v2/faq/cluster/index_en.rst b/doc/v2/faq/cluster/index_en.rst index 855b7e8e533..7cbcaeefcbd 100644 --- a/doc/v2/faq/cluster/index_en.rst +++ b/doc/v2/faq/cluster/index_en.rst @@ -2,4 +2,15 @@ Cluster Training and Prediction ############################### -TBD +.. contents:: + +1. Network connection errors in the log during muliti-node cluster training +------------------------------------------------ +The errors in the log belong to network connection during mulilti-node cluster training, for example, :code:`Connection reset by peer`. +This kind of error is usually caused by the abnormal exit of the training process in some node, and the others cannot connect with this node any longer. Steps to troubleshoot the problem as follows: + +* Find the first error in the :code:`train.log`, :code:`server.log`, check whether other fault casued the problem, such as FPE, lacking of memory or disk. + +* If network connection gave rise to the first error in the log, this may be caused by the port conflict of the non-exclusive execution. Connect with the operator to check if the current MPI cluster supports jobs submitted with parameter :code:`resource=full`. If so, change the port of job. + +* If the currnet MPI cluster does not support exclusive pattern, ask the operator to replace or update the current cluster. -- GitLab