Cluster train job will hang if there are too many parameter server or ports
Created by: typhoonzero
If there are too many parameter servers or too many parameter server ports(or sparse ports), some parameter servers will wait forever.
When parameter start up, ti says:
W0522 12:00:09.495564 35864 ParameterServer2.cpp:269] --ports_num or --ports_num_for_sparse might be too large, or total dense parameter size or sparse parameters size might be too small, this psever doesn't store any parameter.
In ParameterServer2.cpp
:
void ParameterServer2::setParameter(const SendParameterRequest& request,
std::vector<Buffer>& inputBuffers,
SendParameterResponse* response,
std::vector<Buffer>* outputBuffers) {
...
if (!request.blocks().size()) {
LOG(WARNING)
<< "--ports_num or --ports_num_for_sparse might be too large, "
<< "or total dense parameter size or sparse parameters size "
<< "might be too small, this psever doesn't store any parameter.";
return;
}
...
void ParameterServer2::addGradient(const SendParameterRequest& request,
std::vector<Buffer>& inputBuffers,
SendParameterResponse* response,
std::vector<Buffer>* outputBuffers) {
if (!numPassFinishClients_) {
REGISTER_BARRIER_DELTA_SERVER_SET(
*statSet_,
"forwardbackwardDelta",
FLAGS_num_gradient_servers,
request.trainer_id(),
request.forwardbackward_time(),
isSparseServer_ ? "_sparseUpdater" : "_denseUpdater");
}
It seems that the hanging problem is due to some other reason. But I still need to figure out the details when parameter block is more than pserver instances