Use a standard RPC library instead of manual implement it.
Created by: reyoung
background
In Paddle, we manually implement an RPC library for PServer <=> PClient
communication. The ProtoServer is the base class for RPC server, the BaseClient is the base class for RPC client. The code here is used for register method in PServer.
The pros and cons are listed below:
pros
- The performance is well because the implementation is very thin and fine-tuned. It was written from raw TCP stack and raw RDMA stack.
cons
-
It is not very stable, even in HPC cluster. The reasons are:
- It use send receive IOVs, and this API in Linux use a 32bit length. It may be overflow in a huge neural network with 1 or 2 PServers.
- It did not write any fault tolerance job. Write any illegal data in buffer or network has some error will cause whole cluster process die.
-
It doesn't provide
on_disconnect
API or discover the disconnect process. Currently, one disconnect event will make every process die. -
It doesn't provide
Authentication
API, which is necessary for cluster service.
conclusion
To use a standard PROTO RPC library will make cluster API easier to expose and easier to write new logic code, but it may harm current Paddle network performance.
So, should we change the Paddle RPC by using some standard RPC library?
Please thumbs up/down this issue to vote!