diff --git a/doc/design/cluster_train/pserver_client.md b/doc/design/cluster_train/pserver_client.md index 96cc649b2c2e26743348e4f6c7f571d5c7d750ca..c1cb93434e58a8f2fdfaa7c4831d17c4974e6782 100644 --- a/doc/design/cluster_train/pserver_client.md +++ b/doc/design/cluster_train/pserver_client.md @@ -2,6 +2,20 @@ For an overview of trainer's role, please refer to [distributed training design doc](README.md). In this design doc, we will discuss the parameter server's client library, which will manage communication with parameter servers. The library will be implemented in [Go](https://golang.org/) and made available as a static or dynamic library with a C header file. +## Parameter Initialization + +The parameters on parameter servers need to be initialized. To provide maximum flexibility, we need to allow trainer initialized the parameters. Only one trainer will do the initialization, the other trainers will wait for the completion of initialization and get the parameters from the parameter servers. + +To select the trainer for initialization, every trainer will try to get a distributed lock, whoever owns the lock will do the initialization. As illustrated below: + + +The select process is encapsulated in the C API function: +```c +int paddle_begin_init_params(paddle_pserver_client* client, const char* config_proto); +``` +The selected trainer's call to `paddle_begin_init_params` will return with 1, and the other trainers' call to `paddle_begin_init_params` will block until initialization is done, and return 0. As illustrated below: + + ## C Interface ```c diff --git a/doc/design/cluster_train/src/init_lock.graffle b/doc/design/cluster_train/src/init_lock.graffle new file mode 100644 index 0000000000000000000000000000000000000000..fa9149f21b1311eed48ef72ec55e556559d0fc94 Binary files /dev/null and b/doc/design/cluster_train/src/init_lock.graffle differ diff --git a/doc/design/cluster_train/src/init_lock.png b/doc/design/cluster_train/src/init_lock.png new file mode 100644 index 0000000000000000000000000000000000000000..92404ee6d6c0f9a7727952bae3c869ba338ecd7f Binary files /dev/null and b/doc/design/cluster_train/src/init_lock.png differ diff --git a/doc/design/cluster_train/src/pserver_init.graffle b/doc/design/cluster_train/src/pserver_init.graffle new file mode 100644 index 0000000000000000000000000000000000000000..730d3a561ffdc19e723b3cf6612471440951826a Binary files /dev/null and b/doc/design/cluster_train/src/pserver_init.graffle differ diff --git a/doc/design/cluster_train/src/pserver_init.png b/doc/design/cluster_train/src/pserver_init.png new file mode 100644 index 0000000000000000000000000000000000000000..4d502226d82ba271c50ae1bec5efaaaac4cc4434 Binary files /dev/null and b/doc/design/cluster_train/src/pserver_init.png differ