pserver_client.md 6.6 KB
Newer Older
H
Helin Wang 已提交
1 2 3 4
# Design Doc: The Client Library of Parameter Server

For an overview of trainer's role, please refer to [distributed training design doc](README.md). In this design doc, we will discuss the parameter server's client library, which will manage communication with parameter servers. The library will be implemented in [Go](https://golang.org/) and made available as a static or dynamic library with a C header file.

H
Helin Wang 已提交
5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37
## Parameter Partition

Each parameter will be partitioned into parameter chunks to make the parameters evenly distributed on parameter servers. The partition is done automatically by the client library. The *sparse parameter* require a little different treatment:

### Sparse Parameter

The sparse parameter is a parameter that is updated sparsely. The name is somewhat misleading, it does not have a sparse representation, it is conceptually a dense vector.

Because a sparse parameter is updated sparsely, the trainer will have to partition the sparse parameter. Because the parameter server will merge all sparse parameter shard into the same file when saving the parameter. It needs special naming convention:

If a sparse parameter is partitioned into n shards, they should be named as:

```text
name:sparse-0
name:sparse-1
...
name:sparse-n-1
```

## Gradient Optimization

There are two ways to perform model optimization according to gradients:

- On Client
  The client does forward and backward update multiple steps. In each step, the gradients are calculated each step and a new model is generated. After some steps, the client will calculate the difference between the newest model and the old model at step 0. The difference will be updated to parameter servers. Parameter servers will just update parameters according to the difference without any optimization using gradients (such as Adam and L1 regularization).

- On Parameter Server
  The client will send gradients to parameter servers, the parameter server will do the optimization using gradients.

## L1 and L2 Regularization

PaddlePaddle allows L1 or L2 regularizations to be specified per parameter, so when the trainer initializes the parameter. When the parameter server is doing the optimization, the trainer needs to pass a parameter configuration to parameter servers to indicate the Regularization.

H
Helin Wang 已提交
38 39 40 41
## Parameter Initialization

The parameters on parameter servers need to be initialized. To provide maximum flexibility, we need to allow trainer initialized the parameters. Only one trainer will do the initialization, the other trainers will wait for the completion of initialization and get the parameters from the parameter servers.

H
Helin Wang 已提交
42 43
### Trainer Selection

H
Helin Wang 已提交
44
To select the trainer for initialization, every trainer will try to get a distributed lock, whoever owns the lock will do the initialization. As illustrated below:
H
Helin Wang 已提交
45

H
Helin Wang 已提交
46 47
<img src="./src/init_lock.png">

H
Helin Wang 已提交
48 49
### Selection Process

H
Helin Wang 已提交
50 51 52 53 54
The select process is encapsulated in the C API function:
```c
int paddle_begin_init_params(paddle_pserver_client* client, const char* config_proto);
```
The selected trainer's call to `paddle_begin_init_params` will return with 1, and the other trainers' call to `paddle_begin_init_params` will block until initialization is done, and return 0. As illustrated below:
H
Helin Wang 已提交
55

H
Helin Wang 已提交
56 57
<img src="./src/pserver_init.png">

H
Helin Wang 已提交
58 59 60
## C Interface

```c
H
Helin Wang 已提交
61 62 63 64 65 66 67 68
typedef enum {
  PADDLE_ELEMENT_TYPE_INT32   = 0,
  PADDLE_ELEMENT_TYPE_UINT32  = 1,
  PADDLE_ELEMENT_TYPE_INT64   = 2,
  PADDLE_ELEMENT_TYPE_UINT64  = 3,
  PADDLE_ELEMENT_TYPE_FLOAT32 = 4,
  PADDLE_ELEMENT_TYPE_FLOAT64 = 5,
} paddle_element_type;
H
Helin Wang 已提交
69

H
Helin Wang 已提交
70
typedef struct {
H
Helin Wang 已提交
71 72 73 74
  char*               name;
  paddle_element_type element_type;
  void*               content;
  int                 content_len;
H
Helin Wang 已提交
75 76
} paddle_parameter, paddle_gradient;

H
Helin Wang 已提交
77 78 79 80 81 82
typedef struct paddle_pserver_client paddle_pserver_client;

paddle_pserver_client* paddle_new_pserver_client();
void paddle_pserver_client_release(paddle_pserver_client* client);

/**
H
Helin Wang 已提交
83 84
 * @brief paddle_begin_init_params begins to initialize parameters on
 * parameter servers.
H
Helin Wang 已提交
85
 *
H
Helin Wang 已提交
86 87
 * paddle_begin_init_params will be called from multiple trainers,
 * only one trainer will be selected to initialize the parameters on
H
Helin Wang 已提交
88 89
 * parameter servers. Other trainers will be blocked until the
 * initialization is done, and they need to get the initialized
H
Helin Wang 已提交
90
 * parameters from parameter servers using @paddle_get_params.
H
Helin Wang 已提交
91
 *
H
Helin Wang 已提交
92
 * @param pserver_config_proto serialized parameter server configuration in
H
Helin Wang 已提交
93 94 95
 * Protocol Buffers format.
 * @return 1 if the trainer is selected to initialize parameter
 * servers, otherwise 0.
H
Helin Wang 已提交
96
 */
H
Helin Wang 已提交
97
int paddle_begin_init_params(paddle_pserver_client* client, const char* pserver_config_proto);
H
Helin Wang 已提交
98 99 100 101 102

/**
 * @brief paddle_init_param initializes the parameter on parameter
 * servers.
 *
H
Helin Wang 已提交
103
 * @param param the parameter to initialize.
H
Helin Wang 已提交
104
 * @param param_config_proto the configuration for the parameter.
H
Helin Wang 已提交
105 106 107 108
 * @return 0 if successful, otherwise -1. On failure, the trainer
 * needs to restart the entire initialization process (starting from
 * @paddle_begin_init_param). Or simply exit the program and wait for
 * the cluster management system to restart the trainer.
H
Helin Wang 已提交
109
 */
H
Helin Wang 已提交
110
int paddle_init_param(paddle_pserver_client* client, paddle_parameter params, const char* param_config_proto);
H
Helin Wang 已提交
111 112

/**
H
Helin Wang 已提交
113
 * @brief paddle_finish_init_params tells parameter servers client has
H
Helin Wang 已提交
114 115
 * sent all parameters to parameter servers as initialization.
 *
H
Helin Wang 已提交
116 117 118 119
 * @return 0 if successful, otherwise -1. On failure, the trainer
 * needs to restart the entire initialization process (starting from
 * @paddle_begin_init_param). Or simply exit the program and wait for
 * the cluster management system to restart the trainer.
H
Helin Wang 已提交
120
 */
H
Helin Wang 已提交
121
int paddle_finish_init_params(paddle_pserver_client* client);
H
Helin Wang 已提交
122 123

/**
H
Helin Wang 已提交
124
 * @brief paddle_send_grads sends gradients to parameter servers for
H
Helin Wang 已提交
125 126
 * updating parameters.
 *
H
Helin Wang 已提交
127
 * @param grads the array of gradients to send.
H
Helin Wang 已提交
128
 * @param len the length of the gradient array.
H
Helin Wang 已提交
129
 * @param learning_rate the learning rate for the gradients.
H
Helin Wang 已提交
130 131
 * @return 0 if successful, otherwise -1.
 */
H
Helin Wang 已提交
132
int paddle_send_grads(paddle_pserver_client* client, const paddle_gradient* grads, int len);
H
Helin Wang 已提交
133 134

/**
H
Helin Wang 已提交
135
 * @brief paddle_get_params gets parameters from parameter servers.
H
Helin Wang 已提交
136
 *
H
Helin Wang 已提交
137 138
 * @param names the array of names of the parameters to get.
 * @param dst the destination array of parameters to save to.
H
Helin Wang 已提交
139 140
 * @param len the length of the names array and the paddle_parameter
 * array.
H
Helin Wang 已提交
141 142
 * @return 0 if successful, otherwise -1.
 */
H
Helin Wang 已提交
143
int paddle_get_params(paddle_pserver_client* client, const char** names, paddle_parameter* dst, int len);
H
Helin Wang 已提交
144 145 146 147 148

/**
 * @brief paddle_save_model indicates parameters to save the parameter
 * to the given path
 *
H
Helin Wang 已提交
149
 * @param path the path to save parameters.
H
Helin Wang 已提交
150 151 152 153
 * @return 0 if successful, otherwise -1.
 */
int paddle_save_model(paddle_pserver_client* client, const char* path);
```