pserver_client.md 6.8 KB
Newer Older
H
Helin Wang 已提交
1 2 3 4
# Design Doc: The Client Library of Parameter Server

For an overview of trainer's role, please refer to [distributed training design doc](README.md). In this design doc, we will discuss the parameter server's client library, which will manage communication with parameter servers. The library will be implemented in [Go](https://golang.org/) and made available as a static or dynamic library with a C header file.

H
Helin Wang 已提交
5 6
## Parameter Partition

7
Each parameter will be partitioned into parameter blocks to make the parameters evenly distributed on parameter servers. The partition is done automatically by the client library. The *sparse parameter* require a little different treatment:
H
Helin Wang 已提交
8 9 10

### Sparse Parameter

H
Helin Wang 已提交
11
The sparse parameter is a parameter that is updated sparsely. The name is somewhat misleading, it does not have a sparse representation, it has the same representation as a dense vector.
H
Helin Wang 已提交
12 13 14 15 16 17 18 19 20 21 22 23

Because a sparse parameter is updated sparsely, the trainer will have to partition the sparse parameter. Because the parameter server will merge all sparse parameter shard into the same file when saving the parameter. It needs special naming convention:

If a sparse parameter is partitioned into n shards, they should be named as:

```text
name:sparse-0
name:sparse-1
...
name:sparse-n-1
```

H
Helin Wang 已提交
24
The library is unaware of the partition, and treat each parameter independently. Only when saving parameters, the parameter servers will merge the sparse parameters according to the naming convention.
H
Helin Wang 已提交
25

H
Helin Wang 已提交
26
## Model Optimization Using Gradients
H
Helin Wang 已提交
27 28

There are two ways to perform model optimization using gradients:
H
Helin Wang 已提交
29 30

- On Client
H
Helin Wang 已提交
31

H
Helin Wang 已提交
32
  The client does multiple steps of forward and backward update. In each step, the gradients are calculated and a new model is generated. After some steps, the client will calculate the difference between the newest model and the old model at step 0. The difference will be updated to parameter servers. Parameter servers will just update parameters using the difference without any optimization using gradients (such as Adam and L1 regularization).
H
Helin Wang 已提交
33 34

- On Parameter Server
H
Helin Wang 已提交
35

H
Helin Wang 已提交
36
  The client will send accumulated gradients to parameter servers, the parameter server will do the optimization using gradients.
H
Helin Wang 已提交
37 38 39

## L1 and L2 Regularization

H
Helin Wang 已提交
40
PaddlePaddle allows L1 or L2 regularizations to be specified per parameter, so when the trainer initializes the parameter it needs include a parameter configuration when L1 or L2 regularization is necessary.
H
Helin Wang 已提交
41

H
Helin Wang 已提交
42 43
## Parameter Initialization

H
Helin Wang 已提交
44
The parameters on parameter servers need to be initialized. To provide maximum flexibility, the trainer will initialize the parameters. Only one trainer will do the initialization, the other trainers will wait for the completion of initialization and get the parameters from the parameter servers.
H
Helin Wang 已提交
45

H
Helin Wang 已提交
46 47
### Trainer Selection

H
Helin Wang 已提交
48
To select the trainer for initialization, every trainer will try to get a distributed lock, whoever owns the lock will do the initialization. As illustrated below:
H
Helin Wang 已提交
49

H
Helin Wang 已提交
50 51
<img src="./src/init_lock.png">

H
Helin Wang 已提交
52
### Trainer Selection Process
H
Helin Wang 已提交
53

H
Helin Wang 已提交
54
The trainer select process is encapsulated in the C API function:
H
Helin Wang 已提交
55 56 57
```c
int paddle_begin_init_params(paddle_pserver_client* client, const char* config_proto);
```
58
The selected trainer's call to `paddle_begin_init_params` will return with 1, and the other trainers' call to `paddle_begin_init_params` will return 0. `paddle_get_params` will be blocked until initialization is completed. As illustrated below:
H
Helin Wang 已提交
59

H
Helin Wang 已提交
60 61
<img src="./src/pserver_init.png">

H
Helin Wang 已提交
62 63 64
## C Interface

```c
H
Helin Wang 已提交
65 66 67 68 69 70 71 72
typedef enum {
  PADDLE_ELEMENT_TYPE_INT32   = 0,
  PADDLE_ELEMENT_TYPE_UINT32  = 1,
  PADDLE_ELEMENT_TYPE_INT64   = 2,
  PADDLE_ELEMENT_TYPE_UINT64  = 3,
  PADDLE_ELEMENT_TYPE_FLOAT32 = 4,
  PADDLE_ELEMENT_TYPE_FLOAT64 = 5,
} paddle_element_type;
H
Helin Wang 已提交
73

H
Helin Wang 已提交
74
typedef struct {
H
Helin Wang 已提交
75 76 77 78
  char*               name;
  paddle_element_type element_type;
  void*               content;
  int                 content_len;
H
Helin Wang 已提交
79 80
} paddle_parameter, paddle_gradient;

H
Helin Wang 已提交
81 82 83 84 85 86
typedef struct paddle_pserver_client paddle_pserver_client;

paddle_pserver_client* paddle_new_pserver_client();
void paddle_pserver_client_release(paddle_pserver_client* client);

/**
H
Helin Wang 已提交
87 88
 * @brief paddle_begin_init_params begins to initialize parameters on
 * parameter servers.
H
Helin Wang 已提交
89
 *
H
Helin Wang 已提交
90 91
 * paddle_begin_init_params will be called from multiple trainers,
 * only one trainer will be selected to initialize the parameters on
92
 * parameter servers. Other trainers need to get the initialized
H
Helin Wang 已提交
93
 * parameters from parameter servers using @paddle_get_params.
H
Helin Wang 已提交
94
 *
H
Helin Wang 已提交
95 96
 * @return 1 if the trainer is selected to initialize parameter
 * servers, otherwise 0.
H
Helin Wang 已提交
97
 */
98
int paddle_begin_init_params(paddle_pserver_client* client);
H
Helin Wang 已提交
99 100 101 102 103

/**
 * @brief paddle_init_param initializes the parameter on parameter
 * servers.
 *
H
Helin Wang 已提交
104
 * @param param the parameter to initialize.
H
Helin Wang 已提交
105
 * @param param_config_proto the configuration for the parameter.
106
 * @param config_len the length of param_config_proto
H
Helin Wang 已提交
107 108 109 110
 * @return 0 if successful, otherwise -1. On failure, the trainer
 * needs to restart the entire initialization process (starting from
 * @paddle_begin_init_param). Or simply exit the program and wait for
 * the cluster management system to restart the trainer.
H
Helin Wang 已提交
111
 */
112
int paddle_init_param(paddle_pserver_client* client, paddle_parameter param, const unsigned char* param_config_proto, int config_len);
H
Helin Wang 已提交
113 114

/**
H
Helin Wang 已提交
115
 * @brief paddle_finish_init_params tells parameter servers client has
H
Helin Wang 已提交
116 117
 * sent all parameters to parameter servers as initialization.
 *
H
Helin Wang 已提交
118 119 120 121
 * @return 0 if successful, otherwise -1. On failure, the trainer
 * needs to restart the entire initialization process (starting from
 * @paddle_begin_init_param). Or simply exit the program and wait for
 * the cluster management system to restart the trainer.
H
Helin Wang 已提交
122
 */
H
Helin Wang 已提交
123
int paddle_finish_init_params(paddle_pserver_client* client);
H
Helin Wang 已提交
124 125

/**
H
Helin Wang 已提交
126
 * @brief paddle_send_grads sends gradients to parameter servers for
H
Helin Wang 已提交
127 128
 * updating parameters.
 *
H
Helin Wang 已提交
129
 * @param grads the array of gradients to send.
H
Helin Wang 已提交
130
 * @param len the length of the gradient array.
H
Helin Wang 已提交
131
 * @param learning_rate the learning rate for the gradients.
H
Helin Wang 已提交
132 133
 * @return 0 if successful, otherwise -1.
 */
H
Helin Wang 已提交
134
int paddle_send_grads(paddle_pserver_client* client, const paddle_gradient* grads, int len);
H
Helin Wang 已提交
135 136

/**
H
Helin Wang 已提交
137
 * @brief paddle_get_params gets parameters from parameter servers.
H
Helin Wang 已提交
138
 *
139 140 141
 * paddle_get_params will block until parameters are initialized on
 * the parameter servers.
 *
H
Helin Wang 已提交
142 143
 * @param names the array of names of the parameters to get.
 * @param dst the destination array of parameters to save to.
H
Helin Wang 已提交
144 145
 * @param len the length of the names array and the paddle_parameter
 * array.
H
Helin Wang 已提交
146 147
 * @return 0 if successful, otherwise -1.
 */
H
Helin Wang 已提交
148
int paddle_get_params(paddle_pserver_client* client, const char** names, paddle_parameter* dst, int len);
H
Helin Wang 已提交
149 150 151 152 153

/**
 * @brief paddle_save_model indicates parameters to save the parameter
 * to the given path
 *
H
Helin Wang 已提交
154
 * @param path the path to save parameters.
H
Helin Wang 已提交
155 156 157 158
 * @return 0 if successful, otherwise -1.
 */
int paddle_save_model(paddle_pserver_client* client, const char* path);
```