prefetch_parameter.md 2.0 KB
Newer Older
Y
Yancey1989 已提交
1
# Design Doc: Prefetching Parameter From Parameter Server
Y
Yancey1989 已提交
2 3 4

## Abstract

5
We propose an approach to pre-fetch the parameters from a Parameter Server while distributed training so that Fluid is able to train a model with a large number of parameters that cannot be stored in one trainer's memory.
Y
Yancey1989 已提交
6 7 8

## Background

9 10
For an embedding layer, the number of trainable parameters may be very large and it is likely that they may not be able to be stored in one trainer's memory. In Fluid distributed training,
the [Distributed Transpiler](./parameter_server.md#distributed-transpiler) would split every parameter into a number of small parameters that are stored on the Parameter Server. Hence, we can pre-fetch the parameters from the specified Parameter Server using the input `Ids`.
Y
Yancey1989 已提交
11 12 13

## Design

14 15
Prior to reading this design, it would be useful for the reader to make themselves familiar with Fluid [Distributed Training Architecture](./distributed_architecture.md) and 
[Parameter Server](./parameter_server.md).
Y
Yancey1989 已提交
16

Y
Yancey1989 已提交
17
### Partationed Parameter
Y
Yancey1989 已提交
18 19 20

<img src="src/split_parameter.png" width="400" />

21 22
- **Distributed Transpiler** would split the large parameters
(`weight`) into some partitioned parameters (`weight_0`, `weight_1`, `weight_2`) as shown in the
Y
Yancey1989 已提交
23
figure above.
24
- We can use `round-robin` to distribute the partitioned parameter.
Y
Yancey1989 已提交
25

26
### Pre-fetching Parameters
Y
Yancey1989 已提交
27 28 29

<img src="src/prefetch_parameters.png" width="400" />

Y
Yancey1989 已提交
30
- `prefetch_rpc` operator would prefetch the parameter from different Parameter
31
    Servers using the input `Ids`. We use [SelectedRows](../../../design/selected_rows.md)
Y
Yancey1989 已提交
32 33 34
    as the received variable type.
- `merge_selected_rows` operator would merge the received parameters into one
    `SelectedRows` variable.
Y
Yancey1989 已提交
35 36 37

## TODO

Y
Yancey1989 已提交
38 39 40 41
- `prefetch_rpc` operator to send rows index and receive SelectedRows variables.
- `lookup_table` need to support `SelectedRows` variable type as input `Weight`.
- Async Update, To avoid slow-node, Async update is important for distributed training,
  we need a design doc and implement it in future.