prefetch_parameter.md 2.5 KB
Newer Older
Y
Yancey1989 已提交
1
# Design Doc: Lookup Remote Table while Distributed training
Y
Yancey1989 已提交
2 3 4

## Abstract

Y
Yancey1989 已提交
5
We propose an approach to pre-fetch the parameters from a Parameter Server while distributed training so that Fluid can train a model with the very large parameter that cannot be stored in one trainer's memory.
Y
Yancey1989 已提交
6 7 8

## Background

Y
Yancey1989 已提交
9 10
For an embedding layer, the trainable parameter may be very large, and it is likely that it may not be able to be stored in one trainer's memory. In Fluid distributed training,
the [Distributed Transpiler](./parameter_server.md#distributed-transpiler) would split every parameter into some small parameters that stored on the Parameter Server. Hence, we can pre-fetch the parameter from the specified Parameter Server using the input `Ids`.
Y
Yancey1989 已提交
11 12 13

## Design

14 15
Prior to reading this design, it would be useful for the reader to make themselves familiar with Fluid [Distributed Training Architecture](./distributed_architecture.md) and 
[Parameter Server](./parameter_server.md).
Y
Yancey1989 已提交
16

Y
Yancey1989 已提交
17
The execution of `lookup local table` is as follows:
Y
Yancey1989 已提交
18

Y
Yancey1989 已提交
19
<img src="src/lookup_local_table.png" width="700" />
Y
Yancey1989 已提交
20

Y
Yancey1989 已提交
21 22 23
For some cases, the parameter(`weight`) may be very large, such as 10 billion features, the entire
data could not be stored in one trainer's memory, so we need to partition this parameter and
pre-fetch it at the beginning of each mini-batch, and we call it `lookup remote table`:
Y
Yancey1989 已提交
24

Y
Yancey1989 已提交
25
<img src="src/lookup_remote_table.png" width="700" />
Y
Yancey1989 已提交
26

Y
Yancey1989 已提交
27
The processing flow of `lookup remote table` is as follows:
Y
Yancey1989 已提交
28

Y
Yancey1989 已提交
29 30 31 32 33 34 35 36 37 38 39
1. partitioned parameter

    <img src="src/split_parameter.png" width="400" />

    - **Distributed Transpiler** would split the large parameters
      (`weight`) into some partitioned parameters (`weight_0`, `weight_1`, `weight_2`) as shown in the figure above.
    - We can use `round-robin` to distribute the partitioned parameter.

1. pre-fetching parameter at the beginning of each mini-batch

    - `prefetch_rpc` operator would prefetch the parameter from different Parameter
40
    Servers using the input `Ids`. We use [SelectedRows](../../../design/selected_rows.md)
Y
Yancey1989 已提交
41
    as the received variable type.
Y
Yancey1989 已提交
42
    - `merge_selected_rows` operator would merge the received parameters into one
Y
Yancey1989 已提交
43
    `SelectedRows` variable.
Y
Yancey1989 已提交
44 45 46

## TODO

Y
Yancey1989 已提交
47 48 49 50
- `prefetch_rpc` operator to send rows index and receive SelectedRows variables.
- `lookup_table` need to support `SelectedRows` variable type as input `Weight`.
- Async Update, To avoid slow-node, Async update is important for distributed training,
  we need a design doc and implement it in future.