large_model.md 1.6 KB
Newer Older
Y
Yancey1989 已提交
1 2 3 4 5 6 7 8 9 10 11 12 13
# Design Doc: Large Model

## Abstract

We propose an approach to support the large parameter.
For embedding layer, the parameter may very large and could
not be stored in one trainer's memory. In this approach, a Trainer would
prefetch a sliced parameter from different Parameter Server instances
according to the input `Ids`, and then run forward, backward and send
the gradient to Parameter Server to execute the optimize program.

## Design

Y
Yancey1989 已提交
14 15 16 17
**NOTE**: this approach is a feature of Fluid distributed trianing, maybe you want
to know [Distributed Architecture](./distributed_architecture.md) and
[Parameter Server](./parameter_server.md) before reading the following content.

Y
Yancey1989 已提交
18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44
Fluid large model distributed training use 
[Distributed Transpiler](./parameter_server.md#distributed-transpiler) to split
a large parameter into multiple parameters which stored on Parameter Server, and
the Trainer would prefetch them by `RPC` interface.

### Split Large Parameter

<img src="src/split_parameter.png" width="400" />

**Distributed Transpiler** would split the large parameter
(weight) into some sliced parameters (weight_0, weight_1, weight_2) as the 
figure above.

### Prefetch Parameters from Parameter Servers

<img src="src/prefetch_parameters.png" width="400" />

- `PrefetchRpc` operator would send the rows index the multiple Parameter Servers,
  and then receive the SelctedRows.
- The different with normal Fluid distributed training, we only prefetch the rows

## TODO

- Async Update

  To avoid slow-node, Async update is important for distributed training,
  we need an design doc and implement it in future.