提交 cb7891a4 编写于 作者: Y Yancey1989

Add large model design doc

上级 5a159f34
# Design Doc: Large Model
## Abstract
We propose an approach to support the large parameter.
For embedding layer, the parameter may very large and could
not be stored in one trainer's memory. In this approach, a Trainer would
prefetch a sliced parameter from different Parameter Server instances
according to the input `Ids`, and then run forward, backward and send
the gradient to Parameter Server to execute the optimize program.
## Design
Fluid large model distributed training use
[Distributed Transpiler](./parameter_server.md#distributed-transpiler) to split
a large parameter into multiple parameters which stored on Parameter Server, and
the Trainer would prefetch them by `RPC` interface.
### Split Large Parameter
<img src="src/split_parameter.png" width="400" />
**Distributed Transpiler** would split the large parameter
(weight) into some sliced parameters (weight_0, weight_1, weight_2) as the
figure above.
### Prefetch Parameters from Parameter Servers
<img src="src/prefetch_parameters.png" width="400" />
- `PrefetchRpc` operator would send the rows index the multiple Parameter Servers,
and then receive the SelctedRows.
- The different with normal Fluid distributed training, we only prefetch the rows
## TODO
- Async Update
To avoid slow-node, Async update is important for distributed training,
we need an design doc and implement it in future.
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册