diff --git a/doc/fluid/design/dist_train/large_model.md b/doc/fluid/design/dist_train/large_model.md new file mode 100644 index 0000000000000000000000000000000000000000..f82fa6f81e5d33f5006df13ca862d3a5e0c39dbb --- /dev/null +++ b/doc/fluid/design/dist_train/large_model.md @@ -0,0 +1,40 @@ +# Design Doc: Large Model + +## Abstract + +We propose an approach to support the large parameter. +For embedding layer, the parameter may very large and could +not be stored in one trainer's memory. In this approach, a Trainer would +prefetch a sliced parameter from different Parameter Server instances +according to the input `Ids`, and then run forward, backward and send +the gradient to Parameter Server to execute the optimize program. + +## Design + +Fluid large model distributed training use +[Distributed Transpiler](./parameter_server.md#distributed-transpiler) to split +a large parameter into multiple parameters which stored on Parameter Server, and +the Trainer would prefetch them by `RPC` interface. + +### Split Large Parameter + + + +**Distributed Transpiler** would split the large parameter +(weight) into some sliced parameters (weight_0, weight_1, weight_2) as the +figure above. + +### Prefetch Parameters from Parameter Servers + + + +- `PrefetchRpc` operator would send the rows index the multiple Parameter Servers, + and then receive the SelctedRows. +- The different with normal Fluid distributed training, we only prefetch the rows + +## TODO + +- Async Update + + To avoid slow-node, Async update is important for distributed training, + we need an design doc and implement it in future. diff --git a/doc/fluid/design/dist_train/src/prefetch_parameters.graffle b/doc/fluid/design/dist_train/src/prefetch_parameters.graffle new file mode 100644 index 0000000000000000000000000000000000000000..067178972219e99add6db5a82db0f104d3517862 Binary files /dev/null and b/doc/fluid/design/dist_train/src/prefetch_parameters.graffle differ diff --git a/doc/fluid/design/dist_train/src/prefetch_parameters.png b/doc/fluid/design/dist_train/src/prefetch_parameters.png new file mode 100644 index 0000000000000000000000000000000000000000..ee54c35272898e0487c1193a85b204774811874b Binary files /dev/null and b/doc/fluid/design/dist_train/src/prefetch_parameters.png differ diff --git a/doc/fluid/design/dist_train/src/split_parameter.graffle b/doc/fluid/design/dist_train/src/split_parameter.graffle new file mode 100644 index 0000000000000000000000000000000000000000..6e2f13727d082dea7d3deaf99a43189dfaf29f3a Binary files /dev/null and b/doc/fluid/design/dist_train/src/split_parameter.graffle differ diff --git a/doc/fluid/design/dist_train/src/split_parameter.png b/doc/fluid/design/dist_train/src/split_parameter.png new file mode 100644 index 0000000000000000000000000000000000000000..1776fb8c4eb0f86aed1cdd67f54d8a08a4e479ce Binary files /dev/null and b/doc/fluid/design/dist_train/src/split_parameter.png differ