diff --git a/doc/fluid/design/dist_train/large_model.md b/doc/fluid/design/dist_train/large_model.md index 9689582130729553fb58353014bb4716736f1c10..92daa2b52eac5166c3113612c93ec871c763b9d0 100644 --- a/doc/fluid/design/dist_train/large_model.md +++ b/doc/fluid/design/dist_train/large_model.md @@ -1,44 +1,48 @@ -# Design Doc: Large Model +# Design Doc: Prefecting Parameter From Parameter Server ## Abstract -We propose an approach to support the large parameter. -For embedding layer, the parameter may very large and could -not be stored in one trainer's memory. In this approach, a Trainer would -prefetch a sliced parameter from different Parameter Server instances -according to the input `Ids`, and then run forward, backward and send -the gradient to Parameter Server to execute the optimize program. +We propose an approach to prefetch parameter from Parameter +Server while distributed training so that Fluid would training +a model including the large parameter which could not be stored in one +trainer's memory. + +## Background + +For an embedding layer, the trainable parameter may be very large and could +not be stored in one trainer's memory. In Fluid distributed training, +[Distributed Transpiler](./parameter_server.md#distributed-transpiler) would split every parameter into a number of small +parameters and stored in Parameter Server, so we could prefetch the parameter +from the specified Parameter Server according to the input `Ids`. ## Design -**NOTE**: this approach is a feature of Fluid distributed trianing, maybe you want +This is a feature of Fluid distributed training, maybe you want to know [Distributed Architecture](./distributed_architecture.md) and [Parameter Server](./parameter_server.md) before reading the following content. -Fluid large model distributed training use -[Distributed Transpiler](./parameter_server.md#distributed-transpiler) to split -a large parameter into multiple parameters which stored on Parameter Server, and -the Trainer would prefetch them by `RPC` interface. - -### Split Large Parameter +### Partationed Parameter -**Distributed Transpiler** would split the large parameter -(weight) into some sliced parameters (weight_0, weight_1, weight_2) as the +- **Distributed Transpiler** would split the large parameter +(weight) into some partitioned parameters (weight_0, weight_1, weight_2) as the figure above. +- We could use `round-robin` to distribute the partitioned parameter. -### Prefetch Parameters from Parameter Servers +### Prefetching Parameter -- `PrefetchRpc` operator would send the rows index the multiple Parameter Servers, - and then receive the SelctedRows. -- The different with normal Fluid distributed training, we only prefetch the rows +- `prefetch_rpc` operator would prefetch the parameter from different Parameter + Server according with the input `Ids`, we use [SelectedRows](../../../design/selected_rows.md) + as the received variable type. +- `merge_selected_rows` operator would merge the received parameters into one + `SelectedRows` variable. ## TODO -- Async Update - - To avoid slow-node, Async update is important for distributed training, - we need an design doc and implement it in future. +- `prefetch_rpc` operator to send rows index and receive SelectedRows variables. +- `lookup_table` need to support `SelectedRows` variable type as input `Weight`. +- Async Update, To avoid slow-node, Async update is important for distributed training, + we need a design doc and implement it in future. diff --git a/doc/fluid/design/dist_train/src/prefetch_parameters.graffle b/doc/fluid/design/dist_train/src/prefetch_parameters.graffle index c1a59b901745d81add42328b9995408d32d8127b..abbb7089829709cb5fdc337b7f9521a34f7d131a 100644 Binary files a/doc/fluid/design/dist_train/src/prefetch_parameters.graffle and b/doc/fluid/design/dist_train/src/prefetch_parameters.graffle differ diff --git a/doc/fluid/design/dist_train/src/split_parameter.graffle b/doc/fluid/design/dist_train/src/split_parameter.graffle index 6e2f13727d082dea7d3deaf99a43189dfaf29f3a..f020c0291e99f758f532d2611e72003478656750 100644 Binary files a/doc/fluid/design/dist_train/src/split_parameter.graffle and b/doc/fluid/design/dist_train/src/split_parameter.graffle differ diff --git a/doc/fluid/design/dist_train/src/split_parameter.png b/doc/fluid/design/dist_train/src/split_parameter.png index 1776fb8c4eb0f86aed1cdd67f54d8a08a4e479ce..d311c414a72517205545649a239882bdce3480c0 100644 Binary files a/doc/fluid/design/dist_train/src/split_parameter.png and b/doc/fluid/design/dist_train/src/split_parameter.png differ