diff --git a/doc/fluid/design/dist_train/prefetch_parameter.md b/doc/fluid/design/dist_train/prefetch_parameter.md index e8ea7fe67b9a5666a88aa952c86477c0bc01338e..952d2bada9238b0668225893d661c4856464b35a 100644 --- a/doc/fluid/design/dist_train/prefetch_parameter.md +++ b/doc/fluid/design/dist_train/prefetch_parameter.md @@ -1,36 +1,45 @@ -# Design Doc: Prefetching Parameter From Parameter Server +# Design Doc: Lookup Remote Table while Distributed training ## Abstract -We propose an approach to pre-fetch the parameters from a Parameter Server while distributed training so that Fluid is able to train a model with a large number of parameters that cannot be stored in one trainer's memory. +We propose an approach to pre-fetch the parameters from a Parameter Server while distributed training so that Fluid can train a model with the very large parameter that cannot be stored in one trainer's memory. ## Background -For an embedding layer, the number of trainable parameters may be very large and it is likely that they may not be able to be stored in one trainer's memory. In Fluid distributed training, -the [Distributed Transpiler](./parameter_server.md#distributed-transpiler) would split every parameter into a number of small parameters that are stored on the Parameter Server. Hence, we can pre-fetch the parameters from the specified Parameter Server using the input `Ids`. +For an embedding layer, the trainable parameter may be very large, and it is likely that it may not be able to be stored in one trainer's memory. In Fluid distributed training, +the [Distributed Transpiler](./parameter_server.md#distributed-transpiler) would split every parameter into some small parameters that stored on the Parameter Server. Hence, we can pre-fetch the parameter from the specified Parameter Server using the input `Ids`. ## Design Prior to reading this design, it would be useful for the reader to make themselves familiar with Fluid [Distributed Training Architecture](./distributed_architecture.md) and [Parameter Server](./parameter_server.md). -### Partationed Parameter +The execution of `lookup local table` is as follows: - + -- **Distributed Transpiler** would split the large parameters -(`weight`) into some partitioned parameters (`weight_0`, `weight_1`, `weight_2`) as shown in the -figure above. -- We can use `round-robin` to distribute the partitioned parameter. +For some cases, the parameter(`weight`) may be very large, such as 10 billion features, the entire +data could not be stored in one trainer's memory, so we need to partition this parameter and +pre-fetch it at the beginning of each mini-batch, and we call it `lookup remote table`: -### Pre-fetching Parameters + - +The processing flow of `lookup remote table` is as follows: -- `prefetch_rpc` operator would prefetch the parameter from different Parameter +1. partitioned parameter + + + + - **Distributed Transpiler** would split the large parameters + (`weight`) into some partitioned parameters (`weight_0`, `weight_1`, `weight_2`) as shown in the figure above. + - We can use `round-robin` to distribute the partitioned parameter. + +1. pre-fetching parameter at the beginning of each mini-batch + + - `prefetch_rpc` operator would prefetch the parameter from different Parameter Servers using the input `Ids`. We use [SelectedRows](../../../design/selected_rows.md) as the received variable type. -- `merge_selected_rows` operator would merge the received parameters into one + - `merge_selected_rows` operator would merge the received parameters into one `SelectedRows` variable. ## TODO diff --git a/doc/fluid/design/dist_train/src/lookup_local_table.graffle b/doc/fluid/design/dist_train/src/lookup_local_table.graffle new file mode 100644 index 0000000000000000000000000000000000000000..180004cec944783468158ebcacd430e14a4fb2bb Binary files /dev/null and b/doc/fluid/design/dist_train/src/lookup_local_table.graffle differ diff --git a/doc/fluid/design/dist_train/src/lookup_local_table.png b/doc/fluid/design/dist_train/src/lookup_local_table.png new file mode 100644 index 0000000000000000000000000000000000000000..e5719f39e5f33fa0dca8d42c9765b3b5b0dd77eb Binary files /dev/null and b/doc/fluid/design/dist_train/src/lookup_local_table.png differ diff --git a/doc/fluid/design/dist_train/src/lookup_remote_table.graffle b/doc/fluid/design/dist_train/src/lookup_remote_table.graffle new file mode 100644 index 0000000000000000000000000000000000000000..109dec020ce1abc4446c106cffc34da7b7209747 Binary files /dev/null and b/doc/fluid/design/dist_train/src/lookup_remote_table.graffle differ diff --git a/doc/fluid/design/dist_train/src/lookup_remote_table.png b/doc/fluid/design/dist_train/src/lookup_remote_table.png new file mode 100644 index 0000000000000000000000000000000000000000..4e2ef39964844ed71f577de0871efc3a9bd86567 Binary files /dev/null and b/doc/fluid/design/dist_train/src/lookup_remote_table.png differ diff --git a/doc/fluid/design/dist_train/src/prefetch_parameters.graffle b/doc/fluid/design/dist_train/src/prefetch_parameters.graffle deleted file mode 100644 index abbb7089829709cb5fdc337b7f9521a34f7d131a..0000000000000000000000000000000000000000 Binary files a/doc/fluid/design/dist_train/src/prefetch_parameters.graffle and /dev/null differ diff --git a/doc/fluid/design/dist_train/src/prefetch_parameters.png b/doc/fluid/design/dist_train/src/prefetch_parameters.png deleted file mode 100644 index 433ea3d612b51f325c8f87a58bb35216667910d0..0000000000000000000000000000000000000000 Binary files a/doc/fluid/design/dist_train/src/prefetch_parameters.png and /dev/null differ