diff --git a/doc/design/refactor/distributed_architecture.md b/doc/design/refactor/distributed_architecture.md index ac7e98ccf1aadbb973a4801fde842375cf63448c..68db509a61e83ffcad084e3c0a957ee1924da27b 100644 --- a/doc/design/refactor/distributed_architecture.md +++ b/doc/design/refactor/distributed_architecture.md @@ -64,29 +64,31 @@ the cases - the compiler will optimize the IR: -We can have our own IR too: PaddlePaddle can support model parallel by +We can have our own IR which is called [Program](../program.md). +PaddlePaddle can support model parallel by converting the IR so the user no longer need to manually do it in Python: -The IR for PaddlePaddle after refactor is called `Block`, it specifies -the computation dependency graph and the variables used in the -computation. ### Limitation 3 The user can not directly specify the parameter update rule for the -parameter server because the parameter server does not use the same -computation definition as the trainer. Instead, the update rule is -baked in the parameter server. The user can not specify the update -rule in the same way of specifying the trainer computation. +parameter server because the previous implementaion hard coded that +parameter server only do vector's optimization algorithm by +configuration. The user can not specify the parameter server's +computation layer by layer. -This could be fixed by making the parameter server run the same +This could be fixed by making the parameter server run a separated +IR according to the trainer's varialble (tensors, selectedrows) +defination. + +the same computation definition as the trainer. For a detailed explanation, please see -[Design Doc: Operation Graph Based Parameter Server](./dist_train.md) +[Design Doc: Operation Graph Based Parameter Server](./parameter_server.md) ## Distributed Training Architecture @@ -110,30 +112,63 @@ img, label = input[0], input[1] hidden = paddle.layer.fc(input=img, size=200, act=paddle.activation.Tanh()) prediction = paddle.layer.fc(input=img, size=10, act=paddle.activation.Softmax()) cost = paddle.layer.classification_cost(input=prediction, label=label) -optimizer = paddle.optimizer.SGD(cost, learning_rate=0.01) -session = paddle.session.NewRemote(num_trainer=3, num_ps=2, GPU_per_trainer=1) +optimizer = paddle.optimizer.SGD(learning_rate=0.01) +opts = optimizer.minimize(cost) +exe = RemoteExecutor(num_trainer=3, num_ps=2, GPU_per_trainer=2, sync_batches=1) +# this will init variable data on both server and trainer +exe.run(framework.default_startup_program()) +exe.sync() + for i in range(1000): - _, cost_val = session.eval(targets=[cost, optimizer]) - print cost_val + # feed data + ... + cost, acc = exe.run(framework.default_main_program(), + fetch_list=[avg_cost, acc_out]) + print cost, acc ``` The code above is a typical Python trainer code, the neural network topology is built using helper functions such as -`paddle.layer.fc`. The training is done by calling `session.eval` +`paddle.layer.fc`. The training is done by calling `Executor.run` iteratively. -#### session.eval +#### RemoteExecutor + +As shown in the graph, `RemoteExecutor.run` sends the IR to the +PaddlePaddle cluster for Execution. You can also use parameter +`fetch_list` to interactively fetch varirable back to local for +log printing. + +The Python `RemoteExecutor` is derived from `Executor` class. +For more information about `RemoteExecutor`, please +see [Design Doc: RemoteExecutor](./remote_executor.md). + +By default, `Executor.run` starts a PaddlePaddle Cloud +[TrainingJob](https://github.com/PaddlePaddle/cloud/blob/develop/doc/autoscale/README.md#training-job-resource), or you can run each component in the +executor by your own method: -As shown in the graph, `session.eval` sends the IR and the evaluation -inputs/targets to the PaddlePaddle cluster for evaluation. The -targets can be any variable in the computation graph. When the target -is the `optimizer` variable, the neural network will be optimized -once. When the target is the `cost` variable, `session.eval` returns -the cost value. +- Data Parrallelism + ```python + if os.getenv('PLACE_PSERVER'): + exe.run_pserver() + elif os.getenv('PLACE_TRAINER'): + exe.run_trainer() + ``` +- Model Parrallelism + ```python + for part in exe.get_parralle_parts(): + exe.run_part(part) + ``` -The Python `session` is a wrapper of the C++ `Session` class. For more -information about `Session`, please -see [Design Doc: Session](./session.md). +#### Program and Executor + +As mentioned above, the implementation of IR is [Program](../program.md). + +[Executor](../executor.md) converts and parses the IR to a prefered +graph for final execution. For local training you generally use +`Executor` to run the graph locally. For any kind of distributed +training, you can use `RemoteExecutor` to specify desired distributed +training method with some optional arguments. ### PaddlePaddle Converter @@ -183,13 +218,6 @@ device computation time and device communication time. Model parallelism requires the general placement algorithm. -### PaddlePaddle Runtime - -The PaddlePaddle runtime owns multiple devices (e.g., CPUs, GPUs) and -runs the IR. The runtime does not need to do OP placement since it's -already done by the converter. - - ### Local Training Architecture The local training architecture will be the same as the distributed @@ -210,7 +238,7 @@ the Python reader will need to read from the distributed filesystem network traffic. When doing distributed training, the user can still use Python data -reader: the training data are sent with `session.eval`. However should +reader: the training data are sent with `Executor.run`. However should be used for debugging purpose only. The users are encouraged to use the read data OPs. diff --git a/doc/design/refactor/session.md b/doc/design/refactor/session.md deleted file mode 100644 index 1d9a26683c14f54e3b5fe41675cd03b5620646b8..0000000000000000000000000000000000000000 --- a/doc/design/refactor/session.md +++ /dev/null @@ -1,180 +0,0 @@ -# Design Doc: Session - -## Abstract - -The *session* object encapsulates the environment in which the -computation graph is executed. - -We will have the *local* session and *remote* session, they offer the -same [interface](#interface). The local session encapsulates the local -runtime environment and the remote session encapsulates the cluster -runtime environment. - -The local runtime environment contains: - -1. computation devices (i.e., CPU, GPU) handles, and -1. the [scope](../scope.md) which holds all variables. - -The remote runtime environment contains: - -1. computation devices (i.e., CPU and GPU on node 0, 1) in a cluster, - and -1. the distributed [scope](../scope.md) in a cluster which holds all - variables. - -The user can create a remote session on Paddle Cloud and evaluate the -computation graph with it. In this way, the user can control the -remote computation resource in a cluster from his local computer. - - -## Background - -The current design has an implicit global session in which -`paddle.eval()` is executed. The pain point is: - -Since the user is not able to explicitly switch between runtime -environments, the user cannot run a topology in two independent -environments. - -For example, in reinforcement learning, the user may want to have a -stale model for inference and a fresh model for training, and only -replace the stale model with the fresh model periodically. - -Furthermore, we have no concept that encapsulates a remote environment -that executes a computation graph. - -We need the session object to address above issues. - - -## Session - -A session is an object that owns the runtime environment. All -computations are executed through `session.eval()`. - - -### Interface - -```python -eval( - targets, - feed_dict=None, -) -``` - -Evaluates the target Operations or Variables in `targets`. - -- *targets*: the evaluation targets. Can be a single Operation or - Variable, or a list with the Operations or Variables as - elements. The value returned by `eval()` has the same shape as the - `target` argument. - - The PaddlePaddle program is represented by - the [ProgramDesc](../design/program.md), `eval()` will infer the - ProgramDesc from the given targets and run the PaddlePaddle - program. Please - see - [this graph](./distributed_architecture.md#local-training-architecture) for - the detailed illustration for the local session - and - [this graph](./distributed_architecture.md#distributed-training-architecture) for - the detailed illustration for the remote session. - -- *feed_dict*: a dictionary that contains the tensors which override - the edges of the computation graph. - - feed_dict not only can provide the input data, it can override any - OP's input as well: - - ```python - a = pd.constant(2.0, name="a") - b = pd.variable(name="b") - c = pd.mul(a,b) - sess.eval(targets=c, feed_dict={"b":3.0}) # returns 6.0 - ``` - -```python -close() -``` - -Closes the session and releases the scope that the session owns. - - -### Create a Local Session - -```python -session( - devices=None -) -``` - -Creates a new session. One session owns one global scope, so creating -multiple sessions will create different scopes. - -- *devices*: a single `string` or a list of `string` of device names, - the corresponding devices will be the computation devices for - `eval()`. If not specified, all available devices (e.g., all GPUs) - will be used. The user doesn't need to specify the CPU device since - it will be always used. Multiple sessions can use the same device. - - -#### Example - -```Python -a = paddle.constant(1.0) -b = paddle.constant(2.0) -c = a + b -sess = paddle.session(devices=["gpu:0", "gpu:1", "fpga:0"]) -sess.eval(c) -sess.close() -``` - -### Create a Remote Session - -```python -create_cloud_job( - name, - num_trainer, - mem_per_trainer, - gpu_per_trainer, - cpu_per_trainer, - num_ps, - mem_per_ps, - cpu_per_ps, -) -``` - -Creates a Paddle Cloud job. Fails if the job name exists. - -```python -get_cloud_job( - name -) -``` - -Gets a Paddle Cloud job. - -```python -remote_session( - job -) -``` - -- *job*: the Paddle Cloud job. - -#### Example - -```Python -reader = paddle.reader.recordio("/pfs/home/peter/mnist-train-*") # data stored on Paddle Cloud -image = reader.column(0) -label = reader.column(1) -fc1 = paddle.op.fc(image, size=256, act="sigmoid") -fc2 = paddle.op.fc(fc1, size=10, act="softmax") -cost = paddle.op.cross_entropy(fc2, label) -opt = paddle.optimizer.sgd(cost) - -job = paddle.create_cloud_job("test", 3, "1G", 1, 1, 2, "1G", 1) -sess = paddle.remote_ession(job) -for i in range(1000): - sess.eval(opt) -sess.close() -``` diff --git a/doc/design/refactor/src/distributed_architecture.graffle b/doc/design/refactor/src/distributed_architecture.graffle index f8496e57326c38de7468eb452a7713291d57653c..1ebbe70db04ebe0321f252a1de68aaefdaa32bd6 100644 Binary files a/doc/design/refactor/src/distributed_architecture.graffle and b/doc/design/refactor/src/distributed_architecture.graffle differ diff --git a/doc/design/refactor/src/distributed_architecture.png b/doc/design/refactor/src/distributed_architecture.png index 410c4510c6aab301dec95e6427fe80ac24e105fe..0da49f44123be7c0b84bf420ad81c33f0ac0a105 100644 Binary files a/doc/design/refactor/src/distributed_architecture.png and b/doc/design/refactor/src/distributed_architecture.png differ diff --git a/doc/design/refactor/src/local_architecture.graffle b/doc/design/refactor/src/local_architecture.graffle index cc7783c45381f25ded0b898649322c81418ad317..49fcc663ebe3824aa234e3a67aadf285cb417877 100644 Binary files a/doc/design/refactor/src/local_architecture.graffle and b/doc/design/refactor/src/local_architecture.graffle differ diff --git a/doc/design/refactor/src/local_architecture.png b/doc/design/refactor/src/local_architecture.png index 4b999538b7825c805292ee28b5e3256d5543bd09..14adc9fd72b855bb9f74fbf2c84ac9ec0cf2b122 100644 Binary files a/doc/design/refactor/src/local_architecture.png and b/doc/design/refactor/src/local_architecture.png differ