+- AlexNet
+
+| BatchSize | 64 | 128 | 256 |
+|--------------|--------| ------ | -------|
+| OpenBLAS | 45.62 | 72.79 | 107.22 |
+| MKLML | 66.37 | 105.60 | 144.04 |
+| MKL-DNN | 399.00 | 498.94 | 626.53 |
+
+
+
#### Inference
Test on batch size 1, 2, 4, 8, 16 on Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz
- VGG-19
| BatchSize | 1 | 2 | 4 | 8 | 16 |
|-----------|-------|-------|-------|-------|-------|
-| OpenBLAS | 1.07 | 1.08 | 1.06 | 0.88 | 0.65 |
+| OpenBLAS | 1.10 | 1.96 | 3.62 | 3.63 | 2.25 |
| MKLML | 5.58 | 9.80 | 15.15 | 21.21 | 28.67 |
| MKL-DNN | 75.07 | 88.64 | 82.58 | 92.29 | 96.75 |
+
+
- ResNet-50
| BatchSize | 1 | 2 | 4 | 8 | 16 |
|-----------|-------|--------|--------|--------|--------|
-| OpenBLAS | 3.35 | 3.19 | 3.09 | 2.55 | 1.96 |
+| OpenBLAS | 3.31 | 6.72 | 11.59 | 13.17 | 9.27 |
| MKLML | 6.33 | 12.02 | 22.88 | 40.53 | 63.09 |
| MKL-DNN | 107.83| 148.84 | 177.78 | 189.35 | 217.69 |
+
- GoogLeNet
| BatchSize | 1 | 2 | 4 | 8 | 16 |
|-----------|--------|--------|--------|--------|--------|
-| OpenBLAS | 12.04 | 11.31 | 10.00 | 9.07 | 4.34 |
+| OpenBLAS | 12.06 | 23.56 | 34.48 | 36.45 | 23.12 |
| MKLML | 22.74 | 41.56 | 81.22 | 133.47 | 210.53 |
| MKL-DNN | 175.10 | 272.92 | 450.70 | 512.00 | 600.94 |
+
+
+- AlexNet
+
+| BatchSize | 1 | 2 | 4 | 8 | 16 |
+|-----------|--------|--------|--------|--------|--------|
+| OpenBLAS | 3.53 | 6.23 | 15.04 | 26.06 | 31.62 |
+| MKLML | 21.32 | 36.55 | 73.06 | 131.15 | 192.77 |
+| MKL-DNN | 442.91 | 656.41 | 719.10 | 847.68 | 850.51 |
+
+
### Laptop
TBD
diff --git a/benchmark/cluster/README.md b/benchmark/cluster/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..b619613ea7a5b6e940ec735314e8e47338b2c600
--- /dev/null
+++ b/benchmark/cluster/README.md
@@ -0,0 +1,78 @@
+# Cluster Training Benchmark
+
+## Setup
+
+- Platform
+ - Kubernetes: v1.6.2
+ - Linux Kernel: v3.10.0
+
+- Resource
+ - CPU: 10 Cores per Pod
+ - Memory: 5GB per Pod
+
+- Docker Image
+
+ We use different base Docker Image to run the benchmark on Kubernetes:
+ - PaddlePaddle v2: paddlepaddle/paddle:0.11.0
+ - PaddlePaddle Fluid: paddlepaddle/paddle:[commit-id]
+ - TensorFlow: tensorflow/tensorflow:1.5.0-rc0
+
+- Model
+ vgg16 is used in this benchmark.
+
+## Cases
+
+- Variable
+ - Batch Size of training data.
+ - PServer count of the training job.
+ - The number of trainers.
+
+- Invariant
+ - The resource of trainer/pserver Pod.
+
+### Measure the Performance for Different Batch Size
+
+- PServer Count: 40
+- Trainer Count: 100
+- Metrics: mini-batch / sec
+
+| Batch Size | 32 | 64 | 128 | 256 |
+| -- | -- | -- | -- | -- |
+| PaddlePaddle Fluid | - | - | - | - |
+| PaddlePaddle v2 | - | - | - | - |
+| TensorFlow | - | - | - | - |
+
+### Measure the Performance for Different PServer Count
+
+- Trainer Count: 100
+- Batch Size: 64
+- Metrics: mini-batch / sec
+
+| PServer Count | 10 | 20 | 40 | 60 |
+| -- | -- | -- | -- | -- |
+| PaddlePaddle Fluid | - | - | - | - |
+| PaddlePaddle v2 | - | - | - | - |
+| TensorFlow | - | - | - | - |
+
+### Measure Parallel Efficiency By Increasing Trainer Count
+
+- PServer Count: 20
+- Batch Size: 64
+- Metrics:
+
+$S = \div(T1, TN)$
+
+which S is the ratio of T1 over TN, training time of 1 and N trainers.
+The parallel efficiency is:
+
+$E = \div(S, N)$
+
+| Trainer Counter | 1 | 10 | 20 | 30 | 40 | 50 | 60 | 70 | 80 | 90 | 100 |
+| -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- |
+| PaddlePaddle Fluid | - | - | - | - | - | - | - | - | - | - | - |
+| PaddlePaddle v2 | - | - | - | - | - | - | - | - | - | - | - | - |
+| TensorFlow | - | - | - | - | - | - | - | - | - | - | - | - | - |
+
+## Reproduce the benchmark
+
+TODO
diff --git a/benchmark/cluster/vgg16/Dockerfile b/benchmark/cluster/vgg16/Dockerfile
new file mode 100644
index 0000000000000000000000000000000000000000..13ad8e1b6237e6f41a076c4fb54311728832ae33
--- /dev/null
+++ b/benchmark/cluster/vgg16/Dockerfile
@@ -0,0 +1,35 @@
+FROM nvidia/cuda:8.0-cudnn5-runtime-ubuntu16.04
+
+# you can get mirror list here:
+# https://launchpad.net/ubuntu/+archivemirrors
+ARG UBUNTU_MIRROR
+RUN /bin/bash -c 'if [[ -n ${UBUNTU_MIRROR} ]]; then sed -i 's#http://archive.ubuntu.com/ubuntu#${UBUNTU_MIRROR}#g' /etc/apt/sources.list; fi'
+
+RUN apt-get update && apt-get install -y python python-dev python-pip iputils-ping libgtk2.0-dev
+RUN pip install -U kubernetes opencv-python
+
+RUN pip install paddlepaddle
+# if network is slowly, you may need to add proxy here.
+# ENV https_proxy=
+RUN sh -c 'echo "import paddle.v2 as paddle\npaddle.dataset.cifar.train10()" | python'
+RUN pip uninstall -y paddlepaddle
+# unset proxy if it is setted.
+# ENV https_proxy=""
+
+# NOTE: By default CI built wheel packages turn WITH_DISTRIBUTE=OFF,
+# so we must build one with distribute support to install in this image.
+ADD *.whl /
+RUN pip install /*.whl && rm -f /*.whl
+ENV LD_LIBRARY_PATH=/usr/local/lib
+
+# tf k8s
+RUN pip install tensorflow==1.4.0
+ADD tf_k8s /usr/bin
+RUN chmod +x /usr/bin/tf_k8s
+ADD vgg16_tf.py /workspace/
+
+# below lines may change a lot for debugging
+ADD https://raw.githubusercontent.com/PaddlePaddle/cloud/develop/docker/paddle_k8s /usr/bin
+ADD https://raw.githubusercontent.com/PaddlePaddle/cloud/develop/docker/k8s_tools.py /root
+RUN chmod +x /usr/bin/paddle_k8s
+ADD vgg16_fluid.py vgg16_v2.py /workspace/
diff --git a/benchmark/cluster/vgg16/README.md b/benchmark/cluster/vgg16/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..cd681a1a282d9a26eac1c267bfa26967f8c3c9fd
--- /dev/null
+++ b/benchmark/cluster/vgg16/README.md
@@ -0,0 +1,77 @@
+# Performance for Distributed vgg16
+
+## Test Result
+
+### Hardware Infomation
+
+- CPU: Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz
+- cpu MHz : 2101.000
+- cache size : 20480 KB
+
+### Blas settings
+
+Setting environment variable: `MKL_NUM_THREADS=1`.
+
+### Single Node Single Thread
+
+- Metrics: samples / sec
+
+| Batch Size | 32 | 64 | 128 | 256 |
+| -- | -- | -- | -- | -- |
+| PaddlePaddle Fluid | 15.44 | 16.32 | 16.74 | 16.79 |
+| PaddlePaddle v2 | 15.97 | 17.04 | 17.60 | 17.83 |
+| TensorFlow | 9.09 | 9.10 | 9.24 | 8.66 |
+
+### Different Batch Size
+
+- PServer Count: 10
+- Trainer Count: 20
+- Metrics: samples / sec
+
+| Batch Size | 32 | 64 | 128 | 256 |
+| -- | -- | -- | -- | -- |
+| PaddlePaddle Fluid | 190.20 | 222.15 | 247.40 | 258.18 |
+| PaddlePaddle v2 | 170.96 | 233.71 | 256.14 | 329.23 |
+| TensorFlow | - | - | - | - |
+
+
+### Accelerate Rate
+
+- Pserver Count: 20
+- Batch Size: 128
+- Metrics: samples / sec
+
+| Trainer Count | 20 | 40 | 80 | 100 |
+| -- | -- | -- | -- | -- |
+| PaddlePaddle Fluid | 263.29 (78.64%) | 518.80 (77.47%) | 836.26 (62.44%) | 1019.29 (60.89%) |
+| PaddlePaddle v2 (need more tests) | 326.85 (92.85%) | 534.58 (75.93%) | 853.30 (60.60%) | 1041.99 (59.20%) |
+| TensorFlow | - | - | - | - |
+
+### Different Pserver Count
+
+- Trainer Count: 60
+- Batch Size: 128
+- Metrics: samples/ sec
+
+| PServer Count | 3 | 6 |10 | 20 |
+| -- | -- | -- | -- | -- |
+| PaddlePaddle Fluid(should fix in next PR) | 589.1 | 592.6 | 656.4 | 655.8 |
+| PaddlePaddle v2 | 593.4 | 791.3 | 729.7 | 821.7 |
+| TensorFlow | - | - | - | - |
+
+*The performance gap between Fuild and v2 comes from the network interference.*
+
+
+## Steps to Run the Performance Test
+
+1. You must re-compile PaddlePaddle and enable `-DWITH_DISTRIBUTE` to build PaddlePaddle with distributed support.
+1. When the build finishes, copy the output `whl` package located under `build/python/dist` to current directory.
+1. Run `docker build -t [image:tag] .` to build the docker image and run `docker push [image:tag]` to push the image to reponsitory so kubernetes can find it.
+1. Run `kubectl create -f pserver.yaml && kubectl create -f trainer.yaml` to start the job on your kubernetes cluster (you must configure the `kubectl` client before this step).
+1. Run `kubectl get po` to get running pods, and run `kubectl logs [podID]` to fetch the pod log of pservers and trainers.
+
+Check the logs for the distributed training progress and analyze the performance.
+
+## Enable Verbos Logs
+
+Edit `pserver.yaml` and `trainer.yaml` and add an environment variable `GLOG_v=3` and `GLOG_logtostderr=1` to see what happend in detail.
diff --git a/benchmark/cluster/vgg16/fluid_pserver.yaml b/benchmark/cluster/vgg16/fluid_pserver.yaml
new file mode 100644
index 0000000000000000000000000000000000000000..ee8b0763b62fc011f40f6197e929a68b48a93e47
--- /dev/null
+++ b/benchmark/cluster/vgg16/fluid_pserver.yaml
@@ -0,0 +1,72 @@
+apiVersion: extensions/v1beta1
+kind: ReplicaSet
+metadata:
+ name: vgg16job-pserver
+spec:
+ replicas: 10
+ template:
+ metadata:
+ labels:
+ paddle-job-pserver: vgg16job
+ spec:
+ hostNetwork: true
+ imagePullSecrets:
+ - name: job-registry-secret
+ containers:
+ - name: pserver
+ image: "registry.baidu.com/paddlepaddle/fluid_benchmark:vgg16"
+ imagePullPolicy: Always
+ ports:
+ - name: jobport-30236
+ containerPort: 30236
+ env:
+ - name: PADDLE_JOB_NAME
+ value: vgg16job
+ - name: MKL_NUM_THREADS
+ value: "1"
+ - name: TRAINING_ROLE
+ value: "PSERVER"
+ - name: TRAINERS
+ value: "20"
+ - name: PSERVERS
+ value: "10"
+ - name: TOPOLOGY
+ value: ""
+ - name: ENTRY
+ value: "MKL_NUM_THREADS=1 python /workspace/vgg16_fluid.py --local 0"
+ - name: TRAINER_PACKAGE
+ value: "/workspace"
+ - name: PADDLE_INIT_PORT
+ value: "30236"
+ - name: PADDLE_INIT_NICS
+ value: "xgbe0"
+ - name: PADDLE_INIT_TRAINER_COUNT
+ value: "1"
+ - name: PADDLE_INIT_PORTS_NUM
+ value: "1"
+ - name: PADDLE_INIT_PORTS_NUM_FOR_SPARSE
+ value: "1"
+ - name: PADDLE_INIT_NUM_GRADIENT_SERVERS
+ value: "20"
+ - name: PADDLE_INIT_NUM_PASSES
+ value: "1"
+ - name: PADDLE_INIT_USE_GPU
+ value: "0"
+ - name: LD_LIBRARY_PATH
+ value: "/usr/local/lib:/usr/local/nvidia/lib64"
+ - name: NAMESPACE
+ valueFrom:
+ fieldRef:
+ fieldPath: "metadata.namespace"
+ - name: POD_IP
+ valueFrom:
+ fieldRef:
+ fieldPath: "status.podIP"
+ command: ["paddle_k8s", "start_fluid"]
+ resources:
+ requests:
+ memory: 10Gi
+ cpu: 4
+ limits:
+ memory: 10Gi
+ cpu: 4
diff --git a/benchmark/cluster/vgg16/fluid_trainer.yaml b/benchmark/cluster/vgg16/fluid_trainer.yaml
new file mode 100644
index 0000000000000000000000000000000000000000..3d56caac009464d1073423bb63abff1f8b0cf28f
--- /dev/null
+++ b/benchmark/cluster/vgg16/fluid_trainer.yaml
@@ -0,0 +1,69 @@
+apiVersion: batch/v1
+kind: Job
+metadata:
+ name: vgg16job-trainer
+spec:
+ parallelism: 20
+ completions: 20
+ template:
+ metadata:
+ labels:
+ paddle-job: vgg16job
+ spec:
+ imagePullSecrets:
+ - name: job-registry-secret
+ hostNetwork: true
+ containers:
+ - name: trainer
+ image: "registry.baidu.com/paddlepaddle/fluid_benchmark:vgg16"
+ imagePullPolicy: Always
+ command: ["paddle_k8s", "start_fluid"]
+ env:
+ - name: PADDLE_JOB_NAME
+ value: vgg16job
+ - name: TRAINING_ROLE
+ value: "TRAINER"
+ - name: TRAINERS
+ value: "20"
+ - name: PSERVERS
+ value: "10"
+ - name: TOPOLOGY
+ value: ""
+ - name: ENTRY
+ value: "MKL_NUM_THREADS=1 python /workspace/vgg16_fluid.py --local 0 --batch_size 128"
+ - name: TRAINER_PACKAGE
+ value: "/workspace"
+ - name: PADDLE_INIT_PORT
+ value: "30236"
+ - name: PADDLE_INIT_NICS
+ value: "xgbe0"
+ - name: PADDLE_INIT_TRAINER_COUNT
+ value: "1"
+ - name: PADDLE_INIT_PORTS_NUM
+ value: "1"
+ - name: PADDLE_INIT_PORTS_NUM_FOR_SPARSE
+ value: "1"
+ - name: PADDLE_INIT_NUM_GRADIENT_SERVERS
+ value: "20"
+ - name: PADDLE_INIT_NUM_PASSES
+ value: "1"
+ - name: PADDLE_INIT_USE_GPU
+ value: "0"
+ - name: LD_LIBRARY_PATH
+ value: "/usr/local/lib:/usr/local/nvidia/lib64"
+ - name: NAMESPACE
+ valueFrom:
+ fieldRef:
+ fieldPath: "metadata.namespace"
+ - name: POD_IP
+ valueFrom:
+ fieldRef:
+ fieldPath: "status.podIP"
+ resources:
+ requests:
+ memory: 40Gi
+ cpu: 2
+ limits:
+ memory: 40Gi
+ cpu: 2
+ restartPolicy: Never
diff --git a/benchmark/cluster/vgg16/tf_k8s b/benchmark/cluster/vgg16/tf_k8s
new file mode 100644
index 0000000000000000000000000000000000000000..4fc263d5f681aeabfa71f1758714d269d987b272
--- /dev/null
+++ b/benchmark/cluster/vgg16/tf_k8s
@@ -0,0 +1,82 @@
+#!/bin/bash
+check_trainer_ret() {
+ ret=$1
+ stdbuf -oL echo "job returned $ret...setting pod return message..."
+ stdbuf -oL echo "==============================="
+
+ if [ $ret -eq 136 ] ; then
+ echo "Error Arithmetic Operation(Floating Point Exception)" > /dev/termination-log
+ elif [ $ret -eq 139 ] ; then
+ echo "Segmentation Fault" > /dev/termination-log
+ elif [ $ret -eq 1 ] ; then
+ echo "General Error" > /dev/termination-log
+ elif [ $ret -eq 134 ] ; then
+ echo "Program Abort" > /dev/termination-log
+ fi
+ stdbuf -oL echo "termination log wroted..."
+ exit $ret
+}
+
+g_pservers=""
+g_trainers=""
+
+wait_running_pods(){
+ pserver_label="tf-job-pserver=${JOB_NAME}"
+ trainer_label="tf-job-trainer=${JOB_NAME}"
+
+ stdbuf -oL python /root/k8s_tools.py wait_pods_running ${pserver_label} ${PSERVERS_NUM}
+ stdbuf -oL python /root/k8s_tools.py wait_pods_running ${trainer_label} ${TRAINERS_NUM}
+
+ g_pservers=$(python /root/k8s_tools.py fetch_endpoints ${pserver_label} ${PORT})
+ g_trainers=$(python /root/k8s_tools.py fetch_endpoints ${trainer_label} ${PORT})
+}
+
+start_tf_pserver(){
+ wait_running_pods
+
+ label="tf-job-pserver=${JOB_NAME}"
+ pserver_id=$(python /root/k8s_tools.py fetch_id ${label})
+
+ cmd="${ENTRY} --ps_hosts=${g_pservers} --worker_hosts=${g_trainers} \
+ --job_name=${TF_JOB_NAME} --task_index=${pserver_id}"
+
+ stdbuf -oL sh -c "cd ${TRAINER_PACKAGE} && ${cmd}"
+}
+
+start_tf_trainer(){
+ wait_running_pods
+
+ label="tf-job-trainer=${JOB_NAME}"
+ trainer_id=$(python /root/k8s_tools.py fetch_id ${label})
+
+ cmd="${ENTRY} --ps_hosts=${g_pservers} --worker_hosts=${g_trainers} \
+ --job_name=${TF_JOB_NAME} --task_index=${trainer_id} --batch_size=${BATCH_SIZE}"
+
+ stdbuf -oL sh -c "cd ${TRAINER_PACKAGE} && ${cmd}"
+ check_trainer_ret $?
+}
+
+start_tf(){
+ if [[ "${TF_JOB_NAME}" == "worker" ]]; then
+ start_tf_trainer
+ else
+ start_tf_pserver
+ fi
+}
+
+usage() {
+ echo "usage: tf_k8s [




-
-
-
-Figure 2 illustrates the RNN's data flow
-
-
-
-
-
-
-
-
-PaddlePaddle can support model parallelism by converting the IR so that the user no longer needs to manually perform the computation and operations in the Python component:
-
-
-
-The IR for PaddlePaddle after refactoring is called a `Block`, it specifies the computation dependency graph and the variables used in the computation.
-
-### Limitation 3
-
-The user can not directly specify the parameter update rule for the parameter server in the Python module, since the parameter server does not use the same computation definition as the trainer. Instead, the update rule is baked inside the parameter server. The user can not specify the update rule explicitly.
-
-This could be fixed by making the parameter server run the same computation definition as the trainer (the user's Python module). For a detailed explanation, refer to this document -
-[Design Doc: Operation Graph Based Parameter Server](./parameter_server.md)
-
-## Distributed Training Architecture
-
-The revamped distributed training architecture can address the above discussed limitations. Below is the illustration of how it does so:
-
-
-
-The major components in the architecture are: *PaddlePaddle Python*, *PaddlePaddle converter* and *PaddlePaddle runtime*.
-
-### PaddlePaddle Python
-
-PaddlePaddle Python is the Python library that user's Python code invokes, to read the data. build the neural network topology, start training, etc.
-
-```Python
-paddle.init()
-input = paddle.op.recordIO("/home/data/mnist.recordio") # file stored on the cluster
-img, label = input[0], input[1]
-hidden = paddle.layer.fc(input=img, size=200, act=paddle.activation.Tanh())
-prediction = paddle.layer.fc(input=img, size=10, act=paddle.activation.Softmax())
-cost = paddle.layer.classification_cost(input=prediction, label=label)
-optimizer = paddle.optimizer.SGD(cost, learning_rate=0.01)
-session = paddle.session.NewRemote(num_trainer=3, num_ps=2, GPU_per_trainer=1)
-for i in range(1000):
- _, cost_val = session.eval(targets=[cost, optimizer])
- print cost_val
-```
-
-The above code is what a typical Python trainer code is, the neural network topology is built using the helper functions such as `paddle.layer.fc`. Training is done by calling `session.eval` iteratively.
-
-#### session.eval
-
-As shown in the graph, `session.eval` sends the IR and the evaluation inputs or targets to the PaddlePaddle cluster for evaluation.
-The targets can be any variable in the computation graph. When the target is say, the `optimizer` variable, the neural network will be optimized once. When the target is the `cost` variable, `session.eval` returns the cost value. Based on what the target is, an appropriate action is taken.
-
-The Python `session` is a wrapper of the C++ `Session` class. For more information about `Session`, refer to this document - [Design Doc: Session](./session.md).
-
-### PaddlePaddle Converter
-
-The PaddlePaddle converter automatically converts the IR in the request (IR and evaluation inputs/targets) from PaddlePaddle Python to partitioned IRs and dispatches the new IRs and evaluation inputs/targets to different PaddlePaddle runtimes. Below are the steps that are followed :
-
-1. Add a `feed` OP that feeds the eval inputs, and a `fetch` OP that fetches the eval targets to the IR.
-
-2. Extract a new computation (sub)graph with the `feed` and `fetch` OPs as the boundary. The runtime does not need to run the OP that is not dependent on the `fetch` OP.
-
-3. Optimize the computation graph.
-
-4. Place the OPs in the graph onto different devices on different PaddlePaddle runtime according to a placement algorithm and the device constraints specified by the user.
-
-5. Partition the graph according to runtime boundaries and add `send` / `recv` OP pair on the runtime boundaries.
-
-6. Dispatch the partitioned graph to different PaddlePaddle runtimes.
-
-7. PaddlePaddle runtimes with the `fetch` OP reports evaluation results back to the converter, the converter reports the evaluation results back to the PaddlePaddle Python.
-
-The output IRs will be cached to optimize the conversion latency.
-
-
-#### Placement Algorithm
-
-Our first implementation will only support "trainer-parameter server" placement: the parameters, initializers, and optimizers are all placed on the PaddlePaddle runtimes with the parameter server role. Everything else will be placed on the PaddlePaddle runtimes with the trainer role. This has the same functionality as the "trainer-parameter server" architecture of PaddlePaddle v0.10.0, but is more generic and flexible.
-
-In the future, a more general placement algorithm should be implemented, which makes placements according to the input IR, and a model of device computation time and device communication time. Model parallelism requires the generic placement algorithm.
-
-
-### PaddlePaddle Runtime
-
-The PaddlePaddle runtime owns multiple devices (e.g., CPUs, GPUs) and runs the IR. The runtime does not need to do OP placement since it is already done by the converter.
-
-
-### Local Training Architecture
-
-The local training architecture will be the same as the distributed training architecture, the difference is that everything runs locally, and there is just one PaddlePaddle runtime:
-
-
-
-
-### Training Data
-
-In PaddlePaddle v0.10.0, training data is typically read with a [data reader](../reader/README.md) from Python. This approach is no longer efficient when training in a distributed fashion since the Python process no longer runs on the same node with the trainer processes. The Python reader will need to read from the distributed filesystem (assuming it has the required access) and send to the trainers, doubling the network traffic.
-
-When doing distributed training, the user can still use Python data reader: the training data are sent with `session.eval`. However this should be used for debugging purpose only. The users are encouraged to use the read data OPs.
-
-
-## References:
-
-[1] [TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems](https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/45166.pdf)
-
-[2] [TensorFlow: A System for Large-Scale Machine Learning](https://www.usenix.org/system/files/conference/osdi16/osdi16-abadi.pdf)
diff --git a/doc/design/refactor/parameter_server.md b/doc/design/refactor/parameter_server.md
deleted file mode 100644
index fa3c5d7990213cf2b0d236e66e592dd2699da876..0000000000000000000000000000000000000000
--- a/doc/design/refactor/parameter_server.md
+++ /dev/null
@@ -1,106 +0,0 @@
-# Design Doc: Operation Graph Based Parameter Server
-
-## Abstract
-
-We propose an approach to implement the parameter server. In this
-approach, there is no fundamental difference between the trainer and
-the parameter server: they both run subgraphs, but subgraphs of
-different purposes.
-
-## Background
-
-The previous implementations of the parameter server does not run a
-subgraph. parameter initialization, optimizer computation, network
-communication and checkpointing are implemented twice on both the
-trainer and the parameter server.
-
-It would be great if we can write code once and use them on both the
-trainer and the parameter server: reduces code duplication and
-improves extensibility. Given that after the current refactor, we are
-representing everything as a computing graph on the
-trainer. Representing everything as a computing graph on the parameter
-server becomes a natural extension.
-
-## Design
-
-### Graph Converter
-
-The *graph converter* converts the user-defined operation (OP) graph
-into subgraphs to be scheduled on different nodes with the following
-steps:
-
-1. OP placement: the OPs will be placed on different nodes according
- to heuristic that minimizes estimated total computation
- time. Currently we will use a simple heuristic that puts parameter
- varable on parameter server workers and everything else on trainer
- workers.
-
-1. Add communication OPs to enable the communication between nodes.
-
-We will need these OPs: *Send*, *Recv*, *Enqueue*, *Dequeue*.
-
-Below is an example of converting the user defined graph to the
-subgraphs for the trainer and the parameter server:
-
-
-
-After converting:
-
-
-
-1. The parameter variable W and it's optimizer subgraph are placed on the parameter server.
-1. Operators are added to the subgraphs.
- - *Send* sends data to the connected *Recv* operator. The
- scheduler on the receive node will only schedule *Recv* operator
- to run when the *Send* operator has ran (the *Send* OP will mark
- the *Recv* OP runnable automatically).
- - *Enueue* enqueues the input variable, it can block until space
- become available in the queue.
- - *Dequeue* outputs configurable numbers of tensors from the
- queue. It will block until the queue have the required number of
- tensors.
-
-
-### Benefits
-
-- Model parallelism become easier to implement: it's an extension to
- the trainer - parameter server approach. we already have the
- communication OPs, but need to extend the graph converter's
- placement functionality.
-
-- User-defined optimizer is easier to add - user can now express it as
- a subgraph.
-
-- No more duplication logic inside the trainer and the parameter
- server mentioned in the background section.
-
-### Challenges
-
-- It might be hard for the graph converter to cut a general graph
- (without any hint for which subgraph is the optimizer). We may need
- to label which subgraph inside the OP graph is the optimizer.
-
-- It's important to balance the parameter shards of on multiple
- parameter server. If a single parameter is very big (some
- word-embedding, fully connected, softmax layer), we need to
- automatically partition the single parameter onto different
- parameter servers when possible (only element-wise optimizer depends
- on the parameter variable).
-
-### Discussion
-
-- In the "Aync SGD" figure, the "W" variable on the parameter server
- could be read and wrote concurrently, what is our locking strategy?
- E.g., each variable have a lock cpp method to be invoked by every
- OP, or, have a lock OP.
-
-- Can the Enqueue OP be implemented under our current tensor design
- (puts the input tensor into the queue tensor)?
-
-- *Dequeue* OP will have variable numbers of output (depends on the
- `min_count` attribute), does our current design support it? (similar
- question for the *Add* OP)
-
-
-### References:
-[1] [TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems](https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/45166.pdf)
diff --git a/doc/design/refactor/session.md b/doc/design/refactor/session.md
deleted file mode 100644
index 1d9a26683c14f54e3b5fe41675cd03b5620646b8..0000000000000000000000000000000000000000
--- a/doc/design/refactor/session.md
+++ /dev/null
@@ -1,180 +0,0 @@
-# Design Doc: Session
-
-## Abstract
-
-The *session* object encapsulates the environment in which the
-computation graph is executed.
-
-We will have the *local* session and *remote* session, they offer the
-same [interface](#interface). The local session encapsulates the local
-runtime environment and the remote session encapsulates the cluster
-runtime environment.
-
-The local runtime environment contains:
-
-1. computation devices (i.e., CPU, GPU) handles, and
-1. the [scope](../scope.md) which holds all variables.
-
-The remote runtime environment contains:
-
-1. computation devices (i.e., CPU and GPU on node 0, 1) in a cluster,
- and
-1. the distributed [scope](../scope.md) in a cluster which holds all
- variables.
-
-The user can create a remote session on Paddle Cloud and evaluate the
-computation graph with it. In this way, the user can control the
-remote computation resource in a cluster from his local computer.
-
-
-## Background
-
-The current design has an implicit global session in which
-`paddle.eval()` is executed. The pain point is:
-
-Since the user is not able to explicitly switch between runtime
-environments, the user cannot run a topology in two independent
-environments.
-
-For example, in reinforcement learning, the user may want to have a
-stale model for inference and a fresh model for training, and only
-replace the stale model with the fresh model periodically.
-
-Furthermore, we have no concept that encapsulates a remote environment
-that executes a computation graph.
-
-We need the session object to address above issues.
-
-
-## Session
-
-A session is an object that owns the runtime environment. All
-computations are executed through `session.eval()`.
-
-
-### Interface
-
-```python
-eval(
- targets,
- feed_dict=None,
-)
-```
-
-Evaluates the target Operations or Variables in `targets`.
-
-- *targets*: the evaluation targets. Can be a single Operation or
- Variable, or a list with the Operations or Variables as
- elements. The value returned by `eval()` has the same shape as the
- `target` argument.
-
- The PaddlePaddle program is represented by
- the [ProgramDesc](../design/program.md), `eval()` will infer the
- ProgramDesc from the given targets and run the PaddlePaddle
- program. Please
- see
- [this graph](./distributed_architecture.md#local-training-architecture) for
- the detailed illustration for the local session
- and
- [this graph](./distributed_architecture.md#distributed-training-architecture) for
- the detailed illustration for the remote session.
-
-- *feed_dict*: a dictionary that contains the tensors which override
- the edges of the computation graph.
-
- feed_dict not only can provide the input data, it can override any
- OP's input as well:
-
- ```python
- a = pd.constant(2.0, name="a")
- b = pd.variable(name="b")
- c = pd.mul(a,b)
- sess.eval(targets=c, feed_dict={"b":3.0}) # returns 6.0
- ```
-
-```python
-close()
-```
-
-Closes the session and releases the scope that the session owns.
-
-
-### Create a Local Session
-
-```python
-session(
- devices=None
-)
-```
-
-Creates a new session. One session owns one global scope, so creating
-multiple sessions will create different scopes.
-
-- *devices*: a single `string` or a list of `string` of device names,
- the corresponding devices will be the computation devices for
- `eval()`. If not specified, all available devices (e.g., all GPUs)
- will be used. The user doesn't need to specify the CPU device since
- it will be always used. Multiple sessions can use the same device.
-
-
-#### Example
-
-```Python
-a = paddle.constant(1.0)
-b = paddle.constant(2.0)
-c = a + b
-sess = paddle.session(devices=["gpu:0", "gpu:1", "fpga:0"])
-sess.eval(c)
-sess.close()
-```
-
-### Create a Remote Session
-
-```python
-create_cloud_job(
- name,
- num_trainer,
- mem_per_trainer,
- gpu_per_trainer,
- cpu_per_trainer,
- num_ps,
- mem_per_ps,
- cpu_per_ps,
-)
-```
-
-Creates a Paddle Cloud job. Fails if the job name exists.
-
-```python
-get_cloud_job(
- name
-)
-```
-
-Gets a Paddle Cloud job.
-
-```python
-remote_session(
- job
-)
-```
-
-- *job*: the Paddle Cloud job.
-
-#### Example
-
-```Python
-reader = paddle.reader.recordio("/pfs/home/peter/mnist-train-*") # data stored on Paddle Cloud
-image = reader.column(0)
-label = reader.column(1)
-fc1 = paddle.op.fc(image, size=256, act="sigmoid")
-fc2 = paddle.op.fc(fc1, size=10, act="softmax")
-cost = paddle.op.cross_entropy(fc2, label)
-opt = paddle.optimizer.sgd(cost)
-
-job = paddle.create_cloud_job("test", 3, "1G", 1, 1, 2, "1G", 1)
-sess = paddle.remote_ession(job)
-for i in range(1000):
- sess.eval(opt)
-sess.close()
-```
diff --git a/doc/design/refactor/src/distributed_architecture.graffle b/doc/design/refactor/src/distributed_architecture.graffle
deleted file mode 100644
index f8496e57326c38de7468eb452a7713291d57653c..0000000000000000000000000000000000000000
Binary files a/doc/design/refactor/src/distributed_architecture.graffle and /dev/null differ
diff --git a/doc/design/refactor/src/distributed_architecture.png b/doc/design/refactor/src/distributed_architecture.png
deleted file mode 100644
index 410c4510c6aab301dec95e6427fe80ac24e105fe..0000000000000000000000000000000000000000
Binary files a/doc/design/refactor/src/distributed_architecture.png and /dev/null differ
diff --git a/doc/design/refactor/src/local_architecture.graffle b/doc/design/refactor/src/local_architecture.graffle
deleted file mode 100644
index cc7783c45381f25ded0b898649322c81418ad317..0000000000000000000000000000000000000000
Binary files a/doc/design/refactor/src/local_architecture.graffle and /dev/null differ
diff --git a/doc/design/refactor/src/local_architecture.png b/doc/design/refactor/src/local_architecture.png
deleted file mode 100644
index 4b999538b7825c805292ee28b5e3256d5543bd09..0000000000000000000000000000000000000000
Binary files a/doc/design/refactor/src/local_architecture.png and /dev/null differ
diff --git a/doc/design/releasing_process.md b/doc/design/releasing_process.md
deleted file mode 100644
index 14c081ea84282e52a2e36475c3c0ea755122d154..0000000000000000000000000000000000000000
--- a/doc/design/releasing_process.md
+++ /dev/null
@@ -1,68 +0,0 @@
-# PaddlePaddle发行规范
-
-PaddlePaddle使用git-flow branching model做分支管理,使用[Semantic Versioning](http://semver.org/)标准表示PaddlePaddle版本号。
-
-PaddlePaddle每次发新的版本,遵循以下流程:
-
-1. 从`develop`分支派生出新的分支,分支名为`release/版本号`。例如,`release/0.10.0`
-1. 将新分支的版本打上tag,tag为`版本号rc.Patch号`。第一个tag为`0.10.0rc1`,第二个为`0.10.0rc2`,依次类推。
-1. 对这个版本的提交,做如下几个操作:
- * 修改`python/setup.py.in`中的版本信息,并将`istaged`字段设为`True`。
- * 编译这个版本的Docker发行镜像,发布到dockerhub。如果失败,修复Docker编译镜像问题,Patch号加一,返回第二步
- * 编译这个版本的Ubuntu Deb包。如果失败,修复Ubuntu Deb包编译问题,Patch号加一,返回第二步。
- * 使用Regression Test List作为检查列表,测试Docker镜像/ubuntu安装包的功能正确性
- * 如果失败,记录下所有失败的例子,在这个`release/版本号`分支中,修复所有bug后,Patch号加一,返回第二步
- * 编译这个版本的python wheel包,并发布到pypi。
- * 由于pypi.python.org目前遵循[严格的命名规范PEP 513](https://www.python.org/dev/peps/pep-0513),在使用twine上传之前,需要重命名wheel包中platform相关的后缀,比如将`linux_x86_64`修改成`manylinux1_x86_64`。
- * pypi上的package名称为paddlepaddle和paddlepaddle_gpu,如果要上传GPU版本的包,需要修改build/python/setup.py中,name: "paddlepaddle_gpu"并重新打包wheel包:`python setup.py bdist_wheel`。
- * 上传方法:
- ```
- cd build/python
- pip install twine
- twine upload dist/[package to upload]
- ```
-1. 第三步完成后,将`release/版本号`分支合入master分支,并删除`release/版本号`分支。将master分支的合入commit打上tag,tag为`版本号`。同时再将`master`分支合入`develop`分支。最后删除`release/版本号`分支。
-1. 编译master分支的Docker发行镜像,发布到dockerhub。编译ubuntu的deb包,发布到github release页面
-1. 协同完成Release Note的书写
-
-
-需要注意的是:
-
-* `release/版本号`分支一旦建立,一般不允许再从`develop`分支合入`release/版本号`。这样保证`release/版本号`分支功能的封闭,方便测试人员测试PaddlePaddle的行为。
-* 在`release/版本号`分支存在的时候,如果有bugfix的行为,需要将bugfix的分支同时merge到`master`, `develop`和`release/版本号`这三个分支。
-
-## PaddlePaddle 分支规范
-
-PaddlePaddle开发过程使用[git-flow](http://nvie.com/posts/a-successful-git-branching-model/)分支规范,并适应github的特性做了一些区别。
-
-* PaddlePaddle的主版本库遵循[git-flow](http://nvie.com/posts/a-successful-git-branching-model/)分支规范。其中:
- * `master`分支为稳定(stable branch)版本分支。每一个`master`分支的版本都是经过单元测试和回归测试的版本。
- * `develop`分支为开发(develop branch)版本分支。每一个`develop`分支的版本都经过单元测试,但并没有经过回归测试。
- * `release/版本号`分支为每一次Release时建立的临时分支。在这个阶段的代码正在经历回归测试。
-
-* 其他用户的fork版本库并不需要严格遵守[git-flow](http://nvie.com/posts/a-successful-git-branching-model/)分支规范,但所有fork的版本库的所有分支都相当于特性分支。
- * 建议,开发者fork的版本库使用`develop`分支同步主版本库的`develop`分支
- * 建议,开发者fork的版本库中,再基于`develop`版本fork出自己的功能分支。
- * 当功能分支开发完毕后,向PaddlePaddle的主版本库提交`Pull Reuqest`,进而进行代码评审。
- * 在评审过程中,开发者修改自己的代码,可以继续在自己的功能分支提交代码。
-
-* BugFix分支也是在开发者自己的fork版本库维护,与功能分支不同的是,BugFix分支需要分别给主版本库的`master`、`develop`与可能有的`release/版本号`分支,同时提起`Pull Request`。
-
-## PaddlePaddle回归测试列表
-
-本列表说明PaddlePaddle发版之前需要测试的功能点。
-
-### PaddlePaddle Book中所有章节
-
-PaddlePaddle每次发版本首先要保证PaddlePaddle Book中所有章节功能的正确性。功能的正确性包括验证PaddlePaddle目前的`paddle_trainer`训练和纯使用`Python`训练模型正确性。
-
-| | 新手入门章节 | 识别数字 | 图像分类 | 词向量 | 情感分析 | 语意角色标注 | 机器翻译 | 个性化推荐 |
-| --- | --- | --- | --- | --- | --- | --- | --- | --- |
-| API.V2 + Docker + GPU | | | | | | | | |
-| API.V2 + Docker + CPU | | | | | | | | |
-| `paddle_trainer` + Docker + GPU | | | | | | | | |
-| `paddle_trainer` + Docker + CPU | | | | | | | | |
-| API.V2 + Ubuntu + GPU | | | | | | | | |
-| API.V2 + Ubuntu + CPU | | | | | | | | |
-| `paddle_trainer` + Ubuntu + GPU | | | | | | | | |
-| `paddle_trainer` + Ubuntu + CPU | | | | | | | | |
diff --git a/doc/design/speech/README.MD b/doc/design/speech/README.MD
deleted file mode 100644
index 7304650e628dba210488cd2dc4836318b5383b2a..0000000000000000000000000000000000000000
--- a/doc/design/speech/README.MD
+++ /dev/null
@@ -1,155 +0,0 @@
-# DeepSpeech2 on PaddlePaddle: Design Doc
-
-We are planning to build Deep Speech 2 (DS2) \[[1](#references)\], a powerful Automatic Speech Recognition (ASR) engine, on PaddlePaddle. For the first-stage plan, we have the following short-term goals:
-
-- Release a basic distributed implementation of DS2 on PaddlePaddle.
-- Contribute a chapter of Deep Speech to PaddlePaddle Book.
-
-Intensive system optimization and low-latency inference library (details in \[[1](#references)\]) are not yet covered in this first-stage plan.
-
-## Table of Contents
-
-- [Tasks](#tasks)
-- [Task Dependency](#task-dependency)
-- [Design Details](#design-details)
- - [Overview](#overview)
- - [Row Convolution](#row-convolution)
- - [Beam Search With CTC and LM](#beam-search-with-ctc-and-lm)
-- [Future Work](#future-work)
-- [References](#references)
-
-## Tasks
-
-We roughly break down the project into 14 tasks:
-
-1. Develop an **audio data provider**:
- - Json filelist generator.
- - Audio file format transformer.
- - Spectrogram feature extraction, power normalization etc.
- - Batch data reader with SortaGrad.
- - Data augmentation (optional).
- - Prepare (one or more) public English data sets & baseline.
-2. Create a **simplified DS2 model configuration**:
- - With only fixed-length (by padding) audio sequences (otherwise need *Task 3*).
- - With only bidirectional-GRU (otherwise need *Task 4*).
- - With only greedy decoder (otherwise need *Task 5, 6*).
-3. Develop to support **variable-shaped** dense-vector (image) batches of input data.
- - Update `DenseScanner` in `dataprovider_converter.py`, etc.
-4. Develop a new **lookahead-row-convolution layer** (See \[[1](#references)\] for details):
- - Lookahead convolution windows.
- - Within-row convolution, without kernels shared across rows.
-5. Build KenLM **language model** (5-gram) for beam search decoder:
- - Use KenLM toolkit.
- - Prepare the corpus & train the model.
- - Create infererence interfaces (for Task 6).
-6. Develop a **beam search decoder** with CTC + LM + WORDCOUNT:
- - Beam search with CTC.
- - Beam search with external custom scorer (e.g. LM).
- - Try to design a more general beam search interface.
-7. Develop a **Word Error Rate evaluator**:
- - update `ctc_error_evaluator`(CER) to support WER.
-8. Prepare internal dataset for Mandarin (optional):
- - Dataset, baseline, evaluation details.
- - Particular data preprocessing for Mandarin.
- - Might need cooperating with the Speech Department.
-9. Create **standard DS2 model configuration**:
- - With variable-length audio sequences (need *Task 3*).
- - With unidirectional-GRU + row-convolution (need *Task 4*).
- - With CTC-LM beam search decoder (need *Task 5, 6*).
-10. Make it run perfectly on **clusters**.
-11. Experiments and **benchmarking** (for accuracy, not efficiency):
- - With public English dataset.
- - With internal (Baidu) Mandarin dataset (optional).
-12. Time **profiling** and optimization.
-13. Prepare **docs**.
-14. Prepare PaddlePaddle **Book** chapter with a simplified version.
-
-## Task Dependency
-
-Tasks parallelizable within phases:
-
-Roadmap | Description | Parallelizable Tasks
------------ | :------------------------------------ | :--------------------
-Phase I | Simplified model & components | *Task 1* ~ *Task 8*
-Phase II | Standard model & benchmarking & profiling | *Task 9* ~ *Task 12*
-Phase III | Documentations | *Task13* ~ *Task14*
-
-Issue for each task will be created later. Contributions, discussions and comments are all highly appreciated and welcomed!
-
-## Design Details
-
-### Overview
-
-Traditional **ASR** (Automatic Speech Recognition) pipelines require great human efforts devoted to elaborately tuning multiple hand-engineered components (e.g. audio feature design, accoustic model, pronuncation model and language model etc.). **Deep Speech 2** (**DS2**) \[[1](#references)\], however, trains such ASR models in an end-to-end manner, replacing most intermediate modules with only a single deep network architecture. With scaling up both the data and model sizes, DS2 achieves a very significant performance boost.
-
-Please read Deep Speech 2 \[[1](#references),[2](#references)\] paper for more background knowledge.
-
-The classical DS2 network contains 15 layers (from bottom to top):
-
-- **Two** data layers (audio spectrogram, transcription text)
-- **Three** 2D convolution layers
-- **Seven** uni-directional simple-RNN layers
-- **One** lookahead row convolution layers
-- **One** fully-connected layers
-- **One** CTC-loss layer
-
-
+
+
+
+PaddlePaddle can support model parallelism by converting the IR so that the user no longer needs to manually perform the computation and operations in the Python component:
+
+
+
+The IR for PaddlePaddle after refactoring is called a `Block`, it specifies the computation dependency graph and the variables used in the computation.
+
+### Limitation 3
+
+The user can not directly specify the parameter update rule for the parameter server in the Python module, since the parameter server does not use the same computation definition as the trainer. Instead, the update rule is baked inside the parameter server. The user can not specify the update rule explicitly.
+
+This could be fixed by making the parameter server also run an IR, which can be different to the trainer side
+For a detailed explanation, refer to this document -
+[Design Doc: Parameter Server](./parameter_server.md)
+
+## Distributed Training Architecture
+
+The revamped distributed training architecture can address the above discussed limitations. Below is the illustration of how it does so:
+
+
+
+The major components are: *Python API*, *Distribute Transpiler* and *Remote Executor*.
+
+### Python API
+
+Python API is the Python library that user's Python code invokes, to read the data, build the neural network topology, and start training, etc.
+
+```Python
+images = fluid.layers.data(name='pixel', shape=[1, 28, 28], dtype='float32')
+label = fluid.layers.data(name='label', shape=[1], dtype='int64')
+...
+predict = fluid.layers.fc(input=conv_pool_2, size=10, act="softmax")
+cost = fluid.layers.cross_entropy(input=predict, label=label)
+avg_cost = fluid.layers.mean(x=cost)
+optimizer = fluid.optimizer.Adam(learning_rate=0.01)
+optimizer.minimize(avg_cost)
+
+train_reader = paddle.batch(
+ paddle.reader.shuffle(
+ paddle.dataset.mnist.train(), buf_size=500),
+ batch_size=BATCH_SIZE)
+
+place = fluid.CPUPlace()
+exe = fluid.Executor(place)
+
+for pass_id in range(10):
+ for data in train_reader():
+ loss, acc = exe.run(trainer_prog,
+ feed=feeder.feed(data),
+ fetch_list=[avg_cost])
+```
+
+The code above is a typical local training program, the "Training Program" is built using helper functions such as
+`fluid.layer.fc`. The training is done by calling `Executor.run`
+iteratively.
+
+For more details, the implementation of IR is [Program](../program.md), and `ProgramDesc` is the protobuf type.
+
+[Executor](../executor.md) simply runs the `ProgramDesc`. For local training you generally use
+`Executor` to run the program locally. For any kind of distributed training, you can use
+`RemoteExecutor` to specify desired distributed training method with some optional arguments.
+
+### Distributed Transpiler
+
+The Distributed Transpiler automatically converts the IR (in protobuf format) to partitioned IRs. Then
+the Remote Executor dispatches the new IRs to Remote Executors across the cluster.
+Below are the steps that are followed :
+
+1. User only need to change `Executor` to `RemoteExecutor` to change local program to distributed program.
+1. `RemoteExecutor` calls `Distributed Transpiler` to "transpile" user's program to several IRs representing a
+ distributed training program:
+ 1. Parse configurations from `RemoteExecutor`.
+ 1. Determine the type of distributed program, can be DataParallelism, ModelParallelism or Streaming.
+ 1. Partition the `ProgramDesc` according to type and add `send` / `recv` OP pair on the boundaries. Take
+ DataParallelism type for example, it removes the optimization operators and add a `send` OP to the
+ "trainer" role, then add the optimization operators to the parameter server role within the `recv` OP.
+1. Dispatch the partitioned graph to different `RemoteExecutor` in the cluster.
+1. `RemoteExecutor` on each node run the received `ProgramDesc` utill the end.
+
+
+### RemoteExecutor
+
+As shown in the graph, `RemoteExecutor.run` sends the IR to the cluster for Execution.
+You can also use parameter `fetch_list` to interactively fetch variable back to local for
+log printing.
+
+The Python `RemoteExecutor` is derived from `Executor` class.
+
+```python
+exe = RemoteExecutor(
+ feed=feeder.feed(data),
+ fetch_list=[avg_cost],
+ job_desc=JobDesc(
+ jobname,
+ num_trainer,
+ num_pserver,
+ cpu_per_trainer,
+ gpu_per_trainer,
+ mem_per_trainer,
+ cpu_per_pserver,
+ mem_per_pserver
+ ))
+for data in train_reader():
+ loss, acc = exe.run(trainer_prog,
+ feed=feeder.feed(data),
+ fetch_list=[avg_cost])
+```
+
+`JobDesc` object describe the distributed job resource specification to run on
+Cluster environment.
+
+
+
+`RemoteExecutor.run` sends the `ProgramDesc` and
+[TrainingJob](https://github.com/PaddlePaddle/cloud/blob/unreleased-tpr/doc/autoscale/README.md#training-job-resource)
+to a server in the cluster which executes `RemoteExecutor.listen`. This server is responsible
+to start the final Kubernetes Jobs to run the different role of `ProgramDesc` from `ConfigMap`.
+
+
+### Placement Algorithm
+
+Our first implementation will only support "trainer-parameter server" placement: the parameters, initializers, and optimizers are all placed on the PaddlePaddle runtimes with the parameter server role. Everything else will be placed on the PaddlePaddle runtimes with the trainer role. This has the same functionality as the "trainer-parameter server" architecture of PaddlePaddle v0.10.0, but is more generic and flexible.
+
+In the future, a more general placement algorithm should be implemented, which makes placements according to the input IR, and a model of device computation time and device communication time. Model parallelism requires the generic placement algorithm.
+
+
+### Local Training Architecture
+
+The local training architecture will be the same as the distributed training architecture, the difference is that everything runs locally, and there is just one PaddlePaddle runtime:
+
+
+
+
+### Training Data
+
+In PaddlePaddle v0.10.0, training data is typically read
+with [data reader](../reader/README.md) from Python. This approach is
+no longer efficient when training distributedly since the Python
+process no longer runs on the same node with the trainer processes,
+the Python reader will need to read from the distributed filesystem
+(assuming it has the access) and send to the trainers, doubling the
+network traffic.
+
+When doing distributed training, the user can still use Python data
+reader: the training data are sent with `Executor.run`. However, should
+be used for debugging purpose only. The users are encouraged to use
+the read data OPs.
+
+
+## References:
+
+[1] [TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems](https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/45166.pdf)
+
+[2] [TensorFlow: A System for Large-Scale Machine Learning](https://www.usenix.org/system/files/conference/osdi16/osdi16-abadi.pdf)
diff --git a/doc/fluid/design/dist_train/distributed_lookup_table_design.md b/doc/fluid/design/dist_train/distributed_lookup_table_design.md
new file mode 100644
index 0000000000000000000000000000000000000000..e543adf0f97cc6b47415b807d7a1ed1effec9b22
--- /dev/null
+++ b/doc/fluid/design/dist_train/distributed_lookup_table_design.md
@@ -0,0 +1,128 @@
+## Design Doc: Distributed Lookup Table Operator
+
+A lookup table operator in PaddlePaddle where the table could be out
+of the memory of a computer.
+
+## Background
+
+A lookup table operator is well-used in deep learning for learning the
+representation, or the
+[*embedding*](http://www.cs.toronto.edu/~fritz/absps/ieee-lre.pdf), of
+symbols.
+
+### The Forward Algorithm
+
+The forward algorithm of the lookup table is a multiplication of the
+input vector x and the lookup table matrix W:
+
+$$y = x * W$$
+
+When x is a sparse vector of symbols, the above multiplication
+simplifies into looking up rows in W that correspond to symbols in x,
+denoted by W(x). Please be aware that W could be huge and out of the
+memory, so we'd need a distributed storage service, which supports the
+lookup of rows.
+
+The following figure illustrates the multiplication of x with two
+non-zero elements, or say, two symbols, and a lookup table W:
+
+
+
+### The Backward Algorithm
+
+The backward algorithm computes W'(x) using W(x). W'(x) has the same
+scale of size as W(x) and is much smaller than W.
+
+To optimize W given W', we can do simple SGD update:
+
+$$W = f(W') = \lambda * W'$$
+
+or some more sophisticated algorithms that rely on both W' and W:
+
+$$W = f(W, W')$$
+
+The following figure illustrates the backward pass of the lookup
+operator: 
+
+## Distributed Storage Service
+
+The forward algorithm requires a distributed storage service for W.
+The backward algorithm prefers that the storage system can apply the
+optimization algorithm on W. The following two sections describe two
+solutions -- the former doesn't require that the storage service can
+do optimization, the latter does.
+
+### Storage Service Doesn't Optimize
+
+In this design, we use highly-optimized distributed storage, e.g.,
+memcached, as the storage service, and we run the optimization
+algorithm on parameter servers of PaddlePaddle. The following figure
+illustrates the training process.
+
+
+
+
+
+After converted:
+
+
+
+## Implement
+
+- `Multi-CPU Transpiler` will convert the graph to a multi-CPU graph
+ which would be executed with multi-threads.
+- `BlockingCounter` will `Init/Decrement` an atomic counter, and Blocking `Wait`
+ for the atomic counter become `0`:
+ ```cpp
+ BlockingCounter bc(thread_count);
+ for (int i = 0; i < thread_count; ++i) {
+ thread_pool->Start([&bc] {bc.DecrementCount(); })
+ }
+ bc.Wait();
+ ```
+- `ParallelDo` Operator
+ - Initialize a thread pool which is a Singleton.
+ - Use a block id as the input, and create run the specify Block on independent scope
+ with multi-threads.
+ - Initialize a `BlockingCounter` instance and wait until all threads are done.
+- `Split` Operator will split the Input Tensor into a TensorArray.
+- `Merge` merge all the gradients which calculated in different threads
+ with `mean/sum/max/min...` method, and then run the Optimizer Op to optimize `W`.
+
+## TODO
+
+- Improve the optimizer stage with multi-threads, since we could
+ assign the parameters to the different threads and execute
+ optimizer with multi-threads.
diff --git a/doc/fluid/design/dist_train/parameter_server.md b/doc/fluid/design/dist_train/parameter_server.md
new file mode 100644
index 0000000000000000000000000000000000000000..6ce48dfbfce8b094684b412ebfda7e505ddc30ae
--- /dev/null
+++ b/doc/fluid/design/dist_train/parameter_server.md
@@ -0,0 +1,107 @@
+# Design Doc: Parameter Server
+
+## Abstract
+
+We propose an approach to implement the parameter server. In this
+approach, there is no fundamental difference between the trainer and
+the parameter server: they both run subgraphs, but subgraphs of
+different purposes.
+
+## Background
+
+The previous implementations of the parameter server do not run a
+fluid sub-program. Parameter initialization, optimizer computation, network
+communication and checkpointing are implemented twice on both the
+trainer as well as the parameter server.
+
+It would be great if we can write code once and use them on both: the
+trainer and the parameter server, since this reduces code duplication and
+improves extensibility. Given that after the current refactoring, we are
+representing everything as a computation graph on the
+trainer. Representing everything as a computation graph on the parameter
+server becomes a natural extension.
+
+## Design
+
+### Distributed Transpiler
+
+The *Distributed Transpiler* converts the user-defined fluid program
+into sub-programs to be scheduled on different nodes with the following
+steps:
+
+1. OP placement: the OPs will be placed on different nodes according
+ to a heuristic that minimizes the estimated total computation
+ time. Currently we will use a simple heuristic that puts parameter
+ variable on parameter server workers and everything else on trainer
+ workers.
+1. Add communication OPs to enable the communication between nodes.
+
+We will need these OPs: *Send*, *Recv*, *Enqueue*, *Dequeue*.
+
+Below is an example of converting the user defined graph to the
+subgraphs for the trainer and the parameter server:
+
+
+
+After converting:
+
+
+
+1. The parameter variable W and its optimizer program are placed on the parameter server.
+1. Operators are added to the program.
+ - *Send* sends data to the connected *Recv* operator. The
+ scheduler on the receive node will only schedule *Recv* operator
+ to run when the *Send* operator has ran (the *Send* OP will mark
+ the *Recv* OP runnable automatically).
+ - *Enqueue* enqueues the input variable, it can block until space
+ become available in the queue.
+ - *Dequeue* outputs configurable numbers of tensors from the
+ queue. It will block until the queue has the required number of
+ tensors.
+
+### Sparse Update
+
+For embedding layers, the gradient may have many rows containing only 0 when training,
+if the gradient uses a dense tensor to do parameter optimization,
+it could spend unnecessary memory, slow down the calculations and waste
+the bandwidth while doing distributed training.
+In Fluid, we introduce [SelectedRows](../selected_rows.md) to represent a list of rows containing
+non-zero gradient data. So when we do parameter optimization both locally and remotely,
+we only need to send those non-zero rows to the optimizer operators:
+
+
+
+### Benefits
+
+- Model parallelism becomes easier to implement: it is an extension to
+ the trainer - parameter server approach. We can have several "Transpilers"
+ to achieve different goals.
+- User-defined optimizer is easier to add - user can now express it as
+ a sub-program.
+- No more duplication logic inside the trainer and the parameter
+ server mentioned in the background section.
+
+### Challenges
+
+- It is important to balance the parameter shards on multiple
+ parameter servers. If a single parameter is very big (for example: some
+ word-embedding, fully connected, softmax layer), we need to
+ automatically partition the single parameter onto different
+ parameter servers when possible (only element-wise optimizer depends
+ on the parameter variable).
+- In the "Async SGD" figure, the "W" variable on the parameter server
+ could be read and written concurrently. See
+ [here](https://github.com/PaddlePaddle/Paddle/pull/6394) for more
+ details about concurrent program in Fluid.
+
+### Discussion
+
+- Can the Enqueue OP be implemented under our current tensor design
+ (put the input tensor into the queue tensor)?
+- *Dequeue* OP will have variable numbers of output (depending on the
+ `min_count` attribute), does our current design support it? (similar
+ question for the *Add* OP)
+
+### References
+
+[1] [TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems](https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/45166.pdf)
diff --git a/doc/design/refactor/src/compiler.graffle b/doc/fluid/design/dist_train/src/compiler.graffle
similarity index 100%
rename from doc/design/refactor/src/compiler.graffle
rename to doc/fluid/design/dist_train/src/compiler.graffle
diff --git a/doc/design/refactor/src/compiler.png b/doc/fluid/design/dist_train/src/compiler.png
similarity index 100%
rename from doc/design/refactor/src/compiler.png
rename to doc/fluid/design/dist_train/src/compiler.png
diff --git a/doc/design/refactor/src/dist-graph.graffle b/doc/fluid/design/dist_train/src/dist-graph.graffle
similarity index 100%
rename from doc/design/refactor/src/dist-graph.graffle
rename to doc/fluid/design/dist_train/src/dist-graph.graffle
diff --git a/doc/design/refactor/src/dist-graph.png b/doc/fluid/design/dist_train/src/dist-graph.png
similarity index 100%
rename from doc/design/refactor/src/dist-graph.png
rename to doc/fluid/design/dist_train/src/dist-graph.png
diff --git a/doc/fluid/design/dist_train/src/distributed_architecture.graffle b/doc/fluid/design/dist_train/src/distributed_architecture.graffle
new file mode 100644
index 0000000000000000000000000000000000000000..d1b60141342232e06227c2d430ebc60ec349a907
Binary files /dev/null and b/doc/fluid/design/dist_train/src/distributed_architecture.graffle differ
diff --git a/doc/fluid/design/dist_train/src/distributed_architecture.png b/doc/fluid/design/dist_train/src/distributed_architecture.png
new file mode 100644
index 0000000000000000000000000000000000000000..29c7b0c0783f97c6d33b1db1ed484d6a2b9dd356
Binary files /dev/null and b/doc/fluid/design/dist_train/src/distributed_architecture.png differ
diff --git a/doc/design/refactor/src/local-graph.graffle b/doc/fluid/design/dist_train/src/local-graph.graffle
similarity index 100%
rename from doc/design/refactor/src/local-graph.graffle
rename to doc/fluid/design/dist_train/src/local-graph.graffle
diff --git a/doc/design/refactor/src/local-graph.png b/doc/fluid/design/dist_train/src/local-graph.png
similarity index 100%
rename from doc/design/refactor/src/local-graph.png
rename to doc/fluid/design/dist_train/src/local-graph.png
diff --git a/doc/fluid/design/dist_train/src/local_architecture.graffle b/doc/fluid/design/dist_train/src/local_architecture.graffle
new file mode 100644
index 0000000000000000000000000000000000000000..49fcc663ebe3824aa234e3a67aadf285cb417877
Binary files /dev/null and b/doc/fluid/design/dist_train/src/local_architecture.graffle differ
diff --git a/doc/fluid/design/dist_train/src/local_architecture.png b/doc/fluid/design/dist_train/src/local_architecture.png
new file mode 100644
index 0000000000000000000000000000000000000000..14adc9fd72b855bb9f74fbf2c84ac9ec0cf2b122
Binary files /dev/null and b/doc/fluid/design/dist_train/src/local_architecture.png differ
diff --git a/doc/fluid/design/dist_train/src/lookup_table.png b/doc/fluid/design/dist_train/src/lookup_table.png
new file mode 100644
index 0000000000000000000000000000000000000000..72dfe3547f731d0d090338afb206b0549dff472e
Binary files /dev/null and b/doc/fluid/design/dist_train/src/lookup_table.png differ
diff --git a/doc/fluid/design/dist_train/src/lookup_table_training.png b/doc/fluid/design/dist_train/src/lookup_table_training.png
new file mode 100644
index 0000000000000000000000000000000000000000..cc7cc4aeb3b885850fe2f70f19fb84d5873bed1e
Binary files /dev/null and b/doc/fluid/design/dist_train/src/lookup_table_training.png differ
diff --git a/doc/fluid/design/dist_train/src/multi-threads.graffle b/doc/fluid/design/dist_train/src/multi-threads.graffle
new file mode 100644
index 0000000000000000000000000000000000000000..e71173715fff92a0a933d0c7d83599ba948552c6
Binary files /dev/null and b/doc/fluid/design/dist_train/src/multi-threads.graffle differ
diff --git a/doc/fluid/design/dist_train/src/multi-threads/multi-threads@3x.png b/doc/fluid/design/dist_train/src/multi-threads/multi-threads@3x.png
new file mode 100644
index 0000000000000000000000000000000000000000..e40a869987dbbf5019d4cb03c1dab55b74d6c9f9
Binary files /dev/null and b/doc/fluid/design/dist_train/src/multi-threads/multi-threads@3x.png differ
diff --git a/doc/fluid/design/dist_train/src/multi-threads/single-thread@3x.png b/doc/fluid/design/dist_train/src/multi-threads/single-thread@3x.png
new file mode 100644
index 0000000000000000000000000000000000000000..4083aebfdd45af5fbac25fa2c4176bc08c3cb44a
Binary files /dev/null and b/doc/fluid/design/dist_train/src/multi-threads/single-thread@3x.png differ
diff --git a/doc/design/refactor/src/paddle-compile.graffle b/doc/fluid/design/dist_train/src/paddle-compile.graffle
similarity index 100%
rename from doc/design/refactor/src/paddle-compile.graffle
rename to doc/fluid/design/dist_train/src/paddle-compile.graffle
diff --git a/doc/design/refactor/src/paddle-compile.png b/doc/fluid/design/dist_train/src/paddle-compile.png
similarity index 100%
rename from doc/design/refactor/src/paddle-compile.png
rename to doc/fluid/design/dist_train/src/paddle-compile.png
diff --git a/doc/fluid/design/dist_train/src/remote_executor.graffle b/doc/fluid/design/dist_train/src/remote_executor.graffle
new file mode 100644
index 0000000000000000000000000000000000000000..41b2067311694b56d211a4f32d1b76884eeffd2d
Binary files /dev/null and b/doc/fluid/design/dist_train/src/remote_executor.graffle differ
diff --git a/doc/fluid/design/dist_train/src/remote_executor.png b/doc/fluid/design/dist_train/src/remote_executor.png
new file mode 100644
index 0000000000000000000000000000000000000000..744e2fb2e0f1bbe058e991ba7b2a09000965ee79
Binary files /dev/null and b/doc/fluid/design/dist_train/src/remote_executor.png differ
diff --git a/doc/fluid/design/dist_train/src/sparse_update.graffle b/doc/fluid/design/dist_train/src/sparse_update.graffle
new file mode 100644
index 0000000000000000000000000000000000000000..08d689a58f83698d8c1158ee3990ed8abf3a7a9a
Binary files /dev/null and b/doc/fluid/design/dist_train/src/sparse_update.graffle differ
diff --git a/doc/fluid/design/dist_train/src/sparse_update.png b/doc/fluid/design/dist_train/src/sparse_update.png
new file mode 100644
index 0000000000000000000000000000000000000000..8c872e6ac479f7d1b818a4a207956c43155d0ad7
Binary files /dev/null and b/doc/fluid/design/dist_train/src/sparse_update.png differ
diff --git a/doc/design/ops/images/2_level_rnn.dot b/doc/fluid/design/dynamic_rnn/2_level_rnn.dot
similarity index 100%
rename from doc/design/ops/images/2_level_rnn.dot
rename to doc/fluid/design/dynamic_rnn/2_level_rnn.dot
diff --git a/doc/design/ops/images/2_level_rnn.png b/doc/fluid/design/dynamic_rnn/2_level_rnn.png
similarity index 100%
rename from doc/design/ops/images/2_level_rnn.png
rename to doc/fluid/design/dynamic_rnn/2_level_rnn.png
diff --git a/doc/design/ops/images/rnn.dot b/doc/fluid/design/dynamic_rnn/rnn.dot
similarity index 100%
rename from doc/design/ops/images/rnn.dot
rename to doc/fluid/design/dynamic_rnn/rnn.dot
diff --git a/doc/design/ops/images/rnn.jpg b/doc/fluid/design/dynamic_rnn/rnn.jpg
similarity index 100%
rename from doc/design/ops/images/rnn.jpg
rename to doc/fluid/design/dynamic_rnn/rnn.jpg
diff --git a/doc/fluid/design/dynamic_rnn/rnn.md b/doc/fluid/design/dynamic_rnn/rnn.md
new file mode 100644
index 0000000000000000000000000000000000000000..6f414e5549b149bc88fb252085ff56dbb06730f8
--- /dev/null
+++ b/doc/fluid/design/dynamic_rnn/rnn.md
@@ -0,0 +1,153 @@
+# RNNOp design
+
+This document describes the RNN (Recurrent Neural Network) operator and how it is implemented in PaddlePaddle. The RNN op requires that all instances in a mini-batch have the same length. We will have a more flexible dynamic RNN operator in the future.
+
+## RNN Algorithm Implementation
+
+
+
+
+
+Figure 2 illustrates the RNN's data flow
+
+
+
+
+
+
+cudnn provides APIs to finish the whole series of computation, we can use them in our GPU kernel.
+
+### Python
+
+`batch_norm_op` is warpped as a layer in Python:
+
+```python
+def batch_norm_layer(net,
+ input,
+ output,
+ scale,
+ bias,
+ use_global_est = False,
+ epsilon = 1e-6,
+ momentum = 0.99):
+ mean_cache = scope.new_var(name = 'estimated_mean', trainable = False)
+ var_cache = scop.new_var(name = 'estimated_var', trainable = False)
+ batch_mean = scope.new_var(name = 'batch_mean')
+ batch_var = scope.new_var(name = 'batch_var')
+ batch_norm_op = Operator('batch_norm_op',
+ x = input,
+ estimated_mean = mean_cache,
+ estimated_mean = var_cache,
+ scale = scale,
+ bias = bias,
+ y = output,
+ batch_mean = batch_mean,
+ batch_var = batch_var,
+ saved_mean = mean_cache,
+ saved_var = var_cache,
+ is_infer = False,
+ use_global_est = use_global_est,
+ epsilon = epsilon,
+ momentum = momentum)
+ net.append_op(batch_norm_op)
+ return output
+```
+
+Because Python API has not been finally decided, the code above can be regarded as pseudo code. There are a few key points we shall note:
+
+1. `estimated_mean` and `estimated_var` are assigned the same variables with `saved_mean` and `saved_var` respectively. So they share same the memories. The output mean and variance values(`saved_mean` and `saved_var`) of a certain batch will be the inputs(`estimated_mean` and `estimated_var`) of the next batch.
+
+2. `is_infer` decided whether `batch_norm_op` will run in training mode or inferencing mode. However, a network may contains both training and inferencing parts. And user may switch `batch_norm_op`'s running mode in Python `for` loop like this:
+
+```python
+for pass_id in range(PASS_NUM):
+ # ...
+ net.train() # run training model
+ if pass_id % 100 == 0:
+ net.infer(test_image) # run inferencing model
+ # ...
+```
+
+`is_infer` is an attribute. Once an operator is created, its attributes can not be changed. It suggests us that we shall maintain two `batch_norm_op` in the model, one's `is_infer` is `True`(we call it `infer_batch_norm_op`) and the other one's is `False`(we call it `train_batch_norm_op`). They share all parameters and variables, but be placed in two different branches. That is to say, if a network contains a `batch_norm_op`, it will fork into two branches, one go through `train_batch_norm_op` and the other one go through `infer_batch_norm_op`:
+
+
+

+
+
+
+* 注:CI环境使用 https://github.com/PaddlePaddle/buildtools 这里的DockerImage作为编译环境以支持更多的Linux
+ 发型版,如果需要手动编译,也可以使用这些镜像。这些镜像也可以从 https://hub.docker.com/r/paddlepaddle/paddle_manylinux_devel/tags/ 下载得到。
+* pypi不支持覆盖上传,所以一个版本号的wheel包发布之后,不可以更改。下一个wheel包需要更新版本号才可以上传。
+
+## 发布Docker镜像
+
+上述PaddlePaddle CI编译wheel完成后会自动将Docker镜像push到DockerHub,所以,发布Docker镜像只需要对自动push的镜像打上
+版本号对应的tag即可:
+
+1. 进入 https://hub.docker.com/r/paddlepaddle/paddle/tags/ 查看latest tag的更新时间是否在上述编译wheel包完成后是否最新。
+1. 执行 `docker pull paddlepaddle/paddle:[latest tag]`,latest tag可以是latest或latest-gpu等。
+1. 执行 `docker tag paddlepaddle/paddle:[latest tag] paddlepaddle/paddle:[version]`
+1. 执行 `docker push paddlepaddle/paddle:[version]`
+
+## PaddlePaddle 分支规范
+
+PaddlePaddle开发过程使用[git-flow](http://nvie.com/posts/a-successful-git-branching-model/)分支规范,并适应github的特性做了一些区别。
+
+* PaddlePaddle的主版本库遵循[git-flow](http://nvie.com/posts/a-successful-git-branching-model/)分支规范。其中:
+ * `master`分支为稳定(stable branch)版本分支。每一个`master`分支的版本都是经过单元测试和回归测试的版本。
+ * `develop`分支为开发(develop branch)版本分支。每一个`develop`分支的版本都经过单元测试,但并没有经过回归测试。
+ * `release/版本号`分支为每一次Release时建立的临时分支。在这个阶段的代码正在经历回归测试。
+
+* 其他用户的fork版本库并不需要严格遵守[git-flow](http://nvie.com/posts/a-successful-git-branching-model/)分支规范,但所有fork的版本库的所有分支都相当于特性分支。
+ * 建议,开发者fork的版本库使用`develop`分支同步主版本库的`develop`分支
+ * 建议,开发者fork的版本库中,再基于`develop`版本fork出自己的功能分支。
+ * 当功能分支开发完毕后,向PaddlePaddle的主版本库提交`Pull Reuqest`,进而进行代码评审。
+ * 在评审过程中,开发者修改自己的代码,可以继续在自己的功能分支提交代码。
+
+* BugFix分支也是在开发者自己的fork版本库维护,与功能分支不同的是,BugFix分支需要分别给主版本库的`master`、`develop`与可能有的`release/版本号`分支,同时提起`Pull Request`。
+
+## PaddlePaddle回归测试列表
+
+本列表说明PaddlePaddle发版之前需要测试的功能点。
+
+### PaddlePaddle Book中所有章节
+
+PaddlePaddle每次发版本首先要保证PaddlePaddle Book中所有章节功能的正确性。功能的正确性包括验证PaddlePaddle目前的`paddle_trainer`训练和纯使用`Python`训练模型正确性。
+
+| | 新手入门章节 | 识别数字 | 图像分类 | 词向量 | 情感分析 | 语意角色标注 | 机器翻译 | 个性化推荐 |
+| --- | --- | --- | --- | --- | --- | --- | --- | --- |
+| API.V2 + Docker + GPU | | | | | | | | |
+| API.V2 + Docker + CPU | | | | | | | | |
+| `paddle_trainer` + Docker + GPU | | | | | | | | |
+| `paddle_trainer` + Docker + CPU | | | | | | | | |
+| API.V2 + Ubuntu + GPU | | | | | | | | |
+| API.V2 + Ubuntu + CPU | | | | | | | | |
+| `paddle_trainer` + Ubuntu + GPU | | | | | | | | |
+| `paddle_trainer` + Ubuntu + CPU | | | | | | | | |
diff --git a/doc/fluid/dev/src/fc.py b/doc/fluid/dev/src/fc.py
new file mode 100644
index 0000000000000000000000000000000000000000..3b074821cc2276a29b2a8639e82199fcf4d72020
--- /dev/null
+++ b/doc/fluid/dev/src/fc.py
@@ -0,0 +1,81 @@
+# Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+
+def fc(input,
+ size,
+ num_flatten_dims=1,
+ param_attr=None,
+ bias_attr=None,
+ act=None,
+ name=None):
+ """
+ **Fully Connected Layer**
+
+ The fully connected layer can take multiple tensors as its inputs. It
+ creates a variable called weights for each input tensor, which represents
+ a fully connected weight matrix from each input unit to each output unit.
+ The fully connected layer multiplies each input tensor with its coresponding
+ weight to produce an output Tensor. If multiple input tensors are given,
+ the results of multiple multiplications will be sumed up. If bias_attr is
+ not None, a bias variable will be created and added to the output. Finally,
+ if activation is not None, it will be applied to the output as well.
+
+ This process can be formulated as follows:
+
+ .. math::
+
+ Out = Act({\sum_{i=0}^{N-1}X_iW_i + b})
+
+ In the above equation:
+
+ * :math:`N`: Number of the input.
+ * :math:`X_i`: The input tensor.
+ * :math:`W`: The weights created by this layer.
+ * :math:`b`: The bias parameter created by this layer (if needed).
+ * :math:`Act`: The activation function.
+ * :math:`Out`: The output tensor.
+
+ Args:
+ input (Variable|list of Variable): The input tensor(s) of this layer, and the dimension of
+ the input tensor(s) is at least 2.
+ size(int): The number of output units in this layer.
+ num_flatten_dims (int, default 1): The fc layer can accept an input tensor with more than
+ two dimensions. If this happens, the multidimensional tensor will first be flattened
+ into a 2-dimensional matrix. The parameter `num_flatten_dims` determines how the input
+ tensor is flattened: the first `num_flatten_dims` (inclusive, index starts from 1)
+ dimensions will be flatten to form the first dimension of the final matrix (height of
+ the matrix), and the rest `rank(X) - num_flatten_dims` dimensions are flattened to
+ form the second dimension of the final matrix (width of the matrix). For example, suppose
+ `X` is a 6-dimensional tensor with a shape [2, 3, 4, 5, 6], and `num_flatten_dims` = 3.
+ Then, the flattened matrix will have a shape [2 x 3 x 4, 5 x 6] = [24, 30].
+ param_attr (ParamAttr|list of ParamAttr, default None): The parameter attribute for learnable
+ parameters/weights of this layer.
+ bias_attr (ParamAttr|list of ParamAttr, default None): The parameter attribute for the bias
+ of this layer. If it is set to None, no bias will be added to the output units.
+ act (str, default None): Activation to be applied to the output of this layer.
+ name (str, default None): The name of this layer.
+
+ Returns:
+ A tensor variable storing the transformation result.
+
+ Raises:
+ ValueError: If rank of the input tensor is less than 2.
+
+ Examples:
+ .. code-block:: python
+
+ data = fluid.layers.data(name="data", shape=[32, 32], dtype="float32")
+ fc = fluid.layers.fc(input=data, size=1000, act="tanh")
+ """
diff --git a/doc/fluid/dev/support_new_device.md b/doc/fluid/dev/support_new_device.md
new file mode 100644
index 0000000000000000000000000000000000000000..8983df900460127fc130043c52373dab505363ba
--- /dev/null
+++ b/doc/fluid/dev/support_new_device.md
@@ -0,0 +1,240 @@
+# Design Doc: Supporting new Device/Library
+
+## Background
+
+Deep learning has a high demand for computing resources. New high-performance devices and computing libraries are appearing very frequently. Deep learning frameworks have to integrate these high-performance devices and computing libraries in a flexible and efficient manner.
+
+On one hand, hardware and computing libraries usually do not have a one-to-one correspondence. For example, Intel CPUs support Eigen and MKL computing libraries while Nvidia GPUs support Eigen and cuDNN computing libraries. We have to implement operator specific kernels for each computing library.
+
+On the other hand, users usually do not want to care about the low-level hardware and computing libraries when writing a neural network configuration. In Fluid, `Layer` is exposed in `Python`, and `Operator` is exposed in `C++`. Both `Layer` and `Operator` are hardware independent.
+
+So, how to support a new Device/Library in Fluid becomes a challenge.
+
+
+## Basic: Integrate A New Device/Library
+
+For a general overview of fluid, please refer to the [overview doc](https://github.com/PaddlePaddle/Paddle/blob/develop/doc/howto/read_source.md).
+
+There are mainly three parts that we have to consider while integrating a new device/library:
+
+- Place and DeviceContext: indicate the device id and manage hardware resources
+
+- Memory and Tensor: malloc/free data on certain device
+
+- Math Functor and OpKernel: implement computing unit on certain devices/libraries
+
+### Place and DeviceContext
+
+Please note that device and computing library are not one-to-one corresponding. A device can have a lot of computing libraries and a computing library can also support several devices.
+
+#### Place
+Fluid uses class [Place](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/platform/place.h#L55) to represent the device memory where data is located. If we add another device, we have to add the corresponding `DevicePlace`.
+
+```
+ | CPUPlace
+Place --| CUDAPlace
+ | FPGAPlace
+```
+
+And `Place` is defined as follows:
+
+```
+typedef boost::variant