PaddlePaddle Design Doc¶
-Ingredients¶
-As our design principle is starting from the essence: how could we -allow users to express and solve their problems as neural networks. -Some essential concepts that our API have to provide include:
--
-
- A topology is an expression of layers. -
- A layer could be any kind of computation, including cost. -
- Some layers have parameters, some don’t. Most costs don’t have -parameters. -
- In some topologies, layers share parameters. For -example, -the network for training a ranking model. -
- At programming time, users specify topologies and possible sharing -of parameters. PaddlePaddle can figure out and create parameters -required (and possibly shared) by one or more topologies. -
Starting from Examples¶
-As a summarization -of -our disucssion, -let us present two examples here:
-Example 1. Sharing Parameters between Layers¶
-We use -the -3-branch ranking model -in this example. For your convenience, I copy-a-paste the model’s -topology as follows:
-A -> f -\
-Q -> f --> cost
-B -> f -/
-The following program trains the topology including the cost, and then -use the sub-network in the trained topology in inference:
-def f(in):
- e = paddle.layer.embedding(in, parameter_name="embedding")
- o = paddle.layer.softmax(e, parameter_name="semantic")
- return o
-
-# Create 3 topologies (subnets), they share parameters because all
-# correspoinding layers have the same parameter names.
-fA = f(paddle.layer.data(input_name="A"))
-fB = f(paddle.layer.data(input_name="B"))
-fQ = f(paddle.layer.data(input_name="Q"))
-
-topology = paddle.layer.less_than(
- paddle.layer.cross_entropy(fA, fQ),
- paddle.layer.corss_entropy(fB, fQ))
-
-# Derive parameters required in topology and create them in model.
-parameters = paddle.parameters.create(topology)
-
-# Estimate parameters used in topology from data.
-paddle.train(topology, parameters, reader=read_ranking_model_data)
-
-# Inference using fA (or fB or fC, as they share their parameters).
-[testA, testB, testQ] = read_ranking_model_data()
-print "The sematic-vector of testA: ", paddle.infer(fA, parameters, testA)
-Example 2. Sharing Parameters between “Models”¶
-We use GAN in
-this example. In the following example program, d0 and d1
-correspond to the two networks in the following figure:

def G(in):
- # over-simplified example as G has only one layers:
- return paddle.layer.fc(in, parameter_name="G")
-
-def D(in);
- # again, over-simplified:
- return paddle.layer.fc(in, parameter_name="D")
-
-# Construct the first topology, which contains both D and G.
-# By learning this topology, we update parameters of G.
-d0 = paddle.layer.should_be_false(D(G(paddle.layer.data())))
-
-# Construct a second topology d1, which contains only D. By
-# training this topology, we update parameters of D. Note
-# that d1 share parameters with d0.
-d1 = paddle.layer.should_be_true(D(paddle.layer.data()))
-
-# Create parameters from a list of multiple topologies (models) for
-# the chance to share parameters between these topologies.
-parameters = paddle.parameters.create([d0, d1])
-
-# Iterative training of GAN.
-for ...:
- train(d0, parameters, reader=read_from_rng, immutable_parameters={"D"})
- train(d1, parameters, reader=read_from_realistic_images)
-
-# Use d1 for inference:
-print "D thinks a batch of images are realistic ", infer(d1, parameters, read_mnist_images)
-Summarization¶
-Above two programs reveal some important design concerns:
--
-
- Users describe a topology as an expression of layers. Every layer -has a parameter name. If the users don’t specify it explicitly, it’s automatically generated as a unique name. By -specifying the parameter name, users can specify the sharing of -parameters between layers and even between topologies. -
paddle.parameters.createfigures out parameters required by one -or more topologies from parameter names of layers. It creates these -parameters and returns aParameterSetobject, which is in essence -a map from parameter names to parameters.
-- At training and inference time,
paddle.trainandpaddle.infer-requires both a topology and the parameter set that holds the parameters of that topology. There are some reasons:-
-
- This prevents users from forgetting to call
-
paddle.parameters.create.
- paddle.trainneeds to know which parameter set to update.
-- Users could load another (pre-trained) parameter set and use it
-with a topology in
train.infer.
-
- - This prevents users from forgetting to call
-
- By specifying the
immutable_parametersparameter of -paddle.train, we can forbid the update of these parameters.
-
Reader¶
-Not all programming frameworks allow users to define I/O functions.
-An example is Google MapReduce, which can only read from text,
-SSTable, and RecordIO files. Hadoop MapReduce allows users to define
-readers and writers by deriving from base classes Reader and
-Writer. The former is less flexible but also less error-prone. We
-decide to provide the flexibility to users to define their readers.
There are some open questions here:
--
-
- Should a reader return a Python dictionary? -
- How to map multiple outputs from a reader to multiple data layers? -
- How to easily compose some existing readers to read more data and -feed a topology with more data layers? -
Training¶
-The recommended way to training a model is to call paddle.train,
-which simply calls paddle.trainer.Default, a global variable of
-type paddle.trainer.SGD. Equivalently, we can do
opt = paddle.trainer.SGD(..., paddle.updater.Adam(...))
-opt.train(topology, parameters, reader=read, ...)
-Updater¶
-Please be aware that a trainer can accept an updater as its data
-member, where an updater is a class derived from
-paddle.trainer.Updater. This is to make it easier to customize
-trainers, as discussed
-here.
Event Handler¶
-paddle.train and paddle.trainer.XXX.train take an optional
-parameter event_handler, which should be either None or a function
-that handle some events:
-
-
- BeginTraining -
- EndTraining -
- BeginIteration -
- EndIteration -
- BeginPass -
- EndPass -
where EndPass is sent if and only if the reader yields
-end_pass=True.
An example as follows:
-def event_handler(event):
- if ininstance(event, paddle.event.EndIteration):
- print paddle.test(...)
-
-paddle.train(topology, parameters, reader, event_handler)
-If we are writing a PaddlePaddle program in and for iPython/Jypyter, -we can use metaplotlib in the event handler to plot a curve of -cost/error versus iterations, as shown -here.
-Distributed Training¶
-If users want to do distributed training on a cluster, s/he should
-call paddle.dist_train and provides access tokens to the cluster as
-a parameter.
For example, if the user has a TLS certificate that allows him to -access a Kubernetes cluster, s/he should be able to call
-paddle.dist_train(model,
- trainer=paddle.trainer.SGD(...,
- paddle.updater.Adam(...)),
- reader=read,
- k8s_user="yi",
- k8s_token="kube_cluster_tls.pem",
- k8s_job="hello",
- num_parameter_servers=15)
-The pseudo code of paddle.dist_train is as follows:
def dist_train(topology, parameters, trainer, reader, ...):
- if os.getenv("KUBERNETES_SERVICE_HOST") == None:
- image_name = k8s_user + '/' + k8s_job
- docker_build(image_name)
- docker_push()
- kube_ctrl_start_job(image_name, k8s_user, k8s_token)
- else:
- rank = kube_list_containers_in_job_and_return_current_containers_rank()
- if rank == 0:
- master()
- elif rank < 15:
- parameter_server()
- else:
- trainer.train(model, reader=read)
-Please be aware that if a process is running on the Kubernetes -cluster, it will have some environment variables pre-defined.
-If dist_train doesn’t see these environment variables, it knows
-that it’s running on users’ personal computer, and it should work as a
-launcher. Otherwise, it knows that it’s running on the cluster and
-need to figure out its role as either the master, or a trainer, or a
-parameter server.
-
-By coordinating these processes, PaddlePaddle supports use both Synchronize Stochastic Gradient Descent (sync SGD) and Asynchronous Stochastic Gradient Descent (async SGD) to train user-defined neural network topologies.
-
-When training with sync SGD, parameter servers wait for all trainers to finish gradients update and then send the updated parameters to trainers, training can not proceed until the trainer received the updated parameters. This creates a synchronization point between trainers. When training with async SGD, each trainer upload gradient and download new parameters individually, without the synchronization with other trainers. Using asyc SGD will be faster in terms of time per pass, but have more noise in gradient since trainers are likely to have a stale model.
-
-### Master Server Process
-
-The master server process will:
-
-- Partition a dataset into [tasks](#task) and dispatch tasks to trainers.
-- Keep track of training progress on the dataset with [task queue](#task-queue). A training job will iterate on the dataset for a full pass until it goes into next pass.
-
-
-#### Task
-
-A task is a data shard to be trained. The total number of tasks will be much bigger than the total number of trainers. The number of data instances inside a task will be much bigger than the mini-batch size.
-
-#### Task Queue
-
-The master server has three task queues to track training progress. As illustrated in the graph below, Job A and Job B both have one master server. Each master server process has three task queues.
-
-
-
-- The todo queue holds tasks to be dispatched. When a job starts, the master server fills in the todo queue with all tasks.
-- The pending queue holds tasks that are currently training by trainers.
-- the done queue holds tasks that are already trained.
-
-The life cycle of a single task is illustrated below:
-
-
-
-1. When a new pass of training starts, all tasks will be placed in the todo queue.
-1. Upon trainer requests for new task, the master server will dispatch a task from todo queue to it, put the task in the pending queue and wait for completion.
-1. The trainer will work on its task and tell the master server once the task is completed and ask for new task. The master server will dispatch a new task to that trainer.
-1. If a task fails for any reason in trainer, or takes longer than a specific period of time, the master server will move the task back to the todo queue. The timeout count for that task will increase by one. If the timeout count is above a threshold, the task is likely to cause a trainer to crash, then it will be discarded.
-1. The master server will move completed task to the done queue. When the todo queue is empty, the master server will start a new pass by moving all tasks in the done queue to todo queue and reset the timeout counter of all tasks to zero.
-
-### Trainer Process
-
-The trainer process will:
-
-- Request tasks from the master.
-- Work on the tasks
-- Upload gradient to parameter servers, and update local model by downloading new parameters from parameter servers.
-
-### Parameter Server Process
-
-Parameter server processes hold the parameters collaboratively. The parameters are partitioned on different parameter servers.
-
-The parameter server will:
-
-- Receive gradient from the trainers, update its parameters, and give the trainers the latest parameters.
-- Periodically save its parameters to distributed file system by overriding the previous save.
-
-### Optimization Algorithms
-
-The communication pattern between the trainers and the parameter servers depends on the category of optimization algorithm:
-
-- Synchronous Stochastic Gradient Descent (sync-SGD)
-
- Parameter server will wait for all trainer finish n-th mini-batch calculation and send their gradients before broadcasting new parameters to every trainer. Every trainer will wait for the new parameters before starting n+1-th mini-batch.
-
-- Asynchronous Stochastic Gradient Descent (async-SGD)
-
- There will no synchronization between different trainers, and parameter server updates its parameter as soon as it receives new gradient:
-
- - Each trainer uploads its accumulated gradient every n mini-batches.
- - Every m mini-batches, the trainer downloads new parameters from parameter server.
- - n and m do not have to be equal.
-
-## Fault Tolerant
-
-The training job will pause if the master server processes is dead, or any of the parameter server process is dead. They will be started by [Kubernetes](https://kubernetes.io/) and recover in few minutes. Please refer to [fault recovery](#fault-recovery).
-
-The training job will continue to make progress if there is at least one training process running. The strategy depends on the type of optimization algorithm:
-
-- sync-SGD
-
- TODO
-
-- async-SGD
-
- Since async-SGD does not require synchronization between mini-batches, the system will by definition make process if at least one trainer is running.
-
-## Fault Recovery
-
-PaddlePaddle uses [etcd](https://github.com/coreos/etcd) to keep track of the states of processes. Because etcd is a distributed reliable key-value store, the restarted process can recover its states from etcd. The model parameters are periodically saved into distributed file system, so a restarted parameter server can recover its parameters from the saved file.
-
-Now we will introduce how each process recovers from a failure, the graph below shows how etcd is used:
-
-
-
-### Master Server Process
-
-When the master is started by the Kubernetes, it executes the following steps at startup:
-
-1. Grabs a unique *master* lock in etcd, which prevents concurrent master instantiations.
-1. Recovers the task queues from etcd if they already exist, otherwise, the master will create them.
-1. Write its ip address to */master/addr* so that trainers can discover it.
-1. Listens to trainers' request of task, dispatch one upon request, and updates task queue using an etcd transaction to ensure lock is held during the update.
-
-When the master server process is dead for any reason, Kubernetes will restart it. It will be online again with all states recovered from etcd in few minutes.
-
-### Trainer Process
-
-When the trainer is started by the Kubernetes, it executes the following steps at startup:
-
-1. Watches the available parameter server prefix keys `/ps/` on etcd and waits until the count of parameter servers reaches the desired count */ps_desired*.
-1. Finds and watches */master/addr* to get master's address.
-1. Requests for tasks from the master to start training.
-
-When a trainer fails, Kuberentes would try to restart it. The recovered trainer would fetch tasks from master and go on training.
-
-### Parameter Server Process
-
-When the parameter server is started by Kubernetes, it executes the following steps at startup:
-
-1. Read desired total number of parameter servers from etcd `/ps_desired`
-1. Search through etcd keys `/ps/
-
- The third parameter server joined:
-
-
-
-1. The parameter server can load parameters if there are already saved parameters in the save path (inferred from its index).
-1. Now the parameter server is ready for the trainers' requests.
-
-If the parameter server's etcd lease expires, the parameter server will kill itself.
-
-
-## Parameter Server Checkpointing
-See [here](./checkpointing.md)
-
-## Store and dispatching trainning data
-See [here](./data_dispatch.md)
-
-
-## Dynamic Scaling
-
-### Trainer Scaling
-
-TODO
-
-### Parameter Server Scaling
-
-Not planned for v1.
-
-## Training Dataset Format
-
-TODO
-
-## User Interface
-
-TODO
diff --git a/develop/doc/_sources/design/cluster_train/checkpointing.md.txt b/develop/doc/_sources/design/cluster_train/checkpointing.md.txt
deleted file mode 100644
index c87ef2c7d2636208866d05456d5d44316d0bb200..0000000000000000000000000000000000000000
--- a/develop/doc/_sources/design/cluster_train/checkpointing.md.txt
+++ /dev/null
@@ -1,44 +0,0 @@
-## 模型参数检查点(Checkpointing)
-模型数据检查点的实现,可以有效的避免parameter server的单点或多点同时故障。模型参数检查点通过定期向磁盘上保存一份存储在parameter server内存中的模型数据的完整镜像,来保证训练过程可以从中间状态重新启动。在一个不可中断并缺少备份的训练任务中,可以通过阶段性的保存每个parameter server的数据快照(snapshot)到 ***分布式存储服务*** 达到容灾的目的,比如每隔10分钟最新的快照,并删除更早的快照。在出现单点故障时,只需要恢复这台节点,或者将这台节点迁移到另一个节点并启动即可恢复训练任务。
-
-
-
-### 快照保存的设计如下:
-
-说明:
-
-* parameter server在集群中启动后,自动挂载分布式存储目录,并把快照保存到这个目录下。
-* ***注:每个parameter server的检查点各自独立保存,暂时不考虑多个parameter server同步的保存一个特定时间点的全局检查点,因为这样做也没法保证消除随机性。***
-
-检查点保存程序流程:
-
-1. 如果满足条件"每隔10分钟"时,parameter server会获取parameters内存的`read_lock`,启动一个新的线程开始保存检查点。如果已经正在执行保存检查点的线程,则忽略。由于对parameters的更新需要获取parameters内存的`write_lock`,所以在写入快照的过程中,parameter server会暂停参数更新并等待。
-2. parameter server生成一个UUID,向指定的目录中一个新的文件(文件名为此UUID)写入快照数据。在快照写入完成后,计算这个文件的MD5 sum。然后在etcd的`/checkpoints/[pserver_id]`中写入json内容:`{"uuid": [UUID], "md5", "MD5 sum", "timestamp": xxxx}`。
-3. 删除磁盘目录中不是当前uuid的快照文件。
-4. 释放对paramters内存的锁定,停止保存检查点的线程。
-
-这里需要用户额外注意,在您的实际环境中,训练任务的运行可能会占满trainer和parameter server之间的网络带宽,如果parameter server此时还需要通过网络访问分布式存储以保存快照,可能会造成网络拥塞,而出现阶段性的运行停滞。
-
-### 从快照恢复
-
-在parameter server第一次启动或任意时间parameter server故障后被Kubernetes重新启动,则需要回滚到上一个检查点:
-
- 1. 从etcd中读取节点:`/checkpoints/[pserver_id]`获取最新的检查点的文件uuid
- 1. 从磁盘文件中加载uuid文件名的检查点快照文件,并加载其中的参数
- 1. 如果上面两步出现错误,则使用启动参数定义的初始化方法初始化参数
- 1. 开始提供服务
-
-## TODO List
-### 推测执行/加速执行(TODO)
-在异构集群中,如果存在某些trainer执行速度过慢会影响整体集群的速度(如图中Trainer 1),此时master将负责启动一个新的Trainer(Accelerate Trainer 2),使用同样的训练数据block。哪个trainer先完成block的训练,则把另一个慢速的kill掉。
-
-### 动态扩容/缩容
-目前只考虑动态扩容trainer数量,可以减小系统复杂性。
-
-## 术语
-* model: 指深度学习训练之后得到的所有参数,使用这个神经网络可以完成对新数据的预测
-* parameters: 神经网络中的参数,包括权重w和偏置b。一个神经网络的模型由大量的参数组成
-* shard: 分片,通常指将一个整体拆分成多份的其中的一份。
-* model shard: 将一个神经网络参数拆分成多份,每个shard分别存储在其中一台parameter server之上
-* parameter block: 多个parameter block构成一个model shard
-* 单点故障: 任意时刻只可能同时有一台服务器故障。由于集群中同时存在两台机器故障的概率极低((平均故障率*平均故障修复时间)^2)只对特殊在线系统考虑两台以上同时故障的容灾。
diff --git a/develop/doc/_sources/design/cluster_train/data_dispatch.md.txt b/develop/doc/_sources/design/cluster_train/data_dispatch.md.txt
deleted file mode 100644
index 1f5d22ff5e6abcb576d16cbe7391da1967a1ab8e..0000000000000000000000000000000000000000
--- a/develop/doc/_sources/design/cluster_train/data_dispatch.md.txt
+++ /dev/null
@@ -1,160 +0,0 @@
-## 训练数据的存储和分发
-
-### 概念解释
-
-### 流程介绍
-生产环境中的训练数据集通常体积很大,并被存储在诸如Hadoop HDFS,Ceph,AWS S3之类的分布式存储之上。这些分布式存储服务通常会把数据切割成多个分片分布式的存储在多个节点之上。这样就可以在云端执行多种数据类计算任务,包括:
-
-* 数据预处理任务
-* Paddle训练任务
-* 在线模型预测服务
-
-
-
-
-A dataset is a list of files in *RecordIO* format. A RecordIO file consists of chunks, whereas each chunk consists some records.
-
-## Task Queue
-
-As mentioned in [distributed training design doc](./README.md), a *task* is a data shard that the master server assigns to the trainer process to train on. A task consists of one or multiple *chunks* from one or multiple files. The master server maintains *task queues* to track the training progress.
-
-### Task Queue Creation
-
-1. Each trainer will make an RPC call (using Go's [rpc](https://golang.org/pkg/net/rpc/) package) to the master server, telling it the RecordIO files representing the dataset specified by the user. Since every trainer will tell the master server the same dataset, only the first RPC call will be honored.
-
- The RPC interface is:
- ```go
- func (m *RPCServer) ReportDataset(Paths []string, dummy *int) error {
- }
- ```
-1. The master server will scan through each RecordIO file to generate the *chunk index* and know how many chunks does each file have. A chunk can be referenced by the file path and the index of the chunk within the file. The chunk index is in memory data structure that enables fast access to each chunk, and the index of the chunk with the file is an integer start from 0, representing the n-th chunk within the file.
-
- The definition of the chunk is:
- ```go
- type Chunk struct {
- Idx int // index of the chunk within the file
- Path string
- Index recordio.Index // chunk index
- }
- ```
-1. Chunks are grouped into tasks, and tasks are filled into the todo queue. The pending queue and the done queue are initialized with no element.
-
- The definition of the task is:
- ```go
- type Task struct {
- Index int
- Chunks []Chunk
- }
- ```
-
- The elements in the tasks queues is of type `TaskEntry`, containing a timeout counter (described in [task retry logic](#task-retry-logic)), and a task:
- ```go
- type TaskEntry struct {
- NumTimeout int
- Task Task
- }
- ```
-
- The definition of task queues is:
- ```go
- type TaskQueues struct {
- Todo []TaskEntry
- Pending map[int]TaskEntry // map from task index to task entry
- Done []TaskEntry
- }
- ```
-
-### Task Queue Persistence
-
-The task queues need to be persisted on [etcd](https://github.com/coreos/etcd) for fault recovery. Since the task queues only change once a task is completed or timed out, which is not very frequent, we can afford to synchronize with etcd every time the task queues change.
-
-We will serialize the task queues data structure with [gob encoding](https://golang.org/pkg/encoding/gob/), compress with gzip, and save into etcd synchronously under key `/task_queues`.
-
-### Task Dispatch
-
-The trainer will make an RPC call to master to get a new task when:
-
-- the trainer first started, or
-- the trainer finishes a task.
-
-The RPC interface is:
-```go
-func (m *RPCServer) GetTask(finished *Task, result *Task) error {
-}
-```
-Argument `finished` will be `nil` when the trainer is just started.
-
-During the RPC call the master will do the following:
-
-- Make a copy of the task queues, and update the copy reflecting the finished tasks and the new pending tasks.
-- Synchronize the copy of task queues with etcd using a transaction conditioned on holding the master lock.
-- Replace the task queues with the copy and report to the trainer with the new tasks if succeeded, or discard the copy and report the error to the trainer if failed.
-
-### Task Retry Logic
-
-When a task is dispatched to the trainer, the master will schedule a function for execution after the timeout duration (based on the moving average of task completion time). If the task entry in still in the pending queue, its timeout counter will increase by one, and the task will be moved to todo queue. If the timeout counter is above the threshold, the master will log the error and discard the task.
-
-Please note that since a timed out task could be completed after it has been dispatched for retry, so it is possible for a task to be processed multiple times. We do not try to prevent it from happening since it's fine to train on the same task multiple times due to the stochastic nature of the stochastic gradient decent algorithm.
diff --git a/develop/doc/_sources/design/cluster_train/pserver_client.md.txt b/develop/doc/_sources/design/cluster_train/pserver_client.md.txt
deleted file mode 100644
index 474b8c572cd92fc87e9f7f3f2b19d12cccd158de..0000000000000000000000000000000000000000
--- a/develop/doc/_sources/design/cluster_train/pserver_client.md.txt
+++ /dev/null
@@ -1,171 +0,0 @@
-# Design Doc: The Client Library of Parameter Server
-
-For an overview of trainer's role, please refer to [distributed training design doc](README.md). In this design doc, we will discuss the parameter server's client library, which will manage communication with parameter servers. The library will be implemented in [Go](https://golang.org/) and made available as a static or dynamic library with a C header file.
-
-## Parameter Partition
-
-Each parameter will be partitioned into parameter blocks to make the parameters evenly distributed on parameter servers. The partition is done automatically by the client library. The *sparse parameter* require a little different treatment:
-
-### Sparse Parameter
-
-The sparse parameter is a parameter that is updated sparsely. The name is somewhat misleading, it does not have a sparse representation, it has the same representation as a dense vector.
-
-Because a sparse parameter is updated sparsely, the trainer will have to partition the sparse parameter. Because the parameter server will merge all sparse parameter shard into the same file when saving the parameter. It needs special naming convention:
-
-If a sparse parameter is partitioned into n shards, they should be named as:
-
-```text
-name:sparse-0
-name:sparse-1
-...
-name:sparse-n-1
-```
-
-The library is unaware of the partition, and treat each parameter independently. Only when saving parameters, the parameter servers will merge the sparse parameters according to the naming convention.
-
-## Model Optimization Using Gradients
-
-There are two ways to perform model optimization using gradients:
-
-- On Client
-
- The client does multiple steps of forward and backward update. In each step, the gradients are calculated and a new model is generated. After some steps, the client will calculate the difference between the newest model and the old model at step 0. The difference will be updated to parameter servers. Parameter servers will just update parameters using the difference without any optimization using gradients (such as Adam and L1 regularization).
-
-- On Parameter Server
-
- The client will send accumulated gradients to parameter servers, the parameter server will do the optimization using gradients.
-
-## L1 and L2 Regularization
-
-PaddlePaddle allows L1 or L2 regularizations to be specified per parameter, so when the trainer initializes the parameter it needs include a parameter configuration when L1 or L2 regularization is necessary.
-
-## Parameter Initialization
-
-The parameters on parameter servers need to be initialized. To provide maximum flexibility, the trainer will initialize the parameters. Only one trainer will do the initialization, the other trainers will wait for the completion of initialization and get the parameters from the parameter servers.
-
-### Trainer Selection
-
-To select the trainer for initialization, every trainer will try to get a distributed lock, whoever owns the lock will do the initialization. As illustrated below:
-
-
-
-### Trainer Selection Process
-
-The trainer select process is encapsulated in the C API function:
-```c
-int paddle_begin_init_params(paddle_pserver_client* client, const char* config_proto);
-```
-The selected trainer's call to `paddle_begin_init_params` will return with 1, and the other trainers' call to `paddle_begin_init_params` will return 0. `paddle_get_params` will be blocked until initialization is completed. As illustrated below:
-
-
-
-## C Interface
-
-```c
-typedef enum {
- PADDLE_ELEMENT_TYPE_INT32 = 0,
- PADDLE_ELEMENT_TYPE_UINT32 = 1,
- PADDLE_ELEMENT_TYPE_INT64 = 2,
- PADDLE_ELEMENT_TYPE_UINT64 = 3,
- PADDLE_ELEMENT_TYPE_FLOAT32 = 4,
- PADDLE_ELEMENT_TYPE_FLOAT64 = 5,
-} paddle_element_type;
-
-typedef struct {
- char* name;
- paddle_element_type element_type;
- unsigned char* content;
- int content_len;
-} paddle_parameter, paddle_gradient;
-
-typedef int paddle_pserver_client;
-
-/**
- * @brief creates a pserver client that talks to etcd for coordination.
- */
-paddle_pserver_client paddle_new_etcd_pserver_client(char* etcd_addr);
-
-/**
- * @brief creates a pserver client given pserver addresses.
- *
- * @param pserver_addrs comma-separated pserver addresses.
- * @param selected if current pserver client is selected to initialize all parameter servers.
- */
-paddle_pserver_client paddle_new_pserver_client(char* pserver_addrs, int selected);
-void paddle_pserver_client_release(paddle_pserver_client c);
-
-/**
- * @brief paddle_begin_init_params begins to initialize parameters on
- * parameter servers.
- *
- * paddle_begin_init_params will be called from multiple trainers,
- * only one trainer will be selected to initialize the parameters on
- * parameter servers. Other trainers need to get the initialized
- * parameters from parameter servers using @paddle_get_params.
- *
- * @return 1 if the trainer is selected to initialize parameter
- * servers, otherwise 0.
- */
-int paddle_begin_init_params(paddle_pserver_client client);
-
-/**
- * @brief paddle_init_param initializes the parameter on parameter
- * servers.
- *
- * @param param the parameter to initialize.
- * @param param_config_proto the configuration for the parameter.
- * @param config_len the length of param_config_proto
- * @return 0 if successful, otherwise -1. On failure, the trainer
- * needs to restart the entire initialization process (starting from
- * @paddle_begin_init_param). Or simply exit the program and wait for
- * the cluster management system to restart the trainer.
- */
-int paddle_init_param(paddle_pserver_client client, paddle_parameter param, const unsigned char* param_config_proto, int config_len);
-
-/**
- * @brief paddle_finish_init_params tells parameter servers client has
- * sent all parameters to parameter servers as initialization.
- *
- * @return 0 if successful, otherwise -1. On failure, the trainer
- * needs to restart the entire initialization process (starting from
- * @paddle_begin_init_param). Or simply exit the program and wait for
- * the cluster management system to restart the trainer.
- */
-int paddle_finish_init_params(paddle_pserver_client client);
-
-/**
- * @brief paddle_send_grads sends gradients to parameter servers for
- * updating parameters.
- *
- * @param grads the array of gradients to send.
- * @param len the length of the gradient array.
- * @param learning_rate the learning rate for the gradients.
- * @return 0 if successful, otherwise -1.
- */
-int paddle_send_grads(paddle_pserver_client client, const paddle_gradient* grads, int len);
-
-/**
- * @brief paddle_get_params gets parameters from parameter servers.
- *
- * paddle_get_params will block until parameters are initialized on
- * the parameter servers.
- *
- * @param dst the destination array of parameter pointers to save to.
- * The parameter pointer must be pre-popullated with required parameter name,
- * and the content of parameter must be pre-allocated of the size of required
- * parameter on pserver.
- * @param len the length of the names array and the paddle_parameter
- * array.
- * @return 0 if successful, otherwise -1.
- */
-int paddle_get_params(paddle_pserver_client client, paddle_parameter** dst, int len);
-
-/**
- * @brief paddle_save_model indicates parameters to save the parameter
- * to the given path
- *
- * @param path the path to save parameters.
- * @return 0 if successful, otherwise -1.
- */
-int paddle_save_model(paddle_pserver_client client, const char* path);
-```
diff --git a/develop/doc/_sources/design/cluster_train/remote_parameter_updater.md.txt b/develop/doc/_sources/design/cluster_train/remote_parameter_updater.md.txt
deleted file mode 100644
index 6e8e5938455b869e0f3367794c41250340b37f77..0000000000000000000000000000000000000000
--- a/develop/doc/_sources/design/cluster_train/remote_parameter_updater.md.txt
+++ /dev/null
@@ -1,21 +0,0 @@
-# Design Doc: Remote Parameter Updater for Cluster Train
-
-For an overview of distribute training, please refer to [distributed training design doc](README.md). In this design doc, we will discuss the parameter updater that will use parameter server cclient [The Client Library of Parameter Server Design Doc](pserver_client.md) to manage and update parameters.
-
-## Parameter Updater
-
-Parameter Updater is used by trainer to manage and update parameter, there are mainly two kind of parameter updater: local and remote, since this design is for cluster train, we will only discuss remote parameter updater here.
-
-### Remote Parameter Updater
-
-Remote Parameter Updater manage parameters through remote parameter server with the client that communicate with pserver([The Client Library of Parameter Server Design Doc](pserver_client.md))
-
-In PaddlePaddle Python V2 API, trainer is implemented in python, and the trainer will hold a instance of parameter updater and call it's functions directly. In this design, we will also expose the api of RemoteParameterUpdater to python with swig.
-
-#### Sparse Remote Parameter Updater
-
-Since we will only implement dense parameter management new, the mechanism for sparse parameter will be discussed in next stage.
-
-### Interface Design
-
-TBD
diff --git a/develop/doc/_sources/design/cluster_train/save_model.md.txt b/develop/doc/_sources/design/cluster_train/save_model.md.txt
deleted file mode 100644
index b755185c81ad617b9c85c47de0f5f65d2201c658..0000000000000000000000000000000000000000
--- a/develop/doc/_sources/design/cluster_train/save_model.md.txt
+++ /dev/null
@@ -1,111 +0,0 @@
-# Design Doc: Save Model
-
-## Overview
-
-The model is the output of the training process. There are two
-ways from which user can obtain a model:
-
-- Save model triggered by user code: user code asks PaddlePaddle to
- save a model.
-- Convert model from the checkpoint: model being converted from
- pservers' periodic checkpoint. In this way, the user can cancel a
- job at any time, and still have a relatively fresh model (we
- checkpoint around every 5 minutes).
-
-### Trainer Saving Model vs. Pservers Saving Model
-
-Both trainers and pservers have access to the model. So the model can
-be saved from a trainer or pservers. We need to decide where the model
-is saved from.
-
-#### Dense Update vs. Sparse Update
-
-There are two types of model update methods: dense update and sparse
-update (when the model parameter is configured to be sparse).
-
-- Dense update
-
- Every trainer has it's own full copy of the model. Every model
- update will update the entire model.
-
-- Sparse update
-
- The training input is sparse, and the trainer does not have the
- entire model. It will only download the sub-model necessary related
- to the input. When updating the model, only the sub-model related to
- the training input is updated.
-
-
-#### Pservers Saving Model
-
-The benefit of letting pservers save model is they have the entire
-model all the time. However, since pservers are on different nodes, it
-requires a merging process to merge model shards into the same
-model. Thus requires the pservers to write models to a distributed
-filesystem, making the checkpoint shards visible to the merge program.
-
-#### Trainer Saving Model
-
-The benefit of letting one trainer to save the model is it does not
-require a distributed filesystem. And it's reusing the same save model
-logic when training locally - except when doing sparse update, the
-trainer needs to download the entire model during the saving process.
-
-#### Conclusion
-
-Given trainer saving model does not require a distributed filesystem,
-and is an intuitive extension to trainer saving model when training
-locally, we decide to let the trainer save the model when doing
-distributed training.
-
-
-### Convert Model from Checkpoint
-
-TODO
-
-
-## Timeline
-
-We first implement trainer save the model. Converting the latest
-snapshot to a model will be a TODO for future.
-
-
-## Trainer Save Model
-
-### Trainer Election
-
-One trainer will be elected as the one to save the model. When using
-etcd, trainer ID is a randomly generated UUID, the trainer will
-contact the master server requesting to save the model, and find out
-if itself is elected. When the master server is not used, unique
-trainer IDs will be given by the administrator, the trainer whose ID
-is "0" is elected to save the model.
-
-### Model Save Path
-
-Each trainer will be given the directory to save the model. The
-elected trainer will save the model to
-`given-directory/trainerID`. Since the trainer ID is unique, this
-would prevent concurrent save to the same file when multiple trainers
-are elected to save the model when split-brain problem happens.
-
-### What Happens When Model Is Saving
-
-It takes some time to save model, we need to define what will happen
-when save model is taking place.
-
-When doing dense update, the trainer uses the local model. Pservers
-does not need to pause model update.
-
-When doing sparse update. The trainer needs to download the entire
-model while saving. To get the most accurate model, the model update
-needs to be paused before the download starts and resumed after the
-download finishes. Otherwise, the trainer gets a model that is
-"polluted": some part of the model is old, some part of the model is
-new.
-
-It's unclear that the "polluted" model will be inferior due to the
-stochastic nature of deep learning, and pausing the model update will
-add more complexity to the system. Since supporting sparse update is a
-TODO item. We defer the evaluation of pause the model update or not
-during saving model to the future.
diff --git a/develop/doc/_sources/design/cluster_train/submit-job.md.txt b/develop/doc/_sources/design/cluster_train/submit-job.md.txt
deleted file mode 100644
index 8377d5489dc64bd2fdc5bb4f7bc737e7b489000d..0000000000000000000000000000000000000000
--- a/develop/doc/_sources/design/cluster_train/submit-job.md.txt
+++ /dev/null
@@ -1,127 +0,0 @@
-# Submit a Distributed Training Job
-
-The user can submit a distributed training job with Python code, rather than with a command-line interface.
-
-## Runtime Environment On Kubernetes
-
-For a distributed training job, there is two Docker image called *runtime Docker image* and *base Docker image*. The runtime Docker image is the Docker image that gets scheduled by Kubernetes to run during training. The base Docker image is for building the runtime Docker image.
-
-### Base Docker Image
-
-Usually, the base Docker image is PaddlePaddle product Docker image including paddle binary files and python package. And of course, users can specify any image name hosted on any docker registry which users have the access right.
-
-### Runtime Docker Image
-
-The trainer package which user upload and some Python dependencies are packaged into a runtime Docker image based on base Docker image.
-
-- Handle Python Dependencies
-
- You need to provide requirements.txt file in your `trainer-package` folder. Example:
-
- ```txt
- pillow
- protobuf==3.1.0
- ```
- More [details](https://pip.readthedocs.io/en/1.1/requirements.html) about requirements, an example project looks like:
- ```bash
- paddle_example
- |-quick_start
- |-trainer.py
- |-dataset.py
- |-requirements.txt
- ```
-
-## Submit Distributed Training Job With Python Code
-
-
-- `paddle.job.dist_train()` will call the Job Server API `/v1/packages` to upload the trainer package and save them on CephFS, and then call `/v1/trainer/job` to submit the PaddlePaddle distributed job.
-- `/v1/trainer/job` will start a building job for preparing the runtime Docker image. When the building job is finished, Job Server will submit the PaddlePaddle distributed job to Kubernetes.
-- *NOTE*: For the first version, we will not prepare the runtime Docker image, instead, the package is uploaded to Paddle Cloud, and Paddle Cloud will mount the package in a temporary folder into the base Docker image. We will not support custom Python dependencies in the first version as well.
-
-You can call `paddle.job.dist_train` and provide distributed training configuration as the parameters:
-```python
-paddle.job.dist_train(
- trainer=dist_trainer(),
- paddle_job=PaddleJob(
- job_name = "paddle-cloud",
- entry_point = "python %s"%__file__,
- trainer_package = "/example/word2vec",
- image = "yancey1989/paddle-job",
- trainers = 10,
- pservers = 3,
- trainer_cpu = 1,
- trainer_gpu = 1,
- trainer_mem = "10G",
- pserver_cpu = 1,
- pserver_mem = "2G"
- ))
-```
-
-The parameter `trainer` of `paddle.job.dist_train` is a function and you can implement it as follows:
-```python
-def dist_trainer():
- def trainer_creator():
- trainer = paddle.v2.trainer.SGD(...)
- trainer.train(...)
- return trainer_creator
-```
-
-The pseudo code of `paddle.job.dist_train` is as follows:
-```python
-def dist_train(trainer, paddle_job):
- # if the code is running on cloud, set PADDLE_ON_CLOUD=YES
- if os.getenv("RUNNING_ON_CLOUD", "NO") == "NO":
- #submit the paddle job
- paddle_job.submit()
- else:
- #start the training
- trainer()
-```
-### PaddleJob Parameters
-parameter | type | explanation
- --- | --- | ---
-job_name | str | the unique name for the training job
-entry_point | str | entry point for startup trainer process
-trainer_package | str | trainer package file path which user have the access right
-image|str|the [base image](#base-docker-image) for building the [runtime image](#runtime-docker-image)
-pservers|int| Parameter Server process count
-trainers|int| Trainer process count
-pserver_cpu|int| CPU count for each Parameter Server process
-pserver_mem|str| memory allocated for each Parameter Server process, a plain integer using one of these suffixes: E, P, T, G, M, K
-trainer_cpu|int| CPU count for each Trainer process
-trainer_mem|str| memory allocated for each Trainer process, a plain integer using one of these suffixes: E, P, T, G, M, K
-trainer_gpu|int| GPU count for each Trainer process, if you only want CPU, do not set this parameter
-
-### Deploy Parameter Server, Trainer and Master Process
- - Deploy PaddlePaddle Parameter Server processes, it's a Kubernetes ReplicaSet.
- - Deploy PaddlePaddle Trainer processes, it's a Kubernetes Job.
- - Deploy PaddlePaddle Master processes, it's a Kubernetes ReplicaSet.
-
-## Job Server
-
-- RESTful API
-
- Job server provides RESTful HTTP API for receiving the trainer package and displaying
- PaddlePaddle job related informations.
- - `POST /v1/package` receive the trainer package and save them on CephFS
- - `POST /v1/trainer/job` submit a trainer job
- - `GET /v1/jobs/` list all jobs
- - `GET /v1/jobs/
-
-PaddlePaddle can support model parallelism by converting the IR so that the user no longer needs to manually perform the computation and operations in the Python component:
-
-
-
-The IR for PaddlePaddle after refactoring is called a `Block`, it specifies the computation dependency graph and the variables used in the computation.
-
-### Limitation 3
-
-The user can not directly specify the parameter update rule for the parameter server in the Python module, since the parameter server does not use the same computation definition as the trainer. Instead, the update rule is baked inside the parameter server. The user can not specify the update rule explicitly.
-
-This could be fixed by making the parameter server also run an IR, which can be different to the trainer side
-For a detailed explanation, refer to this document -
-[Design Doc: Parameter Server](./parameter_server.md)
-
-## Distributed Training Architecture
-
-The revamped distributed training architecture can address the above discussed limitations. Below is the illustration of how it does so:
-
-
-
-The major components are: *Python API*, *Distribute Transpiler* and *Remote Executor*.
-
-### Python API
-
-Python API is the Python library that user's Python code invokes, to read the data, build the neural network topology, and start training, etc.
-
-```Python
-images = fluid.layers.data(name='pixel', shape=[1, 28, 28], dtype='float32')
-label = fluid.layers.data(name='label', shape=[1], dtype='int64')
-...
-predict = fluid.layers.fc(input=conv_pool_2, size=10, act="softmax")
-cost = fluid.layers.cross_entropy(input=predict, label=label)
-avg_cost = fluid.layers.mean(x=cost)
-optimizer = fluid.optimizer.Adam(learning_rate=0.01)
-optimizer.minimize(avg_cost)
-
-train_reader = paddle.batch(
- paddle.reader.shuffle(
- paddle.dataset.mnist.train(), buf_size=500),
- batch_size=BATCH_SIZE)
-
-place = fluid.CPUPlace()
-exe = fluid.Executor(place)
-
-for pass_id in range(10):
- for data in train_reader():
- loss, acc = exe.run(trainer_prog,
- feed=feeder.feed(data),
- fetch_list=[avg_cost])
-```
-
-The code above is a typical local training program, the "Training Program" is built using helper functions such as
-`fluid.layer.fc`. The training is done by calling `Executor.run`
-iteratively.
-
-For more details, the implementation of IR is [Program](../program.md), and `ProgramDesc` is the protobuf type.
-
-[Executor](../executor.md) simply runs the `ProgramDesc`. For local training you generally use
-`Executor` to run the program locally. For any kind of distributed training, you can use
-`RemoteExecutor` to specify desired distributed training method with some optional arguments.
-
-### Distributed Transpiler
-
-The Distributed Transpiler automatically converts the IR (in protobuf format) to partitioned IRs. Then
-the Remote Executor dispatches the new IRs to Remote Executors across the cluster.
-Below are the steps that are followed :
-
-1. User only need to change `Executor` to `RemoteExecutor` to change local program to distributed program.
-1. `RemoteExecutor` calls `Distributed Transpiler` to "transpile" user's program to several IRs representing a
- distributed training program:
- 1. Parse configurations from `RemoteExecutor`.
- 1. Determine the type of distributed program, can be DataParallelism, ModelParallelism or Streaming.
- 1. Partition the `ProgramDesc` according to type and add `send` / `recv` OP pair on the boundaries. Take
- DataParallelism type for example, it removes the optimization operators and add a `send` OP to the
- "trainer" role, then add the optimization operators to the parameter server role within the `recv` OP.
-1. Dispatch the partitioned graph to different `RemoteExecutor` in the cluster.
-1. `RemoteExecutor` on each node run the received `ProgramDesc` utill the end.
-
-
-### RemoteExecutor
-
-As shown in the graph, `RemoteExecutor.run` sends the IR to the cluster for Execution.
-You can also use parameter `fetch_list` to interactively fetch variable back to local for
-log printing.
-
-The Python `RemoteExecutor` is derived from `Executor` class.
-
-```python
-exe = RemoteExecutor(
- feed=feeder.feed(data),
- fetch_list=[avg_cost],
- job_desc=JobDesc(
- jobname,
- num_trainer,
- num_pserver,
- cpu_per_trainer,
- gpu_per_trainer,
- mem_per_trainer,
- cpu_per_pserver,
- mem_per_pserver
- ))
-for data in train_reader():
- loss, acc = exe.run(trainer_prog,
- feed=feeder.feed(data),
- fetch_list=[avg_cost])
-```
-
-`JobDesc` object describe the distributed job resource specification to run on
-Cluster environment.
-
-
-
-`RemoteExecutor.run` sends the `ProgramDesc` and
-[TrainingJob](https://github.com/PaddlePaddle/cloud/blob/develop/doc/autoscale/README.md#training-job-resource)
-to a server in the cluster which executes `RemoteExecutor.listen`. This server is responsible
-to start the final Kubernetes Jobs to run the different role of `ProgramDesc` from `ConfigMap`.
-
-
-### Placement Algorithm
-
-Our first implementation will only support "trainer-parameter server" placement: the parameters, initializers, and optimizers are all placed on the PaddlePaddle runtimes with the parameter server role. Everything else will be placed on the PaddlePaddle runtimes with the trainer role. This has the same functionality as the "trainer-parameter server" architecture of PaddlePaddle v0.10.0, but is more generic and flexible.
-
-In the future, a more general placement algorithm should be implemented, which makes placements according to the input IR, and a model of device computation time and device communication time. Model parallelism requires the generic placement algorithm.
-
-
-### Local Training Architecture
-
-The local training architecture will be the same as the distributed training architecture, the difference is that everything runs locally, and there is just one PaddlePaddle runtime:
-
-
-
-
-### Training Data
-
-In PaddlePaddle v0.10.0, training data is typically read
-with [data reader](../reader/README.md) from Python. This approach is
-no longer efficient when training distributedly since the Python
-process no longer runs on the same node with the trainer processes,
-the Python reader will need to read from the distributed filesystem
-(assuming it has the access) and send to the trainers, doubling the
-network traffic.
-
-When doing distributed training, the user can still use Python data
-reader: the training data are sent with `Executor.run`. However, should
-be used for debugging purpose only. The users are encouraged to use
-the read data OPs.
-
-
-## References:
-
-[1] [TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems](https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/45166.pdf)
-
-[2] [TensorFlow: A System for Large-Scale Machine Learning](https://www.usenix.org/system/files/conference/osdi16/osdi16-abadi.pdf)
diff --git a/develop/doc/_sources/design/dist_refactor/multi_cpu.md.txt b/develop/doc/_sources/design/dist_refactor/multi_cpu.md.txt
deleted file mode 100644
index a8d8ee0422acc84835170a44eb83f9b5f0c6bb40..0000000000000000000000000000000000000000
--- a/develop/doc/_sources/design/dist_refactor/multi_cpu.md.txt
+++ /dev/null
@@ -1,43 +0,0 @@
-# Design Doc: Execute the Program with Multi CPU
-
-## Abstract
-
-This Design Doc propose an approach to make the user-defined Op graph
-running with multi-CPU, we will use an auto transpiler to convert the user-defined
-Op graph to a multi-CPU Op graph, and run `ParallelDo` Op to run the graph.
-
-## Transpiler
-
-
-
-After converted:
-
-
-
-## Implement
-
-- `Multi-CPU Transpiler` will convert the graph to a multi-CPU graph
- which would be executed with multi-threads.
-- `BlockingCounter` will `Init/Decrement` an atomic counter, and Blocking `Wait`
- for the atomic counter become `0`:
- ```cpp
- BlockingCounter bc(thread_count);
- for (int i = 0; i < thread_count; ++i) {
- thread_pool->Start([&bc] {bc.DecrementCount(); })
- }
- bc.Wait();
- ```
-- `ParallelDo` Operator
- - Initialize a thread pool which is a Singleton.
- - Use a block id as the input, and create run the specify Block on independent scope
- with multi-threads.
- - Initialize a `BlockingCounter` instance and wait until all threads are done.
-- `Split` Operator will split the Input Tensor into a TensorArray.
-- `Merge` merge all the gradients which calculated in different threads
- with `mean/sum/max/min...` method, and then run the Optimizer Op to optimize `W`.
-
-## TODO
-
-- Improve the optimizer stage with multi-threads, since we could
- assign the parameters to the different threads and execute
- optimizer with multi-threads.
diff --git a/develop/doc/_sources/design/dist_refactor/parameter_server.md.txt b/develop/doc/_sources/design/dist_refactor/parameter_server.md.txt
deleted file mode 100644
index 805dd13048d41b995d2a01cda52b2ea33e4bbe1d..0000000000000000000000000000000000000000
--- a/develop/doc/_sources/design/dist_refactor/parameter_server.md.txt
+++ /dev/null
@@ -1,96 +0,0 @@
-# Design Doc: Parameter Server
-
-## Abstract
-
-We propose an approach to implement the parameter server. In this
-approach, there is no fundamental difference between the trainer and
-the parameter server: they both run subgraphs, but subgraphs of
-different purposes.
-
-## Background
-
-The previous implementations of the parameter server do not run a
-fluid sub-program. Parameter initialization, optimizer computation, network
-communication and checkpointing are implemented twice on both the
-trainer as well as the parameter server.
-
-It would be great if we can write code once and use them on both: the
-trainer and the parameter server, since this reduces code duplication and
-improves extensibility. Given that after the current refactoring, we are
-representing everything as a computation graph on the
-trainer. Representing everything as a computation graph on the parameter
-server becomes a natural extension.
-
-## Design
-
-### Distributed Transpiler
-
-The *Distributed Transpiler* converts the user-defined fluid program
-into sub-programs to be scheduled on different nodes with the following
-steps:
-
-1. OP placement: the OPs will be placed on different nodes according
- to a heuristic that minimizes the estimated total computation
- time. Currently we will use a simple heuristic that puts parameter
- variable on parameter server workers and everything else on trainer
- workers.
-1. Add communication OPs to enable the communication between nodes.
-
-We will need these OPs: *Send*, *Recv*, *Enqueue*, *Dequeue*.
-
-Below is an example of converting the user defined graph to the
-subgraphs for the trainer and the parameter server:
-
-
-
-After converting:
-
-
-
-1. The parameter variable W and its optimizer program are placed on the parameter server.
-1. Operators are added to the program.
- - *Send* sends data to the connected *Recv* operator. The
- scheduler on the receive node will only schedule *Recv* operator
- to run when the *Send* operator has ran (the *Send* OP will mark
- the *Recv* OP runnable automatically).
- - *Enqueue* enqueues the input variable, it can block until space
- become available in the queue.
- - *Dequeue* outputs configurable numbers of tensors from the
- queue. It will block until the queue has the required number of
- tensors.
-
-
-### Benefits
-
-- Model parallelism becomes easier to implement: it is an extension to
- the trainer - parameter server approach. We can have several "Transpilers"
- to achieve different goals.
-- User-defined optimizer is easier to add - user can now express it as
- a sub-program.
-- No more duplication logic inside the trainer and the parameter
- server mentioned in the background section.
-
-### Challenges
-
-- It is important to balance the parameter shards on multiple
- parameter servers. If a single parameter is very big (for example: some
- word-embedding, fully connected, softmax layer), we need to
- automatically partition the single parameter onto different
- parameter servers when possible (only element-wise optimizer depends
- on the parameter variable).
-- In the "Async SGD" figure, the "W" variable on the parameter server
- could be read and written concurrently. See
- [here](https://github.com/PaddlePaddle/Paddle/pull/6394) for more
- details about concurrent program in Fluid.
-
-### Discussion
-
-- Can the Enqueue OP be implemented under our current tensor design
- (put the input tensor into the queue tensor)?
-- *Dequeue* OP will have variable numbers of output (depending on the
- `min_count` attribute), does our current design support it? (similar
- question for the *Add* OP)
-
-
-### References:
-[1] [TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems](https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/45166.pdf)
diff --git a/develop/doc/_sources/design/error_clip.md.txt b/develop/doc/_sources/design/error_clip.md.txt
deleted file mode 100644
index 58aa73b8cd38d01e2426278a3479714e4fb6a3b0..0000000000000000000000000000000000000000
--- a/develop/doc/_sources/design/error_clip.md.txt
+++ /dev/null
@@ -1,92 +0,0 @@
-# Error Clip
-
-## Overview
-
-Error clip is widely used in model training to prevent gradient exploding. It takes some specific rules to adjust variables' gradients and prevent them from being too large. With it, values of a gradient will be checked before they are taken by the next `grad_op` and be shrunk if necessary.
-## Usage
-
-Users are allowed to assign different error clip methods or attributes to different `Variable`s. Users can specify it as a parameter of `Variable`'s constructor:
-
-```python
-var = framework.Variable(..., error_clip=myErrorClip, ...)
-```
-
-The default value of `error_clip` is `None`, which means no error clip is employed. When it's not `None`, it should take an object of `BaseErrorClipAttr`'s derived class. So far, `BaseErrorClipAttr` has only one derived class: `ErrorClipByValue`, whose constructor is:
-
-```python
-ErrorClipByValue(max, min=None)
-```
-
-`max` and `min` represent the maximal and minimal clip threshold respectively. In backward pass, all values of `var`'s gradient greater than `max` or less than `min` will be clipped to `max` and `min` respectively. When the `min` is None, the minimal threshold will be assigned with `-max` automatically.
-
-So we can enable the error clip with threshold `[-5.0, 5.0]` for variable `var` by:
-
-```python
-var = framework.Variable(..., error_clip=ErrorClipByValue(max=5.0), ...)
-```
-
-## Implementation
-
-The `BaseErrorClipAttr` and its derived class `ErrorClipByValue` are defined in *clip.py*.
-
-```python
-class BaseErrorClipAttr(object):
- def append_clip_op(self, block, grad_name):
- raise NotImplementedError()
-
-
-class ErrorClipByValue(BaseErrorClipAttr):
- def __init__(self, max, min=None):
- max = float(max)
- if min is None:
- min = -max
- else:
- min = float(min)
- self.max = max
- self.min = min
-
- def append_clip_op(self, block, grad_name):
- clip_op_desc = block.desc.append_op()
- clip_op_desc.set_type("clip")
- clip_op_desc.set_input("X", [grad_name])
- clip_op_desc.set_output("Out", [grad_name])
- clip_op_desc.set_attr("min", self.min)
- clip_op_desc.set_attr("max", self.max)
-```
-
-The `BaseErrorClipAttr` have one main member functions: `append_clip_op(self, block, grad_name)`.
-
-This function is used to create a `clip_op` and append it to the end of given `block`. For different error clip algorithm require different `clip_op`, the function is defined as virtual in the base class. All derived classes must implement their own versions of this function.
-
-These `clip_op`s should be inserted after `grad_op`s whose output gradients need to be clipped. It is equivalent to appending some `clip_op`s to the end of the target block every time a new `grad_op` is added.
-
-```python
-for op_desc in grad_op_descs:
- new_op_desc = target_block.desc.append_op()
- new_op_desc.copy_from(op_desc)
- callback(block=target_block, context=grad_to_var)
-```
-
-Here we employ a callback function to complete this kind of jobs. In `_append_backward_ops_` function, each time after a `grad_op` is added to the `target_block`, a callback function is invoked. The logic of `clip_op` appending can be implemented inside the callback function.
-
-The callback function for `clip_op` appending is defined in *clip.py*:
-
-```python
-def error_clip_callback(block, context):
- # the context is a grad_to_var map
- grad_to_var = context
- op_desc = block.desc.op(block.desc.op_size() - 1)
- for grad_n in filter(lambda n: grad_to_var.has_key(n),
- op_desc.output_arg_names()):
- fwd_var = block.var_recursive(grad_to_var[grad_n])
- error_clip = getattr(fwd_var, "error_clip", None)
- if not (error_clip is None or isinstance(error_clip,
- BaseErrorClipAttr)):
- raise TypeError(
- "Variable's error_clip should be an instance of BaseErrorClipAttr or None."
- )
- if error_clip is not None:
- error_clip.append_clip_op(block, grad_n)
-```
-
-This function takes a `block` and a `context`(which is actually a grad\_to\_var map) as inputs. It checks each output of the last `OpDesc` in the `block`. Notice that the last `OpDesc` of the `block` must be a `grad_op` and its outputs must be some forward variables' gradients. If an output gradient's corresponding forward variable has an attribute of `error_clip`, `error_clip_callback` will call the `error_clip`'s `append_clip_op` function to append the required `clip_op` into the `block`.
diff --git a/develop/doc/_sources/design/evaluator.md.txt b/develop/doc/_sources/design/evaluator.md.txt
deleted file mode 100644
index 11cc129d56905a9ee666da92fbe6f8559c6d325a..0000000000000000000000000000000000000000
--- a/develop/doc/_sources/design/evaluator.md.txt
+++ /dev/null
@@ -1,58 +0,0 @@
-## Evaluator Design
-
-### Problem Statement
-
-During training or inference, we provide an evaluation function to measure the model performance, for example, accuracy, precision, etc. In the operator based framework design, the data passes through the network pipeline batch by batch. As a result, inside the operator, we only calculate the metrics for one minibatch. Thus, we need to provide a mechanism to calculate the metrics for each N pass/batch the user wants.
-
-### Evaluator Design
-Currently, every operation is expressed in the graph. We divide the evaluator process into three steps.
-
-1. Initialize the metric state and add it into the block.
-
-2. Calculate the concerned metrics for every mini-batch. The single evaluator operator is only responsible for calculating the necessary statistics for one mini-batch. For example, the accuracy operator only calculates the accuracy for a minibatch data if run once.
-
-
-3. Merge the mini-batch statistics to form the evaluation result for multiple mini-batches. When it comes to distributed training/Multi-GPU training, aggregate the value from different devices.
-
-### Implementation
-This design is shown in the Python API.
-Each metric operator needs to caculate the metric statistic and return the batch-aware states. Python side is responsible for accumulating the states for each pass.
-
-
-```python
-class Evaluator(object):
- """
- Evaluator Base class.
- """
- def __init__(self, name, **kwargs):
- """
- Different evaluator may has different metric states. E.g, Accuracy need two variables, total and right sample counts.
- Auc need four variables, `true_positives`,
- `true_negatives`, `false_positives` and `false_negatives`. So every evaluator should create its needed variables and append to main_program
-
- The initialization of Evaluator should be responsible for:
- create metric states and append to the main_program
- """
- pass
-
- def _update_ops(self, input, label, **kwargs)
- """
- Add mini-batch evaluator caculate operators to the main_program.
- Add increment operator to accumulate the metric states.
- """
-
-
- def reset(self, executor, reset_program=None):
- """
- Reset metric states at the begin of each pass/user specified batch number.
- Execute the reset_program to reset the states.
- """
-
-
- def eval(self, executor, eval_program=None):
- """
- Merge the mini-batch statistics to form the evaluation result for multiple mini-batches.
- Execute the eval_program and return the result.
- """
- return eval_result
-```
diff --git a/develop/doc/_sources/design/executor.md.txt b/develop/doc/_sources/design/executor.md.txt
deleted file mode 100644
index 2d4b371cc56db82ce5747da6db07f05aa7f7e6c1..0000000000000000000000000000000000000000
--- a/develop/doc/_sources/design/executor.md.txt
+++ /dev/null
@@ -1,29 +0,0 @@
-# Executor Design Doc
-
-## Motivation
-In [fluid](https://github.com/PaddlePaddle/Paddle/blob/develop/doc/design/fluid.md), we encourage the user to use deep learning programming paradigms to describe the training process. When the user-written Python program is executed, it will first create a protobuf message
-[`ProgramDesc`](https://github.com/PaddlePaddle/Paddle/blob/a91efdde6910ce92a78e3aa7157412c4c88d9ee8/paddle/framework/framework.proto#L145) that describes the process and is conceptually like an [abstract syntax tree](https://en.wikipedia.org/wiki/Abstract_syntax_tree).
-
-The executor runs the `ProgramDesc` like an interpreter. `ProgramDesc` contains the intrinsics (operators in this case) and variables which will be used, executor explicitly executes the stored precompiled code.
-
-## Overview
-
-An executor takes a `ProgramDesc`, a `block_id` and a `Scope`. The `ProgramDesc` is a list of blocks and each block contains the protobuf definition of all the parameters and operators in the block. The `block_id` specifies the entrance block. And the `Scope` is the container of all the variable instances, which is persistent throughout different runs.
-
-## Executor
-
-The `Executor` explicitly executes all the intrinsics (operators here) in the `block_id`th block of a `ProgramDesc`. Essentially, it instantiates Variables and Operators, then runs all the operators in sequence one-by-one.
-It is very similar to how a push stack frame works when entering a block, following which it cleans up all the temporary variables when a mini-batch is finished. It does not however, have the stack frame pop process.
-
-### The interface
-```c++
- Executor(places);
-```
-A executor does not own any computing resources, a user can only construct an executor using the specified places.
-
-### Running an Executor
-
-```
- void Run(ProgramDesc, Scope, block_id, create_local_scope);
-```
-An `Executor` only provides a unified way to execute `ProgramDesc`. `ProgramDesc` is the target that will be executed, the `Scope` specifies the variable container, the `block_id` indicates the entrance block and `create_local_scope` is a boolean that states whether it will destroy the temporary variables after the execution is finished.
diff --git a/develop/doc/_sources/design/file_manager/README.md.txt b/develop/doc/_sources/design/file_manager/README.md.txt
deleted file mode 100644
index 3df10d801e568834729f902aace483d033340e2d..0000000000000000000000000000000000000000
--- a/develop/doc/_sources/design/file_manager/README.md.txt
+++ /dev/null
@@ -1,87 +0,0 @@
-# FileManager设计文档
-## 目标
-在本文档中,我们设计说明了名为FileManager系统,方便用户上传自己的训练数据以进行分布式训练
-
-主要功能包括:
-
-- 提供常用的命令行管理命令管理文件和目录
-- 支持大文件的断点上传、下载
-
-## 名词解释
-- PFS:是`Paddlepaddle cloud File System`的缩写,是对用户文件存储空间的抽象,与之相对的是local filesystem。目前我们用CephFS来搭建。
-- [CephFS](http://docs.ceph.com/docs/master/cephfs/):一个POSIX兼容的文件系统。
-- Chunk:逻辑划上文件分块的单位。
-
-## 模块
-### 架构图
-






-
-
-
-
-
-After compiling, the graph as shows
-
-
-
-Operators are added to the sub-graphs. Every GPU assigned a role of `rank0`, `rank1` etc.
-
-- **Broadcast**. Broadcast operator distribute initialized parameter to all the GPUs from the GPU who owns it. e.g. from`rank0` GPU.
-- **AllReduce**. AllReduce operator synchronizes parameters/gradients between GPUs. AllReduce implemented in the Ring-Based communicating method, avoid of the bottle neck in a single GPU.
-
-Need to notice that AllReduce operator force GPUs synchronized at that point. The whole training process in asynchronous or synchronous mode depends on the AllReduce point in the graph.
-
-As it shown in the picture, when each GPU compute the gradient of `W`, followed with a `AllReduce` operator, accumulate the `dW` to full batch of data, then run the optimize process individually and apply the gradient to its `W`.
-
-- **AllReduce**
- Need to note that our AllReduce operator is a ring-base AllReduce implementation. If we use the NCCL2 AllReduce primitive, every GPU optimized full batch of data, wasted (n-1) GPU compute resources. In addition, NCCL2 built-in AllReduce will only utilize the communicating resource during synchronization, then update the gradient will be a subsequent phase. In fact, we can amortize the update gradient time cost into the communicating phase. The process is
-1. Every parameter has its root card. That card will responsible for aggregating the gradients from GPUs.
-2. The whole model's parameter will be hashed to different root card, ensure the load balance between GPUs.
-3. Logically neighberhood card will start send parameter to the next one. After one round, the parameter main card will aggregate the full gradients.
-4. Then the root card will optimize the parameter.
-5. This parameter card will send its optimized result to its neighberhood, then the neighberhood will send parameter to its next one.
-6. Finish the sychronization round.
-
-The total time cost will be 2 * (n-1) * per-parameter-send-time, we reach the goal of amortize the upgrade time into communicating phase.
diff --git a/develop/doc/_sources/design/parallel_do.md.txt b/develop/doc/_sources/design/parallel_do.md.txt
deleted file mode 100644
index 42bd136f825986d94fafaeaa5f58edb02848a74c..0000000000000000000000000000000000000000
--- a/develop/doc/_sources/design/parallel_do.md.txt
+++ /dev/null
@@ -1,163 +0,0 @@
-# Design Doc: Parallel_Do in PaddlePaddle
-
-In PaddlePaddle, we use parallel_do primitive to represent multithread data parallel processing.
-
-## Design overview
-
-The definition of a parallel_do op looks like the following
-
-```c++
-AddInput(kInputs, "Inputs needed to be split onto different devices").AsDuplicable();
-AddInput(kParameters, "Parameters are duplicated over different devices")
- .AsDuplicable();
-AddInput(kPlaces, "Devices used for parallel processing");
-AddOutput(kOutputs, "Outputs needed to be merged from different devices").AsDuplicable();
-AddOutput(kParallelScopes,
- "Scopes for all local variables in forward pass. One scope for each device");
-AddAttr







-
-* 注:CI环境使用 https://github.com/PaddlePaddle/buildtools 这里的DockerImage作为编译环境以支持更多的Linux
- 发型版,如果需要手动编译,也可以使用这些镜像。这些镜像也可以从 https://hub.docker.com/r/paddlepaddle/paddle_manylinux_devel/tags/ 下载得到。
-* pypi不支持覆盖上传,所以一个版本号的wheel包发布之后,不可以更改。下一个wheel包需要更新版本号才可以上传。
-
-## 发布Docker镜像
-
-上述PaddlePaddle CI编译wheel完成后会自动将Docker镜像push到DockerHub,所以,发布Docker镜像只需要对自动push的镜像打上
-版本号对应的tag即可:
-
-1. 进入 https://hub.docker.com/r/paddlepaddle/paddle/tags/ 查看latest tag的更新时间是否在上述编译wheel包完成后是否最新。
-1. 执行 `docker pull paddlepaddle/paddle:[latest tag]`,latest tag可以是latest或latest-gpu等。
-1. 执行 `docker tag paddlepaddle/paddle:[latest tag] paddlepaddle/paddle:[version]`
-1. 执行 `docker push paddlepaddle/paddle:[version]`
-
-## PaddlePaddle 分支规范
-
-PaddlePaddle开发过程使用[git-flow](http://nvie.com/posts/a-successful-git-branching-model/)分支规范,并适应github的特性做了一些区别。
-
-* PaddlePaddle的主版本库遵循[git-flow](http://nvie.com/posts/a-successful-git-branching-model/)分支规范。其中:
- * `master`分支为稳定(stable branch)版本分支。每一个`master`分支的版本都是经过单元测试和回归测试的版本。
- * `develop`分支为开发(develop branch)版本分支。每一个`develop`分支的版本都经过单元测试,但并没有经过回归测试。
- * `release/版本号`分支为每一次Release时建立的临时分支。在这个阶段的代码正在经历回归测试。
-
-* 其他用户的fork版本库并不需要严格遵守[git-flow](http://nvie.com/posts/a-successful-git-branching-model/)分支规范,但所有fork的版本库的所有分支都相当于特性分支。
- * 建议,开发者fork的版本库使用`develop`分支同步主版本库的`develop`分支
- * 建议,开发者fork的版本库中,再基于`develop`版本fork出自己的功能分支。
- * 当功能分支开发完毕后,向PaddlePaddle的主版本库提交`Pull Reuqest`,进而进行代码评审。
- * 在评审过程中,开发者修改自己的代码,可以继续在自己的功能分支提交代码。
-
-* BugFix分支也是在开发者自己的fork版本库维护,与功能分支不同的是,BugFix分支需要分别给主版本库的`master`、`develop`与可能有的`release/版本号`分支,同时提起`Pull Request`。
-
-## PaddlePaddle回归测试列表
-
-本列表说明PaddlePaddle发版之前需要测试的功能点。
-
-### PaddlePaddle Book中所有章节
-
-PaddlePaddle每次发版本首先要保证PaddlePaddle Book中所有章节功能的正确性。功能的正确性包括验证PaddlePaddle目前的`paddle_trainer`训练和纯使用`Python`训练模型正确性。
-
-| | 新手入门章节 | 识别数字 | 图像分类 | 词向量 | 情感分析 | 语意角色标注 | 机器翻译 | 个性化推荐 |
-| --- | --- | --- | --- | --- | --- | --- | --- | --- |
-| API.V2 + Docker + GPU | | | | | | | | |
-| API.V2 + Docker + CPU | | | | | | | | |
-| `paddle_trainer` + Docker + GPU | | | | | | | | |
-| `paddle_trainer` + Docker + CPU | | | | | | | | |
-| API.V2 + Ubuntu + GPU | | | | | | | | |
-| API.V2 + Ubuntu + CPU | | | | | | | | |
-| `paddle_trainer` + Ubuntu + GPU | | | | | | | | |
-| `paddle_trainer` + Ubuntu + CPU | | | | | | | | |
diff --git a/develop/doc/_sources/design/scope.md.txt b/develop/doc/_sources/design/scope.md.txt
deleted file mode 100644
index 4da76eebb74abcd26ec2b8671399e6bc4fb58574..0000000000000000000000000000000000000000
--- a/develop/doc/_sources/design/scope.md.txt
+++ /dev/null
@@ -1,124 +0,0 @@
-# Design of Scope in Paddle
-
-## Overview
-
-Scope is an important concept in programming languages, which defines a program region that a set of bindings between names and entities applies. In a specific scope, a valid name is uniquely associated with an entity, such as a variable. And in another scope, this name may refer to other entity or nothing at all. It clearly restricts the visibility and validity of names in a program. Hence **Scope** is introduced to PaddlePaddle to manage variables in context. But different from the original abstract concept, Scope now becomes an object with two important attributes:
-
-- Scope is an association of a name to variable.
-- Variables in a parent scope can be retrieved from local scope.
-
-A detailed explanation of these two attributes goes as following.
-
-
-## Scope is an association of a name to variable.
-
-Scope is an association of a name to variable. All variables belong to `Scope`. You need to specify a scope to run a Net, i.e., `net.Run(&scope)`. One net can run in different scopes and update different variable in the scope.
-
-
-1. Scope only contains a map of a name to variable.
-
- All parameters, data, states in a Net should be variables and stored inside a scope. Each op should get inputs and outputs to do computation from a scope, such as data buffer, state (momentum) etc.
-
-1. Variable can only be created by Scope and a variable can only be got from Scope. User cannot create or get a variable outside a scope. This is a constraints of our framework, and will keep our framework simple and clear.
-
-1. Scope only contains methods that are used to Create and Get Variables. Scope do not contain Operators and have no information to run them.
- `Net` is designed to drive the computation and Scope only contains a map of variables. There is no computation logic inside a `Scope`. Scope just handles the lifetime management of variables.
- - `Create` is used to create a Variable by its name and add the mapping relation.
- - `Get` is used to find a Variable by name.
-
-1. Every variable only belongs to one certain Scope.
-
- Variable can not belong to many scopes. If you want to use variables from parent scope, you can use `parent scope`.
-
-1. Scope should destruct all Variables inside it when itself is destructed. User can never store `Variable` pointer somewhere else.
-
- Because Variable can only be got from Scope. When destroying Scope, we also need to destroy all the Variables in it. If user store `Variable` pointer to private data member or some global variable, the pointer will be an invalid pointer when associated `Scope` is destroyed.
-
-```cpp
-class Scope {
- public:
- Variable* Var(const std::string& name);
- const Variable* FindVar(const std::string& name) const;
-
- private:
- std::unordered_map








