diff --git a/doc/design/api.md b/doc/design/api.md
new file mode 100644
index 0000000000000000000000000000000000000000..dd4341b32490dc61a16b5580900cbfaa0dd70e2a
--- /dev/null
+++ b/doc/design/api.md
@@ -0,0 +1,277 @@
+# PaddlePaddle Design Doc
+
+## Ingredients
+
+As the first step of our design, we list important concepts in deep
+learning and try to figure their relationship, as shown below:
+
+```
+Model = {topology, parameters}
+
+Evaluator = {Model*, activations}
+- forward
+- test(cost, ...)
+
+GradientMachine = {Evaluator*, gradients}
+- backward
+
+Optimizer = {GradientMachine*}
+- train(cost, ...)
+- update
+- checkpoint
+```
+
+where the pair of curly braces `{` and `}` indicate *composition*, `*`
+indicates a *reference*, and `-` marks a "class method".
+
+
+### Model
+
+We used to think that parameters are part of the topology (or layers).
+But that is not true because multiple layers could share the same
+parameter matrix.  An example is a network that compares two text
+segments in a semantic space:
+
+```
+          semantic
+text A -> projection ---\
+          layer A        \
+                          cosine
+                          similarity -> output
+                          layer
+          semantic       /
+text B -> projection ---/
+          layer B
+```
+
+In this network, the two semantic projection layers (A and B) share
+the same parameter matrix.
+
+For more information about our API that specifies topology and
+parameter sharing, please refer to [TODO: API].
+
+
+### Evaluator
+
+Supposed that we have a trained ranking model, we should be able to
+use it in our search engine.  The search engine's Web server is a
+concurrent program so to serve many HTTP requests simultaneously.  It
+doesn't make sense for each of these threads to have its own copy of the model because that would duplicate topologies and parameters.
+However, each thread should be able to record layer outputs, i.e.,
+activations, computed from an input, derived from the request.  With
+*Evaluator* that saves activations, we can write the over-simplified
+server program as:
+
+```python
+m = paddle.model.load("trained.model")
+
+http.handle("/",
+            lambda req:
+                e = paddle.evaluator.create(m)
+                e.forward(req)
+                e.activation(layer="output")) # returns activations of layer "output"
+```
+
+### GradientMachine
+
+Similar to the evaluation, the training needs to compute gradients so
+to update model parameters.  Because an [optimizer](#optimizer) might
+run multiple simultaneous threads to update the same model, gradients
+should be separated from the model.  Because gradients are only used
+in training, but not serving, they should be separate from Evaluator.
+Hence the `GradientMachine`.
+
+### Optimizer
+
+None of Model, Evaluator, nor GradientMachine implements the training
+loop, hence Optimizer.  We can define a concurrent optimizer that runs
+multiple simultaneous threads to train a model -- just let each
+thread has its own GradientMachine object.
+
+Most models should be able to be trained using the
+`paddle.optimizer.SGD` by calling its `train` method.  Many
+customizations to the SGD algorithm happens with the update equation,
+e.g., momentum and the Adam SGD algorithm.  We make `train` calls
+`update` to do an update, so that we can derive a `paddle.optimizer.Adam`
+from `paddle.optimizer.SGD` by overrides only the `update` method.
+
+
+## Programming Interface
+
+A fictive example of PaddlePaddle program looks like the following:
+
+```python
+import paddle
+
+def read(args):
+    f = open_file(args["filename"])
+    mb = read_a_minibatch(f)
+    end_pass = eof(f)
+    if end_pass:
+       f = open_file(args["filename"]) # rewind for reading again
+    yield mb, end_pass
+
+input = paddle.layer.data(...)
+intermediate = paddle.layers.fc(input)
+output = paddle.layer.softmax(intermediate)
+
+model = paddle.model.create(output)
+
+paddle.train(model, data_provider=read)
+```
+
+This shows some important part of a program:
+
+1. Define how to read (and augment) data by defining a function, in
+   this example, `read`, that `yields` a minibatch and a boolean flag
+   `eof_of_pass`.
+
+1. Define the topology, `input`, `intermediate`, and `output` in this
+   example.
+
+1. Create parameters from the topology thus forms the model by calling
+   `paddel.model.create`.
+
+1. Train the model by calling `paddle.train`.
+
+
+### Reader
+
+Not all programming frameworks allow users to define I/O functions.
+An example is Google MapReduce, which can only read from text,
+SSTable, and RecordIO files.  Hadoop MapReduce allows users to define
+readers and writers by deriving from base classes `Reader` and
+`Writer`.  The former is less flexible but also less error-prone.  We
+decide to provide the flexibility to users to define their readers.
+
+
+#### A Synthetic Data Reader
+
+Sometimes we want to test a topology and/or a training algorithm using
+synthetic data.  We can do this by defining the reader a synthesizer:
+
+```python
+def read(args):
+    x = sample_from_uniform(0.0, 1.0)
+    y = sample_from_gauss(2 * x, sigma)
+    yield {x, y}, False # no end-of-file so no end-of-pass
+```
+
+#### A Reader for Online Learning
+
+Readers can also read an infinite data stream, e.g., a log stream from
+a search engine and collected by Kafka:
+
+```python
+def read(args):
+    log_stream = kafka.open_channel(args["kafka channel name"])
+    yeild log_stream.read(), False # no end-of-pass in online learning
+```
+
+### Topology
+
+By default, layers don't have names.  But if we want to refer to a
+layer later some time, for example, when we do serving using the model
+and wants activations/outputs of a layer, we should give it a name.
+
+```python
+input = paddle.layer.data(...)
+intermediate = paddle.layer.fc(input, name="inter", ...)
+output = paddle.layer.softmax(intermediate, name="output", ...)
+
+m = paddle.model.create(output)
+e = paddle.evaluator.create(model)
+e.forward(read_an_input()) # compute activations of all layers.
+print e.activations(layer="inter")  # retrieve the activations of layer "inter"
+print e.activations(layer="output") # retrieve the activations of layer "output"
+```
+
+#### Sharing Parameters
+
+In [above section](#model) we shows a network whose two layers share
+the same parameter matrix.  To specify such cases, we give "parameter
+names" to layers.  If some layers have the same paraemter names,
+`paddle.model.create` creates a single parameter matrix for these
+layers:
+
+```python
+text1 = paddle.layer.data(...)
+sematic1 = paddle.layer.fc(text1, ..., parameter_name="sematic_projection")
+text2 = paddle.layer.data(...)
+sematic2 = paddle.layer.fc(text2, ..., parameter_name="sematic_projection")
+out = paddle.layer.cosine(semantic1, semantic2)
+```
+
+We can also share parameter matrices between layers in different
+models.  To do this, we need an additional parameter that refers to a
+model:
+
+```python
+model1_input = paddle.layer.data(...)
+model1_output = paddle.layer.softmax(model1_input, ...,
+                                     parameter_name="a_parameter_matrix")
+model1 = paddle.model.create(model1_output)
+
+# Another model
+model2_semantic = paddle.layer.fc(text2, ...,
+                                  parameter_name="a_parameter_matrix",
+                                  parameter_model=model1)
+```
+
+### Training
+
+The recommended way to training a model is to call `paddle.train`,
+which simply calls `paddle.optimizer.Default`, a global variable of
+type `paddle.optimizer.SGD`.  Equivalently, we can do
+
+```python
+opt = paddle.optimizer.SGD(...)
+opt.train(model, reader=read, ...)
+```
+
+#### Distributed Training
+
+If users want to do distributed training on a cluster, s/he should
+call `paddle.dist_train` and provides access tokens to the cluster as
+a parameter.
+
+For example, if the user has a TLS certificate that allows him to
+access a Kubernetes cluster, s/he should be able to call
+
+```python
+paddle.dist_train(model,
+                  reader=read,
+                  optimizer=paddle.optimizer.SGDOptimizer(...),
+                  k8s_user="yi",
+                  k8s_token="kube_cluster_tls.pem",
+                  k8s_job="hello",
+                  num_parameter_servers=15)
+```
+
+The pseudo code if `paddle.dist_train` is as follows:
+
+```python
+def dist_train():
+    if os.getenv("KUBERNETES_SERVICE_HOST") == None:
+        image_name = k8s_user + '/' + k8s_job
+        docker_build(image_name)
+        docker_push()
+        kube_ctrl_start_job(image_name, k8s_user, k8s_token)
+    else:
+        rank = kube_list_containers_in_job_and_return_current_containers_rank()
+        if rank == 0:
+            master()
+        elif rank < 15:
+            parameter_server()
+        else:
+            optimizer.train(model, reader=read)
+```
+
+Please be aware that if a process is running on the Kubernetes
+cluster, it will have some environment variables pre-defined.
+
+If `dist_train` doesn't see these environment variables, it knowns
+that it's running on users' personal computer, and it should work as a
+*launcher*.  Otherwise, it knows that it's running on the cluster and
+need to figure out its role as either the master, or a trainer, or a
+parameter server.