From 5a1f0617459177da6c5d1b52b4d514da253df68e Mon Sep 17 00:00:00 2001
From: Yi Wang <yi.wang.2005@gmail.com>
Date: Sun, 12 Feb 2017 21:55:11 -0800
Subject: [PATCH] Update according to discussions in
 https://github.com/PaddlePaddle/Paddle/issues/1315

---
 doc/design/api.md | 293 +++++++++++++++++++---------------------------
 1 file changed, 119 insertions(+), 174 deletions(-)

diff --git a/doc/design/api.md b/doc/design/api.md
index dd4341b324..f66a34fa98 100644
--- a/doc/design/api.md
+++ b/doc/design/api.md
@@ -2,140 +2,148 @@
 
 ## Ingredients
 
-As the first step of our design, we list important concepts in deep
-learning and try to figure their relationship, as shown below:
+As our design principle is starting from the essence: how could we
+allow users to express and solve their problems at neural networks.
+Some essential concepts that our API have to provide include:
 
-```
-Model = {topology, parameters}
+1. A *topology* is an expression of *layers*.
 
-Evaluator = {Model*, activations}
-- forward
-- test(cost, ...)
+1. A layer could be any kind of computation, including *cost*.
 
-GradientMachine = {Evaluator*, gradients}
-- backward
+1. Some layers have parameters, some don't. Most costs don't have
+   parameters.
 
-Optimizer = {GradientMachine*}
-- train(cost, ...)
-- update
-- checkpoint
-```
+1. In some topologies, layers share parameters.  For
+   example,
+   [the network for training a ranking model](https://github.com/PaddlePaddle/Paddle/issues/1311#issuecomment-279121850).
+   
+1. At programming time, users specify topologies and possible sharing
+   of parameters.  PaddlePaddle can figure out and create parameters
+   required (and possibly shared) by one or more topologies.
+   
+
+## Starting from Examples
 
-where the pair of curly braces `{` and `}` indicate *composition*, `*`
-indicates a *reference*, and `-` marks a "class method".
+As a summarization
+of
+[our disucssion](https://github.com/PaddlePaddle/Paddle/issues/1315),
+let us present two examples here:
 
 
-### Model
+### Example 1. Sharing Parameters between Layers
 
-We used to think that parameters are part of the topology (or layers).
-But that is not true because multiple layers could share the same
-parameter matrix.  An example is a network that compares two text
-segments in a semantic space:
+We use
+the
+[3-branch ranking](https://github.com/PaddlePaddle/Paddle/issues/1311#issuecomment-279121850) model
+in this example.  For your convenience, I copy-a-paste the model's
+topology as follows:
 
 ```
-          semantic
-text A -> projection ---\
-          layer A        \
-                          cosine
-                          similarity -> output
-                          layer
-          semantic       /
-text B -> projection ---/
-          layer B
+A -> f -\
+Q -> f --> cost
+B -> f -/
 ```
 
-In this network, the two semantic projection layers (A and B) share
-the same parameter matrix.
-
-For more information about our API that specifies topology and
-parameter sharing, please refer to [TODO: API].
-
-
-### Evaluator
-
-Supposed that we have a trained ranking model, we should be able to
-use it in our search engine.  The search engine's Web server is a
-concurrent program so to serve many HTTP requests simultaneously.  It
-doesn't make sense for each of these threads to have its own copy of the model because that would duplicate topologies and parameters.
-However, each thread should be able to record layer outputs, i.e.,
-activations, computed from an input, derived from the request.  With
-*Evaluator* that saves activations, we can write the over-simplified
-server program as:
+The following program trains the topology including the cost, and then
+use the sub-network in the trained topology in inference:
 
 ```python
-m = paddle.model.load("trained.model")
-
-http.handle("/",
-            lambda req:
-                e = paddle.evaluator.create(m)
-                e.forward(req)
-                e.activation(layer="output")) # returns activations of layer "output"
+def f(in):
+    e = paddle.layer.embedding(in, parameter_name="embedding")
+    o = paddle.layer.softmax(e, parameter_name="semantic")
+    return o
+
+# Create 3 topologies (subnets), they share parameters because all
+# correspoinding layers have the same parameter names.
+fA = f(paddle.layer.data(input_name="A"))
+fB = f(paddle.layer.data(input_name="B"))
+fQ = f(paddle.layer.data(input_name="Q"))
+
+topology = paddle.cost.less_than(
+               paddle.cost.cross_entropy(fA, fQ),
+               paddle.cost.corss_entropy(fB, fQ))
+
+# Derive parameters required in topology and create them in model.
+parameters = paddle.parameters.create(topology)
+
+# Estimate parameters used in topology from data.
+paddle.train(topology, parameters, reader=read_ranking_model_data)
+
+# Inference using fA (or fB or fC, as they share their parameters).
+[testA, testB, testQ] = read_ranking_model_data()
+print "The sematic-vector of testA: ", paddle.infer(fA, parameters, testA)
 ```
 
-### GradientMachine
-
-Similar to the evaluation, the training needs to compute gradients so
-to update model parameters.  Because an [optimizer](#optimizer) might
-run multiple simultaneous threads to update the same model, gradients
-should be separated from the model.  Because gradients are only used
-in training, but not serving, they should be separate from Evaluator.
-Hence the `GradientMachine`.
 
-### Optimizer
+### Example 2. Sharing Parameters between "Models"
 
-None of Model, Evaluator, nor GradientMachine implements the training
-loop, hence Optimizer.  We can define a concurrent optimizer that runs
-multiple simultaneous threads to train a model -- just let each
-thread has its own GradientMachine object.
+We use [GAN](https://github.com/PaddlePaddle/book/tree/develop/gan) in
+this example.  In the following example program, `d0` and `d1`
+correspond to the two networks in the following figure:
 
-Most models should be able to be trained using the
-`paddle.optimizer.SGD` by calling its `train` method.  Many
-customizations to the SGD algorithm happens with the update equation,
-e.g., momentum and the Adam SGD algorithm.  We make `train` calls
-`update` to do an update, so that we can derive a `paddle.optimizer.Adam`
-from `paddle.optimizer.SGD` by overrides only the `update` method.
-
-
-## Programming Interface
-
-A fictive example of PaddlePaddle program looks like the following:
+<img src="https://github.com/wangyang59/book/raw/00036f4b0da5225041a6824587c1a01cf20159b1/gan/image/gan_ig.png" width=400 />
 
 ```python
-import paddle
-
-def read(args):
-    f = open_file(args["filename"])
-    mb = read_a_minibatch(f)
-    end_pass = eof(f)
-    if end_pass:
-       f = open_file(args["filename"]) # rewind for reading again
-    yield mb, end_pass
-
-input = paddle.layer.data(...)
-intermediate = paddle.layers.fc(input)
-output = paddle.layer.softmax(intermediate)
-
-model = paddle.model.create(output)
-
-paddle.train(model, data_provider=read)
+def G(in):
+    # over-simplified example as G has only one layers:
+    return paddle.layer.fc(in, parameter_name="G") 
+
+def D(in);
+    # again, over-simplified:
+    return paddle.layer.fc(in, parameter_name="D")
+
+# Construct the first topology, which contains both D and G.
+# By learning this topology, we update parameters of G.
+d0 = paddle.cost.should_be_false(D(G(paddle.layer.data())))
+
+# Construct a second topology d1, which contains only D. By
+# training this topology, we update parameters of D.  Note 
+# that d1 share parameters with d0.
+d1 = paddle.cost.should_be_true(D(paddle.layer.data()))
+
+# Create parameters from a list of multiple topologies (models) for
+# the chance to share parameters between these topologies.
+parameters = paddle.parameters.create([d0, d1])
+
+# Iterative training of GAN.
+for ...:
+    train(d0, parameters, reader=read_from_rng, immutable_parameters={"D"})
+    train(d1, parameters, reader=read_from_realistic_images)
+
+# Use d1 for inference:
+print "D thinks a batch of images are realistic ", infer(d1, parameters, read_mnist_images)
 ```
 
-This shows some important part of a program:
 
-1. Define how to read (and augment) data by defining a function, in
-   this example, `read`, that `yields` a minibatch and a boolean flag
-   `eof_of_pass`.
+### Summarization
 
-1. Define the topology, `input`, `intermediate`, and `output` in this
-   example.
 
-1. Create parameters from the topology thus forms the model by calling
-   `paddel.model.create`.
+Above two programs reveal some important design concerns:
 
-1. Train the model by calling `paddle.train`.
+1. Users describe a topology as an expression of layers.  Every layer
+   has a *parameter name*.  If the users don't specify it explicitly, it's automatically generated as a unique name.  By
+   specifying the parameter name, users can specify the sharing of
+   parameters between layers and even between topologies.
 
+1. `paddle.parameters.create` figures out parameters required by one
+   or more topologies from parameter names of layers.  It creates these
+   parameters and returns a `ParameterSet` object, which is in essence
+   a map from *parameter names* to *parameters*.
 
-### Reader
+1. At training and inference time, `paddle.train` and `paddle.infer`
+   requires both a topology and the parameter set that holds the parameters of that topology.  There are some reasons:
+   
+   1. This prevents users from forgetting to call
+      `paddle.parameters.create`.
+   1. `paddle.train` needs to know which parameter set to update.
+   1. Users could load another (pre-trained) parameter set and use it
+      with a topology in `train.infer`.
+   
+1. By specifying the `immutable_parameters` parameter of
+   `paddle.train`, we can forbid the update of these parameters.
+   
+
+## Reader
 
 Not all programming frameworks allow users to define I/O functions.
 An example is Google MapReduce, which can only read from text,
@@ -145,78 +153,15 @@ readers and writers by deriving from base classes `Reader` and
 decide to provide the flexibility to users to define their readers.
 
 
-#### A Synthetic Data Reader
-
-Sometimes we want to test a topology and/or a training algorithm using
-synthetic data.  We can do this by defining the reader a synthesizer:
+There are some open questions here:
 
-```python
-def read(args):
-    x = sample_from_uniform(0.0, 1.0)
-    y = sample_from_gauss(2 * x, sigma)
-    yield {x, y}, False # no end-of-file so no end-of-pass
-```
-
-#### A Reader for Online Learning
-
-Readers can also read an infinite data stream, e.g., a log stream from
-a search engine and collected by Kafka:
-
-```python
-def read(args):
-    log_stream = kafka.open_channel(args["kafka channel name"])
-    yeild log_stream.read(), False # no end-of-pass in online learning
-```
-
-### Topology
-
-By default, layers don't have names.  But if we want to refer to a
-layer later some time, for example, when we do serving using the model
-and wants activations/outputs of a layer, we should give it a name.
+1. **Should a reader return a Python dictionary?**
 
-```python
-input = paddle.layer.data(...)
-intermediate = paddle.layer.fc(input, name="inter", ...)
-output = paddle.layer.softmax(intermediate, name="output", ...)
-
-m = paddle.model.create(output)
-e = paddle.evaluator.create(model)
-e.forward(read_an_input()) # compute activations of all layers.
-print e.activations(layer="inter")  # retrieve the activations of layer "inter"
-print e.activations(layer="output") # retrieve the activations of layer "output"
-```
-
-#### Sharing Parameters
+1. **How to map multiple outputs from a reader to multiple data layers?**
 
-In [above section](#model) we shows a network whose two layers share
-the same parameter matrix.  To specify such cases, we give "parameter
-names" to layers.  If some layers have the same paraemter names,
-`paddle.model.create` creates a single parameter matrix for these
-layers:
+1. **How to easily compose some existing readers to read more data and
+   feed a topology with more data layers?**
 
-```python
-text1 = paddle.layer.data(...)
-sematic1 = paddle.layer.fc(text1, ..., parameter_name="sematic_projection")
-text2 = paddle.layer.data(...)
-sematic2 = paddle.layer.fc(text2, ..., parameter_name="sematic_projection")
-out = paddle.layer.cosine(semantic1, semantic2)
-```
-
-We can also share parameter matrices between layers in different
-models.  To do this, we need an additional parameter that refers to a
-model:
-
-```python
-model1_input = paddle.layer.data(...)
-model1_output = paddle.layer.softmax(model1_input, ...,
-                                     parameter_name="a_parameter_matrix")
-model1 = paddle.model.create(model1_output)
-
-# Another model
-model2_semantic = paddle.layer.fc(text2, ...,
-                                  parameter_name="a_parameter_matrix",
-                                  parameter_model=model1)
-```
 
 ### Training
 
@@ -270,7 +215,7 @@ def dist_train():
 Please be aware that if a process is running on the Kubernetes
 cluster, it will have some environment variables pre-defined.
 
-If `dist_train` doesn't see these environment variables, it knowns
+If `dist_train` doesn't see these environment variables, it knows
 that it's running on users' personal computer, and it should work as a
 *launcher*.  Otherwise, it knows that it's running on the cluster and
 need to figure out its role as either the master, or a trainer, or a
-- 
GitLab