Merge pull request #1322 from wangkuiyi/design_doc_new_api

Update API design doc according to discussions in issue #1315

Merge pull request #1322 from wangkuiyi/design_doc_new_api
Update API design doc according to discussions in issue #1315
6ab6c356 · wangkuiyi · GitHub · a30f6aa9 · 6915cb29 · 6ab6c356
隐藏空白更改
内联并排

Showing with 155 addition and 170 deletion

doc/design/api.md doc/design/api.md +155 -170

未找到文件。
--- a/doc/design/api.md
+++ b/doc/design/api.md
@@ -2,140 +2,148 @@

 ## Ingredients

-As the first step of our design, we list important concepts in deep
-learning and try to figure their relationship, as shown below:
+As our design principle is starting from the essence: how could we
+allow users to express and solve their problems at neural networks.
+Some essential concepts that our API have to provide include:

-```
-Model = {topology, parameters}
+1. A *topology* is an expression of *layers*.

-Evaluator = {Model*, activations}
- forward
- test(cost, ...)
+1. A layer could be any kind of computation, including *cost*.

-GradientMachine = {Evaluator*, gradients}
- backward
+1. Some layers have parameters, some don't. Most costs don't have
+   parameters.

-Optimizer = {GradientMachine*}
- train(cost, ...)
- update
- checkpoint
-```
+1. In some topologies, layers share parameters.  For
+   example,
+   [the network for training a ranking model](https://github.com/PaddlePaddle/Paddle/issues/1311#issuecomment-279121850).

-where the pair of curly braces `{` and `}` indicate *composition*, `*`
-indicates a *reference*, and `-` marks a "class method".
+1. At programming time, users specify topologies and possible sharing
+   of parameters.  PaddlePaddle can figure out and create parameters
+   required (and possibly shared) by one or more topologies.


-### Model
+## Starting from Examples

-We used to think that parameters are part of the topology (or layers).
-But that is not true because multiple layers could share the same
-parameter matrix.  An example is a network that compares two text
-segments in a semantic space:
+As a summarization
+of
+[our disucssion](https://github.com/PaddlePaddle/Paddle/issues/1315),
+let us present two examples here:

-```
-          semantic
-text A -> projection ---\
-          layer A        \
-                          cosine
-                          similarity -> output
-                          layer
-          semantic       /
-text B -> projection ---/
-          layer B
-```

-In this network, the two semantic projection layers (A and B) share
-the same parameter matrix.
+### Example 1. Sharing Parameters between Layers

-For more information about our API that specifies topology and
-parameter sharing, please refer to [TODO: API].
+We use
+the
+[3-branch ranking](https://github.com/PaddlePaddle/Paddle/issues/1311#issuecomment-279121850) model
+in this example.  For your convenience, I copy-a-paste the model's
+topology as follows:

+```
+A -> f -\
+Q -> f --> cost
+B -> f -/
+```

-### Evaluator
-
-Supposed that we have a trained ranking model, we should be able to
-use it in our search engine.  The search engine's Web server is a
-concurrent program so to serve many HTTP requests simultaneously.  It
-doesn't make sense for each of these threads to have its own copy of the model because that would duplicate topologies and parameters.
-However, each thread should be able to record layer outputs, i.e.,
-activations, computed from an input, derived from the request.  With
-*Evaluator* that saves activations, we can write the over-simplified
-server program as:
+The following program trains the topology including the cost, and then
+use the sub-network in the trained topology in inference:

 ```python
-m = paddle.model.load("trained.model")
-
-http.handle("/",
-            lambda req:
-                e = paddle.evaluator.create(m)
-                e.forward(req)
-                e.activation(layer="output")) # returns activations of layer "output"
+def f(in):
+    e = paddle.layer.embedding(in, parameter_name="embedding")
+    o = paddle.layer.softmax(e, parameter_name="semantic")
+    return o
+
+# Create 3 topologies (subnets), they share parameters because all
+# correspoinding layers have the same parameter names.
+fA = f(paddle.layer.data(input_name="A"))
+fB = f(paddle.layer.data(input_name="B"))
+fQ = f(paddle.layer.data(input_name="Q"))
+
+topology = paddle.layer.less_than(
+               paddle.layer.cross_entropy(fA, fQ),
+               paddle.layer.corss_entropy(fB, fQ))
+
+# Derive parameters required in topology and create them in model.
+parameters = paddle.parameters.create(topology)
+
+# Estimate parameters used in topology from data.
+paddle.train(topology, parameters, reader=read_ranking_model_data)
+
+# Inference using fA (or fB or fC, as they share their parameters).
+[testA, testB, testQ] = read_ranking_model_data()
+print "The sematic-vector of testA: ", paddle.infer(fA, parameters, testA)
 ```

-### GradientMachine
-
-Similar to the evaluation, the training needs to compute gradients so
-to update model parameters.  Because an [optimizer](#optimizer) might
-run multiple simultaneous threads to update the same model, gradients
-should be separated from the model.  Because gradients are only used
-in training, but not serving, they should be separate from Evaluator.
-Hence the `GradientMachine`.
-
-### Optimizer
-
-None of Model, Evaluator, nor GradientMachine implements the training
-loop, hence Optimizer.  We can define a concurrent optimizer that runs
-multiple simultaneous threads to train a model -- just let each
-thread has its own GradientMachine object.

-Most models should be able to be trained using the
-`paddle.optimizer.SGD` by calling its `train` method.  Many
-customizations to the SGD algorithm happens with the update equation,
-e.g., momentum and the Adam SGD algorithm.  We make `train` calls
-`update` to do an update, so that we can derive a `paddle.optimizer.Adam`
-from `paddle.optimizer.SGD` by overrides only the `update` method.
+### Example 2. Sharing Parameters between "Models"

+We use [GAN](https://github.com/PaddlePaddle/book/tree/develop/gan) in
+this example.  In the following example program, `d0` and `d1`
+correspond to the two networks in the following figure:

-## Programming Interface
-
-A fictive example of PaddlePaddle program looks like the following:
+<img src="https://github.com/wangyang59/book/raw/00036f4b0da5225041a6824587c1a01cf20159b1/gan/image/gan_ig.png" width=400 />

 ```python
-import paddle
+def G(in):
+    # over-simplified example as G has only one layers:
+    return paddle.layer.fc(in, parameter_name="G")
+
+def D(in);
+    # again, over-simplified:
+    return paddle.layer.fc(in, parameter_name="D")
+
+# Construct the first topology, which contains both D and G.
+# By learning this topology, we update parameters of G.
+d0 = paddle.layer.should_be_false(D(G(paddle.layer.data())))
+
+# Construct a second topology d1, which contains only D. By
+# training this topology, we update parameters of D.  Note
+# that d1 share parameters with d0.
+d1 = paddle.layer.should_be_true(D(paddle.layer.data()))
+
+# Create parameters from a list of multiple topologies (models) for
+# the chance to share parameters between these topologies.
+parameters = paddle.parameters.create([d0, d1])
+
+# Iterative training of GAN.
+for ...:
+    train(d0, parameters, reader=read_from_rng, immutable_parameters={"D"})
+    train(d1, parameters, reader=read_from_realistic_images)
+
+# Use d1 for inference:
+print "D thinks a batch of images are realistic ", infer(d1, parameters, read_mnist_images)
+```

-def read(args):
-    f = open_file(args["filename"])
-    mb = read_a_minibatch(f)
-    end_pass = eof(f)
-    if end_pass:
-       f = open_file(args["filename"]) # rewind for reading again
-    yield mb, end_pass

-input = paddle.layer.data(...)
-intermediate = paddle.layers.fc(input)
-output = paddle.layer.softmax(intermediate)
+### Summarization

-model = paddle.model.create(output)

-paddle.train(model, data_provider=read)
-```
+Above two programs reveal some important design concerns:

-This shows some important part of a program:
+1. Users describe a topology as an expression of layers.  Every layer
+   has a *parameter name*.  If the users don't specify it explicitly, it's automatically generated as a unique name.  By
+   specifying the parameter name, users can specify the sharing of
+   parameters between layers and even between topologies.

-1. Define how to read (and augment) data by defining a function, in
-   this example, `read`, that `yields` a minibatch and a boolean flag
-   `eof_of_pass`.
+1. `paddle.parameters.create` figures out parameters required by one
+   or more topologies from parameter names of layers.  It creates these
+   parameters and returns a `ParameterSet` object, which is in essence
+   a map from *parameter names* to *parameters*.

-1. Define the topology, `input`, `intermediate`, and `output` in this
-   example.
+1. At training and inference time, `paddle.train` and `paddle.infer`
+   requires both a topology and the parameter set that holds the parameters of that topology.  There are some reasons:

-1. Create parameters from the topology thus forms the model by calling
-   `paddel.model.create`.
+   1. This prevents users from forgetting to call
+      `paddle.parameters.create`.
+   1. `paddle.train` needs to know which parameter set to update.
+   1. Users could load another (pre-trained) parameter set and use it
+      with a topology in `train.infer`.

-1. Train the model by calling `paddle.train`.
+1. By specifying the `immutable_parameters` parameter of
+   `paddle.train`, we can forbid the update of these parameters.


-### Reader
+## Reader

 Not all programming frameworks allow users to define I/O functions.
 An example is Google MapReduce, which can only read from text,
@@ -145,91 +153,67 @@ readers and writers by deriving from base classes `Reader` and
 decide to provide the flexibility to users to define their readers.


-#### A Synthetic Data Reader
+There are some open questions here:

-Sometimes we want to test a topology and/or a training algorithm using
-synthetic data.  We can do this by defining the reader a synthesizer:
+1. **Should a reader return a Python dictionary?**

-```python
-def read(args):
-    x = sample_from_uniform(0.0, 1.0)
-    y = sample_from_gauss(2 * x, sigma)
-    yield {x, y}, False # no end-of-file so no end-of-pass
-```
+1. **How to map multiple outputs from a reader to multiple data layers?**

-#### A Reader for Online Learning
+1. **How to easily compose some existing readers to read more data and
+   feed a topology with more data layers?**

-Readers can also read an infinite data stream, e.g., a log stream from
-a search engine and collected by Kafka:

-```python
-def read(args):
-    log_stream = kafka.open_channel(args["kafka channel name"])
-    yeild log_stream.read(), False # no end-of-pass in online learning
-```
+## Training

-### Topology
-
-By default, layers don't have names.  But if we want to refer to a
-layer later some time, for example, when we do serving using the model
-and wants activations/outputs of a layer, we should give it a name.
+The recommended way to training a model is to call `paddle.train`,
+which simply calls `paddle.trainer.Default`, a global variable of
+type `paddle.trainer.SGD`.  Equivalently, we can do

 ```python
-input = paddle.layer.data(...)
-intermediate = paddle.layer.fc(input, name="inter", ...)
-output = paddle.layer.softmax(intermediate, name="output", ...)
-
-m = paddle.model.create(output)
-e = paddle.evaluator.create(model)
-e.forward(read_an_input()) # compute activations of all layers.
-print e.activations(layer="inter")  # retrieve the activations of layer "inter"
-print e.activations(layer="output") # retrieve the activations of layer "output"
+opt = paddle.trainer.SGD(..., paddle.updater.Adam(...))
+opt.train(topology, parameters, reader=read, ...)
 ```

-#### Sharing Parameters
+### Updater

-In [above section](#model) we shows a network whose two layers share
-the same parameter matrix.  To specify such cases, we give "parameter
-names" to layers.  If some layers have the same paraemter names,
-`paddle.model.create` creates a single parameter matrix for these
-layers:
+Please be aware that a trainer can accept an updater as its data
+member, where an updater is a class derived from
+`paddle.trainer.Updater`.  This is to make it easier to customize
+trainers, as discussed
+[here](https://github.com/PaddlePaddle/Paddle/issues/1319).

-```python
-text1 = paddle.layer.data(...)
-sematic1 = paddle.layer.fc(text1, ..., parameter_name="sematic_projection")
-text2 = paddle.layer.data(...)
-sematic2 = paddle.layer.fc(text2, ..., parameter_name="sematic_projection")
-out = paddle.layer.cosine(semantic1, semantic2)
-```
+### Event Handler

-We can also share parameter matrices between layers in different
-models.  To do this, we need an additional parameter that refers to a
-model:
+`paddle.train` and `paddle.trainer.XXX.train` take an optional
+parameter `event_handler`, which should be either `None` or a function
+that handle some events:

-```python
-model1_input = paddle.layer.data(...)
-model1_output = paddle.layer.softmax(model1_input, ...,
-                                     parameter_name="a_parameter_matrix")
-model1 = paddle.model.create(model1_output)
-
-# Another model
-model2_semantic = paddle.layer.fc(text2, ...,
-                                  parameter_name="a_parameter_matrix",
-                                  parameter_model=model1)
-```
+1. BeginTraining
+1. EndTraining
+1. BeginIteration
+1. EndIteration
+1. BeginPass
+1. EndPass

-### Training
+where EndPass is sent if and only if the reader yields
+`end_pass=True`.

-The recommended way to training a model is to call `paddle.train`,
-which simply calls `paddle.optimizer.Default`, a global variable of
-type `paddle.optimizer.SGD`.  Equivalently, we can do
+An example as follows:

 ```python
-opt = paddle.optimizer.SGD(...)
-opt.train(model, reader=read, ...)
+def event_handler(event):
+    if ininstance(event, paddle.event.EndIteration):
+        print paddle.test(...)
+
+paddle.train(topology, parameters, reader, event_handler)
 ```

-#### Distributed Training
+If we are writing a PaddlePaddle program in and for iPython/Jypyter,
+we can use metaplotlib in the event handler to plot a curve of
+cost/error versus iterations, as shown
+[here](https://blog.dominodatalab.com/interactive-dashboards-in-jupyter/).
+
+### Distributed Training

 If users want to do distributed training on a cluster, s/he should
 call `paddle.dist_train` and provides access tokens to the cluster as
@@ -240,8 +224,9 @@ access a Kubernetes cluster, s/he should be able to call

 ```python
 paddle.dist_train(model,
+                  trainer=paddle.trainer.SGD(...,
+                                             paddle.updater.Adam(...)),
                  reader=read,
-                  optimizer=paddle.optimizer.SGDOptimizer(...),
                  k8s_user="yi",
                  k8s_token="kube_cluster_tls.pem",
                  k8s_job="hello",
@@ -251,7 +236,7 @@ paddle.dist_train(model,
 The pseudo code if `paddle.dist_train` is as follows:

 ```python
-def dist_train():
+def dist_train(topology, parameters, trainer, reader, ...):
    if os.getenv("KUBERNETES_SERVICE_HOST") == None:
        image_name = k8s_user + '/' + k8s_job
        docker_build(image_name)
@@ -264,13 +249,13 @@ def dist_train():
        elif rank < 15:
            parameter_server()
        else:
-            optimizer.train(model, reader=read)
+            trainer.train(model, reader=read)
 ```

 Please be aware that if a process is running on the Kubernetes
 cluster, it will have some environment variables pre-defined.

-If `dist_train` doesn't see these environment variables, it knowns
+If `dist_train` doesn't see these environment variables, it knows
 that it's running on users' personal computer, and it should work as a
 *launcher*.  Otherwise, it knows that it's running on the cluster and
 need to figure out its role as either the master, or a trainer, or a