api.md

# PaddlePaddle Design Doc

## Ingredients

As our design principle is starting from the essence: how could we
allow users to express and solve their problems at neural networks.
Some essential concepts that our API have to provide include:

1. A *topology* is an expression of *layers*.

1. A layer could be any kind of computation, including *cost*.

1. Some layers have parameters, some don't. Most costs don't have
   parameters.

1. In some topologies, layers share parameters.  For
   example,
   [the network for training a ranking model](https://github.com/PaddlePaddle/Paddle/issues/1311#issuecomment-279121850).
   
1. At programming time, users specify topologies and possible sharing
   of parameters.  PaddlePaddle can figure out and create parameters
   required (and possibly shared) by one or more topologies.
   

## Starting from Examples

As a summarization
of
[our disucssion](https://github.com/PaddlePaddle/Paddle/issues/1315),
let us present two examples here:


### Example 1. Sharing Parameters between Layers

We use
the
[3-branch ranking](https://github.com/PaddlePaddle/Paddle/issues/1311#issuecomment-279121850) model
in this example.  For your convenience, I copy-a-paste the model's
topology as follows:

```
A -> f -\
Q -> f --> cost
B -> f -/
```

The following program trains the topology including the cost, and then
use the sub-network in the trained topology in inference:

```python
def f(in):
    e = paddle.layer.embedding(in, parameter_name="embedding")
    o = paddle.layer.softmax(e, parameter_name="semantic")
    return o

# Create 3 topologies (subnets), they share parameters because all
# correspoinding layers have the same parameter names.
fA = f(paddle.layer.data(input_name="A"))
fB = f(paddle.layer.data(input_name="B"))
fQ = f(paddle.layer.data(input_name="Q"))

topology = paddle.cost.less_than(
               paddle.cost.cross_entropy(fA, fQ),
               paddle.cost.corss_entropy(fB, fQ))

# Derive parameters required in topology and create them in model.
parameters = paddle.parameters.create(topology)

# Estimate parameters used in topology from data.
paddle.train(topology, parameters, reader=read_ranking_model_data)

# Inference using fA (or fB or fC, as they share their parameters).
[testA, testB, testQ] = read_ranking_model_data()
print "The sematic-vector of testA: ", paddle.infer(fA, parameters, testA)
```


### Example 2. Sharing Parameters between "Models"

We use [GAN](https://github.com/PaddlePaddle/book/tree/develop/gan) in
this example.  In the following example program, `d0` and `d1`
correspond to the two networks in the following figure:

<img src="https://github.com/wangyang59/book/raw/00036f4b0da5225041a6824587c1a01cf20159b1/gan/image/gan_ig.png" width=400 />

```python
def G(in):
    # over-simplified example as G has only one layers:
    return paddle.layer.fc(in, parameter_name="G") 

def D(in);
    # again, over-simplified:
    return paddle.layer.fc(in, parameter_name="D")

# Construct the first topology, which contains both D and G.
# By learning this topology, we update parameters of G.
d0 = paddle.cost.should_be_false(D(G(paddle.layer.data())))

# Construct a second topology d1, which contains only D. By
# training this topology, we update parameters of D.  Note 
# that d1 share parameters with d0.
d1 = paddle.cost.should_be_true(D(paddle.layer.data()))

# Create parameters from a list of multiple topologies (models) for
# the chance to share parameters between these topologies.
parameters = paddle.parameters.create([d0, d1])

# Iterative training of GAN.
for ...:
    train(d0, parameters, reader=read_from_rng, immutable_parameters={"D"})
    train(d1, parameters, reader=read_from_realistic_images)

# Use d1 for inference:
print "D thinks a batch of images are realistic ", infer(d1, parameters, read_mnist_images)
```


### Summarization


Above two programs reveal some important design concerns:

1. Users describe a topology as an expression of layers.  Every layer
   has a *parameter name*.  If the users don't specify it explicitly, it's automatically generated as a unique name.  By
   specifying the parameter name, users can specify the sharing of
   parameters between layers and even between topologies.

1. `paddle.parameters.create` figures out parameters required by one
   or more topologies from parameter names of layers.  It creates these
   parameters and returns a `ParameterSet` object, which is in essence
   a map from *parameter names* to *parameters*.

1. At training and inference time, `paddle.train` and `paddle.infer`
   requires both a topology and the parameter set that holds the parameters of that topology.  There are some reasons:
   
   1. This prevents users from forgetting to call
      `paddle.parameters.create`.
   1. `paddle.train` needs to know which parameter set to update.
   1. Users could load another (pre-trained) parameter set and use it
      with a topology in `train.infer`.
   
1. By specifying the `immutable_parameters` parameter of
   `paddle.train`, we can forbid the update of these parameters.
   

## Reader

Not all programming frameworks allow users to define I/O functions.
An example is Google MapReduce, which can only read from text,
SSTable, and RecordIO files.  Hadoop MapReduce allows users to define
readers and writers by deriving from base classes `Reader` and
`Writer`.  The former is less flexible but also less error-prone.  We
decide to provide the flexibility to users to define their readers.


There are some open questions here:

1. **Should a reader return a Python dictionary?**

1. **How to map multiple outputs from a reader to multiple data layers?**

1. **How to easily compose some existing readers to read more data and
   feed a topology with more data layers?**


### Training

The recommended way to training a model is to call `paddle.train`,
which simply calls `paddle.optimizer.Default`, a global variable of
type `paddle.optimizer.SGD`.  Equivalently, we can do

```python
opt = paddle.optimizer.SGD(...)
opt.train(model, reader=read, ...)
```

#### Distributed Training

If users want to do distributed training on a cluster, s/he should
call `paddle.dist_train` and provides access tokens to the cluster as
a parameter.

For example, if the user has a TLS certificate that allows him to
access a Kubernetes cluster, s/he should be able to call

```python
paddle.dist_train(model,
                  reader=read,
                  optimizer=paddle.optimizer.SGDOptimizer(...),
                  k8s_user="yi",
                  k8s_token="kube_cluster_tls.pem",
                  k8s_job="hello",
                  num_parameter_servers=15)
```

The pseudo code if `paddle.dist_train` is as follows:

```python
def dist_train():
    if os.getenv("KUBERNETES_SERVICE_HOST") == None:
        image_name = k8s_user + '/' + k8s_job
        docker_build(image_name)
        docker_push()
        kube_ctrl_start_job(image_name, k8s_user, k8s_token)
    else:
        rank = kube_list_containers_in_job_and_return_current_containers_rank()
        if rank == 0:
            master()
        elif rank < 15:
            parameter_server()
        else:
            optimizer.train(model, reader=read)
```

Please be aware that if a process is running on the Kubernetes
cluster, it will have some environment variables pre-defined.

If `dist_train` doesn't see these environment variables, it knows
that it's running on users' personal computer, and it should work as a
*launcher*.  Otherwise, it knows that it's running on the cluster and
need to figure out its role as either the master, or a trainer, or a
parameter server.