api.md 8.6 KB
Newer Older
Y
Yi Wang 已提交
1
# PaddlePaddle Design Doc
Y
Yi Wang 已提交
2 3 4

## Ingredients

5
As our design principle is starting from the essence: how could we
P
Peng Li 已提交
6
allow users to express and solve their problems as neural networks.
7
Some essential concepts that our API have to provide include:
Y
Yi Wang 已提交
8

9
1. A *topology* is an expression of *layers*.
Y
Yi Wang 已提交
10

11
1. A layer could be any kind of computation, including *cost*.
Y
Yi Wang 已提交
12

13 14
1. Some layers have parameters, some don't. Most costs don't have
   parameters.
Y
Yi Wang 已提交
15

16 17 18
1. In some topologies, layers share parameters.  For
   example,
   [the network for training a ranking model](https://github.com/PaddlePaddle/Paddle/issues/1311#issuecomment-279121850).
19

20 21 22
1. At programming time, users specify topologies and possible sharing
   of parameters.  PaddlePaddle can figure out and create parameters
   required (and possibly shared) by one or more topologies.
23

24 25

## Starting from Examples
Y
Yi Wang 已提交
26

27 28 29 30
As a summarization
of
[our disucssion](https://github.com/PaddlePaddle/Paddle/issues/1315),
let us present two examples here:
Y
Yi Wang 已提交
31 32


33
### Example 1. Sharing Parameters between Layers
Y
Yi Wang 已提交
34

35 36 37 38 39
We use
the
[3-branch ranking](https://github.com/PaddlePaddle/Paddle/issues/1311#issuecomment-279121850) model
in this example.  For your convenience, I copy-a-paste the model's
topology as follows:
Y
Yi Wang 已提交
40 41

```
42 43 44
A -> f -\
Q -> f --> cost
B -> f -/
Y
Yi Wang 已提交
45 46
```

47 48
The following program trains the topology including the cost, and then
use the sub-network in the trained topology in inference:
Y
Yi Wang 已提交
49 50

```python
51 52 53 54 55 56 57 58 59 60 61
def f(in):
    e = paddle.layer.embedding(in, parameter_name="embedding")
    o = paddle.layer.softmax(e, parameter_name="semantic")
    return o

# Create 3 topologies (subnets), they share parameters because all
# correspoinding layers have the same parameter names.
fA = f(paddle.layer.data(input_name="A"))
fB = f(paddle.layer.data(input_name="B"))
fQ = f(paddle.layer.data(input_name="Q"))

62 63 64
topology = paddle.layer.less_than(
               paddle.layer.cross_entropy(fA, fQ),
               paddle.layer.corss_entropy(fB, fQ))
65 66 67 68 69 70 71 72 73 74

# Derive parameters required in topology and create them in model.
parameters = paddle.parameters.create(topology)

# Estimate parameters used in topology from data.
paddle.train(topology, parameters, reader=read_ranking_model_data)

# Inference using fA (or fB or fC, as they share their parameters).
[testA, testB, testQ] = read_ranking_model_data()
print "The sematic-vector of testA: ", paddle.infer(fA, parameters, testA)
Y
Yi Wang 已提交
75 76 77
```


78
### Example 2. Sharing Parameters between "Models"
Y
Yi Wang 已提交
79

80
We use GAN in this example.  In the following example program, `d0` and `d1`
81
correspond to the two networks in the following figure:
Y
Yi Wang 已提交
82

83
<img src="https://github.com/wangyang59/book/raw/00036f4b0da5225041a6824587c1a01cf20159b1/gan/image/gan_ig.png" width=400 />
Y
Yi Wang 已提交
84 85

```python
86 87
def G(in):
    # over-simplified example as G has only one layers:
88
    return paddle.layer.fc(in, parameter_name="G")
89 90 91 92 93 94 95

def D(in);
    # again, over-simplified:
    return paddle.layer.fc(in, parameter_name="D")

# Construct the first topology, which contains both D and G.
# By learning this topology, we update parameters of G.
96
d0 = paddle.layer.should_be_false(D(G(paddle.layer.data())))
97 98

# Construct a second topology d1, which contains only D. By
99
# training this topology, we update parameters of D.  Note
100
# that d1 share parameters with d0.
101
d1 = paddle.layer.should_be_true(D(paddle.layer.data()))
102 103 104 105 106 107 108 109 110 111 112 113

# Create parameters from a list of multiple topologies (models) for
# the chance to share parameters between these topologies.
parameters = paddle.parameters.create([d0, d1])

# Iterative training of GAN.
for ...:
    train(d0, parameters, reader=read_from_rng, immutable_parameters={"D"})
    train(d1, parameters, reader=read_from_realistic_images)

# Use d1 for inference:
print "D thinks a batch of images are realistic ", infer(d1, parameters, read_mnist_images)
Y
Yi Wang 已提交
114 115 116
```


117
### Summarization
Y
Yi Wang 已提交
118 119


120
Above two programs reveal some important design concerns:
Y
Yi Wang 已提交
121

122 123 124 125
1. Users describe a topology as an expression of layers.  Every layer
   has a *parameter name*.  If the users don't specify it explicitly, it's automatically generated as a unique name.  By
   specifying the parameter name, users can specify the sharing of
   parameters between layers and even between topologies.
Y
Yi Wang 已提交
126

127 128 129 130
1. `paddle.parameters.create` figures out parameters required by one
   or more topologies from parameter names of layers.  It creates these
   parameters and returns a `ParameterSet` object, which is in essence
   a map from *parameter names* to *parameters*.
Y
Yi Wang 已提交
131

132 133
1. At training and inference time, `paddle.train` and `paddle.infer`
   requires both a topology and the parameter set that holds the parameters of that topology.  There are some reasons:
134

135 136 137 138 139
   1. This prevents users from forgetting to call
      `paddle.parameters.create`.
   1. `paddle.train` needs to know which parameter set to update.
   1. Users could load another (pre-trained) parameter set and use it
      with a topology in `train.infer`.
140

141 142
1. By specifying the `immutable_parameters` parameter of
   `paddle.train`, we can forbid the update of these parameters.
143

144 145

## Reader
Y
Yi Wang 已提交
146 147 148 149 150 151 152 153 154

Not all programming frameworks allow users to define I/O functions.
An example is Google MapReduce, which can only read from text,
SSTable, and RecordIO files.  Hadoop MapReduce allows users to define
readers and writers by deriving from base classes `Reader` and
`Writer`.  The former is less flexible but also less error-prone.  We
decide to provide the flexibility to users to define their readers.


155
There are some open questions here:
Y
Yi Wang 已提交
156

157
1. **Should a reader return a Python dictionary?**
Y
Yi Wang 已提交
158

159
1. **How to map multiple outputs from a reader to multiple data layers?**
Y
Yi Wang 已提交
160

161 162
1. **How to easily compose some existing readers to read more data and
   feed a topology with more data layers?**
Y
Yi Wang 已提交
163 164


Y
Yi Wang 已提交
165
## Training
Y
Yi Wang 已提交
166 167

The recommended way to training a model is to call `paddle.train`,
168 169
which simply calls `paddle.trainer.Default`, a global variable of
type `paddle.trainer.SGD`.  Equivalently, we can do
Y
Yi Wang 已提交
170 171

```python
172
opt = paddle.trainer.SGD(..., paddle.updater.Adam(...))
Y
Yi Wang 已提交
173
opt.train(topology, parameters, reader=read, ...)
Y
Yi Wang 已提交
174 175
```

Y
Yi Wang 已提交
176 177
### Updater

Y
Yi Wang 已提交
178 179 180 181 182
Please be aware that a trainer can accept an updater as its data
member, where an updater is a class derived from
`paddle.trainer.Updater`.  This is to make it easier to customize
trainers, as discussed
[here](https://github.com/PaddlePaddle/Paddle/issues/1319).
183

Y
Yi Wang 已提交
184 185 186 187 188 189 190 191
### Event Handler

`paddle.train` and `paddle.trainer.XXX.train` take an optional
parameter `event_handler`, which should be either `None` or a function
that handle some events:

1. BeginTraining
1. EndTraining
Y
Yi Wang 已提交
192 193
1. BeginIteration
1. EndIteration
Y
Yi Wang 已提交
194 195 196 197 198 199 200 201 202 203
1. BeginPass
1. EndPass

where EndPass is sent if and only if the reader yields
`end_pass=True`.

An example as follows:

```python
def event_handler(event):
Y
Yi Wang 已提交
204
    if ininstance(event, paddle.event.EndIteration):
Y
Yi Wang 已提交
205 206 207 208
        print paddle.test(...)

paddle.train(topology, parameters, reader, event_handler)
```
209

Y
Yi Wang 已提交
210 211 212 213 214
If we are writing a PaddlePaddle program in and for iPython/Jypyter,
we can use metaplotlib in the event handler to plot a curve of
cost/error versus iterations, as shown
[here](https://blog.dominodatalab.com/interactive-dashboards-in-jupyter/).

Y
Yi Wang 已提交
215
### Distributed Training
Y
Yi Wang 已提交
216 217 218 219 220 221 222 223 224 225

If users want to do distributed training on a cluster, s/he should
call `paddle.dist_train` and provides access tokens to the cluster as
a parameter.

For example, if the user has a TLS certificate that allows him to
access a Kubernetes cluster, s/he should be able to call

```python
paddle.dist_train(model,
226
                  trainer=paddle.trainer.SGD(...,
227
                                             paddle.updater.Adam(...)),
Y
Yi Wang 已提交
228 229 230 231 232 233 234
                  reader=read,
                  k8s_user="yi",
                  k8s_token="kube_cluster_tls.pem",
                  k8s_job="hello",
                  num_parameter_servers=15)
```

P
Peng Li 已提交
235
The pseudo code of `paddle.dist_train` is as follows:
Y
Yi Wang 已提交
236 237

```python
238
def dist_train(topology, parameters, trainer, reader, ...):
Y
Yi Wang 已提交
239 240 241 242 243 244 245 246 247 248 249 250
    if os.getenv("KUBERNETES_SERVICE_HOST") == None:
        image_name = k8s_user + '/' + k8s_job
        docker_build(image_name)
        docker_push()
        kube_ctrl_start_job(image_name, k8s_user, k8s_token)
    else:
        rank = kube_list_containers_in_job_and_return_current_containers_rank()
        if rank == 0:
            master()
        elif rank < 15:
            parameter_server()
        else:
251
            trainer.train(model, reader=read)
Y
Yi Wang 已提交
252 253 254 255 256
```

Please be aware that if a process is running on the Kubernetes
cluster, it will have some environment variables pre-defined.

257
If `dist_train` doesn't see these environment variables, it knows
Y
Yi Wang 已提交
258 259 260 261
that it's running on users' personal computer, and it should work as a
*launcher*.  Otherwise, it knows that it's running on the cluster and
need to figure out its role as either the master, or a trainer, or a
parameter server.