From 5a1f0617459177da6c5d1b52b4d514da253df68e Mon Sep 17 00:00:00 2001 From: Yi Wang Date: Sun, 12 Feb 2017 21:55:11 -0800 Subject: [PATCH] Update according to discussions in https://github.com/PaddlePaddle/Paddle/issues/1315 --- doc/design/api.md | 293 +++++++++++++++++++--------------------------- 1 file changed, 119 insertions(+), 174 deletions(-) diff --git a/doc/design/api.md b/doc/design/api.md index dd4341b32..f66a34fa9 100644 --- a/doc/design/api.md +++ b/doc/design/api.md @@ -2,140 +2,148 @@ ## Ingredients -As the first step of our design, we list important concepts in deep -learning and try to figure their relationship, as shown below: +As our design principle is starting from the essence: how could we +allow users to express and solve their problems at neural networks. +Some essential concepts that our API have to provide include: -``` -Model = {topology, parameters} +1. A *topology* is an expression of *layers*. -Evaluator = {Model*, activations} -- forward -- test(cost, ...) +1. A layer could be any kind of computation, including *cost*. -GradientMachine = {Evaluator*, gradients} -- backward +1. Some layers have parameters, some don't. Most costs don't have + parameters. -Optimizer = {GradientMachine*} -- train(cost, ...) -- update -- checkpoint -``` +1. In some topologies, layers share parameters. For + example, + [the network for training a ranking model](https://github.com/PaddlePaddle/Paddle/issues/1311#issuecomment-279121850). + +1. At programming time, users specify topologies and possible sharing + of parameters. PaddlePaddle can figure out and create parameters + required (and possibly shared) by one or more topologies. + + +## Starting from Examples -where the pair of curly braces `{` and `}` indicate *composition*, `*` -indicates a *reference*, and `-` marks a "class method". +As a summarization +of +[our disucssion](https://github.com/PaddlePaddle/Paddle/issues/1315), +let us present two examples here: -### Model +### Example 1. Sharing Parameters between Layers -We used to think that parameters are part of the topology (or layers). -But that is not true because multiple layers could share the same -parameter matrix. An example is a network that compares two text -segments in a semantic space: +We use +the +[3-branch ranking](https://github.com/PaddlePaddle/Paddle/issues/1311#issuecomment-279121850) model +in this example. For your convenience, I copy-a-paste the model's +topology as follows: ``` - semantic -text A -> projection ---\ - layer A \ - cosine - similarity -> output - layer - semantic / -text B -> projection ---/ - layer B +A -> f -\ +Q -> f --> cost +B -> f -/ ``` -In this network, the two semantic projection layers (A and B) share -the same parameter matrix. - -For more information about our API that specifies topology and -parameter sharing, please refer to [TODO: API]. - - -### Evaluator - -Supposed that we have a trained ranking model, we should be able to -use it in our search engine. The search engine's Web server is a -concurrent program so to serve many HTTP requests simultaneously. It -doesn't make sense for each of these threads to have its own copy of the model because that would duplicate topologies and parameters. -However, each thread should be able to record layer outputs, i.e., -activations, computed from an input, derived from the request. With -*Evaluator* that saves activations, we can write the over-simplified -server program as: +The following program trains the topology including the cost, and then +use the sub-network in the trained topology in inference: ```python -m = paddle.model.load("trained.model") - -http.handle("/", - lambda req: - e = paddle.evaluator.create(m) - e.forward(req) - e.activation(layer="output")) # returns activations of layer "output" +def f(in): + e = paddle.layer.embedding(in, parameter_name="embedding") + o = paddle.layer.softmax(e, parameter_name="semantic") + return o + +# Create 3 topologies (subnets), they share parameters because all +# correspoinding layers have the same parameter names. +fA = f(paddle.layer.data(input_name="A")) +fB = f(paddle.layer.data(input_name="B")) +fQ = f(paddle.layer.data(input_name="Q")) + +topology = paddle.cost.less_than( + paddle.cost.cross_entropy(fA, fQ), + paddle.cost.corss_entropy(fB, fQ)) + +# Derive parameters required in topology and create them in model. +parameters = paddle.parameters.create(topology) + +# Estimate parameters used in topology from data. +paddle.train(topology, parameters, reader=read_ranking_model_data) + +# Inference using fA (or fB or fC, as they share their parameters). +[testA, testB, testQ] = read_ranking_model_data() +print "The sematic-vector of testA: ", paddle.infer(fA, parameters, testA) ``` -### GradientMachine - -Similar to the evaluation, the training needs to compute gradients so -to update model parameters. Because an [optimizer](#optimizer) might -run multiple simultaneous threads to update the same model, gradients -should be separated from the model. Because gradients are only used -in training, but not serving, they should be separate from Evaluator. -Hence the `GradientMachine`. -### Optimizer +### Example 2. Sharing Parameters between "Models" -None of Model, Evaluator, nor GradientMachine implements the training -loop, hence Optimizer. We can define a concurrent optimizer that runs -multiple simultaneous threads to train a model -- just let each -thread has its own GradientMachine object. +We use [GAN](https://github.com/PaddlePaddle/book/tree/develop/gan) in +this example. In the following example program, `d0` and `d1` +correspond to the two networks in the following figure: -Most models should be able to be trained using the -`paddle.optimizer.SGD` by calling its `train` method. Many -customizations to the SGD algorithm happens with the update equation, -e.g., momentum and the Adam SGD algorithm. We make `train` calls -`update` to do an update, so that we can derive a `paddle.optimizer.Adam` -from `paddle.optimizer.SGD` by overrides only the `update` method. - - -## Programming Interface - -A fictive example of PaddlePaddle program looks like the following: + ```python -import paddle - -def read(args): - f = open_file(args["filename"]) - mb = read_a_minibatch(f) - end_pass = eof(f) - if end_pass: - f = open_file(args["filename"]) # rewind for reading again - yield mb, end_pass - -input = paddle.layer.data(...) -intermediate = paddle.layers.fc(input) -output = paddle.layer.softmax(intermediate) - -model = paddle.model.create(output) - -paddle.train(model, data_provider=read) +def G(in): + # over-simplified example as G has only one layers: + return paddle.layer.fc(in, parameter_name="G") + +def D(in); + # again, over-simplified: + return paddle.layer.fc(in, parameter_name="D") + +# Construct the first topology, which contains both D and G. +# By learning this topology, we update parameters of G. +d0 = paddle.cost.should_be_false(D(G(paddle.layer.data()))) + +# Construct a second topology d1, which contains only D. By +# training this topology, we update parameters of D. Note +# that d1 share parameters with d0. +d1 = paddle.cost.should_be_true(D(paddle.layer.data())) + +# Create parameters from a list of multiple topologies (models) for +# the chance to share parameters between these topologies. +parameters = paddle.parameters.create([d0, d1]) + +# Iterative training of GAN. +for ...: + train(d0, parameters, reader=read_from_rng, immutable_parameters={"D"}) + train(d1, parameters, reader=read_from_realistic_images) + +# Use d1 for inference: +print "D thinks a batch of images are realistic ", infer(d1, parameters, read_mnist_images) ``` -This shows some important part of a program: -1. Define how to read (and augment) data by defining a function, in - this example, `read`, that `yields` a minibatch and a boolean flag - `eof_of_pass`. +### Summarization -1. Define the topology, `input`, `intermediate`, and `output` in this - example. -1. Create parameters from the topology thus forms the model by calling - `paddel.model.create`. +Above two programs reveal some important design concerns: -1. Train the model by calling `paddle.train`. +1. Users describe a topology as an expression of layers. Every layer + has a *parameter name*. If the users don't specify it explicitly, it's automatically generated as a unique name. By + specifying the parameter name, users can specify the sharing of + parameters between layers and even between topologies. +1. `paddle.parameters.create` figures out parameters required by one + or more topologies from parameter names of layers. It creates these + parameters and returns a `ParameterSet` object, which is in essence + a map from *parameter names* to *parameters*. -### Reader +1. At training and inference time, `paddle.train` and `paddle.infer` + requires both a topology and the parameter set that holds the parameters of that topology. There are some reasons: + + 1. This prevents users from forgetting to call + `paddle.parameters.create`. + 1. `paddle.train` needs to know which parameter set to update. + 1. Users could load another (pre-trained) parameter set and use it + with a topology in `train.infer`. + +1. By specifying the `immutable_parameters` parameter of + `paddle.train`, we can forbid the update of these parameters. + + +## Reader Not all programming frameworks allow users to define I/O functions. An example is Google MapReduce, which can only read from text, @@ -145,78 +153,15 @@ readers and writers by deriving from base classes `Reader` and decide to provide the flexibility to users to define their readers. -#### A Synthetic Data Reader - -Sometimes we want to test a topology and/or a training algorithm using -synthetic data. We can do this by defining the reader a synthesizer: +There are some open questions here: -```python -def read(args): - x = sample_from_uniform(0.0, 1.0) - y = sample_from_gauss(2 * x, sigma) - yield {x, y}, False # no end-of-file so no end-of-pass -``` - -#### A Reader for Online Learning - -Readers can also read an infinite data stream, e.g., a log stream from -a search engine and collected by Kafka: - -```python -def read(args): - log_stream = kafka.open_channel(args["kafka channel name"]) - yeild log_stream.read(), False # no end-of-pass in online learning -``` - -### Topology - -By default, layers don't have names. But if we want to refer to a -layer later some time, for example, when we do serving using the model -and wants activations/outputs of a layer, we should give it a name. +1. **Should a reader return a Python dictionary?** -```python -input = paddle.layer.data(...) -intermediate = paddle.layer.fc(input, name="inter", ...) -output = paddle.layer.softmax(intermediate, name="output", ...) - -m = paddle.model.create(output) -e = paddle.evaluator.create(model) -e.forward(read_an_input()) # compute activations of all layers. -print e.activations(layer="inter") # retrieve the activations of layer "inter" -print e.activations(layer="output") # retrieve the activations of layer "output" -``` - -#### Sharing Parameters +1. **How to map multiple outputs from a reader to multiple data layers?** -In [above section](#model) we shows a network whose two layers share -the same parameter matrix. To specify such cases, we give "parameter -names" to layers. If some layers have the same paraemter names, -`paddle.model.create` creates a single parameter matrix for these -layers: +1. **How to easily compose some existing readers to read more data and + feed a topology with more data layers?** -```python -text1 = paddle.layer.data(...) -sematic1 = paddle.layer.fc(text1, ..., parameter_name="sematic_projection") -text2 = paddle.layer.data(...) -sematic2 = paddle.layer.fc(text2, ..., parameter_name="sematic_projection") -out = paddle.layer.cosine(semantic1, semantic2) -``` - -We can also share parameter matrices between layers in different -models. To do this, we need an additional parameter that refers to a -model: - -```python -model1_input = paddle.layer.data(...) -model1_output = paddle.layer.softmax(model1_input, ..., - parameter_name="a_parameter_matrix") -model1 = paddle.model.create(model1_output) - -# Another model -model2_semantic = paddle.layer.fc(text2, ..., - parameter_name="a_parameter_matrix", - parameter_model=model1) -``` ### Training @@ -270,7 +215,7 @@ def dist_train(): Please be aware that if a process is running on the Kubernetes cluster, it will have some environment variables pre-defined. -If `dist_train` doesn't see these environment variables, it knowns +If `dist_train` doesn't see these environment variables, it knows that it's running on users' personal computer, and it should work as a *launcher*. Otherwise, it knows that it's running on the cluster and need to figure out its role as either the master, or a trainer, or a -- GitLab