Merge branch 'develop' of github.com:baidu/Paddle into feature/new_api

176d44ef · Yu Yang · dc7f2031 · a30f6aa9 · 176d44ef · 176d44ef
5 changed file
--- a/.travis.yml
+++ b/.travis.yml
@@ -54,7 +54,9 @@ before_install:
    fi
  - if [[ "$TRAVIS_OS_NAME" == "osx" ]]; then paddle/scripts/travis/before_install.osx.sh; fi
  - if [[ "$JOB" == "PRE_COMMIT" ]]; then sudo ln -s /usr/bin/clang-format-3.8 /usr/bin/clang-format; fi
-  - pip install numpy wheel protobuf sphinx recommonmark sphinx_rtd_theme virtualenv pre-commit requests==2.9.2 LinkChecker
+  # Paddle is using protobuf 3.1 currently. Protobuf 3.2 breaks the compatibility. So we specify the python 
+  # protobuf version.
+  - pip install numpy wheel 'protobuf==3.1' sphinx recommonmark sphinx_rtd_theme virtualenv pre-commit requests==2.9.2 LinkChecker
 script:
  - paddle/scripts/travis/main.sh
 notifications:

--- a/doc/design/api.md
+++ b/doc/design/api.md
+# PaddlePaddle Design Doc
+## Ingredients
+As the first step of our design, we list important concepts in deep
+learning and try to figure their relationship, as shown below:
+```
+Model = {topology, parameters}
+Evaluator = {Model*, activations}
+- forward
+- test(cost, ...)
+GradientMachine = {Evaluator*, gradients}
+- backward
+Optimizer = {GradientMachine*}
+- train(cost, ...)
+- update
+- checkpoint
+```
+where the pair of curly braces `{` and `}` indicate *composition*, `*`
+indicates a *reference*, and `-` marks a "class method".
+### Model
+We used to think that parameters are part of the topology (or layers).
+But that is not true because multiple layers could share the same
+parameter matrix.  An example is a network that compares two text
+segments in a semantic space:
+```
+          semantic
+text A -> projection ---\
+          layer A        \
+                          cosine
+                          similarity -> output
+                          layer
+          semantic       /
+text B -> projection ---/
+          layer B
+```
+In this network, the two semantic projection layers (A and B) share
+the same parameter matrix.
+For more information about our API that specifies topology and
+parameter sharing, please refer to [TODO: API].
+### Evaluator
+Supposed that we have a trained ranking model, we should be able to
+use it in our search engine.  The search engine's Web server is a
+concurrent program so to serve many HTTP requests simultaneously.  It
+doesn't make sense for each of these threads to have its own copy of the model because that would duplicate topologies and parameters.
+However, each thread should be able to record layer outputs, i.e.,
+activations, computed from an input, derived from the request.  With
+*Evaluator* that saves activations, we can write the over-simplified
+server program as:
+```python
+m = paddle.model.load("trained.model")
+http.handle("/",
+            lambda req:
+                e = paddle.evaluator.create(m)
+                e.forward(req)
+                e.activation(layer="output")) # returns activations of layer "output"
+```
+### GradientMachine
+Similar to the evaluation, the training needs to compute gradients so
+to update model parameters.  Because an [optimizer](#optimizer) might
+run multiple simultaneous threads to update the same model, gradients
+should be separated from the model.  Because gradients are only used
+in training, but not serving, they should be separate from Evaluator.
+Hence the `GradientMachine`.
+### Optimizer
+None of Model, Evaluator, nor GradientMachine implements the training
+loop, hence Optimizer.  We can define a concurrent optimizer that runs
+multiple simultaneous threads to train a model -- just let each
+thread has its own GradientMachine object.
+Most models should be able to be trained using the
+`paddle.optimizer.SGD` by calling its `train` method.  Many
+customizations to the SGD algorithm happens with the update equation,
+e.g., momentum and the Adam SGD algorithm.  We make `train` calls
+`update` to do an update, so that we can derive a `paddle.optimizer.Adam`
+from `paddle.optimizer.SGD` by overrides only the `update` method.
+## Programming Interface
+A fictive example of PaddlePaddle program looks like the following:
+```python
+import paddle
+def read(args):
+    f = open_file(args["filename"])
+    mb = read_a_minibatch(f)
+    end_pass = eof(f)
+    if end_pass:
+       f = open_file(args["filename"]) # rewind for reading again
+    yield mb, end_pass
+input = paddle.layer.data(...)
+intermediate = paddle.layers.fc(input)
+output = paddle.layer.softmax(intermediate)
+model = paddle.model.create(output)
+paddle.train(model, data_provider=read)
+```
+This shows some important part of a program:
+1. Define how to read (and augment) data by defining a function, in
+   this example, `read`, that `yields` a minibatch and a boolean flag
+   `eof_of_pass`.
+1. Define the topology, `input`, `intermediate`, and `output` in this
+   example.
+1. Create parameters from the topology thus forms the model by calling
+   `paddel.model.create`.
+1. Train the model by calling `paddle.train`.
+### Reader
+Not all programming frameworks allow users to define I/O functions.
+An example is Google MapReduce, which can only read from text,
+SSTable, and RecordIO files.  Hadoop MapReduce allows users to define
+readers and writers by deriving from base classes `Reader` and
+`Writer`.  The former is less flexible but also less error-prone.  We
+decide to provide the flexibility to users to define their readers.
+#### A Synthetic Data Reader
+Sometimes we want to test a topology and/or a training algorithm using
+synthetic data.  We can do this by defining the reader a synthesizer:
+```python
+def read(args):
+    x = sample_from_uniform(0.0, 1.0)
+    y = sample_from_gauss(2 * x, sigma)
+    yield {x, y}, False # no end-of-file so no end-of-pass
+```
+#### A Reader for Online Learning
+Readers can also read an infinite data stream, e.g., a log stream from
+a search engine and collected by Kafka:
+```python
+def read(args):
+    log_stream = kafka.open_channel(args["kafka channel name"])
+    yeild log_stream.read(), False # no end-of-pass in online learning
+```
+### Topology
+By default, layers don't have names.  But if we want to refer to a
+layer later some time, for example, when we do serving using the model
+and wants activations/outputs of a layer, we should give it a name.
+```python
+input = paddle.layer.data(...)
+intermediate = paddle.layer.fc(input, name="inter", ...)
+output = paddle.layer.softmax(intermediate, name="output", ...)
+m = paddle.model.create(output)
+e = paddle.evaluator.create(model)
+e.forward(read_an_input()) # compute activations of all layers.
+print e.activations(layer="inter")  # retrieve the activations of layer "inter"
+print e.activations(layer="output") # retrieve the activations of layer "output"
+```
+#### Sharing Parameters
+In [above section](#model) we shows a network whose two layers share
+the same parameter matrix.  To specify such cases, we give "parameter
+names" to layers.  If some layers have the same paraemter names,
+`paddle.model.create` creates a single parameter matrix for these
+layers:
+```python
+text1 = paddle.layer.data(...)
+sematic1 = paddle.layer.fc(text1, ..., parameter_name="sematic_projection")
+text2 = paddle.layer.data(...)
+sematic2 = paddle.layer.fc(text2, ..., parameter_name="sematic_projection")
+out = paddle.layer.cosine(semantic1, semantic2)
+```
+We can also share parameter matrices between layers in different
+models.  To do this, we need an additional parameter that refers to a
+model:
+```python
+model1_input = paddle.layer.data(...)
+model1_output = paddle.layer.softmax(model1_input, ...,
+                                     parameter_name="a_parameter_matrix")
+model1 = paddle.model.create(model1_output)
+# Another model
+model2_semantic = paddle.layer.fc(text2, ...,
+                                  parameter_name="a_parameter_matrix",
+                                  parameter_model=model1)
+```
+### Training
+The recommended way to training a model is to call `paddle.train`,
+which simply calls `paddle.optimizer.Default`, a global variable of
+type `paddle.optimizer.SGD`.  Equivalently, we can do
+```python
+opt = paddle.optimizer.SGD(...)
+opt.train(model, reader=read, ...)
+```
+#### Distributed Training
+If users want to do distributed training on a cluster, s/he should
+call `paddle.dist_train` and provides access tokens to the cluster as
+a parameter.
+For example, if the user has a TLS certificate that allows him to
+access a Kubernetes cluster, s/he should be able to call
+```python
+paddle.dist_train(model,
+                  reader=read,
+                  optimizer=paddle.optimizer.SGDOptimizer(...),
+                  k8s_user="yi",
+                  k8s_token="kube_cluster_tls.pem",
+                  k8s_job="hello",
+                  num_parameter_servers=15)
+```
+The pseudo code if `paddle.dist_train` is as follows:
+```python
+def dist_train():
+    if os.getenv("KUBERNETES_SERVICE_HOST") == None:
+        image_name = k8s_user + '/' + k8s_job
+        docker_build(image_name)
+        docker_push()
+        kube_ctrl_start_job(image_name, k8s_user, k8s_token)
+    else:
+        rank = kube_list_containers_in_job_and_return_current_containers_rank()
+        if rank == 0:
+            master()
+        elif rank < 15:
+            parameter_server()
+        else:
+            optimizer.train(model, reader=read)
+```
+Please be aware that if a process is running on the Kubernetes
+cluster, it will have some environment variables pre-defined.
+If `dist_train` doesn't see these environment variables, it knowns
+that it's running on users' personal computer, and it should work as a
+*launcher*.  Otherwise, it knows that it's running on the cluster and
+need to figure out its role as either the master, or a trainer, or a
+parameter server.
--- a/paddle/gserver/dataproviders/PyDataProvider2.cpp
+++ b/paddle/gserver/dataproviders/PyDataProvider2.cpp
@@ -647,7 +647,7 @@ public:
      DataBatch& gpuBatch = *batch;
      std::vector<Argument>& gpuArguments = gpuBatch.getStreams();
      gpuArguments.resize(cpuArguments.size());
-      gpuBatch.setSize(size);
+      gpuBatch.setSize(bsize);
      for (size_t i = 0; i < headers_.size(); ++i) {
        gpuArguments[i].resizeAndCopyFrom(
            cpuArguments[i], useGpu_, HPPL_STREAM_1);

--- a/python/paddle/trainer_config_helpers/data_sources.py
+++ b/python/paddle/trainer_config_helpers/data_sources.py
@@ -201,7 +201,7 @@ def define_py_data_sources2(train_list, test_list, module, obj, args=None):
        data.load_data_module = load_data_module
        data.load_data_object = load_data_object
        data.load_data_args = load_data_args
-        data.async_load_data = True
+        data.async_load_data = False
        return data
    define_py_data_sources(

--- a/python/paddle/trainer_config_helpers/tests/configs/protostr/test_split_datasource.protostr
+++ b/python/paddle/trainer_config_helpers/tests/configs/protostr/test_split_datasource.protostr
@@ -19,7 +19,7 @@ model_config {
 data_config {
  type: "py2"
  files: "train.list"
-  async_load_data: true
+  async_load_data: false
  for_test: false
  load_data_module: "a"
  load_data_object: "c"
@@ -58,7 +58,7 @@ opt_config {
 test_data_config {
  type: "py2"
  files: "test.list"
-  async_load_data: true
+  async_load_data: false
  for_test: true
  load_data_module: "b"
  load_data_object: "d"