Merge branch 'develop' of https://github.com/PaddlePaddle/Paddle into fix_im2seq

f2a32ddd · wanghaoshuang · 1234b8b4 · 95853fc1 · f2a32ddd · f2a32ddd
6 changed file
--- a/doc/design/dist_refactor/parameter_server.md
+++ b/doc/design/dist_refactor/parameter_server.md
@@ -9,16 +9,16 @@ different purposes.
 ## Background
-The previous implementations of the parameter server does not run a
+The previous implementations of the parameter server do not run a
 fluid sub-program. Parameter initialization, optimizer computation, network
 communication and checkpointing are implemented twice on both the
-trainer and the parameter server.
+trainer as well as the parameter server.
-It would be great if we can write code once and use them on both the
+It would be great if we can write code once and use them on both: the
-trainer and the parameter server: reduces code duplication and
+trainer and the parameter server, since this reduces code duplication and
-improves extensibility. Given that after the current refactor, we are
+improves extensibility. Given that after the current refactoring, we are
-representing everything as a computing graph on the
+representing everything as a computation graph on the
-trainer. Representing everything as a computing graph on the parameter
+trainer. Representing everything as a computation graph on the parameter
 server becomes a natural extension.
 ## Design
@@ -30,9 +30,9 @@ into sub-programs to be scheduled on different nodes with the following
 steps:
 1. OP placement: the OPs will be placed on different nodes according
-   to heuristic that minimizes estimated total computation
+   to a heuristic that minimizes the estimated total computation
   time. Currently we will use a simple heuristic that puts parameter
-   varable on parameter server workers and everything else on trainer
+   variable on parameter server workers and everything else on trainer
   workers.
 1. Add communication OPs to enable the communication between nodes.
@@ -47,22 +47,22 @@ After converting:
 <img src="src/dist-graph.png" width="700"/>
-1. The parameter variable W and it's optimizer program are placed on the parameter server.
+1. The parameter variable W and its optimizer program are placed on the parameter server.
 1. Operators are added to the program.
   - *Send* sends data to the connected *Recv* operator.  The
 	 scheduler on the receive node will only schedule *Recv* operator
 	 to run when the *Send* operator has ran (the *Send* OP will mark
 	 the *Recv* OP runnable automatically).
-   - *Enueue* enqueues the input variable, it can block until space
+   - *Enqueue* enqueues the input variable, it can block until space
     become available in the queue.
   - *Dequeue* outputs configurable numbers of tensors from the
-     queue. It will block until the queue have the required number of
+     queue. It will block until the queue has the required number of
     tensors.
 ### Benefits
- Model parallelism become easier to implement: it's an extension to
+- Model parallelism becomes easier to implement: it is an extension to
  the trainer - parameter server approach. We can have several "Transpilers"
  to achieve different goals.
 - User-defined optimizer is easier to add - user can now express it as
@@ -72,22 +72,22 @@ After converting:
 ### Challenges
- It's important to balance the parameter shards of on multiple
+- It is important to balance the parameter shards on multiple
-  parameter server. If a single parameter is very big (some
+  parameter servers. If a single parameter is very big (for example: some
  word-embedding, fully connected, softmax layer), we need to
  automatically partition the single parameter onto different
  parameter servers when possible (only element-wise optimizer depends
  on the parameter variable).
- In the "Aync SGD" figure, the "W" variable on the parameter server
+- In the "Async SGD" figure, the "W" variable on the parameter server
-  could be read and wrote concurrently. See
+  could be read and written concurrently. See
  [here](https://github.com/PaddlePaddle/Paddle/pull/6394) for more
-  details about concurrent program in fluid.
+  details about concurrent program in Fluid.
 ### Discussion
 - Can the Enqueue OP be implemented under our current tensor design
-  (puts the input tensor into the queue tensor)?
+  (put the input tensor into the queue tensor)?
- *Dequeue* OP will have variable numbers of output (depends on the
+- *Dequeue* OP will have variable numbers of output (depending on the
  `min_count` attribute), does our current design support it? (similar
  question for the *Add* OP)

--- a/doc/howto/optimization/cpu_profiling.md
+++ b/doc/howto/optimization/cpu_profiling.md
@@ -60,8 +60,7 @@ each column is as follows:
 | column | meaning |
 | --- | --- |
 | ncalls | the number of calls into a function |
-| tottime | the total execution time of the function, not including the
+| tottime | the total execution time of the function, not including the execution time of other functions called by the function |
- execution time of other functions called by the function |
 | percall | tottime divided by ncalls |
 | cumtime | the total execution time of the function, including the execution time of other functions being called |
 | percall | cumtime divided by ncalls |

--- a/paddle/gserver/layers/PriorBox.cpp
+++ b/paddle/gserver/layers/PriorBox.cpp
@@ -69,7 +69,7 @@ bool PriorBoxLayer::init(const LayerMap& layerMap,
  if (maxSize_.size() > 0) CHECK_EQ(minSize_.size(), maxSize_.size());
  // flip aspect ratios
-  for (int index = 0; index < tmp.size(); index++) {
+  for (unsigned index = 0; index < tmp.size(); index++) {
    real ar = tmp[index];
    if (fabs(ar - 1.) < 1e-6) continue;
    aspectRatio_.push_back(ar);

--- a/paddle/operators/ctc_align_op.h
+++ b/paddle/operators/ctc_align_op.h
@@ -51,7 +51,7 @@ class CTCAlignKernel : public framework::OpKernel<T> {
      T prev_token = -1;
      for (size_t i = input_lod[level][seq_idx];
           i < input_lod[level][seq_idx + 1]; ++i) {
-        if (input_data[i] != blank &&
+        if ((unsigned)input_data[i] != blank &&
            !(merge_repeated && input_data[i] == prev_token)) {
          output_data[output_idx] = input_data[i];
          ++output_idx;

--- a/paddle/operators/sequence_reshape_op.h
+++ b/paddle/operators/sequence_reshape_op.h
@@ -35,7 +35,7 @@ class SequenceReshapeKernel : public framework::OpKernel<T> {
    PADDLE_ENFORCE_EQ(in_lod.size(), 1UL,
                      "Only support one level sequence now.");
    PADDLE_ENFORCE_EQ(
-        in_dims[0], in_lod[0].back(),
+        (uint64_t)in_dims[0], in_lod[0].back(),
        "Inconsistent size between X.shape[0] and X.lod()[0].back().");
    auto in_lod_l0 = in_lod[0];

--- a/python/paddle/v2/image.py
+++ b/python/paddle/v2/image.py
@@ -176,7 +176,6 @@ def resize_short(im, size):
    :param size: the shorter edge size of image after resizing.
    :type size: int
    """
-    assert im.shape[-1] == 1 or im.shape[-1] == 3
    h, w = im.shape[:2]
    h_new, w_new = size, size
    if h > w:
@@ -267,7 +266,7 @@ def random_crop(im, size, is_color=True):
    return im
-def left_right_flip(im):
+def left_right_flip(im, is_color=True):
    """
    Flip an image along the horizontal direction.
    Return the flipped image.
@@ -278,13 +277,15 @@ def left_right_flip(im):
        im = left_right_flip(im)
-    :paam im: input image with HWC layout
+    :param im: input image with HWC layout or HW layout for gray image
    :type im: ndarray
+    :param is_color: whether input image is color or not
+    :type is_color: bool
    """
-    if len(im.shape) == 3:
+    if len(im.shape) == 3 and is_color:
        return im[:, ::-1, :]
    else:
-        return im[:, ::-1, :]
+        return im[:, ::-1]
 def simple_transform(im,
@@ -321,8 +322,9 @@ def simple_transform(im,
    if is_train:
        im = random_crop(im, crop_size, is_color=is_color)
        if np.random.randint(2) == 0:
-            im = left_right_flip(im)
+            im = left_right_flip(im, is_color)
    else:
+        im = center_crop(im, crop_size, is_color)
        im = center_crop(im, crop_size, is_color=is_color)
    if len(im.shape) == 3:
        im = to_chw(im)
@@ -331,8 +333,10 @@ def simple_transform(im,
    if mean is not None:
        mean = np.array(mean, dtype=np.float32)
        # mean value, may be one value per channel 
-        if mean.ndim == 1:
+        if mean.ndim == 1 and is_color:
            mean = mean[:, np.newaxis, np.newaxis]
+        elif mean.ndim == 1:
+            mean = mean
        else:
            # elementwise mean
            assert len(mean.shape) == len(im)
@@ -372,6 +376,6 @@ def load_and_transform(filename,
                 mean values per channel.
    :type mean: numpy array | list
    """
-    im = load_image(filename)
+    im = load_image(filename, is_color)
    im = simple_transform(im, resize_size, crop_size, is_train, is_color, mean)
    return im