diff --git a/doc/design/dist_refactor/parameter_server.md b/doc/design/dist_refactor/parameter_server.md index 1094f06d461275a9ad4034d5e48b39856d967b71..805dd13048d41b995d2a01cda52b2ea33e4bbe1d 100644 --- a/doc/design/dist_refactor/parameter_server.md +++ b/doc/design/dist_refactor/parameter_server.md @@ -9,16 +9,16 @@ different purposes. ## Background -The previous implementations of the parameter server does not run a +The previous implementations of the parameter server do not run a fluid sub-program. Parameter initialization, optimizer computation, network communication and checkpointing are implemented twice on both the -trainer and the parameter server. +trainer as well as the parameter server. -It would be great if we can write code once and use them on both the -trainer and the parameter server: reduces code duplication and -improves extensibility. Given that after the current refactor, we are -representing everything as a computing graph on the -trainer. Representing everything as a computing graph on the parameter +It would be great if we can write code once and use them on both: the +trainer and the parameter server, since this reduces code duplication and +improves extensibility. Given that after the current refactoring, we are +representing everything as a computation graph on the +trainer. Representing everything as a computation graph on the parameter server becomes a natural extension. ## Design @@ -30,9 +30,9 @@ into sub-programs to be scheduled on different nodes with the following steps: 1. OP placement: the OPs will be placed on different nodes according - to heuristic that minimizes estimated total computation + to a heuristic that minimizes the estimated total computation time. Currently we will use a simple heuristic that puts parameter - varable on parameter server workers and everything else on trainer + variable on parameter server workers and everything else on trainer workers. 1. Add communication OPs to enable the communication between nodes. @@ -47,22 +47,22 @@ After converting: -1. The parameter variable W and it's optimizer program are placed on the parameter server. +1. The parameter variable W and its optimizer program are placed on the parameter server. 1. Operators are added to the program. - *Send* sends data to the connected *Recv* operator. The scheduler on the receive node will only schedule *Recv* operator to run when the *Send* operator has ran (the *Send* OP will mark the *Recv* OP runnable automatically). - - *Enueue* enqueues the input variable, it can block until space + - *Enqueue* enqueues the input variable, it can block until space become available in the queue. - *Dequeue* outputs configurable numbers of tensors from the - queue. It will block until the queue have the required number of + queue. It will block until the queue has the required number of tensors. ### Benefits -- Model parallelism become easier to implement: it's an extension to +- Model parallelism becomes easier to implement: it is an extension to the trainer - parameter server approach. We can have several "Transpilers" to achieve different goals. - User-defined optimizer is easier to add - user can now express it as @@ -72,22 +72,22 @@ After converting: ### Challenges -- It's important to balance the parameter shards of on multiple - parameter server. If a single parameter is very big (some +- It is important to balance the parameter shards on multiple + parameter servers. If a single parameter is very big (for example: some word-embedding, fully connected, softmax layer), we need to automatically partition the single parameter onto different parameter servers when possible (only element-wise optimizer depends on the parameter variable). -- In the "Aync SGD" figure, the "W" variable on the parameter server - could be read and wrote concurrently. See +- In the "Async SGD" figure, the "W" variable on the parameter server + could be read and written concurrently. See [here](https://github.com/PaddlePaddle/Paddle/pull/6394) for more - details about concurrent program in fluid. + details about concurrent program in Fluid. ### Discussion - Can the Enqueue OP be implemented under our current tensor design - (puts the input tensor into the queue tensor)? -- *Dequeue* OP will have variable numbers of output (depends on the + (put the input tensor into the queue tensor)? +- *Dequeue* OP will have variable numbers of output (depending on the `min_count` attribute), does our current design support it? (similar question for the *Add* OP) diff --git a/doc/howto/optimization/cpu_profiling.md b/doc/howto/optimization/cpu_profiling.md index 1775374cf6e518586c28bbd8e04946c74df7e4c5..368af40cc7308cf6f4c609361078fe3ba02213ed 100644 --- a/doc/howto/optimization/cpu_profiling.md +++ b/doc/howto/optimization/cpu_profiling.md @@ -60,8 +60,7 @@ each column is as follows: | column | meaning | | --- | --- | | ncalls | the number of calls into a function | -| tottime | the total execution time of the function, not including the - execution time of other functions called by the function | +| tottime | the total execution time of the function, not including the execution time of other functions called by the function | | percall | tottime divided by ncalls | | cumtime | the total execution time of the function, including the execution time of other functions being called | | percall | cumtime divided by ncalls | diff --git a/paddle/gserver/layers/PriorBox.cpp b/paddle/gserver/layers/PriorBox.cpp index 337b9ba7bc0fc4e4bb80ee7b248d934f111379d5..8faf032f550836579522016b4fff3db7e94746e3 100644 --- a/paddle/gserver/layers/PriorBox.cpp +++ b/paddle/gserver/layers/PriorBox.cpp @@ -69,7 +69,7 @@ bool PriorBoxLayer::init(const LayerMap& layerMap, if (maxSize_.size() > 0) CHECK_EQ(minSize_.size(), maxSize_.size()); // flip aspect ratios - for (int index = 0; index < tmp.size(); index++) { + for (unsigned index = 0; index < tmp.size(); index++) { real ar = tmp[index]; if (fabs(ar - 1.) < 1e-6) continue; aspectRatio_.push_back(ar); diff --git a/paddle/operators/ctc_align_op.h b/paddle/operators/ctc_align_op.h index 589413feb3dcbb7fea1f0a878b35d4bf714b5318..fed89aa1e899a2450b315f352b9695056ed13aec 100644 --- a/paddle/operators/ctc_align_op.h +++ b/paddle/operators/ctc_align_op.h @@ -51,7 +51,7 @@ class CTCAlignKernel : public framework::OpKernel { T prev_token = -1; for (size_t i = input_lod[level][seq_idx]; i < input_lod[level][seq_idx + 1]; ++i) { - if (input_data[i] != blank && + if ((unsigned)input_data[i] != blank && !(merge_repeated && input_data[i] == prev_token)) { output_data[output_idx] = input_data[i]; ++output_idx; diff --git a/paddle/operators/sequence_reshape_op.h b/paddle/operators/sequence_reshape_op.h index c6f528ab8a73294bb8ee91425f34e44c66f1932c..aaae7ab29281b72848515b80cc60931c13a294c9 100644 --- a/paddle/operators/sequence_reshape_op.h +++ b/paddle/operators/sequence_reshape_op.h @@ -35,7 +35,7 @@ class SequenceReshapeKernel : public framework::OpKernel { PADDLE_ENFORCE_EQ(in_lod.size(), 1UL, "Only support one level sequence now."); PADDLE_ENFORCE_EQ( - in_dims[0], in_lod[0].back(), + (uint64_t)in_dims[0], in_lod[0].back(), "Inconsistent size between X.shape[0] and X.lod()[0].back()."); auto in_lod_l0 = in_lod[0]; diff --git a/python/paddle/v2/image.py b/python/paddle/v2/image.py index a7bb22a35519b87e196b014056649f3a1bfa504a..e5000e440cc8d822dbd38dce3978d2722d32ebe4 100644 --- a/python/paddle/v2/image.py +++ b/python/paddle/v2/image.py @@ -176,7 +176,6 @@ def resize_short(im, size): :param size: the shorter edge size of image after resizing. :type size: int """ - assert im.shape[-1] == 1 or im.shape[-1] == 3 h, w = im.shape[:2] h_new, w_new = size, size if h > w: @@ -267,7 +266,7 @@ def random_crop(im, size, is_color=True): return im -def left_right_flip(im): +def left_right_flip(im, is_color=True): """ Flip an image along the horizontal direction. Return the flipped image. @@ -278,13 +277,15 @@ def left_right_flip(im): im = left_right_flip(im) - :paam im: input image with HWC layout + :param im: input image with HWC layout or HW layout for gray image :type im: ndarray + :param is_color: whether input image is color or not + :type is_color: bool """ - if len(im.shape) == 3: + if len(im.shape) == 3 and is_color: return im[:, ::-1, :] else: - return im[:, ::-1, :] + return im[:, ::-1] def simple_transform(im, @@ -321,8 +322,9 @@ def simple_transform(im, if is_train: im = random_crop(im, crop_size, is_color=is_color) if np.random.randint(2) == 0: - im = left_right_flip(im) + im = left_right_flip(im, is_color) else: + im = center_crop(im, crop_size, is_color) im = center_crop(im, crop_size, is_color=is_color) if len(im.shape) == 3: im = to_chw(im) @@ -331,8 +333,10 @@ def simple_transform(im, if mean is not None: mean = np.array(mean, dtype=np.float32) # mean value, may be one value per channel - if mean.ndim == 1: + if mean.ndim == 1 and is_color: mean = mean[:, np.newaxis, np.newaxis] + elif mean.ndim == 1: + mean = mean else: # elementwise mean assert len(mean.shape) == len(im) @@ -372,6 +376,6 @@ def load_and_transform(filename, mean values per channel. :type mean: numpy array | list """ - im = load_image(filename) + im = load_image(filename, is_color) im = simple_transform(im, resize_size, crop_size, is_train, is_color, mean) return im