The need of cpu acceleartion implementation of depthwise convolution in Mobilenet. (#2826) · Issue · PaddlePaddle / Paddle

The need of cpu acceleartion implementation of depthwise convolution in Mobilenet.

Created by: NHZlX

Now, people widely use the mobilenet for it's small model size(～12M for 1.0 mobilenet) and good preformance on many tasks(classification, detection etc), and just like it‘s name， it‘s widely used in Embedded system.
PaddlePaddle is working on supporting the embedded system，therefore, mobilenet on paddle is indispensable
Mobilenet mainly contains two operations: depthwise convolution and pointwise convolution. Pointwise convolution, that is, 1*1 convolution with groups equals 1, depthwise convolution is a specific convolution with groups equals the input channels. The optimization of mobilenet is basically the optimization of depthwise conv.
Although one can build depthwise convolution with ExpandConvLayer in paddle, but it's very slow， especially training process. The Gpu acceleration of mobilenet on paddle have been already realized, this will speed up the mobilenet training process https://github.com/PaddlePaddle/Paddle/pull/2776

im2col operation are not necessary in 1 * 1 convolution.
The need of cpu acceleration implementation of depthwise convolution. For ARM, neon acceleartion is also needed.
Fuse batch normalization on paddle.