Optimizing elementwise_add for CPU with MKL (#10786) · Issue · PaddlePaddle / Paddle

Optimizing elementwise_add for CPU with MKL

Created by: tpatejko

I working on optimizing elementwise_add operator for CPU. The operator adds two tensors x and y element by element, and stores the result in tensor z. I'm currently focusing on the case when both operands x and y are of equal dimensions.

The optimization uses MKL VML's v?Add operation that performs elementwise addition: https://software.intel.com/en-us/mkl-developer-reference-c-v-add

When elementwise_add is performed on GPU and/or x and y are of different dimensions the algorithm falls back to the default implementation.

To implement the optimization, I extended an interface of PaddlePaddle BLAS code: https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/fluid/operators/math/blas.h

with two operations: VADD that performs elementwise add operation with VML vAdd routine, and VCOPY that performs copying of two vectors and uses BLAS level 1 routine cblas_vcopy. I use VCOPY routine with already available SAXPY routine to implement VADD operation for non-MKL case.

Is it ok for you to extend the interface of Blas routines in PaddlePaddle for CPU? Currently the algorithm is as follows. What do you think about it?

x = ctx.Input<T>('X')
y = ctx.Input<T>('Y')
z = ctx.Output<T>('Z')

if (ctx.is_cpu_place() and x.dims() == y.dims()) {
    flatten(x);
    flatten(y);
    flatten(z);
    if (MKL_is_used()) { 
        
        VADD(x->numel(), x, y, z);
    } else {
        // SAXPY implements y = alpha * x + y
       // so first content of y is copied to z
       // and x is added to z
        VCOPY(y, z);
        SAXPY(x->numel(), 1.0 /*alpha*/, x, z) 
    }
} else {
    // fall back to default implentation
}

PaddlePaddle / Paddle 大约 2 年 前同步成功

Optimizing elementwise_add for CPU with MKL

PaddlePaddle / Paddle
大约 2 年前同步成功