Optimizing elementwise_add for CPU with MKL
Created by: tpatejko
I working on optimizing elementwise_add
operator for CPU. The operator adds two tensors x
and y
element by element, and stores the result in tensor z
. I'm currently focusing on the case when both operands x
and y
are of equal dimensions.
The optimization uses MKL VML's v?Add
operation that performs elementwise addition:
https://software.intel.com/en-us/mkl-developer-reference-c-v-add
When elementwise_add
is performed on GPU and/or x
and y
are of different dimensions the algorithm falls back to the default implementation.
To implement the optimization, I extended an interface of PaddlePaddle BLAS code: https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/fluid/operators/math/blas.h
with two operations: VADD that performs elementwise add operation with VML vAdd
routine, and VCOPY that performs copying of two vectors and uses BLAS level 1 routine cblas_vcopy
. I use VCOPY routine with already available SAXPY routine to implement VADD operation for non-MKL case.
Is it ok for you to extend the interface of Blas routines in PaddlePaddle for CPU? Currently the algorithm is as follows. What do you think about it?
x = ctx.Input<T>('X')
y = ctx.Input<T>('Y')
z = ctx.Output<T>('Z')
if (ctx.is_cpu_place() and x.dims() == y.dims()) {
flatten(x);
flatten(y);
flatten(z);
if (MKL_is_used()) {
VADD(x->numel(), x, y, z);
} else {
// SAXPY implements y = alpha * x + y
// so first content of y is copied to z
// and x is added to z
VCOPY(y, z);
SAXPY(x->numel(), 1.0 /*alpha*/, x, z)
}
} else {
// fall back to default implentation
}