Created by: tpatejko
This PR implements optimization of elementwse_add forward and backward passes. It includes for forward pass:
- MKL VML-based optimization with
v?Add
then MKL/MKLDNN are used - Blas-based optimization with
VCopy
andSAXPY
operations when MKL is disabled
For backward pass:
- Blas level 1
VCopy
is used for copyingdx
anddy
vectors.
When integral or float16 types, or GPU device are used, the implementation falls back to the default (generic) elementwise_add
operation.