Created by: jczaja
This PR is reimplementing elementwise_add mkldnn kernel so that sum DNNL primitive is replaced with binary primitive as it is a bit more efficient and supports inplace execution. mkldnn inplace pass was updated to support elementwise_add mkl-dnn
Performance improvement : ~2% on ERNIE (B1 Thread 1 Latency: 252 -> 245 fp32 , 90 -> 87 INT8) : Intel Xeon(R) 6248 ~1% on BERT (~162 FPS to ~165 FPS) when tested on Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz