Created by: jczaja
This PR provides MKLDNN based Softmax op implementation.
Performance and testing:
On tested models softmax MKLDNN op is roughly ~10x faster than Plain CPU version. RNN Search (https://github.com/dzhwinter/benchmark/blob/master/fluid/machine_translation.py) executes training in ~90% of Plain CPU time. RNN search does converge with Softmax MKLDNN op used.
Notes
- It was needed to update cross_entropy grad op with code preventing -INF results in a similar way as it was done in cross_entropy op.