Created by: NHZlX
This PR adds an emb_eltwise_layernorm OP, which is only for inference, so we banned backward's unit test. At the same time, we adjust ut precision to 1e-4
Fuse the embedding eltwise_add layernorm OPs to a single OP. Under cuda10.1 p4 card, the normal ernie model performance is improved from 9.5ms to 8.4ms.
In this pr, we also do the following optimization:
- refine inplace_add_relu
- refine fc_eltwise_layernorm (8.4ms -> 7.8ms) test=develop