Created by: baojun-nervana
- upgraded nGraph version
- to be compatible with mkldnn v0.20
- introduce some more fusions for bert
- simplified nGraph engine
- simplified input/output : reduced from 4 states (full/partial training/test) to 2 states (training and test); this also reduces outputs and reduce cost of ngraph/paddle io
- restricted ngraph engine to post feed ops (pre feed op support is unlikely has much performance benefit)