Created by: baojun-nervana
This PR includes the followings:
- cached compiled ngraph function instead function, this will have some performance benefit to avoid recompile each time
- Added a basic cache management by clearing the cache when it reaches the capacity cap (set is as 5 for now); this is necessary for BERT model training
- Simplified some logic and made the code concise; Original it contains a cache off mode for debug purpose, it was replaced by forcing cache clearing.