oneDNN support for DyGraph mode
Created by: wojtuss
In the presence of the dynamic graph (DyGraph) mode in PaddlePaddle a question arises whether the oneDNN (previously named MKL-DNN, DNNL) library can be used to boost dynamic graphs execution. There are several factors which have to be considered when adding oneDNN support to DyGraph mode:
- Our understanding is that DyGraph mode is most useful during topology designing and model training, as it allows more in-depth analysis of the topology one is working on. For inference workloads, static graphs are preferable due to more performance optimization options.
- oneDNN-based kernels present in PaddlePaddle employ a primitives caching mechanism which serves two main purposes:
- allows reusing forward primitives in subsequent batch iterations in inference mode, avoiding overhead of recreating the primitives,
- allows sharing certain objects (oneDNN forward primitive descriptors) between the forward and backward runs, required for creating backward oneDNN primitives during training. Assuming oneDNN support for DyGraph in training, ii. makes some sort of caching inevitable.
- Using the current caching mechanism in the DyGraph mode meets problems with caching keys. Primitives are stored in the cache map under keys constructed from (among others) input/output variable names. In DyGraph mode the variable names change between batch iterations, preventing the primitives from being reused. A predictable naming conventions in DyGraph mode would be required or a new caching key creation algorithm would have to be introduced. The latter is not easy to come up with and would probably involve architectural changes and considerable amount of work when updating all the operators.
- oneDNN support for any operator in training (in any mode) requires both forward and backward kernels implemented using oneDNN. As our past focus for oneDNN-based optimizations was on inference, current support for backward oneDNN kernels is rather scarce. There are not many operators enabled for training yet.
- most of the oneDNN-based grad kernels enabled so far did not prove to be much (if at all) faster than grad native kernels.
- A dynamic graph can be used also in inference mode. However, if performance matters, most probably it is transformed into a static graph and then executed. The point of transformation from dynamic to static graph seems the most suitable for enabling oneDNN optimizations.
In view of the foregoing, we have concerns about reasonableness of adding oneDNN support in DyGraph mode for training, with a prospect of enabling it for inference after converting dynamic graphs into static ones.
Please, share Baidu's perspective on that issue. What are the use case scenarios you would like oneDNN support to help the most with?