Created by: dzhwinter
This PR only contains single machine version, some conflict happens when distributed training are enabled. It contains 2 main modifies.
- Python memory optimize equally implement in c++ IR graph. by default, it is turned off, can be triggered with "memory_optimize" in pybind;
- Implement a early_delete_pass. can be triggered with "memory_early_delete" in pybind.
model tests for QA https://github.com/dzhwinter/modeltests
Following PR should be contained in this one, but it too huge and will takes more time to pass all the tests.
- add memory optimize for executor(now only in parallelexecutor).
- c++ inference use IR memory optimize, add memory optimize pass in inference passes
- test performance in inference cases.
- fix confilict with distributed training jobs. Especially in merge_multi_batch_pass enabled. [doing]
- add inplace pass before c++ IR memory optimize pass, to save more memory.