Created by: Superjomn
Memory optimization pass for inference with pre-analysis of memory usage without GC.
It works with the following steps:
- Run the inference with several batches of fake data, important to try different sequence lengths if sequence data is used as inputs.
- the predictor will collect the shapes of all the Tensors and persist to a cache file.
- Run the inference formally, the memory optimization pass will load the cache file, split the tensors into several clusters by the shape (currently just the batch size) change behaviors.
- try to make a plan for reusing the memory of the existing tensors, attention that, a tensor can only reuse the ones in the same cluster.
- the fixed shape tensor can reuse all the others with the fixed shape,
- the tensor with LoD enabled can only reuse the others with the same sequence length,
- the output of some
pool
op will change the first dimension to one(for each sequence), it can only share the memory with the other ones with the same behavior (TODO analysis all the shape dimensions).
- try to make a plan for reusing the memory of the existing tensors, attention that, a tensor can only reuse the ones in the same cluster.
It should work with both the LoD or Non-LoD models.
Currently, I've tested it with DAM, and that saves about 77.8% memory.