Created by: dzhwinter
A re-implement of memory transpiler. Main changes
-
add in-place cache strategy. we provide the Reuse tag in OpProto, and memory transpiler will reuse the memory if the op can run in-place.
-
more memory saving will be triggered, not only the shape equally block. Currently, our memory transpiler only reuses the shape equally memory block. I didn't figure out the reason why the previous implement does not fully support the bigger memory block. Obviously, if the memory block is bigger than we needed, we also can reuse it. https://github.com/PaddlePaddle/Paddle/blob/develop/python/paddle/fluid/transpiler/memory_optimization_transpiler.py#L189 https://github.com/PaddlePaddle/Paddle/blob/develop/python/paddle/fluid/transpiler/memory_optimization_transpiler.py#L252
-
compute liveness set with a new algorithm. In our transpiler, an important step is that we need to compute the variable liveness set. However, when the op count goes higher, for example, the se-renext 152 has 4000 ops, the previous reach fixpoint algorithm cost long time to converge. This PR use the worklist algorithm, which converges faster than the reach fix point algorithm.
-
The SSA-form graph optimized liveness algorithm. Given our memory strategy heavily relied on variable liveness range algorithm, so one idea is to generate more concrete variable liveness information. https://hal.inria.fr/inria-00558509v1/document I follow this paper, try this SSA-graph based liveness algorithm in https://github.com/PaddlePaddle/Paddle/pull/11385 However, I found some issues when I implement this algorithm.
first, the more concrete SSA-form liveness set is based on the hypothesis that loops and if condition happens everywhere. Because the loops and if condition can convert to phi variable*, which dominate the analysis later. If not, it works same as the normal liveness analysis algorithm. second, What's more, I do a forward analysis of our test program by hand(That means to compute the variable live period), and found that the result is same with the normal liveness analysis algorithm. third, In most ssa-graph application, the ssa program is the intermediate representation of user program. It's not easy to convert a ssa-form graph back to program desc.(Because I need to convert it back after finish the memory transpile). As a conclusion, I implement the worklist based liveness algorithm instead of the SSA-graph.