[Feature] Add buffer shared indentity inplace pass !25900

Created by: zhiqiu

PR types

New features

PR changes

Others

Describe

Merge PR #21242: Add BufferSharedIdentityInplaceOpPass

Details from #21242

Background

This PR enhances in-place strategy further.

Some operators do not change the input data when in-place is performed. These operators (we call them "identity ops") include:

reshape, reshape_grad and reshape_grad_grad
squeeze and squeeze_grad
unsqueeze and unsqueeze_grad
flatten and flatten_grad
assign and its grad (the grad of assign is assign itself)
...

Suppose that X is input of op1 and op2, where op1 is an identity op and op2 is a non-identity op. In-place can be performed in op1 safely because op1 would not change the data of X even though in-place is performed. This PR adds BufferSharedIdentityInplaceOpPass to enable this kind of in-place strategy.

For-example,

x2 = fluid.layers.reshape(x1, ...)
x3 = non_identity_op(x1)

Although x1 is input of two ops, in-place can be performed safely when running reshape.

Design of BufferSharedIdentityInplaceOpPass

1. Do not consider last lived ops only

Suppose that we have the network like:

x2 = fluid.layers.reshape(x1, ...)
x3 = op2(x2)
x4 = op3(x1, x3)

It is obvious that the last lived ops of x1 are [op3] only (because reshape is strictly before op3 in the graph). However, since the last version of x1 would be only read in reshape and op3, x1 can be still identity inplaced in this case. Therefore, in the implementation of this PR, we would check whether the last lived ops of x1 only read x1. If so, we would scan all ops that read x1 to find out the identity inplace ops instead of only scanning last lived ops of x1.

2. Non-branched identity inplace and branched identity inplace

There are two kinds of identity inplace reuse:

Non-branched identity inplace: input X may be only reused by one output var, such as:

X -> reshape -> Y1 -> squeeze -> Y2 -> op1 -> ...

In-place can be performed in reshape and squeeze no matter op1 is a non-identity inplace op (i.e, relu) or not.

Branched identity inplace: input X may be reused by many output vars. Branched non-identity inplace is not allowed, since these ops would change data of input, but branched identity inplace is allowed. For example,

       -> reshape -> Y1 -> op1 -> ...
X -  |
       -> squeeze -> Y2 -> op2 -> ...

If op1 is an inplace relu, inplace of reshape should not happen! Therefore, when we record the identity inplace ops, we should also know whether the leaf vars of the branched inplace tree would be non-identity inplaced or not. In the implementation of this PR, we would record all the leaf var nodes that are non-identity inplaced and prune them when branched identity inplace happens.

3. Mark some leaf vars to be non-reusable to avoid further reuse error

There are two cases when the leaf vars should not be reused further (i.e, reused inside BufferSharedCrossOpMemoryReusePass):

Branched identity inplace happens. If any leaf var is reused by another vars, calculation result may be wrong.
If the last lived ops of X is not a subset of all identity inplace ops. In this case, the last lived ops of X may read the data of X after its data is changed by other memory reuse process. Calculation result may be wrong too.

Performance

memory

We evaluate the max batch_size of transformer model on single v100 card. The default allocator strategy is auto_growth, with gc and inlpace enabled.

	dev	pr
model with reshape(inplace=True)	10323	10322
model with reshape(inplace=False)	10326	10424

From the table above, we can conclude that,

It helps little to set fluid.layers.reshape(inplace=True) on transformer model, so we can remove inplace parameter in reshape.
The BufferSharedIdentityInplaceOpPass is able to increase max batch_size on transformer model, from 10323->10424

speed

We evaluate training speed of transformer model on single v100 card. The default allocator strategy is auto_growth, with gc and inlpace enabled, batch_size is 10250.

	dev (step/s)	pr (step/s)
model with reshape(inplace=True)	2.23	2.21858
model with reshape(inplace=False)	2.23568	2.20041