Created by: zhiqiu
PR types
New features
PR changes
Others
Describe
Merge PR #21242: Add BufferSharedIdentityInplaceOpPass
Details from #21242
Background
This PR enhances in-place strategy further.
Some operators do not change the input data when in-place is performed. These operators (we call them "identity ops") include:
- reshape, reshape_grad and reshape_grad_grad
- squeeze and squeeze_grad
- unsqueeze and unsqueeze_grad
- flatten and flatten_grad
- assign and its grad (the grad of assign is assign itself)
- ...
Suppose that X
is input of op1 and op2, where op1 is an identity op and op2 is a non-identity op.
In-place can be performed in op1 safely because op1 would not change the data of X
even though in-place is performed. This PR adds BufferSharedIdentityInplaceOpPass
to enable this kind of in-place strategy.
For-example,
x2 = fluid.layers.reshape(x1, ...)
x3 = non_identity_op(x1)
Although x1 is input of two ops, in-place can be performed safely when running reshape.
Design of BufferSharedIdentityInplaceOpPass
1. Do not consider last lived ops only
Suppose that we have the network like:
x2 = fluid.layers.reshape(x1, ...)
x3 = op2(x2)
x4 = op3(x1, x3)
It is obvious that the last lived ops of x1
are [op3]
only (because reshape
is strictly before op3
in the graph). However, since the last version of x1
would be only read in reshape
and op3
, x1
can be still identity inplaced in this case. Therefore, in the implementation of this PR, we would check whether the last lived ops of x1
only read x1
. If so, we would scan all ops that read x1
to find out the identity inplace ops instead of only scanning last lived ops of x1
.
2. Non-branched identity inplace and branched identity inplace
There are two kinds of identity inplace reuse:
- Non-branched identity inplace: input
X
may be only reused by one output var, such as:
X -> reshape -> Y1 -> squeeze -> Y2 -> op1 -> ...
In-place can be performed in reshape
and squeeze
no matter op1
is a non-identity inplace op (i.e, relu
) or not.
- Branched identity inplace: input
X
may be reused by many output vars. Branched non-identity inplace is not allowed, since these ops would change data of input, but branched identity inplace is allowed. For example,
-> reshape -> Y1 -> op1 -> ...
X - |
-> squeeze -> Y2 -> op2 -> ...
If op1 is an inplace relu
, inplace of reshape
should not happen! Therefore, when we record the identity inplace ops, we should also know whether the leaf vars of the branched inplace tree would be non-identity inplaced or not. In the implementation of this PR, we would record all the leaf var nodes that are non-identity inplaced and prune them when branched identity inplace happens.
3. Mark some leaf vars to be non-reusable to avoid further reuse error
There are two cases when the leaf vars should not be reused further (i.e, reused inside BufferSharedCrossOpMemoryReusePass
):
- Branched identity inplace happens. If any leaf var is reused by another vars, calculation result may be wrong.
- If the last lived ops of
X
is not a subset of all identity inplace ops. In this case, the last lived ops ofX
may read the data ofX
after its data is changed by other memory reuse process. Calculation result may be wrong too.
Performance
memory
We evaluate the max batch_size of transformer model on single v100 card.
The default allocator strategy is auto_growth
, with gc
and inlpace
enabled.
dev | pr | |
---|---|---|
model with reshape(inplace=True) | 10323 | 10322 |
model with reshape(inplace=False) | 10326 | 10424 |
From the table above, we can conclude that,
- It helps little to set fluid.layers.reshape(inplace=True) on transformer model, so we can remove
inplace
parameter inreshape
. - The BufferSharedIdentityInplaceOpPass is able to increase max batch_size on transformer model, from 10323->10424
speed
We evaluate training speed of transformer model on single v100 card.
The default allocator strategy is auto_growth
, with gc
and inlpace
enabled, batch_size is 10250.
dev (step/s) | pr (step/s) | |
---|---|---|
model with reshape(inplace=True) | 2.23 | 2.21858 |
model with reshape(inplace=False) | 2.23568 | 2.20041 |
From the table above, we can conclude that,
- With reshape(inplace=True), the training of develop version speed up from 2.18 to 2.21, about 1.4%.
- With this PR, the training of develop version speed up from -