Created by: reyoung
It is an experimental feature. Default OFF.
The gradients of same parameters will be sumed eagerly when this flag is ON.
For example, from
op --> w@grad@0
op --> w@grad@1
op --> w@grad@2
w@grad@0, w@grad@1, w@grad@2 --(sum)--> w@grad
to
op --> w@grad
op --> w@grad@1
w@grad, w@grad@1 --(sum)--> w@grad
op --> w@grad@2
w@grad, w@grad@2 --(sum)--> w@grad
Also, I found a very interesting thing. I assume this model should only have 1 or 2 pieces of embedding gradient. However, the memopt
does not optimized that way. @dzhwinter Please make this as an unittest case for your refactored memory opt.
import paddle.fluid as fluid
word = fluid.layers.data(name='word', shape=[1], dtype='int64')
emb = [
fluid.layers.embedding(word, size=[65536, 256], param_attr='emb')
for _ in range(6)]
emb = fluid.layers.concat(emb)
emb = fluid.layers.mean(emb)
fluid.backward.append_backward(emb)
fluid.memory_optimize(fluid.default_main_program(), print_log=True)
gblock = fluid.default_main_program().block(0)
for op in gblock.ops:
print op.input_arg_names, op.type ,op.output_arg_names
print fluid.contrib.memory_usage(fluid.default_main_program(), 300)