Need the ability to group operations together. Similar to collections in Tensorflow
Created by: abhinavarora
This problem came into light when I was investigating the method we could use to move regularization to Pserver (https://github.com/PaddlePaddle/Paddle/issues/7432). The current distributed transpiler splits the params
and the grads
and passes different slices to each pserver. Hence, when we create optimize ops, we use sliced parameters and gradients. However, the distribute transpiler currently does this through a hack. The transpiler identifies these ops
by checking if the op contains inputs called Param
and Grad
. This works well because optimize ops have their own dedicated operations called sgd_op
, adam_op
, etc.
However, in case of regularization and gradient clipping, we rely on generic tensor ops like scale
and elementwise_add
. These ops take as inputs the parameters and thus on the pserver they should take the sliced parameters as inputs. Thus we need a way to identify these ops in the distribute transpiler, so that we can make sure that we pass the sliced params
and grads
as inputs to them. The above-mentioned hack will not work for this case because these are generic ops which have input and output names like X
, Y
, etc.
A hacky solution would be to create dedicated ops for regularization. Currently, regularization layer adds a scale and an elementwise_add op in Python. Instead, we could create a separate op which composes these 2 ops in C++.
A better and a more sustainable solution would be to support adding tags to Python ops. This could allow us to group ops of similar tags. In this way, we can make sure that all the ops that are added for regularization carry a regularization tag. Similarly, gradient clipping ops carry a tag. The distribute transpiler can then process the ops by tag and apply whatever slicing logic it needs to apply to them. These tags are similar to the concept of Collections in Tensorflow.