PLSC多个数据集训练loss问题
Created by: gobigrassland
PLSC很好解决大规模分类训练。实际模型训练会使用多个数据集,而且这些数据集存在不同程度id重叠,而清理这些数据集间重叠id也比较麻烦。于是训练时采用多个数据集输入,模型主干网络参数共享,针对各个数据集有各自的分类层。目前PLSC代码中 shard_logit = loss._get_info('shard_logit‘) shard_prob = loss._get_info('shard_prob’) shard_label = loss._get_info('shard_label‘) shard_dim = loss._get_info('shard_dim’) 大体上通过这块实现大的分类层权重分拆到多个GPU中(这里描述不准确)。
针对多个分类的连接权重,现在的PLSC代码就不能处理了。我对plsc/model/dist_algo.py中minimize函数进行修改,大概修改为如下形式:
def compute_gradient_multi_branches(self, loss, dataset_name, startup_program=None, parameter_list=None, no_grad_set=None, callbacks=None): assert loss.get_info('shard_logit{}'.format(dataset_name))
shard_logit = loss._get_info('shard_logit_{}'.format(dataset_name)) shard_prob = loss._get_info('shard_prob_{}'.format(dataset_name)) shard_label = loss._get_info('shard_label_{}'.format(dataset_name)) shard_dim = loss._get_info('shard_dim_{}'.format(dataset_name)) op_maker = fluid.core.op_proto_and_checker_maker op_role_key = op_maker.kOpRoleAttrName() op_role_var_key = op_maker.kOpRoleVarAttrName() backward_role = int(op_maker.OpRole.Backward) loss_backward_role = int(op_maker.OpRole.Loss) | int( op_maker.OpRole.Backward) # minimize a scalar of reduce_sum to generate the backward network scalar = fluid.layers.reduce_sum(shard_logit) block = loss.block
if not self._use_fp16: #ret = self._optimizer.minimize(scalar) params_grads = self._optimizer.backward(scalar) print(loss, scalar, dataset_name) # remove the unnecessary ops index = 0 """ for i, op in enumerate(block.ops): if op.all_attrs()[op_role_key] == loss_backward_role: index = i break """ for i,op in enumerate(block.ops): print(i, dataset_name, block.ops[i])
希望能针对不同分支的分类loss分别求梯度,然后再进行各个分支的梯度聚合操作,但是发现不同数据集对应的分支,op就有很大差异。之前我在tensorflow中实现的,只有最后各个分类层参数不一样也不需共享,因此可以对共同的参数梯度比如取一个均值然后再更新。我实验中,以webface和vggface2为例,发现这两个分支对应op差异很大。其中一个比另一个分支多了很多op
多出的部分op:
inputs { parameter: "X" arguments: "prelu_32.w_0@GRAD" } outputs { parameter: "Out" arguments: "prelu_32.w_0@GRAD" } type: "c_sync_calc_stream" attrs { name: "op_device" type: STRING s: "" } attrs { name: "op_role" type: INT i: 1 } attrs { name: "op_callstack" type: STRINGS strings: " } attrs { name: "op_namescope" type: STRING s: "/" } attrs { name: "op_role_var" type: STRINGS }
inputs { parameter: "X" arguments: "prelu_24.w_0@GRAD" } outputs { parameter: "Out" arguments: "prelu_24.w_0@GRAD" } type: "c_allreduce_sum" attrs { name: "op_device" type: STRING s: "" } attrs { name: "ring_id" type: INT i: 0 } attrs { name: "use_calc_stream" type: BOOLEAN b: false } attrs { name: "op_role" type: INT i: 1 } attrs { name: "op_role_var" type: STRINGS }
上面是使用多个数据,针对不同数据集使用不同分类层来联合训练。不过遇到上面问题,希望能给个建议