while_loop循环内计算出错 (#24382) · Issue · PaddlePaddle / Paddle

while_loop循环内计算出错

Created by: Ramlinbird

计算目标描述：输入一个label，形状是[-1,-1]，每个step的形状都会发生变化，需要进行的操作是将第2维的所有数值展开为one_hot形式，变成[-1, -1, self._padding_seqlen]，然后在dim=1进行reduce_sum操作，变成[-1, self_padding_seqlen]（多分类label产生）。

训练时发现无法直接对label进行one_hot，利用while_loop修改为循环模式后，发现期望输出和实际输出不一样。循环中的计算，似乎没有利用到“计数器”的信息，每次计算一样了。

        label = layers.squeeze(label, axes=[1])
        layers.Print(label, message="debug label: ")

### 原思路，因为直接对label做one_hot，偶发撑爆内存，所以改成后面的循环
        label_oh_0 = fluid.one_hot(label, self._padding_seqlen - 2, allow_out_of_range=True)
        layers.Print(layers.reduce_sum(label_oh_0), message="debug expected_label_sum: ")
        label_oh_0 = layers.reduce_sum(label_oh_0, dim=1)
### 

        label_oh = fluid.one_hot(label[:, 0], self._padding_seqlen - 2, allow_out_of_range=True)
        layers.Print(layers.reduce_sum(label_oh), message="debug label_oh: ")

        def cond(ind, label, label_oh, depth):   # 参数和loop_vars相对应
            return ind < layers.shape(label)[1]

        def body(ind, label, label_oh, depth):   # 参数和loop_vars相对应
            layers.Print(ind, message="debug ind-: ")
#### 这里似乎有问题，分析每次打印的 "debug label_oh-:"，似乎是累加了同样的数值
            layers.Print(layers.reduce_sum(label_oh), message="debug label_oh-: ")
            label_oh += fluid.one_hot(label[:, ind], depth, allow_out_of_range=True)
##### 
            ind += 1
            return [ind, label, label_oh, depth]

        ind = layers.fill_constant(shape=[1], dtype='int32', value=1)   # 循环计数器
        depth = self._padding_seqlen - 2
        layers.Print(depth, message="debug depth: ")
        ind, label, label_oh, depth = layers.while_loop(cond, body, [ind, label, label_oh, depth], name=scope_name + "while_loop")

        label = layers.cast(label_oh, "float32")
        layers.Print(layers.reduce_sum(label_oh), message="debug {}_label_ohs: ".format(scope_name))

其中一个step的打印结果：

Tensor[squeeze_6.tmp_0]
	shape: [389,5,]
	dtype: l
	data: -1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,
1589175039	debug depth: 	The place is:CUDAPlace(0)
Tensor[tmp_143]
	shape: [1,]
	dtype: i
	data: 72,
1589175039	debug expected_label_sum: 	The place is:CUDAPlace(0)
Tensor[reduce_sum_32.tmp_0]
	shape: [1,]
	dtype: f
	data: 116, ## 期望值116
1589175039	debug ind-: 	The place is:CUDAPlace(0)
Tensor[fill_constant_0.tmp_0]
	shape: [1,]
	dtype: i
	data: 1,
1589175039	debug label_oh-: 	The place is:CUDAPlace(0)
Tensor[reduce_sum_34.tmp_0]
	shape: [1,]
	dtype: f
	data: 50,
1589175039	debug ind-: 	The place is:CUDAPlace(0)
Tensor[fill_constant_0.tmp_0]
	shape: [1,]
	dtype: i
	data: 2,
1589175039	debug label_oh-: 	The place is:CUDAPlace(0)
Tensor[reduce_sum_34.tmp_0]
	shape: [1,]
	dtype: f
	data: 96, ##这里96-50=36
1589175039	debug ind-: 	The place is:CUDAPlace(0)
Tensor[fill_constant_0.tmp_0]
	shape: [1,]
	dtype: i
	data: 3,
1589175039	debug label_oh-: 	The place is:CUDAPlace(0)
Tensor[reduce_sum_34.tmp_0]
	shape: [1,]
	dtype: f
	data: 142, ##这里142-96=36
1589175039	debug ind-: 	The place is:CUDAPlace(0)
Tensor[fill_constant_0.tmp_0]
	shape: [1,]
	dtype: i
	data: 4,
1589175039	debug label_oh-: 	The place is:CUDAPlace(0)
Tensor[reduce_sum_34.tmp_0]
	shape: [1,]
	dtype: f
	data: 188,
1589175039	debug label_oh: 	The place is:CUDAPlace(0)
Tensor[reduce_sum_33.tmp_0]
	shape: [1,]
	dtype: f
	data: 50,
1589175039	debug emcell__label_ohs: 	The place is:CUDAPlace(0)
Tensor[reduce_sum_35.tmp_0]
	shape: [1,]
	dtype: f
	data: 234, #这里最终算出来的234，和期望116不一致。

PaddlePaddle / Paddle 1 年多 前同步成功

while_loop循环内计算出错

PaddlePaddle / Paddle
1 年多前同步成功