SGD Op更新fuse之后的parameter和单独更新parameter存在diff
Created by: chenwhql
实验代码如下
- 对比的执行器:Executor和(ParallelExecutor + fuse_all_optimizer_ops = True)
import paddle.fluid as fluid
import numpy as np
import random
batch_size = 32
feed_dict = {
'image': np.random.random([batch_size, 784]).astype('float32'),
'label': np.random.random_integers(
low=0, high=9, size=[batch_size, 1]).astype('int64')
}
def simple_fc_net():
img = fluid.layers.data(name='image', shape=[784], dtype='float32')
label = fluid.layers.data(name='label', shape=[1], dtype='int64')
prediction = fluid.layers.fc(img, size=10, act='softmax')
loss = fluid.layers.cross_entropy(input=prediction, label=label)
loss = fluid.layers.mean(loss)
return loss
def build_program_and_scope():
startup_program = fluid.Program()
main_program = fluid.Program()
startup_program.random_seed = 1
main_program.random_seed = 1
scope = fluid.Scope()
with fluid.program_guard(main_program, startup_program):
with fluid.unique_name.guard():
loss = simple_fc_net()
adam = fluid.optimizer.SGD(learning_rate=1e-3)
adam.minimize(loss)
with fluid.scope_guard(scope):
exe = fluid.Executor(fluid.CPUPlace())
exe.run(startup_program)
return main_program, scope, exe, loss
# Program
prog1, scope1, exe, loss1 = build_program_and_scope()
# CompiledProgram
prog2, scope2, _, loss = build_program_and_scope()
build_strategy = fluid.BuildStrategy()
# if close this strategy, no diff
build_strategy.fuse_all_optimizer_ops = True
exec_strategy = fluid.ExecutionStrategy()
exec_strategy.num_threads = 1
compiled_prog = fluid.CompiledProgram(prog2).with_data_parallel(
loss_name=loss.name,
build_strategy=build_strategy,
exec_strategy=exec_strategy,
places=fluid.CPUPlace())
for i in range(4):
with fluid.scope_guard(scope1):
fetch_val1, = exe.run(prog1,
feed=feed_dict,
fetch_list=['fc_0.b_0'])
with fluid.scope_guard(scope2):
fetch_val2, = exe.run(compiled_prog,
feed=feed_dict,
fetch_list=['fc_0.b_0'])
if not np.array_equal(fetch_val1, fetch_val2):
print("Iter: %d" % i)
for i in range(len(fetch_val1)):
if(fetch_val1[i] != fetch_val2[i]):
print("index: %d, val1: %.12f, val2: %.12f" % (i, fetch_val1[i], fetch_val2[i]))
build_strategy.fuse_all_optimizer_ops设为False,实验结果一致 build_strategy.fuse_all_optimizer_ops设为True,会出现类似下面的diff
λ yq01-gpu-255-137-12-00 /work/self python fuse_test_diff.py
grep: warning: GREP_OPTIONS is deprecated; please use an alias or script
I1120 13:06:00.144999 27199 parallel_executor.cc:423] The number of CPUPlace, which is used in ParallelExecutor, is 1. And the Program will be copied 1 copies
W1120 13:06:00.145452 27199 fuse_optimizer_op_pass.cc:191] Find sgd operators : 2, and 2 for dense gradients. To make the speed faster, those optimization are fused during training.
I1120 13:06:00.146219 27199 build_strategy.cc:364] SeqOnlyAllReduceOps:0, num_trainers:1
I1120 13:06:00.146730 27199 parallel_executor.cc:287] Inplace strategy is enabled, when build_strategy.enable_inplace = True
I1120 13:06:00.147424 27199 parallel_executor.cc:370] Garbage collection strategy is enabled, when FLAGS_eager_delete_tensor_gb = 0
Iter: 1
index: 6, val1: -0.000070472430, val2: -0.000070472437
Iter: 2
index: 0, val1: -0.000048413985, val2: -0.000048413989
index: 6, val1: -0.000105456667, val2: -0.000105456682
注意到这里diff差别非常小,已经超出了float的可表示范围
fuse_all_optimizer_ops设置与否的区别
build_strategy.fuse_all_optimizer_ops设为True
会将两个sgd将要计算的fc_0.b_0和fc_0.w_0连接到一块内存中,将这块内存整体作为SGD Op的输入和输出去计算,通过VLOG日志,可以看到这一过程:
$GLOG_v=6 python fuse_test_diff.py
I1120 13:13:42.773070 27398 operator.cc:152] CPUPlace Op(coalesce_tensor), inputs:{Input[fc_0.b_0:float[10]({}), fc_0.w_0:float[784, 10]({})]}, outputs:{FusedOutput[@FUSEDVAR@_sgd_Param_fc_0.b_0:[0]({})], Output[fc_0.b_0:float[10]({}), fc_0.w_0:float[784, 10]({})]}.
I1120 13:13:42.773128 27398 operator.cc:986] expected_kernel_key:data_type[float]:data_layout[ANY_LAYOUT]:place[CPUPlace]:library_type[PLAIN]
I1120 13:13:42.773192 27398 tensor_util.cu:28] TensorCopy 10 from CPUPlace to CPUPlace
I1120 13:13:42.773275 27398 tensor_util.cu:28] TensorCopy 784, 10 from CPUPlace to CPUPlace
I1120 13:13:42.773368 27398 operator.cc:172] CPUPlace Op(coalesce_tensor), inputs:{Input[fc_0.b_0:float[10]({}), fc_0.w_0:float[784, 10]({})]}, outputs:{FusedOutput[@FUSEDVAR@_sgd_Param_fc_0.b_0:float[9216]({})], Output[fc_0.b_0:float[10]({}), fc_0.w_0:float[784, 10]({})]}.
这里连接起来的内存长度是9216,进行了一些内存补齐操作,比如第一个变量fc_0.b_0 10个float会先补齐到1024个float,再连接后面的fc_0.w_0,所以这个9216计算过程是:
1024(10补齐)+8192(7840补齐)=9216
build_strategy.fuse_all_optimizer_ops设为False
PE的graph没有变化,与Executor执行的结构一致
目前的一些分析
目前做了一些实验和分析,但还是不明白问题的根源是什么,这些分析可能成为线索,列举如下:
- coalesce tensor op里面,fuse前后copy的数据一致,排除最开始fuse的时候就出错的可能
- sgd里面,找到fc_0.b_0的var,创建副本,计算,和总体计算结果切分后一致,排除由于连续内存的填充区域可能引入计算干扰
- fetch fc_0.w_0不出错,fc_0.b_0很容易就出错
- 更换w和b的内存顺序,w放到b前面,仍然出错
- 将fc_0.w_0缩短到[1, 10],和fc_0.b_0一样长,尝试多次也没有出错
- 手写实现sgd的运算,不使用现在的jit kernel,没有diff
- 单独输入数据验证sgd op jit kernel,数据长度从1-10000都没有diff,只是补齐内存的时候有diff
- fuse补齐的长度现在是1024,对于10个float长的fc_0.b_0,由10补齐为11-15计算正确,补齐为16或以上长度,计算出现diff
- sgd与adam kernel有diff,momentum kernel计算无diff