fp16 training does not work
Created by: ghost
I want to run fp16 training on both CPU and GPU but I get the same error message.
python train.py --model=ResNet18 --batch_size=32 --use_gpu=True --fp16=True
Traceback (most recent call last):
File "train.py", line 462, in <module>
main()
File "train.py", line 458, in main
train(args)
File "train.py", line 272, in train
args=args)
File "train.py", line 232, in build_program
params_grads, main_prog, startup_prog, args.scale_loss)
File "/root/models/PaddleCV/image_classification/utils/fp16_utils.py", line 94, in create_master_params_grads
reduced_master_grad = fluid.layers.collective._allreduce(master_grad)
File "/usr/local/lib/python2.7/dist-packages/paddle/fluid/layers/collective.py", line 47, in _allreduce
"sync_mode": sync_mode})
File "/usr/local/lib/python2.7/dist-packages/paddle/fluid/layer_helper.py", line 43, in append_op
return self.main_program.current_block().append_op(*args, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/paddle/fluid/framework.py", line 1689, in append_op
attrs=kwargs.get("attrs", None))
File "/usr/local/lib/python2.7/dist-packages/paddle/fluid/framework.py", line 1016, in __init__
proto = OpProtoHolder.instance().get_op_proto(type)
File "/usr/local/lib/python2.7/dist-packages/paddle/fluid/framework.py", line 905, in get_op_proto
raise ValueError("Operator \"%s\" has not been registered." % type)
ValueError: Operator "allreduce" has not been registered.
My specs:
os: ubuntu16.04
pp: Release
version built from develop
branch, enabled GPU support
models: develop
branch
Disabling fp16: --fp16=False
results in correct training both on GPU and CPU.
Can you please help me resolve this issue (if it's mistake on my end) or fix it?