未验证 提交 13d5e590 编写于 作者: G gaotingquan

fix: convert bn to sync_bn

the running_mean and running_var of bn would not be synchronized in dist,
so which leads to bug that eval loss in training is inconsistent with eval only.
上级 474c918b
...@@ -242,6 +242,11 @@ class Engine(object): ...@@ -242,6 +242,11 @@ class Engine(object):
level=amp_level, level=amp_level,
save_dtype='float32') save_dtype='float32')
# TODO(gaotingquan): convert_sync_batchnorm is not effective
# eval loss in training is inconsistent with the eval only if bn is used,
# because the running_mean and running_var of bn are not synced in dist.
self.model = nn.SyncBatchNorm.convert_sync_batchnorm(self.model)
# for distributed # for distributed
world_size = dist.get_world_size() world_size = dist.get_world_size()
self.config["Global"]["distributed"] = world_size != 1 self.config["Global"]["distributed"] = world_size != 1
......
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册