fix: convert bn to sync_bn

the running_mean and running_var of bn would not be synchronized in dist, so which leads to bug that eval loss in training is inconsistent with eval only.

fix: convert bn to sync_bn
the running_mean and running_var of bn would not be synchronized in dist, so which leads to bug that eval loss in training is inconsistent with eval only.
13d5e590 · gaotingquan · 474c918b · 13d5e590
显示空白变更内容
内联并排

Showing with 5 addition and 0 deletion

ppcls/engine/engine.py ppcls/engine/engine.py +5 -0

未找到文件。
--- a/ppcls/engine/engine.py
+++ b/ppcls/engine/engine.py
@@ -242,6 +242,11 @@ class Engine(object):
                level=amp_level,
                save_dtype='float32')
+        # TODO(gaotingquan): convert_sync_batchnorm is not effective
+        # eval loss in training is inconsistent with the eval only if bn is used,
+        # because the running_mean and running_var of bn are not synced in dist.
+        self.model = nn.SyncBatchNorm.convert_sync_batchnorm(self.model)
        # for distributed
        world_size = dist.get_world_size()
        self.config["Global"]["distributed"] = world_size != 1