fix: convert bn to sync_bn

the running_mean and running_var of bn would not be synchronized in dist, so which leads to bug that eval loss in training is inconsistent with eval only.

fix: convert bn to sync_bn
the running_mean and running_var of bn would not be synchronized in dist, so which leads to bug that eval loss in training is inconsistent with eval only.
5ca50f84 · gaotingquan · 0258d35e · 5ca50f84
隐藏空白更改
内联并排

Showing with 5 addition and 0 deletion

ppcls/engine/engine.py ppcls/engine/engine.py +5 -0

未找到文件。
--- a/ppcls/engine/engine.py
+++ b/ppcls/engine/engine.py
@@ -243,6 +243,11 @@ class Engine(object):
                level=amp_level,
                save_dtype='float32')

+        # TODO(gaotingquan): convert_sync_batchnorm is not effective
+        # eval loss in training is inconsistent with the eval only if bn is used,
+        # because the running_mean and running_var of bn are not synced in dist.
+        self.model = nn.SyncBatchNorm.convert_sync_batchnorm(self.model)
+
        # for distributed
        world_size = dist.get_world_size()
        self.config["Global"]["distributed"] = world_size != 1