fix bug of multicard grad ncclAllReduce (#30554)

cherry-pick #30553 fix bug of multicard grad ncclAllReduce, the gradient accumulater of parameters should be keep order, otherwsie, it will influence multicard ncclAllReduce of grad.

fix bug of multicard grad ncclAllReduce (#30554)
cherry-pick #30553 fix bug of multicard grad ncclAllReduce, the gradient accumulater of parameters should be keep order, otherwsie, it will influence multicard ncclAllReduce of grad.
96058384 · Zhou Wei · GitHub · 5844dfe4 · 96058384 · 96058384
隐藏空白更改
内联并排

Showing with 9 addition and 3 deletion

paddle/fluid/imperative/basic_engine.cc paddle/fluid/imperative/basic_engine.cc +6 -2

paddle/fluid/imperative/basic_engine.h paddle/fluid/imperative/basic_engine.h +3 -1

未找到文件。
--- a/paddle/fluid/imperative/basic_engine.cc
+++ b/paddle/fluid/imperative/basic_engine.cc
@@ -328,9 +328,13 @@ void BasicEngine::Execute() {
                    "Cannot find gradient of variable %s", var->Name()));
          }
-          // leaf_accumulators_ : hooks and accumulate-grad for leaf tensor
+          // leaf_accumulators_ : hooks and accumulate-grad for leaf tensor,
+          // it should be orderly and not reapeated.
          if (var->IsLeafGrad()) {
-            leaf_accumulators_.insert(iter->second.get());
+            if (std::find(leaf_accumulators_.begin(), leaf_accumulators_.end(),
+                          iter->second.get()) == leaf_accumulators_.end()) {
+              leaf_accumulators_.push_back(iter->second.get());
+            }
            if (iter->second->HasInnerVar()) {
              var = iter->second->InnerVar();

--- a/paddle/fluid/imperative/basic_engine.h
+++ b/paddle/fluid/imperative/basic_engine.h
@@ -69,7 +69,9 @@ class BasicEngine : public Engine {
  std::vector<std::pair<GradientAccumulator*, std::shared_ptr<VariableWrapper>>>
      need_accu_var_list_;
  // leaf_accumulators_ is only for leaf tensor(hooks/accumulate grad)
-  std::unordered_set<GradientAccumulator*> leaf_accumulators_;
+  // It should be orderly and not repeated, because multiple cards must ensure
+  // that the order of vars is the same.
+  std::vector<GradientAccumulator*> leaf_accumulators_;
  bool retain_graph_;
 };