未验证 提交 2c1ed9b8 编写于 作者: L liuyuhui 提交者: GitHub

[Kunlun]fix multi xpu dygraph hang, test=kunlun (#32662) (#32696)

上级 09adf20f
...@@ -762,10 +762,11 @@ void Reducer::MarkGroupReady(size_t group_index) { ...@@ -762,10 +762,11 @@ void Reducer::MarkGroupReady(size_t group_index) {
// TODO(liuyuhui): Add try catch to deal with exception later, // TODO(liuyuhui): Add try catch to deal with exception later,
// otherwise the main thread will continue to run when an exception is // otherwise the main thread will continue to run when an exception is
// thrown in comm_pool_. // thrown in comm_pool_.
comm_pool_->enqueue([&] { auto next_group = next_group_;
comm_pool_->enqueue([this, run_order, next_group, &group] {
auto dev_id = BOOST_GET_CONST(platform::XPUPlace, place_).device; auto dev_id = BOOST_GET_CONST(platform::XPUPlace, place_).device;
platform::SetXPUDeviceId(dev_id); platform::SetXPUDeviceId(dev_id);
FusedAllReduceSchedule(run_order, group, next_group_); FusedAllReduceSchedule(run_order, group, next_group);
{ {
std::lock_guard<std::mutex> lock(mutex_); std::lock_guard<std::mutex> lock(mutex_);
comm_op_count_ -= 1; // lock comm_op_count_ -= 1; // lock
......
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册