Created by: tonyyang-svail
fix https://github.com/PaddlePaddle/Paddle/issues/8592
Profiling result: script: as the example in this pr command:
CUDA_VISIBLE_DEVICES=0 nvprof -f -o one.nvvp python parallel_executor_example.py --batch_size=32
CUDA_VISIBLE_DEVICES=0,1,2,3 nvprof -f -o four.nvvp python parallel_executor_example.py --batch_size=32
Setting | copy weights | forward and backward | merge gradient | apply gradient |
---|---|---|---|---|
1 with nccl on bp | / | 250 | / | 5 |
4 with nccl on bp | / | 750(AllReduce takes about 63%) | / | 5 |
Save Model (to be implemented)
In the current implementation, the ParallelExecutor's constructor creates a base scope and n (n equals the number of GPUs) sub scopes, the model is replicated in each sub scope. The save model function cannot access the sub scopes created by the ParallelExecutor.
Proposed Solution
ParallelExecutor's constructor creates n-1 sub scopes. ParallelExecutor.run
will take a scope parameter, which will be attached as another sub scope. In this way, the user can create a scope, use it for ParallelExecutor.run
as well as for save model.