[Speed] feature/ParallelExecutor (!8891) · 合并请求 · PaddlePaddle / Paddle

[Speed] feature/ParallelExecutor !8891

Created by: tonyyang-svail

fix https://github.com/PaddlePaddle/Paddle/issues/8592

Profiling result: script: as the example in this pr command:

CUDA_VISIBLE_DEVICES=0             nvprof -f -o one.nvvp python parallel_executor_example.py --batch_size=32
CUDA_VISIBLE_DEVICES=0,1,2,3       nvprof -f -o four.nvvp python  parallel_executor_example.py --batch_size=32

Setting	copy weights	forward and backward	merge gradient	apply gradient
1 with nccl on bp	/	250	/	5
4 with nccl on bp	/	750(AllReduce takes about 63%)	/	5

Save Model (to be implemented)

In the current implementation, the ParallelExecutor's constructor creates a base scope and n (n equals the number of GPUs) sub scopes, the model is replicated in each sub scope. The save model function cannot access the sub scopes created by the ParallelExecutor.

Proposed Solution

ParallelExecutor's constructor creates n-1 sub scopes. ParallelExecutor.run will take a scope parameter, which will be attached as another sub scope. In this way, the user can create a scope, use it for ParallelExecutor.run as well as for save model.

PaddlePaddle / Paddle 大约 1 年 前同步成功