Two questions about image classification distributed training.
Created by: TheodoreG
I have used the distributed training code for image classification in https://github.com/PaddlePaddle/models/blob/develop/fluid/PaddleCV/image_classification/dist_train/dist_train.py and got some questions.
- It seems that variable nccl_id_var keeps None in the whole program, and doesn't change even after the append_nccl2_prepare() function is called. Buy the way, function append_nccl2_prepare() seems returned nothing.
- In the training loop I can't tell how the program allocate different training data batch into different device. Ideally, different devices should train different training batches, or may be ParallelExecutor do such things for me ?