Created by: Yancey1989
Fixed #11143 (closed)
After parallel bcast, this PR will improve performacne about 15% on vgg +flowers + 2trainers + 2pservers
overlap memcpy branch
Pass = 0, Elapsed = 166, Training performance = 36.934574 imgs/s, Train accuracy = 0.027809, Test accuracy = 0.009874
Pass = 1, Elapsed = 163, Training performance = 37.542919 imgs/s, Train accuracy = 0.038393, Test accuracy = 0.009804
develop branch
Pass = 0, Elapsed = 195, Training performance = 31.470824 imgs/s, Train accuracy = 0.028873, Test accuracy = 0.008731
Pass = 1, Elapsed = 177, Training performance = 34.738449 imgs/s, Train accuracy = 0.044999, Test accuracy = 0.015925
The improvement would be better on resnet, because the parameter size is smaller than vgg.