Dataset on cloud - partition randomly by PaddlePaddle
Created by: helinwang
For dataset on the cloud, we currently only allow the user to upload the dataset and an index file, we will do the dataset partition for the user. (according to https://github.com/PaddlePaddle/Paddle/blob/develop/doc/design/cluster_train/data_dispatch.md#上传训练文件 )
@emailweixu suggested that we need to randomize the data instance order before the partition happen. That is very crucial since the index file may have a very bad label distribution for training (e.g., for digit classification, labels are in order: 0,0,0,0,1,1,1,1,2,2,2,2,3,3,3,3....).