Dataset on cloud - partition randomly by PaddlePaddle (#1915) · Issue · PaddlePaddle / Paddle

Dataset on cloud - partition randomly by PaddlePaddle

Created by: helinwang

For dataset on the cloud, we currently only allow the user to upload the dataset and an index file, we will do the dataset partition for the user. (according to https://github.com/PaddlePaddle/Paddle/blob/develop/doc/design/cluster_train/data_dispatch.md#上传训练文件 )

@emailweixu suggested that we need to randomize the data instance order before the partition happen. That is very crucial since the index file may have a very bad label distribution for training (e.g., for digit classification, labels are in order: 0,0,0,0,1,1,1,1,2,2,2,2,3,3,3,3....).

PaddlePaddle / Paddle 大约 1 年 前同步成功

Dataset on cloud - partition randomly by PaddlePaddle

PaddlePaddle / Paddle
大约 1 年前同步成功