Created by: yaoxuefeng6
PR types
Others
PR changes
Others
Describe
add data_generator in paddle.distributed.fleet.dataset to illustrate dataset class well Based on comments of this pr:https://github.com/PaddlePaddle/Paddle/pull/27133#pullrequestreview-489234912 1, mark static only in new dataset api doc 2, mark deprecated in dataset api in fluid 3, example codes fit with 2.0 apis
my_data_generator class base on class MultiSlotDataGenerator my_data_generator.py:
import paddle
import paddle.distributed.fleet as fleet
class MyDataGenerator(fleet.MultiSlotDataGenerator):
def generate_sample(self, line):
def data_iter():
for i in range(10000):
yield ("words", [1, 2, 3, 4]), ("label", [0])
return data_iter
if __name__ == "__main__":
d = MyDataGenerator()
d.run_from_stdin()
paddle.distributed.InMemoryDataset/QueueDataset using data_generator demo
import paddle
paddle.enable_static()
slots = ["slot1", "slot2", "slot3", "slot4"]
slots_vars = []
for slot in slots:
var = paddle.static.data(name=slot, shape=[1], dtype="int64", lod_level=1)
slots_vars.append(var)
# create dataset instance directly with distributed.InMemoryDataset
dataset = paddle.distributed.InMemoryDataset()
# call init() to initialize single node related settings once.
# use my_data_generator in pipecommand
dataset.init(
batch_size=32,
thread_num=3,
pipe_command="python my_data_generator.py",
use_var=slots_vars)