diff --git a/tutorials/source_en/use/data_preparation/data_processing_and_augmentation.md b/tutorials/source_en/use/data_preparation/data_processing_and_augmentation.md index 94973c079f3c50f3409349aae8b7775afff89826..ee669c4d3452f00e8ba08b36ed877348a3b6ac94 100644 --- a/tutorials/source_en/use/data_preparation/data_processing_and_augmentation.md +++ b/tutorials/source_en/use/data_preparation/data_processing_and_augmentation.md @@ -279,7 +279,7 @@ Data augmentation requires the `map` function. For details about how to use the ``` 2. Define data augmentation operators. The following uses `Resize` as an example: ```python - dataset = ds.ImageFolderDatasetV2(DATA_DIR, decode=True) # Deocde images. + dataset = ds.ImageFolderDatasetV2(DATA_DIR, decode=True) # Decode images. resize_op = transforms.Resize(size=(500,500), interpolation=Inter.LINEAR) dataset.map(input_columns="image", operations=resize_op) diff --git a/tutorials/source_en/use/data_preparation/loading_the_datasets.md b/tutorials/source_en/use/data_preparation/loading_the_datasets.md index 6a47a1a22b3b896b632aaf386421184f93306909..8f36ba5eef11c869344f9e48ea81e49f183d7f57 100644 --- a/tutorials/source_en/use/data_preparation/loading_the_datasets.md +++ b/tutorials/source_en/use/data_preparation/loading_the_datasets.md @@ -148,32 +148,112 @@ MindSpore can also read datasets in the `TFRecord` data format through the `TFRe ``` ## Loading a Custom Dataset -You can load a custom dataset using the `GeneratorDataset` object. +In real scenarios, there are virous datasets. For a custom dataset or a dataset that can't be loaded by APIs directly, there are tow ways. +One is converting the dataset to MindSpore data format (for details, see [Converting Datasets to the Mindspore Data Format](https://www.mindspore.cn/tutorial/en/master/use/data_preparation/converting_datasets.html)). The other one is using the `GeneratorDataset` object. +The following shows how to use `GeneratorDataset`. -1. Define a function (for example, `Generator1D`) to generate a dataset. - > The custom generation function returns the objects that can be called. Each time, tuples of `numpy array` are returned as a row of data. +1. Define an iterable object to generate a dataset. There are two examples following. One is a customized function which contains `yield`. The other one is a customized class which contains `__getitem__`. + Both of them will generator a dataset with numbers from 0 to 9. + > The custom iterable object returns a tuple of `numpy arrays` as a row of data each time. An example of a custom function is as follows: ```python import numpy as np # Import numpy lib. - def Generator1D(): - for i in range(64): + def generator_func(num): + for i in range(num): yield (np.array([i]),) # Notice, tuple of only one element needs following a comma at the end. ``` -2. Transfer `Generator1D` to `GeneratorDataset` to create a dataset and set `column` to data. + An example of a custom class is as follows: ```python - dataset = ds.GeneratorDataset(Generator1D, ["data"]) + import numpy as np # Import numpy lib. + class Generator(): + + def __init__(self, num): + self.num = num + + def __getitem__(self, item): + return (np.array([item]),) # Notice, tuple of only one element needs following a comma at the end. + + def __len__(self): + return self.num + ``` + +2. Create a dataset with `GeneratorDataset`. Transfer `generator_func` to `GeneratorDataset` to create a dataset and set `column` to `data`. +Define a `Generator` and transfer it to `GeneratorDataset` to create a dataset and set `column` to `data`. + ```python + dataset1 = ds.GeneratorDataset(source=generator_func(10), column_names=["data"], shuffle=False) + dataset2 = ds.GeneratorDataset(source=Generator(10), column_names=["data"], shuffle=False) ``` 3. After creating a dataset, create an iterator for the dataset to obtain the corresponding data. Iterator creation methods are as follows: - - Create an iterator whose return value is of the sequence type. + - Create an iterator whose return value is of the sequence type. As is shown in the following, create the iterators for `dataset1` and `dataset2`, and print the output. ```python - for data in dataset.create_tuple_iterator(): # each data is a sequence + print("dataset1:") + for data in dataset1.create_tuple_iterator(): # each data is a sequence + print(data[0]) + + print("dataset2:") + for data in dataset2.create_tuple_iterator(): # each data is a sequence print(data[0]) ``` + The output is as follows: + ``` + dataset1: + [array([0], dtype=int64)] + [array([1], dtype=int64)] + [array([2], dtype=int64)] + [array([3], dtype=int64)] + [array([4], dtype=int64)] + [array([5], dtype=int64)] + [array([6], dtype=int64)] + [array([7], dtype=int64)] + [array([8], dtype=int64)] + [array([9], dtype=int64)] + dataset2: + [array([0], dtype=int64)] + [array([1], dtype=int64)] + [array([2], dtype=int64)] + [array([3], dtype=int64)] + [array([4], dtype=int64)] + [array([5], dtype=int64)] + [array([6], dtype=int64)] + [array([7], dtype=int64)] + [array([8], dtype=int64)] + [array([9], dtype=int64)] + ``` - - Create an iterator whose return value is of the dictionary type. - ```python - for data in dataset.create_dict_iterator(): # each data is a dictionary + - Create an iterator whose return value is of the dictionary type. As is shown in the following, create the iterators for `dataset1` and `dataset2`, and print the output. + ```python + print("dataset1:") + for data in dataset1.create_dict_iterator(): # each data is a dictionary + print(data["data"]) + + print("dataset2:") + for data in dataset2.create_dict_iterator(): # each data is a dictionary print(data["data"]) ``` + The output is as follows: + ``` + dataset1: + {'data': array([0], dtype=int64)} + {'data': array([1], dtype=int64)} + {'data': array([2], dtype=int64)} + {'data': array([3], dtype=int64)} + {'data': array([4], dtype=int64)} + {'data': array([5], dtype=int64)} + {'data': array([6], dtype=int64)} + {'data': array([7], dtype=int64)} + {'data': array([8], dtype=int64)} + {'data': array([9], dtype=int64)} + dataset2: + {'data': array([0], dtype=int64)} + {'data': array([1], dtype=int64)} + {'data': array([2], dtype=int64)} + {'data': array([3], dtype=int64)} + {'data': array([4], dtype=int64)} + {'data': array([5], dtype=int64)} + {'data': array([6], dtype=int64)} + {'data': array([7], dtype=int64)} + {'data': array([8], dtype=int64)} + {'data': array([9], dtype=int64)} + ``` \ No newline at end of file diff --git a/tutorials/source_zh_cn/use/data_preparation/data_processing_and_augmentation.md b/tutorials/source_zh_cn/use/data_preparation/data_processing_and_augmentation.md index 13fc97b52e907834b2d88751ae82adc472a1803e..190440fb1e9addb7a4a13d6b2498da3fd143ac71 100644 --- a/tutorials/source_zh_cn/use/data_preparation/data_processing_and_augmentation.md +++ b/tutorials/source_zh_cn/use/data_preparation/data_processing_and_augmentation.md @@ -278,7 +278,7 @@ MindSpore提供`c_transforms`模块以及`py_transforms`模块函数供用户进 ``` 2. 定义数据增强算子,以`Resize`为例: ```python - dataset = ds.ImageFolderDatasetV2(DATA_DIR, decode=True) # Deocde images. + dataset = ds.ImageFolderDatasetV2(DATA_DIR, decode=True) # Decode images. resize_op = transforms.Resize(size=(500,500), interpolation=Inter.LINEAR) dataset.map(input_columns="image", operations=resize_op) diff --git a/tutorials/source_zh_cn/use/data_preparation/loading_the_datasets.md b/tutorials/source_zh_cn/use/data_preparation/loading_the_datasets.md index 3186acaa2bd0c2779064fc2ae411bd324d8898f2..9e204a1a497e5ea6dd8e14b656630d2ec4f4865a 100644 --- a/tutorials/source_zh_cn/use/data_preparation/loading_the_datasets.md +++ b/tutorials/source_zh_cn/use/data_preparation/loading_the_datasets.md @@ -148,32 +148,110 @@ MindSpore也支持读取`TFRecord`数据格式的数据集,可以通过`TFReco ``` ## 加载自定义数据集 -对于自定义数据集,可以通过`GeneratorDataset`对象加载。 +现实场景中,数据集的种类多种多样,对于自定义数据集或者目前不支持直接加载的数据集,有两种方法可以处理。 +一种方法是将数据集转成MindRecord格式(请参考[将数据集转换为MindSpore数据格式](https://www.mindspore.cn/tutorial/zh-CN/master/use/data_preparation/converting_datasets.html)章节),另一种方法是通过`GeneratorDataset`对象加载,以下将展示如何使用`GeneratorDataset`。 -1. 定义一个函数(示例函数名为`Generator1D`)用于生成数据集的函数。 - > 自定义的生成函数返回的是可调用的对象,每次返回`numpy array`的元组,作为一行数据。 +1. 定义一个可迭代的对象,用于生成数据集。以下展示了两种示例,一种是含有`yield`返回值的自定义函数,另一种是含有`__getitem__`的自定义类。两种示例都将产生一个含有从0到9数字的数据集。 + > 自定义的可迭代对象,每次返回`numpy array`的元组,作为一行数据。 自定义函数示例如下: ```python import numpy as np # Import numpy lib. - def Generator1D(): - for i in range(64): + def generator_func(num): + for i in range(num): yield (np.array([i]),) # Notice, tuple of only one element needs following a comma at the end. ``` -2. 将`Generator1D`传入`GeneratorDataset`创建数据集,并设定`column`名为“data”。 + + 自定义类示例如下: ```python - dataset = ds.GeneratorDataset(Generator1D, ["data"]) + import numpy as np # Import numpy lib. + class Generator(): + + def __init__(self, num): + self.num = num + + def __getitem__(self, item): + return (np.array([item]),) # Notice, tuple of only one element needs following a comma at the end. + + def __len__(self): + return self.num + ``` +2. 使用`GeneratorDataset`创建数据集。将`generator_func`传入`GeneratorDataset`创建数据集`dataset1`,并设定`column`名为“data” 。 + 将定义的`Generator`对象传入`GeneratorDataset`创建数据集`dataset2`,并设定`column`名为“data” 。 + ```python + dataset1 = ds.GeneratorDataset(source=generator_func(10), column_names=["data"], shuffle=False) + dataset2 = ds.GeneratorDataset(source=Generator(10), column_names=["data"], shuffle=False) ``` 3. 在创建数据集后,可以通过给数据创建迭代器的方式,获取相应的数据。有两种创建迭代器的方法。 - - 创建返回值为序列类型的迭代器。 + - 创建返回值为序列类型的迭代器。以下分别对`dataset1`和`dataset2`创建迭代器,并打印输出数据观察结果。 ```python - for data in dataset.create_tuple_iterator(): # each data is a sequence - print(data[0]) + print("dataset1:") + for data in dataset1.create_tuple_iterator(): # each data is a sequence + print(data) + + print("dataset2:") + for data in dataset2.create_tuple_iterator(): # each data is a sequence + print(data) ``` - - - 创建返回值为字典类型的迭代器。 - ```python - for data in dataset.create_dict_iterator(): # each data is a dictionary + 输出如下所示: + ``` + dataset1: + [array([0], dtype=int64)] + [array([1], dtype=int64)] + [array([2], dtype=int64)] + [array([3], dtype=int64)] + [array([4], dtype=int64)] + [array([5], dtype=int64)] + [array([6], dtype=int64)] + [array([7], dtype=int64)] + [array([8], dtype=int64)] + [array([9], dtype=int64)] + dataset2: + [array([0], dtype=int64)] + [array([1], dtype=int64)] + [array([2], dtype=int64)] + [array([3], dtype=int64)] + [array([4], dtype=int64)] + [array([5], dtype=int64)] + [array([6], dtype=int64)] + [array([7], dtype=int64)] + [array([8], dtype=int64)] + [array([9], dtype=int64)] + ``` + + - 创建返回值为字典类型的迭代器。以下分别对`dataset1`和`dataset2`创建迭代器,并打印输出数据观察结果。 + ```python + print("dataset1:") + for data in dataset1.create_dict_iterator(): # each data is a dictionary + print(data["data"]) + + print("dataset2:") + for data in dataset2.create_dict_iterator(): # each data is a dictionary print(data["data"]) ``` + 输出如下所示: + ``` + dataset1: + {'data': array([0], dtype=int64)} + {'data': array([1], dtype=int64)} + {'data': array([2], dtype=int64)} + {'data': array([3], dtype=int64)} + {'data': array([4], dtype=int64)} + {'data': array([5], dtype=int64)} + {'data': array([6], dtype=int64)} + {'data': array([7], dtype=int64)} + {'data': array([8], dtype=int64)} + {'data': array([9], dtype=int64)} + dataset2: + {'data': array([0], dtype=int64)} + {'data': array([1], dtype=int64)} + {'data': array([2], dtype=int64)} + {'data': array([3], dtype=int64)} + {'data': array([4], dtype=int64)} + {'data': array([5], dtype=int64)} + {'data': array([6], dtype=int64)} + {'data': array([7], dtype=int64)} + {'data': array([8], dtype=int64)} + {'data': array([9], dtype=int64)} + ```