add customized dataset tutorial

b8159349 · Yanjun Peng · 960a991f · b8159349 · b8159349 · b8159349
4 changed file
--- a/tutorials/source_en/use/data_preparation/data_processing_and_augmentation.md
+++ b/tutorials/source_en/use/data_preparation/data_processing_and_augmentation.md
@@ -279,7 +279,7 @@ Data augmentation requires the `map` function. For details about how to use the
    ```
 2. Define data augmentation operators. The following uses `Resize` as an example:
    ```python
-    dataset = ds.ImageFolderDatasetV2(DATA_DIR, decode=True)  # Deocde images. 
+    dataset = ds.ImageFolderDatasetV2(DATA_DIR, decode=True)  # Decode images. 
    resize_op = transforms.Resize(size=(500,500), interpolation=Inter.LINEAR)
    dataset.map(input_columns="image", operations=resize_op)


--- a/tutorials/source_en/use/data_preparation/loading_the_datasets.md
+++ b/tutorials/source_en/use/data_preparation/loading_the_datasets.md
@@ -148,32 +148,112 @@ MindSpore can also read datasets in the `TFRecord` data format through the `TFRe
    ```

 ## Loading a Custom Dataset
-You can load a custom dataset using the `GeneratorDataset` object.
+In real scenarios, there are virous datasets. For a custom dataset or a dataset that can't be loaded by APIs directly, there are tow ways.
+One is converting the dataset to MindSpore data format (for details, see [Converting Datasets to the Mindspore Data Format](https://www.mindspore.cn/tutorial/en/master/use/data_preparation/converting_datasets.html)). The other one is using the `GeneratorDataset` object.
+The following shows how to use `GeneratorDataset`.

-1. Define a function (for example, `Generator1D`) to generate a dataset.
-   > The custom generation function returns the objects that can be called. Each time, tuples of `numpy array` are returned as a row of data. 
+1. Define an iterable object to generate a dataset. There are two examples following. One is a customized function which contains `yield`. The other one is a customized class which contains `__getitem__`.
+   Both of them will generator a dataset with numbers from 0 to 9.
+   > The custom iterable object returns a tuple of `numpy arrays` as a row of data each time. 

   An example of a custom function is as follows:
   ```python
   import numpy as np  # Import numpy lib.
-   def Generator1D():
-       for i in range(64):
+   def generator_func(num):
+       for i in range(num):
           yield (np.array([i]),)  # Notice, tuple of only one element needs following a comma at the end.
   ```
-2. Transfer `Generator1D` to `GeneratorDataset` to create a dataset and set `column` to data.  
+   An example of a custom class is as follows:   
   ```python
-   dataset = ds.GeneratorDataset(Generator1D, ["data"])
+   import numpy as np  # Import numpy lib.
+   class Generator():
+      
+       def __init__(self, num):
+           self.num = num
+      
+       def __getitem__(self, item):
+           return (np.array([item]),)  # Notice, tuple of only one element needs following a comma at the end.
+      
+       def __len__(self):
+           return self.num
+   ```
+   
+2. Create a dataset with `GeneratorDataset`. Transfer `generator_func` to `GeneratorDataset` to create a dataset and set `column` to `data`. 
+Define a `Generator` and transfer it to `GeneratorDataset` to create a dataset and set `column` to `data`.  
+   ```python
+   dataset1 = ds.GeneratorDataset(source=generator_func(10), column_names=["data"], shuffle=False)
+   dataset2 = ds.GeneratorDataset(source=Generator(10), column_names=["data"], shuffle=False)
   ```

 3. After creating a dataset, create an iterator for the dataset to obtain the corresponding data. Iterator creation methods are as follows:
-   - Create an iterator whose return value is of the sequence type.
+   - Create an iterator whose return value is of the sequence type. As is shown in the following, create the iterators for `dataset1` and `dataset2`, and print the output.
      ```python
-      for data in dataset.create_tuple_iterator():  # each data is a sequence
+      print("dataset1:") 
+      for data in dataset1.create_tuple_iterator():  # each data is a sequence
+          print(data[0])
+     
+      print("dataset2:")
+      for data in dataset2.create_tuple_iterator():  # each data is a sequence
          print(data[0])
      ```
+     The output is as follows:
+      ```
+      dataset1:
+      [array([0], dtype=int64)]
+      [array([1], dtype=int64)]
+      [array([2], dtype=int64)]
+      [array([3], dtype=int64)]
+      [array([4], dtype=int64)]
+      [array([5], dtype=int64)]
+      [array([6], dtype=int64)]
+      [array([7], dtype=int64)]
+      [array([8], dtype=int64)]
+      [array([9], dtype=int64)]
+      dataset2:
+      [array([0], dtype=int64)]
+      [array([1], dtype=int64)]
+      [array([2], dtype=int64)]
+      [array([3], dtype=int64)]
+      [array([4], dtype=int64)]
+      [array([5], dtype=int64)]
+      [array([6], dtype=int64)]
+      [array([7], dtype=int64)]
+      [array([8], dtype=int64)]
+      [array([9], dtype=int64)]
+      ```

-   - Create an iterator whose return value is of the dictionary type.
-      ```python 
-      for data in dataset.create_dict_iterator():  # each data is a dictionary
+   - Create an iterator whose return value is of the dictionary type. As is shown in the following, create the iterators for `dataset1` and `dataset2`, and print the output.
+      ```python
+      print("dataset1:") 
+      for data in dataset1.create_dict_iterator():  # each data is a dictionary
+          print(data["data"])
+     
+      print("dataset2:")
+      for data in dataset2.create_dict_iterator():  # each data is a dictionary
          print(data["data"])
      ```
+     The output is as follows:
+     ```
+     dataset1:
+     {'data': array([0], dtype=int64)}
+     {'data': array([1], dtype=int64)}
+     {'data': array([2], dtype=int64)}
+     {'data': array([3], dtype=int64)}
+     {'data': array([4], dtype=int64)}
+     {'data': array([5], dtype=int64)}
+     {'data': array([6], dtype=int64)}
+     {'data': array([7], dtype=int64)}
+     {'data': array([8], dtype=int64)}
+     {'data': array([9], dtype=int64)}
+     dataset2:
+     {'data': array([0], dtype=int64)}
+     {'data': array([1], dtype=int64)}
+     {'data': array([2], dtype=int64)}
+     {'data': array([3], dtype=int64)}
+     {'data': array([4], dtype=int64)}
+     {'data': array([5], dtype=int64)}
+     {'data': array([6], dtype=int64)}
+     {'data': array([7], dtype=int64)}
+     {'data': array([8], dtype=int64)}
+     {'data': array([9], dtype=int64)}
+     ```
\ No newline at end of file
--- a/tutorials/source_zh_cn/use/data_preparation/data_processing_and_augmentation.md
+++ b/tutorials/source_zh_cn/use/data_preparation/data_processing_and_augmentation.md
@@ -278,7 +278,7 @@ MindSpore提供`c_transforms`模块以及`py_transforms`模块函数供用户进
    ```
 2. 定义数据增强算子，以`Resize`为例：
    ```python
-    dataset = ds.ImageFolderDatasetV2(DATA_DIR, decode=True)  # Deocde images. 
+    dataset = ds.ImageFolderDatasetV2(DATA_DIR, decode=True)  # Decode images. 
    resize_op = transforms.Resize(size=(500,500), interpolation=Inter.LINEAR)
    dataset.map(input_columns="image", operations=resize_op)


--- a/tutorials/source_zh_cn/use/data_preparation/loading_the_datasets.md
+++ b/tutorials/source_zh_cn/use/data_preparation/loading_the_datasets.md
@@ -148,32 +148,110 @@ MindSpore也支持读取`TFRecord`数据格式的数据集，可以通过`TFReco
    ```

 ## 加载自定义数据集
-对于自定义数据集，可以通过`GeneratorDataset`对象加载。
+现实场景中，数据集的种类多种多样，对于自定义数据集或者目前不支持直接加载的数据集，有两种方法可以处理。
+一种方法是将数据集转成MindRecord格式（请参考[将数据集转换为MindSpore数据格式](https://www.mindspore.cn/tutorial/zh-CN/master/use/data_preparation/converting_datasets.html)章节），另一种方法是通过`GeneratorDataset`对象加载，以下将展示如何使用`GeneratorDataset`。

-1. 定义一个函数（示例函数名为`Generator1D`）用于生成数据集的函数。
-   > 自定义的生成函数返回的是可调用的对象，每次返回`numpy array`的元组，作为一行数据。 
+1. 定义一个可迭代的对象，用于生成数据集。以下展示了两种示例，一种是含有`yield`返回值的自定义函数，另一种是含有`__getitem__`的自定义类。两种示例都将产生一个含有从0到9数字的数据集。
+   > 自定义的可迭代对象，每次返回`numpy array`的元组，作为一行数据。 

   自定义函数示例如下：
   ```python
   import numpy as np  # Import numpy lib.
-   def Generator1D():
-       for i in range(64):
+   def generator_func(num):
+       for i in range(num):
           yield (np.array([i]),)  # Notice, tuple of only one element needs following a comma at the end.
   ```
-2. 将`Generator1D`传入`GeneratorDataset`创建数据集，并设定`column`名为“data”。  
+   
+   自定义类示例如下：
   ```python
-   dataset = ds.GeneratorDataset(Generator1D, ["data"])
+   import numpy as np  # Import numpy lib.
+   class Generator():
+      
+       def __init__(self, num):
+           self.num = num
+      
+       def __getitem__(self, item):
+           return (np.array([item]),)  # Notice, tuple of only one element needs following a comma at the end.
+      
+       def __len__(self):
+           return self.num
+   ```
+2. 使用`GeneratorDataset`创建数据集。将`generator_func`传入`GeneratorDataset`创建数据集`dataset1`，并设定`column`名为“data” 。
+   将定义的`Generator`对象传入`GeneratorDataset`创建数据集`dataset2`，并设定`column`名为“data” 。
+   ```python
+   dataset1 = ds.GeneratorDataset(source=generator_func(10), column_names=["data"], shuffle=False)
+   dataset2 = ds.GeneratorDataset(source=Generator(10), column_names=["data"], shuffle=False)
   ```

 3. 在创建数据集后，可以通过给数据创建迭代器的方式，获取相应的数据。有两种创建迭代器的方法。
-   - 创建返回值为序列类型的迭代器。
+   - 创建返回值为序列类型的迭代器。以下分别对`dataset1`和`dataset2`创建迭代器，并打印输出数据观察结果。
      ```python
-      for data in dataset.create_tuple_iterator():  # each data is a sequence
-          print(data[0])
+     print("dataset1:") 
+     for data in dataset1.create_tuple_iterator():  # each data is a sequence
+          print(data)
+     
+     print("dataset2:")
+     for data in dataset2.create_tuple_iterator():  # each data is a sequence
+         print(data)
      ```
-
-   - 创建返回值为字典类型的迭代器。
-      ```python 
-      for data in dataset.create_dict_iterator():  # each data is a dictionary
+     输出如下所示：
+     ```
+     dataset1:
+     [array([0], dtype=int64)]
+     [array([1], dtype=int64)]
+     [array([2], dtype=int64)]
+     [array([3], dtype=int64)]
+     [array([4], dtype=int64)]
+     [array([5], dtype=int64)]
+     [array([6], dtype=int64)]
+     [array([7], dtype=int64)]
+     [array([8], dtype=int64)]
+     [array([9], dtype=int64)]
+     dataset2:
+     [array([0], dtype=int64)]
+     [array([1], dtype=int64)]
+     [array([2], dtype=int64)]
+     [array([3], dtype=int64)]
+     [array([4], dtype=int64)]
+     [array([5], dtype=int64)]
+     [array([6], dtype=int64)]
+     [array([7], dtype=int64)]
+     [array([8], dtype=int64)]
+     [array([9], dtype=int64)]
+     ```
+
+   - 创建返回值为字典类型的迭代器。以下分别对`dataset1`和`dataset2`创建迭代器，并打印输出数据观察结果。
+      ```python
+      print("dataset1:")
+      for data in dataset1.create_dict_iterator():  # each data is a dictionary
+          print(data["data"])
+     
+      print("dataset2:")
+      for data in dataset2.create_dict_iterator():  # each data is a dictionary
          print(data["data"])
      ```
+     输出如下所示：
+     ```
+     dataset1:
+     {'data': array([0], dtype=int64)}
+     {'data': array([1], dtype=int64)}
+     {'data': array([2], dtype=int64)}
+     {'data': array([3], dtype=int64)}
+     {'data': array([4], dtype=int64)}
+     {'data': array([5], dtype=int64)}
+     {'data': array([6], dtype=int64)}
+     {'data': array([7], dtype=int64)}
+     {'data': array([8], dtype=int64)}
+     {'data': array([9], dtype=int64)}
+     dataset2:
+     {'data': array([0], dtype=int64)}
+     {'data': array([1], dtype=int64)}
+     {'data': array([2], dtype=int64)}
+     {'data': array([3], dtype=int64)}
+     {'data': array([4], dtype=int64)}
+     {'data': array([5], dtype=int64)}
+     {'data': array([6], dtype=int64)}
+     {'data': array([7], dtype=int64)}
+     {'data': array([8], dtype=int64)}
+     {'data': array([9], dtype=int64)}
+     ```