提交 b8159349 编写于 作者: Y Yanjun Peng

add customized dataset tutorial

上级 960a991f
......@@ -279,7 +279,7 @@ Data augmentation requires the `map` function. For details about how to use the
```
2. Define data augmentation operators. The following uses `Resize` as an example:
```python
dataset = ds.ImageFolderDatasetV2(DATA_DIR, decode=True) # Deocde images.
dataset = ds.ImageFolderDatasetV2(DATA_DIR, decode=True) # Decode images.
resize_op = transforms.Resize(size=(500,500), interpolation=Inter.LINEAR)
dataset.map(input_columns="image", operations=resize_op)
......
......@@ -148,32 +148,112 @@ MindSpore can also read datasets in the `TFRecord` data format through the `TFRe
```
## Loading a Custom Dataset
You can load a custom dataset using the `GeneratorDataset` object.
In real scenarios, there are virous datasets. For a custom dataset or a dataset that can't be loaded by APIs directly, there are tow ways.
One is converting the dataset to MindSpore data format (for details, see [Converting Datasets to the Mindspore Data Format](https://www.mindspore.cn/tutorial/en/master/use/data_preparation/converting_datasets.html)). The other one is using the `GeneratorDataset` object.
The following shows how to use `GeneratorDataset`.
1. Define a function (for example, `Generator1D`) to generate a dataset.
> The custom generation function returns the objects that can be called. Each time, tuples of `numpy array` are returned as a row of data.
1. Define an iterable object to generate a dataset. There are two examples following. One is a customized function which contains `yield`. The other one is a customized class which contains `__getitem__`.
Both of them will generator a dataset with numbers from 0 to 9.
> The custom iterable object returns a tuple of `numpy arrays` as a row of data each time.
An example of a custom function is as follows:
```python
import numpy as np # Import numpy lib.
def Generator1D():
for i in range(64):
def generator_func(num):
for i in range(num):
yield (np.array([i]),) # Notice, tuple of only one element needs following a comma at the end.
```
2. Transfer `Generator1D` to `GeneratorDataset` to create a dataset and set `column` to data.
An example of a custom class is as follows:
```python
dataset = ds.GeneratorDataset(Generator1D, ["data"])
import numpy as np # Import numpy lib.
class Generator():
def __init__(self, num):
self.num = num
def __getitem__(self, item):
return (np.array([item]),) # Notice, tuple of only one element needs following a comma at the end.
def __len__(self):
return self.num
```
2. Create a dataset with `GeneratorDataset`. Transfer `generator_func` to `GeneratorDataset` to create a dataset and set `column` to `data`.
Define a `Generator` and transfer it to `GeneratorDataset` to create a dataset and set `column` to `data`.
```python
dataset1 = ds.GeneratorDataset(source=generator_func(10), column_names=["data"], shuffle=False)
dataset2 = ds.GeneratorDataset(source=Generator(10), column_names=["data"], shuffle=False)
```
3. After creating a dataset, create an iterator for the dataset to obtain the corresponding data. Iterator creation methods are as follows:
- Create an iterator whose return value is of the sequence type.
- Create an iterator whose return value is of the sequence type. As is shown in the following, create the iterators for `dataset1` and `dataset2`, and print the output.
```python
for data in dataset.create_tuple_iterator(): # each data is a sequence
print("dataset1:")
for data in dataset1.create_tuple_iterator(): # each data is a sequence
print(data[0])
print("dataset2:")
for data in dataset2.create_tuple_iterator(): # each data is a sequence
print(data[0])
```
The output is as follows:
```
dataset1:
[array([0], dtype=int64)]
[array([1], dtype=int64)]
[array([2], dtype=int64)]
[array([3], dtype=int64)]
[array([4], dtype=int64)]
[array([5], dtype=int64)]
[array([6], dtype=int64)]
[array([7], dtype=int64)]
[array([8], dtype=int64)]
[array([9], dtype=int64)]
dataset2:
[array([0], dtype=int64)]
[array([1], dtype=int64)]
[array([2], dtype=int64)]
[array([3], dtype=int64)]
[array([4], dtype=int64)]
[array([5], dtype=int64)]
[array([6], dtype=int64)]
[array([7], dtype=int64)]
[array([8], dtype=int64)]
[array([9], dtype=int64)]
```
- Create an iterator whose return value is of the dictionary type.
```python
for data in dataset.create_dict_iterator(): # each data is a dictionary
- Create an iterator whose return value is of the dictionary type. As is shown in the following, create the iterators for `dataset1` and `dataset2`, and print the output.
```python
print("dataset1:")
for data in dataset1.create_dict_iterator(): # each data is a dictionary
print(data["data"])
print("dataset2:")
for data in dataset2.create_dict_iterator(): # each data is a dictionary
print(data["data"])
```
The output is as follows:
```
dataset1:
{'data': array([0], dtype=int64)}
{'data': array([1], dtype=int64)}
{'data': array([2], dtype=int64)}
{'data': array([3], dtype=int64)}
{'data': array([4], dtype=int64)}
{'data': array([5], dtype=int64)}
{'data': array([6], dtype=int64)}
{'data': array([7], dtype=int64)}
{'data': array([8], dtype=int64)}
{'data': array([9], dtype=int64)}
dataset2:
{'data': array([0], dtype=int64)}
{'data': array([1], dtype=int64)}
{'data': array([2], dtype=int64)}
{'data': array([3], dtype=int64)}
{'data': array([4], dtype=int64)}
{'data': array([5], dtype=int64)}
{'data': array([6], dtype=int64)}
{'data': array([7], dtype=int64)}
{'data': array([8], dtype=int64)}
{'data': array([9], dtype=int64)}
```
\ No newline at end of file
......@@ -278,7 +278,7 @@ MindSpore提供`c_transforms`模块以及`py_transforms`模块函数供用户进
```
2. 定义数据增强算子,以`Resize`为例:
```python
dataset = ds.ImageFolderDatasetV2(DATA_DIR, decode=True) # Deocde images.
dataset = ds.ImageFolderDatasetV2(DATA_DIR, decode=True) # Decode images.
resize_op = transforms.Resize(size=(500,500), interpolation=Inter.LINEAR)
dataset.map(input_columns="image", operations=resize_op)
......
......@@ -148,32 +148,110 @@ MindSpore也支持读取`TFRecord`数据格式的数据集,可以通过`TFReco
```
## 加载自定义数据集
对于自定义数据集,可以通过`GeneratorDataset`对象加载。
现实场景中,数据集的种类多种多样,对于自定义数据集或者目前不支持直接加载的数据集,有两种方法可以处理。
一种方法是将数据集转成MindRecord格式(请参考[将数据集转换为MindSpore数据格式](https://www.mindspore.cn/tutorial/zh-CN/master/use/data_preparation/converting_datasets.html)章节),另一种方法是通过`GeneratorDataset`对象加载,以下将展示如何使用`GeneratorDataset`
1. 定义一个函数(示例函数名为`Generator1D`)用于生成数据集的函数
> 自定义的生成函数返回的是可调用的对象,每次返回`numpy array`的元组,作为一行数据。
1. 定义一个可迭代的对象,用于生成数据集。以下展示了两种示例,一种是含有`yield`返回值的自定义函数,另一种是含有`__getitem__`的自定义类。两种示例都将产生一个含有从0到9数字的数据集
> 自定义的可迭代对象,每次返回`numpy array`的元组,作为一行数据。
自定义函数示例如下:
```python
import numpy as np # Import numpy lib.
def Generator1D():
for i in range(64):
def generator_func(num):
for i in range(num):
yield (np.array([i]),) # Notice, tuple of only one element needs following a comma at the end.
```
2.`Generator1D`传入`GeneratorDataset`创建数据集,并设定`column`名为“data”。
自定义类示例如下:
```python
dataset = ds.GeneratorDataset(Generator1D, ["data"])
import numpy as np # Import numpy lib.
class Generator():
def __init__(self, num):
self.num = num
def __getitem__(self, item):
return (np.array([item]),) # Notice, tuple of only one element needs following a comma at the end.
def __len__(self):
return self.num
```
2. 使用`GeneratorDataset`创建数据集。将`generator_func`传入`GeneratorDataset`创建数据集`dataset1`,并设定`column`名为“data” 。
将定义的`Generator`对象传入`GeneratorDataset`创建数据集`dataset2`,并设定`column`名为“data” 。
```python
dataset1 = ds.GeneratorDataset(source=generator_func(10), column_names=["data"], shuffle=False)
dataset2 = ds.GeneratorDataset(source=Generator(10), column_names=["data"], shuffle=False)
```
3. 在创建数据集后,可以通过给数据创建迭代器的方式,获取相应的数据。有两种创建迭代器的方法。
- 创建返回值为序列类型的迭代器。
- 创建返回值为序列类型的迭代器。以下分别对`dataset1``dataset2`创建迭代器,并打印输出数据观察结果。
```python
for data in dataset.create_tuple_iterator(): # each data is a sequence
print(data[0])
print("dataset1:")
for data in dataset1.create_tuple_iterator(): # each data is a sequence
print(data)
print("dataset2:")
for data in dataset2.create_tuple_iterator(): # each data is a sequence
print(data)
```
- 创建返回值为字典类型的迭代器。
```python
for data in dataset.create_dict_iterator(): # each data is a dictionary
输出如下所示:
```
dataset1:
[array([0], dtype=int64)]
[array([1], dtype=int64)]
[array([2], dtype=int64)]
[array([3], dtype=int64)]
[array([4], dtype=int64)]
[array([5], dtype=int64)]
[array([6], dtype=int64)]
[array([7], dtype=int64)]
[array([8], dtype=int64)]
[array([9], dtype=int64)]
dataset2:
[array([0], dtype=int64)]
[array([1], dtype=int64)]
[array([2], dtype=int64)]
[array([3], dtype=int64)]
[array([4], dtype=int64)]
[array([5], dtype=int64)]
[array([6], dtype=int64)]
[array([7], dtype=int64)]
[array([8], dtype=int64)]
[array([9], dtype=int64)]
```
- 创建返回值为字典类型的迭代器。以下分别对`dataset1``dataset2`创建迭代器,并打印输出数据观察结果。
```python
print("dataset1:")
for data in dataset1.create_dict_iterator(): # each data is a dictionary
print(data["data"])
print("dataset2:")
for data in dataset2.create_dict_iterator(): # each data is a dictionary
print(data["data"])
```
输出如下所示:
```
dataset1:
{'data': array([0], dtype=int64)}
{'data': array([1], dtype=int64)}
{'data': array([2], dtype=int64)}
{'data': array([3], dtype=int64)}
{'data': array([4], dtype=int64)}
{'data': array([5], dtype=int64)}
{'data': array([6], dtype=int64)}
{'data': array([7], dtype=int64)}
{'data': array([8], dtype=int64)}
{'data': array([9], dtype=int64)}
dataset2:
{'data': array([0], dtype=int64)}
{'data': array([1], dtype=int64)}
{'data': array([2], dtype=int64)}
{'data': array([3], dtype=int64)}
{'data': array([4], dtype=int64)}
{'data': array([5], dtype=int64)}
{'data': array([6], dtype=int64)}
{'data': array([7], dtype=int64)}
{'data': array([8], dtype=int64)}
{'data': array([9], dtype=int64)}
```
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册