All other datasets should subclass it. All subclasses should override `__len__`, that provides the size of the dataset, and `__getitem__`, supporting integer indexing in range from 0 to len(self) exclusive.
Dataset to concatenate multiple datasets. Purpose: useful to assemble different existing datasets, possibly large-scale datasets as the concatenation operation is done in an on-the-fly manner.
***dataset** ([_Dataset_](#torch.utils.data.Dataset"torch.utils.data.Dataset")) – dataset from which to load the data.
***batch_size** ([_int_](https://docs.python.org/3/library/functions.html#int"(in Python v3.7)")_,_ _optional_) – how many samples per batch to load (default: `1`).
***shuffle** ([_bool_](https://docs.python.org/3/library/functions.html#bool"(in Python v3.7)")_,_ _optional_) – set to `True` to have the data reshuffled at every epoch (default: `False`).
***sampler** ([_Sampler_](#torch.utils.data.Sampler"torch.utils.data.Sampler")_,_ _optional_) – defines the strategy to draw samples from the dataset. If specified, `shuffle` must be False.
***batch_sampler** ([_Sampler_](#torch.utils.data.Sampler"torch.utils.data.Sampler")_,_ _optional_) – like sampler, but returns a batch of indices at a time. Mutually exclusive with `batch_size`, `shuffle`, `sampler`, and `drop_last`.
***num_workers** ([_int_](https://docs.python.org/3/library/functions.html#int"(in Python v3.7)")_,_ _optional_) – how many subprocesses to use for data loading. 0 means that the data will be loaded in the main process. (default: `0`)
***collate_fn** (_callable__,_ _optional_) – merges a list of samples to form a mini-batch.
***pin_memory** ([_bool_](https://docs.python.org/3/library/functions.html#bool"(in Python v3.7)")_,_ _optional_) – If `True`, the data loader will copy tensors into CUDA pinned memory before returning them.
***drop_last** ([_bool_](https://docs.python.org/3/library/functions.html#bool"(in Python v3.7)")_,_ _optional_) – set to `True` to drop the last incomplete batch, if the dataset size is not divisible by the batch size. If `False` and the size of dataset is not divisible by the batch size, then the last batch will be smaller. (default: `False`)
***timeout** (_numeric__,_ _optional_) – if positive, the timeout value for collecting a batch from workers. Should always be non-negative. (default: `0`)
***worker_init_fn** (_callable__,_ _optional_) – If not `None`, this will be called on each worker subprocess with the worker id (an int in `[0, num_workers - 1]`) as input, after seeding and before data loading. (default: `None`)
注意:
默认地,每个worker都会有各自的PyTorch种子,设置方法是`base_seed + worker_id`,其中`base_seed`是主进程通过随机数生成器生成的long型数。而其它库(如NumPy)的种子可能由初始worker复制得到, 使得每一个worker返回相同的种子。(见FAQ中的[My data loader workers return identical random numbers](notes/faq.html#dataloader-workers-random-seed)部分。)你可以用[`torch.initial_seed()`](torch.html#torch.initial_seed"torch.initial_seed")查看`worker_init_fn`中每个worker的PyTorch种子,也可以在加载数据之前设置其他种子。
Note
警告:
By default, each worker will have its PyTorch seed set to `base_seed + worker_id`, where `base_seed` is a long generated by main process using its RNG. However, seeds for other libraies may be duplicated upon initializing workers (w.g., NumPy), causing each worker to return identical random numbers. (See [My data loader workers return identical random numbers](notes/faq.html#dataloader-workers-random-seed) section in FAQ.) You may use [`torch.initial_seed()`](torch.html#torch.initial_seed"torch.initial_seed") to access the PyTorch seed for each worker in `worker_init_fn`, and use it to set other seeds before data loading.
Warning
If `spawn` start method is used, `worker_init_fn` cannot be an unpicklable object, e.g., a lambda function.
Every Sampler subclass has to provide an __iter__ method, providing a way to iterate over indices of dataset elements, and a __len__ method that returns the length of the returned iterators.
Samples elements randomly. If without replacement, then sample from a shuffled dataset. If with replacement, then user can specify `num_samples` to draw.
Parameters:
***data_source** ([_Dataset_](#torch.utils.data.Dataset"torch.utils.data.Dataset")) – dataset to sample from
***num_samples** ([_int_](https://docs.python.org/3/library/functions.html#int"(in Python v3.7)")) – number of samples to draw, default=len(dataset)
***replacement** ([_bool_](https://docs.python.org/3/library/functions.html#bool"(in Python v3.7)")) – samples are drawn with replacement if `True`, default=False
Samples elements from [0,..,len(weights)-1] with given probabilities (weights).
Parameters:
***weights** (_sequence_) – a sequence of weights, not necessary summing up to one
***num_samples** ([_int_](https://docs.python.org/3/library/functions.html#int"(in Python v3.7)")) – number of samples to draw
***replacement** ([_bool_](https://docs.python.org/3/library/functions.html#bool"(in Python v3.7)")) – if `True`, samples are drawn with replacement. If not, they are drawn without replacement, which means that when a sample index is drawn for a row, it cannot be drawn again for that row.
Wraps another sampler to yield a mini-batch of indices.
Parameters:
***sampler** ([_Sampler_](#torch.utils.data.Sampler"torch.utils.data.Sampler")) – Base sampler.
***batch_size** ([_int_](https://docs.python.org/3/library/functions.html#int"(in Python v3.7)")) – Size of mini-batch.
***drop_last** ([_bool_](https://docs.python.org/3/library/functions.html#bool"(in Python v3.7)")) – If `True`, the sampler will drop the last batch if its size would be less than `batch_size`
Sampler that restricts data loading to a subset of the dataset.
It is especially useful in conjunction with [`torch.nn.parallel.DistributedDataParallel`](nn.html#torch.nn.parallel.DistributedDataParallel"torch.nn.parallel.DistributedDataParallel"). In such case, each process can pass a DistributedSampler instance as a DataLoader sampler, and load a subset of the original dataset that is exclusive to it.