- en: DataPipe Tutorial id: totrans-0 prefs: - PREF_H1 type: TYPE_NORMAL zh: DataPipe教程 - en: 原文:[https://pytorch.org/data/beta/dp_tutorial.html](https://pytorch.org/data/beta/dp_tutorial.html) id: totrans-1 prefs: - PREF_BQ type: TYPE_NORMAL zh: 原文:[https://pytorch.org/data/beta/dp_tutorial.html](https://pytorch.org/data/beta/dp_tutorial.html) - en: Using DataPipes[](#using-datapipes "Permalink to this heading") id: totrans-2 prefs: - PREF_H2 type: TYPE_NORMAL zh: 使用DataPipes[](#using-datapipes "跳转到此标题") - en: 'Suppose that we want to load data from CSV files with the following steps:' id: totrans-3 prefs: [] type: TYPE_NORMAL zh: 假设我们想要从CSV文件中加载数据,以下是步骤: - en: List all CSV files in a directory id: totrans-4 prefs: - PREF_UL type: TYPE_NORMAL zh: 列出目录中的所有CSV文件 - en: Load CSV files id: totrans-5 prefs: - PREF_UL type: TYPE_NORMAL zh: 加载CSV文件 - en: Parse CSV file and yield rows id: totrans-6 prefs: - PREF_UL type: TYPE_NORMAL zh: 解析CSV文件并产生行 - en: Split our dataset into training and validation sets id: totrans-7 prefs: - PREF_UL type: TYPE_NORMAL zh: 将数据集分割为训练集和验证集 - en: There are a few [built-in DataPipes](torchdata.datapipes.iter.html) that can help us with the above operations. id: totrans-8 prefs: [] type: TYPE_NORMAL zh: 有一些[内置的DataPipes](torchdata.datapipes.iter.html)可以帮助我们进行上述操作。 - en: '`FileLister` - [lists out files in a directory](generated/torchdata.datapipes.iter.FileLister.html)' id: totrans-9 prefs: - PREF_UL type: TYPE_NORMAL zh: '`FileLister` - [列出目录中的文件](generated/torchdata.datapipes.iter.FileLister.html)' - en: '`Filter` - [filters the elements in DataPipe based on a given function](generated/torchdata.datapipes.iter.Filter.html)' id: totrans-10 prefs: - PREF_UL type: TYPE_NORMAL zh: '`Filter` - [根据给定函数过滤DataPipe中的元素](generated/torchdata.datapipes.iter.Filter.html)' - en: '`FileOpener` - [consumes file paths and returns opened file streams](generated/torchdata.datapipes.iter.FileOpener.html)' id: totrans-11 prefs: - PREF_UL type: TYPE_NORMAL zh: '`FileOpener` - [消耗文件路径并返回打开的文件流](generated/torchdata.datapipes.iter.FileOpener.html)' - en: '`CSVParser` - [consumes file streams, parses the CSV contents, and returns one parsed line at a time](generated/torchdata.datapipes.iter.CSVParser.html)' id: totrans-12 prefs: - PREF_UL type: TYPE_NORMAL zh: '`CSVParser` - [消耗文件流,解析CSV内容,并逐行返回解析后的内容](generated/torchdata.datapipes.iter.CSVParser.html)' - en: '`RandomSplitter` - [randomly split samples from a source DataPipe into groups](generated/torchdata.datapipes.iter.RandomSplitter.html)' id: totrans-13 prefs: - PREF_UL type: TYPE_NORMAL zh: '`RandomSplitter` - [从源DataPipe中随机分割样本为组](generated/torchdata.datapipes.iter.RandomSplitter.html)' - en: 'As an example, the source code for `CSVParser` looks something like this:' id: totrans-14 prefs: [] type: TYPE_NORMAL zh: 例如,`CSVParser`的源代码看起来像这样: - en: '[PRE0]' id: totrans-15 prefs: [] type: TYPE_PRE zh: '[PRE0]' - en: 'As mentioned in a different section, DataPipes can be invoked using their functional forms (recommended) or their class constructors. A pipeline can be assembled as the following:' id: totrans-16 prefs: [] type: TYPE_NORMAL zh: 如在不同部分中提到的,DataPipes可以使用它们的函数形式(推荐)或它们的类构造函数来调用。可以组装一个管道如下: - en: '[PRE1]' id: totrans-17 prefs: [] type: TYPE_PRE zh: '[PRE1]' - en: You can find the full list of built-in [IterDataPipes here](torchdata.datapipes.iter.html) and [MapDataPipes here](torchdata.datapipes.map.html). id: totrans-18 prefs: [] type: TYPE_NORMAL zh: 您可以在这里找到所有内置的[IterDataPipes](torchdata.datapipes.iter.html)和[MapDataPipes](torchdata.datapipes.map.html)。 - en: Working with DataLoader[](#working-with-dataloader "Permalink to this heading") id: totrans-19 prefs: - PREF_H2 type: TYPE_NORMAL zh: 使用DataLoader[](#working-with-dataloader "跳转到此标题") - en: In this section, we will demonstrate how you can use `DataPipe` with `DataLoader`. For the most part, you should be able to use it just by passing `dataset=datapipe` as an input argument into the `DataLoader`. For detailed documentation related to `DataLoader`, please visit [this PyTorch Core page](https://pytorch.org/docs/stable/data.html#single-and-multi-process-data-loading). id: totrans-20 prefs: [] type: TYPE_NORMAL zh: 在本节中,我们将演示如何使用`DataPipe`与`DataLoader`。大部分情况下,您只需将`dataset=datapipe`作为输入参数传递给`DataLoader`即可使用。有关与`DataLoader`相关的详细文档,请访问[此PyTorch Core页面](https://pytorch.org/docs/stable/data.html#single-and-multi-process-data-loading)。 - en: Please refer to [this page](dlv2_tutorial.html) about using `DataPipe` with `DataLoader2`. id: totrans-21 prefs: [] type: TYPE_NORMAL zh: 请参考[此页面](dlv2_tutorial.html)关于如何使用`DataPipe`与`DataLoader2`。 - en: For this example, we will first have a helper function that generates some CSV files with random label and data. id: totrans-22 prefs: [] type: TYPE_NORMAL zh: 对于这个例子,我们首先会有一个帮助函数,生成一些带有随机标签和数据的CSV文件。 - en: '[PRE2]' id: totrans-23 prefs: [] type: TYPE_PRE zh: '[PRE2]' - en: Next, we will build our DataPipes to read and parse through the generated CSV files. Note that we prefer to have pass defined functions to DataPipes rather than lambda functions because the formers are serializable with pickle. id: totrans-24 prefs: [] type: TYPE_NORMAL zh: 接下来,我们将构建我们的DataPipes来读取和解析生成的CSV文件。请注意,我们更喜欢将定义的函数传递给DataPipes,而不是lambda函数,因为前者可以与pickle序列化。 - en: '[PRE3]' id: totrans-25 prefs: [] type: TYPE_PRE zh: '[PRE3]' - en: Lastly, we will put everything together in `'__main__'` and pass the DataPipe into the DataLoader. Note that if you choose to use `Batcher` while setting `batch_size > 1` for DataLoader, your samples will be batched more than once. You should choose one or the other. id: totrans-26 prefs: [] type: TYPE_NORMAL zh: 最后,我们将把所有内容放在`'__main__'`中,并将DataPipe传递给DataLoader。请注意,如果您选择在DataLoader中设置`batch_size > 1`时使用`Batcher`,则您的样本将被分批多次。您应该选择其中一个。 - en: '[PRE4]' id: totrans-27 prefs: [] type: TYPE_PRE zh: '[PRE4]' - en: The following statements will be printed to show the shapes of a single batch of labels and features. id: totrans-28 prefs: [] type: TYPE_NORMAL zh: 以下语句将被打印出来,显示单个批次的标签和特征的形状。 - en: '[PRE5]' id: totrans-29 prefs: [] type: TYPE_PRE zh: '[PRE5]' - en: The reason why `n_sample = 12` is because `ShardingFilter` (`datapipe.sharding_filter()`) was not used, such that each worker will independently return all samples. In this case, there are 10 rows per file and 3 files, with a batch size of 5, that gives us 6 batches per worker. With 2 workers, we get 12 total batches from the `DataLoader`. id: totrans-30 prefs: [] type: TYPE_NORMAL zh: '`n_sample = 12`的原因是因为没有使用`ShardingFilter`(`datapipe.sharding_filter()`),因此每个工作进程将独立返回所有样本。在这种情况下,每个文件有10行,共3个文件,批量大小为5,这给我们每个工作进程6个批次。有2个工作进程,我们从`DataLoader`中得到12个总批次。' - en: In order for DataPipe sharding to work with `DataLoader`, we need to add the following. id: totrans-31 prefs: [] type: TYPE_NORMAL zh: 为了使DataPipe分片与`DataLoader`一起工作,我们需要添加以下内容。 - en: '[PRE6]' id: totrans-32 prefs: [] type: TYPE_PRE zh: '[PRE6]' - en: 'When we re-run, we will get:' id: totrans-33 prefs: [] type: TYPE_NORMAL zh: 当我们重新运行时,我们将得到: - en: '[PRE7]' id: totrans-34 prefs: [] type: TYPE_PRE zh: '[PRE7]' - en: 'Note:' id: totrans-35 prefs: [] type: TYPE_NORMAL zh: 注意: - en: Place `ShardingFilter` (`datapipe.sharding_filter`) as early as possible in the pipeline, especially before expensive operations such as decoding, in order to avoid repeating these expensive operations across worker/distributed processes. id: totrans-36 prefs: - PREF_UL type: TYPE_NORMAL zh: 尽量在管道中尽早放置`ShardingFilter`(`datapipe.sharding_filter`),特别是在解码等昂贵操作之前,以避免在工作进程/分布式进程中重复执行这些昂贵操作。 - en: For the data source that needs to be sharded, it is crucial to add `Shuffler` before `ShardingFilter` to ensure data are globally shuffled before being split into shards. Otherwise, each worker process would always process the same shard of data for all epochs. And, it means each batch would only consist of data from the same shard, which leads to low accuracy during training. However, it doesn’t apply to the data source that has already been sharded for each multi-/distributed process, since `ShardingFilter` is no longer required to be presented in the pipeline. id: totrans-37 prefs: - PREF_UL type: TYPE_NORMAL zh: 对于需要分片的数据源,关键是在`ShardingFilter`之前添加`Shuffler`,以确保数据在分成片之前进行全局洗牌。否则,每个工作进程将始终处理相同的数据片段进行所有时期的训练。这意味着每个批次只包含来自同一数据片段的数据,这会导致训练时准确性较低。然而,对于已经为每个多/分布式进程分片的数据源,不再需要在管道中出现`ShardingFilter`。 - en: There may be cases where placing `Shuffler` earlier in the pipeline lead to worse performance, because some operations (e.g. decompression) are faster with sequential reading. In those cases, we recommend decompressing the files prior to shuffling (potentially prior to any data loading). id: totrans-38 prefs: - PREF_UL type: TYPE_NORMAL zh: 在某些情况下,将`Shuffler`放在管道中较早的位置可能会导致性能变差,因为某些操作(例如解压缩)在顺序读取时速度更快。在这种情况下,我们建议在洗牌之前解压缩文件(可能在任何数据加载之前)。 - en: You can find more DataPipe implementation examples for various research domains [on this page](examples.html). id: totrans-39 prefs: [] type: TYPE_NORMAL zh: 您可以在[此页面](examples.html)找到各种研究领域的更多DataPipe实现示例。 - en: Implementing a Custom DataPipe[](#implementing-a-custom-datapipe "Permalink to this heading") id: totrans-40 prefs: - PREF_H2 type: TYPE_NORMAL zh: 实现自定义DataPipe[](#implementing-a-custom-datapipe "跳转到此标题") - en: Currently, we already have a large number of built-in DataPipes and we expect them to cover most necessary data processing operations. If none of them supports your need, you can create your own custom DataPipe. id: totrans-41 prefs: [] type: TYPE_NORMAL zh: 目前,我们已经拥有大量内置的DataPipes,并且我们希望它们能够涵盖大多数必要的数据处理操作。如果没有一个支持您的需求,您可以创建自己的自定义DataPipe。 - en: As a guiding example, let us implement an `IterDataPipe` that applies a callable to the input iterator. For `MapDataPipe`, take a look at the [map](https://github.com/pytorch/pytorch/tree/master/torch/utils/data/datapipes/map) folder for examples, and follow the steps below for the `__getitem__` method instead of the `__iter__` method. id: totrans-42 prefs: [] type: TYPE_NORMAL zh: 作为一个指导示例,让我们实现一个将可调用函数应用于输入迭代器的`IterDataPipe`。对于`MapDataPipe`,请查看[map](https://github.com/pytorch/pytorch/tree/master/torch/utils/data/datapipes/map)文件夹中的示例,并按照下面的步骤为`__getitem__`方法而不是`__iter__`方法。 - en: Naming[](#naming "Permalink to this heading") id: totrans-43 prefs: - PREF_H3 type: TYPE_NORMAL zh: 命名[](#naming "跳转到此标题") - en: The naming convention for `DataPipe` is “Operation”-er, followed by `IterDataPipe` or `MapDataPipe`, as each DataPipe is essentially a container to apply an operation to data yielded from a source `DataPipe`. For succinctness, we alias to just “Operation-er” in **init** files. For our `IterDataPipe` example, we’ll name the module `MapperIterDataPipe` and alias it as `iter.Mapper` under `torchdata.datapipes`. id: totrans-44 prefs: [] type: TYPE_NORMAL zh: '`DataPipe`的命名约定是“操作”-er,后跟`IterDataPipe`或`MapDataPipe`,因为每个DataPipe本质上是一个容器,用于将操作应用于从源`DataPipe`中产生的数据。为了简洁起见,在**init**文件中我们将其别名为“Operation-er”。对于我们的`IterDataPipe`示例,我们将模块命名为`MapperIterDataPipe`,并在`torchdata.datapipes`下将其别名为`iter.Mapper`。' - en: For the functional method name, the naming convention is `datapipe.`. For instance, the functional method name of `Mapper` is `map`, such that it can be invoked by `datapipe.map(...)`. id: totrans-45 prefs: [] type: TYPE_NORMAL zh: 对于功能方法的命名约定是`datapipe.`。例如,`Mapper`的功能方法名称是`map`,因此可以通过`datapipe.map(...)`来调用它。 - en: Constructor[](#constructor "Permalink to this heading") id: totrans-46 prefs: - PREF_H3 type: TYPE_NORMAL zh: 构造函数[](#constructor "跳转到此标题") - en: 'DataSets are now generally constructed as stacks of `DataPipes`, so each `DataPipe` typically takes a source `DataPipe` as its first argument. Here is a simplified version of Mapper as an example:' id: totrans-47 prefs: [] type: TYPE_NORMAL zh: 数据集现在通常构建为`DataPipes`堆叠,因此每个`DataPipe`通常将源`DataPipe`作为其第一个参数。以下是Mapper的简化版本示例: - en: '[PRE8]' id: totrans-48 prefs: [] type: TYPE_PRE zh: '[PRE8]' - en: 'Note:' id: totrans-49 prefs: [] type: TYPE_NORMAL zh: 注意: - en: Avoid loading data from the source DataPipe in `__init__` function, in order to support lazy data loading and save memory. id: totrans-50 prefs: - PREF_UL type: TYPE_NORMAL zh: 避免在`__init__`函数中从源DataPipe加载数据,以支持延迟数据加载并节省内存。 - en: If `IterDataPipe` instance holds data in memory, please be ware of the in-place modification of data. When second iterator is created from the instance, the data may have already changed. Please take `IterableWrapper` [class](https://github.com/pytorch/pytorch/blob/master/torch/utils/data/datapipes/iter/utils.py) as reference to `deepcopy` data for each iterator. id: totrans-51 prefs: - PREF_UL type: TYPE_NORMAL zh: 如果`IterDataPipe`实例在内存中保存数据,请注意数据的原地修改。当从实例创建第二个迭代器时,数据可能已经发生了变化。请参考`IterableWrapper`[类](https://github.com/pytorch/pytorch/blob/master/torch/utils/data/datapipes/iter/utils.py)来为每个迭代器`deepcopy`数据。 - en: Avoid variables names that are taken by the functional names of existing DataPipes. For instance, `.filter` is the functional name that can be used to invoke `FilterIterDataPipe`. Having a variable named `filter` inside another `IterDataPipe` can lead to confusion. id: totrans-52 prefs: - PREF_UL type: TYPE_NORMAL zh: 避免使用现有DataPipes的功能名称作为变量名。例如,`.filter`是可以用来调用`FilterIterDataPipe`的功能名称。在另一个`IterDataPipe`中有一个名为`filter`的变量可能会导致混淆。 - en: Iterator[](#iterator "Permalink to this heading") id: totrans-53 prefs: - PREF_H3 type: TYPE_NORMAL zh: 迭代器[](#iterator "跳转到此标题") - en: For `IterDataPipes`, an `__iter__` function is needed to consume data from the source `IterDataPipe` then apply the operation over the data before `yield`. id: totrans-54 prefs: [] type: TYPE_NORMAL zh: 对于`IterDataPipes`,需要一个`__iter__`函数来从源`IterDataPipe`中消耗数据,然后在`yield`之前对数据应用操作。 - en: '[PRE9]' id: totrans-55 prefs: [] type: TYPE_PRE zh: '[PRE9]' - en: Length[](#length "Permalink to this heading") id: totrans-56 prefs: - PREF_H3 type: TYPE_NORMAL zh: 长度[](#length "跳转到此标题") - en: In many cases, as in our `MapperIterDataPipe` example, the `__len__` method of a DataPipe returns the length of the source DataPipe. id: totrans-57 prefs: [] type: TYPE_NORMAL zh: 在许多情况下,就像我们的`MapperIterDataPipe`示例一样,DataPipe的`__len__`方法返回源DataPipe的长度。 - en: '[PRE10]' id: totrans-58 prefs: [] type: TYPE_PRE zh: '[PRE10]' - en: However, note that `__len__` is optional for `IterDataPipe` and often inadvisable. For `CSVParserIterDataPipe` in the using DataPipes section below, `__len__` is not implemented because the number of rows in each file is unknown before loading it. In some special cases, `__len__` can be made to either return an integer or raise an Error depending on the input. In those cases, the Error must be a `TypeError` to support Python’s build-in functions like `list(dp)`. id: totrans-59 prefs: [] type: TYPE_NORMAL zh: 但请注意,对于`IterDataPipe`,`__len__`是可选的,通常不建议使用。在下面的DataPipes部分中,对于`CSVParserIterDataPipe`,`__len__`未实现,因为在加载之前无法确定每个文件中的行数。在某些特殊情况下,`__len__`可以被设置为返回整数或根据输入引发错误。在这些情况下,错误必须是`TypeError`,以支持Python的内置函数如`list(dp)`。 - en: Registering DataPipes with the functional API[](#registering-datapipes-with-the-functional-api "Permalink to this heading") id: totrans-60 prefs: - PREF_H3 type: TYPE_NORMAL zh: 使用功能API注册DataPipes[](#registering-datapipes-with-the-functional-api "跳转到此标题的永久链接") - en: Each DataPipe can be registered to support functional invocation using the decorator `functional_datapipe`. id: totrans-61 prefs: [] type: TYPE_NORMAL zh: 每个DataPipe都可以注册以支持使用装饰器`functional_datapipe`进行功能调用。 - en: '[PRE11]' id: totrans-62 prefs: [] type: TYPE_PRE zh: '[PRE11]' - en: 'The stack of DataPipes can then be constructed using their functional forms (recommended) or class constructors:' id: totrans-63 prefs: [] type: TYPE_NORMAL zh: 然后,可以使用它们的功能形式(推荐)或类构造函数构建DataPipes堆栈: - en: '[PRE12]' id: totrans-64 prefs: [] type: TYPE_PRE zh: '[PRE12]' - en: In the above example, `datapipes1` and `datapipes2` represent the exact same stack of `IterDataPipe`s. We recommend using the functional form of DataPipes. id: totrans-65 prefs: [] type: TYPE_NORMAL zh: 在上面的示例中,`datapipes1`和`datapipes2`代表完全相同的`IterDataPipe`堆栈。我们建议使用DataPipes的功能形式。 - en: Working with Cloud Storage Providers[](#working-with-cloud-storage-providers "Permalink to this heading") id: totrans-66 prefs: - PREF_H2 type: TYPE_NORMAL zh: 与云存储提供商合作[](#working-with-cloud-storage-providers "跳转到此标题的永久链接") - en: In this section, we show examples accessing AWS S3, Google Cloud Storage, and Azure Cloud Storage with built-in `fsspec` DataPipes. Although only those two providers are discussed here, with additional libraries, `fsspec` DataPipes should allow you to connect with other storage systems as well ([list of known implementations](https://filesystem-spec.readthedocs.io/en/latest/api.html#other-known-implementations)). id: totrans-67 prefs: [] type: TYPE_NORMAL zh: 在本节中,我们展示了使用内置`fsspec` DataPipes访问AWS S3、Google Cloud Storage和Azure Cloud Storage的示例。尽管这里只讨论了这两个提供商,但使用其他库,`fsspec` DataPipes也应该允许您连接到其他存储系统([已知实现列表](https://filesystem-spec.readthedocs.io/en/latest/api.html#other-known-implementations))。 - en: Let us know on GitHub if you have a request for support for other cloud storage providers, or you have code examples to share with the community. id: totrans-68 prefs: [] type: TYPE_NORMAL zh: 如果您对其他云存储提供商的支持有任何请求,或者有代码示例要与社区分享,请在GitHub上告诉我们。 - en: Accessing AWS S3 with `fsspec` DataPipes[](#accessing-aws-s3-with-fsspec-datapipes "Permalink to this heading") id: totrans-69 prefs: - PREF_H3 type: TYPE_NORMAL zh: 使用`fsspec` DataPipes访问AWS S3[](#accessing-aws-s3-with-fsspec-datapipes "跳转到此标题的永久链接") - en: This requires the installation of the libraries `fsspec` ([documentation](https://filesystem-spec.readthedocs.io/en/latest/)) and `s3fs` ([s3fs GitHub repo](https://github.com/fsspec/s3fs)). id: totrans-70 prefs: [] type: TYPE_NORMAL zh: 这需要安装库`fsspec`([文档](https://filesystem-spec.readthedocs.io/en/latest/))和`s3fs`([s3fs GitHub 仓库](https://github.com/fsspec/s3fs))。 - en: You can list out the files within a S3 bucket directory by passing a path that starts with `"s3://BUCKET_NAME"` to [FSSpecFileLister](generated/torchdata.datapipes.iter.FSSpecFileLister.html) (`.list_files_by_fsspec(...)`). id: totrans-71 prefs: [] type: TYPE_NORMAL zh: 您可以通过将以`s3://BUCKET_NAME`开头的路径传递给[FSSpecFileLister](generated/torchdata.datapipes.iter.FSSpecFileLister.html)(`.list_files_by_fsspec(...)`)来列出S3存储桶目录中的文件。 - en: '[PRE13]' id: totrans-72 prefs: [] type: TYPE_PRE zh: '[PRE13]' - en: You can also open files using [FSSpecFileOpener](generated/torchdata.datapipes.iter.FSSpecFileOpener.html) (`.open_files_by_fsspec(...)`) and stream them (if supported by the file format). id: totrans-73 prefs: [] type: TYPE_NORMAL zh: 您还可以使用[FSSpecFileOpener](generated/torchdata.datapipes.iter.FSSpecFileOpener.html)(`.open_files_by_fsspec(...)`)打开文件并流式传输(如果文件格式支持)。 - en: 'Note that you can also provide additional parameters via the argument `kwargs_for_open`. This can be useful for purposes such as accessing specific bucket version, which you can do so by passing in `{version_id: ''SOMEVERSIONID''}` (more [details about S3 bucket version awareness](https://s3fs.readthedocs.io/en/latest/#bucket-version-awareness) by `s3fs`). The supported arguments vary by the (cloud) file system that you are accessing.' id: totrans-74 prefs: [] type: TYPE_NORMAL zh: '请注意,您还可以通过参数`kwargs_for_open`提供额外的参数。这对于访问特定存储桶版本等目的可能很有用,您可以通过传入`{version_id: ''SOMEVERSIONID''}`来实现(更多关于S3存储桶版本感知的详细信息,请参阅`s3fs`的[文档](https://s3fs.readthedocs.io/en/latest/#bucket-version-awareness))。支持的参数取决于您正在访问的(云)文件系统。' - en: In the example below, we are streaming the archive by using [TarArchiveLoader](generated/torchdata.datapipes.iter.TarArchiveLoader.html#) (`.load_from_tar(mode="r|")`), in contrast with the usual `mode="r:"`. This allows us to begin processing data inside the archive without downloading the whole archive into memory first. id: totrans-75 prefs: [] type: TYPE_NORMAL zh: 在下面的示例中,我们通过使用[TarArchiveLoader](generated/torchdata.datapipes.iter.TarArchiveLoader.html#)(`.load_from_tar(mode="r|")`)来流式传输存档,与通常的`mode="r:"`相反。这使我们能够在将整个存档下载到内存之前开始处理存档中的数据。 - en: '[PRE14]' id: totrans-76 prefs: [] type: TYPE_PRE zh: '[PRE14]' - en: Finally, [FSSpecFileSaver](generated/torchdata.datapipes.iter.FSSpecSaver.html) is also available for writing data to cloud. id: totrans-77 prefs: [] type: TYPE_NORMAL zh: 最后,[FSSpecFileSaver](generated/torchdata.datapipes.iter.FSSpecSaver.html) 也可用于将数据写入云端。 - en: Accessing Google Cloud Storage (GCS) with `fsspec` DataPipes[](#accessing-google-cloud-storage-gcs-with-fsspec-datapipes "Permalink to this heading") id: totrans-78 prefs: - PREF_H3 type: TYPE_NORMAL zh: 使用`fsspec` DataPipes访问Google Cloud Storage(GCS)[](#accessing-google-cloud-storage-gcs-with-fsspec-datapipes "跳转到此标题的永久链接") - en: This requires the installation of the libraries `fsspec` ([documentation](https://filesystem-spec.readthedocs.io/en/latest/)) and `gcsfs` ([gcsfs GitHub repo](https://github.com/fsspec/gcsfs)). id: totrans-79 prefs: [] type: TYPE_NORMAL zh: 这需要安装库`fsspec`([文档](https://filesystem-spec.readthedocs.io/en/latest/))和`gcsfs`([gcsfs GitHub 仓库](https://github.com/fsspec/gcsfs))。 - en: You can list out the files within a GCS bucket directory by specifying a path that starts with `"gcs://BUCKET_NAME"`. The bucket name in the example below is `uspto-pair`. id: totrans-80 prefs: [] type: TYPE_NORMAL zh: 您可以通过指定以`"gcs://BUCKET_NAME"`开头的路径来列出GCS存储桶目录中的文件。下面示例中的存储桶名称是`uspto-pair`。 - en: '[PRE15]' id: totrans-81 prefs: [] type: TYPE_PRE zh: '[PRE15]' - en: Here is an example of loading a zip file `05900035.zip` from a bucket named `uspto-pair` inside the directory `applications`. id: totrans-82 prefs: [] type: TYPE_NORMAL zh: 以下是从名为`uspto-pair`的存储桶中的`applications`目录加载`05900035.zip`文件的示例。 - en: '[PRE16]' id: totrans-83 prefs: [] type: TYPE_PRE zh: '[PRE16]' - en: Accessing Azure Blob storage with `fsspec` DataPipes[](#accessing-azure-blob-storage-with-fsspec-datapipes "Permalink to this heading") id: totrans-84 prefs: - PREF_H3 type: TYPE_NORMAL zh: 使用`fsspec` DataPipes 访问 Azure Blob 存储[](#accessing-azure-blob-storage-with-fsspec-datapipes "跳转到此标题") - en: 'This requires the installation of the libraries `fsspec` ([documentation](https://filesystem-spec.readthedocs.io/en/latest/)) and `adlfs` ([adlfs GitHub repo](https://github.com/fsspec/adlfs)). You can access data in Azure Data Lake Storage Gen2 by providing URIs staring with `abfs://`. For example, [FSSpecFileLister](generated/torchdata.datapipes.iter.FSSpecFileLister.html) (`.list_files_by_fsspec(...)`) can be used to list files in a directory in a container:' id: totrans-85 prefs: [] type: TYPE_NORMAL zh: 这需要安装库`fsspec`([文档](https://filesystem-spec.readthedocs.io/en/latest/))和`adlfs`([adlfs GitHub 仓库](https://github.com/fsspec/adlfs))。您可以通过提供以`abfs://`开头的 URI 来访问 Azure Data Lake Storage Gen2。例如,[FSSpecFileLister](generated/torchdata.datapipes.iter.FSSpecFileLister.html)(`.list_files_by_fsspec(...)`)可用于列出容器中目录中的文件: - en: '[PRE17]' id: totrans-86 prefs: [] type: TYPE_PRE zh: '[PRE17]' - en: You can also open files using [FSSpecFileOpener](generated/torchdata.datapipes.iter.FSSpecFileOpener.html) (`.open_files_by_fsspec(...)`) and stream them (if supported by the file format). id: totrans-87 prefs: [] type: TYPE_NORMAL zh: 您还可以使用[FSSpecFileOpener](generated/torchdata.datapipes.iter.FSSpecFileOpener.html)(`.open_files_by_fsspec(...)`)打开文件并流式传输(如果文件格式支持)。 - en: Here is an example of loading a CSV file `ecdc_cases.csv` from a public container inside the directory `curated/covid-19/ecdc_cases/latest`, belonging to account `pandemicdatalake`. id: totrans-88 prefs: [] type: TYPE_NORMAL zh: 这里是一个从属于账户`pandemicdatalake`的公共容器内的目录`curated/covid-19/ecdc_cases/latest`中加载 CSV 文件`ecdc_cases.csv`的示例。 - en: '[PRE18]' id: totrans-89 prefs: [] type: TYPE_PRE zh: '[PRE18]' - en: If necessary, you can also access data in Azure Data Lake Storage Gen1 by using URIs staring with `adl://` and `abfs://`, as described in [README of adlfs repo](https://github.com/fsspec/adlfs/blob/main/README.md) id: totrans-90 prefs: [] type: TYPE_NORMAL zh: 如有必要,您还可以通过使用以`adl://`和`abfs://`开头的 URI 来访问 Azure Data Lake Storage Gen1,如[adlfs 仓库的 README](https://github.com/fsspec/adlfs/blob/main/README.md)中所述。 - en: Accessing Azure ML Datastores with `fsspec` DataPipes[](#accessing-azure-ml-datastores-with-fsspec-datapipes "Permalink to this heading") id: totrans-91 prefs: - PREF_H3 type: TYPE_NORMAL zh: 使用`fsspec` DataPipes 访问 Azure ML 数据存储[](#accessing-azure-ml-datastores-with-fsspec-datapipes "跳转到此标题") - en: 'An Azure ML datastore is a *reference* to an existing storage account on Azure. The key benefits of creating and using an Azure ML datastore are:' id: totrans-92 prefs: [] type: TYPE_NORMAL zh: Azure ML 数据存储是对 Azure 上现有存储账户的*引用*。创建和使用 Azure ML 数据存储的主要优势是: - en: A common and easy-to-use API to interact with different storage types in Azure (Blob/Files/). id: totrans-93 prefs: - PREF_UL type: TYPE_NORMAL zh: 一个通用且易于使用的 API,用于与 Azure 中的不同存储类型(Blob/Files/)进行交互。 - en: Easier to discover useful datastores when working as a team. id: totrans-94 prefs: - PREF_UL type: TYPE_NORMAL zh: 团队合作时更容易发现有用的数据存储。 - en: Authentication is automatically handled - both *credential-based* access (service principal/SAS/key) and *identity-based* access (Azure Active Directory/managed identity) are supported. When using credential-based authentication, you do not need to expose secrets in your code. id: totrans-95 prefs: - PREF_UL type: TYPE_NORMAL zh: 身份验证会自动处理 - 支持基于凭据的访问(服务主体/SAS/密钥)和基于身份的访问(Azure Active Directory/托管标识)。使用基于凭据的身份验证时,您无需在代码中暴露密钥。 - en: This requires the installation of the library `azureml-fsspec` ([documentation](https://learn.microsoft.com/python/api/azureml-fsspec/?view=azure-ml-py)). id: totrans-96 prefs: [] type: TYPE_NORMAL zh: 这需要安装库`azureml-fsspec`([文档](https://learn.microsoft.com/python/api/azureml-fsspec/?view=azure-ml-py))。 - en: 'You can access data in an Azure ML datastore by providing URIs staring with `azureml://`. For example, [FSSpecFileLister](generated/torchdata.datapipes.iter.FSSpecFileLister.html) (`.list_files_by_fsspec(...)`) can be used to list files in a directory in a container:' id: totrans-97 prefs: [] type: TYPE_NORMAL zh: 通过提供以`azureml://`开头的 URI,您可以访问 Azure ML 数据存储。例如,[FSSpecFileLister](generated/torchdata.datapipes.iter.FSSpecFileLister.html)(`.list_files_by_fsspec(...)`)可用于列出容器中目录中的文件: - en: '[PRE19]' id: totrans-98 prefs: [] type: TYPE_PRE zh: '[PRE19]' - en: You can also open files using [FSSpecFileOpener](generated/torchdata.datapipes.iter.FSSpecFileOpener.html) (`.open_files_by_fsspec(...)`) and stream them (if supported by the file format). id: totrans-99 prefs: [] type: TYPE_NORMAL zh: 您还可以使用[FSSpecFileOpener](generated/torchdata.datapipes.iter.FSSpecFileOpener.html)(`.open_files_by_fsspec(...)`)打开文件并流式传输(如果文件格式支持)。 - en: Here is an example of loading a tar file from the default Azure ML datastore `workspaceblobstore` where the path is `/cifar-10-python.tar.gz` (top-level folder). id: totrans-100 prefs: [] type: TYPE_NORMAL zh: 这里是一个从默认的 Azure ML 数据存储`workspaceblobstore`中加载 tar 文件的示例,路径为`/cifar-10-python.tar.gz`(顶层文件夹)。 - en: '[PRE20]' id: totrans-101 prefs: [] type: TYPE_PRE zh: '[PRE20]' - en: Here is an example of loading a CSV file - the famous Titanic dataset ([download](https://raw.githubusercontent.com/Azure/azureml-examples/main/cli/assets/data/sample-data/titanic.csv)) - from the Azure ML datastore `workspaceblobstore` where the path is `/titanic.csv` (top-level folder). id: totrans-102 prefs: [] type: TYPE_NORMAL zh: 这里是一个加载 CSV 文件的示例 - 著名的泰坦尼克号数据集([下载](https://raw.githubusercontent.com/Azure/azureml-examples/main/cli/assets/data/sample-data/titanic.csv))- 从 Azure ML 数据存储`workspaceblobstore`中加载,路径为`/titanic.csv`(顶层文件夹)。 - en: '[PRE21]' id: totrans-103 prefs: [] type: TYPE_PRE zh: '[PRE21]'