- en: Examples id: totrans-0 prefs: - PREF_H1 type: TYPE_NORMAL zh: 示例 - en: 原文:[https://pytorch.org/data/beta/examples.html](https://pytorch.org/data/beta/examples.html) id: totrans-1 prefs: - PREF_BQ type: TYPE_NORMAL zh: 原文:https://pytorch.org/data/beta/examples.html - en: In this section, you will find the data loading implementations (using DataPipes) of various popular datasets across different research domains. Some of the examples are implements by the PyTorch team and the implementation codes are maintained within PyTorch libraries. Others are created by members of the PyTorch community. id: totrans-2 prefs: [] type: TYPE_NORMAL zh: 在本节中,您将找到不同研究领域中各种流行数据集的数据加载实现(使用DataPipes)。一些示例是由PyTorch团队实现的,实现代码在PyTorch库中维护。其他是由PyTorch社区成员创建的。 - en: Audio[](#audio "Permalink to this heading") id: totrans-3 prefs: - PREF_H2 type: TYPE_NORMAL zh: 音频这个标题的永久链接 - en: LibriSpeech[](#librispeech "Permalink to this heading") id: totrans-4 prefs: - PREF_H3 type: TYPE_NORMAL zh: LibriSpeech这个标题的永久链接 - en: '[LibriSpeech dataset](https://www.openslr.org/12/) is corpus of approximately 1000 hours of 16kHz read English speech. Here is the [DataPipe implementation of LibriSpeech](https://github.com/pytorch/data/blob/main/examples/audio/librispeech.py) to load the data.' id: totrans-5 prefs: [] type: TYPE_NORMAL zh: LibriSpeech数据集是大约1000小时的16kHz英语朗读语音语料库。这里是加载数据的LibriSpeech的DataPipe实现。 - en: Text[](#text "Permalink to this heading") id: totrans-6 prefs: - PREF_H2 type: TYPE_NORMAL zh: 这个标题的永久链接 - en: Amazon Review Polarity[](#amazon-review-polarity "Permalink to this heading") id: totrans-7 prefs: - PREF_H3 type: TYPE_NORMAL zh: 亚马逊评论极性这个标题的永久链接 - en: The Amazon reviews dataset contains reviews from Amazon. Its purpose is to train text/sentiment classification models. In our DataPipe [implementation of the dataset](https://github.com/pytorch/data/blob/main/examples/text/amazonreviewpolarity.py), we described every step with detailed comments to help you understand what each DataPipe is doing. We recommend having a look at this example. id: totrans-8 prefs: [] type: TYPE_NORMAL zh: 亚马逊评论数据集包含来自亚马逊的评论。其目的是训练文本/情感分类模型。在我们的DataPipe数据集实现中,我们用详细的注释描述了每个步骤,以帮助您了解每个DataPipe正在做什么。我们建议您查看这个例子。 - en: IMDB[](#imdb "Permalink to this heading") id: totrans-9 prefs: - PREF_H3 type: TYPE_NORMAL zh: IMDB这个标题的永久链接 - en: This is a [large movie review dataset](http://ai.stanford.edu/~amaas/data/sentiment/) for binary sentiment classification containing 25,000 highly polar movie reviews for training and 25,00 for testing. Here is the [DataPipe implementation to load the data](https://github.com/pytorch/data/blob/main/examples/text/imdb.py). id: totrans-10 prefs: [] type: TYPE_NORMAL zh: 这是一个用于二元情感分类的大型电影评论数据集,包含25000条高度极性的电影评论用于训练和25000条用于测试。这里是加载数据的DataPipe实现。 - en: SQuAD[](#squad "Permalink to this heading") id: totrans-11 prefs: - PREF_H3 type: TYPE_NORMAL zh: SQuAD这个标题的永久链接 - en: '[SQuAD (Stanford Question Answering Dataset)](https://rajpurkar.github.io/SQuAD-explorer/) is a dataset for reading comprehension. It consists of a list of questions by crowdworkers on a set of Wikipedia articles. Here are the DataPipe implementations for [version 1.1](https://github.com/pytorch/data/blob/main/examples/text/squad1.py) is here and [version 2.0](https://github.com/pytorch/data/blob/main/examples/text/squad2.py).' id: totrans-12 prefs: [] type: TYPE_NORMAL zh: SQuAD(斯坦福问答数据集)是一个用于阅读理解的数据集。它由一组维基百科文章上的众包工作者提出的问题列表组成。这里是版本1.1的DataPipe实现和版本2.0的DataPipe实现。 - en: Additional Datasets in TorchText[](#additional-datasets-in-torchtext "Permalink to this heading") id: totrans-13 prefs: - PREF_H3 type: TYPE_NORMAL zh: TorchText中的其他数据集这个标题的永久链接 - en: In a separate PyTorch domain library [TorchText](https://github.com/pytorch/text), you will find some of the most popular datasets in the NLP field implemented as loadable datasets using DataPipes. You can find all of those [NLP datasets here](https://github.com/pytorch/text/tree/main/torchtext/datasets). id: totrans-14 prefs: [] type: TYPE_NORMAL zh: 在一个独立的PyTorch领域库TorchText中,您将找到一些最受欢迎的NLP领域数据集,这些数据集被实现为可使用DataPipes加载的数据集。您可以在这里找到所有这些NLP数据集。 - en: Vision[](#vision "Permalink to this heading") id: totrans-15 prefs: - PREF_H2 type: TYPE_NORMAL zh: 视觉这个标题的永久链接 - en: Caltech 101[](#caltech-101 "Permalink to this heading") id: totrans-16 prefs: - PREF_H3 type: TYPE_NORMAL zh: Caltech 101这个标题的永久链接 - en: The [Caltech 101 dataset](https://data.caltech.edu/records/20086) contains pictures of objects belonging to 101 categories. Here is the [DataPipe implementation of Caltech 101](https://github.com/pytorch/data/blob/main/examples/vision/caltech101.py). id: totrans-17 prefs: [] type: TYPE_NORMAL zh: Caltech 101数据集包含属于101个类别的对象的图片。这里是Caltech 101的DataPipe实现。 - en: Caltech 256[](#caltech-256 "Permalink to this heading") id: totrans-18 prefs: - PREF_H3 type: TYPE_NORMAL zh: Caltech 256这个标题的永久链接 - en: The [Caltech 256 dataset](https://data.caltech.edu/records/20087) contains 30607 images from 256 categories. Here is the [DataPipe implementation of Caltech 256](https://github.com/pytorch/data/blob/main/examples/vision/caltech256.py). id: totrans-19 prefs: [] type: TYPE_NORMAL zh: Caltech 256数据集包含来自256个类别的30607张图片。这里是Caltech 256的DataPipe实现。 - en: CamVid - Semantic Segmentation (community example)[](#camvid-semantic-segmentation-community-example "Permalink to this heading") id: totrans-20 prefs: - PREF_H3 type: TYPE_NORMAL zh: CamVid - 语义分割(社区示例)这个标题的永久链接 - en: The [Cambridge-driving Labeled Video Database (CamVid)](http://mi.eng.cam.ac.uk/research/projects/VideoRec/CamVid/) is a collection of videos with object class semantic labels, complete with metadata. The database provides ground truth labels that associate each pixel with one of 32 semantic classes. Here is a [DataPipe implementation of CamVid](https://github.com/tcapelle/torchdata/blob/main/01_Camvid_segmentation_with_datapipes.ipynb) created by our community. id: totrans-21 prefs: [] type: TYPE_NORMAL zh: 剑桥驾驶标记视频数据库(CamVid)是一个带有对象类语义标签的视频集合,附带元数据。该数据库提供了将每个像素与32个语义类别之一关联的地面实况标签。这里是我们社区创建的CamVid的DataPipe实现。 - en: laion2B-en-joined[](#laion2b-en-joined "Permalink to this heading") id: totrans-22 prefs: - PREF_H3 type: TYPE_NORMAL zh: laion2B-en-joined这个标题的永久链接 - en: The [laion2B-en-joined dataset](https://huggingface.co/datasets/laion/laion2B-en-joined) is a subset of the [LAION-5B dataset](https://laion.ai/blog/laion-5b/) containing english captions, URls pointing to images, and other metadata. It contains around 2.32 billion entries. Currently (February 2023) around 86% of the URLs still point to valid images. Here is a [DataPipe implementation of laion2B-en-joined](https://github.com/pytorch/data/blob/main/examples/vision/laion5b.py) that filters out unsafe images and images with watermarks and loads the images from the URLs. id: totrans-23 prefs: [] type: TYPE_NORMAL zh: '[laion2B-en-joined数据集](https://huggingface.co/datasets/laion/laion2B-en-joined)是[LAION-5B数据集](https://laion.ai/blog/laion-5b/)的一个子集,包含英文标题、指向图像的URL以及其他元数据。它包含大约23.2亿条目。目前(2023年2月)大约86%的URL仍指向有效图像。这里有一个[laion2B-en-joined的DataPipe实现](https://github.com/pytorch/data/blob/main/examples/vision/laion5b.py),它会过滤掉不安全的图像和带有水印的图像,并从URL加载图像。' - en: Additional Datasets in TorchVision[](#additional-datasets-in-torchvision "Permalink to this heading") id: totrans-24 prefs: - PREF_H3 type: TYPE_NORMAL zh: TorchVision中的其他数据集 - en: In a separate PyTorch domain library [TorchVision](https://github.com/pytorch/vision), you will find some of the most popular datasets in the computer vision field implemented as loadable datasets using DataPipes. You can find all of those [vision datasets here](https://github.com/pytorch/vision/tree/main/torchvision/prototype/datasets/_builtin). id: totrans-25 prefs: [] type: TYPE_NORMAL zh: 在单独的PyTorch领域库[TorchVision](https://github.com/pytorch/vision)中,您将找到一些最受欢迎的计算机视觉领域数据集,这些数据集被实现为可加载的数据集,使用DataPipes。您可以在[这里找到所有这些视觉数据集](https://github.com/pytorch/vision/tree/main/torchvision/prototype/datasets/_builtin)。 - en: Note that these implementations are currently in the prototype phase, but they should be fully supported in the coming months. Nonetheless, they demonstrate the different ways DataPipes can be used for data loading. id: totrans-26 prefs: [] type: TYPE_NORMAL zh: 请注意,这些实现目前处于原型阶段,但它们应该在未来几个月内得到充分支持。尽管如此,它们展示了DataPipes可以用于数据加载的不同方式。 - en: Recommender System[](#recommender-system "Permalink to this heading") id: totrans-27 prefs: - PREF_H2 type: TYPE_NORMAL zh: 推荐系统 - en: Criteo 1TB Click Logs[](#criteo-1tb-click-logs "Permalink to this heading") id: totrans-28 prefs: - PREF_H3 type: TYPE_NORMAL zh: Criteo 1TB点击日志 - en: The [Criteo dataset](https://ailab.criteo.com/download-criteo-1tb-click-logs-dataset) contains feature values and click feedback for millions of display advertisements. It aims to benchmark algorithms for click through rate (CTR) prediction. You can find a prototype stage implementation of the [dataset with DataPipes in TorchRec](https://github.com/pytorch/torchrec/blob/main/torchrec/datasets/criteo.py). id: totrans-29 prefs: [] type: TYPE_NORMAL zh: '[Criteo数据集](https://ailab.criteo.com/download-criteo-1tb-click-logs-dataset)包含数百万个展示广告的特征值和点击反馈。它旨在为点击率(CTR)预测的算法提供基准。您可以在[TorchRec中使用DataPipes实现数据集的原型阶段](https://github.com/pytorch/torchrec/blob/main/torchrec/datasets/criteo.py)。' - en: Graphs, Meshes and Point Clouds[](#graphs-meshes-and-point-clouds "Permalink to this heading") id: totrans-30 prefs: - PREF_H2 type: TYPE_NORMAL zh: 图、网格和点云 - en: TigerGraph (community example)[](#tigergraph-community-example "Permalink to this heading") id: totrans-31 prefs: - PREF_H3 type: TYPE_NORMAL zh: TigerGraph(社区示例) - en: TigerGraph is a scalable graph data platform for AI and ML. You can find an [implementation](https://github.com/TigerGraph-DevLabs/torchdata_tutorial/blob/main/torchdata_example.ipynb) of graph feature engineering and machine learning with DataPipes in TorchData and data stored in a TigerGraph database, which includes computing PageRank scores in-database, pulling graph data and features with multiple DataPipes, and training a neural network using graph features in PyTorch. id: totrans-32 prefs: [] type: TYPE_NORMAL zh: TigerGraph是一个可扩展的用于AI和ML的图数据平台。您可以在[TorchData中使用DataPipes实现图特征工程和机器学习](https://github.com/TigerGraph-DevLabs/torchdata_tutorial/blob/main/torchdata_example.ipynb),数据存储在TigerGraph数据库中,其中包括在数据库中计算PageRank分数,使用多个DataPipes提取图数据和特征,以及使用PyTorch中的图特征训练神经网络。 - en: MoleculeNet (community example)[](#moleculenet-community-example "Permalink to this heading") id: totrans-33 prefs: - PREF_H3 type: TYPE_NORMAL zh: MoleculeNet(社区示例) - en: '[MoleculeNet](https://moleculenet.org/) is a benchmark specially designed for testing machine learning methods of molecular properties. You can find an implementation of the [HIV dataset with DataPipes in PyTorch Geometric](https://github.com/pyg-team/pytorch_geometric/blob/master/examples/datapipe.py), which includes converting SMILES strings into molecular graph representations.' id: totrans-34 prefs: [] type: TYPE_NORMAL zh: MoleculeNet是专门设计用于测试分子属性机器学习方法的基准。您可以在[PyTorch Geometric中使用DataPipes实现HIV数据集](https://github.com/pyg-team/pytorch_geometric/blob/master/examples/datapipe.py),其中包括将SMILES字符串转换为分子图表示。 - en: Princeton ModelNet (community example)[](#princeton-modelnet-community-example "Permalink to this heading") id: totrans-35 prefs: - PREF_H3 type: TYPE_NORMAL zh: 普林斯顿ModelNet(社区示例) - en: The Princeton ModelNet project provides a comprehensive and clean collection of 3D CAD models across various object types. You can find an implementation of the [ModelNet10 dataset with DataPipes in PyTorch Geometric](https://github.com/pyg-team/pytorch_geometric/blob/master/examples/datapipe.py), which includes reading in meshes via [meshio](https://github.com/nschloe/meshio), and sampling of points from object surfaces and dynamic graph generation via [PyG’s functional transformations](https://pytorch-geometric.readthedocs.io/en/latest/modules/transforms.html). id: totrans-36 prefs: [] type: TYPE_NORMAL zh: 普林斯顿ModelNet项目提供了各种对象类型的全面且干净的3D CAD模型集合。您可以在[PyTorch Geometric中使用DataPipes实现ModelNet10数据集](https://github.com/pyg-team/pytorch_geometric/blob/master/examples/datapipe.py),其中包括通过[meshio](https://github.com/nschloe/meshio)读取网格,从对象表面采样点以及通过[PyG的功能转换](https://pytorch-geometric.readthedocs.io/en/latest/modules/transforms.html)生成动态图。 - en: Timeseries[](#timeseries "Permalink to this heading") id: totrans-37 prefs: - PREF_H2 type: TYPE_NORMAL zh: 时间序列 - en: Custom DataPipe for Timeseries rolling window (community example)[](#custom-datapipe-for-timeseries-rolling-window-community-example "Permalink to this heading") id: totrans-38 prefs: - PREF_H3 type: TYPE_NORMAL zh: 用于时间序列滚动窗口的自定义DataPipe(社区示例) - en: Implementing a rolling window custom DataPipe for timeseries forecasting tasks. Here is the [DataPipe implementation of a rolling window](https://github.com/tcapelle/torchdata/blob/main/02_Custom_timeseries_datapipe.ipynb). id: totrans-39 prefs: [] type: TYPE_NORMAL zh: 为时间序列预测任务实现滚动窗口自定义DataPipe。这里是滚动窗口的DataPipe实现。 - en: Using AIStore[](#using-aistore "Permalink to this heading") id: totrans-40 prefs: - PREF_H2 type: TYPE_NORMAL zh: 使用AIStore - en: Caltech 256 and Microsoft COCO (community example)[](#caltech-256-and-microsoft-coco-community-example "Permalink to this heading") id: totrans-41 prefs: - PREF_H3 type: TYPE_NORMAL zh: Caltech 256和Microsoft COCO(社区示例) - en: Listing and loading data from AIS buckets (buckets that are not 3rd party backend-based) and remote cloud buckets (3rd party backend-based cloud buckets) using [AISFileLister](https://pytorch.org/data/main/generated/torchdata.datapipes.iter.AISFileLister.html#aisfilelister) and [AISFileLoader](https://pytorch.org/data/main/generated/torchdata.datapipes.iter.AISFileLoader.html#torchdata.datapipes.iter.AISFileLoader). id: totrans-42 prefs: [] type: TYPE_NORMAL zh: 从AIS存储桶(非第三方后端存储桶)和远程云存储桶(第三方后端云存储桶)中列出和加载数据,使用AISFileLister和AISFileLoader。 - en: Here is an [example which uses AISIO DataPipe](https://github.com/pytorch/data/blob/main/examples/aistore/aisio_usage_example.ipynb) for the [Caltech-256 Object Category Dataset](https://data.caltech.edu/records/20087) containing 256 object categories and a total of 30607 images stored on an AIS bucket and the [Microsoft COCO Dataset](https://cocodataset.org/#home) which has 330K images with over 200K labels of more than 1.5 million object instances across 80 object categories stored on Google Cloud. id: totrans-43 prefs: [] type: TYPE_NORMAL zh: 这是一个示例,使用AISIO DataPipe处理Caltech-256对象类别数据集和Microsoft COCO数据集。Caltech-256数据集包含256个对象类别和30607张图像,存储在AIS存储桶中;而Microsoft COCO数据集包含330K张图像,涵盖80个对象类别的超过200K个标签和超过150万个对象实例,存储在Google Cloud上。