# torchtext.datasets

> 原文：[`pytorch.org/text/stable/datasets.html`](https://pytorch.org/text/stable/datasets.html)

警告

torchtext 支持的数据集是来自[torchdata 项目](https://pytorch.org/data/beta/index.html)的数据管道，该项目仍处于 Beta 状态。这意味着 API 可能会在没有弃用周期的情况下发生更改。特别是，我们期望随着`torchdata`发布`DataLoaderV2`，当前的许多惯用法会发生变化。

以下是关于数据管道使用的一些建议：

+   要对数据管道进行洗牌，请在 DataLoader 中进行：`DataLoader(dp, shuffle=True)`。您不需要调用`dp.shuffle()`，因为`torchtext`已经为您做了。但请注意，除非您明确将`shuffle=True`传递给 DataLoader，否则数据管道不会被洗牌。

+   在使用多处理（`num_workers=N`）时，请使用内置的`worker_init_fn`： 

    ```py
    from torch.utils.data.backward_compatibility import worker_init_fn
    DataLoader(dp, num_workers=4, worker_init_fn=worker_init_fn, drop_last=True) 
    ```

    这将确保数据在工作进程之间不会重复。

+   我们还建议使用`drop_last=True`。如果不这样做，在某些情况下，一个时期结束时的批次大小可能会非常小（比其他映射样式数据集的批次大小小）。这可能会对准确性产生很大影响，特别是在使用批量归一化时。`drop_last=True`确保所有批次大小相等。

+   使用`DistributedDataParallel`进行分布式训练目前还不够稳定/支持，我们不建议在这一点上使用。它将在 DataLoaderV2 中得到更好的支持。如果您仍希望使用 DDP，请确保：

    +   所有工作进程（DDP 工作进程*和*DataLoader 工作进程）看到数据的不同部分。数据集已经包装在[ShardingFilter](https://pytorch.org/data/main/generated/torchdata.datapipes.iter.ShardingFilter.html)中，您可能需要调用`dp.apply_sharding(num_shards, shard_id)`以将数据分片到排名（DDP 工作进程）和 DataLoader 工作进程中。一种方法是创建`worker_init_fn`，该函数调用`apply_sharding`并传递适当数量的分片（DDP 工作进程*DataLoader 工作进程）和分片 ID（通过排名和相应 DataLoader 的工作 ID 推断）。但请注意，这假定所有排名的 DataLoader 工作进程数量相等。

    +   所有 DDP 工作进程处理相同数量的批次。一种方法是通过将每个工作进程内的数据管道大小限制为`len(datapipe) // num_ddp_workers`来实现，但这可能不适用于所有用例。

    +   洗牌种子在所有工作进程中是相同的。您可能需要调用`torch.utils.data.graph_settings.apply_shuffle_seed(dp, rng)`

    +   洗牌种子在不同的时期是不同的。

    +   RNG 的其余部分（通常用于转换）在工作进程之间是**不同**的，以获得最大熵和最佳准确性。

一般用例如下：

```py
# import datasets
from torchtext.datasets import IMDB

train_iter = IMDB(split='train')

def tokenize(label, line):
    return line.split()

tokens = []
for label, line in train_iter:
    tokens += tokenize(label, line) 
```

目前提供以下数据集。如果您想向存储库贡献新数据集或使用自己的自定义数据集，请参考[CONTRIBUTING_DATASETS.md](https://github.com/pytorch/text/blob/main/CONTRIBUTING_DATASETS.md)指南。

数据集

+   Text Classification

    +   AG_NEWS

    +   AmazonReviewFull

    +   AmazonReviewPolarity

    +   CoLA

    +   DBpedia

    +   IMDb

    +   MNLI

    +   MRPC

    +   QNLI

    +   QQP

    +   RTE

    +   SogouNews

    +   SST2

    +   STSB

    +   WNLI

    +   YahooAnswers

    +   YelpReviewFull

    +   YelpReviewPolarity

+   Language Modeling

    +   PennTreebank

    +   WikiText-2

    +   WikiText103

+   Machine Translation

    +   IWSLT2016

    +   IWSLT2017

    +   Multi30k

+   Sequence Tagging

    +   CoNLL2000Chunking

    +   UDPOS

+   Question Answer

    +   SQuAD 1.0

    +   SQuAD 2.0

+   Unsupervised Learning

    +   CC100

    +   EnWik9

## 文本分类

### AG_NEWS

```py
torchtext.datasets.AG_NEWS(root: str = '.data', split: Union[Tuple[str], str] = ('train', 'test'))
```

AG_NEWS 数据集

警告

目前仍然存在一些注意事项，使用 datapipes。如果您希望使用此数据集进行洗牌、多处理或分布式学习，请参阅此说明。

有关更多详细信息，请参阅[`paperswithcode.com/dataset/ag-news`](https://paperswithcode.com/dataset/ag-news)

每个拆分的行数：

+   训练：120000

+   测试：7600

参数：

+   **root** - 数据集保存的目录。默认值：os.path.expanduser（'〜/.torchtext/cache'）

+   **split** - 要返回的拆分或拆分。可以是字符串或字符串元组。默认值：（train，test）

返回：

DataPipe，产生标签（1 到 4）和文本的元组

返回类型：

（[int](https://docs.python.org/3/library/functions.html#int "(在 Python v3.12 中)")，[str](https://docs.python.org/3/library/stdtypes.html#str "(在 Python v3.12 中)"))

### AmazonReviewFull

```py
torchtext.datasets.AmazonReviewFull(root: str = '.data', split: Union[Tuple[str], str] = ('train', 'test'))
```

AmazonReviewFull 数据集

警告

目前仍然存在一些注意事项，使用 datapipes。如果您希望使用此数据集进行洗牌、多处理或分布式学习，请参阅此说明。

有关更多详细信息，请参阅[`arxiv.org/abs/1509.01626`](https://arxiv.org/abs/1509.01626)

每个拆分的行数：

+   训练：3000000

+   测试：650000

参数：

+   **root** - 数据集保存的目录。默认值：os.path.expanduser（'〜/.torchtext/cache'）

+   **split** - 要返回的拆分或拆分。可以是字符串或字符串元组。默认值：（train，test）

返回：

DataPipe，产生标签（1 到 5）和包含评论标题和文本的文本的元组

返回类型：

（[int](https://docs.python.org/3/library/functions.html#int "(在 Python v3.12 中)")，[str](https://docs.python.org/3/library/stdtypes.html#str "(在 Python v3.12 中)"))

### AmazonReviewPolarity

```py
torchtext.datasets.AmazonReviewPolarity(root: str = '.data', split: Union[Tuple[str], str] = ('train', 'test'))
```

AmazonReviewPolarity 数据集

警告

目前仍然存在一些注意事项，使用 datapipes。如果您希望使用此数据集进行洗牌、多处理或分布式学习，请参阅此说明。

有关更多详细信息，请参阅[`arxiv.org/abs/1509.01626`](https://arxiv.org/abs/1509.01626)

每个拆分的行数：

+   训练：3600000

+   测试：400000

参数：

+   **root** - 数据集保存的目录。默认值：os.path.expanduser（'〜/.torchtext/cache'）

+   **split** - 要返回的拆分或拆分。可以是字符串或字符串元组。默认值：（train，test）

返回：

DataPipe，产生标签（1 到 2）和包含评论标题和文本的文本的元组

返回类型：

（[int](https://docs.python.org/3/library/functions.html#int "(在 Python v3.12 中)")，[str](https://docs.python.org/3/library/stdtypes.html#str "(在 Python v3.12 中)"))

### CoLA

```py
torchtext.datasets.CoLA(root: str = '.data', split: Union[Tuple[str], str] = ('train', 'dev', 'test'))
```

CoLA 数据集

警告

目前仍然存在一些注意事项，使用 datapipes。如果您希望使用此数据集进行洗牌、多处理或分布式学习，请参阅此说明。

有关更多详细信息，请参阅[`nyu-mll.github.io/CoLA/`](https://nyu-mll.github.io/CoLA/)

每个拆分的行数：

+   训练：8551

+   开发：527

+   测试：516

参数：

+   **root** - 数据集保存的目录。默认值：os.path.expanduser（'〜/.torchtext/cache'）

+   **split** - 要返回的拆分或拆分。可以是字符串或字符串元组。默认值：（train，dev，test）

返回：

DataPipe，从 CoLA 数据集产生行（源（str），标签（int），句子（str））

返回类型：

（[str](https://docs.python.org/3/library/stdtypes.html#str "(在 Python v3.12 中)")，[int](https://docs.python.org/3/library/functions.html#int "(在 Python v3.12 中)")，[str](https://docs.python.org/3/library/stdtypes.html#str "(在 Python v3.12 中)"))

### DBpedia

```py
torchtext.datasets.DBpedia(root: str = '.data', split: Union[Tuple[str], str] = ('train', 'test'))
```

DBpedia 数据集

警告

目前仍然存在一些注意事项，如果您希望使用此数据集进行洗牌、多处理或分布式学习，请参阅此说明。

有关更多详细信息，请参阅[`www.dbpedia.org/resources/latest-core/`](https://www.dbpedia.org/resources/latest-core/)

每个拆分的行数：

+   训练：560000

+   测试：70000

参数：

+   **root** - 数据集保存的目录。默认值：os.path.expanduser（'~/.torchtext/cache'）

+   **split** - 要返回的拆分或拆分。可以是字符串或字符串元组。默认值：（train，test）

返回：

DataPipe，产生包含新闻标题和内容的标签元组（1 到 14）和文本

返回类型：

([int](https://docs.python.org/3/library/functions.html#int "(在 Python v3.12 中)"), [str](https://docs.python.org/3/library/stdtypes.html#str "(在 Python v3.12 中)"))

### IMDb

```py
torchtext.datasets.IMDB(root: str = '.data', split: Union[Tuple[str], str] = ('train', 'test'))
```

IMDB 数据集

警告

目前仍然存在一些注意事项，如果您希望使用此数据集进行洗牌、多处理或分布式学习，请参阅此说明。

有关更多详细信息，请参阅[`ai.stanford.edu/~amaas/data/sentiment/`](http://ai.stanford.edu/~amaas/data/sentiment/)

每个拆分的行数：

+   训练：25000

+   测试：25000

参数：

+   **root** - 数据集保存的目录。默认值：os.path.expanduser（'~/.torchtext/cache'）

+   **split** - 要返回的拆分或拆分。可以是字符串或字符串元组。默认值：（train，test）

返回：

DataPipe，产生包含电影评论的标签元组（1 到 2）和文本

返回类型：

([int](https://docs.python.org/3/library/functions.html#int "(在 Python v3.12 中)"), [str](https://docs.python.org/3/library/stdtypes.html#str "(在 Python v3.12 中)"))

使用`IMDB`的教程：

![T5-基础模型用于摘要、情感分类和翻译](img/054ec2c5b6c69ac648ddd68d0b5494e6.png)

T5-基础模型用于摘要、情感分类和翻译

T5-基础模型用于摘要、情感分类和翻译

### MNLI

```py
torchtext.datasets.MNLI(root='.data', split=('train', 'dev_matched', 'dev_mismatched'))
```

MNLI 数据集

警告

目前仍然存在一些注意事项，如果您希望使用此数据集进行洗牌、多处理或分布式学习，请参阅此说明。

有关更多详细信息，请参阅[`cims.nyu.edu/~sbowman/multinli/`](https://cims.nyu.edu/~sbowman/multinli/)

每个拆分的行数：

+   训练：392702

+   dev_matched：9815

+   dev_mismatched：9832

参数：

+   **root** - 数据集保存的目录。默认值：os.path.expanduser（'~/.torchtext/cache'）

+   **split** - 要返回的拆分或拆分。可以是字符串或字符串元组。默认值：（train，dev_matched，dev_mismatched）

返回：

DataPipe，产生包含文本和标签（0 到 2）的元组。

返回类型：

元组[[int](https://docs.python.org/3/library/functions.html#int "(在 Python v3.12 中)"), [str](https://docs.python.org/3/library/stdtypes.html#str "(在 Python v3.12 中)"), [str](https://docs.python.org/3/library/stdtypes.html#str "(在 Python v3.12 中)")]

### MRPC

```py
torchtext.datasets.MRPC(root: str = '.data', split: Union[Tuple[str], str] = ('train', 'test'))
```

MRPC 数据集

警告

目前仍然存在一些注意事项，如果您希望使用此数据集进行洗牌、多处理或分布式学习，请参阅此说明。

有关更多详细信息，请参阅[`www.microsoft.com/en-us/download/details.aspx?id=52398`](https://www.microsoft.com/en-us/download/details.aspx?id=52398)

每个拆分的行数：

+   训练：4076

+   测试：1725

参数：

+   **root** - 数据集保存的目录。默认值：os.path.expanduser（'~/.torchtext/cache'）

+   **split** - 要返回的拆分或拆分。可以是字符串或字符串元组。默认值：（train，test）

返回：

DataPipe，产生来自 MRPC 数据集的数据点，其中包含标签、句子 1、句子 2

返回类型：

（[int](https://docs.python.org/3/library/functions.html#int "(在 Python v3.12 中)"), [str](https://docs.python.org/3/library/stdtypes.html#str "(在 Python v3.12 中)"), [str](https://docs.python.org/3/library/stdtypes.html#str "(在 Python v3.12 中)"))

### QNLI

```py
torchtext.datasets.QNLI(root='.data', split=('train', 'dev', 'test'))
```

QNLI 数据集

有关更多详细信息，请参阅[`arxiv.org/pdf/1804.07461.pdf`](https://arxiv.org/pdf/1804.07461.pdf)（来自 GLUE 论文）

每个拆分的行数：

+   train：104743

+   dev：5463

+   test：5463

参数：

+   **root** - 数据集保存的目录。默认值：os.path.expanduser（'〜/.torchtext/cache'）

+   **split** - 要返回的拆分或拆分。可以是字符串或字符串元组。默认值：（train，dev，test）

返回：

DataPipe，产生文本和标签（0 和 1）的元组。

返回类型：

（[int](https://docs.python.org/3/library/functions.html#int "(在 Python v3.12 中)"), [str](https://docs.python.org/3/library/stdtypes.html#str "(在 Python v3.12 中)"), [str](https://docs.python.org/3/library/stdtypes.html#str "(在 Python v3.12 中)"))

### QQP

```py
torchtext.datasets.QQP(root: str)
```

QQP 数据集

警告

使用 datapipes 目前仍然受到一些注意事项的限制。如果您希望使用此数据集进行洗牌、多处理或分布式学习，请参阅此说明。

有关更多详细信息，请参阅[`quoradata.quora.com/First-Quora-Dataset-Release-Question-Pairs`](https://quoradata.quora.com/First-Quora-Dataset-Release-Question-Pairs)

参数：

**root** - 数据集保存的目录。默认值：os.path.expanduser（'〜/.torchtext/cache'）

返回：

DataPipe，产生来自 QQP 数据集的行（标签（int），问题 1（str），问题 2（str））

返回类型：

（[int](https://docs.python.org/3/library/functions.html#int "(在 Python v3.12 中)"), [str](https://docs.python.org/3/library/stdtypes.html#str "(在 Python v3.12 中)"), [str](https://docs.python.org/3/library/stdtypes.html#str "(在 Python v3.12 中)"))

### RTE

```py
torchtext.datasets.RTE(root='.data', split=('train', 'dev', 'test'))
```

RTE 数据集

有关更多详细信息，请参阅[`aclweb.org/aclwiki/Recognizing_Textual_Entailment`](https://aclweb.org/aclwiki/Recognizing_Textual_Entailment)

每个拆分的行数：

+   train：2490

+   dev：277

+   test：3000

参数：

+   **root** - 数据集保存的目录。默认值：os.path.expanduser（'〜/.torchtext/cache'）

+   **split** - 要返回的拆分或拆分。可以是字符串或字符串元组。默认值：（train，dev，test）

返回：

DataPipe，产生文本和/或标签（0 和 1）的元组。测试拆分仅返回文本。

返回类型：

Union[（[int](https://docs.python.org/3/library/functions.html#int "(在 Python v3.12 中)"), [str](https://docs.python.org/3/library/stdtypes.html#str "(在 Python v3.12 中)"), [str](https://docs.python.org/3/library/stdtypes.html#str "(在 Python v3.12 中)")),（[str](https://docs.python.org/3/library/stdtypes.html#str "(在 Python v3.12 中)"), [str](https://docs.python.org/3/library/stdtypes.html#str "(在 Python v3.12 中)"))]

### SogouNews

```py
torchtext.datasets.SogouNews(root: str = '.data', split: Union[Tuple[str], str] = ('train', 'test'))
```

SogouNews 数据集

警告

使用 datapipes 目前仍然受到一些注意事项的限制。如果您希望使用此数据集进行洗牌、多处理或分布式学习，请参阅此说明。

有关更多详细信息，请参阅[`arxiv.org/abs/1509.01626`](https://arxiv.org/abs/1509.01626)

> 每个拆分的行数：
> 
> +   train：450000
> +   
> +   test：60000
> +   
> 参数：
> 
> root：数据集保存的目录。默认值：os.path.expanduser（'〜/.torchtext/cache'）split：要返回的拆分或拆分。可以是字符串或字符串元组。默认值：（train，test）
> 
> 返回：
> 
> DataPipe，产生标签（1 到 5）和包含新闻标题和内容的文本的元组
> 
> rtype：
> 
> （int，str）

### SST2

```py
torchtext.datasets.SST2(root='.data', split=('train', 'dev', 'test'))
```

SST2 数据集

警告

使用 datapipes 目前仍然存在一些注意事项。如果您希望在此数据集中使用洗牌、多处理或分布式学习，请参阅此说明以获取进一步的指导。

有关更多详细信息，请参阅[`nlp.stanford.edu/sentiment/`](https://nlp.stanford.edu/sentiment/)

每个拆分的行数：

+   训练：67349

+   开发：872

+   测试：1821

参数：

+   **root** - 数据集保存的目录。默认值：os.path.expanduser（'~/.torchtext/cache'）

+   **split** - 要返回的拆分或拆分。可以是字符串或字符串元组。默认值：（训练、开发、测试）

返回：

DataPipe 会产生文本和/或标签（1 到 4）的元组。测试拆分仅返回文本。

返回类型：

Union[([int](https://docs.python.org/3/library/functions.html#int "(在 Python v3.12 中)"), [str](https://docs.python.org/3/library/stdtypes.html#str "(在 Python v3.12 中)")), ([str](https://docs.python.org/3/library/stdtypes.html#str "(在 Python v3.12 中)"),)]

使用`SST2`的教程：

![SST-2 二进制文本分类与 XLM-RoBERTa 模型](img/98241cb68ab73fa3d56bc87944e16fd8.png)

SST-2 二进制文本分类与 XLM-RoBERTa 模型

SST-2 二进制文本分类与 XLM-RoBERTa 模型

### STSB

```py
torchtext.datasets.STSB(root='.data', split=('train', 'dev', 'test'))
```

STSB 数据集

警告

使用 datapipes 目前仍然存在一些注意事项。如果您希望在此数据集中使用洗牌、多处理或分布式学习，请参阅此说明以获取进一步的指导。

有关更多详细信息，请参阅[`ixa2.si.ehu.eus/stswiki/index.php/STSbenchmark`](https://ixa2.si.ehu.eus/stswiki/index.php/STSbenchmark)

每个拆分的行数：

+   训练：5749

+   开发：1500

+   测试：1379

参数：

+   **root** - 数据集保存的目录。默认值：os.path.expanduser（'~/.torchtext/cache'）

+   **split** - 要返回的拆分或拆分。可以是字符串或字符串元组。默认值：（训练、开发、测试）

返回：

DataPipe 会产生元组（索引（整数）、标签（浮点数）、句子 1（字符串）、句子 2（字符串））

返回类型：

([int](https://docs.python.org/3/library/functions.html#int "(在 Python v3.12 中)"), [float](https://docs.python.org/3/library/functions.html#float "(在 Python v3.12 中)"), [str](https://docs.python.org/3/library/stdtypes.html#str "(在 Python v3.12 中)"), [str](https://docs.python.org/3/library/stdtypes.html#str "(在 Python v3.12 中)"))

### WNLI

```py
torchtext.datasets.WNLI(root='.data', split=('train', 'dev', 'test'))
```

WNLI 数据集

有关更多详细信息，请参阅[`arxiv.org/pdf/1804.07461v3.pdf`](https://arxiv.org/pdf/1804.07461v3.pdf)

每个拆分的行数：

+   训练：635

+   开发：71

+   测试：146

参数：

+   **root** - 数据集保存的目录。默认值：os.path.expanduser（'~/.torchtext/cache'）

+   **split** - 要返回的拆分或拆分。可以是字符串或字符串元组。默认值：（训练、开发、测试）

返回：

DataPipe 会产生文本和/或标签（0 到 1）的元组。测试拆分仅返回文本。

返回类型：

Union[([int](https://docs.python.org/3/library/functions.html#int "(在 Python v3.12 中)"), [str](https://docs.python.org/3/library/stdtypes.html#str "(在 Python v3.12 中)"), [str](https://docs.python.org/3/library/stdtypes.html#str "(在 Python v3.12 中)")), ([str](https://docs.python.org/3/library/stdtypes.html#str "(在 Python v3.12 中)"), [str](https://docs.python.org/3/library/stdtypes.html#str "(在 Python v3.12 中)"))]

### YahooAnswers

```py
torchtext.datasets.YahooAnswers(root: str = '.data', split: Union[Tuple[str], str] = ('train', 'test'))
```

YahooAnswers 数据集

警告

使用 datapipes 目前仍然存在一些注意事项。如果您希望在此数据集中使用洗牌、多处理或分布式学习，请参阅此说明以获取进一步的指导。

有关更多详细信息，请参阅[`arxiv.org/abs/1509.01626`](https://arxiv.org/abs/1509.01626)

每个拆分的行数：

+   训练：1400000

+   测试：60000

参数：

+   **root** – 数据集保存的目录。默认值：os.path.expanduser(‘~/.torchtext/cache’)

+   **split** – 要返回的拆分或拆分。可以是字符串或字符串元组。默认值：(train, test)

返回：

DataPipe，产生包含问题标题、问题内容和最佳答案的文本的元组

返回类型：

([int](https://docs.python.org/3/library/functions.html#int "(在 Python v3.12 中)"), [str](https://docs.python.org/3/library/stdtypes.html#str "(在 Python v3.12 中)"))

### YelpReviewFull

```py
torchtext.datasets.YelpReviewFull(root: str = '.data', split: Union[Tuple[str], str] = ('train', 'test'))
```

YelpReviewFull 数据集

警告

使用 datapipes 目前仍然存在一些注意事项。如果您希望使用此数据集进行洗牌、多进程处理或分布式学习，请参阅此说明。

有关更多详细信息，请参考[`arxiv.org/abs/1509.01626`](https://arxiv.org/abs/1509.01626)

每个拆分的行数：

+   训练：650000

+   测试：50000

参数：

+   **root** – 数据集保存的目录。默认值：os.path.expanduser(‘~/.torchtext/cache’)

+   **split** – 要返回的拆分或拆分。可以是字符串或字符串元组。默认值：(train, test)

返回：

DataPipe，产生包含评论标签（1 到 5）和文本的元组

返回类型：

([int](https://docs.python.org/3/library/functions.html#int "(在 Python v3.12 中)"), [str](https://docs.python.org/3/library/stdtypes.html#str "(在 Python v3.12 中)"))

### YelpReviewPolarity

```py
torchtext.datasets.YelpReviewPolarity(root: str = '.data', split: Union[Tuple[str], str] = ('train', 'test'))
```

YelpReviewPolarity 数据集

警告

使用 datapipes 目前仍然存在一些注意事项。如果您希望使用此数据集进行洗牌、多进程处理或分布式学习，请参阅此说明。

有关更多详细信息，请参考[`arxiv.org/abs/1509.01626`](https://arxiv.org/abs/1509.01626)

每个拆分的行数：

+   训练：560000

+   测试：38000

参数：

+   **root** – 数据集保存的目录。默认值：os.path.expanduser(‘~/.torchtext/cache’)

+   **split** – 要返回的拆分或拆分。可以是字符串或字符串元组。默认值：(train, test)

返回：

DataPipe，产生包含评论标签（1 到 2）和文本的元组

返回类型：

([int](https://docs.python.org/3/library/functions.html#int "(在 Python v3.12 中)"), [str](https://docs.python.org/3/library/stdtypes.html#str "(在 Python v3.12 中)"))

## 语言建模

### PennTreebank

```py
torchtext.datasets.PennTreebank(root='.data', split: Union[Tuple[str], str] = ('train', 'valid', 'test'))
```

PennTreebank 数据集

警告

使用 datapipes 目前仍然存在一些注意事项。如果您希望使用此数据集进行洗牌、多进程处理或分布式学习，请参阅此说明。

有关更多详细信息，请参考[`catalog.ldc.upenn.edu/docs/LDC95T7/cl93.html`](https://catalog.ldc.upenn.edu/docs/LDC95T7/cl93.html)

每个拆分的行数：

+   训练：42068

+   验证：3370

+   测试：3761

参数：

+   **root** – 数据集保存的目录。默认值：os.path.expanduser(‘~/.torchtext/cache’)

+   **split** – 要返回的拆分或拆分。可以是字符串或字符串元组。默认值：(train, valid, test)

返回：

DataPipe，产生来自 Treebank 语料库的文本

返回类型：

[str](https://docs.python.org/3/library/stdtypes.html#str "(在 Python v3.12 中)")

### WikiText-2

```py
torchtext.datasets.WikiText2(root: str = '.data', split: Union[Tuple[str], str] = ('train', 'valid', 'test'))
```

WikiText2 数据集

警告

使用 datapipes 目前仍然存在一些注意事项。如果您希望使用此数据集进行洗牌、多进程处理或分布式学习，请参阅此说明。

有关更多详细信息，请参考[`blog.salesforceairesearch.com/the-wikitext-long-term-dependency-language-modeling-dataset/`](https://blog.salesforceairesearch.com/the-wikitext-long-term-dependency-language-modeling-dataset/)

每个拆分的行数：

+   训练：36718

+   有效：3760

+   测试：4358

参数：

+   **root** – 数据集保存的目录。默认值：os.path.expanduser（'〜/.torchtext/cache'）

+   **split** – 要返回的拆分或拆分。可以是字符串或字符串元组。默认值：（train，valid，test）

返回：

从维基百科文章中产生文本的 DataPipe

返回类型：

[str](https://docs.python.org/3/library/stdtypes.html#str "(在 Python v3.12 中)")

### WikiText103

```py
torchtext.datasets.WikiText103(root: str = '.data', split: Union[Tuple[str], str] = ('train', 'valid', 'test'))
```

WikiText103 数据集

警告

使用 datapipes 目前仍然受到一些注意事项的限制。如果您希望使用此数据集进行洗牌、多处理或分布式学习，请参阅此说明。

有关更多详细信息，请参阅[`blog.salesforceairesearch.com/the-wikitext-long-term-dependency-language-modeling-dataset/`](https://blog.salesforceairesearch.com/the-wikitext-long-term-dependency-language-modeling-dataset/)

每个拆分的行数：

+   训练：1801350

+   有效：3760

+   测试：4358

参数：

+   **root** – 数据集保存的目录。默认值：os.path.expanduser（'〜/.torchtext/cache'）

+   **split** – 要返回的拆分或拆分。可以是字符串或字符串元组。默认值：（train，valid，test）

返回：

从维基百科文章中产生文本的 DataPipe

返回类型：

[str](https://docs.python.org/3/library/stdtypes.html#str "(在 Python v3.12 中)")

## 机器翻译

### IWSLT2016

```py
torchtext.datasets.IWSLT2016(root='.data', split=('train', 'valid', 'test'), language_pair=('de', 'en'), valid_set='tst2013', test_set='tst2014')
```

IWSLT2016 数据集

警告

使用 datapipes 目前仍然受到一些注意事项的限制。如果您希望使用此数据集进行洗牌、多处理或分布式学习，请参阅此说明。

有关更多详细信息，请参阅[`wit3.fbk.eu/2016-01`](https://wit3.fbk.eu/2016-01)

可用的数据集包括以下内容：

**语言对**：

|  | “en” | “fr” | “de” | “cs” | “ar” |
| --- | --- | --- | --- | --- | --- |
| “en” |  | x | x | x | x |
| “fr” | x |  |  |  |  |
| “de” | x |  |  |  |  |
| “cs” | x |  |  |  |  |
| “ar” | x |  |  |  |  |

**验证/测试集**：[“dev2010”，“tst2010”，“tst2011”，“tst2012”，“tst2013”，“tst2014”]

参数：

+   **root** – 数据集保存的目录。默认值：os.path.expanduser（'〜/.torchtext/cache'）

+   **split** – 要返回的拆分或拆分。可以是字符串或字符串元组。默认值：（'train'，'valid'，'test'）

+   **language_pair** – 包含 src 和 tgt 语言的元组或列表

+   **valid_set** – 用于识别验证集的字符串。

+   **test_set** – 用于识别测试集的字符串。

返回：

从维基百科文章中产生源句子和目标句子的元组的 DataPipe

返回类型：

([str](https://docs.python.org/3/library/stdtypes.html#str "(在 Python v3.12 中)"), [str](https://docs.python.org/3/library/stdtypes.html#str "(在 Python v3.12 中)"))

示例

```py
>>> from torchtext.datasets import IWSLT2016
>>> train_iter, valid_iter, test_iter = IWSLT2016()
>>> src_sentence, tgt_sentence = next(iter(train_iter)) 
```

### IWSLT2017

```py
torchtext.datasets.IWSLT2017(root='.data', split=('train', 'valid', 'test'), language_pair=('de', 'en'))
```

IWSLT2017 数据集

警告

使用 datapipes 目前仍然受到一些注意事项的限制。如果您希望使用此数据集进行洗牌、多处理或分布式学习，请参阅此说明。

有关更多详细信息，请参阅[`wit3.fbk.eu/2017-01`](https://wit3.fbk.eu/2017-01)

可用的数据集包括以下内容：

**语言对**：

|  | “en” | “nl” | “de” | “it” | “ro” |
| --- | --- | --- | --- | --- | --- |
| “en” |  | x | x | x | x |
| “nl” | x |  | x | x | x |
| “de” | x | x |  | x | x |
| “it” | x | x | x |  | x |
| “ro” | x | x | x | x |  |

参数：

+   **root** – 数据集保存的目录。默认值：os.path.expanduser（'〜/.torchtext/cache'）

+   **split** – 要返回的拆分或拆分。可以是字符串或字符串元组。默认值：（'train'，'valid'，'test'）

+   **language_pair** – 包含 src 和 tgt 语言的元组或列表

返回：

从维基百科文章中产生源句子和目标句子的元组的 DataPipe

返回类型：

([str](https://docs.python.org/3/library/stdtypes.html#str "(在 Python v3.12 中)"), [str](https://docs.python.org/3/library/stdtypes.html#str "(在 Python v3.12 中)"))

示例

```py
>>> from torchtext.datasets import IWSLT2017
>>> train_iter, valid_iter, test_iter = IWSLT2017()
>>> src_sentence, tgt_sentence = next(iter(train_iter)) 
```

### Multi30k

```py
torchtext.datasets.Multi30k(root: str = '.data', split: Union[Tuple[str], str] = ('train', 'valid', 'test'), language_pair: Tuple[str] = ('de', 'en'))
```

Multi30k 数据集

警告

使用 datapipes 目前仍然存在一些注意事项。如果您希望在此数据集中使用洗牌、多处理或分布式学习，请参阅此说明以获取进一步的指导。

有关更多详细信息，请参阅[`www.statmt.org/wmt16/multimodal-task.html#task1`](https://www.statmt.org/wmt16/multimodal-task.html#task1)

每个拆分的行数：

+   训练：29000

+   验证：1014

+   测试：1000

参数：

+   **root** - 数据集保存的目录。默认值：os.path.expanduser（'~/.torchtext/cache'）

+   **split** - 要返回的拆分或拆分。可以是字符串或字符串元组。默认值：（'train'，'valid'，'test'）

+   **language_pair** - 包含源语言和目标语言的元组或列表。可用选项为（'de'，'en'）和（'en'，'de'）

返回：

产生源句子和目标句子的元组的 DataPipe

返回类型：

([str](https://docs.python.org/3/library/stdtypes.html#str "(在 Python v3.12 中)"), [str](https://docs.python.org/3/library/stdtypes.html#str "(在 Python v3.12 中)"))

使用`Multi30k`的教程：

![T5-基础模型用于摘要、情感分类和翻译](img/054ec2c5b6c69ac648ddd68d0b5494e6.png)

T5-基础模型用于摘要、情感分类和翻译

T5-基础模型用于摘要、情感分类和翻译

## 序列标注

### CoNLL2000Chunking

```py
torchtext.datasets.CoNLL2000Chunking(root: str = '.data', split: Union[Tuple[str], str] = ('train', 'test'))
```

CoNLL2000Chunking 数据集

警告

使用 datapipes 目前仍然存在一些注意事项。如果您希望在此数据集中使用洗牌、多处理或分布式学习，请参阅此说明以获取进一步的指导。

有关更多详细信息，请参阅[`www.clips.uantwerpen.be/conll2000/chunking/`](https://www.clips.uantwerpen.be/conll2000/chunking/)

每个拆分的行数：

+   训练：8936

+   测试：2012

参数：

+   **root** - 数据集保存的目录。默认值：os.path.expanduser（'~/.torchtext/cache'）

+   **split** - 要返回的拆分或拆分。可以是字符串或字符串元组。默认值：（train，test）

返回：

产生单词列表以及相应词性标签和块标签的 DataPipe

返回类型：

[[list](https://docs.python.org/3/library/stdtypes.html#list "(在 Python v3.12 中)")([str](https://docs.python.org/3/library/stdtypes.html#str "(在 Python v3.12 中)")), [list](https://docs.python.org/3/library/stdtypes.html#list "(在 Python v3.12 中)")([str](https://docs.python.org/3/library/stdtypes.html#str "(在 Python v3.12 中)")), [list](https://docs.python.org/3/library/stdtypes.html#list "(在 Python v3.12 中)")([str](https://docs.python.org/3/library/stdtypes.html#str "(在 Python v3.12 中)"))]

### UDPOS

```py
torchtext.datasets.UDPOS(root: str = '.data', split: Union[Tuple[str], str] = ('train', 'valid', 'test'))
```

UDPOS 数据集

警告

使用 datapipes 目前仍然存在一些注意事项。如果您希望在此数据集中使用洗牌、多处理或分布式学习，请参阅此说明以获取进一步的指导。

每个拆分的行数：

+   训练：12543

+   验证：2002

+   测试：2077

参数：

+   **root** - 数据集保存的目录。默认值：os.path.expanduser（'~/.torchtext/cache'）

+   **split** - 要返回的拆分或拆分。可以是字符串或字符串元组。默认值：（train，valid，test）

返回：

产生单词列表以及相应词性标签的 DataPipe

返回类型：

[[list](https://docs.python.org/3/library/stdtypes.html#list "(在 Python v3.12 中)")([str](https://docs.python.org/3/library/stdtypes.html#str "(在 Python v3.12 中)")), [list](https://docs.python.org/3/library/stdtypes.html#list "(在 Python v3.12 中)")([str](https://docs.python.org/3/library/stdtypes.html#str "(在 Python v3.12 中)"))]

## 问题回答

### SQuAD 1.0

```py
torchtext.datasets.SQuAD1(root: str = '.data', split: Union[Tuple[str], str] = ('train', 'dev'))
```

SQuAD1 数据集

警告

使用 datapipes 目前仍然存在一些注意事项。如果您希望在此数据集中使用洗牌、多进程或分布式学习，请参阅此说明以获取进一步的指导。

有关更多详细信息，请参阅[`rajpurkar.github.io/SQuAD-explorer/`](https://rajpurkar.github.io/SQuAD-explorer/)

每个拆分的行数：

+   train: 87599

+   dev: 10570

参数：

+   **root** - 数据集保存的目录。默认值：os.path.expanduser（'~/.torchtext/cache'）

+   **split** - 要返回的拆分或拆分。可以是字符串或字符串元组。默认值：（train，dev）

返回：

DataPipe，产生 SQuaAD1 数据集中的数据点，包括上下文、问题、答案列表和上下文中对应的索引

返回类型:

([str](https://docs.python.org/3/library/stdtypes.html#str "(在 Python v3.12 中)"), [str](https://docs.python.org/3/library/stdtypes.html#str "(在 Python v3.12 中)"), [list](https://docs.python.org/3/library/stdtypes.html#list "(在 Python v3.12 中)")([str](https://docs.python.org/3/library/stdtypes.html#str "(在 Python v3.12 中)")), [list](https://docs.python.org/3/library/stdtypes.html#list "(在 Python v3.12 中)")([int](https://docs.python.org/3/library/functions.html#int "(在 Python v3.12 中)")))

### SQuAD 2.0

```py
torchtext.datasets.SQuAD2(root: str = '.data', split: Union[Tuple[str], str] = ('train', 'dev'))
```

SQuAD2 数据集

警告

使用 datapipes 目前仍然存在一些注意事项。如果您希望在此数据集中使用洗牌、多进程或分布式学习，请参阅此说明以获取进一步的指导。

有关更多详细信息，请参阅[`rajpurkar.github.io/SQuAD-explorer/`](https://rajpurkar.github.io/SQuAD-explorer/)

每个拆分的行数：

+   train: 130319

+   dev: 11873

参数：

+   **root** - 数据集保存的目录。默认值：os.path.expanduser（'~/.torchtext/cache'）

+   **split** - 要返回的拆分或拆分。可以是字符串或字符串元组。默认值：（train，dev）

返回：

DataPipe，产生 SQuaAD1 数据集中的数据点，包括上下文、问题、答案列表和上下文中对应的索引

返回类型：

([str](https://docs.python.org/3/library/stdtypes.html#str "(在 Python v3.12 中)"), [str](https://docs.python.org/3/library/stdtypes.html#str "(在 Python v3.12 中)"), [list](https://docs.python.org/3/library/stdtypes.html#list "(在 Python v3.12 中)")([str](https://docs.python.org/3/library/stdtypes.html#str "(在 Python v3.12 中)")), [list](https://docs.python.org/3/library/stdtypes.html#list "(在 Python v3.12 中)")([int](https://docs.python.org/3/library/functions.html#int "(在 Python v3.12 中)")))

## 无监督学习

### CC100

```py
torchtext.datasets.CC100(root: str, language_code: str = 'en')
```

CC100 数据集

警告

使用 datapipes 目前仍然存在一些注意事项。如果您希望在此数据集中使用洗牌、多进程或分布式学习，请参阅此说明以获取进一步的指导。

有关更多详细信息，请参阅[`data.statmt.org/cc-100/`](https://data.statmt.org/cc-100/)

参数：

+   **root** - 数据集保存的目录。默认值：os.path.expanduser（'~/.torchtext/cache'）

+   **language_code** - 数据集的语言

返回：

DataPipe，产生语言代码和文本的元组

返回类型：

([str](https://docs.python.org/3/library/stdtypes.html#str "(在 Python v3.12 中)"), [str](https://docs.python.org/3/library/stdtypes.html#str "(在 Python v3.12 中)"))

### EnWik9

```py
torchtext.datasets.EnWik9(root: str)
```

EnWik9 数据集

警告

目前仍然存在一些注意事项，如果您希望使用此数据集进行洗牌、多进程处理或分布式学习，请参阅此说明获取进一步指导。

有关更多详细信息，请参阅[`mattmahoney.net/dc/textdata.html`](http://mattmahoney.net/dc/textdata.html)

数据集中的行数：13147026

参数：

**root** - 数据集保存的目录。默认值：os.path.expanduser（'~/.torchtext/cache'）

返回：

从 WnWik9 数据集中产生原始文本行的 DataPipe

返回类型：

[str](https://docs.python.org/3/library/stdtypes.html#str "(在 Python v3.12 中)")