# torchtext.vocab

> 原文：[`pytorch.org/text/stable/vocab.html`](https://pytorch.org/text/stable/vocab.html)

## 词汇表

```py
class torchtext.vocab.Vocab(vocab)
```

```py
__contains__(token: str) → bool
```

参数：

**token** - 要检查成员资格的令牌。

返回：

令牌是否为词汇表成员。

```py
__getitem__(token: str) → int
```

参数：

**token** - 用于查找相应索引的令牌。

返回：

与关联令牌对应的索引。

```py
__init__(vocab) → None
```

初始化内部模块状态，由 nn.Module 和 ScriptModule 共享。

```py
__jit_unused_properties__ = ['is_jitable']
```

创建一个将令牌映射到索引的词汇对象。

参数：

**vocab** (*torch.classes.torchtext.Vocab* *或* *torchtext._torchtext.Vocab*) - 一个 cpp 词汇对象。

```py
__len__() → int
```

返回：

词汇表的长度。

```py
__prepare_scriptable__()
```

返回一个可 JIT 的词汇表。

```py
append_token(token: str) → None
```

参数：

**token** - 用于查找相应索引的令牌。

提高：

[**RuntimeError**](https://docs.python.org/3/library/exceptions.html#RuntimeError "(在 Python v3.12 中)") - 如果令牌已经存在于词汇表中

```py
forward(tokens: List[str]) → List[int]
```

调用 lookup_indices 方法

参数：

**tokens** - 用于查找其相应索引的令牌列表。

返回：

与一组令牌相关联的索引。

```py
get_default_index() → Optional[int]
```

返回：

如果设置了默认索引值，则返回默认索引值。

```py
get_itos() → List[str]
```

返回：

将索引映射到令牌的列表。

```py
get_stoi() → Dict[str, int]
```

返回：

将令牌映射到索引的字典。

```py
insert_token(token: str, index: int) → None
```

参数：

+   **token** - 用于查找相应索引的令牌。

+   **index** - 与关联令牌对应的索引。

提高：

[**RuntimeError**](https://docs.python.org/3/library/exceptions.html#RuntimeError "(在 Python v3.12 中)") - 如果索引不在范围[0, Vocab.size()]内，或者如果令牌已经存在于词汇表中。

```py
lookup_indices(tokens: List[str]) → List[int]
```

参数：

**tokens** - 用于查找其相应索引的令牌。

返回：

与令牌相关联的‘indices`。

```py
lookup_token(index: int) → str
```

参数：

**index** - 与关联令牌对应的索引。

返回：

用于查找相应索引的令牌。

返回类型：

令牌

提高：

[**RuntimeError**](https://docs.python.org/3/library/exceptions.html#RuntimeError "(在 Python v3.12 中)") - 如果索引不在范围[0, itos.size())内。

```py
lookup_tokens(indices: List[int]) → List[str]
```

参数：

**indices** - 用于查找其相应`令牌的索引。

返回：

与索引相关联的令牌。

提高：

[**RuntimeError**](https://docs.python.org/3/library/exceptions.html#RuntimeError "(在 Python v3.12 中)") - 如果索引不在范围[0, itos.size())内。

```py
set_default_index(index: Optional[int]) → None
```

参数：

**index** - 默认索引值。当查询 OOV 令牌时，将返回此索引。

## vocab

```py
torchtext.vocab.vocab(ordered_dict: Dict, min_freq: int = 1, specials: Optional[List[str]] = None, special_first: bool = True) → Vocab
```

用于创建将令牌映射到索引的词汇对象的工厂方法。

请注意，在构建词汇表时，将尊重有序字典中插入键值对的顺序。因此，如果按令牌频率排序对用户很重要，则应以反映这一点的方式创建有序字典。

参数：

+   **ordered_dict** - 有序字典，将令牌映射到其对应的出现频率。

+   **min_freq** - 需要包含令牌在词汇表中的最小频率。

+   **specials** - 要添加的特殊符号。所提供的令牌顺序将被保留。

+   **special_first** - 指示是否在开头或结尾插入符号。

返回：

一个词汇对象

返回类型：

torchtext.vocab.Vocab

示例

```py
>>> from torchtext.vocab import vocab
>>> from collections import Counter, OrderedDict
>>> counter = Counter(["a", "a", "b", "b", "b"])
>>> sorted_by_freq_tuples = sorted(counter.items(), key=lambda x: x[1], reverse=True)
>>> ordered_dict = OrderedDict(sorted_by_freq_tuples)
>>> v1 = vocab(ordered_dict)
>>> print(v1['a']) #prints 1
>>> print(v1['out of vocab']) #raise RuntimeError since default index is not set
>>> tokens = ['e', 'd', 'c', 'b', 'a']
>>> #adding <unk> token and default index
>>> unk_token = '<unk>'
>>> default_index = -1
>>> v2 = vocab(OrderedDict([(token, 1) for token in tokens]), specials=[unk_token])
>>> v2.set_default_index(default_index)
>>> print(v2['<unk>']) #prints 0
>>> print(v2['out of vocab']) #prints -1
>>> #make default index same as index of unk_token
>>> v2.set_default_index(v2[unk_token])
>>> v2['out of vocab'] is v2[unk_token] #prints True 
```

## build_vocab_from_iterator

```py
torchtext.vocab.build_vocab_from_iterator(iterator: Iterable, min_freq: int = 1, specials: Optional[List[str]] = None, special_first: bool = True, max_tokens: Optional[int] = None) → Vocab
```

从迭代器构建词汇表。

参数：

+   **iterator** - 用于构建词汇表的迭代器。必须产生令牌的列表或迭代器。

+   **min_freq** - 需要包含令牌在词汇表中的最小频率。

+   **specials** - 要添加的特殊符号。所提供的令牌顺序将被保留。

+   **special_first** - 指示是否在开头或结尾插入符号。

+   **max_tokens** - 如果提供，从最常见的令牌中创建词汇表，数量为 max_tokens - len(specials)。

返回：

一个词汇对象

返回类型：

torchtext.vocab.Vocab

示例

```py
>>> #generating vocab from text file
>>> import io
>>> from torchtext.vocab import build_vocab_from_iterator
>>> def yield_tokens(file_path):
>>>     with io.open(file_path, encoding = 'utf-8') as f:
>>>         for line in f:
>>>             yield line.strip().split()
>>> vocab = build_vocab_from_iterator(yield_tokens(file_path), specials=["<unk>"]) 
```

## Vectors

```py
class torchtext.vocab.Vectors(name, cache=None, url=None, unk_init=None, max_vectors=None)
```

```py
__init__(name, cache=None, url=None, unk_init=None, max_vectors=None) → None
```

参数：

+   **name** - 包含向量的文件的名称

+   **cache** - 用于缓存向量的目录

+   **url** - 如果在缓存中找不到向量，则用于下载的 url

+   **unk_init**（*回调*）- 默认情况下，将词汇表外的词向量初始化为零向量；可以是任何接受张量并返回相同大小的张量的函数

+   **max_vectors**（[*int*](https://docs.python.org/3/library/functions.html#int "(在 Python v3.12 中)")）- 这可以用来限制加载的预训练向量的数量。大多数预训练向量集按照单词频率降序排序。因此，在整个集合无法放入内存或出于其他原因不需要整个集合的情况下，通过传递 max_vectors 可以限制加载集合的大小。

```py
get_vecs_by_tokens(tokens, lower_case_backup=False)
```

查找标记的嵌入向量。

参数：

+   **tokens** - 一个标记或标记列表。如果 tokens 是一个字符串，则返回形状为 self.dim 的 1-D 张量；如果 tokens 是一个字符串列表，则返回形状为(len(tokens), self.dim)的 2-D 张量。

+   **lower_case_backup** - 是否在小写中查找标记。如果为 False，则将查找原始大小写中的每个标记；如果为 True，则首先查找原始大小写中的每个标记，如果在属性 stoi 的键中找不到，则将查找小写中的标记。默认值：False。

示例

```py
>>> examples = ['chip', 'baby', 'Beautiful']
>>> vec = text.vocab.GloVe(name='6B', dim=50)
>>> ret = vec.get_vecs_by_tokens(examples, lower_case_backup=True) 
```

### 预训练词嵌入

## GloVe

```py
class torchtext.vocab.GloVe(name='840B', dim=300, **kwargs)
```

## FastText

```py
class torchtext.vocab.FastText(language='en', **kwargs)
```

## CharNGram

```py
class torchtext.vocab.CharNGram(**kwargs)
```