diff --git a/chapter_attention-mechanisms/attention-cues.md b/chapter_attention-mechanisms/attention-cues.md
new file mode 100644
index 0000000000000000000000000000000000000000..0f79292dcb07f5073bd144b2811367cb30bbcc0b
--- /dev/null
+++ b/chapter_attention-mechanisms/attention-cues.md
@@ -0,0 +1,117 @@
+# 注意线索
+:label:`sec_attention-cues`
+
+谢谢你关注这本书。注意力是一种稀缺的资源：目前你正在阅读这本书而忽略了其余的书。因此，与金钱类似，你的注意力是用机会成本来支付的。为了确保您现在的注意力值得投入，我们非常积极地谨慎注意制作一本好书。注意力是生命拱门的基石，也是任何作品例外主义的关键。
+
+由于经济学研究稀缺资源的分配，因此我们正处在关注经济时代，人类的注意力被视为可以交换的有限、有价值和稀缺的商品。为了利用它，已经开发了许多商业模式。在音乐或视频流媒体服务上，我们要么关注他们的广告，要么付钱来隐藏它们。为了在线游戏世界的增长，我们要么注意参与战斗，以吸引新玩家，要么付钱立即变得强大。没什么是免费的。
+
+总而言之，关注的是，我们环境中的信息并不稀少。在检查视觉场景时，我们的视神经收到的信息大约为每秒 $10^8$ 位，远远超过了我们的大脑能够完全处理的水平。幸运的是，我们的祖先从经验中学到（也称为数据），* 并非所有的感官输入都是一样的 *。在整个人类历史中，只将注意力引向感兴趣的一小部分信息的能力使我们的大脑能够更明智地分配资源来生存、成长和社交，例如检测掠食者、捕食者和伴侣。
+
+## 生物学中的注意线索
+
+为了解释我们的注意力是如何在视觉世界中部署的，一个双组成部分的框架已经出现并普遍存在。这个想法可以追溯到 19 世纪 90 年代的威廉·詹姆斯，他被认为是 “美国心理学之父” :cite:`James.2007`。在这个框架中，受试者使用 * 非言论提示 * 和 * 声音提示 * 有选择地引导注意力的焦点。
+
+非自豪的提示是基于环境中物体的显著性和显著性。想象一下，你面前有五个物品：一份报纸、一篇研究论文、一杯咖啡、一本笔记本和一本 :numref:`fig_eye-coffee` 中的书。虽然所有纸制品都是黑白印刷的，但咖啡杯是红色的。换句话说，这种咖啡在这种视觉环境中本质上是突出和显眼的，自动而且非自愿地引起人们的注意。所以你把 fovea（视力最高的黄斑中心）带到咖啡上，如 :numref:`fig_eye-coffee` 所示。
+
+![Using the nonvolitional cue based on saliency (red cup, non-paper), attention is involuntarily directed to the coffee.](../img/eye-coffee.svg)
+:width:`400px`
+:label:`fig_eye-coffee`
+
+喝咖啡后，你会变成含咖啡因并想读书。所以你转过头，重新聚焦你的眼睛，然后看看 :numref:`fig_eye-book` 中描述的书。与 :numref:`fig_eye-coffee` 中的咖啡偏向于根据显著程度进行选择的情况不同，在这种依赖任务的情况下，您可以选择受认知和言语控制的书。使用基于变量选择标准的 volitional 提示，这种形式的注意力更为谨慎。该主题的自愿努力也更加强大。
+
+![Using the volitional cue (want to read a book) that is task-dependent, attention is directed to the book under volitional control.](../img/eye-book.svg)
+:width:`400px`
+:label:`fig_eye-book`
+
+## 查询、键和值
+
+受到解释注意力部署的非自豪和言语的注意线索的启发，我们将在下文中描述通过纳入这两个注意线索来设计注意机制的框架。
+
+首先，考虑只有非自豪提示可用的更简单的情况。要将选择偏向于感官输入，我们可以简单地使用参数化的完全连接层，甚至是非参数化的最大值或平均池。
+
+因此，将注意力机制与那些完全连接的层或池层区别开来的是包含任意提示。在注意机制的背景下，我们将自由线索称为 * Queries*。考虑到任何查询，注意机制通过 * 注意力池 * 偏向于感官输入（例如中间特征表示）的选择。在注意机制的背景下，这些感官输入被称为 * 值 *。更一般地说，每个值都与一个 * 键 * 配对，这可以想象为该感官输入的非自旋提示。如 :numref:`fig_qkv` 所示，我们可以设计注意力池，以便给定的查询（volitional Cue）可以与键（非自豪提示）进行交互，这将指导偏差选择对值（感官输入）的偏差选择。
+
+![Attention mechanisms bias selection over values (sensory inputs) via attention pooling, which incorporates queries (volitional cues) and keys (nonvolitional cues).](../img/qkv.svg)
+:label:`fig_qkv`
+
+请注意，关注机制的设计有许多替代方案。例如，我们可以设计一个不可区分的注意力模型，该模型可以使用强化学习方法 :cite:`Mnih.Heess.Graves.ea.2014` 进行训练。鉴于该框架在 :numref:`fig_qkv` 中占据主导地位，该框架下的模型将成为本章我们关注的中心。
+
+## 注意力的可视化
+
+平均汇集可以被视为投入的加权平均值，其中权重是一致的。实际上，注意力池使用加权平均值聚合值，其中权重是在给定查询和不同键之间计算的。
+
+```{.python .input}
+from d2l import mxnet as d2l
+from mxnet import np, npx
+npx.set_np()
+```
+
+```{.python .input}
+#@tab pytorch
+from d2l import torch as d2l
+import torch
+```
+
+```{.python .input}
+#@tab tensorflow
+from d2l import tensorflow as d2l
+import tensorflow as tf
+```
+
+为了可视化注意力权重，我们定义了 `show_heatmaps` 函数。它的输入 `matrices` 具有形状（要显示的行数，要显示的列数，查询数，键数）。
+
+```{.python .input}
+#@tab all
+#@save
+def show_heatmaps(matrices, xlabel, ylabel, titles=None, figsize=(2.5, 2.5),
+                  cmap='Reds'):
+    d2l.use_svg_display()
+    num_rows, num_cols = matrices.shape[0], matrices.shape[1]
+    fig, axes = d2l.plt.subplots(num_rows, num_cols, figsize=figsize,
+                                 sharex=True, sharey=True, squeeze=False)
+    for i, (row_axes, row_matrices) in enumerate(zip(axes, matrices)):
+        for j, (ax, matrix) in enumerate(zip(row_axes, row_matrices)):
+            pcm = ax.imshow(d2l.numpy(matrix), cmap=cmap)
+            if i == num_rows - 1:
+                ax.set_xlabel(xlabel)
+            if j == 0:
+                ax.set_ylabel(ylabel)
+            if titles:
+                ax.set_title(titles[j])
+    fig.colorbar(pcm, ax=axes, shrink=0.6);
+```
+
+为了进行演示，我们考虑一个简单的情况，即仅当查询和键相同时，注意力权重为 1；否则为零。
+
+```{.python .input}
+#@tab all
+attention_weights = d2l.reshape(d2l.eye(10), (1, 1, 10, 10))
+show_heatmaps(attention_weights, xlabel='Keys', ylabel='Queries')
+```
+
+在接下来的章节中，我们经常调用此函数来显示注意力权重。
+
+## 摘要
+
+* 人类的注意力是有限、宝贵和稀缺的资源。
+* 受试者使用非自豪和言语提示有选择地引导注意力。前者基于显著程度，后者取决于任务。
+* 由于包含了空白提示，注意机制与完全连接的层或池层不同。
+* 注意机制通过注意力池使选择偏向于值（感官输入），其中包含查询（言论提示）和键（非自豪提示）。键和值是配对的。
+* 我们可以直观地显示查询和键之间的注意力权重。
+
+## 练习
+
+1. 在机器翻译中通过令牌解码序列令牌时，空白提示可能是什么？什么是非自豪的线索和感官输入？
+1. 随机生成 $10 \times 10$ 矩阵并使用 softmax 运算来确保每行都是有效的概率分布。可视化输出注意力权重。
+
+:begin_tab:`mxnet`
+[Discussions](https://discuss.d2l.ai/t/1596)
+:end_tab:
+
+:begin_tab:`pytorch`
+[Discussions](https://discuss.d2l.ai/t/1592)
+:end_tab:
+
+:begin_tab:`tensorflow`
+[Discussions](https://discuss.d2l.ai/t/1710)
+:end_tab:
diff --git a/chapter_attention-mechanisms/attention-cues_origin.md b/chapter_attention-mechanisms/attention-cues_origin.md
new file mode 100644
index 0000000000000000000000000000000000000000..09585cdefe4847a565633f8ed49504c5d15409d5
--- /dev/null
+++ b/chapter_attention-mechanisms/attention-cues_origin.md
@@ -0,0 +1,238 @@
+# Attention Cues
+:label:`sec_attention-cues`
+
+Thank you for your attention
+to this book.
+Attention is a scarce resource:
+at the moment
+you are reading this book
+and ignoring the rest.
+Thus, similar to money,
+your attention is being paid with an opportunity cost.
+To ensure that your investment of attention
+right now is worthwhile,
+we have been highly motivated to pay our attention carefully
+to produce a nice book.
+Attention
+is the keystone in the arch of life and
+holds the key to any work's exceptionalism.
+
+
+Since economics studies the allocation of scarce resources,
+we are
+in the era of the attention economy,
+where human attention is treated as a limited, valuable, and scarce commodity
+that can be exchanged.
+Numerous business models have been
+developed to capitalize on it.
+On music or video streaming services,
+we either pay attention to their ads
+or pay money to hide them.
+For growth in the world of online games,
+we either pay attention to
+participate in battles, which attract new gamers,
+or pay money to instantly become powerful.
+Nothing comes for free.
+
+All in all,
+information in our environment is not scarce,
+attention is.
+When inspecting a visual scene,
+our optic nerve receives information
+at the order of $10^8$ bits per second,
+far exceeding what our brain can fully process.
+Fortunately,
+our ancestors had learned from experience (also known as data)
+that *not all sensory inputs are created equal*.
+Throughout human history,
+the capability of directing attention
+to only a fraction of information of interest
+has enabled our brain
+to allocate resources more smartly
+to survive, to grow, and to socialize,
+such as detecting predators, preys, and mates.
+
+
+
+## Attention Cues in Biology
+
+To explain how our attention is deployed in the visual world,
+a two-component framework has emerged
+and been pervasive.
+This idea dates back to William James in the 1890s,
+who is considered the "father of American psychology" :cite:`James.2007`.
+In this framework,
+subjects selectively direct the spotlight of attention
+using both the *nonvolitional cue* and *volitional cue*.
+
+The nonvolitional cue is based on
+the saliency and conspicuity of objects in the environment.
+Imagine there are five objects in front of you:
+a newspaper, a research paper, a cup of coffee, a notebook, and a book such as in :numref:`fig_eye-coffee`.
+While all the paper products are printed in black and white,
+the coffee cup is red.
+In other words,
+this coffee is intrinsically salient and conspicuous in
+this visual environment,
+automatically and involuntarily drawing attention.
+So you bring the fovea (the center of the macula where visual acuity is highest) onto the coffee as shown in :numref:`fig_eye-coffee`.
+
+![Using the nonvolitional cue based on saliency (red cup, non-paper), attention is involuntarily directed to the coffee.](../img/eye-coffee.svg)
+:width:`400px`
+:label:`fig_eye-coffee`
+
+After drinking coffee,
+you become caffeinated and
+want to read a book.
+So you turn your head, refocus your eyes,
+and look at the book as depicted in :numref:`fig_eye-book`.
+Different from
+the case in :numref:`fig_eye-coffee`
+where the coffee biases you towards
+selecting based on saliency,
+in this task-dependent case you select the book under
+cognitive and volitional control.
+Using the volitional cue based on variable selection criteria,
+this form of attention is more deliberate.
+It is also more powerful with the subject's voluntary effort.
+
+![Using the volitional cue (want to read a book) that is task-dependent, attention is directed to the book under volitional control.](../img/eye-book.svg)
+:width:`400px`
+:label:`fig_eye-book`
+
+
+## Queries, Keys, and Values
+
+Inspired by the nonvolitional and volitional attention cues that explain the attentional deployment,
+in the following we will
+describe a framework for
+designing attention mechanisms
+by incorporating these two attention cues.
+
+To begin with, consider the simpler case where only
+nonvolitional cues are available.
+To bias selection over sensory inputs,
+we can simply use
+a parameterized fully-connected layer
+or even non-parameterized
+max or average pooling.
+
+Therefore,
+what sets attention mechanisms
+apart from those fully-connected layers
+or pooling layers
+is the inclusion of the volitional cues.
+In the context of attention mechanisms,
+we refer to volitional cues as *queries*.
+Given any query,
+attention mechanisms
+bias selection over sensory inputs (e.g., intermediate feature representations)
+via *attention pooling*.
+These sensory inputs are called *values* in the context of attention mechanisms.
+More generally,
+every value is paired with a *key*,
+which can be thought of the nonvolitional cue of that sensory input.
+As shown in :numref:`fig_qkv`,
+we can design attention pooling
+so that the given query (volitional cue) can interact with keys (nonvolitional cues),
+which guides bias selection over values (sensory inputs).
+
+![Attention mechanisms bias selection over values (sensory inputs) via attention pooling, which incorporates queries (volitional cues) and keys (nonvolitional cues).](../img/qkv.svg)
+:label:`fig_qkv`
+
+Note that there are many alternatives for the design of attention mechanisms.
+For instance,
+we can design a non-differentiable attention model
+that can be trained using reinforcement learning methods :cite:`Mnih.Heess.Graves.ea.2014`.
+Given the dominance of the framework in :numref:`fig_qkv`,
+models under this framework
+will be the center of our attention in this chapter.
+
+
+## Visualization of Attention
+
+Average pooling
+can be treated as a weighted average of inputs,
+where weights are uniform.
+In practice,
+attention pooling aggregates values using weighted average, where weights are computed between the given query and different keys.
+
+```{.python .input}
+from d2l import mxnet as d2l
+from mxnet import np, npx
+npx.set_np()
+```
+
+```{.python .input}
+#@tab pytorch
+from d2l import torch as d2l
+import torch
+```
+
+```{.python .input}
+#@tab tensorflow
+from d2l import tensorflow as d2l
+import tensorflow as tf
+```
+To visualize attention weights,
+we define the `show_heatmaps` function.
+Its input `matrices` has the shape (number of rows for display, number of columns for display, number of queries, number of keys).
+
+```{.python .input}
+#@tab all
+#@save
+def show_heatmaps(matrices, xlabel, ylabel, titles=None, figsize=(2.5, 2.5),
+                  cmap='Reds'):
+    d2l.use_svg_display()
+    num_rows, num_cols = matrices.shape[0], matrices.shape[1]
+    fig, axes = d2l.plt.subplots(num_rows, num_cols, figsize=figsize,
+                                 sharex=True, sharey=True, squeeze=False)
+    for i, (row_axes, row_matrices) in enumerate(zip(axes, matrices)):
+        for j, (ax, matrix) in enumerate(zip(row_axes, row_matrices)):
+            pcm = ax.imshow(d2l.numpy(matrix), cmap=cmap)
+            if i == num_rows - 1:
+                ax.set_xlabel(xlabel)
+            if j == 0:
+                ax.set_ylabel(ylabel)
+            if titles:
+                ax.set_title(titles[j])
+    fig.colorbar(pcm, ax=axes, shrink=0.6);
+```
+
+For demonstration,
+we consider a simple case where
+the attention weight is one only when the query and the key are the same; otherwise it is zero.
+
+```{.python .input}
+#@tab all
+attention_weights = d2l.reshape(d2l.eye(10), (1, 1, 10, 10))
+show_heatmaps(attention_weights, xlabel='Keys', ylabel='Queries')
+```
+
+In the subsequent sections,
+we will often invoke this function to visualize attention weights.
+
+## Summary
+
+* Human attention is a limited, valuable, and scarce resource.
+* Subjects selectively direct attention using both the nonvolitional and volitional cues. The former is based on saliency and the latter is task-dependent.
+* Attention mechanisms are different from fully-connected layers or pooling layers due to inclusion of the volitional cues.
+* Attention mechanisms bias selection over values (sensory inputs) via attention pooling, which incorporates queries (volitional cues) and keys (nonvolitional cues). Keys and values are paired.
+* We can visualize attention weights between queries and keys.
+
+## Exercises
+
+1. What can be the volitional cue when decoding a sequence token by token in machine translation? What are the nonvolitional cues and the sensory inputs?
+1. Randomly generate a $10 \times 10$ matrix and use the softmax operation to ensure each row is a valid probability distribution. Visualize the output attention weights.
+
+:begin_tab:`mxnet`
+[Discussions](https://discuss.d2l.ai/t/1596)
+:end_tab:
+
+:begin_tab:`pytorch`
+[Discussions](https://discuss.d2l.ai/t/1592)
+:end_tab:
+
+:begin_tab:`tensorflow`
+[Discussions](https://discuss.d2l.ai/t/1710)
+:end_tab:
diff --git a/chapter_attention-mechanisms/attention-scoring-functions.md b/chapter_attention-mechanisms/attention-scoring-functions.md
new file mode 100644
index 0000000000000000000000000000000000000000..994e2c36086dfd9444da598f6dd6e0bd463f1ec4
--- /dev/null
+++ b/chapter_attention-mechanisms/attention-scoring-functions.md
@@ -0,0 +1,311 @@
+# 注意评分功能
+:label:`sec_attention-scoring-functions`
+
+在 :numref:`sec_nadaraya-waston` 中，我们使用高斯内核来模拟查询和键之间的交互。将 :eqref:`eq_nadaraya-waston-gaussian` 中的高斯内核的指数视为 * 注意力评分函数 *（简称 * 评分函数 *），这个函数的结果基本上被输入了 softmax 操作。因此，我们获得了与键配对的值的概率分布（注意力权重）。最后，注意力集中的输出只是基于这些注意力权重的值的加权总和。
+
+从较高层面来说，我们可以使用上述算法实例化 :numref:`fig_qkv` 中的注意机制框架。:numref:`fig_attention_output` 表示 $a$ 的注意力评分函数，说明了如何将注意力集中的输出计算为加权值和。由于注意力权重是概率分布，因此加权总和基本上是加权平均值。
+
+![Computing the output of attention pooling as a weighted average of values.](../img/attention-output.svg)
+:label:`fig_attention_output`
+
+从数学上讲，假设我们有一个查询 $\mathbf{q} \in \mathbb{R}^q$ 和 $m$ 键值对 $(\mathbf{k}_1, \mathbf{v}_1), \ldots, (\mathbf{k}_m, \mathbf{v}_m)$，其中任何 $\mathbf{k}_i \in \mathbb{R}^k$ 和任何 $\mathbf{v}_i \in \mathbb{R}^v$。注意力池 $f$ 被实例化为值的加权总和：
+
+$$f(\mathbf{q}, (\mathbf{k}_1, \mathbf{v}_1), \ldots, (\mathbf{k}_m, \mathbf{v}_m)) = \sum_{i=1}^m \alpha(\mathbf{q}, \mathbf{k}_i) \mathbf{v}_i \in \mathbb{R}^v,$$
+:eqlabel:`eq_attn-pooling`
+
+其中查询 $\mathbf{q}$ 和键 $\mathbf{k}_i$ 的注意力权重（标量）是通过注意力评分函数 $a$ 的 softmax 运算计算的，该函数将两个向量映射到标量：
+
+$$\alpha(\mathbf{q}, \mathbf{k}_i) = \mathrm{softmax}(a(\mathbf{q}, \mathbf{k}_i)) = \frac{\exp(a(\mathbf{q}, \mathbf{k}_i))}{\sum_{j=1}^m \exp(a(\mathbf{q}, \mathbf{k}_j))} \in \mathbb{R}.$$
+:eqlabel:`eq_attn-scoring-alpha`
+
+正如我们所看到的，注意力评分功能 $a$ 的不同选择导致不同的注意力集中行为。在本节中，我们将介绍两个流行的评分功能，我们稍后将用来开发更复杂的注意力机制。
+
+```{.python .input}
+import math
+from d2l import mxnet as d2l
+from mxnet import np, npx
+from mxnet.gluon import nn
+npx.set_np()
+```
+
+```{.python .input}
+#@tab pytorch
+from d2l import torch as d2l
+import math
+import torch
+from torch import nn
+```
+
+## 蒙面 Softmax 操作
+
+正如我们刚才提到的，softmax 运算用于输出概率分布作为注意力权重。在某些情况下，并非所有的价值都应该被纳入注意力集中。例如，为了在 :numref:`sec_machine_translation` 中高效处理微型批量，某些文本序列填充了没有意义的特殊令牌。为了将注意力集中在仅作为值的有意义的令牌上，我们可以指定一个有效的序列长度（以令牌数表示），以便在计算 softmax 时过滤掉超出此指定范围的那些。通过这种方式，我们可以在下面的 `masked_softmax` 函数中实现这样的 * 掩码 softmax 操作 *，其中任何超出有效长度的值都被掩盖为零。
+
+```{.python .input}
+#@save
+def masked_softmax(X, valid_lens):
+    """Perform softmax operation by masking elements on the last axis."""
+    # `X`: 3D tensor, `valid_lens`: 1D or 2D tensor
+    if valid_lens is None:
+        return npx.softmax(X)
+    else:
+        shape = X.shape
+        if valid_lens.ndim == 1:
+            valid_lens = valid_lens.repeat(shape[1])
+        else:
+            valid_lens = valid_lens.reshape(-1)
+        # On the last axis, replace masked elements with a very large negative
+        # value, whose exponentiation outputs 0
+        X = npx.sequence_mask(X.reshape(-1, shape[-1]), valid_lens, True,
+                              value=-1e6, axis=1)
+        return npx.softmax(X).reshape(shape)
+```
+
+```{.python .input}
+#@tab pytorch
+#@save
+def masked_softmax(X, valid_lens):
+    """Perform softmax operation by masking elements on the last axis."""
+    # `X`: 3D tensor, `valid_lens`: 1D or 2D tensor
+    if valid_lens is None:
+        return nn.functional.softmax(X, dim=-1)
+    else:
+        shape = X.shape
+        if valid_lens.dim() == 1:
+            valid_lens = torch.repeat_interleave(valid_lens, shape[1])
+        else:
+            valid_lens = valid_lens.reshape(-1)
+        # On the last axis, replace masked elements with a very large negative
+        # value, whose exponentiation outputs 0
+        X = d2l.sequence_mask(X.reshape(-1, shape[-1]), valid_lens,
+                              value=-1e6)
+        return nn.functional.softmax(X.reshape(shape), dim=-1)
+```
+
+为了演示此函数的工作原理，请考虑由两个 $2 \times 4$ 矩阵示例组成的小批量，其中这两个示例的有效长度分别为 2 和 3 个。由于蒙面 softmax 操作，超出有效长度的值都被掩盖为零。
+
+```{.python .input}
+masked_softmax(np.random.uniform(size=(2, 2, 4)), d2l.tensor([2, 3]))
+```
+
+```{.python .input}
+#@tab pytorch
+masked_softmax(torch.rand(2, 2, 4), torch.tensor([2, 3]))
+```
+
+同样，我们也可以使用二维张量为每个矩阵示例中的每一行指定有效长度。
+
+```{.python .input}
+masked_softmax(np.random.uniform(size=(2, 2, 4)),
+               d2l.tensor([[1, 3], [2, 4]]))
+```
+
+```{.python .input}
+#@tab pytorch
+masked_softmax(torch.rand(2, 2, 4), d2l.tensor([[1, 3], [2, 4]]))
+```
+
+## 添加剂注意
+:label:`subsec_additive-attention`
+
+一般来说，当查询和键是不同长度的矢量时，我们可以使用附加注意力作为评分功能。给定查询 $\mathbf{q} \in \mathbb{R}^q$ 和关键 $\mathbf{k} \in \mathbb{R}^k$，* 加法注意 * 评分功能
+
+$$a(\mathbf q, \mathbf k) = \mathbf w_v^\top \text{tanh}(\mathbf W_q\mathbf q + \mathbf W_k \mathbf k) \in \mathbb{R},$$
+:eqlabel:`eq_additive-attn`
+
+其中可学习的参数 $\mathbf W_q\in\mathbb R^{h\times q}$、$\mathbf W_k\in\mathbb R^{h\times k}$ 和 $\mathbf w_v\in\mathbb R^{h}$。相当于 :eqref:`eq_additive-attn`，查询和密钥被连接在一个 MLP 中，其中包含一个隐藏层，其隐藏单位的数量为 $h$，这是一个超参数。通过使用 $\tanh$ 作为激活函数和禁用偏见术语，我们将在以下内容中实现附加注意。
+
+```{.python .input}
+#@save
+class AdditiveAttention(nn.Block):
+    """Additive attention."""
+    def __init__(self, num_hiddens, dropout, **kwargs):
+        super(AdditiveAttention, self).__init__(**kwargs)
+        # Use `flatten=False` to only transform the last axis so that the
+        # shapes for the other axes are kept the same
+        self.W_k = nn.Dense(num_hiddens, use_bias=False, flatten=False)
+        self.W_q = nn.Dense(num_hiddens, use_bias=False, flatten=False)
+        self.w_v = nn.Dense(1, use_bias=False, flatten=False)
+        self.dropout = nn.Dropout(dropout)
+
+    def forward(self, queries, keys, values, valid_lens):
+        queries, keys = self.W_q(queries), self.W_k(keys)
+        # After dimension expansion, shape of `queries`: (`batch_size`, no. of
+        # queries, 1, `num_hiddens`) and shape of `keys`: (`batch_size`, 1,
+        # no. of key-value pairs, `num_hiddens`). Sum them up with
+        # broadcasting
+        features = np.expand_dims(queries, axis=2) + np.expand_dims(
+            keys, axis=1)
+        features = np.tanh(features)
+        # There is only one output of `self.w_v`, so we remove the last
+        # one-dimensional entry from the shape. Shape of `scores`:
+        # (`batch_size`, no. of queries, no. of key-value pairs)
+        scores = np.squeeze(self.w_v(features), axis=-1)
+        self.attention_weights = masked_softmax(scores, valid_lens)
+        # Shape of `values`: (`batch_size`, no. of key-value pairs, value
+        # dimension)
+        return npx.batch_dot(self.dropout(self.attention_weights), values)
+```
+
+```{.python .input}
+#@tab pytorch
+#@save
+class AdditiveAttention(nn.Module):
+    def __init__(self, key_size, query_size, num_hiddens, dropout, **kwargs):
+        super(AdditiveAttention, self).__init__(**kwargs)
+        self.W_k = nn.Linear(key_size, num_hiddens, bias=False)
+        self.W_q = nn.Linear(query_size, num_hiddens, bias=False)
+        self.w_v = nn.Linear(num_hiddens, 1, bias=False)
+        self.dropout = nn.Dropout(dropout)
+
+    def forward(self, queries, keys, values, valid_lens):
+        queries, keys = self.W_q(queries), self.W_k(keys)
+        # After dimension expansion, shape of `queries`: (`batch_size`, no. of
+        # queries, 1, `num_hiddens`) and shape of `keys`: (`batch_size`, 1,
+        # no. of key-value pairs, `num_hiddens`). Sum them up with
+        # broadcasting
+        features = queries.unsqueeze(2) + keys.unsqueeze(1)
+        features = torch.tanh(features)
+        # There is only one output of `self.w_v`, so we remove the last
+        # one-dimensional entry from the shape. Shape of `scores`:
+        # (`batch_size`, no. of queries, no. of key-value pairs)
+        scores = self.w_v(features).squeeze(-1)
+        self.attention_weights = masked_softmax(scores, valid_lens)
+        # Shape of `values`: (`batch_size`, no. of key-value pairs, value
+        # dimension)
+        return torch.bmm(self.dropout(self.attention_weights), values)
+```
+
+让我们用一个玩具示例来演示上面的 `AdditiveAttention` 类，其中查询、键和值的形状（批量大小、步数或令牌序列长度、特征大小）分别为（73229293618、$1$、$20$）、（$10$、$2$、$2$）和（73229293615、$2$、$10$）和（73229293615、$10$）和（73229293615、$10$、$10$）和（73229293615、$10$、$10$），$4$）。注意力池输出的形状为（批量大小、查询的步骤数、值的要素大小）。
+
+```{.python .input}
+queries, keys = d2l.normal(0, 1, (2, 1, 20)), d2l.ones((2, 10, 2))
+# The two value matrices in the `values` minibatch are identical
+values = np.arange(40).reshape(1, 10, 4).repeat(2, axis=0)
+valid_lens = d2l.tensor([2, 6])
+
+attention = AdditiveAttention(num_hiddens=8, dropout=0.1)
+attention.initialize()
+attention(queries, keys, values, valid_lens)
+```
+
+```{.python .input}
+#@tab pytorch
+queries, keys = d2l.normal(0, 1, (2, 1, 20)), d2l.ones((2, 10, 2))
+# The two value matrices in the `values` minibatch are identical
+values = torch.arange(40, dtype=torch.float32).reshape(1, 10, 4).repeat(
+    2, 1, 1)
+valid_lens = d2l.tensor([2, 6])
+
+attention = AdditiveAttention(key_size=2, query_size=20, num_hiddens=8,
+                              dropout=0.1)
+attention.eval()
+attention(queries, keys, values, valid_lens)
+```
+
+尽管加法注意力包含可学习的参数，但由于本例中每个键都是相同的，所以注意力权重是一致的，由指定的有效长度决定。
+
+```{.python .input}
+#@tab all
+d2l.show_heatmaps(d2l.reshape(attention.attention_weights, (1, 1, 2, 10)),
+                  xlabel='Keys', ylabel='Queries')
+```
+
+## 缩放点-产品关注
+
+计分功能的计算效率更高的设计可以简单地是点积。但是，点积操作要求查询和键具有相同的矢量长度，比如 $d$。假设查询的所有元素和关键字都是独立的随机变量，均值和单位方差零。两个向量的点积均值为零，方差为 $d$。为确保无论矢量长度如何，点积的方差仍然是一个，* 缩放的点积注意 * 评分功能
+
+$$a(\mathbf q, \mathbf k) = \mathbf{q}^\top \mathbf{k}  /\sqrt{d}$$
+
+将点积除以 $\sqrt{d}$。在实践中，我们通常以微型批量来考虑提高效率，例如 $n$ 查询和 $m$ 键值对的计算注意力，其中查询和键的长度为 $d$，值的长度为 $v$。查询 $\mathbf Q\in\mathbb R^{n\times d}$、键 $\mathbf K\in\mathbb R^{m\times d}$ 和值 $\mathbf V\in\mathbb R^{m\times v}$ 的扩展点-产品关注度是
+
+$$ \mathrm{softmax}\left(\frac{\mathbf Q \mathbf K^\top }{\sqrt{d}}\right) \mathbf V \in \mathbb{R}^{n\times v}.$$
+:eqlabel:`eq_softmax_QK_V`
+
+在以下缩放点产品注意事项的实施中，我们使用了 dropout 进行模型正则化。
+
+```{.python .input}
+#@save
+class DotProductAttention(nn.Block):
+    """Scaled dot product attention."""
+    def __init__(self, dropout, **kwargs):
+        super(DotProductAttention, self).__init__(**kwargs)
+        self.dropout = nn.Dropout(dropout)
+
+    # Shape of `queries`: (`batch_size`, no. of queries, `d`)
+    # Shape of `keys`: (`batch_size`, no. of key-value pairs, `d`)
+    # Shape of `values`: (`batch_size`, no. of key-value pairs, value
+    # dimension)
+    # Shape of `valid_lens`: (`batch_size`,) or (`batch_size`, no. of queries)
+    def forward(self, queries, keys, values, valid_lens=None):
+        d = queries.shape[-1]
+        # Set `transpose_b=True` to swap the last two dimensions of `keys`
+        scores = npx.batch_dot(queries, keys, transpose_b=True) / math.sqrt(d)
+        self.attention_weights = masked_softmax(scores, valid_lens)
+        return npx.batch_dot(self.dropout(self.attention_weights), values)
+```
+
+```{.python .input}
+#@tab pytorch
+#@save
+class DotProductAttention(nn.Module):
+    """Scaled dot product attention."""
+    def __init__(self, dropout, **kwargs):
+        super(DotProductAttention, self).__init__(**kwargs)
+        self.dropout = nn.Dropout(dropout)
+
+    # Shape of `queries`: (`batch_size`, no. of queries, `d`)
+    # Shape of `keys`: (`batch_size`, no. of key-value pairs, `d`)
+    # Shape of `values`: (`batch_size`, no. of key-value pairs, value
+    # dimension)
+    # Shape of `valid_lens`: (`batch_size`,) or (`batch_size`, no. of queries)
+    def forward(self, queries, keys, values, valid_lens=None):
+        d = queries.shape[-1]
+        # Set `transpose_b=True` to swap the last two dimensions of `keys`
+        scores = torch.bmm(queries, keys.transpose(1,2)) / math.sqrt(d)
+        self.attention_weights = masked_softmax(scores, valid_lens)
+        return torch.bmm(self.dropout(self.attention_weights), values)
+```
+
+为了演示上述 `DotProductAttention` 类别，我们使用与先前玩具示例相同的键、值和有效长度进行附加注意。对于点积操作，我们将查询的特征大小与键的特征大小相同。
+
+```{.python .input}
+queries = d2l.normal(0, 1, (2, 1, 2))
+attention = DotProductAttention(dropout=0.5)
+attention.initialize()
+attention(queries, keys, values, valid_lens)
+```
+
+```{.python .input}
+#@tab pytorch
+queries = d2l.normal(0, 1, (2, 1, 2))
+attention = DotProductAttention(dropout=0.5)
+attention.eval()
+attention(queries, keys, values, valid_lens)
+```
+
+与加法注意力演示相同，由于 `keys` 包含无法通过任何查询区分的相同元素，因此获得了统一的注意力权重。
+
+```{.python .input}
+#@tab all
+d2l.show_heatmaps(d2l.reshape(attention.attention_weights, (1, 1, 2, 10)),
+                  xlabel='Keys', ylabel='Queries')
+```
+
+## 摘要
+
+* 我们可以将注意力集中的输出计算为值的加权平均值，其中注意力评分功能的不同选择会导致不同的注意力集中行为。
+* 当查询和密钥是不同长度的矢量时，我们可以使用加法注意力评分功能。当它们相同时，缩放的点-产品注意力评分功能在计算上更有效率。
+
+## 练习
+
+1. 修改玩具示例中的按键并可视化注意力重量。添加剂的注意力和缩放的点-产品的注意力是否仍然产生相同的注意力？为什么或为什么不？
+1. 只使用矩阵乘法，您能否为具有不同矢量长度的查询和键设计新的评分函数？
+1. 当查询和键具有相同的矢量长度时，矢量求和是否比计分函数的点积更好？为什么或为什么不？
+
+:begin_tab:`mxnet`
+[Discussions](https://discuss.d2l.ai/t/346)
+:end_tab:
+
+:begin_tab:`pytorch`
+[Discussions](https://discuss.d2l.ai/t/1064)
+:end_tab:
diff --git a/chapter_attention-mechanisms/attention-scoring-functions_origin.md b/chapter_attention-mechanisms/attention-scoring-functions_origin.md
new file mode 100644
index 0000000000000000000000000000000000000000..71b77e577442e036bc23327f862a1825401f5c3c
--- /dev/null
+++ b/chapter_attention-mechanisms/attention-scoring-functions_origin.md
@@ -0,0 +1,449 @@
+# Attention Scoring Functions
+:label:`sec_attention-scoring-functions`
+
+In :numref:`sec_nadaraya-waston`,
+we used a Gaussian kernel to model
+interactions between queries and keys.
+Treating the exponent of the Gaussian kernel
+in :eqref:`eq_nadaraya-waston-gaussian`
+as an *attention scoring function* (or *scoring function* for short),
+the results of this function were
+essentially fed into
+a softmax operation.
+As a result,
+we obtained
+a probability distribution (attention weights)
+over values that are paired with keys.
+In the end,
+the output of the attention pooling
+is simply a weighted sum of the values
+based on these attention weights.
+
+At a high level,
+we can use the above algorithm
+to instantiate the framework of attention mechanisms
+in :numref:`fig_qkv`.
+Denoting an attention scoring function by $a$,
+:numref:`fig_attention_output`
+illustrates how the output of attention pooling
+can be computed as a weighted sum of values.
+Since attention weights are
+a probability distribution,
+the weighted sum is essentially
+a weighted average.
+
+![Computing the output of attention pooling as a weighted average of values.](../img/attention-output.svg)
+:label:`fig_attention_output`
+
+
+
+Mathematically,
+suppose that we have
+a query $\mathbf{q} \in \mathbb{R}^q$
+and $m$ key-value pairs $(\mathbf{k}_1, \mathbf{v}_1), \ldots, (\mathbf{k}_m, \mathbf{v}_m)$, where any $\mathbf{k}_i \in \mathbb{R}^k$ and any $\mathbf{v}_i \in \mathbb{R}^v$.
+The attention pooling $f$
+is instantiated as a weighted sum of the values:
+
+$$f(\mathbf{q}, (\mathbf{k}_1, \mathbf{v}_1), \ldots, (\mathbf{k}_m, \mathbf{v}_m)) = \sum_{i=1}^m \alpha(\mathbf{q}, \mathbf{k}_i) \mathbf{v}_i \in \mathbb{R}^v,$$
+:eqlabel:`eq_attn-pooling`
+
+where
+the attention weight (scalar) for the query $\mathbf{q}$
+and key $\mathbf{k}_i$
+is computed by
+the softmax operation of
+an attention scoring function $a$ that maps two vectors to a scalar:
+
+$$\alpha(\mathbf{q}, \mathbf{k}_i) = \mathrm{softmax}(a(\mathbf{q}, \mathbf{k}_i)) = \frac{\exp(a(\mathbf{q}, \mathbf{k}_i))}{\sum_{j=1}^m \exp(a(\mathbf{q}, \mathbf{k}_j))} \in \mathbb{R}.$$
+:eqlabel:`eq_attn-scoring-alpha`
+
+As we can see,
+different choices of the attention scoring function $a$
+lead to different behaviors of attention pooling.
+In this section,
+we introduce two popular scoring functions
+that we will use to develop more
+sophisticated attention mechanisms later.
+
+```{.python .input}
+import math
+from d2l import mxnet as d2l
+from mxnet import np, npx
+from mxnet.gluon import nn
+npx.set_np()
+```
+
+```{.python .input}
+#@tab pytorch
+from d2l import torch as d2l
+import math
+import torch
+from torch import nn
+```
+
+## Masked Softmax Operation
+
+As we just mentioned,
+a softmax operation is used to
+output a probability distribution as attention weights.
+In some cases,
+not all the values should be fed into attention pooling.
+For instance,
+for efficient minibatch processing in :numref:`sec_machine_translation`,
+some text sequences are padded with
+special tokens that do not carry meaning.
+To get an attention pooling
+over
+only meaningful tokens as values,
+we can specify a valid sequence length (in number of tokens)
+to filter out those beyond this specified range
+when computing softmax.
+In this way,
+we can implement such a *masked softmax operation*
+in the following `masked_softmax` function,
+where any value beyond the valid length
+is masked as zero.
+
+```{.python .input}
+#@save
+def masked_softmax(X, valid_lens):
+    """Perform softmax operation by masking elements on the last axis."""
+    # `X`: 3D tensor, `valid_lens`: 1D or 2D tensor
+    if valid_lens is None:
+        return npx.softmax(X)
+    else:
+        shape = X.shape
+        if valid_lens.ndim == 1:
+            valid_lens = valid_lens.repeat(shape[1])
+        else:
+            valid_lens = valid_lens.reshape(-1)
+        # On the last axis, replace masked elements with a very large negative
+        # value, whose exponentiation outputs 0
+        X = npx.sequence_mask(X.reshape(-1, shape[-1]), valid_lens, True,
+                              value=-1e6, axis=1)
+        return npx.softmax(X).reshape(shape)
+```
+
+```{.python .input}
+#@tab pytorch
+#@save
+def masked_softmax(X, valid_lens):
+    """Perform softmax operation by masking elements on the last axis."""
+    # `X`: 3D tensor, `valid_lens`: 1D or 2D tensor
+    if valid_lens is None:
+        return nn.functional.softmax(X, dim=-1)
+    else:
+        shape = X.shape
+        if valid_lens.dim() == 1:
+            valid_lens = torch.repeat_interleave(valid_lens, shape[1])
+        else:
+            valid_lens = valid_lens.reshape(-1)
+        # On the last axis, replace masked elements with a very large negative
+        # value, whose exponentiation outputs 0
+        X = d2l.sequence_mask(X.reshape(-1, shape[-1]), valid_lens,
+                              value=-1e6)
+        return nn.functional.softmax(X.reshape(shape), dim=-1)
+```
+
+To demonstrate how this function works,
+consider a minibatch of two $2 \times 4$ matrix examples,
+where the valid lengths for these two examples
+are two and three, respectively.
+As a result of the masked softmax operation,
+values beyond the valid lengths
+are all masked as zero.
+
+```{.python .input}
+masked_softmax(np.random.uniform(size=(2, 2, 4)), d2l.tensor([2, 3]))
+```
+
+```{.python .input}
+#@tab pytorch
+masked_softmax(torch.rand(2, 2, 4), torch.tensor([2, 3]))
+```
+
+Similarly, we can also
+use a two-dimensional tensor
+to specify valid lengths
+for every row in each matrix example.
+
+```{.python .input}
+masked_softmax(np.random.uniform(size=(2, 2, 4)),
+               d2l.tensor([[1, 3], [2, 4]]))
+```
+
+```{.python .input}
+#@tab pytorch
+masked_softmax(torch.rand(2, 2, 4), d2l.tensor([[1, 3], [2, 4]]))
+```
+
+## Additive Attention
+:label:`subsec_additive-attention`
+
+In general,
+when queries and keys are vectors of different lengths,
+we can use additive attention
+as the scoring function.
+Given a query $\mathbf{q} \in \mathbb{R}^q$
+and a key $\mathbf{k} \in \mathbb{R}^k$,
+the *additive attention* scoring function
+
+$$a(\mathbf q, \mathbf k) = \mathbf w_v^\top \text{tanh}(\mathbf W_q\mathbf q + \mathbf W_k \mathbf k) \in \mathbb{R},$$
+:eqlabel:`eq_additive-attn`
+
+where
+learnable parameters
+$\mathbf W_q\in\mathbb R^{h\times q}$, $\mathbf W_k\in\mathbb R^{h\times k}$, and $\mathbf w_v\in\mathbb R^{h}$.
+Equivalent to :eqref:`eq_additive-attn`,
+the query and the key are concatenated
+and fed into an MLP with a single hidden layer
+whose number of hidden units is $h$, a hyperparameter.
+By using $\tanh$ as the activation function and disabling
+bias terms,
+we implement additive attention in the following.
+
+```{.python .input}
+#@save
+class AdditiveAttention(nn.Block):
+    """Additive attention."""
+    def __init__(self, num_hiddens, dropout, **kwargs):
+        super(AdditiveAttention, self).__init__(**kwargs)
+        # Use `flatten=False` to only transform the last axis so that the
+        # shapes for the other axes are kept the same
+        self.W_k = nn.Dense(num_hiddens, use_bias=False, flatten=False)
+        self.W_q = nn.Dense(num_hiddens, use_bias=False, flatten=False)
+        self.w_v = nn.Dense(1, use_bias=False, flatten=False)
+        self.dropout = nn.Dropout(dropout)
+
+    def forward(self, queries, keys, values, valid_lens):
+        queries, keys = self.W_q(queries), self.W_k(keys)
+        # After dimension expansion, shape of `queries`: (`batch_size`, no. of
+        # queries, 1, `num_hiddens`) and shape of `keys`: (`batch_size`, 1,
+        # no. of key-value pairs, `num_hiddens`). Sum them up with
+        # broadcasting
+        features = np.expand_dims(queries, axis=2) + np.expand_dims(
+            keys, axis=1)
+        features = np.tanh(features)
+        # There is only one output of `self.w_v`, so we remove the last
+        # one-dimensional entry from the shape. Shape of `scores`:
+        # (`batch_size`, no. of queries, no. of key-value pairs)
+        scores = np.squeeze(self.w_v(features), axis=-1)
+        self.attention_weights = masked_softmax(scores, valid_lens)
+        # Shape of `values`: (`batch_size`, no. of key-value pairs, value
+        # dimension)
+        return npx.batch_dot(self.dropout(self.attention_weights), values)
+```
+
+```{.python .input}
+#@tab pytorch
+#@save
+class AdditiveAttention(nn.Module):
+    def __init__(self, key_size, query_size, num_hiddens, dropout, **kwargs):
+        super(AdditiveAttention, self).__init__(**kwargs)
+        self.W_k = nn.Linear(key_size, num_hiddens, bias=False)
+        self.W_q = nn.Linear(query_size, num_hiddens, bias=False)
+        self.w_v = nn.Linear(num_hiddens, 1, bias=False)
+        self.dropout = nn.Dropout(dropout)
+
+    def forward(self, queries, keys, values, valid_lens):
+        queries, keys = self.W_q(queries), self.W_k(keys)
+        # After dimension expansion, shape of `queries`: (`batch_size`, no. of
+        # queries, 1, `num_hiddens`) and shape of `keys`: (`batch_size`, 1,
+        # no. of key-value pairs, `num_hiddens`). Sum them up with
+        # broadcasting
+        features = queries.unsqueeze(2) + keys.unsqueeze(1)
+        features = torch.tanh(features)
+        # There is only one output of `self.w_v`, so we remove the last
+        # one-dimensional entry from the shape. Shape of `scores`:
+        # (`batch_size`, no. of queries, no. of key-value pairs)
+        scores = self.w_v(features).squeeze(-1)
+        self.attention_weights = masked_softmax(scores, valid_lens)
+        # Shape of `values`: (`batch_size`, no. of key-value pairs, value
+        # dimension)
+        return torch.bmm(self.dropout(self.attention_weights), values)
+```
+
+Let us demonstrate the above `AdditiveAttention` class
+with a toy example,
+where shapes (batch size, number of steps or sequence length in tokens, feature size)
+of queries, keys, and values
+are ($2$, $1$, $20$), ($2$, $10$, $2$),
+and ($2$, $10$, $4$), respectively.
+The attention pooling output
+has a shape of (batch size, number of steps for queries, feature size for values).
+
+```{.python .input}
+queries, keys = d2l.normal(0, 1, (2, 1, 20)), d2l.ones((2, 10, 2))
+# The two value matrices in the `values` minibatch are identical
+values = np.arange(40).reshape(1, 10, 4).repeat(2, axis=0)
+valid_lens = d2l.tensor([2, 6])
+
+attention = AdditiveAttention(num_hiddens=8, dropout=0.1)
+attention.initialize()
+attention(queries, keys, values, valid_lens)
+```
+
+```{.python .input}
+#@tab pytorch
+queries, keys = d2l.normal(0, 1, (2, 1, 20)), d2l.ones((2, 10, 2))
+# The two value matrices in the `values` minibatch are identical
+values = torch.arange(40, dtype=torch.float32).reshape(1, 10, 4).repeat(
+    2, 1, 1)
+valid_lens = d2l.tensor([2, 6])
+
+attention = AdditiveAttention(key_size=2, query_size=20, num_hiddens=8,
+                              dropout=0.1)
+attention.eval()
+attention(queries, keys, values, valid_lens)
+```
+
+Although additive attention contains learnable parameters,
+since every key is the same in this example,
+the attention weights are uniform,
+determined by the specified valid lengths.
+
+```{.python .input}
+#@tab all
+d2l.show_heatmaps(d2l.reshape(attention.attention_weights, (1, 1, 2, 10)),
+                  xlabel='Keys', ylabel='Queries')
+```
+
+## Scaled Dot-Product Attention
+
+A more computationally efficient
+design for the scoring function can be
+simply dot product.
+However,
+the dot product operation
+requires that both the query and the key
+have the same vector length, say $d$.
+Assume that
+all the elements of the query and the key
+are independent random variables
+with zero mean and unit variance.
+The dot product of
+both vectors has zero mean and a variance of $d$.
+To ensure that the variance of the dot product
+still remains one regardless of vector length,
+the *scaled dot-product attention* scoring function
+
+
+$$a(\mathbf q, \mathbf k) = \mathbf{q}^\top \mathbf{k}  /\sqrt{d}$$
+
+divides the dot product by $\sqrt{d}$.
+In practice,
+we often think in minibatches
+for efficiency,
+such as computing attention
+for
+$n$ queries and $m$ key-value pairs,
+where queries and keys are of length $d$
+and values are of length $v$.
+The scaled dot-product attention
+of queries $\mathbf Q\in\mathbb R^{n\times d}$,
+keys $\mathbf K\in\mathbb R^{m\times d}$,
+and values $\mathbf V\in\mathbb R^{m\times v}$
+is
+
+
+$$ \mathrm{softmax}\left(\frac{\mathbf Q \mathbf K^\top }{\sqrt{d}}\right) \mathbf V \in \mathbb{R}^{n\times v}.$$
+:eqlabel:`eq_softmax_QK_V`
+
+In the following implementation of the scaled dot product attention, we use dropout for model regularization.
+
+```{.python .input}
+#@save
+class DotProductAttention(nn.Block):
+    """Scaled dot product attention."""
+    def __init__(self, dropout, **kwargs):
+        super(DotProductAttention, self).__init__(**kwargs)
+        self.dropout = nn.Dropout(dropout)
+
+    # Shape of `queries`: (`batch_size`, no. of queries, `d`)
+    # Shape of `keys`: (`batch_size`, no. of key-value pairs, `d`)
+    # Shape of `values`: (`batch_size`, no. of key-value pairs, value
+    # dimension)
+    # Shape of `valid_lens`: (`batch_size`,) or (`batch_size`, no. of queries)
+    def forward(self, queries, keys, values, valid_lens=None):
+        d = queries.shape[-1]
+        # Set `transpose_b=True` to swap the last two dimensions of `keys`
+        scores = npx.batch_dot(queries, keys, transpose_b=True) / math.sqrt(d)
+        self.attention_weights = masked_softmax(scores, valid_lens)
+        return npx.batch_dot(self.dropout(self.attention_weights), values)
+```
+
+```{.python .input}
+#@tab pytorch
+#@save
+class DotProductAttention(nn.Module):
+    """Scaled dot product attention."""
+    def __init__(self, dropout, **kwargs):
+        super(DotProductAttention, self).__init__(**kwargs)
+        self.dropout = nn.Dropout(dropout)
+
+    # Shape of `queries`: (`batch_size`, no. of queries, `d`)
+    # Shape of `keys`: (`batch_size`, no. of key-value pairs, `d`)
+    # Shape of `values`: (`batch_size`, no. of key-value pairs, value
+    # dimension)
+    # Shape of `valid_lens`: (`batch_size`,) or (`batch_size`, no. of queries)
+    def forward(self, queries, keys, values, valid_lens=None):
+        d = queries.shape[-1]
+        # Set `transpose_b=True` to swap the last two dimensions of `keys`
+        scores = torch.bmm(queries, keys.transpose(1,2)) / math.sqrt(d)
+        self.attention_weights = masked_softmax(scores, valid_lens)
+        return torch.bmm(self.dropout(self.attention_weights), values)
+```
+
+To demonstrate the above `DotProductAttention` class,
+we use the same keys, values, and valid lengths from the earlier toy example
+for additive attention.
+For the dot product operation,
+we make the feature size of queries
+the same as that of keys.
+
+```{.python .input}
+queries = d2l.normal(0, 1, (2, 1, 2))
+attention = DotProductAttention(dropout=0.5)
+attention.initialize()
+attention(queries, keys, values, valid_lens)
+```
+
+```{.python .input}
+#@tab pytorch
+queries = d2l.normal(0, 1, (2, 1, 2))
+attention = DotProductAttention(dropout=0.5)
+attention.eval()
+attention(queries, keys, values, valid_lens)
+```
+
+Same as in the additive attention demonstration,
+since `keys` contains the same element
+that cannot be differentiated by any query,
+uniform attention weights are obtained.
+
+```{.python .input}
+#@tab all
+d2l.show_heatmaps(d2l.reshape(attention.attention_weights, (1, 1, 2, 10)),
+                  xlabel='Keys', ylabel='Queries')
+```
+
+## Summary
+
+* We can compute the output of attention pooling as a weighted average of values, where different choices of the attention scoring function lead to different behaviors of attention pooling.
+* When queries and keys are vectors of different lengths, we can use the additive attention scoring function. When they are the same, the scaled dot-product attention scoring function is more computationally efficient.
+
+
+
+## Exercises
+
+1. Modify keys in the toy example and visualize attention weights. Do additive attention and scaled dot-product attention still output the same attention weights? Why or why not?
+1. Using matrix multiplications only, can you design a new scoring function for queries and keys with different vector lengths?
+1. When queries and keys have the same vector length, is vector summation a better design than dot product for the scoring function? Why or why not?
+
+:begin_tab:`mxnet`
+[Discussions](https://discuss.d2l.ai/t/346)
+:end_tab:
+
+:begin_tab:`pytorch`
+[Discussions](https://discuss.d2l.ai/t/1064)
+:end_tab:
diff --git a/chapter_attention-mechanisms/bahdanau-attention.md b/chapter_attention-mechanisms/bahdanau-attention.md
new file mode 100644
index 0000000000000000000000000000000000000000..50c7d0faf3c2ec760a5aef50b755f58dd94ae30a
--- /dev/null
+++ b/chapter_attention-mechanisms/bahdanau-attention.md
@@ -0,0 +1,255 @@
+# Bahdanau 关注
+:label:`sec_seq2seq_attention`
+
+我们在 :numref:`sec_seq2seq` 中研究了机器翻译问题，在那里我们设计了一个基于两个 RNN 的编码器解码器架构，用于顺序到序列的学习。具体来说，RNN 编码器将可变长度序列转换为固定形状的上下文变量，然后 RNN 解码器根据生成的令牌和上下文变量按令牌生成输出（目标）序列令牌。但是，即使并非所有输入（源）令牌都对解码某个标记都有用，但在每个解码步骤中仍使用编码整个输入序列的 *same* 上下文变量。
+
+在为给定文本序列生成手写的一个单独但相关的挑战中，格雷夫斯设计了一种可区分的注意力模型，将文本字符与更长的笔迹对齐，其中对齐方式仅向一个方向移动 :cite:`Graves.2013`。受学习对齐想法的启发，Bahdanau 等人提出了一个没有严格的单向对齐限制 :cite:`Bahdanau.Cho.Bengio.2014` 的可区分注意力模型。在预测令牌时，如果不是所有输入令牌都相关，模型将仅对齐（或参与）输入序列中与当前预测相关的部分。这是通过将上下文变量视为注意力集中的输出来实现的。
+
+## 模型
+
+在下面描述 Bahdanau 对 RNN 编码器的关注时，我们将遵循 :numref:`sec_seq2seq` 中的相同符号。新的基于注意的模型与 :numref:`sec_seq2seq` 中的模型相同，只不过 :eqref:`eq_seq2seq_s_t` 中的上下文变量 $\mathbf{c}$ 在任何解码时间步骤 $t'$ 都会被 $\mathbf{c}_{t'}$ 替换。假设输入序列中有 $T$ 个令牌，解码时间步长 $t'$ 的上下文变量是注意力集中的输出：
+
+$$\mathbf{c}_{t'} = \sum_{t=1}^T \alpha(\mathbf{s}_{t' - 1}, \mathbf{h}_t) \mathbf{h}_t,$$
+
+其中，时间步骤 $t' - 1$ 时的解码器隐藏状态 $\mathbf{s}_{t' - 1}$ 是查询，编码器隐藏状态 $\mathbf{h}_t$ 既是键，也是值，注意权重 $\alpha$ 是使用 :eqref:`eq_attn-scoring-alpha` 所定义的加法注意力评分函数计算的。
+
+与 :numref:`fig_seq2seq_details` 中的香草 RNN 编码器解码器架构略有不同，:numref:`fig_s2s_attention_details` 描述了巴赫达瑙关注的同一架构。
+
+![Layers in an RNN encoder-decoder model with Bahdanau attention.](../img/seq2seq-attention-details.svg)
+:label:`fig_s2s_attention_details`
+
+```{.python .input}
+from d2l import mxnet as d2l
+from mxnet import np, npx
+from mxnet.gluon import rnn, nn
+npx.set_np()
+```
+
+```{.python .input}
+#@tab pytorch
+from d2l import torch as d2l
+import torch
+from torch import nn
+```
+
+## 注意定义解码器
+
+要在 Bahdanau 关注的情况下实现 RNN 编码器-解码器，我们只需重新定义解码器即可。为了更方便地显示学习的注意力权重，以下 `AttentionDecoder` 类定义了具有注意机制的解码器的基本接口。
+
+```{.python .input}
+#@tab all
+#@save
+class AttentionDecoder(d2l.Decoder):
+    """The base attention-based decoder interface."""
+    def __init__(self, **kwargs):
+        super(AttentionDecoder, self).__init__(**kwargs)
+
+    @property
+    def attention_weights(self):
+        raise NotImplementedError
+```
+
+现在让我们在接下来的 `Seq2SeqAttentionDecoder` 课程中以 Bahdanau 关注的情况下实施 RNN 解码器。解码器的状态初始化为 i) 编码器在所有时间步长的最终层隐藏状态（作为关注的键和值）；ii) 最后一个时间步长的编码器全层隐藏状态（初始化解码器的隐藏状态）；和 iii) 编码器有效长度（排除在注意力池中填充令牌）。在每个解码时间步骤中，解码器上一个时间步的最终层隐藏状态将用作关注的查询。因此，注意力输出和输入嵌入都连接为 RNN 解码器的输入。
+
+```{.python .input}
+class Seq2SeqAttentionDecoder(AttentionDecoder):
+    def __init__(self, vocab_size, embed_size, num_hiddens, num_layers,
+                 dropout=0, **kwargs):
+        super(Seq2SeqAttentionDecoder, self).__init__(**kwargs)
+        self.attention = d2l.AdditiveAttention(num_hiddens, dropout)
+        self.embedding = nn.Embedding(vocab_size, embed_size)
+        self.rnn = rnn.GRU(num_hiddens, num_layers, dropout=dropout)
+        self.dense = nn.Dense(vocab_size, flatten=False)
+
+    def init_state(self, enc_outputs, enc_valid_lens, *args):
+        # Shape of `outputs`: (`num_steps`, `batch_size`, `num_hiddens`).
+        # Shape of `hidden_state[0]`: (`num_layers`, `batch_size`,
+        # `num_hiddens`)
+        outputs, hidden_state = enc_outputs
+        return (outputs.swapaxes(0, 1), hidden_state, enc_valid_lens)
+
+    def forward(self, X, state):
+        # Shape of `enc_outputs`: (`batch_size`, `num_steps`, `num_hiddens`).
+        # Shape of `hidden_state[0]`: (`num_layers`, `batch_size`,
+        # `num_hiddens`)
+        enc_outputs, hidden_state, enc_valid_lens = state
+        # Shape of the output `X`: (`num_steps`, `batch_size`, `embed_size`)
+        X = self.embedding(X).swapaxes(0, 1)
+        outputs, self._attention_weights = [], []
+        for x in X:
+            # Shape of `query`: (`batch_size`, 1, `num_hiddens`)
+            query = np.expand_dims(hidden_state[0][-1], axis=1)
+            # Shape of `context`: (`batch_size`, 1, `num_hiddens`)
+            context = self.attention(
+                query, enc_outputs, enc_outputs, enc_valid_lens)
+            # Concatenate on the feature dimension
+            x = np.concatenate((context, np.expand_dims(x, axis=1)), axis=-1)
+            # Reshape `x` as (1, `batch_size`, `embed_size` + `num_hiddens`)
+            out, hidden_state = self.rnn(x.swapaxes(0, 1), hidden_state)
+            outputs.append(out)
+            self._attention_weights.append(self.attention.attention_weights)
+        # After fully-connected layer transformation, shape of `outputs`:
+        # (`num_steps`, `batch_size`, `vocab_size`)
+        outputs = self.dense(np.concatenate(outputs, axis=0))
+        return outputs.swapaxes(0, 1), [enc_outputs, hidden_state,
+                                        enc_valid_lens]
+
+    @property
+    def attention_weights(self):
+        return self._attention_weights
+```
+
+```{.python .input}
+#@tab pytorch
+class Seq2SeqAttentionDecoder(AttentionDecoder):
+    def __init__(self, vocab_size, embed_size, num_hiddens, num_layers,
+                 dropout=0, **kwargs):
+        super(Seq2SeqAttentionDecoder, self).__init__(**kwargs)
+        self.attention = d2l.AdditiveAttention(
+            num_hiddens, num_hiddens, num_hiddens, dropout)
+        self.embedding = nn.Embedding(vocab_size, embed_size)
+        self.rnn = nn.GRU(
+            embed_size + num_hiddens, num_hiddens, num_layers,
+            dropout=dropout)
+        self.dense = nn.Linear(num_hiddens, vocab_size)
+
+    def init_state(self, enc_outputs, enc_valid_lens, *args):
+        # Shape of `outputs`: (`num_steps`, `batch_size`, `num_hiddens`).
+        # Shape of `hidden_state[0]`: (`num_layers`, `batch_size`,
+        # `num_hiddens`)
+        outputs, hidden_state = enc_outputs
+        return (outputs.permute(1, 0, 2), hidden_state, enc_valid_lens)
+
+    def forward(self, X, state):
+        # Shape of `enc_outputs`: (`batch_size`, `num_steps`, `num_hiddens`).
+        # Shape of `hidden_state[0]`: (`num_layers`, `batch_size`,
+        # `num_hiddens`)
+        enc_outputs, hidden_state, enc_valid_lens = state
+        # Shape of the output `X`: (`num_steps`, `batch_size`, `embed_size`)
+        X = self.embedding(X).permute(1, 0, 2)
+        outputs, self._attention_weights = [], []
+        for x in X:
+            # Shape of `query`: (`batch_size`, 1, `num_hiddens`)
+            query = torch.unsqueeze(hidden_state[-1], dim=1)
+            # Shape of `context`: (`batch_size`, 1, `num_hiddens`)
+            context = self.attention(
+                query, enc_outputs, enc_outputs, enc_valid_lens)
+            # Concatenate on the feature dimension
+            x = torch.cat((context, torch.unsqueeze(x, dim=1)), dim=-1)
+            # Reshape `x` as (1, `batch_size`, `embed_size` + `num_hiddens`)
+            out, hidden_state = self.rnn(x.permute(1, 0, 2), hidden_state)
+            outputs.append(out)
+            self._attention_weights.append(self.attention.attention_weights)
+        # After fully-connected layer transformation, shape of `outputs`:
+        # (`num_steps`, `batch_size`, `vocab_size`)
+        outputs = self.dense(torch.cat(outputs, dim=0))
+        return outputs.permute(1, 0, 2), [enc_outputs, hidden_state,
+                                          enc_valid_lens]
+    
+    @property
+    def attention_weights(self):
+        return self._attention_weights
+```
+
+在以下内容中，我们使用包含 7 个时间步长的 4 个序列输入的小批量测试已实施的解码器，使用 Bahdanau 的注意力。
+
+```{.python .input}
+encoder = d2l.Seq2SeqEncoder(vocab_size=10, embed_size=8, num_hiddens=16,
+                             num_layers=2)
+encoder.initialize()
+decoder = Seq2SeqAttentionDecoder(vocab_size=10, embed_size=8, num_hiddens=16,
+                                  num_layers=2)
+decoder.initialize()
+X = d2l.zeros((4, 7))  # (`batch_size`, `num_steps`)
+state = decoder.init_state(encoder(X), None)
+output, state = decoder(X, state)
+output.shape, len(state), state[0].shape, len(state[1]), state[1][0].shape
+```
+
+```{.python .input}
+#@tab pytorch
+encoder = d2l.Seq2SeqEncoder(vocab_size=10, embed_size=8, num_hiddens=16,
+                             num_layers=2)
+encoder.eval()
+decoder = Seq2SeqAttentionDecoder(vocab_size=10, embed_size=8, num_hiddens=16,
+                                  num_layers=2)
+decoder.eval()
+X = d2l.zeros((4, 7), dtype=torch.long)  # (`batch_size`, `num_steps`)
+state = decoder.init_state(encoder(X), None)
+output, state = decoder(X, state)
+output.shape, len(state), state[0].shape, len(state[1]), state[1][0].shape
+```
+
+## 培训
+
+与 :numref:`sec_seq2seq_training` 类似，我们在这里指定超级测量器，实例化一个编码器和解码器，并在 Bahdanau 关注的情况下对这个模型进行机器翻译培训。由于新增的关注机制，这项培训比没有注意力机制的 :numref:`sec_seq2seq_training` 慢得多。
+
+```{.python .input}
+#@tab all
+embed_size, num_hiddens, num_layers, dropout = 32, 32, 2, 0.1
+batch_size, num_steps = 64, 10
+lr, num_epochs, device = 0.005, 250, d2l.try_gpu()
+
+train_iter, src_vocab, tgt_vocab = d2l.load_data_nmt(batch_size, num_steps)
+encoder = d2l.Seq2SeqEncoder(
+    len(src_vocab), embed_size, num_hiddens, num_layers, dropout)
+decoder = Seq2SeqAttentionDecoder(
+    len(tgt_vocab), embed_size, num_hiddens, num_layers, dropout)
+net = d2l.EncoderDecoder(encoder, decoder)
+d2l.train_seq2seq(net, train_iter, lr, num_epochs, tgt_vocab, device)
+```
+
+模型训练完毕后，我们用它将几个英语句子翻译成法语并计算它们的 BLEU 分数。
+
+```{.python .input}
+#@tab all
+engs = ['go .', "i lost .", 'he\'s calm .', 'i\'m home .']
+fras = ['va !', 'j\'ai perdu .', 'il est calme .', 'je suis chez moi .']
+for eng, fra in zip(engs, fras):
+    translation, dec_attention_weight_seq = d2l.predict_seq2seq(
+        net, eng, src_vocab, tgt_vocab, num_steps, device, True)
+    print(f'{eng} => {translation}, ',
+          f'bleu {d2l.bleu(translation, fra, k=2):.3f}')
+```
+
+```{.python .input}
+#@tab all
+attention_weights = d2l.reshape(
+    d2l.concat([step[0][0][0] for step in dec_attention_weight_seq], 0),
+    (1, 1, -1, num_steps))
+```
+
+通过将翻译最后一个英语句子时的注意力权重可视化，我们可以看到每个查询都会在键值对上分配不均匀的权重。它显示，在每个解码步骤中，输入序列的不同部分都会有选择地聚合在注意力池中。
+
+```{.python .input}
+# Plus one to include the end-of-sequence token
+d2l.show_heatmaps(
+    attention_weights[:, :, :, :len(engs[-1].split()) + 1],
+    xlabel='Key posistions', ylabel='Query posistions')
+```
+
+```{.python .input}
+#@tab pytorch
+# Plus one to include the end-of-sequence token
+d2l.show_heatmaps(
+    attention_weights[:, :, :, :len(engs[-1].split()) + 1].cpu(),
+    xlabel='Key posistions', ylabel='Query posistions')
+```
+
+## 摘要
+
+* 在预测令牌时，如果不是所有输入令牌都是相关的，那么具有 Bahdanau 关注的 RNN 编码器会有选择地聚合输入序列的不同部分。这是通过将上下文变量视为加法注意力池的输出来实现的。
+* 在 RNN 编码器解码器中，Bahdanau 的注意力将上一个时间步的解码器隐藏状态视为查询，编码器在所有时间步长的隐藏状态同时视为键和值。
+
+## 练习
+
+1. 在实验中用 LSTM 替换 GRU。
+1. 修改实验以将加法注意力评分功能替换为缩放的点积。它如何影响培训效率？
+
+:begin_tab:`mxnet`
+[Discussions](https://discuss.d2l.ai/t/347)
+:end_tab:
+
+:begin_tab:`pytorch`
+[Discussions](https://discuss.d2l.ai/t/1065)
+:end_tab:
diff --git a/chapter_attention-mechanisms/bahdanau-attention_origin.md b/chapter_attention-mechanisms/bahdanau-attention_origin.md
new file mode 100644
index 0000000000000000000000000000000000000000..b14b7574e1a27c598ef1092cfc9c2ba2bf7b6e36
--- /dev/null
+++ b/chapter_attention-mechanisms/bahdanau-attention_origin.md
@@ -0,0 +1,362 @@
+# Bahdanau Attention
+:label:`sec_seq2seq_attention`
+
+We studied the machine translation
+problem in :numref:`sec_seq2seq`,
+where we designed
+an encoder-decoder architecture based on two RNNs
+for sequence to sequence learning.
+Specifically,
+the RNN encoder 
+transforms
+a variable-length sequence
+into a fixed-shape context variable,
+then
+the RNN decoder
+generates the output (target) sequence token by token
+based on the generated tokens and the context variable.
+However,
+even though not all the input (source) tokens
+are useful for decoding a certain token,
+the *same* context variable
+that encodes the entire input sequence
+is still used at each decoding step.
+
+
+In a separate but related
+challenge of handwriting generation for a given text sequence,
+Graves designed a differentiable attention model
+to align text characters with the much longer pen trace,
+where the alignment moves only in one direction :cite:`Graves.2013`.
+Inspired by the idea of learning to align,
+Bahdanau et al. proposed a differentiable attention model
+without the severe unidirectional alignment limitation :cite:`Bahdanau.Cho.Bengio.2014`.
+When predicting a token,
+if not all the input tokens are relevant,
+the model aligns (or attends)
+only to parts of the input sequence that are relevant to the current prediction.
+This is achieved
+by treating the context variable as an output of attention pooling.
+
+## Model
+
+When describing 
+Bahdanau attention
+for the RNN encoder-decoder below,
+we will follow the same notation in
+:numref:`sec_seq2seq`.
+The new attention-based model
+is the same as that
+in :numref:`sec_seq2seq`
+except that
+the context variable
+$\mathbf{c}$
+in 
+:eqref:`eq_seq2seq_s_t`
+is replaced by
+$\mathbf{c}_{t'}$
+at any decoding time step $t'$.
+Suppose that
+there are $T$ tokens in the input sequence,
+the context variable at the decoding time step $t'$
+is the output of attention pooling:
+
+$$\mathbf{c}_{t'} = \sum_{t=1}^T \alpha(\mathbf{s}_{t' - 1}, \mathbf{h}_t) \mathbf{h}_t,$$
+
+where the decoder hidden state
+$\mathbf{s}_{t' - 1}$ at time step $t' - 1$
+is the query,
+and the encoder hidden states $\mathbf{h}_t$
+are both the keys and values,
+and the attention weight $\alpha$
+is computed as in
+:eqref:`eq_attn-scoring-alpha`
+using the additive attention scoring function
+defined by
+:eqref:`eq_additive-attn`.
+
+
+Slightly different from 
+the vanilla RNN encoder-decoder architecture 
+in :numref:`fig_seq2seq_details`,
+the same architecture
+with Bahdanau attention is depicted in 
+:numref:`fig_s2s_attention_details`.
+
+![Layers in an RNN encoder-decoder model with Bahdanau attention.](../img/seq2seq-attention-details.svg)
+:label:`fig_s2s_attention_details`
+
+```{.python .input}
+from d2l import mxnet as d2l
+from mxnet import np, npx
+from mxnet.gluon import rnn, nn
+npx.set_np()
+```
+
+```{.python .input}
+#@tab pytorch
+from d2l import torch as d2l
+import torch
+from torch import nn
+```
+
+## Defining the Decoder with Attention
+
+To implement the RNN encoder-decoder
+with Bahdanau attention,
+we only need to redefine the decoder.
+To visualize the learned attention weights more conveniently,
+the following `AttentionDecoder` class
+defines the base interface for 
+decoders with attention mechanisms.
+
+```{.python .input}
+#@tab all
+#@save
+class AttentionDecoder(d2l.Decoder):
+    """The base attention-based decoder interface."""
+    def __init__(self, **kwargs):
+        super(AttentionDecoder, self).__init__(**kwargs)
+
+    @property
+    def attention_weights(self):
+        raise NotImplementedError
+```
+
+Now let us implement
+the RNN decoder with Bahdanau attention
+in the following `Seq2SeqAttentionDecoder` class.
+The state of the decoder
+is initialized with 
+i) the encoder final-layer hidden states at all the time steps (as keys and values of the attention);
+ii) the encoder all-layer hidden state at the final time step (to initialize the hidden state of the decoder);
+and iii) the encoder valid length (to exclude the padding tokens in attention pooling).
+At each decoding time step,
+the decoder final-layer hidden state at the previous time step is used as the query of the attention.
+As a result, both the attention output
+and the input embedding are concatenated
+as the input of the RNN decoder.
+
+```{.python .input}
+class Seq2SeqAttentionDecoder(AttentionDecoder):
+    def __init__(self, vocab_size, embed_size, num_hiddens, num_layers,
+                 dropout=0, **kwargs):
+        super(Seq2SeqAttentionDecoder, self).__init__(**kwargs)
+        self.attention = d2l.AdditiveAttention(num_hiddens, dropout)
+        self.embedding = nn.Embedding(vocab_size, embed_size)
+        self.rnn = rnn.GRU(num_hiddens, num_layers, dropout=dropout)
+        self.dense = nn.Dense(vocab_size, flatten=False)
+
+    def init_state(self, enc_outputs, enc_valid_lens, *args):
+        # Shape of `outputs`: (`num_steps`, `batch_size`, `num_hiddens`).
+        # Shape of `hidden_state[0]`: (`num_layers`, `batch_size`,
+        # `num_hiddens`)
+        outputs, hidden_state = enc_outputs
+        return (outputs.swapaxes(0, 1), hidden_state, enc_valid_lens)
+
+    def forward(self, X, state):
+        # Shape of `enc_outputs`: (`batch_size`, `num_steps`, `num_hiddens`).
+        # Shape of `hidden_state[0]`: (`num_layers`, `batch_size`,
+        # `num_hiddens`)
+        enc_outputs, hidden_state, enc_valid_lens = state
+        # Shape of the output `X`: (`num_steps`, `batch_size`, `embed_size`)
+        X = self.embedding(X).swapaxes(0, 1)
+        outputs, self._attention_weights = [], []
+        for x in X:
+            # Shape of `query`: (`batch_size`, 1, `num_hiddens`)
+            query = np.expand_dims(hidden_state[0][-1], axis=1)
+            # Shape of `context`: (`batch_size`, 1, `num_hiddens`)
+            context = self.attention(
+                query, enc_outputs, enc_outputs, enc_valid_lens)
+            # Concatenate on the feature dimension
+            x = np.concatenate((context, np.expand_dims(x, axis=1)), axis=-1)
+            # Reshape `x` as (1, `batch_size`, `embed_size` + `num_hiddens`)
+            out, hidden_state = self.rnn(x.swapaxes(0, 1), hidden_state)
+            outputs.append(out)
+            self._attention_weights.append(self.attention.attention_weights)
+        # After fully-connected layer transformation, shape of `outputs`:
+        # (`num_steps`, `batch_size`, `vocab_size`)
+        outputs = self.dense(np.concatenate(outputs, axis=0))
+        return outputs.swapaxes(0, 1), [enc_outputs, hidden_state,
+                                        enc_valid_lens]
+
+    @property
+    def attention_weights(self):
+        return self._attention_weights
+```
+
+```{.python .input}
+#@tab pytorch
+class Seq2SeqAttentionDecoder(AttentionDecoder):
+    def __init__(self, vocab_size, embed_size, num_hiddens, num_layers,
+                 dropout=0, **kwargs):
+        super(Seq2SeqAttentionDecoder, self).__init__(**kwargs)
+        self.attention = d2l.AdditiveAttention(
+            num_hiddens, num_hiddens, num_hiddens, dropout)
+        self.embedding = nn.Embedding(vocab_size, embed_size)
+        self.rnn = nn.GRU(
+            embed_size + num_hiddens, num_hiddens, num_layers,
+            dropout=dropout)
+        self.dense = nn.Linear(num_hiddens, vocab_size)
+
+    def init_state(self, enc_outputs, enc_valid_lens, *args):
+        # Shape of `outputs`: (`num_steps`, `batch_size`, `num_hiddens`).
+        # Shape of `hidden_state[0]`: (`num_layers`, `batch_size`,
+        # `num_hiddens`)
+        outputs, hidden_state = enc_outputs
+        return (outputs.permute(1, 0, 2), hidden_state, enc_valid_lens)
+
+    def forward(self, X, state):
+        # Shape of `enc_outputs`: (`batch_size`, `num_steps`, `num_hiddens`).
+        # Shape of `hidden_state[0]`: (`num_layers`, `batch_size`,
+        # `num_hiddens`)
+        enc_outputs, hidden_state, enc_valid_lens = state
+        # Shape of the output `X`: (`num_steps`, `batch_size`, `embed_size`)
+        X = self.embedding(X).permute(1, 0, 2)
+        outputs, self._attention_weights = [], []
+        for x in X:
+            # Shape of `query`: (`batch_size`, 1, `num_hiddens`)
+            query = torch.unsqueeze(hidden_state[-1], dim=1)
+            # Shape of `context`: (`batch_size`, 1, `num_hiddens`)
+            context = self.attention(
+                query, enc_outputs, enc_outputs, enc_valid_lens)
+            # Concatenate on the feature dimension
+            x = torch.cat((context, torch.unsqueeze(x, dim=1)), dim=-1)
+            # Reshape `x` as (1, `batch_size`, `embed_size` + `num_hiddens`)
+            out, hidden_state = self.rnn(x.permute(1, 0, 2), hidden_state)
+            outputs.append(out)
+            self._attention_weights.append(self.attention.attention_weights)
+        # After fully-connected layer transformation, shape of `outputs`:
+        # (`num_steps`, `batch_size`, `vocab_size`)
+        outputs = self.dense(torch.cat(outputs, dim=0))
+        return outputs.permute(1, 0, 2), [enc_outputs, hidden_state,
+                                          enc_valid_lens]
+    
+    @property
+    def attention_weights(self):
+        return self._attention_weights
+```
+
+In the following, we test the implemented 
+decoder with Bahdanau attention
+using a minibatch of 4 sequence inputs
+of 7 time steps.
+
+```{.python .input}
+encoder = d2l.Seq2SeqEncoder(vocab_size=10, embed_size=8, num_hiddens=16,
+                             num_layers=2)
+encoder.initialize()
+decoder = Seq2SeqAttentionDecoder(vocab_size=10, embed_size=8, num_hiddens=16,
+                                  num_layers=2)
+decoder.initialize()
+X = d2l.zeros((4, 7))  # (`batch_size`, `num_steps`)
+state = decoder.init_state(encoder(X), None)
+output, state = decoder(X, state)
+output.shape, len(state), state[0].shape, len(state[1]), state[1][0].shape
+```
+
+```{.python .input}
+#@tab pytorch
+encoder = d2l.Seq2SeqEncoder(vocab_size=10, embed_size=8, num_hiddens=16,
+                             num_layers=2)
+encoder.eval()
+decoder = Seq2SeqAttentionDecoder(vocab_size=10, embed_size=8, num_hiddens=16,
+                                  num_layers=2)
+decoder.eval()
+X = d2l.zeros((4, 7), dtype=torch.long)  # (`batch_size`, `num_steps`)
+state = decoder.init_state(encoder(X), None)
+output, state = decoder(X, state)
+output.shape, len(state), state[0].shape, len(state[1]), state[1][0].shape
+```
+
+## Training
+
+
+Similar to :numref:`sec_seq2seq_training`,
+here we specify hyperparemeters,
+instantiate
+an encoder and a decoder with Bahdanau attention,
+and train this model for machine translation.
+Due to the newly added attention mechanism,
+this training is much slower than
+that in :numref:`sec_seq2seq_training` without attention mechanisms.
+
+```{.python .input}
+#@tab all
+embed_size, num_hiddens, num_layers, dropout = 32, 32, 2, 0.1
+batch_size, num_steps = 64, 10
+lr, num_epochs, device = 0.005, 250, d2l.try_gpu()
+
+train_iter, src_vocab, tgt_vocab = d2l.load_data_nmt(batch_size, num_steps)
+encoder = d2l.Seq2SeqEncoder(
+    len(src_vocab), embed_size, num_hiddens, num_layers, dropout)
+decoder = Seq2SeqAttentionDecoder(
+    len(tgt_vocab), embed_size, num_hiddens, num_layers, dropout)
+net = d2l.EncoderDecoder(encoder, decoder)
+d2l.train_seq2seq(net, train_iter, lr, num_epochs, tgt_vocab, device)
+```
+
+After the model is trained,
+we use it to translate a few English sentences
+into French and compute their BLEU scores.
+
+```{.python .input}
+#@tab all
+engs = ['go .', "i lost .", 'he\'s calm .', 'i\'m home .']
+fras = ['va !', 'j\'ai perdu .', 'il est calme .', 'je suis chez moi .']
+for eng, fra in zip(engs, fras):
+    translation, dec_attention_weight_seq = d2l.predict_seq2seq(
+        net, eng, src_vocab, tgt_vocab, num_steps, device, True)
+    print(f'{eng} => {translation}, ',
+          f'bleu {d2l.bleu(translation, fra, k=2):.3f}')
+```
+
+```{.python .input}
+#@tab all
+attention_weights = d2l.reshape(
+    d2l.concat([step[0][0][0] for step in dec_attention_weight_seq], 0),
+    (1, 1, -1, num_steps))
+```
+
+By visualizing the attention weights
+when translating the last English sentence,
+we can see that each query assigns non-uniform weights
+over key-value pairs.
+It shows that at each decoding step,
+different parts of the input sequences 
+are selectively aggregated in the attention pooling.
+
+```{.python .input}
+# Plus one to include the end-of-sequence token
+d2l.show_heatmaps(
+    attention_weights[:, :, :, :len(engs[-1].split()) + 1],
+    xlabel='Key posistions', ylabel='Query posistions')
+```
+
+```{.python .input}
+#@tab pytorch
+# Plus one to include the end-of-sequence token
+d2l.show_heatmaps(
+    attention_weights[:, :, :, :len(engs[-1].split()) + 1].cpu(),
+    xlabel='Key posistions', ylabel='Query posistions')
+```
+
+## Summary
+
+* When predicting a token, if not all the input tokens are relevant, the RNN encoder-decoder with Bahdanau attention selectively aggregates different parts of the input sequence. This is achieved by treating the context variable as an output of additive attention pooling.
+* In the RNN encoder-decoder, Bahdanau attention treats the decoder hidden state at the previous time step as the query, and the encoder hidden states at all the time steps as both the keys and values.
+
+
+## Exercises
+
+1. Replace GRU with LSTM in the experiment.
+1. Modify the experiment to replace the additive attention scoring function with the scaled dot-product. How does it influence the training efficiency?
+
+
+:begin_tab:`mxnet`
+[Discussions](https://discuss.d2l.ai/t/347)
+:end_tab:
+
+:begin_tab:`pytorch`
+[Discussions](https://discuss.d2l.ai/t/1065)
+:end_tab:
diff --git a/chapter_attention-mechanisms/index.md b/chapter_attention-mechanisms/index.md
new file mode 100644
index 0000000000000000000000000000000000000000..5704e04e8ae843aa02395253e6b005c14bddb31e
--- /dev/null
+++ b/chapter_attention-mechanisms/index.md
@@ -0,0 +1,24 @@
+# 注意机制
+:label:`chap_attention`
+
+灵长类动物视觉系统的视神经接受大量的感官输入，远远超过了大脑能够完全处理的程度。幸运的是，并非所有的刺激都是平等的。意识的聚集和集中使灵长类动物能够在复杂的视觉环境中将注意力引向感兴趣的物体，例如猎物和掠食动物。只关注一小部分信息的能力具有进化意义，使人类能够生存和成功。
+
+自 19 世纪以来，科学家们一直在研究认知神经科学领域的注意力。在本章中，我们将首先回顾一个热门框架，解释如何在视觉场景中部署注意力。受此框架中的注意线索的启发，我们将设计利用这些关注线索的模型。值得注意的是，1964 年的 Nadaraya-Waston 内核回归是具有 * 注意力机制 * 的机器学习的简单演示。
+
+接下来，我们将继续介绍在深度学习中注意力模型设计中广泛使用的注意力函数。具体来说，我们将展示如何使用这些函数来设计 *Bahdanau 注意力 *，这是深度学习中的突破性注意力模型，可以双向对齐并且可以区分。
+
+最后，配备了最近的
+*多头关注 *
+和 * 自我关注 * 设计，我们将仅基于注意机制来描述 *Transer* 架构。自 2017 年提出建议以来，变形金刚一直在现代深度学习应用中普遍存在，例如语言、视觉、语音和强化学习领域。
+
+```toc
+:maxdepth: 2
+
+attention-cues
+nadaraya-waston
+attention-scoring-functions
+bahdanau-attention
+multihead-attention
+self-attention-and-positional-encoding
+transformer
+```
diff --git a/chapter_attention-mechanisms/index_origin.md b/chapter_attention-mechanisms/index_origin.md
new file mode 100644
index 0000000000000000000000000000000000000000..3b4b5885f549d764dfd463dba6d55d40554ce4ef
--- /dev/null
+++ b/chapter_attention-mechanisms/index_origin.md
@@ -0,0 +1,68 @@
+# Attention Mechanisms
+:label:`chap_attention`
+
+The optic nerve of a primate's visual system
+receives massive sensory input,
+far exceeding what the brain can fully process.
+Fortunately,
+not all stimuli are created equal.
+Focalization and concentration of consciousness 
+have enabled primates to direct attention
+to objects of interest,
+such as preys and predators, 
+in the complex visual environment.
+The ability of paying attention to 
+only a small fraction of the information
+has evolutionary significance,
+allowing human beings 
+to live and succeed.
+
+Scientists have been studying attention 
+in the cognitive neuroscience field
+since the 19th century.
+In this chapter,
+we will begin by reviewing a popular framework
+explaining how attention is deployed in a visual scene.
+Inspired by the attention cues in this framework,
+we will design models
+that leverage such attention cues.
+Notably, the Nadaraya-Waston kernel regression
+in 1964 is a simple demonstration of machine learning with *attention mechanisms*.
+
+Next, we will go on to introduce attention functions 
+that have been extensively used in 
+the design of attention models in deep learning.
+Specifically,
+we will show how to use these functions
+to design the *Bahdanau attention*,
+a groundbreaking attention model in deep learning
+that can align bidirectionally and is differentiable.
+
+In the end,
+equipped with 
+the more recent
+*multi-head attention*
+and *self-attention* designs,
+we will describe the *Transformer* architecture
+based solely on attention mechanisms.
+Since their proposal in 2017,
+Transformers
+have been pervasive in modern 
+deep learning applications,
+such as in areas of
+language,
+vision, speech,
+and reinforcement learning.
+
+```toc
+:maxdepth: 2
+
+attention-cues
+nadaraya-waston
+attention-scoring-functions
+bahdanau-attention
+multihead-attention
+self-attention-and-positional-encoding
+transformer
+```
+
diff --git a/chapter_attention-mechanisms/multihead-attention.md b/chapter_attention-mechanisms/multihead-attention.md
new file mode 100644
index 0000000000000000000000000000000000000000..4734add2c8aee0e7bc5a791185a4771a21c657d8
--- /dev/null
+++ b/chapter_attention-mechanisms/multihead-attention.md
@@ -0,0 +1,226 @@
+# 多头关注
+:label:`sec_multihead-attention`
+
+实际上，鉴于查询、键和值集相同，我们可能希望我们的模型将来自同一注意机制不同行为的知识结合起来，例如捕获序列内各种范围的依赖关系（例如，短范围与长距离）。因此，允许我们的注意机制共同使用查询、键和值的不同表示子空间可能是有益的。
+
+为此，可以使用 $h$ 独立学习的线性投影来转换查询、键和值，而不是执行单一的注意力集中。然后，这些 $h$ 个预计查询、键和值将并行输入注意力集中。最后，$h$ 注意力集中输出被连接在一起，并与另一个学习的线性投影进行转换，以产生最终输出。这种设计被称为 * 多头注意 *，其中 $h$ 注意力池输出中的每个都是 * 头 * :cite:`Vaswani.Shazeer.Parmar.ea.2017`。:numref:`fig_multi-head-attention` 使用完全连接的图层来执行可学习的线性变换，描述了多头注意力。
+
+![Multi-head attention, where multiple heads are concatenated then linearly transformed.](../img/multi-head-attention.svg)
+:label:`fig_multi-head-attention`
+
+## 模型
+
+在提供多头关注的实施之前，让我们以数学方式将这个模型正式化。给定查询 $\mathbf{q} \in \mathbb{R}^{d_q}$、一个键 $\mathbf{k} \in \mathbb{R}^{d_k}$ 和一个值 $\mathbf{v} \in \mathbb{R}^{d_v}$，每个注意头 $\mathbf{h}_i$ ($i = 1, \ldots, h$) 的计算方法为
+
+$$\mathbf{h}_i = f(\mathbf W_i^{(q)}\mathbf q, \mathbf W_i^{(k)}\mathbf k,\mathbf W_i^{(v)}\mathbf v) \in \mathbb R^{p_v},$$
+
+其中，可学习的参数 $\mathbf W_i^{(q)}\in\mathbb R^{p_q\times d_q}$、$\mathbf W_i^{(k)}\in\mathbb R^{p_k\times d_k}$ 和 $\mathbf W_i^{(v)}\in\mathbb R^{p_v\times d_v}$ 以及 $f$ 是注意力集中，例如 :numref:`sec_attention-scoring-functions` 中的添加剂注意力和扩大点产品注意力。多头注意力输出是另一种线性转换，通过 $h$ 头连接的可学习参数 $\mathbf W_o\in\mathbb R^{p_o\times h p_v}$：
+
+$$\mathbf W_o \begin{bmatrix}\mathbf h_1\\\vdots\\\mathbf h_h\end{bmatrix} \in \mathbb{R}^{p_o}.$$
+
+基于这种设计，每个头都可能会关注输入的不同部分。可以表示比简单加权平均值更复杂的函数。
+
+```{.python .input}
+from d2l import mxnet as d2l
+import math
+from mxnet import autograd, np, npx
+from mxnet.gluon import nn
+npx.set_np()
+```
+
+```{.python .input}
+#@tab pytorch
+from d2l import torch as d2l
+import math
+import torch
+from torch import nn
+```
+
+## 实施
+
+在我们的实施过程中，我们为多头关注的每个人选择缩放的点产品注意力。为避免计算成本和参数化成本的显著增长，我们设置了 $p_q = p_k = p_v = p_o / h$。请注意，如果我们将查询、键和值的线性变换的输出数量设置为 $p_q h = p_k h = p_v h = p_o$，则可以并行计算 $h$ 头。在下面的实现中，$p_o$ 是通过参数 `num_hiddens` 指定的。
+
+```{.python .input}
+#@save
+class MultiHeadAttention(nn.Block):
+    def __init__(self, num_hiddens, num_heads, dropout, use_bias=False,
+                 **kwargs):
+        super(MultiHeadAttention, self).__init__(**kwargs)
+        self.num_heads = num_heads
+        self.attention = d2l.DotProductAttention(dropout)
+        self.W_q = nn.Dense(num_hiddens, use_bias=use_bias, flatten=False)
+        self.W_k = nn.Dense(num_hiddens, use_bias=use_bias, flatten=False)
+        self.W_v = nn.Dense(num_hiddens, use_bias=use_bias, flatten=False)
+        self.W_o = nn.Dense(num_hiddens, use_bias=use_bias, flatten=False)
+
+    def forward(self, queries, keys, values, valid_lens):
+        # Shape of `queries`, `keys`, or `values`:
+        # (`batch_size`, no. of queries or key-value pairs, `num_hiddens`)
+        # Shape of `valid_lens`:
+        # (`batch_size`,) or (`batch_size`, no. of queries)
+        # After transposing, shape of output `queries`, `keys`, or `values`:
+        # (`batch_size` * `num_heads`, no. of queries or key-value pairs,
+        # `num_hiddens` / `num_heads`)
+        queries = transpose_qkv(self.W_q(queries), self.num_heads)
+        keys = transpose_qkv(self.W_k(keys), self.num_heads)
+        values = transpose_qkv(self.W_v(values), self.num_heads)
+
+        if valid_lens is not None:
+            # On axis 0, copy the first item (scalar or vector) for
+            # `num_heads` times, then copy the next item, and so on
+            valid_lens = valid_lens.repeat(self.num_heads, axis=0)
+
+        # Shape of `output`: (`batch_size` * `num_heads`, no. of queries,
+        # `num_hiddens` / `num_heads`)
+        output = self.attention(queries, keys, values, valid_lens)
+        
+        # Shape of `output_concat`:
+        # (`batch_size`, no. of queries, `num_hiddens`)
+        output_concat = transpose_output(output, self.num_heads)
+        return self.W_o(output_concat)
+```
+
+```{.python .input}
+#@tab pytorch
+#@save
+class MultiHeadAttention(nn.Module):
+    def __init__(self, key_size, query_size, value_size, num_hiddens,
+                 num_heads, dropout, bias=False, **kwargs):
+        super(MultiHeadAttention, self).__init__(**kwargs)
+        self.num_heads = num_heads
+        self.attention = d2l.DotProductAttention(dropout)
+        self.W_q = nn.Linear(query_size, num_hiddens, bias=bias)
+        self.W_k = nn.Linear(key_size, num_hiddens, bias=bias)
+        self.W_v = nn.Linear(value_size, num_hiddens, bias=bias)
+        self.W_o = nn.Linear(num_hiddens, num_hiddens, bias=bias)
+
+    def forward(self, queries, keys, values, valid_lens):
+        # Shape of `queries`, `keys`, or `values`:
+        # (`batch_size`, no. of queries or key-value pairs, `num_hiddens`)
+        # Shape of `valid_lens`:
+        # (`batch_size`,) or (`batch_size`, no. of queries)
+        # After transposing, shape of output `queries`, `keys`, or `values`:
+        # (`batch_size` * `num_heads`, no. of queries or key-value pairs,
+        # `num_hiddens` / `num_heads`)
+        queries = transpose_qkv(self.W_q(queries), self.num_heads)
+        keys = transpose_qkv(self.W_k(keys), self.num_heads)
+        values = transpose_qkv(self.W_v(values), self.num_heads)
+
+        if valid_lens is not None:
+            # On axis 0, copy the first item (scalar or vector) for
+            # `num_heads` times, then copy the next item, and so on
+            valid_lens = torch.repeat_interleave(
+                valid_lens, repeats=self.num_heads, dim=0)
+
+        # Shape of `output`: (`batch_size` * `num_heads`, no. of queries,
+        # `num_hiddens` / `num_heads`)
+        output = self.attention(queries, keys, values, valid_lens)
+
+        # Shape of `output_concat`:
+        # (`batch_size`, no. of queries, `num_hiddens`)
+        output_concat = transpose_output(output, self.num_heads)
+        return self.W_o(output_concat)
+```
+
+为了允许多个头的并行计算，上面的 `MultiHeadAttention` 类使用了下面定义的两个移调函数。具体来说，`transpose_output` 函数逆转了 `transpose_qkv` 函数的操作。
+
+```{.python .input}
+#@save
+def transpose_qkv(X, num_heads):
+    # Shape of input `X`:
+    # (`batch_size`, no. of queries or key-value pairs, `num_hiddens`).
+    # Shape of output `X`:
+    # (`batch_size`, no. of queries or key-value pairs, `num_heads`,
+    # `num_hiddens` / `num_heads`)
+    X = X.reshape(X.shape[0], X.shape[1], num_heads, -1)
+
+    # Shape of output `X`:
+    # (`batch_size`, `num_heads`, no. of queries or key-value pairs,
+    # `num_hiddens` / `num_heads`)
+    X = X.transpose(0, 2, 1, 3)
+
+    # Shape of `output`:
+    # (`batch_size` * `num_heads`, no. of queries or key-value pairs,
+    # `num_hiddens` / `num_heads`)
+    return X.reshape(-1, X.shape[2], X.shape[3])
+
+
+#@save
+def transpose_output(X, num_heads):
+    """Reverse the operation of `transpose_qkv`"""
+    X = X.reshape(-1, num_heads, X.shape[1], X.shape[2])
+    X = X.transpose(0, 2, 1, 3)
+    return X.reshape(X.shape[0], X.shape[1], -1)
+```
+
+```{.python .input}
+#@tab pytorch
+#@save
+def transpose_qkv(X, num_heads):
+    # Shape of input `X`:
+    # (`batch_size`, no. of queries or key-value pairs, `num_hiddens`).
+    # Shape of output `X`:
+    # (`batch_size`, no. of queries or key-value pairs, `num_heads`,
+    # `num_hiddens` / `num_heads`)
+    X = X.reshape(X.shape[0], X.shape[1], num_heads, -1)
+
+    # Shape of output `X`:
+    # (`batch_size`, `num_heads`, no. of queries or key-value pairs,
+    # `num_hiddens` / `num_heads`)
+    X = X.permute(0, 2, 1, 3)
+
+    # Shape of `output`:
+    # (`batch_size` * `num_heads`, no. of queries or key-value pairs,
+    # `num_hiddens` / `num_heads`)
+    return X.reshape(-1, X.shape[2], X.shape[3])
+
+
+#@save
+def transpose_output(X, num_heads):
+    """Reverse the operation of `transpose_qkv`"""
+    X = X.reshape(-1, num_heads, X.shape[1], X.shape[2])
+    X = X.permute(0, 2, 1, 3)
+    return X.reshape(X.shape[0], X.shape[1], -1)
+```
+
+让我们使用键和值相同的玩具示例来测试我们实施的 `MultiHeadAttention` 类。因此，多头注意输出的形状是（`batch_size`、`num_queries`、`num_hiddens`）。
+
+```{.python .input}
+num_hiddens, num_heads = 100, 5
+attention = MultiHeadAttention(num_hiddens, num_heads, 0.5)
+attention.initialize()
+```
+
+```{.python .input}
+#@tab pytorch
+num_hiddens, num_heads = 100, 5
+attention = MultiHeadAttention(num_hiddens, num_hiddens, num_hiddens,
+                               num_hiddens, num_heads, 0.5)
+attention.eval()
+```
+
+```{.python .input}
+#@tab all
+batch_size, num_queries, num_kvpairs, valid_lens = 2, 4, 6, d2l.tensor([3, 2])
+X = d2l.ones((batch_size, num_queries, num_hiddens))
+Y = d2l.ones((batch_size, num_kvpairs, num_hiddens))
+attention(X, Y, Y, valid_lens).shape
+```
+
+## 摘要
+
+* 多头关注通过查询、键和值的不同表示子空间将同一注意力集中的知识结合起来。
+* 要并行计算多头多头注意力，需要适当的张量操作。
+
+## 练习
+
+1. 在这个实验中，可视化多个头部的注意力重量。
+1. 假设我们有一个基于多头注意力的训练有素的模型，我们希望修剪最不重要的注意力头以提高预测速度。我们如何设计实验来衡量注意头的重要性？
+
+:begin_tab:`mxnet`
+[Discussions](https://discuss.d2l.ai/t/1634)
+:end_tab:
+
+:begin_tab:`pytorch`
+[Discussions](https://discuss.d2l.ai/t/1635)
+:end_tab:
diff --git a/chapter_attention-mechanisms/multihead-attention_origin.md b/chapter_attention-mechanisms/multihead-attention_origin.md
new file mode 100644
index 0000000000000000000000000000000000000000..32486fa6dc2d1f2e3371a8fbf64443c2a5233886
--- /dev/null
+++ b/chapter_attention-mechanisms/multihead-attention_origin.md
@@ -0,0 +1,309 @@
+# Multi-Head Attention
+:label:`sec_multihead-attention`
+
+
+In practice,
+given the same set of queries, keys, and values
+we may want our model to
+combine knowledge from
+different behaviors of the same attention mechanism,
+such as capturing dependencies of various ranges (e.g., shorter-range vs. longer-range)
+within a sequence.
+Thus, 
+it may be beneficial 
+to allow our attention mechanism
+to jointly use different representation subspaces
+of queries, keys, and values.
+
+
+
+To this end,
+instead of performing a single attention pooling,
+queries, keys, and values
+can be transformed
+with $h$ independently learned linear projections.
+Then these $h$ projected queries, keys, and values
+are fed into attention pooling in parallel.
+In the end,
+$h$ attention pooling outputs
+are concatenated and 
+transformed with another learned linear projection
+to produce the final output.
+This design
+is called *multi-head attention*,
+where each of the $h$ attention pooling outputs
+is a *head* :cite:`Vaswani.Shazeer.Parmar.ea.2017`.
+Using fully-connected layers
+to perform learnable linear transformations,
+:numref:`fig_multi-head-attention`
+describes multi-head attention.
+
+![Multi-head attention, where multiple heads are concatenated then linearly transformed.](../img/multi-head-attention.svg)
+:label:`fig_multi-head-attention`
+
+
+
+
+## Model
+
+Before providing the implementation of multi-head attention,
+let us formalize this model mathematically.
+Given a query $\mathbf{q} \in \mathbb{R}^{d_q}$,
+a key $\mathbf{k} \in \mathbb{R}^{d_k}$,
+and a value $\mathbf{v} \in \mathbb{R}^{d_v}$,
+each attention head $\mathbf{h}_i$  ($i = 1, \ldots, h$)
+is computed as
+
+$$\mathbf{h}_i = f(\mathbf W_i^{(q)}\mathbf q, \mathbf W_i^{(k)}\mathbf k,\mathbf W_i^{(v)}\mathbf v) \in \mathbb R^{p_v},$$
+
+where learnable parameters
+$\mathbf W_i^{(q)}\in\mathbb R^{p_q\times d_q}$,
+$\mathbf W_i^{(k)}\in\mathbb R^{p_k\times d_k}$
+and $\mathbf W_i^{(v)}\in\mathbb R^{p_v\times d_v}$,
+and
+$f$ is attention pooling,
+such as
+additive attention and scaled dot-product attention
+in :numref:`sec_attention-scoring-functions`.
+The multi-head attention output
+is another linear transformation via 
+learnable parameters
+$\mathbf W_o\in\mathbb R^{p_o\times h p_v}$
+of the concatenation of $h$ heads:
+
+$$\mathbf W_o \begin{bmatrix}\mathbf h_1\\\vdots\\\mathbf h_h\end{bmatrix} \in \mathbb{R}^{p_o}.$$
+
+Based on this design,
+each head may attend to different parts of the input.
+More sophisticated functions than the simple weighted average
+can be expressed.
+
+```{.python .input}
+from d2l import mxnet as d2l
+import math
+from mxnet import autograd, np, npx
+from mxnet.gluon import nn
+npx.set_np()
+```
+
+```{.python .input}
+#@tab pytorch
+from d2l import torch as d2l
+import math
+import torch
+from torch import nn
+```
+
+## Implementation
+
+In our implementation,
+we choose the scaled dot-product attention
+for each head of the multi-head attention.
+To avoid significant growth
+of computational cost and parameterization cost,
+we set
+$p_q = p_k = p_v = p_o / h$.
+Note that $h$ heads
+can be computed in parallel
+if we set
+the number of outputs of linear transformations
+for the query, key, and value
+to $p_q h = p_k h = p_v h = p_o$.
+In the following implementation,
+$p_o$ is specified via the argument `num_hiddens`.
+
+```{.python .input}
+#@save
+class MultiHeadAttention(nn.Block):
+    def __init__(self, num_hiddens, num_heads, dropout, use_bias=False,
+                 **kwargs):
+        super(MultiHeadAttention, self).__init__(**kwargs)
+        self.num_heads = num_heads
+        self.attention = d2l.DotProductAttention(dropout)
+        self.W_q = nn.Dense(num_hiddens, use_bias=use_bias, flatten=False)
+        self.W_k = nn.Dense(num_hiddens, use_bias=use_bias, flatten=False)
+        self.W_v = nn.Dense(num_hiddens, use_bias=use_bias, flatten=False)
+        self.W_o = nn.Dense(num_hiddens, use_bias=use_bias, flatten=False)
+
+    def forward(self, queries, keys, values, valid_lens):
+        # Shape of `queries`, `keys`, or `values`:
+        # (`batch_size`, no. of queries or key-value pairs, `num_hiddens`)
+        # Shape of `valid_lens`:
+        # (`batch_size`,) or (`batch_size`, no. of queries)
+        # After transposing, shape of output `queries`, `keys`, or `values`:
+        # (`batch_size` * `num_heads`, no. of queries or key-value pairs,
+        # `num_hiddens` / `num_heads`)
+        queries = transpose_qkv(self.W_q(queries), self.num_heads)
+        keys = transpose_qkv(self.W_k(keys), self.num_heads)
+        values = transpose_qkv(self.W_v(values), self.num_heads)
+
+        if valid_lens is not None:
+            # On axis 0, copy the first item (scalar or vector) for
+            # `num_heads` times, then copy the next item, and so on
+            valid_lens = valid_lens.repeat(self.num_heads, axis=0)
+
+        # Shape of `output`: (`batch_size` * `num_heads`, no. of queries,
+        # `num_hiddens` / `num_heads`)
+        output = self.attention(queries, keys, values, valid_lens)
+        
+        # Shape of `output_concat`:
+        # (`batch_size`, no. of queries, `num_hiddens`)
+        output_concat = transpose_output(output, self.num_heads)
+        return self.W_o(output_concat)
+```
+
+```{.python .input}
+#@tab pytorch
+#@save
+class MultiHeadAttention(nn.Module):
+    def __init__(self, key_size, query_size, value_size, num_hiddens,
+                 num_heads, dropout, bias=False, **kwargs):
+        super(MultiHeadAttention, self).__init__(**kwargs)
+        self.num_heads = num_heads
+        self.attention = d2l.DotProductAttention(dropout)
+        self.W_q = nn.Linear(query_size, num_hiddens, bias=bias)
+        self.W_k = nn.Linear(key_size, num_hiddens, bias=bias)
+        self.W_v = nn.Linear(value_size, num_hiddens, bias=bias)
+        self.W_o = nn.Linear(num_hiddens, num_hiddens, bias=bias)
+
+    def forward(self, queries, keys, values, valid_lens):
+        # Shape of `queries`, `keys`, or `values`:
+        # (`batch_size`, no. of queries or key-value pairs, `num_hiddens`)
+        # Shape of `valid_lens`:
+        # (`batch_size`,) or (`batch_size`, no. of queries)
+        # After transposing, shape of output `queries`, `keys`, or `values`:
+        # (`batch_size` * `num_heads`, no. of queries or key-value pairs,
+        # `num_hiddens` / `num_heads`)
+        queries = transpose_qkv(self.W_q(queries), self.num_heads)
+        keys = transpose_qkv(self.W_k(keys), self.num_heads)
+        values = transpose_qkv(self.W_v(values), self.num_heads)
+
+        if valid_lens is not None:
+            # On axis 0, copy the first item (scalar or vector) for
+            # `num_heads` times, then copy the next item, and so on
+            valid_lens = torch.repeat_interleave(
+                valid_lens, repeats=self.num_heads, dim=0)
+
+        # Shape of `output`: (`batch_size` * `num_heads`, no. of queries,
+        # `num_hiddens` / `num_heads`)
+        output = self.attention(queries, keys, values, valid_lens)
+
+        # Shape of `output_concat`:
+        # (`batch_size`, no. of queries, `num_hiddens`)
+        output_concat = transpose_output(output, self.num_heads)
+        return self.W_o(output_concat)
+```
+
+To allow for parallel computation of multiple heads,
+the above `MultiHeadAttention` class uses two transposition functions as defined below.
+Specifically,
+the `transpose_output` function reverses the operation
+of the `transpose_qkv` function.
+
+```{.python .input}
+#@save
+def transpose_qkv(X, num_heads):
+    # Shape of input `X`:
+    # (`batch_size`, no. of queries or key-value pairs, `num_hiddens`).
+    # Shape of output `X`:
+    # (`batch_size`, no. of queries or key-value pairs, `num_heads`,
+    # `num_hiddens` / `num_heads`)
+    X = X.reshape(X.shape[0], X.shape[1], num_heads, -1)
+
+    # Shape of output `X`:
+    # (`batch_size`, `num_heads`, no. of queries or key-value pairs,
+    # `num_hiddens` / `num_heads`)
+    X = X.transpose(0, 2, 1, 3)
+
+    # Shape of `output`:
+    # (`batch_size` * `num_heads`, no. of queries or key-value pairs,
+    # `num_hiddens` / `num_heads`)
+    return X.reshape(-1, X.shape[2], X.shape[3])
+
+
+#@save
+def transpose_output(X, num_heads):
+    """Reverse the operation of `transpose_qkv`"""
+    X = X.reshape(-1, num_heads, X.shape[1], X.shape[2])
+    X = X.transpose(0, 2, 1, 3)
+    return X.reshape(X.shape[0], X.shape[1], -1)
+```
+
+```{.python .input}
+#@tab pytorch
+#@save
+def transpose_qkv(X, num_heads):
+    # Shape of input `X`:
+    # (`batch_size`, no. of queries or key-value pairs, `num_hiddens`).
+    # Shape of output `X`:
+    # (`batch_size`, no. of queries or key-value pairs, `num_heads`,
+    # `num_hiddens` / `num_heads`)
+    X = X.reshape(X.shape[0], X.shape[1], num_heads, -1)
+
+    # Shape of output `X`:
+    # (`batch_size`, `num_heads`, no. of queries or key-value pairs,
+    # `num_hiddens` / `num_heads`)
+    X = X.permute(0, 2, 1, 3)
+
+    # Shape of `output`:
+    # (`batch_size` * `num_heads`, no. of queries or key-value pairs,
+    # `num_hiddens` / `num_heads`)
+    return X.reshape(-1, X.shape[2], X.shape[3])
+
+
+#@save
+def transpose_output(X, num_heads):
+    """Reverse the operation of `transpose_qkv`"""
+    X = X.reshape(-1, num_heads, X.shape[1], X.shape[2])
+    X = X.permute(0, 2, 1, 3)
+    return X.reshape(X.shape[0], X.shape[1], -1)
+```
+
+Let us test our implemented `MultiHeadAttention` class
+using a toy example where keys and values are the same.
+As a result,
+the shape of the multi-head attention output
+is (`batch_size`, `num_queries`, `num_hiddens`).
+
+```{.python .input}
+num_hiddens, num_heads = 100, 5
+attention = MultiHeadAttention(num_hiddens, num_heads, 0.5)
+attention.initialize()
+```
+
+```{.python .input}
+#@tab pytorch
+num_hiddens, num_heads = 100, 5
+attention = MultiHeadAttention(num_hiddens, num_hiddens, num_hiddens,
+                               num_hiddens, num_heads, 0.5)
+attention.eval()
+```
+
+```{.python .input}
+#@tab all
+batch_size, num_queries, num_kvpairs, valid_lens = 2, 4, 6, d2l.tensor([3, 2])
+X = d2l.ones((batch_size, num_queries, num_hiddens))
+Y = d2l.ones((batch_size, num_kvpairs, num_hiddens))
+attention(X, Y, Y, valid_lens).shape
+```
+
+## Summary
+
+* Multi-head attention combines knowledge of the same attention pooling via different representation subspaces of queries, keys, and values.
+* To compute multiple heads of multi-head attention in parallel, proper tensor manipulation is needed.
+
+
+
+## Exercises
+
+1. Visualize attention weights of multiple heads in this experiment.
+1. Suppose that we have a trained model based on multi-head attention and we want to prune least important attention heads to increase the prediction speed. How can we design experiments to measure the importance of an attention head?
+
+
+:begin_tab:`mxnet`
+[Discussions](https://discuss.d2l.ai/t/1634)
+:end_tab:
+
+:begin_tab:`pytorch`
+[Discussions](https://discuss.d2l.ai/t/1635)
+:end_tab:
diff --git a/chapter_attention-mechanisms/nadaraya-waston.md b/chapter_attention-mechanisms/nadaraya-waston.md
new file mode 100644
index 0000000000000000000000000000000000000000..4ec3a5b1847ab64997e1fa5baae8d7b4912028e0
--- /dev/null
+++ b/chapter_attention-mechanisms/nadaraya-waston.md
@@ -0,0 +1,373 @@
+# 注意力集中：Nadaraya-Watson 内核回归
+:label:`sec_nadaraya-waston`
+
+现在你知道了 :numref:`fig_qkv` 框架下关注机制的主要组成部分。概括一下，查询（名义提示）和键（非自豪提示）之间的交互导致了 * 注意力集中 *。注意力集中有选择性地聚合了值（感官输入）以产生产出。在本节中，我们将更详细地介绍注意力集中，以便让您从高层次了解注意力机制在实践中的运作方式。具体来说，1964 年提出的 Nadaraya-Watson 内核回归模型是一个简单而完整的示例，用于演示具有注意机制的机器学习。
+
+```{.python .input}
+from d2l import mxnet as d2l
+from mxnet import autograd, gluon, np, npx
+from mxnet.gluon import nn
+
+npx.set_np()
+```
+
+```{.python .input}
+#@tab pytorch
+from d2l import torch as d2l
+import torch
+from torch import nn
+```
+
+## 生成数据集
+
+为了简单起见，让我们考虑以下回归问题：给定输入-产出对 $\{(x_1, y_1), \ldots, (x_n, y_n)\}$ 的数据集，如何学习 $f$ 来预测任何新输入 $\hat{y} = f(x)$ 的输出 $\hat{y} = f(x)$？
+
+在这里，我们根据以下非线性函数生成一个人工数据集，噪声术语 $\epsilon$：
+
+$$y_i = 2\sin(x_i) + x_i^{0.8} + \epsilon,$$
+
+其中 $\epsilon$ 服从平均值和标准差 0.5 的正态分布。同时生成了 50 个培训示例和 50 个测试示例。为了以后更好地直观地显示注意力模式，训练输入将进行排序。
+
+```{.python .input}
+n_train = 50  # No. of training examples
+x_train = np.sort(d2l.rand(n_train) * 5)   # Training inputs
+```
+
+```{.python .input}
+#@tab pytorch
+n_train = 50  # No. of training examples
+x_train, _ = torch.sort(d2l.rand(n_train) * 5)   # Training inputs
+```
+
+```{.python .input}
+#@tab all
+def f(x):
+    return 2 * d2l.sin(x) + x**0.8
+
+y_train = f(x_train) + d2l.normal(0.0, 0.5, (n_train,))  # Training outputs
+x_test = d2l.arange(0, 5, 0.1)  # Testing examples
+y_truth = f(x_test)  # Ground-truth outputs for the testing examples
+n_test = len(x_test)  # No. of testing examples
+n_test
+```
+
+以下函数绘制所有训练示例（由圆表示）、不带噪声项的地面真实数据生成函数 `f`（标记为 “Truth”）和学习的预测函数（标记为 “Pred”）。
+
+```{.python .input}
+#@tab all
+def plot_kernel_reg(y_hat):
+    d2l.plot(x_test, [y_truth, y_hat], 'x', 'y', legend=['Truth', 'Pred'],
+             xlim=[0, 5], ylim=[-1, 5])
+    d2l.plt.plot(x_train, y_train, 'o', alpha=0.5);
+```
+
+## 平均池
+
+我们首先可能是世界上对这个回归问题的 “最愚蠢” 的估算器：使用平均汇集来计算所有训练输出的平均值：
+
+$$f(x) = \frac{1}{n}\sum_{i=1}^n y_i,$$
+:eqlabel:`eq_avg-pooling`
+
+这如下图所示。正如我们所看到的，这个估算器确实不那么聪明。
+
+```{.python .input}
+y_hat = y_train.mean().repeat(n_test)
+plot_kernel_reg(y_hat)
+```
+
+```{.python .input}
+#@tab pytorch
+y_hat = torch.repeat_interleave(y_train.mean(), n_test)
+plot_kernel_reg(y_hat)
+```
+
+## 非参数化注意力池
+
+显然，平均池忽略了输入 $x_i$。Nadaraya :cite:`Nadaraya.1964` 和 Waston :cite:`Watson.1964` 提出了一个更好的想法，根据输入位置对输出 $y_i$ 进行权衡：
+
+$$f(x) = \sum_{i=1}^n \frac{K(x - x_i)}{\sum_{j=1}^n K(x - x_j)} y_i,$$
+:eqlabel:`eq_nadaraya-waston`
+
+其中 $K$ 是 * 内核 *。:eqref:`eq_nadaraya-waston` 中的估计器被称为 *Nadaraya-Watson 内核回归 *。在这里我们不会深入研究内核的细节。回想一下 :numref:`fig_qkv` 中的关注机制框架。从注意力的角度来看，我们可以用更广泛的 * 注意力集合 * 的形式重写 :eqref:`eq_nadaraya-waston`：
+
+$$f(x) = \sum_{i=1}^n \alpha(x, x_i) y_i,$$
+:eqlabel:`eq_attn-pooling`
+
+其中 $x$ 是查询，$(x_i, y_i)$ 是键值对。比较 :eqref:`eq_attn-pooling` 和 :eqref:`eq_avg-pooling`，这里的注意力集中是 $y_i$ 的加权平均值。根据查询 $x$ 和 $\alpha$ 建模的密钥 $x_i$ 之间的交互作用，将 :eqref:`eq_attn-pooling` 中的 * 注意力权重 * $\alpha(x, x_i)$ 分配给相应的值 $y_i$。对于任何查询，它在所有键值对上的注意力权重都是有效的概率分布：它们是非负数的，总和为一。
+
+要获得注意力集中的直觉，只需考虑一个 * 高斯内核 * 定义为
+
+$$
+K(u) = \frac{1}{\sqrt{2\pi}} \exp(-\frac{u^2}{2}).
+$$
+
+将高斯内核插入 :eqref:`eq_attn-pooling` 和 :eqref:`eq_nadaraya-waston` 就会给出
+
+$$\begin{aligned} f(x) &=\sum_{i=1}^n \alpha(x, x_i) y_i\\ &= \sum_{i=1}^n \frac{\exp\left(-\frac{1}{2}(x - x_i)^2\right)}{\sum_{j=1}^n \exp\left(-\frac{1}{2}(x - x_j)^2\right)} y_i \\&= \sum_{i=1}^n \mathrm{softmax}\left(-\frac{1}{2}(x - x_i)^2\right) y_i. \end{aligned}$$
+:eqlabel:`eq_nadaraya-waston-gaussian`
+
+在 :eqref:`eq_nadaraya-waston-gaussian` 中，接近给定查询 $x$ 的密钥 $x_i$ 将得到
+*通过分配给密钥的相应值 $y_i$ 的 * 更大的注意力重量 * 来进一步注意 *。
+
+值得注意的是，Nadaraya-Watson 内核回归是一个非参数模型；因此，:eqref:`eq_nadaraya-waston-gaussian` 就是 * 非参数化注意力池 * 的示例。在下面，我们基于此非参数化关注模型绘制预测。预测的线是平稳的，并且比普通集中产生的线更接近地面真相。
+
+```{.python .input}
+# Shape of `X_repeat`: (`n_test`, `n_train`), where each row contains the
+# same testing inputs (i.e., same queries)
+X_repeat = d2l.reshape(x_test.repeat(n_train), (-1, n_train))
+# Note that `x_train` contains the keys. Shape of `attention_weights`:
+# (`n_test`, `n_train`), where each row contains attention weights to be
+# assigned among the values (`y_train`) given each query
+attention_weights = npx.softmax(-(X_repeat - x_train)**2 / 2)
+# Each element of `y_hat` is weighted average of values, where weights are
+# attention weights
+y_hat = d2l.matmul(attention_weights, y_train)
+plot_kernel_reg(y_hat)
+```
+
+```{.python .input}
+#@tab pytorch
+# Shape of `X_repeat`: (`n_test`, `n_train`), where each row contains the
+# same testing inputs (i.e., same queries)
+X_repeat = d2l.reshape(x_test.repeat_interleave(n_train), (-1, n_train))
+# Note that `x_train` contains the keys. Shape of `attention_weights`:
+# (`n_test`, `n_train`), where each row contains attention weights to be
+# assigned among the values (`y_train`) given each query
+attention_weights = nn.functional.softmax(-(X_repeat - x_train)**2 / 2, dim=1)
+# Each element of `y_hat` is weighted average of values, where weights are
+# attention weights
+y_hat = d2l.matmul(attention_weights, y_train)
+plot_kernel_reg(y_hat)
+```
+
+现在让我们来看看注意力的权重。这里测试输入是查询，而训练输入是关键。由于两个输入都是排序的，我们可以看到查询键对越接近，注意力集中的注意力就越高。
+
+```{.python .input}
+d2l.show_heatmaps(np.expand_dims(np.expand_dims(attention_weights, 0), 0),
+                  xlabel='Sorted training inputs',
+                  ylabel='Sorted testing inputs')
+```
+
+```{.python .input}
+#@tab pytorch
+d2l.show_heatmaps(attention_weights.unsqueeze(0).unsqueeze(0),
+                  xlabel='Sorted training inputs',
+                  ylabel='Sorted testing inputs')
+```
+
+## 参数化注意力池
+
+非参数 Nadaraya-Watson 内核回归具有 * 一致性 * 的好处：如果有足够的数据，此模型会收敛到最佳解决方案。尽管如此，我们可以轻松地将可学习的参数集成到注意力池中。
+
+例如，与 :eqref:`eq_nadaraya-waston-gaussian` 略有不同，在下面的查询 $x$ 和键 $x_i$ 之间的距离乘以可学习参数 $w$：
+
+$$\begin{aligned}f(x) &= \sum_{i=1}^n \alpha(x, x_i) y_i \\&= \sum_{i=1}^n \frac{\exp\left(-\frac{1}{2}((x - x_i)w)^2\right)}{\sum_{j=1}^n \exp\left(-\frac{1}{2}((x - x_i)w)^2\right)} y_i \\&= \sum_{i=1}^n \mathrm{softmax}\left(-\frac{1}{2}((x - x_i)w)^2\right) y_i.\end{aligned}$$
+:eqlabel:`eq_nadaraya-waston-gaussian-para`
+
+在本节的其余部分中，我们将通过学习 :eqref:`eq_nadaraya-waston-gaussian-para` 中注意力集中的参数来训练此模型。
+
+### 批量矩阵乘法
+:label:`subsec_batch_dot`
+
+为了更有效地计算小批次的注意力，我们可以利用深度学习框架提供的批量矩阵乘法实用程序。
+
+假设第一个微型批次包含 $n$ 矩阵 $n$，形状为 $a\times b$，第二个微型批次包含 $n$ 矩阵 $\mathbf{Y}_1, \ldots, \mathbf{Y}_n$，形状为 73229363620。它们的批量矩阵乘法得出 $n$ 矩阵 $\mathbf{X}_1\mathbf{Y}_1, \ldots, \mathbf{X}_n\mathbf{Y}_n$，形状为 $a\times c$。因此，假定两个张量的形状（$n$、$a$、$b$）和（$b$，$c$）的形状，它们的批量矩阵乘法输出的形状为（$n$、$a$、$c$）。
+
+```{.python .input}
+X = d2l.ones((2, 1, 4))
+Y = d2l.ones((2, 4, 6))
+npx.batch_dot(X, Y).shape
+```
+
+```{.python .input}
+#@tab pytorch
+X = d2l.ones((2, 1, 4))
+Y = d2l.ones((2, 4, 6))
+torch.bmm(X, Y).shape
+```
+
+在注意力机制的背景下，我们可以使用微型批次矩阵乘法来计算微型批次中值的加权平均值。
+
+```{.python .input}
+weights = d2l.ones((2, 10)) * 0.1
+values = d2l.reshape(d2l.arange(20), (2, 10))
+npx.batch_dot(np.expand_dims(weights, 1), np.expand_dims(values, -1))
+```
+
+```{.python .input}
+#@tab pytorch
+weights = d2l.ones((2, 10)) * 0.1
+values = d2l.reshape(d2l.arange(20.0), (2, 10))
+torch.bmm(weights.unsqueeze(1), values.unsqueeze(-1))
+```
+
+### 定义模型
+
+使用微型批量矩阵乘法，下面我们根据 :eqref:`eq_nadaraya-waston-gaussian-para` 中的参数关注池定义 Nadaraya-Watson 内核回归的参数化版本。
+
+```{.python .input}
+class NWKernelRegression(nn.Block):
+    def __init__(self, **kwargs):
+        super().__init__(**kwargs)
+        self.w = self.params.get('w', shape=(1,))
+
+    def forward(self, queries, keys, values):
+        # Shape of the output `queries` and `attention_weights`:
+        # (no. of queries, no. of key-value pairs)
+        queries = d2l.reshape(
+            queries.repeat(keys.shape[1]), (-1, keys.shape[1]))
+        self.attention_weights = npx.softmax(
+            -((queries - keys) * self.w.data())**2 / 2)
+        # Shape of `values`: (no. of queries, no. of key-value pairs)
+        return npx.batch_dot(np.expand_dims(self.attention_weights, 1),
+                             np.expand_dims(values, -1)).reshape(-1)
+```
+
+```{.python .input}
+#@tab pytorch
+class NWKernelRegression(nn.Module):
+    def __init__(self, **kwargs):
+        super().__init__(**kwargs)
+        self.w = nn.Parameter(torch.rand((1,), requires_grad=True))
+
+    def forward(self, queries, keys, values):
+        # Shape of the output `queries` and `attention_weights`:
+        # (no. of queries, no. of key-value pairs)
+        queries = d2l.reshape(
+            queries.repeat_interleave(keys.shape[1]), (-1, keys.shape[1]))
+        self.attention_weights = nn.functional.softmax(
+            -((queries - keys) * self.w)**2 / 2, dim=1)
+        # Shape of `values`: (no. of queries, no. of key-value pairs)
+        return torch.bmm(self.attention_weights.unsqueeze(1),
+                         values.unsqueeze(-1)).reshape(-1)
+```
+
+### 培训
+
+在以下内容中，我们将训练数据集转换为键和值，以训练注意力模型。在参数化注意力池中，任何训练输入都会从所有训练示例中获取键值对，但用于预测其输出。
+
+```{.python .input}
+# Shape of `X_tile`: (`n_train`, `n_train`), where each column contains the
+# same training inputs
+X_tile = np.tile(x_train, (n_train, 1))
+# Shape of `Y_tile`: (`n_train`, `n_train`), where each column contains the
+# same training outputs
+Y_tile = np.tile(y_train, (n_train, 1))
+# Shape of `keys`: ('n_train', 'n_train' - 1)
+keys = d2l.reshape(X_tile[(1 - d2l.eye(n_train)).astype('bool')],
+                   (n_train, -1))
+# Shape of `values`: ('n_train', 'n_train' - 1)
+values = d2l.reshape(Y_tile[(1 - d2l.eye(n_train)).astype('bool')],
+                     (n_train, -1))
+```
+
+```{.python .input}
+#@tab pytorch
+# Shape of `X_tile`: (`n_train`, `n_train`), where each column contains the
+# same training inputs
+X_tile = x_train.repeat((n_train, 1))
+# Shape of `Y_tile`: (`n_train`, `n_train`), where each column contains the
+# same training outputs
+Y_tile = y_train.repeat((n_train, 1))
+# Shape of `keys`: ('n_train', 'n_train' - 1)
+keys = d2l.reshape(X_tile[(1 - d2l.eye(n_train)).type(torch.bool)],
+                   (n_train, -1))
+# Shape of `values`: ('n_train', 'n_train' - 1)
+values = d2l.reshape(Y_tile[(1 - d2l.eye(n_train)).type(torch.bool)],
+                     (n_train, -1))
+```
+
+我们使用平方损失和随机梯度下降，训练参数化注意力模型。
+
+```{.python .input}
+net = NWKernelRegression()
+net.initialize()
+loss = gluon.loss.L2Loss()
+trainer = gluon.Trainer(net.collect_params(), 'sgd', {'learning_rate': 0.5})
+animator = d2l.Animator(xlabel='epoch', ylabel='loss', xlim=[1, 5])
+
+for epoch in range(5):
+    with autograd.record():
+        l = loss(net(x_train, keys, values), y_train)
+    l.backward()
+    trainer.step(1)
+    print(f'epoch {epoch + 1}, loss {float(l.sum()):.6f}')
+    animator.add(epoch + 1, float(l.sum()))
+```
+
+```{.python .input}
+#@tab pytorch
+net = NWKernelRegression()
+loss = nn.MSELoss(reduction='none')
+trainer = torch.optim.SGD(net.parameters(), lr=0.5)
+animator = d2l.Animator(xlabel='epoch', ylabel='loss', xlim=[1, 5])
+
+for epoch in range(5):
+    trainer.zero_grad()
+    # Note: L2 Loss = 1/2 * MSE Loss. PyTorch has MSE Loss which is slightly
+    # different from MXNet's L2Loss by a factor of 2. Hence we halve the loss
+    l = loss(net(x_train, keys, values), y_train) / 2
+    l.sum().backward()
+    trainer.step()
+    print(f'epoch {epoch + 1}, loss {float(l.sum()):.6f}')
+    animator.add(epoch + 1, float(l.sum()))
+```
+
+训练参数化注意力模型后，我们可以绘制其预测。试图使用噪点拟合训练数据集，预测线不如之前绘制的非参数对应线平滑。
+
+```{.python .input}
+# Shape of `keys`: (`n_test`, `n_train`), where each column contains the same
+# training inputs (i.e., same keys)
+keys = np.tile(x_train, (n_test, 1))
+# Shape of `value`: (`n_test`, `n_train`)
+values = np.tile(y_train, (n_test, 1))
+y_hat = net(x_test, keys, values)
+plot_kernel_reg(y_hat)
+```
+
+```{.python .input}
+#@tab pytorch
+# Shape of `keys`: (`n_test`, `n_train`), where each column contains the same
+# training inputs (i.e., same keys)
+keys = x_train.repeat((n_test, 1))
+# Shape of `value`: (`n_test`, `n_train`)
+values = y_train.repeat((n_test, 1))
+y_hat = net(x_test, keys, values).unsqueeze(1).detach()
+plot_kernel_reg(y_hat)
+```
+
+与非参数化注意力池相比，注意力权重较大的区域在可学习和参数化设置中变得更加锐利。
+
+```{.python .input}
+d2l.show_heatmaps(np.expand_dims(np.expand_dims(net.attention_weights, 0), 0),
+                  xlabel='Sorted training inputs',
+                  ylabel='Sorted testing inputs')
+```
+
+```{.python .input}
+#@tab pytorch
+d2l.show_heatmaps(net.attention_weights.unsqueeze(0).unsqueeze(0),
+                  xlabel='Sorted training inputs',
+                  ylabel='Sorted testing inputs')
+```
+
+## 摘要
+
+* Nadaraya-Watson 内核回归是具有注意机制的机器学习示例。
+* Nadaraya-Watson 内核回归的注意力集中是训练输出的加权平均值。从注意力的角度来看，根据查询的函数和与值配对的键，将注意力权重分配给值。
+* 注意力池可以是非参数化的，也可以是参数化的。
+
+## 练习
+
+1. 增加培训示例的数量。你能更好地学习非参数 Nadaraya-Watson 内核回归吗？
+1. 我们在参数化注意力池实验中学到的 $w$ 的价值是什么？为什么在可视化注意力权重时，它会使加权区域更加锐利？
+1. 我们如何将超参数添加到非参数 Nadaraya-Watson 内核回归中以更好地预测？
+1. 为本节的内核回归设计另一个参数化注意力池。训练这个新模型并可视化其注意力重量。
+
+:begin_tab:`mxnet`
+[Discussions](https://discuss.d2l.ai/t/1598)
+:end_tab:
+
+:begin_tab:`pytorch`
+[Discussions](https://discuss.d2l.ai/t/1599)
+:end_tab:
diff --git a/chapter_attention-mechanisms/nadaraya-waston_origin.md b/chapter_attention-mechanisms/nadaraya-waston_origin.md
new file mode 100644
index 0000000000000000000000000000000000000000..bba974db28f5512c30b14872d8d31c165f9d8cce
--- /dev/null
+++ b/chapter_attention-mechanisms/nadaraya-waston_origin.md
@@ -0,0 +1,466 @@
+# Attention Pooling: Nadaraya-Watson Kernel Regression
+:label:`sec_nadaraya-waston`
+
+Now you know the major components of attention mechanisms under the framework in :numref:`fig_qkv`.
+To recapitulate,
+the interactions between
+queries (volitional cues) and keys (nonvolitional cues)
+result in *attention pooling*.
+The attention pooling selectively aggregates values (sensory inputs) to produce the output.
+In this section,
+we will describe attention pooling in greater detail
+to give you a high-level view of
+how attention mechanisms work in practice.
+Specifically,
+the Nadaraya-Watson kernel regression model
+proposed in 1964
+is a simple yet complete example
+for demonstrating machine learning with attention mechanisms.
+
+```{.python .input}
+from d2l import mxnet as d2l
+from mxnet import autograd, gluon, np, npx
+from mxnet.gluon import nn
+
+npx.set_np()
+```
+
+```{.python .input}
+#@tab pytorch
+from d2l import torch as d2l
+import torch
+from torch import nn
+```
+
+## Generating the Dataset
+
+To keep things simple,
+let us consider the following regression problem:
+given a dataset of input-output pairs $\{(x_1, y_1), \ldots, (x_n, y_n)\}$,
+how to learn $f$ to predict the output $\hat{y} = f(x)$ for any new input $x$?
+
+Here we generate an artificial dataset according to the following nonlinear function with the noise term $\epsilon$:
+
+$$y_i = 2\sin(x_i) + x_i^{0.8} + \epsilon,$$
+
+where $\epsilon$ obeys a normal distribution with zero mean and standard deviation 0.5.
+Both 50 training examples and 50 testing examples
+are generated.
+To better visualize the pattern of attention later, the training inputs are sorted.
+
+```{.python .input}
+n_train = 50  # No. of training examples
+x_train = np.sort(d2l.rand(n_train) * 5)   # Training inputs
+```
+
+```{.python .input}
+#@tab pytorch
+n_train = 50  # No. of training examples
+x_train, _ = torch.sort(d2l.rand(n_train) * 5)   # Training inputs
+```
+
+```{.python .input}
+#@tab all
+def f(x):
+    return 2 * d2l.sin(x) + x**0.8
+
+y_train = f(x_train) + d2l.normal(0.0, 0.5, (n_train,))  # Training outputs
+x_test = d2l.arange(0, 5, 0.1)  # Testing examples
+y_truth = f(x_test)  # Ground-truth outputs for the testing examples
+n_test = len(x_test)  # No. of testing examples
+n_test
+```
+
+The following function plots all the training examples (represented by circles),
+the ground-truth data generation function `f` without the noise term (labeled by "Truth"), and the learned prediction function (labeled by "Pred").
+
+```{.python .input}
+#@tab all
+def plot_kernel_reg(y_hat):
+    d2l.plot(x_test, [y_truth, y_hat], 'x', 'y', legend=['Truth', 'Pred'],
+             xlim=[0, 5], ylim=[-1, 5])
+    d2l.plt.plot(x_train, y_train, 'o', alpha=0.5);
+```
+
+## Average Pooling
+
+We begin with perhaps the world's "dumbest" estimator for this regression problem:
+using average pooling to average over all the training outputs:
+
+$$f(x) = \frac{1}{n}\sum_{i=1}^n y_i,$$
+:eqlabel:`eq_avg-pooling`
+
+which is plotted below. As we can see, this estimator is indeed not so smart.
+
+```{.python .input}
+y_hat = y_train.mean().repeat(n_test)
+plot_kernel_reg(y_hat)
+```
+
+```{.python .input}
+#@tab pytorch
+y_hat = torch.repeat_interleave(y_train.mean(), n_test)
+plot_kernel_reg(y_hat)
+```
+
+## Nonparametric Attention Pooling
+
+Obviously,
+average pooling omits the inputs $x_i$.
+A better idea was proposed
+by Nadaraya :cite:`Nadaraya.1964`
+and Waston :cite:`Watson.1964`
+to weigh the outputs $y_i$ according to their input locations:
+
+$$f(x) = \sum_{i=1}^n \frac{K(x - x_i)}{\sum_{j=1}^n K(x - x_j)} y_i,$$
+:eqlabel:`eq_nadaraya-waston`
+
+where $K$ is a *kernel*.
+The estimator in :eqref:`eq_nadaraya-waston`
+is called *Nadaraya-Watson kernel regression*.
+Here we will not dive into details of kernels.
+Recall the framework of attention mechanisms in :numref:`fig_qkv`.
+From the perspective of attention,
+we can rewrite :eqref:`eq_nadaraya-waston`
+in a more generalized form of *attention pooling*:
+
+$$f(x) = \sum_{i=1}^n \alpha(x, x_i) y_i,$$
+:eqlabel:`eq_attn-pooling`
+
+
+where $x$ is the query and $(x_i, y_i)$ is the key-value pair.
+Comparing :eqref:`eq_attn-pooling` and :eqref:`eq_avg-pooling`,
+the attention pooling here
+is a weighted average of values $y_i$.
+The *attention weight* $\alpha(x, x_i)$
+in :eqref:`eq_attn-pooling`
+is assigned to the corresponding value $y_i$
+based on the interaction
+between the query $x$ and the key $x_i$
+modeled by $\alpha$.
+For any query, its attention weights over all the key-value pairs are a valid probability distribution: they are non-negative and sum up to one.
+
+To gain intuitions of attention pooling,
+just consider a *Gaussian kernel* defined as
+
+$$
+K(u) = \frac{1}{\sqrt{2\pi}} \exp(-\frac{u^2}{2}).
+$$
+
+
+Plugging the Gaussian kernel into
+:eqref:`eq_attn-pooling` and
+:eqref:`eq_nadaraya-waston` gives
+
+$$\begin{aligned} f(x) &=\sum_{i=1}^n \alpha(x, x_i) y_i\\ &= \sum_{i=1}^n \frac{\exp\left(-\frac{1}{2}(x - x_i)^2\right)}{\sum_{j=1}^n \exp\left(-\frac{1}{2}(x - x_j)^2\right)} y_i \\&= \sum_{i=1}^n \mathrm{softmax}\left(-\frac{1}{2}(x - x_i)^2\right) y_i. \end{aligned}$$
+:eqlabel:`eq_nadaraya-waston-gaussian`
+
+In :eqref:`eq_nadaraya-waston-gaussian`,
+a key $x_i$ that is closer to the given query $x$ will get
+*more attention* via a *larger attention weight* assigned to the key's corresponding value $y_i$.
+
+Notably, Nadaraya-Watson kernel regression is a nonparametric model;
+thus :eqref:`eq_nadaraya-waston-gaussian`
+is an example of *nonparametric attention pooling*.
+In the following, we plot the prediction based on this
+nonparametric attention model.
+The predicted line is smooth and closer to the ground-truth than that produced by average pooling.
+
+```{.python .input}
+# Shape of `X_repeat`: (`n_test`, `n_train`), where each row contains the
+# same testing inputs (i.e., same queries)
+X_repeat = d2l.reshape(x_test.repeat(n_train), (-1, n_train))
+# Note that `x_train` contains the keys. Shape of `attention_weights`:
+# (`n_test`, `n_train`), where each row contains attention weights to be
+# assigned among the values (`y_train`) given each query
+attention_weights = npx.softmax(-(X_repeat - x_train)**2 / 2)
+# Each element of `y_hat` is weighted average of values, where weights are
+# attention weights
+y_hat = d2l.matmul(attention_weights, y_train)
+plot_kernel_reg(y_hat)
+```
+
+```{.python .input}
+#@tab pytorch
+# Shape of `X_repeat`: (`n_test`, `n_train`), where each row contains the
+# same testing inputs (i.e., same queries)
+X_repeat = d2l.reshape(x_test.repeat_interleave(n_train), (-1, n_train))
+# Note that `x_train` contains the keys. Shape of `attention_weights`:
+# (`n_test`, `n_train`), where each row contains attention weights to be
+# assigned among the values (`y_train`) given each query
+attention_weights = nn.functional.softmax(-(X_repeat - x_train)**2 / 2, dim=1)
+# Each element of `y_hat` is weighted average of values, where weights are
+# attention weights
+y_hat = d2l.matmul(attention_weights, y_train)
+plot_kernel_reg(y_hat)
+```
+
+Now let us take a look at the attention weights.
+Here testing inputs are queries while training inputs are keys.
+Since both inputs are sorted,
+we can see that the closer the query-key pair is,
+the higher attention weight is in the attention pooling.
+
+```{.python .input}
+d2l.show_heatmaps(np.expand_dims(np.expand_dims(attention_weights, 0), 0),
+                  xlabel='Sorted training inputs',
+                  ylabel='Sorted testing inputs')
+```
+
+```{.python .input}
+#@tab pytorch
+d2l.show_heatmaps(attention_weights.unsqueeze(0).unsqueeze(0),
+                  xlabel='Sorted training inputs',
+                  ylabel='Sorted testing inputs')
+```
+
+## Parametric Attention Pooling
+
+Nonparametric Nadaraya-Watson kernel regression
+enjoys the *consistency* benefit:
+given enough data this model converges to the optimal solution.
+Nonetheless,
+we can easily integrate learnable parameters into attention pooling.
+
+As an example, slightly different from :eqref:`eq_nadaraya-waston-gaussian`,
+in the following
+the distance between the query $x$ and the key $x_i$
+is multiplied a learnable parameter $w$:
+
+
+$$\begin{aligned}f(x) &= \sum_{i=1}^n \alpha(x, x_i) y_i \\&= \sum_{i=1}^n \frac{\exp\left(-\frac{1}{2}((x - x_i)w)^2\right)}{\sum_{j=1}^n \exp\left(-\frac{1}{2}((x - x_i)w)^2\right)} y_i \\&= \sum_{i=1}^n \mathrm{softmax}\left(-\frac{1}{2}((x - x_i)w)^2\right) y_i.\end{aligned}$$
+:eqlabel:`eq_nadaraya-waston-gaussian-para`
+
+In the rest of the section,
+we will train this model by learning the parameter of
+the attention pooling in :eqref:`eq_nadaraya-waston-gaussian-para`.
+
+
+### Batch Matrix Multiplication
+:label:`subsec_batch_dot`
+
+To more efficiently compute attention
+for minibatches,
+we can leverage batch matrix multiplication utilities
+provided by deep learning frameworks.
+
+
+Suppose that the first minibatch contains $n$ matrices $\mathbf{X}_1, \ldots, \mathbf{X}_n$ of shape $a\times b$, and the second minibatch contains $n$ matrices $\mathbf{Y}_1, \ldots, \mathbf{Y}_n$ of shape $b\times c$. Their batch matrix multiplication
+results in
+$n$ matrices $\mathbf{X}_1\mathbf{Y}_1, \ldots, \mathbf{X}_n\mathbf{Y}_n$ of shape $a\times c$. Therefore, given two tensors of shape ($n$, $a$, $b$) and ($n$, $b$, $c$), the shape of their batch matrix multiplication output is ($n$, $a$, $c$).
+
+```{.python .input}
+X = d2l.ones((2, 1, 4))
+Y = d2l.ones((2, 4, 6))
+npx.batch_dot(X, Y).shape
+```
+
+```{.python .input}
+#@tab pytorch
+X = d2l.ones((2, 1, 4))
+Y = d2l.ones((2, 4, 6))
+torch.bmm(X, Y).shape
+```
+
+In the context of attention mechanisms, we can use minibatch matrix multiplication to compute weighted averages of values in a minibatch.
+
+```{.python .input}
+weights = d2l.ones((2, 10)) * 0.1
+values = d2l.reshape(d2l.arange(20), (2, 10))
+npx.batch_dot(np.expand_dims(weights, 1), np.expand_dims(values, -1))
+```
+
+```{.python .input}
+#@tab pytorch
+weights = d2l.ones((2, 10)) * 0.1
+values = d2l.reshape(d2l.arange(20.0), (2, 10))
+torch.bmm(weights.unsqueeze(1), values.unsqueeze(-1))
+```
+
+### Defining the Model
+
+Using minibatch matrix multiplication,
+below we define the parametric version
+of Nadaraya-Watson kernel regression
+based on the parametric attention pooling in
+:eqref:`eq_nadaraya-waston-gaussian-para`.
+
+```{.python .input}
+class NWKernelRegression(nn.Block):
+    def __init__(self, **kwargs):
+        super().__init__(**kwargs)
+        self.w = self.params.get('w', shape=(1,))
+
+    def forward(self, queries, keys, values):
+        # Shape of the output `queries` and `attention_weights`:
+        # (no. of queries, no. of key-value pairs)
+        queries = d2l.reshape(
+            queries.repeat(keys.shape[1]), (-1, keys.shape[1]))
+        self.attention_weights = npx.softmax(
+            -((queries - keys) * self.w.data())**2 / 2)
+        # Shape of `values`: (no. of queries, no. of key-value pairs)
+        return npx.batch_dot(np.expand_dims(self.attention_weights, 1),
+                             np.expand_dims(values, -1)).reshape(-1)
+```
+
+```{.python .input}
+#@tab pytorch
+class NWKernelRegression(nn.Module):
+    def __init__(self, **kwargs):
+        super().__init__(**kwargs)
+        self.w = nn.Parameter(torch.rand((1,), requires_grad=True))
+
+    def forward(self, queries, keys, values):
+        # Shape of the output `queries` and `attention_weights`:
+        # (no. of queries, no. of key-value pairs)
+        queries = d2l.reshape(
+            queries.repeat_interleave(keys.shape[1]), (-1, keys.shape[1]))
+        self.attention_weights = nn.functional.softmax(
+            -((queries - keys) * self.w)**2 / 2, dim=1)
+        # Shape of `values`: (no. of queries, no. of key-value pairs)
+        return torch.bmm(self.attention_weights.unsqueeze(1),
+                         values.unsqueeze(-1)).reshape(-1)
+```
+
+### Training
+
+In the following, we transform the training dataset
+to keys and values to train the attention model.
+In the parametric attention pooling,
+any training input takes key-value pairs from all the training examples except for itself to predict its output.
+
+```{.python .input}
+# Shape of `X_tile`: (`n_train`, `n_train`), where each column contains the
+# same training inputs
+X_tile = np.tile(x_train, (n_train, 1))
+# Shape of `Y_tile`: (`n_train`, `n_train`), where each column contains the
+# same training outputs
+Y_tile = np.tile(y_train, (n_train, 1))
+# Shape of `keys`: ('n_train', 'n_train' - 1)
+keys = d2l.reshape(X_tile[(1 - d2l.eye(n_train)).astype('bool')],
+                   (n_train, -1))
+# Shape of `values`: ('n_train', 'n_train' - 1)
+values = d2l.reshape(Y_tile[(1 - d2l.eye(n_train)).astype('bool')],
+                     (n_train, -1))
+```
+
+```{.python .input}
+#@tab pytorch
+# Shape of `X_tile`: (`n_train`, `n_train`), where each column contains the
+# same training inputs
+X_tile = x_train.repeat((n_train, 1))
+# Shape of `Y_tile`: (`n_train`, `n_train`), where each column contains the
+# same training outputs
+Y_tile = y_train.repeat((n_train, 1))
+# Shape of `keys`: ('n_train', 'n_train' - 1)
+keys = d2l.reshape(X_tile[(1 - d2l.eye(n_train)).type(torch.bool)],
+                   (n_train, -1))
+# Shape of `values`: ('n_train', 'n_train' - 1)
+values = d2l.reshape(Y_tile[(1 - d2l.eye(n_train)).type(torch.bool)],
+                     (n_train, -1))
+```
+
+Using the squared loss and stochastic gradient descent,
+we train the parametric attention model.
+
+```{.python .input}
+net = NWKernelRegression()
+net.initialize()
+loss = gluon.loss.L2Loss()
+trainer = gluon.Trainer(net.collect_params(), 'sgd', {'learning_rate': 0.5})
+animator = d2l.Animator(xlabel='epoch', ylabel='loss', xlim=[1, 5])
+
+for epoch in range(5):
+    with autograd.record():
+        l = loss(net(x_train, keys, values), y_train)
+    l.backward()
+    trainer.step(1)
+    print(f'epoch {epoch + 1}, loss {float(l.sum()):.6f}')
+    animator.add(epoch + 1, float(l.sum()))
+```
+
+```{.python .input}
+#@tab pytorch
+net = NWKernelRegression()
+loss = nn.MSELoss(reduction='none')
+trainer = torch.optim.SGD(net.parameters(), lr=0.5)
+animator = d2l.Animator(xlabel='epoch', ylabel='loss', xlim=[1, 5])
+
+for epoch in range(5):
+    trainer.zero_grad()
+    # Note: L2 Loss = 1/2 * MSE Loss. PyTorch has MSE Loss which is slightly
+    # different from MXNet's L2Loss by a factor of 2. Hence we halve the loss
+    l = loss(net(x_train, keys, values), y_train) / 2
+    l.sum().backward()
+    trainer.step()
+    print(f'epoch {epoch + 1}, loss {float(l.sum()):.6f}')
+    animator.add(epoch + 1, float(l.sum()))
+```
+
+After training the parametric attention model,
+we can plot its prediction.
+Trying to fit the training dataset with noise,
+the predicted line is less smooth
+than its nonparametric counterpart that was plotted earlier.
+
+```{.python .input}
+# Shape of `keys`: (`n_test`, `n_train`), where each column contains the same
+# training inputs (i.e., same keys)
+keys = np.tile(x_train, (n_test, 1))
+# Shape of `value`: (`n_test`, `n_train`)
+values = np.tile(y_train, (n_test, 1))
+y_hat = net(x_test, keys, values)
+plot_kernel_reg(y_hat)
+```
+
+```{.python .input}
+#@tab pytorch
+# Shape of `keys`: (`n_test`, `n_train`), where each column contains the same
+# training inputs (i.e., same keys)
+keys = x_train.repeat((n_test, 1))
+# Shape of `value`: (`n_test`, `n_train`)
+values = y_train.repeat((n_test, 1))
+y_hat = net(x_test, keys, values).unsqueeze(1).detach()
+plot_kernel_reg(y_hat)
+```
+
+Comparing with nonparametric attention pooling,
+the region with large attention weights becomes sharper
+in the learnable and parametric setting.
+
+```{.python .input}
+d2l.show_heatmaps(np.expand_dims(np.expand_dims(net.attention_weights, 0), 0),
+                  xlabel='Sorted training inputs',
+                  ylabel='Sorted testing inputs')
+```
+
+```{.python .input}
+#@tab pytorch
+d2l.show_heatmaps(net.attention_weights.unsqueeze(0).unsqueeze(0),
+                  xlabel='Sorted training inputs',
+                  ylabel='Sorted testing inputs')
+```
+
+## Summary
+
+* Nadaraya-Watson kernel regression is an example of machine learning with attention mechanisms.
+* The attention pooling of Nadaraya-Watson kernel regression is a weighted average of the training outputs. From the attention perspective, the attention weight is assigned to a value based on a function of a query and the key that is paired with the value.
+* Attention pooling can be either nonparametric or parametric.
+
+
+## Exercises
+
+1. Increase the number of training examples. Can you learn  nonparametric Nadaraya-Watson kernel regression better?
+1. What is the value of our learned $w$ in the parametric attention pooling experiment? Why does it make the weighted region sharper when visualizing the attention weights?
+1. How can we add hyperparameters to nonparametric Nadaraya-Watson kernel regression to predict better?
+1. Design another parametric attention pooling for the kernel regression of this section. Train this new model and visualize its attention weights.
+
+
+
+:begin_tab:`mxnet`
+[Discussions](https://discuss.d2l.ai/t/1598)
+:end_tab:
+
+:begin_tab:`pytorch`
+[Discussions](https://discuss.d2l.ai/t/1599)
+:end_tab:
diff --git a/chapter_attention-mechanisms/self-attention-and-positional-encoding.md b/chapter_attention-mechanisms/self-attention-and-positional-encoding.md
new file mode 100644
index 0000000000000000000000000000000000000000..0a62debc9a4801a565848b1eafd129034d47ae7a
--- /dev/null
+++ b/chapter_attention-mechanisms/self-attention-and-positional-encoding.md
@@ -0,0 +1,201 @@
+# 自我注意力和位置编码
+:label:`sec_self-attention-and-positional-encoding`
+
+在深度学习中，我们经常使用 CNN 或 RNN 对序列进行编码。现在请注意机制。想象一下，我们将一系列令牌输入注意力池，以便同一组令牌充当查询、键和值。具体来说，每个查询都会关注所有键值对并生成一个注意力输出。由于查询、键和值来自同一个地方，因此执行
+*自我关注 * :cite:`Lin.Feng.Santos.ea.2017,Vaswani.Shazeer.Parmar.ea.2017`，也称为 * 内心注意 * :cite:`Cheng.Dong.Lapata.2016,Parikh.Tackstrom.Das.ea.2016,Paulus.Xiong.Socher.2017`。
+在本节中，我们将讨论使用自我注意的序列编码，包括使用序列顺序的其他信息。
+
+```{.python .input}
+from d2l import mxnet as d2l
+import math
+from mxnet import autograd, np, npx
+from mxnet.gluon import nn
+npx.set_np()
+```
+
+```{.python .input}
+#@tab pytorch
+from d2l import torch as d2l
+import math
+import torch
+from torch import nn
+```
+
+## 自我注意
+
+给定一系列输入令牌 $\mathbf{x}_1, \ldots, \mathbf{x}_n$，其中任何 $\mathbf{x}_i \in \mathbb{R}^d$ ($1 \leq i \leq n$)，它的自我注意力输出一个长度相同的序列 $\mathbf{y}_1, \ldots, \mathbf{y}_n$，其中
+
+$$\mathbf{y}_i = f(\mathbf{x}_i, (\mathbf{x}_1, \mathbf{x}_1), \ldots, (\mathbf{x}_n, \mathbf{x}_n)) \in \mathbb{R}^d$$
+
+根据 :eqref:`eq_attn-pooling` 中关注集中 $f$ 的定义。使用多头注意力，以下代码片段计算具有形状的张量的自我注意力（批量大小、时间步长或令牌中的序列长度，$d$）。输出张量的形状相同。
+
+```{.python .input}
+num_hiddens, num_heads = 100, 5
+attention = d2l.MultiHeadAttention(num_hiddens, num_heads, 0.5)
+attention.initialize()
+```
+
+```{.python .input}
+#@tab pytorch
+num_hiddens, num_heads = 100, 5
+attention = d2l.MultiHeadAttention(num_hiddens, num_hiddens, num_hiddens,
+                                   num_hiddens, num_heads, 0.5)
+attention.eval()
+```
+
+```{.python .input}
+#@tab all
+batch_size, num_queries, valid_lens = 2, 4, d2l.tensor([3, 2])
+X = d2l.ones((batch_size, num_queries, num_hiddens))
+attention(X, X, X, valid_lens).shape
+```
+
+## 比较 CNN、RNN 和自我注意
+:label:`subsec_cnn-rnn-self-attention`
+
+让我们比较将 $n$ 令牌序列映射到另一个相等长度序列的架构，其中每个输入或输出令牌由 $d$ 维矢量表示。具体来说，我们将考虑 CNN、RNN 和自我注意力。我们将比较它们的计算复杂性、顺序操作和最大路径长度。请注意，顺序操作会阻止并行计算，而任意序列位置组合之间的路径较短，可以更轻松地学习序列 :cite:`Hochreiter.Bengio.Frasconi.ea.2001` 中的远距离依赖关系。
+
+![Comparing CNN (padding tokens are omitted), RNN, and self-attention architectures.](../img/cnn-rnn-self-attention.svg)
+:label:`fig_cnn-rnn-self-attention`
+
+考虑一个内核大小为 $k$ 的卷积层。我们将在后面的章节中提供有关使用 CNN 处理序列的更多详细信息目前，我们只需要知道，由于序列长度是 $n$，所以输入和输出通道的数量都是 $d$，卷积层的计算复杂度为 $\mathcal{O}(knd^2)$。如 :numref:`fig_cnn-rnn-self-attention` 所示，CNN 是分层的，因此有 $\mathcal{O}(1)$ 个顺序操作，最大路径长度为 $\mathcal{O}(n/k)$。例如，$\mathbf{x}_1$ 和 $\mathbf{x}_5$ 处于 :numref:`fig_cnn-rnn-self-attention` 中内核大小为 3 的双层 CNN 的接受范围内。
+
+更新 rnN 的隐藏状态时，$d \times d$ 权重矩阵和 $d$ 维隐藏状态的乘法计算复杂度为 $\mathcal{O}(d^2)$。由于序列长度为 $n$，因此循环层的计算复杂度为 $\mathcal{O}(nd^2)$。根据 :numref:`fig_cnn-rnn-self-attention`，有 $\mathcal{O}(n)$ 个顺序操作无法并行化，最大路径长度也是 $\mathcal{O}(n)$。
+
+在自我注意中，查询、键和值都是 $n \times d$ 矩阵。考虑 :eqref:`eq_softmax_QK_V` 中的扩展点-产品关注点，其中 $n \times d$ 矩阵乘以 $d \times n$ 矩阵，然后输出 $n \times n$ 矩阵乘以 $n \times d$ 矩阵。因此，自我注意力具有 $\mathcal{O}(n^2d)$ 计算复杂性。正如我们在 :numref:`fig_cnn-rnn-self-attention` 中看到的那样，每个令牌都通过自我注意直接连接到任何其他令牌。因此，计算可以与 $\mathcal{O}(1)$ 顺序操作并行，最大路径长度也是 $\mathcal{O}(1)$。
+
+总而言之，CNN 和自我注意力都可以享受并行计算，而且自我注意力的最大路径长度最短。但是，相对于序列长度的二次计算复杂性使得自我注意力在很长的序列中非常缓慢。
+
+## 位置编码
+:label:`subsec_positional-encoding`
+
+与逐个重复处理序列令牌的 RNN 不同，自我注意力会放弃顺序操作，而倾向于并行计算。要使用序列顺序信息，我们可以通过在输入表示中添加 * 位置编码 * 来注入绝对或相对位置信息。可以学习或修复位置编码。在下面，我们描述了基于正弦和余弦函数 :cite:`Vaswani.Shazeer.Parmar.ea.2017` 的固定位置编码。
+
+假设输入表示 $\mathbf{X} \in \mathbb{R}^{n \times d}$ 包含一个序列中 $n$ 令牌的 $d$ 维嵌入。位置编码使用相同形状的位置嵌入矩阵 $\mathbf{P} \in \mathbb{R}^{n \times d}$ 输出 $\mathbf{X} + \mathbf{P}$，该矩阵在 $i^\mathrm{th}$ 行和 $(2j)^\mathrm{th}$ 或 $(2j + 1)^\mathrm{th}$ 列上的元素为
+
+$$\begin{aligned} p_{i, 2j} &= \sin\left(\frac{i}{10000^{2j/d}}\right),\\p_{i, 2j+1} &= \cos\left(\frac{i}{10000^{2j/d}}\right).\end{aligned}$$
+:eqlabel:`eq_positional-encoding-def`
+
+乍一看，这种三角函数设计看起来很奇怪。在解释这个设计之前，让我们首先在下面的 `PositionalEncoding` 课中实现它。
+
+```{.python .input}
+#@save
+class PositionalEncoding(nn.Block):
+    def __init__(self, num_hiddens, dropout, max_len=1000):
+        super(PositionalEncoding, self).__init__()
+        self.dropout = nn.Dropout(dropout)
+        # Create a long enough `P`
+        self.P = d2l.zeros((1, max_len, num_hiddens))
+        X = d2l.arange(max_len).reshape(-1, 1) / np.power(
+            10000, np.arange(0, num_hiddens, 2) / num_hiddens)
+        self.P[:, :, 0::2] = np.sin(X)
+        self.P[:, :, 1::2] = np.cos(X)
+
+    def forward(self, X):
+        X = X + self.P[:, :X.shape[1], :].as_in_ctx(X.ctx)
+        return self.dropout(X)
+```
+
+```{.python .input}
+#@tab pytorch
+#@save
+class PositionalEncoding(nn.Module):
+    def __init__(self, num_hiddens, dropout, max_len=1000):
+        super(PositionalEncoding, self).__init__()
+        self.dropout = nn.Dropout(dropout)
+        # Create a long enough `P`
+        self.P = d2l.zeros((1, max_len, num_hiddens))
+        X = d2l.arange(max_len, dtype=torch.float32).reshape(
+            -1, 1) / torch.pow(10000, torch.arange(
+            0, num_hiddens, 2, dtype=torch.float32) / num_hiddens)
+        self.P[:, :, 0::2] = torch.sin(X)
+        self.P[:, :, 1::2] = torch.cos(X)
+
+    def forward(self, X):
+        X = X + self.P[:, :X.shape[1], :].to(X.device)
+        return self.dropout(X)
+```
+
+在位置嵌入矩阵 $\mathbf{P}$ 中，行对应于序列中的位置，列表示不同的位置编码维度。在下面的示例中，我们可以看到位置嵌入矩阵的 $6^{\mathrm{th}}$ 和 $7^{\mathrm{th}}$ 列的频率高于 $8^{\mathrm{th}}$ 和 $9^{\mathrm{th}}$ 列。$6^{\mathrm{th}}$ 和 $7^{\mathrm{th}}$ 列之间的偏移量（$8^{\mathrm{th}}$ 和 $9^{\mathrm{th}}$ 列相同）是由于正弦函数和余弦函数的交替。
+
+```{.python .input}
+encoding_dim, num_steps = 32, 60
+pos_encoding = PositionalEncoding(encoding_dim, 0)
+pos_encoding.initialize()
+X = pos_encoding(np.zeros((1, num_steps, encoding_dim)))
+P = pos_encoding.P[:, :X.shape[1], :]
+d2l.plot(d2l.arange(num_steps), P[0, :, 6:10].T, xlabel='Row (position)',
+         figsize=(6, 2.5), legend=["Col %d" % d for d in d2l.arange(6, 10)])
+```
+
+```{.python .input}
+#@tab pytorch
+encoding_dim, num_steps = 32, 60
+pos_encoding = PositionalEncoding(encoding_dim, 0)
+pos_encoding.eval()
+X = pos_encoding(d2l.zeros((1, num_steps, encoding_dim)))
+P = pos_encoding.P[:, :X.shape[1], :]
+d2l.plot(d2l.arange(num_steps), P[0, :, 6:10].T, xlabel='Row (position)',
+         figsize=(6, 2.5), legend=["Col %d" % d for d in d2l.arange(6, 10)])
+```
+
+### 绝对位置信息
+
+要了解沿编码维度单调降低的频率与绝对位置信息的关系，让我们打印出 $0, 1, \ldots, 7$ 的二进制表示形式。正如我们所看到的，每个数字、每两个数字和每四个数字上的最低位、第二位和第三位最低位分别交替。
+
+```{.python .input}
+#@tab all
+for i in range(8):
+    print(f'{i} in binary is {i:>03b}')
+```
+
+在二进制表示中，较高的位比特的频率低于较低的位。同样，如下面的热图所示，位置编码通过使用三角函数在编码维度下降频率。由于输出是浮点数，因此此类连续表示比二进制表示法更节省空间。
+
+```{.python .input}
+P = np.expand_dims(np.expand_dims(P[0, :, :], 0), 0)
+d2l.show_heatmaps(P, xlabel='Column (encoding dimension)',
+                  ylabel='Row (position)', figsize=(3.5, 4), cmap='Blues')
+```
+
+```{.python .input}
+#@tab pytorch
+P = P[0, :, :].unsqueeze(0).unsqueeze(0)
+d2l.show_heatmaps(P, xlabel='Column (encoding dimension)',
+                  ylabel='Row (position)', figsize=(3.5, 4), cmap='Blues')
+```
+
+### 相对位置信息
+
+除了捕获绝对位置信息之外，上述位置编码还允许模型轻松学习相对位置参加。这是因为对于任何固定仓位偏移 $\delta$，位置 $i + \delta$ 处的位置编码可以用 $i$ 位置的线性投影来表示。
+
+这种预测可以用数学方式解释。代表 $\omega_j = 1/10000^{2j/d}$，对于任何固定抵消量 $\delta$，:eqref:`eq_positional-encoding-def` 中的任何一对 $(p_{i, 2j}, p_{i, 2j+1})$ 都可以线性预测到 $(p_{i+\delta, 2j}, p_{i+\delta, 2j+1})$：
+
+$$\begin{aligned}
+&\begin{bmatrix} \cos(\delta \omega_j) & \sin(\delta \omega_j) \\  -\sin(\delta \omega_j) & \cos(\delta \omega_j) \\ \end{bmatrix}
+\begin{bmatrix} p_{i, 2j} \\  p_{i, 2j+1} \\ \end{bmatrix}\\
+=&\begin{bmatrix} \cos(\delta \omega_j) \sin(i \omega_j) + \sin(\delta \omega_j) \cos(i \omega_j) \\  -\sin(\delta \omega_j) \sin(i \omega_j) + \cos(\delta \omega_j) \cos(i \omega_j) \\ \end{bmatrix}\\
+=&\begin{bmatrix} \sin\left((i+\delta) \omega_j\right) \\  \cos\left((i+\delta) \omega_j\right) \\ \end{bmatrix}\\
+=& 
+\begin{bmatrix} p_{i+\delta, 2j} \\  p_{i+\delta, 2j+1} \\ \end{bmatrix},
+\end{aligned}$$
+
+$2\times 2$ 预测矩阵不依赖于任何仓位指数 $i$。
+
+## 摘要
+
+* 在自我注意中，查询、键和值都来自同一个地方。
+* CNN 和自我注意都享受并行计算，自我注意力的最大路径长度最短。但是，相对于序列长度的二次计算复杂性使得自我注意力在很长的序列中非常缓慢。
+* 要使用序列顺序信息，我们可以通过在输入表示中添加位置编码来注入绝对或相对位置信息。
+
+## 练习
+
+1. 假设我们设计了一个深层架构，通过使用位置编码堆叠自我注意图层来表示序列。可能是什么问题？
+1. 你能设计一种可学习的位置编码方法吗？
+
+:begin_tab:`mxnet`
+[Discussions](https://discuss.d2l.ai/t/1651)
+:end_tab:
+
+:begin_tab:`pytorch`
+[Discussions](https://discuss.d2l.ai/t/1652)
+:end_tab:
diff --git a/chapter_attention-mechanisms/self-attention-and-positional-encoding_origin.md b/chapter_attention-mechanisms/self-attention-and-positional-encoding_origin.md
new file mode 100644
index 0000000000000000000000000000000000000000..4fdcf2a0973a7ed4f5698a1df34c0cb3ef6e207a
--- /dev/null
+++ b/chapter_attention-mechanisms/self-attention-and-positional-encoding_origin.md
@@ -0,0 +1,362 @@
+# Self-Attention and Positional Encoding
+:label:`sec_self-attention-and-positional-encoding`
+
+In deep learning,
+we often use CNNs or RNNs to encode a sequence.
+Now with attention mechanisms.
+imagine that we feed a sequence of tokens
+into attention pooling
+so that
+the same set of tokens
+act as queries, keys, and values.
+Specifically,
+each query attends to all the key-value pairs
+and generates one attention output.
+Since the queries, keys, and values
+come from the same place,
+this performs
+*self-attention* :cite:`Lin.Feng.Santos.ea.2017,Vaswani.Shazeer.Parmar.ea.2017`, which is also called *intra-attention* :cite:`Cheng.Dong.Lapata.2016,Parikh.Tackstrom.Das.ea.2016,Paulus.Xiong.Socher.2017`.
+In this section,
+we will discuss sequence encoding using self-attention,
+including using additional information for the sequence order.
+
+```{.python .input}
+from d2l import mxnet as d2l
+import math
+from mxnet import autograd, np, npx
+from mxnet.gluon import nn
+npx.set_np()
+```
+
+```{.python .input}
+#@tab pytorch
+from d2l import torch as d2l
+import math
+import torch
+from torch import nn
+```
+
+## Self-Attention
+
+Given a sequence of input tokens
+$\mathbf{x}_1, \ldots, \mathbf{x}_n$ where any $\mathbf{x}_i \in \mathbb{R}^d$ ($1 \leq i \leq n$),
+its self-attention outputs
+a sequence of the same length
+$\mathbf{y}_1, \ldots, \mathbf{y}_n$,
+where
+
+$$\mathbf{y}_i = f(\mathbf{x}_i, (\mathbf{x}_1, \mathbf{x}_1), \ldots, (\mathbf{x}_n, \mathbf{x}_n)) \in \mathbb{R}^d$$
+
+according to the definition of attention pooling $f$ in
+:eqref:`eq_attn-pooling`.
+Using multi-head attention,
+the following code snippet
+computes the self-attention of a tensor
+with shape (batch size, number of time steps or sequence length in tokens, $d$).
+The output tensor has the same shape.
+
+```{.python .input}
+num_hiddens, num_heads = 100, 5
+attention = d2l.MultiHeadAttention(num_hiddens, num_heads, 0.5)
+attention.initialize()
+```
+
+```{.python .input}
+#@tab pytorch
+num_hiddens, num_heads = 100, 5
+attention = d2l.MultiHeadAttention(num_hiddens, num_hiddens, num_hiddens,
+                                   num_hiddens, num_heads, 0.5)
+attention.eval()
+```
+
+```{.python .input}
+#@tab all
+batch_size, num_queries, valid_lens = 2, 4, d2l.tensor([3, 2])
+X = d2l.ones((batch_size, num_queries, num_hiddens))
+attention(X, X, X, valid_lens).shape
+```
+
+## Comparing CNNs, RNNs, and Self-Attention
+:label:`subsec_cnn-rnn-self-attention`
+
+Let us
+compare architectures for mapping
+a sequence of $n$ tokens
+to another sequence of equal length,
+where each input or output token is represented by
+a $d$-dimensional vector.
+Specifically,
+we will consider CNNs, RNNs, and self-attention.
+We will compare their
+computational complexity, 
+sequential operations,
+and maximum path lengths.
+Note that sequential operations prevent parallel computation,
+while a shorter path between
+any combination of sequence positions
+makes it easier to learn long-range dependencies within the sequence :cite:`Hochreiter.Bengio.Frasconi.ea.2001`.
+
+
+![Comparing CNN (padding tokens are omitted), RNN, and self-attention architectures.](../img/cnn-rnn-self-attention.svg)
+:label:`fig_cnn-rnn-self-attention`
+
+Consider a convolutional layer whose kernel size is $k$.
+We will provide more details about sequence processing
+using CNNs in later chapters.
+For now,
+we only need to know that
+since the sequence length is $n$,
+the numbers of input and output channels are both $d$,
+the computational complexity of the convolutional layer is $\mathcal{O}(knd^2)$.
+As :numref:`fig_cnn-rnn-self-attention` shows,
+CNNs are hierarchical  so 
+there are $\mathcal{O}(1)$ sequential operations
+and the maximum path length is $\mathcal{O}(n/k)$.
+For example, $\mathbf{x}_1$ and $\mathbf{x}_5$
+are within the receptive field of a two-layer CNN
+with kernel size 3 in :numref:`fig_cnn-rnn-self-attention`.
+
+When updating the hidden state of RNNs,
+multiplication of the $d \times d$ weight matrix
+and the $d$-dimensional hidden state has 
+a computational complexity of $\mathcal{O}(d^2)$.
+Since the sequence length is $n$,
+the computational complexity of the recurrent layer
+is $\mathcal{O}(nd^2)$.
+According to :numref:`fig_cnn-rnn-self-attention`,
+there are $\mathcal{O}(n)$ sequential operations
+that cannot be parallelized
+and the maximum path length is also $\mathcal{O}(n)$.
+
+In self-attention,
+the queries, keys, and values 
+are all $n \times d$ matrices.
+Consider the scaled dot-product attention in
+:eqref:`eq_softmax_QK_V`,
+where a $n \times d$ matrix is multiplied by
+a $d \times n$ matrix,
+then the output $n \times n$ matrix is multiplied
+by a $n \times d$ matrix.
+As a result,
+the self-attention
+has a $\mathcal{O}(n^2d)$ computational complexity.
+As we can see in :numref:`fig_cnn-rnn-self-attention`,
+each token is directly connected
+to any other token via self-attention.
+Therefore,
+computation can be parallel with $\mathcal{O}(1)$ sequential operations
+and the maximum path length is also $\mathcal{O}(1)$.
+
+All in all,
+both CNNs and self-attention enjoy parallel computation
+and self-attention has the shortest maximum path length.
+However, the quadratic computational complexity with respect to the sequence length
+makes self-attention prohibitively slow for very long sequences.
+
+
+
+
+
+## Positional Encoding
+:label:`subsec_positional-encoding`
+
+
+Unlike RNNs that recurrently process
+tokens of a sequence one by one,
+self-attention ditches
+sequential operations in favor of 
+parallel computation.
+To use the sequence order information,
+we can inject
+absolute or relative
+positional information
+by adding *positional encoding*
+to the input representations.
+Positional encodings can be 
+either learned or fixed.
+In the following, 
+we describe a fixed positional encoding
+based on sine and cosine functions :cite:`Vaswani.Shazeer.Parmar.ea.2017`.
+
+Suppose that
+the input representation $\mathbf{X} \in \mathbb{R}^{n \times d}$ contains the $d$-dimensional embeddings for $n$ tokens of a sequence.
+The positional encoding outputs
+$\mathbf{X} + \mathbf{P}$
+using a positional embedding matrix $\mathbf{P} \in \mathbb{R}^{n \times d}$ of the same shape,
+whose element on the $i^\mathrm{th}$ row 
+and the $(2j)^\mathrm{th}$
+or the $(2j + 1)^\mathrm{th}$ column is
+
+$$\begin{aligned} p_{i, 2j} &= \sin\left(\frac{i}{10000^{2j/d}}\right),\\p_{i, 2j+1} &= \cos\left(\frac{i}{10000^{2j/d}}\right).\end{aligned}$$
+:eqlabel:`eq_positional-encoding-def`
+
+At first glance,
+this trigonometric-function
+design looks weird.
+Before explanations of this design,
+let us first implement it in the following `PositionalEncoding` class.
+
+```{.python .input}
+#@save
+class PositionalEncoding(nn.Block):
+    def __init__(self, num_hiddens, dropout, max_len=1000):
+        super(PositionalEncoding, self).__init__()
+        self.dropout = nn.Dropout(dropout)
+        # Create a long enough `P`
+        self.P = d2l.zeros((1, max_len, num_hiddens))
+        X = d2l.arange(max_len).reshape(-1, 1) / np.power(
+            10000, np.arange(0, num_hiddens, 2) / num_hiddens)
+        self.P[:, :, 0::2] = np.sin(X)
+        self.P[:, :, 1::2] = np.cos(X)
+
+    def forward(self, X):
+        X = X + self.P[:, :X.shape[1], :].as_in_ctx(X.ctx)
+        return self.dropout(X)
+```
+
+```{.python .input}
+#@tab pytorch
+#@save
+class PositionalEncoding(nn.Module):
+    def __init__(self, num_hiddens, dropout, max_len=1000):
+        super(PositionalEncoding, self).__init__()
+        self.dropout = nn.Dropout(dropout)
+        # Create a long enough `P`
+        self.P = d2l.zeros((1, max_len, num_hiddens))
+        X = d2l.arange(max_len, dtype=torch.float32).reshape(
+            -1, 1) / torch.pow(10000, torch.arange(
+            0, num_hiddens, 2, dtype=torch.float32) / num_hiddens)
+        self.P[:, :, 0::2] = torch.sin(X)
+        self.P[:, :, 1::2] = torch.cos(X)
+
+    def forward(self, X):
+        X = X + self.P[:, :X.shape[1], :].to(X.device)
+        return self.dropout(X)
+```
+
+In the positional embedding matrix $\mathbf{P}$,
+rows correspond to positions within a sequence
+and columns represent different positional encoding dimensions.
+In the example below,
+we can see that
+the $6^{\mathrm{th}}$ and the $7^{\mathrm{th}}$
+columns of the positional embedding matrix 
+have a higher frequency than 
+the $8^{\mathrm{th}}$ and the $9^{\mathrm{th}}$
+columns.
+The offset between 
+the $6^{\mathrm{th}}$ and the $7^{\mathrm{th}}$ (same for the $8^{\mathrm{th}}$ and the $9^{\mathrm{th}}$) columns
+is due to the alternation of sine and cosine functions.
+
+```{.python .input}
+encoding_dim, num_steps = 32, 60
+pos_encoding = PositionalEncoding(encoding_dim, 0)
+pos_encoding.initialize()
+X = pos_encoding(np.zeros((1, num_steps, encoding_dim)))
+P = pos_encoding.P[:, :X.shape[1], :]
+d2l.plot(d2l.arange(num_steps), P[0, :, 6:10].T, xlabel='Row (position)',
+         figsize=(6, 2.5), legend=["Col %d" % d for d in d2l.arange(6, 10)])
+```
+
+```{.python .input}
+#@tab pytorch
+encoding_dim, num_steps = 32, 60
+pos_encoding = PositionalEncoding(encoding_dim, 0)
+pos_encoding.eval()
+X = pos_encoding(d2l.zeros((1, num_steps, encoding_dim)))
+P = pos_encoding.P[:, :X.shape[1], :]
+d2l.plot(d2l.arange(num_steps), P[0, :, 6:10].T, xlabel='Row (position)',
+         figsize=(6, 2.5), legend=["Col %d" % d for d in d2l.arange(6, 10)])
+```
+
+### Absolute Positional Information
+
+To see how the monotonically decreased frequency
+along the encoding dimension relates to absolute positional information,
+let us print out the binary representations of $0, 1, \ldots, 7$.
+As we can see,
+the lowest bit, the second-lowest bit, and the third-lowest bit alternate on every number, every two numbers, and every four numbers, respectively.
+
+```{.python .input}
+#@tab all
+for i in range(8):
+    print(f'{i} in binary is {i:>03b}')
+```
+
+In binary representations,
+a higher bit has a lower frequency than a lower bit.
+Similarly,
+as demonstrated in the heat map below,
+the positional encoding decreases
+frequencies along the encoding dimension
+by using trigonometric functions.
+Since the outputs are float numbers,
+such continuous representations
+are more space-efficient
+than binary representations.
+
+```{.python .input}
+P = np.expand_dims(np.expand_dims(P[0, :, :], 0), 0)
+d2l.show_heatmaps(P, xlabel='Column (encoding dimension)',
+                  ylabel='Row (position)', figsize=(3.5, 4), cmap='Blues')
+```
+
+```{.python .input}
+#@tab pytorch
+P = P[0, :, :].unsqueeze(0).unsqueeze(0)
+d2l.show_heatmaps(P, xlabel='Column (encoding dimension)',
+                  ylabel='Row (position)', figsize=(3.5, 4), cmap='Blues')
+```
+
+### Relative Positional Information
+
+Besides capturing absolute positional information,
+the above positional encoding
+also allows
+a model to easily learn to attend by relative positions.
+This is because
+for any fixed position offset $\delta$,
+the positional encoding at position $i + \delta$
+can be represented by a linear projection
+of that at position $i$.
+
+
+This projection can be explained
+mathematically.
+Denoting
+$\omega_j = 1/10000^{2j/d}$,
+any pair of $(p_{i, 2j}, p_{i, 2j+1})$ 
+in :eqref:`eq_positional-encoding-def`
+can 
+be linearly projected to $(p_{i+\delta, 2j}, p_{i+\delta, 2j+1})$
+for any fixed offset $\delta$:
+
+$$\begin{aligned}
+&\begin{bmatrix} \cos(\delta \omega_j) & \sin(\delta \omega_j) \\  -\sin(\delta \omega_j) & \cos(\delta \omega_j) \\ \end{bmatrix}
+\begin{bmatrix} p_{i, 2j} \\  p_{i, 2j+1} \\ \end{bmatrix}\\
+=&\begin{bmatrix} \cos(\delta \omega_j) \sin(i \omega_j) + \sin(\delta \omega_j) \cos(i \omega_j) \\  -\sin(\delta \omega_j) \sin(i \omega_j) + \cos(\delta \omega_j) \cos(i \omega_j) \\ \end{bmatrix}\\
+=&\begin{bmatrix} \sin\left((i+\delta) \omega_j\right) \\  \cos\left((i+\delta) \omega_j\right) \\ \end{bmatrix}\\
+=& 
+\begin{bmatrix} p_{i+\delta, 2j} \\  p_{i+\delta, 2j+1} \\ \end{bmatrix},
+\end{aligned}$$
+
+where the $2\times 2$ projection matrix does not depend on any position index $i$.
+
+## Summary
+
+* In self-attention, the queries, keys, and values all come from the same place.
+* Both CNNs and self-attention enjoy parallel computation and self-attention has the shortest maximum path length. However, the quadratic computational complexity with respect to the sequence length makes self-attention prohibitively slow for very long sequences.
+* To use the sequence order information, we can inject absolute or relative positional information by adding positional encoding to the input representations.
+
+## Exercises
+
+1. Suppose that we design a deep architecture to represent a sequence by stacking self-attention layers with positional encoding. What could be issues?
+1. Can you design a learnable positional encoding method?
+
+:begin_tab:`mxnet`
+[Discussions](https://discuss.d2l.ai/t/1651)
+:end_tab:
+
+:begin_tab:`pytorch`
+[Discussions](https://discuss.d2l.ai/t/1652)
+:end_tab:
diff --git a/chapter_attention-mechanisms/transformer.md b/chapter_attention-mechanisms/transformer.md
new file mode 100644
index 0000000000000000000000000000000000000000..06441fe94b8d86c92435cd585709199b759b6e86
--- /dev/null
+++ b/chapter_attention-mechanisms/transformer.md
@@ -0,0 +1,651 @@
+# 变压器
+:label:`sec_transformer`
+
+我们在 :numref:`subsec_cnn-rnn-self-attention` 中比较了 CNN、RNN 和自我注意力。值得注意的是，自我注意力同时享有并行计算和最短的最大路径长度。因此，自然而言，通过使用自我注意力来设计深层架构是很有吸引力的。与之前仍然依赖 RNN 进行输入表示 :cite:`Cheng.Dong.Lapata.2016,Lin.Feng.Santos.ea.2017,Paulus.Xiong.Socher.2017` 的自我注意模型不同，变压器模型完全基于注意机制，没有任何卷积层或循环层 :cite:`Vaswani.Shazeer.Parmar.ea.2017`。尽管最初提议对文本数据进行顺序学习，但变形金刚在各种现代深度学习应用中普遍存在，例如语言、视觉、语音和强化学习领域。
+
+## 模型
+
+作为编码器解码器架构的一个实例，变压器的整体架构在 :numref:`fig_transformer` 中介绍。正如我们所看到的，变压器由编码器和解码器组成。与 :numref:`fig_s2s_attention_details` 中 Bahdanau 对序列到序列学习的关注不同，输入（源）和输出（目标）序列嵌入在输入（源）和输出（目标）序列嵌入之前将添加到编码器和基于自我注意力堆叠模块的解码器之前。
+
+![The Transformer architecture.](../img/transformer.svg)
+:width:`500px`
+:label:`fig_transformer`
+
+现在我们在 :numref:`fig_transformer` 中概述了变压器架构。从高层来看，变压器编码器是由多个相同层组成的堆栈，每层都有两个子层（两个子层表示为 $\mathrm{sublayer}$）。第一个是多头自我注意力集中，第二个是位置上的前馈网络。具体来说，在编码器的自我注意中，查询、键和值都来自前一个编码器层的输出。受 :numref:`sec_resnet` ResNet 设计的启发，两个子层周围都采用了残留连接。在变压器中，对于序列中任何位置的任何输入 $\mathbf{x} \in \mathbb{R}^d$，我们要求 $\mathrm{sublayer}(\mathbf{x}) \in \mathbb{R}^d$，以便剩余连接 $\mathbf{x} + \mathrm{sublayer}(\mathbf{x}) \in \mathbb{R}^d$ 是可行的。从残留连接中添加这一点之后立即进行层规范化 :cite:`Ba.Kiros.Hinton.2016`。因此，变压器编码器为输入序列的每个位置输出 $d$ 维矢量表示。
+
+变压器解码器也是由多个相同层组成的堆栈，具有残留连接和层标准化。除了编码器中描述的两个子层之外，解码器还在这两个子层之间插入第三个子层，称为编码器解码器注意力。在编码器解码器中，查询来自前一个解码器层的输出，键和值来自 Transcoransder 编码器输出。在解码器中，查询、键和值都来自上一个解码器层的输出。但是，解码器中的每个位置只能处理解码器中直到该位置的所有位置。这种 * 掩码 * 注意力保留了自动回归属性，确保预测仅依赖于已生成的输出令牌。
+
+我们已经描述并实施了基于 :numref:`sec_multihead-attention` 中的缩放点产品和 :numref:`subsec_positional-encoding` 中的位置编码的多头关注。在下面，我们将实现变压器模型的其余部分。
+
+```{.python .input}
+from d2l import mxnet as d2l
+import math
+from mxnet import autograd, np, npx
+from mxnet.gluon import nn
+import pandas as pd
+npx.set_np()
+```
+
+```{.python .input}
+#@tab pytorch
+from d2l import torch as d2l
+import math
+import pandas as pd
+import torch
+from torch import nn
+```
+
+## 定位前馈网络
+
+位置向前馈网络使用同一个 MLP 转换所有序列位置的表示形式。这就是为什么我们称之为 * 职位 *。在下面的实现中，带有形状的输入 `X`（批量大小、时间步长或序列长度（标记为单位的序列长度、隐藏单位数或要素维度）将被双层 MLP 转换为形状的输出张量（批量大小、时间步长、`ffn_num_outputs`）。
+
+```{.python .input}
+#@save
+class PositionWiseFFN(nn.Block):
+    def __init__(self, ffn_num_hiddens, ffn_num_outputs, **kwargs):
+        super(PositionWiseFFN, self).__init__(**kwargs)
+        self.dense1 = nn.Dense(ffn_num_hiddens, flatten=False,
+                               activation='relu')
+        self.dense2 = nn.Dense(ffn_num_outputs, flatten=False)
+
+    def forward(self, X):
+        return self.dense2(self.dense1(X))
+```
+
+```{.python .input}
+#@tab pytorch
+#@save
+class PositionWiseFFN(nn.Module):
+    def __init__(self, ffn_num_input, ffn_num_hiddens, ffn_num_outputs,
+                 **kwargs):
+        super(PositionWiseFFN, self).__init__(**kwargs)
+        self.dense1 = nn.Linear(ffn_num_input, ffn_num_hiddens)
+        self.relu = nn.ReLU()
+        self.dense2 = nn.Linear(ffn_num_hiddens, ffn_num_outputs)
+
+    def forward(self, X):
+        return self.dense2(self.relu(self.dense1(X)))
+```
+
+以下示例显示，张量的最内层维度会改变位置向前馈网络中的输出数量。由于相同的 MLP 在所有仓位上都变换，所以当所有这些位置的输入相同时，它们的输出也是相同的。
+
+```{.python .input}
+ffn = PositionWiseFFN(4, 8)
+ffn.initialize()
+ffn(np.ones((2, 3, 4)))[0]
+```
+
+```{.python .input}
+#@tab pytorch
+ffn = PositionWiseFFN(4, 4, 8)
+ffn.eval()
+ffn(d2l.ones((2, 3, 4)))[0]
+```
+
+## 剩余连接和层规范化
+
+现在让我们关注 :numref:`fig_transformer` 中的 “添加和规范” 组件。正如我们在本节开头所述，这是一个残余连接，紧接着是层规范化。两者都是有效的深度架构的关键。
+
+在 :numref:`sec_batch_norm` 中，我们解释了如何在一个小批量内批量标准化最近和重新调整示例。图层规范化与批量规范化相同，只是前者在要素维度上进行规范化。尽管在计算机视觉中广泛应用批量规范化，但在自然语言处理任务中，批量规范化通常不如图层规范化的效果，而自然语言处理任务的输入通常是可变长度的序列。
+
+以下代码段通过层规范化和批量规范化比较了不同维度的规范化。
+
+```{.python .input}
+ln = nn.LayerNorm()
+ln.initialize()
+bn = nn.BatchNorm()
+bn.initialize()
+X = d2l.tensor([[1, 2], [2, 3]])
+# Compute mean and variance from `X` in the training mode
+with autograd.record():
+    print('layer norm:', ln(X), '\nbatch norm:', bn(X))
+```
+
+```{.python .input}
+#@tab pytorch
+ln = nn.LayerNorm(2)
+bn = nn.BatchNorm1d(2)
+X = d2l.tensor([[1, 2], [2, 3]], dtype=torch.float32)
+# Compute mean and variance from `X` in the training mode
+print('layer norm:', ln(X), '\nbatch norm:', bn(X))
+```
+
+现在我们可以使用剩余连接实现 `AddNorm` 类，然后再进行层规范化。退学也适用于正规化。
+
+```{.python .input}
+#@save
+class AddNorm(nn.Block):
+    def __init__(self, dropout, **kwargs):
+        super(AddNorm, self).__init__(**kwargs)
+        self.dropout = nn.Dropout(dropout)
+        self.ln = nn.LayerNorm()
+
+    def forward(self, X, Y):
+        return self.ln(self.dropout(Y) + X)
+```
+
+```{.python .input}
+#@tab pytorch
+#@save
+class AddNorm(nn.Module):
+    def __init__(self, normalized_shape, dropout, **kwargs):
+        super(AddNorm, self).__init__(**kwargs)
+        self.dropout = nn.Dropout(dropout)
+        self.ln = nn.LayerNorm(normalized_shape)
+
+    def forward(self, X, Y):
+        return self.ln(self.dropout(Y) + X)
+```
+
+剩余连接要求两个输入的形状相同，以便在加法操作后输出张量也具有相同的形状。
+
+```{.python .input}
+add_norm = AddNorm(0.5)
+add_norm.initialize()
+add_norm(d2l.ones((2, 3, 4)), d2l.ones((2, 3, 4))).shape
+```
+
+```{.python .input}
+#@tab pytorch
+add_norm = AddNorm([3, 4], 0.5) # Normalized_shape is input.size()[1:]
+add_norm.eval()
+add_norm(d2l.ones((2, 3, 4)), d2l.ones((2, 3, 4))).shape
+```
+
+## 编码器
+
+由于组装变压器所需的所有必要组件，让我们首先在编码器中实现单层。以下 `EncoderBlock` 类包含两个子层：多头自我注意力和定位前馈网络，其中两个子层周围采用残留连接，然后再进行层规范化。
+
+```{.python .input}
+#@save
+class EncoderBlock(nn.Block):
+    def __init__(self, num_hiddens, ffn_num_hiddens, num_heads, dropout,
+                 use_bias=False, **kwargs):
+        super(EncoderBlock, self).__init__(**kwargs)
+        self.attention = d2l.MultiHeadAttention(
+            num_hiddens, num_heads, dropout, use_bias)
+        self.addnorm1 = AddNorm(dropout)
+        self.ffn = PositionWiseFFN(ffn_num_hiddens, num_hiddens)
+        self.addnorm2 = AddNorm(dropout)
+
+    def forward(self, X, valid_lens):
+        Y = self.addnorm1(X, self.attention(X, X, X, valid_lens))
+        return self.addnorm2(Y, self.ffn(Y))
+```
+
+```{.python .input}
+#@tab pytorch
+#@save
+class EncoderBlock(nn.Module):
+    def __init__(self, key_size, query_size, value_size, num_hiddens,
+                 norm_shape, ffn_num_input, ffn_num_hiddens, num_heads,
+                 dropout, use_bias=False, **kwargs):
+        super(EncoderBlock, self).__init__(**kwargs)
+        self.attention = d2l.MultiHeadAttention(
+            key_size, query_size, value_size, num_hiddens, num_heads, dropout,
+            use_bias)
+        self.addnorm1 = AddNorm(norm_shape, dropout)
+        self.ffn = PositionWiseFFN(
+            ffn_num_input, ffn_num_hiddens, num_hiddens)
+        self.addnorm2 = AddNorm(norm_shape, dropout)
+
+    def forward(self, X, valid_lens):
+        Y = self.addnorm1(X, self.attention(X, X, X, valid_lens))
+        return self.addnorm2(Y, self.ffn(Y))
+```
+
+正如我们所看到的，变压器编码器中的任何图层都不会改变其输入的形状。
+
+```{.python .input}
+X = d2l.ones((2, 100, 24))
+valid_lens = d2l.tensor([3, 2])
+encoder_blk = EncoderBlock(24, 48, 8, 0.5)
+encoder_blk.initialize()
+encoder_blk(X, valid_lens).shape
+```
+
+```{.python .input}
+#@tab pytorch
+X = d2l.ones((2, 100, 24))
+valid_lens = d2l.tensor([3, 2])
+encoder_blk = EncoderBlock(24, 24, 24, 24, [100, 24], 24, 48, 8, 0.5)
+encoder_blk.eval()
+encoder_blk(X, valid_lens).shape
+```
+
+在下面的变压器编码器实现中，我们堆叠上述 `EncoderBlock` 类的 `num_layers` 个实例。由于我们使用的值始终在-1 和 1 之间的固定位置编码，因此我们将可学习输入嵌入的值乘以嵌入维度的平方根，以便在总结输入嵌入和位置编码之前重新缩放。
+
+```{.python .input}
+#@save
+class TransformerEncoder(d2l.Encoder):
+    def __init__(self, vocab_size, num_hiddens, ffn_num_hiddens,
+                 num_heads, num_layers, dropout, use_bias=False, **kwargs):
+        super(TransformerEncoder, self).__init__(**kwargs)
+        self.num_hiddens = num_hiddens
+        self.embedding = nn.Embedding(vocab_size, num_hiddens)
+        self.pos_encoding = d2l.PositionalEncoding(num_hiddens, dropout)
+        self.blks = nn.Sequential()
+        for _ in range(num_layers):
+            self.blks.add(
+                EncoderBlock(num_hiddens, ffn_num_hiddens, num_heads, dropout,
+                             use_bias))
+
+    def forward(self, X, valid_lens, *args):
+        # Since positional encoding values are between -1 and 1, the embedding
+        # values are multiplied by the square root of the embedding dimension
+        # to rescale before they are summed up
+        X = self.pos_encoding(self.embedding(X) * math.sqrt(self.num_hiddens))
+        self.attention_weights = [None] * len(self.blks)
+        for i, blk in enumerate(self.blks):
+            X = blk(X, valid_lens)
+            self.attention_weights[
+                i] = blk.attention.attention.attention_weights
+        return X
+```
+
+```{.python .input}
+#@tab pytorch
+#@save
+class TransformerEncoder(d2l.Encoder):
+    def __init__(self, vocab_size, key_size, query_size, value_size,
+                 num_hiddens, norm_shape, ffn_num_input, ffn_num_hiddens,
+                 num_heads, num_layers, dropout, use_bias=False, **kwargs):
+        super(TransformerEncoder, self).__init__(**kwargs)
+        self.num_hiddens = num_hiddens
+        self.embedding = nn.Embedding(vocab_size, num_hiddens)
+        self.pos_encoding = d2l.PositionalEncoding(num_hiddens, dropout)
+        self.blks = nn.Sequential()
+        for i in range(num_layers):
+            self.blks.add_module("block"+str(i),
+                EncoderBlock(key_size, query_size, value_size, num_hiddens,
+                             norm_shape, ffn_num_input, ffn_num_hiddens,
+                             num_heads, dropout, use_bias))
+
+    def forward(self, X, valid_lens, *args):
+        # Since positional encoding values are between -1 and 1, the embedding
+        # values are multiplied by the square root of the embedding dimension
+        # to rescale before they are summed up
+        X = self.pos_encoding(self.embedding(X) * math.sqrt(self.num_hiddens))
+        self.attention_weights = [None] * len(self.blks)
+        for i, blk in enumerate(self.blks):
+            X = blk(X, valid_lens)
+            self.attention_weights[
+                i] = blk.attention.attention.attention_weights
+        return X
+```
+
+下面我们指定了超参数来创建一个双层变压器编码器。变压器编码器输出的形状是（批量大小、时间步长数、`num_hiddens`）。
+
+```{.python .input}
+encoder = TransformerEncoder(200, 24, 48, 8, 2, 0.5)
+encoder.initialize()
+encoder(np.ones((2, 100)), valid_lens).shape
+```
+
+```{.python .input}
+#@tab pytorch
+encoder = TransformerEncoder(
+    200, 24, 24, 24, 24, [100, 24], 24, 48, 8, 2, 0.5)
+encoder.eval()
+encoder(d2l.ones((2, 100), dtype=torch.long), valid_lens).shape
+```
+
+## 解码器
+
+如 :numref:`fig_transformer` 所示，变压器解码器由多个相同的层组成。每个层都在以下 `DecoderBlock` 类中实现，其中包含三个子层：解码器自我注意、编码器-解码器注意力和定位前馈网络。这些子层周围使用残留连接，然后进行层规范化。
+
+正如我们在本节前面所述，在蒙版的多头解码器自我注意力（第一个子层）中，查询、键和值都来自上一个解码器层的输出。训练顺序到序列模型时，输出序列的所有位置（时间步长）的令牌都是已知的。但是，在预测期间，输出序列是通过令牌生成的；因此，在任何解码器时间步骤中，只有生成的令牌才能用于解码器的自我注意力。为了在解码器中保留自动回归，其蒙版自我注意力指定 `dec_valid_lens`，以便任何查询只参与解码器中直到查询位置的所有位置。
+
+```{.python .input}
+class DecoderBlock(nn.Block):
+    # The `i`-th block in the decoder
+    def __init__(self, num_hiddens, ffn_num_hiddens, num_heads,
+                 dropout, i, **kwargs):
+        super(DecoderBlock, self).__init__(**kwargs)
+        self.i = i
+        self.attention1 = d2l.MultiHeadAttention(num_hiddens, num_heads,
+                                                 dropout)
+        self.addnorm1 = AddNorm(dropout)
+        self.attention2 = d2l.MultiHeadAttention(num_hiddens, num_heads,
+                                                 dropout)
+        self.addnorm2 = AddNorm(dropout)
+        self.ffn = PositionWiseFFN(ffn_num_hiddens, num_hiddens)
+        self.addnorm3 = AddNorm(dropout)
+
+    def forward(self, X, state):
+        enc_outputs, enc_valid_lens = state[0], state[1]
+        # During training, all the tokens of any output sequence are processed
+        # at the same time, so `state[2][self.i]` is `None` as initialized.
+        # When decoding any output sequence token by token during prediction,
+        # `state[2][self.i]` contains representations of the decoded output at
+        # the `i`-th block up to the current time step
+        if state[2][self.i] is None:
+            key_values = X
+        else:
+            key_values = np.concatenate((state[2][self.i], X), axis=1)
+        state[2][self.i] = key_values
+
+        if autograd.is_training():
+            batch_size, num_steps, _ = X.shape
+            # Shape of `dec_valid_lens`: (`batch_size`, `num_steps`), where
+            # every row is [1, 2, ..., `num_steps`]
+            dec_valid_lens = np.tile(np.arange(1, num_steps + 1, ctx=X.ctx),
+                                     (batch_size, 1))
+        else:
+            dec_valid_lens = None
+
+        # Self-attention
+        X2 = self.attention1(X, key_values, key_values, dec_valid_lens)
+        Y = self.addnorm1(X, X2)
+        # Encoder-decoder attention. Shape of `enc_outputs`:
+        # (`batch_size`, `num_steps`, `num_hiddens`)
+        Y2 = self.attention2(Y, enc_outputs, enc_outputs, enc_valid_lens)
+        Z = self.addnorm2(Y, Y2)
+        return self.addnorm3(Z, self.ffn(Z)), state
+```
+
+```{.python .input}
+#@tab pytorch
+class DecoderBlock(nn.Module):
+    # The `i`-th block in the decoder
+    def __init__(self, key_size, query_size, value_size, num_hiddens,
+                 norm_shape, ffn_num_input, ffn_num_hiddens, num_heads,
+                 dropout, i, **kwargs):
+        super(DecoderBlock, self).__init__(**kwargs)
+        self.i = i
+        self.attention1 = d2l.MultiHeadAttention(
+            key_size, query_size, value_size, num_hiddens, num_heads, dropout)
+        self.addnorm1 = AddNorm(norm_shape, dropout)
+        self.attention2 = d2l.MultiHeadAttention(
+            key_size, query_size, value_size, num_hiddens, num_heads, dropout)
+        self.addnorm2 = AddNorm(norm_shape, dropout)
+        self.ffn = PositionWiseFFN(ffn_num_input, ffn_num_hiddens,
+                                   num_hiddens)
+        self.addnorm3 = AddNorm(norm_shape, dropout)
+
+    def forward(self, X, state):
+        enc_outputs, enc_valid_lens = state[0], state[1]
+        # During training, all the tokens of any output sequence are processed
+        # at the same time, so `state[2][self.i]` is `None` as initialized.
+        # When decoding any output sequence token by token during prediction,
+        # `state[2][self.i]` contains representations of the decoded output at
+        # the `i`-th block up to the current time step
+        if state[2][self.i] is None:
+            key_values = X
+        else:
+            key_values = torch.cat((state[2][self.i], X), axis=1)
+        state[2][self.i] = key_values
+        if self.training:
+            batch_size, num_steps, _ = X.shape
+            # Shape of `dec_valid_lens`: (`batch_size`, `num_steps`), where
+            # every row is [1, 2, ..., `num_steps`]
+            dec_valid_lens = torch.arange(
+                1, num_steps + 1, device=X.device).repeat(batch_size, 1)
+        else:
+            dec_valid_lens = None
+
+        # Self-attention
+        X2 = self.attention1(X, key_values, key_values, dec_valid_lens)
+        Y = self.addnorm1(X, X2)
+        # Encoder-decoder attention. Shape of `enc_outputs`:
+        # (`batch_size`, `num_steps`, `num_hiddens`)
+        Y2 = self.attention2(Y, enc_outputs, enc_outputs, enc_valid_lens)
+        Z = self.addnorm2(Y, Y2)
+        return self.addnorm3(Z, self.ffn(Z)), state
+```
+
+为了便于在编码器-解码器注意和剩余连接中的加法操作，解码器的特征尺寸 (`num_hiddens`) 与编码器的特征尺寸 (`num_hiddens`) 相同。
+
+```{.python .input}
+decoder_blk = DecoderBlock(24, 48, 8, 0.5, 0)
+decoder_blk.initialize()
+X = np.ones((2, 100, 24))
+state = [encoder_blk(X, valid_lens), valid_lens, [None]]
+decoder_blk(X, state)[0].shape
+```
+
+```{.python .input}
+#@tab pytorch
+decoder_blk = DecoderBlock(24, 24, 24, 24, [100, 24], 24, 48, 8, 0.5, 0)
+decoder_blk.eval()
+X = d2l.ones((2, 100, 24))
+state = [encoder_blk(X, valid_lens), valid_lens, [None]]
+decoder_blk(X, state)[0].shape
+```
+
+现在我们构建了由 `num_layers` 个 `DecoderBlock` 实例组成的整个变压器解码器。最后，一个完全连接的层计算所有 `vocab_size` 个可能的输出令牌的预测。解码器的自我注意力重量和编码器-解码器的注意权重都被存储，以供日后可视化。
+
+```{.python .input}
+class TransformerDecoder(d2l.AttentionDecoder):
+    def __init__(self, vocab_size, num_hiddens, ffn_num_hiddens,
+                 num_heads, num_layers, dropout, **kwargs):
+        super(TransformerDecoder, self).__init__(**kwargs)
+        self.num_hiddens = num_hiddens
+        self.num_layers = num_layers
+        self.embedding = nn.Embedding(vocab_size, num_hiddens)
+        self.pos_encoding = d2l.PositionalEncoding(num_hiddens, dropout)
+        self.blks = nn.Sequential()
+        for i in range(num_layers):
+            self.blks.add(
+                DecoderBlock(num_hiddens, ffn_num_hiddens, num_heads,
+                             dropout, i))
+        self.dense = nn.Dense(vocab_size, flatten=False)
+
+    def init_state(self, enc_outputs, enc_valid_lens, *args):
+        return [enc_outputs, enc_valid_lens, [None] * self.num_layers]
+
+    def forward(self, X, state):
+        X = self.pos_encoding(self.embedding(X) * math.sqrt(self.num_hiddens))
+        self._attention_weights = [[None] * len(self.blks) for _ in range (2)]
+        for i, blk in enumerate(self.blks):
+            X, state = blk(X, state)
+            # Decoder self-attention weights
+            self._attention_weights[0][
+                i] = blk.attention1.attention.attention_weights
+            # Encoder-decoder attention weights
+            self._attention_weights[1][
+                i] = blk.attention2.attention.attention_weights
+        return self.dense(X), state
+
+    @property
+    def attention_weights(self):
+        return self._attention_weights
+```
+
+```{.python .input}
+#@tab pytorch
+class TransformerDecoder(d2l.AttentionDecoder):
+    def __init__(self, vocab_size, key_size, query_size, value_size,
+                 num_hiddens, norm_shape, ffn_num_input, ffn_num_hiddens,
+                 num_heads, num_layers, dropout, **kwargs):
+        super(TransformerDecoder, self).__init__(**kwargs)
+        self.num_hiddens = num_hiddens
+        self.num_layers = num_layers
+        self.embedding = nn.Embedding(vocab_size, num_hiddens)
+        self.pos_encoding = d2l.PositionalEncoding(num_hiddens, dropout)
+        self.blks = nn.Sequential()
+        for i in range(num_layers):
+            self.blks.add_module("block"+str(i),
+                DecoderBlock(key_size, query_size, value_size, num_hiddens,
+                             norm_shape, ffn_num_input, ffn_num_hiddens,
+                             num_heads, dropout, i))
+        self.dense = nn.Linear(num_hiddens, vocab_size)
+
+    def init_state(self, enc_outputs, enc_valid_lens, *args):
+        return [enc_outputs, enc_valid_lens, [None] * self.num_layers]
+
+    def forward(self, X, state):
+        X = self.pos_encoding(self.embedding(X) * math.sqrt(self.num_hiddens))
+        self._attention_weights = [[None] * len(self.blks) for _ in range (2)]
+        for i, blk in enumerate(self.blks):
+            X, state = blk(X, state)
+            # Decoder self-attention weights
+            self._attention_weights[0][
+                i] = blk.attention1.attention.attention_weights
+            # Encoder-decoder attention weights
+            self._attention_weights[1][
+                i] = blk.attention2.attention.attention_weights
+        return self.dense(X), state
+
+    @property
+    def attention_weights(self):
+        return self._attention_weights
+```
+
+## 培训
+
+让我们通过遵循变压器架构来实例化编码器解码器模型。在这里，我们指定变压器编码器和变压器解码器都有 2 层，使用 4 头注意力。与 :numref:`sec_seq2seq_training` 类似，我们训练变压器模型，以便在英语-法语机器翻译数据集上进行序列到序列的学习。
+
+```{.python .input}
+num_hiddens, num_layers, dropout, batch_size, num_steps = 32, 2, 0.1, 64, 10
+lr, num_epochs, device = 0.005, 200, d2l.try_gpu()
+ffn_num_hiddens, num_heads = 64, 4
+
+train_iter, src_vocab, tgt_vocab = d2l.load_data_nmt(batch_size, num_steps)
+
+encoder = TransformerEncoder(
+    len(src_vocab), num_hiddens, ffn_num_hiddens, num_heads, num_layers,
+    dropout)
+decoder = TransformerDecoder(
+    len(tgt_vocab), num_hiddens, ffn_num_hiddens, num_heads, num_layers,
+    dropout)
+net = d2l.EncoderDecoder(encoder, decoder)
+d2l.train_seq2seq(net, train_iter, lr, num_epochs, tgt_vocab, device)
+```
+
+```{.python .input}
+#@tab pytorch
+num_hiddens, num_layers, dropout, batch_size, num_steps = 32, 2, 0.1, 64, 10
+lr, num_epochs, device = 0.005, 200, d2l.try_gpu()
+ffn_num_input, ffn_num_hiddens, num_heads = 32, 64, 4
+key_size, query_size, value_size = 32, 32, 32
+norm_shape = [32]
+
+train_iter, src_vocab, tgt_vocab = d2l.load_data_nmt(batch_size, num_steps)
+
+encoder = TransformerEncoder(
+    len(src_vocab), key_size, query_size, value_size, num_hiddens,
+    norm_shape, ffn_num_input, ffn_num_hiddens, num_heads,
+    num_layers, dropout)
+decoder = TransformerDecoder(
+    len(tgt_vocab), key_size, query_size, value_size, num_hiddens,
+    norm_shape, ffn_num_input, ffn_num_hiddens, num_heads,
+    num_layers, dropout)
+net = d2l.EncoderDecoder(encoder, decoder)
+d2l.train_seq2seq(net, train_iter, lr, num_epochs, tgt_vocab, device)
+```
+
+训练结束后，我们使用变压器模型将一些英语句子翻译成法语并计算它们的 BLEU 分数。
+
+```{.python .input}
+#@tab all
+engs = ['go .', "i lost .", 'he\'s calm .', 'i\'m home .']
+fras = ['va !', 'j\'ai perdu .', 'il est calme .', 'je suis chez moi .']
+for eng, fra in zip(engs, fras):
+    translation, dec_attention_weight_seq = d2l.predict_seq2seq(
+        net, eng, src_vocab, tgt_vocab, num_steps, device, True)
+    print(f'{eng} => {translation}, ',
+          f'bleu {d2l.bleu(translation, fra, k=2):.3f}')
+```
+
+让我们在将最后一个英语句子翻译成法语时可视化变压器的注意力重量。编码器自我注意权重的形状为（编码器层数、注意头数、`num_steps` 或查询数、`num_steps` 或键值对的数量）。
+
+```{.python .input}
+#@tab all
+enc_attention_weights = d2l.reshape(
+    d2l.concat(net.encoder.attention_weights, 0),
+    (num_layers, num_heads, -1, num_steps))
+enc_attention_weights.shape
+```
+
+在编码器的自我注意中，查询和键来自相同的输入序列。由于填充令牌不具有意义，并且输入序列的指定有效长度，因此没有查询参与填充令牌的位置。在以下内容中，将逐行呈现两层多头注意力权重。每位负责人都根据查询、键和值的单独表示子空间独立出席。
+
+```{.python .input}
+d2l.show_heatmaps(
+    enc_attention_weights, xlabel='Key positions', ylabel='Query positions',
+    titles=['Head %d' % i for i in range(1, 5)], figsize=(7, 3.5))
+```
+
+```{.python .input}
+#@tab pytorch
+d2l.show_heatmaps(
+    enc_attention_weights.cpu(), xlabel='Key positions',
+    ylabel='Query positions', titles=['Head %d' % i for i in range(1, 5)],
+    figsize=(7, 3.5))
+```
+
+为了可视化解码器的自我注意力权重和编码器-解码器的注意权重，我们需要更多的数据操作。例如，我们用零填充蒙面的注意力重量。请注意，解码器自我注意权重和编码器注意权重都有相同的查询：序列开始令牌后跟输出令牌。
+
+```{.python .input}
+dec_attention_weights_2d = [d2l.tensor(head[0]).tolist()
+                            for step in dec_attention_weight_seq
+                            for attn in step for blk in attn for head in blk]
+dec_attention_weights_filled = d2l.tensor(
+    pd.DataFrame(dec_attention_weights_2d).fillna(0.0).values)
+dec_attention_weights = d2l.reshape(dec_attention_weights_filled,
+                                    (-1, 2, num_layers, num_heads, num_steps))
+dec_self_attention_weights, dec_inter_attention_weights = \
+    dec_attention_weights.transpose(1, 2, 3, 0, 4)
+dec_self_attention_weights.shape, dec_inter_attention_weights.shape
+```
+
+```{.python .input}
+#@tab pytorch
+dec_attention_weights_2d = [head[0].tolist()
+                            for step in dec_attention_weight_seq
+                            for attn in step for blk in attn for head in blk]
+dec_attention_weights_filled = d2l.tensor(
+    pd.DataFrame(dec_attention_weights_2d).fillna(0.0).values)
+dec_attention_weights = d2l.reshape(dec_attention_weights_filled,
+                                    (-1, 2, num_layers, num_heads, num_steps))
+dec_self_attention_weights, dec_inter_attention_weights = \
+    dec_attention_weights.permute(1, 2, 3, 0, 4)
+dec_self_attention_weights.shape, dec_inter_attention_weights.shape
+```
+
+由于解码器自我注意的自动回归属性，查询位置后没有查询参与键值对。
+
+```{.python .input}
+#@tab all
+# Plus one to include the beginning-of-sequence token
+d2l.show_heatmaps(
+    dec_self_attention_weights[:, :, :, :len(translation.split()) + 1],
+    xlabel='Key positions', ylabel='Query positions',
+    titles=['Head %d' % i for i in range(1, 5)], figsize=(7, 3.5))
+```
+
+与编码器自我注意的情况类似，通过输入序列的指定有效长度，输出序列中的任何查询都不会参与输入序列中的填充标记。
+
+```{.python .input}
+#@tab all
+d2l.show_heatmaps(
+    dec_inter_attention_weights, xlabel='Key positions',
+    ylabel='Query positions', titles=['Head %d' % i for i in range(1, 5)],
+    figsize=(7, 3.5))
+```
+
+尽管变压器架构最初是为了顺序到序列的学习而提出的，但正如我们将在本书后面发现的那样，变压器编码器或变压器解码器通常被单独用于不同的深度学习任务。
+
+## 摘要
+
+* 变压器是编码器解码器架构的一个实例，尽管在实践中可以单独使用编码器或解码器。
+* 在变压器中，多头自我注意力用于表示输入序列和输出序列，尽管解码器必须通过蒙版本保留自动回归属性。
+* 变压器中的残余连接和层标准化对于训练非常深入的模型都很重要。
+* 变压器模型中的向定位前馈网络使用相同的 MLP 转换所有序列位置的表示。
+
+## 练习
+
+1. 在实验中训练更深的变压器。它如何影响培训速度和翻译绩效？
+1. 在变压器中用添加剂注意力取代缩放的点产品注意力是不错的主意吗？为什么？
+1. 对于语言建模，我们应该使用 Transor 编码器、解码器还是两者？如何设计这种方法？
+1. 如果输入序列很长，变形金刚会面临什么挑战？为什么？
+1. 如何提高变形金刚的计算和内存效率？Hind: you may refer to the survey paper by Tay et al. :cite:`Tay.Dehghani.Bahri.ea.2020`。
+1. 我们如何在不使用 CNN 的情况下为图像分类任务设计基于变压器的模型？Hind: you may refer to the Vision Transformer :cite:`Dosovitskiy.Beyer.Kolesnikov.ea.2021`。
+
+:begin_tab:`mxnet`
+[Discussions](https://discuss.d2l.ai/t/348)
+:end_tab:
+
+:begin_tab:`pytorch`
+[Discussions](https://discuss.d2l.ai/t/1066)
+:end_tab:
diff --git a/chapter_attention-mechanisms/transformer_origin.md b/chapter_attention-mechanisms/transformer_origin.md
new file mode 100644
index 0000000000000000000000000000000000000000..26c117e826290a21e68a310e23ca3c3b4a49600a
--- /dev/null
+++ b/chapter_attention-mechanisms/transformer_origin.md
@@ -0,0 +1,885 @@
+# Transformer
+:label:`sec_transformer`
+
+
+We have compared CNNs, RNNs, and self-attention in
+:numref:`subsec_cnn-rnn-self-attention`.
+Notably,
+self-attention
+enjoys both parallel computation and
+the shortest maximum path length.
+Therefore natually,
+it is appealing to design deep architectures
+by using self-attention.
+Unlike earlier self-attention models
+that still rely on RNNs for input representations :cite:`Cheng.Dong.Lapata.2016,Lin.Feng.Santos.ea.2017,Paulus.Xiong.Socher.2017`,
+the Transformer model
+is solely based on attention mechanisms
+without any convolutional or recurrent layer :cite:`Vaswani.Shazeer.Parmar.ea.2017`.
+Though originally proposed
+for sequence to sequence learning on text data,
+Transformers have been
+pervasive in a wide range of
+modern deep learning applications,
+such as in areas of language, vision, speech, and reinforcement learning.
+
+## Model
+
+
+As an instance of the encoder-decoder
+architecture,
+the overall architecture of
+the Transformer
+is presented in :numref:`fig_transformer`.
+As we can see,
+the Transformer is composed of an encoder and a decoder.
+Different from
+Bahdanau attention
+for sequence to sequence learning
+in :numref:`fig_s2s_attention_details`,
+the input (source) and output (target)
+sequence embeddings
+are added with positional encoding
+before being fed into
+the encoder and the decoder
+that stack modules based on self-attention.
+
+![The Transformer architecture.](../img/transformer.svg)
+:width:`500px`
+:label:`fig_transformer`
+
+
+Now we provide an overview of the
+Transformer architecture in :numref:`fig_transformer`.
+On a high level,
+the Transformer encoder is a stack of multiple identical layers,
+where each layer
+has two sublayers (either is denoted as $\mathrm{sublayer}$).
+The first
+is a multi-head self-attention pooling
+and the second is a positionwise feed-forward network.
+Specifically,
+in the encoder self-attention,
+queries, keys, and values are all from the
+the outputs of the previous encoder layer.
+Inspired by the ResNet design in :numref:`sec_resnet`,
+a residual connection is employed
+around both sublayers.
+In the Transformer,
+for any input $\mathbf{x} \in \mathbb{R}^d$ at any position of the sequence,
+we require that $\mathrm{sublayer}(\mathbf{x}) \in \mathbb{R}^d$ so that
+the residual connection $\mathbf{x} + \mathrm{sublayer}(\mathbf{x}) \in \mathbb{R}^d$ is feasible.
+This addition from the residual connection is immediately
+followed by layer normalization :cite:`Ba.Kiros.Hinton.2016`.
+As a result, the Transformer encoder outputs a $d$-dimensional vector representation for each position of the input sequence.
+
+The Transformer decoder is also
+a stack of multiple identical layers with residual connections and layer normalizations.
+Besides the two sublayers described in
+the encoder, the decoder inserts
+a third sublayer, known as
+the encoder-decoder attention,
+between these two.
+In the encoder-decoder attention,
+queries are from the
+outputs of the previous decoder layer,
+and the keys and values are
+from the Transformer encoder outputs.
+In the decoder self-attention,
+queries, keys, and values are all from the
+the outputs of the previous decoder layer.
+However,
+each position in the decoder is
+allowed to only attend to all positions in the decoder
+up to that position.
+This *masked* attention
+preserves the auto-regressive property,
+ensuring that the prediction only depends on those output tokens that have been generated.
+
+
+We have already described and implemented
+multi-head attention based on scaled dot-products
+in :numref:`sec_multihead-attention`
+and positional encoding in :numref:`subsec_positional-encoding`.
+In the following,
+we will implement the rest of the Transformer model.
+
+```{.python .input}
+from d2l import mxnet as d2l
+import math
+from mxnet import autograd, np, npx
+from mxnet.gluon import nn
+import pandas as pd
+npx.set_np()
+```
+
+```{.python .input}
+#@tab pytorch
+from d2l import torch as d2l
+import math
+import pandas as pd
+import torch
+from torch import nn
+```
+
+## Positionwise Feed-Forward Networks
+
+The positionwise feed-forward network
+transforms
+the representation at all the sequence positions
+using the same MLP.
+This is why we call it *positionwise*.
+In the implementation below,
+the input `X` with shape
+(batch size, number of time steps or sequence length in tokens, number of hidden units or feature dimension)
+will be transformed by a two-layer MLP into
+an output tensor of shape
+(batch size, number of time steps, `ffn_num_outputs`).
+
+```{.python .input}
+#@save
+class PositionWiseFFN(nn.Block):
+    def __init__(self, ffn_num_hiddens, ffn_num_outputs, **kwargs):
+        super(PositionWiseFFN, self).__init__(**kwargs)
+        self.dense1 = nn.Dense(ffn_num_hiddens, flatten=False,
+                               activation='relu')
+        self.dense2 = nn.Dense(ffn_num_outputs, flatten=False)
+
+    def forward(self, X):
+        return self.dense2(self.dense1(X))
+```
+
+```{.python .input}
+#@tab pytorch
+#@save
+class PositionWiseFFN(nn.Module):
+    def __init__(self, ffn_num_input, ffn_num_hiddens, ffn_num_outputs,
+                 **kwargs):
+        super(PositionWiseFFN, self).__init__(**kwargs)
+        self.dense1 = nn.Linear(ffn_num_input, ffn_num_hiddens)
+        self.relu = nn.ReLU()
+        self.dense2 = nn.Linear(ffn_num_hiddens, ffn_num_outputs)
+
+    def forward(self, X):
+        return self.dense2(self.relu(self.dense1(X)))
+```
+
+The following example
+shows that the innermost dimension
+of a tensor changes to
+the number of outputs in
+the positionwise feed-forward network.
+Since the same MLP transforms
+at all the positions,
+when the inputs at all these positions are the same,
+their outputs are also identical.
+
+```{.python .input}
+ffn = PositionWiseFFN(4, 8)
+ffn.initialize()
+ffn(np.ones((2, 3, 4)))[0]
+```
+
+```{.python .input}
+#@tab pytorch
+ffn = PositionWiseFFN(4, 4, 8)
+ffn.eval()
+ffn(d2l.ones((2, 3, 4)))[0]
+```
+
+## Residual Connection and Layer Normalization
+
+Now let us focus on
+the "add & norm" component in :numref:`fig_transformer`.
+As we described at the beginning
+of this section,
+this is a residual connection immediately
+followed by layer normalization.
+Both are key to effective deep architectures.
+
+In :numref:`sec_batch_norm`,
+we explained how batch normalization
+recenters and rescales across the examples within
+a minibatch.
+Layer normalization is the same as batch normalization
+except that the former
+normalizes across the feature dimension.
+Despite its pervasive applications
+in computer vision,
+batch normalization
+is usually empirically
+less effective than layer normalization
+in natural language processing
+tasks, whose inputs are often
+variable-length sequences.
+
+The following code snippet
+compares the normalization across different dimensions
+by layer normalization and batch normalization.
+
+```{.python .input}
+ln = nn.LayerNorm()
+ln.initialize()
+bn = nn.BatchNorm()
+bn.initialize()
+X = d2l.tensor([[1, 2], [2, 3]])
+# Compute mean and variance from `X` in the training mode
+with autograd.record():
+    print('layer norm:', ln(X), '\nbatch norm:', bn(X))
+```
+
+```{.python .input}
+#@tab pytorch
+ln = nn.LayerNorm(2)
+bn = nn.BatchNorm1d(2)
+X = d2l.tensor([[1, 2], [2, 3]], dtype=torch.float32)
+# Compute mean and variance from `X` in the training mode
+print('layer norm:', ln(X), '\nbatch norm:', bn(X))
+```
+
+Now we can implement the `AddNorm` class
+using a residual connection followed by layer normalization.
+Dropout is also applied for regularization.
+
+```{.python .input}
+#@save
+class AddNorm(nn.Block):
+    def __init__(self, dropout, **kwargs):
+        super(AddNorm, self).__init__(**kwargs)
+        self.dropout = nn.Dropout(dropout)
+        self.ln = nn.LayerNorm()
+
+    def forward(self, X, Y):
+        return self.ln(self.dropout(Y) + X)
+```
+
+```{.python .input}
+#@tab pytorch
+#@save
+class AddNorm(nn.Module):
+    def __init__(self, normalized_shape, dropout, **kwargs):
+        super(AddNorm, self).__init__(**kwargs)
+        self.dropout = nn.Dropout(dropout)
+        self.ln = nn.LayerNorm(normalized_shape)
+
+    def forward(self, X, Y):
+        return self.ln(self.dropout(Y) + X)
+```
+
+The residual connection requires that
+the two inputs are of the same shape
+so that the output tensor also has the same shape after the addition operation.
+
+```{.python .input}
+add_norm = AddNorm(0.5)
+add_norm.initialize()
+add_norm(d2l.ones((2, 3, 4)), d2l.ones((2, 3, 4))).shape
+```
+
+```{.python .input}
+#@tab pytorch
+add_norm = AddNorm([3, 4], 0.5) # Normalized_shape is input.size()[1:]
+add_norm.eval()
+add_norm(d2l.ones((2, 3, 4)), d2l.ones((2, 3, 4))).shape
+```
+
+## Encoder
+
+With all the essential components to assemble
+the Transformer encoder,
+let us start by
+implementing a single layer within the encoder.
+The following `EncoderBlock` class
+contains two sublayers: multi-head self-attention and positionwise feed-forward networks,
+where a residual connection followed by layer normalization is employed
+around both sublayers.
+
+```{.python .input}
+#@save
+class EncoderBlock(nn.Block):
+    def __init__(self, num_hiddens, ffn_num_hiddens, num_heads, dropout,
+                 use_bias=False, **kwargs):
+        super(EncoderBlock, self).__init__(**kwargs)
+        self.attention = d2l.MultiHeadAttention(
+            num_hiddens, num_heads, dropout, use_bias)
+        self.addnorm1 = AddNorm(dropout)
+        self.ffn = PositionWiseFFN(ffn_num_hiddens, num_hiddens)
+        self.addnorm2 = AddNorm(dropout)
+
+    def forward(self, X, valid_lens):
+        Y = self.addnorm1(X, self.attention(X, X, X, valid_lens))
+        return self.addnorm2(Y, self.ffn(Y))
+```
+
+```{.python .input}
+#@tab pytorch
+#@save
+class EncoderBlock(nn.Module):
+    def __init__(self, key_size, query_size, value_size, num_hiddens,
+                 norm_shape, ffn_num_input, ffn_num_hiddens, num_heads,
+                 dropout, use_bias=False, **kwargs):
+        super(EncoderBlock, self).__init__(**kwargs)
+        self.attention = d2l.MultiHeadAttention(
+            key_size, query_size, value_size, num_hiddens, num_heads, dropout,
+            use_bias)
+        self.addnorm1 = AddNorm(norm_shape, dropout)
+        self.ffn = PositionWiseFFN(
+            ffn_num_input, ffn_num_hiddens, num_hiddens)
+        self.addnorm2 = AddNorm(norm_shape, dropout)
+
+    def forward(self, X, valid_lens):
+        Y = self.addnorm1(X, self.attention(X, X, X, valid_lens))
+        return self.addnorm2(Y, self.ffn(Y))
+```
+
+As we can see,
+any layer in the Transformer encoder
+does not change the shape of its input.
+
+```{.python .input}
+X = d2l.ones((2, 100, 24))
+valid_lens = d2l.tensor([3, 2])
+encoder_blk = EncoderBlock(24, 48, 8, 0.5)
+encoder_blk.initialize()
+encoder_blk(X, valid_lens).shape
+```
+
+```{.python .input}
+#@tab pytorch
+X = d2l.ones((2, 100, 24))
+valid_lens = d2l.tensor([3, 2])
+encoder_blk = EncoderBlock(24, 24, 24, 24, [100, 24], 24, 48, 8, 0.5)
+encoder_blk.eval()
+encoder_blk(X, valid_lens).shape
+```
+
+In the following Transformer encoder implementation,
+we stack `num_layers` instances of the above `EncoderBlock` classes.
+Since we use the fixed positional encoding
+whose values are always between -1 and 1,
+we multiply values of the learnable input embeddings
+by the square root of the embedding dimension
+to rescale before summing up the input embedding and the positional encoding.
+
+```{.python .input}
+#@save
+class TransformerEncoder(d2l.Encoder):
+    def __init__(self, vocab_size, num_hiddens, ffn_num_hiddens,
+                 num_heads, num_layers, dropout, use_bias=False, **kwargs):
+        super(TransformerEncoder, self).__init__(**kwargs)
+        self.num_hiddens = num_hiddens
+        self.embedding = nn.Embedding(vocab_size, num_hiddens)
+        self.pos_encoding = d2l.PositionalEncoding(num_hiddens, dropout)
+        self.blks = nn.Sequential()
+        for _ in range(num_layers):
+            self.blks.add(
+                EncoderBlock(num_hiddens, ffn_num_hiddens, num_heads, dropout,
+                             use_bias))
+
+    def forward(self, X, valid_lens, *args):
+        # Since positional encoding values are between -1 and 1, the embedding
+        # values are multiplied by the square root of the embedding dimension
+        # to rescale before they are summed up
+        X = self.pos_encoding(self.embedding(X) * math.sqrt(self.num_hiddens))
+        self.attention_weights = [None] * len(self.blks)
+        for i, blk in enumerate(self.blks):
+            X = blk(X, valid_lens)
+            self.attention_weights[
+                i] = blk.attention.attention.attention_weights
+        return X
+```
+
+```{.python .input}
+#@tab pytorch
+#@save
+class TransformerEncoder(d2l.Encoder):
+    def __init__(self, vocab_size, key_size, query_size, value_size,
+                 num_hiddens, norm_shape, ffn_num_input, ffn_num_hiddens,
+                 num_heads, num_layers, dropout, use_bias=False, **kwargs):
+        super(TransformerEncoder, self).__init__(**kwargs)
+        self.num_hiddens = num_hiddens
+        self.embedding = nn.Embedding(vocab_size, num_hiddens)
+        self.pos_encoding = d2l.PositionalEncoding(num_hiddens, dropout)
+        self.blks = nn.Sequential()
+        for i in range(num_layers):
+            self.blks.add_module("block"+str(i),
+                EncoderBlock(key_size, query_size, value_size, num_hiddens,
+                             norm_shape, ffn_num_input, ffn_num_hiddens,
+                             num_heads, dropout, use_bias))
+
+    def forward(self, X, valid_lens, *args):
+        # Since positional encoding values are between -1 and 1, the embedding
+        # values are multiplied by the square root of the embedding dimension
+        # to rescale before they are summed up
+        X = self.pos_encoding(self.embedding(X) * math.sqrt(self.num_hiddens))
+        self.attention_weights = [None] * len(self.blks)
+        for i, blk in enumerate(self.blks):
+            X = blk(X, valid_lens)
+            self.attention_weights[
+                i] = blk.attention.attention.attention_weights
+        return X
+```
+
+Below we specify hyperparameters to create a two-layer Transformer encoder.
+The shape of the Transformer encoder output
+is (batch size, number of time steps, `num_hiddens`).
+
+```{.python .input}
+encoder = TransformerEncoder(200, 24, 48, 8, 2, 0.5)
+encoder.initialize()
+encoder(np.ones((2, 100)), valid_lens).shape
+```
+
+```{.python .input}
+#@tab pytorch
+encoder = TransformerEncoder(
+    200, 24, 24, 24, 24, [100, 24], 24, 48, 8, 2, 0.5)
+encoder.eval()
+encoder(d2l.ones((2, 100), dtype=torch.long), valid_lens).shape
+```
+
+## Decoder
+
+As shown in :numref:`fig_transformer`,
+the Transformer decoder
+is composed of multiple identical layers.
+Each layer is implemented in the following
+`DecoderBlock` class,
+which contains three sublayers:
+decoder self-attention,
+encoder-decoder attention,
+and positionwise feed-forward networks.
+These sublayers employ
+a residual connection around them
+followed by layer normalization.
+
+
+As we described earlier in this section,
+in the masked multi-head decoder self-attention
+(the first sublayer),
+queries, keys, and values
+all come from the outputs of the previous decoder layer.
+When training sequence-to-sequence models,
+tokens at all the positions (time steps)
+of the output sequence
+are known.
+However,
+during prediction
+the output sequence is generated token by token;
+thus,
+at any decoder time step
+only the generated tokens
+can be used in the decoder self-attention.
+To preserve auto-regression in the decoder,
+its masked self-attention
+specifies  `dec_valid_lens` so that
+any query
+only attends to
+all positions in the decoder
+up to the query position.
+
+```{.python .input}
+class DecoderBlock(nn.Block):
+    # The `i`-th block in the decoder
+    def __init__(self, num_hiddens, ffn_num_hiddens, num_heads,
+                 dropout, i, **kwargs):
+        super(DecoderBlock, self).__init__(**kwargs)
+        self.i = i
+        self.attention1 = d2l.MultiHeadAttention(num_hiddens, num_heads,
+                                                 dropout)
+        self.addnorm1 = AddNorm(dropout)
+        self.attention2 = d2l.MultiHeadAttention(num_hiddens, num_heads,
+                                                 dropout)
+        self.addnorm2 = AddNorm(dropout)
+        self.ffn = PositionWiseFFN(ffn_num_hiddens, num_hiddens)
+        self.addnorm3 = AddNorm(dropout)
+
+    def forward(self, X, state):
+        enc_outputs, enc_valid_lens = state[0], state[1]
+        # During training, all the tokens of any output sequence are processed
+        # at the same time, so `state[2][self.i]` is `None` as initialized.
+        # When decoding any output sequence token by token during prediction,
+        # `state[2][self.i]` contains representations of the decoded output at
+        # the `i`-th block up to the current time step
+        if state[2][self.i] is None:
+            key_values = X
+        else:
+            key_values = np.concatenate((state[2][self.i], X), axis=1)
+        state[2][self.i] = key_values
+
+        if autograd.is_training():
+            batch_size, num_steps, _ = X.shape
+            # Shape of `dec_valid_lens`: (`batch_size`, `num_steps`), where
+            # every row is [1, 2, ..., `num_steps`]
+            dec_valid_lens = np.tile(np.arange(1, num_steps + 1, ctx=X.ctx),
+                                     (batch_size, 1))
+        else:
+            dec_valid_lens = None
+
+        # Self-attention
+        X2 = self.attention1(X, key_values, key_values, dec_valid_lens)
+        Y = self.addnorm1(X, X2)
+        # Encoder-decoder attention. Shape of `enc_outputs`:
+        # (`batch_size`, `num_steps`, `num_hiddens`)
+        Y2 = self.attention2(Y, enc_outputs, enc_outputs, enc_valid_lens)
+        Z = self.addnorm2(Y, Y2)
+        return self.addnorm3(Z, self.ffn(Z)), state
+```
+
+```{.python .input}
+#@tab pytorch
+class DecoderBlock(nn.Module):
+    # The `i`-th block in the decoder
+    def __init__(self, key_size, query_size, value_size, num_hiddens,
+                 norm_shape, ffn_num_input, ffn_num_hiddens, num_heads,
+                 dropout, i, **kwargs):
+        super(DecoderBlock, self).__init__(**kwargs)
+        self.i = i
+        self.attention1 = d2l.MultiHeadAttention(
+            key_size, query_size, value_size, num_hiddens, num_heads, dropout)
+        self.addnorm1 = AddNorm(norm_shape, dropout)
+        self.attention2 = d2l.MultiHeadAttention(
+            key_size, query_size, value_size, num_hiddens, num_heads, dropout)
+        self.addnorm2 = AddNorm(norm_shape, dropout)
+        self.ffn = PositionWiseFFN(ffn_num_input, ffn_num_hiddens,
+                                   num_hiddens)
+        self.addnorm3 = AddNorm(norm_shape, dropout)
+
+    def forward(self, X, state):
+        enc_outputs, enc_valid_lens = state[0], state[1]
+        # During training, all the tokens of any output sequence are processed
+        # at the same time, so `state[2][self.i]` is `None` as initialized.
+        # When decoding any output sequence token by token during prediction,
+        # `state[2][self.i]` contains representations of the decoded output at
+        # the `i`-th block up to the current time step
+        if state[2][self.i] is None:
+            key_values = X
+        else:
+            key_values = torch.cat((state[2][self.i], X), axis=1)
+        state[2][self.i] = key_values
+        if self.training:
+            batch_size, num_steps, _ = X.shape
+            # Shape of `dec_valid_lens`: (`batch_size`, `num_steps`), where
+            # every row is [1, 2, ..., `num_steps`]
+            dec_valid_lens = torch.arange(
+                1, num_steps + 1, device=X.device).repeat(batch_size, 1)
+        else:
+            dec_valid_lens = None
+
+        # Self-attention
+        X2 = self.attention1(X, key_values, key_values, dec_valid_lens)
+        Y = self.addnorm1(X, X2)
+        # Encoder-decoder attention. Shape of `enc_outputs`:
+        # (`batch_size`, `num_steps`, `num_hiddens`)
+        Y2 = self.attention2(Y, enc_outputs, enc_outputs, enc_valid_lens)
+        Z = self.addnorm2(Y, Y2)
+        return self.addnorm3(Z, self.ffn(Z)), state
+```
+
+To facilitate scaled dot-product operations
+in the encoder-decoder attention
+and addition operations in the residual connections,
+the feature dimension (`num_hiddens`) of the decoder is
+the same as that of the encoder.
+
+```{.python .input}
+decoder_blk = DecoderBlock(24, 48, 8, 0.5, 0)
+decoder_blk.initialize()
+X = np.ones((2, 100, 24))
+state = [encoder_blk(X, valid_lens), valid_lens, [None]]
+decoder_blk(X, state)[0].shape
+```
+
+```{.python .input}
+#@tab pytorch
+decoder_blk = DecoderBlock(24, 24, 24, 24, [100, 24], 24, 48, 8, 0.5, 0)
+decoder_blk.eval()
+X = d2l.ones((2, 100, 24))
+state = [encoder_blk(X, valid_lens), valid_lens, [None]]
+decoder_blk(X, state)[0].shape
+```
+
+Now we construct the entire Transformer decoder
+composed of `num_layers` instances of `DecoderBlock`.
+In the end,
+a fully-connected layer computes the prediction
+for all the `vocab_size` possible output tokens.
+Both of the decoder self-attention weights
+and the encoder-decoder attention weights
+are stored for later visualization.
+
+```{.python .input}
+class TransformerDecoder(d2l.AttentionDecoder):
+    def __init__(self, vocab_size, num_hiddens, ffn_num_hiddens,
+                 num_heads, num_layers, dropout, **kwargs):
+        super(TransformerDecoder, self).__init__(**kwargs)
+        self.num_hiddens = num_hiddens
+        self.num_layers = num_layers
+        self.embedding = nn.Embedding(vocab_size, num_hiddens)
+        self.pos_encoding = d2l.PositionalEncoding(num_hiddens, dropout)
+        self.blks = nn.Sequential()
+        for i in range(num_layers):
+            self.blks.add(
+                DecoderBlock(num_hiddens, ffn_num_hiddens, num_heads,
+                             dropout, i))
+        self.dense = nn.Dense(vocab_size, flatten=False)
+
+    def init_state(self, enc_outputs, enc_valid_lens, *args):
+        return [enc_outputs, enc_valid_lens, [None] * self.num_layers]
+
+    def forward(self, X, state):
+        X = self.pos_encoding(self.embedding(X) * math.sqrt(self.num_hiddens))
+        self._attention_weights = [[None] * len(self.blks) for _ in range (2)]
+        for i, blk in enumerate(self.blks):
+            X, state = blk(X, state)
+            # Decoder self-attention weights
+            self._attention_weights[0][
+                i] = blk.attention1.attention.attention_weights
+            # Encoder-decoder attention weights
+            self._attention_weights[1][
+                i] = blk.attention2.attention.attention_weights
+        return self.dense(X), state
+
+    @property
+    def attention_weights(self):
+        return self._attention_weights
+```
+
+```{.python .input}
+#@tab pytorch
+class TransformerDecoder(d2l.AttentionDecoder):
+    def __init__(self, vocab_size, key_size, query_size, value_size,
+                 num_hiddens, norm_shape, ffn_num_input, ffn_num_hiddens,
+                 num_heads, num_layers, dropout, **kwargs):
+        super(TransformerDecoder, self).__init__(**kwargs)
+        self.num_hiddens = num_hiddens
+        self.num_layers = num_layers
+        self.embedding = nn.Embedding(vocab_size, num_hiddens)
+        self.pos_encoding = d2l.PositionalEncoding(num_hiddens, dropout)
+        self.blks = nn.Sequential()
+        for i in range(num_layers):
+            self.blks.add_module("block"+str(i),
+                DecoderBlock(key_size, query_size, value_size, num_hiddens,
+                             norm_shape, ffn_num_input, ffn_num_hiddens,
+                             num_heads, dropout, i))
+        self.dense = nn.Linear(num_hiddens, vocab_size)
+
+    def init_state(self, enc_outputs, enc_valid_lens, *args):
+        return [enc_outputs, enc_valid_lens, [None] * self.num_layers]
+
+    def forward(self, X, state):
+        X = self.pos_encoding(self.embedding(X) * math.sqrt(self.num_hiddens))
+        self._attention_weights = [[None] * len(self.blks) for _ in range (2)]
+        for i, blk in enumerate(self.blks):
+            X, state = blk(X, state)
+            # Decoder self-attention weights
+            self._attention_weights[0][
+                i] = blk.attention1.attention.attention_weights
+            # Encoder-decoder attention weights
+            self._attention_weights[1][
+                i] = blk.attention2.attention.attention_weights
+        return self.dense(X), state
+
+    @property
+    def attention_weights(self):
+        return self._attention_weights
+```
+
+## Training
+
+Let us instantiate an encoder-decoder model
+by following the Transformer architecture.
+Here we specify that
+both the Transformer encoder and the Transformer decoder
+have 2 layers using 4-head attention.
+Similar to :numref:`sec_seq2seq_training`,
+we train the Transformer model
+for sequence to sequence learning on the English-French machine translation dataset.
+
+```{.python .input}
+num_hiddens, num_layers, dropout, batch_size, num_steps = 32, 2, 0.1, 64, 10
+lr, num_epochs, device = 0.005, 200, d2l.try_gpu()
+ffn_num_hiddens, num_heads = 64, 4
+
+train_iter, src_vocab, tgt_vocab = d2l.load_data_nmt(batch_size, num_steps)
+
+encoder = TransformerEncoder(
+    len(src_vocab), num_hiddens, ffn_num_hiddens, num_heads, num_layers,
+    dropout)
+decoder = TransformerDecoder(
+    len(tgt_vocab), num_hiddens, ffn_num_hiddens, num_heads, num_layers,
+    dropout)
+net = d2l.EncoderDecoder(encoder, decoder)
+d2l.train_seq2seq(net, train_iter, lr, num_epochs, tgt_vocab, device)
+```
+
+```{.python .input}
+#@tab pytorch
+num_hiddens, num_layers, dropout, batch_size, num_steps = 32, 2, 0.1, 64, 10
+lr, num_epochs, device = 0.005, 200, d2l.try_gpu()
+ffn_num_input, ffn_num_hiddens, num_heads = 32, 64, 4
+key_size, query_size, value_size = 32, 32, 32
+norm_shape = [32]
+
+train_iter, src_vocab, tgt_vocab = d2l.load_data_nmt(batch_size, num_steps)
+
+encoder = TransformerEncoder(
+    len(src_vocab), key_size, query_size, value_size, num_hiddens,
+    norm_shape, ffn_num_input, ffn_num_hiddens, num_heads,
+    num_layers, dropout)
+decoder = TransformerDecoder(
+    len(tgt_vocab), key_size, query_size, value_size, num_hiddens,
+    norm_shape, ffn_num_input, ffn_num_hiddens, num_heads,
+    num_layers, dropout)
+net = d2l.EncoderDecoder(encoder, decoder)
+d2l.train_seq2seq(net, train_iter, lr, num_epochs, tgt_vocab, device)
+```
+
+After training,
+we use the Transformer model
+to translate a few English sentences into French and compute their BLEU scores.
+
+```{.python .input}
+#@tab all
+engs = ['go .', "i lost .", 'he\'s calm .', 'i\'m home .']
+fras = ['va !', 'j\'ai perdu .', 'il est calme .', 'je suis chez moi .']
+for eng, fra in zip(engs, fras):
+    translation, dec_attention_weight_seq = d2l.predict_seq2seq(
+        net, eng, src_vocab, tgt_vocab, num_steps, device, True)
+    print(f'{eng} => {translation}, ',
+          f'bleu {d2l.bleu(translation, fra, k=2):.3f}')
+```
+
+Let us visualize the Transformer attention weights when translating the last English sentence into French.
+The shape of the encoder self-attention weights
+is (number of encoder layers, number of attention heads, `num_steps` or number of queries, `num_steps` or number of key-value pairs).
+
+```{.python .input}
+#@tab all
+enc_attention_weights = d2l.reshape(
+    d2l.concat(net.encoder.attention_weights, 0),
+    (num_layers, num_heads, -1, num_steps))
+enc_attention_weights.shape
+```
+
+In the encoder self-attention,
+both queries and keys come from the same input sequence.
+Since padding tokens do not carry meaning,
+with specified valid length of the input sequence,
+no query attends to positions of padding tokens.
+In the following,
+two layers of multi-head attention weights
+are presented row by row.
+Each head independently attends
+based on a separate representation subspaces of queries, keys, and values.
+
+```{.python .input}
+d2l.show_heatmaps(
+    enc_attention_weights, xlabel='Key positions', ylabel='Query positions',
+    titles=['Head %d' % i for i in range(1, 5)], figsize=(7, 3.5))
+```
+
+```{.python .input}
+#@tab pytorch
+d2l.show_heatmaps(
+    enc_attention_weights.cpu(), xlabel='Key positions',
+    ylabel='Query positions', titles=['Head %d' % i for i in range(1, 5)],
+    figsize=(7, 3.5))
+```
+
+To visualize both the decoder self-attention weights and the encoder-decoder attention weights,
+we need more data manipulations.
+For example,
+we fill the masked attention weights with zero.
+Note that
+the decoder self-attention weights
+and the encoder-decoder attention weights
+both have the same queries:
+the beginning-of-sequence token followed by
+the output tokens.
+
+```{.python .input}
+dec_attention_weights_2d = [d2l.tensor(head[0]).tolist()
+                            for step in dec_attention_weight_seq
+                            for attn in step for blk in attn for head in blk]
+dec_attention_weights_filled = d2l.tensor(
+    pd.DataFrame(dec_attention_weights_2d).fillna(0.0).values)
+dec_attention_weights = d2l.reshape(dec_attention_weights_filled,
+                                    (-1, 2, num_layers, num_heads, num_steps))
+dec_self_attention_weights, dec_inter_attention_weights = \
+    dec_attention_weights.transpose(1, 2, 3, 0, 4)
+dec_self_attention_weights.shape, dec_inter_attention_weights.shape
+```
+
+```{.python .input}
+#@tab pytorch
+dec_attention_weights_2d = [head[0].tolist()
+                            for step in dec_attention_weight_seq
+                            for attn in step for blk in attn for head in blk]
+dec_attention_weights_filled = d2l.tensor(
+    pd.DataFrame(dec_attention_weights_2d).fillna(0.0).values)
+dec_attention_weights = d2l.reshape(dec_attention_weights_filled,
+                                    (-1, 2, num_layers, num_heads, num_steps))
+dec_self_attention_weights, dec_inter_attention_weights = \
+    dec_attention_weights.permute(1, 2, 3, 0, 4)
+dec_self_attention_weights.shape, dec_inter_attention_weights.shape
+```
+
+Due to the auto-regressive property of the decoder self-attention,
+no query attends to key-value pairs after the query position.
+
+```{.python .input}
+#@tab all
+# Plus one to include the beginning-of-sequence token
+d2l.show_heatmaps(
+    dec_self_attention_weights[:, :, :, :len(translation.split()) + 1],
+    xlabel='Key positions', ylabel='Query positions',
+    titles=['Head %d' % i for i in range(1, 5)], figsize=(7, 3.5))
+```
+
+Similar to the case in the encoder self-attention,
+via the specified valid length of the input sequence,
+no query from the output sequence
+attends to those padding tokens from the input sequence.
+
+```{.python .input}
+#@tab all
+d2l.show_heatmaps(
+    dec_inter_attention_weights, xlabel='Key positions',
+    ylabel='Query positions', titles=['Head %d' % i for i in range(1, 5)],
+    figsize=(7, 3.5))
+```
+
+Although the Transformer architecture
+was originally proposed for sequence-to-sequence learning,
+as we will discover later in the book,
+either the Transformer encoder
+or the Transformer decoder
+is often individually used
+for different deep learning tasks.
+
+
+## Summary
+
+* The Transformer is an instance of the encoder-decoder architecture, though either the encoder or the decoder can be used individually in practice.
+* In the Transformer, multi-head self-attention is used for representing the input sequence and the output sequence, though the decoder has to preserve the auto-regressive property via a masked version.
+* Both the residual connections and the layer normalization in the Transformer are important for training a very deep model.
+* The positionwise feed-forward network in the Transformer model transforms the representation at all the sequence positions using the same MLP.
+
+
+## Exercises
+
+1. Train a deeper Transformer in the experiments. How does it affect the training speed and the translation performance?
+1. Is it a good idea to replace scaled dot-product attention with additive attention in the Transformer? Why?
+1. For language modeling, should we use the Transformer encoder, decoder, or both? How to design this method?
+1. What can be challenges to Transformers if input sequences are very long? Why?
+1. How to improve computational and memory efficiency of Transformers? Hint: you may refer to the survey paper by Tay et al. :cite:`Tay.Dehghani.Bahri.ea.2020`.
+1. How can we design Transformer-based models for image classification tasks without using CNNs? Hint: you may refer to the Vision Transformer :cite:`Dosovitskiy.Beyer.Kolesnikov.ea.2021`.
+
+:begin_tab:`mxnet`
+[Discussions](https://discuss.d2l.ai/t/348)
+:end_tab:
+
+:begin_tab:`pytorch`
+[Discussions](https://discuss.d2l.ai/t/1066)
+:end_tab: