<p>This file holds the implementations of the core modules of Capsule Networks.</p>
<p>I used <ahref="https://github.com/jindongwang/Pytorch-CapsuleNet">jindongwang/Pytorch-CapsuleNet</a> to clarify some confusions I had with the paper.</p>
<p>Here's a notebook for training a Capsule Network on MNIST dataset.</p>
<p><ahref="https://colab.research.google.com/github/labmlai/annotated_deep_learning_paper_implementations/blob/master/labml_nn/capsule_networks/mnist.ipynb"><imgalt="Open In Colab"src="https://colab.research.google.com/assets/colab-badge.svg"></a><ahref="https://app.labml.ai/run/e7c08e08586711ebb3e30242ac1c0002"><imgalt="View Run"src="https://img.shields.io/badge/labml-experiment-brightgreen"></a></p>
<p><ahref="https://colab.research.google.com/github/labmlai/annotated_deep_learning_paper_implementations/blob/master/labml_nn/capsule_networks/mnist.ipynb"><imgalt="Open In Colab"src="https://colab.research.google.com/assets/colab-badge.svg"></a></p>
<h1>Train a <ahref="index.html">ConvMixer</a> on CIFAR 10</h1>
<p>This script trains a ConvMixer on CIFAR 10 dataset.</p>
<p>This is not an attempt to reproduce the results of the paper. The paper uses image augmentations present in <ahref="https://github.com/rwightman/pytorch-image-models">PyTorch Image Models (timm)</a> for training. We haven't done this for simplicity - which causes our validation accuracy to drop.</p>
<p>ConvMixer is Similar to <ahref="https://nn.labml.ai/transformers/mlp_mixer/index.html">MLP-Mixer</a>. MLP-Mixer separates mixing of spatial and channel dimensions, by applying an MLP across spatial dimension and then an MLP across the channel dimension (spatial MLP replaces the <ahref="https://nn.labml.ai/transformers/vit/index.html">ViT</a> attention and channel MLP is the <ahref="https://nn.labml.ai/transformers/feed_forward.html">FFN</a> of ViT).</p>
<p>ConvMixer uses a 1x1 convolution for channel mixing and a depth-wise convolution for spatial mixing. Since it's a convolution instead of a full MLP across the space, it mixes only the nearby batches in contrast to ViT or MLP-Mixer. Also, the MLP-mixer uses MLPs of two layers for each mixing and ConvMixer uses a single layer for each mixing.</p>
<p>The paper recommends removing the residual connection across the channel mixing (point-wise convolution) and having only a residual connection over the spatial mixing (depth-wise convolution). They also use <ahref="https://nn.labml.ai/normalization/batch_norm/index.html">Batch normalization</a> instead of <ahref="../normalization/layer_norm/index.html">Layer normalization</a>.</p>
<p>Here's <ahref="https://nn.labml.ai/conv_mixer/experiment.html">an experiment</a> that trains ConvMixer on CIFAR-10.</p>
<p>This is a <ahref="https://pytorch.org">PyTorch</a> implementation/tutorial of the paper <ahref="https://papers.labml.ai/paper/1503.02531">Distilling the Knowledge in a Neural Network</a>.</p>
<p>It's a way of training a small network using the knowledge in a trained larger network; i.e. distilling the knowledge from the large network.</p>
<p>A large model with regularization or an ensemble of models (using dropout) generalizes better than a small model when trained directly on the data and labels. However, a small model can be trained to generalize better with help of a large model. Smaller models are better in production: faster, less compute, less memory.</p>
<p>The output probabilities of a trained model give more information than the labels because it assigns non-zero probabilities to incorrect classes as well. These probabilities tell us that a sample has a chance of belonging to certain classes. For instance, when classifying digits, when given an image of digit <em>7</em>, a generalized model will give a high probability to 7 and a small but non-zero probability to 2, while assigning almost zero probability to other digits. Distillation uses this information to train a small model better.</p>
<p>The output probabilities of a trained model give more information than the labels because it assigns non-zero probabilities to incorrect classes as well. These probabilities tell us that a sample has a chance of belonging to certain classes. For instance, when classifying digits, when given an image of digit <em>7</em>, a generalized model will give a high probability to 7 and a small but non-zero probability to 2, while assigning almost zero probability to other digits. Distillation uses this information to train a small model better. </p>
<p>This is a <ahref="https://pytorch.org">PyTorch</a> implementation of the paper <ahref="https://papers.labml.ai/paper/1710.10903">Graph Attention Networks</a>.</p>
<p>GATs work on graph data. A graph consists of nodes and edges connecting nodes. For example, in Cora dataset the nodes are research papers and the edges are citations that connect the papers.</p>
<p>GAT uses masked self-attention, kind of similar to <ahref="https://nn.labml.ai/transformers/mha.html">transformers</a>. GAT consists of graph attention layers stacked on top of each other. Each graph attention layer gets node embeddings as inputs and outputs transformed embeddings. The node embeddings pay attention to the embeddings of other nodes it's connected to. The details of graph attention layers are included alongside the implementation.</p>
<p>Here is <ahref="https://nn.labml.ai/graphs/gat/experiment.html">the training code</a> for training a two-layer GAT on Cora dataset.</p>
<p>This is a <ahref="https://pytorch.org">PyTorch</a> implementation of the GATv2 operator from the paper <ahref="https://papers.labml.ai/paper/2105.14491">How Attentive are Graph Attention Networks?</a>.</p>
<p>GATv2s work on graph data. A graph consists of nodes and edges connecting nodes. For example, in Cora dataset the nodes are research papers and the edges are citations that connect the papers.</p>
<p>The GATv2 operator fixes the static attention problem of the standard GAT: since the linear layers in the standard GAT are applied right after each other, the ranking of attended nodes is unconditioned on the query node. In contrast, in GATv2, every node can attend to any other node.</p>
<p>Here is <ahref="https://nn.labml.ai/graphs/gatv2/experiment.html">the training code</a> for training a two-layer GATv2 on Cora dataset.</p>
<p>This is a collection of simple PyTorch implementations of neural networks and related algorithms. <ahref="https://github.com/labmlai/annotated_deep_learning_paper_implementations">These implementations</a> are documented with explanations, and the <ahref="index.html">website</a> renders these as side-by-side formatted notes. We believe these would help you understand these algorithms better.</p>
<p><imgalt="Screenshot"src="dqn-light.png"></p>
<p>We are actively maintaining this repo and adding new implementations. <ahref="https://twitter.com/labmlai"><imgalt="Twitter"src="https://img.shields.io/twitter/follow/labmlai?style=social"></a> for updates.</p>
<p>We need to know <span><spanclass="katex"><spanaria-hidden="true"class="katex-html"><spanclass="base"><spanclass="strut"style="height:1.138em;vertical-align:-0.25em;"></span><spanclass="mord coloredeq eqg"style=""><spanclass="mord mathbb"style="">E</span><spanclass="mopen"style="">[</span><spanclass="mord"style=""><spanclass="mord mathnormal"style="">x</span><spanclass="msupsub"><spanclass="vlist-t"><spanclass="vlist-r"><spanclass="vlist"style="height:0.8879999999999999em;"><spanstyle="top:-3.063em;margin-right:0.05em;"><spanclass="pstrut"style="height:2.7em;"></span><spanclass="sizing reset-size6 size3 mtight"style=""><spanclass="mord mtight"style=""><spanclass="mopen mtight"style="">(</span><spanclass="mord mathnormal mtight"style="margin-right:0.03148em">k</span><spanclass="mclose mtight"style="">)</span></span></span></span></span></span></span></span></span><spanclass="mclose"style="">]</span></span></span></span></span></span> and <span><spanclass="katex"><spanaria-hidden="true"class="katex-html"><spanclass="base"><spanclass="strut"style="height:1.138em;vertical-align:-0.25em;"></span><spanclass="mord coloredeq eqj"style=""><spanclass="mord mathnormal"style="">Va</span><spanclass="mord mathnormal"style="margin-right:0.02778em">r</span><spanclass="mopen"style="">[</span><spanclass="mord"style=""><spanclass="mord mathnormal"style="">x</span><spanclass="msupsub"><spanclass="vlist-t"><spanclass="vlist-r"><spanclass="vlist"style="height:0.8879999999999999em;"><spanstyle="top:-3.063em;margin-right:0.05em;"><spanclass="pstrut"style="height:2.7em;"></span><spanclass="sizing reset-size6 size3 mtight"style=""><spanclass="mord mtight"style=""><spanclass="mopen mtight"style="">(</span><spanclass="mord mathnormal mtight"style="margin-right:0.03148em">k</span><spanclass="mclose mtight"style="">)</span></span></span></span></span></span></span></span></span><spanclass="mclose"style="">]</span></span></span></span></span></span> in order to perform the normalization. So during inference, you either need to go through the whole (or part of) dataset and find the mean and variance, or you can use an estimate calculated during training. The usual practice is to calculate an exponential moving average of mean and variance during the training phase and use that for inference.</p>
<p>Here's <ahref="mnist.html">the training code</a> and a notebook for training a CNN classifier that uses batch normalization for MNIST dataset.</p>
<p><ahref="https://colab.research.google.com/github/labmlai/annotated_deep_learning_paper_implementations/blob/master/labml_nn/normalization/batch_norm/mnist.ipynb"><imgalt="Open In Colab"src="https://colab.research.google.com/assets/colab-badge.svg"></a><ahref="https://app.labml.ai/run/011254fe647011ebbb8e0242ac1c0002"><imgalt="View Run"src="https://img.shields.io/badge/labml-experiment-brightgreen"></a></p>
<p><ahref="https://colab.research.google.com/github/labmlai/annotated_deep_learning_paper_implementations/blob/master/labml_nn/normalization/batch_norm/mnist.ipynb"><imgalt="Open In Colab"src="https://colab.research.google.com/assets/colab-badge.svg"></a></p>
<p><ahref="https://nn.labml.ai/normalization/batch_norm/index.html">Batch Normalization</a> works well for large enough batch sizes but not well for small batch sizes, because it normalizes over the batch. Training large models with large batch sizes is not possible due to the memory capacity of the devices.</p>
<p>This paper introduces Group Normalization, which normalizes a set of features together as a group. This is based on the observation that classical features such as <ahref="https://en.wikipedia.org/wiki/Scale-invariant_feature_transform">SIFT</a> and <ahref="https://en.wikipedia.org/wiki/Histogram_of_oriented_gradients">HOG</a> are group-wise features. The paper proposes dividing feature channels into groups and then separately normalizing all channels within each group.</p>
<p>Here's a <ahref="https://nn.labml.ai/normalization/group_norm/experiment.html">CIFAR 10 classification model</a> that uses instance normalization.</p>
<p><ahref="https://colab.research.google.com/github/labmlai/annotated_deep_learning_paper_implementations/blob/master/labml_nn/normalization/group_norm/experiment.ipynb"><imgalt="Open In Colab"src="https://colab.research.google.com/assets/colab-badge.svg"></a><ahref="https://app.labml.ai/run/081d950aa4e011eb8f9f0242ac1c0002"><imgalt="View Run"src="https://img.shields.io/badge/labml-experiment-brightgreen"></a></p>
<p><ahref="https://colab.research.google.com/github/labmlai/annotated_deep_learning_paper_implementations/blob/master/labml_nn/normalization/group_norm/experiment.ipynb"><imgalt="Open In Colab"src="https://colab.research.google.com/assets/colab-badge.svg"></a></p>
<p>Run the synthetic experiment is <em>Adam</em>. <ahref="https://app.labml.ai/run/61ebfdaa384411eb94d8acde48001122">Here are the results</a>. You can see that Adam converges at <span><spanclass="katex"><spanaria-hidden="true"class="katex-html"><spanclass="base"><spanclass="strut"style="height:0.43056em;vertical-align:0em;"></span><spanclass="mord coloredeq eqbd"style=""><spanclass="mord mathnormal"style="">x</span></span><spanclass="mspace"style="margin-right:0.2777777777777778em;"></span><spanclass="mrel">=</span><spanclass="mspace"style="margin-right:0.2777777777777778em;"></span></span><spanclass="base"><spanclass="strut"style="height:0.72777em;vertical-align:-0.08333em;"></span><spanclass="mord">+</span><spanclass="mord">1</span></span></span></span></span></p>
<p>Run the synthetic experiment is <em>Adam</em>. You can see that Adam converges at <span><spanclass="katex"><spanaria-hidden="true"class="katex-html"><spanclass="base"><spanclass="strut"style="height:0.43056em;vertical-align:0em;"></span><spanclass="mord coloredeq eqbd"style=""><spanclass="mord mathnormal"style="">x</span></span><spanclass="mspace"style="margin-right:0.2777777777777778em;"></span><spanclass="mrel">=</span><spanclass="mspace"style="margin-right:0.2777777777777778em;"></span></span><spanclass="base"><spanclass="strut"style="height:0.72777em;vertical-align:-0.08333em;"></span><spanclass="mord">+</span><spanclass="mord">1</span></span></span></span></span></p>
<p>Run the synthetic experiment is <em>AMSGrad</em><ahref="https://app.labml.ai/run/uuid=bc06405c384411eb8b82acde48001122">Here are the results</a>. You can see that AMSGrad converges to true optimal <span><spanclass="katex"><spanaria-hidden="true"class="katex-html"><spanclass="base"><spanclass="strut"style="height:0.72777em;vertical-align:-0.08333em;"></span><spanclass="mord coloredeq eqv"style=""><spanclass="mord"style=""><spanclass="mord mathnormal coloredeq eqbd"style="">x</span></span><spanclass="mspace"style="margin-right:0.2777777777777778em;"></span><spanclass="mrel"style="">=</span><spanclass="mspace"style="margin-right:0.2777777777777778em;"></span><spanclass="mord"style="">−</span><spanclass="mord"style="">1</span></span></span></span></span></span></p>
<p>Run the synthetic experiment is <em>AMSGrad</em> You can see that AMSGrad converges to true optimal <span><spanclass="katex"><spanaria-hidden="true"class="katex-html"><spanclass="base"><spanclass="strut"style="height:0.72777em;vertical-align:-0.08333em;"></span><spanclass="mord coloredeq eqv"style=""><spanclass="mord"style=""><spanclass="mord mathnormal coloredeq eqbd"style="">x</span></span><spanclass="mspace"style="margin-right:0.2777777777777778em;"></span><spanclass="mrel"style="">=</span><spanclass="mspace"style="margin-right:0.2777777777777778em;"></span><spanclass="mord"style="">−</span><spanclass="mord"style="">1</span></span></span></span></span></span></p>
<p>This is a <ahref="https://pytorch.org">PyTorch</a> implementation of paper <ahref="https://papers.labml.ai/paper/1312.5602">Playing Atari with Deep Reinforcement Learning</a> along with <ahref="https://nn.labml.ai/rl/dqn/model.html">Dueling Network</a>, <ahref="https://nn.labml.ai/rl/dqn/replay_buffer.html">Prioritized Replay</a> and Double Q Network.</p>
<p>Here is the <ahref="https://nn.labml.ai/rl/dqn/experiment.html">experiment</a> and <ahref="https://nn.labml.ai/rl/dqn/model.html">model</a> implementation.</p>
<p><ahref="https://colab.research.google.com/github/labmlai/annotated_deep_learning_paper_implementations/blob/master/labml_nn/rl/dqn/experiment.ipynb"><imgalt="Open In Colab"src="https://colab.research.google.com/assets/colab-badge.svg"></a><ahref="https://app.labml.ai/run/fe1ad986237511ec86e8b763a2d3f710"><imgalt="View Run"src="https://img.shields.io/badge/labml-experiment-brightgreen"></a></p>
<p><ahref="https://colab.research.google.com/github/labmlai/annotated_deep_learning_paper_implementations/blob/master/labml_nn/rl/dqn/experiment.ipynb"><imgalt="Open In Colab"src="https://colab.research.google.com/assets/colab-badge.svg"></a></p>
<p>This is a <ahref="https://pytorch.org">PyTorch</a> implementation of <ahref="https://papers.labml.ai/paper/1707.06347">Proximal Policy Optimization - PPO</a>.</p>
<p>PPO is a policy gradient method for reinforcement learning. Simple policy gradient methods one do a single gradient update per sample (or a set of samples). Doing multiple gradient steps for a singe sample causes problems because the policy deviates too much producing a bad policy. PPO lets us do multiple gradient updates per sample by trying to keep the policy close to the policy that was used to sample data. It does so by clipping gradient flow if the updated policy is not close to the policy used to sample the data.</p>
<p>You can find an experiment that uses it <ahref="https://nn.labml.ai/rl/ppo/experiment.html">here</a>. The experiment uses <ahref="https://nn.labml.ai/rl/ppo/gae.html">Generalized Advantage Estimation</a>.</p>
<p><ahref="https://colab.research.google.com/github/labmlai/annotated_deep_learning_paper_implementations/blob/master/labml_nn/rl/ppo/experiment.ipynb"><imgalt="Open In Colab"src="https://colab.research.google.com/assets/colab-badge.svg"></a><ahref="https://app.labml.ai/run/6eff28a0910e11eb9b008db315936e2f"><imgalt="View Run"src="https://img.shields.io/badge/labml-experiment-brightgreen"></a></p>
<p><ahref="https://colab.research.google.com/github/labmlai/annotated_deep_learning_paper_implementations/blob/master/labml_nn/rl/ppo/experiment.ipynb"><imgalt="Open In Colab"src="https://colab.research.google.com/assets/colab-badge.svg"></a></p>
<p>This is a <ahref="https://pytorch.org">PyTorch</a> implementation of the paper <ahref="https://papers.labml.ai/paper/2105.14103">An Attention Free Transformer</a>.</p>
<p>This paper replaces the <ahref="https://nn.labml.ai/transformers/mha.html">self-attention layer</a> with a new efficient operation, that has memory complexity of O(Td), where T is the sequence length and <span><spanclass="katex"><spanaria-hidden="true"class="katex-html"><spanclass="base"><spanclass="strut"style="height:0.69444em;vertical-align:0em;"></span><spanclass="mord mathnormal">d</span></span></span></span></span> is the dimensionality of embeddings.</p>
<p>The paper introduces AFT along with AFT-local and AFT-conv. Here we have implemented AFT-local which pays attention to closeby tokens in an autoregressive model.</p>
<p>The paper introduces AFT along with AFT-local and AFT-conv. Here we have implemented AFT-local which pays attention to closeby tokens in an autoregressive model. </p>
<p>Since training compression with BPTT requires maintaining a very large computational graph (many time steps), the paper proposes an <em>auto-encoding loss</em> and an <em>attention reconstruction loss</em>. The auto-encoding loss decodes the original memories from the compressed memories and calculates the loss. Attention reconstruction loss computes the multi-headed attention results on the compressed memory and on uncompressed memory and gets a mean squared error between them. We have implemented the latter here since it gives better results.</p>
<p>This implementation uses pre-layer normalization while the paper uses post-layer normalization. Pre-layer norm does the layer norm before <ahref="../feedforward.html">FFN</a> and self-attention, and the pass-through in the residual connection is not normalized. This is supposed to be more stable in standard transformer setups.</p>
<p>Here are <ahref="https://nn.labml.ai/transformers/compressive/experiment.html">the training code</a> and a notebook for training a compressive transformer model on the Tiny Shakespeare dataset.</p>
<p><ahref="https://colab.research.google.com/github/labmlai/annotated_deep_learning_paper_implementations/blob/master/labml_nn/transformers/compressive/experiment.ipynb"><imgalt="Open In Colab"src="https://colab.research.google.com/assets/colab-badge.svg"></a><ahref="https://app.labml.ai/run/0d9b5338726c11ebb7c80242ac1c0002"><imgalt="View Run"src="https://img.shields.io/badge/labml-experiment-brightgreen"></a></p>
<p><ahref="https://colab.research.google.com/github/labmlai/annotated_deep_learning_paper_implementations/blob/master/labml_nn/transformers/compressive/experiment.ipynb"><imgalt="Open In Colab"src="https://colab.research.google.com/assets/colab-badge.svg"></a></p>
<p>This is an annotated implementation of the paper <ahref="https://papers.labml.ai/paper/2102.11174">Linear Transformers Are Secretly Fast Weight Memory Systems in PyTorch</a>.</p>
<p>Here is the <ahref="https://nn.labml.ai/transformers/fast_weights/index.html">annotated implementation</a>. Here are <ahref="https://nn.labml.ai/transformers/fast_weights/experiment.html">the training code</a> and a notebook for training a fast weights transformer on the Tiny Shakespeare dataset.</p>
<p><ahref="https://colab.research.google.com/github/labmlai/annotated_deep_learning_paper_implementations/blob/master/labml_nn/transformers/fast_weights/experiment.ipynb"><imgalt="Open In Colab"src="https://colab.research.google.com/assets/colab-badge.svg"></a><ahref="https://app.labml.ai/run/928aadc0846c11eb85710242ac1c0002"><imgalt="View Run"src="https://img.shields.io/badge/labml-experiment-brightgreen"></a></p>
<p><ahref="https://colab.research.google.com/github/labmlai/annotated_deep_learning_paper_implementations/blob/master/labml_nn/transformers/fast_weights/experiment.ipynb"><imgalt="Open In Colab"src="https://colab.research.google.com/assets/colab-badge.svg"></a></p>
<p>The updated feedback transformer shares weights used to calculate keys and values among the layers. We then calculate the keys and values for each step only once and keep them cached. The <ahref="#shared_kv">second half</a> of this file implements this. We implemented a custom PyTorch function to improve performance.</p>
<p>Here's <ahref="experiment.html">the training code</a> and a notebook for training a feedback transformer on Tiny Shakespeare dataset.</p>
<p><ahref="https://colab.research.google.com/github/labmlai/annotated_deep_learning_paper_implementations/blob/master/labml_nn/transformers/feedback/experiment.ipynb"><imgalt="Open In Colab"src="https://colab.research.google.com/assets/colab-badge.svg"></a><ahref="https://app.labml.ai/run/d8eb9416530a11eb8fb50242ac1c0002"><imgalt="View Run"src="https://img.shields.io/badge/labml-experiment-brightgreen"></a></p>
<p><ahref="https://colab.research.google.com/github/labmlai/annotated_deep_learning_paper_implementations/blob/master/labml_nn/transformers/feedback/experiment.ipynb"><imgalt="Open In Colab"src="https://colab.research.google.com/assets/colab-badge.svg"></a></p>
<ul><li><ahref="glu_variants/experiment.html">Experiment that uses <codeclass="highlight"><span></span><spanclass="n">labml</span><spanclass="o">.</span><spanclass="n">configs</span></code>
<ul><li><ahref="experiment.html">Experiment that uses <codeclass="highlight"><span></span><spanclass="n">labml</span><spanclass="o">.</span><spanclass="n">configs</span></code>
</a></li>
<li><ahref="glu_variants/simple.html">Simpler version from scratch</a></li></ul>
<li><ahref="simple.html">Simpler version from scratch</a></li></ul>
<h1><ahref="https://nn.labml.ai/transformers/gmlp/index.html">Pay Attention to MLPs (gMLP)</a></h1>
<p>This is a <ahref="https://pytorch.org">PyTorch</a> implementation of the paper <ahref="https://papers.labml.ai/paper/2105.08050">Pay Attention to MLPs</a>.</p>
<p>This paper introduces a Multilayer Perceptron (MLP) based architecture with gating, which they name <strong>gMLP</strong>. It consists of a stack of <span><spanclass="katex"><spanaria-hidden="true"class="katex-html"><spanclass="base"><spanclass="strut"style="height:0.68333em;vertical-align:0em;"></span><spanclass="mord mathnormal">L</span></span></span></span></span><em>gMLP</em> blocks.</p>
<p>Here is <ahref="https://nn.labml.ai/transformers/gmlp/experiment.html">the training code</a> for a gMLP model based autoregressive model.</p>
and <codeclass="highlight"><span></span><spanclass="n">hom</span></code>
to predict <codeclass="highlight"><span></span><spanclass="n">e</span></code>
and so on. So the model will initially start predicting with a shorter context first and then learn to use longer contexts later. Since MLMs have this problem it's a lot faster to train if you start with a smaller sequence length initially and then use a longer sequence length later.</p>
<p>Here is <ahref="https://nn.labml.ai/transformers/mlm/experiment.html">the training code</a> for a simple MLM model.</p>
<p>This is a <ahref="https://pytorch.org">PyTorch</a> implementation of the paper <ahref="https://papers.labml.ai/paper/2105.01601">MLP-Mixer: An all-MLP Architecture for Vision</a>.</p>
<p>This paper applies the model on vision tasks. The model is similar to a transformer with attention layer being replaced by a MLP that is applied across the patches (or tokens in case of a NLP task).</p>
<p>Our implementation of MLP Mixer is a drop in replacement for the <ahref="https://nn.labml.ai/transformers/mha.html">self-attention layer</a> in <ahref="https://nn.labml.ai/transformers/models.html">our transformer implementation</a>. So it's just a couple of lines of code, transposing the tensor to apply the MLP across the sequence dimension.</p>
<p>Although the paper applied MLP Mixer on vision tasks, we tried it on a <ahref="https://nn.labml.ai/transformers/mlm/index.html">masked language model</a>. <ahref="https://nn.labml.ai/transformers/mlp_mixer/experiment.html">Here is the experiment code</a>.</p>
<p>Although the paper applied MLP Mixer on vision tasks, we tried it on a <ahref="https://nn.labml.ai/transformers/mlm/index.html">masked language model</a>. <ahref="https://nn.labml.ai/transformers/mlp_mixer/experiment.html">Here is the experiment code</a>. </p>
<p>The most effective modification found by the search is using a square ReLU instead of ReLU in the <ahref="https://nn.labml.ai/transformers/feed_forward.html">position-wise feedforward module</a>.</p>
<h3>Multi-DConv-Head Attention (MDHA)</h3>
<p>The next effective modification is a depth-wise 3 X 1 convolution after multi-head projection for queries, keys, and values. The convolution is along the sequence dimension and per channel (depth-wise). To be clear, if the number of channels in each head is d_k the convolution will have 1 X 3 kernels for each of the d_k channels.</p>
<p><ahref="https://nn.labml.ai/transformers/primer_ez/experiment.html">Here is the experiment code</a>, for Primer EZ.</p>
<p>The Switch Transformer uses different parameters for each token by switching among parameters based on the token. Therefore, only a fraction of parameters are chosen for each token. So you can have more parameters but less computational cost.</p>
<p>The switching happens at the Position-wise Feedforward network (FFN) of each transformer block. Position-wise feedforward network consists of two sequentially fully connected layers. In switch transformer we have multiple FFNs (multiple experts), and we chose which one to use based on a router. The output is a set of probabilities for picking a FFN, and we pick the one with the highest probability and only evaluate that. So essentially the computational cost is the same as having a single FFN. In our implementation this doesn't parallelize well when you have many or large FFNs since it's all happening on a single GPU. In a distributed setup you would have each FFN (each very large) on a different device.</p>
<p>The paper introduces another loss term to balance load among the experts (FFNs) and discusses dropping tokens when routing is not balanced.</p>
<p>Here's <ahref="experiment.html">the training code</a> and a notebook for training a switch transformer on Tiny Shakespeare dataset.</p>
<p><ahref="https://colab.research.google.com/github/labmlai/annotated_deep_learning_paper_implementations/blob/master/labml_nn/transformers/switch/experiment.ipynb"><imgalt="Open In Colab"src="https://colab.research.google.com/assets/colab-badge.svg"></a><ahref="https://app.labml.ai/run/353770ce177c11ecaa5fb74452424f46"><imgalt="View Run"src="https://img.shields.io/badge/labml-experiment-brightgreen"></a></p>
<p>Here's <ahref="experiment.html">the training code</a> and a notebook for training a switch transformer on Tiny Shakespeare dataset. </p>
<p>Annotated implementation of relative multi-headed attention is in <ahref="https://nn.labml.ai/transformers/xl/relative_mha.html"><codeclass="highlight"><span></span><spanclass="n">relative_mha</span><spanclass="o">.</span><spanclass="n">py</span></code>
</a>.</p>
<p>Here's <ahref="https://nn.labml.ai/transformers/xl/experiment.html">the training code</a> and a notebook for training a transformer XL model on Tiny Shakespeare dataset.</p>
<p><ahref="https://colab.research.google.com/github/labmlai/annotated_deep_learning_paper_implementations/blob/master/labml_nn/transformers/xl/experiment.ipynb"><imgalt="Open In Colab"src="https://colab.research.google.com/assets/colab-badge.svg"></a><ahref="https://app.labml.ai/run/d3b6760c692e11ebb6a70242ac1c0002"><imgalt="View Run"src="https://img.shields.io/badge/labml-experiment-brightgreen"></a></p>
<p><ahref="https://colab.research.google.com/github/labmlai/annotated_deep_learning_paper_implementations/blob/master/labml_nn/transformers/xl/experiment.ipynb"><imgalt="Open In Colab"src="https://colab.research.google.com/assets/colab-badge.svg"></a></p>
<h1><ahref="https://nn.labml.ai/uncertainty/evidence/index.html">Evidential Deep Learning to Quantify Classification Uncertainty</a></h1>
<p>This is a <ahref="https://pytorch.org">PyTorch</a> implementation of the paper <ahref="https://papers.labml.ai/paper/1806.01768">Evidential Deep Learning to Quantify Classification Uncertainty</a>.</p>
<p>Here is the <ahref="https://nn.labml.ai/uncertainty/evidence/experiment.html">training code <codeclass="highlight"><span></span><spanclass="n">experiment</span><spanclass="o">.</span><spanclass="n">py</span></code>
<p><ahref="https://colab.research.google.com/github/labmlai/annotated_deep_learning_paper_implementations/blob/master/labml_nn/capsule_networks/mnist.ipynb"><imgalt="Open In Colab"src="https://colab.research.google.com/assets/colab-badge.svg"></a><ahref="https://app.labml.ai/run/e7c08e08586711ebb3e30242ac1c0002"><imgalt="View Run"src="https://img.shields.io/badge/labml-experiment-brightgreen"></a></p>
<p><ahref="https://colab.research.google.com/github/labmlai/annotated_deep_learning_paper_implementations/blob/master/labml_nn/capsule_networks/mnist.ipynb"><imgalt="Open In Colab"src="https://colab.research.google.com/assets/colab-badge.svg"></a></p>