...
 
Commits (4)
    https://gitcode.net/greenplum/annotated_deep_learning_paper_implementations/-/commit/97e53c0f3d7b2c479b6b5fdfa10650f4b395c27c fix glu variants links 2023-04-02T12:00:23+05:30 Varuna Jayasiri vpjayasiri@gmail.com https://gitcode.net/greenplum/annotated_deep_learning_paper_implementations/-/commit/c5685c9ffed19ef5384957014807f8b4a611efea remove app.labml.ai links 2023-04-02T12:10:18+05:30 Varuna Jayasiri vpjayasiri@gmail.com https://gitcode.net/greenplum/annotated_deep_learning_paper_implementations/-/commit/d4c33355252393a25f8ded5ad719880135630417 link to translations 2023-04-02T14:03:23+05:30 Varuna Jayasiri vpjayasiri@gmail.com https://gitcode.net/greenplum/annotated_deep_learning_paper_implementations/-/commit/b05c9e0c57c6223b8f59dc11be114b97896b0481 zh translation 2023-04-02T14:23:40+05:30 Varuna Jayasiri vpjayasiri@gmail.com
此差异已折叠。
......@@ -77,7 +77,7 @@
<p>This file holds the implementations of the core modules of Capsule Networks.</p>
<p>I used <a href="https://github.com/jindongwang/Pytorch-CapsuleNet">jindongwang/Pytorch-CapsuleNet</a> to clarify some confusions I had with the paper.</p>
<p>Here&#x27;s a notebook for training a Capsule Network on MNIST dataset.</p>
<p><a href="https://colab.research.google.com/github/labmlai/annotated_deep_learning_paper_implementations/blob/master/labml_nn/capsule_networks/mnist.ipynb"><img alt="Open In Colab" src="https://colab.research.google.com/assets/colab-badge.svg"></a> <a href="https://app.labml.ai/run/e7c08e08586711ebb3e30242ac1c0002"><img alt="View Run" src="https://img.shields.io/badge/labml-experiment-brightgreen"></a> </p>
<p><a href="https://colab.research.google.com/github/labmlai/annotated_deep_learning_paper_implementations/blob/master/labml_nn/capsule_networks/mnist.ipynb"><img alt="Open In Colab" src="https://colab.research.google.com/assets/colab-badge.svg"></a> </p>
</div>
<div class='code'>
......
此差异已折叠。
......@@ -73,13 +73,12 @@
<h1>Train a <a href="index.html">ConvMixer</a> on CIFAR 10</h1>
<p>This script trains a ConvMixer on CIFAR 10 dataset.</p>
<p>This is not an attempt to reproduce the results of the paper. The paper uses image augmentations present in <a href="https://github.com/rwightman/pytorch-image-models">PyTorch Image Models (timm)</a> for training. We haven&#x27;t done this for simplicity - which causes our validation accuracy to drop.</p>
<p><a href="https://app.labml.ai/run/0fc344da2cd011ecb0bc3fdb2e774a3d"><img alt="View Run" src="https://img.shields.io/badge/labml-experiment-brightgreen"></a></p>
</div>
<div class='code'>
<div class="highlight"><pre><span class="lineno">20</span><span></span><span class="kn">from</span> <span class="nn">labml</span> <span class="kn">import</span> <span class="n">experiment</span>
<span class="lineno">21</span><span class="kn">from</span> <span class="nn">labml.configs</span> <span class="kn">import</span> <span class="n">option</span>
<span class="lineno">22</span><span class="kn">from</span> <span class="nn">labml_nn.experiments.cifar10</span> <span class="kn">import</span> <span class="n">CIFAR10Configs</span></pre></div>
<div class="highlight"><pre><span class="lineno">18</span><span></span><span class="kn">from</span> <span class="nn">labml</span> <span class="kn">import</span> <span class="n">experiment</span>
<span class="lineno">19</span><span class="kn">from</span> <span class="nn">labml.configs</span> <span class="kn">import</span> <span class="n">option</span>
<span class="lineno">20</span><span class="kn">from</span> <span class="nn">labml_nn.experiments.cifar10</span> <span class="kn">import</span> <span class="n">CIFAR10Configs</span></pre></div>
</div>
</div>
<div class='section' id='section-1'>
......@@ -93,7 +92,7 @@
</div>
<div class='code'>
<div class="highlight"><pre><span class="lineno">25</span><span class="k">class</span> <span class="nc">Configs</span><span class="p">(</span><span class="n">CIFAR10Configs</span><span class="p">):</span></pre></div>
<div class="highlight"><pre><span class="lineno">23</span><span class="k">class</span> <span class="nc">Configs</span><span class="p">(</span><span class="n">CIFAR10Configs</span><span class="p">):</span></pre></div>
</div>
</div>
<div class='section' id='section-2'>
......@@ -105,7 +104,7 @@
</div>
<div class='code'>
<div class="highlight"><pre><span class="lineno">34</span> <span class="n">patch_size</span><span class="p">:</span> <span class="nb">int</span> <span class="o">=</span> <span class="mi">2</span></pre></div>
<div class="highlight"><pre><span class="lineno">32</span> <span class="n">patch_size</span><span class="p">:</span> <span class="nb">int</span> <span class="o">=</span> <span class="mi">2</span></pre></div>
</div>
</div>
<div class='section' id='section-3'>
......@@ -117,7 +116,7 @@
</div>
<div class='code'>
<div class="highlight"><pre><span class="lineno">36</span> <span class="n">d_model</span><span class="p">:</span> <span class="nb">int</span> <span class="o">=</span> <span class="mi">256</span></pre></div>
<div class="highlight"><pre><span class="lineno">34</span> <span class="n">d_model</span><span class="p">:</span> <span class="nb">int</span> <span class="o">=</span> <span class="mi">256</span></pre></div>
</div>
</div>
<div class='section' id='section-4'>
......@@ -129,7 +128,7 @@
</div>
<div class='code'>
<div class="highlight"><pre><span class="lineno">38</span> <span class="n">n_layers</span><span class="p">:</span> <span class="nb">int</span> <span class="o">=</span> <span class="mi">8</span></pre></div>
<div class="highlight"><pre><span class="lineno">36</span> <span class="n">n_layers</span><span class="p">:</span> <span class="nb">int</span> <span class="o">=</span> <span class="mi">8</span></pre></div>
</div>
</div>
<div class='section' id='section-5'>
......@@ -141,7 +140,7 @@
</div>
<div class='code'>
<div class="highlight"><pre><span class="lineno">40</span> <span class="n">kernel_size</span><span class="p">:</span> <span class="nb">int</span> <span class="o">=</span> <span class="mi">7</span></pre></div>
<div class="highlight"><pre><span class="lineno">38</span> <span class="n">kernel_size</span><span class="p">:</span> <span class="nb">int</span> <span class="o">=</span> <span class="mi">7</span></pre></div>
</div>
</div>
<div class='section' id='section-6'>
......@@ -153,7 +152,7 @@
</div>
<div class='code'>
<div class="highlight"><pre><span class="lineno">42</span> <span class="n">n_classes</span><span class="p">:</span> <span class="nb">int</span> <span class="o">=</span> <span class="mi">10</span></pre></div>
<div class="highlight"><pre><span class="lineno">40</span> <span class="n">n_classes</span><span class="p">:</span> <span class="nb">int</span> <span class="o">=</span> <span class="mi">10</span></pre></div>
</div>
</div>
<div class='section' id='section-7'>
......@@ -165,8 +164,8 @@
</div>
<div class='code'>
<div class="highlight"><pre><span class="lineno">45</span><span class="nd">@option</span><span class="p">(</span><span class="n">Configs</span><span class="o">.</span><span class="n">model</span><span class="p">)</span>
<span class="lineno">46</span><span class="k">def</span> <span class="nf">_conv_mixer</span><span class="p">(</span><span class="n">c</span><span class="p">:</span> <span class="n">Configs</span><span class="p">):</span></pre></div>
<div class="highlight"><pre><span class="lineno">43</span><span class="nd">@option</span><span class="p">(</span><span class="n">Configs</span><span class="o">.</span><span class="n">model</span><span class="p">)</span>
<span class="lineno">44</span><span class="k">def</span> <span class="nf">_conv_mixer</span><span class="p">(</span><span class="n">c</span><span class="p">:</span> <span class="n">Configs</span><span class="p">):</span></pre></div>
</div>
</div>
<div class='section' id='section-8'>
......@@ -177,7 +176,7 @@
</div>
<div class='code'>
<div class="highlight"><pre><span class="lineno">50</span> <span class="kn">from</span> <span class="nn">labml_nn.conv_mixer</span> <span class="kn">import</span> <span class="n">ConvMixerLayer</span><span class="p">,</span> <span class="n">ConvMixer</span><span class="p">,</span> <span class="n">ClassificationHead</span><span class="p">,</span> <span class="n">PatchEmbeddings</span></pre></div>
<div class="highlight"><pre><span class="lineno">48</span> <span class="kn">from</span> <span class="nn">labml_nn.conv_mixer</span> <span class="kn">import</span> <span class="n">ConvMixerLayer</span><span class="p">,</span> <span class="n">ConvMixer</span><span class="p">,</span> <span class="n">ClassificationHead</span><span class="p">,</span> <span class="n">PatchEmbeddings</span></pre></div>
</div>
</div>
<div class='section' id='section-9'>
......@@ -189,9 +188,9 @@
</div>
<div class='code'>
<div class="highlight"><pre><span class="lineno">53</span> <span class="k">return</span> <span class="n">ConvMixer</span><span class="p">(</span><span class="n">ConvMixerLayer</span><span class="p">(</span><span class="n">c</span><span class="o">.</span><span class="n">d_model</span><span class="p">,</span> <span class="n">c</span><span class="o">.</span><span class="n">kernel_size</span><span class="p">),</span> <span class="n">c</span><span class="o">.</span><span class="n">n_layers</span><span class="p">,</span>
<span class="lineno">54</span> <span class="n">PatchEmbeddings</span><span class="p">(</span><span class="n">c</span><span class="o">.</span><span class="n">d_model</span><span class="p">,</span> <span class="n">c</span><span class="o">.</span><span class="n">patch_size</span><span class="p">,</span> <span class="mi">3</span><span class="p">),</span>
<span class="lineno">55</span> <span class="n">ClassificationHead</span><span class="p">(</span><span class="n">c</span><span class="o">.</span><span class="n">d_model</span><span class="p">,</span> <span class="n">c</span><span class="o">.</span><span class="n">n_classes</span><span class="p">))</span><span class="o">.</span><span class="n">to</span><span class="p">(</span><span class="n">c</span><span class="o">.</span><span class="n">device</span><span class="p">)</span></pre></div>
<div class="highlight"><pre><span class="lineno">51</span> <span class="k">return</span> <span class="n">ConvMixer</span><span class="p">(</span><span class="n">ConvMixerLayer</span><span class="p">(</span><span class="n">c</span><span class="o">.</span><span class="n">d_model</span><span class="p">,</span> <span class="n">c</span><span class="o">.</span><span class="n">kernel_size</span><span class="p">),</span> <span class="n">c</span><span class="o">.</span><span class="n">n_layers</span><span class="p">,</span>
<span class="lineno">52</span> <span class="n">PatchEmbeddings</span><span class="p">(</span><span class="n">c</span><span class="o">.</span><span class="n">d_model</span><span class="p">,</span> <span class="n">c</span><span class="o">.</span><span class="n">patch_size</span><span class="p">,</span> <span class="mi">3</span><span class="p">),</span>
<span class="lineno">53</span> <span class="n">ClassificationHead</span><span class="p">(</span><span class="n">c</span><span class="o">.</span><span class="n">d_model</span><span class="p">,</span> <span class="n">c</span><span class="o">.</span><span class="n">n_classes</span><span class="p">))</span><span class="o">.</span><span class="n">to</span><span class="p">(</span><span class="n">c</span><span class="o">.</span><span class="n">device</span><span class="p">)</span></pre></div>
</div>
</div>
<div class='section' id='section-10'>
......@@ -202,7 +201,7 @@
</div>
<div class='code'>
<div class="highlight"><pre><span class="lineno">58</span><span class="k">def</span> <span class="nf">main</span><span class="p">():</span></pre></div>
<div class="highlight"><pre><span class="lineno">56</span><span class="k">def</span> <span class="nf">main</span><span class="p">():</span></pre></div>
</div>
</div>
<div class='section' id='section-11'>
......@@ -214,7 +213,7 @@
</div>
<div class='code'>
<div class="highlight"><pre><span class="lineno">60</span> <span class="n">experiment</span><span class="o">.</span><span class="n">create</span><span class="p">(</span><span class="n">name</span><span class="o">=</span><span class="s1">&#39;ConvMixer&#39;</span><span class="p">,</span> <span class="n">comment</span><span class="o">=</span><span class="s1">&#39;cifar10&#39;</span><span class="p">)</span></pre></div>
<div class="highlight"><pre><span class="lineno">58</span> <span class="n">experiment</span><span class="o">.</span><span class="n">create</span><span class="p">(</span><span class="n">name</span><span class="o">=</span><span class="s1">&#39;ConvMixer&#39;</span><span class="p">,</span> <span class="n">comment</span><span class="o">=</span><span class="s1">&#39;cifar10&#39;</span><span class="p">)</span></pre></div>
</div>
</div>
<div class='section' id='section-12'>
......@@ -226,7 +225,7 @@
</div>
<div class='code'>
<div class="highlight"><pre><span class="lineno">62</span> <span class="n">conf</span> <span class="o">=</span> <span class="n">Configs</span><span class="p">()</span></pre></div>
<div class="highlight"><pre><span class="lineno">60</span> <span class="n">conf</span> <span class="o">=</span> <span class="n">Configs</span><span class="p">()</span></pre></div>
</div>
</div>
<div class='section' id='section-13'>
......@@ -238,7 +237,7 @@
</div>
<div class='code'>
<div class="highlight"><pre><span class="lineno">64</span> <span class="n">experiment</span><span class="o">.</span><span class="n">configs</span><span class="p">(</span><span class="n">conf</span><span class="p">,</span> <span class="p">{</span></pre></div>
<div class="highlight"><pre><span class="lineno">62</span> <span class="n">experiment</span><span class="o">.</span><span class="n">configs</span><span class="p">(</span><span class="n">conf</span><span class="p">,</span> <span class="p">{</span></pre></div>
</div>
</div>
<div class='section' id='section-14'>
......@@ -250,8 +249,8 @@
</div>
<div class='code'>
<div class="highlight"><pre><span class="lineno">66</span> <span class="s1">&#39;optimizer.optimizer&#39;</span><span class="p">:</span> <span class="s1">&#39;Adam&#39;</span><span class="p">,</span>
<span class="lineno">67</span> <span class="s1">&#39;optimizer.learning_rate&#39;</span><span class="p">:</span> <span class="mf">2.5e-4</span><span class="p">,</span></pre></div>
<div class="highlight"><pre><span class="lineno">64</span> <span class="s1">&#39;optimizer.optimizer&#39;</span><span class="p">:</span> <span class="s1">&#39;Adam&#39;</span><span class="p">,</span>
<span class="lineno">65</span> <span class="s1">&#39;optimizer.learning_rate&#39;</span><span class="p">:</span> <span class="mf">2.5e-4</span><span class="p">,</span></pre></div>
</div>
</div>
<div class='section' id='section-15'>
......@@ -263,8 +262,8 @@
</div>
<div class='code'>
<div class="highlight"><pre><span class="lineno">70</span> <span class="s1">&#39;epochs&#39;</span><span class="p">:</span> <span class="mi">150</span><span class="p">,</span>
<span class="lineno">71</span> <span class="s1">&#39;train_batch_size&#39;</span><span class="p">:</span> <span class="mi">64</span><span class="p">,</span></pre></div>
<div class="highlight"><pre><span class="lineno">68</span> <span class="s1">&#39;epochs&#39;</span><span class="p">:</span> <span class="mi">150</span><span class="p">,</span>
<span class="lineno">69</span> <span class="s1">&#39;train_batch_size&#39;</span><span class="p">:</span> <span class="mi">64</span><span class="p">,</span></pre></div>
</div>
</div>
<div class='section' id='section-16'>
......@@ -276,7 +275,7 @@
</div>
<div class='code'>
<div class="highlight"><pre><span class="lineno">74</span> <span class="s1">&#39;train_dataset&#39;</span><span class="p">:</span> <span class="s1">&#39;cifar10_train_augmented&#39;</span><span class="p">,</span></pre></div>
<div class="highlight"><pre><span class="lineno">72</span> <span class="s1">&#39;train_dataset&#39;</span><span class="p">:</span> <span class="s1">&#39;cifar10_train_augmented&#39;</span><span class="p">,</span></pre></div>
</div>
</div>
<div class='section' id='section-17'>
......@@ -288,8 +287,8 @@
</div>
<div class='code'>
<div class="highlight"><pre><span class="lineno">76</span> <span class="s1">&#39;valid_dataset&#39;</span><span class="p">:</span> <span class="s1">&#39;cifar10_valid_no_augment&#39;</span><span class="p">,</span>
<span class="lineno">77</span> <span class="p">})</span></pre></div>
<div class="highlight"><pre><span class="lineno">74</span> <span class="s1">&#39;valid_dataset&#39;</span><span class="p">:</span> <span class="s1">&#39;cifar10_valid_no_augment&#39;</span><span class="p">,</span>
<span class="lineno">75</span> <span class="p">})</span></pre></div>
</div>
</div>
<div class='section' id='section-18'>
......@@ -301,7 +300,7 @@
</div>
<div class='code'>
<div class="highlight"><pre><span class="lineno">79</span> <span class="n">experiment</span><span class="o">.</span><span class="n">add_pytorch_models</span><span class="p">({</span><span class="s1">&#39;model&#39;</span><span class="p">:</span> <span class="n">conf</span><span class="o">.</span><span class="n">model</span><span class="p">})</span></pre></div>
<div class="highlight"><pre><span class="lineno">77</span> <span class="n">experiment</span><span class="o">.</span><span class="n">add_pytorch_models</span><span class="p">({</span><span class="s1">&#39;model&#39;</span><span class="p">:</span> <span class="n">conf</span><span class="o">.</span><span class="n">model</span><span class="p">})</span></pre></div>
</div>
</div>
<div class='section' id='section-19'>
......@@ -313,8 +312,8 @@
</div>
<div class='code'>
<div class="highlight"><pre><span class="lineno">81</span> <span class="k">with</span> <span class="n">experiment</span><span class="o">.</span><span class="n">start</span><span class="p">():</span>
<span class="lineno">82</span> <span class="n">conf</span><span class="o">.</span><span class="n">run</span><span class="p">()</span></pre></div>
<div class="highlight"><pre><span class="lineno">79</span> <span class="k">with</span> <span class="n">experiment</span><span class="o">.</span><span class="n">start</span><span class="p">():</span>
<span class="lineno">80</span> <span class="n">conf</span><span class="o">.</span><span class="n">run</span><span class="p">()</span></pre></div>
</div>
</div>
<div class='section' id='section-20'>
......@@ -326,8 +325,8 @@
</div>
<div class='code'>
<div class="highlight"><pre><span class="lineno">86</span><span class="k">if</span> <span class="vm">__name__</span> <span class="o">==</span> <span class="s1">&#39;__main__&#39;</span><span class="p">:</span>
<span class="lineno">87</span> <span class="n">main</span><span class="p">()</span></pre></div>
<div class="highlight"><pre><span class="lineno">84</span><span class="k">if</span> <span class="vm">__name__</span> <span class="o">==</span> <span class="s1">&#39;__main__&#39;</span><span class="p">:</span>
<span class="lineno">85</span> <span class="n">main</span><span class="p">()</span></pre></div>
</div>
</div>
<div class='footer'>
......
此差异已折叠。
......@@ -75,8 +75,7 @@
<p>ConvMixer is Similar to <a href="https://nn.labml.ai/transformers/mlp_mixer/index.html">MLP-Mixer</a>. MLP-Mixer separates mixing of spatial and channel dimensions, by applying an MLP across spatial dimension and then an MLP across the channel dimension (spatial MLP replaces the <a href="https://nn.labml.ai/transformers/vit/index.html">ViT</a> attention and channel MLP is the <a href="https://nn.labml.ai/transformers/feed_forward.html">FFN</a> of ViT).</p>
<p>ConvMixer uses a 1x1 convolution for channel mixing and a depth-wise convolution for spatial mixing. Since it&#x27;s a convolution instead of a full MLP across the space, it mixes only the nearby batches in contrast to ViT or MLP-Mixer. Also, the MLP-mixer uses MLPs of two layers for each mixing and ConvMixer uses a single layer for each mixing.</p>
<p>The paper recommends removing the residual connection across the channel mixing (point-wise convolution) and having only a residual connection over the spatial mixing (depth-wise convolution). They also use <a href="https://nn.labml.ai/normalization/batch_norm/index.html">Batch normalization</a> instead of <a href="../normalization/layer_norm/index.html">Layer normalization</a>.</p>
<p>Here&#x27;s <a href="https://nn.labml.ai/conv_mixer/experiment.html">an experiment</a> that trains ConvMixer on CIFAR-10.</p>
<p><a href="https://app.labml.ai/run/0fc344da2cd011ecb0bc3fdb2e774a3d"><img alt="View Run" src="https://img.shields.io/badge/labml-experiment-brightgreen"></a></p>
<p>Here&#x27;s <a href="https://nn.labml.ai/conv_mixer/experiment.html">an experiment</a> that trains ConvMixer on CIFAR-10. </p>
</div>
<div class='code'>
......
此差异已折叠。
此差异已折叠。
......@@ -74,8 +74,7 @@
<p>This is a <a href="https://pytorch.org">PyTorch</a> implementation/tutorial of the paper <a href="https://papers.labml.ai/paper/1503.02531">Distilling the Knowledge in a Neural Network</a>.</p>
<p>It&#x27;s a way of training a small network using the knowledge in a trained larger network; i.e. distilling the knowledge from the large network.</p>
<p>A large model with regularization or an ensemble of models (using dropout) generalizes better than a small model when trained directly on the data and labels. However, a small model can be trained to generalize better with help of a large model. Smaller models are better in production: faster, less compute, less memory.</p>
<p>The output probabilities of a trained model give more information than the labels because it assigns non-zero probabilities to incorrect classes as well. These probabilities tell us that a sample has a chance of belonging to certain classes. For instance, when classifying digits, when given an image of digit <em>7</em>, a generalized model will give a high probability to 7 and a small but non-zero probability to 2, while assigning almost zero probability to other digits. Distillation uses this information to train a small model better.</p>
<p><a href="https://app.labml.ai/run/d6182e2adaf011eb927c91a2a1710932"><img alt="View Run" src="https://img.shields.io/badge/labml-experiment-brightgreen"></a> </p>
<p>The output probabilities of a trained model give more information than the labels because it assigns non-zero probabilities to incorrect classes as well. These probabilities tell us that a sample has a chance of belonging to certain classes. For instance, when classifying digits, when given an image of digit <em>7</em>, a generalized model will give a high probability to 7 and a small but non-zero probability to 2, while assigning almost zero probability to other digits. Distillation uses this information to train a small model better. </p>
</div>
<div class='code'>
......
此差异已折叠。
因为 它太大了无法显示 source diff 。你可以改为 查看blob
此差异已折叠。
此差异已折叠。
......@@ -75,8 +75,7 @@
<p>This is a <a href="https://pytorch.org">PyTorch</a> implementation of the paper <a href="https://papers.labml.ai/paper/1710.10903">Graph Attention Networks</a>.</p>
<p>GATs work on graph data. A graph consists of nodes and edges connecting nodes. For example, in Cora dataset the nodes are research papers and the edges are citations that connect the papers.</p>
<p>GAT uses masked self-attention, kind of similar to <a href="https://nn.labml.ai/transformers/mha.html">transformers</a>. GAT consists of graph attention layers stacked on top of each other. Each graph attention layer gets node embeddings as inputs and outputs transformed embeddings. The node embeddings pay attention to the embeddings of other nodes it&#x27;s connected to. The details of graph attention layers are included alongside the implementation.</p>
<p>Here is <a href="https://nn.labml.ai/graphs/gat/experiment.html">the training code</a> for training a two-layer GAT on Cora dataset.</p>
<p><a href="https://app.labml.ai/run/d6c636cadf3511eba2f1e707f612f95d"><img alt="View Run" src="https://img.shields.io/badge/labml-experiment-brightgreen"></a> </p>
<p>Here is <a href="https://nn.labml.ai/graphs/gat/experiment.html">the training code</a> for training a two-layer GAT on Cora dataset. </p>
</div>
<div class='code'>
......
此差异已折叠。
此差异已折叠。
......@@ -75,8 +75,7 @@
<p>This is a <a href="https://pytorch.org">PyTorch</a> implementation of the GATv2 operator from the paper <a href="https://papers.labml.ai/paper/2105.14491">How Attentive are Graph Attention Networks?</a>.</p>
<p>GATv2s work on graph data. A graph consists of nodes and edges connecting nodes. For example, in Cora dataset the nodes are research papers and the edges are citations that connect the papers.</p>
<p>The GATv2 operator fixes the static attention problem of the standard GAT: since the linear layers in the standard GAT are applied right after each other, the ranking of attended nodes is unconditioned on the query node. In contrast, in GATv2, every node can attend to any other node.</p>
<p>Here is <a href="https://nn.labml.ai/graphs/gatv2/experiment.html">the training code</a> for training a two-layer GATv2 on Cora dataset.</p>
<p><a href="https://app.labml.ai/run/34b1e2f6ed6f11ebb860997901a2d1e3"><img alt="View Run" src="https://img.shields.io/badge/labml-experiment-brightgreen"></a> </p>
<p>Here is <a href="https://nn.labml.ai/graphs/gatv2/experiment.html">the training code</a> for training a two-layer GATv2 on Cora dataset. </p>
</div>
<div class='code'>
......
此差异已折叠。
......@@ -73,6 +73,7 @@
<p>This is a collection of simple PyTorch implementations of neural networks and related algorithms. <a href="https://github.com/labmlai/annotated_deep_learning_paper_implementations">These implementations</a> are documented with explanations, and the <a href="index.html">website</a> renders these as side-by-side formatted notes. We believe these would help you understand these algorithms better.</p>
<p><img alt="Screenshot" src="dqn-light.png"></p>
<p>We are actively maintaining this repo and adding new implementations. <a href="https://twitter.com/labmlai"><img alt="Twitter" src="https://img.shields.io/twitter/follow/labmlai?style=social"></a> for updates.</p>
<h2>Languages: <strong><a href="https://nn.labml.ai">English</a></strong> | <strong><a href="https://nn.labml.ai/zh/">Chinese (translated)</a></strong></h2>
<h2>Paper Implementations</h2>
<h4><a href="transformers/index.html">Transformers</a></h4>
<ul><li><a href="transformers/mha.html">Multi-headed attention</a> </li>
......
......@@ -102,7 +102,7 @@ M1001 80h400000v40h-400000z"></path></svg></span></span></span><span class="vlis
<h2>Inference</h2>
<p>We need to know <span ><span class="katex"><span aria-hidden="true" class="katex-html"><span class="base"><span class="strut" style="height:1.138em;vertical-align:-0.25em;"></span><span class="mord coloredeq eqg" style=""><span class="mord mathbb" style="">E</span><span class="mopen" style="">[</span><span class="mord" style=""><span class="mord mathnormal" style="">x</span><span class="msupsub"><span class="vlist-t"><span class="vlist-r"><span class="vlist" style="height:0.8879999999999999em;"><span style="top:-3.063em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight" style=""><span class="mord mtight" style=""><span class="mopen mtight" style="">(</span><span class="mord mathnormal mtight" style="margin-right:0.03148em">k</span><span class="mclose mtight" style="">)</span></span></span></span></span></span></span></span></span><span class="mclose" style="">]</span></span></span></span></span></span> and <span ><span class="katex"><span aria-hidden="true" class="katex-html"><span class="base"><span class="strut" style="height:1.138em;vertical-align:-0.25em;"></span><span class="mord coloredeq eqj" style=""><span class="mord mathnormal" style="">Va</span><span class="mord mathnormal" style="margin-right:0.02778em">r</span><span class="mopen" style="">[</span><span class="mord" style=""><span class="mord mathnormal" style="">x</span><span class="msupsub"><span class="vlist-t"><span class="vlist-r"><span class="vlist" style="height:0.8879999999999999em;"><span style="top:-3.063em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight" style=""><span class="mord mtight" style=""><span class="mopen mtight" style="">(</span><span class="mord mathnormal mtight" style="margin-right:0.03148em">k</span><span class="mclose mtight" style="">)</span></span></span></span></span></span></span></span></span><span class="mclose" style="">]</span></span></span></span></span></span> in order to perform the normalization. So during inference, you either need to go through the whole (or part of) dataset and find the mean and variance, or you can use an estimate calculated during training. The usual practice is to calculate an exponential moving average of mean and variance during the training phase and use that for inference.</p>
<p>Here&#x27;s <a href="mnist.html">the training code</a> and a notebook for training a CNN classifier that uses batch normalization for MNIST dataset.</p>
<p><a href="https://colab.research.google.com/github/labmlai/annotated_deep_learning_paper_implementations/blob/master/labml_nn/normalization/batch_norm/mnist.ipynb"><img alt="Open In Colab" src="https://colab.research.google.com/assets/colab-badge.svg"></a> <a href="https://app.labml.ai/run/011254fe647011ebbb8e0242ac1c0002"><img alt="View Run" src="https://img.shields.io/badge/labml-experiment-brightgreen"></a> </p>
<p><a href="https://colab.research.google.com/github/labmlai/annotated_deep_learning_paper_implementations/blob/master/labml_nn/normalization/batch_norm/mnist.ipynb"><img alt="Open In Colab" src="https://colab.research.google.com/assets/colab-badge.svg"></a> </p>
</div>
<div class='code'>
......
......@@ -76,7 +76,7 @@
<p><a href="https://nn.labml.ai/normalization/batch_norm/index.html">Batch Normalization</a> works well for large enough batch sizes but not well for small batch sizes, because it normalizes over the batch. Training large models with large batch sizes is not possible due to the memory capacity of the devices.</p>
<p>This paper introduces Group Normalization, which normalizes a set of features together as a group. This is based on the observation that classical features such as <a href="https://en.wikipedia.org/wiki/Scale-invariant_feature_transform">SIFT</a> and <a href="https://en.wikipedia.org/wiki/Histogram_of_oriented_gradients">HOG</a> are group-wise features. The paper proposes dividing feature channels into groups and then separately normalizing all channels within each group.</p>
<p>Here&#x27;s a <a href="https://nn.labml.ai/normalization/group_norm/experiment.html">CIFAR 10 classification model</a> that uses instance normalization.</p>
<p><a href="https://colab.research.google.com/github/labmlai/annotated_deep_learning_paper_implementations/blob/master/labml_nn/normalization/group_norm/experiment.ipynb"><img alt="Open In Colab" src="https://colab.research.google.com/assets/colab-badge.svg"></a> <a href="https://app.labml.ai/run/081d950aa4e011eb8f9f0242ac1c0002"><img alt="View Run" src="https://img.shields.io/badge/labml-experiment-brightgreen"></a> </p>
<p><a href="https://colab.research.google.com/github/labmlai/annotated_deep_learning_paper_implementations/blob/master/labml_nn/normalization/group_norm/experiment.ipynb"><img alt="Open In Colab" src="https://colab.research.google.com/assets/colab-badge.svg"></a> </p>
</div>
<div class='code'>
......
......@@ -495,11 +495,11 @@
<div class='section-link'>
<a href='#section-29'>#</a>
</div>
<p>Run the synthetic experiment is <em>Adam</em>. <a href="https://app.labml.ai/run/61ebfdaa384411eb94d8acde48001122">Here are the results</a>. You can see that Adam converges at <span ><span class="katex"><span aria-hidden="true" class="katex-html"><span class="base"><span class="strut" style="height:0.43056em;vertical-align:0em;"></span><span class="mord coloredeq eqbd" style=""><span class="mord mathnormal" style="">x</span></span><span class="mspace" style="margin-right:0.2777777777777778em;"></span><span class="mrel">=</span><span class="mspace" style="margin-right:0.2777777777777778em;"></span></span><span class="base"><span class="strut" style="height:0.72777em;vertical-align:-0.08333em;"></span><span class="mord">+</span><span class="mord">1</span></span></span></span></span> </p>
<p>Run the synthetic experiment is <em>Adam</em>. You can see that Adam converges at <span ><span class="katex"><span aria-hidden="true" class="katex-html"><span class="base"><span class="strut" style="height:0.43056em;vertical-align:0em;"></span><span class="mord coloredeq eqbd" style=""><span class="mord mathnormal" style="">x</span></span><span class="mspace" style="margin-right:0.2777777777777778em;"></span><span class="mrel">=</span><span class="mspace" style="margin-right:0.2777777777777778em;"></span></span><span class="base"><span class="strut" style="height:0.72777em;vertical-align:-0.08333em;"></span><span class="mord">+</span><span class="mord">1</span></span></span></span></span> </p>
</div>
<div class='code'>
<div class="highlight"><pre><span class="lineno">202</span> <span class="n">_synthetic_experiment</span><span class="p">(</span><span class="kc">True</span><span class="p">)</span></pre></div>
<div class="highlight"><pre><span class="lineno">201</span> <span class="n">_synthetic_experiment</span><span class="p">(</span><span class="kc">True</span><span class="p">)</span></pre></div>
</div>
</div>
<div class='section' id='section-30'>
......@@ -507,11 +507,11 @@
<div class='section-link'>
<a href='#section-30'>#</a>
</div>
<p>Run the synthetic experiment is <em>AMSGrad</em> <a href="https://app.labml.ai/run/uuid=bc06405c384411eb8b82acde48001122">Here are the results</a>. You can see that AMSGrad converges to true optimal <span ><span class="katex"><span aria-hidden="true" class="katex-html"><span class="base"><span class="strut" style="height:0.72777em;vertical-align:-0.08333em;"></span><span class="mord coloredeq eqv" style=""><span class="mord" style=""><span class="mord mathnormal coloredeq eqbd" style="">x</span></span><span class="mspace" style="margin-right:0.2777777777777778em;"></span><span class="mrel" style="">=</span><span class="mspace" style="margin-right:0.2777777777777778em;"></span><span class="mord" style=""></span><span class="mord" style="">1</span></span></span></span></span></span> </p>
<p>Run the synthetic experiment is <em>AMSGrad</em> You can see that AMSGrad converges to true optimal <span ><span class="katex"><span aria-hidden="true" class="katex-html"><span class="base"><span class="strut" style="height:0.72777em;vertical-align:-0.08333em;"></span><span class="mord coloredeq eqv" style=""><span class="mord" style=""><span class="mord mathnormal coloredeq eqbd" style="">x</span></span><span class="mspace" style="margin-right:0.2777777777777778em;"></span><span class="mrel" style="">=</span><span class="mspace" style="margin-right:0.2777777777777778em;"></span><span class="mord" style=""></span><span class="mord" style="">1</span></span></span></span></span></span> </p>
</div>
<div class='code'>
<div class="highlight"><pre><span class="lineno">206</span> <span class="n">_synthetic_experiment</span><span class="p">(</span><span class="kc">False</span><span class="p">)</span></pre></div>
<div class="highlight"><pre><span class="lineno">204</span> <span class="n">_synthetic_experiment</span><span class="p">(</span><span class="kc">False</span><span class="p">)</span></pre></div>
</div>
</div>
<div class='footer'>
......
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
......@@ -74,7 +74,7 @@
<h1><a href="https://nn.labml.ai/rl/dqn/index.html">Deep Q Networks (DQN)</a></h1>
<p>This is a <a href="https://pytorch.org">PyTorch</a> implementation of paper <a href="https://papers.labml.ai/paper/1312.5602">Playing Atari with Deep Reinforcement Learning</a> along with <a href="https://nn.labml.ai/rl/dqn/model.html">Dueling Network</a>, <a href="https://nn.labml.ai/rl/dqn/replay_buffer.html">Prioritized Replay</a> and Double Q Network.</p>
<p>Here is the <a href="https://nn.labml.ai/rl/dqn/experiment.html">experiment</a> and <a href="https://nn.labml.ai/rl/dqn/model.html">model</a> implementation.</p>
<p><a href="https://colab.research.google.com/github/labmlai/annotated_deep_learning_paper_implementations/blob/master/labml_nn/rl/dqn/experiment.ipynb"><img alt="Open In Colab" src="https://colab.research.google.com/assets/colab-badge.svg"></a> <a href="https://app.labml.ai/run/fe1ad986237511ec86e8b763a2d3f710"><img alt="View Run" src="https://img.shields.io/badge/labml-experiment-brightgreen"></a> </p>
<p><a href="https://colab.research.google.com/github/labmlai/annotated_deep_learning_paper_implementations/blob/master/labml_nn/rl/dqn/experiment.ipynb"><img alt="Open In Colab" src="https://colab.research.google.com/assets/colab-badge.svg"></a> </p>
</div>
<div class='code'>
......
此差异已折叠。
此差异已折叠。
此差异已折叠。
......@@ -75,7 +75,7 @@
<p>This is a <a href="https://pytorch.org">PyTorch</a> implementation of <a href="https://papers.labml.ai/paper/1707.06347">Proximal Policy Optimization - PPO</a>.</p>
<p>PPO is a policy gradient method for reinforcement learning. Simple policy gradient methods one do a single gradient update per sample (or a set of samples). Doing multiple gradient steps for a singe sample causes problems because the policy deviates too much producing a bad policy. PPO lets us do multiple gradient updates per sample by trying to keep the policy close to the policy that was used to sample data. It does so by clipping gradient flow if the updated policy is not close to the policy used to sample the data.</p>
<p>You can find an experiment that uses it <a href="https://nn.labml.ai/rl/ppo/experiment.html">here</a>. The experiment uses <a href="https://nn.labml.ai/rl/ppo/gae.html">Generalized Advantage Estimation</a>.</p>
<p><a href="https://colab.research.google.com/github/labmlai/annotated_deep_learning_paper_implementations/blob/master/labml_nn/rl/ppo/experiment.ipynb"><img alt="Open In Colab" src="https://colab.research.google.com/assets/colab-badge.svg"></a> <a href="https://app.labml.ai/run/6eff28a0910e11eb9b008db315936e2f"><img alt="View Run" src="https://img.shields.io/badge/labml-experiment-brightgreen"></a> </p>
<p><a href="https://colab.research.google.com/github/labmlai/annotated_deep_learning_paper_implementations/blob/master/labml_nn/rl/ppo/experiment.ipynb"><img alt="Open In Colab" src="https://colab.research.google.com/assets/colab-badge.svg"></a> </p>
</div>
<div class='code'>
......
......@@ -561,7 +561,7 @@
<url>
<loc>https://nn.labml.ai/diffusion/stable_diffusion/sampler/ddim.html</loc>
<lastmod>2022-09-15T16:30:00+00:00</lastmod>
<lastmod>2023-02-24T16:30:00+00:00</lastmod>
<priority>1.00</priority>
</url>
......
此差异已折叠。
......@@ -74,8 +74,7 @@
<h1><a href="https://nn.labml.ai/transformers/aft/index.html">An Attention Free Transformer</a></h1>
<p>This is a <a href="https://pytorch.org">PyTorch</a> implementation of the paper <a href="https://papers.labml.ai/paper/2105.14103">An Attention Free Transformer</a>.</p>
<p>This paper replaces the <a href="https://nn.labml.ai/transformers/mha.html">self-attention layer</a> with a new efficient operation, that has memory complexity of O(Td), where T is the sequence length and <span ><span class="katex"><span aria-hidden="true" class="katex-html"><span class="base"><span class="strut" style="height:0.69444em;vertical-align:0em;"></span><span class="mord mathnormal">d</span></span></span></span></span> is the dimensionality of embeddings.</p>
<p>The paper introduces AFT along with AFT-local and AFT-conv. Here we have implemented AFT-local which pays attention to closeby tokens in an autoregressive model.</p>
<p><a href="https://app.labml.ai/run/6348e504c3a511eba9529daa283fb495"><img alt="View Run" src="https://img.shields.io/badge/labml-experiment-brightgreen"></a> </p>
<p>The paper introduces AFT along with AFT-local and AFT-conv. Here we have implemented AFT-local which pays attention to closeby tokens in an autoregressive model. </p>
</div>
<div class='code'>
......
此差异已折叠。
......@@ -80,7 +80,7 @@
<p>Since training compression with BPTT requires maintaining a very large computational graph (many time steps), the paper proposes an <em>auto-encoding loss</em> and an <em>attention reconstruction loss</em>. The auto-encoding loss decodes the original memories from the compressed memories and calculates the loss. Attention reconstruction loss computes the multi-headed attention results on the compressed memory and on uncompressed memory and gets a mean squared error between them. We have implemented the latter here since it gives better results.</p>
<p>This implementation uses pre-layer normalization while the paper uses post-layer normalization. Pre-layer norm does the layer norm before <a href="../feedforward.html">FFN</a> and self-attention, and the pass-through in the residual connection is not normalized. This is supposed to be more stable in standard transformer setups.</p>
<p>Here are <a href="https://nn.labml.ai/transformers/compressive/experiment.html">the training code</a> and a notebook for training a compressive transformer model on the Tiny Shakespeare dataset.</p>
<p><a href="https://colab.research.google.com/github/labmlai/annotated_deep_learning_paper_implementations/blob/master/labml_nn/transformers/compressive/experiment.ipynb"><img alt="Open In Colab" src="https://colab.research.google.com/assets/colab-badge.svg"></a> <a href="https://app.labml.ai/run/0d9b5338726c11ebb7c80242ac1c0002"><img alt="View Run" src="https://img.shields.io/badge/labml-experiment-brightgreen"></a> </p>
<p><a href="https://colab.research.google.com/github/labmlai/annotated_deep_learning_paper_implementations/blob/master/labml_nn/transformers/compressive/experiment.ipynb"><img alt="Open In Colab" src="https://colab.research.google.com/assets/colab-badge.svg"></a> </p>
</div>
<div class='code'>
......
......@@ -74,7 +74,7 @@
<h1><a href="https://nn.labml.ai/transformers/fast_weights/index.html">Fast weights transformer</a></h1>
<p>This is an annotated implementation of the paper <a href="https://papers.labml.ai/paper/2102.11174">Linear Transformers Are Secretly Fast Weight Memory Systems in PyTorch</a>.</p>
<p>Here is the <a href="https://nn.labml.ai/transformers/fast_weights/index.html">annotated implementation</a>. Here are <a href="https://nn.labml.ai/transformers/fast_weights/experiment.html">the training code</a> and a notebook for training a fast weights transformer on the Tiny Shakespeare dataset.</p>
<p><a href="https://colab.research.google.com/github/labmlai/annotated_deep_learning_paper_implementations/blob/master/labml_nn/transformers/fast_weights/experiment.ipynb"><img alt="Open In Colab" src="https://colab.research.google.com/assets/colab-badge.svg"></a> <a href="https://app.labml.ai/run/928aadc0846c11eb85710242ac1c0002"><img alt="View Run" src="https://img.shields.io/badge/labml-experiment-brightgreen"></a> </p>
<p><a href="https://colab.research.google.com/github/labmlai/annotated_deep_learning_paper_implementations/blob/master/labml_nn/transformers/fast_weights/experiment.ipynb"><img alt="Open In Colab" src="https://colab.research.google.com/assets/colab-badge.svg"></a> </p>
</div>
<div class='code'>
......
......@@ -79,7 +79,7 @@
<p>The updated feedback transformer shares weights used to calculate keys and values among the layers. We then calculate the keys and values for each step only once and keep them cached. The <a href="#shared_kv">second half</a> of this file implements this. We implemented a custom PyTorch function to improve performance.</p>
<p>Here&#x27;s <a href="experiment.html">the training code</a> and a notebook for training a feedback transformer on Tiny Shakespeare dataset.</p>
<p><a href="https://colab.research.google.com/github/labmlai/annotated_deep_learning_paper_implementations/blob/master/labml_nn/transformers/feedback/experiment.ipynb">Colab Notebook</a></p>
<p><a href="https://colab.research.google.com/github/labmlai/annotated_deep_learning_paper_implementations/blob/master/labml_nn/transformers/feedback/experiment.ipynb"><img alt="Open In Colab" src="https://colab.research.google.com/assets/colab-badge.svg"></a> <a href="https://app.labml.ai/run/d8eb9416530a11eb8fb50242ac1c0002"><img alt="View Run" src="https://img.shields.io/badge/labml-experiment-brightgreen"></a> </p>
<p><a href="https://colab.research.google.com/github/labmlai/annotated_deep_learning_paper_implementations/blob/master/labml_nn/transformers/feedback/experiment.ipynb"><img alt="Open In Colab" src="https://colab.research.google.com/assets/colab-badge.svg"></a> </p>
</div>
<div class='code'>
......
......@@ -72,9 +72,9 @@
<a href='#section-0'>#</a>
</div>
<h1>Gated Linear Units and Variants</h1>
<ul><li><a href="glu_variants/experiment.html">Experiment that uses <code class="highlight"><span></span><span class="n">labml</span><span class="o">.</span><span class="n">configs</span></code>
<ul><li><a href="experiment.html">Experiment that uses <code class="highlight"><span></span><span class="n">labml</span><span class="o">.</span><span class="n">configs</span></code>
</a> </li>
<li><a href="glu_variants/simple.html">Simpler version from scratch</a></li></ul>
<li><a href="simple.html">Simpler version from scratch</a></li></ul>
</div>
<div class='code'>
......
此差异已折叠。
......@@ -74,8 +74,7 @@
<h1><a href="https://nn.labml.ai/transformers/gmlp/index.html">Pay Attention to MLPs (gMLP)</a></h1>
<p>This is a <a href="https://pytorch.org">PyTorch</a> implementation of the paper <a href="https://papers.labml.ai/paper/2105.08050">Pay Attention to MLPs</a>.</p>
<p>This paper introduces a Multilayer Perceptron (MLP) based architecture with gating, which they name <strong>gMLP</strong>. It consists of a stack of <span ><span class="katex"><span aria-hidden="true" class="katex-html"><span class="base"><span class="strut" style="height:0.68333em;vertical-align:0em;"></span><span class="mord mathnormal">L</span></span></span></span></span> <em>gMLP</em> blocks.</p>
<p>Here is <a href="https://nn.labml.ai/transformers/gmlp/experiment.html">the training code</a> for a gMLP model based autoregressive model.</p>
<p><a href="https://app.labml.ai/run/01bd941ac74c11eb890c1d9196651a4a"><img alt="View Run" src="https://img.shields.io/badge/labml-experiment-brightgreen"></a> </p>
<p>Here is <a href="https://nn.labml.ai/transformers/gmlp/experiment.html">the training code</a> for a gMLP model based autoregressive model. </p>
</div>
<div class='code'>
......
此差异已折叠。
此差异已折叠。
......@@ -98,8 +98,7 @@
and <code class="highlight"><span></span><span class="n">hom</span></code>
to predict <code class="highlight"><span></span><span class="n">e</span></code>
and so on. So the model will initially start predicting with a shorter context first and then learn to use longer contexts later. Since MLMs have this problem it&#x27;s a lot faster to train if you start with a smaller sequence length initially and then use a longer sequence length later.</p>
<p>Here is <a href="https://nn.labml.ai/transformers/mlm/experiment.html">the training code</a> for a simple MLM model.</p>
<p><a href="https://app.labml.ai/run/3a6d22b6c67111ebb03d6764d13a38d1"><img alt="View Run" src="https://img.shields.io/badge/labml-experiment-brightgreen"></a> </p>
<p>Here is <a href="https://nn.labml.ai/transformers/mlm/experiment.html">the training code</a> for a simple MLM model. </p>
</div>
<div class='code'>
......
......@@ -75,8 +75,7 @@
<p>This is a <a href="https://pytorch.org">PyTorch</a> implementation of the paper <a href="https://papers.labml.ai/paper/2105.01601">MLP-Mixer: An all-MLP Architecture for Vision</a>.</p>
<p>This paper applies the model on vision tasks. The model is similar to a transformer with attention layer being replaced by a MLP that is applied across the patches (or tokens in case of a NLP task).</p>
<p>Our implementation of MLP Mixer is a drop in replacement for the <a href="https://nn.labml.ai/transformers/mha.html">self-attention layer</a> in <a href="https://nn.labml.ai/transformers/models.html">our transformer implementation</a>. So it&#x27;s just a couple of lines of code, transposing the tensor to apply the MLP across the sequence dimension.</p>
<p>Although the paper applied MLP Mixer on vision tasks, we tried it on a <a href="https://nn.labml.ai/transformers/mlm/index.html">masked language model</a>. <a href="https://nn.labml.ai/transformers/mlp_mixer/experiment.html">Here is the experiment code</a>.</p>
<p><a href="https://app.labml.ai/run/994263d2cdb511eb961e872301f0dbab"><img alt="View Run" src="https://img.shields.io/badge/labml-experiment-brightgreen"></a> </p>
<p>Although the paper applied MLP Mixer on vision tasks, we tried it on a <a href="https://nn.labml.ai/transformers/mlm/index.html">masked language model</a>. <a href="https://nn.labml.ai/transformers/mlp_mixer/experiment.html">Here is the experiment code</a>. </p>
</div>
<div class='code'>
......
......@@ -78,8 +78,7 @@
<p>The most effective modification found by the search is using a square ReLU instead of ReLU in the <a href="https://nn.labml.ai/transformers/feed_forward.html">position-wise feedforward module</a>.</p>
<h3>Multi-DConv-Head Attention (MDHA)</h3>
<p>The next effective modification is a depth-wise 3 X 1 convolution after multi-head projection for queries, keys, and values. The convolution is along the sequence dimension and per channel (depth-wise). To be clear, if the number of channels in each head is d_k the convolution will have 1 X 3 kernels for each of the d_k channels.</p>
<p><a href="https://nn.labml.ai/transformers/primer_ez/experiment.html">Here is the experiment code</a>, for Primer EZ.</p>
<p><a href="https://app.labml.ai/run/30adb7aa1ab211eca7310f80a114e8a4"><img alt="View Run" src="https://img.shields.io/badge/labml-experiment-brightgreen"></a> </p>
<p><a href="https://nn.labml.ai/transformers/primer_ez/experiment.html">Here is the experiment code</a>, for Primer EZ. </p>
</div>
<div class='code'>
......
......@@ -82,7 +82,6 @@
<li><a href="model.html">Model</a> </li>
<li><a href="dataset.html">Dataset</a>: Pre-calculate the nearest neighbors </li>
<li><a href="train.html">Training code</a></li></ul>
<p><a href="https://app.labml.ai/run/3113dd3ea1e711ec85ee295d18534021"><img alt="View Run" src="https://img.shields.io/badge/labml-experiment-brightgreen"></a></p>
</div>
<div class='code'>
......
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
......@@ -76,8 +76,7 @@
<p>The Switch Transformer uses different parameters for each token by switching among parameters based on the token. Therefore, only a fraction of parameters are chosen for each token. So you can have more parameters but less computational cost.</p>
<p>The switching happens at the Position-wise Feedforward network (FFN) of each transformer block. Position-wise feedforward network consists of two sequentially fully connected layers. In switch transformer we have multiple FFNs (multiple experts), and we chose which one to use based on a router. The output is a set of probabilities for picking a FFN, and we pick the one with the highest probability and only evaluate that. So essentially the computational cost is the same as having a single FFN. In our implementation this doesn&#x27;t parallelize well when you have many or large FFNs since it&#x27;s all happening on a single GPU. In a distributed setup you would have each FFN (each very large) on a different device.</p>
<p>The paper introduces another loss term to balance load among the experts (FFNs) and discusses dropping tokens when routing is not balanced.</p>
<p>Here&#x27;s <a href="experiment.html">the training code</a> and a notebook for training a switch transformer on Tiny Shakespeare dataset.</p>
<p><a href="https://colab.research.google.com/github/labmlai/annotated_deep_learning_paper_implementations/blob/master/labml_nn/transformers/switch/experiment.ipynb"><img alt="Open In Colab" src="https://colab.research.google.com/assets/colab-badge.svg"></a> <a href="https://app.labml.ai/run/353770ce177c11ecaa5fb74452424f46"><img alt="View Run" src="https://img.shields.io/badge/labml-experiment-brightgreen"></a> </p>
<p>Here&#x27;s <a href="experiment.html">the training code</a> and a notebook for training a switch transformer on Tiny Shakespeare dataset. </p>
</div>
<div class='code'>
......
此差异已折叠。
此差异已折叠。
......@@ -77,7 +77,7 @@
<p>Annotated implementation of relative multi-headed attention is in <a href="https://nn.labml.ai/transformers/xl/relative_mha.html"><code class="highlight"><span></span><span class="n">relative_mha</span><span class="o">.</span><span class="n">py</span></code>
</a>.</p>
<p>Here&#x27;s <a href="https://nn.labml.ai/transformers/xl/experiment.html">the training code</a> and a notebook for training a transformer XL model on Tiny Shakespeare dataset.</p>
<p><a href="https://colab.research.google.com/github/labmlai/annotated_deep_learning_paper_implementations/blob/master/labml_nn/transformers/xl/experiment.ipynb"><img alt="Open In Colab" src="https://colab.research.google.com/assets/colab-badge.svg"></a> <a href="https://app.labml.ai/run/d3b6760c692e11ebb6a70242ac1c0002"><img alt="View Run" src="https://img.shields.io/badge/labml-experiment-brightgreen"></a> </p>
<p><a href="https://colab.research.google.com/github/labmlai/annotated_deep_learning_paper_implementations/blob/master/labml_nn/transformers/xl/experiment.ipynb"><img alt="Open In Colab" src="https://colab.research.google.com/assets/colab-badge.svg"></a> </p>
</div>
<div class='code'>
......
......@@ -74,8 +74,7 @@
<h1><a href="https://nn.labml.ai/uncertainty/evidence/index.html">Evidential Deep Learning to Quantify Classification Uncertainty</a></h1>
<p>This is a <a href="https://pytorch.org">PyTorch</a> implementation of the paper <a href="https://papers.labml.ai/paper/1806.01768">Evidential Deep Learning to Quantify Classification Uncertainty</a>.</p>
<p>Here is the <a href="https://nn.labml.ai/uncertainty/evidence/experiment.html">training code <code class="highlight"><span></span><span class="n">experiment</span><span class="o">.</span><span class="n">py</span></code>
</a> to train a model on MNIST dataset.</p>
<p><a href="https://app.labml.ai/run/f82b2bfc01ba11ecbb2aa16a33570106"><img alt="View Run" src="https://img.shields.io/badge/labml-experiment-brightgreen"></a> </p>
</a> to train a model on MNIST dataset. </p>
</div>
<div class='code'>
......
此差异已折叠。
......@@ -72,12 +72,12 @@
</div>
<h1><a href="https://nn.labml.ai/capsule_networks/index.html">胶囊网络</a></h1>
<p>这是<a href="https://papers.labml.ai/paper/1710.09829">胶囊间动态路由</a><a href="https://pytorch.org">PyTorch</a> 实现/教程。</p>
<p>胶囊网络是一种神经网络架构,它以胶囊的形式嵌入特征,并通过投票机制将它们路由到下一层胶囊。</p>
<p>与其他模型实现不同,我们包含了一个示例,因为仅使用模块很难理解某些概念。<a href="mnist.html">这是使用胶囊对 MNIST 数据集进行分类的模型的带注释的代码</a></p>
<p>该文件保存了胶囊网络核心模块的实现。</p>
<p>我用 <a href="https://github.com/jindongwang/Pytorch-CapsuleNet">jindongWang/pytorch-CapsuleNet</a> 来澄清我对这篇报纸的一些困惑。</p>
<p>这是一本在 MNIST 数据集上训练胶囊网络的笔记本。</p>
<p><a href="https://colab.research.google.com/github/labmlai/annotated_deep_learning_paper_implementations/blob/master/labml_nn/capsule_networks/mnist.ipynb"><img alt="Open In Colab" src="https://colab.research.google.com/assets/colab-badge.svg"></a><a href="https://app.labml.ai/run/e7c08e08586711ebb3e30242ac1c0002"><img alt="View Run" src="https://img.shields.io/badge/labml-experiment-brightgreen"></a></p>
<p>Capsule 网络是一种神经网络架构,它以胶囊的形式嵌入特征,并通过投票机制将它们路由到下一层胶囊。</p>
<p>与其他模型实现不同,我们提供了一个示例,因为仅使用模块很难理解某些概念。<a href="mnist.html">这是使用胶囊对 MNIST 数据集进行分类的模型的带注释的代码</a></p>
<p>该文件包含了 Capsule Networks 核心模块的实现。</p>
<p>我用 <a href="https://github.com/jindongwang/Pytorch-CapsuleNet">jindongwang/pytorch-CapsuleNet</a> 来澄清我对这篇论文的一些困惑。</p>
<p>这是一本在 MNIST 数据集上训练 Capsule 网络的笔记本。</p>
<p><a href="https://colab.research.google.com/github/labmlai/annotated_deep_learning_paper_implementations/blob/master/labml_nn/capsule_networks/mnist.ipynb"><img alt="Open In Colab" src="https://colab.research.google.com/assets/colab-badge.svg"></a></p>
</div>
<div class='code'>
......
此差异已折叠。
此差异已折叠。
此差异已折叠。
......@@ -70,13 +70,12 @@
<div class='section-link'>
<a href='#section-0'>#</a>
</div>
<h1><a href="https://nn.labml.ai/conv_mixer/index.html">补丁是你所需要的吗?</a></h1>
<p>这是 <a href="https://pytorch.org">PayTorch</a> 实现的纸质<a href="https://papers.labml.ai/paper/2201.09792">补丁是你所需要的吗?</a></p>
<p>ConvMixer 与 <a href="https://nn.labml.ai/transformers/mlp_mixer/index.html">MLP 混音器</a>类似。MLP-Mixer 将空间维度和信道维度的混合分开,方法是跨空间维度应用 MLP,然后在通道维度上应用 MLP(空间 MLP 取代 <a href="https://nn.labml.ai/transformers/vit/index.html">ViT</a> 注意力,通道 MLP 是ViT 的 <a href="https://nn.labml.ai/transformers/feed_forward.html">FFN</a>)。</p>
C@@ <p>onvMixer 使用 1x1 卷积进行通道混合,使用深度卷积进行空间混合。由于它是卷积而不是整个空间的完整MLP,因此与 ViT 或 MLP 混音器相比,它只混合附近的批次。此外,MLP-Mixer 在每次混音时使用两层的 MLP,而 ConvMixer 为每次混音使用单个层。</p>
<p>本文建议移除通道混音中的残余连接(逐点卷积),并且在空间混合(深度卷积)上只有一个剩余连接。他们还使用<a href="https://nn.labml.ai/normalization/batch_norm/index.html">批量归一化</a>而不是<a href="../normalization/layer_norm/index.html">图层规范化</a></p>
<p><a href="https://nn.labml.ai/conv_mixer/experiment.html">这是一个在 CIFAR-10 上训练 ConvMixer 的实验</a></p>
<p><a href="https://app.labml.ai/run/0fc344da2cd011ecb0bc3fdb2e774a3d"><img alt="View Run" src="https://img.shields.io/badge/labml-experiment-brightgreen"></a></p>
<h1><a href="https://nn.labml.ai/conv_mixer/index.html">你只需要补丁吗?</a></h1>
<p>这是 <a href="https://pytorch.org">PyTorch</a> 对论文《<a href="https://papers.labml.ai/paper/2201.09792">补丁就是你所需要的?</a>》的实现</p>
<p>convMixer 类似于 <a href="https://nn.labml.ai/transformers/mlp_mixer/index.html">MLP 混音器</a>。MLP-Mixer 通过在空间维度上应用 MLP,然后在信道维度上应用 MLP 来分离空间维度和信道维度的混音(空间 MLP 取代 <a href="https://nn.labml.ai/transformers/vit/index.html">vIT</a> 注意力,信道 MLP 是 ViT 的 <a href="https://nn.labml.ai/transformers/feed_forward.html">FFN</a>)。</p>
<p>ConvMixer 使用 1x1 卷积进行通道混合,使用深度卷积进行空间混合。由于它是卷积而不是整个空间的完整的 MLP,因此与 vIT 或 MLP-Mixer 相比,它只混合附近的批次。此外,MLP-Mixer 每次混合使用两层 MLP,ConvMixer 每次混合使用单层。</p>
<p>该论文建议删除信道混合(逐点卷积)上的剩余连接,在空间混合(深度卷积)上仅使用残差连接。他们还使用<a href="https://nn.labml.ai/normalization/batch_norm/index.html">批量标准化</a>而不是<a href="../normalization/layer_norm/index.html">图层标准化</a></p>
<p>这是<a href="https://nn.labml.ai/conv_mixer/experiment.html">一项在 CIFAR-10 上训练 ConvMixer 的实验</a></p>
</div>
<div class='code'>
......
此差异已折叠。
此差异已折叠。
......@@ -71,11 +71,10 @@
<a href='#section-0'>#</a>
</div>
<h1><a href="https://nn.labml.ai/distillation/index.html">在神经网络中提炼知识</a></h1>
<p>这是论文《<a href="https://papers.labml.ai/paper/1503.02531">在神经网络中提炼知识</a>》的 <a href="https://pytorch.org">PyTorch</a> 实现/教程。</p>
<p>这是一种使用经过训练的大型网络中的知识来训练小型网络的方法;即从大型网络中提取知识。</p>
当@@ <p>直接在数据和标签上训练时,具有正则化或模型集合(使用 dropout)的大型模型比小型模型的概化效果更好。但是,在大型模型的帮助下,可以训练一个小模型以更好地进行概括。较小的模型在生产环境中会更好:更快、更少的计算、更少的内存。</p>
<p>训练模型的输出概率比标签提供的信息更多,因为它也将非零概率分配给不正确的类。这些概率告诉我们,样本有可能属于某些类别。例如,在对数字进行分类时,当给出数字 <em>7</em> 的图像时,广义模型将给出 7 的概率很高,为 2 提供一个小但非零的概率,同时为其他数字分配几乎为零的概率。蒸馏利用这些信息来更好地训练小型模型。</p>
<p><a href="https://app.labml.ai/run/d6182e2adaf011eb927c91a2a1710932"><img alt="View Run" src="https://img.shields.io/badge/labml-experiment-brightgreen"></a></p>
<p>这是论文《<a href="https://papers.labml.ai/paper/1503.02531">在神经网络中提炼知识》的 PyT</a> <a href="https://pytorch.org">orch</a> 实现/教程。</p>
<p>这是一种使用经过训练的大型网络中的知识来训练小型网络的方法;即从大型网络中提炼知识。</p>
<p>直接在数据和标签上训练时,具有正则化或模型集合(使用 dropout)的大型模型比小型模型的概化效果更好。但是,在大型模型的帮助下,可以训练小模型以更好地进行概括。较小的模型在生产中更好:速度更快、计算更少、内存更少。</p>
<p>经过训练的模型的输出概率比标签提供的信息更多,因为它也会为错误的类分配非零概率。这些概率告诉我们,样本有可能属于某些类别。例如,在对数字进行分类时,当给定数字 <em>7</em> 的图像时,广义模型会给出7的高概率,给2的概率很小但不是零,而给其他数字分配几乎为零的概率。蒸馏利用这些信息来更好地训练小型模型。</p>
</div>
<div class='code'>
......
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
......@@ -73,6 +73,7 @@
<p>这是神经网络和相关算法的简单 PyTorch 实现的集合。<a href="https://github.com/labmlai/annotated_deep_learning_paper_implementations">这些实现</a>与解释一起记录,<a href="index.html">网站将这些内容</a>呈现为并排格式的注释。我们相信这些将帮助您更好地理解这些算法。</p>
<p><img alt="Screenshot" src="dqn-light.png"></p>
<p>我们正在积极维护这个仓库并添加新的实现。<a href="https://twitter.com/labmlai"><img alt="Twitter" src="https://img.shields.io/twitter/follow/labmlai?style=social"></a>以获取更新。</p>
</a><h2>语言:<strong><a href="https://nn.labml.ai">英语</a></strong> | <strong><a href="https://nn.labml.ai/zh/">中文(翻译)</strong></h2>
<h2>纸质实现</h2>
<h4><a href="transformers/index.html">变形金刚</a></h4>
<ul><li><a href="transformers/mha.html">多头关注</a></li>
......
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。