diff --git a/docs/index.html b/docs/index.html
index afc37122dd6e5611e262fb59b990efafe60594d8..8e17ac32124598e6efcbac03fd0e32acba97a6c4 100644
--- a/docs/index.html
+++ b/docs/index.html
@@ -125,6 +125,7 @@ and
 <h4>✨ <a href="https://nn.labml.ai/normalization/index.html">Normalization Layers</a></h4>
 <ul>
 <li><a href="https://nn.labml.ai/normalization/batch_norm/index.html">Batch Normalization</a></li>
+<li><a href="https://nn.labml.ai/normalization/layer_norm/index.html">Layer Normalization</a></li>
 </ul>
 <h3>Installation</h3>
 <pre><code class="bash">pip install labml_nn
diff --git a/docs/normalization/batch_norm/index.html b/docs/normalization/batch_norm/index.html
index 2fed9a7e2c04abb568e56978a2ae0cf796734fc8..1888ef0f33212b11c7a3640bbe4cffec3f53c86d 100644
--- a/docs/normalization/batch_norm/index.html
+++ b/docs/normalization/batch_norm/index.html
@@ -156,18 +156,21 @@ a CNN classifier that use batch normalization for MNIST dataset.</p>
 <p>Batch normalization layer $\text{BN}$ normalizes the input $X$ as follows:</p>
 <p>When input $X \in \mathbb{R}^{B \times C \times H \times W}$ is a batch of image representations,
 where $B$ is the batch size, $C$ is the number of channels, $H$ is the height and $W$ is the width.
+$\gamma \in \mathbb{R}^{C}$ and $\beta \in \mathbb{R}^{C}$.
 <script type="math/tex; mode=display">\text{BN}(X) = \gamma
 \frac{X - \underset{B, H, W}{\mathbb{E}}[X]}{\sqrt{\underset{B, H, W}{Var}[X] + \epsilon}}
 + \beta</script>
 </p>
-<p>When input $X \in \mathbb{R}^{B \times C}$ is a batch of vector embeddings,
+<p>When input $X \in \mathbb{R}^{B \times C}$ is a batch of embeddings,
 where $B$ is the batch size and $C$ is the number of features.
+$\gamma \in \mathbb{R}^{C}$ and $\beta \in \mathbb{R}^{C}$.
 <script type="math/tex; mode=display">\text{BN}(X) = \gamma
 \frac{X - \underset{B}{\mathbb{E}}[X]}{\sqrt{\underset{B}{Var}[X] + \epsilon}}
 + \beta</script>
 </p>
-<p>When input $X \in \mathbb{R}^{B \times C \times L}$ is a batch of sequence embeddings,
+<p>When input $X \in \mathbb{R}^{B \times C \times L}$ is a batch of a sequence embeddings,
 where $B$ is the batch size, $C$ is the number of features, and $L$ is the length of the sequence.
+$\gamma \in \mathbb{R}^{C}$ and $\beta \in \mathbb{R}^{C}$.
 <script type="math/tex; mode=display">\text{BN}(X) = \gamma
 \frac{X - \underset{B, L}{\mathbb{E}}[X]}{\sqrt{\underset{B, L}{Var}[X] + \epsilon}}
 + \beta</script>
@@ -192,9 +195,9 @@ where $B$ is the batch size, $C$ is the number of features, and $L$ is the lengt
 <p>We&rsquo;ve tried to use the same names for arguments as PyTorch <code>BatchNorm</code> implementation.</p>
             </div>
             <div class='code'>
-                <div class="highlight"><pre><span class="lineno">129</span>    <span class="k">def</span> <span class="fm">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">channels</span><span class="p">:</span> <span class="nb">int</span><span class="p">,</span> <span class="o">*</span><span class="p">,</span>
-<span class="lineno">130</span>                 <span class="n">eps</span><span class="p">:</span> <span class="nb">float</span> <span class="o">=</span> <span class="mf">1e-5</span><span class="p">,</span> <span class="n">momentum</span><span class="p">:</span> <span class="nb">float</span> <span class="o">=</span> <span class="mf">0.1</span><span class="p">,</span>
-<span class="lineno">131</span>                 <span class="n">affine</span><span class="p">:</span> <span class="nb">bool</span> <span class="o">=</span> <span class="kc">True</span><span class="p">,</span> <span class="n">track_running_stats</span><span class="p">:</span> <span class="nb">bool</span> <span class="o">=</span> <span class="kc">True</span><span class="p">):</span></pre></div>
+                <div class="highlight"><pre><span class="lineno">132</span>    <span class="k">def</span> <span class="fm">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">channels</span><span class="p">:</span> <span class="nb">int</span><span class="p">,</span> <span class="o">*</span><span class="p">,</span>
+<span class="lineno">133</span>                 <span class="n">eps</span><span class="p">:</span> <span class="nb">float</span> <span class="o">=</span> <span class="mf">1e-5</span><span class="p">,</span> <span class="n">momentum</span><span class="p">:</span> <span class="nb">float</span> <span class="o">=</span> <span class="mf">0.1</span><span class="p">,</span>
+<span class="lineno">134</span>                 <span class="n">affine</span><span class="p">:</span> <span class="nb">bool</span> <span class="o">=</span> <span class="kc">True</span><span class="p">,</span> <span class="n">track_running_stats</span><span class="p">:</span> <span class="nb">bool</span> <span class="o">=</span> <span class="kc">True</span><span class="p">):</span></pre></div>
             </div>
         </div>
     <div class='section' id='section-3'>
@@ -205,14 +208,14 @@ where $B$ is the batch size, $C$ is the number of features, and $L$ is the lengt
                 
             </div>
             <div class='code'>
-                <div class="highlight"><pre><span class="lineno">141</span>        <span class="nb">super</span><span class="p">()</span><span class="o">.</span><span class="fm">__init__</span><span class="p">()</span>
-<span class="lineno">142</span>
-<span class="lineno">143</span>        <span class="bp">self</span><span class="o">.</span><span class="n">channels</span> <span class="o">=</span> <span class="n">channels</span>
-<span class="lineno">144</span>
-<span class="lineno">145</span>        <span class="bp">self</span><span class="o">.</span><span class="n">eps</span> <span class="o">=</span> <span class="n">eps</span>
-<span class="lineno">146</span>        <span class="bp">self</span><span class="o">.</span><span class="n">momentum</span> <span class="o">=</span> <span class="n">momentum</span>
-<span class="lineno">147</span>        <span class="bp">self</span><span class="o">.</span><span class="n">affine</span> <span class="o">=</span> <span class="n">affine</span>
-<span class="lineno">148</span>        <span class="bp">self</span><span class="o">.</span><span class="n">track_running_stats</span> <span class="o">=</span> <span class="n">track_running_stats</span></pre></div>
+                <div class="highlight"><pre><span class="lineno">144</span>        <span class="nb">super</span><span class="p">()</span><span class="o">.</span><span class="fm">__init__</span><span class="p">()</span>
+<span class="lineno">145</span>
+<span class="lineno">146</span>        <span class="bp">self</span><span class="o">.</span><span class="n">channels</span> <span class="o">=</span> <span class="n">channels</span>
+<span class="lineno">147</span>
+<span class="lineno">148</span>        <span class="bp">self</span><span class="o">.</span><span class="n">eps</span> <span class="o">=</span> <span class="n">eps</span>
+<span class="lineno">149</span>        <span class="bp">self</span><span class="o">.</span><span class="n">momentum</span> <span class="o">=</span> <span class="n">momentum</span>
+<span class="lineno">150</span>        <span class="bp">self</span><span class="o">.</span><span class="n">affine</span> <span class="o">=</span> <span class="n">affine</span>
+<span class="lineno">151</span>        <span class="bp">self</span><span class="o">.</span><span class="n">track_running_stats</span> <span class="o">=</span> <span class="n">track_running_stats</span></pre></div>
             </div>
         </div>
     <div class='section' id='section-4'>
@@ -223,9 +226,9 @@ where $B$ is the batch size, $C$ is the number of features, and $L$ is the lengt
                 <p>Create parameters for $\gamma$ and $\beta$ for scale and shift</p>
             </div>
             <div class='code'>
-                <div class="highlight"><pre><span class="lineno">150</span>        <span class="k">if</span> <span class="bp">self</span><span class="o">.</span><span class="n">affine</span><span class="p">:</span>
-<span class="lineno">151</span>            <span class="bp">self</span><span class="o">.</span><span class="n">scale</span> <span class="o">=</span> <span class="n">nn</span><span class="o">.</span><span class="n">Parameter</span><span class="p">(</span><span class="n">torch</span><span class="o">.</span><span class="n">ones</span><span class="p">(</span><span class="n">channels</span><span class="p">))</span>
-<span class="lineno">152</span>            <span class="bp">self</span><span class="o">.</span><span class="n">shift</span> <span class="o">=</span> <span class="n">nn</span><span class="o">.</span><span class="n">Parameter</span><span class="p">(</span><span class="n">torch</span><span class="o">.</span><span class="n">zeros</span><span class="p">(</span><span class="n">channels</span><span class="p">))</span></pre></div>
+                <div class="highlight"><pre><span class="lineno">153</span>        <span class="k">if</span> <span class="bp">self</span><span class="o">.</span><span class="n">affine</span><span class="p">:</span>
+<span class="lineno">154</span>            <span class="bp">self</span><span class="o">.</span><span class="n">scale</span> <span class="o">=</span> <span class="n">nn</span><span class="o">.</span><span class="n">Parameter</span><span class="p">(</span><span class="n">torch</span><span class="o">.</span><span class="n">ones</span><span class="p">(</span><span class="n">channels</span><span class="p">))</span>
+<span class="lineno">155</span>            <span class="bp">self</span><span class="o">.</span><span class="n">shift</span> <span class="o">=</span> <span class="n">nn</span><span class="o">.</span><span class="n">Parameter</span><span class="p">(</span><span class="n">torch</span><span class="o">.</span><span class="n">zeros</span><span class="p">(</span><span class="n">channels</span><span class="p">))</span></pre></div>
             </div>
         </div>
     <div class='section' id='section-5'>
@@ -237,9 +240,9 @@ where $B$ is the batch size, $C$ is the number of features, and $L$ is the lengt
 mean $\mathbb{E}[x^{(k)}]$ and variance $Var[x^{(k)}]$</p>
             </div>
             <div class='code'>
-                <div class="highlight"><pre><span class="lineno">155</span>        <span class="k">if</span> <span class="bp">self</span><span class="o">.</span><span class="n">track_running_stats</span><span class="p">:</span>
-<span class="lineno">156</span>            <span class="bp">self</span><span class="o">.</span><span class="n">register_buffer</span><span class="p">(</span><span class="s1">&#39;exp_mean&#39;</span><span class="p">,</span> <span class="n">torch</span><span class="o">.</span><span class="n">zeros</span><span class="p">(</span><span class="n">channels</span><span class="p">))</span>
-<span class="lineno">157</span>            <span class="bp">self</span><span class="o">.</span><span class="n">register_buffer</span><span class="p">(</span><span class="s1">&#39;exp_var&#39;</span><span class="p">,</span> <span class="n">torch</span><span class="o">.</span><span class="n">ones</span><span class="p">(</span><span class="n">channels</span><span class="p">))</span></pre></div>
+                <div class="highlight"><pre><span class="lineno">158</span>        <span class="k">if</span> <span class="bp">self</span><span class="o">.</span><span class="n">track_running_stats</span><span class="p">:</span>
+<span class="lineno">159</span>            <span class="bp">self</span><span class="o">.</span><span class="n">register_buffer</span><span class="p">(</span><span class="s1">&#39;exp_mean&#39;</span><span class="p">,</span> <span class="n">torch</span><span class="o">.</span><span class="n">zeros</span><span class="p">(</span><span class="n">channels</span><span class="p">))</span>
+<span class="lineno">160</span>            <span class="bp">self</span><span class="o">.</span><span class="n">register_buffer</span><span class="p">(</span><span class="s1">&#39;exp_var&#39;</span><span class="p">,</span> <span class="n">torch</span><span class="o">.</span><span class="n">ones</span><span class="p">(</span><span class="n">channels</span><span class="p">))</span></pre></div>
             </div>
         </div>
     <div class='section' id='section-6'>
@@ -253,7 +256,7 @@ mean $\mathbb{E}[x^{(k)}]$ and variance $Var[x^{(k)}]$</p>
 <code>[batch_size, channels, height, width]</code></p>
             </div>
             <div class='code'>
-                <div class="highlight"><pre><span class="lineno">159</span>    <span class="k">def</span> <span class="nf">forward</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">x</span><span class="p">:</span> <span class="n">torch</span><span class="o">.</span><span class="n">Tensor</span><span class="p">):</span></pre></div>
+                <div class="highlight"><pre><span class="lineno">162</span>    <span class="k">def</span> <span class="nf">forward</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">x</span><span class="p">:</span> <span class="n">torch</span><span class="o">.</span><span class="n">Tensor</span><span class="p">):</span></pre></div>
             </div>
         </div>
     <div class='section' id='section-7'>
@@ -264,7 +267,7 @@ mean $\mathbb{E}[x^{(k)}]$ and variance $Var[x^{(k)}]$</p>
                 <p>Keep the original shape</p>
             </div>
             <div class='code'>
-                <div class="highlight"><pre><span class="lineno">167</span>        <span class="n">x_shape</span> <span class="o">=</span> <span class="n">x</span><span class="o">.</span><span class="n">shape</span></pre></div>
+                <div class="highlight"><pre><span class="lineno">170</span>        <span class="n">x_shape</span> <span class="o">=</span> <span class="n">x</span><span class="o">.</span><span class="n">shape</span></pre></div>
             </div>
         </div>
     <div class='section' id='section-8'>
@@ -275,7 +278,7 @@ mean $\mathbb{E}[x^{(k)}]$ and variance $Var[x^{(k)}]$</p>
                 <p>Get the batch size</p>
             </div>
             <div class='code'>
-                <div class="highlight"><pre><span class="lineno">169</span>        <span class="n">batch_size</span> <span class="o">=</span> <span class="n">x_shape</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span></pre></div>
+                <div class="highlight"><pre><span class="lineno">172</span>        <span class="n">batch_size</span> <span class="o">=</span> <span class="n">x_shape</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span></pre></div>
             </div>
         </div>
     <div class='section' id='section-9'>
@@ -286,7 +289,7 @@ mean $\mathbb{E}[x^{(k)}]$ and variance $Var[x^{(k)}]$</p>
                 <p>Sanity check to make sure the number of features is same</p>
             </div>
             <div class='code'>
-                <div class="highlight"><pre><span class="lineno">171</span>        <span class="k">assert</span> <span class="bp">self</span><span class="o">.</span><span class="n">channels</span> <span class="o">==</span> <span class="n">x</span><span class="o">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">1</span><span class="p">]</span></pre></div>
+                <div class="highlight"><pre><span class="lineno">174</span>        <span class="k">assert</span> <span class="bp">self</span><span class="o">.</span><span class="n">channels</span> <span class="o">==</span> <span class="n">x</span><span class="o">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">1</span><span class="p">]</span></pre></div>
             </div>
         </div>
     <div class='section' id='section-10'>
@@ -297,7 +300,7 @@ mean $\mathbb{E}[x^{(k)}]$ and variance $Var[x^{(k)}]$</p>
                 <p>Reshape into <code>[batch_size, channels, n]</code></p>
             </div>
             <div class='code'>
-                <div class="highlight"><pre><span class="lineno">174</span>        <span class="n">x</span> <span class="o">=</span> <span class="n">x</span><span class="o">.</span><span class="n">view</span><span class="p">(</span><span class="n">batch_size</span><span class="p">,</span> <span class="bp">self</span><span class="o">.</span><span class="n">channels</span><span class="p">,</span> <span class="o">-</span><span class="mi">1</span><span class="p">)</span></pre></div>
+                <div class="highlight"><pre><span class="lineno">177</span>        <span class="n">x</span> <span class="o">=</span> <span class="n">x</span><span class="o">.</span><span class="n">view</span><span class="p">(</span><span class="n">batch_size</span><span class="p">,</span> <span class="bp">self</span><span class="o">.</span><span class="n">channels</span><span class="p">,</span> <span class="o">-</span><span class="mi">1</span><span class="p">)</span></pre></div>
             </div>
         </div>
     <div class='section' id='section-11'>
@@ -309,7 +312,7 @@ mean $\mathbb{E}[x^{(k)}]$ and variance $Var[x^{(k)}]$</p>
 if we are in training mode or if we have not tracked exponential moving averages</p>
             </div>
             <div class='code'>
-                <div class="highlight"><pre><span class="lineno">178</span>        <span class="k">if</span> <span class="bp">self</span><span class="o">.</span><span class="n">training</span> <span class="ow">or</span> <span class="ow">not</span> <span class="bp">self</span><span class="o">.</span><span class="n">track_running_stats</span><span class="p">:</span></pre></div>
+                <div class="highlight"><pre><span class="lineno">181</span>        <span class="k">if</span> <span class="bp">self</span><span class="o">.</span><span class="n">training</span> <span class="ow">or</span> <span class="ow">not</span> <span class="bp">self</span><span class="o">.</span><span class="n">track_running_stats</span><span class="p">:</span></pre></div>
             </div>
         </div>
     <div class='section' id='section-12'>
@@ -321,7 +324,7 @@ if we are in training mode or if we have not tracked exponential moving averages
 i.e. the means for each feature $\mathbb{E}[x^{(k)}]$</p>
             </div>
             <div class='code'>
-                <div class="highlight"><pre><span class="lineno">181</span>            <span class="n">mean</span> <span class="o">=</span> <span class="n">x</span><span class="o">.</span><span class="n">mean</span><span class="p">(</span><span class="n">dim</span><span class="o">=</span><span class="p">[</span><span class="mi">0</span><span class="p">,</span> <span class="mi">2</span><span class="p">])</span></pre></div>
+                <div class="highlight"><pre><span class="lineno">184</span>            <span class="n">mean</span> <span class="o">=</span> <span class="n">x</span><span class="o">.</span><span class="n">mean</span><span class="p">(</span><span class="n">dim</span><span class="o">=</span><span class="p">[</span><span class="mi">0</span><span class="p">,</span> <span class="mi">2</span><span class="p">])</span></pre></div>
             </div>
         </div>
     <div class='section' id='section-13'>
@@ -333,7 +336,7 @@ i.e. the means for each feature $\mathbb{E}[x^{(k)}]$</p>
 i.e. the means for each feature $\mathbb{E}[(x^{(k)})^2]$</p>
             </div>
             <div class='code'>
-                <div class="highlight"><pre><span class="lineno">184</span>            <span class="n">mean_x2</span> <span class="o">=</span> <span class="p">(</span><span class="n">x</span> <span class="o">**</span> <span class="mi">2</span><span class="p">)</span><span class="o">.</span><span class="n">mean</span><span class="p">(</span><span class="n">dim</span><span class="o">=</span><span class="p">[</span><span class="mi">0</span><span class="p">,</span> <span class="mi">2</span><span class="p">])</span></pre></div>
+                <div class="highlight"><pre><span class="lineno">187</span>            <span class="n">mean_x2</span> <span class="o">=</span> <span class="p">(</span><span class="n">x</span> <span class="o">**</span> <span class="mi">2</span><span class="p">)</span><span class="o">.</span><span class="n">mean</span><span class="p">(</span><span class="n">dim</span><span class="o">=</span><span class="p">[</span><span class="mi">0</span><span class="p">,</span> <span class="mi">2</span><span class="p">])</span></pre></div>
             </div>
         </div>
     <div class='section' id='section-14'>
@@ -344,7 +347,7 @@ i.e. the means for each feature $\mathbb{E}[(x^{(k)})^2]$</p>
                 <p>Variance for each feature $Var[x^{(k)}] = \mathbb{E}[(x^{(k)})^2] - \mathbb{E}[x^{(k)}]^2$</p>
             </div>
             <div class='code'>
-                <div class="highlight"><pre><span class="lineno">186</span>            <span class="n">var</span> <span class="o">=</span> <span class="n">mean_x2</span> <span class="o">-</span> <span class="n">mean</span> <span class="o">**</span> <span class="mi">2</span></pre></div>
+                <div class="highlight"><pre><span class="lineno">189</span>            <span class="n">var</span> <span class="o">=</span> <span class="n">mean_x2</span> <span class="o">-</span> <span class="n">mean</span> <span class="o">**</span> <span class="mi">2</span></pre></div>
             </div>
         </div>
     <div class='section' id='section-15'>
@@ -355,9 +358,9 @@ i.e. the means for each feature $\mathbb{E}[(x^{(k)})^2]$</p>
                 <p>Update exponential moving averages</p>
             </div>
             <div class='code'>
-                <div class="highlight"><pre><span class="lineno">189</span>            <span class="k">if</span> <span class="bp">self</span><span class="o">.</span><span class="n">training</span> <span class="ow">and</span> <span class="bp">self</span><span class="o">.</span><span class="n">track_running_stats</span><span class="p">:</span>
-<span class="lineno">190</span>                <span class="bp">self</span><span class="o">.</span><span class="n">exp_mean</span> <span class="o">=</span> <span class="p">(</span><span class="mi">1</span> <span class="o">-</span> <span class="bp">self</span><span class="o">.</span><span class="n">momentum</span><span class="p">)</span> <span class="o">*</span> <span class="bp">self</span><span class="o">.</span><span class="n">exp_mean</span> <span class="o">+</span> <span class="bp">self</span><span class="o">.</span><span class="n">momentum</span> <span class="o">*</span> <span class="n">mean</span>
-<span class="lineno">191</span>                <span class="bp">self</span><span class="o">.</span><span class="n">exp_var</span> <span class="o">=</span> <span class="p">(</span><span class="mi">1</span> <span class="o">-</span> <span class="bp">self</span><span class="o">.</span><span class="n">momentum</span><span class="p">)</span> <span class="o">*</span> <span class="bp">self</span><span class="o">.</span><span class="n">exp_var</span> <span class="o">+</span> <span class="bp">self</span><span class="o">.</span><span class="n">momentum</span> <span class="o">*</span> <span class="n">var</span></pre></div>
+                <div class="highlight"><pre><span class="lineno">192</span>            <span class="k">if</span> <span class="bp">self</span><span class="o">.</span><span class="n">training</span> <span class="ow">and</span> <span class="bp">self</span><span class="o">.</span><span class="n">track_running_stats</span><span class="p">:</span>
+<span class="lineno">193</span>                <span class="bp">self</span><span class="o">.</span><span class="n">exp_mean</span> <span class="o">=</span> <span class="p">(</span><span class="mi">1</span> <span class="o">-</span> <span class="bp">self</span><span class="o">.</span><span class="n">momentum</span><span class="p">)</span> <span class="o">*</span> <span class="bp">self</span><span class="o">.</span><span class="n">exp_mean</span> <span class="o">+</span> <span class="bp">self</span><span class="o">.</span><span class="n">momentum</span> <span class="o">*</span> <span class="n">mean</span>
+<span class="lineno">194</span>                <span class="bp">self</span><span class="o">.</span><span class="n">exp_var</span> <span class="o">=</span> <span class="p">(</span><span class="mi">1</span> <span class="o">-</span> <span class="bp">self</span><span class="o">.</span><span class="n">momentum</span><span class="p">)</span> <span class="o">*</span> <span class="bp">self</span><span class="o">.</span><span class="n">exp_var</span> <span class="o">+</span> <span class="bp">self</span><span class="o">.</span><span class="n">momentum</span> <span class="o">*</span> <span class="n">var</span></pre></div>
             </div>
         </div>
     <div class='section' id='section-16'>
@@ -368,9 +371,9 @@ i.e. the means for each feature $\mathbb{E}[(x^{(k)})^2]$</p>
                 <p>Use exponential moving averages as estimates</p>
             </div>
             <div class='code'>
-                <div class="highlight"><pre><span class="lineno">193</span>        <span class="k">else</span><span class="p">:</span>
-<span class="lineno">194</span>            <span class="n">mean</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">exp_mean</span>
-<span class="lineno">195</span>            <span class="n">var</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">exp_var</span></pre></div>
+                <div class="highlight"><pre><span class="lineno">196</span>        <span class="k">else</span><span class="p">:</span>
+<span class="lineno">197</span>            <span class="n">mean</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">exp_mean</span>
+<span class="lineno">198</span>            <span class="n">var</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">exp_var</span></pre></div>
             </div>
         </div>
     <div class='section' id='section-17'>
@@ -382,7 +385,7 @@ i.e. the means for each feature $\mathbb{E}[(x^{(k)})^2]$</p>
 </p>
             </div>
             <div class='code'>
-                <div class="highlight"><pre><span class="lineno">198</span>        <span class="n">x_norm</span> <span class="o">=</span> <span class="p">(</span><span class="n">x</span> <span class="o">-</span> <span class="n">mean</span><span class="o">.</span><span class="n">view</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="o">-</span><span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">))</span> <span class="o">/</span> <span class="n">torch</span><span class="o">.</span><span class="n">sqrt</span><span class="p">(</span><span class="n">var</span> <span class="o">+</span> <span class="bp">self</span><span class="o">.</span><span class="n">eps</span><span class="p">)</span><span class="o">.</span><span class="n">view</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="o">-</span><span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">)</span></pre></div>
+                <div class="highlight"><pre><span class="lineno">201</span>        <span class="n">x_norm</span> <span class="o">=</span> <span class="p">(</span><span class="n">x</span> <span class="o">-</span> <span class="n">mean</span><span class="o">.</span><span class="n">view</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="o">-</span><span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">))</span> <span class="o">/</span> <span class="n">torch</span><span class="o">.</span><span class="n">sqrt</span><span class="p">(</span><span class="n">var</span> <span class="o">+</span> <span class="bp">self</span><span class="o">.</span><span class="n">eps</span><span class="p">)</span><span class="o">.</span><span class="n">view</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="o">-</span><span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">)</span></pre></div>
             </div>
         </div>
     <div class='section' id='section-18'>
@@ -394,8 +397,8 @@ i.e. the means for each feature $\mathbb{E}[(x^{(k)})^2]$</p>
 </p>
             </div>
             <div class='code'>
-                <div class="highlight"><pre><span class="lineno">200</span>        <span class="k">if</span> <span class="bp">self</span><span class="o">.</span><span class="n">affine</span><span class="p">:</span>
-<span class="lineno">201</span>            <span class="n">x_norm</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">scale</span><span class="o">.</span><span class="n">view</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="o">-</span><span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">)</span> <span class="o">*</span> <span class="n">x_norm</span> <span class="o">+</span> <span class="bp">self</span><span class="o">.</span><span class="n">shift</span><span class="o">.</span><span class="n">view</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="o">-</span><span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">)</span></pre></div>
+                <div class="highlight"><pre><span class="lineno">203</span>        <span class="k">if</span> <span class="bp">self</span><span class="o">.</span><span class="n">affine</span><span class="p">:</span>
+<span class="lineno">204</span>            <span class="n">x_norm</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">scale</span><span class="o">.</span><span class="n">view</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="o">-</span><span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">)</span> <span class="o">*</span> <span class="n">x_norm</span> <span class="o">+</span> <span class="bp">self</span><span class="o">.</span><span class="n">shift</span><span class="o">.</span><span class="n">view</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="o">-</span><span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">)</span></pre></div>
             </div>
         </div>
     <div class='section' id='section-19'>
@@ -406,31 +409,49 @@ i.e. the means for each feature $\mathbb{E}[(x^{(k)})^2]$</p>
                 <p>Reshape to original and return</p>
             </div>
             <div class='code'>
-                <div class="highlight"><pre><span class="lineno">204</span>        <span class="k">return</span> <span class="n">x_norm</span><span class="o">.</span><span class="n">view</span><span class="p">(</span><span class="n">x_shape</span><span class="p">)</span></pre></div>
+                <div class="highlight"><pre><span class="lineno">207</span>        <span class="k">return</span> <span class="n">x_norm</span><span class="o">.</span><span class="n">view</span><span class="p">(</span><span class="n">x_shape</span><span class="p">)</span></pre></div>
             </div>
         </div>
     <div class='section' id='section-20'>
-            <div class='docs'>
+        <div class='docs doc-strings'>
                 <div class='section-link'>
                     <a href='#section-20'>#</a>
                 </div>
+                <p>Simple test</p>
+            </div>
+            <div class='code'>
+                <div class="highlight"><pre><span class="lineno">210</span><span class="k">def</span> <span class="nf">_test</span><span class="p">():</span></pre></div>
+            </div>
+        </div>
+    <div class='section' id='section-21'>
+            <div class='docs'>
+                <div class='section-link'>
+                    <a href='#section-21'>#</a>
+                </div>
+                
+            </div>
+            <div class='code'>
+                <div class="highlight"><pre><span class="lineno">214</span>    <span class="kn">from</span> <span class="nn">labml.logger</span> <span class="kn">import</span> <span class="n">inspect</span>
+<span class="lineno">215</span>
+<span class="lineno">216</span>    <span class="n">x</span> <span class="o">=</span> <span class="n">torch</span><span class="o">.</span><span class="n">zeros</span><span class="p">([</span><span class="mi">2</span><span class="p">,</span> <span class="mi">3</span><span class="p">,</span> <span class="mi">2</span><span class="p">,</span> <span class="mi">4</span><span class="p">])</span>
+<span class="lineno">217</span>    <span class="n">inspect</span><span class="p">(</span><span class="n">x</span><span class="o">.</span><span class="n">shape</span><span class="p">)</span>
+<span class="lineno">218</span>    <span class="n">bn</span> <span class="o">=</span> <span class="n">BatchNorm</span><span class="p">(</span><span class="mi">3</span><span class="p">)</span>
+<span class="lineno">219</span>
+<span class="lineno">220</span>    <span class="n">x</span> <span class="o">=</span> <span class="n">bn</span><span class="p">(</span><span class="n">x</span><span class="p">)</span>
+<span class="lineno">221</span>    <span class="n">inspect</span><span class="p">(</span><span class="n">x</span><span class="o">.</span><span class="n">shape</span><span class="p">)</span>
+<span class="lineno">222</span>    <span class="n">inspect</span><span class="p">(</span><span class="n">bn</span><span class="o">.</span><span class="n">exp_var</span><span class="o">.</span><span class="n">shape</span><span class="p">)</span></pre></div>
+            </div>
+        </div>
+    <div class='section' id='section-22'>
+            <div class='docs'>
+                <div class='section-link'>
+                    <a href='#section-22'>#</a>
+                </div>
                 
             </div>
             <div class='code'>
-                <div class="highlight"><pre><span class="lineno">207</span><span class="k">def</span> <span class="nf">_test</span><span class="p">():</span>
-<span class="lineno">208</span>    <span class="kn">from</span> <span class="nn">labml.logger</span> <span class="kn">import</span> <span class="n">inspect</span>
-<span class="lineno">209</span>
-<span class="lineno">210</span>    <span class="n">x</span> <span class="o">=</span> <span class="n">torch</span><span class="o">.</span><span class="n">zeros</span><span class="p">([</span><span class="mi">2</span><span class="p">,</span> <span class="mi">3</span><span class="p">,</span> <span class="mi">2</span><span class="p">,</span> <span class="mi">4</span><span class="p">])</span>
-<span class="lineno">211</span>    <span class="n">inspect</span><span class="p">(</span><span class="n">x</span><span class="o">.</span><span class="n">shape</span><span class="p">)</span>
-<span class="lineno">212</span>    <span class="n">bn</span> <span class="o">=</span> <span class="n">BatchNorm</span><span class="p">(</span><span class="mi">3</span><span class="p">)</span>
-<span class="lineno">213</span>
-<span class="lineno">214</span>    <span class="n">x</span> <span class="o">=</span> <span class="n">bn</span><span class="p">(</span><span class="n">x</span><span class="p">)</span>
-<span class="lineno">215</span>    <span class="n">inspect</span><span class="p">(</span><span class="n">x</span><span class="o">.</span><span class="n">shape</span><span class="p">)</span>
-<span class="lineno">216</span>    <span class="n">inspect</span><span class="p">(</span><span class="n">bn</span><span class="o">.</span><span class="n">exp_var</span><span class="o">.</span><span class="n">shape</span><span class="p">)</span>
-<span class="lineno">217</span>
-<span class="lineno">218</span>
-<span class="lineno">219</span><span class="k">if</span> <span class="vm">__name__</span> <span class="o">==</span> <span class="s1">&#39;__main__&#39;</span><span class="p">:</span>
-<span class="lineno">220</span>    <span class="n">_test</span><span class="p">()</span></pre></div>
+                <div class="highlight"><pre><span class="lineno">226</span><span class="k">if</span> <span class="vm">__name__</span> <span class="o">==</span> <span class="s1">&#39;__main__&#39;</span><span class="p">:</span>
+<span class="lineno">227</span>    <span class="n">_test</span><span class="p">()</span></pre></div>
             </div>
         </div>
     </div>
diff --git a/docs/normalization/batch_norm/readme.html b/docs/normalization/batch_norm/readme.html
new file mode 100644
index 0000000000000000000000000000000000000000..de16cd155c0ad36c8515292fdb6ff37e8df45b92
--- /dev/null
+++ b/docs/normalization/batch_norm/readme.html
@@ -0,0 +1,180 @@
+<!DOCTYPE html>
+<html>
+<head>
+    <meta http-equiv="content-type" content="text/html;charset=utf-8"/>
+    <meta name="viewport" content="width=device-width, initial-scale=1.0"/>
+    <meta name="description" content=""/>
+
+    <meta name="twitter:card" content="summary"/>
+    <meta name="twitter:image:src" content="https://avatars1.githubusercontent.com/u/64068543?s=400&amp;v=4"/>
+    <meta name="twitter:title" content="Batch Normalization"/>
+    <meta name="twitter:description" content=""/>
+    <meta name="twitter:site" content="@labmlai"/>
+    <meta name="twitter:creator" content="@labmlai"/>
+
+    <meta property="og:url" content="https://nn.labml.ai/normalization/batch_norm/readme.html"/>
+    <meta property="og:title" content="Batch Normalization"/>
+    <meta property="og:image" content="https://avatars1.githubusercontent.com/u/64068543?s=400&amp;v=4"/>
+    <meta property="og:site_name" content="LabML Neural Networks"/>
+    <meta property="og:type" content="object"/>
+    <meta property="og:title" content="Batch Normalization"/>
+    <meta property="og:description" content=""/>
+
+    <title>Batch Normalization</title>
+    <link rel="shortcut icon" href="/icon.png"/>
+    <link rel="stylesheet" href="../../pylit.css">
+    <link rel="canonical" href="https://nn.labml.ai/normalization/batch_norm/readme.html"/>
+    <!-- Global site tag (gtag.js) - Google Analytics -->
+    <script async src="https://www.googletagmanager.com/gtag/js?id=G-4V3HC8HBLH"></script>
+    <script>
+        window.dataLayer = window.dataLayer || [];
+
+        function gtag() {
+            dataLayer.push(arguments);
+        }
+
+        gtag('js', new Date());
+
+        gtag('config', 'G-4V3HC8HBLH');
+    </script>
+</head>
+<body>
+<div id='container'>
+    <div id="background"></div>
+    <div class='section'>
+        <div class='docs'>
+            <p>
+                <a class="parent" href="/">home</a>
+                <a class="parent" href="../index.html">normalization</a>
+                <a class="parent" href="index.html">batch_norm</a>
+            </p>
+            <p>
+
+                <a href="https://github.com/lab-ml/labml_nn/tree/master/labml_nn/normalization/batch_norm/readme.md">
+                    <img alt="Github"
+                         src="https://img.shields.io/github/stars/lab-ml/nn?style=social"
+                         style="max-width:100%;"/></a>
+                <a href="https://join.slack.com/t/labforml/shared_invite/zt-egj9zvq9-Dl3hhZqobexgT7aVKnD14g/"
+                   rel="nofollow">
+                    <img alt="Join Slact"
+                         src="https://img.shields.io/badge/slack-chat-green.svg?logo=slack"
+                         style="max-width:100%;"/></a>
+                <a href="https://twitter.com/labmlai"
+                   rel="nofollow">
+                    <img alt="Twitter"
+                         src="https://img.shields.io/twitter/follow/labmlai?style=social"
+                         style="max-width:100%;"/></a>
+            </p>
+        </div>
+    </div>
+    <div class='section' id='section-0'>
+            <div class='docs'>
+                <div class='section-link'>
+                    <a href='#section-0'>#</a>
+                </div>
+                <h1><a href="https://nn.labml.ai/normalization/batch_norm/index.html">Batch Normalization</a></h1>
+<p>This is a <a href="https://pytorch.org">PyTorch</a> implementation of Batch Normalization from paper
+ <a href="https://arxiv.org/abs/1502.03167">Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift</a>.</p>
+<h3>Internal Covariate Shift</h3>
+<p>The paper defines <em>Internal Covariate Shift</em> as the change in the
+distribution of network activations due to the change in
+network parameters during training.
+For example, let&rsquo;s say there are two layers $l_1$ and $l_2$.
+During the beginning of the training $l_1$ outputs (inputs to $l_2$)
+could be in distribution $\mathcal{N}(0.5, 1)$.
+Then, after some training steps, it could move to $\mathcal{N}(0.5, 1)$.
+This is <em>internal covariate shift</em>.</p>
+<p>Internal covariate shift will adversely affect training speed because the later layers
+($l_2$ in the above example) has to adapt to this shifted distribution.</p>
+<p>By stabilizing the distribution batch normalization minimizes the internal covariate shift.</p>
+<h2>Normalization</h2>
+<p>It is known that whitening improves training speed and convergence.
+<em>Whitening</em> is linearly transforming inputs to have zero mean, unit variance,
+and be uncorrelated.</p>
+<h3>Normalizing outside gradient computation doesn&rsquo;t work</h3>
+<p>Normalizing outside the gradient computation using pre-computed (detached)
+means and variances doesn&rsquo;t work. For instance. (ignoring variance), let
+<script type="math/tex; mode=display">\hat{x} = x - \mathbb{E}[x]</script>
+where $x = u + b$ and $b$ is a trained bias.
+and $\mathbb{E}[x]$ is outside gradient computation (pre-computed constant).</p>
+<p>Note that $\hat{x}$ has no effect of $b$.
+Therefore,
+$b$ will increase or decrease based
+$\frac{\partial{\mathcal{L}}}{\partial x}$,
+and keep on growing indefinitely in each training update.
+The paper notes that similar explosions happen with variances.</p>
+<h3>Batch Normalization</h3>
+<p>Whitening is computationally expensive because you need to de-correlate and
+the gradients must flow through the full whitening calculation.</p>
+<p>The paper introduces simplified version which they call <em>Batch Normalization</em>.
+First simplification is that it normalizes each feature independently to have
+zero mean and unit variance:
+<script type="math/tex; mode=display">\hat{x}^{(k)} = \frac{x^{(k)} - \mathbb{E}[x^{(k)}]}{\sqrt{Var[x^{(k)}]}}</script>
+where $x = (x^{(1)} &hellip; x^{(d)})$ is the $d$-dimensional input.</p>
+<p>The second simplification is to use estimates of mean $\mathbb{E}[x^{(k)}]$
+and variance $Var[x^{(k)}]$ from the mini-batch
+for normalization; instead of calculating the mean and variance across whole dataset.</p>
+<p>Normalizing each feature to zero mean and unit variance could affect what the layer
+can represent.
+As an example paper illustrates that, if the inputs to a sigmoid are normalized
+most of it will be within $[-1, 1]$ range where the sigmoid is linear.
+To overcome this each feature is scaled and shifted by two trained parameters
+$\gamma^{(k)}$ and $\beta^{(k)}$.
+<script type="math/tex; mode=display">y^{(k)} =\gamma^{(k)} \hat{x}^{(k)} + \beta^{(k)}</script>
+where $y^{(k)}$ is the output of the batch normalization layer.</p>
+<p>Note that when applying batch normalization after a linear transform
+like $Wu + b$ the bias parameter $b$ gets cancelled due to normalization.
+So you can and should omit bias parameter in linear transforms right before the
+batch normalization.</p>
+<p>Batch normalization also makes the back propagation invariant to the scale of the weights.
+And empirically it improves generalization, so it has regularization effects too.</p>
+<h2>Inference</h2>
+<p>We need to know $\mathbb{E}[x^{(k)}]$ and $Var[x^{(k)}]$ in order to
+perform the normalization.
+So during inference, you either need to go through the whole (or part of) dataset
+and find the mean and variance, or you can use an estimate calculated during training.
+The usual practice is to calculate an exponential moving average of
+mean and variance during the training phase and use that for inference.</p>
+<p>Here&rsquo;s <a href="https://nn.labml.ai/normalization/layer_norm/mnist.html">the training code</a> and a notebook for training
+a CNN classifier that use batch normalization for MNIST dataset.</p>
+<p><a href="https://colab.research.google.com/github/lab-ml/nn/blob/master/labml_nn/normalization/batch_norm/mnist.ipynb"><img alt="Open In Colab" src="https://colab.research.google.com/assets/colab-badge.svg" /></a>
+<a href="https://web.lab-ml.com/run?uuid=011254fe647011ebbb8e0242ac1c0002"><img alt="View Run" src="https://img.shields.io/badge/labml-experiment-brightgreen" /></a></p>
+            </div>
+            <div class='code'>
+                
+            </div>
+        </div>
+    </div>
+</div>
+<script src="https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.4/MathJax.js?config=TeX-AMS_HTML">
+</script>
+<!-- MathJax configuration -->
+<script type="text/x-mathjax-config">
+    MathJax.Hub.Config({
+        tex2jax: {
+            inlineMath: [ ['$','$'] ],
+            displayMath: [ ['$$','$$'] ],
+            processEscapes: true,
+            processEnvironments: true
+        },
+        // Center justify equations in code and markdown cells. Elsewhere
+        // we use CSS to left justify single line equations in code cells.
+        displayAlign: 'center',
+        "HTML-CSS": { fonts: ["TeX"] }
+    });
+
+
+
+
+
+
+
+
+
+
+
+
+
+</script>
+</body>
+</html>
\ No newline at end of file
diff --git a/docs/normalization/index.html b/docs/normalization/index.html
index a528757d731cc0e931df163002ce1cb7d29c6394..6f4b3143294ce68d669abcd412c26e6236c68618 100644
--- a/docs/normalization/index.html
+++ b/docs/normalization/index.html
@@ -74,10 +74,10 @@
                 <h1>Normalization Layers</h1>
 <ul>
 <li><a href="batch_norm/index.html">Batch Normalization</a></li>
+<li><a href="layer_norm/index.html">Layer Normalization</a></li>
 </ul>
 <p><em>TODO</em></p>
 <ul>
-<li>Layer Normalization</li>
 <li>Instance Normalization</li>
 <li>Group Normalization</li>
 </ul>
diff --git a/docs/normalization/layer_norm/index.html b/docs/normalization/layer_norm/index.html
index 49c0b303186e09cc0b86f73776eed61f3bd29ab9..b174a6f90b5ade6d84a406f399a937882e9f88e5 100644
--- a/docs/normalization/layer_norm/index.html
+++ b/docs/normalization/layer_norm/index.html
@@ -88,7 +88,7 @@ large NLP models are usually trained with small batch sizes.</li>
 on a wider range of settings.
 Layer normalization transformers the inputs to have zero mean and unit variance
 across the features.
-*Note that batch normalization, fixes the zero mean and unit variance for each vector.
+<em>Note that batch normalization fixes the zero mean and unit variance for each element.</em>
 Layer normalization does it for each batch across all elements.</p>
 <p>Layer normalization is generally used for NLP tasks.</p>
 <p>We have used layer normalization in most of the
@@ -109,6 +109,29 @@ Layer normalization does it for each batch across all elements.</p>
                     <a href='#section-1'>#</a>
                 </div>
                 <h2>Layer Normalization</h2>
+<p>Layer normalization $\text{LN}$ normalizes the input $X$ as follows:</p>
+<p>When input $X \in \mathbb{R}^{B \times C}$ is a batch of embeddings,
+where $B$ is the batch size and $C$ is the number of features.
+$\gamma \in \mathbb{R}^{C}$ and $\beta \in \mathbb{R}^{C}$.
+<script type="math/tex; mode=display">\text{LN}(X) = \gamma
+\frac{X - \underset{C}{\mathbb{E}}[X]}{\sqrt{\underset{C}{Var}[X] + \epsilon}}
++ \beta</script>
+</p>
+<p>When input $X \in \mathbb{R}^{L \times B \times C}$ is a batch of a sequence of embeddings,
+where $B$ is the batch size, $C$ is the number of channels, $L$ is the length of the sequence.
+$\gamma \in \mathbb{R}^{C}$ and $\beta \in \mathbb{R}^{C}$.
+<script type="math/tex; mode=display">\text{LN}(X) = \gamma
+\frac{X - \underset{C}{\mathbb{E}}[X]}{\sqrt{\underset{C}{Var}[X] + \epsilon}}
++ \beta</script>
+</p>
+<p>When input $X \in \mathbb{R}^{B \times C \times H \times W}$ is a batch of image representations,
+where $B$ is the batch size, $C$ is the number of channels, $H$ is the height and $W$ is the width.
+This is not a widely used scenario.
+$\gamma \in \mathbb{R}^{C \times H \times W}$ and $\beta \in \mathbb{R}^{C \times H \times W}$.
+<script type="math/tex; mode=display">\text{LN}(X) = \gamma
+\frac{X - \underset{C, H, W}{\mathbb{E}}[X]}{\sqrt{\underset{C, H, W}{Var}[X] + \epsilon}}
++ \beta</script>
+</p>
             </div>
             <div class='code'>
                 <div class="highlight"><pre><span class="lineno">43</span><span class="k">class</span> <span class="nc">LayerNorm</span><span class="p">(</span><span class="n">Module</span><span class="p">):</span></pre></div>
@@ -120,18 +143,18 @@ Layer normalization does it for each batch across all elements.</p>
                     <a href='#section-2'>#</a>
                 </div>
                 <ul>
-<li><code>normalized_shape</code> $S$ is shape of the elements (except the batch).
+<li><code>normalized_shape</code> $S$ is the shape of the elements (except the batch).
  The input should then be
  $X \in \mathbb{R}^{* \times S[0] \times S[1] \times &hellip; \times S[n]}$</li>
-<li><code>eps</code> is $\epsilon$, used in $\sqrt{Var[X}] + \epsilon}$ for numerical stability</li>
+<li><code>eps</code> is $\epsilon$, used in $\sqrt{Var[X] + \epsilon}$ for numerical stability</li>
 <li><code>elementwise_affine</code> is whether to scale and shift the normalized value</li>
 </ul>
 <p>We&rsquo;ve tried to use the same names for arguments as PyTorch <code>LayerNorm</code> implementation.</p>
             </div>
             <div class='code'>
-                <div class="highlight"><pre><span class="lineno">48</span>    <span class="k">def</span> <span class="fm">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">normalized_shape</span><span class="p">:</span> <span class="n">Union</span><span class="p">[</span><span class="nb">int</span><span class="p">,</span> <span class="n">List</span><span class="p">[</span><span class="nb">int</span><span class="p">],</span> <span class="n">Size</span><span class="p">],</span> <span class="o">*</span><span class="p">,</span>
-<span class="lineno">49</span>                 <span class="n">eps</span><span class="p">:</span> <span class="nb">float</span> <span class="o">=</span> <span class="mf">1e-5</span><span class="p">,</span>
-<span class="lineno">50</span>                 <span class="n">elementwise_affine</span><span class="p">:</span> <span class="nb">bool</span> <span class="o">=</span> <span class="kc">True</span><span class="p">):</span></pre></div>
+                <div class="highlight"><pre><span class="lineno">72</span>    <span class="k">def</span> <span class="fm">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">normalized_shape</span><span class="p">:</span> <span class="n">Union</span><span class="p">[</span><span class="nb">int</span><span class="p">,</span> <span class="n">List</span><span class="p">[</span><span class="nb">int</span><span class="p">],</span> <span class="n">Size</span><span class="p">],</span> <span class="o">*</span><span class="p">,</span>
+<span class="lineno">73</span>                 <span class="n">eps</span><span class="p">:</span> <span class="nb">float</span> <span class="o">=</span> <span class="mf">1e-5</span><span class="p">,</span>
+<span class="lineno">74</span>                 <span class="n">elementwise_affine</span><span class="p">:</span> <span class="nb">bool</span> <span class="o">=</span> <span class="kc">True</span><span class="p">):</span></pre></div>
             </div>
         </div>
     <div class='section' id='section-3'>
@@ -142,11 +165,11 @@ Layer normalization does it for each batch across all elements.</p>
                 
             </div>
             <div class='code'>
-                <div class="highlight"><pre><span class="lineno">60</span>        <span class="nb">super</span><span class="p">()</span><span class="o">.</span><span class="fm">__init__</span><span class="p">()</span>
-<span class="lineno">61</span>
-<span class="lineno">62</span>        <span class="bp">self</span><span class="o">.</span><span class="n">normalized_shape</span> <span class="o">=</span> <span class="n">normalized_shape</span>
-<span class="lineno">63</span>        <span class="bp">self</span><span class="o">.</span><span class="n">eps</span> <span class="o">=</span> <span class="n">eps</span>
-<span class="lineno">64</span>        <span class="bp">self</span><span class="o">.</span><span class="n">elementwise_affine</span> <span class="o">=</span> <span class="n">elementwise_affine</span></pre></div>
+                <div class="highlight"><pre><span class="lineno">84</span>        <span class="nb">super</span><span class="p">()</span><span class="o">.</span><span class="fm">__init__</span><span class="p">()</span>
+<span class="lineno">85</span>
+<span class="lineno">86</span>        <span class="bp">self</span><span class="o">.</span><span class="n">normalized_shape</span> <span class="o">=</span> <span class="n">normalized_shape</span>
+<span class="lineno">87</span>        <span class="bp">self</span><span class="o">.</span><span class="n">eps</span> <span class="o">=</span> <span class="n">eps</span>
+<span class="lineno">88</span>        <span class="bp">self</span><span class="o">.</span><span class="n">elementwise_affine</span> <span class="o">=</span> <span class="n">elementwise_affine</span></pre></div>
             </div>
         </div>
     <div class='section' id='section-4'>
@@ -157,9 +180,9 @@ Layer normalization does it for each batch across all elements.</p>
                 <p>Create parameters for $\gamma$ and $\beta$ for gain and bias</p>
             </div>
             <div class='code'>
-                <div class="highlight"><pre><span class="lineno">66</span>        <span class="k">if</span> <span class="bp">self</span><span class="o">.</span><span class="n">elementwise_affine</span><span class="p">:</span>
-<span class="lineno">67</span>            <span class="bp">self</span><span class="o">.</span><span class="n">gain</span> <span class="o">=</span> <span class="n">nn</span><span class="o">.</span><span class="n">Parameter</span><span class="p">(</span><span class="n">torch</span><span class="o">.</span><span class="n">ones</span><span class="p">(</span><span class="n">normalized_shape</span><span class="p">))</span>
-<span class="lineno">68</span>            <span class="bp">self</span><span class="o">.</span><span class="n">bias</span> <span class="o">=</span> <span class="n">nn</span><span class="o">.</span><span class="n">Parameter</span><span class="p">(</span><span class="n">torch</span><span class="o">.</span><span class="n">zeros</span><span class="p">(</span><span class="n">normalized_shape</span><span class="p">))</span></pre></div>
+                <div class="highlight"><pre><span class="lineno">90</span>        <span class="k">if</span> <span class="bp">self</span><span class="o">.</span><span class="n">elementwise_affine</span><span class="p">:</span>
+<span class="lineno">91</span>            <span class="bp">self</span><span class="o">.</span><span class="n">gain</span> <span class="o">=</span> <span class="n">nn</span><span class="o">.</span><span class="n">Parameter</span><span class="p">(</span><span class="n">torch</span><span class="o">.</span><span class="n">ones</span><span class="p">(</span><span class="n">normalized_shape</span><span class="p">))</span>
+<span class="lineno">92</span>            <span class="bp">self</span><span class="o">.</span><span class="n">bias</span> <span class="o">=</span> <span class="n">nn</span><span class="o">.</span><span class="n">Parameter</span><span class="p">(</span><span class="n">torch</span><span class="o">.</span><span class="n">zeros</span><span class="p">(</span><span class="n">normalized_shape</span><span class="p">))</span></pre></div>
             </div>
         </div>
     <div class='section' id='section-5'>
@@ -173,7 +196,7 @@ Layer normalization does it for each batch across all elements.</p>
 <code>[seq_len, batch_size, features]</code></p>
             </div>
             <div class='code'>
-                <div class="highlight"><pre><span class="lineno">70</span>    <span class="k">def</span> <span class="nf">forward</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">x</span><span class="p">:</span> <span class="n">torch</span><span class="o">.</span><span class="n">Tensor</span><span class="p">):</span></pre></div>
+                <div class="highlight"><pre><span class="lineno">94</span>    <span class="k">def</span> <span class="nf">forward</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">x</span><span class="p">:</span> <span class="n">torch</span><span class="o">.</span><span class="n">Tensor</span><span class="p">):</span></pre></div>
             </div>
         </div>
     <div class='section' id='section-6'>
@@ -181,10 +204,10 @@ Layer normalization does it for each batch across all elements.</p>
                 <div class='section-link'>
                     <a href='#section-6'>#</a>
                 </div>
-                <p>Keep the original shape</p>
+                <p>Sanity check to make sure the shapes match</p>
             </div>
             <div class='code'>
-                <div class="highlight"><pre><span class="lineno">78</span>        <span class="n">x_shape</span> <span class="o">=</span> <span class="n">x</span><span class="o">.</span><span class="n">shape</span></pre></div>
+                <div class="highlight"><pre><span class="lineno">102</span>        <span class="k">assert</span> <span class="bp">self</span><span class="o">.</span><span class="n">normalized_shape</span> <span class="o">==</span> <span class="n">x</span><span class="o">.</span><span class="n">shape</span><span class="p">[</span><span class="o">-</span><span class="nb">len</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">normalized_shape</span><span class="p">):]</span></pre></div>
             </div>
         </div>
     <div class='section' id='section-7'>
@@ -192,10 +215,10 @@ Layer normalization does it for each batch across all elements.</p>
                 <div class='section-link'>
                     <a href='#section-7'>#</a>
                 </div>
-                <p>Sanity check to make sure the shapes match</p>
+                <p>The dimensions to calculate the mean and variance on</p>
             </div>
             <div class='code'>
-                <div class="highlight"><pre><span class="lineno">80</span>        <span class="k">assert</span> <span class="bp">self</span><span class="o">.</span><span class="n">normalized_shape</span> <span class="o">==</span> <span class="n">x</span><span class="o">.</span><span class="n">shape</span><span class="p">[</span><span class="o">-</span><span class="nb">len</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">normalized_shape</span><span class="p">):]</span></pre></div>
+                <div class="highlight"><pre><span class="lineno">105</span>        <span class="n">dims</span> <span class="o">=</span> <span class="p">[</span><span class="o">-</span><span class="p">(</span><span class="n">i</span> <span class="o">+</span> <span class="mi">1</span><span class="p">)</span> <span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="nb">len</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">normalized_shape</span><span class="p">))]</span></pre></div>
             </div>
         </div>
     <div class='section' id='section-8'>
@@ -203,10 +226,11 @@ Layer normalization does it for each batch across all elements.</p>
                 <div class='section-link'>
                     <a href='#section-8'>#</a>
                 </div>
-                <p>Reshape into <code>[M, S[0], S[1], ..., S[n]]</code></p>
+                <p>Calculate the mean of all elements;
+i.e. the means for each element $\mathbb{E}[X]$</p>
             </div>
             <div class='code'>
-                <div class="highlight"><pre><span class="lineno">83</span>        <span class="n">x</span> <span class="o">=</span> <span class="n">x</span><span class="o">.</span><span class="n">view</span><span class="p">(</span><span class="o">-</span><span class="mi">1</span><span class="p">,</span> <span class="o">*</span><span class="bp">self</span><span class="o">.</span><span class="n">normalized_shape</span><span class="p">)</span></pre></div>
+                <div class="highlight"><pre><span class="lineno">109</span>        <span class="n">mean</span> <span class="o">=</span> <span class="n">x</span><span class="o">.</span><span class="n">mean</span><span class="p">(</span><span class="n">dim</span><span class="o">=</span><span class="n">dims</span><span class="p">,</span> <span class="n">keepdims</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span></pre></div>
             </div>
         </div>
     <div class='section' id='section-9'>
@@ -214,11 +238,11 @@ Layer normalization does it for each batch across all elements.</p>
                 <div class='section-link'>
                     <a href='#section-9'>#</a>
                 </div>
-                <p>Calculate the mean across first dimension;
-i.e. the means for each element $\mathbb{E}[X}]$</p>
+                <p>Calculate the squared mean of all elements;
+i.e. the means for each element $\mathbb{E}[X^2]$</p>
             </div>
             <div class='code'>
-                <div class="highlight"><pre><span class="lineno">87</span>        <span class="n">mean</span> <span class="o">=</span> <span class="n">x</span><span class="o">.</span><span class="n">mean</span><span class="p">(</span><span class="n">dim</span><span class="o">=</span><span class="mi">0</span><span class="p">)</span></pre></div>
+                <div class="highlight"><pre><span class="lineno">112</span>        <span class="n">mean_x2</span> <span class="o">=</span> <span class="p">(</span><span class="n">x</span> <span class="o">**</span> <span class="mi">2</span><span class="p">)</span><span class="o">.</span><span class="n">mean</span><span class="p">(</span><span class="n">dim</span><span class="o">=</span><span class="n">dims</span><span class="p">,</span> <span class="n">keepdims</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span></pre></div>
             </div>
         </div>
     <div class='section' id='section-10'>
@@ -226,11 +250,10 @@ i.e. the means for each element $\mathbb{E}[X}]$</p>
                 <div class='section-link'>
                     <a href='#section-10'>#</a>
                 </div>
-                <p>Calculate the squared mean across first dimension;
-i.e. the means for each element $\mathbb{E}[X^2]$</p>
+                <p>Variance of all element $Var[X] = \mathbb{E}[X^2] - \mathbb{E}[X]^2$</p>
             </div>
             <div class='code'>
-                <div class="highlight"><pre><span class="lineno">90</span>        <span class="n">mean_x2</span> <span class="o">=</span> <span class="p">(</span><span class="n">x</span> <span class="o">**</span> <span class="mi">2</span><span class="p">)</span><span class="o">.</span><span class="n">mean</span><span class="p">(</span><span class="n">dim</span><span class="o">=</span><span class="mi">0</span><span class="p">)</span></pre></div>
+                <div class="highlight"><pre><span class="lineno">114</span>        <span class="n">var</span> <span class="o">=</span> <span class="n">mean_x2</span> <span class="o">-</span> <span class="n">mean</span> <span class="o">**</span> <span class="mi">2</span></pre></div>
             </div>
         </div>
     <div class='section' id='section-11'>
@@ -238,10 +261,11 @@ i.e. the means for each element $\mathbb{E}[X^2]$</p>
                 <div class='section-link'>
                     <a href='#section-11'>#</a>
                 </div>
-                <p>Variance for each element $Var[X] = \mathbb{E}[X^2] - \mathbb{E}[X]^2$</p>
+                <p>Normalize <script type="math/tex; mode=display">\hat{X} = \frac{X - \mathbb{E}[X]}{\sqrt{Var[X] + \epsilon}}</script>
+</p>
             </div>
             <div class='code'>
-                <div class="highlight"><pre><span class="lineno">92</span>        <span class="n">var</span> <span class="o">=</span> <span class="n">mean_x2</span> <span class="o">-</span> <span class="n">mean</span> <span class="o">**</span> <span class="mi">2</span></pre></div>
+                <div class="highlight"><pre><span class="lineno">117</span>        <span class="n">x_norm</span> <span class="o">=</span> <span class="p">(</span><span class="n">x</span> <span class="o">-</span> <span class="n">mean</span><span class="p">)</span> <span class="o">/</span> <span class="n">torch</span><span class="o">.</span><span class="n">sqrt</span><span class="p">(</span><span class="n">var</span> <span class="o">+</span> <span class="bp">self</span><span class="o">.</span><span class="n">eps</span><span class="p">)</span></pre></div>
             </div>
         </div>
     <div class='section' id='section-12'>
@@ -249,11 +273,12 @@ i.e. the means for each element $\mathbb{E}[X^2]$</p>
                 <div class='section-link'>
                     <a href='#section-12'>#</a>
                 </div>
-                <p>Normalize <script type="math/tex; mode=display">\hat{X} = \frac{X} - \mathbb{E}[X]}{\sqrt{Var[X] + \epsilon}}</script>
+                <p>Scale and shift <script type="math/tex; mode=display">\text{LN}(x) = \gamma \hat{X} + \beta</script>
 </p>
             </div>
             <div class='code'>
-                <div class="highlight"><pre><span class="lineno">95</span>        <span class="n">x_norm</span> <span class="o">=</span> <span class="p">(</span><span class="n">x</span> <span class="o">-</span> <span class="n">mean</span><span class="p">)</span> <span class="o">/</span> <span class="n">torch</span><span class="o">.</span><span class="n">sqrt</span><span class="p">(</span><span class="n">var</span> <span class="o">+</span> <span class="bp">self</span><span class="o">.</span><span class="n">eps</span><span class="p">)</span></pre></div>
+                <div class="highlight"><pre><span class="lineno">119</span>        <span class="k">if</span> <span class="bp">self</span><span class="o">.</span><span class="n">elementwise_affine</span><span class="p">:</span>
+<span class="lineno">120</span>            <span class="n">x_norm</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">gain</span> <span class="o">*</span> <span class="n">x_norm</span> <span class="o">+</span> <span class="bp">self</span><span class="o">.</span><span class="n">bias</span></pre></div>
             </div>
         </div>
     <div class='section' id='section-13'>
@@ -261,23 +286,21 @@ i.e. the means for each element $\mathbb{E}[X^2]$</p>
                 <div class='section-link'>
                     <a href='#section-13'>#</a>
                 </div>
-                <p>Scale and shift <script type="math/tex; mode=display">\text{LN}(x) = \gamma \hat{X} + \beta</script>
-</p>
+                
             </div>
             <div class='code'>
-                <div class="highlight"><pre><span class="lineno">97</span>        <span class="k">if</span> <span class="bp">self</span><span class="o">.</span><span class="n">elementwise_affine</span><span class="p">:</span>
-<span class="lineno">98</span>            <span class="n">x_norm</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">gain</span> <span class="o">*</span> <span class="n">x_norm</span> <span class="o">+</span> <span class="bp">self</span><span class="o">.</span><span class="n">bias</span></pre></div>
+                <div class="highlight"><pre><span class="lineno">123</span>        <span class="k">return</span> <span class="n">x_norm</span></pre></div>
             </div>
         </div>
     <div class='section' id='section-14'>
-            <div class='docs'>
+        <div class='docs doc-strings'>
                 <div class='section-link'>
                     <a href='#section-14'>#</a>
                 </div>
-                <p>Reshape to original and return</p>
+                <p>Simple test</p>
             </div>
             <div class='code'>
-                <div class="highlight"><pre><span class="lineno">101</span>        <span class="k">return</span> <span class="n">x_norm</span><span class="o">.</span><span class="n">view</span><span class="p">(</span><span class="n">x_shape</span><span class="p">)</span></pre></div>
+                <div class="highlight"><pre><span class="lineno">126</span><span class="k">def</span> <span class="nf">_test</span><span class="p">():</span></pre></div>
             </div>
         </div>
     <div class='section' id='section-15'>
@@ -288,20 +311,27 @@ i.e. the means for each element $\mathbb{E}[X^2]$</p>
                 
             </div>
             <div class='code'>
-                <div class="highlight"><pre><span class="lineno">104</span><span class="k">def</span> <span class="nf">_test</span><span class="p">():</span>
-<span class="lineno">105</span>    <span class="kn">from</span> <span class="nn">labml.logger</span> <span class="kn">import</span> <span class="n">inspect</span>
-<span class="lineno">106</span>
-<span class="lineno">107</span>    <span class="n">x</span> <span class="o">=</span> <span class="n">torch</span><span class="o">.</span><span class="n">zeros</span><span class="p">([</span><span class="mi">2</span><span class="p">,</span> <span class="mi">3</span><span class="p">,</span> <span class="mi">2</span><span class="p">,</span> <span class="mi">4</span><span class="p">])</span>
-<span class="lineno">108</span>    <span class="n">inspect</span><span class="p">(</span><span class="n">x</span><span class="o">.</span><span class="n">shape</span><span class="p">)</span>
-<span class="lineno">109</span>    <span class="n">ln</span> <span class="o">=</span> <span class="n">LayerNorm</span><span class="p">(</span><span class="n">x</span><span class="o">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">2</span><span class="p">:])</span>
-<span class="lineno">110</span>
-<span class="lineno">111</span>    <span class="n">x</span> <span class="o">=</span> <span class="n">ln</span><span class="p">(</span><span class="n">x</span><span class="p">)</span>
-<span class="lineno">112</span>    <span class="n">inspect</span><span class="p">(</span><span class="n">x</span><span class="o">.</span><span class="n">shape</span><span class="p">)</span>
-<span class="lineno">113</span>    <span class="n">inspect</span><span class="p">(</span><span class="n">ln</span><span class="o">.</span><span class="n">gain</span><span class="o">.</span><span class="n">shape</span><span class="p">)</span>
-<span class="lineno">114</span>
-<span class="lineno">115</span>
-<span class="lineno">116</span><span class="k">if</span> <span class="vm">__name__</span> <span class="o">==</span> <span class="s1">&#39;__main__&#39;</span><span class="p">:</span>
-<span class="lineno">117</span>    <span class="n">_test</span><span class="p">()</span></pre></div>
+                <div class="highlight"><pre><span class="lineno">130</span>    <span class="kn">from</span> <span class="nn">labml.logger</span> <span class="kn">import</span> <span class="n">inspect</span>
+<span class="lineno">131</span>
+<span class="lineno">132</span>    <span class="n">x</span> <span class="o">=</span> <span class="n">torch</span><span class="o">.</span><span class="n">zeros</span><span class="p">([</span><span class="mi">2</span><span class="p">,</span> <span class="mi">3</span><span class="p">,</span> <span class="mi">2</span><span class="p">,</span> <span class="mi">4</span><span class="p">])</span>
+<span class="lineno">133</span>    <span class="n">inspect</span><span class="p">(</span><span class="n">x</span><span class="o">.</span><span class="n">shape</span><span class="p">)</span>
+<span class="lineno">134</span>    <span class="n">ln</span> <span class="o">=</span> <span class="n">LayerNorm</span><span class="p">(</span><span class="n">x</span><span class="o">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">2</span><span class="p">:])</span>
+<span class="lineno">135</span>
+<span class="lineno">136</span>    <span class="n">x</span> <span class="o">=</span> <span class="n">ln</span><span class="p">(</span><span class="n">x</span><span class="p">)</span>
+<span class="lineno">137</span>    <span class="n">inspect</span><span class="p">(</span><span class="n">x</span><span class="o">.</span><span class="n">shape</span><span class="p">)</span>
+<span class="lineno">138</span>    <span class="n">inspect</span><span class="p">(</span><span class="n">ln</span><span class="o">.</span><span class="n">gain</span><span class="o">.</span><span class="n">shape</span><span class="p">)</span></pre></div>
+            </div>
+        </div>
+    <div class='section' id='section-16'>
+            <div class='docs'>
+                <div class='section-link'>
+                    <a href='#section-16'>#</a>
+                </div>
+                
+            </div>
+            <div class='code'>
+                <div class="highlight"><pre><span class="lineno">142</span><span class="k">if</span> <span class="vm">__name__</span> <span class="o">==</span> <span class="s1">&#39;__main__&#39;</span><span class="p">:</span>
+<span class="lineno">143</span>    <span class="n">_test</span><span class="p">()</span></pre></div>
             </div>
         </div>
     </div>
diff --git a/docs/normalization/layer_norm/readme.html b/docs/normalization/layer_norm/readme.html
new file mode 100644
index 0000000000000000000000000000000000000000..d279c77f20d61c3da3e47faa50c192950f287b8e
--- /dev/null
+++ b/docs/normalization/layer_norm/readme.html
@@ -0,0 +1,134 @@
+<!DOCTYPE html>
+<html>
+<head>
+    <meta http-equiv="content-type" content="text/html;charset=utf-8"/>
+    <meta name="viewport" content="width=device-width, initial-scale=1.0"/>
+    <meta name="description" content=""/>
+
+    <meta name="twitter:card" content="summary"/>
+    <meta name="twitter:image:src" content="https://avatars1.githubusercontent.com/u/64068543?s=400&amp;v=4"/>
+    <meta name="twitter:title" content="Layer Normalization"/>
+    <meta name="twitter:description" content=""/>
+    <meta name="twitter:site" content="@labmlai"/>
+    <meta name="twitter:creator" content="@labmlai"/>
+
+    <meta property="og:url" content="https://nn.labml.ai/normalization/layer_norm/readme.html"/>
+    <meta property="og:title" content="Layer Normalization"/>
+    <meta property="og:image" content="https://avatars1.githubusercontent.com/u/64068543?s=400&amp;v=4"/>
+    <meta property="og:site_name" content="LabML Neural Networks"/>
+    <meta property="og:type" content="object"/>
+    <meta property="og:title" content="Layer Normalization"/>
+    <meta property="og:description" content=""/>
+
+    <title>Layer Normalization</title>
+    <link rel="shortcut icon" href="/icon.png"/>
+    <link rel="stylesheet" href="../../pylit.css">
+    <link rel="canonical" href="https://nn.labml.ai/normalization/layer_norm/readme.html"/>
+    <!-- Global site tag (gtag.js) - Google Analytics -->
+    <script async src="https://www.googletagmanager.com/gtag/js?id=G-4V3HC8HBLH"></script>
+    <script>
+        window.dataLayer = window.dataLayer || [];
+
+        function gtag() {
+            dataLayer.push(arguments);
+        }
+
+        gtag('js', new Date());
+
+        gtag('config', 'G-4V3HC8HBLH');
+    </script>
+</head>
+<body>
+<div id='container'>
+    <div id="background"></div>
+    <div class='section'>
+        <div class='docs'>
+            <p>
+                <a class="parent" href="/">home</a>
+                <a class="parent" href="../index.html">normalization</a>
+                <a class="parent" href="index.html">layer_norm</a>
+            </p>
+            <p>
+
+                <a href="https://github.com/lab-ml/labml_nn/tree/master/labml_nn/normalization/layer_norm/readme.md">
+                    <img alt="Github"
+                         src="https://img.shields.io/github/stars/lab-ml/nn?style=social"
+                         style="max-width:100%;"/></a>
+                <a href="https://join.slack.com/t/labforml/shared_invite/zt-egj9zvq9-Dl3hhZqobexgT7aVKnD14g/"
+                   rel="nofollow">
+                    <img alt="Join Slact"
+                         src="https://img.shields.io/badge/slack-chat-green.svg?logo=slack"
+                         style="max-width:100%;"/></a>
+                <a href="https://twitter.com/labmlai"
+                   rel="nofollow">
+                    <img alt="Twitter"
+                         src="https://img.shields.io/twitter/follow/labmlai?style=social"
+                         style="max-width:100%;"/></a>
+            </p>
+        </div>
+    </div>
+    <div class='section' id='section-0'>
+            <div class='docs'>
+                <div class='section-link'>
+                    <a href='#section-0'>#</a>
+                </div>
+                <h1><a href="https://nn.labml.ai/normalization/layer_norm/index.html">Layer Normalization</a></h1>
+<p>This is a <a href="https://pytorch.org">PyTorch</a> implementation of
+<a href="https://arxiv.org/abs/1607.06450">Layer Normalization</a>.</p>
+<h3>Limitations of <a href="https://nn.labml.ai/normalization/batch_norm/index.html">Batch Normalization</a></h3>
+<ul>
+<li>You need to maintain running means.</li>
+<li>Tricky for RNNs. Do you need different normalizations for each step?</li>
+<li>Doesn&rsquo;t work with small batch sizes;
+large NLP models are usually trained with small batch sizes.</li>
+<li>Need to compute means and variances across devices in distributed training</li>
+</ul>
+<h2>Layer Normalization</h2>
+<p>Layer normalization is a simpler normalization method that works
+on a wider range of settings.
+Layer normalization transformers the inputs to have zero mean and unit variance
+across the features.
+<em>Note that batch normalization fixes the zero mean and unit variance for each element.</em>
+Layer normalization does it for each batch across all elements.</p>
+<p>Layer normalization is generally used for NLP tasks.</p>
+<p>We have used layer normalization in most of the
+<a href="https://nn.labml.ai/transformers/gpt/index.html">transformer implementations</a>.</p>
+            </div>
+            <div class='code'>
+                
+            </div>
+        </div>
+    </div>
+</div>
+<script src="https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.4/MathJax.js?config=TeX-AMS_HTML">
+</script>
+<!-- MathJax configuration -->
+<script type="text/x-mathjax-config">
+    MathJax.Hub.Config({
+        tex2jax: {
+            inlineMath: [ ['$','$'] ],
+            displayMath: [ ['$$','$$'] ],
+            processEscapes: true,
+            processEnvironments: true
+        },
+        // Center justify equations in code and markdown cells. Elsewhere
+        // we use CSS to left justify single line equations in code cells.
+        displayAlign: 'center',
+        "HTML-CSS": { fonts: ["TeX"] }
+    });
+
+
+
+
+
+
+
+
+
+
+
+
+
+</script>
+</body>
+</html>
\ No newline at end of file
diff --git a/docs/sitemap.xml b/docs/sitemap.xml
index 048381dab00e43ab9d3e2337c154c4c14b9172ab..56e8e87cce17ccbda65d304d2918b74fc5fb0466 100644
--- a/docs/sitemap.xml
+++ b/docs/sitemap.xml
@@ -50,7 +50,7 @@
 
     <url>
       <loc>https://nn.labml.ai/activations/swish.html</loc>
-      <lastmod>2021-01-25T16:30:00+00:00</lastmod>
+      <lastmod>2021-02-02T16:30:00+00:00</lastmod>
       <priority>1.00</priority>
     </url>
     
@@ -83,6 +83,13 @@
     </url>
     
 
+    <url>
+      <loc>https://nn.labml.ai/normalization/layer_norm/index.html</loc>
+      <lastmod>2021-02-02T16:30:00+00:00</lastmod>
+      <priority>1.00</priority>
+    </url>
+    
+
     <url>
       <loc>https://nn.labml.ai/normalization/index.html</loc>
       <lastmod>2021-02-01T16:30:00+00:00</lastmod>
@@ -183,7 +190,7 @@
 
     <url>
       <loc>https://nn.labml.ai/optimizers/mnist_experiment.html</loc>
-      <lastmod>2020-12-10T16:30:00+00:00</lastmod>
+      <lastmod>2021-02-02T16:30:00+00:00</lastmod>
       <priority>1.00</priority>
     </url>
     
@@ -225,7 +232,7 @@
 
     <url>
       <loc>https://nn.labml.ai/transformers/knn/train_model.html</loc>
-      <lastmod>2021-01-25T16:30:00+00:00</lastmod>
+      <lastmod>2021-02-02T16:30:00+00:00</lastmod>
       <priority>1.00</priority>
     </url>
     
@@ -253,7 +260,7 @@
 
     <url>
       <loc>https://nn.labml.ai/transformers/models.html</loc>
-      <lastmod>2021-02-01T16:30:00+00:00</lastmod>
+      <lastmod>2021-02-02T16:30:00+00:00</lastmod>
       <priority>1.00</priority>
     </url>
     
@@ -267,14 +274,14 @@
 
     <url>
       <loc>https://nn.labml.ai/transformers/gpt/index.html</loc>
-      <lastmod>2021-02-01T16:30:00+00:00</lastmod>
+      <lastmod>2021-02-02T16:30:00+00:00</lastmod>
       <priority>1.00</priority>
     </url>
     
 
     <url>
       <loc>https://nn.labml.ai/transformers/feed_forward.html</loc>
-      <lastmod>2021-01-30T16:30:00+00:00</lastmod>
+      <lastmod>2021-02-02T16:30:00+00:00</lastmod>
       <priority>1.00</priority>
     </url>
     
@@ -295,7 +302,7 @@
 
     <url>
       <loc>https://nn.labml.ai/transformers/feedback/index.html</loc>
-      <lastmod>2021-02-01T16:30:00+00:00</lastmod>
+      <lastmod>2021-02-02T16:30:00+00:00</lastmod>
       <priority>1.00</priority>
     </url>
     
@@ -309,7 +316,7 @@
 
     <url>
       <loc>https://nn.labml.ai/transformers/feedback/experiment.html</loc>
-      <lastmod>2021-01-29T16:30:00+00:00</lastmod>
+      <lastmod>2021-02-02T16:30:00+00:00</lastmod>
       <priority>1.00</priority>
     </url>
     
@@ -330,14 +337,14 @@
 
     <url>
       <loc>https://nn.labml.ai/transformers/glu_variants/experiment.html</loc>
-      <lastmod>2021-01-26T16:30:00+00:00</lastmod>
+      <lastmod>2021-02-02T16:30:00+00:00</lastmod>
       <priority>1.00</priority>
     </url>
     
 
     <url>
       <loc>https://nn.labml.ai/transformers/glu_variants/simple.html</loc>
-      <lastmod>2021-01-26T16:30:00+00:00</lastmod>
+      <lastmod>2021-02-02T16:30:00+00:00</lastmod>
       <priority>1.00</priority>
     </url>
     
@@ -358,7 +365,7 @@
 
     <url>
       <loc>https://nn.labml.ai/transformers/switch/index.html</loc>
-      <lastmod>2021-02-01T16:30:00+00:00</lastmod>
+      <lastmod>2021-02-02T16:30:00+00:00</lastmod>
       <priority>1.00</priority>
     </url>
     
@@ -372,28 +379,28 @@
 
     <url>
       <loc>https://nn.labml.ai/transformers/switch/experiment.html</loc>
-      <lastmod>2021-01-25T16:30:00+00:00</lastmod>
+      <lastmod>2021-02-02T16:30:00+00:00</lastmod>
       <priority>1.00</priority>
     </url>
     
 
     <url>
       <loc>https://nn.labml.ai/transformers/positional_encoding.html</loc>
-      <lastmod>2021-01-07T16:30:00+00:00</lastmod>
+      <lastmod>2021-02-02T16:30:00+00:00</lastmod>
       <priority>1.00</priority>
     </url>
     
 
     <url>
       <loc>https://nn.labml.ai/transformers/label_smoothing_loss.html</loc>
-      <lastmod>2020-12-10T16:30:00+00:00</lastmod>
+      <lastmod>2021-02-02T16:30:00+00:00</lastmod>
       <priority>1.00</priority>
     </url>
     
 
     <url>
       <loc>https://nn.labml.ai/transformers/mha.html</loc>
-      <lastmod>2021-02-01T16:30:00+00:00</lastmod>
+      <lastmod>2021-02-02T16:30:00+00:00</lastmod>
       <priority>1.00</priority>
     </url>
     
diff --git a/labml_nn/__init__.py b/labml_nn/__init__.py
index c6e44cfd1077eb2338f129678c091357cd589c66..f46c8fc08371304e445f4eda0ce602369cdab6c6 100644
--- a/labml_nn/__init__.py
+++ b/labml_nn/__init__.py
@@ -60,6 +60,7 @@ and
 
 #### ✨ [Normalization Layers](https://nn.labml.ai/normalization/index.html)
 * [Batch Normalization](https://nn.labml.ai/normalization/batch_norm/index.html)
+* [Layer Normalization](https://nn.labml.ai/normalization/layer_norm/index.html)
 
 ### Installation
 
diff --git a/labml_nn/normalization/__init__.py b/labml_nn/normalization/__init__.py
index 986d9c754d29aa8673481929e85c302226409040..ac254aacd92712774823fa01f2c92ec00908526e 100644
--- a/labml_nn/normalization/__init__.py
+++ b/labml_nn/normalization/__init__.py
@@ -8,10 +8,10 @@ summary: >
 # Normalization Layers
 
 * [Batch Normalization](batch_norm/index.html)
+* [Layer Normalization](layer_norm/index.html)
 
 *TODO*
 
-* Layer Normalization
 * Instance Normalization
 * Group Normalization
 """
\ No newline at end of file
diff --git a/labml_nn/normalization/batch_norm/__init__.py b/labml_nn/normalization/batch_norm/__init__.py
index f64a1475659a8055d74e2d18acd610e790afd224..0e914a0c9c4589e65f9fbdf0ce2aa52e87899646 100644
--- a/labml_nn/normalization/batch_norm/__init__.py
+++ b/labml_nn/normalization/batch_norm/__init__.py
@@ -109,18 +109,21 @@ class BatchNorm(Module):
 
     When input $X \in \mathbb{R}^{B \times C \times H \times W}$ is a batch of image representations,
     where $B$ is the batch size, $C$ is the number of channels, $H$ is the height and $W$ is the width.
+    $\gamma \in \mathbb{R}^{C}$ and $\beta \in \mathbb{R}^{C}$.
     $$\text{BN}(X) = \gamma
     \frac{X - \underset{B, H, W}{\mathbb{E}}[X]}{\sqrt{\underset{B, H, W}{Var}[X] + \epsilon}}
     + \beta$$
 
-    When input $X \in \mathbb{R}^{B \times C}$ is a batch of vector embeddings,
+    When input $X \in \mathbb{R}^{B \times C}$ is a batch of embeddings,
     where $B$ is the batch size and $C$ is the number of features.
+    $\gamma \in \mathbb{R}^{C}$ and $\beta \in \mathbb{R}^{C}$.
     $$\text{BN}(X) = \gamma
     \frac{X - \underset{B}{\mathbb{E}}[X]}{\sqrt{\underset{B}{Var}[X] + \epsilon}}
     + \beta$$
 
-    When input $X \in \mathbb{R}^{B \times C \times L}$ is a batch of sequence embeddings,
+    When input $X \in \mathbb{R}^{B \times C \times L}$ is a batch of a sequence embeddings,
     where $B$ is the batch size, $C$ is the number of features, and $L$ is the length of the sequence.
+    $\gamma \in \mathbb{R}^{C}$ and $\beta \in \mathbb{R}^{C}$.
     $$\text{BN}(X) = \gamma
     \frac{X - \underset{B, L}{\mathbb{E}}[X]}{\sqrt{\underset{B, L}{Var}[X] + \epsilon}}
     + \beta$$
@@ -205,6 +208,9 @@ class BatchNorm(Module):
 
 
 def _test():
+    """
+    Simple test
+    """
     from labml.logger import inspect
 
     x = torch.zeros([2, 3, 2, 4])
@@ -216,5 +222,6 @@ def _test():
     inspect(bn.exp_var.shape)
 
 
+#
 if __name__ == '__main__':
     _test()
diff --git a/labml_nn/normalization/batch_norm/readme.md b/labml_nn/normalization/batch_norm/readme.md
new file mode 100644
index 0000000000000000000000000000000000000000..573c079b0003138c1f0226adf0fcbfbc88cf4e5d
--- /dev/null
+++ b/labml_nn/normalization/batch_norm/readme.md
@@ -0,0 +1,88 @@
+# [Batch Normalization](https://nn.labml.ai/normalization/batch_norm/index.html)
+
+This is a [PyTorch](https://pytorch.org) implementation of Batch Normalization from paper
+ [Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift](https://arxiv.org/abs/1502.03167).
+
+### Internal Covariate Shift
+
+The paper defines *Internal Covariate Shift* as the change in the
+distribution of network activations due to the change in
+network parameters during training.
+For example, let's say there are two layers $l_1$ and $l_2$.
+During the beginning of the training $l_1$ outputs (inputs to $l_2$)
+could be in distribution $\mathcal{N}(0.5, 1)$.
+Then, after some training steps, it could move to $\mathcal{N}(0.5, 1)$.
+This is *internal covariate shift*.
+
+Internal covariate shift will adversely affect training speed because the later layers
+($l_2$ in the above example) has to adapt to this shifted distribution.
+
+By stabilizing the distribution batch normalization minimizes the internal covariate shift.
+
+## Normalization
+
+It is known that whitening improves training speed and convergence.
+*Whitening* is linearly transforming inputs to have zero mean, unit variance,
+and be uncorrelated.
+
+### Normalizing outside gradient computation doesn't work
+
+Normalizing outside the gradient computation using pre-computed (detached)
+means and variances doesn't work. For instance. (ignoring variance), let
+$$\hat{x} = x - \mathbb{E}[x]$$
+where $x = u + b$ and $b$ is a trained bias.
+and $\mathbb{E}[x]$ is outside gradient computation (pre-computed constant).
+
+Note that $\hat{x}$ has no effect of $b$.
+Therefore,
+$b$ will increase or decrease based
+$\frac{\partial{\mathcal{L}}}{\partial x}$,
+and keep on growing indefinitely in each training update.
+The paper notes that similar explosions happen with variances.
+
+### Batch Normalization
+
+Whitening is computationally expensive because you need to de-correlate and
+the gradients must flow through the full whitening calculation.
+
+The paper introduces simplified version which they call *Batch Normalization*.
+First simplification is that it normalizes each feature independently to have
+zero mean and unit variance:
+$$\hat{x}^{(k)} = \frac{x^{(k)} - \mathbb{E}[x^{(k)}]}{\sqrt{Var[x^{(k)}]}}$$
+where $x = (x^{(1)} ... x^{(d)})$ is the $d$-dimensional input.
+
+The second simplification is to use estimates of mean $\mathbb{E}[x^{(k)}]$
+and variance $Var[x^{(k)}]$ from the mini-batch
+for normalization; instead of calculating the mean and variance across whole dataset.
+
+Normalizing each feature to zero mean and unit variance could affect what the layer
+can represent.
+As an example paper illustrates that, if the inputs to a sigmoid are normalized
+most of it will be within $[-1, 1]$ range where the sigmoid is linear.
+To overcome this each feature is scaled and shifted by two trained parameters
+$\gamma^{(k)}$ and $\beta^{(k)}$.
+$$y^{(k)} =\gamma^{(k)} \hat{x}^{(k)} + \beta^{(k)}$$
+where $y^{(k)}$ is the output of the batch normalization layer.
+
+Note that when applying batch normalization after a linear transform
+like $Wu + b$ the bias parameter $b$ gets cancelled due to normalization.
+So you can and should omit bias parameter in linear transforms right before the
+batch normalization.
+
+Batch normalization also makes the back propagation invariant to the scale of the weights.
+And empirically it improves generalization, so it has regularization effects too.
+
+## Inference
+
+We need to know $\mathbb{E}[x^{(k)}]$ and $Var[x^{(k)}]$ in order to
+perform the normalization.
+So during inference, you either need to go through the whole (or part of) dataset
+and find the mean and variance, or you can use an estimate calculated during training.
+The usual practice is to calculate an exponential moving average of
+mean and variance during the training phase and use that for inference.
+
+Here's [the training code](https://nn.labml.ai/normalization/layer_norm/mnist.html) and a notebook for training
+a CNN classifier that use batch normalization for MNIST dataset.
+
+[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/lab-ml/nn/blob/master/labml_nn/normalization/batch_norm/mnist.ipynb)
+[![View Run](https://img.shields.io/badge/labml-experiment-brightgreen)](https://web.lab-ml.com/run?uuid=011254fe647011ebbb8e0242ac1c0002)
diff --git a/labml_nn/normalization/layer_norm/__init__.py b/labml_nn/normalization/layer_norm/__init__.py
index b711e81add018854fca88a131192dafbe47f69b0..6d913197ec125c7a139ebf9f5888be11349cc266 100644
--- a/labml_nn/normalization/layer_norm/__init__.py
+++ b/labml_nn/normalization/layer_norm/__init__.py
@@ -24,7 +24,7 @@ Layer normalization is a simpler normalization method that works
 on a wider range of settings.
 Layer normalization transformers the inputs to have zero mean and unit variance
 across the features.
-*Note that batch normalization, fixes the zero mean and unit variance for each vector.
+*Note that batch normalization fixes the zero mean and unit variance for each element.*
 Layer normalization does it for each batch across all elements.
 
 Layer normalization is generally used for NLP tasks.
@@ -41,18 +41,42 @@ from labml_helpers.module import Module
 
 
 class LayerNorm(Module):
-    """
+    r"""
     ## Layer Normalization
+
+    Layer normalization $\text{LN}$ normalizes the input $X$ as follows:
+
+    When input $X \in \mathbb{R}^{B \times C}$ is a batch of embeddings,
+    where $B$ is the batch size and $C$ is the number of features.
+    $\gamma \in \mathbb{R}^{C}$ and $\beta \in \mathbb{R}^{C}$.
+    $$\text{LN}(X) = \gamma
+    \frac{X - \underset{C}{\mathbb{E}}[X]}{\sqrt{\underset{C}{Var}[X] + \epsilon}}
+    + \beta$$
+
+    When input $X \in \mathbb{R}^{L \times B \times C}$ is a batch of a sequence of embeddings,
+    where $B$ is the batch size, $C$ is the number of channels, $L$ is the length of the sequence.
+    $\gamma \in \mathbb{R}^{C}$ and $\beta \in \mathbb{R}^{C}$.
+    $$\text{LN}(X) = \gamma
+    \frac{X - \underset{C}{\mathbb{E}}[X]}{\sqrt{\underset{C}{Var}[X] + \epsilon}}
+    + \beta$$
+
+    When input $X \in \mathbb{R}^{B \times C \times H \times W}$ is a batch of image representations,
+    where $B$ is the batch size, $C$ is the number of channels, $H$ is the height and $W$ is the width.
+    This is not a widely used scenario.
+    $\gamma \in \mathbb{R}^{C \times H \times W}$ and $\beta \in \mathbb{R}^{C \times H \times W}$.
+    $$\text{LN}(X) = \gamma
+    \frac{X - \underset{C, H, W}{\mathbb{E}}[X]}{\sqrt{\underset{C, H, W}{Var}[X] + \epsilon}}
+    + \beta$$
     """
 
     def __init__(self, normalized_shape: Union[int, List[int], Size], *,
                  eps: float = 1e-5,
                  elementwise_affine: bool = True):
         """
-        * `normalized_shape` $S$ is shape of the elements (except the batch).
+        * `normalized_shape` $S$ is the shape of the elements (except the batch).
          The input should then be
          $X \in \mathbb{R}^{* \times S[0] \times S[1] \times ... \times S[n]}$
-        * `eps` is $\epsilon$, used in $\sqrt{Var[X}] + \epsilon}$ for numerical stability
+        * `eps` is $\epsilon$, used in $\sqrt{Var[X] + \epsilon}$ for numerical stability
         * `elementwise_affine` is whether to scale and shift the normalized value
 
         We've tried to use the same names for arguments as PyTorch `LayerNorm` implementation.
@@ -74,34 +98,35 @@ class LayerNorm(Module):
          For example, in an NLP task this will be
         `[seq_len, batch_size, features]`
         """
-        # Keep the original shape
-        x_shape = x.shape
         # Sanity check to make sure the shapes match
         assert self.normalized_shape == x.shape[-len(self.normalized_shape):]
 
-        # Reshape into `[M, S[0], S[1], ..., S[n]]`
-        x = x.view(-1, *self.normalized_shape)
+        # The dimensions to calculate the mean and variance on
+        dims = [-(i + 1) for i in range(len(self.normalized_shape))]
 
-        # Calculate the mean across first dimension;
-        # i.e. the means for each element $\mathbb{E}[X}]$
-        mean = x.mean(dim=0)
-        # Calculate the squared mean across first dimension;
+        # Calculate the mean of all elements;
+        # i.e. the means for each element $\mathbb{E}[X]$
+        mean = x.mean(dim=dims, keepdims=True)
+        # Calculate the squared mean of all elements;
         # i.e. the means for each element $\mathbb{E}[X^2]$
-        mean_x2 = (x ** 2).mean(dim=0)
-        # Variance for each element $Var[X] = \mathbb{E}[X^2] - \mathbb{E}[X]^2$
+        mean_x2 = (x ** 2).mean(dim=dims, keepdims=True)
+        # Variance of all element $Var[X] = \mathbb{E}[X^2] - \mathbb{E}[X]^2$
         var = mean_x2 - mean ** 2
 
-        # Normalize $$\hat{X} = \frac{X} - \mathbb{E}[X]}{\sqrt{Var[X] + \epsilon}}$$
+        # Normalize $$\hat{X} = \frac{X - \mathbb{E}[X]}{\sqrt{Var[X] + \epsilon}}$$
         x_norm = (x - mean) / torch.sqrt(var + self.eps)
         # Scale and shift $$\text{LN}(x) = \gamma \hat{X} + \beta$$
         if self.elementwise_affine:
             x_norm = self.gain * x_norm + self.bias
 
-        # Reshape to original and return
-        return x_norm.view(x_shape)
+        #
+        return x_norm
 
 
 def _test():
+    """
+    Simple test
+    """
     from labml.logger import inspect
 
     x = torch.zeros([2, 3, 2, 4])
@@ -113,5 +138,6 @@ def _test():
     inspect(ln.gain.shape)
 
 
+#
 if __name__ == '__main__':
     _test()
diff --git a/labml_nn/normalization/layer_norm/readme.md b/labml_nn/normalization/layer_norm/readme.md
new file mode 100644
index 0000000000000000000000000000000000000000..2be7ecd58a30ed73100030f81a6b6064c6a3d924
--- /dev/null
+++ b/labml_nn/normalization/layer_norm/readme.md
@@ -0,0 +1,26 @@
+# [Layer Normalization](https://nn.labml.ai/normalization/layer_norm/index.html)
+
+This is a [PyTorch](https://pytorch.org) implementation of
+[Layer Normalization](https://arxiv.org/abs/1607.06450).
+
+### Limitations of [Batch Normalization](https://nn.labml.ai/normalization/batch_norm/index.html)
+
+* You need to maintain running means.
+* Tricky for RNNs. Do you need different normalizations for each step?
+* Doesn't work with small batch sizes;
+large NLP models are usually trained with small batch sizes.
+* Need to compute means and variances across devices in distributed training
+
+## Layer Normalization
+
+Layer normalization is a simpler normalization method that works
+on a wider range of settings.
+Layer normalization transformers the inputs to have zero mean and unit variance
+across the features.
+*Note that batch normalization fixes the zero mean and unit variance for each element.*
+Layer normalization does it for each batch across all elements.
+
+Layer normalization is generally used for NLP tasks.
+
+We have used layer normalization in most of the
+[transformer implementations](https://nn.labml.ai/transformers/gpt/index.html).
\ No newline at end of file
diff --git a/readme.md b/readme.md
index 6fa6ad7d7ebf4673f0369c4341abf27e1091ba12..7bd188a401f719a3c9e97251ab168dfeacc74a7f 100644
--- a/readme.md
+++ b/readme.md
@@ -66,6 +66,7 @@ and
 
 #### ✨ [Normalization Layers](https://nn.labml.ai/normalization/index.html)
 * [Batch Normalization](https://nn.labml.ai/normalization/batch_norm/index.html)
+* [Layer Normalization](https://nn.labml.ai/normalization/layer_norm/index.html)
 
 ### Installation