提交 bbd442d4 编写于 作者: V Varuna Jayasiri

transformer xl links

上级 ab01567b
......@@ -84,7 +84,10 @@ implementations.</p>
<ul>
<li><a href="transformers/mha.html">Multi-headed attention</a></li>
<li><a href="transformers/models.html">Transformer building blocks</a></li>
<li><a href="transformers/xl/relative_mha.html">Relative multi-headed attention</a>.</li>
<li><a href="transformers/xl/index.html">Transformer XL</a><ul>
<li><a href="transformers/xl/relative_mha.html">Relative multi-headed attention</a></li>
</ul>
</li>
<li><a href="transformers/gpt/index.html">GPT Architecture</a></li>
<li><a href="transformers/glu_variants/simple.html">GLU Variants</a></li>
<li><a href="transformers/knn/index.html">kNN-LM: Generalization through Memorization</a></li>
......
......@@ -426,6 +426,13 @@
</url>
<url>
<loc>https://nn.labml.ai/transformers/xl/experiment.html</loc>
<lastmod>2021-02-07T16:30:00+00:00</lastmod>
<priority>1.00</priority>
</url>
<url>
<loc>https://nn.labml.ai/transformers/xl/index.html</loc>
<lastmod>2021-02-07T16:30:00+00:00</lastmod>
......
......@@ -78,10 +78,12 @@ from paper <a href="https://arxiv.org/abs/1706.03762">Attention Is All You Need<
and derivatives and enhancements of it.</p>
<ul>
<li><a href="mha.html">Multi-head attention</a></li>
<li><a href="xl/relative_mha.html">Relative multi-head attention</a></li>
<li><a href="models.html">Transformer Encoder and Decoder Models</a></li>
<li><a href="positional_encoding.html">Fixed positional encoding</a></li>
</ul>
<h2><a href="xl/index.html">Transformer XL</a></h2>
<p>This implements Transformer XL model using
<a href="xl/relative_mha.html">relative multi-head attention</a></p>
<h2><a href="gpt">GPT Architecture</a></h2>
<p>This is an implementation of GPT-2 architecture.</p>
<h2><a href="glu_variants/simple.html">GLU Variants</a></h2>
......@@ -100,10 +102,10 @@ Our implementation only has a few million parameters and doesn&rsquo;t do model
It does single GPU training but we implement the concept of switching as described in the paper.</p>
</div>
<div class='code'>
<div class="highlight"><pre><span class="lineno">49</span><span></span><span class="kn">from</span> <span class="nn">.configs</span> <span class="kn">import</span> <span class="n">TransformerConfigs</span>
<span class="lineno">50</span><span class="kn">from</span> <span class="nn">.models</span> <span class="kn">import</span> <span class="n">TransformerLayer</span><span class="p">,</span> <span class="n">Encoder</span><span class="p">,</span> <span class="n">Decoder</span><span class="p">,</span> <span class="n">Generator</span><span class="p">,</span> <span class="n">EncoderDecoder</span>
<span class="lineno">51</span><span class="kn">from</span> <span class="nn">.mha</span> <span class="kn">import</span> <span class="n">MultiHeadAttention</span>
<span class="lineno">52</span><span class="kn">from</span> <span class="nn">labml_nn.transformers.xl.relative_mha</span> <span class="kn">import</span> <span class="n">RelativeMultiHeadAttention</span></pre></div>
<div class="highlight"><pre><span class="lineno">52</span><span></span><span class="kn">from</span> <span class="nn">.configs</span> <span class="kn">import</span> <span class="n">TransformerConfigs</span>
<span class="lineno">53</span><span class="kn">from</span> <span class="nn">.models</span> <span class="kn">import</span> <span class="n">TransformerLayer</span><span class="p">,</span> <span class="n">Encoder</span><span class="p">,</span> <span class="n">Decoder</span><span class="p">,</span> <span class="n">Generator</span><span class="p">,</span> <span class="n">EncoderDecoder</span>
<span class="lineno">54</span><span class="kn">from</span> <span class="nn">.mha</span> <span class="kn">import</span> <span class="n">MultiHeadAttention</span>
<span class="lineno">55</span><span class="kn">from</span> <span class="nn">labml_nn.transformers.xl.relative_mha</span> <span class="kn">import</span> <span class="n">RelativeMultiHeadAttention</span></pre></div>
</div>
</div>
</div>
......
此差异已折叠。
......@@ -17,7 +17,8 @@ implementations.
* [Multi-headed attention](transformers/mha.html)
* [Transformer building blocks](transformers/models.html)
* [Relative multi-headed attention](transformers/xl/relative_mha.html).
* [Transformer XL](transformers/xl/index.html)
* [Relative multi-headed attention](transformers/xl/relative_mha.html)
* [GPT Architecture](transformers/gpt/index.html)
* [GLU Variants](transformers/glu_variants/simple.html)
* [kNN-LM: Generalization through Memorization](transformers/knn/index.html)
......
......@@ -14,10 +14,13 @@ from paper [Attention Is All You Need](https://arxiv.org/abs/1706.03762),
and derivatives and enhancements of it.
* [Multi-head attention](mha.html)
* [Relative multi-head attention](xl/relative_mha.html)
* [Transformer Encoder and Decoder Models](models.html)
* [Fixed positional encoding](positional_encoding.html)
## [Transformer XL](xl/index.html)
This implements Transformer XL model using
[relative multi-head attention](xl/relative_mha.html)
## [GPT Architecture](gpt)
This is an implementation of GPT-2 architecture.
......
......@@ -30,7 +30,6 @@ Here's [the training code](experiment.html) and a notebook for training a transf
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/lab-ml/nn/blob/master/labml_nn/transformers/xl/experiment.ipynb)
[![View Run](https://img.shields.io/badge/labml-experiment-brightgreen)](https://web.lab-ml.com/run?uuid=d3b6760c692e11ebb6a70242ac1c0002)
"""
......
# Transformer XL
This is an implementation of
[Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context](https://arxiv.org/abs/1901.02860)
in [PyTorch](https://pytorch.org).
Transformer has a limited attention span,
equal to the length of the sequence trained in parallel.
All these positions have a fixed positional encoding.
Transformer XL increases this attention span by letting
each of the positions pay attention to precalculated past embeddings.
For instance if the context length is $l$ it will keep the embeddings of
all layers for previous batch of length $l$ and feed them to current step.
If we use fixed-positional encodings these pre-calculated embeddings will have
the same positions as the current context.
They introduce relative positional encoding, where the positional encodings
are introduced at the attention calculation.
Annotated implementation of relative multi-headed attention is in [`relative_mha.py`](relative_mha.html).
Here's [the training code](experiment.html) and a notebook for training a transformer XL model on Tiny Shakespeare dataset.
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/lab-ml/nn/blob/master/labml_nn/transformers/xl/experiment.ipynb)
[![View Run](https://img.shields.io/badge/labml-experiment-brightgreen)](https://web.lab-ml.com/run?uuid=d3b6760c692e11ebb6a70242ac1c0002)
......@@ -23,7 +23,8 @@ implementations almost weekly.
* [Multi-headed attention](https://nn.labml.ai/transformers/mha.html)
* [Transformer building blocks](https://nn.labml.ai/transformers/models.html)
* [Relative multi-headed attention](https://nn.labml.ai/transformers/xl/relative_mha.html).
* [Transformer XL](https://nn.labml.ai/transformers/xl/index.html)
* [Relative multi-headed attention](https://nn.labml.ai/transformers/xl/relative_mha.html)
* [GPT Architecture](https://nn.labml.ai/transformers/gpt/index.html)
* [GLU Variants](https://nn.labml.ai/transformers/glu_variants/simple.html)
* [kNN-LM: Generalization through Memorization](https://nn.labml.ai/transformers/knn)
......
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册