提交 3f7ce825 编写于 作者: V Varuna Jayasiri

️ typos switch transformer

上级 5aa62bed
......@@ -379,7 +379,7 @@
<url>
<loc>https://nn.labml.ai/transformers/switch/index.html</loc>
<lastmod>2021-02-02T16:30:00+00:00</lastmod>
<lastmod>2021-02-10T16:30:00+00:00</lastmod>
<priority>1.00</priority>
</url>
......
......@@ -77,16 +77,16 @@
<a href="https://arxiv.org/abs/2101.03961">Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity</a>.
Our implementation only has a few million parameters and doesn&rsquo;t do model parallel distributed training.
It does single GPU training, but we implement the concept of switching as described in the paper.</p>
<p>The Switch Transformer uses different parameters for each token by switching among parameters,
based on the token. So only a fraction of parameters is chosen for each token, so you
<p>The Switch Transformer uses different parameters for each token by switching among parameters
based on the token. Thererfore, only a fraction of parameters are chosen for each token. So you
can have more parameters but less computational cost.</p>
<p>The switching happens at the Position-wise Feedforward network (FFN) of each transformer block.
Position-wise feedforward network is a two sequentially fully connected layers.
Position-wise feedforward network consists of two sequentially fully connected layers.
In switch transformer we have multiple FFNs (multiple experts),
and we chose which one to use based on a router.
The outputs a set of probabilities for picking a FFN,
and we pick the one with the highest probability and only evaluates that.
So essentially the computational cost is same as having a single FFN.
The output is a set of probabilities for picking a FFN,
and we pick the one with the highest probability and only evaluate that.
So essentially the computational cost is the same as having a single FFN.
In our implementation this doesn&rsquo;t parallelize well when you have many or large FFNs since it&rsquo;s all
happening on a single GPU.
In a distributed setup you would have each FFN (each very large) on a different device.</p>
......@@ -460,7 +460,7 @@ We route to the expert with highest probability</p>
* the final output
* number of tokens routed to each expert
* sum of probabilities for each expert
* number of tokens dropped
* number of tokens dropped.
These are used for the load balancing loss and logging</p>
</div>
<div class='code'>
......@@ -473,7 +473,7 @@ These are used for the load balancing loss and logging</p>
<a href='#section-30'>#</a>
</div>
<h1>Switch Transformer Block</h1>
<p>This is same as <a href="../models.html#TransformerLayer">normal transformer block</a>
<p>This is the same as <a href="../models.html#TransformerLayer">normal transformer block</a>
with handling extra outputs of switch feedforward module.</p>
</div>
<div class='code'>
......
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册