✍️ typos switch transformer

3f7ce825 · Varuna Jayasiri · 5aa62bed · 3f7ce825 · 3f7ce825
隐藏空白更改
内联并排

Showing with 9 addition and 9 deletion

docs/sitemap.xml docs/sitemap.xml +1 -1

docs/transformers/switch/index.html docs/transformers/switch/index.html +8 -8

未找到文件。
--- a/docs/sitemap.xml
+++ b/docs/sitemap.xml
@@ -379,7 +379,7 @@

    <url>
      <loc>https://nn.labml.ai/transformers/switch/index.html</loc>
-      <lastmod>2021-02-02T16:30:00+00:00</lastmod>
+      <lastmod>2021-02-10T16:30:00+00:00</lastmod>
      <priority>1.00</priority>
    </url>
    

--- a/docs/transformers/switch/index.html
+++ b/docs/transformers/switch/index.html
@@ -77,16 +77,16 @@
 <a href="https://arxiv.org/abs/2101.03961">Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity</a>.
 Our implementation only has a few million parameters and doesn&rsquo;t do model parallel distributed training.
 It does single GPU training, but we implement the concept of switching as described in the paper.</p>
-<p>The Switch Transformer uses different parameters for each token by switching among parameters,
-based on the token. So only a fraction of parameters is chosen for each token, so you
+<p>The Switch Transformer uses different parameters for each token by switching among parameters
+based on the token. Thererfore, only a fraction of parameters are chosen for each token. So you
 can have more parameters but less computational cost.</p>
 <p>The switching happens at the Position-wise Feedforward network (FFN) of each transformer block.
-Position-wise feedforward network is a two sequentially fully connected layers.
+Position-wise feedforward network consists of two sequentially fully connected layers.
 In switch transformer we have multiple FFNs (multiple experts),
 and we chose which one to use based on a router.
-The outputs a set of probabilities for picking a FFN,
-and we pick the one with the highest probability and only evaluates that.
-So essentially the computational cost is same as having a single FFN.
+The output is a set of probabilities for picking a FFN,
+and we pick the one with the highest probability and only evaluate that.
+So essentially the computational cost is the same as having a single FFN.
 In our implementation this doesn&rsquo;t parallelize well when you have many or large FFNs since it&rsquo;s all
 happening on a single GPU.
 In a distributed setup you would have each FFN (each very large) on a different device.</p>
@@ -460,7 +460,7 @@ We route to the expert with highest probability</p>
 * the final output
 * number of tokens routed to each expert
 * sum of probabilities for each expert
-* number of tokens dropped
+* number of tokens dropped.
 These are used for the load balancing loss and logging</p>
            </div>
            <div class='code'>
@@ -473,7 +473,7 @@ These are used for the load balancing loss and logging</p>
                    <a href='#section-30'>#</a>
                </div>
                <h1>Switch Transformer Block</h1>
-<p>This is same as <a href="../models.html#TransformerLayer">normal transformer block</a>
+<p>This is the same as <a href="../models.html#TransformerLayer">normal transformer block</a>
 with handling extra outputs of switch feedforward module.</p>
            </div>
            <div class='code'>