Skip to content
体验新版
项目
组织
正在加载...
登录
切换导航
打开侧边栏
Greenplum
Annotated Deep Learning Paper Implementations
提交
3f7ce825
A
Annotated Deep Learning Paper Implementations
项目概览
Greenplum
/
Annotated Deep Learning Paper Implementations
大约 1 年 前同步成功
通知
6
Star
0
Fork
0
代码
文件
提交
分支
Tags
贡献者
分支图
Diff
Issue
0
列表
看板
标记
里程碑
合并请求
0
DevOps
流水线
流水线任务
计划
Wiki
0
Wiki
分析
仓库
DevOps
项目成员
Pages
A
Annotated Deep Learning Paper Implementations
项目概览
项目概览
详情
发布
仓库
仓库
文件
提交
分支
标签
贡献者
分支图
比较
Issue
0
Issue
0
列表
看板
标记
里程碑
合并请求
0
合并请求
0
Pages
DevOps
DevOps
流水线
流水线任务
计划
分析
分析
仓库分析
DevOps
Wiki
0
Wiki
成员
成员
收起侧边栏
关闭侧边栏
动态
分支图
创建新Issue
流水线任务
提交
Issue看板
体验新版 GitCode,发现更多精彩内容 >>
提交
3f7ce825
编写于
2月 10, 2021
作者:
V
Varuna Jayasiri
浏览文件
操作
浏览文件
下载
电子邮件补丁
差异文件
✍
️ typos switch transformer
上级
5aa62bed
变更
2
隐藏空白更改
内联
并排
Showing
2 changed file
with
9 addition
and
9 deletion
+9
-9
docs/sitemap.xml
docs/sitemap.xml
+1
-1
docs/transformers/switch/index.html
docs/transformers/switch/index.html
+8
-8
未找到文件。
docs/sitemap.xml
浏览文件 @
3f7ce825
...
...
@@ -379,7 +379,7 @@
<url>
<loc>
https://nn.labml.ai/transformers/switch/index.html
</loc>
<lastmod>
2021-02-
02
T16:30:00+00:00
</lastmod>
<lastmod>
2021-02-
10
T16:30:00+00:00
</lastmod>
<priority>
1.00
</priority>
</url>
...
...
docs/transformers/switch/index.html
浏览文件 @
3f7ce825
...
...
@@ -77,16 +77,16 @@
<a
href=
"https://arxiv.org/abs/2101.03961"
>
Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity
</a>
.
Our implementation only has a few million parameters and doesn
’
t do model parallel distributed training.
It does single GPU training, but we implement the concept of switching as described in the paper.
</p>
<p>
The Switch Transformer uses different parameters for each token by switching among parameters
,
based on the token.
So only a fraction of parameters is chosen for each token, s
o you
<p>
The Switch Transformer uses different parameters for each token by switching among parameters
based on the token.
Thererfore, only a fraction of parameters are chosen for each token. S
o you
can have more parameters but less computational cost.
</p>
<p>
The switching happens at the Position-wise Feedforward network (FFN) of each transformer block.
Position-wise feedforward network
is a
two sequentially fully connected layers.
Position-wise feedforward network
consists of
two sequentially fully connected layers.
In switch transformer we have multiple FFNs (multiple experts),
and we chose which one to use based on a router.
The outputs a set of probabilities for picking a FFN,
and we pick the one with the highest probability and only evaluate
s
that.
So essentially the computational cost is same as having a single FFN.
The output
i
s a set of probabilities for picking a FFN,
and we pick the one with the highest probability and only evaluate that.
So essentially the computational cost is
the
same as having a single FFN.
In our implementation this doesn
’
t parallelize well when you have many or large FFNs since it
’
s all
happening on a single GPU.
In a distributed setup you would have each FFN (each very large) on a different device.
</p>
...
...
@@ -460,7 +460,7 @@ We route to the expert with highest probability</p>
* the final output
* number of tokens routed to each expert
* sum of probabilities for each expert
* number of tokens dropped
* number of tokens dropped
.
These are used for the load balancing loss and logging
</p>
</div>
<div
class=
'code'
>
...
...
@@ -473,7 +473,7 @@ These are used for the load balancing loss and logging</p>
<a
href=
'#section-30'
>
#
</a>
</div>
<h1>
Switch Transformer Block
</h1>
<p>
This is same as
<a
href=
"../models.html#TransformerLayer"
>
normal transformer block
</a>
<p>
This is
the
same as
<a
href=
"../models.html#TransformerLayer"
>
normal transformer block
</a>
with handling extra outputs of switch feedforward module.
</p>
</div>
<div
class=
'code'
>
...
...
编辑
预览
Markdown
is supported
0%
请重试
或
添加新附件
.
添加附件
取消
You are about to add
0
people
to the discussion. Proceed with caution.
先完成此消息的编辑!
取消
想要评论请
注册
或
登录