__init__.zh.json 8.6 KB
Newer Older
V
Varuna Jayasiri 已提交
1
{
V
Varuna Jayasiri 已提交
2
 "<h1>Primer: Searching for Efficient Transformers for Language Modeling</h1>\n<p>This is a <a href=\"https://pytorch.org\">PyTorch</a> implementation of the paper <a href=\"https://papers.labml.ai/paper/2109.08668\">Primer: Searching for Efficient Transformers for Language Modeling</a>.</p>\n<p>The authors do an evolutionary search for transformer architectures. They name the architecture found using the search Primer (PRIMitives searched transformER). <strong>Primer EZ</strong> is the architecture with the two most robust modifications in Primer compared to  the original transformer. Primer EZ trains a lot faster than the vanilla transformer.</p>\n<h3>Squared ReLU</h3>\n<p>The most effective modification found by the search is using a square ReLU instead of ReLU in the <a href=\"../feed_forward.html\">position-wise feedforward module</a>.</p>\n<p><span translate=no>_^_0_^_</span></p>\n<h3>Multi-DConv-Head Attention (MDHA)</h3>\n<p>The next effective modification is a depth-wise <span translate=no>_^_1_^_</span> convolution after multi-head projection  for queries, keys, and values. The convolution is along the sequence dimension and per channel (depth-wise). To be clear, if the number of channels in each head is <span translate=no>_^_2_^_</span> the convolution will have <span translate=no>_^_3_^_</span> kernels for each of the <span translate=no>_^_4_^_</span> channels.</p>\n<p><a href=\"experiment.html\">Here is the experiment code</a>, for Primer EZ.</p>\n": "<h1>\u5165\u95e8\uff1a\u5bfb\u627e\u7528\u4e8e\u8bed\u8a00\u5efa\u6a21\u7684\u9ad8\u6548\u8f6c\u6362\u5668</h1>\n<p>\u8fd9\u662f <a href=\"https://pytorch.org\">P <a href=\"https://papers.labml.ai/paper/2109.08668\">rimer\uff1a\u4e3a\u8bed\u8a00\u5efa\u6a21\u5bfb\u627e\u9ad8\u6548\u8f6c\u6362\u5668</a>\u8bba\u6587\u7684 PyTorch</a> \u5b9e\u73b0\u3002</p>\n<p>\u4f5c\u8005\u5bf9\u53d8\u538b\u5668\u67b6\u6784\u8fdb\u884c\u4e86\u8fdb\u5316\u63a2\u7d22\u3002\u4ed6\u4eec\u4f7f\u7528\u641c\u7d22 Primer\uff08Primitives \u641c\u7d22 Transformer\uff09\u547d\u540d\u627e\u5230\u7684\u67b6\u6784\u3002\u4e0e\u539f\u59cb\u53d8\u538b\u5668\u76f8\u6bd4\uff0cP@@ <strong>rimer EZ</strong> \u662f\u5728 Primer \u4e2d\u8fdb\u884c\u4e86\u4e24\u9879\u6700\u5f3a\u5927\u7684\u4fee\u6539\u7684\u67b6\u6784\u3002Primer EZ \u7684\u8bad\u7ec3\u901f\u5ea6\u6bd4\u539f\u7248\u53d8\u538b\u5668\u5feb\u5f88\u591a\u3002</p>\n<h3>Squared ReLU</h3>\n<p>\u641c\u7d22\u53d1\u73b0\u7684\u6700\u6709\u6548\u7684\u4fee\u6539\u662f\u5728<a href=\"../feed_forward.html\">\u4f4d\u7f6e\u524d\u9988\u6a21\u5757\u4e2d\u4f7f\u7528\u65b9\u5f62 ReLU \u800c\u4e0d\u662f Re</a> LU\u3002</p>\n<p><span translate=no>_^_0_^_</span></p>\n<h3>Multi-conv-Head \u6ce8\u610f\u529b (MDHA)</h3>\n<p>\u4e0b\u4e00\u4e2a\u6709\u6548\u7684\u4fee\u6539\u662f\u5728\u67e5\u8be2\u3001\u952e\u548c\u503c\u7684\u591a\u5934\u6295\u5f71\u4e4b\u540e\u7684\u6df1\u5ea6<span translate=no>_^_1_^_</span>\u5377\u79ef\u3002\u5377\u79ef\u6cbf\u7740\u5e8f\u5217\u7ef4\u5ea6\u548c\u6bcf\u4e2a\u901a\u9053\uff08\u6df1\u5ea6\uff09\u8fdb\u884c\u3002\u9700\u8981\u660e\u786e\u7684\u662f\uff0c\u5982\u679c\u6bcf\u4e2a\u4fe1\u5934\u4e2d\u7684\u901a\u9053\u6570\u4e3a<span translate=no>_^_2_^_</span>\uff0c\u5219\u5377\u79ef\u5c06\u4e3a\u6bcf\u4e2a<span translate=no>_^_4_^_</span>\u901a\u9053\u90fd\u6709<span translate=no>_^_3_^_</span>\u5185\u6838\u3002</p>\n<p><a href=\"experiment.html\">\u4ee5\u4e0b\u662f Primer EZ \u7684\u5b9e\u9a8c\u4ee3\u7801</a>\u3002</p>\n",
V
Varuna Jayasiri 已提交
3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
 "<h2>Multi-DConv-Head Attention (MDHA)</h2>\n<p>We extend our original implementation of <a href=\"../mha.html#MHA\">Multi-Head Attention</a> and add the spatial depth-wise convolution to query, key and value projections.</p>\n": "<h2>\u591a dconv-Head \u6ce8\u610f\u529b (MDHA)</h2>\n<p>\u6211\u4eec\u6269\u5c55\u4e86\u6700\u521d\u7684 M <a href=\"../mha.html#MHA\">ulti-Head</a> Attention \u5b9e\u73b0\uff0c\u5e76\u5c06\u7a7a\u95f4\u6df1\u5ea6\u5377\u79ef\u6dfb\u52a0\u5230\u67e5\u8be2\u3001\u952e\u548c\u503c\u6295\u5f71\u4e2d\u3002</p>\n",
 "<h2>Spatial Depth Wise Convolution</h2>\n": "<h2>\u7a7a\u95f4\u6df1\u5ea6\u660e\u667a\u5377\u79ef</h2>\n",
 "<h2>Squared ReLU activation</h2>\n<p><span translate=no>_^_0_^_</span></p>\n<p>Squared ReLU is used as the activation function in the  <a href=\"../feed_forward.html\">position wise feedforward module</a>.</p>\n": "<h2>\u6fc0\u6d3b\u5e73\u65b9 ReLU</h2>\n<p><span translate=no>_^_0_^_</span></p>\n<p>Squared RelU \u5728<a href=\"../feed_forward.html\">\u4f4d\u7f6e\u524d\u9988\u6a21\u5757</a>\u4e2d\u7528\u4f5c\u6fc0\u6d3b\u51fd\u6570\u3002</p>\n",
 "<p> </p>\n": "<p></p>\n",
 "<p> <span translate=no>_^_0_^_</span> has shape <span translate=no>_^_1_^_</span></p>\n": "<p><span translate=no>_^_0_^_</span>\u6709\u5f62\u72b6<span translate=no>_^_1_^_</span></p>\n",
 "<p>1D convolution accepts input of the form <span translate=no>_^_0_^_</span> </p>\n": "<p>\u4e00\u7ef4\u5377\u79ef\u63a5\u53d7\u4ee5\u4e0b\u5f62\u5f0f\u7684\u8f93\u5165<span translate=no>_^_0_^_</span></p>\n",
 "<p><a href=\"../mha.html#MHA\">Multi-Head Attention</a> will create query, key and value projection modules <span translate=no>_^_0_^_</span>, <span translate=no>_^_1_^_</span>, and <span translate=no>_^_2_^_</span>.</p>\n<p>We combine a spatial depth-wise convolution layer to each of them and replace <span translate=no>_^_3_^_</span>, <span translate=no>_^_4_^_</span>, and <span translate=no>_^_5_^_</span>.</p>\n<p>\ud83d\udcdd <em>We feel this cleaner implementation is easier to understand since it clearly shows the difference between this and vanilla transformer multi-head attention</em>. </p>\n": "<p><a href=\"../mha.html#MHA\">Multi-Head</a> Attention \u5c06\u521b\u5efa\u67e5\u8be2\u3001\u952e\u548c\u4ef7\u503c\u6295\u5f71\u6a21\u5757<span translate=no>_^_0_^_</span><span translate=no>_^_1_^_</span>\u3001\u548c<span translate=no>_^_2_^_</span>\u3002</p>\n<p>\u6211\u4eec\u5c06\u7a7a\u95f4\u6df1\u5ea6\u5377\u79ef\u5c42\u7ec4\u5408\u5230\u6bcf\u4e2a\u5c42\u4e0a\uff0c\u5e76\u66ff\u6362<span translate=no>_^_3_^_</span><span translate=no>_^_4_^_</span>\u3001\u548c<span translate=no>_^_5_^_</span>\u3002</p>\n<p>\ud83d\udcdd <em>\u6211\u4eec\u8ba4\u4e3a\u8fd9\u79cd\u66f4\u7b80\u6d01\u7684\u5b9e\u73b0\u66f4\u5bb9\u6613\u7406\u89e3\uff0c\u56e0\u4e3a\u5b83\u6e05\u695a\u5730\u663e\u793a\u4e86\u8fd9\u4e0e\u666e\u901a\u53d8\u538b\u5668\u591a\u5934\u5173\u6ce8\u4e4b\u95f4\u7684\u533a\u522b</em>\u3002</p>\n",
 "<p>Apply ReLU </p>\n": "<p>\u7533\u8bf7 ReLU</p>\n",
 "<p>Change the shape to <span translate=no>_^_0_^_</span> </p>\n": "<p>\u5c06\u5f62\u72b6\u6539\u4e3a<span translate=no>_^_0_^_</span></p>\n",
 "<p>Crop the right most <span translate=no>_^_0_^_</span> results since we padded both sides </p>\n": "<p>\u88c1\u526a\u6700\u53f3\u8fb9\u7684<span translate=no>_^_0_^_</span>\u7ed3\u679c\uff0c\u56e0\u4e3a\u6211\u4eec\u586b\u5145\u4e86\u4e24\u8fb9</p>\n",
 "<p>Get the shape </p>\n": "<p>\u5f97\u5230\u5f62\u72b6</p>\n",
 "<p>Permute to <span translate=no>_^_0_^_</span> </p>\n": "<p>\u6392\u5217\u4e3a<span translate=no>_^_0_^_</span></p>\n",
 "<p>Reshape to <span translate=no>_^_0_^_</span> </p>\n": "<p>\u91cd\u5851\u4e3a<span translate=no>_^_0_^_</span></p>\n",
 "<p>Square it </p>\n": "<p>\u628a\u5b83\u5f04\u5e73\u4e86</p>\n",
 "<p>We use PyTorch&#x27;s <span translate=no>_^_0_^_</span> module. We set the number of groups to be equal to the number of channels so that it does a separate convolution (with different kernels) for each channel. We add padding to both sides and later crop the right most <span translate=no>_^_1_^_</span> results </p>\n": "<p>\u6211\u4eec\u4f7f\u7528 PyTorch \u7684<span translate=no>_^_0_^_</span>\u6a21\u5757\u3002\u6211\u4eec\u5c06\u7ec4\u7684\u6570\u91cf\u8bbe\u7f6e\u4e3a\u7b49\u4e8e\u901a\u9053\u6570\uff0c\u4ee5\u4fbf\u5b83\u5bf9\u6bcf\u4e2a\u901a\u9053\u8fdb\u884c\u5355\u72ec\u7684\u5377\u79ef\uff08\u4f7f\u7528\u4e0d\u540c\u7684\u5185\u6838\uff09\u3002\u6211\u4eec\u5728\u4e24\u8fb9\u6dfb\u52a0\u586b\u5145\uff0c\u7136\u540e\u88c1\u526a\u6700\u53f3\u8fb9\u7684<span translate=no>_^_1_^_</span>\u7ed3\u679c</p>\n",
 "<ul><li><span translate=no>_^_0_^_</span> is the number of channels in each head</li></ul>\n": "<ul><li><span translate=no>_^_0_^_</span>\u662f\u6bcf\u4e2a head \u4e2d\u7684\u901a\u9053\u6570</li></ul>\n",
 "Primer: Searching for Efficient Transformers for Language Modeling": "\u5165\u95e8\uff1a\u4e3a\u8bed\u8a00\u5efa\u6a21\u5bfb\u627e\u9ad8\u6548\u7684\u53d8\u6362\u5668",
 "This is an annotated implementation/tutorial of Primer: Searching for Efficient Transformers for Language Modeling for Vision in PyTorch.": "\u8fd9\u662f PyTorch \u4e2d\u7684 Primer\uff1a\u641c\u7d22\u7528\u4e8e\u89c6\u89c9\u8bed\u8a00\u5efa\u6a21\u7684\u9ad8\u6548\u53d8\u6362\u5668\u7684\u5e26\u6ce8\u91ca\u7684\u5b9e\u73b0/\u6559\u7a0b\u3002"
}