{
 "<h1>HyperLSTM module</h1>\n": "<h1>HyperLSTM \u6a21\u5757</h1>\n",
 "<h1>HyperNetworks - HyperLSTM</h1>\n<p>We have implemented HyperLSTM introduced in paper <a href=\"https://papers.labml.ai/paper/1609.09106\">HyperNetworks</a>, with annotations using <a href=\"https://pytorch.org\">PyTorch</a>. <a href=\"https://blog.otoro.net/2016/09/28/hyper-networks/\">This blog post</a> by David Ha gives a good explanation of HyperNetworks.</p>\n<p>We have an experiment that trains a HyperLSTM to predict text on Shakespeare dataset. Here&#x27;s the link to code: <a href=\"experiment.html\"><span translate=no>_^_0_^_</span></a></p>\n<p><a href=\"https://colab.research.google.com/github/labmlai/annotated_deep_learning_paper_implementations/blob/master/labml_nn/hypernetworks/experiment.ipynb\"><span translate=no>_^_1_^_</span></a></p>\n<p>HyperNetworks use a smaller network to generate weights of a larger network. There are two variants: static hyper-networks and dynamic hyper-networks. Static HyperNetworks have smaller networks that generate weights (kernels) of a convolutional network. Dynamic HyperNetworks generate parameters of a recurrent neural network for each step. This is an implementation of the latter.</p>\n<h2>Dynamic HyperNetworks</h2>\n<p>In a RNN the parameters stay constant for each step. Dynamic HyperNetworks generate different parameters for each step. HyperLSTM has the structure of a LSTM but the parameters of each step are changed by a smaller LSTM network.</p>\n<p>In the basic form, a Dynamic HyperNetwork has a smaller recurrent network that generates a feature vector corresponding to each parameter tensor of the larger recurrent network. Let&#x27;s say the larger network has some parameter <span translate=no>_^_2_^_</span> the smaller network generates a feature vector <span translate=no>_^_3_^_</span> and we dynamically compute <span translate=no>_^_4_^_</span> as a linear transformation of <span translate=no>_^_5_^_</span>. For instance <span translate=no>_^_6_^_</span> where <span translate=no>_^_7_^_</span> is a 3-d tensor parameter and <span translate=no>_^_8_^_</span> is a tensor-vector multiplication. <span translate=no>_^_9_^_</span> is usually a linear transformation of the output of the smaller recurrent network.</p>\n<h3>Weight scaling instead of computing</h3>\n<p>Large recurrent networks have large dynamically computed parameters. These are calculated using linear transformation of feature vector <span translate=no>_^_10_^_</span>. And this transformation requires an even larger weight tensor. That is, when <span translate=no>_^_11_^_</span> has shape <span translate=no>_^_12_^_</span>, <span translate=no>_^_13_^_</span> will be <span translate=no>_^_14_^_</span>.</p>\n<p>To overcome this, we compute the weight parameters of the recurrent network by dynamically scaling each row of a matrix of same size.</p>\n<span translate=no>_^_15_^_</span><p>where <span translate=no>_^_16_^_</span> is a <span translate=no>_^_17_^_</span> parameter matrix.</p>\n<p>We can further optimize this when we compute <span translate=no>_^_18_^_</span>, as <span translate=no>_^_19_^_</span> where <span translate=no>_^_20_^_</span> stands for element-wise multiplication.</p>\n": "<h1>\u8d85\u7f51\u7edc-HyperLSTM</h1>\n<p>\u6211\u4eec\u5df2\u7ecf\u5b9e\u73b0\u4e86\u8bba\u6587 Hyper <a href=\"https://papers.labml.ai/paper/1609.09106\">Networks \u4e2d\u4ecb\u7ecd\u7684 Hyper</a> LSTM\uff0c\u5e76\u4f7f\u7528 <a href=\"https://pytorch.org\">PyTorch</a> \u8fdb\u884c\u4e86\u6ce8\u91ca\u3002<a href=\"https://blog.otoro.net/2016/09/28/hyper-networks/\">David Ha\u7684\u8fd9\u7bc7\u535a\u5ba2\u6587\u7ae0</a>\u5f88\u597d\u5730\u89e3\u91ca\u4e86HyperNetworks\u3002</p>\n<p>\u6211\u4eec\u6709\u4e00\u4e2a\u5b9e\u9a8c\u53ef\u4ee5\u8bad\u7ec3 HyperLSTM \u6765\u9884\u6d4b\u838e\u58eb\u6bd4\u4e9a\u6570\u636e\u96c6\u4e0a\u7684\u6587\u672c\u3002\u4ee5\u4e0b\u662f\u4ee3\u7801\u94fe\u63a5\uff1a<a href=\"experiment.html\"><span translate=no>_^_0_^_</span></a></p>\n<p><a href=\"https://colab.research.google.com/github/labmlai/annotated_deep_learning_paper_implementations/blob/master/labml_nn/hypernetworks/experiment.ipynb\"><span translate=no>_^_1_^_</span></a></p>\n<p>HyperNetworks \u4f7f\u7528\u8f83\u5c0f\u7684\u7f51\u7edc\u6765\u751f\u6210\u8f83\u5927\u7f51\u7edc\u7684\u6743\u91cd\u3002\u6709\u4e24\u79cd\u53d8\u4f53\uff1a\u9759\u6001\u8d85\u7f51\u7edc\u548c\u52a8\u6001\u8d85\u7f51\u7edc\u3002\u9759\u6001\u8d85\u7f51\u7edc\u5177\u6709\u8f83\u5c0f\u7684\u7f51\u7edc\uff0c\u7528\u4e8e\u751f\u6210\u5377\u79ef\u7f51\u7edc\u7684\u6743\u91cd\uff08\u5185\u6838\uff09\u3002\u52a8\u6001\u8d85\u7f51\u7edc\u4e3a\u6bcf\u4e2a\u6b65\u9aa4\u751f\u6210\u5faa\u73af\u795e\u7ecf\u7f51\u7edc\u7684\u53c2\u6570\u3002\u8fd9\u662f\u540e\u8005\u7684\u5b9e\u73b0\u3002</p>\n<h2>\u52a8\u6001\u8d85\u7f51\u7edc</h2>\n<p>\u5728 RNN \u4e2d\uff0c\u6bcf\u4e2a\u6b65\u9aa4\u7684\u53c2\u6570\u4fdd\u6301\u4e0d\u53d8\u3002\u52a8\u6001\u8d85\u7f51\u7edc\u4e3a\u6bcf\u4e2a\u6b65\u9aa4\u751f\u6210\u4e0d\u540c\u7684\u53c2\u6570\u3002HyperLSTM \u5177\u6709 LSTM \u7684\u7ed3\u6784\uff0c\u4f46\u6bcf\u4e2a\u6b65\u9aa4\u7684\u53c2\u6570\u90fd\u7531\u8f83\u5c0f\u7684 LSTM \u7f51\u7edc\u66f4\u6539\u3002</p>\n<p>\u5728\u57fa\u672c\u5f62\u5f0f\u4e2d\uff0cDynamic HyperNetwork \u5177\u6709\u8f83\u5c0f\u7684\u5faa\u73af\u7f51\u7edc\uff0c\u8be5\u7f51\u7edc\u751f\u6210\u4e0e\u8f83\u5927\u5faa\u73af\u7f51\u7edc\u7684\u6bcf\u4e2a\u53c2\u6570\u5f20\u91cf\u5bf9\u5e94\u7684\u7279\u5f81\u5411\u91cf\u3002\u5047\u8bbe\u8f83\u5927\u7684\u7f51\u7edc\u6709\u4e00\u4e9b\u53c2\u6570<span translate=no>_^_2_^_</span>\uff0c\u8f83\u5c0f\u7684\u7f51\u7edc\u751f\u6210\u4e00\u4e2a\u7279\u5f81\u5411\u91cf<span translate=no>_^_3_^_</span>\uff0c\u6211\u4eec\u52a8\u6001\u8ba1\u7b97<span translate=no>_^_4_^_</span>\u4e3a\u7684\u7ebf\u6027\u53d8\u6362<span translate=no>_^_5_^_</span>\u3002\u4f8b\u5982\uff0c<span translate=no>_^_6_^_</span>\u5176\u4e2d<span translate=no>_^_7_^_</span>\u662f\u4e09\u7ef4\u5f20\u91cf\u53c2\u6570\uff0c<span translate=no>_^_8_^_</span>\u662f\u5f20\u91cf\u5411\u91cf\u4e58\u6cd5\u3002<span translate=no>_^_9_^_</span>\u901a\u5e38\u662f\u8f83\u5c0f\u7684\u5faa\u73af\u7f51\u7edc\u8f93\u51fa\u7684\u7ebf\u6027\u53d8\u6362\u3002</p>\n<h3>\u6309\u91cd\u91cf\u7f29\u653e\u800c\u4e0d\u662f\u8ba1\u7b97</h3>\n<p>\u5927\u578b\u5faa\u73af\u7f51\u7edc\u5177\u6709\u5927\u91cf\u7684\u52a8\u6001\u8ba1\u7b97\u53c2\u6570\u3002\u8fd9\u4e9b\u662f\u4f7f\u7528\u7279\u5f81\u5411\u91cf\u7684\u7ebf\u6027\u53d8\u6362\u8ba1\u7b97<span translate=no>_^_10_^_</span>\u7684\u3002\u800c\u4e14\u8fd9\u79cd\u53d8\u6362\u9700\u8981\u66f4\u5927\u7684\u6743\u91cd\u5f20\u91cf\u3002\u4e5f\u5c31\u662f\u8bf4\uff0c\u5f53<span translate=no>_^_11_^_</span>\u6709\u5f62\u72b6\u65f6<span translate=no>_^_12_^_</span>\uff0c<span translate=no>_^_13_^_</span>\u5c06\u662f<span translate=no>_^_14_^_</span>\u3002</p>\n<p>\u4e3a\u4e86\u514b\u670d\u8fd9\u4e2a\u95ee\u9898\uff0c\u6211\u4eec\u901a\u8fc7\u52a8\u6001\u7f29\u653e\u76f8\u540c\u5927\u5c0f\u7684\u77e9\u9635\u7684\u6bcf\u4e00\u884c\u6765\u8ba1\u7b97\u5faa\u73af\u7f51\u7edc\u7684\u6743\u91cd\u53c2\u6570\u3002</p>\n<span translate=no>_^_15_^_</span><p>\u5176\u4e2d<span translate=no>_^_16_^_</span>\u662f<span translate=no>_^_17_^_</span>\u53c2\u6570\u77e9\u9635\u3002</p>\n<p>\u6211\u4eec\u53ef\u4ee5\u5728\u8ba1\u7b97\u65f6\u8fdb\u4e00\u6b65\u5bf9\u5176\u8fdb\u884c\u4f18\u5316<span translate=no>_^_18_^_</span>\uff0c\u56e0\u4e3a<span translate=no>_^_19_^_</span>\u5176\u4e2d<span translate=no>_^_20_^_</span>\u4ee3\u8868\u9010\u5143\u7d20\u4e58\u6cd5\u3002</p>\n",
 "<h2>HyperLSTM Cell</h2>\n<p>For HyperLSTM the smaller network and the larger network both have the LSTM structure. This is defined in Appendix A.2.2 in the paper.</p>\n": "<h2>HyperLSTM Cell</h2>\n<p>\u5bf9\u4e8e HyperLSTM\uff0c\u8f83\u5c0f\u7684\u7f51\u7edc\u548c\u8f83\u5927\u7684\u7f51\u7edc\u90fd\u5177\u6709 LSTM \u7ed3\u6784\u3002\u8fd9\u5728\u767d\u76ae\u4e66\u7684\u9644\u5f55A.2.2\u4e2d\u8fdb\u884c\u4e86\u5b9a\u4e49\u3002</p>\n",
 "<p> </p>\n": "<p></p>\n",
 "<p> <span translate=no>_^_0_^_</span> is the size of the input <span translate=no>_^_1_^_</span>, <span translate=no>_^_2_^_</span> is the size of the LSTM, and <span translate=no>_^_3_^_</span> is the size of the smaller LSTM that alters the weights of the larger outer LSTM. <span translate=no>_^_4_^_</span> is the size of the feature vectors used to alter the LSTM weights.</p>\n<p>We use the output of the smaller LSTM to compute <span translate=no>_^_5_^_</span>, <span translate=no>_^_6_^_</span> and <span translate=no>_^_7_^_</span> using linear transformations. We calculate <span translate=no>_^_8_^_</span>, <span translate=no>_^_9_^_</span>, and <span translate=no>_^_10_^_</span> from these, using linear transformations again. These are then used to scale the rows of weight and bias tensors of the main LSTM.</p>\n<p>\ud83d\udcdd Since the computation of <span translate=no>_^_11_^_</span> and <span translate=no>_^_12_^_</span> are two sequential linear transformations these can be combined into a single linear transformation. However we&#x27;ve implemented this separately so that it matches with the description in the paper.</p>\n": "<p><span translate=no>_^_0_^_</span>\u662f\u8f93\u5165\u7684\u5927\u5c0f<span translate=no>_^_1_^_</span>\uff0c<span translate=no>_^_2_^_</span>\u662f LSTM \u7684\u5927\u5c0f\uff0c<span translate=no>_^_3_^_</span>\u662f\u8f83\u5c0f\u7684 LSTM \u7684\u5927\u5c0f\uff0c\u5b83\u4f1a\u6539\u53d8\u66f4\u5927\u7684\u5916\u90e8 LSTM\u3002<span translate=no>_^_4_^_</span>\u662f\u7528\u4e8e\u6539\u53d8 LSTM \u6743\u91cd\u7684\u7279\u5f81\u5411\u91cf\u7684\u5927\u5c0f\u3002</p>\n<p>\u6211\u4eec\u4f7f\u7528\u8f83\u5c0f\u7684 LSTM \u7684\u8f93\u51fa\u8fdb\u884c\u8ba1\u7b97<span translate=no>_^_5_^_</span>\uff0c<span translate=no>_^_6_^_</span>\u5e76<span translate=no>_^_7_^_</span>\u4f7f\u7528\u7ebf\u6027\u53d8\u6362\u3002\u6211\u4eec\u518d\u6b21\u4f7f\u7528\u7ebf\u6027\u53d8\u6362\u8fdb\u884c\u8ba1\u7b97<span translate=no>_^_8_^_</span><span translate=no>_^_9_^_</span>\u3001\u548c<span translate=no>_^_10_^_</span>\u8ba1\u7b97\u3002\u7136\u540e\u4f7f\u7528\u5b83\u4eec\u6765\u7f29\u653e\u4e3b LSTM \u7684\u6743\u91cd\u548c\u504f\u7f6e\u5f20\u91cf\u7684\u884c\u3002</p>\n<p>\ud83d\udcdd \u7531\u4e8e<span translate=no>_^_11_^_</span>\u548c\u7684\u8ba1\u7b97<span translate=no>_^_12_^_</span>\u662f\u4e24\u4e2a\u8fde\u7eed\u7684\u7ebf\u6027\u53d8\u6362\uff0c\u56e0\u6b64\u53ef\u4ee5\u5c06\u5b83\u4eec\u7ec4\u5408\u6210\u5355\u4e2a\u7ebf\u6027\u53d8\u6362\u3002\u4f46\u662f\uff0c\u6211\u4eec\u5df2\u7ecf\u5355\u72ec\u5b9e\u73b0\u4e86\u8fd9\u4e00\u70b9\uff0c\u4ee5\u4fbf\u5b83\u4e0e\u8bba\u6587\u4e2d\u7684\u63cf\u8ff0\u76f8\u5339\u914d\u3002</p>\n",
 "<p> Create a network of <span translate=no>_^_0_^_</span> of HyperLSTM.</p>\n": "<p>\u521b\u5efa\u4e00\u4e2a\u7531 HyperLSTM<span translate=no>_^_0_^_</span> \u7ec4\u6210\u7684\u7f51\u7edc\u3002</p>\n",
 "<p><span translate=no>_^_0_^_</span> </p>\n": "<p><span translate=no>_^_0_^_</span></p>\n",
 "<p><span translate=no>_^_0_^_</span> \ud83e\udd14 In the paper it was specified as <span translate=no>_^_1_^_</span> I feel that it&#x27;s a typo. </p>\n": "<p><span translate=no>_^_0_^_</span>\ud83e\udd14 \u5728\u62a5\u7eb8\u4e0a\u6307\u5b9a\u4e86\u5b83\uff0c\u56e0\u4e3a<span translate=no>_^_1_^_</span>\u6211\u89c9\u5f97\u8fd9\u662f\u4e00\u4e2a\u9519\u5b57\u3002</p>\n",
 "<p>Collect the output <span translate=no>_^_0_^_</span> of the final layer </p>\n": "<p>\u6536\u96c6\u6700\u540e\u4e00\u5c42<span translate=no>_^_0_^_</span>\u7684\u8f93\u51fa</p>\n",
 "<p>Collect the outputs of the final layer at each step </p>\n": "<p>\u5728\u6bcf\u4e00\u6b65\u6536\u96c6\u6700\u540e\u4e00\u5c42\u7684\u8f93\u51fa</p>\n",
 "<p>Create cells for each layer. Note that only the first layer gets the input directly. Rest of the layers get the input from the layer below </p>\n": "<p>\u4e3a\u6bcf\u5c42\u521b\u5efa\u5355\u5143\u3002\u8bf7\u6ce8\u610f\uff0c\u53ea\u6709\u7b2c\u4e00\u5c42\u76f4\u63a5\u83b7\u5f97\u8f93\u5165\u3002\u5176\u4f59\u56fe\u5c42\u4ece\u4e0b\u9762\u7684\u56fe\u5c42\u83b7\u53d6\u8f93\u5165</p>\n",
 "<p>Get the state of the layer </p>\n": "<p>\u83b7\u53d6\u56fe\u5c42\u7684\u72b6\u6001</p>\n",
 "<p>Initialize the state with zeros if <span translate=no>_^_0_^_</span> </p>\n": "<p>\u4f7f\u7528\u96f6\u521d\u59cb\u5316\u72b6\u6001\u5982\u679c<span translate=no>_^_0_^_</span></p>\n",
 "<p>Input to the first layer is the input itself </p>\n": "<p>\u7b2c\u4e00\u5c42\u7684\u8f93\u5165\u662f\u8f93\u5165\u672c\u8eab</p>\n",
 "<p>Input to the next layer is the state of this layer </p>\n": "<p>\u4e0b\u4e00\u5c42\u7684\u8f93\u5165\u662f\u8be5\u56fe\u5c42\u7684\u72b6\u6001</p>\n",
 "<p>Layer normalization </p>\n": "<p>\u5c42\u89c4\u8303\u5316</p>\n",
 "<p>Loop through the layers </p>\n": "<p>\u5faa\u73af\u7a7f\u8fc7\u56fe\u5c42</p>\n",
 "<p>Reverse stack the tensors to get the states of each layer</p>\n<p>\ud83d\udcdd You can just work with the tensor itself but this is easier to debug </p>\n": "<p>\u53cd\u5411\u5806\u53e0\u5f20\u91cf\u4ee5\u83b7\u5f97\u6bcf\u5c42\u7684\u72b6\u6001</p>\n<p>\ud83d\udcdd \u4f60\u53ef\u4ee5\u53ea\u4f7f\u7528\u5f20\u91cf\u672c\u8eab\uff0c\u4f46\u8fd9\u66f4\u5bb9\u6613\u8c03\u8bd5</p>\n",
 "<p>Stack the outputs and states </p>\n": "<p>\u5806\u53e0\u8f93\u51fa\u548c\u72b6\u6001</p>\n",
 "<p>Store sizes to initialize state </p>\n": "<p>\u5b58\u50a8\u5927\u5c0f\u4ee5\u521d\u59cb\u5316\u72b6\u6001</p>\n",
 "<p>The input to the hyperLSTM is <span translate=no>_^_0_^_</span> where <span translate=no>_^_1_^_</span> is the input and <span translate=no>_^_2_^_</span> is the output of the outer LSTM at previous step. So the input size is <span translate=no>_^_3_^_</span>.</p>\n<p>The output of hyperLSTM is <span translate=no>_^_4_^_</span> and <span translate=no>_^_5_^_</span>. </p>\n": "<p>HyperLSTM \u7684\u8f93\u5165\u662f<span translate=no>_^_0_^_</span>\u4e0a\u4e00\u6b65\u4e2d\u5916\u90e8 LSTM \u7684\u8f93\u5165\uff0c<span translate=no>_^_2_^_</span>\u4e5f\u662f\u5916\u90e8 LSTM \u7684\u8f93\u51fa\u3002<span translate=no>_^_1_^_</span>\u56e0\u6b64\uff0c\u8f93\u5165\u5927\u5c0f\u4e3a<span translate=no>_^_3_^_</span>\u3002</p>\n<p>HyperLSTM \u7684\u8f93\u51fa\u4e3a<span translate=no>_^_4_^_</span>\u548c<span translate=no>_^_5_^_</span>\u3002</p>\n",
 "<p>The weight matrices <span translate=no>_^_0_^_</span> </p>\n": "<p>\u6743\u91cd\u77e9\u9635<span translate=no>_^_0_^_</span></p>\n",
 "<p>We calculate <span translate=no>_^_0_^_</span>, <span translate=no>_^_1_^_</span>, <span translate=no>_^_2_^_</span> and <span translate=no>_^_3_^_</span> in a loop </p>\n": "<p>\u6211\u4eec\u5faa\u73af\u8ba1\u7b97<span translate=no>_^_0_^_</span><span translate=no>_^_1_^_</span>\u3001<span translate=no>_^_2_^_</span>\u548c<span translate=no>_^_3_^_</span></p>\n",
 "<span translate=no>_^_0_^_</span><p> </p>\n": "<span translate=no>_^_0_^_</span><p></p>\n",
 "<ul><li><span translate=no>_^_0_^_</span> has shape <span translate=no>_^_1_^_</span> and </li>\n<li><span translate=no>_^_2_^_</span> is a tuple of <span translate=no>_^_3_^_</span>.  <span translate=no>_^_4_^_</span> have shape <span translate=no>_^_5_^_</span> and  <span translate=no>_^_6_^_</span> have shape <span translate=no>_^_7_^_</span>.</li></ul>\n": "<ul><li><span translate=no>_^_0_^_</span>\u6709\u5f62\u72b6<span translate=no>_^_1_^_</span>\u548c</li>\n<li><span translate=no>_^_2_^_</span>\u662f\u7684\u5143\u7ec4<span translate=no>_^_3_^_</span>\u3002<span translate=no>_^_4_^_</span>\u6709\u5f62\u72b6<span translate=no>_^_5_^_</span>\u548c<span translate=no>_^_6_^_</span>\u5f62\u72b6<span translate=no>_^_7_^_</span>\u3002</li></ul>\n",
 "A PyTorch implementation/tutorial of HyperLSTM introduced in paper HyperNetworks.": "\u8bba\u6587 HyperNetworks \u4e2d\u4ecb\u7ecd\u4e86 HyperLSTM \u7684 PyTorch \u5b9e\u73b0/\u6559\u7a0b\u3002",
 "HyperNetworks - HyperLSTM": "\u8d85\u7f51\u7edc-HyperLSTM"
}