{ "

HyperLSTM module

\n": "

HyperLSTM \u6a21\u5757

\n", "

HyperNetworks - HyperLSTM

\n

We have implemented HyperLSTM introduced in paper HyperNetworks, with annotations using PyTorch. This blog post by David Ha gives a good explanation of HyperNetworks.

\n

We have an experiment that trains a HyperLSTM to predict text on Shakespeare dataset. Here's the link to code: _^_0_^_

\n

_^_1_^_

\n

HyperNetworks use a smaller network to generate weights of a larger network. There are two variants: static hyper-networks and dynamic hyper-networks. Static HyperNetworks have smaller networks that generate weights (kernels) of a convolutional network. Dynamic HyperNetworks generate parameters of a recurrent neural network for each step. This is an implementation of the latter.

\n

Dynamic HyperNetworks

\n

In a RNN the parameters stay constant for each step. Dynamic HyperNetworks generate different parameters for each step. HyperLSTM has the structure of a LSTM but the parameters of each step are changed by a smaller LSTM network.

\n

In the basic form, a Dynamic HyperNetwork has a smaller recurrent network that generates a feature vector corresponding to each parameter tensor of the larger recurrent network. Let's say the larger network has some parameter _^_2_^_ the smaller network generates a feature vector _^_3_^_ and we dynamically compute _^_4_^_ as a linear transformation of _^_5_^_. For instance _^_6_^_ where _^_7_^_ is a 3-d tensor parameter and _^_8_^_ is a tensor-vector multiplication. _^_9_^_ is usually a linear transformation of the output of the smaller recurrent network.

\n

Weight scaling instead of computing

\n

Large recurrent networks have large dynamically computed parameters. These are calculated using linear transformation of feature vector _^_10_^_. And this transformation requires an even larger weight tensor. That is, when _^_11_^_ has shape _^_12_^_, _^_13_^_ will be _^_14_^_.

\n

To overcome this, we compute the weight parameters of the recurrent network by dynamically scaling each row of a matrix of same size.

\n_^_15_^_

where _^_16_^_ is a _^_17_^_ parameter matrix.

\n

We can further optimize this when we compute _^_18_^_, as _^_19_^_ where _^_20_^_ stands for element-wise multiplication.

\n": "

\u8d85\u7f51\u7edc-HyperLSTM

\n

\u6211\u4eec\u5df2\u7ecf\u5b9e\u73b0\u4e86\u8bba\u6587 Hyper Networks \u4e2d\u4ecb\u7ecd\u7684 Hyper LSTM\uff0c\u5e76\u4f7f\u7528 PyTorch \u8fdb\u884c\u4e86\u6ce8\u91ca\u3002David Ha\u7684\u8fd9\u7bc7\u535a\u5ba2\u6587\u7ae0\u5f88\u597d\u5730\u89e3\u91ca\u4e86HyperNetworks\u3002

\n

\u6211\u4eec\u6709\u4e00\u4e2a\u5b9e\u9a8c\u53ef\u4ee5\u8bad\u7ec3 HyperLSTM \u6765\u9884\u6d4b\u838e\u58eb\u6bd4\u4e9a\u6570\u636e\u96c6\u4e0a\u7684\u6587\u672c\u3002\u4ee5\u4e0b\u662f\u4ee3\u7801\u94fe\u63a5\uff1a_^_0_^_

\n

_^_1_^_

\n

HyperNetworks \u4f7f\u7528\u8f83\u5c0f\u7684\u7f51\u7edc\u6765\u751f\u6210\u8f83\u5927\u7f51\u7edc\u7684\u6743\u91cd\u3002\u6709\u4e24\u79cd\u53d8\u4f53\uff1a\u9759\u6001\u8d85\u7f51\u7edc\u548c\u52a8\u6001\u8d85\u7f51\u7edc\u3002\u9759\u6001\u8d85\u7f51\u7edc\u5177\u6709\u8f83\u5c0f\u7684\u7f51\u7edc\uff0c\u7528\u4e8e\u751f\u6210\u5377\u79ef\u7f51\u7edc\u7684\u6743\u91cd\uff08\u5185\u6838\uff09\u3002\u52a8\u6001\u8d85\u7f51\u7edc\u4e3a\u6bcf\u4e2a\u6b65\u9aa4\u751f\u6210\u5faa\u73af\u795e\u7ecf\u7f51\u7edc\u7684\u53c2\u6570\u3002\u8fd9\u662f\u540e\u8005\u7684\u5b9e\u73b0\u3002

\n

\u52a8\u6001\u8d85\u7f51\u7edc

\n

\u5728 RNN \u4e2d\uff0c\u6bcf\u4e2a\u6b65\u9aa4\u7684\u53c2\u6570\u4fdd\u6301\u4e0d\u53d8\u3002\u52a8\u6001\u8d85\u7f51\u7edc\u4e3a\u6bcf\u4e2a\u6b65\u9aa4\u751f\u6210\u4e0d\u540c\u7684\u53c2\u6570\u3002HyperLSTM \u5177\u6709 LSTM \u7684\u7ed3\u6784\uff0c\u4f46\u6bcf\u4e2a\u6b65\u9aa4\u7684\u53c2\u6570\u90fd\u7531\u8f83\u5c0f\u7684 LSTM \u7f51\u7edc\u66f4\u6539\u3002

\n

\u5728\u57fa\u672c\u5f62\u5f0f\u4e2d\uff0cDynamic HyperNetwork \u5177\u6709\u8f83\u5c0f\u7684\u5faa\u73af\u7f51\u7edc\uff0c\u8be5\u7f51\u7edc\u751f\u6210\u4e0e\u8f83\u5927\u5faa\u73af\u7f51\u7edc\u7684\u6bcf\u4e2a\u53c2\u6570\u5f20\u91cf\u5bf9\u5e94\u7684\u7279\u5f81\u5411\u91cf\u3002\u5047\u8bbe\u8f83\u5927\u7684\u7f51\u7edc\u6709\u4e00\u4e9b\u53c2\u6570_^_2_^_\uff0c\u8f83\u5c0f\u7684\u7f51\u7edc\u751f\u6210\u4e00\u4e2a\u7279\u5f81\u5411\u91cf_^_3_^_\uff0c\u6211\u4eec\u52a8\u6001\u8ba1\u7b97_^_4_^_\u4e3a\u7684\u7ebf\u6027\u53d8\u6362_^_5_^_\u3002\u4f8b\u5982\uff0c_^_6_^_\u5176\u4e2d_^_7_^_\u662f\u4e09\u7ef4\u5f20\u91cf\u53c2\u6570\uff0c_^_8_^_\u662f\u5f20\u91cf\u5411\u91cf\u4e58\u6cd5\u3002_^_9_^_\u901a\u5e38\u662f\u8f83\u5c0f\u7684\u5faa\u73af\u7f51\u7edc\u8f93\u51fa\u7684\u7ebf\u6027\u53d8\u6362\u3002

\n

\u6309\u91cd\u91cf\u7f29\u653e\u800c\u4e0d\u662f\u8ba1\u7b97

\n

\u5927\u578b\u5faa\u73af\u7f51\u7edc\u5177\u6709\u5927\u91cf\u7684\u52a8\u6001\u8ba1\u7b97\u53c2\u6570\u3002\u8fd9\u4e9b\u662f\u4f7f\u7528\u7279\u5f81\u5411\u91cf\u7684\u7ebf\u6027\u53d8\u6362\u8ba1\u7b97_^_10_^_\u7684\u3002\u800c\u4e14\u8fd9\u79cd\u53d8\u6362\u9700\u8981\u66f4\u5927\u7684\u6743\u91cd\u5f20\u91cf\u3002\u4e5f\u5c31\u662f\u8bf4\uff0c\u5f53_^_11_^_\u6709\u5f62\u72b6\u65f6_^_12_^_\uff0c_^_13_^_\u5c06\u662f_^_14_^_\u3002

\n

\u4e3a\u4e86\u514b\u670d\u8fd9\u4e2a\u95ee\u9898\uff0c\u6211\u4eec\u901a\u8fc7\u52a8\u6001\u7f29\u653e\u76f8\u540c\u5927\u5c0f\u7684\u77e9\u9635\u7684\u6bcf\u4e00\u884c\u6765\u8ba1\u7b97\u5faa\u73af\u7f51\u7edc\u7684\u6743\u91cd\u53c2\u6570\u3002

\n_^_15_^_

\u5176\u4e2d_^_16_^_\u662f_^_17_^_\u53c2\u6570\u77e9\u9635\u3002

\n

\u6211\u4eec\u53ef\u4ee5\u5728\u8ba1\u7b97\u65f6\u8fdb\u4e00\u6b65\u5bf9\u5176\u8fdb\u884c\u4f18\u5316_^_18_^_\uff0c\u56e0\u4e3a_^_19_^_\u5176\u4e2d_^_20_^_\u4ee3\u8868\u9010\u5143\u7d20\u4e58\u6cd5\u3002

\n", "

HyperLSTM Cell

\n

For HyperLSTM the smaller network and the larger network both have the LSTM structure. This is defined in Appendix A.2.2 in the paper.

\n": "

HyperLSTM Cell

\n

\u5bf9\u4e8e HyperLSTM\uff0c\u8f83\u5c0f\u7684\u7f51\u7edc\u548c\u8f83\u5927\u7684\u7f51\u7edc\u90fd\u5177\u6709 LSTM \u7ed3\u6784\u3002\u8fd9\u5728\u767d\u76ae\u4e66\u7684\u9644\u5f55A.2.2\u4e2d\u8fdb\u884c\u4e86\u5b9a\u4e49\u3002

\n", "

\n": "

\n", "

_^_0_^_ is the size of the input _^_1_^_, _^_2_^_ is the size of the LSTM, and _^_3_^_ is the size of the smaller LSTM that alters the weights of the larger outer LSTM. _^_4_^_ is the size of the feature vectors used to alter the LSTM weights.

\n

We use the output of the smaller LSTM to compute _^_5_^_, _^_6_^_ and _^_7_^_ using linear transformations. We calculate _^_8_^_, _^_9_^_, and _^_10_^_ from these, using linear transformations again. These are then used to scale the rows of weight and bias tensors of the main LSTM.

\n

\ud83d\udcdd Since the computation of _^_11_^_ and _^_12_^_ are two sequential linear transformations these can be combined into a single linear transformation. However we've implemented this separately so that it matches with the description in the paper.

\n": "

_^_0_^_\u662f\u8f93\u5165\u7684\u5927\u5c0f_^_1_^_\uff0c_^_2_^_\u662f LSTM \u7684\u5927\u5c0f\uff0c_^_3_^_\u662f\u8f83\u5c0f\u7684 LSTM \u7684\u5927\u5c0f\uff0c\u5b83\u4f1a\u6539\u53d8\u66f4\u5927\u7684\u5916\u90e8 LSTM\u3002_^_4_^_\u662f\u7528\u4e8e\u6539\u53d8 LSTM \u6743\u91cd\u7684\u7279\u5f81\u5411\u91cf\u7684\u5927\u5c0f\u3002

\n

\u6211\u4eec\u4f7f\u7528\u8f83\u5c0f\u7684 LSTM \u7684\u8f93\u51fa\u8fdb\u884c\u8ba1\u7b97_^_5_^_\uff0c_^_6_^_\u5e76_^_7_^_\u4f7f\u7528\u7ebf\u6027\u53d8\u6362\u3002\u6211\u4eec\u518d\u6b21\u4f7f\u7528\u7ebf\u6027\u53d8\u6362\u8fdb\u884c\u8ba1\u7b97_^_8_^__^_9_^_\u3001\u548c_^_10_^_\u8ba1\u7b97\u3002\u7136\u540e\u4f7f\u7528\u5b83\u4eec\u6765\u7f29\u653e\u4e3b LSTM \u7684\u6743\u91cd\u548c\u504f\u7f6e\u5f20\u91cf\u7684\u884c\u3002

\n

\ud83d\udcdd \u7531\u4e8e_^_11_^_\u548c\u7684\u8ba1\u7b97_^_12_^_\u662f\u4e24\u4e2a\u8fde\u7eed\u7684\u7ebf\u6027\u53d8\u6362\uff0c\u56e0\u6b64\u53ef\u4ee5\u5c06\u5b83\u4eec\u7ec4\u5408\u6210\u5355\u4e2a\u7ebf\u6027\u53d8\u6362\u3002\u4f46\u662f\uff0c\u6211\u4eec\u5df2\u7ecf\u5355\u72ec\u5b9e\u73b0\u4e86\u8fd9\u4e00\u70b9\uff0c\u4ee5\u4fbf\u5b83\u4e0e\u8bba\u6587\u4e2d\u7684\u63cf\u8ff0\u76f8\u5339\u914d\u3002

\n", "

Create a network of _^_0_^_ of HyperLSTM.

\n": "

\u521b\u5efa\u4e00\u4e2a\u7531 HyperLSTM_^_0_^_ \u7ec4\u6210\u7684\u7f51\u7edc\u3002

\n", "

_^_0_^_

\n": "

_^_0_^_

\n", "

_^_0_^_ \ud83e\udd14 In the paper it was specified as _^_1_^_ I feel that it's a typo.

\n": "

_^_0_^_\ud83e\udd14 \u5728\u62a5\u7eb8\u4e0a\u6307\u5b9a\u4e86\u5b83\uff0c\u56e0\u4e3a_^_1_^_\u6211\u89c9\u5f97\u8fd9\u662f\u4e00\u4e2a\u9519\u5b57\u3002

\n", "

Collect the output _^_0_^_ of the final layer

\n": "

\u6536\u96c6\u6700\u540e\u4e00\u5c42_^_0_^_\u7684\u8f93\u51fa

\n", "

Collect the outputs of the final layer at each step

\n": "

\u5728\u6bcf\u4e00\u6b65\u6536\u96c6\u6700\u540e\u4e00\u5c42\u7684\u8f93\u51fa

\n", "

Create cells for each layer. Note that only the first layer gets the input directly. Rest of the layers get the input from the layer below

\n": "

\u4e3a\u6bcf\u5c42\u521b\u5efa\u5355\u5143\u3002\u8bf7\u6ce8\u610f\uff0c\u53ea\u6709\u7b2c\u4e00\u5c42\u76f4\u63a5\u83b7\u5f97\u8f93\u5165\u3002\u5176\u4f59\u56fe\u5c42\u4ece\u4e0b\u9762\u7684\u56fe\u5c42\u83b7\u53d6\u8f93\u5165

\n", "

Get the state of the layer

\n": "

\u83b7\u53d6\u56fe\u5c42\u7684\u72b6\u6001

\n", "

Initialize the state with zeros if _^_0_^_

\n": "

\u4f7f\u7528\u96f6\u521d\u59cb\u5316\u72b6\u6001\u5982\u679c_^_0_^_

\n", "

Input to the first layer is the input itself

\n": "

\u7b2c\u4e00\u5c42\u7684\u8f93\u5165\u662f\u8f93\u5165\u672c\u8eab

\n", "

Input to the next layer is the state of this layer

\n": "

\u4e0b\u4e00\u5c42\u7684\u8f93\u5165\u662f\u8be5\u56fe\u5c42\u7684\u72b6\u6001

\n", "

Layer normalization

\n": "

\u5c42\u89c4\u8303\u5316

\n", "

Loop through the layers

\n": "

\u5faa\u73af\u7a7f\u8fc7\u56fe\u5c42

\n", "

Reverse stack the tensors to get the states of each layer

\n

\ud83d\udcdd You can just work with the tensor itself but this is easier to debug

\n": "

\u53cd\u5411\u5806\u53e0\u5f20\u91cf\u4ee5\u83b7\u5f97\u6bcf\u5c42\u7684\u72b6\u6001

\n

\ud83d\udcdd \u4f60\u53ef\u4ee5\u53ea\u4f7f\u7528\u5f20\u91cf\u672c\u8eab\uff0c\u4f46\u8fd9\u66f4\u5bb9\u6613\u8c03\u8bd5

\n", "

Stack the outputs and states

\n": "

\u5806\u53e0\u8f93\u51fa\u548c\u72b6\u6001

\n", "

Store sizes to initialize state

\n": "

\u5b58\u50a8\u5927\u5c0f\u4ee5\u521d\u59cb\u5316\u72b6\u6001

\n", "

The input to the hyperLSTM is _^_0_^_ where _^_1_^_ is the input and _^_2_^_ is the output of the outer LSTM at previous step. So the input size is _^_3_^_.

\n

The output of hyperLSTM is _^_4_^_ and _^_5_^_.

\n": "

HyperLSTM \u7684\u8f93\u5165\u662f_^_0_^_\u4e0a\u4e00\u6b65\u4e2d\u5916\u90e8 LSTM \u7684\u8f93\u5165\uff0c_^_2_^_\u4e5f\u662f\u5916\u90e8 LSTM \u7684\u8f93\u51fa\u3002_^_1_^_\u56e0\u6b64\uff0c\u8f93\u5165\u5927\u5c0f\u4e3a_^_3_^_\u3002

\n

HyperLSTM \u7684\u8f93\u51fa\u4e3a_^_4_^_\u548c_^_5_^_\u3002

\n", "

The weight matrices _^_0_^_

\n": "

\u6743\u91cd\u77e9\u9635_^_0_^_

\n", "

We calculate _^_0_^_, _^_1_^_, _^_2_^_ and _^_3_^_ in a loop

\n": "

\u6211\u4eec\u5faa\u73af\u8ba1\u7b97_^_0_^__^_1_^_\u3001_^_2_^_\u548c_^_3_^_

\n", "_^_0_^_

\n": "_^_0_^_

\n", "\n": "\n", "A PyTorch implementation/tutorial of HyperLSTM introduced in paper HyperNetworks.": "\u8bba\u6587 HyperNetworks \u4e2d\u4ecb\u7ecd\u4e86 HyperLSTM \u7684 PyTorch \u5b9e\u73b0/\u6559\u7a0b\u3002", "HyperNetworks - HyperLSTM": "\u8d85\u7f51\u7edc-HyperLSTM" }