The paper introduces multiple choices for $f_c$ and we have only implemented
The paper introduces multiple choices for $f_c$ and we have only implemented
1D convolution which seems to give best results.
1D convolution which seems to give the best results.
Each layer has a separate compression operation $f_c^{(i)}$ where
Each layer has a separate compression operation $f_c^{(i)}$ where
$i$ is the layer number.
$i$ is the layer number.
## Training compression operation
## Training compression operation
Since training compression with BPTT requires maintaining
Since training compression with BPTT requires maintaining
a very large computational graph (many time steps), paper proposes
a very large computational graph (many time steps), the paper proposes
an *auto-encoding loss* and an *attention reconstruction loss*.
an *auto-encoding loss* and an *attention reconstruction loss*.
The auto-encoding loss, decodes the original memories from the compressed memories,
The auto-encoding loss decodes the original memories from the compressed memories
and calculate the loss.
and calculates the loss.
Attention reconstruction loss computes the multi-headed attention results
Attention reconstruction loss computes the multi-headed attention results
on the compressed memory and on uncompressed memory and get a mean squared error
on the compressed memory and on uncompressed memory and gets a mean squared error
between them.
between them.
We have implemented the latter here since it gives better results.
We have implemented the latter here since it gives better results.
This implementation uses pre-layer norm while the paper uses post-layer norm.
This implementation uses pre-layer normalization
while the paper uses post-layer normalization.
Pre-layer norm does the layer norm before FFN[../feedforward.html) and
Pre-layer norm does the layer norm before FFN[../feedforward.html) and
self attention, and the pass through in the residual connection is not normalized.
self-attention, and the pass-through in the residual connection is not normalized.
This is supposed to be more stable in standard transformer setups.
This is supposed to be more stable in standard transformer setups.
Here's [the training code](experiment.html) and a notebook for training a compressive transformer
Here are [the training code](experiment.html) and a notebook for training a compressive transformer
model on Tiny Shakespeare dataset.
model on the Tiny Shakespeare dataset.
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/lab-ml/nn/blob/master/labml_nn/transformers/compressive/experiment.ipynb)
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/lab-ml/nn/blob/master/labml_nn/transformers/compressive/experiment.ipynb)