@@ -90,7 +90,7 @@ In a distributed setup you would have each FFN (each very large) on a different
discusses dropping tokens when routing is not balanced.</p>
<p>Here’s <ahref="experiment.html">the training code</a> and a notebook for training a switch transformer on Tiny Shakespeare dataset.</p>
<p><ahref="https://colab.research.google.com/github/labmlai/annotated_deep_learning_paper_implementations/blob/master/labml_nn/transformers/switch/experiment.ipynb"><imgalt="Open In Colab"src="https://colab.research.google.com/assets/colab-badge.svg"/></a>