{"font_size":0.4,"font_color":"#FFFFFF","background_alpha":0.5,"background_color":"#9C27B0","Stroke":"none","body":[{"from":4.52,"to":8.67,"location":2,"content":"Okay. So I'm delighted to introduce,"},{"from":8.67,"to":11.36,"location":2,"content":"um, our first lot of invited speakers."},{"from":11.36,"to":14.46,"location":2,"content":"And so we're gonna have two invited speakers, um, today."},{"from":14.46,"to":16.26,"location":2,"content":"So starting off, um,"},{"from":16.26,"to":18.96,"location":2,"content":"we go and have Ashish Vaswani who's gonna be"},{"from":18.96,"to":23.11,"location":2,"content":"talking about self attention for generative models and in particular,"},{"from":23.11,"to":25.53,"location":2,"content":"um, we'll introduce some of the work on"},{"from":25.53,"to":29.19,"location":2,"content":"transformers that he is well-known for along with his colleagues."},{"from":29.19,"to":32.22,"location":2,"content":"Um and then as a sort of, um,"},{"from":32.22,"to":35.23,"location":2,"content":"a special edition then we're also going to have"},{"from":35.23,"to":39.44,"location":2,"content":"Anna Huang talking about some applications of this work."},{"from":39.44,"to":41.54,"location":2,"content":"There are actually at least a couple of people in the class who are"},{"from":41.54,"to":43.79,"location":2,"content":"actually interested in music applications."},{"from":43.79,"to":48.74,"location":2,"content":"So this will be your one chance in the course to see music applications of deep learning."},{"from":48.74,"to":51.89,"location":2,"content":"Okay, um, so I'll hand it over to Ashish."},{"from":51.89,"to":53.36,"location":2,"content":"Thanks, Chris and, uh, thanks, Evie."},{"from":53.36,"to":55.94,"location":2,"content":"Uh, Anna is actually here to make the class less dull."},{"from":55.94,"to":57.83,"location":2,"content":"So [LAUGHTER] she's the highlight on this one."},{"from":57.83,"to":60.84,"location":2,"content":"So uh, so, uh, hi everyone."},{"from":60.84,"to":63.4,"location":2,"content":"Um, um, uh excited to be here."},{"from":63.4,"to":66.2,"location":2,"content":"This is a very large class."},{"from":66.2,"to":67.92,"location":2,"content":"Uh, first invited speaker,"},{"from":67.92,"to":69.92,"location":2,"content":"no pressure, so hopefully this will all go well."},{"from":69.92,"to":74.97,"location":2,"content":"Uh, so yes, so the talk is going to be about, uh, self attention."},{"from":74.97,"to":78.34,"location":2,"content":"Um, and so the purpose is,"},{"from":78.34,"to":82.69,"location":2,"content":"is not going to be just to talk about a particular model, but, as,"},{"from":82.69,"to":85.85,"location":2,"content":"as, as, as empiricists and, and,"},{"from":85.85,"to":87.83,"location":2,"content":"like, well, I'm an empiricist and I"},{"from":87.83,"to":90.38,"location":2,"content":"consume machine learning to apply it to various tasks."},{"from":90.38,"to":95.21,"location":2,"content":"And, and, and, well, starting point always is to ask this question, you know,"},{"from":95.21,"to":96.8,"location":2,"content":"what are the- what's the structure in"},{"from":96.8,"to":98.72,"location":2,"content":"my dataset or what are the symmetries in my dataset,"},{"from":98.72,"to":101.88,"location":2,"content":"and is there a model that exists that that's a very good- that,"},{"from":101.88,"to":106.01,"location":2,"content":"that has the inductive biases to model these properties that exist in my dataset."},{"from":106.01,"to":108.45,"location":2,"content":"So hopefully, over the course of this, uh,"},{"from":108.45,"to":111.68,"location":2,"content":"this, this lecture Anna and I will convince you that, uh,"},{"from":111.68,"to":114.41,"location":2,"content":"self attention indeed does have some- has"},{"from":114.41,"to":116.17,"location":2,"content":"the ability models and inductive biases that"},{"from":116.17,"to":118.69,"location":2,"content":"potentially could be useful for the problems that you care about."},{"from":118.69,"to":124.79,"location":2,"content":"Um, so, um, this talk is going to be our learning representations primarily of,"},{"from":124.79,"to":127.44,"location":2,"content":"uh, variable length data where we have images but,"},{"from":127.44,"to":130.19,"location":2,"content":"uh, most of it is going to be variable length data."},{"from":130.19,"to":131.96,"location":2,"content":"And, uh, and, and,"},{"from":131.96,"to":134.97,"location":2,"content":"and all of us care about this problem because we- in"},{"from":134.97,"to":137.99,"location":2,"content":"deep learning, and deep learning is all about representation learning."},{"from":137.99,"to":142.66,"location":2,"content":"And if- and building the right tools for learning representations as,"},{"from":142.66,"to":144.83,"location":2,"content":"as, as, as sort of- is an important factor in,"},{"from":144.83,"to":146.6,"location":2,"content":"in achieving empirical success."},{"from":146.6,"to":149.81,"location":2,"content":"Um, now, uh, the models of choice,"},{"from":149.81,"to":152.16,"location":2,"content":"the primary workhorse for"},{"from":152.16,"to":155.6,"location":2,"content":"perhaps even now and or up to this point had been recurrent neural networks."},{"from":155.6,"to":160.73,"location":2,"content":"Um, um, how, how many people here are familiar with RNNs?"},{"from":160.73,"to":163.06,"location":2,"content":"[LAUGHTER] Okay."},{"from":163.06,"to":165.26,"location":2,"content":"So definitely up to this point,"},{"from":165.26,"to":167.69,"location":2,"content":"the primary workhorse have been recurrent neural networks,"},{"from":167.69,"to":170.47,"location":2,"content":"and some of the more, uh, some, uh,"},{"from":170.47,"to":174.54,"location":2,"content":"some gated variants that explicitly add multiplicative interactions like LSTMs,"},{"from":174.54,"to":178.18,"location":2,"content":"they also, they also have mechanisms that allow for better gradient transfer."},{"from":178.18,"to":180.62,"location":2,"content":"And some recent variants like gated, uh,"},{"from":180.62,"to":182.3,"location":2,"content":"recurrent units that are simplification,"},{"from":182.3,"to":186.86,"location":2,"content":"they're kind of the- they're- they dominate this, this recurrent landscape."},{"from":186.86,"to":189.6,"location":2,"content":"Um, and typically how did recurrent neural networks, uh,"},{"from":189.6,"to":192.68,"location":2,"content":"learn or, um, produce representations?"},{"from":192.68,"to":195.76,"location":2,"content":"They consume a string or a sentence, um,"},{"from":195.76,"to":197.63,"location":2,"content":"even an image, imagine, you know,"},{"from":197.63,"to":201.13,"location":2,"content":"in a particular- in sequentially and, uh, at each,"},{"from":201.13,"to":202.53,"location":2,"content":"at each, uh, position,"},{"from":202.53,"to":204.9,"location":2,"content":"at each timestep they produce, they produce a,"},{"from":204.9,"to":206.9,"location":2,"content":"a continuous representation that's"},{"from":206.9,"to":210.84,"location":2,"content":"summarization of, of everything that they've actually crunched through."},{"from":210.84,"to":216.96,"location":2,"content":"Um, now, so in, in, in the,"},{"from":216.96,"to":219.48,"location":2,"content":"in the realm of large data, uh,"},{"from":219.48,"to":221.31,"location":2,"content":"par- having parallel models is,"},{"from":221.31,"to":222.89,"location":2,"content":"is quite, is quite beneficial."},{"from":222.89,"to":225.14,"location":2,"content":"In fact, I was actually reading Oliver Selfridge."},{"from":225.14,"to":226.37,"location":2,"content":"Uh, he was a,"},{"from":226.37,"to":228.77,"location":2,"content":"he was a professor at MIT and, uh, he had this,"},{"from":228.77,"to":233.36,"location":2,"content":"uh, sorry, he wrote the precursor to deep nets its it's called Pandemoniums."},{"from":233.36,"to":234.59,"location":2,"content":"I would recommend everybody to read it."},{"from":234.59,"to":236.45,"location":2,"content":"And he has this fascinating note that, you know,"},{"from":236.45,"to":237.96,"location":2,"content":"if you give me more parallel computation,"},{"from":237.96,"to":239.9,"location":2,"content":"I'll just add more data and make it slower."},{"from":239.9,"to":242.18,"location":2,"content":"So you can consume more data."},{"from":242.18,"to":247.26,"location":2,"content":"Um, and, and recurrence, uh, recurrence sort of just by construction, um,"},{"from":247.26,"to":248.91,"location":2,"content":"limits parallelization because you have to,"},{"from":248.91,"to":251.1,"location":2,"content":"you have to wait until- your wait un-"},{"from":251.1,"to":254.03,"location":2,"content":"for a particular time point to produce a representation."},{"from":254.03,"to":256.27,"location":2,"content":"Um, but if there's any questions,"},{"from":256.27,"to":257.32,"location":2,"content":"please raise your hands, I'll"},{"from":257.32,"to":258.9,"location":2,"content":"hopefully look around and, and,"},{"from":258.9,"to":261.23,"location":2,"content":"uh, be able to attend to your question."},{"from":261.23,"to":264.92,"location":2,"content":"Um, and again, and, and now because we're actually producing these representations,"},{"from":264.92,"to":266.15,"location":2,"content":"we're sort of summarizing,"},{"from":266.15,"to":267.74,"location":2,"content":"you know, if you want to pass information,"},{"from":267.74,"to":269.76,"location":2,"content":"if you want to pass co-reference information,"},{"from":269.76,"to":272.35,"location":2,"content":"then we kind of have to shove all of this inside"},{"from":272.35,"to":276.11,"location":2,"content":"this fixed size vector, so it could potentially be difficult to model."},{"from":276.11,"to":279.63,"location":2,"content":"And, uh, while they have been successful in language, uh,"},{"from":279.63,"to":282.53,"location":2,"content":"explicit they don't have- the architecture"},{"from":282.53,"to":285.89,"location":2,"content":"doesn't have a very clear explicit way to model hierarchy which is,"},{"from":285.89,"to":288.15,"location":2,"content":"which is something that's very important in language."},{"from":288.15,"to":294.39,"location":2,"content":"Um, now, um, so they have been devin- it has been excellent work of,"},{"from":294.39,"to":298.64,"location":2,"content":"a precursor to self attention that actually surmounted some of these difficulties."},{"from":298.64,"to":301.55,"location":2,"content":"And what were these difficulties basically is a convolutional sequence models"},{"from":301.55,"to":305.18,"location":2,"content":"where you have these limited receptive field convolutions that,"},{"from":305.18,"to":307.22,"location":2,"content":"again, consumed the sentence now not,"},{"from":307.22,"to":309.59,"location":2,"content":"not sequentially but in depth."},{"from":309.59,"to":312.13,"location":2,"content":"And they produce representations for every-"},{"from":312.13,"to":314.72,"location":2,"content":"they produce representations of your variable length sequences."},{"from":314.72,"to":317.72,"location":2,"content":"Um, and, uh, they're trivial to"},{"from":317.72,"to":321.11,"location":2,"content":"parallelize because you can apply these convolutions simultaneously at every position."},{"from":321.11,"to":323.03,"location":2,"content":"Each layer is trivial to parallelize."},{"from":323.03,"to":326.17,"location":2,"content":"Uh, the, the, the serial dependencies are only in the number of layers."},{"from":326.17,"to":328.24,"location":2,"content":"Um, you can get, uh,"},{"from":328.24,"to":329.96,"location":2,"content":"you can- you can get"},{"from":329.96,"to":332.75,"location":2,"content":"these local dependencies efficiently because that a single application of"},{"from":332.75,"to":337.48,"location":2,"content":"a convolution can consume all the information inside its local receptive field."},{"from":337.48,"to":339.32,"location":2,"content":"Um, now if you want to have"},{"from":339.32,"to":342.17,"location":2,"content":"these really long distance interactions while you"},{"from":342.17,"to":345.02,"location":2,"content":"don't have to pass through a linear number of steps,"},{"from":345.02,"to":346.06,"location":2,"content":"you still because these,"},{"from":346.06,"to":349.67,"location":2,"content":"because these receptive fields are local you might need something like linear"},{"from":349.67,"to":353.52,"location":2,"content":"and depth or logarithmic if you're doing something like dilated convolutions."},{"from":353.52,"to":356.03,"location":2,"content":"So there's still need- the number of layers that are needed are"},{"from":356.03,"to":359.21,"location":2,"content":"still a function of the length of the of, of your string."},{"from":359.21,"to":361.07,"location":2,"content":"Uh, but they're a great development and they"},{"from":361.07,"to":363.32,"location":2,"content":"actually pushed a lot of research like WaveRNN, for example,"},{"from":363.32,"to":365.23,"location":2,"content":"is a classic sort of success story of"},{"from":365.23,"to":368.82,"location":2,"content":"convolutio- convolutional sequence models even by net."},{"from":368.82,"to":375.08,"location":2,"content":"Um, now, so far attention has been like one of the most important components,"},{"from":375.08,"to":376.81,"location":2,"content":"the sort of content-based,"},{"from":376.81,"to":379.06,"location":2,"content":"you know, memory retrieval mechanism."},{"from":379.06,"to":383.56,"location":2,"content":"And it's content-based because you have your decoder that attends to all this content,"},{"from":383.56,"to":386.63,"location":2,"content":"that's your encoder and then just sort of decides what to wha- what,"},{"from":386.63,"to":388.58,"location":2,"content":"what information to absorb based on how similar"},{"from":388.58,"to":390.98,"location":2,"content":"this content is to every position in the memory."},{"from":390.98,"to":393.44,"location":2,"content":"So this has been a very critical mechanism in,"},{"from":393.44,"to":394.95,"location":2,"content":"uh, in neural machine translation."},{"from":394.95,"to":396.95,"location":2,"content":"So now the question that we asked was, like, why,"},{"from":396.95,"to":400.46,"location":2,"content":"why not just use attention for representations and, uh,"},{"from":400.46,"to":403.79,"location":2,"content":"now here's what sort of a rough framework of this,"},{"from":403.79,"to":406.44,"location":2,"content":"this representation mechanism would look like, uh,"},{"from":406.44,"to":409.63,"location":2,"content":"the way- just sort of repeating what attention is essentially."},{"from":409.63,"to":412.36,"location":2,"content":"Now imagine you have- you want to represent the word,"},{"from":412.36,"to":415.73,"location":2,"content":"re-represent the word representing, you want to construct its new representation."},{"from":415.73,"to":418.61,"location":2,"content":"And then first, uh, you, you attend or you,"},{"from":418.61,"to":420.71,"location":2,"content":"you compare yourself, you compare your content,"},{"from":420.71,"to":422.76,"location":2,"content":"and in the beginning it could just be a word embedding."},{"from":422.76,"to":425.54,"location":2,"content":"Your compare content with all your words, and with all,"},{"from":425.54,"to":427.34,"location":2,"content":"with all the embeddings and based on these,"},{"from":427.34,"to":429.9,"location":2,"content":"based on these compatibilities or these comparisons,"},{"from":429.9,"to":434.18,"location":2,"content":"you produce, uh, you produce a weighted combination of your entire neighborhood,"},{"from":434.18,"to":436.18,"location":2,"content":"and based on that weighted combination you,"},{"from":436.18,"to":437.87,"location":2,"content":"you summarize all that information."},{"from":437.87,"to":440.15,"location":2,"content":"So it's, like, you're re-expressing yourself in certain terms"},{"from":440.15,"to":442.73,"location":2,"content":"of a weighted combination of your entire neighborhood."},{"from":442.73,"to":443.93,"location":2,"content":"That's what attention does,"},{"from":443.93,"to":448.95,"location":2,"content":"and you can add feed-forward layers to basically sort of compute new features for you."},{"from":448.95,"to":454.7,"location":2,"content":"Um, now, um so the first part is going to be about how, like,"},{"from":454.7,"to":457.76,"location":2,"content":"some of the properties of self attention actually help us in text generation, like,"},{"from":457.76,"to":459.32,"location":2,"content":"what inductive biases are actually useful,"},{"from":459.32,"to":460.95,"location":2,"content":"and we empirically showed that indeed they,"},{"from":460.95,"to":463.06,"location":2,"content":"they move the needle in text generation."},{"from":463.06,"to":464.99,"location":2,"content":"And this is going to be about machine translation,"},{"from":464.99,"to":467.42,"location":2,"content":"but there were other work also that we'll talk about later."},{"from":467.42,"to":469.88,"location":2,"content":"So [NOISE] now with this, uh,"},{"from":469.88,"to":471.88,"location":2,"content":"with this sort of, uh,"},{"from":471.88,"to":475.47,"location":2,"content":"with this attention mechanism you get this- we get a constant path length."},{"from":475.47,"to":478,"location":2,"content":"So all pairs or a word can in-"},{"from":478,"to":481.1,"location":2,"content":"position can interact with any position, every position simultaneously."},{"from":481.1,"to":484.25,"location":2,"content":"Um, hopefully if the number of positions is not too many."},{"from":484.25,"to":486.41,"location":2,"content":"Uh, attention just by virtue of, like,"},{"from":486.41,"to":488.06,"location":2,"content":"it's a construction, you have a softmax,"},{"from":488.06,"to":490.2,"location":2,"content":"you have these gating and multiplicative interactions."},{"from":490.2,"to":492.68,"location":2,"content":"And again, I'm not gonna be able to explain why,"},{"from":492.68,"to":494.19,"location":2,"content":"but it's, it's interesting, like,"},{"from":494.19,"to":495.29,"location":2,"content":"you've seen these models, like,"},{"from":495.29,"to":496.4,"location":2,"content":"even, even the, uh,"},{"from":496.4,"to":499.98,"location":2,"content":"even Pixel, PixelCNN, uh, or, um,"},{"from":499.98,"to":501.66,"location":2,"content":"when it was actually modeling images,"},{"from":501.66,"to":504.96,"location":2,"content":"they explicitly had to add these multiplicative interactions inside the model to,"},{"from":504.96,"to":506.88,"location":2,"content":"to basically beat RNNs,"},{"from":506.88,"to":509.39,"location":2,"content":"and attention just by construction gets this because you're,"},{"from":509.39,"to":513.03,"location":2,"content":"you're multiplying the attention probabilities with your, with your activations."},{"from":513.03,"to":514.58,"location":2,"content":"It's trivial to parallelize, why?"},{"from":514.58,"to":519.44,"location":2,"content":"Because you can just do attention with matmuls, especially the variant that we use in our paper,"},{"from":519.44,"to":520.87,"location":2,"content":"uh, in our work."},{"from":520.87,"to":523.89,"location":2,"content":"And, uh, so now the question is"},{"from":523.89,"to":529.16,"location":2,"content":"convolutional sequence to- convolutional sequence models have been very successful in,"},{"from":529.16,"to":532.33,"location":2,"content":"in, in, in ge- generative tasks for text."},{"from":532.33,"to":534.83,"location":2,"content":"Can we actually do the same or achieved the same with, uh,"},{"from":534.83,"to":538.58,"location":2,"content":"with, uh, attention as our primary workhorse for representation learning."},{"from":538.58,"to":543.49,"location":2,"content":"Um, so just to sort of add some context and there's been some,"},{"from":543.49,"to":547.43,"location":2,"content":"there's been some- up to- up to the transformer there have been a lot of"},{"from":547.43,"to":552.02,"location":2,"content":"great work on using self attention primarily for classification within."},{"from":552.02,"to":555.29,"location":2,"content":"There was, there was work on self attention within the confines of,"},{"from":555.29,"to":556.61,"location":2,"content":"like, recurrent neural networks."},{"from":556.61,"to":559.37,"location":2,"content":"Um, perhaps the closest to us is the,"},{"from":559.37,"to":560.91,"location":2,"content":"is the memory networks,"},{"from":560.91,"to":562.82,"location":2,"content":"uh, by Weston, Sukhbaatar,"},{"from":562.82,"to":565.72,"location":2,"content":"where they actually had a version of recurrent attention,"},{"from":565.72,"to":567.29,"location":2,"content":"but they didn't have, uh,"},{"from":567.29,"to":570.71,"location":2,"content":"but they didn't actually- empirically,"},{"from":570.71,"to":573.5,"location":2,"content":"they didn't show it to work on sort of conditional modeling, like,"},{"from":573.5,"to":577.37,"location":2,"content":"uh, translation and their mechanism was, uh, like,"},{"from":577.37,"to":581.55,"location":2,"content":"they were using sort of a fixed- they were using a fixed query at every step."},{"from":581.55,"to":583.5,"location":2,"content":"So there's- it, it leaves something to be desired."},{"from":583.5,"to":587.06,"location":2,"content":"They still had this question, is it actually going to work, um, on,"},{"from":587.06,"to":590.87,"location":2,"content":"on, on large scale machine translation systems or large-scale text generation systems."},{"from":590.87,"to":594.1,"location":2,"content":"So this is sort of the, the culmination of, um,"},{"from":594.1,"to":597.43,"location":2,"content":"of the, the self attention, our self attention work."},{"from":597.43,"to":600.5,"location":2,"content":"This is the tran- the- and we put it together in the transformer model."},{"from":600.5,"to":603.2,"location":2,"content":"And, uh, so how does this look like?"},{"from":603.2,"to":605.98,"location":2,"content":"So we're going to use attention pri- we're going to use"},{"from":605.98,"to":609.39,"location":2,"content":"attention primarily for computing representations so- of your input."},{"from":609.39,"to":611.48,"location":2,"content":"Imagine you're doing English to German translation."},{"from":611.48,"to":614.03,"location":2,"content":"So you have your words, and notice that,"},{"from":614.03,"to":616.61,"location":2,"content":"uh, attention is, uh, permutation invariant."},{"from":616.61,"to":619.22,"location":2,"content":"So you just change the order of your positions."},{"from":619.22,"to":620.91,"location":2,"content":"You change the order of your words and, and,"},{"from":620.91,"to":623.32,"location":2,"content":"uh, it's not going to affect the actual output."},{"from":623.32,"to":625.34,"location":2,"content":"So in ord- in order to maintain order we add,"},{"from":625.34,"to":626.99,"location":2,"content":"we add position representations."},{"from":626.99,"to":629.71,"location":2,"content":"And, uh, there's two kinds that we tried in the paper,"},{"from":629.71,"to":633.18,"location":2,"content":"these, these fantastic sinusoids with no entropy invented."},{"from":633.18,"to":635.63,"location":2,"content":"And we also use learned representations which are"},{"from":635.63,"to":638.09,"location":2,"content":"very plain vanilla both of them work equally well."},{"from":638.09,"to":640.42,"location":2,"content":"Um, and, uh, so,"},{"from":640.42,"to":642.89,"location":2,"content":"so first we have- so the encoder looks as follows, right?"},{"from":642.89,"to":646.97,"location":2,"content":"So we have a self attention layer that just recomputes the representation, uh,"},{"from":646.97,"to":650.09,"location":2,"content":"for every position simultaneously using attention,"},{"from":650.09,"to":651.54,"location":2,"content":"then we have a feed-forward layer."},{"from":651.54,"to":652.82,"location":2,"content":"And we also have residual,"},{"from":652.82,"to":654.38,"location":2,"content":"residual connections and I'll,"},{"from":654.38,"to":656.6,"location":2,"content":"I'll sort of give you a glimpse of what these residual connections"},{"from":656.6,"to":659.09,"location":2,"content":"might be bringing that is between every,"},{"from":659.09,"to":662.99,"location":2,"content":"every layer, and the input we have a skip connection that just adds the activations."},{"from":662.99,"to":665.33,"location":2,"content":"Uh, and then this tuple of, uh,"},{"from":665.33,"to":668.13,"location":2,"content":"self attention and feed-forward layer just essentially repeats."},{"from":668.13,"to":670.22,"location":2,"content":"Now, on the decoder side, uh,"},{"from":670.22,"to":673.92,"location":2,"content":"we've- we, we have a sort of standard encoder decoder architecture."},{"from":673.92,"to":677.5,"location":2,"content":"On the decoder side, we mimic a language model using self attention,"},{"from":677.5,"to":680.3,"location":2,"content":"and the way to mimic a language model using self attention is to impose"},{"from":680.3,"to":683.54,"location":2,"content":"causality by just masking out the positions that you can look at."},{"from":683.54,"to":685.66,"location":2,"content":"So basically, uh,"},{"from":685.66,"to":689.28,"location":2,"content":"the first position it's- it can't look forward, it's illegal to look forward."},{"from":689.28,"to":692.08,"location":2,"content":"It can look at itself because we actually shift the input."},{"from":692.08,"to":694.75,"location":2,"content":"Um, so it's not copying, uh."},{"from":694.75,"to":697.24,"location":2,"content":"It's kind of surprising that parti- with these models,"},{"from":697.24,"to":698.79,"location":2,"content":"it's very easy to copy at one point,"},{"from":698.79,"to":701.59,"location":2,"content":"when early on it was even harder to ge- you know,"},{"from":701.59,"to":703.36,"location":2,"content":"do copying with recurrent models."},{"from":703.36,"to":704.86,"location":2,"content":"But now, at least, you can copy really well,"},{"from":704.86,"to":706.85,"location":2,"content":"which is a positive sign, I think overall."},{"from":706.85,"to":709.83,"location":2,"content":"Um, but, uh, so now on the decoder side, uh,"},{"from":709.83,"to":711.15,"location":2,"content":"we have, uh, we have"},{"from":711.15,"to":714.38,"location":2,"content":"this causal self attention layer followed by encoder-decoder attention,"},{"from":714.38,"to":716.18,"location":2,"content":"where we actually attend to the, uh,"},{"from":716.18,"to":719.45,"location":2,"content":"last layer of the encoder and a feed-forward layer, and this tripled,"},{"from":719.45,"to":720.67,"location":2,"content":"repeats a mul- a few times,"},{"from":720.67,"to":722.95,"location":2,"content":"and at the end we have the standard cross-entropy loss."},{"from":722.95,"to":728.47,"location":2,"content":"Um, and, um, so, um, sort of,"},{"from":728.47,"to":730.65,"location":2,"content":"staring at the- at,"},{"from":730.65,"to":732.74,"location":2,"content":"at our parti- at the particular variant of the self-"},{"from":732.74,"to":735.21,"location":2,"content":"of the attention mechanis- mechanism that we use,"},{"from":735.21,"to":737.97,"location":2,"content":"we went for both- we went for simplicity and speed."},{"from":737.97,"to":741.63,"location":2,"content":"So, um, so how do you actually compute attention?"},{"from":741.63,"to":744.47,"location":2,"content":"So imagine you want to re-represent the position e2."},{"from":744.47,"to":746.93,"location":2,"content":"And, uh, we're going to first linearly,"},{"from":746.93,"to":750.22,"location":2,"content":"linearly transform it into, uh, a query,"},{"from":750.22,"to":752.15,"location":2,"content":"and then we're gonna linearly transform"},{"from":752.15,"to":754.52,"location":2,"content":"every position in your neighborhood"},{"from":754.52,"to":756.51,"location":2,"content":"or let's say every position at the input because this is the,"},{"from":756.51,"to":757.8,"location":2,"content":"uh, uh, the encoder side,"},{"from":757.8,"to":759.17,"location":2,"content":"to, uh, a key."},{"from":759.17,"to":761.68,"location":2,"content":"And these linear transformations can actually be thought as features,"},{"from":761.68,"to":763.1,"location":2,"content":"and I'll talk more about it later on."},{"from":763.1,"to":765.5,"location":2,"content":"So it's like- it's, it's basically a bilinear form."},{"from":765.5,"to":768.36,"location":2,"content":"You're projecting these vectors into a space where dot product is"},{"from":768.36,"to":771.65,"location":2,"content":"a good- where just a dot product is a good proxy for similarity."},{"from":771.65,"to":773.2,"location":2,"content":"Okay? So now, you have your logit,"},{"from":773.2,"to":775.81,"location":2,"content":"so you just do a so- softmax computer convex combination."},{"from":775.81,"to":777.93,"location":2,"content":"And now based on this convex combination,"},{"from":777.93,"to":781.48,"location":2,"content":"you're going to then re-express e2 or in"},{"from":781.48,"to":785.38,"location":2,"content":"terms of this convex combination of all the vectors of all these positions."},{"from":785.38,"to":788.11,"location":2,"content":"And before doing- before doing the convex combination,"},{"from":788.11,"to":790.5,"location":2,"content":"we again do a linear transformation to produce values."},{"from":790.5,"to":793.94,"location":2,"content":"And then we do a second linear transformation just to"},{"from":793.94,"to":797.62,"location":2,"content":"mix this information and pass it through a- pass it through a feedforward layer."},{"from":797.62,"to":799.08,"location":2,"content":"And this is- um,"},{"from":799.08,"to":801.91,"location":2,"content":"and all of this can be expressed basically"},{"from":801.91,"to":804.9,"location":2,"content":"in two- in two- in two-matrix multiplications,"},{"from":804.9,"to":807.62,"location":2,"content":"and the square root factor is just to make sure that these,"},{"from":807.62,"to":809.08,"location":2,"content":"these dot products don't blow up."},{"from":809.08,"to":810.42,"location":2,"content":"It's just a scaling factor."},{"from":810.42,"to":812.14,"location":2,"content":"And, uh, and, and,"},{"from":812.14,"to":813.61,"location":2,"content":"wha- why is this particular- why is"},{"from":813.61,"to":815.74,"location":2,"content":"this mechanism attractive? Well, it's just really fast."},{"from":815.74,"to":817.34,"location":2,"content":"You can do this very quickly on a GPU,"},{"from":817.34,"to":819.01,"location":2,"content":"and simul- you can do it simultaneously for"},{"from":819.01,"to":823.04,"location":2,"content":"all positions with just two matmuls and a softmax."},{"from":823.04,"to":825.32,"location":2,"content":"Um, on the decoder side it's,"},{"from":825.32,"to":826.64,"location":2,"content":"it's exactly the same,"},{"from":826.64,"to":834.59,"location":2,"content":"except we impose causality by just adding 10 e- minus 10 e9 to the logits."},{"from":834.59,"to":838.13,"location":2,"content":"So it basi- it's just- you just get zero probabilities on those positions."},{"from":838.13,"to":840.82,"location":2,"content":"So we just impose causality by, by adding these,"},{"from":840.82,"to":844.41,"location":2,"content":"uh, highly negative values on the attention- on the attention logits."},{"from":844.41,"to":846.75,"location":2,"content":"Um, is, is everything-"},{"from":846.75,"to":847.41,"location":2,"content":"[LAUGHTER]"},{"from":847.41,"to":853.6,"location":2,"content":"I thought that was a question."},{"from":853.6,"to":858.46,"location":2,"content":"So, um, [LAUGHTER] okay so attention is really, uh, attention is cheap."},{"from":858.46,"to":860.89,"location":2,"content":"So because it's- because this variant of"},{"from":860.89,"to":863.9,"location":2,"content":"attention just involve two- involves two matrix multiplications,"},{"from":863.9,"to":866.51,"location":2,"content":"it's quadratic in the length of your sequence."},{"from":866.51,"to":870.82,"location":2,"content":"And now what's the computational profile of RNNs or convolutions?"},{"from":870.82,"to":872.37,"location":2,"content":"They're quadratic in the dimension."},{"from":872.37,"to":875.04,"location":2,"content":"Because, basically, you can just think of a convolution just flattening"},{"from":875.04,"to":878.17,"location":2,"content":"your input or just applying a linear transformation on top of it, right?"},{"from":878.17,"to":880.84,"location":2,"content":"So- and when does this actually become very attractive?"},{"from":880.84,"to":884.14,"location":2,"content":"This becomes very, very attractive when your dimension is,"},{"from":884.14,"to":886.63,"location":2,"content":"uh, much larger than your length."},{"from":886.63,"to":888.46,"location":2,"content":"Which is the case for machine translation."},{"from":888.46,"to":891.34,"location":2,"content":"Now, we will talk about cases when there's- when the- when this is not true,"},{"from":891.34,"to":894.28,"location":2,"content":"and we have to- we have to do a- we have to make other model developments."},{"from":894.28,"to":896.44,"location":2,"content":"Um, but, uh, but for"},{"from":896.44,"to":898.02,"location":2,"content":"short sequences or sequences where"},{"from":898.02,"to":900.02,"location":2,"content":"your length does- where your dimension dominates length,"},{"from":900.02,"to":902.89,"location":2,"content":"attention is a very- has a very favorable computation profile."},{"from":902.89,"to":906.36,"location":2,"content":"And as you can see, it's about four times faster than an RNN."},{"from":906.36,"to":908.92,"location":2,"content":"Um, um, and, and faster than"},{"from":908.92,"to":913,"location":2,"content":"a convolutional model where the- you have a kernel of- like filter with, uh, three."},{"from":913,"to":919.32,"location":2,"content":"So, so there's still one problem."},{"from":919.32,"to":921.46,"location":2,"content":"Now, here's something- so in language,"},{"from":921.46,"to":922.6,"location":2,"content":"typically, we want to know, like,"},{"from":922.6,"to":923.9,"location":2,"content":"who did what to whom, right?"},{"from":923.9,"to":925.77,"location":2,"content":"So now, imagine you applied a convolutional filter."},{"from":925.77,"to":926.83,"location":2,"content":"Because you actually have"},{"from":926.83,"to":930.45,"location":2,"content":"different linear transformations based on let- relative distances,"},{"from":930.45,"to":931.77,"location":2,"content":"like this, this, this, this,"},{"from":931.77,"to":935.32,"location":2,"content":"linear transformation on the word who, uh, o- o- on the concept,"},{"from":935.32,"to":937.69,"location":2,"content":"we can have- can learn this concept of who and, and, and,"},{"from":937.69,"to":940.36,"location":2,"content":"pick out different information from this embedding of the word I."},{"from":940.36,"to":941.93,"location":2,"content":"And this linear transformation,"},{"from":941.93,"to":944.53,"location":2,"content":"the lre- the red linear transformation can pick out different information"},{"from":944.53,"to":947.76,"location":2,"content":"from kicked and the blue linear transformation can pick out different,"},{"from":947.76,"to":949.45,"location":2,"content":"different information from ball."},{"from":949.45,"to":953.23,"location":2,"content":"Now, when you have a single attention layer, this is difficult."},{"from":953.23,"to":955.33,"location":2,"content":"Because all- because they're just a convex combination"},{"from":955.33,"to":957.32,"location":2,"content":"where you have the same linear transformation everywhere."},{"from":957.32,"to":960.22,"location":2,"content":"All that's available to you is just a- is just mixing proportions."},{"from":960.22,"to":963.67,"location":2,"content":"So you can't pick out different pieces of information from different places."},{"from":963.67,"to":970.27,"location":2,"content":"Well, what if we had one attention layer for who?"},{"from":970.27,"to":973.6,"location":2,"content":"So you can think of an attention layer as something like a feature detector almost,"},{"from":973.6,"to":975.04,"location":2,"content":"like, because a particular- it,"},{"from":975.04,"to":978.34,"location":2,"content":"it might try to- it might- because it carries with it a linear transformation,"},{"from":978.34,"to":981.7,"location":2,"content":"so it's projecting them in a space that- which starts caring maybe about syntax,"},{"from":981.7,"to":984.49,"location":2,"content":"or it's projecting in this space which starts caring about who or what."},{"from":984.49,"to":988.94,"location":2,"content":"Uh, then we can have another attention layer for or attention head for what,"},{"from":988.94,"to":991.49,"location":2,"content":"did what, and other- another attention head for,"},{"from":991.49,"to":994.12,"location":2,"content":"for, for whom- to whom."},{"from":994.12,"to":997.3,"location":2,"content":"And all of this can actually be done in parallel,"},{"from":997.3,"to":999.21,"location":2,"content":"and that's actually- and that's exactly what we do."},{"from":999.21,"to":1001.25,"location":2,"content":"And for efficiency, instead of actually"},{"from":1001.25,"to":1004.23,"location":2,"content":"having these dimensions operating in a large space,"},{"from":1004.23,"to":1006.99,"location":2,"content":"we just- we just reduce the dimensionality of all these heads"},{"from":1006.99,"to":1010.23,"location":2,"content":"and we operate these attention layers in parallel, sort of bridging the gap."},{"from":1010.23,"to":1011.67,"location":2,"content":"Now, here's a, uh,"},{"from":1011.67,"to":1013.66,"location":2,"content":"perhaps, well, here's a little quiz."},{"from":1013.66,"to":1016.4,"location":2,"content":"I mean, can you actually- is there"},{"from":1016.4,"to":1021.11,"location":2,"content":"a combination of heads or is there a configuration in which you can,"},{"from":1021.11,"to":1024.26,"location":2,"content":"actually, exactly simulate a convolution probably with more parameters?"},{"from":1024.26,"to":1026.39,"location":2,"content":"I think there should be a simple way to show that if you"},{"from":1026.39,"to":1029.65,"location":2,"content":"had mo- more heads or heads are a function of positions,"},{"from":1029.65,"to":1031.88,"location":2,"content":"you could probably just simulate a convolution,"},{"from":1031.88,"to":1033.38,"location":2,"content":"but- although with a lot of parameters."},{"from":1033.38,"to":1035.15,"location":2,"content":"Uh, so it can- in, in,"},{"from":1035.15,"to":1037.14,"location":2,"content":"in the limit, it can actually simulate a convolution."},{"from":1037.14,"to":1041.28,"location":2,"content":"Uh, and it also- we can al- we can continue to enjoy the benefits of parallelism,"},{"from":1041.28,"to":1043.05,"location":2,"content":"but we did increase the number of softmaxes"},{"from":1043.05,"to":1044.82,"location":2,"content":"because each head then carries with it a softmax."},{"from":1044.82,"to":1047.19,"location":2,"content":"But the amount of FLOPS didn't change because we-"},{"from":1047.19,"to":1050.01,"location":2,"content":"instead of actually having these heads operating in very large dimensions,"},{"from":1050.01,"to":1052.22,"location":2,"content":"they're operating in very small dimensions."},{"from":1052.22,"to":1055.11,"location":2,"content":"Um, so, uh, when we applied this on, on,"},{"from":1055.11,"to":1057.54,"location":2,"content":"on machine translation, um,"},{"from":1057.54,"to":1060.3,"location":2,"content":"we were able to drama- uh, dramatically outperform,"},{"from":1060.3,"to":1063.64,"location":2,"content":"uh, previous results on English-German and English-French translation."},{"from":1063.64,"to":1067.08,"location":2,"content":"So we had a pretty standard setup: 32,000-word vocabularies,"},{"from":1067.08,"to":1070.32,"location":2,"content":"WordPiece encodings, WMT14-, uh,"},{"from":1070.32,"to":1072.54,"location":2,"content":"WMT 2014, uh, was our test set,"},{"from":1072.54,"to":1073.97,"location":2,"content":"2013 did the dev set."},{"from":1073.97,"to":1079.12,"location":2,"content":"And, uh, and some of these results were much stronger than even our previous ensemble models."},{"from":1079.12,"to":1082.69,"location":2,"content":"And, um, and on English-French also,"},{"from":1082.69,"to":1085.4,"location":2,"content":"we had some- we had some very favorabl- favorable results."},{"from":1085.4,"to":1086.63,"location":2,"content":"Uh, and we- and we are,"},{"from":1086.63,"to":1088.42,"location":2,"content":"we, we, we achieved state of the art."},{"from":1088.42,"to":1091.26,"location":2,"content":"Now, ste- stepping back a bit, uh,"},{"from":1091.26,"to":1093.9,"location":2,"content":"I- I'm not claiming that we,"},{"from":1093.9,"to":1097.11,"location":2,"content":"we arrived at an architecture that has better expressivity than an LSTM."},{"from":1097.11,"to":1098.6,"location":2,"content":"I mean, there's, there's, there's,"},{"from":1098.6,"to":1102.88,"location":2,"content":"there's theorems that are- that say that LSTMs can model any function."},{"from":1102.88,"to":1107.87,"location":2,"content":"Um, perhaps, all we did was just build an architecture that was good for SGD."},{"from":1107.87,"to":1110.89,"location":2,"content":"Because stochastic gradient descent could just train this architecture really well,"},{"from":1110.89,"to":1113.4,"location":2,"content":"because the gradient dynamics and attention are very simple attentions,"},{"from":1113.4,"to":1114.62,"location":2,"content":"just a linear combination."},{"from":1114.62,"to":1117.82,"location":2,"content":"And, uh, um, I think that's- I,"},{"from":1117.82,"to":1119.56,"location":2,"content":"I think that's actually favorable."},{"from":1119.56,"to":1122.23,"location":2,"content":"But hopefully, uh, as we- as we go on,"},{"from":1122.23,"to":1123.48,"location":2,"content":"but the- well, I'd,"},{"from":1123.48,"to":1124.86,"location":2,"content":"I'd also like to point out that, you know,"},{"from":1124.86,"to":1127.77,"location":2,"content":"we do explicit mo- we do explicitly model all,"},{"from":1127.77,"to":1129.44,"location":2,"content":"all path connection, all, all,"},{"from":1129.44,"to":1134.07,"location":2,"content":"all pairwise connections and it has its adva- advantage of a very clear modeling,"},{"from":1134.07,"to":1136.58,"location":2,"content":"very clear relationships directly between, between any two words."},{"from":1136.58,"to":1140.64,"location":2,"content":"Um, and, like, hopefully we'll be able to also"},{"from":1140.64,"to":1142.26,"location":2,"content":"show that there are other inductive biases."},{"from":1142.26,"to":1145.61,"location":2,"content":"That it's not just like building more architectures that,"},{"from":1145.61,"to":1148.72,"location":2,"content":"that are good for- that are good inductive biases for SGD."},{"from":1148.72,"to":1153,"location":2,"content":"So frameworks, a lot of our work was initially pushed out in tensor2tensor."},{"from":1153,"to":1155.98,"location":2,"content":"Maybe that might change in the future with the arrival of JAX."},{"from":1155.98,"to":1158.79,"location":2,"content":"There's ano- there's a framework also from Amazon called Sockeye."},{"from":1158.79,"to":1160.81,"location":2,"content":"There's also Fairseq, uh, the se- the"},{"from":1160.81,"to":1163.64,"location":2,"content":"convolutional sequence-to-sequence toolkit from Facebook that the,"},{"from":1163.64,"to":1166.29,"location":2,"content":"they prob- I'm actually not sure if it has a transformer implementation,"},{"from":1166.29,"to":1169.48,"location":2,"content":"but they have some really good sequence-to-sequence models as well."},{"from":1169.48,"to":1171.7,"location":2,"content":"Um, okay."},{"from":1171.7,"to":1172.85,"location":2,"content":"So the importance of residuals."},{"from":1172.85,"to":1177.88,"location":2,"content":"So, uh, we have these resil- residual connections, uh, between, um,"},{"from":1177.88,"to":1181.8,"location":2,"content":"so we have these residual connections that go from here to- here to here,"},{"from":1181.8,"to":1185.24,"location":2,"content":"here to here, like between every pair of layers, and it's interesting."},{"from":1185.24,"to":1187.89,"location":2,"content":"So we, um, we- so what we do is we just"},{"from":1187.89,"to":1191.03,"location":2,"content":"add the position informations at the input to the model."},{"from":1191.03,"to":1193.53,"location":2,"content":"And, uh, we don't infuse- we don't infuse"},{"from":1193.53,"to":1196.15,"location":2,"content":"or we don't inject position information at every layer."},{"from":1196.15,"to":1202.18,"location":2,"content":"So when, uh, we severed these residual connections and we loo- stared at these,"},{"from":1202.18,"to":1204.81,"location":2,"content":"uh, stared at these attention distributions, this is the center or,"},{"from":1204.81,"to":1207.55,"location":2,"content":"sort of, the middle map is this attention distribution."},{"from":1207.55,"to":1210.75,"location":2,"content":"You actually- basically, it- it's been unable to pick this diagonal."},{"from":1210.75,"to":1213.37,"location":2,"content":"It should have a very strong diagonal focus."},{"from":1213.37,"to":1215.33,"location":2,"content":"And so what has happened was these residuals"},{"from":1215.33,"to":1218.15,"location":2,"content":"were carrying this position information to every layer."},{"from":1218.15,"to":1220.82,"location":2,"content":"And because these subsequent layers had no notion of position,"},{"from":1220.82,"to":1222.9,"location":2,"content":"they were fi- finding it hard to actually attend."},{"from":1222.9,"to":1225.88,"location":2,"content":"This is the encoder-decoder attention which typically ends up being diagonal."},{"from":1225.88,"to":1227.38,"location":2,"content":"Now, so then we, uh, we said okay."},{"from":1227.38,"to":1230.7,"location":2,"content":"So then we actually continued with- continued to sever the residuals,"},{"from":1230.7,"to":1232.95,"location":2,"content":"but we added position information back in at every layer."},{"from":1232.95,"to":1234.84,"location":2,"content":"We injected position information back in."},{"from":1234.84,"to":1236.39,"location":2,"content":"And we didn't recover the accuracy,"},{"from":1236.39,"to":1237.79,"location":2,"content":"but we did get some of this,"},{"from":1237.79,"to":1239.3,"location":2,"content":"sort of, diagonal focus back in."},{"from":1239.3,"to":1241.4,"location":2,"content":"So the residuals are doing more, but they're certainly,"},{"from":1241.4,"to":1244.16,"location":2,"content":"definitely moving this position information to the model there."},{"from":1244.16,"to":1246.81,"location":2,"content":"They're pumping this position information through the model."},{"from":1246.81,"to":1249.07,"location":2,"content":"Um, okay."},{"from":1249.07,"to":1251.37,"location":2,"content":"So, so that was- that was- so, so now we saw that,"},{"from":1251.37,"to":1252.44,"location":2,"content":"you know, being able to, sort of,"},{"from":1252.44,"to":1253.89,"location":2,"content":"model both long- and short-,"},{"from":1253.89,"to":1256.51,"location":2,"content":"short-term relationships, uh, sh- uh, long and,"},{"from":1256.51,"to":1258.43,"location":2,"content":"long- and short-distance relationships with,"},{"from":1258.43,"to":1261.74,"location":2,"content":"with attention is beneficial for, for text generation."},{"from":1261.74,"to":1263.53,"location":2,"content":"Um, what kind of inductive,"},{"from":1263.53,"to":1266.78,"location":2,"content":"inductive biases lay- actually, uh, appear, or what,"},{"from":1266.78,"to":1270.86,"location":2,"content":"what kind of phenomena appear in images and something that we constantly see- constantly"},{"from":1270.86,"to":1272.74,"location":2,"content":"see in images and music is this notion of"},{"from":1272.74,"to":1275.18,"location":2,"content":"repeating structure that's very similar to each other?"},{"from":1275.18,"to":1278.06,"location":2,"content":"You have these motifs that repeat in, in different scales."},{"from":1278.06,"to":1281.42,"location":2,"content":"So, for example, there's a b- it's another artificial but beautiful example of"},{"from":1281.42,"to":1284.9,"location":2,"content":"self-similarity where you have this Van Gogh painting where this texture or these,"},{"from":1284.9,"to":1286.41,"location":2,"content":"these little objects just repeat."},{"from":1286.41,"to":1290.28,"location":2,"content":"These images are- these different pieces of the image are very sa- similar to each other,"},{"from":1290.28,"to":1291.6,"location":2,"content":"but they might have different scales."},{"from":1291.6,"to":1292.95,"location":2,"content":"Uh, again in music,"},{"from":1292.95,"to":1294.62,"location":2,"content":"here's a motif that repeats, uh,"},{"from":1294.62,"to":1296.61,"location":2,"content":"that could have- it could have, like,"},{"from":1296.61,"to":1300.12,"location":2,"content":"di- various, like, spans of time between in, in, between it."},{"from":1300.12,"to":1303.25,"location":2,"content":"So, um, so, so this,"},{"from":1303.25,"to":1304.44,"location":2,"content":"so we, we, we,"},{"from":1304.44,"to":1305.78,"location":2,"content":"we attempted after this to see, well,"},{"from":1305.78,"to":1309.72,"location":2,"content":"to ask this question: can self-attention help us in modeling other objects like images?"},{"from":1309.72,"to":1311.71,"location":2,"content":"So the, the path we took was, sort of,"},{"from":1311.71,"to":1317.45,"location":2,"content":"standard auto-regressive image modeling the- or probabilistic image modeling, not GANs."},{"from":1317.45,"to":1318.91,"location":2,"content":"Because it was- well, one, it was very easy."},{"from":1318.91,"to":1320.09,"location":2,"content":"We had a language model almost."},{"from":1320.09,"to":1322.24,"location":2,"content":"So this is just like language modeling on images."},{"from":1322.24,"to":1323.91,"location":2,"content":"Uh, and also training at maximum,"},{"from":1323.91,"to":1324.93,"location":2,"content":"likely, it allows you to, sort of,"},{"from":1324.93,"to":1326.72,"location":2,"content":"measure, measure how well you're doing on,"},{"from":1326.72,"to":1328.78,"location":2,"content":"uh, on, on your held-out set."},{"from":1328.78,"to":1330.84,"location":2,"content":"Uh, and it also gives you diversity,"},{"from":1330.84,"to":1332.79,"location":2,"content":"so you hopefully are covering all possible, uh,"},{"from":1332.79,"to":1335.9,"location":2,"content":"different kinds of images you- So, um,"},{"from":1335.9,"to":1337.29,"location":2,"content":"and to this point there's al- we had"},{"from":1337.29,"to":1338.88,"location":2,"content":"an advantage that's also been- there are- there've been"},{"from":1338.88,"to":1342.42,"location":2,"content":"good work on using recurrent models like PixelRNN and PixelCNN,"},{"from":1342.42,"to":1346.33,"location":2,"content":"that, that we're actually getting some very good compression rates. Um-"},{"from":1346.33,"to":1351.81,"location":2,"content":"And, um, again here,"},{"from":1351.81,"to":1355.61,"location":2,"content":"originally the argument was that, well, you know,"},{"from":1355.61,"to":1357.92,"location":2,"content":"in images because there- because you want symmetry,"},{"from":1357.92,"to":1359.3,"location":2,"content":"because you want like if you have a face,"},{"from":1359.3,"to":1361.58,"location":2,"content":"you want, you want one ear to sort of match with the other."},{"from":1361.58,"to":1363.58,"location":2,"content":"If you had a large receptive field,"},{"from":1363.58,"to":1367.13,"location":2,"content":"which you could potentially get with attention at a lower computational cost,"},{"from":1367.13,"to":1370.82,"location":2,"content":"then it should benefit- then it should be quite beneficial for, for images,"},{"from":1370.82,"to":1373.64,"location":2,"content":"for images and you wouldn't need many layers like you do in"},{"from":1373.64,"to":1377.95,"location":2,"content":"convolutions to actually get dependencies between these far away pixels."},{"from":1377.95,"to":1380.67,"location":2,"content":"So it seem like self-attention would have been a- what, what,"},{"from":1380.67,"to":1383.74,"location":2,"content":"what was already a good computational mechanism, right?"},{"from":1383.74,"to":1386.58,"location":2,"content":"But this sort of- but it was actually interesting to see"},{"from":1386.58,"to":1389.7,"location":2,"content":"how it even modeled- naturally modeled self-similarity,"},{"from":1389.7,"to":1392.46,"location":2,"content":"and people have used self-similarity in image generation like, you know, uh,"},{"from":1392.46,"to":1395.76,"location":2,"content":"there's this really cool work by Efros where they actually see, okay,"},{"from":1395.76,"to":1398.67,"location":2,"content":"in the training set, what are those patches that are really,"},{"from":1398.67,"to":1399.81,"location":2,"content":"that are really similar to me?"},{"from":1399.81,"to":1401.64,"location":2,"content":"And based on the patches that are really similar to me,"},{"from":1401.64,"to":1403.04,"location":2,"content":"I'm going to fill up the information."},{"from":1403.04,"to":1405.49,"location":2,"content":"So it's like actually doing image generation."},{"from":1405.49,"to":1407.43,"location":2,"content":"Uh, there is this really classic work called"},{"from":1407.43,"to":1409.98,"location":2,"content":"non-local means where they do image denoising,"},{"from":1409.98,"to":1411.91,"location":2,"content":"where they want to denoise this sort of,"},{"from":1411.91,"to":1414.45,"location":2,"content":"this patch P. And they say,"},{"from":1414.45,"to":1418.29,"location":2,"content":"I'm going to- based on my similarity between all other patches in my image,"},{"from":1418.29,"to":1421.13,"location":2,"content":"I'm going to compute some function of content-based similarity,"},{"from":1421.13,"to":1423.53,"location":2,"content":"and based on the similarity I'm going to pull information."},{"from":1423.53,"to":1426.64,"location":2,"content":"So as- and exploiting this fact that images are very self-similar."},{"from":1426.64,"to":1430.44,"location":2,"content":"And, uh, uh, this has also been sort of,"},{"from":1430.44,"to":1432.39,"location":2,"content":"uh, applied in some recent work."},{"from":1432.39,"to":1435.03,"location":2,"content":"Now if you just took this encoder self-attention mechanism"},{"from":1435.03,"to":1437.17,"location":2,"content":"and just replace these word embeddings with patches,"},{"from":1437.17,"to":1438.77,"location":2,"content":"and that's kind of exactly what it's doing."},{"from":1438.77,"to":1441.33,"location":2,"content":"It's, it's computing this notion of content-based similarity"},{"from":1441.33,"to":1444.12,"location":2,"content":"between these elements and then based on this content-based similarity,"},{"from":1444.12,"to":1447.51,"location":2,"content":"it constructs a convex combination that essentially brings these things together."},{"from":1447.51,"to":1449.38,"location":2,"content":"So it's, it's a very ni- it was,"},{"from":1449.38,"to":1451.55,"location":2,"content":"it was quite- it was very pleasant to see that,"},{"from":1451.55,"to":1453.97,"location":2,"content":"oh, this is a differentiable way of doing non-local means."},{"from":1453.97,"to":1462.03,"location":2,"content":"And, uh, and we took the transformer architecture and replaced words with pixels."},{"from":1462.03,"to":1466.01,"location":2,"content":"Uh, there was some- there were some architecture adjustments to do."},{"from":1466.01,"to":1468.3,"location":2,"content":"And, uh, so this was but- this was"},{"from":1468.3,"to":1471.09,"location":2,"content":"basically the kind of- it was very similar to the original work,"},{"from":1471.09,"to":1474.36,"location":2,"content":"and here the position representations instead of being, you know,"},{"from":1474.36,"to":1476.76,"location":2,"content":"one-dimensional, they were- because we are not dealing with sequences,"},{"from":1476.76,"to":1478.35,"location":2,"content":"we have two-dimensional position representations."},{"from":1478.35,"to":1480.37,"location":2,"content":"Um, okay."},{"from":1480.37,"to":1482.07,"location":2,"content":"So I pointed out before,"},{"from":1482.07,"to":1485.78,"location":2,"content":"attention is a very com- very favorable computational profile"},{"from":1485.78,"to":1489.27,"location":2,"content":"if your length- if your dimension dominates length,"},{"from":1489.27,"to":1491.25,"location":2,"content":"which if- which is absolutely untrue for,"},{"from":1491.25,"to":1492.54,"location":2,"content":"absolutely untrue for images."},{"from":1492.54,"to":1496.17,"location":2,"content":"Uh, because even for like 32 by- even for 32 by 32 images,"},{"from":1496.17,"to":1499.26,"location":2,"content":"when you flatten them and you- and you flatten them, you have 30- you get 30,"},{"from":1499.26,"to":1502.96,"location":2,"content":"72 positions, uh, so it's your standard CFIR image."},{"from":1502.96,"to":1506.4,"location":2,"content":"Um, so simple solution, uh,"},{"from":1506.4,"to":1509.22,"location":2,"content":"because like convolutions of- I mean,"},{"from":1509.22,"to":1511.14,"location":2,"content":"you get- convolutions are basically looked"},{"from":1511.14,"to":1513.35,"location":2,"content":"at local windows and you get translational equivariance."},{"from":1513.35,"to":1516.66,"location":2,"content":"We said, \"Okay. Let's adopt the same strategy.\""},{"from":1516.66,"to":1519.22,"location":2,"content":"And also there's a lot of spatial locality and images."},{"from":1519.22,"to":1524.27,"location":2,"content":"Uh, but now, we will still have a better computational profile."},{"from":1524.27,"to":1527.34,"location":2,"content":"If your- if your receptive field is still smaller than your dimension,"},{"from":1527.34,"to":1529.48,"location":2,"content":"you can afford- you can actually still do"},{"from":1529.48,"to":1534.6,"location":2,"content":"much more long distance computation than a standard convolution because you're,"},{"from":1534.6,"to":1537.62,"location":2,"content":"uh, because you're quadratic in length."},{"from":1537.62,"to":1540.39,"location":2,"content":"So as long as we didn't increase our length beyond the dimension,"},{"from":1540.39,"to":1542.41,"location":2,"content":"we still had a favorable computational profile."},{"from":1542.41,"to":1544.44,"location":2,"content":"And so the way we did it was, uh,"},{"from":1544.44,"to":1546.19,"location":2,"content":"we essentially had, uh,"},{"from":1546.19,"to":1547.95,"location":2,"content":"two kinds of rasterizations."},{"from":1547.95,"to":1552.21,"location":2,"content":"So we had a one-dimensional rasterization where you had a sort of single query block,"},{"from":1552.21,"to":1554.79,"location":2,"content":"uh, which was, uh,"},{"from":1554.79,"to":1558.66,"location":2,"content":"which was then attending or to the- into a larger memory block,"},{"from":1558.66,"to":1562.68,"location":2,"content":"uh, in this rasterized fashion along the- along, along the rows."},{"from":1562.68,"to":1565.32,"location":2,"content":"Um, then we tried another form of rasterization,"},{"from":1565.32,"to":1567.78,"location":2,"content":"falling standard two-dimensional locality,"},{"from":1567.78,"to":1570.36,"location":2,"content":"where you had- where we actually produced the image in,"},{"from":1570.36,"to":1573.3,"location":2,"content":"uh, in blocks and within each block we had a rasterization scheme."},{"from":1573.3,"to":1578.64,"location":2,"content":"Um, again, these- the image transformer layer was very similar."},{"from":1578.64,"to":1581.28,"location":2,"content":"We had two-dimensional position representations along"},{"from":1581.28,"to":1584.67,"location":2,"content":"with query- with the same- with a very similar attention mechanism."},{"from":1584.67,"to":1587.22,"location":2,"content":"Um, and we tried"},{"from":1587.22,"to":1590.55,"location":2,"content":"both super-resolution and unconditional and conditional image generation."},{"from":1590.55,"to":1594.07,"location":2,"content":"Uh, this is- this is Ne- Niki Parmar,"},{"from":1594.07,"to":1597.26,"location":2,"content":"I and a co- and a few other authors from Brain,"},{"from":1597.26,"to":1599.58,"location":2,"content":"um, and we presented it at ICML."},{"from":1599.58,"to":1604.82,"location":2,"content":"And, uh, we were able to achieve better perplexity than existing models."},{"from":1604.82,"to":1607.65,"location":2,"content":"So PixelSNAIL is actually another model that used- mixed"},{"from":1607.65,"to":1610.69,"location":2,"content":"both convolutions and self-attention and they- they outperformed us on,"},{"from":1610.69,"to":1612.67,"location":2,"content":"on, on, on, on, bits per dimension."},{"from":1612.67,"to":1614.31,"location":2,"content":"So we were measuring perplexity because these are"},{"from":1614.31,"to":1616.56,"location":2,"content":"probabilistic- these are probabilistic models."},{"from":1616.56,"to":1618.67,"location":2,"content":"It's like basically a language model of images and,"},{"from":1618.67,"to":1621.12,"location":2,"content":"and it just- and your- and the factorization"},{"from":1621.12,"to":1623.58,"location":2,"content":"of your language model just depends on how you rasterize."},{"from":1623.58,"to":1625.62,"location":2,"content":"In the- in this- in the one-D rasterization,"},{"from":1625.62,"to":1627,"location":2,"content":"we went first rows and then columns."},{"from":1627,"to":1628.18,"location":2,"content":"In the two-D rasterization,"},{"from":1628.18,"to":1631.15,"location":2,"content":"we went blockwise and inside each block we rasterized."},{"from":1631.15,"to":1634.65,"location":2,"content":"On ImageNet, we achieved better perplexities, and,"},{"from":1634.65,"to":1638.78,"location":2,"content":"uh, so yeah, I mean we're at a GAN level, right?"},{"from":1638.78,"to":1643.69,"location":2,"content":"I mean this weird- this is- I think probabilist auto-regressive Image generation,"},{"from":1643.69,"to":1646.72,"location":2,"content":"uh, by this point had not reached GANs."},{"from":1646.72,"to":1651.09,"location":2,"content":"At ICLR 2019, there's a paper by Nal that actually uses self-attention and gets very,"},{"from":1651.09,"to":1652.58,"location":2,"content":"very good quality images."},{"from":1652.58,"to":1654.7,"location":2,"content":"But what we, what we observed was,"},{"from":1654.7,"to":1656.67,"location":2,"content":"we were getting structured objects fairly well."},{"from":1656.67,"to":1659.93,"location":2,"content":"Like can people recognize what the second row is?"},{"from":1659.93,"to":1663.77,"location":2,"content":"Cars. [OVERLAPPING]"},{"from":1663.77,"to":1666.05,"location":2,"content":"I heard- I said- most- almost everyone said cars."},{"from":1666.05,"to":1668.72,"location":2,"content":"I'm not going to ask who said something else, but yes, they're cars."},{"from":1668.72,"to":1673.35,"location":2,"content":"yeah. And, uh, so the- and the last row is another vehicles like,"},{"from":1673.35,"to":1678.33,"location":2,"content":"uh, so essentially when structured jo- structured objects were easy to capture."},{"from":1678.33,"to":1681.16,"location":2,"content":"Um, like frogs and sort of,"},{"from":1681.16,"to":1684.15,"location":2,"content":"you know, objects that were camouflaged just turned into this mush."},{"from":1684.15,"to":1687.09,"location":2,"content":"Um, and- but on super resolution,"},{"from":1687.09,"to":1688.65,"location":2,"content":"now super-resolution is interesting because"},{"from":1688.65,"to":1690.38,"location":2,"content":"there's a lot of conditioning information, right?"},{"from":1690.38,"to":1693.53,"location":2,"content":"And, uh, when you have a lot of conditioning information, the,"},{"from":1693.53,"to":1695.52,"location":2,"content":"the sort of possible- you break- you,"},{"from":1695.52,"to":1697.9,"location":2,"content":"you actually lock quite a few of the modes."},{"from":1697.9,"to":1700.02,"location":2,"content":"So there's only a few options you can have at the output."},{"from":1700.02,"to":1702.39,"location":2,"content":"And super- our super resolution results are much better."},{"from":1702.39,"to":1706.77,"location":2,"content":"We were able to get better facial orientation and structure than previous work."},{"from":1706.77,"to":1711.39,"location":2,"content":"And these are samples at different temperatures and, uh, and, uh,"},{"from":1711.39,"to":1714.69,"location":2,"content":"and we wou- when we quantify this with actual human evaluators,"},{"from":1714.69,"to":1716.16,"location":2,"content":"we- like we flash an image and said,"},{"from":1716.16,"to":1717.35,"location":2,"content":"is this real, is this false?"},{"from":1717.35,"to":1718.63,"location":2,"content":"And we were able to, uh,"},{"from":1718.63,"to":1720.75,"location":2,"content":"we were able to fool humans like four"},{"from":1720.75,"to":1723.29,"location":2,"content":"times better than previous results in super resolution."},{"from":1723.29,"to":1726.99,"location":2,"content":"Again, these are not- these results like I, I guess the,"},{"from":1726.99,"to":1730.47,"location":2,"content":"the latest GAN result from Nvidia makes us look like a joke."},{"from":1730.47,"to":1731.71,"location":2,"content":"But, I mean this is,"},{"from":1731.71,"to":1733.05,"location":2,"content":"I mean, we're starting later than GAN."},{"from":1733.05,"to":1734.13,"location":2,"content":"So hopefully we'll catch up."},{"from":1734.13,"to":1737.25,"location":2,"content":"But, but the point here is that this is an interesting inductive bias for images,"},{"from":1737.25,"to":1739.5,"location":2,"content":"so very natural inductive bias for images."},{"from":1739.5,"to":1741.38,"location":2,"content":"Um, and, uh, and,"},{"from":1741.38,"to":1745.65,"location":2,"content":"and there is hope to apply it- for applying in classification and other such tasks also."},{"from":1745.65,"to":1747.45,"location":2,"content":"Um, so one interesting thing,"},{"from":1747.45,"to":1749.64,"location":2,"content":"just to sort of both out of curiosity and"},{"from":1749.64,"to":1752.74,"location":2,"content":"asking how good is maximum or like does maximum likelihood."},{"from":1752.74,"to":1756.18,"location":2,"content":"Well, one, does the model actually capture some interesting structure in the role?"},{"from":1756.18,"to":1757.65,"location":2,"content":"Second, do you get diversity?"},{"from":1757.65,"to":1759.54,"location":2,"content":"Well, maximum likelihood should get diversity,"},{"from":1759.54,"to":1761.95,"location":2,"content":"by, by virtue, by virtue of what it does."},{"from":1761.95,"to":1763.86,"location":2,"content":"Uh, so then we just- we did image completion."},{"from":1763.86,"to":1765.87,"location":2,"content":"And why is- why image completion because as soon as you"},{"from":1765.87,"to":1768,"location":2,"content":"lock down half the image to the goal truth,"},{"from":1768,"to":1770.61,"location":2,"content":"you're actually shaving off a lot of the possible modes."},{"from":1770.61,"to":1772.23,"location":2,"content":"So you have a much easier time sampling."},{"from":1772.23,"to":1774.09,"location":2,"content":"So, uh, so the first is,"},{"from":1774.09,"to":1775.93,"location":2,"content":"uh, first is what we supply to the model."},{"from":1775.93,"to":1778.79,"location":2,"content":"The, the, the right row- the right most column is,"},{"from":1778.79,"to":1781.15,"location":2,"content":"is gold, and we were able to generate different samples."},{"from":1781.15,"to":1783.27,"location":2,"content":"But what was really interesting is the third row."},{"from":1783.27,"to":1786.18,"location":2,"content":"Uh, so the rightmost column is- the rightmost column is gold."},{"from":1786.18,"to":1788.65,"location":2,"content":"Uh, now if you look at the third row, this horse."},{"from":1788.65,"to":1792.13,"location":2,"content":"So actually there's this sort of glimpse or a suggestion of a pull,"},{"from":1792.13,"to":1795.05,"location":2,"content":"but the model hallucinated a human in some of these,"},{"from":1795.05,"to":1796.17,"location":2,"content":"in some of these images,"},{"from":1796.17,"to":1798.45,"location":2,"content":"which is interesting like in- it does capture at least"},{"from":1798.45,"to":1802.23,"location":2,"content":"the data teaches it to capture some structure about the world."},{"from":1802.23,"to":1806.19,"location":2,"content":"Um, the dog is just cute and I guess it also shows that, you know,"},{"from":1806.19,"to":1807.48,"location":2,"content":"there was this entire object,"},{"from":1807.48,"to":1810.66,"location":2,"content":"this chair, that the model just completely refused to imagine."},{"from":1810.66,"to":1812.84,"location":2,"content":"So there's a lot of difficulty."},{"from":1812.84,"to":1815.07,"location":2,"content":"And I guess Anna is gonna talk about"},{"from":1815.07,"to":1819.46,"location":2,"content":"[NOISE] the another way to exploit self- self-similarity."},{"from":1819.46,"to":1820.08,"location":2,"content":"Thank you."},{"from":1820.08,"to":1831.6,"location":2,"content":"[APPLAUSE]"},{"from":1831.6,"to":1834.06,"location":2,"content":"So thank you Ashish for the introduction."},{"from":1834.06,"to":1837.11,"location":2,"content":"Uh, so there's a lot of self-similarity in images."},{"from":1837.11,"to":1839.46,"location":2,"content":"There's also a lot of self-similarity in, in music."},{"from":1839.46,"to":1843.18,"location":2,"content":"So we can imagine, transformer being a, a good model for it."},{"from":1843.18,"to":1846.06,"location":2,"content":"Uh, we- we're going to show how,"},{"from":1846.06,"to":1848.1,"location":2,"content":"uh, we can add more to,"},{"from":1848.1,"to":1850.35,"location":2,"content":"to the self attention, to think more about kind of"},{"from":1850.35,"to":1854.22,"location":2,"content":"relational information and how that could help, uh, music generation."},{"from":1854.22,"to":1857.16,"location":2,"content":"[NOISE] So, uh, first I want to"},{"from":1857.16,"to":1861.22,"location":2,"content":"clarify what is the raw representation that we're working with right now."},{"from":1861.22,"to":1863.28,"location":2,"content":"So analogous to language,"},{"from":1863.28,"to":1867.06,"location":2,"content":"you can think about there's text and somebody is reading out a text,"},{"from":1867.06,"to":1869.43,"location":2,"content":"so they add their kind of own intonations to it,"},{"from":1869.43,"to":1872.38,"location":2,"content":"and then you have sound waves coming out of that speech."},{"from":1872.38,"to":1876.1,"location":2,"content":"So for music there's a va- very similar kind of, uh,"},{"from":1876.1,"to":1881.27,"location":2,"content":"line of a generation where you say the composer has an idea,"},{"from":1881.27,"to":1883.36,"location":2,"content":"uh, writes down the score and then,"},{"from":1883.36,"to":1885.58,"location":2,"content":"a performer performs it and then you get sound."},{"from":1885.58,"to":1889.54,"location":2,"content":"So what we're going to focus on today is mostly, uh,"},{"from":1889.54,"to":1891.41,"location":2,"content":"you can think of the score but it's actually,"},{"from":1891.41,"to":1894.08,"location":2,"content":"er, a performance, um,"},{"from":1894.08,"to":1901.55,"location":2,"content":"in that it's a symbolic representation where MIDI pianos were used and,"},{"from":1901.55,"to":1904.07,"location":2,"content":"uh, um, professional amateur, uh,"},{"from":1904.07,"to":1906.64,"location":2,"content":"musicians were performing on the pianos."},{"from":1906.64,"to":1907.89,"location":2,"content":"So we have the recorded,"},{"from":1907.89,"to":1909.66,"location":2,"content":"uh, information of their playing."},{"from":1909.66,"to":1911.3,"location":2,"content":"So in particular, um,"},{"from":1911.3,"to":1915.81,"location":2,"content":"at each time se- step modeling music as this sequential, uh,"},{"from":1915.81,"to":1918.72,"location":2,"content":"process, what is being output are, okay,"},{"from":1918.72,"to":1920.14,"location":2,"content":"turn this note on, ah,"},{"from":1920.14,"to":1921.96,"location":2,"content":"advance the clock by this much,"},{"from":1921.96,"to":1923.22,"location":2,"content":"and then turn this note off."},{"from":1923.22,"to":1925.96,"location":2,"content":"And also there is, uh, dynamics information,"},{"from":1925.96,"to":1927.66,"location":2,"content":"so when you turn the note on, you first say like,"},{"from":1927.66,"to":1929.98,"location":2,"content":"how loud it's going to be."},{"from":1929.98,"to":1933.09,"location":2,"content":"Uh, so traditionally, uh, modeling, uh,"},{"from":1933.09,"to":1935.09,"location":2,"content":"music as kind of a language,"},{"from":1935.09,"to":1938.13,"location":2,"content":"we've been using, uh, recurrent neural networks."},{"from":1938.13,"to":1943.35,"location":2,"content":"And, um, because as Ashish introduced and, and talked about,"},{"from":1943.35,"to":1945.51,"location":2,"content":"there is a lot of compression that needs to happen,"},{"from":1945.51,"to":1949.83,"location":2,"content":"like a long sequence has to be embedded into like a fixed length vector."},{"from":1949.83,"to":1952.2,"location":2,"content":"And that becomes hard when, uh,"},{"from":1952.2,"to":1955.2,"location":2,"content":"in music you have- you have repetition coming,"},{"from":1955.2,"to":1957.15,"location":2,"content":"um, at a distance."},{"from":1957.15,"to":1959.49,"location":2,"content":"So, uh, I'm first going to show you,"},{"from":1959.49,"to":1963.27,"location":2,"content":"um, samples from, from the RNNs,"},{"from":1963.27,"to":1966.45,"location":2,"content":"from a transformer and then from a music transformer that has"},{"from":1966.45,"to":1968.58,"location":2,"content":"the relative attention and kind of let you hear"},{"from":1968.58,"to":1971.97,"location":2,"content":"the differences and then I'll go into how we,"},{"from":1971.97,"to":1974.58,"location":2,"content":"uh, what are, what are the, uh,"},{"from":1974.58,"to":1978.66,"location":2,"content":"modifications we needed to do on top of the, uh, transformer model."},{"from":1978.66,"to":1980.74,"location":2,"content":"Uh, so here, uh,"},{"from":1980.74,"to":1983.3,"location":2,"content":"this task is kind of the image completion task."},{"from":1983.3,"to":1988.34,"location":2,"content":"So we give it an initial motif and then we ask the model to do continuations."},{"from":1988.34,"to":1990.66,"location":2,"content":"So this is the motif that we fed."},{"from":1990.66,"to":1996.22,"location":2,"content":"[MUSIC] How many people recognize that?"},{"from":1996.22,"to":1999.09,"location":2,"content":"Awesome. Okay. [LAUGHTER] Yeah,"},{"from":1999.09,"to":2000.38,"location":2,"content":"so this is a, uh,"},{"from":2000.38,"to":2002.9,"location":2,"content":"kind of a fragment from a Chopin Etude piece."},{"from":2002.9,"to":2004.91,"location":2,"content":"And we're going to ask, uh,"},{"from":2004.91,"to":2006.68,"location":2,"content":"the RNN to do a continuation."},{"from":2006.68,"to":2014.99,"location":2,"content":"[NOISE]"},{"from":2014.99,"to":2028.33,"location":2,"content":"[MUSIC]"},{"from":2028.33,"to":2030.95,"location":2,"content":"So in here, like in the beginning, it was trying to repeat it."},{"from":2030.95,"to":2032.33,"location":2,"content":"But very fast, it, er,"},{"from":2032.33,"to":2035.87,"location":2,"content":"wandered off into, its other different ideas."},{"from":2035.87,"to":2038.12,"location":2,"content":"So that's one challenge because it's, uh,"},{"from":2038.12,"to":2041.7,"location":2,"content":"not able to directly look back to what happened in the past, uh, and,"},{"from":2041.7,"to":2044.06,"location":2,"content":"and can just look at kind of a blu- blurry version,"},{"from":2044.06,"to":2046.4,"location":2,"content":"and that blurry version becomes more and more blurry."},{"from":2046.4,"to":2048.45,"location":2,"content":"Uh, so this is what the transformer does."},{"from":2048.45,"to":2050.99,"location":2,"content":"Uh, so so, uh, a detail is, uh,"},{"from":2050.99,"to":2054.45,"location":2,"content":"these models are trained on half the length that you're hearing."},{"from":2054.45,"to":2058.76,"location":2,"content":"So we're kinda asking the model to generalize beyond the length that it's trained on."},{"from":2058.76,"to":2060.17,"location":2,"content":"And you can see for this transformer,"},{"from":2060.17,"to":2062.28,"location":2,"content":"it, it deteriorates beyond that."},{"from":2062.28,"to":2065.15,"location":2,"content":"But it can hold the motif pretty consistent."},{"from":2065.15,"to":2074.69,"location":2,"content":"[MUSIC] Okay. You, you,"},{"from":2074.69,"to":2075.77,"location":2,"content":"you ge- you get the idea."},{"from":2075.77,"to":2080.69,"location":2,"content":"[LAUGHTER] So initially, it was able to do this repetition really well."},{"from":2080.69,"to":2082.4,"location":2,"content":"Uh, so it was able to copy it very well."},{"from":2082.4,"to":2084.17,"location":2,"content":"But beyond the length that was trained on,"},{"from":2084.17,"to":2087.44,"location":2,"content":"it kinda didn't know how to cope with, like longer contexts."},{"from":2087.44,"to":2088.88,"location":2,"content":"And, uh, what you see,"},{"from":2088.88,"to":2091.32,"location":2,"content":"uh, the, the last one is from the music transformer."},{"from":2091.32,"to":2093.35,"location":2,"content":"I think so that kind of [NOISE] the relational information."},{"from":2093.35,"to":2096.47,"location":2,"content":"And you can just see visually how it's very consistent and kinda"},{"from":2096.47,"to":2099.94,"location":2,"content":"repeating these [NOISE] these larger, uh, arcs."},{"from":2099.94,"to":2121.19,"location":2,"content":"[MUSIC]"},{"from":2121.19,"to":2123.82,"location":2,"content":"Yeah. So that was, uh, music transformer."},{"from":2123.82,"to":2127.07,"location":2,"content":"And so in music,"},{"from":2127.07,"to":2130.41,"location":2,"content":"the, the self similarity that we talked about, uh,"},{"from":2130.41,"to":2131.76,"location":2,"content":"so we see, uh,"},{"from":2131.76,"to":2132.95,"location":2,"content":"the motif here, and so,"},{"from":2132.95,"to":2135.01,"location":2,"content":"so there we primed the model with a motif,"},{"from":2135.01,"to":2136.45,"location":2,"content":"and this is actually a sample,"},{"from":2136.45,"to":2137.87,"location":2,"content":"unconditioned sample from the model."},{"from":2137.87,"to":2140.69,"location":2,"content":"So nothing, er, there was no priming that the, uh,"},{"from":2140.69,"to":2142.88,"location":2,"content":"model kinda had to create its own motif and then,"},{"from":2142.88,"to":2145.11,"location":2,"content":"uh, do, uh, continuations from there."},{"from":2145.11,"to":2149.21,"location":2,"content":"And here, uh, if we kinda look at it and analyze it a bit, you see,"},{"from":2149.21,"to":2151.8,"location":2,"content":"uh, a lot of repetition,"},{"from":2151.8,"to":2154.04,"location":2,"content":"uh, with gaps in between."},{"from":2154.04,"to":2156.64,"location":2,"content":"And if you look at the self attention structure,"},{"from":2156.64,"to":2158.87,"location":2,"content":"we actually do see the model,"},{"from":2158.87,"to":2160.63,"location":2,"content":"uh, looking at the relevant parts."},{"from":2160.63,"to":2164.07,"location":2,"content":"Even if, if it was not immediately, uh, preceding it."},{"from":2164.07,"to":2165.5,"location":2,"content":"So, so here, uh,"},{"from":2165.5,"to":2169.97,"location":2,"content":"what I colored shaded out is where the motif, um, occurs."},{"from":2169.97,"to":2171.83,"location":2,"content":"Uh, and you can, uh, see the different colors,"},{"from":2171.83,"to":2174.71,"location":2,"content":"there's a different attention heads and they're kinda focusing,"},{"from":2174.71,"to":2176.81,"location":2,"content":"uh, among those, uh, grayed out sections."},{"from":2176.81,"to":2179.75,"location":2,"content":"[NOISE] So I'll play the sample and we also have"},{"from":2179.75,"to":2183.7,"location":2,"content":"a visualization that kind of shows you as the music is pa- uh,"},{"from":2183.7,"to":2188.93,"location":2,"content":"is being played or what notes it was attending to as it was predicting that note."},{"from":2188.93,"to":2191.15,"location":2,"content":"And, uh, this was generated from scratch."},{"from":2191.15,"to":2193.88,"location":2,"content":"And, uh, so the self attention is, um,"},{"from":2193.88,"to":2197.27,"location":2,"content":"from, from kind of note to note level or event to event level."},{"from":2197.27,"to":2199.32,"location":2,"content":"So it's, it's quite low level."},{"from":2199.32,"to":2200.97,"location":2,"content":"Uh, so when you look at it, it's,"},{"from":2200.97,"to":2202.66,"location":2,"content":"it's ki- a little bit overwhelming."},{"from":2202.66,"to":2204.35,"location":2,"content":"It has like multiple heads and,"},{"from":2204.35,"to":2205.93,"location":2,"content":"er, a lot of things moving."},{"from":2205.93,"to":2207.95,"location":2,"content":"Uh, but there's kind of these structural moments"},{"from":2207.95,"to":2210.28,"location":2,"content":"where you would kind of see more of this, uh,"},{"from":2210.28,"to":2212.8,"location":2,"content":"clean, uh, kind of,"},{"from":2212.8,"to":2215.27,"location":2,"content":"uh, sections where it's attending to."},{"from":2215.27,"to":2272.39,"location":2,"content":"[MUSIC]"},{"from":2272.39,"to":2273.71,"location":2,"content":"VOkay. So, um,"},{"from":2273.71,"to":2275.69,"location":2,"content":"how, how did we do that?"},{"from":2275.69,"to":2279.44,"location":2,"content":"And so starting from kind of the the regular attention mechanism,"},{"from":2279.44,"to":2282.7,"location":2,"content":"we know it's, uh, a weighted average of the past history."},{"from":2282.7,"to":2284.69,"location":2,"content":"Uh, and the nice thing is, uh,"},{"from":2284.69,"to":2287.16,"location":2,"content":"however far it is, we have direct access to it."},{"from":2287.16,"to":2288.84,"location":2,"content":"So if we know, uh,"},{"from":2288.84,"to":2290.87,"location":2,"content":"there are kind of motifs that occurred,"},{"from":2290.87,"to":2293,"location":2,"content":"uh, in in early on in the piece,"},{"from":2293,"to":2295.39,"location":2,"content":"we're still able to based on, uh,"},{"from":2295.39,"to":2297.08,"location":2,"content":"the fact that things that are similar,"},{"from":2297.08,"to":2299.24,"location":2,"content":"uh, to be able to retrieve those."},{"from":2299.24,"to":2302.91,"location":2,"content":"Um, but, uh, it also becomes,"},{"from":2302.91,"to":2305.03,"location":2,"content":"all the past becomes kind of a bag of words,"},{"from":2305.03,"to":2307.31,"location":2,"content":"like there is no structure of which came,"},{"from":2307.31,"to":2308.57,"location":2,"content":"uh, before or after."},{"from":2308.57,"to":2311.2,"location":2,"content":"So there's the positional sinusoids that Ashish talked about."},{"from":2311.2,"to":2313.59,"location":2,"content":"That, uh, basically in this, uh,"},{"from":2313.59,"to":2318.39,"location":2,"content":"indices indexes into a sinusoids that are moving at different speeds."},{"from":2318.39,"to":2320.64,"location":2,"content":"And so close-by positions would have, uh,"},{"from":2320.64,"to":2322.16,"location":2,"content":"a very similar kind of, uh,"},{"from":2322.16,"to":2326.32,"location":2,"content":"cross section into those multiple sinusoids."},{"from":2326.32,"to":2328.8,"location":2,"content":"Uh, in contrast for, er,"},{"from":2328.8,"to":2330.92,"location":2,"content":"for convolutions, you kinda have this, uh,"},{"from":2330.92,"to":2334.94,"location":2,"content":"fixed filter that's moving around that captures the relative distance."},{"from":2334.94,"to":2336.88,"location":2,"content":"Like 1B4, 2B4."},{"from":2336.88,"to":2339.18,"location":2,"content":"And these are kind of, uh,"},{"from":2339.18,"to":2342.93,"location":2,"content":"in some ways like a rigid structure that allows you to be, uh,"},{"from":2342.93,"to":2344.93,"location":2,"content":"a kind of, uh, bring in the,"},{"from":2344.93,"to":2347.44,"location":2,"content":"the distance information very explicitly."},{"from":2347.44,"to":2350.77,"location":2,"content":"Um, you can imagine relative attention, um,"},{"from":2350.77,"to":2353.08,"location":2,"content":"with the multiple heads, uh, at play,"},{"from":2353.08,"to":2355.39,"location":2,"content":"uh, to be some combination of these."},{"from":2355.39,"to":2357.17,"location":2,"content":"So, uh, on one hand,"},{"from":2357.17,"to":2358.58,"location":2,"content":"you can access, uh,"},{"from":2358.58,"to":2360.49,"location":2,"content":"the the history very directly."},{"from":2360.49,"to":2362.51,"location":2,"content":"On the other hand, you also know, er,"},{"from":2362.51,"to":2365.21,"location":2,"content":"how you rel- relate to this history."},{"from":2365.21,"to":2366.86,"location":2,"content":"Uh, capturing for example,"},{"from":2366.86,"to":2369.57,"location":2,"content":"like translational invariance and, er,"},{"from":2369.57,"to":2372.44,"location":2,"content":"and we, uh, and for example,"},{"from":2372.44,"to":2375.45,"location":2,"content":"we think one of the reasons why in the beginning, uh,"},{"from":2375.45,"to":2378.83,"location":2,"content":"priming samples that you heard that the, uh,"},{"from":2378.83,"to":2380.95,"location":2,"content":"music transformer was able to generate"},{"from":2380.95,"to":2383.74,"location":2,"content":"beyond the length that it was trained on at a very coherent way,"},{"from":2383.74,"to":2387.83,"location":2,"content":"is that it's able to kind of rely on this translational invariance to to carry,"},{"from":2387.83,"to":2390.78,"location":2,"content":"uh, the relational information forward."},{"from":2390.78,"to":2395,"location":2,"content":"So, if we take a closer look at how how how the,"},{"from":2395,"to":2396.55,"location":2,"content":"how this works is, uh,"},{"from":2396.55,"to":2398.54,"location":2,"content":"the regular transformer you have,"},{"from":2398.54,"to":2400.25,"location":2,"content":"you compare all the queries and keys,"},{"from":2400.25,"to":2402.26,"location":2,"content":"so you get kind of this, uh, square matrix."},{"from":2402.26,"to":2404.39,"location":2,"content":"You can think of it as like a self similarity,"},{"from":2404.39,"to":2406.01,"location":2,"content":"uh, matrix, so it's, uh, a square."},{"from":2406.01,"to":2408.89,"location":2,"content":"Uh, what relative attention does is,"},{"from":2408.89,"to":2412.36,"location":2,"content":"to add an additional term that thinks, uh,"},{"from":2412.36,"to":2414.53,"location":2,"content":"that thinks about whenever you're comparing two things,"},{"from":2414.53,"to":2416.21,"location":2,"content":"how far are you apart?"},{"from":2416.21,"to":2418.82,"location":2,"content":"And also based on the content, do I,"},{"from":2418.82,"to":2421.34,"location":2,"content":"do I care about things that are two steps away or"},{"from":2421.34,"to":2424.18,"location":2,"content":"three steps away or I maybe care about things that are recurring,"},{"from":2424.18,"to":2426.28,"location":2,"content":"at kind of a periodical distance."},{"from":2426.28,"to":2429.3,"location":2,"content":"And, uh, with that information gathered,"},{"from":2429.3,"to":2433.84,"location":2,"content":"that influences, uh, the the similarity between positions."},{"from":2433.84,"to":2435.82,"location":2,"content":"And in particular, uh,"},{"from":2435.82,"to":2439.46,"location":2,"content":"this extra term is based on, um, the distance."},{"from":2439.46,"to":2440.51,"location":2,"content":"So you wanna, uh,"},{"from":2440.51,"to":2441.95,"location":2,"content":"gather the embeddings, uh,"},{"from":2441.95,"to":2444.5,"location":2,"content":"that's irrelevant to the, uh,"},{"from":2444.5,"to":2446.18,"location":2,"content":"the query key distances,"},{"from":2446.18,"to":2449.28,"location":2,"content":"uh, on the [NOISE] on the logits."},{"from":2449.28,"to":2451.72,"location":2,"content":"So, in translation, this,"},{"from":2451.72,"to":2453.23,"location":2,"content":"uh, has shown, uh,"},{"from":2453.23,"to":2455.01,"location":2,"content":"a lot of improvement in,"},{"from":2455.01,"to":2457.73,"location":2,"content":"um, for example English to to German translation."},{"from":2457.73,"to":2460.01,"location":2,"content":"Uh, but in translation,"},{"from":2460.01,"to":2461.76,"location":2,"content":"the sequences are usually quite short."},{"from":2461.76,"to":2463.41,"location":2,"content":"It's only a sentence to sentence."},{"from":2463.41,"to":2465.11,"location":2,"content":"Uh, a translation for example,"},{"from":2465.11,"to":2467.21,"location":2,"content":"maybe 50 words or 100 words."},{"from":2467.21,"to":2472.01,"location":2,"content":"But the music, er, samples that you've heard are in the range of 2,000 time-steps."},{"from":2472.01,"to":2476.01,"location":2,"content":"So it's like 2,000 tokens need to be able to fit in memory."},{"from":2476.01,"to":2477.5,"location":2,"content":"So this was a problem, uh,"},{"from":2477.5,"to":2483.35,"location":2,"content":"because the original formulation relied on building this 3D tensor that's,"},{"from":2483.35,"to":2485.8,"location":2,"content":"uh, that's very large in memory."},{"from":2485.8,"to":2487.72,"location":2,"content":"Um, and and why this is the case?"},{"from":2487.72,"to":2490.05,"location":2,"content":"It's because for every pair,"},{"from":2490.05,"to":2492.71,"location":2,"content":"uh, you look up what the,"},{"from":2492.71,"to":2495.2,"location":2,"content":"what the re- so you can compute what the relative distance is,"},{"from":2495.2,"to":2498.32,"location":2,"content":"and then you look up an embedding that corresponds to that distance."},{"from":2498.32,"to":2503.54,"location":2,"content":"So, um, for like this there's a length by length, like L by L, uh, matrix."},{"from":2503.54,"to":2504.82,"location":2,"content":"You need like, uh,"},{"from":2504.82,"to":2507.64,"location":2,"content":"to collect embeddings for each of the positions and that's, uh,"},{"from":2507.64,"to":2511.07,"location":2,"content":"depth D. So that gives us the 3D."},{"from":2511.07,"to":2512.9,"location":2,"content":"What we realized is,"},{"from":2512.9,"to":2518.48,"location":2,"content":"you can actually just directly multiply the queries and the embedding distances."},{"from":2518.48,"to":2520.67,"location":2,"content":"[NOISE] And they, uh,"},{"from":2520.67,"to":2522.08,"location":2,"content":"come out kind of in a different order,"},{"from":2522.08,"to":2524.63,"location":2,"content":"because now you have the queries ordered by a relative distance,"},{"from":2524.63,"to":2527.93,"location":2,"content":"but you need the queries ordered by keys, uh,"},{"from":2527.93,"to":2531.44,"location":2,"content":"which is kind of a absolute by absolute, uh, configuration."},{"from":2531.44,"to":2533.36,"location":2,"content":"So what we could do is just, uh,"},{"from":2533.36,"to":2536.7,"location":2,"content":"do a series of skewing, uh,"},{"from":2536.7,"to":2540.51,"location":2,"content":"to to put it into the right, uh, configuration."},{"from":2540.51,"to":2543.88,"location":2,"content":"And this is, uh, yeah."},{"from":2543.88,"to":2545.57,"location":2,"content":"Just a, just a quick contrast to,"},{"from":2545.57,"to":2548.48,"location":2,"content":"to show, um, the difference in memory requirements."},{"from":2548.48,"to":2551.7,"location":2,"content":"So, er, a lot of the times the challenge is in, uh,"},{"from":2551.7,"to":2553.82,"location":2,"content":"being able to scale, uh, you know,"},{"from":2553.82,"to":2557.66,"location":2,"content":"being able to be more memory efficient so that [NOISE] you can model longer sequences."},{"from":2557.66,"to":2560.42,"location":2,"content":"So with that, uh, this is,"},{"from":2560.42,"to":2562.85,"location":2,"content":"um, I can play you one more example if we have time."},{"from":2562.85,"to":2565.13,"location":2,"content":"But if we don't have time, we can, go ahead."},{"from":2565.13,"to":2566.18,"location":2,"content":"We'll see more of that."},{"from":2566.18,"to":2567.98,"location":2,"content":"Okay. [LAUGHTER] So this is,"},{"from":2567.98,"to":2569.93,"location":2,"content":"this is, uh, maybe a one, uh,"},{"from":2569.93,"to":2574.48,"location":2,"content":"about a one-minute sample and I- I hope you like it."},{"from":2574.48,"to":2646.39,"location":2,"content":"Thanks. [MUSIC]"},{"from":2646.39,"to":2647.72,"location":2,"content":"Thank you for listening."},{"from":2647.72,"to":2658.61,"location":2,"content":"[APPLAUSE]."},{"from":2658.61,"to":2663.83,"location":2,"content":"[LAUGHTER] Thanks, Anna. Um, um, great."},{"from":2663.83,"to":2666.97,"location":2,"content":"Um, so to sort to, um,"},{"from":2666.97,"to":2671.61,"location":2,"content":"so relative attention has been a powerful mechanism for,"},{"from":2671.61,"to":2675.18,"location":2,"content":"um, a very powerful mechanism for music."},{"from":2675.18,"to":2677.28,"location":2,"content":"It's also helped in machine translation."},{"from":2677.28,"to":2679.36,"location":2,"content":"Um, one really interesting, uh,"},{"from":2679.36,"to":2681.55,"location":2,"content":"consequences of, uh, of, um,"},{"from":2681.55,"to":2684.39,"location":2,"content":"one really interesting consequence of relative attention in,"},{"from":2684.39,"to":2686.03,"location":2,"content":"uh, images, is that,"},{"from":2686.03,"to":2688.37,"location":2,"content":"um, like convolutions achieve,"},{"from":2688.37,"to":2690.74,"location":2,"content":"uh, convolutions achieve translational equivariance."},{"from":2690.74,"to":2691.97,"location":2,"content":"So if you have,"},{"from":2691.97,"to":2694.64,"location":2,"content":"let's say, you wa- uh, you have this,"},{"from":2694.64,"to":2698.34,"location":2,"content":"this red dot or this feature that you're computing at this red dot,"},{"from":2698.34,"to":2701.47,"location":2,"content":"it doesn't depend on where the image of the dog is in the image,"},{"from":2701.47,"to":2704.72,"location":2,"content":"is in the the larger image. It just doesn't depend on its absolute location."},{"from":2704.72,"to":2707,"location":2,"content":"It's going to, it's going to produce the same activation."},{"from":2707,"to":2710.91,"location":2,"content":"So you have- convolutions have this nice, uh, translation equivariance."},{"from":2710.91,"to":2713.14,"location":2,"content":"Now, with, with relative,"},{"from":2713.14,"to":2715.22,"location":2,"content":"uh, positions or relative attention,"},{"from":2715.22,"to":2718.55,"location":2,"content":"you get exactly the same effect because you don't have any- once you just"},{"from":2718.55,"to":2722.49,"location":2,"content":"remove this notion of absolute position that you are injecting [NOISE] into the model,"},{"from":2722.49,"to":2724.28,"location":2,"content":"uh, once you've, once you've removed that,"},{"from":2724.28,"to":2726.47,"location":2,"content":"then your attention computation,"},{"from":2726.47,"to":2728.84,"location":2,"content":"because it actually includes I mean, we've,"},{"from":2728.84,"to":2732.22,"location":2,"content":"we've- Niki and I couple of others have actually,"},{"from":2732.22,"to":2734.87,"location":2,"content":"and Anna were actually working on images and seems-"},{"from":2734.87,"to":2737.48,"location":2,"content":"and it seems to actually show, uh, better results."},{"from":2737.48,"to":2742.04,"location":2,"content":"Um, this actio- this now satisfies this,"},{"from":2742.04,"to":2744.44,"location":2,"content":"uh, uh, the- I mean, it,"},{"from":2744.44,"to":2747.47,"location":2,"content":"it can achieve translation equivariance which is a great property for images."},{"from":2747.47,"to":2749.3,"location":2,"content":"So there's a lot of- it seems like this might be"},{"from":2749.3,"to":2751.25,"location":2,"content":"an interesting direction to pursue if you want to push,"},{"from":2751.25,"to":2755.09,"location":2,"content":"uh, Self-Attention in images for a self-supervised learning."},{"from":2755.09,"to":2759.78,"location":2,"content":"Um, I guess on, on self-supervised learning so the geni- generative modeling work that,"},{"from":2759.78,"to":2761.45,"location":2,"content":"that I talked about before in,"},{"from":2761.45,"to":2765.32,"location":2,"content":"in itself just having probabilistic models of images is, I mean,"},{"from":2765.32,"to":2766.89,"location":2,"content":"I guess the best model of an image is I,"},{"from":2766.89,"to":2769.58,"location":2,"content":"I go to Google search and I pick up an image and I just give it to you,"},{"from":2769.58,"to":2772.32,"location":2,"content":"but I guess generative models of images are useful because,"},{"from":2772.32,"to":2774.47,"location":2,"content":"if you want to do something like semis-, uh, uh,"},{"from":2774.47,"to":2776.81,"location":2,"content":"self supervised learning where you just pre-train a model on"},{"from":2776.81,"to":2779.38,"location":2,"content":"a lot of- on a lot of unlabeled data then you transfer it."},{"from":2779.38,"to":2782.76,"location":2,"content":"So hopefully, this is gonna help and this is gonna be a part of that machinery."},{"from":2782.76,"to":2786.89,"location":2,"content":"Um, another interesting, uh,"},{"from":2786.89,"to":2790.52,"location":2,"content":"another indus-interesting structure that relative attention allows you to model,"},{"from":2790.52,"to":2791.96,"location":2,"content":"is, uh, is, is kind of a graph."},{"from":2791.96,"to":2793.52,"location":2,"content":"So imagine you have this, uh,"},{"from":2793.52,"to":2796.26,"location":2,"content":"you have this similarity graph where these red edges are,"},{"from":2796.26,"to":2797.6,"location":2,"content":"are this notion of companies,"},{"from":2797.6,"to":2800.18,"location":2,"content":"and the blue edge is a notion of a fruit, uh,"},{"from":2800.18,"to":2804.5,"location":2,"content":"and um, an apple takes these two forms."},{"from":2804.5,"to":2807.14,"location":2,"content":"And, uh, and you could just imagine"},{"from":2807.14,"to":2810.65,"location":2,"content":"relative attention just modeling this- just being able to model,"},{"from":2810.65,"to":2812.28,"location":2,"content":"or being able to- you, you,"},{"from":2812.28,"to":2816.17,"location":2,"content":"yourself being able to impose these different notions of similarity uh,"},{"from":2816.17,"to":2818.38,"location":2,"content":"between, uh, between, uh, different elements."},{"from":2818.38,"to":2820.72,"location":2,"content":"Uh, so if you have like, if you have graph problems, um,"},{"from":2820.72,"to":2823.93,"location":2,"content":"then relative self-attention might be a good fit for you."},{"from":2823.93,"to":2828.53,"location":2,"content":"Um, there's also, there's also a simi- quite a position paper by Battaglia et al from"},{"from":2828.53,"to":2833.93,"location":2,"content":"Deep Mind that talks about relative attention and how it can be used, um, within graphs."},{"from":2833.93,"to":2835.58,"location":2,"content":"So while we're on graphs,"},{"from":2835.58,"to":2838.68,"location":2,"content":"I just wanted to- perhaps might be interesting to connect,"},{"from":2838.68,"to":2841.49,"location":2,"content":"um, uh, of- some, uh,"},{"from":2841.49,"to":2842.81,"location":2,"content":"excellent work that was done on, uh,"},{"from":2842.81,"to":2845.03,"location":2,"content":"on graphs called Message Passing Neural Networks."},{"from":2845.03,"to":2847.26,"location":2,"content":"And it's quite funny, so if you look at,"},{"from":2847.26,"to":2850.73,"location":2,"content":"if you look at the message passing function, um,"},{"from":2850.73,"to":2854.48,"location":2,"content":"what it's saying is you're actually just passing messages between pairs of nodes."},{"from":2854.48,"to":2857.09,"location":2,"content":"So you can just think of self attention as imposing a fully connect- it's"},{"from":2857.09,"to":2859.97,"location":2,"content":"like a bipe- a full, a complete bipartite graph,"},{"from":2859.97,"to":2862.25,"location":2,"content":"and, uh, you're, you're passing messages between,"},{"from":2862.25,"to":2863.75,"location":2,"content":"you're passing messages between nodes."},{"from":2863.75,"to":2866.54,"location":2,"content":"Now message passing, message passing neural networks did exactly that."},{"from":2866.54,"to":2869.42,"location":2,"content":"They were passing messages between nodes as well. And how are they different?"},{"from":2869.42,"to":2871.58,"location":2,"content":"Well, the only way that when- well, mathematically,"},{"from":2871.58,"to":2873.97,"location":2,"content":"they were only different in that message passing was,"},{"from":2873.97,"to":2877.37,"location":2,"content":"was, uh, forcing the messages to be between pairs of nodes,"},{"from":2877.37,"to":2880.79,"location":2,"content":"but just because of the Softmax function where you get interaction between all the nodes,"},{"from":2880.79,"to":2883.18,"location":2,"content":"self attention is like a message passing mechanism,"},{"from":2883.18,"to":2885.47,"location":2,"content":"where the interactions are between all, all nodes."},{"from":2885.47,"to":2887.32,"location":2,"content":"So, uh, they're, they're like,"},{"from":2887.32,"to":2888.8,"location":2,"content":"they're not too far mathematically,"},{"from":2888.8,"to":2891.32,"location":2,"content":"and also the me- the Message Passing Paper introduces"},{"from":2891.32,"to":2894.62,"location":2,"content":"an interesting concept called Multiple Towers that are similar to multi-head attention,"},{"from":2894.62,"to":2896.59,"location":2,"content":"uh, that, that Norman invented."},{"from":2896.59,"to":2901.16,"location":2,"content":"And, uh, it's like you run k copies of these message passing neural networks in parallel."},{"from":2901.16,"to":2903.59,"location":2,"content":"So there's a lot of similarity between existing, you know,"},{"from":2903.59,"to":2907.8,"location":2,"content":"this connects to work that existed before but these connections sort of came in later."},{"from":2907.8,"to":2911.93,"location":2,"content":"Um, we have a graph library where we kind of connected these both,"},{"from":2911.93,"to":2914.15,"location":2,"content":"both these strands message passing and, uh, we,"},{"from":2914.15,"to":2917.49,"location":2,"content":"uh, we put it out in tensor2tensor."},{"from":2917.49,"to":2920.7,"location":2,"content":"Um, so to sort of summarize, um,"},{"from":2920.7,"to":2923.51,"location":2,"content":"the properties that Self-Attention has been able to help"},{"from":2923.51,"to":2926.24,"location":2,"content":"us model is this constant path length between any two,"},{"from":2926.24,"to":2927.87,"location":2,"content":"any two positions, and it's been,"},{"from":2927.87,"to":2929.6,"location":2,"content":"it's been shown to be quite useful in,"},{"from":2929.6,"to":2932.16,"location":2,"content":"in, in, uh, in sequence modeling."},{"from":2932.16,"to":2936.2,"location":2,"content":"This advantage of having unbounded memory not having to pack information in finite,"},{"from":2936.2,"to":2938.36,"location":2,"content":"in, in sort of a finite amount of- in a,"},{"from":2938.36,"to":2939.57,"location":2,"content":"in a fixed amount of space,"},{"from":2939.57,"to":2943.63,"location":2,"content":"uh, where in, in our case our memory essentially grows with the sequences is,"},{"from":2943.63,"to":2947.18,"location":2,"content":"is helps you computationally, uh, it's trivial to parallelize."},{"from":2947.18,"to":2949.11,"location":2,"content":"You can, you can crunch a lot of data, it's uh,"},{"from":2949.11,"to":2952.04,"location":2,"content":"which is useful if you wanna have your large data sets."},{"from":2952.04,"to":2954.28,"location":2,"content":"We found that it can model Self-Similarity."},{"from":2954.28,"to":2956.33,"location":2,"content":"Uh, It seems to be a very natural thing, uh,"},{"from":2956.33,"to":2960.35,"location":2,"content":"a very, a very natural phenomenon if you're dealing with images or music."},{"from":2960.35,"to":2963.2,"location":2,"content":"Also, relative attention allows you to sort of, gives you this added dimension"},{"from":2963.2,"to":2966.08,"location":2,"content":"of being able to model expressive timing and music,"},{"from":2966.08,"to":2967.93,"location":2,"content":"well, this translational equivariance,"},{"from":2967.93,"to":2970.47,"location":2,"content":"uh, it extends naturally to graphs."},{"from":2970.47,"to":2977.03,"location":2,"content":"Um, so this part or everything that I talked so far was about sort of parallel training."},{"from":2977.03,"to":2981.91,"location":2,"content":"Um, so there's a very active area of research now using the Self-Attention models for,"},{"from":2981.91,"to":2983.97,"location":2,"content":"for, for less auto-regressive generation."},{"from":2983.97,"to":2985.79,"location":2,"content":"So notice a- at generation time,"},{"from":2985.79,"to":2987.57,"location":2,"content":"notice that the decoder mask was causal,"},{"from":2987.57,"to":2988.67,"location":2,"content":"we couldn't look into the future."},{"from":2988.67,"to":2991.19,"location":2,"content":"So when we're, when we're generating we're still"},{"from":2991.19,"to":2994.25,"location":2,"content":"generating sequentially left to right on the target side."},{"from":2994.25,"to":2996.84,"location":2,"content":"Um, so, um, and, and,"},{"from":2996.84,"to":2999.17,"location":2,"content":"and, and why, why is generation hard?"},{"from":2999.17,"to":3000.67,"location":2,"content":"Well, because your outputs are multi-modal."},{"from":3000.67,"to":3002.84,"location":2,"content":"I f you had- if you want to translate English to German,"},{"from":3002.84,"to":3004.28,"location":2,"content":"there's multiple ways and,"},{"from":3004.28,"to":3008.41,"location":2,"content":"and, and your, your second word that you're translating will depend on the first word."},{"from":3008.41,"to":3011.61,"location":2,"content":"For example, if you, if you first- the first word that you predict was danke,"},{"from":3011.61,"to":3013.68,"location":2,"content":"then that's going to change the second word that you predict."},{"from":3013.68,"to":3015.67,"location":2,"content":"And if you just predicted them independently,"},{"from":3015.67,"to":3017.62,"location":2,"content":"then you can imagine you can just have all sorts of"},{"from":3017.62,"to":3020.18,"location":2,"content":"permutations of these which will be incorrect."},{"from":3020.18,"to":3022.69,"location":2,"content":"Uh, and the way we actually break modes is"},{"from":3022.69,"to":3024.94,"location":2,"content":"just- or we make decisions is just sequential generation."},{"from":3024.94,"to":3027.7,"location":2,"content":"Once we commit to a word that makes a decision,"},{"from":3027.7,"to":3030.49,"location":2,"content":"and then that nails down what's the next word that you're going to predict."},{"from":3030.49,"to":3034.21,"location":2,"content":"So there's been some, there's been some work on, it's an active research area, uh,"},{"from":3034.21,"to":3036.7,"location":2,"content":"and you can kind of categorize some of these papers like"},{"from":3036.7,"to":3041.74,"location":2,"content":"the non-autogressive transformer of the fast- the third paper, fast decoding."},{"from":3041.74,"to":3043.87,"location":2,"content":"Um, the fourth paper towards a better understanding"},{"from":3043.87,"to":3046,"location":2,"content":"of all Vector Quantized Auto-encoders into this group,"},{"from":3046,"to":3049.26,"location":2,"content":"where they're actually make- doing the decision making in a latent space,"},{"from":3049.26,"to":3053.47,"location":2,"content":"that's being, uh, it's e- either being learned using word alignments,"},{"from":3053.47,"to":3056.86,"location":2,"content":"uh, fertilities, or that's being learned using Auto-encoders."},{"from":3056.86,"to":3059.68,"location":2,"content":"So you make- you do the decision making in latent space,"},{"from":3059.68,"to":3062.28,"location":2,"content":"and then you- once you've made the decisions in latent space,"},{"from":3062.28,"to":3064.03,"location":2,"content":"you assume that all your outputs,"},{"from":3064.03,"to":3065.72,"location":2,"content":"are actually conditionally independent,"},{"from":3065.72,"to":3067.18,"location":2,"content":"given that you've made these decisions."},{"from":3067.18,"to":3068.49,"location":2,"content":"So that's how they actually speed up."},{"from":3068.49,"to":3070.6,"location":2,"content":"There's also- there's ano- there's another paper."},{"from":3070.6,"to":3071.86,"location":2,"content":"The second one is a"},{"from":3071.86,"to":3074.02,"location":2,"content":"paper that does Iterative Refinement."},{"from":3074.02,"to":3077.66,"location":2,"content":"There is also a Blockwise Parallel Decoding paper by Mitchell Stern,"},{"from":3077.66,"to":3080.22,"location":2,"content":"uh, Noam Shazeer, and Jakob Uszkoreit, uh,"},{"from":3080.22,"to":3083.44,"location":2,"content":"where they essentially just run multiple models like, uh,"},{"from":3083.44,"to":3089.44,"location":2,"content":"and rescore using a more- a decode using a faster model and score,"},{"from":3089.44,"to":3091.41,"location":2,"content":"using the more expensive model."},{"from":3091.41,"to":3093.58,"location":2,"content":"So that's how it sort of it speeds it up."},{"from":3093.58,"to":3098.35,"location":2,"content":"Um, [NOISE] transfer learning has had the- Self-Attention has been beneficial in transfer"},{"from":3098.35,"to":3102.82,"location":2,"content":"learning, GPT from OpenAI and BERT are two classic examples."},{"from":3102.82,"to":3104.99,"location":2,"content":"There's been some work on actually, scaling this up,"},{"from":3104.99,"to":3108.09,"location":2,"content":"like add a factor as, uh, efficient optimizer."},{"from":3108.09,"to":3112.12,"location":2,"content":"Um, there's a, there's a recent paper by Rohan Anil and Yoram Singer."},{"from":3112.12,"to":3114.45,"location":2,"content":"Um, there's also Mesh-Tensorflow,"},{"from":3114.45,"to":3117.85,"location":2,"content":"which actually they've been able to train models"},{"from":3117.85,"to":3122.53,"location":2,"content":"of just several orders of magnitude larger than the original models have been trained."},{"from":3122.53,"to":3125.71,"location":2,"content":"So there's, I mean, when you're working this large data regime you would probably want to"},{"from":3125.71,"to":3127.72,"location":2,"content":"memorize a lot of- you want to memorize"},{"from":3127.72,"to":3130.27,"location":2,"content":"a lot of things inside your parameters used to train a larger model."},{"from":3130.27,"to":3132.41,"location":2,"content":"Uh, Mesh-Tensorflow can uh, can let you do that."},{"from":3132.41,"to":3136.3,"location":2,"content":"Um, there has been a lot of interesting work, universal transformers,"},{"from":3136.3,"to":3139.24,"location":2,"content":"sort of recurrent neural networks can actually count very nicely."},{"from":3139.24,"to":3141.91,"location":2,"content":"There's these cute papers by Schmidhuber where he actually shows"},{"from":3141.91,"to":3145.48,"location":2,"content":"that recurring neural, the count- the cell mechanism just learns a nice counter,"},{"from":3145.48,"to":3147.34,"location":2,"content":"like if you're- you can learn kind of a to the n,"},{"from":3147.34,"to":3149.23,"location":2,"content":"b to the n, uh, with LSTM."},{"from":3149.23,"to":3151.74,"location":2,"content":"So then, uh, universals transformers"},{"from":3151.74,"to":3154.66,"location":2,"content":"brings back recurrence in depth inside the transformer."},{"from":3154.66,"to":3157.2,"location":2,"content":"Uh, there is a really cool Wikipedia paper,"},{"from":3157.2,"to":3160.9,"location":2,"content":"um, simultaneously with the image transformer paper that also uses local attention."},{"from":3160.9,"to":3166.06,"location":2,"content":"Transformer-XL paper that sort of combines recurrence with Self-Attention,"},{"from":3166.06,"to":3167.29,"location":2,"content":"so they do Self-Attention in chunks,"},{"from":3167.29,"to":3170.28,"location":2,"content":"but they sort of summarize history by using recurrence, it's kinda cute."},{"from":3170.28,"to":3172.14,"location":2,"content":"It's been used in speech but I don't know if there's been"},{"from":3172.14,"to":3175.32,"location":2,"content":"some fairly big success stories of Self-Attention in speech."},{"from":3175.32,"to":3178.35,"location":2,"content":"Uh, again, similar issues where you have very large, uh,"},{"from":3178.35,"to":3180.9,"location":2,"content":"um as positions to,"},{"from":3180.9,"to":3183.16,"location":2,"content":"uh, to do Self-Attention over."},{"from":3183.16,"to":3188.05,"location":2,"content":"So yeah, um, self supervision is a- if it works it would be,"},{"from":3188.05,"to":3189.64,"location":2,"content":"it would be, it would be very beneficial."},{"from":3189.64,"to":3192.91,"location":2,"content":"We wouldn't need large label datasets, understanding transfer,"},{"from":3192.91,"to":3195.49,"location":2,"content":"transfers is becoming very succe- becoming- is becoming"},{"from":3195.49,"to":3198.96,"location":2,"content":"a reality in NLP with BERT and some of these other models."},{"from":3198.96,"to":3201.63,"location":2,"content":"So understanding how these, what's actually happening is a-"},{"from":3201.63,"to":3204.55,"location":2,"content":"is an interesting area of ongoing research for me and a couple."},{"from":3204.55,"to":3209.51,"location":2,"content":"And a few of my collaborators and uh, multitask learning and surmounting this,"},{"from":3209.51,"to":3212.86,"location":2,"content":"this quadratic problem with Self-Attention is"},{"from":3212.86,"to":3218.15,"location":2,"content":"an interesting area of research that I- that I'd like to pursue. Thank you."}]}