{"font_size":0.4,"font_color":"#FFFFFF","background_alpha":0.5,"background_color":"#9C27B0","Stroke":"none","body":[{"from":4.19,"to":8.79,"location":2,"content":"So today, we're very pleased to have as our second, um,"},{"from":8.79,"to":10.98,"location":2,"content":"invited speaker, Richard Socher,"},{"from":10.98,"to":14.09,"location":2,"content":"he is the chief scientist at Salesforce."},{"from":14.09,"to":18.04,"location":2,"content":"Um, Richard actually also has a lot more connection to this class,"},{"from":18.04,"to":21.84,"location":2,"content":"um, because, um, for several years, um,"},{"from":21.84,"to":24.97,"location":2,"content":"Richard was involved either as instructor or, um,"},{"from":24.97,"to":29.11,"location":2,"content":"co-instructor in teaching this material at Stanford,"},{"from":29.11,"to":32.6,"location":2,"content":"um, so he sort of knows the course, um, pretty well."},{"from":32.6,"to":34.03,"location":2,"content":"Um, and so today,"},{"from":34.03,"to":38.83,"location":2,"content":"he's going to be talking about some of the challenges and recent work"},{"from":38.83,"to":43.69,"location":2,"content":"in doing multitask learning in natural language processing. So welcome, Richard."},{"from":43.69,"to":46.59,"location":2,"content":"Thank you. Hello, everybody. I'm excited to be here."},{"from":46.59,"to":49.38,"location":2,"content":"Uh, yeah, I want to talk to you today about what we,"},{"from":49.38,"to":51.28,"location":2,"content":"in short, called decaNLP."},{"from":51.28,"to":54.63,"location":2,"content":"I want to first give a big shout out to Bryan McCann."},{"from":54.63,"to":56.9,"location":2,"content":"He's the first author of this, uh, paper,"},{"from":56.9,"to":60.2,"location":2,"content":"and I've pitched this idea to a lot of people in the last, like,"},{"from":60.2,"to":61.28,"location":2,"content":"three to four years,"},{"from":61.28,"to":62.41,"location":2,"content":"and most people were like,"},{"from":62.41,"to":64.73,"location":2,"content":"\"This is too much pre-processing because you're trying to"},{"from":64.73,"to":67.29,"location":2,"content":"do 10 different tasks in one model.\""},{"from":67.29,"to":69.51,"location":2,"content":"That's sort of where the decathlon, uh,"},{"from":69.51,"to":71.81,"location":2,"content":"wording comes in, uh, but he,"},{"from":71.81,"to":73.31,"location":2,"content":"he really stuck to it, uh,"},{"from":73.31,"to":76.73,"location":2,"content":"did all the pre-processing and all the things that you now know like tokenization,"},{"from":76.73,"to":78.5,"location":2,"content":"and it turns out a lot of different data sets,"},{"from":78.5,"to":80.27,"location":2,"content":"have a different conception of what a word is."},{"from":80.27,"to":81.71,"location":2,"content":"This wasn't two words,"},{"from":81.71,"to":83.48,"location":2,"content":"uh, or one word,"},{"from":83.48,"to":85.36,"location":2,"content":"and things like that, and that changes how you"},{"from":85.36,"to":87.47,"location":2,"content":"write all your evaluation scripts and all of that."},{"from":87.47,"to":89.17,"location":2,"content":"So Bryan, uh, is,"},{"from":89.17,"to":90.77,"location":2,"content":"is a really phenomenal researcher,"},{"from":90.77,"to":91.98,"location":2,"content":"uh, with us in the group,"},{"from":91.98,"to":95.34,"location":2,"content":"and Nitish has helped us a lot on the optimization side of this,"},{"from":95.34,"to":96.48,"location":2,"content":"uh, and then Caiming Xiong,"},{"from":96.48,"to":98.42,"location":2,"content":"the Director of Research, has done a lot of, uh,"},{"from":98.42,"to":101.73,"location":2,"content":"really phenomenal work that's kind of helpful in pretty much all our projects."},{"from":101.73,"to":104.83,"location":2,"content":"So I'm going to tell you a couple of different, uh,"},{"from":104.83,"to":108.56,"location":2,"content":"lines of reasoning that led us to,"},{"from":108.56,"to":110.53,"location":2,"content":"uh, this idea of multitask learning."},{"from":110.53,"to":114.17,"location":2,"content":"And the first one was sort of trying to take a step back and looking at the field,"},{"from":114.17,"to":118.95,"location":2,"content":"and I noticed not like that much of a historical class but basically pre-2010,"},{"from":118.95,"to":124.34,"location":2,"content":"most natural language processing had kind of these very hand-designed features,"},{"from":124.34,"to":125.66,"location":2,"content":"and we basically just had,"},{"from":125.66,"to":128.71,"location":2,"content":"uh, machine learning kind of learned weights,"},{"from":128.71,"to":132.68,"location":2,"content":"uh, in the optimization procedure for these human-designed features."},{"from":132.68,"to":140.78,"location":2,"content":"And so in 2010, Chris and I and others sort of started to work in deep learning for feature learning."},{"from":140.78,"to":142.15,"location":2,"content":"So everything was a word vector and now,"},{"from":142.15,"to":145.91,"location":2,"content":"we can back-propagate into them and actually learn those representations."},{"from":145.91,"to":147.41,"location":2,"content":"And I think currently,"},{"from":147.41,"to":148.88,"location":2,"content":"we're kind of in a state where we do a lot of"},{"from":148.88,"to":151.94,"location":2,"content":"deep architecture engineering for specific tasks,"},{"from":151.94,"to":153.11,"location":2,"content":"and you've seen this already."},{"from":153.11,"to":154.7,"location":2,"content":"You have like an NER model,"},{"from":154.7,"to":156.35,"location":2,"content":"you have a question and answering model,"},{"from":156.35,"to":157.75,"location":2,"content":"you have a translation model,"},{"from":157.75,"to":159.11,"location":2,"content":"and we basically now,"},{"from":159.11,"to":161.99,"location":2,"content":"each of these communities has at least, uh,"},{"from":161.99,"to":164.66,"location":2,"content":"converged on is probably some kind of neural network,"},{"from":164.66,"to":167.57,"location":2,"content":"but there's still a lot of different kinds of architectures of"},{"from":167.57,"to":171.04,"location":2,"content":"these neural networks that you're working on for each different task."},{"from":171.04,"to":172.59,"location":2,"content":"And so the question is like, okay,"},{"from":172.59,"to":174.13,"location":2,"content":"we're gonna probably do that for"},{"from":174.13,"to":177.17,"location":2,"content":"another couple of years because we're making good progress,"},{"from":177.17,"to":178.55,"location":2,"content":"but what's sort of next,"},{"from":178.55,"to":179.99,"location":2,"content":"uh, on the research side?"},{"from":179.99,"to":182.48,"location":2,"content":"And what I actually love about this class so much is that"},{"from":182.48,"to":185,"location":2,"content":"you go from like maybe not knowing much about NLP at"},{"from":185,"to":187.72,"location":2,"content":"all to you can basically understand"},{"from":187.72,"to":190.88,"location":2,"content":"the state-of-the-art research papers as they come out now,"},{"from":190.88,"to":192.95,"location":2,"content":"uh, and this, this is one of those."},{"from":192.95,"to":195.48,"location":2,"content":"Uh, so [NOISE] why,"},{"from":195.48,"to":197.84,"location":2,"content":"why not continue to work in this multitask regime?"},{"from":197.84,"to":199.28,"location":2,"content":"In some ways, I feel like, uh,"},{"from":199.28,"to":200.96,"location":2,"content":"the community is a little bit, uh,"},{"from":200.96,"to":202.7,"location":2,"content":"like this cute dog, where we, kind of,"},{"from":202.7,"to":205.96,"location":2,"content":"randomly restart, uh, after every project."},{"from":205.96,"to":209.84,"location":2,"content":"And it's kind of clear to me that if you have a lot of training data, uh,"},{"from":209.84,"to":214.92,"location":2,"content":"and you define a specific data set and task on that data set,"},{"from":214.92,"to":219.08,"location":2,"content":"you start to architecture engineer in your model to hill-climb on a particular metric,"},{"from":219.08,"to":221.42,"location":2,"content":"or leaderboard, or publications,"},{"from":221.42,"to":223.66,"location":2,"content":"or products, or whatever it is, uh,"},{"from":223.66,"to":225.71,"location":2,"content":"then as long as your data set has"},{"from":225.71,"to":228.09,"location":2,"content":"roughly a good representative set of"},{"from":228.09,"to":230.88,"location":2,"content":"1,000 times the number of output classes that you have,"},{"from":230.88,"to":236.21,"location":2,"content":"you'll probably get it into a regi- regime where you're in the 80 to 90 percent accuracy,"},{"from":236.21,"to":239.36,"location":2,"content":"or if one, where you're basically doing pretty okay."},{"from":239.36,"to":242.3,"location":2,"content":"And of course, now when you look at trends on ImageNet,"},{"from":242.3,"to":245,"location":2,"content":"you have 1,000 different classes in computer vision,"},{"from":245,"to":248.64,"location":2,"content":"1,000 different classes, each has 1,000 images."},{"from":248.64,"to":251.46,"location":2,"content":"So if you have roughly a million images, you do pretty well."},{"from":251.46,"to":253.74,"location":2,"content":"And in machine translation, ideally,"},{"from":253.74,"to":256.25,"location":2,"content":"you know, I have many more, I have like hundreds of thousands of words,"},{"from":256.25,"to":261.73,"location":2,"content":"so you want many millions of examples of each of the word in their,"},{"from":261.73,"to":263.09,"location":2,"content":"uh, words in their context."},{"from":263.09,"to":264.83,"location":2,"content":"And of course, you know, that the caveat is"},{"from":264.83,"to":267.62,"location":2,"content":"machine translation doesn't work to the level of humans,"},{"from":267.62,"to":270.11,"location":2,"content":"but it works well enough to have it at least in products,"},{"from":270.11,"to":274.75,"location":2,"content":"and even the best human translators use it as sort of a pre-translation and then,"},{"from":274.75,"to":277.03,"location":2,"content":"uh, sort of, clean it up."},{"from":277.03,"to":279.99,"location":2,"content":"And so it's also clear to me that in this regime,"},{"from":279.99,"to":281.48,"location":2,"content":"and if we want to get to, sort of,"},{"from":281.48,"to":283.55,"location":2,"content":"more general AI features, uh,"},{"from":283.55,"to":287.36,"location":2,"content":"we need to have some kind of more continuous learning of a single model."},{"from":287.36,"to":289.84,"location":2,"content":"Because if we keep restarting at every project,"},{"from":289.84,"to":291.83,"location":2,"content":"we're never going to get to a single model that, kind of,"},{"from":291.83,"to":295.71,"location":2,"content":"encompasses more and more of the complexity of natural language."},{"from":295.71,"to":299.12,"location":2,"content":"And, uh, when I say we start from random,"},{"from":299.12,"to":301.3,"location":2,"content":"you of course know that that's not quite true"},{"from":301.3,"to":304.19,"location":2,"content":"because we do have some things that we pre-train,"},{"from":304.19,"to":306.29,"location":2,"content":"namely word vectors, and in computer vision,"},{"from":306.29,"to":307.52,"location":2,"content":"we have even more things."},{"from":307.52,"to":309.02,"location":2,"content":"And so in some ways that is, ah,"},{"from":309.02,"to":311.75,"location":2,"content":"an aspiring ideal for NLP,"},{"from":311.75,"to":313.86,"location":2,"content":"because in computer vision, you would be, kind of,"},{"from":313.86,"to":315.59,"location":2,"content":"crazy to not use some kind of"},{"from":315.59,"to":319.61,"location":2,"content":"convolution neural network that has pre-train- has been pre-trained on some kind of"},{"from":319.61,"to":322.52,"location":2,"content":"tasks like ImageNet when you start with your project and"},{"from":322.52,"to":325.99,"location":2,"content":"try to classify objects or do object detection and a lot of other things."},{"from":325.99,"to":329.75,"location":2,"content":"And in some ways that the whole community could get behind it very quickly,"},{"from":329.75,"to":332.46,"location":2,"content":"because I mean, you know, once it worked, uh,"},{"from":332.46,"to":334.13,"location":2,"content":"reasonably well, because there was a, sort of,"},{"from":334.13,"to":335.99,"location":2,"content":"single blocking task in computer vision."},{"from":335.99,"to":338.61,"location":2,"content":"If you can't even tell apart a dog from a cat from a house,"},{"from":338.61,"to":342.42,"location":2,"content":"it doesn't really make sense to think of even larger, uh, vision projects."},{"from":342.42,"to":345.21,"location":2,"content":"And in NLP, we've had a lot of success with word vectors,"},{"from":345.21,"to":346.65,"location":2,"content":"you know a lot of those now,"},{"from":346.65,"to":348.75,"location":2,"content":"and it started for, sort of, just a small, uh,"},{"from":348.75,"to":351.78,"location":2,"content":"window-based approach or Word2Vec and GloVe, uh,"},{"from":351.78,"to":355.02,"location":2,"content":"then we had, uh, context vectors that were trained, uh,"},{"from":355.02,"to":357.3,"location":2,"content":"on machine translation, but basically,"},{"from":357.3,"to":360.05,"location":2,"content":"instead of just having a single set of words,"},{"from":360.05,"to":364.45,"location":2,"content":"we actually pre-trained some of the NLSTMs that came on top of those word vectors,"},{"from":364.45,"to":366.93,"location":2,"content":"and, uh, the way we train that, uh,"},{"from":366.93,"to":369.05,"location":2,"content":"was also actually Bryan McCann's paper on"},{"from":369.05,"to":372.53,"location":2,"content":"contextual vectors with machine translation and then ELMo,"},{"from":372.53,"to":376.26,"location":2,"content":"kind of, replaced machine translation with, uh, language modeling,"},{"from":376.26,"to":378.57,"location":2,"content":"which of course is even better because there's even more training data,"},{"from":378.57,"to":380.34,"location":2,"content":"and it still tells you a lot, uh,"},{"from":380.34,"to":383.21,"location":2,"content":"and kind of captures in some ways a more complex version of"},{"from":383.21,"to":386.9,"location":2,"content":"distributional sort of hypotheses that we had in simpler word vectors,"},{"from":386.9,"to":389.64,"location":2,"content":"and BERT, not quite a language model but also, kind of,"},{"from":389.64,"to":391.61,"location":2,"content":"trying to predict words in their context, uh,"},{"from":391.61,"to":394.4,"location":2,"content":"but pre-training a lot more layers and a lot deeper networks."},{"from":394.4,"to":399.7,"location":2,"content":"And so we see the success of pre-training a certain set of weights."},{"from":399.7,"to":401.26,"location":2,"content":"And so the question is,"},{"from":401.26,"to":404.31,"location":2,"content":"why not try to pre-train the entire model?"},{"from":404.31,"to":406.65,"location":2,"content":"As in including your output,"},{"from":406.65,"to":410.14,"location":2,"content":"your softmax, your pointer mechanisms and everything,"},{"from":410.14,"to":414.24,"location":2,"content":"and then just taking a completely pre-trained model and trying to do something,"},{"from":414.24,"to":416.89,"location":2,"content":"and that is, kind of, the goal that we have."},{"from":416.89,"to":418.89,"location":2,"content":"And so, uh, we, sort of,"},{"from":418.89,"to":420.52,"location":2,"content":"ask ourselves why hasn't this happened?"},{"from":420.52,"to":421.74,"location":2,"content":"Why are we, you know,"},{"from":421.74,"to":423.43,"location":2,"content":"the first to think about, like,"},{"from":423.43,"to":425.81,"location":2,"content":"trying to pre-train the entirety of the model,"},{"from":425.81,"to":427.37,"location":2,"content":"the encoders, and decoders,"},{"from":427.37,"to":428.42,"location":2,"content":"and outputs, and everything."},{"from":428.42,"to":432.74,"location":2,"content":"Uh, and I think part of it is that NLP requires a lot of different kinds of reasoning."},{"from":432.74,"to":434.42,"location":2,"content":"You've seen many of them already."},{"from":434.42,"to":438.29,"location":2,"content":"You have some logical reasoning like 550 people in this room,"},{"from":438.29,"to":440.3,"location":2,"content":"25 leave, are there still people in the room,"},{"from":440.3,"to":442.79,"location":2,"content":"and you logically can answer that question,"},{"from":442.79,"to":445.93,"location":2,"content":"and you have lots of different kinds of linguistic and emotional reasoning,"},{"from":445.93,"to":447.47,"location":2,"content":"sentiment analysis, you know,"},{"from":447.47,"to":450.14,"location":2,"content":"this is a typical Nicolas Cage movie and then you need to know that that's a"},{"from":450.14,"to":453.59,"location":2,"content":"probably negative review unless you like Nicolas Cage movies."},{"from":453.59,"to":456.47,"location":2,"content":"Um, no judgment. And, uh,"},{"from":456.47,"to":458.18,"location":2,"content":"you know, visual types of reasoning and so on."},{"from":458.18,"to":461.45,"location":2,"content":"And so I think partly because of that complexity in the beginning to feel,"},{"from":461.45,"to":466.58,"location":2,"content":"didn't really make much progress and now and then kind of separate it."},{"from":466.58,"to":470.68,"location":2,"content":"And I think in some cases, kind of artificially separated into all these separate tasks,"},{"from":470.68,"to":472.34,"location":2,"content":"like you have named entity recognition,"},{"from":472.34,"to":475.8,"location":2,"content":"part of speech tagging, and semantic role labeling and, and so on."},{"from":475.8,"to":478.56,"location":2,"content":"And, and in some ways- and it sounds kind of snarky but,"},{"from":478.56,"to":479.99,"location":2,"content":"you know, it made a lot of sense at the time,"},{"from":479.99,"to":482.54,"location":2,"content":"and it allowed us to make a lot of progress in the community,"},{"from":482.54,"to":484.85,"location":2,"content":"but basically we started chasing these benchmarks,"},{"from":484.85,"to":486.29,"location":2,"content":"and all these different communities, kind of,"},{"from":486.29,"to":488.61,"location":2,"content":"started going off in their own ways."},{"from":488.61,"to":490.32,"location":2,"content":"And we even have some communities that say,"},{"from":490.32,"to":491.95,"location":2,"content":"\"We do general question answering,"},{"from":491.95,"to":494.99,"location":2,"content":"and there's literally workshops on general question answering, and when I asked,"},{"from":494.99,"to":498.35,"location":2,"content":"uh, the organizers, \"Can I ask your model what the sentiment is of this tweet?\""},{"from":498.35,"to":501.24,"location":2,"content":"They're like, \"No, that's sentiment analysis. Go to that different workshop."},{"from":501.24,"to":502.51,"location":2,"content":"It's down, down the hall.\""},{"from":502.51,"to":504.27,"location":2,"content":"But I'm like, \"That's a- that's a question."},{"from":504.27,"to":507.33,"location":2,"content":"Why can't you answer it in the general question answering workshop?\""},{"from":507.33,"to":509.94,"location":2,"content":"Um, and so a lot of people then say,"},{"from":509.94,"to":511.54,"location":2,"content":"\"Well, if you want to work on more general stuff,"},{"from":511.54,"to":513.86,"location":2,"content":"it has to be an unsupervised, kind of,"},{"from":513.86,"to":516.7,"location":2,"content":"task and the, the feature will not be supervised.\""},{"from":516.7,"to":520.49,"location":2,"content":"I don't think NLP will be completely unsupervised,"},{"from":520.49,"to":522.83,"location":2,"content":"and we won't solve it, uh, completely unsupervised,"},{"from":522.83,"to":525.41,"location":2,"content":"because in the end, language has a lot of supervision for people,"},{"from":525.41,"to":529.02,"location":2,"content":"uh, and, uh, I think for, for systems also."},{"from":529.02,"to":532.62,"location":2,"content":"Uh, and you won't, you know,"},{"from":532.62,"to":534.6,"location":2,"content":"if you have- there's a child and it's in a jungle,"},{"from":534.6,"to":537.29,"location":2,"content":"it will probably develop a pretty good visual cortex by itself,"},{"from":537.29,"to":539.37,"location":2,"content":"but it won't develop language by itself."},{"from":539.37,"to":541.23,"location":2,"content":"And then- and then also, like,"},{"from":541.23,"to":543.72,"location":2,"content":"I think if you'll just allow AI's to talk to one another,"},{"from":543.72,"to":546.2,"location":2,"content":"it makes very little sense for them to try to come up with as"},{"from":546.2,"to":549.14,"location":2,"content":"inefficient of a communication protocol as humans have with, you know,"},{"from":549.14,"to":553.97,"location":2,"content":"sequential processing of language because algorithms and computers could,"},{"from":553.97,"to":556.07,"location":2,"content":"if there's no supervision of human language,"},{"from":556.07,"to":559.46,"location":2,"content":"they could just communicate in much more efficient ways with one another."},{"from":559.46,"to":561.05,"location":2,"content":"So I think it's fairly clear,"},{"from":561.05,"to":564.49,"location":2,"content":"we need a lot of supervision, uh, in NLP."},{"from":564.49,"to":567.84,"location":2,"content":"And so basically, all of this has led us, uh,"},{"from":567.84,"to":574.34,"location":2,"content":"to trying to think about a unified multitask model for a lot of different NLP tasks."},{"from":574.34,"to":576.51,"location":2,"content":"By the way, if you have any questions, just raise your hand."},{"from":576.51,"to":579.11,"location":2,"content":"Okay, let's make this very interactive."},{"from":579.11,"to":582.55,"location":2,"content":"Um, basically, we want this unified model, uh,"},{"from":582.55,"to":585.57,"location":2,"content":"to decide how to transfer knowledge,"},{"from":585.57,"to":587.88,"location":2,"content":"uh, and not have it, sort of, be manually assigned."},{"from":587.88,"to":589.28,"location":2,"content":"Like in most cases,"},{"from":589.28,"to":590.87,"location":2,"content":"when you assign your project you say, \"Oh,"},{"from":590.87,"to":595.03,"location":2,"content":"well I know that named entity recognition part of speech tagging help each other."},{"from":595.03,"to":596.87,"location":2,"content":"Because once you know something is a noun,"},{"from":596.87,"to":600.73,"location":2,"content":"then it's more likely that it's also a named entity.\""},{"from":600.73,"to":605.09,"location":2,"content":"And in this case, we want to basically allow for the single unified model"},{"from":605.09,"to":609.89,"location":2,"content":"to know itself how to do domain adaptation and wha- how to share the weights,"},{"from":609.89,"to":612.65,"location":2,"content":"and that will hopefully then lead to a lot of,"},{"from":612.65,"to":615.93,"location":2,"content":"uh, transfer learning and zero shot learning capabilities."},{"from":615.93,"to":619.1,"location":2,"content":"I also think that if we get to this, sort of,"},{"from":619.1,"to":623.26,"location":2,"content":"hard goal of having a single fa- single unified multitask model,"},{"from":623.26,"to":627.14,"location":2,"content":"then we'll easy- be able to more easily adapt it to"},{"from":627.14,"to":631.09,"location":2,"content":"new tasks and we'll be also able to deploy it in production more quickly."},{"from":631.09,"to":632.4,"location":2,"content":"If nowadays you want to build"},{"from":632.4,"to":635.57,"location":2,"content":"a little squirrel detector and connect it to your sprinkler system,"},{"from":635.57,"to":637.89,"location":2,"content":"you can just download some off-the-shelf software,"},{"from":637.89,"to":640.2,"location":2,"content":"and it will basically, kind of, work."},{"from":640.2,"to":642.17,"location":2,"content":"That is not the case if you try to do"},{"from":642.17,"to":644.39,"location":2,"content":"a pretty complex language project where you"},{"from":644.39,"to":646.96,"location":2,"content":"want to translate into some completely new language or,"},{"from":646.96,"to":650.24,"location":2,"content":"you know, analyze some website and then do something else afterwards."},{"from":650.24,"to":651.89,"location":2,"content":"So, uh, you also,"},{"from":651.89,"to":656.37,"location":2,"content":"when you actually try to deploy and use these kinds of tools and companies,"},{"from":656.37,"to":659.08,"location":2,"content":"you'll realize that there are a lot of different kinds of groups."},{"from":659.08,"to":660.2,"location":2,"content":"There's the search group,"},{"from":660.2,"to":661.31,"location":2,"content":"and the chatbot team,"},{"from":661.31,"to":662.54,"location":2,"content":"and the translation team,"},{"from":662.54,"to":665.93,"location":2,"content":"and, uh, and the social sentiment analysis team,"},{"from":665.93,"to":667.1,"location":2,"content":"and they all use different models,"},{"from":667.1,"to":668.39,"location":2,"content":"and they all deploy different models,"},{"from":668.39,"to":670.85,"location":2,"content":"and they all have to build a lot of overhead into"},{"from":670.85,"to":675.15,"location":2,"content":"the core of the- or around that core of an AI model."},{"from":675.15,"to":678.24,"location":2,"content":"So basically, um, lastly,"},{"from":678.24,"to":680.43,"location":2,"content":"it was, sort of, what we had with, with this dog."},{"from":680.43,"to":682.17,"location":2,"content":"I think that once we have this unified model,"},{"from":682.17,"to":684.38,"location":2,"content":"it will also be a first step to being able to"},{"from":684.38,"to":686.87,"location":2,"content":"then continually learn this and just have a single model that just"},{"from":686.87,"to":688.88,"location":2,"content":"gets better and better over time and starts"},{"from":688.88,"to":692.03,"location":2,"content":"to capture more and more of the complexity of language."},{"from":692.03,"to":693.98,"location":2,"content":"All right, any questions around, sort of,"},{"from":693.98,"to":701.7,"location":2,"content":"the motivation high level?"},{"from":701.7,"to":704.86,"location":2,"content":"All right. So then, uh,"},{"from":704.86,"to":708.37,"location":2,"content":"it's sort of the question, how do we actually make that happen?"},{"from":708.37,"to":712.13,"location":2,"content":"And then we -- I first sort of sat down and looked at, like,"},{"from":712.13,"to":716.56,"location":2,"content":"the general sort of formats of all the tasks that you may experience in"},{"from":716.56,"to":718.51,"location":2,"content":"this class and that NLP sort of has as a field in"},{"from":718.51,"to":721,"location":2,"content":"general and I think they can broadly classified,"},{"from":721,"to":723.1,"location":2,"content":"be classified into these three different categories."},{"from":723.1,"to":724.9,"location":2,"content":"Sequence tagging, you already know."},{"from":724.9,"to":727.84,"location":2,"content":"Things like NER or aspect-specific sentiment or in"},{"from":727.84,"to":732.25,"location":2,"content":"a specific context we want to classify if a word is positive or negative."},{"from":732.25,"to":734.38,"location":2,"content":"Uh, and then text classification,"},{"from":734.38,"to":737.29,"location":2,"content":"just a single label for the entire piece of text"},{"from":737.29,"to":740.34,"location":2,"content":"and then sequence the sequence a lot of different, you know,"},{"from":740.34,"to":743.58,"location":2,"content":"problems fall into that and I actually personally love, uh,"},{"from":743.58,"to":747.49,"location":2,"content":"these three particular tasks: machine translation, summarization, question answering."},{"from":747.49,"to":751.45,"location":2,"content":"Because they are immediately useful that you don't have to explain to somebody,"},{"from":751.45,"to":754.2,"location":2,"content":"\"Oh, but why do you need the semantic role labeller or parser? \""},{"from":754.2,"to":756.49,"location":2,"content":"If you're a layman and you, you know,"},{"from":756.49,"to":758.62,"location":2,"content":"on the Internet you understand immediately why it's"},{"from":758.62,"to":761.14,"location":2,"content":"useful to do summarization, question answering,"},{"from":761.14,"to":763.24,"location":2,"content":"or translation and an improvement in"},{"from":763.24,"to":766.84,"location":2,"content":"those tasks kind of immediately translates in- into better products,"},{"from":766.84,"to":771.43,"location":2,"content":"uh, and people being able to communicate better and more efficiently with language."},{"from":771.43,"to":777.4,"location":2,"content":"So, that, uh, kind of analysis led us to think,"},{"from":777.4,"to":781.03,"location":2,"content":"uh, about these what I call three equivalent supertasks of NLP."},{"from":781.03,"to":783.91,"location":2,"content":"Uh, and basically they are"},{"from":783.91,"to":787.78,"location":2,"content":"language modeling, question answer now- question answering and dialogue systems."},{"from":787.78,"to":791.41,"location":2,"content":"Uh, language modeling, basically trying to predin- predict the next word,"},{"from":791.41,"to":792.43,"location":2,"content":"you've already worked on that."},{"from":792.43,"to":798.77,"location":2,"content":"Uh, and usually it's only used to rescore or basically to pre-train these days."},{"from":798.77,"to":802.64,"location":2,"content":"But really if you ask me a question and then you try to predict the next couple of words,"},{"from":802.64,"to":805.43,"location":2,"content":"then that is also language modeling"},{"from":805.43,"to":808.81,"location":2,"content":"and if you're able to predict the next couple of words after a question, like,"},{"from":808.81,"to":812.35,"location":2,"content":"what were the named entities in the sentence and then you just generate, you know,"},{"from":812.35,"to":814.12,"location":2,"content":"Dresden was a location,"},{"from":814.12,"to":816.43,"location":2,"content":"Richard was a person and whatnot."},{"from":816.43,"to":821.14,"location":2,"content":"Uh, then you can kind of cast almost all of these tasks into language modeling."},{"from":821.14,"to":822.58,"location":2,"content":"Uh, similarly question answering,"},{"from":822.58,"to":824.08,"location":2,"content":"you can ask any kind of question,"},{"from":824.08,"to":825.43,"location":2,"content":"what is the translation,"},{"from":825.43,"to":828.12,"location":2,"content":"what's the summary, uh, and so on,"},{"from":828.12,"to":830.77,"location":2,"content":"and then with dialogue right now it's kind of tricky because there are"},{"from":830.77,"to":835.93,"location":2,"content":"no really good dialogue datasets out there and a lot of times you want some interaction,"},{"from":835.93,"to":840.01,"location":2,"content":"you have to run user studies and most of the existing NLP task would"},{"from":840.01,"to":844.36,"location":2,"content":"basically be pretty short one-step dialogues like what are the named entity tags,"},{"from":844.36,"to":845.56,"location":2,"content":"and you give them and that's it."},{"from":845.56,"to":849.85,"location":2,"content":"So it's a little bit overkill and because of that we basically converged,"},{"from":849.85,"to":853.52,"location":2,"content":"uh, on question answering as our main formalism."},{"from":853.52,"to":858.36,"location":2,"content":"And here is now an overview of the 10 different tasks that we have,"},{"from":858.36,"to":861.61,"location":2,"content":"uh, and we cast all of them as question answering."},{"from":861.61,"to":865.12,"location":2,"content":"These are literally the tr- the training,"},{"from":865.12,"to":867.7,"location":2,"content":"uh, the format of the training dataset, uh,"},{"from":867.7,"to":870.88,"location":2,"content":"and eventually also the way we formulate"},{"from":870.88,"to":875.53,"location":2,"content":"the test set and you'll see basically for every single task,"},{"from":875.53,"to":878.61,"location":2,"content":"you have a context as some kind of document."},{"from":878.61,"to":879.7,"location":2,"content":"It could be a Wikipedia article,"},{"from":879.7,"to":881.5,"location":2,"content":"it could be a tweet, it could be a longer document,"},{"from":881.5,"to":885.55,"location":2,"content":"whatever, and you ask a question about it and you want to generate an answer."},{"from":885.55,"to":889.09,"location":2,"content":"And I'm actually -- I'm curious if you can think of any task in NLP"},{"from":889.09,"to":892.79,"location":2,"content":"that couldn't be formulated in this kind of structure."},{"from":892.79,"to":895.72,"location":2,"content":"Uh, so, let's go over some of these."},{"from":895.72,"to":897.87,"location":2,"content":"Uh, the first one is sort of the standard,"},{"from":897.87,"to":900.14,"location":2,"content":"uh, task that all- you're all familiar with now."},{"from":900.14,"to":902.44,"location":2,"content":"The SQuAD, Stanford Question Answering Dataset."},{"from":902.44,"to":906.88,"location":2,"content":"Uh, where the answer is essentially a phrase somewhere in the context."},{"from":906.88,"to":912.26,"location":2,"content":"But then, uh, the second one is something that you would never see in most,"},{"from":912.26,"to":916.9,"location":2,"content":"uh, generalized, uh, question answering workshops and that is, uh,"},{"from":916.9,"to":920.56,"location":2,"content":"having a context of the single sentence asking what is the translation from"},{"from":920.56,"to":925.09,"location":2,"content":"English into German and the output is again a sequence of words but in this case,"},{"from":925.09,"to":926.5,"location":2,"content":"and we color them differently here."},{"from":926.5,"to":931.87,"location":2,"content":"Uh, this is blue because all these words are basically not in the context and not in"},{"from":931.87,"to":935.11,"location":2,"content":"the question and we will just generate them"},{"from":935.11,"to":939.28,"location":2,"content":"with a standard softmax to basically answer this question."},{"from":939.28,"to":943.39,"location":2,"content":"We can also ask what is the summary and you can see that those"},{"from":943.39,"to":947.29,"location":2,"content":"two in some ways is artificial to make them into a natural language question."},{"from":947.29,"to":951.25,"location":2,"content":"You could just say translate or summarize and this is just like"},{"from":951.25,"to":956.14,"location":2,"content":"one kind of task token in your network but actually half of these tasks."},{"from":956.14,"to":962.3,"location":2,"content":"It makes sense because the question also has ac- is different for every example."},{"from":962.3,"to":966.04,"location":2,"content":"So this one here is natural language inference, NLI, uh,"},{"from":966.04,"to":970.92,"location":2,"content":"She covered also where we want to ask whether two sentences entail each other,"},{"from":970.92,"to":974.81,"location":2,"content":"contradict each other or there's some neutral relationship between them."},{"from":974.81,"to":976.9,"location":2,"content":"You've seen a lot of sentiment."},{"from":976.9,"to":978.58,"location":2,"content":"And this here is kind of important."},{"from":978.58,"to":982.6,"location":2,"content":"We actually asked is this sentence positive or negative versus just what is the sentiment"},{"from":982.6,"to":987.75,"location":2,"content":"and what- why that is important is that you see here in green,"},{"from":987.75,"to":990.76,"location":2,"content":"this answer here actually comes from"},{"from":990.76,"to":994.38,"location":2,"content":"a word into question and if we formulate it that way,"},{"from":994.38,"to":999.33,"location":2,"content":"we can eventually do zero-shot learning where we ask a new question that was"},{"from":999.33,"to":1004.15,"location":2,"content":"never asked before for a new set of labels and magically, in some cases,"},{"from":1004.15,"to":1006.18,"location":2,"content":"it still actually works and we'll, you know,"},{"from":1006.18,"to":1010.5,"location":2,"content":"ask que- we can ask questions like is this story happy or sad and it will still"},{"from":1010.5,"to":1012.12,"location":2,"content":"give us an answer even though we've never given"},{"from":1012.12,"to":1015.2,"location":2,"content":"it a trained dataset of a bunch of happy and sad stories."},{"from":1015.2,"to":1019.74,"location":2,"content":"So, it's kind of zero-shot classification that you get to in"},{"from":1019.74,"to":1022.23,"location":2,"content":"some cases if you formulate your questions in a way"},{"from":1022.23,"to":1025.27,"location":2,"content":"that the answer is part as a word in the question."},{"from":1025.27,"to":1028.34,"location":2,"content":"Then we have semantic role labeling here."},{"from":1028.34,"to":1035.54,"location":2,"content":"So what has something experienced, kind of a random weird question."},{"from":1035.54,"to":1038.45,"location":2,"content":"Then we have a zero-shot relation extraction who is"},{"from":1038.45,"to":1042.26,"location":2,"content":"the illustrator of Cycle of the Werewolf,"},{"from":1042.26,"to":1044.58,"location":2,"content":"we also have some dialogue state tracking."},{"from":1044.58,"to":1048.62,"location":2,"content":"What is the current state in- in a dialogue and the context just keeps on"},{"from":1048.62,"to":1053.98,"location":2,"content":"growing with the dialogue and then we also have SQL,"},{"from":1053.98,"to":1057.69,"location":2,"content":"Wiki SQL translation tasks but not translating into"},{"from":1057.69,"to":1062.03,"location":2,"content":"another natural language translating into a SQL database query."},{"from":1062.03,"to":1063.72,"location":2,"content":"It's actually a super-helpful task."},{"from":1063.72,"to":1067.83,"location":2,"content":"There's a, you know, a lot of data out there that is stored in databases."},{"from":1067.83,"to":1070.44,"location":2,"content":"If you can access it without having to ask"},{"from":1070.44,"to":1073.38,"location":2,"content":"somebody who knows how to program SQL it will make"},{"from":1073.38,"to":1076.2,"location":2,"content":"that data available to a lot more people so"},{"from":1076.2,"to":1079.26,"location":2,"content":"they can analyze it and like business analytics and so on."},{"from":1079.26,"to":1082.74,"location":2,"content":"And then here, Winograd Schemas and anaphora resolution."},{"from":1082.74,"to":1086.1,"location":2,"content":"Uh, some people call this kind of common sense reasoning but it's kind of,"},{"from":1086.1,"to":1090.22,"location":2,"content":"you know, mostly just anaphora resolution trying to understand in this context."},{"from":1090.22,"to":1092.38,"location":2,"content":"Uh, what -- who's, you know,"},{"from":1092.38,"to":1095.55,"location":2,"content":"uh, the word like who had given help,"},{"from":1095.55,"to":1099.03,"location":2,"content":"was it Susan or Joanne, and then based on this context,"},{"from":1099.03,"to":1102.9,"location":2,"content":"you can kind of should be able to figure that out and again here,"},{"from":1102.9,"to":1106.86,"location":2,"content":"the question is different for every single example. All right, yeah?"},{"from":1106.86,"to":1109.89,"location":2,"content":"When you're testing it -- like when you ask,"},{"from":1109.89,"to":1111.8,"location":2,"content":"is this sentence positive or negative,"},{"from":1111.8,"to":1115.29,"location":2,"content":"does it sometimes, like, [inaudible]?"},{"from":1115.29,"to":1117.77,"location":2,"content":"Great question. So, the question is when I ask,"},{"from":1117.77,"to":1120.51,"location":2,"content":"is this sentence positive or negative will it sometimes eventually"},{"from":1120.51,"to":1123.91,"location":2,"content":"accidentally switch to a different one of the task and, uh,"},{"from":1123.91,"to":1127.11,"location":2,"content":"we actually have a slide on that and the answer is it's surprisingly good at"},{"from":1127.11,"to":1132.78,"location":2,"content":"knowing how to go about doing the task and where to get the answer where it's from."},{"from":1132.78,"to":1136.86,"location":2,"content":"Um, and yeah, they'll make more sense in a couple of slides once we go over the model."},{"from":1136.86,"to":1138.56,"location":2,"content":"Any other questions about,"},{"from":1138.56,"to":1140.82,"location":2,"content":"uh, the question answering formalism?"},{"from":1140.82,"to":1144.93,"location":2,"content":"Are you able to formulate text generation in the question answer format as well?"},{"from":1144.93,"to":1146.68,"location":2,"content":"Like, tell me a story."},{"from":1146.68,"to":1150.19,"location":2,"content":"Good question. So can we do text generation, uh,"},{"from":1150.19,"to":1151.8,"location":2,"content":"like tell me a story, uh,"},{"from":1151.8,"to":1154.59,"location":2,"content":"from a random kind of -- or in this kind of formalism."},{"from":1154.59,"to":1159.45,"location":2,"content":"Uh, we don't have that as a task because largely it's really hard to evaluate."},{"from":1159.45,"to":1162.12,"location":2,"content":"It'll tell you some random stuff and then is that a good story or not,"},{"from":1162.12,"to":1164.33,"location":2,"content":"is it grammatical, you have to come up with a lot of,"},{"from":1164.33,"to":1165.75,"location":2,"content":"uh, sort of, uh,"},{"from":1165.75,"to":1168.42,"location":2,"content":"evaluation metrics which we actually are doing for"},{"from":1168.42,"to":1171.33,"location":2,"content":"some of the dialogue systems and in case of dialogue,"},{"from":1171.33,"to":1173.28,"location":2,"content":"why does -- why are they equivalent because"},{"from":1173.28,"to":1176.16,"location":2,"content":"the context can just keep on growing and every time, uh,"},{"from":1176.16,"to":1178.39,"location":2,"content":"the user said something, uh,"},{"from":1178.39,"to":1183.53,"location":2,"content":"you basically try to then predict the next answer in that dialogue."},{"from":1183.53,"to":1188.7,"location":2,"content":"And so I think you could very easily [NOISE] use this to generate texts."},{"from":1188.7,"to":1191.22,"location":2,"content":"Uh, you basically just ask -- tell it like what is, you know,"},{"from":1191.22,"to":1194.49,"location":2,"content":"what's a good ending of the story and you maybe start the context with like"},{"from":1194.49,"to":1198.42,"location":2,"content":"two or three words and then you ask the model to generate more and more words,"},{"from":1198.42,"to":1201.97,"location":2,"content":"uh, in the form of this network I'll describe in a second. Yeah?"},{"from":1201.97,"to":1204.72,"location":2,"content":"I was wondering like, uh, when you're training"},{"from":1204.72,"to":1207.8,"location":2,"content":"it and you're trying to research like a new task."},{"from":1207.8,"to":1211.47,"location":2,"content":"Uh, does it like learn with less data?"},{"from":1211.47,"to":1214.32,"location":2,"content":"That is an amazingly thoughtful question"},{"from":1214.32,"to":1216.93,"location":2,"content":"and it's- it's so important we'll have a bunch of slides on it."},{"from":1216.93,"to":1220.98,"location":2,"content":"So maybe we'll- we'll go -- we'll continue and we'll get to that question, uh,"},{"from":1220.98,"to":1225.08,"location":2,"content":"in a lot of detail because it's sort of why we're doing it and, the short answer is yes."},{"from":1225.08,"to":1227.86,"location":2,"content":"But we'll get to more details. All right."},{"from":1227.86,"to":1230.21,"location":2,"content":"So these are basically the 10 tasks."},{"from":1230.21,"to":1233.97,"location":2,"content":"Uh, and again this is the actual format for it."},{"from":1233.97,"to":1235.89,"location":2,"content":"So if you have a problem,"},{"from":1235.89,"to":1237.81,"location":2,"content":"and you can cast it in this format, uh,"},{"from":1237.81,"to":1240.63,"location":2,"content":"you can just take, uh, the open source code and run it and,"},{"from":1240.63,"to":1242.03,"location":2,"content":"uh, it'll- it'll work."},{"from":1242.03,"to":1245.01,"location":2,"content":"And so when you kind of analyze and think about what we've done here."},{"from":1245.01,"to":1247.68,"location":2,"content":"In some ways, we've taken the tasks that"},{"from":1247.68,"to":1250.95,"location":2,"content":"usually is kind of in your head but it's not given to the model."},{"from":1250.95,"to":1254.73,"location":2,"content":"The model is just given an input x and an output y in almost all of"},{"from":1254.73,"to":1260.76,"location":2,"content":"the supervised systems and instead we're actually including the task in the inputs,"},{"from":1260.76,"to":1265.95,"location":2,"content":"uh, in the set of inputs to the model. So you can kind of call this meta-supervised learning."},{"from":1265.95,"to":1268.26,"location":2,"content":"So again the question, uh,"},{"from":1268.26,"to":1271.14,"location":2,"content":"is kind of our task definition for each of these different tasks."},{"from":1271.14,"to":1273.57,"location":2,"content":"The model has to figure out itself when to ask the question"},{"from":1273.57,"to":1276.18,"location":2,"content":"that way it can also figure out itself when to"},{"from":1276.18,"to":1281.57,"location":2,"content":"transfer knowledge from these other tasks and y is again just the answer."},{"from":1281.57,"to":1285.33,"location":2,"content":"So, in some ways it's meta-supervised learning and I'm quite excited"},{"from":1285.33,"to":1289.56,"location":2,"content":"because once you allow the task to be given to the model as input,"},{"from":1289.56,"to":1292.17,"location":2,"content":"it can kind of decide itself how to go about"},{"from":1292.17,"to":1295.02,"location":2,"content":"solving that particular task and now you can learn,"},{"from":1295.02,"to":1296.84,"location":2,"content":"uh, a lot more powerful models."},{"from":1296.84,"to":1299.31,"location":2,"content":"So once we had the dataset,"},{"from":1299.31,"to":1302.27,"location":2,"content":"we thought \"Okay, how do we now solve this problem?\""},{"from":1302.27,"to":1303.96,"location":2,"content":"The simplest way is you could just say, \"Well,"},{"from":1303.96,"to":1305.01,"location":2,"content":"I have a big if statement,"},{"from":1305.01,"to":1307.26,"location":2,"content":"I have a classifier in the beginning and then I classify."},{"from":1307.26,"to":1309.22,"location":2,"content":"If this is a machine translation task,"},{"from":1309.22,"to":1311.02,"location":2,"content":"then run my machine translation model.\""},{"from":1311.02,"to":1314.3,"location":2,"content":"And in general, in Python that would still be just like one big python,"},{"from":1314.3,"to":1316.43,"location":2,"content":"uh, model with a bunch of if statements, right?"},{"from":1316.43,"to":1318.77,"location":2,"content":"And that's not the goal because then we wouldn't get to any of"},{"from":1318.77,"to":1322.19,"location":2,"content":"the transfer learning and zero-shot capabilities that we're hoping for."},{"from":1322.19,"to":1327.63,"location":2,"content":"So [NOISE] we want to have the model wanted"},{"from":1327.63,"to":1330.11,"location":2,"content":"to have the capability to internally adjust"},{"from":1330.11,"to":1335.36,"location":2,"content":"to these different tasks and make these decisions itself."},{"from":1335.36,"to":1338.49,"location":2,"content":"And basically, all of those considerations and all"},{"from":1338.49,"to":1340.62,"location":2,"content":"of those thoughts led us, uh, to this model."},{"from":1340.62,"to":1342.12,"location":2,"content":"So before I go, uh,"},{"from":1342.12,"to":1343.45,"location":2,"content":"into a little bit more detail."},{"from":1343.45,"to":1345.83,"location":2,"content":"I'll just like sort of give you the high-level overview."},{"from":1345.83,"to":1347.92,"location":2,"content":"Again, you start with the context."},{"from":1347.92,"to":1350.71,"location":2,"content":"Um, you start- you ask a question about, uh,"},{"from":1350.71,"to":1353.7,"location":2,"content":"that context document, and then we're going to generate,"},{"from":1353.7,"to":1358.56,"location":2,"content":"uh, the answer one word at a time by either pointing to the context,"},{"from":1358.56,"to":1360.05,"location":2,"content":"and you've had pointers already, right?"},{"from":1360.05,"to":1364.04,"location":2,"content":"Pointer networks, all that? Great. Um, pointing to a question word,"},{"from":1364.04,"to":1368.19,"location":2,"content":"or choosing a word from an external vocabulary with your standard softmax classifier."},{"from":1368.19,"to":1372.63,"location":2,"content":"Uh, and we'll have a pointer switch mechanism that will kind"},{"from":1372.63,"to":1377.41,"location":2,"content":"of choose how much to weight [NOISE] each of these three generation mechanisms."},{"from":1377.41,"to":1380.76,"location":2,"content":"So, uh, let's dig into a little bit into this model."},{"from":1380.76,"to":1384.6,"location":2,"content":"Fortunately, uh, in some ways it's kind of just taking the best, uh,"},{"from":1384.6,"to":1389.16,"location":2,"content":"of the current sort of the state of the art techniques and putting them together in a way,"},{"from":1389.16,"to":1391.56,"location":2,"content":"uh, that- that generalize well enough."},{"from":1391.56,"to":1394.14,"location":2,"content":"Uh, you can look at all the code on decanlp.com,"},{"from":1394.14,"to":1396.87,"location":2,"content":"[NOISE] it has like thousands of, uh,"},{"from":1396.87,"to":1400.4,"location":2,"content":"stars and, uh, and forks and stuff combined, uh,"},{"from":1400.4,"to":1401.8,"location":2,"content":"and you can, you know,"},{"from":1401.8,"to":1404.18,"location":2,"content":"basically run everything, uh,"},{"from":1404.18,"to":1409.76,"location":2,"content":"in this, uh, on these experiments with just one command."},{"from":1409.76,"to":1413.61,"location":2,"content":"It'll double, you get all the datasets and everything and- and run everything,"},{"from":1413.61,"to":1416.34,"location":2,"content":"you can really explore what it looks like but let's- let's"},{"from":1416.34,"to":1419.37,"location":2,"content":"dive a little bit into the details of what this model told us."},{"from":1419.37,"to":1421.07,"location":2,"content":"In some ways again, it just kind of takes"},{"from":1421.07,"to":1423.87,"location":2,"content":"all the best ingredients from deep learning [NOISE] NLP,"},{"from":1423.87,"to":1428.49,"location":2,"content":"most of which you've already learned about and puts them together in a reasonable way."},{"from":1428.49,"to":1430.47,"location":2,"content":"So we start with fixed GloVe embeddings."},{"from":1430.47,"to":1432.63,"location":2,"content":"Eventually, we'll- we updated, uh,"},{"from":1432.63,"to":1434.73,"location":2,"content":"the embeddings to CoVe embeddings, uh,"},{"from":1434.73,"to":1437.71,"location":2,"content":"and probably it'll work even better if you update them to BERT embeddings."},{"from":1437.71,"to":1440.82,"location":2,"content":"Uh, but at some point we kind of have to move on and do other things."},{"from":1440.82,"to":1443.46,"location":2,"content":"Uh, but basically, you have a fixed set of word vectors,"},{"from":1443.46,"to":1445.86,"location":2,"content":"and that is kind of important because in some of these,"},{"from":1445.86,"to":1448.55,"location":2,"content":"uh, data sets, they're much smaller than others."},{"from":1448.55,"to":1450.36,"location":2,"content":"Uh, and as you know from SQuAD,"},{"from":1450.36,"to":1452.58,"location":2,"content":"if you actually backpropagate into the word vectors,"},{"from":1452.58,"to":1454.68,"location":2,"content":"you just do really, really well on your trained dataset,"},{"from":1454.68,"to":1458.31,"location":2,"content":"but then you won't generalize because of most of the [NOISE] text,"},{"from":1458.31,"to":1461.43,"location":2,"content":"uh, test documents will include words you've never seen before."},{"from":1461.43,"to":1464.64,"location":2,"content":"So if you change all the word vectors during training, uh,"},{"from":1464.64,"to":1468.3,"location":2,"content":"it won't- it won't work very well at test time and won't generalize the unseen words."},{"from":1468.3,"to":1470.36,"location":2,"content":"So, uh, fixed GloVe embeddings,"},{"from":1470.36,"to":1471.99,"location":2,"content":"if you don't have word vectors, uh,"},{"from":1471.99,"to":1475.14,"location":2,"content":"for unseen words, we also have character n-gram embeddings."},{"from":1475.14,"to":1477.87,"location":2,"content":"Then we pipe them through a simple linear layer,"},{"from":1477.87,"to":1479.25,"location":2,"content":"and then we have a shared, uh,"},{"from":1479.25,"to":1482.54,"location":2,"content":"bidirectional LSTM with skip connections."},{"from":1482.54,"to":1486.26,"location":2,"content":"And so, uh, it's a deep- deep one so you skip to higher layers,"},{"from":1486.26,"to":1489.09,"location":2,"content":"and it's shared between the context and the questions."},{"from":1489.09,"to":1491.85,"location":2,"content":"So they have basically the same [NOISE] set of weights."},{"from":1491.85,"to":1496.44,"location":2,"content":"[NOISE] Then, uh, we have a co-attention layer."},{"from":1496.44,"to":1498.84,"location":2,"content":"Uh, where we basically just have outer products, uh,"},{"from":1498.84,"to":1503.4,"location":2,"content":"between all the hidden states of those two sequences,"},{"from":1503.4,"to":1506.07,"location":2,"content":"and again, have skip connections, uh,"},{"from":1506.07,"to":1508.05,"location":2,"content":"to circumvent, uh, those as well."},{"from":1508.05,"to":1511.2,"location":2,"content":"So now you have kind of context or question dependent, uh,"},{"from":1511.2,"to":1515.46,"location":2,"content":"contextual representations [NOISE] or- or representations of that context."},{"from":1515.46,"to":1518.97,"location":2,"content":"[NOISE] Uh, then we feed those into our transformer layers,"},{"from":1518.97,"to":1523.58,"location":2,"content":"uh, and we actually tried to use transformers for all the things,"},{"from":1523.58,"to":1525.77,"location":2,"content":"with having no LSTMs or any of that."},{"from":1525.77,"to":1528.73,"location":2,"content":"Uh, unfortunately, transformer layers were still, uh,"},{"from":1528.73,"to":1532.59,"location":2,"content":"very, uh, finicky and very hard to optimize,"},{"from":1532.59,"to":1535.02,"location":2,"content":"and there's a lot of trickery with- of the learning rates,"},{"from":1535.02,"to":1538.52,"location":2,"content":"and we could just not get them to perform really well,"},{"from":1538.52,"to":1541.76,"location":2,"content":"uh, on- on these 10 different tasks."},{"from":1541.76,"to":1545.76,"location":2,"content":"Uh, [NOISE] sometimes you had one transformer layer, one transformer network,"},{"from":1545.76,"to":1546.93,"location":2,"content":"that worked really well in one task,"},{"from":1546.93,"to":1549.33,"location":2,"content":"but the only other transformer network that worked well"},{"from":1549.33,"to":1551.89,"location":2,"content":"on the second task had like half the layers."},{"from":1551.89,"to":1555.15,"location":2,"content":"And once you tried to have one network with the same number of layers,"},{"from":1555.15,"to":1557.71,"location":2,"content":"it just wouldn't work on either of the two tasks anymore."},{"from":1557.71,"to":1560.64,"location":2,"content":"Uh, and so- so yeah, unfortunately as nice as they"},{"from":1560.64,"to":1563.58,"location":2,"content":"are because they're nicely paralyzable in GPUs,"},{"from":1563.58,"to":1565.11,"location":2,"content":"uh, they weren't yet robust enough,"},{"from":1565.11,"to":1566.82,"location":2,"content":"uh, to- to be used for this."},{"from":1566.82,"to":1569.28,"location":2,"content":"[NOISE] So we have to have these LSTMs,"},{"from":1569.28,"to":1571.2,"location":2,"content":"uh, before and after the transformer layers."},{"from":1571.2,"to":1575.3,"location":2,"content":"[NOISE] And then we essentially just have a standard sort of autoregressive, uh,"},{"from":1575.3,"to":1577.77,"location":2,"content":"decoder where given the last state,"},{"from":1577.77,"to":1579.72,"location":2,"content":"uh, we generate the next word."},{"from":1579.72,"to":1582.09,"location":2,"content":"And then we have these three pointer mechanisms."},{"from":1582.09,"to":1584.46,"location":2,"content":"Uh, they're very similar to the pointer ne- mechanisms you already know."},{"from":1584.46,"to":1588.4,"location":2,"content":"But now on top of these very contextualized representations, uh,"},{"from":1588.4,"to":1590.58,"location":2,"content":"at the end of this encoder, uh,"},{"from":1590.58,"to":1593.64,"location":2,"content":"and it basically learns to either point to question words,"},{"from":1593.64,"to":1595.77,"location":2,"content":"context words based on the hidden states,"},{"from":1595.77,"to":1598.13,"location":2,"content":"or have also a standard softmax,"},{"from":1598.13,"to":1601.39,"location":2,"content":"and then we just basically have a weighted sum,"},{"from":1601.39,"to":1605.49,"location":2,"content":"convex sum, of these three different distributions of output words."},{"from":1605.49,"to":1608.12,"location":2,"content":"[NOISE] All right."},{"from":1608.12,"to":1612.69,"location":2,"content":"So I think these are mostly standard components that you've already saw,"},{"from":1612.69,"to":1614.61,"location":2,"content":"uh, for you- already seen all their details."},{"from":1614.61,"to":1615.94,"location":2,"content":"But if you have any questions,"},{"from":1615.94,"to":1618.69,"location":2,"content":"um, about how we put it together? Yeah?"},{"from":1618.69,"to":1622.92,"location":2,"content":"[NOISE] So the output- the output has to be a word."},{"from":1622.92,"to":1626.61,"location":2,"content":"That's right. The output has to be a word and it's always either a word from the context,"},{"from":1626.61,"to":1628.47,"location":2,"content":"a word from the question or a word from the softmax."},{"from":1628.47,"to":1631.05,"location":2,"content":"[NOISE]"},{"from":1631.05,"to":1635.61,"location":2,"content":"That's- the data preprocessing I guess it's different with each task."},{"from":1635.61,"to":1638.22,"location":2,"content":"So the data preprocessing is different for each task,"},{"from":1638.22,"to":1640.95,"location":2,"content":"but we basically had to normalize everything to have"},{"from":1640.95,"to":1643.71,"location":2,"content":"the same tokenization and- and all of that. [NOISE]"},{"from":1643.71,"to":1649.77,"location":2,"content":"Uh, so do the double arrows in the encoding just represent there's a bidirectional?"},{"from":1649.77,"to":1650.13,"location":2,"content":"Yeah."},{"from":1650.13,"to":1650.78,"location":2,"content":"Okay."},{"from":1650.78,"to":1652.39,"location":2,"content":"Yeah. But the double arrows,"},{"from":1652.39,"to":1654,"location":2,"content":"uh, here are just bidirectional."},{"from":1654,"to":1658.08,"location":2,"content":"So left to right and right to left for the LSTMs. All right."},{"from":1658.08,"to":1661.05,"location":2,"content":"So what datasets, uh, are we using?"},{"from":1661.05,"to":1664.13,"location":2,"content":"Uh, I mentioned that that was a big headache in the beginning."},{"from":1664.13,"to":1666.54,"location":2,"content":"Uh, we definitely wanted to include a lot of the sequence to"},{"from":1666.54,"to":1669.72,"location":2,"content":"sequence tasks that we felt like are very,"},{"from":1669.72,"to":1674.06,"location":2,"content":"um, sort of high level and I- immediately useful, uh,"},{"from":1674.06,"to":1677.95,"location":2,"content":"and in some ways what this also shows you is that"},{"from":1677.95,"to":1683.31,"location":2,"content":"nowadays you don't have to work as much on some of the intermediate representations,"},{"from":1683.31,"to":1685.28,"location":2,"content":"uh, in NLP anymore."},{"from":1685.28,"to":1689.49,"location":2,"content":"Uh, you can just directly go for the end tasks that that real users might care about,"},{"from":1689.49,"to":1692.34,"location":2,"content":"and then have these end-to-end trainable systems,"},{"from":1692.34,"to":1694.69,"location":2,"content":"uh, that really do quite well."},{"from":1694.69,"to":1697.29,"location":2,"content":"And, uh, I've myself worked a lot on parsing."},{"from":1697.29,"to":1698.41,"location":2,"content":"And so I don't wanna, you know,"},{"from":1698.41,"to":1699.54,"location":2,"content":"say we- we don't need it."},{"from":1699.54,"to":1701.58,"location":2,"content":"There's certainly still tasks that you do need it for,"},{"from":1701.58,"to":1706.1,"location":2,"content":"but it's kind of surprising that you can just go directly to translation or summarization"},{"from":1706.1,"to":1708.87,"location":2,"content":"without having intermediate representations that"},{"from":1708.87,"to":1712.04,"location":2,"content":"were sort of very specifically hand-designed."},{"from":1712.04,"to":1716.31,"location":2,"content":"Um, so we had those three really interesting, uh, and hard tasks."},{"from":1716.31,"to":1718.38,"location":2,"content":"Question answering, machine translation, summarization."},{"from":1718.38,"to":1721.26,"location":2,"content":"They actually also have the three biggest datasets,"},{"from":1721.26,"to":1722.82,"location":2,"content":"uh, of all of these."},{"from":1722.82,"to":1726.96,"location":2,"content":"Uh, then we had NLI, and basically, um,"},{"from":1726.96,"to":1732.19,"location":2,"content":"all of these, uh, 10 datasets [NOISE] were, uh,"},{"from":1732.19,"to":1736.88,"location":2,"content":"publicly available, uh, and in several cases especially for translation,"},{"from":1736.88,"to":1741.03,"location":2,"content":"you could actually find much larger, uh, translation datasets,"},{"from":1741.03,"to":1743.79,"location":2,"content":"but we also tried to keep it, uh,"},{"from":1743.79,"to":1748.53,"location":2,"content":"to a- to a size where normal people that don't work in gigantic companies with huge, uh,"},{"from":1748.53,"to":1753.54,"location":2,"content":"GPU infrastructures could still run experiments, [NOISE] uh, themselves."},{"from":1753.54,"to":1756.63,"location":2,"content":"So universities and folks, uh, can still run it on."},{"from":1756.63,"to":1758.98,"location":2,"content":"Basically if you have just a single GPU,"},{"from":1758.98,"to":1761.38,"location":2,"content":"it'll probably take about a week or so, uh,"},{"from":1761.38,"to":1763.68,"location":2,"content":"to run an experiment."},{"from":1763.68,"to":1766.63,"location":2,"content":"If you have multiple GPUs on one large AWS machine,"},{"from":1766.63,"to":1769.56,"location":2,"content":"you can kind of run an experiment in a day or two."},{"from":1769.56,"to":1771.75,"location":2,"content":"And so especially for translation, right,"},{"from":1771.75,"to":1775.61,"location":2,"content":"you could get a lot more data, uh, than IWSLT."},{"from":1775.61,"to":1778.47,"location":2,"content":"And each of these, uh,"},{"from":1778.47,"to":1782.1,"location":2,"content":"communities and datasets and- and tasks has their own metric."},{"from":1782.1,"to":1784.05,"location":2,"content":"We actually tried to, in the beginning,"},{"from":1784.05,"to":1786.33,"location":2,"content":"we had a lot of discussion about how we should"},{"from":1786.33,"to":1789.87,"location":2,"content":"define the measure of success for this project."},{"from":1789.87,"to":1791.57,"location":2,"content":"Uh, it doesn't make sense, uh,"},{"from":1791.57,"to":1795.3,"location":2,"content":"to have a normalized F1 score for basically all the different tasks,"},{"from":1795.3,"to":1797.31,"location":2,"content":"but then we basically realized that"},{"from":1797.31,"to":1800.25,"location":2,"content":"these different communities have different metrics for a reason."},{"from":1800.25,"to":1805.01,"location":2,"content":"Uh, unfortunately at least all of these metrics are from 0-100 in theory."},{"from":1805.01,"to":1807.4,"location":2,"content":"Of course, in practice, you rarely ever see, uh,"},{"from":1807.4,"to":1810.27,"location":2,"content":"a translation system of a 100, uh,"},{"from":1810.27,"to":1812.28,"location":2,"content":"or even high 90s of a BLEU score,"},{"from":1812.28,"to":1814.93,"location":2,"content":"uh, or these really, really high ROUGE scores."},{"from":1814.93,"to":1818.55,"location":2,"content":"But, you know, in theory they go from 0-100, and so, uh,"},{"from":1818.55,"to":1824.04,"location":2,"content":"we kept basically intact the different evaluation metrics for each of these communities,"},{"from":1824.04,"to":1826.44,"location":2,"content":"and we just said we're going to sum them up."},{"from":1826.44,"to":1829.38,"location":2,"content":"And, uh, when we first talked about this,"},{"from":1829.38,"to":1831.15,"location":2,"content":"we have- had a lot of discussion,"},{"from":1831.15,"to":1832.89,"location":2,"content":"uh, with- with others also like, oh,"},{"from":1832.89,"to":1835.53,"location":2,"content":"but translation is so much more important because it's much"},{"from":1835.53,"to":1838.24,"location":2,"content":"bigger and it's a much more useful task than you still,"},{"from":1838.24,"to":1840.63,"location":2,"content":"you know, silly like pronoun resolution Winograd Schemas"},{"from":1840.63,"to":1843.15,"location":2,"content":"which only have a couple hundred training samples."},{"from":1843.15,"to":1845.73,"location":2,"content":"And so you should have weighted translation more and"},{"from":1845.73,"to":1848.31,"location":2,"content":"then literally five questions later somebody's like,"},{"from":1848.31,"to":1850.14,"location":2,"content":"\"Why didn't you weight pronoun resolution more?"},{"from":1850.14,"to":1854.37,"location":2,"content":"That is a really hard task that captures sort of common sense reasoning and, you know,"},{"from":1854.37,"to":1856.59,"location":2,"content":"the complexity of language and semantics,"},{"from":1856.59,"to":1860.34,"location":2,"content":"and unlike all this, like, statistical pattern matching [NOISE] that you do in translation.\""},{"from":1860.34,"to":1863.19,"location":2,"content":"And I was like, I used to talk to that guy [LAUGHTER] and like,"},{"from":1863.19,"to":1864.51,"location":2,"content":"uh, hopefully in the end,"},{"from":1864.51,"to":1868.05,"location":2,"content":"we'll just all agree that like it's reasonable to sum them up, uh,"},{"from":1868.05,"to":1873.64,"location":2,"content":"and of course, you also have to tackle when you run experiments in this."},{"from":1873.64,"to":1877.85,"location":2,"content":"Uh, a lot of the complexity that you have in machine learning and,"},{"from":1877.85,"to":1881.63,"location":2,"content":"you know, stuff that very few people talk about like having very skewed distributions."},{"from":1881.63,"to":1884.61,"location":2,"content":"So you have translation which has, uh,"},{"from":1884.61,"to":1886.62,"location":2,"content":"millions or hundreds of thousands of examples,"},{"from":1886.62,"to":1887.73,"location":2,"content":"and you have Winograd Schemas,"},{"from":1887.73,"to":1889.92,"location":2,"content":"uh, that only have a couple hundred."},{"from":1889.92,"to":1894.75,"location":2,"content":"How do you train that such that you don't just completely ignore the smaller dataset."},{"from":1894.75,"to":1898.35,"location":2,"content":"Uh, so we'll get to some of the optimization trickery,"},{"from":1898.35,"to":1902.01,"location":2,"content":"uh, that Nitish spent several months on in a bit."},{"from":1902.01,"to":1905.31,"location":2,"content":"But I first wanna sort of give you the first set of experiments."},{"from":1905.31,"to":1906.96,"location":2,"content":"So as you can see from all the numbers,"},{"from":1906.96,"to":1908.57,"location":2,"content":"there's a lot of experiments, uh,"},{"from":1908.57,"to":1910.69,"location":2,"content":"that we ran to even get to this,"},{"from":1910.69,"to":1912.96,"location":2,"content":"and so we'll walk through this, uh, quite carefully."},{"from":1912.96,"to":1916.11,"location":2,"content":"I think hopefully you'll get some ideas also for- for ablations,"},{"from":1916.11,"to":1919.8,"location":2,"content":"or experiments that you might wanna run in your, um,"},{"from":1919.8,"to":1921.21,"location":2,"content":"in your experiments and in your,"},{"from":1921.21,"to":1923.67,"location":2,"content":"uh, problem- final- final projects."},{"from":1923.67,"to":1925.29,"location":2,"content":"So what are we looking at here?"},{"from":1925.29,"to":1927.4,"location":2,"content":"So basically, uh, on the left side,"},{"from":1927.4,"to":1928.77,"location":2,"content":"we have single task performance."},{"from":1928.77,"to":1933.47,"location":2,"content":"So here, each number comes from its different model that was trained,"},{"from":1933.47,"to":1936.33,"location":2,"content":"um, separately on just one task."},{"from":1936.33,"to":1942.54,"location":2,"content":"Uh, each row- each column here is the same architecture, uh,"},{"from":1942.54,"to":1943.93,"location":2,"content":"and [NOISE] on the right side here,"},{"from":1943.93,"to":1945.43,"location":2,"content":"we basically have, uh,"},{"from":1945.43,"to":1951.16,"location":2,"content":"for each column is basically the same architecture and the same exact model."},{"from":1951.16,"to":1954.67,"location":2,"content":"So here, we have four different models and here, uh,"},{"from":1954.67,"to":1957.16,"location":2,"content":"we have 40 different models,"},{"from":1957.16,"to":1960.11,"location":2,"content":"and each column again is the same architecture."},{"from":1960.11,"to":1961.72,"location":2,"content":"And so the simplest, uh,"},{"from":1961.72,"to":1964.62,"location":2,"content":"first column here is just a standard sequence to sequence"},{"from":1964.62,"to":1968.28,"location":2,"content":"model with very few bells and whistles and some pointers,"},{"from":1968.28,"to":1969.96,"location":2,"content":"but nothing sort of major."},{"from":1969.96,"to":1971.27,"location":2,"content":"It's pretty deep, you know,"},{"from":1971.27,"to":1973.55,"location":2,"content":"stack bidirectional LSTM skip connections,"},{"from":1973.55,"to":1977.78,"location":2,"content":"all the standard good well-tuned stuff for sequence to sequence models."},{"from":1977.78,"to":1980.94,"location":2,"content":"And, uh, then we added self-attention."},{"from":1980.94,"to":1983.4,"location":2,"content":"Um, this- this sort of, uh,"},{"from":1983.4,"to":1986.31,"location":2,"content":"basically, uh, transformer layers."},{"from":1986.31,"to":1988.11,"location":2,"content":"[NOISE] Then we have this co-attention layer of"},{"from":1988.11,"to":1990.22,"location":2,"content":"the outer products that we mentioned in the beginning,"},{"from":1990.22,"to":1992.71,"location":2,"content":"and then we also added the question pointer."},{"from":1992.71,"to":1998.33,"location":2,"content":"So having the ability to point to a word in a question."},{"from":1998.33,"to":2001.67,"location":2,"content":"All right. Any questions about this table?"},{"from":2001.67,"to":2003.32,"location":2,"content":"We'll dig into some of the details."},{"from":2003.32,"to":2005.09,"location":2,"content":"Uh, okay. Well, we'll dig into"},{"from":2005.09,"to":2007.76,"location":2,"content":"the details first and then maybe you can think of some questions."},{"from":2007.76,"to":2009.83,"location":2,"content":"So let's analyze, uh,"},{"from":2009.83,"to":2012.74,"location":2,"content":"what's going on in this table because there are a lot of numbers, uh,"},{"from":2012.74,"to":2016.51,"location":2,"content":"and you really want to carefully analyze and sort of distinguish."},{"from":2016.51,"to":2017.89,"location":2,"content":"I think my first, uh,"},{"from":2017.89,"to":2020.59,"location":2,"content":"observation was, wow, we can have a single architecture."},{"from":2020.59,"to":2023.17,"location":2,"content":"Like, even, even this is not quite what we want, right?"},{"from":2023.17,"to":2024.54,"location":2,"content":"We want a single model."},{"from":2024.54,"to":2026.14,"location":2,"content":"But even this kind of showed us, wow,"},{"from":2026.14,"to":2031.43,"location":2,"content":"you can have a single architecture that actually does really well and somewhat randomly,"},{"from":2031.43,"to":2033.92,"location":2,"content":"in some cases, it actually had gotten state-of-the-art results."},{"from":2033.92,"to":2036.02,"location":2,"content":"So Wiki SQL, for instance,"},{"from":2036.02,"to":2039.2,"location":2,"content":"this architecture had the best model"},{"from":2039.2,"to":2042.24,"location":2,"content":"to translate natural language English questions into SQL queries,"},{"from":2042.24,"to":2045.53,"location":2,"content":"which was a surprise to us because it is the ninth dataset."},{"from":2045.53,"to":2048.95,"location":2,"content":"It was really not like a priority for us and when we designed"},{"from":2048.95,"to":2052.97,"location":2,"content":"the model and thought about how to generate words and pointer mechanisms and so on."},{"from":2052.97,"to":2056.39,"location":2,"content":"We just kind of had the standard context of SQL words"},{"from":2056.39,"to":2059.99,"location":2,"content":"and we asked the question what's the translation to SQL, and then, uh,"},{"from":2059.99,"to":2064.79,"location":2,"content":"somewhat surprisingly to us this particular architecture had the state-of-the-art, uh,"},{"from":2064.79,"to":2067.82,"location":2,"content":"on SQL generation and bunch of folks in that community kind"},{"from":2067.82,"to":2070.86,"location":2,"content":"of picked it up more quickly because it had state-of-the-art."},{"from":2070.86,"to":2072.59,"location":2,"content":"And that's- uh, unfortunately,"},{"from":2072.59,"to":2074.91,"location":2,"content":"it doesn't have that many other state-of-the-art numbers, uh,"},{"from":2074.91,"to":2076.4,"location":2,"content":"which is why it's harder, uh,"},{"from":2076.4,"to":2077.75,"location":2,"content":"it's actually a much harder task."},{"from":2077.75,"to":2080.2,"location":2,"content":"And what you also observe is that,"},{"from":2080.2,"to":2082.32,"location":2,"content":"uh, in several of the cases, uh,"},{"from":2082.32,"to":2084.08,"location":2,"content":"using the multitask model,"},{"from":2084.08,"to":2086.64,"location":2,"content":"so having a single model for all the 10 tasks,"},{"from":2086.64,"to":2088.88,"location":2,"content":"uh, actually hurts performance at first."},{"from":2088.88,"to":2092.12,"location":2,"content":"And this is also something you rarely read in papers because papers"},{"from":2092.12,"to":2095.21,"location":2,"content":"have a strong selection bias to only publish positive results."},{"from":2095.21,"to":2100.31,"location":2,"content":"Uh, and when you look at most transfer learning and multitask learning papers,"},{"from":2100.31,"to":2104.66,"location":2,"content":"they're sort of an outside of the actual model consideration of like,"},{"from":2104.66,"to":2109.1,"location":2,"content":"well, let's only combine tasks that we know will work well with one another."},{"from":2109.1,"to":2111.05,"location":2,"content":"And if they don't work and hurt performance,"},{"from":2111.05,"to":2113.28,"location":2,"content":"then we'd just exclude them from our experiments."},{"from":2113.28,"to":2116.61,"location":2,"content":"And so you don't see many negative task results, uh,"},{"from":2116.61,"to":2120.22,"location":2,"content":"in the literature and there are a few papers here and there that, uh,"},{"from":2120.22,"to":2124.91,"location":2,"content":"study basically the opposite side of transfer learning and that is,"},{"from":2124.91,"to":2128.32,"location":2,"content":"uh, catastrophic interference and catastrophic forgetting."},{"from":2128.32,"to":2132.11,"location":2,"content":"So interference is when you train two different tasks in the same model,"},{"from":2132.11,"to":2135.16,"location":2,"content":"and to interfere with one another next, you hurt each other's performance."},{"from":2135.16,"to":2137.96,"location":2,"content":"And catastrophic forgetting is if you train continually"},{"from":2137.96,"to":2141.3,"location":2,"content":"your first train in one task then you train on a second task,"},{"from":2141.3,"to":2142.89,"location":2,"content":"people used to think,"},{"from":2142.89,"to":2144.08,"location":2,"content":"\"Oh, well, you know,"},{"from":2144.08,"to":2145.79,"location":2,"content":"basically the first task will be completely"},{"from":2145.79,"to":2148.97,"location":2,"content":"forgotten,\" and you just work well on the second task."},{"from":2148.97,"to":2152.75,"location":2,"content":"If you train neural networks sort of in a sequential way one task and then"},{"from":2152.75,"to":2156.85,"location":2,"content":"another and somewhat surprisingly, uh,"},{"from":2156.85,"to":2159.16,"location":2,"content":"we- we found that things aren't actually"},{"from":2159.16,"to":2161.93,"location":2,"content":"catastrophically being forgotten in these models,"},{"from":2161.93,"to":2164.41,"location":2,"content":"turns out that if you train them sequentially and"},{"from":2164.41,"to":2167.07,"location":2,"content":"you add a little bit of the original to the first task,"},{"from":2167.07,"to":2168.76,"location":2,"content":"it comes back very, very quickly."},{"from":2168.76,"to":2170.66,"location":2,"content":"So while the performance is really bad,"},{"from":2170.66,"to":2172.91,"location":2,"content":"you can get to the really good performance very,"},{"from":2172.91,"to":2174.47,"location":2,"content":"very quickly in very few iterations."},{"from":2174.47,"to":2178.11,"location":2,"content":"So but it's one of the many interesting sort of tidbits that we found,"},{"from":2178.11,"to":2180.91,"location":2,"content":"uh, in the course of this that we haven't even published yet. All right."},{"from":2180.91,"to":2184.05,"location":2,"content":"So, uh, focusing on, uh,"},{"from":2184.05,"to":2186.56,"location":2,"content":"the transformer layers here we basically find transformers"},{"from":2186.56,"to":2189.28,"location":2,"content":"do help the original sequence to sequence model a lot."},{"from":2189.28,"to":2193.41,"location":2,"content":"So if you tune them carefully and you combine them with, uh,"},{"from":2193.41,"to":2196.24,"location":2,"content":"some bidirectional LSTMs and so on, uh,"},{"from":2196.24,"to":2198.41,"location":2,"content":"they were very helpful and improved, uh,"},{"from":2198.41,"to":2201.8,"location":2,"content":"across a bunch of different datasets, in some cases quite significantly."},{"from":2201.8,"to":2206.39,"location":2,"content":"Another observation is question-answering and semantic role labeling,"},{"from":2206.39,"to":2209.66,"location":2,"content":"uh, actually can predict each other's performance quite well."},{"from":2209.66,"to":2211.67,"location":2,"content":"If one works well, the other works well,"},{"from":2211.67,"to":2213.14,"location":2,"content":"uh, and- and vice-versa."},{"from":2213.14,"to":2214.4,"location":2,"content":"If they don't work well,"},{"from":2214.4,"to":2216.59,"location":2,"content":"uh, both of them don't work very well."},{"from":2216.59,"to":2220.85,"location":2,"content":"Um, and it's also interesting because both of those tasks have different questions for,"},{"from":2220.85,"to":2224.07,"location":2,"content":"uh, every training example."},{"from":2224.07,"to":2227.78,"location":2,"content":"Pointing. Uh, so the question pointing,"},{"from":2227.78,"to":2229.52,"location":2,"content":"uh, is super important."},{"from":2229.52,"to":2231.7,"location":2,"content":"Uh, we actually have in some cases, uh,"},{"from":2231.7,"to":2233.91,"location":2,"content":"twice the performance even for,"},{"from":2233.91,"to":2235.57,"location":2,"content":"and this is kind of surprising to us,"},{"from":2235.57,"to":2238.7,"location":2,"content":"a simple classification task where you could just have a standard Softmax."},{"from":2238.7,"to":2242.64,"location":2,"content":"But instead of saying you have a Softmax of entailment, contradiction, and so on,"},{"from":2242.64,"to":2245.01,"location":2,"content":"you just basically, uh,"},{"from":2245.01,"to":2248.01,"location":2,"content":"point to the word entailment in the question."},{"from":2248.01,"to":2252.05,"location":2,"content":"And that was also the case for Winograd Schemas that also benefited a lot,"},{"from":2252.05,"to":2254,"location":2,"content":"uh, from this pointer mechanism."},{"from":2254,"to":2256.19,"location":2,"content":"[NOISE]"},{"from":2256.19,"to":2256.88,"location":2,"content":"Can you explain that?"},{"from":2256.88,"to":2259.49,"location":2,"content":"Sure. Um, can we explain it? Why-"},{"from":2259.49,"to":2261.47,"location":2,"content":"[inaudible]"},{"from":2261.47,"to":2262.76,"location":2,"content":"Why does it help so much?"},{"from":2262.76,"to":2264.98,"location":2,"content":"Um, in some ways,"},{"from":2264.98,"to":2267.86,"location":2,"content":"I think partly is the whole architecture"},{"from":2267.86,"to":2271.16,"location":2,"content":"has been gotten- has gotten better and better at pointing."},{"from":2271.16,"to":2273.32,"location":2,"content":"And part of the reason we actually do very,"},{"from":2273.32,"to":2274.73,"location":2,"content":"very poorly in translation,"},{"from":2274.73,"to":2279.02,"location":2,"content":"which is the only task that hurt in the- our first experiments a lot, uh,"},{"from":2279.02,"to":2282.5,"location":2,"content":"in the multitask setting is that that is the only task that now has to generate,"},{"from":2282.5,"to":2285.44,"location":2,"content":"uh, results from a completely separate Softmax,"},{"from":2285.44,"to":2287.66,"location":2,"content":"whereas the rest of the architecture got really,"},{"from":2287.66,"to":2292.53,"location":2,"content":"really good at pointing to things to answer questions, any kind of question."},{"from":2292.53,"to":2295.55,"location":2,"content":"Uh, and so but in some ways,"},{"from":2295.55,"to":2297.56,"location":2,"content":"I think that is one explanation,"},{"from":2297.56,"to":2299.72,"location":2,"content":"but I- I don't think it's- it's all of it."},{"from":2299.72,"to":2309.01,"location":2,"content":"I think we still need to figure out more why this happens. All right."},{"from":2309.01,"to":2312.2,"location":2,"content":"Now, multitask learning is the most"},{"from":2312.2,"to":2315.47,"location":2,"content":"helpful when it comes to zero-shot and I'm actually very excited about that."},{"from":2315.47,"to":2319.84,"location":2,"content":"So this is a zero-shot relation extraction where you have different kinds of, uh,"},{"from":2319.84,"to":2322.43,"location":2,"content":"relations that you might wanna extract and you might have never"},{"from":2322.43,"to":2325.55,"location":2,"content":"seen like the student-teacher relationship that you're trying"},{"from":2325.55,"to":2327.86,"location":2,"content":"to identify in a certain context or"},{"from":2327.86,"to":2331.74,"location":2,"content":"a product company relationship or something like that."},{"from":2331.74,"to":2335.48,"location":2,"content":"And so, uh, that one actually, uh,"},{"from":2335.48,"to":2338.18,"location":2,"content":"benefited a lot and almost got twice, uh,"},{"from":2338.18,"to":2340.28,"location":2,"content":"as high in terms of the accuracy, uh,"},{"from":2340.28,"to":2342.38,"location":2,"content":"when you learned it with everything else."},{"from":2342.38,"to":2344.36,"location":2,"content":"So these were questions, it's never seen before,"},{"from":2344.36,"to":2346.26,"location":2,"content":"relations that it's never seen before,"},{"from":2346.26,"to":2348.72,"location":2,"content":"and it got twice as good, uh,"},{"from":2348.72,"to":2353.21,"location":2,"content":"and benefited a lot especially from having seen other kinds of questions."},{"from":2353.21,"to":2356.87,"location":2,"content":"And in some ways, we have to give a lot of credit to SQuAD too,"},{"from":2356.87,"to":2358.89,"location":2,"content":"uh, because SQuAD as a dataset,"},{"from":2358.89,"to":2364.76,"location":2,"content":"uh, kind of pushed people into thinking about pointers as a mechanism to generate answers."},{"from":2364.76,"to":2368.75,"location":2,"content":"And pointers, we kind of see them like as a given and they don't get that much credit,"},{"from":2368.75,"to":2373.53,"location":2,"content":"but they allow you to predict answers that you've never seen before at training time."},{"from":2373.53,"to":2376.04,"location":2,"content":"To generate words, you've never seen before at training time,"},{"from":2376.04,"to":2379.85,"location":2,"content":"which is actually quite- quite amazing. All right."},{"from":2379.85,"to":2383.09,"location":2,"content":"Now, the main observation though"},{"from":2383.09,"to":2386.81,"location":2,"content":"here is that you still if you had an Oracle that would tell you"},{"from":2386.81,"to":2390.28,"location":2,"content":"exactly which task you're currently in"},{"from":2390.28,"to":2394.68,"location":2,"content":"and you would be perfectly kind of separating these into 10 different models,"},{"from":2394.68,"to":2398.95,"location":2,"content":"maybe they're all the same architecture but there's still 10 different models, then, uh,"},{"from":2398.95,"to":2402.41,"location":2,"content":"you would actually still do slightly better,"},{"from":2402.41,"to":2406.53,"location":2,"content":"uh, than the first version of this multitask learning model."},{"from":2406.53,"to":2409.07,"location":2,"content":"And that is largely because we"},{"from":2409.07,"to":2412.43,"location":2,"content":"chose to include a bunch of different tasks that have nothing to do"},{"from":2412.43,"to":2415.13,"location":2,"content":"with one another and we wanted the community to start"},{"from":2415.13,"to":2418.31,"location":2,"content":"thinking about tackling catastrophic interference, right?"},{"from":2418.31,"to":2421.68,"location":2,"content":"If you learn like a new language or, you know,"},{"from":2421.68,"to":2424.67,"location":2,"content":"you learn how to understand social media on Twitter,"},{"from":2424.67,"to":2426.86,"location":2,"content":"you don't replace all your language,"},{"from":2426.86,"to":2428.82,"location":2,"content":"uh, you know, in- in your brain."},{"from":2428.82,"to":2430.82,"location":2,"content":"You have one brain, it keeps getting smarter,"},{"from":2430.82,"to":2432.07,"location":2,"content":"you keep learning new skills,"},{"from":2432.07,"to":2435.14,"location":2,"content":"even when that skills that are new to you are very,"},{"from":2435.14,"to":2436.52,"location":2,"content":"very different from old skills."},{"from":2436.52,"to":2440.42,"location":2,"content":"So in some ways we may have made our lives too hard,"},{"from":2440.42,"to":2441.77,"location":2,"content":"and now we're actually thinking, okay,"},{"from":2441.77,"to":2444.62,"location":2,"content":"maybe if you wanna publish a nicer paper on multitask learning,"},{"from":2444.62,"to":2446.81,"location":2,"content":"we'll just look at all the tasks that do help each other,"},{"from":2446.81,"to":2448.88,"location":2,"content":"and then we'll just, you know, have groups of tasks,"},{"from":2448.88,"to":2451.45,"location":2,"content":"and then I can very quickly publish,"},{"from":2451.45,"to":2454.01,"location":2,"content":"uh, some, some nice state-of-the-art papers."},{"from":2454.01,"to":2457.37,"location":2,"content":"But basically here, uh, we're still, uh,"},{"from":2457.37,"to":2463.91,"location":2,"content":"quite significantly away in the decaScore between 10 different models and a single model."},{"from":2463.91,"to":2466.28,"location":2,"content":"Now, this of course is kind of an oracle score,"},{"from":2466.28,"to":2469.8,"location":2,"content":"that's why we put it in parentheses because you don't actually have this oracle."},{"from":2469.8,"to":2471.26,"location":2,"content":"And in some cases,"},{"from":2471.26,"to":2473.78,"location":2,"content":"it's quite easy to build an almost perfect classifier."},{"from":2473.78,"to":2476.61,"location":2,"content":"So, you know, separating what is the summary"},{"from":2476.61,"to":2479.81,"location":2,"content":"based on that question and what is the translation from English to German,"},{"from":2479.81,"to":2481.61,"location":2,"content":"you can do with almost 100 percent accuracy."},{"from":2481.61,"to":2485.09,"location":2,"content":"Uh, but, uh, SQuAD, question-answering,"},{"from":2485.09,"to":2486.66,"location":2,"content":"and zero-shot relation extraction,"},{"from":2486.66,"to":2489.57,"location":2,"content":"and question-answering as a semantic role labeling,"},{"from":2489.57,"to":2493.22,"location":2,"content":"those are actually easily confused in terms of how"},{"from":2493.22,"to":2497.33,"location":2,"content":"to generate the answers and you wouldn't quite know,"},{"from":2497.33,"to":2500.87,"location":2,"content":"uh, which into which model to route, uh, this."},{"from":2500.87,"to":2504.93,"location":2,"content":"So in some sense, this is kind of theoretical. All right."},{"from":2504.93,"to":2507.71,"location":2,"content":"Now, I mentioned that we have this prob- this"},{"from":2507.71,"to":2511.73,"location":2,"content":"complexity in the optimization strategy and this is one of the many,"},{"from":2511.73,"to":2515.8,"location":2,"content":"um, sort of problems that don't get that much, uh, coverage."},{"from":2515.8,"to":2517.53,"location":2,"content":"But when you have a very,"},{"from":2517.53,"to":2519.78,"location":2,"content":"uh, imbalanced or skewed dataset,"},{"from":2519.78,"to":2525.01,"location":2,"content":"it's easy to lose track and basically overpower the smaller dataset tasks."},{"from":2525.01,"to":2527.51,"location":2,"content":"And so, uh, the first, uh,"},{"from":2527.51,"to":2530.78,"location":2,"content":"simplest training- we actually tried a ton of different training strategies,"},{"from":2530.78,"to":2533.6,"location":2,"content":"but in the end, this fully joint one worked quite well."},{"from":2533.6,"to":2538.16,"location":2,"content":"But actually promised to ask go wait for questions, uh, on this table."},{"from":2538.16,"to":2540.68,"location":2,"content":"So any questions on all these results so far? Yeah?"},{"from":2540.68,"to":2544.55,"location":2,"content":"So, uh, [NOISE] since you mentioned that if you had"},{"from":2544.55,"to":2546.74,"location":2,"content":"an oracle that will tell you which task it is and"},{"from":2546.74,"to":2549.22,"location":2,"content":"you have two better ways having 10 different ones."},{"from":2549.22,"to":2552.44,"location":2,"content":"So really try training a model on"},{"from":2552.44,"to":2555.71,"location":2,"content":"like data meaning what task is interested in this particular version?"},{"from":2555.71,"to":2558.31,"location":2,"content":"We did. And so it- it confused, you know,"},{"from":2558.31,"to":2562.24,"location":2,"content":"SQuAD and- and those too the quest- the other- basically the other,"},{"from":2562.24,"to":2567.26,"location":2,"content":"uh, two types of problems that were also cast, ask question answering."},{"from":2567.26,"to":2569.35,"location":2,"content":"So it confused those."},{"from":2569.35,"to":2573.49,"location":2,"content":"Um, but then a lot of the others, it was able to like, very perfectly do it."},{"from":2573.49,"to":2576.19,"location":2,"content":"But then you basically, as soon as you,"},{"from":2576.19,"to":2581.11,"location":2,"content":"uh, were to try to then build a whole model and get a decaScore,"},{"from":2581.11,"to":2585.39,"location":2,"content":"if your- if your classifier is even like 90 percent accurate,"},{"from":2585.39,"to":2588.53,"location":2,"content":"you basically multiply this by 0.9 and"},{"from":2588.53,"to":2591.68,"location":2,"content":"you get dinged so hard that it- it's not competitive anymore."},{"from":2591.68,"to":2594.35,"location":2,"content":"So it is actually hard if you try to just build"},{"from":2594.35,"to":2597.08,"location":2,"content":"that whole system and keep adding sort of if-then else statements,"},{"from":2597.08,"to":2598.88,"location":2,"content":"uh, to make that, uh,"},{"from":2598.88,"to":2600.89,"location":2,"content":"into sort of a single system. Yeah?"},{"from":2600.89,"to":2604.09,"location":2,"content":"Have you tried telling the model what kind of task this it's doing,"},{"from":2604.09,"to":2607.33,"location":2,"content":"just giving that indicator of the kind of task quickly?"},{"from":2607.33,"to":2609.01,"location":2,"content":"I mean, in some ways,"},{"from":2609.01,"to":2610.12,"location":2,"content":"we did in this case,"},{"from":2610.12,"to":2613.36,"location":2,"content":"because we only trained each model separately on it."},{"from":2613.36,"to":2614.28,"location":2,"content":"[inaudible]"},{"from":2614.28,"to":2616.91,"location":2,"content":"Um, only through the question."},{"from":2616.91,"to":2619.18,"location":2,"content":"Yeah. Because I was thinking the"},{"from":2619.18,"to":2622.76,"location":2,"content":"um, maybe it's not that important that the model figure out what we want it to"},{"from":2622.76,"to":2624.97,"location":2,"content":"do in- in a practical [NOISE] application"},{"from":2624.97,"to":2627.56,"location":2,"content":"if we could just tell it what we want it to do right now?"},{"from":2627.56,"to":2629.42,"location":2,"content":"In some cases, you could tell."},{"from":2629.42,"to":2631.43,"location":2,"content":"Uh, so the question is sort of,"},{"from":2631.43,"to":2633.26,"location":2,"content":"uh, and even in the multitask setting,"},{"from":2633.26,"to":2636.09,"location":2,"content":"you could have like an extra kind of token to say,"},{"from":2636.09,"to":2638.15,"location":2,"content":"\"Now, you're doing summarization."},{"from":2638.15,"to":2639.95,"location":2,"content":"So, and that's another input.\""},{"from":2639.95,"to":2641.26,"location":2,"content":"Uh, in some ways,"},{"from":2641.26,"to":2643.61,"location":2,"content":"whether you have a summarization token,"},{"from":2643.61,"to":2645.65,"location":2,"content":"uh, or you ask what is the summary?"},{"from":2645.65,"to":2648.13,"location":2,"content":"It actually I don't think makes that big of a difference."},{"from":2648.13,"to":2651.19,"location":2,"content":"It's just now you can query this model in"},{"from":2651.19,"to":2653.14,"location":2,"content":"very natural language rather than having to know"},{"from":2653.14,"to":2655.6,"location":2,"content":"kind of a special token to, to query the model."},{"from":2655.6,"to":2659.71,"location":2,"content":"Uh, and we'll see actually in a couple of slides that the model is not confused,"},{"from":2659.71,"to":2662.86,"location":2,"content":"uh, when it comes to how to generate the answers."},{"from":2662.86,"to":2664.71,"location":2,"content":"So, for every of the task,"},{"from":2664.71,"to":2668.66,"location":2,"content":"it knows very clearly how to generate the words to get to the right,"},{"from":2668.66,"to":2670.7,"location":2,"content":"to get to, you know, a reasonably accurate answer."},{"from":2670.7,"to":2676.52,"location":2,"content":"[NOISE] Um, in the- [inaudible] does the model"},{"from":2676.52,"to":2682.58,"location":2,"content":"see all of the data and then [inaudible] that class or does it only include a [inaudible]?"},{"from":2682.58,"to":2685.4,"location":2,"content":"Oh, great question. So, how do we train, uh, the single task models?"},{"from":2685.4,"to":2687.98,"location":2,"content":"They're only trained on that dataset."},{"from":2687.98,"to":2691.7,"location":2,"content":"So, the SQuAD number here is just a single model that has only seen SQuAD training."},{"from":2691.7,"to":2697.25,"location":2,"content":"[NOISE] So, your point about the,"},{"from":2697.25,"to":2699.05,"location":2,"content":"um, the pointer exception for the, uh,"},{"from":2699.05,"to":2702.31,"location":2,"content":"[inaudible] generally more helpful than [inaudible]?"},{"from":2702.31,"to":2704.83,"location":2,"content":"Somewhat surprisingly, even, ah,"},{"from":2704.83,"to":2706.32,"location":2,"content":"in the case here, uh,"},{"from":2706.32,"to":2709.07,"location":2,"content":"where we had, um, this is MultiNLI,"},{"from":2709.07,"to":2710.69,"location":2,"content":"this particular model, I mean,"},{"from":2710.69,"to":2712.55,"location":2,"content":"if you just have the standard sequence to sequence,"},{"from":2712.55,"to":2714.03,"location":2,"content":"it just generates, you know,"},{"from":2714.03,"to":2716.66,"location":2,"content":"also with a softmax, uh, that label."},{"from":2716.66,"to":2718.64,"location":2,"content":"So in that sense, it's quite similar."},{"from":2718.64,"to":2723.65,"location":2,"content":"Uh, but yeah, it was actually better able to just point, which actually led us, uh,"},{"from":2723.65,"to":2727.73,"location":2,"content":"for a while into thinking about maybe we should have a project where we just say point to"},{"from":2727.73,"to":2732.13,"location":2,"content":"all the things and just get rid of softmax classifiers forever."},{"from":2732.13,"to":2735.89,"location":2,"content":"Um, the problem is when you then try to do translation also,"},{"from":2735.89,"to":2737.21,"location":2,"content":"it's like okay wow,"},{"from":2737.21,"to":2738.39,"location":2,"content":"what do you point to,"},{"from":2738.39,"to":2740.42,"location":2,"content":"and then you kind of pre-train it and do"},{"from":2740.42,"to":2743.75,"location":2,"content":"some alignment and it gets kinda very large and you point to a lot of different like,"},{"from":2743.75,"to":2746.36,"location":2,"content":"you may have like- like tens of thousands of potential candidates."},{"from":2746.36,"to":2749.54,"location":2,"content":"So we kinda discarded it as like a single unifying model for all the things,"},{"from":2749.54,"to":2751.89,"location":2,"content":"but you could point to a lot of different,"},{"from":2751.89,"to":2752.99,"location":2,"content":"like a lot of these tasks,"},{"from":2752.99,"to":2754.28,"location":2,"content":"you could actually point to and"},{"from":2754.28,"to":2761.44,"location":2,"content":"I think it's another interesting side project that could spawn from this, yeah."},{"from":2761.44,"to":2763.74,"location":2,"content":"Just a quick question to how,"},{"from":2763.74,"to":2766.91,"location":2,"content":"how sensitive [inaudible] how sensitive, uh,"},{"from":2766.91,"to":2769.85,"location":2,"content":"the individual components [inaudible] was when you"},{"from":2769.85,"to":2773.24,"location":2,"content":"slightly perturb the relative weights of them in the loss function?"},{"from":2773.24,"to":2776.86,"location":2,"content":"So, we -- the question is, uh, how, um,"},{"from":2776.86,"to":2779.8,"location":2,"content":"sensitive were the tasks if we were to,"},{"from":2779.8,"to":2782.82,"location":2,"content":"um, add weights to the different tasks?"},{"from":2782.82,"to":2787.49,"location":2,"content":"We [NOISE] did in the optimization kind of did a lot of trickery on"},{"from":2787.49,"to":2792.08,"location":2,"content":"how to train it but we never said this task only matters like 0.5 or something."},{"from":2792.08,"to":2794.93,"location":2,"content":"So, we didn't do that analysis. Yeah?"},{"from":2794.93,"to":2797.99,"location":2,"content":"Co-attention seems to be a burden a little bit."},{"from":2797.99,"to":2799.07,"location":2,"content":"In some cases, yeah."},{"from":2799.07,"to":2804.43,"location":2,"content":"Is it the [inaudible] co-attention and order but no co-attention or is that kind of like,"},{"from":2804.43,"to":2807.32,"location":2,"content":"\"Oh, you already saw the test data so, like, you can't use these.\""},{"from":2807.32,"to":2809.05,"location":2,"content":"I mean, these are all dep sets."},{"from":2809.05,"to":2813.56,"location":2,"content":"Um, but it's, you could definitely do even more architecture engineering."},{"from":2813.56,"to":2815.9,"location":2,"content":"In fact, there's this whole field which I don't think"},{"from":2815.9,"to":2818.69,"location":2,"content":"you gotten to, right, neural architecture search?"},{"from":2818.69,"to":2822.51,"location":2,"content":"Yeah. So like you can actually combine your reinforcement learning, um,"},{"from":2822.51,"to":2825.7,"location":2,"content":"and you say the action space for the reinforcement learning agent"},{"from":2825.7,"to":2827.36,"location":2,"content":"are trying to have a couple of"},{"from":2827.36,"to":2829.58,"location":2,"content":"different modules of neural nets like maybe you want to have"},{"from":2829.58,"to":2831.18,"location":2,"content":"like a CNN layer and then like"},{"from":2831.18,"to":2834.32,"location":2,"content":"a memory layer and then an LSTM layer and maybe it's bidirectional and you"},{"from":2834.32,"to":2839.47,"location":2,"content":"basically let a reinforcement learning agent figure out all of these decisions."},{"from":2839.47,"to":2842.86,"location":2,"content":"Uh, so I think it would be phenomenal to try to apply"},{"from":2842.86,"to":2845.21,"location":2,"content":"neural architecture search not to what's"},{"from":2845.21,"to":2847.79,"location":2,"content":"usually being done which is we already know how to do image classification,"},{"from":2847.79,"to":2850.72,"location":2,"content":"we'll just do it slightly better with NAS, neural architecture search."},{"from":2850.72,"to":2851.93,"location":2,"content":"But we actually try to find"},{"from":2851.93,"to":2854.81,"location":2,"content":"a single architecture for multi-task learning which we don't know."},{"from":2854.81,"to":2858.62,"location":2,"content":"The problem of course is that already getting to these."},{"from":2858.62,"to":2861.47,"location":2,"content":"All these numbers took a lot of compute time and a lot of"},{"from":2861.47,"to":2864.88,"location":2,"content":"fiddling around with stuff and it is, I can,"},{"from":2864.88,"to":2868.99,"location":2,"content":"I can only give you sort of an idea of like how often we'd say,"},{"from":2868.99,"to":2870.89,"location":2,"content":"\"Oh man, we got like this really amazing result"},{"from":2870.89,"to":2873.11,"location":2,"content":"in this task but it needed this learning rate.\""},{"from":2873.11,"to":2875,"location":2,"content":"And it turns out the same model,"},{"from":2875,"to":2877.1,"location":2,"content":"same set of hyperparameters everything,"},{"from":2877.1,"to":2881.55,"location":2,"content":"but this other task to get to good performance needed a much higher learning rate."},{"from":2881.55,"to":2885.65,"location":2,"content":"And now, you try to combine those two tasks only together and you're like,"},{"from":2885.65,"to":2887.34,"location":2,"content":"\"Okay, how do you choose your learning rate now?\""},{"from":2887.34,"to":2889.07,"location":2,"content":"You choose the, you know,"},{"from":2889.07,"to":2891.65,"location":2,"content":"if you choose the task, the learning rate from the task that is, you know,"},{"from":2891.65,"to":2893.78,"location":2,"content":"bigger than the smaller tasks just doesn't work"},{"from":2893.78,"to":2895.97,"location":2,"content":"well at all because it needed this higher learning rate."},{"from":2895.97,"to":2899.41,"location":2,"content":"If you'd use the higher learning rate that the smaller task and the smaller dataset,"},{"from":2899.41,"to":2903.99,"location":2,"content":"uh, did really well on then the large one just overfits and doesn't work well either."},{"from":2903.99,"to":2905.96,"location":2,"content":"If you try to do the average, neither of the two work."},{"from":2905.96,"to":2909.56,"location":2,"content":"Like there's a lot of complexity in trying to do multitask learning."},{"from":2909.56,"to":2915.1,"location":2,"content":"That's why, that's why it's such an interesting I think, uh, research challenge."},{"from":2915.1,"to":2918.41,"location":2,"content":"All right, any more questions about this first set of results?"},{"from":2918.41,"to":2919.78,"location":2,"content":"They get, they will get better."},{"from":2919.78,"to":2922.27,"location":2,"content":"We, we have, we have had some ideas already,"},{"from":2922.27,"to":2927.25,"location":2,"content":"uh, on, on how to improve them."},{"from":2927.25,"to":2929.78,"location":2,"content":"All right. So, uh,"},{"from":2929.78,"to":2931.78,"location":2,"content":"how did we actually train this whole thing?"},{"from":2931.78,"to":2934.89,"location":2,"content":"Um, we had tried a lot of different things but in the end, uh,"},{"from":2934.89,"to":2938.99,"location":2,"content":"this very simple fully joint training strategy actually worked the best."},{"from":2938.99,"to":2942.8,"location":2,"content":"Uh, and that is you basically take a mini batch from each of"},{"from":2942.8,"to":2947.54,"location":2,"content":"the different tasks and you just train on that mini batch from that task."},{"from":2947.54,"to":2951.47,"location":2,"content":"So basically just going through all the 10 tasks and then round robin,"},{"from":2951.47,"to":2953.69,"location":2,"content":"uh, go through them."},{"from":2953.69,"to":2956.82,"location":2,"content":"Um, now it turns out, ah,"},{"from":2956.82,"to":2959.09,"location":2,"content":"that that does not work,"},{"from":2959.09,"to":2961.46,"location":2,"content":"uh, quite as well, uh,"},{"from":2961.46,"to":2966.05,"location":2,"content":"as another training strategy and if you look into optimization,"},{"from":2966.05,"to":2967.68,"location":2,"content":"uh, strategies in neural nets, uh,"},{"from":2967.68,"to":2969.17,"location":2,"content":"there are actually a couple of papers on"},{"from":2969.17,"to":2971.72,"location":2,"content":"so-called curriculum learning, where the idea is,"},{"from":2971.72,"to":2976.43,"location":2,"content":"you start with training your model with simple pro- simple instances of your problems."},{"from":2976.43,"to":2978.83,"location":2,"content":"So, in translation, for instance you start training with"},{"from":2978.83,"to":2981.99,"location":2,"content":"very short sentences and then you go to larger and larger,"},{"from":2981.99,"to":2984.56,"location":2,"content":"uh, sentences, uh, or longer and longer sentences."},{"from":2984.56,"to":2987.55,"location":2,"content":"Uh, now it turns out for multi-task learning,"},{"from":2987.55,"to":2989.28,"location":2,"content":"you actually want to do the opposite."},{"from":2989.28,"to":2992.05,"location":2,"content":"You wanna do anti-curriculum learning."},{"from":2992.05,"to":2995.33,"location":2,"content":"Uh, and that is you start with the hardest tasks and you iterate on"},{"from":2995.33,"to":2998.93,"location":2,"content":"those for a while and then you add the simple tasks later on."},{"from":2998.93,"to":3002.05,"location":2,"content":"And to some degree, I think this is intuitive because when"},{"from":3002.05,"to":3007.78,"location":2,"content":"you train this very gigantic and powerful model,"},{"from":3007.78,"to":3011.02,"location":2,"content":"uh, on a very simple task like"},{"from":3011.02,"to":3014.51,"location":2,"content":"sentiment and you just need to classify everything to be positive or negative."},{"from":3014.51,"to":3018.22,"location":2,"content":"You train all of these weights and you arrive at sort of, uh,"},{"from":3018.22,"to":3020.71,"location":2,"content":"local optima that are quite deep and very"},{"from":3020.71,"to":3024.37,"location":2,"content":"specific to just generating these two words and if you then try to get out of that,"},{"from":3024.37,"to":3027.43,"location":2,"content":"out of this local optimum for that very simple task"},{"from":3027.43,"to":3030.66,"location":2,"content":"and then try to generate all these other kinds of words and point to different,"},{"from":3030.66,"to":3033.93,"location":2,"content":"you know, words it's never seen before then SQuAD,"},{"from":3033.93,"to":3036.94,"location":2,"content":"it's very very hard to come out of that local optimum."},{"from":3036.94,"to":3040.97,"location":2,"content":"And that is sort of my intuition of why it actually makes more sense to say,"},{"from":3040.97,"to":3044.93,"location":2,"content":"\"Let's start with SQuAD and machine translation and a couple of these harder tasks."},{"from":3044.93,"to":3047.02,"location":2,"content":"We'll make the model very general purpose."},{"from":3047.02,"to":3048.91,"location":2,"content":"It has to generate a lot of different things,"},{"from":3048.91,"to":3052.24,"location":2,"content":"create a softmax, German words,"},{"from":3052.24,"to":3054.46,"location":2,"content":"it has to point to all kinds of"},{"from":3054.46,"to":3057.89,"location":2,"content":"different words and be able to parse all kinds of different Wikipedia paragraphs.\""},{"from":3057.89,"to":3061.32,"location":2,"content":"And you do that a couple of times and then once you've finished,"},{"from":3061.32,"to":3063.19,"location":2,"content":"uh, this sort of pre-training, uh,"},{"from":3063.19,"to":3069.22,"location":2,"content":"stage or anti-curriculum, then you move on and add sort of the simpler smaller tasks."},{"from":3069.22,"to":3071.59,"location":2,"content":"So [NOISE] with that, uh,"},{"from":3071.59,"to":3075.09,"location":2,"content":"relatively simple change that did take us,"},{"from":3075.09,"to":3077.45,"location":2,"content":"uh, a lot of different experiments to get to."},{"from":3077.45,"to":3080.2,"location":2,"content":"Um, we actually, uh,"},{"from":3080.2,"to":3082.05,"location":2,"content":"closed or, uh, um,"},{"from":3082.05,"to":3085.57,"location":2,"content":"went closer to closing that gap and now, um,"},{"from":3085.57,"to":3090.33,"location":2,"content":"we're only sort of, um, 14, uh, away."},{"from":3090.33,"to":3092.78,"location":2,"content":"Right, yeah, uh, 14 or so."},{"from":3092.78,"to":3095.18,"location":2,"content":"Uh, but there's still, uh,"},{"from":3095.18,"to":3097.7,"location":2,"content":"a big gap and the biggest, uh,"},{"from":3097.7,"to":3100.88,"location":2,"content":"nuisance and issue that we had was with a translation."},{"from":3100.88,"to":3102.84,"location":2,"content":"Basically, if you look at all of these,"},{"from":3102.84,"to":3104.91,"location":2,"content":"most things are kind of similar,"},{"from":3104.91,"to":3109.16,"location":2,"content":"get slightly better, um and it's sort of a toss up but then and,"},{"from":3109.16,"to":3112.13,"location":2,"content":"and roughly similar, but translation was really bad."},{"from":3112.13,"to":3113.45,"location":2,"content":"It's almost only half, uh,"},{"from":3113.45,"to":3116.42,"location":2,"content":"the performance in the multitask learning setup,"},{"from":3116.42,"to":3120.11,"location":2,"content":"and part of that is because translation was the only task that had"},{"from":3120.11,"to":3125.96,"location":2,"content":"a very large Softmax vocabulary of words that were in no other task."},{"from":3125.96,"to":3128.07,"location":2,"content":"And most of the other tasks,"},{"from":3128.07,"to":3130.43,"location":2,"content":"actually were doing really well with pointing."},{"from":3130.43,"to":3134.57,"location":2,"content":"And so, uh, my interpretation of this was that the intermediate layers,"},{"from":3134.57,"to":3136.55,"location":2,"content":"all these representations that we learned with"},{"from":3136.55,"to":3139.52,"location":2,"content":"bi-directional LSTMs and transformers, they got really,"},{"from":3139.52,"to":3141.88,"location":2,"content":"really good at being pointed to,"},{"from":3141.88,"to":3147.56,"location":2,"content":"like creating hidden representations that the answer module can point to very accurately."},{"from":3147.56,"to":3149.47,"location":2,"content":"And then you have this one task that is like,"},{"from":3149.47,"to":3151.09,"location":2,"content":"I don't point to almost anything,"},{"from":3151.09,"to":3154.24,"location":2,"content":"I basically just generate other words and then different vocabulary."},{"from":3154.24,"to":3157.61,"location":2,"content":"And so those hidden representations became less useful for that task."},{"from":3157.61,"to":3161.36,"location":2,"content":"And so, that was one of the insights and that led"},{"from":3161.36,"to":3165.02,"location":2,"content":"to one of the ways of trying to improve this."},{"from":3165.02,"to":3167.61,"location":2,"content":"Now, one of the interesting issues that we had is,"},{"from":3167.61,"to":3169.04,"location":2,"content":"when we improved the model,"},{"from":3169.04,"to":3171.5,"location":2,"content":"the multi-single model for all 10 tasks,"},{"from":3171.5,"to":3173.09,"location":2,"content":"a lot of times we said, well,"},{"from":3173.09,"to":3175.28,"location":2,"content":"but now we also have to go back and run"},{"from":3175.28,"to":3179.06,"location":2,"content":"10 more experiments on all the single tasks to have a proper comparison, right?"},{"from":3179.06,"to":3181.28,"location":2,"content":"Because if you tune the thing you care about,"},{"from":3181.28,"to":3184.79,"location":2,"content":"and you stop tuning the thing you wanna show you can do better than,"},{"from":3184.79,"to":3186.28,"location":2,"content":"then that's not fair."},{"from":3186.28,"to":3189.47,"location":2,"content":"Uh, so you always wanna give as much, uh,"},{"from":3189.47,"to":3193.66,"location":2,"content":"TLC and focus and experiment time to your baselines."},{"from":3193.66,"to":3198.67,"location":2,"content":"And so, uh, in some cases we actually,"},{"from":3198.67,"to":3202.41,"location":2,"content":"uh, improved some- improved something."},{"from":3202.41,"to":3206.49,"location":2,"content":"But then, we improve both the 10 separate models and our model,"},{"from":3206.49,"to":3209.09,"location":2,"content":"and some cases like the 10 separate models improved, even more."},{"from":3209.09,"to":3210.49,"location":2,"content":"So the gap got even larger."},{"from":3210.49,"to":3212.72,"location":2,"content":"It's kind of the opposite of what we wanted to show, but in general,"},{"from":3212.72,"to":3214.22,"location":2,"content":"it's better for both tests,"},{"from":3214.22,"to":3216.53,"location":2,"content":"uh, for the architecture overall."},{"from":3216.53,"to":3217.97,"location":2,"content":"So basically, we started, uh,"},{"from":3217.97,"to":3220.22,"location":2,"content":"with this fully joint training and we have"},{"from":3220.22,"to":3222.51,"location":2,"content":"this sort of set of single models that we could,"},{"from":3222.51,"to":3224.15,"location":2,"content":"in theory with some oracle,"},{"from":3224.15,"to":3225.34,"location":2,"content":"kind of just sum up, uh,"},{"from":3225.34,"to":3227.01,"location":2,"content":"in their scores, to get a decaScore."},{"from":3227.01,"to":3229.11,"location":2,"content":"So the gap started at 23."},{"from":3229.11,"to":3233.03,"location":2,"content":"And then, uh, we basically did this anti-curriculum training,"},{"from":3233.03,"to":3235.79,"location":2,"content":"uh, which, uh, lowered the gap to 15."},{"from":3235.79,"to":3237.38,"location":2,"content":"So we're kind of excited,"},{"from":3237.38,"to":3238.76,"location":2,"content":"uh, making good progress."},{"from":3238.76,"to":3239.93,"location":2,"content":"Then we switched, uh,"},{"from":3239.93,"to":3241.88,"location":2,"content":"from GloVe and use CoVe."},{"from":3241.88,"to":3244.05,"location":2,"content":"So contextual vectors, um,"},{"from":3244.05,"to":3246.32,"location":2,"content":"which actually increased the gap a lot again."},{"from":3246.32,"to":3249.32,"location":2,"content":"So everything got better, but the 10 separate models got"},{"from":3249.32,"to":3253,"location":2,"content":"even better than the one single model that does the 10 tasks."},{"from":3253,"to":3254.65,"location":2,"content":"Um, so the gap got bigger,"},{"from":3254.65,"to":3257.14,"location":2,"content":"but everybody's performance increased."},{"from":3257.14,"to":3259.51,"location":2,"content":"So it was still overall a good thing."},{"from":3259.51,"to":3262.78,"location":2,"content":"Uh, and then, uh, we basically figured,"},{"from":3262.78,"to":3264.61,"location":2,"content":"especially with this machine translation issue,"},{"from":3264.61,"to":3266.47,"location":2,"content":"we shouldn't just pre-train on SQuAD,"},{"from":3266.47,"to":3270.1,"location":2,"content":"but we also should include machine translation in"},{"from":3270.1,"to":3274.84,"location":2,"content":"this pre-training in the beginning so the model doesn't just start learning to point."},{"from":3274.84,"to":3277.63,"location":2,"content":"Um, and that helped us, uh,"},{"from":3277.63,"to":3280.16,"location":2,"content":"to reduce the gap between the 10 separate models,"},{"from":3280.16,"to":3283.09,"location":2,"content":"Oracle, and the single model to about five points."},{"from":3283.09,"to":3284.69,"location":2,"content":"And then, uh, we basically said,"},{"from":3284.69,"to":3286.64,"location":2,"content":"okay, translation is still not that good."},{"from":3286.64,"to":3287.78,"location":2,"content":"We just keep oversampling."},{"from":3287.78,"to":3292.76,"location":2,"content":"So, every time we go through one of these round robin mini-batch sets,"},{"from":3292.76,"to":3294.74,"location":2,"content":"we just always include machine translation."},{"from":3294.74,"to":3299.27,"location":2,"content":"And that basically allowed us to then reduce the gap,"},{"from":3299.27,"to":3301.03,"location":2,"content":"uh, to just a single point."},{"from":3301.03,"to":3303.59,"location":2,"content":"So now, uh, we started, uh,"},{"from":3303.59,"to":3306.65,"location":2,"content":"couple of, several months ago, uh, at 586."},{"from":3306.65,"to":3308.96,"location":2,"content":"And now the single, uh,"},{"from":3308.96,"to":3311.33,"location":2,"content":"oracle with 10 different models,"},{"from":3311.33,"to":3312.56,"location":2,"content":"if you were to sum them up,"},{"from":3312.56,"to":3316.1,"location":2,"content":"get 618, uh, and the, you know,"},{"from":3316.1,"to":3319.99,"location":2,"content":"better contextual vectors and tuning and adding a lot more translation,"},{"from":3319.99,"to":3323.21,"location":2,"content":"and translation is still not as good as we would like it to be, uh,"},{"from":3323.21,"to":3326.53,"location":2,"content":"but now, several of the other tasks benefited a bunch."},{"from":3326.53,"to":3330.14,"location":2,"content":"And now we're basically one decaScore away from"},{"from":3330.14,"to":3333.74,"location":2,"content":"having a single model that does as well as 10 different ones."},{"from":3333.74,"to":3336.39,"location":2,"content":"And you can basically,"},{"from":3336.39,"to":3338.53,"location":2,"content":"you could run even more experiments,"},{"from":3338.53,"to":3341.93,"location":2,"content":"in some ways you could burn millions of dollars on AWS cost here,"},{"from":3341.93,"to":3347.18,"location":2,"content":"because most of the time we kept the hyperparameters of these different models the same."},{"from":3347.18,"to":3349.39,"location":2,"content":"Like each of these, you could also say, well,"},{"from":3349.39,"to":3352.01,"location":2,"content":"maybe this multitask model needs to have 50 more layers,"},{"from":3352.01,"to":3353.72,"location":2,"content":"or maybe 19 more layers,"},{"from":3353.72,"to":3356.22,"location":2,"content":"or maybe five more layers and maybe they should be 1000,"},{"from":3356.22,"to":3357.86,"location":2,"content":"you know, wider in their hidden dimensions."},{"from":3357.86,"to":3361.31,"location":2,"content":"And you could basically run a lot more experiments."},{"from":3361.31,"to":3363.83,"location":2,"content":"Maybe hopefully, eventually, the community jointly does that,"},{"from":3363.83,"to":3366.17,"location":2,"content":"and then we can kind of move, move towards that."},{"from":3366.17,"to":3368.48,"location":2,"content":"But we figured, okay, we're pretty close,"},{"from":3368.48,"to":3373.85,"location":2,"content":"so we moved on to some other things which maybe I'll tell you about next year."},{"from":3373.85,"to":3376.72,"location":2,"content":"[LAUGHTER] But basically, um,"},{"from":3376.72,"to":3378.98,"location":2,"content":"let's do some analysis of what happened in this project."},{"from":3378.98,"to":3382.24,"location":2,"content":"And this is kind of, I think something that I would encourage you all to do as well."},{"from":3382.24,"to":3385.46,"location":2,"content":"Like you, you can chase the numbers for a while and in some ways,"},{"from":3385.46,"to":3388.39,"location":2,"content":"you should always be skeptical about your evaluations."},{"from":3388.39,"to":3389.78,"location":2,"content":"And in some cases,"},{"from":3389.78,"to":3393.23,"location":2,"content":"you've seen- we've seen in the NLP community people"},{"from":3393.23,"to":3396.93,"location":2,"content":"like basically just optimize BLEU scores for translation for years."},{"from":3396.93,"to":3398.69,"location":2,"content":"And then somebody came out with a paper and said, well,"},{"from":3398.69,"to":3404.51,"location":2,"content":"it turns out BLEU metrics and human evaluations on how good of a translation is this,"},{"from":3404.51,"to":3406.18,"location":2,"content":"aren't actually that correlated."},{"from":3406.18,"to":3408.32,"location":2,"content":"And you're like, ah, that that sucks,"},{"from":3408.32,"to":3413,"location":2,"content":"we just spent years of our lives tuning that metric and publishing a bunch of papers."},{"from":3413,"to":3417.29,"location":2,"content":"Um, and so in some ways all of these metrics have flaws, uh, you know,"},{"from":3417.29,"to":3420.14,"location":2,"content":"root scores summarization is a super,"},{"from":3420.14,"to":3423.38,"location":2,"content":"uh, subjective kind of a task."},{"from":3423.38,"to":3425.47,"location":2,"content":"And summarization, for instance,"},{"from":3425.47,"to":3427.73,"location":2,"content":"when you analyze the errors, uh,"},{"from":3427.73,"to":3430.59,"location":2,"content":"you often realize that word vectors have problems too."},{"from":3430.59,"to":3432.92,"location":2,"content":"So, for instance, the word vector for Jason, John,"},{"from":3432.92,"to":3435.29,"location":2,"content":"and Jeremy are all kind of the same, right?"},{"from":3435.29,"to":3436.94,"location":2,"content":"They all have similar, uh,"},{"from":3436.94,"to":3440.05,"location":2,"content":"distributions, similar contexts, windows, and so on."},{"from":3440.05,"to":3442.61,"location":2,"content":"And so word vectors of names are very similar."},{"from":3442.61,"to":3445.84,"location":2,"content":"And so in summarization errors, you realize, oh,"},{"from":3445.84,"to":3449.3,"location":2,"content":"well, you know, this article, news article talked about Jeremy being kidnapped."},{"from":3449.3,"to":3451.16,"location":2,"content":"But the summary said that Jason was kidnapped."},{"from":3451.16,"to":3453.65,"location":2,"content":"And you like, well, you know, in the evaluation metric"},{"from":3453.65,"to":3456.32,"location":2,"content":"that's just one word is off and like, all the rest is correct,"},{"from":3456.32,"to":3458,"location":2,"content":"but it's a pretty important word."},{"from":3458,"to":3460.97,"location":2,"content":"And so, word vectors have like issues"},{"from":3460.97,"to":3464.07,"location":2,"content":"for summarization that are pretty fundamental and I don't think,"},{"from":3464.07,"to":3466.84,"location":2,"content":"uh, anybody's tackling really well right now."},{"from":3466.84,"to":3468.88,"location":2,"content":"Uh, and so all of these metrics have issues."},{"from":3468.88,"to":3471.62,"location":2,"content":"I would argue though that combining the 10 actually"},{"from":3471.62,"to":3474.44,"location":2,"content":"makes it less problematic and more meaningful,"},{"from":3474.44,"to":3476.63,"location":2,"content":"than looking at each one separately."},{"from":3476.63,"to":3480.72,"location":2,"content":"Uh, because now you can't use the idiosyncrasies of"},{"from":3480.72,"to":3484.97,"location":2,"content":"one particular evaluation metric to just get like your score a little bit higher."},{"from":3484.97,"to":3489.74,"location":2,"content":"Um, because then, if you just tune with that particular thing in mind,"},{"from":3489.74,"to":3493.37,"location":2,"content":"it will hurt some of the other tasks and you won't get to the sort of general,"},{"from":3493.37,"to":3495.95,"location":2,"content":"uh, NLP model that much more easily."},{"from":3495.95,"to":3498.61,"location":2,"content":"All right. So now, let's do some analysis uh,"},{"from":3498.61,"to":3500.64,"location":2,"content":"of this model and, uh,"},{"from":3500.64,"to":3504.14,"location":2,"content":"look at, and this is the kinda thing that comes to one of the questions that was asked."},{"from":3504.14,"to":3508.3,"location":2,"content":"Uh, is this model able to kind of generate the right words for the right tasks?"},{"from":3508.3,"to":3511.78,"location":2,"content":"And here, we basically looked at the distributions of how often, uh,"},{"from":3511.78,"to":3517.1,"location":2,"content":"the model generated words in these differen- with these three different mechanisms,"},{"from":3517.1,"to":3520.37,"location":2,"content":"Softmax vocabulary, context pointers, or question pointers."},{"from":3520.37,"to":3522.51,"location":2,"content":"And, uh, as you can see,"},{"from":3522.51,"to":3525.5,"location":2,"content":"in the majority of cases it knows exactly how to generate."},{"from":3525.5,"to":3527.91,"location":2,"content":"So, uh, for, uh,"},{"from":3527.91,"to":3531.11,"location":2,"content":"question, answering, and semantic role labeling,"},{"from":3531.11,"to":3535.36,"location":2,"content":"and SQuAD and Wiki SQL and,"},{"from":3535.36,"to":3539.15,"location":2,"content":"um, summarization, it basically uses the context pointer."},{"from":3539.15,"to":3541.57,"location":2,"content":"So it just points into the context document."},{"from":3541.57,"to":3542.8,"location":2,"content":"And we know for SQuAD,"},{"from":3542.8,"to":3545.99,"location":2,"content":"that is basically [NOISE] how the data set was generated."},{"from":3545.99,"to":3548.6,"location":2,"content":"So that's the only thing that that really makes a lot of sense."},{"from":3548.6,"to":3551.93,"location":2,"content":"Uh, what's kind of cool is that in some cases like summarization,"},{"from":3551.93,"to":3554.24,"location":2,"content":"it sometimes creates new words or, you know,"},{"from":3554.24,"to":3557.33,"location":2,"content":"that weren't in the context document wherein pointed to."},{"from":3557.33,"to":3559.91,"location":2,"content":"Uh, and for zero-shot relation extraction,"},{"from":3559.91,"to":3561.45,"location":2,"content":"also sometimes uses, uh,"},{"from":3561.45,"to":3564.05,"location":2,"content":"this external vocabulary and in some cases the context pointer."},{"from":3564.05,"to":3566.21,"location":2,"content":"So for the most part, uh,"},{"from":3566.21,"to":3571.97,"location":2,"content":"this model doesn't- is not confused how to execute on a task given, uh,"},{"from":3571.97,"to":3575.18,"location":2,"content":"this question formalism rather than, uh, the,"},{"from":3575.18,"to":3577.37,"location":2,"content":"uh, format of sort of this is the task,"},{"from":3577.37,"to":3581.2,"location":2,"content":"just do this particular test."},{"from":3581.2,"to":3584.03,"location":2,"content":"Now, um, you might argue,"},{"from":3584.03,"to":3585.83,"location":2,"content":"okay, I'm not that impressed by, you know,"},{"from":3585.83,"to":3588.5,"location":2,"content":"having the performance be slightly the same with one model versus"},{"from":3588.5,"to":3591.59,"location":2,"content":"10 separate models even though it's nice if you wanna deploy it right,"},{"from":3591.59,"to":3593.26,"location":2,"content":"like, uses less RAM and all of that,"},{"from":3593.26,"to":3594.97,"location":2,"content":"assuming they're the same size,"},{"from":3594.97,"to":3597.08,"location":2,"content":"uh, while, you know, one-tenth the size."},{"from":3597.08,"to":3600.71,"location":2,"content":"But what I'm excited about is more like the next couple of results."},{"from":3600.71,"to":3602.75,"location":2,"content":"And namely, sort of this transfer learning,"},{"from":3602.75,"to":3604.55,"location":2,"content":"domain adaptation, and zero-shot,"},{"from":3604.55,"to":3606.02,"location":2,"content":"uh, these kinds of capabilities."},{"from":3606.02,"to":3611.63,"location":2,"content":"So here, uh, we chose two data sets that weren't included in the original 10."},{"from":3611.63,"to":3617.8,"location":2,"content":"And we basically trained a pre-trained model on this versus a random model."},{"from":3617.8,"to":3620.51,"location":2,"content":"And, uh, randomly here again,"},{"from":3620.51,"to":3621.86,"location":2,"content":"they're the same architecture,"},{"from":3621.86,"to":3625.3,"location":2,"content":"and pre-trained means the entirety of the model was pre-trained."},{"from":3625.3,"to":3626.95,"location":2,"content":"All the, you know,"},{"from":3626.95,"to":3631.32,"location":2,"content":"encoders including the decoder in the Softmax and everything, uh,"},{"from":3631.32,"to":3636.14,"location":2,"content":"and to two other tasks where another IWSLT language pair namely,"},{"from":3636.14,"to":3637.68,"location":2,"content":"translating from English to Czech, uh,"},{"from":3637.68,"to":3640.88,"location":2,"content":"and named entity recognition tasks that you all know very well."},{"from":3640.88,"to":3643.46,"location":2,"content":"So basically what we found is that,"},{"from":3643.46,"to":3645.93,"location":2,"content":"uh, it converges much more quickly,"},{"from":3645.93,"to":3647.81,"location":2,"content":"uh, in the beginning, uh, and then,"},{"from":3647.81,"to":3651.2,"location":2,"content":"there's still a significant but not gigantic gap."},{"from":3651.2,"to":3655.59,"location":2,"content":"So this pre-training on these completely separate kinds of task had helped."},{"from":3655.59,"to":3658.74,"location":2,"content":"And, uh, I think that's,"},{"from":3658.74,"to":3660.36,"location":2,"content":"that's pretty exciting, um,"},{"from":3660.36,"to":3662.42,"location":2,"content":"especially sort of the quicker convergence, like,"},{"from":3662.42,"to":3664.16,"location":2,"content":"learning more quickly, uh,"},{"from":3664.16,"to":3666.31,"location":2,"content":"whatever new task you, you come up with,"},{"from":3666.31,"to":3669.01,"location":2,"content":"which also means in some cases you can get away with"},{"from":3669.01,"to":3671.95,"location":2,"content":"less training data on these new- on these new tasks."},{"from":3671.95,"to":3675.97,"location":2,"content":"Uh, now domain adaptation is kind of the simpler form of transfer learning,"},{"from":3675.97,"to":3679.28,"location":2,"content":"where you basically just have a different,"},{"from":3679.28,"to":3681.41,"location":2,"content":"uh, type of, uh,"},{"from":3681.41,"to":3683.06,"location":2,"content":"you know, distribution for your words."},{"from":3683.06,"to":3686.75,"location":2,"content":"Uh, we mentioned we have the Stanford Sentiment Treebank for sentiment analysis."},{"from":3686.75,"to":3689.78,"location":2,"content":"Uh, and then we analyze this on different,"},{"from":3689.78,"to":3691.61,"location":2,"content":"uh, sentiment data sets,"},{"from":3691.61,"to":3694.51,"location":2,"content":"namely Amazon product reviews and Yelp restaurant reviews,"},{"from":3694.51,"to":3696.61,"location":2,"content":"and out of the box without any training,"},{"from":3696.61,"to":3699.97,"location":2,"content":"the model just got 80% accuracy on both of those data sets."},{"from":3699.97,"to":3702.32,"location":2,"content":"Uh, and I think for practitioners,"},{"from":3702.32,"to":3705.14,"location":2,"content":"that is pretty exciting because you basically didn't have to train anything,"},{"from":3705.14,"to":3706.61,"location":2,"content":"it just kind of worked out of the box,"},{"from":3706.61,"to":3708.83,"location":2,"content":"download it from GitHub, and run it."},{"from":3708.83,"to":3711.62,"location":2,"content":"Uh, SNLI, that was slightly different."},{"from":3711.62,"to":3713.33,"location":2,"content":"It didn't quite work as well."},{"from":3713.33,"to":3715.28,"location":2,"content":"It's another natural language inference data set,"},{"from":3715.28,"to":3719.14,"location":2,"content":"but has very different- a very different distribution, different, uh,"},{"from":3719.14,"to":3721.04,"location":2,"content":"kinds of domains, uh, that,"},{"from":3721.04,"to":3723.29,"location":2,"content":"uh, these entailment questions are asked over."},{"from":3723.29,"to":3726.98,"location":2,"content":"Uh, and here, out of the box it achieved 62."},{"from":3726.98,"to":3730.2,"location":2,"content":"Uh, but then, uh, once you fine tuned it and"},{"from":3730.2,"to":3734.23,"location":2,"content":"similar to these experiments here continue to actually train on this data set,"},{"from":3734.23,"to":3737.68,"location":2,"content":"it quickly uh, converged to 87 which was"},{"from":3737.68,"to":3741.63,"location":2,"content":"still two percent gain over a randomlyor initialized McCann model. Yeah."},{"from":3741.63,"to":3749.07,"location":2,"content":"In that experiment, did you evaluate how much less data you can get away with?"},{"from":3749.07,"to":3752.9,"location":2,"content":"Did we evaluate how much less data we can get away with? We didn't."},{"from":3752.9,"to":3755.51,"location":2,"content":"And in some ways, whenever you would run this experiment,"},{"from":3755.51,"to":3758,"location":2,"content":"you'd basically be like, you'd still not do as well."},{"from":3758,"to":3761.55,"location":2,"content":"Like, everything- all these models will still do better with more training data."},{"from":3761.55,"to":3763.64,"location":2,"content":"So you just kind of, it would be a fuzzy kind of say,"},{"from":3763.64,"to":3766.22,"location":2,"content":"like, cut- fuzzy sort of result, right?"},{"from":3766.22,"to":3768.14,"location":2,"content":"Where you say, well, with one-tenth we might get"},{"from":3768.14,"to":3770.89,"location":2,"content":"to 50 and the other model might get only to 40,"},{"from":3770.89,"to":3772.16,"location":2,"content":"doing something like that."},{"from":3772.16,"to":3774.83,"location":2,"content":"Um, we don't- I don't have those numbers."},{"from":3774.83,"to":3777.38,"location":2,"content":"It would be kind of actually also a neat, neat, uh,"},{"from":3777.38,"to":3779.75,"location":2,"content":"analysis to do. Yeah."},{"from":3779.75,"to":3786.84,"location":2,"content":"So if you wanted to like train on a new task [inaudible]."},{"from":3786.84,"to":3787.93,"location":2,"content":"Yeah."},{"from":3787.93,"to":3790.16,"location":2,"content":"[inaudible] ."},{"from":3790.16,"to":3793.11,"location":2,"content":"So, do we have the code to train a new task? Yes, we do."},{"from":3793.11,"to":3794.7,"location":2,"content":"Um, you can just, uh, edit,"},{"from":3794.7,"to":3796.8,"location":2,"content":"make it into this format using context."},{"from":3796.8,"to":3799.47,"location":2,"content":"Here's a question, simple like CSV type format,"},{"from":3799.47,"to":3804.16,"location":2,"content":"and then you add it and you can both like train the pre-trained model yourself."},{"from":3804.16,"to":3808.69,"location":2,"content":"You can download a pre-trained model and just add it. So I'll look it up, yeah."},{"from":3808.69,"to":3814.8,"location":2,"content":"Do you know how this compares to using other kinds of pre-trained representations like, say BERT?"},{"from":3814.8,"to":3817.33,"location":2,"content":"So, um, it's a great question."},{"from":3817.33,"to":3820.12,"location":2,"content":"So how does this compare to other pre-trained representations like BERT?"},{"from":3820.12,"to":3821.93,"location":2,"content":"So, in some ways,"},{"from":3821.93,"to":3824.2,"location":2,"content":"people say BERT is kind of this model that does everything,"},{"from":3824.2,"to":3826.69,"location":2,"content":"but when you actually read the paper, you realize, well,"},{"from":3826.69,"to":3829.93,"location":2,"content":"it's a separate model for these different tasks, right?"},{"from":3829.93,"to":3832.38,"location":2,"content":"If you wanted to have a classification task,"},{"from":3832.38,"to":3834.07,"location":2,"content":"you have a little token in the beginning,"},{"from":3834.07,"to":3835.33,"location":2,"content":"and you have a different top layer."},{"from":3835.33,"to":3837.4,"location":2,"content":"If you wanna do a sequence labeling task,"},{"from":3837.4,"to":3838.45,"location":2,"content":"you have a different top layer."},{"from":3838.45,"to":3840.4,"location":2,"content":"If you wanted to do a sequence extraction task,"},{"from":3840.4,"to":3841.76,"location":2,"content":"you have a different top layer."},{"from":3841.76,"to":3846.22,"location":2,"content":"So, BERT isn't actually a single model for all of these different tasks."},{"from":3846.22,"to":3848.41,"location":2,"content":"Ah, and then, on all the results,"},{"from":3848.41,"to":3851.8,"location":2,"content":"there's a lot of extra tuning for each of the data sets,"},{"from":3851.8,"to":3853.76,"location":2,"content":"and tasks, uh, that, you know,"},{"from":3853.76,"to":3856.03,"location":2,"content":"different learning rate for this task, uh,"},{"from":3856.03,"to":3859.12,"location":2,"content":"different size, or different sets of BERT, and so on."},{"from":3859.12,"to":3861.67,"location":2,"content":"So, we're also super excited, we're like maybe this is it,"},{"from":3861.67,"to":3863.59,"location":2,"content":"we'll just run everything on BERT,"},{"from":3863.59,"to":3865.18,"location":2,"content":"and then we looked into all the details,"},{"from":3865.18,"to":3866.92,"location":2,"content":"and there's so much excitement in the beginning."},{"from":3866.92,"to":3869.02,"location":2,"content":"And then the more we dug through the details,"},{"from":3869.02,"to":3871.8,"location":2,"content":"the less excited we became as this being like sort of the answer,"},{"from":3871.8,"to":3873.58,"location":2,"content":"because it is not a single model."},{"from":3873.58,"to":3876.88,"location":2,"content":"Uh, in some ways, it's probably better to- for pre-training."},{"from":3876.88,"to":3878.29,"location":2,"content":"So instead of CoVe,"},{"from":3878.29,"to":3881.14,"location":2,"content":"you can have kind of BERT at the very beginning,"},{"from":3881.14,"to":3883.45,"location":2,"content":"and my hunch is everything will get slightly better,"},{"from":3883.45,"to":3886.07,"location":2,"content":"but you still need to have, um,"},{"from":3886.07,"to":3892.12,"location":2,"content":"a lot of the- a lot of the other sort of modeling architecture on top of it."},{"from":3892.12,"to":3896.05,"location":2,"content":"Uh, and then the sad thing is to really get the state of the art results,"},{"from":3896.05,"to":3900.36,"location":2,"content":"there's a lot of very spec- task-specific tuning of those last top layers."},{"from":3900.36,"to":3904.53,"location":2,"content":"So, if you try to unify that task-specific tuning,"},{"from":3904.53,"to":3906.7,"location":2,"content":"you lose a lot of the good performance of BERT."},{"from":3906.7,"to":3910.49,"location":2,"content":"Um, so, unfortunately, it's not quite the sort of,"},{"from":3910.49,"to":3912.18,"location":2,"content":"\"Oh, just use BERT for it,"},{"from":3912.18,"to":3915.22,"location":2,"content":"and you'll just have state-of-the-art numbers and all the things.\""},{"from":3915.22,"to":3918.57,"location":2,"content":"Um, I could probably go like talk about it a lot more, but, uh,"},{"from":3918.57,"to":3921.3,"location":2,"content":"I think it still makes sense to think about, um,"},{"from":3921.3,"to":3923.07,"location":2,"content":"some of the ideas from BERT,"},{"from":3923.07,"to":3926.36,"location":2,"content":"like basically, add as one of the tasks language modeling."},{"from":3926.36,"to":3930.99,"location":2,"content":"That would be very likely the task that helps the most for all the other tasks,"},{"from":3930.99,"to":3933.48,"location":2,"content":"and we should include that, uh,"},{"from":3933.48,"to":3937.53,"location":2,"content":"it also would be nice to have a faster model right now."},{"from":3937.53,"to":3940.27,"location":2,"content":"Um, it's hard to do language modeling is very, very large,"},{"from":3940.27,"to":3941.74,"location":2,"content":"it benefits even more from,"},{"from":3941.74,"to":3943.84,"location":2,"content":"you know, billions and billions of words."},{"from":3943.84,"to":3945.67,"location":2,"content":"It's hard to train the McCann model,"},{"from":3945.67,"to":3948.94,"location":2,"content":"this current question answering model of the co-attention mechanism of the question"},{"from":3948.94,"to":3952.03,"location":2,"content":"with like an increasingly large context."},{"from":3952.03,"to":3954.97,"location":2,"content":"So you'd have to kind of split it also like BERT,"},{"from":3954.97,"to":3959.02,"location":2,"content":"works also reasonably well only for like at most I think 500 words or so,"},{"from":3959.02,"to":3962.05,"location":2,"content":"and if you wanted to do summarization you'd basically have to cut"},{"from":3962.05,"to":3966.49,"location":2,"content":"the original document to only 500 words, and then try to summarize it."},{"from":3966.49,"to":3969.82,"location":2,"content":"So, there are a lot of like devil in the details that they didn't have to figure out,"},{"from":3969.82,"to":3972.52,"location":2,"content":"because they said, \"Well, we'll just sort of just like word vectors,"},{"from":3972.52,"to":3976.42,"location":2,"content":"we can take them in, and then we do a lot of other stuff that is task-specific,"},{"from":3976.42,"to":3978.78,"location":2,"content":"um, with those- those word vectors,"},{"from":3978.78,"to":3980.35,"location":2,"content":"or with the BERT architecture.\""},{"from":3980.35,"to":3982.72,"location":2,"content":"I still- I don't want to- this BERT is obviously amazing,"},{"from":3982.72,"to":3985.12,"location":2,"content":"and we are looking into trying to use ideas from it."},{"from":3985.12,"to":3987.4,"location":2,"content":"But unfortunately, it wasn't just sort of a silver bullet to"},{"from":3987.4,"to":3993.36,"location":2,"content":"solve multi-task learning. Mm-hmm?"},{"from":3993.36,"to":3995.51,"location":2,"content":"Pre-training process to be considered, uh,"},{"from":3995.51,"to":4000.99,"location":2,"content":"prioritized sampling based off of how much fewer group, how much loss there is?"},{"from":4000.99,"to":4002.67,"location":2,"content":"Sorry, did we- say again?"},{"from":4002.67,"to":4006.39,"location":2,"content":"Would you consider prioritizing sampling [inaudible]?"},{"from":4006.39,"to":4008.37,"location":2,"content":"So, did we consider prioritizing the sampling?"},{"from":4008.37,"to":4011.76,"location":2,"content":"So in some ways with this pre-trained strategy here, um,"},{"from":4011.76,"to":4016.5,"location":2,"content":"that's kind of what we did by basically focusing on these really hard tasks."},{"from":4016.5,"to":4022.14,"location":2,"content":"And, uh, a lot of like the gap in the end was improved by really waiting for,"},{"from":4022.14,"to":4024.55,"location":2,"content":"like four of the tasks at the very end,"},{"from":4024.55,"to":4025.99,"location":2,"content":"uh, bef- unti- you know, uh,"},{"from":4025.99,"to":4028.56,"location":2,"content":"until after you're gone through, uh,"},{"from":4028.56,"to":4030.75,"location":2,"content":"sort of oversampling all of these,"},{"from":4030.75,"to":4031.8,"location":2,"content":"uh, really hard tasks."},{"from":4031.8,"to":4036.38,"location":2,"content":"In the last 10 minutes, uh, basically, uh,"},{"from":4036.38,"to":4038.4,"location":2,"content":"th- the most exciting thing, uh,"},{"from":4038.4,"to":4042.54,"location":2,"content":"for- for last though I think you could also do a lot more work in this direction."},{"from":4042.54,"to":4044.46,"location":2,"content":"Uh, I mentioned the sole question pointer"},{"from":4044.46,"to":4046.38,"location":2,"content":"and zero short learning in the beginning, and, uh,"},{"from":4046.38,"to":4049.97,"location":2,"content":"we basically just tried to play around with that a little bit, um,"},{"from":4049.97,"to":4052.18,"location":2,"content":"and found that in some cases,"},{"from":4052.18,"to":4055.08,"location":2,"content":"it actually kind of magically works."},{"from":4055.08,"to":4057.06,"location":2,"content":"Uh, so here, we tried, uh,"},{"from":4057.06,"to":4058.72,"location":2,"content":"a sentence John had a party,"},{"from":4058.72,"to":4060.86,"location":2,"content":"but no one came, and he was all alone."},{"from":4060.86,"to":4063.96,"location":2,"content":"And then we asked, \"Is this story sad, or happy?\""},{"from":4063.96,"to":4066.12,"location":2,"content":"And while the model could've, you know,"},{"from":4066.12,"to":4067.92,"location":2,"content":"generate some random German words,"},{"from":4067.92,"to":4069.57,"location":2,"content":"or some random SQL words,"},{"from":4069.57,"to":4071.24,"location":2,"content":"or it's just said whatever,"},{"from":4071.24,"to":4074.49,"location":2,"content":"it actually pointed to, of all the words,"},{"from":4074.49,"to":4076.44,"location":2,"content":"you could've pointed to in the context or the question that"},{"from":4076.44,"to":4078.82,"location":2,"content":"pointed to \"Sad\", which is pretty cool."},{"from":4078.82,"to":4081.75,"location":2,"content":"Like- and it's just one small sample,"},{"from":4081.75,"to":4083.58,"location":2,"content":"and, you know, you could do a lot more,"},{"from":4083.58,"to":4088.91,"location":2,"content":"you could try to come up with a very large zero-shot kind of classification data set,"},{"from":4088.91,"to":4090.3,"location":2,"content":"which is actually kind of hard too."},{"from":4090.3,"to":4092.55,"location":2,"content":"You have to be quite creative, it's not like you can just say, \"Oh,"},{"from":4092.55,"to":4093.75,"location":2,"content":"it would just take all these reviews,"},{"from":4093.75,"to":4095.7,"location":2,"content":"and label them as these, you know, positive negative."},{"from":4095.7,"to":4099.81,"location":2,"content":"Ah, but so, I think we- we need to do more work in that direction."},{"from":4099.81,"to":4103.23,"location":2,"content":"Somebody will hopefully create a zero-shot kind of task data set,"},{"from":4103.23,"to":4105.57,"location":2,"content":"that is not just zero-shot for, you know,"},{"from":4105.57,"to":4109.05,"location":2,"content":"kind of new distributions or something with completely different, uh, outputs."},{"from":4109.05,"to":4111.81,"location":2,"content":"Uh, but we- we tried a couple,"},{"from":4111.81,"to":4112.95,"location":2,"content":"and it doesn't always work, right."},{"from":4112.95,"to":4114.51,"location":2,"content":"You can be adversarial about it,"},{"from":4114.51,"to":4118.47,"location":2,"content":"you can make this basically looks most similar to,"},{"from":4118.47,"to":4120.51,"location":2,"content":"is the sentiment positive or negative?"},{"from":4120.51,"to":4122.81,"location":2,"content":"Uh, is this sen- is this sentence positive or negative?"},{"from":4122.81,"to":4125.95,"location":2,"content":"That was the formalism we had for sentiment analysis."},{"from":4125.95,"to":4127.66,"location":2,"content":"And so you could,"},{"from":4127.66,"to":4130.38,"location":2,"content":"if you make the question more and more different,"},{"from":4130.38,"to":4132,"location":2,"content":"eventually, it'll kinda get tripped up."},{"from":4132,"to":4135.02,"location":2,"content":"Ah, and it's clear that it's benefited, uh,"},{"from":4135.02,"to":4137.01,"location":2,"content":"from the word vectors,"},{"from":4137.01,"to":4139.02,"location":2,"content":"of sad being closer to negative,"},{"from":4139.02,"to":4141.36,"location":2,"content":"and then understanding sort of through all these,"},{"from":4141.36,"to":4143.72,"location":2,"content":"uh, correlations, and- and, uh,"},{"from":4143.72,"to":4148.92,"location":2,"content":"deep representations that there are other sort of sad words in this context,"},{"from":4148.92,"to":4150.12,"location":2,"content":"or- or whatever it is."},{"from":4150.12,"to":4152.37,"location":2,"content":"Uh, and so, it was able to point to this."},{"from":4152.37,"to":4154.74,"location":2,"content":"But you can be adversarial, it doesn't always work."},{"from":4154.74,"to":4156.78,"location":2,"content":"But even the fact that, uh,"},{"from":4156.78,"to":4160.34,"location":2,"content":"it was sort of zero-shot classification based on word vectors, uh,"},{"from":4160.34,"to":4162.15,"location":2,"content":"for new kinds of questions,"},{"from":4162.15,"to":4164.07,"location":2,"content":"uh, personally, it was very exciting to me."},{"from":4164.07,"to":4166.17,"location":2,"content":"And we tried a couple of other things like,"},{"from":4166.17,"to":4168.61,"location":2,"content":"uh, Bryan gave a talk and nobody clapped."},{"from":4168.61,"to":4169.65,"location":2,"content":"Was Bryan happy, or sad?"},{"from":4169.65,"to":4170.67,"location":2,"content":"And it also got it right."},{"from":4170.67,"to":4173.3,"location":2,"content":"So, um, there are a couple- a couple of the,"},{"from":4173.3,"to":4176.19,"location":2,"content":"the examples were, were at least as happy or sad thing worked."},{"from":4176.19,"to":4179.3,"location":2,"content":"And then, uh, a couple of other sort of adjective questions that we,"},{"from":4179.3,"to":4180.78,"location":2,"content":"we tried but, um,"},{"from":4180.78,"to":4183.69,"location":2,"content":"what I'm- what I would be most excited about is eventually actually"},{"from":4183.69,"to":4187.76,"location":2,"content":"trying to have a zero-shot classification task,"},{"from":4187.76,"to":4189.68,"location":2,"content":"uh, that combines the different tasks too."},{"from":4189.68,"to":4192.54,"location":2,"content":"So, uh, unfortunately, there's no data set for that,"},{"from":4192.54,"to":4194.46,"location":2,"content":"so we didn't train it, so it doesn't happen with the model."},{"from":4194.46,"to":4197.73,"location":2,"content":"But in theory, if you ask what is the sum- you can summarize,"},{"from":4197.73,"to":4199.99,"location":2,"content":"and you can translate from English into German,"},{"from":4199.99,"to":4202.52,"location":2,"content":"why couldn't you ask the model for a German summary?"},{"from":4202.52,"to":4204.24,"location":2,"content":"And if that worked, eventually,"},{"from":4204.24,"to":4205.65,"location":2,"content":"that would be even more amazing,"},{"from":4205.65,"to":4207.39,"location":2,"content":"but it, it doesn't work right now,"},{"from":4207.39,"to":4209.19,"location":2,"content":"because we never ask it sort of for these"},{"from":4209.19,"to":4212.31,"location":2,"content":"compositional task- these compositional task questions."},{"from":4212.31,"to":4215.49,"location":2,"content":"But is yet another interesting line of research that I think could spawn from this."},{"from":4215.49,"to":4216.68,"location":2,"content":"Uh, all right."},{"from":4216.68,"to":4219.15,"location":2,"content":"So, I hope I could show you that this sort of"},{"from":4219.15,"to":4224.13,"location":2,"content":"decaNLP framework is an interesting new benchmark for generalized NLP."},{"from":4224.13,"to":4227.16,"location":2,"content":"Uh, I do think it's a reasonably good framework"},{"from":4227.16,"to":4230.31,"location":2,"content":"for tackling a bunch of the really hard questions in the field."},{"from":4230.31,"to":4232.26,"location":2,"content":"Uh, more general language understanding,"},{"from":4232.26,"to":4233.55,"location":2,"content":"and question answering of course,"},{"from":4233.55,"to":4237.18,"location":2,"content":"uh, multitask learning, domain adaptation, uh,"},{"from":4237.18,"to":4239.79,"location":2,"content":"which we sort of analyzed a little bit with the sentiment,"},{"from":4239.79,"to":4241.81,"location":2,"content":"and SNLI versus multi NLI,"},{"from":4241.81,"to":4244.71,"location":2,"content":"um, transfer learning, and then weight sharing."},{"from":4244.71,"to":4246.78,"location":2,"content":"I think it's clear, everybody loves weight sharing,"},{"from":4246.78,"to":4248.85,"location":2,"content":"you wanna share as many weights as possible."},{"from":4248.85,"to":4252.38,"location":2,"content":"Uh, word vector started at, uh, ELMo,"},{"from":4252.38,"to":4255.3,"location":2,"content":"CoVe, and now BERT basically share more and more,"},{"from":4255.3,"to":4256.55,"location":2,"content":"deeper and deeper layers."},{"from":4256.55,"to":4259.56,"location":2,"content":"It would be great if we can unify that last bit also, uh,"},{"from":4259.56,"to":4262.57,"location":2,"content":"and then share basically the entirety of the networks,"},{"from":4262.57,"to":4265.2,"location":2,"content":"and then eventually hopefully get to zero-shot learning."},{"from":4265.2,"to":4267.33,"location":2,"content":"Now, there's a bunch of related work."},{"from":4267.33,"to":4269.22,"location":2,"content":"The original paper has over 100,"},{"from":4269.22,"to":4271.73,"location":2,"content":"um, citations in it, uh, of,"},{"from":4271.73,"to":4273.52,"location":2,"content":"of, you know, papers to other,"},{"from":4273.52,"to":4276.4,"location":2,"content":"other, um, lines of, uh, work."},{"from":4276.4,"to":4278.49,"location":2,"content":"But, uh, this is actually zero- at least some of"},{"from":4278.49,"to":4281.67,"location":2,"content":"the models and papers that influenced us the most,"},{"from":4281.67,"to":4283.92,"location":2,"content":"uh, in, in our thinking and modelling."},{"from":4283.92,"to":4285.47,"location":2,"content":"Uh, one of them actually comes from,"},{"from":4285.47,"to":4287.55,"location":2,"content":"uh, the two instructors of the class."},{"from":4287.55,"to":4291.16,"location":2,"content":"And so, um, hopefully, uh, we can,"},{"from":4291.16,"to":4295.05,"location":2,"content":"you know, sort of think about what- what's next after all this architecture engineering."},{"from":4295.05,"to":4298.13,"location":2,"content":"And, uh, I think one potential answer to that, uh,"},{"from":4298.13,"to":4302.4,"location":2,"content":"is single multitask learning for more generalized NLP models."},{"from":4302.4,"to":4313.62,"location":2,"content":"[NOISE] All right. Thank you. [APPLAUSE]"}]}