thanks drew for the the intro and thanks everyone for for coming here so many of you in fact I really appreciate it it's interesting you mentioned the restricted Boltzmann machines I don't even know many people I'm kind of curious so how many people know what a restricted Boltzmann machine is just raise your hand okay this is actually more than I expected oh okay no I know but I don't even know if that I would teach it now because it's barely used anymore they had a vested interest in and our bm's I I have an online course some of you might know on deep learning and neural nets and have a whole section on this and aren't that many section online and RBM so if they come back I think I'm you know I might be even more famous I think this is key for me to increase my profile so anyone here if you want to make rbms work like really well and some important problem then please go for it they'll support you potentially even financially alright but what I'll talk about today is my recent interest on in a problem a few shop learning and particularly an approach that has become popular in the past you know year or two based on an idea called meta learning now this is a talk that's actually not entirely focused on my own work but I thought what I would do is kind of try to present an overview of this topic including the work of many other great people and maybe give you a sense also of what might be interesting challenges and opportunities for research that many of the young researchers here might want to engage in and investigate moving forward so in terms of my trajectory how I got to this particular problem was well first of all just noticing that this amazing great success we've had with deep learning approaches to many problems related to AI all sort of working on the assumption that we have a lot of data available for for these problems and if we really truly believe in you know this trajectory that we're trying to make progress on to work ái that can solve any problems the assumption that will actually be collecting you know hundreds of thousands or millions of labeled example for every individual AI problem that we'd like to solve this is probably unrealistic and you know just as unrealistic it was when we were talking about expert systems and we would think that for any given problem we could just dump all the expert knowledge in a computer that would allow it to to have the intelligence that we'd expect and also it scientifically it's quite interesting why is it that we as humans learn certain things very quickly from very few examples or exercises if you're at school it's essentially like a training training data for four humans we can learn things very quickly from few examples whereas you know our machines are kind of this know I make the comparison of a really really hard-working student but that's like really really dumb and slow it's just like work is ass off through like millions and millions of exercises and event you will get it that's not how we operate that's not you know how smart we are so there's this scientific question here which is why is that why are we not there yet and previous years in my research I've looked at one sort of high-level idea which is okay well let's try for any given problem to leverage data from other sources that aren't labeled data for that specific problem and you early my career as a PhD student looked at using unlabeled data and doing unsupervised learning using data that comes with multiple modalities if you have images in text and the text might serve as some sort of weak label for the image but what I'll be talking about today is more in what I would call this multi domain data so the idea of let's try to do transfer learning in some way from problems what we have some data to problems or we don't have data the one set up to keep in mind for most of this talk really is that the problem I want to address is one where the training said the Train and I'm given is one which is maybe a five class classification problem and I have very few examples for each maybe one labeled example maybe five labeled examples so in this case I have five classes Sundays this thing a dog a lion some bulls and then I have a test said that I need a way of producing from this very small training set the model that will generalize well to these images in the test set of this lion here and in this bowl ok so that's the problem that we're we're trying to address we want to solve problems of this form so as I mentioned people are actually quite good at it this is an example that Josh Tenenbaum often gives if I if I show you so the first time you saw a Segway like this and then you were given all of these different images and we're asked which ones are in the same category I think a lot of people with more or less scaley the right answer in particular in this case identified this as the other Segway and also machines are getting much better at it perhaps not in descending of supernatural images but for instance there's this paper again from josh Tenenbaums lab where they used a form of meta learning really to get human level performance on recognizing characters in various alphabets okay so ultimately so this corresponds to progress it seems to suggest that we we might start having tools that allows us to significantly progress what I'm after in terms of you know there's one sort of scientific mission which is to close this gap between humans and machines there's also a very sort of practical consideration which is it'd be nice if we had as users a very simple way of defining classifiers and predictors where we could just provide a few labels and then we'd get a pretty accurate classifier one example of what this might look like that I like to give is from this website teachable Machine which is this sort of web experiment that Google has come up with where we'll I'll show you is just me having fun with my kids defining a training image classifier in three classes the green and their purple the orange class where the green is gonna be actually let's just start it up but the green is gonna be someone smiling and the purple is going to be someone young and then the orange class is going to be someone being scared so what I do I collect data where when I'm doing something with my webcam I'm capturing the examples for each class by pressing the button and then when you see the colors changing that's the prediction of the classifier so you can see once it has data about me it's doing pretty well seeing my action skills and display here and but then of course if you go further out than the input distribution ask my daughter the classifier is totally broken it just doesn't work at all because it has only like ten examples for four for each class but then if she starts collecting some examples for herself so smelling yawning and and then being scared then it's sort of after that gets it and ideally what you'd want is systems where you can do this maybe for a few people a few situations a few contexts and then it would be able to actually generalize well outside of that distribution so say if I ask your sister to come along it was a bit less shy you know it so in our case it sort of works it gets yawning and it gets you know being scared and this is sort of an example the kind of generalization we'd like to be achieving and granted here the my daughters so they're supposed to look alike assuming they share some of my genes which I hope that they do now obviously this is a pretty hard you know project it took us a while to make that demo work we took a bunch of different you know visual classes we might want to define a bunch of them just were not working so for natural images just that fairly restricted setting is actually already quite challenging but we could see that if we had a system like this then you know if we if we think of making you know what one thing the deep learning is allowed is making the use of say computer vision methods much more robust a more easily usable by many people including people in industry that's why there's been all this commercial interest if we could even reduce the barrier in terms not even having to collect tons of examples for any given classification problem I think this could have a really big impact also just in industry those are these duals you know motivation scientifically speaking I think there's something to learn about ourselves about intelligence and also this would really be practical all right so the really the problem that we'll be looking at is we're going to try to address directly this problem a few shot learning so what I mean by that really is a problem where we need to from a small training set of input and label pairs where the number so L this number of examples is very small when we want is a learning algorithm that will able to produce say the parameters of a predictor or a model M that will generalize well to new examples that weren't in this training set and here we're really focusing the the constraint we're putting on the setting is that this training set will be small okay and that's the setting in which we're interested in its performance of this learning algorithm and one approach that has become quite popular in the past 1 or 2 years is to instead of sort of using all of our knowledge about how we might be doing this say or you know what our properties say of images that might be useful for having this strong generalization from few examples instead let's just try to learn that learning algorithm let's just treat the learning algorithm a that takes a set and produces parameters for a model let's let's make that something that we can train and then we'll train on on examples of such problems and so for that reason we call it either learning to learn or we more often call it the approach of meta learning okay and that's what I'll talk about now the problem of few shot learning itself is is definitely not new just in general doing transfer learning is a problem that people have looked at and is a concept that has been quite useful there's these well-known results where if you train a model an image net it's been shown to transfer really well to different image classification problems and in fact even surpass the hand engineered you know features that had been designed in the computer vision community for a long time and and so here in few shot learning we're still gonna rely on this idea of doing transfer learning on new datasets but we're we're in a sense trying to transfer not just the features but the whole process of going from a training set to a predictor that I can apply okay and again few shot learning also in the context of say like one-shot learning there's been a bunch of papers a bit more than ten years ago where people were looking at that and making some progress and a lot of that progress was again based on this idea of and designing features largely and so here a lot of the core of deep learning is just decided let's try whenever we can to end-to-end train a system on data you know that's really one of the key reasons why this has been working pretty well this approach of deep learning when in a sense in meta language we're going one step beyond and we say let's end-to-end the learning algorithm you know we have pretty good tools now to create you know computing programs that we can back propagate through can we leverage that technology but now to learn a a learning algorithm all right so let's start to put some some notation around around this problem so if we're gonna talk about meta learning we should talk about just learning so a learning algorithm what is that that's really an algorithm whose input is a training set of input and target pairs and then its output is either a model let's say the parameters of some model which I'll sometimes call the learner and a good learning algorithm is one that for a given training set produces a model that has good performance on a test set you know from sort of distribution of interest that's effectively like a very basic just you know definition for a learning algorithm so then what should be a meta learning algorithm well it's a it's a learning algorithm itself but in this setting the input is gonna be a training set but we're going to call that a meta training set in the sense that it's actually going to contain it's going to be a training a training set of data sets where each data set is a split of a training set and its corresponding test set and we have a bunch of these examples so these training test set pairs pants will often call those episodes will be the examples on which our meta learning algorithm will be running and will be training a learning algorithm so a good meta learning algorithm would have parameters but these parameters are the parameters of the learning algorithm that is a now gonna made that takes a training set and produces the parameters of a model for that corresponding problem and in a good meta learning algorithm would be one where its performance is good on new episodes so on new training and test set splits for other problems okay so maybe a visual illustration of that will be more useful but before I go there this notion of measuring a generalization of a meta learning algorithm I think is quite important so small tender there's other sort of problems that some people might be tempted to call meta learning problems one of that might be doing what's called architecture search where you're taking a neural net and you're optimizing the number of layers and the connectivity in that network my claim is that if you don't take a procedure a meta learning procedure and you don't evaluate it on new problems then it's not really meta learning it's more like optimization so if you're on working on C for 10 and you're optimizing the architecture for C for 10 to get a good performance for the test set of C for 10 that's a primary optimization or it's more it's closer to optimization they did it close to learning learning would be like trying that procedure that optimizes the hyper parameters and the architecture of a neural net and and training out on say C 410 and then showing that it actually works well also on some other image classification problem or some other machine learning problem and that to me is is meta learning there needs to be this notion of generalization to new problems sorry yes yeah so the question is about what's the relationship between these these problems in meta training and meta test and I think this will get clearer as I move forward but that's actually a problem in itself and that's something where I think we're also in right now still need to iterate on as to how do we define that problem but so visually what this problem will look like as I mentioned what we want is essentially a learning algorithm that generalizes well you didn't open your chips before let's all wait for Dhruv opening his chips okay all right so in meta learning what we want to do is as I said we want to tackle this particular problem on a small training set and want to produce parameters for its corresponding test set so meta learning as I mentioned we will have these problems will be in are essentially meta test set and then we'll assume that we have a meta training set of other episodes where this episode is kind of this grayed out capsule here where you have a training and test set splint and in particular in the way we usually set up the problem is that we make sure that in the meta training set we have images of classes that are different from the classes that are in the meta test set and that's the extent to which we're measuring generalization so here we're going to ask for in some sense an extrapolation to classes that it's never seen before so in particular these examples are coming from what's called the mininum internet benchmark and which is essentially a subset of imagenet where we've taken 64 classes to construct the episodes for meta training and then we have another 20 set of class set of 20 classes that are used for the images in the test and the meta test set ok and so here you can see that you know you have piano in as one of the classes in the episodes of the meta training set but will not find any pianos in the meta test set in any none of the images ok and so in meta learning really what we're trying to do is to iterate when we're meta training we're iterating over this meta trainings that and so we go over the first episode and then we're gonna have our meta learner that's gonna take as input and the meta learner is where the parameters of our meta learning approach are sitting those parameters are we're learning we're gonna feed in the training set into the metal learner and conceptually that meta learner is going to produce a learner through outputting parameters for it or it's going to produce a model that I can now apply that can take as input a test example a test input and make a prediction for its label and now what I do to set up my meta learning training objective is that I'm gonna define a loss between what is the learners outputs on the test set examples of that episode and I'm going to back propagate backwards through these arrows I'm gonna find a way of adjusting the parameters that that are in the meta learner here to optimize the loss so we're we're trying to get as close as we can of the problem of generalization to some extent you know we're trying to optimize a meta learner that produces models em that actually generalize to test examples that were not fed as input to be learning algorithm okay and then maybe if we do gradient descent which is usually what we do we do an update on the metal learners parameter and then we move to in the next episode the meta learner gets a different training set produces different model look at the loss on their corresponding test set examples and then get back propagate into the the meta learner okay are there any questions about this set I'm sure there are anyone has the courage to ask these questions yes yes the meta learner is not explicitly at least taking a learner as input it does not have access to a definition of the learners at least not in some sort of structured format so we'll see a bunch of examples of metal learners there are some approaches to metal learning right now that will now allow you to change what is the family of the model say at meta test time you know it will just assume the only produces parameters for a cough net with four layers you know so approach is like like mammal and and approaches like that or sort of death nature and I think right now I can't really think of any approach in meta learning where the type of learner like really drastically changes between say episodes and I think you know down the line that's definitely something that we want to consider and investigate but yeah often they're making fairly strong assumptions about the nature of the learner such as know what parameters it has yeah there's a matter of validation set yeah with sixteen other classes that are different from both meta testing meta training yes absolutely yeah yeah it's essentially a cross entropy loss that we'll be using and we'll see that in a bit but yeah alright so a quick note on summon a nomenclature you'll find in papers so the notation I like to use is to talk about within an episode there being a split of a training set and a test set and then we have this notion of meta training set that we train the meta learner on and a meta test set for generalization for meta learning it really establishes you know it sort of follows this concept that's here we were training a learning algorithm it gets pretty hairy in both writing the paper or presenting it as I found in my previous presentations to have both training sets and meta training sets sometimes you get confused as to which one you're referring to so you'll find some papers that will instead instead of talking about a training set in episode they'll say that's the support set and then the test set is instead a query set and then you can just use training and test set for you know these sets of episodes it is definitely clearer that notation I don't find it as cool so that's why I tend to avoid it that's not the best reason but and so I will try to stick with this notation but be aware that you might read a paper meta-learning and they might you know if they talked about the supports and the query set they're referring to this split within each episode of the examples the set of examples fed as input to the learning algorithm and the query set is just the test examples on which you're evaluating the loss for that episode alright and so then if you have this procedure that takes a training set and then allows you to produce a model if that model defines a distribution over what could be the label for a new test settings input in my episode so one way of writing it is that this model is effectively also conditioning on the training set you know that's kind of what this whole procedure is is you know you can think of it as giving you a conditional distribution of the label of a new input given the full training set well one thing we can do for the loss of an episode is to optimize the average negative log probability of those label assignments the cross and repeat loss as we often refer to it for that episode averaging over all test set examples in in that episode and now depending on the choice of what the metal learner is and various approaches published so far essentially often what will really change is this that is we're going to do graining descent on that objective and then we'll just use various definitions of that conditional distribution effectively and so what I'll do now for most of this presentation really is that our go over the various approaches that have been early on so the most successful ones and it also kind of span various approaches one might take for this problem so so I think they're sort of like good things to know about if you want to dive into this dis literature moving forward I like to think they're being sort of two categories of approaches one of them is you know let's not throw out all the literature on machine learning algorithms let's instead take inspiration from out the actual sort of hand design machine learning algorithms that structure that they take and how they function but that's just introduce some parameters in them and that's train those parameters with metal learning and so some taken inspiration from what a K&N predictor or kernel machine might look like and that's essentially matching networks there's this relationship between assuming a Gaussian classifier in some learned Lathan space and and training with prototypical networks which is another approach I'll talk about there's another one which is well a lot of our training algorithms are essentially gradient descent based well it's used great in descent as the model for metal learner I'll talk about that mammal is an example of that and then there's the other approach which is to say you know what maybe we're wrong about what a learning algorithm should look like and maybe it should just be some black box generic neural network that we will be training to make these predictions these conditional distribute predictions where I have a new input I have a training set then there's some black box neural net that produces the softmax over the labels I'm not gonna talk about men's so much though it was one of the early examples of doing that I'm instead going to cover is snail which is more recent than and you know up there in the state of the art approaches but let's first start with the first family where we're taking inspiration from known learning algorithms so one of the first ones that were proposed was the matching networks coming from deep mind from oral vignola and there the idea is that we're going to assume that the label Y is predicted essentially based on doing a linear combination of the labels that I have in my training set in my episode where the weight is essentially a function of what is the input associated with that label that I'm waiting and what is my test set input that I'm trying to make a prediction for and so here the Y's are essentially one Hut vectors and so with these weights if they sum to one then you're going to get a distribution essentially over the label of x hat where X hat is in my episode that's one of the test set inputs and one way of doing this one we're parrot rising this this weight here this essentially this vote for X I voting for its actual label in the training set is to do a soft max over some sort of comparison function between a representation of the tests input X hat and some other representation of the training set input X I and they've looked at various architectures for representing both function J and F so wait that's G except J didn't I sorry French hiccup so they've looked at once it's using different neural nets for both they've looked at and this is sort of this dotted line arrow here where you could have the representation of the test input modulate how you process the full training set also defining an actual RNN that is going in some arbitrary order in in both directions all the inputs in the training set and so that you know trying to make those two representations as expressive as they can and actually finding that they can get better results by using some sort of bi-directional RNN on this set in where you sort of order it in some arbitrary order and so that's essentially just you know end to end learning a pattern matcher so that it you know is able to make the right prediction to a new label for a new example in our approach that's one of my favorites because it's super simple so it's a good first approach to consider for for as a meta learning approach it's quite fast to train and quite simple is this approach called a prototypical Network where what we're trying to learn is per class we're going to learn a prototype and essentially the way we're going to classify some test example X I guess the previous notation that would be X hat and here in here but what we'll do is that will take my test input will map it into some representation space and then we will compare that with the prototype of each class using some negative distance function could be Lydian could be cosine and then we'll just pass these you know negative distances or these similarities through a softmax and that's how we get our distribution over the labels and the prototype themselves you can imagine various architectures for extracting a prototype for each class here they use something very simple which is essentially you take say some convolutional neural net that maps to a vector space and then you take the average per class of those vectors so quite simple all examples X Y where the Y is actually in the correspond to all examples in class K and here really the parameters of the meta learner are essentially just a representation because we fixed out that representation is used so effectively you're sort of end-to-end learning that representations such that if it's used in what is effectively a something like a Gaussian classifier where in the Gaussian classifier you would just take the average the average representation of all classes and then you would essentially look at the likelihood under these means of each Gaussian of a test example and you would use that to make a prediction if you use for this distance here the squared Euclidean distance then you recover essentially a Gaussian classifier on that representation but you're learning it explicitly such that it works well for a Gaussian classifier so that's that's in the sense that is effectively a former of meta learning so I've mentioned that that yeah we get a Gaussian classifier effectively if we used a Euclidian squared also if you use the agreein function which which often actually works surprisingly well you can also show that the prototypes are effectively acting as the output weights of a neural net that is applied on the test example so this you know the essentially the embedding function would be kind of like the first few layers of that model you apply on test examples and the prototypes are providing the output weights and that's essentially because if you have the squared Euclidean distance you can we factor that into a term that multiplies the representation with the prototypes and then you have this constant term that's the bias so there's you can kind of think of it also as producing weights but only the output weights of a neural net that you're applying at on the test set of a given episode and yes right so yeah so how you get a prototype I think you know there's presumably some innovation that could be done there why have just one prototype per class you know that's kind of a very simplifying assumption why use this average pooling so you can think of it as you know we're doing average pooling here the level of classes you could do other types of pooling could do max pooling can do other things yeah and then the less the third approach that's sort of within this family let's take inspiration from learning algorithms that exists is to actually take some inspiration from gradient descent so right now one of the most successful learning algorithm is stochastic gradient descent so we come up with some characterization of our learner of our model and then as you know gradient descent we just come up with we initialize the parameters that a zero of our model and then we sort of apply this update rule several times where this is the Loess as measured on a mini batch and so this is the gradient of my current parameter values of that loss and I go in the opposite direction of the gradient with some step size move in that direction in my parameter space and that gives me the new parameters okay well that formula might be a good inspiration from what could be kind of like a some sort of record neural net that's actually iterated and unrolled in the space of parameters and that was sort of our inspiration in this I clear paper with such an Ravi in 2017 where we kind of noticed the similarity between this and for instance an LST M State so if you're familiar with STM's you have this cell state which is updated by taking the current cell state and then possibly modulating that with a forget gate and then adding a update to the cell state the sea tilt which is modulated by an input gate okay and the input gate the forget gate pretty much everything here is sort of produced by by form of a record neural network now if if we make this cell state be the parameter vector for my model say my confident that I want to train to do image classification on some problem so if CT becomes theta T and let's say I just fixed F T to be one so no forget gate and I T is just a spectively gonna be my learning rate and then I make I impose the cell state update to be going in the negative direction of the gradient of the loss for my training data in my episode with respect to my parameters I essentially recover the same equation okay and we're able to train l STM's fine very well in fact so that suggests that we should be able to back propagate through a few iterations of gradient descent so that was kind of the inspiration here where we sort of effectively have a lsdm which is why we called our method meta lsdm but there the cell state was imposed to be the parameter space of some comp net and the cell state update was imposed to be the negative gradient of the ComNet parameters on the law on the loss of the current mini-batch and then we use otherwise regular input and forget gates which are effectively actually decided to learn and forget gate because it's kind of acting as a regularizer so since its multiplying by our turn that might be smaller than one it is effectively shrinking the current parameter values a bit close to zero and the input gate is effectively the learning rate so at to what extent you want to move fast so it could potentially learn to both optimize and regularize depending on what it does on the input and forget gate so more visually yeah I think I've said all of that visually speaking this is what it would look like so for a given episode where I have a training set any test set I would start with some initial values of the parameters of my continent so all the convolutional filters the fully connected layers and then for the output weights since there's no real transfer that can happen there so for a given episode we if we have five classes we sort of randomly ordered them in assigned and label you know integers from 1 to 5 so there's no real structure in the output weights that we can learn so we just initialize those to be 0 then we look okay for that initial value of the parameters what's my loss on my first mini batch of inputs X and labels Y since we're doing few shot learning we actually use the full training set because you know in our experiments we use at most 5 classes and 5 examples per class so that's a very small mini batch already essentially and so for doing the fordpass with dead a 0 on X and looking at the loss for Y that gives me a gradient on my loss and now I feed that gradient to the metal earner which is going to combine that with the current value of the parameters that are zero to give me a new value of the parameters of my ComNet theta one and then I look okay now with that ax 1 what's the gradient for my second mini-batch which when if we shot learning that's just the training set as I mentioned and then I get a new gradient which I combined with that ax 1 to give me a new parameter value theta 2 and I sort of iterate that unrolling my graph until I get a final value of the parameters that are T and then I take theta T and now I look what is the loss but now on the inputs and the test set in my episodes these examples that were not using computing any of the gradients were not used anywhere in the sport pass here and then I look at what's the loss on those examples and that's the C term here and then what I do is I back propagate through this whole recursive graph here much like I would back propagate through an RNN or lsdm we actually do block some gradients so the gradient that we get here is effectively a function of the parameters of my metal learner so I have to do gradients of gradients which are is computationally expensive and and so it might take a lot of memory so we find in practice that ignoring those gradients sometimes loses fine in this particular I'll be presenting we were able to get good results so we're blocking those gradients largely for compute efficiency reasons and then we back propagate into all the parameters of my metal learners and those parameters they are the parameters inside of the thing that's giving me my input gate and my forget gate but they're also going to be the initial parameters so where do I start my optimization what is the initial value of my comp net before I do a few steps of gradient descent and in fact a lot of the transfer is going to happen from there you can think of this train initial parameter vector as being essentially what's a parameter vector that's close to the solution across all the episodes and then these food-grade and stents for a given episode it's just going to bring me close to what is the right solution for that episode okay so a lot of transfer learning actually happens through that trained initialization another way of thinking of it is that we're end-to-end learning this pre train followed by in PI fine tuning you know in pre-training you sort of do a bunch of training on some data then you fix these parameters and then you just do a bit of grading descent step but they it's on a new data set that isn't necessarily the one you've pre trained on here we're essentially end-to-end training both that we're training initialization such that if I do a bit of fine tuning I'm gonna get better results because it's it's trained end to end to do that well yes yeah right so I guess yeah the question is whether we've looked at whether what is recovered in terms of initialization reflects some of the sort of built-in knowledge from training these neural nets that we've discovered actually that's a great question we haven't looked at what it was learning in terms of the initialization we did look at what the input gate and the forget gate we're learning it was not interpreted it was really hard to make sense of in fact so it's not cool it's not clear what its learning I know that one inspiration from this paper was a previous paper by David Divino and others at Harvard where they did that they back prop through multiple steps of grain and descent but they actually did it on large datasets and here we're gonna back up through like twelve steps of gradient descent so really doing a few steps of grain in the sent on a given episode largely for compute reasons but also because we're in a few shop learning setting we suspect that we shouldn't be doing too many steps of gradient descent otherwise we risk overfitting and but this work that David de Vanille did was meant to try to see what kind of knowledge would be extracted from doing that they didn't look at I they might have looked at properties of the initialization I don't remember but they did look at the types of learning rates that were learned and they find these interesting things where they saw that unintuitive ly what seemed to be optimal was to train the I think it was training the top layer first a lot and then ah now regrets is I'm not sure where if what I'm saying is the right thing maybe I can just because I know I have the slide somewhere yeah so yeah the last layer the learning rate would jump up early on and then would just go back down this is just what the learning rate is that was trained in an end-to-end way and then later on it would instead crank up the learning rate of the first layer now this is almost certainly not globally optimal but it was kind of interesting that you know this is not common wisdom necessarily because the fear when you're training a deep neural net is that initially the features are random so the output weights can't really learn anything meaningful until the features have been learned but so to see this what's kind of surprising ok yes so the way we do it is well for the mini-batches in this approach in particular since we're using the full gradient on the mini batch then there's there's it's order agnostic because we take you're an average over those the matching network be the approach where they would use an RN and bi-directional RN and over the set of inputs it's got to be dependent so I'm sure if you shifted the for a given episode you take the training set and you try different orders you're going to get different predictions and my guess in fact is that if you want to slightly improve your results at meta test time you take an episode you try with multiple random orderings of the input examples and you just average the predictions you're getting I wouldn't be surprised that this is like valuable form of sort of fairness reduction that that would give you some results some good better results in this setting because we're using the full batch gradient when we're computing that that gradient here that goes into the metal learner lsdm I then that's it's actually agnostic it's not dependent order right well right there so the order of the episodes themselves I don't know that the that it is more dependent on order than say just regular stochastic gradient descent is dependent on the orders of mini batches I don't have a reason to think of it as being more or less dependent and my be but we did find that what's interesting is that if you take a regular approach to learning a representation which would be to train on all of the data that is present in all the episodes and then you use that to initialize say you know this approach or prototypical network or something like that we do tend to find that we get better results this is more recent like in the past few months there's been a few results suggesting that which suggests that there are properties about this optimization problem set up this way that are maybe challenging and are not well understood yet so so that would be your the only reason why I would think that there might be something more clever that one wants to do in terms of how to create these episodes for training yeah it's a great question okay oh do i okay I'm gonna ignore that I'll go quickly over mammal so mammal came out a few months after our paper at I clear and there what's interesting is that what they did effectively is that they remove the lsdm parts of the input and forget gate so essentially trying to learn a model that produces the regularizer and the grain in the sense sorry the learning rate at each time step just remove that assume you're doing constant learning rate updates but still learn the initialization and it would get you know as good results and often better it even seems to be more stable and which is interesting because that you would think that this is pretty small capacity meta learner but there's even theoretical results where Chelsea fends is from Chelsea Finn from from Berkeley where she could also show that this restricted form of a meta learner is actually a universal approximator under some assumptions which again is sort of suggesting that you're actually not there's an inductive bias but it's not one that removes any potential for universal approximation and so if you were to use a method that's great and base like this I I would not use the metal or LSM which was it turns out like a bit overly complicated it turns out that entirely using the full form of constant grain and descent constant learning rate grained in the scent was sufficient to do meta learning and learn a good initialization and and then the other approach I mentioned is actually saying you know and let's ignore all of these machine learning algorithms that exist let's not take inspiration from that really and let's just apply some sort of black box sequential neural network to actually effectively train it such that it can only learn a learning algorithm so one way of doing that is that you set up a sequence this is sequential prediction problem where you feelin as the first steps of of the sequence the input and the corresponding label in the training set of the episode and then only at the very end you provide the input of a test set example in that episode and of course you don't provide the label because otherwise it could cheat and just copy the label at the output so you provide like nothing here there's a bunch of zeros but you have this sequential neural net that is able then to make a prediction of what is this label that's missing here okay and then you use you know modules that we're familiar with that allows us to handle sequences of variable size so in their case what they used is dilated convolutions and using self attention layers so is here you have convolutions attention convolution and some more attention at the top and without going into details what they found is that you could train this you know very blackbox module they actually do surprising you well in terms of you know making predictions for a forward for new problems and effectively learning some sort of learning algorithm and so that's another approach and it could be that we're gonna design some other form of you know some people have looked at using memory networks previously LS TMS this is right now kind of like the best results in that category of approaches but who knows if you know the state of the art and how we make sequence the sequence predictions are changes this approach might eventually become you know more successful but so going to the results these are initial results we presented early on in 2017 on this mini imagenet benchmark which has kind of become the de-facto benchmark that people use in a few shop learning the way it works it was actually originally proposed by Oriole Vignola in the matching network paper but not all the details of how to reproduce that benchmark were provided so we now later on sort of reproduce that challenge and that benchmark you and people can ask for it if they want to compare with our results but essentially it took just a hundred classes from imagenet then split that 100 class into 64 for two sixteen for validation twenty for testing to generate episodes you take if you're in the middle training set you take a random subset of if you're doing five class and that's all we'll do here you take a random subset of five classes from these 64 classes if you're doing one shot then for each of these classes you take just a single randomly selected image and you put that image in the training set and then in the test set of the episode you take I think it's 10 to forget the exact number but a few more examples from that class from the same five classes that were selected in that episode and then if you're doing a validation set episode your instant drawing the classes from these 16 validation classes and 20 testing if you and meta testing and so there in the first thing that we want to see is whether we can do better than fairly simple baselines which is essentially learn a representation on all of data data from all of the 64 training classes in this very simple way we're just doing 64 weight classification on mini-batches drawn from these 64 classes and then do something after on the episodes in the meta test set you can either do fine tuning just assuming okay let's say we just do five steps of gradient descent from the initialization that was pre trained and that actually doesn't work really well in our experiments which kind of suggests that end-to-end learning the initialization in the context where it's used where you're doing a few green in the sense steps after is indeed valuable it's really hard to sort of do pre training separately in such a way that will work well if you do fine tuning at meta test time and the other approach is of course just use that representation so scrapping the output layer just use the top hidden layer and then doing nearest-neighbor matching or nearest-neighbor prediction in the episode of the meta test set that actually works you know pretty well but thanks to meta learning we are able to edged out that performance from a little bit to a fair amount these are the results for matching networks FC refers to using this sort of more complicated encoding of the test end and the training set we're using this bi-directional RNN so they've managed to get a bit better results with that and our mental learner at that time was doing a bit better in particular in the five-shot setting and then later on more results were published using prototypical networks mammal and snail we have the results here for me the conclusion is more or less that all of these results are about around the same ballpark there are certain design choices are a bit different sometimes between these I think prototypical networks might be a slightly bigger neural net than saying mammal I know snail they used like residual connections and they use like a pretty big neural net whereas mammal was trying to use the same number of parameters as in matching networks and an artwork so these are not entirely comparable and right now we are working on trying to define a new benchmark where we don't have these issues that we can not compare with things in the past and and and also make the benchmarks more substantial than just a hundred classes but all these methods indeed seem to be able to edge out like what would have been the sort of learner representation and then use it then some other method after so in terms of I want to close with you know talking about about challenges the notion of how do we define these episodes how do you find a distribution over training and test sets is here really arbitrary we make a pretty strong assumption which is that all the problems we'll be facing at meta test time or 5 class problems and either they're going to be one shot or gonna be five shots and in fact in our case we train one model for one shot and one model for a five shot designing methods that will actually robust to the number of ways and the number of shots in episodes that's like the natural next step for for this research because of course you know in this little demo that I showed the number of examples that was collecting with my webcam with my laptop camera was kind of unknown a priority I didn't even know myself how many I was going to provide but also to what extent so measuring our ability to do generalization to new classes and getting a sense in which are we able to sort of go really beyond the support of the classes that we have a training time or meta training time should be something we need to study more so as I mentioned we are currently sort of working on the challenge where it would effectively be multiple datasets so right now we effectively took imagenet and simulated many datasets from it but you know in real life we would have really different datasets you know the data study would meet and we would see a meta test time would be different what what you've seen before so we're trying to create this benchmark where it would look something a bit closer to that and and finally here we only studied supervised classification but we could go beyond this there are other settings that are interesting so one of them is doing semi-supervised learning so this is work we publish that I clear this year we're in our episodes we assume that now we don't only have laid a few labeled examples we have potentially 10 times more unlabeled examples where these examples belong to any one of the classes for this problem but also and these are the minuses here we have some images that are not from any other classes that we call distractors and in that paper melamine force semi-supervised view shot-- classification so i won't go over because i'm don't have enough time we sort of extend prototypical networks to that setup where effectively what we're doing is we're doing a two-step process where we're going from the training set I'm gonna have to we're gonna try to infer some soft labels for the unlabeled set and include these soft labels back in to how we compute prototypes for each class and then look at the performance of these prototypes predicting the label in the test set of the episodes and then back propagating through this whole sort of two-step process and finding that we can do better than not end-to-end learning that process as well we could do fuschia distribution estimation this is a paper from deep mind where what they did is that imagine the problem where I'm give you a few examples from just one class of characters and then I want a system that produces a generative model that can generate more examples that look like come like they come from that distribution this is also a paper and I clear and have this ongoing AI on project which the project anyone can participate in it's sort of an experiment in open research where anyone can come in and say I want to help out I want to participate we have a github we have a slack channel and people come in read the literature and if they have time they can try to contribute to our code base we're here we try to apply some of these ideas of fuchsia distribution but for music generation where the idea would be that an artist would come in and be like I want to hear melodies that kind of sound like these and then you would immediately get a generative model and that might give you some artistic ideas for that artist to write you know a new song you can enter and learn a data annotation network this is work from Facebook I would just find really interesting where they're using prototypical networks again here but now the idea is that what we're going to Train also is a model that takes in one of the training examples concatenate it with some noise and which produces a new image or at least a vector that fits in the input space and then that's going to get the same label as the original image and then if we concatenate that to the training set in the episode and feed this into a prototypical Network you want to train this generator to improve the performance of the prototypical network so it might not actually generate images the lure that look like images but it's generating things that help in say defining the prototypes and in such a way that you get better generalization now that was kind of a clever thing to do you can do program induction where the idea here is that in this paper from nips you have these Carol sample tasks where you're given an input of a program and what is the output which is essentially a program that moves this little guy here that in this case is sort of moving down and putting a little circle wherever there's no brick wall so this is actually this program here now assume that you're only observing the input and the corresponding output of that program you would like to extract from many input/output pairs a model that can actually take a new input and produce the right output okay the only thing that's program here is that we know the underlying function that created these targets come from some language from from an actual program and so what did there is that they showed you could use something that's somewhat similar to a prototypical Network where you would have a task encoder that will take all the input-output pairs encoded in some space do some pooling and then use that to condition a output decoder that takes in an input and needs now to produce the correct output for a new input so this paper might be worth looking at I realize I'm going fast but sort of out at a time and this is in the space of structured prediction and there's this other problem that's interesting where you might want to do a few shots image segmentation where here the user is providing you just a few key points with what is the positive class and what is just background and now you want a neural net that takes in this annotation and the corresponding image and then is able to condition on that in another network that will take a new image and needs to produce the full segmentation for that same category so in this case you see that the white dots are around the dog and so here what you want is the network to learn okay then my representation of the task is that I should segment out the dogs in images and so if you connect these two components you can end to and learn the encoding of the task such that this network here that's applied on the new image is doing dog segmentation and not other types of segmentations and I think ultimately really where we want to go is this more like interactive setting so I mentioned that the goal is to have someone provide examples see the performance correct by providing new examples and so on and that might lead us to think about how to do meta active learning so maybe you want the system to also suggest can you label this cuz I'm not sure what you want for this and right now there's been some people looking at this but not with a whole lot of success I would say so I think that's still sort of an open question can we learn an active learning policy for instance I did have some work published with nips where we were doing this in the context of recommendation where for a given user you would provide what is the items that they've engaged with and didn't engage with and that would condition a model that would make prediction as to whether you would engage for a new item but I think we're barely scratched the surface in this sort of interactive setting all right so those are the challenges that you know I encourage people to think about sort of thinking about we and things that I think our communities should think about thinking about we define new benchmarks that are a bit more realistic going beyond supervised classification I think there's tons of opportunities there and also moving more closely to this interactive setting and with that thank you all