thanks drew for the the intro

and thanks everyone for for coming here

so many of you in fact I really

appreciate it it's interesting you

mentioned the restricted Boltzmann

machines I don't even know many people

I'm kind of curious so how many people

know what a restricted Boltzmann machine

is just raise your hand okay this is

actually more than I expected oh okay no

I know but I don't even know if that I

would teach it now because it's barely

used anymore they had a vested interest

in and our bm's I I have an online

course some of you might know on deep

learning and neural nets and have a

whole section on this and aren't that

many section online and RBM so if they

come back I think I'm you know I might

be even more famous I think this is key

for me to increase my profile so anyone

here if you want to make rbms work like

really well and some important problem

then please go for it they'll support

you potentially even financially alright

but what I'll talk about today is my

recent interest on in a problem a few

shop learning and particularly an

approach that has become popular in the

past you know year or two based on an

idea called meta learning now this is a

talk that's actually not entirely

focused on my own work but I thought

what I would do is kind of try to

present an overview of this topic

including the work of many other great

people and maybe give you a sense also

of what might be interesting challenges

and opportunities for research that many

of the young researchers here might want

to engage in and investigate moving

forward so in terms of my trajectory how

I got to this particular problem was

well first of all just noticing that

this amazing great success we've had

with deep learning approaches to many

problems related to AI all sort of

working on the assumption that we have a

lot of data available for for these

problems and if we really truly believe

in you know this trajectory that we're

trying to make progress on to work

ái that can solve any problems the

assumption that will actually be

collecting you know hundreds of

thousands or millions of labeled example

for every individual AI problem that

we'd like to solve this is probably

unrealistic and you know just as

unrealistic it was when we were talking

about expert systems and we would think

that for any given problem we could just

dump all the expert knowledge in a

computer that would allow it to to have

the intelligence that we'd expect and

also it scientifically it's quite

interesting why is it that we as humans

learn certain things very quickly from

very few examples or exercises if you're

at school it's essentially like a

training training data for four humans

we can learn things very quickly from

few examples whereas you know our

machines are kind of this know I make

the comparison of a really really

hard-working student but that's like

really really dumb and slow it's just

like work is ass off through like

millions and millions of exercises and

event you will get it that's not how we

operate that's not you know how smart we

are so there's this scientific question

here which is why is that why are we not

there yet and previous years in my

research I've looked at one sort of

high-level idea which is okay well let's

try for any given problem to leverage

data from other sources that aren't

labeled data for that specific problem

and you early my career as a PhD student

looked at using unlabeled data and doing

unsupervised learning using data that

comes with multiple modalities if you

have images in text and the text might

serve as some sort of weak label for the

image but what I'll be talking about

today is more in what I would call this

multi domain data so the idea of let's

try to do transfer learning in some way

from problems what we have some data to

problems or we don't have data the one

set up to keep in mind for most of this

talk really is that the problem I want

to address is one where the training

said the Train and I'm given is one

which is maybe a five class

classification problem and I have very

few examples for each maybe one labeled

example maybe five labeled examples so

in this case I have five classes Sundays

this thing

a dog a lion some bulls and then I have

a test said that I need a way of

producing from this very small training

set the model that will generalize well

to these images in the test set of this

lion here and in this bowl ok so that's

the problem that we're we're trying to

address we want to solve problems of

this form so as I mentioned people are

actually quite good at it this is an

example that Josh Tenenbaum often gives

if I if I show you so the first time you

saw a Segway like this and then you were

given all of these different images and

we're asked which ones are in the same

category I think a lot of people with

more or less scaley the right answer in

particular in this case identified this

as the other Segway and also machines

are getting much better at it perhaps

not in descending of supernatural images

but for instance there's this paper

again from josh Tenenbaums lab where

they used a form of meta learning really

to get human level performance on

recognizing characters in various

alphabets okay

so ultimately so this corresponds to

progress it seems to suggest that we we

might start having tools that allows us

to significantly progress what I'm after

in terms of you know there's one sort of

scientific mission which is to close

this gap between humans and machines

there's also a very sort of practical

consideration which is it'd be nice if

we had as users a very simple way of

defining classifiers and predictors

where we could just provide a few labels

and then we'd get a pretty accurate

classifier one example of what this

might look like that I like to give is

from this website teachable Machine

which is this sort of web experiment

that Google has come up with where we'll

I'll show you is just me having fun with

my kids defining a training image

classifier in three classes the green

and their purple the orange class where

the green is gonna be actually let's

just start it up but the green is gonna

be someone smiling and the purple is

going to be someone young and then the

orange class is going to be someone

being scared so what I do I collect data

where when I'm doing something with my

webcam I'm capturing the examples for

each class by pressing the button and

then when you see the colors changing

that's the prediction of the classifier

so you can see once it has data about me

it's doing pretty well seeing my action

skills and display here and but then of

course if you go further out than the

input distribution ask my daughter the

classifier is totally broken it just

doesn't work at all because it has only

like ten examples for four for each

class but then if she starts collecting

some examples for herself so smelling

yawning and and then being scared then

it's sort of after that gets it and

ideally what you'd want is systems where

you can do this maybe for a few people a

few situations a few contexts and then

it would be able to actually generalize

well outside of that distribution so say

if I ask your sister to come along it

was a bit less shy you know it so in our

case it sort of works

it gets yawning and it gets you know

being scared

and this is sort of an example the kind

of generalization we'd like to be

achieving and granted here the my

daughters so they're supposed to look

alike assuming they share some of my

genes which I hope that they do now

obviously this is a pretty hard you know

project it took us a while to make that

demo work we took a bunch of different

you know visual classes we might want to

define a bunch of them just were not

working so for natural images just that

fairly restricted setting is actually

already quite challenging but we could

see that if we had a system like this

then you know if we if we think of

making you know what one thing the deep

learning is allowed is making the use of

say computer vision methods much more

robust a more easily usable by many

people including people in industry

that's why there's been all this

commercial interest if we could even

reduce the barrier in terms

not even having to collect tons of

examples for any given classification

problem I think this could have a really

big impact also just in industry those

are these duals you know motivation

scientifically speaking I think there's

something to learn about ourselves about

intelligence and also this would really

be practical all right so the really the

problem that we'll be looking at is

we're going to try to address directly

this problem a few shot learning so what

I mean by that really is a problem where

we need to from a small training set of

input and label pairs where the number

so L this number of examples is very

small when we want is a learning

algorithm that will able to produce say

the parameters of a predictor or a model

M that will generalize well to new

examples that weren't in this training

set and here we're really focusing the

the constraint we're putting on the

setting is that this training set will

be small okay and that's the setting in

which we're interested in its

performance of this learning algorithm

and one approach that has become quite

popular in the past 1 or 2 years is to

instead of sort of using all of our

knowledge about how we might be doing

this say or you know what our properties

say of images that might be useful for

having this strong generalization from

few examples instead let's just try to

learn that learning algorithm let's just

treat the learning algorithm a that

takes a set and produces parameters for

a model let's let's make that something

that we can train and then we'll train

on on examples of such problems and so

for that reason we call it either

learning to learn or we more often call

it the approach of meta learning okay

and that's what I'll talk about now the

problem of few shot learning itself is

is definitely not new just in general

doing transfer learning is a problem

that people have looked at and is a

concept that has been quite useful

there's these well-known results where

if you train a model an image net it's

been shown to transfer really well to

different image classification problems

and in fact even surpass the hand

engineered you know features that had

been designed in the computer vision

community for a long time and and so

here in few shot learning we're

still gonna rely on this idea of doing

transfer learning on new datasets but

we're we're in a sense trying to

transfer not just the features but the

whole process of going from a training

set to a predictor that I can apply okay

and again few shot learning also in the

context of say like one-shot learning

there's been a bunch of papers a bit

more than ten years ago

where people were looking at that and

making some progress and a lot of that

progress was again based on this idea of

and designing features largely and so

here a lot of the core of deep learning

is just decided let's try whenever we

can to end-to-end train a system on data

you know that's really one of the key

reasons why this has been working pretty

well this approach of deep learning when

in a sense in meta language we're going

one step beyond and we say let's

end-to-end the learning algorithm you

know we have pretty good tools now to

create you know computing programs that

we can back propagate through can we

leverage that technology but now to

learn a a learning algorithm all right

so let's start to put some some notation

around around this problem so if we're

gonna talk about meta learning we should

talk about just learning so a learning

algorithm what is that that's really an

algorithm whose input is a training set

of input and target pairs and then its

output is either a model let's say the

parameters of some model which I'll

sometimes call the learner and a good

learning algorithm is one that for a

given training set produces a model that

has good performance on a test set you

know from sort of distribution of

interest that's effectively like a very

basic just you know definition for a

learning algorithm so then what should

be a meta learning algorithm well it's a

it's a learning algorithm itself but in

this setting the input is gonna be a

training set but we're going to call

that a meta training set in the sense

that it's actually going to contain it's

going to be a training a training set of

data sets where

each data set is a split of a training

set and its corresponding test set and

we have a bunch of these examples so

these training test set pairs pants will

often call those episodes will be the

examples on which our meta learning

algorithm will be running and will be

training a learning algorithm so a good

meta learning algorithm would have

parameters but these parameters are the

parameters of the learning algorithm

that is a now gonna made that takes a

training set and produces the parameters

of a model for that corresponding

problem and in a good meta learning

algorithm would be one where its

performance is good on new episodes so

on new training and test set splits for

other problems okay so maybe a visual

illustration of that will be more useful

but before I go there this notion of

measuring a generalization of a meta

learning algorithm I think is quite

important so small tender there's other

sort of problems that some people might

be tempted to call meta learning

problems one of that might be doing

what's called architecture search where

you're taking a neural net and you're

optimizing the number of layers and the

connectivity in that network my claim is

that if you don't take a procedure a

meta learning procedure and you don't

evaluate it on new problems then it's

not really meta learning it's more like

optimization so if you're on working on

C for 10 and you're optimizing the

architecture for C for 10 to get a good

performance for the test set of C for 10

that's a primary optimization or it's

more it's closer to optimization they

did it close to learning learning would

be like trying that procedure that

optimizes the hyper parameters and the

architecture of a neural net and and

training out on say C 410 and then

showing that it actually works well also

on some other image classification

problem or some other machine learning

problem and that to me is is meta

learning there needs to be this notion

of generalization to new problems

sorry yes

yeah so the question is about what's the

relationship between these these

problems in meta training and meta test

and I think this will get clearer as I

move forward but that's actually a

problem in itself and that's something

where I think we're also in right now

still need to iterate on as to how do we

define that problem but so visually what

this problem will look like as I

mentioned what we want is essentially a

learning algorithm that generalizes well

you didn't open your chips before let's

all wait for Dhruv opening his chips

okay all right so in meta learning what

we want to do is as I said we want to

tackle this particular problem on a

small training set and want to produce

parameters for its corresponding test

set so meta learning as I mentioned we

will have these problems will be in are

essentially meta test set and then we'll

assume that we have a meta training set

of other episodes where this episode is

kind of this grayed out capsule here

where you have a training and test set

splint and in particular in the way we

usually set up the problem is that we

make sure that in the meta training set

we have images of classes that are

different from the classes that are in

the meta test set and that's the extent

to which we're measuring generalization

so here we're going to ask for in some

sense an extrapolation to classes that

it's never seen before so in particular

these examples are coming from what's

called the mininum internet benchmark

and which is essentially a subset of

imagenet where we've taken 64 classes to

construct the episodes for meta training

and then we have another 20 set of class

set of 20 classes that are used for the

images in the test and the meta test set

ok and so here you can see that you know

you have piano in as one of the classes

in the episodes of the meta training set

but will not find any pianos in the meta

test set in any none of the images ok

and so in meta learning really what

we're trying to do is to iterate when

we're meta training we're iterating over

this meta trainings

that and so we go over the first episode

and then we're gonna have our meta

learner that's gonna take as input and

the meta learner is where the parameters

of our meta learning approach are

sitting those parameters are we're

learning we're gonna feed in the

training set into the metal learner and

conceptually that meta learner is going

to produce a learner through outputting

parameters for it or it's going to

produce a model that I can now apply

that can take as input a test example a

test input and make a prediction for its

label and now what I do to set up my

meta learning training objective is that

I'm gonna define a loss between what is

the learners outputs on the test set

examples of that episode and I'm going

to back propagate backwards through

these arrows I'm gonna find a way of

adjusting the parameters that that are

in the meta learner here to optimize the

loss so we're we're trying to get as

close as we can of the problem of

generalization to some extent you know

we're trying to optimize a meta learner

that produces models em that actually

generalize to test examples that were

not fed as input to be learning

algorithm okay and then maybe if we do

gradient descent which is usually what

we do we do an update on the metal

learners parameter and then we move to

in the next episode the meta learner

gets a different training set produces

different model look at the loss on

their corresponding test set examples

and then get back propagate into the the

meta learner okay are there any

questions about this set I'm sure there

are anyone has the courage to ask these

questions yes yes the meta learner is

not explicitly at least taking a learner

as input it does not have access to a

definition of the learners at least not

in some sort of structured format so

we'll see a bunch of examples of metal

learners there are some approaches to

metal learning right now that will now

allow you to change what is the family

of the model say at meta test time you

know it will just assume the

only produces parameters for a cough net

with four layers you know so approach is

like like mammal and and approaches like

that or sort of death nature and I think

right now I can't really think of any

approach in meta learning where the type

of learner like really drastically

changes between say episodes and I think

you know down the line that's definitely

something that we want to consider and

investigate but yeah often they're

making fairly strong assumptions about

the nature of the learner such as know

what parameters it has yeah there's a

matter of validation set yeah with

sixteen other classes that are different

from both meta testing meta training yes

absolutely yeah yeah it's essentially a

cross entropy loss that we'll be using

and we'll see that in a bit but yeah

alright so a quick note on summon a

nomenclature you'll find in papers so

the notation I like to use is to talk

about within an episode there being a

split of a training set and a test set

and then we have this notion of meta

training set that we train the meta

learner on and a meta test set for

generalization for meta learning it

really establishes you know it sort of

follows this concept that's here we were

training a learning algorithm it gets

pretty hairy in both writing the paper

or presenting it as I found in my

previous presentations to have both

training sets and meta training sets

sometimes you get confused as to which

one you're referring to so you'll find

some papers that will instead instead of

talking about a training set in episode

they'll say that's the support set and

then the test set is instead a query set

and then you can just use training and

test set for you know these sets of

episodes

it is definitely clearer that notation I

don't find it as cool so that's why I

tend to avoid it that's not the best

reason but and so I will try to stick

with this notation but be aware that you

might read a paper

meta-learning and they might you know if

they talked about the supports and the

query set they're referring to this

split within each episode of the

examples the set of examples fed as

input to the learning algorithm and the

query set is just the test examples on

which you're evaluating the loss for

that episode alright and so then if you

have this procedure that takes a

training set and then allows you to

produce a model if that model defines a

distribution over what could be the

label for a new test settings input in

my episode so one way of writing it is

that this model is effectively also

conditioning on the training set you

know that's kind of what this whole

procedure is is you know you can think

of it as giving you a conditional

distribution of the label of a new input

given the full training set well one

thing we can do for the loss of an

episode is to optimize the average

negative log probability of those label

assignments the cross and repeat loss as

we often refer to it for that episode

averaging over all test set examples in

in that episode and now depending on the

choice of what the metal learner is and

various approaches published so far

essentially often what will really

change is this that is we're going to do

graining descent on that objective and

then we'll just use various definitions

of that conditional distribution

effectively and so what I'll do now for

most of this presentation really is that

our go over the various approaches that

have been early on so the most

successful ones and it also kind of span

various approaches one might take for

this problem so so I think they're sort

of like good things to know about if you

want to dive into this dis literature

moving forward I like to think they're

being sort of two categories of

approaches one of them is you know let's

not throw out all the literature on

machine learning algorithms let's

instead take inspiration from out the

actual sort of hand design machine

learning algorithms that structure that

they take and how they function but

that's just introduce some parameters in

them and that's train those parameters

with metal learning and so

some taken inspiration from what a K&N

predictor or kernel machine might look

like and that's essentially matching

networks there's this relationship

between assuming a Gaussian classifier

in some learned Lathan space and and

training with prototypical networks

which is another approach I'll talk

about there's another one which is well

a lot of our training algorithms are

essentially gradient descent based well

it's used great in descent as the model

for metal learner I'll talk about that

mammal is an example of that and then

there's the other approach which is to

say you know what maybe we're wrong

about what a learning algorithm should

look like and maybe it should just be

some black box generic neural network

that we will be training to make these

predictions these conditional distribute

predictions where I have a new input I

have a training set then there's some

black box neural net that produces the

softmax over the labels I'm not gonna

talk about men's so much though it was

one of the early examples of doing that

I'm instead going to cover is snail

which is more recent than and you know

up there in the state of the art

approaches but let's first start with

the first family where we're taking

inspiration from known learning

algorithms so one of the first ones that

were proposed was the matching networks

coming from deep mind from oral vignola

and there the idea is that we're going

to assume that the label Y is predicted

essentially based on doing a linear

combination of the labels that I have in

my training set in my episode where the

weight is essentially a function of what

is the input associated with that label

that I'm waiting and what is my test set

input that I'm trying to make a

prediction for and so here the Y's are

essentially one Hut vectors and so with

these weights if they sum to one then

you're going to get a distribution

essentially over the label of x hat

where X hat is in my episode that's one

of the test set inputs and one way of

doing this one we're parrot rising this

this weight here this essentially this

vote for X I voting for its actual label

in the training set is to do a soft max

over some sort of comparison function

between a representation of the

tests input X hat and some other

representation of the training set input

X I and they've looked at various

architectures for representing both

function J and F so wait that's G except

J didn't I

sorry French hiccup

so they've looked at once it's using

different neural nets for both they've

looked at and this is sort of this

dotted line arrow here where you could

have the representation of the test

input modulate how you process the full

training set also defining an actual RNN

that is going in some arbitrary order in

in both directions all the inputs in the

training set and so that you know trying

to make those two representations as

expressive as they can and actually

finding that they can get better results

by using some sort of bi-directional RNN

on this set in where you sort of order

it in some arbitrary order and so that's

essentially just you know end to end

learning a pattern matcher so that it

you know is able to make the right

prediction to a new label for a new

example in our approach that's one of my

favorites because it's super simple so

it's a good first approach to consider

for for as a meta learning approach it's

quite fast to train and quite simple is

this approach called a prototypical

Network where what we're trying to learn

is per class we're going to learn a

prototype and essentially the way we're

going to classify some test example X I

guess the previous notation that would

be X hat and here in here but what we'll

do is that will take my test input will

map it into some representation space

and then we will compare that with the

prototype of each class using some

negative distance function could be

Lydian could be cosine and then we'll

just pass these you know negative

distances or these similarities through

a softmax and that's how we get our

distribution over the labels and the

prototype themselves

you can imagine various architectures

for extracting a prototype for each

class here they use something very

simple which is essentially you take say

some convolutional neural net that maps

to a vector space and then you take the

average per class of those vectors so

quite simple all examples X Y where the

Y is actually in the correspond to all

examples in class K and here really the

parameters of the meta learner are

essentially just a representation

because we fixed out that representation

is used so effectively you're sort of

end-to-end learning that representations

such that if it's used in what is

effectively a something like a Gaussian

classifier where in the Gaussian

classifier you would just take the

average the average representation of

all classes and then you would

essentially look at the likelihood under

these means of each Gaussian of a test

example and you would use that to make a

prediction if you use for this distance

here the squared Euclidean distance then

you recover essentially a Gaussian

classifier on that representation but

you're learning it explicitly such that

it works well for a Gaussian classifier

so that's that's in the sense that is

effectively a former of meta learning so

I've mentioned that that yeah we get a

Gaussian classifier effectively if we

used a Euclidian squared also if you use

the agreein function which which often

actually works surprisingly well you can

also show that the prototypes are

effectively acting as the output weights

of a neural net that is applied on the

test example so this you know the

essentially the embedding function would

be kind of like the first few layers of

that model you apply on test examples

and the prototypes are providing the

output weights and that's essentially

because if you have the squared

Euclidean distance you can we factor

that into a term that multiplies the

representation with the prototypes and

then you have this constant term that's

the bias so there's you can kind of

think of it also as producing

weights but only the output weights of a

neural net that you're applying at on

the test set of a given episode and yes

right so yeah so how you get a prototype

I think you know there's presumably some

innovation that could be done there

why have just one prototype per class

you know that's kind of a very

simplifying assumption why use this

average pooling so you can think of it

as you know we're doing average pooling

here the level of classes you could do

other types of pooling could do max

pooling can do other things yeah and

then the less the third approach that's

sort of within this family let's take

inspiration from learning algorithms

that exists is to actually take some

inspiration from gradient descent so

right now one of the most successful

learning algorithm is stochastic

gradient descent so we come up with some

characterization of our learner of our

model and then as you know gradient

descent we just come up with we

initialize the parameters that a zero of

our model and then we sort of apply this

update rule several times where this is

the Loess as measured on a mini batch

and so this is the gradient of my

current parameter values of that loss

and I go in the opposite direction of

the gradient with some step size move in

that direction in my parameter space and

that gives me the new parameters okay

well that formula might be a good

inspiration from what could be kind of

like a some sort of record neural net

that's actually iterated and unrolled in

the space of parameters and that was

sort of our inspiration in this I clear

paper with such an Ravi in 2017 where we

kind of noticed the similarity between

this and for instance an LST M State so

if you're familiar with STM's you have

this cell state which is updated by

taking the current cell state and then

possibly modulating that with a forget

gate and then adding

a update to the cell state the sea tilt

which is modulated by an input gate okay

and the input gate the forget gate

pretty much everything here is sort of

produced by by form of a record neural

network now if if we make this cell

state be the parameter vector for my

model say my confident that I want to

train to do image classification on some

problem so if CT becomes theta T and

let's say I just fixed F T to be one so

no forget gate and I T is just a

spectively gonna be my learning rate and

then I make I impose the cell state

update to be going in the negative

direction of the gradient of the loss

for my training data in my episode with

respect to my parameters I essentially

recover the same equation okay and we're

able to train l STM's fine very well in

fact so that suggests that we should be

able to back propagate through a few

iterations of gradient descent so that

was kind of the inspiration here where

we sort of effectively have a lsdm which

is why we called our method meta lsdm

but there the cell state was imposed to

be the parameter space of some comp net

and the cell state update was imposed to

be the negative gradient of the ComNet

parameters on the law on the loss of the

current mini-batch and then we use

otherwise regular input and forget gates

which are effectively actually decided

to learn and forget gate because it's

kind of acting as a regularizer so since

its multiplying by our turn that might

be smaller than one it is effectively

shrinking the current parameter values a

bit close to zero and the input gate is

effectively the learning rate so at to

what extent you want to move fast so it

could potentially learn to both optimize

and regularize depending on what it does

on the input and forget gate so more

visually yeah I think I've said all of

that visually speaking this is what it

would look like so for a given episode

where I have a training set any test set

I would start with some initial values

of the parameters of my continent so all

the convolutional filters the fully

connected layers and then for the output

weights since there's no real transfer

that can happen there so for a given

episode we if we have five classes we

sort of randomly ordered them in

assigned and label you know integers

from 1 to 5 so there's no real structure

in the output weights that we can learn

so we just initialize those to be 0 then

we look okay for that initial value of

the parameters

what's my loss on my first mini batch of

inputs X and labels Y since we're doing

few shot learning we actually use the

full training set because you know in

our experiments we use at most 5 classes

and 5 examples per class so that's a

very small mini batch already

essentially and so for doing the

fordpass with dead a 0 on X and looking

at the loss for Y that gives me a

gradient on my loss and now I feed that

gradient to the metal earner which is

going to combine that with the current

value of the parameters that are zero to

give me a new value of the parameters of

my ComNet theta one and then I look okay

now with that ax 1 what's the gradient

for my second mini-batch which when if

we shot learning that's just the

training set as I mentioned and then I

get a new gradient which I combined with

that ax 1 to give me a new parameter

value theta 2 and I sort of iterate that

unrolling my graph until I get a final

value of the parameters that are T and

then I take theta T and now I look what

is the loss but now on the inputs and

the test set in my episodes these

examples that were not using computing

any of the gradients were not used

anywhere in the sport pass here and then

I look at what's the loss on those

examples and that's the C term here and

then what I do is I back propagate

through this whole recursive graph here

much like I would back propagate through

an RNN or lsdm we actually do block some

gradients so the gradient that we get

here is effectively a function of the

parameters of my metal learner so I have

to do gradients of gradients which are

is computationally expensive and and so

it might take a lot of memory so we find

in practice that ignoring those

gradients sometimes loses fine in this

particular

I'll be presenting we were able to get

good results so we're blocking those

gradients largely for compute efficiency

reasons and then we back propagate into

all the parameters of my metal learners

and those parameters they are the

parameters inside of the thing that's

giving me my input gate and my forget

gate but they're also going to be the

initial parameters so where do I start

my optimization what is the initial

value of my comp net before I do a few

steps of gradient descent and in fact a

lot of the transfer is going to happen

from there you can think of this train

initial parameter vector as being

essentially what's a parameter vector

that's close to the solution across all

the episodes and then these food-grade

and stents for a given episode it's just

going to bring me close to what is the

right solution for that episode okay so

a lot of transfer learning actually

happens through that trained

initialization another way of thinking

of it is that we're end-to-end learning

this pre train followed by in PI fine

tuning

you know in pre-training you sort of do

a bunch of training on some data then

you fix these parameters and then you

just do a bit of grading descent step

but they it's on a new data set that

isn't necessarily the one you've pre

trained on here we're essentially

end-to-end training both that we're

training initialization

such that if I do a bit of fine tuning

I'm gonna get better results because

it's it's trained end to end to do that

well yes yeah right so I guess yeah the

question is whether we've looked at

whether what is recovered in terms of

initialization reflects some of the sort

of built-in knowledge from training

these neural nets that we've discovered

actually that's a great question we

haven't looked at what it was learning

in terms of the initialization we did

look at what the

input gate and the forget gate we're

learning it was not interpreted it was

really hard to make sense of in fact so

it's not cool it's not clear what its

learning I know that one inspiration

from this paper was a previous paper by

David Divino and others at Harvard where

they did that they back prop through

multiple steps of grain and descent but

they actually did it

on large datasets and here we're gonna

back up through like twelve steps of

gradient descent so really doing a few

steps of grain in the sent on a given

episode largely for compute reasons but

also because we're in a few shop

learning setting we suspect that we

shouldn't be doing too many steps of

gradient descent otherwise we risk

overfitting and but this work that David

de Vanille did was meant to try to see

what kind of knowledge would be

extracted from doing that they didn't

look at I they might have looked at

properties of the initialization I don't

remember but they did look at the types

of learning rates that were learned and

they find these interesting things where

they saw that unintuitive ly what seemed

to be optimal was to train the I think

it was training the top layer first a

lot and then ah now regrets is I'm not

sure where if what I'm saying is the

right thing maybe I can just because I

know I have the slide somewhere yeah so

yeah the last layer the learning rate

would jump up early on and then would

just go back down this is just what the

learning rate is that was trained in an

end-to-end way and then later on it

would instead crank up the learning rate

of the first layer now this is almost

certainly not globally optimal but it

was kind of interesting that you know

this is not common wisdom necessarily

because the fear when you're training a

deep neural net is that initially the

features are random so the output

weights can't really learn anything

meaningful until the features have been

learned but so to see this what's kind

of surprising ok yes

so the way we do it is well for the

mini-batches in this approach in

particular since we're using the full

gradient on the mini batch then there's

there's it's order agnostic because we

take you're an average over those the

matching network be the approach where

they would use an RN and bi-directional

RN and over the set of inputs

it's got to be dependent so I'm sure if

you shifted the for a given episode you

take the training set and you try

different orders you're going to get

different predictions and my guess in

fact is that if you want to slightly

improve your results at meta test time

you take an episode you try with

multiple random orderings of the input

examples and you just average the

predictions you're getting I wouldn't be

surprised that this is like valuable

form of sort of fairness reduction that

that would give you some results some

good better results in this setting

because we're using the full batch

gradient when we're computing that that

gradient here that goes into the metal

learner lsdm I then that's it's actually

agnostic it's not dependent order right

well right there so the order of the

episodes themselves I don't know that

the that it is more dependent on order

than say just regular stochastic

gradient descent is dependent on the

orders of mini batches I don't have a

reason to think of it as being more or

less dependent and my be but we did find

that what's interesting is that if you

take a regular approach to learning a

representation which would be to train

on all of the data that is present in

all the episodes and then you use that

to initialize say you know this approach

or prototypical network or something

like that

we do tend to find that we get better

results this is more recent like in the

past few months there's been a few

results suggesting that which suggests

that there are properties about this

optimization problem set up this way

that are maybe challenging

and are not well understood yet so so

that would be your the only reason why I

would think that there might be

something more clever that one wants to

do in terms of how to create these

episodes for training yeah it's a great

question

okay oh do i okay

I'm gonna ignore that I'll go quickly

over mammal so mammal came out a few

months after our paper at I clear and

there what's interesting is that what

they did effectively is that they remove

the lsdm parts of the input and forget

gate so essentially trying to learn a

model that produces the regularizer and

the grain in the sense sorry the

learning rate at each time step just

remove that assume you're doing constant

learning rate updates but still learn

the initialization and it would get you

know as good results and often better it

even seems to be more stable and which

is interesting because that you would

think that this is pretty small capacity

meta learner but there's even

theoretical results where Chelsea fends

is from Chelsea Finn from from Berkeley

where she could also show that this

restricted form of a meta learner is

actually a universal approximator under

some assumptions which again is sort of

suggesting that you're actually not

there's an inductive bias but it's not

one that removes any potential for

universal approximation and so if you

were to use a method that's great and

base like this I I would not use the

metal or LSM which was it turns out like

a bit overly complicated it turns out

that entirely using the full form of

constant grain and descent constant

learning rate grained in the scent was

sufficient to do meta learning and learn

a good initialization and and then the

other approach I mentioned is actually

saying you know and let's ignore all of

these machine learning algorithms that

exist let's not take inspiration from

that really and let's just apply some

sort of black box sequential neural

network to actually effectively train it

such that it can only learn a learning

algorithm so one way of doing that is

that you set up a sequence this is

sequential

prediction problem where you feelin as

the first steps of of the sequence the

input and the corresponding label in the

training set of the episode and then

only at the very end you provide the

input of a test set example in that

episode and of course you don't provide

the label because otherwise it could

cheat and just copy the label at the

output so you provide like nothing here

there's a bunch of zeros but you have

this sequential neural net that is able

then to make a prediction of what is

this label that's missing here okay and

then you use you know modules that we're

familiar with that allows us to handle

sequences of variable size so in their

case what they used is dilated

convolutions and using self attention

layers so is here you have convolutions

attention convolution and some more

attention at the top and without going

into details what they found is that you

could train this you know very blackbox

module they actually do surprising you

well in terms of you know making

predictions for a forward for new

problems and effectively learning some

sort of learning algorithm and so that's

another approach and it could be that

we're gonna design some other form of

you know some people have looked at

using memory networks previously LS TMS

this is right now kind of like the best

results in that category of approaches

but who knows if you know the state of

the art and how we make sequence the

sequence predictions are changes this

approach might eventually become you

know more successful but so going to the

results these are initial results we

presented early on in 2017 on this mini

imagenet benchmark which has kind of

become the de-facto benchmark that

people use in a few shop learning the

way it works it was actually originally

proposed by Oriole Vignola in the

matching network paper but not all the

details of how to reproduce that

benchmark were provided so we now later

on sort of reproduce that challenge and

that benchmark you and people can ask

for it if they want to compare with our

results but essentially it took just a

hundred classes from imagenet then split

that 100 class into 64 for two

sixteen for validation twenty for

testing to generate episodes you take if

you're in the middle training set you

take a random subset of if you're doing

five class and that's all we'll do here

you take a random subset of five classes

from these 64 classes if you're doing

one shot then for each of these classes

you take just a single randomly selected

image and you put that image in the

training set and then in the test set of

the episode you take I think it's 10 to

forget the exact number but a few more

examples from that class from the same

five classes that were selected in that

episode and then if you're doing a

validation set episode your instant

drawing the classes from these 16

validation classes and 20 testing if you

and meta testing and so there in the

first thing that we want to see is

whether we can do better than fairly

simple baselines which is essentially

learn a representation on all of data

data from all of the 64 training classes

in this very simple way we're just doing

64 weight classification on mini-batches

drawn from these 64 classes and then do

something after on the episodes in the

meta test set you can either do fine

tuning just assuming okay let's say we

just do five steps of gradient descent

from the initialization that was pre

trained and that actually doesn't work

really well in our experiments which

kind of suggests that end-to-end

learning the initialization in the

context where it's used where you're

doing a few green in the sense steps

after is indeed valuable it's really

hard to sort of do pre training

separately in such a way that will work

well if you do fine tuning at meta test

time and the other approach is of course

just use that representation so

scrapping the output layer just use the

top hidden layer and then doing

nearest-neighbor matching or

nearest-neighbor prediction in the

episode of the meta test set that

actually works you know pretty well but

thanks to meta learning we are able to

edged out that performance from a little

bit to a fair amount these are the

results for matching networks FC refers

to using this sort of more complicated

encoding of the test end and the

training set we're using this

bi-directional RNN so they've managed to

get a bit better results with that and

our mental learner

at that time was doing a bit better in

particular in the five-shot setting and

then later on more results were

published using prototypical networks

mammal and snail we have the results

here for me the conclusion is more or

less that all of these results are about

around the same ballpark there are

certain design choices are a bit

different sometimes between these I

think prototypical networks might be a

slightly bigger neural net than saying

mammal

I know snail they used like residual

connections and they use like a pretty

big neural net

whereas mammal was trying to use the

same number of parameters as in matching

networks and an artwork so these are not

entirely comparable and right now we are

working on trying to define a new

benchmark where we don't have these

issues that we can not compare with

things in the past and and and also make

the benchmarks more substantial than

just a hundred classes but all these

methods indeed seem to be able to edge

out like what would have been the sort

of learner representation and then use

it then some other method after so in

terms of I want to close with you know

talking about about challenges the

notion of how do we define these

episodes how do you find a distribution

over training and test sets is here

really arbitrary we make a pretty strong

assumption which is that all the

problems we'll be facing at meta test

time or 5 class problems and either

they're going to be one shot or gonna be

five shots and in fact in our case we

train one model for one shot and one

model for a five shot designing methods

that will actually robust to the number

of ways and the number of shots in

episodes that's like the natural next

step for for this research because of

course you know in this little demo that

I showed the number of examples that was

collecting with my webcam with my laptop

camera was kind of unknown a priority I

didn't even know myself how many I was

going to provide but also to what extent

so measuring our ability to do

generalization to new classes and

getting a sense in which are we able to

sort of go really beyond the support of

the classes that we have a training time

or meta training time should be

something we need to study more so as I

mentioned we are currently sort of

working on the challenge where it would

effectively

be multiple datasets so right now we

effectively took imagenet and simulated

many datasets from it but you know in

real life we would have really different

datasets you know the data study would

meet and we would see a meta test time

would be different what what you've seen

before so we're trying to create this

benchmark where it would look something

a bit closer to that and and finally

here we only studied supervised

classification but we could go beyond

this there are other settings that are

interesting so one of them is doing

semi-supervised learning so this is work

we publish that I clear this year we're

in our episodes we assume that now we

don't only have laid a few labeled

examples we have potentially 10 times

more unlabeled examples where these

examples belong to any one of the

classes for this problem but also and

these are the minuses here we have some

images that are not from any other

classes that we call distractors and in

that paper melamine force

semi-supervised view shot--

classification so i won't go over

because i'm don't have enough time we

sort of extend prototypical networks to

that setup where effectively what we're

doing is we're doing a two-step process

where we're going from the training set

I'm gonna have to we're gonna try to

infer some soft labels for the unlabeled

set and include these soft labels back

in to how we compute prototypes for each

class and then look at the performance

of these prototypes predicting the label

in the test set of the episodes and then

back propagating through this whole sort

of two-step process and finding that we

can do better than not end-to-end

learning that process as well we could

do fuschia distribution estimation this

is a paper from deep mind where what

they did is that imagine the problem

where I'm give you a few examples from

just one class of characters and then I

want a system that produces a generative

model that can generate more examples

that look like come like they come from

that distribution this is also a paper

and I clear and have this ongoing AI on

project which the project anyone can

participate in it's sort of an

experiment in

open research where anyone can come in

and say I want to help out I want to

participate we have a github we have a

slack channel and people come in read

the literature and if they have time

they can try to contribute to our code

base we're here we try to apply some of

these ideas of fuchsia distribution but

for music generation where the idea

would be that an artist would come in

and be like I want to hear melodies that

kind of sound like these and then you

would immediately get a generative model

and that might give you some artistic

ideas for that artist to write you know

a new song you can enter and learn a

data annotation network this is work

from Facebook I would just find really

interesting where they're using

prototypical networks again here but now

the idea is that what we're going to

Train also is a model that takes in one

of the training examples concatenate it

with some noise and which produces a new

image or at least a vector that fits in

the input space and then that's going to

get the same label as the original image

and then if we concatenate that to the

training set in the episode and feed

this into a prototypical Network you

want to train this generator to improve

the performance of the prototypical

network so it might not actually

generate images the lure that look like

images but it's generating things that

help in say defining the prototypes and

in such a way that you get better

generalization now that was kind of a

clever thing to do you can do program

induction where the idea here is that in

this paper from nips you have these

Carol sample tasks where you're given an

input of a program and what is the

output which is essentially a program

that moves this little guy here that in

this case is sort of moving down and

putting a little circle wherever there's

no brick wall so this is actually this

program here now assume that you're only

observing the input and the

corresponding output of that program you

would like to extract from many

input/output pairs a model that can

actually take a new input and produce

the right output okay the only thing

that's program here is that we know the

underlying function that created these

targets come from some language from

from an actual program and so what

did there is that they showed you could

use something that's somewhat similar to

a prototypical Network where you would

have a task encoder that will take all

the input-output pairs encoded in some

space do some pooling and then use that

to condition a output decoder that takes

in an input and needs now to produce the

correct output for a new input so this

paper might be worth looking at I

realize I'm going fast but sort of out

at a time and this is in the space of

structured prediction and there's this

other problem that's interesting where

you might want to do a few shots image

segmentation where here the user is

providing you just a few key points with

what is the positive class and what is

just background and now you want a

neural net that takes in this annotation

and the corresponding image and then is

able to condition on that in another

network that will take a new image and

needs to produce the full segmentation

for that same category so in this case

you see that the white dots are around

the dog and so here what you want is the

network to learn okay then my

representation of the task is that I

should segment out the dogs in images

and so if you connect these two

components you can end to and learn the

encoding of the task such that this

network here that's applied on the new

image is doing dog segmentation and not

other types of segmentations and I think

ultimately really where we want to go is

this more like interactive setting so I

mentioned that the goal is to have

someone provide examples see the

performance correct by providing new

examples and so on and that might lead

us to think about how to do meta active

learning so maybe you want the system to

also suggest can you label this cuz I'm

not sure what you want for this and

right now there's been some people

looking at this but not with a whole lot

of success I would say so I think that's

still sort of an open question can we

learn an active learning policy for

instance I did have some work published

with nips where we were doing this in

the context of recommendation where for

a given user you would provide what is

the items that they've engaged with and

didn't engage with and that would

condition a model that would make

prediction as to whether you would

engage for a new item but I think we're

barely

scratched the surface in this sort of

interactive setting all right so those

are the challenges that you know I

encourage people to think about sort of

thinking about we and things that I

think our communities should think about

thinking about we define new benchmarks

that are a bit more realistic

going beyond supervised classification I

think there's tons of opportunities

there and also moving more closely to

this interactive setting and with that

thank you all