Thanks for having me and
it's been fun to visit thank.

You so.

Understanding complex noisy data streams
are obviously a really important part

of vision and
in part of cognition in general right so

you know parsing the sort of buzzing like

a booming buzzing confusion of the world
is like a really challenging piece of

figuring out how to deal with the world
right so before I go forward I should say

I please I definitely would like questions
in the middle of the presentation right so

you know stop me any time I would much
prefer to not get through my slides but

like actually communicate then you
know sort of just talk at you guys for

an hour or so.

You know it's there slides and I have
some like prepared text or something but

please just just interrupt me
when wrote one relevant so

you know I was saying this type of really
noisy data streams parsing was very

important you look at that image you can
tell pretty quickly there's two cars there

one is in front of the other you know
they're on a field in front of mountains

and there's much faster for you to figure
that out for him than for me to say it and

you can do this despite a huge amount of
variation right like position post size

lumination distortion and noise mean
background variation seen variation

rate presumably has seen this image before
but you can still tell as a car and

even if you don't know what an alpha is
you can still tells a car so you know

there's obviously a lot of variation that
makes this problem challenging and you

know this is true also in other domains
not just in vision right so you look at

that it's hard for you to tell what that
says most were here before I was testing.

But.

They are that says Hannah
is good at compromising and

you could tell that even though like you
probably haven't heard that particular

sentence before with that
speaker identity and so forth.

And what's really going on is that
there's some sense in which a sort of

explicit representation representation.

Being constructed where
somehow under some like

some objective function the categories
of interest are being pulled apart.

And you really need to do two things
to make that happen to be for different

objects right need to tell when it's you
know one car versus you know a boat or

something and you have tolerance for
changes to the like the input and so

computational easy to have any one of
these things right because templates

are integration to get either of them but
together is really hard right so in a way

to visualize this right is that you have
population representation of this thing.

Along some kind of manifold right say
an object identity manifold is the face

of sort of transforming through space and
a good representation have this property

that two different faces will be separated
across all those different transformations

a bad one would look like this right but
the real one on the pixels is like this

right in some sense this is the problem
is that they're all kind of jumbled up

together there's this tremendous tangling
of the interests of the factors of

interest rate another way to think about
this is that the sort of natural physics

axes of the world like for example you
know retinal photoreceptor voltages or

hair fill Point number two so
those are the sensors for

the different devices things that
are like street in those axes

are really not straight in the axes of
natural behavioral vents like a face

deforming face moving through complex that
environment are totally misaligned OK So

that's one reason why the problems of
this kind are computationally hard

there's a sort of nonlinear misalignment
but also it really needs to be done fast

which is another reason why
the computation is heart rate and

you look at those images and
you can figure out what's going on and

then you kind of know what the stuff is
you know WAY faster than it is for you to

say what they are and you can do it while
you're listening to me talk nonetheless

like that processing has to happen somehow
pretty quickly impair all right and

in fact if you think of the object
core representation regime as being.

Of roughly one hundred to two hundred
milliseconds of looking at an image I mean

humans are like significantly higher than
chance or practically like maxing out

their cut their power even by
that very short duration so

the core idea is that cortical brain
meet effectively is doing some kind of

computation to go from this representation
to this representation OK and

in particular that it somehow untangle in
those very tangled up structures right and

you know it's well known from a lot
of neurophysiology in anatomy and

stuff that those processes are likely
sensory Cascades you know there's

a series of fairly simple operations but
that nonetheless the winning three

put in series they do this very
complicated nonlinear on entangling OK.

Now.

Sort of putting a bunch of neuron atomic
architectonic and sort of latency evidence

together what you see in the visual system
this is the ventral visual stream and

in particular this is this idea that
a lot of object recognition and

other higher level visual behaviors may
be read out of well here are roughly at

the top of the sequence of processing
where you know maybe each stage is roughly

ten milliseconds after you get into cortex
of the sequence of brain areas that

pass data on this is what a sensory
cascade looks like so that roughly another

way to think about it is that information
comes in on a particular on the retina and

then it sort of gets reprieves
center really represented

out through these different layers to the
point that by the time it gets to the top

of the ventral visual stream in the
inferior temporal cortex it's possible to

read out lots of interesting things easily
from that neural representation OK.

So you know if you want to think
about this in terms of the sort of

have the fast moving up the reason
of it is something like this where

you basically have you know
that sort of sequence of thing.

Things going on in your head like that are
you looking at the images in the neurons

are firing in the sort of sequential way
roughly doing something different for each

image but somehow in such a way that by
the time it gets to this area at the top

lots of things have become explicit about
what's going on in the scene question.

It's probably not smooth
in the sense that.

You know there may be clear
ways in which small changes in

the pixel representation lead
to big changes downstream and so

like you can think of non tangling as
being more entangling as being a process

that actually makes things look of that
are quite fairly non smooth actually and

you have to kind of to
smooth them out a bit.

You're adding the well in a sense yeah I
mean if you learn the parameters of that

process you you're adding to your
prior information about what

the parameters should do right so of
course the information is all in the image

if you can see it by looking what
the thing is right so in a sense it's like

there's no it's not like the information
has increased the amount of

information in the Shannon theoretic
sense increases that's not possible.

But that doesn't preclude it being the
case that the information is laid out in

a smoother way so that simple decoders can
get at it and I think that's a good way to

think about it is not is not that to the
extent you're adding information you're

getting the same information for every
image and that information is your priors

about given a given pattern how should
something be smoothed out effectively.

OK that makes sense yeah OK.

So you know underlying this this is this
idea of there's these kind of three things

the mule's neurons and behavior so
stimuli are what comes in neurons

are the stuff in the middle and behavior
is what you read out of this rate and so

what we've just what I just said basically
in response to a question was the idea

that there's this kind of information
transformation process that actually

loses a bunch of information that's why
I've drawn it as like a bottleneck or

like an hourglass kind of structure and
then it sort of pushes it into

this representation that can then be read
out to do many different things like you

know figure out what
the category is location and

size posts that are OK in a way
where would be really hard to do

this with linear readouts if you
were in this representation.

Right basically that kind of untangling
a curse now to just make that concrete I'm

going to tell you a little bit about
a multi-year a electrophysiology

experiment in the CAC OK so this was
done in collaboration with Han and

Jim De Carlo in Jim's group at MIT and
what was done

was implant of a multi or a electrode
arrays into the four and I T Those

are sort of toward the top of the ventral
visual stream a couple of hundred.

Sites collected and with that set up.

About six thousand images were recorded
responses to six thousand images so

in this case the images were constructed
by taking sixty four three dimensional

objects putting them in categories you
know it actually there's a categories of

objects and putting them on uncorrelated
photographic backgrounds OK

at three levels of variation so like low
variation where the objects are at a fixed

position pose and size a medium level
very very Asian in the high low res and

the other can sort of be all over
the place at different poses sizes and

the object categories are sort of these
natural sort of random selection of

natural categories animals boats cars
chairs etc So it gives you just a sense of

what is in that So if you take that image
that image that you recorded with the set

up I mentioned a moment ago and then
you've been the spike count of each neuron

the neuron spike they don't start spiking
until a bit after the the actual images

are presented because the wait for
the data to get through the visual system.

And so if you've been it seventy to one
hundred seventy miles I can post a mule's

presentation and you average it over
a bunch of repetitions then you get one

scalar per neuron per image and that's
what's in this matrix effectively OK so.

First I want to talk about why this data
is interesting what about this data makes

it useful so.

Welcome back to that imagine
you have that data and

you try to do some of the coding from it
needing that you want to build a linear

combination across neurons to do something
like the tech to if it's an animal or

not or a different linear combination
of its current Whatever So

those are linearly trying to read out
from the neural representation the neural

responses what's present in the image
OK if you do that with the V.

for data that's data from the V four here
which is kind of an intermediate visual

area what you see is that it's quite good
at low variation so if you try to detect

animals versus versus both versus cars et
cetera the sort of categorization task so

like something like sixty
something almost seventy percent

at low variation chance here is
twelve point five percent so.

This is in a way task but a before is much
worse even though the neurons are dry and

being driven by the images the stuff is in
the receptive fields of the neurons for

those who think about such things so
but nonetheless the population is

not able to easily decode what's going
on what the content of the images can.

You can actually look at human data on
a whole bunch of these tasks such as

the basic animals versus boats or
his cars or Leica within category.

Excuse me with an with in category cars
task or with an incredibly faces task of

faces versus each other you
can measure human behavior and

what you see is that actually
on this basic categorization

task that's the black bar now humans are
way better at higher variation than before

they're more comparable at liberation
they're way better at higher.

But if you look at the ID population
that's the area after V four

in the ventral stream sequence I-T.

is much better OK so you can decode pretty
well out of I.Q. with features treated

as an Like machine learning features or
with a linear decoder you can decode what

the object is pretty well OK now this
was back in two thousand and twelve and

when we first looked at this a lot of
machine learning algorithms at the time

were also getting smirched by that
high variation learning variation

part of the dataset so there's something
about the computations being done and

i t from V four to I T that
were interesting right and

actually if you look at many different
tasks each dot here is actually

a different visual task what you see is
that comparing the neural decode to these

to human performance you can predict it
pretty well out of I.T. So the same things

humans are bad at itas bad at something
humans are good at like he's good at

you can pretty good pretty well out of
this top level visual representation but

less well out of the four and actually
much less well out of earlier layers in

the sequence right so there's a saying
it's not just the performance is good but

that these neural features are making
an interesting patter OK and that pattern.

It's predicting the human pattern
it's on some detailed way

behavior is followed these neurons right
so what that's basically to say is

that those neurons are worth
explaining OK I think I will.

Hold on.

One.

So we can just see it just briefly
here is that monkeys actually can

do this behavior as well and

you can record you know exactly what
behavioral patterns they make and

if you see that then in fact you record
their errors that monkeys on this task

are very similar to what human behavioral
patterns errors humans will make.

So that suggests that there's something
similar there's the neurons in the behave

in the monkeys are predicting
the human behavior and

also predicting monkey behavior.

So what this basically says is that
somehow that visual representation

by the time you get up to eighty is doing
this and this decoding in such a way

this untangling in such a way that linear
readout is able to predict category

behavior right so all this is sort of
a long way of saying that a predictive

model of neural responses in this
pathway is of Rio's of interest

OK you want to do this because you want to
come up with a quantitative hypothesis for

what generated.

Those neurons those neural responses
right that's why it's interesting for

that reason OK So the obvious solution is
to use something like convolutional neural

networks right and this was even obvious
five years ago because you know they sort

of build in the basic neuroanatomy of
the ventral stream you know that they're

like hierarchical in the retina topic
that is to say specially tiled right so

these are the things that were known from
the neuroanatomy therefore it sort of made

it was obvious sense to use a system that
had these and convoluted neural networks

were designed with these in mind right so
just briefly I'm sure many people here

are familiar with them these are these
multilayer neural networks and individual

layers of the neural networks have made up
of these sort of an early plausible basic

operations like filtering threshold and
pulling normalization cetera.

Each of which has a kind of like
neuroscience interpretation and

a data science interpretation so
you can think of these as being useful for

various reasons and I can go into
the details of all of this the sort of

boils down to a tremendous amount of
work that people did over decades and

I'm going to just posit that we use this.

The key thing is that these structures are
applied convolutional meaning basically

the same in all locations I mean in theory
you don't have to have weight sharing but

that's the typical idea to make it both
tractable and also deal with the fact that

sort of neural net natural statistics are
pretty similar at different locations so

that being meaning that if you
have an input which is image like

you get an output which
is image like two and

then in this particular picture there is
like one grey slice per filter type OK.

So in natural images.

Averaged over many trials models
like this with the types of

nonlinearities that I mentioned a moment
ago in that type of filters of this kind.

Have predict roughly fifty percent
of variance in natural stimuli

in V one there as that is to say early
in the ventral stream OK and key insight

is that one of the reasons that they are
able to do this is that the filters have

this sort of not nice shape roughly in
the form of Gabor wavelets where there's

different like wavelet elements for
different orientations and frequencies so

this sort of boils up down together
like a tremendous like decades of work

in visual systems neuroscience to come up
with this idea that is sort of one layer

convolutional at work there's a reasonable
job presenting neurons responses in

early visual cortex you know it's usually
think about where this comes from

right I mean of course you can think
of it as being sort of the intuition of

the discoveries of the one and they're
properties of the ones like there's

just this fixed basis that that makes just
makes sense if you're smart enough and

maybe you can figure out what good
properties of that base of such would have

and do that in general anyway it seems
natural in this case when the looking

at their experiments the kind
of theorize something like.

These basis elements there are other good
at ways to do this as well the idea that

basically you want to have neurons
represent their environment and do so

as efficiently as possible and this kind
of sparse coding idea that all cells and

field and others came up with which is
basically idea that if you want to.

If you have a network that
has takes an image and

has to produce the same image out from it.

Through a hidden layer and you impose
that there is a sparsity prior on

the hidden layer then it turns out that if
you train the elements of the first layer

filters to do this you'll get filters
that look roughly like the Bourse OK so

that's a really nice idea because it
gives you a kind of underlying reason to

understand why the filters are the way
they are in a kind of prop in a principled

way so it's natural to take some of those
ideas and sort of push them up the ventral

stream right if you're interested in
neural responses in higher cortical areas.

So after all you know like input input
is image like outputs image like if you

just you can just keep sucking right of
course there's a huge number of parameters

consistent with that idea right there's
like architectural parameters how many

layers how many filters etc network
structure there themselves the continuous

values and all the filter player templates
so that's a huge number of parameters in

this obvious big question is how do you
can't find the right parameters OK.

I'm going to just say in that they
understand why this is the case

is a lot of discussion but basically
turns out to be really hard to generalize

these basic approaches to
many layered networks OK

there's no obvious way to do this
in a way in a way that works OK.

There's another strategy which is neural
fit you basically record data and

then you put the parameters that do this
that produce the neural responses right on

permits basis but it turns out that
it's really hard to do that too

because there's not enough neural data
to constrain this very large model class

it's a very nice idea but in practice
it's sort of leads to overfitting.

And so the basic state of affairs
in there of years ago when

we started looking at this was
that direct neural fitting

was less successful higher of his
visual areas before eighty cetera.

So we came up the strategy which is
sort of obvious that there's nothing

big there's nothing non-trivial about
those ideas that you use the other

constraint you have which is you
know the system has to do a task

right I mean I showed you that
earlier the system actually does do

a really interesting behavioral tasks
like optimize for categorization and

then just fix the parameters that way and
then compare to neural data right so

there's really two underlying metrics here
there's a performance metric accuracy and

some challenging high variation
visual object categorization task and

neural predictive E.G. the ability of the
model to predict each individual sites and

are like that point and so

the idea is that if you do this for
a challenging task and

this is a challenge for network engineer
that the put the animals to do the task.

The hypothesis was you know if you
if you feel if you do this and

you'll get a better optimizer
one you'll get better and two.

Of course this begs the question
of what it means to map

a neural system to the brain and
one way of thinking about this is

imagine of two brains of source
brain in the target brain.

A very natural way to do
this is to just ask that one

be the linear transform of another
right it's plain not the case that for

every neuron in Monkey one there's exact
neuron a monkey to exactly match as.

But it sort of makes sense to give
yourself a little slip room that

neurons and monkey one are at a linear
combination of neurons and Monkey two and

that's the slipper we give ourselves in
comparing a model to the brain other words

basically treated general site as
a linear combination of model units

figure that out the mapping
with linear regression and

then accuracy as goodness of fit
on held out testing images OK So

the question was whether or not we could
do that so before we got into trying to

deeply optimize neural networks to solve
the task we wanted to do some simple

high throughput experiments to
check whether we had a chance So

first of all we did something like random
selection of model parameters you know

random selection of those architectural
parameters I mentioned a moment ago and

then measure both performance
on neural predictive eighty and

each dot years different model and
what you see is that there's a modest

reasonable correlation between
performance on the one hand and

neural predictively on the other you
know so it's about point five five.

There's quite a bit of variability
as well so this wasn't terrible.

But it led us to think well what if we
did some hyper parameter optimization

on the architectural parameters and then
just measured neural predictive it as were

the blue dots are so it's blue that here
is again a different model you see that

the optimization works pretty well that
you're able to get out to the right but

also that the correlation increases
a lot that this is now roughly eight

point eight now so this is just well
if you want to get better at making

good predictive eighty maybe you should
like performance optimize right.

Of course the opposite the other
thing might make sense as well it's

optimized for a node predictively
in the measure performance and

that's what these red dots are and
this was interesting because it sort of

reinforce sizer overfitting problem which
was that although you could optimize for

no productivity you can actually get
a lot better than you would do by

just optimizing for performance alone and
you didn't get your predictive any back

I mean the performance back so
blew me a lot more sense than red.

But really this was fine this was where we
were at the time relative to a bunch of

other approaches but the problem was.

Who are sort of bad everywhere I
really wanted to be up to the right

OK that was basically the problem and so
we had to do much better at optimizing for

performance and we did a bunch of stuff we
basically threw the kitchen sink at it not

with the idea that the optimization method
itself would be anything biological but

just to see if we could
push out to the right.

And so we did a bunch of things to sort
of improve architectural parameters so

I mentioned that automated met a parameter
optimization ensembles of modern

models chosen for boosting and
also filter parameter optimization using.

To cast a gradient descent on like
whatever categorization task we had

available and at the time we didn't quite
use a software actually is different

slightly different last one sions that
are typical now but very similar idea and

with those ideas together you know produce
the model and we did it with a bunch of

different training sets turned it
turns out that the most effective one

is this image in the training set which
is by now become a standard in the field

which is thousands of images in
thousands of categories OK And

the idea was training on real photos first
question was well OK Having done that does

a generalized of the types of stimuli that
we had here right and to make that sort

of test as strong as we could we remove
categories of photographs that didn't

appear in the tests that we like OK so
no animals no boats and cars etc OK.

We got a model that way in the test of it
and it turned out that actually as you get

better as you get better on this training
task you get better at the performance on

the testing task as well so
generalization works pretty well

OK In other words there are features we're
learning something that was useful for

general sort of object recognition even
though we haven't seen those particular

categories before I would say
it is point what's happening

is we're retraining in a way classifier
on the top hidden layer of this thing OK.

So just to put it in crisp ective number
there is this picture it showed you before

the red bar was the model that came out
of that OK so it was doing pretty well

on the high variation condition that
was before and some of the models OK.

So were our core question was does it
predict neurons better thereby OK this

is one representation of the data that's
useful to look at for some neurons so

here what we're seeing is like basically
this is neuron fifty three and I T and

it's like it's response is not time OK
this is just response as laid out on a per

category basis and so Each dot is like
there's thousands of testing images here

OK And so you know what you see
is that this is a face neuron

because it really likes to respond to
faces way above the baseline level OK So

this is an easy neuron to look at in
this type of representation and so

we can see is that if you try to predict
using the linear regression metric this

neuron from the top hidden layer of the
model as mentioned a moment ago it does

a pretty good job OK not perfect this is
in Our Square about point five five OK but

it's getting a lot of the sort
of matching a lot of the ups and

downs of the neuron and
this is true not just for

neurons like that are really easy to
interpret like the face neurons but

also for others like you know sight forty
two which is I don't know it only exactly

what it is but it has a bunch of peaks and
the neurons are able to be fit pretty well

you know the red line
being the prediction.

And so we were able to go quite a bit up
to the right is the basic short story

there that if the blue dots
were what you see in the now

here in the lower lower left we're able to
get here so you know more than one hundred

percent improvement in your fitting
in much better performance as well.

So that was really gratifying and
it suggested we could have kind of

predictive models of who are least
reasonably predictive models

of higher cortical areas which was
a was really hard to figure out and

there's kind of these mysterious areas we
don't know exactly what they're responding

to but that led us to have given that we
wanted to then sort of investigate what

was going on with them right so one way to
do that is to compare the different layers

of the model to the data OK so what we're
trying to do here is fit that same neuron

I showed you before but with Layer one OK
the first layer what you see is basically

able to capture like the low variation
part so in this in this way have.

Laid out by category but
then within each block by variability so

the images where there is like
head on fixed position pose and

size Those are all in the sort of
left part of each block right so

that layer one is able to be selective but
when you look at the sort of higher

variation conditions is just not able to
be able to be robust tolerance as well but

if you go up through the layers
that's what's able to be done

real to be both selective and tolerant at
the same time so this is a pretty natural

way of thinking about how tolerance is
built through the layers of network and

what you see if you look at
the median over all the units and

R T critics that that layer is
significant you know sort of if you go up

through the layers you get better
that's what those red bars are.

Now if it were the case that
the neurons were totally driven

by their categorical value this
wouldn't be that interesting

right because of course if you're
good at getting categorization and

your eyes are totally ruined by categories
alone like they're just categorical in

their responses Well that's sort
of obvious that should work so

to just check that out we did something
like the built ideal observer model so

basically models that try to predict
neurons by knowing the category perfectly

or all of the variables that generated the
image the position the pose the size etc

That's what you see here so that there is
significantly less but there is certainly

better than chance here and I tell you but
they're much less good OK a question.

Well I T.
It's likely five years.

We'll get there it's a good question.

But but.

No No So these are data collected so
it's interesting

all the neural data I have shown you so
far is where the animal is doing R.S.V.P.

it's just sitting there staring at the
images and you know the animals task is to

make sure that it's pixelation
doesn't leave the center of the image

OK now it's a good question what happens
differently in the neurons if the animals

doing a task it's a great question that
I'm going to mostly not talk about today.

But just noticed one thing is remember
I show you those relationships between

neural data and human performance it's
funny what's going on there is that it's

monkeys doing nothing predicting
humans doing the task so even that

even if you are basically that suggests OK
don't worry about that issue too much for

the purposes of what I'm going to tell you
today if you're worried about dynamics and

all that that might change but
if you're interested in just that sort of

like seventy two hundred seventy
millisecond temporal average been

then maybe you're OK because that average
been in the monkey doing nothing while

it's not away it's not sleep OK it's an
awake behaving monkey where the behavior

is to fix it what that says is that
you know basically the as long as you

are if you have the amount of awareness
that I was like I said earlier I mean

nothing what happens when see responses
and I think as long as the animals awaken

is motivated to do at least
something that's watching the image.

Then you know you can predict from those
neural responses you can decode behavior

and quite accurately at
a sort of per task level so

it's worth explaining
data the data with us but

even if there's more details that if
the animal were doing something OK.

So but what I was saying here was
basically that the fact that these

categorical or sort of all variables or

deal observers didn't predict
the data nearly as well as those

models suggest that you kind of need both
a performance constraint is going to

actually do the task there will be all
observers do the task we're all models and

the neural networks do the task but you
also need to actually be a real network.

And it's basically those two things
together a kind of performance constraint

but in our with an architectural
constraint that leads to the better neural

productivity you can't just read
the category off the image and

that's that's what the neuron is doing.

OK.

But now to your question earlier
about looking about intermediate A We

actually also had data in other
visual areas like before OK So

we wanted to compare to that and
just to give you a sense it would be for

data looks like if you put it in that
same representation that I showed you

the face neuron in before it looks like
this which is basically a mess OK And

this is way subsample because
otherwise would just be a black blob.

This is why before it's
hard to understand.

OK.

Now it turns out that if you look
at earlier layers in the model and

late layers in the model you don't
do a particularly good job of

explaining this data.

OK no regression is not that great OK but

if you look at intermediate layers
especially this layer three

you do not bad job OK Certainly
better than you do in the early or

late layers OK so I can't tell you
why this neuron is doing that stuff

why the peaks are the way they are but
what this is saying is this intermediate

layer of the network optimized to do this
downstream test has the right non-linear

basis to predict that that neuron OK so
if you summarize this across all

the before neurons in the population
what you see is that that's like saying

basically that you sort of speak in
the middle in terms of your predictive it.

OK And again you're way better than
exist other existing models were.

And just to sort of drive the point
home these categorical ideal observers

are terrible before they're packed they're
basically not above chance question.

So you're saying.

Totally yeah there's nothing special
about this being for right so I don't

mean to give that impression deeper models
that we can build now are better and

they parse out the intermediate
layers better so this is it's like

it's calles it improves as you like twenty
twenty one hundred layers of get worse

you know of course you do better on
the test eventually better than humans but

it doesn't track with the neurons
after a certain point.

So that doesn't work so well but if you
have you know ten layers or eight layers

probably closer to the number of receptive
fields you know this is later work you

can get a better picture and you can maybe
parcel out those areas unfortunately have

quite enough data in P I T C O T T Right
now separately to be statistically

sure about that but in practice I
totally agree there's nothing like this

layer three is like just before and that's
it alone but I don't I'm not I don't mean

to suggest that I mean to say something
like the intermediate style computations

that you have to do you have to have them
and when you have them they give you

a better picture of what the intermediate
neural response areas are like to get.

So you know that just contrast that I.Q.
goes up two layers this week for

predictive repeat in the middle and
as we in many others observed

early layers also get what looked
like the bore wavelets and a variety

of other interesting phenomenology in
Layer one you know in early layers and

others probably better than
also shown that those are like

among various certainly among the layers
of the model the best prediction of

early visual cortex as in the early model
layers that's not to say that these models

of the one are better than other
models of the one maybe that are more

detailed biologically in those comparisons
are really interesting to do and

I think open at this point but anyway what
you can see from this is basically I think

I'll skip that in time but
what this basically is suggesting the.

Is that if you compliment the sort of from
below approach where you try to figure out

what each layer is doing and think about
it and then build new filters on top of

those but this kind of top down approach
where you impose a behavioral constraint.

With the architecture what you
actually get is a pretty good picture

of like constraints that
multiple different layers

along the pathway we don't have any of
the two data we to date ourselves so

we really can't make that
comparison very carefully.

But to some extent it's been made OK so.

You know this was a really useful kind of.

Intermediate model and point because it
suggested that although we didn't exactly

know what all those intermediate or early
layers were doing we could put like this

architectural piece of knowledge down and
the functional piece of knowledge down and

although they were kind of broad strokes
by themselves neither one alone sort of

tells you exactly what the model of needs
to be putting them together gives you

a pretty pretty more quantitatively
exact picture so it was good but

at the same time it's like you'd really
like the bill to know something quantum

qualitative as well from such a model
raises like a black box I mean

it's not a black box model in the sense
that from a picture if this gets on me and

my nerves a little bit you'll say it's
a black box model of a black box system so

what are you learning.

And of course it's a totally transparent
box in the sense that the cost

of measuring everything about it is
practically zero or whatever like it costs

to run your G.P.U. right so I mean
the brain is a bill is a black punks

because it's extraordinarily expensive
to figure out what it's an abscess but

you can't do that but you get to measure
the activity is very expensive etc.

So it's in some sense this box
is extremely transparent but

what it isn't is obviously
understandable OK it's not simple so

what you really want to do with this
such a thing is well how do you generate

sort of qualitative predictions
that are useful or interesting.

And the basic approach of this is the only
one that I really believe in at this point

is you do a bunch of experiments and
when something comes out of it that need.

Does not match your intuition then you
check it out in the real data OK so

it's a strategic plan for
coming up with qualitative predictions and

so I'll tell you now about one of the more
interesting ones that we saw OK So

obviously you can do categorization
tasks right you look at that image you

see the plane and
maybe you know it's an F.

sixteen.

But you can also like not only tell
what it is we know where it is and

how big it is and
you know its aspect ratio and

stuff like that in fact you can sort of
quickly assess the scene as a whole right.

So obvious question is Where are all
these properties coded neurally.

Right.

And sort of obvious hypothesis and

this was like kind of our world model
to even if you look at especially

given what I showed you a minute ago
right that that basically would not be at

the top of the ventral stream where
these properties are included.

Because basically you imagine the kind
of transformations of over identity for

serving like transformations are
aggregated at each layer it's kind of how

you manage an invariance is built at least
it's natural to imagine invariance being

built that way right so like as receptive
field size increases you might imagine

category tolerance increases but
like position sensitivity goes down right

that's a natural thing to think certainly
we thought roughly a version of that so

maybe these these properties might
be encoded like positions certainly

earlier in visual areas or
somewhere else in the brain so

just a sort of put like various hypotheses
down previous studies had told us of for

categorical properties information
went up through the layers.

OK for the category orthogonal
properties What did we know well

one eyeball this would be that there would
be a tolerance sensitivity trade off

that's very natural OK.

H two is like a tolerance sensitivity
trade off but we're like we

one is better than human so maybe you're a
human performance when you get the ID OK.

Then there's this other performance this
is what Jim my post like advisors like

Paul was probably going to
happen was that actually for

the category of thought in
the performer properties in from.

It would be quote preserved OK
that you wouldn't totally lose

it wouldn't make sense to throw
it out if you could use it so

why would you so that information
would somehow be preserved.

So we wanted to figure out which one
of these it would be and in doing so

we'd sort of random the various tasks and

found some that we weren't totally
expecting which was that as we got better

at the categorization task performance
on the out of the top hidden layer.

Was getting better on
a position estimation task.

OK even though the goal was to become
invariant position it was getting much

better at that because it was a little
surprising maybe not impossible but

certainly a little surprising So

then we thought well maybe this was
H two not H one maybe was H two.

This was true for all the tasks so

we looked at supposition
estimation scale rotation etc.

We thought was maybe H two so
we figured out OK which layer is

position estimation going to peak it
is that early in the middle I kind of

thought to be in the middle because
you have boundary detection and

that would give you I don't know
where the objects are roughly but

as it turns out you get better
as you go through layers as well

OK increase performance on the position
intimation task of each model where and

this was true for all the tasks that we
could check them all out on write and

you know like this was like one of
twelve experiments I ran that week and

there's the only one that did
something that I was not expecting.

So eventually we convinced ourselves
that we should collect more data

to figure out whether or not the data
was in line with this OK because it was

a little out of line with what we expect
and we record we produced you know our

original result that categorization
was better and I can even before

this is given set up that's good better
than the one before better the one but

actually turns out that I case also better
than the fourth position estimation and

in fact if you look across
all the various tests.

We found that in general well always I T.

was better than before and

usually before was better than
the one across all these tasks OK.

We also did an experiment like this which
was sort of more like a classic standard

receptive field mapping stimuli
which have sort of like an X.

and Y.
position and

orientation property because we expected
to find like a different result here given

what people are known and indeed that is
true right the classical results are we

didn't like contradict them or anything
OK In other words for things like X.

and Y.
position V one is better than these

higher areas OK for
these types of stimuli go ahead.

Both Well this part this
particular plot is v one.

Model but we want data also even.

More so has this property and
we later check that out.

So you know we want is really
good at barcode reading to

write it's not surprising you
can tell where these people like

it's good at figuring this stuff out
right better than humans and that.

Probably better than so
far as we can tell better than humans.

So it's not like the you know we weren't
contradicting the known knowledge and

sort of put this in the perspective
of human cycles like behavior

if you measure behavior for
all these different tasks and

you plot performance by fraction you
know neural performance fraction

relative to human behavior as a function
of number of units in that you draw from.

This is on a semi log plot then what you
see is that you know OK you increase for

each unit a certain amount of
performance on categorization tasks like

basic categorization like
animals worse as boats etc or

subordinate identification cards versus
each other roughly at the same rate but

this is also true across many different
tasks across all the tasks effectively and

it's much more constant that in I.T.

than it is for the other areas so
basically it's like saying you know.

For every single edit you did of in for

power that you get an additional unit and
I T.

on one of these test you kind of get equal
amounts of power on all these other tasks

not exactly but.

Reasonably so.

So really what's going on is this.

OK at least for these types of textbooks
OK And this is really the crucial thing

it's really not about the task as it
turns out it's more about the nature of

the stimuli OK so what you should
take away from this is one I T.

is definitely not invariant OK so
if you think of you know a sort of

specific general idea of when we as a was
aggregating over in various dimensions

that's definitely not what's happening in
this rental stream process but perhaps

some kind of generic aggregation is OK but
you're not aggregating over the invariant

dimensions you know of category of
the category identity preserving the.

From Asians lower level properties
are not that low level OK We thought of

those being low level properties like
position at least with complex objects and

complex backgrounds and it's not like
intermediate or intermediate either which

is kind of what I was betting based on
border protection in particular really it

depends on the nature of the stimuli for
a simple stimuli there will be properties

that are low level OK but for these
complex stimuli have all the properties

travel together in a categorical and
non categorical properties together and

it's not just that not all position
information is lost but actually it it's

not that the information is produced like
we talked about earlier but it's made more

explicit for every one of these types of
categories so maybe to suggest that I keep

doing some kind of generic scene parsing
sort of sort of the powerful real or

central image area but just just the sort
of summarize why this was useful is

that of course you could have discovered
this if you had had the data but you

didn't need to know what neural network
model to like to do this is all data.

But we would never have done the
experiment allowing us to make these plots

if we hadn't suspected that
this could possibly be true and

the only reason we did this is because we
ran a bunch of experiments within a model

that had some potential to be right but
that we didn't expect that answer OK So

that's the sense in which I think
predictive models are useful

even if they're not totally
understandable in all their details OK.

So you know around the time that I
was doing this I really thought it

might be interesting to look into some
things in auditory cortex as well because

it also seemed to have a bunch of
properties that were like the kinds of.

You know invariance things that
were happening in a vision and

of course not exactly the same but sort of
roughly related and so I ran into these

folks also at MIT Alex salmon jobs who
are really interested in addition.

And so together we went down the road of
trying to think about how some of these

ideas apply in auditory cortex so just
as a little background auditory cortex

has a lot of is is actually presupposing
and uses a tremendous amount of really.

Interesting auditory subcortical
processing so you know these inferior

color killers various areas of auditory
cortex that are doing a lot of really

interesting things after the Coakley but
there's this area of cortex green.

Thing which is about a bit interior of
early visual cortex sort of roughly there

and the question is What are those units
doing right so what was known was that

there's roughly a kind of a pair
of core belt pair about structure

not perfectly well this is a monkey
brain we actually measured in humans and

so this is just schematic.

But in the primary area of auditory
cortex it was thought that

basically some kind of spacey or
temporal filter in was occurring but

that in these sort of other non-primary
auditory areas like the belt or

the power belt there was sort of not
totally clear exactly what was happening

OK So this is our question is Are could
we figure out like better models of

non-primary this is
a higher auditory cortex.

So we did a bunch of different ways of
dealing with this actually one of the

convolutions were the most obvious things
since time is sort of evidently local.

But it also turned out to be useful in
fact more efficient to treat frequency

as local as well and that is to say use
a coarse model of the Coakley as the input

OK So that is the that is to say to
transform sort of amplitude into this

into this type of diagram OK this sort
of like Coakley or Gram representation.

And to stick those into
a continent OK was the strategy so

basically optimized for performance
on a challenging auditory task and

then compared to neural data so
let's get the.

Mike off.

And so.

The task that we were in decided
to use was a six hundred Way word

recognition task so taking a bunch of
words from various speech databases

to MIT world Street Journal
etc These are various.

Speech corpora and combining it
with significant background noise.

So that's what the samples would
sound like she had your dark suit and

greasy waltz water all year with like you
know auditory scenes and speech babble and

music clips in the background so
it was really non-trivial and

you the task was pick out the middle
word in each one second clip so

like if yours she had you or
you would be picking up a pad OK.

And so humans are definitely above chance
chances were able one percent here and

humans are around seventy something
percent on this task they're

doing pretty well but far from ceiling.

By optimizing network we can do pretty
well this is going showing performance

up through the layers of such a network so
close not quite at but

close to human performance at the top
layer of this network measured on how that

data with different speakers and
different auditory backgrounds etc.

So the question is how to
compare this to neural data.

Level for we did that we wanted to
understand like was the behavior

comparison useful again there's
a lot of interesting behavioral work

you can do with auditory to
kind of look at the patterns so

in particular you can take something like.

Various state like clips and
put them on either dry or

like noise backgrounds of various kinds
like auditory scene speech babble

music see speech noise at different levels
different signals noise ratio levels OK.

Again doing a six hundred to a task and

what happens is that if you actually
measure human performance on this one each

of the different performance conditions
you get a pattern you know with different

levels of noise in different proportion
you get different proportions correct for

different types of noise
backgrounds OK And so

it turns out that actually if you look at
the model that we built versus the human

pattern it predicts it pretty well OK so
not perfectly but

you know really hard to distinguish from
different humans different humans don't

predict each other perfectly here either
so the model predictions pretty good at

this sort of behavioral pattern
you know the loading of.

That this network was not optimized
to get this pattern correct it just

happens to fall out behaviorally this is a
little bit like the pattern of the scatter

plot of dots that I showed you before for

the visual system you can also look
at the ask whether the distance space

of the model is a good way of predicting
the actual distortion of the signal by

measuring the abs different drivers is wet
and just to make a long story short there.

Early layers don't predict human portion
the sort of distortion in the layers

responses don't predict that well but as
you go up through the layers you get much

much better behavior so task optimizing
and does skiver a kind of a reasonable

speed in which to think about
the distances between stimuli

as predicting performance curve you know
proportions correct behavior really.

So OK so we're in a situation where we had
something that was performing reasonably

well was doing the behavioral pattern
reasonably well so we want to as to what

extent it predicted neural responses now
in this case we were using humans for

a couple reasons one of my
collaborators work with humans so

I had no choice the other was that humans
do this stuff whereas it's hard to get

mechanics to do a lot of these to like
do it's not clear that they really have.

Good responses to a lot
of these things and so.

Sam and Nancy and just McDermott Nancy
Kanwisher and just McDermott measured

one hundred sixty five commonly
heard natural sounds and while

humans were listening to them they were
brains are remembered with M.R.I. M.R.I..

So like a baby crying or running water or
you know like car horn or.

Text you know speech so
it was just a variety of stuff right and

then of course for each box will measure
an average response the sound right

nothing like the temporal resolution or

spatial resolution that we have with
the electrophysiology to OK but

it produces a data matrix of
roughly the same form for

our purposes OK so it's like
the sounds versus voxels And of course.

This is a sort of ugly situation where
the number of predict if you are doing

the coding this is an ugly situation
where the number of voxels is way bigger

than number of stimuli so you have to
do all sorts of stuff to prevent or

if it is actually for us it's not a big
deal because we separately predict each

each voxel So for us it's just like
more chance more more checks OK.

So we were interested in using a similar
aggression metric to do this and

I should say although it's the case that
we use the same types of regressions here

if you look at how many units were needed
I should have had a slight is how many

units were needed to predict each voxel
you need many more units average here to

predict voxels than you do in the in
the electrifies case OK So there's some

titrations of the resolution of the
mapping that should be explored more but

we really haven't done yet.

But just a sort of describe what it
what came out of these was first of all

the reliability was not bad in other words

if you look at reliability
this is just neural data in L.

D.

Sorry this is a reliability plot so
what this is saying is that actually

basically the stuff that's reliable
responses is in auditory cortex and

outside of it it's some reliability but
significantly less OK.

If you look at median predictive It is a
function of model there you see peaking in

this sort of slightly later regions of
the model but in particular if you look

at predictive any differences between
high and model low model layers high and

low modelers what you see is that like
model predicts early layers are much

better at predicting the sort of primary
area that's what's outlined in blue OK

that's using a different separate like
primary area finding localizer So

that's where the lower areas are much
better than the higher at predicting

responses but in these non primary areas
later layers were significantly better

that's why that's what the blue means so
this was interesting because

an encouraging because it suggested that
we could figure out sort of confirm and

like deep in our understanding
of where we like higher.

Arkell structure in the auditory cortex
was in humans which was hard otherwise so

early there's a better explanation of
primary cortex where higher layers

are better explanation of non primary
auditory cortex there's a specter of

temporal filtering model that was
around and sort of the best model for

primary cortex and so just comparing
to that in the high rez high risk

reliability areas what we saw was about
the same this is a bad bad color plot but

in particular what you're
seeing is that this sort of.

Kind of dark yellow is where the net
was about the same as the spectrum for

a model in the yellows where it was better
with a deep it was better when you see is

sort of basically the case the primary
auditory cortex was about as well

predicted by earlier layers of this model
as the space or temporal models that have

been there before although it's actually
a bit better so there's stuff going on in

primary auditory cortex that's captured by
early layers of the network they're not

captured by the spacesuit temporal
model OK but there are significant

proven space in the everywhere but
especially in these non-primary

areas where you it's really hard to do
much with the space or temporal ball.

You can also look at a kind of region of
interest basis so in that tunnel topic

area what you see is variance is best
explained in intermediate layers and

that's interesting actually
because it's not the early layers

that are best explaining primary
auditory cortex it's intermediate layers

suggesting that and
this is something that others have.

Seen in some sort of anecdotal way this
is probably relative to visual cortex

early so-called primary auditory cortex is
actually more stages in this more stuff

going on in the sub cortex an auditory
system relative to the whole system then

there is in the sort of same thing in
vision OK so perhaps it would be the case

that if we had good subcortical data we'd
see early layers mapping to say inferior

click illustrate something like that we
don't have that data so it's hard to say

but it's an interesting question if you
look in speech selective cortex this

is speech so at the cortex generated
a different way with you know like.

Outlined with the Independent.

Localizer what you see is that performance
you know productivity goes up through

the layers maybe not totally surprisingly
since you know we optimize for speech

selection task but if you look in bunch
of different our eyes a ton of topic but

also pits selective music selective speech
selective what you see is that you get

significantly better predictor predictions
out of the test optimize C.N.N..

But especially in sort of non-tonic topic
non primary areas so music selection for

instance music selective area and
now we're really trying to bore into that

to see if you know if we have models
that have multiple different.

So much in great areas so for example not
just word recognition but something to do

with music so little hard to formalize
that task in a good way but we

think we have an idea for how to do that
then perhaps we can get finer structure

in a theory cortex that would be otherwise
hard to come up with you know have so

hypotheses for figuring out where we
think their structure are they are we and

we think we have been able to identify
that there are maybe several streams in

particular within Word Recognition for
finer structure so

just to compare this to the sort of
previous vision work what we see is if

you look at word recognition performance
on the task that I mentioned earlier and

auditory cortex productivity on that one
hundred sixty five stimulus set there's

actually quite a strong correlation here
each dot here is a different model again

and I find it can be a little
embarrassed about this pot because

the number is very high and I don't but
nonetheless at this we reproduce a couple

of different ways there's this quite
strong correlation in this regime and

this is really just to say that although
the models are structures are different

and the tasks are different and
the data is different and

if you take a vision model you try to use
it to explain auditory cortex data it's

significantly less good the Nonetheless
there's this sort of principle that

binds them together you know that
basically if you optimize for

task with a reasonable wish structure then
you get a pretty good predictor of of

intermediate neural computations so
maybe it's a little.

Grandiose to call those principles
more of a theorist stick

that if you are really bad if
you want to have a good model or

a better model than you have
now if some cortical area and

your current models are terrible at
the tasks that you think that the area

performs make the models better at
the tasks right and then perhaps if they

are sort of somewhat narrowly reasonable
then by combining those constraints you

get a better prediction of the sort of
more intermediate more detailed levels.

And of course this is not beautiful in
the same way that like this was right so

we don't get like a beautiful picture for

what for
what the receptive field structure is.

Despite a lot of trying there's
not a lot of great results.

Like visualizing what the filters
are right we did a lot of that and

it wasn't that successful and
other people have done it.

Better than us but again I don't
feel like it's that successful

in the sense that although you can
sometimes see what some filters are doing

sometimes on some ways it's quite it's
quite a can't it's like a very it's very

messy it's like some units seem
to be unpredictable but most not.

You know some units seem to be
understandable and most not.

So you know I'm not saying that this
type of understanding is impossible for

higher visual areas maybe one day we will
figure it out but what I'm saying is kind

of the meantime there is a sense in which
there's something deeply principle going

on here like basically one of the knobs
when you build in a Network Rail you're

you have to formulate a model
class in this case it's the C.N.N.

class I should say in the picture this
visualization is sort of like each

point in here is a model and as you go out
through sort of go out of the outer radio

you get more and deeper networks where
it is just a way of the way visualizing

architecture class space but then within
you know in the model class only some

sub manifolds of models actually are able
to perform any given the logically valid

test right most most models suck at
any given task and so a sub small

subset is able to do each task sort of
tasks pick out these bits of the space.

And of course to get there
from an arbitrary and.

Well starting point you need to implement
some kind of generic learning rule OK

in this case and
sort of mostly going forward we have

gradient descent with hyper parameters
on the matter parameter selection and

then once you've done that
you can map the brain data or

you can say well look to what extent is
it the case that like you know you get

a strong jet stream by doing this process
that allows you to say something about

what the internal structures that's
principle even if it's not quite as.

Sort of structurally elegant as being
able to interpret the receptive fields

you know of course you can think of these
three things in a sort of formal way and

we do this as like a way to figure out how
to make models better architecture class

task which is sort of a last function
of the dataset and then our learning

rule which is a mixture of there's a hyper
parameter optimization gradient descent.

Of course it's really not a principle in
the sense that of course if you push it

too far it doesn't work right if you push
the performance on any given thing beyond

human performance you're probably not
going to get better at predicting neurons

in fact we see the resin that isn't better
at predicting neural responses than

we G.G. revealed is definitely better
than what we had back in two thousand and

twelve twenty thirteen but
it's not enormously better

somewhat better and also it gets worse
as you eventually start to push too far

so you know I'm not saying that
there's this deep principle by which

categorization optimization is the answer
to the visual system it's not it's just

one proxy task that gets you closer
then you know otherwise you would have.

But of course you know I'm really
interested in things beyond this domain so

you know instead of CNN'S interest
looking at recurring networks instead of

you know categorization looking at task
switching ideally somebody would hand me

a better more biologically
realistic learning rule than

strict back propagation I'm going to
leave that problem since it's hard but

then you know I'm really interested in
napping to other brain regions as well

you know pile cortex frontal cortex so you
know just what's going on in the lab now.

We're starting just getting set up but
things that are or of interest is really

interested in looking at recurrent
architecture for us for dynamic tasks so

you know instead of just the feed forward
structure have local recurrences have

long range recurrences and
have you no longer in ships and

long range by you know feedbacks and
think about both like tasks like Obs

action recognition which are inherently
dynamic as well as time discounting So

you've got to do it but
you've got to do it fast accurately but

feste right or things like situations
where there's heavy occlusion but

visibly taking the same strategies
optimized for these tasks and

check against both the static data maybe
will be better than fifty percent and

also dynamic data looking at the neural
response patterns themselves you know

the dynamics over time can we predict
non-trivial structures in the way that

the neurons change with state which they
certainly do right like I talked about

the situation where you know we were
beginning from seventy to one hundred

seventy milliseconds but
there's reliable dynamics at the ten or

twenty millisecond range and
what are those patterns doing and you.

Came up earlier about.

The animals doing a test that's a place
where we think that it's clear that there

are some differences that is to say the
dynamics are different when the animals

are doing a task and probably caches
out in some fraction of the images that

are harder being doable for
those animals but a small fraction so

you know maybe the absolute performance
numbers aren't that different but

the subset that is different
is really interesting and

tells you something about how recurrent
Amex is doing processing OK So

this is one direction I
think is really interesting.

I'm also interested in sort of pushing
into other cortical areas I show you to

vision an audition you know there's
a cortex and auditory cortex but

I really want to go into a non-human
animal or non primate because all sorts of

interesting visual neuroscience techniques
can be done in these other systems and

so one of the most interesting
cortical areas to me is rodents.

At a sensory cortex in particular whisker
trigeminal system because it's very

advanced the animal's doing something
very interesting with its whiskers and so

you know not only is it like it's sort of
better using its whiskers and it is seen.

And so
we came up with the idea working this is

work with Mr Hartman You know
Northwestern who is a whisker expert

on building neural networks that take that
basically built a sensor model three D.

sensor model and
then look at trying to solve sort of.

Reasonably somewhat ecologically relevant.

Shape the Texan tasks with that type
of input OK And those are all and

of having to be recurrent models because
it's a very temporal signal OK and

that's work that's underway we have
some pretty good models there and

so now we're in the process of comparing
two cortical data from that system

of course the biggest hole probably in
the thing that I've said so far is that

although maybe it's the case that
the adult state is reasonably predictive.

Of by the models by you know
optimizing for categorization.

The learning rules the learning task
is terribly like on a biological

right in the sense that no no
my cat no human presumably but

certainly no mechanic has access to
thousands of labels in each of thousands

of categories that's just crazy right so
but none of the system does a good job.

So you really we need to if you know
people back to the picture I mean it's

a couple slides ago of these three things.

Even if we know what architecture
class is OK in a visual system and

even if we are going to assume
OK that we are excuse me even if

we are soon that we're going to be using
sort of gradient the center based methods.

Excuse me.

That you know obviously was something
wrong with our task with our

last function on our dataset OK So
you know a big.

A big region of interest in the lab is
trying to figure out better task for

self supervised learning brain and

trying to be as creative as possible
here because we're feeling like we

there's a lot of been a lot of negative
results coming out of auto and coders and

similar things you know so one of these
things is most interesting to us is future

prediction underage and controlled action
so interactive data set construction but

there's a hard task and
I don't have a lot to report on that yet

although it's a really
interesting direction.

And finally you know although it's the
case that you might have a system that has

like a really great representation that
can do all these different tasks like I

showed you earlier you know categorization
localizations eyes posed by the law went

on any given task in the world you're
actually only ever doing one of the write

and you have to figure out which
one it is and learn them fast

how do you have to choose them and have to
learn how to switch between them right so

we're looking to try to
understand what the kind of

controller architecture is here that
allows us an agent like this is sort of

embodied in a world to do continual
learning in a reasonable way OK that is to

say figure out what tasks are happening
now learn readouts that are good for

them learn how to build readouts in a way
that will use the information they already

have efficiently and
then switch flexibly between them so

those are those are some ongoing projects
in the group that kind of build out from

deep rich sensory models but
go to try to deepen our understanding but

also like use them to understand other
other cognitive domains as well.

But anyway I just leave you by saying
that there's these kind of task modeling

test your modeling can make you know
significantly improve quantitative models

of high intermediate level cortical areas
there can be qualitative insight if you

you know if you if you if you if you work
at it and also the concepts are useful

across multiple sensory modalities So
that's that's what I have to say and

please let's ask some more
questions of the great so

thank you.

Yes.

Yeah you want you want understanding like
this roughly right like what did I say

this thing you want kind of
like understanding like this

what is the view what
are the futures doing and what.

It could be well if you know how to do it
let me know I've been very unsatisfied

by our own and
other people's efforts to make a kind of

word level story of what the fee quote
what the features are computed and

I can tell you at some level they're
computing what they need to compute So

the next level can compute what it needs
to compute So the next level can compute

what it needs to compute So at the end
you solve the task right and that's like

that's a story right it's like saying look
you've got this big nonlinear system it

was optimized by evolution in development
to solve this big nonlinear task there you

are right that's an explanation of
some kind but I think you want you

want us to sort of we basically get over
this way I have a high criteria for what

would have to be to have a visualization
were an explanation of the units be useful

specifically you have to be able to
reduce the training time by virtue of

whatever insight you have so if you could
tell me look me knowing this thing reduces

the training time a lot you can just stick
it in right if you have an intuition for

it if you have your job or formulas or
you know certainly like unsupervised data

the sparse auto and coding approach rate
is much cheaper than what we have to do

the labels are cheap comparatively rich
was great if there were some insight

that allowed you to get out of having
that bad supervision task as driving and

then that would be a way of reducing
the training data in some sense and

I think that would be really great while.

You know all that maybe those
gray bars that was sucked

those are those sucked even harder.

Now than ever.

You're asking some interesting so.

There's a number of different questions
imbedded in there which are interesting.

I think there's one question which
is sort of an easy version so

I'll answer that one first which is
like how much did neuroscience help

OK How much does neuroanatomy tell you and

I think it's a very long what road
in the sense that like you know.

I was talking to somebody about this
earlier today that the first comments

were developed by Japanese guy
back in the late seventy's and

he told me at lunch one day
that the reason he knew to use

common that structure was because
his neighbor down the hall it.

Was Tanaka K.G. thought it was
a neuroscientist who told him Look

the vision system solves this with
a hierarchical retina topic structure

and you know if you wish it was like
mathematicians are like hierarchal

right in the top of the structure
that's like repeated convolutions.

So I don't know how true the story is but
it's what he said and

like that sort of the kind of thing that
happens once in a lifetime slowly in this

kind of unpredictable way where you know
like some insight from qualitative insight

from you're a science may help you make
a better model and it took a long time for

the models to be actually engineered
to the point where they were better.

In the same thing could be said for
a bunch of those in Iraq showed you those

sort of intermediate steps like filtering
in that's the convolution part of that but

also like the different
nonlinearities that were in there

like neuroscientists figured out stuff
about device of normalization and

other types of normalization
steps before they were

useful from a column that operation point
of view but it was very uncertain and

there was lots of things that were
could go wrong in implementing them and

so it was really not like a direct
cause of trend translation.

So you know I don't for a minute in my own
work I've certainly I'm not counting on

neuro science and site to tell me what to
do next in a detailed way right I mean I

would say half of my group as a I does
machine learning and I on the hope that.

By solving more interesting tasks we'll
have some stuff that will be models for

neural processes that we currently
can't even begin to approach.

Then there is a nother version of
the question which is harder and

I'm going to rephrase that and then punt
on it which is basically Is it the case

that there are other model classes
that are not convolution in some sense

whatever you mean by not convolutional
that predict the neural data as well and

solve the task as well and the current
answer to that as well is to produce one

problem there which is what do you mean
by not convolutional right like of course

it's the case that something might
not formally look convolutional but

actually compute the same
functions intermediately so.

So modular without that
definitional issue OK.

But the answer is at this point
no there are no other classes

not even the solve the task though so

that we know well case what exactly we
don't have a positive control there.

Right and so I wish we did and
it would be a great question

to answer in that case and so
I have to punt on it basically because.

I don't know we don't have it I don't know
maybe we'll never have it or maybe there's

a universe out maybe basically the only
types of networks that's all that are of

that form maybe there's something else
that's really different in some way but

somehow you could solve
the categorization test well but

then doesn't look like neural data and
that would be a great positive control but

let me put it where you can do
the following thing which is you can

take a shallow neural network OK And
you can approximate the deep neural

networks to some extent with
the shallow neural networks.

In the To some extent explain the neural
day I mean explain the responses

solve the task but they're terrible at
explaining neural response patterns and

like me for something.

So in a way maybe that's like something
like it's kind of a jury rigged

positive control that's basically the
universal approximation theorem failing

the Rosa Parks mission is true but.

It applies in.

Anyways and so

like a shallow neural network that can
approximate the deep neural network I O.

isn't necessarily going to have
intermediate steps that look like

you know we're going to have neurons that
look like the intermediate steps and so in

a way that's a bit of a positive control
but I don't I think you'd want something

better than that but I don't know how
to give it to you at this point yeah.

OK thanks thanks for having me.