Today and what.
Your right.
Thank you so much for the introduction
is indeed a dear friend from me for
many years.
However when someone introduces you and
has the word ridiculous in it
twice in the first sentence.
It kind of makes you would nervous so.
I hope to not be ridiculous
is all I can say
I am going to be talking about
talked ability and attention.
Probably a lot more about attention
than tractability Invision but
that's really the general theme and
I'd like to start.
By first thanking the students that have
played a role in developing all the work
that I'll be showing you today.
Bruce Alex and droplets saying it
should be nice and Eugene's to me and
also like to think the funding source
is the natural reason Science and
Engineering Research Council of Canada and
the Canada research chair is a program.
So where did this all start.
This started with a paper in one nine
hundred seventy one by Harry Barrow and
Robin papa strum.
Some of you may remember those names and
what they said this is very disconcerting
giving this talk because what's on my
slides is not appearing on my laptop so
I'm going to have to walk it
back here every once in a while.
So what they said is that they looked at
the problem of object recognition from
a very realistic perspective.
If you're trying to recognize things.
What do we do as humans.
Well if I'm at all puzzled
by what I'm looking at
I might walk up to it close more closely.
Take a look look at it from
a couple of different viewpoints.
Even touch it.
So I'm not going to touch your food
don't even touch it to move it around
in order to manipulate things and
if you think about it.
That's how children learn by
manipulating things right.
So the rest of the process to the point
is that the process of recognition
is very active.
It's not simply a matter of
showing an image and say Go.
There's a lot more to
it than just that and.
In one thousand A.D.
I had just finished my Ph D. thesis.
So what I had done is built
an attention system in.
Within knowledge based vision and
I was really happy with this
I was very proud of my work.
The problem was that the prevailing view
of how you do vision was at the time due
to David Marr who did not include anything
about attention in what he wrote and
certainly did not like the use of
knowledge in trying to understand vision.
So here's poor me wondering I just
graduated my department give me a degree
for doing something that's wrong.
You know I'm maybe some of you
have been in that position before.
And I was kind of puzzled
I was very concerned.
In one thousand nine hundred five and
about it.
She wrote a very nice paper on active
perception where she summarized basically
what Baron and papa Stone had said very
nicely active sensing is the problem of
intelligent control strategies applied to
the data acquisition process including
the data interpretation part in other
words there was a hypothesis and
test cycle involved necessarily
in the data acquisition part.
So that was really good that kind of
motivated me to think that well maybe I
wasn't wrong or who I believe Rouge or
not or David and then I met and treason.
Who is a well known cognitive psychologist
who's worked on visual search with respect
to you know human visual search for a long
time and I think that set me off on a path
which I will try and summarize this
path because you might be able to tell
that I've been at this for a little while
and I'll summarize this by giving you my.
The yellow brick road along
which I have passed through
hope this shows up at the beginning.
There is a very very tiny little bit at
the part at the top which is the paper
that referred to at the first I.C.C.
V on the complexity level analysis work
where we had all these big numbers and the
idea of those big numbers is to show that
the straightforward brute force approach
is to vision just can't possibly be right.
It turns out that resilient
really like that paper and
NATO Advanced Study Institute in Italy
in one thousand nine hundred ninety to
me aside and
said Well that was really nice work but
if you don't show the same thing
is true for time varying cases for
video then you really haven't solved
the whole problem so that set me off to
working on that and there was a paper
in C.V. that talks about active
vs passive perception that looks at the
time very incased We then had a nice C.V.
P.R. paper that looked at
active recognition behaviors.
How do you actually incorporate
this cycle of that regime it
was talking about with hypothesize and
test.
We decided that the current
sensors were not adequate and
we designed our own head Patris had
the key new component is that it involved.
Torsion as another degree of freedom cycle
torsion what cycle torsion it turns out
that you made as you may not know and
that your eyes actually rotate this way.
And it's so you puzzled.
They actually rotate this way five to ten
degrees or so and it's not something that
you can control easily but
it's been known since Helmholtz and
he's worked he worked out all of the math
for that turns out it's very valuable
when you're trying to look downwards when
you're looking off of the center point.
So if you look off in this direction.
Your eyes will rotate in order
to bring things into into focus.
So we included that in our in
the design of our head and
showed exactly how that can
enhance stereo work and then.
C.P.R. paper.
I looked at the problem more
broadly if we include behaviors and
not just perception that is
a problem still very difficult.
Yes it is from Prove that it was
intractable formally using methods from
computational complexity and then design
to control strategy that I called Star in
that period in robotics we are trying
the systems based on that and
then there was this nice paper on
attention in viewpoint control and
the reason why it's in red is just to
give my co-author credit who is under so
Henrik was involved in that paper and
I thank him for
his collaboration there along with
Sen Dickinson who's at the University
of Toronto now who was my postdoc at the
time and another graduate student of mine.
David Wilkes and another student from
Stockholm named Grant all of us and
if I remember correctly.
We then started putting some of these
thoughts on our robotic wheelchair
platform that we called play but we
then developed this search algorithm for
how one can do objects
search in an environment and
you'll see more of that in
the remainder of the talk.
We looked at the problem of you degeneracy
and how can you use focal length control.
We looked at sensor planning and
prove that that was N.P.
hard and you'll see some of that today
we looked at the issue of saliency How
does one develop a saliency algorithm and
this is one based on information theory.
We've got our wheelchair
to actually open doors and
we've presented that at our
assessment of seven more proofs about
localization of an object different
say different search policies
in terms of an empirical study
that you'll see later today.
We then ported this to a humanoid
robot and showed how these
algorithms seem to work well and
are easily portable to other platforms.
There was a nice paper that dress the
issue of what about the active aspects.
Sensor parameters.
Do they matter.
And I'll talk to you about that.
And then some more
theoretical work as well.
And then the most recent work
which will be presented in
May at the Canadian Conference
on computer and
robot vision where your own Frank
the laird I think is a keynote.
Yes Did You Know That.
OK So
that's the path that I've taken since
one thousand nine hundred eighty seven.
So I feel that this is a you
know a nice snapshot of
the research program with
a very consistent theme.
Namely trying to understand those early
farts that I posted from Barrow and
published on and
bite you in terms of how do we actually
implement all of that in trying
to understand it in a deep sense.
So in this presentation.
What I'd like to do is try and
argue that Activision is a subset of
visual attention speaking more broadly.
I'll point you to the fact
these issues of formal
problem analysis are critical
to point us to viable solutions.
I'll show you a search algorithm
that I'll call minimalist and
attentive and
I'll show you why I called that
show you different evaluations in three
separate demonstrations of it and
then and with discussion about
sensor settings and how they matter.
So let's start with after of
vision as a subset of attention.
Why active.
So you know what gave you the kind
of arm waving version that
published on wrote about but
really what can we get by being active.
Well I tried here to list
everything I could think of.
And there may be more things if you
think of more please do tell me so
I can make the list longer.
It looks more impressive if you
have more stuff on your slide.
Regardless of what Patrick Winston
says about giving good talks.
But the point.
The main point here is you know I've.
Figured that at least a couple of people
I was happy that a couple people laugh
that's all I can explain
that later in life.
But the point here of this
slide is to highlight for
you that there is
a tremendous amount of basic
activity that can be thought
of as active vision.
It's not a matter of here's an image.
Tell me what's in it.
It's not quite so
simple in the real world and in fact.
If you ask the question Can
these be considered attention.
Well each requires the action of selecting
one from a month from many options and
a definition of attention.
A simple one is that attention is
the cognitive process of selectively
concentrating on one aspect of the
environment while ignoring other things.
So each of those selections
would be guided by a purpose.
Each of them is taken with the intent of
moving the system closer to its goal.
This is classically
what any control theory
operation would like to do move
your system call sort of the goal.
So the preferred definition
which appears in one of my
writings is that attention is a set of
mechanisms required to tune the search
processes in addition to achieve their
best performance for a given task.
So that's the working definition of
attention that will use throughout here.
There.
So all of those active
tasks that I mention to you
can be seen in most real tasks so these
are this is a list of household tasks.
This is a list of the tasks that DARPA
has asked the robotic challenge to to
work on and
challenge you to choose any of these and
say that there's nothing
active about this at all.
They all require active mechanisms and
they all require more than one active
mechanism operating in concert.
So if that is the case how
exactly can we do that.
How do you put all of that together
into one package to make it all work.
That's an open challenge I think for
the community at large.
I'm certainly not going to give you.
Much of the solution.
Today.
And being active also entails cost.
So you can read about that in one
thousand nine hundred two paper
you have to decide that an action is
needed to determine which one to apply and
what sequence and what are the costs
involved including adapting
the new viewpoint executing the changes
determining correspondences and
the benefits should outweigh the costs
of cost has to play a major role and
this is a component that is normally
not considered in computer vision.
Namely the cost of doing
something if you have a robot and
you want the robot to execute real
tasks you need to consider the cost of
executing those tasks as well and make
sure that it's not insurmountable and.
With respect to what you're doing.
So that's my argument as to
why I think Activision is
really a subset of attention.
Moving on to formal analysis.
So this is where these big numbers that
Henrik referred to at the beginning came
from because what I was trying to do
is understand how difficult is vision.
After all.
If you look at the field of
attention which is primarily in
cognitive science psychology neuroscience
terms like capacity limits and
resource allocation and so forth.
Appear all the time and they're not
quantified anywhere no one tells you.
Everyone says it's too much but
no one says how much is too much.
What actually fits into the brain.
So I was motivated to that for
that particular problem and
that's where that first paper came from
and many of the proofs afterwards in order
to show that what the brain is doing is
not solving this idealized vision problem.
It's solving something else and finding
the something else is really the trick.
So that's where these formal analyses can
help point you to I'm not going to be
talking about any of that today but rather
point show you how those same formal
analyses can help you help point you to
practical solutions as well in robotics.
But the key thing is that any
perceiving agent needs attention.
Any perceiving agent needs attention.
I don't care what it's
implemented on whether it's
a biological hydraulic I
don't care any person.
Agent needs attention and
the theory makes that clear.
So we work very hard over how many years.
One of too many years twenty five years
to develop a body of theoretical work.
The tries to show exactly how difficult
these problems are the reason
why these problems are difficult
are all common authorial reasons.
So it's not the fact that
there are a lot of pixels.
It's not the fact that there are a lot
of photo receptors in the retina.
It's the fact that you don't know which
combination corresponds to anything that's
recognizable.
So it's so in every one of these things
you have the power set of all pixels or
the power set of all photo receptors that
you need to think about as
the size of your search space.
So that's really the issue here.
And that's what many of these
theorems are based on so the N.P.
completeness of these things
is based really on that.
The ones that I've highlighted
here show you the theorems that
tell you something about how
does the problem become easier.
So that what these show is that in
every case when you can add some
direction in the system due
to knowledge of the task and
that task and can guide
the process things become linear.
That's really a key result
if you think about it.
It tells you that including task
direction is very important and
that task direction can
be world knowledge.
It can be things that are specific about
the task it doesn't actually have to be
very much at all in order to make
that turn into a linear problem and
we have that theoretical work is all.
Pretty solid It's actually some
of the basic theorems have been
replicated by others as well.
Do these theorems matter.
Because it's easy for
sometimes for people to say well
there's lots of theoretical work.
And.
That's nice.
You had you had you had a nice
time playing with pencil and
paper and you got some papers out of it.
Yeah.
Who cares.
Well I would say they do matter.
You see all the time in computer
vision especially purely feedforward
architectures with no attention and
they're very very popular and
this is the this is really the Mar school
of vision but if you think about it if you
look carefully at what those
algorithms do they all try and
restrict their search space by
making assumptions in advance.
They assume that faces
are well defined in the image
they assume that will only look at
four different orientations of edges.
They make us of variety of assumptions.
All of which are geared to limiting
the search space in a sufficiently
important manner so
that things can fit into computers.
And as computers become faster you
can do more and more of that and
that lets you think sometimes that well
maybe the theorems really don't matter
because I can do all this stuff.
The thesis that I mentioned that I did
once was on a P.D.P. eleven forty five.
I had two hundred fifty six K. of memory
on which to build my vision system.
Trust me.
I had to be attentive there
was no way to do it otherwise
the theory doesn't matter and
that doesn't.
Then the point is that this doesn't scale.
So if you're trying to if you're going to
claim that you have a solution that scales
and you're making all these assumptions
in advance then that's not going
to get you there.
You really have to look at the problem
in a more deep fashion and
why do I mention human
behavior in all of this.
Anyway.
What does that have to do with
building a robot or a real system.
Well typically restall
robots interact with people.
And those people expect
they're interacting
robot if you're you know my elderly
mother might need a companion robot or
whatever they expect that robot to
behave in a way that they understand.
So they expect that they have Kinds of
the capabilities that people have or
at least operate in
an understandable manner.
So if that's.
A robot interface issue that
you look at human behavior and
try to have something that's at least
consistent and compatible with so
that humans know how to
interact with that robot.
And one of those big aspects of human
behavior is human visual attention turns
out that it's a very very broad and
complex phenomenon I had a nice talk with
you this morning about about
exactly how broad this is and
it and it's not just the saliency map as
you see most often in computer vision.
There's much more to it and
short commercial break.
If you want to see more about what
is there is this great great book
I can recommend to you that gives you a
very broad view of attention in this way.
Plus a complete set of.
References for everything.
It's ever been done in the field and
a commission.
So.
HENDRIK I'm assuming I still have that and
at the same time right sort
of OK I will I will go I will go
that I will go a little bit quickly
over some of the rest so bear with me but
I'm happy to talk to you afterwards.
So I like to present the I
mentioned to you that there is
an object search algorithm that
I will present and that's next.
So we have to first define the problem
with this is work with you meet
to find the object search
problem in the following way.
The problem is to select a sequence of
actions that maximizes the probability
that a robot with active sensing
capabilities will find an object in
a partially unknown three D.
space within a given set of resources and
it's basically finding
the set of actions from the space of all
possible actions in that environment that
satisfy is this cost constraint which we
had as time only in the beginning and
maximizes the probability that
I actually find this object.
So as it turns out this just.
The updating.
Action is just standard Beijing
there's nothing very complex about it.
We have a conditional probability
here of being able to find the object
with whatever recognizer you're
using the recognizer by the way for
the rest of this actually doesn't matter.
They all have different properties and
it doesn't matter.
You can see.
So this is standard Bayesian But
the problem is that we know none of
this we know none of the what so.
Here.
This is the probability that a given
set of actions is high enough so
we want to maximize that probability
of finding the object and
its given by this sum each
of those which I didn't say.
So this is the problem so
everything is encoded as an occupancy
grade we're using very very small little
voxels and within each one where encoding
the probability that the object is found
is centered at that box at that point.
So the probability that shown there is for
a particular object particular.
And at a particular time in the sequence.
Doesn't so
much matter because it's too hard.
We really can't do that you can
prove that this problem is hard and
prove that in a two thousand and
one paper.
So we'll do something that's heuristic
because we're not going to get an optimal
solution will divide this into a set
of actions of where to look next.
And where to move next.
And there's no look ahead here I'll
come back to the look ahead issue later.
So where do we look next.
So we're going to have an occupancy
grid we have little cubes these cubes.
I think five by five by
five centimeters in our
example run so
there's a little small tubes.
And what we're going to do is
look at the region around.
The robot and ask in which direction is
there a sum of cubes that is largest So
how do we do that.
If you think about it.
Whenever you point a camera at a scene.
There's a certain region that you're
looking at and a certain region
that the camera actually that you
recognize or can actually function.
So there's a visual field.
And then there's a depth
of field this way.
During it within which
the recognizer works
objects can't be too close because
you can't recognise them and
objects can't be too far away because
again you can't recognize them.
You just certain you know values there.
So within that kind of a wedge you
can put all these wedges together all
around the robot for all the viewpoints
and ask within that wedge.
What is the sum of probabilities
that I'm looking at.
What I think this wage.
What about it within this wage and
so forth.
Right.
And then say which were just strong
the strongest gives me the greatest
likelihood and I'll choose that one.
So initially unless you've
done something special.
They're all the same you just choose
one at random and just keep going and
then apply your recognizer if you
say well it's objects not there.
You take that portion of the wedge zero
it out update the probabilities go again.
You'll see how this works in an example.
Once I've exhausted all
the different directions that I can
look at from one given position then I'd
like to figure out where to move to next.
And I do that by hypothesizing
Suppose I go over there.
And I do the same thing.
What's the total sum of probabilities or
if I'm looking from that direction.
How about if I go over there.
What's the total sum from that position
and look at all the positions and
choose the largest one moved to that
position and then repeat the cycle.
OK so that it's as simple as
that kind of an algorithm.
Why would I call this to be
an attentive one recall our definition.
So basically we have a number of
mechanisms of attention and you
see in there there's many different kinds
of attention it's not just a single one.
I can.
Go into all of that detail but what I can
do is tell you that the strategy I just
pointed you to selects a viewpoint
selects actions restricts the recognition
task to a visual field and suppresses
consideration of previously seen regions
and all of those are attentive functions
so it is an attentive kind of a structure.
Why is it minimalist.
There's a lot of things it doesn't
do it doesn't produce a map.
So I can tell you where all
the objects are in any kind of.
Faithful way it doesn't cover the space
in order to do a search only.
It does not require a knowledge of the
interior just what the boundary might be
and use only vision and dead reckoning.
So this is as small as you can get part
of the beauty of it is that afterwards
as you'll see you can add on all
kinds of other functionality.
So when other functionality is available
it interfaces nicely with this and
is improved overall but if you don't
have it it will still function and
it functions in a very dogged
way we've seen before.
So let me demonstrate this.
This is our lab and this appeared in C.
View in two thousand and
ten work that had done the robot will be
there facing towards the shelves
the object that the system is looking for
is this and this is the most
braindead kind of recognition at all.
It's just you show.
The object once it creates kind of
a template and it looks for it so
it relies on a basic assumption.
Namely that the robot finds its way
into to the object in such a way so
that it has a good view of the object it's
looking for doesn't have to be exactly
front on it has quite a lot of leeway
depending on the recognizer but it's very
simple and it didn't learn the rotated
version it only learned the operator so
the total number of positions is thirty
two because it goes meter by meter here.
The number number of actions
to choose from at any step is
five hundred forty four because of
all of the different viewpoints.
The number of occupancy grids locations
is four hundred fifty thousand
and the each pro.
Probabilities in stereo in the number of
possible states is very very large So
that's the set up for this.
This is the algorithm functioning so
let me just explain this is just
a picture of the robot as it
searching that is the image that
is being tested at any particular
from any particular position.
This is the map of the environment
the robot does not know the map at all.
It's for you only so
the robot doesn't know that's
the robot this is the the wedge
that the recognizer operates in and the
direction in which the camera is pointing.
And the zero degrees there says that it's
looking at a horizontal direction so
without pain until this is them the slice
through the middle of the probability
density function that shows you
the black is where I have bean and
have recognize I have realized that
the object is not there the green bits
are the stereo hits that bumblebees
built in stereo disparity maps.
Tell us about and that's all.
So this is position one sensing
action one that's position too so
it's now looking at the object.
It's far away.
So it's looking that way.
The object is over here looking far away
can't see it as Position three or four.
So at this point it's only analyzed for
images of the whole space and
decide it's time to move because I can
get nothing useful out of anything here.
Further it moves over to this position and
starts looking in this direction so
again the object that is looking for
is over here but
it doesn't have a good view of that object
so that's the robot looking over there.
We do the same thing.
Look at different viewpoints
in this case it's looking
back over areas that had looked at and
realize that no it has to tilt.
So this is at a thirty degree angle
because it's looked at the zero degree.
ANGLE this way.
And it continues looking so
that second position has not found
the object that it wants moves on.
So that's only eight images that need
to be inspected are it moves on.
So there's the object but
now it's looking right at you
because looking the other direction.
Moves on here looks upwards again so
it says this is the object on the ceiling.
Maybe they're trying to trick me and
it looks up here and continues to look so
that's the third position and
now it's looking right at the object here.
But this is not recognizable yet
given what was learned about the object.
So it now chooses to move to the next
position over here takes another look and
now I can recognize the object.
So the interesting thing about all of
this is this is the probability density
map has been created during
the process of recognition and.
This required the robot to move
only to be in four positions or
move to three positions and
only thirteen images to be inspected.
So if you think of well what would it be.
What would a brute force inspection of
the space be how many images would you do.
Would you have to look at if
you were going to try and
find that object in a blind man or
totally brute force man.
I don't know it could be one.
It could be a million.
You just don't know this is thirteen so
I think that's actually pretty good.
So we decided we needed to check to
see exactly how good this could be and
maybe look at this with with a different
different set of policies as well.
Yes.
This is guided by the fact that every
time it decides where to look next.
It looks at the current
probability density function.
So yes it knows where it's.
Then knows where it's seen and knows where
the previous seeing has been not fruitful.
Yes.
So the things that brute
force would be intractable.
First of all who knows how many Also
the more common question that I
get which is related to what you asked
is why not use upon D.P. because people.
There's lots of examples of that and
there.
What I tried to show you by the number
of states is that it's far larger
than what normal Pandi
seem to be able to handle.
And so yeah.
OK OK So number one
we did the theory and
showed that it's an N.P.
hard problem with be hard because the
structure of the problem is the same as
knapsack and we can do a reduction from
knapsack into the object search problem.
So it's provably hard in that sense and
that's that's a that tells you that there
is going to be no optimal algorithm
that gives you that is that you can
find it's in a sense saying that
everything is everything in the worst case
is going to be brute force in that sense.
So if you ask how many
images are possible to be
captured by this mobile
platform on that floor.
I think the number is
would be astronomical.
Because you can be any angle you
can look at you don't you're not
subdividing the space in any intentional.
Intelligent manner you're just looking at
all possible images that can be created
from that space that would be brute force.
I.
It was no was.
It was not common authorial
the ones I created
because I was not designing
a brute force algorithm.
So.
So if I go back here and
I look at how many actions there are so
the size of any of the actions is that
many from any any step is that many and
if I look at the number
of possible states.
It would be this number.
Which corresponds to the number of
occupancy grid positions times two because
each one codes binary values is there some
is it occupied or is the object there and
it includes a probability which is a real
number who knows how big that is and then
the states times the number of possible
steps that everyone gives you the options.
OK.
So we decided to look at all of
this a little bit more closely and
see how would performance change with
the change in the decision policy.
So in order to have a tent of selection
you depend on some decision process.
So we decided to look at
different strategies.
So the one that we use so far in
the example I showed you is this one.
Exploring the current
position first next position
is determined by the maximum
probability that you get.
Other ones could be true as the action
with the largest the detection probability
total This corresponds to a former.
Because you because you're not you know.
You're looking at everything in to
fine grain away another one could be
look at it while minimizing the distance.
Another one could be just minimize
distance but maybe not so
much allow it to be very.
So we tested these four and
we did it by running the algorithm
on this space same space as you saw
we placed the robot in five different
positions we placed the target which is
this kept in four different positions
which were on tables and we tested it
with First of all no knowledge at all.
And secondly with knowledge that
the object was on one of those tables.
So in other words you take
the initial uniformly
equal probability density function and
you increase it at the table.
So you give it a hint and
you see what happens.
And that is one hundred sixty runs one
hundred forty five of them were successful
in terms of the object being found and
the rest we could easily explain
because we had no localization in
the robot so sometimes got itself lost and
stereo was also so we had no smarts
there with respect to those things.
So the result of these and
we look now at the number of actions.
The total time it took and
the distance traveled for
each of these different sets
of this is no prior knowledge.
This is knowledge of targets on one of the
tables and the interesting thing is that
option C. where you are looking
at the maximizing the probability
while minimizing the distance traveled
gives you the best result in all cases.
So this is very useful if you're building
a practical system because what it
tells you is that.
First of all some knowledge
is better than none.
It's going to give you a win and
if you need to minimize time then you
take option C. plus use knowledge.
If you need to minimize
the number of actions.
It's a different subset of goals and
if you need to minimize distance.
Again it's this.
So it gives you some
information on how to design.
Search strategy
where it where you have the possibility
of including variations in these options.
So we can look at cost a little
bit more deeply and this is a.
Collaboration with
Honda Research Institute Europe.
Alex and drop all this was my P.C.
student and
just by chance meeting with
the very thing and myself.
He went over to Honda and often and
and ported our search algorithm
to this humanoid robot and
not using saying the name.
Explain that for you after the cameras
go off uses the same search algorithm.
And defines another search policy as well.
So there's just an example so
that you could have tables or
whole rooms full of objects.
This is the environment that the tests
were done and it's only vision and
only dead reckoning for
robot localization so
the cost that was added in looks very
much like the policy I mentioned earlier.
Except they started they
included all of the costs
of the humanoid being able to
move not just the distance.
So that's all encapsulated
in this variable T..
And.
I can show you this video off camera.
Don't ask me why on camera
use the knowledge of the intractability of
localization problem in order to design in
order to improve the overall function and
actually did a very nice job in
the video does show the robot doing
the sensible thing for a target here.
It actually walks around and
then finds the target eventually.
We can add in some look ahead.
I mentioned the look ahead was something
that we did not include so far.
And it's basically a kind of search.
We can use.
So the first example had no knowledge
then we added some hints about which
regions to try first.
We're now going to add in some
sailing some knowledge and
in the future we're going
to look at indirect search.
Knowledge like Tom very talked
about in and Ballard and
looked at predictive knowledge as well.
So sailing to is look ahead and this
will be presented at the same conference
where Frank is the keynote
in me a mere recently
as you know when in search look ahead
is always a competent Oriel thing.
So if you're playing a game for example
when you're looking ahead many moves.
It's common authorial it
gets that very very quickly.
We're not playing a game here or
do we have a vision problem.
So in vision.
You can just look ahead.
Just look ahead.
So let me explain.
So the impact of a recognition
action in our algorithm
affects only this
recognition depth of field.
So that's this this visual angle.
So if the robots there.
That's the visual angle
that the camera sees
this is a region where you can't
recognize because it's too close
outside here you can't
recognise because it's too far.
So this is a region that's affected
by our recognition algorithm.
So then for the binoculars camera that
we're using which is just a bumble bee.
The accuracy is actually much longer
than what the recognition is so
the recognition field of view and
the camera depth of field are different.
What it means then is that
we could use alien C.
to tell us anything
interesting in this region.
So this is literally looking ahead as
you go that's noncommunist Tauriel.
So we're using the saliency algorithm
are using is due to my student Neil Bruce
saliency is attention based on information
maximisation I can't go into this detail
but I'm happy to say that
throughout all of the benchmarks that
have been done on salience algorithms.
It always seems to rank consistently
near near the top few so
it seems to have captured something
very good and there is just.
And the example of that is nonfunctional.
Well there's a couple of videos
in there but they're not working.
Anyway it's.
So the way this is used is as follows.
This is an image you run the aim a generic
sailing sea algorithm and get that
as being the regions that are salient
you then take your target and build
a color histogram of it and back project
it into that image and add it to it.
So the target in this case is
this red up was was the little
toy here but
you have some red stuff here as well.
So by the time you're finished.
You have some hot spots here and
here which are the ones
that would be driving it.
So in effect you this
being the set up that we
just looked at you have
the object being in front.
I won't go through this in the interest
of time but this is basically the way.
I quickly set it.
We did experiments on three separate
portions of our lab again putting
the robot in different positions putting
targets these red rectangles in different
positions in order to test it and did
several tests from each of these different
in each of these different environments
just to see how all of this works.
Here's how one of them looks like.
The target is here the robot's
looking up in this direction.
So the green bits here are the stereo
hits that we had before.
But the great overlay is
what saliency tells you.
Now.
So the whiter the salience a bit
is the stronger the saliency
of what it's looking at.
So it goes through the whole thing again.
Three actions.
It moves over to here.
And by the time ago.
So there it seems that this
actually is where the target is and
that's the hottest spot in the plains and
so it just takes you know five views in
order to find the item with sailing a C.
I want to show this because it's boring
because it's just showing
the robot moving to do that.
This is the actual test results and
the the end result is that you win
with saliency say a couple of
meters a few minutes and so forth.
So it does seem to give you some benefit
to include that dimension of it and
we even tested it on.
Things that are much more
seem much more difficult.
Will this run.
Namely finding a rock.
So that's the rocket supposed to find and
it's not it's not working.
But.
Apologize I don't know why.
This is the first of you.
It took.
That's the saliency map and
view that it sees that's
where it was pointing view.
And by the third view it from the rock.
So it's very fast in terms
of finding the saliency for
any particular object as fast.
Sensor settings.
I might just skip over this and give you
the punchline in the interest of time.
This is work that Alex and droplets did I
think this is I think it's very important
to understand exactly what
it means to understand
the effect of camera imaging
parameters on recognition and
that's what Alex did he built
this nice data data set.
So if anyone is interested
in this data set and
we did in a comparison against all of
these interest point algorithms in order
to see what is the effect
of changing shutter time.
And gain in the camera.
For all of those algorithms.
So what you see in this plot is a check
mark means that the average person
recall was over point five for the same
lumination as the target look at how
sparse these things are what this tells
you is that given any one of these
you have no idea what it's actually
going to perform unless you try it and
even once you try you don't
know how to extend it.
You just don't know how it
generalizes you can look at this.
Even with more more variability and
you see that these plots which I realize
you can't read what's here it's for
different illumination levels depending
on illumination levels you get wildly
different performance of these algorithms.
Interestingly except for our sailing sea
algorithm which seems to be the most
because I don't know how that happened.
I don't know how that happened.
As can what it what it really tells you
though is that and this is what matters
is this current deficit's in the community
on large image dataset comparison.
I don't know how useful this is and
the reason why is that we have a potential
that these comparisons are meaningless
unless you know the Providence.
Provenance of an image you don't
know if you can compare them.
You can't take a random data set
of a million images and run.
You know a set of classifiers on them and
expect that result to be comparable
if it's true that depending on
the light level the shutter and so
in the camera parameters you
get different performance.
You just I just don't know
what that means any longer so
I'm thinking that we should
be a little bit more careful
about creating data sets not saying
that the data set idea is bad.
I'm saying that the creation of
the data center needs to be done
in a more principled fashion.
Namely you choose you build these
data sets by choosing a set of
camera parameters and
then are consistent throughout them and
then you understand how
things perform for.
But the point is here so let me conclude
quickly and I'm only five minutes over.
I've tried to argue that Activision
is a subset of visual attention and
that most common real tasks
involve more than involve at
least one attentive task so.
Basically I think that any
kind of a useful robot
would have to include some
attentive tasks in there in it for
whatever it winds up
doing formal analysis.
Please note from.
Formal analysis is critical to get
it pointing us to viable solutions
because they tell us what
things can't are not possible.
I've shown you a minimalist
attentive search algorithm.
Basically this minimalist thing
is just greedy search over
this probability distribution space and
it's and it's the kind of algorithm that
you can then add in a number of different.
Embellishments so
that in very much the same way as
developed subsumption you had an algorithm
that worked kind of all the time and
you could add stuff onto it.
Other layers of processing.
That's exactly what we have here.
So when you don't have any knowledge.
It will still function.
It will be slower.
It might not have it might have more
error but it will still function and
as you add different
components it gets better.
But we're trying to show you here.
I've shown you a demonstration
of differing policies.
And I've shown you that we
can use long range visual saliency
as a method of non comment or
to look ahead in the search process and
finally that sensor settings do matter.
It's a mostly ignored aspect of our
vision and it confounds current
interest point and recognition benchmark
strategies as far as I'm concerned.
And P.S. I asked at the beginning
was I wrong in my thesis.
I think I've concluded that I wasn't.
I'm happy.
Answer any of the questions.
That pleasure.
You would find I know
what I mean the problem.
Yes I think we're we're actually
looking at a different problem
which has the same characteristics.
We have an industrial collaborator
who's building companion robots and
they want them to be operational
in a hospital environment.
So the main thing that
the main stumbling block.
They have there is that as this companion
robot is going down the hallway.
You have people coming.
You have people in wheelchairs.
You have gurneys with you
know patients on them and
you have to be able to
detect all of those and
avoid them in some cases recognize them
because you need to interact with them.
So we're using exactly
the same sort of method there.
We haven't progressed as much so
far there as I wanted but
I see no major stumbling block
to to going in that direction.
So
I have this.
I have this weird aspect about my
lab that I that I don't allow.
My students to use other sensors and
it's not because they're not useful for
fully understand that they're useful but
my heart is a vision person
I'm a vision person.
I want to understand how we do it visually
because you and I can do it visually.
So that's why I kind of
make that restriction.
But in the but in general if you
add other sensors that provide
a piece of information that can be useful.
By all means it's just in my lab.
You know people are stuck and
they have to just use visible light.
Years.
So
the with respect to stereo.
It's not terribly sensitive
other than the fact that it
can get lost the doesn't have we
don't have we haven't in this in
these implementations built in any sort
of a path planner or mapping system or
anything like that or we're doing that
just because that's something to add on.
So that's the stereo
the recognizer is that's a more
interesting thing because one of the one
of the more important problems that we
have not looked at is always
show is one view when
you can't get that to that view what
you can do multi view recognition.
So we're currently looking at that as well
but our solution to this point is that
we use a family of recognizers so
that we can put one in take one out.
And when we've looked at their sensitivity
with respect to the angle of presentation
of this one view they all seem to have you
know reasonable performance within maybe
twenty degrees of that rotation and
include Same thing with inclusion.
So does handle some variability there but
in the general case we do have to
look at the multi viewpoint issue.
So so in the object.
I think you're absolutely right because so
the N.P.
hardness looks at the fact that you're
looking at the a very general case.
So an object could be for
example if I call
these five people that I'm looking at an
object because they're all separate right.
So it's looking at a very general
case that's right that's true.
No no no no no that they're
not looking a single.
No no it's the saliency the salient points
simply tell you where in the probability
density function to increase the value so
that you look in there sooner
than rather than later.