More.

Like.

Only I want to hear.

From.

You all.

Morning.

Long.

People

are.

Really.

THANK YOU CAN YOU ALL Hear Me Yes All
right thank you James

that was of any kind introduction.

So I'll just get I'll just get started so
why it might be such interest is

semantic image understanding and
so ideally I'd like for

machines to be able to take images
like this and sort of detect and

recognize all the objects tell us
whether this image was taken indoors or

outdoors where this image was taken
Are there any people that are there what

are they doing how they interacted with
each other and sort of in the limited

oncet any question about this image that
we would expect to be able to answer right

and so the title of my dog is words
pictures and common sense and so

I was forced to start by talking about why
I think this intersection of words and

pictures are language and
vision is interesting so

one the First Division is that pictures
out everywhere bears that out.

By visual content and
words are how we communicate so

for idea of applications it seems like
being able to link this visual content

with language or with words would
be very useful for example anytime

you want to interact with organize sift
through navigate through this typically

large quantities of unstructured visual
you know it seems like if you have a link

between this visual data and
words then this starts might be easier.

There's a lot of multimodal information
on the Web there's when you have

images when you have videos you typically
have some text that I'm going in and

see if you have a good way of linking
these words to the social content then

you might be able to leverage this
inherently multi-modal nature of

information around us and learn from that.

If you think of we usually embed users
then again if you have a link between

visual content and words then these
words are language can play out all you

can sort of substitute in for those which
we content and be accessible to users

when it may not have been if it was
just a visual and same thing applies but

analysts if they are throwing a lot of
visual data out and this it's hard for

them to sift through look through all of
that and extract information that they get

about what you can if you can summarize
this information through words and

that becomes much more accessible right so
one motivation behind words and

pictures is that it opens up
a lot of these applications and

the other border ration is
that building this link gives

us a way of demonstrating certain AI
capabilities if I can take an image and

if I can describe it well let's say in
a sentence then that could be one we in

which I have demonstrated that my
machine has understood this image.

And same with the other way around if I
can take a sentence with some associated

with content and if I can ground
that language in the visual content

then that could be one way of
demonstrating that I have understood what

the language is stopping right and that
a philosophical debates about this but

I think there is some
value to this perspective.

And the other thing.

Especially from a computer vision
perspective is that language is

rich in composition and I have I'm so used
to thinking of region tasks as bucketed

recognition very different object of
action than I did boxes and I place them

in one of twenty different categories if
I'm doing segmentation I dig a pixel and

I place it in one of the many different
activities if I'm doing a vacation I

take an image and I place it in one of
let's a thousand gotta get it right but

I'm used to placing everything into
one of the several discrete events and

because language is so rich and
compositional you can't afford to

create bins for every single semantic
concept that you might care about and so

that forces us to think beyond
this bucket of recognition and

build models that can deal with this
region compositional nature of language so

I think it helps us push the frontiers of
computer vision by trying to map vision or

images or
visual content to language right.

And the last motivation is
that a lot of progress in

AI typically has been in these separate
eighty or so there's a lot of progress in

computer vision there's a lot of progress
in natural language processing and so

it seems like it could be fun and
interesting and

useful to think of problems act
intersections of these different feed

stories are taking steps towards these AI
complete problems if you will can help us

sort of push on the envelope
of of what AI can do today.

Right so those are my main deviations
behind looking at problems at

the intersection of images of vision and
language and so

in my abstract I perhaps do in viciously
promised four different things and

so I will attempt to go through those the
visual question answering and I will talk

about just in a couple of minutes just for
a couple of minutes and through what I was

giving a talk tomorrow I think we'll get
into this a little bit more right so

I already met someone earlier today who
was hoping I would talk about B. Q A And

so for others in the audience like that I
apologize but you can you can get some of

it tomorrow all right so let me let me
start with this with this for stopping.

And so in this vision in language space

obviously I'm not the only one
excited about it there's a lot of.

Interesting what's going on at this
intersection and one problem in particular

that has received a lot of attention at
this intersection is image captioning but

the task is to take an image and
describe it in one sentence and for

those of you we're following
computer vision just at C.B.R.

last year there were I think any of eight
papers all on image captioning all at

once and there was a whole
session on vision and language.

Most of which would image caption works
it was even a New York Times article or

perhaps a little too prematurely on image
captioning abilities about a year ago but

now we have machines that can take images
like this and automatically generate

descriptions that say something like man
in blue wetsuit or surfing on wave or

a group of young people playing a game of
frisbee a thought a spark in the middle

of nowhere sounds like a very human this
middle of nowhere is an interesting phrase

to be picking up on a part of
broccoli on a stove and so on.

And so we have all these techniques that
are doing automatic image description and

especially last year when all
these papers came out in battle.

There wasn't a good way of evaluating how
well these techniques are working right

and we all know from in computer vision
and other fields that good evaluation

protocols and benchmarks didn't really
drive progress especially in the early

stages when there's a lot of people
excited about five you seen this for

low level vision tasks we've seen it for
mid-level drastic segmentation or for

human semantic tasks like detecting and
segmenting objects.

But with image description we there
wasn't a good way of evaluating

the right answer evaluation has been
problematic for image description and

I believe for a lot of language a nation
tasks in general for a good idea

of reasons one is if you use automatic
metrics then people have found over and

over again that they don't usually
correlate well with human judgment and so

that's that's not very satisfying and

so then the other option as well
let's use human evaluation but

the problem with that is it's typically
expensive you have to pay someone to do.

This evaluation it's not easy
to reproduce across labs

people might use different interfaces
to get these annotations done and so

then it's not clear if
the numbers are comparable.

Oftentimes people will measure different
aspects about the languages that fluent Is

it accurate and so on but then there is no
good way of bringing together all these

different factors into one number
that you can use as your metric and

finally one but have started thing that
we at least interlace at the time and

found out why we would working on this
is the following so you might say that

OK let's just talk about overall quality
let's just ask people how much they like

a certain caption or
a certain generated piece of language and

what we found was there's actually
a difference between what humans like and

what is human like right and
let me let me give you an example of this

if I show you this image and
I show you two descriptions The first is

an hour listening in a tree and the second
is a multicolored ald with black white and

camel colored feathers is looking
to the left of the camera.

Right and I ask you wish description
do you like which description do

you think is better.

Menu if you might think the second
description is better to

say this is what human and we've done this
on mechanically good people tend to like

the second description this
is what humans like but

if I ask all of you would do if I actually
this image and asked you to write

a description most if you would just
have said and now the sitting in a tree.

So this description is the more human
like description and we did do this we

asked forty different people to describe
this image and if you read through it and

all the sitting in a tree and all just
sitting on a branch in our lives looking

towards the camera and all this bush in
front of a tree and on and on and on.

And so that becomes an issue when
you're collecting your ground truth

your data looks like this but then when
you're asking people to evaluate people

are reporting the second description
here and so there's a mismatch between

the ground truth that you're learning from
and then what you're holding as the gold

standard when you're trying to
evaluate and that's problematic.

So what we call.

It was well if the goal of image
captioning is to describe an image

in a human like fashion then let's just
directly do that let's just directly

measured how human like a caption is right
let's just directly measure that for

that that if this is the caption
How well does it match how

several different people
describe this image.

And so that's when we introduce this idea
of a consensus based image description

evaluation when we checked for a candidate
description for a candidate caption How

well does it match how different people
are describing this image right and so

I won't go into the details of how this is
laid out who introduced a new allegation

what ality so that you can under date you
can measure this consensus through humans

introduced a new data set in order to
measure consensus you need to have enough

people describing the image only then can
you get a lively measure what consensus is

right so we introduce data sets where
we have images with fifty difference

sentences describing each image.

And then we have this automatic metric
that I started my peculiar measure of this

consensus as opposed to the human
annotation that that I talked about here

and this metric is called site or

consensus based image
description about you wish.

To let me to show you what it can do so
again we're back to that image of the out

there's a metric called
Blue which is based on and

depending on how you implemented it tends
to have a bias toward shorter descriptions

and so blue is a blue autumn I
think it's an evaluation metric and

it thinks that a really good
description of this image is just out.

Right and
then there's another metric called Rule

which is equal based incidence of having
a bias towards longer descriptions and

this metric things that that long
description I gave you at the beginning

is a better description of this image.

And then cited we're just trying to check
how well how good of a match this caption

is to how people on average tend to
describe this image than sided picks and

are listening in a treat as
a good description of this image.

Right here is another example blue.

Says a cat in a tree is a good
description ruler says a gad to

sitting in the branches of already skinny
tree and cider says a cat stuck in a tree

is a good description is the most human
like description is or is what side of

things and if you look at again forty
eight people are describing this image

a cat stuck in a tree or got stuck in
a tree that is a gap stuck in a tree and

so a lot of people describe this
image in that fashion and so

that's what side it ends up picking up
on and just to convince you that side

it isn't just going based on some
medium length that is not too long and

not too short so again Here's another
image Blue says a man who says a man

sitting with his hands together would do
plastic bottles sitting in front of him

and cited in this case has a longer
description as well a board man with

glasses with two empty bottles in front of
them right in the reason it's going for

this sort of longish description is
because when we asked multiple people

to describe it many people chose to give
this sort of longish description for

this image and so that's the consensus and
so that's what cited we're a target but

it's very explicitly going by how
what people are describing this image

question.

Was.

Right that's a great question and
that is an issue with

the problem of image captioning in
general where the task is always defined

us here's an image describe it describe it
for what purpose is never specified and so

I'll end up making this connection back to
the end of my dog which is what motivated

something negligible question
answering that makes it very driven

Here's a question about the image and your
goal in understanding this image is so

you can answer that question so
that's an issue with this stuff gets out

that solidifying in this case as far
as I know people always just say

describe this image in a sentence in
the book was this just never specified.

Or.

So that is quality control so this is all
done on the county Amazon Mechanical Turk

in my case and in other data sets
that people have collected for

example at Muskoka and so there is
a quality control body only accept workers

who've done enough good work in the past
so making sure that you delaminating

malicious workers who might be interested
in sort of intentionally trying to be lazy

and make money off of you so delaminating
those but you're not providing any

feedback in terms of these are the kinds
of sentences we're looking for

not to use a basic things like you
need to provide one entire sentence

avoid spelling mistakes make sure
the sentence is fluent and it's going to

magically correct then you're giving
them those sort of basic things so

that they're not just producing
jibberish But but beyond.

Right great writing.

From.

Any other questions.

All right so in and
we can also quantitatively show that

cider does better but
I think in the interest of time.

To get through this I'm happy to talk
about the details later if you'd like.

And so there was this image capturing
challenge of the Microsoft Corp a dataset

that James was involved in building and
cited as one of the metrics there but

I should say that it's still far
from being anywhere close to

human evaluation people have
found that on these competitions

machine generated captions
are outperforming

human generated captions by a large
margin using these automatic metrics.

Right so that that doesn't say much about
how good these captions are that says

more about how bad these
a valuation metrics are solution.

That's what I say about that.

All right so this was about this sort of
consensus average behavior of descriptions

you asked many people to describe an image
and how did they describe it on average so

what I want to talk about next
is more about the radiance and

the signal that how do these
descriptions of a deity.

Right and
that's this idea of image specificity and

so the idea is if I show you this
image and if I ask you to describe it.

The kinds of descriptions I would get
of these are four descriptions that for

different people gave us the first as
people lined up in a terminal people lined

up at a train station long line at the
station people waiting for a train outside

the station writes a fairly consistent the
kinds of things that you would like to see

to believe that image captioning is a well
defined problem that a machine can hope to

achieve great if people are not consistent
then you might think whatever you were

trying to do you write to this is good but
then if you have an image like this and

you ask for people for different people to
describe it one person might say alleyway

in a small town another person says people
sitting and walking in a shopping area

a third person says man walking in a
shopping area with others selling products

and the fourth person says Sunbeam
shining through the skylight and

depending on your bag down whether
you recognize this place or

not you might say something much more
you might see something different.

So the point that I'm
trying to make it has one

multiple descriptions of the same
image can vary quite a bit and

this is something that I don't think a lot
of people have explicitly knowledged.

To this amount of variance varies across
images some images are ambiguous and not

specific So this variances high but some
images are very specific where multiple

people are describing that same image will
produce more or less the same description.

And the thought of the most important
point is that we think this image

specificity need not be talked of just as
noise this variance in description and

the fact that there's a variance of
eighty's across images you do not think of

it as noise it's actually a guy going to
stick off the image that could be a useful

signal to exploit we know what human
problems at the intersection of vision and

language right and so
one application where we try to

exploit the signal when we try to exploit
the fact that for some images might

people people describe the exact same way
but for some other images different be.

Describe it differently so we try to
exploit this phenomena for next season

which are true but it also gives you have
some database of images each image has

been described with one caption So there's
some caption that each image has and

now there's a user to user has a certain
target image in mind that he or

she is trying to treat from this database
and what they're going to issue is

a quality which is a sentence
describing the image they have in mind.

And what you're going to do is
you're going to match the Square-D.

sentence to the captions in
the database and sort them right and

you would hope that the image that
the user was looking for is it and

height right so this is a text
based imagery would also but

you are treating images based
on matching text right and

so our idea of introducing
specificity into this application

was the following The intuition was
the following if an image a specific

I know that multiple people tend
to describe it in the same way.

So for us for a specific image
I expect that the caption and

the caption that I have in my
database would be a really good match

because there aren't logical
ways of describing this image.

Right so for a specific image I should
regard it as a development result only if

there's a really good match between the
Quaid e and the caption in my data bits

because I know this image a specific that
I want different ways of describing it

on the other hand if an image is ambiguous
I know that there are many different

ways of describing it and so even though
Mike where he doesn't match my caption in

the data is all that well it still may in
fact be a relevant result just because

those images ambiguous I know there
are different ways of describing it so

even a moderate similarity
between the caption and

the database caption might be sufficient
to call this image at a lower intimate.

Right and so the idea is that
the similarity between the quality and

the caption in the database
should be more deleted

based on how specific the images the same
similarity value means to different things

based on whether the image of specific or
ambiguous.

Yes.

So I was I mean I wasn't going
to go into those details but

what we do is we can take multiple
descriptions from people and

then we can check bed why similarity
between all those disk descriptions and

average that now that similarity can
be human and dated similarity or

it can be using automatic
metrics like more distance and

more to backspace or in some embedding
space or blue or based on engrams or

anything that you would
anything that you want.

And so what we can show footage of travel
so this is the baseline approach that

does not more delayed based on
the specificity so it doesn't

do in order to estimate specificity we
would need multiple sentences describing

that image straight so that's what I'm
calling these training sentences but

if you're north reasoning about
specificity you don't need the sentences

of the performance of slack and
this is the performance that you get

if you reason about him it's specificity
in this case load is better because this

is the dog you're trying that angle of
the dog image that you cared about.

Right and so what we're seeing here
are these are two different data sets and

red and blue And so what we're seeing is
we need about eight sentences body image

in this case or seventeen cents a sport
image in this case to be able to estimate

image specificity accurately enough for
it to actually help.

And so at this point you should all be
saying how does this make any sense here

essentially saying that my entire
database of images each image should have

seventeen different descriptions but
we don't know how much the variance is and

that's just not practical So what we've
shown in the paper has you can in fact

predict the image specificity
automatically using just image features so

if you have a database of images I can
first compute the image specificity using

multiple descriptions and then I can train
a to guess or that just based on the image

features can predict the specificity
value and you can then use that for

treatment.

Any any questions here.

Yes.

The planing features.

So yeah they are what we've tried so far
ID We've tried a variety of features that

they get the deep learning features and
aborting the best and they have some

semantic information in them they know
what kind of objects are present and

they have a general sense
of the layout and so

that work the best but it's possible that
if you find doing these networks for

the task of predicting image specificity
it might might work even better.

Yes.

Right.

No So this is a this is misleading
I should have dropped those off but

what these two points are is
essential the semi cheating where

this last sentence that came in was
actually the sentence that at best time

when you're doing your tree will happen
to be the database caption that you have

an image retrieval the train and
test later is not really clear and

so we just wanted to see how much you
would benefit if you knew that this

caption is the one you have while
estimating the specificity so

that it was just coming
from the Yeah Yeah.

Yes.

Yes.

That's what this I see so we have to go
back and check on whether we vary discolor

norm but this is the same data sets
that I introduced for the evaluation so

we had fifty sentences put image and
so that's the highest we have and

have to go back and check how performance
studio dates if we used fewer and

fewer of those for training for
estimating specificity while training

those regressors so
I don't I don't understand.

I see so we had about a thousand images
at most so it wasn't that large.

I mean reasonably well i wouldn't you
know it's nothing groundbreaking but

I juggle aces well enough.

Any other questions.

All right so this is going to be
the mean chunk of my dog it's this

idea of learning common sense and so again
going back to the scenario with captions

I had mentioned that we now have machines
that can take images like this and

generate a description that says
construction work out in order to save

the rest is working on the road and so
this is this is great it's exciting but

if we were to think of an image like this
great and if a person describes it someone

might say a man was rescued from a struck
that is hanging dangerously from a bridge

and if you think about what it would take
two or dude words like that skewed and

dangerously there isn't anything
specific about this image any particular

detail in this image that directly
would correspond to the word rescue or

the war dangerous trait when we look
at this image we have a sense for

what happened before this what
is likely to happen next and

based on that we're seeing things
like rescued and dangerous.

And so that involved some common sense
knowledge that we have about how the world

functions that allows us
to say something like this

that from as far as I know
current machines don't happen.

And so if we wanted to learn this
commonsense very deeply wanted to learn

this common sense what might we do and
one solution that a lot of people have

taken in history learned this from text we
have lots and lots of text on the web and

we can mine this to learn a whole
bunch of common sense solutions.

But they should text as there's
a reporting bias right this

is not an unbiased representation of the
order to stack that humans generated and

humans like talking about things that they
like talking about humans like number

things that are interesting they
often tend to be uncommon and

it's not interesting to talk about
things that are common right and so one.

Example of this these are the number
of times these different words

are part of a certain text corpus
one associated with people.

And if I use these word occurrences
these frequencies as an estimate

of how often these things
happen in the world

then it would turn out that we inhale
six times as often as we exhale and

in fact we get more doing it
seventeen times as often as we exude.

If you look at different body parts being
mentioned in these texts porpoises then we

have heads one thousand eighty five
times as often as we have called Letters

write God letters are just not interesting
to talk about about anything and

so so there is this well known reporting
bias in text you can't use frequencies and

text as an estimate of how often
something happens in that you would

think here is another example of
the question is Do birds fly and

if I look at the definition of a board
any warm blooded egg laying where

debate of the clause Ava's categorized
by a body covering all feathers and

forelimbs modified as weeks towards life.

It's but if I look at penguins the very
first thing it says any of several

flightless I quoted boards so
if I just look at this I would think

boards are flightless right but the whole
point is that it's worth mentioning

that boards are flightless because boards
are typically the boards typically fly

right and so there needs to be this
added level of reasoning for us to take

a definition like this and realize that
because it seems likely as boards it must

mean that birds typically fly right in
this logic doesn't always hold true so

the point being that it's hard to use
text as an estimate at too low in common

sense there's a lot of useful signal but
it doesn't have everything right and so

the thought then is that well we have this
visual world around us that has a lot

of structured in it can
we just exploit that but

has that is a more unbiased sampling of
what happens in the world around us and

this structure is certainly that
if I give you an image like this

if you had to describe it you
might say two professors going

to worse in front of a blackboard.

And now if I change this image.

Actually.

You would probably change the description
to say to Professor stand in front of

a blackboard right so you know
the structure that gaze is important

to whether two people are going
which thing or not so

there is a lot of the structure that we're
exploiting but how do we get the machine

to learn from this right how do I get the
machine to realize that just a flipping

gaze can change the semantic meaning
of interactions between people.

And the reason this is hard to do is that
there are a few reasons why this is hard

to do one is lack of visual dentity I
don't have two images that genes ever so

slightly such that the meaning
also changes right so

that's I don't have the finding
signal to learn from

the second issue is Annotations
are expensive right if I

wanted to learn that this gaze is relevant
to whether people are going where thing or

not I would need to gaze annotated I
would need expressions annotated and

a whole bunch of these other things
which is very expensive to collect.

And so then he would say well a lot of
patients are expensive let's just learn

computer vision techniques to
estimate gears and expressions and

all sorts of of that actual proof
of what people are reading and

so on but that then becomes like a chicken
and egg problem I'm saying that I want to

learn common sense so
that I can do computer vision better so

that I can understand images better and
yet I'm now saying that in order to learn

this common sense I first need to solve
computer vision so that I can learn it.

And so are told to in trying to break
out of this sort of chicken egg problem

was to question whether
the fortitude You're listening to

is even necessary to learn common sense.

And our thought was that the common sense
lies in the semantics of the images and

know what in the actual pixel
values of these images.

And so we can give up on for today listen
and focus just on the semantics and

hopefully learn common sense from that and
so with that in mind we introduce these

two characters Mike and Jenny and
they would award full of sort of toys and

animals and all sorts of things that
they can do in the park with rain and

clouds and lightning and

trees my continue can have different
expressions they can have different poses.

And now what we can do is a lot of fun
things we can go to Amazon Mechanical door

and we can show them this interface and we
can ask them to create these abstract but

semantical you did scenes that
you see here they can move

objects into in sort of three different
depths they can flip them around and

just with a handful of objects I think
there's something like six objects in

the scene that is a story that you
can tell this is a semantical scene.

And with this you can do things that
you've gone ever do with real images

I can do things like give
somebody a description

Mike writes of a bad by giving him
a hot dog while Jen you don't so it and

I can ask six different people to
create a scene that depicts this.

Right and for all of these they look
very different right they have different

objects present or absent but
what's consistent in all of them

is they have it has Mike it has Jenny
there's a bit Mike is facing the bird

hoarding a hot dog the words and Jenny
is running off in the opposite direction

right way to exactly Mike and Jenny are
all of that this changing but this goal of

some of these core semantic weasel
features are remaining intact right and

if you try and imagine collecting six
different real images that are so

different what all have the same semantic
meaning that would be really hard to do.

And so we have this data set of these
thousand semantic classes where there

are ten scenes that all correspond
to the same language description and

one thing again to north is that
the visual features are all annotated it's

a fully honored scene right it's a it's
a it's an abstract scene it's a cartoon

scene so the annotations are all that
including gays expressions voices object

presents three levels of depth and so
on and and now we can begin to ask

interesting questions like which visual
features are important to the semantic

meaning is it location is it developed
of location is a gaze is it expression

which wards correlate with
specific semantic visual features.

And so from from the status
of we can do things like

we can take in as input a description
that says Jenny is catching the ball.

Mike is kicking the ball
the table is next to the tree and

we can do some basic end of the processing
to extract the hoots of the form

primary object secondary object and
elation and we can automatically generate

an abstract scene or
a cartoon scene that depicts this meaning.

Right and just for reference this was
the ground truth seen that a human had

generated again there are a lot of
differences the ball itself has changed

but the meaning of means the same and
the reason we lost the tree was

because the couple of the processing
dropped in the bubbles and so

that's why this image doesn't have a tree.

And then we sort of decided to focus
on a very narrow domain of two people

interacting and so we had this interface
where you can move there are these people

all people that whose voices
you can freely move and you're

asking people to create a scene where
someone is dancing with another person

right outside of looks like one person is
hitting the other person on the head but

if you just wait for a second this change
in expression changes the meaning right

now and now it looks like so we selected
data for dancing with walking with walking

away from talking to arguing with and sort
of these very fine grained distinctions

between just interactions would do between
two people no other objects no other

context and we try to see if we can
we can learn these differences so

these are some examples that mechanical or
gooders could hear to and

there I was quite impressed
with with what they did some of

these are fairly elaborate especially
if you look at these dancing voices.

And these are images from the Web
corresponding to these interactions and

when I laid out like this it's very
tempting to think so what happens if I

train on this data can I test on real
images and detect these interactions right

and that's exactly what we did obviously
there's a huge domain ship right so

we don't expect this toward all that
accurately there is no treaty right it's

just one plane that people can
move out and but we just have that

out this is the performance that you would
get my chance of the sixty categories.

And this is the performance that you get
if we assume that in real images the poses

have been accurately detected right
because that little overcast is not what

we were interested in we wanted to know
if the semantics carry over not just for

completeness this is what we got I should
change this this is a post of Dr from two

years ago at this point and so
this was two thousand and fourteen and

I'm sure there are much more accurate
knowledge especially with all the deep

learning techniques but so you can
transfer from this completely abstract

word to these three in the midges without
anything to go to be sophisticated.

But I started all of this with common
sense and then I kind of digest a little

bit but let me come back to common
sense so one task that we considered

to assess whether we can learn common
sense from these abstract seems or

known is to assess the plausibility
offered elation right so if I give you

a couple that says man who lives me
the machines job is to assess whether

this can happen in the real world or not
right does this happen or in the world or

not another example astri wasn't able and
your the machine probably would

decide that no this will hopefully the
side that know this is not very plausible.

And so our thinking behind how
to do this was to say that.

Our approach was that if a couple
is similar if this relation

is similar to all the relations
that I know what laws of will

then I would conclude that this relation
is plausible but if it is not dissimilar

to other relations that I know what laws
above then it's probably not plausible.

Right.

To For example if someone has
already told me that person who or

sandwich is plausible and
that man eats is plausible then I if

I can compute the similarity between this
man who are Smeal and these two couples

here then I would probably see a good
similarity and because I know that these

two are plausible I'll conclude that this
relation is slow simple as well right so

this is the underlying approach that we
decided to follow and all other insight

was that in addition to reasoning about
the similarity in text we should.

Also reason about the similarity between
these relations in the visual board that

if I were to imagine a man holding a meal
if I were to imagine a person holding

a sandwich if I were to imagine a man
eating pizza then in this imagination

if these things tend to be similar
then that should also contribute for

something even if textual
similarity isn't all that hard.

Right so let me let me give
you an example for that So

here are two topples boy stared at men
Boy look at men both of these would be

similar even in text right and here
are two other examples of pairs of couples

that would be similar in Texas that's
great but here's a pair of couples and in

particular focus on the relations so Dean
stand overt act man and each forget if

I were to compute let's say a war director
presentation of stand over and reach for

and I would compute the similarity I
probably wouldn't get a very high match.

But if I look at a visual instantiation of
these rain both of these can in fact have

a very similar instance the action this
you might describe as Dean standing over

Garrard and the same image you would also
describe as a man reaching for Jack.

And so because they can both have
a similar visual instantiation

My claim is that we should consider
these doubles to be similar

even though it actually they
may not be very similar.

Brain does that.

Doesn't make sense.

Right and so we did exactly that and so

with this set up we can now take any
double So we have a collection of

abstract scenes and we have a collection
of couples that describe them and

now given any new double that you provide
as import I can compute the similarity of

that couple who all these text opposites
and I can compute the similarity of that

popular to the result instantiations of
these doubles and we can combine those two

similarities to assess whether
the stop with a small sample or

not right so we would use vision and
language both to assess the plausibility.

Of a text only a couple that is provided
as and book writers of the only common

sense stars that's based on text
there's no important in the vision.

Right and so if we look at accuracy of
this is average precision so with these

doubles we have gone to the nation from
humans about whether whether there was

a bill or not and then our system outputs
a scored about how plausible it is and so

we can compute these two different metrics
hired is better this is what you would

get if you did text a lot so if you did it
is a text only dots it makes sense to use

only index to do it this is what you would
get this is what you would get if you

completely ignored the text and only
looked at the vision side of things and

this is what you get if you combine both
and you see an improvement in performance

for this commonsense task of assessing
the plausibility of problems.

And here's another task that is that
seems to be only based on text but

can leverage common sense so
it's a fill in the blank ask

Mike is having lunch when he
sees a Ben and then black.

Does Mike order to beat so does my Cub
the bears are mammals my tries to.

Write hopefully most people in this
room would agree that the most plausible

answer is D. might dry still quite right
and how do we had arrived at the sun so

well we probably reason about the fact
that people don't gives in particular

are scared of white animals and

when you're scared of something you either
run away that you shouldn't do with bears

I guess I think depending on the kind
of bear all you try and hide and so

with that common sense would decide
that this is the most likely option.

Another dog is visual about a phrasing
the question is do these two descriptions

can they be describing the same scene and
Jenny was going to throw her by it Mike

that's one description the second
description is Jenny is very angry and

Jenny is warning about.

Right again I think most of us would agree
that it is quite plausible that these

two are describing the same
scenario right and

the reason we think that is we know
that when people are angry they could

throw things in order to throw
something you have to be hoarding it.

So what are our argument in
trying to solve the stars.

Is to say that instead of mining through
a whole bunch of text that tells you that

when you're angry you throw things and in
order to be throwing something you have to

be hoarding it instead if we just imagine
the scene corresponding to this right if I

just imagine someone throwing something
I would imagine them holding it.

And if I just sort of go back to these
this latent representation if you will off

the visual signal that lies behind this
and it isn't about the similarity in that

ritual space I might do better than
just reasoning about this text alone.

And so that's that's exactly what we did
if you have a fill in the blank question

you have these four options
we plug in each option

into this description want to die so
that gives us four different and

our descriptions I already told you
that we had a way a couple of years

ago of given a description generating
a scene that corresponds to it and

recall that this imagination that I'm
talking about doesn't need to be folded or

the stick right it just needs
to have the semantics and so

we can imagine it in
this abstract world and

now we can extract features from the text
and these generated images to reason

about which one of these four options is
most likely right but visual data for

using you have two descriptions the
baseline approach would just extract X.

features from all these descriptions and
reason about some form of similarity

to decide if they are the same or
not are they describing the same scene or

not but instead we can generate these
images corresponding to a description and

add those visual features
in into this decision.

Right and just like with the seeing
of couples we see similar games and

in performance so this is our number
formants on the DOS Box hired as

this is what you get the text alone.

Vision alone and then text in vision
board combined it was an improvement.

And so we've used visual abstraction for

a variety of things like studying
mappings betweens images and

text learning concepts in the world like
those interactions between people without

using any real images using just the
visual abstractions when you look at a.

You other properties as well and so

over all studying these high level image
understanding tasks without waiting for

the low level computer vision
problems to be solved.

We've used it for learning common sense
knowledge like we just talked about and

overall what I find exciting about these
abstract scenes is that it's a very

rich out addition what are the right you
can do things like I can show a human

an abstract scene asked them to describe
it I can show people a description and

ask them to create a scene that
corresponds to it I can show people

a scene and ask them to mortify to
the description changes in a certain way

I can put it over scene and
off them how the description changes and

all sorts of other things which you
can do with any of the midges right

it's hard to manipulate for
good in the stick image data.

And just as a teaser we have this
new data set of abstract scenes

with much more realistic looking Clippard
many different people not just Mike

engine in different genders ages races and
so on indoors and

outdoors with many more objects that
is online if anyone is interested.

But I guess I just want to get a pop
with just a teaser about visual question

answering as I said who would present
more details tomorrow and the idea being

coming back to your question that image
captions tend to be very generic right if

I take an image like this and a disk if a
machine spits out a description that says

budget off starting in the grass next to a
tree we would think this is amazing great

recognize that it off ignites the grass
recognize the tree where tones out if you

look at their Muskoka dataset put instance
if I just take it and I'm sampling of all

images that have jobs this description
would fit just fine right and

it's very plausible that all the only
thing the machine did was recognize that

it off and then the language more typical
would inspect out the rest of it right and

so it's not to you if these mortars
are really understanding these images

as it might seem.

So just of course understanding of the
image followed by a simple language model

can produce deceptively
good image captions and

other issues it's very passive
it's not for a pretty.

Killer docs get there some information
that I want from this image

me as a user has no way of doing that
the system will just put out some

one sentence genetic description and
so what we're excited

about is the star scope which will
question answering well given an image and

in natural language form open ended
question any question about this image

the machine starts gets to answer
this question in natural language.

So you might have an image like
this what is the mustache made of.

You might have an image like this
how many slices of beads are there

right that this is counting the object
that action is good at that but

then there is this a vegetarian pizza.

So you can imagine that you might need
access to some knowledge base that says

if you detect meat in this image
then it's probably not of edge that

it is not of age that impedes them I mean
is the sports than expecting company and

that's a lot of social common sense that
you need to see there are still glass in

the blinds probably for

some probably is expecting company does
this person have twenty twenty vision

delays that twenty twenty vision should
be good water do I glossed detectors

is not true you would need some knowledge
base or common sense knowledge to do that.

And so what we're working on is
building more doods they can do

essentially this you can upload an image
and you can type in a question and

the system will hopefully produce
accurate answers to these questions.

Right and so the nice thing is if it can
often get to the details that might be in

the background of the image
these questions and

get to a lot of vetoes it's Das driven
the understanding the image for

the book was a font setting the specific
question about this image it's not genetic

and what we like is that it brings
together towards pictures and common sense

right it's not you need to know a lot
about the world to be able to answer some

of these questions and so what we have so
far is we've collected this database and

it's publicly available now it has over
two hundred fifty images which includes

more than two hundred thousand real
images from the Microsoft data search and

it has these fifty thousand abstract
things that I talked about.

Each image has three.

Questions associated with it that
we collected on Mechanical Turk and

each question has Stan our answers
provided by humans to answer that question

so it's over three quarters of a million
questions over ten million answers and

depending on how much interest there is in
the community we might even grow it over

the years and so
that was a lot of mechanical to it Wallach

over ten thousand dollars were involved
and these are some of the funds to six so

we have an evaluation so hooked up and
going you can today upload your results on

our dataset and see where you stand on the
leaderboard and we're going to organize

a challenge at C.B.R.
next summer see if your current sixteen.

Moments if you're interested
you can find out more

on this webpage the dataset in
the paper and everything it's there.

All right so do to summarize.

I said I'm interested in
semantic image understanding and

I think a lot of work in computer vision
traditionally has focused on sort of

the image part of the trade where we
take these images these pixels and

try and make decision or try and
reason about the ones but I think sort of

equal members of these equation
are the semantics for anything words and

language and a convenient way
of conveying the semantics and

understanding part of the common sense and
in the knowledge base reasoning and

I think all of this needs to work together
to sort of get to these the holy grail of

semantic image understanding like sitting
any question about the image right so I

talked about a few different things but if
you had to leave with the two main things.

That I am excited about one is
teaching machines common sense and

especially the use of visual abstractions
to do that that's something that I think

is a lot of fun and the second is sort
of what are the stocks that we can have

a lot of the truly measures semantic
image understanding as a whole and

I think will question answering is
a nice is a nice step towards that and

we have a database that goes with that
writes I will I'll stop there thank you.

Thanks.

Thank.

You question so

yes.

All right so there were a few different
tasks that I talked about if you thought

I think it up more to visual about a
phrasing when we have two descriptions and

their task is to figure out
whether they're similar or

not so you know to answer
a question we have annoyed.

We have not given special preference
to the objects in the image that were

mentioned in the description that could be
interesting we're learning a classified on

top of it and so it could be in theory
Lauren Lauren that that whatever

it is mentioned in the description and
is present in the image should

have more of a contribution in the final
similarity but we haven't explicitly and

gotten that so
that that could be interesting you know.

Yes.

We haven't we haven't done that yet so
we would We've tried to learn the common

sense from these abstract scenes so
you can think of it as US building

this knowledge base so when we had these
relations you can think of the primary and

the secondary as entities in a knowledge
base in the relation was the relation and

our method is providing a score for that
age should exist in the knowledge base or

not right so that that part of the world
can be thought of as us building

a knowledge base through abstract scenes
in terms of using existing knowledge be.

Says to do something like answering
questions about images we haven't done

that yet that is somewhat out
the north on sort of this moralistic

will put in a data set but more on toy
problems using something like memory

networks where you have the ability to he
didn't indict existing knowledge base and

then your noodle network learns
which facts do it even and

it learns how to use in about
them to produce an answer but

that's that's that's something that
we haven't looked at ourselves.

That that makes sense.

Yes.

So so the part where we learn to
assess how plausible it elation is.

That assessment of plausibility we expect
generalizes from the abstract more dignity

award and we have in fact shown that for
these topples where abstract scenes were

collected on a second strum a different
subset of the data sent and

then the best uploads would have these
double trouble Microsoft local captions

that are captured offer you the images and
so there is no those are not biased

in any way by the abstract board right so
those are clean realistic couples

that we can assess the plausibility of
using our abstract library so that is

the extent to which so that so that common
sense we have shown generalized to these

religious they're still images from
a data set they're known it's not like

a robot walking around and running into
random things and things like that right

it's still there miscall putting aside
the other thing when we had attempted that

was when we learned these interactions
between humans and we train them or do.

An abstract scenes and then tested
them them on real images those real

images were still downloaded for the same
sixty concepts that we cared about so

in that sense it was obviously biased
by the choice that we made but that

struck scenes but there we were checking
if the semantics in this clipboard weren't

in this cartoon world in terms of pauses
and so on carries over to the audio board

right so those are the two ways in which
we have attempted to show that some

of this general a sense but there's still
the other struggle for improvement.