More. Like. Only I want to hear. From. You all. Morning. Long. People are. Really. THANK YOU CAN YOU ALL Hear Me Yes All right thank you James that was of any kind introduction. So I'll just get I'll just get started so why it might be such interest is semantic image understanding and so ideally I'd like for machines to be able to take images like this and sort of detect and recognize all the objects tell us whether this image was taken indoors or outdoors where this image was taken Are there any people that are there what are they doing how they interacted with each other and sort of in the limited oncet any question about this image that we would expect to be able to answer right and so the title of my dog is words pictures and common sense and so I was forced to start by talking about why I think this intersection of words and pictures are language and vision is interesting so one the First Division is that pictures out everywhere bears that out. By visual content and words are how we communicate so for idea of applications it seems like being able to link this visual content with language or with words would be very useful for example anytime you want to interact with organize sift through navigate through this typically large quantities of unstructured visual you know it seems like if you have a link between this visual data and words then this starts might be easier. There's a lot of multimodal information on the Web there's when you have images when you have videos you typically have some text that I'm going in and see if you have a good way of linking these words to the social content then you might be able to leverage this inherently multi-modal nature of information around us and learn from that. If you think of we usually embed users then again if you have a link between visual content and words then these words are language can play out all you can sort of substitute in for those which we content and be accessible to users when it may not have been if it was just a visual and same thing applies but analysts if they are throwing a lot of visual data out and this it's hard for them to sift through look through all of that and extract information that they get about what you can if you can summarize this information through words and that becomes much more accessible right so one motivation behind words and pictures is that it opens up a lot of these applications and the other border ration is that building this link gives us a way of demonstrating certain AI capabilities if I can take an image and if I can describe it well let's say in a sentence then that could be one we in which I have demonstrated that my machine has understood this image. And same with the other way around if I can take a sentence with some associated with content and if I can ground that language in the visual content then that could be one way of demonstrating that I have understood what the language is stopping right and that a philosophical debates about this but I think there is some value to this perspective. And the other thing. Especially from a computer vision perspective is that language is rich in composition and I have I'm so used to thinking of region tasks as bucketed recognition very different object of action than I did boxes and I place them in one of twenty different categories if I'm doing segmentation I dig a pixel and I place it in one of the many different activities if I'm doing a vacation I take an image and I place it in one of let's a thousand gotta get it right but I'm used to placing everything into one of the several discrete events and because language is so rich and compositional you can't afford to create bins for every single semantic concept that you might care about and so that forces us to think beyond this bucket of recognition and build models that can deal with this region compositional nature of language so I think it helps us push the frontiers of computer vision by trying to map vision or images or visual content to language right. And the last motivation is that a lot of progress in AI typically has been in these separate eighty or so there's a lot of progress in computer vision there's a lot of progress in natural language processing and so it seems like it could be fun and interesting and useful to think of problems act intersections of these different feed stories are taking steps towards these AI complete problems if you will can help us sort of push on the envelope of of what AI can do today. Right so those are my main deviations behind looking at problems at the intersection of images of vision and language and so in my abstract I perhaps do in viciously promised four different things and so I will attempt to go through those the visual question answering and I will talk about just in a couple of minutes just for a couple of minutes and through what I was giving a talk tomorrow I think we'll get into this a little bit more right so I already met someone earlier today who was hoping I would talk about B. Q A And so for others in the audience like that I apologize but you can you can get some of it tomorrow all right so let me let me start with this with this for stopping. And so in this vision in language space obviously I'm not the only one excited about it there's a lot of. Interesting what's going on at this intersection and one problem in particular that has received a lot of attention at this intersection is image captioning but the task is to take an image and describe it in one sentence and for those of you we're following computer vision just at C.B.R. last year there were I think any of eight papers all on image captioning all at once and there was a whole session on vision and language. Most of which would image caption works it was even a New York Times article or perhaps a little too prematurely on image captioning abilities about a year ago but now we have machines that can take images like this and automatically generate descriptions that say something like man in blue wetsuit or surfing on wave or a group of young people playing a game of frisbee a thought a spark in the middle of nowhere sounds like a very human this middle of nowhere is an interesting phrase to be picking up on a part of broccoli on a stove and so on. And so we have all these techniques that are doing automatic image description and especially last year when all these papers came out in battle. There wasn't a good way of evaluating how well these techniques are working right and we all know from in computer vision and other fields that good evaluation protocols and benchmarks didn't really drive progress especially in the early stages when there's a lot of people excited about five you seen this for low level vision tasks we've seen it for mid-level drastic segmentation or for human semantic tasks like detecting and segmenting objects. But with image description we there wasn't a good way of evaluating the right answer evaluation has been problematic for image description and I believe for a lot of language a nation tasks in general for a good idea of reasons one is if you use automatic metrics then people have found over and over again that they don't usually correlate well with human judgment and so that's that's not very satisfying and so then the other option as well let's use human evaluation but the problem with that is it's typically expensive you have to pay someone to do. This evaluation it's not easy to reproduce across labs people might use different interfaces to get these annotations done and so then it's not clear if the numbers are comparable. Oftentimes people will measure different aspects about the languages that fluent Is it accurate and so on but then there is no good way of bringing together all these different factors into one number that you can use as your metric and finally one but have started thing that we at least interlace at the time and found out why we would working on this is the following so you might say that OK let's just talk about overall quality let's just ask people how much they like a certain caption or a certain generated piece of language and what we found was there's actually a difference between what humans like and what is human like right and let me let me give you an example of this if I show you this image and I show you two descriptions The first is an hour listening in a tree and the second is a multicolored ald with black white and camel colored feathers is looking to the left of the camera. Right and I ask you wish description do you like which description do you think is better. Menu if you might think the second description is better to say this is what human and we've done this on mechanically good people tend to like the second description this is what humans like but if I ask all of you would do if I actually this image and asked you to write a description most if you would just have said and now the sitting in a tree. So this description is the more human like description and we did do this we asked forty different people to describe this image and if you read through it and all the sitting in a tree and all just sitting on a branch in our lives looking towards the camera and all this bush in front of a tree and on and on and on. And so that becomes an issue when you're collecting your ground truth your data looks like this but then when you're asking people to evaluate people are reporting the second description here and so there's a mismatch between the ground truth that you're learning from and then what you're holding as the gold standard when you're trying to evaluate and that's problematic. So what we call. It was well if the goal of image captioning is to describe an image in a human like fashion then let's just directly do that let's just directly measured how human like a caption is right let's just directly measure that for that that if this is the caption How well does it match how several different people describe this image. And so that's when we introduce this idea of a consensus based image description evaluation when we checked for a candidate description for a candidate caption How well does it match how different people are describing this image right and so I won't go into the details of how this is laid out who introduced a new allegation what ality so that you can under date you can measure this consensus through humans introduced a new data set in order to measure consensus you need to have enough people describing the image only then can you get a lively measure what consensus is right so we introduce data sets where we have images with fifty difference sentences describing each image. And then we have this automatic metric that I started my peculiar measure of this consensus as opposed to the human annotation that that I talked about here and this metric is called site or consensus based image description about you wish. To let me to show you what it can do so again we're back to that image of the out there's a metric called Blue which is based on and depending on how you implemented it tends to have a bias toward shorter descriptions and so blue is a blue autumn I think it's an evaluation metric and it thinks that a really good description of this image is just out. Right and then there's another metric called Rule which is equal based incidence of having a bias towards longer descriptions and this metric things that that long description I gave you at the beginning is a better description of this image. And then cited we're just trying to check how well how good of a match this caption is to how people on average tend to describe this image than sided picks and are listening in a treat as a good description of this image. Right here is another example blue. Says a cat in a tree is a good description ruler says a gad to sitting in the branches of already skinny tree and cider says a cat stuck in a tree is a good description is the most human like description is or is what side of things and if you look at again forty eight people are describing this image a cat stuck in a tree or got stuck in a tree that is a gap stuck in a tree and so a lot of people describe this image in that fashion and so that's what side it ends up picking up on and just to convince you that side it isn't just going based on some medium length that is not too long and not too short so again Here's another image Blue says a man who says a man sitting with his hands together would do plastic bottles sitting in front of him and cited in this case has a longer description as well a board man with glasses with two empty bottles in front of them right in the reason it's going for this sort of longish description is because when we asked multiple people to describe it many people chose to give this sort of longish description for this image and so that's the consensus and so that's what cited we're a target but it's very explicitly going by how what people are describing this image question. Was. Right that's a great question and that is an issue with the problem of image captioning in general where the task is always defined us here's an image describe it describe it for what purpose is never specified and so I'll end up making this connection back to the end of my dog which is what motivated something negligible question answering that makes it very driven Here's a question about the image and your goal in understanding this image is so you can answer that question so that's an issue with this stuff gets out that solidifying in this case as far as I know people always just say describe this image in a sentence in the book was this just never specified. Or. So that is quality control so this is all done on the county Amazon Mechanical Turk in my case and in other data sets that people have collected for example at Muskoka and so there is a quality control body only accept workers who've done enough good work in the past so making sure that you delaminating malicious workers who might be interested in sort of intentionally trying to be lazy and make money off of you so delaminating those but you're not providing any feedback in terms of these are the kinds of sentences we're looking for not to use a basic things like you need to provide one entire sentence avoid spelling mistakes make sure the sentence is fluent and it's going to magically correct then you're giving them those sort of basic things so that they're not just producing jibberish But but beyond. Right great writing. From. Any other questions. All right so in and we can also quantitatively show that cider does better but I think in the interest of time. To get through this I'm happy to talk about the details later if you'd like. And so there was this image capturing challenge of the Microsoft Corp a dataset that James was involved in building and cited as one of the metrics there but I should say that it's still far from being anywhere close to human evaluation people have found that on these competitions machine generated captions are outperforming human generated captions by a large margin using these automatic metrics. Right so that that doesn't say much about how good these captions are that says more about how bad these a valuation metrics are solution. That's what I say about that. All right so this was about this sort of consensus average behavior of descriptions you asked many people to describe an image and how did they describe it on average so what I want to talk about next is more about the radiance and the signal that how do these descriptions of a deity. Right and that's this idea of image specificity and so the idea is if I show you this image and if I ask you to describe it. The kinds of descriptions I would get of these are four descriptions that for different people gave us the first as people lined up in a terminal people lined up at a train station long line at the station people waiting for a train outside the station writes a fairly consistent the kinds of things that you would like to see to believe that image captioning is a well defined problem that a machine can hope to achieve great if people are not consistent then you might think whatever you were trying to do you write to this is good but then if you have an image like this and you ask for people for different people to describe it one person might say alleyway in a small town another person says people sitting and walking in a shopping area a third person says man walking in a shopping area with others selling products and the fourth person says Sunbeam shining through the skylight and depending on your bag down whether you recognize this place or not you might say something much more you might see something different. So the point that I'm trying to make it has one multiple descriptions of the same image can vary quite a bit and this is something that I don't think a lot of people have explicitly knowledged. To this amount of variance varies across images some images are ambiguous and not specific So this variances high but some images are very specific where multiple people are describing that same image will produce more or less the same description. And the thought of the most important point is that we think this image specificity need not be talked of just as noise this variance in description and the fact that there's a variance of eighty's across images you do not think of it as noise it's actually a guy going to stick off the image that could be a useful signal to exploit we know what human problems at the intersection of vision and language right and so one application where we try to exploit the signal when we try to exploit the fact that for some images might people people describe the exact same way but for some other images different be. Describe it differently so we try to exploit this phenomena for next season which are true but it also gives you have some database of images each image has been described with one caption So there's some caption that each image has and now there's a user to user has a certain target image in mind that he or she is trying to treat from this database and what they're going to issue is a quality which is a sentence describing the image they have in mind. And what you're going to do is you're going to match the Square-D. sentence to the captions in the database and sort them right and you would hope that the image that the user was looking for is it and height right so this is a text based imagery would also but you are treating images based on matching text right and so our idea of introducing specificity into this application was the following The intuition was the following if an image a specific I know that multiple people tend to describe it in the same way. So for us for a specific image I expect that the caption and the caption that I have in my database would be a really good match because there aren't logical ways of describing this image. Right so for a specific image I should regard it as a development result only if there's a really good match between the Quaid e and the caption in my data bits because I know this image a specific that I want different ways of describing it on the other hand if an image is ambiguous I know that there are many different ways of describing it and so even though Mike where he doesn't match my caption in the data is all that well it still may in fact be a relevant result just because those images ambiguous I know there are different ways of describing it so even a moderate similarity between the caption and the database caption might be sufficient to call this image at a lower intimate. Right and so the idea is that the similarity between the quality and the caption in the database should be more deleted based on how specific the images the same similarity value means to different things based on whether the image of specific or ambiguous. Yes. So I was I mean I wasn't going to go into those details but what we do is we can take multiple descriptions from people and then we can check bed why similarity between all those disk descriptions and average that now that similarity can be human and dated similarity or it can be using automatic metrics like more distance and more to backspace or in some embedding space or blue or based on engrams or anything that you would anything that you want. And so what we can show footage of travel so this is the baseline approach that does not more delayed based on the specificity so it doesn't do in order to estimate specificity we would need multiple sentences describing that image straight so that's what I'm calling these training sentences but if you're north reasoning about specificity you don't need the sentences of the performance of slack and this is the performance that you get if you reason about him it's specificity in this case load is better because this is the dog you're trying that angle of the dog image that you cared about. Right and so what we're seeing here are these are two different data sets and red and blue And so what we're seeing is we need about eight sentences body image in this case or seventeen cents a sport image in this case to be able to estimate image specificity accurately enough for it to actually help. And so at this point you should all be saying how does this make any sense here essentially saying that my entire database of images each image should have seventeen different descriptions but we don't know how much the variance is and that's just not practical So what we've shown in the paper has you can in fact predict the image specificity automatically using just image features so if you have a database of images I can first compute the image specificity using multiple descriptions and then I can train a to guess or that just based on the image features can predict the specificity value and you can then use that for treatment. Any any questions here. Yes. The planing features. So yeah they are what we've tried so far ID We've tried a variety of features that they get the deep learning features and aborting the best and they have some semantic information in them they know what kind of objects are present and they have a general sense of the layout and so that work the best but it's possible that if you find doing these networks for the task of predicting image specificity it might might work even better. Yes. Right. No So this is a this is misleading I should have dropped those off but what these two points are is essential the semi cheating where this last sentence that came in was actually the sentence that at best time when you're doing your tree will happen to be the database caption that you have an image retrieval the train and test later is not really clear and so we just wanted to see how much you would benefit if you knew that this caption is the one you have while estimating the specificity so that it was just coming from the Yeah Yeah. Yes. Yes. That's what this I see so we have to go back and check on whether we vary discolor norm but this is the same data sets that I introduced for the evaluation so we had fifty sentences put image and so that's the highest we have and have to go back and check how performance studio dates if we used fewer and fewer of those for training for estimating specificity while training those regressors so I don't I don't understand. I see so we had about a thousand images at most so it wasn't that large. I mean reasonably well i wouldn't you know it's nothing groundbreaking but I juggle aces well enough. Any other questions. All right so this is going to be the mean chunk of my dog it's this idea of learning common sense and so again going back to the scenario with captions I had mentioned that we now have machines that can take images like this and generate a description that says construction work out in order to save the rest is working on the road and so this is this is great it's exciting but if we were to think of an image like this great and if a person describes it someone might say a man was rescued from a struck that is hanging dangerously from a bridge and if you think about what it would take two or dude words like that skewed and dangerously there isn't anything specific about this image any particular detail in this image that directly would correspond to the word rescue or the war dangerous trait when we look at this image we have a sense for what happened before this what is likely to happen next and based on that we're seeing things like rescued and dangerous. And so that involved some common sense knowledge that we have about how the world functions that allows us to say something like this that from as far as I know current machines don't happen. And so if we wanted to learn this commonsense very deeply wanted to learn this common sense what might we do and one solution that a lot of people have taken in history learned this from text we have lots and lots of text on the web and we can mine this to learn a whole bunch of common sense solutions. But they should text as there's a reporting bias right this is not an unbiased representation of the order to stack that humans generated and humans like talking about things that they like talking about humans like number things that are interesting they often tend to be uncommon and it's not interesting to talk about things that are common right and so one. Example of this these are the number of times these different words are part of a certain text corpus one associated with people. And if I use these word occurrences these frequencies as an estimate of how often these things happen in the world then it would turn out that we inhale six times as often as we exhale and in fact we get more doing it seventeen times as often as we exude. If you look at different body parts being mentioned in these texts porpoises then we have heads one thousand eighty five times as often as we have called Letters write God letters are just not interesting to talk about about anything and so so there is this well known reporting bias in text you can't use frequencies and text as an estimate of how often something happens in that you would think here is another example of the question is Do birds fly and if I look at the definition of a board any warm blooded egg laying where debate of the clause Ava's categorized by a body covering all feathers and forelimbs modified as weeks towards life. It's but if I look at penguins the very first thing it says any of several flightless I quoted boards so if I just look at this I would think boards are flightless right but the whole point is that it's worth mentioning that boards are flightless because boards are typically the boards typically fly right and so there needs to be this added level of reasoning for us to take a definition like this and realize that because it seems likely as boards it must mean that birds typically fly right in this logic doesn't always hold true so the point being that it's hard to use text as an estimate at too low in common sense there's a lot of useful signal but it doesn't have everything right and so the thought then is that well we have this visual world around us that has a lot of structured in it can we just exploit that but has that is a more unbiased sampling of what happens in the world around us and this structure is certainly that if I give you an image like this if you had to describe it you might say two professors going to worse in front of a blackboard. And now if I change this image. Actually. You would probably change the description to say to Professor stand in front of a blackboard right so you know the structure that gaze is important to whether two people are going which thing or not so there is a lot of the structure that we're exploiting but how do we get the machine to learn from this right how do I get the machine to realize that just a flipping gaze can change the semantic meaning of interactions between people. And the reason this is hard to do is that there are a few reasons why this is hard to do one is lack of visual dentity I don't have two images that genes ever so slightly such that the meaning also changes right so that's I don't have the finding signal to learn from the second issue is Annotations are expensive right if I wanted to learn that this gaze is relevant to whether people are going where thing or not I would need to gaze annotated I would need expressions annotated and a whole bunch of these other things which is very expensive to collect. And so then he would say well a lot of patients are expensive let's just learn computer vision techniques to estimate gears and expressions and all sorts of of that actual proof of what people are reading and so on but that then becomes like a chicken and egg problem I'm saying that I want to learn common sense so that I can do computer vision better so that I can understand images better and yet I'm now saying that in order to learn this common sense I first need to solve computer vision so that I can learn it. And so are told to in trying to break out of this sort of chicken egg problem was to question whether the fortitude You're listening to is even necessary to learn common sense. And our thought was that the common sense lies in the semantics of the images and know what in the actual pixel values of these images. And so we can give up on for today listen and focus just on the semantics and hopefully learn common sense from that and so with that in mind we introduce these two characters Mike and Jenny and they would award full of sort of toys and animals and all sorts of things that they can do in the park with rain and clouds and lightning and trees my continue can have different expressions they can have different poses. And now what we can do is a lot of fun things we can go to Amazon Mechanical door and we can show them this interface and we can ask them to create these abstract but semantical you did scenes that you see here they can move objects into in sort of three different depths they can flip them around and just with a handful of objects I think there's something like six objects in the scene that is a story that you can tell this is a semantical scene. And with this you can do things that you've gone ever do with real images I can do things like give somebody a description Mike writes of a bad by giving him a hot dog while Jen you don't so it and I can ask six different people to create a scene that depicts this. Right and for all of these they look very different right they have different objects present or absent but what's consistent in all of them is they have it has Mike it has Jenny there's a bit Mike is facing the bird hoarding a hot dog the words and Jenny is running off in the opposite direction right way to exactly Mike and Jenny are all of that this changing but this goal of some of these core semantic weasel features are remaining intact right and if you try and imagine collecting six different real images that are so different what all have the same semantic meaning that would be really hard to do. And so we have this data set of these thousand semantic classes where there are ten scenes that all correspond to the same language description and one thing again to north is that the visual features are all annotated it's a fully honored scene right it's a it's a it's an abstract scene it's a cartoon scene so the annotations are all that including gays expressions voices object presents three levels of depth and so on and and now we can begin to ask interesting questions like which visual features are important to the semantic meaning is it location is it developed of location is a gaze is it expression which wards correlate with specific semantic visual features. And so from from the status of we can do things like we can take in as input a description that says Jenny is catching the ball. Mike is kicking the ball the table is next to the tree and we can do some basic end of the processing to extract the hoots of the form primary object secondary object and elation and we can automatically generate an abstract scene or a cartoon scene that depicts this meaning. Right and just for reference this was the ground truth seen that a human had generated again there are a lot of differences the ball itself has changed but the meaning of means the same and the reason we lost the tree was because the couple of the processing dropped in the bubbles and so that's why this image doesn't have a tree. And then we sort of decided to focus on a very narrow domain of two people interacting and so we had this interface where you can move there are these people all people that whose voices you can freely move and you're asking people to create a scene where someone is dancing with another person right outside of looks like one person is hitting the other person on the head but if you just wait for a second this change in expression changes the meaning right now and now it looks like so we selected data for dancing with walking with walking away from talking to arguing with and sort of these very fine grained distinctions between just interactions would do between two people no other objects no other context and we try to see if we can we can learn these differences so these are some examples that mechanical or gooders could hear to and there I was quite impressed with with what they did some of these are fairly elaborate especially if you look at these dancing voices. And these are images from the Web corresponding to these interactions and when I laid out like this it's very tempting to think so what happens if I train on this data can I test on real images and detect these interactions right and that's exactly what we did obviously there's a huge domain ship right so we don't expect this toward all that accurately there is no treaty right it's just one plane that people can move out and but we just have that out this is the performance that you would get my chance of the sixty categories. And this is the performance that you get if we assume that in real images the poses have been accurately detected right because that little overcast is not what we were interested in we wanted to know if the semantics carry over not just for completeness this is what we got I should change this this is a post of Dr from two years ago at this point and so this was two thousand and fourteen and I'm sure there are much more accurate knowledge especially with all the deep learning techniques but so you can transfer from this completely abstract word to these three in the midges without anything to go to be sophisticated. But I started all of this with common sense and then I kind of digest a little bit but let me come back to common sense so one task that we considered to assess whether we can learn common sense from these abstract seems or known is to assess the plausibility offered elation right so if I give you a couple that says man who lives me the machines job is to assess whether this can happen in the real world or not right does this happen or in the world or not another example astri wasn't able and your the machine probably would decide that no this will hopefully the side that know this is not very plausible. And so our thinking behind how to do this was to say that. Our approach was that if a couple is similar if this relation is similar to all the relations that I know what laws of will then I would conclude that this relation is plausible but if it is not dissimilar to other relations that I know what laws above then it's probably not plausible. Right. To For example if someone has already told me that person who or sandwich is plausible and that man eats is plausible then I if I can compute the similarity between this man who are Smeal and these two couples here then I would probably see a good similarity and because I know that these two are plausible I'll conclude that this relation is slow simple as well right so this is the underlying approach that we decided to follow and all other insight was that in addition to reasoning about the similarity in text we should. Also reason about the similarity between these relations in the visual board that if I were to imagine a man holding a meal if I were to imagine a person holding a sandwich if I were to imagine a man eating pizza then in this imagination if these things tend to be similar then that should also contribute for something even if textual similarity isn't all that hard. Right so let me let me give you an example for that So here are two topples boy stared at men Boy look at men both of these would be similar even in text right and here are two other examples of pairs of couples that would be similar in Texas that's great but here's a pair of couples and in particular focus on the relations so Dean stand overt act man and each forget if I were to compute let's say a war director presentation of stand over and reach for and I would compute the similarity I probably wouldn't get a very high match. But if I look at a visual instantiation of these rain both of these can in fact have a very similar instance the action this you might describe as Dean standing over Garrard and the same image you would also describe as a man reaching for Jack. And so because they can both have a similar visual instantiation My claim is that we should consider these doubles to be similar even though it actually they may not be very similar. Brain does that. Doesn't make sense. Right and so we did exactly that and so with this set up we can now take any double So we have a collection of abstract scenes and we have a collection of couples that describe them and now given any new double that you provide as import I can compute the similarity of that couple who all these text opposites and I can compute the similarity of that popular to the result instantiations of these doubles and we can combine those two similarities to assess whether the stop with a small sample or not right so we would use vision and language both to assess the plausibility. Of a text only a couple that is provided as and book writers of the only common sense stars that's based on text there's no important in the vision. Right and so if we look at accuracy of this is average precision so with these doubles we have gone to the nation from humans about whether whether there was a bill or not and then our system outputs a scored about how plausible it is and so we can compute these two different metrics hired is better this is what you would get if you did text a lot so if you did it is a text only dots it makes sense to use only index to do it this is what you would get this is what you would get if you completely ignored the text and only looked at the vision side of things and this is what you get if you combine both and you see an improvement in performance for this commonsense task of assessing the plausibility of problems. And here's another task that is that seems to be only based on text but can leverage common sense so it's a fill in the blank ask Mike is having lunch when he sees a Ben and then black. Does Mike order to beat so does my Cub the bears are mammals my tries to. Write hopefully most people in this room would agree that the most plausible answer is D. might dry still quite right and how do we had arrived at the sun so well we probably reason about the fact that people don't gives in particular are scared of white animals and when you're scared of something you either run away that you shouldn't do with bears I guess I think depending on the kind of bear all you try and hide and so with that common sense would decide that this is the most likely option. Another dog is visual about a phrasing the question is do these two descriptions can they be describing the same scene and Jenny was going to throw her by it Mike that's one description the second description is Jenny is very angry and Jenny is warning about. Right again I think most of us would agree that it is quite plausible that these two are describing the same scenario right and the reason we think that is we know that when people are angry they could throw things in order to throw something you have to be hoarding it. So what are our argument in trying to solve the stars. Is to say that instead of mining through a whole bunch of text that tells you that when you're angry you throw things and in order to be throwing something you have to be hoarding it instead if we just imagine the scene corresponding to this right if I just imagine someone throwing something I would imagine them holding it. And if I just sort of go back to these this latent representation if you will off the visual signal that lies behind this and it isn't about the similarity in that ritual space I might do better than just reasoning about this text alone. And so that's that's exactly what we did if you have a fill in the blank question you have these four options we plug in each option into this description want to die so that gives us four different and our descriptions I already told you that we had a way a couple of years ago of given a description generating a scene that corresponds to it and recall that this imagination that I'm talking about doesn't need to be folded or the stick right it just needs to have the semantics and so we can imagine it in this abstract world and now we can extract features from the text and these generated images to reason about which one of these four options is most likely right but visual data for using you have two descriptions the baseline approach would just extract X. features from all these descriptions and reason about some form of similarity to decide if they are the same or not are they describing the same scene or not but instead we can generate these images corresponding to a description and add those visual features in into this decision. Right and just like with the seeing of couples we see similar games and in performance so this is our number formants on the DOS Box hired as this is what you get the text alone. Vision alone and then text in vision board combined it was an improvement. And so we've used visual abstraction for a variety of things like studying mappings betweens images and text learning concepts in the world like those interactions between people without using any real images using just the visual abstractions when you look at a. You other properties as well and so over all studying these high level image understanding tasks without waiting for the low level computer vision problems to be solved. We've used it for learning common sense knowledge like we just talked about and overall what I find exciting about these abstract scenes is that it's a very rich out addition what are the right you can do things like I can show a human an abstract scene asked them to describe it I can show people a description and ask them to create a scene that corresponds to it I can show people a scene and ask them to mortify to the description changes in a certain way I can put it over scene and off them how the description changes and all sorts of other things which you can do with any of the midges right it's hard to manipulate for good in the stick image data. And just as a teaser we have this new data set of abstract scenes with much more realistic looking Clippard many different people not just Mike engine in different genders ages races and so on indoors and outdoors with many more objects that is online if anyone is interested. But I guess I just want to get a pop with just a teaser about visual question answering as I said who would present more details tomorrow and the idea being coming back to your question that image captions tend to be very generic right if I take an image like this and a disk if a machine spits out a description that says budget off starting in the grass next to a tree we would think this is amazing great recognize that it off ignites the grass recognize the tree where tones out if you look at their Muskoka dataset put instance if I just take it and I'm sampling of all images that have jobs this description would fit just fine right and it's very plausible that all the only thing the machine did was recognize that it off and then the language more typical would inspect out the rest of it right and so it's not to you if these mortars are really understanding these images as it might seem. So just of course understanding of the image followed by a simple language model can produce deceptively good image captions and other issues it's very passive it's not for a pretty. Killer docs get there some information that I want from this image me as a user has no way of doing that the system will just put out some one sentence genetic description and so what we're excited about is the star scope which will question answering well given an image and in natural language form open ended question any question about this image the machine starts gets to answer this question in natural language. So you might have an image like this what is the mustache made of. You might have an image like this how many slices of beads are there right that this is counting the object that action is good at that but then there is this a vegetarian pizza. So you can imagine that you might need access to some knowledge base that says if you detect meat in this image then it's probably not of edge that it is not of age that impedes them I mean is the sports than expecting company and that's a lot of social common sense that you need to see there are still glass in the blinds probably for some probably is expecting company does this person have twenty twenty vision delays that twenty twenty vision should be good water do I glossed detectors is not true you would need some knowledge base or common sense knowledge to do that. And so what we're working on is building more doods they can do essentially this you can upload an image and you can type in a question and the system will hopefully produce accurate answers to these questions. Right and so the nice thing is if it can often get to the details that might be in the background of the image these questions and get to a lot of vetoes it's Das driven the understanding the image for the book was a font setting the specific question about this image it's not genetic and what we like is that it brings together towards pictures and common sense right it's not you need to know a lot about the world to be able to answer some of these questions and so what we have so far is we've collected this database and it's publicly available now it has over two hundred fifty images which includes more than two hundred thousand real images from the Microsoft data search and it has these fifty thousand abstract things that I talked about. Each image has three. Questions associated with it that we collected on Mechanical Turk and each question has Stan our answers provided by humans to answer that question so it's over three quarters of a million questions over ten million answers and depending on how much interest there is in the community we might even grow it over the years and so that was a lot of mechanical to it Wallach over ten thousand dollars were involved and these are some of the funds to six so we have an evaluation so hooked up and going you can today upload your results on our dataset and see where you stand on the leaderboard and we're going to organize a challenge at C.B.R. next summer see if your current sixteen. Moments if you're interested you can find out more on this webpage the dataset in the paper and everything it's there. All right so do to summarize. I said I'm interested in semantic image understanding and I think a lot of work in computer vision traditionally has focused on sort of the image part of the trade where we take these images these pixels and try and make decision or try and reason about the ones but I think sort of equal members of these equation are the semantics for anything words and language and a convenient way of conveying the semantics and understanding part of the common sense and in the knowledge base reasoning and I think all of this needs to work together to sort of get to these the holy grail of semantic image understanding like sitting any question about the image right so I talked about a few different things but if you had to leave with the two main things. That I am excited about one is teaching machines common sense and especially the use of visual abstractions to do that that's something that I think is a lot of fun and the second is sort of what are the stocks that we can have a lot of the truly measures semantic image understanding as a whole and I think will question answering is a nice is a nice step towards that and we have a database that goes with that writes I will I'll stop there thank you. Thanks. Thank. You question so yes. All right so there were a few different tasks that I talked about if you thought I think it up more to visual about a phrasing when we have two descriptions and their task is to figure out whether they're similar or not so you know to answer a question we have annoyed. We have not given special preference to the objects in the image that were mentioned in the description that could be interesting we're learning a classified on top of it and so it could be in theory Lauren Lauren that that whatever it is mentioned in the description and is present in the image should have more of a contribution in the final similarity but we haven't explicitly and gotten that so that that could be interesting you know. Yes. We haven't we haven't done that yet so we would We've tried to learn the common sense from these abstract scenes so you can think of it as US building this knowledge base so when we had these relations you can think of the primary and the secondary as entities in a knowledge base in the relation was the relation and our method is providing a score for that age should exist in the knowledge base or not right so that that part of the world can be thought of as us building a knowledge base through abstract scenes in terms of using existing knowledge be. Says to do something like answering questions about images we haven't done that yet that is somewhat out the north on sort of this moralistic will put in a data set but more on toy problems using something like memory networks where you have the ability to he didn't indict existing knowledge base and then your noodle network learns which facts do it even and it learns how to use in about them to produce an answer but that's that's that's something that we haven't looked at ourselves. That that makes sense. Yes. So so the part where we learn to assess how plausible it elation is. That assessment of plausibility we expect generalizes from the abstract more dignity award and we have in fact shown that for these topples where abstract scenes were collected on a second strum a different subset of the data sent and then the best uploads would have these double trouble Microsoft local captions that are captured offer you the images and so there is no those are not biased in any way by the abstract board right so those are clean realistic couples that we can assess the plausibility of using our abstract library so that is the extent to which so that so that common sense we have shown generalized to these religious they're still images from a data set they're known it's not like a robot walking around and running into random things and things like that right it's still there miscall putting aside the other thing when we had attempted that was when we learned these interactions between humans and we train them or do. An abstract scenes and then tested them them on real images those real images were still downloaded for the same sixty concepts that we cared about so in that sense it was obviously biased by the choice that we made but that struck scenes but there we were checking if the semantics in this clipboard weren't in this cartoon world in terms of pauses and so on carries over to the audio board right so those are the two ways in which we have attempted to show that some of this general a sense but there's still the other struggle for improvement.