[00:00:05] >> Everyone welcome to the machine that enjoy to take 7 hours and it is my great pleasure going to do anything at all I shall have as an assistant professor of computer science of the Who has its own research interests and computer vision and machine learning here is half of us a 1st set up a icon Francis Yes there has been a here actually for many of them she said serving as a program and as you see the 2025 research has been supported by the National Science Foundation who are a little be Amazon. [00:00:38] What I personally enjoy about it here honest work is that it's sort of it has to be your breath of fresh air I think she goes after a lot of problems and questions that aren't aren't necessarily mainstream but I think are fascinating and I think you can see some of that in her balcony and so that paints it out of her being with us and have that you take it from you know Thanks Debbie for the invitation and introduction so the title is reasoning about conflicts year from week multiples revision which there's a lot of words here but probably dislike and actually explain what I'm trying to do so and what I'm interested in is understanding the rhetoric of multimodal media and you know I probably don't need to convince you especially. [00:01:32] This week and this month that this is important. But you know understanding how media affects opinion is very tricky. Though we want to understand the agenda or intent of the media so we've looked there more recently political articles a little further back but what I'll talk about 1st visual advertisements and. [00:01:57] The challenge here is. These are very creative they target human audiences but also data. Is limited. In the sense there at images that have intense images and political articles and they do have enchanter advertisements which a designer actually have to spend some time to create they're not as plentiful as general images and also the supervision that we have on them is. [00:02:24] Limited in terms of count and also in terms of extent because advertisements are very appeal to you know human knowledge in and even more so than other images and so actually all your knowledge that you need is maybe not of use of all. One nice thing about these media they're looking at is that they have both visual and visual and sexual aspects but generally so the text you can see it right some sort of supervision to understand the images or the relation between text and image is indirect. [00:02:57] And like sort of other problems we look there in vision and language so even though our text is a type of supervision for the images and with the show it's so kind of noisy supervision. So what are you interested in more recently. You know still within these applications how we can learn is to our models for some of these tasks noisy data just as a motivational slide that I've used for a while it comes from kind of my background in media studies and then it is a top top or 2 majors there you know one of the words there. [00:03:37] I have been very influential was from the sixty's on is from the ninety's. So these are real you know social impact. At the bottom are a bunch of advertisements the 1st one probably everyone is seeing the 2nd group of the absolute ads it I don't so much care about their impact in terms of increasing market share of. [00:04:00] By whatever factor but I'm interested in them being extremely creative and extremely challenging challenging so probably you know with a 2nd of thought you know figure out why this is absolute Chicago because of the way and this is absolute peel it is this is a you know orange peel or whatever. [00:04:20] But you know these are going to be extremely challenging for Mission operatives to win for you know why is the Chicago Why is this appeal so we want to get started we formulate this task of decoding image advertisements as understanding the message of this ad is trying to convey and you know say that our vision systems are not good at this just because they're not trying to be good at this so we can recognize concepts in these images and this is from actually fear is back so it's a program that are now we could Cashen these images with you know fairly accurately but the problem is that in terms of what this image means which is really you know the reason for creating this image in terms of what it means. [00:05:04] Without your Before our work vision our grounds were unable to do this and you know. It was only because they were looking at this aspect of a persuasive so. One. Explanation of what this means is which is from an anteater it's what a Burger King misstates really good even the competitors employ secretly. [00:05:26] And so we know it's secret because of the trenchcoat we know that this is Ronald McDonald There's a lot of a lot of conjecture or knowledge that you need to understand this and I think that's why you know it's an interesting problem. So explicitly some of the challenges that some of which was looked at some of which you haven't looked at your leverage really these ads have. [00:05:49] I think about a small fraction of as but not really a fraction of that have these it's a pickle objects that they feature that are obviously not for a realistic and so vision are it will have a lot of trouble understanding them they have this dear made up Janki of this cow made out of fruit you have these lungs that are actually tree easy at this hour made of coffee beans though developing representations that are a bust of is something that I'm interested in so I actually have a recent Grant on this topic but it's not something else **** about in this in this talk another thing as there are these really cool is how ads even though these are static images imply a certain physical process. [00:06:35] And this process actually relates to the point that the ad is making the whole. These 2 are a little bit less obvious maybe but this one you know we can all agree the earth is melting. This if you think about what's going to happen next there is going to be crushed by the arms of the clock. [00:06:52] And so it's saying something about I think that every 60 seconds a species that is out of its animals and is running out of we need to rescue them now there are as in terms of you know common sense if you will there's the symbolic associations that as rely on like guns are dangerous here it were fragile and you know other myths are used for hot stuff and it's actually at its heart like here the Sketchup is hot like spicy not hot like temperature so there's this there are these symbolic associations and we've looked at that a little bit how to leverage that understand the message is a bad. [00:07:36] There's also an interesting vision language aspect which is that the type of text that. Ads feature is of is what I'm going to cost slogan is often not it's more of a metaphorical use of the text sometimes I hear this is indeed a winter collection so this is kind of a literal use of the word the phrase win election here it's not the she's dressed in cardboard boxes and I believe the intent is true. [00:08:10] For the statue evoke as like this in your mind but kind of makes the State of the person even more you know seem even more tragic is a call to action to in this case. Help prevent or work against human trafficking so this is a public service announcement yes human trafficking this is a typical cutting out so that the relationship between image and text in these 2 ads especially in this one is very. [00:08:36] Very abstract and very tricky but also very though on to start we collected this advertisement data sets this is the image data set of $64000.00 and it would have already run of patients like the topic of the ad so basically if it's a product ad what is it trying to sell if it's a p.s.a. What is it a p.s.a. about we have of sentiment and patients like what sentiment is this actually your vocals are trying to make you hungry is it trying to make you scared is it trying to cheer you up. [00:09:11] This is kind of that mean you think we use which are these action reason they like a natural language answer to what is this suggest you should do and what reason is a provide or taking a suggested action and a few other types of annotations we also have a similar data sets with fewer videos which we're going to do with some of the same types of innit. [00:09:38] Though in terms of the current evidence I guess this is a kind of one of the courthouse that we try to explore. What messages yet convey we call the action and what arguments of the provide for taking the suggested action is a reason as of this you can think of a high level as a way to ask where the answer is open and there's different ways we've actually gone through for how we actually from the latest ask in practice. [00:10:06] One thing that we kind of settled on because it's so challenging as a multiple choice task or given energy options for there actually is a statement the algorithm and it's to think they want to match is the image and just to give you a sense of how challenged challenging this is even if people actually did understand the ad which not always the case. [00:10:26] Here I have this ad we have 3 annotations that are all correct So these are all actually matching this image and you know this is only only beyond the adhesions or I should drink everyone. So this is our action because it helps you recover because it keeps us young is the reason. [00:10:48] At provides though Or here's an example method a month at the recent ones that we came up with for this task which builds on prior methods so here we have this now it's a p.s.a. in other says Read the road and it shows your crash motorcycle and it's about being careful. [00:11:10] The so that the slowest is read the road and you want need as many new parts there are what we want to do this. We have a metric learning task where this image needs to be matched with like this. I should be careful on the road so I don't crash and I say this is our actual reason statement and it should be it shouldn't be matched or shouldn't be close the visual representation shouldn't be close to something like this. [00:11:37] Which says I should buy this one a bike because it is. The Oahu we have you know I'm betting for the image and the pieces of text will go into our image and betting is you know most entered just an image regions with learned tensions course we have a representation of the slogan which is a better in the air so that's this. [00:12:01] To leverage our idea of basically as Reliance symbols and this comes from his studies theories Well we have a simple classifier. So our symbols are things like danger and each are into college and all kinds of concepts that are abstract but can't be demonstrated with. Specific objects and I have a light to go into that a little bit more next but we basically have a simple class a further outputs how likely it is that this image contains any of these 6 symbols and then just because we don't actually have anything like object annotations on this on these images. [00:12:40] Captioning our room to get you know predicting the descriptions of objects in this image and all that goes into our image and Betty. At the symbolism we basically so so so I want to you know go further on this and go beyond what we've actually done but we have done is that we have collected entertain people about a bunch of symbols or concepts and types of images that refer to this concept the types of images there are signifiers as you might call it in the east so it's looking a little literature signifiers for something that's called the signified right in other words the concept though you're at different images that show dangers that they're symbolically referred to danger images refer to be natural or be healthy which are kind of related you know these are symbols for strength often you know muscles or. [00:13:39] Strong cars or whatever things refer to speed like bullets and rockets and things refer to energy. But going back to the to the actual method. Level overview of results basically. This here we're trying to retrieve a national reason statements are what you should do and why and we show accuracy and rank. [00:14:06] And we separately show result where accuracy on the ship should be high rate should be low so this is the rank of the correct statements out of. Rank should be low in a breakdown or zones for product as an n.p.c. because generally product ads are kind of less creative and less abstract yes asr they want to get really creative and we also fear of those. [00:14:28] Which is you know even worse so if you look at just in putting say the image or the slogan which is the text the bit in the ad is an input to predict the action readers think many checks are generally a lot more useful alone because it's more directly tied to the message needed that image serves you know image is there to attract your attention to explain something in the text it could be to. [00:14:56] At the surface contradict the text but generally the message is much more clearly in code or in the in the slogan. Which is kind of you know we knew this from the start so we hadn't had to spend too much time looking at the slogan as input because that's less interesting. [00:15:09] But you know and this is this is a journal version of a previous method so we did you know input the slogan is a. Is a is a when it's when it's features if you look at the action part versus the reason part separately they're the action is basically saying go by this and the reason part is the why in the why is generally more challenging so here this is ranks are lower is better and for a reason the ranks are worse if you look at what happens in. [00:15:38] Each in the videos so these 1st 2 results we're using images this last result is using them the same task but on our video at. So anybody out there speech which you can say is an analog to slogans but generally the speech alone is a lot less useful is an input meaning that the message of the ad is not directly encoded in what people are saying. [00:16:03] And here's a quality result where again you know it'll actually looks very nice but again this is a you know the method just needs to select the right statement though all our method is this is this is the bottom and out of use your nap plus plus one as is standard method trained on our dataset where it shows the ad is basically the green regions or our regions of importance what they're kind of hiding is that this person is putting on lipstick but it's actually a cigarette and you know it this ad is meant to look like a beauty had but to the twist when you kind of in your perception when you see that it's a cigarette and I are not after lipstick is intended and so but you know the baseline falls for this the baseline thinks it is a beauty at work it's ours it was a statement that you should stop smoking because it doesn't it's. [00:17:04] The reason I got interested in ads is because they rely on common sense and because you know allude to a lot of human knowledge of the obvious thing to try next was to look at how we can leverage externalities and here we just mean kind of the classic you know old school thing where you look at the knowledge base. [00:17:25] And how we can retrieve information from a knowledge base that's a general purpose knowledge base so it's going to have a lot of your relevant stuff also so how you can leverage that to better do the actual statement retrieval task though here we have some of the state same stuff from before so we have a region as we have o.c.r. to get slogans use of slogans to look into a knowledge base. [00:17:50] Here the slogan is introducing the knife Nike hyper dunk. And so that we're going to use all the words and the query to potentially retrieve statements from knowledge base of the stuff we retrieve as the blue boxes so we have. Nike Just course for a company on the right we have also Nike the asteroid and I see the Greek goddess. [00:18:14] Are not relevant and you know our them has to figure this out it doesn't it's not given this information and what I didn't want to need to do here for this specific image it needs to retrieve a statement like I should buy Nike sneakers because they will protect my. [00:18:30] We represent all this information and graph. But the challenge is not all of these knowledge like this will be relevant so we need to you know learn which pieces of knowledge are relevant there's also this annoying task that we figured out. Which is that the way the task is serp the method can you kind of cheat or chases away or there's you know a lot if there's a lot of images about night I can just learn and the way the mother will trust ask is set up if there's only a single statement that refers to Nike obviously are real not just learn to use the word nice and so it's not a reason about the properties of the sneakers that are being portrayed here it's just going to the easy thing which is why match Nike in the. [00:19:14] Hard output to Nike in the in the slogan. But that's not going to work when it's a less common company or when there's not really an entity to match between input and output. So generally this is you know inflates results but it's also that doesn't generalize and so we call this a type of shortcut that the method is using and it turns out to Miller shortcuts exist for exist for other datasets not just ours. [00:19:40] So to prevent overfitting to the circus we basically do something that's you know very similar to drop out but on the token novel so we need drop drop words in the in the slogan we drop words in the knowledge base centuries. To make training more robust and ultimately improve the performance on the you know original task. [00:20:01] So the way to represent knowledge in a graph we have we want to socially conceptualize the meaning of the slogan from price of knowledge to this is the original meaning of the slogan is just t. hey are embeddings for the knowledge and I guess that we retrieve all of them including Was that are not relevant. [00:20:20] Because we don't know what's relevant in any alpha terms or learn to and they allow the method you choose which is a balance to use. The Global know is basically our logo or representation for the image and we have information coming from the. Visual information proposals is slogans which are previously been conceptualized with the extra knowledge and again the betas allow us to choose what information to use and then this is train with our measure of learning task which is to match the image to the human annotated actually reinstatement and in terms of masking a little bit more detail we master training data to prevent the model from alike remain too much one on word matching or object matching. [00:21:09] Though all 3 strategies we use is to drop words from the slogans and to drop words from. The knowledge base where we were the worst that we used to query how it's research is Nike. Or to replace. Entries in t v p which isn't always raised that we use with the I would happily tell and. [00:21:34] Asked our quality to figure 1st but. Basically what this is showing is so these are 2 ads with the desired action recent statements that we want to retrieve these to show our graphs where we only show connections over point 05. Meaning only connections that actually matter so this is that the learn graph and so the edges are these alphas and betas that admission before that I was to choose which college to use so this column is kind of the standard way to use dollar which is just to you know treated as an extra and put on the right is where we use masking to basically make the task harder in training time and so what we see here the orange boxes are the boxes would change the color scheme I guess but the door in the boxes are from from the knowledge base and we actually see a lot more pieces from the not knowledge base being used what we do use masking at brain time so now this is says nature which can be recycled The trick is not so this is the intersection of or. [00:22:42] And so our method is going to use information about it's going to look at information about nature and about recycling and it's going to use that to better understand the slogan here. Whereas the method doesn't use masking I just learned to rely on kind of shallow shortcuts like matching i.q. Nike or Chanel to whatever Chanel of it's in in the statements and so that is actually you know I may work in some cases but in general it's not going to be super reliable so. [00:23:11] Here we have a site task where we evaluate to what extent the method can choose their words Alpha and Beta weights can choose the right now which they must use and our method with masking is you know much much better a more standard task is. You know accuracy and you know retrieval metrics. [00:23:33] That it's like prior methods that it use there as they just said the bottom is more of an ablation where we show. These just visually just east actual regions K.'s knowledge I mean it's basically if you just add knowledge and then he way you don't get almost any use the promise of here in these 1st 2 higher is better is to 1st you go off last block lower is better but you basically just don't get any benefit from adding certain knowledge which is consistent with what the other you know prior papers found so however if you are masking which is the not a by the stream asking strangers your results improve so now if you think that they improve over both of the previous. [00:24:13] Rows meaning that now that knowledge knowledge actually knowledge actually becomes. Thinking of slogans again just to make explicit what I was saying about the relation between image and text being more abstract you know problems that we looked at before. Some of the same objects are mentioned in both the caption and. [00:24:39] In the image so you have it so essentially can think of the image and text being redundant in the sense of the same message is transmitted independently cross both channels they have their lectured the lecture and they enter show instead of images. Here you're asking who's wearing glasses and the answer is minute of minute questions are sure to be. [00:25:00] In error as it turns out that the 2 channels are not so much complementary so they're connected that image and text are connected in a nonliteral way the whole this is a super abstract image and you might say this doesn't even make sense to a human but basically the text says they can't afford to wait for evolution and the ad is about like pollution and the all noise pollution in the ocean and shows a dolphin that evolved earphones. [00:25:29] But the point is you know the text says the word evolution but the image show also shows a dolphin earphones which are obviously not you know it's maybe if I just showed you what the slogan objects were you can figure out was the connection but I think it's I think so it's actually a very very indirect connection but it still makes sense because these images text still showed up together so that they do have some connection it's just a different type of connection so in this paper which we try to understand what kind of images and text are. [00:26:00] Go occurring in parallel you know they actually say the same thing so here you have a bottle of clean water in the Texas your 5 drinking water but there's also image and text that did co-occur in the sense that this is a slogan image for the same ad but they're actually not parallel and 3 ways of being not parallel that we came up with from our dataset we're. [00:26:23] Starting with this made you want to be a bit ambiguous if you see. Coke or Pepsi bottle squish between us you know sandwich this doesn't really make sense if you read the slogan which has a by it's not right without it the gnats because of the you're so one mama that like me and they give us one without We can just purely be decorative and usually this is the image or even more interesting but more challenging the image interests but actually appear to the opposite to create some sort of you know deeper meaning. [00:26:52] Which I want to get into this it's just ad designers being super creative. We also try to look at what kind of relations between symbolic in more literal more he knowable objective concepts we can find through this metric learning that we did because we did project you know both. [00:27:14] Socially object words and more abstract words from our actual reason statements into the same space and we looked at just neighborhoods in this case that we found that if you have the similar concept of comfort which you might find in our reason statements like you should buy this because you'll be comfortable whatever the interest of its neighbors in this learns base where we project a whole bunch of stuff to go into the same space in its neighbors there are going to be things like this are going to be objects like pillow bed blanket couch and make sense. [00:27:50] If you want to demonstrate being cool these are the objects that you would use basically. If you want and then you have 2 objects that both derive from tomato but one of them actually correlates with the concept delicious and the other one correlates with a constant in the sense that these are neighbors. [00:28:10] In the joint space that we learn and we also do discover some association between image regions in text these are we got these these associations again or across old space just from the actual reason retrieval text we didn't have any occasion for reason for ketchup or Pepsi or lipstick It's just that these were mentioned in the actual reasons. [00:28:34] Along with a whole bunch of other words right so. For very concrete concepts like Oreos and kitchen we can actually localize them so these green things there are boxes that most correspond to this concept so we actually were able to localize them without you know trying to and without having any supervision. [00:28:54] For Oreos or for ketchup. So a lot that inspired thought following approach which now has nothing to do with ads and goes back to something a lot more mainstream like object attraction and so we have this paper we try to learn object detection models from week supervision the form of capture the whole you know whether captions are weak supervision or not is a little bit maybe controversial our point here is that people generally provide text when they upload photos and this text is going to be the form of a caption it's going to have a lot of words that are not really the object that is in the images. [00:29:32] Here we actually use captions that are you know crowd source that are written to be clean descriptions of images. But what I'm interested in right now is how we can extend this to even more naturally occurring in noisy captions though the point here is to you know obviously work in using image level supervision to do the localization tasks exists we're trying to make this a little bit. [00:29:59] Here the supervillains even more weak in the sense that it's not object labels it's captions which may or may not mention all the objects and they also mention things that are not objects that we should ignore. The whole we want to go from captions to having localization a test on it we do this is a little bit not quite satisfactory to me which is that we do actually use a small amount of supervision but it's very small and there's ways to do it without supervision and preserve the supervision even though we haven't just single data sets and lights to others the point is we do just learn you know text on the classifier that given a caption produces the labels for this caption and we can do this you know we do get these these are little labels from an image data set but in theory doesn't have to come from. [00:30:54] There doesn't have to be a connection to an actual. Image sample so the challenge here you know is that again the caption is not going to mention everything that's relevant though he had this image that says a person in tie for example in a bunch of other persons and a bottle of the captions are not going to missionaries obviously if people watch a man delivering a lecture or wearing a suit on is that there were ties never been. [00:31:22] And there's a there's a few other people that have looked at this task before. But they haven't actually proposed how to use this or object that actions so. Our point is here they are there are tie could not be extracted from through lexical matching because it just doesn't appear anywhere in the sky it is but a human can kind of figure out if there's a person in your living a lecture at a conference maybe they're probably going to where it's. [00:31:50] So we can learn some math from Catalyst labels officially in this figure shows basically different ways of doing that where the sex pacifiers kind of the unappealing approach that I mention you could just use you know some sort of word many similarity or a word were in danger learned for this for your data set or. [00:32:11] You could do exact match meaning if the caption mentions tie you assume ties and label for this image you could also look rely on a hand crafted like sit in dictionary says because if it says man and the label is person you still want to say yeah that men person is saying. [00:32:30] This is bigger recently shows that it doesn't matter if we use 100 percent of the data that we have for the sex classify or just 5 percent basically recall doesn't much change in terms of the labels that we can retreat. So this is this figure here was evaluated in terms of do we get the right sort of labels or the image level this is actually you know actual detection performance with the sitter labels that we get. [00:32:56] The top words here these 2 are used granted labels in the image level so these are going to be upper bounds here and this is on on the 2007 test that we have a model that uses cocoa images that will labels and you'll see visually and we're trying to match this with our noisy strategy we're trying to match this. [00:33:19] Second block of results are different ways to better captions will label and their slip part of use is the best one is actually comes fairly close in terms of overall. Precision here so we have 36.3 and review point one There are actually what he calls. Other ways of attaining the sort of labels or theory but not that much if your ear sources or if you want the next best in his or you 2.5 but I realize that a handcrafted dictionary the next best thing is 41 points. [00:33:53] But the truth of this case classifier over other ways of generating the pseudo labels actually transfers to other data sets were. We which have different data and then the data we used to train our next lesson which which we're encouraged by and that was basically going from ads to. [00:34:19] Answer have these essentially caption like. Pieces of information and so whatever we discovered in the ad space we try to apply in this this more mainstream basic architecture and this is something that I'm interested in kind of how to learn that actually models from my from noisy language supervision. [00:34:41] And related to that we started looking at now now back to the slightly more new stuff which is we started looking at little articles. And kind of how they construct near and if so this is very similar to the motivations of our ads work but now looking at little articles are you going to have a lot of text. [00:35:01] These are not going to be as creative because these are going to be naturally appearing images for the most part but they're going to have some sort of plays some of the 1st thing we did is try to ensure political bias here based on just from the images alone phase and intuition and that you know there's research showing that there is you know old left of value left leaning values or values associated with the left or values for the right so here we have this very silly figure the tryst and show this intuition but it's obvious that you know this is a things are more complex than that. [00:35:33] So we want to study the political Rini of an image just from the image itself it's a sign of the. Critic basically predict whether an image appeared in a left leaning are right leaning source is going to be our noisy supervision that we use that training time. Which is to inherit the label of the source and apply it to all the images that appeared in this new source and test and we're going to have some supervision because we don't be trying to classify the political bias of a cat in as there's also going to be a lot of images that are not going to happen. [00:36:07] So we have a multiple event model setting where we have images with your lengthy articles portraying and this is basically I said it's already for training we have you weak legal so the bias of a media source and we're going to get the bias of the media source which is much easier to get because it's just you know all media source is just one label and we get this one label from. [00:36:30] A website called Media fact bias check there's other websites like that. We spy her media these media to find articles. About controversial topics whatever was controversial at the time that we did the work. And we get about a 1000000 images to you need to be just awful if you begin as well as their associates. [00:36:54] In protest and we're about a 1000 images that were labeled They were clearly labeled left or right is up and I was having a bias by a majority of of anteaters as opposed to having a clear bias that we are a mess the worst is we actually have 2 stages and I said no we don't want to use the text it's just time but we do use the text that isn't it was it what you modality attorney. [00:37:18] And the reason for that is that the task is actually much easier maybe not much easier but easier from text then from images of in a sense that bias is more obvious in text though at 1st we're going to learn to document a day in. The article so this is just a way to represent a text and we train this document in dating so it's actually just a doctor back. [00:37:41] On our data set and you can think of this obviously is an unsupervised task then we fuse image image features from a resonate in the dock to back features in the text entry to classifier then this is really just a way to learn image features that are going to be more relevant this pie's prediction task. [00:38:02] In Stage 2 we get rid of you know. We keep the features of this resonates which are now perfect for a task Rosen we get rid of the hurt the text input and we train just the last layer which is like a classifier to now use the features that are appropriate for by specifications to actually properly classify left and right from just the image. [00:38:31] The quantitative results are not so interesting. To me and I think is interesting is a way to actually. You know what tangibly demonstrate bias or something that's that's fairly fixed were very visual variability is going to be controlled which is faces of well known politicians and how they appear in left and right leaning media so we have a Face it you know we have pictures of Trump at the time it was what was I guess more relevant was splintered and Obama and so for these pictures we're going to see we have a lot of data for Trump in left leaning sources Trump and rightly sources and we want to see how these are actually different and to do that we used a. [00:39:14] Generated model that we developed in a prior work which basically models left and right using strip usurers over attributes that are predicted from a separate data set so we don't require any extraditions. And this is where we get to that the method we used in this prior paper is you know gives us Pari results which is fine the point is to see you know how these were results are going to be my fight this is the these are the original images these are reconstruction which again are blurry but that's not what we focus on so now we want to modify all of these images to appear in we're going to want to push them very far to the left and very far to the right. [00:39:56] When I say very because we you know exaggerate kind of the shift just by you know applying a scale or 2 to make the pattern more obvious though if you show you know same polls of trying on the left even when she was my think actually starts being injury where is Clinton and Obama are going to be happier generally and. [00:40:22] They're going to so Obama interestingly here is looking down there on the left which is going to be supportive you know he's going to start looking up and this is something that our model our medical learned that looking up and Stookey are more favorable portrayal. And now if you do the reverse if you want to shoulder or a tramp on the right he starts looking like this which is I think very you know very unrealistic but probably this is what Trump looks like it. [00:40:51] Is a more positive portrayal as a very very positive for sure on the far right here. Clinton when she smiling starts being very angry. About I'm a similarly when he was actually looking up now starts looking down but unlike in Clinton on the neck on the kind of the opposite end of the spectrum you know Obama doesn't look angry in any of these photographs he's. [00:41:18] Quantitative lead back to the vice production task we break up our results a little bit. By category in the comparative bunch of baseline of this is not not all that interesting I'll move on to this next task which was to try to retrieve so now we're going to look at the that text a little bit more more explicitly a text that we have a nice political articles and politics is based there are data sets and to try to do something that in principle is more you know more classic which is cross when to retrieve role but it is much more challenging to me. [00:41:55] Though one thing that you my dear Israel on is Cocoa where again images and texts are going to be essentially redundant and not of it's not a difficult task but it's just a different type of task and here we have something where you have an image of a protest the text actually you know you can probably as a human match this next to the image but it's a more interacts relationship so it's immigrants a president House administration has become almost everything they feared sums of course this is something about guns this is. [00:42:23] You know something about it was he writes We want to be able to do cause more ritual in the space. To most programs that images and text are rightly aligned but text in real world media are is not not so well aligned and the image might just visualize the prospects of the text or vice versa our dataset our this our politics that is it is much more visually diverse too much noisier too because we harvested it from the web and also something that we don't quite leverage here but the project is often much longer than the caption which poses another challenge what we're really interested in is how the 2 modalities are complimentary though all this is again I you know it's like the exaggerated complex example but you have this image. [00:43:11] Here and the authors of the early go that chose a spot are using as a form of visual rhetoric and then with allergies or complimentary meaning that. This image of the Greek goddess of Justice Denis can be paired with any of these texts so it can be about Supreme Court or heidi or all the same sex couples can marry it can be about the future of abortion and the Grangers decision not to indict the officer illustrates both the power and the limits of the black sites that are movements the Supreme Court gave from the Go Ahead on the border war using all of these can actually appear with this. [00:43:50] On the other hand this particular task it was treated with very visually very images like these are all about border wall but it's you know literally they were all I see officer an immigrant child or approaches all of the you know here with this text but they're very visually diverse the roles here is who essentially make the visual representations more aware of semantics beyond just what a triple loss can give you don't know in these illustrations we have we have. [00:44:26] Checks that nobody checks be denoted by circles images being generated by squares. And the basically Exide y. our image and text that appear in the same article same for the green things of the color notes which article something appeared in. Green and blue images may actually you know even though they appear exif Here's what y. and x. day if you use it right Cheney these 2 images are not necessarily going to be visual neighbors and here we're trying to say that I white i j you're actually said meant to clean the original doctor Backspacer going to semantically related in the news crosspost base there we learned they don't have to be neighbors but they are originally semantically we want to preserve. [00:45:13] Existing methods like a triple gloss or actually other lawsuits are stronger better for learning losses that we try. Don't necessarily preserve semantic the military though you know they are going to put x. and the x's and y's closer but they might do that by pulling the wise the worst of it both parts and so what we do is that we so so these 2 techs that are semantically similar may actually be stretched apart and images may undesirably you know remain far. [00:45:45] What we do is we still use triple loss but that we apply is to bring. Is within the 2 modalities within the text modality and within the image modality so especially of this kind of where triplet of things that we bring together. In a little bit more detail these are 4 pieces of text denoted by these triangles here or images that should be Brokaw's or by Triple not because they actually heard our point however is we want to 1st learn a model of text a man thinks. [00:46:15] With Dr back and to keep things that are Semitic neighbors close in the in the new space which applies to both the text and images so texts that were originally close were going to be a force they be close in across a little space is well it is the images that appeared with them the now all these images even though they're visually distant will be enforced because they are you know they appear texts or calls the with this we compared to credit is stronger the combination of existing. [00:46:47] Metrically losses that we could come up with including. More vision language when you see we value known for datasets 2 or a lot generally aligned and 2 are generally not a lines of politics and good news or both different data sets with articles. Several captions as you know less maligned than Coca but it's still slightly more literal We have a scaly retrieval task. [00:47:11] Rather evaluating wrinkly retreat whether we've got the right you know it's again a multiple choice task and I prefer most of these methods. But more importantly if you look at the kind of images that were not neighbors but are brought up closer these are actually the 2. The issue here is that images that have where our method of the most impact in the sense that we're showing the smallest ratio of the distance between these 2 images in. [00:47:39] Our space versus a space that they get stuck you know versus the basic life of the last These are obviously very different but they're related is they're also like about the e.u. and about justice. And something that we did a little bit more recently is to come up with ways to wete samples in terms of essentially how abstract they are so how much they should be prioritized in the overall you know trip of the ring and our plainest things where there are neighbors where we do what we did in the previous part which is that we look at Textron neighbors we look at a person the images for the sexually verse and we measure how visually diverse the neighbors are so these so you have this imagery have these children with Israeli flags for a minute now 8 is the key figure behind the present government isn't to allow some or if this is a case where it's much more literal so you have a laptop and you have a description of. [00:48:36] The neat neighborhood corresponding visual neighbors are going to be corresponding images were so strong neighbors are going to be these in the Fatah and these are the top So obviously these are much more visually diverse and so we say this is a more abstract piece and we're going to prioritize it. [00:48:54] Risk Group this these are basically 2 types of image taxpayers that are great rate highly by our method for different metrics. We are performing lots of other ways to rate samples on both agree the sets except for Coco which we're not we're going to be $84.00 in the sense that it's more literal. [00:49:15] We in terms of cross more retrieval you know given the word justice or patriotism this is what our method which. Is much more relevant and this is what the baseline retrieves. In So with that owl I'd like to think my funding sources are my students if you just take questions thanks Adrianna. [00:49:40] Anybody has questions been for you all standing there you know any. Special interface are on the check with you but our alcohols there's one question so far and I got to see things he. Bought all the. Dirt How do you manage to get that picture soem just crawl them. [00:50:04] There we generated a giant list of topics for what ads might you know for different add topics and this is this is in a supplementary follows a giant list of my students came up with one for each of these key words he basically you know scraped Google images to get images we didn't had a process to just not everything was great was it was actually adds images so. [00:50:28] It we didn't want to actually run every image the human verifications that we did human verification a subset of images only learns a basically adversus not add classifier which worked really really well which was kind of surprising and then we applied this classifier to our clean up the data and then we had a you know secondary process where if something went through fertilization but it wasn't actually add. [00:50:55] But that happened for us that we had a competition like you were vacationing actually learning with us and thanks for the question. Well really for any other questions I was curious about one piece like you were talking about collecting annotations for the signifiers and signified for the symbolism that. [00:51:24] You use if you were experimenting with trying to just sort of learn this from external knowledge or to sort of mind that in some way and if they're feeling it way down the south that wasn't the 1st part of me yeah so I think you know I think we did try something but it's also the basic thing we have is just those straight up you know standardly trainee classifier And I think we did play a little bit with using a certain knowledge for the symbols but it's part of our external part of a separate task. [00:51:56] And actually did the symbols just as I was talking to some of this morning I kind of got really excited about that idea because there is a I think you know when we did this there wasn't as much interest in common sense but now there is and I think there are ways we can try to use either you know. [00:52:10] Like more transformer like retreating as a form of a certain knowledge or it works that of our knowledge base is so probably revisit that yes you can shrink I mean treat whatever we have and if he did it's just a test. From existing resources. You know no one will know far Sinhalese are. [00:52:34] Particularly good sources of identifying what by the politics people use to it what kinds of values. And a bullet is just a painting or Emmy nominations but things of that sign right yeah yeah yeah so I do that and I am working with someone who had earlier looked at like metaphors and language and I think just whatever is a metaphor language could be good or yeah good candidate for being uncivil. [00:53:05] Harsh as a question that essential human language omniscient and I just. Don't like conceptual captions he you think there's any hope of using that example that there's a context that may be present Yeah I mean I'm sure it's definitely going to help we haven't much played with that in the city. [00:53:25] I mean we have you know we have a standard forms of creature raining like Unishe lights and bags do something sensible. But for anything cost model we haven't I mean we prefer some of the inventions we use are like cross morally pre-trained but not the kind of like actually you know martyr stuff that you're talking about so we haven't experimented with up to. [00:53:48] The on the as it is that I don't know what actually are the questions that I. Can explain how the b.p. has gathered knowledge. Right so we just want to. Get a little bit of context like what is Nike or what is nature what is recycling so we just get whatever comment exists in the e.p.a. about this. [00:54:16] You know concept. And we try to you know use this what other features in our in our model but there's other knowledge races we can use that are a little bit more structured not a whole comment but you can give somebody who's more of a graph this is just what we play with the time. [00:54:34] Yeah I do see the other questions. Carolyn unfairly relationship between image and text you many label these how did the operative manage to learn their looks is now in peril so thanks for the question so we didn't leave will actually personal set we label about a 1000 images I would dearly non-parallel we had a lot of trouble even though you know what we were going for was clear to us it was really hard to compete us to interferes don't we thought maybe the next time we'll hire you know English majors or something undergrads that would understand us a little bit better it's not it's really not careful it's kind of our fault for for failing to watch explain the task even though we try. [00:55:17] But we do have some annotations and for some of them there is reasonable human agreement in the under number we actually tried a lot of an average student who predicts their lot here along some of the work better than others though we actually used kind of this wasn't a lot that's an awful remember but we looked at for example if you extract the top objects and images and you extract the top words or into some number of words from the slogans you know how much you're a lot busier and if there's a lot of all of us we also looked at things like memorability on the vision in this a proxy for me not being for a long looked at like measures of concreteness just the language space is a measure of non-parallel too. [00:56:00] But nothing works really great that kind of the things that we thought would work well didn't work to overcome it because it was. Using force recognition to use Yeah so we did use voice recognition status when we when I talk about speech which. We did actually are run for certain issues on that speech to get representations of what's being talked about in the videos but we can probably do better than what we've actually done so far. [00:56:24] If far left and far right could this be applied to some repertoire how many pictures you need for the actual. Yeah you can probably use for things in between I mean our model actually you can you can tell it how much straight saturate so you can so it's not as separate just a watcher really Cheryl how much of a difference there really is just in the data set itself. [00:56:45] And how many images we need for an accurate model is not something that I remember we originally for that for the generation model we train there are absolutely certain which no matter how we split the data is kept at you know $60000.00 images but in terms of aces it's even less so it's something on the order of like. [00:57:04] Tens of thousands being like 2030000. Which is why it's you know we have to do something that gives us Larry results. I have one more question and this is harsh so you had mentioned. That you mentioned you think options do what if an object and you mention they did in those cases it's possible even for objects that are mentioned in the caption Those who knows when they where do you observe this behavior in your evaluation that like for an image of 45 busy day to say the images. [00:57:45] You had something that there is an object there's and it's like mentioned in the captions is it. Yeah so we didn't I think we don't have any results doing you know measuring that separately but that is something that I'm interested in and you know something you're probably pursued for there but I don't think we had a breakdown. [00:58:05] Probably you know if you look at that that accuracy for different. Objects categories you know probably there are some that are less likely to be mentioned so I can probably look at that and try to come up with you know a pattern but it's not something where I have a conclusion ready for you which is you know it's actually really a cut question so you know maybe I haven't answered next time someone asks. [00:58:31] Thanks for the question. But over at a time I guess there's one more question I can. Briefly mention what is being considered is a weak supervision in one of the projects yeah so I didn't properly define we separate and I just mean that I don't want to have like an occasion for everything that I do I don't want to have labels for like left or right labels for all the images I don't want to have objects labels attached the box level so that's kind of you know it's easier to define We supervision in that setting. [00:59:10] I don't want to. Like so there's something to a data set that woods where this kind of a knowledge base that's a little bit more of a relevance to the questions here we want to work with a general purpose knowledge base for this going to be a lot of noise as it were and so being retrieved this not useful for our task. [00:59:31] But yeah it's not it's not a very definition. Of weeks of version right I think I think that out of time thank you very out for giving us this very interesting talk Thanks everyone for asking your questions and in using. And have a great week.