[00:00:15] >> It's. Interesting to see Researchers at the. National Forests. Section. Probably. Going by what you just said. So we're really interested in hearing a lot of the things that you say. Thanks everyone for coming to my talk to. Candidate at Cornell University and I work at the intersection of natural language and machine learning but the things that I kind of really interested in is designing national language agents which can do interesting things for us and today I'll be talking about situated natural language understanding so the goal here is to do national language understanding in context and these are some problems where the setting arises so on the left we see these popular wise assistant applications. [00:01:13] Is it turned on. OK. So you see these Washington applications on the left like the Google watch assistant. And in the center you see these robots which have to follow natural language instruction in this case make some desert and other I do see these booking assistant. Style agents which have to interact with a human and book that flight and these are very different applications but they have all have a national language interface and the beauty of natural language is that it allows non-experts users to provide a broad range of object is and so that makes it like a natural fit for these applications so let's look at these applications in more detail. [00:01:57] So for the ones that stand applications on the left The By context I would refer to something like the settings of the of the phone you know and if this had an alarm for example an action could be to cancel in an hour to play music or to tell me the temperature for the robots which are following instructions the context is the physical space in which they are situated and the actions would be the values that you give to the controller and for these building a system application is the context could be the prior history of conversation and an action could be to generate a response asking for clarification or to print the itinerary or to book the flight now these are at a said they are very different type of challenge as you can see they have a very similar interface and a very similar set of challenges to you which involves understanding the semantics reasoning about the context and then a taken appropriate actions. [00:02:47] So I will be mostly discussing my work on this problem in the center which is called the instruction following a robot instruction falling or agent instruction falling and I will just look at these challenges in more detail but before that I will give a brief background of my of my research just to more appropriate situated where I'm coming from so a lot of my research as you've mentioned has been on instruction following and this includes some very early work on railroad bosses some more recent work that I've been using these more end to end deep learning style methods for this problem and will be looking at some of this work today. [00:03:23] The other problem which I'm very interested in is like question on 3. Or all it's also like a more general problem it's called semantic parsing where you take a national language text and you map it to a form of representation which you can then use for question onstream or something else and some of my work has been designing systems for parsing for question and string now both the problem of question and string and instruction falling have a very similar nature where you perform a sequence of actions and you get some kind of feedback which is often delayed and a learning framework which accurately captures this problem is reinforcement learning and people have been doing it a lot for. [00:04:01] Playing games but also more realistic tossed like question on string and some of my work has been to understand what do these algorithms really do and like can we prove guarantees and and under what condition can we say that this if you can solve this problem accurately so I won't be discussing this work but if you're interested I'll be delighted to talk to you about this after my talk and finally I mean I really enjoy doing open source research like creating production software is releasing them it's a great way to give back to community and it also helps you would like to have fast reproducibility So we're working on some software is in our in our lab including Like somehow simulations and in learning framework that will bring a look at it today in the context of the problem I mentioned so that problem is instruction falling so in this case there is an agent or a robot or you can use your favorite metal for it situated in a physical or a simulated and why mint and can observe that we're in the form of R.G.B. images sort of some lighter or some other form of data and it's given an instruction and has to map this instructions to actions in the real world and has to do so while reasoning about what it can do in the real world but also following the instruction and this like cartoon kind of accurately shows what the problem is about. [00:05:19] In the previous work or at least until the last until the last few years and have been mostly focused on approaching the problem by building these elaborate pipelines so these pipelines have these models which are tailored to specify just in the problem so as a mentioned you have to reason about the context the instruction generate actions so people have like for example builds pipelines which have a perception module which takes the image produces some form of like information about what objects are there or some kind of a parser which takes instruction output some form of representation or some kind of a planning model which takes all this information and generates actions and these pipeline methods really popular in fields like robotics and have been used for actual cost like many plating forklift robots but the problem of this approach is that it requires a lot of engineering effort in like men designing and maintaining For example if you build this this pipeline you have to design what intermediate representations would be out what would the interface between the parser and the more you look like so you have to build things like ontologies or some form of structure representation and often these representation is a very hard to scale for example if now you want to use this pipeline for a different robot or if the problem changes like a different and why I'm in the knot of these this work has to be redone this makes it hard to skill the method. [00:06:44] To a new problems. And these are some work. Which broadly have looked at this problem so there was some very original work on using rule based methods like the work by Terry Benaud grad in 1970 and this work is very interesting because if you read about it this work was very promising and people felt that you know this is this shows that AI is quite achievable and will probably solve it in like next few years but by then people gradually realize that this is hard to scale these kind of systems and so eventually people started working on more learning based approaches and this includes a lot of work these are some of them will be the work I really like from Stephanie telex down at MIT Professor Brown she used like design a real system which was deployed in a forklift robot and was like moving stuff around in an open field but as I mentioned lot of this work has been building these pipelines are assuming that a part of the problem has been solved like the computer vision problem has been solved and these methods are expensive in terms of engineering and as I mentioned are hard to scale so what you propose is what we call the direct mapping approach since sort of building this elaborate pipeline will build a single model Bush will reason about the raw data the observations and the instructions and generate actions and this single model will hold that would be able to do all the reasoning that these elaborate pipelines are doing and it has to do it while you're reasoning directly with the raw data like for example R.G.B. images so let's be more concrete with a specific task so this task was introduced by Yonathan bisque and others and contains fairly challenging instructions for moving blocks on a map so that structure in here is put the turtle block in the same row as the block in the 1st open space to the right of the right block. [00:08:40] And in order to do this to us this is what the agent has to do. If the members Observateur role are to be images and has to find which block it has to move and then generate a sequence of action so here it is moving this block around to the destination shown in the red square So this red square is for our visualisation So it's not not what the agent can see and we can view this thing as taking a sequence of actions so that we have the instruction and then there is the initial layout of the map and it the agent has to generate a sequence of actions which involve selecting one of the block like for example the block with the logo of toilet and moving it a step in the north south east to west direction and finally the agent has to indicate that it has completed the task with a stop action so this is the final action that takes and this if this actually is kind of important because you want to design agents that don't just accidentally complete the task but actually know that they haven't completed the task and this is different from some of the if you are familiar with these are kid game environments like Mario Brothers where if you just happen to be in the house the game is over so so we want to design and this is I think is a very important part of the problem that people have overlooked so these actions together so there are about 20 blocks and 4 directions so 80 actions of this type in the special action stuff so this gives you 81 possible actions you can take at any given time step and this is a fairly large X. in space makes complicates a problem so let's look at the challenges in trying to solve this problem through the agent image of the current environment and it's given the instruction and as you understand the phrase turn it up block refers to these pixels over here the phrase refers to these pixels here that the phrase in the same asked that refers to anything which is inside this rectangular grid and the phrase in the 1st open space to the right of the refers to anything which is inside this red square. [00:10:44] In all this reading it has to come by now and understand that the Tour de Blas should be moved over there and then finally generate a sequence of actions and in this case has to solve for example some planning problems like it has to move the block around these obstacles. [00:11:12] Yes So actually orginally paper. Does exactly that but if you if you just give it the gold state then you're kind of assuming that there is a planner which which will then take the gold state and achieve it so it's in this in this paper we were trying to do everything together so using a single model and of course you can make several assumptions over here for example you only ever move single block in this map so you can say that. [00:11:37] Why don't we directly predict which block and then find how to move it but what we wanted to study how much we can really do with a single model without without including assumptions because let's be honest I mean this task in itself is not interesting you know so it's mostly is as interesting as a test for for then developing models then we can then apply for more realistic Tosk and we will look at more realistic task as as we move forward so these challenges that we discuss could be put in 4 categories so there's the 1st challenge of course of understanding these complex phrases like in the same row what to the right off and all and then you also need to do a mission the standing you have to understand what kind of objects are there what are the relations of these objects. [00:12:18] And file you need to move blocks around obstacles in war some on the challenges in the planning and most importantly the way I see it the most important challenge is learning so here we are generalizing to new instructions and images to test time so this is kind of different from more like the standard reinforcement learning setting where you you train on a game with a new test on the same game so there's a general. [00:12:44] Yeah. You. Can. Just. Show. So what you about optimization. So the agent has to take a sequence of actions of the task in was doing all the reasoning similar to nestle. And so I do want to point I mean as I mentioned this generalization part is quite tricky because a generalization generalizing across to 2 different allergies and finally we have to do it using a very small size dataset and why that is important will look at when we discuss learning later. [00:13:43] So it will be a new structure of a new set of new sequence of words which could include unseen words as well. Correct yeah. Now there is a completely deterministic setting. Right yeah. There is no noise here this is a this is a very it's a fairly simple setting although although there is like interesting common Tauriel possibilities because oftentimes you have to look at multiple blocks to find a location so ended that could be challenging in a different way so this problem is actually deceptively tricky in that respect so remember the agent has to take a sequence of actions so at a given time step what information can be can can the agent as you access to to make the next decision so rule will call will call that information as the agent context so at a given time step the agent of course has access to the instruction it has access to the current and the previous images it has observed and will assume that the agent memorizes the action that it has taken in the past so will put all of this in one category call it the agent context and be able to give the symbol for told us and this is old information that the agent has to make the decision so there is no access to any structured environment here so once we have defined this agent context we use it to generate actions using another concept called the policy the policy is this is this box which takes the context and outputs a probability distribution over the action space so here the 2nd action has a probability of point 2 and corresponds to moving block number 10 which is the block with this logo a single step in the south direction and we give the symbol of pi to refer to the policy and by pie a given told to ask we mean the probability of taking the action a when given the context as. [00:15:57] Remember our aim here is to design a model which doesn't use any expensive in gene ring or. You're allowed to move any box possible. To see. I would say that's not necessarily following the instruction because especially say is that you should put it up block. Although. You had said I would actually win our evaluation metric wouldn't wouldn't wouldn't like that so you would be penalized actually for doing it so our aim here is to minimize engineering effort and like not use any external knowledge base to do that we'll use a you design a policy model to reflect it so we use this new network architecture. [00:16:51] Yet for right now there's no reward function we're just talking about so that will come into learning. Yet some people like to function as part of the problem but I actually kind of because in general in and there's a reward function but there's a separate evaluation metric so I like to view those things that's different but I think if you talked an awful person the reward function is the problem so there is a problem specificity what function. [00:17:15] So this is the neural network which will define our policy so it contains 3 main competence there is a convolutional level network which takes the current and the last 4 images and maps them to a vector representation to CNN'S are a special type of deep learning architecture which have been used widely computer wish and community then we take the national language instruction and map it using a record to dual network with non-linearity to another with their representation and file even take the previous action that the agent and map it to other letter representation and then we concatenate these factors together we project this representation into a single space and use that to generate probability distribution over which block has to be moved and in which direction it has to be moved or if the stop action has to be taken. [00:18:17] I think it could be useful for recovery from failure so why for example you just move the block wrong in your life that it's the wrong things you still have access to previous images which could help you could you recover but yeah I could definitely come up with a policy which doesn't look at that but I think a fair bit of redundancy might shouldn't be shouldn't be harmful. [00:18:39] Yeah. Yeah yeah yeah and this is like a I would say a fairly simple neural network architecture and we're not using any expensive knowledge bases here resources here so if you look at some of the pipelines before and I'm right on the systems there's often a parser then there's like some kind of a semantic or some knowledge base is you know so we're going to use a very simple network here and we're going to hold that this can do all the things that these pipelines were doing and it's quite simple that people have been able to use it for other tasks like this recent work here. [00:19:18] So not if you define the policy how do we go about training so to do that we're going to make certain assumptions on our learning setting so we're going to assume that for each example we have access to a demonstration demonstration is the sequence of context and actions. [00:19:34] Still the One is this initial context one action is the action that the article policy would take if this context was given as input and then the 2nd context and so on and the idea is because as I mentioned it's a deterministic domain if you follow the sequence of actions you perfectly complete the task. [00:19:56] So that's the assumption number one we're going to make. The 2nd assumption is inspired by my work in robotics where during training we are going to assume access to the word state to be able to provide feedback to the agent so this world state here contains information about what objects are present in the word and what kind of what is the position and configuration and these are some papers this is something has been made and the generally the way you make this assumption is that you place someone to train or some kind of camera which measures measure the position of these objects and so you're essentially training the agent in this very controlled lab setting and then you will deploy it in this open setting where that assumption can not hold true so that's the motivation behind it as a mission. [00:20:42] Now for each example since you already assumed access to the demonstration of very simple things we can do for training is supervised learning supervised learning takes 50 ministration and maximizes the log likelihood of actions in them so mathematically you solve for this optimization here where that is the parameter of the new and new look for a policy so the expectation is there were examples in the training data the some ation is over the steps in these demonstration and it is the action and the context. [00:21:17] So you solve this optimization by sampling using a training demonstrations and do simple. Propagation and people have use it for a very old approach for example Palmer you used it for autonomous navigation in the ninety's so. Yeah. So we could do that but the problem is that an agent that is trained using supervised learning finds it hard to recover from failures so if an agent makes a mistake what ends up happening is that there is some starts happening a mismatch between the train and the test time state distribution which F.X. the performance and it goes by radius name. [00:21:59] Some people call it exposure bias for example so mathematically this is the this is the mismatch which happens to doing training you're optimizing for this objective so here States are sampled using the perfect distribution the Oracle distribution and the actions are from the Oracle in your maximizing the log likelihood of these actions but during testing the agent takes actions on its own so now the distribution over the state is induced by the agent's policy but you're still optimizing for the Oracle actions to now now this distribution is different from this distribution to there is a mismatch and distributions are different we shouldn't hold that the performance on a different distribution would be very good and this problem is called exposure bias or goes by various names like. [00:22:56] So for supervised learning you can just as you have an oracle demonstration in the just maximized actions. I think what you're getting at is imitation learning. You can totally do without but you have to make an assumption that you have access to the mortal of the environment and that doesn't is actually very hard to realize in practice so I can talk about it when we discuss the learning. [00:23:29] So the goal here is now to design an agent which can learn to recover from this fear and a very popular and powerful paradigm which had been applied for it starts recently is the reinforcement learning so in reinforcement learning the agent samples actions according to its current policy for every action it's received of war and the aim is to maximize the expected total reward which is showing by this this figure here here the expected means the expectation both with respect to the examples in the training data and also with respect to the actions the agent takes for any single example and the total reward is a sum of reward it receives for any one example so that's the objective here and the idea is that by learning to explore you kind of would be able to recover from failure so they won't be a mismatch between test and training but we haven't talked about the reward function yet and this kind of important because in the heart of reinforcement learning algorithm is the reward function that's the way you encode the problem that you want to solve for another kid the problem we want to solve is to reach the goal state and do so in ask a few steps as possible so very simple reward function which and behaviorist is so we give it a lot of plus one to the agent if it reaches the goal and stops there we give it a small negative reward if it causes completion of stops in the wrong state and we give it a small negative penalty for otherwise to to minimise the length of the path because it would be multiple ways in which you could complete the task and we could provide this reward function because we have already assumed access to the world state so we know where objects are by the agent this reward function is pretty easy to program. [00:25:20] And so not a real function we can use a very standard reinforcement learning algorithm. Later we'll come to it so so a very standard reinforcement learning algorithm is the policy graded learning it's actually more like a family of reinforcement learning algorithms but the way generally works is that the agent of the agent samples actions by using the current policy and generates what's called an experience which is here so this experience contains the state the context the action samples using the policy when given the previous state and the reward the agent receives so we generate the experience and we use this experience to update a to optimize the objective the reinforcement learning objective and there is a standard way there is a Turing by Richard Sutton which allows you to compute the gradient of this objective using called the policy gradient learning and using that experience you can approximate the gradient. [00:26:24] Formula so this formula here summing over the steps in the experience the PI remember is the policy eighty's the action. Is the age and context and Q. is this multiplicative factor which measures how what's the reward you would receive in the long term if you take action 80 in context. [00:26:46] So you can think about this multiplicative factor as some kind of credit is like how good is this action if I take it now and this credit takes into account the entire long term effect of this action. So this is and this completes the reinforcement learning algorithm but the problem is that these algorithms are quite data hungry and this has been terrific you both studied for example. [00:27:11] Has a good deal of information on this and there's also been imperiously shown in bunch of recent work on using deep reinforcement learning. And this I think is very challenging for instruction falling or natural language understanding in general because remember our context contains instruction. And instructions are written by human beings so this makes it hard to scale these data sets to be able to apply these reinforcement learning algorithms and this is critical. [00:27:39] In kind of makes it hard to to to use like when you lobby enforcement learning here. Has to. Be. Yes So we do and. You know that's kind of critical that these algorithms are very like you have to optimize for several things but anyway I think in general in spite of all the work I mean the problem still remains that reinforcement learning algorithms have high sample complexity so we decided that instead of working with the arbitrary reinforcement learning problem we'll work with what's called the contextual banded problem so this problem is inspired by a bunch of older work bandits and was introduced in this paper by John Langford and Tony in 2007 so. [00:28:38] I'm setting you maximize only the immediate reward so you can think about a setting like a user comes on a website your task is to show an advertisement to the user and then based on whether the user decides to click on the advertisement or not you get some reward and the task and there so it's a single step reinforcement learning problem and 3rd declared has been shown that this problem has a lower sample complexity than reinforcement learning what that means that if you have fewer samples you can create a you can create a better context then a reinforcement learning agent but as I mentioned that these methods are designed for A Single Step decision making so it's not and in our case the agent has to make a sequence of actions. [00:29:23] So what he did was adopt the contractual ban of setting for reinforcement learning so we're going to apply an approximation which was which will give us this algorithm here. So this is a new out of them we call it contractually bad policy gradient so the step one remains the same you still sample actions according to the policy but in the 2nd step when we compute the gradient really apply an approximation so this is an approximation here and we'll only optimize for the immediate reward as opposed to the Q. function that we had before the cuter we only have the immediately board here and this will give you a biased gradient actually but would reduce your performance so in the sense of it will give you lower error with fewer samples so there's an approximation here and this is actual bounded approximation that at each time step you only care about maximizing the immediate reward but there is still a problem the reward function that we solve few slices before is extremely sparse so this is for example we consider before here the blue line or the purple shows the demonstration and the color on the map shows what how much reward would you get if you place the correct block over there because they can see that it's a never guess a positive order until it actually reaches the goal and solves the problem and a policy in the beginning is going to be be a randomly initialised there's going to be as good as a random walk and the probability that a random walk would reach here is negligible and this makes it very hard to do reinforcement learning using these reward functions. [00:31:07] Some people call them sparse reward function informally. But this is even more problematic in our case because we are training the agent to maximize the immediate reward only an image reward here is pretty meaningless it's just gives you minus one or or a negative penalty. So what we did was to use technique called reward shaping do to change the reward function and add an in for made of shipping term shown here in this symbol F. and B. Use a new toward function which we hope is richer so not spars downfield then optimize over the few slides before I just said that reward function is this hard to reinforce my learning and it defines the problem we are solving so if I use a different war function how come I am solving the same problem as before so there is these children is called the shipping curium which says what form of F F can be such that the new problem is the same as the old problem so one theorem here from 99 paper is that it has this functional form for some function fee then will men tend to prefer the policy by what I mean is that if you prefer policy piled one to policy with respect to the old what function would prefer Pi one to respect to respect to the new reward function so the ordering over the policy is going to be the same. [00:32:33] So we use this sacred war function to modify our reward function from being sparse to being more rich so in a nutshell the at shaping term which incentivizes the agent to move closer the gold state and also to follow demonstration so we add this extra information. In our board function so I don't have time to discuss more of the details but if there's a question I'll be happy to dive in and that's going to be our main learning algorithm we're going to do a contextual bandit approximated reinforcement learning with a shape toward function. [00:33:15] To get this shipping. OK great good question so the demonstration I would assume is not super hard to get but what is hard to get is the 1st term here which is also the more important of the 2 shipping terms and the way you do it is that you can assume access to a distance function that if you give it to states going to compute how far they are in general getting this distance function is tricky but for this problem we have been able to learn it and we learn it using what's called a metric learning so we have access to a bunch of contrast of data so you have to step it up close and 2 states we should be far and we do metric learning to learn a distance function but in general this could be hard. [00:34:00] For evaluating our agent will. Need one more assumption so you could do something I suggested. Like where you take an action and then you want to tell which action at this disk state is going to give me the right behavior but then you need to know what's going to happen if you take that action in our algorithm you take an action only then you compute the reward but for the dagger you need to mentally assume that what's going to stay going to look like if you take this action so you can do a diagram but it's one more assumption that you need to model. [00:34:50] Right but. Yeah so I mean as I said I mean this we don't really have to make that assumption in general. Just. So that way I was saying is that if you're at a given state and you want to do is a very is a very strong algorithm it gives you a very very strong feedback if you're in a given state you want to you want to find what's the optimal action to take OK unless you're in that you take the action you go bad then you take you know so you have to understand what's the best action to take at this given point. [00:35:33] And that's kind of requires you to understand that this is the state I would be and if I take this action for all possible actions the need was excellent of a model so you need like a one step model if you have to do model based learning if you're ready to make that assumption I think in the block where you could learn that. [00:35:50] Then you could do then you can perfectly do that I think that would be very That would be a very strong algorithm but there is that assumption is kind of hard to see because the block words but even she will go to more realistic problems. So they set up here it was introduced by the original paper from Jonathan basically condensed a small data set of about 12000 train instructions and 3000 test but during testing we are working with unseen instructions and maps and the execution adder that will be evaluating the agent on is the distance between where the blocks end up being and the gold positions so we compute these pairwise distances and some demand that's the evaluation metric and these are the results so we compare our algorithm against these baselines the stop baseline just takes the stop action at the beginning of the task so it doesn't take any actual action. [00:36:42] The random walk here take samples and action randomly uniformly and performs even worse as we can expect that a random action would make situation bad. Supervised learning algorithm is performs better than these deep Q. learning and reinforce are 2 popular reinforcement earning algorithms or at least were popular 2 years ago to perform kind of like that and finally a contextual banded with them performs these baselines. [00:37:14] You mean the. I don't have. So for these ones I mean so I don't remember the results for each one of them for these are of course. I would as you about point 23.9 I've seen it go up to 3.9 And these are some of this is an example and these are agents trained with the different learning algorithms the instruction here is move to the left of Heineken slide a far enough to clear Heineken slide right until almost touching shell slide up even the until evenly aligned which. [00:37:52] So let's see what the agent would supervise learning does it just takes a 1st wrong action and then just completely fails to do the task the 8 it would just. Has a hard time it doesn't take any action properly and I have some ideas about why why this is happening and I'm happy to discuss offline and finally the agent trained with our approach is able to do the toss. [00:38:16] And this is a failure case. That we did some qualitative analysis of how often that happens and we saw that up in about 92 to 90 percent case it's mostly talks about where you should eventually end up being so your comment is roughly still correct. So there's a fairly a case where the instruction is Slide 18 to the right and then up so that it is directly under block 17. [00:38:53] So an agent that is trained with our algorithm performs this. Up and down oscillator behavior and we call it the stopping problem and one reason why I believe this is the case is because it's a label bias problem here because if the agent tries to stop anywhere else it gives him feedback other than the gold state so the agent is really learns that stopping action is the last thing it should do and that's why you see this kind of behavior and this is bad because in real life 58 if you don't know how to stop then bad things happen to you. [00:39:35] Again. Because I'm going to I'm going to skip this in the. Interest of time so so let's look at a different topic it's a here we are going to move to something which is more realistic which is 3 D. in Wyman's So all of us here are living in a 3 D. word and when we go to a different city and we ask for directions from someone we often have to do the spatial reasoning so to test if we can design agents to do this kind of reasoning we collected a corpus. [00:40:05] Which has instructions for an agent navigating in an open land space so here the instruction is walked to words and around the back of the ladder and take a left walk past the pumpkin and to watch the great boulder and stop before you reach it for the agent the word in this 1st person view and this is a top down view of the. [00:40:29] End and in some way this is a more challenging data said than the block were because in the block where we were in this very top down pixilated and Wyman where the blocks are generally left or right of each other or you build a line so you build these diagnose things but here the instructions are much more ambiguous for example there is no precise stopping location and this makes spatial reasoning and evaluation both hard. [00:40:56] The other challenge is that here we are working in a 3 D. And what I meant and so the problems like a clue. Start to become real for example here we see that the statue is partially by the ladder and objects which are away from the agent are smaller than this is a closer to the agent and this is different from the block where all objects are of the same skill and there is no. [00:41:18] Here but it still remains the same so restored solving instruction following agent has to take a sequence of actions and indicate completion with the stop here the action space is different so the can move a step forward or turn left or right by 30 degrees. So remember when we talked about the previous problem we created this concept of an agent context which is the information the agent has in order to take actions here we are going to use a slightly different. [00:41:50] Assumption for the agent context. Before we still have access to the instruction and the current and previous images but we're going to assume 2 more things here we're going to access to the agents post which is the position and the rotation of the agent so we have this extra information and this is fairly common in robotics and their algorithms which give you that information. [00:42:13] Another country going to make us access to a panorama of the agents around and the reason why we do so will become clear pretty shortly so remember we said like we're interested in designing agents that can learn to do spatial reasoning so what's wrong with the or original model be looked at in the 1st half of the talk. [00:42:34] The problem was that the because of the black box nature of the neural net for it we saw before we don't know that the agent has understood the Tasca without taking actions from the agent from the scene and this is in general a very unsafe in heart a fine thing in divorce case to do because you don't want an agent to take actions when you don't know it has understood the task. [00:43:00] So how do we design agents that even prior to starting to do the task are to take actions give you some evidence that they have understood the task and this will have an advantage like interpretability and safety so we can understand where the mistake is being made and if the agent even understood the toss to begin with and we'll do this into pretty ability using this to composition so will break down the instruction falling task into 2 subtask will 1st predict where the agent should go given the instruction and the image given the panorama and the instruction will do a goal prediction to generate a distribution and then use its prediction to generate actions using the post and the and the action generation model we're going to decompose the toss this way and this will have to advantage one of as I mentioned is about the interpretability and the 2nd advantage is that because we are going to remove the language understanding from the generation part it's going to make the learning problem for the generation easier so let's look at the 1st 1st problem here is that given the panorama and instruction the goal prediction part will generate a distribution which where the pixels here produce show how confident the agent is that that's where the gold is located and we're going to frame it as an image to image problem that is conditioned on the national language. [00:44:30] So when we thought of this a composition we looked at the computer which of literature from models was do this kind of a mystery image conversion and one model that we found was is really quite powerful is the unit model from Ron and Berger 2015 so it was designed for you. [00:44:53] Yeah. You know in the same paper we have another dataset which does have walls there what we do is that we do a panorama predict the next goal then we do another panorama before we do it a multiple way so for example if you decide that you should go to another room you 1st go predict the door go through it and then go to the. [00:45:15] No but you have a humoristic which predicts like what the interpreted points look like or what the decision states look like. So here the unit model takes the input image and performs a sequence of convolution operations followed by a sequence of the convolution operations to generate an output image and the idea behind it that this allows you to do here article reasoning so you have these stack of convolutions which can do rethink a different scale and then you have the opposite operation which allows you to regenerate the image back from from from these convolutions output but there's no natural language here and said there's a natural language conditioned reasoning so we took this model and modified it to do language based reasoning and we get a new model called the link unit. [00:46:04] So the way to do it is use an LS T.M. to map the natural language to a vector representation and then use this vector representation to do text based convolution at every layer of the model and the idea is that because these convolutions are de doing reasoning at multiple image multiple scales of the image now he can do text based reasoning at the different levels of the different scales of the image and we use a sling going at model to generate the goal distribution. [00:46:39] And the the action generation model which takes the distribution and the agents pose and then generates the action and this is a very simple architecture has a simple. Camera projection to map the pixels to ground location in the interest of time I'll skip over the details because I don't think they're very. [00:46:59] Super interesting and then talk about the learning. So here now we have 2 different models we have a goal prediction model we have an action generation model so what we do is that we find a different type of learning is better for the different models here so for the goal prediction we use a simple supervised learning to minimize the cross entropy loss so here we don't have any kind of a mismatch problem so our simple civilised learning here works perfectly well. [00:47:28] For the 2nd model here the action generation we use the algorithm be introducing the 1st have the context about it algorithm and use that to train the action generation model but the key thing here to note is that there is no natural language here so the agent context doesn't occur it doesn't have natural language so we could potentially skill the Darvocet by creating synthetic input output examples for training the action generation. [00:47:54] So this simplifies the learning and we get pretty high accuracy for actions and ration. Yes So. Great question so in this case no because we are training a specific lead to go to court towards that goal but in another paper at core what we do with it instead predicting only the goal location we predict the path House will so we do so but that's a different one be talking about it here. [00:48:28] Sorry go ahead. So OK maybe if I understood the question correctly that how do we go from here to actually being able to reach the goal but even if you put it to perfectly. So the way. So the way we do it I mean I'll tell you very briefly we could talk more more off after the talk and then she would do camera projection to predict the pixels to a 3 D. location in the real world and then we constantly track that 3 D. location at every frame so that's that's that's why we use Agent suppose because for this task given the height of the agent and its position you can easily write down the camera projection equations. [00:49:34] Yeah so let me maybe I'll skip some part and go to what's more interesting so we can do a new corpus which is slightly bigger than the block where corpus has about 20000 train instructions and 4000 Devan test and the evaluation metric would be the distance between the agents final position and the gold position so we still used to stop distance based evaluation metric and we found that it correlates reasonably velvet with the human evaluation. [00:50:04] And these are the results. For the stop and the random walk baseline we saw before continue to perform bad Misra 17 is the architecture and the learning algorithm we introduced before said before most better but still still quite bad the chaplain 18 is an architecture which and a learning algorithm from CMU it actually performs the reinforcement learning uses what's called a generalized advantage estimation. [00:50:31] And finally our model outperforms these base lines and has a key advantage of entropy ability so we can actually look at these distributions to see where the agent makes a mistake so let's do so. So behave you see this panorama with the gold predictions superimposed on them so the 1st instruction is fly between the palm tree and the pond and we can see the agent quite confident that the distribute the goal is between the palm tree and the pond on the right the 2nd instruction is curved around big rock keeping it to your left and we can see the agent has some confusion over of which side of the rock it should go but the mode of the distribution is still on the right side and this kind of interesting because the distributions give you more than mode so they give you the how much place which are confusing the agent in the way and solve the distribution. [00:51:21] And these are some failure cases. So the example at the top here the agent goes on the wrong side of the machine maybe because it's confused by the late light body here. And over here the instruction is just to wake so that as you can see the distribution is all over the place which is kind of kind of what we expect. [00:51:41] So we also did some quantitative analysis of the paper if you're interested. You can look at. A minute let's skip skip different talk about the some recent work that I've been working on in the same paper that we introduce this problem be introduced even more challenging set up of doing navigation and menu pollution in a 3 D. house so this will be both the simulator using Unity 3 D. and we have this agent which can do many places like pick objects move around open drawers and even can turn on televisions and stuff and this is a much harder problem because has to mention it has these like kind of walls and you have to navigate between rooms and you have instructions like clean the table where you have to keep on moving move and pick stuff and keep them in the sink so much harder language understanding and much harder planning and that open challenge if interested we can talk about it I'm going to tell you how do you use the positive results and talk about some recent work in the next couple of minutes. [00:52:43] So we looked at some of these simulated and women's today and in general simulated and why men's have been gaining momentum and popularity in the learning community because they allow you reproducibility and it's very easy for people to use it just to do good cloning to get started and they're greatly useful I think I think we're learning new stuff using these simulated algorithms and why Miss But in general today life is much harder. [00:53:07] And specifically real life is much harder control problems and much harder image understanding so there is there is a gap between even the best simulated and Weimann So you have today and the real life so to understand how big is that Gabby looked at these 2 problems in isolation. [00:53:25] So in this paper recently at Cornell bit Valse Ross Knepper and you see we looked at the top of mapping observations and instructions to directly to V. lost cities of the control of the drone and this drone be used provided by the Microsoft. Dynamics and a fair bit of complexity and we're able to get like some reasonable performance but I think we can still keep on pushing it even more and the 2nd paper recently at C V P R of it however L. A new US Navy and we looked at doing navigation and a map of the Google Street View in New York paid the agent has to observe these challenging real life images as to follow actually even more complex natural language instructions and this is an open problem we have like not some navigation results which are quite poor and I think there's a lot of room to go on these on these tasks. [00:54:18] And finally we talked about a lot of these simulated and Wyman's today and the models so we are both dislike integrated learning framework which has implementations of these models these learning algorithms and stuff if you're interested you can check out the repository here. And finally this is also my last slide so there are many I talked about instruction falling today where you give an instruction to an agent and the agent is expected to follow but in general there's like the real life has much more complexity for example people learn by interacting with each other right so there is a dialogue the real life that you can ask clarifications for example if the if the task is not valid formed or you can ask like OK you know did you really mean this thing you know if it's something completely chart garbage so there's directions like working more on dioxide for like doing tasks based understanding and some of the work has been coming from as Phil on these kind of problems the the 2nd problem is that how do we evaluate these problems so I talked about some metrics like stop distance but in general these metrics could be very poor proxy because you can have instruction to talk about intermediate goals you know or something more subtle So how do we go about like designing better evaluation metrics for these problems and I think this is in general more present in other natural language understanding problems as well like machine translation where people have shown that proxy metrics like blue are not very good and then how do we design better learning algorithms can do for example some complex exploration and work with a limited amount of data. [00:55:54] And lastly like hearing you sort of board function that we computed using the N Y meant but in general people give each other feedback all the time using natural language right I mean you did a great job you shouldn't have done this you should have done that so how do you take this natural language feedback and map it to a scale of reward and use it for for training agents. [00:56:14] And this could be a great assistant for like things like. Cortana. Full stop here. To do have time for questions or probably worse. The 3. Really. Good. QUESTION So let me tell you how we went about it so in the 1st paper we're trying to really push how much we can do without being any expensive and what's what's up with think about that as a baseline you know that this is this is going to stop this the baseline because you're not going to make any assumption I would assume the answer lies somewhere in the middle but closer to 2 and 2 and metric so in the 2nd paper we went slightly towards the left because they realized that these single box based models are kind of just really dangerous to even deploy and with them what we ended up happening was we were using these models but if they make an error we just can't do anything about it we don't have enough understanding about why these models are making error so there's debug and then there's course a safety and a debatable T. issue so we went a bit too it's a left I would say too much more engine ring but if you think about it the kind of engine and we have as you can hear it scalable so you can use it for different Wyman's like things like bold prediction you can always do if you have image and text so you're not really introducing ontologies So if your family would like this semantic parsing based method you introduce ontologies which are very hard to skew to new problems so we're introducing some level of engineering the not as much which makes it very hard to scale to new problems.