[00:00:05] >> I am very excited today to host this party from University of Alberta. Chapel is very well known in reinforcement learning and banded community and you need but I'm going to briefly introduce him though he's. Our chair and the team leader Foundation's team of deep mind and he's a professor **** professor of computer science that you know survive but that he got his Ph d. from Joseph until a university in Hungary. [00:00:34] In addition to some publishing in Prague journals and conferences he currently services actually to troll for Jim Miller an associate editor for Market Watch. Has written several books the most recent of them is Bandit algorithms and you can see here which is I guess the the standard textbook now on planets which came out like just last fall. [00:00:59] Is Queen winter off you see the older time which is a variant of Monte Carlo tree search that was used in us for good program which defeated the top Google professional allow a few years ago. During the pandemic charges started a new which will seminar series on reinforcement learning theory which I think is a great service to the community. [00:01:23] In terms of the talk if you have any questions to talk please type them in the q. and a box and I will monitor the box and if need be interrupt shop and ask your questions there we talk any further I do care about the fluids Yes thank you often a nice introduction. [00:01:42] The pressure to be there virtually at least. I said that I'm going to talk about some very with I'm very excited that both. This is in the context of a free for spent learning and long Larry in the context of planning in life and the peace. And her tries to investigate the limits of watch you can do in this case if you have a large m.v.p. then there are some of your opponent that say that well you need to touch every state action Perry in the m.v.p. if you want to call office actions for good quality serious and that's that's too much because most of the M.V.P.'s talk to longe and then the question is What can you to and so that's what this talk is appalled and the answer is going to be why you have to do computations in some contest for incidents was the whole talking support this verdict wouldn't have been possible of a doubt the lot of people I'm going to present papers written by. [00:02:46] My students gal their value station tray if other students that they happened to work with were of my students feel they from work Ian Botham I should turn fair. And. Call the us to love us Jacobian and young and told Lucky more and a large portion of that research has been sired by. [00:03:11] Simon to show that the long and lean I work live with the Slingbox is the rest of the people maybe they bought their paper was reading spying for all these are work that I'm going to present today though the contents are super simple the 1st going to talk about what and why and the science and conclusions in explaining. [00:03:38] Big questions. When So this is a both like what I've been doing in Via I read doing it. So we're trying to change the question of why being 1st and learning so success what and when is it so success for us that I you see some pictures highlighting some of the Sox assists of brain 1st minute learning are good times in various That means her body aches. [00:04:05] Or games where it's mostly have all of the demonstration that the single target them is ebooks to learn to play online suite of computer games just from looking at what's happening on the screen which I think is a pretty amazing success and then complete their goal their I mean 1st meant learning because they are get them so they feel a top player couple of years back but we shouldn't forget the both early successes like any gambler and that I mean for some of the learning base there are good times to. [00:04:41] Achieve very very high and have a professional play early on in the ninety's though we see all the successes the question is why is this happening in them and repeat them so next that. The other question that one might ask is Ok this is fine like maybe we are just locking the skaters but we want to apply our route more probably and then the question then is how to do that like how can we like what I build in minutes and how to better than the applicability of these are you. [00:05:20] Next. The next question is. That there are these are good times have a lot of moving parts to this say and the question is whether all of these are reading necessary critical 46 this is reach which parts are necessary and the hope to your system I think of a b I mean these are good terms. [00:05:44] And so intimately the question is what I'd agree with I'd argue that's that's the question. Ok so in terms of. What do we mean by a good area to what is that we are trying to achieve would be nice to have at least some. Companies have phones that are going to going to govern that every cost something a success and for that I propose 3 seeing slacks ability. [00:06:17] And effect you are good times and Ephesians are good times and to break these down the little bit further. The are going to say that and they are going to misapply x. of all if it's able to take any function or person meter technique and use it. That ever the functional approach to meet their technique is suitable. [00:06:45] We are not going to ask anything impossible for them they are good terms with just like with us we are asking that if you through the architect. Function the person with their technique and that technique happens to be a good fit to the problem setting like Let It Be go or. [00:07:03] Robotics or any other of them when. Then we expect the argot them to do well well so that's the effectiveness of the our get them here this is also what their stance would be as good as the function of x. mission technique Alosi to be and maybe there could be some extra gap. [00:07:27] Comes from because the arguer time has finite compute time. And finite number of I don't know carries that it's so many to us to monitor or see that the whole we do all of this in the context of simulation up to the station and efficiency. Is just the usual all the time requirements that if there are some natural problem either side of the problem by the planning horizon The Remember actions how many of eights you have in your functional person major survey it's this the. [00:08:02] What is the association that on time should be one of the I mean all of these hopefully some low order. So this research is done in the context of work of n.b.c. process is that everyone probably in the audience have heard of both. Just very briefly this is. [00:08:28] A setting that you have a bunch of states you have a bunch of actions you take an action if he's writing that on the next date and you encourage somebody who are going to go away and maybe it's a random regard and you're trying to take the actions in search of a so that use your desists them the wired spot of the state action space where you see him already why are you interested in the long term. [00:08:56] And he ward maybe it's you sum up all that he words maybe you're going to apply some discounting when you're summing up all the 3 words they have you think the evidence for you are the longer directories bought the important thing is that you can't avoid the long term facts because of this this is a very interesting setting where you have to bring in planning and then sinking of all a lot of the problem staff So maybe today you have to take an action that puts you into a good situation but to model it to achieve some good cost and with a large. [00:09:31] So when solving M.V.P.'s that is a classic remember or capturing this well at this step if you are making a bet on a certainty the key solution concepts are the value functions of the very function of a policy of what he sees of a.f.c. the hearing assistance of a f. using the actions based on the past information and that that the future me there may have and the value function of pot is just gifts you all much totality of water quality is going to call x. se of gold is an expectation and it's a function of the initial us they surveyed ever heard of what is he started from that's it you sum up or that he worse they had expected and if you know value of the policy the ultimate value is the maximum value that you can achieve from every possible us they've introduced and in general you're interested in coming up with what he says that match these optimal value as closely as possible so this is the computation a problem that you have maybe you have survey of describing m.v.p. and you want to co-op is this. [00:10:40] There are these either functions the q. functions are actually functions which are very similar to the state value functions they just stayed at like you started the state you take an action after that you follow the policy that's q.. Secure pie is going to be if you the total expect of the heart in the setting and q. star gives you the maximum possible values and the Basie theory off M.V.P.'s their size that we have to solve an m.v.p. that these 2 will be able to find a party c. which is Optima of each of a function is matching its restart function it is stuffy sion to figure out but this q. start function is. [00:11:24] Queue stark enough have a little bit after extra information and the basic theory just passes that that's all it's in the queue. Thankfully. So these functions are functions of the states or the functions of the state action space and the question is how big are these M.V.P.'s how many states do we have and to start his robotics example out there of you see that well it's not going to fly if you just think that you can just like anybody thought the states because they're the States bases country news but a high dimensional state space if you try to disguise if you're going to face the curse of the mansion or theory see that 30 simple No you've described as a nation techniques and not going to say next example is the other games there you see that you have 10 to the story 8 states the next example is go. [00:12:26] Back to you. And to the next. I did and 7th the board positions and next example is backgammon arm but that a number is a little bit smaller so the next. And the 10 to the 20 next. Yeah so that's that a number is live it lower but that's just so big that you can't possibly fear all these functions into the memory All right. [00:13:00] So then the hope that these are get them some work I'm going to post that they had that the reason these are get them sort by and large is because. Or today all of these stocks this is are achieved and they are gift arms are ran and trained on their simulator using a voucher compute and flexible function of her summation like Laci or at Hertz and so next. [00:13:30] If you. It just kind of hurts you 1st put but at points so that is the setting of planning that I'm going to look at today I'm going to ask the question what I did with our get them Cindy simulator setting where you have logical view that as a planning setting and then the question is. [00:13:51] Do these are good times for the hurt and what when do they were and this enough asking questions about complicated things like knowing their function of person nation was neural networks led just die of boxing so I led a bit and I just ask questions of all of the simpler Saturday when you have no functional person nation and the reason for doing that is because if we. [00:14:17] Like it seems prudent if you try to answer a simple question 1st before trying to answer a more complicated question and that is a lot to be learned a lot became by answering the simpler question. So why function approximation Well States pieces are too large you can't they'll be bit out doing some sort of compression and functional person nation is of the compressing whatever quantities you wish to put as and this quantities could be. [00:14:54] Next I did value functions. So these are the functions of space of state action pairs or they could be parts. So the police are just. The simplest form of policing is just the Security Act the deterministic policy that maps needs to actions and you can try to comb past either the value functions or to police or pose in today's talk I'm going to focus on compress computation value functions but we're going to briefly touch on. [00:15:32] What he says. Right there how does this work. So I said that we're going to use leaner function of persuasion they need not function oxygenation you have a bunch of basis functions. This 5 on 2 fi 2 to 5 the that's what I corner of the sly and let's say you want to all proximity of the action of a function of some obviously is a function of the state action pairs as a. [00:16:05] The idea is that we use just a few basis functions maybe as a show with the figure is just 6 basis functions you can approach them a functions. That I've defined over to a lot of spaces like infinites basis and maybe not the crazy for saying that you can get good upper summations to some target function with using up too many babies functions if you have some additional cry in the what each about what this function may be then you can do this by choosing the be this instance in there. [00:16:43] It looks like. So that took. Us next the I gain to require in terms of flexibility function approximation that the are good times should accept any feature map so they shouldn't be too picky about what each or map is thrown at that that's going to check. Next like. [00:17:12] The create area of the flexibility and then the other saying that we might want to have patients seen. Which is going to be this putting them out on time guarantee and it was so in himself like effectiveness we want to know all how could the police decide going to be so next light to define Ephesians c. [00:17:39] Let's dive into the south of it a bit more in more detail as you have the simulator which is also going to feed us features and there is a plan that a teen going to talk about the complexity of the planet of teen and the efficiency and effectiveness of the plan that routine and the plan that with theme is a given at the beginning est and it's also given the feature of actor of the state so that's in this case we don't offer some a thing. [00:18:15] Or a person we think state by a function or plan I don't think can go on then and there suddenly queries to the stimulator in the form of state action pairs and the simulator responds to the queries by replying back this next day it's. Going to the dynamics of the M.V.P.'s of these p. a **** that's of these to be sure overstates But I mean is separate from the underlying distribution of the m.v.p. and it also gets backs the Iraq maybe it's a stochastically right here high just repeat for. [00:18:55] Deterministic anymore and it also gets back the features of the next aids and then next plan there can do these what I buy something might occur used but at the end of the day it has to decide Ok I know enough and I know what action to take at this initial state s. not right so the competition to the plan there needs to do. [00:19:20] Is to just come up this in action. And then there are variations like if you have find it arise on planning problems then you have this stages off where you are in the what eyes on and those would be supplied to the plan there and then the plan there is going to ask this from the simulator like Ok for we are in this stage like was the feature of actor but I spoiling t.v. the next day that the next stage. [00:19:49] And if you do the actual value function puncture most of her summation then the variation is that a feature of actors are going to be super light all the action actually that's how Ephesians is defined Ephesians is going to be whole many Kuwaitis are submitting even the how much computation is done but then the next question is how good is the plan there and to figure out that imagine that the plan there is interconnected to be as they say when I go to your enlightenment which is the same as what this simulator. [00:20:23] What simulator the player can access to. And it seems connected with it in a closed loop question so whatever action if came up it it is but back to the environment and the environment makes and will lose the next thing it and that that state is fed back to the planner and it is close look fresh on the plan that is going to induce a party see the environment and be outweighed how good the plan there is what he's done how good the policy that he uses in the environment. [00:20:57] Right. Ok so is that we covered both flexibility and Ephesians Cs So let's talk about f.x. the. Next. And that's that's that even though I mean the problem enough activeness has to be fined up to be mean by whole Well the m.v.p. is 15 but a feature match that the plan there was supplied us by Bill to. [00:21:34] Get us off the ground the Imagine that you have the empty piece listed on the laugh. Was these blue dots it's all possible. And on that I you have all these feature maps and for each of them you can think of both a measure off like how the m.v.p. is beat to the feature map actually you can sink a boat Moc opposed to trash years one of such measures would be just to say let's say the feature map is what actually functions is whole that the action of a function and auto submit the optimal action of a function p. was star but that's one. [00:22:16] Such metric but you can think about other matches that that people came up with. During the years the 2nd metric is how well the features are going to map all the action of a functions. All the places in the m.v.p. And so that's also just gives you a number of this number The smaller than the features of the if the number is being then the features some other create and then we don't let any of the acquire the player to do a good a job because it was not supplied of use of the Sat but fleeting feature of actor the very last of on is like. [00:22:56] The feature of actor surfing but I think that I haven't thought of the body yet but it's an important at this stage of the talk at the end of the. But if. So can we imagine that these metrics are going to be small Yes that could be small then sinks can be compressed like in this little city example where you have this vision more of a car and you see this Lars there blinking. [00:23:25] But the blinking of the florist in the state of the floor so that's part of this that's keeping of the m.v.p. they have some dynamics it's under 80 of how the car is traveling on this month and you want to control the car and you can call him test of a everything that is related to the florist in this example so you can compare the state space and admittedly this is a city example of pot it is meant to show that in an m.v.p. potentially you have a lot of things that are under 82 how to act up to money and then you can do a Create a moment of compression and maybe you're evil to the side of all of these be these functions and come off this ones that are going to be a good feat to the n.d.p. just by thinking about what's important for our own trolling or developing a good contract for us and. [00:24:21] I. The why why so complicate that saw in my student lair. Possibility would be trustee to think about what he said which you could have all these features and then you could say that well why don't because of their let's say party sees the depend on so we're compressing policy so they depend on the features maybe in their local in their fashion like in voice one or soft necklaces and then you come up with some matches and then you just want to maximize this metric. [00:24:54] And you don't have to. Be Planning problem would then be to you know this simulator and then some horse or solve the search problem this body search problem that's about as you can write saw the feature set a good match if this is metric is going to be high and maybe you have the power from a some of the ultimate party see. [00:25:21] But the problem is that this simple approach doesn't work it doesn't take these police to search problem is n.p. hard even for the simplest possible cases there to be people can imagine usually the matching that people come up with in this police search police agreed in metrics is some kind of average of the performance of the policy over to states that's shown here and maybe this cone thing and this this design by glasses and leave when the virus 2012 stays that even in this very simple setting where you use the thicker geisha in. [00:26:03] This police research problem is n.p. hard and it's also hard to of course to me it's really a hopeless problem even for a finite M.V.P.'s if you're compressing too much you're asking too much of it that's why. Or framework is a little bit different and might feel even more complicated but there is this whole class problem so we really can't just try to do this right so the next line. [00:26:35] So let's get to somebody science so you know it's not going to come in flea forms. So 1st I'm going to look at the case that the function. Of summation technique is side that all the action of a functions can be very of our feet for all the POTUS this by the functional approximation techniques being used these days the 2nd setting is going to be at the lax setting a very old the a shroom that the optimal action of a function can be very presented by the features and assert something is in the of the must pay the value function but I'm going to talk about his eyes in the city setting but the 1st setting. [00:27:25] And let's let's go to the next slide is the setting then or the action of a functions are close to the function space and he had the 1st question that we are facing is whether our target accuracy. Is that after the subcommittee of the policy that the plan there is supposed to compute is a much much smaller then square to the times the. [00:27:51] Functional approach the mission after or not and the very surprising that it is that that is by by the crack of the Volga and young is that if these accuracy the mandate is still high. Then there is no plan there is it would be evil to. Come up with such a policy we know that exponentially many carries either in the number of. [00:28:23] The missions the number a feature set or in the planning or I saw this like should be escalating. If this life please her. Can we go to the previous slide. The next line yeah previous flight Yes thank you. Right and so this is too many courteous. And and so this is and they are going to mean dependent resign so it's not just like they're the argot terms that we come up with this are too lame and we should improve them it's an information theory possibility aside and know that this is under the studying to be assured that all the action of a functions are valid or present and that even in this case is. [00:29:18] Arrogant terms that try to do this compress quality patient and cannot achieve and a target of accuracy that these much smaller than squared with the times the apps on these these approximation that are and he have these the number of features and then the question is Ok if we. [00:29:43] Are content with. An accuracy which is not that demanding then easy possible to get because there are good times and then the on study is yes the argot terms from the good old days of perks made for the seat that relation which got a new name the un and was the r. get them that was used. [00:30:08] In the Terry game so when it's quite active when you're in that works then it's a market term to gather a piece. Of cleverness that I'm going to talk about on the next slide is evil to give you a poll and then we are computational reside and then put in a very complex if. [00:30:29] The difference to what what people are doing in practice and what is resigned to studying us to do or is there that is not just studying us too telling us that for these are good to me it's pretty important that you control the extra politician errors Who are these extrapolation there are so if you're measuring measuring some function let's hear their function or closely they're not a function at some points then you might ask. [00:31:00] About the value of the function outside of you know or some hold of called x. hall or something like that opposite of that that sort of points and so that's called Extra what ation it was the result says that we have to video memory of all of these and the squared of the blog fact that it is actually coming and can be traced back to these extrapolation after the lower of on past few that the square to the law of the optics to missionaries unavoidable and the awkward bond is going to match the years and by the matching is happening by controlling the explanation Atterson care for they and for this the argot them needs to reason in the very careful they hold onto the sexual predation the interest and then we bring in the technique of optimal experimental design from steadiest Nixonian it happens that that stuff to call troll acts definition. [00:31:59] All right so next light. Then the next question you might ask. What are the limits of the sub optimally. Though in the limits of infinite Compu would How good is the policy of their theory going to get. To the previous to the southwest kind of hiding of the previous slide was kind of hiding this so I said I'm going to some sites about this but if they're on the local OSA mission there is then the in the lineage the fins kind of computer and even research hostname it we're going to get a policy which isn't often enough most of that's good but what if the auto summation that are the next. [00:32:41] Is what it peeves so absolutely suppose if so what is the best Dow thought that we can get before it is all post with Boise 30 ish America term this was not shown in the previous slide but what happens is that the stock Timothy Guy is going to scale with this absolute on but it's a blow an abaya factor disproportionate to the not the square root of the number of features and times this one overvote might as a go mock you will but one of our miners call Moyes usually called the planning court I was on the phone to discount it saying so he had gone ways of discounting And so this just staff says that the planning caught eyes on is it is going to or so blow up the functional person mission there and then it's going to ask whether this is necessary is this perhaps to lose this after Bond is maybe to lose the next. [00:33:42] Apparently it happens that Ben rich so why here just gave us time in the last year. Year for him to stand there must be alive. He has a new paper but he shows that what all crooks made participation force they figure geisha this blow up in terms of the the pubic dependence on the effective chorus and he's on the Avoid of all he doesn't have this credit for the the other kind of ideas that the square root of these and avoid a book instead ideation this credit of the can indeed be to move so I believe that that kind of loses the case for state a Geisha and approximate what he see if the nation but the question still is very there are some other argot time can do better than this next line and so. [00:34:37] Or what is he thought they should be have these lower in the locker ballance and if you look at the lip the chair. Next to him was shattered I had a paper in 2040 and of and he already shows some ballance that term easing so the dependence on the effect of what I was on is much my dear but the Pete Ince of this this constant c.v. is a problem depending on course it's not just like depending on a feature map but it's it depends on the m.v.p. that's highly problem with the. [00:35:12] Business constant can be Earth history a lot. And the similarity is that was obtained last year but i'm there are good time cause q. and peachy beaches like in the east on the natural or the sea could even type affair good terms. Have the dependence on the effect of what I sense for that he used but this post and since the are there and then state a geisha on paper actually gifts a very hopeful it is that it shows that you don't have to have this sea coast on there and the tennis of the effect of poison can also be vastly improved to to make it to the not and happens next. [00:35:59] That didn't have a previous target them that was apply in a different context this party Thanks Erica time which is very similar to q. and peachy if not the same. You can modify the proof that it was that were given to these areas and if you modify the proofs for these characters then. [00:36:21] You can achieve you can sure that these are good too much gives this idea skidding off leaving her. Well you're skating with the horizon idea in the sense that it's much better than anything before we don't know what that is I suspect maybe this is by far that that's an open question though the lesson here is that to be some better design you can improve syncs a.p.i. is not bad but maybe we should. [00:36:51] Both funny popular debate and you should be intrigued. Some more coalition. Play off of going about saying next topic. There is a quick question if I mean to. Read the question how can we compress the states piece How do we how do we determine which features not really want. [00:37:16] Well. So in a practical application you would need to think about what. What matters for the application. Though here I'm trying to you know step any of this question the whole of the argot them design but I would say that that's outside of the scope of this question of this of this talk. [00:37:39] But clearly if you want to solve some particle problem then you have to come up this very space of doing this I think you just like for the standard procedures. It's going to. Probably give you pretty good features if you are skeptical of both had really not a functional approach the mission is Sufi. [00:38:06] Maybe I share your skepticism but I think that it's kind of necessary for us to get these questions in the simpler context before going to something more complex like you're in a Turks and. Yeah. And also we see some examples where people are able to replicate some of the successes like for example for I thought it games be cleaner functional approximation have been shown to be almost part of the sneer and it's not so bad you know like there is a lot of particularly in the ghetto a lot of things out yet of on to then maybe you can apply to figure out what's inside the internet on and you just ignored him so that's how you compress it but the other thing I would say is that you can just drive different things and. [00:39:02] And you want to have a negative that the verse in the letter vav if you sort of a better feature so you can have all this action loop around all of this right so you have America time that is getting the best out of any set of features that you throw up it then you can just like them or that's action. [00:39:23] That was in itself should get back to this other studying so. The actual a function last close to the span of the features. If the target accuracy is much more ahead and discredit of the times absent on the same previous verse cases that applies because there's an even more demanding setting and then the next question is Ok Do we have many actions here or not and it turns out that this depending on what the answer to this question is you might end up in a very different regimes of this problem so on next the 1st if we have many actions then we have a recent paper it's got it but a sense of what I've ever showed that similar lower problem to the previous one still applies but then it's no argot to me in this setting. [00:40:23] And even in that Pierce event the ultimate action of a function is exactly the representative go by by the features and then he's no argot that would be a vote to get out of a video asking explanation many either in that the mission or or the horizon so if you both have the mansion and the hot eyes in the lodge the studying that we like to be and this is this is just too demanding goal it seems that the number of actions is going to matter next slide Also this just drives home this point and the next life so how this is Virg. [00:41:04] Makes life. Next. So come up with the construction of the finance horizon setting. You slow down. Then. Where you have. And at the term of the stick of the list the custody of arts and there are 8 stages that I use to see the illustration of the m.v.p. it's basically a big tree some extra things like in the next 15 to the lot there's a big tree. [00:41:42] The dynamics is deterministic you take an action you and you get to the next state they can actually got to the next state up to the end although the words are going to be as they are up to the very end there are 3 words are going to be. [00:41:58] Noisy. And of a the lower ball in this constructed is that of people to ensure in this lower volume that we can come up with some feature in that bunch that the optimal action of a function is lean there in the future maps of this relies of the people it's a for this Sufi sion to mean the battle often with the creation because the Creation has a unique solution and we decide if somebody would prefer to cherry that we are going to have many actions because in a previous lore of on construction they have seen that having many actions just works for out of this this. [00:42:38] Troll's only in this 1st type of feature So these are these of factors of the units that each other nearly or to going to each other and you can have exponentially many of those indeed Imation It's pretty fantastic So if enough product between any of the 2 actors. [00:42:56] If that end has to be less than a constant you can stock in the explanation many of these are actors so that's kind of maybe contradicts or though the mission will view off the verge but that's that's a fact you can have these explanation many of actors and those those are going to be the feature of actors be they could assign to the actions that if you gain information about you know the value of one of the actions doesn't give you much information about the value of the other action so we can choose the official mystic transitions the next stage we're going to have ben Lutie of the parts next. [00:43:39] And this penalty you know I would need to have the smallest skin their skin is going to be to the minus a roughly. This is because otherwise a planner could choose coal in an experiment in the last stage and figure out the optimal action because the Summit of the trade decided that in every state that the same action is going to be off the mouth because they had that have to deal is the center lies of the crash and the battle often with the activation in it and that then he takes this very simple form and we are in a much better position to make this happen but now if we'll have these in the very last thing the optimal action values are is dado and that player could just go in and measure just the made it a virus and feel you know the identity of the optimal action except if you make a very worst of really dirty sparser we're going to have bad go to the importance of this very small bias spot on Mater of the planet or I'll now own less the planner is going to issue at least 2 to the age of these crazies it's not going to be a vote to figure out of each action he's golf enough the optimal action would have you know a lot of hard all the other actions would have. [00:44:58] Bought But all that he works are so small a scout that doesn't hurt. To achieve this in a good idea or fashion so they're not not too big gaps between divide us beat the inserts to stay just as Beatrice a scout down exponentially as the stage is and to the e actions are going to be unnecessary for hiding the lottery warred yet at the very 1st stage lights. [00:45:25] And to ensure that the sub optimal actions given the information to the plan there they also decide to get to set the immediate rewards. Zaid all every very As except for what a subject of an action is except for the last stage and we also have to feature bias to maintain consistency next and have to have these x. it lean also to mean think of this this its just a little bit hairy pot. [00:45:58] Or together but all this means is that a plan there that looks at this and the p.d. can experiment with the actions at any stage but in the very last stage the signal to noise ratio is really bad in the intermediate stages the primary spacing and even in a state problem because there are just of a too many actions to try and the only difference so there is no difference between the subtheme and actions dynamics is known that has often times there is nothing to be learned about that and the subaltern actions or gives They had already hard and the optimal action is just like one of the 2 to do the actions so plan there really needs to try all the actions if they have all of them to find out what is a good action and what is the ultimate action. [00:46:47] So all together it issues the construction so why this is a hot thanks. And it 1st dear the question then you have caused a number of factions to steal. So that the city interesting open question thanks. So is that I like to cover the very last but which is when you have faith function the price ability. [00:47:14] Well the previous figure kind of repeats in this case if you have any actions here don't you're also doing because in the starter lie that he you really have to quote you like her from this look at computations so that he will take time proportional to the number of actions if you have too many actions in just a call to time it's going to be through a large but if you don't have too many actions then if you cared about the quality complexity then we have some better news for us but we have these are get them which is very nice and I think. [00:47:49] It should be beneath when Iran. We call this has are planning and the chiefs for a fixed number of actions achieved upon their were quite a complex city. Though all the number Aquarius that has a plan is going to need to infuse a good policy in this local planning law is going to be. [00:48:14] Part of the me or indeed to the part of a so if there is a constant like a student this is. The actions of the horizon and the target act. But what's interesting about this is that it's not. Really any e.v.p. type argot some but it's still using sonce thing like the phantom on equitation but it's not using the Bama locked in with the accusation even though it is just shooting for you know obviously optimal. [00:48:47] 8 value a functional. And so the the argot home births is that it keeps of possible Automator of actors that he couldn't rule out yet so there's like the. To be's and past information it kind of tries to shing these hype but it is that if used to push to contain the part of me that act that part of my tries is the ultimate. [00:49:14] But you function and given this set then it solves this leaner optimization problem off like what is the put on the thought of actor that gives us the best to you have these initial states of s. that's teeth a plus there were none of this figure and then given this vector it comes up is despite the success of this policy is the one that makes everything because it's there and so you have. [00:49:42] This of a function in this of a function should because it's then well at least there should be one actual. Pledge there the the value current of this of a function of the current state is steam as the emitter theory of art plus the expected value at next it's of this value like the best the point is that if we choose to be a growth business place the previous life or in the next line. [00:50:14] No I want just no previous life yeah because previously thank you sorry Ok so so you have this policy and you want to roll out with this policy. And. Running all business place of these are getting is going to check whether. This kind of a stance seahorse to a good this this of a function really vast satisfy this and here what's important is that there is no box or you know just checking the bag since any action will be happy to take any action you don't want to take the action then like you're not going to stop if. [00:50:53] Maximizing action doesn't satisfy this just any action this is going to have disapproved and of these ideas actually weeks of this is will be stronger there is that if you can prove is that these are get them is able to comp if we see any policy. Value function is then that is the feature that's not only why I'm in the bill to move a function is in the span of the features is just a special case or even the data is not the key is what that is at policy which is. [00:51:26] What i Report It's maybe a rate policy but maybe it's policy but it's an lvalue function liason the span of the features then the area committees are going to guarantee that it's going to call back as much you are almost as much as far as this obviously and then the best of these voices going to be able to confidence all of exposure. [00:51:48] And so they are get them just roar so it's using the simulator and checks for this kind of distance a horse and that kind of implies distance the horse there that implies that the policy actually induce he's as least as much a value that's predicated by these value function and because the. [00:52:10] Part I meet that actor was chosen in an optimistic question this is going to imply that your company TNG was the best of these but he says it was a very functional is the. Device of all was the features that we have right so soon as the idea of the our get them. [00:52:30] Past and roll out. Some hold of the local causes to suit of the family type of likely to come in. But but in a different way and if they are going to tax a contradiction then it kind of likes things this Sat and has said that the why is the calls for this being that it's kind of working but only a part quite be financially at the same time that the people have heard of President covers this case of crazy complex if he can prove that the crazy complex if he's part of the well but if the next leaves open the question whether these are good has or any other argument has a problem and runtime complexity for this seems. [00:53:20] All right so next summer questions would a p.r. get themselves over in this setting. So if you think a both what would be the measure our attempt to use it would be something like the squarest why you think a show that our Get them is going to. Tougher from an amplification off. [00:53:43] The editors the estimation the owners that's a very badly missed the mention in the hard eyes on the at least what are the area. I think that they are likely answer is that you know it's not going to work and I don't know of any other on your time the other question is is this Erik I'm sensitive to adverse through his ability I seem to be it's going to be not bad bought but we haven't got to the end of this question. [00:54:13] But the maybe one of the most interesting questions is to buy their this the same argument and the mighty us with the extent to queue started is a bit Davis with a few actions. We don't know the answer actually summary They seemed that the key idea in a lot of the Al. [00:54:41] We each alike was invented made the already bought. Illegally these pushed to the extremes these days is that we need to do this copy patience and compressed forms and these days people like to use the Internet earths are doing this but it's also important to investigate what happened since seems like cases. [00:55:04] Have been their function of her summation next like. And so for this setting I propose that we should study this problem in this form of pre-merger where you're with the men that are get them should be flexible they should be a book to be able as any feature map in earlier setting they should be affective. [00:55:25] The day if the feature and the have is about their feet to the end the. Them should perform better and they should also be officially in terms of their run time should be part of. The I've seen a lot of resides today next line and for the 1st of the freezone it's Be conclusion I have is that these acts appreciation adders provably do matter and. [00:55:58] Amplification you to accept what I. Offer the function of her summation natter is unavoidable it's provably alive or the ball. Over there and they are good is. Isn't matching the best possible amplification not it depends in the very subtle of a on the form of the argot them and we have seen that time or good times are better like Point attack and p.t. than others and it seems that the theme there is that you somehow like if you end up doing some sort of that you should do it in the cautious way. [00:56:44] And if you want to gain more flexibility to this person resides assuring that all the action of a functions like close of the span **** you may think that it's to rest. If you only issue Md death of the optimal action of a function or of a functional life in the spell then we see that to get any. [00:57:10] That is sides we have to change the way the our Get them is the sign and the v.n.d. dop is this optimistic argot Wisconsin generations and routes which I find quite interesting in passing the thing and then have to see how to translate or get them so like these in practical are good terms but I think that it's it's really exciting to see these are to meet the science going up and up to him isn't he specially showing up even though we don't can't have both how much he would party school active here not only learning probably just a plan problem but it's really in the natural to both parties optimistic are get them see hear us. [00:57:54] So one question is though over this all this is going to be more your can you solve the functional person mater technique was something I was still going from really not to near and that's barracks in the past there's being quite a few Burks that were doing that and I'm quite optimistic that you can do this at least under some conversations but maybe it's going to be a livid more challenging to do in some other conversations. [00:58:25] So the 1st studying van all the action of a functions are representative of a vital function of personate her we have done that for to put it back setting the earth in the loft do there but there are questions of all that extra petition that are reaching the key is the future and that's why it's not still 3 Bia to handle and we also know from you know like in the pennant of Hercules you're on that. [00:58:50] People are going off there you know is this a verse that example in my in my mind they are really going off that is this extra push of the editors and we see that they are not planning to leave out though we really have to look very carefully what's going on there and seems to be have learned from this research that actually put a show matter provably doesn't matter. [00:59:15] I don't think that we can expect them not to matter in the moment you're. If I missing maybe they're going to matter and then the question is how to handle this so what's what's a good way out there of course there are many other settings like the bash in the hallway saying which are harder in some vase than the planning setting and the best sitting can be so hard that you can prove a lot of impossibilities I seen in that way but we've becomes an easy setting because you don't have everything design problems because you can just like prove all this impossible to be a science but maybe that's what I said a bit too pessimistic but it's like these are fascinating All right. [00:59:59] It's life so that's there's the end of my talk and I put up some references here and there somewhere Francis and I do have these vamps I am teaching previously. I do have these vast site I'm teaching this I don't here of course previous for. Thank you Ok. [01:00:27] So this about I don't hear you get how I or. I'm putting up lecture notes there and a lot of the method he identifies covered there. Is already there. That I'm happy to answer questions let's chalk. This one question in a box. And he talked again what are some key things to consider when designing or to work function. [01:01:02] So. That that is studies competently unrelated of course to my thought but it's a fascinating question that of the less. Well the tell you that this thing quite much into what the object you are out we can have a tryst to answer this question by avoiding this question. [01:01:32] I think what happens is that for many of the practical problems we see that we have multiple of check that we want to me right so you want your robot to be energy efficient even while you want the robot to accomplish the task on your own part you know you learn. [01:01:53] And so far so all libraries there are all these requirements and even any of them is not that hard to formulate in the Mathematica farms other than that you this. Although I know they're trying to complete it that's Ok that may be a little bit more difficult for example if you have a. [01:02:14] Baby this not a trivial thing but like Ok much easier than putting everything together if you start to put together so one challenge. Is that when you have these hate us to between the different objectives if you try to do some of this trade us by some ad hoc manner of combining things that it tends to not work but choosing very white function is maybe not that deep or easy we have to recognise that and it's. [01:02:48] These days it's more than an art then science so why not try to put some science into that and to me that's like more the old check the very 1st area someone comes up there's a bunch of really want to functions and though you have to solve the tradeoffs in various space and demonstrate population after each and sin one that's so I'm really excited about this action there are very interesting questions there like how to get Nick least. [01:03:21] But I guess you can think about other approaches that people have tried in the past like imitation learning learning from experts like immersing players. I think people are trying to learn that he was playing those that are so interesting it is the sign was the exciting part these and the even and that one big question that I have that you can you can some hall. [01:03:47] Reduce everything to the stand that they're in for Smith learning and I think it's a fascinating. Areas there is a lot of papers that are coming out of this let's open the floor through the audience for any requests it's one or 2 quick questions before we got it up because we're already running late I'm not sure if our didn't do you remember there's one more question that just came in I'll read out to my nor long. [01:04:34] He says Not sure if I heard you correctly but I think you mentioned that people use a mostly of your training as a proxy to get good extrapolation If so can you expand on your concerns with that group Ok thank you yeah. So I think that excerpt. That's my interpretation that people don't necessarily say that but if I look at these you think share. [01:05:01] The 1st 30 examples and how to defend against adversity example us I think that there the basic observation is that if you're training here and you're in it for against some data stat then that this division of the data set is going to be the flaccid. What the neural network knows about and small donations from distribution and strode up and to me just like yeah that's like the lack of extrapolation So here in the in 1st minute learning we see that well these are your terms it's crucial to call Shaw these early infinity areas this is the verse Here's actors like diversity in objects to mission that are over there like this that's what the whole thing is involved. [01:05:48] And it happens that only not function after exoneration now are Ok the editors can blog by a factor us credit of these but it's not a burden that at least you have a limited hole big to blow up and I don't I think that to me you're ahead if you have the same question and you could ask like it's a good design what's a good of a controlling this lot of there's we shouldn't be Dot often is to get the air it's not going to blow up but that these we should try to limited the blow off and we also see that some function operationally it like some. [01:06:25] Say Think of a geisha or like this like piece of ice constant basis functions if you have pieces course and business functions there's no block off the air Ok so that's like asking permission is 3 via their piecewise constant every piece measure dancing pig pig they have it just the measurements you're good like you don't need to variable pull ups and that reputation They're all and to me there's a person example Siri telling us that does really have personally seen or networks people are trying hard because all of that. [01:07:00] And I think it's a fascinating topic and it would be very nice to really contact people to back to train for cement learning and in the rain forest my learning I haven't seen people trying to to use these techniques to control these extra partition headers. Maybe just making their ends but I think that given the what we have learned about the inner chaos that extra police in there are culturally is important. [01:07:27] Because here something like this thanks Tom I don't see any other question so maybe we'll end the talk here thanks Chad.