[00:00:05] >> Hello everyone welcome to our our im seminar it's our late great pleasure to welcome Greg bore was the engineering manager of the whole design project product of my works Ok I got his partner Martin b. of the from the school of managing a and he has joined math course in 2003. [00:00:28] Being the leading force behind many of the bookies that we all used. In that market and recently they also walk on a very interesting. Apple box that has to do with informal and control. Greg and I we have been we met many years ago conferences and since then we have been really good friends so please they have into his top as a very interesting to tell you and and he will be happy to answer any questions. [00:01:02] The room is your. Back Thanks for the Russian therapist thank you for writing in for the summer series they appreciate the opportunity to speak here as groups so they they stop because reinforcement learning and how they can leverage declaring their control. So their gender will be focusing on a particular example of how to train a bike that robots walk and through this talk of going through what is reinforcement learning how would you solve this problem using attrition traditional controls approach and then how do we apply reinforcement learning to solve this problem and then wrap it up with some Humanae though it is reinforcement learning so reinforcement learning is a subset of machine learning Latin as we say machine learning most people are familiar with and supervisory and supervised learning and supervisor Minas when you have unlabeled data and you're trying to gather some information about about your data such as using clustering. [00:01:58] Supervised learning is when you have labelled data and you do either classification regression this is you know the most common of doing any trust patient is this a drug dog or a cat's And then finally the 3rd type of machine learning is reinforcement learning and this is when used in our action data so you don't have actual label data but you need to interact with your environment in order to make control of the system can to do control and decision making processes one thing that differentiates reinforcement learning from other machine learning techniques is that the sequence of actions are decisions that you take matter for example if you have doing image classification it doesn't matter if you say this is a dog or a cat or what order that you do it in but when you're interacting or making decision making process or temperament time and sequential though if you can imagine if you're driving a car in mares whether or not you turn left or right at the previous times. [00:02:54] The reinforcement learning is not learning a behavior or accomplished pass through trial and error through interaction and sometimes you'll see the term deep reinforcement learning and this is when you use a deep model. To solve complex problems there's a lot of applications for reinforcement learning some of the most popular that is in the news is video games you see stuff with Terry games that go in the world world champions there's a lot of lot of the video games and game type applications that we see in the news which is very exciting but it can also be used for where you know autonomous vehicles controls of such as as drones and robotics and robotics applications one will focus on today so our goal today is to train a robot to walk in a straight line so how would you go about doing this. [00:03:50] So we have several motors that actually the joints of the robot and we need to determine what is a sequence of motor commands that we need to use to make the robot walk right. Doak by selecting the right right set of motor commands we can we can. Get a locomotive behavior of from the robot and so typically you do this with control right so we have a system here we provide some actions to the system which results in the resulting behavior and the then the goal is to how do we determine what actions generate the behavior that we desire and to pollute if you're familiar with the back control of this is that a standard problem where you observe the behavior this could be observing the behavior through different sensors such as in coders accelerometers vision systems that cetera you take the state observations and you compare them to what your desired behavior is and then the controller writes on this error from the desired behavior and the actual behavior of the system to determine what actions it should take next to try to achieve the desired behavior and this is a concept of a feedback control the Applying this to our robot system with the considerations of controls approach. [00:05:03] Here we have observations that we could have camera data coming and sensors coming in from. Encoders from the joints accelerometers etc. And we want to determine data information from the system just start out with Typically we do some perception you might have this in perception engineered. Extract features from this camera data and then you seen the sensor data with these visual camera feature extraction you do some sort of sensor fusion and state estimation or localization to get information about your proposed where you're at environments that your current state and then you feed these states into a control system to determine what motor commands you should use to degenerate the desired behavior and in this case we have 6 motor commands we have one for the hip joint knee joint for each of the legs however it's a little bit more complicated this you know if we will be drilled into the control system will see that there's actually a large number of other items that we would do in developing a control system. [00:06:11] We have a low level low level controller such as Ph D.'s to do to make sure that the motors are going to the reference trajectories and look speed and location. We have legen from trajectories that these might create planned trajectory is that we're then using this reference means to track and then we might also have some out of loops which are you know bouncing the robots or external disturbances or uncertainty in the system to account for. [00:06:43] A day in saying this is can be a very complicated problem if you can actually spend the whole disciplines from perception sensor fusion as well as control talk and we look at an alternative approach to to solve this problem so what if we said once if we took all the stuff. [00:07:01] And lumped into a single black box controller So here we're going to take the camera data. And the sensors put it into a controller and generate them the motor commands explicitly. So this would be nice if we were able to do this but the question is how do we design this complicated lack access. [00:07:22] And one way is reinforcement learning So what is reinforcement learning so we use you know the definition of a sudden Bartle use from there they're thinking it's reinforcement learning both reinforcement learning is learning what to do how to map situations actions to maximise the numerical reports that. The learner is not told what actions to take but must but must discover what actions to take. [00:07:47] By way learning what actions yield the most reward by training them. But easy way to think about this is as teaching a dog new tricks right so you have a dock year the owner else that action throws the stick so the observation the dog sees is the stick lying and the owner yelling to touch the dog maybe say the special. [00:08:10] Fascistic it may sit down if it goes special touches the stick and comes back you give it a reward a treat right so you're reinforcing the behavior that you want the dog to learn and so that's the principle that we're doing if you're a trained a dog you'll notice that there's a lot of interaction getting the dog to do the right I don't and then once once it does the right action get a reward in it then it reinforces here over and over again so before we get into something the reinforcement learning problem I'll go over some of the terminology that's used in and reinforcement learning so there's many too many components says the agents and the environment don't environment would include the robots or any physical I don't that's but that being interacted with and the agent is really about the logic that they use to take up survey sions and sense actions they can think of a sort of like the brain. [00:09:04] In this case observations would be like joining those acceleration Bussy the camera vision system etc and the actions that will be sent to the robot are the more important to each of the joints in addition the Veyron it provides a reward stating how well did it do how's how's it doing for these given actions that is taking the if we look at this in terms of the reinforcement agent we want to learn the optimal policy. [00:09:34] So this is the same diagram just drawn up a little bit different so we have a policy here that takes the observations and massive through a function to determine the actions and we have a reinforcement learning algorithm and the reinforcement ground takes what kind of action the policy took what are the current observations and the current rewards being received and the reinforcement learning is used that is used as optimization to find the optimal policy that maximizes the long term reward so. [00:10:03] The goal of the reinforcement learning algorithm is how it should update the the policy to maximize its long term reward What is the workflow that we would use to to solve this problem so there's there's there are several pieces so 1st we need an environment where we interacting with us we need to define a reward signal the reward signals are very important at the terminal what task we're trying to solve we need a policy so this is the function that we're using to make the decision process. [00:10:35] We need an agent to train so that agent we train in the policy to maximize its reward and then finally you might want to deploy this could be whether depending whether you're training in simulation or hardware but eventually you want to the plate to do hardware device goes 1st focus on the environment does environment like I said Is everything outside of an agent. [00:11:04] That the takes it from I say observations this is that you know the statement Marmont which is which is observed by the agent they don't provide the action which then modifies the you know if we add. The joints of the other robots it changes the state of the environment and environment also returning a reward based upon the current behavior of the system in our case we have this bike a robot that will be using what we mainly use in it through simulation but out will talk about what are some of the many just of learning on hardware Persis versus simulation and as I mentioned will have a large number of survey sions you know it's what those ation information as well as its joint information etc and the actions that are going to the robot will be the trucks going to each of motors for each of the joints though there are 2 ways you can interact with the environment we can look at real versus simulated environment though on the left here we have a real environment which is actual robots and then here we have a simulation of merriment which would be like Simulink where you would we are you develop a simulation environment that represents the real system. [00:12:18] So one of the benefits of using a real environment is accuracy rate so there's no model mismatch you're learning on the actual system that that that you're trying to train so your accuracy is a as a big plus here one of the downsizes risk right so as you learn and do interaction you have to try and explore various activities right so I have to explore in order to see whether or not I can achieve a better behavior if I explore poor leaves us say I'm driving a car at night and I turn left into a wall you know so you you really have to worry about safety and other things as as you are in addition learning what from fortunate learning is very simple and efficient you need a large number of samples to get one with reinforcement learning and therefore it could cause wear and tear on your actual robot. [00:13:12] That one of the benefits of using a simulator environment is train speed so in the real environment you can only train that real clock time a lot of times you can run simulations. Faster than walk clock time so you actually train see it as well as using a whole resources such as call computing and distributing distributing these solutions across multiple our computers it also provides flexible simulation conditions there may be some conditions that you wouldn't necessarily be able to get the robot in. [00:13:45] In in real life but you can you can you can create these these types of simulations with different initial conditions or different scenarios with then a simulated apartment and as I mentioned before a key part is safety you know you don't have to worry about during exploration whether or not you take a bad action that could potentially damage or a robot in real life so safety's big critical here or there or injure somebody else. [00:14:12] The downside is not on accuracy so you know your simulator environment may not model here real environment exactly this could be due to its actions disturbance or just the physical or physical dynamics of the systems however there are a lot of tools that you can use to improve models such as Premier a summation of and so forth to improve the how well the simulator environment maps your your real system would that be Instead what we're focusing on assimilate and who aren't here we're using simulant to model the my pet robot. [00:14:49] We're actually modeling the robot you seen our physical modeling. Boxes since Kate and were able to model other Democrats of the system as well as the contact forces. With the environment so if we take a look into this model here we we have a couple blocks we have that the agent block and then the walking robot which represents them Vironment. [00:15:16] We represent the robots legs left and right we have an sensors we have our world and round hips and torsos etc The release of physical modeling Paki's to represent each of that there is the revolution joints and legs of the system that we look into the sensors were measuring quite a few criminals of the system. [00:15:36] Which will be feeding back into our observations the next piece I want to talk about is the rewards are not that we define them the environment we want to we want to focus on what is a reward for the task that we're trying to solve doesn't mention the reward is a function that outputs a scalar number that represents the immediate goodness of an agent at a particular state taking a particular action and I go is to maximize this long term rewards for the system. [00:16:10] The only thing the reward you know how do you how do you translate it behavior into a numerical value so we have to trade in numerical value which which in codes the behavior that we're trying to cram. So you can think of as how to incentivize the robot So is this the way you want me to go you know but somebody out from the robot. [00:16:30] Though what we did was this is the the final reward something that we came with and I'll go into further detail about how we can about it but our reward is our current form velocity which is our goal we want the robots walked forward. We also them on the robot to stray too far away from the straight line so we have a deviation from from the center line as a penalty and we also want to keep up make sure the robot stays up right so we want to keep that the robot have some nominal height so we have an anomaly truncate that we want to maintain so as a d.b.s. that penalizes from that we want to make sure it works as long as possible so we're given it a. [00:17:12] Reward for each time step it remains upright and we also want to minimize actuator effort so we actually penalize how much torque that we provide in each of the each of the. Motor joints so the family reward system and some think it's quite easy we have a block here where we subsystem which we model if we look inside of it we actually just build up this block that this reward is in the block diagram you could also use a map function block to actually just explicitly Regulus creationism as well so it's very easy to divine a reward system and this modeling of structure the knowledge we defray the reward system the next the opposite if I'm the policy. [00:18:01] That that where you seem. So as I mentioned the agent the policies inside the agent. And so if you look inside of here just the root recall that you know the the observations that the that the policy is the function that maps the observations to the actions and then also used reinforcement learning algorithm to optimize this policy. [00:18:27] So inside of this particular example we use an actor critic method then that kind of give a description of how actor tricks make methods work there's tuner on networks and typically they're there to meet simultaneously so we have an actor which takes the state observations to provide actions. [00:18:47] And the actors trying to learn what is the best action to take given the observations and then the critic is used to try to estimate the value of each state. State action fair that that the system takes in this value represents that you know how good that action was. [00:19:05] How good. Was expected long term reward for taking that action for that for the robot in the given state then using this critic we can actually criticize the actor in terms of you know if we can use the critiques to improve the actor's behavior. As I mentioned the critic represents the value of the value is that that the poor that it respects on a given state and and taking actions are on work done how does this work so that the actors are responsible as I said taking the state observations and entreating that action. [00:19:45] The critic then takes the state observations and the actor and the action that the. That the actor took and predicts the value of that of that action long term reward that would be received and basically it takes a 1000000 compares it with the resourcing that comes back. And basically using the bellman equation we can look at how the how how as how far does this deviate How well is that critic representing. [00:20:14] This expected value that we expect based on the new we were coming in versus what is currently current primers nation is and using this error signal we can update the critic so that it does a better job approximating what the pa us for a given state an actual pair then once we have a good critic basically the critic then uses this information to to update the actor. [00:20:38] Based upon its knowledge of weapon at what actions are good for particular states Doak creating the agent we're going to use. Reinforcement learning album called The determinacy policy gradient. And in constructing this we need to create a critic network a Policy Network. And then also we have to set some associated hyper Premier's with the system and I'll go through how this is done and force them into a box so 1st we need to create a critic network so the stakes and state observation and actions returns a value here we're using the deep network designer and deeper into a box to build the network here we have the observations coming in there's a few fully connected layers with some value layers and then the action coming in through a fully connected and then the final roll you call effective layer so that represents our the critic structure that will be usin our for this example Similarly we can use the deep network designer to create a actor network I hear we have observations coming in and we have a few fully connected layers with what role you activation functions and our final layer in our actor network is a tennis player because we want to we actually want to keep our core comforts between plus and minus one we're able to put our are actually constraints explicitly and be. [00:22:07] The actor policy. So the Creator d.d.p. agency and the tool box. We can specify the options for each of the actor and critics of these are hyper predators such as what type optimizer that you want to use do you want to do something with the random or stochastic gradient approach. [00:22:29] You know you want to do regularization as well as that race so these are some behavioral problems that you specify we create the network with the networks that we find in the previous stop and then there's also some other hyper parameters that we can see but suffice that's just the discomfort actor many backsides and exciter that these all go into how. [00:22:52] The the agent does its updates for further enforcement you know and decision you can specify your noise options so this is about how you do it's how the agent will go about exploring that some Vironment once we have the networks created we can then constructor agents with the actor and critic representation and there's no sheet of paper printers that we want to use for. [00:23:15] The reinforcement learning agent So now that we define the agent object we can basically use the symlink what we have a reinforcement learning agent lock in similarly so you can drag that in and put that agent. In the in the master of the system so now that we define the agents in the system the reward function and then in the environment of the walking robot the next stuff has to the next step is to. [00:23:48] Train the agent. So training our agent there's several choices to do we can accelerate right by running simulations and in parallel this can be done almost a court computers or clusters are in the cloud. So basically we have a client we have several workers so that worker the client tells the workers to rinse and simulations when the workers done that some of the day sense that they are back and then the agent updates itself and then that sends out to workers to do additional simulations so this greatly accelerate. [00:24:21] The learning process by distributing the simulation recloser. We also put support the training with g.p.u. so you can run the inference. Or the critic and on t.v. years this is simply more beneficial the deeper networks are if used correctly run faster for shallower networks because of the overhead of of transferring data to the g.p.u. memory though training the agents pretty simple we specify some training or absent options what's the maximum number of episodes that we run for training was the max number of steps that we want to take up or so. [00:25:01] What is the stopping criteria for for training So here were some great areas where every word reaches a particular level. During training we are using parallel parallelization. And we're sending experience back after every $32.00 steps they didn't takes or that the worker takes. And this is done through these parameters of the options and then basically just call training on the agent around and the training options and let it run. [00:25:33] So here's a video of it running so. The Knisley is running at real time here and then eventually we receive it up there are a thing for for time purposes. But you can see the the the number up so it's you can see it started to run an issue least we don't see it learning very much at the beginning but once it learns to take a few steps the one you actually improves quite a bit so what's it going to take a few steps. [00:26:02] Of walking it actually improves this money quite a bit. Here we have a deficit on things I'm just in with with Laurie Fortunately I mean as you go through these peaks and valleys running are many other methods. But the good news is the trend is moving up so we're we're are watching is improving. [00:26:23] The more samples that we take. And so here we see we ran 2418 episodes and then finally. With that training agent which we press play to walk with a trained agent you can see that they didn't walking well and walking along a straight line it starts to deviate that it moves back to the Senate line so here we saw you know. [00:27:01] As an accelerated view of the training process and the final results of that walking but let's look a little bit more detail of how we got here after a time she needed it report shaping our reward shaping as part of the training the lot of times you'll see that the agents having trouble learning so you want to give it more incentives to learn and improve simpler fish and sea so oftentimes there's a lot of feedback loop that you go through adjusting the reward system. [00:27:33] For the reward function so why does reward function matter so much so if you just started out you might say what I want robot to walk for so let's start out by just creating a wart which is just purely for velocity so if we ran this with just this reward it tend to get stuck in a local minimum of wanting to just fall for it so it gets that initial support of the law city if we trained it longer with more exploration maybe it would eventually learn to walk but there's not much skating going the robot of what type behavior should be trying to achieve the next step we did was let's add a reward for each time step the robot remains upright then also give it a penalty for not meeting a certain torso height Don't we did this we see that the robot does a better job at walking however it's not ideal it's very snobby very you know. [00:28:28] Nancy who. Looks like it's going to fall over a few times but now it's as if one wants but you know it did do what we wanted it we wanted to move forward and maintain a torso height so the next up what we did was let's add additional penalties for. [00:28:46] For the energy used for actuation So here we had a penalty for the energy used in. The motor coach for each each of the joints do we look here we see we start getting really good watching behavior right and is walking more like a human if you think of a humans evolved to have pretty efficient walking behavior so it's kind of expected that this type of behavior would emerge as you start penalize actuation a new much more efficient movement however you saw that the robot doesn't really maintain the center line you can see that it deviates from it and we never really told the robot explicitly in this reward function that that was part of the task. [00:29:27] So the next step what we do is we added a penalty for deviating from the sunlight. So if we look at this as we do that we see that we get good behavior walking and we see when it starts to deviate from the center line starts tracking back to the center line so that the main key takeaway from here is reward function matter the sample efficiency of a reinforcement learning object urban person learning algorithm learning a particular behavior can be dropped dramatically affected by how you incentivize the agents through the rewards. [00:30:05] Though now that we've have a trained agent that we that we like we did this or simulation that we want to deploy it on to our robot. Currently there's a couple ways you can deploy to to target so here it we have are offering simulation the a reinforcement learning algorithm policy in this the simply environment and we're going to take this policy and put it onto the trigger hardware to see how well that this policy works and works in the wrong Vironment and you can do this. [00:30:39] In the reverse America what's un-American generate c. or c. plus plus code for the policy as well as good a code if you're running on the g.p.u. to run the policy on and better device. So basically a way it works it takes the policy to go through our coach and products that automatically generate code. [00:30:59] Using about her g.p.u. code or we can deploy to. Different different libraries such as Intel And Carol you know that it helps our cheese could libraries or our fewer berries. The key takeaways from my talk is that you know we can use reinforcement learning to solve complicated and controller decision problems one of the nice things about me unfortunately I mean is we saw a large number of observations coming into the control system typically would have to take us perception and sensor fusion to to reduce those into manageable subsets than a controller could use but reinforcement learning scales up well for large of our observing a large amount of information and determine the appropriate action. [00:31:45] Reward seeking that is very important in improving learning outcomes we find this is one of the the the the trickier parts in terms of coming up with good reward functions that. Improve simple efficiency is as well as mapped out the desired behavior that you want your agent to learn and then map them and see what the reinforcement my tool box provides a complete workflow for 40 brain force learning so here we went through the example. [00:32:15] For this pipe at robot this is a shipping example so you can actually quite intimate documentation as well. So that ends my presentations. I think I went through it relatively fast but are happy day answer any questions or paying you money Microsoft is in. One question is example of why we must have already. [00:32:44] Yes yes it ships with the top box we also have. Her own box that it uses the 2nd one so you can see all political you so sorry sorry the Brits that segment Yes So so the reinforcement learning tool that's required is that the only talks are so that's what's required to. [00:33:05] Run reinforcement itself in this example we also use Simulink. Sim scape multibody to model the robot state now makes it so for the environment it would also require Simulink as well as a sim state not a body. We ship this example out with the product we also have some videos if you go to a map of tech stocks we have a series of videos. [00:33:33] And reinforcement learning and this is part of one of the samples that we let research. And Mark with and for great but great you have used this font for the walking Robert have you tried anything you were trying to try to offer some reallocations with some combined and I and I talk about doubt about that but is this came up with an application that you got access to try the search tool box and this was asked and. [00:34:05] Yeah yeah so we've we've worked with a lot of customers Republican gauge with colleagues over 150 customers on various applications. We see a lot of problems in terms of scheduling so a particular example may be. Where you're trying to schedule maybe pumps you use if you use a simple problem the c.f.i. punt something water to a water tower you're trying to meet demand for the water supply how do you optimize the scheduling of those plants to minimize cost. [00:34:42] Minimize cost. While for while satisfying. The the the demand and variations in the man so we see a lot of those I mean that's an abstract representation but we see a lot of special type problems. We've seen problems in read only communications we see a lot of problems in heavy equipment. [00:35:05] So there's a large number of equitation that we see a lot of customers are leveraging we have a lot of customers that previously are all currently do a lot of stuff that they can make programming and they're looking at him they reduce the overhead of an engine or something that any programming problem using reinforcement is that's another area as well other has a lot of questions that humans are let me. [00:35:32] Know are are are how long the real time to train my problem. Let me. And you go back to the simulation that. Was. Real time. Unfortunately have this and 2nd so making 1514 seconds and I'm not good at doing math I'm a herd. So that's how long it. [00:36:06] Actually elapsed train from the training and I think we have. 8 workers. So I can do the math. Maybe I'll be better for a few seconds there are Sally are asking the example either be if they're real rather than eaten by not harming Portman in model but rather always model is the policy another big nary a rabbit. [00:36:41] Ears so we didn't actually do actually the play the still real robot yet but if we did you know we've deployed in a reinforcement learning algorithm early in the policy to some simple examples where we have like a bit of bit of a blow where we will we've done some experiments where we where we deployed the passage where we clearly don't support learning are under her explicitly only deploying the policy and. [00:37:10] We're currently working on providing functionality to actually learn directly in the hardware as well. Anonymous Is it possible to import network waves will be compatible with her comment by the fact that I talk Yes So we have so there's a thing called Onyx I forget where an exchange for say open network exchange so I mix as a generic term at which many of the popular frameworks. [00:37:44] Support. As a intermediate representation or a comparable representation. So if you look we support armies in court there are some layers that we don't support So what happens there's a few in court and networks. That may have a customer in pay torture or maybe a layer that we don't have explicitly It'll put a placeholder layer for it but if you go to our documentation page there's. [00:38:13] It documents all the layers that we support we support all the popular ones. And it's a it's one of our popular features of a mix is a popular way of it's very networks from from different platforms. Is a commodity boom I've formed. By phone. Yeah yeah so it's comparable pipe torch I think I don't know if it's 10 simple explicitly but keris there's an importer for carrots as well. [00:38:48] But yes it's all the common most popular packages leverage disarming open that rif exchange for their language and if this was a structure. You mean us in 1102 in part why are you 1st elected and then trying to policy I mean are is the policy of trying on my own opinion popping You partially answered these questions yet the policies are actually trained on line. [00:39:17] So it's this different than supervised learning although you can take previous. If you have what you call in our policy method for reinforcement learning you can use previously collected data set and learn from from that data but this is continuously running simulations and learning while you're running the simulation so it's dynamically updating the policy as you learn. [00:39:42] Typically this is much more easily get much better results in the 2nd way than trying to do a purely supervised learning aspect because it will actually. Tend to converge better to our optimal solution. Of time not a present hour is artist policemen they don't live by compartment if they are not writing access to the slides. [00:40:08] I can follow up with carry out this act or I do know the that you guys are playing to publish the the recording. The president asks what does the Queen Mary want to again yeah so the critic network is trying to figure out given an action or given the current state and the action you take and then if you're following the state Apple policy what is your long term reward so it's trying to figure out some how good the action that you took was. [00:40:43] Then if you think of the word critic rate it's as if if you if you're going to criticise something great like that actor critically if you're going to criticize somebody performance. You know you do so she added value to that right so that you can think of the critic as you know evaluating the the actor saying you know I give you want to give you a 10. [00:41:05] And then using them to actually use as an information or learning up and use that information to make the actor. Achieve better scores or you know better that evaluation from the critics it's like a movie not people who were. Powers by now in the news on the margins like obvious that the critic is like a movie right and beyond there is in the bigs. [00:41:29] Yeah. Yeah I mean 3rd we said of the war on the car or do we have a name or have we won whoppers then my. Markham pointed to the war and. So typically in any type of you know especially deep learning to stuff the more knowledge you can so so people when they look at reinforcement learning their kinds of minimize you know not put much information into the reward function this can be done with like the Terry games where you know the score as a reward function you may not know how it's computed or or chair but simply an injury and applications you want to try to embed your domain knowledge into it. [00:42:09] By embedding some domain knowledge into the reward function even if you just set the ruler function as for velocity we may have been able to get a job where we actually you know did enough exploration but it may take a long time for it to get there so if you can put some domain knowledge into the design of the reward function you often improve the simplifications which you which you. [00:42:33] Which you can learn. But you can start with the naive Take think clearly what we do is that we think we have a good reward function we came to evaluate what kind of behavior emerges. From it and then look at you know are there things that we could use to. [00:42:51] Do further include the behavior sometimes it's hard to to define a task in terms of a numerical hour you're right there's often times playing with the weights you know they care more about tracking they care more about. Actuation effort similar to like if you did a toupee problem you know how do you how do you adjust it you know in our weights of the function or of the other cost option. [00:43:17] And our easy These are honorable moments. My friend example intrusion from us. Yes so so with with. With sim scape we actually can import camera and also we have several examples where. We've been ported a different robot arm from the cab models explicitly or you have to sometimes do a little bit at cleanup of the model but it can model has other joints you know the joints in and. [00:43:52] Then rigid bodies and that's so that's typically if we have a cad drawing of the system we use that term port. Into the system you can also if you have your equations in in math well or if you want to leverage things such as a robotic system to access some simulation models in it already and there are preparation you can leverage those as well. [00:44:19] But it's pretty easy using Sim's gape you basically you know set the their shirts and stuff out of each of the rigid bodies and whatever trunks they are. It's pretty straightforward to create and we have examples of the documentation examples of videos of how did I did this I was listening. [00:44:40] Is the best way to ward off my train in the. Past. Most of yeah hi and I think yeah I think airing your domain knowledge into the reward function is key. It's kind of a. You know didn't provide enough information to solve the task sometimes. You know some things are obvious that if you made a mistake when you see the behavior of a merge. [00:45:12] When we would like one reward function what we did was just for the last year minimize actuation and what it did was a try to do them least amount of motion that it could make it fall for. This amount of forks to make it fall forward which makes sense right you're trying to make it will for but you're also trying to minimize how much control effort you use right. [00:45:34] And so the behavior you know approach where just use just enough actuator effort to make it fall forward. So sometimes when you see that you can say well I want to remain upright so let me try to maintain it's force of 2 or a certain level. And. In the human heart I under the sun are easy to rule and then yourself is how we bore people on various to this other of the muscles Yeah yeah so one of the challenges with with me unfortunate learning to act is fighting the war about these gruesome and I do but. [00:46:10] We never have another year on our function approximation and none of these methods really guarantee convergence that I'm aware of. If you have when you're linear power see and critics there are some proofs that you can guarantee convergence on. But typically you can't guarantee convergence and you'll see a lot of different algorithms such as a d.d. p g t 3. [00:46:39] 10 delay I forget that the other acronym stands for it other sax stuff factor critic Don't these methods trade are. A virgin's versus a simplification saying they don't necessarily guarantee coverage if not cases but but they actually tend to have better versions properties if anything they're that scary at this yesterday as I think yes I would. [00:47:06] Like the melody and presence I need to. But I haven't used to get busy right yeah there are these already converted into. Their groceries Yes You know you guys are using the p.p.t. for 1st person and were a few probably you know. Sometimes a bit yes sometimes it depends on what type of policy and they're using for personality or policies there's different convergence conditions. [00:47:36] Yeah another question asked that you have already answered and said yes and we were 500000 euros one. Hour hours leaving a life because there are to create the government reward of the action if you are arguing for the art. From the greedy broaching line acceptable that. You are the partially. [00:47:58] Yeah out says I mentioned before I go back to the slight here. Though the critique actually takes the reward coming in so we have what the current estimate of the value of the critic is here right and then take a risk reward in the next next state and actually Asian value. [00:48:20] We can determine what is the new best estimate. Based upon based on our current data and using this that we can for used like it's a bell in equations or to provide an update to the Premier's ation of the critic so the critic is continuously learning as well and basically the critic is good when when basically this this. [00:48:45] This error goes to 0. Process actions. Album Beauty human and therefore I personally want passion. Yes so much like. Other. Take methodology this. Lot of times aren't going to have a cost option or reward function. You know coming up with relative balances between these. Sometimes it's a matter of normalization. [00:49:19] You know if you get all the signals on the same scale then as he surely easier to wait them on an equal grounds where there are 3 each of these values are very disparate in and magnitudes then you need to account for that in the weights so one way is the kind of normalize all your components of your reward function and then start. [00:49:40] Tweaking the rates the weights based upon. The level of the ports and set you that you know the are a more worried about a stream from the path are you more worried about it. You know staying upright. Though there's kind of a tradeoff that so much like any in the optimization function that you're right there's you know doing scaling and then balancing the weights based upon your desire to protect us. [00:50:13] To viable a common war will constantly come up because things are so. The state action sometimes there's also the next state as well. Meaning if if if your next state took you into a back location. You have the next say but yes a typically reward is a function of the state and the action and also sometimes of the next state resulting from that action taken in that state. [00:50:49] What happens in banks or actions thing on when the private market might be you use a gradient Yeah so yes so they actually if it uses a gradient makes method to. To compute the. To the to update the the or the network wait. So in this case that the actions were we're continuously in question would. [00:51:27] You save some lives on policy versus off on bondsman's Yeah so our policy on policy so the main difference so if you hear stuff like. Policy gradients Except are those are on policy methods then on policy methods meaning it means that you you need to learn from the data that the policy is running so that the 5 a particular set of parameters for the policy. [00:51:56] When I do my update at I have to use the data from that policy. So if I collect the data previously changed my policy collected some more data I can't use f. or state that of data any war to update my policy because the the method means I can I have to be on top I have to use data that was generated with that policy and it's a linear here but then our policy means I can actually learn from data from another policy meaning I can open a whole nother policy running or I could have data collected with the same policy or I guess it's a different policy it has a different proposition. [00:52:38] And one of those things is nice is that you can use stuff such as you know replay buffers where you store previous states and actions and reuse that data over and over again of policy methods also Lozier to the say you ran of the bunch of experiments in the past or train in a different agent you can still run that data through an off the method and reset it that you previously collected. [00:53:07] For example if you eat you have the data that you collected from another controller of. Army hardware experiments etc You could still reset in our policy so it just means that they can use data that spectrum in some data that was generated with a different policy those are all that and policy means I can only use David generated with that. [00:53:29] Being off the fossil learning we use what is called the behavior boss. Is the actual into the Khaybar policy that white of. Yeah nothing. Really good and we asked too many questions pregnant and. There is one market and one more in that. You know that is that everything every family war effect I have resolved and are good enough yet what about the theses but even yourself you don't know the desired behavior. [00:54:06] Yeah I mean so you have to have some way of coding. The behavior apes. If there's no feedback of tell you whether or not you're doing something good there's there's no way you can react to how I should change my behavior so I mean you know even if you think that a dog or a kid you know you give verbal feedback or you know you clap or if something exciting this is feedback without some sort of reward. [00:54:38] There's no way to incentivize behavior so you may not know the reward function explicitly there are some things got inverse reinforcement learning this is where you let's say you have an expert. An expert driver or operator and you want to learn what is the reward that makes a person that expert right and so there's sometimes a thought in for example control or and there's reinforcement learning where you have a policy which you say is optimal and I want to know what is the what is the cost of rewards option that makes up that that that person is trying to achieve. [00:55:17] So that's one way if you have. An example of an urban expert you can use those types of techniques to try to. Back out what is the reward function that they're maximizing. My practice. Or someplace you want to serve on the questions that are. I think Various a lot of things interesting that you guys are watching and I mix I've always see. [00:55:50] Every time I see a presentation on from you and your group you're always up your game. So thank you very much for taking the time I know you're busy and. Taking the time to give this presentation to us was based out of our students and into very much again and again if anybody has any questions direct them to me I will ask Greg and then we can from there. [00:56:18] Thank you very much Gregg thank you everybody for being here and. Thank you very much I wrote that didn't all the great questions.