Let's get started. It is my great pleasure to introduce our today's speaker, kosher, professor kosher Serena's, he is currently an associate professor of mechanical engineering and UC Berkeley. As many of you already know that he's the leading researcher in the dynamic non-linear control and locomotion. His work has been covered by the numerous media or so. He is the recipient of the NSF Career Award, Herrmann fellow, best awarded RSS, google Faculty Research Award. So I believe the Togo be super fun and informative. So please welcome our speaker. Thank you. So it's great to be at Georgia Tech. I was here about seven years ago for one of these items have been hours and things have changed so much. You guys have such a nice space. I'm actually jealous. You can do robot experiments in so many different places, treadmills, That's amazing. It's like a good thing to see so many students in robotics all here. It's all great. So thanks for having me and it's amazing to be here. I was a little worried because of this time slot. I thought I'll be the one keeping you away from lunch. But I guess we have the code, so that's awesome. Alright, so today I'm gonna talk a little bit about legged local manipulation. And you need to wait until the end of the talk to see you how this makes sense. So the idea is to combine the use of legs for both locomotion and manipulation. We will be looking at the interplay between model-based and reinforcement learning. This is probably want to buy stocks on reinforcement learning. So I'm sort of drastic yours, if you ask hard questions, I may not be able to answer them. So my training is more of a model based on the smaller base side. So let me get started. So I wanted to thank all my collaborators by funding agencies and made this work possible. And more importantly, my students who have done all these results that we've gotten, the get amazing stuff done and I steal their credit. We're giving these talks. So if it wasn't for them and none of this would've happened. So these are some of my PhD students. Alright, to start off, I'm going to show you a video on the state of art and locomotion from the animal kingdom. Let's see, There's a volume here. Like as a sound is simply anyway. So what you're looking at here as a high-speed chase of a mountain goat. And a snow leopard is doing the staves. And essentially a life depends on this. If you miss a single step, you're dead. So these animals have amazing limb coordination and motor visual coordination. Sorry to show you those during lunch. Now if you felt sorry for how I feel are actually feeling happy for the mountain goat and a smaller parts of the ones that are going extinct. So it's a funny side. My research vision is to create the next generation of highly dynamic robots. By highly dynamic, I'm looking at speed, agility, efficiency, robustness, and hopefully throw in some sort of safety as well. So we'll explore more about what are these concepts. Alright, so I'll take you through a whirlwind of different concepts. So we'll start on the left on model-based, and then we'll go all the way to data-driven on the side, on the right. So we'll start with control Lyapunov functions, controlled barrier functions that can formally guarantees stability and safety. Now, these don't work if you have model uncertainty. So we'll then look at traditional ways to address model uncertainty, either robust controller adaptive control. We can also use another approach where we can combine model-based and data-driven techniques. So we can combine RL or Gaussian processes to address the model uncertainty in your model. And hopefully you still preserve some sort of guarantees. Then we'll switch over all the way to the other side. And we look at pure reinforcement learning. Here I love explore network architectures all the way from RMA to transformers. We'll also look at RL that just outputs talk directly rather than outputting a position. We'll look at some fun applications. How do you apply this to different task of soccer, to shoot a goal to goal keeping. Then finally close the loop and will incorporate some sort of model-based knowledge to make this efficient in terms of learning so sample efficient. Okay, So let's get started. So we'll start with control app. No functions controlled lambda function based quadratic programs. Now, if you've taken robotics one-on-one course, you probably understand most robot models. Be written down this way. So these systems are non-linear. There are also control affine, so linear in terms of the control input. So any control that your double-up probably for this particular non-linear system. Now the gold standard and developing and formally guaranteeing stability for non-linear systems is through Lyapunov functions. So classically you have a Lyapunov function goes to zero at the equilibrium point. You want to guarantee that you pick an input that drives this V dot negative, which means V is decreasing asymptotically to the equilibrium point. If you can find subtle FOR function, your job is done. And then you can go have a celebrated denote these complexes. So that's not easy. Now, what we want is something more stronger than asymptotic stability. We want V dot to actually go down to zero exponentially fast. If you can satisfy this constraint through an equality, you're basically going down this dotted line. If you can satisfy the inequality, are going under the dotted lines. So your upper bounded by an exponentially decaying function. The goal is to choose a control input that enforces this constraint directly. So you add an exponential control FOR function. The v dot itself is also has a linear structure in terms of you through this lead derivatives. If you haven't seen this, don't worry too much about it. Alright, so let's think a little bit about how we can pick this u to guarantee this particular constraint. And it turns out people have already done this in the past. If you go back to the 1980s, we have this particular constraint. If I choose all the terms that, that don't multiply U L of v plus lambda v, and I call that size zero. And the term that multiplies you assign one, then based on this, I can choose a U. I basically want to enforce this particular constraint. Size zero plus psi one times u is less than equal to zero. Now, just to break the ice, let me ask you a question. If psi zero is negative, what's the best choice of u? Yeah, zero, That's perfectly correct. So I want to do the least possible effort if your dynamics are already stabilizing. I don't have to apply any control input. If psi zero is positive, that I pick the u, if I want to do the least possible effort, I picked a you to make this inequality. And I can satisfy this constraint. Now that sort of control Xist exit existed back in the 1980s and this was the minimum controller from attesting. Now it turns out this particular control is exactly equivalent to a quadratic program with one particular constraint. So minimizing the control effort subject to this exact constraint. Now once you have an optimization problem, then you can throw and other, other inequalities, other constraints, e.g. maybe I have constraints on my input. All robots have input saturation on the actuators. I can put on this. In the moment I put in those, I have two constraints they could while it, so I need to relax one of them. So I'm gonna relax stability a little bit penalize the relaxation. So this CLF KP can run at 1 khz on our complex robots. So here's Mabou. Let's experiment was done remotely even before the pandemic days. So it does work in practice. So that's one way we can not do it. Stability. Now let's take a moment and see what about safety? So we'll address the concept of control, barrier functions. Some of you are probably familiar with this and we will try to combine that with Latin of functions. Now the motivation for this is the classical problem of walking on stepping stones. Humans can do this really well. Even this little kid can nail is stepping stones. Now he does fail to account for the stems film of water and it slips, but he does nail the stepping stone. So how do you guarantee that your legged robot would always hit the stepping stone? And also guarantee that your friction cone is not violated. If you must have steppingstone by a centimeter, you fall into a crevice. You have to do this in real time. So it's not like you can pre-plan this trajectory. So for that, we'll look at this concept of controlled barrier functions. So essentially you have a barrier function. And I'm looking at the barrier function being non-negative and the set of those states, I'll call that the safe set. I want to choose a control that keeps my state evolving in this set. This value function itself is not dependent on you. But if I look at the time derivative of the barrier, it could be dependent on the U. So then I can choose some constraint on B dot to enforce the constraint, enforce this, that x stays in the set, e.g. if b dot was equal to this and exponentially decaying function. And I make sure that V dot of x comma u is greater than this thing, then my state is going to evolve that lower bounded by an exponentially decaying function and then indirectly and force the safety constraint. So now I can put both of these concepts together. In fact be dot depends on you and I want to enforce this. So choose a u that satisfies this constraint. So very similar to the Controller app no function. The only differences, this inequality. One side you have upper bounded by an exponential. And here you have a lower bounded by an exponential. So in this case I want to evolve in this particular green region. Here I want a wall in this particular green region. Now you can design this barrier functions for the task of walking on stepping stones. And you can formulate all of this into a single optimization. Now once again, if you have multiple constraints, you can have infeasibility, so you're relaxing stability. To guarantee safety. You have more constraints. This whole quadratic program could be infeasible. If it's infeasible, then you lose all your guarantees. Now, assuming this optimization problem is feasible, then you can make some sort of guarantees and stability and safety. Alright, so let's take a look at this in action. You can apply this to quadrotors. So here's a quadrotor completely manually Italy operator. And then you have a safety filter that kicks in when you try to boil it, the safety of leaving the space. Or you can generate CBS to walk on a terrain of stepping stones. So here the robot has only knowledge of one step ahead, so it doesn't know the entire range. I can pre-plan that tragic T for motion. The feature point contacts so I can stop and think about where to place my feet. I have to put my foot address. I'm going to follow. You can notice on more complex systems. So this is a dangerous humanoid and you can see the dynamic motions are torsos bobbing up and down. Your step length and step width is being modified on the fly. You can also do this with step height. Here we are combining controlled barrier functions with multiple periodic orbits. You can switch from one periodic orbit to another periodical, but to increase your step length and step wide variability. In fact, this was so effective that you can actually just do this with multiple periodic orbits. So here's an action on the atria is once again, the controller is given only step, one step ahead. Knowledge of where the next step is. The step length is standing stochastically. You can also do step height changes along the step length. And if you notice some of these things are not bolted to the ground, so they actually wobble. If you apply too much of a force holding a couple. We also have to ensure you're applying too much force. Okay, So this is pretty cool, right? You don't have a choice, you would say yes because I mean, why didn't you being nice? So that's great. What's, what's one of the disadvantages? Any ideas? Okay, sure, so lap known function, value, function have to be designed. That's one aspect of it, but let's say I can do that design. Anything else? Yeah, yeah, that's perfectly right. I need a model of the system. And in even if I've done the best system ID on the robot, I never have the most precise model. I always have some nominal model, and the true model is different from the dominant model. I developed my controller on the nominal model, and I've have guarantees on the nominal models. So my v dot by true v dot and true B dot, a function of v dot and v dot tildes on the nominal model plus some uncertainty. And those uncertainties are unknown. So how do I design something on this model? Get it to work on this model. And I'm going to focus derivative less of it on these topics. So the key idea is, I don't know, delta V and delta V. Now both of these are dependent on, not only on the state, but also in the EU. And because your system dynamics is control of fine, these are also controller fine. So they have some structure which you can make yourself. I'll quickly give you two ways to address this and then we'll go into more depth on other ways. So if you're doing model-based control, you can do robust control. So essentially I put an upper bound on the worst possible uncertainty on the deltas. And then I can run this robust QP CLF on the CBF. This does work. I don't have time to go into this. Or you can also look at the adaptive case, robust as conservative because you're always assuming worst-case adaptive. You can look at learning the uncertainty delta. And we use L1 adaptive control. You can deploy this on our robot as well. More interestingly, you can actually use reinforcement learning to learn these delta terms. So here you have a QP over here, which is the minimum control on the nominal plant. And then have a network that actually learning these delta parameters that can then try to address the uncertainty. But the moment I use R0, what do I lose? Everybody knows your loss guarantees. Now, everything that I'm claiming with CLF, CBF outfit makes sense because I have a network here. The next best thing I could do is instead of RL, I could probably do Gaussian processes. I could do GPE regression to find out parametrically how to estimate this delta. Now once I use this, I get some stochastic bounds on delta. So I can then have some chance constraints on the CLF and the CBF. The QP gets converted to a second-order cone program. I won't go into more details on this as well. But for today I want to focus more on the data-driven aspect. So we'll start with just pure R0 and then we'll see how can we actually do something with guarantees. Alright, so let's look at reinforcement learning. We want to look at reinforcement learning for a complex system such as Kasey. Now Georgia Tech has one is how has Kasey earn a desert? It's pretty cool. So the idea here is I want to have a policy that looks at the current stage without an action. The policy itself has trained using a reward. But the main problem is, I don't know what the reward is. I could engineer this reward to make it work. I could spend a lot of time do something tonight. But we want a better way to do this. So I had some joint work with, so again, Peter at the Berkeley. And the idea is we have a model. We can develop periodic orbits for this module. So you can get a whole bunch of gates. So why don't you do imitation learning? So the reference motion is coming in from a trajectory optimization and it's a periodic gate. And the reward engineering is now replaced by this. You could also do some stuff to minimize contact impacts as well as joint torques. And that becomes your reward a lot easier. You don't have to stick with one reference motion. You can have a whole bunch of periodic orbits for different velocities, walking forwards, walking sideways, walking at different heights. And you can take a command and you can pick one of these. And to train a network that would imitate all of these policies. Now what does this network look? It'll be a traditional RL thing. The only difference here is I have a great library with lots of periodic orbits. Have a command, pick one of these gates and pass that reference to the policy. And the policy itself has two inputs, one and port as the output of my entire system. And the other input is the output of the policy itself. So in some sense, the neural network is doing a system ID on this component of my system. Learns that part of the system spits out motor, motor positions. Then you have a PD controller talks that outputs torques. To track these multiple sessions. The policy run so 30 hz, the PD controller runs at 2 khz. Now before you deploy it on hardware, you have the famous into real gap. Since you know a lot about the model, you can put all of that information in your simulator. Then you do domain randomization, classical thing. So we model the three sources basically. So you have dynamic, dynamic's causing the sensor noise, communication delay. If you address all of these things in your domain randomization, then you'll pull potentially make the central real easier. We also go one other step ahead. So we have a simulated molecule that really fast for RM. But before we put it on hardware, we simulate the system on our high fidelity simulator in Simulink. That's more realistic. And if it works here, there's a higher chance of work on hardware. And in fact, there's a zero-shot transferred to this complex system for Casie and it does work. So you can see more experimental results. You can do perturbations and all this crazy stuff. Now, one thing we spoke about us, we don't have any guarantees. Now the listing of them were actually want for the system as some sort of stability guarantee. Let's think about what we can do for stability guarantees. For the robustness LacI, pretty good. Robot actually walks on, the gantry, recovers from. It's pretty impressive. So the main problem that we want to address now is there a way we can provide some stability guarantees? Now, if phi work in nonlinear control, what does the classical way to prove a non-linear controllers table? Any thoughts? Have a non-linear system. I double up a non-linear controller. I want to prove this closed loop system is stable. What's the first thing I would try? You could try lightly up now, but that's actually pretty hard to find a lifelong function. Is there something easier? Okay, so linearization and eigenvalues, okay, Awesome. So if I, if I could linearize this closed loop system and I can look at eigenvalues and I can conclude local stability even for the nonlinear system. To linearize this, it's gonna be a pain even if it was possible. Next best thing I could do is I can take this entire system, a closed loop system, as a black box. And I can actuate it by a bunch of inputs. And I can observe the outputs. I can take this input-output pairs and do a system ID with the restriction that the model, that effect is the linear model. And that's the only restriction that I can identify the system using simple system ID techniques. So I throw in an input, it's got a bunch of step inputs. That's got a bunch of ramps, some job. And out I get a response from a closed loop system with the policy, which was this. From this, I can identify a linear system which is the transfer function. From the transfer function, I can predict what the response would have been from the prediction and the true. I can then get a foot on how good the model is. So we'll do some RMSE and you can find an accuracy of your model. So here what I have is a transfer function. And this particular closed loop system basically has commands that says, walk at a particular velocity forward, walk sideways stones are bunch of different commands. I'm showing you one of those input-output pairs. So there are four inputs, four outputs. I can identify four different types of functions. Single-input-single-output. Still. Once I have this, now the next thing I'm gonna do is I can look at the poles zeros of this transfer function. From this, I can conclude bounded input, bounded output stability. So it's still not asymptotic stability. Or it's the best thing that I can do with input, output stability for this linear system. So I can provide guarantees which will then transfer over to the nonlinear system and some local fashion. In fact, all four of these have all bounded input, bounded output stable. More importantly, they're also minimum phase, which basically means your system doesn't move in the opposite direction of the combined initially. If it does, then your planet, planet has to be a little bit more careful. Alright, so your system is actually multi input, multi-output. So you want to check what happens if I excite my system with multiple inputs. So if I excited with one input, I get one output. I do assist somebody on this. But if I excite the system with two inputs simultaneously, I get an output. But I use this model to predict that output. And if that photo still good, then what I can conclude as my auto policy is decoupling the input-output mapping. So in some sense, the ordeal is doing what's called as input-output linearization. So single input controls only a single output for the closed loop system. So all of those you can find out through this linear foot. So it's pretty interesting and we can make general conclusion that every RL Policy does this, suggest that what we observe in this particular system still a lot of open questions. So what can we conclude we have BIBO stability, minimum phase decoupled dynamics. This is only true if your input frequency is less than 0.6 ω. If I excited hi, faster than this, than I have lots of non-linearities that show up on the output and that breaks my linear foot. So I have to satisfy this. Now once I get this linear system, I can actually use that linear system or the linear model to do a higher level planner, high level of safety policy on top of this closed loop system. Be interesting to see if the guarantees that are provided on the linear model would also carry over to this closed loop system. And I'm particularly looking for safety guarantees. So still preliminary results, but hopefully that is showing you that we can do or think about something. Now, there's a low-pass filter here between this and this, because the linearity breaks when your input is higher than around 0.6 hz is what we absorbed. Okay, so let's take a look at another different kinds of architectures we can use for RL, for complex systems. And we'll look at some of the famous results from Berkeley, from the general Malik's group, RMA, rapid motor adaptation, but applied to Casie. So we have to do a few other things. Now the way we applied our L here was we had a whole bunch of periodic orbits to take away the reward. And the problem is if you make this robot walk on all kinds of different terrains, you have to change the periodic orbit. Having so many different periodic orbits is going to be very cumbersome. If there was a way for the policy to adapt to the terrain, and I could solve that problem. So this particular approach has its roots in adaptive control. And the way it works is the following. I have a policy that generates walking. The input to this is the previous action, previous state, a short history. But more importantly, you have this privileged information, which is the mass of the robot, the friction on the ground, and a whole bunch of different parameters that I cannot observe intestine. But I know in simulation waters, I take all of this, pass it through an encoder. I get some latent variables. And I send this latent variables to the policy. So in some sense, I want to use those variables to characterize what kind of ground and walking on and have the policy adopted. This. The problem is I don't know this ground-truth intestine. So I need to actually infer this values from a longer history of Beta. So in some sense, this adaptation module is doing observation from the previous input output and getting you latent variables. And we do a regression to match up the latent variables that you get here and what you get from this encoder. Whereas the base policy in this part was fixed. So there's no arrow here. Now this works pretty well on quadrupeds. On Cassie turns out we don't have observability. From this history, getting this, you only get a noisy estimate of the latent variables. So then we need to retune this policy to work with noisy observations of the latent variables. Now if you do this, then you can make this robot walk on slippery ground. Now I call the slippery ground. You don't actually see the robot slipping because the policy is actually good. Some tiny little slips. You can also make this work on soft ground and a whole bunch of other things. You can make it pull a large payload, 40 kg. Now here does the cable on the robot and the cable is going thought that's going slack. So the force on the robot is changing quite drastically while it's dumping. And the policy unable to adapt to this. Okay, so let's switch gears and let's look at more dynamic motions. And we want to look at jumping. And in particular, we want to look at a single jump or a single hop, which is a lot more harder than multiple hops. If we're doing multiple hops, I can then stabilize. You have more time to stabilize. Single-hop I take a jump and then I basically have to stand in line the jump. That's pretty hard. My elementary school kid calls us a bunny hop. I don't know what the technical term for this. I think it's called a standing jump for athletes. So anyway, so how do you achieve jumping where the location is specified? So we start with a policy which has a couple of inputs. And we basically have the same sort of outputs motor positions than a PD control. We provide the landing target, learning target, steppe, land, or step height. We provided reference motion. Reference motion is basically coming in from an animation not dynamically feasible. Just shows you what a jump should look like. Then we have a history of the input output for the system. And that's the shortest E4 samples. Now more importantly, we have a long history. And the idea here is if I have a jump, my take-off condition is really critical. Once I'm in the air, I have conservation of angular momentum and I can't do much. So I need to know what was my takeoff to actually nail the jump. And we have the CNN, which is trying to learn that from this longer two-second history. So you put this, train this on a different jumps. First on a single jam, there are multiple jumps. Here's the same policy working on different months. And you can see it's pretty large jumps. Now, the same policy is also used to do the landing. So that's one reason you'll see a little bit of oscillations you can do for germs, different jumped lens onto targets, step length and step height, specified. By the way, this robot is an end-of-life, which means anything goes on in the robot, the company is not going to support, changes. Even more pressure on the student. Alright, so if you look at the slow motion, it looks pretty dynamic, very agile. And in fact, some of these high jumps, I've tried to attempt them myself. I can line, but I can stick the lining. I can't say that I'm going to go forward because of the momentum. So that's a pretty interesting. So this is the longest jump that we got, 1.4 m. And the combination of the length and the height, I think this was the next thing that it shows the highest. That's what it's actually pretty hard for a human to jump and stick the landing. If you've been doing box jumps in your gym, you should come up by a lab and shows how it's done. Okay, so let's move forward. Let's look at another architecture and we'll look at transformers. You've all probably heard of transformers. And you've heard of transformers and various contexts. So we looked at neural network architectures are applied for locomotion. This is a brief subset of literature. Mlp, CNN's LSTMs have been applied. The question is, what's next? Transformers have been around for awhile revolutionizing almost every field that they touch. Why not locomotion? So we thought, okay, is our hypothesis was from the history of observations and actions. Network is probably could do in-context learning to react to slips or two steps or tumbles. So we thought, okay, Can we do that through transformers? We apply that it's traditional transformer architecture. The only difference here is the causal transformers. So you're not looking into the future. You can only look at past tokens. A tokenizer observation and action pair. I have a bunch of 16 set of observation action pairs, and I'm spreading out the next action doesn't work in practice. Amazingly, I didn't expect it to work. So we had to first get a simulator working on a GPU. Spent a lot of time doing that. Then we train. This took about 12 h. Students somehow discovered a system with 48100 GPUs on campus. Plan for toddlers. And then we've got a policy he shall transport hardware hasn't been learned on slopes, are robust to these slopes. What if preliminary results, you get emergent behavior of arms swinging, nothing in the reward that tastes swing alarms. The threw everything that they found in the lab to check robustness, including Ethernet cables destroyed. You'll actually see that price to step over, step back. That's, that's what we hybridize. At least. It's fairly robust to perturbations. Here's Ilya taking a book on auto, putting it a backpack and checking your business. But this itself is not that big of a robustness test. You can shower. I can throw tiny boxes on it. I said, come on, this is the node is gonna get convinced that this is the robot is robust. If you do that, then they found something bigger to throw on the robot, to push it to check for failure. So here's piccolo are trying too hard on this robot as well. Eventually, robots is very surprising. Transplant or architecture. She learned and be amazing what you can do with this, going beyond locomotion, combining it with manipulation. If you're interested, you can also look at the attention, attention to tilt this time varying. So we have to generate a nice plot of this. So we have foreheads. I think we need to look over multiple steps and see where it's looking at. My hypothesis is the first step of contact is probably the most important. And that's probably ready to pay more attention. We need to check if that's true. Okay, So one other neural network architecture. Now, one of the things that when you have controlled by ground, when you talk with your control colleagues, will tell you is, what about neural network that you have over here? You still have a controller. So in some sense you're still doing feedback. It's not the learning. You will still have a PD controller. This is the traditional architecture for most policies on hardware. In robotics, you have a neural network spit out joint positions. Pd controller, spent a lot of time tuning PD gains and then spit on joint torques. This runs at a low frequency that's launched a high-frequency. Now I got tired with all these questions. I said, Okay, why don't we take out the speedy control entirely and have a pure learning-based policy. So in this architecture, they're learning is probably playing the role of planning on the fly. If I were to take out this, then I have to do this at a higher frequency and the learning is actually becoming a control. So that's what we did for A1 robots without the PD control, the strands are the higher frequency, 500 hz does work in practice. So I could do locomotion or different terrains. You can do locomotion on granular medial. Doesn't come anywhere close to what Dan Goleman does, but this is our poor man's approximation. We can also do torturing of robots. That's what I call this. The students get more and more abusive, pretty sight to see. Actually this is used to show that talk based policy works well. Then if you work in the barrier than you can do all this fancy things with the Golden Gate and the background. Totally useless for a paper, but the students wanted to do this. Anyway. So here's a comparison of a position based RL with a torque Pesaro. And robot basically fails over here. The robot all the, all almost the torso hits the ground but still recovers. That's pretty compliant. It's pretty good. Okay. Maybe roughly around ten more minutes. So time. Not talk a little bit about some fun projects. Here we'll look at local manipulation. And we look at how we can use legs not only for locomotion, but also for manipulation. Will go along this new axis. And we look at soccer shooting any soccer fans. So most of the countries outside the US Soccer is like a religion. Us not so much so, but hopefully we've tried to change that. Let's look at soccer shooting. And the problem is the following. I have a quadrupedal robot, our regulation size soccer ball, maybe size four. I want to shoot this into a goal. And we'll apply RL parallel classical approach. The only difference here is the network has an input which is a Bezier polynomial. And that polynomial is telling me the end-effector trajectory for the foot hits the ball. And then we can actually look at this network and see it's pretty robust. So why Let's kicking, you can check robustness. And then we add another policy that spits out those Bezier polynomial. So we ensure that this works in hardware. And this one is actually changing the way you kick by changing this basic polynomial. Now you can close the loop by looking at the vision. Now it turns out if I do just zero-shot transport to hardware, it fails because the soccer ball and my simulation as nice and rigid and spherical, I go to the real-world. It's not a sphere. It's as ISOS, I turned 16 or 18 sites. It's nice and compliant. And the friction properties also do change on the ground as it rolls. So you'll actually have to learn from real-world data as well. So you can fine tune this policy. So let's look at this in practice. So I have a whole bunch of different mesial polynomials to do the kick. And then we do domain randomization. Ensures all of this works in hardware. But now when we apply this to the task of hitting a goal that almost always fails. We collect some data and then we train the network with the experimental data. And it turns out it's almost successful most of the time. Our goal is adjust those little box with a tag and we have to take about 01:30 shots. So we had other students doing this 130 times, retiring now to rewild. Go run behind the ball, get a bike, should go run down the ball. Repeat, this works pretty well. So here's our accuracy mind map. Since we're using experimental data, the policy actually launch to use the wall to bounce the ball so that it's more robust. It's a little lazy. I think. It also learns to curve the ball towards the goal. Just to show off. It does fail. You're in some cases if it's a little too far away. Okay, So if you can do shooting at a goal, the next problem, and soccer is actually the task of gold keeping. For those, we actually had a proud student from Georgia Tech help us. Zhai Yu long is right here and it's going to be joining Berkeley soon. I don't know how we ended up in a lab, but if they're in a lab for summer, and he said, Okay, I'm going to try out soccer goal keeping. And when my PhD student told me, how about this problem, I said No way this is not going to work. There's no way a robot can jump up in real time, use its legs as arms locked out of the air and stop a goal from being shot? This is what's going on in my head. I don't vocalize this and tell the students. I say, Hey, good idea, Go for it. I learned this through the process of building your faculty or the use or senior colleagues. They've been doing this all the time. So anyway, so the students disappear after a while, they come back and say, Hey, we have this working. Alright, so what's happening here is we have a bunch of different skills. We train a policy for each skill. Let's start with the simple skills of sidestepping. The ball is rolling towards you right next to you. So the robot just sidesteps to block the goal. Then you have a ball that's up in the goal posts. So you have to jump to block the ball. All the bolus super far away. Tom's the goalpost. So you actually have to die to block the ball. So we have three different policies that we train. All these policies have another input which is the Bezier polynomial, which is the end-effector trajectory of the foot, which is now becoming the commander platoon. On top of those, we put on another policy that tries to pick which of these skills to use, as well as the command for that Bezier polynomial. And then we close this with another policy which is basically an object detector that lands where the bolus and tries to predict ball position or velocity. Alright, so I'm going to play a video and I'll try to stay quiet. So if you roll the ball and does work pretty well, little too far away, it goes into this diving policy, throw it out, but does jump up. And if you notice carefully, it's using its unlikes. Box the ball or the goal. Then they call me and say, Hey, this works. So I try out some real goals. Here. The dancer, the policy actually has learned to do a header. This motion, slow motion. That's pretty impressive. Grid. Well, it's not always successful. I think we have about 87% success rate. Professional soccer players have about 65%. That's not a fair comparison. The ball is coming at such a higher speed. The goalkeeper did actually moving even before the ball moves. To do this, we found that they needed the three skills to improve accuracy rate the number of scales, and it can do all this crazy stuff. Now we also have a robot that the new soccer said. Okay, let's have a shoe to enter the keyboard. Now we can automate a few things. Turns out the shooter is actually not that good, or the goalkeeper actually really good. It doesn't score much. A wonder what goals more than others. I said, Okay, let's try the next best thing regarding some kids. Soccer players. Soccer players in their own school teams kicks up pretty slow. Here we have a camera. Camera is not fast enough. So if you actually helps foster kicks, you have to use a motion capture a moment. Here, the motion capture environment where the ball kicks the ball back so fast, it surprises the shooter. Okay, So how about maybe a few more minutes? I wanted to talk about closing the loop and going back and putting in some model-based stuff back into R0. So let's look at efficient auto. And what we want to do is the following. So have a simulation. I train my policy on those, works really well in simulation. But when I go to the hardware, I have the same two real gap and it typically failed. I must have done a great job in the simulation. So my goal is to use experimental data to fine tune the policy. But I don't want to use too much data because every experiment is expensive to run. So can we somehow make the fine-tuning process faster using some model knowledge? Unimodel knowledge that we want to incorporate here is controlled lab no functions. So if I look at model predictive control, model predictive control actually uses Contralateral functions and their cost. And that is the one that actually guarantee stability of the model predictive control. And that are also makes the horizon small because the CLF then gives you the long horizon thing. So can we do the same thing for auto and F? So it will hopefully give us better sample efficiency. And even more importantly, the control wrapper function that they use could be an approximate controlled lab, unfortunately not be able to one. So the idea is the following. I have a control lapply function. I'm going to use that as my cost function. And I do my training. And then I can reduce my horizon. Or in the terms of auto, I can change my parameter Gamma. The big toe of the way you call it with this, Here's an experiment, hopefully to the experiment place. So the policy is training simulation, then fine-tune with 10 s of experimental data. This is the infamous cut full experiment, really famous for control. Annotations, does stabilize some oscillations. If I were to train this with a little bit more experimental data, with 20 s of data, you actually get rid of this oscillations. So this is the policy fine-tune with 20 s of experimental data. Actually does a swing up and then balances, right? We've also applied those on an A1 robot for tracking walking velocity. That's pretty effective. But I think I'm almost out of time. So let me stop here. I've given you a whole bunch of different aspects of things, all the way, starting from model-based to data-driven techniques and a combination of the two. What's gonna be the future's not quite sure. That's something that all of you will probably answer. Alright, so I'll stop here. Thanks. Happy to take questions. Question regarding your your safety guarantees. Specifically when you're approximating model. A model with network has the linear model. To me, there are certain safety guarantees that our trust, my grandma. But this one, it seems like there's a lot of assumptions when you use your model to directions that I would be interested. One would be do you have applications in mind where you don't need district guarantee, but it's actually more helpful than just saying it usually works. Or alternatively, if you have ideas that you could take a sip, build upon your method to make it more. I don't know. Okay, so first, just a clarification. So with the linear model that we fit, we are providing some sort of stability guarantees, not safety guarantees. And we're providing bounded input, bounded output stability guarantees. Now, our hypothesis is what this linear model that you fit. You can design a safety controller. And then we actually have to do the work to check if the state radically would be close enough. So we're not making any safety guarantees yet, but that's something that we need to look into. So this just overworked, really impressive and I think mentioned there's no reference trajectory. Even know like Jack foundation for this kind of burning a senior right now. I'm just curious like in the future, if you think about mall, complex, multiple different kinds of locomotion and manipulation. Burning approach is the right way to go. All you think there's maybe some more they'll get library can be combined with as well. Okay, So in some sense your question That's okay, model-based versus data-driven. Which way to go. And like myself, who are also trained in model-based approaches. And now you are in a crossroads. Which way to go? And I don't have an answer for you. For a long time. I thought oral policies only working and animations or they work on simple robots. But I don't think that's true. And I would think they are actually doing, is producing better performance on real hardware than model-based. I can quantify this yet. So it's pretty impressive. You can just throw it away and say, Hey, there's no guarantees, am I going to use it? So I think the future will be some nice combination of the two. Now there are disadvantages with RL, the sample inefficient. It's hard to train the no guarantees. So if you can solve that, it'll be interesting. The way I would see this as. So if you don't cannibalize your research, someone else will do it for you. This is what all companies do. Your choice. If you're young faculty, it's very easy for you to change course. If you're a senior faculty, you've already well established, you don't care about the new things. People in between are the ones that are in a dilemma. Reference like jumping. Like a second question. Okay, So for the walking stuff. So we had a trajectory optimization that would use the model of the system to generate periodic orbits. And we use that reference motion in your reward. When we went to the jumping, we thought, okay, why don't we start with the reference motion that just characterizes a jump. So we just went and provided an animation that shows what a jump should look like. It's not dynamically feasible, but it's kinematically feasible for the robot. And it turns out the network still learn from that. Hello reference motions. You do not need dynamically feasible reference motions even for a giant motion such as jumping. Your second question is, okay, does this work on onboard? Yes. Yeah. So when the students are lazy, they're connected to a tonight and which runs it on a cluster or a small desktop, but actually forced them to run to on board so you can run the whole thing on a knock. The inference is pretty fast. You can do inference at least on the A1 and kilohertz rates. Yeah. So if I had a model and add a laptop function for it and throwing it away, I can then use this and my cost of the reward for the RL. Now, if I didn't have a Contralateral function at all, what I could do is I could train in my simulation. I can use a value function as an approximate CLF. For training an expert mode. I can use a value function as an approximate CLF. And further fine tuning in the experimental data. And that also seems to work. Christians. A lot of amendments that you use. You can say a lot of strong things about them in local neighborhoods. So your linearization, you've got some things you can say with RL or any kind of data-driven, the scope of the data, the set of your input output sort of defines where you're interpolating versus x. You have any sense while you've, I wanted really to maximize the global efficacy of a policy, you better off spending my money on a control theorist or machine-learning guy? Okay, now I'm on the spot to pick one. So if you have money, you should hire both. The control theoretician will get you the models that you can actually use to make the model better in your simulation that the RL Policy would actually use to train on. If your simulation modelers off, there's no way you're going to work on hardware anyway. So the data-driven model-free is actually not model-free. You don't have a pretty decent model in simulation. But you get a lot of robustness and form the RL I think, because it's able to explore perturbations and lots of different directions that are typical controller would not have seen or would not have been able to capture. And that sense of robustness. I think auto policy would be good. Some combination of the two. I would like to ask you one more question, shows him. It was very impressive. Yeah. You mentioned that. Can you do better than MOP? I can write it like a temporal convolution neural network architecture. I think in the paper we talk about, we do a comparison between different policies, but it's still not a complete picture. The transformer has this property that it can do in-context learning. And what that basically means is I can look at samples of what I did in the past contexts window of the previous step. So if I just slipped, I can use that information to change how I'm going to take my next step. And MLP that has to be captured in your weights. Here. This can be captured in your attention in addition to the wage. So maybe that will provide robustness. But I don't think you can conclude that this is more robust than classical things unless you do a full study of it.