My name is Mark Riedl. I'm in the College of Computing and I don't have any robots, but I want my robot socks . So I think that's why Seth and invited me here . So what I'm going to talk about is something a little bit of a, I guess an apology from my community doing virtual worlds and recognition that you all are doing hard things in open real-world . And we need to do things more like you . But I'm not ready to buy a robot . Okay . So this is work that was done primarily with my colleague Jonathan buck and a whole bunch of really smart people along the way. And one going to talk about is adapting reinforcement learning agents to open world problems specifically involving novelty . I'll talk about that, what that means a little bit . But let me just kinda start with, I think things that we've seen in the news, right ? There's a whole bunch of stuff showing really great things in AlphaGo . Playing StarCraft, playing DOTA, games, and reinforcement learning go great together and we've seen some great results there . And then people will make these game systems . And then they'll say, games are a stepping stone to the real-world . But are they games or closed worlds ? What do I mean by that? I mean that they have finite, although large sets of states . They have finite, although sometimes again large numbers of actions . And even though they are maybe Markovian are probabilistic nature . They do have well-defined and often not well enumerated rules of how, what happens when you do certain actions is there's somewhat predictable environments . Okay ? Now, the issue with this is we can make really great demonstrations in those games . But it's also really, really easy to make reinforcement learning, deep reinforcement learning systems even fail and games and all we have to do is tweak the rules a little bit. And I'll give some examples of some of the things that we can do to make things fail. But any small change may have catastrophic implications for any sort of reinforcement learning agent that has gone and trained on millions and millions of hours of these particular computer games . And I'm going to argue, and I think everyone will probably agree with me that if we really want games to be stepping stones to the real-world, they got to work more like the real-world. They gotta be more open like open worlds . And I'm not going to suggest everything we need to do to make games work for the real-world. But I'm going to describe one that we have tried to introduce him to games, to make games more open world . And then there's something called novelty. So what does a novelty? Novelty is a sudden change to the world dynamics during execution time . And the kind of give you three examples of what I mean by this . A volcano erupting, changing the climate so that the things that we've learned about what crops will succeed and what Fail have changed . In a household, raking a door . Now suddenly my key doesn't work on that particular door, even though it's worked thousands of times before. And in computer games, these game companies, they're always pushing updates and changing their game and they take your favorite weapon and they make it week and then you can't play anymore . Then you're sad . Humans fail at this too, but we adapt fast. So they start to things. Or when our reinforcement learning systems go into these situations, they say, alright, my policy just gives me 0 reward . Now, I don't know what to do . I can't do anything . Please retrain me . Alright? Novelty importantly are these sudden changes to the dynamics . You can think of novelties as domain transfers. There's the novelty happens . You've actually gone from one game or one environment to another environment with most of the rules the same, but something is different of consequence . The distinction between transfer and while I'm not talking about transfer, is that this happens during deployment and you're not allowed to stop . Take your agent out of that environment, train it for a billion more hours and put it back in the environment, are operating . So there's no separate domain transfer retraining phase . You've got to learn to deal with this change the domain transfer in the moment while continuing to execute, you can't go out and retraining, come back to the vitamin. So our operating environment kinda looks like this. We have a training phase and we're all used to that . And we say, alright, we're good . We're deploying something in the world changes . This is where we normally say, Oh, we've domain transferred, Let's go retrain and then put it back in environment or this new environment. What we're saying is you have to keep going . And so obviously if you have to keep going in this environment, you gotta return up here fast . Okay, so how are we going to handle these novel situations ? Well, we trying to come up with some, some of our own kind of techniques here . Although this is a general problem that I think everyone who doesn't have robots should work with. And people with robots, obviously, you handle these sorts of things all the time as well. So we need to handle novelty and recover high-performing policies as quickly as possible without going into an offline training mode . What do we need to do, right ? There's a problem of detecting novelty hasn't already happened . Because if it does, you might have to turn your training back on. And of course, if you want that training to happen quickly, you also have to come up with ways of making sure that you learn very fast . One way of doing that is to model the world dynamics . And use that model to think about your new world and what might be different or wrong about your current policy and trying to update it as quickly as possible. Now, one of the ways that you might try to do this with something called the world model . This is an old idea that has been re-branded and has now cool and sexy and hot . Basically learn your transition function . So lots of times in reinforcement learning systems we do model-free, right? We just basically say learned the policy, it'll work right here we're saying learn the policy, but also learn the transition dynamics of your world, which are the rules if you're talking about a computer game, um, and it turns out that you can speed up policy training and policy re-learning really, really fast because you can start to imagine what future states would look like and then use that as a kind of a more, a more rich feedback. This is just a little picture of a grid world agent playing this game and trying to imagine its neck stays down here and there's a little bit of difference. It's learned pretty well . Okay ? So now we're going to use neuro symbolic . Just means World models . That just means we're gonna kinda mixed neural system neuron that systems and some all sorts of systems . In deep learning. It's alright . Then what we're basically going to do is we're going to let deep reinforcement learning handle a lot of the stochasticity and nondeterminism of game. That's why we use them in the first place . It works particularly well. But we're going to build a symbolic world model of the transitions . And the reason why we want to do that is because these symbolic systems can update often in a single-step . If something happens unexpected, we can update, we can recognize something is wrong very quickly without often in one instance, we can update that model . Sometimes it's just changing one set of symbols in the system . We can often do that in a single-step instead of retraining some sort of neural predictive sort of system . And this is common to all World Modeling systems . But even though it's symbolic, we can still imagine interacting with the environment without actually interacting with environments . So instead of going through many, many steps in the environment, we can step back and say, I'm going to imagine what will happen if I try different actions, predict those outcomes, check my reward, and I can actually train myself on my own imagination . Not going to spend a lot of time going into the details, but we have to learn it. Symbolic rule system. This is what the rule says is going to look like . We need to know what our features are and then we learn what the ranges are of acceptable and unacceptable sorts of things. I don't want to say too much about this. We actually can handle nondeterminism. We just get multiple rules with overlapping preconditions and effects . That is all okay . We learned this simultaneously with the policy. We do regular reinforcement learning. We also learn this symbolic model. We use things like integral, what's called integer programming and things like that . Now once we have that model, we can do novelty detection. So you're going along during deployment . You have some sort of transition that is not properly understood by learned rules . Usually that is, some sort of action fails that you were expecting or some other transition happened that wasn't expecting . That just means your rules are now wrong . The world has changed in some particular way . You've detected the novelty and near update is pretty easy . You just change the bounds on some of those symbolic parameters and you're good to go. You have a new model ready to go . Then once you have that new model, you can now start to do imagination based retraining . So before we talk about that, let's talk about the normal reinforcement learning training loop. It's pretty simple, like this . You basically query your policy for your best action . You take that action, you execute it . You see what new state you're in. You put that into some sort of training buffer and you push that through your neural net. What we do most of the time, what imagination based training does is simply instead of executing the action, we just use our world model or symbolic world model and say what are the new state parameters that we would expect to see, which is just replacing this one step with a cool animation. And now suddenly we can run the environment inside the agent itself . And actually what we do is we do a mix of both because our rule model is gonna be wrong . It's not gonna be perfect. System is alerting system. Middleware time, you'll drift from the real estate of the world. We just periodically recollect by taking a step in the real-world. And you can adjust this ratio. You can take further one step in the real-world . We can take 50 in our imagination, whatever works for that, okay? Architecture, kinda big and scary . But for those of you who are familiar with reinforcement learning, you'll see the familiar path where you go from the game state. You send it through your policy model, back into the game state . That's what happens in normal circumstances. Pre novelty . We have a bunch of novelty detection star stuff here, not particularly important for today. And then if novelty ever happens, we have a second loop that doesn't go to the game engine, that goes through the rural system itself . Here's the magic part here where we're imagining the next steps . And so we still look at reward and all that sort of stuff as normal . We can switch back and forth between these two loops are run them in parallel . However, what Okay, so does this work ? Well, we try to screw up our agents . We teach it to find the L key to open a yellow door. Then we change it to make it so that only looky works . We run, alright, so if you do blue is basically your regular naive reinforcement learning sort of thing. We hit novelty here, takes a long time to re-learn the policy is going to basically throw everything away . Now our baseline in another world model system called dreamer is green and orange in the middle . Unfortunately, our baseline knocked the socks off of us and learn faster . It's really hard to be dreamer . It's a really good learning system . But we're close and we re-learn our policy in two orders of magnitude, fewer back-propagation, basically . So we're much more efficient in doing that learning . We're just slower. Now where our system really excels is in some other situations . So we trained to assist them that they could walk through lava . And then we said no, actually lava hurts you and kills you. Here, we actually beat the baseline. I'm pretty happy by that. So what it does is it learns, our system learns very quickly that it's okay to step into lava . Our model changes and says, alright, lava doesn't hurt anymore. Walk through the law . What dreamer does is a neural model . Moral model says, I stepped into lava . I imagine this hurting . I step out of the model . It, it doesn't want to acknowledge right away the fact that the world has changed has got to take a long time for that neural system to re-learn . That is actually okay, relies too much on memory, its own imagination . Then my favorite one is where we teach it that lava hertz. But then we make it not hurt anymore . And most World Modeling systems never figured this out, but our system does . So this is what's called a shortcut novelty . So we're pretty happy about that . Okay ? Why is this working ? Well, imagination basically is going to punish the agent when things that used to get, if you reward, don't give you a reward anymore . And it can figure out the punishment in its own imagination because the world model is different as basically as I am predicting no reward now, for the things that used to work, I'm going to lower the utility of certain state actions and transitions. And at a certain point before even taking a single-step in the real-world, or very few steps, basically says, well, you know what? That trajectory no longer gives me any utility . What do I do ? Well, I go to other places that can give you more utility . So I go back into Explore mode . Alright, I suddenly switch off exploit very quickly back into Explorer Mode . These world models are able to help us do this faster . And why did we are able to find shortcuts is because I'm kinda the same thing, right ? As soon as we see that our world model allows me to step in the lava and then a single-step. We know that this is no other piece of lavas going to hurt me . We can very quickly start to learn to explore our way through the lava without being having to worry about it. We get a couple of benefits from these symbolic world models . I'm going to wrap up here observations . When the rules of the game change, fast adaptation is essential . Symbolic world models are very good at updating very fast . They're not good at playing the game . So that's why we allow the neural Awesome system to play the game and the symbolic system to provide things like novelty detection and world model updates. Symbolic system that I'm talking about this, but there are interpretable, you can actually look at what those rules are and understand what our system thinks and how the system thinks that the world works. We can check to see if it's wrong . We can also manually partially or fully seed the rules . We can basically say, I will tell you, the agent how the world works, or at least my understanding, as close as I can, you go learn the rest . Which is a very powerful thing if we have certain expectations from our agent and we want to do allow people to partially specified that information . We think that these symbolic systems can do more than that . I think they can actually guide exploration. We could do something like this key doesn't work . Let me go see what other actions have also failed and maybe actually go and look for actions that CCD and fail. This will actually ultimately result in more robust world models and more robust policies, which will be more robust to real transfer or additional novelties if they happen down the road. And ultimately going back to where we started . It really makes no sense for computer games to be called stepping stones to the real-world . If we don't make these games solves some of the problems that you all have out in the real-world themselves. So our little mini grid environment, which is novelty aware. So it allows you to inject novelty in and do experiments is available . We're trying to encourage lots of people to use these sorts of environments to test the robustness to novelty. And then think about how algorithms can adapt to that . If it's useful for you all to do simple simulation environments with novelty . I mean, that's the only one type of open world . But if it's there, we invite you to come and use them and reach out to me if you want some help we using them . Alright . That's it for me . Thank you for staying all day. Demonstrate all day. Yeah . Matthews talking oh, one more . Time for questions while Matthew meanders other incredibly relaxed pace given the time constraints . Charlie, really interesting, I'm wondering, how do you think this we've generalized to continuous basis. Basis with more uncertainty? Yeah, so the question is, how would this work in a more continuous space sort of thing ? We're doing this in Doom as well, so we can play game . Doom is still somewhat discreet, is not fully continuous in the way that you want your robots to operate . Because their world model is symbolic, what you do need is you do need to collapse a continuous space into a discrete set of symbols . There are lots of different ways of doing that . Some of them require a little bit of knowledge engineering, sorry. Knowledge engineering is still part of our world. Or use things like seen graphs or things that have kind of pre-learned, pre-trained to boil things into symbols . And the symbols don't have to be perfect. The world model doesn't have to be perfect . It just has to be right enough that it can make good guesses down the road. Great question . I'll answer it again. Is there any utility telling you which possible world you could be living in ? Oh, interesting . Okay, So the question is, we have one world model. Is it useful to have multiple world models and have some sort of probability over which for a model . That's an interesting idea I think, and may lead to some additional robustness. I mean, I think right now we do boil it down to a single model and say, We think this is right. But I don't think there's any reason why you couldn't have competing world models to make predictions and then train on that. I mean, I think if they conflict with each other too much, you might just confuse the agent. So that would be an interesting set of experiments to yeah, yeah, yeah, We can converge on particular aroma. I mean, right now our goal is just, we just put the uncertainty in the world model itself and say this rule has a lots of different effects and we don't actually know if any of these are right or wrong, or you'll ever get one of these or another. So that's another way which is basically where do you put the uncertainty? I guess that's kinda what we're asking . Thank you very much.