North North get bar. But it's possible then I was like, all right, I think I've become my own person. I'm ready to come back home. See what that's like. I'm very excited to be here at Georgia Tech. I can't imagine how many times I drove back here as a kid. Never imagine being a professor speaking here with you today in my lab, the kind of optimization relation of Robotics laboratory at Georgia Tech. What we do is try to characterize the human in human robot interaction. So that we can develop robots that we can actually place in the hands of end users to empower them to teach these robots, program them for new skills for assisted tasks that these robots can anticipate the needs of their teammates. That humans don't just have to sit there with an Xbox controller and try to puppeteer the robot all day long. Which is typically what we think of in robotic applications today. And that's very far from the future of what we think with robots in popular culture. Let's start with an example. Many of you who come from an HCI background I think will be like, yeah, this is obvious, but for a lot of people in AI, machine learning, robotics, this is surprising. This is the USS Bunker Hill. It's a guided missile cruiser. I spent time on the ship, on the bridge. As part of my time at MT. Lincoln Laboratory. My colleagues and I developed decision support tools to make recommendations for how to defend this ship. If there was an enemy anti ship missile attack that might look like in red, a raid of enemy anti ship missiles, they're going to be flying at you close to the speed of sound, probably at their terminal phase. Going to be loaded to the surface of the ocean. And if you're standing on the bridge of a ship, you only have about 35 kilometers of line of sight to a target. Something starting at 35 kilometers away. Traveling at 1,000 kilometers an hour, you have seconds to minutes to make a decision. If there are 100 of these things flying at your face and you're some 24 year old tactical action officer, what are you going to do about it? Okay. You're going to decide, do I launch interceptors hitting a bullet with a bullet? I don't have so many of those. Am I going to launch chap basically smoke cloudy things with aluminum foil in it. Do I launch players? Do I launch what they call a rubber ducky surface decoys? They're a giant rubber balloon that looks like a ship on a radar display. Well, I can just deploy everything everywhere and hope it works out, but that will go away. And then what happens if more missiles come? Or what happens if the decoys that I launch actually cause a missile to not hit me but hit one of my partnerships? This stuff has actually happened before. Fratricide has happened because of countermeasures. A British ship, I believe, shot at a US ship. Because the US ship launched cha, the British ship, automated guns thought that the chaff was the enemy. So it was shot straight at the friendly ship. Now let's scale this up to coordinated reasoning among distributed operations. Now we're talking about a single person sitting probably on the E2d trying to do this entire battle management. If you talk to a war fighter and you ask them, are you tell them, hey, I'm really smart in my T or Gorda Tech or semi whatever scientist. I'm going to come up with the best battle management algorithm for you. They're going to stop listening to you because they're going to say you cannot manage a battle. You can't get optimality, you can't manage it, you can't control it. One of my colleagues was literally called a weenie in a lab coat. I will not listen to a weenie in a lab coat is what a captain told one of my colleagues. So what does it look like when you develop decision support tools that can help the coordination of these defensive assets? Well, my colleagues at Miteam Lincoln Laboratory would go on board these ships. This is an actual bridge of cruiser. This is just a graphic because I don't have the picture of the actual tape, but they found tape literally placed over the buttons that would turn on or off these systems. Because the war fighters don't trust them, They don't get a chance to train with them. They're very expensive. And what happens if you practice using this defensive countermeasure? Maybe it costs a few million dollar. That's not so bad. I mean, it's bad, but it's not so bad. The real bad thing is if Lockheed or Raytheon were making it, and they had a contract to produce 100 of these, and it goes wrong, they could lose that contract. So there's all this political pressure to prevent anyone from using anything, actually, trying the whole system from like an actual building, trust building, shared mental models of how AI systems are going to work, facilitating appropriate reliance and compliance. All of that is totally screwed up, we'll put it that way. We need more HCI. I can talk to you, I can give a whole hour long talk about why HCI is broken within the DO D, but instead, going to talk about the cool technologies and ways of characterizing humans that we've been developing in my lab to try to counteract some of these problems when we go on and employ robotics and AI in the real world, it's not just for the Navy that we have these problems. It's Tesla with autonomous driving. Some of you might have seen a recent reports allegations that the actual incident rate for Tesla's autopilot or full self driving is about ten times that of a human on the road, which is in contrast to what Elon Musk has boasted of it actually being twice safer. We have autonomy in the cockpit. That's leading to degradation of skills and actually setting up humans to fail. This is an Air France crash that happened over the Atlantic Ocean. Lots of mistakes that Airbus made in designing the AI aboard these ships are aboard these airplanes. One of which was surely, if my sensors are telling me that I'm going less than 50 or 60 knots at 30,000 feet, that must be a sensor malfunction. It actually wasn't, and the plane was literally dropping straight out of the sky. Nasa, we're supporting my Nasa really career fellowship with the Jet Propulsion Laboratory. We just had a paper accepted for developing decision support tools for automated path planning recommendations to facilitate fast traversal on Mars. We're using a human centered approach to try to learn the objective function criteria that are Experts who plan these paths are actually using. Because if you try to just ask them what's a good path, okay, let me go take that and develop an engineering tool. It's going to come up with the right path for you, given the objective function. You told me, the robot drivers are going to say, no, I'm not going to use it. That's bad. They don't use it. They have all the AI researchers in the world that would love to help solve this problem. And every time they say, no, that's not what I wanted, then surgeons who are notorious for having god like DD complexes, who often reject any intervention from technology other than one that they can fully control, like through the Vinci. And I think autonomy and AI fails to live up to its potential because the researchers like us are failing to understand the human and human autonomy interaction. The Boeing 737 max, which is now flying again, was a great recent example of a system that was working well. We developed an AI for a problem that didn't actually need to be addressed. We did it wrong and we tried to cover it up in my lab. I have a vision for moving robots from a laboratory setting, from pristine behind the cage environments, to be working alongside humans and manufacturing settings. This is how my work with Boeing for the Navy, of course, also in health care, whether it's for decision support of allocating which nurses or doctors should go to take care of which patients for triage or inpatient care, and then also for exploring the solar system. I want everyone to have their own. R2222 was not a puppet. Two. D two was a partner that not only could learn from and adapt to Luke Skywalker, but also could assert initiative. There was a mixed initiative teaming here. There wasn't a clear leader follower identifying the right conditions and then creating the capability for that to actually have. True teaming, where you can both take turns and interrupt each other is a very chal***ging thing. That is not really a solved problem application. We're going to do this with three ways. First, we're going to give robots insights into human decision making. I do this first by developing machine learning models, cystical methods, and also conducting human subject experiments that allow us to develop models of human behavior that we can predict what you're going to do next. Before you do it, we can use this either to imitate you. If I want to have a robot that is going to learn to play ping pong just like you, I have no idea. I'm a robot. I have no idea how to play ping pong. Can I just watch you play the sport? And then figure out how you play and try to mimic you as much like often how children learn from adults and other caregivers. We can also use these models to say if you're a particular ping pong player and you like to play in this particular side, you like to be very aggressive or maybe you're a weaker player. I can be your partner, your teammate to help compensate for those potential weaknesses and we can work together as a true team. Second, we give humans insights into robot decision making, primarily through explainable artificial intelligence. Lab has a special focus in the reinforcement learning setting. Robots are going to learn through trial and error how to accomplish tasks, but then how can the robot tell the person? I'm going to be your assistant, to go and gather resources in mine craft and help you build this house. How can I tell you what I've learned about the best way for us to team together? Well, we could either just work together over many, many iterations and you could reverse engineer it yourself, much like humans do if, say you're going to have to work on ad hoc teaming and you can't talk to each other, but that would just be somewhat inefficient. What if I could just tell you, here is my policy, here's my representation of my behavior, some behavior treat that could help facilitate the adoption of a shared mental model more quickly. And lastly, we scale that up to coordinate reasoning and human robot teams. We account for the fact that humans and robots are actually different. And not all humans are the same. Humans have unique capabilities. This could be somebody is a welder, somebody might be really good at riveting, and you want to account for those differences if you're going to coordinate a team to build aircraft for assembly manufacturing. But it could also be that humans themselves have time varying the cast like performance criteria. Humans get better over time, ideally with repeated, with practice. We see this in manufacturing a lot with seasonal work. That people will get better at the task you hire them for. You have to adapt to that. This also brings up interesting political and ethical questions about whether or not it's actually appropriate for a robot to characterize human task performance in many states and also in union settings. Actually having that low level information about how good you are at each task, it's actually forbidden what I'm going to do today to try to appeal to all the students that are here for credit and for the free lunch. We're going to try to engage you and ask you to pick which of those three topics would you like to explore in this talk? First, how robots can learn behavioral models of human decision making. Learn to mostly learn how to imitate human skills, skill transfer from a human to a robot. Given that humans are suboptimal and how we perform tasks and also diverse or heterogeneous, You've got to learn from a lot of diverse people if you want to scale up for distributed robotics. Second, explainable artificial intelligence. How robots can explain to people what they've learned about how to interact with the world. Then third, how do we do coordinated reasoning specifically with a lot of graph neural networks. All right, hands up for topic number one. Hands up for topic two. Hands up for topic three. Oh, I think topic two has it. Okay. I did this at U Dub and they picked topic one and they didn't believe me that I had a topic two prepared. So they were asking me afterwards. But I do, yes, exactly. I want robots to interact with people without a significant amount of training. As I mentioned, with the Navy setting, training may not even be possible. Ideally, I'd like every war fighter to have an ipad and they can play war in an environment where they can experiment with these tools. That's actually really hard to make happen. Real training takes a lot of time. We've seen a number of cases where the US Navy is actually run into crashes with themselves. Ships crashing with each other because what we have is training at the margins. You have an eight hour job of actually running the ship, and then you have an eight hour job of paperwork, like a desk job. And then you get 8 hours to sleep. Where does training happen? If I'd deploy robots to help at a construction job site. If I'd deploy robot in a manufacturing environment for the very first time. Or if you want to play a pick up game of soccer. This is where ad hoc teaming is really exemplified. Is this idea for Robot Cup, if you've never heard of it, It's basically you get a bunch of now like robots or childlike robots and we're going to all have them go play soccer. And imagine if they've never, ever played soccer before. How do you do a pick up game of soccer with robots? That's what I'm talking about. You need to learn who's good at what and where do you fit in with the team. And then how do we orchestrate or choreograph the team dynamics. This is properly defined as ad hoc, human ad hoc teaming, that we are going to be unaware of the capabilities and behaviors of other agents which is going to limit my ability to collaborate with them by performing humans and human teams maintain mental models and shared mental models of their teammates. Mental models is what really separates a novice, or even an amateur from an expert, is that I don't actually have to pay attention to everything that's happening in my environment. I can focus on one specific important task, because in my mind I'm able to simulate and predict what everyone else is going to do. And I know the next time that I need to pay attention to something else. Whereas a novice has to pay attention to everything all the time and they can't ever get anything done. We want to short circuit the process of forming shared missile models that you would naturally get from interacting, playing pick up soccer games with somebody 1,000 times. How could we facilitate the shared missile model and understanding of what everyone on the team is going to do in one or two iterations. This joint coordination is going to depend on shared representations, the ability to predict others actions and integrate the effect of those predictions. We're going to do that with explainable artificial intelligence and interfaces. Now, I'm not really going to talk too much about the interface today. Um, we, whatever we can hack together is what our interface is going to be. And we're going to mostly focus on the representation of the machine learning model that's going to afford the ability to explain itself. Neural networks are the go to for deep learning, Deep reinforcement learning. This is joke from a colleague of mine. You see Boulder wanted the neural network across the road. Well, I don't know, but it did really well. Who cares? Right? That's what a lot of people are saying. That's what people were saying a few years ago. That's what people are saying now. And what people will continue to say. Sam Altman can go up there and make everyone scared of US. Capitol Hell. But at the end of the day, is anyone really stopping this? No. There's this belief, Cynthia Ruden has talked a lot about this. That there's this trade off between how well you can interpret a machine learning models representation and how high performing or how accurate the model is going to reflect the phenomena that you're trying to counter. Sure, generally I think this is true, but not at all cases. That I think that we can have accuracy and interpretability with some assumptions. The most important assumption is that you're going to have the right features. You need interpretable features to begin with. Some. If we're going to talk about interpretive models, I'm not talking about taking in an image, data sets, reasoning from pixel space, and then somehow getting a decision tree out of that. We can talk about how you might be able to do that, but that's not what to talk about. I'm talk about let's start with human interpretable features, tabular data, something in an Excel spreadsheet, and then that higher level semantic description of an environment. Can we learn some decision tree rule lists or other interpretive model to guide our robots or explain what our human behavior is? Decision trees can be described as a recursive function, where at the leaf nodes here in green, you're going to make some prediction about the world. Here we're going to make predictions about what action a human would take or what action the robot should take in a reinforcement learning setting. And the blue nodes are decision nodes that are going to give you a true or false about some condition being satisfied. Now these conditions being satisfied are bullion by u. U is either one or zero if some feature element j, x is our feature vector, j is the elements. And then A describes the node specific selection of a feature. If it's going to be greater than or less than some threshold fee sub at what we can't do, we can't apply approximate policy optimization, PPO, TRPO or any other policy. Gradient based iterative learning on a decision tree through gradient descent. Because it's not differentiable, there are tricks that we can apply. But in the first step, what we can do is often what the physicists like to do, if they want to prove something, they'll make a fuzzy approximation of the controller or the model where we're actually going to replace this bullion split with a sigmoid. We can have a degree of truth of something being satisfied or not If you're going to give somebody a drug and let's say that some test comes back as positive and you want to say that you're going to classify them as having a disease or not. It's not that there's some specific threshold that they're sick or not sick, it's a degree of being sick. Now, because it's fully continuous in this space, we can perform gradient descent and do supervised learning. Supervised learning, et cetera. Every decision node will have all of the features, the input of the state space, say your age, sex, BM, I, et cetera. And you're going to have a splitting criteria that you're going to learn. And then some temperature parameter alpha that describes the transition from truth to falsity and the leaf nodes. We're now going to have a vector of possibilities. It's going to describe a discrete pty mass distribution over what actions we might want to take. So we can explore, we can try different actions. And then we're going to learn those prob distribution at the leaves. We can sum over the entire graph. It's basically a specific graph of a neural network. We can then get an exact pro distribution that we want to learn. How do we actually train these things in a reinforcement learning world? In an AI stats paper, I approved some limitations of using Q learning based approaches. If you're familiar with learning, it can be fickle to work with compared to policy gradient based approaches. Learning basically is going to try to say, the tree is going to learn the efficacy of taking an action. Whereas policy gradient based approaches are going to try to predict what is the probability that you should take the action. That is the best one. Just I'll quickly note that if you're going to try to learn how to balance this pole on top of this cart, you can move the cart back and forth, left to right, and try to balance this inverted pendulum on it. If it's falling to the right, you actually want to move the cart to the right. We can see is that there is one optimum for what the splitting threshold should be. It's when theta is ideally fully between, the range is zero to five. Classify it for those states. For policy gradients, you get this nice curve where the optimum is in the middle. Q learning based approaches introduce all of these additional critical points that makes it much more difficult to learn over these graphical models. How well can we learn with an interpretable model? Is it going to suck because it's so sparse and interpretable. We applied it to some standard RL baselines, cart pole, lunar lander, where you're trying to land a land the suter craft, the spacecraft between the two flags. And you have thrusters that can push you left and right. And then up we have a wildfire tracking domain. We have UAV's trying to fly over this wildfire and track where it's propagating. And then we also have a Starcraft two domain finding to feed circling. Interestingly, we reached a parity or exceeded the performance of a neural network here. In this case the multilayer perceptron, just a normal neural network, feed forward neural network with rectified linear units. We're able to beat it when we have tabular data. The representation, the inductive bias that we have from a tree like representation actually is really powerful. But is it interpretable? Well, when we take that tree which has all those parameters and it's fuzzy and we crispy it, we say now we're going to go back to the Boolean world. So we just do some arc max over all the features. We take the best feature, the best splitting criteria, the best action. Instead of having this fuzzy approximation, we do take a hidden performance, which I will address later. But the real point that I want to make here is that you'll notice that a state action DT is worse. What's a state action DT? Many people in industry and even in academia say, why don't we just train a deep neural network with reinforcement learning to learn the controller for our robot? Then we just watch it and see some description of the world, the state or a feature vector x of its position as philosophy, et cetera. And then what the neural network told it to do, moving left or right. And they're going to use supervised learning to train a decision tree off line to capture that behavior. Well, you can get 9,999.9% right for that. Okay? If your decision tree is big enough, your network small enough, et cetera, you can distill that pretty well. But is that supervised loss the same thing as having the tree now control the robot? No. If that tree ever gets into a state that I didn't see from the neural network, it's going to have to make it guess. It's probably going to be wrong and it's going to wander off from the distribution of states that it was trained under. And it's going to get lost and it's going to have no hope of actually acting like a controller to get back to a distribution of states where it knows what to do. It's not being trained in a closed loop of fashion. Instead, if you actually use reinforcement learning on the interpretable representation itself, you are able to learn the interpretable policy directly. That's like a key point if you leave here today of something I want you to know is that if you're going to do interpretable machine learning, train it for the thing you want to use it for. Don't train it to approximate something unless all you care about is an approximation. This is really particular for medicine, as I see what people are going to do nowadays is Chat PT, or deep neural networks for predicting if you have some tumor in an x ray. You're going to start training a decision tree to approximate the neural networks behavior. But the neural network will be the thing that's actually making the decisions about your health care. Then at the end of the day, is the decision tree really relevant? A lot of reason to think it's not. So we've since then enhanced the performance of these disti trees and work with forward for autonomous driving where we introduced Gumbel Softmax as a way to deal with the fuzzy approximation and select the particular features that we want. And also using straight through green tricks. They, we're training the sparse, crisp tree rather than training the fuzzy tree. And doing those things actually helps us to greatly improve performance, nicely landing between the two flags. And we also have this cool robot demo. You'll notice that the leaves now we've actually extended to be linear controllers. And ongoing work I have with this was Rohan Pals work has now been handed off to a student named Josh who is actually putting model predictive controllers, the objective function that you want for a model predictive controller in each of the leads so you can have this hybridynamical system that we're hoping to then deploy with Ford Motor Company. Okay, here is what I've shared so far as the way that we can use interpretable representations of the Sei trees in reinforcement learning settings to be able to actually describe what a robot has learned about how to navigate its own environment. In a reinforcement learning setting, the robot is working by itself to figure out the world. And then report back to you what it's learned that's useful. But what if what the robot is trying to understand is you not just how to control itself? If we want to understand how experts think, how humans think, you typically hire an army of consultants and they're going to then say interview 35 pilots codify everything that they've learned from those empathy maps, which they're probably not even heard of an empathy map before or Codi task analysis, whatever they're going to go at, ask a bunch of questions and use their own intuition. And read some textbooks and say I and how to be a fighter pilot. And I'm going to shove all of that into a drone and say that's a human in a box. This fighter pilot in a box. That's literally what they call it is there's this guy named Benny. And Lincoln Laboratory of this long project about how can we put Benny in a box. Benny did not want to be put in a box because that was his job security. He may have obfuscated. How he does everything he does, nobody could figure it out. The problem is that experts can give you what they think about, but not how they think it if they're trying to help you. There was this great study by Bill Lich Tal in 2008, where they took expert chess players, set them up to win, to achieve a checkmate in a three move sequence. And then they changed something about the board so that, that three move checkmate was impossible, and asked them to find another way to achieve a checkmate, say in four moves. To think aloud, the experts would say, I'm looking all over the board trying to find novel ways of achieving the four move checkmate. They're pleading with you, telling them like, I'm so smart. I'm brainstorming a couple issues first. If you track their eyes, all their eyes were doing was tracing out the three move sequence over and over again. That's the first issue. There's a cognitive disconnect between what you're doing and what you think you're doing, or at least what you want to think you're doing. The second experts get stuck in habituation. Tlon is roughly German for habit habituation. I think former dean is Bel. And I would agree, I've heard him say this too, is that I typically don't listen to people. I watch what they do and then try to figure out a way of reverse engineering behavioral models about people. Now when I talk to roboticists, I say that wholeheartedly, the opposite. I say we need to listen to people and talk to them because roboticist almost never do. But then when I talk to people who all you do is talk to people, I will say don't talk to people. Watch what they actually do objectively and try to figure out because they may be misrepresenting reality even though they believe what they're saying. I think some of the best teachers are those that are, for example, if you have a private pilot's license, the best teachers that I've seen who can teach those skills are actually able to say what they are really thinking and doing. But most people lack that ability empirically. Are we going to quantify a model of human behavior from observations? Let's say we have the state of the world, in this case a human that's grabbing, we have joint angles for the human's arms or the robot's arm for hitting a top spin shot in ping pong. We have actions. Those are the torques that you exert on all of your joints. And maybe you also have for your state, the position of the ball, the state of the ball. And then we want to observe a human, pi star human is going to take in the state of the world and then take an action. Then we're going to have a robot policy, pi ofs, that we're going to use some optimization technique to be able to figure out how can we minimize some measure of divergence. Some difference between what the humans doing and what the robot is doing in the same state. Whether or not you do this as a function of offline learning behavior. Cloning is what we call it, where the human is just demonstrating tasks for you versus the robot is trying to do the task in the human is correcting that later. One is called imitation learning. There's pros and cons to both. I'm going to share with you some of our work today, for example, and how we can learn from suboptimal demonstration how to perform the task. Better learn a better correlation between the ground truth reward function for a task and what we predict to be the ground truth reward for, as the goodness of the demonstration, without ever asking a person what is better. If we have time, we'll get to how we do that. First we're going to start from learning from heterogeneous demonstrations, whether it's a human performing different types of table tennis shots, topspin slice, a lob shot, a drop shot. Or humans performing navigational tasks differently, like driving a car. Some people may want to pass on the left or some people pass on the right. We are going to see he heterogeneity in our data set. If you take the average of somebody that passes on the left with people who pass on the right, what do you get? It's not that the machine will always pick a mode, They may pick the average and then crash into the car in front of them. If you haven't carefully decided your decision space, but in my like blowing this out of proportion. Okay. There was a study actually in the '90s where they tried to get commercial line pilots to demonstrate flying airplanes. And then use machine learning in this case decision tree to imitate the execution of the flight plan. They found that there was so much diversity among commercial line pilots flying the same flight plan that their decision tree that they trained on, the lumped data, was bogus. It was better to just have one pilot give them, if there's in people give them in times the data on their homogeneous set of behavior and train the machine learning model off that cardiac perfusion under NIH R one that I have working with Dr. Marco Zona and Roger Diaz at the VA in Harvard, respectively. They gave us a tabular data set of decisions that a cardiac perfusionist make. This is the person who pumps your blood for you while you are under open heart surgery. They oxidated it, make sure you don't have bubbles, et cetera. Because you'll die. They had two perfusionist give us decision input, output, data, state action pair data for how they're going to take care of a patient. We wanted to predict how well we're going to do is a small dataset, tabular, we didn't have that much information. So the accuracy overall isn't that important. What was interesting is that we didn't know that there were two perfusionist in the dataset. We thought there were just one or that perfusionist would all think alike. I decided to run our model, but I gave separate row identifier for each data point. Our machinery models automatically split the data and said, this is one set of behavior, this is a different mode of behavior. And then I told our partners who are like, oh wow, how did you know that there were actually like two perfusionist in this data set? I'm like, yeah, our machinery model toys but you need that captured about 8% of the accuracy deficit from here to here when you're working with different people are not going to think alike and you're going to have to figure out ways for your machinery algorithm to handle that diversity. If you just train a single model on each individual person, that probably will not be a deployable option because people are not going to be patient enough to give you enough data. The idea for us is that we want to be able to, instead of learning a one policy for everyone, or even clustering people into different types like aggressive, defensive, whatever may be. Because you're still losing minus one over K of the data for every policy that you're training. Or k, the number of clusters. You want to learn this latent embedding space that can describe how you are similar and different to every other person within the behavioral base of that task. We're going to extend our decision trees to personalized neural trees, where we're going to add an embedding omega, a vector that is nonsensical. It's just a bunch of numbers in a scalar vector here. And we're going to pass them as features into our decision tree. But we're going to use mutual information maximization between these embedding omega and your actions that you take condition on the state. So if I give the model two different omegas, it should give me different actions that I'm predicting. Everyone is going to get their own Omega, like their own key code, their own DNA. You pass that to the model and it's going to learn how you are different from everyone else. I don't have time to go through all the math, but basically you get to a lower bound on maximizing mutual information. But because we don't actually have access to the real probably distribution for the likelihood of an embedding, given the state of the world that you're in and the action you take, we have to magically figure it out, learn it. Which I still don't really fully accept. Deep down aside at like this promotive animal level that you can actually do this. We're going to train, excuse me, here's our policy as can predict what you're going to do depending on the state of the world that you're in and this key code or embedding space omega vector. And then we're going to take in the state of the world in the action that was predicted, can try to recover what your embedding was. If we optimize for the recovery of what your key code is that we passed in here and trying to mic our match, your action distribution. Then we've done a good job. We did that with our decision trees. We apply it to a various set of domains. The last one being can you program, can you demonstrate for Nub an automated Uber car, how you want to drive around the city to make as much money as possible so that you don't actually have to drive around the city. I think that would be pretty nice and we beat baselines in this setting, which was pretty cool. Also, the interpretable version of it, once we convert it back to a sparse decision tree, worked very well. This was a paper done by Ron Baleia at Neurips a couple of years ago. Is X I actually useful though? I just got up here and said it can facilitate the development of shared mental models. Maybe it makes you feel warm and fuzzy inside that you can see what a neural network is going to do. Doesn't even matter though. If you're a patient and you get a decision tree about why you should get radiation or chemotherapy, if the decision tree has 10,000 leaves, what's the point of that? Anyways, there was a court case when a judge ruled that such a thing was actually uninterpretable. When Acoma Driving company tried to say that it was so many people are jumping on the I bandwagon without actually asking what do users want, what will they benefit from. But I thought it was high time that we do. So we conducted a study with an audience participants that was from Imsa, Mechanical Turk. I know it's not the best, but we wanted to do something large scale and that's what we did. It also partially during the pandemic, where it looked at a set of questions that were common sense reasoning tasks. For example, I read this, Mark has just started running and is trying to train for a local marathon. So he's a Nube. The marathon is set to take place in a month. So Mark has been training very hard. That's a mistake. Unfortunately, a week before the marathon, Mark suffered an injury. Where was Mark injured? Elbow, neck, knee, or back? You don't have enough information to answer this. How many of you would say elbow? How many of you say back, knee and neck. All right. So most of you said knee. That's what I thought too. That was the point of this was that you don't really know, but you can make an educated guess and there should be some preponderance of the evidence in some evasion prior and likelihood that you can reason about for what this might be. Okay. And then we're going to give decision support with a virtual robot. Robot is going to say, hey, I think the injury would be in the knee and then the robot may use an explanation. I took Mark's upcoming marathon into consideration to determine the injury would be in the knee. That would be a feature importance space reasoning. Something in the feature space that told you this is what I'm paying attention to and making my recommendation to you. We use decision trees, counterfactual reasoning, template language, case based reasoning, et cetera. So we enumerated a bunch of possible responses, and we had the robot be purposefully right or wrong. The robot was wrong 30% of the time to try to trick you. To see if I could actually help you figure out when the robot was trying to trick you. Then you'd rate whether or not you agreed and understood the robot's recommendation. Here's the panoply of different recommendation types. What are some highlights? We found that subjective explainability was correlated with God speed, which is anthropomorphism is like a measure of social competence. Explainability was correlated with social competence, trust, and objective performance in the task. That's good. The performance did not correlate with any particular method. There's no one ring to rule them all here. There's a lot of work to be done for Juma. Tom Waker has been looking at this as well. Is that based on whether you're a CS student, if you're a digital native or not, those things are all going to impact what kind of thing you're going to want to work with. Believe it or not, some George's tech students actually really like working with pseudo code over templated language. That makes sense, but it might not be what you think about to begin with. Then not all I methods are created equal. Overall, counterfactual in our general population appeared to be the most explainable. That says, if something was different about the world, here's what I would have done instead. A lot of people say that's grounded in how humans think through pair wise comparisons. I wanted decision trees to be the best, but they weren't. But the good thing is that a decision tree, because it basically has all these counterfactuals in its reasoning, you can easily convert it into counterfactual reasoning. It's kind of hard to do policy gradients over counterfactual reasoning, but you can't do it over decision trees, and then we can convert that into counterfactual reasoning. I'm right now preparing a manuscript, actually with my wife who's an assistant professor at Emory, she's a pediatric neuro. Immunologist. We repeated this study with neurologists and we actually found that they did not like counterfactual reasoning at all. I think that there's also going to be interesting domain specific catering that we're going to need to do when we develop I interfaces for people. Last thing, agreement with wrong decisions. We did not want inappropriate compliance or reliance on a robot. If it's giving you the wrong decision, we want you to reject it and we want you to detect the flawed reasoning. Hopefully, through I interesting decision trees, tricked users into agreeing with incorrect recommendations more. I think that this could be because, well they're Turkers. They might not have been trying really hard. Like we know like 50% issue of Turkers are just trying to get done with a task. There's also like 20% or so that are adversarial. Yes, that could be part of it. I also think decision trees may be intimidating even if you're trained on, depending on the size of the tree, the trees were still relatively small. Which then says for us, if we learn these decision trees, which have powerful representation in reinforcement learning, maybe we distill it or converted in something like counterfactual reasoning that supports people better. We don't want to do I, we want to do it right. I will note that we have this publication out showing that when you actually conduct human subject experiments and you use liquid scales to measure things like trust, that something, it's like 97% of people in HRI, the field of human robot interaction, do it wrong. If you want to go read something quasi scandalous, go check out that paper. And we also have a journal paper that tells you it's a nice tutorial for how you actually design and administer liquor scales appropriately. The last thing I want to share with you, last main thing that I want to share is how we're going to not just look at instantaneous decision making. Does somebody have cancer or did Mark bust his knee or was it something else? But how do we think about teaming multiple sequential decisions in this case actually for building a house in mine craft. We're going to have a robot that's trained, trained itself to complete assistive tasks for building a house. And it's also notionally used, sensibly used machine learning to figure out a behavioral model of, now it's all a lie. We hand coded those models, but that's how we set it up because we wanted the robots actually be competent and reliable. We would either show people the policy that the robot had for itself or show people the policy that the robot said was, here's what I think you're going to be doing. Or we would withhold everything and we judge whether that positively or negatively impacted people's ability to maintain situational awareness. I can, in at least three levels, be able to perceive the environment, comprehend, understand what is going on, and then project into the future what will happen for the robot and what the robot thinks you are going to be doing. I mentioned the three conditions here. We had for our factor for it support. We also stratified that in a pseudo experiment, based upon human expertise, novices first experts. Now everyone had to have some minimum qualification for mine craft. I think we said 20 hours of gameplay experience. And we had some tests to make sure that we weren't dealing with people who would still be learning the basic mechanics of it while they're playing. We did this in two experiments, and I'll tell you why later. First, we just had people watch the game and we would interrupt it, blank out the screen, and use the Situational awareness Global Assessment technique. Ask them fact based questions about what is happening, what will happen, what are you doing, what's your avatar doing? What's the avatar doing? We can gauge whether they're right or wrong. We did this with no explanation. The human on the left, the robot on the right is just working status based explanation what the robot thinks the human is doing, what the robot says it is doing or it's in the process of doing. Or We can show the full decision tree. Yellow highlights the particular part of it that's active presently. What we found is that levels 2.3 for situational awareness and a 48 participant study were enhanced by decision trees. Decision trees smally, better than just the status spaceport, like literally what am I doing right now the verb, but it was significantly better than having no decision support. This is a wind for decision trees, this is a wind for I, for improving situational awareness. When you're in a passive observation mode, like a human supervisory control, great. It actually makes everyone, almost the average person worse when you're actually teaming. So we found is that in this 30 participate follow on study objective performance, Novice users augmented with status only just is support did complete the task quicker than they did without the I. Yeah, the support, the expert users with the AI based support were actually slower for both kinds of just the verb, what you're doing right now and then the full decision tree. Expert users with partial versus full support were significantly faster. Expert users better perform with partial. However, what did people say they wanted? People wanted the decision tree, but they need the Morse first. Humans do want AI. We're not going to get away from that, but while a complete policy explanation is better for situational awareness, we have to be careful what is the role of the human? Is the human an equal teammate that needs to be focused on doing, or is the human a supervisor? If the human is going to be a supervisor, then I think give them more decision support, explainable AI to help them understand and to it the behavior of the agents. But if the human has to be doing the task, stick to the more verbal level. There's a lot of papers that go into how human human teams work together in an expert setting. And typically humans have implicit deliberative communication. You can think about if you have a bunch of newbies working together, they're probably just shouting out everything that they're doing all the time and the communication channels just get overloaded with nonsense. Expert teams only give updates to when they think that information will be valuable to somebody else. I think that's where we need to go if we're going to have I in the moment. We need to think about not just what am I doing, but what does the other person need to know about me. Interpret algorithms can perform well for reinforcement learning and learning from demonstration. Explanations of black box models are not good as actual policies to control the world that you're living in. And that we want to use expla a priori to build shared mental models. Also for human supervisory control. And that I during teaming can cause information overload. I may trick users into accepting bad advice. So be careful then. The utility of XI methods are user independent and we have ongoing work in personalization of those methods. Hopefully in a few years I'll be able to come back and say that we deployed this on a car and it didn't crash, but you never know. Autonomous driving is not the boon it was even just a couple of years ago. I think people are. Reason. Saying it's a really hard problem, I'm going to wrap up. I want to thank all of the students that are in my lab who contributed to the work presented today. An Pega, Andrew Silva presume, and war have all been pioneers doing explainable artificial intelligence in my lab. I will tease that one of the fun things that I do, well, I'm going to mute that, the spare time. It's not really spare time, right, but it's my fun time at work. This other stuff is really fun too. But I grew up playing competitive tennis. I still play some adult leagues and adults are crazy. They still like hook you on line calls. They still get upset and have temper tantrums. Adults are crazy, but tennis is a lot of fun and robots is really hard. And ultimately, maybe we can use the CI methods that we're developing so that I can start working with this robot to be a good doubles team. I found it really hard. I was like a reasonable doubles player in high school. So I always picked, I was always the best one in my high school. So I picked to play doubles, but it took me like three years to find a good partner. I'm very picky. It's really hard to find somebody that you mesh with. So how can I get this robot to mesh with me on the court? I think that's like a pinnacle chal***ge for robotics. All right. I'll thank my sponsors and thank you all for your time. I'll be happy to take a few questions. Yes, probably. Beyond what you hear about the black box, what about modern AI? Is it explained what's decision is created any sort of manner, way an AI manifest? It still exists in some sort of state. So like I understand, there's like a level of interpretation in explaining what the AI is doing. But at a base level there really a black box anywhere. Like is it just, there is some explanation interesting? Okay, so the question is, is is it ever truly a black box? Is that basically the question in any deployment? So I'm going to say yes to keep it simple. I'm going to say, for example, chat PT, these large language models. I'm not going to call them a foundational model because I think that's bogus. I don't think it'll be the foundation of artificial general intelligence, but they are powerful. There are people who, a small portion like community of researchers, who are working on what's called mechanistic interpretability. Let's say we have a neural network. It doesn't really matter what the architecture is, but just let's say it's complicated. All you're doing it is you're going to pass in two numbers and it's supposed to multiply it and then output it. Can you figure out what part of the neural network is actually doing the multiplication? There's a lot of it that's probably going to be doing nothing. And just some key little part of it that's doing multiplication. What if it's a calculator? So you can either pass in a symbol for multiplication, division, subtraction, or addition the neural, you should be able to divide it into four parts. And one part is for each of those operations, it gets a lot more complicated. We're talking about understanding natural language, but people are working on actually, can you identify what architecture is in the neural network and where, how deep into it actually start doing this. We've seen ten years ago like Jeffrey Anton nature paper that looked at like features that you learned for images as input. Where like the lower level is looking at like Gabor features like Sift representations like basically crescent moon shapes, squiggles and things. And you get to the higher levels than the highest levels, You get like prototypes for like a face, or an arm, or a leg, or a car or a wheel. That version of interpretability can say it's not a black box. Because maybe we can hunt for these things and try to learn it. But I'd say it's almost like we, an alien crash landed on the Earth and now we're doing an autopsy, but the alien is made out of silicon, and none of the organs match up with what we have is that black box. Well, we can do surgery and try to figure it out, but I think that's a far cry from what I'm trying to say is we want to design for interpretability to begin with. Fear is that we're going to get so good at making these aliens and we don't understand how they work. The performance will just get so good that we will just stop caring and just accept it. Maybe it'll be for the best, but that's my prediction of what will happen. Any other questions? Yes. As a network online and also physical. Can you use a hierarchical task network to support online learning? Yes, I have some colleagues who, like Professor Nova, had a PSG student who graduated from her lab who was working on learning hierarchical task networks. Often that's done in an offline setting, which you may know is you just give it a bunch of data of like a world you're trying to model the phenomena with hierarchical task networks. A lot of those approaches you then have to figure out they will continue to split and get lower and lower in their nested reasoning, as you find more caveats online without ever going back up to a higher level and saying, I'm going to redo this because it's just so messed up. Now there's all this like technical debt, basically. That's why we did these fully optimizable things online. It was because most of those approaches, they commit early on to something and then they don't go back and revise it. But I think you probably could. Yeah. Yes. Fists. I'm just, how do you decide when humans agenda. How do you decide specific application would be benefited by the machine learning? When working with people in health care, how do you decide what application is right for machine learning? The cynical and my true answer to you is that what we actually do is figure out who's willing to work with us and who can give us money to do it. Often there's a mismatch between where the most critical need is or where the most locating fruit is and where the money is, and where the interests and the capacities are. That's really unfortunate for cardiac perfusion. This was, I was connected by my former Phd advisor, Julie Shah, with people at the BA. She was, she just wanted to support me, was like, hey, connect with these people and I'm like, okay, let's do it. And we got an R one and we had to convince the reviewers that there really was a problem that needed to be solved. The issue with health care and proving that there's a problem that needs to be solved is that doctors don't write down when something went wrong. Because that's like medical legal. You're going to get destroyed in a court room. If you all are going to go out, there is like technically minded people who are going to go try to solve problems. How are you going to get people to admit on paper that there are problems? That's the biggest problem that we have in my health care applications. The first one was NSF Graduate Research Fellowship. Nobody had to tell me there was a problem. Like nobody had to admit on paper, there's a problem. I just got to say, hey, can I go help? And I got to play around and do cool stuff with an obstetrics and gynecology award. Not because I sought that out, but my Phd advisor was married to an OBGYN and he's like, yeah, let's go do stuff then We found there were problems and then we could publish on it. Long, sort short. It's hard. Well, if you want, you can go back in that weather or you can hang out. Thank you for your time. Hopefully, I'll see you all again.