[00:00:06]
>> It will. Be the ones you know. For. God for human impact. On yourself in time from which the horses big source of growth that we will. Thanks everybody for being here I'm going everybody in the back hear me is this great Ok feel free to interrupt me at any time I'm super excited to be here this is actually my 1st talk at Georgia Tech so I'm really thrilled to be able to share with you some of the work that we're doing around human focused reinforcement learning so members of learning when I started my future wasn't particularly popular but that's changed enormously over the last 5 to 10 years probably everybody now is for a 1000000 with some of the remarkable success this that reinforcement learning has enabled things like go or meeting humans tare and some of the really impressive results and robotics that are happening right now.

[00:01:14]
And so when we look at these sorts of domains when I talk to people outside of the areas of Ai and reinforcement learning they might say Ok great so you guys are going to solve this reinforcement learning thing and maybe the field is starting to wrap up and what are the open challenges.

[00:01:29]
And when we think about this I think about sort of the types of things that we rely on heavily in order to see the results that we have of the last few years and then particularly when we think about video games we have access to a simulator which we can simulate for millions and millions and millions of games and in fact that's exactly what they do they get a normal some enormous amount of simulation trials in which it fails to play Go millions of times and then eventually achieve human level performance and this requires an enormous amount of data to train and even in robotics we often have fairly good physics simulators at least outside of sort of.

[00:02:04]
Intacta manipulation where we can again use the simulation try things a nervous mother times and then do a sim to real transfer So these are all domains where we can leverage the fact that this essentially easy for us to try things out in an easy way. But my interest in this really comes from a different set of domains I'm very interested in how we can use reinforcement learning to amplify people's potential and the particular when I started my post I got interested in the question of how we can use reinforcement learning for education and I came into this pretty naive I thought Great Well you know education is a sequential problem we have outcomes we have because we can optimize all we have to do with kind of sprinkle r.l. over it and everything will be fine so you know a decade later I wouldn't say we're quite at that point yet but I think we've learned a lot and it's really helped push the technical friend hears of both machine learning and what we can do on the education side so what I've showed you here is one of my early papers in this area where we looked at how we could use adaptive decisionmaking to select typically 3 students that were playing a communal game and the challenge here was how would you sort of equalize levels to keep everybody engaged when they're doing say a fractions task or simple mathematics.

[00:03:16]
And when I started thinking about those sorts of demeans as well as a lot of other sort of educational task that we've thought about over the last decade. I started thinking about what are the particular challenges that we encounter when we start to think about doing adaptive decision making in the context of people we're thinking about educational systems and my work recently has also started to think about applications to health care as well and I think we have a lot of different types of constraints when we start to think about these problems so in general we have no good simulator of human physiology behavior and learning so we can't just sort of plug into our simulated human learner and then optimize a whole bunch and then hope that we've found an optimal solution.

[00:03:56]
And also gathering real data impacts real people so we can't just run out our system a lot of times and hope that you know eventually finding some. Which doesn't fail to not teach people fractions and so I became really interested in thinking about what are the technical challenges that arise when we try to start tackling these cases.

[00:04:15]
And I think that when I started thinking about this problem I was sort of thinking about it in a figure that looks like this where we think about trying to do reinforcement learning to support people in the context of education and health care but I really move to more thinking about it like this where I think about the case that we're hopefully going to end up with these hybrid human Ai systems that are much better than either Catalona and this is a little bit different of the view than some people have though of course many people share my view as well some people sort of view it more as.

[00:04:45]
A takeover entirely and that other people say Well humans are already the best they could be and so the best we could do is maybe try to do apprenticeship learning or imitation learning and what I'm really interested in is problems where we don't think people are perfect we don't think people are perfect teachers we don't think that we have completely solved whatever that means health care and so the hope is that we could do better than either humans are doing or machines are doing right now.

[00:05:11]
So what I'm going to do so today is talk about this is obviously a very complex challenge but I'll talk today is about some of the main areas in my labs where we're trying to tackle these challenges and I'm going to talk about counterfactuals efficiency and complex objectives.

[00:05:26]
But before I do that I know it's a pretty diverse group I'm not going to be able to do justice to sort of 50 years of sequence of decision making but I will do a really quick overview so a lot of the work that I think about things about sort of stochastic decision processes we imagine that we have some sort of state of the world say trying to capture the state of the learner.

[00:05:45]
And then we have a policy which I'm going to notice pie which is a mapping from states to actions and so the idea is those actions you take and then they affect the state of the world so then we would get some sort of reward signal back or if you're controlled there is sort of a cost function.

[00:06:02]
And what we're often interested doing is thinking of this loop going. Many many times were you think of the value of a policy being the expected discounted sum of rewards we would get so that's a function of the immediate rewards plus a discounted some of the future rewards speak to obtain from that state.

[00:06:19]
And so there's a beautiful legacy of this work starting from Belmont in roughly the fifty's which started off thinking about the case where you actually know how the world works where you know what the dynamics are of the system and you know what the rewards are but it's still a really complicated planning problem to figure out what you should do.

[00:06:38]
Now often in these cases we're going to be interested in trying to figure out the best thing to do so we want to think about not just evaluating a policy of how good it is but actually trying to extract the best possible decision policy to maximize the some of the awards.

[00:06:52]
And a really interesting question of course is what if you don't know these systems so I said that in healthcare and education we probably don't have perfect simulators or models of how the world works and how things will change so we're going to assume that the reward model the dynamics model is unobserved and we can only observe it through experience with the world.

[00:07:16]
So that's my very very brief overview of meditation. And we're going to continue in sort of this type of formalism throughout this hawk. So the 1st thing that I want to talk about is counterfactuals and this is actually probably my favorite topic right now I think it's one of the beautiful instances where we get a combination of really deep mathematical theory and really important practical impact so what do I mean by this well I think we're living in an age increasingly where we have enormous amounts of data about decisions made and their outcomes so we now have an enormous amounts of electronic medical records systems we have incredible amounts about consumer data probably right now you're foods are having data about you right now.

[00:07:59]
There are many many implications for that and one of the potential Patel possibilities is to use that data. Data make better decisions. So in particular if we think about something like a health care example we can think of a sequence of decisions that are being made often in response to changes and outcomes so maybe we can think about different individuals and the different sequences of treatments that they had and some sort of measure of health outcomes and the question is can we leverage this is store data to try to make better decisions for a new individual.

[00:08:32]
Now this is a really challenging question because the data is censored so we can never know what it would have been like if we had done a different sequences of decisions so right now you'd never know how much better your life could have been if you went and got coffee instead of coming to my talk that has a future you will never know so this is the general challenge that we have historical data about this is made in their outcomes we can never know the counterfactual what would have happened under other decisions that could have been made.

[00:09:05]
So that's one really big open challenge the 2nd big challenge is the need for generalization. So what do I mean by that I mean if we think about making a sequence of decisions there's a competent number of decisions that could be made and that's going to be infeasible for us to explore all those possible pathways I can't possibly do a clinical trial to think about all the sequences of activity of interventions I could do for health care that's just not going to be feasible to scale and so we're going to need some way to either leverage structure or to generalize so that we can make inferences about treatment plans we've never tried.

[00:09:44]
And I'm going to demonstrate later that there's going to be sufficient variability in some cases that that might be possible. Now right now I think is a really exciting time to be thinking about counterfactual reasoning and causal inference one of the recent turn Prize winners it's. A bunch you know it's very interesting this topic many people are very curious about this because it's consider.

[00:10:04]
Good to be needed to make further advance there is sort of predictive models that really capture structure about the world and this is been studied for a long time there's a very deep history of this in statistics and econometrics and epidemiology that has been really exciting to start to get familiar with but the vast majority of this prior prior work considers a single decision in fact often it considers a single binary decision like treat or not treat.

[00:10:32]
And I think what we have right now is that we have increasing evidence of scenarios we're making not just a single decisions but a sequence of decisions like in the reinforcement learning context. And so I've been very interest in this question which I call sort of counterfactual or batch of policy reinforcement learning where you have a dataset of prior data that was collected by some decision policy say physicians you know how health is the sions interact in the hospital we want to use that to make better decisions and that what I want to convince you is that that problem is really hard.

[00:11:06]
So one of these huge challenges is covariant shift so I really love this. Figure that came from my colleagues David. Up in Boston there are 2 labs thought about illustrating this problem the context of health care so imagine that you have some data and what you want to do is evaluate how good your policy is treating people and has a particular strategy for treating people so in particular it would say hey for these people for their current oxygen intake I would do mechanical ventilation.

[00:11:41]
Well in order to evaluate that decision policy you can only look at people that happen to have gotten mechanical ventilation so it's only going to be a subset of people and so then maybe I want to consider Ok some of those people experience discomfort under my Ai policy I'm going to do sedation so again only a few of those people actually got sufficient and now I have this pool of maybe 15 people and then because of their blood pressure measures under my decision policy I would apply best oppressors.

[00:12:11]
Well now you end up with this tiny effective cohort of around like 5 to 6 people that match the decision policy you're trying to evaluate and the reason for this is the basic assumption is that the data was collected under some decision policies say from particular physicians or you know particular protocols and hospitals and you want to evaluate how good an alternative one would be which means by definition you're going to be taking different types of decisions that are going to end up often in different types of states and so you have what we call shift in this case because the data is going to be from a different distribution than the data was generated in their original place.

[00:12:51]
So that's the 1st challenge. So how can we deal with this one idea is maybe we could build some models we could build some predictive models of what might happen if I do mechanical ventilation under that oxygen intake so we could take that historical data and we could build predictive reward models or predictive statistical models and then we could plug those into our favorite planning algorithm or cue learning or whatever we want and then computed optimal policy for this and this is in fact what we started doing when we started thinking about this problem a few years ago and then we observed something troubling.

[00:13:29]
So this shows a very simple scenario where we just have somewhere between $5.00 to $20.00 states and that's very small but ideally we can just think of this x. axis about complexity of the representation we're using and the y. axis is showing how good these models are and this is cost validation This isn't overfitting this is saying that perhaps unsurprisingly this is for an education domain slightly more complicated models of student learning actually fit the data better.

[00:13:59]
So this wasn't so surprising and we're like Ok great we're getting better predictive models and then we could plug those into our decision making. The thing that we found surprising was the 2nd light. Which shows us if we take those policies we take those models that build a decision policy and then deploy it how good is it and what the blank line says is in this domain you only want to use the 10 state model.

[00:14:25]
Because you're getting better predictive models you're actually getting worse decisions. So it seemed really weird to us when we 1st looked at this we're kind of concerned because we thought well we just got to get really good predictive models and then you're going to do planning on top of those the reason this occurs and this is been noted by some researchers here.

[00:14:46]
Who also used to be here is that these models have missed specification and in particular here. The model is not market anymore and so when you do these planning out of them rely on assumptions that are false and so even if you have better predictive models your decisions may be worse.

[00:15:08]
So that's the 1st thing that should indicate that this problem can be fairly subtle. The 2nd is Ok so maybe we just need to make sure that our models are unbiased to make sure that we can get good consistent estimates and then as long as we have those we can use these good estimates of decision policies to try to make choices about what things we're going to deploy in the future.

[00:15:29]
So in particular I'll talk in a minute about sort of Cisco techniques to do that but one thing that we also observed is Ok let's get our great unbiased estimates of how good different decision policies could be. Well unfortunately it turns out that even if you have these unbiased policy estimates they don't always lead to making good policy selections.

[00:15:51]
And the intuition for this is that you don't just care about biasness but you care about variance and if you ignore this then you can make systematically bad decisions. So that shouldn't just sort of highly why these problems are complex we have this covariant shift we can't just relying on get getting good models we can't rely on by.

[00:16:11]
Estimators we really need to think about the complexity of we've got this old data from a completely different decision policy we want to try to make better decisions in the future and how can we do this and so I've been really interested in how can we take old data and try to make better decisions in the future with some reliability and we've been thinking about the full spectrum of this problem from how do we have value it policies to how do we pick policies to how do we build their performance.

[00:16:39]
So I'm just going to talk about one of those which is thinking about how do we do this really efficiently how do we take a small amount of data as much as possible and evaluate a policy well and this is joint work with my former post Thomas who's now a professor at u. Mass Amherst.

[00:16:56]
So I suggested before that you can't just build a parametric model and then hope that that will allow you to do well for planning and getting out a good policy and the problem is that even those these methods tend to have low variance they often have a bias unless he makes a strong assumptions that the model will really converged the true thing.

[00:17:16]
So that alternative is to use imported sampling which to me which is this technique of allowing us to use data from one distribution to estimate the quantity under a different distribution and the idea is I think it's really beautiful ideas maybe you may already be familiar with it but I'll go through it quickly so the idea is let's imagine we have a quantity say the value of a policy call that art and we want to estimate how good it is over a distribution of x. So then we would just take our probability of x. times the reward to get under x. Well the cool thing we can do is we can just multiply and divide by one multiply and divide by Q Of x. which is an alternative distribution over accents.

[00:17:58]
And then we can see that the expected value of our under some policy is equal to its expected value under a different policy re weight things. And this is really cool because it means that we can use old data that was generated from a different policy and really weigh those trajectories to look more like the data we would get or the policy we care about so maybe you know in the past I Advil half the time an aspirin half the time but in the future I want to prescribe more so my old data set I'm going to appoint things where I've made the decisions like.

[00:18:33]
So it allows us to transform between distributions. And it turns out that we can do the same thing for reinforcement learning by relating our trajectories to look more like the trajectories we would get under decision policies. And the really cool thing about this is that we don't actually have to know anything about the dynamics of the world we just have to know the probability of making decisions under an old policy and a new policy.

[00:19:01]
Now this is been sort of known for a number of years in terms of reinforcement learning and many people have been really interested in doing this the problem is these type of estimates tend to have really high variance in fact we've shown that it can be exponential in the horizon of decisions you're making so we have these 2 types of techniques we have these super models that might be asymptotically run and we have these important sampling methods that are unbiased but have really high variance.

[00:19:28]
And so we were inspired by some work from the statistics community that says combine them generally ensembles often work really well that can be true here too let's combine between these and the idea that a high level is that we're just going to combine between parametric models and these unbiased methods.

[00:19:45]
And that his ideas were extended to reinforcement learning where to where the nice idea here is that you can get an unbiased estimator if either of these 2 things is right so in some ways I think of this is a robustness guarantee if you. Your model is correct or you can accurately estimate kind of these really weighing of these probabilities of taking decisions then you can get a good estimate.

[00:20:09]
But what I would say is that in practice we don't often care about this asymptotic regime we have a real finite amount of data we want to get good estimates so how can we do this with real finite about some data and so our intuition was well often what we really care about is just accuracy we care about means where the means great error is a combination of bias and variance and so let's just try to directly optimize for that so we know that model based estimators may have high bias of variance importance timeliness committers may have Lou bias and high variance so can we just combine between these 2 directly.

[00:20:46]
And so we introduce 2 new off policy about you ation estimator it's and again our goal in this case is to be able to do much better with limited amounts of data in order to make good decisions. So while the idea is to do weighted doubly robust methods and this is built inspired by the fact that for important sampling things often have much lower variance abuse weighty and the nice thing is that it's a very small change.

[00:21:14]
And empirically can give much better estimates and it still has really good statistical properties meaning that is you get more and more data you are going to be asymptotically consistent. And then the 2nd thing is a little more complicated but the idea here is that we're just going to blend smoothly between these different types of estimators and in particular you could imagine some estimators which have high bias and low variants that are essentially combining mostly with the model and only a little bit with the important sampling and then we have a step one where we mostly rely on important sampling and there's a smooth spectrum between all of these.

[00:21:55]
And so what we can do is we can try to directly op. Across this. So in these cases we have this combination of bias and variance we can try to directly minimize this we solve it as a quadratic program and we can prove that this is strongly consistent.

[00:22:12]
For those of you that are familiar with these types of but that's you might be saying well if you actually knew what bias was which is the difference between our true estimate and what we're estimating then you wouldn't have this problem you can't know what biases so how do you how do you optimize if you don't know what biases.

[00:22:28]
Are in sight there was to say we don't know what bias is but we can get a balance on that and we know because important sampling is consistent then we can get a difference between that and our model based estimate or. And if we have a confidence found over the important simply nice to meet are you going to bounce between this is the minimum bias that must occur.

[00:22:55]
Ok so even if you don't care about some of the mathematics behind this you might wonder is this actually any better and does it by a something and so we compared I'm in a small simulation across lots of different types of techniques and what you see here on the x. axis is centrally how much data do we have so much as stark all data do we have in the y. axis is how well we have to make this policy and this is a simulator so that allows us to know what the ground truth is and you can see perhaps not surprisingly from showing this graph you can see that our med method is doing very well it's always got the best accuracy for the amount of data but I think the more important thing is to look across the x. axis and notice that we can need roughly in order of magnitude less data to get the same accuracy estimate.

[00:23:45]
So why is that important that's important because that would say we normally need to know something about how good these estimators are in order to deploy them and so if I talk to some of my domain experts who are doctors or teachers that I could say even if you only have like 100 data points I can still find out whether or not this is better the much much better than needing a 1000000 data points because in a lot of our real contacts we are data limited so I've done studies with you know around like a 1000 or $1200.00 students in classrooms which is a lot of work but it's nothing compared to sort of the amount of data that we normally get if you do an Alpha go style experiment it's like you know not even on their axes.

[00:24:25]
So we need these methods in order to be able to actually do good inferences for small data. And we've seen that partly because of this since some of my colleagues have been looking at things like sepsis treatment and trying to use offline data like previously collected data to make better decisions and one of the things that they found there is that this is a case where you're looking to using i.v. fluids in vast oppressors you have rewards for survival and negative rewards for death of about $750.00 states and they had access to roughly 20000 i.c.u. patients again quite a lot but nothing compared to the data you would normally need.

[00:25:05]
And what they found in this case is that our sort of weighted doubly robust method was the only one that had good enough estimates to be able to figure out hey this new policy we think it's reliably better because if you don't have good enough estimates then they're going to look like they overlap a lot and it's going to be hard to understand if you've actually got a better policy or if you can be confident that that's.

[00:25:27]
So in general you might still be thinking about my 1st claim is that we that from all this is circle data that we actually have enough data there and enough variation that actually we can find things that are better than what people are already doing and I want to argue to you that it's possible that we don't have to only mimic what has been possible in the past.

[00:25:48]
That some of you guys might be familiar with car poll it's a very simple sort of control task it's a common benchmark for reinforcement learning we looked at this recently this is a different method than I showed before but I want to illustrate it to show what's possible so this is a task we try to swing up or you try to kind of keep something stable on a cart and we have a behavior policy here which is a blue which is not very good so optimal here is around 200 and you can see that this is not very good.

[00:26:19]
And there's some previous methods that can use that old data to do better so it takes data that is generated from that not very good policy and it learns a policy that's better by doing these type of off policy techniques and we showed that by being more strategic with the policy green but that we could actually get to near optimal.

[00:26:40]
So I highlight this to say with the same data if you change the techniques there can be enough variation and some of the data you can get enormously better than what the data has already been shown to be doing because you can leverage the natural variation in that data set and try to identify the cases where you get much better outcomes.

[00:27:01]
And it turns out compared to some other prior approaches we also still do much better so it really pays to be strategic in the way that we leverage this data. And I want to highlight that's not just true in this domain there's other domains including like an HIV stimulator domain where we don't get close to optimal but again you can get a huge improvement compared to the behavior policy so the behavior policy here isn't blue we only have data from that behavior decision policy and what we're getting is green.

[00:27:29]
So I think in many cases there will exist not always if you only have one policy that's being used you can't but in many real world cases you'll have some natural variation little allows you to have the opportunity to do much better with that existing data. The final thing that I just want to highlight there in terms of applications is one of our original motivations came to the for why it came to this problem was looking at.

[00:27:54]
Optimizing for an online fractions game this is collaboration with our puppet University of Washington and some of our students and we had access in that case to around 11000 students old data where we sequence levels of different orders. And our goal in this case was try to increase persistence because this was an optional educational game and so people would just drop out after a period and we thought that the material had some pedagogical value so we wanted to try to help people persist and so we could use this offline data to learn a policy which would say things like depending on you know how long we took on the last game and a whole bunch of other things of how they're interacting with that game we can do branching policy we can use adaptive reinforcement learning to decide what to give next.

[00:28:42]
And so we did this and we found that we could increase persistence by 30 percent. So this is again using the old randomized data that we could find that there was a much much better policy kind of hidden in that data that we could extract and then we verified that with an experiment with another 2000 users.

[00:28:59]
So I just want to highlight here that I think that this sort of off policy and sort of counterfactual reasoning can be really really powerful there's a huge number of really foundational theoretical questions still to be answered but there is real practical possibilities here for leveraging our old data better and that's great because then we don't have to do official experimentation in some cases where that's not feasible.

[00:29:25]
Yeah. So. So. The branch. With the or. Did you pay for all those bridges. Great questions so I don't know everybody heard that that's question was in this data we're looking at sort of people branching and the question is do you have data for all these branches no we in this case we do for a short term rates so people for better or worse only play this game prayer around like 5 to 7 levels most people so for that we could have coverage as you start to get into broader cases you can't have coverage over all of this and I think one of the biggest questions we've been thinking about recently is Howard.

[00:30:09]
Do you do well within the coverage you have. Because often You're not be able to explore the whole space and I think historically right now we tend to listen to be overly optimistic and so they can lead you to that one person that happened to do well which is not reliable.

[00:30:25]
And please feel free to interrupt me with any other questions. So I think this is you know this is as I said one of my favorite topics right now I think it's super exciting and there's a there's a lot of potential open questions I will highlight one of the things that almost all of this work historically has relied on that is that you have access to all the information people are using to make their decisions so officially we call this no confounding and so that means like in our electronic medical record systems you have all the variables doctors are using to make decisions this is almost definitely wrong and so one of the things we've been working on recently is how do we relax that assumption and deal with sort of sensitivity analysis in those cases.

[00:31:08]
So that the things they want to talk about is efficiency so in the 1st case I was thinking about applications where you already have a lot of data but where might be at. So it could here match question is saying that a lot of medical knowledge has been passed down for many generations and there may or may not be solid clinical reasons to support it in many cases there may be but there may not be as you could use these type of methods to try to understand which of these are actually associate with better clinical outcomes only within the space of things that have been tried so we can only.

[00:32:09]
Sort of explore this goes to Beth's questions as well between the space of decisions people have tried to starkly So like if nobody ever prescribe Tylenol they can't tell you what it would be like to prescribe Tylenol but if some people happen to prescribe Advil in the same situation that other people prescribe aspirin then we could tell you in this case so we can leverage natural variation to try to infer those decisions but only within that restricted space.

[00:32:35]
So in some of these cases it can be really hard to try things out online but of course the real promise of machine learning is systems that learn with experience and I think the real beauty of reinforcement learning is systems that can actively try out things and then actively explore and in some cases I think that's feasible not at all but in some scenarios we are having the chance where it's low stakes enough we're not sure enough about what to do that it will be reasonable to do a trial and error learning but we really want to be sample efficient.

[00:33:07]
So I was interested in the case of Sir transfer learning I was thinking about cases where you'd have a series of individual series of students go through different sort of educational games or educational pathways and each of those tasks is going to involve a sequence of decision it's a mag goal was well how do we think about leveraging structure cross these tasks so we can improve for a new student or a new patient and so I started thinking about this more formally in terms of you know what are kind of the theoretical foundations and the practical improvements we can have in this case and since I've started working on this problem there's been a lot of excitement over things like mother learning the lifelong learning etc but the vast majority of this still tends to be empirical in terms of the success and I was really curious about whether we could really provably be faster What can we do with leveraging prior experience that makes us really provably more efficient and when I started investigating that I realised that we didn't really understand this even per single task even for some quite basic settings and so I'm going to say briefly some of.

[00:34:09]
There are work that we're doing in that to highlight how we think about whether reinforcement learning algorithm is good. And I guess the result of Tiley why I think theory is still important even though there's been some incredible imperils successes is that these imperial successes are normally on somewhere between one to 50 benchmarks you know I'd say 50 that's about the number of Atari games and that's probably the biggest We people normally consider and I think that's great but that doesn't give us any assurances of how well they're going to work in future domains whereas theory despite many of its limitations gives us assurances on how well these algorithms will work on any problem before we deploy them and so I think it's still really helpful to understand that.

[00:34:50]
So how do I and most others in the community think about sort of formally evaluate our algorithms either in terms of sort of regret or pack so if we think about sort of an episodic scenario which I like to think of as a series of students or patients and we think of what the return was so maybe for the 1st student you know this was the return and then over time we see as the algorithm learns and tries things out hopefully it's improving over time and so its performance is getting higher.

[00:35:18]
And we'd like to evaluate how good this is compared to what maybe would be the best as possible so let's imagine we can't know this in reality but this is the optimal return this is the best you could do in this scenario again this is unknown. We could compare how well we did how well we could have done and just subtract from each of those cases it was like we're grading the algorithm at each time point and we're getting this difference and we define regret is the sum of these differences it's the difference between you know the whole sequences of decisions you've made up to this time.

[00:35:51]
So regardless I think it was a really hard measure that sort of saying like if you had went to you know a different kindergarten like 15 years ago what would your life trajectory have been like and and here's something up all these decisions from the very beginning. Now in some cases we only care about pretty bad decisions.

[00:36:12]
So we're willing to tolerate some amount of suboptimal ality but we only want a little bit and we call we just want to count up the number of times we make decisions that are not post-op small. And we call that pack so probably approximately correct are all it's going to try to count out the number of significant mistakes.

[00:36:32]
And there's been a lot of excitement and interest over sort of what algorithm it either have good regret or don't make a lot of mistakes. And you can show that if you don't do anything smart so this is even just for the tabular case where we can write down all the states you don't I think smart you can have kind of exponential pack to make a whole bunch of bad decisions.

[00:36:50]
And you can have essentially linear regret which means you always are going to make bad decisions. And then there's been sort of a lot of work over the last couple decades. Looking at how do we improve that where we get quantities that scales a function of the size of the domain and the number of dimensions we have.

[00:37:08]
And as of sort of last year we've sort of closed this problem so we've said we now have tight. Than we have tight upper and lower bounds for both the pack setting at least in the dominant term and regret so to some extent now we formally know how hard those problems are in the worst case.

[00:37:28]
But fortunately in many cases things are not as bad as the worst case and so with another student under the net we've been starting to develop the 1st instance dependent bounce so what I mean by this is that we're going to divide design a generic reinforcement learning algorithm that we can guarantee that if the problem is easier it's going to do better.

[00:37:47]
And I think this is really important because I think if we want to democratize Ai We need people that are not Ph D.'s in computer science to be able to use these algorithms and know how well the work and so we therefore have to be sure that these types of algorithms are going to work well and be able to exploit structure if that's present.

[00:38:06]
So I want to talk just really briefly about this just to share sort of some of the ideas that made this that we used to try to achieve this. So I'll do this part briefly. One of the key insights for how we attain these type of results is kind of thinking about what is the difference between our estimated value function to the expected just kind of some rewards in the truth and historically people thought about this often in terms of estimating the model parameters and use things like offerings and qualities to bound things.

[00:38:36]
There was an improvement over this when people started to use Bernstein inequality which looks at the variance. But it turns out that in fact what you really care about is how different can be the value of the next state and the way that I like to think about this is if I drive a novel highway if I'm driving on the highway and there's somebody who might or might not move to let me get to the exit if the next exit and let's say I'm trying to get somewhere across town if the next exit is like you know point 25 miles away doesn't really matter whether they move or not if I get off on this exit or I get off on the exit either way my you know the amount of time for me to get to work is going to be about the same but if they don't move and the next exit is a 50 miles then that makes a really big difference whether or not they move because either I'm going to really short drive to work or going to really long drive to work and so that would be a case where we have a really big split in the value of the next state so in the 1st case it's either going to be 70 minutes versus 80 minutes and the 2nd case can be 17 minutes versus an hour.

[00:39:39]
And so the size of that gap turns out to determine how hard it is to learn. Sort of it turns out. This is sort of the key thing that helps us decide whether it's going to require a lot of data for us to figure out to make the right decision.

[00:39:54]
And so it sort of enhances our understanding of like why is it hard to make decisions a learn to make good decisions in some cases and what we can show here is that 2 cases are pretty easy these were known before but they fall out of our sort of unified theory so deterministic M.V.P.'s are cases where only one thing is going to happen I'm always going to get off the highway and so in those cases those are pretty easy to learn and the other is where everything's totally random and my decisions don't affect anything of what i do we can call this bandits and those are also pretty easy to learn and so I have to think about the future.

[00:40:26]
The hard one is this in-between where like you know whether I get off the highway or not makes a really big difference to my value function and I have to know that sometimes I can get off the highway and sometimes I can't. And I also helps answer part of a previous learning theory open question about whether long horizon problems are really harder to learn and.

[00:40:46]
On the other hand I think that sort of very verifies these empirical finding that have shown imperfectly that this sort of variance of the value function seems to determine how much data you need to learn in these different types of problems so now we have sort of the formal underpinnings of why that's the case.

[00:41:03]
So this is the 1st them to do that and so now we've sort of been under strain understand how we do these sort of single test settings but we've also been trying to do a lot of work on the transfer learning set it and again I want to argue to you that these are not just you know exciting things to do from a theoretical perspective but also can be helpful so one of the things that I was interested in I said before is that like if you have a sequence of students and one of the cases we started thinking about if you're going to go is what if you have a sequence of say customers or you're trying to make news article recommendations so you're trying to decide this is from Yahoo from a few years ago which news article recommendation to make and what we could show in this case is that by doing sort of sequence will transfer learning we could do my.

[00:41:47]
Better than prior approaches so one fire approach you could do which is. Orange in the red lines in the middle if you use the same everyone is the same you don't do any transfer learning and the problem with that is that people do have different preferences so you tend to not get very good click through rates that means if you try to learn from scratch for each new individual you just don't have enough data so it takes too long to person.

[00:42:11]
But if you do the search after learning approach with strategic exploration you get our blue light where you can see that we get significantly higher click through rates. Let's argue why I think some of these are strategic exploration problems are important the other reason that I picked this particular example to show is that Yahoo Sadly would not allow us to just try this out on their on their data but they said we'll give you a data set of 500000 people so we're this is an online algorithm that we're using some of our ideas from counterfactual reasoning to evaluate an online algorithm offline.

[00:42:45]
So there we can see how well it would have done assuming future customers are similar to past customers. So I think there's a lot of really exciting questions here both on the theoretical side as well as how can we really use these in practice so we can really quickly adapt to new scenarios in cases where that's possible.

[00:43:04]
The final thing I want to talk about is complex objectives so you might have to wonder hard to read from there what is this last thing so pull this I live in the school district I have 2 young boys so I pay attention to some of these things and I was looking at sort of what was the desired characteristics you know one of the goals of my school district so I pulled it open and saw their sort of promise thing and you can see here there's this enormous list of things that the school district is trying to do this is only page one and there's an enormous list of things they want to do and in particular they want to do all these things which seem like they might be impossible to simultaneously do with you know finite budgets students will perform at or above grade level and they'll get to discover their passions and we'll make sure we've got equity and we'll do all of this and we'll support teachers like there's just this incredibly long list and yet when I teach reinforcement learning what I say is we're going to maximize expected reward.

[00:43:58]
So it's clear that sort of the complexity of the way that people think about optimizing things is enormously different than the complexity with which we sort of standardly at least teach r.l. And so I've been thinking for a few years about you know how do we deal with these are complex Balti objective optimization problems.

[00:44:16]
And I've also been thinking about the fact that at least in real life we must never get things right the 1st time like and when I we build our educational systems we have to change them like in our even our sort of simple fractions tutors we have to change them on your i Phone like we're always adding new features maybe new sensors to try to improve the performance and so instead of thinking of there being like a static set of states in a static set of actions we should start to be thinking about this is being a continual and.

[00:44:45]
Movement where those things are also going to be modifiable. So we've been thinking about that a lot I think that's a really cool question I'm not going to talk about any of our work with that I'd be happy to just got to talk about this complex multi objective optimization problem.

[00:44:59]
So I was really excited that we had paper a couple months ago and I this is again my former POWs talk Bill Thomas and we're trying to think about how we handle these complex object is at least with historical data and the things we're thinking about is that we now have this opportunity for having machine learning Ai and all these sorts of applications and there's also all this very important word attention to the fact that we're failing a lot we're deploying things that are not fair to all groups and so how do we start to at least mitigate against those known failure behaviors that we want to avoid.

[00:45:36]
And I'm not going to deal with the unknown unknowns at all but we know even just from like a legislative perspective that there's some sort of types of behaviors a very nice time to want to avoid. So particular we don't just want to think about Macs and have a maybe that's our expected value we want to think about that subject and some constraints on its behavior.

[00:45:55]
And previously a lot of the methods to do this and there's some beautiful methods from the optimisation community has tended to sort of assume that there's some nice structure in these sorts of constraints and in your objective often the convex little allow you to make sure that you can do this in a computationally efficient way but with my with Phil and I and some of the other our other covered years we were particularly interested in competition efficiency we just wanted to be able to start tackling some of the constraints that we thought were important for the application areas we were looking at.

[00:46:28]
And so in particular we wanted to be able to think about things like safety and fairness and think about cases where for example you might have a new decision policy you want to deploy you want to be sure that it's better than the old one. And this comes up all the time like I'll talk to sort of.

[00:46:44]
People in companies that will say Ok maybe you can use our historical data but we're not going to deploy it unless we're pretty sure it's going to be better what's your assurances can you have in this case. So let me write down our sort of simple algorithm for tackling this.

[00:46:59]
So what is that we're going to try to have these behavior constraints in a way that people that are not Ph D.'s in computer science can write them down about constraints on fairness or constraints on safety and to say with what probability they want us to hold they want really high competence or not and to give us some data.

[00:47:17]
And then if you just have a finite set of decision policies that you're considering to run in the future we can use a generalization bound for each of those constraints things like others. And then you figure out if it passes with a high probability all of those constraints and then we pick whichever of those has the best performance of the things that are safe.

[00:47:37]
And then importantly if there's nothing that safe we say we can't find it and I think that's really important because I think having these are going to be more transparent about the. Limitations and their failures is really helpful for all of those desired criteria that I put up for the Palo Alto school district we probably can't satisfy them and they are with them say we can't satisfy that and it's not we don't tell them yet if that's because we don't have enough data or because the constraints are fundamentally conflicting or why but at least it starts to build a dialogue between the human and the ai about you know when when and what is possible.

[00:48:13]
So the case that we were really interested in for this that we explored in our paper was to look at diabetes management there's a really cool high fidelity simulator that's been approved by the f.d.a. to replace early stage animal trials so it's not this is not we're not recommending this is a clinical policy this is only a tiny step towards that but I think it's really nice to have these really high fidelity simulators that are.

[00:48:36]
Different alternatives to video games as other types of challenging problems and what we want to look at in this case is to find new policies from a single individual data actually we thought would be better in terms of sort of a multi objective constraint of their hypoglycemia and hyperglycemia So what we're doing in this case just to be clear is we have an existing decision policy that's making into a dosage recommendations so it's running and it's generating all this old data and what we do is we apply our algorithm to that old data we're not getting to actively explore and when we're confident we say we think we've got something better we only output something we only change when we think we've got something better.

[00:49:19]
And so what this shows you here is whether you might have undesirable behavior here we're looking at hypoglycemia and you can see that if you sort of standard machine learning about that they often give you a lot of undesirable behavior at the beginning because they're not being constrained but our method never does and even though we can't improve immediately we only improve when it's safe.

[00:49:40]
So after around like 10250 days you know a few months we can say yeah we think we can find a better policy for this individual. And the nice thing here is that then we can really get these better policies where we haven't suffered these bad expected returns early on because we only output stuff when the policy when we're confident.

[00:50:00]
So I think this illustrates that practically with the amounts of data and of course you'd like it happened a week or something but with a few months you can do personalized insulin pump. Decisions are towards that right that need to be a ton of additional work but we could start to make that possible in a way that is safe.

[00:50:22]
Now. For those of you who are familiar with responding to them so you might have noticed I only did that for a finite set of policies generally you'd like to think about acute case of decision policies and we've started to do that in the context of bandits in particular here we are curious about fairness constraints.

[00:50:39]
So we're trying to think about just constraining behavior in general not just for safety and so here we want to think about fairness across different subgroups but you wouldn't end up with a decision policy that is system magically better for one group than another and so we looked at this for a number of different datasets like loan approval and criminal recidivism and also a tutoring experiment so in our we made a bad tutor made a bad teacher to teach something that we systematically gave females worse tutoring decisions than men and we did so in a way that if you ignored these sort of constraints you would say Well it does so much better for men in this other case that of course we should deploy that but then that wouldn't be fair and so the way we could have fair.

[00:51:19]
This is to say you can't get a policy that is much worse than your current default policy for one subgroup you have to make something that is not systematically sacrificing performance on one group in order to get better performance on another. And again we can see that you need more data than standard approaches in order to get something out that you can satisfy this with on the order of like a 1000 samples we did these experiments on Amazon Turk.

[00:51:45]
And these other approaches can tend to still fail to be state fair after a large amount of data we want to highlight this here is that even though it's not just that we need more data for the other approaches they're just not trying to have this fairness constraint so even with lots of data they may still fail to be safe or fair now they are things that you might wonder is whether you're sacrificing a lot of performance to try to get these constraints satisfied when you can see here is a lot.

[00:52:13]
You know at the beginning we might lose a little bit but essentially we're in the same range as the other approaches in terms of the actual average performance so often you lose a little bit by having these constraints but not significant amounts. So bring us up to highlight but I think we're starting to have techniques there's still an enormous amount of open questions but we're starting to have techniques that can kind of mitigate against this undesirable behavior so we can have things that we feel confident about the point and also transparency about when that's not possible.

[00:52:45]
And so what I share today is a little bit of our work on counterfactuals efficiency and complex objectives and I'm really excited about sort of the future potential of these a hybrid system where we can hopefully do much better together than either can alone thank you. For your.

[00:53:09]
Kind of domains. To help you go with the. So the question was What other domains outside of education the health would be impactful for my approaches I mean I think it's certainly relevant to a lot of consumer marketing you know whether you're not playing Google or making product recommendations all these ideas tend to similarly apply.

[00:53:31]
I think there are other cases where simulators are very expensive things in climate and others could also be important where you care about sample efficiency and it's still really expensive to generate you don't. Hold. That seat or right. There is office so it's. One of the best it's.

[00:54:04]
Just a question of saying that for things like safety and others you may not be able to write down all your constraints in advance and so what should you do in those cases I think it's an important really hard question if you. I think that there are unknown unknowns which might call as catastrophic failure what should you do one thing that we've been thinking about that is cases where you'd have a default safe policy that maybe has been designed by experts say for a fly in a plane or something like that which you think does satisfy those in those cases you could imagine relaxing back to that if you can detect failure but in general if you really don't have any sense of where the failures could occur I think that's very challenging outside of sort of having past detection of that and switching to a human.

[00:54:51]
Being. Because. I think that's a great question so that learning the for the educational side not there are also for learning is often a social process like my former colleague does a lot of work on that and thinking about how do you have sort of educational technology in a social setting.

[00:55:15]
In our work we haven't as much I think it's a really interesting question one place we've been thinking about is how do you think about when to have teachers involved what information to give to teachers so to leverage their expertise or even just other individuals like your stock turned to Sally you know.

[00:55:31]
That so I think it's a really interesting area we haven't focused on it much. More than. Just absolutely and I think one of things we found really interesting is that sometimes our systems will learn not to take any subset of the pedagogical activities we just initially gives us data that those weren't very effective because optimize them out and so then I as a teacher know those ones weren't so effective What are the features of those that mean I can design better things in the future.

[00:56:18]
That's the killer.