Good afternoon everyone. Welcome to today's machine learning seminar. We have a lake as well as our seminar speaker. I'm very excited to host him today. So Ali has been a Researcher at Microsoft Research for several years and he is soon going to move to google. And he has been working on theoretical foundations of machine learning, panning mini subareas of machine learning, including large-scale and distributed optimization. Items from statistics, online learning, and most recently reinforcement learning. And that's what he'll talk to us today. The focus was on designing algorithms that are provably theoretical than on which you can provide theoretical guarantees. And his work at Microsoft has resulted in creation of a new service that is part of Microsoft. Azure has had several honors in his career. In particular, he, his work has been recognized with a best paper award and in ACM SIG, AI industry Impact Award. The second award was for his and work on a suit crystallization. So without further delay, I like the floor is yours. Thank Shiva for that kind introduction and thanks very much for having me. So we'll, we'll try to keep this, I guess as interactive as possible, a nice format. So if you have questions, clarifications. During the talk, please post them in the Q and a or chat and we'll try to take them as frequently as we can. So today I want to discuss some thoughts on building a theoretical framework for doing representation learning and reinforcement learning tasks. This is plagued with a few keywords, so let's kind of unpack them and by one starting with what reinforcement learning is about. So we'll see a couple examples. Let's consider designing, you know, maybe a mobile application where we want to give activity recommendations to people when they're taking maybe short break from their day's work. In such an application might begin with perceiving the state of the user. This user state might consist of things like, you know, their physiological data like height, weight, age, et cetera, as well as more transient stuff like, you know, their current heart rate, when what was the last meal they had, when they had it, etc. Based on all of this information, we might recommend an activity like going for a short walk or run or just doing a few push-ups next to the desk and observe some outcomes like how many calories they burned or steps they walk and so on. Of course, the overall goal here is to try and help the user of such an application meet their health targets, such as maybe managing their weight or blood sugar levels, helping them sleep better at night and so on, right? So just some terminology that we'll be using throughout the talk in this example that I'll call the activity example. In this presentation, information we perceive about the user will call a state. The activity we choose amongst our actions. And the feedback that we receive is called a reward. And a long-term health goals of the user are called long-term rewards for values. Now, very different scenario that we can, nevertheless cost and similar terms is that of navigation from first-person perspective. Let's imagine robotic Adrian trying to navigate this sort of a domain shown here. He's got some sort of a head mounted sensor, like a camera lidar or something. And it receives images from each sensor each time, based on which it tries to select for each wants to do next. Move left, move in one of the four directions. And maybe it has a goal that it's trying to reach and it gets a reward of one. If it reaches that goal. There is some small penalty for lingering around and a large penalty for falling into one of these lava dying, right? So invites the task we are trying to describe here is that of finding the shortest path to go without dying. And of course, this is an extremely over-simplified caricature of robotic navigation. But it conveys some of, I guess, let me call this the navigation example today, right? So we can take these couple of examples and distill them down into a more general reinforcement learning framework, which is a sequential and interactive learning process between an environment and a learning agent. For each interaction begins with the agent perceiving the state of its environment, and we'll denote states by x. The agent has some actions to choose a monks. And it will pick one action denoted by a and then it observes or reward denoted by r. Now inside the learning learning agent, which means it's trying to learn something. So inside the learning agent is an object called a policy that helps you map the observed state into the action it chooses. We denote policies by pi. And you can think of this as some sort of maybe a neural network which will do this mapping from states to actions. And what the learning agent is trying to do is it's trying to have many, many such interactions. And by observing the rewards, it's trying to improve the parameters of this policy, such as the parameters of this neural network, to try and maximize the expected cumulative sum of rewards that it gets over a long sequence of interaction. And the number of interactions over which we try to optimize this reward is denoted by capital H for horizon. It kind of generally signifies how far ahead in the future the agent needs to think to just the consequences of its actions properly, right? So maybe if you're trying to manage dirt over a 30 day period horizon. But if you make one decision every day, right? So that's kind of the basic reinforcement learning paradigm. Now, in the context of this reinforcement learning paradigm, what I want to talk to you about is a particular problem called out, called representation learning. So I'll start by motivating Pi want to understand this representation learning problem in the context of RL. Reinforcement learning. I'll then present a general framework for doing representation learning and instantiate a couple of examples of that kind of framework, time permitting. We will be able to go over both. Okay, So are there any questions about the basic reinforcement learning setup so far or do we go into representational? Then no questions, we can go ahead. Okay. So why, why are we interested in doing representation that? But one of the things that is salient in a lot of domains where we would like to do or reinforcement learning, such as robotics, conversational agents, intelligent tutoring systems. People are now trying to do, solve resource allocation type problems using reinforcement learning. In all of these, when the agent receives the state of its environment, the status, often some complex sensory input. It might consist of signals that are audio, video, images, STX, complex structure, data streams of other sorts. And it has to figure out how to make sense of these complex inputs. And while there is at least 50 years, if not more, at this point, of jury around reinforcement learning. And algorithms. Working with such complex inputs remains a challenging task for most reinforcement learning algorithms, especially the ones that come with strong theoretical guarantees on their behavior. And as a person who likes to have theoretically grounded algorithms of my, one of my personal research goals is to develop sample efficient reinforcement learning methods that can actually scale to such environments. Because having good reinforcement learning agents in these domains would be truly transformative in what we can do. Okay, bye, quit. This complexity of the input space presents a real challenge. Now, one observation that we've made that can potentially offers a way to tackle this complex input space, is why the inputs themselves, that the agent receives their complex. But often they, there is some underlying latent state that is significantly simpler. Of course this is latent is not exposed to us in explicit form, but it's there. For instance, in the robotics tasks, the agent might perceivers environment just using its sensors. But there might be a simple latent stage such as its position configuration and go. Similarly in a conversational system by you know, that really the transcript of the dialogue is this, is, the state is kind of an underlying need in the user's head, which is driving that conversation, which is often much simpler to two, or has mature degrees of variation, then the potential different transcripts that the agent might see. 10 thing with intelligent you bring systems who are kind of the student's knowledge provide that late, latent state in resource allocation problems. It's the state of the underlying system that we're working with sets. So this is really something that's fairly ubiquitous in many application scenarios. The existence of such a latent state. And somehow what we would like to do is to give this latent state, allow a reagent to work with these latent stage because they are potentially much simpler. But of course you know that they're late and we don't have them. So this observation starts to suggest a very natural paradigm for reinforcement learning from these complex observation. Complex inputs. Where we create this modified pipeline for reinforcement learning, where we start from some rich observation of the environment, such as an image. We plug this input into some black box, which I'll call it Representation Learning black box for now. And what this black box I in an ideal world would do is it will spit out the latent state. And then the RL agent will simply forget this image. Just take a latent state and proceed as if that was the state, the environment. And now maybe we can invoke some of the standard reinforcement learning algorithms that work for simple state spaces. The benefits of doing this, some of the benefits are self-evident that this results in a simpler reinforcement learning problem. Whenever the latent state is simpler, a more significant benefit, which is maybe less, less clearly obvious, is sometimes we'll have similar domains in which we are trying to do a variety of tasks. In such scenarios, what can happen is this representation learning black-box can be shared across all those different. Asks. For instance, even if you take like one robotic arm, there are many different tasks people want to use our same robotic arm for in the same environment, right? And they can all share the same representation learning. If you start thinking at an even higher level. Of course, there are many tasks that take place within the same laws of physics in computer vision and LP, of course, this has transformative the use of representations that are being used across many, many different tasks. And so we want to bring some of these benefits to reinforcement learning as well. While this sounds promising, it does come with a challenge. Right? So when we are trying to do representation learning for reinforcement learning, we have this intended behavior that you would like representation learning blackbox to have. But, you know, how do we make sure that this behavior is indeed what the system has hardly? How do we drive the system uncover these latent states? One of the, one of the reasons this turns out to be extra hard and reinforcement learning compared to supervised learning is. In reinforcement learning, you don't begin with some large datasets. You've got an agent. The agent gets dropped down into an environment. The agent has to act, it has to explore the environment and it has to learn from its exploration and observations. So one of the things that we can easily do is if somebody gives us a very rich dataset that in some sense consists of all possible kinds of observations the agent can get. Using these observations to get to these latent states is something we have reasonable ideas for in terms of algorithms. On the other hand, since nobody gives us pleasure representative dataset to begin with, what we also know how to do is if somebody gives us this mapping from observations, relate and stapes, we know how to actually collect a very representative dataset by using existing exploration algorithms. Price of even have a bit of a catch-22. Given diverse data we give you, you know, how to go to good representations. Given good representations, we know how to put divers about, hey, we need to do both simultaneously. We don't have either. And that intertwined coupling of exploration representation learning presents one of the unique challenges we face when trying to do representation learning for RL in particular. That doesn't manifest in many of the use cases in that we see in InVision, NLP and other domain. And under overcoming the challenge is one of the main things we'll see how to do in today's talk. Okay? Any questions so far? No credits. In order to better answer these questions and study these challenges will now introduce a mathematical model in which we can ground the problem of representation learning, give a meaning to what these latent states are and worry about how to call their presentation on the problem. And the model will use is what's called a low rank MDP. Mdp here stands for a Markov decision process. If you're not familiar with what a Markov decision processes don't perish. Not particularly important right now. Okay? So usually what happens in reinforcement learning is when the agent is in a state x. And if they choose the action a, this action is going to update the state of the environment, right? So if you're standing in one place and you decide to go forward, your status changed because your position is changed in the environment, right? The and the View you see around you has changed a little bit. And in many problems of interest. This next state does. The update that happens to the state is not, it's not just some deterministic function. There's a distribution over the next states in which the agent might find itself. For instance, you can think of it as Wendy. Then we try to move forward. We don't move forward the same, same distance with every step. There is some stochasticity in the step length and that induces a stochasticity in the positions we will take. If we're thinking about a conversational system, the stochasticity might be much, much larger because the exact set of utterances that happen after the conversation later and say something. Or the exact impact it has on the mental state of the user is, is much more variable. So in general, given a state and action, there's a distribution over the next steps that the agent might, that the agent might perceive it the next time step. And this is one of the central objects in reinforcement learning we worry about. Now in a low-rank and a, and in general, right? If you think about, you know, let's think about the number of potential inputs, the number of possible access the agent can ever receive as being finite for now, finite but extremely large. Right then if you think of this distribution as a matrix, its dimension is the number of states times actions on one axis and number of states on the other, right? There's an extremely large matrix. I had to learn this matrix in, you know, in its entirety, which many reinforcing learning algorithms aim to do. That's going to require a lot of data because it has a lot of different entries. You can. Without any nice structure of this matrix, it seems hopeless to try and learn it. So what low rank MDP is do is they assume a particular low-rank structure on this matrix. And the way the low-rank factorization looks is the following, right? So we assume that this matrix decomposes as the inner product between stung features, p star of the state and action that we were, we started in. And some other features, mu star of the next state. Right? So this conditional probability of P of x prime given x a is equal to the inner product between a vector Phi static, say, and under another vector mu star x prime. And the shared dimension d of these two vectors is assumed to be much smaller than the number of unique stapes. So now the number of parameters here is potentially much smaller. And one reason this is convenient of all with this is a national model to look to for representation learning is, the movement is giving us a natural way to think about having useful features on states and actions through this mapping phi star. And we'll see in one moment why this has the same latent state meaning that we were looking for. Right? So one way to intuitively think about this low rank MDP setup is the following, right? Imagine a very simple probabilistic model, which looks a bit like an HMM. There. You see some states x, but really there is some discrete latent state which takes a small amount of values that is underlying these observed stage sexes price or what happens when you take the state as the, the action a in state x is first, you will go to some one of the values of that latent state. Now, suppose you land in this latent state here. Then there is some emission distribution that's associated with each cell, with each different value of the latent state. And the actual state the agent observes is emitted from that emission distribution. And so in this simple model, p star denotes now, you know, the distribution over the latent states or the discrete latent states that the agent will, that the environment will go to next. And then mu star just denotes the emission probability of the observation x prime from each of the different later and shared values. Right? So naturally the dimension of each vector here is just a number of different latent stage. Right? So since you know, HMMs, for instance, are so ubiquitous in, in many learning tasks. This gives you some sense of the types of things low-rank MDPs capture that they allow us to bring HMM like models. Our topic model like things to reinforcement learning settings. And the reason we think of these fee starts now as the representation we are after, right? There are two reasons. So one is, of course, here we see that V star exactly give some mapping from state action to the next latent state, right? It's also well known that if somebody gave P star to us privately, the feast, our function was known to us. Then reinforcement learning comes simple, though we know algorithms that are computationally efficient, that our sample efficient that actually work reasonably well. If this modeling assumption is true and if P star has been given to us. Though, VB know that if we had access to this presentation, reinforcement learning is doable, but of course we don't have access to that representation. And hence, we need to do some representation learning. Okay? Is that clear to people so far? So now we can then, given this model, we can state our representation learning goals a bit more formally. So since we don't have access to V star, what do we want to do? We'll adopt a usual learning theoretic approach and say that somebody has maybe given us a class capital Phi of functions that look like V star that may map states and actions to some d-dimensional vectors. And we're gonna find our approximation phi star, approximation to V star denoted by Phi hat by searching over this set. In some cases, we also assume that we have a set of candidate mappings Upsilon for mu star. And we want to search over that class to also find w hat. Okay, So, so you want to basically find approximations to phi star and mu star touching our respective classes. And as before, we have this intertwine challenge between exploration and representation learning. So we need to figure out how to collect the right data that enables representation learning. And once we collect some useful data, how to approximate V-star minced are given the data. That's just a more mathematical formulation of the earlier challenge we describe. They'll use a shared framework for in which we can design algorithms I planned to talk about today for doing this representation learning. So it's actually a very simple framework. What it does is it's sort of a forward algorithm. So since we have the, we have this MDP in which we interact with the environment. Capital H times price. The number of times. What we'll do is we'll first try to collect useful data, diverse data, over all possible states we can encounter at level 1. Then we try to learn some good representation for these states. Use this representation to go to divert stage set level 2, somehow. Collect, then use that data to learn a good representation at level two. Then use a learned representation to go to our states at level 3 and learn good representation at level 3 and so on, right? So that's why it's sort of format. It's kind of going depth-first or breadth-first into the MDP and doing representation learning. So let's start with the first approach to operationalize this intuition, which is an algorithm called flam way that we produced in Europe last year, stands for feature learning and model-based exploration. So this is, There's going to be a model based algorithm in which we'll try to learn approximations to put V star and star. And a basic representation learning approach is actually going to be very simple. It's just going to be doing some maximum likelihood inference. Okay, so what I mean by that is, you know, we have this forward algorithm. So let's say we are currently collecting diverse data at some level h. After taking h minus 1 actions, we are encountering some diverse state site level h. And we want to find a good representation for these states. We want to find approximations two phi hat and mu hat. At this level. Well, what we know is that if we had access to V star and we'll start, they will, they will capture this conditional distribution over next states given a state and action, right? So in order to search for their approximations, we try to find candidate phi hat and mu hat, that similarity capture this probability in the following way. Right? So we are, let's say, reaching diverse stage at some level h. We're going to then take another action uniformly at random. Collect also that state we encounter level h plus 1. So now you've got this. State action next state propose that are all being sampled. Were aware that triple is being sampled the neck, the next state at level h plus 1 is being sampled from this conditional distribution. And now we try to find a pair phi hat mu hat that maximizes the log-likelihood of these observed samples, right? So we take these x h, x h plus 1 triples, we feed them into a max likelihood inference and find for me I had to be the arg max over our respective p and epsilon classes of the log-likelihood. So this gives us an approximate representation at this level h. Next thing we do, right? So basically what we are doing is they are based there essentially describing an MDP of their own. Where instead of generating states according to this distribution, at the next step, you are generating states according to this empirical or estimator distribution p hat. Now, we can treat this as an MDP, and we can run some planning algorithms within this MDP to try and collect diverse states at level h plus 1, which will allow us to recurse into the next step of our algorithm. And this planning step that we need to do in order to reach diverse state site level h plus one. It turns out there already are algorithms that exist in the literature. Or for doing this. Because hey, now we are acting in an MDP. This, this estimated MDP p-hat. We know the features of states and actions here. We know that W highest here. So this is not fully known MDP. And there are algorithms that are computationally efficient, that of course there's no sampling involved here because we're just working with, you know, the dynamics are known. So we can just use these existing algorithms to find a policy for reaching diverse states at the next step. And once you have this updated policy, you collect some data with this policy to reach diverse state site level h plus 1. Again, you take a next uniform random action and collect, you know, observes stage height h plus 2 and use that, those tuples to learn a good representation at h plus 1 and so on. I guess so that's the algorithm. What can we say about this algorithm? First of all, you know, is at the first glance not. Have a couple quick questions. Yeah, please assume you're looking at a finite time horizon problem, right? That's correct. And the hedges, which is the level in the tourism, and are you assuming that the P is changing with respect to hit you or is it the same at all levels? So we can deal with either o things the mappings we learn are actually different at every step. Okay? But that's just for convenience. Okay. Got it. Got it. Um, I'm little confused about how you are doing the stamp. Who are you assuming you have access to like a simulator or are you working with trajectories? Trajectories. So, so basically we are always, remember the data we collect is in the real world by acting according to the policies we induce in the planning step there. So basically in the planning step, we are essentially trying to find policies that lead to high diversity of states in terms of the learned features. Okay? And then we execute these policies in the real world, collect data according to them. And then, I mean, of course the only part of the data we will look at is in action next, state triples at the frontier of our learning process. Okay, right, so the prefix of the trajectory is not really used. I mean, okay, this is kind of the simplest way of describing the algorithm. There is a more complex version of the algorithm that looks at the entire trajectories and does learning not just at one level, but learning wherever learning is possible. But I'll just maybe a bit more confusing to describe. Okay, And one more thing. So you use the policy up to the frontier level and at the frontier now when you pick uniform action is the correct? Yes. Yes. Okay. Yes. Thanks. Thanks. Thanks. Thanks for asking this. Two questions. Now, now one thing that maybe at a first glance is generally, it should not be very obvious, is well flow-based trying to find features that allow us to do maximum likelihood, right? So they're trying to approximate the conditional distribution over states given a previous state action. Why is this going to result in good features, right? Why, why is this going to give us good approximations to the feature SV star? And in particular, when we do planning with respect to learn features three-halves to, to, to try and find diverse states in terms of their fee had features. Why is this ever going to be sufficient to find diverse Church in terms of the feast, our feature, which is what we'll actually really need in the proofs. So the correctness of this algorithm is actually rather non-obvious when you look at it first time. Nevertheless, it got published because it is correct. It so it does come with a proof of correctness. What we can show is under an assumption that V star mu star Duke line the respective classes that we've formed. Or for those mappings. The first felt lambda will estimate the transition model, the operator p to an accuracy epsilon in total variation distance using a polynomial number of samples in all the relevant parameters, such as the dimension Horizon number of actions. Whenever Epsilon, low cardinality of the function spaces, and so on, with high probability, Everything's going to run in polynomial time as long as you can solve this maximum likelihood inference problem in polynomial time. If the chosen clusters are such that you can do MLE in polynomial-time, then of course you can't help it. So that's kind of an oracle efficient approach. I want to say that the reason I say it's just polynomial here is while we do have some convexity guarantees, it's pretty clear that they are far from optimal. They are far from being shot probably for you in this algorithm and there's a lot of room for improvement. The other side here. Then we know sounding Rubin can be made definitely under stronger assumptions as well. And the paper talks about these things and so on. But this guarantee by itself doesn't tell us anything useful still because, you know, I'm telling us something about estimating, estimating transition probabilities. Well, that shouldn't be surprising because I'm trying to estimate transition probabilities. But what the **** does, does it have to do with learning good features, which is the task we set out to. There. The fun thing that happens is that good representation learning in this model actually falls out as a corollary of estimating that transition model. So what you can show is if we take the learned features, Fiats, then things like optimal policy or the Q-star function, the q-value function of the optimal policy. In fact, Q function of any policy that the agent might choose to use. These can all be expressed as approximately linear functions of the learned features. So the learned features actually provide a useful representation in which good policy is good value functions can be expressed as simple objects. The strong binders that this is so flat but doesn't actually look at the reward signal at all during the learning process. And so this linearity claim is based totally on flam by learning the transition dynamics and it's going to be true for any reward function you decide to use to measure the agents progress. Now this by the way, is actually quite powerful because in practice, when we start to do reinforcement learning often, okay? Like if we are using just some standard simulator for evaluating a new algorithm, then the reward function maybe comes to the simulator. But when you apply reinforcement learning to an actual real-world problem, nobody tells you what the reward function is. You have some big ideas on what you want, what you want to do in our application. And effectively like a reward engineering phase, where you're trying different choices for the reward function and seeing what kind of behavior it manifests. And it's very useful that you can learn the representation just once and use it to learn good policies for many, many different word functions. It be kind of onerous to really do this every time you change your reward function. What's even more powerful is actually the data that they collected during its execution. Is, is really quite rich. So in fact, learning good policies for any reward function downstream can be done without collecting any further expedients. You can just take the dataset. That Colombia has collected and run some standard algorithms like fitted q iteration of it in policy iteration, just with that dataset, without any further real-world Internet, right? So this, this kind of a nice multitask learning type a benefit that plan B gives. Okay? So one final thing I want to say about lambda is just give you a little bit of a flavor of why it works. Because why it works is not necessarily very obvious. Remember, you know, the the policy is their plan, they collect data with. These policies are learn in the estimated MDP with transition dynamics defined by p hat. There we know that this MDP is indeed an MDP with this low-rank factorization of the dynamics and where all the features or no. So we know that in this MDP, then we'd, when we do planning for this MDP, using one of the theoretically correct planning algorithms for these MVPs. That the resulting policy will indeed collect our stapes, at least in this estimator. Now then to execute this policy in the real world, one of two things can happen. Either this policy will also do well in the crew in the real world setting, or it will visit some state action where our estimator MDP is a poor approximation to the real world, right? That's why, you know, the, the part that we will. That's why that's the only situation in which we can have a divergence between what happens in p hat and the true word. And that means by visiting these states and actions, you will collect some useful data for improving our approximation p hat. Right? Because we'll actually observe the state action next year, triples in these states and actions. And we will then add that data to our maximum likelihood in France and our approximation will improve. And the only situation that can be bad for the algorithm is if we keep falling into this thickened bucket over and over and over again for a very long period of time. And what you can do is you can construct a potential function argument that ensures that the second thing cannot happen very often by leveraging the low-rank properties of the real MVP as well. And that's the part that requires some, yeah. But can be done using some reasonably well understood argument. Okay. So that that's, that's all I'm going to talk about from there. So if you have any questions about flam bed is a good point to stop and take them. One question from the audience about the exact dependence on one over its launch, you had a poly, in your opinion theorem. Yeah. I actually don't even remember because it changes based on the exact set of assumptions we're making. So in the most general case, I think it's something like 1 over 10 or 11. It can be it can be significantly improved. I think it can be improved for like 10 one over epsilon to the fifth or something if we make a much stronger assumptions, but that's still far from optimal. Probably. But, but that's again like that smart surprising that we have slack in these results because even for simpler subsets of this fairly general model that I'm studying today. There is actually the best sample complexity banks that are known are, are, are quite blue. So you're going for, there are special cases of the setup, things called block MDPs, which people have studied for longer, including some of my own work and even there, like getting sharp rates remains a fairly important open challenge. So definitely, you know, there's, there's a lot to be improved on, on that front. And, you know, we know that information theoretically these problems are, this is none of this is a barrier. So there is an algorithm from 2017 called Olive, which can, which can solve these problems at a rate of one over epsilon squared. But oligos hopelessly computationally inefficient. So this kind of what's happening is we're, we're trading off statistical efficiency for computational. And this is the best trade-off we know of so far in this expressive model. But we hope that are better tradeoff points. We had. Ideally, we can get all the way to information theoretic efficiency with computational efficiency as well. I have a quick question. Yeah. So you mentioned that if phi star and Mr. known high and work tells you how to find exploratory policy in just, just v star. Sorry. You just need phi star. The star. Okay? Yeah. So basically this is. Basically this is the linear MDP, a result of GI Jane or Liniang and men De Wang or some of our work from isomer or sorry, from Europe 20 on PCP ji. Basically the idea is once you know the features Phi Phi star, you can give yourself an exploration bonus to go to the directions in the state space, right? So then you can just add DI bona us to the reward function. And yeah, there's that. That just works with you. Then you're going to basically use the UCB value iteration type things when I bonus function or UCB policy optimization type things for that bonus. Yeah. Correct. Thanks. Okay. Though, of course, Colombia has a lot to improve in terms of sample complexity guarantees we have. But perhaps if we, if we step up a level and ask, are there other places where we might ask for something different? Florida is Almaty based algorithm tries to really approximate the transition dynamics by learning both approximations to phi and mu. But as I just said in my response to achieve are really what we need in order to do. Planning is access to the features Phi. So sometimes you will, you could say that learning Mu is superfluous and maybe we should try and remove the requirement of learning me know, because hey, if I don't need to approximate mu, maybe I can tackle tasks which are even richer. They're mapping mu is hopelessly complex. I cannot hope to learn it. It cannot be captured in some nice function class, but fee is still nice. So I can hope to learn. In the reinforcement learning power lens. This would basically comes to rule out model-based approaches that try to learn transition dynamics and force us to think model-free, right? And others, there's a usual debate whether we should be doing more ways for model-free for a variety of reasons. I think in the interest of time, I'm not going to go into them actually today. But yeah, there, there, there, there, there, there are strong arguments for both. But nevertheless, you know, sort of it's not so obvious whether we can do something that's purely model-free at all. So we wanted to understand this question. And that brings us to the goal of a model-free representation learning. Sorry, there's a bit of a busy slide. I, I switch laptops recently and I lost my animations. So that's why otherwise this time wherever much nicer with its animations. So one of the first things we have to answer is when we are being model-based, we had a natural notion of what it, what a good representation constitutes, which is the ability to approximate transition anatomy. In a model-free setting that we are not trying to learn the transition dynamics. We cannot go after that goal. So we have to form a surrogate measure for having a good representation. And maybe we already got a hint of what this target definition might be when we were talking about the corollaries of estimating transition dynamics at the end of our guarantees in the last part. So we know that if somebody gives us V star, then optimal policy is a, is a linear function using feast our Q, Q function of the optimal policy. The Q-star function is a linear function. In fact, the value function of any policy for any reward can be written as a linear function under fee style. This is a well-known result from children's paper only MDP. So what we can ask from an approximation phi hat is to satisfy the same contract. That is for any reward function and any policy. I want the value function, the expected reward, the policy gets. To be expressible as a linear function in my approximate features we had there, of course, the w will change for every policy and reward function, right? So it's going to be a function of both. That's why it's essential here. So maybe you want to find features we had that satisfy this contract. In fact, we look for something that's even a little bit stronger. We'll look for features there. The conditional expectation of any function f, given the previous state and action can be written as a linear function because this property holds for V star. Okay, so we will try to go after this guarantee for all functions chosen from some class. And we're going to note that there is no mu involved anywhere in the statement of this guarantee. So this is something we can at least hope to do without ever involving the mapping me in the process. Now, you know, these days people tend to use these adversarial machine learning that helps a lot. And you're thinking about those adversarial approaches suggest at least one national approach to find a mapping phi satisfying this guarantee vivo, right? So we can think of learner working with representations fee and adversity, working with discriminators F. So like a GAN type thing, where the learner's goal is to realize the function f using linear functions in the previous state action features. And the adversary's job is to try and maximize this discrepancy. Of course, though we can maybe try to do this saddle point optimization where we try to find a mapping which can keep this loss small for all possible adversary functions from some class. And of course, this side point has an enormous as well because we have this min over ws part of the loss function. But this is not just a usual Min-Max, but now it's a min-max men. To unfortunately this by Sir doesn't work for somewhat complicated technical reasons for, if, for people who, who've done reinforcement learning before this, I'll just tell them that is kind of a standard double sampling type issue where there is the TDD, the amount of variance the loss has depends on the function f. And yeah, that digest creates statistical problems when trying to just work with this objective. And there's also a well known fix to remove this residual variance by subtracting it off. So there is a much, much nastier looking optimization problem which addresses this double sampling issue. Where now for every f, we first ask, okay, Given our class fee and given a set of weights, what is the best possible value of this loss we can get for that f. We subtract it off and then find the fee that optimizes the worst-case residual. After we do the shift. And after this residual variance has been removed. It turns out this objective is actually a reasonable objective to try and do model-free representation learning. So sorry, I'm going a little bit faster in this part, in the interest of time. But we can operationalize this objective in the following manner. So we, we, we do an incremental algorithm where we find discriminators incrementally. So at time t, We've got some sort of discriminators. We find a feature map that works for all these discriminators just by adding up their loss, loss functions. Now, once we have this feature map, the discriminator finds a new function f t plus 1 to defeat this mapping phi hat by inducing a large residual loss on this learn Phi hat. Okay? So what it means is it tries to find a function f for which the best possible value of this loss is very small, but it's going to be larger than phi hat. Okay? And we repeat this process over and over again. And what we can show is that the, the process provably converges in Pali the iterations where d is the dimension of the feature map. And it will learn to approximate all discriminators in our discriminator class. And it will have similar guarantees as flam Bay, except it has an additional assumption about reachability that I'm not going to really talk about. What I want to highlight. The other thing is, by the way, this algorithm is computationally much harder than planned a, right? Because we have to solve these nest egg max-min problems now which we, we don't, we don't know how to solve in general with provable computational efficiency. So we have some, we have one special case that we discussed in the paper where we can do this. But in general, this is a much harder thing where we don't have all pallete polynomial time computational efficiency always. Okay, so I'm gonna wrap up to leave some time for questions and discussion. So in addition to presenting these results, the goal of this talk was also to kind of introduce this broader philosophy. Then thinking about reinforcement learning tasks where the agents inputs consist of rich than three observations. And the way we are thinking about tackling these styles we're trying to tackle them is by decoupling the underlying learning offender, useful representations from the actual reinforcement learning as much as possible. And this has a number of benefits we already talked about. It results in some of the very few approaches that we know that can handle non-linear function approximation and ribosome in reinforcement learning and approvable manner. And of course, operationalizing this philosophy in concrete scenarios is fraught with a number of challenges to do with finding the right supervision, secondary, dealing with distribution, sheriff double sampling type issues. But we've got some promising starts now. And we have some results to, on which to bet. There's of course a lot that remains to be done here. So I hinted a little bit on the, what's the potential to greatly generalize across related tasks that kind of naturally presents as a benefit of representation learning. And so far we've only looked at very simple models of generalization where we are dealing with variations in reward with Dynamics fixed or something. You know what, what more can we capture? This is of course room to do a lot better on both computational and a sample efficiency side. One of the questions we are looking into is then the agent is not necessarily, does not necessarily want to learn a good representation for the entire input, but only part of it. And essentially because a lot of the input is in, it, consists of things that are outside the agent's control. Right? So if you think about the world that you are immersed and you just control a small slice of that word. Mean. All of us would maybe like to control more, but we don't, right? So we, we, we often don't benefit from trying to learn detailed representations of the parts of the environment we don't control. And this can have serious sample complexity improvement. And finally, of course, you know, we've, we've, we've been so busy understanding these things in theory so far, we've not had a chance to thoroughly investigate their empirical properties and decrements on their to-do list as well. If you want to learn more, here are the pointers to the papers I talked about today and I want to conclude by thanking my collaborators who were involved in this, in these works. And yeah, that's a happy to answer any further questions. First hour, Thank the speaker. Or the couple questions from the audience. And both are actually kind of related. One question is about how this can be extended to in finite horizon setting. The other question is on what can be done in the case when the state action space a sudden fact. Okay? So for the infinite horizon setting, Colombia, I think, is reasonably easy to extend to infinite horizon because it's not quite. So. Then we have the forward nature in the algorithm that it has to finish learning at one step before it can go and explore next step, like doing the infinite horizon is less, less easy. By clamping. The most general version we're doing. The paper doesn't have this clever in that, that, that algorithm is actually quite straightforward to extend to it. All right, For morphine, so far it's a, it's a little less clear what would be the right extension in infinite horizon, we will basically have to revert the algorithm a little bit. And that's a good question to think about. I forgot. The second one I didn't follow. I mean, so for infinite tapes, we are already dealing with infinite. It's the number of a number of different x's doesn't have to be finite here. We are just assuming that they can be embedded into a finite dimensional space. The number of actions is actually interesting, so we require finite actions. Data lower, I think one of the lower bounds is by TO Java and some other folks which, which basically think about representation learning and give some negative results on lower bounds. I think they're exponential in dementia. And the reason they, those lower bounds don't contradict our results is they actually have to go to exponentially large action spaces in the dimension. Though we actually know that without additional structure and the action space, infinite dimensional x with action spaces are not things we can tackle. Yes, i e, infinitely large action spaces cannot be tackled. I have a question, please. So everything we've talked about is for low rank MDP way it really has this five-star task structure. Yeah, if you want to go beyond this, what, what can one do? So I went for general MDPs. I guess. There are fundamental limits, I guess because of yeah, itself has a very complicated structure than, than it would be hard to learn. But one can I S curve. Suppose I want to work with a d-dimensional feature vectors. What is the best possible set of D damages? But that's I think I will post question. I'm not sure. No, no, not without some other factor on the MDP. So if you I'm pretty sure there will be periodic hardness for the question you're asking. Like if you just want to if you don't make any assumption on the MDP, but you just want to sort of find the best d-dimensional approximation. I mean, this is okay. So, okay, One unrelated question that people have been trying to do for a while and I think that one doesn't. We don't, we have, we don't have, we have some hardness results now in the infinite action case is if you give me an arbitrary mapping phi star, okay? And you told me to find the best q star approximation with those given features. Right? So you don't even have to find those features. Somebody gives them to you, but you just have to find the rescue start approximation with those features. This problem is already hard. Is it's got an exponential sample complexity. I see, I see. So this is kind of one of the main challenges in reinforcement learning that just making assumptions on the function approximation site aims insufficient. We have to make structural assumption on the task itself, on the dynamics itself, or the rewards. Yeah. Thanks. I don't see any other questions from the audience. I guess we can now. And let's thank the speaker again. Thank you. Thanks for thanks for attending.