Thank you thank you very much for having me for inviting me here it's really nice and thanks so much for this nice introduction. OK Let me first say that this is going to be a talk on joint fork on the day yes Elsner and young are young two of my collaborators. And it will be sharper O'Quinn of course for non-convex last. So I thought I'll talk for a while about a case where you do empirical risk minimization so you minimize some function which is not convex and I want to show that. The behavior of stationary points but all of this also works you can forget about an ongoing fix it's used to look at global minimizers and so on so. That the stress is a little. About non-convex laws and then they'll talk about Oracle inequalities and to explain what I mean by that. And sharp or can equalities explain what I mean by that and it will be in the high dimensional she too Asian. OK So that's the aim. Property theoretical properties for global minimizers. Lawson's which are not very smooth or. Smooth last fungus but possibly not convex and then looking at stationary points in my the idea of the electorate is that it's just trying to develop so new unifying theory so that you can handle all kinds of cases if you want shot. So it's theoretical results it's not developing new methodology that. Combining some convex and non-convex optimization with statistical and probabilistic theory. So I'll try to sketch the. Set up first of all there are data we're statistics since order data around observations X. one of two X. N.. This is well it's general that they can be an arbitrary spaces but usually they're just Euclidean spaces and soldier end observations. And then you want to estimate some parameter which is some aspect of the distribution of these data in the parameter space is Peter mentioned and I call it B.. And it's you know it's a called Fix subset of some high dimensional space so B. is very large. In order to do what is the target so the parameter of interest is somehow defined as to minimize or of some function which is here to theoretical risk function. So we have here does read function and important is that it's not known. Because if it would be known it would be just an optimization problem but it's an unknown function and you want to find its minimum. And I call that the target because serial zero is sort of because it's kind of a special point since missing here so I want to find this minimum here but I don't observe the theoretical risk that are to read function I don't observe instead I have my data and I have some estimate of that read function that's a black function. And because you can observe it it's called empirical risk. So instead of doing the minimization here I do it here in the domain where I can observe something. And hopefully does two functions are closer to minimize or close right. So but one of the points in this stock will be that this function here is nice with your estimator can be non-convex is going to be wiggly so the picture is not completely correct with that the main idea is that you want to find it's minimum but maybe it's not so easy because it's not context. So we want to see how close is this to that and. See the conflict. So now I use my fingerprint doesn't like it. I should maybe turn something off well that's OK with quicker quicker than the screen so. OK so far so good. So this is the set up now that I didn't really write down how the empirical loss depends on the on the operations but anyway there are end up Servatius and there are people amateurs and B. is much larger and that's the high dimensional case OK. And then you have to assume something typical assumption is sparsity. So you don't really it's humid but it's more like a you believe. It's not something you impose but you hope that the underlying premise or you're trying to estimate is sparse so there's a lot of noise and nothing going on but there are only a few non-zero. Go fish in this. So it's saying well it's sparse are approximately sparse So it's approximately some sparse factor and this factor sparsely means that there's only you if you non-zero coefficients so this is to support certain of the active set in the number of such the cardinality of that set this is S. and S. stands for sparsity and it's sparse meaning that as is somehow small. Or it can be added to many of the GO fishes are almost zero but let's not. Worry too much about that you can have some kind of weak sparsity. OK then. We consider a situation where it is. Theoretical OS function is called Fix it said threads function to own noncom function is convex. But we don't know it. And buccal last function that's the one with the hat. Is observed but it's possibly not differentiable or possibly not conflicts. And may have multiple local minima. OK Now the aim Let's zero zero give you a couple of examples two to see what I'm thinking about. So again high dimension since we have here repression model with a response variable Y. and go French matrix. Matrix has to go variable so the design matrix X.. And you can do a linear regression of Y. on X. So saying why is a linear combination of X. if the columns of X. plus error as off to be is larger than and so the number of parameters is to go fishing to the linear combination it's Larch. So we do a linear combination take the approximation and now we measure it in terms of the absolute value so ne square Spitler values. And we end up with a empirical risk function. And it's well you can say it's differentiable because it's only not differentiable in a couple of points. The absolute value function is not differentiable but anyway it's not a very smooth function if you would say OK if I look at the there if it's if it's more or less the sine function which is jumping so that's definitely not smooth so this is not a very smooth. Function. That's an example of conflicts with not very smooth functions Here's an example fairly simple of smooth function which is not convex that's the errors in variables model I suppose you again want to do linear regression wise a linear function of X. plus error heiresses just standard Gaussian or whatever not very difficult. But you observe right but you don't observe X. you observe it with error. The error is independent of X. let's say with a known Go Ferentz matrix. OK so now if you do. If you do a least squares if you would know X. you could do list squares right. And then you would do something like this let me just write it up just short of three get something. Simple So you DO JUST least squares so you minimize this over be at least squares estimate or. C. plus minus two. And I call that sigma. Or normalized with and so this would be the minimum you if you lose question would mean minimize this right. And what occurs is to grow a matrix of the axis. And the near term and this term you don't care about us doesn't depend on the parameter of which are minimizing. So I'm doing exactly that but now with the errors in variables model so I replaced the expo what I observed which is it. And then this critic term there I replace this girl matrix which I don't observe by an unbiased estimator. So I just take the matrix of said what I do observe I subtract. The covariance matrix of the errors and then I get an unbiased estimator of this one here. That's all that's very standard procedure a. Little bit naive maybe but. This is what you do but anyway if you know that this matrix is going to be. Closer to seventy definite but if you subtract as opposed to seven different matrix you get you possibly get something which is no longer positive semi definite no guarantee what's well so this is a non-convex loss function so if you minimize this one you could but you don't know how to minimize it in effect in practice. Another example is. Been spoken ponens. Then you open. A data matrix with. Growth P. columns. And you're trying to estimate this is to grow matrix you're trying to estimate the prince first principle bonus of the Cofan to Matrix theory so say think this is a matrix with ID Roath and each row has a sort of go variance matrix and you want to estimate the first principal component. Then you could do it like this just to the best Wrangler approximation and it's again a non-convex problem here. OK and it will be you will study it in high dimension where P. is large an answer we also need to regularization there. Yes Are these are some examples of non-convex problems. Another example is. Suppose you want to do one step new to refs and. Then you need the matrix of second derivatives of the function you're you're looking at and you need to invert it in higher dimensions you can be. Good to be a very lucky in general you can do that because the matrix will be singular and will not even be positive semi definite So inferred in that matrix can be well you just can't invert if you can try to do some kind of surrogate inverse by doing some kind of regularised projection so inverting a matrix is just like projecting each variable in all the others but now you can do it with you can do this for a kind of records projection but anyway this will be a non-convex problem as well. OK. So this is our aim. Maybe I should have a clock so you want to extend Siri to sharp opening quality so then I have to explain what I mean by sharp. When the risk is not so smooth or not conflicts and. There is some work especially for these non-convex problems so new element in this work is sharpness of the result which sort of helps you when you have a model myspace be cation or when you have a learning point of view. And the theoretical elements of it all are just based on work by flooding near and gathering loony to us also here so they're very important for also for this work and their colleague So should she broke off. OK. So first let me explain what is a sharp. OK first I say explain what is sharp in the in a sharp or when quality and then explain what is Oracle. Now is sure Baldwin quality soft following form so you have these red function of this unknown function. You want to minimize it or maybe you're minimizing it over a subset of the Euclidean space and maybe the true minimum is not in that set so you want to get the best possible approximation so you want to minimize if you can't want to minimize dysfunction over the parameter space the best possible and then a remainder term. And so you you do and you don't really have a model you just want to best the best possible approximation of the minimal risk so that's not really taking a model into account you're not thinking of a true Bram with her and so on and so on your just want to minimize function. And that sharp or does so to see what it means if you look at this. You can know the true target sort of minimize You could be to Ciro you can of course subtract on both sides here any constant. And under Doesn't the result is just the same it cancels again and so it's to say. So I subtract this constant year what you get here is then called the excess risk and here another excess risk. So then nothing changes between here you can now think of having results where there is a constant in front of the excess risk which is larger than one and that's not such a problem. If you think some talk to Kelly and so on. But if you ever miss specified model it can be that the best approximation in your class is still far away from the do to minimize or show that this innocent don't tick set ups will not be small they will not go to zero and if this doesn't go to zero in a constant the bigger than one in front I mean then it's just a useless result and so sharpness means that you can have this not going to zero and still have a nice result use forget about getting close to this one just getting closer to best minimize or. That mean Sharp's. So that's a type of a learning point of view you don't really start with to to stick a model and then hope to model is true but you just define your parameter in your goal like this this is the course to find like this. Then the Oracle part is about the remainder remainder should be of course a nice pic. So let me see what's an oracle inequality that says the remainder should be more or less. The number of active parameters divided by the number of observations. So the number of observations is N. in my notation and the number of active parameters is the sparsity. So why is that a good idea where you can think of regression or something typically. Depending on all kinds of situations will see that the typically you need parameter Berthe parameter one observation. So that this would be to good. Behavior in this case you don't know how many really parameters to are well ective parameters you start with be parameters but only S. of them are X. If you don't know that I have priority so if you can achieve this goal it's really like an oracle in Oracle's is somebody or a. Reason that behaves as if you know what are the active parameters but we don't know it we just have no idea. But still you get this kind of good behavior in the remainder that's as if you are the Oracle as if you have heard an Oracle tell you what to do. So as is the number of active parameters well maybe again this is a learning point of view so maybe you're outside the parameter space so you have me specification and then you just take the best approximation and the sparsity of that. So that's. The kind of the benchmark we want to Chief but then there is a proximate sign here and this approximates And what does it mean. It means first of all that well we don't know reaching the oracle is of course a bit too much to ask for so the penalty order the price to pay is that we don't know the active set is in terms is logarithmic in terms of the original number of parameters and will be an additional block Peter I'm here to curry. And the second thing. Is the sparsity S.. We have to replace it by effective sparsity which And you know by gum a square to have to explain what it is that you can imagine that somehow you need some kind of identify ability of these parameters so depending on how well identifies you are you will get close to this benchmark so you've somehow have to take into account some some identify ability which will typically make. The effects of sparsity a little bit larger than the. Sparsity just a number of non-zero parameters it's a little bit more complicated to call it. OK. So you know we're ahead in a bet no. OK No the what do we do. There's a question. No no I think that I will not probably say much a little bit about it but it's a separate lecture more or less I'll give an idea here. So this is like a preview of what we've come up. OK And then the estimate there will be the following so we take the empirical risk Think of the one I I just wrote for the errors of variables model or the sparse principal component. Or at least absolute deviations that's the empirical risk. And because of the high dimensional problem we take a penalty. Which will be the norm. In most of my talked will be just one norm to fix ideas but it can be general norm and love that is a chilling parameter if you take it large you will have more regular i say shit. And if you take it Ciro if you don't penalize. You mean you minimize dysfunctions or you take stationary points then you're in trouble because you are you have so many parameters in the overall fit. OK so do the. Every organization is really important. It's good. OK so now. In principle what we want to do as an estimator is to minimize this function. And then we get a. Random quantity an estimator of the target now minimizing dysfunction you can do as follows. You just take the derivative and put it to zero as it were in Euclidean space nothing special we can just stick there if it just so I'm going to define what it there if it if it is. And because putting it to zero if you solve it you get stationary points or to minimize or you have to define what distance or if if well here usually there's less problems especially if this function is differentiable. That's a derivative of a normal I have to define that. So that's one of the toss defined stationary points. That's to come and then we have to define her future maybe just go back quickly to the picture. Here we go. So we have this function that we're all going to minimize. Trying to minimize we're minimizing the black one but the red one is the theoretical risk and we're thinking about finding this minimum you can imagine if there is more perfect or it's the problem gets easier and if the function is very flat the problem is more difficult so you have to somehow quantify that in terms of the girl feature. And that. Year. So we quantified that curvature. And then we measure the distance between two points in the parameter space or you can take Euclidean distance but in many cases you have to take some other distance because it's more natural so we measure distances in terms of some norm. So now we have two norms one for defining curvature. And one here as a penalty. So down are certain ask. So dispersed gifts and identify ability in terms of a norm and then we have to relate this norm with the norm of occurring in the penalty and that's where affected sparsity comes in this is the type of Norm comparison going on. So let me do that let me first define stationary points. Well if you do so global minimization if you could do that you just do what you'll do is you just minimize it so I call that. The Ark I'm in solution head. The first stationary points I have to find the river this. So what you need to the reflective of of normal. And here it is you need the start of the rift if it's to sup differential So if you don't familiar with it it's no problem at all but just that you know that you can sort of state that the river to often are in the fall in force to get you on Mars. And if you are a little bit more familiar or if you're curious here's an example if you take the L one norm that's to some of the absolute value of the coefficients. Then it's like taking the riffle of the absolute value function it's here. And then the derivative this year is minus one and here it's close one in this point it's not differentiable and then the soup differential is just any value between minus one plus one and six who interval. And that's written here so this is the sup differential of the L. one norm. Just what I said and you can do that for all other norms as well this kind of what occurs here is to do one norm and I'm. Here to tell one Norman the jewel Norman Stone phenytoin or. Anyway so you can define that. And then you have to sub differential and you state your. Assuming that your risk and Bucharest function is differentiable you take the derivative you take to the river to prove the norm and get it to zero and you solve the stationary point that's how stationary point is defined. And because of the parameter space can be just some bounded subset of the Euclidean space you can be at the bottom and the boundary so let me make it a little bit more general to take that into account we don't worry about that so more or less this is a little bit more general than this. If you have this inequality where you say we have a stationary point. OK and the end is beat up is arbitrary in fact that we just write here for some detail which is will be to the best sparse approximation later on so to the Oracle so to speak. Yes So if you have a global minimizer you have a stationary point the third Kyoto way around here that's. Through. OK So that's the stationary point. Now discovered her. And effect is first of all let's first look at this curvature. We're going to show so that's a conditions under theoretical risk the underlying unknown risk function we should want to be conflicts with having a certain amount of convexity OK. Now let's see. So like I said they takes all the norm on the parameter space for all it's a subset of the Euclidean space. I think is strictly convex function. And then I'm saying the function there is function is strictly convex but here is convexity it is if you take a complex communication one minus three times B. times. The risk is them smaller than the contacts communications of the risks that's called fixity and to quantify how confection have here minus one minus something. And that dot com Something which describes the amount of convexity so to speak that's not there some function here. Which will be large if you are further away. From be. Present this is large this will be large. OK. If you see the norm you measure the distance between the points in this north tower. Base and don't worry too much about all the details but just the idea this is the description of amount of convexity and we call that strict can fix. This I will need four so here are no derivatives involved and no use tricks can fix it if you cannot differentiate. If you can differentiate I'm going to look at the linear approximation the difference between the original function and its linear approximation. And that has to be lower bound upper bound lower bound by some context function call that Brackman conditions probably has some main So if you have this difference approximation by a linear function with the derivatives of the F.T. shouldn't exist. And is this the difference between this and this is called the Bricklin divergence and I assume these great on divergence is lower bounded by some convex function. OK. It's all. Just a description of curvature Yeah. So. This is the description this is my description of curvature. And typically in my example this function will be corrected. Like a first order to turn Tater expansion. With it's going to be something else if you have lots of examples where it's just it's. So if it's something like extra power one plus Delta then your sleeve lets go for your dinner when it's extra power to the case like the standard case. And this directly and divergences here you have a function. Point I called the head here you take the linear approximation and then there's a difference here it's call to prayer by Fortune's that this is in one dimension in higher dimensions it's a little bit more complicated to draw. Yeah so let's see. Now I'm going to restrict myself to just this particular norm the L. one norm and later on I'll just tell you that you can extend it to general norms just not because we already have enough technical details than just restrict myself first. So the norm in the penalty is not a L. one norm. And I need some notation. For why do I need this notation both to define this effective. Sparsity Yeah OK So take an arbitrary subset of the variables. And then I'm going to denote by B. as a B. minus is following so here's an example I have an arbitrary vector. And I think some subset two three seven. Two three seven I don't touch them then all the others to zero this reduced restrict myself to this subset of the variables and I do note that with B. S.. And B. minus this is the same with now on the complements So you put this you leave this one untouched. And these three are now searched and the other ones are put to Ciro. OK So that's just a convenient notation. Hope it's OK. And then the effect is sparsity So here comes the whole story people have been working on this quite a while but effective sparse which I define now is just comparing the L. one norm in the penalty with the norm appearing in the curvature. A few words just a few words. If you look at the literature it's often a little bit differently described with terms like restricted I can fair use or. Are i P.. Restricted I was only three property. I call it a compatibility constant and on Tuesday next week I'll talk about it in the actually you can even refine the things and so on so that you get really good results but in this setup you Jim metrically it would be something like. L. one version of canonical correlation to test the something with current correlations or I can fair use. So the effective sparsity is just a number of nonzero fissions divided by the number which can be very small which can blow up the whole thing the gum up. And here's a dramatic picture of effective sparsity. So here the X. is set to set S. is just a first variable and these are all the other ferryboats you take the sign contacts who and then the distance to design conflicts so. That States number here. So if the distance is very small the effective spur city will be very large. OK. So that's certain. To. Get dealt to is this a good question that's this one so I measured the curvature in that in the terms of that one. Yes. Exactly. So typically will be used to Euclidean norm but often it's just. There's a some some void that if you please you Norm So yeah. OK. Now this is the main result for the. Known differentiable that can fix cases. So take the global minimum. Let's for simplicity say that girl fits your script rhetoric. And then comes a condition which has nothing to do with that but this is the noise level every. Probabilistic argument is hiding here so we're zooming out it just says I have these empirical risk and if the theoretical risk how far are they from each other. In some sense I don't worry but in some sense and that's called the noise level. OK so. To do with this you need a probabilistic arguments but once you have that pair what you can say OK this is something random if you list and that with large probability OK I call that the noise level and I think my tuning parameter larger than the lowest level and then I have this. Sharp Oracle results as old as a constant one here the risk at the estimate is the global minimizer bound by the risk because so you can do it for any and then minimize over Peter plus. Effect the sparsity Times joining parameter more or less squared the squares are because of the first Vatican. Marchin so to prevent the curvature. OK. Now just the flavor of the result has zero to the hard thing is of course hiding here otherwise it's quite clear what's said but typically the noise level is of order square be over and. And then if the effects are sparse these are also more or less to true sparsity then you get the result is. Well here the difference is off Order Well look be over N. that's this catalog be over in times the active. Number of active parameters so you see that you achieve this this this benchmark that you have to sharpen Oracle. I don't know why we're here if this is true with high probably it's been this is true with the same high probability OK So this is like a deterministic statement this will be true with some probability and then this will be true on that set as well OK. OK. So death does. Differentiability but a did issue that you have to global minimum if you don't have the global minimum then we issue the stronger What do we do no less. So if you do if you should differentiate I think we typically have a less strong condition for the curvature and you only need stationary points so you have just any solution of putting the whole thing to Ciro and for any of those solutions you have a sharper Oracle result just exactly the same as in the Crieff use light. And need the noise level if you can comparing the. Empirical risk with the theoretical risk but now not so much to. The originals with their derivatives comparing their derivatives. To look a little bit back here maybe here. This was if they're not differentiable and was I then due for the differential be OK So I just have to go to. One or zero I'm not sure I think to zero and then you get a derivative with if it's not a differential I just don't go all the way down to zero. OK Yes. That's a good thing there are a number of the. Already gone. Too far as. You know. So dismiss remainder. The remainder let me see good question is something like love squared times gamma squared and then love to square it is of order look for N. typical cases and gamma squared did I did not write it but that's the flavor of the flavor is. The flavor of the result is that. You said think of this as. OK that's the idea. That OK. So actually what is there is the effective sparsity can be much larger. But in most examples in people always try to prove things like this trying to prove that this is of this order Yeah but if it's not then you know of and it's just you don't have this nice Oracle. OK. But you're right the number. Of. Well this effect is first it's difficult difficult to compute and say I'll give an example where I can compute it. Explicitly that's for total variation penalty in another cases people just do some design which is random or so and then say with a high probability they are of a certain order so it really X X X the computing effects person. Is So it's very hard to be seen in my experience you just have bounds. And arguments coming from from randomization and things like that. OK so you have these both these theorems one for the non differential case and one for the non-convex case their to same flavor and in general if you have curvature with some general curvature function what is instead of quadratic something more general what appears in a bounce is just if you want to know tickle fixed on your gate instead of the square. Has sort of the squares the more standard case like a tutor Taylor expansion than the Gulf it's going to be in a book on you create is also the square. OK now let's see how much time do I have. Something like fifteen minutes. OK so I'll just go quickly through this so there is this random part of the problem. And I call that the empirical process condition and just use an idea and if you do for instance generalized linear models then you have some observations X. and Y. and you want to approximate Y. via linear combinations of the axes maybe with some link function and whatever anyway then you have. A last function row. If this last function is lead sheets. And you call it assumption leap sheet and if it's different if it slips each we call it the assumption that this Lipshitz and in both cases if you do the Global minimizer and you have just Lipschitz last function you get the noise level square and look below for N. if you have the stationary point and the derivative leap sheets function then you get the noise level again squared will be over and this is just standard empirical process theory if you know it it's just applying those tools. So you get Indeed this order. To prevent a part. OK So just a conclusion if you have these generalized linear models. If there's also a function is Lipschitz you get inequality for the global minimizer. And that's for the conflicts go useful because you can only find it then. And if the last function it's derivative is also Lipschitz the shark or queen of qualities for stationary points and that's useful for the non-convex case. Yes of course. But. That's just been zero zero zero. Zero Yes it's a good question so what I'm saying is that this is the number of non-zero parameters that you can also have a weak sparsity assumptions it's just just not to go to too many special cases but you can also have something like that you look at this and we call it R. or something. And even with our equal one you can prove things that anyone something are between zero and one and assuming that this number is all of. Order. I don't remember something like this where a small order. And all. These kind of assumptions in theory go through and are equal Ciro that's a special case then you have to number of non-zero parameters so. This is sort of a strong sparsity and this is a weaker form but the SCIRI go through it you just have to. Do some kind of tradeoff in. Approximating a factor like this by a vector like this and then the approximation errors and being of the right order and the whole theory goes from yeah yeah this is a good question especially if you do function theory or something these people have these kind of weak sparsity if your functions in some bases space you do describe sparsity and not the strong one. Zero it's just. This scaling. It depends on the model but typically you should that two normal factor is bounded so that you scale it like that something like that with the pencil little bit. You know. Yeah that's a good question of course you need some skating that's true. It's hiding. So nice that so the V.A. chanson higher dimensions say if you do Lisa absolutely fit deviations the last function is like this. And the derivatives is like the sine function that is difficult to finitely not leap sheet so you're not in the derivative leap she's case here in this not so smooth case but OK so then you need global minimizers and stationary points I don't know how to prove something for that but it's conflicts on X X and so it's going to fix so it's no problem. So let me just go OK here's to result so you have police absolutely if you actually have a sharp inequality. Where defective sparsity is more in standard literature it's more or less expressed in this form but it's the same thing. As the norm used in the curvature condition is just. Some kind of waited till to Norm. Right. You can apply this to sparse principle components. Just as you have time what you have to do is first to get in a local neighborhood so that in the local neighborhood the theoretical loss is convex but the M.P. you call US is still not convex. But. It's C.. So we take this loss function the girl Matrix and we sort of empirical covariance matrix and do the best sparse approximation and then with L. one penalty. OK then you have to check the conditions it involves involves some. Random crude dramatic form so I'm a little bit random matrix theory to get around the park right but in the end applying the theory effectively sparsity of the same order as to Spurstow and you get these these kind of results for. Sparse principle components. Also. This is just a lot of blah blah that you can do this kind of arguments to go further and even infer sparse principle components if you do a one step Newton rest an estimate or you can prove a simple tick normality it's in false again these same arguments for the known conflicts lost because to has unit is. Not going to be positive definite. OK this is just to convince you that everything goes through for General Marks. As a let some Norm let's look at some risk and then penalized by this norm. And then minimize that this to. Estimator or you can look at stationary points so we know how they are defined so if you do the Global minimizer we should strict going fix it the advice this predict one condition. So everything goes through except that it's now general norm so the norm we want to be inducing sparsity because we are thinking about sparsity. And in juicing sparsity you can check that in a way is the same as assuming some kind of decompose ability or some kind of reverse triangle property so I can't really say any details here but just. Convince you that the whole theory goes through if you have the proper structure in your norm. Which induces sparsity that's written here. And then you see here the effect is sparsity which is again this norm comparison. So everything is the same. And the theory is also the same so that this is completely the same but now for general norms. Put. So you have the sharp qualities. Here are some examples of norms one norm. Decided. Earlier Fersen off the slope. To sort of tell one or more of the slope maybe you know these things. You have norms generated from cones I won't give the details but this is like a structured sparsity and nuclear norms and generalization to dancers so these are the norms I know maybe you know all the norms. And the All they also have this right start the. Sparsity in juicing property. So let's see for instance if you do matrix completion you can do it with robust loss and then from a Trix completion you used the nuclear nor to trace norm and see if some matrix. And you observe only certain entries. You should the matrix is slowing and it's the sparsity in principle components. And you observe X. times these so you observe certain entries and was going quickly at a few and you gain instead of the least squares you know do robust first since for instance you take as lost function least absolutely evasions or Huber loss. And then well you can apply the theory and you get. There you have to do the random part of a number to be difficult. But it's doable have to look at the. Actually it's quite standard but you end up with. The jewel norm of the new Korean or Mr Maxwell I can value of a matrix order and so you end up with Max Mike a face of random age we suspect they are well studied. Good. So you just go through all these conditions and you end up with the following results so for robust matrix completion an oracle inequality. And the remainder is the number of non-zero parameters defied by sample size and the local recent price and what's this number of parameters Well if you have a matrix of rank S. and it's a P. one time speak to matrix to be one is the largest of the two the number of rows then the number of parameters is more or less P one times the rank of the matrix and so this is just again number of parameters divided by number of observations in the remainder so. This is the usual structure. Sure yes if. The Dimension is a measure sure. Very good. So let me conclude. The last part was bequeath with the conclusion is so the idea is just that you can do this in a very general framework. Or of course inequalities which are sharp and allowing you so for a model miss you can. Allowing for long convexity although you do understand assume that the underlying function is convex than a known theoretical risk. You get them in high dimension as long as these norms are sparse the in juicing has so people call decompose ability or to wrangle property. As long as you have enough curvature of the risk and this effective Spurstow I didn't really explain what it is to give some hint a measure of effective sparsity these are the ingredients and then you have these Oracle inequalities and it applies to several models deviations errors in variables principle component and I'm system to be lost is a very nice example of non-convex last. Row this matrix completion matrix slope and that's an example I didn't show it and this is important for estimating first one step. Newton wrestling type procedure. And I'll talk about that's part as to making the inverse of a matrix on Thursday next week just not not in this fairy general context but this is useful for getting. Well for the constructing new estimators from these structured estimators which are essential to. Normal so that you can build confidence intervals OK I want to thank you very much for your patience and I hope that through your questions. And where. You. See you know this well. Yeah you can actually come back. Curious though it seems to work no knowledge doesn't know there's proof. But not sharp let me give you the reference. For the by polling and so on here. They do these kind of penalties but you will not end up with a sharp. Saw and that I'm pretty much convinced it's impossible to do anything like that. With scatters. Question. I'm a bit over very quick question for. Whatever direction. Right. I see it if you are more restricted free in war where you can for instance something boy I don't know. We still have to take your gun different subject goodness you're probably going to come fall with some remarks lower bones of you you're getting. Really right. Over in a million bucks for his old circle troll. You know all sides of government over and a lot about something like OK So there are two ways to can think of this so for many Macs well. There are results which say that there were not many makes but you by Martin may write in co-authors that if you do that the law so. Then for a body don't know all the time Al Gore isn't it it has to appear also in the lower down. The government took over those who were too lazy to go to school to go to the position looks exactly like a computational lower about computational minimax know whatever. You can also just and I'll talk about that on on Tuesday just look at. At the algorithm itself to the law so I'll look at it and then you can show that. It depends a little bit on the underlying premise or but then that it's it's it's it's also in the lower bout just just for this particular estimator if you're counting. Yeah. Yeah OK So the question. You are going from where you work for us all be just all possible scholars. Like. Most of you all here almost all of your. Souls. And you. Want. To thank you. Very much.