[00:00:12]
>> And you need. 14 items but. That's. OK OK thanks thanks for the introduction always good to come back because I have multiple friends here so you all just treat me with the barbecue yes that this is one of the good thing about Georgia Tech OK so my talk today it's sort of like a continuation of what I talk about yesterday.

[00:00:56]
So I'm going to focus song again maybe one or 2 particular problems and try to illustrate why it is important to combine statistics on to my position and this is this time is joining work with 2 which was my student. My student collaborator hugest and my collaborator CMU and gentian fun my my colleague at Princeton OK so I've motivated this yesterday so I'm going to be very quick so we care about commas up to my zation and but none commas optimization is super scary so we need to find a way to do it partition those always try to just use some empirical over them like Wayne in the sense that.

[00:01:41]
They seem to be working extremely well. But because this is a highly non-coms problem in general we can hope to solve it in the in the satisfactory way. But they are still use every single day in practice in seems to achieve some. Performance So we tried to provide some of the perspective in explaining this kind of thing.

[00:02:05]
And one of the way to do this is through some more careful statistical analysis which basically says if you have a nice statistical model for John the rating the data that you care about and sometimes the problems look much much nicer than the worst case situations even though the problem might still be hiding on canvas they are probably not the difficult kind of non-coms problem to solve so so we have a recent overview paper summarizing some of the recent events went on this very interesting thing is take a look OK So let me try to be a slightly more specific use one example to show you some of the interesting messages happening in the community OK So this is a problem called recovery OK Basically what this is trying to do is to say if I have a lower matrix if I have some random measurements about this low run matrix I want to recover the matrix of interest that this is a basic problem and if you look try to solve the problem.

[00:03:16]
Some kind of. Problem and then this will give to you problem the looks like this OK so this is this is a very stand away and very natural way to solve this problem. In general this of this is Paul this problem is hiding in Congress because sort of like dealing with a degree for pulling the siege of our ball so in general this is supposed to be a very difficult one.

[00:03:44]
Fortunately if we have some. Structure of all of the sampling operator for example if you assume this a I matrix which is like a sample Mages For this case if you assume something that is either Galaxian or I.D.'s up dos and whatever. All of a sudden this becomes a nice problem.

[00:04:05]
That's not difficult to solve and live 3 years ago by not a group. Discovered this kind of very interesting message which basically says that if you have this kind of idea Garcia measurements if you have the number of samples that exceeds the number of the degrees of freedom of this problem in the US through very interesting message just it 1st before this problem there's no spurious minimum.

[00:04:34]
So you have some local station there points there but you don't have bad local minima you only have global minimum which in this case happen to be the ground troops or something the highly correlated to the ground troops or saddle point OK And another thing is that you have saddle point and these kind of subtle points a very nice kind of set of points.

[00:04:58]
Strict saddle points in the sense that you have a very you have negative coverture there and sufficiency Strawn the negative co-feature there so if you get closer to it there's a way to escape the subtle point in the efficient way. So this is interesting observation. That has been discovered in the last several years.

[00:05:22]
But this kind of phenomena seems to be. Seems to pony a lot of different applications. And which sort of line motivates a lot of for this investigation along this line so so now a lot of people when they tried to look at some of the non-coms problems they tried to star.

[00:05:44]
They investigate them in the following way they 1st look at the landscape look at some geometric analysis by looking to by checking whether there is some nice landscape prop can exploit and after that they they basically propagate such interesting jewel metric properties to optimization people until them that I have a little more structure than Gene there e non-coms optimization asked them to find the generic with them for solving the problem and each of this this each step of this pipeline has received a lot of attention for landscape analysis.

[00:06:23]
From some simple neural net dictionary learning mages completion you know and some of the simple blind the common Lucian one they are new and they're kind of things this is no local No bad local minimum kind of phenomenon seems to be found in all of these applications and then they are something in their algorithms for solving it maybe dating back to you know be regularize ation until more recently there's some for Turbo grade in this and.

[00:06:52]
A lot of different versions of that that allow to solve the problem if you have a non-conscious problem with all the bad local minima and with a strict saddle and the a lot. With them a. Global solution in the normal time. OK Well I would like to argue today he said this kind of analysis paradigm which is definitely very useful interactive when you look at generic problems but they might actually be overly conservative if you only care of all maybe one or 2 specific colossal problems like majors completion like ritual the problem I mentioned yesterday and I'm going to try to use maybe one or 2 of the examples to illustrate why this is the case and whether they sent any better way for us to have a better understanding about this kind of problems so the key message here is we read sometimes if you only care about a specific problem you really need to combine the it's the to Zuko and conversion analysis together in order to get something nicer as if you do this did this way maybe sometimes even the simplest possible non-conscious method already achieves the best possible performance guarantees you can you can expect when I say best performance guarantee I mean you have the optimal Cisco accuracy my I mean there you have the fastest possible convergence rate again I'm going to just use the problem I mentioned yesterday as a guiding threat it's very not restricted to this problem is complete simple and it convolutions the action in the learning a lot of them the message I believe all the same as this is probably the easiest for me to to illustrate the points you know.

[00:08:48]
So this is this is to. The men topic today I'm going to use this to illustrate the importance of this into the way the thinking was that is seeing optimizing. OK So this is the problem that I mention so they just to quickly remind you of the notation I have of the.

[00:09:07]
X. star which is an dimensional I have a few quadratic equations about this. Data I observe a. Sampling vector I have so I have a set of quadratic equations I know a K. I know why K. I want to recover all estimate OK so. The problem I mentioned yesterday finds a lot of applications in optics and in Iran that OK so I'm not going to repeat this because I already mentioned this yes.

[00:09:41]
So. In all those who solved this problem very natural starting point which I also mentioned yesterday is that maybe let's try to look at this commas optimization problem try to see whether we can soft this problem to global optimality OK. The issue of cause is said this is the degree for pulling in X. so computationally this is supposed to be very challenging in general.

[00:10:11]
Now and this is how we'll know I mentioned yesterday they tried to find a nice initialization followed by some kind of great in this sense when order to solve it and this extremely simple paradigm has already enjoyed some of the theoretical success. And the Russian know of this 2 stage approach.

[00:10:33]
Is very very simple 1st of all this is a problem but this is if you can you are able to restrict your attention to a region that is not too far away from the ground troops maybe this within this region the landscape is not that bad OK and if you can do this.

[00:10:54]
And if you can continue to do refinement without jumping out of this nice region and then maybe if you come verges it converges to the right thing so now sort of lay the difficulties in 3 to par 1st you find a way to get into this nice region and then 2nd of all you want to find the largest possible step size with all jumping out of this nice region OK So this is a Russian though of of this a 2 stage approaches which has generated a lot of interest in a lot of different problems I face ritual majors completion.

[00:11:30]
On that a lot of different things people have been trying to use this kind of paradigm usually using spectral METHERELL tensor method to start to find the starting point and then run some kind of grain in the sand alternating minimization stochastic Wade and this and things like this which if you are doing things cool carefully it seems that you can work very well for this class of problems.

[00:11:56]
But I would like to pose the following question OK so most of the work so far have focused on maybe I think 99 percent of the work so far have focus on this 2 stage approach if they want to get fast convergence. but my question is is a care for initialization really necessary i d's for this colossal problem you've you want to achieve fast come virgins because if you for some of the problem if you go to practitioners they probably never really use a very careful way to initialize the are with oftentimes he just star the randomly and then in the seems to be working quite well so this multi of a is all question do we have to use that the care for initialization all we would the hour when them work automatically get so this is the men question are like to address today ok so so let me mention again why week why a good initialization is healthful so if we have a nice initialization lie using spectral meth or and so this is like go face ritual problem into the mentions this is their level set that i plot so if your able to star from you know this is a grouch the global solution if you are able to stuff from somewhere not too far way from the global solution if you look had this local region this is looks like a calm us functions sort so if you have a columnist functioned you know how to solve so this is a russian that back now the question is how can you a star from anywhere well not really if you want of just used vanilla great in the cent probably not because they had to say saddle points here so if you have sather point if you to run than the log way than the sun to you know you get stuck there ok i meet you can escape bit but the say just you want to use a simplest version of grain in the sent then you get stuck in the saw the one you don't want to do this they if you look at this picture how many a saddle points to we half we only have to a saddle points here we have to global solution we have to a southern points you so if you just journal raids your initial point a randomly They'd very little chance that you can get close to these points in fact this is a 2 dimensional problem if you do in high that mentions if you generate points randomly or we probably have almost 0 there you will get closer to any of this other point now the question is if I want to start the algorithm randomly for example from here can we hope to steal.

[00:14:36]
It in the cent but still guarantee a fast convergence to the global solution and why do we care about this because this partition though if this works then practitioners probably would prefer this way because it's very simple it's modern Gnostic so you might be better in terms of the U. If these some kind of model mismatch issues.

[00:15:00]
So this is this is worth it to address. So now let's try to see what prior work tells us. About this part. First of 4. There is another group in Colombia in John Rice group they look at the geometric analysis of this problem and they find very similar messages to the one that we just mentioned form a true sensing which is that there is no bad local minima for this problem if you will or the design vectors are generated randomly and all this other points are strict have strong enough negative coverage OK So this is a school starting point in the landscape it's nice.

[00:15:48]
And with this another group. Mike Jordan Ben read Jason Lee there and some of their co-authors able to show the following message if you have a nice landscape meaning that these you know about local minimum if you run away then dissent almost surely with random initialisation or almost surely the great indecent converges to the global solution OK but this is a very nice starting point he says that if you have run the miniaturization when you come by way great in the sense if you Convergence it has to converge to the ground troops.

[00:16:29]
But what I would like to argue is that this result still practically is far from enough because it does not really tell us how fast this can converge. In fact almost all conversions sometimes my actually mean you know you might take forever to convert in fact by the same group of people they actually identify one example not this problem by another example saying that.

[00:16:56]
That problem has nice landscape but it tastes great and this is an exponential time to come for. Exponential in the dimensions of the problem so if that happens this is not the encouraging message form for us because we really want with them that can converge as fast as possible especially when we're dealing with.

[00:17:16]
Large scale applications OK. So numeric only. Let's try to see a bit of numerical evidence to see whether we can expect fast conversions so this is some very simple numerical Paramount's that I have run so this is I'm plotting I'm running great in the sun with a galaxy in the initialisation I'm plotting the L. 2 Arrows as a function of the iteration count.

[00:17:45]
I changed the dimensions of the problem from 102-1000 plotting each of the convergence code for each of the problem here so you sort of see that from this curve you see that seems to be 2 stages step in the 1st stage it takes you maybe life 30 to ration.

[00:18:05]
To get to you know for example like an arrow like point 5 relative error after that you see the near convergence here the Y. axis is plotty on the logarithmic basis so you see the near convergence after you know. The safer teel for the to ration So the key message is a numeric holy it takes you roughly several tens of iterations.

[00:18:31]
To get to a point which From then on you see convergence. So this seems to be a very nice numerical message regardless of the problem this phenomenon seems to be always true. So you have fast convergence roughly in total you takes you about $200.00 to ration of grade in this sunt in order to get to 5 digits accuracy so this seems to be something that's very nice for proficient.

[00:19:03]
OK so now my goal is to try to explain why this is true. OK So let me try to explain this curve a little bit more via another numerical experiments OK So this is a curve I just mention about the 1st iteration count no there's one part that's a little bit less clear what's happening in the 1st 30 the ration.

[00:19:27]
Now I'm going to plot one more thing I'm going to plot. The correlation between my current to rates and my global solution OK So so in the signal processing language this is sort of type of signal component of you have a 10 so far so if I plug this correlation term since I'm starting the algorithm randomly when it's just started the it if this is a high dimensional objects you expect that the correlation to be very small.

[00:20:01]
But in the 1st several you to ration you see a linear exponential increase in this this component which means that after several very few iterations the signal component become so large that it reveals a lot of information there and this is the point of this probe which sort of means that this curve the signal component is already sufficient to life and then after that you see the convergence of a just to if you can keep refining this in a very fast OK.

[00:20:34]
So if I combine these 2 curves you my sort of believe that maybe in each of the stages you use something happening exponentially fast so if I combine both of them maybe in toto just along the route me number of iterations we might get to an arbitrary are crazy would.

[00:20:55]
And this is something that we can formalize using of theory basically. If your idea goes in design meaning that all of the design of vector is something that those are random generated if you star your algorithm randomly had this is this is something that we can say OK so this looks a bit complicated Don't worry about this going to explain the using for the from messages to explain what I'm talking about here just one thing this is this function is an ode to metric modules some global ambiguity there so you basically just treat me as some able to distance function OK so what this algorithm and what this fear I'm is saying can be broken into 4 parts.

[00:21:46]
First of 4 this is what this is saying it's the following it takes longer than the iterations and it's the dimensions of the problem take us longer need to ration to get to an accuracy. That is you know like reasonably good so here when I say recently go you mean I get to an.

[00:22:05]
Estimation error that's roughly point one OK but it takes me about longer need to ration to get there. The 2nd thing is after he's after this stage we see a linear convergence and the contraction factor is absolute constant independent of the dimension so think about this as you know point 9.

[00:22:29]
So combining these 2 messages together this is why I want to say combining the total iteration complexity is locked in plus log one over absolute if you want to get. OK know that the 2nd part days no dependency on and OK so in general you'd probably just take you several hundreds of feet to ration to get you know 10 digits accuracy.

[00:22:58]
And all of this happens as long as the number of equations or the number of samples you have exceed and time some luck factor so end is that the mention of the problem if you want to recover any unknown variables you at least need any equations even if you know the faces so this is sort of like all most information theoretically optimal.

[00:23:23]
Unless you care about some lock factors. So this seems to be a very nice message he says 1st before he says a convergence very fast 2nd he says that. In order to enable fast conversions you don't really need to have many equations. OK And then we can also if you care about statistical accuracy we can have something very similar so the Suppose you have you know some of the data for noise you run the same algorithm.

[00:23:55]
What we can say it doesn't converge to the ground troops but you convert just to the maximum like us to make the problem. Within the same number of iterations so used and you can also show that mass and likewise the mission for this galaxy in the noise problem model.

[00:24:12]
If you can find the messenger like us to me it is optimal achieve the crammer all bond Exactly so this is including the per cost so this is really something that is solved. All right OK I'm probably going to skip the numerics part let me try to get into the analysis and try to tell you a little bit about how we analyze this thing and why understanding the statistical Parys is crucial OK So this is the 1st stage I'm going to start by saying OK let's try to add the lights what's happening from random in the.

[00:24:53]
To the point where we get to our local region OK And just to remind you why a theory does not have any kind of convergence rate guarantees and using all theory we are able to say only takes longer need to ration together. OK. So let me start with the employer problem.

[00:25:13]
A simpler problem is that let's say suppose I have infinite samples if you have infinite samples you know in a statistical language this means that we have a population level kind of the dynamics OK. So basically the problem becomes very easy so you basically replace your gradient by expectation of the gradient which is you know like population level gradient for this particular problem is very easy to calculate and so I calculate things for you but this is you take who 5 minutes to go that's OK.

[00:25:47]
OK So now let's look at this population level thing and see whether we can analyze this very simple dynamics OK now how well are you going to do it we do the following thing in order to analyze this we introduce 2 parameters the 1st parameter is the correlation between my current iterate and the global solution which is what I have plotted before and the 2nd parameter is basically everything else so this is the signal parts and this is the part of the it's a rate of organ the to the signal component OK So this is like a residual part this is the signal point I used to ram into to try to capture what's happening in this in this problem and you can easily see that it takes you maybe half an hour to see that actually for this problem and also many problems I may choose completion you know like actually learning.

[00:26:47]
These 2 parameters. And so experience the following data evolution 1st so for this signal par is increasing exponentially fast the residual part is going down exponentially. And this is something that's very easy to do OK And in fact a very simple characterization of your 2 parameters the evolution and I have only 2 parameters I can analyze everything by hand so.

[00:27:17]
Trust me this is a very simple thing to on the lights. Of course OK so what this what this is saying though is if you have in this sample this actually is a very simple problem you can solve you can expect to solve the problem in the in the very fission way now the question is how are we going to analyze the fact finite sample case I mean this is so what matters we don't have enough in these samples we have very few samples the number of samples we have is just slightly above the information theoretic limit how how do we how can we even expect this to happen.

[00:27:54]
OK So naturally if this is my gradient. This is my residual compared to the population level gradient you probably just want to make sure that So we know the population level everything's working very well you probably just want to make sure the residual part it's more enough so that it doesn't affect the population level dynamics too much.

[00:28:21]
OK but it happens that this is a very difficult one to control and just to give you one example if you want to understand this part this one component that looks very complicated it looks like a degree for polynomial in. A and degree of my current it to rate so this looks like complicated.

[00:28:45]
On the lights. Now what I have just shown is that in the population there were everything's working well now let me try to cheat a little bit so now I suppose if I may be allowed to pretend that my my sample vector is independent of my current iterate which is definitely not true but let's say if I'm allowed to made this heuristic assumption and then things might not be that difficult because if you look at this December should be an average of a lot of random components and by central limit theorem something that's you can sort of like control this this one up to you know the level that is almost like a standard deviation of this part.

[00:29:33]
so this is basically saying that you for i have read he have this kind of independence than everything it's he's is supposed to be working very well but now the real difficulty is sets i do not have this kind of independence assumption unless i'm using new data every to ration it's what i'd read the a half it's an ex teed that these dependent on ai so i can really use centrally me fear him kind of things to a to on the lies does sill the keith analysis component of this this work is basically a ball how to shaw that on the very intuitive level how to show their ads my currently to rate is almost the independent of each show of to something vector and if this is true and then we hope that this can mimic what's happening in the independence case and there maybe we can see something simular to centrally me feel or a there are laws us to shore everything ok and we are going to use analysis call leave while analysis should do this job which is a bit difficult to explain so i'm going to skip this part i'm going to the a fast forward to the 2nd stage and a in their stage i'm going to use a roughly the same i need to do the same thing and i'm going to use roughly the same tenneco tech nice there but for their pies slights a easier for me to explain so i'm going to move to the 2nd stage and their i'll come back to mention a little bit about how are we going to do a for this stage so the 2nd stage is this stage when you have you know when you end to the local region until he finds the global solution ok and as let me just mention that him prior feehery like you man who kind this result you roughly takes you any to ration it's an times a locket to raid lock one the web sawing to ration is to achieve so on accuracy and al theory would basically want to say they're actually you don't really need this and the factor it's can be done in a completed dimension free man ok so So locally so we are basically analyzing grid in the sense the try to see how we analyze weight in the set so so I think probably all of you have taken the optimization courses in the optimization courses we are told as for grade in the cent if you want grade in this and to converge fast to stand the conditions there you feel they are satisfy you have fast convergence 1st so forth you have some kind of strong.

[00:32:17]
Maybe in the restricted sense but some kind of some local sense but but your wall something like strong Come us as you do happen. Second of all we want this. You know like I could leave it in kind of condition so basically if the low point of this function.

[00:32:36]
Every point is sandwiched between that old. And that of another critical job the function and then maybe the grade in this and it's quite good it's converging very fast so these are the 2 standard conditions. Now let's try to just tried to check. What are the conditions for DOES particular problem.

[00:33:02]
OK so. This is a definition for the kind of problem so I'm going to introduce alpha beta as a strong combusted parameter and this move these parameters. So in terms of aero contraction this is the stand up to my theory told us. If you have this condition happening and this is the convergence rate we have the contraction factor depends on.

[00:33:31]
Condition number. And basically for this problem the condition number determines rate of convergence of course you can use you know less some more complicated things than this trough things to improve on this to maybe a square root of that but this is not my point here actually. My point so let's just focus on this extremely simple grade in descent to see whether this really is the convergence rate.

[00:33:58]
So let's look at the these 2 parameters so for the problem I have 1st before locally at least locally the problem is strongly convex Actually this is a very nice thing locally so as long as the point is not too far away from the ground choose the global solution the last strongly Comus with parameter point 5 This is a very nice thing.

[00:34:23]
But unfortunately the condition number even locally it's extremely back OK It's OK It's a condition that depends on the dimensions of the problem and in fact is this move this forum the scales this end and it's the dimensions of the problem so what this means this means that if you run grade in this isn't if you want to attend absolute accuracy it takes you roughly N. times law to ration together because this is the condition number mentioned the iteration complex the condition number of times some log fact OK So this is the total number of.

[00:35:07]
Iterations needed for this if you just use for the love to my position theory. So it seems that if optimization theory gives us the optimal performance then you might actually say that it's actually convergence quite quite slowly and if you the problem becomes larger dimensional maybe you get slower conversions.

[00:35:31]
OK know what's what is wrong. OK So now let's take a 2nd look at the Great in this theory as see why we are why we why condition number is very important so let's try to see which in which region we expect to have both strong convexity and small business for this particular part OK So this is.

[00:35:57]
I have full case I'm just writing it down but you don't need to worry about what the expression is but the key thing that I like to mention is that it depends on X. almost only through this part and we basically want to understand this part so for this part if X. is very close very very close to any of the 8 K. and this become very large and if this is large and this whole thing might become very large and this sort of like give us a bat some of this Purab in the language this is sort of saying that this is a heavy toll distribution and if you have heavy toll distribution with some high probability you will have a very large.

[00:36:39]
So the message here is said if my decision variable is too close to any of the design vector of a K. and then the small is parameter becomes an issue OK So let me try to plot something in the in this poll to show you what I really mean it so this is my ground truth so this should be extra.

[00:37:02]
And let me look at the local region and what I was saying is that if you want to have strong converse city and small for this maybe you want to make sure that your decision variable X. is not too close to any of the design vector So if this is a one I want to make sure my X. is fully within this region so for example an X. here so that X. is not too close from a one OK So they are sort of like off.

[00:37:34]
A one if that is true and it's very good. In the language do they call you incoherent basically is sort of saying the X. needs to be almost 0 fog in the world to each of day. And this is just for what I want you also want to make coherent We respect to every single sampling vector so you probably also make sure that we respect to a tool you also have something like.

[00:38:01]
So somehow saying that I still have a nice region and I enjoy both drunken buses here and small thing but this region is no longer able to bore this region looks like you know a poli top you know if you have a polytope sometimes you might run into issues using a generic optimization theory.

[00:38:23]
And in fact a lot of prior works when they tried to say I want a fast fast convergent algorithm I want to enforce some additional steps truncation by projection like you know change the laws function you know blah blah blah in order to make sure that my iterate always stay within this Polito and if you are able to always stay within this polytope way then the self-willed theory will tell you there you converge for us.

[00:38:53]
Just a site so most of the PIO work for a lot of these problems they usually try to enforce some kind of. As I mentioned truncation projection you know change your function blah blah blah in order to make sure it converges fast but for an regularize version very little is known extra So for face ritual I mentioned yesterday that you know we have some guarantees there which are suboptimal formations completion the D.S.P. for work there's almost nothing there blind to convolution the same thing happens to so people have not really look at how to regularize version that much mostly because they are unable to control the trajectory of the problem.

[00:39:40]
But the question is on regularized methyl really suboptimal for these problems or is just because of some technical issues. OK so all finding is basically to justify that actually there's no need to enforce any kind of these you know extra. Regularize ation. We can while we can make sure you said if you are doing statistical analysis more carefully you can show that if you start from this point within this.

[00:40:09]
Next iteration you won't leave this in the top with high probability you will still still stay within here so but you always stay within this polytope and as I said this is a Polito with strong and well conditioned this and if that is true and then you converge very fast OK this is what we call implicit regularization the sort of say that you optimization algorithm automatically in forces you to stay incoherent with all of.

[00:40:40]
This meaning there you always stay with the industry to deal. With that we get convergence and this is something where you won't be able to derive. Optimization of theory and you need some kind of statistical analysis. And this is the analysis that I'm going to use I'm going to mention very very briefly the idea is something that comes from the probability is that Cisco literature actually me in my days factor to.

[00:41:13]
When he was trying to use it to understand central limit theorem laid on the lot of 1st that decisions have tried to on the used is to understand you know linear regression logistic regression so on and so for the we're trying to borrow some of the inside Syria to understand this problem.

[00:41:32]
So what is the idea OK this is my problem I have a matrix I have a ground shows this is my observation. I want to make sure that. I want to say it says I want to say so what I want to say is that each of the a.

[00:41:54]
Is nearly. Nearly all 4 of them though to my my my it's are this is basically what I want to say OK so how am I going to do it. If I only care about this condition with respect to A I actually. So if I want to understand this I'm going to create an auxiliary sequence via the following face I'm going to drop one sample OK originally I have a one I.

[00:42:30]
Know I'm going to drop one sample so I'm not going to use it completely drop this one but I'm going to rerun my my algorithm in the same way OK I just discard one sample but use all the remaining samples and run the same algorithm. So so I get to rates OK I use an L. to mention that I the sample has been dropped OK So this is my new iterates.

[00:43:01]
Why do I do this because my it to Rick has never really seen this sample a one and a one is generated randomly This is why well I have assumed so with probability at least within the pulling on and we'll number of iterations this this guy is going to be nearly off in the fuel run them Gallus him because they have no correlation so the on the going to be nearly 0 fog.

[00:43:32]
And this is exactly why one so if for X. if this is a true iterate this basic you satisfied this guy so this is basically what I want. Now but this is this is a fake sequence to analyze the true sequence so how to do that because I only drop one sample.

[00:43:53]
Outof you know like 1000000 sample so they're expected to be there quite close the must be some continuity argument saying that these 2 sequences are not going to deviate too much OK and if one sequence is nearly all fallen into a one the other sequence is close to this fake sequence and everything put together you can sort of show that actually it's nearly all fallen though to a one OK So the Q. This is the key idea I generate.

[00:44:25]
To show the independence and then I use a continuity argument to show that they are quite close so you know the true sequence is supposed to be also nearly independent of the of my 3 my my sampling that you can do it for all of the sampling of act you take the union value you get everything.

[00:44:49]
OK So this is the very very high levels idea about this Livan not assist and this is really something coming almost exclusively from statistical. Then coming back to the initialization parts things are going to be much much harder so I'm not going to explain in detail why I want to say as I'm still going to use this leave are not assist except that I'm going to design my leave while sequences in a more complicated way for example I'm I sometimes want to flip the sign off there my measurements sometimes I'm I want to simultaneously 2 of them but basically the some technicality I'm not going to explain all the details.

[00:45:32]
OK So coming back to the. The message part one thing though you might wonder how come they're going in this and never really needs to escape the saddle points. Now if you look at numeric Hoey experiments actually you see that even though there is a saddle point which I apply in blue here you're dynamics actually now which has never really get close to the saddle point so if you run away in the set at the beginning sort of like getting closer to it but when it gets sufficiently close to the saddle point you seem so they some positive force dragging it away.

[00:46:14]
From the saddle points and move towards the global solution. And since the eateries actually have NEVER get close to the several points it is a very nice thing it means that we don't really need to use any kind of sophisticated saddo escaping algorithms for doing this in fact if you just use those generic Sather scaping theory even though you are using complicated algorithms I trust region method to grade in this.

[00:46:48]
Way than the Sent. Me if you just use those from there if you're e you get turned iteration complex see this you know a polynomial in N. But if you do it more carefully if you only care of all this kind of problem like face you too will make his completion log any to ration So you know if you don't really need to use any of the sophisticated things you automatically get the bits OK So this is why I want to say some of the simple algorithm to work better actually much better than what we expect if you can understand the problem in a more careful way.

[00:47:22]
OK So let me just Crute by blending statistics and optimization we can sometimes show that even the simplest algorithm like great in this sense so cast a grade in the Sent work extremely well you don't really need you probably don't need to the Initialize is smart you probably don't need to do for sophisticated regularization you probably don't need to speed the samples you probably don't need to use sophisticated saddo escaping algorithms this is summarized saying the following paper Thank you very much.